LinuxLists.cc - [RFC v3 Optimizing veth xsk performance 0/9]

2023-08-08 16:40:04

Subject: [RFC v3 Optimizing veth xsk performance 0/9]

AF_XDP is a kernel bypass technology that can greatly improve performance.
However,for virtual devices like veth,even with the use of AF_XDP sockets,
there are still many additional software paths that consume CPU resources.
This patch series focuses on optimizing the performance of AF_XDP sockets
for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
Patch 5 introduces tx queue and tx napi for packet transmission, while
patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9
add support for AF_XDP tx need_wakup feature. These optimizations significantly
reduce the software path and support checksum offload.

I tested those feature with
A typical topology is shown below:
client(send): server:(recv)
veth<-->veth-peer veth1-peer<--->veth1
1 | | 7
|2 6|
| |
bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
3 4 5
(machine1) (machine2)
AF_XDP socket is attach to veth and veth1. and send packets to physical NIC(eth0)
veth:(172.17.0.2/24)
bridge:(172.17.0.1/24)
eth0:(192.168.156.66/24)

eth1(172.17.0.2/24)
bridge1:(172.17.0.1/24)
eth0:(192.168.156.88/24)

after set default route\snat\dnat. we can have a tests
to get the performance results.

packets send from veth to veth1:
af_xdp test tool:
link:https://github.com/cclinuxer/libxudp
send:(veth)
./objs/xudpperf send --dst 192.168.156.88:6002 -l 1300
recv:(veth1)
./objs/xudpperf recv --src 172.17.0.2:6002

udp test tool:iperf3
send:(veth)
iperf3 -c 192.168.156.88 -p 6002 -l 1300 -b 0 -u
recv:(veth1)
iperf3 -s -p 6002

performance:
performance:(test weth libxudp lib)
UDP : 320 Kpps (with 100% cpu)
AF_XDP no zerocopy + no batch : 480 Kpps (with ksoftirqd 100% cpu)
AF_XDP with batch + zerocopy : 1.5 Mpps (with ksoftirqd 15% cpu)

With af_xdp batch, the libxudp user-space program reaches a bottleneck.
Therefore, the softirq did not reach the limit.

This is just an RFC patch series, and some code details still need
further consideration. Please review this proposal.

v2->v3:
- fix build error find by kernel test robot.

v1->v2:
- all the patches pass checkpatch.pl test. suggested by Simon Horman.
- iperf3 tested with -b 0, update the test results. suggested by Paolo Abeni.
- refactor code to make code structure clearer.
- delete some useless code logic in the veth_xsk_tx_xmit function.
- add support for AF_XDP tx need_wakup feature.

Albert Huang (9):
veth: Implement ethtool's get_ringparam() callback
xsk: add dma_check_skip for skipping dma check
veth: add support for send queue
xsk: add xsk_tx_completed_addr function
veth: use send queue tx napi to xmit xsk tx desc
veth: add ndo_xsk_wakeup callback for veth
sk_buff: add destructor_arg_xsk_pool for zero copy
veth: af_xdp tx batch support for ipv4 udp
veth: add support for AF_XDP tx need_wakup feature

drivers/net/veth.c | 679 +++++++++++++++++++++++++++++++++++-
include/linux/skbuff.h | 2 +
include/net/xdp_sock_drv.h | 5 +
include/net/xsk_buff_pool.h | 1 +
net/xdp/xsk.c | 6 +
net/xdp/xsk_buff_pool.c | 3 +-
net/xdp/xsk_queue.h | 10 +
7 files changed, 704 insertions(+), 2 deletions(-)

--
2.20.1

2023-08-08 17:56:46

by 黄杰

[permalink] [raw]

Subject: [RFC v3 Optimizing veth xsk performance 4/9] xsk: add xsk_tx_completed_addr function

Return desc to the cq by using the descriptor address.

Signed-off-by: Albert Huang <[email protected]>
---
include/net/xdp_sock_drv.h | 5 +++++
net/xdp/xsk.c | 6 ++++++
net/xdp/xsk_queue.h | 10 ++++++++++
3 files changed, 21 insertions(+)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 1f6fc8c7a84c..de82c596e48f 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -15,6 +15,7 @@
#ifdef CONFIG_XDP_SOCKETS

void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
+void xsk_tx_completed_addr(struct xsk_buff_pool *pool, u64 addr);
bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max);
void xsk_tx_release(struct xsk_buff_pool *pool);
@@ -188,6 +189,10 @@ static inline void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
{
}

+static inline void xsk_tx_completed_addr(struct xsk_buff_pool *pool, u64 addr)
+{
+}
+
static inline bool xsk_tx_peek_desc(struct xsk_buff_pool *pool,
struct xdp_desc *desc)
{
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4f1e0599146e..b2b8aa7b0bcf 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -396,6 +396,12 @@ void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
}
EXPORT_SYMBOL(xsk_tx_completed);

+void xsk_tx_completed_addr(struct xsk_buff_pool *pool, u64 addr)
+{
+ xskq_prod_submit_addr(pool->cq, addr);
+}
+EXPORT_SYMBOL(xsk_tx_completed_addr);
+
void xsk_tx_release(struct xsk_buff_pool *pool)
{
struct xdp_sock *xs;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 13354a1e4280..3a5e26a81dc2 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -428,6 +428,16 @@ static inline void __xskq_prod_submit(struct xsk_queue *q, u32 idx)
smp_store_release(&q->ring->producer, idx); /* B, matches C */
}

+static inline void xskq_prod_submit_addr(struct xsk_queue *q, u64 addr)
+{
+ struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+ u32 idx = q->ring->producer;
+
+ ring->desc[idx++ & q->ring_mask] = addr;
+
+ __xskq_prod_submit(q, idx);
+}
+
static inline void xskq_prod_submit(struct xsk_queue *q)
{
__xskq_prod_submit(q, q->cached_prod);
--
2.20.1

2023-08-08 17:57:00

by 黄杰

[permalink] [raw]

Subject: [RFC v3 Optimizing veth xsk performance 5/9] veth: use send queue tx napi to xmit xsk tx desc

use send queue tx napi to xmit xsk tx desc

Signed-off-by: Albert Huang <[email protected]>
---
drivers/net/veth.c | 230 ++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 229 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 25faba879505..28b891dd8dc9 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -27,6 +27,8 @@
#include <linux/bpf_trace.h>
#include <linux/net_tstamp.h>
#include <net/page_pool.h>
+#include <net/xdp_sock_drv.h>
+#include <net/xdp.h>

#define DRV_NAME "veth"
#define DRV_VERSION "1.0"
@@ -1061,6 +1063,141 @@ static int veth_poll(struct napi_struct *napi, int budget)
return done;
}

+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+ int buflen)
+{
+ struct sk_buff *skb;
+
+ skb = build_skb(head, buflen);
+ if (!skb)
+ return NULL;
+
+ skb_reserve(skb, headroom);
+ skb_put(skb, len);
+
+ return skb;
+}
+
+static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool, int budget)
+{
+ struct veth_priv *priv, *peer_priv;
+ struct net_device *dev, *peer_dev;
+ struct veth_stats stats = {};
+ struct sk_buff *skb = NULL;
+ struct veth_rq *peer_rq;
+ struct xdp_desc desc;
+ int done = 0;
+
+ dev = sq->dev;
+ priv = netdev_priv(dev);
+ peer_dev = priv->peer;
+ peer_priv = netdev_priv(peer_dev);
+
+ /* todo: queue index must set before this */
+ peer_rq = &peer_priv->rq[sq->queue_index];
+
+ /* set xsk wake up flag, to do: where to disable */
+ if (xsk_uses_need_wakeup(xsk_pool))
+ xsk_set_tx_need_wakeup(xsk_pool);
+
+ while (budget-- > 0) {
+ unsigned int truesize = 0;
+ struct page *page;
+ void *vaddr;
+ void *addr;
+
+ if (!xsk_tx_peek_desc(xsk_pool, &desc))
+ break;
+
+ addr = xsk_buff_raw_get_data(xsk_pool, desc.addr);
+
+ /* can not hold all data in a page */
+ truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ truesize += desc.len + xsk_pool->headroom;
+ if (truesize > PAGE_SIZE) {
+ xsk_tx_completed_addr(xsk_pool, desc.addr);
+ stats.xdp_drops++;
+ break;
+ }
+
+ page = dev_alloc_page();
+ if (!page) {
+ xsk_tx_completed_addr(xsk_pool, desc.addr);
+ stats.xdp_drops++;
+ break;
+ }
+ vaddr = page_to_virt(page);
+
+ memcpy(vaddr + xsk_pool->headroom, addr, desc.len);
+ xsk_tx_completed_addr(xsk_pool, desc.addr);
+
+ skb = veth_build_skb(vaddr, xsk_pool->headroom, desc.len, PAGE_SIZE);
+ if (!skb) {
+ put_page(page);
+ stats.xdp_drops++;
+ break;
+ }
+ skb->protocol = eth_type_trans(skb, peer_dev);
+ napi_gro_receive(&peer_rq->xdp_napi, skb);
+
+ stats.xdp_bytes += desc.len;
+ done++;
+ }
+
+ /* release, move consumer，and wakeup the producer */
+ if (done) {
+ napi_schedule(&peer_rq->xdp_napi);
+ xsk_tx_release(xsk_pool);
+ }
+
+ u64_stats_update_begin(&sq->stats.syncp);
+ sq->stats.vs.xdp_packets += done;
+ sq->stats.vs.xdp_bytes += stats.xdp_bytes;
+ sq->stats.vs.xdp_drops += stats.xdp_drops;
+ u64_stats_update_end(&sq->stats.syncp);
+
+ return done;
+}
+
+static int veth_poll_tx(struct napi_struct *napi, int budget)
+{
+ struct veth_sq *sq = container_of(napi, struct veth_sq, xdp_napi);
+ struct xsk_buff_pool *pool;
+ int done = 0;
+
+ sq->xsk.last_cpu = smp_processor_id();
+
+ /* xmit for tx queue */
+ rcu_read_lock();
+ pool = rcu_dereference(sq->xsk.pool);
+ if (pool)
+ done = veth_xsk_tx_xmit(sq, pool, budget);
+
+ rcu_read_unlock();
+
+ if (done < budget) {
+ /* if done < budget, the tx ring is no buffer */
+ napi_complete_done(napi, done);
+ }
+
+ return done;
+}
+
+static int veth_napi_add_tx(struct net_device *dev)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+ int i;
+
+ for (i = 0; i < dev->real_num_rx_queues; i++) {
+ struct veth_sq *sq = &priv->sq[i];
+
+ netif_napi_add(dev, &sq->xdp_napi, veth_poll_tx);
+ napi_enable(&sq->xdp_napi);
+ }
+
+ return 0;
+}
+
static int veth_create_page_pool(struct veth_rq *rq)
{
struct page_pool_params pp_params = {
@@ -1153,6 +1290,19 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
}
}

+static void veth_napi_del_tx(struct net_device *dev)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+ int i;
+
+ for (i = 0; i < dev->real_num_rx_queues; i++) {
+ struct veth_sq *sq = &priv->sq[i];
+
+ napi_disable(&sq->xdp_napi);
+ __netif_napi_del(&sq->xdp_napi);
+ }
+}
+
static void veth_napi_del(struct net_device *dev)
{
veth_napi_del_range(dev, 0, dev->real_num_rx_queues);
@@ -1360,7 +1510,7 @@ static void veth_set_xdp_features(struct net_device *dev)
struct veth_priv *priv_peer = netdev_priv(peer);
xdp_features_t val = NETDEV_XDP_ACT_BASIC |
NETDEV_XDP_ACT_REDIRECT |
- NETDEV_XDP_ACT_RX_SG;
+ NETDEV_XDP_ACT_RX_SG | NETDEV_XDP_ACT_XSK_ZEROCOPY;

if (priv_peer->_xdp_prog || veth_gro_requested(peer))
val |= NETDEV_XDP_ACT_NDO_XMIT |
@@ -1737,11 +1887,89 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
return err;
}

+static int veth_xsk_pool_enable(struct net_device *dev, struct xsk_buff_pool *pool, u16 qid)
+{
+ struct veth_priv *peer_priv;
+ struct veth_priv *priv = netdev_priv(dev);
+ struct net_device *peer_dev = priv->peer;
+ int err = 0;
+
+ if (qid >= dev->real_num_tx_queues)
+ return -EINVAL;
+
+ if (!peer_dev)
+ return -EINVAL;
+
+ /* no dma, so we just skip dma skip in xsk zero copy */
+ pool->dma_check_skip = true;
+
+ peer_priv = netdev_priv(peer_dev);
+
+ /* enable peer tx xdp here, this side
+ * xdp is enable by veth_xdp_set
+ * to do: we need to check whther this side is already enable xdp
+ * maybe it do not have xdp prog
+ */
+ if (!(peer_priv->_xdp_prog) && (!veth_gro_requested(peer_dev))) {
+ /* peer should enable napi*/
+ err = veth_napi_enable(peer_dev);
+ if (err)
+ return err;
+ }
+
+ /* Here is already protected by rtnl_lock, so rcu_assign_pointer
+ * is safe.
+ */
+ rcu_assign_pointer(priv->sq[qid].xsk.pool, pool);
+
+ veth_napi_add_tx(dev);
+
+ return err;
+}
+
+static int veth_xsk_pool_disable(struct net_device *dev, u16 qid)
+{
+ struct veth_priv *peer_priv;
+ struct veth_priv *priv = netdev_priv(dev);
+ struct net_device *peer_dev = priv->peer;
+ int err = 0;
+
+ if (qid >= dev->real_num_tx_queues)
+ return -EINVAL;
+
+ if (!peer_dev)
+ return -EINVAL;
+
+ peer_priv = netdev_priv(peer_dev);
+
+ /* to do: this may be failed */
+ if (!(peer_priv->_xdp_prog) && (!veth_gro_requested(peer_dev))) {
+ /* disable peer napi */
+ veth_napi_del(peer_dev);
+ }
+
+ veth_napi_del_tx(dev);
+
+ rcu_assign_pointer(priv->sq[qid].xsk.pool, NULL);
+ return err;
+}
+
+/* this is for setup xdp */
+static int veth_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp)
+{
+ if (xdp->xsk.pool)
+ return veth_xsk_pool_enable(dev, xdp->xsk.pool, xdp->xsk.queue_id);
+ else
+ return veth_xsk_pool_disable(dev, xdp->xsk.queue_id);
+}
+
static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
{
switch (xdp->command) {
case XDP_SETUP_PROG:
return veth_xdp_set(dev, xdp->prog, xdp->extack);
+ case XDP_SETUP_XSK_POOL:
+ return veth_xsk_pool_setup(dev, xdp);
default:
return -EINVAL;
}
--
2.20.1

2023-08-08 19:55:06

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: [RFC v3 Optimizing veth xsk performance 0/9]

Albert Huang <[email protected]> writes:

> AF_XDP is a kernel bypass technology that can greatly improve performance.
> However,for virtual devices like veth,even with the use of AF_XDP sockets,
> there are still many additional software paths that consume CPU resources.
> This patch series focuses on optimizing the performance of AF_XDP sockets
> for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
> Patch 5 introduces tx queue and tx napi for packet transmission, while
> patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9
> add support for AF_XDP tx need_wakup feature. These optimizations significantly
> reduce the software path and support checksum offload.
>
> I tested those feature with
> A typical topology is shown below:
> client(send): server:(recv)
> veth<-->veth-peer veth1-peer<--->veth1
> 1 | | 7
> |2 6|
> | |
> bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
> 3 4 5
> (machine1) (machine2)

I definitely applaud the effort to improve the performance of af_xdp
over veth, this is something we have flagged as in need of improvement
as well.

However, looking through your patch series, I am less sure that the
approach you're taking here is the right one.

AFAIU (speaking about the TX side here), the main difference between
AF_XDP ZC and the regular transmit mode is that in the regular TX mode
the stack will allocate an skb to hold the frame and push that down the
stack. Whereas in ZC mode, there's a driver NDO that gets called
directly, bypassing the skb allocation entirely.

In this series, you're implementing the ZC mode for veth, but the driver
code ends up allocating an skb anyway. Which seems to be a bit of a
weird midpoint between the two modes, and adds a lot of complexity to
the driver that (at least conceptually) is mostly just a
reimplementation of what the stack does in non-ZC mode (allocate an skb
and push it through the stack).

So my question is, why not optimise the non-zc path in the stack instead
of implementing the zc logic for veth? It seems to me that it would be
quite feasible to apply the same optimisations (bulking, and even GRO)
to that path and achieve the same benefits, without having to add all
this complexity to the veth driver?

-Toke

2023-08-09 08:32:49

by 黄杰

[permalink] [raw]

Subject: Re: Re: [RFC v3 Optimizing veth xsk performance 0/9]

Toke Høiland-Jørgensen <[email protected]> 于2023年8月8日周二 20:01写道：
>
> Albert Huang <[email protected]> writes:
>
> > AF_XDP is a kernel bypass technology that can greatly improve performance.
> > However,for virtual devices like veth,even with the use of AF_XDP sockets,
> > there are still many additional software paths that consume CPU resources.
> > This patch series focuses on optimizing the performance of AF_XDP sockets
> > for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
> > Patch 5 introduces tx queue and tx napi for packet transmission, while
> > patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9
> > add support for AF_XDP tx need_wakup feature. These optimizations significantly
> > reduce the software path and support checksum offload.
> >
> > I tested those feature with
> > A typical topology is shown below:
> > client(send): server:(recv)
> > veth<-->veth-peer veth1-peer<--->veth1
> > 1 | | 7
> > |2 6|
> > | |
> > bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
> > 3 4 5
> > (machine1) (machine2)
>
> I definitely applaud the effort to improve the performance of af_xdp
> over veth, this is something we have flagged as in need of improvement
> as well.
>
> However, looking through your patch series, I am less sure that the
> approach you're taking here is the right one.
>
> AFAIU (speaking about the TX side here), the main difference between
> AF_XDP ZC and the regular transmit mode is that in the regular TX mode
> the stack will allocate an skb to hold the frame and push that down the
> stack. Whereas in ZC mode, there's a driver NDO that gets called
> directly, bypassing the skb allocation entirely.
>
> In this series, you're implementing the ZC mode for veth, but the driver
> code ends up allocating an skb anyway. Which seems to be a bit of a
> weird midpoint between the two modes, and adds a lot of complexity to
> the driver that (at least conceptually) is mostly just a
> reimplementation of what the stack does in non-ZC mode (allocate an skb
> and push it through the stack).
>
> So my question is, why not optimise the non-zc path in the stack instead
> of implementing the zc logic for veth? It seems to me that it would be
> quite feasible to apply the same optimisations (bulking, and even GRO)
> to that path and achieve the same benefits, without having to add all
> this complexity to the veth driver?
>
> -Toke
>
thanks!
This idea is really good indeed. You've reminded me, and that's
something I overlooked. I will now consider implementing the solution
you've proposed and test the performance enhancement.

Albert.

2023-08-09 09:33:07

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: Re: [RFC v3 Optimizing veth xsk performance 0/9]

黄杰 <[email protected]> writes:

> Toke Høiland-Jørgensen <[email protected]> 于2023年8月8日周二 20:01写道：
>>
>> Albert Huang <[email protected]> writes:
>>
>> > AF_XDP is a kernel bypass technology that can greatly improve performance.
>> > However,for virtual devices like veth,even with the use of AF_XDP sockets,
>> > there are still many additional software paths that consume CPU resources.
>> > This patch series focuses on optimizing the performance of AF_XDP sockets
>> > for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
>> > Patch 5 introduces tx queue and tx napi for packet transmission, while
>> > patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9
>> > add support for AF_XDP tx need_wakup feature. These optimizations significantly
>> > reduce the software path and support checksum offload.
>> >
>> > I tested those feature with
>> > A typical topology is shown below:
>> > client(send): server:(recv)
>> > veth<-->veth-peer veth1-peer<--->veth1
>> > 1 | | 7
>> > |2 6|
>> > | |
>> > bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
>> > 3 4 5
>> > (machine1) (machine2)
>>
>> I definitely applaud the effort to improve the performance of af_xdp
>> over veth, this is something we have flagged as in need of improvement
>> as well.
>>
>> However, looking through your patch series, I am less sure that the
>> approach you're taking here is the right one.
>>
>> AFAIU (speaking about the TX side here), the main difference between
>> AF_XDP ZC and the regular transmit mode is that in the regular TX mode
>> the stack will allocate an skb to hold the frame and push that down the
>> stack. Whereas in ZC mode, there's a driver NDO that gets called
>> directly, bypassing the skb allocation entirely.
>>
>> In this series, you're implementing the ZC mode for veth, but the driver
>> code ends up allocating an skb anyway. Which seems to be a bit of a
>> weird midpoint between the two modes, and adds a lot of complexity to
>> the driver that (at least conceptually) is mostly just a
>> reimplementation of what the stack does in non-ZC mode (allocate an skb
>> and push it through the stack).
>>
>> So my question is, why not optimise the non-zc path in the stack instead
>> of implementing the zc logic for veth? It seems to me that it would be
>> quite feasible to apply the same optimisations (bulking, and even GRO)
>> to that path and achieve the same benefits, without having to add all
>> this complexity to the veth driver?
>>
>> -Toke
>>
> thanks!
> This idea is really good indeed. You've reminded me, and that's
> something I overlooked. I will now consider implementing the solution
> you've proposed and test the performance enhancement.

Sounds good, thanks! :)

-Toke

2023-08-09 11:59:52

by Jesper Dangaard Brouer

[permalink] [raw]

Subject: Re: [RFC v3 Optimizing veth xsk performance 0/9]

On 09/08/2023 11.06, Toke Høiland-Jørgensen wrote:
> 黄杰 <[email protected]> writes:
>
>> Toke Høiland-Jørgensen <[email protected]> 于2023年8月8日周二 20:01写道：
>>>
>>> Albert Huang <[email protected]> writes:
>>>
>>>> AF_XDP is a kernel bypass technology that can greatly improve performance.
>>>> However,for virtual devices like veth,even with the use of AF_XDP sockets,
>>>> there are still many additional software paths that consume CPU resources.
>>>> This patch series focuses on optimizing the performance of AF_XDP sockets
>>>> for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
>>>> Patch 5 introduces tx queue and tx napi for packet transmission, while
>>>> patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9
>>>> add support for AF_XDP tx need_wakup feature. These optimizations significantly
>>>> reduce the software path and support checksum offload.
>>>>
>>>> I tested those feature with
>>>> A typical topology is shown below:
>>>> client(send): server:(recv)
>>>> veth<-->veth-peer veth1-peer<--->veth1
>>>> 1 | | 7
>>>> |2 6|
>>>> | |
>>>> bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
>>>> 3 4 5
>>>> (machine1) (machine2)
>>>
>>> I definitely applaud the effort to improve the performance of af_xdp
>>> over veth, this is something we have flagged as in need of improvement
>>> as well.
>>>
>>> However, looking through your patch series, I am less sure that the
>>> approach you're taking here is the right one.
>>>
>>> AFAIU (speaking about the TX side here), the main difference between
>>> AF_XDP ZC and the regular transmit mode is that in the regular TX mode
>>> the stack will allocate an skb to hold the frame and push that down the
>>> stack. Whereas in ZC mode, there's a driver NDO that gets called
>>> directly, bypassing the skb allocation entirely.
>>>
>>> In this series, you're implementing the ZC mode for veth, but the driver
>>> code ends up allocating an skb anyway. Which seems to be a bit of a
>>> weird midpoint between the two modes, and adds a lot of complexity to
>>> the driver that (at least conceptually) is mostly just a
>>> reimplementation of what the stack does in non-ZC mode (allocate an skb
>>> and push it through the stack).
>>>
>>> So my question is, why not optimise the non-zc path in the stack instead
>>> of implementing the zc logic for veth? It seems to me that it would be
>>> quite feasible to apply the same optimisations (bulking, and even GRO)
>>> to that path and achieve the same benefits, without having to add all
>>> this complexity to the veth driver?
>>>
>>> -Toke
>>>
>> thanks!
>> This idea is really good indeed. You've reminded me, and that's
>> something I overlooked. I will now consider implementing the solution
>> you've proposed and test the performance enhancement.
>
> Sounds good, thanks! :)

Good to hear, that you want to optimize the non-zc TX path of AF_XDP, as
Toke suggests.

There is a number of performance issues for AF_XDP non-zc TX that I've
talked/complained to Magnus and Bjørn about over the years.
I've recently started to work on fixing these myself, in collaboration
with Maryam (cc).

The most obvious is that non-zc TX uses socket memory accounting for the
SKBs that gets allocated. (ZC TX obviously doesn't). IMHO this doesn't
make sense as AF_XDP concept is to pre-allocate memory, thus AF_XDP
memory limits are already bounded at setup time. Further more,
__xsk_generic_xmit() already have a backpressure mechanism based on
avail room in the CQ (Completion Queue) . Hint: the call
sock_alloc_send_skb() includes/does socket mem accounting.

When AF_XDP gets combined with veth (or other layered software devices),
the problem gets worse, because:

(1) the SKB that gets allocated by xsk_build_skb() doesn't have enough
headroom to satisfy XDP requirement XDP_PACKET_HEADROOM.

(2) the backing memory type from sock_alloc_send_skb() is not
compatible with generic/veth XDP.

Both these issues, result in that when peer veth device RX the (AF_XDP)
TX packet, then it have to reallocate memory+SKB and copy data *again*.

I'm currently[1] looking into how to fix this and have some PoC patches
to estimate the performance benefit from avoiding the realloc when
entering veth. With packet size 512, the numbers start at 828Kpps and
after increase to 1002Kpps (and increase of 20% or 208 nanosec).

[1]
https://github.com/xdp-project/xdp-project/blob/veth-benchmark01/areas/core/veth_benchmark03.org

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer