Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp2119842pxb; Mon, 18 Jan 2021 08:43:01 -0800 (PST) X-Google-Smtp-Source: ABdhPJw58KclIiF23Uu11mZpudO/WR2bTDTetHgN/sE+mJG4AxomwUkgIoNYPIIiD01RPU2aME78 X-Received: by 2002:aa7:d603:: with SMTP id c3mr228828edr.337.1610988180763; Mon, 18 Jan 2021 08:43:00 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610988180; cv=none; d=google.com; s=arc-20160816; b=jIhcSKotmUTFK65PCKpdJQqfXk1X7pqsXC8stSv20zmDYpOrq6F6RzpXlwWWtGAetT jcmr+DCaQbW3SdTECLd0nCfk8PJ0bPiKh1RI+5vVMg5cD4igFu4aFB7iqQ8ZyL7bj5xQ u913iNbaCW2aMmHzVKw3I2LDgBs0IJIXmDvQJoYqxX2IaNm8MyNXI8MBMFXfrdOcxzh/ sttSgLUlzTdD/gjCNoAHCPnwtmQ91R/vFXs7KR0XtoeINLXz7bcPe3g9SWClCEeXLNcl Zmd0m0113QRupLBWggUH0SdoFYTjoduO83sQr4D+4/4D/qXJDzW6ZsaT1cDoG44F0q5U 910g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:reply-to:cc:from:to :dkim-signature:date; bh=2MCt0pQ+rLxV5Y2LqOVu5B2auNoDbQUpH/AK3M3vQjI=; b=RaEc4LSUGTnWFNKcEUYeEl+uOHBuE+c8YxZcmYFb8JYZE5ZiJTtmR9Rq3dYyoQVX79 H1spjqz5v/seDnOpO4Qu8M5eQQQ/yJ6m1UnDYrLw8UT5Tk/Mb1sI+V1YBPhf4aSbEVMw d/0o7H2SuakhOR83ZSGLprvefdroLbOoW04gGbTZuPoheqt7Jc4cnBKcMRONAtiAAOk3 kaohpZvuvO+B1e/w3q2H+dkxqy9CYEz+wt4gOSEy2efJKqIpTrStcZnFpiRpif2x1kk4 Da6n5hAnheTFxAdU5whD/thyO+hf9StWDPvPmkChbI+AslPSPFDKTs6ioHrqW8KaKXpN UUFQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@pm.me header.s=protonmail header.b=HZI80ZpL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=pm.me Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q18si7218659ejt.469.2021.01.18.08.42.36; Mon, 18 Jan 2021 08:43:00 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@pm.me header.s=protonmail header.b=HZI80ZpL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=pm.me Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2406175AbhARQkD (ORCPT + 99 others); Mon, 18 Jan 2021 11:40:03 -0500 Received: from mail2.protonmail.ch ([185.70.40.22]:33400 "EHLO mail2.protonmail.ch" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2405917AbhARQjt (ORCPT ); Mon, 18 Jan 2021 11:39:49 -0500 Date: Mon, 18 Jan 2021 16:38:31 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pm.me; s=protonmail; t=1610987915; bh=2MCt0pQ+rLxV5Y2LqOVu5B2auNoDbQUpH/AK3M3vQjI=; h=Date:To:From:Cc:Reply-To:Subject:In-Reply-To:References:From; b=HZI80ZpLzCjI1Y9R+agx6ZXt9nNZyKpE9VCvk1GpR2tdtBrYMABhvdxPwRKCUvp60 +DWqyAnekSf7pNDXKBva2kPcqWLiko018vHPUck+y7kNtyRco2lWIs8zoe02R/YOd1 IAR6tcglSe+oxhAVq/TQmTW0whXYiUvzl0PKknL6odY+dByhyXS/NZnRdBFutUPWZV jUCZsDei0oucXaXGOGYvIxeC9OhuEgmSEDHE52HT4z2PPhJJ6b4CsgeHmEy0P8RmEW VtrmgSh6KCbTLhpiUaJEM49OOty80Ppcg86TJ8hUNkYgGmJlnoPKbfqWiapB3PVGf3 r2vWYyOmJjXpg== To: Magnus Karlsson From: Alexander Lobakin Cc: Alexander Lobakin , Yunsheng Lin , Xuan Zhuo , "Michael S. Tsirkin" , Jason Wang , "David S. Miller" , Jakub Kicinski , bjorn.topel@intel.com, Magnus Karlsson , Jonathan Lemon , Alexei Starovoitov , Daniel Borkmann , Jesper Dangaard Brouer , John Fastabend , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Willem de Bruijn , Steffen Klassert , Miaohe Lin , Mauro Carvalho Chehab , Antoine Tenart , Michal Kubecek , Andrew Lunn , Florian Fainelli , Meir Lichtinger , virtualization@lists.linux-foundation.org, bpf , Network Development , open list Reply-To: Alexander Lobakin Subject: Re: [PATCH bpf-next] xsk: build skb by page Message-ID: <20210118163742.10364-1-alobakin@pm.me> In-Reply-To: References: <579fa463bba42ac71591540a1811dca41d725350.1610764948.git.xuanzhuo@linux.alibaba.com> <4a4b475b-0e79-6cf6-44f5-44d45b5d85b5@huawei.com> <20210118125937.4088-1-alobakin@pm.me> <20210118143948.8706-1-alobakin@pm.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.2 required=10.0 tests=ALL_TRUSTED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF shortcircuit=no autolearn=disabled version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on mailout.protonmail.ch Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > From: Magnus Karlsson > Date: Mon, 18 Jan 2021 16:10:40 +0100 >=20 > On Mon, Jan 18, 2021 at 3:47 PM Alexander Lobakin wrote: > > > > From: Alexander Lobakin > > Date: Mon, 18 Jan 2021 13:00:17 +0000 > > > > > From: Yunsheng Lin > > > Date: Mon, 18 Jan 2021 20:40:52 +0800 > > > > > >> On 2021/1/16 10:44, Xuan Zhuo wrote: > > >>> This patch is used to construct skb based on page to save memory co= py > > >>> overhead. > > >>> > > >>> This has one problem: > > >>> > > >>> We construct the skb by fill the data page as a frag into the skb. = In > > >>> this way, the linear space is empty, and the header information is = also > > >>> in the frag, not in the linear space, which is not allowed for some > > >>> network cards. For example, Mellanox Technologies MT27710 Family > > >>> [ConnectX-4 Lx] will get the following error message: > > >>> > > >>> mlx5_core 0000:3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn= 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 > > >>> 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > >>> 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > >>> 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > >>> 00000030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 > > >>> WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 > > >>> 00000000: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 > > >>> 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > >>> 00000020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 > > >>> 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > >>> mlx5_core 0000:3b:00.1 eth1: ERR CQE on SQ: 0x1dbb > > >>> > > >>> I also tried to use build_skb to construct skb, but because of the > > >>> existence of skb_shinfo, it must be behind the linear space, so thi= s > > >>> method is not working. We can't put skb_shinfo on desc->addr, it wi= ll be > > >>> exposed to users, this is not safe. > > >>> > > >>> Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whethe= r the > > >> > > >> Does it make sense to use ETHTOOL_TX_COPYBREAK tunable in ethtool to > > >> configure if the data is copied or not? > > > > > > As far as I can grep, only mlx4 supports this, and it has a different > > > meaning in that driver. > > > So I guess a new netdev_feature would be a better solution. > > > > > >>> network card supports the header information of the packet in the f= rag > > >>> and not in the linear space. > > >>> > > >>> ---------------- Performance Testing ------------ > > >>> > > >>> The test environment is Aliyun ECS server. > > >>> Test cmd: > > >>> ``` > > >>> xdpsock -i eth0 -t -S -s > > >>> ``` > > >>> > > >>> Test result data: > > >>> > > >>> size 64 512 1024 1500 > > >>> copy 1916747 1775988 1600203 1440054 > > >>> page 1974058 1953655 1945463 1904478 > > >>> percent 3.0% 10.0% 21.58% 32.3% > > >>> > > >>> Signed-off-by: Xuan Zhuo > > >>> Reviewed-by: Dust Li > > >>> --- > > >>> drivers/net/virtio_net.c | 2 +- > > >>> include/linux/netdev_features.h | 5 +- > > >>> net/ethtool/common.c | 1 + > > >>> net/xdp/xsk.c | 108 ++++++++++++++++++++++++++++= +++++------- > > >>> 4 files changed, 97 insertions(+), 19 deletions(-) > > >>> > > >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > >>> index 4ecccb8..841a331 100644 > > >>> --- a/drivers/net/virtio_net.c > > >>> +++ b/drivers/net/virtio_net.c > > >>> @@ -2985,7 +2985,7 @@ static int virtnet_probe(struct virtio_device= *vdev) > > >>> /* Set up network device as normal. */ > > >>> dev->priv_flags |=3D IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE; > > >>> dev->netdev_ops =3D &virtnet_netdev; > > >>> - dev->features =3D NETIF_F_HIGHDMA; > > >>> + dev->features =3D NETIF_F_HIGHDMA | NETIF_F_SKB_NO_LINEAR; > > >>> > > >>> dev->ethtool_ops =3D &virtnet_ethtool_ops; > > >>> SET_NETDEV_DEV(dev, &vdev->dev); > > >>> diff --git a/include/linux/netdev_features.h b/include/linux/netdev= _features.h > > >>> index 934de56..8dd28e2 100644 > > >>> --- a/include/linux/netdev_features.h > > >>> +++ b/include/linux/netdev_features.h > > >>> @@ -85,9 +85,11 @@ enum { > > >>> > > >>> NETIF_F_HW_MACSEC_BIT, /* Offload MACsec operations */ > > >>> > > >>> + NETIF_F_SKB_NO_LINEAR_BIT, /* Allow skb linear is empty */ > > >>> + > > >>> /* > > >>> * Add your fresh new feature above and remember to update > > >>> - * netdev_features_strings[] in net/core/ethtool.c and maybe > > >>> + * netdev_features_strings[] in net/ethtool/common.c and maybe > > >>> * some feature mask #defines below. Please also describe it > > >>> * in Documentation/networking/netdev-features.rst. > > >>> */ > > >>> @@ -157,6 +159,7 @@ enum { > > >>> #define NETIF_F_GRO_FRAGLIST __NETIF_F(GRO_FRAGLIST) > > >>> #define NETIF_F_GSO_FRAGLIST __NETIF_F(GSO_FRAGLIST) > > >>> #define NETIF_F_HW_MACSEC __NETIF_F(HW_MACSEC) > > >>> +#define NETIF_F_SKB_NO_LINEAR __NETIF_F(SKB_NO_LINEAR) > > >>> > > >>> /* Finds the next feature with the highest number of the range of = start till 0. > > >>> */ > > >>> diff --git a/net/ethtool/common.c b/net/ethtool/common.c > > >>> index 24036e3..2f3d309 100644 > > >>> --- a/net/ethtool/common.c > > >>> +++ b/net/ethtool/common.c > > >>> @@ -68,6 +68,7 @@ > > >>> [NETIF_F_HW_TLS_RX_BIT] =3D "tls-hw-rx-offload", > > >>> [NETIF_F_GRO_FRAGLIST_BIT] =3D "rx-gro-list", > > >>> [NETIF_F_HW_MACSEC_BIT] =3D "macsec-hw-offload", > > >>> + [NETIF_F_SKB_NO_LINEAR_BIT] =3D "skb-no-linear", > > > > > > I completely forgot to add that you'd better to mention in both > > > enumeration/feature and its Ethtool string that the feature applies > > > to Tx path. > > > Smth like: > > > > > > NETIF_F_SKB_TX_NO_LINEAR{,_BIT}, "skb-tx-no-linear" > > > or > > > NETIF_F_TX_SKB_NO_LINEAR{,_BIT}, "tx-skb-no-linear" > > > > > > Otherwise, it may be confusing for users and developers. >=20 > I prefer one of these names for the property as they clearly describe > a feature that the driver supports. >=20 > > OR, I think we may tight the feature with the new approach to build > > skbs by page as it makes no sense for anything else. > > So, if we define something like: > > > > NETIF_F_XSK_TX_GENERIC_ZC{,_BIT}, "xsk-tx-generic-zerocopy", >=20 > This one I misunderstood first. I thought: "this is not zerocopy", but > you are right it is. It is zero-copy implemented with skb:s. But in my > mind, the NO_LINEAR version that you suggested are clearer. >=20 > > then user can toggle your new XSK Tx path on/off via Ethtool for > > drivers that will support it (don't forget to add it to hw_features > > for virtio_net then). User don't need to enable manually this, drivers usually enable most of their features on netdevice creation. This way we just could have an option to turn it off. If the feature is not about to be exposed to user at all, only to indicate if a particular driver supports skbs with skb_headlen =3D=3D 0 on its .ndo_start_xmit() path, then it might be better to introduce a private flag (netdev_priv_flags) instead of netdev_feature. Private flags are kernel-only and can't be toggled on/off after netdev is registered. E.g. IFF_TX_SKB_NO_LINEAR and test it like if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) { =09/* new generic zerocopy path */ } else { =09/* current code */ } > > >>> }; > > >>> > > >>> const char > > >>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > >>> index 8037b04..94d17dc 100644 > > >>> --- a/net/xdp/xsk.c > > >>> +++ b/net/xdp/xsk.c > > >>> @@ -430,6 +430,95 @@ static void xsk_destruct_skb(struct sk_buff *s= kb) > > >>> sock_wfree(skb); > > >>> } > > >>> > > >>> +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, > > >>> + struct xdp_desc *desc) > > >>> +{ > > >>> + u32 len, offset, copy, copied; > > >>> + struct sk_buff *skb; > > >>> + struct page *page; > > >>> + char *buffer; > > >>> + int err, i; > > >>> + u64 addr; > > >>> + > > >>> + skb =3D sock_alloc_send_skb(&xs->sk, 0, 1, &err); > > >>> + if (unlikely(!skb)) > > >>> + return NULL; > > >>> + > > >>> + addr =3D desc->addr; > > >>> + len =3D desc->len; > > >>> + > > >>> + buffer =3D xsk_buff_raw_get_data(xs->pool, addr); > > >>> + offset =3D offset_in_page(buffer); > > >>> + addr =3D buffer - (char *)xs->pool->addrs; > > >>> + > > >>> + for (copied =3D 0, i =3D 0; copied < len; ++i) { > > >>> + page =3D xs->pool->umem->pgs[addr >> PAGE_SHIFT]; > > >>> + > > >>> + get_page(page); > > >>> + > > >>> + copy =3D min((u32)(PAGE_SIZE - offset), len - copied); > > >>> + > > >>> + skb_fill_page_desc(skb, i, page, offset, copy); > > >>> + > > >>> + copied +=3D copy; > > >>> + addr +=3D copy; > > >>> + offset =3D 0; > > >>> + } > > >>> + > > >>> + skb->len +=3D len; > > >>> + skb->data_len +=3D len; > > >>> + skb->truesize +=3D len; > > >>> + > > >>> + refcount_add(len, &xs->sk.sk_wmem_alloc); > > >>> + > > >>> + return skb; > > >>> +} > > >>> + > > >>> +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > > >>> + struct xdp_desc *desc, int *err) > > >>> +{ > > >>> + struct sk_buff *skb; > > >>> + > > >>> + if (xs->dev->features & NETIF_F_SKB_NO_LINEAR) { > > >>> + skb =3D xsk_build_skb_zerocopy(xs, desc); > > >>> + if (unlikely(!skb)) { > > >>> + *err =3D -ENOMEM; > > >>> + return NULL; > > >>> + } > > >>> + } else { > > >>> + char *buffer; > > >>> + u64 addr; > > >>> + u32 len; > > >>> + int err; > > >>> + > > >>> + len =3D desc->len; > > >>> + skb =3D sock_alloc_send_skb(&xs->sk, len, 1, &err); > > >>> + if (unlikely(!skb)) { > > >>> + *err =3D -ENOMEM; > > >>> + return NULL; > > >>> + } > > >>> + > > >>> + skb_put(skb, len); > > >>> + addr =3D desc->addr; > > >>> + buffer =3D xsk_buff_raw_get_data(xs->pool, desc->addr); > > >>> + err =3D skb_store_bits(skb, 0, buffer, len); > > >>> + > > >>> + if (unlikely(err)) { > > >>> + kfree_skb(skb); > > >>> + *err =3D -EINVAL; > > >>> + return NULL; > > >>> + } > > >>> + } > > >>> + > > >>> + skb->dev =3D xs->dev; > > >>> + skb->priority =3D xs->sk.sk_priority; > > >>> + skb->mark =3D xs->sk.sk_mark; > > >>> + skb_shinfo(skb)->destructor_arg =3D (void *)(long)desc->addr; > > >>> + skb->destructor =3D xsk_destruct_skb; > > >>> + > > >>> + return skb; > > >>> +} > > >>> + > > >>> static int xsk_generic_xmit(struct sock *sk) > > >>> { > > >>> struct xdp_sock *xs =3D xdp_sk(sk); > > >>> @@ -446,43 +535,28 @@ static int xsk_generic_xmit(struct sock *sk) > > >>> goto out; > > >>> > > >>> while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { > > >>> - char *buffer; > > >>> - u64 addr; > > >>> - u32 len; > > >>> - > > >>> if (max_batch-- =3D=3D 0) { > > >>> err =3D -EAGAIN; > > >>> goto out; > > >>> } > > >>> > > >>> - len =3D desc.len; > > >>> - skb =3D sock_alloc_send_skb(sk, len, 1, &err); > > >>> + skb =3D xsk_build_skb(xs, &desc, &err); > > >>> if (unlikely(!skb)) > > >>> goto out; > > >>> > > >>> - skb_put(skb, len); > > >>> - addr =3D desc.addr; > > >>> - buffer =3D xsk_buff_raw_get_data(xs->pool, addr); > > >>> - err =3D skb_store_bits(skb, 0, buffer, len); > > >>> /* This is the backpressure mechanism for the Tx path. > > >>> * Reserve space in the completion queue and only proce= ed > > >>> * if there is space in it. This avoids having to imple= ment > > >>> * any buffering in the Tx path. > > >>> */ > > >>> spin_lock_irqsave(&xs->pool->cq_lock, flags); > > >>> - if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) { > > >>> + if (xskq_prod_reserve(xs->pool->cq)) { > > >>> spin_unlock_irqrestore(&xs->pool->cq_lock, flag= s); > > >>> kfree_skb(skb); > > >>> goto out; > > >>> } > > >>> spin_unlock_irqrestore(&xs->pool->cq_lock, flags); > > >>> > > >>> - skb->dev =3D xs->dev; > > >>> - skb->priority =3D sk->sk_priority; > > >>> - skb->mark =3D sk->sk_mark; > > >>> - skb_shinfo(skb)->destructor_arg =3D (void *)(long)desc.= addr; > > >>> - skb->destructor =3D xsk_destruct_skb; > > >>> - > > >>> err =3D __dev_direct_xmit(skb, xs->queue_id); > > >>> if (err =3D=3D NETDEV_TX_BUSY) { > > >>> /* Tell user-space to retry the send */ > > >>> > > > > > > Al > > > > Al > Al