Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp1199013pxb; Sun, 17 Jan 2021 00:01:23 -0800 (PST) X-Google-Smtp-Source: ABdhPJyxH924zGt+sbOdckZx/v79eZIkRZ74T7XWoYOX2jR7SFWOqptdnd/Vr8NHqedWyPiV9Rgr X-Received: by 2002:a17:906:2898:: with SMTP id o24mr13757714ejd.215.1610870483505; Sun, 17 Jan 2021 00:01:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610870483; cv=none; d=google.com; s=arc-20160816; b=c55z+NRvt7XeDiHhQg+aipKPttDJxesvO6YSnLtrMExBiYTsbd9IVh0HDVbBNGYOVB LrwY/NxXmR5qBqNwCbfDHE/l1HSBVxB83CT+p0FxuSr9+jrOiN/tikudpofAOiSPj/jE dEmBrrw3zIIF0loXPwJQ7UNkPptRGXa3a6To5OAMOC0ttVyn+RnMuc1mH+q4xuKsnjbx 5EQ7sW3W7uUwapaNDzaSGd3JSCQ+/QynQvuS1YeHlzP8A6+o5xRtA6ueefeR5iUKfd5j oCAsSr+Jq525PMVEjFx09jQ/I0W+01NO45211Uw8AWjwQT4CpTuwrR5A3zSJbTSyXBof 1xhg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=VESx0q50uHif31i5KEl5hWIvCIRQQNFTAOMGoAM4sHk=; b=yhPZIrO4iV80G9KpOjUDm4XI9E/SjgZNU5U7uyP7jKvREr2YgJZIhxGqRG664QO2n0 DYYE+y65uNSWyUwHFgRzxaEbD3iepjAeBFNIvjcnnPqI6+ctn4vY0bCNFhQmd6A1ATUf 9YxHYE4+YJ51Txf0Jcm3lAhJjItwrKqYK4Gesrjuu7RkRsFzWrqtljIzYKEGOIameXIo d2Dki48jnjSpaPvjzM4Gdz9HoFXD80pnMsYndq3vDNL9jmF7ju3pcgYcBZYqhvWA7Gxl 2DuDQrYAVtROcIrxohOCxGFWl6HbqyVZK2f1KG1vOj9hF8S3rEseTSC18B2jQrLR54sc 9bHA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@daynix-com.20150623.gappssmtp.com header.s=20150623 header.b=dowfzTW2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id d20si6160506eds.14.2021.01.17.00.00.50; Sun, 17 Jan 2021 00:01:23 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@daynix-com.20150623.gappssmtp.com header.s=20150623 header.b=dowfzTW2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728042AbhAQH7A (ORCPT + 99 others); Sun, 17 Jan 2021 02:59:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58312 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727446AbhAQH6l (ORCPT ); Sun, 17 Jan 2021 02:58:41 -0500 Received: from mail-ot1-x333.google.com (mail-ot1-x333.google.com [IPv6:2607:f8b0:4864:20::333]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D1E24C061574 for ; Sat, 16 Jan 2021 23:57:59 -0800 (PST) Received: by mail-ot1-x333.google.com with SMTP id i20so5393434otl.7 for ; Sat, 16 Jan 2021 23:57:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=daynix-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=VESx0q50uHif31i5KEl5hWIvCIRQQNFTAOMGoAM4sHk=; b=dowfzTW23mQmJth/93y3cwCyGyvAl+fFqQcvXECxcGq8jo5zaQkT722+738BfC0A14 Bsqd/ozm5BiNtKsHXJ5UDIa7Wfm31hW52j0wVk/7WFwHHDzt1eytW0ilYs1zsM5BMX7s t1nLNTzQDYtxUd/eiOB+foBYE4wfdrR00TKQgH+31X2W7aLhZPfvJR4go5ov7vm4hlh+ 814jWwD7JM8JKTkGOMTEvwqXRqGSJC9Z16hI/+aYhFdXlqRtovk9JT9kkHGl5VyDVRY2 DewWaIsAU0QFt3CgW6rxqv0Qz3cazUlkFC55+YDczF0e0pq3KuAN2Y+g7PvU+9NnN9hd t2zQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=VESx0q50uHif31i5KEl5hWIvCIRQQNFTAOMGoAM4sHk=; b=D8NWVXAxOHDDgx+V9cmZksCPkACBm5TAn1KQ4LmokuTZjhjcinhXcyBKYw9J4tP32A QUoL2VDDPeAFRrnoQlTUiOOEtBy9KmsPBSZkJE6kAhSwXv32SxgnHIBZPUEox7789pk6 plqEaEAlf3MIR7GHjMHUkFqPkEam62WZbJvHh4VIxHMPkFJNTnIkUB7g1blWFRKTUPKU yUTB5KeKRohbLhaSFlKiwx7GmxcfO3Yk9ttuTNP3B4Fr0FSWDoa6hLqRKA5I2TvwTam9 i6BqoBGigAZ6CjcxI45NG4veEHjyq5YVf2Uit2yahtRA+MfwdrWQx3BOu6JDSYOW6lUA 0llQ== X-Gm-Message-State: AOAM531enqaXjtwGsBbkQj/nt0ldQw1yRdczforPWWn3fRuhe941EO8a i5tYGy1VUCXOxT3n4a27xhMGVcRf7kLvhj7Lg3Y8bQ== X-Received: by 2002:a9d:4715:: with SMTP id a21mr14710269otf.220.1610870278976; Sat, 16 Jan 2021 23:57:58 -0800 (PST) MIME-Version: 1.0 References: <20210112194143.1494-1-yuri.benditovich@daynix.com> <78bbc518-4b73-4629-68fb-2713250f8967@redhat.com> <8ea218a8-a068-1ed9-929d-67ad30111c3c@redhat.com> In-Reply-To: <8ea218a8-a068-1ed9-929d-67ad30111c3c@redhat.com> From: Yuri Benditovich Date: Sun, 17 Jan 2021 09:57:47 +0200 Message-ID: Subject: Re: [RFC PATCH 0/7] Support for virtio-net hash reporting To: Jason Wang Cc: Willem de Bruijn , "David S. Miller" , Jakub Kicinski , "Michael S . Tsirkin" , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , Randy Dunlap , "Gustavo A . R . Silva" , Herbert Xu , Steffen Klassert , Pablo Neira Ayuso , decui@microsoft.com, cai@lca.pw, Jakub Sitnicki , Marco Elver , Paolo Abeni , Network Development , linux-kernel , kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, bpf , Yan Vugenfirer Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 14, 2021 at 5:39 AM Jason Wang wrote: > > > On 2021/1/13 =E4=B8=8B=E5=8D=8810:33, Willem de Bruijn wrote: > > On Tue, Jan 12, 2021 at 11:11 PM Jason Wang wrote= : > >> > >> On 2021/1/13 =E4=B8=8A=E5=8D=887:47, Willem de Bruijn wrote: > >>> On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich > >>> wrote: > >>>> On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich > >>>> wrote: > >>>>> On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich > >>>>> wrote: > >>>>>> Existing TUN module is able to use provided "steering eBPF" to > >>>>>> calculate per-packet hash and derive the destination queue to > >>>>>> place the packet to. The eBPF uses mapped configuration data > >>>>>> containing a key for hash calculation and indirection table > >>>>>> with array of queues' indices. > >>>>>> > >>>>>> This series of patches adds support for virtio-net hash reporting > >>>>>> feature as defined in virtio specification. It extends the TUN mod= ule > >>>>>> and the "steering eBPF" as follows: > >>>>>> > >>>>>> Extended steering eBPF calculates the hash value and hash type, ke= eps > >>>>>> hash value in the skb->hash and returns index of destination virtq= ueue > >>>>>> and the type of the hash. TUN module keeps returned hash type in > >>>>>> (currently unused) field of the skb. > >>>>>> skb->__unused renamed to 'hash_report_type'. > >>>>>> > >>>>>> When TUN module is called later to allocate and fill the virtio-ne= t > >>>>>> header and push it to destination virtqueue it populates the hash > >>>>>> and the hash type into virtio-net header. > >>>>>> > >>>>>> VHOST driver is made aware of respective virtio-net feature that > >>>>>> extends the virtio-net header to report the hash value and hash re= port > >>>>>> type. > >>>>> Comment from Willem de Bruijn: > >>>>> > >>>>> Skbuff fields are in short supply. I don't think we need to add one > >>>>> just for this narrow path entirely internal to the tun device. > >>>>> > >>>> We understand that and try to minimize the impact by using an alread= y > >>>> existing unused field of skb. > >>> Not anymore. It was repurposed as a flags field very recently. > >>> > >>> This use case is also very narrow in scope. And a very short path fro= m > >>> data producer to consumer. So I don't think it needs to claim scarce > >>> bits in the skb. > >>> > >>> tun_ebpf_select_queue stores the field, tun_put_user reads it and > >>> converts it to the virtio_net_hdr in the descriptor. > >>> > >>> tun_ebpf_select_queue is called from .ndo_select_queue. Storing the > >>> field in skb->cb is fragile, as in theory some code could overwrite > >>> that between field between ndo_select_queue and > >>> ndo_start_xmit/tun_net_xmit, from which point it is fully under tun > >>> control again. But in practice, I don't believe anything does. > >>> > >>> Alternatively an existing skb field that is used only on disjoint > >>> datapaths, such as ingress-only, could be viable. > >> > >> A question here. We had metadata support in XDP for cooperation betwee= n > >> eBPF programs. Do we have something similar in the skb? > >> > >> E.g in the RSS, if we want to pass some metadata information between > >> eBPF program and the logic that generates the vnet header (either hard > >> logic in the kernel or another eBPF program). Is there any way that ca= n > >> avoid the possible conflicts of qdiscs? > > Not that I am aware of. The closest thing is cb[]. > > > > It'll have to aliase a field like that, that is known unused for the gi= ven path. > > > Right, we need to make sure cb is not used by other ones. I'm not sure > how hard to achieve that consider Qemu installs the eBPF program but it > doesn't deal with networking configurations. > > > > > > One other approach that has been used within linear call stacks is out > > of band. Like percpu variables softnet_data.xmit.more and > > mirred_rec_level. But that is perhaps a bit overwrought for this use > > case. > > > Yes, and if we go that way then eBPF turns out to be a burden since we > need to invent helpers to access those auxiliary data structure. It > would be better then to hard-coded the RSS in the kernel. > > > > > >>>>> Instead, you could just run the flow_dissector in tun_put_user if t= he > >>>>> feature is negotiated. Indeed, the flow dissector seems more apt to= me > >>>>> than BPF here. Note that the flow dissector internally can be > >>>>> overridden by a BPF program if the admin so chooses. > >>>>> > >>>> When this set of patches is related to hash delivery in the virtio-n= et > >>>> packet in general, > >>>> it was prepared in context of RSS feature implementation as defined = in > >>>> virtio spec [1] > >>>> In case of RSS it is not enough to run the flow_dissector in tun_put= _user: > >>>> in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash, > >>>> hash type and queue index > >>>> according to the (mapped) parameters (key, hash types, indirection > >>>> table) received from the guest. > >>> TUNSETSTEERINGEBPF was added to support more diverse queue selection > >>> than the default in case of multiqueue tun. Not sure what the exact > >>> use cases are. > >>> > >>> But RSS is exactly the purpose of the flow dissector. It is used for > >>> that purpose in the software variant RPS. The flow dissector > >>> implements a superset of the RSS spec, and certainly computes a > >>> four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC > >>> has already computed a 4-tuple hash. > >>> > >>> What it does not give is a type indication, such as > >>> VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used= . > >>> In datapaths where the NIC has already computed the four-tuple hash > >>> and stored it in skb->hash --the common case for servers--, That type > >>> field is the only reason to have to compute again. > >> > >> The problem is there's no guarantee that the packet comes from the NIC= , > >> it could be a simple VM2VM or host2VM packet. > >> > >> And even if the packet is coming from the NIC that calculates the hash > >> there's no guarantee that it's the has that guest want (guest may use > >> different RSS keys). > > Ah yes, of course. > > > > I would still revisit the need to store a detailed hash_type along with > > the hash, as as far I can tell that conveys no actionable information > > to the guest. > > > Yes, need to figure out its usage. According to [1], it only mention > that storing has type is a charge of driver. Maybe Yuri can answer this. > For the case of Windows VM we can't know how exactly the network stack uses provided hash data (including hash type). But: different releases of Windows enable different hash types (for example UDP hash is enabled only on Server 2016 and up). Indeed the Windows requires a little more from the network adapter/driver than Linux does. The addition of RSS support to virtio specification takes in account the widest set of requirements (i.e. Windows one), our initial impression is that this should be enough also for Linux. The NDIS specification in part of RSS is _mandatory_ and there are certification tests that check that the driver provides the hash data as expected. All the high-performance network adapters have such RSS functionality in the hardware. With pre-RSS QEMU (i.e. where the virtio-net device does not indicate the RSS support) the virtio-net driver for Windows does all the job related to RSS: - hash calculation - hash/hash_type delivery - reporting each packet on the correct CPU according to RSS settings With RSS support in QEMU all the packets always come on a proper CPU and the driver never needs to reschedule them. The driver still need to calculate the hash and report it to Windows. In this case we do the same job twice: the d= evice (QEMU or eBPF) does calculate the hash and get proper queue/CPU to deliver the packet. But the hash is not delivered by the device, so the driver need= s to recalculate it and report to the Windows. If we add HASH_REPORT support (current set of patches) and the device indicates this feature we can avoid hash recalculation in the driver assuming we receive the correct hash value and hash type. Otherwise the driver can't know which exactly hash the device has calculated. Please let me know if I did not answer the question. > Thanks > > [1] > https://docs.microsoft.com/en-us/windows-hardware/drivers/network/indicat= ing-rss-receive-data > > > > >