Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp91758rwd; Mon, 12 Jun 2023 10:29:18 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4GThILsKRiGMF8uN45wp630yr1mcvjeMWzXgQEKylDZmG7uxM6j52V2Aq6IunSW5j31H9B X-Received: by 2002:a05:6a20:2d09:b0:10b:56bd:b00e with SMTP id g9-20020a056a202d0900b0010b56bdb00emr9086483pzl.40.1686590958354; Mon, 12 Jun 2023 10:29:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686590958; cv=none; d=google.com; s=arc-20160816; b=YFuZ3STzcKndrcI7qxvb+8Eo79m0UO1J3qesdoIG5EIa1fq2mbusgeYt017i0R86qo DUnFmz7hLOJgNGKGcvpdbi8Cw7wkAzlocf0HLg+6XWVC226kd73HzCRN6jIg9Z2GEqr+ oIHRXdzZ64+KfjulGcKmSWTQ9E7luYjsUO5j/CxCrPNUIlxG14cfixlTJG4KJXiShlQb wQC2R7VSELBdBP9kRYqfka2PUb5xOiDbJduPbxSS8LMRtAlWg9LhBF75g1oDa0SCIZSH ILwacIi4XakMTUI9q63gIct/XvgfU3lhQkCpSQNSIGtUNn2m3NSMD48pZ7rF2l4HzNGx JWnQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=c935zS2u11RmvZtrD599JYGUH77y0v+EuJrHvFVjSP0=; b=KrAhGGsqLjeymUQ6Clsa+Pd151jh+tkzlVOKgF0PZpV2kXrDooE0D/Wdxt97w0Qqlq ZMknzTquUDqKgq+x2jR7sAEVjD/cdY0+fIF9aj81lmoQZS1Ta6RBU3JwRgj+y7joNBT6 +O74ibWLhUF5uOxKWcVMQ9JlqhJHAQZMKMh7CynM8b3e59hnzby/Kln/dH/U20ciJwwQ TykLfrBp9XfwyKZGS4mwWQiU4dbOQzntepG/VnonDHMTE1WyosS3vrgesspxw0ys4z6I FWZmGuqEG5M1YdRUwX3iQTfhDYsvK9JdXY/A1Y1pQDmFkin1wjf17xvTK17IFTly/rP8 LAuQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=D2voveb7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a27-20020aa794bb000000b0062565210347si4635228pfl.275.2023.06.12.10.29.03; Mon, 12 Jun 2023 10:29:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=D2voveb7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231726AbjFLRU5 (ORCPT + 99 others); Mon, 12 Jun 2023 13:20:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33040 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231318AbjFLRU4 (ORCPT ); Mon, 12 Jun 2023 13:20:56 -0400 Received: from mail-pf1-x436.google.com (mail-pf1-x436.google.com [IPv6:2607:f8b0:4864:20::436]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E441FB2; Mon, 12 Jun 2023 10:20:54 -0700 (PDT) Received: by mail-pf1-x436.google.com with SMTP id d2e1a72fcca58-650bacd6250so3610641b3a.2; Mon, 12 Jun 2023 10:20:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1686590454; x=1689182454; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=c935zS2u11RmvZtrD599JYGUH77y0v+EuJrHvFVjSP0=; b=D2voveb7+DkukujDqdVKp1BMV/vc2x1i+0b/CV1IumlkRxULqM84XqZBgrzeCCRks3 2OfcbVGlmw/XPVq99OID0D6rA6fOOfByrenMOYttKXRiWO/tzed5tk5URqCWJwsFyusz 5gLzhbr7PGpmSqGEG3E0ZMkeVRuFn8vioBMfNSQcPYYiwyAAWo6sLi3zYoqghzlgRRuT 6JcVnjYtBO/JAKRdscMLO0Bl6T9+1LQ0MKkgsCQk7Nv3DfgWOyiz4pSUWcm/IXNPFyMo zlNafLQrdrHjwlVj5U/lyeiR5f364SiBZyBDQf4exbJu6dYIquZQvHLIjkr+EH5o+da6 aWgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686590454; x=1689182454; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=c935zS2u11RmvZtrD599JYGUH77y0v+EuJrHvFVjSP0=; b=dhx02H3tFFcfhNLxAriCqioLOk3GhSq5yj7OsDVTwxsN/SYFp+0XfNcDMnCCWLh9vc Cn7+4jc1eVBhEM9Pz4PmNV24NzMSJyfNLuyIP4KlE4Wc6mV/mA2tGBpxpzl/LunbBR5t iHclq8mWd4KHMkiOiNmLlmQqvNTg5oukiZ3Fd+AzUEch05nNp09bp4FGmbBDF97w8CJb qsu5F5U473O+55mGw07m4s0aU64mSJsJ3pQIbOfFf1oNrWTmRRzyQHE/jANJ21ySp6/N hvCCrL/TXzFyfH7wKscqKaTMZSNIOWBUd/nk8dS5/2/HqrkZE51RQeNNJlqVIk+C9oTJ REqw== X-Gm-Message-State: AC+VfDxCMUMd6CN4W+Y5PkYyJ6TARuUpKL9E6XIOhxfKad44f51szbLP HUse5UGWZ6aRi4yMOyskR40= X-Received: by 2002:a05:6a00:ad6:b0:64c:c453:244f with SMTP id c22-20020a056a000ad600b0064cc453244fmr12917404pfl.15.1686590454107; Mon, 12 Jun 2023 10:20:54 -0700 (PDT) Received: from localhost (c-67-166-91-86.hsd1.wa.comcast.net. [67.166.91.86]) by smtp.gmail.com with ESMTPSA id z24-20020a63c058000000b0051ba9d772f9sm7824190pgi.59.2023.06.12.10.20.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 12 Jun 2023 10:20:53 -0700 (PDT) Date: Mon, 12 Jun 2023 17:20:52 +0000 From: Bobby Eshleman To: Arseniy Krasnov Cc: Stefan Hajnoczi , Stefano Garzarella , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , "Michael S. Tsirkin" , Jason Wang , Bobby Eshleman , kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, kernel@sberdevices.ru, oxffffaa@gmail.com Subject: Re: [RFC PATCH v4 00/17] vsock: MSG_ZEROCOPY flag support Message-ID: References: <20230603204939.1598818-1-AVKrasnov@sberdevices.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230603204939.1598818-1-AVKrasnov@sberdevices.ru> X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hey Arseniy, Thanks for this series, very good stuff! On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote: > Hello, > > DESCRIPTION > > this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow > current implementation for TCP as much as possible: > > 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this > flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY > flag will be ignored (e.g. without completion). > > 2) Kernel uses completions from socket's error queue. Single completion > for single tx syscall (or it can merge several completions to single > one). I used already implemented logic for MSG_ZEROCOPY support: > 'msg_zerocopy_realloc()' etc. > > Difference with copy way is not significant. During packet allocation, > non-linear skb is created and filled with pinned user pages. > There are also some updates for vhost and guest parts of transport - in > both cases i've added handling of non-linear skb for virtio part. vhost > copies data from such skb to the guest's rx virtio buffers. In the guest, > virtio transport fills tx virtio queue with pages from skb. > > Head of this patchset is: > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b > > > This version has several limits/problems: > > 1) As this feature totally depends on transport, there is no way (or it > is difficult) to check whether transport is able to handle it or not > during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific > setsockopt callback from setsockopt callback for SOL_SOCKET, but this > leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback > are not considered to be called from each other. So in current version > SO_ZEROCOPY is set successfully to any type (e.g. transport) of > AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, > tx routine will fail with EOPNOTSUPP. > > ^^^ > This is still no resolved :( > I think to get around this you could use set SOCK_CUSTOM_SOCKOPT in the vsock create function, handle SO_ZEROCOPY in the vsock handler, but pass the rest of the common options to sock_setsockopt(). I think the next issue you would run into though is that users may call setsockopt() before connect(), and so the transport will still be unknown (except for dgrams, which are weird for reasons). What do you think about EOPNOTSUPP being returned when the user selects an incompatible transport with connect() instead of returning it later in the tx path? > 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue > one completion. In each completion there is flag which shows how tx > was performed: zerocopy or copy. This leads that whole message must > be send in zerocopy or copy way - we can't send part of message with > copying and rest of message with zerocopy mode (or vice versa). Now, > we need to account vsock credit logic, e.g. we can't send whole data > once - only allowed number of bytes could sent at any moment. In case > of copying way there is no problem as in worst case we can send single > bytes, but zerocopy is more complex because smallest transmission > unit is single page. So if there is not enough space at peer's side > to send integer number of pages (at least one) - we will wait, thus > stalling tx side. To overcome this problem i've added simple rule - > zerocopy is possible only when there is enough space at another side > for whole message (to check, that current 'msghdr' was already used > in previous tx iterations i use 'iov_offset' field of it's iov iter). > > ^^^ > Discussed as ok during v2. Link: > https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/ > > 3) loopback transport is not supported, because it requires to implement > non-linear skb handling in dequeue logic (as we "send" fragged skb > and "receive" it from the same queue). I'm going to implement it in > next versions. > > ^^^ fixed in v2 > > 4) Current implementation sets max length of packet to 64KB. IIUC this > is due to 'kmalloc()' allocated data buffers. I think, in case of > MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is > not touched for data - user space pages are used as buffers. Also > this limit trims every message which is > 64KB, thus such messages > will be send in copy mode due to 'iov_offset' check in 2). > > ^^^ fixed in v2 > > PATCHSET STRUCTURE > > Patchset has the following structure: > 1) Handle non-linear skbuff on receive in virtio/vhost. > 2) Handle non-linear skbuff on send in virtio/vhost. > 3) Updates for AF_VSOCK. > 4) Enable MSG_ZEROCOPY support on transports. > 5) Tests/tools/docs updates. > > PERFORMANCE > > Performance: it is a little bit tricky to compare performance between > copy and zerocopy transmissions. In zerocopy way we need to wait when > user buffers will be released by kernel, so it is like synchronous > path (wait until device driver will process it), while in copy way we > can feed data to kernel as many as we want, don't care about device > driver. So I compared only time which we spend in the 'send()' syscall. > Then if this value will be combined with total number of transmitted > bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not > enough credit, receiver allocates same amount of space as sender needs. > > Sender: > ./vsock_perf --sender --buf-size --bytes 256M [--zc] > > Receiver: > ./vsock_perf --vsk-size 256M > > I run tests on two setups: desktop with Core i7 - I use this PC for > development and in this case guest is nested guest, and host is normal > guest. Another hardware is some embedded board with Atom - here I don't > have nested virtualization - host runs on hw, and guest is normal guest. > > G2H transmission (values are Gbit/s): > > Core i7 with nested guest. Atom with normal guest. > > *-------------------------------* *-------------------------------* > | | | | | | | | > | buf size | copy | zerocopy | | buf size | copy | zerocopy | > | | | | | | | | > *-------------------------------* *-------------------------------* > | 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 | > *-------------------------------* *-------------------------------* > | 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 | > *-------------------------------* *-------------------------------* > | 256KB | 33 | 244 | | 256KB | 7.8 | 55 | > *-------------------------------* *-------------------------------* > | 1M | 30 | 373 | | 1M | 7 | 95 | > *-------------------------------* *-------------------------------* > | 8M | 22 | 475 | | 8M | 7 | 114 | > *-------------------------------* *-------------------------------* > > H2G: > > Core i7 with nested guest. Atom with normal guest. > > *-------------------------------* *-------------------------------* > | | | | | | | | > | buf size | copy | zerocopy | | buf size | copy | zerocopy | > | | | | | | | | > *-------------------------------* *-------------------------------* > | 4KB | 20 | 10 | | 4KB | 4.37 | 3 | > *-------------------------------* *-------------------------------* > | 32KB | 37 | 75 | | 32KB | 11 | 18 | > *-------------------------------* *-------------------------------* > | 256KB | 44 | 299 | | 256KB | 11 | 62 | > *-------------------------------* *-------------------------------* > | 1M | 28 | 335 | | 1M | 9 | 77 | > *-------------------------------* *-------------------------------* > | 8M | 27 | 417 | | 8M | 9.35 | 115 | > *-------------------------------* *-------------------------------* > Nice! [...] Thanks, Bobby