Received: by 2002:a05:6358:701b:b0:131:369:b2a3 with SMTP id 27csp4482451rwo; Tue, 25 Jul 2023 06:50:08 -0700 (PDT) X-Google-Smtp-Source: APBJJlGxxZRUGjgSjgsuV4CNjbFZd0vachGxtsGOpQdIzzOTQc/yo6N1l80xWtLk0zD0FN/3wW8M X-Received: by 2002:a17:906:2096:b0:992:42d4:a7dc with SMTP id 22-20020a170906209600b0099242d4a7dcmr12170917ejq.21.1690293008050; Tue, 25 Jul 2023 06:50:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690293008; cv=none; d=google.com; s=arc-20160816; b=09jbLqjyRCXBgrboJwUYWJ2is/rh0hKyzgGv5to5wdPjoQO1B2AS3Ft73KhlI+75+S Q99uUxfsYLYONw6duTH3zP/PRDxHpdbMmxoBaM7p5MhT81TBIelaC2H4OdVJDfTQBaqI LhvptuoL9O6Z/O0q5aALSnUvntNU+S24TPu8wEI7DgGZctJ9OOkgRQcqGW8oDVrsuqrm SFgMwcJb0rZZqXaA+pWkYhvGX3baY+PWPelSHzB6+DcIdwxtmJVYOhCPV1I3ra50s7Dh VcCUttnfNwL5rCpBo+7aTKVXpTrjuP2fXLd+gkPrSxYJeRYSlgd5zmmpzYvHoZPwDtPW akaQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=Z12R1uTDDrV/Hu5AGi/EpYkmK+9JgUWLpfxhqmRlK/Y=; fh=p8Cgs2ymfKt84CMTknS/7hK3+aUjkgTBCSKVsrpCjaY=; b=q4cepE40FtVuII8p+IUVJLQ42TfdF3C6CrcVh4hlkMxd1bWsaMOUaw0Y+pccoEJ7oY sL5ymq1YvLb5RHghjdCBXlyFvfVMSe4Fs38mDrcUl5F9zxnJwpxifN+0wbhJDte0JPea uuWei4z/HLq3pqx49ax7M+EcbyFFHqV54IBU8k+hbWJEXb1wIhbdtJkUDeJcviq6L00q vo1UENz6L1k7CrCtUYvaf5f5POIovt65WmszH9Xd3LWEQH26sIo4b8m0//aSwcVMq8MM 26P7czlsPZik1uxuUHoS/7Fv4SUBN1J0dxBkgvRLsPL7Kno2g9X/8qd+4D4PZybxmnLD WAhQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HdjaV12I; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id p26-20020a170906141a00b00993664a9970si8518283ejc.873.2023.07.25.06.49.43; Tue, 25 Jul 2023 06:50:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HdjaV12I; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230144AbjGYNWk (ORCPT + 99 others); Tue, 25 Jul 2023 09:22:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52796 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230060AbjGYNWi (ORCPT ); Tue, 25 Jul 2023 09:22:38 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 641CA10F8 for ; Tue, 25 Jul 2023 06:21:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1690291309; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Z12R1uTDDrV/Hu5AGi/EpYkmK+9JgUWLpfxhqmRlK/Y=; b=HdjaV12IVy87MebqsSlt7kpXQJV7RuGt2GA5RzxzyLc803OZ5iQyUemaDGjMUSVTLZAdP7 /yPxDAIuncSqEpnIKLqzngA34csHiQ/u+LyOQ1rBODnmqLWQuNXuBn2sHPot9HOrBQIloP Fofu6dP0sYQRcv2uyYFhUTFM8tJgooI= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-638-e3Asz0wcP-O4I7sq89TEig-1; Tue, 25 Jul 2023 09:21:48 -0400 X-MC-Unique: e3Asz0wcP-O4I7sq89TEig-1 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-993d631393fso457140466b.0 for ; Tue, 25 Jul 2023 06:21:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690291307; x=1690896107; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Z12R1uTDDrV/Hu5AGi/EpYkmK+9JgUWLpfxhqmRlK/Y=; b=XXhdI6vfFUiOHpvG78lRJy4FOimLuzOx121ZHs/ilJT8n+EEWsKg1M05L6TLudwsI0 jVTOcIcUpeU3o7G9XwJLllZOsDhMdgDh/zSBNmZh9vI1m2VOBnpjBA0nYoEkcdOT7+f4 vswHti1qSbPRvDSVO2cAIl5aXMhThM0dWKebTDnoK8YJvkA1rl8ufnaUDgn+kzIeVae/ DJJ8bf9ChdaTAuiDVPuMQfpZTuGz9aOK28o5xd7Li4s3WqjxadoVqQX24m3RhkErFdBI po5ubxHOfFdB/HJvcq33XBru3R3tHhcMrl76GRqAeLnj6QGySxu5XgnAf0fFVz3I0nSs tHPQ== X-Gm-Message-State: ABy/qLaI0jZJXxrrOOFrUwj8WEVyCCTXSTb1J2pEzARyLaYjLHfB1PyR PobO/yzbj6H9pMTV3XB4xkek8tl4N68VvYuDt/VNedavGRuK/9ZCmbOXL4MZEyTd8gQVHxli7JX OoUmbSco2KBH41qeQ/c31vtQm X-Received: by 2002:a17:906:20d5:b0:982:b920:daad with SMTP id c21-20020a17090620d500b00982b920daadmr12140548ejc.71.1690291307228; Tue, 25 Jul 2023 06:21:47 -0700 (PDT) X-Received: by 2002:a17:906:20d5:b0:982:b920:daad with SMTP id c21-20020a17090620d500b00982b920daadmr12140536ejc.71.1690291306921; Tue, 25 Jul 2023 06:21:46 -0700 (PDT) Received: from sgarzare-redhat ([193.207.178.30]) by smtp.gmail.com with ESMTPSA id b27-20020a1709062b5b00b00992b71d8f19sm8121153ejg.133.2023.07.25.06.21.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Jul 2023 06:21:45 -0700 (PDT) Date: Tue, 25 Jul 2023 15:21:33 +0200 From: Stefano Garzarella To: "Michael S. Tsirkin" Cc: Arseniy Krasnov , Stefan Hajnoczi , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jason Wang , Bobby Eshleman , kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, kernel@sberdevices.ru, oxffffaa@gmail.com Subject: Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support Message-ID: References: <20230720214245.457298-1-AVKrasnov@sberdevices.ru> <20230720214245.457298-5-AVKrasnov@sberdevices.ru> <091c067b-43a0-da7f-265f-30c8c7e62977@sberdevices.ru> <20230725042544-mutt-send-email-mst@kernel.org> <20230725090514-mutt-send-email-mst@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: <20230725090514-mutt-send-email-mst@kernel.org> X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 25, 2023 at 09:06:02AM -0400, Michael S. Tsirkin wrote: >On Tue, Jul 25, 2023 at 02:53:39PM +0200, Stefano Garzarella wrote: >> On Tue, Jul 25, 2023 at 07:50:53AM -0400, Michael S. Tsirkin wrote: >> > On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote: >> > > >> > > >> > > On 21.07.2023 00:42, Arseniy Krasnov wrote: >> > > > This adds handling of MSG_ZEROCOPY flag on transmission path: if this >> > > > flag is set and zerocopy transmission is possible (enabled in socket >> > > > options and transport allows zerocopy), then non-linear skb will be >> > > > created and filled with the pages of user's buffer. Pages of user's >> > > > buffer are locked in memory by 'get_user_pages()'. Second thing that >> > > > this patch does is replace type of skb owning: instead of calling >> > > > 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this >> > > > change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc' >> > > > of socket, so to decrease this field correctly proper skb destructor is >> > > > needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'. >> > > > >> > > > Signed-off-by: Arseniy Krasnov >> > > > --- >> > > > Changelog: >> > > > v5(big patchset) -> v1: >> > > > * Refactorings of 'if' conditions. >> > > > * Remove extra blank line. >> > > > * Remove 'frag_off' field unneeded init. >> > > > * Add function 'virtio_transport_fill_skb()' which fills both linear >> > > > and non-linear skb with provided data. >> > > > v1 -> v2: >> > > > * Use original order of last four arguments in 'virtio_transport_alloc_skb()'. >> > > > v2 -> v3: >> > > > * Add new transport callback: 'msgzerocopy_check_iov'. It checks that >> > > > provided 'iov_iter' with data could be sent in a zerocopy mode. >> > > > If this callback is not set in transport - transport allows to send >> > > > any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true' >> > > > then zerocopy is allowed. Reason of this callback is that in case of >> > > > G2H transmission we insert whole skb to the tx virtio queue and such >> > > > skb must fit to the size of the virtio queue to be sent in a single >> > > > iteration (may be tx logic in 'virtio_transport.c' could be reworked >> > > > as in vhost to support partial send of current skb). This callback >> > > > will be enabled only for G2H path. For details pls see comment >> > > > 'Check that tx queue...' below. >> > > > >> > > > include/net/af_vsock.h | 3 + >> > > > net/vmw_vsock/virtio_transport.c | 39 ++++ >> > > > net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------ >> > > > 3 files changed, 241 insertions(+), 58 deletions(-) >> > > > >> > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h >> > > > index 0e7504a42925..a6b346eeeb8e 100644 >> > > > --- a/include/net/af_vsock.h >> > > > +++ b/include/net/af_vsock.h >> > > > @@ -177,6 +177,9 @@ struct vsock_transport { >> > > > >> > > > /* Read a single skb */ >> > > > int (*read_skb)(struct vsock_sock *, skb_read_actor_t); >> > > > + >> > > > + /* Zero-copy. */ >> > > > + bool (*msgzerocopy_check_iov)(const struct iov_iter *); >> > > > }; >> > > > >> > > > /**** CORE ****/ >> > > > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c >> > > > index 7bbcc8093e51..23cb8ed638c4 100644 >> > > > --- a/net/vmw_vsock/virtio_transport.c >> > > > +++ b/net/vmw_vsock/virtio_transport.c >> > > > @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq) >> > > > queue_work(virtio_vsock_workqueue, &vsock->rx_work); >> > > > } >> > > > >> > > > +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov) >> > > > +{ >> > > > + struct virtio_vsock *vsock; >> > > > + bool res = false; >> > > > + >> > > > + rcu_read_lock(); >> > > > + >> > > > + vsock = rcu_dereference(the_virtio_vsock); >> > > > + if (vsock) { >> > > > + struct virtqueue *vq; >> > > > + int iov_pages; >> > > > + >> > > > + vq = vsock->vqs[VSOCK_VQ_TX]; >> > > > + >> > > > + iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE; >> > > > + >> > > > + /* Check that tx queue is large enough to keep whole >> > > > + * data to send. This is needed, because when there is >> > > > + * not enough free space in the queue, current skb to >> > > > + * send will be reinserted to the head of tx list of >> > > > + * the socket to retry transmission later, so if skb >> > > > + * is bigger than whole queue, it will be reinserted >> > > > + * again and again, thus blocking other skbs to be sent. >> > > > + * Each page of the user provided buffer will be added >> > > > + * as a single buffer to the tx virtqueue, so compare >> > > > + * number of pages against maximum capacity of the queue. >> > > > + * +1 means buffer for the packet header. >> > > > + */ >> > > > + if (iov_pages + 1 <= vq->num_max) >> > > >> > > I think this check is actual only for case one we don't have indirect buffer feature. >> > > With indirect mode whole data to send will be packed into one indirect buffer. >> > > >> > > Thanks, Arseniy >> > >> > Actually the reverse. With indirect you are limited to num_max. >> > Without you are limited to whatever space is left in the >> > queue (which you did not check here, so you should). >> > >> > >> > > > + res = true; >> > > > + } >> > > > + >> > > > + rcu_read_unlock(); >> > >> > Just curious: >> > is the point of all this RCU dance to allow vsock >> > to change from under us? then why is it ok to >> > have it change? the virtio_transport_msgzerocopy_check_iov >> > will then refer to the old vsock ... >> >> IIRC we introduced the RCU to handle hot-unplug issues: >> commit 0deab087b16a ("vsock/virtio: use RCU to avoid use-after-free on >> the_virtio_vsock") >> >> When we remove the device, we flush all the works, etc. so we should >> not be in this case (referring the old vsock), except for an irrelevant >> transient as the device is disappearing. >> >> Stefano > >what if old device goes away then new one appears? In virtio_vsock_remove() (.remove cb) we hold `the_virtio_vsock_mutex` while flushing all the works/sockets/packets and sync the RCU. In virtio_vsock_probe (.probe cb) we hold the same lock while adding the new one and updating the RCU pointer. (only 1 virtio-vsock device per guest is currently supported) So when the new one appears, all the previous sockets are closed, all the queue packets and pending works flushed. So new packets will see the new vsock device. It looks safe to me. Stefano