Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp758406rwr; Wed, 3 May 2023 05:56:23 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4rAJXIqBU6xdVgrrXKSq9EDSoQ9z4dtf5OR8FXRWJUYh/z64VZ4L+z/WAJ8QA+TarUF57/ X-Received: by 2002:a05:6a20:6a07:b0:ef:cb4c:c23e with SMTP id p7-20020a056a206a0700b000efcb4cc23emr26992189pzk.29.1683118582950; Wed, 03 May 2023 05:56:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683118582; cv=none; d=google.com; s=arc-20160816; b=0unG4cdrswrRhKTDyR1bG7i7qUd3mi+l0QjtFhdP8rHOiBIzBQlANQtL8Y5mcg4r/V u1q1gLc1gyONrzKdqObcp2f4GC8A8Kt5Yq+YnrWE1SUlNdi70vB+kb8iMwHzI/p+tXiu x8a0tB0bvZsz1hairZ4ZaHGAQarc3ZhvbkbS/Y/KtnmwrW7Z3U8bHCbVIQj47t7qNBcj AXP9yMZf3Bm6sRFEUrFyUot3Ui2ZYGNg4WjbSKDFhtCpNtqWDunoFySj0kw55JQzVdUN fbe5jcUQ7hr0ofqnHb7m5wQp9EHPK65LmH0pnW+myI9GnwQ5b8mAMB6qQz5AZ+5G+NER mEZQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=LNd8A4raflq8+zrsoLkAbucJm9t+51+2qlFUdBGjsEY=; b=pXmyUDbgLsIe/Km3fUQH0XAfRJOj+xC5SJsoq5kPo8s0mrghUsdr+drT/iqlO2DhXk ft8aAwXd02wBc5HNGe2H8CVrqBQ0n+WJl25Ce7JIKHf04iMDlDa2EaiYofeRvGQlscJe UnY9EqiTTk0g7pH32MIQ8qY1eGPKgM883xx/es4XTWdgz6PxFdng8fBZ8b674SeL5Sd5 KhD13sMw2NVzyEFY1LsUn9Rwy72mU2ABkYfC+/y1eGqk/sBp41pPwV2qWokRipOPzyN/ yrX8CKM/L4Q6SaGDlNtH9TuHuw1hbyDd0prh4G+FYFXUAZ1uv80fpqWkRBOp/TWRnFtW d+KQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=GzXEa3z7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i184-20020a6387c1000000b00525019d1f8bsi28304453pge.346.2023.05.03.05.56.08; Wed, 03 May 2023 05:56:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=GzXEa3z7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229806AbjECMxn (ORCPT + 99 others); Wed, 3 May 2023 08:53:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53920 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229554AbjECMxl (ORCPT ); Wed, 3 May 2023 08:53:41 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 868515592 for ; Wed, 3 May 2023 05:52:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1683118368; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=LNd8A4raflq8+zrsoLkAbucJm9t+51+2qlFUdBGjsEY=; b=GzXEa3z7a5+4GtOpuHR/bCbzed2q9lZvQ/IChtvdF0/KfCErlWiET3ZE7NPpOys/5VVz2y sliSSYbFy2nPbhelCQu7A1hHqP5g0VtVL3cK2j3jwGzdLfwqjpsIGgY1bR45vSu+TuO/Gg aPjSJHz7tpEmAT842VaibuLNBRxp04c= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-85-sAoFeP93OkWqyYHl1pm0BA-1; Wed, 03 May 2023 08:52:47 -0400 X-MC-Unique: sAoFeP93OkWqyYHl1pm0BA-1 Received: by mail-qk1-f200.google.com with SMTP id af79cd13be357-74df56fa31dso290093885a.3 for ; Wed, 03 May 2023 05:52:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683118366; x=1685710366; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=LNd8A4raflq8+zrsoLkAbucJm9t+51+2qlFUdBGjsEY=; b=c6FUZ9l9wdcoWs7YFgGf3D0hoybjvwMBZEeznRCj0hBgpUaQj+31fdakaDKaoeQ2ub F2zIWSq99N4G7qioz9eXoZMHQA+2ZKdqGojLzV1Azaw7keEzSIFPJ933jQhv6q6i+ioU xXvn6E7YgIK219sM0Y7bbHDPlJT7lZ13KVpLULEGg3pLSOdR8wtdA4LcDGrmyEMfC/6g RVTJxZf4wwb43WhFBlrOeDhVKaqqdL5Nbme3vEOy1PXjnUasjWMLE0GF7DF5HTwUHpS8 WGZ/exfbCjksUaG5ayoyLHATmID+xSYlWxzI716l0KkFSlVTLd/3e4HTkeiT6gbCLl9j MSLQ== X-Gm-Message-State: AC+VfDzVqGD8SkxtO9oKqjxgjWSEyeVs62H3pwGcMviA16ruTvNIrQJU Kmbtw3DgZ8R+M/WT8nzuepOE+HruQwZe7vVmLPFamcBHppBijLMm3GvpTkCohI8dw6W9Yif8AJp yAB7bVlI35E/JyVsyxtolAp5n X-Received: by 2002:ac8:7fca:0:b0:3e4:df94:34fa with SMTP id b10-20020ac87fca000000b003e4df9434famr33038275qtk.37.1683118366605; Wed, 03 May 2023 05:52:46 -0700 (PDT) X-Received: by 2002:ac8:7fca:0:b0:3e4:df94:34fa with SMTP id b10-20020ac87fca000000b003e4df9434famr33038243qtk.37.1683118366311; Wed, 03 May 2023 05:52:46 -0700 (PDT) Received: from sgarzare-redhat ([185.29.96.107]) by smtp.gmail.com with ESMTPSA id dz16-20020a05620a2b9000b0074df16f36f1sm10501546qkb.108.2023.05.03.05.52.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 May 2023 05:52:45 -0700 (PDT) Date: Wed, 3 May 2023 14:52:35 +0200 From: Stefano Garzarella To: Arseniy Krasnov Cc: Stefan Hajnoczi , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , "Michael S. Tsirkin" , Jason Wang , Bobby Eshleman , kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, kernel@sberdevices.ru, oxffffaa@gmail.com Subject: Re: [RFC PATCH v2 00/15] vsock: MSG_ZEROCOPY flag support Message-ID: References: <20230423192643.1537470-1-AVKrasnov@sberdevices.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: <20230423192643.1537470-1-AVKrasnov@sberdevices.ru> X-Spam-Status: No, score=-2.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Arseniy, Sorry for the delay, but I have been very busy. I can't apply this series on master or net-next, can you share with me the base commit? On Sun, Apr 23, 2023 at 10:26:28PM +0300, Arseniy Krasnov wrote: >Hello, > > DESCRIPTION > >this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow >current implementation for TCP as much as possible: > >1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this > flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY > flag will be ignored (e.g. without completion). > >2) Kernel uses completions from socket's error queue. Single completion > for single tx syscall (or it can merge several completions to single > one). I used already implemented logic for MSG_ZEROCOPY support: > 'msg_zerocopy_realloc()' etc. > >Difference with copy way is not significant. During packet allocation, >non-linear skb is created, then I call 'pin_user_pages()' for each page >from user's iov iterator and add each returned page to the skb as fragment. >There are also some updates for vhost and guest parts of transport - in >both cases i've added handling of non-linear skb for virtio part. vhost >copies data from such skb to the guest's rx virtio buffers. In the guest, >virtio transport fills tx virtio queue with pages from skb. > >This version has several limits/problems: > >1) As this feature totally depends on transport, there is no way (or it > is difficult) to check whether transport is able to handle it or not > during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific > setsockopt callback from setsockopt callback for SOL_SOCKET, but this > leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback > are not considered to be called from each other. So in current version > SO_ZEROCOPY is set successfully to any type (e.g. transport) of > AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, > tx routine will fail with EOPNOTSUPP. Do you plan to fix this in the next versions? If it is too complicated, I think we can have this limitation until we find a good solution. > >2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue > one completion. In each completion there is flag which shows how tx > was performed: zerocopy or copy. This leads that whole message must > be send in zerocopy or copy way - we can't send part of message with > copying and rest of message with zerocopy mode (or vice versa). Now, > we need to account vsock credit logic, e.g. we can't send whole data > once - only allowed number of bytes could sent at any moment. In case > of copying way there is no problem as in worst case we can send single > bytes, but zerocopy is more complex because smallest transmission > unit is single page. So if there is not enough space at peer's side > to send integer number of pages (at least one) - we will wait, thus > stalling tx side. To overcome this problem i've added simple rule - > zerocopy is possible only when there is enough space at another side > for whole message (to check, that current 'msghdr' was already used > in previous tx iterations i use 'iov_offset' field of it's iov iter). So, IIUC if MSG_ZEROCOPY is set, but there isn't enough space in the destination we temporarily disable zerocopy, also if MSG_ZEROCOPY is set. Right? If it is the case it seems reasonable to me. > >3) loopback transport is not supported, because it requires to implement > non-linear skb handling in dequeue logic (as we "send" fragged skb > and "receive" it from the same queue). I'm going to implement it in > next versions. > > ^^^ fixed in v2 > >4) Current implementation sets max length of packet to 64KB. IIUC this > is due to 'kmalloc()' allocated data buffers. I think, in case of > MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is > not touched for data - user space pages are used as buffers. Also > this limit trims every message which is > 64KB, thus such messages > will be send in copy mode due to 'iov_offset' check in 2). > > ^^^ fixed in v2 > > PATCHSET STRUCTURE > >Patchset has the following structure: >1) Handle non-linear skbuff on receive in virtio/vhost. >2) Handle non-linear skbuff on send in virtio/vhost. >3) Updates for AF_VSOCK. >4) Enable MSG_ZEROCOPY support on transports. >5) Tests/tools/docs updates. > > PERFORMANCE > >Performance: it is a little bit tricky to compare performance between >copy and zerocopy transmissions. In zerocopy way we need to wait when >user buffers will be released by kernel, so it something like synchronous >path (wait until device driver will process it), while in copy way we >can feed data to kernel as many as we want, don't care about device >driver. So I compared only time which we spend in the 'send()' syscall. >Then if this value will be combined with total number of transmitted >bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not >enough credit, receiver allocates same amount of space as sender needs. > >Sender: >./vsock_perf --sender --buf-size --bytes 256M [--zc] > >Receiver: >./vsock_perf --vsk-size 256M > >G2H transmission (values are Gbit/s): > >*-------------------------------* >| | | | >| buf size | copy | zerocopy | >| | | | >*-------------------------------* >| 4KB | 3 | 10 | >*-------------------------------* >| 32KB | 9 | 45 | >*-------------------------------* >| 256KB | 24 | 195 | >*-------------------------------* >| 1M | 27 | 270 | >*-------------------------------* >| 8M | 22 | 277 | >*-------------------------------* > >H2G: > >*-------------------------------* >| | | | >| buf size | copy | zerocopy | >| | | | >*-------------------------------* >| 4KB | 17 | 11 | Do you know why in this case zerocopy is slower in this case? Could be the cost of pin/unpin pages? >*-------------------------------* >| 32KB | 30 | 66 | >*-------------------------------* >| 256KB | 38 | 179 | >*-------------------------------* >| 1M | 38 | 234 | >*-------------------------------* >| 8M | 28 | 279 | >*-------------------------------* > >Loopback: > >*-------------------------------* >| | | | >| buf size | copy | zerocopy | >| | | | >*-------------------------------* >| 4KB | 8 | 7 | >*-------------------------------* >| 32KB | 34 | 42 | >*-------------------------------* >| 256KB | 43 | 83 | >*-------------------------------* >| 1M | 40 | 109 | >*-------------------------------* >| 8M | 40 | 171 | >*-------------------------------* > >I suppose that huge difference above between both modes has two reasons: >1) We don't need to copy data. >2) We don't need to allocate buffer for data, only for header. > >Zerocopy is faster than classic copy mode, but of course it requires >specific architecture of application due to user pages pinning, buffer >size and alignment. > >If host fails to send data with "Cannot allocate memory", check value >/proc/sys/net/core/optmem_max - it is accounted during completion skb >allocation. What the user needs to do? Increase it? > > TESTING > >This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to >cover new code as much as possible so there are different cases for >MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io >vector types (different sizes, alignments, with unmapped pages). I also >run tests with loopback transport and running vsockmon. Thanks for the test again :-) This cover letter is very good, with a lot of details, but please add more details in each single patch, explaining the reason of the changes, otherwise it is very difficult to review, because it is a very big change. I'll do a per-patch review in the next days. Thanks, Stefano