Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp810692rwb; Fri, 7 Oct 2022 04:36:51 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4ab82Lh3Fg0PtLx86nndbLMLi2Pds7q4BdCxcWtGFchFzNqugomKjPnPU0xdpCKp520WdK X-Received: by 2002:a17:90a:8b93:b0:20a:bd84:5182 with SMTP id z19-20020a17090a8b9300b0020abd845182mr16346003pjn.161.1665142611613; Fri, 07 Oct 2022 04:36:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1665142611; cv=none; d=google.com; s=arc-20160816; b=Z4Rj2vMEKEwpch3DSjkNNjQtaFa3gEBr7MUUdDovlXgxPqPu+Ql52qKL+1Tqv2h2fl sZuqMOxTpxPLYEmuZ1IcR5ta2VZz9cNX0PeCF7Zq+vmiTprp8dMsNzOH+Ab85qdayv/4 WKZ+SuTeo7JiGYJadpx46Utw8eu7K5GurOpSiM2CsHK35zp6qmimsX8U0JpWU2xNnA5a VSlBWsQY4ChJi7i3FVC9kLPSQlpVh8M6Yv2f0uT9dLhaPA4jQUR29H9fiec5oKUzSFCQ GnKJTOfHw3gj6El0nHx0sa/Jzn5f9xXJ36yRzct+mmKQpveO9szu/PK4Y0naixulupP+ 27Gw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=DixBaSnqkPBBtpUYdAsSR8GQ/t8LkPr+Nw1KPDi7uoY=; b=L38GWVl35AuqSlGVKdOrWDZpsKlJ1I4tr6GnJaEVM82b7x1vElxYb42SQ7zS5mB2MD L3foEMTTy24jmTc5ipXkKhWIwxf0dPU3KnNNyPIUL5x1wVb+xZ52UPWsG+PZhH/B7wOt S+UPXFwlTdtwTpcaG1MMmI6QcXXmEWjGpZzA2/jOSPzJ1m4Tk3XRks82gvDcu2B5bbuv f+WHEtls0XiqVwRScrYcI/WGTq0JoqNLM2XvVeW+oSyLlfQ4yfIavimZsI/i+EfQtkTM +s9WBHpATCN40z1xpff3RCxS5RSCuueVx2zsuu3Fw9eE36bU8U615Fdp/sX2G+gp0mfb AjBQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=btpz+9Nw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h25-20020a633859000000b00443860ee255si3019224pgn.17.2022.10.07.04.36.39; Fri, 07 Oct 2022 04:36:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=btpz+9Nw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229722AbiJGLWN (ORCPT + 99 others); Fri, 7 Oct 2022 07:22:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45924 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229744AbiJGLWJ (ORCPT ); Fri, 7 Oct 2022 07:22:09 -0400 Received: from mail-ed1-x531.google.com (mail-ed1-x531.google.com [IPv6:2a00:1450:4864:20::531]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 141C8CF1BA for ; Fri, 7 Oct 2022 04:22:04 -0700 (PDT) Received: by mail-ed1-x531.google.com with SMTP id s30so6615138eds.1 for ; Fri, 07 Oct 2022 04:22:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=DixBaSnqkPBBtpUYdAsSR8GQ/t8LkPr+Nw1KPDi7uoY=; b=btpz+9NwHTht9XRJijdsLBCR5mkTKEPy7fniLEhQAX6v5bcJQix3nbfG/xmDCr4IxI Gmo7ymg1NKCYc6E0QxCvjIfKxRNCv6TbwxRdFh8lw+itDKFTNVmRd5KcFPjGMDm8ZILs BS3r5S8Rvc9Cykkrkz0sjPQXnxaR0+dKhs07fI85Rg93ls4W78Y741FspYGZhC6i4iBx VibjcJqaiziUh6SWrdd7WyIsCBv2AlaIsBuSk/Oysz/9s60rM3ATkBVrktrJU19DGQbq DDCqIcSk0EQtzGvCIoaYtXqhGCNrzNBT4dFsq3Z/2bjoulHhNjPLZrBpF0HYn0ad8n1J kZiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=DixBaSnqkPBBtpUYdAsSR8GQ/t8LkPr+Nw1KPDi7uoY=; b=DwiP7Y0c5DGwNKoM10J2BlDf3VvnxcEfA+udGDjA3ETsve6jmeMzei5kmZijOrdnwD AAkhBW4ffKIvLFGXO5Q3kapJMA68stIz6Ys+v6Fx/xR+CFZcJFE/wCrunpRGGjGLnCsS 9Hw3R98Wr23usZEaitp7h8XJXs03GFiyH1r3ZjGN4WSgFYsWctjwGmXTSC3uL2UTxeVn mMDYQ2CxPTXjmHNaAFLx7u6lWpstyD1Qrj2LT2lc2sbou6eOCuwmWy+P0PV4CEhM+hf9 BMlcPaLEenZ1QXqhVl2lZanllZ4/DIZknYZNog1yWgX6qqbfPJ63ubSIDPAc3xPeZEXm baZA== X-Gm-Message-State: ACrzQf1ftMn+EYDpDr26U5QNj/7Z7qK9oLEJJsDyli8xIrTbkB6bH5FW FwAdhlaguQbDhnF7+i0wU27vyV3al00TVcIDULA+ X-Received: by 2002:a05:6402:5024:b0:440:e4ad:f7b6 with SMTP id p36-20020a056402502400b00440e4adf7b6mr4134926eda.358.1665141723300; Fri, 07 Oct 2022 04:22:03 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yongji Xie Date: Fri, 7 Oct 2022 19:21:51 +0800 Message-ID: Subject: Re: ublk-qcow2: ublk-qcow2 is available To: Ming Lei Cc: Stefan Hajnoczi , Stefan Hajnoczi , io-uring@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel , Kirill Tkhai , Manuel Bentele , qemu-devel@nongnu.org, Kevin Wolf , rjones@redhat.com, "Denis V. Lunev" , Stefano Garzarella Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 7, 2022 at 6:51 PM Ming Lei wrote: > > On Fri, Oct 07, 2022 at 06:04:29PM +0800, Yongji Xie wrote: > > On Thu, Oct 6, 2022 at 7:24 PM Ming Lei wrote: > > > > > > On Wed, Oct 05, 2022 at 08:21:45AM -0400, Stefan Hajnoczi wrote: > > > > On Wed, 5 Oct 2022 at 00:19, Ming Lei wrote: > > > > > > > > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote: > > > > > > On Tue, 4 Oct 2022 at 05:44, Ming Lei wrote: > > > > > > > > > > > > > > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote: > > > > > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote: > > > > > > > > > ublk-qcow2 is available now. > > > > > > > > > > > > > > > > Cool, thanks for sharing! > > > > > > > > > > > > > > > > > > > > > > > > > > So far it provides basic read/write function, and compression and snapshot > > > > > > > > > aren't supported yet. The target/backend implementation is completely > > > > > > > > > based on io_uring, and share the same io_uring with ublk IO command > > > > > > > > > handler, just like what ublk-loop does. > > > > > > > > > > > > > > > > > > Follows the main motivations of ublk-qcow2: > > > > > > > > > > > > > > > > > > - building one complicated target from scratch helps libublksrv APIs/functions > > > > > > > > > become mature/stable more quickly, since qcow2 is complicated and needs more > > > > > > > > > requirement from libublksrv compared with other simple ones(loop, null) > > > > > > > > > > > > > > > > > > - there are several attempts of implementing qcow2 driver in kernel, such as > > > > > > > > > ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2 > > > > > > > > > might useful be for covering requirement in this field > > > > > > > > > > > > > > > > > > - performance comparison with qemu-nbd, and it was my 1st thought to evaluate > > > > > > > > > performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv > > > > > > > > > is started > > > > > > > > > > > > > > > > > > - help to abstract common building block or design pattern for writing new ublk > > > > > > > > > target/backend > > > > > > > > > > > > > > > > > > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block > > > > > > > > > device as TEST_DEV, and kernel building workload is verified too. Also > > > > > > > > > soft update approach is applied in meta flushing, and meta data > > > > > > > > > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of > > > > > > > > > test, and only cluster leak is reported during this test. > > > > > > > > > > > > > > > > > > The performance data looks much better compared with qemu-nbd, see > > > > > > > > > details in commit log[1], README[5] and STATUS[6]. And the test covers both > > > > > > > > > empty image and pre-allocated image, for example of pre-allocated qcow2 > > > > > > > > > image(8GB): > > > > > > > > > > > > > > > > > > - qemu-nbd (make test T=qcow2/002) > > > > > > > > > > > > > > > > Single queue? > > > > > > > > > > > > > > Yeah. > > > > > > > > > > > > > > > > > > > > > > > > randwrite(4k): jobs 1, iops 24605 > > > > > > > > > randread(4k): jobs 1, iops 30938 > > > > > > > > > randrw(4k): jobs 1, iops read 13981 write 14001 > > > > > > > > > rw(512k): jobs 1, iops read 724 write 728 > > > > > > > > > > > > > > > > Please try qemu-storage-daemon's VDUSE export type as well. The > > > > > > > > command-line should be similar to this: > > > > > > > > > > > > > > > > # modprobe virtio_vdpa # attaches vDPA devices to host kernel > > > > > > > > > > > > > > Not found virtio_vdpa module even though I enabled all the following > > > > > > > options: > > > > > > > > > > > > > > --- vDPA drivers > > > > > > > vDPA device simulator core > > > > > > > vDPA simulator for networking device > > > > > > > vDPA simulator for block device > > > > > > > VDUSE (vDPA Device in Userspace) support > > > > > > > Intel IFC VF vDPA driver > > > > > > > Virtio PCI bridge vDPA driver > > > > > > > vDPA driver for Alibaba ENI > > > > > > > > > > > > > > BTW, my test environment is VM and the shared data is done in VM too, and > > > > > > > can virtio_vdpa be used inside VM? > > > > > > > > > > > > I hope Xie Yongji can help explain how to benchmark VDUSE. > > > > > > > > > > > > virtio_vdpa is available inside guests too. Please check that > > > > > > VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio > > > > > > drivers" menu. > > > > > > > > > > > > > > > > > > > > > # modprobe vduse > > > > > > > > # qemu-storage-daemon \ > > > > > > > > --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \ > > > > > > > > --blockdev qcow2,file=file,node-name=qcow2 \ > > > > > > > > --object iothread,id=iothread0 \ > > > > > > > > --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0 > > > > > > > > # vdpa dev add name vduse0 mgmtdev vduse > > > > > > > > > > > > > > > > A virtio-blk device should appear and xfstests can be run on it > > > > > > > > (typically /dev/vda unless you already have other virtio-blk devices). > > > > > > > > > > > > > > > > Afterwards you can destroy the device using: > > > > > > > > > > > > > > > > # vdpa dev del vduse0 > > > > > > > > > > > > > > > > > > > > > > > > > > - ublk-qcow2 (make test T=qcow2/022) > > > > > > > > > > > > > > > > There are a lot of other factors not directly related to NBD vs ublk. In > > > > > > > > order to get an apples-to-apples comparison with qemu-* a ublk export > > > > > > > > type is needed in qemu-storage-daemon. That way only the difference is > > > > > > > > the ublk interface and the rest of the code path is identical, making it > > > > > > > > possible to compare NBD, VDUSE, ublk, etc more precisely. > > > > > > > > > > > > > > Maybe not true. > > > > > > > > > > > > > > ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely, > > > > > > > and so far single io_uring/pthread is for handling all qcow2 IOs and IO > > > > > > > command. > > > > > > > > > > > > qemu-nbd doesn't use io_uring to handle the backend IO, so we don't > > > > > > > > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed. > > > > > > > > > > > know whether the benchmark demonstrates that ublk is faster than NBD, > > > > > > that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2, > > > > > > whether there are miscellaneous implementation differences between > > > > > > ublk-qcow2 and qemu-nbd (like using the same io_uring context for both > > > > > > ublk and backend IO), or something else. > > > > > > > > > > The theory shouldn't be too complicated: > > > > > > > > > > 1) io uring passthough(pt) communication is fast than socket, and io command > > > > > is carried over io_uring pt commands, and should be fast than virio > > > > > communication too. > > > > > > > > > > 2) io uring io handling is fast than libaio which is taken in the > > > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled > > > > > by io_uring. > > > > > > > > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common > > > > > > > > > > 3) ublk uses one single io_uring to handle all io commands and qcow2 > > > > > backend IOs, so batching handling is common, and it is easy to see > > > > > dozens of IOs/io commands handled in single syscall, or even more. > > > > > > > > I agree with the theory but theory has to be tested through > > > > experiments in order to validate it. We can all learn from systematic > > > > performance analysis - there might even be bottlenecks in ublk that > > > > can be solved to improve performance further. > > > > > > Indeed, one thing is that ublk uses get user pages to retrieve user pages > > > for copying data, this way may add latency for big chunk IO, since > > > latency of get user pages should be increased linearly by nr_pages. > > > > > > I looked into vduse code a bit too, and vduse still needs the page copy, > > > but lots of bounce pages are allocated and cached in the whole device > > > lifetime, this way can void the latency for retrieving & allocating > > > pages runtime with cost of extra memory consumption. Correct me > > > if it is wrong, Xie Yongji or anyone? > > > > > > > Yes, you are right. Another way is registering the preallocated > > userspace memory as bounce buffer. > > Thanks for the clarification. > > IMO, the pages consumption is too much for vduse, each vdpa device > has one vduse_iova_domain which may allocate 64K bounce pages at most, > and these pages won't be freed until freeing the device. > Yes, actually in our initial design, this can be mitigated by some memory reclaim mechanism and zero copy support. Even we can let multiple vdpa device share one iova domain. Thanks, Yongji > But it is one solution for implementing generic userspace device(not > limit to block device), and this idea seems great. > > > > > Thanks, > Ming