Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp4097448iob; Tue, 17 May 2022 14:00:39 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwQpJIM4yjqjrfIk83WBPNT9cWCIdrpf1fZwI06+qYUl1TZF+4dgIDYzr32nFgRMfoZhfFQ X-Received: by 2002:a05:6402:3298:b0:42a:a91d:905b with SMTP id f24-20020a056402329800b0042aa91d905bmr16129988eda.373.1652821239303; Tue, 17 May 2022 14:00:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652821239; cv=none; d=google.com; s=arc-20160816; b=erAm2+vJJeIZt9+ypoEvrtDMK4Z0rdxiu0bmFIGJbLK1yEmJ0vNOVPFMPJHCxdnYvp pni2m9pc7mJv+NRoaPFBduYHAFhTKJCnKXbPE6PovXUcjd5IqlUPCOgA3rMAy8Iuh3Z/ QUYC049ujMzZ8ryA4ncq8IcT0CRHUhsgIqb5IThwVsEV7VlN+C8Fzsj0CflLndMJQUnh iIevet21cpKPUzSVwysN5VkWoVixVAb1Ak309kkqveQvZJt5RBJYXSRYrJkLLT/5G+VX RCZCII/MHJG+EYbT6h47jnnPFlVE4woqkxhi6kiunpBYoXg7JLYvkcCh7qICe7FC1wiy 9AcA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=Dan4b7oMniQEPDBtcHcK4omWqrnS96Raov0S3bkdiog=; b=kUVl7J4Unqv7MvxS1oGoYPPln8tb0yX8NCCZBWbB91IvzaAiHO1jkfQvdl1THU3cGm 2/NMK1bxLjnWCGHzfBL+Irkvq2JDCddMJku9ZQuG3X2Wx4mdOZw6C4Oxy0QLZmMjfdLd 4DgSj3415JHylHQzPErvKUZ8kmOfP569YEvpfG4z7gwjZHFgkY3gMo8Hc3esi3kPyNHu hXPGyGgM+bRsDVmYH2XdCrDmVlOjH4rSeNW7i5oF9eHZaDtFTcDuhnlsFg2PhVrGAOVs YzLbCQa5RrzzaI54B+1WNVi8A8n/gYCdE5l7i204LQ48BhkqrawE+wAerLm797/MTvXk n4Fw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=OMmGEHL+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ga32-20020a1709070c2000b006f3a79245c6si251807ejc.941.2022.05.17.14.00.13; Tue, 17 May 2022 14:00:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=OMmGEHL+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238542AbiEQB6q (ORCPT + 99 others); Mon, 16 May 2022 21:58:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36754 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232318AbiEQB6p (ORCPT ); Mon, 16 May 2022 21:58:45 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 789A140E66 for ; Mon, 16 May 2022 18:58:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1652752723; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Dan4b7oMniQEPDBtcHcK4omWqrnS96Raov0S3bkdiog=; b=OMmGEHL+4FxZfsehvWXcHFmfI6KzVRUuuDoAm2bq9LoM5ke4/kRVGAs6kHwfjefDv8MZ5h 2aGetYOrdDfrfrAq1XMVoJkxgdaFIVgnl5Gvac0Fu3DVh47rFWh/c+RHeSDXqlQqpTIaKE 2wgL9PHkGa4yUqUJJkGpDTAn/B16fx0= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-19-hM4Qy6dFPTmRJTsFajiPog-1; Mon, 16 May 2022 21:58:38 -0400 X-MC-Unique: hM4Qy6dFPTmRJTsFajiPog-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E425B3AF42A2; Tue, 17 May 2022 01:58:37 +0000 (UTC) Received: from T590 (ovpn-8-20.pek2.redhat.com [10.72.8.20]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 1FFD72026D64; Tue, 17 May 2022 01:58:02 +0000 (UTC) Date: Tue, 17 May 2022 09:57:56 +0800 From: Ming Lei To: Stefan Hajnoczi Cc: Jens Axboe , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, io-uring@vger.kernel.org, Gabriel Krisman Bertazi , ZiyangZhang , Xiaoguang Wang , kwolf@redhat.com, sgarzare@redhat.com Subject: Re: [RFC PATCH] ubd: add io_uring based userspace block driver Message-ID: References: <20220509092312.254354-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 2.78 on 10.11.54.4 X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Stefan, On Mon, May 16, 2022 at 08:29:25PM +0100, Stefan Hajnoczi wrote: > Hi, > This looks interesting! I have some questions: Thanks for your comment! > > 1. What is the ubdsrv permission model? > > A big usability challenge for *-in-userspace interfaces is the balance > between security and allowing unprivileged processes to use these > features. > > - Does /dev/ubd-control need to be privileged? I guess the answer is > yes since an evil ubdsrv can hang I/O and corrupt data in hopes of > triggering file system bugs. Yes, I think so. UBD should be in same position with NBD which does require capable(CAP_SYS_ADMIN). > - Can multiple processes that don't trust each other use UBD at the same > time? I guess not since ubd_index_idr is global. Only single process can open /dev/ubdcN for communicating with ubd driver, see ubd_ch_open(). > - What about containers and namespaces? They currently have (write) > access to the same global ubd_index_idr. I understand contrainers/namespaces only need to see /dev/ubdbN, and the usage model should be same with kernel loop: the global ubd_index_idr is same with loop's loop_index_idr too. Or can you explain a bit in detail if I misunderstood your point. > - Maybe there should be a struct ubd_device "owner" (struct > task_struct *) so only devices created by the current process can be > modified? I guess it isn't needed since /dev/ubdcN is opened by single process. > > 2. io_uring_cmd design > > The rationale for the io_uring_cmd design is not explained in the cover > letter. I think it's worth explaining the design. Here are my guesses: > > The same thing can be achieved with just file_operations and io_uring. > ubdsrv could read I/O submissions with IORING_OP_READ and write I/O > completions with IORING_OP_WRITE. That would require 2 sqes per > roundtrip instead of 1, but the same number of io_uring_enter(2) calls > since multiple sqes/cqes can be batched per syscall: > > - IORING_OP_READ, addr=(struct ubdsrv_io_desc*) (for submission) > - IORING_OP_WRITE, addr=(struct ubdsrv_io_cmd*) (for completion) > > Both operations require a copy_to/from_user() to access the command > metadata. Yes, but it can't be efficient as io_uring command. Two OPs require two long code path for read and write which are supposed for handling fs io, so reusing complicated FS IO interface for sending command via cha dev is really overkill, and nvme passthrough has shown better IOPS than read/write interface with io_uring command, and extra copy_to/from_user() may fault with extra meta copy, which can slow down the ubd server. Also for IORING_OP_READ, copy_to_user() has to be done in the ubq daemon context, even though that isn't a big deal, but with extra cost(cpu utilization) in the ubq deamon context or sleep for handling page fault, that is really what should be avoided, we need to save more CPU for handling user space IO logic in that context. > > The io_uring_cmd approach works differently. The IORING_OP_URING_CMD sqe > carries a 40-byte payload so it's possible to embed struct ubdsrv_io_cmd > inside it. The struct ubdsrv_io_desc mmap gets around the fact that > io_uring cqes contain no payload. The driver therefore needs a > side-channel to transfer the request submission details to ubdsrv. I > don't see much of a difference between IORING_OP_READ and the mmap > approach though. At least the performance difference, ->uring_cmd() requires much less code path(single simple o_uring command) than read/write, without any copy on command data, without fault in copy_to/from_user(), without two long/ complicated FS IO code path. Single command of UBD_IO_COMMIT_AND_FETCH_REQ can handle both fetching io request desc and commit command result in one trip. > > It's not obvious to me how much more efficient the io_uring_cmd approach > is, but taking fewer trips around the io_uring submission/completion > code path is likely to be faster. Something similar can be done with > file_operations ->ioctl(), but I guess the point of using io_uring is > that is composes. If ubdsrv itself wants to use io_uring for other I/O > activity (e.g. networking, disk I/O, etc) then it can do so and won't be > stuck in a blocking ioctl() syscall. ioctl can't be a choice, since we will lose the benefit of batching handling. > > It would be nice if you could write 2 or 3 paragraphs explaining why the > io_uring_cmd design and the struct ubdsrv_io_desc mmap was chosen. Fine, I guess most are the above inline comment? > > 3. Miscellaneous stuff > > - There isn't much in the way of memory ordering in the code. I worry a > little that changes to the struct ubdsrv_io_desc mmap may not be > visible at the expected time with respect to the io_uring cq ring. I believe io_uring_cmd_done() with userspace cqe helper implies enough memory barrier, once the cqe is observed in userspace, any memory OP done before io_uring_cmd_done() should be observed by user side cqe handling code, otherwise it can be thought as io_uring bug. If it isn't this way, we still can avoid any barrier by moving setting io desc into ubq daemon context(ubd_rq_task_work_fn), but I really want to save cpu in that context, and don't think it is needed. Thanks, Ming