Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp5620167iob; Mon, 9 May 2022 22:51:49 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzdNteErLIt8mifuMFDnwtjqxEtlILG2sZ1EhOKSvK8MwXMwEOQMytbZnc75hkepPAYZF72 X-Received: by 2002:a05:6402:440d:b0:412:9e8a:5e51 with SMTP id y13-20020a056402440d00b004129e8a5e51mr21496673eda.362.1652161909141; Mon, 09 May 2022 22:51:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652161909; cv=none; d=google.com; s=arc-20160816; b=lU0ThGht3/0r4AxkpbIRpI8aKVO+dCvdUVhD3nJyezqtdjzlmWKzR//agPn0LYZOCW 47GwthF0pLdJgH5DvV2/6snZ9DBpY3ikqjs8kAQ3D1cBj5kWX7L7SJbmpxjNBRc3nMR0 EhSCpWR7c2rm3AzInaaRdZRX65xDrfErEtwfnK3Adqp9dX+8otZViURhOre2ESlvxkxZ +yswFVowc5kZhUMsoDO74cw/plWlSU/i+ZOmwg6TSEZyM7ZbpGXXEWmYm4Z6j/v1RC8a KJOaZGDFVss0cuGeuj7lkr+2WflTk1wQArRXTmlw1NP3NUARFCwwUEAnAVctm/6IXUyH FWGw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=fxdtTp3M8aUfDAUSZiH+3vfMAMB7VhyCU2mEh+hK6Ew=; b=gUP5CmfrzCr3P+FLklpAWh++oY+PYcy5m6SYhQYnphwQ/rcv/+9AiX/Fja5r+xcDWh HASoQJUVAKgcCX4e6IsGogKZe3pXGMJmYg01jYpu9WBWjn2CzAF7CivUIZEP5X08Dj5S H0+6fVXDF6k6yZIDrmwZQQNlI/lanhMI5KDsi3ajpCv5hVk6hOxkaLRtlEYkcplSfbq2 rJPVH4ZZ0KE7UA/vRN7gN24AKN9FMbeUMNaupRdc62HuCq6kGQD69k/PoOBI66ZFNPwA srJQfPuf2daeZsrJq7YZK36rPkPfceWTlQce0iSm5b3o7XooyNmaNiNMGkiTkxlDIZmf aLqQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel-dk.20210112.gappssmtp.com header.s=20210112 header.b=4sHLng4o; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hs38-20020a1709073ea600b006fa6acdd9a3si6838315ejc.431.2022.05.09.22.51.25; Mon, 09 May 2022 22:51:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20210112.gappssmtp.com header.s=20210112 header.b=4sHLng4o; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235522AbiEJDeC (ORCPT + 99 others); Mon, 9 May 2022 23:34:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50610 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235478AbiEJDdx (ORCPT ); Mon, 9 May 2022 23:33:53 -0400 Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 05CAD6948B for ; Mon, 9 May 2022 20:29:49 -0700 (PDT) Received: by mail-pl1-x630.google.com with SMTP id i1so15617639plg.7 for ; Mon, 09 May 2022 20:29:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20210112.gappssmtp.com; s=20210112; h=message-id:date:mime-version:user-agent:subject:content-language:to :cc:references:from:in-reply-to:content-transfer-encoding; bh=fxdtTp3M8aUfDAUSZiH+3vfMAMB7VhyCU2mEh+hK6Ew=; b=4sHLng4o/86uE+nfqNXIXDE5L+7p2bPFktoPF881vrndtItm5dFj+t2Gz8/SgvvKLe G2TZRt64QCDdts+BM5JOfQ8f8vffhbevbejdIrjh44I2tvoR/YKynPJi06vlBgrp4n1r /5LxwwBq+U9Nk46R34kbzREGjxz708XH4DawA/1voll+RpxbotzdfGj1w5TzGaaG0qPR KUQjYUZjfoJ3j4Q8/lfgcWLAO7QklBVO2c/ZmQDmduvgvgQUBdxgFJvnpqnoXMCZnUVZ xN62e9k5vuHu9gwbzE2/ETiRosRWXrfa6ZjvomGW8LXXVONocBKtNmY3lGj5GatkR1ee FAKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=fxdtTp3M8aUfDAUSZiH+3vfMAMB7VhyCU2mEh+hK6Ew=; b=QSp9NeP8aGYRXSsV4PhJcDbfyluynXzFFhu2Iff1CcQcQuCgOpPk40kg3k/hoeom0y XrA7TDApTX8Czm0XW4wB4JU8CTVqk0r5Lt7P78oyZN/ZqvU3xTU3BPg6uef5ijM9rKQy cDpVUWI+RyVAeyNXSCHiHPGJresuCVhrnydB+pOiMxAS2LhHN93G+YPM7oQBGrRhZQq1 SA7n31PRxo2TfdCW7COy3/2kcpjxAvyhbWr3gmgMhh4+dPr5bgoL7EAA75lbIUFby76x 4k+qM19qPzoJbau1PHvFex11cgPWLu7baWRyF/B5kk15AfSol18Y91rCgAA51KFQihyd MhFA== X-Gm-Message-State: AOAM531ulSgZslllHDzI3+b7xkcu8MFtaSKezcjaIYRWJQ8ePvqKhAGF Xi8te13cNo4fLymbLthGOas0gA== X-Received: by 2002:a17:902:74c5:b0:15f:1388:53eb with SMTP id f5-20020a17090274c500b0015f138853ebmr7521990plt.39.1652153388271; Mon, 09 May 2022 20:29:48 -0700 (PDT) Received: from [192.168.1.100] ([198.8.77.157]) by smtp.gmail.com with ESMTPSA id g7-20020a17090a578700b001d25dfb9d39sm535687pji.14.2022.05.09.20.29.47 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 09 May 2022 20:29:47 -0700 (PDT) Message-ID: <5ac7c54a-e82b-9d28-9761-446a644e181c@kernel.dk> Date: Mon, 9 May 2022 21:29:46 -0600 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Subject: Re: [RFC PATCH] ubd: add io_uring based userspace block driver Content-Language: en-US To: Ming Lei Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, io-uring@vger.kernel.org, Gabriel Krisman Bertazi , ZiyangZhang , Xiaoguang Wang References: <20220509092312.254354-1-ming.lei@redhat.com> <9c833e12-fd09-fe7d-d4f2-e916c6ce4524@kernel.dk> From: Jens Axboe In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-3.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/9/22 8:58 PM, Ming Lei wrote: > On Mon, May 09, 2022 at 10:09:10AM -0600, Jens Axboe wrote: >> On 5/9/22 3:23 AM, Ming Lei wrote: >>> This is the driver part of userspace block driver(ubd driver), the other >>> part is userspace daemon part(ubdsrv)[1]. >>> >>> The two parts communicate by io_uring's IORING_OP_URING_CMD with one >>> shared cmd buffer for storing io command, and the buffer is read only for >>> ubdsrv, each io command is indexed by io request tag directly, and >>> is written by ubd driver. >>> >>> For example, when one READ io request is submitted to ubd block driver, ubd >>> driver stores the io command into cmd buffer first, then completes one >>> IORING_OP_URING_CMD for notifying ubdsrv, and the URING_CMD is issued to >>> ubd driver beforehand by ubdsrv for getting notification of any new io request, >>> and each URING_CMD is associated with one io request by tag. >>> >>> After ubdsrv gets the io command, it translates and handles the ubd io >>> request, such as, for the ubd-loop target, ubdsrv translates the request >>> into same request on another file or disk, like the kernel loop block >>> driver. In ubdsrv's implementation, the io is still handled by io_uring, >>> and share same ring with IORING_OP_URING_CMD command. When the target io >>> request is done, the same IORING_OP_URING_CMD is issued to ubd driver for >>> both committing io request result and getting future notification of new >>> io request. >>> >>> Another thing done by ubd driver is to copy data between kernel io >>> request and ubdsrv's io buffer: >>> >>> 1) before ubsrv handles WRITE request, copy the request's data into >>> ubdsrv's userspace io buffer, so that ubdsrv can handle the write >>> request >>> >>> 2) after ubsrv handles READ request, copy ubdsrv's userspace io buffer >>> into this READ request, then ubd driver can complete the READ request >>> >>> Zero copy may be switched if mm is ready to support it. >>> >>> ubd driver doesn't handle any logic of the specific user space driver, >>> so it should be small/simple enough. >> >> This is pretty interesting! Just one small thing I noticed, since you >> want to make sure batching is Good Enough: >> >>> +static blk_status_t ubd_queue_rq(struct blk_mq_hw_ctx *hctx, >>> + const struct blk_mq_queue_data *bd) >>> +{ >>> + struct ubd_queue *ubq = hctx->driver_data; >>> + struct request *rq = bd->rq; >>> + struct ubd_io *io = &ubq->ios[rq->tag]; >>> + struct ubd_rq_data *data = blk_mq_rq_to_pdu(rq); >>> + blk_status_t res; >>> + >>> + if (ubq->aborted) >>> + return BLK_STS_IOERR; >>> + >>> + /* this io cmd slot isn't active, so have to fail this io */ >>> + if (WARN_ON_ONCE(!(io->flags & UBD_IO_FLAG_ACTIVE))) >>> + return BLK_STS_IOERR; >>> + >>> + /* fill iod to slot in io cmd buffer */ >>> + res = ubd_setup_iod(ubq, rq); >>> + if (res != BLK_STS_OK) >>> + return BLK_STS_IOERR; >>> + >>> + blk_mq_start_request(bd->rq); >>> + >>> + /* mark this cmd owned by ubdsrv */ >>> + io->flags |= UBD_IO_FLAG_OWNED_BY_SRV; >>> + >>> + /* >>> + * clear ACTIVE since we are done with this sqe/cmd slot >>> + * >>> + * We can only accept io cmd in case of being not active. >>> + */ >>> + io->flags &= ~UBD_IO_FLAG_ACTIVE; >>> + >>> + /* >>> + * run data copy in task work context for WRITE, and complete io_uring >>> + * cmd there too. >>> + * >>> + * This way should improve batching, meantime pinning pages in current >>> + * context is pretty fast. >>> + */ >>> + task_work_add(ubq->ubq_daemon, &data->work, TWA_SIGNAL); >>> + >>> + return BLK_STS_OK; >>> +} >> >> It'd be better to use bd->last to indicate what kind of signaling you >> need here. TWA_SIGNAL will force an immediate transition if the app is >> running in userspace, which may not be what you want. Also see: >> >> https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.19/io_uring&id=e788be95a57a9bebe446878ce9bf2750f6fe4974 >> >> But regardless of signaling needed, you don't need it except if bd->last >> is true. Would need a commit_rqs() as well, but that's trivial. > > Good point, I think we may add non-last request via task_work_add(TWA_NONE), > and only notify via TWA_SIGNAL_NO_IPI for bd->last. Yep, I think that'd be the way to go. >> More importantly, what prevents ubq->ubq_daemon from going away after >> it's been assigned? I didn't look at the details, but is this relying on >> io_uring being closed to cancel pending requests? That should work, but > > I think no way can prevent ubq->ubq_daemon from being killed by 'kill -9', > even though ubdsrv has handled SIGTERM. That is why I suggest to add > one service for removing all ubd devices before shutdown: > > https://github.com/ming1/ubdsrv/blob/devel/README Right, you can't prevent a task from getting killed. But what you do know is that file descriptors get closed when the task goes away, and if you're integrated into io_uring in terms of how request are handled, then the closing of the io_ring ring descriptor should wait-for/cancel pending requests. If done right, that could perhaps exclude the issue of having the stored task become invalid. I haven't looked too closely at it all yet, so the above may not be a viable approach. Or maybe it will... It's how io_uring itself does it. > All the commands of UBD_IO_FETCH_REQ or UBD_IO_COMMIT_AND_FETCH_REQ have > been submitted to driver, I understand io_uring can't cancel them, > please correct me if it is wrong. Right, any storage IO can't get canceled if it's already hit the block layer or further down. So you end up waiting for them, which is fine too. > One solution I thought of is to use one watchdog to check if ubq->ubq_daemon > is dead, then abort whole device if yes. Or any suggestion? You'd still need to ensure that it remains valid. >> we need some way to ensure that ->ubq_daemon is always valid here. > > Good catch. > > get_task_struct() should be used for assigning ubq->ubq_daemon. Yep, that could work too as long as it doesn't introduce a weird loopy dependency. Since it's just the task_struct itself, I think it'd be fine and the simplest solution. This is a setup thing, and not per-io? -- Jens Axboe