Received: by 2002:ad5:4acb:0:0:0:0:0 with SMTP id n11csp27419imw; Thu, 14 Jul 2022 19:39:46 -0700 (PDT) X-Google-Smtp-Source: AGRyM1sfdImQ4/Xlz9EfjZytvkzck1JNYMHPRsudUM+PyV+CQa0v73lzCHDp3lsB44UL2vZEwPVo X-Received: by 2002:a05:6a00:a29:b0:52a:c0c3:4379 with SMTP id p41-20020a056a000a2900b0052ac0c34379mr11681950pfh.15.1657852785968; Thu, 14 Jul 2022 19:39:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1657852785; cv=none; d=google.com; s=arc-20160816; b=HHnfzcRh2krwH2+z5I5keRTbV0LMVSvciRExKFK8S/pYekRomgYMGLA2gOz/cHMTpF z3sD/3xKdbjog7XyqjsWy+NQXNsyd/so41DR3G1tn5+Nkc/wubiljgBlQOCMj0042VhR ycWhuABMbsinEOFTyT0fe3qn61R7Lq8AL5BHtdW/SjYHBaVQF49BVDWBMBHFbCYevLj+ sayTesX9M52ke87nhm5C2mbVzrMq7nzn1gSkUQoYm6QHGQJEDIpzLCoFtLE9chli0yYQ PQJiO3yfH7Tv7KdVz5Nr/21UIOF4FgVO1tIso76ev8LyJYoWUe3OB5Ia15/oeS6OyRqc uWzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=pOzb+QFUMv1rTNZUUAxj6XPi3NhuXh/LflPvp/g5ba4=; b=Mkgj0s/mnrnUU80zlKZbel4AhNwFCxH0WnyHRCop2ZXU9kN5ckZzRTr6yx+wivYV2I uJ4QVW6vsKYU2f+btgp8+uukyGdpk0VtmyqjIWXY35NOaSAH6lqxgaFvZO5BbBbtzHxH /0QQ8B2vbebM0tEy+4kb8LXRfPN4zhqEvHOWjSKC5/wZoNEAZK/BdHZ92wSxYM5WMyJU qN519H+YUlPEkamiOLdzduGxrjM46tI2nQvglqYoVe1V6cHEKRzLTi+rVzPv1H9a9enP xP5SjSux7KcGLK9L0nmi/ppWMw2RWVPHmLxJZhXuSpPJpmuKD26h6lbzIIJbGBPX7YNG vdug== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=P+cJAduk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id mu18-20020a17090b389200b001f06f142d0fsi9221548pjb.101.2022.07.14.19.39.29; Thu, 14 Jul 2022 19:39:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=P+cJAduk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241235AbiGOCFN (ORCPT + 99 others); Thu, 14 Jul 2022 22:05:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55444 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241214AbiGOCFL (ORCPT ); Thu, 14 Jul 2022 22:05:11 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 7A89474DD7 for ; Thu, 14 Jul 2022 19:05:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1657850709; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=pOzb+QFUMv1rTNZUUAxj6XPi3NhuXh/LflPvp/g5ba4=; b=P+cJAdukXzs6Mx6I3zm9XYYlRd0O43yBCeUsDXcY2+L3TDewhal5lu3tWEBGiTedUBKiNi Eet24TA9QG/H7UoeESPdW6Blz+1P1inV9jXUO++elbA//RgiKcAVXg4Eibvf2irSBEKZqo PRaJwgS2yr1u0vZKS20xsq0ZtGYJXYQ= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-435-VgLy9gSUOfSquIZdtbjX9g-1; Thu, 14 Jul 2022 22:05:08 -0400 X-MC-Unique: VgLy9gSUOfSquIZdtbjX9g-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 831923804072; Fri, 15 Jul 2022 02:05:07 +0000 (UTC) Received: from T590 (ovpn-8-19.pek2.redhat.com [10.72.8.19]) by smtp.corp.redhat.com (Postfix) with ESMTPS id C9C39492C3B; Fri, 15 Jul 2022 02:05:02 +0000 (UTC) Date: Fri, 15 Jul 2022 10:04:57 +0800 From: Ming Lei To: Ziyang Zhang Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, io-uring@vger.kernel.org, Jens Axboe , Gabriel Krisman Bertazi , Xiaoguang Wang Subject: Re: [PATCH V5 1/2] ublk_drv: add io_uring based userspace block driver Message-ID: References: <20220713140711.97356-1-ming.lei@redhat.com> <20220713140711.97356-2-ming.lei@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 2.85 on 10.11.54.9 X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 14, 2022 at 09:23:40PM +0800, Ziyang Zhang wrote: > On 2022/7/14 18:48, Ming Lei wrote: > > On Thu, Jul 14, 2022 at 06:20:38PM +0800, Ziyang Zhang wrote: > >> On 2022/7/13 22:07, Ming Lei wrote: > >>> This is the driver part of userspace block driver(ublk driver), the other > >>> part is userspace daemon part(ublksrv)[1]. > >>> > >>> The two parts communicate by io_uring's IORING_OP_URING_CMD with one > >>> shared cmd buffer for storing io command, and the buffer is read only for > >>> ublksrv, each io command is indexed by io request tag directly, and > >>> is written by ublk driver. > >>> > >>> For example, when one READ io request is submitted to ublk block driver, ublk > >>> driver stores the io command into cmd buffer first, then completes one > >>> IORING_OP_URING_CMD for notifying ublksrv, and the URING_CMD is issued to > >>> ublk driver beforehand by ublksrv for getting notification of any new io request, > >>> and each URING_CMD is associated with one io request by tag. > >>> > >>> After ublksrv gets the io command, it translates and handles the ublk io > >>> request, such as, for the ublk-loop target, ublksrv translates the request > >>> into same request on another file or disk, like the kernel loop block > >>> driver. In ublksrv's implementation, the io is still handled by io_uring, > >>> and share same ring with IORING_OP_URING_CMD command. When the target io > >>> request is done, the same IORING_OP_URING_CMD is issued to ublk driver for > >>> both committing io request result and getting future notification of new > >>> io request. > >>> > >>> Another thing done by ublk driver is to copy data between kernel io > >>> request and ublksrv's io buffer: > >>> > >>> 1) before ubsrv handles WRITE request, copy the request's data into > >>> ublksrv's userspace io buffer, so that ublksrv can handle the write > >>> request > >>> > >>> 2) after ubsrv handles READ request, copy ublksrv's userspace io buffer > >>> into this READ request, then ublk driver can complete the READ request > >>> > >>> Zero copy may be switched if mm is ready to support it. > >>> > >>> ublk driver doesn't handle any logic of the specific user space driver, > >>> so it is small/simple enough. > >>> > >>> [1] ublksrv > >>> > >>> https://github.com/ming1/ubdsrv > >>> > >>> Signed-off-by: Ming Lei > >>> --- > >> > >> > >> Hi, Ming > >> > >> I find that a big change from v4 to v5 is the simplification of locks. > >> > >> In v5 you remove ubq->abort_lock, and I want to ask why it is OK to remove it? > > > > Actually V4 and previous version dealt with the issue too complicated. > > > >> > >> If you have time, could you explain how ublk deals with potential race on: > >> 1)queue_rq 2)ublk_abort_queue 3) ublk_ctrl_stop_dev 4) ublk_rq_task_work. > >> (Lock in ublk really confuses me...) > > > > One big change is the following code: > > > > __ublk_rq_task_work(): > > bool task_exiting = current != ubq->ubq_daemon || > > (current->flags & PF_EXITING); > > ... > > if (unlikely(task_exiting)) { > > blk_mq_end_request(req, BLK_STS_IOERR); > > mod_delayed_work(system_wq, &ub->monitor_work, 0); > > return; > > } > > > > Abort is always started after PF_EXITING is set, but if PF_EXITING is > > set, __ublk_rq_task_work fails the request immediately, then io->flags > > won't be touched, then no race with abort. Also PF_EXITING is > > per-task flag, can only be set before calling __ublk_rq_task_work(), > > and setting it actually serialized with calling task work func. > > > > In ublk_queue_rq(), we don't touch io->flags, so there isn't race > > with abort. > > > > Wrt. ublk_ctrl_stop_dev(), it isn't related with abort directly, and > > if del_gendisk() waits for inflight IO, abort work will be started > > for making forward progress. After del_gendisk() returns, there can't > > be any inflight io, so it is safe to cancel other pending io command. > > > > Thanks, Ming. I understand the aborting code now. And it looks good to me. > > Previously I think maybe monitor_work and task_work > may be scheduled at the same time while task is exiting > and blk_mq_end_request() on the same tag could be called twice. > > But I find there is a check on ublk_io's flag(UBLK_IO_FLAG_ACTIVE) > in ublk_daemon_monitor_work() and ublk_io is aborted in task_work > immediately(with UBLK_IO_FLAG_ACTIVE set, not cleared yet) > > So there is no chance to call a send blk_mq_end_request() on the same tag. > > Besides, for ublk_ios with UBLK_IO_FLAG_ACTIVE unset, > stop_work scheduled in monitor work will call ublk_cancel_queue() by sending > cqes with UBLK_IO_RES_ABORT. > > Put it together: > > When daemon is PF_EXITING: > > 1) current ublk_io: aborted immediately in task_work Precisely it is just that ublk io request is ended immediately, so io->flags won't be touched. > > 2) UBLK_IO_FLAG_ACTIVE set: aborted in ublk_daemon_monitor_work This part is important for making forward progress, that is why it has to be done in a wq context. > > 3) UBLK_IO_FLAG_ACTIVE unset: send cqe with UBLK_IO_RES_ABORT This is the 2nd stage of aborting after disk is deleted. In short, it is one two-stage aborting, and the idea is actually straightforward. > > > Hope I'm correct this time. :) Absolutely. > > >> > >> > >> [...] > >> > >>> + > >>> +/* > >>> + * __ublk_fail_req() may be called from abort context or ->ubq_daemon > >>> + * context during exiting, so lock is required. > >>> + * > >>> + * Also aborting may not be started yet, keep in mind that one failed > >>> + * request may be issued by block layer again. > >>> + */ > >>> +static void __ublk_fail_req(struct ublk_io *io, struct request *req) > >>> +{ > >>> + WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE); > >>> + > >>> + if (!(io->flags & UBLK_IO_FLAG_ABORTED)) { > >>> + io->flags |= UBLK_IO_FLAG_ABORTED; > >>> + blk_mq_end_request(req, BLK_STS_IOERR); > >>> + } > >>> +} > >>> + > >> > >> [...] > >> > >>> + > >>> +/* > >>> + * When ->ubq_daemon is exiting, either new request is ended immediately, > >>> + * or any queued io command is drained, so it is safe to abort queue > >>> + * lockless > >>> + */ > >>> +static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq) > >>> +{ > >>> + int i; > >>> + > >>> + if (!ublk_get_device(ub)) > >>> + return; > >>> + > >>> + for (i = 0; i < ubq->q_depth; i++) { > >>> + struct ublk_io *io = &ubq->ios[i]; > >>> + > >>> + if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) { > >>> + struct request *rq; > >>> + > >>> + /* > >>> + * Either we fail the request or ublk_rq_task_work_fn > >>> + * will do it > >>> + */ > >>> + rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i); > >>> + if (rq) > >>> + __ublk_fail_req(io, rq); > >>> + } > >>> + } > >>> + ublk_put_device(ub); > >>> +} > >>> + > >> > >> > >> Another problem: > >> > >> 1) comment of __ublk_fail_req(): "so lock is required" > > > > Yeah, now __ublk_fail_req is only called in abort context, and no race > > with task work any more, so lock isn't needed. > > Ok, I see. > > > > >> > >> 2) comment of ublk_abort_queue(): "so it is safe to abort queue lockless" > > > > This comment is updated in v5, and it is correct. > > > >> > >> 3) ublk_abort_queue() calls _ublk_fail_req() on all ubqs. > > > > No, ublk_abort_queue() only aborts the passed ubq, so if one ubq daemon > > is aborted, other ubqs can still handle IO during deleting disk. > > Ok, I see. > > I think if one ubq daemon is killed(and blk-mq requests related to it are aborted), > stop work should call del_gendisk() and other ubq daemons can still complete blk-mq requests > but no new blk-mq requests will be issued. > After that, these unkilled ubq daemons will get UBLK_IO_RES_ABORT cqes > and exit by themselves. Yeah, you are right. Thanks, Ming