Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp5912961iob; Tue, 10 May 2022 06:36:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyJlOs9y/eDMaVd+3TLF2hol+Ib4p8w1JFmPwNgeaLBl7bemsaMnNS1wQIalI1TT7bw/at9 X-Received: by 2002:a17:902:76cb:b0:15e:e178:e2eb with SMTP id j11-20020a17090276cb00b0015ee178e2ebmr21024902plt.0.1652189760688; Tue, 10 May 2022 06:36:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652189760; cv=none; d=google.com; s=arc-20160816; b=teZjh4yBueoyGgg3sZ4qYETuLkAXFczCGO3fjLBm4Ep6KPRq3jUa4PpFNfKi3HoxIV UaF6hINtRUCE0shc9v+iY6cggRHgc+xDz9HSu6Xj3SK8yKKAe3UGDkv/w7UHa/+K+qid R00v7jv1GoRVpbYXC1XgR1cdwLTYAsV4XvJGgpwTuGE4Sb95TkyqksZakjLYL6mIvVNc XeVQZ1ITAScmqcqT/+02H7/cWcmqJDiryyYPyvlosN/TBWMdJMEXCYdoiiQk8+zuGkox 2LONzhbVhyUcY0kO0z37fBv257nhGKqpeep1+b0IjgqlOU0/9HHuFW6Nb4Pp9Sz7/u1f s9JQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=5vuJdULStsU5CmQIi830Nqb7JpxBqPaC3DxwEZL2ZRQ=; b=I/+jmiUT2Af0OXCw8E2I3ouE2Fspnj7aA+sNnGF8Opq6LkV7WTgm+WbSaD5EHQmg+o UNpJWCJsuS/c6TTpwDx4tNm44j+jHrNLtsA/s67I+5T55hWKpS/dd7b7D9SEdTLWwMpg ULVVI+SF3eOvJiUo6+aT4cKI2lw6npsHWzIF4r62NzZU+WKew0iPeVE9Slfscm3Zv4nP 2/5fJ9kqHc2kjomk5uEUvrK0OvfMxsL/7ZwzAiPqSUkWvHrx2I5kJU+rKA7+UqeeBu6O gScz0t2zsPWW65v/in3ICn4nJVMQzDGJ0Za3QivB2si0fiOqrFfc2nPHmO2iNEiqvxZ9 feOg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i3-20020a17090332c300b00158cefcc031si3788547plr.216.2022.05.10.06.35.45; Tue, 10 May 2022 06:36:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235232AbiEJEcf (ORCPT + 99 others); Tue, 10 May 2022 00:32:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33852 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236586AbiEJEbU (ORCPT ); Tue, 10 May 2022 00:31:20 -0400 Received: from out30-44.freemail.mail.aliyun.com (out30-44.freemail.mail.aliyun.com [115.124.30.44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9DB7C4664E; Mon, 9 May 2022 21:22:51 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04400;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VCop7.W_1652156567; Received: from 30.30.119.234(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VCop7.W_1652156567) by smtp.aliyun-inc.com(127.0.0.1); Tue, 10 May 2022 12:22:49 +0800 Message-ID: Date: Tue, 10 May 2022 12:22:47 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Subject: Re: [RFC PATCH] ubd: add io_uring based userspace block driver Content-Language: en-US To: Gabriel Krisman Bertazi , Ming Lei Cc: Jens Axboe , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, io-uring@vger.kernel.org, Xiaoguang Wang References: <20220509092312.254354-1-ming.lei@redhat.com> <87h75ylp2s.fsf@collabora.com> From: Ziyang Zhang In-Reply-To: <87h75ylp2s.fsf@collabora.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2022/5/9 23:10, Gabriel Krisman Bertazi wrote: > Ming Lei writes: > >> This is the driver part of userspace block driver(ubd driver), the other >> part is userspace daemon part(ubdsrv)[1]. >> >> The two parts communicate by io_uring's IORING_OP_URING_CMD with one >> shared cmd buffer for storing io command, and the buffer is read only for >> ubdsrv, each io command is indexed by io request tag directly, and >> is written by ubd driver. >> >> For example, when one READ io request is submitted to ubd block driver, ubd >> driver stores the io command into cmd buffer first, then completes one >> IORING_OP_URING_CMD for notifying ubdsrv, and the URING_CMD is issued to >> ubd driver beforehand by ubdsrv for getting notification of any new io request, >> and each URING_CMD is associated with one io request by tag. >> >> After ubdsrv gets the io command, it translates and handles the ubd io >> request, such as, for the ubd-loop target, ubdsrv translates the request >> into same request on another file or disk, like the kernel loop block >> driver. In ubdsrv's implementation, the io is still handled by io_uring, >> and share same ring with IORING_OP_URING_CMD command. When the target io >> request is done, the same IORING_OP_URING_CMD is issued to ubd driver for >> both committing io request result and getting future notification of new >> io request. >> >> Another thing done by ubd driver is to copy data between kernel io >> request and ubdsrv's io buffer: >> >> 1) before ubsrv handles WRITE request, copy the request's data into >> ubdsrv's userspace io buffer, so that ubdsrv can handle the write >> request >> >> 2) after ubsrv handles READ request, copy ubdsrv's userspace io buffer >> into this READ request, then ubd driver can complete the READ request >> >> Zero copy may be switched if mm is ready to support it. >> >> ubd driver doesn't handle any logic of the specific user space driver, >> so it should be small/simple enough. >> >> [1] ubdsrv >> https://github.com/ming1/ubdsrv/commits/devel >> >> Signed-off-by: Ming Lei >> --- >> drivers/block/Kconfig | 7 + >> drivers/block/Makefile | 2 + >> drivers/block/ubd_drv.c | 1193 ++++++++++++++++++++++++++++++++++ >> include/uapi/linux/ubd_cmd.h | 167 +++++ >> 4 files changed, 1369 insertions(+) >> create mode 100644 drivers/block/ubd_drv.c >> create mode 100644 include/uapi/linux/ubd_cmd.h >> >> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig >> index fdb81f2794cd..3893ccd82e8a 100644 >> --- a/drivers/block/Kconfig >> +++ b/drivers/block/Kconfig >> @@ -408,6 +408,13 @@ config BLK_DEV_RBD >> >> If unsure, say N. >> >> +config BLK_DEV_USER_BLK_DRV >> + bool "Userspace block driver" >> + select IO_URING >> + default y >> + help >> + io uring based userspace block driver. >> + >> source "drivers/block/rnbd/Kconfig" >> >> endif # BLK_DEV >> diff --git a/drivers/block/Makefile b/drivers/block/Makefile >> index 934a9c7c3a7c..effff34babd9 100644 >> --- a/drivers/block/Makefile >> +++ b/drivers/block/Makefile >> @@ -39,4 +39,6 @@ obj-$(CONFIG_BLK_DEV_RNBD) += rnbd/ >> >> obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk/ >> >> +obj-$(CONFIG_BLK_DEV_USER_BLK_DRV) += ubd_drv.o >> + >> swim_mod-y := swim.o swim_asm.o >> diff --git a/drivers/block/ubd_drv.c b/drivers/block/ubd_drv.c >> new file mode 100644 >> index 000000000000..9eba7ee17aff >> --- /dev/null >> +++ b/drivers/block/ubd_drv.c > > Hi Ming, > > Thank you very much for sending it. > > Given the interest in working on it from a bunch of people, I'd > appreciate if we consider putting the code into drivers/staging as soon > as possible, and let us work on the same code base to consolidate the > protocol there, instead of taking too long to push to drivers/block. Approveļ¼ > > One initial concern is that UBD is already the name for storage device > exposed by User Mode Linux. A rename would avoid future confusion. > >> @@ -0,0 +1,1193 @@ >> +// SPDX-License-Identifier: GPL-2.0-or-later >> +/* >> + * Userspace block device - block device which IO is handled from userspace >> + * >> + * Take full use of io_uring passthrough command for communicating with >> + * ubd userspace daemon(ubdsrvd) for handling basic IO request. >> + * >> + * Copyright 2022 Ming Lei >> + * >> + * (part of code stolen from loop.c) >> + */ >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> + >> +#define UBD_MINORS (1U << MINORBITS) >> + >> +struct ubd_abort_work { >> + struct ubd_device *ub; >> + unsigned int q_id; >> + struct work_struct work; >> +}; >> + >> +struct ubd_rq_data { >> + struct callback_head work; >> +}; >> + >> +/* io cmd is active: sqe cmd is received, and its cqe isn't done */ >> +#define UBD_IO_FLAG_ACTIVE 0x01 >> + >> +/* >> + * FETCH io cmd is completed via cqe, and the io cmd is being handled by >> + * ubdsrv, and not committed yet >> + */ >> +#define UBD_IO_FLAG_OWNED_BY_SRV 0x02 >> + >> +struct ubd_io { >> + /* userspace buffer address from io cmd */ >> + __u64 addr; >> + unsigned int flags; >> + unsigned int res; >> + >> + struct io_uring_cmd *cmd; >> +}; >> + >> +struct ubd_queue { >> + int q_id; >> + int q_depth; >> + >> + struct task_struct *ubq_daemon; >> + char *io_cmd_buf; >> + >> + unsigned long io_addr; /* mapped vm address */ >> + unsigned int max_io_sz; >> + bool aborted; >> + struct ubd_io ios[0]; >> +}; >> + >> +struct ubd_device { >> + struct gendisk *ub_disk; >> + struct request_queue *ub_queue; >> + >> + char *__queues; >> + >> + unsigned short queue_size; >> + unsigned short bs_shift; >> + unsigned int nr_aborted_queues; >> + struct ubdsrv_ctrl_dev_info dev_info; >> + >> + struct blk_mq_tag_set tag_set; >> + >> + struct cdev cdev; >> + struct device cdev_dev; >> + >> + atomic_t ch_open_cnt; >> + int ub_number; >> + >> + struct mutex mutex; >> +}; >> + >> +static dev_t ubd_chr_devt; >> +static struct class *ubd_chr_class; >> + >> +static DEFINE_IDR(ubd_index_idr); >> +static DEFINE_MUTEX(ubd_ctl_mutex); >> + >> +static struct miscdevice ubd_misc; >> + >> +static inline struct ubd_queue *ubd_get_queue(struct ubd_device *dev, int qid) >> +{ >> + return (struct ubd_queue *)&(dev->__queues[qid * dev->queue_size]); >> +} >> + >> +static inline bool ubd_rq_need_copy(struct request *rq) >> +{ >> + return rq->bio && bio_has_data(rq->bio); >> +} >> + >> +static inline struct ubdsrv_io_desc *ubd_get_iod(struct ubd_queue *ubq, int tag) >> +{ >> + return (struct ubdsrv_io_desc *) >> + &(ubq->io_cmd_buf[tag * sizeof(struct ubdsrv_io_desc)]); >> +} >> + >> +static inline char *ubd_queue_cmd_buf(struct ubd_device *ub, int q_id) >> +{ >> + return ubd_get_queue(ub, q_id)->io_cmd_buf; >> +} >> + >> +static inline int ubd_queue_cmd_buf_size(struct ubd_device *ub, int q_id) >> +{ >> + struct ubd_queue *ubq = ubd_get_queue(ub, q_id); >> + >> + return round_up(ubq->q_depth * sizeof(struct ubdsrv_io_desc), PAGE_SIZE); >> +} >> + >> +static int ubd_open(struct block_device *bdev, fmode_t mode) >> +{ >> + return 0; >> +} >> + >> +static void ubd_release(struct gendisk *disk, fmode_t mode) >> +{ >> +} >> + >> +static const struct block_device_operations ub_fops = { >> + .owner = THIS_MODULE, >> + .open = ubd_open, >> + .release = ubd_release, >> +}; >> + >> +#define UBD_MAX_PIN_PAGES 32 >> + >> +static void ubd_release_pages(struct ubd_device *ub, struct page **pages, >> + int nr_pages) >> +{ >> + int i; >> + >> + for (i = 0; i < nr_pages; i++) >> + put_page(pages[i]); >> +} >> + >> +static int ubd_pin_user_pages(struct ubd_device *ub, u64 start_vm, >> + struct page **pages, unsigned int nr_pages, bool to_rq) >> +{ >> + unsigned int gup_flags = to_rq ? 0 : FOLL_WRITE; >> + >> + return get_user_pages_fast(start_vm, nr_pages, gup_flags, pages); >> +} >> + >> +static inline unsigned ubd_copy_bv(struct bio_vec *bv, void **bv_addr, >> + void *pg_addr, unsigned int *pg_off, >> + unsigned int *pg_len, bool to_bv) >> +{ >> + unsigned len = min_t(unsigned, bv->bv_len, *pg_len); >> + >> + if (*bv_addr == NULL) >> + *bv_addr = kmap_local_page(bv->bv_page); >> + >> + if (to_bv) >> + memcpy(*bv_addr + bv->bv_offset, pg_addr + *pg_off, len); >> + else >> + memcpy(pg_addr + *pg_off, *bv_addr + bv->bv_offset, len); >> + >> + bv->bv_offset += len; >> + bv->bv_len -= len; >> + *pg_off += len; >> + *pg_len -= len; >> + >> + if (!bv->bv_len) { >> + kunmap_local(*bv_addr); >> + *bv_addr = NULL; >> + } >> + >> + return len; >> +} >> + >> +/* copy rq pages to ubdsrv vm address pointed by io->addr */ >> +static int ubd_copy_pages(struct ubd_device *ub, struct request *rq) >> +{ >> + struct ubd_queue *ubq = rq->mq_hctx->driver_data; >> + struct ubd_io *io = &ubq->ios[rq->tag]; >> + struct page *pgs[UBD_MAX_PIN_PAGES]; >> + const bool to_rq = !op_is_write(rq->cmd_flags); >> + struct req_iterator req_iter; >> + struct bio_vec bv; >> + unsigned long start = io->addr, left = rq->__data_len; >> + unsigned int idx = 0, pg_len = 0, pg_off = 0; >> + int nr_pin = 0; >> + void *pg_addr = NULL; >> + struct page *curr = NULL; >> + >> + rq_for_each_segment(bv, rq, req_iter) { >> + unsigned len, bv_off = bv.bv_offset, bv_len = bv.bv_len; >> + void *bv_addr = NULL; >> + >> +refill: >> + if (pg_len == 0) { >> + unsigned int off = 0; >> + >> + if (pg_addr) { >> + kunmap_local(pg_addr); >> + if (!to_rq) >> + set_page_dirty_lock(curr); >> + pg_addr = NULL; >> + } >> + >> + /* refill pages */ >> + if (idx >= nr_pin) { >> + unsigned int max_pages; >> + >> + ubd_release_pages(ub, pgs, nr_pin); >> + >> + off = start & (PAGE_SIZE - 1); >> + max_pages = round_up(off + left, PAGE_SIZE); >> + nr_pin = min_t(unsigned, UBD_MAX_PIN_PAGES, max_pages); >> + nr_pin = ubd_pin_user_pages(ub, start, pgs, >> + nr_pin, to_rq); >> + if (nr_pin <= 0) >> + return -EINVAL; >> + idx = 0; >> + } >> + pg_off = off; >> + pg_len = min(PAGE_SIZE - off, left); >> + off = 0; >> + curr = pgs[idx++]; >> + pg_addr = kmap_local_page(curr); >> + } >> + >> + len = ubd_copy_bv(&bv, &bv_addr, pg_addr, &pg_off, &pg_len, >> + to_rq); >> + /* either one of the two has been consumed */ >> + WARN_ON_ONCE(bv.bv_len && pg_len); >> + start += len; >> + left -= len; >> + >> + /* overflow */ >> + WARN_ON_ONCE(left > rq->__data_len); >> + WARN_ON_ONCE(bv.bv_len > bv_len); >> + if (bv.bv_len) >> + goto refill; >> + >> + bv.bv_len = bv_len; >> + bv.bv_offset = bv_off; >> + } >> + if (pg_addr) { >> + kunmap_local(pg_addr); >> + if (!to_rq) >> + set_page_dirty_lock(curr); >> + } >> + ubd_release_pages(ub, pgs, nr_pin); >> + >> + WARN_ON_ONCE(left != 0); >> + >> + return 0; >> +} >> + >> +#define UBD_REMAP_BATCH 32 >> + >> +static int ubd_map_io(struct ubd_device *ub, struct request *req) >> +{ >> + /* >> + * no zero copy, we delay copy WRITE request data into ubdsrv >> + * context via task_work_add and the big benefit is that pinning >> + * pages in current context is pretty fast, see ubd_pin_user_pages >> + */ >> + if (!op_is_write(req->cmd_flags) || !ubd_rq_need_copy(req)) >> + return 0; >> + >> + /* convert to data copy in current context */ >> + return ubd_copy_pages(ub, req); >> +} >> + >> +static int ubd_unmap_io(struct request *req) >> +{ >> + struct ubd_device *ub = req->q->queuedata; >> + >> + if (!op_is_write(req->cmd_flags) && ubd_rq_need_copy(req)) >> + return ubd_copy_pages(ub, req); >> + return 0; >> +} >> + >> +static inline unsigned int ubd_req_build_flags(struct request *req) >> +{ >> + unsigned flags = 0; >> + >> + if (req->cmd_flags & REQ_FAILFAST_DEV) >> + flags |= UBD_IO_F_FAILFAST_DEV; >> + >> + if (req->cmd_flags & REQ_FAILFAST_TRANSPORT) >> + flags |= UBD_IO_F_FAILFAST_TRANSPORT; >> + >> + if (req->cmd_flags & REQ_FAILFAST_DRIVER) >> + flags |= UBD_IO_F_FAILFAST_DRIVER; >> + >> + if (req->cmd_flags & REQ_META) >> + flags |= UBD_IO_F_META; >> + >> + if (req->cmd_flags & REQ_INTEGRITY) >> + flags |= UBD_IO_F_INTEGRITY; >> + >> + if (req->cmd_flags & REQ_FUA) >> + flags |= UBD_IO_F_FUA; >> + >> + if (req->cmd_flags & REQ_PREFLUSH) >> + flags |= UBD_IO_F_PREFLUSH; >> + >> + if (req->cmd_flags & REQ_NOUNMAP) >> + flags |= UBD_IO_F_NOUNMAP; >> + >> + if (req->cmd_flags & REQ_SWAP) >> + flags |= UBD_IO_F_SWAP; >> + >> + return flags; >> +} >> + >> +static int ubd_setup_iod(struct ubd_queue *ubq, struct request *req) >> +{ >> + struct ubdsrv_io_desc *iod = ubd_get_iod(ubq, req->tag); >> + struct ubd_io *io = &ubq->ios[req->tag]; >> + u32 ubd_op; >> + >> + switch (req_op(req)) { >> + case REQ_OP_READ: >> + ubd_op = UBD_IO_OP_READ; >> + break; >> + case REQ_OP_WRITE: >> + ubd_op = UBD_IO_OP_WRITE; >> + break; >> + case REQ_OP_FLUSH: >> + ubd_op = UBD_IO_OP_FLUSH; >> + break; >> + case REQ_OP_DISCARD: >> + ubd_op = UBD_IO_OP_DISCARD; >> + break; >> + case REQ_OP_WRITE_ZEROES: >> + ubd_op = UBD_IO_OP_WRITE_ZEROES; >> + break; >> + default: >> + return BLK_STS_IOERR; >> + } >> + >> + /* need to translate since kernel may change */ >> + iod->op_flags = ubd_op | ubd_req_build_flags(req); >> + iod->tag_blocks = req->tag | (blk_rq_sectors(req) << 12); >> + iod->start_block = blk_rq_pos(req); >> + iod->addr = io->addr; >> + >> + return BLK_STS_OK; >> +} >> + >> +static blk_status_t ubd_queue_rq(struct blk_mq_hw_ctx *hctx, >> + const struct blk_mq_queue_data *bd) >> +{ >> + struct ubd_queue *ubq = hctx->driver_data; >> + struct request *rq = bd->rq; >> + struct ubd_io *io = &ubq->ios[rq->tag]; >> + struct ubd_rq_data *data = blk_mq_rq_to_pdu(rq); >> + blk_status_t res; >> + >> + if (ubq->aborted) >> + return BLK_STS_IOERR; >> + >> + /* this io cmd slot isn't active, so have to fail this io */ >> + if (WARN_ON_ONCE(!(io->flags & UBD_IO_FLAG_ACTIVE))) >> + return BLK_STS_IOERR; >> + >> + /* fill iod to slot in io cmd buffer */ >> + res = ubd_setup_iod(ubq, rq); >> + if (res != BLK_STS_OK) >> + return BLK_STS_IOERR; >> + >> + blk_mq_start_request(bd->rq); >> + >> + /* mark this cmd owned by ubdsrv */ >> + io->flags |= UBD_IO_FLAG_OWNED_BY_SRV; >> + >> + /* >> + * clear ACTIVE since we are done with this sqe/cmd slot >> + * >> + * We can only accept io cmd in case of being not active. >> + */ >> + io->flags &= ~UBD_IO_FLAG_ACTIVE; >> + >> + /* >> + * run data copy in task work context for WRITE, and complete io_uring >> + * cmd there too. >> + * >> + * This way should improve batching, meantime pinning pages in current >> + * context is pretty fast. >> + */ >> + task_work_add(ubq->ubq_daemon, &data->work, TWA_SIGNAL); >> + >> + return BLK_STS_OK; >> +} >> + >> +static void ubd_complete_rq(struct request *req) >> +{ >> + struct ubd_queue *ubq = req->mq_hctx->driver_data; >> + struct ubd_io *io = &ubq->ios[req->tag]; >> + >> + /* for READ request, writing data in iod->addr to rq buffers */ >> + ubd_unmap_io(req); >> + >> + blk_mq_end_request(req, io->res); >> +} >> + >> +static void ubd_rq_task_work_fn(struct callback_head *work) >> +{ >> + struct ubd_rq_data *data = container_of(work, >> + struct ubd_rq_data, work); >> + struct request *req = blk_mq_rq_from_pdu(data); >> + struct ubd_queue *ubq = req->mq_hctx->driver_data; >> + struct ubd_io *io = &ubq->ios[req->tag]; >> + struct ubd_device *ub = req->q->queuedata; >> + int ret = UBD_IO_RES_OK; >> + >> + if (ubd_map_io(ub, req)) >> + ret = UBD_IO_RES_DATA_BAD; >> + >> + pr_devel("%s: complete: cmd op %d, tag %d ret %x io_flags %x, addr %llx\n", >> + __func__, io->cmd->cmd_op, req->tag, ret, io->flags, >> + ubd_get_iod(ubq, req->tag)->addr); >> + /* tell ubdsrv one io request is coming */ >> + io_uring_cmd_done(io->cmd, ret, 0); >> +} >> + >> +static int ubd_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data, >> + unsigned int hctx_idx) >> +{ >> + struct ubd_device *ub = hctx->queue->queuedata; >> + struct ubd_queue *ubq = ubd_get_queue(ub, hctx->queue_num); >> + >> + hctx->driver_data = ubq; >> + return 0; >> +} >> + >> +static int ubd_init_rq(struct blk_mq_tag_set *set, struct request *req, >> + unsigned int hctx_idx, unsigned int numa_node) >> +{ >> + struct ubd_rq_data *data = blk_mq_rq_to_pdu(req); >> + >> + init_task_work(&data->work, ubd_rq_task_work_fn); >> + >> + return 0; >> +} >> + >> +static const struct blk_mq_ops ubd_mq_ops = { >> + .queue_rq = ubd_queue_rq, >> + .init_hctx = ubd_init_hctx, >> + .init_request = ubd_init_rq, >> +}; >> + >> +static int ubd_ch_open(struct inode *inode, struct file *filp) >> +{ >> + struct ubd_device *ub = container_of(inode->i_cdev, >> + struct ubd_device, cdev); >> + >> + if (atomic_cmpxchg(&ub->ch_open_cnt, 0, 1) == 0) { >> + filp->private_data = ub; >> + return 0; >> + } >> + return -EBUSY; >> +} >> + >> +static int ubd_ch_release(struct inode *inode, struct file *filp) >> +{ >> + struct ubd_device *ub = filp->private_data; >> + >> + while (atomic_cmpxchg(&ub->ch_open_cnt, 1, 0) != 1) >> + cpu_relax(); >> + >> + filp->private_data = NULL; >> + return 0; >> +} >> + >> +/* map pre-allocated per-queue cmd buffer to ubdsrv daemon */ >> +static int ubd_ch_mmap(struct file *filp, struct vm_area_struct *vma) >> +{ >> + struct ubd_device *ub = filp->private_data; >> + size_t sz = vma->vm_end - vma->vm_start; >> + unsigned max_sz = UBD_MAX_QUEUE_DEPTH * sizeof(struct ubdsrv_io_desc); >> + unsigned long pfn, end; >> + int q_id; >> + >> + end = UBDSRV_CMD_BUF_OFFSET + ub->dev_info.nr_hw_queues * max_sz; >> + if (vma->vm_pgoff < UBDSRV_CMD_BUF_OFFSET || vma->vm_pgoff >= end) >> + return -EINVAL; >> + >> + q_id = (vma->vm_pgoff - UBDSRV_CMD_BUF_OFFSET) / max_sz; >> + >> + if (sz != ubd_queue_cmd_buf_size(ub, q_id)) >> + return -EINVAL; >> + >> + pfn = virt_to_phys(ubd_queue_cmd_buf(ub, q_id)) >> PAGE_SHIFT; >> + return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); >> +} >> + >> +static void ubd_commit_completion(struct ubd_device *ub, >> + struct ubdsrv_io_cmd *ub_cmd) >> +{ >> + u32 qid = ub_cmd->q_id, tag = ub_cmd->tag; >> + struct ubd_queue *ubq = ubd_get_queue(ub, qid); >> + struct ubd_io *io = &ubq->ios[tag]; >> + struct request *req; >> + >> + /* now this cmd slot is owned by nbd driver */ >> + io->flags &= ~UBD_IO_FLAG_OWNED_BY_SRV; >> + io->res = ub_cmd->result; >> + >> + /* find the io request and complete */ >> + req = blk_mq_tag_to_rq(ub->tag_set.tags[qid], ub_cmd->tag); >> + >> + if (req && likely(!blk_should_fake_timeout(req->q))) >> + ubd_complete_rq(req); >> +} >> + >> +/* has to be called disk is dead or frozen */ >> +static int ubd_abort_queue(struct ubd_device *ub, int qid) >> +{ >> + int ret = UBD_IO_RES_ABORT; >> + struct ubd_queue *q = ubd_get_queue(ub, qid); >> + int i; >> + >> + for (i = 0; i < q->q_depth; i++) { >> + struct ubd_io *io = &q->ios[i]; >> + >> + if (io->flags & UBD_IO_FLAG_ACTIVE) { >> + io->flags &= ~UBD_IO_FLAG_ACTIVE; >> + io_uring_cmd_done(io->cmd, ret, 0); >> + } >> + } >> + q->ubq_daemon = NULL; >> + return 0; >> +} >> + >> +static void ubd_abort_work_fn(struct work_struct *work) >> +{ >> + struct ubd_abort_work *abort_work = >> + container_of(work, struct ubd_abort_work, work); >> + struct ubd_device *ub = abort_work->ub; >> + struct ubd_queue *ubq = ubd_get_queue(ub, abort_work->q_id); >> + >> + blk_mq_freeze_queue(ub->ub_queue); >> + mutex_lock(&ub->mutex); >> + if (!ubq->aborted) { >> + ubq->aborted = true; >> + ubd_abort_queue(ub, abort_work->q_id); >> + ub->nr_aborted_queues++; >> + } >> + mutex_unlock(&ub->mutex); >> + blk_mq_unfreeze_queue(ub->ub_queue); >> + >> + mutex_lock(&ub->mutex); >> + if (ub->nr_aborted_queues == ub->dev_info.nr_hw_queues) { >> + if (disk_live(ub->ub_disk)) >> + del_gendisk(ub->ub_disk); >> + } >> + mutex_unlock(&ub->mutex); >> + >> + kfree(abort_work); >> +} >> + >> +static int ubd_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) >> +{ >> + struct ubdsrv_io_cmd *ub_cmd = (struct ubdsrv_io_cmd *)cmd->cmd; >> + struct ubd_device *ub = cmd->file->private_data; >> + struct ubd_queue *ubq; >> + struct ubd_io *io; >> + u32 cmd_op = cmd->cmd_op; >> + unsigned tag = ub_cmd->tag; >> + int ret = UBD_IO_RES_INVALID_SQE; >> + >> + pr_devel("%s: receieved: cmd op %d, tag %d, queue %d\n", >> + __func__, cmd->cmd_op, tag, ub_cmd->q_id); >> + >> + if (!(issue_flags & IO_URING_F_SQE128)) >> + goto out; >> + >> + if (cmd_op == UBD_IO_ABORT_QUEUE) { >> + struct ubd_abort_work *work = kzalloc(sizeof(*work), >> + GFP_KERNEL); >> + if (!work) >> + goto out; >> + >> + INIT_WORK(&work->work, ubd_abort_work_fn); >> + work->ub = ub; >> + work->q_id = ub_cmd->q_id; >> + >> + schedule_work(&work->work); >> + ret = UBD_IO_RES_OK; >> + goto out_done; >> + } >> + >> + ubq = ubd_get_queue(ub, ub_cmd->q_id); >> + if (WARN_ON_ONCE(tag >= ubq->q_depth)) >> + goto out; >> + >> + io = &ubq->ios[tag]; >> + >> + /* there is pending io cmd, something must be wrong */ >> + if (io->flags & UBD_IO_FLAG_ACTIVE) { >> + ret = UBD_IO_RES_BUSY; >> + goto out; >> + } >> + >> + switch (cmd_op) { >> + case UBD_IO_FETCH_REQ: >> + /* FETCH_REQ is only issued when starting device */ >> + mutex_lock(&ub->mutex); >> + if (!ubq->ubq_daemon) >> + ubq->ubq_daemon = current; > > I find it slightly weird to set ubq_daemon per-request. It would cause > weird corruptions if a buggy thread sends a FETCH_REQ to the wrong q_id > after another request in-flight has already set ubq_daemon. I was also > working on MQ and I was thinking of dropping the q_id field from the > io_uring command, implying it from the fd that was opened. Then, I let > each thread reopen the char device to configure a new queue: > > fd = open("/dev/ubdc0", O_RDWR); // Connects to a new queue > > fd is closed with fork/exec. > >> +static int ubd_ctrl_start_dev(struct ubd_device *ub, struct io_uring_cmd *cmd) >> +{ >> + struct ubdsrv_ctrl_dev_info *info = (struct ubdsrv_ctrl_dev_info *)cmd->cmd; >> + int ret = -EINVAL; >> + unsigned long end = jiffies + 3 * HZ; >> + >> + if (info->ubdsrv_pid <= 0) >> + return -1; >> + >> + mutex_lock(&ub->mutex); >> + >> + ub->dev_info.ubdsrv_pid = info->ubdsrv_pid; >> + if (disk_live(ub->ub_disk)) >> + goto unlock; >> + while (time_before(jiffies, end)) { >> + if (ubd_io_ready(ub)) { >> + ret = 0; >> + break; >> + } >> + msleep(100); >> + } > > This timer is quite weird, in my opinion. Shouldn't we fail start_dev > and let userspace retry the command at a later time? > >> + unlock: >> + mutex_unlock(&ub->mutex); >> + pr_devel("%s: device io ready %d\n", __func__, !ret); >> + >> + if (ret == 0) >> + ret = add_disk(ub->ub_disk); >> + >> + return ret; >> +} >> + >> +static inline void ubd_dump(struct io_uring_cmd *cmd) >> +{ >> +#ifdef DEBUG >> + struct ubdsrv_ctrl_dev_info *info = >> + (struct ubdsrv_ctrl_dev_info *)cmd->cmd; >> + >> + printk("%s: cmd_op %x, dev id %d flags %llx\n", __func__, >> + cmd->cmd_op, info->dev_id, info->flags); >> + >> + printk("\t nr_hw_queues %d queue_depth %d block size %d dev_capacity %lld\n", >> + info->nr_hw_queues, info->queue_depth, >> + info->block_size, info->dev_blocks); >> +#endif >> +} > > Maybe pr_debug or tracepoint? > >> +MODULE_AUTHOR("Ming Lei "); >> +MODULE_LICENSE("GPL"); >> diff --git a/include/uapi/linux/ubd_cmd.h b/include/uapi/linux/ubd_cmd.h >> new file mode 100644 >> index 000000000000..8ecea6aa9cfe >> --- /dev/null >> +++ b/include/uapi/linux/ubd_cmd.h >> @@ -0,0 +1,167 @@ >> +#ifndef USER_BLK_DRV_CMD_INC_H >> +#define USER_BLK_DRV_CMD_INC_H >> + >> +/* ubd server command definition */ >> + >> +/* CMD result code */ >> +#define UBD_CTRL_CMD_RES_OK 0 >> +#define UBD_CTRL_CMD_RES_FAILED -1 >> + >> +/* >> + * Admin commands, issued by ubd server, and handled by ubd driver. >> + */ >> +#define UBD_CMD_SET_DEV_INFO 0x01 >> +#define UBD_CMD_GET_DEV_INFO 0x02 >> +#define UBD_CMD_ADD_DEV 0x04 >> +#define UBD_CMD_DEL_DEV 0x05 >> +#define UBD_CMD_START_DEV 0x06 >> +#define UBD_CMD_STOP_DEV 0x07 >> + >> +/* >> + * IO commands, issued by ubd server, and handled by ubd driver. >> + * >> + * FETCH_REQ: issued via sqe(URING_CMD) beforehand for fetching IO request >> + * from ubd driver, should be issued only when starting device. After >> + * the associated cqe is returned, request's tag can be retrieved via >> + * cqe->userdata. >> + * >> + * COMMIT_AND_FETCH_REQ: issued via sqe(URING_CMD) after ubdserver handled >> + * this IO request, request's handling result is committed to ubd >> + * driver, meantime FETCH_REQ is piggyback, and FETCH_REQ has to be >> + * handled before completing io request. >> + * >> + * COMMIT_REQ: issued via sqe(URING_CMD) after ubdserver handled this IO >> + * request, request's handling result is committed to ubd driver. >> + * >> + * ABORT_QUEUE: issued via sqe(URING_CMD) and abort all active commands, >> + * meantime ubdserver can't issue any FETCH_REQ commands >> + */ >> +#define UBD_IO_FETCH_REQ 0x20 >> +#define UBD_IO_COMMIT_AND_FETCH_REQ 0x21 >> +#define UBD_IO_COMMIT_REQ 0x22 >> +#define UBD_IO_ABORT_QUEUE 0x23 >> + >> +#define UBD_IO_RES_OK 0x01 >> +#define UBD_IO_RES_INVALID_SQE 0x5f >> +#define UBD_IO_RES_INVALID_TAG 0x5e >> +#define UBD_IO_RES_INVALID_QUEUE 0x5d >> +#define UBD_IO_RES_BUSY 0x5c >> +#define UBD_IO_RES_DUP_FETCH 0x5b >> +#define UBD_IO_RES_UNEXPECTED_CMD 0x5a >> +#define UBD_IO_RES_DATA_BAD 0x59 >> + >> +/* only ABORT means that no re-fetch */ >> +#define UBD_IO_RES_ABORT 0x59 >> + >> +#define UBDSRV_CMD_BUF_OFFSET 0 >> +#define UBDSRV_IO_BUF_OFFSET 0x80000000 >> + >> +/* tag bit is 12bit, so at most 4096 IOs for each queue */ >> +#define UBD_MAX_QUEUE_DEPTH 4096 >> + >> +/* >> + * zero copy requires 4k block size, and can remap ubd driver's io >> + * request into ubdsrv's vm space >> + */ >> +#define UBD_F_SUPPORT_ZERO_COPY 0 >> + >> +struct ubdsrv_ctrl_dev_info { >> + __u16 nr_hw_queues; >> + __u16 queue_depth; >> + __u16 block_size; >> + __u16 state; >> + >> + __u32 rq_max_blocks; >> + __u32 dev_id; >> + >> + __u64 dev_blocks; >> + __u64 flags; >> + >> + /* >> + * Only valid for READ kind of ctrl command, and driver can >> + * get the userspace buffer address here, then write data >> + * into this buffer. >> + * >> + * And the buffer has to be inside one single page. >> + */ >> + __u64 addr; >> + __u32 len; >> + __s32 ubdsrv_pid; >> + __u64 reserved0[2]; >> +}; > > I was about to send you a patch to simplify the interface for > different commands. My plan was to avoid passing parts of the structure for > commands that don't use it, considering the limit of io_uring cmd length. > > struct ubdsrv_ctrl_dev_info { > __u32 dev_id; > union { > > char raw[28]; /* Maximum size for iouring_command */ > > struct { > __u32 rq_max_blocks; > __s32 ubdsrv_pid; > > __u16 nr_hw_queues; > __u16 queue_depth; > __u16 block_size; > __u16 state; > > __u64 dev_blocks; > __u64 flags; > }, /* UBD_CMD_ADD_DEV */ > > struct { > __u64 addr; > __u32 len; > } /* CMD_GET_DEV_INFO */ > } > }; > > What do you think? > > in addition, I think UBD_CMD_ADD_DEV should embed an extra protocol > version field, for future extension. >