Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp1047455pxa; Thu, 20 Aug 2020 00:39:22 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxHK0f79FPM85gH6dXC7GNKQIFoNhAtI7mB/j4Vc3wtqJl95ELXe+A0dYzLp+mUtjitIdLQ X-Received: by 2002:a17:906:d9db:: with SMTP id qk27mr163234ejb.230.1597909162591; Thu, 20 Aug 2020 00:39:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1597909162; cv=none; d=google.com; s=arc-20160816; b=TmxyLtjl1QMhLMUTIlDOTLhc0U9b55IrbbpDnHhrR9QwV78FJNngnTjXybDIDpB/hH MjKQCsbvtr2oo2OeAh7koqKJe69ZAp90tHDPszcvthab/E2ifGIj8FGD8atEubqP1AmD MeruXGy1R18bOaxZaVGFZPUV/oUwlfIiWudBReYnp+WYXzHVUbwG8tOhNifDpJ7e/xXc tzoWuY+60qJHFv4yESJKwsWreZYAYU0JAZVhy+uz99SkjzXnMd7lpkj6T1tpxZOtp3cN b/8cZV1SOUO7t/2pTaqLjpvznvr4QzrvJGyhIBp4cveNqDvr9Fakz/+nxYQ52RaooqIn cKeA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=tQmGB+boenn+8AWQYGoZFj5ruu/kBTP0OjM4DoRcBzM=; b=wWkWm3o/yLy+NwnkgCLpMQHIxfUsRypa15aFNbv1mQ55zFn3T3xuhVApQ68UWgJ6SC e7326XaiduzlERve5SqkXWzFh7wrSwBYarxdsywXYzAOGfjpD/nsb2tEWTj4Kk+chKmQ KPw+29z273s7jUPndUpOzJS4EpeNhVlaFMVPTj6ECKw5tTSOas4ymyZlbM96n3WQYvoo CRdc+Z4f2xPd5lowuqCpR2/bSub6Mcx+m2JPG6+uPV1ZM36E7KK/ifjgiOLEjRKdDpnG rKy0twAQgiNezxRWPU2MZlAZGhDudWkmQiGwrvEWNDAlX61Qlr8NaUk9c1mD4p3GRGFO XeRA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=AviZ1SR1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b20si925011edn.57.2020.08.20.00.38.58; Thu, 20 Aug 2020 00:39:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=AviZ1SR1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725910AbgHTHiT (ORCPT + 99 others); Thu, 20 Aug 2020 03:38:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39698 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725824AbgHTHiP (ORCPT ); Thu, 20 Aug 2020 03:38:15 -0400 Received: from mail-wm1-x342.google.com (mail-wm1-x342.google.com [IPv6:2a00:1450:4864:20::342]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9720DC061757 for ; Thu, 20 Aug 2020 00:38:14 -0700 (PDT) Received: by mail-wm1-x342.google.com with SMTP id g75so676641wme.4 for ; Thu, 20 Aug 2020 00:38:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=tQmGB+boenn+8AWQYGoZFj5ruu/kBTP0OjM4DoRcBzM=; b=AviZ1SR1z3KYs/8oZaCL1cwH2GoTYKUaG1Mf7v+i8czULqXqUw5TOw+ZLKDfIYI+/P 5pgLSO7AAtAcdIC6ilZqNd58/QJbswRzm2H8QTbkGMxazUqSDiN7Op6SUAPOGmfnrrc6 je7fyhZTZ/iSQoIrjYE2/KbBrcX3nAY8KH5a3lzZLNofdM8/VZyNOMfChro5Sk2SEYTP TriwhY59mSH9pR9JTMlh34jQKmHdZbasI5wR8+JP+UHImu0Tr5QPMfucQW75o6yAM9ey 7s8tKEwQS3s5MJrRHaDV5z5efziaDVP1okOS27kT0lXZItiM8iqWshBv5aXY3bPjXT1Z yGhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=tQmGB+boenn+8AWQYGoZFj5ruu/kBTP0OjM4DoRcBzM=; b=U6drgT2icHnx494P2BwONCE95NXYWtx8UYBSbHIkYB783rLZA6dF1RpF6+GyTeGaZd /RorUZEBnQbNzSiA2GXbm3fmq0uFodI0tUy30DnO7PJW7p9Ji/9qZVzMFYW0PJvf70/4 la4VxBfq/uHazw4w+FAYx92fyKuiGbSXtw58HwWm9AMj2G6/DR6MSD0UdB5+4NnT4SLD Pz8pkwa0nRHzhjI3biEGTqgmIoXoc7MgOAvgPKZsB1Fdyu+ImDofYjn98+tSE3hYIv4a dx7OcfaQHvIXKY+/4yIlzgopygEeX5+zEx9rLph00XPBub99k3V4aqfPnzL56n34W09h 3FxA== X-Gm-Message-State: AOAM531W97annbAJf+J5woNcflfZI3JEsUWdSaOWav3bLBGtMi4bbyvX /LY7xTTl9h4j39MnZU8BDaNUyffFT/xIy/qgF6E= X-Received: by 2002:a7b:c384:: with SMTP id s4mr2082840wmj.138.1597909093072; Thu, 20 Aug 2020 00:38:13 -0700 (PDT) MIME-Version: 1.0 References: <20200818052936.10995-1-joshi.k@samsung.com> <20200818052936.10995-3-joshi.k@samsung.com> <20200818071249.GB2544@lst.de> <20200819214248.GA26769@dhcp-10-100-145-180.wdl.wdc.com> In-Reply-To: <20200819214248.GA26769@dhcp-10-100-145-180.wdl.wdc.com> From: Kanchan Joshi Date: Thu, 20 Aug 2020 13:07:46 +0530 Message-ID: Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append To: Keith Busch Cc: David Fugate , "axboe@kernel.dk" , "Damien.LeMoal@wdc.com" , SelvaKumar S , "sagi@grimberg.me" , Kanchan Joshi , "johannes.thumshirn@wdc.com" , "linux-kernel@vger.kernel.org" , "linux-nvme@lists.infradead.org" , Nitesh Shetty , Javier Gonzalez , Christoph Hellwig Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Aug 20, 2020 at 3:22 AM Keith Busch wrote: > > On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote: > > Intel does not support making *optional* NVMe spec features *required* > > by the NVMe driver. > > This is inaccurate. As another example, the spec optionally allows a > zone size to be a power of 2, but linux requires a power of 2 if you > want to use it with this driver. > > > Provided there's no glaring technical issues > > There are many. Some may be improved through a serious review process, > but the mess it makes out of the fast path is measurably harmful to > devices that don't subscribe to this. That issue is not so easily > remedied. > > Since this patch is a copy of the scsi implementation, the reasonable > thing is take this fight to the generic block layer for a common > solution. We're not taking this in the nvme driver. I sincerely want to minimize any adverse impact to the fast-path of non-zoned devices. My understanding of that aspect is (I feel it is good to confirm, irrespective of the future of this patch): 1. Submission path: There is no extra code for non-zoned device IO. For append, there is this "ns->append_emulate = 1" check. Snippet - case REQ_OP_ZONE_APPEND: - ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append); + if (!nvme_is_append_emulated(ns)) + ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append); + else { + /* prepare append like write, and adjust lba afterwards */ 2. Completion: Not as clean as submission for sure. The extra check is "ns && ns->append_emulate == 1" for completions if CONFIG_ZONED is enabled. A slightly better completion-handling version (than the submitted patch) is - - } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && - req_op(req) == REQ_OP_ZONE_APPEND) { - req->__sector = nvme_lba_to_sect(req->q->queuedata, - le64_to_cpu(nvme_req(req)->result.u64)); + } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) { + struct nvme_ns *ns = req->q->queuedata; + /* append-emulation requires wp update for some cmds*/ + if (ns && nvme_is_append_emulated(ns)) { + if (nvme_need_zone_wp_update(req)) + nvme_zone_wp_update(ns, req, status); + } + else if (req_op(req) == REQ_OP_ZONE_APPEND) + req->__sector = nvme_lba_to_sect(ns, + le64_to_cpu(nvme_req(req)->result.u64)); Am I missing any other fast-path pain-points? A quick 1 minute 4K randwrite run (QD 64, 4 jobs,libaio) shows : before: IOPS=270k, BW=1056MiB/s (1107MB/s)(61.9GiB/60002msec) after: IOPS=270k, BW=1055MiB/s (1106MB/s)(61.8GiB/60005msec) This maynot be the best test to see the cost, and I am happy to conduct more and improvise. As for the volume of the code - it is same as SCSI emulation. And I can make efforts to reduce that by moving common code to blk-zone, and reuse in SCSI/NVMe emulation. In the patch I tried to isolate append-emulation by keeping everything into "nvme_za_emul". It contains nothing nvme'ish except nvme_ns, which can be turned into "driver_data". +#ifdef CONFIG_BLK_DEV_ZONED +struct nvme_za_emul { + unsigned int nr_zones; + spinlock_t zones_wp_offset_lock; + u32 *zones_wp_offset; + u32 *rev_wp_offset; + struct work_struct zone_wp_offset_work; + char *zone_wp_update_buf; + struct mutex rev_mutex; + struct nvme_ns *ns; +}; +#endif Will that be an acceptable line of further work/discussions? -- Kanchan