Received: by 2002:ab2:7407:0:b0:1f4:b336:87c4 with SMTP id e7csp128143lqn; Thu, 11 Apr 2024 16:46:44 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCWD9czNuvt8CeyTH8inKogXBvmL+G9N/YJev2g/21ecbt4MolnqjdzhQ8ScBzkffJ80MiIo0QAWzXbzvVKCSnR6vhhEsGIibvyEuQJUYA== X-Google-Smtp-Source: AGHT+IGiN/zOthPGSXdLFEs3hvfJOS+5mLqeWTBeM/jMaFjLu0zDp1raJ3dDjTDtkc9oQnnr0dGP X-Received: by 2002:a05:6102:470a:b0:47a:27ce:74c3 with SMTP id ei10-20020a056102470a00b0047a27ce74c3mr1308916vsb.28.1712879204653; Thu, 11 Apr 2024 16:46:44 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1712879204; cv=pass; d=google.com; s=arc-20160816; b=mRucZ1YJbj+fRQ7rEM4DTSWFRnWFnZT0B368tJj9bupI5eC7TDQl7zdux0Es9hi7xc diiAxD1fls+NqCtGbK2V9Hazyo/iJ0A81K0VpZbZ+PvZeyx3LUh2LX75YJC1DDAjZfvE RARw0aeAd4M9HPGGLEYhJRw/hqeVmtPu3ax3geRKF3raaB2QIN04mhkJQpcnrdG066W/ d6kYdT53BTm9y/aQbTLL9J5chxSbvLx7TwIeMGK3+NrcmQ7W9GWpedjwtgLU+BwGREwQ jRUrZlHinZsZ+6gUDf6uSO9jz5RXN+XB/nGoQd2Lfj06l8bzr7CddRQ1NTFSI7zhJPzE hwTg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:in-reply-to:content-transfer-encoding:content-disposition :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:message-id:subject:cc:to:from:date:dkim-signature; bh=w8fDcPBM4Ze2RPgrAoXs/w5z2M5QE6GW5bGItfgRyJ8=; fh=LdL+NwC1/3vJGkYcTqYTUiuxAceH91JyXWVu0eyAjMg=; b=G1ZQUyAf7OELzxu9OHzNqmsUlKRTjd2DSexg9ThLAvhRMdsPIdX7D56VkHeVCGVAuv vN2Rm9fXw9DI+Qe1nAG6nET571PhkZ0jGN1UugHhagtWC5c2BR2s05zxxxz5FJ4YnCu1 eDr4NDWisLPhBK1Lh3B4MLezai87OsF5gxFkruuEsnOSZQWu1ikDeJ6+60E0JNPkI/qQ 5W9x3FPXn3NbN+tJJ/T8oewe+kpX9pzjxf8YfebkINcu4OrTTHIIy0RYqRkZt6mFRkvF NVymba782oVp9WA7t8HYhtyCLgEcMo0LHKfChkGGjjGynT7GouFILWaINBZ+Do/1HJPw lFgw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@infradead.org header.s=bombadil.20210309 header.b=XIsGPPx3; arc=pass (i=1 dkim=pass dkdomain=infradead.org); spf=pass (google.com: domain of linux-kernel+bounces-141313-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-141313-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id n12-20020a056214008c00b0069b5358ab69si1052341qvr.105.2024.04.11.16.46.44 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Apr 2024 16:46:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-141313-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=bombadil.20210309 header.b=XIsGPPx3; arc=pass (i=1 dkim=pass dkdomain=infradead.org); spf=pass (google.com: domain of linux-kernel+bounces-141313-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-141313-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 607631C233F6 for ; Thu, 11 Apr 2024 17:49:07 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 540EF199E9F; Thu, 11 Apr 2024 16:23:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="XIsGPPx3" Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 156CA1A0AE7; Thu, 11 Apr 2024 16:23:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.137.202.133 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712852604; cv=none; b=oQn7tJgrdFHQ3nqLFBhStEqjZ+u9YqToXIhcoCRtlmAxbBioEDFfAtXBbh9qpc1L0MAPgO2i9jJcbTk7a2AsNrT41MPabIZesFB+7qZV8FKqFnx2vJGq9J/swZM1uJt+7YJRWUqvBbK66Nia+B1YGPy56RwHGjGhJ8dE6kyG8uc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712852604; c=relaxed/simple; bh=nN/Hr+UlT6emx/RUIJVSa74QG/O0VgJ/byKbQpUfmyY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=VWVaNTMrArzwXpUAPHTmX9pqckVf8Y9iEtFipuXz7cMNaIXCIqHf/GgUa7hmk+GHQ4qVqOJ2dS/2zrRgt0iPNlEDUG7Xtxh9OW7yoVsaz55FTL2ANhPZ2NsqZAxdRf+HPKGhb+VaVxmAouLKeTQP8C3h/UcUXuKOQxqloIgOpy8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=XIsGPPx3; arc=none smtp.client-ip=198.137.202.133 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:In-Reply-To: Content-Transfer-Encoding:Content-Type:MIME-Version:References:Message-ID: Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description; bh=w8fDcPBM4Ze2RPgrAoXs/w5z2M5QE6GW5bGItfgRyJ8=; b=XIsGPPx3mZgUgXDqmsW3zFEUMM hrAF6AwyKjg5u/7JpKuZqbRzY4v8HjoVwEuBNAQVw14POMEREkb9sOqKwAN3ZchA3+E1NH+q43G9i h89UzTlMT34ILaOSNIjqzh/1hopUZTMHNrak55UHbof7sc3viEOIyBCwxIJ6BThh3W7K7EUlGy4np TJb5EIRDb4TXgd7PW5tv8xQPgKoVmMRy+YgpCoXVAMMOQ9WQZGo/nl38Gs8Cc4AV4c2NEYmM7UFcA N6nfqhSWnnj5rSE7+/wwPBIg+h68sSddrjQYwRJvDbTOvqfFAn0OFruiNIfUzQl1/VooT2EV9uDuG 6YpTaT4w==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1ruxCN-0000000D38A-1CHz; Thu, 11 Apr 2024 16:22:51 +0000 Date: Thu, 11 Apr 2024 09:22:51 -0700 From: Luis Chamberlain To: John Garry Cc: Dan Helmick , axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, jejb@linux.ibm.com, martin.petersen@oracle.com, djwong@kernel.org, viro@zeniv.linux.org.uk, brauner@kernel.org, dchinner@redhat.com, jack@suse.cz, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org, tytso@mit.edu, jbongio@google.com, linux-scsi@vger.kernel.org, ojaswin@linux.ibm.com, linux-aio@kvack.org, linux-btrfs@vger.kernel.org, io-uring@vger.kernel.org, nilay@linux.ibm.com, ritesh.list@gmail.com, willy@infradead.org, Alan Adamson Subject: Re: [PATCH v6 10/10] nvme: Atomic write support Message-ID: References: <20240326133813.3224593-1-john.g.garry@oracle.com> <20240326133813.3224593-11-john.g.garry@oracle.com> <143e3d55-773f-4fcb-889c-bb24c0acabba@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <143e3d55-773f-4fcb-889c-bb24c0acabba@oracle.com> Sender: Luis Chamberlain On Thu, Apr 11, 2024 at 09:59:57AM +0100, John Garry wrote: > On 11/04/2024 01:29, Luis Chamberlain wrote: > > On Tue, Mar 26, 2024 at 01:38:13PM +0000, John Garry wrote: > > > From: Alan Adamson > > > > > > Add support to set block layer request_queue atomic write limits. The > > > limits will be derived from either the namespace or controller atomic > > > parameters. > > > > > > NVMe atomic-related parameters are grouped into "normal" and "power-fail" > > > (or PF) class of parameter. For atomic write support, only PF parameters > > > are of interest. The "normal" parameters are concerned with racing reads > > > and writes (which also applies to PF). See NVM Command Set Specification > > > Revision 1.0d section 2.1.4 for reference. > > > > > > Whether to use per namespace or controller atomic parameters is decided by > > > NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data > > > Structure, NVM Command Set. > > > > > > NVMe namespaces may define an atomic boundary, whereby no atomic guarantees > > > are provided for a write which straddles this per-lba space boundary. The > > > block layer merging policy is such that no merges may occur in which the > > > resultant request would straddle such a boundary. > > > > > > Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from > > > atomic boundary rule. > > > > Larger IU drives a larger alignment *preference*, and it can be multiples > > of the LBA format, it's called Namespace Preferred Write Granularity (NPWG) > > and the NVMe driver already parses it. So say you have a 4k LBA format > > but a 16k NPWG. I suspect this means we'd want atomics writes to align to 16k > > but I can let Dan confirm. > > If we need to be aligned to NPWG, then the min atomic write unit would also > need to be NPWG. Any NPWG relation to atomic writes is not defined in the > spec, AFAICS. NPWG is just a preference, not a requirement, so it is different than logical block size. As far as I can tell we have no block topology information to represent it. LBS will help users opt-in to align to the NPWG, and a respective NAWUPF will ensure you can also atomically write the respective sector size. For atomics, NABSPF is what we want to use. The above statement on the commit log just seems a bit misleading then. > We simply use the LBA data size as the min atomic unit in this patch. I thought NABSPF is used. > > > Note on NABSPF: > > > There seems to be some vagueness in the spec as to whether NABSPF applies > > > for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF > > > and how it is affected by bit 1. However Figure 4 does tell to check Figure > > > 97 for info about per-namespace parameters, which NABSPF is, so it is > > > implied. However currently nvme_update_disk_info() does check namespace > > > parameter NABO regardless of this bit. > > > > Yeah that its quirky. > > > > Also today we set the physical block size to min(npwg, atomic) and that > > means for a today's average 4k IU drive if they get 16k atomic the > > physical block size would still be 4k. As the physical block size in > > practice can also lift the sector size filesystems used it would seem > > odd only a larger npwg could lift it. > It seems to me that if you want to provide atomic guarantees for this large > "physical block size", then it needs to be based on (N)AWUPF and NPWG. For atomicity, I read it as needing to use NABSPF. Aligning to NPWG will just help performance. The NPWG comes from an internal mapping table constructed and kept on DRAM on a drive in units of an IU size [0], and so not aligning to the IU just causes having to work with entries in the able rather than just one, and also incurs a read-modify-write. Contrary to the logical block size, a write below NPWG but respecting the logical block size is allowed, its just not optimal. [0] https://kernelnewbies.org/KernelProjects/large-block-size#Indirection_Unit_size_increases Luis