Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp5938150rdb; Sun, 17 Sep 2023 20:55:37 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEatRkbekkDMfE2VPOoYlu58Z3KsD8dOJRjFTU6/nzxu3+VbzJ6pssXSwy2S6Z6xOMzXKwl X-Received: by 2002:a05:6a21:4994:b0:145:47af:57d8 with SMTP id ax20-20020a056a21499400b0014547af57d8mr8769611pzc.2.1695009336854; Sun, 17 Sep 2023 20:55:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695009336; cv=none; d=google.com; s=arc-20160816; b=KDi8FLCLmkUyVCXirne07PZJ4siaZwxHteP4W6AZDBclYPwAkptv5s7+QKfE/PwsH4 5N6VdCQyx7RCCPC5BfkkqYQShVZFJYv7mIeKgxLS6vgx3wGSXtxpXeT+WyFWYP5Dj/CD G6kAJkT9yuWAyVPAD1j7/jw4mAfLcn6wiN78cfY9Jn9jOzO+4Echl5fAP8JqdXUgP+ZO QHHPdwtCGw9vCXyj1YlCbP+KhRocBYZBhuqOmKet454crEvlzP8RS13D2VgdjVwEW3X2 doalWZelNdtltye8wmf7Mh6DKFWeKNOiu9/5MCn8h9x3h4buiX8birrOab568+SuAOeS JHVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=fDmEsXyaSf8yAoYrZgi7k3PEY6Z2DsRWqnUB0JhWW88=; fh=lpB+xF/cv6p/k+ojyMy+3AjI7BAxfYTQavMhtEBFfgs=; b=qdbItaUEP2K+3/HMRXaoq/6qdlQHYk76QRhL1NBPF1PdX92a5DmUPouio7oITaREdf euRyzHndZ2+ceiJ875cg0x/MSMQOzLuCs0a747jNzmAtqIo30WCRXpGvAFfnnPWiu5rC rzAhJrJma2gXFPOft8HH6yDk7dwyQS08TALIJXUwVjYCHEPTfyQAJ4GDPTWmgIxS89bR nQmoM1dX61Qj1DewUx3o4S2XnSKK/OXG+3um4EG0J8y0rI4u3S7F+7f/Gfm947e9uX14 gUdvL0QZUEMnS96aKSIkt8MNbZV/cFlbp4sTVJVl+H/rOVCf0weCnBKr4CVv+H5oQsdj g0RA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=Smpr7Rjy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=fromorbit.com Return-Path: Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id l125-20020a633e83000000b0057417631c96si5460593pga.291.2023.09.17.20.55.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 17 Sep 2023 20:55:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=Smpr7Rjy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=fromorbit.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 303E484858B5; Sun, 17 Sep 2023 15:06:26 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238921AbjIQWFi (ORCPT + 99 others); Sun, 17 Sep 2023 18:05:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41258 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239041AbjIQWFa (ORCPT ); Sun, 17 Sep 2023 18:05:30 -0400 Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 881E8124 for ; Sun, 17 Sep 2023 15:05:24 -0700 (PDT) Received: by mail-pf1-x42b.google.com with SMTP id d2e1a72fcca58-69042d398b1so2395875b3a.0 for ; Sun, 17 Sep 2023 15:05:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1694988324; x=1695593124; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=fDmEsXyaSf8yAoYrZgi7k3PEY6Z2DsRWqnUB0JhWW88=; b=Smpr7Rjydv62lnJAkepHNktljfFK9w0gZ+sgLl+Q1Q/OEInSYO+tMEgjyElogjKNFK bQbAc/s2unRw9Fe+CV571NXOk6h6wkm6KB/iFr1TX36jmr5/XS84BrFyKXpLpYpuUpt5 JcXDEm5XzDYGOvgEZqJGg60gNe9gNJJDmchi2Bjx1A3QH0UQh1mbrrPP4sEUHqOj7tvj P6WyuqV4Ggd6rhp5d/KPxfqxDFDMNthBo6dTslfoRrsVsfkT9Nury9PO2zJZbDQo6wAO isDwEYv58LSLo/ZxS6HfTlVbMzteqI84z1K5FpKYqk75vgVMox386PKoWhm0G8/voZjR ONog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694988324; x=1695593124; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=fDmEsXyaSf8yAoYrZgi7k3PEY6Z2DsRWqnUB0JhWW88=; b=HQe2DiX5Z2QPkn47a/m7kUsXi/eUsWLje7gjklUgwUEV3yRlvAM92oFLhnKJOZeB1a om+ew/UvOXmhsC5DmrdWeCJnTKqFSA6zjI7D3dctC/mFh7/FpE1HtMm+2qLa0yhAgEof P3PVtRNd/nFbRe6pQtnc7i0Z77/F7UEaoYYrvCL5P2Qw+gycQLZQvgwH3GCnIIMuiZDr X7QFEAD154foYoze89Nh4zgfmo+4Of6+uOfyEWhC9COibIj/K1bvbYdOeX3/s8Rlni0r l9ej30KDkQX7Ofwx9VyVgucPYKKopnj42YV9y9SrFIApW95QBWVFKvcROmZAawXpEemP FMjw== X-Gm-Message-State: AOJu0YxEkMesWl6REJUUMpVs0ob+USfCch1JuX8d8hzkOJSjg+TOrotS vOpjVhoi8yN/ARTr/hCLnXDBhw== X-Received: by 2002:a05:6a00:2283:b0:68f:ea5d:1f70 with SMTP id f3-20020a056a00228300b0068fea5d1f70mr10470850pfe.14.1694988323991; Sun, 17 Sep 2023 15:05:23 -0700 (PDT) Received: from dread.disaster.area (pa49-180-20-59.pa.nsw.optusnet.com.au. [49.180.20.59]) by smtp.gmail.com with ESMTPSA id d23-20020aa78157000000b00690188b124esm6249509pfn.174.2023.09.17.15.05.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 17 Sep 2023 15:05:23 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1qhztI-0025Sj-0n; Mon, 18 Sep 2023 08:05:20 +1000 Date: Mon, 18 Sep 2023 08:05:20 +1000 From: Dave Chinner To: Pankaj Raghav Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, p.raghav@samsung.com, da.gomez@samsung.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, willy@infradead.org, djwong@kernel.org, linux-mm@kvack.org, chandan.babu@oracle.com, mcgrof@kernel.org, gost.dev@samsung.com Subject: Re: [RFC 00/23] Enable block size > page size in XFS Message-ID: References: <20230915183848.1018717-1-kernel@pankajraghav.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230915183848.1018717-1-kernel@pankajraghav.com> X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Sun, 17 Sep 2023 15:06:26 -0700 (PDT) On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote: > From: Pankaj Raghav > > There has been efforts over the last 16 years to enable enable Large > Block Sizes (LBS), that is block sizes in filesystems where bs > page > size [1] [2]. Through these efforts we have learned that one of the > main blockers to supporting bs > ps in fiesystems has been a way to > allocate pages that are at least the filesystem block size on the page > cache where bs > ps [3]. Another blocker was changed in filesystems due to > buffer-heads. Thanks to these previous efforts, the surgery by Matthew > Willcox in the page cache for adopting xarray's multi-index support, and > iomap support, it makes supporting bs > ps in XFS possible with only a few > line change to XFS. Most of changes are to the page cache to support minimum > order folio support for the target block size on the filesystem. > > A new motivation for LBS today is to support high-capacity (large amount > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are > typically greater than 4k [4] to help reduce DRAM and so in turn cost > and space. In practice this then allows different architectures to use a > base page size of 4k while still enabling support for block sizes > aligned to the larger IUs by relying on high order folios on the page > cache when needed. It also enables to take advantage of these same > drive's support for larger atomics than 4k with buffered IO support in > Linux. As described this year at LSFMM, supporting large atomics greater > than 4k enables databases to remove the need to rely on their own > journaling, so they can disable double buffered writes [5], which is a > feature different cloud providers are already innovating and enabling > customers for through custom storage solutions. > > This series still needs some polishing and fixing some crashes, but it is > mainly targeted to get initial feedback from the community, enable initial > experimentation, hence the RFC. It's being posted now given the results from > our testing are proving much better results than expected and we hope to > polish this up together with the community. After all, this has been a 16 > year old effort and none of this could have been possible without that effort. > > Implementation: > > This series only adds the notion of a minimum order of a folio in the > page cache that was initially proposed by Willy. The minimum folio order > requirement is set during inode creation. The minimum order will > typically correspond to the filesystem block size. The page cache will > in turn respect the minimum folio order requirement while allocating a > folio. This series mainly changes the page cache's filemap, readahead, and > truncation code to allocate and align the folios to the minimum order set for the > filesystem's inode's respective address space mapping. > > Only XFS was enabled and tested as a part of this series as it has > supported block sizes up to 64k and sector sizes up to 32k for years. > The only thing missing was the page cache magic to enable bs > ps. However any filesystem > that doesn't depend on buffer-heads and support larger block sizes > already should be able to leverage this effort to also support LBS, > bs > ps. > > This also paves the way for supporting block devices where their logical > block size > page size in the future by leveraging iomap's address space > operation added to the block device cache by Christoph Hellwig [6]. We > have work to enable support for this, enabling LBAs > 4k on NVME, and > at the same time allow coexistence with buffer-heads on the same block > device so to enable support allow for a drive to use filesystem's to > switch between filesystem's which may depend on buffer-heads or need the > iomap address space operations for the block device cache. Patches for > this will be posted shortly after this patch series. Do you have a git tree branch that I can pull this from somewhere? As it is, I'd really prefer stuff that adds significant XFS functionality that we need to test to be based on a current Linus TOT kernel so that we can test it without being impacted by all the random unrelated breakages that regularly happen in linux-next kernels.... -Dave. -- Dave Chinner david@fromorbit.com