Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp6094076rdb; Mon, 18 Sep 2023 04:12:44 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEvbuFiPuI+sjlP5mXWL4KGt9WGnafj5GUd2K0D6ZNxZuR0ft6vex7ZuvGfVOsw1svn0xFI X-Received: by 2002:a05:6a00:1ac8:b0:68c:dcc:3578 with SMTP id f8-20020a056a001ac800b0068c0dcc3578mr10113280pfv.25.1695035564201; Mon, 18 Sep 2023 04:12:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695035564; cv=none; d=google.com; s=arc-20160816; b=BfDlfl0WMFaFx43MNUHnEJd5IyrTsD02x2EUS2sVYpwWIYwuavq60zZrishK2H2hOE I90mZ29F6QBJYlwoiy93U/n3iJvbMHKliRAFh2PEieU39awWTHOnEt10X9w+3sTi+Eab W+5uZfNiWzh4PAutpMm+3wZAUa1m5Xuqh7KePnbHIlcXwLAR233hiNPbFSY4+dcylQWO R1L2weBwes5L55aeEMH/0+AHLuZeuA9STPzJG2KXAIuBII2HCwgJD4T+4bq59Hg+6BRo kEgTDeaUKduPJ5BlbnmGcHIs52Zi37/3hDbpRo3KbtTB9drkKa8h+iKaJbgwpLqPvHmm hjGg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=M6z0XBRSxgiJriY+TPnWcbvaSWUhO2fbf3ilso8ZY+s=; fh=2C/0ESqNxgxA4z8T0+So0/Kw3pZqI47ghu5qdpnDbJs=; b=V0X9/+YAMYO4w8Rm3LqgW7YHwsPY5n1qlkWan55jIiDGRjDMDG6MFvj2oTNXr2UJn/ Uhs8YhBcxkDE+/YUeHjZ3msSlDynO75pAaoPwVw1RTlG0NYVK9txwnzNUINCpTMy/pe7 IFLTIf6q3xnGY04pnb8SAGngDrdAKeAhXa9THDn9srA0l5IuhomQv9Fo1vSXHNDfiFQB ECHVOXENn5b16p6pr/fC2ogPJc6ggMXbv72cjef/GvvSzEUYTJhTwBEAutqx0pcrvtD8 9xbgLdxAwN9Zg+KIL0aaIpJPbdBha+MPhCl+FbY9D5k1reDKEkNzA2qOsqy9PmgDKK9T RKfg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b="YB3b/JSA"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=fromorbit.com Return-Path: Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id ef6-20020a056a002c8600b0069029c49b66si7827592pfb.60.2023.09.18.04.12.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Sep 2023 04:12:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b="YB3b/JSA"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=fromorbit.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id D848A80DFA41; Sun, 17 Sep 2023 22:10:08 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231883AbjIRFIX (ORCPT + 99 others); Mon, 18 Sep 2023 01:08:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57434 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233195AbjIRFHy (ORCPT ); Mon, 18 Sep 2023 01:07:54 -0400 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C8E3E11F for ; Sun, 17 Sep 2023 22:07:46 -0700 (PDT) Received: by mail-pl1-x636.google.com with SMTP id d9443c01a7336-1c0c6d4d650so36978735ad.0 for ; Sun, 17 Sep 2023 22:07:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1695013666; x=1695618466; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=M6z0XBRSxgiJriY+TPnWcbvaSWUhO2fbf3ilso8ZY+s=; b=YB3b/JSAWf2HsmkfVBSfQpwGFIOi+soaGNRuCdaPIk+iTUscwbTtedkhoNs5A6/94T cMm0HGKKohHZJVTtfkqnqqrPaolSMDmZQ5qYNDqX7K8qdFH0+4LqlS02WOZtY9R90Ur/ +E6DLq7MODP4GnwkrBbX0wdEnVyKRPmnnm/ZXXAYQhaFBDwsyJjw0ntl00mCvKgU/uCK 2rZS43jaejre8Dqw5nuAL5o8rRMsYPGdDAElKshxV71LLJ2LjE0Q22OoB+zNRFNyfA7g 8Cp3DBAybnHvsFxbkEYsCWbyx2GYWTm+8IGkru3iu4JynxnFIfaEpvvMByutMhhYNd+E S4QA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695013666; x=1695618466; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=M6z0XBRSxgiJriY+TPnWcbvaSWUhO2fbf3ilso8ZY+s=; b=v3ecZSygo2uUbLWKZ07hLvXFo8Eq0TzDrssttr3xZsejVJzXAPVcsOfyPU3MVgYt0j W4kFebEKQ/4f1ior7dCSkipzSVAUGNEvZ9Sq/Pi9eHRejMPKcwKvIfmr3XrhQ/tf6CMQ TZG6R3w4voZeVt4rR/NfGKQ4l7WyiXrBtVNHd7D/M/G0pX6gH1POyRn5+BnuhB7rTL82 sHkCqgThgOAQppsW7HY8zqHDcO7VXzt08l8tVF4NfB8YZo615mKK4uAYS7r2pSwIJGHy f5XHO1kJsJoUfR2N7TVlZvZpUMkXrZPuwsI86EYT1Wr/BnbWQxyc0zBogppYSPDtCtc/ o5LA== X-Gm-Message-State: AOJu0Ywliui0YerXXcly2k06MD43d+NCC2miNllPZKnY2umMaRvXgKt5 Qf6B8kGj0xRkCCfKNMDOCchBLw== X-Received: by 2002:a17:902:efc6:b0:1b8:76ce:9d91 with SMTP id ja6-20020a170902efc600b001b876ce9d91mr7789080plb.1.1695013666187; Sun, 17 Sep 2023 22:07:46 -0700 (PDT) Received: from dread.disaster.area (pa49-180-20-59.pa.nsw.optusnet.com.au. [49.180.20.59]) by smtp.gmail.com with ESMTPSA id iz17-20020a170902ef9100b001b9de4fb749sm7494951plb.20.2023.09.17.22.07.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 17 Sep 2023 22:07:45 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1qi6U2-002DLu-32; Mon, 18 Sep 2023 15:07:42 +1000 Date: Mon, 18 Sep 2023 15:07:42 +1000 From: Dave Chinner To: Luis Chamberlain Cc: Pankaj Raghav , linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, p.raghav@samsung.com, da.gomez@samsung.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, willy@infradead.org, djwong@kernel.org, linux-mm@kvack.org, chandan.babu@oracle.com, gost.dev@samsung.com Subject: Re: [RFC 00/23] Enable block size > page size in XFS Message-ID: References: <20230915183848.1018717-1-kernel@pankajraghav.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Sun, 17 Sep 2023 22:10:09 -0700 (PDT) On Sun, Sep 17, 2023 at 07:04:24PM -0700, Luis Chamberlain wrote: > On Mon, Sep 18, 2023 at 08:05:20AM +1000, Dave Chinner wrote: > > On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote: > > > From: Pankaj Raghav > > > > > > There has been efforts over the last 16 years to enable enable Large > > > Block Sizes (LBS), that is block sizes in filesystems where bs > page > > > size [1] [2]. Through these efforts we have learned that one of the > > > main blockers to supporting bs > ps in fiesystems has been a way to > > > allocate pages that are at least the filesystem block size on the page > > > cache where bs > ps [3]. Another blocker was changed in filesystems due to > > > buffer-heads. Thanks to these previous efforts, the surgery by Matthew > > > Willcox in the page cache for adopting xarray's multi-index support, and > > > iomap support, it makes supporting bs > ps in XFS possible with only a few > > > line change to XFS. Most of changes are to the page cache to support minimum > > > order folio support for the target block size on the filesystem. > > > > > > A new motivation for LBS today is to support high-capacity (large amount > > > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are > > > typically greater than 4k [4] to help reduce DRAM and so in turn cost > > > and space. In practice this then allows different architectures to use a > > > base page size of 4k while still enabling support for block sizes > > > aligned to the larger IUs by relying on high order folios on the page > > > cache when needed. It also enables to take advantage of these same > > > drive's support for larger atomics than 4k with buffered IO support in > > > Linux. As described this year at LSFMM, supporting large atomics greater > > > than 4k enables databases to remove the need to rely on their own > > > journaling, so they can disable double buffered writes [5], which is a > > > feature different cloud providers are already innovating and enabling > > > customers for through custom storage solutions. > > > > > > This series still needs some polishing and fixing some crashes, but it is > > > mainly targeted to get initial feedback from the community, enable initial > > > experimentation, hence the RFC. It's being posted now given the results from > > > our testing are proving much better results than expected and we hope to > > > polish this up together with the community. After all, this has been a 16 > > > year old effort and none of this could have been possible without that effort. > > > > > > Implementation: > > > > > > This series only adds the notion of a minimum order of a folio in the > > > page cache that was initially proposed by Willy. The minimum folio order > > > requirement is set during inode creation. The minimum order will > > > typically correspond to the filesystem block size. The page cache will > > > in turn respect the minimum folio order requirement while allocating a > > > folio. This series mainly changes the page cache's filemap, readahead, and > > > truncation code to allocate and align the folios to the minimum order set for the > > > filesystem's inode's respective address space mapping. > > > > > > Only XFS was enabled and tested as a part of this series as it has > > > supported block sizes up to 64k and sector sizes up to 32k for years. > > > The only thing missing was the page cache magic to enable bs > ps. However any filesystem > > > that doesn't depend on buffer-heads and support larger block sizes > > > already should be able to leverage this effort to also support LBS, > > > bs > ps. > > > > > > This also paves the way for supporting block devices where their logical > > > block size > page size in the future by leveraging iomap's address space > > > operation added to the block device cache by Christoph Hellwig [6]. We > > > have work to enable support for this, enabling LBAs > 4k on NVME, and > > > at the same time allow coexistence with buffer-heads on the same block > > > device so to enable support allow for a drive to use filesystem's to > > > switch between filesystem's which may depend on buffer-heads or need the > > > iomap address space operations for the block device cache. Patches for > > > this will be posted shortly after this patch series. > > > > Do you have a git tree branch that I can pull this from > > somewhere? > > > > As it is, I'd really prefer stuff that adds significant XFS > > functionality that we need to test to be based on a current Linus > > TOT kernel so that we can test it without being impacted by all > > the random unrelated breakages that regularly happen in linux-next > > kernels.... > > That's understandable! I just rebased onto Linus' tree, this only > has the bs > ps support on 4k sector size: > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev > I just did a cursory build / boot / fsx with 16k block size / 4k sector size > test with this tree only. I havne't ran fstests on it. W/ 64k block size, generic/042 fails (maybe just a test block size thing), generic/091 fails (data corruption on read after ~70 ops) and then generic/095 hung with a crash in iomap_readpage_iter() during readahead. Looks like a null folio was passed to ifs_alloc(), which implies the iomap_readpage_ctx didn't have a folio attached to it. Something isn't working properly in the readahead code, which would also explain the quick fsx failure... > Just a heads up, using 512 byte sector size will fail for now, it's a > regression we have to fix. Likewise using block sizes 1k, 2k will also > regress on fsx right now. These are regressions we are aware of but > haven't had time yet to bisect / fix. I'm betting that the recently added sub-folio dirty tracking code got broken by this patchset.... Cheers, Dave. -- Dave Chinner david@fromorbit.com