Received: by 2002:a05:6358:53a8:b0:117:f937:c515 with SMTP id z40csp2452156rwe; Sat, 15 Apr 2023 21:01:28 -0700 (PDT) X-Google-Smtp-Source: AKy350biqOyiAAP2FXwgmRg0hxg8Y6jgJJocniqUSyTl+nx51ElcH5TbcFSVQhXjYBH6RFA59U0W X-Received: by 2002:a05:6a00:15cd:b0:633:20bb:5bf8 with SMTP id o13-20020a056a0015cd00b0063320bb5bf8mr15838472pfu.17.1681617688634; Sat, 15 Apr 2023 21:01:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681617688; cv=none; d=google.com; s=arc-20160816; b=xAq+x1SZ1CrImxALV6TzrXd/i6huyaDfi9RpBv7wLbRwMoWLMhQ1ewp+q8a/9jQ31i yFe+eYx/YBHgmX3vdSjAcfiWViyQamN+l9vo6LAqAXATFu8V9XR66Mq2Crdvk+QvrUuP AhJMqutF6yr9khx+Z3Ga/afYuien2Qf6Stvbk5pqHbwvOm+UktxWinmoM09xjxwxO1Ng KJapCs9ZXHMH641Vt4bwID3oKkDyH5LVc52sgtseSPcJf5Q12xE0mJV6mqREzl2pR5Zr AHj6MTqpRmkvs0Q3SgtHS5ps+2rC2JPsw0wQQmkYTEN7qkcHpP8UrxsypNDNXW8kfvET rPjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=SUJm6LwCUxqiqzpX73pEkJuJQk9Y5qAJjXk0dS/Di0c=; b=WNmzE0Hx4f+N2hO4i0e/OZ972yhNb0KOSsm+I1uoGIBRMR0RanIQsxiLw0lQShmmgp 4kgF+EoRos+OATrSCRKgl2fOftKTK7TCL6D8AUgBbyPU2lFGvFD15dZFmC1uY89UpJx3 wQkCfRPivmG1OWxXqNwkaC1+43g9lUUCiO5BEIcV5rSp0/L0/Xyta7449V5CESx8J3OA EtoOE8Nvjg0JbGURH+N1688VznoxHrPbiMZa2MN1CfsgRA8wLlq+JMgKTxSzclA/RbzO /04RM4Q1vMOWl1PFne6lJdbnOUQjKXUIysTuRCm1Ty4q2V2pbX1MAjG02x2fi8CVzvNT 0r0A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=kGvaThHg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 66-20020a620645000000b0063b71d46b6dsi4574893pfg.75.2023.04.15.21.01.04; Sat, 15 Apr 2023 21:01:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=kGvaThHg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229924AbjDPD52 (ORCPT + 99 others); Sat, 15 Apr 2023 23:57:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54836 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229636AbjDPD5Z (ORCPT ); Sat, 15 Apr 2023 23:57:25 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4E7921FE8; Sat, 15 Apr 2023 20:57:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=SUJm6LwCUxqiqzpX73pEkJuJQk9Y5qAJjXk0dS/Di0c=; b=kGvaThHgScatc7rsHE1pEkQKk8 aIY/fqvfcczM82uN2bTF5yt52z5AmxgieXeKekxudxLcRHGq9NA8SVpDXLxn0q0n+JmSSBY61yTZP dkCaSy/WEn0aMl9RIVVLD4EUEVoQu/gCVO3J0oU5iAP01yoKTNWq9TKo9U+LkPqoALmwQtaI6CMaW nbVs8DC+Pbd+IpG6I2oPRN1eOIDyjbIejcTZg74SuwABsJ4KU9tsMisanmrg1vMvrCxPSry+L9kgn sEpJP6178tF9Gwt10TVYhPqfQy1rFNPjWhtxv4quktH/H8J8uVyIx5gtS3ozzf3a0ZkMbz/djScto gyanejDA==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1pntFG-00A5qV-Ok; Sun, 16 Apr 2023 03:40:06 +0000 Date: Sun, 16 Apr 2023 04:40:06 +0100 From: Matthew Wilcox To: Luis Chamberlain Cc: Hannes Reinecke , Pankaj Raghav , "kbus >> Keith Busch" , brauner@kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, gost.dev@samsung.com Subject: Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers Message-ID: References: <20230414110821.21548-1-p.raghav@samsung.com> <1e68a118-d177-a218-5139-c8f13793dbbf@suse.de> <31765c8c-e895-4207-2b8c-39f6c7c83ece@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Apr 15, 2023 at 06:28:11PM -0700, Luis Chamberlain wrote: > On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote: > > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > > > On 4/15/23 05:44, Matthew Wilcox wrote: > > > We _could_ upgrade to always do full page I/O; there's a good > > > chance we'll be using the entire page anyway eventually. > > *Iff* doing away with buffer head 512 granularity could help block sizes > greater than page size where physical and logical block size > PAGE_SIZE > we whould also be able to see it on 4kn drives (logical and physical > block size == 4k). A projection could be made after. > > In so far as experimenting with this, if you already have some > effort on IOMAP for bdev aops one possibility for pure experimentation > for now would be to peg a new set of aops could be set in the path > of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early > for us to know if the device is has (lbs = pbs) > 512. For NVMe for > instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We > put together and set the logical and phyiscal block size on NVMe on > nvme_update_ns_info() --> nvme_update_disk_info(), right before we > call device_add_disk(). The only way to override the aops then would > be right before device_add_disk(), or as part of a new device_add_disk_aops() > or whatever. I think you're making this harder than you need to. For LBA size > PAGE_SIZE, there is always only going to be one BH per folio-that-is-LBA-size, so all the problems with more-than-8-BHs-per-4k-page don't actually exist. I don't think we should be overriding the aops, and if we narrow the scope of large folio support in blockdev t only supporting folio_size == LBA size, it becomes much more feasible. > > > And with storage bandwidth getting larger and larger we might even > > > get a performance boost there. > > > > I think we need to look at this from the filesystem side. > > Before that let's recap the bdev cache current issues. Ooh, yes, this is good! I was totally neglecting the partition scanning code. > Today by just adding the disk we move on to partition scanning > immediately unless your block driver has a flag that says otherwise. The > current crash we're evaluating with brd and that we also hit with NVMe > is due to this part. > > device_add_disk() --> > disk_scan_partitions() --> > blkdev_get_whole() --> > bdev_disk_changed() --> > filemap_read_folio() --> filler() > > The filler is from aops. Right, so you missed a step in that callchain, which is read_mapping_folio(). That ends up in do_read_cache_folio(), which contains the deadly: folio = filemap_alloc_folio(gfp, 0); so that needs to be changed to get the minimum folio order from the mapping, and then it should work. > > What do filesystems actually want to do? > > So you are suggesting that the early reads of the block device by the > block cache and its use of the page cache cache should be aligned / > perhaps redesigned to assist more clearly with what modern filesystems > might actually would want today? Perhaps? I'm just saying the needs of the block device are not the only consideration here. I'd like an API that makes sense for the fs author. > > The first thing is they want to read > > the superblock. That's either going to be immediately freed ("Oh, > > this isn't a JFS filesystem after all") or it's going to hang around > > indefinitely. There's no particular need to keep it in any kind of > > cache (buffer or page). > > And the bdev cache would not be able to know before hand that's the case. Right, nobody knows until it's been read and examined. > > Except ... we want to probe a dozen different > > filesystems, and half of them keep their superblock at the same offset > > from the start of the block device. So we do want to keep it cached. > > That's arguing for using the page cache, at least to read it. > > Do we currently share anything from the bdev cache with the fs for this? > Let's say that first block device blocksize in memory. sb_bread() is used by most filesystems, and the buffer cache aliases into the page cache. > > Now, do we want userspace to be able to dd a new superblock into place > > and have the mounted filesystem see it? > > Not sure I follow this. dd a new super block? In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', I can overwrite the superblock. Do we want filesystems to see that kind of vandalism, or do we want the mounted filesystem to have its own copy of the data and overwrite what userspace wrote the next time it updates the superblock? (the trick is that this may not be vandalism, it might be the sysadmin updating the uuid or running some fsck-ish program or trying to update the superblock to support fabulous-new-feature on next mount. does this change the answer?) > > I suspect that confuses just > > about every filesystem out there. So I think the right answer is to read > > the page into the bdev's page cache and then copy it into a kmalloc'ed > > buffer which the filesystem is then responsible for freeing. It's also > > responsible for writing it back (so that's another API we need), and for > > a journalled filesystem, it needs to fit into the journalling scheme. > > Also, we may need to write back multiple copies of the superblock, > > possibly with slight modifications. > > Are you considering these as extentions to the bdev cache? I'm trying to suggest some of the considerations that need to go into a replacement for sb_bread().