Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp1632135pxu; Thu, 8 Oct 2020 17:20:03 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxl/RjMLSEnM4+CzeKJAoPZHs2odZ4NMltAKtP8Co4McgxOnF+cx6H7A2jbf1YsrWoVlKFv X-Received: by 2002:a17:906:33c8:: with SMTP id w8mr11431022eja.233.1602202803721; Thu, 08 Oct 2020 17:20:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1602202803; cv=none; d=google.com; s=arc-20160816; b=GLP6BsDK06/0TFDhwv0gDINlGVpdqkuuF248J+WkhRoqFkjghyi3AsfjPS/LQDR9WF RDCN9Hm99OkLw/070Ee6l+8DV1ntsU6YfzqYjlHe1xNUGCV/oDyJqbZLaElm/h82aatZ G8qAhYz1bOhEUJ7ZJOcFQYHsEcmiBFx0qKnt6cRCJI+fBwbsz7rnIu1H7nqdIgQ3j+CO BsBJgF8DHJYyzu5JestznBRM24u6B2M33vv/xAMwex8f7mMgtHgdCo3ACu2ObKIlRb0z KdEFWEFOW1A8BoNRcox6K+peAlLQbGB7dSTA0AuMIubjoBEjBCnlZjWcCGg4RpTJKMpJ 3B+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=WpTlgF7B5Hmb07gEa2s29vqdLh2Au9JhL+nnwQMn34o=; b=z5DstRH/c4BgGr3ylgRyaijqMo+AiDKyO/aWaC4wsucJBnu92OvBWGHPSKiehwiv/U d5vNh5iw0glliNWEUAYiVI+MgOJDb3Mixjj7F5b06yzO11+zAqJ9DWGyp9a4jkWmJGII rcX5duTHha09CtaPuPDthcieyEYaxGGdAOAALVPm/pbJy6ZBCD7Nxh8yDhYtfcJ2JbuZ ZGo6RvkL8aPeJLoQiT0t0TAfQbzCczsJZUfN99ePRZ/8AcrFeY+MAgLPsPxD1BAyicZo W+O4duKFsllMzK3wXZU5W7dPWvA4HsD2psDi+baBZEkG+04q6DzeDQqwYCCIxNbmR+tq ddYw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id y2si4597547ede.379.2020.10.08.17.19.27; Thu, 08 Oct 2020 17:20:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726099AbgJHWXK (ORCPT + 99 others); Thu, 8 Oct 2020 18:23:10 -0400 Received: from relay12.mail.gandi.net ([217.70.178.232]:35523 "EHLO relay12.mail.gandi.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725941AbgJHWXK (ORCPT ); Thu, 8 Oct 2020 18:23:10 -0400 Received: from localhost (unknown [67.5.25.97]) (Authenticated sender: josh@joshtriplett.org) by relay12.mail.gandi.net (Postfix) with ESMTPSA id 2E05D200004; Thu, 8 Oct 2020 22:23:02 +0000 (UTC) Date: Thu, 8 Oct 2020 15:22:59 -0700 From: Josh Triplett To: "Theodore Y. Ts'o" Cc: "Darrick J. Wong" , Linus Torvalds , Andreas Dilger , Jan Kara , Linux Kernel Mailing List , linux-ext4@vger.kernel.org Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps Message-ID: <20201008222259.GA45658@localhost> References: <20201005173639.GA2311765@magnolia> <20201006003216.GB6553@localhost> <20201006025110.GJ49559@magnolia> <20201006031834.GA5797@mit.edu> <20201006050306.GA8098@localhost> <20201006133533.GC5797@mit.edu> <20201007080304.GB1112@localhost> <20201007143211.GA235506@mit.edu> <20201007201424.GB15049@localhost> <20201008021017.GD235506@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201008021017.GD235506@mit.edu> Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Wed, Oct 07, 2020 at 10:10:17PM -0400, Theodore Y. Ts'o wrote: > On Wed, Oct 07, 2020 at 01:14:24PM -0700, Josh Triplett wrote: > > That sounds like a conversation that would have been a lot more > > interesting and enjoyable if it hadn't started with "can we shoot it in > > the head", and continued with the notion that anything other than > > e2fsprogs making something to be mounted by mount(2) and handled by > > fs/ext4 is being "inflicted", and if the goal didn't still seem to be > > "how do we make it go away so that only e2fsprogs and the kernel ever > > touch ext4". I started this thread because I'd written some userspace > > code, a new version of the kernel made that userspace code stop working, > > so I wanted to report that the moment I'd discovered that, along with a > > potential way to address it with as little disrupton to ext4 as > > possible. > > What is really getting my dander up is your attempt to claim that the > on-disk file system format is like the userspace/kernel interface, > where if we break any file system that file system that was > "previously accepted by an older kernel", this is a bug that must be > reverted or otherwise fixed to allow file systems that had previously > worked, to continue to work. And this is true even if the file system > is ***invalid***. > > And the problem with this is that there have been any number of > commits where file systems which were previously invalid, but which > could be caused to trigger a syzbot whine, which was fixed by > tightening up the validity tests in the kernel. In some cases, I had > to also had to fix up e2fsck to detect the invalid file system which > was generated by the file system fuzzer. Yes, it's unfortunate that > we didn't have these checks earlier, but a file system has a huge > amount of state. > > The principle you've articulated would make it impossible for me to > fix these bugs, unless I can prove that the failure to check a > particular invalid file system corruption could lead to a security > vulnerability. (Would it be OK for me to make the kernel more strict > and reject an invalid file system if it triggers a WARN_ON, so I get > the syzbot complaint, but it doesn't actually cause a security issue?) > > So this conversation would have been a lot more pleasant for *me* if > you hadn't tried to elevate your request to a general principle, where > if someone is deliberately generating an invalid file system, I'm not > allowed to make the kernel more strict to detect said invalidity and > to reject the invalid / corrupted / fuzzed file system. I appreciate you providing this explanation; this makes the nature and severity of your concern *much* more clear. I also now understand why I was feeling confused, and why it felt several times like we were talking past each other; see below. For my part, I had no desire to have generated this level of concern. The restrictions you're describing above are far, far past anything I had possibly envisioned or desired. I want to clarify a couple of things. I wasn't trying to make a *new* general principle or policy. I was under the impression that this *was* the policy, because it never occurred to me that it could be otherwise. It seemed like a natural aspect of the kernel/userspace boundary, to the point that the idea of it *not* being part of the kernel's stability guarantees didn't cross my mind. Thus, when you were suggesting a policy of "only filesystems generated by the e2fsprogs userspace tools are acceptable", that seems like you were suggesting a massive loosening of existing policy. And when I suggested an alternative of "only filesystems that pass e2fsck are acceptable", I was also trying to suggest a loosening of the existing policy, just a different, milder one. I would have come to this discussion with a very different perspective if it had occurred to me that this might not be the policy, or at least that it simply hadn't come up in this way before. I would be quite happy to have a discussion about where that stability boundary should be. I also have no desire to be inflexible about that boundary or about the evolution of the software I'm working on; this is not some enterprise "thou shalt not touch" workload. On the contrary, within reason it's *relatively* straightforward for me to evolve filesystem generation code to deal with changes in the kernel or e2fsck. I was not looking to push some new general principle on stability; I only wished to ensure that it was reasonable to negotiate about the best way to solve problems if and when they arise, and I hope they do not arise often. (Most of the time, I'd expect that when they arise I'll end up wanting to change my software rather than the kernel.) I also don't think that a (milder) stability guarantee would mean that the kernel or fsck can never become stricter. On the contrary, they both *should* absolutely try to detect more issues over time. Much like other instances of the userspace/kernel boundary, changes that could hypothetically break some userspace software aren't automatically disqualified, only changes that *actually* break userspace in practice. I regularly see the kernel making a change that would technically affect some userspace software, but either no such software appears to exist (and the question could be revisited if it did), or any such software that does exist is not affected in a harmful way (e.g. the software copes with any error produced), or the software can be relatively easily patched to work around the new requirements and the nature of the software is such that people won't be running older versions, or any number of other practical considerations. Stability is a practical boundary rather than a theoretical concern. It's one that can be discussed and negotiated based on the nature of a change, and its practical impact. I suggested a kernel patch because that seemed like a straightforward approach with minimal impact to the kernel's validity checking. Darrick's suggestion in the next mail is an even better approach, and one I'm likely to use, so I'd have no problem with the patch that started this thread *not* being applied after all (though I appreciate the willingness to consider it). Finally, I think there's also some clarification needed in the role of what some of the incompat and ro_compat flags mean. For instance, "RO_COMPAT_READONLY" is documented as: > - Read-only filesystem image; the kernel will not mount this image > read-write and most tools will refuse to write to the image. Is it reasonable to interpret this as "this can never, ever become writable", such that no kernel should ever "understand" that flag in ro_compat? I'd assumed so, but this discussion is definitely leading me to want to confirm any such assumptions. Is this a flag that e2fsck could potentially use to determine that it shouldn't check read-write-specific data structures, or should that be a different flag? > > The short version is that I needed a library to rapidly turn > > dynamically-obtained data into a set of disk blocks to be served > > on-the-fly as a software-defined disk, and then mounted on the other > > side of that interface by the Linux kernel. Turns out that's *many > > orders of magnitude* faster than any kind of network filesystem like > > NFS. It's slightly similar to a vvfat for ext4. The less blocks it can > > generate and account for and cache, the faster it can run, and > > microseconds matter. > > So are you actually trying to dedup data blocks, or are you just > trying to avoid needing to track the block allocation bitmaps? Both. I do actually dedup some of the data blocks, when it's easy to do so. The less data that needs to get generated and go over the wire, the better. > And are you just writing a single file, or multiple files? Do you > know what the maximum size of the file or files will be? Do you need > a complex directory structure, or just a single root directory? It's an arbitrary filesystem hierarchy, including directories, files of various sizes (hence using inline_data), and permissions. The problem isn't to get data from point A to point B; the problem is (in part) to turn a representation of a filesystem into an actual mounted filesystem as efficiently as possible, live-serving individual blocks on demand rather than generating the whole image in advance. > Can the file system be sparse? Files almost certainly won't be (because in practice, it'd be rare enough that it's not worth the time and logic to check). The block device is in a way; unused blocks aren't transmitted. > So for example, you can do something like this, which puts all of the > metadata at beginning of the file system, and then you could write to > contiguous data blocks. Add the following in mke2fs.conf: (Note that the main reason mke2fs wouldn't help is that it generates filesystems in advance, rather than on the fly.) > [fs_types] > hugefile = { > features = extent,huge_file,bigalloc,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2 Other than uninit_bg, that's pretty much the list of feature flags I'm setting. I'm also enabling 64bit, inline_data, largedir, and filetype. > cluster_size = 32768 Many files are potentially small, so 4k blocks make sense. (I've even considered 1k blocks, but that seemed like a less-well-trodden path and I didn't want to be less efficient for larger files. 4k seems to be a good balance, and has the advantage of matching the page size.) > So if you had come to the ext4 list with a set of requirements, it > could have been that we could have come up with something which uses > the existing file system features, or come up with something which > would have been more specific --- and more importantly, we'd know what > the semantics were of various on-disk file system formats that people > are depending upon. I've mentioned specific aspects of what I'm doing at least twice: once in the thread about 128-bit inodes and inline data (which got a good response, and ended up convincing me not to go that route even though it'd be theoretically possible), and once in the feature request asking for support in e2fsck for shared bitmap blocks (which didn't get any further followups after I'd provided additional information). > > If at some point I'm looking to make ext4 support more than it already > > does (e.g. a way to omit bitmaps entirely, or a way to express > > contiguous files with smaller extent maps, or other enhancements for > > read-only filesystems), > > See above for a way to significantly reduce the number of bitmaps. > Adding a way to omit bitmaps entirely would require an INCOMPAT flag, > so it might not be worth it. I agree that it'd be of limited value, compared to the amount of change required. > The way to express contiguous files with smaller extent files would be > to extend the kernel to allow file systems with block_size > page_size > read-only. This would allow you to create a file system with a block > size of 64k, which will reduce the size of the extent maps by a factor > of 16, and it wouldn't be all that hard to teach ext4 to support these > file systems. (The reason why it would be hard for us to support file > systems with block sizes > page size is dealing with page cache when > writing files while allocating blocks, especially when doing random > writes into a sparse file. Read-only would be much easier to > support.) That seems like it would work well for a filesystem that mostly contained larger files, but not a mix of small files and occasional large files. (Imagine, for one possible instance, a normal Linux system's root filesystem with everything on it, plus a single 200GB file.) It's more that I'd like to avoid generating and serving (and having the kernel parse and recurse through) a full set of extent tree blocks; it would be nice if an extent record could represent more than 32k blocks. It'd be even nicer if *inline* extents could represent an arbitrarily long file as long as the file was contiguous. Would it be reasonable, particularly on a read-only filesystem (to avoid the possibility of having to break it into a massive extent tree on the fly), to add a way to represent a completely contiguous or mostly contiguous file using only the 60-byte in-inode region, and no auxiliary extents? Having extents with a 48-bit number of blocks would be helpful, and in theory a contiguous file doesn't exactly need ee_block. > So please, talk to us, and *tell* us what it is you're trying to do > before you try to do it. Don't rely on some implementation detail > where we're not being sufficiently strict in checking for an invalid > file system, especially without telling us in advance and then trying > to hold us to the lack of checking forever because it's "breaking > things that used to work". I'd be happy to start more such conversations in the future. And as mentioned above, I am *not* looking for that level of guarantee of never checking *anything* that used to be checked; I'm looking to collaboratively figure out what the best solution is. I don't want the default assumption to be that the kernel needs to remain perfectly accepting of everything it did in the past, nor do I want the default assumption to be that userspace should always change to meet the new kernel requirements. - Josh Triplett