2020-10-05 17:41:33

by Darrick J. Wong

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Mon, Oct 05, 2020 at 01:14:54AM -0700, Josh Triplett wrote:
> Ran into an ext4 regression when testing upgrades to 5.9-rc kernels:
>
> Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in
> ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems
> with intentionally overlapping bitmap blocks.
>
> On an always-read-only filesystem explicitly marked with
> EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to
> point all the block and inode bitmaps to a single block

LOL, WHAT?

I didn't know shared blocks applied to fs metadata. I thought that
"shared" only applied to file extent maps being able to share physical
blocks.

Could /somebody/ please document the ondisk format changes that are
associated with this feature?

> of all 1s,
> because a read-only filesystem will never allocate or free any blocks or
> inodes.

All 1s? So the inode bitmap says that every inode table slot is in use,
even if the inode record itself says it isn't? What does e2fsck -n
think about that kind of metadata inconsistency?

Hmm, let's try.

$ truncate -s 300m /tmp/a.img
$ mke2fs -T ext4 -O shared_blocks /tmp/a.img -d /tmp/ -F
mke2fs 1.46~WIP-2020-10-04 (4-Oct-2020)
Invalid filesystem option set: shared_blocks

Oookay. So that's not how you create these shared block ext4s,
apparently...

$ mke2fs -T ext4 /tmp/a.img -F
mke2fs 1.46~WIP-2020-10-04 (4-Oct-2020)
Discarding device blocks: done
Creating filesystem with 76800 4k blocks and 19200 inodes
Filesystem UUID: 0a763191-89ca-49bc-9dc6-bf2986009ad9
Superblock backups stored on blocks:
32768

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

$ debugfs -w /tmp/a.img
debugfs 1.45.6 (20-Mar-2020)
debugfs: features shared_blocks
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum shared_blocks
debugfs: set_bg 1 inode_bitmap 42
debugfs: set_bg 1 block_bitmap 39
debugfs: stats
Group 0: block bitmap at 39, inode bitmap at 42, inode table at 45
31517 free blocks, 6389 free inodes, 2 used directories, 6389 unused inodes
[Checksum 0xda06]
Group 1: block bitmap at 39, inode bitmap at 42, inode table at 445
28633 free blocks, 6400 free inodes, 0 used directories, 6400 unused inodes
[Inode not init, Checksum 0x2e69]
$ xfs_io -c "pwrite -S 0xFF $((39 * 4096)) 4096" /tmp/a.img
$ xfs_io -c "pwrite -S 0xFF $((42 * 4096)) 4096" /tmp/a.img

Ok, now we have a shared blocks fs where BG 0 and BG 1 share bitmaps,
and the bitmaps are set to 1.

$ e2fsck -n /tmp/a.img
e2fsck 1.45.6 (20-Mar-2020)
ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
e2fsck: Group descriptors look bad... trying backup blocks...
/tmp/a.img was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(1251--32767)
Fix? no

Free blocks count wrong for group #0 (31517, counted=0).
Fix? no

Free blocks count wrong (71414, counted=39897).
Fix? no

Inode bitmap differences: -(12--6400)
Fix? no

Free inodes count wrong for group #0 (6389, counted=0).
Fix? no

Free inodes count wrong (19189, counted=12800).
Fix? no

Padding at end of inode bitmap is not set. Fix? no

Inode bitmap differences: Group 0 inode bitmap does not match checksum.
IGNORED.
Group 1 inode bitmap does not match checksum.
IGNORED.
Group 2 inode bitmap does not match checksum.
IGNORED.
Block bitmap differences: Group 0 block bitmap does not match checksum.
IGNORED.

/tmp/a.img: ********** WARNING: Filesystem still has errors **********

/tmp/a.img: 11/19200 files (0.0% non-contiguous), 5386/76800 blocks

Sooooo... are you shipping ext4 images with an undocumented ondisk
format variation that looks like inconsistency to the standard tools?

--D

>
> However, after that commit, the block validity check rejects such
> filesystems with -EUCLEAN and "failed to initialize system zone (-117)".
> This causes systems that previously worked correctly to fail when
> upgrading to v5.9-rc2 or later.
>
> This was obviously a bugfix, and I'm not suggesting that it should be
> reverted; it looks like this effectively worked by accident before,
> because the block_validity check wasn't fully functional. However, this
> does break real systems, and I'd like to get some kind of regression fix
> in before 5.9 final if possible. I think it would suffice to make
> block_validity default to false if and only if
> EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is set.
>
> Does that seem like a reasonable fix?
>
> Here's a quick sketch of a patch, which I've tested and confirmed to
> work:
>
> ----- 8< -----
> Subject: [PATCH] Fix ext4 regression in v5.9-rc2 on ro fs with overlapped bitmaps
>
> Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in
> ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems
> with intentionally overlapping bitmap blocks.
>
> On an always-read-only filesystem explicitly marked with
> EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to
> point all the block and inode bitmaps to a single block of all 1s,
> because a read-only filesystem will never allocate or free any blocks or
> inodes.
>
> However, after that commit, the block validity check rejects such
> filesystems with -EUCLEAN and "failed to initialize system zone (-117)".
> This causes systems that previously worked correctly to fail when
> upgrading to v5.9-rc2 or later.
>
> Fix this by defaulting block_validity to off when
> EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is set.
>
> Signed-off-by: Josh Triplett <[email protected]>
> Fixes: e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in ext4_setup_system_zone()")
> ---
> fs/ext4/ext4.h | 2 ++
> fs/ext4/super.c | 3 ++-
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 523e00d7b392..7874028fa864 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1834,6 +1834,7 @@ static inline bool ext4_verity_in_progress(struct inode *inode)
> #define EXT4_FEATURE_RO_COMPAT_METADATA_CSUM 0x0400
> #define EXT4_FEATURE_RO_COMPAT_READONLY 0x1000
> #define EXT4_FEATURE_RO_COMPAT_PROJECT 0x2000
> +#define EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS 0x4000
> #define EXT4_FEATURE_RO_COMPAT_VERITY 0x8000
>
> #define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001
> @@ -1930,6 +1931,7 @@ EXT4_FEATURE_RO_COMPAT_FUNCS(bigalloc, BIGALLOC)
> EXT4_FEATURE_RO_COMPAT_FUNCS(metadata_csum, METADATA_CSUM)
> EXT4_FEATURE_RO_COMPAT_FUNCS(readonly, READONLY)
> EXT4_FEATURE_RO_COMPAT_FUNCS(project, PROJECT)
> +EXT4_FEATURE_RO_COMPAT_FUNCS(shared_blocks, SHARED_BLOCKS)
> EXT4_FEATURE_RO_COMPAT_FUNCS(verity, VERITY)
>
> EXT4_FEATURE_INCOMPAT_FUNCS(compression, COMPRESSION)
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index ea425b49b345..f57a7e966e44 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3954,7 +3954,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> else
> set_opt(sb, ERRORS_RO);
> /* block_validity enabled by default; disable with noblock_validity */
> - set_opt(sb, BLOCK_VALIDITY);
> + if (!ext4_has_feature_shared_blocks(sb))
> + set_opt(sb, BLOCK_VALIDITY);
> if (def_mount_opts & EXT4_DEFM_DISCARD)
> set_opt(sb, DISCARD);
>


2020-10-06 00:18:33

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Mon, Oct 05, 2020 at 10:36:39AM -0700, Darrick J. Wong wrote:
> > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in
> > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems
> > with intentionally overlapping bitmap blocks.
> >
> > On an always-read-only filesystem explicitly marked with
> > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to
> > point all the block and inode bitmaps to a single block
>
> LOL, WHAT?
>
> I didn't know shared blocks applied to fs metadata. I thought that
> "shared" only applied to file extent maps being able to share physical
> bloctks.

My understanding matches Darrick's. I was going to track down the
Google engineer who has most recently (as far as I know) enhanced
e2fsprogs's support of the shared block feature (see the commits
returned by "git log --author [email protected] contrib/android") but
he's apparently out of the office today. Hopefully I'll be able to
track him down and ask about this tomorrow.

> Oookay. So that's not how you create these shared block ext4s,
> apparently...

Yeah, they are created by the e2fsdroid program. See sources in
contrib/e2fsdroid. I took a quick look, and I don't see anything
there which is frobbing with with the bitmaps; but perhaps I'm missing
something, which is why I'd ask David to see if he knows anything
about this.

More to the point, if we are have someone who is trying to dedup or
otherwise frob with bitmaps, I suspect this will break "e2fsck -E
unshare_blocks /dev/XXX", which is a way that you can take a root file
system which is using shared_blocks, and turn it into something that
can actually be mount read/write. This is something that I believe
was being used by AOSP "debug" or "userdebug" (I'm a bit fuzzy on the
details) so that Android developers couldn't actually modify the root
file system. (Of course, you have to also disable dm-verity in order
for this to work....)

Unfortunately, e2fsdroid is currently not buildable under the standard
Linux compilation environment. For the reason why, see:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=928551#75

The first step would be to teach e2fsprogs's configure to check for
libsparse, and to link against it if it's available. But before we
could enable this by default for Linux distribution, we need to link
against libsparse using dlopen(), since most distro release engineers
would be.... cranky.... if mke2fs were to drag in some random Android
libraries that have to be installed as shared libraries in their
installers. Which is the point of comment #75 in the above bug.

Since the only use of shared_blocks is for Android, since very few
other projects want a completely read-only root file system, and where
dedup is actually significantly helpful, we've never tried to make
this work outside of the Android context. At least in theory, though,
it might be useful if we could create shared_block file systems using
"mke2fs -O shared_blocks -d /path/to/embedded-root-fs system.img 1G".
Patches gratefully accepted....

Cheers,

- Ted

2020-10-06 00:32:58

by Josh Triplett

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Mon, Oct 05, 2020 at 10:36:39AM -0700, Darrick J. Wong wrote:
> On Mon, Oct 05, 2020 at 01:14:54AM -0700, Josh Triplett wrote:
> > Ran into an ext4 regression when testing upgrades to 5.9-rc kernels:
> >
> > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in
> > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems
> > with intentionally overlapping bitmap blocks.
> >
> > On an always-read-only filesystem explicitly marked with
> > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to
> > point all the block and inode bitmaps to a single block
>
> LOL, WHAT?
>
> I didn't know shared blocks applied to fs metadata. I thought that
> "shared" only applied to file extent maps being able to share physical
> blocks.

The flag isn't documented very well yet, but since it specifically
forces the filesystem to always be mounted read-only, the bitmaps really
shouldn't matter at all. (In an ideal world, a permanently read-only
filesystem should be able to omit all the bitmaps entirely. Pointing
them all to a single disk block is the next best thing.)

> Could /somebody/ please document the ondisk format changes that are
> associated with this feature?

I pretty much had to sort it out by looking at a combination of
e2fsprogs and the kernel, and a lot of experimentation, until I ended up
with something that the kernel was completely happy with without a
single complaint.

I'd be happy to write up a summary of the format.

I'd still really like to see this patch applied for 5.9, to avoid having
filesystems that an old kernel can mount but a new one can't. This still
seems like a regression to me.

> > of all 1s,
> > because a read-only filesystem will never allocate or free any blocks or
> > inodes.
>
> All 1s? So the inode bitmap says that every inode table slot is in use,
> even if the inode record itself says it isn't?

Yes.

> What does e2fsck -n
> think about that kind of metadata inconsistency?

If you set up the rest of the metadata consistently with it (for
instance, 0 free blocks and 0 free inodes), you'll only get a single
complaint, from the e2fsck equivalent of block_validity. See
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=956509 for details on
that; with that fixed, e2fsck wouldn't complain at all. The kernel,
prior to 5.9-rc2, doesn't have a single complaint, whether at mount,
unmount, or read of every file and directory on the filesystem.

The errors you got in your e2fsck run came because you just overrode the
bitmaps, but didn't make the rest of the metadata consistent with that.
I can provide a sample filesystem if that would help.

- Josh

2020-10-06 02:52:11

by Darrick J. Wong

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Mon, Oct 05, 2020 at 05:32:16PM -0700, Josh Triplett wrote:
> On Mon, Oct 05, 2020 at 10:36:39AM -0700, Darrick J. Wong wrote:
> > On Mon, Oct 05, 2020 at 01:14:54AM -0700, Josh Triplett wrote:
> > > Ran into an ext4 regression when testing upgrades to 5.9-rc kernels:
> > >
> > > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in
> > > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems
> > > with intentionally overlapping bitmap blocks.
> > >
> > > On an always-read-only filesystem explicitly marked with
> > > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to
> > > point all the block and inode bitmaps to a single block
> >
> > LOL, WHAT?
> >
> > I didn't know shared blocks applied to fs metadata. I thought that
> > "shared" only applied to file extent maps being able to share physical
> > blocks.
>
> The flag isn't documented very well yet, but since it specifically
> forces the filesystem to always be mounted read-only, the bitmaps really
> shouldn't matter at all. (In an ideal world, a permanently read-only
> filesystem should be able to omit all the bitmaps entirely. Pointing
> them all to a single disk block is the next best thing.)

I disagree; creating undocumented forks of an existing ondisk format
(especially one that presents as inconsistent metadata to regular tools)
is creating a ton of pain for future users and maintainers when the
incompat forks collide with the canonical implementation(s).

(Granted, I don't know if you /created/ this new interpretation of the
feature flag or if you've merely been stuck with it...)

I don't say that as a theoretical argument -- you've just crashed right
into this, because nobody knew that the in-kernel block validity doesn't
know how to deal with this other than to error out.

> > Could /somebody/ please document the ondisk format changes that are
> > associated with this feature?
>
> I pretty much had to sort it out by looking at a combination of
> e2fsprogs and the kernel, and a lot of experimentation, until I ended up
> with something that the kernel was completely happy with without a
> single complaint.
>
> I'd be happy to write up a summary of the format.

Seems like a good idea, particularly since you're asking for a format
change that requires kernel support and the ondisk format documentation
lives under Documentation/. That said...

> I'd still really like to see this patch applied for 5.9, to avoid having
> filesystems that an old kernel can mount but a new one can't. This still
> seems like a regression to me.
>
> > > of all 1s,
> > > because a read-only filesystem will never allocate or free any blocks or
> > > inodes.
> >
> > All 1s? So the inode bitmap says that every inode table slot is in use,
> > even if the inode record itself says it isn't?
>
> Yes.
>
> > What does e2fsck -n
> > think about that kind of metadata inconsistency?
>
> If you set up the rest of the metadata consistently with it (for
> instance, 0 free blocks and 0 free inodes), you'll only get a single
> complaint, from the e2fsck equivalent of block_validity. See
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=956509 for details on
> that;

...Ted shot down this whole thing six months ago.

The Debian bug database is /not/ the designated forum to discuss changes
to the ondisk format; linux-ext4 is.

--D

> with that fixed, e2fsck wouldn't complain at all. The kernel,
> prior to 5.9-rc2, doesn't have a single complaint, whether at mount,
> unmount, or read of every file and directory on the filesystem.
>
> The errors you got in your e2fsck run came because you just overrode the
> bitmaps, but didn't make the rest of the metadata consistent with that.
> I can provide a sample filesystem if that would help.
>
> - Josh

2020-10-06 03:19:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Mon, Oct 05, 2020 at 07:51:10PM -0700, Darrick J. Wong wrote:
> > > Could /somebody/ please document the ondisk format changes that are
> > > associated with this feature?
> >
> > I pretty much had to sort it out by looking at a combination of
> > e2fsprogs and the kernel, and a lot of experimentation, until I ended up
> > with something that the kernel was completely happy with without a
> > single complaint.
> >
> > I'd be happy to write up a summary of the format.
>
> Seems like a good idea, particularly since you're asking for a format
> change that requires kernel support and the ondisk format documentation
> lives under Documentation/. That said...

> > If you set up the rest of the metadata consistently with it (for
> > instance, 0 free blocks and 0 free inodes), you'll only get a single
> > complaint, from the e2fsck equivalent of block_validity. See
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=956509 for details on
> > that;
>
> ...Ted shot down this whole thing six months ago.
>
> The Debian bug database is /not/ the designated forum to discuss changes
> to the ondisk format; linux-ext4 is.

What Josh is proposing I'm pretty sure would also break "e2fsck -E
unshare_blocks", so that's another reason not to accept this as a
valid format change.

As far as I'm concerned, contrib/e2fsdroid is the canonical definition
of how to create valid file systems with shared_blocks.

- Ted

2020-10-06 05:06:37

by Josh Triplett

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Mon, Oct 05, 2020 at 11:18:34PM -0400, Theodore Y. Ts'o wrote:
> What Josh is proposing I'm pretty sure would also break "e2fsck -E
> unshare_blocks", so that's another reason not to accept this as a
> valid format change.

The kernel already accepted this as a valid mountable filesystem format,
without a single error or warning of any kind, and has done so stably
for years.

> As far as I'm concerned, contrib/e2fsdroid is the canonical definition
> of how to create valid file systems with shared_blocks.

I'm not trying to create a problem here; I'm trying to address a whole
family of problems. I was generally under the impression that mounting
existing root filesystems fell under the scope of the kernel<->userspace
or kernel<->existing-system boundary, as defined by what the kernel
accepts and existing userspace has used successfully, and that upgrading
the kernel should work with existing userspace and systems. If there's
some other rule that applies for filesystems, I'm not aware of that.
(I'm also not trying to suggest that every random corner case of what
the kernel *could* accept needs to be the format definition, but rather,
cases that correspond to existing userspace.)

It wouldn't be *impossible* to work around this, this time; it may be
possible to adapt the existing userspace to work on the new and old
kernels. My concern is, if a filesystem format accepted by previous
kernels can be rejected by future kernels, what stops a future kernel
from further changing the format definition or its strictness
(co-evolving with one specific userspace) and causing further
regressions?

I don't *want* to rely on what apparently turned out to be an
undocumented bug in the kernel's validator. That's why I was trying to
fix the issue in what seemed like the right way, by detecting the
situation and turning off the validator. That seemed like it would fully
address the issue. If it would help, I could also supply a tiny filesystem
image for regression testing.

I'm trying to figure out what solution you'd like to see here, as long
as it isn't "any userspace that isn't e2fsdroid can be broken at will".
I'd be willing to work to adapt the userspace bits I have to work around
the regression, but I'd like to get this on the radar so this doesn't
happen again.

- Josh Triplett

2020-10-06 06:04:33

by Josh Triplett

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Mon, Oct 05, 2020 at 10:03:13PM -0700, Josh Triplett wrote:
> On Mon, Oct 05, 2020 at 11:18:34PM -0400, Theodore Y. Ts'o wrote:
> > What Josh is proposing I'm pretty sure would also break "e2fsck -E
> > unshare_blocks", so that's another reason not to accept this as a
> > valid format change.
>
> The kernel already accepted this as a valid mountable filesystem format,
> without a single error or warning of any kind, and has done so stably
> for years.
>
> > As far as I'm concerned, contrib/e2fsdroid is the canonical definition
> > of how to create valid file systems with shared_blocks.
>
> I'm not trying to create a problem here; I'm trying to address a whole
> family of problems. I was generally under the impression that mounting
> existing root filesystems fell under the scope of the kernel<->userspace
> or kernel<->existing-system boundary, as defined by what the kernel
> accepts and existing userspace has used successfully, and that upgrading
> the kernel should work with existing userspace and systems. If there's
> some other rule that applies for filesystems, I'm not aware of that.
> (I'm also not trying to suggest that every random corner case of what
> the kernel *could* accept needs to be the format definition, but rather,
> cases that correspond to existing userspace.)
>
> It wouldn't be *impossible* to work around this, this time; it may be
> possible to adapt the existing userspace to work on the new and old
> kernels. My concern is, if a filesystem format accepted by previous
> kernels can be rejected by future kernels, what stops a future kernel
> from further changing the format definition or its strictness
> (co-evolving with one specific userspace) and causing further
> regressions?
>
> I don't *want* to rely on what apparently turned out to be an
> undocumented bug in the kernel's validator. That's why I was trying to
> fix the issue in what seemed like the right way, by detecting the
> situation and turning off the validator. That seemed like it would fully
> address the issue. If it would help, I could also supply a tiny filesystem
> image for regression testing.
>
> I'm trying to figure out what solution you'd like to see here, as long
> as it isn't "any userspace that isn't e2fsdroid can be broken at will".
> I'd be willing to work to adapt the userspace bits I have to work around
> the regression, but I'd like to get this on the radar so this doesn't
> happen again.

To clarify something further: I'm genuinely not looking to push hard on
the limits or corners of the kernel/userspace boundary here, nor do I
want to create an imposition on development. I'm happy to attempt to be
a little more flexible than most userspace. I'm trying to make
substantial, non-trivial use of the userspace side of a kernel/userspace
boundary, and within reason, I need to rely on the kernel's stability
guarantees. I'm relying on the combination of
Documentation/filesystems/ext4 and fs/ext4 as the format documentation.
The first time I discovered this issue was in doing some "there's about
to be a new kernel release" regression testing for 5.9, in which it
created a debugging adventure to track down what the problem was. I'd
like to find a good way to report and handle this kind of thing going
forward, if another issue like this arises.

- Josh Triplett

2020-10-06 13:36:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Mon, Oct 05, 2020 at 10:03:06PM -0700, Josh Triplett wrote:
>
> I'm not trying to create a problem here; I'm trying to address a whole
> family of problems. I was generally under the impression that mounting
> existing root filesystems fell under the scope of the kernel<->userspace
> or kernel<->existing-system boundary, as defined by what the kernel
> accepts and existing userspace has used successfully, and that upgrading
> the kernel should work with existing userspace and systems. If there's
> some other rule that applies for filesystems, I'm not aware of that.
> (I'm also not trying to suggest that every random corner case of what
> the kernel *could* accept needs to be the format definition, but rather,
> cases that correspond to existing userspace.)

I'm not opposed to the kernel side change; it's *this time*. I'm more
interested in killing off the tool that generated the malformed file
system in the first place. As I keep pointing out, things aren't
going to go well if "e2fsck -E unshare_blocks" is applied to it. So
users who use this unofficial tool to create this file system is can
run into at least this corner case, if not others, and that will
result in, as the UI designers like to say, "a poor user experience".

We had a similar issue with Android. Many years ago, Andy Rubin was
originally quite allergic to the GPL, and had tried to promulgate the
rule, "no GPL in Android Userspace". This is why bionic is used as
libc, and this resulted in Android engineers (I think before the
Google acquisition, but I'm not 100% sure), creating an unofficial,
"unauthorized" make_ext4fs which was a BSD-licensed version of mke2fs.

Unfortuantely, it created file systems which the kernel would never
complain about, but which, 50% of the time, would result in a file
system which under some circumstances, would get corrupted (even more)
when e2fsck attempted to repair the file system. So if a user had a
bit flip caused by an eMMC hiccup, e2fsck could end up making things
worse.

Worse, make_ext4fs had over time, grown extra functionality, such as
pre-setting the SELinux xattrs, such that you couldn't just replace it
with mke2fs. It took *years* to fix the problem, and that's why
contrib/e2fsdroid exists today. We finally, a few years ago, were
able to retire make_ext4fs and replace it with the combination of
mke2fs and e2fsdroid.

So that's why I really don't like it when there are "unauthorized",
unofficial tools creating file systems out there which we are now
obliged to support. Even if it's OK as far as the kernel is
concerned, unless you're planning on forking and/or reimplementing all
of e2fsprogs, and doing so correctly, that way is going to cause
headaches for file system developers.

As far as I'm concerned, it's not just about on-disk file system
format, it's also about the official user space tools. If you create
a file system which the kernel is happy with, but which wasn't created
using the official user space tools, file systems are so full of state
and permutations of how things should be done that the opportunities
for mischief are huge.

And what's especially aggravating is when it's done for petty reasons
--- whether it's trying to sae an extra 0.0003% of storage, or because
some VP was allergic to the GPL, it's stupid stuff.

> I don't *want* to rely on what apparently turned out to be an
> undocumented bug in the kernel's validator. That's why I was trying to
> fix the issue in what seemed like the right way, by detecting the
> situation and turning off the validator. That seemed like it would fully
> address the issue. If it would help, I could also supply a tiny filesystem
> image for regression testing.

I'm OK with working around the problem, and we're lucky that it's this
simple.... this time around.

But can we *please* take your custom tool out back and shoot it in the
head? Like make_ext4fs, it's just going to cause more headaches down
the line.

And perhaps we need to make a policy that makes it clear that for file
systems, it's not just about "whatever the kernel happens to accept".
It should also be, "was it generated using an official version of the
userspace tools", at least as a consideration. Yes, we can try to
make the kernel more strict, and that's a good thing to do, but
inevitably, as we make the kernel more strict, we can potentially
break other unffocial tools out there, and it's going to make it a lot
harder to be able to do backwards compatible format enhancements to
the file system.

- Ted

2020-10-07 08:22:29

by Josh Triplett

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Tue, Oct 06, 2020 at 09:35:33AM -0400, Theodore Y. Ts'o wrote:
> On Mon, Oct 05, 2020 at 10:03:06PM -0700, Josh Triplett wrote:
> > I'm not trying to create a problem here; I'm trying to address a whole
> > family of problems. I was generally under the impression that mounting
> > existing root filesystems fell under the scope of the kernel<->userspace
> > or kernel<->existing-system boundary, as defined by what the kernel
> > accepts and existing userspace has used successfully, and that upgrading
> > the kernel should work with existing userspace and systems. If there's
> > some other rule that applies for filesystems, I'm not aware of that.
> > (I'm also not trying to suggest that every random corner case of what
> > the kernel *could* accept needs to be the format definition, but rather,
> > cases that correspond to existing userspace.)
>
> I'm not opposed to the kernel side change; it's *this time*.

I appreciate that. (Does that mean you wouldn't object to this patch
going into 5.9, with Jens' Reviewed-by and your ack?)

> I'm more interested in killing off the tool that generated the
> malformed file system in the first place.

It was developed for a reason, and it's not intended for general-purpose
use. It serves its purpose, which has nothing to do with e2fsprogs or
any component of e2fsprogs. It does not simply create filesystem images,
in the way that mke2fs or e2fsdroid does.

If I'd found any existing code that would have worked, or could have
been adapted to work, I would have happily reused it; I don't like
reinventing the wheel.

Also, see below regarding e2fsck.

> As I keep pointing out, things aren't going to go well if "e2fsck -E
> unshare_blocks" is applied to it.

These aren't intended to *ever* become writable. I've ensured that
they're marked with EXT4_FEATURE_RO_COMPAT_READONLY as well, which I
would hope implies that. Would that be sufficient to convince e2fsck
that it should never attempt to apply unshare_blocks to it?

Also, it seems like most of block_validity is trying to carefully avoid
metadata blocks intersecting with the journal, which could cause all
manner of issues; the filesystems I'm working with have no journal
(because they're permanently read-only).

With the sole exception of the shared bitmap block, I have it as a
requirement to pass e2fsck with zero complaints. If you'd be willing to
take a patch to e2fsck along the same lines as this kernel patch, then
I'd be happy to add it to the regression test suite: no filesystem
images that e2fsck is unhappy with. Would that help?

As mentioned before, I'd also be happy to supply some representative
filesystem images.

> We had a similar issue with Android. Many years ago, Andy Rubin was
> originally quite allergic to the GPL, and had tried to promulgate the
> rule, "no GPL in Android Userspace". This is why bionic is used as
> libc, and this resulted in Android engineers (I think before the
> Google acquisition, but I'm not 100% sure), creating an unofficial,
> "unauthorized" make_ext4fs which was a BSD-licensed version of mke2fs.

That is indeed a sad reason to develop anything. It's also particularly
sad to develop a different tool to serve exactly the same purpose as an
existing tool.

As opposed to, in this case, developing a different piece of software
to serve different purposes, that happens to make use of ext4 as one
component.

> Unfortuantely, it created file systems which the kernel would never
> complain about, but which, 50% of the time, would result in a file
> system which under some circumstances, would get corrupted (even more)
> when e2fsck attempted to repair the file system. So if a user had a
> bit flip caused by an eMMC hiccup, e2fsck could end up making things
> worse.

I'd be interested in reading more about that, if you could suggest a
link or the right search terms. I did find
https://bugzilla.kernel.org/show_bug.cgi?id=195561 , which references the
issues in the same way you've done here, but I didn't find any of the
specific details you're mentioning here about what made the images it
created fragile.

I did see you mention, there, the advice of making sure to check
filesystems with e2fsck. That's something I've done from day one: I
didn't stop until e2fsck was happy, not just the kernel.

> So that's why I really don't like it when there are "unauthorized",
> unofficial tools creating file systems out there which we are now
> obliged to support.

ext4 is an incredibly powerful and performant filesystem, with a
reasonably well-documented format and many useful properties. Not all
software that works with ext4 is going to live in e2fsprogs, nor should
it need to.

This would be analogous to the notion that (say) the FreeBSD kernel
belongs in the e2fsprogs repository because it has an ext4 driver. The
tail would be wagging the dog.

> Even if it's OK as far as the kernel is
> concerned, unless you're planning on forking and/or reimplementing all
> of e2fsprogs, and doing so correctly, that way is going to cause
> headaches for file system developers.

I haven't recreated or reimplemented e2fsprogs, and I have no desire to
do so. It seems like, as a result of the make_ext4fs debacle, you're
seeing any other software working with ext4 as an instance of
reinventing the wheel (and failing to make that wheel round, because
hey, it still rolls). That's not the case here.

> > I don't *want* to rely on what apparently turned out to be an
> > undocumented bug in the kernel's validator. That's why I was trying to
> > fix the issue in what seemed like the right way, by detecting the
> > situation and turning off the validator. That seemed like it would fully
> > address the issue. If it would help, I could also supply a tiny filesystem
> > image for regression testing.
>
> I'm OK with working around the problem, and we're lucky that it's this
> simple.... this time around.
>
> But can we *please* take your custom tool out back and shoot it in the
> head?

Nope. As mentioned, this isn't about creating ext4 filesystem images,
and it isn't even remotely similar to mke2fs.

> And perhaps we need to make a policy that makes it clear that for file
> systems, it's not just about "whatever the kernel happens to accept".
> It should also be, "was it generated using an official version of the
> userspace tools", at least as a consideration.

I don't think that's a reasonable policy at all, no.

"Should pass e2fsck" would be a relatively reasonable policy, assuming
that e2fsck doesn't ratchet up strictness in a way that breaks existing
filesystems. I think it'd be reasonable to recommend against trying to
push the boundaries of what the kernel accepts but e2fsck doesn't,
without a concrete plan to improve e2fsck and the filesystem
documentation.

- Josh Triplett

2020-10-07 14:34:00

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Wed, Oct 07, 2020 at 01:03:04AM -0700, Josh Triplett wrote:
> > But can we *please* take your custom tool out back and shoot it in the
> > head?
>
> Nope. As mentioned, this isn't about creating ext4 filesystem images,
> and it isn't even remotely similar to mke2fs.

Can you please tell us what your tool is for, then? Why are you doing
this? Why are you inflicting this on us?

And if the answer is nope, I can't, it's propietary, sorry.... then I
think we should treat this the same way we treat proprietary kernel
modules.

- Ted

2020-10-07 20:30:57

by Josh Triplett

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Wed, Oct 07, 2020 at 10:32:11AM -0400, Theodore Y. Ts'o wrote:
> On Wed, Oct 07, 2020 at 01:03:04AM -0700, Josh Triplett wrote:
> > > But can we *please* take your custom tool out back and shoot it in the
> > > head?
> >
> > Nope. As mentioned, this isn't about creating ext4 filesystem images,
> > and it isn't even remotely similar to mke2fs.
>
> Can you please tell us what your tool is for, then? Why are you doing
> this? Why are you inflicting this on us?

That sounds like a conversation that would have been a lot more
interesting and enjoyable if it hadn't started with "can we shoot it in
the head", and continued with the notion that anything other than
e2fsprogs making something to be mounted by mount(2) and handled by
fs/ext4 is being "inflicted", and if the goal didn't still seem to be
"how do we make it go away so that only e2fsprogs and the kernel ever
touch ext4". I started this thread because I'd written some userspace
code, a new version of the kernel made that userspace code stop working,
so I wanted to report that the moment I'd discovered that, along with a
potential way to address it with as little disruption to ext4 as
possible.

I'm not looking to be an alternative userspace, or an alternative
anything; that implies serving more-or-less the same function
differently. I have no interest in supplanting mke2fs or any other part
of e2fsprogs; I'm using those for many of the purposes they already
serve.

The short version is that I needed a library to rapidly turn
dynamically-obtained data into a set of disk blocks to be served
on-the-fly as a software-defined disk, and then mounted on the other
side of that interface by the Linux kernel. Turns out that's *many
orders of magnitude* faster than any kind of network filesystem like
NFS. It's slightly similar to a vvfat for ext4. The less blocks it can
generate and account for and cache, the faster it can run, and
microseconds matter.

ext4 has *incredible* compatibility guarantees. I'd already understood
the whole compat/ro_compat mechanism when I read through the on-disk
format documentation and the code. RO_COMPAT_SHARED_BLOCKS *seemed* like
the right semantic description of "don't ever try to write to this
filesystem because there are deduplicated blocks", and
RO_COMPAT_READONLY seemed like the right semantic description for "don't
ever try to write this, it's permanently read-only for unspecified
reasons".

If those aren't the right way to express that, I could potentially
adapt. I had a similar such conversation on linux-ext4 already (about
inline data with 128-bit inodes), which led to me choosing to abandon
128-byte inodes rather than try to get ext4 to support what I wanted
with them, because I didn't want to be disruptive to ext4 for a niche
use case. In the particular case that motivated this thread, what I was
doing already worked in previous kernels, and it seemed reasonable to
ask for it to continue to work in new kernels, while preserving the
newly added checks in the new kernels.

If the response here had been more along the lines of "could we create
and use a *different* compat flag for shared metadata and have
RO_COMPAT_SHARED_BLOCKS only mean shared data blocks", I'd be fine with
that.

If at some point I'm looking to make ext4 support more than it already
does (e.g. a way to omit bitmaps entirely, or a way to express
contiguous files with smaller extent maps, or other enhancements for
read-only filesystems), I'd be coming with architectural discussions
first, patches second, and at no point would I have the expectation that
ext4 folks need to do extra work on my behalf. I'm happy to do the work.
The *only* thing I'm asking, here, is "don't break things that worked".
And after this particular item, I'd be happy to narrow that to "don't
break things that e2fsck was previously happy with".

- Josh Triplett

2020-10-08 02:13:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Wed, Oct 07, 2020 at 01:14:24PM -0700, Josh Triplett wrote:
>
> That sounds like a conversation that would have been a lot more
> interesting and enjoyable if it hadn't started with "can we shoot it in
> the head", and continued with the notion that anything other than
> e2fsprogs making something to be mounted by mount(2) and handled by
> fs/ext4 is being "inflicted", and if the goal didn't still seem to be
> "how do we make it go away so that only e2fsprogs and the kernel ever
> touch ext4". I started this thread because I'd written some userspace
> code, a new version of the kernel made that userspace code stop working,
> so I wanted to report that the moment I'd discovered that, along with a
> potential way to address it with as little disrupton to ext4 as
> possible.

What is really getting my dander up is your attempt to claim that the
on-disk file system format is like the userspace/kernel interface,
where if we break any file system that file system that was
"previously accepted by an older kernel", this is a bug that must be
reverted or otherwise fixed to allow file systems that had previously
worked, to continue to work. And this is true even if the file system
is ***invalid***.

And the problem with this is that there have been any number of
commits where file systems which were previously invalid, but which
could be caused to trigger a syzbot whine, which was fixed by
tightening up the validity tests in the kernel. In some cases, I had
to also had to fix up e2fsck to detect the invalid file system which
was generated by the file system fuzzer. Yes, it's unfortunate that
we didn't have these checks earlier, but a file system has a huge
amount of state.

The principle you've articulated would make it impossible for me to
fix these bugs, unless I can prove that the failure to check a
particular invalid file system corruption could lead to a security
vulnerability. (Would it be OK for me to make the kernel more strict
and reject an invalid file system if it triggers a WARN_ON, so I get
the syzbot complaint, but it doesn't actually cause a security issue?)

So this conversation would have been a lot more pleasant for *me* if
you hadn't tried to elevate your request to a general principle, where
if someone is deliberately generating an invalid file system, I'm not
allowed to make the kernel more strict to detect said invalidity and
to reject the invalid / corrupted / fuzzed file system.

And note that sometimes the security problem happens when there are
multiple file system corruptions that are chained together. So
enabling block validity *can* sometimes prevent the fuzzed file system
from proceeding further. Granted, this is less likely in the case of
a read-only file system, but it really worries me when there are
proprietary programs (maybe your library isn't proprietary, but I note
you haven't send me a link to your git repo, but instead have offered
sending sample file systems) which insist on generating their own file
systems, which might or might not be valid, and then expecting them to
receive first class support as part of an iron-bound contract where
I'm not even allowed to add stronger sanity checks which might reject
said invalid file system in the future.

> The short version is that I needed a library to rapidly turn
> dynamically-obtained data into a set of disk blocks to be served
> on-the-fly as a software-defined disk, and then mounted on the other
> side of that interface by the Linux kernel. Turns out that's *many
> orders of magnitude* faster than any kind of network filesystem like
> NFS. It's slightly similar to a vvfat for ext4. The less blocks it can
> generate and account for and cache, the faster it can run, and
> microseconds matter.

So are you actually trying to dedup data blocks, or are you just
trying to avoid needing to track the block allocation bitmaps? And
are you just writing a single file, or multiple files? Do you know
what the maximum size of the file or files will be? Do you need a
complex directory structure, or just a single root directory? Can the
file system be sparse?

So for example, you can do something like this, which puts all of the
metadata at beginning of the file system, and then you could write to
contiguous data blocks. Add the following in mke2fs.conf:

[fs_types]
hugefile = {
features = extent,huge_file,bigalloc,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2
cluster_size = 32768
hash_alg = half_md4
reserved_ratio = 0.0
num_backup_sb = 0
packed_meta_blocks = 1
make_hugefiles = 1
inode_ratio = 4194304
hugefiles_dir = /storage
hugefiles_name = huge-file
hugefiles_digits = 0
hugefiles_size = 0
hugefiles_align = 256M
hugefiles_align_disk = true
num_hugefiles = 1
zero_hugefiles = false
inode_size = 128
}

hugefiles = {
features = extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2
hash_alg = half_md4
reserved_ratio = 0.0
num_backup_sb = 0
packed_meta_blocks = 1
make_hugefiles = 1
inode_ratio = 4194304
hugefiles_dir = /storage
hugefiles_name = chunk-
hugefiles_digits = 5
hugefiles_size = 4G
hugefiles_align = 256M
hugefiles_align_disk = true
zero_hugefiles = false
flex_bg_size = 262144
inode_size = 128
}

... and then run "mke2fs -T hugefile /tmp/image 1T" or "mke2fs -T
hugefiles /tmp/image 1T", and see what you get. In the case of
hugefile, you'll see a single file which covers the entire storage
device. Because we are using bigalloc with a large cluster size, this
minimizes the number of bitmap blocks.

With hugefiles, it will create a set of 4G files to fill the size of
the disk, again, aligned to 256 MiB zones at the beginning of the
disk. In both cases, the file or files are aligned to 256 MiB
relative to beginning of the disk, which can be handy if you are
creating the file system, on, say, a 14T SMR disk. And this is a
niche use case if there ever was one! :-)

So if you had come to the ext4 list with a set of requirements, it
could have been that we could have come up with something which uses
the existing file system features, or come up with something which
would have been more specific --- and more importantly, we'd know what
the semantics were of various on-disk file system formats that people
are depending upon.

> If at some point I'm looking to make ext4 support more than it already
> does (e.g. a way to omit bitmaps entirely, or a way to express
> contiguous files with smaller extent maps, or other enhancements for
> read-only filesystems),

See above for a way to significantly reduce the number of bitmaps.
Adding a way to omit bitmaps entirely would require an INCOMPAT flag,
so it might not be worth it.

The way to express contiguous files with smaller extent files would be
to extend the kernel to allow file systems with block_size > page_size
read-only. This would allow you to create a file system with a block
size of 64k, which will reduce the size of the extent maps by a factor
of 16, and it wouldn't be all that hard to teach ext4 to support these
file systems. (The reason why it would be hard for us to support file
systems with block sizes > page size is dealing with page cache when
writing files while allocating blocks, especially when doing random
writes into a sparse file. Read-only would be much easier to
support.)

So please, talk to us, and *tell* us what it is you're trying to do
before you try to do it. Don't rely on some implementation detail
where we're not being sufficiently strict in checking for an invalid
file system, especially without telling us in advance and then trying
to hold us to the lack of checking forever because it's "breaking
things that used to work".

Cheers,

- Ted

2020-10-08 03:12:42

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Oct 7, 2020, at 2:14 PM, Josh Triplett <[email protected]> wrote:
> If those aren't the right way to express that, I could potentially
> adapt. I had a similar such conversation on linux-ext4 already (about
> inline data with 128-bit inodes), which led to me choosing to abandon
> 128-byte inodes rather than try to get ext4 to support what I wanted
> with them, because I didn't want to be disruptive to ext4 for a niche
> use case. In the particular case that motivated this thread, what I was
> doing already worked in previous kernels, and it seemed reasonable to
> ask for it to continue to work in new kernels, while preserving the
> newly added checks in the new kernels.

This was discussed in the "Inline data with 128-byte inodes?" thread
back in May. While Jan was not necessarily in favour of this, I was
actually OK with improving the ext4 code to handle this case better,
since it would (at minimum) clean up ext4 to make a clear separation
of how it is detecting data in the i_block[] array and the system.data
xattr, and I don't think it added any complexity to the code.

I even posted a WIP patch to that effect, but didn't get a response back:
https://marc.info/?l=linux-ext4&m=158863275019187

I *do* think that inline_data is an under-appreciated feature that I
would be happy to see some improvements with. I don't think that small
files are a niche use case, and if we can clean up the inline_data code
to work with 128-byte inodes I'm not against that, even though I'm not
going to use that combination of features myself.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2020-10-08 18:50:46

by Darrick J. Wong

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Wed, Oct 07, 2020 at 10:10:17PM -0400, Theodore Y. Ts'o wrote:
> On Wed, Oct 07, 2020 at 01:14:24PM -0700, Josh Triplett wrote:
> >
> > That sounds like a conversation that would have been a lot more
> > interesting and enjoyable if it hadn't started with "can we shoot it in

This whole conversation would have been a lot less tense and defensive
had I not started by thinking "Did Ted quietly slip more meaning into
SHARED_BLOCKS (i.e. shared metadata blocks) than I had previously known
about?" and then "No, e2fsprogs is still the same as it always was" and
finally "OH, Josh wrote some weird userspace tool that we've never heard
of which makes design assumptions that he's being cagey about".

To reiterate what Ted said: General purpose filesystems are massively
complex beasts. The kernel can do simple spot checks (checksums, record
validation) but at least with XFS and ext4, we defer the larger
relational checks between data structures (if this block is marked free,
is it unclaimed by the inode table and all file block maps?) to
userspace fsck.

IMO, the prominent free software filesystem projects offer (at least)
four layers of social structures to keep everyone on the same page,
including people writing competing implementations. The first is "Does
everything we write still work with the kernel", which I guess you
clearly did since you're complaining about this change in 5.9-rc2.

The second layer is "Does it pass fsck?" which, given that you're asking
for changes to e2fsck elsewhere in this thread and I couldn't figure out
how to generate a shared-bitmaps ext4 fs that didn't also cause e2fsck
to complain about the overlap doesn't make me confident that you went
beyond the first step before shipping something.

The third layer is consulting the ondisk format documentation to see if
it points out anything that the kernel or fsck didn't notice. That
one's kind of on Ted for not updating Documentation/ to spell out what
SHARED_BLOCKS actually means, but that just leads me to the fourth thing.

The fourth layer is emailing the list to ask "Hey, I was thinking of
___, does anyone see a problem with my interpretation?". That clearly
hasn't been done for shared bitmaps until now, which is all the more
strange since you /did/ ask linux-ext4 about the inline data topic
months ago.

ext* and XFS have been around for 25 years. The software validation of
both is imperfect and incomplete because the complexity of the ondisk
formats is vast and we haven't caught up with the technical debt that
resulted from not writing rigorous checks that would have been very
difficult with mid-90s hardware. I know, because I've been writing
online checking for XFS and have seen the wide the gap between what the
existing software looks for and what the ondisk format documentation
allows.

The only chance that we as a community have at managing the complexity
of a filesystem is to use those four tools I outlined above to try to
keep the mental models of everyone who participates in close enough
alignment. That's how we usually avoid discussions that end in rancor
and confusion.

That was a very long way of reiterating "If you're going to interpret
aspects of the software, please come talk to us before you start writing
code". That is key to ext4's long history of compatibility, because a
project cannot maintain continuity when actors develop conflicting code
in the shadows. Look at all the mutations FAT and UFS that have
appeared over the years-- the codebases became a mess as a result.

> > the head", and continued with the notion that anything other than
> > e2fsprogs making something to be mounted by mount(2) and handled by
> > fs/ext4 is being "inflicted", and if the goal didn't still seem to be
> > "how do we make it go away so that only e2fsprogs and the kernel ever
> > touch ext4". I started this thread because I'd written some userspace
> > code, a new version of the kernel made that userspace code stop working,
> > so I wanted to report that the moment I'd discovered that, along with a
> > potential way to address it with as little disrupton to ext4 as
> > possible.

Ted: I don't think it's a good idea to make SHARED_BLOCKS disable block
validity checking by default. You could someday enable users to write
to block-sharing files by teaching ext4 how to COW, at which point you'd
need correct bitmaps and block validity to prevent accidental overwrite
of critical metadata. At that point you'd either be stuck with the
precedent that SHARED_BLOCKS implies noblock_validity, or you'd end up
breaking Josh's images again.

Frankly, Josh, if you're not going to show us /your/ code and/or
creating an incompat flag for shared-bitmaps, then just set
"noblock_validity" in the superblock mount options field of the images
you create.

--D

> What is really getting my dander up is your attempt to claim that the
> on-disk file system format is like the userspace/kernel interface,
> where if we break any file system that file system that was
> "previously accepted by an older kernel", this is a bug that must be
> reverted or otherwise fixed to allow file systems that had previously
> worked, to continue to work. And this is true even if the file system
> is ***invalid***.
>
> And the problem with this is that there have been any number of
> commits where file systems which were previously invalid, but which
> could be caused to trigger a syzbot whine, which was fixed by
> tightening up the validity tests in the kernel. In some cases, I had
> to also had to fix up e2fsck to detect the invalid file system which
> was generated by the file system fuzzer. Yes, it's unfortunate that
> we didn't have these checks earlier, but a file system has a huge
> amount of state.
>
> The principle you've articulated would make it impossible for me to
> fix these bugs, unless I can prove that the failure to check a
> particular invalid file system corruption could lead to a security
> vulnerability. (Would it be OK for me to make the kernel more strict
> and reject an invalid file system if it triggers a WARN_ON, so I get
> the syzbot complaint, but it doesn't actually cause a security issue?)
>
> So this conversation would have been a lot more pleasant for *me* if
> you hadn't tried to elevate your request to a general principle, where
> if someone is deliberately generating an invalid file system, I'm not
> allowed to make the kernel more strict to detect said invalidity and
> to reject the invalid / corrupted / fuzzed file system.
>
> And note that sometimes the security problem happens when there are
> multiple file system corruptions that are chained together. So
> enabling block validity *can* sometimes prevent the fuzzed file system
> from proceeding further. Granted, this is less likely in the case of
> a read-only file system, but it really worries me when there are
> proprietary programs (maybe your library isn't proprietary, but I note
> you haven't send me a link to your git repo, but instead have offered
> sending sample file systems) which insist on generating their own file
> systems, which might or might not be valid, and then expecting them to
> receive first class support as part of an iron-bound contract where
> I'm not even allowed to add stronger sanity checks which might reject
> said invalid file system in the future.
>
> > The short version is that I needed a library to rapidly turn
> > dynamically-obtained data into a set of disk blocks to be served
> > on-the-fly as a software-defined disk, and then mounted on the other
> > side of that interface by the Linux kernel. Turns out that's *many
> > orders of magnitude* faster than any kind of network filesystem like
> > NFS. It's slightly similar to a vvfat for ext4. The less blocks it can
> > generate and account for and cache, the faster it can run, and
> > microseconds matter.
>
> So are you actually trying to dedup data blocks, or are you just
> trying to avoid needing to track the block allocation bitmaps? And
> are you just writing a single file, or multiple files? Do you know
> what the maximum size of the file or files will be? Do you need a
> complex directory structure, or just a single root directory? Can the
> file system be sparse?
>
> So for example, you can do something like this, which puts all of the
> metadata at beginning of the file system, and then you could write to
> contiguous data blocks. Add the following in mke2fs.conf:
>
> [fs_types]
> hugefile = {
> features = extent,huge_file,bigalloc,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2
> cluster_size = 32768
> hash_alg = half_md4
> reserved_ratio = 0.0
> num_backup_sb = 0
> packed_meta_blocks = 1
> make_hugefiles = 1
> inode_ratio = 4194304
> hugefiles_dir = /storage
> hugefiles_name = huge-file
> hugefiles_digits = 0
> hugefiles_size = 0
> hugefiles_align = 256M
> hugefiles_align_disk = true
> num_hugefiles = 1
> zero_hugefiles = false
> inode_size = 128
> }
>
> hugefiles = {
> features = extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2
> hash_alg = half_md4
> reserved_ratio = 0.0
> num_backup_sb = 0
> packed_meta_blocks = 1
> make_hugefiles = 1
> inode_ratio = 4194304
> hugefiles_dir = /storage
> hugefiles_name = chunk-
> hugefiles_digits = 5
> hugefiles_size = 4G
> hugefiles_align = 256M
> hugefiles_align_disk = true
> zero_hugefiles = false
> flex_bg_size = 262144
> inode_size = 128
> }
>
> ... and then run "mke2fs -T hugefile /tmp/image 1T" or "mke2fs -T
> hugefiles /tmp/image 1T", and see what you get. In the case of
> hugefile, you'll see a single file which covers the entire storage
> device. Because we are using bigalloc with a large cluster size, this
> minimizes the number of bitmap blocks.
>
> With hugefiles, it will create a set of 4G files to fill the size of
> the disk, again, aligned to 256 MiB zones at the beginning of the
> disk. In both cases, the file or files are aligned to 256 MiB
> relative to beginning of the disk, which can be handy if you are
> creating the file system, on, say, a 14T SMR disk. And this is a
> niche use case if there ever was one! :-)
>
> So if you had come to the ext4 list with a set of requirements, it
> could have been that we could have come up with something which uses
> the existing file system features, or come up with something which
> would have been more specific --- and more importantly, we'd know what
> the semantics were of various on-disk file system formats that people
> are depending upon.
>
> > If at some point I'm looking to make ext4 support more than it already
> > does (e.g. a way to omit bitmaps entirely, or a way to express
> > contiguous files with smaller extent maps, or other enhancements for
> > read-only filesystems),
>
> See above for a way to significantly reduce the number of bitmaps.
> Adding a way to omit bitmaps entirely would require an INCOMPAT flag,
> so it might not be worth it.
>
> The way to express contiguous files with smaller extent files would be
> to extend the kernel to allow file systems with block_size > page_size
> read-only. This would allow you to create a file system with a block
> size of 64k, which will reduce the size of the extent maps by a factor
> of 16, and it wouldn't be all that hard to teach ext4 to support these
> file systems. (The reason why it would be hard for us to support file
> systems with block sizes > page size is dealing with page cache when
> writing files while allocating blocks, especially when doing random
> writes into a sparse file. Read-only would be much easier to
> support.)
>
> So please, talk to us, and *tell* us what it is you're trying to do
> before you try to do it. Don't rely on some implementation detail
> where we're not being sufficiently strict in checking for an invalid
> file system, especially without telling us in advance and then trying
> to hold us to the lack of checking forever because it's "breaking
> things that used to work".
>
> Cheers,
>
> - Ted

2020-10-08 19:20:31

by Josh Triplett

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Wed, Oct 07, 2020 at 08:57:12PM -0600, Andreas Dilger wrote:
> On Oct 7, 2020, at 2:14 PM, Josh Triplett <[email protected]> wrote:
> > If those aren't the right way to express that, I could potentially
> > adapt. I had a similar such conversation on linux-ext4 already (about
> > inline data with 128-bit inodes), which led to me choosing to abandon
> > 128-byte inodes rather than try to get ext4 to support what I wanted
> > with them, because I didn't want to be disruptive to ext4 for a niche
> > use case. In the particular case that motivated this thread, what I was
> > doing already worked in previous kernels, and it seemed reasonable to
> > ask for it to continue to work in new kernels, while preserving the
> > newly added checks in the new kernels.
>
> This was discussed in the "Inline data with 128-byte inodes?" thread
> back in May. While Jan was not necessarily in favour of this, I was
> actually OK with improving the ext4 code to handle this case better,
> since it would (at minimum) clean up ext4 to make a clear separation
> of how it is detecting data in the i_block[] array and the system.data
> xattr, and I don't think it added any complexity to the code.
>
> I even posted a WIP patch to that effect, but didn't get a response back:
> https://marc.info/?l=linux-ext4&m=158863275019187

My apologies, I thought I responded to that. It looks promising to me,
though I wouldn't have the bandwidth to take it to completion anytime
soon.

> I *do* think that inline_data is an under-appreciated feature that I
> would be happy to see some improvements with. I don't think that small
> files are a niche use case, and if we can clean up the inline_data code
> to work with 128-byte inodes I'm not against that, even though I'm not
> going to use that combination of features myself.

I'd love to see that happen. At the time, it seemed like too large of a
change to block on, which is why I ended up deciding to switch to
256-byte inodes.

2020-10-08 19:27:11

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Oct 8, 2020, at 1:12 PM, Josh Triplett <[email protected]> wrote:
>
> On Wed, Oct 07, 2020 at 08:57:12PM -0600, Andreas Dilger wrote:
>> On Oct 7, 2020, at 2:14 PM, Josh Triplett <[email protected]> wrote:
>>> If those aren't the right way to express that, I could potentially
>>> adapt. I had a similar such conversation on linux-ext4 already (about
>>> inline data with 128-bit inodes), which led to me choosing to abandon
>>> 128-byte inodes rather than try to get ext4 to support what I wanted
>>> with them, because I didn't want to be disruptive to ext4 for a niche
>>> use case. In the particular case that motivated this thread, what I was
>>> doing already worked in previous kernels, and it seemed reasonable to
>>> ask for it to continue to work in new kernels, while preserving the
>>> newly added checks in the new kernels.
>>
>> This was discussed in the "Inline data with 128-byte inodes?" thread
>> back in May. While Jan was not necessarily in favour of this, I was
>> actually OK with improving the ext4 code to handle this case better,
>> since it would (at minimum) clean up ext4 to make a clear separation
>> of how it is detecting data in the i_block[] array and the system.data
>> xattr, and I don't think it added any complexity to the code.
>>
>> I even posted a WIP patch to that effect, but didn't get a response back:
>> https://marc.info/?l=linux-ext4&m=158863275019187
>
> My apologies, I thought I responded to that. It looks promising to me,
> though I wouldn't have the bandwidth to take it to completion anytime
> soon.

NP, I don't have bandwidth to work on it right now either.

>> I *do* think that inline_data is an under-appreciated feature that I
>> would be happy to see some improvements with. I don't think that small
>> files are a niche use case, and if we can clean up the inline_data code
>> to work with 128-byte inodes I'm not against that, even though I'm not
>> going to use that combination of features myself.
>
> I'd love to see that happen. At the time, it seemed like too large of a
> change to block on, which is why I ended up deciding to switch to
> 256-byte inodes.

Does that mean you are using inline_data with 256-byte inodes? That would
also be good to know, since there haven't been any well-known users of
this feature so far (AFAIK). Since you are using this in a read-only
manner, you won't hit the one know issue when an inline_data inode is
extended to use an external block that may temporarily leave the inode
in an inconsistent state.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2020-10-09 00:20:03

by Josh Triplett

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Wed, Oct 07, 2020 at 10:10:17PM -0400, Theodore Y. Ts'o wrote:
> On Wed, Oct 07, 2020 at 01:14:24PM -0700, Josh Triplett wrote:
> > That sounds like a conversation that would have been a lot more
> > interesting and enjoyable if it hadn't started with "can we shoot it in
> > the head", and continued with the notion that anything other than
> > e2fsprogs making something to be mounted by mount(2) and handled by
> > fs/ext4 is being "inflicted", and if the goal didn't still seem to be
> > "how do we make it go away so that only e2fsprogs and the kernel ever
> > touch ext4". I started this thread because I'd written some userspace
> > code, a new version of the kernel made that userspace code stop working,
> > so I wanted to report that the moment I'd discovered that, along with a
> > potential way to address it with as little disrupton to ext4 as
> > possible.
>
> What is really getting my dander up is your attempt to claim that the
> on-disk file system format is like the userspace/kernel interface,
> where if we break any file system that file system that was
> "previously accepted by an older kernel", this is a bug that must be
> reverted or otherwise fixed to allow file systems that had previously
> worked, to continue to work. And this is true even if the file system
> is ***invalid***.
>
> And the problem with this is that there have been any number of
> commits where file systems which were previously invalid, but which
> could be caused to trigger a syzbot whine, which was fixed by
> tightening up the validity tests in the kernel. In some cases, I had
> to also had to fix up e2fsck to detect the invalid file system which
> was generated by the file system fuzzer. Yes, it's unfortunate that
> we didn't have these checks earlier, but a file system has a huge
> amount of state.
>
> The principle you've articulated would make it impossible for me to
> fix these bugs, unless I can prove that the failure to check a
> particular invalid file system corruption could lead to a security
> vulnerability. (Would it be OK for me to make the kernel more strict
> and reject an invalid file system if it triggers a WARN_ON, so I get
> the syzbot complaint, but it doesn't actually cause a security issue?)
>
> So this conversation would have been a lot more pleasant for *me* if
> you hadn't tried to elevate your request to a general principle, where
> if someone is deliberately generating an invalid file system, I'm not
> allowed to make the kernel more strict to detect said invalidity and
> to reject the invalid / corrupted / fuzzed file system.

I appreciate you providing this explanation; this makes the nature and
severity of your concern *much* more clear. I also now understand why I
was feeling confused, and why it felt several times like we were talking
past each other; see below. For my part, I had no desire to have
generated this level of concern. The restrictions you're describing
above are far, far past anything I had possibly envisioned or desired. I
want to clarify a couple of things.

I wasn't trying to make a *new* general principle or policy. I was under
the impression that this *was* the policy, because it never occurred to
me that it could be otherwise. It seemed like a natural aspect of the
kernel/userspace boundary, to the point that the idea of it *not* being
part of the kernel's stability guarantees didn't cross my mind. Thus,
when you were suggesting a policy of "only filesystems generated by the
e2fsprogs userspace tools are acceptable", that seems like you were
suggesting a massive loosening of existing policy. And when I suggested
an alternative of "only filesystems that pass e2fsck are acceptable", I
was also trying to suggest a loosening of the existing policy, just a
different, milder one. I would have come to this discussion with a very
different perspective if it had occurred to me that this might not be
the policy, or at least that it simply hadn't come up in this way
before.

I would be quite happy to have a discussion about where that stability
boundary should be. I also have no desire to be inflexible about that
boundary or about the evolution of the software I'm working on; this is
not some enterprise "thou shalt not touch" workload. On the contrary,
within reason it's *relatively* straightforward for me to evolve
filesystem generation code to deal with changes in the kernel or e2fsck.
I was not looking to push some new general principle on stability; I
only wished to ensure that it was reasonable to negotiate about the best
way to solve problems if and when they arise, and I hope they do not
arise often. (Most of the time, I'd expect that when they arise I'll
end up wanting to change my software rather than the kernel.)

I also don't think that a (milder) stability guarantee would mean that
the kernel or fsck can never become stricter. On the contrary, they both
*should* absolutely try to detect more issues over time. Much like other
instances of the userspace/kernel boundary, changes that could
hypothetically break some userspace software aren't automatically
disqualified, only changes that *actually* break userspace in practice.
I regularly see the kernel making a change that would technically affect
some userspace software, but either no such software appears to exist
(and the question could be revisited if it did), or any such software
that does exist is not affected in a harmful way (e.g. the software
copes with any error produced), or the software can be relatively easily
patched to work around the new requirements and the nature of the
software is such that people won't be running older versions, or any
number of other practical considerations.

Stability is a practical boundary rather than a theoretical concern.
It's one that can be discussed and negotiated based on the nature of a
change, and its practical impact. I suggested a kernel patch because
that seemed like a straightforward approach with minimal impact to the
kernel's validity checking. Darrick's suggestion in the next mail is an
even better approach, and one I'm likely to use, so I'd have no problem
with the patch that started this thread *not* being applied after all
(though I appreciate the willingness to consider it).

Finally, I think there's also some clarification needed in the role of
what some of the incompat and ro_compat flags mean. For instance,
"RO_COMPAT_READONLY" is documented as:
> - Read-only filesystem image; the kernel will not mount this image
> read-write and most tools will refuse to write to the image.
Is it reasonable to interpret this as "this can never, ever become
writable", such that no kernel should ever "understand" that flag in
ro_compat? I'd assumed so, but this discussion is definitely leading me
to want to confirm any such assumptions. Is this a flag that e2fsck
could potentially use to determine that it shouldn't check
read-write-specific data structures, or should that be a different flag?

> > The short version is that I needed a library to rapidly turn
> > dynamically-obtained data into a set of disk blocks to be served
> > on-the-fly as a software-defined disk, and then mounted on the other
> > side of that interface by the Linux kernel. Turns out that's *many
> > orders of magnitude* faster than any kind of network filesystem like
> > NFS. It's slightly similar to a vvfat for ext4. The less blocks it can
> > generate and account for and cache, the faster it can run, and
> > microseconds matter.
>
> So are you actually trying to dedup data blocks, or are you just
> trying to avoid needing to track the block allocation bitmaps?

Both. I do actually dedup some of the data blocks, when it's easy to do
so. The less data that needs to get generated and go over the wire, the
better.

> And are you just writing a single file, or multiple files? Do you
> know what the maximum size of the file or files will be? Do you need
> a complex directory structure, or just a single root directory?

It's an arbitrary filesystem hierarchy, including directories, files of
various sizes (hence using inline_data), and permissions. The problem
isn't to get data from point A to point B; the problem is (in part) to
turn a representation of a filesystem into an actual mounted filesystem
as efficiently as possible, live-serving individual blocks on demand
rather than generating the whole image in advance.

> Can the file system be sparse?

Files almost certainly won't be (because in practice, it'd be rare
enough that it's not worth the time and logic to check). The block
device is in a way; unused blocks aren't transmitted.

> So for example, you can do something like this, which puts all of the
> metadata at beginning of the file system, and then you could write to
> contiguous data blocks. Add the following in mke2fs.conf:

(Note that the main reason mke2fs wouldn't help is that it generates
filesystems in advance, rather than on the fly.)

> [fs_types]
> hugefile = {
> features = extent,huge_file,bigalloc,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2

Other than uninit_bg, that's pretty much the list of feature flags I'm setting.
I'm also enabling 64bit, inline_data, largedir, and filetype.

> cluster_size = 32768

Many files are potentially small, so 4k blocks make sense. (I've even
considered 1k blocks, but that seemed like a less-well-trodden path and
I didn't want to be less efficient for larger files. 4k seems to be a
good balance, and has the advantage of matching the page size.)

> So if you had come to the ext4 list with a set of requirements, it
> could have been that we could have come up with something which uses
> the existing file system features, or come up with something which
> would have been more specific --- and more importantly, we'd know what
> the semantics were of various on-disk file system formats that people
> are depending upon.

I've mentioned specific aspects of what I'm doing at least twice: once
in the thread about 128-bit inodes and inline data (which got a good
response, and ended up convincing me not to go that route even though
it'd be theoretically possible), and once in the feature request asking
for support in e2fsck for shared bitmap blocks (which didn't get any
further followups after I'd provided additional information).

> > If at some point I'm looking to make ext4 support more than it already
> > does (e.g. a way to omit bitmaps entirely, or a way to express
> > contiguous files with smaller extent maps, or other enhancements for
> > read-only filesystems),
>
> See above for a way to significantly reduce the number of bitmaps.
> Adding a way to omit bitmaps entirely would require an INCOMPAT flag,
> so it might not be worth it.

I agree that it'd be of limited value, compared to the amount of change
required.

> The way to express contiguous files with smaller extent files would be
> to extend the kernel to allow file systems with block_size > page_size
> read-only. This would allow you to create a file system with a block
> size of 64k, which will reduce the size of the extent maps by a factor
> of 16, and it wouldn't be all that hard to teach ext4 to support these
> file systems. (The reason why it would be hard for us to support file
> systems with block sizes > page size is dealing with page cache when
> writing files while allocating blocks, especially when doing random
> writes into a sparse file. Read-only would be much easier to
> support.)

That seems like it would work well for a filesystem that mostly
contained larger files, but not a mix of small files and occasional
large files. (Imagine, for one possible instance, a normal Linux
system's root filesystem with everything on it, plus a single 200GB
file.) It's more that I'd like to avoid generating and serving (and
having the kernel parse and recurse through) a full set of extent tree
blocks; it would be nice if an extent record could represent more than
32k blocks. It'd be even nicer if *inline* extents could represent an
arbitrarily long file as long as the file was contiguous.

Would it be reasonable, particularly on a read-only filesystem (to avoid
the possibility of having to break it into a massive extent tree on the
fly), to add a way to represent a completely contiguous or mostly
contiguous file using only the 60-byte in-inode region, and no auxiliary
extents? Having extents with a 48-bit number of blocks would be helpful,
and in theory a contiguous file doesn't exactly need ee_block.

> So please, talk to us, and *tell* us what it is you're trying to do
> before you try to do it. Don't rely on some implementation detail
> where we're not being sufficiently strict in checking for an invalid
> file system, especially without telling us in advance and then trying
> to hold us to the lack of checking forever because it's "breaking
> things that used to work".

I'd be happy to start more such conversations in the future. And as
mentioned above, I am *not* looking for that level of guarantee of never
checking *anything* that used to be checked; I'm looking to
collaboratively figure out what the best solution is. I don't want the
default assumption to be that the kernel needs to remain perfectly
accepting of everything it did in the past, nor do I want the default
assumption to be that userspace should always change to meet the new
kernel requirements.

- Josh Triplett

2020-10-09 00:20:21

by Josh Triplett

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Thu, Oct 08, 2020 at 01:25:57PM -0600, Andreas Dilger wrote:
> On Oct 8, 2020, at 1:12 PM, Josh Triplett <[email protected]> wrote:
> > On Wed, Oct 07, 2020 at 08:57:12PM -0600, Andreas Dilger wrote:
> >> I *do* think that inline_data is an under-appreciated feature that I
> >> would be happy to see some improvements with. I don't think that small
> >> files are a niche use case, and if we can clean up the inline_data code
> >> to work with 128-byte inodes I'm not against that, even though I'm not
> >> going to use that combination of features myself.
> >
> > I'd love to see that happen. At the time, it seemed like too large of a
> > change to block on, which is why I ended up deciding to switch to
> > 256-byte inodes.
>
> Does that mean you are using inline_data with 256-byte inodes?

I am, yes, and it mostly works great. I've hit zero issues with it in
the filesystems I'm generating.

> That would also be good to know, since there haven't been any
> well-known users of this feature so far (AFAIK). Since you are using
> this in a read-only manner, you won't hit the one know issue when an
> inline_data inode is extended to use an external block that may
> temporarily leave the inode in an inconsistent state.

I've run into a few other issues with it in other tools, as well. mke2fs
with inline_data generates invalid files given xattrs:
https://lore.kernel.org/linux-ext4/20200926102512.GA11386@localhost/T/#u

And extlinux doesn't like inline_data at all:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=971002

I'll report any other issues I run into using inline_data. I agree that
it's deeply underappreciated.

- Josh

2020-10-09 00:21:53

by Josh Triplett

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Thu, Oct 08, 2020 at 10:54:48AM -0700, Darrick J. Wong wrote:
> IMO, the prominent free software filesystem projects offer (at least)
> four layers of social structures to keep everyone on the same page,
> including people writing competing implementations.

(I certainly hope that this isn't a *competing* implementation. It's
more that it serves a different purpose than the existing tools.)

> The first is "Does
> everything we write still work with the kernel", which I guess you
> clearly did since you're complaining about this change in 5.9-rc2.

Right.

> The second layer is "Does it pass fsck?" which, given that you're asking
> for changes to e2fsck elsewhere in this thread and I couldn't figure out
> how to generate a shared-bitmaps ext4 fs that didn't also cause e2fsck
> to complain about the overlap doesn't make me confident that you went
> beyond the first step before shipping something.

I did ensure that, other than the one top-level complaint that the
bitmaps overlapped, I got no complaints from e2fsck. I also confirmed
that with a patch to that one item, e2fsck passed with no issues.

> The third layer is consulting the ondisk format documentation to see if
> it points out anything that the kernel or fsck didn't notice. That
> one's kind of on Ted for not updating Documentation/ to spell out what
> SHARED_BLOCKS actually means, but that just leads me to the fourth thing.

I've been making *extensive* use of Documentation/filesystems/ext4 by
the way, and I greatly appreciate it. I know you've done a ton of work
in that area.

> The fourth layer is emailing the list to ask "Hey, I was thinking of
> ___, does anyone see a problem with my interpretation?". That clearly
> hasn't been done for shared bitmaps until now, which is all the more
> strange since you /did/ ask linux-ext4 about the inline data topic
> months ago.

That one was on me, you're right. Because Ted is the maintainer of
e2fsprogs in Debian, and conversations about ext4 often happen in the
Debian BTS, reporting a bug on e2fsprogs there felt like starting an
upstream conversation. That was my mistake; in the future, I'll make
sure that things make it to linux-ext4. I already did so for
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=971014 , and I
*should* have gone back and done so for the shared bitmap blocks issue.

> ext* and XFS have been around for 25 years. The software validation of
> both is imperfect and incomplete because the complexity of the ondisk
> formats is vast and we haven't caught up with the technical debt that
> resulted from not writing rigorous checks that would have been very
> difficult with mid-90s hardware. I know, because I've been writing
> online checking for XFS and have seen the wide the gap between what the
> existing software looks for and what the ondisk format documentation
> allows.
>
> The only chance that we as a community have at managing the complexity
> of a filesystem is to use those four tools I outlined above to try to
> keep the mental models of everyone who participates in close enough
> alignment. That's how we usually avoid discussions that end in rancor
> and confusion.
>
> That was a very long way of reiterating "If you're going to interpret
> aspects of the software, please come talk to us before you start writing
> code". That is key to ext4's long history of compatibility, because a
> project cannot maintain continuity when actors develop conflicting code
> in the shadows. Look at all the mutations FAT and UFS that have
> appeared over the years-- the codebases became a mess as a result.

Understood, agreed, and acknowledged. I'll make sure that any future
potentially "adventurous" filesystem experiments get discussed on
linux-ext4 first, not just in a distro bugtracker with a limited
audience.

> > > the head", and continued with the notion that anything other than
> > > e2fsprogs making something to be mounted by mount(2) and handled by
> > > fs/ext4 is being "inflicted", and if the goal didn't still seem to be
> > > "how do we make it go away so that only e2fsprogs and the kernel ever
> > > touch ext4". I started this thread because I'd written some userspace
> > > code, a new version of the kernel made that userspace code stop working,
> > > so I wanted to report that the moment I'd discovered that, along with a
> > > potential way to address it with as little disrupton to ext4 as
> > > possible.
>
> Ted: I don't think it's a good idea to make SHARED_BLOCKS disable block
> validity checking by default. You could someday enable users to write
> to block-sharing files by teaching ext4 how to COW, at which point you'd
> need correct bitmaps and block validity to prevent accidental overwrite
> of critical metadata. At that point you'd either be stuck with the
> precedent that SHARED_BLOCKS implies noblock_validity, or you'd end up
> breaking Josh's images again.

That's a fair point (though I think a writable COW version of
SHARED_BLOCKS would need a different flag). It'd make more sense to key
this logic off of RO_COMPAT_READONLY or similar. But even better:

> "noblock_validity" in the superblock mount options field of the images
> you create.

Yeah, I can do that. That's a much better solution, thank you. It would
have been problematic to have to change the userspace that mounts the
filesystem to pass new mount options ("noblock_validity") for the new
kernel. But if I can embed it in the filesystem, that'll work.

I'll do that, and please feel free to drop the original proposed patch
as it's no longer needed. Thanks, Darrick.

- Josh Triplett

2020-10-09 04:11:51

by Darrick J. Wong

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Thu, Oct 08, 2020 at 03:38:58PM -0700, Josh Triplett wrote:
> On Thu, Oct 08, 2020 at 10:54:48AM -0700, Darrick J. Wong wrote:
> > IMO, the prominent free software filesystem projects offer (at least)
> > four layers of social structures to keep everyone on the same page,
> > including people writing competing implementations.
>
> (I certainly hope that this isn't a *competing* implementation. It's
> more that it serves a different purpose than the existing tools.)

I hope so too, but external implementations tend to have staying power
once people starting using them. You'd think e2fsdroid would have been
merged into mke2fs by now, but no...

> > The first is "Does
> > everything we write still work with the kernel", which I guess you
> > clearly did since you're complaining about this change in 5.9-rc2.
>
> Right.
>
> > The second layer is "Does it pass fsck?" which, given that you're asking
> > for changes to e2fsck elsewhere in this thread and I couldn't figure out
> > how to generate a shared-bitmaps ext4 fs that didn't also cause e2fsck
> > to complain about the overlap doesn't make me confident that you went
> > beyond the first step before shipping something.
>
> I did ensure that, other than the one top-level complaint that the
> bitmaps overlapped, I got no complaints from e2fsck. I also confirmed
> that with a patch to that one item, e2fsck passed with no issues.

<cough> even a single top level complaint still means it doesn't pass.

> > The third layer is consulting the ondisk format documentation to see if
> > it points out anything that the kernel or fsck didn't notice. That
> > one's kind of on Ted for not updating Documentation/ to spell out what
> > SHARED_BLOCKS actually means, but that just leads me to the fourth thing.
>
> I've been making *extensive* use of Documentation/filesystems/ext4 by
> the way, and I greatly appreciate it. I know you've done a ton of work
> in that area.
>
> > The fourth layer is emailing the list to ask "Hey, I was thinking of
> > ___, does anyone see a problem with my interpretation?". That clearly
> > hasn't been done for shared bitmaps until now, which is all the more
> > strange since you /did/ ask linux-ext4 about the inline data topic
> > months ago.
>
> That one was on me, you're right. Because Ted is the maintainer of
> e2fsprogs in Debian, and conversations about ext4 often happen in the
> Debian BTS, reporting a bug on e2fsprogs there felt like starting an
> upstream conversation. That was my mistake; in the future, I'll make
> sure that things make it to linux-ext4. I already did so for
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=971014 , and I
> *should* have gone back and done so for the shared bitmap blocks issue.

<nod> Thanks.

> > ext* and XFS have been around for 25 years. The software validation of
> > both is imperfect and incomplete because the complexity of the ondisk
> > formats is vast and we haven't caught up with the technical debt that
> > resulted from not writing rigorous checks that would have been very
> > difficult with mid-90s hardware. I know, because I've been writing
> > online checking for XFS and have seen the wide the gap between what the
> > existing software looks for and what the ondisk format documentation
> > allows.
> >
> > The only chance that we as a community have at managing the complexity
> > of a filesystem is to use those four tools I outlined above to try to
> > keep the mental models of everyone who participates in close enough
> > alignment. That's how we usually avoid discussions that end in rancor
> > and confusion.
> >
> > That was a very long way of reiterating "If you're going to interpret
> > aspects of the software, please come talk to us before you start writing
> > code". That is key to ext4's long history of compatibility, because a
> > project cannot maintain continuity when actors develop conflicting code
> > in the shadows. Look at all the mutations FAT and UFS that have
> > appeared over the years-- the codebases became a mess as a result.
>
> Understood, agreed, and acknowledged. I'll make sure that any future
> potentially "adventurous" filesystem experiments get discussed on
> linux-ext4 first, not just in a distro bugtracker with a limited
> audience.
>
> > > > the head", and continued with the notion that anything other than
> > > > e2fsprogs making something to be mounted by mount(2) and handled by
> > > > fs/ext4 is being "inflicted", and if the goal didn't still seem to be
> > > > "how do we make it go away so that only e2fsprogs and the kernel ever
> > > > touch ext4". I started this thread because I'd written some userspace
> > > > code, a new version of the kernel made that userspace code stop working,
> > > > so I wanted to report that the moment I'd discovered that, along with a
> > > > potential way to address it with as little disrupton to ext4 as
> > > > possible.
> >
> > Ted: I don't think it's a good idea to make SHARED_BLOCKS disable block
> > validity checking by default. You could someday enable users to write
> > to block-sharing files by teaching ext4 how to COW, at which point you'd
> > need correct bitmaps and block validity to prevent accidental overwrite
> > of critical metadata. At that point you'd either be stuck with the
> > precedent that SHARED_BLOCKS implies noblock_validity, or you'd end up
> > breaking Josh's images again.
>
> That's a fair point (though I think a writable COW version of
> SHARED_BLOCKS would need a different flag). It'd make more sense to key
> this logic off of RO_COMPAT_READONLY or similar. But even better:

It wouldn't require a new flag -- "rocompat" features bits mean that
"it's safe to allow userspace to read files off the disk if software
doesn't recognize this feature bit".

We're (ab)using the "doesn't recognize" part of that sentence, since the
kernel doesn't recognize RO_COMPAT_READONLY (or SHARED_BLOCKS) when it
compares the superblock ro compat field to EXT4_FEATURE_RO_COMPAT_SUPP.
It therefore only allows reading files.

If someone /did/ add the code to write to files safely in the presence
of shared blocks, all they'd have to do to light up the functionality is
add it to the _SUPP define. Also, it's strange that the kernel source
doesn't even know of SHARED_BLOCKS, but that's also on Ted...

> > "noblock_validity" in the superblock mount options field of the images
> > you create.
>
> Yeah, I can do that. That's a much better solution, thank you. It would
> have been problematic to have to change the userspace that mounts the
> filesystem to pass new mount options ("noblock_validity") for the new
> kernel. But if I can embed it in the filesystem, that'll work.
>
> I'll do that, and please feel free to drop the original proposed patch
> as it's no longer needed. Thanks, Darrick.

NP.

--D

>
> - Josh Triplett

2020-10-09 14:38:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Thu, Oct 08, 2020 at 03:22:59PM -0700, Josh Triplett wrote:
>
> I wasn't trying to make a *new* general principle or policy. I was under
> the impression that this *was* the policy, because it never occurred to
> me that it could be otherwise. It seemed like a natural aspect of the
> kernel/userspace boundary, to the point that the idea of it *not* being
> part of the kernel's stability guarantees didn't cross my mind.

From our perspective (and Darrick and I discussed this on this week's
ext4 video conference, so it represents the ext4 and xfs maintainer's
position) is that the file system format is different. First, the
on-disk format is not an ABI, and it is several orders more complex
than a system call interface. Second, we make no guarantees about
what the file system created by malicious tools will do. For example,
XFS developers reject bug reports from file system fuzzers, because
the v5 format has CRC checks, so randomly corrupted file systems won't
crash the kernel. Yes, this doesn't protect against maliciously
created file systems where the attacker makes sure the checksums are
valid, but only crazy people who think containers are just as secure
as VM's and that unprivileged users should be allowed to make the
kernel mount potentially maliciously created file systems would be
exposing the kernel to such maliciously created images.

> Finally, I think there's also some clarification needed in the role of
> what some of the incompat and ro_compat flags mean. For instance,
> "RO_COMPAT_READONLY" is documented as:
> > - Read-only filesystem image; the kernel will not mount this image
> > read-write and most tools will refuse to write to the image.
> Is it reasonable to interpret this as "this can never, ever become
> writable", such that no kernel should ever "understand" that flag in
> ro_compat?

Yes. However...

> I'd assumed so, but this discussion is definitely leading me
> to want to confirm any such assumptions. Is this a flag that e2fsck
> could potentially use to determine that it shouldn't check
> read-write-specific data structures, or should that be a different flag?

Just because it won't be modifiable, shouldn't mean that e2fsck won't
check to make sure that such structures are valid. "Won't be changed"
and "valid" are two different concepts. And certainly, today we *do*
check to make sure the bitmaps are valid and don't overlap, and we
can't go back in time to change that.

That being said, on the ext4 weekly video chat, we did discuss other
uses of an incompat feature flag that would allow the allocation
bitmap blocks and inode table block fields in the superblock to be
zero, which would mean that they are unallocated. This would allow us
to dynamically grow the inode table by adding an extra block group
descriptor. In fact, I'd probably use this as an opportunity to make
some other changes, such using inodes to store locations of the block
group descriptors, inode tables, and allocation bitmaps at the same
time. Those details can be discussed later, but the point is that
this is why it's good to discuss format changes from a requirements
perspective, so that if we do need to make an incompat change, we can
kill multiple birds with a single stone.

> It's an arbitrary filesystem hierarchy, including directories, files of
> various sizes (hence using inline_data), and permissions. The problem
> isn't to get data from point A to point B; the problem is (in part) to
> turn a representation of a filesystem into an actual mounted filesystem
> as efficiently as possible, live-serving individual blocks on demand
> rather than generating the whole image in advance.

Ah, so you want to be able to let the other side "look at" the file
system in parallel with it being generated on demand? The cache
coherency problems would seem to be... huge. For example, how can you
add a file to directory after the reader has looked at the directory
inode and directory blocks? Or for that matter, looked at a portion
of the inode table block? Or are you using 4k inodes so there is only
one inode per block? What about the fact that we sometimes do
readahead of inode table blocks?

I can think of all sorts of implementation level changes in terms of
caching, readahead behavior, etc., that we might make in the future
that might break you if you are doing some quite as outr? are that.
Again, the fact that you're being cagey about what you are doing, and
potentially complaining about changes we might make that would break
you, is ***really*** scaring me now.

Can you go into more details here? I'm sorry if you're working for
some startup who might want to patent these ideas, but if you want to
guarantee future compatibility, I'm really going to have to insist.

- Ted

2020-10-09 19:58:45

by Josh Triplett

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Thu, Oct 08, 2020 at 07:54:09PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 08, 2020 at 03:38:58PM -0700, Josh Triplett wrote:
> > On Thu, Oct 08, 2020 at 10:54:48AM -0700, Darrick J. Wong wrote:
> > > > > the head", and continued with the notion that anything other than
> > > > > e2fsprogs making something to be mounted by mount(2) and handled by
> > > > > fs/ext4 is being "inflicted", and if the goal didn't still seem to be
> > > > > "how do we make it go away so that only e2fsprogs and the kernel ever
> > > > > touch ext4". I started this thread because I'd written some userspace
> > > > > code, a new version of the kernel made that userspace code stop working,
> > > > > so I wanted to report that the moment I'd discovered that, along with a
> > > > > potential way to address it with as little disrupton to ext4 as
> > > > > possible.
> > >
> > > Ted: I don't think it's a good idea to make SHARED_BLOCKS disable block
> > > validity checking by default. You could someday enable users to write
> > > to block-sharing files by teaching ext4 how to COW, at which point you'd
> > > need correct bitmaps and block validity to prevent accidental overwrite
> > > of critical metadata. At that point you'd either be stuck with the
> > > precedent that SHARED_BLOCKS implies noblock_validity, or you'd end up
> > > breaking Josh's images again.
> >
> > That's a fair point (though I think a writable COW version of
> > SHARED_BLOCKS would need a different flag). It'd make more sense to key
> > this logic off of RO_COMPAT_READONLY or similar. But even better:
>
> It wouldn't require a new flag -- "rocompat" features bits mean that
> "it's safe to allow userspace to read files off the disk if software
> doesn't recognize this feature bit".

Sure; I was more thinking that if that involved adding some data
structures to track shared blocks and the need to COW, whatever
mechanism that used might potentially need an incompat flag.

> If someone /did/ add the code to write to files safely in the presence
> of shared blocks, all they'd have to do to light up the functionality is
> add it to the _SUPP define. Also, it's strange that the kernel source
> doesn't even know of SHARED_BLOCKS, but that's also on Ted...

It would be nice if the kernel's ext4.h header and e2fsprogs were fully
in sync, yes.

- Josh Triplett

2020-10-10 03:51:57

by Josh Triplett

[permalink] [raw]
Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Fri, Oct 09, 2020 at 10:37:32AM -0400, Theodore Y. Ts'o wrote:
> That being said, on the ext4 weekly video chat, we did discuss other
> uses of an incompat feature flag that would allow the allocation
> bitmap blocks and inode table block fields in the superblock to be
> zero, which would mean that they are unallocated.

What do you mean by "allocation bitmap blocks and inode table block
fields in the superblock"? Those are in the group descriptor, not the
superblock. Or am I missing something there?

> This would allow us
> to dynamically grow the inode table by adding an extra block group
> descriptor. In fact, I'd probably use this as an opportunity to make
> some other changes, such using inodes to store locations of the block
> group descriptors, inode tables, and allocation bitmaps at the same
> time. Those details can be discussed later, but the point is that
> this is why it's good to discuss format changes from a requirements
> perspective, so that if we do need to make an incompat change, we can
> kill multiple birds with a single stone.

I would be quite interested in that.

> On Thu, Oct 08, 2020 at 03:22:59PM -0700, Josh Triplett wrote:
> > It's an arbitrary filesystem hierarchy, including directories, files of
> > various sizes (hence using inline_data), and permissions. The problem
> > isn't to get data from point A to point B; the problem is (in part) to
> > turn a representation of a filesystem into an actual mounted filesystem
> > as efficiently as possible, live-serving individual blocks on demand
> > rather than generating the whole image in advance.
>
> Ah, so you want to be able to let the other side "look at" the file
> system in parallel with it being generated on demand? The cache
> coherency problems would seem to be... huge. For example, how can you
> add a file to directory after the reader has looked at the directory
> inode and directory blocks?

I don't. While the data is computed on demand for performance reasons,
the nature and size of all the data is fully known in advance. I never
add data to a filesystem, only create new filesystem images. The kernel
*is* looking at the filesystem in parallel with it being generated,
insofar as blocks aren't constructed until the kernel asks for them
(which is a massive performance win, especially since the kernel may
only want a small subset of the filesystem). But the block contents are
fixed in advance, even if they haven't been generated yet. So the kernel
can read ahead and cache any blocks it wants to, and they'll be valid.
(Excessive readahead might be a performance problem, but not a
correctness one.)

I briefly considered ideas around adding new data after the filesystem
was mounted, and dismissed those ideas just as quickly, for exactly
these reasons (disk caching, filesystem caching, readahead). That would
require a filesystem with at least some subset of cluster features. I
don't plan to go there if I don't have to.

If I do end up needing that, I'd consider proposing an ext4 change along
the lines of making the root directory into a "super-root" directory
under which multiple filesystem roots could live, and supporting a way
to re-read that root to discover new filesystem roots and new block
groups; then, it'd be possible to add a new root whose contents are
*mostly* references to existing inodes (that the kernel would already
have cached), and any modified directories or new/modified files would
have new inodes added. That would isolate the "don't read ahead and
cache" problem to the new inodes, which could potentially be isolated to
new block groups for simplicity, and the "discover new roots" mechanism
could also discover the newly added block groups and inode tables.

But again, I'm not curently looking to do that, and *if* I were, I'd
bring it up for architectural discussion and design on linux-ext4 first.
There's also a balance there between a simple version that'd work for an
append-only read-only filesystem, and a complex version that'd work for
a writable filesystem, and I'd hope more for the former than the latter,
but that'd be a topic for discussion.

- Josh Triplett

2021-01-10 18:42:40

by Pavel Machek

[permalink] [raw]
Subject: Malicious fs images was Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

Hi!

On Fri 2020-10-09 10:37:32, Theodore Y. Ts'o wrote:
> On Thu, Oct 08, 2020 at 03:22:59PM -0700, Josh Triplett wrote:
> >
> > I wasn't trying to make a *new* general principle or policy. I was under
> > the impression that this *was* the policy, because it never occurred to
> > me that it could be otherwise. It seemed like a natural aspect of the
> > kernel/userspace boundary, to the point that the idea of it *not* being
> > part of the kernel's stability guarantees didn't cross my mind.
>
> >From our perspective (and Darrick and I discussed this on this week's
> ext4 video conference, so it represents the ext4 and xfs maintainer's
> position) is that the file system format is different. First, the
> on-disk format is not an ABI, and it is several orders more complex
> than a system call interface. Second, we make no guarantees about
> what the file system created by malicious tools will do. For example,
> XFS developers reject bug reports from file system fuzzers, because
> the v5 format has CRC checks, so randomly corrupted file systems won't
> crash the kernel. Yes, this doesn't protect against maliciously
> created file systems where the attacker makes sure the checksums are
> valid, but only crazy people who think containers are just as secure

Well, it is not just containers. It is also USB sticks. And people who
believe secure boot is good idea and try to protect kernel against
root. And crazy people who encrypt pointers in dmesg. And...

People want to use USB sticks from time to time. And while I
understand XFS is so complex it is unsuitable for such use, I'd still
expect bugs to be fixed there.

I hope VFAT to be safe to mount, because that is very common on USB.

I also hope ext2/3/4 is safe in that regard.

Anyway it would be nice to have documentation explaining this. If I'm
wrong about VFAT being safe, it would be good to know, and I guess
many will be surprised that XFS is using different rules.

Best regards,
Pavel
--
http://www.livejournal.com/~pavelmachek


Attachments:
(No filename) (2.02 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2021-01-11 18:53:22

by Darrick J. Wong

[permalink] [raw]
Subject: Re: Malicious fs images was Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Sun, Jan 10, 2021 at 07:41:02PM +0100, Pavel Machek wrote:
> Hi!
>
> On Fri 2020-10-09 10:37:32, Theodore Y. Ts'o wrote:
> > On Thu, Oct 08, 2020 at 03:22:59PM -0700, Josh Triplett wrote:
> > >
> > > I wasn't trying to make a *new* general principle or policy. I was under
> > > the impression that this *was* the policy, because it never occurred to
> > > me that it could be otherwise. It seemed like a natural aspect of the
> > > kernel/userspace boundary, to the point that the idea of it *not* being
> > > part of the kernel's stability guarantees didn't cross my mind.
> >
> > >From our perspective (and Darrick and I discussed this on this week's
> > ext4 video conference, so it represents the ext4 and xfs maintainer's
> > position) is that the file system format is different. First, the
> > on-disk format is not an ABI, and it is several orders more complex
> > than a system call interface. Second, we make no guarantees about
> > what the file system created by malicious tools will do. For example,
> > XFS developers reject bug reports from file system fuzzers, because

My recollection of this is quite different -- sybot was sending multiple
zeroday exploits per week to the public xfs list, and nobody at Google
was helping us to /fix/ those bugs. Each report took hours of developer
time to extract the malicious image (because Dmitry couldn't figure out
how to add that ability to syzbot) and syscall sequence from the
reproducer program, plus whatever time it took to craft a patch, test
it, and push it through review.

Dave and I complained to Dmitry about how the community had zero input
into the rate at which syzbot was allowed to look for xfs bugs. Nobody
at Google would commit to helping fix any of the XFS bugs, and Dmitry
would not give us better choices than (a) Google AI continuing to create
zerodays and leaving the community to clean up the mess, or (b) shutting
off syzbot entirely. At the time I said I would accept letting syzbot
run against xfs until it finds something, and turning it off until we
resolve the issue. That wasn't acceptable, because (I guess) nobody at
Google wants to put /any/ staff time into XFS at all.

TLDR: XFS /does/ accept bug reports about fuzzed and broken images.
What we don't want is make-work Google AIs spraying zeroday code in
public places and the community developers having to clean up the mess.

> > the v5 format has CRC checks, so randomly corrupted file systems
> > won't
> > crash the kernel. Yes, this doesn't protect against maliciously
> > created file systems where the attacker makes sure the checksums are
> > valid, but only crazy people who think containers are just as secure
>
> Well, it is not just containers. It is also USB sticks. And people who
> believe secure boot is good idea and try to protect kernel against
> root. And crazy people who encrypt pointers in dmesg. And...
>
> People want to use USB sticks from time to time. And while I
> understand XFS is so complex it is unsuitable for such use, I'd still
> expect bugs to be fixed there.
>
> I hope VFAT to be safe to mount, because that is very common on USB.
>
> I also hope ext2/3/4 is safe in that regard.

None of them are, that's why you can't mount local filesystems as an
unprivileged user by default. At a bare minimum you'd have to audit the
fs driver code to make sure it's robust against unexpected metadata; and
since filesystems store complex relational data across multiple indices,
you end up needing to hoist a full fsck into the mount process.

Or retain the current behavior, which is to delegate the trust decision
to the sysadmin. It's not just an XFS thing.

--D

> Anyway it would be nice to have documentation explaining this. If I'm
> wrong about VFAT being safe, it would be good to know, and I guess
> many will be surprised that XFS is using different rules.
>
> Best regards,
> Pavel
> --
> http://www.livejournal.com/~pavelmachek


2021-01-13 03:13:55

by Pavel Machek

[permalink] [raw]
Subject: Re: Malicious fs images was Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

Hi!

> > People want to use USB sticks from time to time. And while I
> > understand XFS is so complex it is unsuitable for such use, I'd still
> > expect bugs to be fixed there.
> >
> > I hope VFAT to be safe to mount, because that is very common on USB.
> >
> > I also hope ext2/3/4 is safe in that regard.
>
> Ext4 will fix file system fuzzing attack bugs on a best efforts basis.
> That is, when I have time, I've been known to stay up late to bugs
> reported by fuzzers. I hope ext4 is safe, but I'm not going to make
> any guarantees that it is Bug-Free(tm). If you want to trust it in
> that way, you do so at your risk.

Good.

> > Anyway it would be nice to have documentation explaining this. If I'm
> > wrong about VFAT being safe, it would be good to know, and I guess
> > many will be surprised that XFS is using different rules.
>
> Using USB sticks is fine, so long as you trust the provenance of the
> drive. If you take a random USB stick that is handed to you by

Well... That makes passing data between Windows and Linux machines
using USB stick "interesting", right?

> someone whom you don't trust implicitly, or worse, that you picked up
> abandoned on the sidewalk, there have been plenty of articles which
> describe why this is a REALLY BAD IDEA, and even if you ignore
> OS-level vuleranbilities, there are also firwmare and hardware based
> vulerabilities that would put your computer at risk. See [2] and
> [3]

I know, but bear with me.

> As far as documentation is concerned, how far should we go? Should
> there be a warning in the execve(2) system call man page that you
> shouldn't download random binaries from the network and execute them? :-)

No need to pull straw men for me.

This thread suggested that kernel is _not_ supposed to be robust
against corrupt filesystems (because fsck is not integrated in
kernel). Which was news to me (and I'm not the person that needs
warning in execve documentation).

I'd certainly like to hear that VFAT and EXT4 _is_ supposed to be
robust in that way.

And if we have filesystems where corrupt image is known to allow
arbitrary code execution, we need to

a) document that.

b) disable them when secure boot is enabled.

Because with secure boot, we are supposed to be secure against attacks
from root, and root can prepare malicious filesystems. ("The problem,
simply put, is this: the objective of secure boot is to prevent the
system from running any unsigned code in a privileged mode. So, if one
boots a Linux system that, in turn, gives access to the machine to
untrusted code, the entire purpose has been defeated. The consequences
could hurt both locally (bad code could take control of the machine)
and globally (the signing key used to boot Linux could be revoked), so
it is an outcome that is worth avoiding. Doing so, however, requires
placing limitations in the kernel so that not even root can circumvent
the secure boot chain of trust." -- https://lwn.net/Articles/514985/
).

Best regards,
Pavel

--
http://www.livejournal.com/~pavelmachek


Attachments:
(No filename) (3.06 kB)
signature.asc (201.00 B)
Download all attachments

2021-01-13 05:10:49

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Malicious fs images was Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

On Tue, Jan 12, 2021 at 11:28:40PM +0100, Pavel Machek wrote:
>
> This thread suggested that kernel is _not_ supposed to be robust
> against corrupt filesystems (because fsck is not integrated in
> kernel). Which was news to me (and I'm not the person that needs
> warning in execve documentation).

Define "supposed to be". In the ideal world, the kernel should be
robust against corrupt file systems. In the ideal world, hard drives
would never die, and memory bits would never get flipped due to cosmic
rays, and so Intel would be correct that consumers don't need ECC
memory. In the ideal world, drivers would never make mistakes, and so
seat belts would be completely unnecessasry.

Unfortunately, we live in the real world.

And so there is risk associated with using various technologies, and
that risk is not a binary 0% vs 100%. In my mind, assuming that file
systems are robust against maliciously created images is much like
driving around without a seat belt. There are plenty of people who
drive without seat belts for years, and they haven't been killed yet.
But it's not something *I* would do. But hey, I also spent extra
money to buy a personal desktop computer with ECC memory, and I don't
go bicycling without a helment, either.

You're free to decide what *you* want to do. But I wouldn't take a
random file system image from the Net and mount it without running
fsck on the darned thing first.

As far as secure boot is concerned, do you know how many zero days are
out there with Windows machines? I'm pretty sure there are vast
numbers. There are even more systems where the system adminsitrators
haven't bothered to install latest updates, as well.

> And if we have filesystems where corrupt image is known to allow
> arbitrary code execution, we need to

Define *known*. If you look at the syzbot dashboard, there are
hundreds of reproducers where root is able to crash a Linux server.
Some number of them will almost certainly be exploitable. The problem
is it takes a huge amount of time to analyze them, and Syzbot's file
system fuzzers are positively developer-hostile to root cause. So
usually I find and fix ext4 file system fuzzing reports via reports
from other fuzzers, and then eventually syzbot realizes that the
reproducer no longer works, and it gets closed out.

I'd certainly be willing to bet a beer or two that there are
vulnerabilities in VFAT, but do I *know* that to be the case? No,
because I don't have the time to take a syzbot report relating to a
file system and root cause it. The time to extract out the image, and
then figure out how to get a simple reproducer (as opposed to the mess
that a syzbot reproducer that might be randomly toggling a network
bridge interface on one thread while messing with a file system image
on another) is easily 10-20 times the effort to actually *fix* the bug
once we're able to come up with a trivial reproducer with a file
system image which is a separate file so we can analyze it using file
system debugging tools.

I'm also *quite* confident that the NSA, KGB, and other state
intelligence agencies have dozens of zero days for Windows, Linux,
etc. We just don't know in what subsystem they are in, so we can't
just "disable them when secure boot is enabled".

- Ted