2023-06-25 16:11:33

by zhanchengbin

[permalink] [raw]
Subject: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write

Hi Tytso,
Tune2fs does not recognize writes to the manipulated filesystem in another
namespace, there will be two simultaneous write operations on a
block, resulting in filesystem inconsistencies.

The operation is as follows:
first terminal second terminal
mkfs.ext4 /dev/sdb;
mount /dev/sdb /test-sdb;
dd if=/dev/zero of=/test-sdb/test1 bs=1M count=100;
unshare -m;
umount;
gdb tune2fs;
b io_channel_write_byte
r -e remount-ro /dev/sdb
c(Write a byte of old data into the cache)
exit;
(gdb finish)
tune2fs -l /dev/sdb;
tune2fs 1.46.4 (18-Aug-2021)
tune2fs: Superblock checksum does not match superblock while trying to
open /dev/sdb
Couldn't find valid filesystem superblock.

- bin.


2023-06-26 02:37:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write

On Mon, Jun 26, 2023 at 12:00:08AM +0800, zhanchengbin wrote:
> Tune2fs does not recognize writes to the manipulated filesystem in another
> namespace, there will be two simultaneous write operations on a
> block, resulting in filesystem inconsistencies.

What you are reporting has nothing to do with namespaces, since
"tune2fs -e remount-ro /dev/sdb" is something which is allowed
regardless of whether the file system is mounted. What reproduction
is effectively doing is trying to set up a race between when tune2fs
writes a byte to update to update the errors behavior, and when the
actual unmount of the file system happens (e.g., when the last
namespace unmounts the file system). At that point, the kernel is
going to be updating the superblock as part of the unmount, and then
it calculates the superblock, and then it writes out the superblock.

If the tune2fs races with the unmount, it's possible for the tune2fs
update of the error beavhiour bit, and the update of the superblock
checksum, to race with the kernel's final update of the superblock,
includinig its attempt to update the checksum.

There are some workarounds to this, but ultimately, we need to replace
the ad-hoc modification of the block device by tune2fs with some
ioctls which specifically update superblock when the file system
mounted.

As far as whether or not tune2fs can detect if the file system is
mounted, what we can do is check to see if the block device is busy.
If it is mounted in some other namespace, we won't be able to see it
mounted in /proc/self/mounts, but we can see that it's not possible to
open the block device with O_EXCL.

Compare:

root@kvm-xfstests:~# /vtmp/tst_ismounted /dev/vdc
Device /dev/vdc reports flags 31
/dev/vdc is apparently in use.
/dev/vdc is mounted.
/dev/vdc is mounted on /vdc.

and then "unshare -m" in another terminal, followed by umount /dev/vdc
in the first terminal:

root@kvm-xfstests:~# /vtmp/tst_ismounted /dev/vdc
Device /dev/vdc reports flags 10
/dev/vdc is apparently in use.

... and then after we exit the last mount namespace which was keeping
/dev/vdc mounted:

root@kvm-xfstests:~# [ 2409.811328] EXT4-fs (vdc): unmounting filesystem bdc026fd-85a8-4ccf-94f8-961487000293.
root@kvm-xfstests:~# /vtmp/tst_ismounted /dev/vdc
Device /dev/vdc reports flags 00

Cheers,

- Ted

2023-06-26 12:05:52

by zhanchengbin

[permalink] [raw]
Subject: Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write

Thanks for your explanation, it's a great idea.

- bin.

On 2023/6/26 10:17, Theodore Ts'o wrote:
> On Mon, Jun 26, 2023 at 12:00:08AM +0800, zhanchengbin wrote:
>> Tune2fs does not recognize writes to the manipulated filesystem in another
>> namespace, there will be two simultaneous write operations on a
>> block, resulting in filesystem inconsistencies.
>
> What you are reporting has nothing to do with namespaces, since
> "tune2fs -e remount-ro /dev/sdb" is something which is allowed
> regardless of whether the file system is mounted. What reproduction
> is effectively doing is trying to set up a race between when tune2fs
> writes a byte to update to update the errors behavior, and when the
> actual unmount of the file system happens (e.g., when the last
> namespace unmounts the file system). At that point, the kernel is
> going to be updating the superblock as part of the unmount, and then
> it calculates the superblock, and then it writes out the superblock.
>
> If the tune2fs races with the unmount, it's possible for the tune2fs
> update of the error beavhiour bit, and the update of the superblock
> checksum, to race with the kernel's final update of the superblock,
> includinig its attempt to update the checksum.
>
> There are some workarounds to this, but ultimately, we need to replace
> the ad-hoc modification of the block device by tune2fs with some
> ioctls which specifically update superblock when the file system
> mounted. >
> As far as whether or not tune2fs can detect if the file system is
> mounted, what we can do is check to see if the block device is busy.
> If it is mounted in some other namespace, we won't be able to see it
> mounted in /proc/self/mounts, but we can see that it's not possible to
> open the block device with O_EXCL.
>
> Compare:
>
> root@kvm-xfstests:~# /vtmp/tst_ismounted /dev/vdc
> Device /dev/vdc reports flags 31
> /dev/vdc is apparently in use.
> /dev/vdc is mounted.
> /dev/vdc is mounted on /vdc.
>
> and then "unshare -m" in another terminal, followed by umount /dev/vdc
> in the first terminal:
>
> root@kvm-xfstests:~# /vtmp/tst_ismounted /dev/vdc
> Device /dev/vdc reports flags 10
> /dev/vdc is apparently in use.
>
> ... and then after we exit the last mount namespace which was keeping
> /dev/vdc mounted:
>
> root@kvm-xfstests:~# [ 2409.811328] EXT4-fs (vdc): unmounting filesystem bdc026fd-85a8-4ccf-94f8-961487000293.
> root@kvm-xfstests:~# /vtmp/tst_ismounted /dev/vdc
> Device /dev/vdc reports flags 00
>
> Cheers,
>
> - Ted
>
> .
>

2023-07-04 09:06:53

by zhanchengbin

[permalink] [raw]
Subject: Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write

Hi Tytso,

Referring to you kind and insightful advice, I have come up with three
solutions, listed as follows:

1) Add interface blkdev_bh_read and blkdev_bh_write in blkdev_ioctl.
Lock the
bh_lock when blkdev_bh_read is called, then change the value of the data and
write back it, and finally unlock_buffer;

This solution ensures that there is no coupling between the block layer and
the file system layer, but the interfaces are invoked by user processes, and
since user processes are subject to scheduling, there may be potential
performance overhead. Additionally, if user process is killed before
releasing
the bh_lock, it can result in a deadlock situation.

2) Add interface write_super in blkdev_ioctl. In this case, a hook is
used in
the write_super to invoke the commit_super function of each file system.
During
the mount process, the commit_super function of the respective filesystem is
saved into the hook. Since the metadata cache is stored through bdev, it is
sufficient to memcpy the data to bh->b_data and then flush it to disk.

This solution allows each file system to implement its own method of
flushing
the superblock. However, it introduces coupling between the block layer and
the filesystem layer, and there needs to be a place to store this
commit_super
hook function pointer. Is it stored in the block_device? It seems a
little bit
strange to me.

3) Add interface write_super in ext4_ioctl. The mount point is obtained
through
the block device, and open a random file in the file system, the
superblock is
passed to the kernel through ioctl of the file, and finally, the
superblock is
flushed to disk. Due to the inherent risks associated with granting user
space
write permissions to the superblock, it is deemed unsafe to utilize the
entire
superblock as provided by user space. Instead, the superblock should be
validated, followed by selective data writing and recording. Of course,
it is
dangerous to directly modify the data in the super block, so I plan to add a
flag in the super block to record that this modification is from the
user state.

This solution has no coupling with other layers, but uses the ioctl of
ordinary
files.

Personally speaking, I favour the third solution the most, what are your
opinions? If you already have other schemes, I will be more than
delighted to
discuss it with you.
Looking foward to hearing from you soon!

Thanks,
- bin.

2023-07-04 19:49:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write

On Tue, Jul 04, 2023 at 04:35:43PM +0800, zhanchengbin wrote:

> 3) Add interface write_super in ext4_ioctl. The mount point is
> obtained through the block device, and open a random file in the
> file system, the superblock is passed to the kernel through ioctl of
> the file, and finally, the superblock is flushed to disk. Due to the
> inherent risks associated with granting user space write permissions
> to the superblock, it is deemed unsafe to utilize the entire
> superblock as provided by user space. Instead, the superblock should
> be validated, followed by selective data writing and recording. Of
> course, it is dangerous to directly modify the data in the super
> block, so I plan to add a flag in the super block to record that
> this modification is from the user state.


> Personally speaking, I favour the third solution the most, what are
> your opinions? If you already have other schemes, I will be more
> than delighted to discuss it with you. Looking foward to hearing
> from you soon

I agree that the third solution (or some variant thereof) is the best.
The first two involve changes to the block layer, and in addition the
problems you've outlined, different file systems have file systems in
different superblocks in locations on disk, and the superblock may be
differently sized, etc. The problem we are trying to solve is also
fairly unique to ext2/3/4, since many other file systems either have
their own specialized ioctls, or they may simply not allow any file
system tuning operations while the file system is mounted.

The problem though with a write_super() ioctl though is that while it
solves the problem of false positive complaints from syzbot (at least
according to my opinion, sane threat models --- the syzbot folks seem
to disagree about what a sane threat model ought to be), it doesn't
solve the *other* problem which is that the superblock can also be
modified by the kernel, so there are some inherent race conditions
that can occcur when userspace and kernel are trying to modify the
superblock at the same time.

This can be potentially solved by a lot of checks by the kernel code
handling the hypothetical EXT4_IOC_WRITE_SUPER ioctl. This could be
improved by a passing in a second superblock so the ioctl handling
code can compare the modified superblocks between the original and
modified superblock, and then apply more sanity checks. But that's a
lot of extra complexity, and you won't know whether the kernel will
support modifying some pariticular superblock field.

So the better approach is to have multiple new ioctl's for each
superblock field (or set of fields) that you might want modify. We
have some of these already --- for example, EXT4_IOC_SETFSUUID and
FS_IOC_SETFSLABEL. For tune2fs, all of additional ioctls for making
configuration changes while the file system is mounted are:

* EXT4_IOC_SET_FEATURES
- for tune2fs -O...
* EXT4_IOC_CLEAR_FEATURES
- for tune2fs -O^...
* EXT4_IOC_SET_ERROR_BEHAVIOR
- for tune2fs -e
* EXT4_IOC_SET_DEFAULT_MOUNT_FLAGS
- for tune2fs -o
* EXT4_IOC_SET_DEFAULT_MOUNT_OPTS
- for tune2fs -E mount_opts=XXX
* EXT4_IOC_SET_FSCK_POLICY
- for tune2fs -[cCiT]
* EXT4_IOC_SET_RESERVED_ALLOC
- for tune2fs -[ugm]

(The last two assumes using a structure, since it's probably not worth
creating 7 new ioctls when 2 would probably do).

Some of these ioctls are pretty esoteric, and are rarely used by most
system adminsitrators. And we don't have to add all of them all at
once; we could gradually add some of them, and most of these are
simple enough that we could assign them as a starter project for a new
college grad that you've hired into your company, or to an intern.

Cheers,

- Ted

P.S. I've ignored ioctls to "get" the superblock. We could just have
a single EXT4_IOC_READ_SUPERBLOCK, or we can solve the problem by
simply having the userspace use an O_DIRECT to read the superblock,
since this will avoid the potential invalid checksum issues the
superblock will only be written out to disk by the kernel when it is
self-consistent.

2023-07-08 07:32:32

by zhanchengbin

[permalink] [raw]
Subject: Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write

On 2023/7/5 3:33, Theodore Ts'o wrote:

I have written the ioctl for EXT4_IOC_SET_ERROR_BEHAVIOR according to your
instructions and verified that it can indeed modify the data on the disk.

However, I realized some problems. If we use the ioctl method to modify the
data, it would require multiple ioctls in user space to modify the
superblock.
If one ioctl successfully modifies the superblock data, but another ioctl
fails, the atomicity of the superblock cannot be guaranteed. This is just
within one user space process, not to mention the scenario of multiple user
space processes calling ioctls concurrently. Additionally, multiple ioctls
modifying the superblock may be placed in multiple transactions, which
further
compromises atomicity.

Writing the entire superblock buffer_head to disk can ensure atomicity.
However, these data need to be converted to ext4_sb_info. Otherwise, during
unmount, the data in memory will overwrite the data on the disk.
(Modifying the values in ext4_sb_info can potentially introduce unexpected
issues, just like the problem thata arises when setting the UUID without
checking ext4_has_feature_csum_seed.)

> So the better approach is to have multiple new ioctl's for each
> superblock field (or set of fields) that you might want modify. We
> have some of these already --- for example, EXT4_IOC_SETFSUUID and
> FS_IOC_SETFSLABEL. For tune2fs, all of additional ioctls for making
> configuration changes while the file system is mounted are:
>
> * EXT4_IOC_SET_FEATURES
> - for tune2fs -O...
> * EXT4_IOC_CLEAR_FEATURES
> - for tune2fs -O^...
> * EXT4_IOC_SET_ERROR_BEHAVIOR
> - for tune2fs -e
> * EXT4_IOC_SET_DEFAULT_MOUNT_FLAGS
> - for tune2fs -o
> * EXT4_IOC_SET_DEFAULT_MOUNT_OPTS
> - for tune2fs -E mount_opts=XXX
> * EXT4_IOC_SET_FSCK_POLICY
> - for tune2fs -[cCiT]
> * EXT4_IOC_SET_RESERVED_ALLOC
> - for tune2fs -[ugm]

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 331859511f80..d598e1975786 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -51,6 +51,11 @@ static void ext4_sb_setuuid(struct ext4_super_block
*es, const void *arg)
memcpy(es->s_uuid, (__u8 *)arg, UUID_SIZE);
}

+static void ext4_sb_set_error_behavior(struct ext4_super_block *es,
const void *arg)
+{
+ es->s_errors = cpu_to_le16(*(__u16 *)arg);
+}
+
static
int ext4_update_primary_sb(struct super_block *sb, handle_t *handle,
ext4_update_sb_callback func,
@@ -1220,6 +1225,32 @@ static int ext4_ioctl_setuuid(struct file *filp,
return ret;
}

+static int ext4_ioctl_set_error_behavior(struct file *filp,
+ const __u16 __user *uerror_behavior)
+{
+ int ret = 0;
+ struct super_block *sb = file_inode(filp)->i_sb;
+ __u16 error_behavior;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (copy_from_user(&error_behavior, uerror_behavior,
sizeof(error_behavior)))
+ return -EFAULT;
+
+ if (error_behavior < EXT4_ERRORS_CONTINUE || error_behavior >
EXT4_ERRORS_PANIC)
+ return -EINVAL;
+
+ ret = mnt_want_write_file(filp);
+ if (ret)
+ return ret;
+
+ ret = ext4_update_superblocks_fn(sb, ext4_sb_set_error_behavior,
&error_behavior);
+ mnt_drop_write_file(filp);
+
+ return ret;
+}
+
static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned
long arg)
{
struct inode *inode = file_inode(filp);
@@ -1607,6 +1638,8 @@ static long __ext4_ioctl(struct file *filp,
unsigned int cmd, unsigned long arg)
return ext4_ioctl_getuuid(EXT4_SB(sb), (void __user *)arg);
case EXT4_IOC_SETFSUUID:
return ext4_ioctl_setuuid(filp, (const void __user *)arg);
+ case EXT4_IOC_SET_ERROR_BEHAVIOR:
+ return ext4_ioctl_set_error_behavior(filp, (const void __user *)arg);
default:
return -ENOTTY;
}
diff --git a/include/uapi/linux/ext4.h b/include/uapi/linux/ext4.h
index 1c4c2dd29112..68329d51a8a7 100644
--- a/include/uapi/linux/ext4.h
+++ b/include/uapi/linux/ext4.h
@@ -33,6 +33,7 @@
#define EXT4_IOC_CHECKPOINT _IOW('f', 43, __u32)
#define EXT4_IOC_GETFSUUID _IOR('f', 44, struct fsuuid)
#define EXT4_IOC_SETFSUUID _IOW('f', 44, struct fsuuid)
+#define EXT4_IOC_SET_ERROR_BEHAVIOR _IOW('f', 45, __u16)

#define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32)



Thanks,
- bin.


2023-07-12 00:24:19

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write

On Sat, Jul 08, 2023 at 03:29:51PM +0800, zhanchengbin wrote:
> On 2023/7/5 3:33, Theodore Ts'o wrote:
>
> I have written the ioctl for EXT4_IOC_SET_ERROR_BEHAVIOR according to your
> instructions and verified that it can indeed modify the data on the disk.
>
> However, I realized some problems. If we use the ioctl method to modify the
> data, it would require multiple ioctls in user space to modify the
> superblock.
> If one ioctl successfully modifies the superblock data, but another ioctl
> fails, the atomicity of the superblock cannot be guaranteed. This is just
> within one user space process, not to mention the scenario of multiple user
> space processes calling ioctls concurrently. Additionally, multiple ioctls
> modifying the superblock may be placed in multiple transactions, which
> further
> compromises atomicity.
>
> Writing the entire superblock buffer_head to disk can ensure atomicity.

...at a cost of racing with the mounted fs, which might be updating the
superblock at the same time; and prohibiting the kernel devs from
closing the "scribble on mounted bdev" attack surface.

How many of these attributes do you /really/ need to be able to commit
atomically, anyway? If the system crashes, can't you re-run the
program and end up with the same super fields?

--D

> However, these data need to be converted to ext4_sb_info. Otherwise, during
> unmount, the data in memory will overwrite the data on the disk.
> (Modifying the values in ext4_sb_info can potentially introduce unexpected
> issues, just like the problem thata arises when setting the UUID without
> checking ext4_has_feature_csum_seed.)
>
> > So the better approach is to have multiple new ioctl's for each
> > superblock field (or set of fields) that you might want modify. We
> > have some of these already --- for example, EXT4_IOC_SETFSUUID and
> > FS_IOC_SETFSLABEL. For tune2fs, all of additional ioctls for making
> > configuration changes while the file system is mounted are:
> >
> > * EXT4_IOC_SET_FEATURES
> > - for tune2fs -O...
> > * EXT4_IOC_CLEAR_FEATURES
> > - for tune2fs -O^...
> > * EXT4_IOC_SET_ERROR_BEHAVIOR
> > - for tune2fs -e
> > * EXT4_IOC_SET_DEFAULT_MOUNT_FLAGS
> > - for tune2fs -o
> > * EXT4_IOC_SET_DEFAULT_MOUNT_OPTS
> > - for tune2fs -E mount_opts=XXX
> > * EXT4_IOC_SET_FSCK_POLICY
> > - for tune2fs -[cCiT]
> > * EXT4_IOC_SET_RESERVED_ALLOC
> > - for tune2fs -[ugm]
>
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 331859511f80..d598e1975786 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -51,6 +51,11 @@ static void ext4_sb_setuuid(struct ext4_super_block *es,
> const void *arg)
> memcpy(es->s_uuid, (__u8 *)arg, UUID_SIZE);
> }
>
> +static void ext4_sb_set_error_behavior(struct ext4_super_block *es, const
> void *arg)
> +{
> + es->s_errors = cpu_to_le16(*(__u16 *)arg);
> +}
> +
> static
> int ext4_update_primary_sb(struct super_block *sb, handle_t *handle,
> ext4_update_sb_callback func,
> @@ -1220,6 +1225,32 @@ static int ext4_ioctl_setuuid(struct file *filp,
> return ret;
> }
>
> +static int ext4_ioctl_set_error_behavior(struct file *filp,
> + const __u16 __user *uerror_behavior)
> +{
> + int ret = 0;
> + struct super_block *sb = file_inode(filp)->i_sb;
> + __u16 error_behavior;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + if (copy_from_user(&error_behavior, uerror_behavior,
> sizeof(error_behavior)))
> + return -EFAULT;
> +
> + if (error_behavior < EXT4_ERRORS_CONTINUE || error_behavior >
> EXT4_ERRORS_PANIC)
> + return -EINVAL;
> +
> + ret = mnt_want_write_file(filp);
> + if (ret)
> + return ret;
> +
> + ret = ext4_update_superblocks_fn(sb, ext4_sb_set_error_behavior,
> &error_behavior);
> + mnt_drop_write_file(filp);
> +
> + return ret;
> +}
> +
> static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long
> arg)
> {
> struct inode *inode = file_inode(filp);
> @@ -1607,6 +1638,8 @@ static long __ext4_ioctl(struct file *filp, unsigned
> int cmd, unsigned long arg)
> return ext4_ioctl_getuuid(EXT4_SB(sb), (void __user *)arg);
> case EXT4_IOC_SETFSUUID:
> return ext4_ioctl_setuuid(filp, (const void __user *)arg);
> + case EXT4_IOC_SET_ERROR_BEHAVIOR:
> + return ext4_ioctl_set_error_behavior(filp, (const void __user *)arg);
> default:
> return -ENOTTY;
> }
> diff --git a/include/uapi/linux/ext4.h b/include/uapi/linux/ext4.h
> index 1c4c2dd29112..68329d51a8a7 100644
> --- a/include/uapi/linux/ext4.h
> +++ b/include/uapi/linux/ext4.h
> @@ -33,6 +33,7 @@
> #define EXT4_IOC_CHECKPOINT _IOW('f', 43, __u32)
> #define EXT4_IOC_GETFSUUID _IOR('f', 44, struct fsuuid)
> #define EXT4_IOC_SETFSUUID _IOW('f', 44, struct fsuuid)
> +#define EXT4_IOC_SET_ERROR_BEHAVIOR _IOW('f', 45, __u16)
>
> #define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32)
>
>
>
> Thanks,
> - bin.
>

2023-07-12 09:18:18

by zhanchengbin

[permalink] [raw]
Subject: Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write



On 2023/7/12 8:05, Darrick J. Wong wrote:
> On Sat, Jul 08, 2023 at 03:29:51PM +0800, zhanchengbin wrote:
>> On 2023/7/5 3:33, Theodore Ts'o wrote:
>>
>> I have written the ioctl for EXT4_IOC_SET_ERROR_BEHAVIOR according to your
>> instructions and verified that it can indeed modify the data on the disk.
>>
>> However, I realized some problems. If we use the ioctl method to modify the
>> data, it would require multiple ioctls in user space to modify the
>> superblock.
>> If one ioctl successfully modifies the superblock data, but another ioctl
>> fails, the atomicity of the superblock cannot be guaranteed. This is just
>> within one user space process, not to mention the scenario of multiple user
>> space processes calling ioctls concurrently. Additionally, multiple ioctls
>> modifying the superblock may be placed in multiple transactions, which
>> further
>> compromises atomicity.
>>
>> Writing the entire superblock buffer_head to disk can ensure atomicity.
>
> ...at a cost of racing with the mounted fs, which might be updating the
> superblock at the same time; and prohibiting the kernel devs from
> closing the "scribble on mounted bdev" attack surface.

Regardless of whether I am modifying a single byte or the entire
buffer_head, there will always be a situation of contention with the kernel
lock, You can take a look at ext4_update_superblocks_fn which calls
lock_buffer.

What I am more concerned about is that the superblock needs to be
synchronized to the memory before being saved to the disk. Otherwise, during
the ext4_commit_super process, outdated data may be saved to the disk.

>
> How many of these attributes do you /really/ need to be able to commit

My plan with Tytso is to add seven attribute modifications.

> atomically, anyway? If the system crashes, can't you re-run the
> program and end up with the same super fields?
Just run the program again after the system crashes, o.O? I don't think so.

The program perceives that the superblock has been modified
successfully, and
the value of the modified superblock is saved on disk in
ext4_update_primary_sb, but there is no guarantee whether the super block is
saved in journal on the disk or whether it is checkpointed. If the super
block
has not been saved in journal on the disk and the system crashes, the
modification of the super block may be overwritten when the journal
recover; similarly, this problem will also occur for the translation
that has
not been checkpointed; Both of these scenarios are not perceptible to user
process unless there is a backup mechanism implemented in user mode.

Moreover, the method of rerunning the program cannot resolve the conflicting
racing condition between the two ioctls.

Thanks,
- bin.

>
> --D
>
>> However, these data need to be converted to ext4_sb_info. Otherwise, during
>> unmount, the data in memory will overwrite the data on the disk.
>> (Modifying the values in ext4_sb_info can potentially introduce unexpected
>> issues, just like the problem thata arises when setting the UUID without
>> checking ext4_has_feature_csum_seed.)

2023-07-12 15:50:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write

On Wed, Jul 12, 2023 at 05:06:31PM +0800, zhanchengbin wrote:
> > ...at a cost of racing with the mounted fs, which might be updating the
> > superblock at the same time; and prohibiting the kernel devs from
> > closing the "scribble on mounted bdev" attack surface.
>
> Regardless of whether I am modifying a single byte or the entire
> buffer_head, there will always be a situation of contention with the kernel
> lock, You can take a look at ext4_update_superblocks_fn which calls
> lock_buffer.

Many/most of the fields that tune2fs will need to modify are ones
which the kernel never needs to modify. So no locking will be
necessary, and so long as you are using the journal, we don't need to
worry about an invalid checksum getting written to the disk.

There might be races with buffered reads to the superblock, but those
races exist today. E2fsprogs has a way of dealing this where if the
checksum is invalid, it will just sleep and retry. Another way of
dealing with is to use an O_DIRECT read to the superblock.

Longer-term, we may want to have a EXT4_IOC_GET_SUPERBLOCK ioctl, at
which point we may need to have some kind of new kernel locking ----
or we can just take a snapshot of the superblock, and check to see the
checksum is valid; if not, it can just retry the snapshot.

> What I am more concerned about is that the superblock needs to be
> synchronized to the memory before being saved to the disk. Otherwise, during
> the ext4_commit_super process, outdated data may be saved to the disk.

As I've noted above, this is already not a problem if journalling is
enabled. If journalling is not enabled, it's possible that an invalid
superblock is written to disk. This can sort of happen already, if
you have an orphan list update racing with an ext4_error() update to
the superblock. It's a debateable point how much we should care,
since if you don't have journalling enabled, on a crash the file
system may be corrupted in many situations, and so we will need to use
fsck.ext4 anyway. If there is only a single superblock, then fsck
might not have a fallback superblock to use, but arguably that's a
corner case.

> The program perceives that the superblock has been modified
> successfully, and the value of the modified superblock is saved on
> disk in ext4_update_primary_sb, but there is no guarantee whether
> the super block is saved in journal on the disk or whether it is
> checkpointed.

So in actual practice, e2fsprogs will replay the journal and then
reread the superblock. So if you change the max_mounts_count via
tune2fs -c (even though very few systems use that these days, and it's
largely a deprecated feature), so long as the transaction has been
committed, the superblock update will be honored by e2fsck.

I'm not sure we care *all* that much, but if we really want to make
sure things like tune2fs -c will always "take" after a crash, we can
simply have the ioctl force a journal commit before returning.

> If the super block has not been saved in journal on
> the disk and the system crashes, the modification of the super block
> may be overwritten when the journal recover; similarly, this problem
> will also occur for the translation that has not been checkpointed;
> Both of these scenarios are not perceptible to user process unless
> there is a backup mechanism implemented in user mode.

We now *always* update the superblock through the journal. We only
fall back to a direct update of the superblock if the journal is not
present, or the journal has aborted due to some kind of fundamental
failure (so that the ext4_error can be written out before we reboot
the system or force the file system to be read-only).

> Moreover, the method of rerunning the program cannot resolve the conflicting
> racing condition between the two ioctls.

These ioctls are quite rarely used, and it's a problem we have today
if we have two racing tune2fs commands. By having the kernel handle
it, so long as the two ioctls are modifying different superblock
fields, both ioctl updates will happen --- where as today, when we
have two racing tune2fs, one of the tune2fs updates could be
completely lost. This has been true for decades in ext2, ext3, and
ext4, and no one has actually reported this as a problem.

Cheers,