2021-12-14 17:51:05

by Luis Henriques

[permalink] [raw]
Subject: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

When migrating to extents, the temporary inode will have it's own checksum
seed. This means that, when swapping the inodes data, the inode checksums
will be incorrect.

This can be fixed by recalculating the extents checksums again. Or simply
by copying the seed into the temporary inode.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=213357
Reported-by: Jeroen van Wolffelaar <[email protected]>
Signed-off-by: Luís Henriques <[email protected]>
---
fs/ext4/migrate.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

changes since v1:

* Dropped tmp_ei variable
* ->i_csum_seed is now initialised immediately after tmp_inode is created
* New comment about the seed initialization and stating that recovery
needs to be fixed.

Cheers,
--
Luís

diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 7e0b4f81c6c0..36dfc88ce05b 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -459,6 +459,17 @@ int ext4_ext_migrate(struct inode *inode)
ext4_journal_stop(handle);
goto out_unlock;
}
+ /*
+ * Use the correct seed for checksum (i.e. the seed from 'inode'). This
+ * is so that the metadata blocks will have the correct checksum after
+ * the migration.
+ *
+ * Note however that, if a crash occurs during the migration process,
+ * the recovery process is broken because the tmp_inode checksums will
+ * be wrong and the orphans cleanup will fail.
+ */
+ ei = EXT4_I(inode);
+ EXT4_I(tmp_inode)->i_csum_seed = ei->i_csum_seed;
i_size_write(tmp_inode, i_size_read(inode));
/*
* Set the i_nlink to zero so it will be deleted later
@@ -502,7 +513,6 @@ int ext4_ext_migrate(struct inode *inode)
goto out_tmp_inode;
}

- ei = EXT4_I(inode);
i_data = ei->i_data;
memset(&lb, 0, sizeof(lb));



2021-12-15 00:49:49

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

On Tue, Dec 14, 2021 at 05:50:58PM +0000, Lu?s Henriques wrote:
> When migrating to extents, the temporary inode will have it's own checksum
> seed. This means that, when swapping the inodes data, the inode checksums
> will be incorrect.
>
> This can be fixed by recalculating the extents checksums again. Or simply
> by copying the seed into the temporary inode.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=213357
> Reported-by: Jeroen van Wolffelaar <[email protected]>
> Signed-off-by: Lu?s Henriques <[email protected]>
> ---
> fs/ext4/migrate.c | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> changes since v1:
>
> * Dropped tmp_ei variable
> * ->i_csum_seed is now initialised immediately after tmp_inode is created
> * New comment about the seed initialization and stating that recovery
> needs to be fixed.
>
> Cheers,
> --
> Lu?s
>
> diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> index 7e0b4f81c6c0..36dfc88ce05b 100644
> --- a/fs/ext4/migrate.c
> +++ b/fs/ext4/migrate.c
> @@ -459,6 +459,17 @@ int ext4_ext_migrate(struct inode *inode)
> ext4_journal_stop(handle);
> goto out_unlock;
> }
> + /*
> + * Use the correct seed for checksum (i.e. the seed from 'inode'). This
> + * is so that the metadata blocks will have the correct checksum after
> + * the migration.
> + *
> + * Note however that, if a crash occurs during the migration process,
> + * the recovery process is broken because the tmp_inode checksums will
> + * be wrong and the orphans cleanup will fail.

...and then what does the user do?

--D

> + */
> + ei = EXT4_I(inode);
> + EXT4_I(tmp_inode)->i_csum_seed = ei->i_csum_seed;
> i_size_write(tmp_inode, i_size_read(inode));
> /*
> * Set the i_nlink to zero so it will be deleted later
> @@ -502,7 +513,6 @@ int ext4_ext_migrate(struct inode *inode)
> goto out_tmp_inode;
> }
>
> - ei = EXT4_I(inode);
> i_data = ei->i_data;
> memset(&lb, 0, sizeof(lb));
>

2021-12-15 10:46:37

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

On Tue, Dec 14, 2021 at 04:49:45PM -0800, Darrick J. Wong wrote:
> On Tue, Dec 14, 2021 at 05:50:58PM +0000, Lu?s Henriques wrote:
> > When migrating to extents, the temporary inode will have it's own checksum
> > seed. This means that, when swapping the inodes data, the inode checksums
> > will be incorrect.
> >
> > This can be fixed by recalculating the extents checksums again. Or simply
> > by copying the seed into the temporary inode.
> >
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=213357
> > Reported-by: Jeroen van Wolffelaar <[email protected]>
> > Signed-off-by: Lu?s Henriques <[email protected]>
> > ---
> > fs/ext4/migrate.c | 12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > changes since v1:
> >
> > * Dropped tmp_ei variable
> > * ->i_csum_seed is now initialised immediately after tmp_inode is created
> > * New comment about the seed initialization and stating that recovery
> > needs to be fixed.
> >
> > Cheers,
> > --
> > Lu?s
> >
> > diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> > index 7e0b4f81c6c0..36dfc88ce05b 100644
> > --- a/fs/ext4/migrate.c
> > +++ b/fs/ext4/migrate.c
> > @@ -459,6 +459,17 @@ int ext4_ext_migrate(struct inode *inode)
> > ext4_journal_stop(handle);
> > goto out_unlock;
> > }
> > + /*
> > + * Use the correct seed for checksum (i.e. the seed from 'inode'). This
> > + * is so that the metadata blocks will have the correct checksum after
> > + * the migration.
> > + *
> > + * Note however that, if a crash occurs during the migration process,
> > + * the recovery process is broken because the tmp_inode checksums will
> > + * be wrong and the orphans cleanup will fail.
>
> ...and then what does the user do?

I can't really say I know what is the right thing for a user to do. But
my understanding is that my patch doesn't change a lot: the recovery
process would still fail in a slightly different way, and will need to be
fixed at some point.

I believe the userspace tools already have support to handle orphan file,
but a quick look at the latest version of the patchset show that these
tools can't handle this either, and the recovery will also fail. But I
may be wrong.

Cheers,
--
Lu?s

>
> --D
>
> > + */
> > + ei = EXT4_I(inode);
> > + EXT4_I(tmp_inode)->i_csum_seed = ei->i_csum_seed;
> > i_size_write(tmp_inode, i_size_read(inode));
> > /*
> > * Set the i_nlink to zero so it will be deleted later
> > @@ -502,7 +513,6 @@ int ext4_ext_migrate(struct inode *inode)
> > goto out_tmp_inode;
> > }
> >
> > - ei = EXT4_I(inode);
> > i_data = ei->i_data;
> > memset(&lb, 0, sizeof(lb));
> >

2021-12-15 11:28:56

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

On Tue 14-12-21 16:49:45, Darrick J. Wong wrote:
> On Tue, Dec 14, 2021 at 05:50:58PM +0000, Lu?s Henriques wrote:
> > When migrating to extents, the temporary inode will have it's own checksum
> > seed. This means that, when swapping the inodes data, the inode checksums
> > will be incorrect.
> >
> > This can be fixed by recalculating the extents checksums again. Or simply
> > by copying the seed into the temporary inode.
> >
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=213357
> > Reported-by: Jeroen van Wolffelaar <[email protected]>
> > Signed-off-by: Lu?s Henriques <[email protected]>
> > ---
> > fs/ext4/migrate.c | 12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > changes since v1:
> >
> > * Dropped tmp_ei variable
> > * ->i_csum_seed is now initialised immediately after tmp_inode is created
> > * New comment about the seed initialization and stating that recovery
> > needs to be fixed.
> >
> > Cheers,
> > --
> > Lu?s
> >
> > diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> > index 7e0b4f81c6c0..36dfc88ce05b 100644
> > --- a/fs/ext4/migrate.c
> > +++ b/fs/ext4/migrate.c
> > @@ -459,6 +459,17 @@ int ext4_ext_migrate(struct inode *inode)
> > ext4_journal_stop(handle);
> > goto out_unlock;
> > }
> > + /*
> > + * Use the correct seed for checksum (i.e. the seed from 'inode'). This
> > + * is so that the metadata blocks will have the correct checksum after
> > + * the migration.
> > + *
> > + * Note however that, if a crash occurs during the migration process,
> > + * the recovery process is broken because the tmp_inode checksums will
> > + * be wrong and the orphans cleanup will fail.
>
> ...and then what does the user do?

Run fsck of course! And then recover from backups :) I know this is sad but
the situation is that our migration code just is not crash-safe (if we
crash we are going to free blocks that are still used by the migrated
inode) and Luis makes it work in case we do not crash (which should be
hopefully more common) and documents it does not work in case we crash.
So overall I'd call it a win.

But maybe we should just remove this online-migration functionality
completely from the kernel? That would be also a fine solution for me. I
was thinking whether we could somehow make the inode migration crash-safe
but I didn't think of anything which would not require on-disk format
change...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-12-15 14:12:55

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

On Wed, Dec 15, 2021 at 12:28:52PM +0100, Jan Kara wrote:
> On Tue 14-12-21 16:49:45, Darrick J. Wong wrote:
> > On Tue, Dec 14, 2021 at 05:50:58PM +0000, Lu?s Henriques wrote:
> > > When migrating to extents, the temporary inode will have it's own checksum
> > > seed. This means that, when swapping the inodes data, the inode checksums
> > > will be incorrect.
> > >
> > > This can be fixed by recalculating the extents checksums again. Or simply
> > > by copying the seed into the temporary inode.
> > >
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=213357
> > > Reported-by: Jeroen van Wolffelaar <[email protected]>
> > > Signed-off-by: Lu?s Henriques <[email protected]>
> > > ---
> > > fs/ext4/migrate.c | 12 +++++++++++-
> > > 1 file changed, 11 insertions(+), 1 deletion(-)
> > >
> > > changes since v1:
> > >
> > > * Dropped tmp_ei variable
> > > * ->i_csum_seed is now initialised immediately after tmp_inode is created
> > > * New comment about the seed initialization and stating that recovery
> > > needs to be fixed.
> > >
> > > Cheers,
> > > --
> > > Lu?s
> > >
> > > diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> > > index 7e0b4f81c6c0..36dfc88ce05b 100644
> > > --- a/fs/ext4/migrate.c
> > > +++ b/fs/ext4/migrate.c
> > > @@ -459,6 +459,17 @@ int ext4_ext_migrate(struct inode *inode)
> > > ext4_journal_stop(handle);
> > > goto out_unlock;
> > > }
> > > + /*
> > > + * Use the correct seed for checksum (i.e. the seed from 'inode'). This
> > > + * is so that the metadata blocks will have the correct checksum after
> > > + * the migration.
> > > + *
> > > + * Note however that, if a crash occurs during the migration process,
> > > + * the recovery process is broken because the tmp_inode checksums will
> > > + * be wrong and the orphans cleanup will fail.
> >
> > ...and then what does the user do?
>
> Run fsck of course! And then recover from backups :) I know this is sad but
> the situation is that our migration code just is not crash-safe (if we
> crash we are going to free blocks that are still used by the migrated
> inode) and Luis makes it work in case we do not crash (which should be
> hopefully more common) and documents it does not work in case we crash.
> So overall I'd call it a win.
>
> But maybe we should just remove this online-migration functionality
> completely from the kernel? That would be also a fine solution for me. I
> was thinking whether we could somehow make the inode migration crash-safe
> but I didn't think of anything which would not require on-disk format
> change...

Since this is not something that anyone can honestly recommend doing
without a prior backup and a word of warning I personaly would be in favor
of removing it.

-Lukas

>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>


2021-12-15 16:01:18

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

On Wed, Dec 15, 2021 at 03:12:37PM +0100, Lukas Czerner wrote:
> On Wed, Dec 15, 2021 at 12:28:52PM +0100, Jan Kara wrote:
> > On Tue 14-12-21 16:49:45, Darrick J. Wong wrote:
> > > On Tue, Dec 14, 2021 at 05:50:58PM +0000, Lu?s Henriques wrote:
> > > > When migrating to extents, the temporary inode will have it's own checksum
> > > > seed. This means that, when swapping the inodes data, the inode checksums
> > > > will be incorrect.
> > > >
> > > > This can be fixed by recalculating the extents checksums again. Or simply
> > > > by copying the seed into the temporary inode.
> > > >
> > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=213357
> > > > Reported-by: Jeroen van Wolffelaar <[email protected]>
> > > > Signed-off-by: Lu?s Henriques <[email protected]>
> > > > ---
> > > > fs/ext4/migrate.c | 12 +++++++++++-
> > > > 1 file changed, 11 insertions(+), 1 deletion(-)
> > > >
> > > > changes since v1:
> > > >
> > > > * Dropped tmp_ei variable
> > > > * ->i_csum_seed is now initialised immediately after tmp_inode is created
> > > > * New comment about the seed initialization and stating that recovery
> > > > needs to be fixed.
> > > >
> > > > Cheers,
> > > > --
> > > > Lu?s
> > > >
> > > > diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> > > > index 7e0b4f81c6c0..36dfc88ce05b 100644
> > > > --- a/fs/ext4/migrate.c
> > > > +++ b/fs/ext4/migrate.c
> > > > @@ -459,6 +459,17 @@ int ext4_ext_migrate(struct inode *inode)
> > > > ext4_journal_stop(handle);
> > > > goto out_unlock;
> > > > }
> > > > + /*
> > > > + * Use the correct seed for checksum (i.e. the seed from 'inode'). This
> > > > + * is so that the metadata blocks will have the correct checksum after
> > > > + * the migration.
> > > > + *
> > > > + * Note however that, if a crash occurs during the migration process,
> > > > + * the recovery process is broken because the tmp_inode checksums will
> > > > + * be wrong and the orphans cleanup will fail.
> > >
> > > ...and then what does the user do?
> >
> > Run fsck of course! And then recover from backups :) I know this is sad but
> > the situation is that our migration code just is not crash-safe (if we
> > crash we are going to free blocks that are still used by the migrated
> > inode) and Luis makes it work in case we do not crash (which should be
> > hopefully more common) and documents it does not work in case we crash.
> > So overall I'd call it a win.
> >
> > But maybe we should just remove this online-migration functionality
> > completely from the kernel? That would be also a fine solution for me. I
> > was thinking whether we could somehow make the inode migration crash-safe
> > but I didn't think of anything which would not require on-disk format
> > change...
>
> Since this is not something that anyone can honestly recommend doing
> without a prior backup and a word of warning I personaly would be in favor
> of removing it.

It's odd that my first fix to ext4 ends up wiping out a whole feature :-)

Anyway, in case the decision is to go ahead with the feature removal, I'm
inlining bellow an RFC that simply removes the whole file (and returns
-EOPNOTSUPP in the appropriate places). This was only compile-tested, so
I may have missed something.

(Oh, I almost forgot: I also wanted to mention the word dromedary just to
mess a little bit with Jon's article in tomorrow's LWN :-) )

Cheers,
--
Lu?s

From 19810c891ca612fafbfdeeea6651ba5484d5039f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Lu=C3=ADs=20Henriques?= <[email protected]>
Date: Wed, 15 Dec 2021 15:10:26 +0000
Subject: [RFC PATCH] ext4: remove online-migration feature from the kernel
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Online migration is currently broken (see link bellow). Although this
specific issue could be easily fixed, it seems to be very difficult to
implement it in a safe manner so that recovery can be done without risking
loosing data. Instead of changing on-disk format, simply wipe the whole
feature.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=213357
Reported-by: Jeroen van Wolffelaar <[email protected]>
Signed-off-by: Lu?s Henriques <[email protected]>
---
fs/ext4/Makefile | 2 +-
fs/ext4/ext4.h | 4 -
fs/ext4/ioctl.c | 33 +--
fs/ext4/migrate.c | 669 ----------------------------------------------
4 files changed, 7 insertions(+), 701 deletions(-)
delete mode 100644 fs/ext4/migrate.c

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 7d89142e1421..d904a1cbc3d8 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -7,7 +7,7 @@ obj-$(CONFIG_EXT4_FS) += ext4.o

ext4-y := balloc.o bitmap.o block_validity.o dir.o ext4_jbd2.o extents.o \
extents_status.o file.o fsmap.o fsync.o hash.o ialloc.o \
- indirect.o inline.o inode.o ioctl.o mballoc.o migrate.o \
+ indirect.o inline.o inode.o ioctl.o mballoc.o \
mmp.o move_extent.o namei.o page-io.o readpage.o resize.o \
super.o symlink.o sysfs.o xattr.o xattr_hurd.o xattr_trusted.o \
xattr_user.o fast_commit.o orphan.o
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 404dd50856e5..7b92120cb560 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3062,10 +3062,6 @@ int ext4_fileattr_set(struct user_namespace *mnt_userns,
int ext4_fileattr_get(struct dentry *dentry, struct fileattr *fa);
extern void ext4_reset_inode_seed(struct inode *inode);

-/* migrate.c */
-extern int ext4_ext_migrate(struct inode *);
-extern int ext4_ind_migrate(struct inode *inode);
-
/* namei.c */
extern int ext4_init_new_dir(handle_t *handle, struct inode *dir,
struct inode *inode);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 606dee9e08a3..07f8461f69da 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -341,7 +341,7 @@ static int ext4_ioctl_setflags(struct inode *inode,
{
struct ext4_inode_info *ei = EXT4_I(inode);
handle_t *handle = NULL;
- int err = -EPERM, migrate = 0;
+ int err = -EPERM;
struct ext4_iloc iloc;
unsigned int oldflags, mask, i;
struct super_block *sb = inode->i_sb;
@@ -365,8 +365,10 @@ static int ext4_ioctl_setflags(struct inode *inode,
goto flags_out;
}

- if ((flags ^ oldflags) & EXT4_EXTENTS_FL)
- migrate = 1;
+ if ((flags ^ oldflags) & EXT4_EXTENTS_FL) {
+ err = -EOPNOTSUPP;
+ goto flags_out;
+ }

if ((flags ^ oldflags) & EXT4_CASEFOLD_FL) {
if (!ext4_has_feature_casefold(sb)) {
@@ -449,12 +451,6 @@ static int ext4_ioctl_setflags(struct inode *inode,
if (err)
goto flags_out;
}
- if (migrate) {
- if (flags & EXT4_EXTENTS_FL)
- err = ext4_ext_migrate(inode);
- else
- err = ext4_ind_migrate(inode);
- }

flags_out:
return err;
@@ -1009,24 +1005,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)

case EXT4_IOC_MIGRATE:
{
- int err;
- if (!inode_owner_or_capable(mnt_userns, inode))
- return -EACCES;
-
- err = mnt_want_write_file(filp);
- if (err)
- return err;
- /*
- * inode_mutex prevent write and truncate on the file.
- * Read still goes through. We take i_data_sem in
- * ext4_ext_swap_inode_data before we switch the
- * inode format to prevent read.
- */
- inode_lock((inode));
- err = ext4_ext_migrate(inode);
- inode_unlock((inode));
- mnt_drop_write_file(filp);
- return err;
+ return -EOPNOTSUPP;
}

case EXT4_IOC_ALLOC_DA_BLKS:
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
deleted file mode 100644
index 7e0b4f81c6c0..000000000000
--- a/fs/ext4/migrate.c
+++ /dev/null
@@ -1,669 +0,0 @@
-// SPDX-License-Identifier: LGPL-2.1
-/*
- * Copyright IBM Corporation, 2007
- * Author Aneesh Kumar K.V <[email protected]>
- *
- */
-
-#include <linux/slab.h>
-#include "ext4_jbd2.h"
-#include "ext4_extents.h"
-
-/*
- * The contiguous blocks details which can be
- * represented by a single extent
- */
-struct migrate_struct {
- ext4_lblk_t first_block, last_block, curr_block;
- ext4_fsblk_t first_pblock, last_pblock;
-};
-
-static int finish_range(handle_t *handle, struct inode *inode,
- struct migrate_struct *lb)
-
-{
- int retval = 0, needed;
- struct ext4_extent newext;
- struct ext4_ext_path *path;
- if (lb->first_pblock == 0)
- return 0;
-
- /* Add the extent to temp inode*/
- newext.ee_block = cpu_to_le32(lb->first_block);
- newext.ee_len = cpu_to_le16(lb->last_block - lb->first_block + 1);
- ext4_ext_store_pblock(&newext, lb->first_pblock);
- /* Locking only for convenience since we are operating on temp inode */
- down_write(&EXT4_I(inode)->i_data_sem);
- path = ext4_find_extent(inode, lb->first_block, NULL, 0);
- if (IS_ERR(path)) {
- retval = PTR_ERR(path);
- path = NULL;
- goto err_out;
- }
-
- /*
- * Calculate the credit needed to inserting this extent
- * Since we are doing this in loop we may accumulate extra
- * credit. But below we try to not accumulate too much
- * of them by restarting the journal.
- */
- needed = ext4_ext_calc_credits_for_single_extent(inode,
- lb->last_block - lb->first_block + 1, path);
-
- retval = ext4_datasem_ensure_credits(handle, inode, needed, needed, 0);
- if (retval < 0)
- goto err_out;
- retval = ext4_ext_insert_extent(handle, inode, &path, &newext, 0);
-err_out:
- up_write((&EXT4_I(inode)->i_data_sem));
- ext4_ext_drop_refs(path);
- kfree(path);
- lb->first_pblock = 0;
- return retval;
-}
-
-static int update_extent_range(handle_t *handle, struct inode *inode,
- ext4_fsblk_t pblock, struct migrate_struct *lb)
-{
- int retval;
- /*
- * See if we can add on to the existing range (if it exists)
- */
- if (lb->first_pblock &&
- (lb->last_pblock+1 == pblock) &&
- (lb->last_block+1 == lb->curr_block)) {
- lb->last_pblock = pblock;
- lb->last_block = lb->curr_block;
- lb->curr_block++;
- return 0;
- }
- /*
- * Start a new range.
- */
- retval = finish_range(handle, inode, lb);
- lb->first_pblock = lb->last_pblock = pblock;
- lb->first_block = lb->last_block = lb->curr_block;
- lb->curr_block++;
- return retval;
-}
-
-static int update_ind_extent_range(handle_t *handle, struct inode *inode,
- ext4_fsblk_t pblock,
- struct migrate_struct *lb)
-{
- struct buffer_head *bh;
- __le32 *i_data;
- int i, retval = 0;
- unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
-
- bh = ext4_sb_bread(inode->i_sb, pblock, 0);
- if (IS_ERR(bh))
- return PTR_ERR(bh);
-
- i_data = (__le32 *)bh->b_data;
- for (i = 0; i < max_entries; i++) {
- if (i_data[i]) {
- retval = update_extent_range(handle, inode,
- le32_to_cpu(i_data[i]), lb);
- if (retval)
- break;
- } else {
- lb->curr_block++;
- }
- }
- put_bh(bh);
- return retval;
-
-}
-
-static int update_dind_extent_range(handle_t *handle, struct inode *inode,
- ext4_fsblk_t pblock,
- struct migrate_struct *lb)
-{
- struct buffer_head *bh;
- __le32 *i_data;
- int i, retval = 0;
- unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
-
- bh = ext4_sb_bread(inode->i_sb, pblock, 0);
- if (IS_ERR(bh))
- return PTR_ERR(bh);
-
- i_data = (__le32 *)bh->b_data;
- for (i = 0; i < max_entries; i++) {
- if (i_data[i]) {
- retval = update_ind_extent_range(handle, inode,
- le32_to_cpu(i_data[i]), lb);
- if (retval)
- break;
- } else {
- /* Only update the file block number */
- lb->curr_block += max_entries;
- }
- }
- put_bh(bh);
- return retval;
-
-}
-
-static int update_tind_extent_range(handle_t *handle, struct inode *inode,
- ext4_fsblk_t pblock,
- struct migrate_struct *lb)
-{
- struct buffer_head *bh;
- __le32 *i_data;
- int i, retval = 0;
- unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
-
- bh = ext4_sb_bread(inode->i_sb, pblock, 0);
- if (IS_ERR(bh))
- return PTR_ERR(bh);
-
- i_data = (__le32 *)bh->b_data;
- for (i = 0; i < max_entries; i++) {
- if (i_data[i]) {
- retval = update_dind_extent_range(handle, inode,
- le32_to_cpu(i_data[i]), lb);
- if (retval)
- break;
- } else {
- /* Only update the file block number */
- lb->curr_block += max_entries * max_entries;
- }
- }
- put_bh(bh);
- return retval;
-
-}
-
-static int free_dind_blocks(handle_t *handle,
- struct inode *inode, __le32 i_data)
-{
- int i;
- __le32 *tmp_idata;
- struct buffer_head *bh;
- struct super_block *sb = inode->i_sb;
- unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
- int err;
-
- bh = ext4_sb_bread(sb, le32_to_cpu(i_data), 0);
- if (IS_ERR(bh))
- return PTR_ERR(bh);
-
- tmp_idata = (__le32 *)bh->b_data;
- for (i = 0; i < max_entries; i++) {
- if (tmp_idata[i]) {
- err = ext4_journal_ensure_credits(handle,
- EXT4_RESERVE_TRANS_BLOCKS,
- ext4_free_metadata_revoke_credits(sb, 1));
- if (err < 0) {
- put_bh(bh);
- return err;
- }
- ext4_free_blocks(handle, inode, NULL,
- le32_to_cpu(tmp_idata[i]), 1,
- EXT4_FREE_BLOCKS_METADATA |
- EXT4_FREE_BLOCKS_FORGET);
- }
- }
- put_bh(bh);
- err = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS,
- ext4_free_metadata_revoke_credits(sb, 1));
- if (err < 0)
- return err;
- ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data), 1,
- EXT4_FREE_BLOCKS_METADATA |
- EXT4_FREE_BLOCKS_FORGET);
- return 0;
-}
-
-static int free_tind_blocks(handle_t *handle,
- struct inode *inode, __le32 i_data)
-{
- int i, retval = 0;
- __le32 *tmp_idata;
- struct buffer_head *bh;
- unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
-
- bh = ext4_sb_bread(inode->i_sb, le32_to_cpu(i_data), 0);
- if (IS_ERR(bh))
- return PTR_ERR(bh);
-
- tmp_idata = (__le32 *)bh->b_data;
- for (i = 0; i < max_entries; i++) {
- if (tmp_idata[i]) {
- retval = free_dind_blocks(handle,
- inode, tmp_idata[i]);
- if (retval) {
- put_bh(bh);
- return retval;
- }
- }
- }
- put_bh(bh);
- retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS,
- ext4_free_metadata_revoke_credits(inode->i_sb, 1));
- if (retval < 0)
- return retval;
- ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data), 1,
- EXT4_FREE_BLOCKS_METADATA |
- EXT4_FREE_BLOCKS_FORGET);
- return 0;
-}
-
-static int free_ind_block(handle_t *handle, struct inode *inode, __le32 *i_data)
-{
- int retval;
-
- /* ei->i_data[EXT4_IND_BLOCK] */
- if (i_data[0]) {
- retval = ext4_journal_ensure_credits(handle,
- EXT4_RESERVE_TRANS_BLOCKS,
- ext4_free_metadata_revoke_credits(inode->i_sb, 1));
- if (retval < 0)
- return retval;
- ext4_free_blocks(handle, inode, NULL,
- le32_to_cpu(i_data[0]), 1,
- EXT4_FREE_BLOCKS_METADATA |
- EXT4_FREE_BLOCKS_FORGET);
- }
-
- /* ei->i_data[EXT4_DIND_BLOCK] */
- if (i_data[1]) {
- retval = free_dind_blocks(handle, inode, i_data[1]);
- if (retval)
- return retval;
- }
-
- /* ei->i_data[EXT4_TIND_BLOCK] */
- if (i_data[2]) {
- retval = free_tind_blocks(handle, inode, i_data[2]);
- if (retval)
- return retval;
- }
- return 0;
-}
-
-static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
- struct inode *tmp_inode)
-{
- int retval, retval2 = 0;
- __le32 i_data[3];
- struct ext4_inode_info *ei = EXT4_I(inode);
- struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode);
-
- /*
- * One credit accounted for writing the
- * i_data field of the original inode
- */
- retval = ext4_journal_ensure_credits(handle, 1, 0);
- if (retval < 0)
- goto err_out;
-
- i_data[0] = ei->i_data[EXT4_IND_BLOCK];
- i_data[1] = ei->i_data[EXT4_DIND_BLOCK];
- i_data[2] = ei->i_data[EXT4_TIND_BLOCK];
-
- down_write(&EXT4_I(inode)->i_data_sem);
- /*
- * if EXT4_STATE_EXT_MIGRATE is cleared a block allocation
- * happened after we started the migrate. We need to
- * fail the migrate
- */
- if (!ext4_test_inode_state(inode, EXT4_STATE_EXT_MIGRATE)) {
- retval = -EAGAIN;
- up_write(&EXT4_I(inode)->i_data_sem);
- goto err_out;
- } else
- ext4_clear_inode_state(inode, EXT4_STATE_EXT_MIGRATE);
- /*
- * We have the extent map build with the tmp inode.
- * Now copy the i_data across
- */
- ext4_set_inode_flag(inode, EXT4_INODE_EXTENTS);
- memcpy(ei->i_data, tmp_ei->i_data, sizeof(ei->i_data));
-
- /*
- * Update i_blocks with the new blocks that got
- * allocated while adding extents for extent index
- * blocks.
- *
- * While converting to extents we need not
- * update the original inode i_blocks for extent blocks
- * via quota APIs. The quota update happened via tmp_inode already.
- */
- spin_lock(&inode->i_lock);
- inode->i_blocks += tmp_inode->i_blocks;
- spin_unlock(&inode->i_lock);
- up_write(&EXT4_I(inode)->i_data_sem);
-
- /*
- * We mark the inode dirty after, because we decrement the
- * i_blocks when freeing the indirect meta-data blocks
- */
- retval = free_ind_block(handle, inode, i_data);
- retval2 = ext4_mark_inode_dirty(handle, inode);
- if (unlikely(retval2 && !retval))
- retval = retval2;
-
-err_out:
- return retval;
-}
-
-static int free_ext_idx(handle_t *handle, struct inode *inode,
- struct ext4_extent_idx *ix)
-{
- int i, retval = 0;
- ext4_fsblk_t block;
- struct buffer_head *bh;
- struct ext4_extent_header *eh;
-
- block = ext4_idx_pblock(ix);
- bh = ext4_sb_bread(inode->i_sb, block, 0);
- if (IS_ERR(bh))
- return PTR_ERR(bh);
-
- eh = (struct ext4_extent_header *)bh->b_data;
- if (eh->eh_depth != 0) {
- ix = EXT_FIRST_INDEX(eh);
- for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
- retval = free_ext_idx(handle, inode, ix);
- if (retval) {
- put_bh(bh);
- return retval;
- }
- }
- }
- put_bh(bh);
- retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS,
- ext4_free_metadata_revoke_credits(inode->i_sb, 1));
- if (retval < 0)
- return retval;
- ext4_free_blocks(handle, inode, NULL, block, 1,
- EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET);
- return 0;
-}
-
-/*
- * Free the extent meta data blocks only
- */
-static int free_ext_block(handle_t *handle, struct inode *inode)
-{
- int i, retval = 0;
- struct ext4_inode_info *ei = EXT4_I(inode);
- struct ext4_extent_header *eh = (struct ext4_extent_header *)ei->i_data;
- struct ext4_extent_idx *ix;
- if (eh->eh_depth == 0)
- /*
- * No extra blocks allocated for extent meta data
- */
- return 0;
- ix = EXT_FIRST_INDEX(eh);
- for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
- retval = free_ext_idx(handle, inode, ix);
- if (retval)
- return retval;
- }
- return retval;
-}
-
-int ext4_ext_migrate(struct inode *inode)
-{
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- handle_t *handle;
- int retval = 0, i;
- __le32 *i_data;
- struct ext4_inode_info *ei;
- struct inode *tmp_inode = NULL;
- struct migrate_struct lb;
- unsigned long max_entries;
- __u32 goal;
- uid_t owner[2];
-
- /*
- * If the filesystem does not support extents, or the inode
- * already is extent-based, error out.
- */
- if (!ext4_has_feature_extents(inode->i_sb) ||
- (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
- return -EINVAL;
-
- if (S_ISLNK(inode->i_mode) && inode->i_blocks == 0)
- /*
- * don't migrate fast symlink
- */
- return retval;
-
- percpu_down_write(&sbi->s_writepages_rwsem);
-
- /*
- * Worst case we can touch the allocation bitmaps, a bgd
- * block, and a block to link in the orphan list. We do need
- * need to worry about credits for modifying the quota inode.
- */
- handle = ext4_journal_start(inode, EXT4_HT_MIGRATE,
- 4 + EXT4_MAXQUOTAS_TRANS_BLOCKS(inode->i_sb));
-
- if (IS_ERR(handle)) {
- retval = PTR_ERR(handle);
- goto out_unlock;
- }
- goal = (((inode->i_ino - 1) / EXT4_INODES_PER_GROUP(inode->i_sb)) *
- EXT4_INODES_PER_GROUP(inode->i_sb)) + 1;
- owner[0] = i_uid_read(inode);
- owner[1] = i_gid_read(inode);
- tmp_inode = ext4_new_inode(handle, d_inode(inode->i_sb->s_root),
- S_IFREG, NULL, goal, owner, 0);
- if (IS_ERR(tmp_inode)) {
- retval = PTR_ERR(tmp_inode);
- ext4_journal_stop(handle);
- goto out_unlock;
- }
- i_size_write(tmp_inode, i_size_read(inode));
- /*
- * Set the i_nlink to zero so it will be deleted later
- * when we drop inode reference.
- */
- clear_nlink(tmp_inode);
-
- ext4_ext_tree_init(handle, tmp_inode);
- ext4_orphan_add(handle, tmp_inode);
- ext4_journal_stop(handle);
-
- /*
- * start with one credit accounted for
- * superblock modification.
- *
- * For the tmp_inode we already have committed the
- * transaction that created the inode. Later as and
- * when we add extents we extent the journal
- */
- /*
- * Even though we take i_mutex we can still cause block
- * allocation via mmap write to holes. If we have allocated
- * new blocks we fail migrate. New block allocation will
- * clear EXT4_STATE_EXT_MIGRATE flag. The flag is updated
- * with i_data_sem held to prevent racing with block
- * allocation.
- */
- down_read(&EXT4_I(inode)->i_data_sem);
- ext4_set_inode_state(inode, EXT4_STATE_EXT_MIGRATE);
- up_read((&EXT4_I(inode)->i_data_sem));
-
- handle = ext4_journal_start(inode, EXT4_HT_MIGRATE, 1);
- if (IS_ERR(handle)) {
- /*
- * It is impossible to update on-disk structures without
- * a handle, so just rollback in-core changes and live other
- * work to orphan_list_cleanup()
- */
- ext4_orphan_del(NULL, tmp_inode);
- retval = PTR_ERR(handle);
- goto out_tmp_inode;
- }
-
- ei = EXT4_I(inode);
- i_data = ei->i_data;
- memset(&lb, 0, sizeof(lb));
-
- /* 32 bit block address 4 bytes */
- max_entries = inode->i_sb->s_blocksize >> 2;
- for (i = 0; i < EXT4_NDIR_BLOCKS; i++) {
- if (i_data[i]) {
- retval = update_extent_range(handle, tmp_inode,
- le32_to_cpu(i_data[i]), &lb);
- if (retval)
- goto err_out;
- } else
- lb.curr_block++;
- }
- if (i_data[EXT4_IND_BLOCK]) {
- retval = update_ind_extent_range(handle, tmp_inode,
- le32_to_cpu(i_data[EXT4_IND_BLOCK]), &lb);
- if (retval)
- goto err_out;
- } else
- lb.curr_block += max_entries;
- if (i_data[EXT4_DIND_BLOCK]) {
- retval = update_dind_extent_range(handle, tmp_inode,
- le32_to_cpu(i_data[EXT4_DIND_BLOCK]), &lb);
- if (retval)
- goto err_out;
- } else
- lb.curr_block += max_entries * max_entries;
- if (i_data[EXT4_TIND_BLOCK]) {
- retval = update_tind_extent_range(handle, tmp_inode,
- le32_to_cpu(i_data[EXT4_TIND_BLOCK]), &lb);
- if (retval)
- goto err_out;
- }
- /*
- * Build the last extent
- */
- retval = finish_range(handle, tmp_inode, &lb);
-err_out:
- if (retval)
- /*
- * Failure case delete the extent information with the
- * tmp_inode
- */
- free_ext_block(handle, tmp_inode);
- else {
- retval = ext4_ext_swap_inode_data(handle, inode, tmp_inode);
- if (retval)
- /*
- * if we fail to swap inode data free the extent
- * details of the tmp inode
- */
- free_ext_block(handle, tmp_inode);
- }
-
- /* We mark the tmp_inode dirty via ext4_ext_tree_init. */
- retval = ext4_journal_ensure_credits(handle, 1, 0);
- if (retval < 0)
- goto out_stop;
- /*
- * Mark the tmp_inode as of size zero
- */
- i_size_write(tmp_inode, 0);
-
- /*
- * set the i_blocks count to zero
- * so that the ext4_evict_inode() does the
- * right job
- *
- * We don't need to take the i_lock because
- * the inode is not visible to user space.
- */
- tmp_inode->i_blocks = 0;
-
- /* Reset the extent details */
- ext4_ext_tree_init(handle, tmp_inode);
-out_stop:
- ext4_journal_stop(handle);
-out_tmp_inode:
- unlock_new_inode(tmp_inode);
- iput(tmp_inode);
-out_unlock:
- percpu_up_write(&sbi->s_writepages_rwsem);
- return retval;
-}
-
-/*
- * Migrate a simple extent-based inode to use the i_blocks[] array
- */
-int ext4_ind_migrate(struct inode *inode)
-{
- struct ext4_extent_header *eh;
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- struct ext4_super_block *es = sbi->s_es;
- struct ext4_inode_info *ei = EXT4_I(inode);
- struct ext4_extent *ex;
- unsigned int i, len;
- ext4_lblk_t start, end;
- ext4_fsblk_t blk;
- handle_t *handle;
- int ret, ret2 = 0;
-
- if (!ext4_has_feature_extents(inode->i_sb) ||
- (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
- return -EINVAL;
-
- if (ext4_has_feature_bigalloc(inode->i_sb))
- return -EOPNOTSUPP;
-
- /*
- * In order to get correct extent info, force all delayed allocation
- * blocks to be allocated, otherwise delayed allocation blocks may not
- * be reflected and bypass the checks on extent header.
- */
- if (test_opt(inode->i_sb, DELALLOC))
- ext4_alloc_da_blocks(inode);
-
- percpu_down_write(&sbi->s_writepages_rwsem);
-
- handle = ext4_journal_start(inode, EXT4_HT_MIGRATE, 1);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out_unlock;
- }
-
- down_write(&EXT4_I(inode)->i_data_sem);
- ret = ext4_ext_check_inode(inode);
- if (ret)
- goto errout;
-
- eh = ext_inode_hdr(inode);
- ex = EXT_FIRST_EXTENT(eh);
- if (ext4_blocks_count(es) > EXT4_MAX_BLOCK_FILE_PHYS ||
- eh->eh_depth != 0 || le16_to_cpu(eh->eh_entries) > 1) {
- ret = -EOPNOTSUPP;
- goto errout;
- }
- if (eh->eh_entries == 0)
- blk = len = start = end = 0;
- else {
- len = le16_to_cpu(ex->ee_len);
- blk = ext4_ext_pblock(ex);
- start = le32_to_cpu(ex->ee_block);
- end = start + len - 1;
- if (end >= EXT4_NDIR_BLOCKS) {
- ret = -EOPNOTSUPP;
- goto errout;
- }
- }
-
- ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS);
- memset(ei->i_data, 0, sizeof(ei->i_data));
- for (i = start; i <= end; i++)
- ei->i_data[i] = cpu_to_le32(blk++);
- ret2 = ext4_mark_inode_dirty(handle, inode);
- if (unlikely(ret2 && !ret))
- ret = ret2;
-errout:
- ext4_journal_stop(handle);
- up_write(&EXT4_I(inode)->i_data_sem);
-out_unlock:
- percpu_up_write(&sbi->s_writepages_rwsem);
- return ret;
-}

2021-12-16 11:23:54

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

On Wed, Dec 15, 2021 at 03:12:37PM +0100, Lukas Czerner wrote:
> On Wed, Dec 15, 2021 at 12:28:52PM +0100, Jan Kara wrote:
> > On Tue 14-12-21 16:49:45, Darrick J. Wong wrote:
> > > On Tue, Dec 14, 2021 at 05:50:58PM +0000, Lu?s Henriques wrote:
> > > > When migrating to extents, the temporary inode will have it's own checksum
> > > > seed. This means that, when swapping the inodes data, the inode checksums
> > > > will be incorrect.
> > > >
> > > > This can be fixed by recalculating the extents checksums again. Or simply
> > > > by copying the seed into the temporary inode.
> > > >
> > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=213357
> > > > Reported-by: Jeroen van Wolffelaar <[email protected]>
> > > > Signed-off-by: Lu?s Henriques <[email protected]>
> > > > ---
> > > > fs/ext4/migrate.c | 12 +++++++++++-
> > > > 1 file changed, 11 insertions(+), 1 deletion(-)
> > > >
> > > > changes since v1:
> > > >
> > > > * Dropped tmp_ei variable
> > > > * ->i_csum_seed is now initialised immediately after tmp_inode is created
> > > > * New comment about the seed initialization and stating that recovery
> > > > needs to be fixed.
> > > >
> > > > Cheers,
> > > > --
> > > > Lu?s
> > > >
> > > > diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> > > > index 7e0b4f81c6c0..36dfc88ce05b 100644
> > > > --- a/fs/ext4/migrate.c
> > > > +++ b/fs/ext4/migrate.c
> > > > @@ -459,6 +459,17 @@ int ext4_ext_migrate(struct inode *inode)
> > > > ext4_journal_stop(handle);
> > > > goto out_unlock;
> > > > }
> > > > + /*
> > > > + * Use the correct seed for checksum (i.e. the seed from 'inode'). This
> > > > + * is so that the metadata blocks will have the correct checksum after
> > > > + * the migration.
> > > > + *
> > > > + * Note however that, if a crash occurs during the migration process,
> > > > + * the recovery process is broken because the tmp_inode checksums will
> > > > + * be wrong and the orphans cleanup will fail.
> > >
> > > ...and then what does the user do?
> >
> > Run fsck of course! And then recover from backups :) I know this is sad but
> > the situation is that our migration code just is not crash-safe (if we
> > crash we are going to free blocks that are still used by the migrated
> > inode) and Luis makes it work in case we do not crash (which should be
> > hopefully more common) and documents it does not work in case we crash.
> > So overall I'd call it a win.
> >
> > But maybe we should just remove this online-migration functionality
> > completely from the kernel? That would be also a fine solution for me. I
> > was thinking whether we could somehow make the inode migration crash-safe
> > but I didn't think of anything which would not require on-disk format
> > change...
>
> Since this is not something that anyone can honestly recommend doing
> without a prior backup and a word of warning I personaly would be in favor
> of removing it.

BTW, in case migration is kept in the kernel (even with the broken
recovery), I think it's worth turning this bug reproducer into an ext4
fstest. I was planning to do so, but I'd rather wait to see if the effort
is worthwhile (i.e. if migration is kept or not).

Cheers,
--
Lu?s

2021-12-16 18:32:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

On Wed, Dec 15, 2021 at 03:12:37PM +0100, Lukas Czerner wrote:
> > Run fsck of course! And then recover from backups :) I know this is sad but
> > the situation is that our migration code just is not crash-safe (if we
> > crash we are going to free blocks that are still used by the migrated
> > inode) and Luis makes it work in case we do not crash (which should be
> > hopefully more common) and documents it does not work in case we crash.
> > So overall I'd call it a win.
> >
> > But maybe we should just remove this online-migration functionality
> > completely from the kernel? That would be also a fine solution for me. I
> > was thinking whether we could somehow make the inode migration crash-safe
> > but I didn't think of anything which would not require on-disk format
> > change...
>
> Since this is not something that anyone can honestly recommend doing
> without a prior backup and a word of warning I personaly would be in favor
> of removing it.

So there are a couple options that we could pursue:

1) We could change the migrate code to stop putting the orphan inode
on the orphan list. If we do this, an crash in the middle of the
migrate will, in the worst case (when the migration isn't completed
within a single jbd2 transaction) result in a leaked inode. That's
not ideal, but it won't lead user data loss, and e2fsck will recover
the situation by cloning the blocks, and leaving the inode in
lost+found.

2) We could try to ensure migration happens all within a single
transaction, if they all fit inside a the inode structure, we allocate
a tmp inode for all of the indirect blocks, attach the blocks to the
tmp inode, place the tmp inode on the orphan list, and put all of that
on a single handle, and then in a second handle, truncate the tmp
inode to release the indirect blocks. If we need to allocate extent
tree blocks, then all of that would need to fit in a single
transaction, and it's a bit more complicated, but it is doable.

3) We can simply remove the inode migration feature by removing
EXT4_EXTENTS_FL from EXT4_FL_USER_MODIFIABLE, and changing the
implementation of the EXT4_IOC_MIGRATE ioctl to return EOPNOTSUPP, and
then cleaning up the code paths that are now unreachable.

The migration feature is clearly less compelling than it was ten years
ago, when ext4 was first introduced --- and most enterprise distros
have never supported the feature even when it has existed. Also on
the plus side, we've never shipped a program to globally migrate a
file system by using ioctl interface.

On the other hand, there may have been user shell scripts that have
done something like "find /mntpt -type f -print0 | xargs -0 chattr +e {} \;"
And so option #3 could be construed as "breaking userspace", especially
without a deprecation window.

Furthermore, Option #1 is pretty simple to implement, and chances of a
migration getting spread across two jbd2 commits is not actually
pretty low. And if it does happen, there would only be a single inode
that would get its blocks cloned and attached to lost+found.

Thats being said, if we *did* option #1, in the long run we'd want to
land a complete solution, which would either be something like option
#2, or allocating a flag to give a hint to e2fsprogs that if it does
find an leaked inode with with the flag set on the on-disk inode, that
all it needs to do is to zero out the inode and be done with it.

So the question is, is it worth it to continue supporting the migrate
feature, or should we just delete all of the migration code, and risk
users complaining that we've broken their use case? The chances of
that happening is admittedly low, and Linus's rule that "it's only
breaking userspace if a user complains" means we might very well get
away with it. :-)

- Ted

2021-12-17 09:35:50

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

On Thu, Dec 16, 2021 at 01:32:14PM -0500, Theodore Ts'o wrote:
> On Wed, Dec 15, 2021 at 03:12:37PM +0100, Lukas Czerner wrote:
> > > Run fsck of course! And then recover from backups :) I know this is sad but
> > > the situation is that our migration code just is not crash-safe (if we
> > > crash we are going to free blocks that are still used by the migrated
> > > inode) and Luis makes it work in case we do not crash (which should be
> > > hopefully more common) and documents it does not work in case we crash.
> > > So overall I'd call it a win.
> > >
> > > But maybe we should just remove this online-migration functionality
> > > completely from the kernel? That would be also a fine solution for me. I
> > > was thinking whether we could somehow make the inode migration crash-safe
> > > but I didn't think of anything which would not require on-disk format
> > > change...
> >
> > Since this is not something that anyone can honestly recommend doing
> > without a prior backup and a word of warning I personaly would be in favor
> > of removing it.
>
> So there are a couple options that we could pursue:
>
> 1) We could change the migrate code to stop putting the orphan inode
> on the orphan list. If we do this, an crash in the middle of the
> migrate will, in the worst case (when the migration isn't completed
> within a single jbd2 transaction) result in a leaked inode. That's
> not ideal, but it won't lead user data loss, and e2fsck will recover
> the situation by cloning the blocks, and leaving the inode in
> lost+found.
>
> 2) We could try to ensure migration happens all within a single
> transaction, if they all fit inside a the inode structure, we allocate
> a tmp inode for all of the indirect blocks, attach the blocks to the
> tmp inode, place the tmp inode on the orphan list, and put all of that
> on a single handle, and then in a second handle, truncate the tmp
> inode to release the indirect blocks. If we need to allocate extent
> tree blocks, then all of that would need to fit in a single
> transaction, and it's a bit more complicated, but it is doable.
>
> 3) We can simply remove the inode migration feature by removing
> EXT4_EXTENTS_FL from EXT4_FL_USER_MODIFIABLE, and changing the
> implementation of the EXT4_IOC_MIGRATE ioctl to return EOPNOTSUPP, and
> then cleaning up the code paths that are now unreachable.
>
> The migration feature is clearly less compelling than it was ten years
> ago, when ext4 was first introduced --- and most enterprise distros
> have never supported the feature even when it has existed. Also on
> the plus side, we've never shipped a program to globally migrate a
> file system by using ioctl interface.
>
> On the other hand, there may have been user shell scripts that have
> done something like "find /mntpt -type f -print0 | xargs -0 chattr +e {} \;"
> And so option #3 could be construed as "breaking userspace", especially
> without a deprecation window.
>
> Furthermore, Option #1 is pretty simple to implement, and chances of a
> migration getting spread across two jbd2 commits is not actually
> pretty low. And if it does happen, there would only be a single inode
> that would get its blocks cloned and attached to lost+found.
>
> Thats being said, if we *did* option #1, in the long run we'd want to
> land a complete solution, which would either be something like option
> #2, or allocating a flag to give a hint to e2fsprogs that if it does
> find an leaked inode with with the flag set on the on-disk inode, that
> all it needs to do is to zero out the inode and be done with it.
>
> So the question is, is it worth it to continue supporting the migrate
> feature, or should we just delete all of the migration code, and risk
> users complaining that we've broken their use case? The chances of
> that happening is admittedly low, and Linus's rule that "it's only
> breaking userspace if a user complains" means we might very well get
> away with it. :-)

That's a very good summary Ted, thanks.

Our rationale behind not supporting the migration was always the fact
that we felt that backup was absolutely necessary before operation like
this. When you already have up-to-date backup available you might as
well create a fresh ext4 file system with all the advantages it brings
and recover data from said backup. I think this is still a very
reasonable approach.

I have no doubt it was useful featrure in the early days of ext4, but I
think we're well past that. Any attempt to rework or fix the feature
assumes it's still useful and has it's users today and into the future.
I very much doubt that, so let's test it. Let's start a year long
deprecation window.

-Lukas

>
> - Ted
>


2021-12-17 15:42:07

by Jeroen van Wolffelaar

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

First of all, thanks Luís for figuring out the bug and writing a patch.

On 2021-12-16 18:32, Theodore Ts'o wrote:
> So the question is, is it worth it to continue supporting the migrate
> feature, or should we just delete all of the migration code, and risk
> users complaining that we've broken their use case? The chances of
> that happening is admittedly low, and Linus's rule that "it's only
> breaking userspace if a user complains" means we might very well get
> away with it. :-)

As the person who ran into this bug and then filed the issue in
bugzilla, my two cents:

I can do without the migration functionality, it is slightly
inconvenient, but that's it. I do believe having an always-broken
migration path (HEAD kernel, if csum is on) is much worse than not
having one. I don't personally care if a crash means I lose one or
several files, but I feel it's really not okay for chattr to risk losing
a file in a crash on a journalled filesystem. I would consider that a
severe data loss bug.

Thus, I support removing this feature. At the same time, I believe the
original patch is strictly better than the existing situation, so I also
support adding that to the kernel so that in the happy case, this
doesn't cause e2fsck failures, pending the (much longer) deprecation
path.


In case anyone cares, my usecase, so you can see it's quite esoteric.
You can also stop reading, I believe that's better :-P.

I have a rotation of harddisks that I use to have offline backups. They
were ext3 until the day I ran into this bug, I finally got around to
upgrading the filesystems (I never wanted to upgrade the instant ext4
became available, but arguably this is somewhat late too).

Because I knew -- out of prior interest -- that ext3 blocklists did not
have checksums, I wanted to ensure all existing files were csum too, so
extents. chattr +e promised to do exactly that. Well, it didn't work.

All files are indexed and checksummed, losing any or all, I can recover
from. So a working chattr +e, even if it risks losing some or all data,
would be 'best' for my case. My alternative is just copying all files
in-disk, which is a few TiB of extra I/O, but whatever.

Recreating the filesystem is annoying, since that brings the number of
backup copies *intentionally* down by one, temporarily, which I do not
want to do. I want my buffer for unintentional failures.

Thank you all,
--Jeroen

2021-12-28 22:40:20

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

Hi!

> > So the question is, is it worth it to continue supporting the migrate
> > feature, or should we just delete all of the migration code, and risk
> > users complaining that we've broken their use case? The chances of
> > that happening is admittedly low, and Linus's rule that "it's only
> > breaking userspace if a user complains" means we might very well get
> > away with it. :-)
>
> That's a very good summary Ted, thanks.
>
> Our rationale behind not supporting the migration was always the fact
> that we felt that backup was absolutely necessary before operation like
> this. When you already have up-to-date backup available you might as
> well create a fresh ext4 file system with all the advantages it brings
> and recover data from said backup. I think this is still a very
> reasonable approach.

Umm. Not really?

First... full backup/restore will take a _long_ time.

Second... if you do online migration, you have filesystem you are
quite unlikely to corrupt, and backup you are unlikely to use. If you
do backup/restore, you have to be _way_ more careful that backup media
is reliable etc.

Best regards,
Pavel
--
http://www.livejournal.com/~pavelmachek


Attachments:
(No filename) (1.16 kB)
signature.asc (195.00 B)
Download all attachments

2021-12-30 06:57:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

On Tue, Dec 28, 2021 at 11:40:17PM +0100, Pavel Machek wrote:
> > Our rationale behind not supporting the migration was always the fact
> > that we felt that backup was absolutely necessary before operation like
> > this. When you already have up-to-date backup available you might as
> > well create a fresh ext4 file system with all the advantages it brings
> > and recover data from said backup. I think this is still a very
> > reasonable approach.
>
> Umm. Not really?
>
> First... full backup/restore will take a _long_ time.

How big is an ext3 file system going to be, though? Ext4 was
available in RHEL 5, and was fully supported in RHEL 6 --- which was
released in the year 2000. Back in 2000, the biggest Seagate
Barracuda drive you could get was 30GB. The biggest disk available at
that time was the IBM Deskstar 75GXP, which was 75GB.

So a 5 disk RAID array from the era where you might have still been
using ext3 *might* have been as large as 300GB.

> Second... if you do online migration, you have filesystem you are
> quite unlikely to corrupt, and backup you are unlikely to use. If you
> do backup/restore, you have to be _way_ more careful that backup media
> is reliable etc.

21 years later, those IDE disks are probably not even functioning any
more. Every 4-7 years, depending on how careful you want to be, you'd
would be well-advised to buy new hard drives, since back then disk
drive capacities were doubling every 18-24 months. And so the sane
thing to do would be to do a disk-to-disk transfer. That is, you buy
a new computer, with brand new hard drives, install the latest distro
(many companies would only upgrade distros when they upgraded their
hardware), and so you'd format new hard drives with the latest file
system, and do a disk-to-disk transfer from the old system to the new
system.

Quite frankly, if you aren't willing to copy the data from your
ancient spinning rust platters (and back then, they probably *were*
actually iron oxide :-), your largest risk would be the HDD
self-destructing from extreme old age. :-)

- Ted


2022-01-06 04:42:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to extents

On Tue, 14 Dec 2021 17:50:58 +0000, Luís Henriques wrote:
> When migrating to extents, the temporary inode will have it's own checksum
> seed. This means that, when swapping the inodes data, the inode checksums
> will be incorrect.
>
> This can be fixed by recalculating the extents checksums again. Or simply
> by copying the seed into the temporary inode.
>
> [...]

Applied, thanks!

[1/1] ext4: set csum seed in tmp inode while migrating to extents
commit: 304f3f7a2817c03a05efabd6d4ae3f5e85b0da73

Best regards,
--
Theodore Ts'o <[email protected]>