2021-08-21 06:45:09

by Zhang Yi

[permalink] [raw]
Subject: [PATCH v3 0/4] ext4: fix a inode checksum error

We find a checksum error and a inode corruption problem while doing
stress test, this 4 patches address to fix them. The first patch is
relate to the error simulation, and the second & third patch are
prepare to do the fix. The last patch fix these two issue.

- Checksum error

EXT4-fs error (device sda): ext4_lookup:1784: inode #131074: comm cat: iget: checksum invalid

- Inode corruption

e2fsck 1.46.0 (29-Jan-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry 'foo' in / (2) has deleted/unused inode 17. Clear<y>? yes
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Inode bitmap differences: -17
Fix<y>? yes
Free inodes count wrong for group #0 (32750, counted=32751).
Fix<y>? yes
Free inodes count wrong (32750, counted=32751).
Fix<y>? yes


Changes since v2:
- Instead of using WARN_ON_ONCE to prevent ext4_do_update_inode()
return before filling the inode buffer, keep the error and postpone
the report after the updating in the third patch.
- Fix some language mistacks in the last patch.

Changes since v1:
- Add a patch to prevent ext4_do_update_inode() return before filling
the inode buffer.
- Do not use BH_New flag to indicate the empty buffer, postpone the
zero and uptodate logic into ext4_do_update_inode() before filling
the inode buffer.

Thanks,
Yi.



Zhang Yi (4):
ext4: move inode eio simulation behind io completeion
ext4: remove an unnecessary if statement in __ext4_get_inode_loc()
ext4: make the updating inode data procedure atomic
ext4: prevent getting empty inode buffer

fs/ext4/inode.c | 227 +++++++++++++++++++++++++++---------------------
1 file changed, 126 insertions(+), 101 deletions(-)

--
2.31.1


2021-08-21 06:45:12

by Zhang Yi

[permalink] [raw]
Subject: [PATCH v3 1/4] ext4: move inode eio simulation behind io completeion

No EIO simulation is required if the buffer is uptodate, so move the
simulation behind read bio completeion just like inode/block bitmap
simulation does.

Signed-off-by: Zhang Yi <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d8de607849df..eb2526a35254 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4330,8 +4330,6 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
bh = sb_getblk(sb, block);
if (unlikely(!bh))
return -ENOMEM;
- if (ext4_simulate_fail(sb, EXT4_SIM_INODE_EIO))
- goto simulate_eio;
if (!buffer_uptodate(bh)) {
lock_buffer(bh);

@@ -4418,8 +4416,8 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
ext4_read_bh_nowait(bh, REQ_META | REQ_PRIO, NULL);
blk_finish_plug(&plug);
wait_on_buffer(bh);
+ ext4_simulate_fail_bh(sb, bh, EXT4_SIM_INODE_EIO);
if (!buffer_uptodate(bh)) {
- simulate_eio:
if (ret_block)
*ret_block = block;
brelse(bh);
--
2.31.1

2021-08-21 06:45:12

by Zhang Yi

[permalink] [raw]
Subject: [PATCH v3 3/4] ext4: make the updating inode data procedure atomic

Now that ext4_do_update_inode() return error before filling the whole
inode data if we fail to set inode blocks in ext4_inode_blocks_set().
This error should never happen in theory since sb->s_maxbytes should not
have allowed this, we have already init sb->s_maxbytes according to this
feature in ext4_fill_super(). So even through that could only happen due
to the filesystem corruption, we'd better to return after we finish
updating the inode because it may left an uninitialized buffer and we
could read this buffer later in "errors=continue" mode.

This patch make the updating inode data procedure atomic, call
EXT4_ERROR_INODE() after we dropping i_raw_lock after something bad
happened, make sure that the inode is integrated, and also drop a BUG_ON
and do some small cleanups.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 44 ++++++++++++++++++++++++++++----------------
1 file changed, 28 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index eae1b2d0b550..8323d3e8f393 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4920,8 +4920,14 @@ static int ext4_inode_blocks_set(handle_t *handle,
ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE);
return 0;
}
+
+ /*
+ * This should never happen since sb->s_maxbytes should not have
+ * allowed this, sb->s_maxbytes was set according to the huge_file
+ * feature in ext4_fill_super().
+ */
if (!ext4_has_feature_huge_file(sb))
- return -EFBIG;
+ return -EFSCORRUPTED;

if (i_blocks <= 0xffffffffffffULL) {
/*
@@ -5024,16 +5030,14 @@ static int ext4_do_update_inode(handle_t *handle,

spin_lock(&ei->i_raw_lock);

- /* For fields not tracked in the in-memory inode,
- * initialise them to zero for new inodes. */
+ /*
+ * For fields not tracked in the in-memory inode, initialise them
+ * to zero for new inodes.
+ */
if (ext4_test_inode_state(inode, EXT4_STATE_NEW))
memset(raw_inode, 0, EXT4_SB(inode->i_sb)->s_inode_size);

err = ext4_inode_blocks_set(handle, raw_inode, ei);
- if (err) {
- spin_unlock(&ei->i_raw_lock);
- goto out_brelse;
- }

raw_inode->i_mode = cpu_to_le16(inode->i_mode);
i_uid = i_uid_read(inode);
@@ -5042,10 +5046,11 @@ static int ext4_do_update_inode(handle_t *handle,
if (!(test_opt(inode->i_sb, NO_UID32))) {
raw_inode->i_uid_low = cpu_to_le16(low_16_bits(i_uid));
raw_inode->i_gid_low = cpu_to_le16(low_16_bits(i_gid));
-/*
- * Fix up interoperability with old kernels. Otherwise, old inodes get
- * re-used with the upper 16 bits of the uid/gid intact
- */
+ /*
+ * Fix up interoperability with old kernels. Otherwise,
+ * old inodes get re-used with the upper 16 bits of the
+ * uid/gid intact.
+ */
if (ei->i_dtime && list_empty(&ei->i_orphan)) {
raw_inode->i_uid_high = 0;
raw_inode->i_gid_high = 0;
@@ -5114,8 +5119,9 @@ static int ext4_do_update_inode(handle_t *handle,
}
}

- BUG_ON(!ext4_has_feature_project(inode->i_sb) &&
- i_projid != EXT4_DEF_PROJID);
+ if (i_projid != EXT4_DEF_PROJID &&
+ !ext4_has_feature_project(inode->i_sb))
+ err = err ?: -EFSCORRUPTED;

if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
EXT4_FITS_IN_INODE(raw_inode, ei, i_projid))
@@ -5123,6 +5129,11 @@ static int ext4_do_update_inode(handle_t *handle,

ext4_inode_csum_set(inode, raw_inode, ei);
spin_unlock(&ei->i_raw_lock);
+ if (err) {
+ EXT4_ERROR_INODE(inode, "corrupted inode contents");
+ goto out_brelse;
+ }
+
if (inode->i_sb->s_flags & SB_LAZYTIME)
ext4_update_other_inodes_time(inode->i_sb, inode->i_ino,
bh->b_data);
@@ -5130,13 +5141,13 @@ static int ext4_do_update_inode(handle_t *handle,
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
err = ext4_handle_dirty_metadata(handle, NULL, bh);
if (err)
- goto out_brelse;
+ goto out_error;
ext4_clear_inode_state(inode, EXT4_STATE_NEW);
if (set_large_file) {
BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "get write access");
err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh);
if (err)
- goto out_brelse;
+ goto out_error;
lock_buffer(EXT4_SB(sb)->s_sbh);
ext4_set_feature_large_file(sb);
ext4_superblock_csum_set(sb);
@@ -5146,9 +5157,10 @@ static int ext4_do_update_inode(handle_t *handle,
EXT4_SB(sb)->s_sbh);
}
ext4_update_inode_fsync_trans(handle, inode, need_datasync);
+out_error:
+ ext4_std_error(inode->i_sb, err);
out_brelse:
brelse(bh);
- ext4_std_error(inode->i_sb, err);
return err;
}

--
2.31.1

2021-08-21 06:45:13

by Zhang Yi

[permalink] [raw]
Subject: [PATCH v3 2/4] ext4: remove an unnecessary if statement in __ext4_get_inode_loc()

The "if (!buffer_uptodate(bh))" hunk covered almost the whole code after
getting buffer in __ext4_get_inode_loc() which seems unnecessary, remove
it and switch to check ext4_buffer_uptodate(), it simplify code and make
it more readable.

Signed-off-by: Zhang Yi <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 162 +++++++++++++++++++++++-------------------------
1 file changed, 78 insertions(+), 84 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index eb2526a35254..eae1b2d0b550 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4330,99 +4330,93 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
bh = sb_getblk(sb, block);
if (unlikely(!bh))
return -ENOMEM;
- if (!buffer_uptodate(bh)) {
- lock_buffer(bh);
-
- if (ext4_buffer_uptodate(bh)) {
- /* someone brought it uptodate while we waited */
- unlock_buffer(bh);
- goto has_buffer;
- }
+ if (ext4_buffer_uptodate(bh))
+ goto has_buffer;

- /*
- * If we have all information of the inode in memory and this
- * is the only valid inode in the block, we need not read the
- * block.
- */
- if (in_mem) {
- struct buffer_head *bitmap_bh;
- int i, start;
+ lock_buffer(bh);
+ /*
+ * If we have all information of the inode in memory and this
+ * is the only valid inode in the block, we need not read the
+ * block.
+ */
+ if (in_mem) {
+ struct buffer_head *bitmap_bh;
+ int i, start;

- start = inode_offset & ~(inodes_per_block - 1);
+ start = inode_offset & ~(inodes_per_block - 1);

- /* Is the inode bitmap in cache? */
- bitmap_bh = sb_getblk(sb, ext4_inode_bitmap(sb, gdp));
- if (unlikely(!bitmap_bh))
- goto make_io;
+ /* Is the inode bitmap in cache? */
+ bitmap_bh = sb_getblk(sb, ext4_inode_bitmap(sb, gdp));
+ if (unlikely(!bitmap_bh))
+ goto make_io;

- /*
- * If the inode bitmap isn't in cache then the
- * optimisation may end up performing two reads instead
- * of one, so skip it.
- */
- if (!buffer_uptodate(bitmap_bh)) {
- brelse(bitmap_bh);
- goto make_io;
- }
- for (i = start; i < start + inodes_per_block; i++) {
- if (i == inode_offset)
- continue;
- if (ext4_test_bit(i, bitmap_bh->b_data))
- break;
- }
+ /*
+ * If the inode bitmap isn't in cache then the
+ * optimisation may end up performing two reads instead
+ * of one, so skip it.
+ */
+ if (!buffer_uptodate(bitmap_bh)) {
brelse(bitmap_bh);
- if (i == start + inodes_per_block) {
- /* all other inodes are free, so skip I/O */
- memset(bh->b_data, 0, bh->b_size);
- set_buffer_uptodate(bh);
- unlock_buffer(bh);
- goto has_buffer;
- }
+ goto make_io;
+ }
+ for (i = start; i < start + inodes_per_block; i++) {
+ if (i == inode_offset)
+ continue;
+ if (ext4_test_bit(i, bitmap_bh->b_data))
+ break;
}
+ brelse(bitmap_bh);
+ if (i == start + inodes_per_block) {
+ /* all other inodes are free, so skip I/O */
+ memset(bh->b_data, 0, bh->b_size);
+ set_buffer_uptodate(bh);
+ unlock_buffer(bh);
+ goto has_buffer;
+ }
+ }

make_io:
- /*
- * If we need to do any I/O, try to pre-readahead extra
- * blocks from the inode table.
- */
- blk_start_plug(&plug);
- if (EXT4_SB(sb)->s_inode_readahead_blks) {
- ext4_fsblk_t b, end, table;
- unsigned num;
- __u32 ra_blks = EXT4_SB(sb)->s_inode_readahead_blks;
-
- table = ext4_inode_table(sb, gdp);
- /* s_inode_readahead_blks is always a power of 2 */
- b = block & ~((ext4_fsblk_t) ra_blks - 1);
- if (table > b)
- b = table;
- end = b + ra_blks;
- num = EXT4_INODES_PER_GROUP(sb);
- if (ext4_has_group_desc_csum(sb))
- num -= ext4_itable_unused_count(sb, gdp);
- table += num / inodes_per_block;
- if (end > table)
- end = table;
- while (b <= end)
- ext4_sb_breadahead_unmovable(sb, b++);
- }
+ /*
+ * If we need to do any I/O, try to pre-readahead extra
+ * blocks from the inode table.
+ */
+ blk_start_plug(&plug);
+ if (EXT4_SB(sb)->s_inode_readahead_blks) {
+ ext4_fsblk_t b, end, table;
+ unsigned num;
+ __u32 ra_blks = EXT4_SB(sb)->s_inode_readahead_blks;
+
+ table = ext4_inode_table(sb, gdp);
+ /* s_inode_readahead_blks is always a power of 2 */
+ b = block & ~((ext4_fsblk_t) ra_blks - 1);
+ if (table > b)
+ b = table;
+ end = b + ra_blks;
+ num = EXT4_INODES_PER_GROUP(sb);
+ if (ext4_has_group_desc_csum(sb))
+ num -= ext4_itable_unused_count(sb, gdp);
+ table += num / inodes_per_block;
+ if (end > table)
+ end = table;
+ while (b <= end)
+ ext4_sb_breadahead_unmovable(sb, b++);
+ }

- /*
- * There are other valid inodes in the buffer, this inode
- * has in-inode xattrs, or we don't have this inode in memory.
- * Read the block from disk.
- */
- trace_ext4_load_inode(sb, ino);
- ext4_read_bh_nowait(bh, REQ_META | REQ_PRIO, NULL);
- blk_finish_plug(&plug);
- wait_on_buffer(bh);
- ext4_simulate_fail_bh(sb, bh, EXT4_SIM_INODE_EIO);
- if (!buffer_uptodate(bh)) {
- if (ret_block)
- *ret_block = block;
- brelse(bh);
- return -EIO;
- }
+ /*
+ * There are other valid inodes in the buffer, this inode
+ * has in-inode xattrs, or we don't have this inode in memory.
+ * Read the block from disk.
+ */
+ trace_ext4_load_inode(sb, ino);
+ ext4_read_bh_nowait(bh, REQ_META | REQ_PRIO, NULL);
+ blk_finish_plug(&plug);
+ wait_on_buffer(bh);
+ ext4_simulate_fail_bh(sb, bh, EXT4_SIM_INODE_EIO);
+ if (!buffer_uptodate(bh)) {
+ if (ret_block)
+ *ret_block = block;
+ brelse(bh);
+ return -EIO;
}
has_buffer:
iloc->bh = bh;
--
2.31.1

2021-08-21 06:46:52

by Zhang Yi

[permalink] [raw]
Subject: [PATCH v3 4/4] ext4: prevent getting empty inode buffer

In ext4_get_inode_loc(), we may skip IO and get an zero && uptodate
inode buffer when the inode monopolize an inode block for performance
reason. For most cases, ext4_mark_iloc_dirty() will fill the inode
buffer to make it fine, but we could miss this call if something bad
happened. Finally, __ext4_get_inode_loc_noinmem() may probably get an
empty inode buffer and trigger ext4 error.

For example, if we remove a nonexistent xattr on inode A,
ext4_xattr_set_handle() will return ENODATA before invoking
ext4_mark_iloc_dirty(), it will left an uptodate but zero buffer. We
will get checksum error message in ext4_iget() when getting inode again.

EXT4-fs error (device sda): ext4_lookup:1784: inode #131074: comm cat: iget: checksum invalid

Even worse, if we allocate another inode B at the same inode block, it
will corrupt the inode A on disk when write back inode B.

So this patch postpone the initialization and mark buffer uptodate logic
until shortly before we fill correct inode data in ext4_do_update_inode()
if skip read I/O, ensure the buffer is really uptodate.

Signed-off-by: Zhang Yi <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 27 ++++++++++++++++++++++++---
1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8323d3e8f393..000abb5696b0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4367,9 +4367,12 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
}
brelse(bitmap_bh);
if (i == start + inodes_per_block) {
- /* all other inodes are free, so skip I/O */
- memset(bh->b_data, 0, bh->b_size);
- set_buffer_uptodate(bh);
+ /*
+ * All other inodes are free, skip I/O. Return
+ * uninitialized buffer immediately, initialization
+ * is postponed until shortly before we fill inode
+ * contents.
+ */
unlock_buffer(bh);
goto has_buffer;
}
@@ -5028,6 +5031,24 @@ static int ext4_do_update_inode(handle_t *handle,
gid_t i_gid;
projid_t i_projid;

+ /*
+ * If the buffer is not uptodate, it means all information of the
+ * inode is in memory and we got this buffer without reading the
+ * block. We must be cautious that once we mark the buffer as
+ * uptodate, we rely on filling in the correct inode data later
+ * in this function. Otherwise if we left uptodate buffer without
+ * copying proper inode contents, we could corrupt the inode on
+ * disk after allocating another inode in the same block.
+ */
+ if (!buffer_uptodate(bh)) {
+ lock_buffer(bh);
+ if (!buffer_uptodate(bh)) {
+ memset(bh->b_data, 0, bh->b_size);
+ set_buffer_uptodate(bh);
+ }
+ unlock_buffer(bh);
+ }
+
spin_lock(&ei->i_raw_lock);

/*
--
2.31.1

2021-08-23 10:24:21

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] ext4: make the updating inode data procedure atomic

On Sat 21-08-21 14:54:49, Zhang Yi wrote:
> Now that ext4_do_update_inode() return error before filling the whole
> inode data if we fail to set inode blocks in ext4_inode_blocks_set().
> This error should never happen in theory since sb->s_maxbytes should not
> have allowed this, we have already init sb->s_maxbytes according to this
> feature in ext4_fill_super(). So even through that could only happen due
> to the filesystem corruption, we'd better to return after we finish
> updating the inode because it may left an uninitialized buffer and we
> could read this buffer later in "errors=continue" mode.
>
> This patch make the updating inode data procedure atomic, call
> EXT4_ERROR_INODE() after we dropping i_raw_lock after something bad
> happened, make sure that the inode is integrated, and also drop a BUG_ON
> and do some small cleanups.
>
> Signed-off-by: Zhang Yi <[email protected]>

Looks good! Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/inode.c | 44 ++++++++++++++++++++++++++++----------------
> 1 file changed, 28 insertions(+), 16 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index eae1b2d0b550..8323d3e8f393 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4920,8 +4920,14 @@ static int ext4_inode_blocks_set(handle_t *handle,
> ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE);
> return 0;
> }
> +
> + /*
> + * This should never happen since sb->s_maxbytes should not have
> + * allowed this, sb->s_maxbytes was set according to the huge_file
> + * feature in ext4_fill_super().
> + */
> if (!ext4_has_feature_huge_file(sb))
> - return -EFBIG;
> + return -EFSCORRUPTED;
>
> if (i_blocks <= 0xffffffffffffULL) {
> /*
> @@ -5024,16 +5030,14 @@ static int ext4_do_update_inode(handle_t *handle,
>
> spin_lock(&ei->i_raw_lock);
>
> - /* For fields not tracked in the in-memory inode,
> - * initialise them to zero for new inodes. */
> + /*
> + * For fields not tracked in the in-memory inode, initialise them
> + * to zero for new inodes.
> + */
> if (ext4_test_inode_state(inode, EXT4_STATE_NEW))
> memset(raw_inode, 0, EXT4_SB(inode->i_sb)->s_inode_size);
>
> err = ext4_inode_blocks_set(handle, raw_inode, ei);
> - if (err) {
> - spin_unlock(&ei->i_raw_lock);
> - goto out_brelse;
> - }
>
> raw_inode->i_mode = cpu_to_le16(inode->i_mode);
> i_uid = i_uid_read(inode);
> @@ -5042,10 +5046,11 @@ static int ext4_do_update_inode(handle_t *handle,
> if (!(test_opt(inode->i_sb, NO_UID32))) {
> raw_inode->i_uid_low = cpu_to_le16(low_16_bits(i_uid));
> raw_inode->i_gid_low = cpu_to_le16(low_16_bits(i_gid));
> -/*
> - * Fix up interoperability with old kernels. Otherwise, old inodes get
> - * re-used with the upper 16 bits of the uid/gid intact
> - */
> + /*
> + * Fix up interoperability with old kernels. Otherwise,
> + * old inodes get re-used with the upper 16 bits of the
> + * uid/gid intact.
> + */
> if (ei->i_dtime && list_empty(&ei->i_orphan)) {
> raw_inode->i_uid_high = 0;
> raw_inode->i_gid_high = 0;
> @@ -5114,8 +5119,9 @@ static int ext4_do_update_inode(handle_t *handle,
> }
> }
>
> - BUG_ON(!ext4_has_feature_project(inode->i_sb) &&
> - i_projid != EXT4_DEF_PROJID);
> + if (i_projid != EXT4_DEF_PROJID &&
> + !ext4_has_feature_project(inode->i_sb))
> + err = err ?: -EFSCORRUPTED;
>
> if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
> EXT4_FITS_IN_INODE(raw_inode, ei, i_projid))
> @@ -5123,6 +5129,11 @@ static int ext4_do_update_inode(handle_t *handle,
>
> ext4_inode_csum_set(inode, raw_inode, ei);
> spin_unlock(&ei->i_raw_lock);
> + if (err) {
> + EXT4_ERROR_INODE(inode, "corrupted inode contents");
> + goto out_brelse;
> + }
> +
> if (inode->i_sb->s_flags & SB_LAZYTIME)
> ext4_update_other_inodes_time(inode->i_sb, inode->i_ino,
> bh->b_data);
> @@ -5130,13 +5141,13 @@ static int ext4_do_update_inode(handle_t *handle,
> BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> err = ext4_handle_dirty_metadata(handle, NULL, bh);
> if (err)
> - goto out_brelse;
> + goto out_error;
> ext4_clear_inode_state(inode, EXT4_STATE_NEW);
> if (set_large_file) {
> BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "get write access");
> err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh);
> if (err)
> - goto out_brelse;
> + goto out_error;
> lock_buffer(EXT4_SB(sb)->s_sbh);
> ext4_set_feature_large_file(sb);
> ext4_superblock_csum_set(sb);
> @@ -5146,9 +5157,10 @@ static int ext4_do_update_inode(handle_t *handle,
> EXT4_SB(sb)->s_sbh);
> }
> ext4_update_inode_fsync_trans(handle, inode, need_datasync);
> +out_error:
> + ext4_std_error(inode->i_sb, err);
> out_brelse:
> brelse(bh);
> - ext4_std_error(inode->i_sb, err);
> return err;
> }
>
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-26 12:56:01

by Zhang Yi

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] ext4: prevent getting empty inode buffer

On 2021/8/21 14:54, Zhang Yi wrote:
> In ext4_get_inode_loc(), we may skip IO and get an zero && uptodate
> inode buffer when the inode monopolize an inode block for performance
> reason. For most cases, ext4_mark_iloc_dirty() will fill the inode
> buffer to make it fine, but we could miss this call if something bad
> happened. Finally, __ext4_get_inode_loc_noinmem() may probably get an
> empty inode buffer and trigger ext4 error.
>
> For example, if we remove a nonexistent xattr on inode A,
> ext4_xattr_set_handle() will return ENODATA before invoking
> ext4_mark_iloc_dirty(), it will left an uptodate but zero buffer. We
> will get checksum error message in ext4_iget() when getting inode again.
>
> EXT4-fs error (device sda): ext4_lookup:1784: inode #131074: comm cat: iget: checksum invalid
>
> Even worse, if we allocate another inode B at the same inode block, it
> will corrupt the inode A on disk when write back inode B.
>
> So this patch postpone the initialization and mark buffer uptodate logic
> until shortly before we fill correct inode data in ext4_do_update_inode()
> if skip read I/O, ensure the buffer is really uptodate.
>
> Signed-off-by: Zhang Yi <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> ---
> fs/ext4/inode.c | 27 ++++++++++++++++++++++++---
> 1 file changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 8323d3e8f393..000abb5696b0 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4367,9 +4367,12 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
> }
> brelse(bitmap_bh);
> if (i == start + inodes_per_block) {
> - /* all other inodes are free, so skip I/O */
> - memset(bh->b_data, 0, bh->b_size);
> - set_buffer_uptodate(bh);
> + /*
> + * All other inodes are free, skip I/O. Return
> + * uninitialized buffer immediately, initialization
> + * is postponed until shortly before we fill inode
> + * contents.
> + */
> unlock_buffer(bh);
> goto has_buffer;
> }
> @@ -5028,6 +5031,24 @@ static int ext4_do_update_inode(handle_t *handle,
> gid_t i_gid;
> projid_t i_projid;
>
> + /*
> + * If the buffer is not uptodate, it means all information of the
> + * inode is in memory and we got this buffer without reading the
> + * block. We must be cautious that once we mark the buffer as
> + * uptodate, we rely on filling in the correct inode data later
> + * in this function. Otherwise if we left uptodate buffer without
> + * copying proper inode contents, we could corrupt the inode on
> + * disk after allocating another inode in the same block.
> + */
> + if (!buffer_uptodate(bh)) {
> + lock_buffer(bh);
> + if (!buffer_uptodate(bh)) {
> + memset(bh->b_data, 0, bh->b_size);
> + set_buffer_uptodate(bh);
> + }
> + unlock_buffer(bh);
> + }

Hi, Jan.

I notice that above solution is not correct. The problem is still in
ext4_xattr_set_handle(), if we set a new xattr entry in a pure inode,
the above hunk may zero out the ibody xattr entry we just set up in
ext4_xattr_ibody_set().

I guess we could not 'zero out buffer && mark buffer uptodate' here,
maybe __ext4_get_inode_loc() should return a really initialized buffer,
or else it's still fragile and hard to guarantee that the 'zero out'
and 'postponed set_buffer_uptodate()' will not zero out something we
just set or overwrite something we updated concurrently.

How about factor out the filling inode contents from ext4_do_update_inode()
into maybe ext4_fill_raw_inode(), and call it in __ext4_get_inode_loc() ?
Please see my v4 patchset.

Thanks,
Yi.