2021-08-19 06:47:00

by Zhang Yi

[permalink] [raw]
Subject: [PATCH v2 0/4] ext4: fix a inode checksum error

We find a checksum error and a inode corruption problem while doing
stress test, this 4 patches address to fix them. The first patch is
relate to the error simulation, and the second & third patch are
prepare to do the fix. The last patch fix these two issue.

- Checksum error

EXT4-fs error (device sda): ext4_lookup:1784: inode #131074: comm cat: iget: checksum invalid

- Inode corruption

e2fsck 1.46.0 (29-Jan-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry 'foo' in / (2) has deleted/unused inode 17. Clear<y>? yes
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Inode bitmap differences: -17
Fix<y>? yes
Free inodes count wrong for group #0 (32750, counted=32751).
Fix<y>? yes
Free inodes count wrong (32750, counted=32751).
Fix<y>? yes

Change since v1:
- Add a patch to prevent ext4_do_update_inode() return before filling
the inode buffer.
- Do not use BH_New flag to indicate the empty buffer, postpone the
zero and uptodate logic into ext4_do_update_inode() before filling
the inode buffer.

Thanks,
Yi.


Zhang Yi (4):
ext4: move inode eio simulation behind io completeion
ext4: remove an unnecessary if statement in __ext4_get_inode_loc()
ext4: don't return error if huge_file feature mismatch
ext4: prevent getting empty inode buffer

fs/ext4/inode.c | 206 +++++++++++++++++++++++++-----------------------
1 file changed, 109 insertions(+), 97 deletions(-)

--
2.31.1


2021-08-19 06:47:00

by Zhang Yi

[permalink] [raw]
Subject: [PATCH v2 4/4] ext4: prevent getting empty inode buffer

In ext4_get_inode_loc(), we may skip IO and get an zero && uptodate
inode buffer when the inode monopolize an inode block for performance
reason. For most cases, ext4_mark_iloc_dirty() will fill the inode
buffer to make it fine, but we could miss this call if something bad
happened. Finally, __ext4_get_inode_loc_noinmem() may probably get an
empty inode buffer and trigger ext4 error.

For example, if we remove a nonexistent xattr on inode A,
ext4_xattr_set_handle() will return ENODATA before invoking
ext4_mark_iloc_dirty(), it will left an uptodate but zero buffer. We
will get checksum error message in ext4_iget() when getting inode again.

EXT4-fs error (device sda): ext4_lookup:1784: inode #131074: comm cat: iget: checksum invalid

Even worse, if we allocate another inode B at the same inode block, it
will corrupt the inode A on disk when write back inode B.

So this patch postpone the init and mark buffer uptodate logic until
before filling correct inode data in ext4_do_update_inode() if skip read
I/O, ensure the buffer is real uptodate.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d0d7a5295bf9..02d910c9d8f1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4367,9 +4367,11 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
}
brelse(bitmap_bh);
if (i == start + inodes_per_block) {
- /* all other inodes are free, so skip I/O */
- memset(bh->b_data, 0, bh->b_size);
- set_buffer_uptodate(bh);
+ /*
+ * All other inodes are free, skip I/O. Return
+ * un-inited buffer (which is postponed until
+ * before filling inode data) immediately.
+ */
unlock_buffer(bh);
goto has_buffer;
}
@@ -5026,6 +5028,24 @@ static int ext4_do_update_inode(handle_t *handle,
gid_t i_gid;
projid_t i_projid;

+ /*
+ * If the buffer is not uptodate, it means all information of inode
+ * in memory and we got this buffer without reading the block. We
+ * must be cautious that once we mark the buffer as uptodate, we
+ * rely on filling in the correct inode data later in this function.
+ * Otherwise if we getting the left falsepositive buffer when
+ * creating other inode on the same block, it could corrupt the
+ * inode data on disk.
+ */
+ if (!buffer_uptodate(bh)) {
+ lock_buffer(bh);
+ if (!buffer_uptodate(bh)) {
+ memset(bh->b_data, 0, bh->b_size);
+ set_buffer_uptodate(bh);
+ }
+ unlock_buffer(bh);
+ }
+
spin_lock(&ei->i_raw_lock);

/* For fields not tracked in the in-memory inode,
--
2.31.1

2021-08-19 06:47:00

by Zhang Yi

[permalink] [raw]
Subject: [PATCH v2 1/4] ext4: move inode eio simulation behind io completeion

No EIO simulation is required if the buffer is uptodate, so move the
simulation behind read bio completeion just like inode/block bitmap
simulation does.

Signed-off-by: Zhang Yi <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d8de607849df..eb2526a35254 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4330,8 +4330,6 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
bh = sb_getblk(sb, block);
if (unlikely(!bh))
return -ENOMEM;
- if (ext4_simulate_fail(sb, EXT4_SIM_INODE_EIO))
- goto simulate_eio;
if (!buffer_uptodate(bh)) {
lock_buffer(bh);

@@ -4418,8 +4416,8 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
ext4_read_bh_nowait(bh, REQ_META | REQ_PRIO, NULL);
blk_finish_plug(&plug);
wait_on_buffer(bh);
+ ext4_simulate_fail_bh(sb, bh, EXT4_SIM_INODE_EIO);
if (!buffer_uptodate(bh)) {
- simulate_eio:
if (ret_block)
*ret_block = block;
brelse(bh);
--
2.31.1

2021-08-19 06:47:00

by Zhang Yi

[permalink] [raw]
Subject: [PATCH v2 2/4] ext4: remove an unnecessary if statement in __ext4_get_inode_loc()

The "if (!buffer_uptodate(bh))" hunk covered almost the whole code after
getting buffer in __ext4_get_inode_loc() which seems unnecessary, remove
it and switch to check ext4_buffer_uptodate(), it simplify code and make
it more readable.

Signed-off-by: Zhang Yi <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 162 +++++++++++++++++++++++-------------------------
1 file changed, 78 insertions(+), 84 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index eb2526a35254..eae1b2d0b550 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4330,99 +4330,93 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
bh = sb_getblk(sb, block);
if (unlikely(!bh))
return -ENOMEM;
- if (!buffer_uptodate(bh)) {
- lock_buffer(bh);
-
- if (ext4_buffer_uptodate(bh)) {
- /* someone brought it uptodate while we waited */
- unlock_buffer(bh);
- goto has_buffer;
- }
+ if (ext4_buffer_uptodate(bh))
+ goto has_buffer;

- /*
- * If we have all information of the inode in memory and this
- * is the only valid inode in the block, we need not read the
- * block.
- */
- if (in_mem) {
- struct buffer_head *bitmap_bh;
- int i, start;
+ lock_buffer(bh);
+ /*
+ * If we have all information of the inode in memory and this
+ * is the only valid inode in the block, we need not read the
+ * block.
+ */
+ if (in_mem) {
+ struct buffer_head *bitmap_bh;
+ int i, start;

- start = inode_offset & ~(inodes_per_block - 1);
+ start = inode_offset & ~(inodes_per_block - 1);

- /* Is the inode bitmap in cache? */
- bitmap_bh = sb_getblk(sb, ext4_inode_bitmap(sb, gdp));
- if (unlikely(!bitmap_bh))
- goto make_io;
+ /* Is the inode bitmap in cache? */
+ bitmap_bh = sb_getblk(sb, ext4_inode_bitmap(sb, gdp));
+ if (unlikely(!bitmap_bh))
+ goto make_io;

- /*
- * If the inode bitmap isn't in cache then the
- * optimisation may end up performing two reads instead
- * of one, so skip it.
- */
- if (!buffer_uptodate(bitmap_bh)) {
- brelse(bitmap_bh);
- goto make_io;
- }
- for (i = start; i < start + inodes_per_block; i++) {
- if (i == inode_offset)
- continue;
- if (ext4_test_bit(i, bitmap_bh->b_data))
- break;
- }
+ /*
+ * If the inode bitmap isn't in cache then the
+ * optimisation may end up performing two reads instead
+ * of one, so skip it.
+ */
+ if (!buffer_uptodate(bitmap_bh)) {
brelse(bitmap_bh);
- if (i == start + inodes_per_block) {
- /* all other inodes are free, so skip I/O */
- memset(bh->b_data, 0, bh->b_size);
- set_buffer_uptodate(bh);
- unlock_buffer(bh);
- goto has_buffer;
- }
+ goto make_io;
+ }
+ for (i = start; i < start + inodes_per_block; i++) {
+ if (i == inode_offset)
+ continue;
+ if (ext4_test_bit(i, bitmap_bh->b_data))
+ break;
}
+ brelse(bitmap_bh);
+ if (i == start + inodes_per_block) {
+ /* all other inodes are free, so skip I/O */
+ memset(bh->b_data, 0, bh->b_size);
+ set_buffer_uptodate(bh);
+ unlock_buffer(bh);
+ goto has_buffer;
+ }
+ }

make_io:
- /*
- * If we need to do any I/O, try to pre-readahead extra
- * blocks from the inode table.
- */
- blk_start_plug(&plug);
- if (EXT4_SB(sb)->s_inode_readahead_blks) {
- ext4_fsblk_t b, end, table;
- unsigned num;
- __u32 ra_blks = EXT4_SB(sb)->s_inode_readahead_blks;
-
- table = ext4_inode_table(sb, gdp);
- /* s_inode_readahead_blks is always a power of 2 */
- b = block & ~((ext4_fsblk_t) ra_blks - 1);
- if (table > b)
- b = table;
- end = b + ra_blks;
- num = EXT4_INODES_PER_GROUP(sb);
- if (ext4_has_group_desc_csum(sb))
- num -= ext4_itable_unused_count(sb, gdp);
- table += num / inodes_per_block;
- if (end > table)
- end = table;
- while (b <= end)
- ext4_sb_breadahead_unmovable(sb, b++);
- }
+ /*
+ * If we need to do any I/O, try to pre-readahead extra
+ * blocks from the inode table.
+ */
+ blk_start_plug(&plug);
+ if (EXT4_SB(sb)->s_inode_readahead_blks) {
+ ext4_fsblk_t b, end, table;
+ unsigned num;
+ __u32 ra_blks = EXT4_SB(sb)->s_inode_readahead_blks;
+
+ table = ext4_inode_table(sb, gdp);
+ /* s_inode_readahead_blks is always a power of 2 */
+ b = block & ~((ext4_fsblk_t) ra_blks - 1);
+ if (table > b)
+ b = table;
+ end = b + ra_blks;
+ num = EXT4_INODES_PER_GROUP(sb);
+ if (ext4_has_group_desc_csum(sb))
+ num -= ext4_itable_unused_count(sb, gdp);
+ table += num / inodes_per_block;
+ if (end > table)
+ end = table;
+ while (b <= end)
+ ext4_sb_breadahead_unmovable(sb, b++);
+ }

- /*
- * There are other valid inodes in the buffer, this inode
- * has in-inode xattrs, or we don't have this inode in memory.
- * Read the block from disk.
- */
- trace_ext4_load_inode(sb, ino);
- ext4_read_bh_nowait(bh, REQ_META | REQ_PRIO, NULL);
- blk_finish_plug(&plug);
- wait_on_buffer(bh);
- ext4_simulate_fail_bh(sb, bh, EXT4_SIM_INODE_EIO);
- if (!buffer_uptodate(bh)) {
- if (ret_block)
- *ret_block = block;
- brelse(bh);
- return -EIO;
- }
+ /*
+ * There are other valid inodes in the buffer, this inode
+ * has in-inode xattrs, or we don't have this inode in memory.
+ * Read the block from disk.
+ */
+ trace_ext4_load_inode(sb, ino);
+ ext4_read_bh_nowait(bh, REQ_META | REQ_PRIO, NULL);
+ blk_finish_plug(&plug);
+ wait_on_buffer(bh);
+ ext4_simulate_fail_bh(sb, bh, EXT4_SIM_INODE_EIO);
+ if (!buffer_uptodate(bh)) {
+ if (ret_block)
+ *ret_block = block;
+ brelse(bh);
+ return -EIO;
}
has_buffer:
iloc->bh = bh;
--
2.31.1

2021-08-19 06:48:29

by Zhang Yi

[permalink] [raw]
Subject: [PATCH v2 3/4] ext4: don't return error if huge_file feature mismatch

In ext4_inode_blocks_set(), huge_file feature should exist when setting
i_blocks beyond a 32 bit variable could be represented, return EFBIG if
not. This error should never happen in theory since sb->s_maxbytes should
not have allowed this, and we have already init sb->s_maxbytes according
to this feature in ext4_fill_super(). So switch to use WARN_ON_ONCE
instead.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index eae1b2d0b550..d0d7a5295bf9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4902,9 +4902,9 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
return ERR_PTR(ret);
}

-static int ext4_inode_blocks_set(handle_t *handle,
- struct ext4_inode *raw_inode,
- struct ext4_inode_info *ei)
+static void ext4_inode_blocks_set(handle_t *handle,
+ struct ext4_inode *raw_inode,
+ struct ext4_inode_info *ei)
{
struct inode *inode = &(ei->vfs_inode);
u64 i_blocks = READ_ONCE(inode->i_blocks);
@@ -4918,10 +4918,15 @@ static int ext4_inode_blocks_set(handle_t *handle,
raw_inode->i_blocks_lo = cpu_to_le32(i_blocks);
raw_inode->i_blocks_high = 0;
ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE);
- return 0;
+ return;
}
- if (!ext4_has_feature_huge_file(sb))
- return -EFBIG;
+
+ /*
+ * This should never happen since sb->s_maxbytes should not have
+ * allowed this, which was set according to the huge_file feature
+ * in ext4_fill_super().
+ */
+ WARN_ON_ONCE(!ext4_has_feature_huge_file(sb));

if (i_blocks <= 0xffffffffffffULL) {
/*
@@ -4938,7 +4943,6 @@ static int ext4_inode_blocks_set(handle_t *handle,
raw_inode->i_blocks_lo = cpu_to_le32(i_blocks);
raw_inode->i_blocks_high = cpu_to_le16(i_blocks >> 32);
}
- return 0;
}

static void __ext4_update_other_inode_time(struct super_block *sb,
@@ -5029,11 +5033,7 @@ static int ext4_do_update_inode(handle_t *handle,
if (ext4_test_inode_state(inode, EXT4_STATE_NEW))
memset(raw_inode, 0, EXT4_SB(inode->i_sb)->s_inode_size);

- err = ext4_inode_blocks_set(handle, raw_inode, ei);
- if (err) {
- spin_unlock(&ei->i_raw_lock);
- goto out_brelse;
- }
+ ext4_inode_blocks_set(handle, raw_inode, ei);

raw_inode->i_mode = cpu_to_le16(inode->i_mode);
i_uid = i_uid_read(inode);
--
2.31.1

2021-08-19 10:27:40

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] ext4: don't return error if huge_file feature mismatch

On Thu 19-08-21 14:57:03, Zhang Yi wrote:
> In ext4_inode_blocks_set(), huge_file feature should exist when setting
> i_blocks beyond a 32 bit variable could be represented, return EFBIG if
> not. This error should never happen in theory since sb->s_maxbytes should
> not have allowed this, and we have already init sb->s_maxbytes according
> to this feature in ext4_fill_super(). So switch to use WARN_ON_ONCE
> instead.
>
> Signed-off-by: Zhang Yi <[email protected]>
> ---

One comment below:

> @@ -4918,10 +4918,15 @@ static int ext4_inode_blocks_set(handle_t *handle,
> raw_inode->i_blocks_lo = cpu_to_le32(i_blocks);
> raw_inode->i_blocks_high = 0;
> ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE);
> - return 0;
> + return;
> }
> - if (!ext4_has_feature_huge_file(sb))
> - return -EFBIG;
> +
> + /*
> + * This should never happen since sb->s_maxbytes should not have
> + * allowed this, which was set according to the huge_file feature
> + * in ext4_fill_super().
> + */
> + WARN_ON_ONCE(!ext4_has_feature_huge_file(sb));

Thinking about this a bit more, this could also happen due to fs
corruption. So we probably need to call ext4_error_inode() here instead of
WARN_ON_ONCE(). Also it will result in properly marking fs as having
errors. But since we hold i_raw_lock at this call site we need to
keep the error bail out from ext4_inode_blocks_set() and in
ext4_do_update_inode() finish updating inode and then call
ext4_error_inode() after dropping i_raw_lock.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-19 10:35:55

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] ext4: prevent getting empty inode buffer

On Thu 19-08-21 14:57:04, Zhang Yi wrote:
> In ext4_get_inode_loc(), we may skip IO and get an zero && uptodate
> inode buffer when the inode monopolize an inode block for performance
> reason. For most cases, ext4_mark_iloc_dirty() will fill the inode
> buffer to make it fine, but we could miss this call if something bad
> happened. Finally, __ext4_get_inode_loc_noinmem() may probably get an
> empty inode buffer and trigger ext4 error.
>
> For example, if we remove a nonexistent xattr on inode A,
> ext4_xattr_set_handle() will return ENODATA before invoking
> ext4_mark_iloc_dirty(), it will left an uptodate but zero buffer. We
> will get checksum error message in ext4_iget() when getting inode again.
>
> EXT4-fs error (device sda): ext4_lookup:1784: inode #131074: comm cat: iget: checksum invalid
>
> Even worse, if we allocate another inode B at the same inode block, it
> will corrupt the inode A on disk when write back inode B.
>
> So this patch postpone the init and mark buffer uptodate logic until
> before filling correct inode data in ext4_do_update_inode() if skip read
> I/O, ensure the buffer is real uptodate.
>
> Signed-off-by: Zhang Yi <[email protected]>

Looks good. Just some language fixes below. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

> ---
> fs/ext4/inode.c | 26 +++++++++++++++++++++++---
> 1 file changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d0d7a5295bf9..02d910c9d8f1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4367,9 +4367,11 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
> }
> brelse(bitmap_bh);
> if (i == start + inodes_per_block) {
> - /* all other inodes are free, so skip I/O */
> - memset(bh->b_data, 0, bh->b_size);
> - set_buffer_uptodate(bh);
> + /*
> + * All other inodes are free, skip I/O. Return
> + * un-inited buffer (which is postponed until

I'd repharse this sentence as: Return uninitialized buffer immediately,
initialization is postponed until shortly before we fill inode contents.

> + * before filling inode data) immediately.
> + */
> unlock_buffer(bh);
> goto has_buffer;
> }
> @@ -5026,6 +5028,24 @@ static int ext4_do_update_inode(handle_t *handle,
> gid_t i_gid;
> projid_t i_projid;
>
> + /*
> + * If the buffer is not uptodate, it means all information of inode
^^^^^^^^
of the inode is

> + * in memory and we got this buffer without reading the block. We
> + * must be cautious that once we mark the buffer as uptodate, we
> + * rely on filling in the correct inode data later in this function.
> + * Otherwise if we getting the left falsepositive buffer when

I'd rephrase this sentence as: Otherwise if we left uptodate buffer without
copying proper inode contents, we could corrupt the inode on disk after
allocating another inode in the same block.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-19 13:13:22

by Zhang Yi

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] ext4: don't return error if huge_file feature mismatch

On 2021/8/19 18:26, Jan Kara wrote:
> On Thu 19-08-21 14:57:03, Zhang Yi wrote:
>> In ext4_inode_blocks_set(), huge_file feature should exist when setting
>> i_blocks beyond a 32 bit variable could be represented, return EFBIG if
>> not. This error should never happen in theory since sb->s_maxbytes should
>> not have allowed this, and we have already init sb->s_maxbytes according
>> to this feature in ext4_fill_super(). So switch to use WARN_ON_ONCE
>> instead.
>>
>> Signed-off-by: Zhang Yi <[email protected]>
>> ---
>
> One comment below:
>
>> @@ -4918,10 +4918,15 @@ static int ext4_inode_blocks_set(handle_t *handle,
>> raw_inode->i_blocks_lo = cpu_to_le32(i_blocks);
>> raw_inode->i_blocks_high = 0;
>> ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE);
>> - return 0;
>> + return;
>> }
>> - if (!ext4_has_feature_huge_file(sb))
>> - return -EFBIG;
>> +
>> + /*
>> + * This should never happen since sb->s_maxbytes should not have
>> + * allowed this, which was set according to the huge_file feature
>> + * in ext4_fill_super().
>> + */
>> + WARN_ON_ONCE(!ext4_has_feature_huge_file(sb));
>
> Thinking about this a bit more, this could also happen due to fs
> corruption. So we probably need to call ext4_error_inode() here instead of
> WARN_ON_ONCE(). Also it will result in properly marking fs as having
> errors. But since we hold i_raw_lock at this call site we need to
> keep the error bail out from ext4_inode_blocks_set() and in
> ext4_do_update_inode() finish updating inode and then call
> ext4_error_inode() after dropping i_raw_lock.
>
Yes, make sense, ext4_error_inode() is more reasonable.

Thanks,
Yi.