On Wed 17-08-22 21:27:01, Baokun Li wrote:
> In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
> and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
>
> In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
> the function returns -ENOMEM.
>
> In __getblk_slow, if the return value of grow_buffers is less than 0,
> the function returns NULL.
>
> When the three processes are connected in series like the following stack,
> an infinite loop may occur:
>
> do_writepages <--- keep retrying
> ext4_writepages
> mpage_map_and_submit_extent
> mpage_map_one_extent
> ext4_map_blocks
> ext4_ext_map_blocks
> ext4_ext_handle_unwritten_extents
> ext4_ext_convert_to_initialized
> ext4_split_extent
> ext4_split_extent_at
> __ext4_ext_dirty
> __ext4_mark_inode_dirty
> ext4_reserve_inode_write
> ext4_get_inode_loc
> __ext4_get_inode_loc <--- return -ENOMEM
> sb_getblk
> __getblk_gfp
> __getblk_slow <--- return NULL
> grow_buffers
> grow_dev_page <--- return -ENXIO
> ret = (block < end_block) ? 1 : -ENXIO;
>
> In this issue, bg_inode_table_hi is overwritten as an incorrect value.
> As a result, `block < end_block` cannot be met in grow_dev_page.
> Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
> keeps retrying. As a result, the writeback process is in the D state due
> to an infinite loop.
>
> Add a check on inode table block in the __ext4_get_inode_loc function by
> referring to ext4_read_inode_bitmap to avoid this infinite loop.
>
> Signed-off-by: Baokun Li <[email protected]>
Thanks for the fixes. Normally, we check that inode table is fine in
ext4_check_descriptors() (and those checks are much stricter) so it seems
unnecessary to check it again here. I understand that in your case it was
resize that corrupted the group descriptor after the filesystem was mounted
which is nasty but there's much more metadata that can be corrupted like
this and it's infeasible to check each metadata block before we use it.
IMHO a proper fix to this class of issues would be for sb_getblk() to
return proper error so that we can distinguish ENOMEM from other errors.
But that will be a larger undertaking...
Honza
> ---
> fs/ext4/inode.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 601214453c3a..5e171879fa23 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4466,9 +4466,17 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
> inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> inode_offset = ((ino - 1) %
> EXT4_INODES_PER_GROUP(sb));
> - block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
> iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
>
> + block = ext4_inode_table(sb, gdp);
> + if ((block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block)) ||
> + (block >= ext4_blocks_count(EXT4_SB(sb)->s_es))) {
> + ext4_error(sb, "Invalid inode table block %llu in "
> + "block_group %u", block, iloc->block_group);
> + return -EFSCORRUPTED;
> + }
> + block += (inode_offset / inodes_per_block);
> +
> bh = sb_getblk(sb, block);
> if (unlikely(!bh))
> return -ENOMEM;
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 22/08/17 04:31PM, Jan Kara wrote:
> On Wed 17-08-22 21:27:01, Baokun Li wrote:
> > In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
> > and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
> >
> > In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
> > the function returns -ENOMEM.
> >
> > In __getblk_slow, if the return value of grow_buffers is less than 0,
> > the function returns NULL.
> >
> > When the three processes are connected in series like the following stack,
> > an infinite loop may occur:
> >
> > do_writepages <--- keep retrying
> > ext4_writepages
> > mpage_map_and_submit_extent
> > mpage_map_one_extent
> > ext4_map_blocks
> > ext4_ext_map_blocks
> > ext4_ext_handle_unwritten_extents
> > ext4_ext_convert_to_initialized
> > ext4_split_extent
> > ext4_split_extent_at
> > __ext4_ext_dirty
> > __ext4_mark_inode_dirty
> > ext4_reserve_inode_write
> > ext4_get_inode_loc
> > __ext4_get_inode_loc <--- return -ENOMEM
> > sb_getblk
> > __getblk_gfp
> > __getblk_slow <--- return NULL
> > grow_buffers
> > grow_dev_page <--- return -ENXIO
> > ret = (block < end_block) ? 1 : -ENXIO;
> >
> > In this issue, bg_inode_table_hi is overwritten as an incorrect value.
> > As a result, `block < end_block` cannot be met in grow_dev_page.
> > Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
> > keeps retrying. As a result, the writeback process is in the D state due
> > to an infinite loop.
> >
> > Add a check on inode table block in the __ext4_get_inode_loc function by
> > referring to ext4_read_inode_bitmap to avoid this infinite loop.
> >
> > Signed-off-by: Baokun Li <[email protected]>
>
> Thanks for the fixes. Normally, we check that inode table is fine in
> ext4_check_descriptors() (and those checks are much stricter) so it seems
> unnecessary to check it again here. I understand that in your case it was
> resize that corrupted the group descriptor after the filesystem was mounted
> which is nasty but there's much more metadata that can be corrupted like
> this and it's infeasible to check each metadata block before we use it.
>
> IMHO a proper fix to this class of issues would be for sb_getblk() to
> return proper error so that we can distinguish ENOMEM from other errors.
> But that will be a larger undertaking...
>
Hi Jan,
How about adding a wrapper around sb_getblk() which will do the basic block
bound checks for ext4. Then we can carefully convert all the callers of
sb_getblk() in ext4 to call ext4_sb_getblk().
ext4_sb_getblk() will then return either of below -
1. ERR_PTR(-EFSCORRUPTED)
2. NULL
3. struct buffer_head*
It's caller can then implement the proper error handling.
Folding a small patch to implement the simple bound check. Is this the right
approach?
From: "Ritesh Harjani (IBM)" <[email protected]>
Date: Thu, 18 Aug 2022 07:53:58 -0500
Subject: [RFC] ext4: Add ext4_sb_getblk() wrapper for block bounds checking
We might need more bounds checking on the block before calling sb_getblk().
This helper does that and if it is not valid then returns ERR_PTR(-EFSCORRUPTED)
Later we will need to carefully convert the callers to use ext4_sb_getblk()
instead of sb_getblk().
For e.g. __ext4_get_inode_loc()
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/block_validity.c | 4 +---
fs/ext4/ext4.h | 12 ++++++++++++
fs/ext4/super.c | 9 +++++++++
3 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
index 5504f72bbbbe..69347fab7c38 100644
--- a/fs/ext4/block_validity.c
+++ b/fs/ext4/block_validity.c
@@ -301,9 +301,7 @@ int ext4_sb_block_valid(struct super_block *sb, struct inode *inode,
struct rb_node *n;
int ret = 1;
- if ((start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
- (start_blk + count < start_blk) ||
- (start_blk + count > ext4_blocks_count(sbi->s_es)))
+ if (!ext4_sb_block_valid_fastcheck(sbi->s_es, start_blk, count))
return 0;
/*
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9bca5565547b..79d0e45185d3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3072,6 +3072,8 @@ extern struct buffer_head *ext4_sb_bread(struct super_block *sb,
sector_t block, blk_opf_t op_flags);
extern struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
sector_t block);
+extern struct buffer_head *ext4_sb_getblk(struct super_block *sb,
+ sector_t block);
extern void ext4_read_bh_nowait(struct buffer_head *bh, blk_opf_t op_flags,
bh_end_io_t *end_io);
extern int ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
@@ -3358,6 +3360,16 @@ static inline unsigned int ext4_flex_bg_size(struct ext4_sb_info *sbi)
return 1 << sbi->s_log_groups_per_flex;
}
+static inline bool ext4_sb_block_valid_fastcheck(struct ext4_super_block *es,
+ sector_t start_blk, unsigned int count)
+{
+ if ((start_blk <= le32_to_cpu(es->s_first_data_block)) ||
+ (start_blk + count < start_blk) ||
+ (start_blk + count > ext4_blocks_count(es)))
+ return false;
+ return true;
+}
+
#define ext4_std_error(sb, errno) \
do { \
if ((errno)) \
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9a66abcca1a8..5b29272ad9a9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -269,6 +269,15 @@ void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
}
}
+struct buffer_head *ext4_sb_getblk(struct super_block *sb, sector_t block)
+{
+ struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+
+ if (!ext4_sb_block_valid_fastcheck(es, block, 1))
+ return ERR_PTR(-EFSCORRUPTED);
+ return sb_getblk(sb, block);
+}
+
static int ext4_verify_csum_type(struct super_block *sb,
struct ext4_super_block *es)
{
--
2.25.1
-ritesh
> Honza
>
> > ---
> > fs/ext4/inode.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 601214453c3a..5e171879fa23 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4466,9 +4466,17 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
> > inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> > inode_offset = ((ino - 1) %
> > EXT4_INODES_PER_GROUP(sb));
> > - block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
> > iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
> >
> > + block = ext4_inode_table(sb, gdp);
> > + if ((block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block)) ||
> > + (block >= ext4_blocks_count(EXT4_SB(sb)->s_es))) {
> > + ext4_error(sb, "Invalid inode table block %llu in "
> > + "block_group %u", block, iloc->block_group);
> > + return -EFSCORRUPTED;
> > + }
> > + block += (inode_offset / inodes_per_block);
> > +
> > bh = sb_getblk(sb, block);
> > if (unlikely(!bh))
> > return -ENOMEM;
> > --
> > 2.31.1
> >
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
On Thu 18-08-22 20:13:53, Ritesh Harjani (IBM) wrote:
> On 22/08/17 04:31PM, Jan Kara wrote:
> > On Wed 17-08-22 21:27:01, Baokun Li wrote:
> > > In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
> > > and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
> > >
> > > In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
> > > the function returns -ENOMEM.
> > >
> > > In __getblk_slow, if the return value of grow_buffers is less than 0,
> > > the function returns NULL.
> > >
> > > When the three processes are connected in series like the following stack,
> > > an infinite loop may occur:
> > >
> > > do_writepages <--- keep retrying
> > > ext4_writepages
> > > mpage_map_and_submit_extent
> > > mpage_map_one_extent
> > > ext4_map_blocks
> > > ext4_ext_map_blocks
> > > ext4_ext_handle_unwritten_extents
> > > ext4_ext_convert_to_initialized
> > > ext4_split_extent
> > > ext4_split_extent_at
> > > __ext4_ext_dirty
> > > __ext4_mark_inode_dirty
> > > ext4_reserve_inode_write
> > > ext4_get_inode_loc
> > > __ext4_get_inode_loc <--- return -ENOMEM
> > > sb_getblk
> > > __getblk_gfp
> > > __getblk_slow <--- return NULL
> > > grow_buffers
> > > grow_dev_page <--- return -ENXIO
> > > ret = (block < end_block) ? 1 : -ENXIO;
> > >
> > > In this issue, bg_inode_table_hi is overwritten as an incorrect value.
> > > As a result, `block < end_block` cannot be met in grow_dev_page.
> > > Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
> > > keeps retrying. As a result, the writeback process is in the D state due
> > > to an infinite loop.
> > >
> > > Add a check on inode table block in the __ext4_get_inode_loc function by
> > > referring to ext4_read_inode_bitmap to avoid this infinite loop.
> > >
> > > Signed-off-by: Baokun Li <[email protected]>
> >
> > Thanks for the fixes. Normally, we check that inode table is fine in
> > ext4_check_descriptors() (and those checks are much stricter) so it seems
> > unnecessary to check it again here. I understand that in your case it was
> > resize that corrupted the group descriptor after the filesystem was mounted
> > which is nasty but there's much more metadata that can be corrupted like
> > this and it's infeasible to check each metadata block before we use it.
> >
> > IMHO a proper fix to this class of issues would be for sb_getblk() to
> > return proper error so that we can distinguish ENOMEM from other errors.
> > But that will be a larger undertaking...
> >
>
> Hi Jan,
>
> How about adding a wrapper around sb_getblk() which will do the basic block
> bound checks for ext4. Then we can carefully convert all the callers of
> sb_getblk() in ext4 to call ext4_sb_getblk().
>
> ext4_sb_getblk() will then return either of below -
> 1. ERR_PTR(-EFSCORRUPTED)
> 2. NULL
> 3. struct buffer_head*
>
> It's caller can then implement the proper error handling.
>
> Folding a small patch to implement the simple bound check. Is this the right
> approach?
Yep, looks sensible to me. Maybe I'd just make ext4_sb_getblk() return bh
or ERR_PTR so something like ERR_PTR(-EFSCORRUPTED), ERR_PTR(-ENXIO), or bh
pointer.
Honza
>
> From: "Ritesh Harjani (IBM)" <[email protected]>
> Date: Thu, 18 Aug 2022 07:53:58 -0500
> Subject: [RFC] ext4: Add ext4_sb_getblk() wrapper for block bounds checking
>
> We might need more bounds checking on the block before calling sb_getblk().
> This helper does that and if it is not valid then returns ERR_PTR(-EFSCORRUPTED)
> Later we will need to carefully convert the callers to use ext4_sb_getblk()
> instead of sb_getblk().
> For e.g. __ext4_get_inode_loc()
>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
> ---
> fs/ext4/block_validity.c | 4 +---
> fs/ext4/ext4.h | 12 ++++++++++++
> fs/ext4/super.c | 9 +++++++++
> 3 files changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
> index 5504f72bbbbe..69347fab7c38 100644
> --- a/fs/ext4/block_validity.c
> +++ b/fs/ext4/block_validity.c
> @@ -301,9 +301,7 @@ int ext4_sb_block_valid(struct super_block *sb, struct inode *inode,
> struct rb_node *n;
> int ret = 1;
>
> - if ((start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
> - (start_blk + count < start_blk) ||
> - (start_blk + count > ext4_blocks_count(sbi->s_es)))
> + if (!ext4_sb_block_valid_fastcheck(sbi->s_es, start_blk, count))
> return 0;
>
> /*
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 9bca5565547b..79d0e45185d3 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3072,6 +3072,8 @@ extern struct buffer_head *ext4_sb_bread(struct super_block *sb,
> sector_t block, blk_opf_t op_flags);
> extern struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
> sector_t block);
> +extern struct buffer_head *ext4_sb_getblk(struct super_block *sb,
> + sector_t block);
> extern void ext4_read_bh_nowait(struct buffer_head *bh, blk_opf_t op_flags,
> bh_end_io_t *end_io);
> extern int ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
> @@ -3358,6 +3360,16 @@ static inline unsigned int ext4_flex_bg_size(struct ext4_sb_info *sbi)
> return 1 << sbi->s_log_groups_per_flex;
> }
>
> +static inline bool ext4_sb_block_valid_fastcheck(struct ext4_super_block *es,
> + sector_t start_blk, unsigned int count)
> +{
> + if ((start_blk <= le32_to_cpu(es->s_first_data_block)) ||
> + (start_blk + count < start_blk) ||
> + (start_blk + count > ext4_blocks_count(es)))
> + return false;
> + return true;
> +}
> +
> #define ext4_std_error(sb, errno) \
> do { \
> if ((errno)) \
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 9a66abcca1a8..5b29272ad9a9 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -269,6 +269,15 @@ void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
> }
> }
>
> +struct buffer_head *ext4_sb_getblk(struct super_block *sb, sector_t block)
> +{
> + struct ext4_super_block *es = EXT4_SB(sb)->s_es;
> +
> + if (!ext4_sb_block_valid_fastcheck(es, block, 1))
> + return ERR_PTR(-EFSCORRUPTED);
> + return sb_getblk(sb, block);
> +}
> +
> static int ext4_verify_csum_type(struct super_block *sb,
> struct ext4_super_block *es)
> {
> --
> 2.25.1
>
> -ritesh
>
>
>
> > Honza
> >
> > > ---
> > > fs/ext4/inode.c | 10 +++++++++-
> > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 601214453c3a..5e171879fa23 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -4466,9 +4466,17 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
> > > inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> > > inode_offset = ((ino - 1) %
> > > EXT4_INODES_PER_GROUP(sb));
> > > - block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
> > > iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
> > >
> > > + block = ext4_inode_table(sb, gdp);
> > > + if ((block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block)) ||
> > > + (block >= ext4_blocks_count(EXT4_SB(sb)->s_es))) {
> > > + ext4_error(sb, "Invalid inode table block %llu in "
> > > + "block_group %u", block, iloc->block_group);
> > > + return -EFSCORRUPTED;
> > > + }
> > > + block += (inode_offset / inodes_per_block);
> > > +
> > > bh = sb_getblk(sb, block);
> > > if (unlikely(!bh))
> > > return -ENOMEM;
> > > --
> > > 2.31.1
> > >
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 22/08/18 07:23PM, Jan Kara wrote:
> On Thu 18-08-22 20:13:53, Ritesh Harjani (IBM) wrote:
> > On 22/08/17 04:31PM, Jan Kara wrote:
> > > On Wed 17-08-22 21:27:01, Baokun Li wrote:
> > > > In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
> > > > and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
> > > >
> > > > In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
> > > > the function returns -ENOMEM.
> > > >
> > > > In __getblk_slow, if the return value of grow_buffers is less than 0,
> > > > the function returns NULL.
> > > >
> > > > When the three processes are connected in series like the following stack,
> > > > an infinite loop may occur:
> > > >
> > > > do_writepages <--- keep retrying
> > > > ext4_writepages
> > > > mpage_map_and_submit_extent
> > > > mpage_map_one_extent
> > > > ext4_map_blocks
> > > > ext4_ext_map_blocks
> > > > ext4_ext_handle_unwritten_extents
> > > > ext4_ext_convert_to_initialized
> > > > ext4_split_extent
> > > > ext4_split_extent_at
> > > > __ext4_ext_dirty
> > > > __ext4_mark_inode_dirty
> > > > ext4_reserve_inode_write
> > > > ext4_get_inode_loc
> > > > __ext4_get_inode_loc <--- return -ENOMEM
> > > > sb_getblk
> > > > __getblk_gfp
> > > > __getblk_slow <--- return NULL
> > > > grow_buffers
> > > > grow_dev_page <--- return -ENXIO
> > > > ret = (block < end_block) ? 1 : -ENXIO;
> > > >
> > > > In this issue, bg_inode_table_hi is overwritten as an incorrect value.
> > > > As a result, `block < end_block` cannot be met in grow_dev_page.
> > > > Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
> > > > keeps retrying. As a result, the writeback process is in the D state due
> > > > to an infinite loop.
> > > >
> > > > Add a check on inode table block in the __ext4_get_inode_loc function by
> > > > referring to ext4_read_inode_bitmap to avoid this infinite loop.
> > > >
> > > > Signed-off-by: Baokun Li <[email protected]>
> > >
> > > Thanks for the fixes. Normally, we check that inode table is fine in
> > > ext4_check_descriptors() (and those checks are much stricter) so it seems
> > > unnecessary to check it again here. I understand that in your case it was
> > > resize that corrupted the group descriptor after the filesystem was mounted
> > > which is nasty but there's much more metadata that can be corrupted like
> > > this and it's infeasible to check each metadata block before we use it.
> > >
> > > IMHO a proper fix to this class of issues would be for sb_getblk() to
> > > return proper error so that we can distinguish ENOMEM from other errors.
> > > But that will be a larger undertaking...
> > >
> >
> > Hi Jan,
> >
> > How about adding a wrapper around sb_getblk() which will do the basic block
> > bound checks for ext4. Then we can carefully convert all the callers of
> > sb_getblk() in ext4 to call ext4_sb_getblk().
> >
> > ext4_sb_getblk() will then return either of below -
> > 1. ERR_PTR(-EFSCORRUPTED)
> > 2. NULL
> > 3. struct buffer_head*
> >
> > It's caller can then implement the proper error handling.
> >
> > Folding a small patch to implement the simple bound check. Is this the right
> > approach?
>
> Yep, looks sensible to me. Maybe I'd just make ext4_sb_getblk() return bh
> or ERR_PTR so something like ERR_PTR(-EFSCORRUPTED), ERR_PTR(-ENXIO), or bh
> pointer.
Sure, Thanks Jan. Will do that once I clear some confusion w.r.t
"start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)"
At some places this is checked with "<= s_first_data_block"
e.g. fs/ext4/ialloc.c, ext4_sb_block_valid()
while at some places I see it to be "< s_first_data_block"
e.g. fs/ext4/mballoc.c, fs/ext4/mmp.c
Will spend sometime to understand why the difference and if there is anything
I might miss here for off-by-one check.
Adding more to the confusion would be difference w.r.t blocksize = 1024 v/s
other blocksizes. Based on the blocksize value I guess, s_first_data_block can
be different (0/1??). Or can bigalloc can change this...
...Will look more into this.
-ritesh
>
> Honza
>
> >
> > From: "Ritesh Harjani (IBM)" <[email protected]>
> > Date: Thu, 18 Aug 2022 07:53:58 -0500
> > Subject: [RFC] ext4: Add ext4_sb_getblk() wrapper for block bounds checking
> >
> > We might need more bounds checking on the block before calling sb_getblk().
> > This helper does that and if it is not valid then returns ERR_PTR(-EFSCORRUPTED)
> > Later we will need to carefully convert the callers to use ext4_sb_getblk()
> > instead of sb_getblk().
> > For e.g. __ext4_get_inode_loc()
> >
> > Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
> > ---
> > fs/ext4/block_validity.c | 4 +---
> > fs/ext4/ext4.h | 12 ++++++++++++
> > fs/ext4/super.c | 9 +++++++++
> > 3 files changed, 22 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
> > index 5504f72bbbbe..69347fab7c38 100644
> > --- a/fs/ext4/block_validity.c
> > +++ b/fs/ext4/block_validity.c
> > @@ -301,9 +301,7 @@ int ext4_sb_block_valid(struct super_block *sb, struct inode *inode,
> > struct rb_node *n;
> > int ret = 1;
> >
> > - if ((start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
> > - (start_blk + count < start_blk) ||
> > - (start_blk + count > ext4_blocks_count(sbi->s_es)))
> > + if (!ext4_sb_block_valid_fastcheck(sbi->s_es, start_blk, count))
> > return 0;
> >
> > /*
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 9bca5565547b..79d0e45185d3 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -3072,6 +3072,8 @@ extern struct buffer_head *ext4_sb_bread(struct super_block *sb,
> > sector_t block, blk_opf_t op_flags);
> > extern struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
> > sector_t block);
> > +extern struct buffer_head *ext4_sb_getblk(struct super_block *sb,
> > + sector_t block);
> > extern void ext4_read_bh_nowait(struct buffer_head *bh, blk_opf_t op_flags,
> > bh_end_io_t *end_io);
> > extern int ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
> > @@ -3358,6 +3360,16 @@ static inline unsigned int ext4_flex_bg_size(struct ext4_sb_info *sbi)
> > return 1 << sbi->s_log_groups_per_flex;
> > }
> >
> > +static inline bool ext4_sb_block_valid_fastcheck(struct ext4_super_block *es,
> > + sector_t start_blk, unsigned int count)
> > +{
> > + if ((start_blk <= le32_to_cpu(es->s_first_data_block)) ||
> > + (start_blk + count < start_blk) ||
> > + (start_blk + count > ext4_blocks_count(es)))
> > + return false;
> > + return true;
> > +}
> > +
> > #define ext4_std_error(sb, errno) \
> > do { \
> > if ((errno)) \
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 9a66abcca1a8..5b29272ad9a9 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -269,6 +269,15 @@ void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
> > }
> > }
> >
> > +struct buffer_head *ext4_sb_getblk(struct super_block *sb, sector_t block)
> > +{
> > + struct ext4_super_block *es = EXT4_SB(sb)->s_es;
> > +
> > + if (!ext4_sb_block_valid_fastcheck(es, block, 1))
> > + return ERR_PTR(-EFSCORRUPTED);
> > + return sb_getblk(sb, block);
> > +}
> > +
> > static int ext4_verify_csum_type(struct super_block *sb,
> > struct ext4_super_block *es)
> > {
> > --
> > 2.25.1
> >
> > -ritesh
> >
> >
> >
> > > Honza
> > >
> > > > ---
> > > > fs/ext4/inode.c | 10 +++++++++-
> > > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > index 601214453c3a..5e171879fa23 100644
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -4466,9 +4466,17 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
> > > > inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> > > > inode_offset = ((ino - 1) %
> > > > EXT4_INODES_PER_GROUP(sb));
> > > > - block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
> > > > iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
> > > >
> > > > + block = ext4_inode_table(sb, gdp);
> > > > + if ((block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block)) ||
> > > > + (block >= ext4_blocks_count(EXT4_SB(sb)->s_es))) {
> > > > + ext4_error(sb, "Invalid inode table block %llu in "
> > > > + "block_group %u", block, iloc->block_group);
> > > > + return -EFSCORRUPTED;
> > > > + }
> > > > + block += (inode_offset / inodes_per_block);
> > > > +
> > > > bh = sb_getblk(sb, block);
> > > > if (unlikely(!bh))
> > > > return -ENOMEM;
> > > > --
> > > > 2.31.1
> > > >
> > > --
> > > Jan Kara <[email protected]>
> > > SUSE Labs, CR
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
On Fri 19-08-22 04:45:41, Ritesh Harjani (IBM) wrote:
> On 22/08/18 07:23PM, Jan Kara wrote:
> > On Thu 18-08-22 20:13:53, Ritesh Harjani (IBM) wrote:
> > > On 22/08/17 04:31PM, Jan Kara wrote:
> > > > On Wed 17-08-22 21:27:01, Baokun Li wrote:
> > > > > In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
> > > > > and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
> > > > >
> > > > > In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
> > > > > the function returns -ENOMEM.
> > > > >
> > > > > In __getblk_slow, if the return value of grow_buffers is less than 0,
> > > > > the function returns NULL.
> > > > >
> > > > > When the three processes are connected in series like the following stack,
> > > > > an infinite loop may occur:
> > > > >
> > > > > do_writepages <--- keep retrying
> > > > > ext4_writepages
> > > > > mpage_map_and_submit_extent
> > > > > mpage_map_one_extent
> > > > > ext4_map_blocks
> > > > > ext4_ext_map_blocks
> > > > > ext4_ext_handle_unwritten_extents
> > > > > ext4_ext_convert_to_initialized
> > > > > ext4_split_extent
> > > > > ext4_split_extent_at
> > > > > __ext4_ext_dirty
> > > > > __ext4_mark_inode_dirty
> > > > > ext4_reserve_inode_write
> > > > > ext4_get_inode_loc
> > > > > __ext4_get_inode_loc <--- return -ENOMEM
> > > > > sb_getblk
> > > > > __getblk_gfp
> > > > > __getblk_slow <--- return NULL
> > > > > grow_buffers
> > > > > grow_dev_page <--- return -ENXIO
> > > > > ret = (block < end_block) ? 1 : -ENXIO;
> > > > >
> > > > > In this issue, bg_inode_table_hi is overwritten as an incorrect value.
> > > > > As a result, `block < end_block` cannot be met in grow_dev_page.
> > > > > Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
> > > > > keeps retrying. As a result, the writeback process is in the D state due
> > > > > to an infinite loop.
> > > > >
> > > > > Add a check on inode table block in the __ext4_get_inode_loc function by
> > > > > referring to ext4_read_inode_bitmap to avoid this infinite loop.
> > > > >
> > > > > Signed-off-by: Baokun Li <[email protected]>
> > > >
> > > > Thanks for the fixes. Normally, we check that inode table is fine in
> > > > ext4_check_descriptors() (and those checks are much stricter) so it seems
> > > > unnecessary to check it again here. I understand that in your case it was
> > > > resize that corrupted the group descriptor after the filesystem was mounted
> > > > which is nasty but there's much more metadata that can be corrupted like
> > > > this and it's infeasible to check each metadata block before we use it.
> > > >
> > > > IMHO a proper fix to this class of issues would be for sb_getblk() to
> > > > return proper error so that we can distinguish ENOMEM from other errors.
> > > > But that will be a larger undertaking...
> > > >
> > >
> > > Hi Jan,
> > >
> > > How about adding a wrapper around sb_getblk() which will do the basic block
> > > bound checks for ext4. Then we can carefully convert all the callers of
> > > sb_getblk() in ext4 to call ext4_sb_getblk().
> > >
> > > ext4_sb_getblk() will then return either of below -
> > > 1. ERR_PTR(-EFSCORRUPTED)
> > > 2. NULL
> > > 3. struct buffer_head*
> > >
> > > It's caller can then implement the proper error handling.
> > >
> > > Folding a small patch to implement the simple bound check. Is this the right
> > > approach?
> >
> > Yep, looks sensible to me. Maybe I'd just make ext4_sb_getblk() return bh
> > or ERR_PTR so something like ERR_PTR(-EFSCORRUPTED), ERR_PTR(-ENXIO), or bh
> > pointer.
>
> Sure, Thanks Jan. Will do that once I clear some confusion w.r.t
> "start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)"
>
> At some places this is checked with "<= s_first_data_block"
> e.g. fs/ext4/ialloc.c, ext4_sb_block_valid()
>
> while at some places I see it to be "< s_first_data_block"
> e.g. fs/ext4/mballoc.c, fs/ext4/mmp.c
Well, superblock is stored at s_first_data_block offset. So strictly
speaking the check should be < s_first_data_block because that block is a
valid filesystem block. OTOH in most places you are not supposed to look at
block with the superblock so stricter <= s_first_data_block is fine.
> Will spend sometime to understand why the difference and if there is anything
> I might miss here for off-by-one check.
>
> Adding more to the confusion would be difference w.r.t blocksize = 1024 v/s
> other blocksizes. Based on the blocksize value I guess, s_first_data_block can
> be different (0/1??). Or can bigalloc can change this...
> ...Will look more into this.
That's what we discussed on our ext4 call yesterday. Normally,
s_first_data_block is 1 for blocksize 1k and 0 for blocksize > 1k. So for
1k blocksize the first group begins with block 1 and not block 0.
Effectively the whole filesystem is shifted by 1 block with 1k blocksize.
When bigalloc comes to play and blocksize is 1k, things are even more
interesting because there, the first group starts at block 0 while
s_first_data_block is still 1, because superblock is stored in block 1 of
cluster 0.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, Aug 18, 2022 at 08:13:53PM +0530, Ritesh Harjani (IBM) wrote:
> Folding a small patch to implement the simple bound check. Is this the right
> approach?
>
> From: "Ritesh Harjani (IBM)" <[email protected]>
> Date: Thu, 18 Aug 2022 07:53:58 -0500
> Subject: [RFC] ext4: Add ext4_sb_getblk() wrapper for block bounds checking
>
> We might need more bounds checking on the block before calling sb_getblk().
> This helper does that and if it is not valid then returns ERR_PTR(-EFSCORRUPTED)
> Later we will need to carefully convert the callers to use ext4_sb_getblk()
> instead of sb_getblk().
Hey Ritesh,
I was going through some old patches and came across this RFC patch.
Have you had a chance to polish this up? I don't think I've seen a
newer version of this patch, but maybe I missed it.
Thanks,
- Ted
On 22/11/28 03:44PM, Theodore Ts'o wrote:
> On Thu, Aug 18, 2022 at 08:13:53PM +0530, Ritesh Harjani (IBM) wrote:
> > Folding a small patch to implement the simple bound check. Is this the right
> > approach?
> >
> > From: "Ritesh Harjani (IBM)" <[email protected]>
> > Date: Thu, 18 Aug 2022 07:53:58 -0500
> > Subject: [RFC] ext4: Add ext4_sb_getblk() wrapper for block bounds checking
> >
> > We might need more bounds checking on the block before calling sb_getblk().
> > This helper does that and if it is not valid then returns ERR_PTR(-EFSCORRUPTED)
> > Later we will need to carefully convert the callers to use ext4_sb_getblk()
> > instead of sb_getblk().
>
> Hey Ritesh,
>
> I was going through some old patches and came across this RFC patch.
> Have you had a chance to polish this up? I don't think I've seen a
> newer version of this patch, but maybe I missed it.
Hello Ted,
Sorry, I guess I completely forgot about this, since around those dates
I was travelling. I will definitely try and get back to this once I finish
few ongoing things on my plate.
Thanks for bringing it up!
-ritesh