On 22/05/21 09:42PM, Baokun Li wrote:
> Hulk Robot reported a BUG_ON:
> ==================================================================
> kernel BUG at fs/ext4/mballoc.c:3211!
> [...]
> RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
> [...]
> Call Trace:
> ext4_mb_new_blocks+0x9df/0x5d30
> ext4_ext_map_blocks+0x1803/0x4d80
> ext4_map_blocks+0x3a4/0x1a10
> ext4_writepages+0x126d/0x2c30
> do_writepages+0x7f/0x1b0
> __filemap_fdatawrite_range+0x285/0x3b0
> file_write_and_wait_range+0xb1/0x140
> ext4_sync_file+0x1aa/0xca0
> vfs_fsync_range+0xfb/0x260
> do_fsync+0x48/0xa0
> [...]
> ==================================================================
>
> Above issue may happen as follows:
> -------------------------------------
> do_fsync
> vfs_fsync_range
> ext4_sync_file
> file_write_and_wait_range
> __filemap_fdatawrite_range
> do_writepages
> ext4_writepages
> mpage_map_and_submit_extent
> mpage_map_one_extent
> ext4_map_blocks
> ext4_mb_new_blocks
> ext4_mb_normalize_request
> >>> start + size <= ac->ac_o_ex.fe_logical
> ext4_mb_regular_allocator
> ext4_mb_simple_scan_group
> ext4_mb_use_best_found
> ext4_mb_new_preallocation
> ext4_mb_new_inode_pa
> ext4_mb_use_inode_pa
> >>> set ac->ac_b_ex.fe_len <= 0
> ext4_mb_mark_diskspace_used
> >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
>
> we can easily reproduce this problem with the following commands:
> `fallocate -l100M disk`
> `mkfs.ext4 -b 1024 -g 256 disk`
> `mount disk /mnt`
> `fsstress -d /mnt -l 0 -n 1000 -p 1`
Thanks for sharing the reproducer. Yes, this could easily trigger the bug_on.
We don't even need "-b 1024".
>
> The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
> Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
> when the size is truncated. So start should be the start position of
> the group where ac_o_ex.fe_logical is located after alignment.
> In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
> is very large, the value calculated by start_off is more accurate.
>
> Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size")
Commit message does say that it can result into allocation request bigger then
the size of the block group. So then, what happens in case of flex_bg feature is
enabled (which by default nowadays).
Shouldn't we consider flex_bg_groups * EXT4_BLOCKS_PER_GROUP as the allocation size request?
>
>
> Reported-by: Hulk Robot <[email protected]>
> Signed-off-by: Baokun Li <[email protected]>
> ---
> fs/ext4/mballoc.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index ea653d19f9ec..32410b79b664 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4107,6 +4107,17 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
> size = size >> bsbits;
> start = start_off >> bsbits;
>
> + /*
> + * Because size must be less than or equal to
> + * EXT4_BLOCKS_PER_GROUP,
We should confirm whether in case of flex_bg groups, this assumption
holds correct?
> start should be the start position of
> + * the group where ac_o_ex.fe_logical is located after alignment.
> + * In addition, when the value of fe_logical or
> + * EXT4_BLOCKS_PER_GROUP is very large, the value calculated
> + * by start_off is more accurate.
> + */
> + start = max(start, round_down(ac->ac_o_ex.fe_logical,
> + EXT4_BLOCKS_PER_GROUP(ac->ac_sb)));
> +
This does looks like it could solve the problem at hand.
This is because we try to allocate based on the start offset upto size.
As of now we do allocate more space then requested (due to normalization),
but due to the start offset of our preallocation, our allocation request
of ac->ac_o_ex.fe_logical + fe_len doesn't lie in the allocated PA. This happens since
we trim the size of the allocation request to only till blocks_per_group.
So with that if we make the above changes (your patch), it does looks like, that we
will then be allocating the PA such that our allocation request will fall within
the allocated PA (in ext4_mb_use_inode_pa()) and hence we won't hit the bug_on in
ext4_mb_mark_diskspace_used().
One more suggestion - I think we should also add a WARN_ON()/BUG_ON() here.
This makes the start of the problem more visible, rather then waiting till
ext4_mb_mark_diskspace_used() call to hit the bug_on().
Thoughts?
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 252c168454c7..e91d5aeb8efd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4301,6 +4301,7 @@ static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac,
BUG_ON(start < pa->pa_pstart);
BUG_ON(end > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len));
BUG_ON(pa->pa_free < len);
+ WARN_ON(len <= 0);
pa->pa_free -= len;
mb_debug(ac->ac_sb, "use %llu/%d from inode pa %p\n", start, len, pa);
-ritesh