2024-03-20 01:31:15

by Baokun Li

[permalink] [raw]
Subject: Re: [PATCH v3 4/9] ext4: fix slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists()

On 2024/3/20 2:25, Ojaswin Mujoo wrote:
> On Tue, Mar 19, 2024 at 06:05:53PM +0800, Baokun Li wrote:
>> On 2024/3/18 20:39, Ojaswin Mujoo wrote:
>>> On Thu, Mar 14, 2024 at 10:09:01PM +0800, Baokun Li wrote:
>>>> --- a/fs/ext4/mballoc.c
>>>> +++ b/fs/ext4/mballoc.c
>>>> @@ -831,6 +831,8 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
>>>> return 0;
>>>> if (order == MB_NUM_ORDERS(sb))
>>>> order--;
>>>> + if (WARN_ON_ONCE(order > MB_NUM_ORDERS(sb)))
>>>> + order = MB_NUM_ORDERS(sb) - 1;
>>> Hey Baokun,
>> Hi Ojaswin,
>>> Thanks for fixing this. This patch looks good to me, feel free to add:
>>>
>>> Reviewed-by: Ojaswin Mujoo <[email protected]>
>> Thanks for the review!
>>> my comments after this are less about the patch and more about some
>>> thoughts on the working of average fragment lists.
>>>
>>> So going through the v2 and this patch got me thinking about what really
>>> is going to happen when a user tries to allocate 32768 blocks which is also
>>> the maximum value we could have in say ac->ac_g_ex.fe_len.
>>>
>>> When this happens, ext4_mb_regular_allocator() will directly set the
>>> criteria as CR_GOAL_LEN_FAST. Now, we'll follow:
>>>
>>> ext4_mb_choose_next_group_goal_fast()
>>> for (i=mb_avg_fragment_size_order(); i < MB_NUM_ORDERS; i++) { .. }
>>>
>>> Here, mb_avg_fragment_siz_order() will do something like:
>>>
>>> order = fls(32768) - 2 = 14
>>> ...
>>> if (order == MB_NUM_ORDERS(sb))
>>> order--;
>>>
>>> return order;
>>>
>>> And we'll look in the fragment list[13] and since none of the groups
>>> there would have 32768 blocks free (since we dont track it here) we'll
>>> unnecessarily traverse the full list before falling to CR_BEST_AVAIL_LEN
>>> (this will become a noop due to the way order and min_order
>>> are calculated) and eventually to CR_GOAL_LEN_SLOW where we might get
>>> something or end up splitting.
>> That's not quite right, in ext4_mb_choose_next_group_goal_fast() even
>> though we're looking for the group with order 13, the group with 32768
>> free blocks is also in there. So after passing ext4_mb_good_group() in
>> ext4_mb_find_good_group_avg_frag_lists(), we get a group with 32768
>> free blocks. And in ext4_mb_choose_next_group_best_avail() we were
> Hey Baokun,
>
> So IIUC, a BG with 32768 blocks free will have bb_fragments = 0 and in
> mb_update_avg_fragment_size() we exit early if a BG has bb_fragments = 0
> hence it won't show up in the order 13 list.
Hello Ojaswin,

This sounded strange, so I added the following debugging information:

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c65fac9b8c72..c6ec423e2971 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1212,6 +1212,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
                        i = mb_find_next_zero_bit(bitmap, max, i);
        }
        grp->bb_fragments = fragments;
+       pr_err(">>> greoup: %u, bb_free: %d, bb_fragments: %d\n",
grp->bb_group, grp->bb_free, grp->bb_fragments);

        if (free != grp->bb_free) {
                ext4_grp_locked_error(sb, group, 0, 0,


Then mount an ext4 image , wait for a moment , and got the
following printout:

>>> greoup: 6, bb_free: 32768, bb_fragments: 1
>>> greoup: 5, bb_free: 31741, bb_fragments: 1
>>> greoup: 4, bb_free: 32768, bb_fragments: 1
>>> greoup: 3, bb_free: 31741, bb_fragments: 1
>>> greoup: 2, bb_free: 32768, bb_fragments: 1
>>> greoup: 1, bb_free: 31741, bb_fragments: 1
>>> greoup: 0, bb_free: 23511, bb_fragments: 1
>> supposed to allocate blocks quickly by trim order, so it's necessary
>> here too. So there are no unnecessary loops here.
>>
>> But this will trigger the freshly added WARN_ON_ONCE, so in the
>> new iteration I need to change it to:
>>
>> if (WARN_ON_ONCE(order > MB_NUM_ORDERS(ac->ac_sb) + 1))
>>         order = MB_NUM_ORDERS(ac->ac_sb) - 1;
>>
>> In addition, when the block size is 4k, there are these limitations:
>>
>> 1) Limit the maximum size of the data allocation estimate to 8M in
>>     ext4_mb_normalize_request().
>> 2) #define MAX_WRITEPAGES_EXTENT_LEN 2048
>> 3) #define DIO_MAX_BLOCKS 4096
>> 4) Metadata is generally not allocated in many blocks at a time
>>
>> So it seems that only group_prealloc will allocate more than 2048
>> blocks at a time.
>>
>> And I've tried removing those 8M/2048/4096 limits before, but the
>> performance of DIO write barely changed, and it doesn't look like
>> the performance bottleneck is here in the number of blocks allocated
>> at a time at the moment.
> Ohh that's interesting, on paper I think it does seem like it should
> improve the performance. I think if CR_GOAL_LEN_FAST can start including
> blocks which are completely empty, and lift those restrictions then we
> might see better performance. I'll try to play around a bit with this as
> well.
>
> Regards,
> ojaswin
>
OK, waiting for your good news.

--
With Best Regards,
Baokun Li
.