2024-03-19 18:26:09

by Ojaswin Mujoo

[permalink] [raw]
Subject: Re: [PATCH v3 4/9] ext4: fix slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists()

On Tue, Mar 19, 2024 at 06:05:53PM +0800, Baokun Li wrote:
> On 2024/3/18 20:39, Ojaswin Mujoo wrote:
> > On Thu, Mar 14, 2024 at 10:09:01PM +0800, Baokun Li wrote:
> > > --- a/fs/ext4/mballoc.c
> > > +++ b/fs/ext4/mballoc.c
> > > @@ -831,6 +831,8 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
> > > return 0;
> > > if (order == MB_NUM_ORDERS(sb))
> > > order--;
> > > + if (WARN_ON_ONCE(order > MB_NUM_ORDERS(sb)))
> > > + order = MB_NUM_ORDERS(sb) - 1;
> > Hey Baokun,
> Hi Ojaswin,
> >
> > Thanks for fixing this. This patch looks good to me, feel free to add:
> >
> > Reviewed-by: Ojaswin Mujoo <[email protected]>
> Thanks for the review!
> > my comments after this are less about the patch and more about some
> > thoughts on the working of average fragment lists.
> >
> > So going through the v2 and this patch got me thinking about what really
> > is going to happen when a user tries to allocate 32768 blocks which is also
> > the maximum value we could have in say ac->ac_g_ex.fe_len.
> >
> > When this happens, ext4_mb_regular_allocator() will directly set the
> > criteria as CR_GOAL_LEN_FAST. Now, we'll follow:
> >
> > ext4_mb_choose_next_group_goal_fast()
> > for (i=mb_avg_fragment_size_order(); i < MB_NUM_ORDERS; i++) { .. }
> >
> > Here, mb_avg_fragment_siz_order() will do something like:
> >
> > order = fls(32768) - 2 = 14
> > ...
> > if (order == MB_NUM_ORDERS(sb))
> > order--;
> >
> > return order;
> >
> > And we'll look in the fragment list[13] and since none of the groups
> > there would have 32768 blocks free (since we dont track it here) we'll
> > unnecessarily traverse the full list before falling to CR_BEST_AVAIL_LEN
> > (this will become a noop due to the way order and min_order
> > are calculated) and eventually to CR_GOAL_LEN_SLOW where we might get
> > something or end up splitting.
> That's not quite right, in ext4_mb_choose_next_group_goal_fast() even
> though we're looking for the group with order 13, the group with 32768
> free blocks is also in there. So after passing ext4_mb_good_group() in
> ext4_mb_find_good_group_avg_frag_lists(), we get a group with 32768
> free blocks. And in ext4_mb_choose_next_group_best_avail() we were

Hey Baokun,

So IIUC, a BG with 32768 blocks free will have bb_fragments = 0 and in
mb_update_avg_fragment_size() we exit early if a BG has bb_fragments = 0
hence it won't show up in the order 13 list.

> supposed to allocate blocks quickly by trim order, so it's necessary
> here too. So there are no unnecessary loops here.
>
> But this will trigger the freshly added WARN_ON_ONCE, so in the
> new iteration I need to change it to:
>
> if (WARN_ON_ONCE(order > MB_NUM_ORDERS(ac->ac_sb) + 1))
> ??????? order = MB_NUM_ORDERS(ac->ac_sb) - 1;
>
> In addition, when the block size is 4k, there are these limitations:
>
> 1) Limit the maximum size of the data allocation estimate to 8M in
> ??? ext4_mb_normalize_request().
> 2) #define MAX_WRITEPAGES_EXTENT_LEN 2048
> 3) #define DIO_MAX_BLOCKS 4096
> 4) Metadata is generally not allocated in many blocks at a time
>
> So it seems that only group_prealloc will allocate more than 2048
> blocks at a time.
>
> And I've tried removing those 8M/2048/4096 limits before, but the
> performance of DIO write barely changed, and it doesn't look like
> the performance bottleneck is here in the number of blocks allocated
> at a time at the moment.

Ohh that's interesting, on paper I think it does seem like it should
improve the performance. I think if CR_GOAL_LEN_FAST can start including
blocks which are completely empty, and lift those restrictions then we
might see better performance. I'll try to play around a bit with this as
well.

Regards,
ojaswin

>
> Thanks,
> Baokun
> > I think something more optimal would be to:
> >
> > 1. Add another entry to average fragment lists for completely empty
> > groups. (As a sidenote i think we should use something like MB_NUM_FRAG_ORDER
> > instead of MB_NUM_ORDERS in calculating limits related to average
> > fragment lists since the NUM_ORDERS seems to be the buddy max order ie
> > 8192 blocks only valid for CR_POWER2 and shouldn't really limit the
> > fragment size lists)
> >
> > 2. If we don't want to go with 1 (maybe there's some history for that),
> > then probably should exit early from CR_GOAL_LEN_FAST so that we don't
> > iterate there.
> >
> > Would like to hear your thoughts on it Baokun, Jan.
> >
> > Regards,
> > ojaswin
> >