Hello,

attached is the C program I use for reproducing the mballoc performance issue.
I run it as:

stress-unlink -s -c 100 -f 22528 16 0 /mnt

								Honza



On Wed 27-07-22 12:51:23, Jan Kara wrote:
> Hello,
> 
> before going on vacation I was tracking down why reaim benchmark regresses
> (10-20%) with larger number of processes with the new mb_optimize_scan
> strategy of mballoc. After a while I have reproduced the regression with a
> simple benchmark that just creates, fsyncs, and deletes lots of small files
> (22k) from 16 processes, each process has its own directory. The immediate
> reason for the slow down is that with mb_optimize_scan=1 the file blocks
> are spread among more block groups and thus we have more bitmaps to update
> in each transaction. 
> 
> So the question is why mballoc with mb_optimize_scan=1 spreads allocations
> more among block groups. The situation is somewhat obscured by group
> preallocation feature of mballoc where each *CPU* holds a preallocation and
> small (below 64k) allocations on that CPU are allocated from this
> preallocation. If I trace creating of these group preallocations I can see
> that the block groups they are taken from look like:
> 
> mb_optimize_scan=0:
> 49 81 113 97 17 33 113 49 81 33 97 113 81 1 17 33 33 81 1 113 97 17 113 113
> 33 33 97 81 49 81 17 49
> 
> mb_optimize_scan=1:
> 127 126 126 125 126 127 125 126 127 124 123 124 122 122 121 120 119 118 117
> 116 115 116 114 113 111 110 109 108 107 106 105 104 104
> 
> So we can see that while with mb_optimize_scan=0 the preallocation is
> always take from one of a few groups (among which we jump mostly randomly)
> which mb_optimize_scan=1 we consistently drift from higher block groups to
> lower block groups.
> 
> The mb_optimize_scan=0 behavior is given by the fact that search for free
> space always starts in the same block group where the inode is allocated
> and the inode is always allocated in the same block group as its parent
> directory. So the create-delete benchmark generally keeps all inodes for
> one process in the same block group and thus allocations are always
> starting in that block group. Because files are small, we always succeed in
> finding free space in the starting block group and thus allocations are
> generally restricted to the several block groups where parent directories
> were originally allocated.
> 
> With mb_optimize_scan=1 the block group to allocate from is selected by
> ext4_mb_choose_next_group_cr0() so in this mode we completely ignore the
> "pack inode with data in the same group" rule. The reason why we keep
> drifting among block groups is that whenever free space in a block group is
> updated (blocks allocated / freed) we recalculate largest free order (see
> mb_mark_used() and mb_free_blocks()) and as a side effect that removes
> group from the bb_largest_free_order_node list and reinserts the group at
> the tail.
> 
> I have two questions about the mb_optimize_scan=1 strategy:
> 
> 1) Shouldn't we respect the initial goal group and try to allocate from it
> in ext4_mb_regular_allocator() before calling ext4_mb_choose_next_group()?
> 
> 2) The rotation of groups in mb_set_largest_free_order() seems a bit
> undesirable to me. In particular it seems pointless if the largest free
> order does not change. Was there some rationale behind it?
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR