Hello, attached is the C program I use for reproducing the mballoc performance issue. I run it as: stress-unlink -s -c 100 -f 22528 16 0 /mnt Honza On Wed 27-07-22 12:51:23, Jan Kara wrote: > Hello, > > before going on vacation I was tracking down why reaim benchmark regresses > (10-20%) with larger number of processes with the new mb_optimize_scan > strategy of mballoc. After a while I have reproduced the regression with a > simple benchmark that just creates, fsyncs, and deletes lots of small files > (22k) from 16 processes, each process has its own directory. The immediate > reason for the slow down is that with mb_optimize_scan=1 the file blocks > are spread among more block groups and thus we have more bitmaps to update > in each transaction. > > So the question is why mballoc with mb_optimize_scan=1 spreads allocations > more among block groups. The situation is somewhat obscured by group > preallocation feature of mballoc where each *CPU* holds a preallocation and > small (below 64k) allocations on that CPU are allocated from this > preallocation. If I trace creating of these group preallocations I can see > that the block groups they are taken from look like: > > mb_optimize_scan=0: > 49 81 113 97 17 33 113 49 81 33 97 113 81 1 17 33 33 81 1 113 97 17 113 113 > 33 33 97 81 49 81 17 49 > > mb_optimize_scan=1: > 127 126 126 125 126 127 125 126 127 124 123 124 122 122 121 120 119 118 117 > 116 115 116 114 113 111 110 109 108 107 106 105 104 104 > > So we can see that while with mb_optimize_scan=0 the preallocation is > always take from one of a few groups (among which we jump mostly randomly) > which mb_optimize_scan=1 we consistently drift from higher block groups to > lower block groups. > > The mb_optimize_scan=0 behavior is given by the fact that search for free > space always starts in the same block group where the inode is allocated > and the inode is always allocated in the same block group as its parent > directory. So the create-delete benchmark generally keeps all inodes for > one process in the same block group and thus allocations are always > starting in that block group. Because files are small, we always succeed in > finding free space in the starting block group and thus allocations are > generally restricted to the several block groups where parent directories > were originally allocated. > > With mb_optimize_scan=1 the block group to allocate from is selected by > ext4_mb_choose_next_group_cr0() so in this mode we completely ignore the > "pack inode with data in the same group" rule. The reason why we keep > drifting among block groups is that whenever free space in a block group is > updated (blocks allocated / freed) we recalculate largest free order (see > mb_mark_used() and mb_free_blocks()) and as a side effect that removes > group from the bb_largest_free_order_node list and reinserts the group at > the tail. > > I have two questions about the mb_optimize_scan=1 strategy: > > 1) Shouldn't we respect the initial goal group and try to allocate from it > in ext4_mb_regular_allocator() before calling ext4_mb_choose_next_group()? > > 2) The rotation of groups in mb_set_largest_free_order() seems a bit > undesirable to me. In particular it seems pointless if the largest free > order does not change. Was there some rationale behind it? > > Honza > -- > Jan Kara > SUSE Labs, CR -- Jan Kara SUSE Labs, CR