Hello,
before going on vacation I was tracking down why reaim benchmark regresses
(10-20%) with larger number of processes with the new mb_optimize_scan
strategy of mballoc. After a while I have reproduced the regression with a
simple benchmark that just creates, fsyncs, and deletes lots of small files
(22k) from 16 processes, each process has its own directory. The immediate
reason for the slow down is that with mb_optimize_scan=1 the file blocks
are spread among more block groups and thus we have more bitmaps to update
in each transaction.
So the question is why mballoc with mb_optimize_scan=1 spreads allocations
more among block groups. The situation is somewhat obscured by group
preallocation feature of mballoc where each *CPU* holds a preallocation and
small (below 64k) allocations on that CPU are allocated from this
preallocation. If I trace creating of these group preallocations I can see
that the block groups they are taken from look like:
mb_optimize_scan=0:
49 81 113 97 17 33 113 49 81 33 97 113 81 1 17 33 33 81 1 113 97 17 113 113
33 33 97 81 49 81 17 49
mb_optimize_scan=1:
127 126 126 125 126 127 125 126 127 124 123 124 122 122 121 120 119 118 117
116 115 116 114 113 111 110 109 108 107 106 105 104 104
So we can see that while with mb_optimize_scan=0 the preallocation is
always take from one of a few groups (among which we jump mostly randomly)
which mb_optimize_scan=1 we consistently drift from higher block groups to
lower block groups.
The mb_optimize_scan=0 behavior is given by the fact that search for free
space always starts in the same block group where the inode is allocated
and the inode is always allocated in the same block group as its parent
directory. So the create-delete benchmark generally keeps all inodes for
one process in the same block group and thus allocations are always
starting in that block group. Because files are small, we always succeed in
finding free space in the starting block group and thus allocations are
generally restricted to the several block groups where parent directories
were originally allocated.
With mb_optimize_scan=1 the block group to allocate from is selected by
ext4_mb_choose_next_group_cr0() so in this mode we completely ignore the
"pack inode with data in the same group" rule. The reason why we keep
drifting among block groups is that whenever free space in a block group is
updated (blocks allocated / freed) we recalculate largest free order (see
mb_mark_used() and mb_free_blocks()) and as a side effect that removes
group from the bb_largest_free_order_node list and reinserts the group at
the tail.
I have two questions about the mb_optimize_scan=1 strategy:
1) Shouldn't we respect the initial goal group and try to allocate from it
in ext4_mb_regular_allocator() before calling ext4_mb_choose_next_group()?
2) The rotation of groups in mb_set_largest_free_order() seems a bit
undesirable to me. In particular it seems pointless if the largest free
order does not change. Was there some rationale behind it?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Jul 27, 2022, at 4:51 AM, Jan Kara <[email protected]> wrote:
>
> Hello,
>
> before going on vacation I was tracking down why reaim benchmark regresses
> (10-20%) with larger number of processes with the new mb_optimize_scan
> strategy of mballoc. After a while I have reproduced the regression with a
> simple benchmark that just creates, fsyncs, and deletes lots of small files
> (22k) from 16 processes, each process has its own directory. The immediate
> reason for the slow down is that with mb_optimize_scan=1 the file blocks
> are spread among more block groups and thus we have more bitmaps to update
> in each transaction.
>
> So the question is why mballoc with mb_optimize_scan=1 spreads allocations
> more among block groups. The situation is somewhat obscured by group
> preallocation feature of mballoc where each *CPU* holds a preallocation and
> small (below 64k) allocations on that CPU are allocated from this
> preallocation. If I trace creating of these group preallocations I can see
> that the block groups they are taken from look like:
>
> mb_optimize_scan=0:
> 49 81 113 97 17 33 113 49 81 33 97 113 81 1 17 33 33 81 1 113 97 17 113 113
> 33 33 97 81 49 81 17 49
>
> mb_optimize_scan=1:
> 127 126 126 125 126 127 125 126 127 124 123 124 122 122 121 120 119 118 117
> 116 115 116 114 113 111 110 109 108 107 106 105 104 104
>
> So we can see that while with mb_optimize_scan=0 the preallocation is
> always take from one of a few groups (among which we jump mostly randomly)
> which mb_optimize_scan=1 we consistently drift from higher block groups to
> lower block groups.
>
> The mb_optimize_scan=0 behavior is given by the fact that search for free
> space always starts in the same block group where the inode is allocated
> and the inode is always allocated in the same block group as its parent
> directory. So the create-delete benchmark generally keeps all inodes for
> one process in the same block group and thus allocations are always
> starting in that block group. Because files are small, we always succeed in
> finding free space in the starting block group and thus allocations are
> generally restricted to the several block groups where parent directories
> were originally allocated.
>
> With mb_optimize_scan=1 the block group to allocate from is selected by
> ext4_mb_choose_next_group_cr0() so in this mode we completely ignore the
> "pack inode with data in the same group" rule. The reason why we keep
> drifting among block groups is that whenever free space in a block group is
> updated (blocks allocated / freed) we recalculate largest free order (see
> mb_mark_used() and mb_free_blocks()) and as a side effect that removes
> group from the bb_largest_free_order_node list and reinserts the group at
> the tail.
>
> I have two questions about the mb_optimize_scan=1 strategy:
>
> 1) Shouldn't we respect the initial goal group and try to allocate from it
> in ext4_mb_regular_allocator() before calling ext4_mb_choose_next_group()?
I would agree with this. Keeping the allocations close to the inode is a
useful first-guess heuristic that fairly distributes load across the groups
and doesn't cost anything in terms of complexity/searching. For multiple
writers of large inodes in the same group there might initially be some
contention, but with 8MB allocation chunks and 128MB groups this would be
resolved quickly, and those writers should end up in different groups after
the initial group is full.
> 2) The rotation of groups in mb_set_largest_free_order() seems a bit
> undesirable to me. In particular it seems pointless if the largest free
> order does not change. Was there some rationale behind it?
I'd actually been wondering about this same thing recently. IMHO, it makes
sense for the per-CPU small block preallocation to stick with the same group,
once the current PA chunk is full, unless that group is full. I don't think
it is desirable to fill every group in the filesystem only 1/32 full (or so)
and then move on to the next empty group.
Selecting a new group should only be done when the current group is (nearly)
full. That avoids partial filling of all groups, and also avoids contention
on the group selection locks.
As for the rotation of the groups in the list, this was to avoid concurrent
writers always contending to get the same group at the start of the list.
However, it makes more sense if each writer (or per-CPU PA) is "sticky" with
a particular group until it is full and not do a new group search each time.
Cheers, Andreas
On 22/07/27 12:51PM, Jan Kara wrote:
> Hello,
>
> before going on vacation I was tracking down why reaim benchmark regresses
> (10-20%) with larger number of processes with the new mb_optimize_scan
> strategy of mballoc. After a while I have reproduced the regression with a
> simple benchmark that just creates, fsyncs, and deletes lots of small files
> (22k) from 16 processes, each process has its own directory. The immediate
> reason for the slow down is that with mb_optimize_scan=1 the file blocks
> are spread among more block groups and thus we have more bitmaps to update
> in each transaction.
To add a little more info to why maybe this regression is getting noticed this late
is that initially the patch series had a bug where the optimization was never
getting enabled for files with extents until it got fixed by this patch.
https://lore.kernel.org/linux-ext4/fc9a48f7f8dcfc83891a8b21f6dd8cdf056ed810.1646732698.git.ojaswin@linux.ibm.com/#t
>
> So the question is why mballoc with mb_optimize_scan=1 spreads allocations
> more among block groups. The situation is somewhat obscured by group
> preallocation feature of mballoc where each *CPU* holds a preallocation and
> small (below 64k) allocations on that CPU are allocated from this
> preallocation. If I trace creating of these group preallocations I can see
> that the block groups they are taken from look like:
>
> mb_optimize_scan=0:
> 49 81 113 97 17 33 113 49 81 33 97 113 81 1 17 33 33 81 1 113 97 17 113 113
> 33 33 97 81 49 81 17 49
>
> mb_optimize_scan=1:
> 127 126 126 125 126 127 125 126 127 124 123 124 122 122 121 120 119 118 117
> 116 115 116 114 113 111 110 109 108 107 106 105 104 104
>
> So we can see that while with mb_optimize_scan=0 the preallocation is
> always take from one of a few groups (among which we jump mostly randomly)
> which mb_optimize_scan=1 we consistently drift from higher block groups to
> lower block groups.
>
> The mb_optimize_scan=0 behavior is given by the fact that search for free
> space always starts in the same block group where the inode is allocated
> and the inode is always allocated in the same block group as its parent
> directory. So the create-delete benchmark generally keeps all inodes for
> one process in the same block group and thus allocations are always
> starting in that block group. Because files are small, we always succeed in
> finding free space in the starting block group and thus allocations are
> generally restricted to the several block groups where parent directories
> were originally allocated.
>
> With mb_optimize_scan=1 the block group to allocate from is selected by
> ext4_mb_choose_next_group_cr0() so in this mode we completely ignore the
> "pack inode with data in the same group" rule. The reason why we keep
> drifting among block groups is that whenever free space in a block group is
> updated (blocks allocated / freed) we recalculate largest free order (see
> mb_mark_used() and mb_free_blocks()) and as a side effect that removes
> group from the bb_largest_free_order_node list and reinserts the group at
> the tail.
One thing which comes to mind is maybe to cache the last block group from
which the allocation was satisfied and only if that fails, we could then try
the largest_free_order() bg.
>
> I have two questions about the mb_optimize_scan=1 strategy:
>
> 1) Shouldn't we respect the initial goal group and try to allocate from it
> in ext4_mb_regular_allocator() before calling ext4_mb_choose_next_group()?
I remember discussing this problem and I think the argument that time was...
""" ...snip from the cover letter.
These changes may result in allocations to be spread across the block
device. While that would not matter some block devices (such as flash)
it may be a cause of concern for other block devices that benefit from
storing related content togetther such as disk. However, it can be
argued that in high fragmentation scenrio, especially for large disks,
it's still worth optimizing the scanning since in such cases, we get
cpu bound on group scanning instead of getting IO bound. Perhaps, in
future, we could dynamically turn this new optimization on based on
fragmentation levels for such devices.
"""
...but maybe more explainations can be added by others.
>
> 2) The rotation of groups in mb_set_largest_free_order() seems a bit
> undesirable to me. In particular it seems pointless if the largest free
> order does not change. Was there some rationale behind it?
Agree.
Also,
I am wondering on whether there is a bot which does reaim benchmark testing too
on any of the performance patches. For e.g. [1].
[1]: https://github.com/intel/lkp-tests/blob/3fece75132266f680047f4e1740b39c5b3faabbf/tests/reaim
Can submitter of a patch also trigger this performance benchmark testing?
I have generally seen some kernel test bot reports with performace score
results, but I am not sure if there is a easy way to trigger this like how we
have for syzbot. Any idea?
-ritesh
On Wed 27-07-22 22:37:04, Ritesh Harjani wrote:
> On 22/07/27 12:51PM, Jan Kara wrote:
> > Hello,
> >
> > before going on vacation I was tracking down why reaim benchmark regresses
> > (10-20%) with larger number of processes with the new mb_optimize_scan
> > strategy of mballoc. After a while I have reproduced the regression with a
> > simple benchmark that just creates, fsyncs, and deletes lots of small files
> > (22k) from 16 processes, each process has its own directory. The immediate
> > reason for the slow down is that with mb_optimize_scan=1 the file blocks
> > are spread among more block groups and thus we have more bitmaps to update
> > in each transaction.
>
> To add a little more info to why maybe this regression is getting noticed this late
> is that initially the patch series had a bug where the optimization was never
> getting enabled for files with extents until it got fixed by this patch.
>
> https://lore.kernel.org/linux-ext4/fc9a48f7f8dcfc83891a8b21f6dd8cdf056ed810.1646732698.git.ojaswin@linux.ibm.com/#t
Yes. Also it took me quite some time to get to analyzing the regression
reported by our automated testing and understand what's going on...
> > So the question is why mballoc with mb_optimize_scan=1 spreads allocations
> > more among block groups. The situation is somewhat obscured by group
> > preallocation feature of mballoc where each *CPU* holds a preallocation and
> > small (below 64k) allocations on that CPU are allocated from this
> > preallocation. If I trace creating of these group preallocations I can see
> > that the block groups they are taken from look like:
> >
> > mb_optimize_scan=0:
> > 49 81 113 97 17 33 113 49 81 33 97 113 81 1 17 33 33 81 1 113 97 17 113 113
> > 33 33 97 81 49 81 17 49
> >
> > mb_optimize_scan=1:
> > 127 126 126 125 126 127 125 126 127 124 123 124 122 122 121 120 119 118 117
> > 116 115 116 114 113 111 110 109 108 107 106 105 104 104
> >
> > So we can see that while with mb_optimize_scan=0 the preallocation is
> > always take from one of a few groups (among which we jump mostly randomly)
> > which mb_optimize_scan=1 we consistently drift from higher block groups to
> > lower block groups.
> >
> > The mb_optimize_scan=0 behavior is given by the fact that search for free
> > space always starts in the same block group where the inode is allocated
> > and the inode is always allocated in the same block group as its parent
> > directory. So the create-delete benchmark generally keeps all inodes for
> > one process in the same block group and thus allocations are always
> > starting in that block group. Because files are small, we always succeed in
> > finding free space in the starting block group and thus allocations are
> > generally restricted to the several block groups where parent directories
> > were originally allocated.
> >
> > With mb_optimize_scan=1 the block group to allocate from is selected by
> > ext4_mb_choose_next_group_cr0() so in this mode we completely ignore the
> > "pack inode with data in the same group" rule. The reason why we keep
> > drifting among block groups is that whenever free space in a block group is
> > updated (blocks allocated / freed) we recalculate largest free order (see
> > mb_mark_used() and mb_free_blocks()) and as a side effect that removes
> > group from the bb_largest_free_order_node list and reinserts the group at
> > the tail.
>
> One thing which comes to mind is maybe to cache the last block group from
> which the allocation was satisfied and only if that fails, we could then try
> the largest_free_order() bg.
Yes, this sounds like a reasonable heuristic to me.
> > I have two questions about the mb_optimize_scan=1 strategy:
> >
> > 1) Shouldn't we respect the initial goal group and try to allocate from it
> > in ext4_mb_regular_allocator() before calling ext4_mb_choose_next_group()?
>
> I remember discussing this problem and I think the argument that time was...
>
> """ ...snip from the cover letter.
> These changes may result in allocations to be spread across the block
> device. While that would not matter some block devices (such as flash)
> it may be a cause of concern for other block devices that benefit from
> storing related content togetther such as disk. However, it can be
> argued that in high fragmentation scenrio, especially for large disks,
> it's still worth optimizing the scanning since in such cases, we get
> cpu bound on group scanning instead of getting IO bound. Perhaps, in
> future, we could dynamically turn this new optimization on based on
> fragmentation levels for such devices.
> """
>
> ...but maybe more explainations can be added by others.
But this reasoning seems to be explaining that selecting group by the
largest free order may spread allocations more (as the group fill up,
group's largest free order will decrease and we'll get to next group) and
that faster allocation is worth the spreading. But I don't think it
justifies why is it good to rotate among groups that have the same largest
free order...
> > 2) The rotation of groups in mb_set_largest_free_order() seems a bit
> > undesirable to me. In particular it seems pointless if the largest free
> > order does not change. Was there some rationale behind it?
>
> Agree.
>
> Also,
> I am wondering on whether there is a bot which does reaim benchmark
> testing too on any of the performance patches. For e.g. [1].
>
> [1]: https://github.com/intel/lkp-tests/blob/3fece75132266f680047f4e1740b39c5b3faabbf/tests/reaim
>
> Can submitter of a patch also trigger this performance benchmark testing?
> I have generally seen some kernel test bot reports with performace score
> results, but I am not sure if there is a easy way to trigger this like
> how we have for syzbot. Any idea?
I don't think there's a way to trigger this from the outside.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hello,
attached is the C program I use for reproducing the mballoc performance issue.
I run it as:
stress-unlink -s -c 100 -f 22528 16 0 /mnt
Honza
On Wed 27-07-22 12:51:23, Jan Kara wrote:
> Hello,
>
> before going on vacation I was tracking down why reaim benchmark regresses
> (10-20%) with larger number of processes with the new mb_optimize_scan
> strategy of mballoc. After a while I have reproduced the regression with a
> simple benchmark that just creates, fsyncs, and deletes lots of small files
> (22k) from 16 processes, each process has its own directory. The immediate
> reason for the slow down is that with mb_optimize_scan=1 the file blocks
> are spread among more block groups and thus we have more bitmaps to update
> in each transaction.
>
> So the question is why mballoc with mb_optimize_scan=1 spreads allocations
> more among block groups. The situation is somewhat obscured by group
> preallocation feature of mballoc where each *CPU* holds a preallocation and
> small (below 64k) allocations on that CPU are allocated from this
> preallocation. If I trace creating of these group preallocations I can see
> that the block groups they are taken from look like:
>
> mb_optimize_scan=0:
> 49 81 113 97 17 33 113 49 81 33 97 113 81 1 17 33 33 81 1 113 97 17 113 113
> 33 33 97 81 49 81 17 49
>
> mb_optimize_scan=1:
> 127 126 126 125 126 127 125 126 127 124 123 124 122 122 121 120 119 118 117
> 116 115 116 114 113 111 110 109 108 107 106 105 104 104
>
> So we can see that while with mb_optimize_scan=0 the preallocation is
> always take from one of a few groups (among which we jump mostly randomly)
> which mb_optimize_scan=1 we consistently drift from higher block groups to
> lower block groups.
>
> The mb_optimize_scan=0 behavior is given by the fact that search for free
> space always starts in the same block group where the inode is allocated
> and the inode is always allocated in the same block group as its parent
> directory. So the create-delete benchmark generally keeps all inodes for
> one process in the same block group and thus allocations are always
> starting in that block group. Because files are small, we always succeed in
> finding free space in the starting block group and thus allocations are
> generally restricted to the several block groups where parent directories
> were originally allocated.
>
> With mb_optimize_scan=1 the block group to allocate from is selected by
> ext4_mb_choose_next_group_cr0() so in this mode we completely ignore the
> "pack inode with data in the same group" rule. The reason why we keep
> drifting among block groups is that whenever free space in a block group is
> updated (blocks allocated / freed) we recalculate largest free order (see
> mb_mark_used() and mb_free_blocks()) and as a side effect that removes
> group from the bb_largest_free_order_node list and reinserts the group at
> the tail.
>
> I have two questions about the mb_optimize_scan=1 strategy:
>
> 1) Shouldn't we respect the initial goal group and try to allocate from it
> in ext4_mb_regular_allocator() before calling ext4_mb_choose_next_group()?
>
> 2) The rotation of groups in mb_set_largest_free_order() seems a bit
> undesirable to me. In particular it seems pointless if the largest free
> order does not change. Was there some rationale behind it?
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
--
Jan Kara <[email protected]>
SUSE Labs, CR