2022-08-23 20:39:39

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/2] ext4: Fix performance regression with mballoc

Hello,

So I have implemented mballoc improvements to avoid spreading allocations
even with mb_optimize_scan=1. It fixes the performance regression I was able
to reproduce with reaim on my test machine:

mb_optimize_scan=0 mb_optimize_scan=1 patched
Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%)
Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%)
Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%)
Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%)
Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%*
Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%)
Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%*
Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%)
Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%)

Stefan, can you please test whether these patches fix the problem for you as
well? Comments & review welcome.

Honza


2022-08-23 20:40:38

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/2] ext4: Avoid unnecessary spreading of allocations among groups

mb_set_largest_free_order() updates lists containing groups with largest
chunk of free space of given order. The way it updates it leads to
always moving the group to the tail of the list. Thus allocations
looking for free space of given order effectively end up cycling through
all groups (and due to initialization in last to first order). This
spreads allocations among block groups which reduces performance for
rotating disks or low-end flash media. Change
mb_set_largest_free_order() to only update lists if the order of the
largest free chunk in the group changed.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/mballoc.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 41e1cfecac3b..6251b4a6cc63 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1077,23 +1077,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
struct ext4_sb_info *sbi = EXT4_SB(sb);
int i;

- if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
+ for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
+ if (grp->bb_counters[i] > 0)
+ break;
+ /* No need to move between order lists? */
+ if (!test_opt2(sb, MB_OPTIMIZE_SCAN) ||
+ i == grp->bb_largest_free_order) {
+ grp->bb_largest_free_order = i;
+ return;
+ }
+
+ if (grp->bb_largest_free_order >= 0) {
write_lock(&sbi->s_mb_largest_free_orders_locks[
grp->bb_largest_free_order]);
list_del_init(&grp->bb_largest_free_order_node);
write_unlock(&sbi->s_mb_largest_free_orders_locks[
grp->bb_largest_free_order]);
}
- grp->bb_largest_free_order = -1; /* uninit */
-
- for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
- if (grp->bb_counters[i] > 0) {
- grp->bb_largest_free_order = i;
- break;
- }
- }
- if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
- grp->bb_largest_free_order >= 0 && grp->bb_free) {
+ grp->bb_largest_free_order = i;
+ if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
write_lock(&sbi->s_mb_largest_free_orders_locks[
grp->bb_largest_free_order]);
list_add_tail(&grp->bb_largest_free_order_node,
--
2.35.3

2022-08-24 10:20:08

by Stefan Wahren

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

Hi Jan,

Am 23.08.22 um 22:15 schrieb Jan Kara:
> Hello,
>
> So I have implemented mballoc improvements to avoid spreading allocations
> even with mb_optimize_scan=1. It fixes the performance regression I was able
> to reproduce with reaim on my test machine:
>
> mb_optimize_scan=0 mb_optimize_scan=1 patched
> Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%)
> Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%)
> Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%)
> Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%)
> Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%*
> Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%)
> Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%*
> Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%)
> Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%)
>
> Stefan, can you please test whether these patches fix the problem for you as
> well? Comments & review welcome.

i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
update process succeed which is a improvement, but the download + unpack
duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
minute ).

Unfortuntately i don't have much time this week and next week i'm in
holidays.

Just a question, my tests always had MBCACHE=y . Is it possible that the
mb_optimize_scan is counterproductive for MBCACHE in this case?

I'm asking because before the download the update script removes the
files from the previous update process which already cause a high load.

Best regards

>
> Honza

2022-08-24 10:46:16

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

Hi Stefan!

On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
> Am 23.08.22 um 22:15 schrieb Jan Kara:
> > Hello,
> >
> > So I have implemented mballoc improvements to avoid spreading allocations
> > even with mb_optimize_scan=1. It fixes the performance regression I was able
> > to reproduce with reaim on my test machine:
> >
> > mb_optimize_scan=0 mb_optimize_scan=1 patched
> > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%)
> > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%)
> > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%)
> > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%)
> > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%*
> > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%)
> > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%*
> > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%)
> > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%)
> >
> > Stefan, can you please test whether these patches fix the problem for you as
> > well? Comments & review welcome.
>
> i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
> update process succeed which is a improvement, but the download + unpack
> duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
> minute ).

OK, thanks for testing! I'll try to check specifically untar whether I can
still see some differences in the IO pattern on my test machine.

> Unfortuntately i don't have much time this week and next week i'm in
> holidays.

No problem.

> Just a question, my tests always had MBCACHE=y . Is it possible that the
> mb_optimize_scan is counterproductive for MBCACHE in this case?

MBCACHE (despite similar name) is actually related to extended attributes
so it likely has no impact on your workload.

> I'm asking because before the download the update script removes the files
> from the previous update process which already cause a high load.

Do you mean already the removal step is noticeably slower with
mb_optimize_scan=1? The removal will be modifying directory blocks, inode
table blocks, block & inode bitmaps, and group descriptors. So if block
allocations are more spread (due to mb_optimize_scan=1 used during the
untar), the removal may also take somewhat longer.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-24 14:39:17

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

On Wed 24-08-22 12:40:10, Jan Kara wrote:
> Hi Stefan!
>
> On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
> > Am 23.08.22 um 22:15 schrieb Jan Kara:
> > > Hello,
> > >
> > > So I have implemented mballoc improvements to avoid spreading allocations
> > > even with mb_optimize_scan=1. It fixes the performance regression I was able
> > > to reproduce with reaim on my test machine:
> > >
> > > mb_optimize_scan=0 mb_optimize_scan=1 patched
> > > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%)
> > > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%)
> > > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%)
> > > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%)
> > > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%*
> > > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%)
> > > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%*
> > > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%)
> > > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%)
> > >
> > > Stefan, can you please test whether these patches fix the problem for you as
> > > well? Comments & review welcome.
> >
> > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
> > update process succeed which is a improvement, but the download + unpack
> > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
> > minute ).
>
> OK, thanks for testing! I'll try to check specifically untar whether I can
> still see some differences in the IO pattern on my test machine.

I have created the same tar archive as you've referenced (files with same
number of blocks) and looked at where blocks get allocated with
mb_optimize_scan=0 and with mb_optimize_scan=1 + my patches. And the
resulting IO pattern looks practically the same on my test machine. In
particular in both cases files get allocated only in 6 groups, if I look
at the number of erase blocks that are expected to be touched by file data
(for various erase block sizes from 512k to 4MB) I get practically same
numbers for both cases.

Ojaswin, I think you've also mentioned you were able to reproduce the issue
in your setup? Are you still able to reproduce it with the patched kernel?
Can you help debugging while Stefan is away?

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-24 21:50:55

by Stefan Wahren

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

Hi Jan,

Am 24.08.22 um 12:40 schrieb Jan Kara:
> Hi Stefan!
>
> On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
>> Am 23.08.22 um 22:15 schrieb Jan Kara:
>>> Hello,
>>>
>>> So I have implemented mballoc improvements to avoid spreading allocations
>>> even with mb_optimize_scan=1. It fixes the performance regression I was able
>>> to reproduce with reaim on my test machine:
>>>
>>> mb_optimize_scan=0 mb_optimize_scan=1 patched
>>> Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%)
>>> Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%)
>>> Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%)
>>> Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%)
>>> Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%*
>>> Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%)
>>> Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%*
>>> Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%)
>>> Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%)
>>>
>>> Stefan, can you please test whether these patches fix the problem for you as
>>> well? Comments & review welcome.
>> i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
>> update process succeed which is a improvement, but the download + unpack
>> duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
>> minute ).
> OK, thanks for testing! I'll try to check specifically untar whether I can
> still see some differences in the IO pattern on my test machine.

i made two iostat output logs during the complete download phase with
5.19 and your series applied. iostat was running via ssh connection and
rpi-update via serial console.

First with mb_optimize_scan=0

https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log

Second with mb_optimize_scan=1

https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log

Maybe this helps

>
>> Unfortuntately i don't have much time this week and next week i'm in
>> holidays.
> No problem.
>
>> Just a question, my tests always had MBCACHE=y . Is it possible that the
>> mb_optimize_scan is counterproductive for MBCACHE in this case?
> MBCACHE (despite similar name) is actually related to extended attributes
> so it likely has no impact on your workload.
>
>> I'm asking because before the download the update script removes the files
>> from the previous update process which already cause a high load.
> Do you mean already the removal step is noticeably slower with
> mb_optimize_scan=1? The removal will be modifying directory blocks, inode
> table blocks, block & inode bitmaps, and group descriptors. So if block
> allocations are more spread (due to mb_optimize_scan=1 used during the
> untar), the removal may also take somewhat longer.
Not sure about this, maybe we should concentrate on download / untar phase.
> Honza

2022-08-25 09:20:48

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

Hi Stefan!

On Wed 24-08-22 23:24:43, Stefan Wahren wrote:
> Am 24.08.22 um 12:40 schrieb Jan Kara:
> > Hi Stefan!
> >
> > On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
> > > Am 23.08.22 um 22:15 schrieb Jan Kara:
> > > > Hello,
> > > >
> > > > So I have implemented mballoc improvements to avoid spreading allocations
> > > > even with mb_optimize_scan=1. It fixes the performance regression I was able
> > > > to reproduce with reaim on my test machine:
> > > >
> > > > mb_optimize_scan=0 mb_optimize_scan=1 patched
> > > > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%)
> > > > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%)
> > > > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%)
> > > > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%)
> > > > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%*
> > > > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%)
> > > > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%*
> > > > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%)
> > > > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%)
> > > >
> > > > Stefan, can you please test whether these patches fix the problem for you as
> > > > well? Comments & review welcome.
> > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
> > > update process succeed which is a improvement, but the download + unpack
> > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
> > > minute ).
> > OK, thanks for testing! I'll try to check specifically untar whether I can
> > still see some differences in the IO pattern on my test machine.
>
> i made two iostat output logs during the complete download phase with 5.19
> and your series applied. iostat was running via ssh connection and
> rpi-update via serial console.
>
> First with mb_optimize_scan=0
>
> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log
>
> Second with mb_optimize_scan=1
>
> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log
>
> Maybe this helps

Thanks for the data! So this is interesting. In both iostat logs, there is
initial phase where no IO happens. I guess that's expected. It is
significantly longer in the mb_optimize_scan=0 but I suppose that is just
caused by a difference in when iostat was actually started. Then in
mb_optimize_scan=0 there is 155 seconds where the eMMC card is 100%
utilized and then iostat ends. During this time ~63MB is written
altogether. Request sizes vary a lot, average is 60KB.

In mb_optimize_scan=1 case there is 715 seconds recorded where eMMC card is
100% utilized. During this time ~133MB is written, average request size is
40KB. If I look just at first 155 seconds of the trace (assuming iostat was
in both cases terminated before writing was fully done), we have written
~53MB and average request size is 56KB.

So with mb_optimize_scan=1 we are indeed still somewhat slower but based on
the trace it is not clear why the download+unpack should take 7 minutes
instead of 1 minute. There must be some other effect we are missing.

Perhaps if you just download the archive manually, call sync(1), and measure
how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
can see whether plain untar is indeed making the difference or there's
something else influencing the result as well (I have checked and
rpi-update does a lot of other deleting & copying as the part of the
update)? Thanks.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-25 16:00:22

by Stefan Wahren

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

Hi Jan,

Am 25.08.22 um 11:18 schrieb Jan Kara:
> Hi Stefan!
>
> On Wed 24-08-22 23:24:43, Stefan Wahren wrote:
>> Am 24.08.22 um 12:40 schrieb Jan Kara:
>>> Hi Stefan!
>>>
>>> On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
>>>> Am 23.08.22 um 22:15 schrieb Jan Kara:
>>>>> Hello,
>>>>>
>>>>> So I have implemented mballoc improvements to avoid spreading allocations
>>>>> even with mb_optimize_scan=1. It fixes the performance regression I was able
>>>>> to reproduce with reaim on my test machine:
>>>>>
>>>>> mb_optimize_scan=0 mb_optimize_scan=1 patched
>>>>> Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%)
>>>>> Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%)
>>>>> Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%)
>>>>> Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%)
>>>>> Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%*
>>>>> Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%)
>>>>> Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%*
>>>>> Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%)
>>>>> Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%)
>>>>>
>>>>> Stefan, can you please test whether these patches fix the problem for you as
>>>>> well? Comments & review welcome.
>>>> i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
>>>> update process succeed which is a improvement, but the download + unpack
>>>> duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
>>>> minute ).
>>> OK, thanks for testing! I'll try to check specifically untar whether I can
>>> still see some differences in the IO pattern on my test machine.
>> i made two iostat output logs during the complete download phase with 5.19
>> and your series applied. iostat was running via ssh connection and
>> rpi-update via serial console.
>>
>> First with mb_optimize_scan=0
>>
>> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log
>>
>> Second with mb_optimize_scan=1
>>
>> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log
>>
>> Maybe this helps
> Thanks for the data! So this is interesting. In both iostat logs, there is
> initial phase where no IO happens. I guess that's expected. It is
> significantly longer in the mb_optimize_scan=0 but I suppose that is just
> caused by a difference in when iostat was actually started. Then in
> mb_optimize_scan=0 there is 155 seconds where the eMMC card is 100%
> utilized and then iostat ends. During this time ~63MB is written
> altogether. Request sizes vary a lot, average is 60KB.
>
> In mb_optimize_scan=1 case there is 715 seconds recorded where eMMC card is
> 100% utilized. During this time ~133MB is written, average request size is
> 40KB. If I look just at first 155 seconds of the trace (assuming iostat was
> in both cases terminated before writing was fully done), we have written
> ~53MB and average request size is 56KB.
>
> So with mb_optimize_scan=1 we are indeed still somewhat slower but based on
> the trace it is not clear why the download+unpack should take 7 minutes
> instead of 1 minute. There must be some other effect we are missing.
>
> Perhaps if you just download the archive manually, call sync(1), and measure
> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> can see whether plain untar is indeed making the difference or there's
> something else influencing the result as well (I have checked and
> rpi-update does a lot of other deleting & copying as the part of the
> update)? Thanks.

I will provide those iostats.

Btw i untar the firmware archive (mb_optimized_scan=1 and your patch)
and got following:

cat /proc/fs/ext4/mmcblk1p2/mb_structs_summary


optimize_scan: 1
max_free_order_lists:
        list_order_0_groups: 5
        list_order_1_groups: 0
        list_order_2_groups: 0
        list_order_3_groups: 0
        list_order_4_groups: 1
        list_order_5_groups: 0
        list_order_6_groups: 1
        list_order_7_groups: 1
        list_order_8_groups: 10
        list_order_9_groups: 1
        list_order_10_groups: 2
        list_order_11_groups: 0
        list_order_12_groups: 2
        list_order_13_groups: 55
fragment_size_tree:
        tree_min: 1
        tree_max: 31249

        tree_nodes: 79

Is this expected?

>
> Honza

2022-08-25 17:04:19

by Stefan Wahren

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

Hi Jan,

Am 25.08.22 um 11:18 schrieb Jan Kara:
> Hi Stefan!
>
> On Wed 24-08-22 23:24:43, Stefan Wahren wrote:
>> Am 24.08.22 um 12:40 schrieb Jan Kara:
>>> Hi Stefan!
>>>
>>> On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
>>>> Am 23.08.22 um 22:15 schrieb Jan Kara:
>>>>> Hello,
>>>>>
>>>>> So I have implemented mballoc improvements to avoid spreading allocations
>>>>> even with mb_optimize_scan=1. It fixes the performance regression I was able
>>>>> to reproduce with reaim on my test machine:
>>>>>
>>>>> mb_optimize_scan=0 mb_optimize_scan=1 patched
>>>>> Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%)
>>>>> Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%)
>>>>> Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%)
>>>>> Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%)
>>>>> Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%*
>>>>> Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%)
>>>>> Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%*
>>>>> Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%)
>>>>> Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%)
>>>>>
>>>>> Stefan, can you please test whether these patches fix the problem for you as
>>>>> well? Comments & review welcome.
>>>> i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
>>>> update process succeed which is a improvement, but the download + unpack
>>>> duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
>>>> minute ).
>>> OK, thanks for testing! I'll try to check specifically untar whether I can
>>> still see some differences in the IO pattern on my test machine.
>> i made two iostat output logs during the complete download phase with 5.19
>> and your series applied. iostat was running via ssh connection and
>> rpi-update via serial console.
>>
>> First with mb_optimize_scan=0
>>
>> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log
>>
>> Second with mb_optimize_scan=1
>>
>> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log
>>
>> Maybe this helps
> Thanks for the data! So this is interesting. In both iostat logs, there is
> initial phase where no IO happens. I guess that's expected. It is
> significantly longer in the mb_optimize_scan=0 but I suppose that is just
> caused by a difference in when iostat was actually started. Then in
> mb_optimize_scan=0 there is 155 seconds where the eMMC card is 100%
> utilized and then iostat ends. During this time ~63MB is written
> altogether. Request sizes vary a lot, average is 60KB.
>
> In mb_optimize_scan=1 case there is 715 seconds recorded where eMMC card is
> 100% utilized. During this time ~133MB is written, average request size is
> 40KB. If I look just at first 155 seconds of the trace (assuming iostat was
> in both cases terminated before writing was fully done), we have written
> ~53MB and average request size is 56KB.
>
> So with mb_optimize_scan=1 we are indeed still somewhat slower but based on
> the trace it is not clear why the download+unpack should take 7 minutes
> instead of 1 minute. There must be some other effect we are missing.
>
> Perhaps if you just download the archive manually, call sync(1), and measure
> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> can see whether plain untar is indeed making the difference or there's
> something else influencing the result as well (I have checked and
> rpi-update does a lot of other deleting & copying as the part of the
> update)? Thanks.

mb_optimize_scan=0 -> almost 5 minutes

mb_optimize_scan=1 -> almost 18 minutes

https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec

>
> Honza

2022-08-25 18:05:58

by Ojaswin Mujoo

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

On Wed, Aug 24, 2022 at 04:13:38PM +0200, Jan Kara wrote:
> On Wed 24-08-22 12:40:10, Jan Kara wrote:
> > Hi Stefan!
> >
> > On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
> > > Am 23.08.22 um 22:15 schrieb Jan Kara:
> > > > Hello,
> > > >
> > > > So I have implemented mballoc improvements to avoid spreading allocations
> > > > even with mb_optimize_scan=1. It fixes the performance regression I was able
> > > > to reproduce with reaim on my test machine:
> > > >
> > > > mb_optimize_scan=0 mb_optimize_scan=1 patched
> > > > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%)
> > > > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%)
> > > > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%)
> > > > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%)
> > > > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%*
> > > > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%)
> > > > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%*
> > > > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%)
> > > > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%)
> > > >
> > > > Stefan, can you please test whether these patches fix the problem for you as
> > > > well? Comments & review welcome.
> > >
> > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
> > > update process succeed which is a improvement, but the download + unpack
> > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
> > > minute ).
> >
> > OK, thanks for testing! I'll try to check specifically untar whether I can
> > still see some differences in the IO pattern on my test machine.
>
> I have created the same tar archive as you've referenced (files with same
> number of blocks) and looked at where blocks get allocated with
> mb_optimize_scan=0 and with mb_optimize_scan=1 + my patches. And the
> resulting IO pattern looks practically the same on my test machine. In
> particular in both cases files get allocated only in 6 groups, if I look
> at the number of erase blocks that are expected to be touched by file data
> (for various erase block sizes from 512k to 4MB) I get practically same
> numbers for both cases.
>
> Ojaswin, I think you've also mentioned you were able to reproduce the issue
> in your setup? Are you still able to reproduce it with the patched kernel?
> Can you help debugging while Stefan is away?
>
> Honza
Hi Jan,

So I ran some more tests on v6.0-rc2 kernel with and without your patches and
here are the details:

Workload:-
time tar -xf rpi-firmware.tar -C ./test
time sync

System details:
- Rpi 3b+ w/ 8G memory card (~4G free)
- tar is ~120MB compressed

And here is the output of time command for various tests. Since some of them
take some time to complete, I ran them only 2 3 times each so the numbers might
vary but they are indicative of the issue.

v6.0-rc2 (Without patches)

mb_optimize_scan = 0

**tar**
real 1m39.574s
user 0m10.311s
sys 0m2.761s

**sync**
real 0m22.269s
user 0m0.001s
sys 0m0.005s

mb_optimize_scan = 1

**tar**
real 21m25.288s
user 0m9.607
sys 0m3.026

**sync**
real 1m23.402s
user 0m0.005s
sys 0m0.000s

v6.0-rc2 (With patches)

mb_optimize_scan = 0

* similar to unpatched (~1 to 2mins) *

mb_optimize_scan = 1

**tar**
real 5m7.858s
user 0m11.008s
sys 0m2.739s

**sync**
real 6m7.308s
user 0m0.003s
sys 0m0.001s

At this point, I'm pretty confident that it is the untar operation that is
having most of the regression and no other download/delete operations in
rpi-update are behind the delay. Further, it does seem like your patches
improve the performance but, from my tests, we are still not close to the
mb_optimize_scan=0 numbers.

I'm going to spend some more time trying to collect the perfs and which block
group the allocations are happening during the untar to see if we can get a better
idea from that data. Let me know if you'd want me to collect anything else.

PS: One question, to find the blocks groups being used I'm planning to take
the dumpe2fs output before and after untar and then see the groups where free blocks
changed (since there is nothing much running on Pi i assume this should give us
a rough idea of allocation pattern of untar), just wanted to check if there's a
better way to get this data.

Regards,
Ojaswin
>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2022-08-26 09:17:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

On Thu 25-08-22 23:19:48, Ojaswin Mujoo wrote:
> On Wed, Aug 24, 2022 at 04:13:38PM +0200, Jan Kara wrote:
> > On Wed 24-08-22 12:40:10, Jan Kara wrote:
> > > Hi Stefan!
> > >
> > > On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
> > > > Am 23.08.22 um 22:15 schrieb Jan Kara:
> > > > > Hello,
> > > > >
> > > > > So I have implemented mballoc improvements to avoid spreading allocations
> > > > > even with mb_optimize_scan=1. It fixes the performance regression I was able
> > > > > to reproduce with reaim on my test machine:
> > > > >
> > > > > mb_optimize_scan=0 mb_optimize_scan=1 patched
> > > > > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%)
> > > > > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%)
> > > > > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%)
> > > > > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%)
> > > > > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%*
> > > > > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%)
> > > > > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%*
> > > > > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%)
> > > > > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%)
> > > > >
> > > > > Stefan, can you please test whether these patches fix the problem for you as
> > > > > well? Comments & review welcome.
> > > >
> > > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
> > > > update process succeed which is a improvement, but the download + unpack
> > > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
> > > > minute ).
> > >
> > > OK, thanks for testing! I'll try to check specifically untar whether I can
> > > still see some differences in the IO pattern on my test machine.
> >
> > I have created the same tar archive as you've referenced (files with same
> > number of blocks) and looked at where blocks get allocated with
> > mb_optimize_scan=0 and with mb_optimize_scan=1 + my patches. And the
> > resulting IO pattern looks practically the same on my test machine. In
> > particular in both cases files get allocated only in 6 groups, if I look
> > at the number of erase blocks that are expected to be touched by file data
> > (for various erase block sizes from 512k to 4MB) I get practically same
> > numbers for both cases.
> >
> > Ojaswin, I think you've also mentioned you were able to reproduce the issue
> > in your setup? Are you still able to reproduce it with the patched kernel?
> > Can you help debugging while Stefan is away?
> >
> > Honza
> Hi Jan,
>
> So I ran some more tests on v6.0-rc2 kernel with and without your patches and
> here are the details:
>
> Workload:-
> time tar -xf rpi-firmware.tar -C ./test
> time sync
>
> System details:
> - Rpi 3b+ w/ 8G memory card (~4G free)
> - tar is ~120MB compressed

Hum, maybe the difference is that I've tried with somewhat larger (20G) and
otherwise empty filesystem...

> And here is the output of time command for various tests. Since some of them
> take some time to complete, I ran them only 2 3 times each so the numbers might
> vary but they are indicative of the issue.
>
> v6.0-rc2 (Without patches)
>
> mb_optimize_scan = 0
>
> **tar**
> real 1m39.574s
> user 0m10.311s
> sys 0m2.761s
>
> **sync**
> real 0m22.269s
> user 0m0.001s
> sys 0m0.005s
>
> mb_optimize_scan = 1
>
> **tar**
> real 21m25.288s
> user 0m9.607
> sys 0m3.026
>
> **sync**
> real 1m23.402s
> user 0m0.005s
> sys 0m0.000s
>
> v6.0-rc2 (With patches)
>
> mb_optimize_scan = 0
>
> * similar to unpatched (~1 to 2mins) *
>
> mb_optimize_scan = 1
>
> **tar**
> real 5m7.858s
> user 0m11.008s
> sys 0m2.739s
>
> **sync**
> real 6m7.308s
> user 0m0.003s
> sys 0m0.001s
>
> At this point, I'm pretty confident that it is the untar operation that is
> having most of the regression and no other download/delete operations in
> rpi-update are behind the delay. Further, it does seem like your patches
> improve the performance but, from my tests, we are still not close to the
> mb_optimize_scan=0 numbers.

Yes, thanks for the tests!

> I'm going to spend some more time trying to collect the perfs and which block
> group the allocations are happening during the untar to see if we can get a better
> idea from that data. Let me know if you'd want me to collect anything else.
>
> PS: One question, to find the blocks groups being used I'm planning to take
> the dumpe2fs output before and after untar and then see the groups where free blocks
> changed (since there is nothing much running on Pi i assume this should give us
> a rough idea of allocation pattern of untar), just wanted to check if there's a
> better way to get this data.

I have used 'find <target-dir> -exec filefrag -v {} \;' to get block
numbers of files. That gets you better insight than plain dumpe2fs
numbers...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-26 09:55:41

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

Hi Stefan!

On Thu 25-08-22 17:48:32, Stefan Wahren wrote:
> Am 25.08.22 um 11:18 schrieb Jan Kara:
> > On Wed 24-08-22 23:24:43, Stefan Wahren wrote:
> > > Am 24.08.22 um 12:40 schrieb Jan Kara:
> > > > On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
> > > > > Am 23.08.22 um 22:15 schrieb Jan Kara:
> > > > > > So I have implemented mballoc improvements to avoid spreading allocations
> > > > > > even with mb_optimize_scan=1. It fixes the performance regression I was able
> > > > > > to reproduce with reaim on my test machine:
> > > > > >
> > > > > > mb_optimize_scan=0 mb_optimize_scan=1 patched
> > > > > > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%)
> > > > > > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%)
> > > > > > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%)
> > > > > > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%)
> > > > > > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%*
> > > > > > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%)
> > > > > > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%*
> > > > > > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%)
> > > > > > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%)
> > > > > >
> > > > > > Stefan, can you please test whether these patches fix the problem for you as
> > > > > > well? Comments & review welcome.
> > > > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
> > > > > update process succeed which is a improvement, but the download + unpack
> > > > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
> > > > > minute ).
> > > > OK, thanks for testing! I'll try to check specifically untar whether I can
> > > > still see some differences in the IO pattern on my test machine.
> > > i made two iostat output logs during the complete download phase with 5.19
> > > and your series applied. iostat was running via ssh connection and
> > > rpi-update via serial console.
> > >
> > > First with mb_optimize_scan=0
> > >
> > > https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log
> > >
> > > Second with mb_optimize_scan=1
> > >
> > > https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log
> > >
> > > Maybe this helps
> > Thanks for the data! So this is interesting. In both iostat logs, there is
> > initial phase where no IO happens. I guess that's expected. It is
> > significantly longer in the mb_optimize_scan=0 but I suppose that is just
> > caused by a difference in when iostat was actually started. Then in
> > mb_optimize_scan=0 there is 155 seconds where the eMMC card is 100%
> > utilized and then iostat ends. During this time ~63MB is written
> > altogether. Request sizes vary a lot, average is 60KB.
> >
> > In mb_optimize_scan=1 case there is 715 seconds recorded where eMMC card is
> > 100% utilized. During this time ~133MB is written, average request size is
> > 40KB. If I look just at first 155 seconds of the trace (assuming iostat was
> > in both cases terminated before writing was fully done), we have written
> > ~53MB and average request size is 56KB.
> >
> > So with mb_optimize_scan=1 we are indeed still somewhat slower but based on
> > the trace it is not clear why the download+unpack should take 7 minutes
> > instead of 1 minute. There must be some other effect we are missing.
> >
> > Perhaps if you just download the archive manually, call sync(1), and measure
> > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> > can see whether plain untar is indeed making the difference or there's
> > something else influencing the result as well (I have checked and
> > rpi-update does a lot of other deleting & copying as the part of the
> > update)? Thanks.
>
> I will provide those iostats.
>
> Btw i untar the firmware archive (mb_optimized_scan=1 and your patch) and
> got following:
>
> cat /proc/fs/ext4/mmcblk1p2/mb_structs_summary
>
>
> optimize_scan: 1
> max_free_order_lists:
> ??????? list_order_0_groups: 5
> ??????? list_order_1_groups: 0
> ??????? list_order_2_groups: 0
> ??????? list_order_3_groups: 0
> ??????? list_order_4_groups: 1
> ??????? list_order_5_groups: 0
> ??????? list_order_6_groups: 1
> ??????? list_order_7_groups: 1
> ??????? list_order_8_groups: 10
> ??????? list_order_9_groups: 1
> ??????? list_order_10_groups: 2
> ??????? list_order_11_groups: 0
> ??????? list_order_12_groups: 2
> ??????? list_order_13_groups: 55
> fragment_size_tree:
> ??????? tree_min: 1
> ??????? tree_max: 31249
>
> ??????? tree_nodes: 79
>
> Is this expected?

Yes, I don't see anything out of ordinary in this for a used filesystem. It
tells us there are 55 groups with big chunks of free space, there are some
groups which have only small chunks of free space but that's expected when
the filesystem is reasonably filled...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-26 10:36:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

Hi Stefan,

On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> > Perhaps if you just download the archive manually, call sync(1), and measure
> > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> > can see whether plain untar is indeed making the difference or there's
> > something else influencing the result as well (I have checked and
> > rpi-update does a lot of other deleting & copying as the part of the
> > update)? Thanks.
>
> mb_optimize_scan=0 -> almost 5 minutes
>
> mb_optimize_scan=1 -> almost 18 minutes
>
> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec

Thanks! So now the iostat data indeed looks substantially different.

nooptimize optimize
Total written 183.6 MB 190.5 MB
Time (recorded) 283 s 1040 s
Avg write request size 79 KB 41 KB

So indeed with mb_optimize_scan=1 we do submit substantially smaller
requests on average. So far I'm not sure why that is. Since Ojaswin can
reproduce as well, let's see what he can see from block location info.
Thanks again for help with debugging this and enjoy your vacation!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-27 14:51:00

by Ojaswin Mujoo

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
> Hi Stefan,
>
> On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> > > Perhaps if you just download the archive manually, call sync(1), and measure
> > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> > > can see whether plain untar is indeed making the difference or there's
> > > something else influencing the result as well (I have checked and
> > > rpi-update does a lot of other deleting & copying as the part of the
> > > update)? Thanks.
> >
> > mb_optimize_scan=0 -> almost 5 minutes
> >
> > mb_optimize_scan=1 -> almost 18 minutes
> >
> > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
>
> Thanks! So now the iostat data indeed looks substantially different.
>
> nooptimize optimize
> Total written 183.6 MB 190.5 MB
> Time (recorded) 283 s 1040 s
> Avg write request size 79 KB 41 KB
>
> So indeed with mb_optimize_scan=1 we do submit substantially smaller
> requests on average. So far I'm not sure why that is. Since Ojaswin can
> reproduce as well, let's see what he can see from block location info.
> Thanks again for help with debugging this and enjoy your vacation!
>

Hi Jan and Stefan,

Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.

So I was able to collect the block numbers using the method you suggested. I converted the
blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can
find them here:

mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png

Observations:
* Before the patched mb_optimize_scan=1 allocations were way more spread out in
40 different BGs.
* With the patch, we still allocate in 36 different BGs but majority happen in
just 1 or 2 BGs.
* With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
explain why this is faster.

Also, one strange thing I'm seeing is that the perfs don't really show any
particular function causing the regression, which is surprising considering
mb_optimize_scan=1 almost takes 10 times more time.

All the perfs can be found here (raw files and perf report/diff --stdio ):
https://github.com/OjaswinM/mbopt-bug/tree/master/perfs

Lastly, FWIW I'm not able to replicate the regression when using loop devices
and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches
as well). Not sure if this is related to the issue or just some side effect of
using loop devices.

Will post here if I have any updates on this.

Regards,
Ojaswin

2022-08-29 09:08:24

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

On Sat 27-08-22 20:06:00, Ojaswin Mujoo wrote:
> On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
> > Hi Stefan,
> >
> > On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> > > > Perhaps if you just download the archive manually, call sync(1), and measure
> > > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> > > > can see whether plain untar is indeed making the difference or there's
> > > > something else influencing the result as well (I have checked and
> > > > rpi-update does a lot of other deleting & copying as the part of the
> > > > update)? Thanks.
> > >
> > > mb_optimize_scan=0 -> almost 5 minutes
> > >
> > > mb_optimize_scan=1 -> almost 18 minutes
> > >
> > > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
> >
> > Thanks! So now the iostat data indeed looks substantially different.
> >
> > nooptimize optimize
> > Total written 183.6 MB 190.5 MB
> > Time (recorded) 283 s 1040 s
> > Avg write request size 79 KB 41 KB
> >
> > So indeed with mb_optimize_scan=1 we do submit substantially smaller
> > requests on average. So far I'm not sure why that is. Since Ojaswin can
> > reproduce as well, let's see what he can see from block location info.
> > Thanks again for help with debugging this and enjoy your vacation!
> >
>
> Hi Jan and Stefan,
>
> Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
>
> So I was able to collect the block numbers using the method you suggested. I converted the
> blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can
> find them here:
>
> mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
> mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
> mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png
>
> Observations:
> * Before the patched mb_optimize_scan=1 allocations were way more spread out in
> 40 different BGs.
> * With the patch, we still allocate in 36 different BGs but majority happen in
> just 1 or 2 BGs.
> * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
> explain why this is faster.

Thanks for testing Ojaswin! Based on iostats from Stefan, I'm relatively
confident the spread between block groups is responsible for the
performance regression. Iostats show pretty clearly that the write
throughput is determined by the average write request size which is
directly related to the number of block groups we allocate from.

Your stats for patched kernel show that there are two block groups which
get big part of allocations (these are likely the target block groups) but
then remaining ~1/3 is spread a lot. I'm not yet sure why that is... I
guess I will fiddle some more with my test VM and try to reproduce these
allocation differences (on my test server the allocation pattern on patched
kernel is very similar with mb_optimize_scan=0/1).

> Also, one strange thing I'm seeing is that the perfs don't really show any
> particular function causing the regression, which is surprising considering
> mb_optimize_scan=1 almost takes 10 times more time.

Well, the time is not spent by CPU. We spend more time waiting for IO which
is not visible in perf profiles. You could plot something like offcpu flame
graphs, there the difference would be visible but I don't expect you'd see
anything more than just we spend more time in functions waiting for
writeback to complete.

> Lastly, FWIW I'm not able to replicate the regression when using loop devices
> and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches
> as well). Not sure if this is related to the issue or just some side effect of
> using loop devices.

This is because eMMC devices seem to be very sensitive to IO pattern (and
write request size). For loop devices, we don't care about request size
much so that's why mb_optimize_scan makes no big difference. But can you
still see the difference in the allocation pattern with the loop device?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-31 07:38:17

by Ojaswin Mujoo

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

On Mon, Aug 29, 2022 at 11:04:34AM +0200, Jan Kara wrote:
> On Sat 27-08-22 20:06:00, Ojaswin Mujoo wrote:
> > On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
> > > Hi Stefan,
> > >
> > > On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> > > > > Perhaps if you just download the archive manually, call sync(1), and measure
> > > > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> > > > > can see whether plain untar is indeed making the difference or there's
> > > > > something else influencing the result as well (I have checked and
> > > > > rpi-update does a lot of other deleting & copying as the part of the
> > > > > update)? Thanks.
> > > >
> > > > mb_optimize_scan=0 -> almost 5 minutes
> > > >
> > > > mb_optimize_scan=1 -> almost 18 minutes
> > > >
> > > > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
> > >
> > > Thanks! So now the iostat data indeed looks substantially different.
> > >
> > > nooptimize optimize
> > > Total written 183.6 MB 190.5 MB
> > > Time (recorded) 283 s 1040 s
> > > Avg write request size 79 KB 41 KB
> > >
> > > So indeed with mb_optimize_scan=1 we do submit substantially smaller
> > > requests on average. So far I'm not sure why that is. Since Ojaswin can
> > > reproduce as well, let's see what he can see from block location info.
> > > Thanks again for help with debugging this and enjoy your vacation!
> > >
> >
> > Hi Jan and Stefan,
> >
> > Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
> >
> > So I was able to collect the block numbers using the method you suggested. I converted the
> > blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can
> > find them here:
> >
> > mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
> > mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
> > mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png
> >
> > Observations:
> > * Before the patched mb_optimize_scan=1 allocations were way more spread out in
> > 40 different BGs.
> > * With the patch, we still allocate in 36 different BGs but majority happen in
> > just 1 or 2 BGs.
> > * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
> > explain why this is faster.
>
> Thanks for testing Ojaswin! Based on iostats from Stefan, I'm relatively
> confident the spread between block groups is responsible for the
> performance regression. Iostats show pretty clearly that the write
> throughput is determined by the average write request size which is
> directly related to the number of block groups we allocate from.
>
> Your stats for patched kernel show that there are two block groups which
> get big part of allocations (these are likely the target block groups) but
> then remaining ~1/3 is spread a lot. I'm not yet sure why that is... I
> guess I will fiddle some more with my test VM and try to reproduce these
> allocation differences (on my test server the allocation pattern on patched
> kernel is very similar with mb_optimize_scan=0/1).
>
> > Also, one strange thing I'm seeing is that the perfs don't really show any
> > particular function causing the regression, which is surprising considering
> > mb_optimize_scan=1 almost takes 10 times more time.
>
> Well, the time is not spent by CPU. We spend more time waiting for IO which
> is not visible in perf profiles. You could plot something like offcpu flame
> graphs, there the difference would be visible but I don't expect you'd see
> anything more than just we spend more time in functions waiting for
> writeback to complete.
Ahh I see, that makes sense.
>
> > Lastly, FWIW I'm not able to replicate the regression when using loop devices
> > and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches
> > as well). Not sure if this is related to the issue or just some side effect of
> > using loop devices.
>
> This is because eMMC devices seem to be very sensitive to IO pattern (and
> write request size). For loop devices, we don't care about request size
> much so that's why mb_optimize_scan makes no big difference. But can you
> still see the difference in the allocation pattern with the loop device?
So i tested with loop devices and yes I do see a similar allocation
pattern althout the regression is not there.

Thanks,
Ojaswin

2022-09-04 10:04:05

by Stefan Wahren

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

Hi,

Am 27.08.22 um 16:36 schrieb Ojaswin Mujoo:
> On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
>> Hi Stefan,
>>
>> On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
>>>> Perhaps if you just download the archive manually, call sync(1), and measure
>>>> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
>>>> can see whether plain untar is indeed making the difference or there's
>>>> something else influencing the result as well (I have checked and
>>>> rpi-update does a lot of other deleting & copying as the part of the
>>>> update)? Thanks.
>>> mb_optimize_scan=0 -> almost 5 minutes
>>>
>>> mb_optimize_scan=1 -> almost 18 minutes
>>>
>>> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
>> Thanks! So now the iostat data indeed looks substantially different.
>>
>> nooptimize optimize
>> Total written 183.6 MB 190.5 MB
>> Time (recorded) 283 s 1040 s
>> Avg write request size 79 KB 41 KB
>>
>> So indeed with mb_optimize_scan=1 we do submit substantially smaller
>> requests on average. So far I'm not sure why that is. Since Ojaswin can
>> reproduce as well, let's see what he can see from block location info.
>> Thanks again for help with debugging this and enjoy your vacation!
>>
> Hi Jan and Stefan,
>
> Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
>
> So I was able to collect the block numbers using the method you suggested. I converted the
> blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can
> find them here:
>
> mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
> mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
> mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png
>
> Observations:
> * Before the patched mb_optimize_scan=1 allocations were way more spread out in
> 40 different BGs.
> * With the patch, we still allocate in 36 different BGs but majority happen in
> just 1 or 2 BGs.
> * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
> explain why this is faster.

thanks this is very helpful for me to understand. So it seems to me that
with disabled mb_optimized_scan we have a more sequential write behavior
and with enabled mb_optimized_scan a more random write behavior.

From my understanding writing small blocks at random addresses of the
sd card flash causes a lot of overhead, because the sd card controller
need to erase huge blocks (up to 1 MB) before it's able to program the
flash pages. This would explain why this series doesn't fix the
performance issue, the total amount of BGs is still much higher.

Is this new block allocation pattern a side effect of the optimization
or desired?

> Also, one strange thing I'm seeing is that the perfs don't really show any
> particular function causing the regression, which is surprising considering
> mb_optimize_scan=1 almost takes 10 times more time.
>
> All the perfs can be found here (raw files and perf report/diff --stdio ):
> https://github.com/OjaswinM/mbopt-bug/tree/master/perfs
>
> Lastly, FWIW I'm not able to replicate the regression when using loop devices
> and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches
> as well). Not sure if this is related to the issue or just some side effect of
> using loop devices.
The regression actually happen in the sd card, but it's triggered
external by the different IO pattern. For a loop device there is no
difference between sequential and random write.
>
> Will post here if I have any updates on this.
>
> Regards,
> Ojaswin

2022-09-04 23:13:17

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

On Sep 4, 2022, at 00:01, Stefan Wahren <[email protected]> wrote:
>
>> Am 27.08.22 um 16:36 schrieb Ojaswin Mujoo:
>>> On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
>>> Hi Stefan,
>>>
>>> On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
>>>>> Perhaps if you just download the archive manually, call sync(1), and measure
>>>>> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
>>>>> can see whether plain untar is indeed making the difference or there's
>>>>> something else influencing the result as well (I have checked and
>>>>> rpi-update does a lot of other deleting & copying as the part of the
>>>>> update)? Thanks.
>>>> mb_optimize_scan=0 -> almost 5 minutes
>>>>
>>>> mb_optimize_scan=1 -> almost 18 minutes
>>>>
>>>> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
>>> Thanks! So now the iostat data indeed looks substantially different.
>>>
>>> nooptimize optimize
>>> Total written 183.6 MB 190.5 MB
>>> Time (recorded) 283 s 1040 s
>>> Avg write request size 79 KB 41 KB
>>>
>>> So indeed with mb_optimize_scan=1 we do submit substantially smaller
>>> requests on average. So far I'm not sure why that is. Since Ojaswin can
>>> reproduce as well, let's see what he can see from block location info.
>>> Thanks again for help with debugging this and enjoy your vacation!
>>>
>> Hi Jan and Stefan,
>>
>> Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
>>
>> So I was able to collect the block numbers using the method you suggested. I converted the
>> blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can
>> find them here:
>>
>> mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
>> mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
>> mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png
>>
>> Observations:
>> * Before the patched mb_optimize_scan=1 allocations were way more spread out in
>> 40 different BGs.
>> * With the patch, we still allocate in 36 different BGs but majority happen in
>> just 1 or 2 BGs.
>> * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
>> explain why this is faster.
>
> thanks this is very helpful for me to understand. So it seems to me that with disabled mb_optimized_scan we have a more sequential write behavior and with enabled mb_optimized_scan a more random write behavior.
>
> From my understanding writing small blocks at random addresses of the sd card flash causes a lot of overhead, because the sd card controller need to erase huge blocks (up to 1 MB) before it's able to program the flash pages. This would explain why this series doesn't fix the performance issue, the total amount of BGs is still much higher.
>
> Is this new block allocation pattern a side effect of the optimization or desired?

The goal of the mb_optimized_scan is to avoid a large amount of linear scanning for
very large filesystems when there are many block groups that are full or fragmented.

It seems for empty filesystems the new list management is not very ideal. It probably
makes sense to have a hybrid, with some small amount of linear scanning (eg. a meta
block group worth), and then use the new list to find a new group if that doesn't find any
group with free space.

Cheers, Andreas

2022-09-05 10:26:34

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

On Sun 04-09-22 12:32:59, Andreas Dilger wrote:
> On Sep 4, 2022, at 00:01, Stefan Wahren <[email protected]> wrote:
> >
> >> Am 27.08.22 um 16:36 schrieb Ojaswin Mujoo:
> >>> On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
> >>> Hi Stefan,
> >>>
> >>> On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> >>>>> Perhaps if you just download the archive manually, call sync(1), and measure
> >>>>> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> >>>>> can see whether plain untar is indeed making the difference or there's
> >>>>> something else influencing the result as well (I have checked and
> >>>>> rpi-update does a lot of other deleting & copying as the part of the
> >>>>> update)? Thanks.
> >>>> mb_optimize_scan=0 -> almost 5 minutes
> >>>>
> >>>> mb_optimize_scan=1 -> almost 18 minutes
> >>>>
> >>>> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
> >>> Thanks! So now the iostat data indeed looks substantially different.
> >>>
> >>> nooptimize optimize
> >>> Total written 183.6 MB 190.5 MB
> >>> Time (recorded) 283 s 1040 s
> >>> Avg write request size 79 KB 41 KB
> >>>
> >>> So indeed with mb_optimize_scan=1 we do submit substantially smaller
> >>> requests on average. So far I'm not sure why that is. Since Ojaswin can
> >>> reproduce as well, let's see what he can see from block location info.
> >>> Thanks again for help with debugging this and enjoy your vacation!
> >>>
> >> Hi Jan and Stefan,
> >>
> >> Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
> >>
> >> So I was able to collect the block numbers using the method you suggested. I converted the
> >> blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can
> >> find them here:
> >>
> >> mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
> >> mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
> >> mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png
> >>
> >> Observations:
> >> * Before the patched mb_optimize_scan=1 allocations were way more spread out in
> >> 40 different BGs.
> >> * With the patch, we still allocate in 36 different BGs but majority happen in
> >> just 1 or 2 BGs.
> >> * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
> >> explain why this is faster.
> >
> > thanks this is very helpful for me to understand. So it seems to me that with disabled mb_optimized_scan we have a more sequential write behavior and with enabled mb_optimized_scan a more random write behavior.
> >
> > From my understanding writing small blocks at random addresses of the sd card flash causes a lot of overhead, because the sd card controller need to erase huge blocks (up to 1 MB) before it's able to program the flash pages. This would explain why this series doesn't fix the performance issue, the total amount of BGs is still much higher.
> >
> > Is this new block allocation pattern a side effect of the optimization or desired?
>
> The goal of the mb_optimized_scan is to avoid a large amount of linear
> scanning for very large filesystems when there are many block groups that
> are full or fragmented.
>
> It seems for empty filesystems the new list management is not very ideal.

The filesystems here are actually about half full and not too fragmented.

> It probably makes sense to have a hybrid, with some small amount of
> linear scanning (eg. a meta block group worth), and then use the new list
> to find a new group if that doesn't find any group with free space.

There is a heuristic to scan a few block groups linearly before using the
data structures to decide about the next block group in current mballoc
code but it gets used only for rotational devices. I don't know about some
easy way how to detect other types of storage like eMMC cards that also
benefit from better data allocation locality.

I have come up with two more patches on top of my current attempt which
improve allocation locality and at least for the untar case causing issues
on eMMC they do get close to the mb_optimize_scan=0 locality. I want to
check whether the higher locality does not hurt performance for highly
parallel workloads though. Then I'll post them for review and discussion.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR