2022-08-16 10:27:46

by Jan Kara

[permalink] [raw]
Subject: Re: [Regression] ext4: changes to mb_optimize_scan cause issues on Raspberry Pi

Hi Stefan!

On Sat 06-08-22 11:50:28, Stefan Wahren wrote:
> Am 28.07.22 um 12:00 schrieb Jan Kara:
> > Hello!
> >
> > On Mon 18-07-22 15:29:47, Stefan Wahren wrote:
> > > i noticed that since Linux 5.18 (Linux 5.19-rc6 is still affected) i'm
> > > unable to run "rpi-update" without massive performance regression on my
> > > Raspberry Pi 4 (multi_v7_defconfig + CONFIG_ARM_LPAE). Using Linux 5.17 this
> > > tool successfully downloads the latest firmware (> 100 MB) on my development
> > > micro SD card (Kingston 16 GB Industrial) with a ext4 filesystem within ~ 1
> > > min. The same scenario on Linux 5.18 shows the following symptoms:
> > Thanks for report and the bisection!
> > > - download takes endlessly much time and leads to an abort by userspace in
> > > most cases because of the poor performance
> > > - massive system load during download even after download has been aborted
> > > (heartbeat LED goes wild)
> > OK, is it that the CPU is busy or are we waiting on the storage card?
> > Observing top(1) for a while should be enough to get the idea. (sorry, I'm
> > not very familiar with RPi so I'm not sure what heartbeat LED shows).
>
> My description wasn't precise. I mean the green ACT LED, which uses the LED
> heartbeat trigger:
>
> "This allows LEDs to be controlled by a CPU load average. The flash
> frequency is a hyperbolic function of the 1-minute load average."
>
> I'm not sure if it's CPU or IO driven load, here the top output in bad case:
>
> top - 08:44:17 up 43 min,? 2 users,? load average: 5,02, 5,45, 5,17
> Tasks: 142 total,?? 1 running, 141 sleeping,?? 0 stopped,?? 0 zombie
> %Cpu(s):? 0,4 us,? 0,4 sy,? 0,0 ni, 49,0 id, 50,2 wa,? 0,0 hi, 0,0 si,? 0,0
> st
> MiB Mem :?? 7941,7 total,?? 4563,1 free,??? 312,7 used,?? 3066,0 buff/cache
> MiB Swap:??? 100,0 total,??? 100,0 free,????? 0,0 used.?? 7359,6 avail Mem

OK, there's plenty of memory available, CPUs are mostly idle, the load is
likely created by tasks waiting for IO (which also contribute to load
despite not consuming CPU). Not much surprising here.

> > Can you run "iostat -x 1" while the download is running so that we can see
> > roughly how the IO pattern looks?
> >
> Here the output during download:
>
> Device??????????? r/s???? w/s???? rkB/s???? wkB/s?? rrqm/s wrqm/s? %rrqm?
> %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm? %util
> mmcblk1????????? 0,00??? 2,00????? 0,00???? 36,00???? 0,00 0,00?? 0,00??
> 0,00??? 0,00 23189,50? 46,38???? 0,00??? 18,00 500,00 100,00
>
> avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle
> ?????????? 0,25??? 0,00??? 0,00?? 49,62??? 0,00?? 50,13
>
> Device??????????? r/s???? w/s???? rkB/s???? wkB/s?? rrqm/s wrqm/s? %rrqm?
> %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm? %util
> mmcblk1????????? 0,00??? 2,00????? 0,00???? 76,00???? 0,00 0,00?? 0,00??
> 0,00??? 0,00 46208,50? 92,42???? 0,00??? 38,00 500,00 100,00
>
> avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle
> ?????????? 0,25??? 0,00??? 0,00?? 49,62??? 0,00?? 50,13
>
> Device??????????? r/s???? w/s???? rkB/s???? wkB/s?? rrqm/s wrqm/s? %rrqm?
> %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm? %util
> mmcblk1????????? 0,00??? 3,00????? 0,00???? 76,00???? 0,00 0,00?? 0,00??
> 0,00??? 0,00 48521,67 145,56???? 0,00??? 25,33 333,33 100,00
>
> avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle
> ?????????? 0,25??? 0,00??? 0,00?? 49,62??? 0,00?? 50,13

So this is interesting. We can see the card is 100% busy. The IO submitted
to the card is formed by small requests - 18-38 KB per request - and each
request takes 0.3-0.5s to complete. So the resulting throughput is horrible
- only tens of KB/s. Also we can see there are many IOs queued for the
device in parallel (aqu-sz columnt). This does not look like load I would
expect to be generated by download of a large file from the web.

You have mentioned in previous emails that with dd(1) you can do couple
MB/s writing to this card which is far more than these tens of KB/s. So the
file download must be doing something which really destroys the IO pattern
(and with mb_optimize_scan=0 ext4 happened to be better dealing with it and
generating better IO pattern). Can you perhaps strace the process doing the
download (or perhaps strace -f the whole rpi-update process) so that we can
see how does the load generated on the filesystem look like? Thanks!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


2022-08-16 11:54:11

by Stefan Wahren

[permalink] [raw]
Subject: Re: [Regression] ext4: changes to mb_optimize_scan cause issues on Raspberry Pi

Hi Jan,

Am 16.08.22 um 11:34 schrieb Jan Kara:
> Hi Stefan!
>
> On Sat 06-08-22 11:50:28, Stefan Wahren wrote:
>> Am 28.07.22 um 12:00 schrieb Jan Kara:
>>> Hello!
>>>
>>> On Mon 18-07-22 15:29:47, Stefan Wahren wrote:
>>>> i noticed that since Linux 5.18 (Linux 5.19-rc6 is still affected) i'm
>>>> unable to run "rpi-update" without massive performance regression on my
>>>> Raspberry Pi 4 (multi_v7_defconfig + CONFIG_ARM_LPAE). Using Linux 5.17 this
>>>> tool successfully downloads the latest firmware (> 100 MB) on my development
>>>> micro SD card (Kingston 16 GB Industrial) with a ext4 filesystem within ~ 1
>>>> min. The same scenario on Linux 5.18 shows the following symptoms:
>>> Thanks for report and the bisection!
>>>> - download takes endlessly much time and leads to an abort by userspace in
>>>> most cases because of the poor performance
>>>> - massive system load during download even after download has been aborted
>>>> (heartbeat LED goes wild)
>>> OK, is it that the CPU is busy or are we waiting on the storage card?
>>> Observing top(1) for a while should be enough to get the idea. (sorry, I'm
>>> not very familiar with RPi so I'm not sure what heartbeat LED shows).
>> My description wasn't precise. I mean the green ACT LED, which uses the LED
>> heartbeat trigger:
>>
>> "This allows LEDs to be controlled by a CPU load average. The flash
>> frequency is a hyperbolic function of the 1-minute load average."
>>
>> I'm not sure if it's CPU or IO driven load, here the top output in bad case:
>>
>> top - 08:44:17 up 43 min,  2 users,  load average: 5,02, 5,45, 5,17
>> Tasks: 142 total,   1 running, 141 sleeping,   0 stopped,   0 zombie
>> %Cpu(s):  0,4 us,  0,4 sy,  0,0 ni, 49,0 id, 50,2 wa,  0,0 hi, 0,0 si,  0,0
>> st
>> MiB Mem :   7941,7 total,   4563,1 free,    312,7 used,   3066,0 buff/cache
>> MiB Swap:    100,0 total,    100,0 free,      0,0 used.   7359,6 avail Mem
> OK, there's plenty of memory available, CPUs are mostly idle, the load is
> likely created by tasks waiting for IO (which also contribute to load
> despite not consuming CPU). Not much surprising here.
>
>>> Can you run "iostat -x 1" while the download is running so that we can see
>>> roughly how the IO pattern looks?
>>>
>> Here the output during download:
>>
>> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s wrqm/s  %rrqm
>> %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm  %util
>> mmcblk1          0,00    2,00      0,00     36,00     0,00 0,00   0,00
>> 0,00    0,00 23189,50  46,38     0,00    18,00 500,00 100,00
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            0,25    0,00    0,00   49,62    0,00   50,13
>>
>> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s wrqm/s  %rrqm
>> %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm  %util
>> mmcblk1          0,00    2,00      0,00     76,00     0,00 0,00   0,00
>> 0,00    0,00 46208,50  92,42     0,00    38,00 500,00 100,00
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            0,25    0,00    0,00   49,62    0,00   50,13
>>
>> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s wrqm/s  %rrqm
>> %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm  %util
>> mmcblk1          0,00    3,00      0,00     76,00     0,00 0,00   0,00
>> 0,00    0,00 48521,67 145,56     0,00    25,33 333,33 100,00
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            0,25    0,00    0,00   49,62    0,00   50,13
> So this is interesting. We can see the card is 100% busy. The IO submitted
> to the card is formed by small requests - 18-38 KB per request - and each
> request takes 0.3-0.5s to complete. So the resulting throughput is horrible
> - only tens of KB/s. Also we can see there are many IOs queued for the
> device in parallel (aqu-sz columnt). This does not look like load I would
> expect to be generated by download of a large file from the web.
>
> You have mentioned in previous emails that with dd(1) you can do couple
> MB/s writing to this card which is far more than these tens of KB/s. So the
> file download must be doing something which really destroys the IO pattern
> (and with mb_optimize_scan=0 ext4 happened to be better dealing with it and
> generating better IO pattern). Can you perhaps strace the process doing the
> download (or perhaps strace -f the whole rpi-update process) so that we can
> see how does the load generated on the filesystem look like? Thanks!

i can do that. But may be the sources of rpi-update is more helpful?

https://github.com/raspberrypi/rpi-update/blob/master/rpi-update

>
> Honza

2022-08-16 20:53:19

by Stefan Wahren

[permalink] [raw]
Subject: Re: [Regression] ext4: changes to mb_optimize_scan cause issues on Raspberry Pi

Hi Jan,

Am 16.08.22 um 11:34 schrieb Jan Kara:
> Hi Stefan!
> So this is interesting. We can see the card is 100% busy. The IO submitted
> to the card is formed by small requests - 18-38 KB per request - and each
> request takes 0.3-0.5s to complete. So the resulting throughput is horrible
> - only tens of KB/s. Also we can see there are many IOs queued for the
> device in parallel (aqu-sz columnt). This does not look like load I would
> expect to be generated by download of a large file from the web.
>
> You have mentioned in previous emails that with dd(1) you can do couple
> MB/s writing to this card which is far more than these tens of KB/s. So the
> file download must be doing something which really destroys the IO pattern
> (and with mb_optimize_scan=0 ext4 happened to be better dealing with it and
> generating better IO pattern). Can you perhaps strace the process doing the
> download (or perhaps strace -f the whole rpi-update process) so that we can
> see how does the load generated on the filesystem look like? Thanks!

i didn't create the strace yet, but i looked at the source of
rpi-update. At the end the download phase is a curl call to download a
tar archive and pipe it directly to tar.

You can find the content list of the tar file here:

https://raw.githubusercontent.com/lategoodbye/mb_optimize_scan_regress/main/rpi-firmware-tar-content-list.txt

Best regards

>
> Honza

2022-08-17 05:29:27

by Ojaswin Mujoo

[permalink] [raw]
Subject: Re: [Regression] ext4: changes to mb_optimize_scan cause issues on Raspberry Pi

On Tue, Aug 16, 2022 at 10:45:48PM +0200, Stefan Wahren wrote:
> Hi Jan,
>
> Am 16.08.22 um 11:34 schrieb Jan Kara:
> > Hi Stefan!
> > So this is interesting. We can see the card is 100% busy. The IO submitted
> > to the card is formed by small requests - 18-38 KB per request - and each
> > request takes 0.3-0.5s to complete. So the resulting throughput is horrible
> > - only tens of KB/s. Also we can see there are many IOs queued for the
> > device in parallel (aqu-sz columnt). This does not look like load I would
> > expect to be generated by download of a large file from the web.
> >
> > You have mentioned in previous emails that with dd(1) you can do couple
> > MB/s writing to this card which is far more than these tens of KB/s. So the
> > file download must be doing something which really destroys the IO pattern
> > (and with mb_optimize_scan=0 ext4 happened to be better dealing with it and
> > generating better IO pattern). Can you perhaps strace the process doing the
> > download (or perhaps strace -f the whole rpi-update process) so that we can
> > see how does the load generated on the filesystem look like? Thanks!
>
> i didn't create the strace yet, but i looked at the source of rpi-update. At
> the end the download phase is a curl call to download a tar archive and pipe
> it directly to tar.
>
> You can find the content list of the tar file here:
>
> https://raw.githubusercontent.com/lategoodbye/mb_optimize_scan_regress/main/rpi-firmware-tar-content-list.txt
>
> Best regards
>
> >
> > Honza
Hi Jan and Stefan,

I did some analysis of this on my Raspberry Pi 3B+. Not sure of the root
cause yet but here is what I observed:

1. So I noticed that the download itself is not causing any issues in my
case, but the download with a pipe to tar is what causes the degradation.
With the pipe to tar, mb_optimize_scan=1 takes around 7mins whereas
mb_optimize_scan=0 takes 1 min

2. I tried to replicate this performance degradation by running untar
on an x86 machine but I not able to get the degradation. It is
reproducible pretty consistently on my Raspberry Pi though (w/ an 8GB
memory card).

3. I did analyse the resulting mb_optimize_scan=0 vs mb_optmize_scan=1
filesystem and seems like the allocated blocks are more spread out in
mb_optmize_scan=1 case but not yet sure if that is the issue.

Will update here if I notice anything else.

Regards,
Ojaswin

2022-08-17 11:00:31

by Jan Kara

[permalink] [raw]
Subject: Re: [Regression] ext4: changes to mb_optimize_scan cause issues on Raspberry Pi

Hi Stefan!

On Tue 16-08-22 22:45:48, Stefan Wahren wrote:
> Am 16.08.22 um 11:34 schrieb Jan Kara:
> > Hi Stefan!
> > So this is interesting. We can see the card is 100% busy. The IO submitted
> > to the card is formed by small requests - 18-38 KB per request - and each
> > request takes 0.3-0.5s to complete. So the resulting throughput is horrible
> > - only tens of KB/s. Also we can see there are many IOs queued for the
> > device in parallel (aqu-sz columnt). This does not look like load I would
> > expect to be generated by download of a large file from the web.
> >
> > You have mentioned in previous emails that with dd(1) you can do couple
> > MB/s writing to this card which is far more than these tens of KB/s. So the
> > file download must be doing something which really destroys the IO pattern
> > (and with mb_optimize_scan=0 ext4 happened to be better dealing with it and
> > generating better IO pattern). Can you perhaps strace the process doing the
> > download (or perhaps strace -f the whole rpi-update process) so that we can
> > see how does the load generated on the filesystem look like? Thanks!
>
> i didn't create the strace yet, but i looked at the source of rpi-update. At
> the end the download phase is a curl call to download a tar archive and pipe
> it directly to tar.
>
> You can find the content list of the tar file here:
>
> https://raw.githubusercontent.com/lategoodbye/mb_optimize_scan_regress/main/rpi-firmware-tar-content-list.txt

Thanks for the details! This is indeed even better. Looking at the tar
archive I can see it consists of a lot of small files big part of them
is even below 10k. So this very much matches the workload I was examining
with reaim where I saw regression (although only ~8%) even on normal
rotating drive on x86 machine. In that case I have pretty much confirmed
that the problem is due to mb_optimize_scan=1 spreading small allocated
files more which is likely also harmful for the SD card because it requires
touching more erase blocks.

Thanks for help with debugging this, I will implement some of the
heuristics we discussed with other ext4 developers to avoid this behavior
and will send you patch for testing.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR