> On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <[email protected]> wrote:
>
> On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
>> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
>>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
>>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
>>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
>>>>
>>>> - LZO: 7.2 MiB, 6 seconds
>>>> - ZSTD: 5.6 MiB, 60 seconds
>>>
>>> That seems unexpected, as the usual numbers say it's about 25%
>>> slower than LZO. Do you have an idea why it is so much slower
>>> here? How long does it take to decompress the
>>> generated arch/arm/boot/Image file in user space on the same
>>> hardware using lzop and zstd?
>>
>> I looked through this a bit more and found two interesting points:
>>
>> - zstd uses a lot more unaligned loads and stores while
>> decompressing. On armv5 those turn into individual byte
>> accesses, while the others can likely use word-aligned
>> accesses. This could make a huge difference if caches are
>> disabled during the decompression.
>>
>> - The sliding window on zstd is much larger, with the kernel
>> using an 8MB window (zstd=23), compared to the normal 32kb
>> for deflate (couldn't find the default for lzo), so on
>> machines with no L2 cache, it is much likely to thrash a
>> small L1 dcache that are used on most arm9.
>>
>> Arnd
>
> Make sense.
>
> For ZSTD as used in kernel decompression (the zstd22 configuration), the
> window is even bigger, 128 MiB. (AFAIU)
Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time...
But this is totally configurable. You can switch compression configurations
at any time. If you believe that the window size is the issue causing speed
regressions, you could use a zstd compression to use a e.g. 256KB window
size like this:
zstd -19 --zstd=wlog=18
This will keep the same algorithm search strength, but limit the decoder memory
usage.
I will also try to get this patchset working on my machine, and try to debug.
The 10x slower speed difference is not expected, and we see much better speed
in userspace ARM. I suspect it has something to do with the preboot environment.
E.g. when implementing x86-64 zstd kernel decompression, I noticed that
memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance
penalty.
Best,
Nick Terrell
> Thanks
>
> Jonathan
On Thu, Oct 12, 2023 at 10:33:23PM +0000, Nick Terrell wrote:
> > On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <[email protected]> wrote:
> > On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
> >> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
> >>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
> >>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
> >>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
> >>>>
> >>>> - LZO: 7.2 MiB, 6 seconds
> >>>> - ZSTD: 5.6 MiB, 60 seconds
[...]
> > For ZSTD as used in kernel decompression (the zstd22 configuration), the
> > window is even bigger, 128 MiB. (AFAIU)
>
> Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time...
>
> But this is totally configurable. You can switch compression configurations
> at any time. If you believe that the window size is the issue causing speed
> regressions, you could use a zstd compression to use a e.g. 256KB window
> size like this:
>
> zstd -19 --zstd=wlog=18
>
> This will keep the same algorithm search strength, but limit the decoder memory
> usage.
Noted.
> I will also try to get this patchset working on my machine, and try to debug.
> The 10x slower speed difference is not expected, and we see much better speed
> in userspace ARM. I suspect it has something to do with the preboot environment.
> E.g. when implementing x86-64 zstd kernel decompression, I noticed that
> memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance
> penalty.
In the meantime I've seen 8s for ZSTD vs. 2s for other algorithms, on
only mildly less ancient hardware (Hi3518A, another ARM9 SoC), so I
think the main culprit here was particularly bad luck in my choice of
test hardware.
The inlining issues are a good point, noted for the next time I work on this.
Thanks,
Jonathan
> On Oct 12, 2023, at 6:27 PM, J. Neuschäfer <[email protected]> wrote:
>
> On Thu, Oct 12, 2023 at 10:33:23PM +0000, Nick Terrell wrote:
>>> On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <[email protected]> wrote:
>>> On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
>>>> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
>>>>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
>>>>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
>>>>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
>>>>>>
>>>>>> - LZO: 7.2 MiB, 6 seconds
>>>>>> - ZSTD: 5.6 MiB, 60 seconds
> [...]
>>> For ZSTD as used in kernel decompression (the zstd22 configuration), the
>>> window is even bigger, 128 MiB. (AFAIU)
>>
>> Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time...
>>
>> But this is totally configurable. You can switch compression configurations
>> at any time. If you believe that the window size is the issue causing speed
>> regressions, you could use a zstd compression to use a e.g. 256KB window
>> size like this:
>>
>> zstd -19 --zstd=wlog=18
>>
>> This will keep the same algorithm search strength, but limit the decoder memory
>> usage.
>
> Noted.
>
>> I will also try to get this patchset working on my machine, and try to debug.
>> The 10x slower speed difference is not expected, and we see much better speed
>> in userspace ARM. I suspect it has something to do with the preboot environment.
>> E.g. when implementing x86-64 zstd kernel decompression, I noticed that
>> memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance
>> penalty.
>
> In the meantime I've seen 8s for ZSTD vs. 2s for other algorithms, on
> only mildly less ancient hardware (Hi3518A, another ARM9 SoC), so I
> think the main culprit here was particularly bad luck in my choice of
> test hardware.
>
> The inlining issues are a good point, noted for the next time I work on this.
I went out and bought a Raspberry Pi 4 to test on. I’ve done some crude measurements
and see that zstd kernel decompression is just slightly slower than gzip kernel
decompression, and about 2x slower than lzo. In userspace decompression of the same
file (a manually compressed kernel image) I see that zstd decompression is significantly
faster than gzip. So it is definitely something about the preboot boot environment, or how
the code is compiled for the preboot environment that is causing the issue.
My next step is to set up qemu on my Pi to try to get some perf measurements of the
decompression. One thing I’ve really been struggling with, and what thwarted my last
attempts at adding ARM zstd kernel decompression, was getting preboot logs printed.
I’ve figured out I need CONFIG_DEBUG_LL=y, but I’ve yet to actually get any logs.
And I can’t figure out how to get it working in qemu. I haven’t tried qemu on an ARM
host with kvm, but that’s the next thing I will try.
Do you happen to have any advice about how to get preboot logs in qemu? Is it
possible only on an ARM host, or would it also be possible on an x86-64 host?
Thanks,
Nick Terrell
> Thanks,
> Jonathan
On Fri, Oct 20, 2023 at 06:53:40PM +0000, Nick Terrell wrote:
> > On Oct 12, 2023, at 6:27 PM, J. Neuschäfer <[email protected]> wrote:
[...]
> > In the meantime I've seen 8s for ZSTD vs. 2s for other algorithms, on
> > only mildly less ancient hardware (Hi3518A, another ARM9 SoC), so I
> > think the main culprit here was particularly bad luck in my choice of
> > test hardware.
> >
> > The inlining issues are a good point, noted for the next time I work on this.
>
> I went out and bought a Raspberry Pi 4 to test on. I’ve done some crude measurements
> and see that zstd kernel decompression is just slightly slower than gzip kernel
> decompression, and about 2x slower than lzo. In userspace decompression of the same
> file (a manually compressed kernel image) I see that zstd decompression is significantly
> faster than gzip. So it is definitely something about the preboot boot environment, or how
> the code is compiled for the preboot environment that is causing the issue.
>
> My next step is to set up qemu on my Pi to try to get some perf measurements of the
> decompression. One thing I’ve really been struggling with, and what thwarted my last
> attempts at adding ARM zstd kernel decompression, was getting preboot logs printed.
Interesting, please keep me updated if you find something out :)
> I’ve figured out I need CONFIG_DEBUG_LL=y, but I’ve yet to actually get any logs.
> And I can’t figure out how to get it working in qemu. I haven’t tried qemu on an ARM
> host with kvm, but that’s the next thing I will try.
>
> Do you happen to have any advice about how to get preboot logs in qemu? Is it
> possible only on an ARM host, or would it also be possible on an x86-64 host?
I have a patch for that, although I've only used it on real hardware,
AFAIR:
https://github.com/neuschaefer/linux/commit/f8542094e36652f2c086c76bf20584330aa27711
Assuming, you use ARCH_MULTIPLATFORM, this should help, once you enable
CONFIG_EXPERT.
Thanks