2020-08-03 19:44:05

by Nick Terrell

[permalink] [raw]
Subject: [PATCH] lz4: Fix kernel decompression speed

From: Nick Terrell <[email protected]>

This patch replaces all memcpy() calls with LZ4_memcpy() which calls
__builtin_memcpy() so the compiler can inline it.

LZ4 relies heavily on memcpy() with a constant size being inlined. In
x86 and i386 pre-boot environments memcpy() cannot be inlined because
memcpy() doesn't get defined as __builtin_memcpy().

An equivalent patch has been applied upstream so that the next import
won't lose this change [1].

I've measured the kernel decompression speed using QEMU before and after
this patch for the x86_64 and i386 architectures. The speed-up is about
10x as shown below.

Code Arch Kernel Size Time Speed
v5.8 x86_64 11504832 B 148 ms 79 MB/s
patch x86_64 11503872 B 13 ms 885 MB/s
v5.8 i386 9621216 B 91 ms 106 MB/s
patch i386 9620224 B 10 ms 962 MB/s

I also measured the time to decompress the initramfs on x86_64, i386,
and arm. All three show the same decompression speed before and after,
as expected.

[1] https://github.com/lz4/lz4/pull/890

Signed-off-by: Nick Terrell <[email protected]>
Cc: Yann Collet <[email protected]>
Cc: Gao Xiang <[email protected]>
Cc: Sven Schmidt <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Linus Torvalds <[email protected]>
---
lib/lz4/lz4_compress.c | 4 ++--
lib/lz4/lz4_decompress.c | 18 +++++++++---------
lib/lz4/lz4defs.h | 10 ++++++++++
lib/lz4/lz4hc_compress.c | 2 +-
4 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/lib/lz4/lz4_compress.c b/lib/lz4/lz4_compress.c
index cc7b6d4cc7c7..90bb67994688 100644
--- a/lib/lz4/lz4_compress.c
+++ b/lib/lz4/lz4_compress.c
@@ -446,7 +446,7 @@ static FORCE_INLINE int LZ4_compress_generic(
*op++ = (BYTE)(lastRun << ML_BITS);
}

- memcpy(op, anchor, lastRun);
+ LZ4_memcpy(op, anchor, lastRun);

op += lastRun;
}
@@ -708,7 +708,7 @@ static int LZ4_compress_destSize_generic(
} else {
*op++ = (BYTE)(lastRunSize<<ML_BITS);
}
- memcpy(op, anchor, lastRunSize);
+ LZ4_memcpy(op, anchor, lastRunSize);
op += lastRunSize;
}

diff --git a/lib/lz4/lz4_decompress.c b/lib/lz4/lz4_decompress.c
index 5371dab6b481..00cb0d0b73e1 100644
--- a/lib/lz4/lz4_decompress.c
+++ b/lib/lz4/lz4_decompress.c
@@ -153,7 +153,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
&& likely((endOnInput ? ip < shortiend : 1) &
(op <= shortoend))) {
/* Copy the literals */
- memcpy(op, ip, endOnInput ? 16 : 8);
+ LZ4_memcpy(op, ip, endOnInput ? 16 : 8);
op += length; ip += length;

/*
@@ -172,9 +172,9 @@ static FORCE_INLINE int LZ4_decompress_generic(
(offset >= 8) &&
(dict == withPrefix64k || match >= lowPrefix)) {
/* Copy the match. */
- memcpy(op + 0, match + 0, 8);
- memcpy(op + 8, match + 8, 8);
- memcpy(op + 16, match + 16, 2);
+ LZ4_memcpy(op + 0, match + 0, 8);
+ LZ4_memcpy(op + 8, match + 8, 8);
+ LZ4_memcpy(op + 16, match + 16, 2);
op += length + MINMATCH;
/* Both stages worked, load the next token. */
continue;
@@ -263,7 +263,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
}
}

- memcpy(op, ip, length);
+ LZ4_memcpy(op, ip, length);
ip += length;
op += length;

@@ -350,7 +350,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
size_t const copySize = (size_t)(lowPrefix - match);
size_t const restSize = length - copySize;

- memcpy(op, dictEnd - copySize, copySize);
+ LZ4_memcpy(op, dictEnd - copySize, copySize);
op += copySize;
if (restSize > (size_t)(op - lowPrefix)) {
/* overlap copy */
@@ -360,7 +360,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
while (op < endOfMatch)
*op++ = *copyFrom++;
} else {
- memcpy(op, lowPrefix, restSize);
+ LZ4_memcpy(op, lowPrefix, restSize);
op += restSize;
}
}
@@ -386,7 +386,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
while (op < copyEnd)
*op++ = *match++;
} else {
- memcpy(op, match, mlen);
+ LZ4_memcpy(op, match, mlen);
}
op = copyEnd;
if (op == oend)
@@ -400,7 +400,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
op[2] = match[2];
op[3] = match[3];
match += inc32table[offset];
- memcpy(op + 4, match, 4);
+ LZ4_memcpy(op + 4, match, 4);
match -= dec64table[offset];
} else {
LZ4_copy8(op, match);
diff --git a/lib/lz4/lz4defs.h b/lib/lz4/lz4defs.h
index 1a7fa9d9170f..c91dd96ef629 100644
--- a/lib/lz4/lz4defs.h
+++ b/lib/lz4/lz4defs.h
@@ -137,6 +137,16 @@ static FORCE_INLINE void LZ4_writeLE16(void *memPtr, U16 value)
return put_unaligned_le16(value, memPtr);
}

+/*
+ * LZ4 relies on memcpy with a constant size being inlined. In freestanding
+ * environments, the compiler can't assume the implementation of memcpy() is
+ * standard compliant, so apply its specialized memcpy() inlining logic. When
+ * possible, use __builtin_memcpy() to tell the compiler to analyze memcpy()
+ * as-if it were standard compliant, so it can inline it in freestanding
+ * environments. This is needed when decompressing the Linux Kernel, for example.
+ */
+#define LZ4_memcpy(dst, src, size) __builtin_memcpy(dst, src, size)
+
static FORCE_INLINE void LZ4_copy8(void *dst, const void *src)
{
#if LZ4_ARCH64
diff --git a/lib/lz4/lz4hc_compress.c b/lib/lz4/lz4hc_compress.c
index 1b61d874e337..e7ac8694b797 100644
--- a/lib/lz4/lz4hc_compress.c
+++ b/lib/lz4/lz4hc_compress.c
@@ -570,7 +570,7 @@ static int LZ4HC_compress_generic(
*op++ = (BYTE) lastRun;
} else
*op++ = (BYTE)(lastRun<<ML_BITS);
- memcpy(op, anchor, iend - anchor);
+ LZ4_memcpy(op, anchor, iend - anchor);
op += iend - anchor;
}

--
2.28.0


2020-08-03 21:58:33

by Arvind Sankar

[permalink] [raw]
Subject: Re: [PATCH] lz4: Fix kernel decompression speed

On Mon, Aug 03, 2020 at 12:40:22PM -0700, Nick Terrell wrote:
> From: Nick Terrell <[email protected]>
>
> This patch replaces all memcpy() calls with LZ4_memcpy() which calls
> __builtin_memcpy() so the compiler can inline it.
>
> LZ4 relies heavily on memcpy() with a constant size being inlined. In
> x86 and i386 pre-boot environments memcpy() cannot be inlined because
> memcpy() doesn't get defined as __builtin_memcpy().
>
> An equivalent patch has been applied upstream so that the next import
> won't lose this change [1].
>
> I've measured the kernel decompression speed using QEMU before and after
> this patch for the x86_64 and i386 architectures. The speed-up is about
> 10x as shown below.
>
> Code Arch Kernel Size Time Speed
> v5.8 x86_64 11504832 B 148 ms 79 MB/s
> patch x86_64 11503872 B 13 ms 885 MB/s
> v5.8 i386 9621216 B 91 ms 106 MB/s
> patch i386 9620224 B 10 ms 962 MB/s
>
> I also measured the time to decompress the initramfs on x86_64, i386,
> and arm. All three show the same decompression speed before and after,
> as expected.
>
> [1] https://github.com/lz4/lz4/pull/890
>

Hi Nick, would you be able to test the below patch's performance to
verify it gives the same speedup? It removes the #undef in misc.c which
causes the decompressors to not use the builtin version. It should be
equivalent to yours except for applying it to all the decompressors.

Thanks.

From 10f8d939fc367e3127e2d72ba099678debcae422 Mon Sep 17 00:00:00 2001
From: Arvind Sankar <[email protected]>
Date: Mon, 3 Aug 2020 17:07:37 -0400
Subject: [PATCH] x86/boot/compressed: Use builtin mem functions for decompressor

Since commits

c041b5ad8640 ("x86, boot: Create a separate string.h file to provide standard string functions")
fb4cac573ef6 ("x86, boot: Move memcmp() into string.h and string.c")

the decompressor stub has been using the compiler's builtin memcpy,
memset and memcmp functions, _except_ where it would likely have the
largest impact, in the decompression code itself.

Remove the #undef's of memcpy and memset in misc.c so that the
decompressor code also uses the compiler builtins.

The rationale given in the comment doesn't really apply: just because
some functions use the out-of-line version is no reason to not use the
builtin version in the rest.

Replace the comment with an explanation of why memzero and memmove are
being #define'd.

Drop the suggestion to #undef in boot/string.h as well: the out-of-line
versions are not really optimized versions, they're generic code that's
good enough for the preboot environment. The compiler will likely
generate better code for constant-size memcpy/memset/memcmp if it is
allowed to.

Signed-off-by: Arvind Sankar <[email protected]>
---
arch/x86/boot/compressed/misc.c | 7 ++-----
arch/x86/boot/string.h | 5 +----
2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 9652d5c2afda..0c74a6e526b6 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -30,12 +30,9 @@
#define STATIC static

/*
- * Use normal definitions of mem*() from string.c. There are already
- * included header files which expect a definition of memset() and by
- * the time we define memset macro, it is too late.
+ * Provide definitions of memzero and memmove as some of the decompressors will
+ * try to define their own functions if these are not defined as macros.
*/
-#undef memcpy
-#undef memset
#define memzero(s, n) memset((s), 0, (n))
#define memmove memmove

diff --git a/arch/x86/boot/string.h b/arch/x86/boot/string.h
index 995f7b7ad512..a232da487cd2 100644
--- a/arch/x86/boot/string.h
+++ b/arch/x86/boot/string.h
@@ -11,10 +11,7 @@ void *memcpy(void *dst, const void *src, size_t len);
void *memset(void *dst, int c, size_t len);
int memcmp(const void *s1, const void *s2, size_t len);

-/*
- * Access builtin version by default. If one needs to use optimized version,
- * do "undef memcpy" in .c file and link against right string.c
- */
+/* Access builtin version by default. */
#define memcpy(d,s,l) __builtin_memcpy(d,s,l)
#define memset(d,c,l) __builtin_memset(d,c,l)
#define memcmp __builtin_memcmp
--
2.26.2

2020-08-03 22:59:20

by Nick Terrell

[permalink] [raw]
Subject: Re: [PATCH] lz4: Fix kernel decompression speed



> On Aug 3, 2020, at 2:57 PM, Arvind Sankar <[email protected]> wrote:
>
> On Mon, Aug 03, 2020 at 12:40:22PM -0700, Nick Terrell wrote:
>> From: Nick Terrell <[email protected]>
>>
>> This patch replaces all memcpy() calls with LZ4_memcpy() which calls
>> __builtin_memcpy() so the compiler can inline it.
>>
>> LZ4 relies heavily on memcpy() with a constant size being inlined. In
>> x86 and i386 pre-boot environments memcpy() cannot be inlined because
>> memcpy() doesn't get defined as __builtin_memcpy().
>>
>> An equivalent patch has been applied upstream so that the next import
>> won't lose this change [1].
>>
>> I've measured the kernel decompression speed using QEMU before and after
>> this patch for the x86_64 and i386 architectures. The speed-up is about
>> 10x as shown below.
>>
>> Code Arch Kernel Size Time Speed
>> v5.8 x86_64 11504832 B 148 ms 79 MB/s
>> patch x86_64 11503872 B 13 ms 885 MB/s
>> v5.8 i386 9621216 B 91 ms 106 MB/s
>> patch i386 9620224 B 10 ms 962 MB/s
>>
>> I also measured the time to decompress the initramfs on x86_64, i386,
>> and arm. All three show the same decompression speed before and after,
>> as expected.
>>
>> [1] https://github.com/lz4/lz4/pull/890
>>
>
> Hi Nick, would you be able to test the below patch's performance to
> verify it gives the same speedup? It removes the #undef in misc.c which
> causes the decompressors to not use the builtin version. It should be
> equivalent to yours except for applying it to all the decompressors.
>
> Thanks.

I will measure it. I would expect it to provide the same speed up. It would be great to fix
the problem for x86/i386 in general.

But, I believe that this is also a problem for ARM, though I have a hard time measuring
because I can’t get pre-boot print statements in QEMU. I will attempt to take a look at the
assembly, because I’m fairly certain that memcpy() isn’t inlined in master.

Even if we fix all the architectures, I would still like to merge the LZ4 patch. It seems like it
is pretty easy to merge a patch that is a boot speed regression, because people aren’t
actively measuring it. So I prefer a layered defense.

Additionally, this is following upstream. At some point LZ4 will be imported from upstream
and get an equivalent of this patch https://github.com/lz4/lz4/pull/890.

Best,
Nick

2020-08-04 01:59:15

by Arvind Sankar

[permalink] [raw]
Subject: Re: [PATCH] lz4: Fix kernel decompression speed

On Mon, Aug 03, 2020 at 10:55:01PM +0000, Nick Terrell wrote:
>
>
> > On Aug 3, 2020, at 2:57 PM, Arvind Sankar <[email protected]> wrote:
> >
> > On Mon, Aug 03, 2020 at 12:40:22PM -0700, Nick Terrell wrote:
> >> From: Nick Terrell <[email protected]>
> >>
> >> This patch replaces all memcpy() calls with LZ4_memcpy() which calls
> >> __builtin_memcpy() so the compiler can inline it.
> >>
> >> LZ4 relies heavily on memcpy() with a constant size being inlined. In
> >> x86 and i386 pre-boot environments memcpy() cannot be inlined because
> >> memcpy() doesn't get defined as __builtin_memcpy().
> >>
> >> An equivalent patch has been applied upstream so that the next import
> >> won't lose this change [1].
> >>
> >> I've measured the kernel decompression speed using QEMU before and after
> >> this patch for the x86_64 and i386 architectures. The speed-up is about
> >> 10x as shown below.
> >>
> >> Code Arch Kernel Size Time Speed
> >> v5.8 x86_64 11504832 B 148 ms 79 MB/s
> >> patch x86_64 11503872 B 13 ms 885 MB/s
> >> v5.8 i386 9621216 B 91 ms 106 MB/s
> >> patch i386 9620224 B 10 ms 962 MB/s
> >>
> >> I also measured the time to decompress the initramfs on x86_64, i386,
> >> and arm. All three show the same decompression speed before and after,
> >> as expected.
> >>
> >> [1] https://github.com/lz4/lz4/pull/890
> >>
> >
> > Hi Nick, would you be able to test the below patch's performance to
> > verify it gives the same speedup? It removes the #undef in misc.c which
> > causes the decompressors to not use the builtin version. It should be
> > equivalent to yours except for applying it to all the decompressors.
> >
> > Thanks.
>
> I will measure it. I would expect it to provide the same speed up. It would be great to fix
> the problem for x86/i386 in general.

Thanks. I tried using RDTSC to get some timings under QEMU, and I get
similar speedup as you have for LZ4, and around 15-20% or so for ZSTD
(on 64-bit) -- I see that ZSTD_copy8 is already using __builtin_memcpy,
but there must be more that can be optimized? There's a couple 1/2-byte
sized copies in huf_decompress.c.

For the rest of the algos there seems to be no difference, likely
because some casual grepping doesn't show any constant-size copies in
the other algorithms.

2020-08-04 03:00:37

by Nick Terrell

[permalink] [raw]
Subject: Re: [PATCH] lz4: Fix kernel decompression speed



> On Aug 3, 2020, at 6:56 PM, Arvind Sankar <[email protected]> wrote:
>
> On Mon, Aug 03, 2020 at 10:55:01PM +0000, Nick Terrell wrote:
>>
>>
>>> On Aug 3, 2020, at 2:57 PM, Arvind Sankar <[email protected]> wrote:
>>>
>>> On Mon, Aug 03, 2020 at 12:40:22PM -0700, Nick Terrell wrote:
>>>> From: Nick Terrell <[email protected]>
>>>>
>>>> This patch replaces all memcpy() calls with LZ4_memcpy() which calls
>>>> __builtin_memcpy() so the compiler can inline it.
>>>>
>>>> LZ4 relies heavily on memcpy() with a constant size being inlined. In
>>>> x86 and i386 pre-boot environments memcpy() cannot be inlined because
>>>> memcpy() doesn't get defined as __builtin_memcpy().
>>>>
>>>> An equivalent patch has been applied upstream so that the next import
>>>> won't lose this change [1].
>>>>
>>>> I've measured the kernel decompression speed using QEMU before and after
>>>> this patch for the x86_64 and i386 architectures. The speed-up is about
>>>> 10x as shown below.
>>>>
>>>> Code Arch Kernel Size Time Speed
>>>> v5.8 x86_64 11504832 B 148 ms 79 MB/s
>>>> patch x86_64 11503872 B 13 ms 885 MB/s
>>>> v5.8 i386 9621216 B 91 ms 106 MB/s
>>>> patch i386 9620224 B 10 ms 962 MB/s
>>>>
>>>> I also measured the time to decompress the initramfs on x86_64, i386,
>>>> and arm. All three show the same decompression speed before and after,
>>>> as expected.
>>>>
>>>> [1] https://github.com/lz4/lz4/pull/890
>>>>
>>>
>>> Hi Nick, would you be able to test the below patch's performance to
>>> verify it gives the same speedup? It removes the #undef in misc.c which
>>> causes the decompressors to not use the builtin version. It should be
>>> equivalent to yours except for applying it to all the decompressors.
>>>
>>> Thanks.
>>
>> I will measure it. I would expect it to provide the same speed up. It would be great to fix
>> the problem for x86/i386 in general.
>
> Thanks. I tried using RDTSC to get some timings under QEMU, and I get
> similar speedup as you have for LZ4, and around 15-20% or so for ZSTD
> (on 64-bit)

By the way, I was using this script for performance testing [0].

> -- I see that ZSTD_copy8 is already using __builtin_memcpy,
> but there must be more that can be optimized? There's a couple 1/2-byte
> sized copies in huf_decompress.c.

Oh wow, I totally missed that, I guess I stopped looking once performance
was about what I expected, nice find!

I suspect it is mostly the memcpy inside of HUF_decodeSymbolX4(), since
that should be the only hot one [1].

Do you want to put up the patch to fix the memcpy’s in zstd Huffman, or should I?

I will be submitting a patch upstream to migrate all of zstd’s memcpy() calls to
use __builtin_memcpy(), since I plan on updating the version in the kernel to
upstream zstd in the next few months. I was waiting until the compressed kernel
patch set landed, so I didn't distract from it.

[0] https://gist.github.com/terrelln/9bd53321a669f62683c608af8944fbc2
[1] https://github.com/torvalds/linux/blob/master/lib/zstd/huf_decompress.c#L598

Best,
Nick

2020-08-04 08:35:53

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] lz4: Fix kernel decompression speed

Hi!

> >> I've measured the kernel decompression speed using QEMU before and after
> >> this patch for the x86_64 and i386 architectures. The speed-up is about
> >> 10x as shown below.
> >>
> >> Code Arch Kernel Size Time Speed
> >> v5.8 x86_64 11504832 B 148 ms 79 MB/s
> >> patch x86_64 11503872 B 13 ms 885 MB/s
> >> v5.8 i386 9621216 B 91 ms 106 MB/s
> >> patch i386 9620224 B 10 ms 962 MB/s
> >>
> >> I also measured the time to decompress the initramfs on x86_64, i386,
> >> and arm. All three show the same decompression speed before and after,
> >> as expected.
> >>
> >> [1] https://github.com/lz4/lz4/pull/890
> >>
> >
> > Hi Nick, would you be able to test the below patch's performance to
> > verify it gives the same speedup? It removes the #undef in misc.c which
> > causes the decompressors to not use the builtin version. It should be
> > equivalent to yours except for applying it to all the decompressors.
> >
> > Thanks.
>
> I will measure it. I would expect it to provide the same speed up. It would be great to fix
> the problem for x86/i386 in general.
>
> But, I believe that this is also a problem for ARM, though I have a hard time measuring
> because I can’t get pre-boot print statements in QEMU. I will attempt to take a look at the
> assembly, because I’m fairly certain that memcpy() isn’t inlined in master.
>
> Even if we fix all the architectures, I would still like to merge the LZ4 patch. It seems like it
> is pretty easy to merge a patch that is a boot speed regression, because people aren’t
> actively measuring it. So I prefer a layered defense.


Layered defense against performance-only problem, happening on
emulation-only?

IMO that's a bit of overkill.

Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.89 kB)
signature.asc (201.00 B)
Download all attachments

2020-08-04 15:20:18

by Arvind Sankar

[permalink] [raw]
Subject: Re: [PATCH] lz4: Fix kernel decompression speed

On Tue, Aug 04, 2020 at 10:32:36AM +0200, Pavel Machek wrote:
> Hi!
>
> > >> I've measured the kernel decompression speed using QEMU before and after
> > >> this patch for the x86_64 and i386 architectures. The speed-up is about
> > >> 10x as shown below.
> > >>
> > >> Code Arch Kernel Size Time Speed
> > >> v5.8 x86_64 11504832 B 148 ms 79 MB/s
> > >> patch x86_64 11503872 B 13 ms 885 MB/s
> > >> v5.8 i386 9621216 B 91 ms 106 MB/s
> > >> patch i386 9620224 B 10 ms 962 MB/s
> > >>
> > >> I also measured the time to decompress the initramfs on x86_64, i386,
> > >> and arm. All three show the same decompression speed before and after,
> > >> as expected.
> > >>
> > >> [1] https://github.com/lz4/lz4/pull/890
> > >>
> > >
> > > Hi Nick, would you be able to test the below patch's performance to
> > > verify it gives the same speedup? It removes the #undef in misc.c which
> > > causes the decompressors to not use the builtin version. It should be
> > > equivalent to yours except for applying it to all the decompressors.
> > >
> > > Thanks.
> >
> > I will measure it. I would expect it to provide the same speed up. It would be great to fix
> > the problem for x86/i386 in general.
> >
> > But, I believe that this is also a problem for ARM, though I have a hard time measuring
> > because I can’t get pre-boot print statements in QEMU. I will attempt to take a look at the
> > assembly, because I’m fairly certain that memcpy() isn’t inlined in master.
> >
> > Even if we fix all the architectures, I would still like to merge the LZ4 patch. It seems like it
> > is pretty easy to merge a patch that is a boot speed regression, because people aren’t
> > actively measuring it. So I prefer a layered defense.
>
>
> Layered defense against performance-only problem, happening on
> emulation-only?
>
> IMO that's a bit of overkill.
>
> Best regards,
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Why would it be emulation-only? QEMU is just being used for ease of
testing, but the performance impact should be similar on bare metal.

Thanks.

2020-08-04 15:22:48

by Arvind Sankar

[permalink] [raw]
Subject: Re: [PATCH] lz4: Fix kernel decompression speed

On Tue, Aug 04, 2020 at 02:57:50AM +0000, Nick Terrell wrote:
>
>
> > On Aug 3, 2020, at 6:56 PM, Arvind Sankar <[email protected]> wrote:
> >
>
> > -- I see that ZSTD_copy8 is already using __builtin_memcpy,
> > but there must be more that can be optimized? There's a couple 1/2-byte
> > sized copies in huf_decompress.c.
>
> Oh wow, I totally missed that, I guess I stopped looking once performance
> was about what I expected, nice find!
>
> I suspect it is mostly the memcpy inside of HUF_decodeSymbolX4(), since
> that should be the only hot one [1].
>
> Do you want to put up the patch to fix the memcpy’s in zstd Huffman, or should I?
>
> I will be submitting a patch upstream to migrate all of zstd’s memcpy() calls to
> use __builtin_memcpy(), since I plan on updating the version in the kernel to
> upstream zstd in the next few months. I was waiting until the compressed kernel
> patch set landed, so I didn't distract from it.
>
> [0] https://gist.github.com/terrelln/9bd53321a669f62683c608af8944fbc2
> [1] https://github.com/torvalds/linux/blob/master/lib/zstd/huf_decompress.c#L598
>
> Best,
> Nick
>

It's better if you do the zstd changes I think, as I'm not familiar with
the code at all.

Thanks.

2020-08-04 17:23:00

by Nick Terrell

[permalink] [raw]
Subject: Re: [PATCH] lz4: Fix kernel decompression speed



> On Aug 4, 2020, at 8:16 AM, Arvind Sankar <[email protected]> wrote:
>
> On Tue, Aug 04, 2020 at 10:32:36AM +0200, Pavel Machek wrote:
>> Hi!
>>
>>>>> I've measured the kernel decompression speed using QEMU before and after
>>>>> this patch for the x86_64 and i386 architectures. The speed-up is about
>>>>> 10x as shown below.
>>>>>
>>>>> Code Arch Kernel Size Time Speed
>>>>> v5.8 x86_64 11504832 B 148 ms 79 MB/s
>>>>> patch x86_64 11503872 B 13 ms 885 MB/s
>>>>> v5.8 i386 9621216 B 91 ms 106 MB/s
>>>>> patch i386 9620224 B 10 ms 962 MB/s
>>>>>
>>>>> I also measured the time to decompress the initramfs on x86_64, i386,
>>>>> and arm. All three show the same decompression speed before and after,
>>>>> as expected.
>>>>>
>>>>> [1] https://github.com/lz4/lz4/pull/890
>>>>>
>>>>
>>>> Hi Nick, would you be able to test the below patch's performance to
>>>> verify it gives the same speedup? It removes the #undef in misc.c which
>>>> causes the decompressors to not use the builtin version. It should be
>>>> equivalent to yours except for applying it to all the decompressors.
>>>>
>>>> Thanks.
>>>
>>> I will measure it. I would expect it to provide the same speed up. It would be great to fix
>>> the problem for x86/i386 in general.
>>>
>>> But, I believe that this is also a problem for ARM, though I have a hard time measuring
>>> because I can’t get pre-boot print statements in QEMU. I will attempt to take a look at the
>>> assembly, because I’m fairly certain that memcpy() isn’t inlined in master.
>>>
>>> Even if we fix all the architectures, I would still like to merge the LZ4 patch. It seems like it
>>> is pretty easy to merge a patch that is a boot speed regression, because people aren’t
>>> actively measuring it. So I prefer a layered defense.
>>
>>
>> Layered defense against performance-only problem, happening on
>> emulation-only?
>>
>> IMO that's a bit of overkill.
>
> Why would it be emulation-only? QEMU is just being used for ease of
> testing, but the performance impact should be similar on bare metal.

In addition, I want the decompressors to be fast in the pre-boot for all
architectures. Not everyone is going to know that zstd and lz4 require
memcpy to be inlined or they are 10x slower, that is an implementation
detail of the library. It is a performance gotcha that I’d rather not have.

2020-08-04 18:02:06

by Nick Terrell

[permalink] [raw]
Subject: Re: [PATCH] lz4: Fix kernel decompression speed



> On Aug 3, 2020, at 6:56 PM, Arvind Sankar <[email protected]> wrote:
>
> On Mon, Aug 03, 2020 at 10:55:01PM +0000, Nick Terrell wrote:
>>
>>
>>> On Aug 3, 2020, at 2:57 PM, Arvind Sankar <[email protected]> wrote:
>>>
>>> On Mon, Aug 03, 2020 at 12:40:22PM -0700, Nick Terrell wrote:
>>>> From: Nick Terrell <[email protected]>
>>>>
>>>> This patch replaces all memcpy() calls with LZ4_memcpy() which calls
>>>> __builtin_memcpy() so the compiler can inline it.
>>>>
>>>> LZ4 relies heavily on memcpy() with a constant size being inlined. In
>>>> x86 and i386 pre-boot environments memcpy() cannot be inlined because
>>>> memcpy() doesn't get defined as __builtin_memcpy().
>>>>
>>>> An equivalent patch has been applied upstream so that the next import
>>>> won't lose this change [1].
>>>>
>>>> I've measured the kernel decompression speed using QEMU before and after
>>>> this patch for the x86_64 and i386 architectures. The speed-up is about
>>>> 10x as shown below.
>>>>
>>>> Code Arch Kernel Size Time Speed
>>>> v5.8 x86_64 11504832 B 148 ms 79 MB/s
>>>> patch x86_64 11503872 B 13 ms 885 MB/s
>>>> v5.8 i386 9621216 B 91 ms 106 MB/s
>>>> patch i386 9620224 B 10 ms 962 MB/s
>>>>
>>>> I also measured the time to decompress the initramfs on x86_64, i386,
>>>> and arm. All three show the same decompression speed before and after,
>>>> as expected.
>>>>
>>>> [1] https://github.com/lz4/lz4/pull/890
>>>>
>>>
>>> Hi Nick, would you be able to test the below patch's performance to
>>> verify it gives the same speedup? It removes the #undef in misc.c which
>>> causes the decompressors to not use the builtin version. It should be
>>> equivalent to yours except for applying it to all the decompressors.
>>>
>>> Thanks.
>>
>> I will measure it. I would expect it to provide the same speed up. It would be great to fix
>> the problem for x86/i386 in general.
>
> Thanks. I tried using RDTSC to get some timings under QEMU, and I get
> similar speedup as you have for LZ4, and around 15-20% or so for ZSTD
> (on 64-bit) -- I see that ZSTD_copy8 is already using __builtin_memcpy,
> but there must be more that can be optimized? There's a couple 1/2-byte
> sized copies in huf_decompress.c.

I tested your patch. I saw the same performance as my patch for LZ4, as expected.
I saw that zstd decompression time went from 84 ms -> 63 ms, a 25% improvement.

I will get started on a zstd patch. I look forward to your x86 patch getting merged!

Thanks,
Nick

2020-08-04 23:50:15

by Arvind Sankar

[permalink] [raw]
Subject: [PATCH 0/1] x86/boot/compressed: Use builtin mem functions for decompressor

This patch drops a couple of lines that prevent the decompressor code
from using __builtin_memcpy (and __builtin_memset).

Note that Nick Terrell's patch [0] will apply similar changes directly
in the decompressor code (in-kernel and upstream) for LZ4, and the ZSTD
code already is partially optimized to use __builtin_memcpy for hot
memcpy's, but still improves a little bit more with this patch.

[0] https://lore.kernel.org/lkml/[email protected]/

Arvind Sankar (1):
x86/boot/compressed: Use builtin mem functions for decompressor

arch/x86/boot/compressed/misc.c | 7 ++-----
arch/x86/boot/string.h | 5 +----
2 files changed, 3 insertions(+), 9 deletions(-)

--
2.26.2