From: Nick Terrell <[email protected]>
Hi all,
I attempted to send this patch on March 19, but it got rejected by the
mailing lists for some reason, so I'm resending it.
This patch set adds support for a ZSTD-compressed kernel, ramdisk, and
initramfs in the kernel boot process. ZSTD-compressed ramdisk and initramfs
are supported on all architectures. The ZSTD-compressed kernel is only
hooked up to x86 in this patch set.
Zstandard requires slightly more memory during the kernel decompression
on x86 (192 KB vs 64 KB), and the memory usage is independent of the
window size.
Zstandard requires memory proprortional to the window size used during
compression for decompressing the ramdisk image, since streaming mode is
used. Newer versions of zstd (1.3.2+) list the window size of a file
with `zstd -lv <file>'. The absolute maximum amount of memory required
is just over 8 MB, but it can be controlled at compression time.
This patch set has been boot tested with buildroot and QEMU based off
of linux-5.6-rc6.
On i386 and x86_64 I have tested the following configurations:
* zstd compressed kernel and a separate zstd compressed initramfs
* zstd compressed kernel and a built-in zstd compressed initramfs
* gzip compressed kernel and a separate gzip compressed initramfs
* gzip compressed kernel and a built-in gzip compressed initramfs
On arm and aarch64 I tested the same configurations, except that the kernel is
always gzip compressed.
Facebook has been using v1 of these patches on x86_64 devices for more than 6
months. When we switched from a xz compressed initramfs to a zstd compressed
initramfs decompression time shrunk from 12 seconds to 3 seconds. When we
switched from a xz compressed kernel to a zstd compressed kernel we saved 2
seconds of boot time.
Facebook has been using v2 of these patches on aarch64 devices for a few weeks.
When we switched from an lzma compressed initramfs to a zstd compressed initramfs
decompression time shrunk from 27 seconds to 8 seconds.
The zstd compressed kernel is smaller than the gzip compressed kernel but larger
than the xz or lzma compressed kernels, and it decompresses faster than
everything except lz4. See the table below for the measurement of an x86_64
kernel ordered by compressed size:
algo size
xz 6,509,792
lzma 6,856,576
zstd 7,399,157
gzip 8,522,527
bzip 8,629,603
lzo 9,808,035
lz4 10,705,570
none 32,565,672
v1 -> v2:
- Rebase
- usr/Makefile and init/Kconfig were changed so the patches were updated
- No functional changes except to rebase
- Split the patches up into smaller chunks
v2 -> v3:
- Add *.zst to the .gitignore in patch 8
- Style nits in patch 3
- Rename the PREBOOT macro to ZSTD_PREBOOT and XXH_PREBOOT in patches
1 through 3
Best,
Nick Terrell
Adam Borowski (1):
.gitignore: add ZSTD-compressed files
Nick Terrell (7):
lib: prepare zstd for preboot environment
lib: prepare xxhash for preboot environment
lib: add zstd support to decompress
init: add support for zstd compressed kernel
usr: add support for zstd compressed initramfs
x86: bump ZO_z_extra_bytes margin for zstd
x86: Add support for ZSTD compressed kernel
.gitignore | 1 +
Documentation/x86/boot.rst | 6 +-
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/Makefile | 5 +-
arch/x86/boot/compressed/misc.c | 4 +
arch/x86/boot/header.S | 8 +-
arch/x86/include/asm/boot.h | 6 +-
include/linux/decompress/unzstd.h | 11 +
init/Kconfig | 15 +-
lib/Kconfig | 4 +
lib/Makefile | 1 +
lib/decompress.c | 5 +
lib/decompress_unzstd.c | 338 ++++++++++++++++++++++++++++++
lib/xxhash.c | 21 +-
lib/zstd/decompress.c | 2 +
lib/zstd/fse_decompress.c | 9 +-
lib/zstd/zstd_internal.h | 14 +-
scripts/Makefile.lib | 15 ++
usr/Kconfig | 20 ++
usr/Makefile | 1 +
20 files changed, 460 insertions(+), 27 deletions(-)
create mode 100644 include/linux/decompress/unzstd.h
create mode 100644 lib/decompress_unzstd.c
--
2.25.1
From: Nick Terrell <[email protected]>
* Don't export symbols if ZSTD_PREBOOT is defined.
* Remove a double definition of the CHECK_F macro when the zstd
library is amalgamated.
* Switch ZSTD_copy8() to __builtin_memcpy(), because in the preboot
environment on x86 gcc can't inline `memcpy()` otherwise.
* Limit the gcc hack in ZSTD_wildcopy() to the broken gcc version. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388.
These changes are necessary to get the build to work in the preboot
environment, and to get reasonable performance. ZSTD_copy8() and
ZSTD_wildcopy() are in the core of the zstd hot loop. So outlining
these calls to memcpy(), and having an extra branch are very
detrimental to performance.
Signed-off-by: Nick Terrell <[email protected]>
---
lib/zstd/decompress.c | 2 ++
lib/zstd/fse_decompress.c | 9 +--------
lib/zstd/zstd_internal.h | 14 ++++++++++++--
3 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/lib/zstd/decompress.c b/lib/zstd/decompress.c
index 269ee9a796c1..73ded63278cf 100644
--- a/lib/zstd/decompress.c
+++ b/lib/zstd/decompress.c
@@ -2490,6 +2490,7 @@ size_t ZSTD_decompressStream(ZSTD_DStream *zds, ZSTD_outBuffer *output, ZSTD_inB
}
}
+#ifndef ZSTD_PREBOOT
EXPORT_SYMBOL(ZSTD_DCtxWorkspaceBound);
EXPORT_SYMBOL(ZSTD_initDCtx);
EXPORT_SYMBOL(ZSTD_decompressDCtx);
@@ -2529,3 +2530,4 @@ EXPORT_SYMBOL(ZSTD_insertBlock);
MODULE_LICENSE("Dual BSD/GPL");
MODULE_DESCRIPTION("Zstd Decompressor");
+#endif
diff --git a/lib/zstd/fse_decompress.c b/lib/zstd/fse_decompress.c
index a84300e5a013..0b353530fb3f 100644
--- a/lib/zstd/fse_decompress.c
+++ b/lib/zstd/fse_decompress.c
@@ -47,6 +47,7 @@
****************************************************************/
#include "bitstream.h"
#include "fse.h"
+#include "zstd_internal.h"
#include <linux/compiler.h>
#include <linux/kernel.h>
#include <linux/string.h> /* memcpy, memset */
@@ -60,14 +61,6 @@
enum { FSE_static_assert = 1 / (int)(!!(c)) }; \
} /* use only *after* variable declarations */
-/* check and forward error code */
-#define CHECK_F(f) \
- { \
- size_t const e = f; \
- if (FSE_isError(e)) \
- return e; \
- }
-
/* **************************************************************
* Templates
****************************************************************/
diff --git a/lib/zstd/zstd_internal.h b/lib/zstd/zstd_internal.h
index 1a79fab9e13a..dac753397f86 100644
--- a/lib/zstd/zstd_internal.h
+++ b/lib/zstd/zstd_internal.h
@@ -127,7 +127,14 @@ static const U32 OF_defaultNormLog = OF_DEFAULTNORMLOG;
* Shared functions to include for inlining
*********************************************/
ZSTD_STATIC void ZSTD_copy8(void *dst, const void *src) {
- memcpy(dst, src, 8);
+ /*
+ * zstd relies heavily on gcc being able to analyze and inline this
+ * memcpy() call, since it is called in a tight loop. Preboot mode
+ * is compiled in freestanding mode, which stops gcc from analyzing
+ * memcpy(). Use __builtin_memcpy() to tell gcc to analyze this as a
+ * regular memcpy().
+ */
+ __builtin_memcpy(dst, src, 8);
}
/*! ZSTD_wildcopy() :
* custom version of memcpy(), can copy up to 7 bytes too many (8 bytes if length==0) */
@@ -137,13 +144,16 @@ ZSTD_STATIC void ZSTD_wildcopy(void *dst, const void *src, ptrdiff_t length)
const BYTE* ip = (const BYTE*)src;
BYTE* op = (BYTE*)dst;
BYTE* const oend = op + length;
- /* Work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388.
+#if defined(GCC_VERSION) && GCC_VERSION >= 70000 && GCC_VERSION < 70200
+ /*
+ * Work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388.
* Avoid the bad case where the loop only runs once by handling the
* special case separately. This doesn't trigger the bug because it
* doesn't involve pointer/integer overflow.
*/
if (length <= 8)
return ZSTD_copy8(dst, src);
+#endif
do {
ZSTD_copy8(op, ip);
op += 8;
--
2.25.1
From: Nick Terrell <[email protected]>
Don't export symbols if XXH_PREBOOT is defined.
This change is necessary to get xxhash to work in a preboot environment,
which is needed to support zstd-compressed kernels.
Signed-off-by: Nick Terrell <[email protected]>
---
lib/xxhash.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/lib/xxhash.c b/lib/xxhash.c
index aa61e2a3802f..b4364e011392 100644
--- a/lib/xxhash.c
+++ b/lib/xxhash.c
@@ -80,13 +80,11 @@ void xxh32_copy_state(struct xxh32_state *dst, const struct xxh32_state *src)
{
memcpy(dst, src, sizeof(*dst));
}
-EXPORT_SYMBOL(xxh32_copy_state);
void xxh64_copy_state(struct xxh64_state *dst, const struct xxh64_state *src)
{
memcpy(dst, src, sizeof(*dst));
}
-EXPORT_SYMBOL(xxh64_copy_state);
/*-***************************
* Simple Hash Functions
@@ -151,7 +149,6 @@ uint32_t xxh32(const void *input, const size_t len, const uint32_t seed)
return h32;
}
-EXPORT_SYMBOL(xxh32);
static uint64_t xxh64_round(uint64_t acc, const uint64_t input)
{
@@ -234,7 +231,6 @@ uint64_t xxh64(const void *input, const size_t len, const uint64_t seed)
return h64;
}
-EXPORT_SYMBOL(xxh64);
/*-**************************************************
* Advanced Hash Functions
@@ -251,7 +247,6 @@ void xxh32_reset(struct xxh32_state *statePtr, const uint32_t seed)
state.v4 = seed - PRIME32_1;
memcpy(statePtr, &state, sizeof(state));
}
-EXPORT_SYMBOL(xxh32_reset);
void xxh64_reset(struct xxh64_state *statePtr, const uint64_t seed)
{
@@ -265,7 +260,6 @@ void xxh64_reset(struct xxh64_state *statePtr, const uint64_t seed)
state.v4 = seed - PRIME64_1;
memcpy(statePtr, &state, sizeof(state));
}
-EXPORT_SYMBOL(xxh64_reset);
int xxh32_update(struct xxh32_state *state, const void *input, const size_t len)
{
@@ -334,7 +328,6 @@ int xxh32_update(struct xxh32_state *state, const void *input, const size_t len)
return 0;
}
-EXPORT_SYMBOL(xxh32_update);
uint32_t xxh32_digest(const struct xxh32_state *state)
{
@@ -372,7 +365,6 @@ uint32_t xxh32_digest(const struct xxh32_state *state)
return h32;
}
-EXPORT_SYMBOL(xxh32_digest);
int xxh64_update(struct xxh64_state *state, const void *input, const size_t len)
{
@@ -439,7 +431,6 @@ int xxh64_update(struct xxh64_state *state, const void *input, const size_t len)
return 0;
}
-EXPORT_SYMBOL(xxh64_update);
uint64_t xxh64_digest(const struct xxh64_state *state)
{
@@ -494,7 +485,19 @@ uint64_t xxh64_digest(const struct xxh64_state *state)
return h64;
}
+
+#ifndef XXH_PREBOOT
+EXPORT_SYMBOL(xxh32_copy_state);
+EXPORT_SYMBOL(xxh64_copy_state);
+EXPORT_SYMBOL(xxh32);
+EXPORT_SYMBOL(xxh64);
+EXPORT_SYMBOL(xxh32_reset);
+EXPORT_SYMBOL(xxh64_reset);
+EXPORT_SYMBOL(xxh32_update);
+EXPORT_SYMBOL(xxh32_digest);
+EXPORT_SYMBOL(xxh64_update);
EXPORT_SYMBOL(xxh64_digest);
MODULE_LICENSE("Dual BSD/GPL");
MODULE_DESCRIPTION("xxHash");
+#endif
--
2.25.1
From: Nick Terrell <[email protected]>
* Add unzstd() and the zstd decompress interface.
* Add zstd support to decompress_method().
The decompress_method() and unzstd() functions are used to decompress
the initramfs and the initrd. The __decompress() function is used in
the preboot environment to decompress a zstd compressed kernel.
The zstd decompression function allows the input and output buffers to
overlap because that is used by x86 kernel decompression.
Signed-off-by: Nick Terrell <[email protected]>
---
include/linux/decompress/unzstd.h | 11 +
lib/Kconfig | 4 +
lib/Makefile | 1 +
lib/decompress.c | 5 +
lib/decompress_unzstd.c | 338 ++++++++++++++++++++++++++++++
5 files changed, 359 insertions(+)
create mode 100644 include/linux/decompress/unzstd.h
create mode 100644 lib/decompress_unzstd.c
diff --git a/include/linux/decompress/unzstd.h b/include/linux/decompress/unzstd.h
new file mode 100644
index 000000000000..56d539ae880f
--- /dev/null
+++ b/include/linux/decompress/unzstd.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_DECOMPRESS_UNZSTD_H
+#define LINUX_DECOMPRESS_UNZSTD_H
+
+int unzstd(unsigned char *inbuf, long len,
+ long (*fill)(void*, unsigned long),
+ long (*flush)(void*, unsigned long),
+ unsigned char *output,
+ long *pos,
+ void (*error_fn)(char *x));
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index bc7e56370129..11de5fa09a52 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -336,6 +336,10 @@ config DECOMPRESS_LZ4
select LZ4_DECOMPRESS
tristate
+config DECOMPRESS_ZSTD
+ select ZSTD_DECOMPRESS
+ tristate
+
#
# Generic allocator support is selected if needed
#
diff --git a/lib/Makefile b/lib/Makefile
index 611872c06926..09ad45ba6883 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -160,6 +160,7 @@ lib-$(CONFIG_DECOMPRESS_LZMA) += decompress_unlzma.o
lib-$(CONFIG_DECOMPRESS_XZ) += decompress_unxz.o
lib-$(CONFIG_DECOMPRESS_LZO) += decompress_unlzo.o
lib-$(CONFIG_DECOMPRESS_LZ4) += decompress_unlz4.o
+lib-$(CONFIG_DECOMPRESS_ZSTD) += decompress_unzstd.o
obj-$(CONFIG_TEXTSEARCH) += textsearch.o
obj-$(CONFIG_TEXTSEARCH_KMP) += ts_kmp.o
diff --git a/lib/decompress.c b/lib/decompress.c
index 857ab1af1ef3..ab3fc90ffc64 100644
--- a/lib/decompress.c
+++ b/lib/decompress.c
@@ -13,6 +13,7 @@
#include <linux/decompress/inflate.h>
#include <linux/decompress/unlzo.h>
#include <linux/decompress/unlz4.h>
+#include <linux/decompress/unzstd.h>
#include <linux/types.h>
#include <linux/string.h>
@@ -37,6 +38,9 @@
#ifndef CONFIG_DECOMPRESS_LZ4
# define unlz4 NULL
#endif
+#ifndef CONFIG_DECOMPRESS_ZSTD
+# define unzstd NULL
+#endif
struct compress_format {
unsigned char magic[2];
@@ -52,6 +56,7 @@ static const struct compress_format compressed_formats[] __initconst = {
{ {0xfd, 0x37}, "xz", unxz },
{ {0x89, 0x4c}, "lzo", unlzo },
{ {0x02, 0x21}, "lz4", unlz4 },
+ { {0x28, 0xb5}, "zstd", unzstd },
{ {0, 0}, NULL, NULL }
};
diff --git a/lib/decompress_unzstd.c b/lib/decompress_unzstd.c
new file mode 100644
index 000000000000..a6b391b47ab8
--- /dev/null
+++ b/lib/decompress_unzstd.c
@@ -0,0 +1,338 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Important notes about in-place decompression
+ *
+ * At least on x86, the kernel is decompressed in place: the compressed data
+ * is placed to the end of the output buffer, and the decompressor overwrites
+ * most of the compressed data. There must be enough safety margin to
+ * guarantee that the write position is always behind the read position.
+ *
+ * The safety margin for ZSTD with a 128 KB block size is calculated below.
+ * Note that the margin with ZSTD is bigger than with GZIP or XZ!
+ *
+ * The worst case for in-place decompression is that the beginning of
+ * the file is compressed extremely well, and the rest of the file is
+ * uncompressible. Thus, we must look for worst-case expansion when the
+ * compressor is encoding uncompressible data.
+ *
+ * The structure of the .zst file in case of a compresed kernel is as follows.
+ * Maximum sizes (as bytes) of the fields are in parenthesis.
+ *
+ * Frame Header: (18)
+ * Blocks: (N)
+ * Checksum: (4)
+ *
+ * The frame header and checksum overhead is at most 22 bytes.
+ *
+ * ZSTD stores the data in blocks. Each block has a header whose size is
+ * a 3 bytes. After the block header, there is up to 128 KB of payload.
+ * The maximum uncompressed size of the payload is 128 KB. The minimum
+ * uncompressed size of the payload is never less than the payload size
+ * (excluding the block header).
+ *
+ * The assumption, that the uncompressed size of the payload is never
+ * smaller than the payload itself, is valid only when talking about
+ * the payload as a whole. It is possible that the payload has parts where
+ * the decompressor consumes more input than it produces output. Calculating
+ * the worst case for this would be tricky. Instead of trying to do that,
+ * let's simply make sure that the decompressor never overwrites any bytes
+ * of the payload which it is currently reading.
+ *
+ * Now we have enough information to calculate the safety margin. We need
+ * - 22 bytes for the .zst file format headers;
+ * - 3 bytes per every 128 KiB of uncompressed size (one block header per
+ * block); and
+ * - 128 KiB (biggest possible zstd block size) to make sure that the
+ * decompressor never overwrites anything from the block it is currently
+ * reading.
+ *
+ * We get the following formula:
+ *
+ * safety_margin = 22 + uncompressed_size * 3 / 131072 + 131072
+ * <= 22 + (uncompressed_size >> 15) + 131072
+ */
+
+/*
+ * Preboot environments #include "path/to/decompress_unzstd.c".
+ * All of the source files we depend on must be #included.
+ * zstd's only source dependeny is xxhash, which has no source
+ * dependencies.
+ *
+ * zstd and xxhash avoid declaring themselves as modules
+ * when ZSTD_PREBOOT and XXH_PREBOOT are defined.
+ */
+#ifdef STATIC
+# define ZSTD_PREBOOT
+# define XXH_PREBOOT
+# include "xxhash.c"
+# include "zstd/entropy_common.c"
+# include "zstd/fse_decompress.c"
+# include "zstd/huf_decompress.c"
+# include "zstd/zstd_common.c"
+# include "zstd/decompress.c"
+#endif
+
+#include <linux/decompress/mm.h>
+#include <linux/kernel.h>
+#include <linux/zstd.h>
+
+/* 8 MB maximum window size */
+#define ZSTD_WINDOWSIZE_MAX (1 << 23)
+/* Size of the input and output buffers in multi-call mdoe */
+#define ZSTD_IOBUF_SIZE 4096
+
+static int INIT handle_zstd_error(size_t ret, void (*error)(char *x))
+{
+ const int err = ZSTD_getErrorCode(ret);
+
+ if (!ZSTD_isError(ret))
+ return 0;
+
+ switch (err) {
+ case ZSTD_error_memory_allocation:
+ error("ZSTD decompressor ran out of memory");
+ break;
+ case ZSTD_error_prefix_unknown:
+ error("Input is not in the ZSTD format (wrong magic bytes)");
+ break;
+ case ZSTD_error_dstSize_tooSmall:
+ case ZSTD_error_corruption_detected:
+ case ZSTD_error_checksum_wrong:
+ error("ZSTD-compressed data is corrupt");
+ break;
+ default:
+ error("ZSTD-compressed data is probably corrupt");
+ break;
+ }
+ return -1;
+}
+
+/*
+ * Handle the case where we have the entire input and output in one segment.
+ * We can allocate less memory (no circular buffer for the sliding window),
+ * and avoid some memcpy() calls.
+ */
+static int INIT decompress_single(const u8 *in_buf, long in_len, u8 *out_buf,
+ long out_len, long *in_pos,
+ void (*error)(char *x))
+{
+ const size_t wksp_size = ZSTD_DCtxWorkspaceBound();
+ void *wksp = large_malloc(wksp_size);
+ ZSTD_DCtx *dctx = ZSTD_initDCtx(wksp, wksp_size);
+ int err;
+ size_t ret;
+
+ if (dctx == NULL) {
+ error("Out of memory while allocating ZSTD_DCtx");
+ err = -1;
+ goto out;
+ }
+ /*
+ * Find out how large the frame actually is, there may be junk at
+ * the end of the frame that ZSTD_decompressDCtx() can't handle.
+ */
+ ret = ZSTD_findFrameCompressedSize(in_buf, in_len);
+ err = handle_zstd_error(ret, error);
+ if (err)
+ goto out;
+ in_len = (long)ret;
+
+ ret = ZSTD_decompressDCtx(dctx, out_buf, out_len, in_buf, in_len);
+ err = handle_zstd_error(ret, error);
+ if (err)
+ goto out;
+
+ if (in_pos != NULL)
+ *in_pos = in_len;
+
+ err = 0;
+out:
+ if (wksp != NULL)
+ large_free(wksp);
+ return err;
+}
+
+static int INIT __unzstd(unsigned char *in_buf, long in_len,
+ long (*fill)(void*, unsigned long),
+ long (*flush)(void*, unsigned long),
+ unsigned char *out_buf, long out_len,
+ long *in_pos,
+ void (*error)(char *x))
+{
+ ZSTD_inBuffer in;
+ ZSTD_outBuffer out;
+ ZSTD_frameParams params;
+ void *in_allocated = NULL;
+ void *out_allocated = NULL;
+ void *wksp = NULL;
+ size_t wksp_size;
+ ZSTD_DStream *dstream;
+ int err;
+ size_t ret;
+
+ if (out_len == 0)
+ out_len = LONG_MAX; /* no limit */
+
+ if (fill == NULL && flush == NULL)
+ /*
+ * We can decompress faster and with less memory when we have a
+ * single chunk.
+ */
+ return decompress_single(in_buf, in_len, out_buf, out_len,
+ in_pos, error);
+
+ /*
+ * If in_buf is not provided, we must be using fill(), so allocate
+ * a large enough buffer. If it is provided, it must be at least
+ * ZSTD_IOBUF_SIZE large.
+ */
+ if (in_buf == NULL) {
+ in_allocated = malloc(ZSTD_IOBUF_SIZE);
+ if (in_allocated == NULL) {
+ error("Out of memory while allocating input buffer");
+ err = -1;
+ goto out;
+ }
+ in_buf = in_allocated;
+ in_len = 0;
+ }
+ /* Read the first chunk, since we need to decode the frame header. */
+ if (fill != NULL)
+ in_len = fill(in_buf, ZSTD_IOBUF_SIZE);
+ if (in_len < 0) {
+ error("ZSTD-compressed data is truncated");
+ err = -1;
+ goto out;
+ }
+ /* Set the first non-empty input buffer. */
+ in.src = in_buf;
+ in.pos = 0;
+ in.size = in_len;
+ /* Allocate the output buffer if we are using flush(). */
+ if (flush != NULL) {
+ out_allocated = malloc(ZSTD_IOBUF_SIZE);
+ if (out_allocated == NULL) {
+ error("Out of memory while allocating output buffer");
+ err = -1;
+ goto out;
+ }
+ out_buf = out_allocated;
+ out_len = ZSTD_IOBUF_SIZE;
+ }
+ /* Set the output buffer. */
+ out.dst = out_buf;
+ out.pos = 0;
+ out.size = out_len;
+
+ /*
+ * We need to know the window size to allocate the ZSTD_DStream.
+ * Since we are streaming, we need to allocate a buffer for the sliding
+ * window. The window size varies from 1 KB to ZSTD_WINDOWSIZE_MAX
+ * (8 MB), so it is important to use the actual value so as not to
+ * waste memory when it is smaller.
+ */
+ ret = ZSTD_getFrameParams(¶ms, in.src, in.size);
+ err = handle_zstd_error(ret, error);
+ if (err)
+ goto out;
+ if (ret != 0) {
+ error("ZSTD-compressed data has an incomplete frame header");
+ err = -1;
+ goto out;
+ }
+ if (params.windowSize > ZSTD_WINDOWSIZE_MAX) {
+ error("ZSTD-compressed data has too large a window size");
+ err = -1;
+ goto out;
+ }
+
+ /*
+ * Allocate the ZSTD_DStream now that we know how much memory is
+ * required.
+ */
+ wksp_size = ZSTD_DStreamWorkspaceBound(params.windowSize);
+ wksp = large_malloc(wksp_size);
+ dstream = ZSTD_initDStream(params.windowSize, wksp, wksp_size);
+ if (dstream == NULL) {
+ error("Out of memory while allocating ZSTD_DStream");
+ err = -1;
+ goto out;
+ }
+
+ /*
+ * Decompression loop:
+ * Read more data if necessary (error if no more data can be read).
+ * Call the decompression function, which returns 0 when finished.
+ * Flush any data produced if using flush().
+ */
+ if (in_pos != NULL)
+ *in_pos = 0;
+ do {
+ /*
+ * If we need to reload data, either we have fill() and can
+ * try to get more data, or we don't and the input is truncated.
+ */
+ if (in.pos == in.size) {
+ if (in_pos != NULL)
+ *in_pos += in.pos;
+ in_len = fill ? fill(in_buf, ZSTD_IOBUF_SIZE) : -1;
+ if (in_len < 0) {
+ error("ZSTD-compressed data is truncated");
+ err = -1;
+ goto out;
+ }
+ in.pos = 0;
+ in.size = in_len;
+ }
+ /* Returns zero when the frame is complete. */
+ ret = ZSTD_decompressStream(dstream, &out, &in);
+ err = handle_zstd_error(ret, error);
+ if (err)
+ goto out;
+ /* Flush all of the data produced if using flush(). */
+ if (flush != NULL && out.pos > 0) {
+ if (out.pos != flush(out.dst, out.pos)) {
+ error("Failed to flush()");
+ err = -1;
+ goto out;
+ }
+ out.pos = 0;
+ }
+ } while (ret != 0);
+
+ if (in_pos != NULL)
+ *in_pos += in.pos;
+
+ err = 0;
+out:
+ if (in_allocated != NULL)
+ free(in_allocated);
+ if (out_allocated != NULL)
+ free(out_allocated);
+ if (wksp != NULL)
+ large_free(wksp);
+ return err;
+}
+
+#ifndef ZSTD_PREBOOT
+STATIC int INIT unzstd(unsigned char *buf, long len,
+ long (*fill)(void*, unsigned long),
+ long (*flush)(void*, unsigned long),
+ unsigned char *out_buf,
+ long *pos,
+ void (*error)(char *x))
+{
+ return __unzstd(buf, len, fill, flush, out_buf, 0, pos, error);
+}
+#else
+STATIC int INIT __decompress(unsigned char *buf, long len,
+ long (*fill)(void*, unsigned long),
+ long (*flush)(void*, unsigned long),
+ unsigned char *out_buf, long out_len,
+ long *pos,
+ void (*error)(char *x))
+{
+ return __unzstd(buf, len, fill, flush, out_buf, out_len, pos, error);
+}
+#endif
--
2.25.1
From: Nick Terrell <[email protected]>
* Adds the zstd cmd to scripts/Makefile.lib
* Adds the HAVE_KERNEL_ZSTD and KERNEL_ZSTD options
Architecture specific support is still needed for decompression.
Signed-off-by: Nick Terrell <[email protected]>
---
init/Kconfig | 15 ++++++++++++++-
scripts/Makefile.lib | 15 +++++++++++++++
2 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/init/Kconfig b/init/Kconfig
index 20a6ac33761c..9b646a25918e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -173,13 +173,16 @@ config HAVE_KERNEL_LZO
config HAVE_KERNEL_LZ4
bool
+config HAVE_KERNEL_ZSTD
+ bool
+
config HAVE_KERNEL_UNCOMPRESSED
bool
choice
prompt "Kernel compression mode"
default KERNEL_GZIP
- depends on HAVE_KERNEL_GZIP || HAVE_KERNEL_BZIP2 || HAVE_KERNEL_LZMA || HAVE_KERNEL_XZ || HAVE_KERNEL_LZO || HAVE_KERNEL_LZ4 || HAVE_KERNEL_UNCOMPRESSED
+ depends on HAVE_KERNEL_GZIP || HAVE_KERNEL_BZIP2 || HAVE_KERNEL_LZMA || HAVE_KERNEL_XZ || HAVE_KERNEL_LZO || HAVE_KERNEL_LZ4 || HAVE_KERNEL_ZSTD || HAVE_KERNEL_UNCOMPRESSED
help
The linux kernel is a kind of self-extracting executable.
Several compression algorithms are available, which differ
@@ -258,6 +261,16 @@ config KERNEL_LZ4
is about 8% bigger than LZO. But the decompression speed is
faster than LZO.
+config KERNEL_ZSTD
+ bool "ZSTD"
+ depends on HAVE_KERNEL_ZSTD
+ help
+ ZSTD is a compression algorithm targeting intermediate compression
+ with fast decompression speed. It will compress better than GZIP and
+ decompress around the same speed as LZO, but slower than LZ4. You
+ will need at least 192 KB RAM or more for booting. The zstd command
+ line tools is required for compression.
+
config KERNEL_UNCOMPRESSED
bool "None"
depends on HAVE_KERNEL_UNCOMPRESSED
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index 752ff0a225a9..4b99893efa3d 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -394,6 +394,21 @@ quiet_cmd_xzkern = XZKERN $@
quiet_cmd_xzmisc = XZMISC $@
cmd_xzmisc = cat $(real-prereqs) | xz --check=crc32 --lzma2=dict=1MiB > $@
+# ZSTD
+# ---------------------------------------------------------------------------
+# Appends the uncompressed size of the data using size_append. The .zst
+# format has the size information available at the beginning of the file too,
+# but it's in a more complex format and it's good to avoid changing the part
+# of the boot code that reads the uncompressed size.
+# Note that the bytes added by size_append will make the zstd tool think that
+# the file is corrupt. This is expected.
+
+quiet_cmd_zstd = ZSTD $@
+cmd_zstd = (cat $(filter-out FORCE,$^) | \
+ zstd -19 && \
+ $(call size_append, $(filter-out FORCE,$^))) > $@ || \
+ (rm -f $@ ; false)
+
# ASM offsets
# ---------------------------------------------------------------------------
--
2.25.1
From: Nick Terrell <[email protected]>
* Add support for a zstd compressed initramfs.
* Add compression for compressing built-in initramfs with zstd.
I have tested this patch by boot testing with buildroot and QEMU.
Specifically, I booted the kernel with both a zstd and gzip compressed
initramfs, both built into the kernel and separate. I ensured that the
correct compression algorithm was used. I tested on arm, aarch64, i386,
and x86_64.
This patch has been tested in production on aarch64 and x86_64 devices.
Additionally, I have performance measurements from internal use in
production. On an aarch64 device we saw 19 second boot time improvement
from switching from lzma to zstd (27 seconds to 8 seconds). On an x86_64
device we saw a 9 second boot time reduction from switching from xz to
zstd.
Signed-off-by: Nick Terrell <[email protected]>
---
usr/Kconfig | 20 ++++++++++++++++++++
usr/Makefile | 1 +
2 files changed, 21 insertions(+)
diff --git a/usr/Kconfig b/usr/Kconfig
index bdf5bbd40727..43aca37d09b5 100644
--- a/usr/Kconfig
+++ b/usr/Kconfig
@@ -100,6 +100,15 @@ config RD_LZ4
Support loading of a LZ4 encoded initial ramdisk or cpio buffer
If unsure, say N.
+config RD_ZSTD
+ bool "Support initial ramdisk/ramfs compressed using ZSTD"
+ default y
+ depends on BLK_DEV_INITRD
+ select DECOMPRESS_ZSTD
+ help
+ Support loading of a ZSTD encoded initial ramdisk or cpio buffer.
+ If unsure, say N.
+
choice
prompt "Built-in initramfs compression mode"
depends on INITRAMFS_SOURCE != ""
@@ -207,4 +216,15 @@ config INITRAMFS_COMPRESSION_LZ4
If you choose this, keep in mind that most distros don't provide lz4
by default which could cause a build failure.
+config INITRAMFS_COMPRESSION_ZSTD
+ bool "ZSTD"
+ depends on RD_ZSTD
+ help
+ ZSTD is a compression algorithm targeting intermediate compression
+ with fast decompression speed. It will compress better than GZIP and
+ decompress around the same speed as LZO, but slower than LZ4.
+
+ If you choose this, keep in mind that you may need to install the zstd
+ tool to be able to compress the initram.
+
endchoice
diff --git a/usr/Makefile b/usr/Makefile
index c12e6b15ce72..b1a81a40eab1 100644
--- a/usr/Makefile
+++ b/usr/Makefile
@@ -15,6 +15,7 @@ compress-$(CONFIG_INITRAMFS_COMPRESSION_LZMA) := lzma
compress-$(CONFIG_INITRAMFS_COMPRESSION_XZ) := xzmisc
compress-$(CONFIG_INITRAMFS_COMPRESSION_LZO) := lzo
compress-$(CONFIG_INITRAMFS_COMPRESSION_LZ4) := lz4
+compress-$(CONFIG_INITRAMFS_COMPRESSION_ZSTD) := zstd
obj-$(CONFIG_BLK_DEV_INITRD) := initramfs_data.o
--
2.25.1
From: Nick Terrell <[email protected]>
* Add support for zstd compressed kernel
* Bump the heap size for zstd.
* Update the documentation.
Integrates the ZSTD decompression code to the x86 pre-boot code.
Zstandard requires slightly more memory during the kernel decompression
on x86 (192 KB vs 64 KB), and the memory usage is independent of the
window size.
This patch has been boot tested with both a zstd and gzip compressed
kernel on i386 and x86_64 using buildroot and QEMU.
Additionally, this has been tested in production on x86_64 devices.
We saw a 2 second boot time reduction by switching kernel compression
from xz to zstd.
Signed-off-by: Nick Terrell <[email protected]>
---
Documentation/x86/boot.rst | 6 +++---
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/Makefile | 5 ++++-
arch/x86/boot/compressed/misc.c | 4 ++++
arch/x86/include/asm/boot.h | 6 ++++--
5 files changed, 16 insertions(+), 6 deletions(-)
diff --git a/Documentation/x86/boot.rst b/Documentation/x86/boot.rst
index c9c201596c3e..cedcf4d49bf0 100644
--- a/Documentation/x86/boot.rst
+++ b/Documentation/x86/boot.rst
@@ -786,9 +786,9 @@ Protocol: 2.08+
uncompressed data should be determined using the standard magic
numbers. The currently supported compression formats are gzip
(magic numbers 1F 8B or 1F 9E), bzip2 (magic number 42 5A), LZMA
- (magic number 5D 00), XZ (magic number FD 37), and LZ4 (magic number
- 02 21). The uncompressed payload is currently always ELF (magic
- number 7F 45 4C 46).
+ (magic number 5D 00), XZ (magic number FD 37), LZ4 (magic number
+ 02 21) and ZSTD (magic number 28 B5). The uncompressed payload is
+ currently always ELF (magic number 7F 45 4C 46).
============ ==============
Field name: payload_length
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..12d88997a3a6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -183,6 +183,7 @@ config X86
select HAVE_KERNEL_LZMA
select HAVE_KERNEL_LZO
select HAVE_KERNEL_XZ
+ select HAVE_KERNEL_ZSTD
select HAVE_KPROBES
select HAVE_KPROBES_ON_FTRACE
select HAVE_FUNCTION_ERROR_INJECTION
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 26050ae0b27e..8233f598f15b 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -24,7 +24,7 @@ OBJECT_FILES_NON_STANDARD := y
KCOV_INSTRUMENT := n
targets := vmlinux vmlinux.bin vmlinux.bin.gz vmlinux.bin.bz2 vmlinux.bin.lzma \
- vmlinux.bin.xz vmlinux.bin.lzo vmlinux.bin.lz4
+ vmlinux.bin.xz vmlinux.bin.lzo vmlinux.bin.lz4 vmlinux.bin.zst
KBUILD_CFLAGS := -m$(BITS) -O2
KBUILD_CFLAGS += -fno-strict-aliasing $(call cc-option, -fPIE, -fPIC)
@@ -145,6 +145,8 @@ $(obj)/vmlinux.bin.lzo: $(vmlinux.bin.all-y) FORCE
$(call if_changed,lzo)
$(obj)/vmlinux.bin.lz4: $(vmlinux.bin.all-y) FORCE
$(call if_changed,lz4)
+$(obj)/vmlinux.bin.zst: $(vmlinux.bin.all-y) FORCE
+ $(call if_changed,zstd)
suffix-$(CONFIG_KERNEL_GZIP) := gz
suffix-$(CONFIG_KERNEL_BZIP2) := bz2
@@ -152,6 +154,7 @@ suffix-$(CONFIG_KERNEL_LZMA) := lzma
suffix-$(CONFIG_KERNEL_XZ) := xz
suffix-$(CONFIG_KERNEL_LZO) := lzo
suffix-$(CONFIG_KERNEL_LZ4) := lz4
+suffix-$(CONFIG_KERNEL_ZSTD) := zst
quiet_cmd_mkpiggy = MKPIGGY $@
cmd_mkpiggy = $(obj)/mkpiggy $< > $@
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 9652d5c2afda..39e592d0e0b4 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -77,6 +77,10 @@ static int lines, cols;
#ifdef CONFIG_KERNEL_LZ4
#include "../../../../lib/decompress_unlz4.c"
#endif
+
+#ifdef CONFIG_KERNEL_ZSTD
+#include "../../../../lib/decompress_unzstd.c"
+#endif
/*
* NOTE: When adding a new decompressor, please update the analysis in
* ../header.S.
diff --git a/arch/x86/include/asm/boot.h b/arch/x86/include/asm/boot.h
index 680c320363db..d6dd43d25d9f 100644
--- a/arch/x86/include/asm/boot.h
+++ b/arch/x86/include/asm/boot.h
@@ -24,9 +24,11 @@
# error "Invalid value for CONFIG_PHYSICAL_ALIGN"
#endif
-#ifdef CONFIG_KERNEL_BZIP2
+#if defined(CONFIG_KERNEL_BZIP2)
# define BOOT_HEAP_SIZE 0x400000
-#else /* !CONFIG_KERNEL_BZIP2 */
+#elif defined(CONFIG_KERNEL_ZSTD)
+# define BOOT_HEAP_SIZE 0x30000
+#else
# define BOOT_HEAP_SIZE 0x10000
#endif
--
2.25.1
From: Adam Borowski <[email protected]>
For now, that's arch/x86/boot/compressed/vmlinux.bin.zst but probably more
will come, thus let's be consistent with all other compressors.
Signed-off-by: Adam Borowski <[email protected]>
Signed-off-by: Nick Terrell <[email protected]>
---
.gitignore | 1 +
1 file changed, 1 insertion(+)
diff --git a/.gitignore b/.gitignore
index 72ef86a5570d..edb0191c294f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -43,6 +43,7 @@
*.tab.[ch]
*.tar
*.xz
+*.zst
Module.symvers
modules.builtin
modules.order
--
2.25.1
From: Nick Terrell <[email protected]>
Bump the ZO_z_extra_bytes margin for zstd.
Zstd needs 3 bytes per 128 KB, and has a 22 byte fixed overhead.
Zstd needs to maintain 128 KB of space at all times, since that is
the maximum block size. See the comments regarding in-place
decompression added in lib/decompress_unzstd.c for details.
Signed-off-by: Nick Terrell <[email protected]>
---
arch/x86/boot/header.S | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index 97d9b6d6c1af..b820875c5c95 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -536,8 +536,14 @@ pref_address: .quad LOAD_PHYSICAL_ADDR # preferred load addr
# the size-dependent part now grows so fast.
#
# extra_bytes = (uncompressed_size >> 8) + 65536
+#
+# ZSTD compressed data grows by at most 3 bytes per 128K, and only has a 22
+# byte fixed overhead but has a maximum block size of 128K, so it needs a
+# larger margin.
+#
+# extra_bytes = (uncompressed_size >> 8) + 131072
-#define ZO_z_extra_bytes ((ZO_z_output_len >> 8) + 65536)
+#define ZO_z_extra_bytes ((ZO_z_output_len >> 8) + 131072)
#if ZO_z_output_len > ZO_z_input_len
# define ZO_z_extract_offset (ZO_z_output_len + ZO_z_extra_bytes - \
ZO_z_input_len)
--
2.25.1
On Wed, Mar 25, 2020 at 12:58:41PM -0700, Nick Terrell wrote:
> From: Nick Terrell <[email protected]>
>
> Hi all,
>
> I attempted to send this patch on March 19, but it got rejected by the
> mailing lists for some reason, so I'm resending it.
>
> This patch set adds support for a ZSTD-compressed kernel, ramdisk, and
> initramfs in the kernel boot process. ZSTD-compressed ramdisk and initramfs
> are supported on all architectures. The ZSTD-compressed kernel is only
> hooked up to x86 in this patch set.
>
> Zstandard requires slightly more memory during the kernel decompression
> on x86 (192 KB vs 64 KB), and the memory usage is independent of the
> window size.
>
> Zstandard requires memory proprortional to the window size used during
> compression for decompressing the ramdisk image, since streaming mode is
> used. Newer versions of zstd (1.3.2+) list the window size of a file
> with `zstd -lv <file>'. The absolute maximum amount of memory required
> is just over 8 MB, but it can be controlled at compression time.
>
> This patch set has been boot tested with buildroot and QEMU based off
> of linux-5.6-rc6.
>
> On i386 and x86_64 I have tested the following configurations:
> * zstd compressed kernel and a separate zstd compressed initramfs
> * zstd compressed kernel and a built-in zstd compressed initramfs
> * gzip compressed kernel and a separate gzip compressed initramfs
> * gzip compressed kernel and a built-in gzip compressed initramfs
>
> On arm and aarch64 I tested the same configurations, except that the kernel is
> always gzip compressed.
>
> Facebook has been using v1 of these patches on x86_64 devices for more than 6
> months. When we switched from a xz compressed initramfs to a zstd compressed
> initramfs decompression time shrunk from 12 seconds to 3 seconds. When we
> switched from a xz compressed kernel to a zstd compressed kernel we saved 2
> seconds of boot time.
>
> Facebook has been using v2 of these patches on aarch64 devices for a few weeks.
> When we switched from an lzma compressed initramfs to a zstd compressed initramfs
> decompression time shrunk from 27 seconds to 8 seconds.
>
> The zstd compressed kernel is smaller than the gzip compressed kernel but larger
> than the xz or lzma compressed kernels, and it decompresses faster than
> everything except lz4. See the table below for the measurement of an x86_64
> kernel ordered by compressed size:
>
> algo size
> xz 6,509,792
> lzma 6,856,576
> zstd 7,399,157
> gzip 8,522,527
> bzip 8,629,603
> lzo 9,808,035
> lz4 10,705,570
> none 32,565,672
>
> v1 -> v2:
> - Rebase
> - usr/Makefile and init/Kconfig were changed so the patches were updated
> - No functional changes except to rebase
> - Split the patches up into smaller chunks
>
> v2 -> v3:
> - Add *.zst to the .gitignore in patch 8
> - Style nits in patch 3
> - Rename the PREBOOT macro to ZSTD_PREBOOT and XXH_PREBOOT in patches
> 1 through 3
>
> Best,
> Nick Terrell
>
Hi! Just resending Reviewed-by, since it looks like the earlier v3
didn't make it to the mailing list, as Nick just pointed out to me:
https://lore.kernel.org/lkml/202003201109.00FDE311A7@keescook/
Reviewed-by: Kees Cook <[email protected]>
-Kees
> Adam Borowski (1):
> .gitignore: add ZSTD-compressed files
>
> Nick Terrell (7):
> lib: prepare zstd for preboot environment
> lib: prepare xxhash for preboot environment
> lib: add zstd support to decompress
> init: add support for zstd compressed kernel
> usr: add support for zstd compressed initramfs
> x86: bump ZO_z_extra_bytes margin for zstd
> x86: Add support for ZSTD compressed kernel
>
> .gitignore | 1 +
> Documentation/x86/boot.rst | 6 +-
> arch/x86/Kconfig | 1 +
> arch/x86/boot/compressed/Makefile | 5 +-
> arch/x86/boot/compressed/misc.c | 4 +
> arch/x86/boot/header.S | 8 +-
> arch/x86/include/asm/boot.h | 6 +-
> include/linux/decompress/unzstd.h | 11 +
> init/Kconfig | 15 +-
> lib/Kconfig | 4 +
> lib/Makefile | 1 +
> lib/decompress.c | 5 +
> lib/decompress_unzstd.c | 338 ++++++++++++++++++++++++++++++
> lib/xxhash.c | 21 +-
> lib/zstd/decompress.c | 2 +
> lib/zstd/fse_decompress.c | 9 +-
> lib/zstd/zstd_internal.h | 14 +-
> scripts/Makefile.lib | 15 ++
> usr/Kconfig | 20 ++
> usr/Makefile | 1 +
> 20 files changed, 460 insertions(+), 27 deletions(-)
> create mode 100644 include/linux/decompress/unzstd.h
> create mode 100644 lib/decompress_unzstd.c
>
> --
> 2.25.1
>
--
Kees Cook
Hi Nick,
I finally got some time to review your patch, here are my comments:
On Wed, Mar 25, 2020 at 12:58:42PM -0700, Nick Terrell wrote:
> * Don't export symbols if ZSTD_PREBOOT is defined.
I'm not sure if this is needed. When I worked on my patch, I have found that
all exporting and modinfo macros generate symbols in modinfo and discard.ksym
sections, which are then dropped by the vmlinux linker script, thus one
will get the same binary independently if he puts this change in or not.
I'm not sure if this is intentional as there is also __DISABLE_EXPORTS define,
which should be used by a decompressor (according to comments in export.h).
> * Remove a double definition of the CHECK_F macro when the zstd
> library is amalgamated.
> * Switch ZSTD_copy8() to __builtin_memcpy(), because in the preboot
> environment on x86 gcc can't inline `memcpy()` otherwise.
> * Limit the gcc hack in ZSTD_wildcopy() to the broken gcc version. See
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388.
No comments to the rest.
Petr
Hi!
On Wed, Mar 25, 2020 at 12:58:44PM -0700, Nick Terrell wrote:
> From: Nick Terrell <[email protected]>
> * Add unzstd() and the zstd decompress interface.
Here I do not understand why you limit the window size to 8MB even when
you read a larger value from the header. I do not see a reason why there
should be such a limitation at the first place and if there should be,
why it differs from ZSTD_WINDOWLOG_MAX.
I removed that limitation to be able to test it in my environment and I
found the performance is worst than with my patch by roughly 20% (on
i7-3520M), which is a major drawback considering the main motivation
to use zstd is the decompression speed. I will test on arm as well and
share the result tomorrow.
Petr
> On Mar 26, 2020, at 8:40 AM, Petr Malat <[email protected]> wrote:
>
> Hi Nick,
> I finally got some time to review your patch, here are my comments:
>
> On Wed, Mar 25, 2020 at 12:58:42PM -0700, Nick Terrell wrote:
>> * Don't export symbols if ZSTD_PREBOOT is defined.
> I'm not sure if this is needed. When I worked on my patch, I have found that
> all exporting and modinfo macros generate symbols in modinfo and discard.ksym
> sections, which are then dropped by the vmlinux linker script, thus one
> will get the same binary independently if he puts this change in or not.
>
> I'm not sure if this is intentional as there is also __DISABLE_EXPORTS define,
> which should be used by a decompressor (according to comments in export.h).
This is not my area of expertise, I’m a zstd developer not a kernel developer.
For that reason I wanted to pick the safe route of disabling the exports explicitly,
since that is what the other decompressors do, so I know it works.
If you’re confident that it isn’t necessary, I can drop the modification. But, I do
prefer this approach, because there is no magic involved. I can see what is
happening clearly just by looking at the decompress_unzstd.c code.
Thanks for the review,
Nick
> On Mar 26, 2020, at 9:47 AM, Petr Malat <[email protected]> wrote:
>
> Hi!
> On Wed, Mar 25, 2020 at 12:58:44PM -0700, Nick Terrell wrote:
>> From: Nick Terrell <[email protected]>
>> * Add unzstd() and the zstd decompress interface.
> Here I do not understand why you limit the window size to 8MB even when
> you read a larger value from the header. I do not see a reason why there
> should be such a limitation at the first place and if there should be,
> why it differs from ZSTD_WINDOWLOG_MAX.
When we are doing streaming decompression (either flush or fill is provided)
we have to allocate memory proportional to the window size. We want to
bound that memory so we don’t accidentally allocate too much memory.
When we are doing a single-pass decompression (neither flush nor fill
are provided) the window size doesn’t matter, and we only have to allocate
a fixed amount of memory ~192 KB.
The zstd spec [0] specifies that all decoders should allow window sizes
up to 8 MB. Additionally, the zstd CLI won’t produce window sizes greater
than 8 MB by default. The window size is controlled by the compression
level, and can be explicitly set.
I would expect larger window sizes to be beneficial for compression ratio,
though there is demising returns. I would expect that for kernel image
compression larger window sizes are beneficial, since it is decompressed
with a single pass. For initramfs decompression, I would expect that limiting
the window size could help decompression speed, since it uses streaming
compression, so unzstd() has to allocate a buffer of window size bytes.
> I removed that limitation to be able to test it in my environment and I
> found the performance is worst than with my patch by roughly 20% (on
> i7-3520M), which is a major drawback considering the main motivation
> to use zstd is the decompression speed. I will test on arm as well and
> share the result tomorrow.
> Petr
What do you mean by that? Can you share with me the test you ran?
Is this for kernel decompression or initramfs decompression?
Best,
Nick
[0] https://tools.ietf.org/html/rfc8478#section-3.1.1.1.2
Hi!
On Thu, Mar 26, 2020 at 07:03:54PM +0000, Nick Terrell wrote:
> >> * Add unzstd() and the zstd decompress interface.
> > Here I do not understand why you limit the window size to 8MB even when
> > you read a larger value from the header. I do not see a reason why there
> > should be such a limitation at the first place and if there should be,
> > why it differs from ZSTD_WINDOWLOG_MAX.
>
> When we are doing streaming decompression (either flush or fill is provided)
> we have to allocate memory proportional to the window size. We want to
> bound that memory so we don't accidentally allocate too much memory.
> When we are doing a single-pass decompression (neither flush nor fill
> are provided) the window size doesn't matter, and we only have to allocate
> a fixed amount of memory ~192 KB.
>
> The zstd spec [0] specifies that all decoders should allow window sizes
> up to 8 MB. Additionally, the zstd CLI won't produce window sizes greater
> than 8 MB by default. The window size is controlled by the compression
> level, and can be explicitly set.
Yes, one needs to pass --ultra option to zstd to produce an incompatible
archive, but that doesn't justify the reason to limit this in the kernel,
especially if one is able to read the needed window size from the header
when allocating the memory. At the time when initramfs is extracted,
there usually is memory available as it's before any processes are
started and this memory is reclaimed after the decompression.
If, on the other hand, an user makes an initramfs for a memory constrained
system, he limits the window size while compressing the archive and
the small window size will be announced in the header.
The only scenario where using the hard-coded limit makes sense is in a
case the window size is not available (I'm not sure if it's mandatory
to provide it). That's how my code works - if the size is available,
it uses the provided value, if not it uses 1 << ZSTD_WINDOWLOG_MAX.
I would also agree a fixed limit would make a sense if a user (or network)
provided data would be used, but in this case only the system owner is able
to provide an initramfs. If one is able to change initramfs, he can render
the system unusable simply by providing a corrupted file. He doesn't have
to bother making the window bigger than the available memory.
> I would expect larger window sizes to be beneficial for compression ratio,
> though there is demising returns. I would expect that for kernel image
> compression larger window sizes are beneficial, since it is decompressed
> with a single pass. For initramfs decompression, I would expect that limiting
> the window size could help decompression speed, since it uses streaming
> compression, so unzstd() has to allocate a buffer of window size bytes.
Yes, larger window improves the compression ratio, see here a comparison
between level 19 and 22 on my testing x86-64 initramfs:
30775022 rootfs.cpio.zst-19
28755429 rootfs.cpio.zst-22
These 7% can be noticeable when one has a slow storage, e.g. a flash memory
on SPI bus.
> > I removed that limitation to be able to test it in my environment and I
> > found the performance is worst than with my patch by roughly 20% (on
> > i7-3520M), which is a major drawback considering the main motivation
> > to use zstd is the decompression speed. I will test on arm as well and
> > share the result tomorrow.
> > Petr
>
> What do you mean by that? Can you share with me the test you ran?
> Is this for kernel decompression or initramfs decompression?
Initramfs - you can apply my v2 patch on v5.5 and try with your test data.
I have tested your patch also on ARMv7 platform and there the degradation
was 8%.
Petr
> On Mar 26, 2020, at 1:16 PM, Petr Malat <[email protected]> wrote:
>
> Hi!
> On Thu, Mar 26, 2020 at 07:03:54PM +0000, Nick Terrell wrote:
>>>> * Add unzstd() and the zstd decompress interface.
>>> Here I do not understand why you limit the window size to 8MB even when
>>> you read a larger value from the header. I do not see a reason why there
>>> should be such a limitation at the first place and if there should be,
>>> why it differs from ZSTD_WINDOWLOG_MAX.
>>
>> When we are doing streaming decompression (either flush or fill is provided)
>> we have to allocate memory proportional to the window size. We want to
>> bound that memory so we don't accidentally allocate too much memory.
>> When we are doing a single-pass decompression (neither flush nor fill
>> are provided) the window size doesn't matter, and we only have to allocate
>> a fixed amount of memory ~192 KB.
>>
>> The zstd spec [0] specifies that all decoders should allow window sizes
>> up to 8 MB. Additionally, the zstd CLI won't produce window sizes greater
>> than 8 MB by default. The window size is controlled by the compression
>> level, and can be explicitly set.
> Yes, one needs to pass --ultra option to zstd to produce an incompatible
> archive, but that doesn't justify the reason to limit this in the kernel,
> especially if one is able to read the needed window size from the header
> when allocating the memory. At the time when initramfs is extracted,
> there usually is memory available as it's before any processes are
> started and this memory is reclaimed after the decompression.
I’m happy to increase this limit. I set it to 8 MB to be conservative, but I am
happy to increase it to 128 MB == 1 << ZSTD_WINDOWLOG_MAX. I will
submit a new version with that change.
> If, on the other hand, an user makes an initramfs for a memory constrained
> system, he limits the window size while compressing the archive and
> the small window size will be announced in the header.
>
> The only scenario where using the hard-coded limit makes sense is in a
> case the window size is not available (I'm not sure if it's mandatory
> to provide it). That's how my code works - if the size is available,
> it uses the provided value, if not it uses 1 << ZSTD_WINDOWLOG_MAX.
>
> I would also agree a fixed limit would make a sense if a user (or network)
> provided data would be used, but in this case only the system owner is able
> to provide an initramfs. If one is able to change initramfs, he can render
> the system unusable simply by providing a corrupted file. He doesn't have
> to bother making the window bigger than the available memory.
That makes sense to me.
>> I would expect larger window sizes to be beneficial for compression ratio,
>> though there is demising returns. I would expect that for kernel image
>> compression larger window sizes are beneficial, since it is decompressed
>> with a single pass. For initramfs decompression, I would expect that limiting
>> the window size could help decompression speed, since it uses streaming
>> compression, so unzstd() has to allocate a buffer of window size bytes.
> Yes, larger window improves the compression ratio, see here a comparison
> between level 19 and 22 on my testing x86-64 initramfs:
> 30775022 rootfs.cpio.zst-19
> 28755429 rootfs.cpio.zst-22
> These 7% can be noticeable when one has a slow storage, e.g. a flash memory
> on SPI bus.
>
>>> I removed that limitation to be able to test it in my environment and I
>>> found the performance is worst than with my patch by roughly 20% (on
>>> i7-3520M), which is a major drawback considering the main motivation
>>> to use zstd is the decompression speed. I will test on arm as well and
>>> share the result tomorrow.
>>> Petr
>>
>> What do you mean by that? Can you share with me the test you ran?
>> Is this for kernel decompression or initramfs decompression?
> Initramfs - you can apply my v2 patch on v5.5 and try with your test data.
>
> I have tested your patch also on ARMv7 platform and there the degradation
> was 8%.
Are you comparing the performance of an 8 MB window size vs a 128 MB
window size?
> Petr
On Thu, Mar 26, 2020 at 09:13:54PM +0000, Nick Terrell wrote:
> >> What do you mean by that? Can you share with me the test you ran?
> >> Is this for kernel decompression or initramfs decompression?
> > Initramfs - you can apply my v2 patch on v5.5 and try with your test data.
> >
> > I have tested your patch also on ARMv7 platform and there the degradation
> > was 8%.
>
> Are you comparing the performance of an 8 MB window size vs a 128 MB
> window size?
No, I use the same initramfs file with 2 different kernels for the test. I have
removed
if (params.windowSize > ZSTD_WINDOWSIZE_MAX) goto out;
from your code.
Petr
> On Mar 26, 2020, at 2:44 PM, Petr Malat <[email protected]> wrote:
>
> On Thu, Mar 26, 2020 at 09:13:54PM +0000, Nick Terrell wrote:
>>>> What do you mean by that? Can you share with me the test you ran?
>>>> Is this for kernel decompression or initramfs decompression?
>>> Initramfs - you can apply my v2 patch on v5.5 and try with your test data.
>>>
>>> I have tested your patch also on ARMv7 platform and there the degradation
>>> was 8%.
>>
>> Are you comparing the performance of an 8 MB window size vs a 128 MB
>> window size?
> No, I use the same initramfs file with 2 different kernels for the test. I have
> removed
> if (params.windowSize > ZSTD_WINDOWSIZE_MAX) goto out;
> from your code.
Thanks for the clarification. I will try to reproduce the speed difference you’ve
measured before submitting v4 (that deletes the windowSize bound).
Initramfs passes the whole input buffer (doesn’t use fill), but does use flush.
Zstd always decompresses into an internal buffer, then copies into the
ZSTD_outBuffer. That means the only functional difference between the two
versions for initramfs should be that I will call flush() 4 KB at a time, and you
call flush 128 KB at a time. Naively I wouldn’t expect this to matter too much,
but I will measure.
Best,
Nick
> On Mar 26, 2020, at 1:16 PM, Petr Malat <[email protected]> wrote:
>
> Hi!
> On Thu, Mar 26, 2020 at 07:03:54PM +0000, Nick Terrell wrote:
>>>> * Add unzstd() and the zstd decompress interface.
>>> Here I do not understand why you limit the window size to 8MB even when
>>> you read a larger value from the header. I do not see a reason why there
>>> should be such a limitation at the first place and if there should be,
>>> why it differs from ZSTD_WINDOWLOG_MAX.
>>
>> When we are doing streaming decompression (either flush or fill is provided)
>> we have to allocate memory proportional to the window size. We want to
>> bound that memory so we don't accidentally allocate too much memory.
>> When we are doing a single-pass decompression (neither flush nor fill
>> are provided) the window size doesn't matter, and we only have to allocate
>> a fixed amount of memory ~192 KB.
>>
>> The zstd spec [0] specifies that all decoders should allow window sizes
>> up to 8 MB. Additionally, the zstd CLI won't produce window sizes greater
>> than 8 MB by default. The window size is controlled by the compression
>> level, and can be explicitly set.
> Yes, one needs to pass --ultra option to zstd to produce an incompatible
> archive, but that doesn't justify the reason to limit this in the kernel,
> especially if one is able to read the needed window size from the header
> when allocating the memory. At the time when initramfs is extracted,
> there usually is memory available as it's before any processes are
> started and this memory is reclaimed after the decompression.
>
> If, on the other hand, an user makes an initramfs for a memory constrained
> system, he limits the window size while compressing the archive and
> the small window size will be announced in the header.
>
> The only scenario where using the hard-coded limit makes sense is in a
> case the window size is not available (I'm not sure if it's mandatory
> to provide it). That's how my code works - if the size is available,
> it uses the provided value, if not it uses 1 << ZSTD_WINDOWLOG_MAX.
>
> I would also agree a fixed limit would make a sense if a user (or network)
> provided data would be used, but in this case only the system owner is able
> to provide an initramfs. If one is able to change initramfs, he can render
> the system unusable simply by providing a corrupted file. He doesn't have
> to bother making the window bigger than the available memory.
>
>> I would expect larger window sizes to be beneficial for compression ratio,
>> though there is demising returns. I would expect that for kernel image
>> compression larger window sizes are beneficial, since it is decompressed
>> with a single pass. For initramfs decompression, I would expect that limiting
>> the window size could help decompression speed, since it uses streaming
>> compression, so unzstd() has to allocate a buffer of window size bytes.
> Yes, larger window improves the compression ratio, see here a comparison
> between level 19 and 22 on my testing x86-64 initramfs:
> 30775022 rootfs.cpio.zst-19
> 28755429 rootfs.cpio.zst-22
> These 7% can be noticeable when one has a slow storage, e.g. a flash memory
> on SPI bus.
>
>>> I removed that limitation to be able to test it in my environment and I
>>> found the performance is worst than with my patch by roughly 20% (on
>>> i7-3520M), which is a major drawback considering the main motivation
>>> to use zstd is the decompression speed. I will test on arm as well and
>>> share the result tomorrow.
>>> Petr
>>
>> What do you mean by that? Can you share with me the test you ran?
>> Is this for kernel decompression or initramfs decompression?
> Initramfs - you can apply my v2 patch on v5.5 and try with your test data.
>
> I have tested your patch also on ARMv7 platform and there the degradation
> was 8%.
Thanks again for measuring the speed differences between the two patchsets!
I’ve found that the difference in performance between our two patchsets is
caused by the output buffer size. I expect this is due to calling flush() more often,
since that is a complex state machine in initramfs’s use case.
I’ve measured the speed of this patch set (v3), compared against this patch set
with a 128 KB buffer size (ZSTD_DStreamOutSize()), vs Petr’s patchset. I’m
measuring on an Intel i9-9900K with turbo disabled on CPU 0. I’m booting the
kernel using QEMU.
To measure the initramfs decompression speed I look at the difference in timestamp
between “Unpacking initramfs…” and “Freeing ignited memory”. The initramfs is
compressed using level 19, but results for level 22 are similar. Times are reported
in seconds. I ran each test 3 times and took the median time, but the results are
very stable. On ARM the initramfs is 26 MB. On x86-64 the initramfs is 97 MB.
Arch v3 128 Petr
Arm 1.67 1.52 1.55
x64 1.76 1.69 1.66
The results for my patch are slightly better on ARM, yours are slightly better on x86.
In v4 of my patchset, which I will send out tonight, I will increase ZSTD_IOBUF_SIZE
to 128 KB (as well as remove the 8 MB window size limit). Please let me know if your
results align with mine on v4.
I’ve also measured the x86_64 zstd kernel decompression speed using our two
patchsets. I measured it by the timing between the “Decompressing Linux…” message
and the “Parsing ELF” message with this script [0]. I used the same technique for
measurement as above. The kernel I am testing is compressed at level 19 with my
patchset and at level 19 with a window size of 4 MB with your patchset.
I found that my patchset takes 70ms to decompress and yours takes 318ms. Your
patchset also uses 4 MB of heap memory, where mine only needs 192 KB. The difference
is caused by two things:
1. memcpy() is replaced by __builtin_memcpy() in patch 1 of my set. This is the
core of the decompression hot loop, and without it the compiler can’t inline memcpy.
2. My patchset calls decompress_single() when neither flush nor fill are provided,
like when decompressing the kernel. This saves the 4 MB of memory, as well as
speeds up decompression a little bit.
Best,
Nick Terrell
[0] https://gist.github.com/terrelln/9bd53321a669f62683c608af8944fbc2