2018-03-21 04:40:41

by Maninder Singh

[permalink] [raw]
Subject: [PATCH 0/1] cover-letter/lz4: Implement lz4 with dynamic offset length.

(Added cover letter to avoid much text in patch description)

LZ4 specification defines 2 byte offset length for 64 KB data.
But in case of ZRAM we compress data per page and in most of
architecture PAGE_SIZE is 4KB. So we can decide offset length based
on actual offset value. For this we can reserve 1 bit to decide offset
length (1 byte or 2 byte). 2 byte required only if ofsset is greater than 127,
else 1 byte is enough.

With this new implementation new offset value can be at MAX 32 KB.

Thus we can save more memory for compressed data.

results checked with new implementation:-

comression size for same input source
(LZ4_DYN < LZO < LZ4)

LZO
=======
orig_data_size: 78917632
compr_data_size: 15894668
mem_used_total: 17117184

LZ4
========
orig_data_size: 78917632
compr_data_size: 16310717
mem_used_total: 17592320

LZ4_DYN
=======
orig_data_size: 78917632
compr_data_size: 15520506
mem_used_total: 16748544

checked performance with below tool:-
https://github.com/sergey-senozhatsky/zram-perf-test
# ./fio-perf-o-meter.sh /tmp/test-fio-zram-lz4 /tmp/test-fio-zram-lz4_dyn
Processing /tmp/test-fio-zram-lz4
Processing /tmp/test-fio-zram-lz4_dyn
#jobs1
WRITE: 1101.7MB/s 1197.7MB/s
WRITE: 799829KB/s 900838KB/s
READ: 2670.2MB/s 2649.5MB/s
READ: 2027.8MB/s 2039.9MB/s
READ: 603703KB/s 597855KB/s
WRITE: 602943KB/s 597103KB/s
READ: 680438KB/s 707986KB/s
WRITE: 679582KB/s 707095KB/s
#jobs2
WRITE: 1993.2MB/s 2121.2MB/s
WRITE: 1654.1MB/s 1700.2MB/s
READ: 5038.2MB/s 4970.9MB/s
READ: 3930.1MB/s 3908.5MB/s
READ: 1113.2MB/s 1117.4MB/s
WRITE: 1111.8MB/s 1115.2MB/s
READ: 1255.8MB/s 1286.5MB/s
WRITE: 1254.2MB/s 1284.9MB/s
#jobs3
WRITE: 2875.6MB/s 3010.3MB/s
WRITE: 2394.4MB/s 2363.2MB/s
READ: 7384.7MB/s 7314.3MB/s
READ: 5389.5MB/s 5427.6MB/s
READ: 1570.8MB/s 1557.3MB/s
WRITE: 1568.8MB/s 1555.3MB/s
READ: 1848.5MB/s 1854.0MB/s
WRITE: 1846.2MB/s 1851.7MB/s
#jobs4
WRITE: 3720.3MB/s 3077.4MB/s
WRITE: 3027.4MB/s 3072.8MB/s
READ: 9694.7MB/s 9822.6MB/s
READ: 6606.5MB/s 6617.2MB/s
READ: 1941.6MB/s 1966.8MB/s
WRITE: 1939.1MB/s 1964.3MB/s
READ: 2405.3MB/s 2347.5MB/s
WRITE: 2402.3MB/s 2344.5MB/s
#jobs5
WRITE: 3335.6MB/s 3360.7MB/s
WRITE: 2670.2MB/s 2677.9MB/s
READ: 9455.3MB/s 8782.2MB/s
READ: 6534.8MB/s 6501.7MB/s
READ: 1848.9MB/s 1858.3MB/s
WRITE: 1846.6MB/s 1855.1MB/s
READ: 2232.4MB/s 2223.7MB/s
WRITE: 2229.6MB/s 2220.9MB/s
#jobs6
WRITE: 3896.5MB/s 3772.9MB/s
WRITE: 3171.1MB/s 3109.4MB/s
READ: 11060MB/s 11120MB/s
READ: 7375.8MB/s 7384.7MB/s
READ: 2132.5MB/s 2133.1MB/s
WRITE: 2129.8MB/s 2131.3MB/s
READ: 2608.4MB/s 2627.3MB/s
WRITE: 2605.7MB/s 2623.2MB/s
#jobs7
WRITE: 4129.4MB/s 4083.2MB/s
WRITE: 3364.5MB/s 3384.4MB/s
READ: 12088MB/s 11062MB/s
READ: 7868.3MB/s 7851.5MB/s
READ: 2277.8MB/s 2291.6MB/s
WRITE: 2274.9MB/s 2288.7MB/s
READ: 2798.5MB/s 2890.1MB/s
WRITE: 2794.1MB/s 2887.4MB/s
#jobs8
WRITE: 4623.3MB/s 4794.9MB/s
WRITE: 3749.3MB/s 3676.9MB/s
READ: 12337MB/s 14076MB/s
READ: 8320.1MB/s 8229.4MB/s
READ: 2496.9MB/s 2486.3MB/s
WRITE: 2493.8MB/s 2483.2MB/s
READ: 3340.4MB/s 3370.6MB/s
WRITE: 3336.2MB/s 3366.4MB/s
#jobs9
WRITE: 4427.6MB/s 4341.3MB/s
WRITE: 3542.6MB/s 3597.2MB/s
READ: 10094MB/s 9888.5MB/s
READ: 7863.5MB/s 8119.9MB/s
READ: 2357.1MB/s 2382.1MB/s
WRITE: 2354.1MB/s 2379.1MB/s
READ: 2828.8MB/s 2826.2MB/s
WRITE: 2825.3MB/s 2822.7MB/s
#jobs10
WRITE: 4463.9MB/s 4327.7MB/s
WRITE: 3637.7MB/s 3592.4MB/s
READ: 10020MB/s 11118MB/s
READ: 7837.8MB/s 8098.7MB/s
READ: 2459.6MB/s 2406.5MB/s
WRITE: 2456.5MB/s 2403.4MB/s
READ: 2804.2MB/s 2829.8MB/s
WRITE: 2800.7MB/s 2826.2MB/s
jobs1 perfstat
stalled-cycles-frontend 20,23,52,25,317 ( 54.32%) 19,29,10,49,608 ( 54.50%)
instructions 44,62,30,88,401 ( 1.20) 42,50,67,71,907 ( 1.20)
branches 7,12,44,77,233 ( 738.975) 6,64,52,15,491 ( 725.584)
branch-misses 2,38,66,520 ( 0.33%) 2,04,33,819 ( 0.31%)
jobs2 perfstat
stalled-cycles-frontend 42,82,90,69,149 ( 56.63%) 41,58,70,01,387 ( 56.01%)
instructions 85,33,18,31,411 ( 1.13) 85,32,92,28,973 ( 1.15)
branches 13,35,34,99,713 ( 677.499) 13,34,97,00,453 ( 693.104)
branch-misses 4,50,17,075 ( 0.34%) 4,47,28,378 ( 0.34%)
jobs3 perfstat
stalled-cycles-frontend 66,01,57,23,062 ( 57.10%) 65,86,74,97,814 ( 57.30%)
instructions 1,28,18,27,80,041 ( 1.11) 1,28,04,92,91,306 ( 1.11)
branches 20,06,14,16,000 ( 651.453) 20,02,85,32,864 ( 652.536)
branch-misses 7,10,66,773 ( 0.35%) 7,12,75,728 ( 0.36%)
jobs4 perfstat
stalled-cycles-frontend 91,98,71,83,315 ( 58.09%) 93,70,91,50,920 ( 58.66%)
instructions 1,70,82,79,66,403 ( 1.08) 1,71,18,67,74,366 ( 1.07)
branches 26,73,53,03,398 ( 621.532) 26,80,89,38,054 ( 618.718)
branch-misses 9,82,07,177 ( 0.37%) 9,81,64,098 ( 0.37%)
jobs5 perfstat
stalled-cycles-frontend 1,47,29,71,29,605 ( 63.59%) 1,47,91,01,92,835 ( 63.86%)
instructions 2,18,90,41,63,988 ( 0.95) 2,18,55,73,09,594 ( 0.94)
branches 34,64,46,32,880 ( 553.209) 34,55,08,02,781 ( 551.953)
branch-misses 14,16,79,279 ( 0.41%) 13,84,85,054 ( 0.40%)
jobs6 perfstat
stalled-cycles-frontend 2,02,92,92,98,242 ( 66.70%) 2,05,33,49,39,627 ( 67.01%)
instructions 2,65,13,90,22,217 ( 0.87) 2,64,84,45,49,149 ( 0.86)
branches 42,11,54,07,400 ( 510.085) 42,03,58,57,789 ( 505.746)
branch-misses 17,71,33,628 ( 0.42%) 17,74,31,942 ( 0.42%)
jobs7 perfstat
stalled-cycles-frontend 2,79,22,74,37,283 ( 70.23%) 2,80,02,50,89,154 ( 70.48%)
instructions 3,11,90,38,02,741 ( 0.78) 3,09,20,69,87,835 ( 0.78)
branches 49,71,39,90,321 ( 460.940) 49,10,44,23,983 ( 455.686)
branch-misses 22,43,84,102 ( 0.45%) 21,96,67,440 ( 0.45%)
jobs8 perfstat
stalled-cycles-frontend 3,59,62,09,66,766 ( 73.38%) 3,58,04,85,16,351 ( 73.37%)
instructions 3,43,83,05,02,841 ( 0.70) 3,43,33,76,84,985 ( 0.70)
branches 54,02,15,25,784 ( 406.256) 53,91,13,38,774 ( 407.265)
branch-misses 25,20,35,507 ( 0.47%) 25,05,71,030 ( 0.46%)
jobs9 perfstat
stalled-cycles-frontend 4,15,33,64,48,628 ( 73.76%) 4,22,88,52,47,923 ( 74.16%)
instructions 3,90,79,09,16,552 ( 0.69) 3,91,12,92,41,516 ( 0.69)
branches 61,66,87,76,271 ( 403.896) 61,73,58,17,174 ( 399.363)
branch-misses 28,46,21,136 ( 0.46%) 28,45,74,774 ( 0.46%)
jobs10 perfstat
stalled-cycles-frontend 4,74,43,71,32,846 ( 74.30%) 4,66,34,70,59,452 ( 73.82%)
instructions 4,35,23,51,39,076 ( 0.68) 4,38,48,78,54,987 ( 0.69)
branches 68,72,17,08,212 ( 396.945) 69,48,52,50,280 ( 405.847)
branch-misses 31,73,62,053 ( 0.46%) 32,34,76,102 ( 0.47%)
seconds elapsed 11.470858891 10.862984653
seconds elapsed 11.802220972 11.348959061
seconds elapsed 11.847204652 11.850297919
seconds elapsed 12.352068602 12.853222188
seconds elapsed 16.162715423 16.355883496
seconds elapsed 16.605502317 16.855938732
seconds elapsed 18.108333660 18.108347866
seconds elapsed 18.621296174 18.354183020
seconds elapsed 22.366502860 22.357632546
seconds elapsed 24.362417439 24.363003009

Maninder Singh, Vaneet Narang (1):
lz4: Implement lz4 with dynamic offset (lz4_dyn).

crypto/lz4.c | 64 ++++++++++++++++++++++++++++++++-
drivers/block/zram/zcomp.c | 4 ++
fs/pstore/platform.c | 2 +-
include/linux/lz4.h | 15 ++++++--
lib/decompress_unlz4.c | 2 +-
lib/lz4/lz4_compress.c | 84 +++++++++++++++++++++++++++++++++++--------
lib/lz4/lz4_decompress.c | 56 ++++++++++++++++++++---------
lib/lz4/lz4defs.h | 11 ++++++
8 files changed, 197 insertions(+), 41 deletions(-)


2018-03-21 04:40:42

by Maninder Singh

[permalink] [raw]
Subject: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.

LZ4 specification defines 2 byte offset length for 64 KB data.
But in case of ZRAM we compress data per page and in most of
architecture PAGE_SIZE is 4KB. So we can decide offset length based
on actual offset value. For this we can reserve 1 bit to decide offset
length (1 byte or 2 byte). 2 byte required only if ofsset is greater than 127,
else 1 byte is enough.

With this new implementation new offset value can be at MAX 32 KB.

Thus we can save more memory for compressed data.

results checked with new implementation:-
LZO
=======
orig_data_size: 78917632
compr_data_size: 15894668
mem_used_total: 17117184

LZ4
========
orig_data_size: 78917632
compr_data_size: 16310717
mem_used_total: 17592320

LZ4_DYN
=======
orig_data_size: 78917632
compr_data_size: 15520506
mem_used_total: 16748544

Signed-off-by: Maninder Singh <[email protected]>
Signed-off-by: Vaneet Narang <[email protected]>
---
crypto/lz4.c | 64 ++++++++++++++++++++++++++++++++-
drivers/block/zram/zcomp.c | 4 ++
fs/pstore/platform.c | 2 +-
include/linux/lz4.h | 15 ++++++--
lib/decompress_unlz4.c | 2 +-
lib/lz4/lz4_compress.c | 84 +++++++++++++++++++++++++++++++++++--------
lib/lz4/lz4_decompress.c | 56 ++++++++++++++++++++---------
lib/lz4/lz4defs.h | 11 ++++++
8 files changed, 197 insertions(+), 41 deletions(-)

diff --git a/crypto/lz4.c b/crypto/lz4.c
index 2ce2660..f1a8a20 100644
--- a/crypto/lz4.c
+++ b/crypto/lz4.c
@@ -67,7 +67,20 @@ static int __lz4_compress_crypto(const u8 *src, unsigned int slen,
u8 *dst, unsigned int *dlen, void *ctx)
{
int out_len = LZ4_compress_default(src, dst,
- slen, *dlen, ctx);
+ slen, *dlen, ctx, false);
+
+ if (!out_len)
+ return -EINVAL;
+
+ *dlen = out_len;
+ return 0;
+}
+
+static int __lz4_compress_crypto_dynamic(const u8 *src, unsigned int slen,
+ u8 *dst, unsigned int *dlen, void *ctx)
+{
+ int out_len = LZ4_compress_default(src, dst,
+ slen, *dlen, ctx, true);

if (!out_len)
return -EINVAL;
@@ -91,10 +104,30 @@ static int lz4_compress_crypto(struct crypto_tfm *tfm, const u8 *src,
return __lz4_compress_crypto(src, slen, dst, dlen, ctx->lz4_comp_mem);
}

+static int lz4_compress_crypto_dynamic(struct crypto_tfm *tfm, const u8 *src,
+ unsigned int slen, u8 *dst, unsigned int *dlen)
+{
+ struct lz4_ctx *ctx = crypto_tfm_ctx(tfm);
+
+ return __lz4_compress_crypto_dynamic(src, slen, dst, dlen, ctx->lz4_comp_mem);
+}
+
static int __lz4_decompress_crypto(const u8 *src, unsigned int slen,
u8 *dst, unsigned int *dlen, void *ctx)
{
- int out_len = LZ4_decompress_safe(src, dst, slen, *dlen);
+ int out_len = LZ4_decompress_safe(src, dst, slen, *dlen, false);
+
+ if (out_len < 0)
+ return -EINVAL;
+
+ *dlen = out_len;
+ return 0;
+}
+
+static int __lz4_decompress_crypto_dynamic(const u8 *src, unsigned int slen,
+ u8 *dst, unsigned int *dlen, void *ctx)
+{
+ int out_len = LZ4_decompress_safe(src, dst, slen, *dlen, true);

if (out_len < 0)
return -EINVAL;
@@ -117,6 +150,13 @@ static int lz4_decompress_crypto(struct crypto_tfm *tfm, const u8 *src,
return __lz4_decompress_crypto(src, slen, dst, dlen, NULL);
}

+static int lz4_decompress_crypto_dynamic(struct crypto_tfm *tfm, const u8 *src,
+ unsigned int slen, u8 *dst,
+ unsigned int *dlen)
+{
+ return __lz4_decompress_crypto_dynamic(src, slen, dst, dlen, NULL);
+}
+
static struct crypto_alg alg_lz4 = {
.cra_name = "lz4",
.cra_flags = CRYPTO_ALG_TYPE_COMPRESS,
@@ -130,6 +170,19 @@ static int lz4_decompress_crypto(struct crypto_tfm *tfm, const u8 *src,
.coa_decompress = lz4_decompress_crypto } }
};

+static struct crypto_alg alg_lz4_dyn = {
+ .cra_name = "lz4_dyn",
+ .cra_flags = CRYPTO_ALG_TYPE_COMPRESS,
+ .cra_ctxsize = sizeof(struct lz4_ctx),
+ .cra_module = THIS_MODULE,
+ .cra_list = LIST_HEAD_INIT(alg_lz4_dyn.cra_list),
+ .cra_init = lz4_init,
+ .cra_exit = lz4_exit,
+ .cra_u = { .compress = {
+ .coa_compress = lz4_compress_crypto_dynamic,
+ .coa_decompress = lz4_decompress_crypto_dynamic } }
+};
+
static struct scomp_alg scomp = {
.alloc_ctx = lz4_alloc_ctx,
.free_ctx = lz4_free_ctx,
@@ -150,9 +203,16 @@ static int __init lz4_mod_init(void)
if (ret)
return ret;

+ ret = crypto_register_alg(&alg_lz4_dyn);
+ if (ret) {
+ crypto_unregister_alg(&alg_lz4);
+ return ret;
+ }
+
ret = crypto_register_scomp(&scomp);
if (ret) {
crypto_unregister_alg(&alg_lz4);
+ crypto_unregister_alg(&alg_lz4_dyn);
return ret;
}

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index 4ed0a78..5bc5aab 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -17,11 +17,15 @@
#include <linux/crypto.h>

#include "zcomp.h"
+#define KB (1 << 10)

static const char * const backends[] = {
"lzo",
#if IS_ENABLED(CONFIG_CRYPTO_LZ4)
"lz4",
+#if (PAGE_SIZE < (32 * KB))
+ "lz4_dyn",
+#endif
#endif
#if IS_ENABLED(CONFIG_CRYPTO_LZ4HC)
"lz4hc",
diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
index 6910321..2b03449 100644
--- a/fs/pstore/platform.c
+++ b/fs/pstore/platform.c
@@ -342,7 +342,7 @@ static int compress_lz4(const void *in, void *out, size_t inlen, size_t outlen)
{
int ret;

- ret = LZ4_compress_default(in, out, inlen, outlen, workspace);
+ ret = LZ4_compress_default(in, out, inlen, outlen, workspace, false);
if (!ret) {
pr_err("LZ4_compress_default error; compression failed!\n");
return -EIO;
diff --git a/include/linux/lz4.h b/include/linux/lz4.h
index 394e3d9..08bb95d 100644
--- a/include/linux/lz4.h
+++ b/include/linux/lz4.h
@@ -181,6 +181,9 @@ static inline int LZ4_compressBound(size_t isize)
* which must be already allocated
* @wrkmem: address of the working memory.
* This requires 'workmem' of LZ4_MEM_COMPRESS.
+ * @dynoffset: 1 or 0.
+ * 1 specifies dynamic offset. (1 byte or 2 byte based on offset value),
+ * 0 specifies normal offset. (2 bytes for each offset value).
*
* Compresses 'sourceSize' bytes from buffer 'source'
* into already allocated 'dest' buffer of size 'maxOutputSize'.
@@ -195,7 +198,7 @@ static inline int LZ4_compressBound(size_t isize)
* (necessarily <= maxOutputSize) or 0 if compression fails
*/
int LZ4_compress_default(const char *source, char *dest, int inputSize,
- int maxOutputSize, void *wrkmem);
+ int maxOutputSize, void *wrkmem, bool dynOffset);

/**
* LZ4_compress_fast() - As LZ4_compress_default providing an acceleration param
@@ -207,6 +210,9 @@ int LZ4_compress_default(const char *source, char *dest, int inputSize,
* @acceleration: acceleration factor
* @wrkmem: address of the working memory.
* This requires 'workmem' of LZ4_MEM_COMPRESS.
+ * @dynoffset: 1 or 0.
+ * 1 specifies dynamic offset. (1 byte or 2 byte based on offset value),
+ * 0 specifies normal offset. (2 bytes for each offset value).
*
* Same as LZ4_compress_default(), but allows to select an "acceleration"
* factor. The larger the acceleration value, the faster the algorithm,
@@ -219,7 +225,7 @@ int LZ4_compress_default(const char *source, char *dest, int inputSize,
* (necessarily <= maxOutputSize) or 0 if compression fails
*/
int LZ4_compress_fast(const char *source, char *dest, int inputSize,
- int maxOutputSize, int acceleration, void *wrkmem);
+ int maxOutputSize, int acceleration, void *wrkmem, bool dynOffset);

/**
* LZ4_compress_destSize() - Compress as much data as possible
@@ -277,6 +283,9 @@ int LZ4_compress_destSize(const char *source, char *dest, int *sourceSizePtr,
* which must be already allocated
* @compressedSize: is the precise full size of the compressed block
* @maxDecompressedSize: is the size of 'dest' buffer
+ * @dynoffset: 1 or 0.
+ * 1 specifies dynamic offset. (1 byte or 2 byte based on offset value),
+ * 0 specifies normal offset. (2 bytes for each offset value).
*
* Decompresses data fom 'source' into 'dest'.
* If the source stream is detected malformed, the function will
@@ -290,7 +299,7 @@ int LZ4_compress_destSize(const char *source, char *dest, int *sourceSizePtr,
* or a negative result in case of error
*/
int LZ4_decompress_safe(const char *source, char *dest, int compressedSize,
- int maxDecompressedSize);
+ int maxDecompressedSize, bool dynOffset);

/**
* LZ4_decompress_safe_partial() - Decompress a block of size 'compressedSize'
diff --git a/lib/decompress_unlz4.c b/lib/decompress_unlz4.c
index 1b0baf3..8be2faa 100644
--- a/lib/decompress_unlz4.c
+++ b/lib/decompress_unlz4.c
@@ -158,7 +158,7 @@ STATIC inline int INIT unlz4(u8 *input, long in_len,
#else
dest_len = uncomp_chunksize;

- ret = LZ4_decompress_safe(inp, outp, chunksize, dest_len);
+ ret = LZ4_decompress_safe(inp, outp, chunksize, dest_len, false);
dest_len = ret;
#endif
if (ret < 0) {
diff --git a/lib/lz4/lz4_compress.c b/lib/lz4/lz4_compress.c
index cc7b6d4..185c358 100644
--- a/lib/lz4/lz4_compress.c
+++ b/lib/lz4/lz4_compress.c
@@ -183,7 +183,8 @@ static FORCE_INLINE int LZ4_compress_generic(
const tableType_t tableType,
const dict_directive dict,
const dictIssue_directive dictIssue,
- const U32 acceleration)
+ const U32 acceleration,
+ const Dynamic_Offset dynOffset)
{
const BYTE *ip = (const BYTE *) source;
const BYTE *base;
@@ -199,6 +200,7 @@ static FORCE_INLINE int LZ4_compress_generic(

BYTE *op = (BYTE *) dest;
BYTE * const olimit = op + maxOutputSize;
+ int max_distance = dynOffset ? MAX_DISTANCE_DYN : MAX_DISTANCE;

U32 forwardH;
size_t refDelta = 0;
@@ -245,6 +247,7 @@ static FORCE_INLINE int LZ4_compress_generic(
for ( ; ; ) {
const BYTE *match;
BYTE *token;
+ int curr_offset;

/* Find a match */
{
@@ -285,7 +288,7 @@ static FORCE_INLINE int LZ4_compress_generic(
: 0)
|| ((tableType == byU16)
? 0
- : (match + MAX_DISTANCE < ip))
+ : (match + max_distance < ip))
|| (LZ4_read32(match + refDelta)
!= LZ4_read32(ip)));
}
@@ -328,8 +331,26 @@ static FORCE_INLINE int LZ4_compress_generic(

_next_match:
/* Encode Offset */
- LZ4_writeLE16(op, (U16)(ip - match));
- op += 2;
+ if (dynOffset) {
+ curr_offset = (U16)(ip - match);
+
+ /*
+ * If Ofsset is greater than 127, we need 2 bytes
+ * to store it. Otherwise 1 byte is enough.
+ */
+ if (curr_offset > 127) {
+ curr_offset = (curr_offset << 1) | DYN_BIT;
+ LZ4_writeLE16(op, (U16)curr_offset);
+ op += 2;
+ } else {
+ curr_offset = curr_offset << 1;
+ *op = (BYTE)curr_offset;
+ op++;
+ }
+ } else {
+ LZ4_writeLE16(op, (U16)(ip - match));
+ op += 2;
+ }

/* Encode MatchLength */
{
@@ -480,39 +501,70 @@ static int LZ4_compress_fast_extState(
return LZ4_compress_generic(ctx, source,
dest, inputSize, 0,
noLimit, byU16, noDict,
- noDictIssue, acceleration);
+ noDictIssue, acceleration, NoDynOffset);
else
return LZ4_compress_generic(ctx, source,
dest, inputSize, 0,
noLimit, tableType, noDict,
- noDictIssue, acceleration);
+ noDictIssue, acceleration, NoDynOffset);
} else {
if (inputSize < LZ4_64Klimit)
return LZ4_compress_generic(ctx, source,
dest, inputSize,
maxOutputSize, limitedOutput, byU16, noDict,
- noDictIssue, acceleration);
+ noDictIssue, acceleration, NoDynOffset);
else
return LZ4_compress_generic(ctx, source,
dest, inputSize,
maxOutputSize, limitedOutput, tableType, noDict,
- noDictIssue, acceleration);
+ noDictIssue, acceleration, NoDynOffset);
}
}

+static int LZ4_compress_fast_extState_dynamic(
+ void *state,
+ const char *source,
+ char *dest,
+ int inputSize,
+ int maxOutputSize,
+ int acceleration)
+{
+ LZ4_stream_t_internal *ctx = &((LZ4_stream_t *)state)->internal_donotuse;
+
+ LZ4_resetStream((LZ4_stream_t *)state);
+
+ if (acceleration < 1)
+ acceleration = LZ4_ACCELERATION_DEFAULT;
+
+ if (maxOutputSize >= LZ4_COMPRESSBOUND(inputSize))
+ return LZ4_compress_generic(ctx, source,
+ dest, inputSize, 0,
+ noLimit, byU16, noDict,
+ noDictIssue, acceleration, DynOffset);
+ else
+ return LZ4_compress_generic(ctx, source,
+ dest, inputSize,
+ maxOutputSize, limitedOutput, byU16, noDict,
+ noDictIssue, acceleration, DynOffset);
+}
+
int LZ4_compress_fast(const char *source, char *dest, int inputSize,
- int maxOutputSize, int acceleration, void *wrkmem)
+ int maxOutputSize, int acceleration, void *wrkmem, bool dynOffset)
{
- return LZ4_compress_fast_extState(wrkmem, source, dest, inputSize,
+ if (!dynOffset)
+ return LZ4_compress_fast_extState(wrkmem, source, dest, inputSize,
+ maxOutputSize, acceleration);
+
+ return LZ4_compress_fast_extState_dynamic(wrkmem, source, dest, inputSize,
maxOutputSize, acceleration);
}
EXPORT_SYMBOL(LZ4_compress_fast);

int LZ4_compress_default(const char *source, char *dest, int inputSize,
- int maxOutputSize, void *wrkmem)
+ int maxOutputSize, void *wrkmem, bool dynOffset)
{
return LZ4_compress_fast(source, dest, inputSize,
- maxOutputSize, LZ4_ACCELERATION_DEFAULT, wrkmem);
+ maxOutputSize, LZ4_ACCELERATION_DEFAULT, wrkmem, dynOffset);
}
EXPORT_SYMBOL(LZ4_compress_default);

@@ -900,12 +952,12 @@ int LZ4_compress_fast_continue(LZ4_stream_t *LZ4_stream, const char *source,
result = LZ4_compress_generic(
streamPtr, source, dest, inputSize,
maxOutputSize, limitedOutput, byU32,
- withPrefix64k, dictSmall, acceleration);
+ withPrefix64k, dictSmall, acceleration, NoDynOffset);
} else {
result = LZ4_compress_generic(
streamPtr, source, dest, inputSize,
maxOutputSize, limitedOutput, byU32,
- withPrefix64k, noDictIssue, acceleration);
+ withPrefix64k, noDictIssue, acceleration, NoDynOffset);
}
streamPtr->dictSize += (U32)inputSize;
streamPtr->currentOffset += (U32)inputSize;
@@ -921,12 +973,12 @@ int LZ4_compress_fast_continue(LZ4_stream_t *LZ4_stream, const char *source,
result = LZ4_compress_generic(
streamPtr, source, dest, inputSize,
maxOutputSize, limitedOutput, byU32,
- usingExtDict, dictSmall, acceleration);
+ usingExtDict, dictSmall, acceleration, NoDynOffset);
} else {
result = LZ4_compress_generic(
streamPtr, source, dest, inputSize,
maxOutputSize, limitedOutput, byU32,
- usingExtDict, noDictIssue, acceleration);
+ usingExtDict, noDictIssue, acceleration, NoDynOffset);
}
streamPtr->dictionary = (const BYTE *)source;
streamPtr->dictSize = (U32)inputSize;
diff --git a/lib/lz4/lz4_decompress.c b/lib/lz4/lz4_decompress.c
index 141734d..337a828 100644
--- a/lib/lz4/lz4_decompress.c
+++ b/lib/lz4/lz4_decompress.c
@@ -71,7 +71,9 @@ static FORCE_INLINE int LZ4_decompress_generic(
/* only if dict == usingExtDict */
const BYTE * const dictStart,
/* note : = 0 if noDict */
- const size_t dictSize
+ const size_t dictSize,
+ /* offset == 1; dynamic offset */
+ const Dynamic_Offset dynOffset
)
{
/* Local Variables */
@@ -141,8 +143,8 @@ static FORCE_INLINE int LZ4_decompress_generic(
/* copy literals */
cpy = op + length;
if (((endOnInput) && ((cpy > (partialDecoding ? oexit : oend - MFLIMIT))
- || (ip + length > iend - (2 + 1 + LASTLITERALS))))
- || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH))) {
+ || (ip + length > iend - (2 + LASTLITERALS))))
+ || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH - 1))) {
if (partialDecoding) {
if (cpy > oend) {
/*
@@ -188,13 +190,31 @@ static FORCE_INLINE int LZ4_decompress_generic(
break;
}

- LZ4_wildCopy(op, ip, cpy);
+ if (dynOffset && length < 4)
+ LZ4_copy4(op, ip);
+ else
+ LZ4_wildCopy(op, ip, cpy);
+
ip += length;
op = cpy;

/* get offset */
- offset = LZ4_readLE16(ip);
- ip += 2;
+ if (dynOffset) {
+ /*
+ * Check if DYN_BIT is set, means 2 Byte Offset,
+ * else 1 Byte Offset.
+ */
+ if (*ip & DYN_BIT) {
+ offset = LZ4_readLE16(ip) >> 1;
+ ip += 2;
+ } else {
+ offset = *ip >> 1;
+ ip += 1;
+ }
+ } else {
+ offset = LZ4_readLE16(ip);
+ ip += 2;
+ }
match = op - offset;

if ((checkOffset) && (unlikely(match < lowLimit))) {
@@ -335,11 +355,11 @@ static FORCE_INLINE int LZ4_decompress_generic(
}

int LZ4_decompress_safe(const char *source, char *dest,
- int compressedSize, int maxDecompressedSize)
+ int compressedSize, int maxDecompressedSize, bool dynOffset)
{
return LZ4_decompress_generic(source, dest, compressedSize,
maxDecompressedSize, endOnInputSize, full, 0,
- noDict, (BYTE *)dest, NULL, 0);
+ noDict, (BYTE *)dest, NULL, 0, dynOffset);
}

int LZ4_decompress_safe_partial(const char *source, char *dest,
@@ -347,14 +367,14 @@ int LZ4_decompress_safe_partial(const char *source, char *dest,
{
return LZ4_decompress_generic(source, dest, compressedSize,
maxDecompressedSize, endOnInputSize, partial,
- targetOutputSize, noDict, (BYTE *)dest, NULL, 0);
+ targetOutputSize, noDict, (BYTE *)dest, NULL, 0, NoDynOffset);
}

int LZ4_decompress_fast(const char *source, char *dest, int originalSize)
{
return LZ4_decompress_generic(source, dest, 0, originalSize,
endOnOutputSize, full, 0, withPrefix64k,
- (BYTE *)(dest - 64 * KB), NULL, 64 * KB);
+ (BYTE *)(dest - 64 * KB), NULL, 64 * KB, NoDynOffset);
}

int LZ4_setStreamDecode(LZ4_streamDecode_t *LZ4_streamDecode,
@@ -392,7 +412,7 @@ int LZ4_decompress_safe_continue(LZ4_streamDecode_t *LZ4_streamDecode,
endOnInputSize, full, 0,
usingExtDict, lz4sd->prefixEnd - lz4sd->prefixSize,
lz4sd->externalDict,
- lz4sd->extDictSize);
+ lz4sd->extDictSize, NoDynOffset);

if (result <= 0)
return result;
@@ -406,7 +426,7 @@ int LZ4_decompress_safe_continue(LZ4_streamDecode_t *LZ4_streamDecode,
compressedSize, maxOutputSize,
endOnInputSize, full, 0,
usingExtDict, (BYTE *)dest,
- lz4sd->externalDict, lz4sd->extDictSize);
+ lz4sd->externalDict, lz4sd->extDictSize, NoDynOffset);
if (result <= 0)
return result;
lz4sd->prefixSize = result;
@@ -427,7 +447,7 @@ int LZ4_decompress_fast_continue(LZ4_streamDecode_t *LZ4_streamDecode,
endOnOutputSize, full, 0,
usingExtDict,
lz4sd->prefixEnd - lz4sd->prefixSize,
- lz4sd->externalDict, lz4sd->extDictSize);
+ lz4sd->externalDict, lz4sd->extDictSize, NoDynOffset);

if (result <= 0)
return result;
@@ -440,7 +460,7 @@ int LZ4_decompress_fast_continue(LZ4_streamDecode_t *LZ4_streamDecode,
result = LZ4_decompress_generic(source, dest, 0, originalSize,
endOnOutputSize, full, 0,
usingExtDict, (BYTE *)dest,
- lz4sd->externalDict, lz4sd->extDictSize);
+ lz4sd->externalDict, lz4sd->extDictSize, NoDynOffset);
if (result <= 0)
return result;
lz4sd->prefixSize = originalSize;
@@ -463,19 +483,19 @@ static FORCE_INLINE int LZ4_decompress_usingDict_generic(const char *source,
if (dictSize == 0)
return LZ4_decompress_generic(source, dest,
compressedSize, maxOutputSize, safe, full, 0,
- noDict, (BYTE *)dest, NULL, 0);
+ noDict, (BYTE *)dest, NULL, 0, NoDynOffset);
if (dictStart + dictSize == dest) {
if (dictSize >= (int)(64 * KB - 1))
return LZ4_decompress_generic(source, dest,
compressedSize, maxOutputSize, safe, full, 0,
- withPrefix64k, (BYTE *)dest - 64 * KB, NULL, 0);
+ withPrefix64k, (BYTE *)dest - 64 * KB, NULL, 0, NoDynOffset);
return LZ4_decompress_generic(source, dest, compressedSize,
maxOutputSize, safe, full, 0, noDict,
- (BYTE *)dest - dictSize, NULL, 0);
+ (BYTE *)dest - dictSize, NULL, 0, NoDynOffset);
}
return LZ4_decompress_generic(source, dest, compressedSize,
maxOutputSize, safe, full, 0, usingExtDict,
- (BYTE *)dest, (const BYTE *)dictStart, dictSize);
+ (BYTE *)dest, (const BYTE *)dictStart, dictSize, NoDynOffset);
}

int LZ4_decompress_safe_usingDict(const char *source, char *dest,
diff --git a/lib/lz4/lz4defs.h b/lib/lz4/lz4defs.h
index 00a0b58..9451a73 100644
--- a/lib/lz4/lz4defs.h
+++ b/lib/lz4/lz4defs.h
@@ -75,6 +75,7 @@
#define WILDCOPYLENGTH 8
#define LASTLITERALS 5
#define MFLIMIT (WILDCOPYLENGTH + MINMATCH)
+#define DYN_BIT 0x1

/* Increase this value ==> compression run slower on incompressible data */
#define LZ4_SKIPTRIGGER 6
@@ -87,6 +88,7 @@

#define MAXD_LOG 16
#define MAX_DISTANCE ((1 << MAXD_LOG) - 1)
+#define MAX_DISTANCE_DYN ((1 << (MAXD_LOG - 1)) - 1)
#define STEPSIZE sizeof(size_t)

#define ML_BITS 4
@@ -147,6 +149,13 @@ static FORCE_INLINE void LZ4_copy8(void *dst, const void *src)
#endif
}

+static FORCE_INLINE void LZ4_copy4(void *dst, const void *src)
+{
+ U32 a = get_unaligned((const U32 *)src);
+
+ put_unaligned(a, (U32 *)dst);
+}
+
/*
* customized variant of memcpy,
* which can overwrite up to 7 bytes beyond dstEnd
@@ -224,4 +233,6 @@ static FORCE_INLINE unsigned int LZ4_count(
typedef enum { endOnOutputSize = 0, endOnInputSize = 1 } endCondition_directive;
typedef enum { full = 0, partial = 1 } earlyEnd_directive;

+typedef enum { NoDynOffset = 0, DynOffset = 1 } Dynamic_Offset;
+
#endif
--
1.7.1

2018-03-21 06:41:28

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 0/1] cover-letter/lz4: Implement lz4 with dynamic offset length.

On (03/21/18 10:10), Maninder Singh wrote:
> (Added cover letter to avoid much text in patch description)
>
> LZ4 specification defines 2 byte offset length for 64 KB data.
> But in case of ZRAM we compress data per page and in most of
> architecture PAGE_SIZE is 4KB. So we can decide offset length based
> on actual offset value. For this we can reserve 1 bit to decide offset
> length (1 byte or 2 byte). 2 byte required only if ofsset is greater than 127,
> else 1 byte is enough.

So what happens if I compress the data on a system with no dyn
offset and then send it over the network to a machine which has
dyn offset? Or, say, I have a USB stick with a compression enabled
FS, store files on a dyn offset enabled PC and then mount that USB
stick on a machine with no dyn offset support. And vice versa.

-ss

2018-03-21 07:49:48

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.

On (03/21/18 10:10), Maninder Singh wrote:
[..]
> +static struct crypto_alg alg_lz4_dyn = {
> + .cra_name = "lz4_dyn",
> + .cra_flags = CRYPTO_ALG_TYPE_COMPRESS,
> + .cra_ctxsize = sizeof(struct lz4_ctx),
> + .cra_module = THIS_MODULE,
> + .cra_list = LIST_HEAD_INIT(alg_lz4_dyn.cra_list),
> + .cra_init = lz4_init,
> + .cra_exit = lz4_exit,
> + .cra_u = { .compress = {
> + .coa_compress = lz4_compress_crypto_dynamic,
> + .coa_decompress = lz4_decompress_crypto_dynamic } }
> +};

[..]

> diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
> index 4ed0a78..5bc5aab 100644
> --- a/drivers/block/zram/zcomp.c
> +++ b/drivers/block/zram/zcomp.c
> @@ -17,11 +17,15 @@
> #include <linux/crypto.h>
>
> #include "zcomp.h"
> +#define KB (1 << 10)
>
> static const char * const backends[] = {
> "lzo",
> #if IS_ENABLED(CONFIG_CRYPTO_LZ4)
> "lz4",
> +#if (PAGE_SIZE < (32 * KB))
> + "lz4_dyn",
> +#endif

This is not the list of supported algorithms. It's the list of
recommended algorithms. You can configure zram to use any of
available and known to Crypto API algorithms. Including lz4_dyn
on PAGE_SIZE > 32K systems.

-ss

2018-03-21 19:56:10

by Nick Terrell

[permalink] [raw]
Subject: Re: [PATCH 0/1] cover-letter/lz4: Implement lz4 with dynamic offset length.

On (03/21/18 10:10), Maninder Singh wrote:
> LZ4 specification defines 2 byte offset length for 64 KB data.
> But in case of ZRAM we compress data per page and in most of
> architecture PAGE_SIZE is 4KB. So we can decide offset length based
> on actual offset value. For this we can reserve 1 bit to decide offset
> length (1 byte or 2 byte). 2 byte required only if ofsset is greater than 127,
> else 1 byte is enough.
>
> With this new implementation new offset value can be at MAX 32 KB.
>
> Thus we can save more memory for compressed data.
>
> results checked with new implementation:-
>
> comression size for same input source
> (LZ4_DYN < LZO < LZ4)
>
> LZO
> =======
> orig_data_size: 78917632
> compr_data_size: 15894668
> mem_used_total: 17117184
>
> LZ4
> ========
> orig_data_size: 78917632
> compr_data_size: 16310717
> mem_used_total: 17592320
>
> LZ4_DYN
> =======
> orig_data_size: 78917632
> compr_data_size: 15520506
> mem_used_total: 16748544

This seems like a reasonable extension to the algorithm, and it looks like
LZ4_DYN is about a 5% improvement to compression ratio on your benchmark.
The biggest question I have is if it is worthwhile to maintain a separate
incompatible variant of LZ4 in the kernel without any upstream for a 5%
gain? If we do want to go forward with this, we should perform more
benchmarks.

I commented in the patch, but because the `dynOffset` variable isn't a
compile time static in LZ4_decompress_generic(), I suspect that the patch
causes a regression in decompression speed for both LZ4 and LZ4_DYN. You'll
need to re-run the benchmarks to first show that LZ4 before the patch
performs the same as LZ4 after the patch. Then re-run the LZ4 vs LZ4_DYN
benchmarks.

I would also like to see a benchmark in user-space (with the code), so we
can see the performance of LZ4 before and after the patch, as well as LZ4
vs LZ4_DYN without anything else going on. I expect the extra branches in
the decoding loop to have an impact on speed, and I would like to see how
big the impact is without noise.

CC-ing Yann Collet, the author of LZ4

2018-03-21 19:59:06

by Nick Terrell

[permalink] [raw]
Subject: Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.

On (03/21/18 10:10), Maninder Singh wrote:
> diff --git a/lib/lz4/lz4_compress.c b/lib/lz4/lz4_compress.c
> index cc7b6d4..185c358 100644
> --- a/lib/lz4/lz4_compress.c
> +++ b/lib/lz4/lz4_compress.c
> @@ -183,7 +183,8 @@ static FORCE_INLINE int LZ4_compress_generic(
> const tableType_t tableType,
> const dict_directive dict,
> const dictIssue_directive dictIssue,
> - const U32 acceleration)
> + const U32 acceleration,
> + const Dynamic_Offset dynOffset)
> {
> const BYTE *ip = (const BYTE *) source;
> const BYTE *base;
> @@ -199,6 +200,7 @@ static FORCE_INLINE int LZ4_compress_generic(
>
> BYTE *op = (BYTE *) dest;
> BYTE * const olimit = op + maxOutputSize;
> + int max_distance = dynOffset ? MAX_DISTANCE_DYN : MAX_DISTANCE;

Lets mark this variable `const`. If the compiler doesn't realize that this
variable and `dynOffset` are compile time constants, I expect the speed to
be impacted.

>
> U32 forwardH;
> size_t refDelta = 0;
> @@ -245,6 +247,7 @@ static FORCE_INLINE int LZ4_compress_generic(
> for ( ; ; ) {
> const BYTE *match;
> BYTE *token;
> + int curr_offset;
>
> /* Find a match */
> {
> @@ -285,7 +288,7 @@ static FORCE_INLINE int LZ4_compress_generic(
> : 0)
> || ((tableType == byU16)
> ? 0
> - : (match + MAX_DISTANCE < ip))
> + : (match + max_distance < ip))
> || (LZ4_read32(match + refDelta)
> != LZ4_read32(ip)));
> }
> @@ -328,8 +331,26 @@ static FORCE_INLINE int LZ4_compress_generic(
>
> _next_match:
> /* Encode Offset */
> - LZ4_writeLE16(op, (U16)(ip - match));
> - op += 2;
> + if (dynOffset) {
> + curr_offset = (U16)(ip - match);
> +
> + /*
> + * If Ofsset is greater than 127, we need 2 bytes
> + * to store it. Otherwise 1 byte is enough.
> + */
> + if (curr_offset > 127) {
> + curr_offset = (curr_offset << 1) | DYN_BIT;
> + LZ4_writeLE16(op, (U16)curr_offset);
> + op += 2;
> + } else {
> + curr_offset = curr_offset << 1;
> + *op = (BYTE)curr_offset;
> + op++;
> + }

The standard way to do variable sized integers is to use the high-bit as
the control bit, not the low-bit. Do you have benchmarks to show that using
the low-bit is faster than using the high-bit? If so, lets comment in the
code, if not lets use the high-bit.

> + } else {
> + LZ4_writeLE16(op, (U16)(ip - match));
> + op += 2;
> + }
>
> /* Encode MatchLength */
> {
> @@ -480,39 +501,70 @@ static int LZ4_compress_fast_extState(
> return LZ4_compress_generic(ctx, source,
> dest, inputSize, 0,
> noLimit, byU16, noDict,
> - noDictIssue, acceleration);
> + noDictIssue, acceleration, NoDynOffset);
> else
> return LZ4_compress_generic(ctx, source,
> dest, inputSize, 0,
> noLimit, tableType, noDict,
> - noDictIssue, acceleration);
> + noDictIssue, acceleration, NoDynOffset);
> } else {
> if (inputSize < LZ4_64Klimit)
> return LZ4_compress_generic(ctx, source,
> dest, inputSize,
> maxOutputSize, limitedOutput, byU16, noDict,
> - noDictIssue, acceleration);
> + noDictIssue, acceleration, NoDynOffset);
> else
> return LZ4_compress_generic(ctx, source,
> dest, inputSize,
> maxOutputSize, limitedOutput, tableType, noDict,
> - noDictIssue, acceleration);
> + noDictIssue, acceleration, NoDynOffset);
> }
> }
>
> +static int LZ4_compress_fast_extState_dynamic(
> + void *state,
> + const char *source,
> + char *dest,
> + int inputSize,
> + int maxOutputSize,
> + int acceleration)
> +{
> + LZ4_stream_t_internal *ctx = &((LZ4_stream_t *)state)->internal_donotuse;
> +
> + LZ4_resetStream((LZ4_stream_t *)state);
> +
> + if (acceleration < 1)
> + acceleration = LZ4_ACCELERATION_DEFAULT;
> +
> + if (maxOutputSize >= LZ4_COMPRESSBOUND(inputSize))
> + return LZ4_compress_generic(ctx, source,
> + dest, inputSize, 0,
> + noLimit, byU16, noDict,
> + noDictIssue, acceleration, DynOffset);
> + else
> + return LZ4_compress_generic(ctx, source,
> + dest, inputSize,
> + maxOutputSize, limitedOutput, byU16, noDict,
> + noDictIssue, acceleration, DynOffset);
> +}
> +
> int LZ4_compress_fast(const char *source, char *dest, int inputSize,
> - int maxOutputSize, int acceleration, void *wrkmem)
> + int maxOutputSize, int acceleration, void *wrkmem, bool dynOffset)
> {
> - return LZ4_compress_fast_extState(wrkmem, source, dest, inputSize,
> + if (!dynOffset)
> + return LZ4_compress_fast_extState(wrkmem, source, dest, inputSize,
> + maxOutputSize, acceleration);
> +
> + return LZ4_compress_fast_extState_dynamic(wrkmem, source, dest, inputSize,
> maxOutputSize, acceleration);
> }
> EXPORT_SYMBOL(LZ4_compress_fast);
>
> int LZ4_compress_default(const char *source, char *dest, int inputSize,
> - int maxOutputSize, void *wrkmem)
> + int maxOutputSize, void *wrkmem, bool dynOffset)
> {
> return LZ4_compress_fast(source, dest, inputSize,
> - maxOutputSize, LZ4_ACCELERATION_DEFAULT, wrkmem);
> + maxOutputSize, LZ4_ACCELERATION_DEFAULT, wrkmem, dynOffset);
> }
> EXPORT_SYMBOL(LZ4_compress_default);
>
> @@ -900,12 +952,12 @@ int LZ4_compress_fast_continue(LZ4_stream_t *LZ4_stream, const char *source,
> result = LZ4_compress_generic(
> streamPtr, source, dest, inputSize,
> maxOutputSize, limitedOutput, byU32,
> - withPrefix64k, dictSmall, acceleration);
> + withPrefix64k, dictSmall, acceleration, NoDynOffset);
> } else {
> result = LZ4_compress_generic(
> streamPtr, source, dest, inputSize,
> maxOutputSize, limitedOutput, byU32,
> - withPrefix64k, noDictIssue, acceleration);
> + withPrefix64k, noDictIssue, acceleration, NoDynOffset);
> }
> streamPtr->dictSize += (U32)inputSize;
> streamPtr->currentOffset += (U32)inputSize;
> @@ -921,12 +973,12 @@ int LZ4_compress_fast_continue(LZ4_stream_t *LZ4_stream, const char *source,
> result = LZ4_compress_generic(
> streamPtr, source, dest, inputSize,
> maxOutputSize, limitedOutput, byU32,
> - usingExtDict, dictSmall, acceleration);
> + usingExtDict, dictSmall, acceleration, NoDynOffset);
> } else {
> result = LZ4_compress_generic(
> streamPtr, source, dest, inputSize,
> maxOutputSize, limitedOutput, byU32,
> - usingExtDict, noDictIssue, acceleration);
> + usingExtDict, noDictIssue, acceleration, NoDynOffset);
> }
> streamPtr->dictionary = (const BYTE *)source;
> streamPtr->dictSize = (U32)inputSize;
> diff --git a/lib/lz4/lz4_decompress.c b/lib/lz4/lz4_decompress.c
> index 141734d..337a828 100644
> --- a/lib/lz4/lz4_decompress.c
> +++ b/lib/lz4/lz4_decompress.c
> @@ -71,7 +71,9 @@ static FORCE_INLINE int LZ4_decompress_generic(
> /* only if dict == usingExtDict */
> const BYTE * const dictStart,
> /* note : = 0 if noDict */
> - const size_t dictSize
> + const size_t dictSize,
> + /* offset == 1; dynamic offset */
> + const Dynamic_Offset dynOffset
> )
> {
> /* Local Variables */
> @@ -141,8 +143,8 @@ static FORCE_INLINE int LZ4_decompress_generic(
> /* copy literals */
> cpy = op + length;
> if (((endOnInput) && ((cpy > (partialDecoding ? oexit : oend - MFLIMIT))
> - || (ip + length > iend - (2 + 1 + LASTLITERALS))))
> - || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH))) {
> + || (ip + length > iend - (2 + LASTLITERALS))))
> + || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH - 1))) {
> if (partialDecoding) {
> if (cpy > oend) {
> /*
> @@ -188,13 +190,31 @@ static FORCE_INLINE int LZ4_decompress_generic(
> break;
> }
>
> - LZ4_wildCopy(op, ip, cpy);
> + if (dynOffset && length < 4)
> + LZ4_copy4(op, ip);
> + else
> + LZ4_wildCopy(op, ip, cpy);
> +

The LZ4 format enforces that the last 5 bytes are literals so that
LZ4_wildCopy() can be used here. I suspect that having this extra branch
here for `dynOffset` mode hurts decompression speed.

> ip += length;
> op = cpy;
>
> /* get offset */
> - offset = LZ4_readLE16(ip);
> - ip += 2;
> + if (dynOffset) {
> + /*
> + * Check if DYN_BIT is set, means 2 Byte Offset,
> + * else 1 Byte Offset.
> + */
> + if (*ip & DYN_BIT) {
> + offset = LZ4_readLE16(ip) >> 1;
> + ip += 2;
> + } else {
> + offset = *ip >> 1;
> + ip += 1;

If we use the high-bit as the control bit, this branch simply becomes
`offset = *ip`, though the long offset branch becomes a bit longer.

> + }
> + } else {
> + offset = LZ4_readLE16(ip);
> + ip += 2;
> + }
> match = op - offset;
>
> if ((checkOffset) && (unlikely(match < lowLimit))) {
> @@ -335,11 +355,11 @@ static FORCE_INLINE int LZ4_decompress_generic(
> }
>
> int LZ4_decompress_safe(const char *source, char *dest,
> - int compressedSize, int maxDecompressedSize)
> + int compressedSize, int maxDecompressedSize, bool dynOffset)
> {
> return LZ4_decompress_generic(source, dest, compressedSize,
> maxDecompressedSize, endOnInputSize, full, 0,
> - noDict, (BYTE *)dest, NULL, 0);
> + noDict, (BYTE *)dest, NULL, 0, dynOffset);

You'll need to use the same trick that LZ4_compress_fast() uses, by hard
coding `dynOffset`. We want the compiler to generate too version of
LZ4_decompress_generic(), one with `dynOffset` and one with `!dynOffset`.
That way the tight loop won't the branches that check `dynOffset`.

if (dynOffset)
return LZ4_decompress_generic(..., true);
else
return LZ4_decompress_generic(..., false);

Without this trick, I expect that this patch causes a regression to both
LZ4 and LZ4_DYN decompression speed.

> }
>
> int LZ4_decompress_safe_partial(const char *source, char *dest,
> @@ -347,14 +367,14 @@ int LZ4_decompress_safe_partial(const char *source, char *dest,
> {
> return LZ4_decompress_generic(source, dest, compressedSize,
> maxDecompressedSize, endOnInputSize, partial,
> - targetOutputSize, noDict, (BYTE *)dest, NULL, 0);
> + targetOutputSize, noDict, (BYTE *)dest, NULL, 0, NoDynOffset);
> }
>
> int LZ4_decompress_fast(const char *source, char *dest, int originalSize)
> {
> return LZ4_decompress_generic(source, dest, 0, originalSize,
> endOnOutputSize, full, 0, withPrefix64k,
> - (BYTE *)(dest - 64 * KB), NULL, 64 * KB);
> + (BYTE *)(dest - 64 * KB), NULL, 64 * KB, NoDynOffset);
> }
>
> int LZ4_setStreamDecode(LZ4_streamDecode_t *LZ4_streamDecode,
> @@ -392,7 +412,7 @@ int LZ4_decompress_safe_continue(LZ4_streamDecode_t *LZ4_streamDecode,
> endOnInputSize, full, 0,
> usingExtDict, lz4sd->prefixEnd - lz4sd->prefixSize,
> lz4sd->externalDict,
> - lz4sd->extDictSize);
> + lz4sd->extDictSize, NoDynOffset);
>
> if (result <= 0)
> return result;
> @@ -406,7 +426,7 @@ int LZ4_decompress_safe_continue(LZ4_streamDecode_t *LZ4_streamDecode,
> compressedSize, maxOutputSize,
> endOnInputSize, full, 0,
> usingExtDict, (BYTE *)dest,
> - lz4sd->externalDict, lz4sd->extDictSize);
> + lz4sd->externalDict, lz4sd->extDictSize, NoDynOffset);
> if (result <= 0)
> return result;
> lz4sd->prefixSize = result;
> @@ -427,7 +447,7 @@ int LZ4_decompress_fast_continue(LZ4_streamDecode_t *LZ4_streamDecode,
> endOnOutputSize, full, 0,
> usingExtDict,
> lz4sd->prefixEnd - lz4sd->prefixSize,
> - lz4sd->externalDict, lz4sd->extDictSize);
> + lz4sd->externalDict, lz4sd->extDictSize, NoDynOffset);
>
> if (result <= 0)
> return result;
> @@ -440,7 +460,7 @@ int LZ4_decompress_fast_continue(LZ4_streamDecode_t *LZ4_streamDecode,
> result = LZ4_decompress_generic(source, dest, 0, originalSize,
> endOnOutputSize, full, 0,
> usingExtDict, (BYTE *)dest,
> - lz4sd->externalDict, lz4sd->extDictSize);
> + lz4sd->externalDict, lz4sd->extDictSize, NoDynOffset);
> if (result <= 0)
> return result;
> lz4sd->prefixSize = originalSize;
> @@ -463,19 +483,19 @@ static FORCE_INLINE int LZ4_decompress_usingDict_generic(const char *source,
> if (dictSize == 0)
> return LZ4_decompress_generic(source, dest,
> compressedSize, maxOutputSize, safe, full, 0,
> - noDict, (BYTE *)dest, NULL, 0);
> + noDict, (BYTE *)dest, NULL, 0, NoDynOffset);
> if (dictStart + dictSize == dest) {
> if (dictSize >= (int)(64 * KB - 1))
> return LZ4_decompress_generic(source, dest,
> compressedSize, maxOutputSize, safe, full, 0,
> - withPrefix64k, (BYTE *)dest - 64 * KB, NULL, 0);
> + withPrefix64k, (BYTE *)dest - 64 * KB, NULL, 0, NoDynOffset);
> return LZ4_decompress_generic(source, dest, compressedSize,
> maxOutputSize, safe, full, 0, noDict,
> - (BYTE *)dest - dictSize, NULL, 0);
> + (BYTE *)dest - dictSize, NULL, 0, NoDynOffset);
> }
> return LZ4_decompress_generic(source, dest, compressedSize,
> maxOutputSize, safe, full, 0, usingExtDict,
> - (BYTE *)dest, (const BYTE *)dictStart, dictSize);
> + (BYTE *)dest, (const BYTE *)dictStart, dictSize, NoDynOffset);
> }
>
> int LZ4_decompress_safe_usingDict(const char *source, char *dest,
> diff --git a/lib/lz4/lz4defs.h b/lib/lz4/lz4defs.h
> index 00a0b58..9451a73 100644
> --- a/lib/lz4/lz4defs.h
> +++ b/lib/lz4/lz4defs.h
> @@ -75,6 +75,7 @@
> #define WILDCOPYLENGTH 8
> #define LASTLITERALS 5
> #define MFLIMIT (WILDCOPYLENGTH + MINMATCH)
> +#define DYN_BIT 0x1
>
> /* Increase this value ==> compression run slower on incompressible data */
> #define LZ4_SKIPTRIGGER 6
> @@ -87,6 +88,7 @@
>
> #define MAXD_LOG 16
> #define MAX_DISTANCE ((1 << MAXD_LOG) - 1)
> +#define MAX_DISTANCE_DYN ((1 << (MAXD_LOG - 1)) - 1)
> #define STEPSIZE sizeof(size_t)
>
> #define ML_BITS 4
> @@ -147,6 +149,13 @@ static FORCE_INLINE void LZ4_copy8(void *dst, const void *src)
> #endif
> }
>
> +static FORCE_INLINE void LZ4_copy4(void *dst, const void *src)
> +{
> + U32 a = get_unaligned((const U32 *)src);
> +
> + put_unaligned(a, (U32 *)dst);
> +}
> +
> /*
> * customized variant of memcpy,
> * which can overwrite up to 7 bytes beyond dstEnd
> @@ -224,4 +233,6 @@ static FORCE_INLINE unsigned int LZ4_count(
> typedef enum { endOnOutputSize = 0, endOnInputSize = 1 } endCondition_directive;
> typedef enum { full = 0, partial = 1 } earlyEnd_directive;
>
> +typedef enum { NoDynOffset = 0, DynOffset = 1 } Dynamic_Offset;
> +
> #endif
> --
> 1.7.1

2018-03-22 02:43:26

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 0/1] cover-letter/lz4: Implement lz4 with dynamic offset length.

On (03/21/18 19:56), Nick Terrell wrote:
[..]
> This seems like a reasonable extension to the algorithm, and it looks like
> LZ4_DYN is about a 5% improvement to compression ratio on your benchmark.
> The biggest question I have is if it is worthwhile to maintain a separate
> incompatible variant of LZ4 in the kernel without any upstream for a 5%
> gain? If we do want to go forward with this, we should perform more
> benchmarks.
>
> I commented in the patch, but because the `dynOffset` variable isn't a
> compile time static in LZ4_decompress_generic(), I suspect that the patch
> causes a regression in decompression speed for both LZ4 and LZ4_DYN. You'll
> need to re-run the benchmarks to first show that LZ4 before the patch
> performs the same as LZ4 after the patch. Then re-run the LZ4 vs LZ4_DYN
> benchmarks.
>
> I would also like to see a benchmark in user-space (with the code), so we
> can see the performance of LZ4 before and after the patch, as well as LZ4
> vs LZ4_DYN without anything else going on. I expect the extra branches in
> the decoding loop to have an impact on speed, and I would like to see how
> big the impact is without noise.

Yes, I've been thinking about this. There are more branches now
("to dyn or not to dyn") in compression/decompression hot path but
we see less instructions and branches in perf output at the end.
And my guess is that we have a lot of noise from zram and zsmalloc.
The data is XXX bytes shorter with dyn enabled, so we use YYY less
moves and ZZZ less branches while we copy the data to zsmalloc and
from zsmalloc, and I may be that's the root cause of "performance
gain" that we see in zram-fio tests. So may be we need to run
benchmarks against lz4, not zram+lz4.

> CC-ing Yann Collet, the author of LZ4

Great, thanks.

-ss

2018-03-22 04:28:21

by Maninder Singh

[permalink] [raw]
Subject: Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.

CC: Vaneet Narang.

 
On (03/21/18 10:10), Maninder Singh wrote:
> diff --git a/lib/lz4/lz4_compress.c b/lib/lz4/lz4_compress.c
> index cc7b6d4..185c358 100644
> --- a/lib/lz4/lz4_compress.c
> +++ b/lib/lz4/lz4_compress.c
> @@ -183,7 +183,8 @@ static FORCE_INLINE int LZ4_compress_generic(
>          const tableType_t tableType,
>          const dict_directive dict,
>          const dictIssue_directive dictIssue,
> -        const U32 acceleration)
> +        const U32 acceleration,
> +        const Dynamic_Offset dynOffset)
>  {
>          const BYTE *ip = (const BYTE *) source;
>          const BYTE *base;
> @@ -199,6 +200,7 @@ static FORCE_INLINE int LZ4_compress_generic(
>
>          BYTE *op = (BYTE *) dest;
>          BYTE * const olimit = op + maxOutputSize;
> +        int max_distance = dynOffset ? MAX_DISTANCE_DYN : MAX_DISTANCE;
 
Lets mark this variable `const`. If the compiler doesn't realize that this
variable and `dynOffset` are compile time constants, I expect the speed to
be impacted.
 
>
>          U32 forwardH;
>          size_t refDelta = 0;
> @@ -245,6 +247,7 @@ static FORCE_INLINE int LZ4_compress_generic(
>          for ( ; ; ) {
>                  const BYTE *match;
>                  BYTE *token;
> +                int curr_offset;
>
>                  /* Find a match */
>                  {
> @@ -285,7 +288,7 @@ static FORCE_INLINE int LZ4_compress_generic(
>                                          : 0)
>                                  || ((tableType == byU16)
>                                          ? 0
> -                                        : (match + MAX_DISTANCE < ip))
> +                                        : (match + max_distance < ip))
>                                  || (LZ4_read32(match + refDelta)
>                                          != LZ4_read32(ip)));
>                  }
> @@ -328,8 +331,26 @@ static FORCE_INLINE int LZ4_compress_generic(
>
>  _next_match:
>                  /* Encode Offset */
> -                LZ4_writeLE16(op, (U16)(ip - match));
> -                op += 2;
> +                if (dynOffset) {
> +                        curr_offset = (U16)(ip - match);
> +
> +                        /*
> +                         * If Ofsset is greater than 127, we need 2 bytes
> +                         * to store it. Otherwise 1 byte is enough.
> +                         */
> +                        if (curr_offset > 127) {
> +                                curr_offset = (curr_offset << 1) | DYN_BIT;
> +                                LZ4_writeLE16(op, (U16)curr_offset);
> +                                op += 2;
> +                        } else {
> +                                curr_offset = curr_offset << 1;
> +                                *op = (BYTE)curr_offset;
> +                                op++;
> +                        }
 
The standard way to do variable sized integers is to use the high-bit as
the control bit, not the low-bit. Do you have benchmarks to show that using
the low-bit is faster than using the high-bit? If so, lets comment in the
code, if not lets use the high-bit.
 
> +                } else {
> +                        LZ4_writeLE16(op, (U16)(ip - match));
> +                        op += 2;
> +                }
>
>                  /* Encode MatchLength */
>                  {
> @@ -480,39 +501,70 @@ static int LZ4_compress_fast_extState(
>                          return LZ4_compress_generic(ctx, source,
>                                  dest, inputSize, 0,
>                                  noLimit, byU16, noDict,
> -                                noDictIssue, acceleration);
> +                                noDictIssue, acceleration, NoDynOffset);
>                  else
>                          return LZ4_compress_generic(ctx, source,
>                                  dest, inputSize, 0,
>                                  noLimit, tableType, noDict,
> -                                noDictIssue, acceleration);
> +                                noDictIssue, acceleration, NoDynOffset);
>          } else {
>                  if (inputSize < LZ4_64Klimit)
>                          return LZ4_compress_generic(ctx, source,
>                                  dest, inputSize,
>                                  maxOutputSize, limitedOutput, byU16, noDict,
> -                                noDictIssue, acceleration);
> +                                noDictIssue, acceleration, NoDynOffset);
>                  else
>                          return LZ4_compress_generic(ctx, source,
>                                  dest, inputSize,
>                                  maxOutputSize, limitedOutput, tableType, noDict,
> -                                noDictIssue, acceleration);
> +                                noDictIssue, acceleration, NoDynOffset);
>          }
>  }
>
> +static int LZ4_compress_fast_extState_dynamic(
> +        void *state,
> +        const char *source,
> +        char *dest,
> +        int inputSize,
> +        int maxOutputSize,
> +        int acceleration)
> +{
> +        LZ4_stream_t_internal *ctx = &((LZ4_stream_t *)state)->internal_donotuse;
> +
> +        LZ4_resetStream((LZ4_stream_t *)state);
> +
> +        if (acceleration < 1)
> +                acceleration = LZ4_ACCELERATION_DEFAULT;
> +
> +        if (maxOutputSize >= LZ4_COMPRESSBOUND(inputSize))
> +                return LZ4_compress_generic(ctx, source,
> +                        dest, inputSize, 0,
> +                        noLimit, byU16, noDict,
> +                        noDictIssue, acceleration, DynOffset);
> +        else
> +                return LZ4_compress_generic(ctx, source,
> +                        dest, inputSize,
> +                        maxOutputSize, limitedOutput, byU16, noDict,
> +                        noDictIssue, acceleration, DynOffset);
> +}
> +
>  int LZ4_compress_fast(const char *source, char *dest, int inputSize,
> -        int maxOutputSize, int acceleration, void *wrkmem)
> +        int maxOutputSize, int acceleration, void *wrkmem, bool dynOffset)
>  {
> -        return LZ4_compress_fast_extState(wrkmem, source, dest, inputSize,
> +        if (!dynOffset)
> +                return LZ4_compress_fast_extState(wrkmem, source, dest, inputSize,
> +                        maxOutputSize, acceleration);
> +
> +        return LZ4_compress_fast_extState_dynamic(wrkmem, source, dest, inputSize,
>                  maxOutputSize, acceleration);
>  }
>  EXPORT_SYMBOL(LZ4_compress_fast);
>
>  int LZ4_compress_default(const char *source, char *dest, int inputSize,
> -        int maxOutputSize, void *wrkmem)
> +        int maxOutputSize, void *wrkmem, bool dynOffset)
>  {
>          return LZ4_compress_fast(source, dest, inputSize,
> -                maxOutputSize, LZ4_ACCELERATION_DEFAULT, wrkmem);
> +                maxOutputSize, LZ4_ACCELERATION_DEFAULT, wrkmem, dynOffset);
>  }
>  EXPORT_SYMBOL(LZ4_compress_default);
>
> @@ -900,12 +952,12 @@ int LZ4_compress_fast_continue(LZ4_stream_t *LZ4_stream, const char *source,
>                          result = LZ4_compress_generic(
>                                  streamPtr, source, dest, inputSize,
>                                  maxOutputSize, limitedOutput, byU32,
> -                                withPrefix64k, dictSmall, acceleration);
> +                                withPrefix64k, dictSmall, acceleration, NoDynOffset);
>                  } else {
>                          result = LZ4_compress_generic(
>                                  streamPtr, source, dest, inputSize,
>                                  maxOutputSize, limitedOutput, byU32,
> -                                withPrefix64k, noDictIssue, acceleration);
> +                                withPrefix64k, noDictIssue, acceleration, NoDynOffset);
>                  }
>                  streamPtr->dictSize += (U32)inputSize;
>                  streamPtr->currentOffset += (U32)inputSize;
> @@ -921,12 +973,12 @@ int LZ4_compress_fast_continue(LZ4_stream_t *LZ4_stream, const char *source,
>                          result = LZ4_compress_generic(
>                                  streamPtr, source, dest, inputSize,
>                                  maxOutputSize, limitedOutput, byU32,
> -                                usingExtDict, dictSmall, acceleration);
> +                                usingExtDict, dictSmall, acceleration, NoDynOffset);
>                  } else {
>                          result = LZ4_compress_generic(
>                                  streamPtr, source, dest, inputSize,
>                                  maxOutputSize, limitedOutput, byU32,
> -                                usingExtDict, noDictIssue, acceleration);
> +                                usingExtDict, noDictIssue, acceleration, NoDynOffset);
>                  }
>                  streamPtr->dictionary = (const BYTE *)source;
>                  streamPtr->dictSize = (U32)inputSize;
> diff --git a/lib/lz4/lz4_decompress.c b/lib/lz4/lz4_decompress.c
> index 141734d..337a828 100644
> --- a/lib/lz4/lz4_decompress.c
> +++ b/lib/lz4/lz4_decompress.c
> @@ -71,7 +71,9 @@ static FORCE_INLINE int LZ4_decompress_generic(
>           /* only if dict == usingExtDict */
>           const BYTE * const dictStart,
>           /* note : = 0 if noDict */
> -         const size_t dictSize
> +         const size_t dictSize,
> +         /* offset == 1; dynamic offset */
> +         const Dynamic_Offset dynOffset
>           )
>  {
>          /* Local Variables */
> @@ -141,8 +143,8 @@ static FORCE_INLINE int LZ4_decompress_generic(
>                  /* copy literals */
>                  cpy = op + length;
>                  if (((endOnInput) && ((cpy > (partialDecoding ? oexit : oend - MFLIMIT))
> -                        || (ip + length > iend - (2 + 1 + LASTLITERALS))))
> -                        || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH))) {
> +                        || (ip + length > iend - (2 + LASTLITERALS))))
> +                        || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH - 1))) {
>                          if (partialDecoding) {
>                                  if (cpy > oend) {
>                                          /*
> @@ -188,13 +190,31 @@ static FORCE_INLINE int LZ4_decompress_generic(
>                          break;
>                  }
>
> -                LZ4_wildCopy(op, ip, cpy);
> +                if (dynOffset && length < 4)
> +                        LZ4_copy4(op, ip);
> +                else
> +                        LZ4_wildCopy(op, ip, cpy);
> +
 
The LZ4 format enforces that the last 5 bytes are literals so that
LZ4_wildCopy() can be used here. I suspect that having this extra branch
here for `dynOffset` mode hurts decompression speed.
 
>                  ip += length;
>                  op = cpy;
>
>                  /* get offset */
> -                offset = LZ4_readLE16(ip);
> -                ip += 2;
> +                if (dynOffset) {
> +                        /*
> +                         * Check if DYN_BIT is set, means 2 Byte Offset,
> +                         * else 1 Byte Offset.
> +                         */
> +                        if (*ip & DYN_BIT) {
> +                                offset = LZ4_readLE16(ip) >> 1;
> +                                ip += 2;
> +                        } else {
> +                                offset = *ip >> 1;
> +                                ip += 1;
 
If we use the high-bit as the control bit, this branch simply becomes
`offset = *ip`, though the long offset branch becomes a bit longer.
 
> +                        }
> +                } else {
> +                        offset = LZ4_readLE16(ip);
> +                        ip += 2;
> +                }
>                  match = op - offset;
>
>                  if ((checkOffset) && (unlikely(match < lowLimit))) {
> @@ -335,11 +355,11 @@ static FORCE_INLINE int LZ4_decompress_generic(
>  }
>
>  int LZ4_decompress_safe(const char *source, char *dest,
> -        int compressedSize, int maxDecompressedSize)
> +        int compressedSize, int maxDecompressedSize, bool dynOffset)
>  {
>          return LZ4_decompress_generic(source, dest, compressedSize,
>                  maxDecompressedSize, endOnInputSize, full, 0,
> -                noDict, (BYTE *)dest, NULL, 0);
> +                noDict, (BYTE *)dest, NULL, 0, dynOffset);
 
You'll need to use the same trick that LZ4_compress_fast() uses, by hard
coding `dynOffset`. We want the compiler to generate too version of
LZ4_decompress_generic(), one with `dynOffset` and one with `!dynOffset`.
That way the tight loop won't the branches that check `dynOffset`.
 
        if (dynOffset)
                return LZ4_decompress_generic(..., true);
        else
                return LZ4_decompress_generic(..., false);
 
Without this trick, I expect that this patch causes a regression to both
LZ4 and LZ4_DYN decompression speed.
 
>  }
>
>  int LZ4_decompress_safe_partial(const char *source, char *dest,
> @@ -347,14 +367,14 @@ int LZ4_decompress_safe_partial(const char *source, char *dest,
>  {
>          return LZ4_decompress_generic(source, dest, compressedSize,
>                  maxDecompressedSize, endOnInputSize, partial,
> -                targetOutputSize, noDict, (BYTE *)dest, NULL, 0);
> +                targetOutputSize, noDict, (BYTE *)dest, NULL, 0, NoDynOffset);
>  }
>
>  int LZ4_decompress_fast(const char *source, char *dest, int originalSize)
>  {
>          return LZ4_decompress_generic(source, dest, 0, originalSize,
>                  endOnOutputSize, full, 0, withPrefix64k,
> -                (BYTE *)(dest - 64 * KB), NULL, 64 * KB);
> +                (BYTE *)(dest - 64 * KB), NULL, 64 * KB, NoDynOffset);
>  }
>
>  int LZ4_setStreamDecode(LZ4_streamDecode_t *LZ4_streamDecode,
> @@ -392,7 +412,7 @@ int LZ4_decompress_safe_continue(LZ4_streamDecode_t *LZ4_streamDecode,
>                          endOnInputSize, full, 0,
>                          usingExtDict, lz4sd->prefixEnd - lz4sd->prefixSize,
>                          lz4sd->externalDict,
> -                        lz4sd->extDictSize);
> +                        lz4sd->extDictSize, NoDynOffset);
>
>                  if (result <= 0)
>                          return result;
> @@ -406,7 +426,7 @@ int LZ4_decompress_safe_continue(LZ4_streamDecode_t *LZ4_streamDecode,
>                          compressedSize, maxOutputSize,
>                          endOnInputSize, full, 0,
>                          usingExtDict, (BYTE *)dest,
> -                        lz4sd->externalDict, lz4sd->extDictSize);
> +                        lz4sd->externalDict, lz4sd->extDictSize, NoDynOffset);
>                  if (result <= 0)
>                          return result;
>                  lz4sd->prefixSize = result;
> @@ -427,7 +447,7 @@ int LZ4_decompress_fast_continue(LZ4_streamDecode_t *LZ4_streamDecode,
>                          endOnOutputSize, full, 0,
>                          usingExtDict,
>                          lz4sd->prefixEnd - lz4sd->prefixSize,
> -                        lz4sd->externalDict, lz4sd->extDictSize);
> +                        lz4sd->externalDict, lz4sd->extDictSize, NoDynOffset);
>
>                  if (result <= 0)
>                          return result;
> @@ -440,7 +460,7 @@ int LZ4_decompress_fast_continue(LZ4_streamDecode_t *LZ4_streamDecode,
>                  result = LZ4_decompress_generic(source, dest, 0, originalSize,
>                          endOnOutputSize, full, 0,
>                          usingExtDict, (BYTE *)dest,
> -                        lz4sd->externalDict, lz4sd->extDictSize);
> +                        lz4sd->externalDict, lz4sd->extDictSize, NoDynOffset);
>                  if (result <= 0)
>                          return result;
>                  lz4sd->prefixSize = originalSize;
> @@ -463,19 +483,19 @@ static FORCE_INLINE int LZ4_decompress_usingDict_generic(const char *source,
>          if (dictSize == 0)
>                  return LZ4_decompress_generic(source, dest,
>                          compressedSize, maxOutputSize, safe, full, 0,
> -                        noDict, (BYTE *)dest, NULL, 0);
> +                        noDict, (BYTE *)dest, NULL, 0, NoDynOffset);
>          if (dictStart + dictSize == dest) {
>                  if (dictSize >= (int)(64 * KB - 1))
>                          return LZ4_decompress_generic(source, dest,
>                                  compressedSize, maxOutputSize, safe, full, 0,
> -                                withPrefix64k, (BYTE *)dest - 64 * KB, NULL, 0);
> +                                withPrefix64k, (BYTE *)dest - 64 * KB, NULL, 0, NoDynOffset);
>                  return LZ4_decompress_generic(source, dest, compressedSize,
>                          maxOutputSize, safe, full, 0, noDict,
> -                        (BYTE *)dest - dictSize, NULL, 0);
> +                        (BYTE *)dest - dictSize, NULL, 0, NoDynOffset);
>          }
>          return LZ4_decompress_generic(source, dest, compressedSize,
>                  maxOutputSize, safe, full, 0, usingExtDict,
> -                (BYTE *)dest, (const BYTE *)dictStart, dictSize);
> +                (BYTE *)dest, (const BYTE *)dictStart, dictSize, NoDynOffset);
>  }
>
>  int LZ4_decompress_safe_usingDict(const char *source, char *dest,
> diff --git a/lib/lz4/lz4defs.h b/lib/lz4/lz4defs.h
> index 00a0b58..9451a73 100644
> --- a/lib/lz4/lz4defs.h
> +++ b/lib/lz4/lz4defs.h
> @@ -75,6 +75,7 @@
>  #define WILDCOPYLENGTH 8
>  #define LASTLITERALS 5
>  #define MFLIMIT (WILDCOPYLENGTH + MINMATCH)
> +#define DYN_BIT 0x1
>
>  /* Increase this value ==> compression run slower on incompressible data */
>  #define LZ4_SKIPTRIGGER 6
> @@ -87,6 +88,7 @@
>
>  #define MAXD_LOG 16
>  #define MAX_DISTANCE ((1 << MAXD_LOG) - 1)
> +#define MAX_DISTANCE_DYN ((1 << (MAXD_LOG - 1)) - 1)
>  #define STEPSIZE sizeof(size_t)
>
>  #define ML_BITS        4
> @@ -147,6 +149,13 @@ static FORCE_INLINE void LZ4_copy8(void *dst, const void *src)
>  #endif
>  }
>
> +static FORCE_INLINE void LZ4_copy4(void *dst, const void *src)
> +{
> +        U32 a = get_unaligned((const U32 *)src);
> +
> +        put_unaligned(a, (U32 *)dst);
> +}
> +
>  /*
>   * customized variant of memcpy,
>   * which can overwrite up to 7 bytes beyond dstEnd
> @@ -224,4 +233,6 @@ static FORCE_INLINE unsigned int LZ4_count(
>  typedef enum { endOnOutputSize = 0, endOnInputSize = 1 } endCondition_directive;
>  typedef enum { full = 0, partial = 1 } earlyEnd_directive;
>
> +typedef enum { NoDynOffset = 0, DynOffset = 1 } Dynamic_Offset;
> +
>  #endif
> --
> 1.7.1
 
 

2018-03-22 23:09:49

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.

Hi Maninder,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.16-rc6]
[cannot apply to next-20180322]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Maninder-Singh/cover-letter-lz4-Implement-lz4-with-dynamic-offset-length/20180323-064137
config: i386-randconfig-x073-201811 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All errors (new ones prefixed by >>):

fs/pstore/platform.c: In function 'decompress_lz4':
>> fs/pstore/platform.c:357:8: error: too few arguments to function 'LZ4_decompress_safe'
ret = LZ4_decompress_safe(in, out, inlen, outlen);
^~~~~~~~~~~~~~~~~~~
In file included from fs/pstore/platform.c:38:0:
include/linux/lz4.h:301:5: note: declared here
int LZ4_decompress_safe(const char *source, char *dest, int compressedSize,
^~~~~~~~~~~~~~~~~~~

vim +/LZ4_decompress_safe +357 fs/pstore/platform.c

8cfc8ddc Geliang Tang 2016-02-18 352
8cfc8ddc Geliang Tang 2016-02-18 353 static int decompress_lz4(void *in, void *out, size_t inlen, size_t outlen)
8cfc8ddc Geliang Tang 2016-02-18 354 {
8cfc8ddc Geliang Tang 2016-02-18 355 int ret;
8cfc8ddc Geliang Tang 2016-02-18 356
d21b5ff1 Sven Schmidt 2017-02-24 @357 ret = LZ4_decompress_safe(in, out, inlen, outlen);
d21b5ff1 Sven Schmidt 2017-02-24 358 if (ret < 0) {
d21b5ff1 Sven Schmidt 2017-02-24 359 /*
d21b5ff1 Sven Schmidt 2017-02-24 360 * LZ4_decompress_safe will return an error code
d21b5ff1 Sven Schmidt 2017-02-24 361 * (< 0) if decompression failed
d21b5ff1 Sven Schmidt 2017-02-24 362 */
d21b5ff1 Sven Schmidt 2017-02-24 363 pr_err("LZ4_decompress_safe error, ret = %d!\n", ret);
8cfc8ddc Geliang Tang 2016-02-18 364 return -EIO;
8cfc8ddc Geliang Tang 2016-02-18 365 }
8cfc8ddc Geliang Tang 2016-02-18 366
d21b5ff1 Sven Schmidt 2017-02-24 367 return ret;
8cfc8ddc Geliang Tang 2016-02-18 368 }
8cfc8ddc Geliang Tang 2016-02-18 369

:::::: The code at line 357 was first introduced by commit
:::::: d21b5ff12df45a65bb220c7e8103a5f0f5609377 fs/pstore: fs/squashfs: change usage of LZ4 to work with new LZ4 version

:::::: TO: Sven Schmidt <[email protected]>
:::::: CC: Linus Torvalds <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (2.59 kB)
.config.gz (32.70 kB)
Download all attachments

2018-03-22 23:32:32

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.

Hi Maninder,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.16-rc6]
[cannot apply to next-20180322]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Maninder-Singh/cover-letter-lz4-Implement-lz4-with-dynamic-offset-length/20180323-064137
config: i386-randconfig-s1-03221113 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All errors (new ones prefixed by >>):

fs/squashfs/lz4_wrapper.c: In function 'lz4_uncompress':
>> fs/squashfs/lz4_wrapper.c:110:8: error: too few arguments to function 'LZ4_decompress_safe'
res = LZ4_decompress_safe(stream->input, stream->output,
^~~~~~~~~~~~~~~~~~~
In file included from fs/squashfs/lz4_wrapper.c:13:0:
include/linux/lz4.h:301:5: note: declared here
int LZ4_decompress_safe(const char *source, char *dest, int compressedSize,
^~~~~~~~~~~~~~~~~~~

vim +/LZ4_decompress_safe +110 fs/squashfs/lz4_wrapper.c

9c06a46f Phillip Lougher 2014-11-27 91
9c06a46f Phillip Lougher 2014-11-27 92
9c06a46f Phillip Lougher 2014-11-27 93 static int lz4_uncompress(struct squashfs_sb_info *msblk, void *strm,
9c06a46f Phillip Lougher 2014-11-27 94 struct buffer_head **bh, int b, int offset, int length,
9c06a46f Phillip Lougher 2014-11-27 95 struct squashfs_page_actor *output)
9c06a46f Phillip Lougher 2014-11-27 96 {
9c06a46f Phillip Lougher 2014-11-27 97 struct squashfs_lz4 *stream = strm;
9c06a46f Phillip Lougher 2014-11-27 98 void *buff = stream->input, *data;
9c06a46f Phillip Lougher 2014-11-27 99 int avail, i, bytes = length, res;
9c06a46f Phillip Lougher 2014-11-27 100
9c06a46f Phillip Lougher 2014-11-27 101 for (i = 0; i < b; i++) {
9c06a46f Phillip Lougher 2014-11-27 102 avail = min(bytes, msblk->devblksize - offset);
9c06a46f Phillip Lougher 2014-11-27 103 memcpy(buff, bh[i]->b_data + offset, avail);
9c06a46f Phillip Lougher 2014-11-27 104 buff += avail;
9c06a46f Phillip Lougher 2014-11-27 105 bytes -= avail;
9c06a46f Phillip Lougher 2014-11-27 106 offset = 0;
9c06a46f Phillip Lougher 2014-11-27 107 put_bh(bh[i]);
9c06a46f Phillip Lougher 2014-11-27 108 }
9c06a46f Phillip Lougher 2014-11-27 109
d21b5ff1 Sven Schmidt 2017-02-24 @110 res = LZ4_decompress_safe(stream->input, stream->output,
d21b5ff1 Sven Schmidt 2017-02-24 111 length, output->length);
d21b5ff1 Sven Schmidt 2017-02-24 112
d21b5ff1 Sven Schmidt 2017-02-24 113 if (res < 0)
9c06a46f Phillip Lougher 2014-11-27 114 return -EIO;
9c06a46f Phillip Lougher 2014-11-27 115
d21b5ff1 Sven Schmidt 2017-02-24 116 bytes = res;
9c06a46f Phillip Lougher 2014-11-27 117 data = squashfs_first_page(output);
9c06a46f Phillip Lougher 2014-11-27 118 buff = stream->output;
9c06a46f Phillip Lougher 2014-11-27 119 while (data) {
09cbfeaf Kirill A. Shutemov 2016-04-01 120 if (bytes <= PAGE_SIZE) {
9c06a46f Phillip Lougher 2014-11-27 121 memcpy(data, buff, bytes);
9c06a46f Phillip Lougher 2014-11-27 122 break;
9c06a46f Phillip Lougher 2014-11-27 123 }
09cbfeaf Kirill A. Shutemov 2016-04-01 124 memcpy(data, buff, PAGE_SIZE);
09cbfeaf Kirill A. Shutemov 2016-04-01 125 buff += PAGE_SIZE;
09cbfeaf Kirill A. Shutemov 2016-04-01 126 bytes -= PAGE_SIZE;
9c06a46f Phillip Lougher 2014-11-27 127 data = squashfs_next_page(output);
9c06a46f Phillip Lougher 2014-11-27 128 }
9c06a46f Phillip Lougher 2014-11-27 129 squashfs_finish_page(output);
9c06a46f Phillip Lougher 2014-11-27 130
d21b5ff1 Sven Schmidt 2017-02-24 131 return res;
9c06a46f Phillip Lougher 2014-11-27 132 }
9c06a46f Phillip Lougher 2014-11-27 133

:::::: The code at line 110 was first introduced by commit
:::::: d21b5ff12df45a65bb220c7e8103a5f0f5609377 fs/pstore: fs/squashfs: change usage of LZ4 to work with new LZ4 version

:::::: TO: Sven Schmidt <[email protected]>
:::::: CC: Linus Torvalds <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (4.38 kB)
.config.gz (27.88 kB)
Download all attachments

2018-03-23 13:21:19

by Vaneet Narang

[permalink] [raw]
Subject: Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.


Hi Nick,

Thanks for your comments, Please check my reply to few of your comments.
I will be sharing benchmarking figures separately.

>
>> + if (curr_offset > 127) {
>> + curr_offset = (curr_offset << 1) | DYN_BIT;
>> + LZ4_writeLE16(op, (U16)curr_offset);
>> + op += 2;
>> + } else {
>> + curr_offset = curr_offset << 1;
>> + *op = (BYTE)curr_offset;
>> + op++;
>> + }
>
>The standard way to do variable sized integers is to use the high-bit as
>the control bit, not the low-bit. Do you have benchmarks to show that using
>the low-bit is faster than using the high-bit? If so, lets comment in the
>code, if not lets use the high-bit.
>
We are not sure about performance difference of using low or high bit but Since as per
LZ4 specification, offset needs to be stored in little endian format so keeping Low bit
as control bit makes it easier to retreive offset when offset is spread across two bytes.

offset = LZ4_readLE16(ip);
if (offset & 0x1) {
offset = offset >> 1; // Just one Right shift
ip += 2;
} else {
offset = (offset & 0xff) >> 1; // Only Two Operation for one byte offset
ip += 1;
}

So not sure if keeping High bit will make much difference as we will be needing
same or more operations to get offset in case of High bit.

>> /* copy literals */
>> cpy = op + length;
>> if (((endOnInput) && ((cpy > (partialDecoding ? oexit : oend - MFLIMIT))
>> - || (ip + length > iend - (2 + 1 + LASTLITERALS))))
>> - || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH))) {
>> + || (ip + length > iend - (2 + LASTLITERALS))))
>> + || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH - 1))) {
>> if (partialDecoding) {
>> if (cpy > oend) {
>> /*
>> @@ -188,13 +190,31 @@ static FORCE_INLINE int LZ4_decompress_generic(
>> break;
>> }
>>
>> - LZ4_wildCopy(op, ip, cpy);
>> + if (dynOffset && length < 4)
>> + LZ4_copy4(op, ip);
>> + else
>> + LZ4_wildCopy(op, ip, cpy);
>> +
>
>The LZ4 format enforces that the last 5 bytes are literals so that
>LZ4_wildCopy() can be used here. I suspect that having this extra branch
>here for `dynOffset` mode hurts decompression speed.
>
This check is made to handle one byte read overflow when decompressing last frame. wildCopy does 8 byte copy blindly.

Issue Case:
0 length literal followed by 1 byte offset and then it ends with one byte token and 5 Byte Last Literals.

With this combination only 7 bytes (1 Byte Offset + 1 Byte Token + 5 Byte Literal) remains at the end of
input buffer so reading 8 bytes from input buffer results in 1 byte overflow.

Since 1 byte offset is not there in original LZ4, so this issue is not possible. To avoid overhead of this check,
we planned to use 6 Byte as Last literal Length.

-#define LASTLITERALS 5
+#define LASTLITERALS 6

>>
>> int LZ4_decompress_safe(const char *source, char *dest,
>> - int compressedSize, int maxDecompressedSize)
>> + int compressedSize, int maxDecompressedSize, bool dynOffset)
>> {
>> return LZ4_decompress_generic(source, dest, compressedSize,
>> maxDecompressedSize, endOnInputSize, full, 0,
>> - noDict, (BYTE *)dest, NULL, 0);
>> + noDict, (BYTE *)dest, NULL, 0, dynOffset);
>
>You'll need to use the same trick that LZ4_compress_fast() uses, by hard
>coding `dynOffset`. We want the compiler to generate too version of
>LZ4_decompress_generic(), one with `dynOffset` and one with `!dynOffset`.
>That way the tight loop won't the branches that check `dynOffset`.
>
> if (dynOffset)
> return LZ4_decompress_generic(..., true);
> else
> return LZ4_decompress_generic(..., false);
>
>Without this trick, I expect that this patch causes a regression to both
>LZ4 and LZ4_DYN decompression speed.

Since there is no backward compatibility of our solution with original LZ4 so we
are bit confused if we should make it as separate API to avoid overhead of dynOffset
checks every where in the code and also to avoid changing prototype of exported functions
LZ4_decompress/LZ4_compress OR we should keep these checks to avoid redundant code.
Kindly suggest

Thanks & Regards,
Vaneet Narang

2018-03-23 13:43:19

by Vaneet Narang

[permalink] [raw]
Subject: Re: [PATCH 0/1] cover-letter/lz4: Implement lz4 with dynamic offset length.

Hi Nick / Sergey,


We have compared LZ4 Dyn with Original LZ4 using some samples of realtime application data(4Kb)
compressed/decompressed by ZRAM. For comparison we have used lzbench (https://github.com/inikep/lzbench)
we have implemented dedicated LZ4 Dyn API & kept last literal length as 6 to avoid overhead
of checks. It seems in average case there is a saving of 3~4% in compression ratio with almost same compression
speed and minor loss in decompression speed (~50MB/s) when compared with LZ4.

Comparison of Lz4 Dyn with LZO1x is also done as LZO1x is default compressor of ZRAM.

Original LZ4:
sh-3.2# ./lzbench -r -elz4 data/
lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 2205 MB/s 2217 MB/s 4096 100.00 data//data_1
lz4 1.8.0 216 MB/s 761 MB/s 2433 59.40 data//data_1
lz4 1.8.0 269 MB/s 877 MB/s 1873 45.73 data//data_2
lz4 1.8.0 238 MB/s 575 MB/s 2060 50.29 data//data_3
lz4 1.8.0 321 MB/s 1015 MB/s 1464 35.74 data//data_4
lz4 1.8.0 464 MB/s 1090 MB/s 713 17.41 data//data_5
lz4 1.8.0 296 MB/s 956 MB/s 1597 38.99 data//data_6
lz4 1.8.0 338 MB/s 994 MB/s 2238 54.64 data//data_7
lz4 1.8.0 705 MB/s 1172 MB/s 193 4.71 data//data_8
lz4 1.8.0 404 MB/s 1150 MB/s 1097 26.78 data//data_9
lz4 1.8.0 216 MB/s 921 MB/s 3183 77.71 data//data_10
lz4 1.8.0 456 MB/s 1101 MB/s 1011 24.68 data//data_11
lz4 1.8.0 867 MB/s 1202 MB/s 37 0.90 data//data_12


LZ4 Dynamic Offet:
sh-3.2# ./lzbench -r -elz4_dyn data/
lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 2203 MB/s 2218 MB/s 4096 100.00 data//data_1
lz4 1.8.0 218 MB/s 693 MB/s 2228 54.39 data//data_1
lz4 1.8.0 273 MB/s 851 MB/s 1739 42.46 data//data_2
lz4 1.8.0 230 MB/s 526 MB/s 1800 43.95 data//data_3
lz4 1.8.0 321 MB/s 952 MB/s 1357 33.13 data//data_4
lz4 1.8.0 470 MB/s 1075 MB/s 664 16.21 data//data_5
lz4 1.8.0 303 MB/s 964 MB/s 1455 35.52 data//data_6
lz4 1.8.0 345 MB/s 951 MB/s 2126 51.90 data//data_7
lz4 1.8.0 744 MB/s 1163 MB/s 177 4.32 data//data_8
lz4 1.8.0 409 MB/s 1257 MB/s 1033 25.22 data//data_9
lz4 1.8.0 220 MB/s 857 MB/s 3049 74.44 data//data_10
lz4 1.8.0 464 MB/s 1105 MB/s 934 22.80 data//data_11
lz4 1.8.0 874 MB/s 1194 MB/s 36 0.88 data//data_12


LZ4 Dynamic Offset with 32K data:
sh-3.2# ./lzbench -elz4_dyn data/data32k
lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 5285 MB/s 5283 MB/s 32768 100.00 data/data32k
lz4 1.8.0 274 MB/s 995 MB/s 13435 41.00 data/data32k
done... (cIters=1 dIters=1 cTime=1.0 dTime=2.0 chunkSize=1706MB cSpeed=0MB)

Original LZ4 with 32K data:
sh-3.2# ./lzbench_orig -elz4 data/data32k
lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 4918 MB/s 5108 MB/s 32768 100.00 data/data32k
lz4 1.8.0 276 MB/s 1045 MB/s 14492 44.23 data/data32k

LZO1x with 32K data (Default Compressor for ZRAM):
sh-3.2# ./lzbench -elzo1x,1 data/data32k
lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 5273 MB/s 5320 MB/s 32768 100.00 data/data32k
lzo1x 2.09 -1 283 MB/s 465 MB/s 14292 43.62 data/data32k

Regards,
Vaneet Narang

2018-03-29 10:26:13

by Maninder Singh

[permalink] [raw]
Subject: Re: [PATCH 0/1] cover-letter/lz4: Implement lz4 with dynamic offset length.


Hello Nick/Sergey,
 
Any suggestion or comments, so that we can change code and resend the patch?
 
> Hi Nick / Sergey,


> We have compared LZ4 Dyn with Original LZ4 using some samples of realtime application data(4Kb)
> compressed/decompressed by ZRAM. For comparison we have used lzbench (https://github.com/inikep/lzbench)
> we have implemented dedicated LZ4 Dyn API & kept last literal length as 6 to avoid overhead 
> of checks. It seems in average case there is a saving of 3~4% in compression ratio with almost same compression
> speed and minor loss in decompression speed (~50MB/s) when compared with LZ4.

> Comparison of Lz4 Dyn with LZO1x is also done as LZO1x is default compressor of ZRAM.

> Original LZ4:
> sh-3.2# ./lzbench  -r  -elz4  data/
> lzbench 1.7.3 (32-bit Linux)   Assembled by P.Skibinski
> Compressor name         Compress. Decompress. Compr. size  Ratio Filename
> memcpy                   2205 MB/s  2217 MB/s        4096 100.00 data//data_1
> lz4 1.8.0                 216 MB/s   761 MB/s        2433  59.40 data//data_1
> lz4 1.8.0                 269 MB/s   877 MB/s        1873  45.73 data//data_2
> lz4 1.8.0                 238 MB/s   575 MB/s        2060  50.29 data//data_3
> lz4 1.8.0                 321 MB/s  1015 MB/s        1464  35.74 data//data_4
> lz4 1.8.0                 464 MB/s  1090 MB/s         713  17.41 data//data_5
> lz4 1.8.0                 296 MB/s   956 MB/s        1597  38.99 data//data_6
> lz4 1.8.0                 338 MB/s   994 MB/s        2238  54.64 data//data_7
> lz4 1.8.0                 705 MB/s  1172 MB/s         193   4.71 data//data_8
> lz4 1.8.0                 404 MB/s  1150 MB/s        1097  26.78 data//data_9
> lz4 1.8.0                 216 MB/s   921 MB/s        3183  77.71 data//data_10
> lz4 1.8.0                 456 MB/s  1101 MB/s        1011  24.68 data//data_11
> lz4 1.8.0                 867 MB/s  1202 MB/s          37   0.90 data//data_12


> LZ4 Dynamic Offet:  
> sh-3.2# ./lzbench  -r  -elz4_dyn  data/
> lzbench 1.7.3 (32-bit Linux)   Assembled by P.Skibinski
> Compressor name         Compress. Decompress. Compr. size  Ratio Filename
> memcpy                   2203 MB/s  2218 MB/s        4096 100.00 data//data_1
> lz4 1.8.0                 218 MB/s   693 MB/s        2228  54.39 data//data_1
> lz4 1.8.0                 273 MB/s   851 MB/s        1739  42.46 data//data_2
> lz4 1.8.0                 230 MB/s   526 MB/s        1800  43.95 data//data_3
> lz4 1.8.0                 321 MB/s   952 MB/s        1357  33.13 data//data_4
> lz4 1.8.0                 470 MB/s  1075 MB/s         664  16.21 data//data_5
> lz4 1.8.0                 303 MB/s   964 MB/s        1455  35.52 data//data_6
> lz4 1.8.0                 345 MB/s   951 MB/s        2126  51.90 data//data_7
> lz4 1.8.0                 744 MB/s  1163 MB/s         177   4.32 data//data_8
> lz4 1.8.0                 409 MB/s  1257 MB/s        1033  25.22 data//data_9
> lz4 1.8.0                 220 MB/s   857 MB/s        3049  74.44 data//data_10
> lz4 1.8.0                 464 MB/s  1105 MB/s         934  22.80 data//data_11
> lz4 1.8.0                 874 MB/s  1194 MB/s          36   0.88 data//data_12


> LZ4 Dynamic Offset with 32K data:
> sh-3.2# ./lzbench -elz4_dyn data/data32k
> lzbench 1.7.3 (32-bit Linux)   Assembled by P.Skibinski
> Compressor name         Compress. Decompress. Compr. size  Ratio Filename
> memcpy                   5285 MB/s  5283 MB/s       32768 100.00 data/data32k
> lz4 1.8.0                 274 MB/s   995 MB/s       13435  41.00 data/data32k
> done... (cIters=1 dIters=1 cTime=1.0 dTime=2.0 chunkSize=1706MB cSpeed=0MB)

> Original LZ4 with 32K data:
> sh-3.2# ./lzbench_orig -elz4 data/data32k
> lzbench 1.7.3 (32-bit Linux)   Assembled by P.Skibinski
> Compressor name         Compress. Decompress. Compr. size  Ratio Filename
> memcpy                   4918 MB/s  5108 MB/s       32768 100.00 data/data32k
> lz4 1.8.0                 276 MB/s  1045 MB/s       14492  44.23 data/data32k

> LZO1x with 32K data (Default Compressor for ZRAM): 
> sh-3.2# ./lzbench -elzo1x,1 data/data32k
> lzbench 1.7.3 (32-bit Linux)   Assembled by P.Skibinski
> Compressor name         Compress. Decompress. Compr. size  Ratio Filename
> memcpy                   5273 MB/s  5320 MB/s       32768 100.00 data/data32k
> lzo1x 2.09 -1             283 MB/s   465 MB/s       14292  43.62 data/data32k

> Regards,
> Vaneet Narang
 
Thanks.
 

2018-03-30 05:41:56

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 0/1] cover-letter/lz4: Implement lz4 with dynamic offset length.

On (03/29/18 15:56), Maninder Singh wrote:
> Hello Nick/Sergey,
>
> Any suggestion or comments, so that we can change code and resend the patch?

Well... there were no replies to
https://marc.info/?l=linux-kernel&m=152161450026771&w=2
and
https://marc.info/?l=linux-kernel&m=152161860627974&w=2

-ss

2018-04-02 05:51:52

by Maninder Singh

[permalink] [raw]
Subject: Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.

Hi,

>> diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
>> index 4ed0a78..5bc5aab 100644
>> --- a/drivers/block/zram/zcomp.c
>> +++ b/drivers/block/zram/zcomp.c
>> @@ -17,11 +17,15 @@
>> #include <linux/crypto.h>
>>
>> #include "zcomp.h"
>> +#define KB (1 << 10)
>>
>> static const char * const backends[] = {
>> "lzo",
>> #if IS_ENABLED(CONFIG_CRYPTO_LZ4)
>> "lz4",
>> +#if (PAGE_SIZE < (32 * KB))
>> + "lz4_dyn",
>> +#endif
>
>This is not the list of supported algorithms. It's the list of
>recommended algorithms. You can configure zram to use any of
>available and known to Crypto API algorithms. Including lz4_dyn
>on PAGE_SIZE > 32K systems.
>
> -ss

Yes, we want to integrate new compression(lz4_dyn) for ZRAM
only if PAGE_SIZE is less than 32KB to get maximum benefit.
so we added lz4_dyn to available list of ZRAM compression alhorithms.

Thanks,
Manider Singh

2018-04-02 06:03:54

by Maninder Singh

[permalink] [raw]
Subject: Re: [PATCH 0/1] cover-letter/lz4: Implement lz4 with dynamic offset length.


 Hello,

>> (Added cover letter to avoid much text in patch description)
>>
>> LZ4 specification defines 2 byte offset length for 64 KB data.
>> But in case of ZRAM we compress data per page and in most of
>> architecture PAGE_SIZE is 4KB. So we can decide offset length based
>> on actual offset value. For this we can reserve 1 bit to decide offset
>> length (1 byte or 2 byte). 2 byte required only if ofsset is greater than 127,
>> else 1 byte is enough.
>
>So what happens if I compress the data on a system with no dyn
>offset and then send it over the network to a machine which has
>dyn offset? Or, say, I have a USB stick with a compression enabled
>FS, store files on a dyn offset enabled PC and then mount that USB
>stick on a machine with no dyn offset support. And vice versa.


lz4_dyn is not an extension of LZ4 so there is no backward compatibility.
Consider this as a different algorithm adapted from LZ4 for better compression ratio.

Thanks
Maninder Singh

2018-04-03 12:26:42

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.

On (04/02/18 11:21), Maninder Singh wrote:
[..]
> >> static const char * const backends[] = {
> >> "lzo",
> >> #if IS_ENABLED(CONFIG_CRYPTO_LZ4)
> >> "lz4",
> >> +#if (PAGE_SIZE < (32 * KB))
> >> + "lz4_dyn",
> >> +#endif
> >
> >This is not the list of supported algorithms. It's the list of
> >recommended algorithms. You can configure zram to use any of
> >available and known to Crypto API algorithms. Including lz4_dyn
> >on PAGE_SIZE > 32K systems.
> >
> Yes, we want to integrate new compression(lz4_dyn) for ZRAM
> only if PAGE_SIZE is less than 32KB to get maximum benefit.
> so we added lz4_dyn to available list of ZRAM compression alhorithms.

Which is not what I was talking about.

You shrink a 2 bytes offset down to a 1 byte offset, thus you enforce that
'page should be less than 32KB', which I'm sure will be confusing. And you
rely on lz4_dyn users to do the right thing - namely, to use that 'nice'
`#if (PAGE_SIZE < (32 * KB))'. Apart from that, lz4_dyn supports only data
in up to page_size chunks. Suppose my system has page_size of less than 32K,
so I legitimately can enable lz4_dyn, but suppose that I will use it
somewhere where I don't work with page_size-d chunks. Will I able to just
do tfm->compress(src, sz) on random buffers? The whole thing looks to be
quite fragile.

-ss

2018-04-03 13:43:47

by Vaneet Narang

[permalink] [raw]
Subject: RE: Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.

Hi Sergey,

>You shrink a 2 bytes offset down to a 1 byte offset, thus you enforce that
2 Byte offset is not shrinked to 1 byte, Its only 1 bit is reserved out of
16 bits of offset. So only 15 Bits can be used to store offset value.

>'page should be less than 32KB', which I'm sure will be confusing.
lz4_dyn will work on bigger data length(> 32k) but in that case compression
ratio may not be better than LZ4. This is same as LZ4 compressing data more
than 64K (16Bits). LZ4 can't store offset more than 64K similarly
LZ4 dyn can't store offset more than 32K.

There is a handling in LZ4 code for this and similar handling added for LZ4 Dyn.

Handling in LZ4 Dyn: max_distance is 32K for lz4_dyn and will be 64K for LZ4
int max_distance = dynOffset ? MAX_DISTANCE_DYN : MAX_DISTANCE;

>And you
>rely on lz4_dyn users to do the right thing - namely, to use that 'nice'
>`#if (PAGE_SIZE < (32 * KB))'.
They don't need to add this code, they just need to choose right compression algorithm
that fits their requirement. If source length is less than 32K then lz4_dyn
would give better compression ratio then LZ4.

Considering ZRAM as a user for LZ4 dyn, we have added this check for PAGE_SIZE which
is source length. This code adds lz4 dyn to preferred list of compression algorithm
when PAGE size is less than 32K.

>Apart from that, lz4_dyn supports only data
>in up to page_size chunks. Suppose my system has page_size of less than 32K,
>so I legitimately can enable lz4_dyn, but suppose that I will use it
>somewhere where I don't work with page_size-d chunks. Will I able to just
>do tfm->compress(src, sz) on random buffers? The whole thing looks to be
>quite fragile.
No thats not true, lz4_dyn can work for random buffers and it need not be
of page size chunks. There is no difference in Lz4 and Lz4 dyn working.

Only difference is LZ4 dyn doesn't use fixed offset size, this concept already getting
used in LZO which uses dynamic size of Metadata based on Match Length and Match offset.
It uses different markers for this which defines length of meta data.

lzodefs.h:

#define M1_MAX_OFFSET 0x0400
#define M2_MAX_OFFSET 0x0800
#define M3_MAX_OFFSET 0x4000
#define M4_MAX_OFFSET 0xbfff

#define M1_MIN_LEN 2
#define M1_MAX_LEN 2
#define M2_MIN_LEN 3
#define M2_MAX_LEN 8
#define M3_MIN_LEN 3
#define M3_MAX_LEN 33
#define M4_MIN_LEN 3
#define M4_MAX_LEN 9

#define M1_MARKER 0
#define M2_MARKER 64
#define M3_MARKER 32
#define M4_MARKER 16

Similarly for LZ4 Dyn, we have used 1 bit as a marker to determine offset length.

Thanks & Regards,
Vaneet Narang


Attachments:
rcptInfo.txt (1.88 kB)

2018-04-04 01:40:23

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.

On (04/03/18 19:13), Vaneet Narang wrote:
> Hi Sergey,
>
> >You shrink a 2 bytes offset down to a 1 byte offset, thus you enforce that
> 2 Byte offset is not shrinked to 1 byte, Its only 1 bit is reserved out of
> 16 bits of offset. So only 15 Bits can be used to store offset value.

Yes, you are right. My bad, was thinking about something else.

> >'page should be less than 32KB', which I'm sure will be confusing.
> lz4_dyn will work on bigger data length(> 32k) but in that case compression
> ratio may not be better than LZ4. This is same as LZ4 compressing data more
> than 64K (16Bits). LZ4 can't store offset more than 64K similarly
> LZ4 dyn can't store offset more than 32K.

Then drop that `if PAGE_SIZE' thing. I'd rather do that stuff internally
in lz4... if it needed at all.

> >And you
> >rely on lz4_dyn users to do the right thing - namely, to use that 'nice'
> >`#if (PAGE_SIZE < (32 * KB))'.
> They don't need to add this code

Then drop it.

> >Apart from that, lz4_dyn supports only data
> >in up to page_size chunks. Suppose my system has page_size of less than 32K,
> >so I legitimately can enable lz4_dyn, but suppose that I will use it
> >somewhere where I don't work with page_size-d chunks. Will I able to just
> >do tfm->compress(src, sz) on random buffers? The whole thing looks to be
> >quite fragile.
> No thats not true, lz4_dyn can work for random buffers and it need not be
> of page size chunks. There is no difference in Lz4 and Lz4 dyn working.

You are right.

-ss

2018-04-16 10:21:38

by Maninder Singh

[permalink] [raw]
Subject: Re: [PATCH 0/1] cover-letter/lz4: Implement lz4 with dynamic offset length.


 Hello Nick/ Yann,

Any inputs regarding LZ4 dyn results & lz4 dyn approach.

>Hello Nick/Sergey,
>
>Any suggestion or comments, so that we can change code and resend the patch?
>
>> Hi Nick / Sergey,
>>
>>
>> We have compared LZ4 Dyn with Original LZ4 using some samples of realtime application data(4Kb)
>> compressed/decompressed by ZRAM. For comparison we have used lzbench (https://github.com/inikep/lzbench)
>> we have implemented dedicated LZ4 Dyn API & kept last literal length as 6 to avoid overhead
>> of checks. It seems in average case there is a saving of 3~4% in compression ratio with almost same compression
>> speed and minor loss in decompression speed (~50MB/s) when compared with LZ4.
>>
>> Comparison of Lz4 Dyn with LZO1x is also done as LZO1x is default compressor of ZRAM.
>>
>> Original LZ4:
>> sh-3.2# ./lzbench -r -elz4 data/
>> lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
>> Compressor name Compress. Decompress. Compr. size Ratio Filename
>> memcpy 2205 MB/s 2217 MB/s 4096 100.00 data//data_1
>> lz4 1.8.0 216 MB/s 761 MB/s 2433 59.40 data//data_1
>> lz4 1.8.0 269 MB/s 877 MB/s 1873 45.73 data//data_2
>> lz4 1.8.0 238 MB/s 575 MB/s 2060 50.29 data//data_3
>> lz4 1.8.0 321 MB/s 1015 MB/s 1464 35.74 data//data_4
>> lz4 1.8.0 464 MB/s 1090 MB/s 713 17.41 data//data_5
>> lz4 1.8.0 296 MB/s 956 MB/s 1597 38.99 data//data_6
>> lz4 1.8.0 338 MB/s 994 MB/s 2238 54.64 data//data_7
>> lz4 1.8.0 705 MB/s 1172 MB/s 193 4.71 data//data_8
>> lz4 1.8.0 404 MB/s 1150 MB/s 1097 26.78 data//data_9
>> lz4 1.8.0 216 MB/s 921 MB/s 3183 77.71 data//data_10
>> lz4 1.8.0 456 MB/s 1101 MB/s 1011 24.68 data//data_11
>> lz4 1.8.0 867 MB/s 1202 MB/s 37 0.90 data//data_12
>>
>>
>> LZ4 Dynamic Offet:
>> sh-3.2# ./lzbench -r -elz4_dyn data/
>> lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
>> Compressor name Compress. Decompress. Compr. size Ratio Filename
>> memcpy 2203 MB/s 2218 MB/s 4096 100.00 data//data_1
>> lz4 1.8.0 218 MB/s 693 MB/s 2228 54.39 data//data_1
>> lz4 1.8.0 273 MB/s 851 MB/s 1739 42.46 data//data_2
>> lz4 1.8.0 230 MB/s 526 MB/s 1800 43.95 data//data_3
>> lz4 1.8.0 321 MB/s 952 MB/s 1357 33.13 data//data_4
>> lz4 1.8.0 470 MB/s 1075 MB/s 664 16.21 data//data_5
>> lz4 1.8.0 303 MB/s 964 MB/s 1455 35.52 data//data_6
>> lz4 1.8.0 345 MB/s 951 MB/s 2126 51.90 data//data_7
>> lz4 1.8.0 744 MB/s 1163 MB/s 177 4.32 data//data_8
>> lz4 1.8.0 409 MB/s 1257 MB/s 1033 25.22 data//data_9
>> lz4 1.8.0 220 MB/s 857 MB/s 3049 74.44 data//data_10
>> lz4 1.8.0 464 MB/s 1105 MB/s 934 22.80 data//data_11
>> lz4 1.8.0 874 MB/s 1194 MB/s 36 0.88 data//data_12
>>
>>
>> LZ4 Dynamic Offset with 32K data:
>> sh-3.2# ./lzbench -elz4_dyn data/data32k
>> lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
>> Compressor name Compress. Decompress. Compr. size Ratio Filename
>> memcpy 5285 MB/s 5283 MB/s 32768 100.00 data/data32k
>> lz4 1.8.0 274 MB/s 995 MB/s 13435 41.00 data/data32k
>> done... (cIters=1 dIters=1 cTime=1.0 dTime=2.0 chunkSize=1706MB cSpeed=0MB)
>>
>> Original LZ4 with 32K data:
>> sh-3.2# ./lzbench_orig -elz4 data/data32k
>> lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
>> Compressor name Compress. Decompress. Compr. size Ratio Filename
>> memcpy 4918 MB/s 5108 MB/s 32768 100.00 data/data32k
>> lz4 1.8.0 276 MB/s 1045 MB/s 14492 44.23 data/data32k
>>
>> LZO1x with 32K data (Default Compressor for ZRAM):
>> sh-3.2# ./lzbench -elzo1x,1 data/data32k
>> lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
>> Compressor name Compress. Decompress. Compr. size Ratio Filename
>> memcpy 5273 MB/s 5320 MB/s 32768 100.00 data/data32k
>> lzo1x 2.09 -1 283 MB/s 465 MB/s 14292 43.62 data/data32k


Thanks,
Maninder Singh
 

2018-04-16 19:34:29

by Yann Collet

[permalink] [raw]
Subject: Re: [PATCH 0/1] cover-letter/lz4: Implement lz4 with dynamic offset length.

Hi Singh

I don't have any strong opinion on this topic.

You made your case clear:
your variant trades a little bit of speed for a little bit more compression ratio.
In the context of zram, it makes sense, and I would expect it to work, as advertised in your benchmark results.
(disclaimer: I haven't reproduced these results, just, they look reasonable to me, I have no reason to doubt them).

So, the issue is less about performance, than about code complexity.

As mentioned, this is an incompatible variant.
So, it requires its own entry point, and preferably its own code path
(even if it's heavily duplicated,
mixing it with regular lz4 source code, as proposed in the patch, will be bad for maintenance,
and can negatively impact regular lz4 usage, outside of zram).

So that's basically the "cost" of adding this option.

Is it worth it?
Well, this is completely outside of my responsibility area, so I really can't tell.
You'll have to convince people in charge that the gains are worth their complexity,
since _they_ will inherit the duty to keep the system working through its future evolutions.
At a minimum, you are targeting maintainers of zram and the crypto interface.
For this topic, they are the right people to talk to.


On 4/16/18, 04:09, "Maninder Singh" <[email protected]> wrote:


Hello Nick/ Yann,

Any inputs regarding LZ4 dyn results & lz4 dyn approach.

>Hello Nick/Sergey,
>
>Any suggestion or comments, so that we can change code and resend the patch?
>
>> Hi Nick / Sergey,
>>
>>
>> We have compared LZ4 Dyn with Original LZ4 using some samples of realtime application data(4Kb)
>> compressed/decompressed by ZRAM. For comparison we have used lzbench (https://github.com/inikep/lzbench)
>> we have implemented dedicated LZ4 Dyn API & kept last literal length as 6 to avoid overhead
>> of checks. It seems in average case there is a saving of 3~4% in compression ratio with almost same compression
>> speed and minor loss in decompression speed (~50MB/s) when compared with LZ4.
>>
>> Comparison of Lz4 Dyn with LZO1x is also done as LZO1x is default compressor of ZRAM.
>>
>> Original LZ4:
>> sh-3.2# ./lzbench -r -elz4 data/
>> lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
>> Compressor name Compress. Decompress. Compr. size Ratio Filename
>> memcpy 2205 MB/s 2217 MB/s 4096 100.00 data//data_1
>> lz4 1.8.0 216 MB/s 761 MB/s 2433 59.40 data//data_1
>> lz4 1.8.0 269 MB/s 877 MB/s 1873 45.73 data//data_2
>> lz4 1.8.0 238 MB/s 575 MB/s 2060 50.29 data//data_3
>> lz4 1.8.0 321 MB/s 1015 MB/s 1464 35.74 data//data_4
>> lz4 1.8.0 464 MB/s 1090 MB/s 713 17.41 data//data_5
>> lz4 1.8.0 296 MB/s 956 MB/s 1597 38.99 data//data_6
>> lz4 1.8.0 338 MB/s 994 MB/s 2238 54.64 data//data_7
>> lz4 1.8.0 705 MB/s 1172 MB/s 193 4.71 data//data_8
>> lz4 1.8.0 404 MB/s 1150 MB/s 1097 26.78 data//data_9
>> lz4 1.8.0 216 MB/s 921 MB/s 3183 77.71 data//data_10
>> lz4 1.8.0 456 MB/s 1101 MB/s 1011 24.68 data//data_11
>> lz4 1.8.0 867 MB/s 1202 MB/s 37 0.90 data//data_12
>>
>>
>> LZ4 Dynamic Offet:
>> sh-3.2# ./lzbench -r -elz4_dyn data/
>> lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
>> Compressor name Compress. Decompress. Compr. size Ratio Filename
>> memcpy 2203 MB/s 2218 MB/s 4096 100.00 data//data_1
>> lz4 1.8.0 218 MB/s 693 MB/s 2228 54.39 data//data_1
>> lz4 1.8.0 273 MB/s 851 MB/s 1739 42.46 data//data_2
>> lz4 1.8.0 230 MB/s 526 MB/s 1800 43.95 data//data_3
>> lz4 1.8.0 321 MB/s 952 MB/s 1357 33.13 data//data_4
>> lz4 1.8.0 470 MB/s 1075 MB/s 664 16.21 data//data_5
>> lz4 1.8.0 303 MB/s 964 MB/s 1455 35.52 data//data_6
>> lz4 1.8.0 345 MB/s 951 MB/s 2126 51.90 data//data_7
>> lz4 1.8.0 744 MB/s 1163 MB/s 177 4.32 data//data_8
>> lz4 1.8.0 409 MB/s 1257 MB/s 1033 25.22 data//data_9
>> lz4 1.8.0 220 MB/s 857 MB/s 3049 74.44 data//data_10
>> lz4 1.8.0 464 MB/s 1105 MB/s 934 22.80 data//data_11
>> lz4 1.8.0 874 MB/s 1194 MB/s 36 0.88 data//data_12
>>
>>
>> LZ4 Dynamic Offset with 32K data:
>> sh-3.2# ./lzbench -elz4_dyn data/data32k
>> lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
>> Compressor name Compress. Decompress. Compr. size Ratio Filename
>> memcpy 5285 MB/s 5283 MB/s 32768 100.00 data/data32k
>> lz4 1.8.0 274 MB/s 995 MB/s 13435 41.00 data/data32k
>> done... (cIters=1 dIters=1 cTime=1.0 dTime=2.0 chunkSize=1706MB cSpeed=0MB)
>>
>> Original LZ4 with 32K data:
>> sh-3.2# ./lzbench_orig -elz4 data/data32k
>> lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
>> Compressor name Compress. Decompress. Compr. size Ratio Filename
>> memcpy 4918 MB/s 5108 MB/s 32768 100.00 data/data32k
>> lz4 1.8.0 276 MB/s 1045 MB/s 14492 44.23 data/data32k
>>
>> LZO1x with 32K data (Default Compressor for ZRAM):
>> sh-3.2# ./lzbench -elzo1x,1 data/data32k
>> lzbench 1.7.3 (32-bit Linux) Assembled by P.Skibinski
>> Compressor name Compress. Decompress. Compr. size Ratio Filename
>> memcpy 5273 MB/s 5320 MB/s 32768 100.00 data/data32k
>> lzo1x 2.09 -1 283 MB/s 465 MB/s 14292 43.62 data/data32k


Thanks,
Maninder Singh



2018-04-16 20:01:18

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH 0/1] cover-letter/lz4: Implement lz4 with dynamic offset length.

On Mon, Apr 16, 2018 at 07:34:29PM +0000, Yann Collet wrote:
> Hi Singh
>
> I don't have any strong opinion on this topic.
>
> You made your case clear:
> your variant trades a little bit of speed for a little bit more compression ratio.
> In the context of zram, it makes sense, and I would expect it to work, as advertised in your benchmark results.
> (disclaimer: I haven't reproduced these results, just, they look reasonable to me, I have no reason to doubt them).
>
> So, the issue is less about performance, than about code complexity.
>
> As mentioned, this is an incompatible variant.
> So, it requires its own entry point, and preferably its own code path
> (even if it's heavily duplicated,
> mixing it with regular lz4 source code, as proposed in the patch, will be bad for maintenance,
> and can negatively impact regular lz4 usage, outside of zram).
>
> So that's basically the "cost" of adding this option.
>
> Is it worth it?
> Well, this is completely outside of my responsibility area, so I really can't tell.
> You'll have to convince people in charge that the gains are worth their complexity,
> since _they_ will inherit the duty to keep the system working through its future evolutions.
> At a minimum, you are targeting maintainers of zram and the crypto interface.
> For this topic, they are the right people to talk to.
>
>
> On 4/16/18, 04:09, "Maninder Singh" <[email protected]> wrote:
>
>
> Hello Nick/ Yann,
>
> Any inputs regarding LZ4 dyn results & lz4 dyn approach.
>
> >Hello Nick/Sergey,
> >
> >Any suggestion or comments, so that we can change code and resend the patch?
> >
> >> Hi Nick / Sergey,
> >>
> >>
> >> We have compared LZ4 Dyn with Original LZ4 using some samples of realtime application data(4Kb)
> >> compressed/decompressed by ZRAM. For comparison we have used lzbench (https://github.com/inikep/lzbench)
> >> we have implemented dedicated LZ4 Dyn API & kept last literal length as 6 to avoid overhead
> >> of checks. It seems in average case there is a saving of 3~4% in compression ratio with almost same compression
> >> speed and minor loss in decompression speed (~50MB/s) when compared with LZ4.
> >>
> >> Comparison of Lz4 Dyn with LZO1x is also done as LZO1x is default compressor of ZRAM.
> >>

Unfortunately the track record of maintaining compression code in the Linux
kernel is not great. zlib for example was forked from v1.2.3, which was
released in 2005, and hasn't been updated since besides some random drive-by
patches which have made it diverge even further from upstream. There have even
been bugs assigned CVE numbers in upstream zlib, and I don't think anyone has
looked at whether the Linux kernel version has those bugs or not.

The story with LZ4 is a bit better as someone updated it to v1.7.3 last year.
But, it took a lot of rounds of review in which I had to point out some subtle
regressions like the hash table size being accidentally changed, and things not
being inlined that should be. And of course now that version is outdated
already.

We also have LZO and Zstandard in the kernel to maintain too, as well as XZ
decompression.

And very problematically, *none* of these compression algorithms have a
maintainer listed in the MAINTAINERS file.

So in my opinion, as a prerequisite for this change, someone would need to
volunteer to actually maintain LZ4 in the kernel.

Thanks,

Eric