Received: by 10.213.65.68 with SMTP id h4csp84774imn; Wed, 21 Mar 2018 13:01:28 -0700 (PDT) X-Google-Smtp-Source: AG47ELujOiuW31MDgnDfwbo8zIf+9Zqrc4hZxs2Xt7X95Gx3NxfmLzpqmM0cwh+qfsWKGm7QycfG X-Received: by 10.98.201.70 with SMTP id k67mr14764191pfg.141.1521662488040; Wed, 21 Mar 2018 13:01:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521662487; cv=none; d=google.com; s=arc-20160816; b=TbzSAcnx4RuCIZ+zQ2BmR7fxAN/pqMM35vw1NP9k+jjoucb17g+zDG6zFBWUPB3XTt rNSfi8iUg6Sz5lpdGOyPIJxOqpTEzdMa33Xctgz1EB9Pj2/uI/uGcA16n+tpwwToJwFO aXYvGeTJMDcnEsQx+sP/ZLhZN/lPUonwlPbA1umgdpTGfx3lbtsonW6VesvLRE3CKDam LvHPEOD38taJW6HQHS3bRBG3g0QbLdleGLBdaXd1n0doeuSSP2nFRME0QIoVtdIdL32U b9/7ZeP0jozBmMjXtjQB4PqxAEq/kjwWH6xLt1VCC8ofmc73h1GI8yJIVjou5P/8JnUp kk2A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:references:in-reply-to :message-id:date:subject:smtp-origin-cluster:cc:to :smtp-origin-hostname:from:smtp-origin-hostprefix:dkim-signature :arc-authentication-results; bh=Pbc0YkX6pQ7Tt/8HYut4s1dklir4F1fbSo6sFQTpUG0=; b=dWzRRKINj0R20T34GpMUo1smLB6J9Vw7fsFdYaDNdPyheGsNfy/KaMlQ+J1zzwjnDr I3ejRo0bLnt0wrvfiKwmbxSjmvT94vbrX4yZS9GTiivJuh9AhcUwM2E2+rEqLFNO6Meb dCgBksuZaO9krzAhyyjYVk4PmlTTphadCIxUHn0QqjwwelAZhvz74y8Yg118X1JZFi36 wFTyCSB9tomAPfNhj4IEL7nvsdZf/YgyMrq6k8M3sWlXjKT+vt3u3MW7EwEvb0F/Cukl iw28HfaDt6BcCwtaU2CfFjx6N3NCDi7dBkKHJCTiTvIAtmJM3fGqODq6xVOF0BwCPJlu VLsw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@fb.com header.s=facebook header.b=c2a2iR9e; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=fb.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r20si3556356pfj.245.2018.03.21.13.01.07; Wed, 21 Mar 2018 13:01:27 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@fb.com header.s=facebook header.b=c2a2iR9e; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=fb.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753324AbeCUT7V (ORCPT + 99 others); Wed, 21 Mar 2018 15:59:21 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:55756 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753006AbeCUT7O (ORCPT ); Wed, 21 Mar 2018 15:59:14 -0400 Received: from pps.filterd (m0044010.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w2LJtd8o015412 for ; Wed, 21 Mar 2018 12:59:14 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=Pbc0YkX6pQ7Tt/8HYut4s1dklir4F1fbSo6sFQTpUG0=; b=c2a2iR9e0ICsU68KBZdybvhCVxb4NlVT1Z6rjiX3KaCJSQErjQ3113jGXeRuDUjkt7zr lcgGSkBsY6cW/ugxlXkEbk6TZPjCGKX94zmgRx/yHjquklx7e8IvRJ0dDyqA8KtCDBqw gw+e13JDnK84v+9CA0Zbaery4X0/eLaqCgQ= Received: from mail.thefacebook.com ([199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2guw05g7sr-4 (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Wed, 21 Mar 2018 12:59:14 -0700 Received: from mx-out.facebook.com (192.168.52.123) by PRN-CHUB11.TheFacebook.com (192.168.16.21) with Microsoft SMTP Server id 14.3.319.2; Wed, 21 Mar 2018 12:59:12 -0700 Received: by devvm33104.prn1.facebook.com (Postfix, from userid 32154) id 253DD28DCB3; Wed, 21 Mar 2018 12:59:11 -0700 (PDT) Smtp-Origin-Hostprefix: devvm From: Nick Terrell Smtp-Origin-Hostname: devvm33104.prn1.facebook.com To: Maninder Singh CC: Yann Collet , Sergey Senozhatsky , , , Nick Terrell Smtp-Origin-Cluster: prn1c35 Subject: Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length. Date: Wed, 21 Mar 2018 12:59:06 -0700 Message-ID: <20180321195906.1260480-1-terrelln@fb.com> X-Mailer: git-send-email 2.9.5 In-Reply-To: <1521607242-3968-2-git-send-email-maninder1.s@samsung.com> References: <1521607242-3968-2-git-send-email-maninder1.s@samsung.com> X-FB-Internal: Safe MIME-Version: 1.0 Content-Type: text/plain X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-03-21_09:,, signatures=0 X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On (03/21/18 10:10), Maninder Singh wrote: > diff --git a/lib/lz4/lz4_compress.c b/lib/lz4/lz4_compress.c > index cc7b6d4..185c358 100644 > --- a/lib/lz4/lz4_compress.c > +++ b/lib/lz4/lz4_compress.c > @@ -183,7 +183,8 @@ static FORCE_INLINE int LZ4_compress_generic( > const tableType_t tableType, > const dict_directive dict, > const dictIssue_directive dictIssue, > - const U32 acceleration) > + const U32 acceleration, > + const Dynamic_Offset dynOffset) > { > const BYTE *ip = (const BYTE *) source; > const BYTE *base; > @@ -199,6 +200,7 @@ static FORCE_INLINE int LZ4_compress_generic( > > BYTE *op = (BYTE *) dest; > BYTE * const olimit = op + maxOutputSize; > + int max_distance = dynOffset ? MAX_DISTANCE_DYN : MAX_DISTANCE; Lets mark this variable `const`. If the compiler doesn't realize that this variable and `dynOffset` are compile time constants, I expect the speed to be impacted. > > U32 forwardH; > size_t refDelta = 0; > @@ -245,6 +247,7 @@ static FORCE_INLINE int LZ4_compress_generic( > for ( ; ; ) { > const BYTE *match; > BYTE *token; > + int curr_offset; > > /* Find a match */ > { > @@ -285,7 +288,7 @@ static FORCE_INLINE int LZ4_compress_generic( > : 0) > || ((tableType == byU16) > ? 0 > - : (match + MAX_DISTANCE < ip)) > + : (match + max_distance < ip)) > || (LZ4_read32(match + refDelta) > != LZ4_read32(ip))); > } > @@ -328,8 +331,26 @@ static FORCE_INLINE int LZ4_compress_generic( > > _next_match: > /* Encode Offset */ > - LZ4_writeLE16(op, (U16)(ip - match)); > - op += 2; > + if (dynOffset) { > + curr_offset = (U16)(ip - match); > + > + /* > + * If Ofsset is greater than 127, we need 2 bytes > + * to store it. Otherwise 1 byte is enough. > + */ > + if (curr_offset > 127) { > + curr_offset = (curr_offset << 1) | DYN_BIT; > + LZ4_writeLE16(op, (U16)curr_offset); > + op += 2; > + } else { > + curr_offset = curr_offset << 1; > + *op = (BYTE)curr_offset; > + op++; > + } The standard way to do variable sized integers is to use the high-bit as the control bit, not the low-bit. Do you have benchmarks to show that using the low-bit is faster than using the high-bit? If so, lets comment in the code, if not lets use the high-bit. > + } else { > + LZ4_writeLE16(op, (U16)(ip - match)); > + op += 2; > + } > > /* Encode MatchLength */ > { > @@ -480,39 +501,70 @@ static int LZ4_compress_fast_extState( > return LZ4_compress_generic(ctx, source, > dest, inputSize, 0, > noLimit, byU16, noDict, > - noDictIssue, acceleration); > + noDictIssue, acceleration, NoDynOffset); > else > return LZ4_compress_generic(ctx, source, > dest, inputSize, 0, > noLimit, tableType, noDict, > - noDictIssue, acceleration); > + noDictIssue, acceleration, NoDynOffset); > } else { > if (inputSize < LZ4_64Klimit) > return LZ4_compress_generic(ctx, source, > dest, inputSize, > maxOutputSize, limitedOutput, byU16, noDict, > - noDictIssue, acceleration); > + noDictIssue, acceleration, NoDynOffset); > else > return LZ4_compress_generic(ctx, source, > dest, inputSize, > maxOutputSize, limitedOutput, tableType, noDict, > - noDictIssue, acceleration); > + noDictIssue, acceleration, NoDynOffset); > } > } > > +static int LZ4_compress_fast_extState_dynamic( > + void *state, > + const char *source, > + char *dest, > + int inputSize, > + int maxOutputSize, > + int acceleration) > +{ > + LZ4_stream_t_internal *ctx = &((LZ4_stream_t *)state)->internal_donotuse; > + > + LZ4_resetStream((LZ4_stream_t *)state); > + > + if (acceleration < 1) > + acceleration = LZ4_ACCELERATION_DEFAULT; > + > + if (maxOutputSize >= LZ4_COMPRESSBOUND(inputSize)) > + return LZ4_compress_generic(ctx, source, > + dest, inputSize, 0, > + noLimit, byU16, noDict, > + noDictIssue, acceleration, DynOffset); > + else > + return LZ4_compress_generic(ctx, source, > + dest, inputSize, > + maxOutputSize, limitedOutput, byU16, noDict, > + noDictIssue, acceleration, DynOffset); > +} > + > int LZ4_compress_fast(const char *source, char *dest, int inputSize, > - int maxOutputSize, int acceleration, void *wrkmem) > + int maxOutputSize, int acceleration, void *wrkmem, bool dynOffset) > { > - return LZ4_compress_fast_extState(wrkmem, source, dest, inputSize, > + if (!dynOffset) > + return LZ4_compress_fast_extState(wrkmem, source, dest, inputSize, > + maxOutputSize, acceleration); > + > + return LZ4_compress_fast_extState_dynamic(wrkmem, source, dest, inputSize, > maxOutputSize, acceleration); > } > EXPORT_SYMBOL(LZ4_compress_fast); > > int LZ4_compress_default(const char *source, char *dest, int inputSize, > - int maxOutputSize, void *wrkmem) > + int maxOutputSize, void *wrkmem, bool dynOffset) > { > return LZ4_compress_fast(source, dest, inputSize, > - maxOutputSize, LZ4_ACCELERATION_DEFAULT, wrkmem); > + maxOutputSize, LZ4_ACCELERATION_DEFAULT, wrkmem, dynOffset); > } > EXPORT_SYMBOL(LZ4_compress_default); > > @@ -900,12 +952,12 @@ int LZ4_compress_fast_continue(LZ4_stream_t *LZ4_stream, const char *source, > result = LZ4_compress_generic( > streamPtr, source, dest, inputSize, > maxOutputSize, limitedOutput, byU32, > - withPrefix64k, dictSmall, acceleration); > + withPrefix64k, dictSmall, acceleration, NoDynOffset); > } else { > result = LZ4_compress_generic( > streamPtr, source, dest, inputSize, > maxOutputSize, limitedOutput, byU32, > - withPrefix64k, noDictIssue, acceleration); > + withPrefix64k, noDictIssue, acceleration, NoDynOffset); > } > streamPtr->dictSize += (U32)inputSize; > streamPtr->currentOffset += (U32)inputSize; > @@ -921,12 +973,12 @@ int LZ4_compress_fast_continue(LZ4_stream_t *LZ4_stream, const char *source, > result = LZ4_compress_generic( > streamPtr, source, dest, inputSize, > maxOutputSize, limitedOutput, byU32, > - usingExtDict, dictSmall, acceleration); > + usingExtDict, dictSmall, acceleration, NoDynOffset); > } else { > result = LZ4_compress_generic( > streamPtr, source, dest, inputSize, > maxOutputSize, limitedOutput, byU32, > - usingExtDict, noDictIssue, acceleration); > + usingExtDict, noDictIssue, acceleration, NoDynOffset); > } > streamPtr->dictionary = (const BYTE *)source; > streamPtr->dictSize = (U32)inputSize; > diff --git a/lib/lz4/lz4_decompress.c b/lib/lz4/lz4_decompress.c > index 141734d..337a828 100644 > --- a/lib/lz4/lz4_decompress.c > +++ b/lib/lz4/lz4_decompress.c > @@ -71,7 +71,9 @@ static FORCE_INLINE int LZ4_decompress_generic( > /* only if dict == usingExtDict */ > const BYTE * const dictStart, > /* note : = 0 if noDict */ > - const size_t dictSize > + const size_t dictSize, > + /* offset == 1; dynamic offset */ > + const Dynamic_Offset dynOffset > ) > { > /* Local Variables */ > @@ -141,8 +143,8 @@ static FORCE_INLINE int LZ4_decompress_generic( > /* copy literals */ > cpy = op + length; > if (((endOnInput) && ((cpy > (partialDecoding ? oexit : oend - MFLIMIT)) > - || (ip + length > iend - (2 + 1 + LASTLITERALS)))) > - || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH))) { > + || (ip + length > iend - (2 + LASTLITERALS)))) > + || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH - 1))) { > if (partialDecoding) { > if (cpy > oend) { > /* > @@ -188,13 +190,31 @@ static FORCE_INLINE int LZ4_decompress_generic( > break; > } > > - LZ4_wildCopy(op, ip, cpy); > + if (dynOffset && length < 4) > + LZ4_copy4(op, ip); > + else > + LZ4_wildCopy(op, ip, cpy); > + The LZ4 format enforces that the last 5 bytes are literals so that LZ4_wildCopy() can be used here. I suspect that having this extra branch here for `dynOffset` mode hurts decompression speed. > ip += length; > op = cpy; > > /* get offset */ > - offset = LZ4_readLE16(ip); > - ip += 2; > + if (dynOffset) { > + /* > + * Check if DYN_BIT is set, means 2 Byte Offset, > + * else 1 Byte Offset. > + */ > + if (*ip & DYN_BIT) { > + offset = LZ4_readLE16(ip) >> 1; > + ip += 2; > + } else { > + offset = *ip >> 1; > + ip += 1; If we use the high-bit as the control bit, this branch simply becomes `offset = *ip`, though the long offset branch becomes a bit longer. > + } > + } else { > + offset = LZ4_readLE16(ip); > + ip += 2; > + } > match = op - offset; > > if ((checkOffset) && (unlikely(match < lowLimit))) { > @@ -335,11 +355,11 @@ static FORCE_INLINE int LZ4_decompress_generic( > } > > int LZ4_decompress_safe(const char *source, char *dest, > - int compressedSize, int maxDecompressedSize) > + int compressedSize, int maxDecompressedSize, bool dynOffset) > { > return LZ4_decompress_generic(source, dest, compressedSize, > maxDecompressedSize, endOnInputSize, full, 0, > - noDict, (BYTE *)dest, NULL, 0); > + noDict, (BYTE *)dest, NULL, 0, dynOffset); You'll need to use the same trick that LZ4_compress_fast() uses, by hard coding `dynOffset`. We want the compiler to generate too version of LZ4_decompress_generic(), one with `dynOffset` and one with `!dynOffset`. That way the tight loop won't the branches that check `dynOffset`. if (dynOffset) return LZ4_decompress_generic(..., true); else return LZ4_decompress_generic(..., false); Without this trick, I expect that this patch causes a regression to both LZ4 and LZ4_DYN decompression speed. > } > > int LZ4_decompress_safe_partial(const char *source, char *dest, > @@ -347,14 +367,14 @@ int LZ4_decompress_safe_partial(const char *source, char *dest, > { > return LZ4_decompress_generic(source, dest, compressedSize, > maxDecompressedSize, endOnInputSize, partial, > - targetOutputSize, noDict, (BYTE *)dest, NULL, 0); > + targetOutputSize, noDict, (BYTE *)dest, NULL, 0, NoDynOffset); > } > > int LZ4_decompress_fast(const char *source, char *dest, int originalSize) > { > return LZ4_decompress_generic(source, dest, 0, originalSize, > endOnOutputSize, full, 0, withPrefix64k, > - (BYTE *)(dest - 64 * KB), NULL, 64 * KB); > + (BYTE *)(dest - 64 * KB), NULL, 64 * KB, NoDynOffset); > } > > int LZ4_setStreamDecode(LZ4_streamDecode_t *LZ4_streamDecode, > @@ -392,7 +412,7 @@ int LZ4_decompress_safe_continue(LZ4_streamDecode_t *LZ4_streamDecode, > endOnInputSize, full, 0, > usingExtDict, lz4sd->prefixEnd - lz4sd->prefixSize, > lz4sd->externalDict, > - lz4sd->extDictSize); > + lz4sd->extDictSize, NoDynOffset); > > if (result <= 0) > return result; > @@ -406,7 +426,7 @@ int LZ4_decompress_safe_continue(LZ4_streamDecode_t *LZ4_streamDecode, > compressedSize, maxOutputSize, > endOnInputSize, full, 0, > usingExtDict, (BYTE *)dest, > - lz4sd->externalDict, lz4sd->extDictSize); > + lz4sd->externalDict, lz4sd->extDictSize, NoDynOffset); > if (result <= 0) > return result; > lz4sd->prefixSize = result; > @@ -427,7 +447,7 @@ int LZ4_decompress_fast_continue(LZ4_streamDecode_t *LZ4_streamDecode, > endOnOutputSize, full, 0, > usingExtDict, > lz4sd->prefixEnd - lz4sd->prefixSize, > - lz4sd->externalDict, lz4sd->extDictSize); > + lz4sd->externalDict, lz4sd->extDictSize, NoDynOffset); > > if (result <= 0) > return result; > @@ -440,7 +460,7 @@ int LZ4_decompress_fast_continue(LZ4_streamDecode_t *LZ4_streamDecode, > result = LZ4_decompress_generic(source, dest, 0, originalSize, > endOnOutputSize, full, 0, > usingExtDict, (BYTE *)dest, > - lz4sd->externalDict, lz4sd->extDictSize); > + lz4sd->externalDict, lz4sd->extDictSize, NoDynOffset); > if (result <= 0) > return result; > lz4sd->prefixSize = originalSize; > @@ -463,19 +483,19 @@ static FORCE_INLINE int LZ4_decompress_usingDict_generic(const char *source, > if (dictSize == 0) > return LZ4_decompress_generic(source, dest, > compressedSize, maxOutputSize, safe, full, 0, > - noDict, (BYTE *)dest, NULL, 0); > + noDict, (BYTE *)dest, NULL, 0, NoDynOffset); > if (dictStart + dictSize == dest) { > if (dictSize >= (int)(64 * KB - 1)) > return LZ4_decompress_generic(source, dest, > compressedSize, maxOutputSize, safe, full, 0, > - withPrefix64k, (BYTE *)dest - 64 * KB, NULL, 0); > + withPrefix64k, (BYTE *)dest - 64 * KB, NULL, 0, NoDynOffset); > return LZ4_decompress_generic(source, dest, compressedSize, > maxOutputSize, safe, full, 0, noDict, > - (BYTE *)dest - dictSize, NULL, 0); > + (BYTE *)dest - dictSize, NULL, 0, NoDynOffset); > } > return LZ4_decompress_generic(source, dest, compressedSize, > maxOutputSize, safe, full, 0, usingExtDict, > - (BYTE *)dest, (const BYTE *)dictStart, dictSize); > + (BYTE *)dest, (const BYTE *)dictStart, dictSize, NoDynOffset); > } > > int LZ4_decompress_safe_usingDict(const char *source, char *dest, > diff --git a/lib/lz4/lz4defs.h b/lib/lz4/lz4defs.h > index 00a0b58..9451a73 100644 > --- a/lib/lz4/lz4defs.h > +++ b/lib/lz4/lz4defs.h > @@ -75,6 +75,7 @@ > #define WILDCOPYLENGTH 8 > #define LASTLITERALS 5 > #define MFLIMIT (WILDCOPYLENGTH + MINMATCH) > +#define DYN_BIT 0x1 > > /* Increase this value ==> compression run slower on incompressible data */ > #define LZ4_SKIPTRIGGER 6 > @@ -87,6 +88,7 @@ > > #define MAXD_LOG 16 > #define MAX_DISTANCE ((1 << MAXD_LOG) - 1) > +#define MAX_DISTANCE_DYN ((1 << (MAXD_LOG - 1)) - 1) > #define STEPSIZE sizeof(size_t) > > #define ML_BITS 4 > @@ -147,6 +149,13 @@ static FORCE_INLINE void LZ4_copy8(void *dst, const void *src) > #endif > } > > +static FORCE_INLINE void LZ4_copy4(void *dst, const void *src) > +{ > + U32 a = get_unaligned((const U32 *)src); > + > + put_unaligned(a, (U32 *)dst); > +} > + > /* > * customized variant of memcpy, > * which can overwrite up to 7 bytes beyond dstEnd > @@ -224,4 +233,6 @@ static FORCE_INLINE unsigned int LZ4_count( > typedef enum { endOnOutputSize = 0, endOnInputSize = 1 } endCondition_directive; > typedef enum { full = 0, partial = 1 } earlyEnd_directive; > > +typedef enum { NoDynOffset = 0, DynOffset = 1 } Dynamic_Offset; > + > #endif > -- > 1.7.1