2018-11-21 12:08:14

by Dave Rodgman

[permalink] [raw]
Subject: [PATCH 0/6] lib/lzo: performance improvements

This patch series introduces performance improvements for lzo.

The improvements fall into two categories: general Arm-specific optimisations
(e.g., more efficient memory access); and the introduction of a special case?
for handling runs of zeros (which is a common case for zram) using run-length
encoding.

The introduction of RLE modifies the bitstream such that it can't be decoded
by old versions of lzo (the new lzo-rle can correctly decode old bitstreams).
To avoid possible issues where data is persisted on disk (e.g., squashfs), the?
final patch in this series separates lzo-rle into a separate algorithm
alongside lzo, so that the new lzo-rle is (by default) only used for zram and
must be explicitly selected for other use-cases. This final patch could be
omitted if the consensus is that we'd rather avoid proliferation of lzo
variants.

Overall, performance is improved by around 1.1 - 4.8x (data-dependent: data
with many zero runs shows higher improvement). Under real-world testing with
zram, time spent in (de)compression during swapping is reduced by around 27%.
The graph below shows the weighted round-trip throughput of lzo, lz4 and
lzo-rle, for randomly generated 4k chunks of data with varying levels of
entropy. (To calculate weighted round-trip throughput, compression performance
is emphasised to reflect the fact that zram does around 2.25x more compression
than decompression. (Results and overall trends are fairly similar for
unweighted).

https://drive.google.com/file/d/18GU4pgRVCLNN7wXxynz-8R2ygrY2IdyE/view

Contributors:
Dave Rodgman <[email protected]>
Matt Sealey <[email protected]>


2018-11-21 14:33:39

by Markus F.X.J. Oberhumer

[permalink] [raw]
Subject: Re: [PATCH 0/6] lib/lzo: performance improvements

Hi Dave,

thanks for your patch set. Just some initial comments:


I think the three patches

[PATCH 2/6] lib/lzo: enable 64-bit CTZ on Arm
[PATCH 3/6] lib/lzo: 64-bit CTZ on Arm aarch64
[PATCH 4/6] lib/lzo: fast 8-byte copy on arm64

should be applied in any case - could you please make an extra
pull request out of these and try to get them merged as fast
as possible. Thanks.


The first patch

[PATCH 1/6] lib/lzo: clean-up by introducing COPY16

does not really affect the resulting code at the moment, but please
note that in one case the actual copy unit is not allowed to
be greater 8 bytes (which might be implied by the name "COPY16").
So this needs more work like an extra COPY16_BY_8() macro.


As for your your "lzo-rle" improvements I'll have a look.

Please note that the first byte value 17 is actually valid when using
external dictionaries ("lzo1x_decompress_dict_safe()" in the LZO source
code). While this functionality is not present in the Linux kernel at
the moment it might be worrisome wrt future enhancements.

Finally I'm wondering if your chart comparisions just compares the "lzo-rle"
patch or also includes the ARM64 improvments - I cannot understand where a
20% speedup should come from if you have 0% zeros.

Cheers,
Markus



On 2018-11-21 13:06, Dave Rodgman wrote:
> This patch series introduces performance improvements for lzo.
>
> The improvements fall into two categories: general Arm-specific optimisations
> (e.g., more efficient memory access); and the introduction of a special case
> for handling runs of zeros (which is a common case for zram) using run-length
> encoding.
>
> The introduction of RLE modifies the bitstream such that it can't be decoded
> by old versions of lzo (the new lzo-rle can correctly decode old bitstreams).
> To avoid possible issues where data is persisted on disk (e.g., squashfs), the
> final patch in this series separates lzo-rle into a separate algorithm
> alongside lzo, so that the new lzo-rle is (by default) only used for zram and
> must be explicitly selected for other use-cases. This final patch could be
> omitted if the consensus is that we'd rather avoid proliferation of lzo
> variants.
>
> Overall, performance is improved by around 1.1 - 4.8x (data-dependent: data
> with many zero runs shows higher improvement). Under real-world testing with
> zram, time spent in (de)compression during swapping is reduced by around 27%.
> The graph below shows the weighted round-trip throughput of lzo, lz4 and
> lzo-rle, for randomly generated 4k chunks of data with varying levels of
> entropy. (To calculate weighted round-trip throughput, compression performance
> is emphasised to reflect the fact that zram does around 2.25x more compression
> than decompression. (Results and overall trends are fairly similar for
> unweighted).
>
> https://drive.google.com/file/d/18GU4pgRVCLNN7wXxynz-8R2ygrY2IdyE/view
>
> Contributors:
> Dave Rodgman <[email protected]>
> Matt Sealey <[email protected]>
>

--
Markus Oberhumer, <[email protected]>, http://www.oberhumer.com/

2018-11-21 19:39:59

by Dave Rodgman

[permalink] [raw]
Subject: Re: [PATCH 0/6] lib/lzo: performance improvements

On 21/11/2018 1:44 pm, Markus F.X.J. Oberhumer wrote:

> I think the three patches
>
> [PATCH 2/6] lib/lzo: enable 64-bit CTZ on Arm
> [PATCH 3/6] lib/lzo: 64-bit CTZ on Arm aarch64
> [PATCH 4/6] lib/lzo: fast 8-byte copy on arm64
>
> should be applied in any case - could you please make an extra
> pull request out of these and try to get them merged as fast
> as possible. Thanks.

The three patches you mention give around 10-25% performance uplift
(mostly on compression). I'll look at generating a pull request for these.

> [PATCH 1/6] lib/lzo: clean-up by introducing COPY16
>
> does not really affect the resulting code at the moment, but please
> note that in one case the actual copy unit is not allowed to
> be greater 8 bytes (which might be implied by the name "COPY16").
> So this needs more work like an extra COPY16_BY_8() macro.

I'll leave Matt to comment on this one, as it's his patch.

> As for your your "lzo-rle" improvements I'll have a look.
>
> Please note that the first byte value 17 is actually valid when using
> external dictionaries ("lzo1x_decompress_dict_safe()" in the LZO source
> code). While this functionality is not present in the Linux kernel at
> the moment it might be worrisome wrt future enhancements.

I wasn't aware of the external dictionary concern. Do you have any
suggestions for an alternative instruction that we could use instead
that would not be used by the existing lzo algorithm at the start of the
stream? If there isn't anything suitable, then we'd have to choose
between backwards compatibility (not a huge issue, if lzo-rle were to be
kept as a separate algorithm to lzo, but certainly nice to have) vs.
allowing for the possibility of introducing external dictionaries in future.

> Finally I'm wondering if your chart comparisions just compares the "lzo-rle"
> patch or also includes the ARM64 improvments - I cannot understand where a
> 20% speedup should come from if you have 0% zeros.

The chart does indeed include the other improvements, so this is where
the performance uplift on the left hand side of the chart (i.e., random
data) comes from.

Thanks for taking a look at this.

Dave

>
> Cheers,
> Markus
>
>
>
> On 2018-11-21 13:06, Dave Rodgman wrote:
>> This patch series introduces performance improvements for lzo.
>>
>> The improvements fall into two categories: general Arm-specific optimisations
>> (e.g., more efficient memory access); and the introduction of a special case
>> for handling runs of zeros (which is a common case for zram) using run-length
>> encoding.
>>
>> The introduction of RLE modifies the bitstream such that it can't be decoded
>> by old versions of lzo (the new lzo-rle can correctly decode old bitstreams).
>> To avoid possible issues where data is persisted on disk (e.g., squashfs), the
>> final patch in this series separates lzo-rle into a separate algorithm
>> alongside lzo, so that the new lzo-rle is (by default) only used for zram and
>> must be explicitly selected for other use-cases. This final patch could be
>> omitted if the consensus is that we'd rather avoid proliferation of lzo
>> variants.
>>
>> Overall, performance is improved by around 1.1 - 4.8x (data-dependent: data
>> with many zero runs shows higher improvement). Under real-world testing with
>> zram, time spent in (de)compression during swapping is reduced by around 27%.
>> The graph below shows the weighted round-trip throughput of lzo, lz4 and
>> lzo-rle, for randomly generated 4k chunks of data with varying levels of
>> entropy. (To calculate weighted round-trip throughput, compression performance
>> is emphasised to reflect the fact that zram does around 2.25x more compression
>> than decompression. (Results and overall trends are fairly similar for
>> unweighted).
>>
>> https://drive.google.com/file/d/18GU4pgRVCLNN7wXxynz-8R2ygrY2IdyE/view
>>
>> Contributors:
>> Dave Rodgman <[email protected]>
>> Matt Sealey <[email protected]>
>>
>

2018-11-21 19:40:00

by Matt Sealey

[permalink] [raw]
Subject: RE: [PATCH 0/6] lib/lzo: performance improvements

Hi Markus.

> Hi Dave,
>
> thanks for your patch set. Just some initial comments:
>
>
> I think the three patches
>
> [PATCH 2/6] lib/lzo: enable 64-bit CTZ on Arm
> [PATCH 3/6] lib/lzo: 64-bit CTZ on Arm aarch64
> [PATCH 4/6] lib/lzo: fast 8-byte copy on arm64
>
> should be applied in any case - could you please make an extra
> pull request out of these and try to get them merged as fast
> as possible. Thanks.
>
>
> The first patch
>
> [PATCH 1/6] lib/lzo: clean-up by introducing COPY16
>
> does not really affect the resulting code at the moment, but please
> note that in one case the actual copy unit is not allowed to
> be greater 8 bytes (which might be implied by the name "COPY16").
> So this needs more work like an extra COPY16_BY_8() macro.

I agree in principle but not in practice ;)

The original intent of those patches was to avail ourselves of
known architectural features, but also give the compiler a helping
hand where LZO isn't supposed to be that knowledgeable of the
underlying processor implementation. The end result is I think
an architecture having a discrete 16-byte movement is more
beneficial than dictating how things are being copied under it.

What came out is that there is a difference in codegen in compress
vs. decompress literally because of a difference in coding style:

decompress:

COPY8(op, ip);
op += 8;
ip += 8;
COPY8(op, ip);
op += 8;
ip += 8;

compress:

COPY8(op, ip);
COPY8(op+8, ip+8);
ip += 16;
op += 16;

Compilers do very well with the latter, and terribly with the former.
Once I aligned them to the same sequence I noticed the only time
LZO ever does 8-byte copies is in 16-byte chunks, and this aligned
with my expectation that on arm64 LDP/STP with writeback gets it done
in two instructions. I want to go to there!

COPY16 being defined as COPY8,COPY8 is a low hanging fruit that
does improve codegen on arm64 slightly (because it cleans the
decompression code up) and performance even more slightly as a
result.

Just for elucidation, it goes from (decompress)

add x0, x0, 16 // op, op,
ldr w2, [x1] //, MEM[base: ii_17, offset: 0B]
add x1, x1, 16 // ii, ii,
cmp x0, x3 // op, _490
str w2, [x0, -16] // _89, MEM[base: op_8, offset: 0B]
ldr w2, [x1, -12] //, MEM[base: ii_17, offset: 4B]
str w2, [x0, -12] // _105, MEM[base: op_8, offset: 4B]
ldr w2, [x1, -8] //, MEM[base: ii_17, offset: 8B]
str w2, [x0, -8] // _145, MEM[base: op_8, offset: 8B]
ldr w2, [x1, -4] //, MEM[base: ii_17, offset: 12B]
str w2, [x0, -4] // _156, MEM[base: op_8, offset: 12B]

To this (same code as compress):

add x0, x0, 16 // op, op,
ldr x2, [x1] // _153, MEM[base: ii_17, offset: 0B]
add x1, x1, 16 // ii, ii,
cmp x0, x3 // op, _418
str x2, [x0, -16] // _153, MEM[base: op_8, offset: 0B]
ldr x2, [x1, -8] // _181, MEM[base: ii_17, offset: 8B]
str x2, [x0, -8] // _181, MEM[base: op_8, offset: 8B]

But it's going to want to be this way:

ldp x2, x3, [x1], 16 // _182, MEM[base: ii_17, offset: 0B]
stp x2, x3, [x0], 16 // _182, MEM[base: op_8, offset: 0B]

Which is a COPY16_BY_8 but only because of the architecture rules
on element sizes.. COPY16_BY_8 doesn't fit the idiom we're
trying to create.

For arm64 the easiest way to get LDP/STP with writeback just
happens to be allowing compilers to lift memcpy(o,i,16) which
is always going to be alignment-safe, but this doesn't hold true
on arm32 or elsewhere and we've not done enough testing on other
arches for that concept to bear fruit, yet.

I'm confused by the 'one case the actual copy unit is not allowed to
be greater 8 bytes' statement: this is not evidenced by the loop
structure. COPY8 is only ever implemented in pairs in the kernel -
can you point out where this might not be the case in the future
(and why they couldn't still use COPY8 in that case)?

Ta!
Matt Sealey <[email protected]>
Arm Partner Enablement Group

2018-11-24 07:52:53

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 0/6] lib/lzo: performance improvements

On (11/21/18 12:06), Dave Rodgman wrote:
>
> Overall, performance is improved by around 1.1 - 4.8x (data-dependent: data
> with many zero runs shows higher improvement). Under real-world testing with
> zram, time spent in (de)compression during swapping is reduced by around 27%.

Impressive.

I think we usually Cc Greg Kroah-Hartman and Andrew Morton on
lzo/lz4 patches.

> The graph below shows the weighted round-trip throughput of lzo, lz4 and
> lzo-rle, for randomly generated 4k chunks of data with varying levels of
> entropy. (To calculate weighted round-trip throughput, compression performance
> is emphasised to reflect the fact that zram does around 2.25x more compression
> than decompression.

Right. The number is data dependent. Not all swapped out pages can be
compressed; compressed pages that end up being >= zs_huge_class_size() are
considered incompressible and stored as it.

I'd say that on my setups around 50-60% of pages are incompressible.

-ss