LinuxLists.cc - zsmalloc/lzo compressibility vs entropy

2013-03-27 21:44:53

Subject: zsmalloc/lzo compressibility vs entropy

This might be obvious to those of you who are better
mathematicians than I, but I ran some experiments
to confirm the relationship between entropy and compressibility
and thought I should report the results to the list.

Using the LZO code in the kernel via zsmalloc and some
hacks in zswap, I measured the compression of pages
generated by get_random_bytes and then of pages
where half the page is generated by get_random_bytes()
and the other half-page is zero-filled.

For a fully random page, one would expect the number
of zeroes and ones generated to be equal (highest
entropy) and that proved true: The mean number of
one-bits in the fully random page was 16384 (x86,
so PAGE_SIZE=4096 * 8 bits/byte) with a stddev of 93.
(sample size > 500000). For this sample of pages,
zsize had a mean of 4116 and a stddev of 16.
So for fully random pages, LZO compression results
in "negative" compression... the size of the compressed
page is slightly larger than a page.

For a "half random page" -- a fully random page with
the first half of the page overwritten with zeros --
zsize mean is 2077 with a stddev of 6. So a half-random
page compresses by about a factor of 2. (Just to
be sure, I reran the experiment with the first half
of the page overwritten with ones instead of zeroes,
and the result was approximately the same.)

For extra credit, I ran a "quarter random page"...
zsize mean is 1052 with a stddev of 45.

For more extra credit, I tried a fully-random page
with every OTHER byte forced to zero, so half the
bytes are random and half are zero. The result:
mean zsize is 3841 with a stddev of 33. Then I
tried a fully-random page with every other PAIR of
bytes forced to zero. The result: zsize mean is 4029
with a stddev of 67. (Worse!)

So LZO page compression works better when there are many
more zeroes than ones in a page (or vice-versa), but works
best when a long sequence of bits (bytes?) are the same.

All this still begs the question as to what the
page-entropy (and zsize distribution) will be over a
large set of pages and over a large set of workloads
AND across different classes of data (e.g. frontswap
pages vs cleancache pages), but at least we have
some theory to guide us.

2013-03-29 22:20:51

by Dan Magenheimer

[permalink] [raw]

Subject: RE: zsmalloc/lzo compressibility vs entropy

> From: Dan Magenheimer
> Sent: Wednesday, March 27, 2013 3:42 PM
> To: Seth Jennings; Konrad Wilk; Minchan Kim; Bob Liu; Robert Jennings; Nitin Gupta; Wanpeng Li; Andrew
> Morton; Mel Gorman
> Cc: [email protected]; [email protected]
> Subject: zsmalloc/lzo compressibility vs entropy
>
> This might be obvious to those of you who are better
> mathematicians than I, but I ran some experiments
> to confirm the relationship between entropy and compressibility
> and thought I should report the results to the list.

A few new observations worth mentioning:

Since Seth long ago mentioned that the text of Moby Dick
resulted in poor (but not horribly poor) compression I thought
I'd look at some ASCII data.

I used the first sentence of the Gettysburg Address (91 characters)
and repeated it to fill a page. Interestingly, LZO apparently
discovered the repetition... the page compressed to 118 bytes
even though the result had 15618 one-bits (fairly high entropy).

I used the full Gettysburg Address (1459 characters), again
repeated to fill a page. LZO compressed this to 1070 bytes.
(14568 one-bits.)

To fill a page with text, I added part of the Declaration of
Independence. No repeating text now. This only compressed
to 2754 bytes (which, I assume, is close to Seth's observations
on Moby Dick). 14819 one-bits.

Last (for swap), to see if random ascii would compress better
than binary, I masked off the MSB in each byte of a random
page. The mean zsize was 4116 bytes (larger than a page)
with a stddev of 51. The one-bit mean was 14336 (7/16 of a page).

On a completely different track, I thought it would be relevant
to look at the difference between frontswap (anonymous) page
zsize distribution and cleancache (file) page zsize distribution.

Running kernbench, zsize mean was 1974 (stddev 895).

For a different benchmark, I did:

# find / | grep3

where grep3 is a simple bash script that does three separate
greps on the first argument. Since this fills the page cache
and causes reclaiming, and reclaims are captured by cleancache
and fed to zcache, this data page stream approximates random
pages on the disk.

This "benchmark" generated a zsize mean of 2265 with stddev 1008.
Also of note: Only a fraction of a percent of cleancache pages
are zero-filled, so Wanpeng's zcache patch to handle zero-filled
pages more efficiently is very good for frontswap pages but may
have little benefit for cleancache pages.

Bottom line conclusions: (1) Entropy is probably less a factor
for LZO-compressibility than data repetition. (2) Cleancache
data pages may have a very different zsize distribution than
frontswap data pages, anecdotally skewed to much higher zsize.