LinuxLists.cc - Compressed filesystems: Better compression?

2004-09-29 03:46:59

Subject: Compressed filesystems: Better compression?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I wonder what compression Squashfs, cramfs, and zisofs use. I think
cramfs uses zlib compression; I don't know of any other compression in
the kernel tree, so I assume zisofs uses the same, as does squashfs. Am
I mistaken?

Doing a little digging, I've found the algorithm used by 7zip[1] (LZMA)
has been ported[2] to Linux. This is a fairly high performance
algorithm which may be of interest to port into the kernel tree for
practical use, which I will list later.

[1] http://7-zip.org/
[2] http://students.fhs-hagenberg.ac.at/se/se00001/Projects_LZMA_.html

(unfortunately, the link to the code at [2] appears to be dead; I'm
contacting the author.)

In my own personal tests, I've gotten a 6.25% increase in compression
ratio over bzip2 using the above lzma code. These were very weak tests
involving simply bunzipping a 32MiB tar.bz2 of the Mozilla 1.7 source
tree and recompressing it with lzma, which produced a 30MiB tar.lzma. I
tried, but could not get it to compress much better than that (I think I
touched 29.5 at some point but not sure, it was a while ago).

The LZMA developer claims that BZ2:LZMA is roughly the same as GZ:BZ2;
so the gains can be predicted over zlib compression as being 12.5%.
This may be significant in some cases. It is also notable that LZMA
decompression is quite fast, twice as fast as BZIP2 as claimed by the
LZMA page above. It does take twice as long to compress, though.

LZMA uses a lot of memory to compress. The more memory allotted, the
higher the compression ratio; parameters within the algorithm can be
freely adjusted to control this. Decompression uses approximately 1/10
of the memory needed to compress.

The developer of the Linux LZMA port has been inactive for 6 months, it
appears.

The LZMA algorithm as with any compression may have several practical
uses in the Linux source tree. Linux has a wide range of applications,
and is open to modifications which may benefit from available high
performance compression algorithms.

A good lzma compressed file system such as an enhanced SquashFS would
allow slightly more to fit on an initrd. This would not be very
significant, amounting to approximately 128 kilobytes per megabyte of
compressed data based on above tests; but it would still allow initrds
to take up less space on disk.

Compressed file systems are also useful for Live CDs. I'm very
interested in seeing a zisofs using lzma; 700M Knoppix[3] CDs may become
approximately 600M in the same content, allowing for possibly another
8-12% of data to be stored on them. Consider that Knoppix is 2.0G to
start with, that's another 160-250M of content.

[3] http://knoppix.net/

With LZMA compression in tree, file systems that are enhanced with
compression in the future may opt to allow for the use the LZMA
algorithm. This would produce better compression ratios. I liked the
compression on NTFS about 2-3 years ago when I was using Windows 2000,
as it can even enhance performance on fast PCs (the hard disk is a
bottle neck) while lowering disk usage, providing wins on all sides. :)

LZMA would possibly be appropriate for compressed memory systems such as
Rodrigo Castro's compressed page cache patch[4]. With small data sets,
the memory footprint could be reduced; allowing 30MiB of memory to
compress a 4KiB or 8KiB page is just ridiculous. Furthermore, because
the operation is performed on small data sets, the performance overhead
would likely still be much more acceptable than swapping.

[4] http://linuxcompressed.sf.net/

Overall, porting LZMA into the kernel tree would allow for more
innovation to be made. This is an ideal time to take these steps;
SquashFS, cramfs, and zisofs are to my knowledge native and exclusive to
Linux, and should be enhanced before other operating systems clone them.
~ Also, an RFC for LZMA and for ZISOFS may be in order.

For other LZMA code, 7-Zip could be examined, though porting from
Windows code may be hard. The only arguments I can see here are that
nobody has time, or that they're busy with more important things :)

Any comments?

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBWjAthDd4aOud5P8RAlmcAJ4hfSFThNlm7MgF5w9o0TsUKQPutACfXvUX
GXX45z1oypnPSMTanq7n4Zk=
=asnn
-----END PGP SIGNATURE-----

2004-09-29 03:53:14

by John Richard Moser

[permalink] [raw]

Subject: Re: Compressed filesystems: Better compression?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

| For other LZMA code,

http://sourceforge.net/projects/p7zip/ or that.

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBWjGohDd4aOud5P8RAoNGAJ9sL590KwOnSn6xFLCdCPwqTW1ejQCdE+9f
a0Mr3W7h0Ccb/yrFwTs2fwg=
=6ehG
-----END PGP SIGNATURE-----

2004-09-29 03:56:41

by John Richard Moser

[permalink] [raw]

Subject: Re: Compressed filesystems: Better compression?

> For other LZMA code,

http://sourceforge.net/projects/p7zip/ or that.

--
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

2004-09-29 08:57:02

by David Woodhouse

[permalink] [raw]

Subject: Re: Compressed filesystems: Better compression?

On Tue, 2004-09-28 at 23:46 -0400, John Richard Moser wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I wonder what compression Squashfs, cramfs, and zisofs use. I think
> cramfs uses zlib compression; I don't know of any other compression in
> the kernel tree, so I assume zisofs uses the same, as does squashfs. Am
> I mistaken?

JFFS2 has a variety of compression options too. In fact, it might be
nice to make a generic compression framework (or just use the crypto
one) instead of having that in JFFS2 itself.

--
dwmw2

2004-09-29 10:48:55

by Giuseppe Bilotta

[permalink] [raw]

Subject: Re: Compressed filesystems: Better compression?

John Richard Moser wrote:
> > For other LZMA code,
>
> http://sourceforge.net/projects/p7zip/ or that.

The LZMA SDK by the 7-Zip author himself is available at

http://www.7-zip.org/sdk.html

Don't know if it can help.

--
Giuseppe "Oblomov" Bilotta

Can't you see
It all makes perfect sense
Expressed in dollar and cents
Pounds shillings and pence
(Roger Waters)

2004-09-29 12:18:51

by Matti Aarnio

[permalink] [raw]

Subject: Re: Compressed filesystems: Better compression?

On Tue, Sep 28, 2004 at 11:46:54PM -0400, John Richard Moser wrote:
> I wonder what compression Squashfs, cramfs, and zisofs use. I think
> cramfs uses zlib compression; I don't know of any other compression in
> the kernel tree, so I assume zisofs uses the same, as does squashfs. Am
> I mistaken?

Compression algorithms are a bit tough to be used in a random access
smallish blocks environments. In long streams where you can use megabytes
worth of buffer spaces there is no problem is achieving good performance.
But do try to do that in an environment where your maximum block size
is, say: 4 kB, and you have to start afresh at every block boundary.

Whatever algorithms you use, there will always be data sequences that
are of maximum entropy, and won't compress. Rather they will be
presented in streams as is with a few bytes long wrappers around
them.

/Matti Aarnio

2004-09-29 12:36:55

by Jörn Engel

[permalink] [raw]

Subject: Re: Compressed filesystems: Better compression?

On Tue, 28 September 2004 23:46:54 -0400, John Richard Moser wrote:
>
> In my own personal tests, I've gotten a 6.25% increase in compression
> ratio over bzip2 using the above lzma code. These were very weak tests
> involving simply bunzipping a 32MiB tar.bz2 of the Mozilla 1.7 source
> tree and recompressing it with lzma, which produced a 30MiB tar.lzma. I
> tried, but could not get it to compress much better than that (I think I
> touched 29.5 at some point but not sure, it was a while ago).

Sounds sane. bzip2 is really hurt by the hart limit of 900k for block
sorting.

Inside the kernel, other things start to matter, though. If you
really want to impress me, take some large test data (your mozilla.tar
or whatever), cut it up into chunks of 4k and compress each chunk
individually. Does lzma still beat gzip?

If you can at least get it to compress better for 64k chunks, that's
already quite interesting. But excellent compression with infinite
chunk-size and infinite memory is quite pointless inside the kernel.
Such things should be left in userspace where they belong.

J?rn

--
The wise man seeks everything in himself; the ignorant man tries to get
everything from somebody else.
-- unknown

2004-09-29 16:08:51

by John Richard Moser

[permalink] [raw]

Subject: Re: Compressed filesystems: Better compression?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matti Aarnio wrote:
| On Tue, Sep 28, 2004 at 11:46:54PM -0400, John Richard Moser wrote:
|
|>I wonder what compression Squashfs, cramfs, and zisofs use. I think
|>cramfs uses zlib compression; I don't know of any other compression in
|>the kernel tree, so I assume zisofs uses the same, as does squashfs. Am
|>I mistaken?
|
|
| Compression algorithms are a bit tough to be used in a random access
| smallish blocks environments. In long streams where you can use megabytes
| worth of buffer spaces there is no problem is achieving good performance.
| But do try to do that in an environment where your maximum block size
| is, say: 4 kB, and you have to start afresh at every block boundary.

Yes of course. I've seen the compressed page cache patch do this and
get fair peformance (10-20%), though on double size blocks (8KiB) it
manages almost twice as good (20-50%, averaged around 30% IIRC). Not
great, but not bad.

On compressed filesystems you can work with 64k or 128k blocks.
Somewhere around 32-64k is usually optimal; you're not going to see
great improvements using 1M blocks instead of 512k blocks.

|
| Whatever algorithms you use, there will always be data sequences that
| are of maximum entropy, and won't compress. Rather they will be
| presented in streams as is with a few bytes long wrappers around
| them.
|

Yes, an intelligent algorithm decides that if the underlying compression
algorithm used produces no results, it just marks the block as
uncompressed and stores it as such. ZLIB does this if the block gets
bigger. LZMA might not; but higher level intrinsics (block headers)
could handle that easy (as you said).

[...]

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBWt2jhDd4aOud5P8RAofRAJ9yYBcTSNeQgtdpCxnAAyHZfzdt1QCdGHS8
ZBxqzmruMQoOwtjBqIxACKw=
=qEDJ
-----END PGP SIGNATURE-----

2004-09-29 16:10:21

by John Richard Moser

[permalink] [raw]

Subject: Re: Compressed filesystems: Better compression?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Ankerl says:

Hi, it is GPL, please try this link:
http://martinus.geekisp.com/files/lzma-0.01.tar.bz2

You might want to have a look at p7zip, which is a full port 7zip to Linux
http://sourceforge.net/projects/p7zip/

=======

So he recommends to look at p7zip too. :) and of course the original
7zip reference algorithm.

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBWt4FhDd4aOud5P8RAjjTAJ9EkE3DA2xrdYyyNQW9Lc+tupoYygCcCo62
utYGJnkPJUmBHrLbjf75SyU=
=tI+H
-----END PGP SIGNATURE-----

2004-09-29 16:39:32

by John Richard Moser

[permalink] [raw]

Subject: Re: Compressed filesystems: Better compression?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

http://developer.linuxtag.net/knoppix/sources/cloop_2.01-5.tar.gz

During compilation of cloop I noticed it had LZMA compression. Looks
like I need to do my research.

This may still be interesting to use for zisofs; although at this point
I appear to be producing a lot of white noise. Still, I'd like to see
LZMA ported out of cloop and into cryptoapi or such, where it can be
more generally useful.

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBWuU9hDd4aOud5P8RAj2BAJ4oBvpBfE9XTgRlQoqrFCpvZmFzZgCeKbpi
kfLehAFjEcCpX/M9yt2MC0c=
=1+4J
-----END PGP SIGNATURE-----

2004-09-29 17:34:42

by John Richard Moser

[permalink] [raw]

Subject: Re: Compressed filesystems: Better compression?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

J?rn Engel wrote:
| On Tue, 28 September 2004 23:46:54 -0400, John Richard Moser wrote:
|
|>In my own personal tests, I've gotten a 6.25% increase in compression
|>ratio over bzip2 using the above lzma code. These were very weak tests
|>involving simply bunzipping a 32MiB tar.bz2 of the Mozilla 1.7 source
|>tree and recompressing it with lzma, which produced a 30MiB tar.lzma. I
|>tried, but could not get it to compress much better than that (I think I
|>touched 29.5 at some point but not sure, it was a while ago).
|
|
| Sounds sane. bzip2 is really hurt by the hart limit of 900k for block
| sorting.
|
| Inside the kernel, other things start to matter, though. If you
| really want to impress me, take some large test data (your mozilla.tar
| or whatever), cut it up into chunks of 4k and compress each chunk
| individually. Does lzma still beat gzip?
|

I'll try that. I'm more interested in 32-128k chunks, however. Based
on prior experience, I've come to rely on 32-64k being "optimal" for
compression; bigger block sizes don't seem to produce much of a gain
(some, but nothing amazing). These are also the ranges that would be
used for compressed filesystems such as squashfs. For filesystems such
as zisofs, it would be possible to split files up into blocks as well,
to lower the memory footprint and increase seek speed through the file.

[BlkSz][DictSz][CompressedData...........]

By placing an indicator of block size (compressed) on each block, and
indicating the size of uncompressed blocks elsewhere (in the file header
etc), compressed data can be quickly seeked through without
decompressing the entire stream (at max 1 block).

| If you can at least get it to compress better for 64k chunks, that's
| already quite interesting. But excellent compression with infinite
| chunk-size and infinite memory is quite pointless inside the kernel.
| Such things should be left in userspace where they belong.
|

Yes, this needs to be practically useful; compressing 800M files in the
kernel using 16G of memory is NOT practical. :)

| J?rn
|

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBWvHlhDd4aOud5P8RAjkLAJ9YQa4dAA8cbEJZwOSm1AqDho24bQCeNsqA
eTvya0mNXt2JJb4Fi95IeEY=
=pe0m
-----END PGP SIGNATURE-----