2006-05-21 14:35:03

by Justin Piszcz

[permalink] [raw]
Subject: Linux Kernel Source Compression

Was curious as to which utilities would offer the best compression ratio
for the kernel source, I thought it'd be bzip2 or rar but lzma wins,
roughly 6 MiB smaller than bzip2.

$ du -sk * | sort -n
33520 linux-2.6.16.17.tar.lzma
33760 linux-2.6.16.17.tar.rar
38064 linux-2.6.16.17.tar.rz
39472 linux-2.6.16.17.tar.szip
39520 linux-2.6.16.17.tar.bz
39936 linux-2.6.16.17.tar.bz2
40000 linux-2.6.16.17.tar.bicom
40656 linux-2.6.16.17.tar.sit
47664 linux-2.6.16.17.tar.lha
49968 linux-2.6.16.17.tar.dzip
50000 linux-2.6.16.17.tar.gz
51344 linux-2.6.16.17.tar.arj
57552 linux-2.6.16.17.tar.lzo
57984 linux-2.6.16.17.tar.F
81136 linux-2.6.16.17.tar.Z
94544 linux-2.6.16.17.tar.zoo
101216 linux-2.6.16.17.tar.arc
228608 linux-2.6.16.17.tar

$ du -sh * | sort -n
33M linux-2.6.16.17.tar.lzma
33M linux-2.6.16.17.tar.rar
37M linux-2.6.16.17.tar.rz
39M linux-2.6.16.17.tar.bicom
39M linux-2.6.16.17.tar.bz
39M linux-2.6.16.17.tar.bz2
39M linux-2.6.16.17.tar.szip
40M linux-2.6.16.17.tar.sit
47M linux-2.6.16.17.tar.lha
49M linux-2.6.16.17.tar.dzip
49M linux-2.6.16.17.tar.gz
50M linux-2.6.16.17.tar.arj
56M linux-2.6.16.17.tar.lzo
57M linux-2.6.16.17.tar.F
79M linux-2.6.16.17.tar.Z
92M linux-2.6.16.17.tar.zoo
99M linux-2.6.16.17.tar.arc
223M linux-2.6.16.17.tar


2006-05-21 18:41:24

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

>
> Was curious as to which utilities would offer the best compression ratio for
> the kernel source, I thought it'd be bzip2 or rar but lzma wins, roughly 6 MiB
> smaller than bzip2.
>
You forgot:
- .7z 7zip
- .j JAR (http://www.arjsoftware.com)
- .ice LHICE (some sort of "brother" to lharc aka lzh)
- .ace ACE (http://www.winace.com)
- UPX (yes!, you just need to put '#!/\n' at the front)
- .cab MS CAB (use winace)
- .bh BlackHole
- .pak PKARC 2.51
- .sqz SqueezeIt
- "LZEXE"

ftp://camelot.spsl.nsc.ru/pub/win32/arc/ - you'll find some there
happy packing :)

> 38064 linux-2.6.16.17.tar.rz

- is this rzip with _maximum_ distance?


Jan Engelhardt
--

2006-05-21 18:56:34

by Kasper Sandberg

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On Sun, 2006-05-21 at 20:40 +0200, Jan Engelhardt wrote:
> >
> > Was curious as to which utilities would offer the best compression ratio for
> > the kernel source, I thought it'd be bzip2 or rar but lzma wins, roughly 6 MiB
> > smaller than bzip2.
> >
> You forgot:
> - .7z 7zip
> - .j JAR (http://www.arjsoftware.com)
> - .ice LHICE (some sort of "brother" to lharc aka lzh)
> - .ace ACE (http://www.winace.com)
> - UPX (yes!, you just need to put '#!/\n' at the front)
> - .cab MS CAB (use winace)
> - .bh BlackHole
> - .pak PKARC 2.51
> - .sqz SqueezeIt
> - "LZEXE"
and also lzx, which was, in the amige days the best there was, allthough
i know of no compressor for linux

>
> ftp://camelot.spsl.nsc.ru/pub/win32/arc/ - you'll find some there
> happy packing :)
>
> > 38064 linux-2.6.16.17.tar.rz
>
> - is this rzip with _maximum_ distance?
>
>
> Jan Engelhardt

2006-05-21 19:03:21

by Alistair John Strachan

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On Sunday 21 May 2006 15:35, Justin Piszcz wrote:
> Was curious as to which utilities would offer the best compression ratio
> for the kernel source, I thought it'd be bzip2 or rar but lzma wins,
> roughly 6 MiB smaller than bzip2.
>
> $ du -sk * | sort -n
> 33520 linux-2.6.16.17.tar.lzma

Somebody needs to make lzma userspace tools (like p7zip) faster, not crash,
and behave like a regular UNIX program. Then we need a patch to GNU tar to
emerge, and for it to persist for at least 4 years. Then maybe people will
adopt this format..

--
Cheers,
Alistair.

Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

2006-05-21 19:28:30

by Justin Piszcz

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Compressed with -9.

-9 slowest (best) compression

Unsure on the maximum distance.

Version info:

rzip 2.1
Copright (C) Andrew Tridgell 1998-2003


On Sun, 21 May 2006, Jan Engelhardt wrote:

>>
>> Was curious as to which utilities would offer the best compression ratio for
>> the kernel source, I thought it'd be bzip2 or rar but lzma wins, roughly 6 MiB
>> smaller than bzip2.
>>
> You forgot:
> - .7z 7zip
> - .j JAR (http://www.arjsoftware.com)
> - .ice LHICE (some sort of "brother" to lharc aka lzh)
> - .ace ACE (http://www.winace.com)
> - UPX (yes!, you just need to put '#!/\n' at the front)
> - .cab MS CAB (use winace)
> - .bh BlackHole
> - .pak PKARC 2.51
> - .sqz SqueezeIt
> - "LZEXE"
>
> ftp://camelot.spsl.nsc.ru/pub/win32/arc/ - you'll find some there
> happy packing :)
>
>> 38064 linux-2.6.16.17.tar.rz
>
> - is this rzip with _maximum_ distance?
>
>
> Jan Engelhardt
> --
>

2006-05-21 21:01:00

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On Sun, May 21, 2006 at 08:03:32PM +0100, Alistair John Strachan wrote:

> Somebody needs to make lzma userspace tools (like p7zip) faster, not
> crash, and behave like a regular UNIX program. Then we need a patch
> to GNU tar to emerge, and for it to persist for at least 4
> years. Then maybe people will adopt this format..

why?

the gains aren't that great

2006-05-21 21:22:05

by Alistair John Strachan

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On Sunday 21 May 2006 22:00, Chris Wedgwood wrote:
> On Sun, May 21, 2006 at 08:03:32PM +0100, Alistair John Strachan wrote:
> > Somebody needs to make lzma userspace tools (like p7zip) faster, not
> > crash, and behave like a regular UNIX program. Then we need a patch
> > to GNU tar to emerge, and for it to persist for at least 4
> > years. Then maybe people will adopt this format..
>
> why?
>
> the gains aren't that great

If it was less than 5%, I'd agree with you. The fact is, it's 17% better on a
regular kernel tarball (not exactly a contrived test), so there would be
reason to use it. It's also faster to decompress.

http://tukaani.org/lzma/

This utility appears to address most of my original concerns (i.e., it works
with stream LZMA and has a bzip2/gzip-esque frontend). I could see LZMA
replacing bzip2, but not gzip, due to the compression performance issues.

--
Cheers,
Alistair.

Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

2006-05-21 21:42:42

by Sam Vilain

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Chris Wedgwood wrote:

>On Sun, May 21, 2006 at 08:03:32PM +0100, Alistair John Strachan wrote:
>
>
>
>>Somebody needs to make lzma userspace tools (like p7zip) faster, not
>>crash, and behave like a regular UNIX program. Then we need a patch
>>to GNU tar to emerge, and for it to persist for at least 4
>>years. Then maybe people will adopt this format..
>>
>>
>
>why?
>
>the gains aren't that great
>

Exactly, and while I know my network connection isn't exactly
representative of the general population of people building the kernel,
it's currently faster for me to download and unpack a .gz than to wait
the extra time for bzip2 to decompress. I've always found it quicker
dealing with .gz's for incremental patches. I thought the speed issue is
more of a speed / compression ratio trade-off, ie a caveat of
compression in general.

Mind you, 'git fetch' is even faster, even for people who aren't close
enough to their mirror to fetch a full .gz kernel tarball in <5s.

Sam.

2006-05-21 21:57:01

by Alistair John Strachan

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On Sunday 21 May 2006 22:42, Sam Vilain wrote:
> Chris Wedgwood wrote:
> >On Sun, May 21, 2006 at 08:03:32PM +0100, Alistair John Strachan wrote:
> >>Somebody needs to make lzma userspace tools (like p7zip) faster, not
> >>crash, and behave like a regular UNIX program. Then we need a patch
> >>to GNU tar to emerge, and for it to persist for at least 4
> >>years. Then maybe people will adopt this format..
> >
> >why?
> >
> >the gains aren't that great
>
> Exactly, and while I know my network connection isn't exactly
> representative of the general population of people building the kernel,
> it's currently faster for me to download and unpack a .gz than to wait
> the extra time for bzip2 to decompress. I've always found it quicker
> dealing with .gz's for incremental patches. I thought the speed issue is
> more of a speed / compression ratio trade-off, ie a caveat of
> compression in general.

Actually, you're making false assumptions about LZMA. It is in fact quicker to
decompress than bzip2, which largely addresses this issue. Compression speeds
are the problem, but the end user won't do a lot of that.

--
Cheers,
Alistair.

Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

2006-05-21 21:59:41

by Diego Calleja

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

El Mon, 22 May 2006 09:42:28 +1200,
Sam Vilain <[email protected]> escribi?:

> it's currently faster for me to download and unpack a .gz than to wait
> the extra time for bzip2 to decompress. I've always found it quicker


For kernel patches and kernel releases it sure doesn't have a lot of
sense to switch, you don't gain too much.

LZMA has its gains, though. It's probably a interesting choice
for packaging software: You may get some extra space in the CD thanks
to the extra compression, and the faster decompressing could make
installs a bit faster. While LZMA is slower as hell compressing in
the "best compression" mode, is faster than bzip2 when compressing and
decompressing at the same compression levels than bzip2 (according to
the previous web). That pretty much means it's just better.

2006-05-21 22:22:21

by Sam Vilain

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Alistair John Strachan wrote:

>>Exactly, and while I know my network connection isn't exactly
>>representative of the general population of people building the kernel,
>>it's currently faster for me to download and unpack a .gz than to wait
>>the extra time for bzip2 to decompress. I've always found it quicker
>>dealing with .gz's for incremental patches. I thought the speed issue is
>>more of a speed / compression ratio trade-off, ie a caveat of
>>compression in general.
>>
>>
>
>Actually, you're making false assumptions about LZMA. It is in fact quicker to
>decompress than bzip2, which largely addresses this issue. Compression speeds
>are the problem, but the end user won't do a lot of that.
>

Interesting. Googling a bit; from http://tukaani.org/lzma/benchmarks:

In terms of speed, gzip is the winner again. lzma comes right behind it
two to three times slower than gzip. bzip2 is a lot slower taking
usually two to six times more time than lzma, that is, four to twelve
times more than gzip. One interesting thing is that gzip and lzma
decompress the faster the smaller the compressed size is, while bzip2
gets slower when the compression ratio gets better.
[...]
neither bzip2 nor lzma can compete with gzip in terms of speed or memory
usage.

Also this:

"lzmash -8" and "lzmash -9" require lots of memory and are practical
only on newer computers; the files compressed with them are probably a
pain to decompress on systems with less than 32 MB or 64 MB of memory.
[...]
The files compressed with the default "lzmash -7" can still be
decompressed, even on machines with only 16 MB of RAM

Sam.

2006-05-21 22:29:27

by Alistair John Strachan

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On Sunday 21 May 2006 23:22, Sam Vilain wrote:
[snip]
> Interesting. Googling a bit; from http://tukaani.org/lzma/benchmarks:
>
> In terms of speed, gzip is the winner again. lzma comes right behind it
> two to three times slower than gzip. bzip2 is a lot slower taking
> usually two to six times more time than lzma, that is, four to twelve
> times more than gzip. One interesting thing is that gzip and lzma
> decompress the faster the smaller the compressed size is, while bzip2
> gets slower when the compression ratio gets better.
> [...]
> neither bzip2 nor lzma can compete with gzip in terms of speed or memory
> usage.
>
> Also this:
>
> "lzmash -8" and "lzmash -9" require lots of memory and are practical
> only on newer computers; the files compressed with them are probably a
> pain to decompress on systems with less than 32 MB or 64 MB of memory.
> [...]
> The files compressed with the default "lzmash -7" can still be
> decompressed, even on machines with only 16 MB of RAM

Interesting info. I agree that LZMA is not a replacement for gzip/zlib,
because gzip is extremely size/time efficient.

However, as noted in another thread, it is almost certainly a viable
replacement for bzip2, since people that use bzip2 are generally interested
in a size optimisation, not a compression speed one, and even if compression
speed is relevant, LZMA's options scale to be approximately as good (or as
bad??) as bzip2.

This is all fairly academic. I think the issue still boils down to widespread
adoption; bzip2 took a while to get off the ground, people don't like messing
with new formats, and distributors have to pick up the tools.

I think kernel.org switching formats would be one of the last things that
could, or indeed should, happen.

--
Cheers,
Alistair.

Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

2006-05-22 02:05:51

by Stefan Smietanowski

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Justin Piszcz wrote:
> Compressed with -9.
>
> -9 slowest (best) compression
>
> Unsure on the maximum distance.
>
> Version info:
>
> rzip 2.1
> Copright (C) Andrew Tridgell 1998-2003
>
>
> On Sun, 21 May 2006, Jan Engelhardt wrote:
>
>>>
>>> Was curious as to which utilities would offer the best compression
>>> ratio for
>>> the kernel source, I thought it'd be bzip2 or rar but lzma wins,
>>> roughly 6 MiB
>>> smaller than bzip2.
>>>
>> You forgot:
>> - .7z 7zip
>> - .j JAR (http://www.arjsoftware.com)
>> - .ice LHICE (some sort of "brother" to lharc aka lzh)
>> - .ace ACE (http://www.winace.com)
>> - UPX (yes!, you just need to put '#!/\n' at the front)
>> - .cab MS CAB (use winace)
>> - .bh BlackHole
>> - .pak PKARC 2.51
>> - .sqz SqueezeIt
>> - "LZEXE"
>>
>> ftp://camelot.spsl.nsc.ru/pub/win32/arc/ - you'll find some there
>> happy packing :)
>>
>>> 38064 linux-2.6.16.17.tar.rz
>>
>>
>> - is this rzip with _maximum_ distance?
>>
>>
>> Jan Engelhardt

Don't forget about .lzx! Probably need an amiga to use it though :)

BUT I could be remembering wrong but I think that they use a newer
version of lzx compression in .CAB since that guy got a job at MS
back in the days.

// Stefan


Attachments:
signature.asc (253.00 B)
OpenPGP digital signature

2006-05-22 18:58:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Followup to: <[email protected]>
By author: Alistair John Strachan <[email protected]>
In newsgroup: linux.dev.kernel
>
> Somebody needs to make lzma userspace tools (like p7zip) faster, not crash,
> and behave like a regular UNIX program. Then we need a patch to GNU tar to
> emerge, and for it to persist for at least 4 years. Then maybe people will
> adopt this format..
>

The patch to GNU tar isn't necessary. If the "not crash, and behave
like a regular UNIX program" can be satisfied, I'd be happy to support
7zip/lzma on kernel.org. Unfortunately, as far as I can tell:

a) right now the standard encapsulation format for LZMA is 7zip, which
only comes in the form of hideously ugly code. lzma-tools are
cleaner, but incompatible.

b) Even lzma-tools relies on a shell script to behave like a Unix
program.

Personally, I would like to suggest adding LZMA capability to gzip.
The gzip format already has support for multiple compression formats.

-hpa

2006-05-22 19:00:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Followup to: <[email protected]>
By author: Alistair John Strachan <[email protected]>
In newsgroup: linux.dev.kernel
>
> I think kernel.org switching formats would be one of the last things that
> could, or indeed should, happen.
>

kernel.org already has a multi-format infrastructure in place. It
wouldn't be much more work to add a third format to the mix.

-hpa

2006-05-22 19:07:11

by Alistair John Strachan

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On Monday 22 May 2006 19:58, H. Peter Anvin wrote:
[snip]
> Personally, I would like to suggest adding LZMA capability to gzip.
> The gzip format already has support for multiple compression formats.

Any idea why this wasn't done for bzip2?

--
Cheers,
Alistair.

Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

2006-05-22 19:11:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Alistair John Strachan wrote:
> On Monday 22 May 2006 19:58, H. Peter Anvin wrote:
> [snip]
>> Personally, I would like to suggest adding LZMA capability to gzip.
>> The gzip format already has support for multiple compression formats.
>
> Any idea why this wasn't done for bzip2?

Yes, the bzip2 author I have been told was originally planning to do that, but then
thought it would be harder to deploy that way (because gzip is a core utility, and people
are nervous about making it larger.)

You'd have to ask him for the details, though.

It *is* true that there is a fair bit of code out there which sees a gzip magic number and
expects to call deflate functions on it, without ever checking the compression type field.
However, even if there is a need for a new magic number, this can be done within the
gzip code, or by forking gzip.

-hpa

2006-05-22 19:14:52

by Alistair John Strachan

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On Monday 22 May 2006 20:10, H. Peter Anvin wrote:
> Alistair John Strachan wrote:
> > On Monday 22 May 2006 19:58, H. Peter Anvin wrote:
> > [snip]
> >
> >> Personally, I would like to suggest adding LZMA capability to gzip.
> >> The gzip format already has support for multiple compression formats.
> >
> > Any idea why this wasn't done for bzip2?
>
> Yes, the bzip2 author I have been told was originally planning to do that,
> but then thought it would be harder to deploy that way (because gzip is a
> core utility, and people are nervous about making it larger.)
>
> You'd have to ask him for the details, though.
>
> It *is* true that there is a fair bit of code out there which sees a gzip
> magic number and expects to call deflate functions on it, without ever
> checking the compression type field. However, even if there is a need for a
> new magic number, this can be done within the gzip code, or by forking
> gzip.

One trivial solution (that comes to mind) is by symlinking gunzip->unlzma (or
similar) and having gzip's defaults change according to argv[0].

It's a bit of a shame bzip2 even exists, really. It really would be better if
there was one unified, pluggable archiver on UNIX (and portables).

--
Cheers,
Alistair.

Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

2006-05-22 20:24:27

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

>> > Any idea why this wasn't done for bzip2?
>>
>> Yes, the bzip2 author I have been told was originally planning to do that,
>> but then thought it would be harder to deploy that way (because gzip is a
>> core utility, and people are nervous about making it larger.)

I'd say that concern is valid.

>It's a bit of a shame bzip2 even exists, really. It really would be better if
>there was one unified, pluggable archiver on UNIX (and portables).

Would You Like To Contribute(tm)? :)
Whenever a program is missing, someone is there to write it.



Jan Engelhardt
--

2006-05-22 20:42:07

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Jan Engelhardt wrote:
>
> Would You Like To Contribute(tm)? :)
> Whenever a program is missing, someone is there to write it.
>

I don't have time, sorry. Between klibc, syslinux, kernel.org and being sick, my time has
been very sparse recently :(

-hpa

2006-05-22 21:00:08

by Alistair John Strachan

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On Monday 22 May 2006 21:24, Jan Engelhardt wrote:
> >> > Any idea why this wasn't done for bzip2?
> >>
> >> Yes, the bzip2 author I have been told was originally planning to do
> >> that, but then thought it would be harder to deploy that way (because
> >> gzip is a core utility, and people are nervous about making it larger.)
>
> I'd say that concern is valid.
>
> >It's a bit of a shame bzip2 even exists, really. It really would be better
> > if there was one unified, pluggable archiver on UNIX (and portables).
>
> Would You Like To Contribute(tm)? :)
> Whenever a program is missing, someone is there to write it.

I would, but if it's a "valid concern" that gzip is a few hundred KB larger,
and the community would not graciously receive such work, there's not much
point, is there? :-)

Seriously, though, if I understand gzip correctly, it uses deflate/zlib
internally. Why, in that case, does /bin/gzip not (dynamically) link against
libz? If a first step was fixing that, a second could be linking dynamically
against libbz2 and 'liblzma', and making it all compile-time configurable.

That should keep everybody happy.

--
Cheers,
Alistair.

Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

2006-05-22 21:05:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Alistair John Strachan wrote:
>
> I would, but if it's a "valid concern" that gzip is a few hundred KB larger,
> and the community would not graciously receive such work, there's not much
> point, is there? :-)
>
> Seriously, though, if I understand gzip correctly, it uses deflate/zlib
> internally. Why, in that case, does /bin/gzip not (dynamically) link against
> libz? If a first step was fixing that, a second could be linking dynamically
> against libbz2 and 'liblzma', and making it all compile-time configurable.
>

Because gzip predates zlib...

-hpa

2006-05-22 21:11:40

by Joshua Hudson

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On 5/22/06, H. Peter Anvin <[email protected]> wrote:
> Alistair John Strachan wrote:
[snip]
> > Seriously, though, if I understand gzip correctly, it uses deflate/zlib
> > internally. Why, in that case, does /bin/gzip not (dynamically) link against
> > libz? If a first step was fixing that, a second could be linking dynamically
> > against libbz2 and 'liblzma', and making it all compile-time configurable.
> >
>
> Because gzip predates zlib...
>
Also because it runs faster than zlib on x86 due to register pressure.
Relocatable
code costs one register. x86 only has 7 that an algorithm can scribble over
(8 if it doesn't have any stack).

2006-05-23 02:16:16

by Nuri Jawad

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Hi,
just wanted to remark that I never liked that bzip was replaced by bzip2
(were there license issues?) since bzip's compression was/is often
stronger:

39843104 Mar 28 09:33 linux-2.6.15.7.tar.bz2
39423739 Mar 28 09:33 linux-2.6.15.7.tar.bz

Not a big difference in this case but still a step back. I for once am
keeping my bzip binary.. does anyone know where the source can still be
found?

Regards, Nuri

2006-05-23 02:55:14

by Stefan Smietanowski

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Nuri Jawad wrote:
> Hi,
> just wanted to remark that I never liked that bzip was replaced by bzip2
> (were there license issues?) since bzip's compression was/is often
> stronger:

bzip was (possibly) infringing a patent so a method bzip used was
removed and subsequently bzip2 was created.

http://lists.debian.org/debian-devel/1997/12/msg00778.html

// Stefan


Attachments:
signature.asc (253.00 B)
OpenPGP digital signature

2006-05-23 13:37:36

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

>> Seriously, though, if I understand gzip correctly, it uses deflate/zlib
>> internally. Why, in that case, does /bin/gzip not (dynamically) link
>> against libz? If a first step was fixing that, a second could be linking
>> dynamically against libbz2 and 'liblzma', and making it all compile-time
>> configurable.
>
> Because gzip predates zlib...
>
So we are carrying cruft around.


Jan Engelhardt
--

2006-05-23 13:39:05

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

>> >> > Any idea why this wasn't done for bzip2?
>> >>
>> >> Yes, the bzip2 author I have been told was originally planning to do
>> >> that, but then thought it would be harder to deploy that way (because
>> >> gzip is a core utility, and people are nervous about making it larger.)
>>
>> I'd say that concern is valid.
>>
>> >It's a bit of a shame bzip2 even exists, really. It really would be better
>> > if there was one unified, pluggable archiver on UNIX (and portables).
>>
>> Would You Like To Contribute(tm)? :)
>> Whenever a program is missing, someone is there to write it.
>
>I would, but if it's a "valid concern" that gzip is a few hundred KB larger,
>and the community would not graciously receive such work, there's not much
>point, is there? :-)
>
Make it use shared libraries (did I already mention that?)

BTW, "a few hundred KB" is really overestimated if it's just about bzip2:
-rwxr-xr-x 1 root root 27640 Apr 23 02:20 /usr/bin/bzip2
-rwxr-xr-x 1 root root 66864 Apr 23 02:20 /lib/libbz2.so.1.0.0
That's not even _one_ hundred KB. Oh, just keep it as .so. :)
And of course, compile with klibc, it has less loader bloat than glibc (as
someone had found out...I think it was Greg.)



Jan Engelhardt
--

2006-05-23 14:15:29

by Ivan Novick

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

cc'ing Julian Seward the author of bzip2

----- Original message -----
Hi,
just wanted to remark that I never liked that bzip was replaced by bzip2
(were there license issues?) since bzip's compression was/is often
stronger:

39843104 Mar 28 09:33 linux-2.6.15.7.tar.bz2
39423739 Mar 28 09:33 linux-2.6.15.7.tar.bz

Not a big difference in this case but still a step back. I for once am
keeping my bzip binary.. does anyone know where the source can still be
found?

2006-05-23 14:23:04

by Olivier Galibert

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

> just wanted to remark that I never liked that bzip was replaced by bzip2
> (were there license issues?) since bzip's compression was/is often
> stronger:

bzip1 uses arithmetic encoding which is heavily patented. bzip2 uses
huffman instead, which isn't, but is slightly (10% is often quoted)
less efficient. I guess bzip3 could use range coding which is
supposedly patent-free[1] and has similar compression ratio than
arithmetic coding.

OG.

[1] I guess everything is in the way it is written, since I have a
very hard time understand where the difference is between range coding
and arithmetic coding.

2006-05-23 14:47:38

by Julian Seward

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression


> bzip1 uses arithmetic encoding which is heavily patented. bzip2 uses
> huffman instead, which isn't, but is slightly (10% is often quoted)
> less efficient.

It uses an adaptive huffman scheme devised by David Wheeler, which usually
gets within 1% of the arithmetic coder that bzip1 used.

bzip2, especially the 1.0.X series, is superior to bzip1 in terms of speed,
memory use, robustness against bad-case inputs, recoverability of data
from damaged compressed streams, and that it can be used as a library.
Moving back to bzip1 would IMO be a big step backwards.

J

2006-05-23 15:30:25

by Alistair John Strachan

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On Tuesday 23 May 2006 14:38, Jan Engelhardt wrote:
> >> >> > Any idea why this wasn't done for bzip2?
> >> >>
> >> >> Yes, the bzip2 author I have been told was originally planning to do
> >> >> that, but then thought it would be harder to deploy that way (because
> >> >> gzip is a core utility, and people are nervous about making it
> >> >> larger.)
> >>
> >> I'd say that concern is valid.
> >>
> >> >It's a bit of a shame bzip2 even exists, really. It really would be
> >> > better if there was one unified, pluggable archiver on UNIX (and
> >> > portables).
> >>
> >> Would You Like To Contribute(tm)? :)
> >> Whenever a program is missing, someone is there to write it.
> >
> >I would, but if it's a "valid concern" that gzip is a few hundred KB
> > larger, and the community would not graciously receive such work, there's
> > not much point, is there? :-)
>
> Make it use shared libraries (did I already mention that?)

Actually I did, in the paragraph that you just snipped.

> BTW, "a few hundred KB" is really overestimated if it's just about bzip2:
> -rwxr-xr-x 1 root root 27640 Apr 23 02:20 /usr/bin/bzip2
> -rwxr-xr-x 1 root root 66864 Apr 23 02:20 /lib/libbz2.so.1.0.0
> That's not even _one_ hundred KB. Oh, just keep it as .so. :)
> And of course, compile with klibc, it has less loader bloat than glibc (as
> someone had found out...I think it was Greg.)

Agreed. However gzip is such an ancient (and presumably now secure) tool that
it might be unpopular to modify it so heavily. It might also be desirable for
embedded folks to statically link code.

But, this is now seriously OT for LKML, so I might just email the GNU gzip
folks and see whether it's been done before and/or whether it's a good idea.

--
Cheers,
Alistair.

Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

2006-05-23 16:35:41

by Nuri Jawad

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On Tue, 23 May 2006, Julian Seward wrote:

> It uses an adaptive huffman scheme devised by David Wheeler, which usually
> gets within 1% of the arithmetic coder that bzip1 used.

If that coder has patent issues, it shouldn't be used, of course,
regardless of performance.

> bzip2, especially the 1.0.X series, is superior to bzip1 in terms of speed,
> memory use, robustness against bad-case inputs, recoverability of data
> from damaged compressed streams, and that it can be used as a library.

Superior in most aspects, yes, but not regarding compression ratio.
Anyway, calling bzip2 a step backwards was a bit of provocation and not
really meant seriously, but it does have slightly reduced compression ratio.

Maybe bzip2 could be updated to make more use of today's fast CPUs? Much
larger dictionary or other computationally expensive improvements.

Regards, Nuri

2006-05-25 11:43:08

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

>> just wanted to remark that I never liked that bzip was replaced by bzip2
>> (were there license issues?) since bzip's compression was/is often
>> stronger:
>
>bzip1 uses arithmetic encoding which is heavily patented. bzip2 uses
>huffman instead, which isn't, but is slightly (10% is often quoted)
>less efficient. I guess bzip3 could use range coding which is
>supposedly patent-free[1] and has similar compression ratio than
>arithmetic coding.
>
Although plans for a bzip3 have been posted (I think removing the MTF and
so on...), it has not been done yet. Maybe I am wrong here.


Jan Engelhardt
--

2006-05-26 04:11:25

by Bruce Guenter

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

On Sun, May 21, 2006 at 10:35:00AM -0400, Justin Piszcz wrote:
> Was curious as to which utilities would offer the best compression ratio
> for the kernel source, I thought it'd be bzip2 or rar but lzma wins,
> roughly 6 MiB smaller than bzip2.
>
> $ du -sk * | sort -n
> 33520 linux-2.6.16.17.tar.lzma

Since it was requested by somebody:

$ du -sk linux-2.6.16.17.*
32904 linux-2.6.16.17.7z
39919 linux-2.6.16.17.tar.bz2

This was done with: 7z -mx=9
--
Bruce Guenter <[email protected]> http://untroubled.org/
OpenPGP key: 699980E8 / D0B7 C8DD 365D A395 29DA 2E2A E96F B2DC 6999 80E8


Attachments:
(No filename) (597.00 B)
(No filename) (191.00 B)
Download all attachments

2006-05-31 22:50:00

by Bill Davidsen

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Nuri Jawad wrote:
> Hi,
> just wanted to remark that I never liked that bzip was replaced by bzip2
> (were there license issues?) since bzip's compression was/is often
> stronger:
>
> 39843104 Mar 28 09:33 linux-2.6.15.7.tar.bz2
> 39423739 Mar 28 09:33 linux-2.6.15.7.tar.bz
>
> Not a big difference in this case but still a step back. I for once am
> keeping my bzip binary.. does anyone know where the source can still be
> found?

I know I have a copy backed up, but I'm rather disorganized at the
moment, having moved two out-of-town offices into this one, after
spending 12 years on a ten week contract... but I doubt you want it,
it's slow as hell and violates all manner of patents. Mind you, I think
the patents are held by IBM, so they might be negotiable, but I think
the original is dead. Used either fractal or arithmetic compression IIRC.
--
Bill Davidsen <[email protected]>
Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.

2006-05-31 22:55:20

by Bill Davidsen

[permalink] [raw]
Subject: Re: Linux Kernel Source Compression

Alistair John Strachan wrote:

> It's a bit of a shame bzip2 even exists, really. It really would be better if
> there was one unified, pluggable archiver on UNIX (and portables).
>
All the people with slow connections bless bzip2. If you or someone want
a new compressor, write a program for it, call it something unique so
people will know it's different, and be happy.

Even with a fast line, I can only pull as fast as the source can push,
so smaller is better for all of us. The time to decompress a tar.bz2 and
tar.gz are very similar, the CPU for bzip2 is about double, and the time
to create the directories and write the files is the same in either case.

--
Bill Davidsen <[email protected]>
Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.