2006-09-21 20:32:59

by Dax Kelson

[permalink] [raw]
Subject: Smaller compressed kernel source tarballs?

Today as I was watching the linux-2.6.18.tar.bz2 slowly download I
thought it would be nice if it could be made smaller.

The 7zip program/algorithm is free software (LGPL) and can be obtained
from http://www.7-zip.org/ and it is distributed with several
distributions (it is in Fedora Core 6 extras for example).

Here are the numbers:

ls -al
-rw-r--r-- 1 root root 240138240 Sep 21 13:55 linux-2.6.18.tar
-rw-r--r-- 1 root root 34180796 Sep 21 13:42 linux-2.6.18.tar.7z
-rw-r--r-- 1 root root 41863580 Sep 21 13:45 linux-2.6.18.tar.bz2
-rw-r--r-- 1 root root 52467357 Sep 21 13:13 linux-2.6.18.tar.gz

ls -alh
-rw-r--r-- 1 root root 230M Sep 21 13:55 linux-2.6.18.tar
-rw-r--r-- 1 root root 33M Sep 21 13:42 linux-2.6.18.tar.7z
-rw-r--r-- 1 root root 40M Sep 21 13:45 linux-2.6.18.tar.bz2
-rw-r--r-- 1 root root 51M Sep 21 13:13 linux-2.6.18.tar.gz

Smaller the better, especially with the international audience.

Dax Kelson


2006-09-21 20:42:52

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, Sep 21, 2006 at 02:32:57PM -0600, Dax Kelson wrote:
> Today as I was watching the linux-2.6.18.tar.bz2 slowly download I
> thought it would be nice if it could be made smaller.
>
> The 7zip program/algorithm is free software (LGPL) and can be obtained
> from http://www.7-zip.org/ and it is distributed with several
> distributions (it is in Fedora Core 6 extras for example).
>
> Here are the numbers:
>
> ls -al
> -rw-r--r-- 1 root root 240138240 Sep 21 13:55 linux-2.6.18.tar
> -rw-r--r-- 1 root root 34180796 Sep 21 13:42 linux-2.6.18.tar.7z
> -rw-r--r-- 1 root root 41863580 Sep 21 13:45 linux-2.6.18.tar.bz2
> -rw-r--r-- 1 root root 52467357 Sep 21 13:13 linux-2.6.18.tar.gz
>
> ls -alh
> -rw-r--r-- 1 root root 230M Sep 21 13:55 linux-2.6.18.tar
> -rw-r--r-- 1 root root 33M Sep 21 13:42 linux-2.6.18.tar.7z
> -rw-r--r-- 1 root root 40M Sep 21 13:45 linux-2.6.18.tar.bz2
> -rw-r--r-- 1 root root 51M Sep 21 13:13 linux-2.6.18.tar.gz
>
> Smaller the better, especially with the international audience.

But after you download it once, you can just get the diff next time.
How is the decompression time on 7zip versus bzip2 and gzip?

--
Len Sorensen

2006-09-21 21:17:50

by Sean

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, 21 Sep 2006 16:42:50 -0400
Lennart Sorensen <[email protected]> wrote:

> On Thu, Sep 21, 2006 at 02:32:57PM -0600, Dax Kelson wrote:
> > Today as I was watching the linux-2.6.18.tar.bz2 slowly download I
> > thought it would be nice if it could be made smaller.
[...]
> But after you download it once, you can just get the diff next time.
> How is the decompression time on 7zip versus bzip2 and gzip?

Not to mention that by using Git it will take care of all that for you.
Downloading only the updates with no need for you to manually apply diffs
etc..

Sean

2006-09-21 21:40:12

by Dax Kelson

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, 2006-09-21 at 16:42 -0400, Lennart Sorensen wrote:
> But after you download it once, you can just get the diff next time.
> How is the decompression time on 7zip versus bzip2 and gzip?

Decompression times on 2.6.18 are as follows:

gzip: 0m3.509s
7zip: 0m10.012s
bzip2: 0m22.703s

Dax Kelson

2006-09-21 21:41:17

by Dax Kelson

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, 2006-09-21 at 17:17 -0400, Sean wrote:
> Not to mention that by using Git it will take care of all that for you.
> Downloading only the updates with no need for you to manually apply diffs
> etc..
>
> Sean

Git users and tarball users are different audiences.

Dax Kelson

2006-09-21 21:44:13

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

Lennart Sorensen wrote:
> On Thu, Sep 21, 2006 at 02:32:57PM -0600, Dax Kelson wrote:
>> Today as I was watching the linux-2.6.18.tar.bz2 slowly download I
>> thought it would be nice if it could be made smaller.
>>
>> The 7zip program/algorithm is free software (LGPL) and can be obtained
>> from http://www.7-zip.org/ and it is distributed with several
>> distributions (it is in Fedora Core 6 extras for example).
>>
>
> But after you download it once, you can just get the diff next time.
> How is the decompression time on 7zip versus bzip2 and gzip?
>

7zip (LZMA) decompresses quickly, and the decompressor text is actually
smaller than the equivalent for gzip. Quite nice.

What is not nice is the code for the compressor, which is a total mess.
I have been holding out on implementing LZMA on kernel.org, because
just as zip (deflate) didn't become common in the Unix world until an
encapsulation format that handles things expected in the Unix world,
e.g. streaming, was created (gzip), I don't think LZMA is going to be
widely used until there is an "lzip" which does the same thing. I
actually started the work of adding LZMA support to gzip, but then
realized it would be better if a new encapsulation format with proper
64-bit support everywhere was created.

-hpa

2006-09-21 21:50:14

by Bob Copeland

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On 9/21/06, Dax Kelson <[email protected]> wrote:
> Git users and tarball users are different audiences.

Try ketchup then. http://www.selenic.com/ketchup/wiki/

2006-09-21 21:57:21

by Sean

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, 21 Sep 2006 15:41:15 -0600
Dax Kelson <[email protected]> wrote:

>
> Git users and tarball users are different audiences.
>

Don't see why that needs to be the case. Git can even produce the
tarballs once you've synced up with kernel.org (see git-tar-tree).
People interested in conserving bandwidth should really consider
the use of Git.

Sean

2006-09-21 22:12:58

by David Lang

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, 21 Sep 2006, Sean wrote:

> On Thu, 21 Sep 2006 15:41:15 -0600
> Dax Kelson <[email protected]> wrote:
>
>>
>> Git users and tarball users are different audiences.
>>
>
> Don't see why that needs to be the case. Git can even produce the
> tarballs once you've synced up with kernel.org (see git-tar-tree).
> People interested in conserving bandwidth should really consider
> the use of Git.

yes,
however git users are people who plan on following every kernel version for a
while, tarball users are people who grab a copy of the kernel once in a while
(probably not every version). for the tarball users they would have to grab
multiple patches to get from the last thing that they have to whatever is
current. and frankly they may not (and probably should not) trust the last thing
that they have, as in many cases it's a distro patched kernel that may not be
compatable with the vanilla kernel.

people who start downloading every revision should start useing git or patches,
but not everyone needs it.

also people could be behind a firewall that prevents git from working properly,
for them tarballs and patches are the right way of doing things.

David Lang

2006-09-21 22:24:59

by Dave Jones

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, Sep 21, 2006 at 03:00:48PM -0700, David Lang wrote:

> for the tarball users they would have to grab
> multiple patches to get from the last thing that they have to whatever is
> current.

ketchup solves that problem. One command brings any tree up to current.

> also people could be behind a firewall that prevents git from working properly,
> for them tarballs and patches are the right way of doing things.

If they can't git through a firewall, they won't be able to wget a tarball through
it either.

Dave

2006-09-21 22:25:57

by Sean

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, 21 Sep 2006 15:00:48 -0700 (PDT)
David Lang <[email protected]> wrote:

> yes,
> however git users are people who plan on following every kernel version for a
> while, tarball users are people who grab a copy of the kernel once in a while
> (probably not every version). for the tarball users they would have to grab
> multiple patches to get from the last thing that they have to whatever is
> current. and frankly they may not (and probably should not) trust the last thing
> that they have, as in many cases it's a distro patched kernel that may not be
> compatable with the vanilla kernel.
>
> people who start downloading every revision should start useing git or patches,
> but not everyone needs it.

Agreed, but for those people there isn't going to be much need (if any) to
worry about if the tar ball is in .gzip or .bzip2 or whatever then either. And
that was the case that inspired the suggestion.

> also people could be behind a firewall that prevents git from working properly,
> for them tarballs and patches are the right way of doing things.

I use git from behind a firewall everyday without a problem. If you've seen
such a problem yourself, a bug report would hopefully lead to a solution.

Thanks,
Sean

2006-09-21 22:29:00

by David Lang

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, 21 Sep 2006, Dave Jones wrote:

> On Thu, Sep 21, 2006 at 03:00:48PM -0700, David Lang wrote:
>
> > for the tarball users they would have to grab
> > multiple patches to get from the last thing that they have to whatever is
> > current.
>
> ketchup solves that problem. One command brings any tree up to current.

so are you saying that ketchup should be used for _all_ access to the vanilla
tree that isn't done via git?

if not then tarballs still have a place.

and how does ketchup deal with patched trees to start with?

> > also people could be behind a firewall that prevents git from working properly,
> > for them tarballs and patches are the right way of doing things.
>
> If they can't git through a firewall, they won't be able to wget a tarball through
> it either.

to work properly git should talk it's own protocol, http/ftp can be allowed (and
authenticated) through firewalls that don't allow the git protocol.

David Lang

> Dave
>

2006-09-21 22:32:18

by David Lang

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, 21 Sep 2006, Sean wrote:

>> also people could be behind a firewall that prevents git from working properly,
>> for them tarballs and patches are the right way of doing things.
>
> I use git from behind a firewall everyday without a problem. If you've seen
> such a problem yourself, a bug report would hopefully lead to a solution.

it's not a bug, it's simply the fact that git (properly) uses it's own port for
it's own protocol, and not all firewalls allow access to that port. in some
cases even where a person would have the ability to get the firewall changed
they may not want to for other (political) reasons.

even if git tunneled over HTTP there would be firewalls that would require
authentication that git wouldn't be able to do and would therefor block the
access.

David Lang

2006-09-21 22:41:08

by Dave Jones

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, Sep 21, 2006 at 03:16:57PM -0700, David Lang wrote:
> On Thu, 21 Sep 2006, Dave Jones wrote:
>
> > On Thu, Sep 21, 2006 at 03:00:48PM -0700, David Lang wrote:
> >
> > > for the tarball users they would have to grab
> > > multiple patches to get from the last thing that they have to whatever is
> > > current.
> >
> > ketchup solves that problem. One command brings any tree up to current.
>
> so are you saying that ketchup should be used for _all_ access to the vanilla
> tree that isn't done via git?
> if not then tarballs still have a place.

I think you have a misunderstanding over what ketchup is/does.
It cannot usurp tarballs by its very nature. It retrieves tarballs (if necessary)
and whatever patches are necessary to get to the tree you want.
http://www.selenic.com/ketchup/

> and how does ketchup deal with patched trees to start with?

By unpatching if necessary.

> > > also people could be behind a firewall that prevents git from working properly,
> > > for them tarballs and patches are the right way of doing things.
> >
> > If they can't git through a firewall, they won't be able to wget a tarball through
> > it either.
>
> to work properly git should talk it's own protocol, http/ftp can be allowed (and
> authenticated) through firewalls that don't allow the git protocol.

'properly' is the wrong word here. optimally, yes, but the firewall argument
alone isn't sufficient to claim git can't be used to clone a tree.
A tree cloned over http: vs one over git: has exactly the same information in
it. All the history, all the changes. Everything.

Dave

2006-09-21 22:46:50

by David Lang

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, 21 Sep 2006, Dave Jones wrote:

>
> On Thu, Sep 21, 2006 at 03:16:57PM -0700, David Lang wrote:
> > On Thu, 21 Sep 2006, Dave Jones wrote:
> >
> > > On Thu, Sep 21, 2006 at 03:00:48PM -0700, David Lang wrote:
> > >
> > > > for the tarball users they would have to grab
> > > > multiple patches to get from the last thing that they have to whatever is
> > > > current.
> > >
> > > ketchup solves that problem. One command brings any tree up to current.
> >
> > so are you saying that ketchup should be used for _all_ access to the vanilla
> > tree that isn't done via git?
> > if not then tarballs still have a place.
>
> I think you have a misunderstanding over what ketchup is/does.
> It cannot usurp tarballs by its very nature. It retrieves tarballs (if necessary)
> and whatever patches are necessary to get to the tree you want.
> http://www.selenic.com/ketchup/

in that case the compression of the tarballs is still worth dealing with

> > and how does ketchup deal with patched trees to start with?
>
> By unpatching if necessary.

assuming that it knows where to get the patches from, I was refering to things
like the debian or redhat tree with their patches.

> > > > also people could be behind a firewall that prevents git from working properly,
> > > > for them tarballs and patches are the right way of doing things.
> > >
> > > If they can't git through a firewall, they won't be able to wget a tarball through
> > > it either.
> >
> > to work properly git should talk it's own protocol, http/ftp can be allowed (and
> > authenticated) through firewalls that don't allow the git protocol.
>
> 'properly' is the wrong word here. optimally, yes, but the firewall argument
> alone isn't sufficient to claim git can't be used to clone a tree.
> A tree cloned over http: vs one over git: has exactly the same information in
> it. All the history, all the changes. Everything.

in most cases, but there are cases where the dumb transports can make mistakes
(there have been several threads on the git list covering these), git is good
enough to notice mos of them, but there is still room for problems. Also,
installing and configuring git should not be a prerequesite to getting the
kernel.

the point being git and ketchup do not eliminate the need to transfer tarballs,
and therfor do not eliminate the attractivness of a compression that saves a
significant amount of bandwidth.

I was responding to the (apparent) argument that with git and ketchup people
should not ever be downloading tarballs, so something that cuts the size of a
tarball in half doesn't make any difference.

David Lang

2006-09-21 23:38:26

by Sean

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, 21 Sep 2006 15:34:53 -0700 (PDT)
David Lang <[email protected]> wrote:

> I was responding to the (apparent) argument that with git and ketchup people
> should not ever be downloading tarballs, so something that cuts the size of a
> tarball in half doesn't make any difference.

Sure there are some cases where tarballs are more appropriate, but with git
and maybe some of the other tools it should really be the minority situation.
I wonder how many people just use tarballs out of inertia. All said though
saving a few bytes of bandwidth by making the tarballs smaller can't hurt.

Sean

2006-09-22 14:00:33

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, Sep 21, 2006 at 03:40:09PM -0600, Dax Kelson wrote:
> Decompression times on 2.6.18 are as follows:
>
> gzip: 0m3.509s
> 7zip: 0m10.012s
> bzip2: 0m22.703s

Hmm, not bad.

--
Len Sorensen

2006-09-22 14:00:16

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Thu, Sep 21, 2006 at 02:43:46PM -0700, H. Peter Anvin wrote:
> 7zip (LZMA) decompresses quickly, and the decompressor text is actually
> smaller than the equivalent for gzip. Quite nice.
>
> What is not nice is the code for the compressor, which is a total mess.
> I have been holding out on implementing LZMA on kernel.org, because
> just as zip (deflate) didn't become common in the Unix world until an
> encapsulation format that handles things expected in the Unix world,
> e.g. streaming, was created (gzip), I don't think LZMA is going to be
> widely used until there is an "lzip" which does the same thing. I
> actually started the work of adding LZMA support to gzip, but then
> realized it would be better if a new encapsulation format with proper
> 64-bit support everywhere was created.

It doesn't handle streaming?

So you can't do: tar c dirname | 7zip dirname.tar.7z ?

--
Len Sorensen
RuggedCom

2006-09-22 16:13:42

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

Lennart Sorensen wrote:
> On Thu, Sep 21, 2006 at 02:43:46PM -0700, H. Peter Anvin wrote:
>> 7zip (LZMA) decompresses quickly, and the decompressor text is actually
>> smaller than the equivalent for gzip. Quite nice.
>>
>> What is not nice is the code for the compressor, which is a total mess.
>> I have been holding out on implementing LZMA on kernel.org, because
>> just as zip (deflate) didn't become common in the Unix world until an
>> encapsulation format that handles things expected in the Unix world,
>> e.g. streaming, was created (gzip), I don't think LZMA is going to be
>> widely used until there is an "lzip" which does the same thing. I
>> actually started the work of adding LZMA support to gzip, but then
>> realized it would be better if a new encapsulation format with proper
>> 64-bit support everywhere was created.
>
> It doesn't handle streaming?
>
> So you can't do: tar c dirname | 7zip dirname.tar.7z ?
>

Nope, and in particular you can't do:

tar cf - dirname | 7zip | ssh ...

This is because 7zip is an archiving format in its own right, much like
zip. What we want is something that is to 7zip what gzip is to zip.

-hpa

2006-09-22 16:16:44

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

>> widely used until there is an "lzip" which does the same thing. I
>> actually started the work of adding LZMA support to gzip, but then
>> realized it would be better if a new encapsulation format with proper
>> 64-bit support everywhere was created.
>
>It doesn't handle streaming?
>
>So you can't do: tar c dirname | 7zip dirname.tar.7z ?

man 7z [slightly changed for reasonability]:

-si
Read data from StdIn (eg: tar -c directory | 7z a -si directory.tar.7z)



Jan Engelhardt
--

2006-09-22 16:33:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

Jan Engelhardt wrote:
>>> widely used until there is an "lzip" which does the same thing. I
>>> actually started the work of adding LZMA support to gzip, but then
>>> realized it would be better if a new encapsulation format with proper
>>> 64-bit support everywhere was created.
>> It doesn't handle streaming?
>>
>> So you can't do: tar c dirname | 7zip dirname.tar.7z ?
>
> man 7z [slightly changed for reasonability]:
>
> -si
> Read data from StdIn (eg: tar -c directory | 7z a -si directory.tar.7z)
>

Yes, but you can't make it write to an unseekable stdout.

-hpa

2006-09-22 17:42:14

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Fri, Sep 22, 2006 at 09:33:01AM -0700, H. Peter Anvin wrote:
> Jan Engelhardt wrote:
> >>>widely used until there is an "lzip" which does the same thing. I
> >>>actually started the work of adding LZMA support to gzip, but then
> >>>realized it would be better if a new encapsulation format with proper
> >>>64-bit support everywhere was created.
> >>It doesn't handle streaming?
> >>
> >>So you can't do: tar c dirname | 7zip dirname.tar.7z ?
> >
> >man 7z [slightly changed for reasonability]:
> >
> > -si
> > Read data from StdIn (eg: tar -c directory | 7z a -si
> > directory.tar.7z)
> >
>
> Yes, but you can't make it write to an unseekable stdout.

It seems the "lzma" program from LZMA Utils can:

http://tukaani.org/lzma/
"Very similar command line interface than what gzip and bzip2 have."

(Debian sid has this in the "lzma" package.)


Johannes

2006-09-22 18:09:55

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

Johannes Stezenbach wrote:
>
> It seems the "lzma" program from LZMA Utils can:
>
> http://tukaani.org/lzma/
> "Very similar command line interface than what gzip and bzip2 have."
>
> (Debian sid has this in the "lzma" package.)
>

Yes, it can. If that's the way things go then I don't mind it, however,
my biggest problem with lzma utils is that the command line parsing is
done in a shell script wrapper.

Maybe I'll start using it anyway...

-hpa

2006-09-22 18:19:32

by Michael Tokarev

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

H. Peter Anvin wrote:
> Johannes Stezenbach wrote:
>>
>> It seems the "lzma" program from LZMA Utils can:
>>
>> http://tukaani.org/lzma/
>> "Very similar command line interface than what gzip and bzip2 have."
>>
>> (Debian sid has this in the "lzma" package.)
>>
>
> Yes, it can. If that's the way things go then I don't mind it, however,
> my biggest problem with lzma utils is that the command line parsing is
> done in a shell script wrapper.

Well, I don't see any shell code here, in /usr/bin/lzma as in istalled from
debian version 4.43-2.

But note that this lzma utility does not have any 'magic number' and does
no crc checks. On the site it's said lzma(sdk) is under rewrite to support
new format with magic number and crc checks...

After reading this thread I wanted to teach GNU tar to automatically recognize
..tar.lzma archives - and failed, eactly because of the lack of magic number
at the start of a file...

/mjt

2006-09-22 18:27:35

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

Michael Tokarev wrote:
>
> Well, I don't see any shell code here, in /usr/bin/lzma as in istalled from
> debian version 4.43-2.
>
> But note that this lzma utility does not have any 'magic number' and does
> no crc checks.

Ah, right, that's a total killer.

> On the site it's said lzma(sdk) is under rewrite to support
> new format with magic number and crc checks...

That is an absolute must, IMO. I would use the gzip format as a base.

-hpa

2006-09-25 11:51:45

by Paulo Marques

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

H. Peter Anvin wrote:
> Michael Tokarev wrote:
>>[...]
>> On the site it's said lzma(sdk) is under rewrite to support
>> new format with magic number and crc checks...
>
> That is an absolute must, IMO. I would use the gzip format as a base.

If you're suggesting a gzip like format (but with different magic,
etc.), that's ok.

However, it has been suggested on similar threads to use the CM field of
the gzip format to introduce different compression methods.

While this is the purpose of this field, I find this to be a very bad
idea. The worse part of it is that, after "lzma gzip" files start to
proliferate, you never know if you can decompress a .gz with your
version of gunzip, which is something that you currently have for granted.

If more formats start being supported inside gzip, this only gets worse...

--
Paulo Marques - http://www.grupopie.com

"The face of a child can say it all, especially the
mouth part of the face."

2006-09-25 15:48:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

Paulo Marques wrote:
> H. Peter Anvin wrote:
>> Michael Tokarev wrote:
>>> [...]
>>> On the site it's said lzma(sdk) is under rewrite to support
>>> new format with magic number and crc checks...
>>
>> That is an absolute must, IMO. I would use the gzip format as a base.
>
> If you're suggesting a gzip like format (but with different magic,
> etc.), that's ok.
>
> However, it has been suggested on similar threads to use the CM field of
> the gzip format to introduce different compression methods.
>
> While this is the purpose of this field, I find this to be a very bad
> idea. The worse part of it is that, after "lzma gzip" files start to
> proliferate, you never know if you can decompress a .gz with your
> version of gunzip, which is something that you currently have for granted.
>
> If more formats start being supported inside gzip, this only gets worse...
>

Doesn't mean that one should name the files .gz.

A more significant reason to not do this is that I think there are a lot
of programs out where which only check the magic number and not the
compression format.

-hpa

2006-10-02 02:35:39

by Drew Scott Daniels

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

ppmd, also in Debian had better compression than lzma. PAQ8i has even
better compression, but isn't in Debian. See the maximumcompression web
site or other archive comparison tests.

The pace of compression algorithm development is high enough that I'd
suggest that the bar be placed quite high before switching to a new
compression format that's not reverse compatible.

For those interested, I'm working on publishing a proof of concept that
can make most tarballs compress better. About 2-3% better in my tests
with bzip2/gzip on the Linux kernel source code.

Drew Daniels
Resume: http://www.boxheap.net/ddaniels/resume.html

2006-10-02 03:32:11

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

In article <20061002033511.GB12695@zimmer> you wrote:
> The pace of compression algorithm development is high enough that I'd
> suggest that the bar be placed quite high before switching to a new
> compression format that's not reverse compatible.
>
> For those interested, I'm working on publishing a proof of concept that
> can make most tarballs compress better. About 2-3% better in my tests
> with bzip2/gzip on the Linux kernel source code.

3% is not a high bar.

Gruss
Bernd

2006-10-02 04:07:33

by Willy Tarreau

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Sun, Oct 01, 2006 at 10:35:11PM -0500, Drew Scott Daniels wrote:
> ppmd, also in Debian had better compression than lzma. PAQ8i has even
> better compression, but isn't in Debian. See the maximumcompression web
> site or other archive comparison tests.

Interesting. But I suspect that you have not checked the compression time.
PAQ8I for instance is between 100 and 300 times SLOWER than bzip2 to achieve
about 30% smaller ! Given that the kernel already takes a very long time to
compress with bzip2, it would take several hours to compress it with such
tools. While they're very interesting proofs of concept for compression
research, they're not suited to any real world usage !

> The pace of compression algorithm development is high enough that I'd
> suggest that the bar be placed quite high before switching to a new
> compression format that's not reverse compatible.

At least, ppmd takes the same time as bzip2 to achieve about 12% better
compression. But I don't think it justifies a switch.

> For those interested, I'm working on publishing a proof of concept that
> can make most tarballs compress better. About 2-3% better in my tests
> with bzip2/gzip on the Linux kernel source code.

A lot of improvement can be made in tar to compress better archive with
large number of small files such as the kernel. You just have to see the
difference in archive size depending on the base directory name. If you
come up with something really interesting which does not alter the output
format nor the compression time, it might get a place in the git-tar-tree
command. But IMHO, it would me more interesting to further reduce patches
size than tarballs size, since patches might be downloaded far more often.

Regards,
Willy

2006-10-02 05:26:42

by David Lang

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Mon, 2 Oct 2006, Willy Tarreau wrote:

> A lot of improvement can be made in tar to compress better archive with
> large number of small files such as the kernel. You just have to see the
> difference in archive size depending on the base directory name. If you
> come up with something really interesting which does not alter the output
> format nor the compression time, it might get a place in the git-tar-tree
> command. But IMHO, it would me more interesting to further reduce patches
> size than tarballs size, since patches might be downloaded far more often.

I just had what's probably a silly thought.

as an alturnative to useing tar, what about useing a git pack?

create a git archive with no history, just the current files, and then pack it
with agressive delta options.

since git uses compression on the result anyway it's unlikly to be much worse
then a tarball, and since it can use deltas across files it may even be better
(potentially enough better to cover the cost of downloading the git binaries)

this would be especially effective once git adds a 'shallow clone' capability to
then take the snapshot pack and extend it (either forward or backward as
requested by the user), but may be worth doing even without this.

thoughts?

David Lang

2006-10-02 06:21:28

by Willy Tarreau

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Sun, Oct 01, 2006 at 10:11:49PM -0700, David Lang wrote:
> On Mon, 2 Oct 2006, Willy Tarreau wrote:
>
> >A lot of improvement can be made in tar to compress better archive with
> >large number of small files such as the kernel. You just have to see the
> >difference in archive size depending on the base directory name. If you
> >come up with something really interesting which does not alter the output
> >format nor the compression time, it might get a place in the git-tar-tree
> >command. But IMHO, it would me more interesting to further reduce patches
> >size than tarballs size, since patches might be downloaded far more often.
>
> I just had what's probably a silly thought.
>
> as an alturnative to useing tar, what about useing a git pack?

Nice idea, but I tried on 2.4 : 43 MB for git-pack vs 38 for tar.gz and
31 for tar.bz2. However, it is blazingly fast. 4 seconds vs 30 for tar.gz
(hot cache).

When speed is important, it's a clear winner. When size matters, it's not
the best solution.

Regards,
Willy

2006-10-02 15:16:40

by Phillip Susi

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

David Lang wrote:
> I just had what's probably a silly thought.
>
> as an alturnative to useing tar, what about useing a git pack?
>
> create a git archive with no history, just the current files, and then
> pack it with agressive delta options.
>

Isn't that what a patch.gz is? Diff generates the deltas and then they
are compressed. Can't get much simpler or better than that.

> since git uses compression on the result anyway it's unlikly to be much
> worse then a tarball, and since it can use deltas across files it may
> even be better (potentially enough better to cover the cost of
> downloading the git binaries)
>
> this would be especially effective once git adds a 'shallow clone'
> capability to then take the snapshot pack and extend it (either forward
> or backward as requested by the user), but may be worth doing even
> without this.
>
> thoughts?
>
> David Lang

2006-10-02 16:05:10

by David Lang

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Mon, 2 Oct 2006, Phillip Susi wrote:

> David Lang wrote:
>> I just had what's probably a silly thought.
>>
>> as an alturnative to useing tar, what about useing a git pack?
>>
>> create a git archive with no history, just the current files, and then pack
>> it with agressive delta options.
>>
>
> Isn't that what a patch.gz is? Diff generates the deltas and then they are
> compressed. Can't get much simpler or better than that.

not quite, a git pack includes everythign you need to get the full source, a
patch.gz requires that you have the prior version of the source to start with.

David Lang

2006-10-02 20:20:27

by Phillip Susi

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

It sounded like you were talking about a modified pack file that did NOT
contain everything you need to get the current source. You said it
would have no history and use aggressive delta compression to achieve a
smaller size than a full tarball. If the pack contains the full
previous version and the delta to the head version, then it will be
larger than the tar, not smaller.

David Lang wrote:
> On Mon, 2 Oct 2006, Phillip Susi wrote:
>
>> David Lang wrote:
>>> I just had what's probably a silly thought.
>>>
>>> as an alturnative to useing tar, what about useing a git pack?
>>>
>>> create a git archive with no history, just the current files, and
>>> then pack it with agressive delta options.
>>>
>>
>> Isn't that what a patch.gz is? Diff generates the deltas and then
>> they are compressed. Can't get much simpler or better than that.
>
> not quite, a git pack includes everythign you need to get the full
> source, a patch.gz requires that you have the prior version of the
> source to start with.
>
> David Lang

2006-10-02 20:29:57

by David Lang

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

no, I was suggesting a pack file that contained _only_ the head version.

within the pack file it would delta against other files in the pack (how many
copies of the GPLv2 text exist across all files for example)

however Willy did a test and found that the resulting pack was significantly
larger then a .tgz. I don't know what options he used, so while there's some
chance that being more agressive in looking for deltas would result in an
improvement, the difference to make up is fairly significant.

David Lang

On Mon, 2 Oct 2006, Phillip Susi wrote:

> Date: Mon, 02 Oct 2006 16:20:40 -0400
> From: Phillip Susi <[email protected]>
> To: David Lang <[email protected]>
> Cc: Willy Tarreau <[email protected]>, Drew Scott Daniels <[email protected]>,
> [email protected]
> Subject: Re: Smaller compressed kernel source tarballs?
>
> It sounded like you were talking about a modified pack file that did NOT
> contain everything you need to get the current source. You said it would
> have no history and use aggressive delta compression to achieve a smaller
> size than a full tarball. If the pack contains the full previous version and
> the delta to the head version, then it will be larger than the tar, not
> smaller.
>
> David Lang wrote:
>> On Mon, 2 Oct 2006, Phillip Susi wrote:
>>
>>> David Lang wrote:
>>>> I just had what's probably a silly thought.
>>>>
>>>> as an alturnative to useing tar, what about useing a git pack?
>>>>
>>>> create a git archive with no history, just the current files, and then
>>>> pack it with agressive delta options.
>>>>
>>>
>>> Isn't that what a patch.gz is? Diff generates the deltas and then they
>>> are compressed. Can't get much simpler or better than that.
>>
>> not quite, a git pack includes everythign you need to get the full source,
>> a patch.gz requires that you have the prior version of the source to start
>> with.
>>
>> David Lang
>
>

2006-10-02 20:35:41

by Willy Tarreau

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Mon, Oct 02, 2006 at 01:12:55PM -0700, David Lang wrote:
> no, I was suggesting a pack file that contained _only_ the head version.
>
> within the pack file it would delta against other files in the pack (how
> many copies of the GPLv2 text exist across all files for example)
>
> however Willy did a test and found that the resulting pack was
> significantly larger then a .tgz. I don't know what options he used, so
> while there's some chance that being more agressive in looking for deltas
> would result in an improvement, the difference to make up is fairly
> significant.

no options at all, so there may be room for improvement. Also, on my
notebook, I have hardlinked all my linux directories so that each
content only appears once. I don't have the numbers right here, but
I remember that it was really useful to merge lots of different versions,
but that the net gain within one given tree was really minor, as there
are not that many identical files in one tree.

Regards,
Willy

2006-10-02 21:49:42

by Sean

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Mon, 2 Oct 2006 22:35:27 +0200
Willy Tarreau <[email protected]> wrote:

> On Mon, Oct 02, 2006 at 01:12:55PM -0700, David Lang wrote:
> > no, I was suggesting a pack file that contained _only_ the head version.
> >
> > within the pack file it would delta against other files in the pack (how
> > many copies of the GPLv2 text exist across all files for example)
> >
> > however Willy did a test and found that the resulting pack was
> > significantly larger then a .tgz. I don't know what options he used, so
> > while there's some chance that being more agressive in looking for deltas
> > would result in an improvement, the difference to make up is fairly
> > significant.
>
> no options at all, so there may be room for improvement. Also, on my
> notebook, I have hardlinked all my linux directories so that each
> content only appears once. I don't have the numbers right here, but
> I remember that it was really useful to merge lots of different versions,
> but that the net gain within one given tree was really minor, as there
> are not that many identical files in one tree.

Hey Willy,

I don't really understand the objective here, but you may want to double
check your procedure, the entire 2.4 history only takes a single 41M pack
in Git for me.

Sean

2006-10-02 21:57:57

by David Lang

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Mon, 2 Oct 2006, Sean wrote:

> On Mon, 2 Oct 2006 22:35:27 +0200
> Willy Tarreau <[email protected]> wrote:
>
>> On Mon, Oct 02, 2006 at 01:12:55PM -0700, David Lang wrote:
>>> no, I was suggesting a pack file that contained _only_ the head version.
>>>
>>> within the pack file it would delta against other files in the pack (how
>>> many copies of the GPLv2 text exist across all files for example)
>>>
>>> however Willy did a test and found that the resulting pack was
>>> significantly larger then a .tgz. I don't know what options he used, so
>>> while there's some chance that being more agressive in looking for deltas
>>> would result in an improvement, the difference to make up is fairly
>>> significant.
>>
>> no options at all, so there may be room for improvement. Also, on my
>> notebook, I have hardlinked all my linux directories so that each
>> content only appears once. I don't have the numbers right here, but
>> I remember that it was really useful to merge lots of different versions,
>> but that the net gain within one given tree was really minor, as there
>> are not that many identical files in one tree.
>
> Hey Willy,
>
> I don't really understand the objective here, but you may want to double
> check your procedure, the entire 2.4 history only takes a single 41M pack
> in Git for me.

the idea was to use a git pack instead of a .tgz or .tar.bz2 as a distribution
format from kernel.org

for example, the pack would only include the 2.6.18 kernel, no history.

once git supports shallow clones then the distributed blob could be a clone seed
that a person could download and then track changes from there forward. but
that's a future enhancement.

David Lang

2006-10-03 03:22:14

by Willy Tarreau

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

On Mon, Oct 02, 2006 at 05:49:38PM -0400, Sean wrote:
> On Mon, 2 Oct 2006 22:35:27 +0200
> Willy Tarreau <[email protected]> wrote:
>
> > On Mon, Oct 02, 2006 at 01:12:55PM -0700, David Lang wrote:
> > > no, I was suggesting a pack file that contained _only_ the head version.
> > >
> > > within the pack file it would delta against other files in the pack (how
> > > many copies of the GPLv2 text exist across all files for example)
> > >
> > > however Willy did a test and found that the resulting pack was
> > > significantly larger then a .tgz. I don't know what options he used, so
> > > while there's some chance that being more agressive in looking for deltas
> > > would result in an improvement, the difference to make up is fairly
> > > significant.
> >
> > no options at all, so there may be room for improvement. Also, on my
> > notebook, I have hardlinked all my linux directories so that each
> > content only appears once. I don't have the numbers right here, but
> > I remember that it was really useful to merge lots of different versions,
> > but that the net gain within one given tree was really minor, as there
> > are not that many identical files in one tree.
>
> Hey Willy,
>
> I don't really understand the objective here, but you may want to double
> check your procedure, the entire 2.4 history only takes a single 41M pack
> in Git for me.

I'm not really surprized, as GIT history begins at 2.4.32 and recent
2.4 patches are very small. So basically, the size is about the same
for the latest 2.4 and all 2.4 history.

Willy

2006-10-03 10:29:35

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?


>> ppmd, also in Debian had better compression than lzma. PAQ8i has even
>> better compression, but isn't in Debian. See the maximumcompression web
>> site or other archive comparison tests.
>
>Interesting. But I suspect that you have not checked the compression time.
>PAQ8I for instance is between 100 and 300 times SLOWER than bzip2 to achieve
>about 30% smaller ! Given that the kernel already takes a very long time to
>compress with bzip2, it would take several hours to compress it with such
>tools. While they're very interesting proofs of concept for compression
>research, they're not suited to any real world usage !

There are lots of obscure compression formats that achieve somewhat
better compression at the cost of MUCH more time (neglecting they are
not too open), such as MS CAB and ACE.


Jan Engelhardt
--

2006-10-03 18:23:57

by Phillip Susi

[permalink] [raw]
Subject: Re: Smaller compressed kernel source tarballs?

Jan Engelhardt wrote:
> There are lots of obscure compression formats that achieve somewhat
> better compression at the cost of MUCH more time (neglecting they are
> not too open), such as MS CAB and ACE.

CAB is an archive container format, not a compression algorithm. Last
time I worked on some code to handle it, they used the standard LZW
algorithm implemented by gzip ( but had the ability to support others in
the future ) and could only compress 32kb blocks. The small block size
led to poor compression.


2006-10-04 15:58:58

by Jörn Engel

[permalink] [raw]
Subject: Compressing pages [was: Re: Smaller compressed kernel source tarballs?]

On Tue, 3 October 2006 14:24:01 -0400, Phillip Susi wrote:
> Jan Engelhardt wrote:
> >There are lots of obscure compression formats that achieve somewhat
> >better compression at the cost of MUCH more time (neglecting they are
> >not too open), such as MS CAB and ACE.
>
> CAB is an archive container format, not a compression algorithm. Last
> time I worked on some code to handle it, they used the standard LZW
> algorithm implemented by gzip ( but had the ability to support others in
> the future ) and could only compress 32kb blocks. The small block size
> led to poor compression.

Actually, compression in 4KiB blocks is a _very_ interesting
benchmark. Jffs2 works with that size for compression and other
compressed filesystems likely do the same, although possibly with
something larger like 64KiB.

And the results are completely different in that benchmark. Gzip
actually beats bzip2 hands-down on compression ratio, for example.

I used to have a script, but cannot find it anymore. Basically
something like:

while (read next 4KiB from input file) {
compress chunk
add compressed_size to total
}
print total

J?rn

--
Unless something dramatically changes, by 2015 we'll be largely
wondering what all the fuss surrounding Linux was really about.
-- Rob Enderle