LinuxLists.cc - [RFC] Breaking data compatibility with userspace bzlib

2003-06-20 18:45:28

Subject: [RFC] Breaking data compatibility with userspace bzlib

Hi!

I'm almost finished with adding bzlib support to the kernel. You
could also say, I'm already 150% finished, as there is only cleanup
left to do. Since bzlib is new to the kernel, I don't see a point in
keeping the uglies in the userspace interface for the kernel as well
and have already killed most of them.

However, one of the uglies, blockSize100k, is also part of the data
format as well, being one field in the header. So now I have to
decide whether to kill this wart or to keep it for compatibility.

The whole interface of the bzlib was modelled after the zlib
interface. blockSize100k is supposed to imitate the zlib compression
levels, the valid range is from 1 to 9 as well. The semantic is that
a block of 100k * blockSize100k is compressed at a time.

Now, the cost of the underlying BWT is O(n) in memory and O(n*ln(n))
in time. That given, I consider it odd to use a linear semantic of
blockSizeXXXX and would prefer an exponential one, as the zlib uses
here and there. Thus blockSizeBits would now give the blockSize as
1 << blockSizeBits, allowing to go well below 100k, resulting in lower
memory consumption for some and well above 900k, giving better
compression ratios.

Long intro, short question: Jay O Nay?

The change would make bzlib much more useful for jffs2, assuming it is
useful for jffs2 at all. But if any kernel bzlib users have to
interface with a userspace bzlib, things will break. That might be a
good reason to postpone the change for a while and rename the whole
thing when it gets done,,,, so my personal tendency is Nay.

J?rn

PS: Kept the surplus commata as a personal reminder that 2.5.72 still
has problems with my keyboard. Should check Andrews Must-Fix List and
add this one it not yet present.

--
The competent programmer is fully aware of the strictly limited size of
his own skull; therefore he approaches the programming task in full
humility, and among other things he avoids clever tricks like the plague.
-- Edsger W. Dijkstra

2003-06-20 18:55:58

by Jeff Garzik

[permalink] [raw]

Subject: Re: [RFC] Breaking data compatibility with userspace bzlib

On Fri, Jun 20, 2003 at 08:59:15PM +0200, J?rn Engel wrote:
> Now, the cost of the underlying BWT is O(n) in memory and O(n*ln(n))
> in time. That given, I consider it odd to use a linear semantic of
> blockSizeXXXX and would prefer an exponential one, as the zlib uses
> here and there. Thus blockSizeBits would now give the blockSize as
> 1 << blockSizeBits, allowing to go well below 100k, resulting in lower
> memory consumption for some and well above 900k, giving better
> compression ratios.
>
>
> Long intro, short question: Jay O Nay?

The big question is whether the bzip2 better compression is actually
useful in a kernel context? Patches to do bzip2 for initrd, for
example, have been around for ages:

http://gtf.org/garzik/kernel/files/initrd-bzip2-2.2.13-2.patch.gz

But the compression and decompression overhead is _much_ larger
than gzip. It was so huge for maximal compression that dialing back
compression reaching a point of diminishing returns rather quickly,
when compared to gzip memory usage and compression.

I talked a bit with the bzip2 author a while ago about memory usage.
He eventually added the capability to only require small blocks
for decompression (64K IIRC?), but there was a significant loss in
compression factor.

So... even in 2003, I really don't know of many (any?) tasks which
would benefit from bzip2, considering the additional memory and
cpu overhead.

Jeff

2003-06-20 19:31:41

by Jörn Engel

[permalink] [raw]

Subject: Re: [RFC] Breaking data compatibility with userspace bzlib

On Fri, 20 June 2003 15:09:57 -0400, Jeff Garzik wrote:
>
> The big question is whether the bzip2 better compression is actually
> useful in a kernel context? Patches to do bzip2 for initrd, for
> example, have been around for ages:
>
> http://gtf.org/garzik/kernel/files/initrd-bzip2-2.2.13-2.patch.gz
>
> But the compression and decompression overhead is _much_ larger
> than gzip. It was so huge for maximal compression that dialing back
> compression reaching a point of diminishing returns rather quickly,
> when compared to gzip memory usage and compression.
>
> I talked a bit with the bzip2 author a while ago about memory usage.
> He eventually added the capability to only require small blocks
> for decompression (64K IIRC?), but there was a significant loss in
> compression factor.

You are puzzling me a bit. What exactly do you consider to be the
overhead? Code size? Memory consumption? CPU time?

When it comes to compression, the combination of compressed data and
decompression code should be larger than uncompressed data, everything
else is secondary. The secondary stuff might still affect some users,
but it is not a general problem.

> So... even in 2003, I really don't know of many (any?) tasks which
> would benefit from bzip2, considering the additional memory and
> cpu overhead.

That overhead will become less important with time. And if we will
merge bzlib anyway, we may as well do it today.

But I do agree with you that some choices made by the maintainer were
rather stupid. And one of them is the reason for this thread and
appears to be the whole point behind your argument.

J?rn

--
The cost of changing business rules is much more expensive for software
than for a secretaty.
-- unknown

2003-06-20 19:36:20

by David Miller

[permalink] [raw]

Subject: Re: [RFC] Breaking data compatibility with userspace bzlib

From: J?rn Engel <[email protected]>
Date: Fri, 20 Jun 2003 20:59:15 +0200

The whole interface of the bzlib was modelled after the zlib
interface.

The zlib interface is actually problematic, it doesn't allow
handling of scatter-gather lists on input or output for example.

Therefore we couldn't make the compress cryptolib interface
take scatterlists elements either, which is a huge problem.

2003-06-20 19:36:11

by David Lang

[permalink] [raw]

Subject: Re: [RFC] Breaking data compatibility with userspace bzlib

he is saying that the memory and CPU requirements to do the bzip
uncompression are so much larger then what is nessasary to do the gzip
uncompression that the small amount of space saved is almost never worth
the cost.

David Lang

On Fri, 20 Jun 2003, J?rn Engel wrote:

> Date: Fri, 20 Jun 2003 21:45:17 +0200
> From: J?rn Engel <[email protected]>
> To: Jeff Garzik <[email protected]>
> Cc: [email protected], [email protected], [email protected],
> David Woodhouse <[email protected]>
> Subject: Re: [RFC] Breaking data compatibility with userspace bzlib
>
> On Fri, 20 June 2003 15:09:57 -0400, Jeff Garzik wrote:
> >
> > The big question is whether the bzip2 better compression is actually
> > useful in a kernel context? Patches to do bzip2 for initrd, for
> > example, have been around for ages:
> >
> > http://gtf.org/garzik/kernel/files/initrd-bzip2-2.2.13-2.patch.gz
> >
> > But the compression and decompression overhead is _much_ larger
> > than gzip. It was so huge for maximal compression that dialing back
> > compression reaching a point of diminishing returns rather quickly,
> > when compared to gzip memory usage and compression.
> >
> > I talked a bit with the bzip2 author a while ago about memory usage.
> > He eventually added the capability to only require small blocks
> > for decompression (64K IIRC?), but there was a significant loss in
> > compression factor.
>
> You are puzzling me a bit. What exactly do you consider to be the
> overhead? Code size? Memory consumption? CPU time?
>
> When it comes to compression, the combination of compressed data and
> decompression code should be larger than uncompressed data, everything
> else is secondary. The secondary stuff might still affect some users,
> but it is not a general problem.
>
> > So... even in 2003, I really don't know of many (any?) tasks which
> > would benefit from bzip2, considering the additional memory and
> > cpu overhead.
>
> That overhead will become less important with time. And if we will
> merge bzlib anyway, we may as well do it today.
>
> But I do agree with you that some choices made by the maintainer were
> rather stupid. And one of them is the reason for this thread and
> appears to be the whole point behind your argument.
>
> J?rn
>
> --
> The cost of changing business rules is much more expensive for software
> than for a secretaty.
> -- unknown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-06-20 19:43:07

by Jörn Engel

[permalink] [raw]

Subject: Re: [RFC] Breaking data compatibility with userspace bzlib

On Fri, 20 June 2003 12:45:10 -0700, David S. Miller wrote:
> From: J?rn Engel <[email protected]>
> Date: Fri, 20 Jun 2003 20:59:15 +0200
>
> The whole interface of the bzlib was modelled after the zlib
> interface.
>
> The zlib interface is actually problematic, it doesn't allow
> handling of scatter-gather lists on input or output for example.
>
> Therefore we couldn't make the compress cryptolib interface
> take scatterlists elements either, which is a huge problem.

Is there a reason to use the zlib and nothing but the zlib for the
cryptolib? RFCs 1950 - 1952? Or would any form of compression do, in
principle at least?

In the worst case, I consider it not too hard to add a wrapper
interface to the zlib to do the scatter-gather handling. Actually
going deeply into the guts of the zlib is not good for a kernel
hackers sanity though. The massive use of macros with magic knowledge
of the surrounding functions is - well - interesting.

J?rn

--
If you're willing to restrict the flexibility of your approach,
you can almost always do something better.
-- John Carmack

2003-06-20 19:52:13

by Jörn Engel

[permalink] [raw]

Subject: Re: [RFC] Breaking data compatibility with userspace bzlib

On Fri, 20 June 2003 12:48:51 -0700, David Lang wrote:
>
> he is saying that the memory and CPU requirements to do the bzip
> uncompression are so much larger then what is nessasary to do the gzip
> uncompression that the small amount of space saved is almost never worth
> the cost.

Which is why kernel images usually come as .bz2 today. ;)

Read my first posting, read the source (grep for BZMALLOC, there are
just a few). It is impossible in it's current state to use less than
roughly 1MB of memory, even though the algorithm doesn't give that
restriction at all. Drop that down to 280k, the current zlib value
and you won't see a difference in compression ratios for jffs2 at
least.

This might boil down to bzlib merge 100% (minus testing and sending
patches) done and <nifty_name>, which is based on bzlib, just started.

Comments?

J?rn

--
Beware of bugs in the above code; I have only proved it correct, but
not tried it.
-- Donald Knuth

2003-06-20 20:14:30

by David Miller

[permalink] [raw]

Subject: Re: [RFC] Breaking data compatibility with userspace bzlib

From: J?rn Engel <[email protected]>
Date: Fri, 20 Jun 2003 21:56:58 +0200

On Fri, 20 June 2003 12:45:10 -0700, David S. Miller wrote:
> Therefore we couldn't make the compress cryptolib interface
> take scatterlists elements either, which is a huge problem.

Is there a reason to use the zlib and nothing but the zlib for the
cryptolib? RFCs 1950 - 1952? Or would any form of compression do, in
principle at least?

Because it's not very sensible to have 2 pieces of code that
do exactly the same thing? :-)

In the worst case, I consider it not too hard to add a wrapper
interface to the zlib to do the scatter-gather handling.

It's not that simple.

Is the wrapper called in IRQ or BH or user context? This matters
to determine what kind of kmap() operations you must use in the
wrapper.

That's merely the beginning of the problems with this kind of idea.

Don't let my statements stall your work, it was just a FYI and
something we should take care of in the 2.7.x timeframe not now.

2003-06-20 21:39:29

by Jeff Garzik

[permalink] [raw]

Subject: Re: [RFC] Breaking data compatibility with userspace bzlib

J?rn Engel wrote:
> On Fri, 20 June 2003 12:48:51 -0700, David Lang wrote:
>>he is saying that the memory and CPU requirements to do the bzip
>>uncompression are so much larger then what is nessasary to do the gzip
>>uncompression that the small amount of space saved is almost never worth
>>the cost.

exactly

> just a few). It is impossible in it's current state to use less than
> roughly 1MB of memory, even though the algorithm doesn't give that
> restriction at all. Drop that down to 280k, the current zlib value
> and you won't see a difference in compression ratios for jffs2 at
> least.

Well, if you drop that down to 280k, and you do not beat zlib
compression by a "comfortable margin", it's just pointless code churn in
addition to swapping out the faster algorithm for the slower algorithm.

If you drop that down to 280k with much better compression than zlib,
that's fantastic and useful, and people won't mind the slower algorithm :)

The issue of memory usage at high compression rates was the main reason
why I didn't push the patch upstream and pursue bzlib further.

Jeff

2003-06-20 21:42:45

by Jeff Garzik

[permalink] [raw]

Subject: Re: [RFC] Breaking data compatibility with userspace bzlib

Jeff Garzik wrote:
> If you drop that down to 280k with much better compression than zlib,
> that's fantastic and useful, and people won't mind the slower algorithm :)

Just so people don't misinterpret this, I'm just talking about the
_addition_ of bzlib as an option. zlib isn't going anywhere.

Jeff