2004-04-23 17:23:43

by Timothy Miller

[permalink] [raw]
Subject: File system compression, not at the block layer

This is probably just another of my silly "they already thought of that
and someone is doing exactly this" ideas.

I get the impression that a lot of people interested in doing FS
compression want to do it at the block layer. This gets complicated,
because you need to allocate partial physical blocks.

Well, why not do the compression at the highest layer?

The idea is something akin to changing this (syntax variation intentional):

tar cf - somefiles* > file

To this:

tar cf - somefiles* | gzip > file

Except doing it transparently and for all files.

This way, the disk cache is all compressed data, and only decompressed
as it's read or written by a process.

For files below a certain size, this is obviously pointless, since you
can't save any space. But in many cases, this could speed up the I/O
for large files that are compressable. (Space is cheap. The only
reason to compress is for speed.)


2004-04-23 17:30:22

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

In article <[email protected]>,
Timothy Miller <[email protected]> wrote:
>Well, why not do the compression at the highest layer?
>[...] doing it transparently and for all files.

http://e2compr.sourceforge.net/

Mike.

2004-04-23 17:41:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Fri, Apr 23, 2004 at 05:30:21PM +0000, Miquel van Smoorenburg wrote:
> In article <[email protected]>,
> Timothy Miller <[email protected]> wrote:
> >Well, why not do the compression at the highest layer?
> >[...] doing it transparently and for all files.
>
> http://e2compr.sourceforge.net/

It's been done (see the above URL), but given how cheap disk space has
gotten, and how the speed of CPU has gotten faster much more quickly
than disk access has, many/most people have not be interested in
trading off performance for space. As a result, there are race
conditions in e2compr (which is why it never got merged into
mainline), and there hasn't been sufficient interest to either (a)
forward port e2compr to more recent kernels revisions, or (b) find and
fix the race conditions.

- Ted


2004-04-23 17:58:21

by Jörn Engel

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Fri, 23 April 2004 13:41:47 -0400, Theodore Ts'o wrote:
>
> It's been done (see the above URL), but given how cheap disk space has
> gotten, and how the speed of CPU has gotten faster much more quickly
> than disk access has, many/most people have not be interested in
> trading off performance for space.

Also, most diskspace today is filled by data that is already
compressed.

J?rn

--
Ninety percent of everything is crap.
-- Sturgeon's Law

2004-04-23 18:11:31

by Timothy Miller

[permalink] [raw]
Subject: Re: File system compression, not at the block layer



Theodore Ts'o wrote:
> On Fri, Apr 23, 2004 at 05:30:21PM +0000, Miquel van Smoorenburg wrote:
>
>>In article <[email protected]>,
>>Timothy Miller <[email protected]> wrote:
>>
>>>Well, why not do the compression at the highest layer?
>>>[...] doing it transparently and for all files.
>>
>>http://e2compr.sourceforge.net/
>
>
> It's been done (see the above URL), but given how cheap disk space has
> gotten, and how the speed of CPU has gotten faster much more quickly
> than disk access has, many/most people have not be interested in
> trading off performance for space. As a result, there are race
> conditions in e2compr (which is why it never got merged into
> mainline), and there hasn't been sufficient interest to either (a)
> forward port e2compr to more recent kernels revisions, or (b) find and
> fix the race conditions.

Well, performance has been my only interest. Aside from the embedded
space (which already uses cramfs or something, right?), the only real
benefit to FS compression is the fact that it would reduce the amount of
data that you have to read from disk. If your IDE drive gives you
50MB/sec, and your file compresses by 50%, then you get 100MB/sec
reading that file.

In a private email, one gentleman (who can credit himself if he likes)
pointed out that compression doesn't reduce the number of seeks, and
since seek times dominate, the benefit of compression would diminish.

SO... in addition to the brilliance of AS, is there anything else that
can be done (using compression or something else) which could aid in
reducing seek time?

Nutty idea: Interleave files on the disk. So, any given file will have
its blocks allocated at, say, intervals of every 17 blocks. Make up for
the sequential performance hit with compression or something, but to get
to the beginning of groups of files, seek time is reduced. Maybe.
Probably not, but hey. :)

Another idea is to actively fragment the disk based on access patterns.
The most frequently accessed blocks are grouped together so as to
maximize over-all throughput. The problem with this is that, well, say
boot time is critical -- booting wouldn't happen enough to get enough
attention so that its blocks get optimized (they would get dispersed as
a result of more common activities); but database access could benefit
in the long-term.

2004-04-23 18:29:55

by Paul Jackson

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

> SO... in addition to the brilliance of AS, is there anything else that
> can be done (using compression or something else) which could aid in
> reducing seek time?

Buy more disks and only use a small portion of each for all but the
most infrequently accessed data.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-04-23 20:14:28

by Joel Jaeggli

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Fri, 23 Apr 2004, Paul Jackson wrote:

> > SO... in addition to the brilliance of AS, is there anything else that
> > can be done (using compression or something else) which could aid in
> > reducing seek time?
>
> Buy more disks and only use a small portion of each for all but the
> most infrequently accessed data.

faster drives. The biggest disks at this point are far slower that the
fastest... the average read service time on a maxtor atlas 15k is like
5.7ms on 250GB western digital sata, 14.1ms, so that more than twice as
many reads can be executed on the fastest disks you can buy now... of
course then you pay for it in cost, heat, density, and controller costs.
everthing is a tradeoff though.

>

--
--------------------------------------------------------------------------
Joel Jaeggli Unix Consulting [email protected]
GPG Key Fingerprint: 5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2


2004-04-23 20:34:54

by Richard B. Johnson

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Fri, 23 Apr 2004, Joel Jaeggli wrote:

> On Fri, 23 Apr 2004, Paul Jackson wrote:
>
> > > SO... in addition to the brilliance of AS, is there anything else that
> > > can be done (using compression or something else) which could aid in
> > > reducing seek time?
> >
> > Buy more disks and only use a small portion of each for all but the
> > most infrequently accessed data.
>
> faster drives. The biggest disks at this point are far slower that the
> fastest... the average read service time on a maxtor atlas 15k is like
> 5.7ms on 250GB western digital sata, 14.1ms, so that more than twice as
> many reads can be executed on the fastest disks you can buy now... of
> course then you pay for it in cost, heat, density, and controller costs.
> everthing is a tradeoff though.
>

If you want to have fast disks, then you should do what I
suggested to Digital 20 years ago when they had ST-506
interfaces and SCSI was available only from third-parties.
It was called "striping" (I'm serious!). Not the so-called
RAID crap that took the original idea and destroyed it.
If you have 32-bits, you design an interface board for 32
disks. The interface board strips each bit to the data that
each disk gets. That makes the whole array 32 times faster
than a single drive and, of course, 32 times larger.

There is no redundancy in such an array, just brute-force
speed. One can add additional bits and CRC correction which
would allow the failure (or removal) of one drive at a time.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.26 on an i686 machine (5557.45 BogoMips).
Note 96.31% of all statistics are fiction.


2004-04-23 20:44:11

by Måns Rullgård

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

"Richard B. Johnson" <[email protected]> writes:

> On Fri, 23 Apr 2004, Joel Jaeggli wrote:
>
>> On Fri, 23 Apr 2004, Paul Jackson wrote:
>>
>> > > SO... in addition to the brilliance of AS, is there anything else that
>> > > can be done (using compression or something else) which could aid in
>> > > reducing seek time?
>> >
>> > Buy more disks and only use a small portion of each for all but the
>> > most infrequently accessed data.
>>
>> faster drives. The biggest disks at this point are far slower that the
>> fastest... the average read service time on a maxtor atlas 15k is like
>> 5.7ms on 250GB western digital sata, 14.1ms, so that more than twice as
>> many reads can be executed on the fastest disks you can buy now... of
>> course then you pay for it in cost, heat, density, and controller costs.
>> everthing is a tradeoff though.
>>
>
> If you want to have fast disks, then you should do what I
> suggested to Digital 20 years ago when they had ST-506
> interfaces and SCSI was available only from third-parties.
> It was called "striping" (I'm serious!). Not the so-called
> RAID crap that took the original idea and destroyed it.
> If you have 32-bits, you design an interface board for 32
> disks. The interface board strips each bit to the data that
> each disk gets. That makes the whole array 32 times faster
> than a single drive and, of course, 32 times larger.

For best performance, the spindles should be synchronized too. This
might be tricky with disks not intended for such operation, of course.

--
M?ns Rullg?rd
[email protected]

2004-04-23 20:58:32

by Richard B. Johnson

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Fri, 23 Apr 2004, [iso-8859-1] M?ns Rullg?rd wrote:

> "Richard B. Johnson" <[email protected]> writes:
>
> > On Fri, 23 Apr 2004, Joel Jaeggli wrote:
> >
> >> On Fri, 23 Apr 2004, Paul Jackson wrote:
> >>
> >> > > SO... in addition to the brilliance of AS, is there anything else that
> >> > > can be done (using compression or something else) which could aid in
> >> > > reducing seek time?
> >> >
> >> > Buy more disks and only use a small portion of each for all but the
> >> > most infrequently accessed data.
> >>
> >> faster drives. The biggest disks at this point are far slower that the
> >> fastest... the average read service time on a maxtor atlas 15k is like
> >> 5.7ms on 250GB western digital sata, 14.1ms, so that more than twice as
> >> many reads can be executed on the fastest disks you can buy now... of
> >> course then you pay for it in cost, heat, density, and controller costs.
> >> everthing is a tradeoff though.
> >>
> >
> > If you want to have fast disks, then you should do what I
> > suggested to Digital 20 years ago when they had ST-506
> > interfaces and SCSI was available only from third-parties.
> > It was called "striping" (I'm serious!). Not the so-called
> > RAID crap that took the original idea and destroyed it.
> > If you have 32-bits, you design an interface board for 32
> > disks. The interface board strips each bit to the data that
> > each disk gets. That makes the whole array 32 times faster
> > than a single drive and, of course, 32 times larger.
>
> For best performance, the spindles should be synchronized too. This
> might be tricky with disks not intended for such operation, of course.

Actually not. You need a FIFO to cache your bits into buffers of bytes
anyway. Depending upon the length of the FIFO, you can "rubber-band" a
lot of rotational latency. When you are dealing with a lot of drives,
you are never going to have all the write currents turn on at the same
time anyway because they are (very) soft-sectored, i.e., block
replacement, etc.

Your argument was used to shout down the idea. Actually, I think
it was lost in the NIH syndrome anyway.

>
> --
> M?ns Rullg?rd
> [email protected]
>


Cheers,
Dick Johnson
Penguin : Linux version 2.4.26 on an i686 machine (5557.45 BogoMips).
Note 96.31% of all statistics are fiction.


2004-04-23 21:12:16

by Timothy Miller

[permalink] [raw]
Subject: Re: File system compression, not at the block layer



Joel Jaeggli wrote:
> On Fri, 23 Apr 2004, Paul Jackson wrote:
>
>
>>>SO... in addition to the brilliance of AS, is there anything else that
>>>can be done (using compression or something else) which could aid in
>>>reducing seek time?
>>
>>Buy more disks and only use a small portion of each for all but the
>>most infrequently accessed data.
>
>
> faster drives. The biggest disks at this point are far slower that the
> fastest... the average read service time on a maxtor atlas 15k is like
> 5.7ms on 250GB western digital sata, 14.1ms, so that more than twice as
> many reads can be executed on the fastest disks you can buy now... of
> course then you pay for it in cost, heat, density, and controller costs.
> everthing is a tradeoff though.


I had this idea of packing a bunch of those really tiny Toshiba
quarter-sized drives and some sort of RAID0 controller into a box the
size of a 3.5" hard drive.


2004-04-23 21:15:47

by Timothy Miller

[permalink] [raw]
Subject: Re: File system compression, not at the block layer



Richard B. Johnson wrote:

>
> Actually not. You need a FIFO to cache your bits into buffers of bytes
> anyway. Depending upon the length of the FIFO, you can "rubber-band" a
> lot of rotational latency. When you are dealing with a lot of drives,
> you are never going to have all the write currents turn on at the same
> time anyway because they are (very) soft-sectored, i.e., block
> replacement, etc.
>
> Your argument was used to shout down the idea. Actually, I think
> it was lost in the NIH syndrome anyway.
>


In a drive with multiple platters and therefore multiple heads, you
could read/write from all heads simultaneously. Or is that how they
already do it?


2004-04-23 21:15:08

by Ben Greear

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

Richard B. Johnson wrote:

> Actually not. You need a FIFO to cache your bits into buffers of bytes
> anyway. Depending upon the length of the FIFO, you can "rubber-band" a
> lot of rotational latency. When you are dealing with a lot of drives,
> you are never going to have all the write currents turn on at the same
> time anyway because they are (very) soft-sectored, i.e., block
> replacement, etc.

Wouldn't this pretty much guarantee worst-case latency scenario for reading, since
on average at least one of your 32 disks is going to require a full rotation
(and probably a seek) to find it's bit?

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2004-04-23 21:23:03

by Timothy Miller

[permalink] [raw]
Subject: Re: File system compression, not at the block layer



Ben Greear wrote:
> Richard B. Johnson wrote:
>
>> Actually not. You need a FIFO to cache your bits into buffers of bytes
>> anyway. Depending upon the length of the FIFO, you can "rubber-band" a
>> lot of rotational latency. When you are dealing with a lot of drives,
>> you are never going to have all the write currents turn on at the same
>> time anyway because they are (very) soft-sectored, i.e., block
>> replacement, etc.
>
>
> Wouldn't this pretty much guarantee worst-case latency scenario for
> reading, since
> on average at least one of your 32 disks is going to require a full
> rotation
> (and probably a seek) to find it's bit?


Only for the first bit of a block. For large streams of reads, the
fifos will keep things going, except for occasionally as drives drift in
their relative rotation positions which can cause some delays.


2004-04-23 21:31:53

by Joel Jaeggli

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Fri, 23 Apr 2004, Richard B. Johnson wrote:
>
> If you want to have fast disks, then you should do what I
> suggested to Digital 20 years ago when they had ST-506
> interfaces and SCSI was available only from third-parties.
> It was called "striping" (I'm serious!). Not the so-called
> RAID crap that took the original idea and destroyed it.
> If you have 32-bits, you design an interface board for 32
> disks. The interface board strips each bit to the data that
> each disk gets. That makes the whole array 32 times faster
> than a single drive and, of course, 32 times larger.
>
> There is no redundancy in such an array, just brute-force
> speed. One can add additional bits and CRC correction which
> would allow the failure (or removal) of one drive at a time.

except disks no longer encode one bit at a time (with prml), and you're
still serializing requests across all the spindles instead of dividing
requests between spindles... it's pretty clear that in the forseeable
future capacity grown will continue to far outstrip access speed in
spinning magnetic media. I would agree that any serious improvement is
likely to come for more creativly arranging the data at the block or
filesystem level, netapps log-structured raid4 being one direction to
head...

> Cheers,
> Dick Johnson
> Penguin : Linux version 2.4.26 on an i686 machine (5557.45 BogoMips).
> Note 96.31% of all statistics are fiction.
>
>

--
--------------------------------------------------------------------------
Joel Jaeggli Unix Consulting [email protected]
GPG Key Fingerprint: 5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2


2004-04-23 21:36:14

by Joel Jaeggli

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Fri, 23 Apr 2004, Timothy Miller wrote:

>
>
> Joel Jaeggli wrote:
> > On Fri, 23 Apr 2004, Paul Jackson wrote:
> >
> >
> >>>SO... in addition to the brilliance of AS, is there anything else that
> >>>can be done (using compression or something else) which could aid in
> >>>reducing seek time?
> >>
> >>Buy more disks and only use a small portion of each for all but the
> >>most infrequently accessed data.
> >
> >
> > faster drives. The biggest disks at this point are far slower that the
> > fastest... the average read service time on a maxtor atlas 15k is like
> > 5.7ms on 250GB western digital sata, 14.1ms, so that more than twice as
> > many reads can be executed on the fastest disks you can buy now... of
> > course then you pay for it in cost, heat, density, and controller costs.
> > everthing is a tradeoff though.
>
>
> I had this idea of packing a bunch of those really tiny Toshiba
> quarter-sized drives and some sort of RAID0 controller into a box the
> size of a 3.5" hard drive.

they're deathly slow... I'll send you an hdparm from an 4GB ibm microdrive
the next time I have it mounted...

>

--
--------------------------------------------------------------------------
Joel Jaeggli Unix Consulting [email protected]
GPG Key Fingerprint: 5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2


2004-04-23 22:17:26

by Ian Stirling

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

Joel Jaeggli wrote:
> On Fri, 23 Apr 2004, Richard B. Johnson wrote:
>
>>If you want to have fast disks, then you should do what I
>>suggested to Digital 20 years ago when they had ST-506
>>interfaces and SCSI was available only from third-parties.


> except disks no longer encode one bit at a time (with prml), and you're
> still serializing requests across all the spindles instead of dividing
> requests between spindles... it's pretty clear that in the forseeable
> future capacity grown will continue to far outstrip access speed in
> spinning magnetic media. I would agree that any serious improvement is


I happened to do some sums about a week ago.

My first drive was ST225R, which was 60M,3600RPM and the whole drive could be
read in 2 or 3 mins.
My new 160G drive is 7200RPM, and reads in around 50 mins.

It's not a complete coincidence that sqrt(160/.06) is about 50, and the number
of revs to read the drive is pretty much dead on 50 times.

The areal density of disk drives tends to go up both by adding more tracks, and
by squeezing the data into each track more densely.

While you can speed up the disk maybe 5 times if you are willing to pay the price,
the increasing number of tracks means that you'r still going to need lots more
revs to read the drive.

2004-04-23 23:36:18

by Paul Jackson

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

> If you want to have fast disks, then you should do what I
> suggested to Digital 20 years ago when they had ST-506
> interfaces and SCSI was available only from third-parties.
> It was called "striping" (I'm serious!).

That gets your bandwidth up, but does nothing for latency.

Depending on your workload, that may or may not be critical.

As a former SGI employee noted:

"Money can buy bandwidth, but latency is forever" -- John Mashey

To get latency down, you need fast rotating disks and short strokes
(waste most of the disk on little used data, or on nothing at all).
And even that won't get you much faster than 20 years ago.

That, or lots of main memory, or if the data is pretty much
read-only, perhaps some complicated data duplication.

But we're not in such bad shape there - folks have been dealing
with that speed difference for at least 20 years ;).

It's the speed difference between the processor and main memory
that's more challenging now - as it approaches speed differences
we once saw between processor and disk.

To heck with disk compression - it's time for main memory compression.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-04-24 02:25:05

by Tom Vier

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Fri, Apr 23, 2004 at 05:18:44PM -0400, Timothy Miller wrote:
> In a drive with multiple platters and therefore multiple heads, you
> could read/write from all heads simultaneously. Or is that how they
> already do it?

fwih, there was once a drive that did this. the problem is track alignment.
these days, you'd need seperate motors for each head.

--
Tom Vier <[email protected]>
DSA Key ID 0x15741ECE

2004-04-24 04:58:15

by Ben Greear

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

Timothy Miller wrote:

>> Wouldn't this pretty much guarantee worst-case latency scenario for
>> reading, since
>> on average at least one of your 32 disks is going to require a full
>> rotation
>> (and probably a seek) to find it's bit?
>
>
>
> Only for the first bit of a block. For large streams of reads, the
> fifos will keep things going, except for occasionally as drives drift in
> their relative rotation positions which can cause some delays.

So how is that better than using a striping raid that stripes at the
block level or multi-block level?

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2004-04-24 07:38:46

by Willy Tarreau

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Fri, Apr 23, 2004 at 10:24:58PM -0400, Tom Vier wrote:
> On Fri, Apr 23, 2004 at 05:18:44PM -0400, Timothy Miller wrote:
> > In a drive with multiple platters and therefore multiple heads, you
> > could read/write from all heads simultaneously. Or is that how they
> > already do it?
>
> fwih, there was once a drive that did this. the problem is track alignment.
> these days, you'd need seperate motors for each head.

I think they now all do it. Haven't you noticed that drives with many
platters are always faster than their cousins with fewer platters ? And
I don't speak about access time, but about sequential reads.

Willy

2004-04-24 16:01:59

by Eric D. Mudama

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Sat, Apr 24 at 9:36, Willy Tarreau wrote:
>On Fri, Apr 23, 2004 at 10:24:58PM -0400, Tom Vier wrote:
>> On Fri, Apr 23, 2004 at 05:18:44PM -0400, Timothy Miller wrote:
>> > In a drive with multiple platters and therefore multiple heads, you
>> > could read/write from all heads simultaneously. Or is that how they
>> > already do it?
>>
>> fwih, there was once a drive that did this. the problem is track alignment.
>> these days, you'd need seperate motors for each head.
>
>I think they now all do it. Haven't you noticed that drives with many
>platters are always faster than their cousins with fewer platters ? And
>I don't speak about access time, but about sequential reads.

Only one read/write element can be active at one time in a modern disk
drive. The issue is that while the drive's headstack was originally
in alignment, all sorts of factors can cause it to fall out of
alignment. If that occurs, the heads might not line up with each
other, meaning that when you used to line up with A1 and B1 (side A,
cylinder 1) your two heads now align with A1 and B40.

Every surface has embedded servo information, which allows the drive
to work around mechanical variability and handling damage. The
difference in position between adjacent heads in a drive factors into
a parameter called "head switch skew". Head switch skew is "how long
does it take us to seek to the next sequential LBA after reading the
last LBA on a track/head?" Track-to-track skew is how long to seek
and settle on the adjacent track on the same head.

These two parameters are used to generate the drive's format, which in
turn account for the sequential throughput. (higher skews means lower
usage duty cycle means lower overall throughput.) If the skews are
set too low, the drive blows revs because it can't settle in time for
the LBA it needs to read.

In general, a drive with lots of heads will perform better on most
workloads because it doesn't have to seek as far radially to cover the
same amount of data. However, a single-headed and a multi-headed
drive of the same generation should be virtually identical in
sequential throughput... within a few percent. If anything, the
single-headed drive should be a bit faster because track-to-track
skews are typically smaller than headswitch skews.

--eric



--
Eric D. Mudama
[email protected]

2004-04-25 04:11:52

by Horst H. von Brand

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

Timothy Miller <[email protected]> said:

[...]

> In a drive with multiple platters and therefore multiple heads, you
> could read/write from all heads simultaneously. Or is that how they
> already do it?

No. Current disks have bad blocks (way too small on disk to be able to
ensure 100% OK), and they are remapped by the drive firmware to spare
cilinders. To have the exact same blocks broken on each surface would be a
real lottery.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2004-04-25 04:10:13

by Horst H. von Brand

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

Willy Tarreau <[email protected]> said:
> On Fri, Apr 23, 2004 at 10:24:58PM -0400, Tom Vier wrote:
> > On Fri, Apr 23, 2004 at 05:18:44PM -0400, Timothy Miller wrote:
> > > In a drive with multiple platters and therefore multiple heads, you
> > > could read/write from all heads simultaneously. Or is that how they
> > > already do it?
> >
> > fwih, there was once a drive that did this. the problem is track alignment.
> > these days, you'd need seperate motors for each head.

> I think they now all do it.

No.

> Haven't you noticed that drives with many
> platters are always faster than their cousins with fewer platters ? And
> I don't speak about access time, but about sequential reads.

Have you ever wondered how they squeeze 16 or more platters into that slim
enclosure? If you take them apart, the question evaporates: There are 2 or
3 platters in them, no more. The "many platters" are an artifact of BIOS'
"disk geometry" description.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2004-04-25 04:11:58

by Horst H. von Brand

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

"Richard B. Johnson" <[email protected]> said:

[...]

> If you want to have fast disks, then you should do what I
> suggested to Digital 20 years ago when they had ST-506
> interfaces and SCSI was available only from third-parties.
> It was called "striping" (I'm serious!). Not the so-called
> RAID crap that took the original idea and destroyed it.
> If you have 32-bits, you design an interface board for 32
> disks. The interface board strips each bit to the data that
> each disk gets. That makes the whole array 32 times faster
> than a single drive and, of course, 32 times larger.

But seeks are just as slow as before... and weigh in more as sectors are
shorter (for the same visible sector size, 1/32th). I'm not so sure this is
a win overall.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2004-04-25 07:32:16

by Willy Tarreau

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Sat, Apr 24, 2004 at 11:05:05PM -0400, Horst von Brand wrote:

> > Haven't you noticed that drives with many
> > platters are always faster than their cousins with fewer platters ? And
> > I don't speak about access time, but about sequential reads.
>
> Have you ever wondered how they squeeze 16 or more platters into that slim
> enclosure? If you take them apart, the question evaporates: There are 2 or
> 3 platters in them, no more. The "many platters" are an artifact of BIOS'
> "disk geometry" description.

I know, I was speaking about physical platters of course. Mark Hann told
me in private that he disagreed with me, so I checked recent disks
(36, 73, 147 GB SCSI with 1, 2, 4 platters) and he was right, they have
exactly the same spec concerning speed. But I said that I remember the
times when I regularly did this test on disks that I was integrating about
7-8 years ago, they were 2.1, 4.3, 6.4 GB (1,2,3 platters), and I'm fairly
certain that the 1-platter performed at about 5 MB/s while the 6.4 was around
12 MB/s. BTW, the 9GB SCSI I have in my PC does about 28 MB/s for 1 platter,
while its 18 GB equivalent (2 platters) does about 51. So I think that what
I observed remained true for such capacities, but changed on bigger disks
because of mechanical constraints. Afterall, what's 18 GB now ? Less than
one twentieth of the biggest disk.

Anyway, this is off-topic, so that's my last post on LKML on the subject.

Regards,
Willy

2004-04-25 19:50:28

by Eric D. Mudama

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Sun, Apr 25 at 9:29, Willy Tarreau wrote:
>I know, I was speaking about physical platters of course. Mark Hann told
>me in private that he disagreed with me, so I checked recent disks
>(36, 73, 147 GB SCSI with 1, 2, 4 platters) and he was right, they have
>exactly the same spec concerning speed. But I said that I remember the
>times when I regularly did this test on disks that I was integrating about
>7-8 years ago, they were 2.1, 4.3, 6.4 GB (1,2,3 platters), and I'm fairly
>certain that the 1-platter performed at about 5 MB/s while the 6.4 was around
>12 MB/s. BTW, the 9GB SCSI I have in my PC does about 28 MB/s for 1 platter,
>while its 18 GB equivalent (2 platters) does about 51. So I think that what
>I observed remained true for such capacities, but changed on bigger disks
>because of mechanical constraints. Afterall, what's 18 GB now ? Less than
>one twentieth of the biggest disk.
>
>Anyway, this is off-topic, so that's my last post on LKML on the subject.

Let me throw in a final $.02...

Are you sure your 9GB and 18GB drives are of the same "generation" of
technology? SCSI drive platters have gotten smaller and smaller to
shorten the seek distance (they use 2.5" media now inside 3.5" drives)
for random operations, and I'm wondering if your 18GB is in fact a
generation ahead of your 9GB.

Are you sure your 9GB SCSI drive only has 1 platter in it?

--eric

--
Eric D. Mudama
[email protected]

2004-04-26 10:22:50

by Jörn Engel

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Fri, 23 April 2004 16:34:21 -0400, Richard B. Johnson wrote:
>
> If you want to have fast disks, then you should do what I
> suggested to Digital 20 years ago when they had ST-506
> interfaces and SCSI was available only from third-parties.
> It was called "striping" (I'm serious!). Not the so-called
> RAID crap that took the original idea and destroyed it.
> If you have 32-bits, you design an interface board for 32
> disks. The interface board strips each bit to the data that
> each disk gets. That makes the whole array 32 times faster
> than a single drive and, of course, 32 times larger.
>
> There is no redundancy in such an array, just brute-force
> speed. One can add additional bits and CRC correction which
> would allow the failure (or removal) of one drive at a time.

...and so you add latency to the ever-growing list of concepts you
publically prove to be unaware of.

Those 32 disks now have something like 32x50MB/s or 1.6GB/s, great.
Seek time is still 10ms, though, so now each seek costs as much as
16MB of continuous data transfer. Nice. So readahead will be 64MB,
and disk cache 1GB, just to get rid of some seeks again? Sure.

If you were a little smarter and used the so-called RAID crap, you
would have stripes of about the readahead size (or more) and seeks get
spread up between disks. Sure, transfer speed will usually be lower
than 1.6GB/s, but who cares. The point is that each seek will only
cost you as much as 500kB of continuous transfer.

But like so many other things, you will refuse to understand this as
well, right? Well, at least don't try to convince the unaware,
please.

J?rn

--
There's nothing better for promoting creativity in a medium than
making an audience feel "Hmm ? I could do better than that!"
-- Douglas Adams in a slashdot interview

2004-04-27 15:39:27

by Timothy Miller

[permalink] [raw]
Subject: Re: File system compression, not at the block layer



Paul Jackson wrote:

>
> To heck with disk compression - it's time for main memory compression.
>

I think nVidia and ATI chips do that with the Z buffer. Definately
improves bandwidth utilization.

2004-04-27 15:41:17

by Timothy Miller

[permalink] [raw]
Subject: Re: File system compression, not at the block layer



Tom Vier wrote:
> On Fri, Apr 23, 2004 at 05:18:44PM -0400, Timothy Miller wrote:
>
>>In a drive with multiple platters and therefore multiple heads, you
>>could read/write from all heads simultaneously. Or is that how they
>>already do it?
>
>
> fwih, there was once a drive that did this. the problem is track alignment.
> these days, you'd need seperate motors for each head.
>

Oh, yeah. Forget the separate motors. Would definately need that to
move heads independently.

The problem is track alignment. Don't drives dedicate one track on one
platter as an alignment track?

2004-04-27 15:42:52

by Timothy Miller

[permalink] [raw]
Subject: Re: File system compression, not at the block layer



Ben Greear wrote:
> Timothy Miller wrote:
>
>>> Wouldn't this pretty much guarantee worst-case latency scenario for
>>> reading, since
>>> on average at least one of your 32 disks is going to require a full
>>> rotation
>>> (and probably a seek) to find it's bit?
>>
>>
>>
>>
>> Only for the first bit of a block. For large streams of reads, the
>> fifos will keep things going, except for occasionally as drives drift
>> in their relative rotation positions which can cause some delays.
>
>
> So how is that better than using a striping raid that stripes at the
> block level or multi-block level?
>


It's only better for large streaming writes. The FIFOs I'm talking
about above would certainly be smaller than typical RAID0 stripes.

2004-04-27 16:04:45

by Jörn Engel

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Tue, 27 April 2004 11:42:11 -0400, Timothy Miller wrote:
> Paul Jackson wrote:
>
> >To heck with disk compression - it's time for main memory compression.
>
> I think nVidia and ATI chips do that with the Z buffer. Definately
> improves bandwidth utilization.
^^^^^^^^^

Well stated. For general purpose cpus with unpredictable access
patterns, compression makes latency even worse, so you need even
bigger caches.

On the other hand, memory compression makes memory bigger, and memory
of course is a disk cache, so it does improve latency somewhere.

J?rn

--
Victory in war is not repetitious.
-- Sun Tzu

2004-04-28 00:29:31

by Tom Vier

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Tue, Apr 27, 2004 at 11:43:58AM -0400, Timothy Miller wrote:
> The problem is track alignment. Don't drives dedicate one track on one
> platter as an alignment track?

it used to be one whole plater was for servo alignment, i think. embedded
servo signals have been around for at least 7 years.

--
Tom Vier <[email protected]>
DSA Key ID 0x15741ECE

2004-04-28 01:00:34

by David Lang

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

to answer the fundamental question that was asked in this thread but not
answered.

the reason why we want to compress at the block level instead of over the
entire file is that sometimes we want to do random seeks into the middle
of the file or replace a chunk in the middle of a file (edits, inserts,
etc). by doing the compression in a block the worst that you have to do is
to read that one block, decompress it and get your data out (or modify the
block, compress it and put it back on disk). if your unit of compression
is the entire file each of these options will require manipulating basicly
the entire file (Ok, reads you can possibly stop after you found your
data)

as for those who say that compression isn't useful becouse CPU's are so
much faster then disks, I will argue that that's exactly when the
compression becomes most useful, if your CPU would otherwise be idle
waiting for the data to move to/from disk then the compression is
essentially free and you save time overall by transfering less data
through your IO bottleneck. however the right way to do this may be to put
a compression engine on the drive and allow the OS to request/send either
compressed or uncompressed data, that way if it's CPU bound or the data is
already compressed it won't spend CPU time on it, but if it will compress
the CPU and drive interface compress it to ease the bandwidth load between
them.

David Lang

On Fri, 23 Apr 2004, Timothy Miller wrote:

> Date: Fri, 23 Apr 2004 13:26:38 -0400
> From: Timothy Miller <[email protected]>
> To: Linux Kernel Mailing List <[email protected]>
> Subject: File system compression, not at the block layer
>
> This is probably just another of my silly "they already thought of that
> and someone is doing exactly this" ideas.
>
> I get the impression that a lot of people interested in doing FS
> compression want to do it at the block layer. This gets complicated,
> because you need to allocate partial physical blocks.
>
> Well, why not do the compression at the highest layer?
>
> The idea is something akin to changing this (syntax variation intentional):
>
> tar cf - somefiles* > file
>
> To this:
>
> tar cf - somefiles* | gzip > file
>
> Except doing it transparently and for all files.
>
> This way, the disk cache is all compressed data, and only decompressed
> as it's read or written by a process.
>
> For files below a certain size, this is obviously pointless, since you
> can't save any space. But in many cases, this could speed up the I/O
> for large files that are compressable. (Space is cheap. The only
> reason to compress is for speed.)
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-04-28 10:09:31

by Jörn Engel

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Tue, 27 April 2004 18:00:11 -0700, David Lang wrote:
>
> to answer the fundamental question that was asked in this thread but not
> answered.
>
> the reason why we want to compress at the block level instead of over the
> entire file is that sometimes we want to do random seeks into the middle
> of the file or replace a chunk in the middle of a file (edits, inserts,
> etc). by doing the compression in a block the worst that you have to do is
> to read that one block, decompress it and get your data out (or modify the
> block, compress it and put it back on disk). if your unit of compression
> is the entire file each of these options will require manipulating basicly
> the entire file (Ok, reads you can possibly stop after you found your
> data)

*IF* your unit of compression...

If that is the complete block device, you're stupid and deserve what
you get. If it is the file, same thing. No difference.

Do it at the file system level or don't do it at all.

J?rn

--
Don't worry about people stealing your ideas. If your ideas are any good,
you'll have to ram them down people's throats.
-- Howard Aiken quoted by Ken Iverson quoted by Jim Horning quoted by
Raph Levien, 1979

2004-04-28 10:21:31

by Nikita Danilov

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

J?rn Engel writes:
> On Tue, 27 April 2004 18:00:11 -0700, David Lang wrote:
> >
> > to answer the fundamental question that was asked in this thread but not
> > answered.
> >
> > the reason why we want to compress at the block level instead of over the
> > entire file is that sometimes we want to do random seeks into the middle
> > of the file or replace a chunk in the middle of a file (edits, inserts,
> > etc). by doing the compression in a block the worst that you have to do is
> > to read that one block, decompress it and get your data out (or modify the
> > block, compress it and put it back on disk). if your unit of compression
> > is the entire file each of these options will require manipulating basicly
> > the entire file (Ok, reads you can possibly stop after you found your
> > data)
>
> *IF* your unit of compression...
>
> If that is the complete block device, you're stupid and deserve what
> you get. If it is the file, same thing. No difference.
>
> Do it at the file system level or don't do it at all.

File system where unit of disk space allocation is smaller than disk
block (i.e., several files can use portions of the same disk block) can
efficiently use various "units of compression": 100 bytes, device block
size, N-blocks, etc.

>
> J?rn

Nikita.

2004-04-28 20:53:17

by Pavel Machek

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

Hi!

> > >Well, why not do the compression at the highest layer?
> > >[...] doing it transparently and for all files.
> >
> > http://e2compr.sourceforge.net/
>
> It's been done (see the above URL), but given how cheap disk space has
> gotten, and how the speed of CPU has gotten faster much more quickly
> than disk access has, many/most people have not be interested in
> trading off performance for space. As a result, there are race

Is CPU_speed / disk_throughput increasing? If so, compression
might help once again. CPU_speed / net_throughput probably is
increasing, so compressedNFS would probably make sense.
Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-04-28 22:52:55

by Timothy Miller

[permalink] [raw]
Subject: Re: File system compression, not at the block layer



Pavel Machek wrote:
> Hi!
>
>
>>>>Well, why not do the compression at the highest layer?
>>>>[...] doing it transparently and for all files.
>>>
>>>http://e2compr.sourceforge.net/
>>
>>It's been done (see the above URL), but given how cheap disk space has
>>gotten, and how the speed of CPU has gotten faster much more quickly
>>than disk access has, many/most people have not be interested in
>>trading off performance for space. As a result, there are race
>
>
> Is CPU_speed / disk_throughput increasing? If so, compression
> might help once again. CPU_speed / net_throughput probably is
> increasing, so compressedNFS would probably make sense.


I've always felt that way, but every time I mention it, people tell me
it's not worth the CPU overhead. For many years, I have felt that there
should be an IP socket type which was inherently compressed.

2004-04-29 09:47:26

by Jörn Engel

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Wed, 28 April 2004 18:57:08 -0400, Timothy Miller wrote:
>
> I've always felt that way, but every time I mention it, people tell me
> it's not worth the CPU overhead. For many years, I have felt that there
> should be an IP socket type which was inherently compressed.

Ever heard of ssh? ;)

Depending on speed of network and cpus involved, scp can be faster
than nfs.

J?rn

--
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong

2004-04-29 09:52:53

by Pavel Machek

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

Hi!

> > I've always felt that way, but every time I mention it, people tell me
> > it's not worth the CPU overhead. For many years, I have felt that there
> > should be an IP socket type which was inherently compressed.
>
> Ever heard of ssh? ;)

Its too high level, and if you want compression but not encryption
that's tricky to do.

> Depending on speed of network and cpus involved, scp can be faster
> than nfs.

Well... but that's due to nfs being broken, right?
Pavel
--
934a471f20d6580d5aad759bf0d97ddc

2004-04-29 10:10:33

by Jörn Engel

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Thu, 29 April 2004 11:52:37 +0200, Pavel Machek wrote:
>
> > Ever heard of ssh? ;)
>
> Its too high level, and if you want compression but not encryption
> that's tricky to do.
>
> > Depending on speed of network and cpus involved, scp can be faster
> > than nfs.
>
> Well... but that's due to nfs being broken, right?

I don't think nfs is broken because of missing compression, but yes,
the difference is by design.

J?rn

--
A victorious army first wins and then seeks battle.
-- Sun Tzu

2004-04-29 10:19:21

by Pavel Machek

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

On Čt 29-04-04 12:09:42, Jörn Engel wrote:
> On Thu, 29 April 2004 11:52:37 +0200, Pavel Machek wrote:
> >
> > > Ever heard of ssh? ;)
> >
> > Its too high level, and if you want compression but not encryption
> > that's tricky to do.
> >
> > > Depending on speed of network and cpus involved, scp can be faster
> > > than nfs.
> >
> > Well... but that's due to nfs being broken, right?
>
> I don't think nfs is broken because of missing compression, but yes,
> the difference is by design.

Well, scp is easy, scp is linear copy of file. nfs is little more
tricky. I'm not talking about compression, due to various reasons
(UDP?), nfs is not always able to get the wire speed.
Pavel
--
934a471f20d6580d5aad759bf0d97ddc

2004-04-29 17:18:34

by Tim Connors

[permalink] [raw]
Subject: Re: File system compression, not at the block layer

Pavel Machek <[email protected]> said on Thu, 29 Apr 2004 11:52:37 +0200:
> Hi!
>
> > > I've always felt that way, but every time I mention it, people tell me
> > > it's not worth the CPU overhead. For many years, I have felt that there
> > > should be an IP socket type which was inherently compressed.
> >
> > Ever heard of ssh? ;)
>
> Its too high level, and if you want compression but not encryption
> that's tricky to do.

Just today we were trying to transfer ~350GB from a shell of a machine
(running knopix, with a very small amount of installed software, and
absolutely no disk space left) holding 4 disks to our raid disks --
the only thing installed was ssh, with even rsh being a symlink to ssh
(I was going to remove a whole bunch of packages to free up some space
so I could install rsh, but they didn't let me - it took them long
enough to get it to "work" in the first place).

Problem was that rsync combined with ssh was reading/writing at about
2MB/sec, given the age of the CPU. That will take a day more than
they have.

To put it bluntly, ssh is a *shit* solution on a secured net where
people care about performance.

--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
White dwarf seeks red giant star for binary relationship