2003-08-11 07:37:55

by Alex Tomas

[permalink] [raw]
Subject: [RFC] file extents for EXT3


Attachments:
(No filename) (3.75 kB)
ext3-extents.patch (45.34 kB)
Download all attachments

2003-08-11 12:53:47

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] file extents for EXT3

Alex Tomas wrote:
> hello all!
>
> there are several problems with old good method ext2/ext3
> use to store map of block for an inode. for example, ext3's
> truncate is quite slow. I think extents could solve this
> and some other troubles. so ...
>
>
> in fact, design is taken from htree modern ext2/ext3 uses. in constrast with
> htree, it isn't backward-compatible.

Neat. I really like extents, and think this is the best long-term
approach. Apparently the ext3 maintainers do, too, because tytso/sct's
"ext roadmap" paper publishing a while ago describes extents, too. (I
wish I had a URL for that)

Anyway, something to keep in mind:

Changing the underlying disk format without bumping the filesystem
revision is a hugely bad idea. I disagreed with merging htree (even
though its backward compat) without bumping the filesystem version, too.

Vendors, distributors, OEMs, etc. all test against existing on-disk
formats, when they release their products. When the filesystem format
for an existing filesystem, in production, changes underneath them, they
tend to get worried and annoyed. So, to all ext developers,

Please add <it> to ext4 not ext3!

Jeff



2003-08-11 15:56:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] file extents for EXT3

On Aug 11, 2003 08:53 -0400, Jeff Garzik wrote:
> Changing the underlying disk format without bumping the filesystem
> revision is a hugely bad idea. I disagreed with merging htree (even
> though its backward compat) without bumping the filesystem version, too.
>
> Vendors, distributors, OEMs, etc. all test against existing on-disk
> formats, when they release their products. When the filesystem format
> for an existing filesystem, in production, changes underneath them, they
> tend to get worried and annoyed. So, to all ext developers,
>
> Please add <it> to ext4 not ext3!

Ext2/3 uses feature flags instead of version numbers to indicate such
changes. Version numbers are a poor way of indicating whether a change
is compatible or not compared to feature flags. For example, if you bump
the minor number to indicate a "compatible" change it means that any
code that pretends to support version x.y features also needs to support
all features <= y and all features <= x.

If you really want to have a feature number to be happy, just think of

s.feature_incompat.s_feature_ro_compat.s_feature_compat

as something like a version number and you will nearly be happy.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2003-08-11 16:28:56

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] file extents for EXT3

Andreas Dilger wrote:
> Ext2/3 uses feature flags instead of version numbers to indicate such
> changes. Version numbers are a poor way of indicating whether a change
> is compatible or not compared to feature flags. For example, if you bump
> the minor number to indicate a "compatible" change it means that any
> code that pretends to support version x.y features also needs to support
> all features <= y and all features <= x.


What I'm talking about is more high level than that, and probably
touches on "marketing" aspects a bit:

The net effect of slowly sliding features into ext2/3 via feature flags
creates the poor situation we have today: your filesystem, and your
kernel, may or may not support the featureset you're looking for. Sure,
slowly sliding features into existing filesystems can be made to work
with compatibility flags and careful thought.

However, I argue that there should be an ext2/3 filesystem feature
freeze. And in this regard I am talking about _software_ versions, not
filesystem formats. ext4 should be where the bulk of the new work goes.
Please -- leave ext3 alone! It's still being stabilized.

Of course, the other alternative is to rename ext3 to "linuxfs", add a
"no journal at all" mode, and remove ext2. But I prefer my "ext4"
solution :)

Anyway, I am hoping that situation will be fixed, not propagated via
feature flags until the end of time as a Good Thing(tm). It is _not_
smart to create features like ACLs or htrees, and then use those
features under different versions of kernels. That strategy guarantees
your metadata will get out of sync with other metadata, in the name of
backward compatibility.

Jeff



2003-08-11 16:46:21

by Randy.Dunlap

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [RFC] file extents for EXT3

On Mon, 11 Aug 2003 08:53:28 -0400 Jeff Garzik <[email protected]> wrote:

| Alex Tomas wrote:
| > hello all!
| >
| > there are several problems with old good method ext2/ext3
| > use to store map of block for an inode. for example, ext3's
| > truncate is quite slow. I think extents could solve this
| > and some other troubles. so ...
| >
| >
| > in fact, design is taken from htree modern ext2/ext3 uses. in constrast with
| > htree, it isn't backward-compatible.
|
| Neat. I really like extents, and think this is the best long-term
| approach. Apparently the ext3 maintainers do, too, because tytso/sct's
| "ext roadmap" paper publishing a while ago describes extents, too. (I
| wish I had a URL for that)

like this? http://www.usenix.org/publications/library/proceedings/usenix02/tech/freenix/tso.html


--
~Randy For Linux-2.6, see:
http://www.kernel.org/pub/linux/kernel/people/davej/misc/post-halloween-2.5.txt

2003-08-12 09:57:23

by Rob Landley

[permalink] [raw]
Subject: Re: [RFC] file extents for EXT3

On Monday 11 August 2003 12:23, Jeff Garzik wrote:

> Of course, the other alternative is to rename ext3 to "linuxfs", add a
> "no journal at all" mode, and remove ext2. But I prefer my "ext4"
> solution :)

Well, embedded developers probably like the smaller driver. Of course they
can always use minixfs. :)

Something I've wondered about for a while:

With the ability to place a journal on another block device, you could
theoretically throw the journal on a 1 megabyte ramdisk, and more or less
degrade ext3 to ext2 that way (as long as you made sure to fsck the heck out
of it on the way back up each time).

Beyond that, why is the minimum journal size 1 megabyte? (Having to waste a
megabyte of ram on a 4 megabyte filesystem is kind of annoying. And yes,
buildroot on uclibc with busybox can give you quite a lot of functionality in
4 megabytes) In theory, if the journal could be crushed down small enough,
then the ramdisk solution isn't so bad, although needing to compile in the
ramdisk and set it up is a bit clumsy, better still if the journal code could
just bounce the blocks off of a small internal ram buffer. (Personally, I'll
live with the redundant in-memory copies; still faster than the disk by a
long shot.)

Beyond THAT, ext2 could be considered ext3 with a "no journal" flag
(automatically supplied when the mount is read only, for example). Last time
I did an embedded device, I had to stick both ext3 in (for the runtime data
partition) and ext2 in (for the initrd that loopback mounted the firmware
image, which was a zisofs containing the root partition). Initramfs
addresses this particular annoyance, but still leaves a problem creating a
bootable CD that's going to install to ext3...

Having to compile two filesystems into the kernel with basically the same
on-disk layout is kind of annoying, but ext3 simply isn't a good fit for a
small ramdisk or for read-only media.

I realise that ext3 was kept separate from ext2 because ext2 should be
uber-stable, but the argument there is that people who care about keeping
their writeable data safe are intentionally not using journaling. (Meanwhile
we're completely redoing the block layer underneath them, and both the SCSI
and IDE subsystems, and raid, but all those are obviously FAR less likely to
do strange things to their data behind their back than the filesystem is...
:)

Oh well. Too late to worry about it for 2.6 anyway... :)

Rob


2003-08-12 15:17:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] file extents for EXT3

On Aug 12, 2003 05:33 -0400, Rob Landley wrote:
> With the ability to place a journal on another block device, you could
> theoretically throw the journal on a 1 megabyte ramdisk, and more or less
> degrade ext3 to ext2 that way (as long as you made sure to fsck the heck out
> of it on the way back up each time).

That would be a net loss over ext2, because at least when you crash an
ext2 system the filesystem will not be marked clean and e2fsck will auto
check it. There is no reason to use ext3 in such a situation except
making the system slower, less resiliant to a crash, and use more RAM.
You would be far better off to just use ext2 in this case.

> Beyond that, why is the minimum journal size 1 megabyte? (Having to waste a
> megabyte of ram on a 4 megabyte filesystem is kind of annoying.

Not only would the journal itself require a 1MB ramdisk, but it could use
up to another 1MB for dirty journal buffers. Really, I can not stress it
enough that this is a terrible setup.

FYI, the reason that the journal needs to be 1MB is that the maximum
transaction size is 1/4 of the journal, and you need about 256 blocks
in a transaction to get decent "write merging" of dirty blocks in the
journal, or you will write the superblock and other commonly-dirtied
blocks out too often.

I _think_ (not to be trusted without extensive testing) that you could
make the journal as small as 3*128 blocks, but it would need some hacking
of the jbd code to set up j_max_transaction_buffers smaller, and also
e2fsck to allow you to make a smaller journal.

> Beyond THAT, ext2 could be considered ext3 with a "no journal" flag
> (automatically supplied when the mount is read only, for example). Last time
> I did an embedded device, I had to stick both ext3 in (for the runtime data
> partition) and ext2 in (for the initrd that loopback mounted the firmware
> image, which was a zisofs containing the root partition). Initramfs
> addresses this particular annoyance, but still leaves a problem creating a
> bootable CD that's going to install to ext3...

If you are interested in that, the ext3 code is _nearly_ ready to support
mounting without a journal, but it never quite was ready. Basically, you
skip the journal setup at mount time, and then in all of the journal helper
functions like ext3_journal_start() you make it a no-op if s_journal is NULL.
You would need to clear the "clean" flag again at mount.

You would still need to make some more helper functions to avoid dereferencing
handle and journal pointers in the ext3 code.

> Having to compile two filesystems into the kernel with basically the same
> on-disk layout is kind of annoying, but ext3 simply isn't a good fit for a
> small ramdisk or for read-only media.

Use something that is - like JFFS2 or similar?

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2003-08-12 21:08:02

by Rob Landley

[permalink] [raw]
Subject: Re: [RFC] file extents for EXT3

On Tuesday 12 August 2003 11:14, Andreas Dilger wrote:
> On Aug 12, 2003 05:33 -0400, Rob Landley wrote:
> > With the ability to place a journal on another block device, you could
> > theoretically throw the journal on a 1 megabyte ramdisk, and more or less
> > degrade ext3 to ext2 that way (as long as you made sure to fsck the heck
> > out of it on the way back up each time).
>
> That would be a net loss over ext2, because at least when you crash an
> ext2 system the filesystem will not be marked clean and e2fsck will auto
> check it.

Hence fscking the heck out of it on the way back up. (Or more accurately,
only ever using it on a read-only filesystem...)

> There is no reason to use ext3 in such a situation except
> making the system slower, less resiliant to a crash, and use more RAM.
> You would be far better off to just use ext2 in this case.

Assuming you wanted to compile two filesystem drivers into the system with
basically the same on-disk layout, for use with an initial ramdisk or cd-rom
boot image...

> > Beyond that, why is the minimum journal size 1 megabyte? (Having to
> > waste a megabyte of ram on a 4 megabyte filesystem is kind of annoying.
>
> Not only would the journal itself require a 1MB ramdisk, but it could use
> up to another 1MB for dirty journal buffers. Really, I can not stress it
> enough that this is a terrible setup.

I know it's bad, I'm saying it's possible. :)

It's a gross kludge to get around the lack of a no-journal option by providing
it with a faux journal, patting it on the head, and sending it about its
merry way... :)

> > Beyond THAT, ext2 could be considered ext3 with a "no journal" flag
> > (automatically supplied when the mount is read only, for example). Last
> > time I did an embedded device, I had to stick both ext3 in (for the
> > runtime data partition) and ext2 in (for the initrd that loopback mounted
> > the firmware image, which was a zisofs containing the root partition).
> > Initramfs addresses this particular annoyance, but still leaves a problem
> > creating a bootable CD that's going to install to ext3...
>
> If you are interested in that, the ext3 code is _nearly_ ready to support
> mounting without a journal, but it never quite was ready. Basically, you
> skip the journal setup at mount time, and then in all of the journal helper
> functions like ext3_journal_start() you make it a no-op if s_journal is
> NULL. You would need to clear the "clean" flag again at mount.
>
> You would still need to make some more helper functions to avoid
> dereferencing handle and journal pointers in the ext3 code.

I'd need to come way the heck up to speed on the ext3 code. This evening I'm
poring over Con's changes to the scheduler. (I prefer exploring areas that
are a little less likely to eat my disk when banging away on my laptop. When
I get my scratch machines out of storage, then I'll worry about messing up
the disk. :)

> > Having to compile two filesystems into the kernel with basically the same
> > on-disk layout is kind of annoying, but ext3 simply isn't a good fit for
> > a small ramdisk or for read-only media.
>
> Use something that is - like JFFS2 or similar?

Jffs2 isn't that great a fit for a many-gigabyte hard drive partition for a
network attached storage device, either...

I had a system where ext2 needed to be compiled in to bring the system up, but
wasn't used while it was running (ext3 was), and I didn't want to go to a
modular kernel just for that. I speced out a way to hack my way around it in
a gross and disgusting fashion. The point was that I was looking for a
no-journal option for ext3 in a real world situation a few months ago, and
wouldn't have minded a performance hit to get it...

These days, initramfs may make this particular case go away. Last I heard,
ext3 still was a bad fit for read-only filesystem images, though...

Rob

2003-08-13 04:33:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [RFC] file extents for EXT3

On Mon, Aug 11, 2003 at 12:23:07PM -0400, Jeff Garzik wrote:
>
> The net effect of slowly sliding features into ext2/3 via feature flags
> creates the poor situation we have today: your filesystem, and your
> kernel, may or may not support the featureset you're looking for. Sure,
> slowly sliding features into existing filesystems can be made to work
> with compatibility flags and careful thought.
>
> However, I argue that there should be an ext2/3 filesystem feature
> freeze. And in this regard I am talking about _software_ versions, not
> filesystem formats. ext4 should be where the bulk of the new work goes.
> Please -- leave ext3 alone! It's still being stabilized.

Any time you add features to a filesystem, there will potentially be
compatibility problems. In the case of htree, a lot of careful
thought was put into how to add them without causing compatibility
problems, and we succeeded.

There are at least three separate issues here, that you're conflating
into one.

The first is code stability. If we add new features, we risk possibly
destablizing the tree. However, I'm sure any instability will be no
worse, and probably a lot better, than what people suffered when the
IDE drivers went to hell and back. Kernel development survived, even
if it was a bit inconvenienced. In addition, we are very careful
about not modifying the old code paths when we add a new feature, even
this risk can be minimized. (And of course, we would only do this in
development kernels, and in initial test patches first!)

The second is issue is one of filesystem backwards compatibility
issues. I disagree that it is a "poor" situation that a kernel may
not support a filesystem with new features. That's just simply life!
Whether or not you use minor versions with feature flags, which might
or might not have compatibility, issues, or you do an entirely new
major number bump, the net result is still the same. For example,
there's no hope at all of using a kernel that understands only
reiserfs3 to mount a reiserfs4 filesystem.

However, in some cases we can do better, by making certain changes
which preserve read-only compatibility, or which only requires a
forced update to a newer version of e2fsprogs. In the case of file
extents, certainly we won't be able to do anything but an incompatible
version bump. But this is true whether we do this via a filesystem
compatibility flag or by changing the major number in the superblock!

In any case, it will always be up to the user to decide whether or not
to enable any new feature.

> Of course, the other alternative is to rename ext3 to "linuxfs", add a
> "no journal at all" mode, and remove ext2. But I prefer my "ext4"
> solution :)

I would like to add "no journal" support to ext3, and then rename it
to ext2. At some level, the only reason why we called it ext3 was
mainly for the code stability issue. (Well, that and in case people
wanted a slightly smaller variant of ext2/3 --- but the people who
care about size issues will likely be in embedded applications, and in
those applications they will probably want to use something like jffs2
anyway.)

I really don't want to have to support n different variants of the
ext2/3/4/5/6/7 codebase. That's just silly, and it's a code
maintenance headache.

- Ted

2003-08-14 16:41:15

by James Antill

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [RFC] file extents for EXT3

Theodore Ts'o <[email protected]> writes:

> I would like to add "no journal" support to ext3, and then rename it
> to ext2. At some level, the only reason why we called it ext3 was
> mainly for the code stability issue. (Well, that and in case people
> wanted a slightly smaller variant of ext2/3 --- but the people who
> care about size issues will likely be in embedded applications, and in
> those applications they will probably want to use something like jffs2
> anyway.)

I presume that this option to ext3 would also restore the ext2
behaviour for fsync()?

--
# James Antill -- [email protected]
:0:
* ^From: .*james@and\.org
/dev/null