2006-10-05 18:23:35

by Dave Kleikamp

[permalink] [raw]
Subject: Updated ext4/jbd2 patches based on 2.6.19-rc1

I have rebuilt the ext4/jbd2 patches against linux-2.6.19-rc1. The
patch set is located at
ftp://kernel.org/pub/linux/kernel/people/shaggy/ext4/2.6.19-rc1/ext4-patches-2.6.19-rc1.tar.gz

Broken out patches in
ftp://kernel.org/pub/linux/kernel/people/shaggy/ext4/2.6.19-rc1/ext4-patches

The patches begin with exact copies of the rc1 version of ext3 and jbd,
so there are no ext3/jbd patches currently in mainline that need to be
applied to the new code. I'll continue to watch ext3/jbd for patches
that need to be ported to ext4/jbd2.
--
David Kleikamp
IBM Linux Technology Center



2006-10-05 20:19:56

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Thu, 05 Oct 2006 13:23:30 -0500
Dave Kleikamp <[email protected]> wrote:

> I have rebuilt the ext4/jbd2 patches against linux-2.6.19-rc1. The
> patch set is located at
> ftp://kernel.org/pub/linux/kernel/people/shaggy/ext4/2.6.19-rc1/ext4-patches-2.6.19-rc1.tar.gz
>
> Broken out patches in
> ftp://kernel.org/pub/linux/kernel/people/shaggy/ext4/2.6.19-rc1/ext4-patches
>
> The patches begin with exact copies of the rc1 version of ext3 and jbd,
> so there are no ext3/jbd patches currently in mainline that need to be
> applied to the new code. I'll continue to watch ext3/jbd for patches
> that need to be ported to ext4/jbd2.

OK...

Linus, what's the best way of doing this? Will git dtrt with a patch which
copies files, or would a script which does the mkdir's and cp's be better?

2006-10-05 21:59:36

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Thu, 05 Oct 2006 13:23:30 -0500
Dave Kleikamp <[email protected]> wrote:

> I have rebuilt the ext4/jbd2 patches against linux-2.6.19-rc1. The
> patch set is located at
> ftp://kernel.org/pub/linux/kernel/people/shaggy/ext4/2.6.19-rc1/ext4-patches-2.6.19-rc1.tar.gz
>
> Broken out patches in
> ftp://kernel.org/pub/linux/kernel/people/shaggy/ext4/2.6.19-rc1/ext4-patches
>
> The patches begin with exact copies of the rc1 version of ext3 and jbd,
> so there are no ext3/jbd patches currently in mainline that need to be
> applied to the new code. I'll continue to watch ext3/jbd for patches
> that need to be ported to ext4/jbd2.

The only patch I can see at present in -mm and all the git trees is
fs-cache-provide-a-filesystem-specific-syncable-page-bit.patch, which
touches ext3. It renames PageChecked to PageFsMisc and so it'll break the
build nicely.

I'll try to remember to cc this list on jbd and ext3-affecting patches.


2006-10-05 23:25:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1



On Thu, 5 Oct 2006, Andrew Morton wrote:
>
> Linus, what's the best way of doing this? Will git dtrt with a patch which
> copies files, or would a script which does the mkdir's and cp's be better?

Git should dtrt.

In fact, if you use

git diff -C

it should generate the appropriate "file copied" things automatically, and
you don't need any huge file at all, you'll get a "patch" that looks
something like

diff --git a/fs/ext3/inode.c b/fs/ext4/inode.c
similarity index 100%
copy from fs/ext3/inode.c
copy to fs/ext4/inode.c
diff --git a/fs/ext3/super.c b/fs/ext4/super.c
similarity index 98%
copy from fs/ext3/super.c
copy to fs/ext4/super.c
index xyz..zzy 100644
--- a/fs/ext3/super.c
+++ b/fs/ext4/super.c
.. small diff that changes "ext3" to "ext4" goes here ..


ie you'll effectively get the best of both worlds: a "diff", but one that
is actually readable and shows what is going on.

I hate to beat my own drum (not really), but git really _is_ a lot better
than anything else out there ;)

Linus

2006-10-06 00:39:36

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Thu, 05 Oct 2006 13:23:30 -0500
Dave Kleikamp <[email protected]> wrote:

> I have rebuilt the ext4/jbd2 patches against linux-2.6.19-rc1. The
> patch set is located at
> ftp://kernel.org/pub/linux/kernel/people/shaggy/ext4/2.6.19-rc1/ext4-patches-2.6.19-rc1.tar.gz
>
> Broken out patches in
> ftp://kernel.org/pub/linux/kernel/people/shaggy/ext4/2.6.19-rc1/ext4-patches
>
> The patches begin with exact copies of the rc1 version of ext3 and jbd,
> so there are no ext3/jbd patches currently in mainline that need to be
> applied to the new code. I'll continue to watch ext3/jbd for patches
> that need to be ported to ext4/jbd2.

Could we please have a few nice words about ext4 for the record? Like,
what its features are, how one creates an instance, where to get the
correct userspace tools from, stability level, any known shortcomings,
issues, etc?

Thanks.

2006-10-06 03:55:31

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Thu, 05 Oct 2006 13:23:30 -0500
Dave Kleikamp <[email protected]> wrote:

> I have rebuilt the ext4/jbd2 patches against linux-2.6.19-rc1. The
> patch set is located at
> ftp://kernel.org/pub/linux/kernel/people/shaggy/ext4/2.6.19-rc1/ext4-patches-2.6.19-rc1.tar.gz
>

So let me see if I have this right.

You grab Alexandre's kit from http://www.bullopensource.org/ext4/20060926/
and a plain old `mke2fs -j' gives a filesystem which will mount as ext3 or
ext4.

If you then mount this filesystem with `-t ext4dev -o extents', it becomes
incompatible with the ext3 driver. Yes?

What else aren't we being told? ;)


2006-10-06 03:58:35

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Thu, 5 Oct 2006 20:55:26 -0700
Andrew Morton <[email protected]> wrote:

> You grab Alexandre's kit from http://www.bullopensource.org/ext4/20060926/
> and a plain old `mke2fs -j' gives a filesystem which will mount as ext3 or
> ext4.
>
> If you then mount this filesystem with `-t ext4dev -o extents', it becomes
> incompatible with the ext3 driver. Yes?

`mke2fs -O extents' doesn't work. Should it?

2006-10-06 04:31:36

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1


If you mount the filesystem with `-t ext4dev -o extents' then create some
extenty files, then unount it and then mount it without `-o extents', the
driver will then refuse to create extenty files.

IOW: you need to give it `-o extents' each time.

That seems fairly pointless. In fact, if I'd created the fs with `mke2fs
-O extents' (which doesn't work at present) then I'd expect it to use
extents from thereon after, without requiring `mount -o extents'.


2006-10-06 04:53:15

by Randy Dunlap

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Thu, 5 Oct 2006 20:55:26 -0700 Andrew Morton wrote:

> On Thu, 05 Oct 2006 13:23:30 -0500
> Dave Kleikamp <[email protected]> wrote:
>
> > I have rebuilt the ext4/jbd2 patches against linux-2.6.19-rc1. The
> > patch set is located at
> > ftp://kernel.org/pub/linux/kernel/people/shaggy/ext4/2.6.19-rc1/ext4-patches-2.6.19-rc1.tar.gz
> >
>
> So let me see if I have this right.
>
> You grab Alexandre's kit from http://www.bullopensource.org/ext4/20060926/
> and a plain old `mke2fs -j' gives a filesystem which will mount as ext3 or
> ext4.
>
> If you then mount this filesystem with `-t ext4dev -o extents', it becomes
> incompatible with the ext3 driver. Yes?

I thought we s/ext4dev/ext4/ ??

> What else aren't we being told? ;)

---
~Randy

2006-10-06 05:05:16

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Thu, 5 Oct 2006 21:54:42 -0700
Randy Dunlap <[email protected]> wrote:

> I thought we s/ext4dev/ext4/ ??

nope.

2006-10-06 05:53:07

by Andreas Dilger

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Oct 05, 2006 21:54 -0700, Randy Dunlap wrote:
> On Thu, 5 Oct 2006 20:55:26 -0700 Andrew Morton wrote:
> > If you then mount this filesystem with `-t ext4dev -o extents', it becomes
> > incompatible with the ext3 driver. Yes?
>
> I thought we s/ext4dev/ext4/ ??

No, we want to leave it at ext4dev for a while, to make it very clear
that this is still under development. We want to get the existing
patches upstream so they don't become completely unwieldy, and earlier
testing is also good, but it is not yet feature complete.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2006-10-06 05:58:31

by Andreas Dilger

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Oct 05, 2006 21:31 -0700, Andrew Morton wrote:
> If you mount the filesystem with `-t ext4dev -o extents' then create some
> extenty files, then unount it and then mount it without `-o extents', the
> driver will then refuse to create extenty files.
>
> IOW: you need to give it `-o extents' each time.
>
> That seems fairly pointless. In fact, if I'd created the fs with `mke2fs
> -O extents' (which doesn't work at present) then I'd expect it to use
> extents from thereon after, without requiring `mount -o extents'.

I think this is an oversight. For Lustre we wanted the ability to mount
ext3 filesystems with or without extents, because different customers
have different levels of tolerance for risk. These days all of our
customers use extents (better performance in conjunction with mballoc),
but the patches have not been changed for ext4 (which should really
default to using extents on a filesystem with the INCOMPAT_EXTENT feature
set unless told otherwise). That is a necessity for filesystems larger
than 2^32 blocks, since there is no way to create old block-mapped files
past that limit.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2006-10-06 06:04:07

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Thu, 5 Oct 2006 23:53:05 -0600
Andreas Dilger <[email protected]> wrote:

> On Oct 05, 2006 21:54 -0700, Randy Dunlap wrote:
> > On Thu, 5 Oct 2006 20:55:26 -0700 Andrew Morton wrote:
> > > If you then mount this filesystem with `-t ext4dev -o extents', it becomes
> > > incompatible with the ext3 driver. Yes?
> >
> > I thought we s/ext4dev/ext4/ ??
>
> No, we want to leave it at ext4dev for a while, to make it very clear
> that this is still under development. We want to get the existing
> patches upstream so they don't become completely unwieldy, and earlier
> testing is also good, but it is not yet feature complete.
>

What features are missing?

Heck, what features does it have now? Guys, we cannot release this thing
to the public without telling them what it is, how to use it, where to get
the tools from and what the roadmap is.


2006-10-06 06:11:02

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Thu, 5 Oct 2006 23:58:29 -0600
Andreas Dilger <[email protected]> wrote:

> but the patches have not been changed for ext4 (which should really
> default to using extents on a filesystem with the INCOMPAT_EXTENT feature
> set unless told otherwise). That is a necessity for filesystems larger
> than 2^32 blocks, since there is no way to create old block-mapped files
> past that limit.

That's news to me. So we only use 48-bit block numbers for extents and
not for old-style indirect blocks?

How much performance improvement do they get, btw? CPU or IO? I'm not noticing
any difference.

Has been a while since I did any fs testing. Boy, ext3 is beating the crap
out of ext2 for quality of file layout.

2006-10-06 06:41:05

by Andreas Dilger

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Oct 05, 2006 23:04 -0700, Andrew Morton wrote:
> On Thu, 5 Oct 2006 23:53:05 -0600
> Andreas Dilger <[email protected]> wrote:
> > No, we want to leave it at ext4dev for a while, to make it very clear
> > that this is still under development. We want to get the existing
> > patches upstream so they don't become completely unwieldy, and earlier
> > testing is also good, but it is not yet feature complete.
> >
>
> What features are missing?

There are several under discussion, whether they all make it in is
partly a function of how much time everyone has to work on them:
- improved file allocation (multi-block alloc, delayed alloc; basically done)
- fix 32000 subdirectory limit (patch exists, needs some e2fsck work)
- nsec timestamps for mtime, atime, ctime, create time (patch exists,
needs some e2fsck work)
- inode version field on disk (NFSv4, Lustre; prototype exists)
- reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
- journal checksumming for robustness, performance (prototype exists)

Features like metadata checksumming have been discussed and planned for
a bit but no patches exist yet so I'm not sure they're in the near-term
roadmap.

> Heck, what features does it have now? Guys, we cannot release this thing
> to the public without telling them what it is, how to use it, where to get
> the tools from and what the roadmap is.

Features now:
- ability to use filesystems > 16TB
- extent format reduces metadata overhead (RAM, IO for access, transactions)
- extent format more robust in face of on-disk corruption due to magics,
internal redunancy in tree

Features soon (previously available, to be enabled by default by "mkefs.ext4"):
- dir_index and resize inode will be on by default
- large inodes will be used by default for fast EAs, nsec timestamps, etc

Other features as above patches are committed.

The big performance win will come with mballoc and delalloc. CFS has
been using mballoc for a few years already with Lustre, and IBM + Bull
did a lot of benchmarking on it. The reason it isn't in the first set of
patches is partly a manageability issue, and partly because it doesn't
directly affect the on-disk format (outside of much better allocation)
so it isn't critical to get into the first round of changes. I believe
Alex is working on a new set of patches right now.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2006-10-06 06:48:57

by Andreas Dilger

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Oct 05, 2006 23:10 -0700, Andrew Morton wrote:
> On Thu, 5 Oct 2006 23:58:29 -0600
> Andreas Dilger <[email protected]> wrote:
> > but the patches have not been changed for ext4 (which should really
> > default to using extents on a filesystem with the INCOMPAT_EXTENT feature
> > set unless told otherwise). That is a necessity for filesystems larger
> > than 2^32 blocks, since there is no way to create old block-mapped files
> > past that limit.
>
> That's news to me. So we only use 48-bit block numbers for extents and
> not for old-style indirect blocks?

Correct. The block-mapped {d,t,}indirect blocks chewed up enough space
as it was (0.1% of the file size) without doubling the block pointers.
Things like truncate hurt pretty badly because of that, as does the
increased IO load to read them and memory pressure due to keeping them
in RAM.

> How much performance improvement do they get, btw? CPU or IO? I'm not
> noticing any difference.

As mentioned in my other email, the big performance win will come from
the multi-block allocation (mballoc) and delayed allocation (delalloc)
from Alex.

The mballoc patch allows a 1MB write to get a 1MB-aligned and contiguous
chunk of disk, instead of the next 256 blocks that might be free.
Having 1MB alignment is good for 10% or more on some RAID systems to
avoid writing partial stripes (which also requires a read).

Delalloc allows the filesystem to actually submit 1MB writes at once
without doing the block allocation in prepare_write(). Better for
picking free space, and avoids needless extent tree insertion/rebalancing.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2006-10-06 06:50:18

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Fri, Oct 06, 2006 at 12:41:03AM -0600, Andreas Dilger wrote:
> On Oct 05, 2006 23:04 -0700, Andrew Morton wrote:
> > On Thu, 5 Oct 2006 23:53:05 -0600
> > Andreas Dilger <[email protected]> wrote:
> > > No, we want to leave it at ext4dev for a while, to make it very clear
> > > that this is still under development. We want to get the existing
> > > patches upstream so they don't become completely unwieldy, and earlier
> > > testing is also good, but it is not yet feature complete.
> > >
> >
> > What features are missing?
>
> There are several under discussion, whether they all make it in is
> partly a function of how much time everyone has to work on them:
> - improved file allocation (multi-block alloc, delayed alloc; basically done)
> - fix 32000 subdirectory limit (patch exists, needs some e2fsck work)
> - nsec timestamps for mtime, atime, ctime, create time (patch exists,
> needs some e2fsck work)
> - inode version field on disk (NFSv4, Lustre; prototype exists)
> - reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
> - journal checksumming for robustness, performance (prototype exists)
>
> Features like metadata checksumming have been discussed and planned for
> a bit but no patches exist yet so I'm not sure they're in the near-term
> roadmap.

I would add persistent preallocation (of uninitialized blocks) support to the
list. Right now we have only put in support to recognize uninitialized
extents so that we can add preallocation, but will be working on developing
the actual implementation for persistent preallocation.

Regards
Suparna

>
> > Heck, what features does it have now? Guys, we cannot release this thing
> > to the public without telling them what it is, how to use it, where to get
> > the tools from and what the roadmap is.
>
> Features now:
> - ability to use filesystems > 16TB
> - extent format reduces metadata overhead (RAM, IO for access, transactions)
> - extent format more robust in face of on-disk corruption due to magics,
> internal redunancy in tree
>
> Features soon (previously available, to be enabled by default by "mkefs.ext4"):
> - dir_index and resize inode will be on by default
> - large inodes will be used by default for fast EAs, nsec timestamps, etc
>
> Other features as above patches are committed.
>
> The big performance win will come with mballoc and delalloc. CFS has
> been using mballoc for a few years already with Lustre, and IBM + Bull
> did a lot of benchmarking on it. The reason it isn't in the first set of
> patches is partly a manageability issue, and partly because it doesn't
> directly affect the on-disk format (outside of much better allocation)
> so it isn't critical to get into the first round of changes. I believe
> Alex is working on a new set of patches right now.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India


2006-10-06 06:50:23

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Fri, 6 Oct 2006 00:41:03 -0600
Andreas Dilger <[email protected]> wrote:

> The big performance win will come with mballoc and delalloc. CFS has
> been using mballoc for a few years already with Lustre, and IBM + Bull
> did a lot of benchmarking on it. The reason it isn't in the first set of
> patches is partly a manageability issue, and partly because it doesn't
> directly affect the on-disk format (outside of much better allocation)
> so it isn't critical to get into the first round of changes. I believe
> Alex is working on a new set of patches right now.

Are you sure that these things will improve allocation much? Reservations
made a big improvement there.

2006-10-06 10:30:54

by Alex Tomas

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

>>>>> Andrew Morton (AM) writes:

AM> On Fri, 6 Oct 2006 00:41:03 -0600
AM> Andreas Dilger <[email protected]> wrote:

>> The big performance win will come with mballoc and delalloc. CFS has
>> been using mballoc for a few years already with Lustre, and IBM + Bull
>> did a lot of benchmarking on it. The reason it isn't in the first set of
>> patches is partly a manageability issue, and partly because it doesn't
>> directly affect the on-disk format (outside of much better allocation)
>> so it isn't critical to get into the first round of changes. I believe
>> Alex is working on a new set of patches right now.

AM> Are you sure that these things will improve allocation much? Reservations
AM> made a big improvement there.

it depends on underlaying storage and workload. mballoc uses buddy
internally. it's much simpler and cheaper to find free 2^N blocks
compared to bitmap. this is especially important for arrays like
DDN and raid5/6 because they require stripe-aligned/-sized requests
for good throughput. also, last mballoc takes logical block into
account and can preallocate few chunks at different logical offsets
for a file. imagine torrent downloading different pieces from few peers.

thanks, Alex

2006-10-06 10:33:51

by Alex Tomas

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

>>>>> Andrew Morton (AM) writes:

AM> On Thu, 5 Oct 2006 20:55:26 -0700
AM> Andrew Morton <[email protected]> wrote:

>> You grab Alexandre's kit from http://www.bullopensource.org/ext4/20060926/
>> and a plain old `mke2fs -j' gives a filesystem which will mount as ext3 or
>> ext4.
>>
>> If you then mount this filesystem with `-t ext4dev -o extents', it becomes
>> incompatible with the ext3 driver. Yes?

AM> `mke2fs -O extents' doesn't work. Should it?

I believe we keep extents as a mount option for a while, just
for development purposes. though there is an agreement about
mke2fs -O extents, IIRC.

thanks, Alex

2006-10-06 12:21:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Thu, Oct 05, 2006 at 08:55:26PM -0700, Andrew Morton wrote:
> On Thu, 05 Oct 2006 13:23:30 -0500
> Dave Kleikamp <[email protected]> wrote:
>
> > I have rebuilt the ext4/jbd2 patches against linux-2.6.19-rc1. The
> > patch set is located at
> > ftp://kernel.org/pub/linux/kernel/people/shaggy/ext4/2.6.19-rc1/ext4-patches-2.6.19-rc1.tar.gz
> >
>
> So let me see if I have this right.
>
> You grab Alexandre's kit from http://www.bullopensource.org/ext4/20060926/
> and a plain old `mke2fs -j' gives a filesystem which will mount as ext3 or
> ext4.
>
> If you then mount this filesystem with `-t ext4dev -o extents', it becomes
> incompatible with the ext3 driver. Yes?

I agree that's the wrong behaviour, and I've always hated the idea of
using using mount -o options to enable ext3/4 features. (When do it
with EA's and acl's, sigh, and that's wrong too, but at least I was
able to paper over that later by adding default mount option support
into the superblock.)

The right way to do this is to only enable a feature like extents
after using tune2fs -O extents, or creating a filesystem with mke2fs
-O extents.

Can we change the patches to do this, please?

- Ted

2006-10-06 12:50:36

by Dave Kleikamp

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Thu, 2006-10-05 at 16:25 -0700, Linus Torvalds wrote:
>
> On Thu, 5 Oct 2006, Andrew Morton wrote:
> >
> > Linus, what's the best way of doing this? Will git dtrt with a patch which
> > copies files, or would a script which does the mkdir's and cp's be better?
>
> Git should dtrt.
>
> In fact, if you use
>
> git diff -C
>
> it should generate the appropriate "file copied" things automatically, and
> you don't need any huge file at all, you'll get a "patch" that looks
> something like
>
> diff --git a/fs/ext3/inode.c b/fs/ext4/inode.c
> similarity index 100%
> copy from fs/ext3/inode.c
> copy to fs/ext4/inode.c
> diff --git a/fs/ext3/super.c b/fs/ext4/super.c
> similarity index 98%
> copy from fs/ext3/super.c
> copy to fs/ext4/super.c
> index xyz..zzy 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext4/super.c
> .. small diff that changes "ext3" to "ext4" goes here ..
>
>
> ie you'll effectively get the best of both worlds: a "diff", but one that
> is actually readable and shows what is going on.

We haven't been using git to manage ext4 so far, although in hindsight
it probably would have made things easier. I'm assuming that you're
just suggesting this for educational purposes and that git will handle
the patches that Andrew picked up into -mm just fine.

I could re-generate the patches that do the copies from git, but I don't
believe it will be that beneficial at this point.

> I hate to beat my own drum (not really), but git really _is_ a lot better
> than anything else out there ;)

No argument here :-)

Shaggy
--
David Kleikamp
IBM Linux Technology Center


2006-10-06 13:57:43

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Fri, 06 Oct 2006 14:31:59 +0400
Alex Tomas <[email protected]> wrote:

> >>>>> Andrew Morton (AM) writes:
>
> AM> On Fri, 6 Oct 2006 00:41:03 -0600
> AM> Andreas Dilger <[email protected]> wrote:
>
> >> The big performance win will come with mballoc and delalloc. CFS has
> >> been using mballoc for a few years already with Lustre, and IBM + Bull
> >> did a lot of benchmarking on it. The reason it isn't in the first set of
> >> patches is partly a manageability issue, and partly because it doesn't
> >> directly affect the on-disk format (outside of much better allocation)
> >> so it isn't critical to get into the first round of changes. I believe
> >> Alex is working on a new set of patches right now.
>
> AM> Are you sure that these things will improve allocation much? Reservations
> AM> made a big improvement there.
>
> it depends on underlaying storage and workload. mballoc uses buddy
> internally. it's much simpler and cheaper to find free 2^N blocks
> compared to bitmap.

So mballoc's application is to save CPU cycles?

> this is especially important for arrays like
> DDN and raid5/6 because they require stripe-aligned/-sized requests
> for good throughput.

Does this not imply that there needs to be new linkage between the
filesystem and the lower layers? So that raid/etc can inform the
filesystem driver about its alignment and striping requirements?

> also, last mballoc takes logical block into
> account and can preallocate few chunks at different logical offsets
> for a file. imagine torrent downloading different pieces from few peers.

hm. You don't need anything as exotic as bittorrent to show up problems in
that area:

box:/usr/src/25> sudo bmap vmlinux | wc -l
1152



2006-10-06 16:11:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1



On Fri, 6 Oct 2006, Dave Kleikamp wrote:
> On Thu, 2006-10-05 at 16:25 -0700, Linus Torvalds wrote:
>
> We haven't been using git to manage ext4 so far, although in hindsight
> it probably would have made things easier. I'm assuming that you're
> just suggesting this for educational purposes and that git will handle
> the patches that Andrew picked up into -mm just fine.

Well, also that even if you're not using git, what any random person can
do is to actually just import the current state into git, and do the
simple "git diff -C" thing to get the nicer diff.

One of the advantages about git is that "intent" doesn't matter. Git only
tracks cold, hard data. So if you copied a file, git doesn't care one whit
whether you _tell_ it that you copied it or not: it will purely look at
the end result, and say "you copied it" if the file looks the same.

> I could re-generate the patches that do the copies from git, but I don't
> believe it will be that beneficial at this point.

The only advantage (but I'd argue that it's a real one, and very possibly
worth it) is that when this gets sent to me (or anybody else) by email, it
can be sent in a format that is actually readable, instead of sending it
as a huge patch that doesn't actually talk about what it does..

Linus

2006-10-06 21:10:23

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH] Get rid of extents mount option

On Fri, 2006-10-06 at 08:21 -0400, Theodore Tso wrote:
> On Thu, Oct 05, 2006 at 08:55:26PM -0700, Andrew Morton wrote:
> > So let me see if I have this right.
> >
> > You grab Alexandre's kit from http://www.bullopensource.org/ext4/20060926/
> > and a plain old `mke2fs -j' gives a filesystem which will mount as ext3 or
> > ext4.
> >
> > If you then mount this filesystem with `-t ext4dev -o extents', it becomes
> > incompatible with the ext3 driver. Yes?
>
> I agree that's the wrong behaviour, and I've always hated the idea of
> using using mount -o options to enable ext3/4 features. (When do it
> with EA's and acl's, sigh, and that's wrong too, but at least I was
> able to paper over that later by adding default mount option support
> into the superblock.)
>
> The right way to do this is to only enable a feature like extents
> after using tune2fs -O extents, or creating a filesystem with mke2fs
> -O extents.
>
> Can we change the patches to do this, please?

Something like this?

EXT4: Get rid of extents mount option

Enabling an ext4 file system to use extents should be done with
'tune2fs -O extents' or 'mke2fs -O extents', not with a mount option

Signed-off-by: Dave Kleikamp <[email protected]>

diff -Nurp linux-orig/fs/ext4/extents.c linux/fs/ext4/extents.c
--- linux-orig/fs/ext4/extents.c 2006-10-05 07:39:08.000000000 -0500
+++ linux/fs/ext4/extents.c 2006-10-06 15:45:59.000000000 -0500
@@ -1875,7 +1875,7 @@ void ext4_ext_init(struct super_block *s
* possible initialization would be here
*/

- if (test_opt(sb, EXTENTS)) {
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
printk("EXT4-fs: file extents enabled");
#ifdef AGRESSIVE_TEST
printk(", agressive tests");
@@ -1900,7 +1900,7 @@ void ext4_ext_init(struct super_block *s
*/
void ext4_ext_release(struct super_block *sb)
{
- if (!test_opt(sb, EXTENTS))
+ if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS))
return;

#ifdef EXTENTS_STATS
diff -Nurp linux-orig/fs/ext4/ialloc.c linux/fs/ext4/ialloc.c
--- linux-orig/fs/ext4/ialloc.c 2006-10-05 07:39:08.000000000 -0500
+++ linux/fs/ext4/ialloc.c 2006-10-06 15:37:36.000000000 -0500
@@ -618,16 +618,9 @@ got:
ext4_std_error(sb, err);
goto fail_free_drop;
}
- if (test_opt(sb, EXTENTS)) {
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
EXT4_I(inode)->i_flags |= EXT4_EXTENTS_FL;
ext4_ext_tree_init(handle, inode);
- if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
- err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh);
- if (err) goto fail;
- EXT4_SET_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS);
- BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "call ext4_journal_dirty_metadata");
- err = ext4_journal_dirty_metadata(handle, EXT4_SB(sb)->s_sbh);
- }
}

ext4_debug("allocating inode %lu\n", inode->i_ino);
diff -Nurp linux-orig/fs/ext4/super.c linux/fs/ext4/super.c
--- linux-orig/fs/ext4/super.c 2006-10-05 07:39:08.000000000 -0500
+++ linux/fs/ext4/super.c 2006-10-06 15:47:47.000000000 -0500
@@ -728,7 +728,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
- Opt_grpquota, Opt_extents,
+ Opt_grpquota,
};

static match_table_t tokens = {
@@ -778,7 +778,6 @@ static match_table_t tokens = {
{Opt_quota, "quota"},
{Opt_usrquota, "usrquota"},
{Opt_barrier, "barrier=%u"},
- {Opt_extents, "extents"},
{Opt_err, NULL},
{Opt_resize, "resize"},
};
@@ -1111,9 +1110,6 @@ clear_qf_name:
case Opt_bh:
clear_opt(sbi->s_mount_opt, NOBH);
break;
- case Opt_extents:
- set_opt (sbi->s_mount_opt, EXTENTS);
- break;
default:
printk (KERN_ERR
"EXT4-fs: Unrecognized mount option \"%s\" "

Shaggy
--
David Kleikamp
IBM Linux Technology Center


2006-10-06 21:21:45

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH] Get rid of extents mount option - try 2

I rushed that out too quick. This one cleans up the header files too.

EXT4: Get rid of extents mount option

Enabling an ext4 file system to use extents should be done with
'tune2fs -O extents' or 'mke2fs -O extents', not with a mount option

Signed-off-by: Dave Kleikamp <[email protected]>

diff -Nurp linux-orig/fs/ext4/extents.c linux/fs/ext4/extents.c
--- linux-orig/fs/ext4/extents.c 2006-10-05 07:39:08.000000000 -0500
+++ linux/fs/ext4/extents.c 2006-10-06 15:45:59.000000000 -0500
@@ -1875,7 +1875,7 @@ void ext4_ext_init(struct super_block *s
* possible initialization would be here
*/

- if (test_opt(sb, EXTENTS)) {
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
printk("EXT4-fs: file extents enabled");
#ifdef AGRESSIVE_TEST
printk(", agressive tests");
@@ -1900,7 +1900,7 @@ void ext4_ext_init(struct super_block *s
*/
void ext4_ext_release(struct super_block *sb)
{
- if (!test_opt(sb, EXTENTS))
+ if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS))
return;

#ifdef EXTENTS_STATS
diff -Nurp linux-orig/fs/ext4/ialloc.c linux/fs/ext4/ialloc.c
--- linux-orig/fs/ext4/ialloc.c 2006-10-05 07:39:08.000000000 -0500
+++ linux/fs/ext4/ialloc.c 2006-10-06 15:37:36.000000000 -0500
@@ -618,16 +618,9 @@ got:
ext4_std_error(sb, err);
goto fail_free_drop;
}
- if (test_opt(sb, EXTENTS)) {
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
EXT4_I(inode)->i_flags |= EXT4_EXTENTS_FL;
ext4_ext_tree_init(handle, inode);
- if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
- err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh);
- if (err) goto fail;
- EXT4_SET_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS);
- BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "call ext4_journal_dirty_metadata");
- err = ext4_journal_dirty_metadata(handle, EXT4_SB(sb)->s_sbh);
- }
}

ext4_debug("allocating inode %lu\n", inode->i_ino);
diff -Nurp linux-orig/fs/ext4/super.c linux/fs/ext4/super.c
--- linux-orig/fs/ext4/super.c 2006-10-05 07:39:08.000000000 -0500
+++ linux/fs/ext4/super.c 2006-10-06 15:47:47.000000000 -0500
@@ -728,7 +728,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
- Opt_grpquota, Opt_extents,
+ Opt_grpquota,
};

static match_table_t tokens = {
@@ -778,7 +778,6 @@ static match_table_t tokens = {
{Opt_quota, "quota"},
{Opt_usrquota, "usrquota"},
{Opt_barrier, "barrier=%u"},
- {Opt_extents, "extents"},
{Opt_err, NULL},
{Opt_resize, "resize"},
};
@@ -1111,9 +1110,6 @@ clear_qf_name:
case Opt_bh:
clear_opt(sbi->s_mount_opt, NOBH);
break;
- case Opt_extents:
- set_opt (sbi->s_mount_opt, EXTENTS);
- break;
default:
printk (KERN_ERR
"EXT4-fs: Unrecognized mount option \"%s\" "
diff -Nurp linux-orig/include/linux/ext4_fs.h linux/include/linux/ext4_fs.h
--- linux-orig/include/linux/ext4_fs.h 2006-10-05 07:39:08.000000000 -0500
+++ linux/include/linux/ext4_fs.h 2006-10-06 16:13:07.000000000 -0500
@@ -399,7 +399,6 @@ struct ext4_inode {
#define EXT4_MOUNT_QUOTA 0x80000 /* Some quota option set */
#define EXT4_MOUNT_USRQUOTA 0x100000 /* "old" user quota */
#define EXT4_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */
-#define EXT4_MOUNT_EXTENTS 0x400000 /* Extents support */

/* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
diff -Nurp linux-orig/include/linux/ext4_jbd2.h linux/include/linux/ext4_jbd2.h
--- linux-orig/include/linux/ext4_jbd2.h 2006-10-05 07:39:08.000000000 -0500
+++ linux/include/linux/ext4_jbd2.h 2006-10-06 16:17:20.000000000 -0500
@@ -33,7 +33,7 @@

#define EXT4_SINGLEDATA_TRANS_BLOCKS(sb) \
(EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS) \
- || test_opt(sb, EXTENTS) ? 27U : 8U)
+ ? 27U : 8U)

/* Extended attribute operations touch at most two data buffers,
* two bitmap buffers, and two group summaries, in addition to the inode

--
David Kleikamp
IBM Linux Technology Center


2006-10-06 22:32:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Get rid of extents mount option - try 2

On Fri, 06 Oct 2006 16:21:40 -0500
Dave Kleikamp <[email protected]> wrote:

> Enabling an ext4 file system to use extents should be done with
> 'tune2fs -O extents' or 'mke2fs -O extents', not with a mount option

But the mke2fs which I built by applying
http://www.bullopensource.org/ext4/20060926/ to e2fsprogs-1.39 doesn't
recognise `-O extents'. So the only way I can use extents is `mount -o
extents'.

What am I missing here?

2006-10-06 23:20:04

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH] Get rid of extents mount option - try 2

On Fri, 2006-10-06 at 15:32 -0700, Andrew Morton wrote:
> On Fri, 06 Oct 2006 16:21:40 -0500
> Dave Kleikamp <[email protected]> wrote:
>
> > Enabling an ext4 file system to use extents should be done with
> > 'tune2fs -O extents' or 'mke2fs -O extents', not with a mount option
>
> But the mke2fs which I built by applying
> http://www.bullopensource.org/ext4/20060926/ to e2fsprogs-1.39 doesn't
> recognise `-O extents'. So the only way I can use extents is `mount -o
> extents'.
>
> What am I missing here?

To be honest, I've been lazy and I haven't even tried to get the new
e2fsprogs. I just grabbed the latest from the mercurial repository,
http://e2fsprogs.sourceforge.net/e2fsprogs-hacking.html , and it doesn't
work for me either. Ted?

Hold off on the patch until we figure it out. :-)

Shaggy
--
David Kleikamp
IBM Linux Technology Center


2006-10-07 04:14:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] Get rid of extents mount option - try 2

On Fri, Oct 06, 2006 at 06:20:00PM -0500, Dave Kleikamp wrote:
> To be honest, I've been lazy and I haven't even tried to get the new
> e2fsprogs. I just grabbed the latest from the mercurial repository,
> http://e2fsprogs.sourceforge.net/e2fsprogs-hacking.html , and it doesn't
> work for me either. Ted?
>
> Hold off on the patch until we figure it out. :-)

I've been busy cleaning up the userspace extents patches before I'm
willing to accept them into the mainline e2fsprogs tree. So it's not
yet in Mercurial yet. It's coming soon; but in the meantime, my
interim patchset which I've been using to hack on the extents patches
plus signed-char-powerpc-dirhash problem can be found at:

ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim

Both a rolled-up patch file plus a broken-out tar.gz file are
available there. The current version on the above URL is
e2fsprogs-1.39-tyt1. Note that you will have to take the
f_extents/image.gz from the broken-out tar file and copy it into
tests/f_extents/image.gz or the f_extents regression test will fail.
In addition, the f_lotsbad test regression test is also known to fail
in 1.39-tyt1, and that regression test failure can be safely ignored
for now.

This should be good enough for the extents patches that Shaggy has
been queuing up.

- Ted

P.S. Before we add the extents patch, I just thought of one
additional change that might be good to add. Could we add an u32
field in the superblock which counts the number of files with extents,
and is automatically incremented and decremented as necessary by the
kernel, and which can be checked by e2fsck? It would be really useful
for making it easy for tune2fs to be able to tell if it can safely
remove the extents feature from the filesystem, or whether it should
refuse such a request.

2006-10-07 15:53:56

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH] Get rid of extents mount option - try 2

On Sat, 2006-10-07 at 00:14 -0400, Theodore Tso wrote:
> On Fri, Oct 06, 2006 at 06:20:00PM -0500, Dave Kleikamp wrote:
> > To be honest, I've been lazy and I haven't even tried to get the new
> > e2fsprogs. I just grabbed the latest from the mercurial repository,
> > http://e2fsprogs.sourceforge.net/e2fsprogs-hacking.html , and it doesn't
> > work for me either. Ted?
> >
> > Hold off on the patch until we figure it out. :-)
>
> I've been busy cleaning up the userspace extents patches before I'm
> willing to accept them into the mainline e2fsprogs tree. So it's not
> yet in Mercurial yet. It's coming soon; but in the meantime, my
> interim patchset which I've been using to hack on the extents patches
> plus signed-char-powerpc-dirhash problem can be found at:
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim
>
> Both a rolled-up patch file plus a broken-out tar.gz file are
> available there. The current version on the above URL is
> e2fsprogs-1.39-tyt1. Note that you will have to take the
> f_extents/image.gz from the broken-out tar file and copy it into
> tests/f_extents/image.gz or the f_extents regression test will fail.
> In addition, the f_lotsbad test regression test is also known to fail
> in 1.39-tyt1, and that regression test failure can be safely ignored
> for now.
>
> This should be good enough for the extents patches that Shaggy has
> been queuing up.

I noticed we are missing Documentation/filesystems/ext4.txt. Over the
weekend, I'll try to put something together with instructions on getting
the right version of e2fsprogs, etc.

> P.S. Before we add the extents patch, I just thought of one
> additional change that might be good to add. Could we add an u32
> field in the superblock which counts the number of files with extents,
> and is automatically incremented and decremented as necessary by the
> kernel, and which can be checked by e2fsck? It would be really useful
> for making it easy for tune2fs to be able to tell if it can safely
> remove the extents feature from the filesystem, or whether it should
> refuse such a request.

I guess this would be useful to turn the feature off immediately after
turning it on, but with the removal of the extents mount option, we no
longer have the ability to make old-style files once the feature is
turned on. So it's unlikely that you'd be able to turn the feature off
once a file system has been used.

Also, do we update the superblock in every transaction that creates or
deletes a file? Otherwise, how do we guarantee the count is accurate
after replaying the journal?

Shaggy
--
David Kleikamp
IBM Linux Technology Center


2006-10-07 17:20:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] Get rid of extents mount option - try 2

On Sat, Oct 07, 2006 at 10:53:47AM -0500, Dave Kleikamp wrote:
> I noticed we are missing Documentation/filesystems/ext4.txt. Over the
> weekend, I'll try to put something together with instructions on getting
> the right version of e2fsprogs, etc.

For now just say to grab the latest from:

ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim

We'll have to upgrade it once we have a released version of e2fsprogs
that supports extents (although by then we may need e2fsprogs-interim
for 48 or 64 bit extents, or whatever next new feature we're working
on :-).

> I guess this would be useful to turn the feature off immediately after
> turning it on, but with the removal of the extents mount option, we no
> longer have the ability to make old-style files once the feature is
> turned on. So it's unlikely that you'd be able to turn the feature off
> once a file system has been used.

Well, we could have tune2fs scan all inodes, and have it allocate
triple/double/indirect blocks to replace the extent tree, at some
point. The count would allow us to turn it off immediately after
turning it on without forcing the full scan of all inodes. Maybe it's
not worth the overhead though.

> Also, do we update the superblock in every transaction that creates or
> deletes a file? Otherwise, how do we guarantee the count is accurate
> after replaying the journal?

Yes, we do. The number of free inodes has to be kept up-to-date,
after all, so the superblock is marked dirty and as being part of the
transaction.

- Ted

2006-10-07 19:43:40

by Alex Tomas

[permalink] [raw]
Subject: Re: [PATCH] Get rid of extents mount option - try 2

>>>>> Theodore Tso (TT) writes:

>> Also, do we update the superblock in every transaction that creates or
>> deletes a file? Otherwise, how do we guarantee the count is accurate
>> after replaying the journal?

TT> Yes, we do. The number of free inodes has to be kept up-to-date,
TT> after all, so the superblock is marked dirty and as being part of the
TT> transaction.

actually, not any more. we use group descriptors to initialize
free inodes counter at mount time.

thanks, Alex

2006-10-07 19:58:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Get rid of extents mount option - try 2

On Sat, 7 Oct 2006 13:20:27 -0400
Theodore Tso <[email protected]> wrote:

> > Also, do we update the superblock in every transaction that creates or
> > deletes a file? Otherwise, how do we guarantee the count is accurate
> > after replaying the journal?
>
> Yes, we do. The number of free inodes has to be kept up-to-date,
> after all, so the superblock is marked dirty and as being part of the
> transaction.

Actually we cheat, and we don't keep the superblock free inodes counter up
to date in real time. Done for CPU consumptions reasons, but it was
perhaps a false optimisation, given that we still have a system-wide
inode_lock.

The free inode count is already triply redundant: inode table scan, inode
bitmap scan, ext4_group_desc.bg_free_inodes_count. Making it quadruply
redundant seemed a bit over the top.

At runtime the definitive free-inodes count is the sum of the
per-blockgroup free-inode counts. On clean shutdown we regenerate that and
write it into the superblock.

2006-10-07 20:08:13

by Alex Tomas

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1


Hi

>>>>> Alex Tomas (AT) writes:

>>> it depends on underlaying storage and workload. mballoc uses buddy
>>> internally. it's much simpler and cheaper to find free 2^N blocks
>>> compared to bitmap.

AM> So mballoc's application is to save CPU cycles?

AFAIU, we don't implement complex scanning for given size in balloc.c
because bitmap isn't very comfortable structure for this and that would
require many cycles. with mballoc it becomes possible. for example,
to find 1MB free chunk one has to choose group (mballoc tracks number
of free chunks in every buddy) and then scan just few bits). thus we
can produce better layout and improve performance.

>>> this is especially important for arrays like
>>> DDN and raid5/6 because they require stripe-aligned/-sized requests
>>> for good throughput.

AM> Does this not imply that there needs to be new linkage between the
AM> filesystem and the lower layers? So that raid/etc can inform the
AM> filesystem driver about its alignment and striping requirements?

currently, we pass preferred I/O size with mount option (stripe=N).
I'd like that sort of communication between block driver and fs.
something like f_bsize.

>>> also, last mballoc takes logical block into
>>> account and can preallocate few chunks at different logical offsets
>>> for a file. imagine torrent downloading different pieces from few peers.

AM> hm. You don't need anything as exotic as bittorrent to show up problems in
AM> that area:

AM> box:/usr/src/25> sudo bmap vmlinux | wc -l
AM> 1152

well, this can be (and will be, I very hope :) solved
by delayed allocation. I mentioned torrent because it's
often used to get really large files. so large that they
don't fit cache and delayed allocation won't help much.
preallocation can help, but then few preallocations per
file is required.

thanks, Alex

2006-10-10 06:29:31

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Thu, 5 Oct 2006 17:39:33 -0700
Andrew Morton <[email protected]> wrote:

> Could we please have a few nice words about ext4 for the record? Like,
> what its features are, how one creates an instance, where to get the
> correct userspace tools from, stability level, any known shortcomings,
> issues, etc?

So I guess I get to write this.

2006-10-10 07:51:49

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Mon, Oct 09, 2006 at 11:29:27PM -0700, Andrew Morton wrote:
> On Thu, 5 Oct 2006 17:39:33 -0700
> Andrew Morton <[email protected]> wrote:
>
> > Could we please have a few nice words about ext4 for the record? Like,
> > what its features are, how one creates an instance, where to get the
> > correct userspace tools from, stability level, any known shortcomings,
> > issues, etc?
>
> So I guess I get to write this.

Hopefully not :)
We should be able to put something together for a start. Where should this
reside ? Under Documentation/filesystems/ext4.txt ?

However, since this is still very clearly a "development" branch at the moment,
with lots of ongoing work both on the kernel and even more so on the
tools side, how much detail are you looking for ?

Regards
Suparna

> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India


2006-10-10 08:20:11

by Andrew Morton

[permalink] [raw]
Subject: Re: Updated ext4/jbd2 patches based on 2.6.19-rc1

On Tue, 10 Oct 2006 13:24:02 +0530
Suparna Bhattacharya <[email protected]> wrote:

> On Mon, Oct 09, 2006 at 11:29:27PM -0700, Andrew Morton wrote:
> > On Thu, 5 Oct 2006 17:39:33 -0700
> > Andrew Morton <[email protected]> wrote:
> >
> > > Could we please have a few nice words about ext4 for the record? Like,
> > > what its features are, how one creates an instance, where to get the
> > > correct userspace tools from, stability level, any known shortcomings,
> > > issues, etc?
> >
> > So I guess I get to write this.
>
> Hopefully not :)
> We should be able to put something together for a start. Where should this
> reside ? Under Documentation/filesystems/ext4.txt ?

That sounds appropriate. And in the patch changelog.

> However, since this is still very clearly a "development" branch at the moment,
> with lots of ongoing work both on the kernel and even more so on the
> tools side, how much detail are you looking for ?

We should make it as easy as possible for our testers to get up and running
and using all available features. They're skilled, so a simple user guide
which tells them what the features are and how to access them should
suffice, thanks.


2006-10-10 18:48:25

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH] Get rid of extents mount option - try 2

On Sat, 2006-10-07 at 13:20 -0400, Theodore Tso wrote:
> On Sat, Oct 07, 2006 at 10:53:47AM -0500, Dave Kleikamp wrote:
> > I noticed we are missing Documentation/filesystems/ext4.txt. Over the
> > weekend, I'll try to put something together with instructions on getting
> > the right version of e2fsprogs, etc.
>
> For now just say to grab the latest from:
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim

Ted,
Do you think it's possible to put a source tarball out there with the
patches applied? It's confusing to untar the tarball only to find the
patchset. Otherwise, we'll have to beef up the instructions a little
bit.

Thanks,
Shaggy

--
David Kleikamp
IBM Linux Technology Center


2006-10-10 20:02:45

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC] [PATCH] Documentation/filesystems/ext4.txt

On Tue, 2006-10-10 at 01:14 -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 13:24:02 +0530
> Suparna Bhattacharya <[email protected]> wrote:
>
> > On Mon, Oct 09, 2006 at 11:29:27PM -0700, Andrew Morton wrote:
> > > On Thu, 5 Oct 2006 17:39:33 -0700
> > > Andrew Morton <[email protected]> wrote:
> > >
> > > > Could we please have a few nice words about ext4 for the record? Like,
> > > > what its features are, how one creates an instance, where to get the
> > > > correct userspace tools from, stability level, any known shortcomings,
> > > > issues, etc?
> > >
> > > So I guess I get to write this.

Sorry I didn't get something written sooner. This is based off of what
you put in the -mm1 announcement.

> > Hopefully not :)
> > We should be able to put something together for a start. Where should this
> > reside ? Under Documentation/filesystems/ext4.txt ?

Suparna put this together and I updated it a bit.

> That sounds appropriate. And in the patch changelog.

How do you want to handle the patch set? I could resend it with more
comments, put it into git, or do you just want to plug the comments into
the patches you are carrying? I can do whatever works best for you.

> > However, since this is still very clearly a "development" branch at the moment,
> > with lots of ongoing work both on the kernel and even more so on the
> > tools side, how much detail are you looking for ?
>
> We should make it as easy as possible for our testers to get up and running
> and using all available features. They're skilled, so a simple user guide
> which tells them what the features are and how to access them should
> suffice, thanks.

This file, ext4.txt, was put together with information from Andrew Morton,
Andreas Dilger, Suparna Bhattacharya, and Ted Ts'o.

I copied the mount options, with the exception of "extents", from ext3.txt,
so if anyone is aware of anything out-of-date, please let me know.

Signed-off-by: Dave Kleikamp <[email protected]>

diff -Nurp linux-orig/Documentation/filesystems/00-INDEX linux/Documentation/filesystems/00-INDEX
--- linux-orig/Documentation/filesystems/00-INDEX 2006-10-05 07:22:05.000000000 -0500
+++ linux/Documentation/filesystems/00-INDEX 2006-10-06 17:30:59.000000000 -0500
@@ -34,6 +34,8 @@ ext2.txt
- info, mount options and specifications for the Ext2 filesystem.
ext3.txt
- info, mount options and specifications for the Ext3 filesystem.
+ext4.txt
+ - info, mount options and specifications for the Ext4 filesystem.
files.txt
- info on file management in the Linux kernel.
fuse.txt
diff -Nurp linux-orig/Documentation/filesystems/ext4.txt linux/Documentation/filesystems/ext4.txt
--- linux-orig/Documentation/filesystems/ext4.txt 1969-12-31 18:00:00.000000000 -0600
+++ linux/Documentation/filesystems/ext4.txt 2006-10-10 14:25:38.000000000 -0500
@@ -0,0 +1,236 @@
+
+Ext4 Filesystem
+===============
+
+This is a development version of the ext4 filesystem, an advanced level
+of the ext3 filesystem which incorporates scalability and reliability
+enhancements for supporting large filesystems (64 bit) in keeping with
+increasing disk capacities and state-of-the-art feature requirements.
+
+Mailing list: [email protected]
+
+
+1. Quick usage instructions:
+===========================
+
+ - Grab updated e2fsprogs from
+ ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/
+ This is a patchset on top of e2fsprogs-1.39, which can be found at
+ ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
+
+ - It's still mke2fs -j /dev/hda1
+
+ - mount /dev/hda1 /wherever -t ext4dev
+
+ - To enable extents,
+
+ mount /dev/hda1 /wherever -t ext4dev -o extents
+
+ - The filesystem is compatible with the ext3 driver until you add a file
+ which has extents (ie: `mount -o extents', then create a file).
+
+ NOTE: The "extents" mount flag is temporary. It will soon go away and
+ extents will be enabled by the "-o extents" flag to mke2fs or tune2fs
+
+ - When comparing performance with other filesystems, remember that
+ ext3/4 by default offers higher data integrity guarantees than most. So
+ when comparing with a metadata-only journalling filesystem, use `mount -o
+ data=writeback'. And you might as well use `mount -o nobh' too along
+ with it. Making the journal larger than the mke2fs default often helps
+ performance with metadata-intensive workloads.
+
+2. Features
+===========
+
+2.1 Currently available
+
+* ability to use filesystems > 16TB
+* extent format reduces metadata overhead (RAM, IO for access, transactions)
+* extent format more robust in face of on-disk corruption due to magics,
+* internal redunancy in tree
+
+2.1 Previously available, soon to be enabled by default by "mkefs.ext4":
+
+* dir_index and resize inode will be on by default
+* large inodes will be used by default for fast EAs, nsec timestamps, etc
+
+2.2 Candidate features for future inclusion
+
+There are several under discussion, whether they all make it in is
+partly a function of how much time everyone has to work on them:
+
+* improved file allocation (multi-block alloc, delayed alloc; basically done)
+* fix 32000 subdirectory limit (patch exists, needs some e2fsck work)
+* nsec timestamps for mtime, atime, ctime, create time (patch exists,
+ needs some e2fsck work)
+* inode version field on disk (NFSv4, Lustre; prototype exists)
+* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
+* journal checksumming for robustness, performance (prototype exists)
+* persistent file preallocation (e.g for streaming media, databases)
+
+Features like metadata checksumming have been discussed and planned for
+a bit but no patches exist yet so I'm not sure they're in the near-term
+roadmap.
+
+The big performance win will come with mballoc and delalloc. CFS has
+been using mballoc for a few years already with Lustre, and IBM + Bull
+did a lot of benchmarking on it. The reason it isn't in the first set of
+patches is partly a manageability issue, and partly because it doesn't
+directly affect the on-disk format (outside of much better allocation)
+so it isn't critical to get into the first round of changes. I believe
+Alex is working on a new set of patches right now.
+
+3. Options
+==========
+
+When mounting an ext4 filesystem, the following option are accepted:
+(*) == default
+
+extents ext4 will use extents to address file data. The
+ file system will no longer be mountable by ext3.
+
+journal=update Update the ext4 file system's journal to the current
+ format.
+
+journal=inum When a journal already exists, this option is ignored.
+ Otherwise, it specifies the number of the inode which
+ will represent the ext4 file system's journal file.
+
+journal_dev=devnum When the external journal device's major/minor numbers
+ have changed, this option allows the user to specify
+ the new journal location. The journal device is
+ identified through its new major/minor numbers encoded
+ in devnum.
+
+noload Don't load the journal on mounting.
+
+data=journal All data are committed into the journal prior to being
+ written into the main file system.
+
+data=ordered (*) All data are forced directly out to the main file
+ system prior to its metadata being committed to the
+ journal.
+
+data=writeback Data ordering is not preserved, data may be written
+ into the main file system after its metadata has been
+ committed to the journal.
+
+commit=nrsec (*) Ext4 can be told to sync all its data and metadata
+ every 'nrsec' seconds. The default value is 5 seconds.
+ This means that if you lose your power, you will lose
+ as much as the latest 5 seconds of work (your
+ filesystem will not be damaged though, thanks to the
+ journaling). This default value (or any low value)
+ will hurt performance, but it's good for data-safety.
+ Setting it to 0 will have the same effect as leaving
+ it at the default (5 seconds).
+ Setting it to very large values will improve
+ performance.
+
+barrier=1 This enables/disables barriers. barrier=0 disables
+ it, barrier=1 enables it.
+
+orlov (*) This enables the new Orlov block allocator. It is
+ enabled by default.
+
+oldalloc This disables the Orlov block allocator and enables
+ the old block allocator. Orlov should have better
+ performance - we'd like to get some feedback if it's
+ the contrary for you.
+
+user_xattr Enables Extended User Attributes. Additionally, you
+ need to have extended attribute support enabled in the
+ kernel configuration (CONFIG_EXT4_FS_XATTR). See the
+ attr(5) manual page and http://acl.bestbits.at/ to
+ learn more about extended attributes.
+
+nouser_xattr Disables Extended User Attributes.
+
+acl Enables POSIX Access Control Lists support.
+ Additionally, you need to have ACL support enabled in
+ the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL).
+ See the acl(5) manual page and http://acl.bestbits.at/
+ for more information.
+
+noacl This option disables POSIX Access Control List
+ support.
+
+reservation
+
+noreservation
+
+bsddf (*) Make 'df' act like BSD.
+minixdf Make 'df' act like Minix.
+
+check=none Don't do extra checking of bitmaps on mount.
+nocheck
+
+debug Extra debugging information is sent to syslog.
+
+errors=remount-ro(*) Remount the filesystem read-only on an error.
+errors=continue Keep going on a filesystem error.
+errors=panic Panic and halt the machine if an error occurs.
+
+grpid Give objects the same group ID as their creator.
+bsdgroups
+
+nogrpid (*) New objects have the group ID of their creator.
+sysvgroups
+
+resgid=n The group ID which may use the reserved blocks.
+
+resuid=n The user ID which may use the reserved blocks.
+
+sb=n Use alternate superblock at this location.
+
+quota
+noquota
+grpquota
+usrquota
+
+bh (*) ext4 associates buffer heads to data pages to
+nobh (a) cache disk block mapping information
+ (b) link pages into transaction to provide
+ ordering guarantees.
+ "bh" option forces use of buffer heads.
+ "nobh" option tries to avoid associating buffer
+ heads (supported only for "writeback" mode).
+
+
+Data Mode
+---------
+There are 3 different data modes:
+
+* writeback mode
+In data=writeback mode, ext4 does not journal data at all. This mode provides
+a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
+mode - metadata journaling. A crash+recovery can cause incorrect data to
+appear in files which were written shortly before the crash. This mode will
+typically provide the best ext4 performance.
+
+* ordered mode
+In data=ordered mode, ext4 only officially journals metadata, but it logically
+groups metadata and data blocks into a single unit called a transaction. When
+it's time to write the new metadata out to disk, the associated data blocks
+are written first. In general, this mode performs slightly slower than
+writeback but significantly faster than journal mode.
+
+* journal mode
+data=journal mode provides full data and metadata journaling. All new data is
+written to the journal first, and then to its final location.
+In the event of a crash, the journal can be replayed, bringing both data and
+metadata into a consistent state. This mode is the slowest except when data
+needs to be read from and written to disk at the same time where it
+outperforms all others modes.
+
+References
+==========
+
+kernel source: <file:fs/ext4/>
+ <file:fs/jbd2/>
+
+programs: http://e2fsprogs.sourceforge.net/
+ http://ext2resize.sourceforge.net
+
+useful links: http://fedoraproject.org/wiki/ext3-devel
+ http://www.bullopensource.org/ext4/

--
David Kleikamp
IBM Linux Technology Center


2006-10-10 20:56:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Documentation/filesystems/ext4.txt

On Tue, 10 Oct 2006 15:02:35 -0500
Dave Kleikamp <[email protected]> wrote:

> > > Hopefully not :)
> > > We should be able to put something together for a start. Where should this
> > > reside ? Under Documentation/filesystems/ext4.txt ?
>
> Suparna put this together and I updated it a bit.

Great, thanks.

> > That sounds appropriate. And in the patch changelog.
>
> How do you want to handle the patch set? I could resend it with more
> comments, put it into git, or do you just want to plug the comments into
> the patches you are carrying? I can do whatever works best for you.

I'll just send what I have now in the next batch, probably this evening.

> + - Grab updated e2fsprogs from
> + ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/
> + This is a patchset on top of e2fsprogs-1.39, which can be found at
> + ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/

Could we please get a patched tarball up there? The easier we make it for
our testers, the more we get.


2006-10-10 21:07:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] Get rid of extents mount option - try 2

On Tue, Oct 10, 2006 at 01:48:18PM -0500, Dave Kleikamp wrote:
> On Sat, 2006-10-07 at 13:20 -0400, Theodore Tso wrote:
> > On Sat, Oct 07, 2006 at 10:53:47AM -0500, Dave Kleikamp wrote:
> > > I noticed we are missing Documentation/filesystems/ext4.txt. Over the
> > > weekend, I'll try to put something together with instructions on getting
> > > the right version of e2fsprogs, etc.
> >
> > For now just say to grab the latest from:
> >
> > ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim
>
> Ted,
> Do you think it's possible to put a source tarball out there with the
> patches applied? It's confusing to untar the tarball only to find the
> patchset. Otherwise, we'll have to beef up the instructions a little
> bit.

I was assuming that someone who knew out how to deal with -mm patchset
would know how to deal with with a patchset. I agree it should be
changed to e2fsprogs-1.39-tyt1-broken-out.tar.gz, though.

We can put a source tarball up, but sometimes the e2fsprogs-interim
patches will be, well, about as stable as the -mm tree. So people who
assume a completely stable release may end up being a little
disappointed. Hopefully by the time the ext4 stuff gets merged into
the mainline kernel we'll have a e2fsprogs-WIP release which will
support extents, and then we can tell people to use that....

- - Ted


2006-10-10 21:18:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Get rid of extents mount option - try 2

On Tue, 10 Oct 2006 17:07:18 -0400
Theodore Tso <[email protected]> wrote:

> On Tue, Oct 10, 2006 at 01:48:18PM -0500, Dave Kleikamp wrote:
> > On Sat, 2006-10-07 at 13:20 -0400, Theodore Tso wrote:
> > > On Sat, Oct 07, 2006 at 10:53:47AM -0500, Dave Kleikamp wrote:
> > > > I noticed we are missing Documentation/filesystems/ext4.txt. Over the
> > > > weekend, I'll try to put something together with instructions on getting
> > > > the right version of e2fsprogs, etc.
> > >
> > > For now just say to grab the latest from:
> > >
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim
> >
> > Ted,
> > Do you think it's possible to put a source tarball out there with the
> > patches applied? It's confusing to untar the tarball only to find the
> > patchset. Otherwise, we'll have to beef up the instructions a little
> > bit.
>
> I was assuming that someone who knew out how to deal with -mm patchset
> would know how to deal with with a patchset. I agree it should be
> changed to e2fsprogs-1.39-tyt1-broken-out.tar.gz, though.
>
> We can put a source tarball up, but sometimes the e2fsprogs-interim
> patches will be, well, about as stable as the -mm tree.

So put a stable one up ;)

Are people likely to care about e2fsprogs much? They'll just want to do
mkfs, maybe the occasional fsck. At this stage it's the kernel code we
want people to beat on.

> So people who
> assume a completely stable release may end up being a little
> disappointed. Hopefully by the time the ext4 stuff gets merged into
> the mainline kernel we'll have a e2fsprogs-WIP release which will
> support extents, and then we can tell people to use that....

I hope you didn't have anything else planned for this evening ;)

2006-10-11 17:03:38

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Documentation/filesystems/ext4.txt

On Oct 10, 2006 15:02 -0500, Dave Kleikamp wrote:
> + - It's still mke2fs -j /dev/hda1

I would suggest "mke2fs -j -O dir_index -I 256 /dev/XXX" to be more
representative of what will be used in the future.

> +programs: http://e2fsprogs.sourceforge.net/
> + http://ext2resize.sourceforge.net

You should likely remove ext2resize from this list, it hasn't got any
support for extent-mapped files.


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2006-10-11 17:16:40

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] Get rid of extents mount option - try 2

On Oct 06, 2006 16:21 -0500, Dave Kleikamp wrote:
> EXT4: Get rid of extents mount option
>
> Enabling an ext4 file system to use extents should be done with
> 'tune2fs -O extents' or 'mke2fs -O extents', not with a mount option

I would agree that the presence of INCOMPAT_EXTENTS should imply the
EXTENTS mount option, but it is also desirable to be able to turn this
off for testing. In our internal patches we also have a "noextents"
mount option to disable extents at runtime even if "extents" was given
as a default mount option.

So, I would leave most of the code as-is (with "test_opt(sb, EXTENTS)"),
and just have ext3_fill_super() enable EXT4_MOUNT_EXTENTS if
INCOMPAT_EXTENTS is set. This is a tiny bit tricky since parse_options()
is called before the superblock is read, so I suspect we'll need a
separate EXT4_MOUNT_NOEXTENTS to distinguish between no mount "extents"
option given and "noextents" disabling this.

The Opt_noextents handling would clear EXT4_MOUNT_EXTENTS, and vice versa.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2006-10-12 14:20:09

by Valerie Clement

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Documentation/filesystems/ext4.txt

Dave Kleikamp wrote:

> +2.2 Candidate features for future inclusion
> +
> +There are several under discussion, whether they all make it in is
> +partly a function of how much time everyone has to work on them:
> +
> +* improved file allocation (multi-block alloc, delayed alloc; basically done)
> +* fix 32000 subdirectory limit (patch exists, needs some e2fsck work)
> +* nsec timestamps for mtime, atime, ctime, create time (patch exists,
> + needs some e2fsck work)
> +* inode version field on disk (NFSv4, Lustre; prototype exists)
> +* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
> +* journal checksumming for robustness, performance (prototype exists)
> +* persistent file preallocation (e.g for streaming media, databases)

Could you add "support of larger block group size" ?
Currently, a prototype exists, but we still have tests to do.

Thanks,
Val?rie