LinuxLists.cc - [RFC] Add new extent structure in ext4

2012-01-23 12:51:54

Subject: [RFC] Add new extent structure in ext4

Hi Ted, Andreas and the list,

After the bigalloc-feature is completed in ext4, we could have much more
big size of block-group (also bigger continuous space), but the extent
structure of files now limit the extent size below 128MB, which is not
optimal.

We could solve the problem by creating a new extent format to support
larger extent size, which looks like this:

struct ext4_extent2 {
__le64 ee_block; /* first logical block extent covers */
__le64 ee_start; /* starting physical block */
__le32 ee_len; /* number of blocks covered by extent */
__le32 ee_flags; /* flags and future extension */
};

struct ext4_extent2_idx {
__le64 ei_block; /* index covers logical blocks from 'block' */
__le64 ei_leaf; /* pointer to the physical block of the next level */
__le32 ei_flags; /* flags and future extension */
__le32 ei_unused; /* padding */
};

I think we could keep the structure of ext4_extent_header and add new
imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2.

The new extent format could support 16TB continuous space and larger volumes.

What's your opinion?

--
--
Best Regard
Robin Dong

2012-01-23 18:59:36

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote:
>
> We could solve the problem by creating a new extent format to support
> larger extent size, which looks like this:
>
> struct ext4_extent2 {
> __le64 ee_block; /* first logical block extent covers */
> __le64 ee_start; /* starting physical block */
> __le32 ee_len; /* number of blocks covered by extent */
> __le32 ee_flags; /* flags and future extension */
> };
>
> I think we could keep the structure of ext4_extent_header and add new
> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2.

The really unfortunate thing about using a 24 byte on-disk extent
structure is that you can only fit 2 extents in the inode before
needing to spill out to an external header.

So being able to support multiple exent formats in the inode (by using
a different eh_magic number) would probably be a good thing. In fact,
it might be useful to also have a version which looks like this:

struct ext4_extent_packed {
__le32 ee_start_lo;
__le16 ee_start_hi;
__le16 ee_len;
};

i.e., something which only takes 8 bytes, but which is only used for
non-sparse files in the inode structure, so that you can fit 6 extents
in the inode.

The hard part will be cleaning up and refactoring the extent code to
support multiple on-disk extent formats. (That's going to be very
messy, though! So if we're going to go through all of that work, it
would benice if it had advantages not for huge file systems, but also
for desktop workloads.) Once this investment gets done, supporting a
third extent format should be relatively straight forward.

This would also allow us to make the new extent format be an RO_COMPAT
feature, so that an existing ext4 file system could be converted to
take advantage of the new extent encodings without needing to do a
backup / reformat / restore pass.

- Ted

2012-01-23 23:17:44

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

On 2012-01-23, at 11:59 AM, Ted Ts'o wrote:
> On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote:
>>
>> We could solve the problem by creating a new extent format to support
>> larger extent size, which looks like this:
>>
>> struct ext4_extent2 {
>> __le64 ee_block; /* first logical block extent covers */
>> __le64 ee_start; /* starting physical block */
>> __le32 ee_len; /* number of blocks covered by extent */
>> __le32 ee_flags; /* flags and future extension */
>> };
>>
>> I think we could keep the structure of ext4_extent_header and add new
>> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2.
>
> The really unfortunate thing about using a 24 byte on-disk extent
> structure is that you can only fit 2 extents in the inode before
> needing to spill out to an external header.
>
> So being able to support multiple exent formats in the inode (by using
> a different eh_magic number) would probably be a good thing. In fact,
> it might be useful to also have a version which looks like this:
>
> struct ext4_extent_packed {
> __le32 ee_start_lo;
> __le16 ee_start_hi;
> __le16 ee_len;
> };
>
> i.e., something which only takes 8 bytes, but which is only used for
> non-sparse files in the inode structure, so that you can fit 6 extents
> in the inode.

How does the code determine in advance whether a file is going to be
sparse or not? Does this mean that the extents would have to be changed
as soon as a hole is added to a file? That probably isn't bad if this
format is only used inside the inode, but would be very complex if it is
used for an indirect block.

Actually, my thought has been that it would be useful to have a new
"extent" format for block-mapped files that have fragmented on-disk
layout, like directories.

> The hard part will be cleaning up and refactoring the extent code to
> support multiple on-disk extent formats. (That's going to be very
> messy, though! So if we're going to go through all of that work, it
> would benice if it had advantages not for huge file systems, but also
> for desktop workloads.) Once this investment gets done, supporting a
> third extent format should be relatively straight forward.

... and fourth...

> This would also allow us to make the new extent format be an RO_COMPAT
> feature, so that an existing ext4 file system could be converted to
> take advantage of the new extent encodings without needing to do a
> backup / reformat / restore pass.

How could a new extent format be RO_COMPAT? Old kernels couldn't possibly
be able to read files with the new extent format. I guess you are
thinking that they are RO_COMPAT in the sense of "they don't crash old
kernels, but new files cannot be read"?

Cheers, Andreas

2012-01-24 13:34:39

by Jan Kara

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

Hello,

On Mon 23-01-12 20:51:53, Robin Dong wrote:
> After the bigalloc-feature is completed in ext4, we could have much more
> big size of block-group (also bigger continuous space), but the extent
> structure of files now limit the extent size below 128MB, which is not
> optimal.
It is not optimal but does it really make difference? I.e. what
improvement do you expect from enlarging extents from 128MB to say 4GB (or
do you expect to be consistently able to allocate continguous chunks larger
than 4GB?)? All you save is a single read of an indirect block... Is that
really worth the complications with another extent format? But maybe I miss
some benefit.

Honza

> We could solve the problem by creating a new extent format to support
> larger extent size, which looks like this:
>
> struct ext4_extent2 {
> __le64 ee_block; /* first logical block extent covers */
> __le64 ee_start; /* starting physical block */
> __le32 ee_len; /* number of blocks covered by extent */
> __le32 ee_flags; /* flags and future extension */
> };
>
> struct ext4_extent2_idx {
> __le64 ei_block; /* index covers logical blocks from 'block' */
> __le64 ei_leaf; /* pointer to the physical block of the next level */
> __le32 ei_flags; /* flags and future extension */
> __le32 ei_unused; /* padding */
> };
>
> I think we could keep the structure of ext4_extent_header and add new
> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2.
>
> The new extent format could support 16TB continuous space and larger volumes.
>
> What's your opinion?
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-01-24 17:31:48

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

On 2012-01-24, at 6:34, Jan Kara <[email protected]> wrote:
> On Mon 23-01-12 20:51:53, Robin Dong wrote:
>> After the bigalloc-feature is completed in ext4, we could have much more
>> big size of block-group (also bigger continuous space), but the extent
>> structure of files now limit the extent size below 128MB, which is not
>> optimal.
>
> It is not optimal but does it really make difference? I.e. what
> improvement do you expect from enlarging extents from 128MB to say 4GB (or
> do you expect to be consistently able to allocate continguous chunks larger
> than 4GB?)? All you save is a single read of an indirect block... Is that
> really worth the complications with another extent format? But maybe I miss
> some benefit.

What I'm (somewhat) interested in is increasing the maximum file size. IMHO, I think it would be better to do this with a larger block size (similar to bigalloc, but actually handling large blocks as a side benefit) since this will reduce the allocation overhead as well.

Even if the blocksize is only 64kB, that would allow files up to 256TB, and filesystems up to 2^64 bytes without the complexity of changing the extent format (which Ted looked at once and thought was difficult). Since Robin and Ted already did most of that work for bigalloc, I think the remaining effort would be manageable, especially if mmap is disabled on such a filesystem.

Increasing the maximum extent size may have some small benefit, but I don't think it would be noticeable, and would rarely be used due to fragmentation and such. A single index block with 128MB extents can already address over 16GB, and with large blocks this increases with the square of the blocksize (larger extents * more extents per index block).

Cheers, Andreas

>> We could solve the problem by creating a new extent format to support
>> larger extent size, which looks like this:
>>
>> struct ext4_extent2 {
>> __le64 ee_block; /* first logical block extent covers */
>> __le64 ee_start; /* starting physical block */
>> __le32 ee_len; /* number of blocks covered by extent */
>> __le32 ee_flags; /* flags and future extension */
>> };
>>
>> struct ext4_extent2_idx {
>> __le64 ei_block; /* index covers logical blocks from 'block' */
>> __le64 ei_leaf; /* pointer to the physical block of the next level */
>> __le32 ei_flags; /* flags and future extension */
>> __le32 ei_unused; /* padding */
>> };
>>
>> I think we could keep the structure of ext4_extent_header and add new
>> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2.
>>
>> The new extent format could support 16TB continuous space and larger volumes.
>>
>> What's your opinion?
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2012-01-25 22:48:50

by Dave Chinner

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote:
> Hi Ted, Andreas and the list,
>
> After the bigalloc-feature is completed in ext4, we could have much more
> big size of block-group (also bigger continuous space), but the extent
> structure of files now limit the extent size below 128MB, which is not
> optimal.
>
> We could solve the problem by creating a new extent format to support
> larger extent size, which looks like this:
>
> struct ext4_extent2 {
> __le64 ee_block; /* first logical block extent covers */
> __le64 ee_start; /* starting physical block */
> __le32 ee_len; /* number of blocks covered by extent */
> __le32 ee_flags; /* flags and future extension */
> };
>
> struct ext4_extent2_idx {
> __le64 ei_block; /* index covers logical blocks from 'block' */
> __le64 ei_leaf; /* pointer to the physical block of the next level */
> __le32 ei_flags; /* flags and future extension */
> __le32 ei_unused; /* padding */
> };
>
> I think we could keep the structure of ext4_extent_header and add new
> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2.
>
> The new extent format could support 16TB continuous space and larger volumes.
>
> What's your opinion?

Just use XFS.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-01-25 23:03:11

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

On 2012-01-25, at 3:48 PM, Dave Chinner wrote:
> On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote:
>> Hi Ted, Andreas and the list,
>>
>> After the bigalloc-feature is completed in ext4, we could have much more
>> big size of block-group (also bigger continuous space), but the extent
>> structure of files now limit the extent size below 128MB, which is not
>> optimal.
>>
>> We could solve the problem by creating a new extent format to support
>> larger extent size, which looks like this:
>>
>> struct ext4_extent2 {
>> __le64 ee_block; /* first logical block extent covers */
>> __le64 ee_start; /* starting physical block */
>> __le32 ee_len; /* number of blocks covered by extent */
>> __le32 ee_flags; /* flags and future extension */
>> };
>>
>> struct ext4_extent2_idx {
>> __le64 ei_block; /* index covers logical blocks from 'block' */
>> __le64 ei_leaf; /* pointer to the physical block of the next level */
>> __le32 ei_flags; /* flags and future extension */
>> __le32 ei_unused; /* padding */
>> };
>>
>> I think we could keep the structure of ext4_extent_header and add new
>> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2.
>>
>> The new extent format could support 16TB continuous space and larger volumes.
>>
>> What's your opinion?
>
> Just use XFS.

Thanks for your troll.

If you have something actually useful to contribute, please feel free to post.
Otherwise, this is a list for ext4 development.

I don't encourage XFS users to switch to ext4 (or ZFS, for that matter, since
ZFS can do a lot of things that just aren't possible for XFS, and is now
available for Linux) on your mailing lists, and I'd appreciate the same
courtesy here...

Cheers, Andreas

2012-01-27 00:19:08

by Dave Chinner

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

On Wed, Jan 25, 2012 at 04:03:09PM -0700, Andreas Dilger wrote:
> On 2012-01-25, at 3:48 PM, Dave Chinner wrote:
> > On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote:
> >> Hi Ted, Andreas and the list,
> >>
> >> After the bigalloc-feature is completed in ext4, we could have much more
> >> big size of block-group (also bigger continuous space), but the extent
> >> structure of files now limit the extent size below 128MB, which is not
> >> optimal.
> >>
> >> We could solve the problem by creating a new extent format to support
> >> larger extent size, which looks like this:
> >>
> >> struct ext4_extent2 {
> >> __le64 ee_block; /* first logical block extent covers */
> >> __le64 ee_start; /* starting physical block */
> >> __le32 ee_len; /* number of blocks covered by extent */
> >> __le32 ee_flags; /* flags and future extension */
> >> };
> >>
> >> struct ext4_extent2_idx {
> >> __le64 ei_block; /* index covers logical blocks from 'block' */
> >> __le64 ei_leaf; /* pointer to the physical block of the next level */
> >> __le32 ei_flags; /* flags and future extension */
> >> __le32 ei_unused; /* padding */
> >> };
> >>
> >> I think we could keep the structure of ext4_extent_header and add new
> >> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2.
> >>
> >> The new extent format could support 16TB continuous space and larger volumes.
> >>
> >> What's your opinion?
> >
> > Just use XFS.
>
> Thanks for your troll.
>
> If you have something actually useful to contribute, please feel free to post.
> Otherwise, this is a list for ext4 development.

You can chose to see my comment as a troll, but it has a serious
message. If that is your use case is for large multi-TB files, then
why wouldn't you just use a filesystem that was designed for files
that large from the ground up rather than try to extend a filesystem
that is already struggling with file sizes that it already supports?
Not to mention that very few people even need this functionality,
and those that do right now are using XFS.

Indeed, on current measures, a 15.95TB file on ext4 takes 330s to
allocate on my test rig, while XFS will do it under *35
milliseconds*. What's the point of increasing the maximum file size
when it when it takes so long to allocate or free the space? If you
can't make the allocation and freeing scale first to the existing
file size limits, there's little point in introducing support for
larger files.

And as an ext4 user, all I want is from ext4 to be stable like ext3
is stable, not have it continually destabilised by the addition of
incompatible feature after incompatible feature. Indeed, I can't
use ext4 in the places I'm using ext3 right now because ext4 is not
very resilient in the face of 20 system crashes a day. I generally
find that ext4 filesystems are irretrievable corrupted within a
week. In comparison, I have ext3 filesystems have lasted more than
3 years under such workloads without any corruptions occurring.

So the long form of my 3-word comment is effectively: "If you need
multi-TB files, then use the filesystem most appropriate for that
workload instead of trying to make ext4 more complex and unstable
than it already is".

> I don't encourage XFS users to switch to ext4 (or ZFS, for that matter, since
> ZFS can do a lot of things that just aren't possible for XFS, and is now
> available for Linux) on your mailing lists, and I'd appreciate the same
> courtesy here...

Sorry, I didn't realise that I'm not aren't allowed to tell ext4
people to use the filesystem most appropriate to their requirements.
Extending ext4 is not the right solution to every problem.

I say stuff like this w.r.t. "don't use XFS for that" or "XFS will
never support that" all the time on the XFS lists and IRC channels,
and nobody thinks that it is out of place. If you want to pop up and
say that "you should use ext4 for that" on the XFS lists then you
are welcome to do so. Such comments generally results in an
informative technical discussion of the pros and cons of why
something is or is not suited to the given requirement without
anyone being called a troll.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-01-27 14:27:30

by Tao Ma

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

Hi Dave,
On 01/27/2012 08:19 AM, Dave Chinner wrote:
> On Wed, Jan 25, 2012 at 04:03:09PM -0700, Andreas Dilger wrote:
>> On 2012-01-25, at 3:48 PM, Dave Chinner wrote:
>>> On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote:
>>>> Hi Ted, Andreas and the list,
>>>>
>>>> After the bigalloc-feature is completed in ext4, we could have much more
>>>> big size of block-group (also bigger continuous space), but the extent
>>>> structure of files now limit the extent size below 128MB, which is not
>>>> optimal.
>>>>
>>>> We could solve the problem by creating a new extent format to support
>>>> larger extent size, which looks like this:
>>>>
>>>> struct ext4_extent2 {
>>>> __le64 ee_block; /* first logical block extent covers */
>>>> __le64 ee_start; /* starting physical block */
>>>> __le32 ee_len; /* number of blocks covered by extent */
>>>> __le32 ee_flags; /* flags and future extension */
>>>> };
>>>>
>>>> struct ext4_extent2_idx {
>>>> __le64 ei_block; /* index covers logical blocks from 'block' */
>>>> __le64 ei_leaf; /* pointer to the physical block of the next level */
>>>> __le32 ei_flags; /* flags and future extension */
>>>> __le32 ei_unused; /* padding */
>>>> };
>>>>
>>>> I think we could keep the structure of ext4_extent_header and add new
>>>> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2.
>>>>
>>>> The new extent format could support 16TB continuous space and larger volumes.
>>>>
>>>> What's your opinion?
>>>
>>> Just use XFS.
>>
>> Thanks for your troll.
>>
>> If you have something actually useful to contribute, please feel free to post.
>> Otherwise, this is a list for ext4 development.
>
> You can chose to see my comment as a troll, but it has a serious
> message. If that is your use case is for large multi-TB files, then
> why wouldn't you just use a filesystem that was designed for files
> that large from the ground up rather than try to extend a filesystem
> that is already struggling with file sizes that it already supports?
> Not to mention that very few people even need this functionality,
> and those that do right now are using XFS.
Robin is one of my colleague. And to be frank, ext4 works well currently
in our product system. And we'd like to see it grows to fit our future
need also. I think it helps both the community and our employer. Having
said that, another reason why we don't consider of XFS as our choice is
that we don't think we have the ability to maintain 2 file systems in
our product system.
>
> Indeed, on current measures, a 15.95TB file on ext4 takes 330s to
> allocate on my test rig, while XFS will do it under *35
> milliseconds*. What's the point of increasing the maximum file size
> when it when it takes so long to allocate or free the space? If you
> can't make the allocation and freeing scale first to the existing
> file size limits, there's little point in introducing support for
> larger files.
I think your test case here is biased since you used the most successful
story from XFS. Yes, bitmap-based file system is a little bit hard to
allocate a very large file if the bitmap is scattered all over the disk,
but I don't think ext4 can't fill the gap of this test case in the
future. Let us wait and see. :)
>
> And as an ext4 user, all I want is from ext4 to be stable like ext3
> is stable, not have it continually destabilised by the addition of
> incompatible feature after incompatible feature. Indeed, I can't
> use ext4 in the places I'm using ext3 right now because ext4 is not
> very resilient in the face of 20 system crashes a day. I generally
> find that ext4 filesystems are irretrievable corrupted within a
> week. In comparison, I have ext3 filesystems have lasted more than
> 3 years under such workloads without any corruptions occurring.
OK, so next time when you see the corruption, please at least send it to
the mail list so that ext4 developers can have the chance of seeing it.
Complaint doesn't improve it.

I have read your original letter about the review process in xfs
development, it is good and I guess ext4 should take it as a standard
process.
>
> So the long form of my 3-word comment is effectively: "If you need
> multi-TB files, then use the filesystem most appropriate for that
> workload instead of trying to make ext4 more complex and unstable
> than it already is".
I have read and watched the talk you gave in this year's LCA, your
assumption about ext4 may be a little frightening, but it is good for
the ext4 community. In your talk "xfs is much slower than ext4 in
2009-2010 for meta-intensive workload", and now it works much faster. So
why do you think ext4 can't be improved also like xfs?

Thanks
Tao
>
>> I don't encourage XFS users to switch to ext4 (or ZFS, for that matter, since
>> ZFS can do a lot of things that just aren't possible for XFS, and is now
>> available for Linux) on your mailing lists, and I'd appreciate the same
>> courtesy here...
>
> Sorry, I didn't realise that I'm not aren't allowed to tell ext4
> people to use the filesystem most appropriate to their requirements.
> Extending ext4 is not the right solution to every problem.
>
> I say stuff like this w.r.t. "don't use XFS for that" or "XFS will
> never support that" all the time on the XFS lists and IRC channels,
> and nobody thinks that it is out of place. If you want to pop up and
> say that "you should use ext4 for that" on the XFS lists then you
> are welcome to do so. Such comments generally results in an
> informative technical discussion of the pros and cons of why
> something is or is not suited to the given requirement without
> anyone being called a troll.

2012-01-29 22:07:09

by Dave Chinner

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

On Fri, Jan 27, 2012 at 10:27:02PM +0800, Tao Ma wrote:
> Hi Dave,
> On 01/27/2012 08:19 AM, Dave Chinner wrote:
> > On Wed, Jan 25, 2012 at 04:03:09PM -0700, Andreas Dilger wrote:
> >> On 2012-01-25, at 3:48 PM, Dave Chinner wrote:
> >>> On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote:
> >>>> Hi Ted, Andreas and the list,
> >>>>
> >>>> After the bigalloc-feature is completed in ext4, we could have much more
> >>>> big size of block-group (also bigger continuous space), but the extent
> >>>> structure of files now limit the extent size below 128MB, which is not
> >>>> optimal.

.....

> >>>> The new extent format could support 16TB continuous space and larger volumes.
> >>>>
> >>>> What's your opinion?
> >>>
> >>> Just use XFS.
> >>
> >> Thanks for your troll.
> >>
> >> If you have something actually useful to contribute, please feel free to post.
> >> Otherwise, this is a list for ext4 development.
> >
> > You can chose to see my comment as a troll, but it has a serious
> > message. If that is your use case is for large multi-TB files, then
> > why wouldn't you just use a filesystem that was designed for files
> > that large from the ground up rather than try to extend a filesystem
> > that is already struggling with file sizes that it already supports?
> > Not to mention that very few people even need this functionality,
> > and those that do right now are using XFS.
> Robin is one of my colleague. And to be frank, ext4 works well currently
> in our product system. And we'd like to see it grows to fit our future
> need also.

Sure. But at the expense of the average user? ext4 is supposed to be
primarily the Linux desktop filesystem, yet all I see is people
trying to make it something for big, bigger and biggest. Bigalloc,
new extent formats, no-journal mode, dioread_nolock, COW snapshots,
secure delete, etc. It's a list of features that are somewhat
incompatible with each other that are useful to only a handful of
vendors or companies. Most have no relevance at all to the uses of
the majority of ext4 users.

This is what I'm getting at - I don't object to adding functionality
that is generically useful and applies to all filesystem configs,
but that's not what is happening. ext4 appears to have a development
mindset of "if we don't support X, then we can do Y" and I don't
think that serves the ext4 users very well at all.

BTW, if you think that is a harsh criticism, just reflect on the
insanity of the recent "we can support 64k block sizes if we just
disable mmap" discussion. Yes, that's great for Lustre, but it is
useless for everyone else...

> I think it helps both the community and our employer. Having
> said that, another reason why we don't consider of XFS as our choice is
> that we don't think we have the ability to maintain 2 file systems in
> our product system.

That's your choice as a product vendor, not mine as an ext4 user....

> > Indeed, on current measures, a 15.95TB file on ext4 takes 330s to
> > allocate on my test rig, while XFS will do it under *35
> > milliseconds*. What's the point of increasing the maximum file size
> > when it when it takes so long to allocate or free the space? If you
> > can't make the allocation and freeing scale first to the existing
> > file size limits, there's little point in introducing support for
> > larger files.
> I think your test case here is biased since you used the most successful
> story from XFS. Yes, bitmap-based file system is a little bit hard to
> allocate a very large file if the bitmap is scattered all over the disk,

Which is the case whenever the filesytem has been used for a while.
I did those tests on a pristine, empty filesystem, so the speed of
allocation only goes down from there. bitmap based allocation
degrades much, much faster than extent-tree based allocation,
especially when you have to search for the free space to allocation
from....

Indeed, how do you plan to test such large files robustly when it
takes so long to allocate the space to them? I mean, I can easily
test large files on XFS because of how quickly allocation occurs. I
can easily fragment free space and test large fragmented files
bcause of how quickly allocation occurs. But if the same test that
take a minute to run on XFS take 4 orders of magnitude longer on
ext4, just how good is your test coverage going to be? What about
when you have different filesystem block sizes, or different mount
options, or doing it concurrently with an online resize?

IOWs, the slowness of the allocation greatly limits the ability to
test such a feature at the scale it is designed to support. That's
my big, overriding concern - with ext4 allocation being so slow, we
can't really test large files with enough thoroughness *right now*.
Increasing the file size is only going to make that problem worse
and that, to me, is a show stopper. If you can't test it properly,
then the change should not be made.

> but I don't think ext4 can't fill the gap of this test case in the
> future. Let us wait and see. :)

How do you plan to fix it? If there isn't a plan, or it involves a
major on-disk format change, then aren't we back to square one about
adding intrusive, complex and destablising features to a filesystem
that people are relying to be stable?

> > And as an ext4 user, all I want is from ext4 to be stable like ext3
> > is stable, not have it continually destabilised by the addition of
> > incompatible feature after incompatible feature. Indeed, I can't
> > use ext4 in the places I'm using ext3 right now because ext4 is not
> > very resilient in the face of 20 system crashes a day. I generally
> > find that ext4 filesystems are irretrievable corrupted within a
> > week. In comparison, I have ext3 filesystems have lasted more than
> > 3 years under such workloads without any corruptions occurring.
> OK, so next time when you see the corruption, please at least send it to
> the mail list so that ext4 developers can have the chance of seeing it.
> Complaint doesn't improve it.

I won't be reporting corruptions because I stopped using ext4 more
than 6 months ago on these machines after the last batch of
unreproducable, unrepairable corruptions that occurred. I couldn't
get anything from the corpses (I do know how to analyse a corrupt
ext4 filesystem), so there really wasn't anything to report....

Generally speaking, the first sign of problems was a corrupted
binary or missing or empty file. The filesystem never complained or
detected corruption at runtime. By that stage, the original cause of
the corruption was unfindable because the problems may have happened
many crashes ago and been propagated further. running e2fsck at that
point generally resulted in a mess with lots of stuff ending in
lost+found and multiply linked blocks being duplicated all over the
place. IOWs, an unrecoverable mess.

> > So the long form of my 3-word comment is effectively: "If you need
> > multi-TB files, then use the filesystem most appropriate for that
> > workload instead of trying to make ext4 more complex and unstable
> > than it already is".
> I have read and watched the talk you gave in this year's LCA, your
> assumption about ext4 may be a little frightening, but it is good for
> the ext4 community. In your talk "xfs is much slower than ext4 in
> 2009-2010 for meta-intensive workload", and now it works much faster. So
> why do you think ext4 can't be improved also like xfs?

Because all of the XFS changes talked about in that talk did not
change the on-disk format at all. They are *software-only* changes
and are completely transparent to users. They are even the default
behaviours now, so users with 10 year old XFS filesystems will also
benefit from them. And they can go back to their old kernels if they
don't like the new kernels, too...

We know that the problems ext4 has are much, much deeper and as this
thread shows require significant on-disk format changes to solve.
And they will only benefit those that have new filesystems or make
their old filesystems incompatible with old kernels. IOWs, the
changes being proposed don't help solve problems on all the existing
filesystems transparently. That's a *major* difference between
where XFS was 2 years ago and where ext4 is now.

Sure, given enough time and resources, any problem is solvable. But
really, do ext4 users really need a new, incompatible, difficult to
test on-disk formats to solve problems that most people will never
hit on their desktop and server systems before they migrate them to
BTRFS?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-01-30 20:41:58

by Eric Sandeen

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

On 1/23/12 6:51 AM, Robin Dong wrote:
> Hi Ted, Andreas and the list,
>
> After the bigalloc-feature is completed in ext4, we could have much more
> big size of block-group (also bigger continuous space), but the extent
> structure of files now limit the extent size below 128MB, which is not
> optimal.
>
> We could solve the problem by creating a new extent format to support
> larger extent size, which looks like this:
>
> struct ext4_extent2 {
> __le64 ee_block; /* first logical block extent covers */
> __le64 ee_start; /* starting physical block */
> __le32 ee_len; /* number of blocks covered by extent */
> __le32 ee_flags; /* flags and future extension */
> };
>
> struct ext4_extent2_idx {
> __le64 ei_block; /* index covers logical blocks from 'block' */
> __le64 ei_leaf; /* pointer to the physical block of the next level */
> __le32 ei_flags; /* flags and future extension */
> __le32 ei_unused; /* padding */
> };
>
> I think we could keep the structure of ext4_extent_header and add new
> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2.
>
> The new extent format could support 16TB continuous space and larger volumes.

(larger volumes?)

> What's your opinion?
>

I think that mailing list drama aside ;) Dave has a decent point that we shouldn't
allow structures to scale out further than the code *using* them can scale.

In other words, if we already have some trouble being efficient with 2^32 blocks
in a file, it is risky and perhaps unwise to allow even larger files, until
those problems are resolved. At a minimum, I'd suggest that such a change
should not go in until it is demonstrated that ext4 can, in general, handle such
large file sizes efficiently.

It'd be nice to be able to self-host large sparse images for large fs testing,
though.

I suppose bigalloc solves that a little, though with some backing store space
usage penalty. I suppose if a bigalloc fs is hosted on a bigalloc fs, things
should (?) line up and be reasonable.

-Eric

2012-01-30 22:50:27

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

On 2012-01-29, at 3:07 PM, Dave Chinner wrote:
> On Fri, Jan 27, 2012 at 10:27:02PM +0800, Tao Ma wrote:
>> Hi Dave,
>> On 01/27/2012 08:19 AM, Dave Chinner wrote:
>>> On Wed, Jan 25, 2012 at 04:03:09PM -0700, Andreas Dilger wrote:
>>>> On 2012-01-25, at 3:48 PM, Dave Chinner wrote:
>>>>> On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote:
>>>>>> Hi Ted, Andreas and the list,
>>>>>>
>>>>>> After the bigalloc-feature is completed in ext4, we could have much more
>>>>>> big size of block-group (also bigger continuous space), but the extent
>>>>>> structure of files now limit the extent size below 128MB, which is not
>>>>>> optimal.
>
> .....
>
>>>>>> The new extent format could support 16TB continuous space and larger volumes.
>>>>>>
>>>>>> What's your opinion?
>>>>>
>>>>> Just use XFS.
>>>>
>>>> Thanks for your troll.
>>>>
>>>> If you have something actually useful to contribute, please feel free to post.
>>>> Otherwise, this is a list for ext4 development.
>>>
>>> You can chose to see my comment as a troll, but it has a serious
>>> message. If that is your use case is for large multi-TB files, then
>>> why wouldn't you just use a filesystem that was designed for files
>>> that large from the ground up rather than try to extend a filesystem
>>> that is already struggling with file sizes that it already supports?
>>> Not to mention that very few people even need this functionality,
>>> and those that do right now are using XFS.
>>
>> Robin is one of my colleague. And to be frank, ext4 works well currently
>> in our product system. And we'd like to see it grows to fit our future
>> need also.
>
> Sure. But at the expense of the average user? ext4 is supposed to be
> primarily the Linux desktop filesystem,

That is your opinion, as an XFS developer that is trying to keep XFS
relevant for some part of the market. Yet ext4 does extremely well
at both the desktop and server workloads.

> yet all I see is people trying to make it something for big, bigger
> and biggest. Bigalloc, new extent formats, no-journal mode,
> dioread_nolock, COW snapshots, secure delete, etc. It's a list of
> features that are somewhat incompatible with each other that are
> useful to only a handful of vendors or companies. Most have no
> relevance at all to the uses of the majority of ext4 users.

??? This is quickly degrading into a mud slinging match. You claim
that "because ext4 is only relevant for desktops, it shouldn't try to
scale or improve performance". Should I similarly claim that "because
XFS is only relevant to gigantic SMP systems with huge RAID arrays it
shouldn't try to improve small file performance or be CPU efficient"?

Not at all. The ext4 users and developers choose it because it meets
their needs better than XFS for one reason or another, and we will
continue to improve it for everyone while we are interested to do so.
The ext4 multi-block allocator was originally done for high-throughput
file servers, but it is totally relevant for desktop workloads today.
The same is true for delayed allocation, and other improvements in the
past. I imagine that bigalloc would be very welcome for media servers
and other large file IO environments.

> This is what I'm getting at - I don't object to adding functionality
> that is generically useful and applies to all filesystem configs,
> but that's not what is happening. ext4 appears to have a development
> mindset of "if we don't support X, then we can do Y" and I don't
> think that serves the ext4 users very well at all.
>
> BTW, if you think that is a harsh criticism, just reflect on the
> insanity of the recent "we can support 64k block sizes if we just
> disable mmap" discussion. Yes, that's great for Lustre, but it is
> useless for everyone else...

I don't see that at all. The complexity of blocksize > PAGE_SIZE
is greatly reduced if we don't have to support mmap IO. Of course
I'd be much happier if the VM supported this properly, but it's been
10 years and it hasn't happened, so waiting longer isn't reasonable.

To be honest, I totally agree that large blocks may not be relevant
for every desktop user. It may not even be relevant for Lustre, but
that isn't a valid reason not even to _discuss_ feature development
and see where that leads us to an implementation that meets a number
of different needs.

Disabling mmap IO for some configurations doesn't prevent someone from
having a 4kB block LV for the root filesystem, and a separate data LV
for large file IO. It isn't that mmap for blocksize > PAGE_SIZE is
impossible to implement, but I'd rather see the code handling the
real-world use cases (efficient large file IO, filesystem portability
between IA64, PPC, ARM) than growing extra complexity to handle an
obscure use case (e.g. mmap file IO and binaries executed from a data
storage filesystem).

Once we get the mechanics of large block allocation, we can still look
into the complexity of mmap thereon, since a large block ext4 filesystem
does not actually involve a disk format change since it has been handled
for ages by ext2/3/4 for CPUs that have larger PAGE_SIZE. Handling mmap
was in Robin's original submission, and I suggested that we exclude it
initially to reduce complexity for the initial implementation.

>> I think it helps both the community and our employer. Having
>> said that, another reason why we don't consider of XFS as our choice is
>> that we don't think we have the ability to maintain 2 file systems in
>> our product system.
>
> That's your choice as a product vendor, not mine as an ext4 user....

You're suggesting that if I started using XFS on my home filesystems
then I get veto power over your development plans? Hmm, I don't think
that is going to happen. Later on, you claim that you aren't even an
ext4 user, so what is the point of your complaint?

The way it works is that anyone is free to develop any features they want
for ext4, they are free to post them to this list (or not) and the ext4
maintainers can evaluate them on functionality and performance in the
manner that they see fit, without any requirement that they be accepted,
keeping in mind that we _do_ take regular user needs into account.

The mere existence of a feature, nay even the discussion of a feature for
ext4, should not be stifled by the suggestion that XFS is the last word
in filesystems (especially since ZFS has already claimed that label :-).

>>> Indeed, on current measures, a 15.95TB file on ext4 takes 330s to
>>> allocate on my test rig, while XFS will do it under *35
>>> milliseconds*. What's the point of increasing the maximum file size
>>> when it when it takes so long to allocate or free the space? If you
>>> can't make the allocation and freeing scale first to the existing
>>> file size limits, there's little point in introducing support for
>>> larger files.
>>
>> I think your test case here is biased since you used the most successful
>> story from XFS. Yes, bitmap-based file system is a little bit hard to
>> allocate a very large file if the bitmap is scattered all over the disk,
>
> Which is the case whenever the filesytem has been used for a while.
> I did those tests on a pristine, empty filesystem, so the speed of
> allocation only goes down from there. bitmap based allocation
> degrades much, much faster than extent-tree based allocation,
> especially when you have to search for the free space to allocation
> from....
>
> Indeed, how do you plan to test such large files robustly when it
> takes so long to allocate the space to them? I mean, I can easily
> test large files on XFS because of how quickly allocation occurs. I
> can easily fragment free space and test large fragmented files
> bcause of how quickly allocation occurs. But if the same test that
> take a minute to run on XFS take 4 orders of magnitude longer on
> ext4, just how good is your test coverage going to be? What about
> when you have different filesystem block sizes, or different mount
> options, or doing it concurrently with an online resize?
>
> IOWs, the slowness of the allocation greatly limits the ability to
> test such a feature at the scale it is designed to support. That's
> my big, overriding concern - with ext4 allocation being so slow, we
> can't really test large files with enough thoroughness *right now*.
> Increasing the file size is only going to make that problem worse
> and that, to me, is a show stopper. If you can't test it properly,
> then the change should not be made.

Hmm, excellent suggestion. Maybe if we implement faster allocation
for ext4 your objections could be quieted? Wait, that is what you
are objecting to in the first place (bigalloc, large blocks, etc) or
any changes to ext4 that don't meet your approval.

>> but I don't think ext4 can't fill the gap of this test case in the
>> future. Let us wait and see. :)
>
> How do you plan to fix it? If there isn't a plan, or it involves a
> major on-disk format change, then aren't we back to square one about
> adding intrusive, complex and destablising features to a filesystem
> that people are relying to be stable?
>
>>> And as an ext4 user, all I want is from ext4 to be stable like ext3
>>> is stable, not have it continually destabilised by the addition of
>>> incompatible feature after incompatible feature. Indeed, I can't
>>> use ext4 in the places I'm using ext3 right now because ext4 is not
>>> very resilient in the face of 20 system crashes a day. I generally
>>> find that ext4 filesystems are irretrievable corrupted within a
>>> week. In comparison, I have ext3 filesystems have lasted more than
>>> 3 years under such workloads without any corruptions occurring.
>>
>> OK, so next time when you see the corruption, please at least send it to
>> the mail list so that ext4 developers can have the chance of seeing it.
>> Complaint doesn't improve it.
>
> I won't be reporting corruptions because I stopped using ext4 more
> than 6 months ago on these machines after the last batch of
> unreproducable, unrepairable corruptions that occurred. I couldn't
> get anything from the corpses (I do know how to analyse a corrupt
> ext4 filesystem), so there really wasn't anything to report....
>
> Generally speaking, the first sign of problems was a corrupted
> binary or missing or empty file. The filesystem never complained or
> detected corruption at runtime. By that stage, the original cause of
> the corruption was unfindable because the problems may have happened
> many crashes ago and been propagated further. running e2fsck at that
> point generally resulted in a mess with lots of stuff ending in
> lost+found and multiply linked blocks being duplicated all over the
> place. IOWs, an unrecoverable mess.

I haven't heard of similar problems reported here, but even the
existence of such bug reports can be useful alert developers about
the existence of such a problem, and to help narrow down corruption
issues to a specific kernel version.

>>> So the long form of my 3-word comment is effectively: "If you need
>>> multi-TB files, then use the filesystem most appropriate for that
>>> workload instead of trying to make ext4 more complex and unstable
>>> than it already is".
>>
>> I have read and watched the talk you gave in this year's LCA, your
>> assumption about ext4 may be a little frightening, but it is good for
>> the ext4 community. In your talk "xfs is much slower than ext4 in
>> 2009-2010 for meta-intensive workload", and now it works much faster. So
>> why do you think ext4 can't be improved also like xfs?
>
> Because all of the XFS changes talked about in that talk did not
> change the on-disk format at all. They are *software-only* changes
> and are completely transparent to users. They are even the default
> behaviours now, so users with 10 year old XFS filesystems will also
> benefit from them. And they can go back to their old kernels if they
> don't like the new kernels, too...

That is only partly true. XFS had to change the 32-bit vs. 64-bit
inode numbers to get better performance, and that is not backward
compatible on 32-bit systems. XFS had changed the logging format
to be more efficient in order to not suck at metadata benchmarks.

> We know that the problems ext4 has are much, much deeper and as this
> thread shows require significant on-disk format changes to solve.

That is a very broad statement, and I think it is your extrapolation
from reading a snippet of one thread on this list.

> And they will only benefit those that have new filesystems or make
> their old filesystems incompatible with old kernels. IOWs, the
> changes being proposed don't help solve problems on all the existing
> filesystems transparently. That's a *major* difference between
> where XFS was 2 years ago and where ext4 is now.

Not true. The ext4 code can mount and run ancient ext2 filesystems
and shows a significant performance improvement without any on-disk
format changes. Ask google about their million(?) ext4 filesystems
and how they have improved with only a software update.

Maybe the converse could also be said, that the fact that XFS can
show so much performance improvement without changing the on-disk
format is a testament to how complex and badly written the old code
was? I think that argument holds as little value as yours, but I
don't jump up and down in [email protected] touting the fact that
ext4 is as fast as (or faster than) XFS for most real-world workloads
with only 1/2 of the code.

> Sure, given enough time and resources, any problem is solvable. But
> really, do ext4 users really need a new, incompatible, difficult to
> test on-disk formats to solve problems that most people will never
> hit on their desktop and server systems before they migrate them to
> BTRFS?

Again, you are entitled to your opinion, and are free to spend your
time and efforts where you like. I wish Chris all the best for Btrfs,
but having looked at that code I'm not in a hurry to move over to
using it for our production workloads, nor even for my home file server.

The joy of open source software is that everyone is free to make their
own choices. I've made mine, and along with many other developers and
users the choice has been ext4. Thanks for your input, we'll continue
to discuss and develop whatever we want, regardless of how much you
want everyone to use XFS.

Cheers, Andreas

2012-01-30 22:52:24

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

On 2012-01-30, at 1:41 PM, Eric Sandeen wrote:
> On 1/23/12 6:51 AM, Robin Dong wrote:
>> After the bigalloc-feature is completed in ext4, we could have much more
>> big size of block-group (also bigger continuous space), but the extent
>> structure of files now limit the extent size below 128MB, which is not
>> optimal.
>>
>> The new extent format could support 16TB continuous space and larger volumes.
>
> (larger volumes?)

Strictly speaking, the current extent format "only" allows filesystems up
to 2^48 * blocksize bytes, typically 2^60 bytes. That in itself is not a
significant limitation IMHO, since there are a number of other format-based
limitations in this area (number of group descriptor blocks, etc), and the
overall "do we realistically expect a single filesystem to be so big" that
cannot be fixed by simply increasing the addressable blocks per file.

Those format-based limits would not be present if we could handle a larger
blocksize for the filesystem, since the number of groups is reduced by
the square of the blocksize increase, as are a number of other limits.

>> What's your opinion?
>
> I think that mailing list drama aside ;) Dave has a decent point that we
> shouldn't allow structures to scale out further than the code *using* them
> can scale.
>
> In other words, if we already have some trouble being efficient with 2^32
> blocks in a file, it is risky and perhaps unwise to allow even larger files, until those problems are resolved. At a minimum, I'd suggest that such a
> change should not go in until it is demonstrated that ext4 can, in general,
> handle such large file sizes efficiently.

I think the issue that Dave pointed out (efficiency of allocating large
files) is one that has partially been addressed by bigalloc. Using bigalloc
allows larger clusters to be allocated much more efficiently, but it only
gets us part of the way there.

> It'd be nice to be able to self-host large sparse images for large fs
> testing, though. I suppose bigalloc solves that a little, though with
> some backing store space usage penalty. I suppose if a bigalloc fs is
> hosted on a bigalloc fs, things should (?) line up and be reasonable.

This is the one limitation of bigalloc - it doesn't change the underlying
filesystem blocksize. That means the current extent format still cannot
address more than 2^32 blocks in a single file, so self-hosting filesystem
images over 16TB with 4kB blocksize is not possible with bigalloc. It
_would_ be possible with a larger filesystem blocksize, and the bigalloc
code already paved the way for most of that to happen.

The joy of allowing large blocks for 4kB PAGE_SIZE is that it _doesn't_
involve an on-disk format change, and would have the added benefit that
it would allow mounting IA64, PPC, ARM, SPARC, etc. filesystems directly,
and facilitate migration or disaster recovery from those aging platforms.

Cheers, Andreas

2012-01-30 23:52:48

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

As a large meta-comment, let me say that I find that most
conversations about which file systems users "should" are very often
not very useful. Even less useful is what developers "should" be
working on. In that way, my philosophy of ext4 is that it should be
like the Linux kernel; it's an evolutionary process and central
planning is often overrated. People contribute to ext4 for many
different reasons, and that means they optimize ext4 for their
particular workloads. Like Linus for Linux, we're not trying to
architect for "world domination" by saying, "hmm, in order to 'take
out' reiserfs4, we'd better implement features foo and bar".

Instead, it's things like "gee, this company over here is interested
in using ext4 as a back-end store for a cluster file system where the
journal is unnecessary overhead and performance under severe memory
pressure is important" --- and so we got no journal mode and some
improvements to the block allocator so it works better under those
conditions.

People contribute to ext4 for different goals, just as people
contribute to Linux for different goals. And just as there are times
when improvements for big servers have improved Linux's capabilities
for embedded machines, and vice versa, there are similar things that
can and have happened for ext4 (such as extents and the multi-block
allocator originally being developed for Lustre, but which have been
very useful for many other use cases).

Personally, I find that I get a lot more joy out of programming to
make a codebase better --- as opposed programming with the goal to
kill off some other codebase, or discouraging other users to use some
other codebase.

Now that's an open source approach to things. Things are no doubt
very different if you are trying to allocate engineering resources at
a distribution. So there may be some tensions between a desire from
an open source perspective to be as flexible as possible, and a
company's position that they only want to support a limited set of
configuration options. I think those decisions are ones which are
best made by the distribution, and not as part of the open source
process. After all, what might make sense for one distribution's
customer base and business model, might not make sense for another's.

There are some dangers to that model; for example, RAID support was
only implemented for the Lustre's private in-kernel (and out-of-tree)
API. Some smarts in ext4's writepages codepath so that we can
properly handle RAID support is currently lacking. I'd work on it,
except that I don't personally (nor does my employer) has a strong
need to worry about RAID systems. I'll certainly integrate code that
fixes that problem, and I'm confident that eventually someone will
decide that's the one bit of improvement they need so that ext4 is a
good match for their use case. I'm definitely not going to stress
that this is something we have to do right away just so we can kill
off XFS; most of us are hopefully working on ext4 because it's fun,
and secondarily because amazingly enough our employers are willing to
pay for us to work on something cool. (Just as I'm glad most Linux
kernel developers weren't waking up trying to think up ways to kill
off FreeBSD or try to put the Mark Williams Company out of business. :-)

Let me also add that competition is a good thing. It keeps all of us
on our toes. Legacy unix systems accepted that system calls and
context switches were naturally slow, until Linux proved that it could
be done very quickly and efficiently. SGI didn't bother dealing with
XFS's slow metadata performance even tough they were selling desktops
during its original development. It was only when Ric Wheeler (as he
tells the story) told the XFS developers how much XFS lagged on
fs_mark that there was a strong effort to address those issues, over a
decade and a half after XFS's original deployment. That's why I don't
believe it's productive to say that a particular file system has no
place in an ecosystem. If developers are continuing to work on an OS,
or a file system, and if users continue to use it, then of course it
has a place. You might not understand why that might be true
initially, but in general it's not because everyone is being
foolish/stupid.

One last observation. It's dangerous to focus on just one benchmark;
especially if it is a micro-benchmark. As a tool to improve one
aspect of a file system's performance, it's certainly useful. But how
many workloads will really hammer a file system with 16 cores, by
creating lots of small files and nothing else? I have no doubt that
we could improve ext4's scalability for that particular workload. But
is that a deadly shortcoming that should cause ext4 developers to drop
everything else they are doing and work on this problem, lest users
immediately reformat their disks and switch to another file system
because ext4's block allocator isn't as scalable as it could be for
lots of small block allocations done in parallel?

I'd suggest that might be an over-reaction.

Best regards,

- Ted

2012-02-01 03:57:13

by Dave Chinner

[permalink] [raw]

Subject: Re: [RFC] Add new extent structure in ext4

On Mon, Jan 30, 2012 at 03:50:24PM -0700, Andreas Dilger wrote:
> On 2012-01-29, at 3:07 PM, Dave Chinner wrote:
> > yet all I see is people trying to make it something for big, bigger
> > and biggest. Bigalloc, new extent formats, no-journal mode,
> > dioread_nolock, COW snapshots, secure delete, etc. It's a list of
> > features that are somewhat incompatible with each other that are
> > useful to only a handful of vendors or companies. Most have no
> > relevance at all to the uses of the majority of ext4 users.
>
> ??? This is quickly degrading into a mud slinging match. You claim
> that "because ext4 is only relevant for desktops, it shouldn't try to
> scale or improve performance". Should I similarly claim that "because
> XFS is only relevant to gigantic SMP systems with huge RAID arrays it
> shouldn't try to improve small file performance or be CPU efficient"?

You can if you want.....

But then I'll just point to Eric Whitney's latest results showing
XFS is generally slightly more CPU efficient that ext4, and performs
as well as ext4 on the small file workload he ran. :)

> Not at all. The ext4 users and developers choose it because it meets
> their needs better than XFS for one reason or another, and we will

More likely is that most desktop users choose ext4 because it is the
default filesystem their distribution installs, not because they
know anything about it or any other linux filesystem....

> continue to improve it for everyone while we are interested to do so.
> The ext4 multi-block allocator was originally done for high-throughput
> file servers, but it is totally relevant for desktop workloads today.
> The same is true for delayed allocation, and other improvements in the
> past. I imagine that bigalloc would be very welcome for media servers
> and other large file IO environments.

Yes, it will help certain workloads, but it isn't a general solution
to the allocation scalability problems. It also requires informed
and knowledgable users to about such features, when it is best to
use them and when not to use them.

One of the things that I'm concerned about is that the changes
being made add a new upfront decisions that users have to be
informed about and understand sufficiently to be able to make the
correct decision. You're making the assumption that users are
informed and knowledgable, and all filesystem developers should know
this is simply not true. Users repeatedly demonstrate that they
don't know how filesystems work, don't understand the knobs that
are provided, don't understand what their applications do in terms
of filesystem operations and don't really understand their data
sets. Education takes time and effort, but still users make the same
mistakes over and over again.

That's the reason why we have the mantra "use the defaults" when it
comes to users asking questions about how to optimise an XFS
filesystem. XFS is almost at the point where the defaults work for
most people, from $300 ARM-based NAS boxes all the way up to
multi-million dollar supercomputers. That's what we should be
delivering to users - something that just works. Special case
solutions should be few and far between, and only in those cases
should education about the various options be necessary.

That ext4 now has a much more complex configuration matrix than XFS,
and that developers are expecting users to understand that matrix
and how it relates to their systems and workloads without prior
experience seems like a pretty valid concern to me.

> > IOWs, the slowness of the allocation greatly limits the ability to
> > test such a feature at the scale it is designed to support. That's
> > my big, overriding concern - with ext4 allocation being so slow, we
> > can't really test large files with enough thoroughness *right now*.
> > Increasing the file size is only going to make that problem worse
> > and that, to me, is a show stopper. If you can't test it properly,
> > then the change should not be made.
>
> Hmm, excellent suggestion. Maybe if we implement faster allocation
> for ext4 your objections could be quieted? Wait, that is what you
> are objecting to in the first place (bigalloc, large blocks, etc) or
> any changes to ext4 that don't meet your approval.

bigalloc is not a solution to the use case that I initially found
this problem on - filling large filesystems quickly before starting
testing. Regardless of the existence of bigalloc, we still need to
test large 4k block size, 4k alloc size filesystems because that is
what users will mostly use.

Further, bigalloc makes the large filesystem test matrix more
complex and time consuming - we now have to test default configs as
well as bigalloc filesystems. And if this new extent format change
goes in, suddenly it is "defaults X bigalloc (various sizes) X
extent format". This gets impossible to test very quickly, and so
we end up with a mess of options that nobody really knows how well
they work together because they simply aren't adequately tested.

I've been trying to help address this large scale testing problem -
to make >16TB filesystem testing for ext4 and btrfs as well as XFS
easy to do through xfstests. Allocation speed is just one of the
initial problems I'm coming across for both ext4 and BTRFS. Having
easily repeatable tests for large filesystems is fundamental to
being able to support such filesystems.

However, requiring magic pixie dust to enable such testing raises a
serious question about the suitability of the filesystem for such
usage. And then further expanding support in an area that is known
to be deficient seems very misguided to me - it doesn't make testing
any easier, and it makes testing large files and filesystems even
more time consuming. Ths is a serious problem, and that's why I'm
asking whether this change is even something that should be done in
the first place.

Yes, I could have said it better than a throw-away, one-line
comment. But I'm trying to explain the many reasons I had for the
glib comment because that comment based on problems that I've seen
over the past year of so trying to use and test ext4....

> >> I have read and watched the talk you gave in this year's LCA,
> >> your assumption about ext4 may be a little frightening, but it
> >> is good for the ext4 community. In your talk "xfs is much
> >> slower than ext4 in 2009-2010 for meta-intensive workload", and
> >> now it works much faster. So why do you think ext4 can't be
> >> improved also like xfs?
> >
> > Because all of the XFS changes talked about in that talk did not
> > change the on-disk format at all. They are *software-only*
> > changes and are completely transparent to users. They are even
> > the default behaviours now, so users with 10 year old XFS
> > filesystems will also benefit from them. And they can go back to
> > their old kernels if they don't like the new kernels, too...
>
> That is only partly true. XFS had to change the 32-bit vs. 64-bit
> inode numbers to get better performance, and that is not backward
> compatible on 32-bit systems. XFS had changed the logging format
> to be more efficient in order to not suck at metadata benchmarks.

Not true, but it's irrelevant to the above discussion, anyway, so I
won't waste time going done this path any further....

Cheers,

Dave.
--
Dave Chinner
[email protected]