2008-06-02 21:50:52

by Thomas King

[permalink] [raw]
Subject: Questions for article

Folks,

I am writing an article for Linux.com to answer Henry Newman's at
http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is
there anyone that can field a few questions on ext4?

Thanks!
Tom King


2008-06-02 22:30:54

by Eric Sandeen

[permalink] [raw]
Subject: Re: Questions for article

Thomas King wrote:
> Folks,
>
> I am writing an article for Linux.com to answer Henry Newman's at
> http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is
> there anyone that can field a few questions on ext4?
>
> Thanks!
> Tom King


Honestly I'm not sure it's worth feeding the trolls... that guy has some
points but is sufficiently off-base to make me wonder if he actually has
any broad Linux filesystem experience. ...But anyway, I'd just ask the
questions on-list if you don't mind a collaborative answer. :)

-Eric

2008-06-02 22:59:45

by Andreas Dilger

[permalink] [raw]
Subject: Re: Questions for article

On Jun 02, 2008 16:50 -0500, Thomas King wrote:
> I am writing an article for Linux.com to answer Henry Newman's at
> http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is
> there anyone that can field a few questions on ext4?

It depends on what you are proposing to write... Henry's comments are
mostly accurate. There isn't even support for > 16TB filesystems in
e2fsprogs today, so I wouldn't go rushing into an email saying "ext4
can support a single 100TB filesystem today". It wouldn't be too hard
to take a 100TB Lustre filesystem and run it on a single node, but I
doubt anyone would actually want to do that and it still doesn't meet
the requirements of "a single instance filesystem".

What is noteworthy is that the comments about IO not being aligned
to RAID boundaries is only partly correct. This is actually done in
ext4 with mballoc (assuming you set these boundaries in the superblock
manually), and is also done by XFS automatically. The RAID geometry
detection code should be added to mke2fs also, if someone would be
interested. The ext4/mballoc code does NOT align the metadata to RAID
boundaries, though this is being worked on also.

The mballoc code also does efficient block allocations (multi-MB at a
time), BUT there is no userspace interface for this yet, except O_DIRECT.
The delayed allocation (delalloc) patches for ext4 are still in the unstable
part of the patch series... What Henry is misunderstanding here is that
the filesystem blocksize isn't necessarily the maximum unit for space
allocation. I agree we could do this more efficiently (e.g. allocate an
entire 128MB block group at a time for large files), but we haven't gotten
there yet.

There are a large number of IO performance improvements in ext4 due to
work to improve IO server performance for Lustre (which Henry is of
course familiar with), and for Lustre at least we are able to get IO
performance in the 2GB/s range on 42 50MB/s disks with software RAID 0
(Sun x4500), but these are with O_DIRECT.

For the fsck front, there have been performance improvements recently
(uninit_bg), and more arriving soon (flex_bg and block metadata
clustering), but that is still a far way from removing the need for
e2fsck in case of corruption.

Similarly, Lustre (with ext3) can scale to a 10M file directory reasonably
(though not superbly) for a certain kind of workload. On the other hand,
this can be really nasty with a "readdir+stat" kind of workload. Lustre
also runs with filesystems > 250M files total, but I haven't heard of
e2fsck performance for such filesystems.


I'd personally tend to keep quiet until we CAN show that ext4
runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-06-03 00:40:58

by Eric Sandeen

[permalink] [raw]
Subject: Re: Questions for article

Andreas Dilger wrote:
> On Jun 02, 2008 16:50 -0500, Thomas King wrote:
>> I am writing an article for Linux.com to answer Henry Newman's at
>> http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is
>> there anyone that can field a few questions on ext4?
>
> It depends on what you are proposing to write... Henry's comments are
> mostly accurate.

But others are way off base IMHO, to the point where I don't put a lot
of stock in the article. fsck only checks the log? Hardly. No linux
filesystem does proper geometry alignment? XFS has for years.

He seems to take ext3 weaknesses and extrapolate to all linux
filesystems. The fact that he suggests testing a 500T ext3 filesystem
indicates a ... lack of research. Never mind that had he done that
research he'd have found that you, well... you can't do it. :) On the
one hand it proves his point about scalibility (of ext3) but on the
other hand indicates that he's not completely investigated the problem
of linux filesystem scalability, himself.

Of the tests he proposes, he's clearly not bothered to do them himself.
A 100 million inode filesystem is not that uncommon on xfs, and some of
the tests he proposes are probably in daily use at SGI customers.

So writing an article about ext4 to refute all his arguments might be
premature, but dismissing all linux filesystems based on ext3
shortcomings is also shortsighted. He has some valid points but saying
"fscking a multi-terabyte fs is too slow on linux" without showing that
it actually *is* slow on linux, or that it *is* fast on $whatever_else,
is just hand-waving. On the other hand it's a very hard test for mere
mortals to run. :)

-Eric

> There isn't even support for > 16TB filesystems in
> e2fsprogs today, so I wouldn't go rushing into an email saying "ext4
> can support a single 100TB filesystem today". It wouldn't be too hard
> to take a 100TB Lustre filesystem and run it on a single node, but I
> doubt anyone would actually want to do that and it still doesn't meet
> the requirements of "a single instance filesystem".
>
> What is noteworthy is that the comments about IO not being aligned
> to RAID boundaries is only partly correct. This is actually done in
> ext4 with mballoc (assuming you set these boundaries in the superblock
> manually), and is also done by XFS automatically. The RAID geometry
> detection code should be added to mke2fs also, if someone would be
> interested. The ext4/mballoc code does NOT align the metadata to RAID
> boundaries, though this is being worked on also.
>
> The mballoc code also does efficient block allocations (multi-MB at a
> time), BUT there is no userspace interface for this yet, except O_DIRECT.
> The delayed allocation (delalloc) patches for ext4 are still in the unstable
> part of the patch series... What Henry is misunderstanding here is that
> the filesystem blocksize isn't necessarily the maximum unit for space
> allocation. I agree we could do this more efficiently (e.g. allocate an
> entire 128MB block group at a time for large files), but we haven't gotten
> there yet.
>
> There are a large number of IO performance improvements in ext4 due to
> work to improve IO server performance for Lustre (which Henry is of
> course familiar with), and for Lustre at least we are able to get IO
> performance in the 2GB/s range on 42 50MB/s disks with software RAID 0
> (Sun x4500), but these are with O_DIRECT.
>
> For the fsck front, there have been performance improvements recently
> (uninit_bg), and more arriving soon (flex_bg and block metadata
> clustering), but that is still a far way from removing the need for
> e2fsck in case of corruption.
>
> Similarly, Lustre (with ext3) can scale to a 10M file directory reasonably
> (though not superbly) for a certain kind of workload. On the other hand,
> this can be really nasty with a "readdir+stat" kind of workload. Lustre
> also runs with filesystems > 250M files total, but I haven't heard of
> e2fsck performance for such filesystems.
>
>
> I'd personally tend to keep quiet until we CAN show that ext4
> runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc.
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2008-06-03 15:17:31

by Thomas King

[permalink] [raw]
Subject: Re: Questions for article

> Andreas Dilger wrote:
>> On Jun 02, 2008 16:50 -0500, Thomas King wrote:
>>> I am writing an article for Linux.com to answer Henry Newman's at
http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is
there anyone that can field a few questions on ext4?
>> It depends on what you are proposing to write... Henry's comments are mostly
accurate.
>
> But others are way off base IMHO, to the point where I don't put a lot of
stock in the article. fsck only checks the log? Hardly. No linux filesystem
does proper geometry alignment? XFS has for years.
>
> He seems to take ext3 weaknesses and extrapolate to all linux
> filesystems. The fact that he suggests testing a 500T ext3 filesystem
indicates a ... lack of research. Never mind that had he done that research
he'd have found that you, well... you can't do it. :) On the one hand it
proves his point about scalibility (of ext3) but on the other hand indicates
that he's not completely investigated the problem of linux filesystem
scalability, himself.
>
> Of the tests he proposes, he's clearly not bothered to do them himself.
> A 100 million inode filesystem is not that uncommon on xfs, and some of
> the tests he proposes are probably in daily use at SGI customers.
>
> So writing an article about ext4 to refute all his arguments might be
premature, but dismissing all linux filesystems based on ext3
> shortcomings is also shortsighted. He has some valid points but saying
"fscking a multi-terabyte fs is too slow on linux" without showing that it
actually *is* slow on linux, or that it *is* fast on $whatever_else, is just
hand-waving. On the other hand it's a very hard test for mere mortals to
run. :)
>
> -Eric
>
>> There isn't even support for > 16TB filesystems in
>> e2fsprogs today, so I wouldn't go rushing into an email saying "ext4 can
support a single 100TB filesystem today". It wouldn't be too hard to take a
100TB Lustre filesystem and run it on a single node, but I doubt anyone would
actually want to do that and it still doesn't meet the requirements of "a
single instance filesystem".
>> What is noteworthy is that the comments about IO not being aligned to RAID
boundaries is only partly correct. This is actually done in ext4 with
mballoc (assuming you set these boundaries in the superblock manually), and
is also done by XFS automatically. The RAID geometry detection code should
be added to mke2fs also, if someone would be interested. The ext4/mballoc
code does NOT align the metadata to RAID boundaries, though this is being
worked on also.
>> The mballoc code also does efficient block allocations (multi-MB at a time),
BUT there is no userspace interface for this yet, except O_DIRECT. The
delayed allocation (delalloc) patches for ext4 are still in the unstable part
of the patch series... What Henry is misunderstanding here is that the
filesystem blocksize isn't necessarily the maximum unit for space allocation.
I agree we could do this more efficiently (e.g. allocate an entire 128MB
block group at a time for large files), but we haven't gotten there yet.
>> There are a large number of IO performance improvements in ext4 due to work
to improve IO server performance for Lustre (which Henry is of course
familiar with), and for Lustre at least we are able to get IO performance in
the 2GB/s range on 42 50MB/s disks with software RAID 0 (Sun x4500), but
these are with O_DIRECT.
>> For the fsck front, there have been performance improvements recently
(uninit_bg), and more arriving soon (flex_bg and block metadata clustering),
but that is still a far way from removing the need for e2fsck in case of
corruption.
>> Similarly, Lustre (with ext3) can scale to a 10M file directory reasonably
(though not superbly) for a certain kind of workload. On the other hand,
this can be really nasty with a "readdir+stat" kind of workload. Lustre also
runs with filesystems > 250M files total, but I haven't heard of e2fsck
performance for such filesystems.
>> I'd personally tend to keep quiet until we CAN show that ext4
>> runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc. Cheers,
Andreas

He is fairly keen on XFS except for a couple of items. "The metadata areas are
not aligned with RAID strips and allocation units are FAR too small but better
than ext." However, some of his comments do hint that any current filesystem
technology wouldn't make him happy. ;)

Folks, thank you for suffering my questions and probing. I may post a few more
later.
Tom King

2008-06-03 15:25:52

by Thomas King

[permalink] [raw]
Subject: Re: Questions for article

> On Jun 02, 2008 16:50 -0500, Thomas King wrote:
>> I am writing an article for Linux.com to answer Henry Newman's at
>> http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is
>> there anyone that can field a few questions on ext4?
>
> It depends on what you are proposing to write... Henry's comments are
> mostly accurate. There isn't even support for > 16TB filesystems in
> e2fsprogs today, so I wouldn't go rushing into an email saying "ext4
> can support a single 100TB filesystem today". It wouldn't be too hard
> to take a 100TB Lustre filesystem and run it on a single node, but I
> doubt anyone would actually want to do that and it still doesn't meet
> the requirements of "a single instance filesystem".
>
Aye, as you probably saw in his article, he's skirting cluster filesystems since
most of the implementations he's referencing use a single physical filesystem.

> What is noteworthy is that the comments about IO not being aligned
> to RAID boundaries is only partly correct. This is actually done in
> ext4 with mballoc (assuming you set these boundaries in the superblock
> manually), and is also done by XFS automatically. The RAID geometry
> detection code should be added to mke2fs also, if someone would be
> interested. The ext4/mballoc code does NOT align the metadata to RAID
> boundaries, though this is being worked on also.
>
Good to know!

> The mballoc code also does efficient block allocations (multi-MB at a
> time), BUT there is no userspace interface for this yet, except O_DIRECT.
> The delayed allocation (delalloc) patches for ext4 are still in the unstable
> part of the patch series... What Henry is misunderstanding here is that
> the filesystem blocksize isn't necessarily the maximum unit for space
> allocation. I agree we could do this more efficiently (e.g. allocate an
> entire 128MB block group at a time for large files), but we haven't gotten
> there yet.
>
Can I assume this (large block size) is a possibility later?

> There are a large number of IO performance improvements in ext4 due to
> work to improve IO server performance for Lustre (which Henry is of
> course familiar with), and for Lustre at least we are able to get IO
> performance in the 2GB/s range on 42 50MB/s disks with software RAID 0
> (Sun x4500), but these are with O_DIRECT.
>
> For the fsck front, there have been performance improvements recently
> (uninit_bg), and more arriving soon (flex_bg and block metadata
> clustering), but that is still a far way from removing the need for
> e2fsck in case of corruption.
>
> Similarly, Lustre (with ext3) can scale to a 10M file directory reasonably
> (though not superbly) for a certain kind of workload. On the other hand,
> this can be really nasty with a "readdir+stat" kind of workload. Lustre
> also runs with filesystems > 250M files total, but I haven't heard of
> e2fsck performance for such filesystems.
>
>
> I'd personally tend to keep quiet until we CAN show that ext4
> runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc.
>
What will be the largest theoretical filesystem for ext4?
Here are three other features he thought necessary for massive filesystems in
Linux:
-T10 DIF (block protect?) aware file system
-NFSv4.1 support
-Support for proposed POSIX relaxation extensions for HPC
Are these already in ext4 or on the radar?
Is there anything else y'all would like folks to know about ext4 and its future?

>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.

Thanks!
Tom King

2008-06-03 15:51:46

by Martin K. Petersen

[permalink] [raw]
Subject: Re: Questions for article

>>>>> "Thomas" == Thomas King <[email protected]> writes:

Thomas> - T10 DIF (block protect?) aware file system

I'm not really sure what the ext4 people are officially planning but I
know from conversations with Ted and a few others that there's
interest. Wiring up ext4 to the block integrity infrastructure is
pretty easy. It's defining the tagging and making fsck use it that's
the hard part. Some of that hinges on a userland interface that I
haven't quite finished baking yet.

However, a filesystem doesn't have to be explicitly DIF-aware to take
advantage of it. Sector tagging is just icing on the cake. The
current DIF infrastructure automagically protects all I/O that doesn't
already have integrity metadata attached.

Unfortunately, ext[23] aren't working well with protection turned on
right now. The way DIF works is that you add a checksum to the I/O
when it is submitted. If there's a mismatch, the HBA or the drive
will reject the I/O. And unfortunately both ext2 and ext3 frequently
modify pages that are in flight, causing a checksum mismatch. I have
yet to try ext4.

XFS and btrfs work fine with DIF except for the generic writable mmap
hole that I think I'm about to fix.

--
Martin K. Petersen Oracle Linux Engineering


2008-06-03 22:07:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: Questions for article

On Jun 03, 2008 10:10 -0500, Thomas King wrote:
> > The mballoc code also does efficient block allocations (multi-MB at a
> > time), BUT there is no userspace interface for this yet, except O_DIRECT.
> > The delayed allocation (delalloc) patches for ext4 are still in the unstable
> > part of the patch series... What Henry is misunderstanding here is that
> > the filesystem blocksize isn't necessarily the maximum unit for space
> > allocation. I agree we could do this more efficiently (e.g. allocate an
> > entire 128MB block group at a time for large files), but we haven't gotten
> > there yet.
>
> Can I assume this (large block size) is a possibility later?

Well, anything is a possibility later. There are no plans to implement it.

> > I'd personally tend to keep quiet until we CAN show that ext4
> > runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc.
>
> What will be the largest theoretical filesystem for ext4?

In theory, it could be 2^64 bytes in size, though common architectures
would currently be limited to 2^60 bytes due to 4kB PAGE_SIZE == blocksize.
I'm not at all interested in "theoretical filesystem size", however, since
theory != practise and a 2^64-byte filesystem that takes 10 weeks to format
or fsck wouldn't be very useful... Not that I think ext4 is that bad, but
I don't like to make claims based on complete guesswork.

> Here are three other features he thought necessary for massive filesystems in
> Linux:
> -T10 DIF (block protect?) aware file system

- DIF support is underway, though I'm not aware of filesystem support for it

> -NFSv4.1 support

- in progress

> -Support for proposed POSIX relaxation extensions for HPC

- nothing more than a proposal, it wouldn't even begin to see Linux
implementation until there is something more than a few emails on
the list. These are mostly meaningless outside of the context of
a cluster.

Don't get me wrong, these ARE things that Linux will want to implement
as filesystems and clusters get huge, and it is also my job to work on
such large file system deployments.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.