2006-09-12 11:07:35

by Pavel Mironchik

[permalink] [raw]
Subject: ext2/3 create large filesystem takes too much time; solutions

Hi,

Ext2/3 does erase of inode tables, when do creation of new systems.
This is very very long operation when the target file system volume is more than
2Tb. Other filesystem are not affected by such huge delay on creation of
filesystem. My concern was to improve design of ext3 to decrease time
consuption of creation large ext3 volumes on storage servers.
In general to solve problem, we should defer job of cleaning nodes to
kernel. In e2fsprogs there is LAZY_BG options but it just avoids doing
erase of inodes only.

I see several solutions for that problem:
1) Add special bitmaps into fs header (inode groups descriptors?).
By looking at those bitmaps kernel could determine if inode is not cleaned, and
that inode will be propertly initialized.
2) Add special identifiers into inodes. If super block id != inode id
-> inode is dirty
and should be cleaned in kernel, where super block id is generated on
creation stage.

I choosed second (much easier - just few lines of code)
and implemented patch for e2fsprogs, kernel ext3. It is just proof of a concept.
With the help of this patch I could create terrabytes volumes fast.
Of cource this patch will broke compatibility for existing filesystem.
More correctly is to choose first way and do not broke compatibility.

Writing this mail, I just want check if there is any interest for this problem
from community.
I would like to see that future ext4 filesystem will be created fast.
and I would be appreciate for thoughts, remarks.
---------------------------
Pavel S. Mironchik


Attachments:
(No filename) (1.50 kB)
ext3-kernel.patch (3.35 kB)
mkfs_and_fsck.patch (4.52 kB)
Download all attachments

2006-09-15 21:20:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext2/3 create large filesystem takes too much time; solutions

On Tue, Sep 12, 2006 at 02:07:34PM +0300, Pavel Mironchik wrote:
>
> Ext2/3 does erase of inode tables, when do creation of new systems.
> This is very very long operation when the target file system volume is more
> than
> 2Tb. Other filesystem are not affected by such huge delay on creation of
> filesystem. My concern was to improve design of ext3 to decrease time
> consuption of creation large ext3 volumes on storage servers.
> In general to solve problem, we should defer job of cleaning nodes to
> kernel. In e2fsprogs there is LAZY_BG options but it just avoids doing
> erase of inodes only.

Hi Pavel,

Apologies that no one responded right away; I think a lot of
people have been incredibly busy. I've been doing a huge amount of
travel myself personally, and so my e-mail latency has been larger
than normal.

The problem of long mke2fs problems is one that we've
considered, and we do want to do something with it, but it's not been
as high priority as some of the other problems on our hit list.
Certainly, given that inode space is very precious, I'm not convinced
that breaking backwards compatibility and burning an extra 16 bytes
per inode is worth the net gain --- although there are other solutions
that don't have that particular cost. Yes, they take more lines of
code to support, but given the hopefully large number of people that
will be using ext4, I'd must rather spend an extra amount of
development time getting it right, than doing something fast and dirty
which then everyone pays for, over and over, again and again and again
across millions and millions of machines!


> I see several solutions for that problem:
> 1) Add special bitmaps into fs header (inode groups descriptors?).
> By looking at those bitmaps kernel could determine if inode is not cleaned,
> and that inode will be propertly initialized.

Actually, you don't need a bitmap; a much simpler solution is to have
an integer field in the block group descriptors which indicates the
number of inods that have been initialized in that block group. The
problem though is that what if the block group descriptors (or the
bitmaps) get corrupted? So what we also want to do is to add support
for checksums in the individual inodes and in the block group
descriptors themselves, as a double-check.

These are useful features in and of themselves, and there are some
sample implementations of them (for example, in the Iron ext2 paper).
So my thinking is that we should get that work into ext4, and then
it's not hard to add the support for fields in the block group
descriptors that would allow for fast mke2fs support.

Regards,

- Ted

2006-09-16 14:56:03

by Pavel Mironchik

[permalink] [raw]
Subject: Re: ext2/3 create large filesystem takes too much time; solutions

Hi Ted,

Thanks for the responce...
I agree with you and I would prefer to send something more
serious on that list than those previous patches - I like your
idea with counters. Btw I assume crc is more preferable than
just control sum for block group descriptors....

Pavel
[email protected]
[email protected]

On 9/16/06, Theodore Tso <[email protected]> wrote:
> On Tue, Sep 12, 2006 at 02:07:34PM +0300, Pavel Mironchik wrote:
> >
> > Ext2/3 does erase of inode tables, when do creation of new systems.
> > This is very very long operation when the target file system volume is more
> > than
> > 2Tb. Other filesystem are not affected by such huge delay on creation of
> > filesystem. My concern was to improve design of ext3 to decrease time
> > consuption of creation large ext3 volumes on storage servers.
> > In general to solve problem, we should defer job of cleaning nodes to
> > kernel. In e2fsprogs there is LAZY_BG options but it just avoids doing
> > erase of inodes only.
>
> Hi Pavel,
>
> Apologies that no one responded right away; I think a lot of
> people have been incredibly busy. I've been doing a huge amount of
> travel myself personally, and so my e-mail latency has been larger
> than normal.
>
> The problem of long mke2fs problems is one that we've
> considered, and we do want to do something with it, but it's not been
> as high priority as some of the other problems on our hit list.
> Certainly, given that inode space is very precious, I'm not convinced
> that breaking backwards compatibility and burning an extra 16 bytes
> per inode is worth the net gain --- although there are other solutions
> that don't have that particular cost. Yes, they take more lines of
> code to support, but given the hopefully large number of people that
> will be using ext4, I'd must rather spend an extra amount of
> development time getting it right, than doing something fast and dirty
> which then everyone pays for, over and over, again and again and again
> across millions and millions of machines!
>
>
> > I see several solutions for that problem:
> > 1) Add special bitmaps into fs header (inode groups descriptors?).
> > By looking at those bitmaps kernel could determine if inode is not cleaned,
> > and that inode will be propertly initialized.
>
> Actually, you don't need a bitmap; a much simpler solution is to have
> an integer field in the block group descriptors which indicates the
> number of inods that have been initialized in that block group. The
> problem though is that what if the block group descriptors (or the
> bitmaps) get corrupted? So what we also want to do is to add support
> for checksums in the individual inodes and in the block group
> descriptors themselves, as a double-check.
>
> These are useful features in and of themselves, and there are some
> sample implementations of them (for example, in the Iron ext2 paper).
> So my thinking is that we should get that work into ext4, and then
> it's not hard to add the support for fields in the block group
> descriptors that would allow for fast mke2fs support.
>
> Regards,
>
> - Ted

2006-09-16 20:06:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext2/3 create large filesystem takes too much time; solutions

On Sat, Sep 16, 2006 at 05:56:02PM +0300, Pavel Mironchik wrote:
> Hi Ted,
>
> Thanks for the responce...
> I agree with you and I would prefer to send something more
> serious on that list than those previous patches - I like your
> idea with counters. Btw I assume crc is more preferable than
> just control sum for block group descriptors....

Yes, when I said checksum I meant a cyclic redundancy checksum, and
not an additive checksum... (and one of the things we can do is to
build in the superblock UUID into the CRC, so that if the filesystem
gets recreated we can distinguish an old inode from a new one).

- Ted

2006-09-17 07:57:17

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext2/3 create large filesystem takes too much time; solutions

On Sep 16, 2006 16:06 -0400, Theodore Tso wrote:
> On Sat, Sep 16, 2006 at 05:56:02PM +0300, Pavel Mironchik wrote:
> > I agree with you and I would prefer to send something more
> > serious on that list than those previous patches - I like your
> > idea with counters. Btw I assume crc is more preferable than
> > just control sum for block group descriptors....
>
> Yes, when I said checksum I meant a cyclic redundancy checksum, and
> not an additive checksum... (and one of the things we can do is to
> build in the superblock UUID into the CRC, so that if the filesystem
> gets recreated we can distinguish an old inode from a new one).

Just to avoid duplication of effort, I'm attaching the current
work-in-progress patches for the uninitialized groups (kernel + e2fsprogs).
They are really at the "barely compile" stage (if that), but at least
people can look at them and start improving them instead of starting from
scratch. The patches are based on work done by Anshu Goel
<[email protected]>, but have been reworked a fair amount since they
were given to me (i.e. bugs added are mine). I've been sitting on them
for too long and they should see the light of day instead of continuing
to stagnate.

I also just incorporated Ted's suggestion to include the filesystem UUID
into the checksum. I previously had added in the group number, so that
if the block is written out to the wrong location it wouldn't verify
correctly.

Things that need to be done:
- the kernel block/inode allocation needs to be reworked:
- initialize a whole block worth of inodes at one time instead
of single inodes.
- I don't think we need to zero out the unused inodes - the kernel
should already be doing this if the inode block is unused
- find a happy medium between using existing groups (inodes/blocks)
and initializing new ones
- we likely need to verify the checksum in more places in e2fsck before
trusting the UNINIT flags

I won't be able to work more on this for a while, so have at it :-).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


Attachments:
(No filename) (2.06 kB)
ext3-uninit_groups-2.6.12.patch (19.45 kB)
e2fsprogs-uninit.patch (28.89 kB)
Download all attachments

2006-09-18 19:51:17

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext2/3 create large filesystem takes too much time; solutions

On Sep 17, 2006 01:57 -0600, Andreas Dilger wrote:
> Things that need to be done:
> - the kernel block/inode allocation needs to be reworked:
> - initialize a whole block worth of inodes at one time instead
> of single inodes.
> - I don't think we need to zero out the unused inodes - the kernel
> should already be doing this if the inode block is unused
> - find a happy medium between using existing groups (inodes/blocks)
> and initializing new ones
> - we likely need to verify the checksum in more places in e2fsck before
> trusting the UNINIT flags
- need to decide what to do if UNINIT flag is set but checksum is wrong.
this has possibility of getting a LOT of garbage from the disk, including
old "valid" inodes, garbage for bitmaps, etc.
- should kernel and/or e2fsck zero the unused parts of the inode table
asynchronously to avoid such problems? It could optionally only write
out the blocks if they are not already zero (to avoid consuming space
on sparse filesystems) but this would require an additional read of each
block (maybe can be done slowly to avoid overloading system)? Could also
have another flag which indicates if group data is aready zeroed
- need to clear UNINIT flags if we detect a bitmap/inode is in use in group;
this would possibly also force a restart of e2fsck so that it checks the
whole group (with caveat for above).
- need to zero itable blocks if allocating from an UNINIT group in e2fsprogs
- need to zero ibitmap/bbitmap if using UNINIT group in e2fsprogs
- should we drop bg_itable_unused to minimum possible value on e2fsck?
this would reduce subsequent e2fsck time a bit.
- need to handle proper big endian machines in e2fsprogs when computing
checksum. kernel will always do crc on little-endian disk data, and
little endian e2fsprogs will do same.

Attached is a slightly-improved version, it at least passes "make check"
in tests, though I haven't gotten the "tst_csum" program to build & run
automatically (passes by hand).


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


Attachments:
(No filename) (2.10 kB)
e2fsprogs-uninit.patch (33.51 kB)
Download all attachments