The time to format a filesystem is mostly linear with filesystem size.
Exact time spent on formating depends on hardware and software, but
this is mainly explained by the zeroing of some blocks (inode, block
bitmaps and inodes tables).
While the mkfs time can be considered negligible (for example compared
to RAID formatting of disk arrays), it is significant compared
to the formating time of others filesystems.
This is noticeable when conducting performance comparison tests, or
testing involving multiple formatting of the same device.
This may become prohibitive for large disks (arrays).
For some measurements, see:
http://www.bullopensource.org/ext4/20080909-mkfs-speed-lazy_itable_init/
http://www.bullopensource.org/ext4/20080911-mkfs-speed-lazy_itable_init/
http://www.bullopensource.org/ext4/20080912-mkfs-speed-lazy_itable_init/
so far it is under one hour, further measurements would be needed,
like for 16TB filesystems.
It is possible to skip the initialization of the inode tables blocks
with the mkfs option "lazy_itable_init" (mkfs.ext4(8)).
However, this option is not safe with respect to fsck, as there is no
way to distinguish between an unitialized block filled with old bits
and a corrupted one.
(The use of lazy_itable_init could be considered safe in the case where
the blocks of the disk, in particular those used by the inode tables,
are prefilled with zeros.)
These patches (try to) initialize the inode tables after mount via a
kernel thread launched by module loading. The goal is to find a
tradeoff between speed and safety.
Apart from use in testing, another use case could be a distribution
installation: since device size rises faster than system size, the
percentage of the formating time during the installation will
increase. Since the system will use a fragment of the full device (say
10GB for system installation on a 1TB disk), it would not be strictly
necessary to initialize all the inode tables before starting the
installation, for example for the home partition.
So far, I've only been able to initialize some small filesystems with
this code (using 2.6.28-rc4).
For example, like this:
. dd if=/dev/zero of=/tmp/ext4fs.img bs=1M count=1024
. losetup /dev/loop0 /tmp/ext4fs.img
. mkfs.ext4 -O^resize_inode -Elazy_itable_init /dev/loop0
. mount /dev/loop0 /mnt/test-ext4
. [dumpe2fs /dev/loop0]
. modprobe ext4_itable_init
. [dumpe2fs /dev/loop0 # here check the ITABLE_ZEROED]
. umount /mnt/test-ext4
. [dumpe2fs /dev/loop0]
. [fsck /dev/loop0]
But I also hitted several bugs and managed to somehow screw up my
machine. So be _extremly_ careful if ever you try the code!
TODO:
. fix the resize inode case
. fix the observed soft lockup
. decide whether to keep it a module.
If not, decide how/when run the kernel thread
. initialize some blocks (for example the non-empty ones) at mount
time, or somewhere else.
. non-empty group case
. feature interactions? (for example inode zeroing vs. resize)
. multiple threads (based on cpu/disks)
. other ?
On Fri, Nov 21, 2008 at 11:23:09AM +0100, [email protected] wrote:
> . decide whether to keep it a module.
> If not, decide how/when run the kernel thread
> . multiple threads (based on cpu/disks)
I would *not* do it as a module. It's more than a little
aesthetically unclean that this has to be a module --- there are
people who by choice decide not to use modules, for example. If
you're clever, doing it as a module allows you to shorten your
compile-edit-debug cycle, I suppose, so maybe it's a justification for
doing it that way, but if that's the main reason, I'd choose using
user mode linux or KVM as my main development vehicle to speed up the
development cycle....
Instead, what I would do is to have the mount system call, if the
filesystem is mounted read/write, and there are uninitialized block
groups, to create a kernel thread responsible for initializing the
filesystem in question. Once it is complete, it can exit.
> . initialize some blocks (for example the non-empty ones) at mount
> time, or somewhere else.
> . non-empty group case
I'm not sure why you are treating the non-empty group case any
different from the empty-group case. The only difference is where you
start zeroing the inode table. In both cases you do need to worry
about locking issues --- what happens if the filesystem tries to
allocate a new inode just as you are starting to zero the filesystem?
In your current patch, you are checking to see if the block group has
no inodes in ext4_thread_init_itable(), by calling has_no_inode(), but
there is no locking between when you check to see that a particular
part of the inode table is not in use, and when you call
ext4_zero_itable_blocks(). If an inode does get allocated, either
between the call to has_no_inode, and ext4_zero_itable_blocks(), or
while ext4_zero_itable_blocks() is running, the inode could get
zero'ed out, causing data loss. So locking is critical.
My suggestion for how to do locking is is to add a field in
ext4_group_info, a data structure which was defined in mballoc.h, but
is going to be moved to ext4.h as part of a patch that is currently in
the ext4 patch queue. This field would be protected by bg_sgl_lock()
(like the other fields in the bg descriptor), and would be called
inode_bg_unavail. If non-zero, and the relative inode number (i.e.,
the inode number modulo the number of inodes per blockgroup) selected
by the inode allocator is greater than or equal to inode_bg_unavail,
the inode allocator should try to find another block group to allocate
the inode.
Now what the inode table initialization thread can do is for each
block group where EXT4_BG_INODE_ZERO is not set, it should first set
inode_bg_unavail to bg_itable_unused. This will prevent the inode
allocator allocating any new inodes in that block group. Since we are
going to zero out inode table blocks, being paranoid is a good thing;
we should check to make sure the bg_checksum is valid, and that inode
bitmap does not show any blocks past bg_unavail as being in use. If
there are, the filesystem must have gotten corrupted and we should
call ext4_error() in that case.
We we start zero'ing the inode table, I would recommend doing so
without using the journal, and doing direct block i/o from the zero
page. The ext4_ext_zeroout function does most of what you need,
although I'd generalize it so it doesn't take inode and and
ext4_extent structure, but rather a struct sb, starting physical block
number, and number of blocks. This function would wait for the I/O to
complete, and once it completes you know the blocks are on disk and
you can make changes to the filesystem metadata that would be
journaled. I would recommend doing the first 32k of the inode table
first, and once it completes, you can update inode_bg_unavaile so that
an additional (32k / EXT4_INODE_SIZE(sb)) inodes are available. This
allows the inode allocator to allocate inodes in the block group while
the itable initializer is initializing the rest of the inode table in
that block group. Once the entire block group's inode table has been
initialized, the itable initializer can then set the BG_ITABLE_ZERO
flag (and this would be a journaled update), and then move on to the
next block group.
Does that make sense?
In terms of how quickly the itable initializer should work, in between
each block group, as we discussed on the call, the simplest thing for
it do is to wait for some time period to go by (say, 5 seconds) before
working on the next block group. The next, slightly more complicated
scheme would be to set a "last ext4 operation time" field in
EXT4_SB(sb) which is set any time the ext4 code paths are entered
(basically, any function in ext4's inode operations, super operations
or file operations). The itable initalizer would sample that time,
and before starting to initialize the next block group where
BG_ITABLE_ZERO is not set, it would check the last ext4 operation time
field, and if there had been an ext4 operation in the last 5 seconds,
it would sleep 5 seconds and check again. This would prevent the
itable initializer from running if the filesystem is in use, although
it will not detect the case where there is a lot of mmap'ed I/O going
on, but no other ext4 operations.
In the long run, we would really want some kind of I/O activity
indication from the block device elevator, but that would require
changes to the core kernel, and the last ext4 operation time is almost
just as good.
> . fix the resize inode case
Not sure what problem you were having here?
- Ted
On Nov 25, 2008 00:32 -0500, Theodore Ts'o wrote:
> I would recommend doing the first 32k of the inode table
> first, and once it completes, you can update inode_bg_unavaile so that
> an additional (32k / EXT4_INODE_SIZE(sb)) inodes are available.
I agree with everything Ted says, though I would zero the itable in
chunks of 64kB or even 128kB. Two reasons are because 64kB is the
maximum blocksize for the filesystem, and it doesn't make sense to zero
less than a whole block at once. Secondly, 64kB is more likely to
match with the internal track size of spinning disks, and 128kB is more
likely to match the erase block size of SSDs.
> In terms of how quickly the itable initializer should work, in between
> each block group, as we discussed on the call, the simplest thing for
> it do is to wait for some time period to go by (say, 5 seconds) before
> working on the next block group. The next, slightly more complicated
> scheme would be to set a "last ext4 operation time" field in
> EXT4_SB(sb) which is set any time the ext4 code paths are entered
That would be "s_wtime" already in the on-disk superblock. It wouldn't
kill us to update this occasionally in ext4, though not on disk all
the time.
> (basically, any function in ext4's inode operations, super operations
> or file operations). The itable initalizer would sample that time,
> and before starting to initialize the next block group where
> BG_ITABLE_ZERO is not set, it would check the last ext4 operation time
> field, and if there had been an ext4 operation in the last 5 seconds,
> it would sleep 5 seconds and check again.
Well, I'd say if it has slept 5s then it should submit a block regardless
of whether the filesystem was in use or not. Otherwise the itable may
never be zeroed out if the filesystem is always in use. Adding a rare
64kB write to disk is unlikely to hurt anything, and if people REALLY care
about it they can avoid formatting with "lazy_itable_init".
> This would prevent the itable initializer from running if the filesystem
> is in use, although it will not detect the case where there is a lot
> of mmap'ed I/O going on, but no other ext4 operations.
Wouldn't even mmap operations cause some ext4 methods to be called?
> In the long run, we would really want some kind of I/O activity
> indication from the block device elevator, but that would require
> changes to the core kernel, and the last ext4 operation time is almost
> just as good.
Alternately we could check the journal tid?
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Theodore Tso writes:
> Does that make sense?
Yes, thanks for the guidance!
> doing it as a module allows you to shorten your
> compile-edit-debug cycle, I suppose, so maybe it's a justification for
> doing it that way,
Yes it is.
> but if that's the main reason, I'd choose using
> user mode linux or KVM as my main development vehicle to speed up the
> development cycle....
I did both kvm and module :-)
Other reasons were:
. this resize comment:
* This could probably be made into a module, because it is not often in use.
and this sentence from the OLS'02 paper
"Since the online resizing code is only used very rarely, it would
be possible to put the bulk of this code into a separate module that
is only loaded when a resize operation is done."
Inode zeroing is done only once in a filesystem lifetime (and each
time it is resized).
. the fact that did not have a clear idea of when to fire the thread.
. some consideration for memory usage (you mentionned NAS boxes in
another thread).
> I'm not sure why you are treating the non-empty group case any
> different from the empty-group case.
Because of simplicity. I wanted to have "something" to start the
discussion.
I was also thinking that there may be other places to do it. For
example, zeroing the inode table where the inode bitmap is initialized
(ext4_init_inode_bitmap() called only once in
ext4_read_inode_bitmap()).
The reasoning would have been to zero as soon as it is known to be
needed:
. without deferring it to the threads,
. decreasing the probability of zeroing competing with other code
. decreasing the "window of vulnerability" (the time between formating
and end of zeroing where it is known that fsck is not safe).
I don't know if it would have been sufficient to guarantee that all
the groups are eventually itable zeroed.
> > . fix the resize inode case
>
> Not sure what problem you were having here?
With resize inode, the obtained filesystem is corrupted, fsck says
"Resize inode not valid. Recreate?"
as well as:
"Free blocks count wrong for group #0 (6788, counted=6789)."
Appart from the data structures change you mentionned, these changes
were discussed:
. a mount option to disable the threads when doing testing/performance
benchmarking
. a flag in s_flags field of struct ext4_super_block to indicate that
the zeroing has been done on all the groups. Possibly reset with
resize.
Do they sound reasonable?
--
solofo
On Tue, Nov 25, 2008 at 01:28:47PM +0100, [email protected] wrote:
> * This could probably be made into a module, because it is not often in use.
> and this sentence from the OLS'02 paper
> "Since the online resizing code is only used very rarely, it would
> be possible to put the bulk of this code into a separate module that
> is only loaded when a resize operation is done."
> Inode zeroing is done only once in a filesystem lifetime (and each
> time it is resized).
Sure, but (a) zeroing the inode table should not be much code;
especially since we need to zero contiguous block ranges in extents.c
to deal with uninitialized extents. Also (b) modules have their own
cost; they waste on average PAGE_SIZE/2 worth of memory per module due
to internal fragmentation, and cause extra entries in the TLB cache
(on an x86, the entire kernel uses a single TLB entry, but each 4k
page used by a module burns a separate TLB entry), and (c) loading
modules is slow and serializes the kernel at boot time. In addition,
(d) some users simply prohibit modules for policy reasons.
So I could imagine adding module support, but it has to work built
into the kernel is critical, and this will probably the primary way it
will be compiled. And you need to make sure the kernel tries to
automatically load the module when a resize or a new filesystem is
mounted, if it is compiled as a module.
> I was also thinking that there may be other places to do it. For
> example, zeroing the inode table where the inode bitmap is initialized
> (ext4_init_inode_bitmap() called only once in
> ext4_read_inode_bitmap()).
When we first allocate an inode in the inode table and when we need to
zero it is largely unrelated. The problem is that e2fsck doesn't want
to trust the inode bitmap as being accurate; nor can it necessarily
trust the bg_inodes_unused field --- note particularly that this is
not updated in the backup group descriptors. Hence the window of
vulnerability has nothing to do with whether or not we have started
using a particular part of the inode table, but because the inode
table has not been initialized.
So we want to get the inode table fully initialized as soon as
possible, although we have to balance this with not impacting the
system's performance.
> The reasoning would have been to zero as soon as it is known to be
> needed:
> . without deferring it to the threads,
> . decreasing the probability of zeroing competing with other code
A block group's inode table can be 2-4 megabytes; and zeroing out that
many disk blocks can take a noticeable amount of time, so doing it
synchronously with an inode creation doesn't seem like a great idea to
me.....
> > > . fix the resize inode case
> >
> > Not sure what problem you were having here?
>
> With resize inode, the obtained filesystem is corrupted, fsck says
> "Resize inode not valid. Recreate?"
> as well as:
> "Free blocks count wrong for group #0 (6788, counted=6789)."
Something really bogus must be happening; the resize inode is in block
group 0, which always has some inodes (and therefore would never be
touched by your patch, which only initializes completely empty inode
tables). So if it managed to corrupt the resize inode, that's
especially worrisome.
> Appart from the data structures change you mentionned, these changes
> were discussed:
> . a mount option to disable the threads when doing testing/performance
> benchmarking
Yes, this makes sense. The administrator can always remount the filesystem
with a mount option to re-enable itable initialization if the
sysadmin wants to zero the inode table later on.
> . a flag in s_flags field of struct ext4_super_block to indicate that
> the zeroing has been done on all the groups. Possibly reset with
> resize.
This doesn't make that much sense to me. It's not that hard to
iterate through all of the block groups descriptors checking for the
EXT4_BG_INODE_ZEROED flag.
- Ted
On Nov 25, 2008 13:28 +0100, [email protected] wrote:
> Appart from the data structures change you mentionned, these changes
> were discussed:
> . a mount option to disable the threads when doing testing/performance
> benchmarking
Sure.
> . a flag in s_flags field of struct ext4_super_block to indicate that
> the zeroing has been done on all the groups. Possibly reset with
> resize.
I was thinking that it makes sense to have this same thread do checking
of all the block group metadata as it traverses the filesystem. That
includes validating the GDT checksums, checking the existing block and
inode bitmaps (and possibly checksums for them, when that is implemented),
along with zeroing the inode table.
The only requirement is that there only be a single such thread running
on the filesystem at one time, and that if the filesystem is unmounted
that the thread be killed before the unmount is completed.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.