2008-05-08 22:49:14

by Andreas Dilger

[permalink] [raw]
Subject: mke2fs and lazy_itable_init

I just noticed lazy_itable_init in the mke2fs.8.in man page. I think
a warning needs to be added there that this is not currently safe to
use, because the kernel does not yet do the background zeroing. There
is nothing in the man page to indicate that this is unsafe...

.B lazy_itable_init\fR[\fb= \fI<0 to disable, 1 to enable>\fR]
If enabled and the uninit_bg feature is enabled, the inode table will
not fully initialized by
.BR mke2fs .
This speeds up filesystem
initialization noitceably, but it requires the kernel to finish
initializing the filesystem in the background when the filesystem is
first mounted. If the option value is omitted, it defaults to 1 to
enable lazy inode table initialization.
+.BR NOTE :
+No kernels do the background zeroing of the inode table at this time,
+so this option is unsafe for production use, due to the risk of serious
+filesystem corruption if e2fsck has to scan the whole inode table for
+some reason (e.g. incorrect group descriptor checksum).

Sorry for half patch, but my GIT repo is broken somehow and I wanted to
email before I forget.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



2008-05-09 02:18:46

by Theodore Ts'o

[permalink] [raw]
Subject: Re: mke2fs and lazy_itable_init

On Thu, May 08, 2008 at 04:48:47PM -0600, Andreas Dilger wrote:
> I just noticed lazy_itable_init in the mke2fs.8.in man page. I think
> a warning needs to be added there that this is not currently safe to
> use, because the kernel does not yet do the background zeroing. There
> is nothing in the man page to indicate that this is unsafe...

Yeah, I was hoping we would actually get this fixed before 1.41 was
released.... (i.e., implement the background zeroing). One of the
things I was thinking about was whether we could avoid needing to go
through the jbd layer when zeroing out an entire inode table block,
and then in the completion callback function when the block group was
completely initiaized, we could clear the ITABLE_UNINIT flag.

It doesn't need to go through the journal, because if we crash without
having the flag set, its not a big deal; the inode table will just not
be marked initialized. The only thing which might require a little
care is if buffer head referencing part of the inode table which is
getting zero'ed out is in flight when an inode allocation happens, an
inode gets marked dirty, and fs/ext4/inode.c wants to write out an
inode table block that is in the middle of being zero'ed. Given that
we've bypassed the jbd layer for efficiency's sake, something bad
could happy unless we protect it with some kind of lock.

Or we could just say that this initialization pass is relatively rare,
so do it the cheap cheasy way, even if the blocks end up going through
the journal. The upside is that it should be pretty quick and easy to
code it this way.

- Ted

2008-05-09 08:19:57

by Andreas Dilger

[permalink] [raw]
Subject: Re: mke2fs and lazy_itable_init

On May 08, 2008 22:18 -0400, Theodore Ts'o wrote:
> On Thu, May 08, 2008 at 04:48:47PM -0600, Andreas Dilger wrote:
> > I just noticed lazy_itable_init in the mke2fs.8.in man page. I think
> > a warning needs to be added there that this is not currently safe to
> > use, because the kernel does not yet do the background zeroing. There
> > is nothing in the man page to indicate that this is unsafe...
>
> Yeah, I was hoping we would actually get this fixed before 1.41 was
> released.... (i.e., implement the background zeroing).

It would still be an issue for 2.6.24 and 2.6.25 kernels, so I think
it at least deserves a warning until there is a specific kernel that
can be referenced that has this functionality.

> One of the
> things I was thinking about was whether we could avoid needing to go
> through the jbd layer when zeroing out an entire inode table block,
> and then in the completion callback function when the block group was
> completely initiaized, we could clear the ITABLE_UNINIT flag.
>
> It doesn't need to go through the journal, because if we crash without
> having the flag set, its not a big deal; the inode table will just not
> be marked initialized. The only thing which might require a little
> care is if buffer head referencing part of the inode table which is
> getting zero'ed out is in flight when an inode allocation happens, an
> inode gets marked dirty, and fs/ext4/inode.c wants to write out an
> inode table block that is in the middle of being zero'ed. Given that
> we've bypassed the jbd layer for efficiency's sake, something bad
> could happy unless we protect it with some kind of lock.
>
> Or we could just say that this initialization pass is relatively rare,
> so do it the cheap cheasy way, even if the blocks end up going through
> the journal. The upside is that it should be pretty quick and easy to
> code it this way.

It is only a once-per-filesystem-lifetime operation, and while it is
a fair amount of IO, it could be done efficient because of sequential IO.
I believe that the unwritten extent code added a function "ext4_zero_blocks"
or similar that maps the ZERO_PAGE into a bio and submits that for IO
to zero out the extent. This could be used for the inode tables also,
avoiding the major problem of dirtying thousands of blocks in memory.

The risk, as you write, is the locking, and the only locks I see on the
inode table are the bh locks on the itable blocks in __ext4_get_inode_loc().
We can't hold the per-group hashed spinlock for this IO. We might consider
adding an rw semaphore, and in the very common case there will only ever
be readers on this lock. If there is a zeroing happening then there can
be a "trylock" operation and the entire group skipped to avoid blocking
the caller for a long time.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-08-27 09:29:39

by Solofo.Ramangalahy

[permalink] [raw]
Subject: Re: mke2fs and lazy_itable_init

>>>>> On Fri, 09 May 2008 02:19:30 -0600, Andreas Dilger <[email protected]> said:

Andreas> On May 08, 2008 22:18 -0400, Theodore Ts'o wrote:
>> On Thu, May 08, 2008 at 04:48:47PM -0600, Andreas Dilger wrote:
>> > I just noticed lazy_itable_init in the mke2fs.8.in man page.
>> I think > a warning needs to be added there that this is not
>> currently safe to > use, because the kernel does not yet do the
>> background zeroing. There > is nothing in the man page to
>> indicate that this is unsafe...
>>
>> Yeah, I was hoping we would actually get this fixed before 1.41
>> was released.... (i.e., implement the background zeroing).

Andreas> It would still be an issue for 2.6.24 and 2.6.25 kernels,
Andreas> so I think it at least deserves a warning until there is
Andreas> a specific kernel that can be referenced that has this
Andreas> functionality.

I think this patch (from Andreas) would be good for e2fsprogs 1.41.1,
as discussed at Monday call.

--
solofo

Index: b/misc/mke2fs.8.in
===================================================================
--- a/misc/mke2fs.8.in
+++ b/misc/mke2fs.8.in
@@ -221,6 +221,12 @@ initialization noticeably, but it requir
initializing the filesystem in the background when the filesystem is
first mounted. If the option value is omitted, it defaults to 1 to
enable lazy inode table initialization.
+. BR NOTE:
+No kernels do the background zeroing of the inode table at this time,
+so this option is unsafe for production use, due to the risk of
+filesystem corruption if e2fsck has to scan the whole inode table for
+some reason (e.g. incorrect group descriptor checksum).
+
.TP
.B test_fs
Set a flag in the filesystem superblock indicating that it may be