2008-04-03 12:55:25

by Theodore Ts'o

[permalink] [raw]
Subject: Bug in ext2fs_set_gdt_csum() and uninit_groups handling


Unfortunately, this has already been merged into the mainline branch
before I noticed this, but we have a nasty bug right now in the
uninit_groups handling. Note this code here in ext2fs_set_gdt_csum():

/* Even if it wasn't zeroed, by the time this function is
* called by e2fsck we have already scanned and corrected
* the whole inode table so we may as well not overwrite it.
* This is just a hint to the kernel that it could do lazy
* zeroing of the inode table if mke2fs didn't do it, to help
* out if we need to do a full itable scan sometime later. */
if (!(bg->bg_flags & (EXT2_BG_INODE_UNINIT |
EXT2_BG_INODE_ZEROED)))
fs->group_desc[i].bg_flags |= EXT2_BG_INODE_ZEROED;


There's only one problem. ext2fs_set_gdt_csum() is also called by
mke2fs, so all of the block groups are having the EXT2_BG_INODE_ZEROED
bit set.

I'm also not entirely sure this is safe for e2fsck, since as far as I
can tell, e2fsck does *not* currently scan zero uninitialized portions
of the inode table. It uses the ext2fs_get_next_inode() function, which
will skip uninitialized inodes.

So, I think what needs to happen is that the above fragment in
csum.c:ext2fs_set_gdt_csum() needs to disappear.

Is there a kernel patch that will zero out the inode tables in the
background? I remember hearing discussions of this, but it doesn't seem
to be in the kernel tree or in the ext4 patch queue.

Eventually, we might want to add a feature to e2fsck (probably
controlled by e2fsck.conf or a command-line option?) to zero out the
inode table, and then we can clear the INODE_ZERO'ed bg_flags.

Jose, Andreas, does this make sense?

- Ted

P.S. I noticed this because I had added to dumpe2fs the ability to show
the INODE_ZEROED flag, and noticed it was set when I was *not* expecting
it to be set.


2008-04-03 14:17:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Bug in ext2fs_set_gdt_csum() and uninit_groups handling

So after looking at this some more, I think I understand what had
happened. The code as currently written is OK, because we are
initializing the inode table even if the uninit_group feature is
enabled. There is a new COMPAT feature, lazy_bg, which causes mke2fs
to skip initializing the inode table --- but since it's not in the
allowed set of filesystem features, we're not in any danger since
currently zeroing of the inode table can't be skipped.

HOWEVER, the code in ext2fs_set_gdt_csum() which I pointed out in my
earlier message is a ticking time bomb that will go off if we ever
enable lazy_bg.

Given that uninit_groups has only recently hit the e2fsprosg mainline,
and as far as I know, no one is currently using it in production ---
and those that might be, such as CFS, are zero'ing out the entire
inode table anyway (since the feature was originally called gdt_cksum)
--- and it would seem HIGHLY uninituitive that uninit_groups, or
uninit_bg doesn't actually prevent the slow step of zeroing the inode
table, I propose that we do the following:

*) Remove lazy_bg feature flag entirely

*) Remove the bogus code in ext2fs_set_gdt_csum()

*) Change mke2fs to skip initializing the inode tables if
uninit_groups is set.

*) Implement the kernel code for ext4 so it will zero out the inode
table in the background.

Does this make sense? I believe it is safe, given that all existing
filesystems using uninit_groups (what few of them exist) have the
inode table completely zero'ed out, so there is no danger in making
this change in the deifnition of what uninit_groups (or uninit_bg) means.

- Ted

2008-04-10 17:33:18

by Andreas Dilger

[permalink] [raw]
Subject: Re: Bug in ext2fs_set_gdt_csum() and uninit_groups handling

On Apr 03, 2008 10:17 -0400, Theodore Ts'o wrote:
> So after looking at this some more, I think I understand what had
> happened. The code as currently written is OK, because we are
> initializing the inode table even if the uninit_group feature is
> enabled. There is a new COMPAT feature, lazy_bg, which causes mke2fs
> to skip initializing the inode table --- but since it's not in the
> allowed set of filesystem features, we're not in any danger since
> currently zeroing of the inode table can't be skipped.
>
> HOWEVER, the code in ext2fs_set_gdt_csum() which I pointed out in my
> earlier message is a ticking time bomb that will go off if we ever
> enable lazy_bg.
>
> Given that uninit_groups has only recently hit the e2fsprosg mainline,
> and as far as I know, no one is currently using it in production ---
> and those that might be, such as CFS, are zero'ing out the entire
> inode table anyway (since the feature was originally called gdt_cksum)
> --- and it would seem HIGHLY uninituitive that uninit_groups, or
> uninit_bg doesn't actually prevent the slow step of zeroing the inode
> table, I propose that we do the following:
>
> *) Remove lazy_bg feature flag entirely
>
> *) Remove the bogus code in ext2fs_set_gdt_csum()
>
> *) Change mke2fs to skip initializing the inode tables if
> uninit_groups is set.
>
> *) Implement the kernel code for ext4 so it will zero out the inode
> table in the background.
>
> Does this make sense? I believe it is safe, given that all existing
> filesystems using uninit_groups (what few of them exist) have the
> inode table completely zero'ed out, so there is no danger in making
> this change in the deifnition of what uninit_groups (or uninit_bg) means.

Sorry for delayed reply - I was away the last 2 weeks and am just skimming
my email backlog.

The zeroing of the inode tables can definitely NOT be removed until
the kernel has the ability to do that if the ZEROED flag is not set.
It is otherwise dangerous if the group checksum is incorrect, or the
backup GDT blocks are used, because they will not have correct UNINIT
flags and itable_unused values. When e2fsck runs from the backups or
if the checksum is bad then the UNINIT flags and itable_unused field
are cleared and a full-group itable scan is done, and this would raise
a HUGE problem if the itable blocks have not been zeroed, which I would
predict would lead to a corrupt and useless filesystem. The window
between mke2fs and the zeroing of the itable blocks must be as small
as possible to avoid this risk.

We added the ZEROED flag as a way of allowing the fast mke2fs to be
added in the future, but the kernel-side support has never been added.
That lazy_bg skips the itable zeroing is no different before or after
the addition of uninit_groups. We didn't consider it a huge problem
for initial uninit_groups implementation because mke2fs is run far less
often than e2fsck...

If the main concern is that lazy_bg is still setting the ZEROED flag,
then yes I agree this is a problem that definitely wasn't intended. I
don't think it is safe to turn off the zeroing of itable blocks in mke2fs
until after the kernel-side support has been available for some time.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.