2012-03-27 14:47:37

by Yongqiang Yang

[permalink] [raw]
Subject: backup of the last group'descriptor when it is the 1st group of a meta_bg

Hi Ted, Andreas and List,

As Andreas pointed out last year, if the last group is the 1st group
in a meta bg, then its group desc has no backup.
With meta_bg resizing inode is useless, I had a thought that we store
a backup group descriptor of the last group in the resizing inode?
What's your opinions?

Yongqiang.


2012-03-28 17:07:28

by Andreas Dilger

[permalink] [raw]
Subject: Re: backup of the last group'descriptor when it is the 1st group of a meta_bg

On 2012-03-27, at 8:47 AM, Yongqiang Yang wrote:
> Hi Ted, Andreas and List,
>
> As Andreas pointed out last year, if the last group is the 1st group
> in a meta bg, then its group desc has no backup.
> With meta_bg resizing inode is useless, I had a thought that we store
> a backup group descriptor of the last group in the resizing inode?
> What's your opinions?

The main difficulty of referencing a backup group descriptor from the
resize inode is that it may confuse tools that are trying to modify
the resize inode. Also, it is more difficult to access the block from
userspace, since it would need to read the inode and use an extent to
reference the block beyond 16TB.

What about storing the 64-bit block number in the superblock? This
should be safe for older e2fsprogs that understand META_BG. At worst
the new backup group descriptor will not be updated on a resize by
older e2fsprogs, which is no worse than not having a backup at all.

I would suggest to put the backup group descriptor in the last block
of the filesystem. This would be in the 0th group of the metagroup.
If the metagroup grows to have a second group, then this block is not
needed anymore, and if both the primary (at the beginning of the group)
and the backup (at the end of the group) are corrupted, then there is
little chance that the data in this last group is good either...

Actually, if the backup is always stored in the last block of the 0th
group (which is itself the last group in the filesystem), there isn't
even a need to store this location in the superblock.

Cheers, Andreas




2012-04-02 05:04:17

by Yongqiang Yang

[permalink] [raw]
Subject: Re: backup of the last group'descriptor when it is the 1st group of a meta_bg

On Thu, Mar 29, 2012 at 12:08 AM, Andreas Dilger <[email protected]> wrote:
> On 2012-03-27, at 8:47 AM, Yongqiang Yang wrote:
>> Hi Ted, Andreas and List,
>>
>> As Andreas pointed out last year, if the last group is the 1st group
>> in a meta bg, then its group desc has no backup.
>> With meta_bg resizing inode is useless,  I had a thought that we store
>> a backup group descriptor of the last group in the resizing inode?
>> What's your opinions?
>
> The main difficulty of referencing a backup group descriptor from the
> resize inode is that it may confuse tools that are trying to modify
> the resize inode.  Also, it is more difficult to access the block from
> userspace, since it would need to read the inode and use an extent to
> reference the block beyond 16TB.
I meant we store the backup group descriptor in resize inode itself
rather than store it in data blocks, so it does not need an extent at
all. However, the inode is corrupted, we need patch e2fsck to let it
understand the resize inode.
>
> What about storing the 64-bit block number in the superblock?  This
> should be safe for older e2fsprogs that understand META_BG.  At worst
> the new backup group descriptor will not be updated on a resize by
> older e2fsprogs, which is no worse than not having a backup at all.
>
> I would suggest to put the backup group descriptor in the last block
> of the filesystem.  This would be in the 0th group of the metagroup.
> If the metagroup grows to have a second group, then this block is not
> needed anymore, and if both the primary (at the beginning of the group)
> and the backup (at the end of the group) are corrupted, then there is
> little chance that the data in this last group is good either...
>
> Actually, if the backup is always stored in the last block of the 0th
> group (which is itself the last group in the filesystem), there isn't
> even a need to store this location in the superblock.
Now we have 2 solutions, the 1st one is storing backup group
descriptor in resize inode itself while the 2nd one is storing backup
in the last block of the 0th block. Both need patching e2fsck because
older e2fsck does not work. The 1st one's patch to e2fsck is much
more complicated, because only one group descriptor is stored in
resize inode itself, but the e2fsck's code reading/writing group
descriptor block. so I like the 2nd one.

Yongqiang.
>
> Cheers, Andreas
>
>
>
>
>



--
Best Wishes
Yongqiang Yang

2012-04-02 05:44:02

by Andreas Dilger

[permalink] [raw]
Subject: Re: backup of the last group'descriptor when it is the 1st group of a meta_bg

On 2012-04-01, at 11:04 PM, Yongqiang Yang wrote:
> On Thu, Mar 29, 2012 at 12:08 AM, Andreas Dilger <[email protected]> wrote:
>> I would suggest to put the backup group descriptor in the last block
>> of the filesystem. This would be in the 0th group of the metagroup.
>> If the metagroup grows to have a second group, then this block is not
>> needed anymore, and if both the primary (at the beginning of the group)
>> and the backup (at the end of the group) are corrupted, then there is
>> little chance that the data in this last group is good either...
>
> Now we have 2 solutions, the 1st one is storing backup group
> descriptor in resize inode itself while the 2nd one is storing backup
> in the last block of the 0th block. Both need patching e2fsck because
> older e2fsck does not work. The 1st one's patch to e2fsck is much
> more complicated, because only one group descriptor is stored in
> resize inode itself, but the e2fsck's code reading/writing group
> descriptor block. so I like the 2nd one.

This solution doesn't _require_ patching e2fsck, which is useful.
If an older e2fsck doesn't understand the backup group descriptor is
in the last block, it is no worse than today where the backup does
not exist at all. In that case, the old e2fsck would mark this block
free, and there is a tiny chance that it would be allocated to some
file and overwritten.

However, the last block will almost never be allocated, since block
allocation is typically biased toward the beginning of the disk, so
storing a checksum in it (per Darrick's patches) would allow a new
e2fsck to use it in case of emergency, and it would mark the block
in use again (so long as it wasn't allocated to some file).

Cheers, Andreas






2012-04-03 18:39:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: backup of the last group'descriptor when it is the 1st group of a meta_bg

On Wed, Mar 28, 2012 at 10:08:39AM -0600, Andreas Dilger wrote:
> On 2012-03-27, at 8:47 AM, Yongqiang Yang wrote:
> > Hi Ted, Andreas and List,
> >
> > As Andreas pointed out last year, if the last group is the 1st group
> > in a meta bg, then its group desc has no backup.
> > With meta_bg resizing inode is useless, I had a thought that we store
> > a backup group descriptor of the last group in the resizing inode?
> > What's your opinions?

It's an issue, however, when I originally thought about a number of
years ago, it wasn't something I was terribly worried about, since by
definition the percentage of the file system that we could lose is a
small percentage overall. If we really were worried we could simply
strongly bias the inode allocator against using the inods in that last
block group. Alternatively, now that we have metadata checksums, it
becomes even easier to find the inode table via a brute force search
if we really needed to find it.

> Actually, if the backup is always stored in the last block of the 0th
> group (which is itself the last group in the filesystem), there isn't
> even a need to store this location in the superblock.

Something that might make sense is to put a backup of the block group
descriptor at block #s_num_blocks (i.e., one block past the end of the
file system as described in the superblock). E2fsck would just simply
try to see if there's a valid block group descriptor block at one
block past the end of the file system if the primary looks bad and
there aren't any of the normal meta_bg backups --- and that way we
wouldn't need to make any file system format changes.

- Ted

2012-04-03 19:25:59

by Andreas Dilger

[permalink] [raw]
Subject: Re: backup of the last group'descriptor when it is the 1st group of a meta_bg

On 2012-04-03, at 12:39 PM, Ted Ts'o wrote:
> On Wed, Mar 28, 2012 at 10:08:39AM -0600, Andreas Dilger wrote:
>> Actually, if the backup is always stored in the last block of the 0th
>> group (which is itself the last group in the filesystem), there isn't
>> even a need to store this location in the superblock.
>
> Something that might make sense is to put a backup of the block group
> descriptor at block #s_num_blocks (i.e., one block past the end of the
> file system as described in the superblock). E2fsck would just simply
> try to see if there's a valid block group descriptor block at one
> block past the end of the file system if the primary looks bad and
> there aren't any of the normal meta_bg backups --- and that way we
> wouldn't need to make any file system format changes.

I had thought of that as well, but came to the conclusion that it was
a bit hacky to store filesystem metadata past the end of the filesystem
as declared in the superblock.

My thinking for putting the backup inside the filesystem was that the
error reported by an old e2fsck (last block marked in bitmap) was
harmless, and could easily be repaired by a newer version of e2fsck.

It would probably not even cause the backup group descriptor to be
lost in the worst case (new mke2fs/e2fsck/resize2fs creates gd backup,
old e2fsck "deletes" gd backup block, use filesystem for a long time,
corrupt primary group descriptors, try to recover using new e2fsck).

If the user is only running an old e2fsck, then they are no worse off
than before - no backup existed for the meta_bg at all. In virtually
every case with a new e2fsck, the last block inside the filesystem
would not be allocated (compounding the already rare case where the
last group descriptor is corrupted), and e2fsck can verify the checksum
of the group descriptor in that block and use this to locate the bitmaps
and inode table for the last group. Again, if the checksum fails, then
the filesystem is no worse off than today if no backup was stored.

Cheers, Andreas






2012-04-03 21:26:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: backup of the last group'descriptor when it is the 1st group of a meta_bg

On Tue, Apr 03, 2012 at 01:28:14PM -0600, Andreas Dilger wrote:
> It would probably not even cause the backup group descriptor to be
> lost in the worst case (new mke2fs/e2fsck/resize2fs creates gd backup,
> old e2fsck "deletes" gd backup block, use filesystem for a long time,
> corrupt primary group descriptors, try to recover using new e2fsck).

Well, it can only be repaired if that block hasn't been allocated and
assigned to a file. If it has, then you can't easily repair it and
you have to resign yourself to not having a backup of the bgd. And
that means more complexity since e2fsck would have to deal with the
possibility that the last block might contain a backup bgd, or might
be allocated to a file.

And even if there is a "harmless" corruption, it will still
potentially alarm users who happen to format an ext4 file system with
this this change implemented, and then they boot a rescue CD which is
using an older e2fsprogs.

Ultimately I suspect the best approach might be to simply try to
reconstruct the last bgd by attempting to find the inode table in case
the last meta_bg bgd is destroyed. Since this only comes up for file
systems with a single block group in a meta_bg, it's a relatively easy
thing to do.....

- Ted

2012-04-03 22:05:39

by Andreas Dilger

[permalink] [raw]
Subject: Re: backup of the last group descriptor when it is the 1st group of a meta_bg

On 2012-04-03, at 3:26 PM, Ted Ts'o wrote:
> On Tue, Apr 03, 2012 at 01:28:14PM -0600, Andreas Dilger wrote:
>> It would probably not even cause the backup group descriptor to be
>> lost in the worst case (new mke2fs/e2fsck/resize2fs creates gd backup,
>> old e2fsck "deletes" gd backup block, use filesystem for a long time,
>> corrupt primary group descriptors, try to recover using new e2fsck).
>
> Well, it can only be repaired if that block hasn't been allocated and
> assigned to a file.

True, but this is IMHO a fairly rare case, since inode and block
allocation is biased toward the beginning of the filesystem and
the beginning of each group. 19 of 20 filesystems I checked didn't
have the last block allocated.

> If it has, then you can't easily repair it and you have to resign
> yourself to not having a backup of the bgd. And that means more
> complexity since e2fsck would have to deal with the possibility that
> the last block might contain a backup bgd, or might be allocated to
> a file.

Sure, but it is not worse than having no backup as you propose below.

> And even if there is a "harmless" corruption, it will still
> potentially alarm users who happen to format an ext4 file system with
> this this change implemented, and then they boot a rescue CD which is
> using an older e2fsprogs.

I modified a filesystem with debugfs to check this. e2fsck -fn reports:

Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -4194303

/dev/sdc: ********** WARNING: Filesystem still has errors **********
/dev/sdc: 11/1048576 files (0.0% non-contiguous), 77074/4194304 blocks


which I agree isn't completely silent. Running e2fsck -fy reports:

Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -4194303
Fix? yes

Free blocks count wrong for group #127 (32253, counted=32254).
Fix? yes

Free blocks count wrong (4117230, counted=4117231).
Fix? yes

/dev/sdc: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdc: 11/1048576 files (0.0% non-contiguous), 77073/4194304 blocks

e2fsck -fp is quiet, since all of these errors are harmless:

/dev/sdc: 11/1048576 files (0.0% non-contiguous), 77073/4194304 blocks


> Ultimately I suspect the best approach might be to simply try to
> reconstruct the last bgd by attempting to find the inode table in case
> the last meta_bg bgd is destroyed. Since this only comes up for file
> systems with a single block group in a meta_bg, it's a relatively easy
> thing to do.....

This is guesswork that could be wrong, and doesn't get any closer to
actually getting a proper backup. Adding the backup gives a long-term
robust solution, and it only has very minor drawbacks (spurious error
messages in e2fsck, some chance of no backup) with a combination of
extremely rare failure of cases. It is still possible to fall back to
guessing, but I'd rather avoid it.

Cheers, Andreas






2012-04-04 19:28:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: backup of the last group descriptor when it is the 1st group of a meta_bg

On Tue, Apr 03, 2012 at 04:07:54PM -0600, Andreas Dilger wrote:
> > And even if there is a "harmless" corruption, it will still
> > potentially alarm users who happen to format an ext4 file system with
> > this this change implemented, and then they boot a rescue CD which is
> > using an older e2fsprogs.
>
> I modified a filesystem with debugfs to check this. e2fsck -fn reports:...
> which I agree isn't completely silent. Running e2fsck -fy reports:
>
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Block bitmap differences: -4194303
> Fix? yes
>
> Free blocks count wrong for group #127 (32253, counted=32254).
> Fix? yes
>
> Free blocks count wrong (4117230, counted=4117231).
> Fix? yes
>
> /dev/sdc: ***** FILE SYSTEM WAS MODIFIED *****
> /dev/sdc: 11/1048576 files (0.0% non-contiguous), 77073/4194304 blocks
>
> e2fsck -fp is quiet, since all of these errors are harmless:
>
> /dev/sdc: 11/1048576 files (0.0% non-contiguous), 77073/4194304 blocks

Granted that in preen mode the error messages aren't printed. But if
someone runs e2fsck by hand, they'll see the output of e2fsck -fy.

I was chatting with Alasdair last evening at the Collab Summit
reception, and he pointed out to me (in relation to another technical
issue, but it applies here too) that if you're a distro, you really
want to engineer to avoid support calls. If you have error messages
that are confusing or are needlessly scary, if that causes even a
small number of support calls to your help desk, that costs you (the
distro) real money. So it's really important to consider carefully
how you write your error messages; on the one hand you want to make
sure enough information ends up in the dmesg log so that a helpdesk
can debug the problem; but if the printk's are scary, it can cause
needless calls to a support desk, and that costs money, and can be the
difference between profit and loss.

Granted that would only happen if you have a mix between older and
newer versions of e2fsprogs, so perhaps this won't be that likely. I
suppose we could use a compat feature to prevent an older version of
e2fsck from running on that file system. Is it worth it? Perhaps...

> This is guesswork that could be wrong, and doesn't get any closer to
> actually getting a proper backup. Adding the backup gives a long-term
> robust solution, and it only has very minor drawbacks (spurious error
> messages in e2fsck, some chance of no backup) with a combination of
> extremely rare failure of cases. It is still possible to fall back to
> guessing, but I'd rather avoid it.

Well, once you have metadata checksums (which should be landing soon),
the "guesswork" is actually going to be reliable, since we won't need
to use hueristics to determine whether or not a potential inode table
block really is an ITB.

In fact, one of the things I'm looking forward to doing is using this
techinque to make a completely safe mke2fs -S functionality. Whether
we do this in the mke2fs program, or in e2fsck, it will substantially
improve our ability to recover even if the backup descriptors aren't
available for some reason. (Users do use mke2fs -S, so there are
definitely times when the backup bgd's aren't sufficient all the
time.)

Ultimately, it's a tradeoff. The kludge of putting the backup in the
last block in the file system (whether or not we decrement
s_blocks_count, which to me isn't that different in terms of
kludginess) has the advantage that it only requires a new version of
e2fsprogs.

If we try to use a hueristics, with metadata checksums I think solves
the problem completely, but it requires an updated kernel plus an
updated e2fsprogs. Without metadata checksums, your criticism of
whether or not the hueristic is fair, although I suspect most of the
time it would actually work.

Regards,

- Ted