2009-11-04 17:37:17

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH] handle journal checksum errors via ext4_error()

As the code stands today, journal checksum errors on the root fs,
resulting in only a partial journal replay, result in no kernel
messages and no error condition, because it is all handled under
a "!(sb->s_flags & MS_RDONLY)" test - and the root fs starts out
life in RO mode.

It seems to me that just calling ext4_error() here is almost the
right thing to do; it will unconditionally mark the fs with errors,
and act according to the selected mount -o errors= behavior...

However, for the root fs, ext4_error will be short-circuited because
the root is mounted readonly; and in fact these errors came into
being while we were writing to a read-only fs during journal recovery.
So we won't do a journal_abort or anything like that, and other than
a printk and an error flag, not much action is taken for a
partially-recovered fs. Does this seem like enough?

Should we just let the root fs carry on if it's been partially-recovered,
and therefore pretty well known to be corrupt?

Signed-off-by: Eric Sandeen <[email protected]>
---

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 312211e..08370ae 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2719,25 +2719,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
EXT4_HAS_COMPAT_FEATURE(sb, EXT4_FEATURE_COMPAT_HAS_JOURNAL)) {
if (ext4_load_journal(sb, es, journal_devnum))
goto failed_mount3;
- if (!(sb->s_flags & MS_RDONLY) &&
- EXT4_SB(sb)->s_journal->j_failed_commit) {
- ext4_msg(sb, KERN_CRIT, "error: "
- "ext4_fill_super: Journal transaction "
- "%u is corrupt",
- EXT4_SB(sb)->s_journal->j_failed_commit);
- if (test_opt(sb, ERRORS_RO)) {
- ext4_msg(sb, KERN_CRIT,
- "Mounting filesystem read-only");
- sb->s_flags |= MS_RDONLY;
- EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS;
- es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
- }
- if (test_opt(sb, ERRORS_PANIC)) {
- EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS;
- es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
- ext4_commit_super(sb, 1);
- goto failed_mount4;
- }
+
+ if (EXT4_SB(sb)->s_journal->j_failed_commit) {
+ ext4_error(sb, __func__,
+ "Journal transaction %u is corrupt, "
+ "filesystem only partially recovered",
+ EXT4_SB(sb)->s_journal->j_failed_commit);
}
} else if (test_opt(sb, NOLOAD) && !(sb->s_flags & MS_RDONLY) &&
EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER)) {




2009-11-08 21:34:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] handle journal checksum errors via ext4_error()

On Wed, Nov 04, 2009 at 11:37:20AM -0600, Eric Sandeen wrote:
> As the code stands today, journal checksum errors on the root fs,
> resulting in only a partial journal replay, result in no kernel
> messages and no error condition, because it is all handled under
> a "!(sb->s_flags & MS_RDONLY)" test - and the root fs starts out
> life in RO mode.
>
> It seems to me that just calling ext4_error() here is almost the
> right thing to do; it will unconditionally mark the fs with errors,
> and act according to the selected mount -o errors= behavior...
>
> However, for the root fs, ext4_error will be short-circuited because
> the root is mounted readonly; and in fact these errors came into
> being while we were writing to a read-only fs during journal recovery.
> So we won't do a journal_abort or anything like that, and other than
> a printk and an error flag, not much action is taken for a
> partially-recovered fs. Does this seem like enough?
>
> Should we just let the root fs carry on if it's been partially-recovered,
> and therefore pretty well known to be corrupt?

I think we need to do more, especially since there will be a large
number of desktop/laptop systems that only have a single root file
system. Given that we are willing to modify the file system to replay
the journal and deal with the orphaned inode list, even when it is
mounted read-only, I think we should temporarily mark the file-system
read/write, before calling ext4_error(), and then restore the
file-system back to read-only status if we return from ext4_error().

That way when the init scripts run fsck, we'll at least get a forced
fsck to clean up the file system afterwards.

Given that we've seen how disatrous things can be when we abort a
journal replay several checksums before the end, I think we really
need to think about adding per-block checksums, so that in the case
where the block is later overwritten by a subsequent transaction
(which is quite likely for most metadata blocks, especially block or
inode allocation bitmaps), we have a better chance of recovering from
a failed checksum. Once we have per-block checksums, even if we don't
have a newer version of the block, it may be that we're better off
continuing to replay subsequent commits, and just not write the block
where we know the checksum is bad, on the theory that while we will
still need to run fsck afterwards, there will likely be less damage to
repair...

The other thing which we might do is see if we can be more aggressive
checkpointing transactions, since if the blocks have been written to
the final lcoation on disk, it doesn't matter if there was a failure
writing them to the journal. This obviously has performance
implications, though.

- Ted

2009-11-14 20:27:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] handle journal checksum errors via ext4_error()

On Wed, Nov 04, 2009 at 11:37:20AM -0600, Eric Sandeen wrote:
> As the code stands today, journal checksum errors on the root fs,
> resulting in only a partial journal replay, result in no kernel
> messages and no error condition, because it is all handled under
> a "!(sb->s_flags & MS_RDONLY)" test - and the root fs starts out
> life in RO mode.
>
> It seems to me that just calling ext4_error() here is almost the
> right thing to do; it will unconditionally mark the fs with errors,
> and act according to the selected mount -o errors= behavior...
>
> However, for the root fs, ext4_error will be short-circuited because
> the root is mounted readonly; and in fact these errors came into
> being while we were writing to a read-only fs during journal recovery.
> So we won't do a journal_abort or anything like that, and other than
> a printk and an error flag, not much action is taken for a
> partially-recovered fs. Does this seem like enough?
>
> Should we just let the root fs carry on if it's been partially-recovered,
> and therefore pretty well known to be corrupt?

I've been thinking about this some more, and I think we need to make
change to jbd2_journal_load() so that it checks for the failed journal
checksum, and then avoid resetting the journal. This allows for a
userspace program like e2fsck to do something better about recovering
from this situation.

- Ted

jbd2: don't wipe the journal on a failed journal checksum

If there is a failed journal checksum, don't reset the journal. This
allows for userspace programs to decide how to recover from this
situation. It may be that ignoring the journal checksum failure might
be a better way of recovering the file system. Once we add per-block
checksums, we can definitely do better. Until then, a system
administrator can try backing up the file system image (or taking a
snapshot) and and trying to determine experimentally whether ignoring
the checksum failure or aborting the journal replay results in less
data loss.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd2/journal.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index fed8538..af60d98 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1248,6 +1248,13 @@ int jbd2_journal_load(journal_t *journal)
if (jbd2_journal_recover(journal))
goto recovery_error;

+ if (journal->j_failed_commit) {
+ printk(KERN_ERR "JBD2: journal transaction %u on %s "
+ "is corrupt.\n", journal->j_failed_commit,
+ journal->j_devname);
+ return -EIO;
+ }
+
/* OK, we've finished with the dynamic journal bits:
* reinitialise the dynamic contents of the superblock in memory
* and reset them on disk. */


2009-11-15 20:28:17

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] ext4: Remove failed journal checksum check

Now that we are checking for failed journal checksums in the jbd2
layer, we don't need to check in the ext4 mount path --- since a
checksum fail will result in ext4_load_journal() returning an error,
causing the file system to refuse to be mounted until e2fsck can deal
with the problem.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/super.c | 20 --------------------
1 files changed, 0 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8662b2e..04c6690 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2721,26 +2721,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
EXT4_HAS_COMPAT_FEATURE(sb, EXT4_FEATURE_COMPAT_HAS_JOURNAL)) {
if (ext4_load_journal(sb, es, journal_devnum))
goto failed_mount3;
- if (!(sb->s_flags & MS_RDONLY) &&
- EXT4_SB(sb)->s_journal->j_failed_commit) {
- ext4_msg(sb, KERN_CRIT, "error: "
- "ext4_fill_super: Journal transaction "
- "%u is corrupt",
- EXT4_SB(sb)->s_journal->j_failed_commit);
- if (test_opt(sb, ERRORS_RO)) {
- ext4_msg(sb, KERN_CRIT,
- "Mounting filesystem read-only");
- sb->s_flags |= MS_RDONLY;
- EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS;
- es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
- }
- if (test_opt(sb, ERRORS_PANIC)) {
- EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS;
- es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
- ext4_commit_super(sb, 1);
- goto failed_mount4;
- }
- }
} else if (test_opt(sb, NOLOAD) && !(sb->s_flags & MS_RDONLY) &&
EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER)) {
ext4_msg(sb, KERN_ERR, "required journal recovery "
--
1.6.5.216.g5288a.dirty


2009-11-17 16:05:37

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: Remove failed journal checksum check

> Now that we are checking for failed journal checksums in the jbd2
> layer, we don't need to check in the ext4 mount path --- since a
> checksum fail will result in ext4_load_journal() returning an error,
> causing the file system to refuse to be mounted until e2fsck can deal
> with the problem.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/super.c | 20 --------------------
> 1 files changed, 0 insertions(+), 20 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8662b2e..04c6690 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2721,26 +2721,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> EXT4_HAS_COMPAT_FEATURE(sb, EXT4_FEATURE_COMPAT_HAS_JOURNAL)) {
> if (ext4_load_journal(sb, es, journal_devnum))
> goto failed_mount3;
But shouldn't we set the EXT4_ERROR_FS flag? We don't semm to do this
in ext4_load_journal() when jbd2_journal_load() fails.

Honza

--
Jan Kara <[email protected]>
SuSE CR Labs

2009-11-18 03:50:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Remove failed journal checksum check

On Tue, Nov 17, 2009 at 05:05:42PM +0100, Jan Kara wrote:
> But shouldn't we set the EXT4_ERROR_FS flag? We don't semm to do this
> in ext4_load_journal() when jbd2_journal_load() fails.

No, we don't need to set the EXT4_ERROR_FS flag. When
jbd2_journal_load() fails, we are leaving the journal in place and we
are refusing the mount. In the case of a root file system with this
problem, this will lead to a panic, and the user will have to use a
rescue CD.

In any case, when e2fsck runs, the current version will report the
error, abort the journal playback, and then force a full check of the
file system. So this actually does what we want without setting the
EXT4_ERROR_FS flag. In fact setting the flag will likely be
pointless, since if the superblock is journalled, it will get
overwritten during the journal replay.

In fact, what I think e2fsck should do as the default option is to
*skip* the journal transaction with the failed checksum, but *not*
abort the journal replay, and to replay the rest of the journal
transactions with correct checksums, and then force a full fsck.
Aborting a journal transaction and abandoning 10 or more transactions
after the failed transaction is likely to do far more damage. We're
better off replaying the transactions, hope that some or all of the
blocks in the skipped, failed transaction, are contained in subsequent
transaction, and then clean up the file system afterwards.

E2fsck should have a (non-default) option to replay the failed
transaction anyway, and a really paranoid system administrator,
though, could try it both ways. Using a LVM snapshot would allow the
sysadmin to try both ways quite efficiently.

Here's an excerpt from journal of a file system that was aborted
during an fs_mark run. (Generated using "logdump -a" in debugfs):

Found expected sequence 5735, type 2 (commit block) at block 1977
Found expected sequence 5736, type 1 (descriptor block) at block 1978
Dumping descriptor block, sequence 5736, at block 1978:
FS block 277 logged at journal block 1979 (flags 0x0)
FS block 2 logged at journal block 1980 (flags 0x2)
FS block 1009 logged at journal block 1981 (flags 0x2)
FS block 547 logged at journal block 1982 (flags 0x2)
FS block 4433 logged at journal block 1983 (flags 0x2)
FS block 267 logged at journal block 1984 (flags 0xa)
Found expected sequence 5736, type 2 (commit block) at block 1985
Found expected sequence 5737, type 1 (descriptor block) at block 1986
Dumping descriptor block, sequence 5737, at block 1986:
FS block 277 logged at journal block 1987 (flags 0x0)
FS block 2 logged at journal block 1988 (flags 0x2)
FS block 1009 logged at journal block 1989 (flags 0x2)
FS block 547 logged at journal block 1990 (flags 0x2)
FS block 4451 logged at journal block 1991 (flags 0x2)
FS block 267 logged at journal block 1992 (flags 0xa)
Found expected sequence 5737, type 2 (commit block) at block 1993
Found expected sequence 5738, type 1 (descriptor block) at block 1994
Dumping descriptor block, sequence 5738, at block 1994:
FS block 277 logged at journal block 1995 (flags 0x0)
FS block 2 logged at journal block 1996 (flags 0x2)
FS block 1009 logged at journal block 1997 (flags 0x2)
FS block 547 logged at journal block 1998 (flags 0x2)
FS block 4680 logged at journal block 1999 (flags 0x2)
FS block 267 logged at journal block 2000 (flags 0xa)
Found expected sequence 5738, type 2 (commit block) at block 2001
Found expected sequence 5739, type 1 (descriptor block) at block 2002
Dumping descriptor block, sequence 5739, at block 2002:
FS block 277 logged at journal block 2003 (flags 0x0)
FS block 2 logged at journal block 2004 (flags 0x2)
FS block 1009 logged at journal block 2005 (flags 0x2)
FS block 547 logged at journal block 2006 (flags 0x2)
FS block 4714 logged at journal block 2007 (flags 0x2)
FS block 291 logged at journal block 2008 (flags 0xa)

This is a best case, but note how many blocks can appear multiple
times in the journal. If fs blocks 277, 2, 1009, or 547 are corrupted
in any transaction before #5739, causing a checksum failure in commit
#5436 (for example), replaying the subsequent transactions will
recover the damage. In fact, if blocks 4433 or 267 are intact, we're
better off replaying commit #5436, even if the journal checksum
doesn't match, since the corrupted blocks will be repaired by
subsequent commits, and at least that way we don't lose the updates to
blocks 4433 and 267.

So this is something that we really need to address in userspace, by
making e2fsck smarter. (And this is also is why we really need
per-block checksums; it will help us recover from corrupted journals
much more easily and automatically.)

- Ted

2009-11-18 10:09:56

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: Remove failed journal checksum check

On Tue 17-11-09 22:50:14, [email protected] wrote:
> On Tue, Nov 17, 2009 at 05:05:42PM +0100, Jan Kara wrote:
> > But shouldn't we set the EXT4_ERROR_FS flag? We don't semm to do this
> > in ext4_load_journal() when jbd2_journal_load() fails.
>
> No, we don't need to set the EXT4_ERROR_FS flag. When
> jbd2_journal_load() fails, we are leaving the journal in place and we
> are refusing the mount. In the case of a root file system with this
> problem, this will lead to a panic, and the user will have to use a
> rescue CD.
>
> In any case, when e2fsck runs, the current version will report the
> error, abort the journal playback, and then force a full check of the
> file system. So this actually does what we want without setting the
> EXT4_ERROR_FS flag. In fact setting the flag will likely be
> pointless, since if the superblock is journalled, it will get
> overwritten during the journal replay.
Ah, you're right. Thanks for explanation.

> In fact, what I think e2fsck should do as the default option is to
> *skip* the journal transaction with the failed checksum, but *not*
> abort the journal replay, and to replay the rest of the journal
> transactions with correct checksums, and then force a full fsck.
> Aborting a journal transaction and abandoning 10 or more transactions
> after the failed transaction is likely to do far more damage. We're
> better off replaying the transactions, hope that some or all of the
> blocks in the skipped, failed transaction, are contained in subsequent
> transaction, and then clean up the file system afterwards.
I agree, Replaying all transactions with correct checksum seems like a
better option. Of course, having per-block checksums and replaying all
blocks that have correct checksum is even better.

> E2fsck should have a (non-default) option to replay the failed
> transaction anyway, and a really paranoid system administrator,
> though, could try it both ways. Using a LVM snapshot would allow the
> sysadmin to try both ways quite efficiently.
Yes, possibly this can be useful as well.

> Here's an excerpt from journal of a file system that was aborted
> during an fs_mark run. (Generated using "logdump -a" in debugfs):
>
> Found expected sequence 5735, type 2 (commit block) at block 1977
> Found expected sequence 5736, type 1 (descriptor block) at block 1978
> Dumping descriptor block, sequence 5736, at block 1978:
> FS block 277 logged at journal block 1979 (flags 0x0)
> FS block 2 logged at journal block 1980 (flags 0x2)
> FS block 1009 logged at journal block 1981 (flags 0x2)
> FS block 547 logged at journal block 1982 (flags 0x2)
> FS block 4433 logged at journal block 1983 (flags 0x2)
> FS block 267 logged at journal block 1984 (flags 0xa)
> Found expected sequence 5736, type 2 (commit block) at block 1985
> Found expected sequence 5737, type 1 (descriptor block) at block 1986
> Dumping descriptor block, sequence 5737, at block 1986:
> FS block 277 logged at journal block 1987 (flags 0x0)
> FS block 2 logged at journal block 1988 (flags 0x2)
> FS block 1009 logged at journal block 1989 (flags 0x2)
> FS block 547 logged at journal block 1990 (flags 0x2)
> FS block 4451 logged at journal block 1991 (flags 0x2)
> FS block 267 logged at journal block 1992 (flags 0xa)
> Found expected sequence 5737, type 2 (commit block) at block 1993
> Found expected sequence 5738, type 1 (descriptor block) at block 1994
> Dumping descriptor block, sequence 5738, at block 1994:
> FS block 277 logged at journal block 1995 (flags 0x0)
> FS block 2 logged at journal block 1996 (flags 0x2)
> FS block 1009 logged at journal block 1997 (flags 0x2)
> FS block 547 logged at journal block 1998 (flags 0x2)
> FS block 4680 logged at journal block 1999 (flags 0x2)
> FS block 267 logged at journal block 2000 (flags 0xa)
> Found expected sequence 5738, type 2 (commit block) at block 2001
> Found expected sequence 5739, type 1 (descriptor block) at block 2002
> Dumping descriptor block, sequence 5739, at block 2002:
> FS block 277 logged at journal block 2003 (flags 0x0)
> FS block 2 logged at journal block 2004 (flags 0x2)
> FS block 1009 logged at journal block 2005 (flags 0x2)
> FS block 547 logged at journal block 2006 (flags 0x2)
> FS block 4714 logged at journal block 2007 (flags 0x2)
> FS block 291 logged at journal block 2008 (flags 0xa)
>
> This is a best case, but note how many blocks can appear multiple
> times in the journal. If fs blocks 277, 2, 1009, or 547 are corrupted
> in any transaction before #5739, causing a checksum failure in commit
> #5436 (for example), replaying the subsequent transactions will
> recover the damage. In fact, if blocks 4433 or 267 are intact, we're
> better off replaying commit #5436, even if the journal checksum
> doesn't match, since the corrupted blocks will be repaired by
> subsequent commits, and at least that way we don't lose the updates to
> blocks 4433 and 267.
Yes, it heavily depends on the load. But you're right that bitmaps, group
descriptors and superblock are likely to be in further transactions. OTOH
inodes, directory blocks or indirect blocks (or however we call such blocks
for extents) and not that likely to be there.

Honza