2008-05-24 22:34:31

by Theodore Ts'o

[permalink] [raw]
Subject: What to do when the journal checksum is incorrect


I've been taking a much closer look at the ext4's journal checksum code
as I integrated it into e2fsck, and I'm finding that what it's doing
doesn't make a whole lot of sense.

Suppose the journal has five commits, with transaction ID's 2, 3, 4, 5,
and 6. And suppose the CRC in the commit block delineating the end of
transaction #4 is bad. At the moment, due to a bug in the code, it
stops processing at transaction #4, meaning that transactions #2, #3,
and #4 are replayed into the filesystem --- even though transaction #4
failed the CRC checksum. Worse yet, no indication of any problems is
sent back to the ext4 filesystem code.

Even far worse, though, is that the filesystem is now in a potentially
very compromised state. Stopping the journal replay after transaction
#3 is no better, because the rest of the filesystem may had changes made
to it since then which would have been overwritten by replaying
only transactions #2 and #3.

What this means is that the only time when it's OK for us to proceed
when there is a checksum failure is if it's the very last commit block,
and journal_async_commit's are enabled. At any other time, we have to
assume that filesystem is corrupted.

The next question is what should we do the journal. For the root
filesystem, we need to be able to mount it in order to run e2fsck. If
we fail the mount because of the bad journal, then the user is forced to
use a rescue CD, at which point e2fsck is going to have to do something
automatic or force the user to start using a disk editor to recover.

So I think the right thing to do is to replay the *entire* journal,
including the commits with the failed checksums (except in the case
where journal_async_commit is enabled and the last commit has a bad
checksum, in which case we skip the last transaction). By replaying the
entire journal, we don't lose any of the revoke blocks, which is
critical in making sure we don't overwrite any data blocks, and
replaying subsequent metadata blocks will probably leave us in a much
better position for e2fsck to be able to recover the filesystem.

But if there are any non-terminal commits that have bad checksums, we
need to pass a flag back to the filesystem, and have the filesystem call
ext4_error() indicating that the journal was corrupt; this will mark the
filesystem has containing errors, and use the filesystem policy to
decide whether to remount the filesystem read-only, continue, or panic.
Then e2fsck at boot time can try to sort out the mess.

If e2fsck is replaying the journal, it will do the same thing; play it
back despite the checksum errors, and then force a full check of the
filesystem since it is quite likely corrupted as a result.

Does this make sense? It's the direction I plan to go in terms of
making changes to the kernel and the e2fsck's recovery code, so please
let me know if you disagree with this strategy.

- Ted







2008-05-25 06:30:28

by Andreas Dilger

[permalink] [raw]
Subject: Re: What to do when the journal checksum is incorrect

On May 24, 2008 18:34 -0400, Theodore Ts'o wrote:
> Suppose the journal has five commits, with transaction ID's 2, 3, 4, 5,
> and 6. And suppose the CRC in the commit block delineating the end of
> transaction #4 is bad. At the moment, due to a bug in the code, it
> stops processing at transaction #4, meaning that transactions #2, #3,
> and #4 are replayed into the filesystem --- even though transaction #4
> failed the CRC checksum. Worse yet, no indication of any problems is
> sent back to the ext4 filesystem code.

I agree it is a bug that transaction #4 with the bad checksum would be
replayed, but I thought there was supposed to be an error returned in
case of the bad checksum not being in the last transaction?

> Even far worse, though, is that the filesystem is now in a potentially
> very compromised state. Stopping the journal replay after transaction
> #3 is no better, because the rest of the filesystem may had changes made
> to it since then which would have been overwritten by replaying
> only transactions #2 and #3.

This is true, but I don't agree with the conculsion you draw from it.

> So I think the right thing to do is to replay the *entire* journal,
> including the commits with the failed checksums (except in the case
> where journal_async_commit is enabled and the last commit has a bad
> checksum, in which case we skip the last transaction). By replaying the
> entire journal, we don't lose any of the revoke blocks, which is
> critical in making sure we don't overwrite any data blocks, and
> replaying subsequent metadata blocks will probably leave us in a much
> better position for e2fsck to be able to recover the filesystem.

I don't agree with this at all. With the exception of the checksum
error being in the last transaction, there is a danger of complete
garbage being checkpointed into the filesystem if the checksum is bad.
The alternative to not replay the rest of the transactions at worst
has some probability that newer versions of the blocks were checkpointed
into the filesystem before it crashed.

You are proposing that we write some blocks which we know to be garbage
directly into the filesystem metadata instead of leaving some unordered
writes in the filesystem to be fixed (as happens with ext2 all of the time).

> But if there are any non-terminal commits that have bad checksums, we
> need to pass a flag back to the filesystem, and have the filesystem call
> ext4_error() indicating that the journal was corrupt; this will mark the
> filesystem has containing errors, and use the filesystem policy to
> decide whether to remount the filesystem read-only, continue, or panic.
> Then e2fsck at boot time can try to sort out the mess.

Yes, that is definitely the right approach, and I thought that was the
case. It's been a while since I looked at the code.

> If e2fsck is replaying the journal, it will do the same thing; play it
> back despite the checksum errors, and then force a full check of the
> filesystem since it is quite likely corrupted as a result.
>
> Does this make sense? It's the direction I plan to go in terms of
> making changes to the kernel and the e2fsck's recovery code, so please
> let me know if you disagree with this strategy.

I do disagree. The whole point of this patch was to avoid the case where
random garbage had been written into the journal and then splattering it
all over the filesystem. Considering that the journal has the highest
density of important metadata in the filesystem, it is virtually impossible
to get more serious corruption than in the journal.

PS - note new email addresses

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-05-25 11:39:00

by Theodore Ts'o

[permalink] [raw]
Subject: Re: What to do when the journal checksum is incorrect

On Sun, May 25, 2008 at 12:30:24AM -0600, Andreas Dilger wrote:
> I don't agree with this at all. With the exception of the checksum
> error being in the last transaction, there is a danger of complete
> garbage being checkpointed into the filesystem if the checksum is bad.
> The alternative to not replay the rest of the transactions at worst
> has some probability that newer versions of the blocks were checkpointed
> into the filesystem before it crashed.
>
> You are proposing that we write some blocks which we know to be garbage
> directly into the filesystem metadata instead of leaving some unordered
> writes in the filesystem to be fixed (as happens with ext2 all of the time).


Well, what are the alternatives? Remember, we could have potentially
50-100 megabytes of stale metadata that haven't been written to
filesystem. And unlike ext2, we've deliberately held back writing
back metadata by pinning it so, things could be much worse. So let's
tick off the possibilities:

* An individual data block is bad --- we write complete garbage into
the filesystem, which means in the worst case we lose 32 inodes
(unless that inode table block is repeated later in the journal), 1
directory block (causing files to land in lost+found), one bitmap
block (which e2fsck can regenerate), or a data block (if data=jouranalled).

* A journal descriptor block is bad --- if it's just a bit-flip, we
could end up writing a data block in the wrong place, which would be
bad; if it's complete garbage, we will probably assume the journal
ended early, and leave the filesystem silently badly corrupted.

* The journal commit block is bad --- probably we will just silently
assume the journal ended early, unless the bit-flip happened exactly
in the CRC field.

The most common case is that one or more individual data blocks in the
journal are bad, and the question is whether writing that garbage into
the filesystem is better or worse than aborting the journal right then
and there.

The problem with only replaying the "good" part of the journal is the
kernel then truncates the journal, and it leaves e2fsck with no way of
doing anything intelligent afterwards. So another possibility is to
not replay the journal at all, and fail the mount unless the
filesystem is being mounted read-only; but the question is whether we
are better off not replaying the journal at *all*, or just replaying
part of it.

Consider that if /boot/grub/menu.lst got written, and one of its data
block was previously directory block that had since gotten deleted,
but in the journal and had been revoked, replaying part of the journal
might make the system non-bootable.

So the other alternative I seriously considered was not replaying the
journal at all, and bailing out after seeing the bad checksum --- but
that just defers the problem to e2fsck, and e2fsck can't really do
anything much different, and the tools to allow a human to make a
decision on a block by block basis in the journal don't exist, and
even if they did would make more system administrators run screaming.

I suspect the *best* approach is to change the journal format one more
time, and include a CRC on a per-block basis in the descriptor blocks,
and a CRC for the entire descriptor block. That way, we can decide
what to replay or not on a per-block basis.

> > But if there are any non-terminal commits that have bad checksums, we
> > need to pass a flag back to the filesystem, and have the filesystem call
> > ext4_error() indicating that the journal was corrupt; this will mark the
> > filesystem has containing errors, and use the filesystem policy to
> > decide whether to remount the filesystem read-only, continue, or panic.
> > Then e2fsck at boot time can try to sort out the mess.
>
> Yes, that is definitely the right approach, and I thought that was the
> case. It's been a while since I looked at the code.

Nope, right now the mount will silently succeed, even though part of
the journal (include the commit with the bad CRC, but no commits
beyond that) have been replayed, and then the entire journal is
truncated and thrown away.

Basically, almost any change we make would be an improvement. :-)

- Ted

2008-05-26 16:30:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: What to do when the journal checksum is incorrect

On Sun, May 25, 2008 at 07:38:42AM -0400, Theodore Tso wrote:
> So the other alternative I seriously considered was not replaying the
> journal at all, and bailing out after seeing the bad checksum --- but
> that just defers the problem to e2fsck, and e2fsck can't really do
> anything much different, and the tools to allow a human to make a
> decision on a block by block basis in the journal don't exist, and
> even if they did would make more system administrators run screaming.
>
> I suspect the *best* approach is to change the journal format one more
> time, and include a CRC on a per-block basis in the descriptor blocks,
> and a CRC for the entire descriptor block. That way, we can decide
> what to replay or not on a per-block basis.

One other alternative I forgot to mention is that e2fsck (or the
kernel) could skip the bad (non-terminal) commit, and continue
replaying the rest of the good commits. That's a much coarser grain
alternative to the above, given that there could be hundred or more
blocks in a particular commit that we would be skipping.

E2fsck could also save the journal to an external file which could
allow an expert to paw over the results and try to do something sane
with the results. Given that *very* few experts would know what to
do, I'm not entirely sure it's worth the effort --- although perhaps
some of the folks who are working on ext3grep might be interested
writing some journal forensic tools.

> Nope, right now the mount will silently succeed, even though part of
> the journal (include the commit with the bad CRC, but no commits
> beyond that) have been replayed, and then the entire journal is
> truncated and thrown away.

BTW, here's the patch that will cause fs/jbd2/recovery.c to terminate
upon finding the non-terminal commit with the failed CRC. It's not a
complete fix, since apparently fs/jbd2/recovery.c is also currently
noting the in the journal-async-commit, terminal-commit with bad CRC
case, fs/jbd2/recovery.c is noting the bad CRC --- and then replaying
it, which means that journal-async-commit is currently NOT safe. (Not
yet fixed with my current patch; I had been so focused on the
non-terminal bad CRC case, that I only recently noticed that the
terminal async journal case was also busted.)

All of this was found by porting over the fs/jbd2/recovery.c changes
into e2fsck/recovery.c, compiling e2fsprogs with --enable-jbd-debug,
and then setting the JBD_DEBUG environment variable, then watching
journal recoveries with selectively corrupted journals using lde.
That's also how I found the buffer head memory leak which sent to
linux-ext4 a few days ago --- and why I try hard to keep
e2fsck/recover.c in sync with fs/jbd[2]/recovery.c.

The Lesson of the Week is that running tests in userspace is easier
and faster, which encourages more testing --- which means bugs get
found faster. :-)

- Ted

commit 78c061f21271968e24c434bb8a5ec41602ad0b99
Author: Theodore Ts'o <[email protected]>
Date: Sat May 24 15:45:31 2008 -0400

jbd2: Fix handling of a bad checksum of a non-terminal commit

If there is a checksum problem in transaction n, and it is not the
last transaction, the problem isn't until the commit block for
transaction n+1 is found. So we need to decrement the commit_ID
counter so the right transaction is marked as the troublemaker, and we
don't replay the corrupted transaction.

Cc: Andreas Dilger <[email protected]>
Cc: Girish Shilamkar <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 7199db5..2d48448 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -611,6 +611,7 @@ static int do_one_pass(journal_t *journal,
chksum_err = chksum_seen = 0;

if (info->end_transaction) {
+ next_commit_ID--;
printk(KERN_ERR "JBD: Transaction %u "
"found to be corrupt.\n",
next_commit_ID - 1);

2008-05-26 18:24:31

by Andreas Dilger

[permalink] [raw]
Subject: Re: What to do when the journal checksum is incorrect

On May 25, 2008 07:38 -0400, Theodore Ts'o wrote:
> Well, what are the alternatives? Remember, we could have potentially
> 50-100 megabytes of stale metadata that haven't been written to
> filesystem. And unlike ext2, we've deliberately held back writing
> back metadata by pinning it so, things could be much worse. So let's
> tick off the possibilities:
>
> * An individual data block is bad --- we write complete garbage into
> the filesystem, which means in the worst case we lose 32 inodes
> (unless that inode table block is repeated later in the journal), 1
> directory block (causing files to land in lost+found), one bitmap
> block (which e2fsck can regenerate), or a data block (if data=jouranalled).
>
> * A journal descriptor block is bad --- if it's just a bit-flip, we
> could end up writing a data block in the wrong place, which would be
> bad; if it's complete garbage, we will probably assume the journal
> ended early, and leave the filesystem silently badly corrupted.
>
> * The journal commit block is bad --- probably we will just silently
> assume the journal ended early, unless the bit-flip happened exactly
> in the CRC field.
>
> The most common case is that one or more individual data blocks in the
> journal are bad, and the question is whether writing that garbage into
> the filesystem is better or worse than aborting the journal right then
> and there.

You are focussing on the case where 1 or 2 filesystem blocks in the
journal are bad, but I suspect the real-world cases are more likely to
be 1 or 2MB of data are bad, or more. Considering that a disk sector
is at least 4 or 64kB in size, and problems like track misalignment
(overpowered seek), write failure (high-flying write), or device cache
reordering problems will result in a large number of bad blocks in the
journal, I don't think 1 or 2 filesystem is a realistic failure scenario
anymore.

> The problem with only replaying the "good" part of the journal is the
> kernel then truncates the journal, and it leaves e2fsck with no way of
> doing anything intelligent afterwards. So another possibility is to
> not replay the journal at all, and fail the mount unless the
> filesystem is being mounted read-only; but the question is whether we
> are better off not replaying the journal at *all*, or just replaying
> part of it.

I'd think at a minimum to replay the journal up to the bad transaction.
That the current code is broken and also replays the bad transaction is
of course incorrect. The probability that later transactions have
begun checkpointing their blocks to the filesystem is decreasing for
each later transaction after the bad one, so the probability of those
changes corrupting the filesystem are correspondingly lower.

> Consider that if /boot/grub/menu.lst got written, and one of its data
> block was previously directory block that had since gotten deleted,
> but in the journal and had been revoked, replaying part of the journal
> might make the system non-bootable.

Sure, such scenarios exist, but the architecture of ext3/4 is that the
data block will _likely_ have been rewritten in the same place. The
more likely case is that some important filesystem metadata (itable,
indirect blocks of files, etc) is being overwritten and corruption in
the journal is a laser-guided missile to finding all of the important
blocks in the filesystem to spread that corruption to.

> So the other alternative I seriously considered was not replaying the
> journal at all, and bailing out after seeing the bad checksum --- but
> that just defers the problem to e2fsck, and e2fsck can't really do
> anything much different, and the tools to allow a human to make a
> decision on a block by block basis in the journal don't exist, and
> even if they did would make more system administrators run screaming.
>
> I suspect the *best* approach is to change the journal format one more
> time, and include a CRC on a per-block basis in the descriptor blocks,
> and a CRC for the entire descriptor block. That way, we can decide
> what to replay or not on a per-block basis.

Yes, I was thinking exactly this same thing. This would give the maximum
probability of the correct outcome, because only "correct" blocks are
checkpointed into the filesystem, and at least an old version of the
block is present in the filesystem (unless it is a new block). The chance
also exists that a later transaction will even overwrite the bad block,
which will avoid even the need to invoke e2fsck.

This would need:
- a checksum in the per-block transaction record (tag). One option is
to keep an 8- or 16-bit checksum in the "flags" field, to keep it
compatible with older JBD implementations.
- a checksum of the commit header and tags to ensure we can trust the
per-block checksums, and we don't need a huge checksum for each block.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-05-26 21:28:35

by Ric Wheeler

[permalink] [raw]
Subject: Re: What to do when the journal checksum is incorrect

Andreas Dilger wrote:
> On May 25, 2008 07:38 -0400, Theodore Ts'o wrote:
>> Well, what are the alternatives? Remember, we could have potentially
>> 50-100 megabytes of stale metadata that haven't been written to
>> filesystem. And unlike ext2, we've deliberately held back writing
>> back metadata by pinning it so, things could be much worse. So let's
>> tick off the possibilities:
>>
>> * An individual data block is bad --- we write complete garbage into
>> the filesystem, which means in the worst case we lose 32 inodes
>> (unless that inode table block is repeated later in the journal), 1
>> directory block (causing files to land in lost+found), one bitmap
>> block (which e2fsck can regenerate), or a data block (if data=jouranalled).
>>
>> * A journal descriptor block is bad --- if it's just a bit-flip, we
>> could end up writing a data block in the wrong place, which would be
>> bad; if it's complete garbage, we will probably assume the journal
>> ended early, and leave the filesystem silently badly corrupted.
>>
>> * The journal commit block is bad --- probably we will just silently
>> assume the journal ended early, unless the bit-flip happened exactly
>> in the CRC field.
>>
>> The most common case is that one or more individual data blocks in the
>> journal are bad, and the question is whether writing that garbage into
>> the filesystem is better or worse than aborting the journal right then
>> and there.
>
> You are focussing on the case where 1 or 2 filesystem blocks in the
> journal are bad, but I suspect the real-world cases are more likely to
> be 1 or 2MB of data are bad, or more. Considering that a disk sector
> is at least 4 or 64kB in size, and problems like track misalignment
> (overpowered seek), write failure (high-flying write), or device cache
> reordering problems will result in a large number of bad blocks in the
> journal, I don't think 1 or 2 filesystem is a realistic failure scenario
> anymore.

Disk sectors are still (almost always) 512 bytes today, but the industry is
pushing hard to get 4k byte sectors out since that has a promise of getting
better data protection and denser layout. Disk arrays have internal "sectors"
that can be really big (64k or bigger).

What seems to be most common is a small number of bad sectors that will be
unreadable (IO errors on read). I would be surprised to see megabytes of
continuous errors, but you could see 10's of kilobytes.

What the checksums will probably be most useful in catching is problems with
memory parts - either non-ECC DRAM in your server, bad DRAM in the disk cache
itself, etc. The interesting thing about these errors is that they will tend to
repeat (depending on where that stuck bit is) and you can see it all over the place.

One thing that will be really neat is to actually put in counters to track the
rate and validate these assumptions.


ric

2008-06-03 10:29:14

by Girish Shilamkar

[permalink] [raw]
Subject: Re: What to do when the journal checksum is incorrect

Hi Ted,
On Sat, 2008-05-24 at 18:34 -0400, Theodore Ts'o wrote:
> I've been taking a much closer look at the ext4's journal checksum code
> as I integrated it into e2fsck, and I'm finding that what it's doing
> doesn't make a whole lot of sense.
>
> Suppose the journal has five commits, with transaction ID's 2, 3, 4, 5,
> and 6. And suppose the CRC in the commit block delineating the end of
> transaction #4 is bad. At the moment, due to a bug in the code, it
> stops processing at transaction #4, meaning that transactions #2, #3,
> and #4 are replayed into the filesystem --- even though transaction #4
> failed the CRC checksum.
I went through the code and also re-ran the e2fsprogs tests which we had
sent upstream for journal checksum. And found that if the transaction is
bad it is marked as info->end_transaction which indicates a bad
transaction and is not replayed.

if (chksum_err) {
info->end_transaction = next_commit_ID;

The end_transaction is set to transaction ID which is found to be
corrupt. So #4 will be set in end_transaction and in PASS_REPLAY the
last transaction to be replayed will be #3 due to this:
----------------------------------------------------------------
if (tid_geq(next_commit_ID, info->end_transaction))
break;
-----------------------------------------------------------------

if (!JFS_HAS_COMPAT_FEATURE(journal,
JFS_FEATURE_INCOMPAT_ASYNC_COMMIT)){
printk(KERN_ERR "JBD: Transaction %u found to be corrupt.\n",
next_commit_ID);
brelse(bh);
break;
}
}
> Worse yet, no indication of any problems is
> sent back to the ext4 filesystem code.
This definitely is not present and needs to be incorporated.

Thanks,
Girish


2008-06-03 21:27:49

by Andreas Dilger

[permalink] [raw]
Subject: Re: What to do when the journal checksum is incorrect

On Jun 03, 2008 15:52 +0530, Girish Shilamkar wrote:
> I went through the code and also re-ran the e2fsprogs tests which we had
> sent upstream for journal checksum. And found that if the transaction is
> bad it is marked as info->end_transaction which indicates a bad
> transaction and is not replayed.
>
> if (chksum_err) {
> info->end_transaction = next_commit_ID;
>
> The end_transaction is set to transaction ID which is found to be
> corrupt. So #4 will be set in end_transaction and in PASS_REPLAY the
> last transaction to be replayed will be #3 due to this:
> ----------------------------------------------------------------
> if (tid_geq(next_commit_ID, info->end_transaction))
> break;
> -----------------------------------------------------------------
>
> if (!JFS_HAS_COMPAT_FEATURE(journal,
> JFS_FEATURE_INCOMPAT_ASYNC_COMMIT)){
> printk(KERN_ERR "JBD: Transaction %u found to be corrupt.\n",
> next_commit_ID);
> brelse(bh);
> break;
> }
> }

Girish, thanks for following up on this. It definitely eases my
concerns about the patch we are currently using.

> > Worse yet, no indication of any problems is
> > sent back to the ext4 filesystem code.
>
> This definitely is not present and needs to be incorporated.

Yes, it seems that skipping later transactions definitely has some
drawbacks.

We had discussed on the ext4 concall a change to the format of the journal
checksum code, having it store a small per-block checksum in addition to
the per-transaction checksum, so that in case of transaction corruption
the good blocks can be recovered and the bad ones skipped.

The proposal was to do an adler32 checksum of the block (using
the transaction number and filesystem block number at the start of
the checksum data), and then store the high (or low?) 16 bits of the
checksum in the high 16 bits of the journal_block_tag_s.t_flags field.
The full 32-bit per-block checksum would also be checksummed in order
to generate the 32-bit transaction checksum.

While a 16-bit checksum is itself is not very strong, we have the full
32-bit transaction checksum to verify the whole transaction, and we
only care about 16-bit per-block checksum in the rare case when the
transaction checksum is bad. Having the transaction number "seed" each
of the per-block checksums avoids the risk of having an old journal tag
block in the journal with "good" checksums, and having the filesystem
block number in the checksum avoids the risk of single-bit errors in
the tag block resulting in the "good checksum" block being written to
the wrong part of the filesystem.

This new per-block checksum would be recorded as a new checksum type
in the journal superblock. The reason for using adler32 instead of
the CRC32 we are using now is that adler32 is significantly faster
when implemented in software, and is equally robust for the blocksizes
we are using for ext3 (adler32 is not as strong with very small buffer
sizes like 128 bytes or less).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-06-04 23:41:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: What to do when the journal checksum is incorrect

On Tue, Jun 03, 2008 at 03:52:13PM +0530, Girish Shilamkar wrote:
> I went through the code and also re-ran the e2fsprogs tests which we had
> sent upstream for journal checksum. And found that if the transaction is
> bad it is marked as info->end_transaction which indicates a bad
> transaction and is not replayed.

You're correct; my apologies. I had misplaced or misfiled your
patches, and recreated the patch to support async checksums, and I
screwed up the patch by missing this change at the end of
do_one_pass()

if (pass == PASS_SCAN) {
- info->end_transaction = next_commit_ID;
+ if (!info->end_transaction)
+ info->end_transaction = next_commit_ID;
} else {
/* It's really bad news if different passes end up at
* different places (but possible due to IO errors). */

My patch to fix the buffer head leak is still relevant, though, and
I've created a patch to reflect the error up to ext4.

> > Worse yet, no indication of any problems is
> > sent back to the ext4 filesystem code.
> This definitely is not present and needs to be incorporated.
>

I will send the patches that I've placed into the ext4 patch queue
relevant to the journal checksum code on this thread.

- Ted

2008-06-04 23:56:37

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] jbd2: If a journal checksum error is detected, propagate the error to ext4

If a journal checksum error is detected, the ext4 filesystem will call
ext4_error(), and the mount will either continue, become a read-only
mount, or cause a kernel panic based on the superblock flags
indicating the user's preference of what to do in case of filesystem
corruption being detected.

XXX should we simply fail the mount instead in the panic case?

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/super.c | 6 ++
fs/jbd2/recovery.c | 4 ++
include/linux/jbd2.h | 3 +
scripts/package/Makefile | 100 +-----------------------------------
scripts/package/builddeb | 126 ----------------------------------------------
5 files changed, 15 insertions(+), 224 deletions(-)
delete mode 100644 scripts/package/builddeb

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 09d9359..dd56c45 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2189,6 +2189,12 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
EXT4_HAS_COMPAT_FEATURE(sb, EXT4_FEATURE_COMPAT_HAS_JOURNAL)) {
if (ext4_load_journal(sb, es, journal_devnum))
goto failed_mount3;
+ if (!(sb->s_flags & MS_RDONLY) &&
+ EXT4_SB(sb)->s_journal->j_failed_commit) {
+ ext4_error(sb, "ext4_fill_super",
+ "Journal transaction %u is corrupt",
+ EXT4_SB(sb)->s_journal->j_failed_commit);
+ }
} else if (journal_inum) {
if (ext4_create_journal(sb, es, journal_inum))
goto failed_mount3;
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 7199db5..1ded464 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -611,6 +611,8 @@ static int do_one_pass(journal_t *journal,
chksum_err = chksum_seen = 0;

if (info->end_transaction) {
+ journal->j_failed_commit =
+ info->end_transaction;
printk(KERN_ERR "JBD: Transaction %u "
"found to be corrupt.\n",
next_commit_ID - 1);
@@ -648,6 +650,8 @@ static int do_one_pass(journal_t *journal,
"JBD: Transaction %u "
"found to be corrupt.\n",
next_commit_ID);
+ journal->j_failed_commit =
+ next_commit_ID;
brelse(bh);
break;
}
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 05e2b30..d147f0f 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -919,6 +919,9 @@ struct journal_s
struct proc_dir_entry *j_proc_entry;
struct transaction_stats_s j_stats;

+ /* Failed journal commit ID */
+ unsigned int j_failed_commit;
+
/*
* An opaque pointer to fs-private information. ext3 puts its
* superblock pointer here
diff --git a/scripts/package/Makefile b/scripts/package/Makefile
index 5e32607..f758b75 100644
--- a/scripts/package/Makefile
+++ b/scripts/package/Makefile
@@ -1,98 +1,2 @@
-# Makefile for the different targets used to generate full packages of a kernel
-# It uses the generic clean infrastructure of kbuild
-
-# Ignore the following files/directories during tar operation
-TAR_IGNORE := --exclude SCCS --exclude BitKeeper --exclude .svn --exclude CVS
-
-
-# RPM target
-# ---------------------------------------------------------------------------
-# The rpm target generates two rpm files:
-# /usr/src/packages/SRPMS/kernel-2.6.7rc2-1.src.rpm
-# /usr/src/packages/RPMS/i386/kernel-2.6.7rc2-1.<arch>.rpm
-# The src.rpm files includes all source for the kernel being built
-# The <arch>.rpm includes kernel configuration, modules etc.
-#
-# Process to create the rpm files
-# a) clean the kernel
-# b) Generate .spec file
-# c) Build a tar ball, using symlink to make kernel version
-# first entry in the path
-# d) and pack the result to a tar.gz file
-# e) generate the rpm files, based on kernel.spec
-# - Use /. to avoid tar packing just the symlink
-
-# Do we have rpmbuild, otherwise fall back to the older rpm
-RPM := $(shell if [ -x "/usr/bin/rpmbuild" ]; then echo rpmbuild; \
- else echo rpm; fi)
-
-# Remove hyphens since they have special meaning in RPM filenames
-KERNELPATH := kernel-$(subst -,,$(KERNELRELEASE))
-MKSPEC := $(srctree)/scripts/package/mkspec
-PREV := set -e; cd ..;
-
-# rpm-pkg
-# ---------------------------------------------------------------------------
-$(objtree)/kernel.spec: $(MKSPEC) $(srctree)/Makefile
- $(CONFIG_SHELL) $(MKSPEC) > [email protected]
-
-rpm-pkg rpm: $(objtree)/kernel.spec FORCE
- $(MAKE) clean
- $(PREV) ln -sf $(srctree) $(KERNELPATH)
- $(PREV) tar -cz $(RCS_TAR_IGNORE) -f $(KERNELPATH).tar.gz $(KERNELPATH)/.
- $(PREV) rm $(KERNELPATH)
-
- set -e; \
- $(CONFIG_SHELL) $(srctree)/scripts/mkversion > $(objtree)/.tmp_version
- set -e; \
- mv -f $(objtree)/.tmp_version $(objtree)/.version
-
- $(RPM) --target $(UTS_MACHINE) -ta ../$(KERNELPATH).tar.gz
- rm ../$(KERNELPATH).tar.gz
-
-clean-files := $(objtree)/kernel.spec
-
-# binrpm-pkg
-# ---------------------------------------------------------------------------
-$(objtree)/binkernel.spec: $(MKSPEC) $(srctree)/Makefile
- $(CONFIG_SHELL) $(MKSPEC) prebuilt > [email protected]
-
-binrpm-pkg: $(objtree)/binkernel.spec FORCE
- $(MAKE) KBUILD_SRC=
- set -e; \
- $(CONFIG_SHELL) $(srctree)/scripts/mkversion > $(objtree)/.tmp_version
- set -e; \
- mv -f $(objtree)/.tmp_version $(objtree)/.version
-
- $(RPM) --define "_builddir $(srctree)" --target $(UTS_MACHINE) -bb $<
-
-clean-files += $(objtree)/binkernel.spec
-
-# Deb target
-# ---------------------------------------------------------------------------
-deb-pkg: FORCE
- $(MAKE) KBUILD_SRC=
- $(CONFIG_SHELL) $(srctree)/scripts/package/builddeb
-
-clean-dirs += $(objtree)/debian/
-
-
-# tarball targets
-# ---------------------------------------------------------------------------
-tar%pkg: FORCE
- $(MAKE) KBUILD_SRC=
- $(CONFIG_SHELL) $(srctree)/scripts/package/buildtar [email protected]
-
-clean-dirs += $(objtree)/tar-install/
-
-
-# Help text displayed when executing 'make help'
-# ---------------------------------------------------------------------------
-help: FORCE
- @echo ' rpm-pkg - Build both source and binary RPM kernel packages'
- @echo ' binrpm-pkg - Build only the binary kernel package'
- @echo ' deb-pkg - Build the kernel as an deb package'
- @echo ' tar-pkg - Build the kernel as an uncompressed tarball'
- @echo ' targz-pkg - Build the kernel as a gzip compressed tarball'
- @echo ' tarbz2-pkg - Build the kernel as a bzip2 compressed tarball'
-
+# Dummy file
+help:
diff --git a/scripts/package/builddeb b/scripts/package/builddeb
deleted file mode 100644
index ba6bf5d..0000000
--- a/scripts/package/builddeb
+++ /dev/null
@@ -1,126 +0,0 @@
-#!/bin/sh
-#
-# builddeb 1.2
-# Copyright 2003 Wichert Akkerman <[email protected]>
-#
-# Simple script to generate a deb package for a Linux kernel. All the
-# complexity of what to do with a kernel after it is installer or removed
-# is left to other scripts and packages: they can install scripts in the
-# /etc/kernel/{pre,post}{inst,rm}.d/ directories that will be called on
-# package install and removal.
-
-set -e
-
-# Some variables and settings used throughout the script
-version=$KERNELRELEASE
-revision=`cat .version`
-tmpdir="$objtree/debian/tmp"
-packagename=linux-$version
-
-if [ "$ARCH" == "um" ] ; then
- packagename=user-mode-linux-$version
-fi
-
-# Setup the directory structure
-rm -rf "$tmpdir"
-mkdir -p "$tmpdir/DEBIAN" "$tmpdir/lib" "$tmpdir/boot"
-if [ "$ARCH" == "um" ] ; then
- mkdir -p "$tmpdir/usr/lib/uml/modules/$version" "$tmpdir/usr/share/doc/$packagename" "$tmpdir/usr/bin"
-fi
-
-# Build and install the kernel
-if [ "$ARCH" == "um" ] ; then
- $MAKE linux
- cp System.map "$tmpdir/usr/lib/uml/modules/$version/System.map"
- cp .config "$tmpdir/usr/share/doc/$packagename/config"
- gzip "$tmpdir/usr/share/doc/$packagename/config"
- cp $KBUILD_IMAGE "$tmpdir/usr/bin/linux-$version"
-else
- cp System.map "$tmpdir/boot/System.map-$version"
- cp .config "$tmpdir/boot/config-$version"
- cp $KBUILD_IMAGE "$tmpdir/boot/vmlinuz-$version"
-fi
-
-if grep -q '^CONFIG_MODULES=y' .config ; then
- INSTALL_MOD_PATH="$tmpdir" make KBUILD_SRC= modules_install
- if [ "$ARCH" == "um" ] ; then
- mv "$tmpdir/lib/modules/$version"/* "$tmpdir/usr/lib/uml/modules/$version/"
- rmdir "$tmpdir/lib/modules/$version"
- fi
-fi
-
-# Install the maintainer scripts
-for script in postinst postrm preinst prerm ; do
- mkdir -p "$tmpdir/etc/kernel/$script.d"
- cat <<EOF > "$tmpdir/DEBIAN/$script"
-#!/bin/sh
-
-set -e
-
-test -d /etc/kernel/$script.d && run-parts --arg="$version" /etc/kernel/$script.d
-exit 0
-EOF
- chmod 755 "$tmpdir/DEBIAN/$script"
-done
-
-name="Kernel Compiler <$(id -nu)@$(hostname -f)>"
-# Generate a simple changelog template
-cat <<EOF > debian/changelog
-linux ($version-$revision) unstable; urgency=low
-
- * A standard release
-
- -- $name $(date -R)
-EOF
-
-# Generate a control file
-if [ "$ARCH" == "um" ]; then
-
-cat <<EOF > debian/control
-Source: linux
-Section: base
-Priority: optional
-Maintainer: $name
-Standards-Version: 3.6.1
-
-Package: $packagename
-Provides: kernel-image-$version, linux-image-$version
-Architecture: any
-Description: User Mode Linux kernel, version $version
- User-mode Linux is a port of the Linux kernel to its own system call
- interface. It provides a kind of virtual machine, which runs Linux
- as a user process under another Linux kernel. This is useful for
- kernel development, sandboxes, jails, experimentation, and
- many other things.
- .
- This package contains the Linux kernel, modules and corresponding other
- files version $version
-EOF
-
-else
-cat <<EOF > debian/control
-Source: linux
-Section: base
-Priority: optional
-Maintainer: $name
-Standards-Version: 3.6.1
-
-Package: $packagename
-Provides: kernel-image-$version, linux-image-$version
-Architecture: any
-Description: Linux kernel, version $version
- This package contains the Linux kernel, modules and corresponding other
- files version $version
-EOF
-fi
-
-# Fix some ownership and permissions
-chown -R root:root "$tmpdir"
-chmod -R go-w "$tmpdir"
-
-# Perform the final magic
-dpkg-gencontrol -isp
-dpkg --build "$tmpdir" ..
-
-exit 0

2008-06-04 23:56:36

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] jbd2: Fix memory leak when verifying checksums in the journal

Cc: Andreas Dilger <[email protected]>
Cc: Girish Shilamkar <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd2/recovery.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 5d0405a..7199db5 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -344,6 +344,7 @@ static int calc_chksums(journal_t *journal, struct buffer_head *bh,
*crc32_sum = crc32_be(*crc32_sum, (void *)obh->b_data,
obh->b_size);
}
+ put_bh(obh);
}
return 0;
}
--
1.5.4.1.144.gdfee-dirty


2008-06-05 03:17:15

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] jbd2: If a journal checksum error is detected, propagate the error to ext4

On Jun 04, 2008 19:56 -0400, Theodore Ts'o wrote:
> If a journal checksum error is detected, the ext4 filesystem will call
> ext4_error(), and the mount will either continue, become a read-only
> mount, or cause a kernel panic based on the superblock flags
> indicating the user's preference of what to do in case of filesystem
> corruption being detected.
>
> XXX should we simply fail the mount instead in the panic case?

It does seem that if someone has a bad ext4 root filesystem it would be
easy to get an unusable system. That is why there are no "ext3_error"
or "ext4_error" calls anywhere in ext[34]_fill_super(). They are all
instead "printk(KERN_ERR ...)" and the mount fails.

> diff --git a/scripts/package/Makefile b/scripts/package/Makefile
> index 5e32607..f758b75 100644
> --- a/scripts/package/Makefile
> +++ b/scripts/package/Makefile

Not sure what the rest of this patch is...

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-06-05 16:21:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] jbd2: If a journal checksum error is detected, propagate the error to ext4

On Wed, Jun 04, 2008 at 09:17:12PM -0600, Andreas Dilger wrote:
> On Jun 04, 2008 19:56 -0400, Theodore Ts'o wrote:
> > If a journal checksum error is detected, the ext4 filesystem will call
> > ext4_error(), and the mount will either continue, become a read-only
> > mount, or cause a kernel panic based on the superblock flags
> > indicating the user's preference of what to do in case of filesystem
> > corruption being detected.
> >
> > XXX should we simply fail the mount instead in the panic case?
>
> It does seem that if someone has a bad ext4 root filesystem it would be
> easy to get an unusable system. That is why there are no "ext3_error"
> or "ext4_error" calls anywhere in ext[34]_fill_super(). They are all
> instead "printk(KERN_ERR ...)" and the mount fails.

Well, failing the mount or panic'ing as the same result for the root
filesystem. It is rude to panic when mounting a corrupt filesystem,
especially when you could just fail the mount, though. I'll make that change.

>
> > diff --git a/scripts/package/Makefile b/scripts/package/Makefile
> > index 5e32607..f758b75 100644
> > --- a/scripts/package/Makefile
> > +++ b/scripts/package/Makefile
>
> Not sure what the rest of this patch is...
>

Some ditritus casued by make-kpkg that slipped in. Oops, sorry about
that.

- Ted