LinuxLists.cc - Updated patches for journal checksums.

2007-06-19 07:50:47

Subject: Updated patches for journal checksums.

Hi,
The journal checksums patches for ext4 and e2fsprogs sent on 30th May
had run into problems on big-endian machines. I have made the required
changes.
The major updates are:
1. Checksum was written in little endian format in the commit header,
changed it to big-endian.
2. The crc32 code ported to userspace (which is used by e2fsprogs) had
problem with #ifdef, fixed that.
3. The crc32 user space code used i386 assembly code for swabbing
changed it to generic one.
4. Removed various #includes, which caused compilation errors on ppc
machine.

Any comments/suggestions are always welcome.

Thanks,
Girish.

Attachments:

ext4-journal_chksum-2.6.20.patch (18.80 kB)
e2fsprogs-journal_chksum.patch (41.58 kB)
Download all attachments

2007-06-19 08:03:25

by Alex Tomas

[permalink] [raw]

Subject: Re: Updated patches for journal checksums.

say, at mount time we fund transaction logged. this means part of it can be
on a disk. what do we do if transaction in the journal is found with wrong
checksum? leave partial transaction in-place?

thanks, Alex

Girish Shilamkar wrote:
> Hi,
> The journal checksums patches for ext4 and e2fsprogs sent on 30th May
> had run into problems on big-endian machines. I have made the required
> changes.
> The major updates are:
> 1. Checksum was written in little endian format in the commit header,
> changed it to big-endian.
> 2. The crc32 code ported to userspace (which is used by e2fsprogs) had
> problem with #ifdef, fixed that.
> 3. The crc32 user space code used i386 assembly code for swabbing
> changed it to generic one.
> 4. Removed various #includes, which caused compilation errors on ppc
> machine.
>
> Any comments/suggestions are always welcome.
>
> Thanks,
> Girish.
>

2007-06-19 08:15:29

by Girish Shilamkar

[permalink] [raw]

Subject: Re: Updated patches for journal checksums.

On Tue, 2007-06-19 at 12:03 +0400, Alex Tomas wrote:
> say, at mount time we fund transaction logged. this means part of it can be
> on a disk.
I am not sure I understand this completely. Still I hope the following
answers your question.

> what do we do if transaction in the journal is found with wrong
> checksum? leave partial transaction in-place?
>
The sanity of the transaction is checked in PASS_SCAN. And if checksum
is found to be incorrect for nth transaction then last transaction which
is written to disk is (n - 1).

Regards,
Girish

2007-06-19 08:51:43

by Andreas Dilger

[permalink] [raw]

Subject: Re: Updated patches for journal checksums.

On Jun 19, 2007 13:45 +0530, Girish Shilamkar wrote:
> On Tue, 2007-06-19 at 12:03 +0400, Alex Tomas wrote:
> > say, at mount time we fund transaction logged. this means part of it can be
> > on a disk.
>
> I am not sure I understand this completely. Still I hope the following
> answers your question.

I _think_ Alex is asking "what happens if during a transaction undergoing
checkpoint of blocks to filesystem (not the last one in the journal) is
interrupted by a crash and upon restart the partially-checkpointed
transaction is found to have a checksum error?"

> > what do we do if transaction in the journal is found with wrong
> > checksum? leave partial transaction in-place?
>
> The sanity of the transaction is checked in PASS_SCAN. And if checksum
> is found to be incorrect for nth transaction then last transaction which
> is written to disk is (n - 1).

The recovery.c code (AFAIK) does not do replay for any transaction that
does not have a valid checksum, or transactions beyond that. If the
bad transaction had already started chekpoint (i.e. isn't the last
committed transaction) then the journal _should_ return an error up to
the filesystem, so it can call ext4_error() at startup. For e2fsck
(which normally does journal replay & recovery) it can do a full
filesystem check at this point.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-06-19 09:11:14

by Alex Tomas

[permalink] [raw]

Subject: Re: Updated patches for journal checksums.

Andreas Dilger wrote:
> I _think_ Alex is asking "what happens if during a transaction undergoing
> checkpoint of blocks to filesystem (not the last one in the journal) is
> interrupted by a crash and upon restart the partially-checkpointed
> transaction is found to have a checksum error?"

yup, thanks for clarification.

>>> what do we do if transaction in the journal is found with wrong
>>> checksum? leave partial transaction in-place?
>> The sanity of the transaction is checked in PASS_SCAN. And if checksum
>> is found to be incorrect for nth transaction then last transaction which
>> is written to disk is (n - 1).
>
> The recovery.c code (AFAIK) does not do replay for any transaction that
> does not have a valid checksum, or transactions beyond that. If the
> bad transaction had already started chekpoint (i.e. isn't the last
> committed transaction) then the journal _should_ return an error up to
> the filesystem, so it can call ext4_error() at startup. For e2fsck
> (which normally does journal replay & recovery) it can do a full
> filesystem check at this point.

hmm. it actually can be last transaction (following no activity?)

thanks, Alex

2007-06-19 19:06:49

by Andreas Dilger

[permalink] [raw]

Subject: Re: Updated patches for journal checksums.

On Jun 19, 2007 13:10 +0400, Alex Tomas wrote:
> Andreas Dilger wrote:
> >I _think_ Alex is asking "what happens if during a transaction undergoing
> >checkpoint of blocks to filesystem (not the last one in the journal) is
> >interrupted by a crash and upon restart the partially-checkpointed
> >transaction is found to have a checksum error?"
>
> yup, thanks for clarification.
>
> >>>what do we do if transaction in the journal is found with wrong
> >>>checksum? leave partial transaction in-place?
> >>The sanity of the transaction is checked in PASS_SCAN. And if checksum
> >>is found to be incorrect for nth transaction then last transaction which
> >>is written to disk is (n - 1).
> >
> >The recovery.c code (AFAIK) does not do replay for any transaction that
> >does not have a valid checksum, or transactions beyond that. If the
> >bad transaction had already started chekpoint (i.e. isn't the last
> >committed transaction) then the journal _should_ return an error up to
> >the filesystem, so it can call ext4_error() at startup. For e2fsck
> >(which normally does journal replay & recovery) it can do a full
> >filesystem check at this point.
>
> hmm. it actually can be last transaction (following no activity?)

The current implementation is that if you are running with sync commits
(i.e. the current 2-phase write {data blocks} {wait} {commit block}
{wait}) and you detect a checksum error in the last transaction this
SHOULD result in the filesystem being marked in error (I haven't
verified this is in the most recent version of the code). This is
what JBD_FEATURE_COMPAT_CHECKSUM does - it only adds checksums to the
commit blocks.

However, in the case of async commits (JBD_FEATURE_INCOMPAT_ASYNC_COMMIT
- write {data blocks} {commit block} {wait}, which is what gives the
performance gain) there isn't any way to tell the difference between a
checksum error in the last transaction and the case of an interrupted
commit. In the case of an interrupted commit there would not have been
any checkpointing into the filesystem yet, so this case shouldn't matter.

This leaves only the "last transaction, finished async commit, started
checkpoint, crash, corruption, checksum error" case. I agree this could
cause data corruption. One possibility is to limit the corruption to
at most a single block by adding checksums to the transaction descriptor
blocks (which already hold the transaction ID to verify they are not
stale) and cover the block numbers. This would seem to be the only place
that we could get error "fan out" and then we would limit the corruption
to the block(s) that were actually corrupted. This wouldn't be any
different than having random corruption of disk blocks inside the fs.
This would probably be a separate INCOMPAT flag.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.