2021-12-28 20:36:28

by Manfred Spraul

[permalink] [raw]
Subject: JBD2: journal transaction 6943 on loop0-8 is corrupt.

Hi,

with simulated power failures, I see a corrupted journal

[39056.200845] JBD2: journal transaction 6943 on loop0-8 is corrupt.
[39056.200851] EXT4-fs (loop0): error loading journal

I can't rule out that this is an artifact from bugs in my setup. Thus
any hints are welcome.

Especially: I assume that data is written in 512 byte blocks. Thus a
1024 byte write can be interrupted.

- Fedora 5.15.10-200.fc35.x86_64.

- Test load: Mostly dbench for 20 seconds, with a 512 MB ext4 image over
nbd. Further details below.

- nbd-server modified so that it creates a trace file of all block write
operations.

- trim is not used.

- in total, there are around 2.2 mio 512-byte writes in the trace. The
uncompressed trace is ~1GB large.

- If I stop after 1908363 512-byte writes, then the mount will fail.
After 1908362 or 1908364 blocks, the image mounts properly.

The compressed image files are ~20 MB, thus if there is a suitable
storage area, I can copy it there.

I have attached the hexdumps starting from 0x8c77000. From outside, it
appears as if the old content at 0x8c77e00 confuses ext4. The write
1908364 sets the sector to 0x00, then everything works again.


Extract from replaying the trace:

> > H=740000008e000000 C=0x40000001 (NBD_CMD_WRITE+NONE)
> O=0000000008c70400 L=00007800
> writing to offset 147260416 (0x8c70400), len 512 (0x200).
> writing to offset 147260928 (0x8c70600), len 512 (0x200).
> writing to offset 147261440 (0x8c70800), len 512 (0x200).
>
> [...]
>
> writing to offset 147288576 (0x8c77200), len 512 (0x200).
> writing to offset 147289088 (0x8c77400), len 512 (0x200).
> writing to offset 147289600 (0x8c77600), len 512 (0x200).
> writing to offset 147290112 (0x8c77800), len 512 (0x200).
> writing to offset 147290624 (0x8c77a00), len 512 (0x200).
>
<stop after 1908362 512-byte writes: ok>
>
> < H=740000008e000000 E=0x00000000
> > H=750000002b3c0000 C=0x00000003 (NBD_CMD_FLUSH+NONE)
> O=0000000000000000 L=00000000
> < H=750000002b3c0000 E=0x00000000
> > H=7600000096000000 C=0x40010001 (NBD_CMD_WRITE+ FUA)
> O=0000000008c77c00 L=00000400
> writing to offset 147291136 (0x8c77c00), len 512 (0x200).
>
<stop after 1908363 512-byte writes: bad>
>
> writing to offset 147291648 (0x8c77e00), len 512 (0x200).
>
<stop after 1908364 512-byte writes: : good>

stress test: modified maketr from nbd project. The tar file is a partial
linux-2.2.26 source code.

> #!/bin/sh
> #
> # Example script to make a transaction log file
> # Must be run as root. Remember to chown the file afterwards
>
> # Insert the name of a tarfile here
> tarfile=/home/manfred/git/manfred/nbd-datalog/data.tar.gz
> tmpnam=$(mktemp)
> tmpmnt=$(echo -n /tmp/mnt-; basename $tmpnam)
> conffile=${tmpnam}.conf
> pidfile=${tmpnam}.pid
> output=`pwd`/output.tr
>
> ulimit -c unlimited
>
> cat >${conffile} <<EOF
> [generic]
> [export1]
>        exportname = $tmpnam
>        transactionlog = $output
>        datalog = true
>        flush = true
>        fua = true
>        rotational = true
> EOF
> nbd-server -C ${conffile} -p ${pidfile}&
> PID=$!
> sleep 1
> dd if=/dev/zero of=${tmpnam} bs=1M count=500
> nbd-client -N export1 127.0.0.1 /dev/nbd0
> mkfs.ext4 /dev/nbd0
> mkdir $tmpmnt
> mount -t ext4 /dev/nbd0 $tmpmnt
> (cd $tmpmnt ; mkdir abc; cd abc; tar xvzf ${tarfile} ; setfacl -R -m
> u:432:rwx linux-2.2.26/Documentation;cd .. ; (setfacl -R -m u:123:rx
> abc&); (rm abc -Rf
> &); tar xvfz ${tarfile};wait;sync) 2>&1 >/dev/null
> dbench -x -D $tmpmnt -t 30 8&
> (cd $tmpmnt ; mkdir abd; cd abd; tar xvzf ${tarfile} ; setfacl -R -m
> u:432:rwx linux-2.2.26/Documentation;cd .. ; (setfacl -R -m u:123:rx
> abd&); (rm abd -Rf
> &);wait;sync) 2>&1 >/dev/null
> sleep 20
> killall dbench
> sleep 2
> killall -KILL dbench
> sync
> wait
> sleep 3
> umount $tmpmnt
> nbd-client -d /dev/nbd0
> if [ -f ${pidfile} ]
> then
>        kill `cat ${pidfile}`
>        rm -f ${pidfile}
> else
>        kill $PID
> fi
> rm -f ${conffile}
> ls -la ${output}
> ls -la ${tmpnam}
>

tune2fs:

> une2fs 1.46.3 (27-Jul-2021)
> Filesystem volume name:   <none>
> Last mounted on:          /tmp/mnt-tmp.BYToCaXtWn
> Filesystem UUID:          393d4a8f-4c60-4e35-a723-554bc6ab7c6f
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr resize_inode dir_index
> filetype needs_recovery extent 64bit flex_bg sparse_super large_file
> huge_file dir_nli
> nk extra_isize metadata_csum
> Filesystem flags:         signed_directory_hash
> Default mount options:    user_xattr acl
> Filesystem state:         clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              128016
> Block count:              512000
> Reserved block count:     25600
> Overhead clusters:        26670
> Free blocks:              485316
> Free inodes:              128005
> First block:              1
> Block size:               1024
> Fragment size:            1024
> Group descriptor size:    64
> Reserved GDT blocks:      256
> Blocks per group:         8192
> Fragments per group:      8192
> Inodes per group:         2032
> Inode blocks per group:   254
> Flex block group size:    16
> Filesystem created:       Tue Dec 28 17:52:31 2021
> Last mount time:          Tue Dec 28 17:52:31 2021
> Last write time:          Tue Dec 28 17:52:31 2021
> Mount count:              1
> Maximum mount count:      -1
> Last checked:             Tue Dec 28 17:52:31 2021
> Check interval:           0 (<none>)
> Lifetime writes:          284 kB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:               128
> Journal inode:            8
> Default directory hash:   half_md4
> Directory Hash Seed:      4799ae9b-680a-41a7-aa0f-3ae76e28ef13
> Journal backup:           inode blocks
> Checksum type:            crc32c
> Checksum:                 0xc6a92ff8
>


Attachments:
extract-data-1908364.img.txt (19.76 kB)
extract-data-1908363.img.txt (19.76 kB)
extract-data-1908362.img.txt (19.76 kB)
Download all attachments

2021-12-30 01:37:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: JBD2: journal transaction 6943 on loop0-8 is corrupt.

On Tue, Dec 28, 2021 at 09:36:22PM +0100, Manfred Spraul wrote:
> Hi,
>
> with simulated power failures, I see a corrupted journal
>
> [39056.200845] JBD2: journal transaction 6943 on loop0-8 is corrupt.
> [39056.200851] EXT4-fs (loop0): error loading journal

This means that the journal replay found a commit which was *not* the
last commit, and which contained a CRC error. If it's the last commit
(e.g., there is no valid subsequent commit block), then it's possible
that the journal commit was never completed before the system crashed
--- e.g., it was an interrupted commit.

Your test is aborting the commit at various points in the write I/O
stream, so it should be simulating an interrupted commit (assuming
that it's not corrupting any I/O. So the jbd2 layer should have
understood it was the last commit in the journal, and been OK with the
checksum failure.

But what can happen is that if there is a commit block in the right
place at the end of the transaction, left over from the previous
journalling session, this can confuse the jbd2 layer into thinking
that it is *not* the last transaction, and then it will make the
"journal transaction is corrupt" report.

How does the jbd2 layer determine whether there is a valid "subsequent
commit", well if the subsequent commit block meets the following two
criteria:

* the commit id is the correct, expected one (n+1 the previous
commit id).
* the commit time (seconds since January 1, 1970) in the
commit block is greater than the comit time in the previous
commit block.

So if your test setup doesn't correctly set the time (say, it always
leaves the bootup time to January 1, 1970), and the workload is
extremely regular, it's possible that the replay interrupted a journal
commit, but there was left-over commit block that *looked* valid, and
it triggered the failure.

If this is what happened, it's not a disaster --- the journal replay
will have correctly stopped where it should have, but it thought it
was an exceptional abort, as opposed to a normal journal replay
commpletion. So the "file system is corrupted flag" will be set,
forcing an fsck, but the fsck shouldn't find any problems with the
file system.

Does this explanation seem to fit with how your test setup is
arranged?

- Ted

2021-12-30 08:16:17

by Manfred Spraul

[permalink] [raw]
Subject: Re: JBD2: journal transaction 6943 on loop0-8 is corrupt.

Hi Ted,

On 12/30/21 02:37, Theodore Ts'o wrote:
> On Tue, Dec 28, 2021 at 09:36:22PM +0100, Manfred Spraul wrote:
>> Hi,
>>
>> with simulated power failures, I see a corrupted journal
>>
>> [39056.200845] JBD2: journal transaction 6943 on loop0-8 is corrupt.
>> [39056.200851] EXT4-fs (loop0): error loading journal
> This means that the journal replay found a commit which was *not* the
> last commit, and which contained a CRC error. If it's the last commit
> (e.g., there is no valid subsequent commit block), then it's possible
> that the journal commit was never completed before the system crashed
> --- e.g., it was an interrupted commit.

It is the last commit, there are no valid subsequent commit blocks.

The current failure model is simple: all blocks up to block <n> are
written, the blocks starting from <n+1> are discarded.

What I can't rule out, but I think this is not what I see:
The image is small (512 MB) and everything is in memory. Thus I would
not rule out that the whole journal is filled within less than one second.
The commit header contains h_commit_sec and h_commit_nsec, but from what
I see, do_one_pass() evaluates only h_commit_sec.


> Your test is aborting the commit at various points in the write I/O
> stream, so it should be simulating an interrupted commit (assuming
> that it's not corrupting any I/O. So the jbd2 layer should have
> understood it was the last commit in the journal, and been OK with the
> checksum failure.

I think the jbd2 layer understood that it was the last commit - but it
nevertheless failed the recovery.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/jbd2/recovery.c?h=v5.16-rc7#n809

> |
> if (pass == PASS_SCAN &&
> !jbd2_commit_block_csum_verify(journal,
> bh->b_data)) {
> chksum_error:
> if (commit_time < last_trans_commit_time)
> goto ignore_crc_mismatch;
> info->end_transaction = next_commit_ID;
>
> if (!jbd2_has_feature_async_commit(journal)) {
> journal->j_failed_commit =
> next_commit_ID;
> brelse(bh);
> break;
> }
> }
> |

async_commit() is false.

journal->j_failed_commit is set, and thus after journal playback, a
failure is reported :-(


What I have done:

- I have removed the line journal->j_failed_commit = next_commit_ID,
then the image is mounted.

- I had added pr_info() lines, and this confirms that it starts with a
failure of jbd2_block_csum_verify().

dmesg with JBD2 debug output and additional pr_info lines:

>>>

[  748.591348] next_commit_id increased: 6943.
[  748.593397] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for
sequence ID 6943 at 4544/8192
[  748.594322] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking
block 4544
[  748.595879] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 5,
sequence 6943
[  748.596800] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for
sequence ID 6943 at 4545/8192
[  748.600073] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking
block 4545
[  748.601398] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 1,
sequence 6943
[  748.603481] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for
sequence ID 6943 at 4574/8192
[  748.606238] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking
block 4574
[  748.607241] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 2,
sequence 6943
[  748.608248] commit_csum_verify error: provided 81be34bd.
[  748.610199] direct block verify error.
[  748.612327] chksum_error.
[  748.617616] error line 2.
[  748.622085] error line 3.
[  748.623291] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for
sequence ID 6943 at 4575/8192
[  748.624133] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking
block 4575
[  748.625134] done, info->end_transaction is 6943.
[  748.626966] done with update, info->end_transaction is 6943.
[  748.629521] next_commit_id: initial value 6799.
[  748.630836] fs/jbd2/recovery.c: (do_one_pass, 491): Starting recovery
pass 1
[  748.631529] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for
sequence ID 6799 at 6734/8192
[  748.633346] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking
block 6734

<<<

What do you think?
Is JBD2 too aggressive in declaring something a journal corruption?

What is definitively correct is that the failure is minor. fsck -f was
able to correct it.
Thus: What is your understanding:
If a mount command fails due to a journal corruption, should user space
first try a fsck -f before giving up?


--

    Manfred


Attachments:
0001-jbd2-recovery.c-Continue-on-csum-failures-for-commit.patch (2.68 kB)
0003-DEBUG-patch-add-printk-to-fs-jbd2-recovery.c.patch (3.24 kB)
Download all attachments