2021-02-24 01:20:33

by Seamus Connor

[permalink] [raw]
Subject: reproducible corruption in journal

Hello All,

I am investigating an issue on our system where a filesystem is becoming corrupt. The underlying block device is provided by us, so we were suspicious that we were screwing something up. However, after gathering a repro of the issue with write logging (a slightly modified version of dm-log-writes), it looks like jbd2 is actually the one running into problems. Up to entry 2979 in our (dm-log-writes) log file, things seem OK:

$ replay-log --log write_log --replay base_filesystem.ext4.tmp --limit 2979
$ debugfs base_filesystem.ext4.tmp -f <(echo logdump) | tail
debugfs 1.44.1 (24-Mar-2018)
Found expected sequence 12821, type 1 (descriptor block) at block 1791
Found expected sequence 12821, type 2 (commit block) at block 1795
Found expected sequence 12822, type 1 (descriptor block) at block 1796
Found expected sequence 12822, type 2 (commit block) at block 1802
Found expected sequence 12823, type 1 (descriptor block) at block 1803
Found expected sequence 12823, type 2 (commit block) at block 1810
Found expected sequence 12824, type 1 (descriptor block) at block 1811
Found expected sequence 12824, type 2 (commit block) at block 1815
Found expected sequence 12825, type 1 (descriptor block) at block 1816
No magic number at block 1821: end of journal.

However once we replay the next IO things get a little weirder.

$ ...--limit 2980 ...
Found expected sequence 12825, type 2 (commit block) at block 1821
Found sequence 12824 (not 12826) at block 1822: end of journal.

And finally after replaying the next IO, it looks like a bunch of old entries in the journal get resurrected:

$ ...--limit 2981 ...
Found expected sequence 15454, type 1 (descriptor block) at block 15352
Found expected sequence 15454, type 2 (commit block) at block 15356

If we fsck our filesystem image at IO 2981 or later, we get a bunch of errors. Before that it is clean. IO 2981 touches as single 4k block in inode 8's data blocks. When this corruption was produced, the filesystem is mounted rw,relatime,sync.

This issue is reproduced by our workload, which is a smallish volume of writes every 30 seconds or so. Once the files are written, they are never modified, and are eventually deleted in a rotation style. During a torture test, we disable IO to the underlying block device, and before re-enabling it we fsck the contents of the device. It is here that we detect the corruption. We are detecting a corruption in around 1% of the events here.

Our kernel is based on Ubuntu 16.04's 4.4 series kernel. We have many patches on top of this, but none inside of mm and fs, and none in block that should be relevant. The issue also reproduces on Linux 5.4 + our patch series, though I have not studied any reproductions there.

My next step is to start going though the jbd2 code to figure out how it could be corrupted. Any assistance would be appreciated as I am mostly unfamiliar with this code. Does any one have some pointers at what to look at, or any recollection of similar issues? I can extract more information from the logs if it would help. I am also happy to share the images of the filesystem and the write log if someone wants to look at it.

Thanks,
Seamus


2021-02-24 15:07:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: reproducible corruption in journal

On Tue, Feb 23, 2021 at 04:41:20PM -0800, Seamus Connor wrote:
> Hello All,
>
> I am investigating an issue on our system where a filesystem is becoming corrupt.

So I'm not 100% sure, but it sounds to me like what is going on is
caused by the following:

*) The jbd/jbd2 layer relies on finding an invalid block (a block
which is missing the jbd/jbd2 "magic number", or where the sequence
number is unexpected) to indicate the end of the journal.

*) We reset to the (4 byte) sequence number to zero on a freshly
mounted file system.

*) It appears that your test is generating a large number of very
small transactions, and you are then "crashing" the file system by
disconnecting the file system from further updates, and running e2fsck
to replay the journal, throwing away the block writes after the
"disconnection", and then remounting the file system. I'm going to
further guess that size of the small transactions are very similar,
and the amount of time between when the file system is mounted, and
when the file system is forcibly disconnected, is highly predictable
(e.g., always N seconds, plus or minus a small delta).

Is that last point correct? If so, that's a perfect storm where it's
possible for the journal replay to get confused, and mistake previous
blocks in the journal as ones part of the last valid file system
mount. It's something which probably never happens in practice in
production, since users are generally not running a super-fixed
workload, and then causing the system to repeatedly crash after a
fixed interval, such that the mistake described above could happen.
That being said, it's arguably still a bug.

Does this hypothesis consistent with what you are seeing?

If so, I can see two possible solutions to avoid this:

1) When we initialize the journal, after replaying the journal and
writing a new journal superblock, we issue a discard for the rest of
the journal. This won't help for block devices that don't support
discard, but it should slightly reduce work for the FTL, and perhaps
slightly improve the write endurance for flash.

2) We should stop resetting the sequence number to zero, but instead,
keep the sequence number at the last used number. For testing
purposes, we should have an option where the sequence number is forced
to (0U - 300) so that we test what happens when the 4 byte unsigned
integer wraps.

Cheers,

- Ted

2021-02-24 19:03:37

by Seamus Connor

[permalink] [raw]
Subject: Re: reproducible corruption in journal

> *) It appears that your test is generating a large number of very
> small transactions, and you are then "crashing" the file system by
> disconnecting the file system from further updates, and running e2fsck
> to replay the journal, throwing away the block writes after the
> "disconnection", and then remounting the file system. I'm going to
> further guess that size of the small transactions are very similar,
> and the amount of time between when the file system is mounted, and
> when the file system is forcibly disconnected, is highly predictable
> (e.g., always N seconds, plus or minus a small delta).

Yes, this matches the workload. I assume the transactions are very small
because we are doing a large number of metadata operations, and
because we are mounted sync?

>
> Is that last point correct? If so, that's a perfect storm where it's
> possible for the journal replay to get confused, and mistake previous
> blocks in the journal as ones part of the last valid file system
> mount. It's something which probably never happens in practice in
> production, since users are generally not running a super-fixed
> workload, and then causing the system to repeatedly crash after a
> fixed interval, such that the mistake described above could happen.
> That being said, it's arguably still a bug.
>
> Does this hypothesis consistent with what you are seeing?

Yes, this is consistent with what I am seeing. The only thing to add is that
the workload isn't particularly fixed. The data being written is generated
by a production workload (we are recording statistics about hardware).
The interval at which we are shutting down the block device is regular
but not precise (+/- 30 seconds).

>
> If so, I can see two possible solutions to avoid this:
>
> 1) When we initialize the journal, after replaying the journal and
> writing a new journal superblock, we issue a discard for the rest of
> the journal. This won't help for block devices that don't support
> discard, but it should slightly reduce work for the FTL, and perhaps
> slightly improve the write endurance for flash.
Our virtual device doesn't support discard, could that be why others aren't
seeing this issue?

>
> 2) We should stop resetting the sequence number to zero, but instead,
> keep the sequence number at the last used number. For testing
> purposes, we should have an option where the sequence number is forced
> to (0U - 300) so that we test what happens when the 4 byte unsigned
> integer wraps.
I can give this a try with my workload. Just so I can be sure I understand, the
hypothesis is that we are running into issues during do_one_pass(..., PASS_SCAN)
because we are getting unlucky with "if (sequence != next_commit_ID) {..."?
The solution is to reduce the occurrence of this issue (to basically zero) by not
resetting the sequence number? Have I understood you correctly? Looking
through e2fsprogs, I think there is a commit that already does this
(32448f50df7d974ded956bbc78a419cf65ec09a3) during replay. Another thing
that I could try is zeroing out the contents of inode 8 after a journal replay and
recreating the journal after each event.

Thanks for your help!