LinuxLists.cc - [RFC PATCH 0/2] ext4, jbd2: journal cycled record transactions

2023-01-19 05:09:36

Subject: [RFC PATCH 0/2] ext4, jbd2: journal cycled record transactions

From: Zhang Yi <[email protected]>

Hello, This patch set introduce a new mount option names
journal_cycle_record, it save journal head for a clean unmounted file
system, let ext4 continue/cycled record new journal transactions after
previous mount or recovered transactions for unclean file system. It
could give us more info when analysing a corrupted file system image
and locate kernel consistency bugs more conveniently.

This is just the kernel part and have already passed throuth xfstests in
auto mode. I will continue the e2fsprogs' part if nobody strong dislike
that. Any comments are welcome.

Thanks,
Yi.

Zhang Yi (2):
jbd2: cycled record log on clean journal logging area
ext4: add journal cycled recording support

fs/ext4/ext4.h | 2 ++
fs/ext4/super.c | 17 +++++++++++++++++
fs/jbd2/journal.c | 18 ++++++++++++++++--
fs/jbd2/recovery.c | 22 +++++++++++++++++-----
include/linux/jbd2.h | 9 +++++++--
5 files changed, 59 insertions(+), 9 deletions(-)

--
2.31.1

2023-01-19 05:10:01

by Zhang Yi

[permalink] [raw]

Subject: [RFC PATCH 1/2] jbd2: cycled record log on clean journal logging area

From: Zhang Yi <[email protected]>

For a newly mounted file system, the journal committing thread always
record log from the beginning of the journal area, no matter whether the
journal is clean or it has just been recovered. It is disadvantageous to
analysis corrupted file system image and locate the file system
inconsistency bugs. When we get a corrupted file system image and want
to find out what has happened, besides lookup the system log, one
effective may is to backtrack the journal log. But we may not always run
e2fsck before each mount and the default fsck -a mode also cannot always
find all inconsistencies, so it could left over some inconsistencies
into the next mount until we detect it. Finally, the transactions in the
journal may probably discontinuous and some relatively new transactions
has been covered, it becomes hard to analyse. So if we could records
transactions continuously between each mounts, we could acquire more
useful info from the journal.

|Previous mount checkpointed/recovered logs|Current mount logs |
|{------}{---}{--------} ... {------}| ... |{======}{========}...000000|

This patch save the head blocknr in the superblock after flushing the
journal or unmounting the file system, let the next mount could continue
to record new transaction behind it. This change is backward compatible
because the old kernel does not care about the head blocknr of the
journal. It is also fine if we mount a clean old image without valid
head blocknr, we fail back to set it to s_first just like before.
Finally, for the case of mount an unclean file system, we could also get
the journal head easily after scanning the journal, it will continue to
record new transaction after the recovered transactions.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/jbd2/journal.c | 18 ++++++++++++++++--
fs/jbd2/recovery.c | 22 +++++++++++++++++-----
include/linux/jbd2.h | 9 +++++++--
3 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 2696f43e7239..41f0f5625e7c 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1557,8 +1557,21 @@ static int journal_reset(journal_t *journal)
journal->j_first = first;
journal->j_last = last;

- journal->j_head = journal->j_first;
- journal->j_tail = journal->j_first;
+ if (journal->j_flags & JBD2_CYCLE_RECORD) {
+ /*
+ * Disable the cycled recording mode if the journal head block
+ * number is not correct.
+ */
+ if (journal->j_head < first || journal->j_head >= last) {
+ printk(KERN_WARNING "JBD2: Incorrect Journal head block %lu, "
+ "disable journal_cycle_record\n",
+ journal->j_head);
+ journal->j_head = journal->j_first;
+ }
+ } else {
+ journal->j_head = journal->j_first;
+ }
+ journal->j_tail = journal->j_head;
journal->j_free = journal->j_last - journal->j_first;

journal->j_tail_sequence = journal->j_transaction_sequence;
@@ -1730,6 +1743,7 @@ static void jbd2_mark_journal_empty(journal_t *journal, blk_opf_t write_flags)

sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
sb->s_start = cpu_to_be32(0);
+ sb->s_head = cpu_to_be32(journal->j_head);
if (jbd2_has_feature_fast_commit(journal)) {
/*
* When journal is clean, no need to commit fast commit flag and
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 8286a9ec122f..6f6bcb75fffe 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -29,6 +29,7 @@ struct recovery_info
{
tid_t start_transaction;
tid_t end_transaction;
+ unsigned long head_block;

int nr_replays;
int nr_revokes;
@@ -301,11 +302,11 @@ int jbd2_journal_recover(journal_t *journal)
* is always zero if, and only if, the journal was cleanly
* unmounted.
*/
-
if (!sb->s_start) {
- jbd2_debug(1, "No recovery required, last transaction %d\n",
- be32_to_cpu(sb->s_sequence));
+ jbd2_debug(1, "No recovery required, last transaction %d, head block %u\n",
+ be32_to_cpu(sb->s_sequence), be32_to_cpu(sb->s_head));
journal->j_transaction_sequence = be32_to_cpu(sb->s_sequence) + 1;
+ journal->j_head = be32_to_cpu(sb->s_head);
return 0;
}

@@ -324,6 +325,9 @@ int jbd2_journal_recover(journal_t *journal)
/* Restart the log at the next transaction ID, thus invalidating
* any existing commit records in the log. */
journal->j_transaction_sequence = ++info.end_transaction;
+ journal->j_head = info.head_block;
+ jbd2_debug(1, "JBD2: last transaction %d, head block %u\n",
+ journal->j_transaction_sequence, journal->j_head);

jbd2_journal_clear_revoke(journal);
err2 = sync_blockdev(journal->j_fs_dev);
@@ -364,6 +368,7 @@ int jbd2_journal_skip_recovery(journal_t *journal)
if (err) {
printk(KERN_ERR "JBD2: error %d scanning journal\n", err);
++journal->j_transaction_sequence;
+ journal->j_head = journal->j_first;
} else {
#ifdef CONFIG_JBD2_DEBUG
int dropped = info.end_transaction -
@@ -373,6 +378,7 @@ int jbd2_journal_skip_recovery(journal_t *journal)
dropped, (dropped == 1) ? "" : "s");
#endif
journal->j_transaction_sequence = ++info.end_transaction;
+ journal->j_head = info.head_block;
}

journal->j_tail = 0;
@@ -462,7 +468,7 @@ static int do_one_pass(journal_t *journal,
struct recovery_info *info, enum passtype pass)
{
unsigned int first_commit_ID, next_commit_ID;
- unsigned long next_log_block;
+ unsigned long next_log_block, head_block;
int err, success = 0;
journal_superblock_t * sb;
journal_header_t * tmp;
@@ -485,6 +491,7 @@ static int do_one_pass(journal_t *journal,
sb = journal->j_superblock;
next_commit_ID = be32_to_cpu(sb->s_sequence);
next_log_block = be32_to_cpu(sb->s_start);
+ head_block = next_log_block;

first_commit_ID = next_commit_ID;
if (pass == PASS_SCAN)
@@ -809,6 +816,7 @@ static int do_one_pass(journal_t *journal,
if (commit_time < last_trans_commit_time)
goto ignore_crc_mismatch;
info->end_transaction = next_commit_ID;
+ info->head_block = head_block;

if (!jbd2_has_feature_async_commit(journal)) {
journal->j_failed_commit =
@@ -817,8 +825,10 @@ static int do_one_pass(journal_t *journal,
break;
}
}
- if (pass == PASS_SCAN)
+ if (pass == PASS_SCAN) {
last_trans_commit_time = commit_time;
+ head_block = next_log_block;
+ }
brelse(bh);
next_commit_ID++;
continue;
@@ -868,6 +878,8 @@ static int do_one_pass(journal_t *journal,
if (pass == PASS_SCAN) {
if (!info->end_transaction)
info->end_transaction = next_commit_ID;
+ if (!info->head_block)
+ info->head_block = head_block;
} else {
/* It's really bad news if different passes end up at
* different places (but possible due to IO errors). */
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 2170e0cc279d..d5843ebfa6ed 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -265,8 +265,10 @@ typedef struct journal_superblock_s
__u8 s_padding2[3];
/* 0x0054 */
__be32 s_num_fc_blks; /* Number of fast commit blocks */
-/* 0x0058 */
- __u32 s_padding[41];
+ __be32 s_head; /* blocknr of head of log, only uptodate
+ * while the filesystem is clean */
+/* 0x005C */
+ __u32 s_padding[40];
__be32 s_checksum; /* crc32c(superblock) */

/* 0x0100 */
@@ -1392,6 +1394,9 @@ JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit, FAST_COMMIT)
#define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
* data write error in ordered
* mode */
+#define JBD2_CYCLE_RECORD 0x080 /* Journal cycled record log on
+ * clean and empty filesystem
+ * logging area */
#define JBD2_FAST_COMMIT_ONGOING 0x100 /* Fast commit is ongoing */
#define JBD2_FULL_COMMIT_ONGOING 0x200 /* Full commit is ongoing */
#define JBD2_JOURNAL_FLUSH_DISCARD 0x0001
--
2.31.1

2023-01-26 10:15:02

by Jan Kara

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] jbd2: cycled record log on clean journal logging area

Hello!

On Thu 19-01-23 11:45:59, Zhang Yi wrote:
> From: Zhang Yi <[email protected]>
>
> For a newly mounted file system, the journal committing thread always
> record log from the beginning of the journal area, no matter whether the
> journal is clean or it has just been recovered. It is disadvantageous to
> analysis corrupted file system image and locate the file system
> inconsistency bugs. When we get a corrupted file system image and want
> to find out what has happened, besides lookup the system log, one
> effective may is to backtrack the journal log. But we may not always run
> e2fsck before each mount and the default fsck -a mode also cannot always
> find all inconsistencies, so it could left over some inconsistencies
> into the next mount until we detect it. Finally, the transactions in the
> journal may probably discontinuous and some relatively new transactions
> has been covered, it becomes hard to analyse. So if we could records
> transactions continuously between each mounts, we could acquire more
> useful info from the journal.
>
> |Previous mount checkpointed/recovered logs|Current mount logs |
> |{------}{---}{--------} ... {------}| ... |{======}{========}...000000|
>
> This patch save the head blocknr in the superblock after flushing the
> journal or unmounting the file system, let the next mount could continue
> to record new transaction behind it. This change is backward compatible
> because the old kernel does not care about the head blocknr of the
> journal. It is also fine if we mount a clean old image without valid
> head blocknr, we fail back to set it to s_first just like before.
> Finally, for the case of mount an unclean file system, we could also get
> the journal head easily after scanning the journal, it will continue to
> record new transaction after the recovered transactions.

I understand the usecase although if there are multiple mounts between
the time when the corruption happened and when it got detected I suspect
the journal will be already overwritten (filled and wrapped over) and so not
too useful anyway. But still the number of blocks preserved in the journal
will be higher so I guess there is some chance there will be something
useful in there.

Do you want this mostly for debugging stuff (like fuzzer testing) or
would you really want to run with this on production machines?

Also I think we could actually implement something like this without adding
s_head field (i.e., without any on-disk format change). Setting of s_start
to 0 when the journal is empty is actually only an optimization. We could
leave it where it is (in this debug mode), just make jbd2 detect empty
journal while it is used from j_head == s_start instead of by testing
s_start == 0, and the only difference would be that jbd2_journal_recover()
would now try recovering even empty journal (but abort immediately) which
mostly should not happen on clean mount anyway because we call jbd2 to
recover the journal only if ext4_has_feature_journal_needs_recovery().

Honza

> ---
> fs/jbd2/journal.c | 18 ++++++++++++++++--
> fs/jbd2/recovery.c | 22 +++++++++++++++++-----
> include/linux/jbd2.h | 9 +++++++--
> 3 files changed, 40 insertions(+), 9 deletions(-)
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 2696f43e7239..41f0f5625e7c 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1557,8 +1557,21 @@ static int journal_reset(journal_t *journal)
> journal->j_first = first;
> journal->j_last = last;
>
> - journal->j_head = journal->j_first;
> - journal->j_tail = journal->j_first;
> + if (journal->j_flags & JBD2_CYCLE_RECORD) {
> + /*
> + * Disable the cycled recording mode if the journal head block
> + * number is not correct.
> + */
> + if (journal->j_head < first || journal->j_head >= last) {
> + printk(KERN_WARNING "JBD2: Incorrect Journal head block %lu, "
> + "disable journal_cycle_record\n",
> + journal->j_head);
> + journal->j_head = journal->j_first;
> + }
> + } else {
> + journal->j_head = journal->j_first;
> + }
> + journal->j_tail = journal->j_head;
> journal->j_free = journal->j_last - journal->j_first;
>
> journal->j_tail_sequence = journal->j_transaction_sequence;
> @@ -1730,6 +1743,7 @@ static void jbd2_mark_journal_empty(journal_t *journal, blk_opf_t write_flags)
>
> sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
> sb->s_start = cpu_to_be32(0);
> + sb->s_head = cpu_to_be32(journal->j_head);
> if (jbd2_has_feature_fast_commit(journal)) {
> /*
> * When journal is clean, no need to commit fast commit flag and
> diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> index 8286a9ec122f..6f6bcb75fffe 100644
> --- a/fs/jbd2/recovery.c
> +++ b/fs/jbd2/recovery.c
> @@ -29,6 +29,7 @@ struct recovery_info
> {
> tid_t start_transaction;
> tid_t end_transaction;
> + unsigned long head_block;
>
> int nr_replays;
> int nr_revokes;
> @@ -301,11 +302,11 @@ int jbd2_journal_recover(journal_t *journal)
> * is always zero if, and only if, the journal was cleanly
> * unmounted.
> */
> -
> if (!sb->s_start) {
> - jbd2_debug(1, "No recovery required, last transaction %d\n",
> - be32_to_cpu(sb->s_sequence));
> + jbd2_debug(1, "No recovery required, last transaction %d, head block %u\n",
> + be32_to_cpu(sb->s_sequence), be32_to_cpu(sb->s_head));
> journal->j_transaction_sequence = be32_to_cpu(sb->s_sequence) + 1;
> + journal->j_head = be32_to_cpu(sb->s_head);
> return 0;
> }
>
> @@ -324,6 +325,9 @@ int jbd2_journal_recover(journal_t *journal)
> /* Restart the log at the next transaction ID, thus invalidating
> * any existing commit records in the log. */
> journal->j_transaction_sequence = ++info.end_transaction;
> + journal->j_head = info.head_block;
> + jbd2_debug(1, "JBD2: last transaction %d, head block %u\n",
> + journal->j_transaction_sequence, journal->j_head);
>
> jbd2_journal_clear_revoke(journal);
> err2 = sync_blockdev(journal->j_fs_dev);
> @@ -364,6 +368,7 @@ int jbd2_journal_skip_recovery(journal_t *journal)
> if (err) {
> printk(KERN_ERR "JBD2: error %d scanning journal\n", err);
> ++journal->j_transaction_sequence;
> + journal->j_head = journal->j_first;
> } else {
> #ifdef CONFIG_JBD2_DEBUG
> int dropped = info.end_transaction -
> @@ -373,6 +378,7 @@ int jbd2_journal_skip_recovery(journal_t *journal)
> dropped, (dropped == 1) ? "" : "s");
> #endif
> journal->j_transaction_sequence = ++info.end_transaction;
> + journal->j_head = info.head_block;
> }
>
> journal->j_tail = 0;
> @@ -462,7 +468,7 @@ static int do_one_pass(journal_t *journal,
> struct recovery_info *info, enum passtype pass)
> {
> unsigned int first_commit_ID, next_commit_ID;
> - unsigned long next_log_block;
> + unsigned long next_log_block, head_block;
> int err, success = 0;
> journal_superblock_t * sb;
> journal_header_t * tmp;
> @@ -485,6 +491,7 @@ static int do_one_pass(journal_t *journal,
> sb = journal->j_superblock;
> next_commit_ID = be32_to_cpu(sb->s_sequence);
> next_log_block = be32_to_cpu(sb->s_start);
> + head_block = next_log_block;
>
> first_commit_ID = next_commit_ID;
> if (pass == PASS_SCAN)
> @@ -809,6 +816,7 @@ static int do_one_pass(journal_t *journal,
> if (commit_time < last_trans_commit_time)
> goto ignore_crc_mismatch;
> info->end_transaction = next_commit_ID;
> + info->head_block = head_block;
>
> if (!jbd2_has_feature_async_commit(journal)) {
> journal->j_failed_commit =
> @@ -817,8 +825,10 @@ static int do_one_pass(journal_t *journal,
> break;
> }
> }
> - if (pass == PASS_SCAN)
> + if (pass == PASS_SCAN) {
> last_trans_commit_time = commit_time;
> + head_block = next_log_block;
> + }
> brelse(bh);
> next_commit_ID++;
> continue;
> @@ -868,6 +878,8 @@ static int do_one_pass(journal_t *journal,
> if (pass == PASS_SCAN) {
> if (!info->end_transaction)
> info->end_transaction = next_commit_ID;
> + if (!info->head_block)
> + info->head_block = head_block;
> } else {
> /* It's really bad news if different passes end up at
> * different places (but possible due to IO errors). */
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 2170e0cc279d..d5843ebfa6ed 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -265,8 +265,10 @@ typedef struct journal_superblock_s
> __u8 s_padding2[3];
> /* 0x0054 */
> __be32 s_num_fc_blks; /* Number of fast commit blocks */
> -/* 0x0058 */
> - __u32 s_padding[41];
> + __be32 s_head; /* blocknr of head of log, only uptodate
> + * while the filesystem is clean */
> +/* 0x005C */
> + __u32 s_padding[40];
> __be32 s_checksum; /* crc32c(superblock) */
>
> /* 0x0100 */
> @@ -1392,6 +1394,9 @@ JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit, FAST_COMMIT)
> #define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
> * data write error in ordered
> * mode */
> +#define JBD2_CYCLE_RECORD 0x080 /* Journal cycled record log on
> + * clean and empty filesystem
> + * logging area */
> #define JBD2_FAST_COMMIT_ONGOING 0x100 /* Fast commit is ongoing */
> #define JBD2_FULL_COMMIT_ONGOING 0x200 /* Full commit is ongoing */
> #define JBD2_JOURNAL_FLUSH_DISCARD 0x0001
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-28 06:49:55

by Zhang Yi

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] jbd2: cycled record log on clean journal logging area

Hello Jan, thanks for suggestions.

On 2023/1/26 18:14, Jan Kara wrote:
> Hello!
>
> On Thu 19-01-23 11:45:59, Zhang Yi wrote:
>> From: Zhang Yi <[email protected]>
>>
>> For a newly mounted file system, the journal committing thread always
>> record log from the beginning of the journal area, no matter whether the
>> journal is clean or it has just been recovered. It is disadvantageous to
>> analysis corrupted file system image and locate the file system
>> inconsistency bugs. When we get a corrupted file system image and want
>> to find out what has happened, besides lookup the system log, one
>> effective may is to backtrack the journal log. But we may not always run
>> e2fsck before each mount and the default fsck -a mode also cannot always
>> find all inconsistencies, so it could left over some inconsistencies
>> into the next mount until we detect it. Finally, the transactions in the
>> journal may probably discontinuous and some relatively new transactions
>> has been covered, it becomes hard to analyse. So if we could records
>> transactions continuously between each mounts, we could acquire more
>> useful info from the journal.
>>
>> |Previous mount checkpointed/recovered logs|Current mount logs |
>> |{------}{---}{--------} ... {------}| ... |{======}{========}...000000|
>>
>> This patch save the head blocknr in the superblock after flushing the
>> journal or unmounting the file system, let the next mount could continue
>> to record new transaction behind it. This change is backward compatible
>> because the old kernel does not care about the head blocknr of the
>> journal. It is also fine if we mount a clean old image without valid
>> head blocknr, we fail back to set it to s_first just like before.
>> Finally, for the case of mount an unclean file system, we could also get
>> the journal head easily after scanning the journal, it will continue to
>> record new transaction after the recovered transactions.
>
> I understand the usecase although if there are multiple mounts between
> the time when the corruption happened and when it got detected I suspect
> the journal will be already overwritten (filled and wrapped over) and so not
> too useful anyway. But still the number of blocks preserved in the journal
> will be higher so I guess there is some chance there will be something
> useful in there.
>
> Do you want this mostly for debugging stuff (like fuzzer testing) or
> would you really want to run with this on production machines?

It's useful for debugging stuff, but it may also benefit to our production
machines (e.g. we have many consumer products and embedded products that are
not long running and have not too much filesystem changes for each running
and mount), so I really want to run with this on production machines
if possible.

>
> Also I think we could actually implement something like this without adding
> s_head field (i.e., without any on-disk format change). Setting of s_start
> to 0 when the journal is empty is actually only an optimization. We could
> leave it where it is (in this debug mode), just make jbd2 detect empty
> journal while it is used from j_head == s_start instead of by testing
> s_start == 0, and the only difference would be that jbd2_journal_recover()
> would now try recovering even empty journal (but abort immediately) which
> mostly should not happen on clean mount anyway because we call jbd2 to
> recover the journal only if ext4_has_feature_journal_needs_recovery().
>

I understand it's best to avoid changing the on-disk format. But IIUC, I think
this is not backward compatible, it changes the 'magic code' (s_start==0) of a
clean journal, the old kernel use it. If we mount a clean ext4 image in old
kernel which has been just worked in debug mode, below warning in
jbd2_journal_wipe() appears, and the fsck also complain about it.

JBD2: Clearing recovery information on journal

fsck.ext4 -a /dev/pmem1
/dev/pmem1: Superblock needs_recovery flag is clear, but journal has data.
/dev/pmem1: Run journal anyway.
/dev/pmem1: recovering journal
...

Although it is not a big stuff, but it looks strange and confused. For this
reason, it seems that this (reuse s_start) may only used for debugging stuff
if we don't care about this incompatible warning. Or else we make things
complicated, we may have to add one more incompatible feature bit for this
mode and we cannot mount it in old kernels. What do you think?

Thanks.
Yi.