LinuxLists.cc - [PATCH] jbd2: skip reading super block if it has been verified

2023-06-15 03:56:50

Subject: [PATCH] jbd2: skip reading super block if it has been verified

From: Zhang Yi <[email protected]>

We got a NULL pointer dereference issue below while running generic/475
I/O failure pressure test.

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 1 PID: 15600 Comm: fsstress Not tainted 6.4.0-rc5-xfstests-00055-gd3ab1bca26b4 #190
RIP: 0010:jbd2_journal_set_features+0x13d/0x430
...
Call Trace:
<TASK>
? __die+0x23/0x60
? page_fault_oops+0xa4/0x170
? exc_page_fault+0x67/0x170
? asm_exc_page_fault+0x26/0x30
? jbd2_journal_set_features+0x13d/0x430
jbd2_journal_revoke+0x47/0x1e0
__ext4_forget+0xc3/0x1b0
ext4_free_blocks+0x214/0x2f0
ext4_free_branches+0xeb/0x270
ext4_ind_truncate+0x2bf/0x320
ext4_truncate+0x1e4/0x490
ext4_handle_inode_extension+0x1bd/0x2a0
? iomap_dio_complete+0xaf/0x1d0

The root cause is the journal super block had been failed to write out
due to I/O fault injection, it's uptodate bit was cleared by
end_buffer_write_sync() and didn't reset yet in jbd2_write_superblock().
And it raced by journal_get_superblock()->bh_read(), unfortunately, the
read IO is also failed, so the error handling in
journal_fail_superblock() unexpectedly clear the journal->j_sb_buffer,
finally lead to above NULL pointer dereference issue.

If the journal super block had been read and verified, there is no need
to call bh_read() read it again even if it has been failed to written
out. So the fix could be simply move buffer_verified(bh) in front of
bh_read().

Fixes: d9eafe0afafa ("jbd2: factor out journal initialization from journal_get_superblock()")
Reported-by: Theodore Ts'o <[email protected]>
Signed-off-by: Zhang Yi <[email protected]>
---
fs/jbd2/journal.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index b5e57735ab3f..0cabce7b31bf 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1919,6 +1919,9 @@ static int journal_get_superblock(journal_t *journal)
bh = journal->j_sb_buffer;

J_ASSERT(bh != NULL);
+ if (buffer_verified(bh))
+ return 0;
+
err = bh_read(bh, 0);
if (err < 0) {
printk(KERN_ERR
@@ -1926,9 +1929,6 @@ static int journal_get_superblock(journal_t *journal)
goto out;
}

- if (buffer_verified(bh))
- return 0;
-
sb = journal->j_superblock;

err = -EINVAL;
--
2.39.2

2023-06-15 05:35:37

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] jbd2: skip reading super block if it has been verified

On Thu, Jun 15, 2023 at 11:49:41AM +0800, Zhang Yi wrote:
> From: Zhang Yi <[email protected]>
>
> We got a NULL pointer dereference issue below while running generic/475
> I/O failure pressure test.

Have you been able to reproduce this failure without the "recheck
checkpoint" series applied? I have not, so like with the e2fsck bug
fix, I can understand how the bug fix worked, but I still don't
understand why I wasn't seeing until I tried to apply the "recheck
chekcpoint" and the following patches in that patch series.

> If the journal super block had been read and verified, there is no need
> to call bh_read() read it again even if it has been failed to written
> out. So the fix could be simply move buffer_verified(bh) in front of
> bh_read().
>
> Fixes: d9eafe0afafa ("jbd2: factor out journal initialization from journal_get_superblock()")

That works, but it's worth noting that commit d9eafe0afafa caused the
failure by removing the check on j_journal_version to determine
whether the superblock was read or not. If the journal superblock had
been previously read, j_journal_version would be either 1 or 2. If it
had been zero, then superblock was not read. So from commit
d9eafe0afafa:

/* Load journal superblock if it is not loaded yet. */
- if (journal->j_format_version == 0 &&
- journal_get_superblock(journal) != 0)
+ if (journal_get_superblock(journal))
return 0;
if (!jbd2_format_support_feature(journal))
return 0;

The comment "Load journal superblock if it is not loaded yet." should
be removed, since it no longer makes sense once the
"journal->j_format_version == 0" check was removed.

I'll also note that a problem with d9eafe0afafa is that by removing
the j_format_version check, every time we add a revoke header, and we
call jbd2_journal_set_features(), this was causing an unconditional
read of the journal superblock and that unnecessary I/O could slow
down certain workloads.

- Ted

2023-06-15 08:39:13

by Zhang Yi

[permalink] [raw]

Subject: Re: [PATCH] jbd2: skip reading super block if it has been verified

On 2023/6/15 13:26, Theodore Ts'o wrote:
> On Thu, Jun 15, 2023 at 11:49:41AM +0800, Zhang Yi wrote:
>> From: Zhang Yi <[email protected]>
>>
>> We got a NULL pointer dereference issue below while running generic/475
>> I/O failure pressure test.
>
> Have you been able to reproduce this failure without the "recheck
> checkpoint" series applied? I have not, so like with the e2fsck bug
> fix, I can understand how the bug fix worked, but I still don't
> understand why I wasn't seeing until I tried to apply the "recheck
> chekcpoint" and the following patches in that patch series.

Yes, I can reproduce this failure without the "recheck
checkpoint" series applied, I reproduced it in ranges from about 5
minutes to 1 hour on your dev branch(just reset to the parent commit
5404e4738054 "ext4: refactoring to use the unified helper
ext4_quotas_off()") with below fstests config.

# ext4 regression fstests config
[ext4]
export FSTYP=ext4
export TEST_DEV=/dev/pmem0p1
export TEST_DIR=/mnt/test
export SCRATCH_DEV=/dev/pmem0p2
export SCRATCH_MNT=/mnt/scratch
export LOGWRITES_DEV=/dev/vdc1
export SCRATCH_LOGDEV=/dev/vdc2
export MKFS_OPTIONS="-O ^extents,^flex_bg,^uninit_bg,^64bit,^metadata_csum,^huge_file,^dir_nlink,^extra_isize"

[ 315.435845] EXT4-fs (dm-0): previous I/O error to superblock detected
[ 315.435877] EXT4-fs (dm-0): I/O error while writing superblock
[ 315.435885] EXT4-fs (dm-0): Remounting filesystem read-only
[ 315.438261] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 315.453689] #PF: supervisor write access in kernel mode
[ 315.454884] #PF: error_code(0x0002) - not-present page
[ 315.456048] PGD 139b3b067 P4D 139b3b067 PUD 1538ea067 PMD 0
[ 315.456201] EXT4-fs error (device dm-0): __ext4_find_entry:1678: inode #131073: comm fsstress: reading directory lblock 0
[ 315.457403] Oops: 0002 [#1] PREEMPT SMP
[ 315.457411] CPU: 14 PID: 10107 Comm: fsstress Not tainted 6.4.0-rc5-00054-g5404e4738054 #214
[ 315.457416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
[ 315.457418] RIP: 0010:jbd2_journal_set_features+0xf4/0x500
[ 315.461326] EXT4-fs (dm-0): I/O error while writing superblock
[ 315.462073] Code: 48 83 05 5e 32 90 0c 01 48 83 05 f6 05 90 0c 01 4d 8b 74 24 38 e8 dc 6c bc 00 48 83 05 ec 05 90 0c 01 48 83 05 bc 05 90 0c 01 <f0> 49 0f ba 2e 02 0f 92 c0 48 83 05 b3 05 90 0c 01 48 83 05 d5
[ 315.462086] RSP: 0018:ffffc900116cbad8 EFLAGS: 00010212
[ 315.462103] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
[ 315.462107] RDX: 0000000080000000 RSI: ffffffffafd25d54 RDI: 0000000000000001
[ 315.462115] RBP: 0000000000000000 R08: ffffffffafd256f0 R09: 0000000000000000
[ 315.468526] R10: 642820726f727265 R11: 2073662d34545845 R12: ffff88817e85e800
[ 315.468535] R13: 0000000000000000 R14: 0000000000000000 R15: ffff888126d93000
[ 315.468548] FS: 00007fda46982b80(0000) GS:ffff888237980000(0000) knlGS:0000000000000000
[ 315.468560] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 315.468568] CR2: 0000000000000000 CR3: 00000001398d0000 CR4: 00000000000006e0
[ 315.487792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 315.487798] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 315.491048] Call Trace:
[ 315.491560] <TASK>
[ 315.494065] ? show_regs+0x84/0x90
[ 315.494089] ? __die_body+0x22/0x90
[ 315.494104] ? __die+0x35/0x50
[ 315.494121] ? page_fault_oops+0x1d3/0x5e0
[ 315.515211] ? search_bpf_extables+0x85/0xc0
[ 315.523168] ? jbd2_journal_set_features+0xf4/0x500
[ 315.524214] ? search_exception_tables+0x7c/0x90
[ 315.525211] ? kernelmode_fixup_or_oops+0x140/0x1a0
[ 315.526370] ? __bad_area_nosemaphore+0x208/0x350
[ 315.527475] ? mt_find+0x2ab/0x3c0
[ 315.528718] ? __bad_area+0x88/0xc0
[ 315.529936] ? bad_area+0x1a/0x30
[ 315.530696] ? do_user_addr_fault+0xa6d/0xd00
[ 315.531550] ? exc_page_fault+0xe7/0x3b0
[ 315.532339] ? asm_exc_page_fault+0x22/0x30
[ 315.533153] ? jbd2_journal_set_features+0xf4/0x500
[ 315.533922] ? jbd2_journal_set_features+0xe4/0x500
[ 315.534636] jbd2_journal_revoke+0x43/0x330
[ 315.535272] __ext4_forget+0x112/0x2c0
[ 315.535804] ? __find_get_block+0x155/0x5a0
[ 315.536443] ext4_free_blocks+0xbd2/0xf20
[ 315.537058] ? ext4_free_data+0x140/0x210
[ 315.538420] ? ext4_free_branches+0x2d4/0x3a0
[ 315.540534] ext4_free_branches+0x1c9/0x3a0
[ 315.542064] ext4_ind_truncate+0x361/0x3f0
[ 315.543304] ? ext4_discard_preallocations+0x3c1/0x740
[ 315.546111] ext4_truncate+0x4a0/0x710
[ 315.547623] ext4_file_write_iter+0xb8d/0xe90
[ 315.548940] vfs_write+0x20e/0x590
[ 315.549986] ksys_write+0x77/0x160
[ 315.552027] __x64_sys_write+0x1d/0x30
[ 315.553492] do_syscall_64+0x68/0xf0
[ 315.554711] entry_SYSCALL_64_after_hwframe+0x63/0xcd

I also try to accelerate reproduce in about 2 mins through add
delay in jbd2_write_superblock() either applied the "recheck
chekcpoint" patch series or not.

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index b5e57735ab3f..90d78fe0fb33 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1623,6 +1623,7 @@ static int journal_reset(journal_t *journal)
* This function expects that the caller will have locked the journal
* buffer head, and will return with it unlocked
*/
+#include <linux/delay.h>
static int jbd2_write_superblock(journal_t *journal, blk_opf_t write_flags)
{
struct buffer_head *bh = journal->j_sb_buffer;
@@ -1659,6 +1660,7 @@ static int jbd2_write_superblock(journal_t *journal, blk_opf_t write_flags)
bh->b_end_io = end_buffer_write_sync;
submit_bh(REQ_OP_WRITE | write_flags, bh);
wait_on_buffer(bh);
+ msleep(10);
if (buffer_write_io_error(bh)) {
clear_buffer_write_io_error(bh);
set_buffer_uptodate(bh);

>
>> If the journal super block had been read and verified, there is no need
>> to call bh_read() read it again even if it has been failed to written
>> out. So the fix could be simply move buffer_verified(bh) in front of
>> bh_read().
>>
>> Fixes: d9eafe0afafa ("jbd2: factor out journal initialization from journal_get_superblock()")
>
> That works, but it's worth noting that commit d9eafe0afafa caused the
> failure by removing the check on j_journal_version to determine
> whether the superblock was read or not. If the journal superblock had
> been previously read, j_journal_version would be either 1 or 2. If it
> had been zero, then superblock was not read. So from commit
> d9eafe0afafa:
>
> /* Load journal superblock if it is not loaded yet. */
> - if (journal->j_format_version == 0 &&
> - journal_get_superblock(journal) != 0)
> + if (journal_get_superblock(journal))
> return 0;
> if (!jbd2_format_support_feature(journal))
> return 0;
>
>
> The comment "Load journal superblock if it is not loaded yet." should
> be removed, since it no longer makes sense once the
> "journal->j_format_version == 0" check was removed.

Yes.

>
> I'll also note that a problem with d9eafe0afafa is that by removing
> the j_format_version check, every time we add a revoke header, and we
> call jbd2_journal_set_features(), this was causing an unconditional
> read of the journal superblock and that unnecessary I/O could slow
> down certain workloads.
>

Yes, fortunately it is innocuous in general because the journal super
block buffer is always in memory and uptodate, therefore bh_read() does
not submit I/O. It's only affects the fault case about the window in
jbd2_write_superblock() which the journal super block has been failed
to write out and has not been restore to uptodate yet.

Thanks,
Yi.