LinuxLists.cc - Kernel panic and filesystem corruption in setxattr due to journal transaction restart

2022-01-12 23:13:55

Subject: Kernel panic and filesystem corruption in setxattr due to journal transaction restart

Hello,

During recent testing, we have found a repeatable kernel panic and ext4 corruption. Andrew Perepechko investigated the root cause and prepared a reproducer that works on the current master HEAD

[root@CO82 linux]# git rev-parse HEAD
daadb3bd0e8d3e317e36bc2c1542e86c528665e5

Here is the reproducer:
[root@CO82 ~]# cat setxttr_corrupt.sh
#!/bin/bash
dd if=/dev/zero of=/tmp/ldiskfs bs=1M count=100
mkfs.ext4 -O ea_inode /tmp/ldiskfs -J size=16 -I 512

mkdir -p /tmp/ldiskfs_m
mount -t ext4 /tmp/ldiskfs /tmp/ldiskfs_m -o loop,commit=600,no_mbcache
touch /tmp/ldiskfs_m/file{1..1024}

V=$(for i in `seq 60000`; do echo -n x ; done)
V1="1$V"
V2="2$V"

while true; do
setfattr -n user.xattr -v $V /tmp/ldiskfs_m/file{1..1024}
setfattr -n user.xattr -v $V1 /tmp/ldiskfs_m/file{1..1024} &
setfattr -n user.xattr -v $V2 /tmp/ldiskfs_m/file{1024..1} &
wait
done

umount /tmp/ldiskfs_m

Here is an outpost of the test session:
[root@CO82 ~]# sh ./setxttr_corrupt.sh
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.0674946 s, 1.6 GB/s
mke2fs 1.46.2.wc3 (18-Jun-2021)
Discarding device blocks: done
Creating filesystem with 102400 1k blocks and 25584 inodes
Filesystem UUID: c328f332-3f48-49de-b9e0-25ca129fd8da
Superblock backups stored on blocks:
8193, 24577, 40961, 57345, 73729

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

Kernel BUG from dmesg:
[ 275.670724] ------------[ cut here ]------------
[ 275.670729] kernel BUG at fs/jbd2/transaction.c:1503!
[ 275.670937] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 275.671128] CPU: 4 PID: 920 Comm: setfattr Not tainted 5.16.0rash+ #56
[ 275.671352] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 275.671621] RIP: 0010:jbd2_journal_dirty_metadata+0x1e5/0x220
[ 275.671621] Code: 8e 00 4c 89 e7 ba 01 00 00 00 48 89 ee e8 c3 f5 ff ff 48 89 df e8 4b ed 8e 00 e9 54 ff ff ff 49 39 6c 24 30 0f 84 86 fe ff ff <0f> 0b 4d 8b 4a 78 4c 39 cd 0f 84 3a ff ff ff e9 cf 4e 8a 00 0f 0b
[ 275.671621] RSP: 0018:ffffa54e8088fa20 EFLAGS: 00010207
[ 275.671621] RAX: 0000000000000000 RBX: ffff914541e80070 RCX: 0000000000000000
[ 275.671621] RDX: 0000000000000001 RSI: ffff9145437a60d0 RDI: 0000000000000000
[ 275.671621] RBP: ffff914543662d00 R08: ffff9145437a60d0 R09: ffff9145575103f8
[ 275.671621] R10: ffff9145456b5800 R11: 0000000000000004 R12: ffff9145563ebf00
[ 275.671621] R13: 0000000000000000 R14: ffff9145437a60d0 R15: ffff9145563ebf08
[ 275.671621] FS: 00007f82fd09f740(0000) GS:ffff91455bd00000(0000) knlGS:0000000000000000
[ 275.671621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 275.671621] CR2: 000055d7358b3f08 CR3: 0000000102220004 CR4: 00000000000706e0
[ 275.671621] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 275.671621] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 275.671621] Call Trace:
[ 275.671621] <TASK>
[ 275.671621] __ext4_handle_dirty_metadata+0x55/0x1a0
[ 275.671621] ext4_mark_iloc_dirty+0x143/0x680
[ 275.671621] ext4_xattr_set_handle+0x39a/0x680
[ 275.671621] ext4_xattr_set+0xd1/0x180
[ 275.671621] __vfs_setxattr+0x65/0x80
[ 275.671621] __vfs_setxattr_noperm+0x6b/0x200
[ 275.671621] vfs_setxattr+0x9d/0x180
[ 275.671621] setxattr+0x11f/0x180
[ 275.671621] ? kmem_cache_alloc+0x2e/0x1a0
[ 275.671621] ? getname_flags+0x65/0x1e0
[ 275.671621] ? preempt_count_add+0x44/0x90
[ 275.671621] path_setxattr+0xc2/0xe0
[ 275.671621] __x64_sys_setxattr+0x22/0x30
[ 275.671621] do_syscall_64+0x3a/0x80
[ 275.671621] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 275.671621] RIP: 0033:0x7f82fc9b3dae
[ 275.671621] Code: 48 8b 0d dd 20 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 bc 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d aa 20 2c 00 f7 d8 64 89 01 48
[ 275.671621] RSP: 002b:00007fff8edd7388 EFLAGS: 00000246 ORIG_RAX: 00000000000000bc
[ 275.671621] RAX: ffffffffffffffda RBX: 00005621ae002300 RCX: 00007f82fc9b3dae
[ 275.671621] RDX: 00005621ae002300 RSI: 00007fff8eddae48 RDI: 00007fff8ede9cae
[ 275.671621] RBP: 00007fff8ede9cae R08: 0000000000000000 R09: 0000000000000003
[ 275.671621] R10: 000000000000ea61 R11: 0000000000000246 R12: 00007fff8eddae48
[ 275.671621] R13: 00007fff8edd74b0 R14: 0000000000000000 R15: 0000000000000000
[ 275.671621] </TASK>
[ 275.671621] Modules linked in:
[ 275.683183] ---[ end trace 6c451d27803f53e0 ]—

The corresponding line of code would map to:
J_ASSERT_JH(jh, jh->b_transaction == transaction ||
jh->b_next_transaction == transaction);

More precisely, jh is associated with an actively committing transaction in its disk writing phase (i.e. t_updates already dropped to zero).

After a bit of tracing, we've found that the transaction is restarting when changing a large EA to another large EA, which causes a new EA inode to be allocated and the old inode to be freed. The truncate part of the old inode release sometimes fails to extend current transaction and has to restart it (this stack from test session on LDISKFS, ldiskfs_ prefix can be changed to the ext4_):

mdt03_024-198115 [012] 45670.650452: kernel_stack: <stack trace>
=> trace_event_raw_event_jbd2_handle_start_class (ffffffffc0c7e60c)
=> jbd2__journal_restart (ffffffffc0c75b5c)
=> ldiskfs_datasem_ensure_credits (ffffffffc1ac3431)
=> ldiskfs_ext_rm_leaf (ffffffffc1ac44e8)
=> ldiskfs_ext_remove_space (ffffffffc1ac8240)
=> ldiskfs_ext_truncate (ffffffffc1ac953a)
=> ldiskfs_truncate (ffffffffc1adbdcb)
=> ldiskfs_evict_inode (ffffffffc1adcc71)
=> evict (ffffffff84f37202)
=> ldiskfs_xattr_set_entry (ffffffffc1abcf1e)
=> ldiskfs_xattr_ibody_set (ffffffffc1abd5be)
=> ldiskfs_xattr_set_handle (ffffffffc1abf9e4)
=> ldiskfs_xattr_set (ffffffffc1abfd70)
=> __vfs_setxattr (ffffffff84f431b6)
=> osd_xattr_set (ffffffffc1b7891d)
=> lod_sub_xattr_set (ffffffffc17da152)
=> lod_generate_and_set_lovea (ffffffffc17c7d8c)
=> lod_striped_create (ffffffffc17c81d0)
=> lod_layout_change (ffffffffc17c839b)
=> mdd_layout_change (ffffffffc1850f7d)
=> mdt_layout_change (ffffffffc18aeaf1)
=> mdt_intent_layout (ffffffffc18b5e30)
=> mdt_intent_opc (ffffffffc18ac778)
=> mdt_intent_policy (ffffffffc18b3ba6)
=> ldlm_lock_enqueue (ffffffffc138ffff)
=> ldlm_handle_enqueue0 (ffffffffc13b811f)
=> tgt_enqueue (ffffffffc1441b14)
=> tgt_request_handle (ffffffffc14465cd)
=> ptlrpc_server_handle_request (ffffffffc13ecaea)
=> ptlrpc_main (ffffffffc13f132a)
=> kthread (ffffffff84d043a6)
=> ret_from_fork (ffffffff8560023f)

One problematic part here is that transaction restart enforces current transaction commit so the incomplete transaction will likely commit before the kernel panics. It will cause ext4 CORRUPTION after remount. The reason why the kernel panic is that we restart this transaction somewhere in between of ext4_get_write_access() and ext4_mark_dirty_metadata() so the inode bh sticks in the old transaction:

ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int
name_index,

const char *name, const
void *value, size_t value_len,

int
flags)
...
error = ext4_reserve_inode_write(handle, inode, &is.iloc);
...
error = ext4_xattr_ibody_set(handle, inode, &i, &is);
...
error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
...
}

We don't have a fix yet and haven't yet decided how to fix this.
Andreas suggested moving final iput for the old EA inode out of transaction. Despite this can work for EXT4, may be problematic with LUSTRE/LDISKFS layering.

Any ideas how to fix this are welcome.

Thanks.

Best regards,
Artem Blagodarenko.