2020-11-11 03:12:15

by Murphy Zhou

[permalink] [raw]
Subject: [fsdax] kernel BUG at fs/ext4/ext4_extents.h:199!

Hi,

A corner case panic starts since the fast-commit merge.

------
kernel BUG at fs/ext4/ext4_extents.h:199!
invalid opcode: 0000 [#1] SMP PTI
CPU: 5 PID: 1184 Comm: t_mmap_dio Tainted: G E 5.10.0-rc3-master-407ab579637c+ #37
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:ext4_fc_write_inode_data+0x1d0/0x1e0 [ext4]
Code: 7f 00 00 74 25 66 81 ca 00 80 66 89 54 24 30 e9 62 ff ff ff 4c 89 ff e8 9e 63 4d c1 31 c0 eb 84 b8 83 ff ff ff e9 7a ff ff ff <0f> 0b e8 79 c8 4c c1 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41
RSP: 0018:ffff9bc200d53d88 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000008000
RDX: 0000000000008000 RSI: 0000000000060002 RDI: 000032ee98c0aa2c
RBP: 0000000000008000 R08: ffffffffc0559158 R09: ffffffffc054cdc0
R10: 0000000000008000 R11: 0000000000000000 R12: 0000000000007fff
R13: ffff9bc200d53e6c R14: ffff88d2e0cf3420 R15: ffff88d2e0cf33a8
FS: 00007f20006cab80(0000) GS:ffff88d367140000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2000937000 CR3: 00000003db940002 CR4: 0000000000020ee0
Call Trace:
ext4_fc_commit+0x874/0x900 [ext4]
? file_check_and_advance_wb_err+0x2e/0xc0
ext4_sync_file+0xd4/0x350 [ext4]
__x64_sys_fsync+0x34/0x60
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f20008d1ef7
------

Bisect result shows the first bad commit:
[aa75f4d3daaeb1389b9cce9d6b84401eaf228d4e] ext4: main fast-commit commit path

It's hit by running xfstests generic/605 on pmem ramdisk, fsdax enabled.
I've narrowed down to a simplified reproducer, which requires a pmem ramdisk
been setup first, eg /dev/pmem10. And /root/xfstests-dev is built.

--------------------- Reproducer start -----------
#! /bin/bash
SCRATCH_DEV=/dev/pmem10
SCRATCH_MNT=/daxmnt
SRC_FILE=$SCRATCH_MNT/src
DST_FILE=$SCRATCH_MNT/dst
XFS_IO_PROG=/usr/sbin/xfs_io
tsize=$((128 * 1024 * 1024))
wipefs -af $SCRATCH_DEV > out.full 2>&1
mkfs.ext4 -b 4096 $SCRATCH_DEV >> out.full 2>&1 || exit
mkdir -p $SCRATCH_MNT
mount -t ext4 -o dax=inode,context=system_u:object_r:nfs_t:s0 $SCRATCH_DEV $SCRATCH_MNT || exit
prep_files()
{
rm -f $SRC_FILE $DST_FILE
$XFS_IO_PROG -f -c "falloc 0 $tsize" $SRC_FILE $DST_FILE >> out.full 2>&1 || exit
}
prep_files
# with O_DIRECT first
/root/xfstests-dev/src/t_mmap_dio $SRC_FILE $DST_FILE 1024 "dio both dax" && echo pass dio
prep_files
# again with buffered IO
/root/xfstests-dev/src/t_mmap_dio -b $SRC_FILE $DST_FILE 1024 "buffered both dax" && echo pass bio
umount /daxmnt
--------------------- Reproducer end -----------

The "context=xx" and "dax=inode" mount option are *necessary* to reproduce.

NO panic reproduced if mount with "-o dax" or "dax=always" or "dax=never".

The context value seems irrelevant. Other value can also trigger this panic.
The inode size value seems irrelevant. Tested with 128 and 256, both panic.

Some gdb info and fs info paste below for you ref.

Thanks,
Murphy

------------ gdb info -----------
(gdb) l *(ext4_fc_write_inode_data+0x1d0/0x1e0)
0x64fd0 is in ext4_fc_write_inode_data (fs/ext4/fast_commit.c:787).
782 /*
783 * Writes updated data ranges for the inode in question. Updates CRC.
784 * Returns 0 on success, error otherwise.
785 */
786 static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
787 {
788 ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
789 struct ext4_inode_info *ei = EXT4_I(inode);
790 struct ext4_map_blocks map;
791 struct ext4_fc_add_range fc_ext;
(gdb) l *(ext4_fc_write_inode_data+0x1d0)
0x651a0 is in ext4_fc_write_inode_data (fs/ext4/ext4_extents.h:199).
194 }
195
196 static inline void ext4_ext_mark_unwritten(struct ext4_extent *ext)
197 {
198 /* We can not have an unwritten extent of zero length! */
199 BUG_ON((le16_to_cpu(ext->ee_len) & ~EXT_INIT_MAX_LEN) == 0);
200 ext->ee_len |= cpu_to_le16(EXT_INIT_MAX_LEN);
201 }
202
203 static inline int ext4_ext_is_unwritten(struct ext4_extent *ext)
(gdb) l *(ext4_fc_commit+0x874/0x900)
0x67460 is in ext4_fc_commit (fs/ext4/fast_commit.c:1072).
1067 * commit_tid if needed. If it's not possible to perform a fast commit
1068 * due to various reasons, we fall back to full commit. Returns 0
1069 * on success, error otherwise.
1070 */
1071 int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
1072 {
1073 struct super_block *sb = (struct super_block *)(journal->j_private);
1074 struct ext4_sb_info *sbi = EXT4_SB(sb);
1075 int nblks = 0, ret, bsize = journal->j_blocksize;
1076 int subtid = atomic_read(&sbi->s_fc_subtid);
(gdb) l *(ext4_fc_commit+0x874)
0x67cd4 is in ext4_fc_commit (fs/ext4/fast_commit.c:972).
967 */
968 ret = ext4_fc_write_inode(inode, crc);
969 if (ret)
970 goto lock_and_exit;
971
972 ret = ext4_fc_write_inode_data(inode, crc);
973 if (ret)
974 goto lock_and_exit;
975
976 if (!ext4_fc_add_dentry_tlv(
(gdb)
-------------gdb info end ------------

--------------- tune2fs -l outout --------
tune2fs 1.45.5 (07-Jan-2020)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 281dc870-a1ac-4f0e-85ec-e5cecd3d0f88
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 655360
Block count: 2621440
Reserved block count: 131072
Free blocks: 2554687
Free inodes: 655349
First block: 0
Block size: 4096
Fragment size: 4096
Group descriptor size: 64
Reserved GDT blocks: 1024
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Wed Nov 11 10:20:53 2020
Last mount time: Wed Nov 11 10:20:53 2020
Last write time: Wed Nov 11 10:20:53 2020
Mount count: 1
Maximum mount count: -1
Last checked: Wed Nov 11 10:20:53 2020
Check interval: 0 (<none>)
Lifetime writes: 4137 kB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: b7a0f1fd-a3fa-4669-ab23-3d63df2b16ef
Journal backup: inode blocks
Checksum type: crc32c
Checksum: 0x555540b3
--------------- tune2fs -l output end --------

--------------- mount info ----------
/dev/pmem10 on /daxmnt type ext4 (rw,relatime,context=system_u:object_r:nfs_t:s0,dax=inode,fast_commit)
--------------- mount info end ----------
--
Murphy


2020-11-11 18:09:36

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [fsdax] kernel BUG at fs/ext4/ext4_extents.h:199!

Thanks for the report Murphy. I just realized we have a collision of
mount flags:

#define EXT4_MOUNT2_DAX_NEVER 0x00000008 /* Do not allow
Direct Access */
#define EXT4_MOUNT2_DAX_INODE 0x00000010 /* For printing
options only */

#define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM 0x00000008 /* User explicitly
specified journal checksum */

#define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */

So whenever dax=inode is turned on, fast commit also gets turned on.
I'll send out a fix for this soon.

Thanks,
Harshad

On Tue, Nov 10, 2020 at 7:10 PM Murphy Zhou <[email protected]> wrote:
>
> Hi,
>
> A corner case panic starts since the fast-commit merge.
>
> ------
> kernel BUG at fs/ext4/ext4_extents.h:199!
> invalid opcode: 0000 [#1] SMP PTI
> CPU: 5 PID: 1184 Comm: t_mmap_dio Tainted: G E 5.10.0-rc3-master-407ab579637c+ #37
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
> RIP: 0010:ext4_fc_write_inode_data+0x1d0/0x1e0 [ext4]
> Code: 7f 00 00 74 25 66 81 ca 00 80 66 89 54 24 30 e9 62 ff ff ff 4c 89 ff e8 9e 63 4d c1 31 c0 eb 84 b8 83 ff ff ff e9 7a ff ff ff <0f> 0b e8 79 c8 4c c1 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41
> RSP: 0018:ffff9bc200d53d88 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000008000
> RDX: 0000000000008000 RSI: 0000000000060002 RDI: 000032ee98c0aa2c
> RBP: 0000000000008000 R08: ffffffffc0559158 R09: ffffffffc054cdc0
> R10: 0000000000008000 R11: 0000000000000000 R12: 0000000000007fff
> R13: ffff9bc200d53e6c R14: ffff88d2e0cf3420 R15: ffff88d2e0cf33a8
> FS: 00007f20006cab80(0000) GS:ffff88d367140000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2000937000 CR3: 00000003db940002 CR4: 0000000000020ee0
> Call Trace:
> ext4_fc_commit+0x874/0x900 [ext4]
> ? file_check_and_advance_wb_err+0x2e/0xc0
> ext4_sync_file+0xd4/0x350 [ext4]
> __x64_sys_fsync+0x34/0x60
> do_syscall_64+0x33/0x40
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f20008d1ef7
> ------
>
> Bisect result shows the first bad commit:
> [aa75f4d3daaeb1389b9cce9d6b84401eaf228d4e] ext4: main fast-commit commit path
>
> It's hit by running xfstests generic/605 on pmem ramdisk, fsdax enabled.
> I've narrowed down to a simplified reproducer, which requires a pmem ramdisk
> been setup first, eg /dev/pmem10. And /root/xfstests-dev is built.
>
> --------------------- Reproducer start -----------
> #! /bin/bash
> SCRATCH_DEV=/dev/pmem10
> SCRATCH_MNT=/daxmnt
> SRC_FILE=$SCRATCH_MNT/src
> DST_FILE=$SCRATCH_MNT/dst
> XFS_IO_PROG=/usr/sbin/xfs_io
> tsize=$((128 * 1024 * 1024))
> wipefs -af $SCRATCH_DEV > out.full 2>&1
> mkfs.ext4 -b 4096 $SCRATCH_DEV >> out.full 2>&1 || exit
> mkdir -p $SCRATCH_MNT
> mount -t ext4 -o dax=inode,context=system_u:object_r:nfs_t:s0 $SCRATCH_DEV $SCRATCH_MNT || exit
> prep_files()
> {
> rm -f $SRC_FILE $DST_FILE
> $XFS_IO_PROG -f -c "falloc 0 $tsize" $SRC_FILE $DST_FILE >> out.full 2>&1 || exit
> }
> prep_files
> # with O_DIRECT first
> /root/xfstests-dev/src/t_mmap_dio $SRC_FILE $DST_FILE 1024 "dio both dax" && echo pass dio
> prep_files
> # again with buffered IO
> /root/xfstests-dev/src/t_mmap_dio -b $SRC_FILE $DST_FILE 1024 "buffered both dax" && echo pass bio
> umount /daxmnt
> --------------------- Reproducer end -----------
>
> The "context=xx" and "dax=inode" mount option are *necessary* to reproduce.
>
> NO panic reproduced if mount with "-o dax" or "dax=always" or "dax=never".
>
> The context value seems irrelevant. Other value can also trigger this panic.
> The inode size value seems irrelevant. Tested with 128 and 256, both panic.
>
> Some gdb info and fs info paste below for you ref.
>
> Thanks,
> Murphy
>
> ------------ gdb info -----------
> (gdb) l *(ext4_fc_write_inode_data+0x1d0/0x1e0)
> 0x64fd0 is in ext4_fc_write_inode_data (fs/ext4/fast_commit.c:787).
> 782 /*
> 783 * Writes updated data ranges for the inode in question. Updates CRC.
> 784 * Returns 0 on success, error otherwise.
> 785 */
> 786 static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
> 787 {
> 788 ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
> 789 struct ext4_inode_info *ei = EXT4_I(inode);
> 790 struct ext4_map_blocks map;
> 791 struct ext4_fc_add_range fc_ext;
> (gdb) l *(ext4_fc_write_inode_data+0x1d0)
> 0x651a0 is in ext4_fc_write_inode_data (fs/ext4/ext4_extents.h:199).
> 194 }
> 195
> 196 static inline void ext4_ext_mark_unwritten(struct ext4_extent *ext)
> 197 {
> 198 /* We can not have an unwritten extent of zero length! */
> 199 BUG_ON((le16_to_cpu(ext->ee_len) & ~EXT_INIT_MAX_LEN) == 0);
> 200 ext->ee_len |= cpu_to_le16(EXT_INIT_MAX_LEN);
> 201 }
> 202
> 203 static inline int ext4_ext_is_unwritten(struct ext4_extent *ext)
> (gdb) l *(ext4_fc_commit+0x874/0x900)
> 0x67460 is in ext4_fc_commit (fs/ext4/fast_commit.c:1072).
> 1067 * commit_tid if needed. If it's not possible to perform a fast commit
> 1068 * due to various reasons, we fall back to full commit. Returns 0
> 1069 * on success, error otherwise.
> 1070 */
> 1071 int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
> 1072 {
> 1073 struct super_block *sb = (struct super_block *)(journal->j_private);
> 1074 struct ext4_sb_info *sbi = EXT4_SB(sb);
> 1075 int nblks = 0, ret, bsize = journal->j_blocksize;
> 1076 int subtid = atomic_read(&sbi->s_fc_subtid);
> (gdb) l *(ext4_fc_commit+0x874)
> 0x67cd4 is in ext4_fc_commit (fs/ext4/fast_commit.c:972).
> 967 */
> 968 ret = ext4_fc_write_inode(inode, crc);
> 969 if (ret)
> 970 goto lock_and_exit;
> 971
> 972 ret = ext4_fc_write_inode_data(inode, crc);
> 973 if (ret)
> 974 goto lock_and_exit;
> 975
> 976 if (!ext4_fc_add_dentry_tlv(
> (gdb)
> -------------gdb info end ------------
>
> --------------- tune2fs -l outout --------
> tune2fs 1.45.5 (07-Jan-2020)
> Filesystem volume name: <none>
> Last mounted on: <not available>
> Filesystem UUID: 281dc870-a1ac-4f0e-85ec-e5cecd3d0f88
> Filesystem magic number: 0xEF53
> Filesystem revision #: 1 (dynamic)
> Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
> Filesystem flags: signed_directory_hash
> Default mount options: user_xattr acl
> Filesystem state: clean
> Errors behavior: Continue
> Filesystem OS type: Linux
> Inode count: 655360
> Block count: 2621440
> Reserved block count: 131072
> Free blocks: 2554687
> Free inodes: 655349
> First block: 0
> Block size: 4096
> Fragment size: 4096
> Group descriptor size: 64
> Reserved GDT blocks: 1024
> Blocks per group: 32768
> Fragments per group: 32768
> Inodes per group: 8192
> Inode blocks per group: 512
> Flex block group size: 16
> Filesystem created: Wed Nov 11 10:20:53 2020
> Last mount time: Wed Nov 11 10:20:53 2020
> Last write time: Wed Nov 11 10:20:53 2020
> Mount count: 1
> Maximum mount count: -1
> Last checked: Wed Nov 11 10:20:53 2020
> Check interval: 0 (<none>)
> Lifetime writes: 4137 kB
> Reserved blocks uid: 0 (user root)
> Reserved blocks gid: 0 (group root)
> First inode: 11
> Inode size: 256
> Required extra isize: 32
> Desired extra isize: 32
> Journal inode: 8
> Default directory hash: half_md4
> Directory Hash Seed: b7a0f1fd-a3fa-4669-ab23-3d63df2b16ef
> Journal backup: inode blocks
> Checksum type: crc32c
> Checksum: 0x555540b3
> --------------- tune2fs -l output end --------
>
> --------------- mount info ----------
> /dev/pmem10 on /daxmnt type ext4 (rw,relatime,context=system_u:object_r:nfs_t:s0,dax=inode,fast_commit)
> --------------- mount info end ----------
> --
> Murphy