I've discovered that generic/269 will trigger a BUG_ON on line 5070 in
ext4_mb_new_inode_pa when running kvm-xfstests on the 1k test case
with a kernel built from the current ext4 dev branch. After hitting the
BUG_ON, the kernel then reports persistent soft lockups. I mentioned this in
today's concall, and Ted confirmed the current dev branch should reflect
what's upstream at this time.
This test reproduces for me 5 to 10% of the time, but reliably enough - I
typically don't need more than 25 trials to see the failure, and 10 often
suffices. (I haven't yet tried the 4k test case, but will do so.)
The failure bisects to:
7e170922f06b ("ext4: Add allocation criteria 1.5 (CR1_5)")
Trace follows.
Eric
generic/269 24s ... [21:41:11][ 284.208474] run fstests generic/269 at 2023-07-03 21:41:11
[ 284.511484] EXT4-fs (vdc): mounted filesystem 2b1fbdd6-2724-47bc-b7b5-f4a73c9f19be r/w with ordered data mode. Quota mode: none.
[ 284.950657] ------------[ cut here ]------------
[ 284.950901] kernel BUG at fs/ext4/mballoc.c:5070!
[ 284.951104] invalid opcode: 0000 [#1] PREEMPT SMP
[ 284.951296] CPU: 0 PID: 12039 Comm: fsstress Not tainted 6.4.0-rc5+ #6
[ 284.951567] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 284.951900] RIP: 0010:ext4_mb_new_inode_pa+0x2a6/0x2c0
[ 284.952124] Code: b5 7e 0f 85 b5 fe ff ff 0f 1f 44 00 00 e9 ab fe ff ff e8 9d 56 d3 ff 84 c0 0f 85 b5 fe ff ff 0f 0b e9 ae fe ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 4c 89 c1 31 c0 e9 42 ff ff ff 0f 1f 84 00
[ 284.952891] RSP: 0018:ffffc90004053970 EFLAGS: 00010a87
[ 284.953124] RAX: 0000000000000002 RBX: ffff8880342d0720 RCX: 0000000000000001
[ 284.953420] RDX: 0000000000004000 RSI: 00001e4000000000 RDI: ffff8880342d0720
[ 284.953720] RBP: ffffc90004053a00 R08: ffff88800a5fc000 R09: 0000000000000000
[ 284.954020] R10: ffff888007964a98 R11: 0000000000000002 R12: 0000000000000003
[ 284.954321] R13: ffff8880342d0720 R14: ffff88800a5fc000 R15: ffff88800abfc000
[ 284.954610] FS: 00007f3d9db07740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[ 284.954923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 284.955148] CR2: 000055b54ce0fde8 CR3: 0000000006730006 CR4: 0000000000770ef0
[ 284.955443] PKRU: 55555554
[ 284.955559] Call Trace:
[ 284.955669] <TASK>
[ 284.955760] ? die+0x33/0x90
[ 284.955887] ? do_trap+0xe0/0x110
[ 284.956031] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
[ 284.956223] ? do_error_trap+0x65/0x80
[ 284.956385] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
[ 284.956575] ? exc_invalid_op+0x4b/0x70
[ 284.956738] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
[ 284.956929] ? asm_exc_invalid_op+0x16/0x20
[ 284.957112] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
[ 284.957305] ext4_mb_complex_scan_group+0x2e0/0x3e0
[ 284.957512] ext4_mb_regular_allocator+0x3be/0xd80
[ 284.957716] ext4_mb_new_blocks+0x9dc/0x1040
[ 284.957895] ? __kmalloc+0xca/0x150
[ 284.958038] ? ext4_find_extent+0x3ec/0x450
[ 284.958204] ? _raw_write_unlock+0x29/0x50
[ 284.958369] ext4_ext_map_blocks+0x9a4/0x19d0
[ 284.958543] ? __kmem_cache_free+0x17d/0x2e0
[ 284.958723] ? find_held_lock+0x2b/0x80
[ 284.958889] ext4_map_blocks+0x230/0x5d0
[ 284.959056] ? lock_release+0x139/0x280
[ 284.959222] ext4_getblk+0x7b/0x2d0
[ 284.959369] ext4_bread+0xc/0x70
[ 284.959510] ext4_append+0x8d/0x190
[ 284.959665] ext4_init_new_dir+0xd5/0x1b0
[ 284.959835] ext4_mkdir+0x192/0x340
[ 284.959987] vfs_mkdir+0x98/0x140
[ 284.960133] do_mkdirat+0x131/0x160
[ 284.960285] __x64_sys_mkdir+0x48/0x70
[ 284.960445] do_syscall_64+0x38/0x90
[ 284.960600] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 284.960814] RIP: 0033:0x7f3d9dbf8b07
[ 284.960967] Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48
[ 284.961741] RSP: 002b:00007ffcd5131098 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
[ 284.962075] RAX: ffffffffffffffda RBX: 00007ffcd5131200 RCX: 00007f3d9dbf8b07
[ 284.962363] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055b54cdac240
[ 284.962647] RBP: 00000000000001ff R08: 0000000000000001 R09: 0000000000000003
[ 284.962928] R10: 00007ffcd5130d16 R11: 0000000000000206 R12: 00000000000000cb
[ 284.963209] R13: 8f5c28f5c28f5c29 R14: 000055b54c8ec660 R15: 00000000000000cb
[ 284.963492] </TASK>
[ 284.963586] Modules linked in:
[ 284.963730] ---[ end trace 0000000000000000 ]---
Eric Whitney <[email protected]> writes:
> I've discovered that generic/269 will trigger a BUG_ON on line 5070 in
Can you confirm in your tree what is line 5070 out of the two?
BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
BUG_ON(ac->ac_pa == NULL);
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev&id=ab8627e104696b8c1c6953ad5255def5b0821e06#n5070
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev#n5070
Based on which tree you tested, it could differ I guess.
I am assuming it is...
BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
> ext4_mb_new_inode_pa when running kvm-xfstests on the 1k test case
> with a kernel built from the current ext4 dev branch. After hitting the
> BUG_ON, the kernel then reports persistent soft lockups. I mentioned this in
> today's concall, and Ted confirmed the current dev branch should reflect
> what's upstream at this time.
>
> This test reproduces for me 5 to 10% of the time, but reliably enough - I
> typically don't need more than 25 trials to see the failure, and 10 often
> suffices. (I haven't yet tried the 4k test case, but will do so.)
>
> The failure bisects to:
> 7e170922f06b ("ext4: Add allocation criteria 1.5 (CR1_5)")
>
Thanks for the bisection.
> Trace follows.
>
> Eric
>
>
> generic/269 24s ... [21:41:11][ 284.208474] run fstests generic/269 at 2023-07-03 21:41:11
> [ 284.511484] EXT4-fs (vdc): mounted filesystem 2b1fbdd6-2724-47bc-b7b5-f4a73c9f19be r/w with ordered data mode. Quota mode: none.
> [ 284.950657] ------------[ cut here ]------------
> [ 284.950901] kernel BUG at fs/ext4/mballoc.c:5070!
> [ 284.951104] invalid opcode: 0000 [#1] PREEMPT SMP
> [ 284.951296] CPU: 0 PID: 12039 Comm: fsstress Not tainted 6.4.0-rc5+ #6
> [ 284.951567] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
> [ 284.951900] RIP: 0010:ext4_mb_new_inode_pa+0x2a6/0x2c0
> [ 284.952124] Code: b5 7e 0f 85 b5 fe ff ff 0f 1f 44 00 00 e9 ab fe ff ff e8 9d 56 d3 ff 84 c0 0f 85 b5 fe ff ff 0f 0b e9 ae fe ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 4c 89 c1 31 c0 e9 42 ff ff ff 0f 1f 84 00
> [ 284.952891] RSP: 0018:ffffc90004053970 EFLAGS: 00010a87
> [ 284.953124] RAX: 0000000000000002 RBX: ffff8880342d0720 RCX: 0000000000000001
> [ 284.953420] RDX: 0000000000004000 RSI: 00001e4000000000 RDI: ffff8880342d0720
> [ 284.953720] RBP: ffffc90004053a00 R08: ffff88800a5fc000 R09: 0000000000000000
> [ 284.954020] R10: ffff888007964a98 R11: 0000000000000002 R12: 0000000000000003
> [ 284.954321] R13: ffff8880342d0720 R14: ffff88800a5fc000 R15: ffff88800abfc000
> [ 284.954610] FS: 00007f3d9db07740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
> [ 284.954923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 284.955148] CR2: 000055b54ce0fde8 CR3: 0000000006730006 CR4: 0000000000770ef0
> [ 284.955443] PKRU: 55555554
> [ 284.955559] Call Trace:
> [ 284.955669] <TASK>
> [ 284.955760] ? die+0x33/0x90
> [ 284.955887] ? do_trap+0xe0/0x110
> [ 284.956031] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> [ 284.956223] ? do_error_trap+0x65/0x80
> [ 284.956385] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> [ 284.956575] ? exc_invalid_op+0x4b/0x70
> [ 284.956738] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> [ 284.956929] ? asm_exc_invalid_op+0x16/0x20
> [ 284.957112] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> [ 284.957305] ext4_mb_complex_scan_group+0x2e0/0x3e0
> [ 284.957512] ext4_mb_regular_allocator+0x3be/0xd80
> [ 284.957716] ext4_mb_new_blocks+0x9dc/0x1040
> [ 284.957895] ? __kmalloc+0xca/0x150
> [ 284.958038] ? ext4_find_extent+0x3ec/0x450
> [ 284.958204] ? _raw_write_unlock+0x29/0x50
> [ 284.958369] ext4_ext_map_blocks+0x9a4/0x19d0
> [ 284.958543] ? __kmem_cache_free+0x17d/0x2e0
> [ 284.958723] ? find_held_lock+0x2b/0x80
> [ 284.958889] ext4_map_blocks+0x230/0x5d0
> [ 284.959056] ? lock_release+0x139/0x280
> [ 284.959222] ext4_getblk+0x7b/0x2d0
> [ 284.959369] ext4_bread+0xc/0x70
> [ 284.959510] ext4_append+0x8d/0x190
> [ 284.959665] ext4_init_new_dir+0xd5/0x1b0
> [ 284.959835] ext4_mkdir+0x192/0x340
hmm.. looks like a allocation request for a directory inode.
> [ 284.959987] vfs_mkdir+0x98/0x140
> [ 284.960133] do_mkdirat+0x131/0x160
> [ 284.960285] __x64_sys_mkdir+0x48/0x70
> [ 284.960445] do_syscall_64+0x38/0x90
> [ 284.960600] entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [ 284.960814] RIP: 0033:0x7f3d9dbf8b07
> [ 284.960967] Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48
> [ 284.961741] RSP: 002b:00007ffcd5131098 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
> [ 284.962075] RAX: ffffffffffffffda RBX: 00007ffcd5131200 RCX: 00007f3d9dbf8b07
> [ 284.962363] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055b54cdac240
> [ 284.962647] RBP: 00000000000001ff R08: 0000000000000001 R09: 0000000000000003
> [ 284.962928] R10: 00007ffcd5130d16 R11: 0000000000000206 R12: 00000000000000cb
> [ 284.963209] R13: 8f5c28f5c28f5c29 R14: 000055b54c8ec660 R15: 00000000000000cb
> [ 284.963492] </TASK>
> [ 284.963586] Modules linked in:
> [ 284.963730] ---[ end trace 0000000000000000 ]---
@Ojaswin,
I was looking at the code. In function ext4_mb_choose_next_group_best_avail(),
we use fls() on goal len to calculate "order". But we never subtract 1
from it. Then we set the goal len based on this "order". This might make
ac_g_ex.fe_len > ac_o_ex.fe_len in some cases where we really don't want
that (like the current case).
You have added some comments there, so I was not sure if that was
intentional.
Now, IIUC, the overall concept of cr_1.5 is to trim the max len order
from goal len to something which is still larger than original length.
But this is only valid for regular files allocation request. Because we don't
normalize the request length for non-regular files. See
ext4_mb_normalize_request() function. As I also see from the current
bug_on, the request was for dir inode I guess.
Although, I still think we should check the function logic in
ext4_mb_choose_next_group_best_avail(), but either ways I guess we don't
want to use CR_BEST_AVAIL_LEN criteria for non-regular files right,
given we anyways don't normalize the allocation request len for such files.
In that case do you think below diff make sense?
mballoc: Don't use CR_BEST_AVAIL_LEN for non-regular files
Using CR_BEST_AVAIL_LEN only make sense for regular files, as for
non-regular files we never normalize the allocation request length i.e.
goal len is same as original length (ac_g_ex.fe_len == ac_o_ex.fe_len).
Hence there is no scope of trimming the goal length such that it can
still satisfy original request len. Thus this patch avoids using
CR_BEST_AVAIL_LEN criteria for non-regular files request.
---
fs/ext4/mballoc.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a2475b8c9fb5..5fbbd7344456 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -974,7 +974,19 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
*group = grp->bb_group;
ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
} else {
- *new_cr = CR_BEST_AVAIL_LEN;
+ /*
+ * CR_BEST_AVAIL_LEN works based on the concept that we have
+ * a larger normalized goal len request which can be trimmed to
+ * a smaller goal len such that it can still satisfy original
+ * request len. However, allocation request for non-regular
+ * files never gets normalized.
+ * See function ext4_mb_normalize_request() (EXT4_MB_HINT_DATA).
+ */
+ if ((ac->ac_criteria & EXT4_MB_HINT_DATA))
+ *new_cr = CR_BEST_AVAIL_LEN;
+ else
+ *new_cr = CR_GOAL_LEN_SLOW;
+
}
}
--
2.30.2
-ritesh
* Ritesh Harjani <[email protected]>:
> Eric Whitney <[email protected]> writes:
>
> > I've discovered that generic/269 will trigger a BUG_ON on line 5070 in
>
> Can you confirm in your tree what is line 5070 out of the two?
>
> BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
> BUG_ON(ac->ac_pa == NULL);
Hi Ritesh:
The second line above. That's line 5070 in mballoc.c after cloning Ted's ext4
tree and checking out the dev branch, which is how I built the test kernel.
Also, I've now run generic/269 500 times using the same kernel on the 4k
test case without failures. So, the failure may be limited to 1K block file
systems (note test appliance is x86_64, block size < page size).
Eric
>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev&id=ab8627e104696b8c1c6953ad5255def5b0821e06#n5070
> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev#n5070
>
> Based on which tree you tested, it could differ I guess.
>
> I am assuming it is...
> BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>
> > ext4_mb_new_inode_pa when running kvm-xfstests on the 1k test case
> > with a kernel built from the current ext4 dev branch. After hitting the
> > BUG_ON, the kernel then reports persistent soft lockups. I mentioned this in
> > today's concall, and Ted confirmed the current dev branch should reflect
> > what's upstream at this time.
> >
> > This test reproduces for me 5 to 10% of the time, but reliably enough - I
> > typically don't need more than 25 trials to see the failure, and 10 often
> > suffices. (I haven't yet tried the 4k test case, but will do so.)
> >
> > The failure bisects to:
> > 7e170922f06b ("ext4: Add allocation criteria 1.5 (CR1_5)")
> >
>
> Thanks for the bisection.
>
> > Trace follows.
> >
> > Eric
> >
> >
> > generic/269 24s ... [21:41:11][ 284.208474] run fstests generic/269 at 2023-07-03 21:41:11
> > [ 284.511484] EXT4-fs (vdc): mounted filesystem 2b1fbdd6-2724-47bc-b7b5-f4a73c9f19be r/w with ordered data mode. Quota mode: none.
> > [ 284.950657] ------------[ cut here ]------------
> > [ 284.950901] kernel BUG at fs/ext4/mballoc.c:5070!
> > [ 284.951104] invalid opcode: 0000 [#1] PREEMPT SMP
> > [ 284.951296] CPU: 0 PID: 12039 Comm: fsstress Not tainted 6.4.0-rc5+ #6
> > [ 284.951567] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
> > [ 284.951900] RIP: 0010:ext4_mb_new_inode_pa+0x2a6/0x2c0
> > [ 284.952124] Code: b5 7e 0f 85 b5 fe ff ff 0f 1f 44 00 00 e9 ab fe ff ff e8 9d 56 d3 ff 84 c0 0f 85 b5 fe ff ff 0f 0b e9 ae fe ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 4c 89 c1 31 c0 e9 42 ff ff ff 0f 1f 84 00
> > [ 284.952891] RSP: 0018:ffffc90004053970 EFLAGS: 00010a87
> > [ 284.953124] RAX: 0000000000000002 RBX: ffff8880342d0720 RCX: 0000000000000001
> > [ 284.953420] RDX: 0000000000004000 RSI: 00001e4000000000 RDI: ffff8880342d0720
> > [ 284.953720] RBP: ffffc90004053a00 R08: ffff88800a5fc000 R09: 0000000000000000
> > [ 284.954020] R10: ffff888007964a98 R11: 0000000000000002 R12: 0000000000000003
> > [ 284.954321] R13: ffff8880342d0720 R14: ffff88800a5fc000 R15: ffff88800abfc000
> > [ 284.954610] FS: 00007f3d9db07740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
> > [ 284.954923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 284.955148] CR2: 000055b54ce0fde8 CR3: 0000000006730006 CR4: 0000000000770ef0
> > [ 284.955443] PKRU: 55555554
> > [ 284.955559] Call Trace:
> > [ 284.955669] <TASK>
> > [ 284.955760] ? die+0x33/0x90
> > [ 284.955887] ? do_trap+0xe0/0x110
> > [ 284.956031] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> > [ 284.956223] ? do_error_trap+0x65/0x80
> > [ 284.956385] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> > [ 284.956575] ? exc_invalid_op+0x4b/0x70
> > [ 284.956738] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> > [ 284.956929] ? asm_exc_invalid_op+0x16/0x20
> > [ 284.957112] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> > [ 284.957305] ext4_mb_complex_scan_group+0x2e0/0x3e0
> > [ 284.957512] ext4_mb_regular_allocator+0x3be/0xd80
> > [ 284.957716] ext4_mb_new_blocks+0x9dc/0x1040
> > [ 284.957895] ? __kmalloc+0xca/0x150
> > [ 284.958038] ? ext4_find_extent+0x3ec/0x450
> > [ 284.958204] ? _raw_write_unlock+0x29/0x50
> > [ 284.958369] ext4_ext_map_blocks+0x9a4/0x19d0
> > [ 284.958543] ? __kmem_cache_free+0x17d/0x2e0
> > [ 284.958723] ? find_held_lock+0x2b/0x80
> > [ 284.958889] ext4_map_blocks+0x230/0x5d0
> > [ 284.959056] ? lock_release+0x139/0x280
> > [ 284.959222] ext4_getblk+0x7b/0x2d0
> > [ 284.959369] ext4_bread+0xc/0x70
> > [ 284.959510] ext4_append+0x8d/0x190
> > [ 284.959665] ext4_init_new_dir+0xd5/0x1b0
> > [ 284.959835] ext4_mkdir+0x192/0x340
>
> hmm.. looks like a allocation request for a directory inode.
>
> > [ 284.959987] vfs_mkdir+0x98/0x140
> > [ 284.960133] do_mkdirat+0x131/0x160
> > [ 284.960285] __x64_sys_mkdir+0x48/0x70
> > [ 284.960445] do_syscall_64+0x38/0x90
> > [ 284.960600] entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > [ 284.960814] RIP: 0033:0x7f3d9dbf8b07
> > [ 284.960967] Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48
> > [ 284.961741] RSP: 002b:00007ffcd5131098 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
> > [ 284.962075] RAX: ffffffffffffffda RBX: 00007ffcd5131200 RCX: 00007f3d9dbf8b07
> > [ 284.962363] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055b54cdac240
> > [ 284.962647] RBP: 00000000000001ff R08: 0000000000000001 R09: 0000000000000003
> > [ 284.962928] R10: 00007ffcd5130d16 R11: 0000000000000206 R12: 00000000000000cb
> > [ 284.963209] R13: 8f5c28f5c28f5c29 R14: 000055b54c8ec660 R15: 00000000000000cb
> > [ 284.963492] </TASK>
> > [ 284.963586] Modules linked in:
> > [ 284.963730] ---[ end trace 0000000000000000 ]---
>
>
> @Ojaswin,
>
> I was looking at the code. In function ext4_mb_choose_next_group_best_avail(),
> we use fls() on goal len to calculate "order". But we never subtract 1
> from it. Then we set the goal len based on this "order". This might make
> ac_g_ex.fe_len > ac_o_ex.fe_len in some cases where we really don't want
> that (like the current case).
> You have added some comments there, so I was not sure if that was
> intentional.
>
> Now, IIUC, the overall concept of cr_1.5 is to trim the max len order
> from goal len to something which is still larger than original length.
> But this is only valid for regular files allocation request. Because we don't
> normalize the request length for non-regular files. See
> ext4_mb_normalize_request() function. As I also see from the current
> bug_on, the request was for dir inode I guess.
>
> Although, I still think we should check the function logic in
> ext4_mb_choose_next_group_best_avail(), but either ways I guess we don't
> want to use CR_BEST_AVAIL_LEN criteria for non-regular files right,
> given we anyways don't normalize the allocation request len for such files.
>
> In that case do you think below diff make sense?
>
>
> mballoc: Don't use CR_BEST_AVAIL_LEN for non-regular files
>
> Using CR_BEST_AVAIL_LEN only make sense for regular files, as for
> non-regular files we never normalize the allocation request length i.e.
> goal len is same as original length (ac_g_ex.fe_len == ac_o_ex.fe_len).
>
> Hence there is no scope of trimming the goal length such that it can
> still satisfy original request len. Thus this patch avoids using
> CR_BEST_AVAIL_LEN criteria for non-regular files request.
>
> ---
> fs/ext4/mballoc.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index a2475b8c9fb5..5fbbd7344456 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -974,7 +974,19 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
> *group = grp->bb_group;
> ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
> } else {
> - *new_cr = CR_BEST_AVAIL_LEN;
> + /*
> + * CR_BEST_AVAIL_LEN works based on the concept that we have
> + * a larger normalized goal len request which can be trimmed to
> + * a smaller goal len such that it can still satisfy original
> + * request len. However, allocation request for non-regular
> + * files never gets normalized.
> + * See function ext4_mb_normalize_request() (EXT4_MB_HINT_DATA).
> + */
> + if ((ac->ac_criteria & EXT4_MB_HINT_DATA))
> + *new_cr = CR_BEST_AVAIL_LEN;
> + else
> + *new_cr = CR_GOAL_LEN_SLOW;
> +
> }
> }
>
> --
> 2.30.2
>
> -ritesh
Ritesh Harjani (IBM) <[email protected]> writes:
> Eric Whitney <[email protected]> writes:
>
>> I've discovered that generic/269 will trigger a BUG_ON on line 5070 in
>
> Can you confirm in your tree what is line 5070 out of the two?
>
> BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
> BUG_ON(ac->ac_pa == NULL);
>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev&id=ab8627e104696b8c1c6953ad5255def5b0821e06#n5070
> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev#n5070
>
> Based on which tree you tested, it could differ I guess.
>
> I am assuming it is...
> BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>
>> ext4_mb_new_inode_pa when running kvm-xfstests on the 1k test case
>> with a kernel built from the current ext4 dev branch. After hitting the
>> BUG_ON, the kernel then reports persistent soft lockups. I mentioned this in
>> today's concall, and Ted confirmed the current dev branch should reflect
>> what's upstream at this time.
>>
>> This test reproduces for me 5 to 10% of the time, but reliably enough - I
>> typically don't need more than 25 trials to see the failure, and 10 often
>> suffices. (I haven't yet tried the 4k test case, but will do so.)
>>
>> The failure bisects to:
>> 7e170922f06b ("ext4: Add allocation criteria 1.5 (CR1_5)")
>>
>
> Thanks for the bisection.
>
>> Trace follows.
>>
>> Eric
>>
>>
>> generic/269 24s ... [21:41:11][ 284.208474] run fstests generic/269 at 2023-07-03 21:41:11
>> [ 284.511484] EXT4-fs (vdc): mounted filesystem 2b1fbdd6-2724-47bc-b7b5-f4a73c9f19be r/w with ordered data mode. Quota mode: none.
>> [ 284.950657] ------------[ cut here ]------------
>> [ 284.950901] kernel BUG at fs/ext4/mballoc.c:5070!
>> [ 284.951104] invalid opcode: 0000 [#1] PREEMPT SMP
>> [ 284.951296] CPU: 0 PID: 12039 Comm: fsstress Not tainted 6.4.0-rc5+ #6
>> [ 284.951567] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
>> [ 284.951900] RIP: 0010:ext4_mb_new_inode_pa+0x2a6/0x2c0
>> [ 284.952124] Code: b5 7e 0f 85 b5 fe ff ff 0f 1f 44 00 00 e9 ab fe ff ff e8 9d 56 d3 ff 84 c0 0f 85 b5 fe ff ff 0f 0b e9 ae fe ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 4c 89 c1 31 c0 e9 42 ff ff ff 0f 1f 84 00
>> [ 284.952891] RSP: 0018:ffffc90004053970 EFLAGS: 00010a87
>> [ 284.953124] RAX: 0000000000000002 RBX: ffff8880342d0720 RCX: 0000000000000001
>> [ 284.953420] RDX: 0000000000004000 RSI: 00001e4000000000 RDI: ffff8880342d0720
>> [ 284.953720] RBP: ffffc90004053a00 R08: ffff88800a5fc000 R09: 0000000000000000
>> [ 284.954020] R10: ffff888007964a98 R11: 0000000000000002 R12: 0000000000000003
>> [ 284.954321] R13: ffff8880342d0720 R14: ffff88800a5fc000 R15: ffff88800abfc000
>> [ 284.954610] FS: 00007f3d9db07740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
>> [ 284.954923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 284.955148] CR2: 000055b54ce0fde8 CR3: 0000000006730006 CR4: 0000000000770ef0
>> [ 284.955443] PKRU: 55555554
>> [ 284.955559] Call Trace:
>> [ 284.955669] <TASK>
>> [ 284.955760] ? die+0x33/0x90
>> [ 284.955887] ? do_trap+0xe0/0x110
>> [ 284.956031] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
>> [ 284.956223] ? do_error_trap+0x65/0x80
>> [ 284.956385] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
>> [ 284.956575] ? exc_invalid_op+0x4b/0x70
>> [ 284.956738] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
>> [ 284.956929] ? asm_exc_invalid_op+0x16/0x20
>> [ 284.957112] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
>> [ 284.957305] ext4_mb_complex_scan_group+0x2e0/0x3e0
>> [ 284.957512] ext4_mb_regular_allocator+0x3be/0xd80
>> [ 284.957716] ext4_mb_new_blocks+0x9dc/0x1040
>> [ 284.957895] ? __kmalloc+0xca/0x150
>> [ 284.958038] ? ext4_find_extent+0x3ec/0x450
>> [ 284.958204] ? _raw_write_unlock+0x29/0x50
>> [ 284.958369] ext4_ext_map_blocks+0x9a4/0x19d0
>> [ 284.958543] ? __kmem_cache_free+0x17d/0x2e0
>> [ 284.958723] ? find_held_lock+0x2b/0x80
>> [ 284.958889] ext4_map_blocks+0x230/0x5d0
>> [ 284.959056] ? lock_release+0x139/0x280
>> [ 284.959222] ext4_getblk+0x7b/0x2d0
>> [ 284.959369] ext4_bread+0xc/0x70
>> [ 284.959510] ext4_append+0x8d/0x190
>> [ 284.959665] ext4_init_new_dir+0xd5/0x1b0
>> [ 284.959835] ext4_mkdir+0x192/0x340
>
> hmm.. looks like a allocation request for a directory inode.
>
>> [ 284.959987] vfs_mkdir+0x98/0x140
>> [ 284.960133] do_mkdirat+0x131/0x160
>> [ 284.960285] __x64_sys_mkdir+0x48/0x70
>> [ 284.960445] do_syscall_64+0x38/0x90
>> [ 284.960600] entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> [ 284.960814] RIP: 0033:0x7f3d9dbf8b07
>> [ 284.960967] Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48
>> [ 284.961741] RSP: 002b:00007ffcd5131098 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
>> [ 284.962075] RAX: ffffffffffffffda RBX: 00007ffcd5131200 RCX: 00007f3d9dbf8b07
>> [ 284.962363] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055b54cdac240
>> [ 284.962647] RBP: 00000000000001ff R08: 0000000000000001 R09: 0000000000000003
>> [ 284.962928] R10: 00007ffcd5130d16 R11: 0000000000000206 R12: 00000000000000cb
>> [ 284.963209] R13: 8f5c28f5c28f5c29 R14: 000055b54c8ec660 R15: 00000000000000cb
>> [ 284.963492] </TASK>
>> [ 284.963586] Modules linked in:
>> [ 284.963730] ---[ end trace 0000000000000000 ]---
>
>
> @Ojaswin,
>
> I was looking at the code. In function ext4_mb_choose_next_group_best_avail(),
> we use fls() on goal len to calculate "order". But we never subtract 1
> from it. Then we set the goal len based on this "order". This might make
> ac_g_ex.fe_len > ac_o_ex.fe_len in some cases where we really don't want
> that (like the current case).
> You have added some comments there, so I was not sure if that was
> intentional.
>
> Now, IIUC, the overall concept of cr_1.5 is to trim the max len order
> from goal len to something which is still larger than original length.
> But this is only valid for regular files allocation request. Because we don't
> normalize the request length for non-regular files. See
> ext4_mb_normalize_request() function. As I also see from the current
> bug_on, the request was for dir inode I guess.
>
> Although, I still think we should check the function logic in
> ext4_mb_choose_next_group_best_avail(), but either ways I guess we don't
> want to use CR_BEST_AVAIL_LEN criteria for non-regular files right,
> given we anyways don't normalize the allocation request len for such files.
>
> In that case do you think below diff make sense?
>
>
> mballoc: Don't use CR_BEST_AVAIL_LEN for non-regular files
>
> Using CR_BEST_AVAIL_LEN only make sense for regular files, as for
> non-regular files we never normalize the allocation request length i.e.
> goal len is same as original length (ac_g_ex.fe_len == ac_o_ex.fe_len).
>
> Hence there is no scope of trimming the goal length such that it can
> still satisfy original request len. Thus this patch avoids using
> CR_BEST_AVAIL_LEN criteria for non-regular files request.
>
> ---
> fs/ext4/mballoc.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index a2475b8c9fb5..5fbbd7344456 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -974,7 +974,19 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
> *group = grp->bb_group;
> ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
> } else {
> - *new_cr = CR_BEST_AVAIL_LEN;
> + /*
> + * CR_BEST_AVAIL_LEN works based on the concept that we have
> + * a larger normalized goal len request which can be trimmed to
> + * a smaller goal len such that it can still satisfy original
> + * request len. However, allocation request for non-regular
> + * files never gets normalized.
> + * See function ext4_mb_normalize_request() (EXT4_MB_HINT_DATA).
> + */
> + if ((ac->ac_criteria & EXT4_MB_HINT_DATA))
my bad. It should be
if ((ac->ac_flags & EXT4_MB_HINT_DATA))
-ritesh
* Ritesh Harjani <[email protected]>:
> Ritesh Harjani (IBM) <[email protected]> writes:
>
> > Eric Whitney <[email protected]> writes:
> >
> >> I've discovered that generic/269 will trigger a BUG_ON on line 5070 in
> >
> > Can you confirm in your tree what is line 5070 out of the two?
> >
> > BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
> > BUG_ON(ac->ac_pa == NULL);
> >
> > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev&id=ab8627e104696b8c1c6953ad5255def5b0821e06#n5070
> > [2]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev#n5070
> >
> > Based on which tree you tested, it could differ I guess.
> >
> > I am assuming it is...
> > BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
> >
> >> ext4_mb_new_inode_pa when running kvm-xfstests on the 1k test case
> >> with a kernel built from the current ext4 dev branch. After hitting the
> >> BUG_ON, the kernel then reports persistent soft lockups. I mentioned this in
> >> today's concall, and Ted confirmed the current dev branch should reflect
> >> what's upstream at this time.
> >>
> >> This test reproduces for me 5 to 10% of the time, but reliably enough - I
> >> typically don't need more than 25 trials to see the failure, and 10 often
> >> suffices. (I haven't yet tried the 4k test case, but will do so.)
> >>
> >> The failure bisects to:
> >> 7e170922f06b ("ext4: Add allocation criteria 1.5 (CR1_5)")
> >>
> >
> > Thanks for the bisection.
> >
> >> Trace follows.
> >>
> >> Eric
> >>
> >>
> >> generic/269 24s ... [21:41:11][ 284.208474] run fstests generic/269 at 2023-07-03 21:41:11
> >> [ 284.511484] EXT4-fs (vdc): mounted filesystem 2b1fbdd6-2724-47bc-b7b5-f4a73c9f19be r/w with ordered data mode. Quota mode: none.
> >> [ 284.950657] ------------[ cut here ]------------
> >> [ 284.950901] kernel BUG at fs/ext4/mballoc.c:5070!
> >> [ 284.951104] invalid opcode: 0000 [#1] PREEMPT SMP
> >> [ 284.951296] CPU: 0 PID: 12039 Comm: fsstress Not tainted 6.4.0-rc5+ #6
> >> [ 284.951567] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
> >> [ 284.951900] RIP: 0010:ext4_mb_new_inode_pa+0x2a6/0x2c0
> >> [ 284.952124] Code: b5 7e 0f 85 b5 fe ff ff 0f 1f 44 00 00 e9 ab fe ff ff e8 9d 56 d3 ff 84 c0 0f 85 b5 fe ff ff 0f 0b e9 ae fe ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 4c 89 c1 31 c0 e9 42 ff ff ff 0f 1f 84 00
> >> [ 284.952891] RSP: 0018:ffffc90004053970 EFLAGS: 00010a87
> >> [ 284.953124] RAX: 0000000000000002 RBX: ffff8880342d0720 RCX: 0000000000000001
> >> [ 284.953420] RDX: 0000000000004000 RSI: 00001e4000000000 RDI: ffff8880342d0720
> >> [ 284.953720] RBP: ffffc90004053a00 R08: ffff88800a5fc000 R09: 0000000000000000
> >> [ 284.954020] R10: ffff888007964a98 R11: 0000000000000002 R12: 0000000000000003
> >> [ 284.954321] R13: ffff8880342d0720 R14: ffff88800a5fc000 R15: ffff88800abfc000
> >> [ 284.954610] FS: 00007f3d9db07740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
> >> [ 284.954923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 284.955148] CR2: 000055b54ce0fde8 CR3: 0000000006730006 CR4: 0000000000770ef0
> >> [ 284.955443] PKRU: 55555554
> >> [ 284.955559] Call Trace:
> >> [ 284.955669] <TASK>
> >> [ 284.955760] ? die+0x33/0x90
> >> [ 284.955887] ? do_trap+0xe0/0x110
> >> [ 284.956031] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> >> [ 284.956223] ? do_error_trap+0x65/0x80
> >> [ 284.956385] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> >> [ 284.956575] ? exc_invalid_op+0x4b/0x70
> >> [ 284.956738] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> >> [ 284.956929] ? asm_exc_invalid_op+0x16/0x20
> >> [ 284.957112] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> >> [ 284.957305] ext4_mb_complex_scan_group+0x2e0/0x3e0
> >> [ 284.957512] ext4_mb_regular_allocator+0x3be/0xd80
> >> [ 284.957716] ext4_mb_new_blocks+0x9dc/0x1040
> >> [ 284.957895] ? __kmalloc+0xca/0x150
> >> [ 284.958038] ? ext4_find_extent+0x3ec/0x450
> >> [ 284.958204] ? _raw_write_unlock+0x29/0x50
> >> [ 284.958369] ext4_ext_map_blocks+0x9a4/0x19d0
> >> [ 284.958543] ? __kmem_cache_free+0x17d/0x2e0
> >> [ 284.958723] ? find_held_lock+0x2b/0x80
> >> [ 284.958889] ext4_map_blocks+0x230/0x5d0
> >> [ 284.959056] ? lock_release+0x139/0x280
> >> [ 284.959222] ext4_getblk+0x7b/0x2d0
> >> [ 284.959369] ext4_bread+0xc/0x70
> >> [ 284.959510] ext4_append+0x8d/0x190
> >> [ 284.959665] ext4_init_new_dir+0xd5/0x1b0
> >> [ 284.959835] ext4_mkdir+0x192/0x340
> >
> > hmm.. looks like a allocation request for a directory inode.
> >
> >> [ 284.959987] vfs_mkdir+0x98/0x140
> >> [ 284.960133] do_mkdirat+0x131/0x160
> >> [ 284.960285] __x64_sys_mkdir+0x48/0x70
> >> [ 284.960445] do_syscall_64+0x38/0x90
> >> [ 284.960600] entry_SYSCALL_64_after_hwframe+0x72/0xdc
> >> [ 284.960814] RIP: 0033:0x7f3d9dbf8b07
> >> [ 284.960967] Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48
> >> [ 284.961741] RSP: 002b:00007ffcd5131098 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
> >> [ 284.962075] RAX: ffffffffffffffda RBX: 00007ffcd5131200 RCX: 00007f3d9dbf8b07
> >> [ 284.962363] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055b54cdac240
> >> [ 284.962647] RBP: 00000000000001ff R08: 0000000000000001 R09: 0000000000000003
> >> [ 284.962928] R10: 00007ffcd5130d16 R11: 0000000000000206 R12: 00000000000000cb
> >> [ 284.963209] R13: 8f5c28f5c28f5c29 R14: 000055b54c8ec660 R15: 00000000000000cb
> >> [ 284.963492] </TASK>
> >> [ 284.963586] Modules linked in:
> >> [ 284.963730] ---[ end trace 0000000000000000 ]---
> >
> >
> > @Ojaswin,
> >
> > I was looking at the code. In function ext4_mb_choose_next_group_best_avail(),
> > we use fls() on goal len to calculate "order". But we never subtract 1
> > from it. Then we set the goal len based on this "order". This might make
> > ac_g_ex.fe_len > ac_o_ex.fe_len in some cases where we really don't want
> > that (like the current case).
> > You have added some comments there, so I was not sure if that was
> > intentional.
> >
> > Now, IIUC, the overall concept of cr_1.5 is to trim the max len order
> > from goal len to something which is still larger than original length.
> > But this is only valid for regular files allocation request. Because we don't
> > normalize the request length for non-regular files. See
> > ext4_mb_normalize_request() function. As I also see from the current
> > bug_on, the request was for dir inode I guess.
> >
> > Although, I still think we should check the function logic in
> > ext4_mb_choose_next_group_best_avail(), but either ways I guess we don't
> > want to use CR_BEST_AVAIL_LEN criteria for non-regular files right,
> > given we anyways don't normalize the allocation request len for such files.
> >
> > In that case do you think below diff make sense?
> >
> >
> > mballoc: Don't use CR_BEST_AVAIL_LEN for non-regular files
> >
> > Using CR_BEST_AVAIL_LEN only make sense for regular files, as for
> > non-regular files we never normalize the allocation request length i.e.
> > goal len is same as original length (ac_g_ex.fe_len == ac_o_ex.fe_len).
> >
> > Hence there is no scope of trimming the goal length such that it can
> > still satisfy original request len. Thus this patch avoids using
> > CR_BEST_AVAIL_LEN criteria for non-regular files request.
> >
> > ---
> > fs/ext4/mballoc.c | 14 +++++++++++++-
> > 1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index a2475b8c9fb5..5fbbd7344456 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -974,7 +974,19 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
> > *group = grp->bb_group;
> > ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
> > } else {
> > - *new_cr = CR_BEST_AVAIL_LEN;
> > + /*
> > + * CR_BEST_AVAIL_LEN works based on the concept that we have
> > + * a larger normalized goal len request which can be trimmed to
> > + * a smaller goal len such that it can still satisfy original
> > + * request len. However, allocation request for non-regular
> > + * files never gets normalized.
> > + * See function ext4_mb_normalize_request() (EXT4_MB_HINT_DATA).
> > + */
> > + if ((ac->ac_criteria & EXT4_MB_HINT_DATA))
>
> my bad. It should be
> if ((ac->ac_flags & EXT4_MB_HINT_DATA))
>
> -ritesh
Ritesh:
An update: the generic/269 failure on 1k is easily reproducible on 6.5-rc1.
The BUG_ON on line 5069 in mballoc.c is now triggering for me, rather than 5070:
BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
Representative trace follows - operation at top of trace varies run to run.
Eric
generic/269 25s ... [18:07:16][ 54.255436] run fstests generic/269 at 2023-06
[ 54.508385] EXT4-fs (vdc): mounted filesystem a986a0fa-5c36-4e16-bad9-2d53b8.
[ 57.914332] ------------[ cut here ]------------
[ 57.914534] kernel BUG at fs/ext4/mballoc.c:5069!
[ 57.914863] invalid opcode: 0000 [#1] PREEMPT SMP
[ 57.915156] CPU: 1 PID: 4157 Comm: fsstress Not tainted 6.5.0-rc1 #1
[ 57.915531] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4
[ 57.916033] RIP: 0010:ext4_mb_new_inode_pa+0x2a6/0x2c0
[ 57.916354] Code: b5 7e 0f 85 b5 fe ff ff 0f 1f 44 00 00 e9 ab fe ff ff e8 70
[ 57.917473] RSP: 0018:ffffc900057f39f0 EFLAGS: 00010206
[ 57.917799] RAX: 0000000000000002 RBX: ffff888017b70d10 RCX: 0000000000000001
[ 57.918233] RDX: 000000000000a000 RSI: 00000b1c00000000 RDI: ffff888017b70d10
[ 57.918667] RBP: ffffc900057f3a80 R08: ffff88800672f000 R09: ffff888009ac65a0
[ 57.919100] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000068
[ 57.919534] R13: ffff888017b70d10 R14: ffff88800672f000 R15: ffff888007b02000
[ 57.919971] FS: 00007f03368b6740(0000) GS:ffff88807dd00000(0000) knlGS:00000
[ 57.920461] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 57.920789] CR2: 000055fe23769d78 CR3: 000000000787e005 CR4: 0000000000770ee0
[ 57.921113] PKRU: 55555554
[ 57.921213] Call Trace:
[ 57.921327] <TASK>
[ 57.921443] ? die+0x33/0x90
[ 57.921611] ? do_trap+0xe0/0x110
[ 57.921807] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
[ 57.922023] ? do_error_trap+0x65/0x80
[ 57.922174] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
[ 57.922354] ? exc_invalid_op+0x4b/0x70
[ 57.922510] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
[ 57.922693] ? asm_exc_invalid_op+0x16/0x20
[ 57.922859] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
[ 57.923039] ext4_mb_complex_scan_group+0x2e0/0x3e0
[ 57.923232] ext4_mb_regular_allocator+0x3be/0xd80
[ 57.923425] ext4_mb_new_blocks+0x9dc/0x1040
[ 57.923592] ? __kmalloc+0xca/0x150
[ 57.923732] ? ext4_find_extent+0x3ec/0x450
[ 57.923896] ? __kmem_cache_free+0x191/0x310
[ 57.924064] ? rcu_is_watching+0xd/0x40
[ 57.924218] ext4_ext_map_blocks+0x981/0x19c0
[ 57.924393] ? __lock_acquire.constprop.0+0x4b/0x6a0
[ 57.924601] ? __lock_acquire.constprop.0+0x4b/0x6a0
[ 57.924871] ext4_map_blocks+0x22a/0x5b0
[ 57.925079] ext4_getblk+0x7b/0x2d0
[ 57.925260] ext4_bread+0xc/0x70
[ 57.925392] ext4_symlink+0x12b/0x3c0
[ 57.925610] vfs_symlink+0x18c/0x230
[ 57.925839] do_symlinkat+0x11c/0x130
[ 57.926071] __x64_sys_symlink+0x38/0x50
[ 57.926318] do_syscall_64+0x38/0x90
[ 57.926544] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 57.926866] RIP: 0033:0x7f03369a9a97
[ 57.927093] Code: f0 ff ff 73 01 c3 48 8b 0d f6 d3 0c 00 f7 d8 64 89 01 48 88
[ 57.928176] RSP: 002b:00007ffdf02ee128 EFLAGS: 00000206 ORIG_RAX: 00000000008
[ 57.928574] RAX: ffffffffffffffda RBX: 00007ffdf02ee290 RCX: 00007f03369a9a97
[ 57.928879] RDX: 0000000000000064 RSI: 000055fe23514410 RDI: 000055fe23542d10
[ 57.929161] RBP: 000055fe23514410 R08: 000055fe23542d10 R09: 0000000000000003
[ 57.929446] R10: 00007ffdf02edda6 R11: 0000000000000206 R12: 0000000000000170
[ 57.929722] R13: 000055fe23542d10 R14: 00007ffdf02ee290 R15: 000055fe23542d10
[ 57.930029] </TASK>
[ 57.930126] Modules linked in:
[ 57.930274] ---[ end trace 0000000000000000 ]---
Eric Whitney <[email protected]> writes:
> * Ritesh Harjani <[email protected]>:
>> Ritesh Harjani (IBM) <[email protected]> writes:
>>
>> > Eric Whitney <[email protected]> writes:
>> >
>> >> I've discovered that generic/269 will trigger a BUG_ON on line 5070 in
>> >
>> > Can you confirm in your tree what is line 5070 out of the two?
>> >
>> > BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>> > BUG_ON(ac->ac_pa == NULL);
>> >
>> > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev&id=ab8627e104696b8c1c6953ad5255def5b0821e06#n5070
>> > [2]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev#n5070
>> >
>> > Based on which tree you tested, it could differ I guess.
>> >
>> > I am assuming it is...
>> > BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>> >
>> >> ext4_mb_new_inode_pa when running kvm-xfstests on the 1k test case
>> >> with a kernel built from the current ext4 dev branch. After hitting the
>> >> BUG_ON, the kernel then reports persistent soft lockups. I mentioned this in
>> >> today's concall, and Ted confirmed the current dev branch should reflect
>> >> what's upstream at this time.
>> >>
>> >> This test reproduces for me 5 to 10% of the time, but reliably enough - I
>> >> typically don't need more than 25 trials to see the failure, and 10 often
>> >> suffices. (I haven't yet tried the 4k test case, but will do so.)
>> >>
>> >> The failure bisects to:
>> >> 7e170922f06b ("ext4: Add allocation criteria 1.5 (CR1_5)")
>> >>
>> >
>> > Thanks for the bisection.
>> >
>> >> Trace follows.
>> >>
>> >> Eric
>> >>
>> >>
>> >> generic/269 24s ... [21:41:11][ 284.208474] run fstests generic/269 at 2023-07-03 21:41:11
>> >> [ 284.511484] EXT4-fs (vdc): mounted filesystem 2b1fbdd6-2724-47bc-b7b5-f4a73c9f19be r/w with ordered data mode. Quota mode: none.
>> >> [ 284.950657] ------------[ cut here ]------------
>> >> [ 284.950901] kernel BUG at fs/ext4/mballoc.c:5070!
>> >> [ 284.951104] invalid opcode: 0000 [#1] PREEMPT SMP
>> >> [ 284.951296] CPU: 0 PID: 12039 Comm: fsstress Not tainted 6.4.0-rc5+ #6
>> >> [ 284.951567] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
>> >> [ 284.951900] RIP: 0010:ext4_mb_new_inode_pa+0x2a6/0x2c0
>> >> [ 284.952124] Code: b5 7e 0f 85 b5 fe ff ff 0f 1f 44 00 00 e9 ab fe ff ff e8 9d 56 d3 ff 84 c0 0f 85 b5 fe ff ff 0f 0b e9 ae fe ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 4c 89 c1 31 c0 e9 42 ff ff ff 0f 1f 84 00
>> >> [ 284.952891] RSP: 0018:ffffc90004053970 EFLAGS: 00010a87
>> >> [ 284.953124] RAX: 0000000000000002 RBX: ffff8880342d0720 RCX: 0000000000000001
>> >> [ 284.953420] RDX: 0000000000004000 RSI: 00001e4000000000 RDI: ffff8880342d0720
>> >> [ 284.953720] RBP: ffffc90004053a00 R08: ffff88800a5fc000 R09: 0000000000000000
>> >> [ 284.954020] R10: ffff888007964a98 R11: 0000000000000002 R12: 0000000000000003
>> >> [ 284.954321] R13: ffff8880342d0720 R14: ffff88800a5fc000 R15: ffff88800abfc000
>> >> [ 284.954610] FS: 00007f3d9db07740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
>> >> [ 284.954923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> >> [ 284.955148] CR2: 000055b54ce0fde8 CR3: 0000000006730006 CR4: 0000000000770ef0
>> >> [ 284.955443] PKRU: 55555554
>> >> [ 284.955559] Call Trace:
>> >> [ 284.955669] <TASK>
>> >> [ 284.955760] ? die+0x33/0x90
>> >> [ 284.955887] ? do_trap+0xe0/0x110
>> >> [ 284.956031] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
>> >> [ 284.956223] ? do_error_trap+0x65/0x80
>> >> [ 284.956385] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
>> >> [ 284.956575] ? exc_invalid_op+0x4b/0x70
>> >> [ 284.956738] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
>> >> [ 284.956929] ? asm_exc_invalid_op+0x16/0x20
>> >> [ 284.957112] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
>> >> [ 284.957305] ext4_mb_complex_scan_group+0x2e0/0x3e0
>> >> [ 284.957512] ext4_mb_regular_allocator+0x3be/0xd80
>> >> [ 284.957716] ext4_mb_new_blocks+0x9dc/0x1040
>> >> [ 284.957895] ? __kmalloc+0xca/0x150
>> >> [ 284.958038] ? ext4_find_extent+0x3ec/0x450
>> >> [ 284.958204] ? _raw_write_unlock+0x29/0x50
>> >> [ 284.958369] ext4_ext_map_blocks+0x9a4/0x19d0
>> >> [ 284.958543] ? __kmem_cache_free+0x17d/0x2e0
>> >> [ 284.958723] ? find_held_lock+0x2b/0x80
>> >> [ 284.958889] ext4_map_blocks+0x230/0x5d0
>> >> [ 284.959056] ? lock_release+0x139/0x280
>> >> [ 284.959222] ext4_getblk+0x7b/0x2d0
>> >> [ 284.959369] ext4_bread+0xc/0x70
>> >> [ 284.959510] ext4_append+0x8d/0x190
>> >> [ 284.959665] ext4_init_new_dir+0xd5/0x1b0
>> >> [ 284.959835] ext4_mkdir+0x192/0x340
>> >
>> > hmm.. looks like a allocation request for a directory inode.
>> >
>> >> [ 284.959987] vfs_mkdir+0x98/0x140
>> >> [ 284.960133] do_mkdirat+0x131/0x160
>> >> [ 284.960285] __x64_sys_mkdir+0x48/0x70
>> >> [ 284.960445] do_syscall_64+0x38/0x90
>> >> [ 284.960600] entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> >> [ 284.960814] RIP: 0033:0x7f3d9dbf8b07
>> >> [ 284.960967] Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48
>> >> [ 284.961741] RSP: 002b:00007ffcd5131098 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
>> >> [ 284.962075] RAX: ffffffffffffffda RBX: 00007ffcd5131200 RCX: 00007f3d9dbf8b07
>> >> [ 284.962363] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055b54cdac240
>> >> [ 284.962647] RBP: 00000000000001ff R08: 0000000000000001 R09: 0000000000000003
>> >> [ 284.962928] R10: 00007ffcd5130d16 R11: 0000000000000206 R12: 00000000000000cb
>> >> [ 284.963209] R13: 8f5c28f5c28f5c29 R14: 000055b54c8ec660 R15: 00000000000000cb
>> >> [ 284.963492] </TASK>
>> >> [ 284.963586] Modules linked in:
>> >> [ 284.963730] ---[ end trace 0000000000000000 ]---
>> >
>> >
>> > @Ojaswin,
>> >
>> > I was looking at the code. In function ext4_mb_choose_next_group_best_avail(),
>> > we use fls() on goal len to calculate "order". But we never subtract 1
>> > from it. Then we set the goal len based on this "order". This might make
>> > ac_g_ex.fe_len > ac_o_ex.fe_len in some cases where we really don't want
>> > that (like the current case).
>> > You have added some comments there, so I was not sure if that was
>> > intentional.
>> >
>> > Now, IIUC, the overall concept of cr_1.5 is to trim the max len order
>> > from goal len to something which is still larger than original length.
>> > But this is only valid for regular files allocation request. Because we don't
>> > normalize the request length for non-regular files. See
>> > ext4_mb_normalize_request() function. As I also see from the current
>> > bug_on, the request was for dir inode I guess.
>> >
>> > Although, I still think we should check the function logic in
>> > ext4_mb_choose_next_group_best_avail(), but either ways I guess we don't
>> > want to use CR_BEST_AVAIL_LEN criteria for non-regular files right,
>> > given we anyways don't normalize the allocation request len for such files.
>> >
>> > In that case do you think below diff make sense?
>> >
>> >
>> > mballoc: Don't use CR_BEST_AVAIL_LEN for non-regular files
>> >
>> > Using CR_BEST_AVAIL_LEN only make sense for regular files, as for
>> > non-regular files we never normalize the allocation request length i.e.
>> > goal len is same as original length (ac_g_ex.fe_len == ac_o_ex.fe_len).
>> >
>> > Hence there is no scope of trimming the goal length such that it can
>> > still satisfy original request len. Thus this patch avoids using
>> > CR_BEST_AVAIL_LEN criteria for non-regular files request.
>> >
>> > ---
>> > fs/ext4/mballoc.c | 14 +++++++++++++-
>> > 1 file changed, 13 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> > index a2475b8c9fb5..5fbbd7344456 100644
>> > --- a/fs/ext4/mballoc.c
>> > +++ b/fs/ext4/mballoc.c
>> > @@ -974,7 +974,19 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
>> > *group = grp->bb_group;
>> > ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
>> > } else {
>> > - *new_cr = CR_BEST_AVAIL_LEN;
>> > + /*
>> > + * CR_BEST_AVAIL_LEN works based on the concept that we have
>> > + * a larger normalized goal len request which can be trimmed to
>> > + * a smaller goal len such that it can still satisfy original
>> > + * request len. However, allocation request for non-regular
>> > + * files never gets normalized.
>> > + * See function ext4_mb_normalize_request() (EXT4_MB_HINT_DATA).
>> > + */
>> > + if ((ac->ac_criteria & EXT4_MB_HINT_DATA))
>>
>> my bad. It should be
>> if ((ac->ac_flags & EXT4_MB_HINT_DATA))
>>
>> -ritesh
>
>
> Ritesh:
>
> An update: the generic/269 failure on 1k is easily reproducible on 6.5-rc1.
> The BUG_ON on line 5069 in mballoc.c is now triggering for me, rather than 5070:
> BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
Thanks Eric for updating. Yes, I gave this a shot on my setup too and I
am able to reproduce the same BUG_ON on my setup with a similar call
trace that you pasted below. I was trying to see if I can hit the other
bug on with ac->ac_pa = NULL, but I wasn't able to hit that at all.
>
> Representative trace follows - operation at top of trace varies run to run.
>
> Eric
>
> generic/269 25s ... [18:07:16][ 54.255436] run fstests generic/269 at 2023-06
> [ 54.508385] EXT4-fs (vdc): mounted filesystem a986a0fa-5c36-4e16-bad9-2d53b8.
> [ 57.914332] ------------[ cut here ]------------
> [ 57.914534] kernel BUG at fs/ext4/mballoc.c:5069!
> [ 57.914863] invalid opcode: 0000 [#1] PREEMPT SMP
> [ 57.915156] CPU: 1 PID: 4157 Comm: fsstress Not tainted 6.5.0-rc1 #1
> [ 57.915531] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4
> [ 57.916033] RIP: 0010:ext4_mb_new_inode_pa+0x2a6/0x2c0
> [ 57.916354] Code: b5 7e 0f 85 b5 fe ff ff 0f 1f 44 00 00 e9 ab fe ff ff e8 70
> [ 57.917473] RSP: 0018:ffffc900057f39f0 EFLAGS: 00010206
> [ 57.917799] RAX: 0000000000000002 RBX: ffff888017b70d10 RCX: 0000000000000001
> [ 57.918233] RDX: 000000000000a000 RSI: 00000b1c00000000 RDI: ffff888017b70d10
> [ 57.918667] RBP: ffffc900057f3a80 R08: ffff88800672f000 R09: ffff888009ac65a0
> [ 57.919100] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000068
> [ 57.919534] R13: ffff888017b70d10 R14: ffff88800672f000 R15: ffff888007b02000
> [ 57.919971] FS: 00007f03368b6740(0000) GS:ffff88807dd00000(0000) knlGS:00000
> [ 57.920461] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 57.920789] CR2: 000055fe23769d78 CR3: 000000000787e005 CR4: 0000000000770ee0
> [ 57.921113] PKRU: 55555554
> [ 57.921213] Call Trace:
> [ 57.921327] <TASK>
> [ 57.921443] ? die+0x33/0x90
> [ 57.921611] ? do_trap+0xe0/0x110
> [ 57.921807] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> [ 57.922023] ? do_error_trap+0x65/0x80
> [ 57.922174] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> [ 57.922354] ? exc_invalid_op+0x4b/0x70
> [ 57.922510] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> [ 57.922693] ? asm_exc_invalid_op+0x16/0x20
> [ 57.922859] ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> [ 57.923039] ext4_mb_complex_scan_group+0x2e0/0x3e0
> [ 57.923232] ext4_mb_regular_allocator+0x3be/0xd80
> [ 57.923425] ext4_mb_new_blocks+0x9dc/0x1040
> [ 57.923592] ? __kmalloc+0xca/0x150
> [ 57.923732] ? ext4_find_extent+0x3ec/0x450
> [ 57.923896] ? __kmem_cache_free+0x191/0x310
> [ 57.924064] ? rcu_is_watching+0xd/0x40
> [ 57.924218] ext4_ext_map_blocks+0x981/0x19c0
> [ 57.924393] ? __lock_acquire.constprop.0+0x4b/0x6a0
> [ 57.924601] ? __lock_acquire.constprop.0+0x4b/0x6a0
> [ 57.924871] ext4_map_blocks+0x22a/0x5b0
> [ 57.925079] ext4_getblk+0x7b/0x2d0
> [ 57.925260] ext4_bread+0xc/0x70
> [ 57.925392] ext4_symlink+0x12b/0x3c0
> [ 57.925610] vfs_symlink+0x18c/0x230
> [ 57.925839] do_symlinkat+0x11c/0x130
> [ 57.926071] __x64_sys_symlink+0x38/0x50
> [ 57.926318] do_syscall_64+0x38/0x90
> [ 57.926544] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [ 57.926866] RIP: 0033:0x7f03369a9a97
> [ 57.927093] Code: f0 ff ff 73 01 c3 48 8b 0d f6 d3 0c 00 f7 d8 64 89 01 48 88
> [ 57.928176] RSP: 002b:00007ffdf02ee128 EFLAGS: 00000206 ORIG_RAX: 00000000008
> [ 57.928574] RAX: ffffffffffffffda RBX: 00007ffdf02ee290 RCX: 00007f03369a9a97
> [ 57.928879] RDX: 0000000000000064 RSI: 000055fe23514410 RDI: 000055fe23542d10
> [ 57.929161] RBP: 000055fe23514410 R08: 000055fe23542d10 R09: 0000000000000003
> [ 57.929446] R10: 00007ffdf02edda6 R11: 0000000000000206 R12: 0000000000000170
> [ 57.929722] R13: 000055fe23542d10 R14: 00007ffdf02ee290 R15: 000055fe23542d10
> [ 57.930029] </TASK>
> [ 57.930126] Modules linked in:
> [ 57.930274] ---[ end trace 0000000000000000 ]---
-ritesh