2023-10-02 08:13:06

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] New: ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

Bug ID: 217965
Summary: ext4(?) regression since 6.5.0 on sata hdd
Product: File System
Version: 2.5
Hardware: All
OS: Linux
Status: NEW
Severity: normal
Priority: P3
Component: ext4
Assignee: [email protected]
Reporter: [email protected]
Regression: No

Since kernel 6.5.x and 6.6-rc* I'm getting weird kworker flush activity when
building openwrt from sources.
91 root 20 0 0 0 0 R 99,7 0,0 18:06.57
kworker/u16:4+flush-8:16

Openwrt sources resides on the sata hdd drive with ext4 fs,I'm using this setup
for a last 5 years, the problem is that since 6.5 kernels after the openwrt
kernel patch make
step(https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=scripts/patch-kernel.sh;h=c2b7e7204952f93946a6075d546cbeae32c2627f;hb=HEAD
which probably involves a lot of copy and write operations)
kworker/u16:4+flush-8:16 uses 100% of one core for a while(5-15 minutes) even
after I canceling openwrt build.

I tried to move this openwrt sources folder to an ssd drive where my system is
resides and run openwrt build from there and getting no issues with kworker
flush cpu usage. Also I have no such behavior with 6.4.x and older kernels so
it looks like regression to me, not sure if this is a fs, vfs or even block
subsystem issue.

This is how it looks in perf
Samples: 320K of event 'cycles:P', Event count (approx.): 363448649248
Children Self Command Shared Object
Symbol
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
ret_from_fork_asm
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
ret_from_fork
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
kthread
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
worker_thread
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
process_one_work
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
wb_workfn
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
wb_writeback
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
__writeback_inodes_wb
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
writeback_sb_inodes
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
__writeback_single_inode
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
do_writepages
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
ext4_writepages
+ 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
ext4_do_writepages
+ 12,39% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
ext4_map_blocks
+ 12,39% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
ext4_ext_map_blocks
+ 12,38% 0,00% kworker/u16:2+f [kernel.vmlinux] [k]
ext4_mb_new_blocks
+ 12,38% 0,93% kworker/u16:2+f [kernel.vmlinux] [k]
ext4_mb_regular_allocator
+ 9,42% 0,00% cc1 [unknown] [.]
0000000000000000
+ 5,42% 0,53% kworker/u16:2+f [kernel.vmlinux] [k]
ext4_mb_scan_aligned
+ 4,88% 0,69% kworker/u16:2+f [kernel.vmlinux] [k]
mb_find_extent
+ 3,99% 3,95% kworker/u16:2+f [kernel.vmlinux] [k]
mb_find_order_for_block
+ 3,51% 0,61% kworker/u16:2+f [kernel.vmlinux] [k]
ext4_mb_load_buddy_gfp
+ 2,95% 0,01% cc1 [kernel.vmlinux] [k]
asm_exc_page_fault
+ 2,67% 0,18% kworker/u16:2+f [kernel.vmlinux] [k]
pagecache_get_page
+ 2,41% 0,40% kworker/u16:2+f [kernel.vmlinux] [k]
__filemap_get_folio
+ 2,33% 2,10% cc1 cc1 [.]
cpp_get_token_1
+ 2,12% 0,05% cc1 [kernel.vmlinux] [k]
exc_page_fault
+ 2,07% 0,04% cc1 [kernel.vmlinux] [k]
do_user_addr_fault
+ 1,81% 0,52% kworker/u16:2+f [kernel.vmlinux] [k]
filemap_get_entry
1,80% 1,71% cc1 cc1 [.]
ht_lookup_with_hash
+ 1,77% 0,08% cc1 [kernel.vmlinux] [k]
handle_mm_fault
+ 1,65% 0,14% cc1 [kernel.vmlinux] [k]
__handle_mm_fault
1,60% 1,49% cc1 cc1 [.]
_cpp_lex_direct
+ 1,54% 0,73% kworker/u16:2+f [kernel.vmlinux] [k]
ext4_mb_good_group
+ 1,49% 1,46% cc1 cc1 [.]
ggc_internal_alloc
+ 1,28% 0,05% cc1 [kernel.vmlinux] [k]
do_anonymous_page
+ 1,28% 0,04% cc1 [kernel.vmlinux] [k]
entry_SYSCALL_64

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.


2023-10-02 08:53:20

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

Ivan Ivanich ([email protected]) changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]
Regression|No |Yes

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-02 08:59:33

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #1 from Ivan Ivanich ([email protected]) ---
Created attachment 305179
--> https://bugzilla.kernel.org/attachment.cgi?id=305179&action=edit
Kernel config

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-02 09:07:58

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #2 from Ivan Ivanich ([email protected]) ---
Backtrace echo l > /proc/sysrq-trigger:


[ 3128.330756] CPU: 2 PID: 311853 Comm: patch Tainted: P O
6.6.0-rc4-x86_64 #1
[ 3128.330758] Hardware name: ASUS All Series/Z97-K, BIOS 2902 03/31/2016
[ 3128.330759] RIP: 0010:mb_find_order_for_block+0x68/0xa0
[ 3128.330762] Code: d0 02 00 00 42 0f b7 04 00 48 01 f0 48 89 c6 41 89 f9 c1
e0 03 83 e0 38 41 d3 f9 48 83 e6 f8 44 01 c8 48 98 48 0f a3 06 73 27 <0f> b7 42
30 83 c1 01 83 c0 01 39 c1 7f 17 48 8b 42 18 48 8b 72 08
[ 3128.330763] RSP: 0018:ffffc9000e363630 EFLAGS: 00000207
[ 3128.330765] RAX: 0000000000000036 RBX: ffffc9000e363698 RCX:
000000000000000c
[ 3128.330766] RDX: ffffc9000e363738 RSI: ffff88810a7e2ff8 RDI:
0000000000006000
[ 3128.330767] RBP: 0000000000007ff0 R08: 0000000000000018 R09:
0000000000000006
[ 3128.330767] R10: 0000000000000001 R11: 0000000000000000 R12:
0000000000006000
[ 3128.330768] R13: 0000000000006000 R14: 000000000000000d R15:
ffffc9000e363738
[ 3128.330769] FS: 00007f237c609740(0000) GS:ffff88841fa80000(0000)
knlGS:0000000000000000
[ 3128.330771] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3128.330772] CR2: 00005576c2964000 CR3: 0000000261c76004 CR4:
00000000001706e0
[ 3128.330773] Call Trace:
[ 3128.330774] <NMI>
[ 3128.330774] ? nmi_cpu_backtrace+0x88/0xf0
[ 3128.330776] ? nmi_cpu_backtrace_handler+0xc/0x20
[ 3128.330778] ? nmi_handle+0x53/0xf0
[ 3128.330780] ? default_do_nmi+0x40/0x220
[ 3128.330782] ? exc_nmi+0xf3/0x120
[ 3128.330784] ? end_repeat_nmi+0x16/0x67
[ 3128.330787] ? mb_find_order_for_block+0x68/0xa0
[ 3128.330788] ? mb_find_order_for_block+0x68/0xa0
[ 3128.330790] ? mb_find_order_for_block+0x68/0xa0
[ 3128.330792] </NMI>
[ 3128.330792] <TASK>
[ 3128.330793] mb_find_extent+0x149/0x200
[ 3128.330795] ext4_mb_scan_aligned+0xd3/0x140
[ 3128.330798] ext4_mb_regular_allocator+0xcca/0xe50
[ 3128.330800] ext4_mb_new_blocks+0x8c6/0xe80
[ 3128.330802] ? ext4_find_extent+0x3c1/0x410
[ 3128.330803] ? release_pages+0x13a/0x400
[ 3128.330805] ext4_ext_map_blocks+0x626/0x15c0
[ 3128.330807] ? filemap_get_folios_tag+0x1c6/0x1f0
[ 3128.330809] ? mpage_prepare_extent_to_map+0x462/0x4a0
[ 3128.330811] ext4_map_blocks+0x18b/0x5f0
[ 3128.330813] ? kmem_cache_alloc+0x278/0x410
[ 3128.330814] ? ext4_alloc_io_end_vec+0x19/0x50
[ 3128.330816] ext4_do_writepages+0x619/0xa90
[ 3128.330818] ext4_writepages+0xa8/0x180
[ 3128.330819] do_writepages+0xc6/0x190
[ 3128.330821] ? __find_get_block+0x1b2/0x2a0
[ 3128.330824] filemap_fdatawrite_wbc+0x5e/0x80
[ 3128.330827] __filemap_fdatawrite_range+0x57/0x80
[ 3128.330829] ext4_rename+0x753/0xbf0
[ 3128.330831] vfs_rename+0xa7e/0xbe0
[ 3128.330834] ? do_renameat2+0x4be/0x660
[ 3128.330836] do_renameat2+0x4be/0x660
[ 3128.330838] __x64_sys_renameat+0x44/0x60
[ 3128.330839] do_syscall_64+0x3f/0x90
[ 3128.330841] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 3128.330843] RIP: 0033:0x7f237c66160a
[ 3128.330844] Code: 48 8b 15 21 e8 15 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66
2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 49 89 ca b8 08 01 00 00 0f 05 <48> 3d 00
f0 ff ff 77 06 c3 0f 1f 44 00 00 48 8b 15 e9 e7 15 00 f7
[ 3128.330845] RSP: 002b:00007ffcf85ca528 EFLAGS: 00000202 ORIG_RAX:
0000000000000108
[ 3128.330847] RAX: ffffffffffffffda RBX: 0000000000000004 RCX:
00007f237c66160a
[ 3128.330848] RDX: 0000000000000004 RSI: 00005576c295a091 RDI:
0000000000000004
[ 3128.330848] RBP: 0000000000000004 R08: 00000000000081a4 R09:
0000000000000000
[ 3128.330849] R10: 00005576c2959ad1 R11: 0000000000000202 R12:
00005576c2959ad1
[ 3128.330850] R13: 00005576c295a091 R14: 00007ffcf85ca630 R15:
0000000000000000
[ 3128.330851] </TASK>
[ 3128.330852] NMI backtrace for cpu 6 skipped: idling at intel_idle+0x62/0xb0
[ 4648.228980] sysrq: Show backtrace of all active CPUs
[ 4648.228983] NMI backtrace for cpu 1
[ 4648.228985] CPU: 1 PID: 526431 Comm: bash Tainted: P O
6.6.0-rc4-x86_64 #1
[ 4648.228987] Hardware name: ASUS All Series/Z97-K, BIOS 2902 03/31/2016
[ 4648.228987] Call Trace:
[ 4648.228989] <TASK>
[ 4648.228991] dump_stack_lvl+0x36/0x50
[ 4648.228995] nmi_cpu_backtrace+0xc0/0xf0
[ 4648.228997] ? __pfx_nmi_raise_cpu_backtrace+0x10/0x10
[ 4648.228999] nmi_trigger_cpumask_backtrace+0xc8/0xf0
[ 4648.229000] __handle_sysrq+0xd2/0x190
[ 4648.229002] write_sysrq_trigger+0x23/0x40
[ 4648.229004] proc_reg_write+0x57/0xa0
[ 4648.229006] vfs_write+0xcb/0x410
[ 4648.229008] ? filp_flush+0x4d/0x80
[ 4648.229010] ksys_write+0x62/0xe0
[ 4648.229011] do_syscall_64+0x3f/0x90
[ 4648.229013] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 4648.229016] RIP: 0033:0x7f3d2d70e640
[ 4648.229017] Code: 87 0c 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f
1f 84 00 00 00 00 00 80 3d 29 0d 0d 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
[ 4648.229019] RSP: 002b:00007ffd39f5e4a8 EFLAGS: 00000202 ORIG_RAX:
0000000000000001
[ 4648.229021] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
00007f3d2d70e640
[ 4648.229022] RDX: 0000000000000002 RSI: 000055fbbed84ba0 RDI:
0000000000000001
[ 4648.229023] RBP: 00007f3d2d7d85c0 R08: 0000000000000030 R09:
00007f3d2d7d7b00
[ 4648.229024] R10: 0000000000000000 R11: 0000000000000202 R12:
0000000000000002
[ 4648.229025] R13: 000055fbbed84ba0 R14: 0000000000000002 R15:
00007f3d2d7d5f40
[ 4648.229026] </TASK>
[ 4648.229027] Sending NMI from CPU 1 to CPUs 0,2-7:
[ 4648.229030] NMI backtrace for cpu 3 skipped: idling at intel_idle+0x62/0xb0
[ 4648.229033] NMI backtrace for cpu 7 skipped: idling at intel_idle+0x62/0xb0
[ 4648.229036] NMI backtrace for cpu 5 skipped: idling at intel_idle+0x62/0xb0
[ 4648.229039] NMI backtrace for cpu 2
[ 4648.229041] CPU: 2 PID: 71621 Comm: kworker/u16:1 Tainted: P O
6.6.0-rc4-x86_64 #1
[ 4648.229042] Hardware name: ASUS All Series/Z97-K, BIOS 2902 03/31/2016
[ 4648.229043] Workqueue: writeback wb_workfn (flush-8:16)
[ 4648.229047] RIP: 0010:mb_find_order_for_block+0x66/0xa0
[ 4648.229050] Code: 8b 80 d0 02 00 00 42 0f b7 04 00 48 01 f0 48 89 c6 41 89
f9 c1 e0 03 83 e0 38 41 d3 f9 48 83 e6 f8 44 01 c8 48 98 48 0f a3 06 <73> 27 0f
b7 42 30 83 c1 01 83 c0 01 39 c1 7f 17 48 8b 42 18 48 8b
[ 4648.229052] RSP: 0018:ffffc90001f076c8 EFLAGS: 00000203
[ 4648.229053] RAX: 0000000000000138 RBX: ffffc90001f07730 RCX:
0000000000000006
[ 4648.229054] RDX: ffffc90001f077d0 RSI: ffff88810a7e2f80 RDI:
0000000000004e30
[ 4648.229055] RBP: 0000000000007ff0 R08: 000000000000000c R09:
0000000000000138
[ 4648.229056] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff8881049f7800
[ 4648.229056] R13: 0000000000000001 R14: 0000000000007ff0 R15:
ffffc90001f077d0
[ 4648.229057] FS: 0000000000000000(0000) GS:ffff88841fa80000(0000)
knlGS:0000000000000000
[ 4648.229059] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4648.229060] CR2: 00007f23b3ffc3a0 CR3: 000000023ba2a001 CR4:
00000000001706e0
[ 4648.229060] Call Trace:
[ 4648.229061] <NMI>
[ 4648.229062] ? nmi_cpu_backtrace+0x88/0xf0
[ 4648.229064] ? nmi_cpu_backtrace_handler+0xc/0x20
[ 4648.229065] ? nmi_handle+0x53/0xf0
[ 4648.229068] ? default_do_nmi+0x40/0x220
[ 4648.229070] ? exc_nmi+0xf3/0x120
[ 4648.229072] ? end_repeat_nmi+0x16/0x67
[ 4648.229075] ? mb_find_order_for_block+0x66/0xa0
[ 4648.229077] ? mb_find_order_for_block+0x66/0xa0
[ 4648.229079] ? mb_find_order_for_block+0x66/0xa0
[ 4648.229081] </NMI>
[ 4648.229081] <TASK>
[ 4648.229081] mb_find_extent+0xd7/0x200
[ 4648.229084] ext4_mb_scan_aligned+0xd3/0x140
[ 4648.229086] ? ext4_get_group_info+0x4b/0x60
[ 4648.229088] ext4_mb_regular_allocator+0xcca/0xe50
[ 4648.229090] ? mempool_alloc+0x58/0x180
[ 4648.229093] ext4_mb_new_blocks+0x8c6/0xe80
[ 4648.229094] ? ext4_find_extent+0x3c1/0x410
[ 4648.229096] ext4_ext_map_blocks+0x626/0x15c0
[ 4648.229099] ? filemap_get_folios_tag+0x1c6/0x1f0
[ 4648.229100] ? mpage_prepare_extent_to_map+0x462/0x4a0
[ 4648.229102] ext4_map_blocks+0x18b/0x5f0
[ 4648.229104] ? kmem_cache_alloc+0x278/0x410
[ 4648.229106] ? ext4_alloc_io_end_vec+0x19/0x50
[ 4648.229109] ext4_do_writepages+0x619/0xa90
[ 4648.229111] ext4_writepages+0xa8/0x180
[ 4648.229113] do_writepages+0xc6/0x190
[ 4648.229115] __writeback_single_inode+0x2c/0x1a0
[ 4648.229117] writeback_sb_inodes+0x1e8/0x450
[ 4648.229119] __writeback_inodes_wb+0x47/0xe0
[ 4648.229121] wb_writeback.isra.0+0x159/0x1b0
[ 4648.229123] wb_workfn+0x269/0x380
[ 4648.229125] ? __schedule+0x2ca/0x1110
[ 4648.229127] ? __mod_timer+0x116/0x370
[ 4648.229130] process_one_work+0x12f/0x240
[ 4648.229132] worker_thread+0x2f0/0x410
[ 4648.229134] ? __pfx_worker_thread+0x10/0x10
[ 4648.229136] kthread+0xe3/0x110
[ 4648.229138] ? __pfx_kthread+0x10/0x10
[ 4648.229139] ret_from_fork+0x2f/0x50
[ 4648.229141] ? __pfx_kthread+0x10/0x10
[ 4648.229142] ret_from_fork_asm+0x1b/0x30
[ 4648.229145] </TASK>
[ 4648.229145] NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
[ 4648.229149] NMI backtrace for cpu 4 skipped: idling at intel_idle+0x62/0xb0
[ 4648.229151] NMI backtrace for cpu 6 skipped: idling at intel_idle+0x62/0xb0

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-02 15:38:17

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

Theodore Tso ([email protected]) changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]

--- Comment #3 from Theodore Tso ([email protected]) ---
What sort of information can you give about the ext4 file system where this is
happening? How much free space is there? Can you run "df" and "dumpe2fs -h"
on the system so we can see which file system features were enabled? Also,
can you specify the kind of the block device?

Differential debugging would be useful. For example, can you try is to copy
the file system image to a hardware, and see if you can reproduce the behavior
using the same kernel build workload? What if you copy to the file system
image to a USB attached SSD, and see if you can reproduce the behavior? What
you attach the USB attached SSD, and use a freshly formatted ext4 file system?
Does it reproduce then? If the file system is nearly full, what if you
delete a lot of unused space, to provide a lot more free space, and see if it
reproduces then?

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-02 16:26:53

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #4 from Ivan Ivanich ([email protected]) ---

(In reply to Theodore Tso from comment #3)
> What sort of information can you give about the ext4 file system where this
> is happening? How much free space is there? Can you run "df" and
> "dumpe2fs -h" on the system so we can see which file system features were
> enabled? Also, can you specify the kind of the block device?
>
> Differential debugging would be useful. For example, can you try is to copy
> the file system image to a hardware, and see if you can reproduce the
> behavior using the same kernel build workload? What if you copy to the
> file system image to a USB attached SSD, and see if you can reproduce the
> behavior? What you attach the USB attached SSD, and use a freshly
> formatted ext4 file system? Does it reproduce then? If the file system
> is nearly full, what if you delete a lot of unused space, to provide a lot
> more free space, and see if it reproduces then?

mount |grep sdb2
/dev/sdb2 on /mnt/sdb2 type ext4 (rw,relatime,stripe=32752)

df /mnt/sdb2/
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb2 2667140496 2125391752 539022500 80% /mnt/sdb2


dumpe2fs -h /dev/sdb2
dumpe2fs 1.47.0 (5-Feb-2023)
Filesystem volume name: <none>
Last mounted on: /mnt/sdb2
Filesystem UUID: 3bd82ee7-f08a-4a23-8e50-2df3786b0858
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype
needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg
dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 169369600
Block count: 677467392
Reserved block count: 677465
Overhead clusters: 10682268
Free blocks: 135437186
Free inodes: 168057645
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 862
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
RAID stride: 32752
Flex block group size: 16
Filesystem created: Wed Jun 25 13:10:17 2014
Last mount time: Mon Oct 2 18:46:12 2023
Last write time: Mon Oct 2 18:46:12 2023
Mount count: 23
Maximum mount count: -1
Last checked: Thu Sep 28 14:37:21 2023
Check interval: 0 (<none>)
Lifetime writes: 84 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: a90aaa2d-577c-4181-83f9-696dee7ea92d
Journal backup: inode blocks
Journal features: journal_incompat_revoke
Total journal size: 128M
Total journal blocks: 32768
Max transaction length: 32768
Fast commit length: 0
Journal sequence: 0x002a8661
Journal start: 9556

I don't have physical access to this machine so can't do any manipulations to
the hardware or attaching USB drives etc.

Device is
Model Family: Toshiba 3.5" DT01ACA... Desktop HDD
Device Model: TOSHIBA DT01ACA300
Serial Number: 14KP58HGS
LU WWN Device Id: 5 000039 ff4d7b718
Firmware Version: MX6OABB0
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Oct 2 18:53:26 2023 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Disabled
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-03 06:50:18

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

Ojaswin Mujoo ([email protected]) changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]

--- Comment #5 from Ojaswin Mujoo ([email protected]) ---
Hi Ivan,

Thanks for sharing the additional information. So I can see that you are
passing stripe=32752 and from the backtrace and I do see some of the functions
in the trace that indicate that a striped allocation was taking place. There
were some changes in the allocator in 6.5 kernel if I'm not wrong but I'll need
to try to replicate this to confirm if the issue is related to those and update
here.

Meanwhile, can you provide some more info on the following:

1. I want to check if this issue is related to the striped allocation code path
or if its also seen in the usual path. To check that, will it be possible for
you to run the same workload without passing "stripe=32752" option during mount
time?

2. How often do you see this issue, is it noticeable 100% of times?

Regards,
ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-03 07:01:10

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #6 from Ivan Ivanich ([email protected]) ---
Hi Ojaswin,
1. Actually I'm not passing stripe option in my fstab, seems it's added
automatically. Could you please clarify how I can get rid of it?

corresponding fstab entry

UUID=3bd82ee7-f08a-4a23-8e50-2df3786b0858 /mnt/sdb2 ext4
defaults,user_xattr 0 2

2. The issue is reproducible/visible 100% of times when building openwrt on the
kernels >=6.5 including 6.6.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-03 11:05:08

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #7 from Ivan Ivanich ([email protected]) ---
Removed stripe with tune2fs -f -E stride=0 /dev/sdb2 and it fixed the issue.
Still have no clue why it was enabled.
Thanks

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-03 13:42:11

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #8 from Ojaswin Mujoo ([email protected]) ---
Hey Ivan, thanks!

Seems like stripe mount option is automatically added if the filesystem is made
using stride or stripe-width options during mkfs time.

Can you kindly perform the below steps as well to further help narrow this
issue down:

1. Mount FS with the usual options (including using stripe=32752).
2. Once mounted, run:

$ echo "0" > /sys/fs/ext4/sdb2/mb_best_avail_max_trim_order

This might need sudo. Basically this disables one of the allocator feature that
was added in 6.5.

3. Run the workload again and see if the issue is still visible.

Regards,
ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-03 14:24:10

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #9 from Ivan Ivanich ([email protected]) ---
With mount option stripe=32752 and disabled mb_best_avail_max_trim_order issue
is still visible.

cat /sys/fs/ext4/sdb2/mb_best_avail_max_trim_order
0

mount|grep sdb2
/dev/sdb2 on /mnt/sdb2 type ext4 (rw,relatime,stripe=32752)

4687 root 20 0 0 0 0 R 100,0 0,0 0:25.73
kworker/u16:1+flush-8:16

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-04 12:44:38

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #10 from Ojaswin Mujoo ([email protected]) ---
Hi Ivan,

So unfortunately I'm not able to replicate it yet at my end. While I try that,
wanted to check if you can give a few things a try,

So it seems like the CPU is stuck at mb_find_order_for_block() called from
mb_find_extent(). I do see a while loop in mb_find_order_for_block() but its
not obvious if its stuck there and if so why.

If possible can you:

1. Recompile the kernel with CONFIG_DEBUG_INFO=y CONFIG_SOFTLOCKUP_DETECTOR=y
and CONFIG_HARDLOCKUP_DETECTOR=y which might provide more information in dmesg
when the lockup happens.

2. Replicate it once more and note the RIP value in the trace of stuck CPU, for
example in the above trace it was mb_find_order_for_block+0x68 for CPU2.

3. Run the following kernel's source dir to get the corresponding line number
(CONFIG_DEBUG_INFO needed):

$ ./scripts/faddr2line vmlinux mb_find_order_for_block+0x68

Maybe you can share the code you see in and around those lines as well as the
exact kernel version.

This will help pinpoint the location where the code might be stuck (for example
in a loop), which can help root cause this.

Also, you mentioned that the CPU gets stuck at 100% util for 10-15mins, does it
ever come back to normal or does it stay stuck?

Regards,
ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-04 17:23:17

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #11 from Ivan Ivanich ([email protected]) ---
Hi Ojaswin,

Tried it but unfortunately it gives me nothing useful

[ 619.894677] RIP: 0010:mb_find_order_for_block+0x50/0xa0

./scripts/faddr2line vmlinux mb_find_order_for_block+0x50
mb_find_order_for_block+0x50/0xa0:
fs mm ??:0

All required kernel options is enabled
zcat /proc/config.gz |grep -i
'CONFIG_DEBUG_INFO\|CONFIG_SOFTLOCKUP_DETECTOR\|CONFIG_HARDLOCKUP_DETECTOR'|grep
-v '#'
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER=y

>Also, you mentioned that the CPU gets stuck at 100% util for 10-15mins, does
>it ever come back to normal or does it stay stuck?

Yes, it comes back to normal after quite some time.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-11 07:53:50

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #12 from Ojaswin Mujoo ([email protected]) ---
Hey Ivan,

so I used the kernel v6.6-rc1 and the same config you provided as well as well
as mounted an hdd on my VM. Then I followed the steps here build openwrt [1].
However, I'm still unable to replicate the 100% cpu utilization in a
kworker/flush thread (I do get .

Since you have the config options enabled and we didn't see them trigger any
warning and the fact that we get back to normal after a few minutes indicates
that its not a lockup/deadlock. We also see that on faster SSD we don't see
this issue so this might even have something to do with a lot of IOs being
queued up on the slower disk causing us to notice the delay. Maybe we are
waiting a lot more on some spinlock that can explain the CPU utilization.

Since I'm unable to replicate it, I'll have to request you for some more info
to get to the bottom of this. More specifically, can you kindly provide the
following:

For the kernel with this issue:

1. Replicate the 100% util in one terminal window.
2. Once the 100% util is hit, in another terminal run the following command:

$ iostat -x /dev/<dev> 2 (run this for 20 to 30 seconds)
$ perf record -ag sleep 20
$ echo l > /proc/sysrq_trigger
$ uname -a

3. Repeat the above for a kernel where the issue is not seen.

Kindly share the sysrq back trace, iostat output, perf.data and the uname
output for both the runs here so that I can take a closer look at what is
causing the unusual utilization.

[1] https://github.com/openwrt/openwrt#quickstart


Regards,
ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-11 16:08:25

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #13 from Ivan Ivanich ([email protected]) ---
Affected one:
iostat: https://pastebin.com/mau5VYnV
perf:
https://drive.google.com/file/d/15FO5YsvYrFnKtJI9hfl8qPG7oI8SgLeI/view?usp=sharing
backtrace: https://pastebin.com/AgjYXhjR
uname -a:
Linux gentoo-desktop 6.5.7-gentoo-x86_64 #1 SMP PREEMPT_DYNAMIC Wed Oct 11
14:40:24 EEST 2023 x86_64 Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz GenuineIntel
GNU/Linux

I don't have 6.4.16 kernel at the moment, will compile it later and provide
relevant info.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-11 22:24:07

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #14 from Ivan Ivanich ([email protected]) ---
Not affected 6.4.16:
iostat: https://pastebin.com/6XadeAuh
perf:
https://drive.google.com/file/d/1neAq73hoztfTU87NyCHWGAQhCKQfILOH/view?usp=sharing
backtrace: https://pastebin.com/zdnWgx4M
uname -a:
Linux gentoo-desktop 6.4.16-gentoo-x86_64 #1 SMP PREEMPT_DYNAMIC Thu Oct 12
00:09:02 EEST 2023 x86_64 Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz GenuineIntel
GNU/Linux

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-17 10:45:44

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

Eduard Kohler ([email protected]) changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]

--- Comment #15 from Eduard Kohler ([email protected]) ---
I encounter the same issue on a stock fedora 37 kernel
# uname -a
Linux xxx.net 6.4.16-100.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Sep 13 18:22:38
UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

The issue seems to be present also on brtfs partitions, as mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=2242391

In my case, I have an EXT4 partition over an mdadm raid 1 array of two HDD.

# mdadm --misc --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Sun Feb 24 21:40:33 2013
Raid Level : raid1
Array Size : 3906118928 (3.64 TiB 4.00 TB)
Used Dev Size : 3906118928 (3.64 TiB 4.00 TB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent

Update Time : Tue Oct 17 12:05:22 2023
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

Consistency Policy : resync

Name : xxx.net:0 (local to host xxx.net)
UUID : 25070e74:b2ac1695:46ee20f9:7e8d1e05
Events : 402440

Number Major Minor RaidDevice State
3 8 17 0 active sync /dev/sdb1
2 8 33 1 active sync /dev/sdc1

dumpe2fs -h /dev/md0p1
dumpe2fs 1.46.5 (30-Dec-2021)
Filesystem volume name: <none>
Last mounted on: /home
Filesystem UUID: 4df09a03-d8ef-4cef-a64d-cefde4823fc2
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype
needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg
dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 244137984
Block count: 976529232
Reserved block count: 48826457
Overhead clusters: 15373546
Free blocks: 545717412
Free inodes: 243040909
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 791
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
RAID stride: 32745
Flex block group size: 16
Filesystem created: Sun Feb 24 22:13:52 2013
Last mount time: Mon Oct 16 21:06:57 2023
Last write time: Mon Oct 16 21:06:57 2023
Mount count: 6
Maximum mount count: -1
Last checked: Sat Oct 7 16:27:02 2023
Check interval: 0 (<none>)
Lifetime writes: 238 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
First orphan inode: 13509593
Default directory hash: half_md4
Directory Hash Seed: af37e79f-0457-4318-9c5d-840dc2f60bce
Journal backup: inode blocks
Journal features: journal_incompat_revoke
Total journal size: 128M
Total journal blocks: 32768
Max transaction length: 32768
Fast commit length: 0
Journal sequence: 0x003bc716
Journal start: 1

# df /dev/md0p1
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/md0p1 3844622744 1661757340 1987543192 46% /home

# mount |grep md0
/dev/md0p1 on /home type ext4 (rw,relatime,seclabel,stripe=32745)

In order to trig the 100% CPU on one core, a simple
# for i in {0001..0200}; echo "some text" > "file_${i}.txt"
is enough. This will bring the kworker/flush at 100 % for around 90 seconds.
This time is dependent of the number of created files. Trying to delete these
files is more or less impossible.

On the same setup I have a small SSD used for system installation. The same
test doesn't bring the kworker/flush behaviour.

# dumpe2fs -h /dev/sda3
dumpe2fs 1.46.5 (30-Dec-2021)
Filesystem volume name: <none>
Last mounted on: /
Filesystem UUID: 589968d6-2283-4903-8699-1b23f5c341a5
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype
needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg
dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 848640
Block count: 3390208
Reserved block count: 169510
Free blocks: 1823000
Free inodes: 735742
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 827
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8160
Inode blocks per group: 510
Flex block group size: 16
Filesystem created: Sun Apr 6 00:00:12 2014
Last mount time: Mon Oct 16 21:06:53 2023
Last write time: Mon Oct 16 21:06:47 2023
Mount count: 437
Maximum mount count: -1
Last checked: Sun Apr 6 00:00:12 2014
Check interval: 0 (<none>)
Lifetime writes: 1791 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
First orphan inode: 130778
Default directory hash: half_md4
Directory Hash Seed: 7d17214b-9585-4370-8c06-4236e449fa7f
Journal backup: inode blocks
Journal features: journal_incompat_revoke
Total journal size: 128M
Total journal blocks: 32768
Max transaction length: 32768
Fast commit length: 0
Journal sequence: 0x0114205c
Journal start: 8194

# mount |grep sda3
/dev/sda3 on / type ext4 (rw,noatime,seclabel)

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-17 10:52:30

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #16 from Eduard Kohler ([email protected]) ---
Oups not the right kernel version.
The 6.4.16 one is the one that it is working. The one with the issue (and
actually all of the 6.5 I tested) is:
Linux version 6.5.4-100.fc37.x86_64
([email protected]) (gcc (GCC) 12.3.1 20230508 (Red
Hat 12.3.1-1), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Tue Sep 19
13:13:25 UTC 2023

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-20 10:24:27

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #17 from Ojaswin Mujoo ([email protected]) ---
Hey Ivan, Eduard,

Thanks for all the data. The iostat suggests that we are breaking up the writes
much more as seen in the wareq-sz field, which is much lower in the regressed
kernel. Further, since the issue is only when striping is used (ie in raid
scenarios) it seems like it has something to do with raiding.

Eduard's replicator seems much easier to try so let me see if I'm able to
replicate this over a raid setup of HDDs.

Regards,
ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-10-21 10:08:07

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #18 from Eduard Kohler ([email protected]) ---
Hi Ojaswin,

If you want to replicate something similar to my setup, I better give you more
info about it, because it's not a desktop computer.

It's a very slow system by current standards and I think it may impact why
there aren't more people complaining about this issue. In Ivan's case it seems
that after a small while, it recovered even with a lot of files. His setup
seems more powerful than mine. In my case pulling a 75MB container with podman
didn't recovered after 2 hours of 100% CPU.

The system has 4G RAM:
# free
total used free shared buff/cache
available
Mem: 3977448 615760 410332 14580 2951356
3045304
Swap: 1564668 0 1564668

and an AMD G-T40E processor:
# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 20
model : 2
model name : AMD G-T40E Processor
stepping : 0
microcode : 0x5000119
cpu MHz : 801.798
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb
rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf
pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm
sse4a misalignsse 3dnowprefetch ibs skinit wdt hw_pstate vmmcall arat npt lbrv
svm_lock nrip_save pausefilter
bugs : fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
spec_store_bypass
bogomips : 1999.82
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor : 1
vendor_id : AuthenticAMD
cpu family : 20
model : 2
model name : AMD G-T40E Processor
stepping : 0
microcode : 0x5000119
cpu MHz : 799.941
cache size : 512 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb
rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf
pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm
sse4a misalignsse 3dnowprefetch ibs skinit wdt hw_pstate vmmcall arat npt lbrv
svm_lock nrip_save pausefilter
bugs : fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
spec_store_bypass
bogomips : 1999.82
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

The raid1 underlying hardware changed from the 2013 from 1T -> 2T -> 4T HDD.
The file system grew up accordingly. During the step 2T -> 4T disk layout was
converted from MBR to GPT:
# gdisk -l /dev/sdb
GPT fdisk (gdisk) version 1.0.9

Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdb: 7814037168 sectors, 3.6 TiB
Model: ST4000VN008-2DR1
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): CD6E68D9-1A0F-40F7-9755-FDE136159F92
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 7814037134
Partitions will be aligned on 2048-sector boundaries
Total free space is 1537100 sectors (750.5 MiB)

Number Start (sector) End (sector) Size Code Name
1 2048 7812502048 3.6 TiB 8300 primary

And again the raid1 md0 partition table:
# gdisk -l /dev/md0
GPT fdisk (gdisk) version 1.0.9

Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/md0: 7812237857 sectors, 3.6 TiB
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): 46C6A662-69D1-492D-A31F-1121048F9FE3
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 7812237823
Partitions will be aligned on 2048-sector boundaries
Total free space is 3933 sectors (1.9 MiB)

Number Start (sector) End (sector) Size Code Name
1 2048 7812235904 3.6 TiB 8300 Linux filesystem

Last one is the EXT4 filesystem.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-15 17:28:06

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #25 from Ojaswin Mujoo ([email protected]) ---
Hey Eyal,

Will it be possible for you to try this experiment on an ext4 FS mounted with
-o nodelalloc option. This should allocate the blocks immediately instead of
doing so during writeback.

Further, will you be able to provide the perf data file along with kallsyms or
somthing similar that can be used to resolve the symbols. If not, the following
output will also do:

perf report --no-children --stdio -i perf.data

Meanwhile I'll continue to look.

Thanks!

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-15 22:39:46

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #26 from Eyal Lebedinsky ([email protected]) ---
Ojaswin,

For my understanding: the (dirty) buffers hold data blocks that need to be
written. Clearing dirty buffers does not include fs activity,
which was already performed during the copy. Is this correct? If so then why
are we talking about ext4?
I expect that my understanding is incorrect.
The copy itself is very fast, completing in a few seconds.

I assume that the requested mount option is safe. Is it safe to use this way
$ sudo mount -o remount,nodelalloc /data1
or should I also add the other current options, 'noatime' (from fstab) and
'stripe=640' (automatic)?
or should I reboot with this option in fstab?

Below is what I did now, is this what you need?

$ sudo perf record -p 1366946 sleep 60
[ perf record: Woken up 37 times to write data ]
[ perf record: Captured and wrote 9.079 MB perf.data (237339 samples) ]

$ sudo perf report --no-children --stdio -i perf.data
# To display the perf.data header info, please use --header/--header-only
options.
#
#
# Total Lost Samples: 0
#
# Samples: 237K of event 'cycles:P'
# Event count (approx.): 258097341907
#
# Overhead Command Shared Object Symbol
# ........ ............... .................
..........................................
#
53.90% kworker/u16:4+f [kernel.kallsyms] [k] ext4_mb_good_group
13.71% kworker/u16:4+f [kernel.kallsyms] [k] ext4_get_group_info
7.39% kworker/u16:4+f [kernel.kallsyms] [k]
ext4_mb_find_good_group_avg_frag_lists
6.96% kworker/u16:4+f [kernel.kallsyms] [k] __rcu_read_unlock
5.01% kworker/u16:4+f [kernel.kallsyms] [k] ext4_mb_scan_aligned
4.51% kworker/u16:4+f [kernel.kallsyms] [k] __rcu_read_lock
1.70% kworker/u16:4+f [kernel.kallsyms] [k] mb_find_order_for_block
0.89% kworker/u16:4+f [kernel.kallsyms] [k] xas_descend
0.85% kworker/u16:4+f [kernel.kallsyms] [k] filemap_get_entry
0.81% kworker/u16:4+f [kernel.kallsyms] [k] ext4_mb_regular_allocator
0.77% kworker/u16:4+f [kernel.kallsyms] [k] ext4_mb_load_buddy_gfp
0.54% kworker/u16:4+f [kernel.kallsyms] [k] xas_load
0.50% kworker/u16:4+f [kernel.kallsyms] [k] ext4_mb_unload_buddy
0.42% kworker/u16:4+f [kernel.kallsyms] [k] _raw_read_unlock
0.40% kworker/u16:4+f [kernel.kallsyms] [k] mb_find_extent
0.40% kworker/u16:4+f [kernel.kallsyms] [k] _raw_spin_trylock
0.25% kworker/u16:4+f [kernel.kallsyms] [k] __filemap_get_folio
0.24% kworker/u16:4+f [kernel.kallsyms] [k] _raw_read_lock
0.13% kworker/u16:4+f [kernel.kallsyms] [k] xas_start
0.12% kworker/u16:4+f [kernel.kallsyms] [k] mb_find_buddy
0.10% kworker/u16:4+f [kernel.kallsyms] [k] pagecache_get_page
0.08% kworker/u16:4+f [kernel.kallsyms] [k] folio_mark_accessed
0.05% kworker/u16:4+f [kernel.kallsyms] [k] folio_test_hugetlb
0.04% kworker/u16:4+f [kernel.kallsyms] [k] _raw_spin_unlock
0.02% kworker/u16:4+f [kernel.kallsyms] [k] __cond_resched
0.02% kworker/u16:4+f [kernel.kallsyms] [k]
perf_adjust_freq_unthr_context
0.01% kworker/u16:4+f [kernel.kallsyms] [k] gen8_irq_handler
0.01% kworker/u16:4+f [kernel.kallsyms] [k] _raw_spin_lock
0.01% kworker/u16:4+f [kernel.kallsyms] [k] native_irq_return_iret
0.01% kworker/u16:4+f [kernel.kallsyms] [k] native_write_msr
0.01% kworker/u16:4+f [kernel.kallsyms] [k] timekeeping_advance
0.01% kworker/u16:4+f [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0.01% kworker/u16:4+f [kernel.kallsyms] [k] native_read_msr
<then many zero% items>

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-16 03:15:45

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #27 from Ojaswin Mujoo ([email protected]) ---
Hey Eyal,

So the way most file systems handle their writes is:

1. Data is written to memory buffers aka pagecache

2. When writeback/flush kicks in, FS tries to group adjacent data together and
allocates disk blocks for it

3. Finally, send the data to lower levels like blocks layer -> raid -> scsi etc
for the actual write.

The practice of delaying the actual allocation till writeback/flush is known as
delayed allocation or delalloc in ext4 and is on by default (other FSes might
have different names). This is why the ext4 allocation related functions
(ext4_mb_regular_allocator etc) show up in your perf report of the flusher
thread.

With delaloc, we are sending bigger requests to the ext4 allocator since we try
to group together buffers. With nodelalloc we disable this so fs block
allocation happens when we are dirtying the buffers (in step 1 above) and we
only allocate as much as that write asked for thus sending smaller requests at
a time. Since with delalloc we see that your flusher seemed to be taking a lot
of time in ext4 allocation routines, I wanted to check if a change in
allocation pattern via nodelalloc could help us narrow down the issue.

Using:

$ sudo mount -o remount,nodelalloc /data1

should be safe and preserve your other mount options so you can give it a try.

Lastly, thanks for the perf report however I'm sorry I forgot to mention that i
was actually looking for the call graph, which could be collected as follows:

$ sudo perf record -p 1234 -g sleep 60

Can you please share the report of the above command.

Thanks!
Ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-16 04:05:07

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #28 from Eyal Lebedinsky ([email protected]) ---
$ sudo perf record -p 1531839 -g sleep 60
[ perf record: Woken up 191 times to write data ]
[ perf record: Captured and wrote 47.856 MB perf.data (238391 samples) ]

$ sudo perf report --no-children --stdio -i perf.data
# To display the perf.data header info, please use --header/--header-only
options.
#
#
# Total Lost Samples: 0
#
# Samples: 238K of event 'cycles:P'
# Event count (approx.): 262920477580
#
# Overhead Command Shared Object Symbol
# ........ ............... .................
..........................................
#
58.48% kworker/u16:5+f [kernel.kallsyms] [k] ext4_mb_good_group
|
|--56.07%--ext4_mb_good_group
| |
| --55.85%--ext4_mb_find_good_group_avg_frag_lists
| ext4_mb_regular_allocator
| ext4_mb_new_blocks
| ext4_ext_map_blocks
| ext4_map_blocks
| ext4_do_writepages
| ext4_writepages
| do_writepages
| __writeback_single_inode
| writeback_sb_inodes
| __writeback_inodes_wb
| wb_writeback
| wb_workfn
| process_one_work
| worker_thread
| kthread
| ret_from_fork
| ret_from_fork_asm
|
--2.16%--ext4_get_group_info
ext4_mb_good_group
|
--2.16%--ext4_mb_find_good_group_avg_frag_lists
ext4_mb_regular_allocator
ext4_mb_new_blocks
ext4_ext_map_blocks
ext4_map_blocks
ext4_do_writepages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
ret_from_fork_asm

15.69% kworker/u16:5+f [kernel.kallsyms] [k] ext4_get_group_info
|
--15.10%--ext4_get_group_info
|
--15.05%--ext4_mb_good_group
|

--15.01%--ext4_mb_find_good_group_avg_frag_lists
ext4_mb_regular_allocator
ext4_mb_new_blocks
ext4_ext_map_blocks
ext4_map_blocks
ext4_do_writepages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
ret_from_fork_asm

9.51% kworker/u16:5+f [kernel.kallsyms] [k]
ext4_mb_find_good_group_avg_frag_lists
|
--9.26%--ext4_mb_find_good_group_avg_frag_lists
ext4_mb_regular_allocator
ext4_mb_new_blocks
ext4_ext_map_blocks
ext4_map_blocks
ext4_do_writepages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
ret_from_fork_asm

6.63% kworker/u16:5+f [kernel.kallsyms] [k] __rcu_read_unlock
|
|--4.56%--__rcu_read_unlock
| |
| --4.54%--ext4_get_group_info
| |
| --4.53%--ext4_mb_good_group
| |
|
--4.51%--ext4_mb_find_good_group_avg_frag_lists
|
ext4_mb_regular_allocator
| ext4_mb_new_blocks
| ext4_ext_map_blocks
| ext4_map_blocks
| ext4_do_writepages
| ext4_writepages
| do_writepages
|
__writeback_single_inode
| writeback_sb_inodes
| __writeback_inodes_wb
| wb_writeback
| wb_workfn
| process_one_work
| worker_thread
| kthread
| ret_from_fork
| ret_from_fork_asm
|
--2.08%--ext4_get_group_info
ext4_mb_good_group
|
--2.07%--ext4_mb_find_good_group_avg_frag_lists
ext4_mb_regular_allocator
ext4_mb_new_blocks
ext4_ext_map_blocks
ext4_map_blocks
ext4_do_writepages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
ret_from_fork_asm
4.32% kworker/u16:5+f [kernel.kallsyms] [k] __rcu_read_lock
|
--4.30%--__rcu_read_lock
|
--4.29%--ext4_get_group_info
|
--4.27%--ext4_mb_good_group
|

--4.26%--ext4_mb_find_good_group_avg_frag_lists

ext4_mb_regular_allocator
ext4_mb_new_blocks
ext4_ext_map_blocks
ext4_map_blocks
ext4_do_writepages
ext4_writepages
do_writepages

__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
ret_from_fork_asm

1.98% kworker/u16:5+f [kernel.kallsyms] [k] ext4_mb_scan_aligned
|
--1.98%--ext4_mb_scan_aligned
ext4_mb_regular_allocator
ext4_mb_new_blocks
ext4_ext_map_blocks
ext4_map_blocks
ext4_do_writepages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
ret_from_fork_asm

0.50% kworker/u16:5+f [kernel.kallsyms] [k] mb_find_order_for_block
0.44% kworker/u16:5+f [kernel.kallsyms] [k] ext4_mb_load_buddy_gfp
0.38% kworker/u16:5+f [kernel.kallsyms] [k] xas_descend
0.33% kworker/u16:5+f [kernel.kallsyms] [k] ext4_mb_regular_allocator
0.30% kworker/u16:5+f [kernel.kallsyms] [k] filemap_get_entry
0.24% kworker/u16:5+f [kernel.kallsyms] [k] xas_load
0.18% kworker/u16:5+f [kernel.kallsyms] [k] ext4_mb_unload_buddy
0.16% kworker/u16:5+f [kernel.kallsyms] [k] mb_find_extent
0.15% kworker/u16:5+f [kernel.kallsyms] [k] _raw_spin_trylock
0.14% kworker/u16:5+f [kernel.kallsyms] [k] _raw_read_unlock
0.10% kworker/u16:5+f [kernel.kallsyms] [k] _raw_read_lock
0.08% kworker/u16:5+f [kernel.kallsyms] [k] __filemap_get_folio
0.07% kworker/u16:5+f [kernel.kallsyms] [k] mb_find_buddy
0.05% kworker/u16:5+f [kernel.kallsyms] [k] xas_start
0.03% kworker/u16:5+f [kernel.kallsyms] [k] pagecache_get_page
0.03% kworker/u16:5+f [kernel.kallsyms] [k] _raw_spin_unlock
0.02% kworker/u16:5+f [kernel.kallsyms] [k]
perf_adjust_freq_unthr_context
0.02% kworker/u16:5+f [kernel.kallsyms] [k] folio_mark_accessed
0.02% kworker/u16:5+f [kernel.kallsyms] [k] folio_test_hugetlb
0.01% kworker/u16:5+f [kernel.kallsyms] [k] native_write_msr
0.01% kworker/u16:5+f [kernel.kallsyms] [k] _raw_spin_lock
0.01% kworker/u16:5+f [kernel.kallsyms] [k] __cond_resched
0.01% kworker/u16:5+f [kernel.kallsyms] [k] timekeeping_advance
0.01% kworker/u16:5+f [kernel.kallsyms] [k] native_irq_return_iret
0.00% kworker/u16:5+f [kernel.kallsyms] [k] gen8_irq_handler

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-16 04:31:11

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #29 from Eyal Lebedinsky ([email protected]) ---
I now repeated the test with the new option.

The result is that very few dirty buffers are recorded, but the rsync that
usually takes a few seconds, now stops for minutes at a time.

The machine also stutters, for example as I open a snall file on /data1.

The following perf run is of the rsync.

$ sudo sysctl vm.drop_caches=3
$ sudo mount -o remount,nodelalloc /data1
$ (d="`date`" ; sudo rsync -aHSK --stats --progress --checksum-choice=none
--no-compress -W /data2/no-backup/old-backups/tapes/13/file.1.data
/data1/no-backup/old-backups/tapes-again/13/ ; echo "`date`: started $d")
$ sudo perf record -p 1531839 -g sleep 60
$ sudo perf report --no-children --stdio -i perf.data

I will add the logs as an attachment (I hope).

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-16 04:33:48

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #30 from Eyal Lebedinsky ([email protected]) ---
Created attachment 305411
--> https://bugzilla.kernel.org/attachment.cgi?id=305411&action=edit
perf lest logs

perf log for rsync, See comment #29

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-16 19:33:50

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #31 from Ojaswin Mujoo ([email protected]) ---
Hey Eyal,

Thanks for running the tests. So it definitely feels that there is something
going on in the ext4 allocator side. Ideally we shouldn't be taking this much
time in finding blocks in the FS especially since the older kernels complete it
quickly so it couldn't be a low FS space issue.

I spent some good amount of time today trying to replicate it to no avail hence
I'll have to request you for some more data. Thanks to the perf report you've
shared I have an idea of some kernel areas we can probe to understand whats
truly going on. I'll just share the perf probe and perf record commands below.
Please let me know if its not possible to run these on the affected distro so
we can think of some other way :)

** Adding probe points **

sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists order"

sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists:18 cr
iter->bb_group"

sudo perf probe -a "ext4_mb_good_group:20 free fragments ac->ac_g_ex.fe_len
ac->ac_2order"

** Recording using the probe point **

sudo perf record -e probe:ext4_mb_find_good_group_avg_frag_lists -e
probe:ext4_mb_find_good_group_avg_frag_lists_L18 -e
probe:ext4_mb_good_group_L20 -e ext4:ext4_mballoc_alloc -g -p <pid> sleep 20

** Saving the output **

sudo perf script -i perf.data > perf-probe.log

I tried these commands on the mainline kernel but I hope they've not changed
too much for your kernel. You *might* need to change the line numbers of the
probe if thats the case and the event names in the record command would also
change accordingly. Please let me know if you need any more info from my end to
get this done, I'd be glad to help.

Thanks again and regard,
ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-17 00:01:35

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #32 from Eyal Lebedinsky ([email protected]) ---
Looks like a basic misunderstanding on my side, or maybe more preparation is
needed? What am I missing?

$ sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists order"
Failed to find the path for the kernel: Invalid ELF file
Error: Failed to add events.

$ uname -a
Linux e7.eyal.emu.id.au 6.5.10-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Nov
2 19:59:55 UTC 2023 x86_64 GNU/Linux

$ ls -l /boot/*6.5.10*
-rw------- 1 root root 8727059 Nov 2 11:00
/boot/System.map-6.5.10-200.fc38.x86_64
-rw-r--r-- 1 root root 264245 Nov 2 11:00
/boot/config-6.5.10-200.fc38.x86_64
-rw------- 1 root root 39368262 Nov 9 10:55
/boot/initramfs-6.5.10-200.fc38.x86_64.img
lrwxrwxrwx 1 root root 46 Nov 9 10:55
/boot/symvers-6.5.10-200.fc38.x86_64.xz ->
/lib/modules/6.5.10-200.fc38.x86_64/symvers.xz
-rwxr-xr-x 1 root root 14551752 Nov 2 11:00
/boot/vmlinuz-6.5.10-200.fc38.x86_64

I strace'd the command, and toward the end I see this (I have vmlinuz, not
vmlinux - related?).
These are the first failed 'openat()' in the trace.

openat(AT_FDCWD, "vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/boot/vmlinux", O_RDONLY) = -1 ENOENT (No such file or
directory)
openat(AT_FDCWD, "/boot/vmlinux-6.5.10-200.fc38.x86_64", O_RDONLY) = -1 ENOENT
(No such file or directory)
openat(AT_FDCWD, "/usr/lib/debug/boot/vmlinux-6.5.10-200.fc38.x86_64",
O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/modules/6.5.10-200.fc38.x86_64/build/vmlinux", O_RDONLY)
= -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/debug/lib/modules/6.5.10-200.fc38.x86_64/vmlinux",
O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/debug/boot/vmlinux-6.5.10-200.fc38.x86_64.debug",
O_RDONLY) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD,
"/root/.debug/.build-id/d3/6a8ae40f88e310401726e275aa1940b279babd",
{st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
openat(AT_FDCWD,
"/root/.debug/.build-id/d3/6a8ae40f88e310401726e275aa1940b279babd/kallsyms",
O_RDONLY) = 3

Earlier I see this open success:

newfstatat(AT_FDCWD, "/lib/modules/6.5.10-200.fc38.x86_64/vmlinuz",
{st_mode=S_IFREG|0755, st_size=14551752, ...}, 0) = 0

which is identical to the one in /boot.

Should I boot a different kernel type?

Remember that I am not a dev...

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-17 00:44:45

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #33 from Ojaswin Mujoo ([email protected]) ---
Hey Eyal,

Yes you are right, it is not able to find the vmlinux with debug symbols. You
can either download the debuginfo package of the corresponding kernel which
should place the vmlinux in the required place or even having the "vmlinux"
with debug symbols in the same directory as where you are running perf probe
shall do.

Also, once this error is resolved you *might* encounter something like " Error:
Failed to show lines." To fix it you can install the kernel source code package
or just use the -s flag to point to the source code. (Example of how I use it):

sudo perf probe -s ~/workspace/9p -a "ext4_mb_find_good_group_avg_frag_lists
order"

Let me know if it still doesn't work.

Thanks!
Ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-17 01:39:28

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #34 from Eyal Lebedinsky ([email protected]) ---
Created attachment 305414
--> https://bugzilla.kernel.org/attachment.cgi?id=305414&action=edit
detailed perf log

Attachment for comment #34

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-17 01:41:03

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #35 from Eyal Lebedinsky ([email protected]) ---
$ sudo dnf debuginfo-install kernel
Installing:
kernel-debuginfo
Installing dependencies:
kernel-debuginfo-common-x86_64

$ sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists order"

$ sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists:18 cr
iter->bb_group"

$ sudo perf record -e probe:ext4_mb_find_good_group_avg_frag_lists -e
probe:ext4_mb_find_good_group_avg_frag_lists_L18 -e
probe:ext4_mb_good_group_L20 -e ext4:ext4_mballoc_alloc -g -p 2789843 sleep 20

$ sudo perf script -i perf.data > perf-probe.log
[ perf record: Woken up 1395 times to write data ]
Warning:
Processed 24640612 events and lost 22 chunks!

Check IO/CPU overload!

[ perf record: Captured and wrote 4725.791 MB perf.data (19187380 samples) ]

$ sudo perf script -i perf.data > perf-probe.log
Warning:
Processed 24640612 events and lost 22 chunks!

Check IO/CPU overload!

$ ls -l perf-probe.log
-rw-r--r-- 1 eyal eyal 39024390135 Nov 17 12:31 perf-probe.log

I am limited to upload much less that this as an attachment.
File size limit: 5120 KB)

I managed to include only so many likes ftom the head:

$ wc -l perf-probe-head.log
51594 perf-probe-head.log

See attachement https://bugzilla.kernel.org/attachment.cgi?id=305414

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-17 15:39:40

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #36 from Ojaswin Mujoo ([email protected]) ---
Hey Eyal,

So the trace data has given me an idea of what's going on. Basically in ext4 we
maintain a list of FS blocks groups (BGs) where each list will have BGs based
on the order of free blocks (BG with 64 free blocks goes in list order 6. 640
free blocks goes in order 9 list etc). In our case, we are trying to allocate
stripe size blocks at a time ie 640 blocks or roughly 2.5 KB and ext4 tries to
look at the order 9 list to find a BG that is fit to satisfy our request.

Unfortunately there seems to be a lot of BGs in the order 9 list (> 1000) but
most of them dont have enough free blocks to satisfy the request so we keep
looping
and trying to call ext4_mb_good_group() on each of them to see if anyone is
good enough. Once we do find a good enough BG, due to striping we actually try
to look for blocks which are specially aligned to stripe size and once we don't
find it we just start looping in the list again from the beginning (!!).

Although I have a good idea now, I'm not able to point my finger at the exact
change in 6.5 that might have caused this. We did change the allocator to some
extent and it might be related to this but we need to dig a bit more deeper to
confirm.

Would it be possible to share the same perf record again but this time I'm
adding a few more probes and removing -g so we can fit more in 5MBlimit and
also the commands for Linux 6.4 so we can compare whats changed:

Linux 6.5+:

Probe adding commands:

sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists order"
sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists:18 cr
iter->bb_group"
sudo perf probe -a "ext4_mb_good_group:20 free fragments ac->ac_g_ex.fe_len
ac->ac_2order"
sudo perf probe -a "ext4_mb_scan_aligned:26 i max"

Record command:

perf record -e probe:ext4_mb_find_good_group_avg_frag_lists_L18 -e
probe:ext4_mb_good_group_L20 -e probe:ext4_mb_find_good_group_avg_frag_lists -e
probe:ext4_mb_ scan_aligned_L26 -e ext4:ext4_mballoc_alloc -p <pid> sleep 20


Linux 6.4.x:

Probe adding commands:

sudo perf probe -a "ext4_mb_choose_next_group_cr1:25 i iter->bb_group"
sudo perf probe -a "ext4_mb_good_group:20 free fragments ac->ac_g_ex.fe_len
ac->ac_2order"
sudo perf probe -a "ext4_mb_scan_aligned:26 i max"

Record command:

sudo perf record -e probe:ext4_mb_choose_next_group_cr1_L25 -e
probe:ext4_mb_good_group_L20 -e probe:ext4_mb_scan_aligned_L26 -e
ext4:ext4_mballoc_alloc -p <pid> sleep 20

Thanks again for all your help on this!

Regards,
ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-17 23:52:25

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #37 from Eyal Lebedinsky ([email protected]) ---
Finally got it.

kernel 6.5 was easy. kernel-6.4 debuginfo could not be found with 'dnf' and I
searched long for it. Finally locating it on koji.

Then the third probe was wrong, and after some learning I discovered
$ sudo perf probe --line "ext4_mb_scan_aligned"
21 while (i < EXT4_CLUSTERS_PER_GROUP(sb)) {
22 if (!mb_test_bit(i, bitmap)) {
23 max = mb_find_extent(e4b, i, sbi->s_stripe,
&ex);
24 if (max >= sbi->s_stripe) {
25 ac->ac_found++;
26 ex.fe_logical = 0xDEADF00D; /* debug
value */
ac->ac_b_ex = ex;
28 ext4_mb_use_best_found(ac, e4b);
29 break;
}
}
32 i += sbi->s_stripe;
}

and decided to pick line 24 (not the original 26). Is this right?

I will now upload the reports.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-17 23:53:55

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #38 from Eyal Lebedinsky ([email protected]) ---
Created attachment 305417
--> https://bugzilla.kernel.org/attachment.cgi?id=305417&action=edit
more perf reports

for comment #37

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-17 23:57:29

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #39 from Eyal Lebedinsky ([email protected]) ---
Created attachment 305418
--> https://bugzilla.kernel.org/attachment.cgi?id=305418&action=edit
more perf reports

for comment #37

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-18 12:10:11

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #40 from Ojaswin Mujoo ([email protected]) ---
Hey Eyal,

Thanks for the data, the perf probes you added were correct!

I see that the problem is as I suspected where we keep looking trying to find
aligned blocks in ext4 when probably none of them exist. Aligned allocation
right now is only done when stripe mount option is passed as an optimization.
Currently we don't seem to fallback to normal allocation if aligned allocation
doesn't work and this causes the very long, seemingly infinite looping.

I can try to work on a patchset that fixes this however as a temporary fix you
can continue with stripe mount option turned off for ext4. This will then
instruct ext4 to just use normal allocation rather than aligned.

One point to note is that -o stripe=xyz is sometimes automatically added during
mount even when we don't pass it. You can look at Comment #6 #7 and #8 in this
bug for more info. To confirm it's off you can look into
/proc/fs/ext4/<dev>/options file which has all the currently active mount
options, you shouldn't see stripe there.

Further, this is not something that was changed between 6.4 and 6.5 however
seems like the allocator changes in 6.5 made it even more difficult to come out
of this loop thus prolonging the time taken to flush.

Also, just wanted to check if you have any non-prod setup where you'd be open
to compile kernel with patches to see if we are able to fix the issue.

Lastly, thank you so much for all the probe data and logs, it's been a huge
help :)

Regard,
ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-18 13:18:08

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #41 from Eyal Lebedinsky ([email protected]) ---
OK, so how do I disable the stripe=640 that was automatically added?
Is there a mount option to disable it?
I would rather not futz with tune2fs.

RE kernel build: I used to often do this maybe 30 years ago, but on a
workstation, not on the server. The build I expect to be simple, but it is the
risk of a bad patch that worries me. I would rather not do that.

BTW, while you are looking at the cache issue, what are your thoughts on the
fact that the array shows (iostat -x) a very high w_await when the members are
much lower? Is it normal?

Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz
w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s
%drqm d_await dareq-sz f/s f_await aqu-sz %util
16:04:08 md127 0.20 0.80 0.00 0.00 10.50 4.00
9.80 739.20 0.00 0.00 1495.44 75.43 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 14.66 4.51
16:04:08 sdb 2.30 187.20 44.50 95.09 30.74 81.39
6.80 345.20 79.70 92.14 29.72 50.76 0.00 0.00 0.00
0.00 0.00 0.00 0.60 14.33 0.28 1.26
16:04:08 sdc 1.90 138.80 32.80 94.52 16.63 73.05
6.60 295.60 67.50 91.09 13.06 44.79 0.00 0.00 0.00
0.00 0.00 0.00 0.60 11.33 0.12 1.05
16:04:08 sdd 2.30 194.40 46.30 95.27 3.70 84.52
2.90 138.80 32.00 91.69 3.34 47.86 0.00 0.00 0.00
0.00 0.00 0.00 0.60 4.67 0.02 0.54
16:04:08 sde 1.80 204.00 49.20 96.47 14.89 113.33
4.10 302.40 71.70 94.59 13.27 73.76 0.00 0.00 0.00
0.00 0.00 0.00 0.60 6.17 0.08 1.66
16:04:08 sdf 1.90 97.20 22.40 92.18 14.63 51.16
4.70 212.00 48.50 91.17 19.70 45.11 0.00 0.00 0.00
0.00 0.00 0.00 0.60 11.50 0.13 1.98
16:04:08 sdg 2.00 212.00 51.00 96.23 7.80 106.00
4.70 279.20 65.30 93.29 8.32 59.40 0.00 0.00 0.00
0.00 0.00 0.00 0.60 8.00 0.06 1.86
16:04:08 sdh 2.20 213.20 51.10 95.87 21.73 96.91
4.80 315.60 74.30 93.93 27.77 65.75 0.00 0.00 0.00
0.00 0.00 0.00 0.60 14.33 0.19 2.05

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-20 06:11:13

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #42 from Ojaswin Mujoo ([email protected]) ---
Hey Eyal,

stripe size can be set to 0 by remounting the FS as follows:

mount -o remount,stripe=0 <dev>

This should should allow us to bypass aligned allocation.

Regarding building the kernel, sure I understand your concerns.

Regarding the util of RAID array, I don't think I have enough experience
dealing with RAID devices to comment if it is normal behavior or not. Still
I'll try to look a bit more into it to see if I can think of something.

Regards,
ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-20 08:49:06

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #43 from Eyal Lebedinsky ([email protected]) ---
Thanks Ojaswin,

The mount option did the trick, things are drastically faster now.
I am now finishing the actual copy that I was unable to complete.

The reason I ask about the raid is that it writes so slowly that I wonder if it
is just the looping thread or maybe there is something more.
The large difference between what the array reports compared to what the
members show (in iostat -x) is beyond what I expected as the array write
amplification.

I will keep watching this thread.

Regards,
Eyal

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-11-27 17:31:59

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #44 from Eduard Kohler ([email protected]) ---
Hi Ojaswin,

I'm still interested by a patch for this. I may find a way to give it a try.
This aligned allocation should work for most otherwise more people would be
here.
Also as the FS is automagically mounted with a stripe value (based on raid
info) it would be nice to avoid future misbehaviours like this.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-03 07:33:58

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #45 from Ojaswin Mujoo ([email protected]) ---
Hi Eduard,

Sure, I'm on vacation this week. Before that I found some time to try and
replicate this in a simulated environment, will be sending a patch once I'm
back.

Regards
Ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-15 11:27:43

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #46 from Ojaswin Mujoo ([email protected]) ---
Hello,

I've posted a patch that, as per my understanding of the issue, should be able
to fix the halt in kernels above 6.5

As I was not able to completely replicate the issue, I relied on the probes and
other logs collected to root cause this, so I'd really appreciate if the folks
here can help test this on their setups where the issue was being observed.

Link to patch:
https://lore.kernel.org/linux-ext4/[email protected]/T/#t

Regards,
ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-23 00:57:03

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

[email protected] changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]

--- Comment #47 from [email protected] ---
I've posted an easy reproducer here:
https://marc.info/?l=linux-raid&m=170327844709957&w=2

You can adjust the vm settings to make sure all the tree fits in the ram cache,
depending on the ram of your machine.

By default the FS is mounted with stripe=1280 because it's on a raid6.
Remounting with stripe=0 works around the problem. Excellent!

Also, this affects all 6.* kernels, as I said in the above post.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-23 01:48:46

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

Andreas Dilger ([email protected]) changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |adilger.kernelbugzilla@dilg
| |er.ca

--- Comment #48 from Andreas Dilger ([email protected]) ---
Independent of the fixes to the mballoc code to improve the allocation
performance, I'm wondering about the ''RAID stride'' values in use here.
The "stride" value is intended to be the size of one complete set of
disks (e.g. 128KiB chunk size * 8 data disks = 1MiB). The filesystem
doesn't see the parity disks, so the number of those disks does not
matter to ext4.


> RAID stride: 32752

> In my case, I have an EXT4 partition over an mdadm raid 1 array of two
HDD.
> RAID stride: 32745


It seems in all these cases that the stripe/stride is strange. I can't
see any value to setting stride to (almost) 128MB, especially not on a
RAID-1 system. Were these values automatically generated by mke2fs,
or entered manually? If manually, why was that value chosen? If there
is something unclear in the documentation it should be fixed, and the
same if there is something wrong in mke2fs detection of the geometry.

> By default the FS is mounted with stripe=1280 because it's on a raid6.
> Remounting with stripe=0 works around the problem. Excellent!

Carlos, how many data disks in this system? Do you have 5x 256KiB or
10x 128KiB *data* disks, plus 2 *parity* disks?

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-23 12:46:37

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #49 from Eduard Kohler ([email protected]) ---
Hi Ojaswin,

I gave a try to your patch and indeed it fixes the issue.
I started to compile a 'vanilla' 6.5.13 fedora kernel (latest available on F37)
and tested it, just to have a base. As expected the issue kicked in.
I added your patch, recompile it and the issue gone.

Andreas, I am positively sure that I never set/change any stride value on my
own. Mostly because I have no idea what it is and how it interacts with the FS
or Raid. I don't know if it can be changed or it is one time set. So it was
chose by the tool I used to create the partition or raid or ext4 FS.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-23 14:22:22

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #50 from Eduard Kohler ([email protected]) ---
After Andreas comment, I throw a look to the raid wiki and changed the
stripe/stride values:

tune2fs -E stride=32,stripe-width=32 /dev/md0p1

which may or may not be optimal for a RAID1.

Running a 6.5.12 kernel (fedora released one, not the previous compiled one).
And it also makes the issue go away.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-23 15:47:20

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #51 from [email protected] ---
[email protected] ([email protected]) wrote on Fri, Dec 22,
2023 at 10:48:39PM -03:
> https://bugzilla.kernel.org/show_bug.cgi?id=217965
>
> Andreas Dilger ([email protected]) changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> CC| |adilger.kernelbugzilla@dilg
> | |er.ca
>
> --- Comment #48 from Andreas Dilger ([email protected]) ---
> Independent of the fixes to the mballoc code to improve the allocation
> performance, I'm wondering about the ''RAID stride'' values in use here.
> The "stride" value is intended to be the size of one complete set of
> disks (e.g. 128KiB chunk size * 8 data disks = 1MiB). The filesystem
> doesn't see the parity disks, so the number of those disks does not
> matter to ext4.
>
> It seems in all these cases that the stripe/stride is strange. I can't
> see any value to setting stride to (almost) 128MB, especially not on a
> RAID-1 system. Were these values automatically generated by mke2fs,
> or entered manually? If manually, why was that value chosen? If there
> is something unclear in the documentation it should be fixed, and the
> same if there is something wrong in mke2fs detection of the geometry.
>
> > By default the FS is mounted with stripe=1280 because it's on a raid6.
> > Remounting with stripe=0 works around the problem. Excellent!
>
> Carlos, how many data disks in this system? Do you have 5x 256KiB or
> 10x 128KiB *data* disks, plus 2 *parity* disks?

There are 10 data disks. The stride, determined automatically by mke2fs, is
correct since the chunk is 512KiB.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-29 20:35:32

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

Matthew Stapleton ([email protected]) changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]

--- Comment #52 from Matthew Stapleton ([email protected]) ---
I'm a bit late to reporting this, but it might be related to this bug report.

Since noticing this 100% cpu flush problem on my computer with a non-raid
Samsung SSD 990 PRO NMVe and kernel 6.6.3 (and now 6.6.8) I decided to do a git
bisect and seemed to track down the problem to commit:
7e170922f06bf46effa7c57f6035fc463d6edc7e if that helps, although I haven't
tested if disabling CR1_5 also resolves the problem. Running "tar xf
linux-6.6.8.tar.xz" is enough to trigger 100% cpu for a few seconds after tar
is finished on my system, although it doesn't noticeably trigger the problem on
my server with raid 6 and hard drives. It also looks like my ext4 filesystem
is mounting with stripe=32733 according to /proc/mounts, although dumpe2fs -h
doesn't show any stripe settings on the filesystem.

I'll try the patch and see if it fixes the problem.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-29 21:01:09

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #53 from Matthew Stapleton ([email protected]) ---
Oh I just rechecked and I do have RAID stride set to 32733 on my filesystem
which I think I did to optimise for SSD blocks. Also, another test which makes
the flush take a really long time to complete on my system is "cp -ai
linux-6.6.8 linux-6.6.8_orig".

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-29 21:06:04

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #54 from Matthew Stapleton ([email protected]) ---
(In reply to Ojaswin Mujoo from comment #46)
> Hello,
>
> I've posted a patch that, as per my understanding of the issue, should be
> able to fix the halt in kernels above 6.5
>
> As I was not able to completely replicate the issue, I relied on the probes
> and other logs collected to root cause this, so I'd really appreciate if the
> folks here can help test this on their setups where the issue was being
> observed.
>
> Link to patch:
> https://lore.kernel.org/linux-ext4/[email protected].
> com/T/#t
>
> Regards,
> ojaswin

It looks like that patch has fixed the problem for my system. Thanks. I still
haven't checked if disabling CR1_5 from
7e170922f06bf46effa7c57f6035fc463d6edc7e also resolves the problem, although
I'm still planning to try that out as well.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-29 22:11:09

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #55 from Matthew Stapleton ([email protected]) ---
It looks like reversing the changes made in
7e170922f06bf46effa7c57f6035fc463d6edc7e also fixed the problem (I didn't
notice before that the names were changed in the very next commit as well).

in fs/ext4/mballoc.c:
Commenting out new_cr = CR_GOAL_LEN_SLOW in function:
ext4_mb_choose_next_group_goal_fast wasn't enough to fix it so then I did the
following as well which fixed the problem:
commented out the call to ext4_mb_choose_next_group_best_avail in function:
ext4_mb_choose_next_group
commented out the code to "Reset goal length to original goal length before
falling into CR_GOAL_LEN_SLOW"

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-29 22:13:16

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #56 from Matthew Stapleton ([email protected]) ---
Correction: It was new_cr = CR_BEST_AVAIL_LEN that I commented out

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-29 23:07:37

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #57 from Matthew Stapleton ([email protected]) ---
Also, it's a bit hard to tell, but even with the patch from
https://lore.kernel.org/linux-ext4/[email protected]/T/#t
, I think flush might still be using more cpu compared to when I disable
ext4_mb_choose_next_group_best_avail even though it isn't stalling anymore.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2023-12-29 23:22:47

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #58 from Matthew Stapleton ([email protected]) ---
Maybe it's just because different code is being executed, but running tests a
few more times, top does seems to be showing less kernel cpu usage while
extracting thunderbird-115.5.0.source.tar.xz
(https://archive.mozilla.org/pub/thunderbird/releases/115.5.0/source/thunderbird-115.5.0.source.tar.xz)
with CR_BEST_AVAIL_LEN disabled even though the actual extract operation is
taking the same amount of time with both patch methods and all the cpu usage
goes away in both cases once tar is finished.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-01-02 04:44:16

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #59 from Ojaswin Mujoo ([email protected]) ---
Hi Matthew,

Thanks for testing out the patch and for sharing the results with and without
CR_BEST_AVAIL_LEN commented. Would it be possible for you to add perf probes as
mentioned in comment 36 and share the result. That'd be helpful in
understanding why CR_BEST_AVAIL_LEN is getting stuck. Basically, I am
interested in seeing how the scan is happening in CR_GOAL_LEN_FAST vs
CR_BEST_AVAIL_LEN.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217965#c36

Also, another thing I'd request you to try is, instead of commenting out the
code can you run:

$ echo "0" > /sys/fs/ext4/<dev>/mb_best_avail_max_trim_order

and rerun the test. This disables most of the code but keeps some of the
initial logic. Might help zero down if the issue is in the initial logic or the
disabled code.

Thanks again,
ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-01-02 06:44:24

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #60 from Matthew Stapleton ([email protected]) ---
It looks like echo "0" > /sys/fs/ext4/<dev>/mb_best_avail_max_trim_order stops
flush from getting stuck, but the flush kworker still uses more cpu during the
write operation. Hopefully I can get the perf results over the next few days.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-01-04 05:38:42

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #61 from Ojaswin Mujoo ([email protected]) ---
Hi Matthew, thanks for confirming. So as pointed out in comment 9 [1], for the
above steps to disable CR_BEST_AVAIL_LEN code did not fix the issue. My
suspicion is that this issue can occur either in CR_GOAL_LEN_FAST or
CR_BEST_AVAIL_LEN lookup, depending on the block groups being searched.
Probably for you, it was occurring during CR_BEST_AVAIL_LEN lookup and hence
disabling that code fixed it.

Further, as Carlos pointed out above, they are able to see this in all 6.*
kernels which means this is happening before CR_BEST_AVAIL_LEN was introduced
however it seems to be much easier to trigger in 6.5+ kernels. Now, assuming
the above theory is correct, then in cases where this is triggered from
CR_GOAL_LEN_FAST, it should ideally happen easily in pre 6.5 kernels as well,
but it doesn't, which makes me think that there might be some other related
changes in 6.5 that might be making it easier to trigger.

I'll try to play around a bit more with this. Also, as for higher CPU usage,
how high are we talking about? So CR_BEST_AVAIL_LEN does add some extra cycles
at the cost of generally faster allocation in fragmented filesystems, however
since you have disabled it we shouldn't ideally be seeing it. Also, does the
CPU util consistently drop when you commented out that code?

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217965#c9

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-01-04 22:31:16

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #62 from Matthew Stapleton ([email protected]) ---
With top at 1 second intervals , with thunderbird source extract, I can see the
flush process with around 70% to 100% for 2 1 second intervals during the tar
extract. I don't this this much cpu usage was present with CR_BEST_AVAIL_LEN
commented out. Hopefully the perf probes will show where the real usage is. I
still haven't tried out the perf probes, but I'm still planning on doing that
hopefully sometime in the next few days.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-01-06 02:11:32

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #63 from Matthew Stapleton ([email protected]) ---
I can't seem to trigger the problem now even with all the patches disabled and
/sys/fs/ext4/<dev>/mb_best_avail_max_trim_order set to the default of 3 so
unfortunately I won't be able to help with this problem for now.

According to perf record -ag sleep 5 at the moment most of the ext4 cpu usage
is in ext4_get_group_desc and ext4_get_group_info , both called from
ext4_mb_prefetch and some usage in ext4_mb_good_group called from
ext4_mb_find_good_group_avg_frag_lists . Both ext4_mb_prefetch and
ext4_mb_find_good_group_avg_frag_lists were called from
ext4_mb_regular_allocator. But this is all fairly low cpu usage at the moment
and it all stops shortly after the file operation is completed.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-01-06 03:18:07

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #64 from Matthew Stapleton ([email protected]) ---
After patching ext4_mb_choose_next_group set of functions with
noinline_for_stack, perf is now including those in the report: perf report
--stdio -i perf.data . That usage in ext4_mb_regular_allocator is only about
1.39% (compared to 17.59 for ext4_get_group_desc and ext4_get_group_info) with
ext4_mb_choose_next_group_goal_fast using 0.74% and
ext4_mb_choose_next_group_best_avail using 0.58% which indicates
ext4_mb_choose_next_group_best_avail isn't stalling for me like it was before.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-01-07 02:56:51

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #65 from Matthew Stapleton ([email protected]) ---
I'm not sure if this will be worth looking into, but with the cpu usage
problem, I've done some more testing and "perf report --stdio -i perf.data" is
showing around 2 to 3 times as much cpu time since the commit that added the
CR_BEST_AVAIL_LEN feature: 7e170922f06bf . With
/sys/fs/ext4/<dev>/mb_best_avail_max_trim_order set to 0 it goes a bit lower,
but is still at least 2 times cpu. With perf record -ag sleep 25 which is the
time it takes the extract to run, it's goes from about 2% to 5% so fairly
small, but when measuring with 5 second intervals this can go from around 5% to
15%. This cpu usage is concentrated in ext4_get_group_desc and
ext4_get_group_info and within these functions, perf is picking up more cpu
time in __rcu_read_lock and __rcu_read_unlock.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-02-07 11:44:55

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

Matteo Italia ([email protected]) changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]

--- Comment #66 from Matteo Italia ([email protected]) ---
I think I'm experiencing the same problem; relatively big (1.5T), relatively
empty (716 G used) ext4 partition on a slow-ish SATA magnetic HDD
(ST3000DM007-1WY10G), stripe=32747, 6.5.0-14-generic kernel from Ubuntu 22.04
HWE stack; during the installation of COMSOL Multiphysics 6.2 (which
essentially untars ~16 GB of stuff) I got kworker flush process stuck at 100%
of CPU for multiple hours, and IO cache filling my 32 GB of RAM. I tried the
various parameters tweaking suggested here with no particular change of pace
(although maybe the pending IO still worked according to the old values?). I
run the bfq scheduler on this disk, but I tried to switch to pretty much all
other schedulers mid-process with no perceptible change.

After ~3 hours of agonizingly slow progress (with disk write performance for
the installation often in the tens of KB/s) I decided to abort the
installation. I rebooted in the Ubuntu GA kernel (5.15.0-92-generic) without
touching any specific parameter and restarted the installation; this time it
completed in ~10 minutes.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-03-12 20:27:49

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

[email protected] changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]

--- Comment #67 from [email protected] ---
Hi,

I just stumbled upon this issue, which it seems is the same I am suffering from
on a Raspberry Pi 5. Th e OS is Arch Linux ARM and linux-rpi 6.6.20 running on
a NVME SSD. After reading this issue, I checked the ext4 mount options and
found stripe=8191. As a first test I downgraded linux-rpi to 6.1.69 and my
issue of a CPU hogging kworker process, when pulling and creating docker
containers did not happen anymore. With kernel 6.6.20 and all earlier 6.6.x
versions, docker operations were really slow, in some cases pulling and
recreating a container took over 1000 seconds. With kernel 6.1.69 these
operations are back to normal speed.
As a next step I would like to try to set stripe=0. What is the right approach
here, as I am a bit unsure. Is it ok to remount the running system partition
with mount -o remount,stripe=0 <dev> without any risks for a first test? And to
change permanently to stripe=0,id it ok and sufficient to run tune2fs -f -E
stride=0 <dev> on the mounted system partition without any risks? I would be
thankful for a helping hand here to not do anything wrong or dangerous.

I hope, that this works around my issue for now, until a fix has been found for
this. Thank you very much for looking into this!

Kind regards
denk

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-03-17 11:58:27

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #68 from Ojaswin Mujoo ([email protected]) ---
Hello Denk,

Both of the commands you mentioned "mount -o remount,stripe=0 <dev>" and
"tune2fs -f -E stride=0 <dev>" should be okay to run and aren't dangerous for
your filesystem. You can always take a backup to be extra cautious ;)

That being said, this issue was fixed by the following patch [1] which I
believe landed in linux kernel 6.7 so anything above that should help you avoid
the issue as well.

Regards,
Ojaswin

[1]
https://lore.kernel.org/linux-ext4/[email protected]

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-03-18 04:45:31

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #69 from Eyal Lebedinsky ([email protected]) ---
Seeing this last comment (#68), I wondered if the fix is included in my latest
kernel. Or do I have a different problem?

After just updating my fedora 38 to kernel 6.7.9 I did NOT do the remount and
repeated my original problematic test.
$ uname -s
Linux e7.eyal.emu.id.au 6.7.9-100.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Wed
Mar 6 19:31:16 UTC 2024 x86_64 GNU/Linux
$ mount|grep data1
/dev/md127 on /data1 type ext4 (rw,noatime,stripe=640)

I rsync a directory (262GB, 8,911,139 files) from an SSD to a raid6 (7x12TB =>
60TB fs). Both are ext4.
$ sudo time rsync -aHSK --stats --progress
/data/no-backup/old-backups/tapes /data1/no-backup/really-not/old-backups/

What I see is that once meminfo shows the limit (4GB) was reached, the kworker
kicks in at 100% CPU.
At that point iostat shows activity on the array dropping, from about 160MB/s
to very little (with an occasional burst of a few MB/s).

```
10:12:33 2024-03-18
10:12:33 Device tps kB_read/s kB_wrtn/s kB_dscd/s
kB_read kB_wrtn kB_dscd
[trimmed]
10:17:13 md127 0.00 0.00 0.00 0.00
0 0 0
10:17:23 md127 34.40 137.60 0.00 0.00
1376 0 0 <<< rsync starts
10:17:33 md127 2346.30 747.20 8638.00 0.00
7472 86380 0
10:17:43 md127 7067.10 431.60 133644.40 0.00
4316 1336444 0
10:17:53 md127 1692.80 578.80 7015.20 0.00
5788 70152 0
10:18:03 md127 2439.20 169.60 32071.20 0.00
1696 320712 0
10:18:13 md127 274.00 4.00 2242.00 0.00
40 22420 0
10:18:23 md127 3172.70 17.60 56828.00 0.00
176 568280 0
10:18:33 md127 416.20 0.80 1664.80 0.00
8 16648 0
10:18:43 md127 18.70 0.40 76.40 0.00
4 764 0
10:18:53 md127 6.50 0.00 30.80 0.00
0 308 0
10:19:03 md127 4.80 0.00 40.00 0.00
0 400 0
10:19:13 md127 5.70 0.00 63.60 0.00
0 636 0
10:19:23 md127 2.60 0.00 54.80 0.00
0 548 0
10:19:33 md127 7.40 0.00 243.20 0.00
0 2432 0
10:19:43 md127 5.20 0.00 75.60 0.00
0 756 0
10:19:53 md127 3.80 0.00 20.40 0.00
0 204 0
10:20:03 md127 2.00 0.00 13.20 0.00
0 132 0
10:20:13 md127 3.90 0.00 29.20 0.00
0 292 0
10:20:23 md127 3.80 0.00 19.60 0.00
0 196 0

At the same time meminfo shows:
2024-03-18 10:17:04 Dirty: 11220 kB Buffers: 829988 kB MemFree:
670576 kB
2024-03-18 10:17:14 Dirty: 10784 kB Buffers: 830016 kB MemFree:
631500 kB
2024-03-18 10:17:24 Dirty: 750616 kB Buffers: 875592 kB MemFree:
654236 kB
2024-03-18 10:17:34 Dirty: 2757048 kB Buffers: 972948 kB MemFree:
600636 kB
2024-03-18 10:17:44 Dirty: 2855196 kB Buffers: 1046736 kB MemFree:
551940 kB
2024-03-18 10:17:54 Dirty: 4104524 kB Buffers: 1127200 kB MemFree:
538136 kB
2024-03-18 10:18:04 Dirty: 4390504 kB Buffers: 1155588 kB MemFree:
600828 kB
2024-03-18 10:18:14 Dirty: 4518280 kB Buffers: 1161916 kB MemFree:
580176 kB
2024-03-18 10:18:24 Dirty: 4356952 kB Buffers: 1185872 kB MemFree:
543072 kB
2024-03-18 10:18:34 Dirty: 4559504 kB Buffers: 1196396 kB MemFree:
518872 kB
2024-03-18 10:18:44 Dirty: 4567212 kB Buffers: 1197060 kB MemFree:
606572 kB
2024-03-18 10:18:54 Dirty: 4567592 kB Buffers: 1197084 kB MemFree:
611440 kB
... and stays there until ...

I then killed the copy (14,296MB copied). The writing to the array remained
very low, the kernel thread stayed at 100%
and meminfo drained very slowly. Access to the array is now slow with some
hiccups.

2024-03-18 10:35:24 Dirty: 4484720 kB Buffers: 4984308 kB MemFree:
820532 kB <<< rsync killed
2024-03-18 10:35:34 Dirty: 4484436 kB Buffers: 4984348 kB MemFree:
851288 kB
2024-03-18 10:35:44 Dirty: 4483992 kB Buffers: 4984368 kB MemFree:
817516 kB
2024-03-18 10:35:54 Dirty: 4483780 kB Buffers: 4984400 kB MemFree:
803156 kB
2024-03-18 10:36:04 Dirty: 4483704 kB Buffers: 4984460 kB MemFree:
809956 kB
2024-03-18 10:36:14 Dirty: 4479416 kB Buffers: 4984496 kB MemFree:
832980 kB
2024-03-18 10:36:24 Dirty: 4474312 kB Buffers: 4984528 kB MemFree:
881464 kB
2024-03-18 10:36:34 Dirty: 4474260 kB Buffers: 4984568 kB MemFree:
840444 kB
2024-03-18 10:36:44 Dirty: 4474132 kB Buffers: 4984600 kB MemFree:
843524 kB
2024-03-18 10:36:54 Dirty: 4474292 kB Buffers: 4984640 kB MemFree:
841004 kB
2024-03-18 10:37:04 Dirty: 4474052 kB Buffers: 4984680 kB MemFree:
834148 kB
2024-03-18 10:37:14 Dirty: 4473688 kB Buffers: 4984712 kB MemFree:
853200 kB
2024-03-18 10:37:24 Dirty: 4473448 kB Buffers: 4984752 kB MemFree:
782540 kB
2024-03-18 10:37:34 Dirty: 4473288 kB Buffers: 4984776 kB MemFree:
786100 kB
2024-03-18 10:37:44 Dirty: 3871768 kB Buffers: 4984972 kB MemFree:
846020 kB
2024-03-18 10:37:54 Dirty: 3871612 kB Buffers: 4985020 kB MemFree:
826664 kB
2024-03-18 10:38:04 Dirty: 3871736 kB Buffers: 4985052 kB MemFree:
826084 kB
2024-03-18 10:38:14 Dirty: 3871184 kB Buffers: 4985100 kB MemFree:
876572 kB
2024-03-18 10:38:24 Dirty: 3870936 kB Buffers: 4985140 kB MemFree:
918944 kB
2024-03-18 10:38:34 Dirty: 3648080 kB Buffers: 4985256 kB MemFree:
901336 kB
2024-03-18 10:38:44 Dirty: 3556612 kB Buffers: 4985316 kB MemFree:
902532 kB
2024-03-18 10:38:54 Dirty: 3551636 kB Buffers: 4985364 kB MemFree:
837816 kB
2024-03-18 10:39:04 Dirty: 3551968 kB Buffers: 4985468 kB MemFree:
823392 kB
2024-03-18 10:39:14 Dirty: 2835648 kB Buffers: 4985656 kB MemFree:
629428 kB
...
2024-03-18 11:05:25 Dirty: 2737096 kB Buffers: 4993860 kB MemFree:
599424 kB <<< 30m later
2024-03-18 11:35:25 Dirty: 2573748 kB Buffers: 5001184 kB MemFree:
612288 kB <<< again
2024-03-18 12:05:26 Dirty: 2432572 kB Buffers: 5007704 kB MemFree:
663928 kB <<< again
2024-03-18 12:35:27 Dirty: 2145348 kB Buffers: 3707492 kB MemFree:
588464 kB <<< again
2024-03-18 13:05:27 Dirty: 2017848 kB Buffers: 3718936 kB MemFree:
585500 kB <<< again
2024-03-18 13:35:28 Dirty: 1822436 kB Buffers: 3746824 kB MemFree:
565560 kB <<< again
2024-03-18 14:05:29 Dirty: 1595088 kB Buffers: 3799124 kB MemFree:
544504 kB <<< again
2024-03-18 14:35:29 Dirty: 1498416 kB Buffers: 3816868 kB MemFree:
3883524 kB <<< again
2024-03-18 15:05:30 Dirty: 1387140 kB Buffers: 3835824 kB MemFree:
3266060 kB <<< again
...
2024-03-18 15:32:51 Dirty: 1284940 kB Buffers: 3850936 kB MemFree:
3088904 kB <<< finally
2024-03-18 15:33:01 Dirty: 933268 kB Buffers: 3851144 kB MemFree:
3098840 kB
2024-03-18 15:33:11 Dirty: 51956 kB Buffers: 3851248 kB MemFree:
3095456 kB
2024-03-18 15:33:21 Dirty: 51968 kB Buffers: 3851284 kB MemFree:
3059212 kB
2024-03-18 15:33:31 Dirty: 52032 kB Buffers: 3851308 kB MemFree:
3085352 kB
2024-03-18 15:33:41 Dirty: 172 kB Buffers: 3851336 kB MemFree:
3090912 kB
2024-03-18 15:33:51 Dirty: 64 kB Buffers: 3851368 kB MemFree:
3030584 kB
```

So over 5 hours to copy this small part (14GB of 262GB) of the data.

Is this expected? Is this a fundamental "feature" of ext4? Or of raid6?
When I did do "sudo mount -o remount,stripe=0 /data1" the copy progressed
nicely with good
writing speed.

I have logs of the progress of this test at 10s intervals.

Regards

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-03-20 06:46:27

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #70 from Ojaswin Mujoo ([email protected]) ---
Hi Eyal,

My bad, seems like the patch landed in 6.8-rc1 as seen here [1].

[1] https://lore.kernel.org/linux-ext4/[email protected]/

Regards,
ojaswin

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-03-23 08:54:05

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

https://bugzilla.kernel.org/show_bug.cgi?id=217965

--- Comment #71 from [email protected] ---
(In reply to Ojaswin Mujoo from comment #68)
> Hello Denk,
>
> Both of the commands you mentioned "mount -o remount,stripe=0 <dev>" and
> "tune2fs -f -E stride=0 <dev>" should be okay to run and aren't dangerous
> for your filesystem. You can always take a backup to be extra cautious ;)
>
> That being said, this issue was fixed by the following patch [1] which I
> believe landed in linux kernel 6.7 so anything above that should help you
> avoid the issue as well.
>
> Regards,
> Ojaswin
>
> [1]
> https://lore.kernel.org/linux-ext4/[email protected]

Hello Ojaswin,

thank you very much! I changed stripe to 0 last weekend with tune2fs -f -E
stripe_width=0 <dev>, as tune2fs -f -E stride=0 <dev> did not change stripe to
0. And since then my system is back to normal performance. So far no further
issues.
Thanks you very much for all the work for ext4!
Best
denk

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.