Machine up for 3 days running 6.9.3.
Stalls happened while rsync server is writing to /mnt/lv_data which
uses ext4 on md raid6 using lvmcache. rsync data is pushed over network
from another machine.
The same rsync process runs twice a day and worked until the 3rd day.
logs attached: dmesg, journalctl -k, ver_linux, lsblk and lspci.
Aside: This looks different than the 6.9.3 phy deadlock I experienced
on a different machine [1].
Git bisect not practical given the pretty low frequency of this
occurrence.
The first few lines of the first 3 log entries have:
<TASK>
? __schedule+0x3cf/0x1510
? sysvec_apic_timer_interrupt+0xe/0x90
? asm_sysvec_apic_timer_interrupt+0x1a/0x20
? xas_descend+0x2f/0xa0
? xas_load+0x41/0x50
? filemap_get_entry+0x72/0x140
? __filemap_get_folio+0x37/0x2e0
? __find_get_block+0x84/0x330
? bdev_getblk+0x22/0x70
? __ext4_get_inode_loc+0x132/0x4d0 [ext4
9f75ec11db44bef3511c7e45e58aac1eb2f9252d]
...
<TASK>
__schedule+0x3c7/0x1510
rt_mutex_schedule+0x20/0x40
rt_mutex_slowlock_block.constprop.0+0x40/0x170
__rt_mutex_slowlock_locked.constprop.0+0xbd/0x2f0
rt_mutex_lock+0x49/0x60
rcu_boost_kthread+0xca/0x2f0
? __pfx_rcu_boost_kthread+0x10/0x10
...
<TASK>
__schedule+0x3c7/0x1510
? asm_sysvec_apic_timer_interrupt+0x1a/0x20
? update_load_avg+0x7e/0x7b0
schedule+0x27/0xf0
percpu_ref_switch_to_atomic_sync+0x9b/0xf0
? __pfx_autoremove_wake_function+0x10/0x10
set_in_sync+0x5c/0xc0 [md_mod
dee80e622cfc358bebfe77522144fac96dc4812e]
md_check_recovery+0x26b/0x3e0 [md_mod
dee80e622cfc358bebfe77522144fac96dc4812e]
raid5d+0x59/0x710 [raid456 d2a0fd36840a461fec669ba17b3965c20926b921]
[1]
https://lore.kernel.org/lkml/[email protected]/
--
Gene