2024-01-23 02:02:22

by Dan Moulding

[permalink] [raw]
Subject: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

After upgrading from 6.7.0 to 6.7.1 a couple of my systems with md
RAID-5 arrays started experiencing hangs. It starts with some
processes which write to the array getting stuck. The whole system
eventually becomes unresponsive and unclean shutdown must be performed
(poweroff and reboot don't work).

While trying to diagnose the issue, I noticed that the md0_raid5
kernel thread consumes 100% CPU after the issue occurs. No relevant
warnings or errors were found in dmesg.

On 6.7.1, I can reproduce the issue somewhat reliably by copying a
large amount of data to the array. I am unable to reproduce the issue
at all on 6.7.0. The bisection was a bit difficult since I don't have
a 100% reliable method to reproduce the problem, but with some
perseverence I eventually managed to whittle it down to commit
0de40f76d567 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
raid5d"). After reverting that commit (i.e. reapplying the reverted
commit) on top of 6.7.1 I can no longer reproduce the problem at all.

Some details that might be relevant:
- Both systems are running MD RAID-5 with a journal device.
- mdadm in monitor mode is always running on both systems.
- Both systems were previously running 6.7.0 and earlier just fine.
- The older of the two systems has been running a raid5 array without
incident for many years (kernel going back to at least 5.1) -- this
is the first raid5 issue it has encountered.

Please let me know if there is any other helpful information that I
might be able to provide.

-- Dan

#regzbot introduced: 0de40f76d567133b871cd6ad46bb87afbce46983


2024-01-23 02:35:04

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On Mon, Jan 22, 2024 at 4:57 PM Dan Moulding <[email protected]> wrote:
>
> After upgrading from 6.7.0 to 6.7.1 a couple of my systems with md
> RAID-5 arrays started experiencing hangs. It starts with some
> processes which write to the array getting stuck. The whole system
> eventually becomes unresponsive and unclean shutdown must be performed
> (poweroff and reboot don't work).
>
> While trying to diagnose the issue, I noticed that the md0_raid5
> kernel thread consumes 100% CPU after the issue occurs. No relevant
> warnings or errors were found in dmesg.
>
> On 6.7.1, I can reproduce the issue somewhat reliably by copying a
> large amount of data to the array. I am unable to reproduce the issue
> at all on 6.7.0. The bisection was a bit difficult since I don't have
> a 100% reliable method to reproduce the problem, but with some
> perseverence I eventually managed to whittle it down to commit
> 0de40f76d567 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
> raid5d"). After reverting that commit (i.e. reapplying the reverted
> commit) on top of 6.7.1 I can no longer reproduce the problem at all.
>
> Some details that might be relevant:
> - Both systems are running MD RAID-5 with a journal device.
> - mdadm in monitor mode is always running on both systems.
> - Both systems were previously running 6.7.0 and earlier just fine.
> - The older of the two systems has been running a raid5 array without
> incident for many years (kernel going back to at least 5.1) -- this
> is the first raid5 issue it has encountered.
>
> Please let me know if there is any other helpful information that I
> might be able to provide.

Thanks for the report, and sorry for the problem.

We are looking into some regressions that are probably related to this.
We will fix the issue ASAP.

Song

2024-01-23 02:40:51

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Some additional new information: I realized after filing this report
that on the mainline there is a second commit, part of a pair, that
was supposed to go with commit 0de40f76d567. That second commit
upstream is d6e035aad6c0 ("md: bypass block throttle for superblock
update"). That commit probably also was supposed to have been
backported to stable along with the first, but was not, since it
provides what is supposed to be a replacement for the fix that has
been reverted.

So I rebuilt my kernel with the missed commit also backported instead
of just reverting the first commit (i.e. I have now built 6.7.1 with
just commit d6e035aad6c0 on top). Unfortunately, I can still reproduce
the hang after applying this second commit. So it looks
like even with that fix applied the regression is still present.

Coincidentally, I see it seems this second commit was picked up for
inclusion in 6.7.2 just today. I think that needs to NOT be
done. Instead the stable series should probably revert 0de40f76d567
until the regression is successfully dealt with on master. Probably no
further changes related to this patch series should be backported
until then.

Cheers,

-- Dan

2024-01-23 06:39:15

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi Dan,

On Mon, Jan 22, 2024 at 5:35 PM Dan Moulding <[email protected]> wrote:
>
> Some additional new information: I realized after filing this report
> that on the mainline there is a second commit, part of a pair, that
> was supposed to go with commit 0de40f76d567. That second commit
> upstream is d6e035aad6c0 ("md: bypass block throttle for superblock
> update"). That commit probably also was supposed to have been
> backported to stable along with the first, but was not, since it
> provides what is supposed to be a replacement for the fix that has
> been reverted.
>
> So I rebuilt my kernel with the missed commit also backported instead
> of just reverting the first commit (i.e. I have now built 6.7.1 with
> just commit d6e035aad6c0 on top). Unfortunately, I can still reproduce
> the hang after applying this second commit. So it looks
> like even with that fix applied the regression is still present.
>
> Coincidentally, I see it seems this second commit was picked up for
> inclusion in 6.7.2 just today. I think that needs to NOT be
> done. Instead the stable series should probably revert 0de40f76d567
> until the regression is successfully dealt with on master. Probably no
> further changes related to this patch series should be backported
> until then.

I think we still want d6e035aad6c0 in 6.7.2. We may need to revert
0de40f76d567 on top of that. Could you please test it out? (6.7.1 +
d6e035aad6c0 + revert 0de40f76d567.

OTOH, I am not able to reproduce the issue. Could you please help
get more information:
cat /proc/mdstat
profile (perf, etc.) of the md thread

Thanks,
Song

2024-01-23 21:53:35

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> I think we still want d6e035aad6c0 in 6.7.2. We may need to revert
> 0de40f76d567 on top of that. Could you please test it out? (6.7.1 +
> d6e035aad6c0 + revert 0de40f76d567.

I was operating under the assumption that the two commits were
intended to exist as a pair (the one reverts the old fix, because the
next commit has what is supposed to be a better fix). But since the
regression still exists, even with both patches applied, the old fix
must be reapplied to resolve the current regression.

But, as you've requested, I have tested 6.7.1 + d6e035aad6c0 + revert
0de40f76d567 and it seems fine. So I have no issue if you think it
makes sense to accept d6e035aad6c0 on its own, even though it would
break up the pair of commits.

> OTOH, I am not able to reproduce the issue. Could you please help
> get more information:
> cat /proc/mdstat

Here is /proc/mdstat from one of the systems where I can reproduce it:

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 dm-0[4](J) sdc[3] sda[0] sdb[1]
3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>

dm-0 is an LVM logical volume which is backed by an NVMe SSD. The
others are run-of-the-mill SATA SSDs.

> profile (perf, etc.) of the md thread

I might need a little more pointing in the direction of what exactly
to look for and under what conditions (i.e. should I run perf while
the thread is stuck in the 100% CPU loop? what kind of report should I
ask perf for?). Also, are there any debug options I could enable in
the kernel configuration that might help gather more information?
Maybe something in debugfs? I currently get absolutely no warnings or
errors in dmesg when the problem occurs.

Cheers,

-- Dan

2024-01-23 22:22:21

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi Dan,

On Tue, Jan 23, 2024 at 1:53 PM Dan Moulding <[email protected]> wrote:
>
> > I think we still want d6e035aad6c0 in 6.7.2. We may need to revert
> > 0de40f76d567 on top of that. Could you please test it out? (6.7.1 +
> > d6e035aad6c0 + revert 0de40f76d567.
>
> I was operating under the assumption that the two commits were
> intended to exist as a pair (the one reverts the old fix, because the
> next commit has what is supposed to be a better fix). But since the
> regression still exists, even with both patches applied, the old fix
> must be reapplied to resolve the current regression.
>
> But, as you've requested, I have tested 6.7.1 + d6e035aad6c0 + revert
> 0de40f76d567 and it seems fine. So I have no issue if you think it
> makes sense to accept d6e035aad6c0 on its own, even though it would
> break up the pair of commits.

Thanks for running the test!

>
> > OTOH, I am not able to reproduce the issue. Could you please help
> > get more information:
> > cat /proc/mdstat
>
> Here is /proc/mdstat from one of the systems where I can reproduce it:
>
> $ cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid5 dm-0[4](J) sdc[3] sda[0] sdb[1]
> 3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
>
> unused devices: <none>
>
> dm-0 is an LVM logical volume which is backed by an NVMe SSD. The
> others are run-of-the-mill SATA SSDs.
>
> > profile (perf, etc.) of the md thread
>
> I might need a little more pointing in the direction of what exactly
> to look for and under what conditions (i.e. should I run perf while
> the thread is stuck in the 100% CPU loop? what kind of report should I
> ask perf for?). Also, are there any debug options I could enable in
> the kernel configuration that might help gather more information?
> Maybe something in debugfs? I currently get absolutely no warnings or
> errors in dmesg when the problem occurs.

This appears the md thread hit some infinite loop, so I would like to
know what it is doing. We can probably get the information with the
perf tool, something like:

perf record -a
perf report

Thanks,
Song

2024-01-23 23:58:29

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> This appears the md thread hit some infinite loop, so I would like to
> know what it is doing. We can probably get the information with the
> perf tool, something like:
>
> perf record -a
> perf report

Here you go!

# Total Lost Samples: 0
#
# Samples: 78K of event 'cycles'
# Event count (approx.): 83127675745
#
# Overhead Command Shared Object Symbol
# ........ ............... .............................. ...................................................
#
49.31% md0_raid5 [kernel.kallsyms] [k] handle_stripe
18.63% md0_raid5 [kernel.kallsyms] [k] ops_run_io
6.07% md0_raid5 [kernel.kallsyms] [k] handle_active_stripes.isra.0
5.50% md0_raid5 [kernel.kallsyms] [k] do_release_stripe
3.09% md0_raid5 [kernel.kallsyms] [k] _raw_spin_lock_irqsave
2.48% md0_raid5 [kernel.kallsyms] [k] r5l_write_stripe
1.89% md0_raid5 [kernel.kallsyms] [k] md_wakeup_thread
1.45% ksmd [kernel.kallsyms] [k] ksm_scan_thread
1.37% md0_raid5 [kernel.kallsyms] [k] stripe_is_lowprio
0.87% ksmd [kernel.kallsyms] [k] memcmp
0.68% ksmd [kernel.kallsyms] [k] xxh64
0.56% md0_raid5 [kernel.kallsyms] [k] __wake_up_common
0.52% md0_raid5 [kernel.kallsyms] [k] __wake_up
0.46% ksmd [kernel.kallsyms] [k] mtree_load
0.44% ksmd [kernel.kallsyms] [k] try_grab_page
0.40% ksmd [kernel.kallsyms] [k] follow_p4d_mask.constprop.0
0.39% md0_raid5 [kernel.kallsyms] [k] r5l_log_disk_error
0.37% md0_raid5 [kernel.kallsyms] [k] _raw_spin_lock_irq
0.33% md0_raid5 [kernel.kallsyms] [k] release_stripe_list
0.31% md0_raid5 [kernel.kallsyms] [k] release_inactive_stripe_list
0.31% ksmd [kernel.kallsyms] [k] get_ksm_page
0.30% md0_raid5 [kernel.kallsyms] [k] __cond_resched
0.28% md0_raid5 [kernel.kallsyms] [k] mutex_unlock
0.28% ksmd [kernel.kallsyms] [k] _raw_spin_lock
0.27% swapper [kernel.kallsyms] [k] intel_idle
0.26% md0_raid5 [kernel.kallsyms] [k] mutex_lock
0.24% md0_raid5 [kernel.kallsyms] [k] rcu_all_qs
0.22% md0_raid5 [kernel.kallsyms] [k] r5c_is_writeback
0.20% md0_raid5 [kernel.kallsyms] [k] __lock_text_start
0.18% ksmd [kernel.kallsyms] [k] up_read
0.18% ksmd [kernel.kallsyms] [k] down_read
0.17% md0_raid5 [kernel.kallsyms] [k] raid5d
0.15% ksmd [kernel.kallsyms] [k] follow_trans_huge_pmd
0.13% kworker/u16:3-e [kernel.kallsyms] [k] ioread32
0.13% kworker/u16:1-e [kernel.kallsyms] [k] ioread32
0.12% ksmd [kernel.kallsyms] [k] follow_page_pte
0.11% md0_raid5 [kernel.kallsyms] [k] r5l_flush_stripe_to_raid
0.11% ksmd [kernel.kallsyms] [k] follow_page
0.11% ksmd [kernel.kallsyms] [k] memcmp_pages
0.10% swapper [kernel.kallsyms] [k] poll_idle
0.08% ksmd [kernel.kallsyms] [k] mtree_range_walk
0.07% ksmd [kernel.kallsyms] [k] __cond_resched
0.07% ksmd [kernel.kallsyms] [k] rcu_all_qs
0.06% ksmd [kernel.kallsyms] [k] __pte_offset_map_lock
0.04% ksmd [kernel.kallsyms] [k] __pte_offset_map
0.03% md0_raid5 [kernel.kallsyms] [k] llist_reverse_order
0.03% md0_raid5 [kernel.kallsyms] [k] r5l_write_stripe_run
0.02% swapper [kernel.kallsyms] [k] menu_select
0.02% ksmd [kernel.kallsyms] [k] rb_insert_color
0.02% ksmd [kernel.kallsyms] [k] vm_normal_page
0.02% swapper [kernel.kallsyms] [k] cpuidle_enter_state
0.01% md0_raid5 [kernel.kallsyms] [k] r5l_submit_current_io
0.01% ksmd [kernel.kallsyms] [k] vma_is_secretmem
0.01% swapper [kernel.kallsyms] [k] alx_mask_msix
0.01% ksmd [kernel.kallsyms] [k] remove_rmap_item_from_tree
0.01% swapper [kernel.kallsyms] [k] lapic_next_deadline
0.01% swapper [kernel.kallsyms] [k] read_tsc
0.01% ksmd [kernel.kallsyms] [k] mas_walk
0.01% swapper [kernel.kallsyms] [k] do_idle
0.01% md0_raid5 [kernel.kallsyms] [k] perf_adjust_freq_unthr_context
0.01% md0_raid5 [kernel.kallsyms] [k] lapic_next_deadline
0.01% swapper [kernel.kallsyms] [k] perf_adjust_freq_unthr_context
0.01% swapper [kernel.kallsyms] [k] __switch_to_asm
0.01% swapper [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0.01% swapper [kernel.kallsyms] [k] native_irq_return_iret
0.01% swapper [kernel.kallsyms] [k] arch_scale_freq_tick
0.01% kworker/u16:3-e [kernel.kallsyms] [k] lapic_next_deadline
0.00% swapper [kernel.kallsyms] [k] __hrtimer_next_event_base
0.00% ksmd [kernel.kallsyms] [k] calc_checksum
0.00% swapper [kernel.kallsyms] [k] psi_group_change
0.00% swapper [kernel.kallsyms] [k] timerqueue_add
0.00% ksmd [kernel.kallsyms] [k] mas_find
0.00% swapper [kernel.kallsyms] [k] __schedule
0.00% swapper [kernel.kallsyms] [k] ioread32
0.00% kworker/u16:3-e [kernel.kallsyms] [k] _aesni_enc4
0.00% swapper [kernel.kallsyms] [k] rb_next
0.00% kworker/u16:1-e [kernel.kallsyms] [k] lapic_next_deadline
0.00% swapper [kernel.kallsyms] [k] ktime_get
0.00% kworker/u16:1-e [kernel.kallsyms] [k] psi_group_change
0.00% Qt bearer threa [kernel.kallsyms] [k] alx_update_hw_stats
0.00% swapper [kernel.kallsyms] [k] cpuidle_enter
0.00% swapper [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0
0.00% swapper [kernel.kallsyms] [k] ct_kernel_exit_state
0.00% swapper [kernel.kallsyms] [k] nr_iowait_cpu
0.00% swapper [kernel.kallsyms] [k] sched_clock_noinstr
0.00% swapper [kernel.kallsyms] [k] psi_flags_change
0.00% swapper [kernel.kallsyms] [k] tick_nohz_stop_idle
0.00% swapper [kernel.kallsyms] [k] __run_timers.part.0
0.00% swapper [kernel.kallsyms] [k] native_apic_msr_eoi
0.00% swapper [kernel.kallsyms] [k] __update_load_avg_se
0.00% md0_raid5 [kernel.kallsyms] [k] __intel_pmu_enable_all.isra.0
0.00% md0_raid5 [kernel.kallsyms] [k] update_vsyscall
0.00% md0_raid5 [kernel.kallsyms] [k] arch_scale_freq_tick
0.00% md0_raid5 [kernel.kallsyms] [k] read_tsc
0.00% md0_raid5 [kernel.kallsyms] [k] x86_pmu_disable
0.00% md0_raid5 [kernel.kallsyms] [k] __update_load_avg_cfs_rq
0.00% swapper [kernel.kallsyms] [k] rcu_sched_clock_irq
0.00% perf [kernel.kallsyms] [k] rep_movs_alternative
0.00% swapper [kernel.kallsyms] [k] hrtimer_active
0.00% swapper [kernel.kallsyms] [k] newidle_balance.isra.0
0.00% swapper [kernel.kallsyms] [k] _raw_spin_lock_irq
0.00% kworker/u16:3-e [kernel.kallsyms] [k] aesni_xts_encrypt
0.00% swapper [kernel.kallsyms] [k] enqueue_task_fair
0.00% swapper [kernel.kallsyms] [k] tick_nohz_idle_stop_tick
0.00% swapper [kernel.kallsyms] [k] leave_mm
0.00% kworker/u16:1-e [kernel.kallsyms] [k] sched_clock_noinstr
0.00% plasmashell [kernel.kallsyms] [k] ext4fs_dirhash
0.00% kwin_x11 [kernel.kallsyms] [k] ioread32
0.00% kworker/0:2-eve [kernel.kallsyms] [k] ioread32
0.00% swapper [kernel.kallsyms] [k] memchr_inv
0.00% swapper [kernel.kallsyms] [k] pick_next_task_fair
0.00% swapper [kernel.kallsyms] [k] tick_sched_do_timer
0.00% swapper [kernel.kallsyms] [k] ktime_get_update_offsets_now
0.00% swapper [kernel.kallsyms] [k] __update_load_avg_cfs_rq
0.00% kworker/u16:3-e [kernel.kallsyms] [k] psi_group_change
0.00% swapper [kernel.kallsyms] [k] ct_kernel_exit.constprop.0
0.00% ksmd [kernel.kallsyms] [k] psi_task_switch
0.00% swapper [kernel.kallsyms] [k] tick_nohz_next_event
0.00% swapper [kernel.kallsyms] [k] clockevents_program_event
0.00% swapper [kernel.kallsyms] [k] __sysvec_apic_timer_interrupt
0.00% swapper [kernel.kallsyms] [k] enqueue_entity
0.00% QXcbEventQueue [kernel.kallsyms] [k] schedule
0.00% swapper [kernel.kallsyms] [k] get_cpu_device
0.00% swapper [kernel.kallsyms] [k] scheduler_tick
0.00% swapper [kernel.kallsyms] [k] tick_check_oneshot_broadcast_this_cpu
0.00% swapper [kernel.kallsyms] [k] switch_mm_irqs_off
0.00% swapper [kernel.kallsyms] [k] calc_load_nohz_stop
0.00% swapper [kernel.kallsyms] [k] _raw_spin_lock
0.00% swapper [kernel.kallsyms] [k] nohz_run_idle_balance
0.00% swapper [kernel.kallsyms] [k] rcu_note_context_switch
0.00% swapper [kernel.kallsyms] [k] run_timer_softirq
0.00% swapper [kernel.kallsyms] [k] kthread_is_per_cpu
0.00% swapper [kernel.kallsyms] [k] x86_pmu_disable
0.00% ksoftirqd/4 [kernel.kallsyms] [k] rcu_cblist_dequeue
0.00% init init [.] 0x0000000000008874
0.00% swapper [kernel.kallsyms] [k] ct_kernel_enter.constprop.0
0.00% swapper [kernel.kallsyms] [k] update_rq_clock.part.0
0.00% swapper [kernel.kallsyms] [k] __dequeue_entity
0.00% swapper [kernel.kallsyms] [k] ttwu_queue_wakelist
0.00% swapper [kernel.kallsyms] [k] __hrtimer_run_queues
0.00% swapper [kernel.kallsyms] [k] select_task_rq_fair
0.00% md0_raid5 [kernel.kallsyms] [k] update_wall_time
0.00% md0_raid5 [kernel.kallsyms] [k] ntp_tick_length
0.00% md0_raid5 [kernel.kallsyms] [k] trigger_load_balance
0.00% md0_raid5 [kernel.kallsyms] [k] acct_account_cputime
0.00% md0_raid5 [kernel.kallsyms] [k] ktime_get
0.00% md0_raid5 [kernel.kallsyms] [k] timerqueue_add
0.00% md0_raid5 [kernel.kallsyms] [k] _raw_spin_lock
0.00% md0_raid5 [kernel.kallsyms] [k] tick_do_update_jiffies64
0.00% md0_raid5 [kernel.kallsyms] [k] native_irq_return_iret
0.00% md0_raid5 [kernel.kallsyms] [k] ktime_get_update_offsets_now
0.00% swapper [kernel.kallsyms] [k] asm_sysvec_apic_timer_interrupt
0.00% swapper [kernel.kallsyms] [k] update_blocked_averages
0.00% md0_raid5 [kernel.kallsyms] [k] error_entry
0.00% md0_raid5 [kernel.kallsyms] [k] rcu_sched_clock_irq
0.00% md0_raid5 [kernel.kallsyms] [k] native_apic_msr_eoi
0.00% swapper [kernel.kallsyms] [k] tick_nohz_highres_handler
0.00% md0_raid5 [kernel.kallsyms] [k] irq_work_tick
0.00% ksmd [kernel.kallsyms] [k] __mod_timer
0.00% ksmd [kernel.kallsyms] [k] __hrtimer_run_queues
0.00% kwin_x11 [kernel.kallsyms] [k] do_vfs_ioctl
0.00% swapper [kernel.kallsyms] [k] run_posix_cpu_timers
0.00% swapper [kernel.kallsyms] [k] __rdgsbase_inactive
0.00% ksmd [kernel.kallsyms] [k] hrtimer_interrupt
0.00% kworker/u16:3-e [kernel.kallsyms] [k] nvkm_object_search
0.00% kworker/u16:3-e [kernel.kallsyms] [k] hrtimer_start_range_ns
0.00% swapper [kernel.kallsyms] [k] __wrgsbase_inactive
0.00% kworker/u16:1-e [kernel.kallsyms] [k] clockevents_program_event
0.00% kworker/u16:1-e [kernel.kallsyms] [k] wq_worker_running
0.00% QSGRenderThread [kernel.kallsyms] [k] ioread32
0.00% swapper [kernel.kallsyms] [k] ct_nmi_exit
0.00% kworker/u16:3-e [kernel.kallsyms] [k] __hrtimer_init
0.00% kworker/u16:1-e [kernel.kallsyms] [k] __schedule
0.00% kworker/u16:3-e [kernel.kallsyms] [k] sched_clock_noinstr
0.00% kworker/u16:1-e [kernel.kallsyms] [k] calc_global_load_tick
0.00% swapper [kernel.kallsyms] [k] load_balance
0.00% swapper [kernel.kallsyms] [k] hrtimer_start_range_ns
0.00% swapper [kernel.kallsyms] [k] irqentry_exit
0.00% ksmd [kernel.kallsyms] [k] psi_group_change
0.00% swapper [kernel.kallsyms] [k] hrtimer_interrupt
0.00% swapper [kernel.kallsyms] [k] rebalance_domains
0.00% plasmashell libKF5Plasma.so.5.113.0 [.] Plasma::Containment::metaObject
0.00% plasmashell [kernel.kallsyms] [k] rb_insert_color
0.00% swapper [kernel.kallsyms] [k] cpuidle_reflect
0.00% swapper [kernel.kallsyms] [k] update_cfs_group
0.00% dmeventd [kernel.kallsyms] [k] update_curr
0.00% plasmashell libc.so.6 [.] __poll
0.00% kworker/u16:1-e [kernel.kallsyms] [k] queued_spin_lock_slowpath
0.00% swapper [kernel.kallsyms] [k] quiet_vmstat
0.00% plasmashell [kernel.kallsyms] [k] call_filldir
0.00% gpm gpm [.] 0x0000000000010470
0.00% gpm [kernel.kallsyms] [k] getname_flags
0.00% QSGRenderThread libQt5Quick.so.5.15.11 [.] 0x0000000000199010
0.00% QSGRenderThread libqxcb-glx-integration.so [.] QXcbWindow::needsSync@plt
0.00% synergys [kernel.kallsyms] [k] do_sys_poll
0.00% plasmashell libQt5Core.so.5.15.11 [.] readdir64@plt
0.00% swapper [kernel.kallsyms] [k] ct_nmi_enter
0.00% plasmashell libQt5Core.so.5.15.11 [.] 0x00000000002dc2b5
0.00% perf [kernel.kallsyms] [k] ext4_journal_check_start
0.00% swapper [kernel.kallsyms] [k] timerqueue_del
0.00% kworker/u16:1-e [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0.00% swapper [kernel.kallsyms] [k] call_cpuidle
0.00% kworker/u16:3-e [kernel.kallsyms] [k] percpu_counter_add_batch
0.00% swapper [kernel.kallsyms] [k] tsc_verify_tsc_adjust
0.00% ksmd [kernel.kallsyms] [k] schedule_timeout
0.00% konsole libQt5XcbQpa.so.5.15.11 [.] QKeyEvent::modifiers@plt
0.00% plasmashell libQt5Core.so.5.15.11 [.] QString::fromLocal8Bit_helper
0.00% kwin_x11 libkwin.so.5.27.10 [.] KWin::Application::dispatchEvent
0.00% swapper [kernel.kallsyms] [k] sysvec_apic_timer_interrupt
0.00% migration/2 [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0
0.00% plasmashell libQt5Core.so.5.15.11 [.] QArrayData::allocate
0.00% kworker/u16:1-e [kernel.kallsyms] [k] hrtimer_active
0.00% plasmashell [kernel.kallsyms] [k] __get_user_1
0.00% synergys [kernel.kallsyms] [k] avg_vruntime
0.00% plasmashell libQt5Core.so.5.15.11 [.] 0x00000000001cdca2
0.00% ksmd [kernel.kallsyms] [k] hrtimer_active
0.00% kworker/u16:1-e [kernel.kallsyms] [k] __switch_to
0.00% ksmd [kernel.kallsyms] [k] nohz_balance_exit_idle.part.0
0.00% konsole libharfbuzz.so.0.60830.0 [.] 0x00000000000a9aa0
0.00% swapper [kernel.kallsyms] [k] rb_erase
0.00% swapper [kernel.kallsyms] [k] activate_task
0.00% plasmashell libQt5Core.so.5.15.11 [.] 0x00000000001d72f3
0.00% swapper [kernel.kallsyms] [k] tick_nohz_idle_retain_tick
0.00% konsole libxcb.so.1.1.0 [.] xcb_send_request64
0.00% swapper [unknown] [.] 0000000000000000
0.00% swapper [kernel.kallsyms] [k] hrtimer_update_next_event
0.00% kworker/7:0-eve [kernel.kallsyms] [k] collect_percpu_times
0.00% plasmashell libQt5Qml.so.5.15.11 [.] QQmlJavaScriptExpression::clearActiveGuards
0.00% perf [kernel.kallsyms] [k] __block_commit_write
0.00% swapper [kernel.kallsyms] [k] __intel_pmu_enable_all.isra.0
0.00% perf [kernel.kallsyms] [k] affine_move_task
0.00% swapper [kernel.kallsyms] [k] tick_nohz_get_sleep_length
0.00% kworker/u16:3-e [kernel.kallsyms] [k] mpage_release_unused_pages
0.00% plasmashell libQt5Qml.so.5.15.11 [.] QQmlData::isSignalConnected
0.00% perf [kernel.kallsyms] [k] mt_find
0.00% xembedsniproxy [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0
0.00% plasmashell libQt5Core.so.5.15.11 [.] 0x0000000000202d04
0.00% migration/3 [kernel.kallsyms] [k] psi_group_change
0.00% swapper [kernel.kallsyms] [k] tick_program_event
0.00% swapper [kernel.kallsyms] [k] cpuidle_get_cpu_driver
0.00% swapper [kernel.kallsyms] [k] account_process_tick
0.00% Qt bearer threa libc.so.6 [.] 0x0000000000093948
0.00% swapper [kernel.kallsyms] [k] __flush_smp_call_function_queue
0.00% kworker/u16:3-e [kernel.kallsyms] [k] xts_crypt
0.00% swapper [kernel.kallsyms] [k] kmem_cache_free
0.00% synergys [kernel.kallsyms] [k] psi_group_change
0.00% avahi-daemon libavahi-common.so.3.5.4 [.] avahi_unescape_label
0.00% migration/0 [kernel.kallsyms] [k] __update_load_avg_se
0.00% swapper [kernel.kallsyms] [k] ct_idle_exit
0.00% swapper [kernel.kallsyms] [k] cpuidle_not_available
0.00% swapper [kernel.kallsyms] [k] error_entry
0.00% swapper [kernel.kallsyms] [k] tick_nohz_idle_got_tick
0.00% X Xorg [.] 0x0000000000094069
0.00% swapper [kernel.kallsyms] [k] try_to_wake_up
0.00% plasmashell libQt5Core.so.5.15.11 [.] 0x0000000000202db6
0.00% swapper [kernel.kallsyms] [k] idle_cpu
0.00% kwin_x11 nouveau_dri.so [.] 0x00000000001342e0
0.00% swapper [kernel.kallsyms] [k] irq_work_needs_cpu
0.00% QXcbEventQueue [kernel.kallsyms] [k] _raw_read_lock_irqsave
0.00% swapper [kernel.kallsyms] [k] nvkm_pci_wr32
0.00% kwin_x11 libkwineffects.so.5.27.10 [.] KWin::WindowPaintData::brightness
0.00% plasmashell libQt5Quick.so.5.15.11 [.] QTextLayout::beginLayout@plt
0.00% QXcbEventQueue [kernel.kallsyms] [k] unix_destruct_scm
0.00% X [kernel.kallsyms] [k] ___slab_alloc.isra.0
0.00% kwin_x11 nouveau_dri.so [.] 0x0000000000070093
0.00% swapper [kernel.kallsyms] [k] psi_task_change
0.00% X Xorg [.] XkbComputeDerivedState
0.00% swapper [kernel.kallsyms] [k] rb_insert_color
0.00% synergys [kernel.kallsyms] [k] newidle_balance.isra.0
0.00% QXcbEventQueue [kernel.kallsyms] [k] __copy_msghdr
0.00% swapper [kernel.kallsyms] [k] __softirqentry_text_start
0.00% kworker/u16:3-e [kernel.kallsyms] [k] ext4_reserve_inode_write
0.00% konsole libc.so.6 [.] 0x0000000000092244
0.00% kwin_x11 libQt5Gui.so.5.15.11 [.] QRegion::~QRegion
0.00% perf [kernel.kallsyms] [k] __rmqueue_pcplist
0.00% konsole libQt5Core.so.5.15.11 [.] 0x00000000002db957
0.00% ksmd [kernel.kallsyms] [k] mas_next_slot
0.00% kwin_x11 libQt5Core.so.5.15.11 [.] QtPrivate::qustrchr
0.00% swapper [kernel.kallsyms] [k] update_load_avg
0.00% swapper [kernel.kallsyms] [k] perf_pmu_nop_void
0.00% plasmashell libc.so.6 [.] 0x00000000000920bf
0.00% synergys [kernel.kallsyms] [k] sock_poll
0.00% QSGRenderThread nouveau_dri.so [.] 0x00000000007474d0
0.00% kwin_x11 [kernel.kallsyms] [k] nvkm_vmm_get_locked
0.00% swapper [kernel.kallsyms] [k] __msecs_to_jiffies
0.00% QXcbEventQueue [kernel.kallsyms] [k] task_h_load
0.00% synergys [kernel.kallsyms] [k] __fget_light
0.00% swapper [kernel.kallsyms] [k] irq_work_tick
0.00% swapper [kernel.kallsyms] [k] irqentry_enter
0.00% kwin_x11 nouveau_dri.so [.] 0x0000000000745aa0
0.00% X [kernel.kallsyms] [k] do_iter_write
0.00% plasmashell libQt5XcbQpa.so.5.15.11 [.] QXcbConnection::handleXcbEvent
0.00% QSGRenderThread [kernel.kallsyms] [k] nvkm_vmm_get_locked
0.00% QSGRenderThread libQt5Quick.so.5.15.11 [.] QSGRenderContext::endSync
0.00% swapper [kernel.kallsyms] [k] arch_cpu_idle_enter
0.00% X [kernel.kallsyms] [k] drain_obj_stock
0.00% swapper [kernel.kallsyms] [k] calc_global_load_tick
0.00% Qt bearer threa [kernel.kallsyms] [k] macvlan_fill_info
0.00% X libdrm_nouveau.so.2.0.0 [.] 0x0000000000004ee2
0.00% synergys libc.so.6 [.] __poll
0.00% swapper [kernel.kallsyms] [k] cpuidle_governor_latency_req
0.00% swapper [kernel.kallsyms] [k] _nohz_idle_balance.isra.0
0.00% X Xorg [.] 0x000000000008207c
0.00% plasmashell libglib-2.0.so.0.7800.3 [.] 0x0000000000059794
0.00% swapper [kernel.kallsyms] [k] irq_exit_rcu
0.00% X [kernel.kallsyms] [k] timestamp_truncate
0.00% plasmashell libglib-2.0.so.0.7800.3 [.] 0x00000000000567c4
0.00% QSGRenderThread nouveau_dri.so [.] 0x000000000024295e
0.00% X [kernel.kallsyms] [k] save_fpregs_to_fpstate
0.00% perf [kernel.kallsyms] [k] lru_add_fn
0.00% swapper [kernel.kallsyms] [k] rcu_preempt_deferred_qs
0.00% swapper [kernel.kallsyms] [k] hrtimer_get_next_event
0.00% plasmashell libc.so.6 [.] 0x0000000000140199
0.00% X [kernel.kallsyms] [k] dequeue_task_fair
0.00% swapper [kernel.kallsyms] [k] __lock_text_start
0.00% swapper [kernel.kallsyms] [k] __remove_hrtimer
0.00% swapper [kernel.kallsyms] [k] rcu_needs_cpu
0.00% swapper [kernel.kallsyms] [k] alx_poll
0.00% swapper [kernel.kallsyms] [k] rcu_segcblist_ready_cbs
0.00% swapper [kernel.kallsyms] [k] task_tick_idle
0.00% swapper [kernel.kallsyms] [k] cr4_update_irqsoff
0.00% plasmashell libQt5Quick.so.5.15.11 [.] 0x000000000020564d
0.00% swapper [kernel.kallsyms] [k] cpu_latency_qos_limit
0.00% swapper [kernel.kallsyms] [k] get_next_timer_interrupt
0.00% InputThread [kernel.kallsyms] [k] __get_user_8
0.00% xembedsniproxy libQt5XcbQpa.so.5.15.11 [.] QXcbConnection::processXcbEvents
0.00% kwin_x11 libxkbcommon.so.0.0.0 [.] xkb_state_key_get_level
0.00% sudo libc.so.6 [.] read
0.00% kworker/u16:3-e [kernel.kallsyms] [k] filemap_get_folios_tag
0.00% InputThread [kernel.kallsyms] [k] ep_item_poll.isra.0
0.00% swapper [kernel.kallsyms] [k] can_stop_idle_tick
0.00% swapper [kernel.kallsyms] [k] __pick_eevdf
0.00% perf [kernel.kallsyms] [k] __fget_light
0.00% InputThread [kernel.kallsyms] [k] _copy_from_iter
0.00% InputThread [kernel.kallsyms] [k] ep_done_scan
0.00% swapper [kernel.kallsyms] [k] netlink_broadcast_filtered
0.00% upsd [kernel.kallsyms] [k] __cgroup_account_cputime
0.00% kworker/7:0-eve [kernel.kallsyms] [k] __cond_resched
0.00% X [kernel.kallsyms] [k] ww_mutex_lock_interruptible
0.00% swapper [kernel.kallsyms] [k] attach_entity_load_avg
0.00% plasmashell libKF5Archive.so.5.113.0 [.] 0x000000000000ea00
0.00% QSGRenderThread nouveau_dri.so [.] 0x000000000037f463
0.00% jbd2/dm-2-8 [kernel.kallsyms] [k] _aesni_enc4
0.00% kwin_x11 [kernel.kallsyms] [k] obj_cgroup_charge
0.00% X nouveau_dri.so [.] 0x0000000000125020
0.00% perf [kernel.kallsyms] [k] fault_in_readable
0.00% perf [kernel.kallsyms] [k] should_failslab
0.00% usbhid-ups [kernel.kallsyms] [k] xhci_ring_ep_doorbell
0.00% kworker/u16:3-e [kernel.kallsyms] [k] put_cpu_partial
0.00% swapper [kernel.kallsyms] [k] ___slab_alloc.isra.0
0.00% kwin_x11 [kernel.kallsyms] [k] evict
0.00% swapper [kernel.kallsyms] [k] sched_clock
0.00% crond libc.so.6 [.] 0x00000000000b1330
0.00% swapper [kernel.kallsyms] [k] update_dl_rq_load_avg
0.00% X libdrm_nouveau.so.2.0.0 [.] nouveau_bo_ref
0.00% perf perf [.] 0x000000000007e2a6
0.00% konsole [kernel.kallsyms] [k] n_tty_read
0.00% synergys [kernel.kallsyms] [k] __schedule
0.00% swapper [kernel.kallsyms] [k] calc_load_nohz_start
0.00% swapper [kernel.kallsyms] [k] tick_irq_enter
0.00% swapper [kernel.kallsyms] [k] skb_release_head_state
0.00% swapper [kernel.kallsyms] [k] task_tick_mm_cid
0.00% swapper [kernel.kallsyms] [k] nohz_csd_func
0.00% swapper [kernel.kallsyms] [k] update_process_times
0.00% perf [kernel.kallsyms] [k] xas_load
0.00% swapper [kernel.kallsyms] [k] update_rt_rq_load_avg
0.00% synergys [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
0.00% plasmashell libQt5Core.so.5.15.11 [.] 0x00000000002b9526
0.00% plasmashell libc.so.6 [.] _pthread_cleanup_push
0.00% plasmashell libglib-2.0.so.0.7800.3 [.] g_mutex_lock
0.00% synergys synergys [.] 0x000000000004dd9b
0.00% usbhid-ups [kernel.kallsyms] [k] update_cfs_group
0.00% swapper [kernel.kallsyms] [k] sched_clock_cpu
0.00% kglobalaccel5 libxcb-keysyms.so.1.0.0 [.] xcb_key_symbols_get_keysym
0.00% synergys [kernel.kallsyms] [k] pipe_poll
0.00% swapper [kernel.kallsyms] [k] record_times
0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry
0.00% plasmashell libQt5Qml.so.5.15.11 [.] QV4::QObjectWrapper::findProperty
0.00% swapper [kernel.kallsyms] [k] finish_task_switch.isra.0
0.00% kwin_x11 libQt5Core.so.5.15.11 [.] qstrcmp
0.00% synergys [kernel.kallsyms] [k] dequeue_entity
0.00% QXcbEventQueue libxcb.so.1.1.0 [.] 0x000000000000f56e
0.00% kglobalaccel5 libc.so.6 [.] pthread_getspecific
0.00% swapper [kernel.kallsyms] [k] ttwu_do_activate.isra.0
0.00% synergys libxcb.so.1.1.0 [.] xcb_poll_for_event
0.00% synergys [kernel.kallsyms] [k] unix_poll
0.00% konqueror libQt5WebEngineCore.so.5.15.11 [.] 0x0000000002ba3914
0.00% rcu_sched [kernel.kallsyms] [k] rcu_all_qs
0.00% QSGRenderThread [kernel.kallsyms] [k] mutex_spin_on_owner
0.00% konqueror libQt5WebEngineCore.so.5.15.11 [.] 0x0000000002b56bc8
0.00% synergys [kernel.kallsyms] [k] update_cfs_group
0.00% QSGRenderThread [kernel.kallsyms] [k] syscall_return_via_sysret
0.00% synergys synergys [.] pthread_mutex_lock@plt
0.00% synergys [kernel.kallsyms] [k] __switch_to
0.00% at-spi2-registr libglib-2.0.so.0.7800.3 [.] 0x0000000000056e64
0.00% perf [kernel.kallsyms] [k] __get_file_rcu
0.00% synergys [kernel.kallsyms] [k] __switch_to_asm
0.00% swapper [kernel.kallsyms] [k] local_clock_noinstr
0.00% perf [kernel.kallsyms] [k] __filemap_add_folio
0.00% swapper [kernel.kallsyms] [k] trigger_load_balance
0.00% swapper [kernel.kallsyms] [k] xhci_ring_ep_doorbell
0.00% synergys [kernel.kallsyms] [k] __rseq_handle_notify_resume
0.00% swapper [kernel.kallsyms] [k] intel_pmu_disable_all
0.00% kwin_x11 kwin_x11 [.] 0x000000000008ee30
0.00% swapper [kernel.kallsyms] [k] sched_idle_set_state
0.00% swapper [kernel.kallsyms] [k] hrtimer_next_event_without
0.00% upsmon [kernel.kallsyms] [k] __ip_finish_output
0.00% plasmashell libQt5Core.so.5.15.11 [.] QVariant::clear
0.00% perf [kernel.kallsyms] [k] create_empty_buffers
0.00% perf [kernel.kallsyms] [k] memset_orig
0.00% synergys libc.so.6 [.] recvmsg
0.00% baloorunner libQt5XcbQpa.so.5.15.11 [.] 0x0000000000065c0d
0.00% konsole libc.so.6 [.] 0x000000000013d502
0.00% swapper [kernel.kallsyms] [k] update_curr
0.00% QSGRenderThread nouveau_dri.so [.] 0x00000000002428dc
0.00% synergys [kernel.kallsyms] [k] save_fpregs_to_fpstate
0.00% synergys [kernel.kallsyms] [k] __update_load_avg_se
0.00% kworker/u16:1-e [kernel.kallsyms] [k] mem_cgroup_css_rstat_flush
0.00% swapper [kernel.kallsyms] [k] ___bpf_prog_run
0.00% kwin_x11 libQt5Core.so.5.15.11 [.] QArrayData::deallocate
0.00% konqueror libQt5Core.so.5.15.11 [.] qstrcmp
0.00% X libglamoregl.so [.] 0x000000000000c6de
0.00% synergys [kernel.kallsyms] [k] exit_to_user_mode_prepare
0.00% X [kernel.kallsyms] [k] __kmem_cache_alloc_node
0.00% synergys libc.so.6 [.] pthread_mutex_lock
0.00% swapper [kernel.kallsyms] [k] tick_nohz_idle_enter
0.00% swapper [kernel.kallsyms] [k] tick_check_broadcast_expired
0.00% perf [kernel.kallsyms] [k] __fdget_pos
0.00% konqueror libQt5WebEngineCore.so.5.15.11 [.] 0x0000000002b6092c
0.00% ksoftirqd/5 [kernel.kallsyms] [k] load_balance
0.00% kglobalaccel5 ld-linux-x86-64.so.2 [.] __tls_get_addr
0.00% swapper [kernel.kallsyms] [k] perf_swevent_stop
0.00% Qt bearer threa [kernel.kallsyms] [k] inet6_fill_ifla6_attrs
0.00% perf [kernel.kallsyms] [k] copy_page_from_iter_atomic
0.00% swapper [kernel.kallsyms] [k] __call_rcu_common.constprop.0
0.00% swapper [kernel.kallsyms] [k] psi_task_switch
0.00% swapper [kernel.kallsyms] [k] menu_reflect
0.00% synergys [kernel.kallsyms] [k] __update_load_avg_cfs_rq
0.00% :-1 [kernel.kallsyms] [k] proc_invalidate_siblings_dcache
0.00% rcu_sched [kernel.kallsyms] [k] dequeue_task_fair
0.00% swapper [kernel.kallsyms] [k] check_tsc_unstable
0.00% konsole libQt5Core.so.5.15.11 [.] QAbstractEventDispatcherPrivate::releaseTimerId
0.00% konqueror libQt5WebEngineCore.so.5.15.11 [.] 0x0000000002b836e2
0.00% kclockd [kernel.kallsyms] [k] __get_user_8
0.00% usbhid-ups libc.so.6 [.] ioctl
0.00% swapper [kernel.kallsyms] [k] perf_event_task_tick
0.00% swapper [kernel.kallsyms] [k] tun_net_xmit
0.00% rcu_sched [kernel.kallsyms] [k] enqueue_timer
0.00% swapper [kernel.kallsyms] [k] tick_nohz_idle_exit
0.00% swapper [kernel.kallsyms] [k] set_next_entity
0.00% synergys [kernel.kallsyms] [k] syscall_enter_from_user_mode
0.00% swapper [kernel.kallsyms] [k] tick_nohz_irq_exit
0.00% usbhid-ups [kernel.kallsyms] [k] proc_do_submiturb
0.00% usbhid-ups [kernel.kallsyms] [k] usbdev_poll
0.00% kworker/u16:1-e [kernel.kallsyms] [k] enqueue_to_backlog
0.00% ksoftirqd/5 [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0
0.00% kwin_x11 libkwin.so.5.27.10 [.] KWin::RenderLoopPrivate::scheduleRepaint
0.00% :-1 [kernel.kallsyms] [k] wake_up_bit
0.00% synergys [kernel.kallsyms] [k] update_load_avg
0.00% QXcbEventQueue libQt5Core.so.5.15.11 [.] QMutex::lock
0.00% synergys [unknown] [.] 0000000000000000
0.00% kworker/u16:1-e [kernel.kallsyms] [k] record_times
0.00% usbhid-ups [kernel.kallsyms] [k] drain_obj_stock
0.00% konqueror [kernel.kallsyms] [k] refill_stock
0.00% konqueror libQt5WebEngineCore.so.5.15.11 [.] 0x0000000002bc6fff
0.00% perf [kernel.kallsyms] [k] _raw_write_lock
0.00% synergys libX11.so.6.4.0 [.] XPending
0.00% synergys libc.so.6 [.] pthread_mutex_unlock
0.00% synergys synergys [.] poll@plt
0.00% usbhid-ups [kernel.kallsyms] [k] schedule_hrtimeout_range_clock
0.00% synergys synergys [.] pthread_mutex_unlock@plt
0.00% swapper [kernel.kallsyms] [k] schedule_idle
0.00% kworker/5:2-eve [kernel.kallsyms] [k] wq_worker_running
0.00% rcu_sched [kernel.kallsyms] [k] __switch_to_asm
0.00% kworker/u16:3-e [kernel.kallsyms] [k] mem_cgroup_css_rstat_flush
0.00% synergys libX11.so.6.4.0 [.] 0x00000000000440b0
0.00% synergys [kernel.kallsyms] [k] unix_stream_read_generic
0.00% usbhid-ups libusb-1.0.so.0.3.0 [.] 0x0000000000011979
0.00% avahi-daemon libavahi-core.so.7.1.0 [.] avahi_dns_packet_check_valid
0.00% X Xorg [.] 0x00000000000d094e
0.00% synergys libxcb.so.1.1.0 [.] 0x000000000000f56c
0.00% swapper [kernel.kallsyms] [k] wakeup_preempt
0.00% swapper [kernel.kallsyms] [k] avg_vruntime
0.00% swapper [kernel.kallsyms] [k] put_prev_task_idle
0.00% swapper [kernel.kallsyms] [k] _find_next_bit
0.00% plasmashell libc.so.6 [.] malloc
0.00% Qt bearer threa [kernel.kallsyms] [k] kmem_cache_alloc_node
0.00% QXcbEventQueue libQt5Core.so.5.15.11 [.] QThread::eventDispatcher
0.00% Qt bearer threa [kernel.kallsyms] [k] do_syscall_64
0.00% perf [kernel.kallsyms] [k] perf_poll
0.00% X libEGL_mesa.so.0.0.0 [.] 0x0000000000018a27
0.00% synergys [kernel.kallsyms] [k] pick_next_task_fair
0.00% swapper [kernel.kallsyms] [k] enqueue_hrtimer
0.00% rcu_sched [kernel.kallsyms] [k] psi_group_change
0.00% kworker/0:2-eve [kernel.kallsyms] [k] vmstat_shepherd
0.00% perf perf [.] 0x0000000000101078
0.00% perf [kernel.kallsyms] [k] lock_vma_under_rcu
0.00% swapper [kernel.kallsyms] [k] tcp_orphan_count_sum
0.00% kworker/u16:1-e [kernel.kallsyms] [k] _raw_spin_lock_irq
0.00% synergys [kernel.kallsyms] [k] sched_clock_noinstr
0.00% swapper [kernel.kallsyms] [k] __rb_insert_augmented
0.00% swapper [kernel.kallsyms] [k] cpuidle_select
0.00% QSGRenderThread libQt5Quick.so.5.15.11 [.] QSGBatchRenderer::Renderer::buildRenderLists
0.00% QSGRenderThread libQt5Quick.so.5.15.11 [.] QSGBatchRenderer::Renderer::nodeChanged
0.00% kwin_x11 libKF5JobWidgets.so.5.113.0 [.] 0x000000000000fa50
0.00% usbhid-ups [kernel.kallsyms] [k] __cgroup_account_cputime
0.00% usbhid-ups libc.so.6 [.] 0x000000000013e8b0
0.00% konqueror libQt5Core.so.5.15.11 [.] clock_gettime@plt
0.00% swapper [kernel.kallsyms] [k] mm_cid_get
0.00% gmain [kernel.kallsyms] [k] inode_permission
0.00% swapper [kernel.kallsyms] [k] hrtimer_try_to_cancel.part.0
0.00% rcu_sched [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0.00% usbhid-ups libc.so.6 [.] 0x000000000007ad00
0.00% kworker/5:2-eve [kernel.kallsyms] [k] kvfree_rcu_bulk
0.00% synergys [kernel.kallsyms] [k] sockfd_lookup_light
0.00% synergys libc.so.6 [.] 0x000000000008ac00
0.00% swapper [kernel.kallsyms] [k] timerqueue_iterate_next
0.00% synergys [kernel.kallsyms] [k] __get_user_8
0.00% kworker/0:2-eve [kernel.kallsyms] [k] memchr_inv
0.00% swapper [kernel.kallsyms] [k] wb_timer_fn
0.00% perf perf [.] 0x0000000000104467
0.00% swapper [kernel.kallsyms] [k] ct_idle_enter
0.00% synergys libX11.so.6.4.0 [.] 0x0000000000043e60
0.00% usbhid-ups libc.so.6 [.] 0x00000000000826a3
0.00% kworker/u16:3-e [kernel.kallsyms] [k] __mod_memcg_lruvec_state
0.00% synergys synergys [.] 0x0000000000026260
0.00% ksoftirqd/5 [kernel.kallsyms] [k] kthread_should_stop
0.00% synergys synergys [.] 0x0000000000025047
0.00% usbhid-ups libc.so.6 [.] pthread_mutex_trylock
0.00% synergys libxcb.so.1.1.0 [.] 0x0000000000010030
0.00% kworker/5:2-eve [kernel.kallsyms] [k] psi_avgs_work
0.00% synergys [kernel.kallsyms] [k] ____sys_recvmsg
0.00% kwin_x11 libglib-2.0.so.0.7800.3 [.] g_mutex_lock
0.00% synergys [kernel.kallsyms] [k] _copy_from_user
0.00% rcu_sched [kernel.kallsyms] [k] update_min_vruntime
0.00% kwin_x11 libQt5Gui.so.5.15.11 [.] QImageData::~QImageData
0.00% rcu_sched [kernel.kallsyms] [k] rcu_gp_kthread
0.00% synergys synergys [.] 0x0000000000025040
0.00% usbhid-ups [kernel.kallsyms] [k] memcpy_orig
0.00% synergys [kernel.kallsyms] [k] timerqueue_add
0.00% swapper [kernel.kallsyms] [k] tick_nohz_tick_stopped
0.00% swapper [kernel.kallsyms] [k] __put_task_struct
0.00% QXcbEventQueue [kernel.kallsyms] [k] kfree
0.00% dmeventd [kernel.kallsyms] [k] finish_task_switch.isra.0
0.00% perf [kernel.kallsyms] [k] __rdgsbase_inactive
0.00% swapper [kernel.kallsyms] [k] irq_chip_ack_parent
0.00% swapper [kernel.kallsyms] [k] irq_enter_rcu
0.00% usbhid-ups [kernel.kallsyms] [k] __fget_light
0.00% usbhid-ups usbhid-ups [.] 0x000000000001e143
0.00% rcu_sched [kernel.kallsyms] [k] __mod_timer
0.00% synergys libX11.so.6.4.0 [.] 0x0000000000031dab
0.00% ksoftirqd/5 [kernel.kallsyms] [k] __softirqentry_text_start
0.00% synergys [kernel.kallsyms] [k] ___sys_recvmsg
0.00% swapper [kernel.kallsyms] [k] error_return
0.00% swapper [kernel.kallsyms] [k] run_rebalance_domains
0.00% rcu_sched [kernel.kallsyms] [k] check_cfs_rq_runtime
0.00% perf [kernel.kallsyms] [k] do_sys_poll
0.00% rcu_sched [kernel.kallsyms] [k] __update_load_avg_se
0.00% ThreadPoolForeg libQt5WebEngineCore.so.5.15.11 [.] 0x0000000002b8cf74
0.00% rcu_sched [kernel.kallsyms] [k] rcu_implicit_dynticks_qs
0.00% swapper [kernel.kallsyms] [k] atomic_notifier_call_chain
0.00% synergys libX11.so.6.4.0 [.] 0x00000000000476cb
0.00% synergys libX11.so.6.4.0 [.] 0x0000000000031cd0
0.00% swapper [kernel.kallsyms] [k] llist_reverse_order
0.00% rcu_sched [kernel.kallsyms] [k] finish_task_switch.isra.0
0.00% synergys libX11.so.6.4.0 [.] 0x00000000000441d0
0.00% upsmon [kernel.kallsyms] [k] __schedule
0.00% upsmon [kernel.kallsyms] [k] check_stack_object
0.00% usbhid-ups [kernel.kallsyms] [k] usbdev_ioctl
0.00% swapper [kernel.kallsyms] [k] hrtimer_run_queues
0.00% swapper [kernel.kallsyms] [k] i_callback
0.00% swapper [kernel.kallsyms] [k] wake_up_process
0.00% synergys [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0.00% X [kernel.kallsyms] [k] rcu_note_context_switch
0.00% kwin_x11 [kernel.kallsyms] [k] __get_task_ioprio
0.00% kwin_x11 libkwin.so.5.27.10 [.] KWin::Workspace::findClient
0.00% X [kernel.kallsyms] [k] update_min_vruntime
0.00% X libGLdispatch.so.0.0.0 [.] 0x000000000004918b
0.00% synergys libX11.so.6.4.0 [.] 0x0000000000031da8
0.00% konqueror [kernel.kallsyms] [k] unix_poll
0.00% konqueror libKF5WidgetsAddons.so.5.113.0 [.] 0x0000000000075fb0
0.00% rcu_sched [kernel.kallsyms] [k] psi_task_switch
0.00% swapper [kernel.kallsyms] [k] __mod_memcg_lruvec_state
0.00% swapper [kernel.kallsyms] [k] get_nohz_timer_target
0.00% rcu_sched [kernel.kallsyms] [k] avg_vruntime
0.00% X libEGL_mesa.so.0.0.0 [.] 0x0000000000018a20
0.00% X [kernel.kallsyms] [k] drm_file_get_master
0.00% swapper [kernel.kallsyms] [k] timer_clear_idle
0.00% ksoftirqd/5 [kernel.kallsyms] [k] __switch_to_asm
0.00% kwin_x11 libQt5Core.so.5.15.11 [.] malloc@plt
0.00% swapper [kernel.kallsyms] [k] evdev_pass_values.part.0
0.00% synergys libX11.so.6.4.0 [.] xcb_connection_has_error@plt
0.00% swapper [kernel.kallsyms] [k] need_update
0.00% synergys [kernel.kallsyms] [k] __cgroup_account_cputime
0.00% synergys [kernel.kallsyms] [k] remove_wait_queue
0.00% swapper [kernel.kallsyms] [k] first_online_pgdat
0.00% swapper [kernel.kallsyms] [k] raw_spin_rq_lock_nested
0.00% perf [kernel.kallsyms] [k] remote_function
0.00% kwin_x11 [kernel.kallsyms] [k] __get_file_rcu
0.00% :-1 [kernel.kallsyms] [k] evict
0.00% X [kernel.kallsyms] [k] sock_poll
0.00% swapper [kernel.kallsyms] [k] arch_cpu_idle_exit
0.00% synergys [kernel.kallsyms] [k] enter_lazy_tlb
0.00% rcu_sched [kernel.kallsyms] [k] rcu_gp_cleanup
0.00% synergys [kernel.kallsyms] [k] __entry_text_start
0.00% swapper [kernel.kallsyms] [k] irq_work_run_list
0.00% swapper [kernel.kallsyms] [k] place_entity
0.00% perf [kernel.kallsyms] [k] xas_start
0.00% synergys [kernel.kallsyms] [k] copy_msghdr_from_user
0.00% synergys [kernel.kallsyms] [k] syscall_return_via_sysret
0.00% synergys [kernel.kallsyms] [k] schedule_hrtimeout_range_clock
0.00% synergys [kernel.kallsyms] [k] set_normalized_timespec64
0.00% kworker/5:2-eve [kernel.kallsyms] [k] desc_read
0.00% kworker/5:2-eve [kernel.kallsyms] [k] update_min_vruntime
0.00% synergys [kernel.kallsyms] [k] update_min_vruntime
0.00% :-1 [kernel.kallsyms] [k] ___d_drop
0.00% kworker/5:2-eve [kernel.kallsyms] [k] strscpy
0.00% swapper [kernel.kallsyms] [k] __wake_up_common
0.00% swapper [kernel.kallsyms] [k] ep_poll_callback
0.00% rcu_sched [kernel.kallsyms] [k] update_curr
0.00% rcu_sched [kernel.kallsyms] [k] pick_next_task_idle
0.00% rcu_sched [kernel.kallsyms] [k] cpuacct_charge
0.00% InputThread libinput_drv.so [.] 0x0000000000008e92
0.00% swapper [kernel.kallsyms] [k] __smp_call_single_queue
0.00% swapper [kernel.kallsyms] [k] reweight_entity
0.00% rcu_sched [kernel.kallsyms] [k] lock_timer_base
0.00% synergys [kernel.kallsyms] [k] put_prev_task_fair
0.00% kworker/u16:3-e [kernel.kallsyms] [k] dequeue_entity
0.00% konsole libQt5Core.so.5.15.11 [.] 0x00000000002d4f40
0.00% kworker/2:1-mm_ [kernel.kallsyms] [k] collect_percpu_times
0.00% synergys libxcb.so.1.1.0 [.] 0x000000000001004b
0.00% swapper [kernel.kallsyms] [k] hrtimer_forward
0.00% upsmon libc.so.6 [.] strlen@plt
0.00% konqueror libQt5Widgets.so.5.15.11 [.] QApplication::notify
0.00% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
0.00% kworker/u16:3-e [kernel.kallsyms] [k] extract_entropy.constprop.0
0.00% swapper [kernel.kallsyms] [k] skb_network_protocol
0.00% kworker/5:2-eve [kernel.kallsyms] [k] _prb_read_valid
0.00% swapper [kernel.kallsyms] [k] enter_lazy_tlb
0.00% synergys [kernel.kallsyms] [k] dequeue_task_fair
0.00% synergys [kernel.kallsyms] [k] psi_task_switch
0.00% swapper [kernel.kallsyms] [k] flush_smp_call_function_queue
0.00% kworker/u16:3-e [kernel.kallsyms] [k] crypt_page_alloc
0.00% kworker/u16:3-e [kernel.kallsyms] [k] vsnprintf
0.00% kwin_x11 libkwin.so.5.27.10 [.] KWin::X11Window::windowEvent
0.00% swapper [kernel.kallsyms] [k] nsecs_to_jiffies
0.00% synergys [kernel.kallsyms] [k] schedule
0.00% rcu_sched [kernel.kallsyms] [k] dequeue_entity
0.00% synergys [kernel.kallsyms] [k] get_nohz_timer_target
0.00% synergys [kernel.kallsyms] [k] record_times
0.00% synergys synergys [.] 0x000000000004dd96
0.00% synergys [kernel.kallsyms] [k] __x64_sys_poll
0.00% rcu_sched [kernel.kallsyms] [k] __switch_to
0.00% kworker/u16:1-e [kernel.kallsyms] [k] cgroup_rstat_flush_locked
0.00% swapper [kernel.kallsyms] [k] nohz_balance_enter_idle
0.00% swapper [kernel.kallsyms] [k] __switch_to
0.00% avahi-daemon [kernel.kallsyms] [k] free_unref_page_commit
0.00% swapper [kernel.kallsyms] [k] account_idle_ticks
0.00% swapper [kernel.kallsyms] [k] perf_swevent_start
0.00% kworker/0:2-eve [kernel.kallsyms] [k] __rdgsbase_inactive
0.00% rcu_sched [kernel.kallsyms] [k] detach_if_pending
0.00% QXcbEventQueue [kernel.kallsyms] [k] mutex_lock
0.00% perf [kernel.kallsyms] [k] fput
0.00% upsmon [kernel.kallsyms] [k] eth_type_trans
0.00% synergys libX11.so.6.4.0 [.] pthread_mutex_lock@plt
0.00% kworker/0:2-eve [kernel.kallsyms] [k] enqueue_timer
0.00% kwin_x11 KF5WindowSystemX11Plugin.so [.] qstrcmp@plt
0.00% usbhid-ups [kernel.kallsyms] [k] __kmem_cache_alloc_node
0.00% QXcbEventQueue libc.so.6 [.] malloc
0.00% kscreen_backend libQt5XcbQpa.so.5.15.11 [.] xcb_flush@plt
0.00% QXcbEventQueue [kernel.kallsyms] [k] __wake_up_common
0.00% avahi-daemon [kernel.kallsyms] [k] pipe_write
0.00% gmain [kernel.kallsyms] [k] restore_fpregs_from_fpstate
0.00% swapper [kernel.kallsyms] [k] pick_next_task_idle
0.00% swapper [kernel.kallsyms] [k] timekeeping_max_deferment
0.00% rcu_sched [kernel.kallsyms] [k] __note_gp_changes
0.00% swapper [kernel.kallsyms] [k] ct_irq_exit
0.00% usbhid-ups usbhid-ups [.] 0x000000000001d21b
0.00% gmain libgio-2.0.so.0.7800.3 [.] g_list_free@plt
0.00% kworker/2:1-mm_ [kernel.kallsyms] [k] refresh_cpu_vm_stats
0.00% swapper [kernel.kallsyms] [k] br_config_bpdu_generation
0.00% swapper [kernel.kallsyms] [k] process_timeout
0.00% kworker/5:2-eve [kernel.kallsyms] [k] psi_group_change
0.00% kwin_x11 libc.so.6 [.] pthread_getspecific
0.00% swapper [kernel.kallsyms] [k] free_unref_page_prepare
0.00% X libc.so.6 [.] __errno_location
0.00% rcu_sched [kernel.kallsyms] [k] schedule
0.00% kworker/5:2-eve [kernel.kallsyms] [k] notifier_call_chain
0.00% dmeventd [kernel.kallsyms] [k] cpuacct_charge
0.00% synergys [kernel.kallsyms] [k] do_syscall_64
0.00% GUsbEventThread libusb-1.0.so.0.3.0 [.] pthread_mutex_unlock@plt
0.00% swapper [kernel.kallsyms] [k] list_add_leaf_cfs_rq
0.00% synergys [kernel.kallsyms] [k] finish_task_switch.isra.0
0.00% synergys libX11.so.6.4.0 [.] _XSend@plt
0.00% synergys [kernel.kallsyms] [k] sched_clock_cpu
0.00% swapper [kernel.kallsyms] [k] find_busiest_group
0.00% kworker/0:2-eve [kernel.kallsyms] [k] worker_thread
0.00% synergys synergys [.] 0x00000000000356fd
0.00% ksoftirqd/7 [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0
0.00% kworker/0:2-eve [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0.00% swapper [kernel.kallsyms] [k] __slab_free.isra.0
0.00% X [kernel.kallsyms] [k] switch_fpu_return
0.00% swapper [kernel.kallsyms] [k] hrtimer_reprogram
0.00% QXcbEventQueue [kernel.kallsyms] [k] __schedule
0.00% QXcbEventQueue libxcb.so.1.1.0 [.] pthread_mutex_lock@plt
0.00% swapper [kernel.kallsyms] [k] ipt_do_table
0.00% synergys [kernel.kallsyms] [k] __hrtimer_init
0.00% kworker/dying [kernel.kallsyms] [k] queued_spin_lock_slowpath
0.00% ksoftirqd/5 [kernel.kallsyms] [k] smpboot_thread_fn
0.00% avahi-daemon [kernel.kallsyms] [k] __get_user_8
0.00% kworker/5:2-eve [kernel.kallsyms] [k] enqueue_timer
0.00% kworker/0:2-eve [kernel.kallsyms] [k] collect_percpu_times
0.00% synergys libc.so.6 [.] 0x00000000000826cd
0.00% swapper [kernel.kallsyms] [k] macvlan_forward_source
0.00% kworker/0:2-eve [kernel.kallsyms] [k] get_pfnblock_flags_mask
0.00% swapper [kernel.kallsyms] [k] raise_softirq
0.00% rcu_sched [kernel.kallsyms] [k] rcu_gp_init
0.00% kworker/0:2-eve [kernel.kallsyms] [k] lock_timer_base
0.00% perf [kernel.kallsyms] [k] event_function_call
0.00% synergys [kernel.kallsyms] [k] update_curr
0.00% swapper [kernel.kallsyms] [k] ip_route_input_slow
0.00% swapper [kernel.kallsyms] [k] sched_clock_tick
0.00% swapper [kernel.kallsyms] [k] __nf_conntrack_find_get.isra.0
0.00% perf [kernel.kallsyms] [k] __intel_pmu_enable_all.isra.0
0.00% gmain libc.so.6 [.] clock_gettime
0.00% kworker/5:2-eve [kernel.kallsyms] [k] psi_task_switch
0.00% swapper [kernel.kallsyms] [k] input_event_dispose
0.00% swapper [kernel.kallsyms] [k] __next_timer_interrupt
0.00% swapper [kernel.kallsyms] [k] ct_irq_enter
0.00% kwin_x11 libc.so.6 [.] 0x0000000000082620
0.00% dmeventd libc.so.6 [.] 0x0000000000087dfd
0.00% perf [kernel.kallsyms] [k] perf_ctx_enable.constprop.0
0.00% kworker/4:2-eve [kernel.kallsyms] [k] fold_diff
0.00% rcu_sched [kernel.kallsyms] [k] put_prev_task_fair
0.00% swapper [kernel.kallsyms] [k] tick_nohz_get_next_hrtimer
0.00% usbhid-ups [kernel.kallsyms] [k] unix_poll
0.00% rcu_sched [kernel.kallsyms] [k] __schedule
0.00% rcu_sched [kernel.kallsyms] [k] update_rq_clock.part.0
0.00% swapper [kernel.kallsyms] [k] put_cpu_partial
0.00% perf [kernel.kallsyms] [k] nmi_restore
0.00% rcu_sched [kernel.kallsyms] [k] __timer_delete_sync
0.00% kworker/3:2-mm_ [kernel.kallsyms] [k] lru_add_drain_per_cpu
0.00% swapper [kernel.kallsyms] [k] local_touch_nmi
0.00% swapper [kernel.kallsyms] [k] rcu_cblist_dequeue
0.00% swapper [kernel.kallsyms] [k] notifier_call_chain
0.00% swapper [kernel.kallsyms] [k] update_rq_clock
0.00% rcu_sched [kernel.kallsyms] [k] force_qs_rnp
0.00% swapper [kernel.kallsyms] [k] __mod_timer
0.00% swapper [kernel.kallsyms] [k] update_group_capacity
0.00% rcu_sched [kernel.kallsyms] [k] __lock_text_start
0.00% rcu_sched [kernel.kallsyms] [k] newidle_balance.isra.0
0.00% rcu_sched [kernel.kallsyms] [k] _raw_spin_lock
0.00% rcu_sched [kernel.kallsyms] [k] schedule_timeout
0.00% swapper [kernel.kallsyms] [k] __enqueue_entity
0.00% swapper [kernel.kallsyms] [k] put_ucounts
0.00% perf [kernel.kallsyms] [k] native_apic_msr_write

2024-01-25 00:03:25

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Thanks for the information!


On Tue, Jan 23, 2024 at 3:58 PM Dan Moulding <[email protected]> wrote:
>
> > This appears the md thread hit some infinite loop, so I would like to
> > know what it is doing. We can probably get the information with the
> > perf tool, something like:
> >
> > perf record -a
> > perf report
>
> Here you go!
>
> # Total Lost Samples: 0
> #
> # Samples: 78K of event 'cycles'
> # Event count (approx.): 83127675745
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... .............................. ..................................................
> #
> 49.31% md0_raid5 [kernel.kallsyms] [k] handle_stripe
> 18.63% md0_raid5 [kernel.kallsyms] [k] ops_run_io
> 6.07% md0_raid5 [kernel.kallsyms] [k] handle_active_stripes.isra.0
> 5.50% md0_raid5 [kernel.kallsyms] [k] do_release_stripe
> 3.09% md0_raid5 [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> 2.48% md0_raid5 [kernel.kallsyms] [k] r5l_write_stripe
> 1.89% md0_raid5 [kernel.kallsyms] [k] md_wakeup_thread
> 1.45% ksmd [kernel.kallsyms] [k] ksm_scan_thread
> 1.37% md0_raid5 [kernel.kallsyms] [k] stripe_is_lowprio
> 0.87% ksmd [kernel.kallsyms] [k] memcmp
> 0.68% ksmd [kernel.kallsyms] [k] xxh64
> 0.56% md0_raid5 [kernel.kallsyms] [k] __wake_up_common
> 0.52% md0_raid5 [kernel.kallsyms] [k] __wake_up
> 0.46% ksmd [kernel.kallsyms] [k] mtree_load
> 0.44% ksmd [kernel.kallsyms] [k] try_grab_page
> 0.40% ksmd [kernel.kallsyms] [k] follow_p4d_mask.constprop.0
> 0.39% md0_raid5 [kernel.kallsyms] [k] r5l_log_disk_error
> 0.37% md0_raid5 [kernel.kallsyms] [k] _raw_spin_lock_irq
> 0.33% md0_raid5 [kernel.kallsyms] [k] release_stripe_list
> 0.31% md0_raid5 [kernel.kallsyms] [k] release_inactive_stripe_list

It appears the thread is indeed doing something. I haven't got luck to
reproduce this on my hosts. Could you please try whether the following
change fixes the issue (without reverting 0de40f76d567)? I will try to
reproduce the issue on my side.

Junxiao,

Please also help look into this.

Thanks,
Song

2024-01-25 16:51:30

by Junxiao Bi

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi Dan,

Thanks for the report.

Can you define the hung? No hung task or other error from dmesg, any
process in D status and what is the call trace if there is? From the
perf result, looks like the raid thread is doing some real job, it may
be issuing io since ops_run_io() took around 20% cpu, please share
"iostat -xz 1" while the workload is running, i am wondering is this
some performance issue with the workload?

Thanks,

Junxiao.

On 1/24/24 4:01 PM, Song Liu wrote:
> Thanks for the information!
>
>
> On Tue, Jan 23, 2024 at 3:58 PM Dan Moulding <[email protected]> wrote:
>>> This appears the md thread hit some infinite loop, so I would like to
>>> know what it is doing. We can probably get the information with the
>>> perf tool, something like:
>>>
>>> perf record -a
>>> perf report
>> Here you go!
>>
>> # Total Lost Samples: 0
>> #
>> # Samples: 78K of event 'cycles'
>> # Event count (approx.): 83127675745
>> #
>> # Overhead Command Shared Object Symbol
>> # ........ ............... .............................. ...................................................
>> #
>> 49.31% md0_raid5 [kernel.kallsyms] [k] handle_stripe
>> 18.63% md0_raid5 [kernel.kallsyms] [k] ops_run_io
>> 6.07% md0_raid5 [kernel.kallsyms] [k] handle_active_stripes.isra.0
>> 5.50% md0_raid5 [kernel.kallsyms] [k] do_release_stripe
>> 3.09% md0_raid5 [kernel.kallsyms] [k] _raw_spin_lock_irqsave
>> 2.48% md0_raid5 [kernel.kallsyms] [k] r5l_write_stripe
>> 1.89% md0_raid5 [kernel.kallsyms] [k] md_wakeup_thread
>> 1.45% ksmd [kernel.kallsyms] [k] ksm_scan_thread
>> 1.37% md0_raid5 [kernel.kallsyms] [k] stripe_is_lowprio
>> 0.87% ksmd [kernel.kallsyms] [k] memcmp
>> 0.68% ksmd [kernel.kallsyms] [k] xxh64
>> 0.56% md0_raid5 [kernel.kallsyms] [k] __wake_up_common
>> 0.52% md0_raid5 [kernel.kallsyms] [k] __wake_up
>> 0.46% ksmd [kernel.kallsyms] [k] mtree_load
>> 0.44% ksmd [kernel.kallsyms] [k] try_grab_page
>> 0.40% ksmd [kernel.kallsyms] [k] follow_p4d_mask.constprop.0
>> 0.39% md0_raid5 [kernel.kallsyms] [k] r5l_log_disk_error
>> 0.37% md0_raid5 [kernel.kallsyms] [k] _raw_spin_lock_irq
>> 0.33% md0_raid5 [kernel.kallsyms] [k] release_stripe_list
>> 0.31% md0_raid5 [kernel.kallsyms] [k] release_inactive_stripe_list
> It appears the thread is indeed doing something. I haven't got luck to
> reproduce this on my hosts. Could you please try whether the following
> change fixes the issue (without reverting 0de40f76d567)? I will try to
> reproduce the issue on my side.
>
> Junxiao,
>
> Please also help look into this.
>
> Thanks,
> Song

2024-01-25 19:41:29

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On Thu, Jan 25, 2024 at 8:44 AM <[email protected]> wrote:
>
> Hi Dan,
>
> Thanks for the report.
>
> Can you define the hung? No hung task or other error from dmesg, any
> process in D status and what is the call trace if there is? From the
> perf result, looks like the raid thread is doing some real job, it may
> be issuing io since ops_run_io() took around 20% cpu, please share
> "iostat -xz 1" while the workload is running, i am wondering is this
> some performance issue with the workload?

I am hoping to get a repro on my side. From the information shared
by Dan, the md thread is busy looping on some stripes. The issue
probably only triggers with raid5 journal.

Thanks,
Song

2024-01-25 20:31:52

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi Junxiao,

I first noticed this problem the next day after I had upgraded some
machines to the 6.7.1 kernel. One of the machines is a backup server.
Just a few hours after the upgrade to 6.7.1, it started running its
overnight backup jobs. Those backup jobs hung part way through. When I
tried to check on the backups in the morning, I found the server
mostly unresponsive. I could SSH in but most shell commands would just
hang. I was able to run top and see that the md0_raid5 kernel thread
was using 100% CPU. I tried to reboot the server, but it wasn't able
to successfully shutdown and eventually I had to hard reset it.

The next day, the same sequence of events occurred on that server
again when it tried to run its backup jobs. Then the following day, I
experienced another hang on a different machine, with a similar RAID-5
configuration. That time I was scp'ing a large file to a virtual
machine whose image was stored on the RAID-5 array. Part way through
the transfer scp reported that the transfer had stalled. I checked top
on that machine and found once again that the md0_raid5 kernel thread
was using 100% CPU.

Yesterday I created a fresh Fedora 39 VM for the purposes of
reproducing this problem in a different environment (the other two
machines are both Gentoo servers running v6.7 kernels straight from
the stable trees with a custom kernel configuration). I am able to
reproduce the problem on Fedora 39 running both the v6.6.13 stable
tree kernel code and the Fedora 39 6.6.13 distribution kernel.

On this Fedora 39 VM, I created a 1GiB LVM volume to use as the RAID-5
journal from space on the "boot" disk. Then I attached 3 additional
100 GiB virtual disks and created the RAID-5 from those 3 disks and
the write-journal device. I then created a new LVM volume group from
the md0 array and created one LVM logical volume named "data", using
all but 64GiB of the available VG space. I then created an ext4 file
system on the "data" volume, mounted it, and used "dd" to copy 1MiB
blocks from /dev/urandom to a file on the "data" file system, and just
let it run. Eventually "dd" hangs and top shows that md0_raid5 is
using 100% CPU.

Here is an example command I just ran, which has hung after writing
4.1 GiB of random data to the array:

test@localhost:~$ dd if=/dev/urandom bs=1M of=/data/random.dat status=progress
4410310656 bytes (4.4 GB, 4.1 GiB) copied, 324 s, 13.6 MB/s

Top shows md0_raid5 using 100% CPU and dd in the "D" state:

top - 19:10:07 up 14 min, 1 user, load average: 7.00, 5.93, 3.30
Tasks: 246 total, 2 running, 244 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 37.5 id, 50.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 1963.4 total, 81.6 free, 490.7 used, 1560.2 buff/cache
MiB Swap: 1963.0 total, 1962.5 free, 0.5 used. 1472.7 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
993 root 20 0 0 0 0 R 99.9 0.0 7:19.08 md0_raid5
1461 root 20 0 0 0 0 I 0.0 0.0 0:00.17 kworker/1+
18 root 20 0 0 0 0 I 0.0 0.0 0:00.12 rcu_preem+
1071 systemd+ 20 0 16240 7480 6712 S 0.0 0.4 0:00.22 systemd-o+
1136 root 20 0 504124 27960 27192 S 0.0 1.4 0:00.26 rsyslogd
1 root 20 0 75356 27884 10456 S 0.0 1.4 0:01.48 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
..
1417 test 20 0 222668 3120 2096 D 0.0 0.2 0:10.45 dd

The dd process stack shows this:

test@localhost:~$ sudo cat /proc/1417/stack
[<0>] do_get_write_access+0x266/0x3f0
[<0>] jbd2_journal_get_write_access+0x5f/0x80
[<0>] __ext4_journal_get_write_access+0x74/0x170
[<0>] ext4_reserve_inode_write+0x61/0xc0
[<0>] __ext4_mark_inode_dirty+0x78/0x240
[<0>] ext4_dirty_inode+0x5b/0x80
[<0>] __mark_inode_dirty+0x57/0x390
[<0>] generic_update_time+0x4e/0x60
[<0>] file_modified+0xa1/0xb0
[<0>] ext4_buffered_write_iter+0x54/0x100
[<0>] vfs_write+0x23b/0x420
[<0>] ksys_write+0x6f/0xf0
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

I have run that dd test in the VM several times (I have to power cycle
the VM in between tests since each time it hangs it won't successfully
reboot). I also tested creating a LVM snapshot of the "data" LV while
the "dd" is running and from the few runs I've done it seems it might
reproduce more easily when the LVM snapshot exists (the snapshot would
act as a write amplifier since it is performing a copy-on-write
operation when dd is writing to the data LV and perhaps that helps to
induce the problem). However, the backup server I mentioned above does
not utilize LVM snapshots, so I know that an LVM snapshot isn't
required to cause the problem.

Below I will include a (hopefully) complete description of how this VM
is configured which might aid in efforts to reproduce the problem.

I hope this helps to undertand the nature of the problem, and may be
of assistance in diagnosing or reproducing the issue.

-- Dan


test@localhost:~$ ls -ld /sys/block/sd*
lrwxrwxrwx. 1 root root 0 Jan 25 19:59 /sys/block/sda -> ../devices/pci0000:00/0000:00:02.2/0000:03:00.0/virtio2/host6/target6:0:0/6:0:0:0/block/sda
lrwxrwxrwx. 1 root root 0 Jan 25 19:59 /sys/block/sdb -> ../devices/pci0000:00/0000:00:02.2/0000:03:00.0/virtio2/host6/target6:0:0/6:0:0:4/block/sdb
lrwxrwxrwx. 1 root root 0 Jan 25 19:59 /sys/block/sdc -> ../devices/pci0000:00/0000:00:02.2/0000:03:00.0/virtio2/host6/target6:0:0/6:0:0:3/block/sdc
lrwxrwxrwx. 1 root root 0 Jan 25 19:59 /sys/block/sdd -> ../devices/pci0000:00/0000:00:02.2/0000:03:00.0/virtio2/host6/target6:0:0/6:0:0:2/block/sdd
lrwxrwxrwx. 1 root root 0 Jan 25 19:59 /sys/block/sde -> ../devices/pci0000:00/0000:00:02.2/0000:03:00.0/virtio2/host6/target6:0:0/6:0:0:1/block/sde




test@localhost:~$ sudo fdisk -l /dev/sd[a,b,c,d,e]
Disk /dev/sda: 32 GiB, 34359738368 bytes, 67108864 sectors
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 3B52A5A1-29BD-436B-8145-EEF27D9EFC97

Device Start End Sectors Size Type
/dev/sda1 2048 4095 2048 1M BIOS boot
/dev/sda2 4096 2101247 2097152 1G Linux filesystem
/dev/sda3 2101248 14678015 12576768 6G Linux LVM
/dev/sda4 14678016 16777215 2099200 1G Linux LVM
/dev/sda5 16777216 67106815 50329600 24G Linux LVM


Disk /dev/sdb: 32 GiB, 34359738368 bytes, 67108864 sectors
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sdc: 100 GiB, 107374182400 bytes, 209715200 sectors
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sdd: 100 GiB, 107374182400 bytes, 209715200 sectors
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sde: 100 GiB, 107374182400 bytes, 209715200 sectors
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes




test@localhost:~$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 dm-1[0](J) sdd[4] sde[3] sdc[1]
209711104 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>




test@localhost:~$ sudo pvs
PV VG Fmt Attr PSize PFree
/dev/md0 array lvm2 a-- 199.99g 0
/dev/sda3 sysvg lvm2 a-- <6.00g 0
/dev/sda4 journal lvm2 a-- 1.00g 0
/dev/sda5 sysvg lvm2 a-- <24.00g 0
/dev/sdb sysvg lvm2 a-- <32.00g 0




test@localhost:~$ sudo vgs
VG #PV #LV #SN Attr VSize VFree
array 1 1 0 wz--n- 199.99g 63.99g
journal 1 1 0 wz--n- 1.00g 0
sysvg 3 1 0 wz--n- <61.99g 0




test@localhost:~$ sudo lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
data array -wi-ao---- 136.00g
journal journal -wi-ao---- 1.00g
root sysvg -wi-ao---- <61.99g




test@localhost:~$ sudo blkid
/dev/mapper/journal-journal: UUID="3e3379f6-ef7a-6bd4-adc1-1c00e328a556" UUID_SUB="446d19c8-d56c-6938-a82f-ff8d52ba1772" LABEL="localhost.localdomain:0" TYPE="linux_raid_member"
/dev/sdd: UUID="3e3379f6-ef7a-6bd4-adc1-1c00e328a556" UUID_SUB="5ab8d465-102a-d333-b1fa-012bd73d7cf5" LABEL="localhost.localdomain:0" TYPE="linux_raid_member"
/dev/sdb: UUID="Gj7d9g-LgcN-LLVl-iv37-DFZy-U0mz-s5nt3e" TYPE="LVM2_member"
/dev/md0: UUID="LcJ3i3-8Gfc-vs1g-ZNZc-8m0G-bPI3-l87W4X" TYPE="LVM2_member"
/dev/mapper/sysvg-root: LABEL="sysroot" UUID="22b0112a-6f38-41d6-921e-2492a19008f0" BLOCK_SIZE="512" TYPE="xfs"
/dev/sde: UUID="3e3379f6-ef7a-6bd4-adc1-1c00e328a556" UUID_SUB="171616cc-ce88-94be-affe-00933b8a7a30" LABEL="localhost.localdomain:0" TYPE="linux_raid_member"
/dev/sdc: UUID="3e3379f6-ef7a-6bd4-adc1-1c00e328a556" UUID_SUB="6d28b122-c1b6-973d-f8df-0834756581f0" LABEL="localhost.localdomain:0" TYPE="linux_raid_member"
/dev/sda4: UUID="ceH3kP-hljE-T6q4-W2qI-Iutm-Vf2N-Uz4omD" TYPE="LVM2_member" PARTUUID="2ed40d4b-f8b2-4c86-b8ca-61216a0c3f48"
/dev/sda2: UUID="c2192edb-0767-464b-9c3a-29d2d8e11c6e" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="effdb052-4887-4571-84df-5c5df132d702"
/dev/sda5: UUID="MEAcyI-qQwk-shwO-Y8qv-EFGa-ggpm-t6NhAV" TYPE="LVM2_member" PARTUUID="343aa231-9f62-46e2-b412-66640d153840"
/dev/sda3: UUID="yKUg0d-XqD2-5IEA-GFkd-6kDc-jVLz-cntwkj" TYPE="LVM2_member" PARTUUID="0dfa0e2d-f467-4e26-b013-9c965ed5a95c"
/dev/zram0: LABEL="zram0" UUID="5087ad0b-ec76-4de7-bbeb-7f39dd1ae318" TYPE="swap"
/dev/mapper/array-data: UUID="fcb29d49-5546-487f-9620-18afb0eeee90" BLOCK_SIZE="4096" TYPE="ext4"
/dev/sda1: PARTUUID="93d0bf6a-463d-4a2a-862f-0a4026964d54"




test@localhost:~$ lsblk -i
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 32G 0 disk
|-sda1 8:1 0 1M 0 part
|-sda2 8:2 0 1G 0 part /boot
|-sda3 8:3 0 6G 0 part
| `-sysvg-root 253:0 0 62G 0 lvm /
|-sda4 8:4 0 1G 0 part
| `-journal-journal 253:1 0 1G 0 lvm
| `-md0 9:0 0 200G 0 raid5
| `-array-data 253:3 0 136G 0 lvm /data
`-sda5 8:5 0 24G 0 part
`-sysvg-root 253:0 0 62G 0 lvm /
sdb 8:16 0 32G 0 disk
`-sysvg-root 253:0 0 62G 0 lvm /
sdc 8:32 0 100G 0 disk
`-md0 9:0 0 200G 0 raid5
`-array-data 253:3 0 136G 0 lvm /data
sdd 8:48 0 100G 0 disk
`-md0 9:0 0 200G 0 raid5
`-array-data 253:3 0 136G 0 lvm /data
sde 8:64 0 100G 0 disk
`-md0 9:0 0 200G 0 raid5
`-array-data 253:3 0 136G 0 lvm /data
zram0 252:0 0 1.9G 0 disk [SWAP]




test@localhost:~$ findmnt --ascii
TARGET SOURCE FSTYPE OPTIONS
/ /dev/mapper/sysvg-root xfs rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota
|-/dev devtmpfs devtmpfs rw,nosuid,seclabel,size=4096k,nr_inodes=245904,mode=755,inode64
| |-/dev/hugepages hugetlbfs hugetlbfs rw,nosuid,nodev,relatime,seclabel,pagesize=2M
| |-/dev/mqueue mqueue mqueue rw,nosuid,nodev,noexec,relatime,seclabel
| |-/dev/shm tmpfs tmpfs rw,nosuid,nodev,seclabel,inode64
| `-/dev/pts devpts devpts rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000
|-/sys sysfs sysfs rw,nosuid,nodev,noexec,relatime,seclabel
| |-/sys/fs/selinux selinuxfs selinuxfs rw,nosuid,noexec,relatime
| |-/sys/kernel/debug debugfs debugfs rw,nosuid,nodev,noexec,relatime,seclabel
| |-/sys/kernel/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime,seclabel
| |-/sys/fs/fuse/connections fusectl fusectl rw,nosuid,nodev,noexec,relatime
| |-/sys/kernel/security securityfs securityfs rw,nosuid,nodev,noexec,relatime
| |-/sys/fs/cgroup cgroup2 cgroup2 rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot
| |-/sys/fs/pstore pstore pstore rw,nosuid,nodev,noexec,relatime,seclabel
| |-/sys/fs/bpf bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700
| `-/sys/kernel/config configfs configfs rw,nosuid,nodev,noexec,relatime
|-/proc proc proc rw,nosuid,nodev,noexec,relatime
| `-/proc/sys/fs/binfmt_misc systemd-1 autofs rw,relatime,fd=34,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=8800
| `-/proc/sys/fs/binfmt_misc binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime
|-/run tmpfs tmpfs rw,nosuid,nodev,seclabel,size=402112k,nr_inodes=819200,mode=755,inode64
| `-/run/user/1000 tmpfs tmpfs rw,nosuid,nodev,relatime,seclabel,size=201056k,nr_inodes=50264,mode=700,uid=1000,gid=1000,inode64
|-/tmp tmpfs tmpfs rw,nosuid,nodev,seclabel,nr_inodes=1048576,inode64
|-/boot /dev/sda2 ext4 rw,relatime,seclabel
|-/data /dev/mapper/array-data ext4 rw,relatime,seclabel,stripe=256
`-/var/lib/nfs/rpc_pipefs sunrpc rpc_pipefs rw,relatime




(On virtual machine host)
$ sudo virsh dumpxml raid5-test-Fedora-Server-39-x86_64
<domain type='kvm' id='48'>
<name>raid5-test-Fedora-Server-39-x86_64</name>
<uuid>abb4cad1-35a4-4209-9da1-01e1cf3463da</uuid>
<metadata>
<libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
<libosinfo:os id="http://fedoraproject.org/fedora/38"/>
</libosinfo:libosinfo>
</metadata>
<memory unit='KiB'>2097152</memory>
<currentMemory unit='KiB'>2097152</currentMemory>
<vcpu placement='static'>8</vcpu>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch='x86_64' machine='pc-q35-8.0'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
<vmport state='off'/>
</features>
<cpu mode='host-passthrough' check='none' migratable='on'/>
<clock offset='utc'>
<timer name='rtc' tickpolicy='catchup'/>
<timer name='pit' tickpolicy='delay'/>
<timer name='hpet' present='no'/>
</clock>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<pm>
<suspend-to-mem enabled='no'/>
<suspend-to-disk enabled='no'/>
</pm>
<devices>
<emulator>/usr/bin/qemu-system-x86_64</emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/raid5-test-Fedora-Server-39-x86_64.qcow2' index='5'/>
<backingStore/>
<target dev='sda' bus='scsi'/>
<alias name='scsi0-0-0-0'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/raid5-test-Fedora-Server-39-x86_64-raid-1.qcow2' index='4'/>
<backingStore/>
<target dev='sdb' bus='scsi'/>
<alias name='scsi0-0-0-1'/>
<address type='drive' controller='0' bus='0' target='0' unit='1'/>
</disk>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/raid5-test-Fedora-Server-39-x86_64-raid-2.qcow2' index='3'/>
<backingStore/>
<target dev='sdc' bus='scsi'/>
<alias name='scsi0-0-0-2'/>
<address type='drive' controller='0' bus='0' target='0' unit='2'/>
</disk>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/raid5-test-Fedora-Server-39-x86_64-raid-3.qcow2' index='2'/>
<backingStore/>
<target dev='sdd' bus='scsi'/>
<alias name='scsi0-0-0-3'/>
<address type='drive' controller='0' bus='0' target='0' unit='3'/>
</disk>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/raid5-test-Fedora-Server-39-x86_64-1.qcow2' index='1'/>
<backingStore/>
<target dev='sde' bus='scsi'/>
<alias name='scsi0-0-0-4'/>
<address type='drive' controller='0' bus='0' target='0' unit='4'/>
</disk>
<controller type='usb' index='0' model='qemu-xhci' ports='15'>
<alias name='usb'/>
<address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='0' model='pcie-root'>
<alias name='pcie.0'/>
</controller>
<controller type='pci' index='1' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='1' port='0x10'/>
<alias name='pci.1'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
</controller>
<controller type='pci' index='2' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='2' port='0x11'/>
<alias name='pci.2'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
</controller>
<controller type='pci' index='3' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='3' port='0x12'/>
<alias name='pci.3'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
</controller>
<controller type='pci' index='4' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='4' port='0x13'/>
<alias name='pci.4'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
</controller>
<controller type='pci' index='5' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='5' port='0x14'/>
<alias name='pci.5'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
</controller>
<controller type='pci' index='6' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='6' port='0x15'/>
<alias name='pci.6'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
</controller>
<controller type='pci' index='7' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='7' port='0x16'/>
<alias name='pci.7'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
</controller>
<controller type='pci' index='8' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='8' port='0x17'/>
<alias name='pci.8'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
</controller>
<controller type='pci' index='9' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='9' port='0x18'/>
<alias name='pci.9'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0' multifunction='on'/>
</controller>
<controller type='pci' index='10' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='10' port='0x19'/>
<alias name='pci.10'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x1'/>
</controller>
<controller type='pci' index='11' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='11' port='0x1a'/>
<alias name='pci.11'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x2'/>
</controller>
<controller type='pci' index='12' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='12' port='0x1b'/>
<alias name='pci.12'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x3'/>
</controller>
<controller type='pci' index='13' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='13' port='0x1c'/>
<alias name='pci.13'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x4'/>
</controller>
<controller type='pci' index='14' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='14' port='0x1d'/>
<alias name='pci.14'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x5'/>
</controller>
<controller type='scsi' index='0' model='virtio-scsi'>
<alias name='scsi0'/>
<address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
</controller>
<controller type='sata' index='0'>
<alias name='ide'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
</controller>
<controller type='virtio-serial' index='0'>
<alias name='virtio-serial0'/>
<address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</controller>
<interface type='network'>
<mac address='52:54:00:01:a7:85'/>
<source network='default' portid='2f054bc0-bdd3-4431-9a7f-f57c84313f0d' bridge='virbr0'/>
<target dev='vnet47'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
</interface>
<serial type='pty'>
<source path='/dev/pts/9'/>
<target type='isa-serial' port='0'>
<model name='isa-serial'/>
</target>
<alias name='serial0'/>
</serial>
<console type='pty' tty='/dev/pts/9'>
<source path='/dev/pts/9'/>
<target type='serial' port='0'/>
<alias name='serial0'/>
</console>
<channel type='unix'>
<source mode='bind' path='/run/libvirt/qemu/channel/48-raid5-test-Fedora-Se/org.qemu.guest_agent.0'/>
<target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
<alias name='channel0'/>
<address type='virtio-serial' controller='0' bus='0' port='1'/>
</channel>
<channel type='spicevmc'>
<target type='virtio' name='com.redhat.spice.0' state='disconnected'/>
<alias name='channel1'/>
<address type='virtio-serial' controller='0' bus='0' port='2'/>
</channel>
<input type='tablet' bus='usb'>
<alias name='input0'/>
<address type='usb' bus='0' port='1'/>
</input>
<input type='mouse' bus='ps2'>
<alias name='input1'/>
</input>
<input type='keyboard' bus='ps2'>
<alias name='input2'/>
</input>
<graphics type='spice' port='5902' autoport='yes' listen='127.0.0.1'>
<listen type='address' address='127.0.0.1'/>
<image compression='off'/>
</graphics>
<sound model='ich9'>
<alias name='sound0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x1b' function='0x0'/>
</sound>
<audio id='1' type='spice'/>
<video>
<model type='virtio' heads='1' primary='yes'/>
<alias name='video0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
</video>
<redirdev bus='usb' type='spicevmc'>
<alias name='redir0'/>
<address type='usb' bus='0' port='2'/>
</redirdev>
<redirdev bus='usb' type='spicevmc'>
<alias name='redir1'/>
<address type='usb' bus='0' port='3'/>
</redirdev>
<watchdog model='itco' action='reset'>
<alias name='watchdog0'/>
</watchdog>
<memballoon model='virtio'>
<stats period='5'/>
<alias name='balloon0'/>
<address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
</memballoon>
<rng model='virtio'>
<backend model='random'>/dev/urandom</backend>
<alias name='rng0'/>
<address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
</rng>
</devices>
<seclabel type='dynamic' model='dac' relabel='yes'>
<label>+77:+77</label>
<imagelabel>+77:+77</imagelabel>
</seclabel>
</domain>

2024-01-26 03:40:38

by Carlos Carvalho

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Dan Moulding ([email protected]) wrote on Thu, Jan 25, 2024 at 05:31:30PM -03:
> I then created an ext4 file system on the "data" volume, mounted it, and used
> "dd" to copy 1MiB blocks from /dev/urandom to a file on the "data" file
> system, and just let it run. Eventually "dd" hangs and top shows that
> md0_raid5 is using 100% CPU.

It's known that ext4 has these symptoms with parity raid. To make sure it's a
raid problem you should try another filesystem or remount it with stripe=0.

2024-01-26 15:46:34

by Dan Moulding

[permalink] [raw]
Subject: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> It's known that ext4 has these symptoms with parity raid.

Interesting. I'm not aware of that problem. One of the systems that
hit this hang has been running with ext4 on an MD RAID-5 array with
every kernel since at least 5.1 and never had an issue until this
regression.

> To make sure it's a raid problem you should try another filesystem or
> remount it with stripe=0.

That's a good suggestion, so I switched it to use XFS. It can still
reproduce the hang. Sounds like this is probably a different problem
than the known ext4 one.

Thanks,

-- Dan

2024-01-26 16:33:36

by Roman Mamedov

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On Fri, 26 Jan 2024 00:30:46 -0300
Carlos Carvalho <[email protected]> wrote:

> Dan Moulding ([email protected]) wrote on Thu, Jan 25, 2024 at 05:31:30PM -03:
> > I then created an ext4 file system on the "data" volume, mounted it, and used
> > "dd" to copy 1MiB blocks from /dev/urandom to a file on the "data" file
> > system, and just let it run. Eventually "dd" hangs and top shows that
> > md0_raid5 is using 100% CPU.
>
> It's known that ext4 has these symptoms with parity raid. To make sure it's a
> raid problem you should try another filesystem or remount it with stripe=0.

If Ext4 wouldn't work properly on parity RAID, then it is a bug that should be
tracked down and fixed, not worked around by using a different FS. I am in
disbelief you are seriously suggesting that, and to be honest really doubt
there is any such high-profile "known" issue that stays unfixed and is just
commonly worked around.

--
With respect,
Roman

2024-01-30 16:32:38

by Blazej Kucman

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi,

On Fri, 26 Jan 2024 08:46:10 -0700
Dan Moulding <[email protected]> wrote:
>
> That's a good suggestion, so I switched it to use XFS. It can still
> reproduce the hang. Sounds like this is probably a different problem
> than the known ext4 one.
>

Our daily tests directed at mdadm/md also detected a problem with
identical symptoms as described in the thread.

Issue detected with IMSM metadata but it also reproduces with native
metadata.
NVMe disks under VMD controller were used.

Scenario:
1. Create raid10:
mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
--raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
--size=7864320 --run
2. Create FS
mkfs.ext4 /dev/md/r10d4s128-15_A
3. Set faulty one raid member:
mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
4. Stop raid devies:
mdadm -Ss

Expected result:
The raid stops without kernel hangs and errors.

Actual result:
command "mdadm -Ss" hangs,
hung_task occurs in OS.

[ 62.770472] md: resync of RAID array md127
[ 140.893329] md: md127: resync done.
[ 204.100490] md/raid10:md127: Disk failure on nvme3n1, disabling
device. md/raid10:md127: Operation continuing on 3 devices.
[ 244.625393] INFO: task kworker/48:1:755 blocked for more than 30
seconds. [ 244.632294] Tainted: G S
6.8.0-rc1-20240129.intel.13479453+ #1 [ 244.640157] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message. [
244.648105] task:kworker/48:1 state:D stack:14592 pid:755 tgid:755
ppid:2 flags:0x00004000 [ 244.657552] Workqueue: md_misc
md_start_sync [md_mod] [ 244.662688] Call Trace: [ 244.665176] <TASK>
[ 244.667316] __schedule+0x2f0/0x9c0
[ 244.670868] ? sched_clock+0x10/0x20
[ 244.674510] schedule+0x28/0x90
[ 244.677703] mddev_suspend+0x11d/0x1e0 [md_mod]
[ 244.682313] ? __update_idle_core+0x29/0xc0
[ 244.686574] ? swake_up_all+0xe0/0xe0
[ 244.690302] md_start_sync+0x3c/0x280 [md_mod]
[ 244.694825] process_scheduled_works+0x87/0x320
[ 244.699427] worker_thread+0x147/0x2a0
[ 244.703237] ? rescuer_thread+0x2d0/0x2d0
[ 244.707313] kthread+0xe5/0x120
[ 244.710504] ? kthread_complete_and_exit+0x20/0x20
[ 244.715370] ret_from_fork+0x31/0x40
[ 244.719007] ? kthread_complete_and_exit+0x20/0x20
[ 244.723879] ret_from_fork_asm+0x11/0x20
[ 244.727872] </TASK>
[ 244.730117] INFO: task mdadm:8457 blocked for more than 30 seconds.
[ 244.736486] Tainted: G S
6.8.0-rc1-20240129.intel.13479453+ #1 [ 244.744345] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message. [
244.752293] task:mdadm state:D stack:13512 pid:8457
tgid:8457 ppid:8276 flags:0x00000000 [ 244.761736] Call Trace: [
244.764241] <TASK> [ 244.766389] __schedule+0x2f0/0x9c0
[ 244.773224] schedule+0x28/0x90
[ 244.779690] stop_sync_thread+0xfa/0x170 [md_mod]
[ 244.787737] ? swake_up_all+0xe0/0xe0
[ 244.794705] do_md_stop+0x51/0x4c0 [md_mod]
[ 244.802166] md_ioctl+0x59d/0x10a0 [md_mod]
[ 244.809567] blkdev_ioctl+0x1bb/0x270
[ 244.816417] __x64_sys_ioctl+0x7a/0xb0
[ 244.823720] do_syscall_64+0x4e/0x110
[ 244.830481] entry_SYSCALL_64_after_hwframe+0x63/0x6b
[ 244.838700] RIP: 0033:0x7f2c540c97cb
[ 244.845457] RSP: 002b:00007fff4ad6a8f8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010 [ 244.856265] RAX: ffffffffffffffda RBX:
0000000000000003 RCX: 00007f2c540c97cb [ 244.866659] RDX:
0000000000000000 RSI: 0000000000000932 RDI: 0000000000000003 [
244.877031] RBP: 0000000000000019 R08: 0000000000200000 R09:
00007fff4ad6a4c5 [ 244.887382] R10: 0000000000000000 R11:
0000000000000246 R12: 00007fff4ad6a9c0 [ 244.897723] R13:
00007fff4ad6a9a0 R14: 000055724d0990e0 R15: 000055724efaa780 [
244.908018] </TASK> [ 275.345375] INFO: task kworker/48:1:755 blocked
for more than 60 seconds. [ 275.355363] Tainted: G S
6.8.0-rc1-20240129.intel.13479453+ #1 [ 275.366306] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message. [
275.377334] task:kworker/48:1 state:D stack:14592 pid:755 tgid:755
ppid:2 flags:0x00004000 [ 275.389863] Workqueue: md_misc
md_start_sync [md_mod] [ 275.398102] Call Trace: [ 275.403673] <TASK>


Also reproduces with XFS FS, does not reproduce when there is no FS on
RAID.

Repository used for testing:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
Branch: master

Last working build: kernel branch HEAD: acc657692aed ("keys, dns: Fix
size check of V1 server-list header")

I see one merge commit touching md after the above one:
01d550f0fcc0 ("Merge tag 'for-6.8/block-2024-01-08' of
git://git.kernel.dk/linux")

I hope these additional logs will help find the cause.

Thanks,
Blazej


2024-01-30 20:22:42

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi Blazej,

On Tue, Jan 30, 2024 at 8:27 AM Blazej Kucman
<[email protected]> wrote:
>
> Hi,
>
> On Fri, 26 Jan 2024 08:46:10 -0700
> Dan Moulding <[email protected]> wrote:
> >
> > That's a good suggestion, so I switched it to use XFS. It can still
> > reproduce the hang. Sounds like this is probably a different problem
> > than the known ext4 one.
> >
>
> Our daily tests directed at mdadm/md also detected a problem with
> identical symptoms as described in the thread.
>
> Issue detected with IMSM metadata but it also reproduces with native
> metadata.
> NVMe disks under VMD controller were used.
>
> Scenario:
> 1. Create raid10:
> mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
> --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
> --size=7864320 --run
> 2. Create FS
> mkfs.ext4 /dev/md/r10d4s128-15_A
> 3. Set faulty one raid member:
> mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
> 4. Stop raid devies:
> mdadm -Ss

Thanks for the report. I can reproduce the issue locally.

The revert [1] cannot fix this one, because the revert is for raid5 (and
the repro is on raid10). I will look into this.

Thanks again!

Song


[1] https://lore.kernel.org/linux-raid/[email protected]/


>
> Expected result:
> The raid stops without kernel hangs and errors.
>
> Actual result:
> command "mdadm -Ss" hangs,
> hung_task occurs in OS.
>
> [ 62.770472] md: resync of RAID array md127
> [ 140.893329] md: md127: resync done.
> [ 204.100490] md/raid10:md127: Disk failure on nvme3n1, disabling
> device. md/raid10:md127: Operation continuing on 3 devices.
> [ 244.625393] INFO: task kworker/48:1:755 blocked for more than 30
> seconds. [ 244.632294] Tainted: G S
> 6.8.0-rc1-20240129.intel.13479453+ #1 [ 244.640157] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 244.648105] task:kworker/48:1 state:D stack:14592 pid:755 tgid:755
> ppid:2 flags:0x00004000 [ 244.657552] Workqueue: md_misc
> md_start_sync [md_mod] [ 244.662688] Call Trace: [ 244.665176] <TASK>
> [ 244.667316] __schedule+0x2f0/0x9c0
> [ 244.670868] ? sched_clock+0x10/0x20
> [ 244.674510] schedule+0x28/0x90
> [ 244.677703] mddev_suspend+0x11d/0x1e0 [md_mod]
> [ 244.682313] ? __update_idle_core+0x29/0xc0
> [ 244.686574] ? swake_up_all+0xe0/0xe0
> [ 244.690302] md_start_sync+0x3c/0x280 [md_mod]
> [ 244.694825] process_scheduled_works+0x87/0x320
> [ 244.699427] worker_thread+0x147/0x2a0
> [ 244.703237] ? rescuer_thread+0x2d0/0x2d0
> [ 244.707313] kthread+0xe5/0x120
> [ 244.710504] ? kthread_complete_and_exit+0x20/0x20
> [ 244.715370] ret_from_fork+0x31/0x40
> [ 244.719007] ? kthread_complete_and_exit+0x20/0x20
> [ 244.723879] ret_from_fork_asm+0x11/0x20
> [ 244.727872] </TASK>
> [ 244.730117] INFO: task mdadm:8457 blocked for more than 30 seconds.
> [ 244.736486] Tainted: G S
> 6.8.0-rc1-20240129.intel.13479453+ #1 [ 244.744345] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 244.752293] task:mdadm state:D stack:13512 pid:8457
> tgid:8457 ppid:8276 flags:0x00000000 [ 244.761736] Call Trace: [
> 244.764241] <TASK> [ 244.766389] __schedule+0x2f0/0x9c0
> [ 244.773224] schedule+0x28/0x90
> [ 244.779690] stop_sync_thread+0xfa/0x170 [md_mod]
> [ 244.787737] ? swake_up_all+0xe0/0xe0
> [ 244.794705] do_md_stop+0x51/0x4c0 [md_mod]
> [ 244.802166] md_ioctl+0x59d/0x10a0 [md_mod]
> [ 244.809567] blkdev_ioctl+0x1bb/0x270
> [ 244.816417] __x64_sys_ioctl+0x7a/0xb0
> [ 244.823720] do_syscall_64+0x4e/0x110
> [ 244.830481] entry_SYSCALL_64_after_hwframe+0x63/0x6b
> [ 244.838700] RIP: 0033:0x7f2c540c97cb
> [ 244.845457] RSP: 002b:00007fff4ad6a8f8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010 [ 244.856265] RAX: ffffffffffffffda RBX:
> 0000000000000003 RCX: 00007f2c540c97cb [ 244.866659] RDX:
> 0000000000000000 RSI: 0000000000000932 RDI: 0000000000000003 [
> 244.877031] RBP: 0000000000000019 R08: 0000000000200000 R09:
> 00007fff4ad6a4c5 [ 244.887382] R10: 0000000000000000 R11:
> 0000000000000246 R12: 00007fff4ad6a9c0 [ 244.897723] R13:
> 00007fff4ad6a9a0 R14: 000055724d0990e0 R15: 000055724efaa780 [
> 244.908018] </TASK> [ 275.345375] INFO: task kworker/48:1:755 blocked
> for more than 60 seconds. [ 275.355363] Tainted: G S
> 6.8.0-rc1-20240129.intel.13479453+ #1 [ 275.366306] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 275.377334] task:kworker/48:1 state:D stack:14592 pid:755 tgid:755
> ppid:2 flags:0x00004000 [ 275.389863] Workqueue: md_misc
> md_start_sync [md_mod] [ 275.398102] Call Trace: [ 275.403673] <TASK>
>
>
> Also reproduces with XFS FS, does not reproduce when there is no FS on
> RAID.
>
> Repository used for testing:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
> Branch: master
>
> Last working build: kernel branch HEAD: acc657692aed ("keys, dns: Fix
> size check of V1 server-list header")
>
> I see one merge commit touching md after the above one:
> 01d550f0fcc0 ("Merge tag 'for-6.8/block-2024-01-08' of
> git://git.kernel.dk/linux")
>
> I hope these additional logs will help find the cause.
>
> Thanks,
> Blazej
>

2024-01-31 01:28:10

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Update my findings so far.

On Tue, Jan 30, 2024 at 8:27 AM Blazej Kucman
<[email protected]> wrote:
[...]
> Our daily tests directed at mdadm/md also detected a problem with
> identical symptoms as described in the thread.
>
> Issue detected with IMSM metadata but it also reproduces with native
> metadata.
> NVMe disks under VMD controller were used.
>
> Scenario:
> 1. Create raid10:
> mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
> --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
> --size=7864320 --run
> 2. Create FS
> mkfs.ext4 /dev/md/r10d4s128-15_A
> 3. Set faulty one raid member:
> mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1

With a failed drive, md_thread calls md_check_recovery() and kicks
off mddev->sync_work, which is md_start_sync().
md_check_recovery() also sets MD_RECOVERY_RUNNING.

md_start_sync() calls mddev_suspend() and waits for
mddev->active_io to become zero.

> 4. Stop raid devies:
> mdadm -Ss

This command calls stop_sync_thread() and waits for
MD_RECOVERY_RUNNING to be cleared.

Given we need a working file system to reproduce the issue, I
suspect the problem comes from active_io.

Yu Kuai, I guess we missed this case in the recent refactoring.
I don't have a good idea to fix this. Please also take a look into
this.

Thanks,
Song

2024-01-31 02:13:46

by Yu Kuai

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi,

在 2024/01/31 9:26, Song Liu 写道:
>> Scenario:
>> 1. Create raid10:
>> mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
>> --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
>> --size=7864320 --run
>> 2. Create FS
>> mkfs.ext4 /dev/md/r10d4s128-15_A
>> 3. Set faulty one raid member:
>> mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
> With a failed drive, md_thread calls md_check_recovery() and kicks
> off mddev->sync_work, which is md_start_sync().
> md_check_recovery() also sets MD_RECOVERY_RUNNING.
>
> md_start_sync() calls mddev_suspend() and waits for
> mddev->active_io to become zero.
>
>> 4. Stop raid devies:
>> mdadm -Ss
> This command calls stop_sync_thread() and waits for
> MD_RECOVERY_RUNNING to be cleared.
>
> Given we need a working file system to reproduce the issue, I
> suspect the problem comes from active_io.

I'll look into this. But I don't understand the root cause yet.
Who grab the 'active_io' and why doesn't release it?

Thanks,
Kuai

>
> Yu Kuai, I guess we missed this case in the recent refactoring.
> I don't have a good idea to fix this. Please also take a look into
> this.


2024-01-31 02:41:28

by Yu Kuai

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi, Blazej!

?? 2024/01/31 0:26, Blazej Kucman д??:
> Hi,
>
> On Fri, 26 Jan 2024 08:46:10 -0700
> Dan Moulding <[email protected]> wrote:
>>
>> That's a good suggestion, so I switched it to use XFS. It can still
>> reproduce the hang. Sounds like this is probably a different problem
>> than the known ext4 one.
>>
>
> Our daily tests directed at mdadm/md also detected a problem with
> identical symptoms as described in the thread.
>
> Issue detected with IMSM metadata but it also reproduces with native
> metadata.
> NVMe disks under VMD controller were used.
>
> Scenario:
> 1. Create raid10:
> mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
> --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
> --size=7864320 --run
> 2. Create FS
> mkfs.ext4 /dev/md/r10d4s128-15_A
> 3. Set faulty one raid member:
> mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
> 4. Stop raid devies:
> mdadm -Ss
>
> Expected result:
> The raid stops without kernel hangs and errors.
>
> Actual result:
> command "mdadm -Ss" hangs,
> hung_task occurs in OS.

Can you test the following patch?

Thanks!
Kuai

diff --git a/drivers/md/md.c b/drivers/md/md.c
index e3a56a958b47..a8db84c200fe 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -578,8 +578,12 @@ static void submit_flushes(struct work_struct *ws)
rcu_read_lock();
}
rcu_read_unlock();
- if (atomic_dec_and_test(&mddev->flush_pending))
+ if (atomic_dec_and_test(&mddev->flush_pending)) {
+ /* The pair is percpu_ref_get() from md_flush_request() */
+ percpu_ref_put(&mddev->active_io);
+
queue_work(md_wq, &mddev->flush_work);
+ }
}

static void md_submit_flush_data(struct work_struct *ws)

>
> [ 62.770472] md: resync of RAID array md127
> [ 140.893329] md: md127: resync done.
> [ 204.100490] md/raid10:md127: Disk failure on nvme3n1, disabling
> device. md/raid10:md127: Operation continuing on 3 devices.
> [ 244.625393] INFO: task kworker/48:1:755 blocked for more than 30
> seconds. [ 244.632294] Tainted: G S
> 6.8.0-rc1-20240129.intel.13479453+ #1 [ 244.640157] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 244.648105] task:kworker/48:1 state:D stack:14592 pid:755 tgid:755
> ppid:2 flags:0x00004000 [ 244.657552] Workqueue: md_misc
> md_start_sync [md_mod] [ 244.662688] Call Trace: [ 244.665176] <TASK>
> [ 244.667316] __schedule+0x2f0/0x9c0
> [ 244.670868] ? sched_clock+0x10/0x20
> [ 244.674510] schedule+0x28/0x90
> [ 244.677703] mddev_suspend+0x11d/0x1e0 [md_mod]
> [ 244.682313] ? __update_idle_core+0x29/0xc0
> [ 244.686574] ? swake_up_all+0xe0/0xe0
> [ 244.690302] md_start_sync+0x3c/0x280 [md_mod]
> [ 244.694825] process_scheduled_works+0x87/0x320
> [ 244.699427] worker_thread+0x147/0x2a0
> [ 244.703237] ? rescuer_thread+0x2d0/0x2d0
> [ 244.707313] kthread+0xe5/0x120
> [ 244.710504] ? kthread_complete_and_exit+0x20/0x20
> [ 244.715370] ret_from_fork+0x31/0x40
> [ 244.719007] ? kthread_complete_and_exit+0x20/0x20
> [ 244.723879] ret_from_fork_asm+0x11/0x20
> [ 244.727872] </TASK>
> [ 244.730117] INFO: task mdadm:8457 blocked for more than 30 seconds.
> [ 244.736486] Tainted: G S
> 6.8.0-rc1-20240129.intel.13479453+ #1 [ 244.744345] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 244.752293] task:mdadm state:D stack:13512 pid:8457
> tgid:8457 ppid:8276 flags:0x00000000 [ 244.761736] Call Trace: [
> 244.764241] <TASK> [ 244.766389] __schedule+0x2f0/0x9c0
> [ 244.773224] schedule+0x28/0x90
> [ 244.779690] stop_sync_thread+0xfa/0x170 [md_mod]
> [ 244.787737] ? swake_up_all+0xe0/0xe0
> [ 244.794705] do_md_stop+0x51/0x4c0 [md_mod]
> [ 244.802166] md_ioctl+0x59d/0x10a0 [md_mod]
> [ 244.809567] blkdev_ioctl+0x1bb/0x270
> [ 244.816417] __x64_sys_ioctl+0x7a/0xb0
> [ 244.823720] do_syscall_64+0x4e/0x110
> [ 244.830481] entry_SYSCALL_64_after_hwframe+0x63/0x6b
> [ 244.838700] RIP: 0033:0x7f2c540c97cb
> [ 244.845457] RSP: 002b:00007fff4ad6a8f8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010 [ 244.856265] RAX: ffffffffffffffda RBX:
> 0000000000000003 RCX: 00007f2c540c97cb [ 244.866659] RDX:
> 0000000000000000 RSI: 0000000000000932 RDI: 0000000000000003 [
> 244.877031] RBP: 0000000000000019 R08: 0000000000200000 R09:
> 00007fff4ad6a4c5 [ 244.887382] R10: 0000000000000000 R11:
> 0000000000000246 R12: 00007fff4ad6a9c0 [ 244.897723] R13:
> 00007fff4ad6a9a0 R14: 000055724d0990e0 R15: 000055724efaa780 [
> 244.908018] </TASK> [ 275.345375] INFO: task kworker/48:1:755 blocked
> for more than 60 seconds. [ 275.355363] Tainted: G S
> 6.8.0-rc1-20240129.intel.13479453+ #1 [ 275.366306] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 275.377334] task:kworker/48:1 state:D stack:14592 pid:755 tgid:755
> ppid:2 flags:0x00004000 [ 275.389863] Workqueue: md_misc
> md_start_sync [md_mod] [ 275.398102] Call Trace: [ 275.403673] <TASK>
>
>
> Also reproduces with XFS FS, does not reproduce when there is no FS on
> RAID.
>
> Repository used for testing:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
> Branch: master
>
> Last working build: kernel branch HEAD: acc657692aed ("keys, dns: Fix
> size check of V1 server-list header")
>
> I see one merge commit touching md after the above one:
> 01d550f0fcc0 ("Merge tag 'for-6.8/block-2024-01-08' of
> git://git.kernel.dk/linux")
>
> I hope these additional logs will help find the cause.
>
> Thanks,
> Blazej
>
>
> .
>


2024-01-31 04:56:03

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On Tue, Jan 30, 2024 at 6:41 PM Yu Kuai <[email protected]> wrote:
>
> Hi, Blazej!
>
> 在 2024/01/31 0:26, Blazej Kucman 写道:
> > Hi,
> >
> > On Fri, 26 Jan 2024 08:46:10 -0700
> > Dan Moulding <[email protected]> wrote:
> >>
> >> That's a good suggestion, so I switched it to use XFS. It can still
> >> reproduce the hang. Sounds like this is probably a different problem
> >> than the known ext4 one.
> >>
> >
> > Our daily tests directed at mdadm/md also detected a problem with
> > identical symptoms as described in the thread.
> >
> > Issue detected with IMSM metadata but it also reproduces with native
> > metadata.
> > NVMe disks under VMD controller were used.
> >
> > Scenario:
> > 1. Create raid10:
> > mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
> > --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
> > --size=7864320 --run
> > 2. Create FS
> > mkfs.ext4 /dev/md/r10d4s128-15_A
> > 3. Set faulty one raid member:
> > mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
> > 4. Stop raid devies:
> > mdadm -Ss
> >
> > Expected result:
> > The raid stops without kernel hangs and errors.
> >
> > Actual result:
> > command "mdadm -Ss" hangs,
> > hung_task occurs in OS.
>
> Can you test the following patch?
>
> Thanks!
> Kuai
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index e3a56a958b47..a8db84c200fe 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -578,8 +578,12 @@ static void submit_flushes(struct work_struct *ws)
> rcu_read_lock();
> }
> rcu_read_unlock();
> - if (atomic_dec_and_test(&mddev->flush_pending))
> + if (atomic_dec_and_test(&mddev->flush_pending)) {
> + /* The pair is percpu_ref_get() from md_flush_request() */
> + percpu_ref_put(&mddev->active_io);
> +
> queue_work(md_wq, &mddev->flush_work);
> + }
> }
>
> static void md_submit_flush_data(struct work_struct *ws)

This fixes the issue in my tests. Please submit the official patch.
Also, we should add a test in mdadm/tests to cover this case.

Thanks,
Song

2024-01-31 13:37:08

by Blazej Kucman

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On Tue, 30 Jan 2024 20:55:39 -0800
Song Liu <[email protected]> wrote:

> On Tue, Jan 30, 2024 at 6:41 PM Yu Kuai <[email protected]>
> >
> > Can you test the following patch?
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index e3a56a958b47..a8db84c200fe 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -578,8 +578,12 @@ static void submit_flushes(struct work_struct
> > *ws) rcu_read_lock();
> > }
> > rcu_read_unlock();
> > - if (atomic_dec_and_test(&mddev->flush_pending))
> > + if (atomic_dec_and_test(&mddev->flush_pending)) {
> > + /* The pair is percpu_ref_get() from
> > md_flush_request() */
> > + percpu_ref_put(&mddev->active_io);
> > +
> > queue_work(md_wq, &mddev->flush_work);
> > + }
> > }
> >
> > static void md_submit_flush_data(struct work_struct *ws)
>
> This fixes the issue in my tests. Please submit the official patch.
> Also, we should add a test in mdadm/tests to cover this case.
>
> Thanks,
> Song
>

Hi Kuai,

On my hardware issue also stopped reproducing with this fix.

I applied the fix on current HEAD of master
branch in kernel/git/torvalds/linux.git repo.

Thansk,
Blazej




2024-01-31 17:38:06

by Junxiao Bi

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi Dan,

On 1/25/24 12:31 PM, Dan Moulding wrote:
> On this Fedora 39 VM, I created a 1GiB LVM volume to use as the RAID-5
> journal from space on the "boot" disk. Then I attached 3 additional
> 100 GiB virtual disks and created the RAID-5 from those 3 disks and
> the write-journal device. I then created a new LVM volume group from
> the md0 array and created one LVM logical volume named "data", using
> all but 64GiB of the available VG space. I then created an ext4 file
> system on the "data" volume, mounted it, and used "dd" to copy 1MiB
> blocks from /dev/urandom to a file on the "data" file system, and just
> let it run. Eventually "dd" hangs and top shows that md0_raid5 is
> using 100% CPU.

I can't reproduce this issue with this test case running over night, dd
is making progress well. I can see dd is very busy, closing to 100%,
sometimes it stay in D status, but just for a moment. md5_raid5 is
staying around 60%, never 100%.

I am wondering your case is a performance issue or a dead hung, if it's
a hung, i suppose we should see some hung task call trace of dd in dmesg
if you didn't disable kernel.hung_task_timeout_secs.

Also are you able to configure kdump and trigger a core dump when issue
reproduced.

Thanks,

Junxiao.

2024-02-01 01:49:07

by Yu Kuai

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi!

在 2024/01/31 21:36, Blazej Kucman 写道:
> Hi Kuai,
>
> On my hardware issue also stopped reproducing with this fix.
>
> I applied the fix on current HEAD of master
> branch in kernel/git/torvalds/linux.git repo.

That is great, thanks for testing!

Hi, Dan, can you try this patch as well. I feel this is a different
problem that the one you reported first. Because revert 0de40f76d567
shouldn't make any difference.

Thanks,
Kuai


2024-02-06 08:13:05

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On Thu, Jan 25, 2024 at 12:31 PM Dan Moulding <[email protected]> wrote:
>
> Hi Junxiao,
>
> I first noticed this problem the next day after I had upgraded some
> machines to the 6.7.1 kernel. One of the machines is a backup server.
> Just a few hours after the upgrade to 6.7.1, it started running its
> overnight backup jobs. Those backup jobs hung part way through. When I
> tried to check on the backups in the morning, I found the server
> mostly unresponsive. I could SSH in but most shell commands would just
> hang. I was able to run top and see that the md0_raid5 kernel thread
> was using 100% CPU. I tried to reboot the server, but it wasn't able
> to successfully shutdown and eventually I had to hard reset it.
>
> The next day, the same sequence of events occurred on that server
> again when it tried to run its backup jobs. Then the following day, I
> experienced another hang on a different machine, with a similar RAID-5
> configuration. That time I was scp'ing a large file to a virtual
> machine whose image was stored on the RAID-5 array. Part way through
> the transfer scp reported that the transfer had stalled. I checked top
> on that machine and found once again that the md0_raid5 kernel thread
> was using 100% CPU.
>
> Yesterday I created a fresh Fedora 39 VM for the purposes of
> reproducing this problem in a different environment (the other two
> machines are both Gentoo servers running v6.7 kernels straight from
> the stable trees with a custom kernel configuration). I am able to
> reproduce the problem on Fedora 39 running both the v6.6.13 stable
> tree kernel code and the Fedora 39 6.6.13 distribution kernel.
>
> On this Fedora 39 VM, I created a 1GiB LVM volume to use as the RAID-5
> journal from space on the "boot" disk. Then I attached 3 additional
> 100 GiB virtual disks and created the RAID-5 from those 3 disks and
> the write-journal device. I then created a new LVM volume group from
> the md0 array and created one LVM logical volume named "data", using
> all but 64GiB of the available VG space. I then created an ext4 file
> system on the "data" volume, mounted it, and used "dd" to copy 1MiB
> blocks from /dev/urandom to a file on the "data" file system, and just
> let it run. Eventually "dd" hangs and top shows that md0_raid5 is
> using 100% CPU.
>
> Here is an example command I just ran, which has hung after writing
> 4.1 GiB of random data to the array:
>
> test@localhost:~$ dd if=/dev/urandom bs=1M of=/data/random.dat status=progress
> 4410310656 bytes (4.4 GB, 4.1 GiB) copied, 324 s, 13.6 MB/s

Update on this..

I haven't been testing the following config md-6.9 branch [1].
The array works fine afaict.

Dan, could you please run the test on this branch
(83cbdaf61b1ab9cdaa0321eeea734bc70ca069c8)?

Thanks,
Song


[1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-6.9

[root@eth50-1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 1024M 0 rom
vda 253:0 0 32G 0 disk
├─vda1 253:1 0 2G 0 part /boot
└─vda2 253:2 0 30G 0 part /
nvme2n1 259:0 0 50G 0 disk
└─md0 9:0 0 100G 0 raid5
├─vg--md--data-md--data-real 250:2 0 50G 0 lvm
│ ├─vg--md--data-md--data 250:1 0 50G 0 lvm /mnt/2
│ └─vg--md--data-snap 250:4 0 50G 0 lvm
└─vg--md--data-snap-cow 250:3 0 49G 0 lvm
└─vg--md--data-snap 250:4 0 50G 0 lvm
nvme0n1 259:1 0 50G 0 disk
└─md0 9:0 0 100G 0 raid5
├─vg--md--data-md--data-real 250:2 0 50G 0 lvm
│ ├─vg--md--data-md--data 250:1 0 50G 0 lvm /mnt/2
│ └─vg--md--data-snap 250:4 0 50G 0 lvm
└─vg--md--data-snap-cow 250:3 0 49G 0 lvm
└─vg--md--data-snap 250:4 0 50G 0 lvm
nvme1n1 259:2 0 50G 0 disk
└─md0 9:0 0 100G 0 raid5
├─vg--md--data-md--data-real 250:2 0 50G 0 lvm
│ ├─vg--md--data-md--data 250:1 0 50G 0 lvm /mnt/2
│ └─vg--md--data-snap 250:4 0 50G 0 lvm
└─vg--md--data-snap-cow 250:3 0 49G 0 lvm
└─vg--md--data-snap 250:4 0 50G 0 lvm
nvme4n1 259:3 0 2G 0 disk
nvme3n1 259:4 0 50G 0 disk
└─vg--data-lv--journal 250:0 0 512M 0 lvm
└─md0 9:0 0 100G 0 raid5
├─vg--md--data-md--data-real 250:2 0 50G 0 lvm
│ ├─vg--md--data-md--data 250:1 0 50G 0 lvm /mnt/2
│ └─vg--md--data-snap 250:4 0 50G 0 lvm
└─vg--md--data-snap-cow 250:3 0 49G 0 lvm
└─vg--md--data-snap 250:4 0 50G 0 lvm
nvme5n1 259:5 0 2G 0 disk
nvme6n1 259:6 0 4G 0 disk
[root@eth50-1 ~]# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 nvme2n1[4] dm-0[3](J) nvme1n1[1] nvme0n1[0]
104790016 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>
[root@eth50-1 ~]# mount | grep /mnt/2
/dev/mapper/vg--md--data-md--data on /mnt/2 type ext4 (rw,relatime,stripe=256)

2024-02-06 20:56:15

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> Dan, could you please run the test on this branch
> (83cbdaf61b1ab9cdaa0321eeea734bc70ca069c8)?

I'm sorry to report that I can still reproduce the problem running the
kernel built from the md-6.9 branch (83cbdaf61b1a).

But the only commit I see on that branch that's not in master and
touches raid5.c is this one:

test@sysrescue:~/src/linux$ git log master..song/md-6.9 drivers/md/raid5.c
commit 61c90765e131e63ead773b9b99167415e246a945
Author: Yu Kuai <[email protected]>
Date: Thu Dec 28 20:55:51 2023 +0800

md: remove redundant check of 'mddev->sync_thread'

Is that expected, or were you expecting additional fixes to be in there?

-- Dan

2024-02-06 21:35:22

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On Tue, Feb 6, 2024 at 12:56 PM Dan Moulding <[email protected]> wrote:
>
> > Dan, could you please run the test on this branch
> > (83cbdaf61b1ab9cdaa0321eeea734bc70ca069c8)?
>
> I'm sorry to report that I can still reproduce the problem running the
> kernel built from the md-6.9 branch (83cbdaf61b1a).
>
> But the only commit I see on that branch that's not in master and
> touches raid5.c is this one:
>
> test@sysrescue:~/src/linux$ git log master..song/md-6.9 drivers/md/raid5.c
> commit 61c90765e131e63ead773b9b99167415e246a945
> Author: Yu Kuai <[email protected]>
> Date: Thu Dec 28 20:55:51 2023 +0800
>
> md: remove redundant check of 'mddev->sync_thread'
>
> Is that expected, or were you expecting additional fixes to be in there?

I don't expect that commit to fix the issue. It is expected to be merged to
master in the next merge window. I am curious why I cannot reproduce
the issue. Let me try more..

Thanks,
Song

2024-02-20 23:07:14

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Just a friendly reminder that this regression still exists on the
mainline. It has been reverted in 6.7 stable. But I upgraded a
development system to 6.8-rc5 today and immediately hit this issue
again. Then I saw that it hasn't yet been reverted in Linus' tree.

Cheers,

-- Dan

2024-02-20 23:15:47

by Junxiao Bi

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi Dan,

The thing is we can't reproduce this issue at all. If you can generate a
vmcore when the hung happened, then we can review which processes are
stuck.

Thanks,

Junxiao.

On 2/20/24 3:06 PM, Dan Moulding wrote:
> Just a friendly reminder that this regression still exists on the
> mainline. It has been reverted in 6.7 stable. But I upgraded a
> development system to 6.8-rc5 today and immediately hit this issue
> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>
> Cheers,
>
> -- Dan

2024-02-21 19:01:20

by Mateusz Kusiak

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On 21.02.2024 00:15, [email protected] wrote:
>
> The thing is we can't reproduce this issue at all. If you can generate
> a vmcore when the hung happened, then we can review which processes
> are stuck.
>
Hi,
don't know if that be any of help, but I run below scenario with SATA
and NVMe drives. For me, the issue is reproducible on NVMe drives only.

Scenario:
1. Create R5D3 with native metadata
    # mdadm -CR /dev/md/vol -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
2. Create FS on the array
    # mkfs.ext4 /dev/md/vol -F
3. Remove single member drive via "--incremental --fail"
    # mdadm -If nvme0n1

The result is almost instant.

Thanks,
Mateusz

2024-02-21 19:25:01

by Junxiao Bi

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On 2/21/24 6:50 AM, Mateusz Kusiak wrote:
> On 21.02.2024 00:15, [email protected] wrote:
>>
>> The thing is we can't reproduce this issue at all. If you can
>> generate a vmcore when the hung happened, then we can review which
>> processes are stuck.
>>
> Hi,
> don't know if that be any of help, but I run below scenario with SATA
> and NVMe drives. For me, the issue is reproducible on NVMe drives only.
>
> Scenario:
> 1. Create R5D3 with native metadata
>     # mdadm -CR /dev/md/vol -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
> 2. Create FS on the array
>     # mkfs.ext4 /dev/md/vol -F
> 3. Remove single member drive via "--incremental --fail"
>     # mdadm -If nvme0n1
>
> The result is almost instant.

This is not the same issue that Dan reported, it looks like another
regression that Yu Kuai fixed , can you please try this patch?

https://lore.kernel.org/lkml/[email protected]/

Thanks,

Junxiao.

>
> Thanks,
> Mateusz

2024-02-23 08:07:56

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On 21.02.24 00:06, Dan Moulding wrote:
> Just a friendly reminder that this regression still exists on the
> mainline. It has been reverted in 6.7 stable. But I upgraded a
> development system to 6.8-rc5 today and immediately hit this issue
> again. Then I saw that it hasn't yet been reverted in Linus' tree.

Song Liu, what's the status here? I aware that you fixed with quite a
few regressions recently, but it seems like resolving this one is
stalled. Or were you able to reproduce the issue or make some progress
and I just missed it?

And if not, what's the way forward here wrt to the release of 6.8?
Revert the culprit and try again later? Or is that not an option for one
reason or another?

Or do we assume that this is not a real issue? That it's caused by some
oddity (bit-flip in the metadata or something like that?) only to be
found in Dan's setup?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

2024-02-23 17:49:58

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi Junxiao,

Thanks for your time so far on this problem. It took some time,
because I've never had to generate a vmcore before, but I have one now
and it looks usable from what I've seen using crash and gdb on
it. It's a bit large, 1.1GB. How can I share it? Also, I'm assuming
you'll also need the vmlinux image that it came from? It's also a bit
big, 251MB.

-- Dan

2024-02-23 19:18:32

by Junxiao Bi

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Thanks Dan.

Before we know how to share vmcore, can you run below cmds from crash first:

1. ps -m | grep UN

2. foreach UN bt

3. ps -m | grep md

4. bt each md process

Thanks,

Junxiao.

On 2/23/24 9:44 AM, Dan Moulding wrote:
> Hi Junxiao,
>
> Thanks for your time so far on this problem. It took some time,
> because I've never had to generate a vmcore before, but I have one now
> and it looks usable from what I've seen using crash and gdb on
> it. It's a bit large, 1.1GB. How can I share it? Also, I'm assuming
> you'll also need the vmlinux image that it came from? It's also a bit
> big, 251MB.
>
> -- Dan

2024-02-23 20:28:34

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> Before we know how to share vmcore, can you run below cmds from crash first:
>
> 1. ps -m | grep UN
>
> 2. foreach UN bt
>
> 3. ps -m | grep md
>
> 4. bt each md process

Sure, here you go!

----

root@localhost:/var/crash/127.0.0.1-2024-02-23-01:34:56# crash /home/test/src/linux/vmlinux vmcore

crash 8.0.4-2.fc39
Copyright (C) 2002-2022 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011, 2020-2022 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
Copyright (C) 2015, 2021 VMware, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...

WARNING: ORC unwinder: module orc_entry structures have changed
WARNING: cannot determine how modules are linked
WARNING: no kernel module access

KERNEL: /home/test/src/linux/vmlinux
DUMPFILE: vmcore
CPUS: 8
DATE: Fri Feb 23 01:34:54 UTC 2024
UPTIME: 00:41:00
LOAD AVERAGE: 6.00, 5.90, 4.80
TASKS: 309
NODENAME: localhost.localdomain
RELEASE: 6.8.0-rc5
VERSION: #1 SMP Fri Feb 23 00:22:23 UTC 2024
MACHINE: x86_64 (2999 Mhz)
MEMORY: 8 GB
PANIC: "Kernel panic - not syncing: sysrq triggered crash"
PID: 1977
COMMAND: "bash"
TASK: ffff888105325880 [THREAD_INFO: ffff888105325880]
CPU: 5
STATE: TASK_RUNNING (PANIC)

crash> ps -m | grep UN
[0 00:15:50.424] [UN] PID: 957 TASK: ffff88810baa0ec0 CPU: 1 COMMAND: "jbd2/dm-3-8"
[0 00:15:56.151] [UN] PID: 1835 TASK: ffff888108a28ec0 CPU: 2 COMMAND: "dd"
[0 00:15:56.187] [UN] PID: 876 TASK: ffff888108bebb00 CPU: 3 COMMAND: "md0_reclaim"
[0 00:15:56.185] [UN] PID: 1914 TASK: ffff8881015e6740 CPU: 1 COMMAND: "kworker/1:2"
[0 00:15:56.255] [UN] PID: 403 TASK: ffff888101351d80 CPU: 7 COMMAND: "kworker/u21:1"
crash> foreach UN bt
PID: 403 TASK: ffff888101351d80 CPU: 7 COMMAND: "kworker/u21:1"
#0 [ffffc90000863840] __schedule at ffffffff81ac18ac
#1 [ffffc900008638a0] schedule at ffffffff81ac1d82
#2 [ffffc900008638b8] io_schedule at ffffffff81ac1e4d
#3 [ffffc900008638c8] wait_for_in_progress at ffffffff81806224
#4 [ffffc90000863910] do_origin at ffffffff81807265
#5 [ffffc90000863948] __map_bio at ffffffff817ede6a
#6 [ffffc90000863978] dm_submit_bio at ffffffff817ee31e
#7 [ffffc900008639f0] __submit_bio at ffffffff81515ec1
#8 [ffffc90000863a08] submit_bio_noacct_nocheck at ffffffff815162a7
#9 [ffffc90000863a60] ext4_io_submit at ffffffff813b506b
#10 [ffffc90000863a70] ext4_do_writepages at ffffffff81399ed6
#11 [ffffc90000863b20] ext4_writepages at ffffffff8139a85d
#12 [ffffc90000863bb8] do_writepages at ffffffff81258c30
#13 [ffffc90000863c18] __writeback_single_inode at ffffffff8132348a
#14 [ffffc90000863c48] writeback_sb_inodes at ffffffff81323b62
#15 [ffffc90000863d18] __writeback_inodes_wb at ffffffff81323e17
#16 [ffffc90000863d58] wb_writeback at ffffffff8132400a
#17 [ffffc90000863dc0] wb_workfn at ffffffff8132503c
#18 [ffffc90000863e68] process_one_work at ffffffff81147b69
#19 [ffffc90000863ea8] worker_thread at ffffffff81148554
#20 [ffffc90000863ef8] kthread at ffffffff8114f8ee
#21 [ffffc90000863f30] ret_from_fork at ffffffff8108bb98
#22 [ffffc90000863f50] ret_from_fork_asm at ffffffff81000da1

PID: 876 TASK: ffff888108bebb00 CPU: 3 COMMAND: "md0_reclaim"
#0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
#1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
#2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
#3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
#4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
#5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
#6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
#7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
#8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1

PID: 957 TASK: ffff88810baa0ec0 CPU: 1 COMMAND: "jbd2/dm-3-8"
#0 [ffffc90001d47b10] __schedule at ffffffff81ac18ac
#1 [ffffc90001d47b70] schedule at ffffffff81ac1d82
#2 [ffffc90001d47b88] io_schedule at ffffffff81ac1e4d
#3 [ffffc90001d47b98] wait_for_in_progress at ffffffff81806224
#4 [ffffc90001d47be0] do_origin at ffffffff81807265
#5 [ffffc90001d47c18] __map_bio at ffffffff817ede6a
#6 [ffffc90001d47c48] dm_submit_bio at ffffffff817ee31e
#7 [ffffc90001d47cc0] __submit_bio at ffffffff81515ec1
#8 [ffffc90001d47cd8] submit_bio_noacct_nocheck at ffffffff815162a7
#9 [ffffc90001d47d30] jbd2_journal_commit_transaction at ffffffff813d246c
#10 [ffffc90001d47e90] kjournald2 at ffffffff813d65cb
#11 [ffffc90001d47ef8] kthread at ffffffff8114f8ee
#12 [ffffc90001d47f30] ret_from_fork at ffffffff8108bb98
#13 [ffffc90001d47f50] ret_from_fork_asm at ffffffff81000da1

PID: 1835 TASK: ffff888108a28ec0 CPU: 2 COMMAND: "dd"
#0 [ffffc90000c2fb30] __schedule at ffffffff81ac18ac
#1 [ffffc90000c2fb90] schedule at ffffffff81ac1d82
#2 [ffffc90000c2fba8] io_schedule at ffffffff81ac1e4d
#3 [ffffc90000c2fbb8] bit_wait_io at ffffffff81ac2418
#4 [ffffc90000c2fbc8] __wait_on_bit at ffffffff81ac214a
#5 [ffffc90000c2fc10] out_of_line_wait_on_bit at ffffffff81ac22cc
#6 [ffffc90000c2fc60] do_get_write_access at ffffffff813d0bc3
#7 [ffffc90000c2fcb0] jbd2_journal_get_write_access at ffffffff813d0dc4
#8 [ffffc90000c2fcd8] __ext4_journal_get_write_access at ffffffff8137c2c9
#9 [ffffc90000c2fd18] ext4_reserve_inode_write at ffffffff813997f8
#10 [ffffc90000c2fd40] __ext4_mark_inode_dirty at ffffffff81399a38
#11 [ffffc90000c2fdc0] ext4_dirty_inode at ffffffff8139cf52
#12 [ffffc90000c2fdd8] __mark_inode_dirty at ffffffff81323284
#13 [ffffc90000c2fe10] generic_update_time at ffffffff8130de25
#14 [ffffc90000c2fe28] file_modified at ffffffff8130e23c
#15 [ffffc90000c2fe50] ext4_buffered_write_iter at ffffffff81388b6f
#16 [ffffc90000c2fe78] vfs_write at ffffffff812ee149
#17 [ffffc90000c2ff08] ksys_write at ffffffff812ee47e
#18 [ffffc90000c2ff40] do_syscall_64 at ffffffff81ab418e
#19 [ffffc90000c2ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c0006a
RIP: 00007f14bdcacc74 RSP: 00007ffcee806498 RFLAGS: 00000202
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f14bdcacc74
RDX: 0000000000100000 RSI: 00007f14bdaa0000 RDI: 0000000000000001
RBP: 00007ffcee8064c0 R8: 0000000000000001 R9: 00007ffcee8a8080
R10: 0000000000000017 R11: 0000000000000202 R12: 0000000000100000
R13: 00007f14bdaa0000 R14: 0000000000000000 R15: 0000000000100000
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

PID: 1914 TASK: ffff8881015e6740 CPU: 1 COMMAND: "kworker/1:2"
#0 [ffffc90000d5fa58] __schedule at ffffffff81ac18ac
#1 [ffffc90000d5fab8] schedule at ffffffff81ac1d82
#2 [ffffc90000d5fad0] schedule_timeout at ffffffff81ac64e9
#3 [ffffc90000d5fb18] io_schedule_timeout at ffffffff81ac15e7
#4 [ffffc90000d5fb30] __wait_for_common at ffffffff81ac2723
#5 [ffffc90000d5fb98] sync_io at ffffffff817f695d
#6 [ffffc90000d5fc00] dm_io at ffffffff817f6b22
#7 [ffffc90000d5fc80] chunk_io at ffffffff81808950
#8 [ffffc90000d5fd38] persistent_commit_exception at ffffffff81808caa
#9 [ffffc90000d5fd50] copy_callback at ffffffff8180601a
#10 [ffffc90000d5fd80] run_complete_job at ffffffff817f78ff
#11 [ffffc90000d5fdc8] process_jobs at ffffffff817f7c5e
#12 [ffffc90000d5fe10] do_work at ffffffff817f7eb7
#13 [ffffc90000d5fe68] process_one_work at ffffffff81147b69
#14 [ffffc90000d5fea8] worker_thread at ffffffff81148554
#15 [ffffc90000d5fef8] kthread at ffffffff8114f8ee
#16 [ffffc90000d5ff30] ret_from_fork at ffffffff8108bb98
#17 [ffffc90000d5ff50] ret_from_fork_asm at ffffffff81000da1
crash> ps -m | grep md
[0 00:00:00.129] [IN] PID: 965 TASK: ffff88810b8de740 CPU: 4 COMMAND: "systemd-oomd"
[0 00:00:01.187] [RU] PID: 875 TASK: ffff888108bee740 CPU: 3 COMMAND: "md0_raid5"
[0 00:00:07.128] [IN] PID: 707 TASK: ffff88810cc31d80 CPU: 1 COMMAND: "systemd-journal"
[0 00:00:07.524] [IN] PID: 1007 TASK: ffff88810b8dc9c0 CPU: 4 COMMAND: "systemd-logind"
[0 00:00:07.524] [IN] PID: 1981 TASK: ffff88810521bb00 CPU: 5 COMMAND: "systemd-hostnam"
[0 00:00:07.524] [IN] PID: 1 TASK: ffff888100158000 CPU: 0 COMMAND: "systemd"
[0 00:00:07.824] [IN] PID: 1971 TASK: ffff88810521ac40 CPU: 2 COMMAND: "systemd-userwor"
[0 00:00:07.825] [IN] PID: 1006 TASK: ffff8881045a0ec0 CPU: 4 COMMAND: "systemd-homed"
[0 00:00:07.830] [IN] PID: 1970 TASK: ffff888105218000 CPU: 1 COMMAND: "systemd-userwor"
[0 00:00:10.916] [IN] PID: 1972 TASK: ffff888105218ec0 CPU: 1 COMMAND: "systemd-userwor"
[0 00:00:36.004] [IN] PID: 971 TASK: ffff8881089c2c40 CPU: 0 COMMAND: "systemd-userdbd"
[0 00:10:56.905] [IN] PID: 966 TASK: ffff888105546740 CPU: 4 COMMAND: "systemd-resolve"
[0 00:15:56.187] [UN] PID: 876 TASK: ffff888108bebb00 CPU: 3 COMMAND: "md0_reclaim"
[0 00:34:52.328] [IN] PID: 1669 TASK: ffff88810521c9c0 CPU: 2 COMMAND: "systemd"
[0 00:39:21.349] [IN] PID: 739 TASK: ffff8881089c5880 CPU: 3 COMMAND: "systemd-udevd"
[0 00:40:59.426] [ID] PID: 74 TASK: ffff888100a68000 CPU: 6 COMMAND: "kworker/R-md"
[0 00:40:59.427] [ID] PID: 75 TASK: ffff888100a68ec0 CPU: 7 COMMAND: "kworker/R-md_bi"
[0 00:40:59.556] [IN] PID: 66 TASK: ffff8881003e8000 CPU: 4 COMMAND: "ksmd"
crash> bt 875
PID: 875 TASK: ffff888108bee740 CPU: 3 COMMAND: "md0_raid5"
#0 [fffffe00000bee60] crash_nmi_callback at ffffffff810a351e
#1 [fffffe00000bee68] nmi_handle at ffffffff81085acb
#2 [fffffe00000beea8] default_do_nmi at ffffffff81ab59d2
#3 [fffffe00000beed0] exc_nmi at ffffffff81ab5c9c
#4 [fffffe00000beef0] end_repeat_nmi at ffffffff81c010f7
[exception RIP: ops_run_io+224]
RIP: ffffffff817c4740 RSP: ffffc90000b3fb58 RFLAGS: 00000206
RAX: 0000000000000220 RBX: 0000000000000003 RCX: ffff88810cee7098
RDX: ffff88812495a3d0 RSI: 0000000000000000 RDI: ffff88810cee7000
RBP: ffff888103884000 R8: 0000000000000000 R9: ffff888103884000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000003 R14: ffff88812495a1b0 R15: ffffc90000b3fc00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#5 [ffffc90000b3fb58] ops_run_io at ffffffff817c4740
#6 [ffffc90000b3fc40] handle_stripe at ffffffff817cd85d
#7 [ffffc90000b3fd40] handle_active_stripes at ffffffff817ce82c
#8 [ffffc90000b3fdd0] raid5d at ffffffff817cee88
#9 [ffffc90000b3fe98] md_thread at ffffffff817db1ef
#10 [ffffc90000b3fef8] kthread at ffffffff8114f8ee
#11 [ffffc90000b3ff30] ret_from_fork at ffffffff8108bb98
#12 [ffffc90000b3ff50] ret_from_fork_asm at ffffffff81000da1
crash> bt 876
PID: 876 TASK: ffff888108bebb00 CPU: 3 COMMAND: "md0_reclaim"
#0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
#1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
#2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
#3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
#4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
#5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
#6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
#7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
#8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
crash> bt 74
PID: 74 TASK: ffff888100a68000 CPU: 6 COMMAND: "kworker/R-md"
#0 [ffffc900002afdf8] __schedule at ffffffff81ac18ac
#1 [ffffc900002afe58] schedule at ffffffff81ac1d82
#2 [ffffc900002afe70] rescuer_thread at ffffffff81148138
#3 [ffffc900002afef8] kthread at ffffffff8114f8ee
#4 [ffffc900002aff30] ret_from_fork at ffffffff8108bb98
#5 [ffffc900002aff50] ret_from_fork_asm at ffffffff81000da1
crash> bt 75
PID: 75 TASK: ffff888100a68ec0 CPU: 7 COMMAND: "kworker/R-md_bi"
#0 [ffffc900002b7df8] __schedule at ffffffff81ac18ac
#1 [ffffc900002b7e58] schedule at ffffffff81ac1d82
#2 [ffffc900002b7e70] rescuer_thread at ffffffff81148138
#3 [ffffc900002b7ef8] kthread at ffffffff8114f8ee
#4 [ffffc900002b7f30] ret_from_fork at ffffffff8108bb98
#5 [ffffc900002b7f50] ret_from_fork_asm at ffffffff81000da1
crash>

2024-02-24 02:13:50

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi,

On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
Leemhuis) <[email protected]> wrote:
>
> On 21.02.24 00:06, Dan Moulding wrote:
> > Just a friendly reminder that this regression still exists on the
> > mainline. It has been reverted in 6.7 stable. But I upgraded a
> > development system to 6.8-rc5 today and immediately hit this issue
> > again. Then I saw that it hasn't yet been reverted in Linus' tree.
>
> Song Liu, what's the status here? I aware that you fixed with quite a
> few regressions recently, but it seems like resolving this one is
> stalled. Or were you able to reproduce the issue or make some progress
> and I just missed it?

Sorry for the delay with this issue. I have been occupied with some
other stuff this week.

I haven't got luck to reproduce this issue. I will spend more time looking
into it next week.

>
> And if not, what's the way forward here wrt to the release of 6.8?
> Revert the culprit and try again later? Or is that not an option for one
> reason or another?

If we don't make progress with it in the next week, we will do the revert,
same as we did with stable kernels.

>
> Or do we assume that this is not a real issue? That it's caused by some
> oddity (bit-flip in the metadata or something like that?) only to be
> found in Dan's setup?

I don't think this is because of oddities. Hopefully we can get more
information about this soon.

Thanks,
Song

>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot poke
>

2024-03-01 20:28:15

by Junxiao Bi

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi Dan & Song,

I have not root cause this yet, but would like share some findings from
the vmcore Dan shared. From what i can see, this doesn't look like a md
issue, but something wrong with block layer or below.

1. There were multiple process hung by IO over 15mins.

crash> ps -m | grep UN
[0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0  CPU: 1   
COMMAND: "jbd2/dm-3-8"
[0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0  CPU: 2   
COMMAND: "dd"
[0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3   
COMMAND: "md0_reclaim"
[0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740  CPU: 1   
COMMAND: "kworker/1:2"
[0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80  CPU: 7   
COMMAND: "kworker/u21:1"

2. Let pick md0_reclaim to take a look, it is waiting done super_block
update. We can see there were two pending superblock write and other
pending io for the underling physical disk, which caused these process hung.

crash> bt 876
PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
 #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
 #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
 #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
 #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
 #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
 #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
 #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
 #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
 #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1

crash> mddev.pending_writes,disks 0xffff888108335800
  pending_writes = {
    counter = 2  <<<<<<<<<< 2 active super block write
  },
  disks = {
    next = 0xffff88810ce85a00,
    prev = 0xffff88810ce84c00
  },
crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending
0xffff88810ce85a00
ffff88810ce85a00
  kobj.name = 0xffff8881067c1a00 "dev-dm-1",
  nr_pending = {
    counter = 0
  },
ffff8881083ace00
  kobj.name = 0xffff888100a93280 "dev-sde",
  nr_pending = {
    counter = 10 <<<<
  },
ffff8881010ad200
  kobj.name = 0xffff8881012721c8 "dev-sdc",
  nr_pending = {
    counter = 8 <<<<<
  },
ffff88810ce84c00
  kobj.name = 0xffff888100325f08 "dev-sdd",
  nr_pending = {
    counter = 2 <<<<<
  },

3. From block layer, i can find the inflight IO for md superblock write
which has been pending 955s which matches with the hung time of
"md0_reclaim"

crash>
request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io
ffff888103b4c300
  q = 0xffff888103a00d80,
  mq_hctx = 0xffff888103c5d200,
  cmd_flags = 38913,
  rq_flags = 139408,
  start_time_ns = 1504179024146,
  bio = 0x0,
  biotail = 0xffff888120758e40,
  state = MQ_RQ_COMPLETE,
  __data_len = 0,
  flush = {
    seq = 3, <<<< REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA
    saved_end_io = 0x0
  },
  end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,

crash> p tk_core.timekeeper.tkr_mono.base
$1 = 2459916243002
crash> eval 2459916243002-1504179024146
hexadecimal: de86609f28
    decimal: 955737218856  <<<<<<< IO pending time is 955s
      octal: 15720630117450
     binary:
0000000000000000000000001101111010000110011000001001111100101000

crash> bio.bi_iter,bi_end_io 0xffff888120758e40
  bi_iter = {
    bi_sector = 8, <<<< super block offset
    bi_size = 0,
    bi_idx = 0,
    bi_bvec_done = 0
  },
  bi_end_io = 0xffffffff817dca50 <super_written>,
crash> dev -d | grep ffff888103a00d80
    8 ffff8881003ab000   sdd        ffff888103a00d80       0 0     0

4. Check above request, even its state is "MQ_RQ_COMPLETE", but it is
still pending. That's because each md superblock write was marked with
REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: pre_flush,
data, and post_flush. Once each step complete, it will be marked in
"request.flush.seq", here the value is 3, which is REQ_FSEQ_PREFLUSH | 
REQ_FSEQ_DATA, so the last step "post_flush" has not be done.  Another
wired thing is that blk_flush_queue.flush_data_in_flight is still 1 even
"data" step already done.

crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
  fq = 0xffff88810332e240,
crash> blk_flush_queue 0xffff88810332e240
struct blk_flush_queue {
  mq_flush_lock = {
    {
      rlock = {
        raw_lock = {
          {
            val = {
              counter = 0
            },
            {
              locked = 0 '\000',
              pending = 0 '\000'
            },
            {
              locked_pending = 0,
              tail = 0
            }
          }
        }
      }
    }
  },
  flush_pending_idx = 1,
  flush_running_idx = 1,
  rq_status = 0 '\000',
  flush_pending_since = 4296171408,
  flush_queue = {{
      next = 0xffff88810332e250,
      prev = 0xffff88810332e250
    }, {
      next = 0xffff888103b4c348, <<<< the request is in this list
      prev = 0xffff888103b4c348
    }},
  flush_data_in_flight = 1,  >>>>>> still 1
  flush_rq = 0xffff888103c2e000
}

crash> list 0xffff888103b4c348
ffff888103b4c348
ffff88810332e260

crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw queue
  tag = -1,
  state = MQ_RQ_IDLE,
  ref = {
    counter = 0
  },

5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
some issue which leading to the io request from md layer stayed in a
partial complete statue. I can't see how this can be related with the
commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
raid5d"")


Dan,

Are you able to reproduce using some regular scsi disk, would like to
rule out whether this is related with virtio-scsi?

And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
official mainline v6.8-rc5 without any other patches?


Thanks,

Junxiao.

On 2/23/24 6:13 PM, Song Liu wrote:
> Hi,
>
> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
> Leemhuis) <[email protected]> wrote:
>> On 21.02.24 00:06, Dan Moulding wrote:
>>> Just a friendly reminder that this regression still exists on the
>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>> development system to 6.8-rc5 today and immediately hit this issue
>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>> Song Liu, what's the status here? I aware that you fixed with quite a
>> few regressions recently, but it seems like resolving this one is
>> stalled. Or were you able to reproduce the issue or make some progress
>> and I just missed it?
> Sorry for the delay with this issue. I have been occupied with some
> other stuff this week.
>
> I haven't got luck to reproduce this issue. I will spend more time looking
> into it next week.
>
>> And if not, what's the way forward here wrt to the release of 6.8?
>> Revert the culprit and try again later? Or is that not an option for one
>> reason or another?
> If we don't make progress with it in the next week, we will do the revert,
> same as we did with stable kernels.
>
>> Or do we assume that this is not a real issue? That it's caused by some
>> oddity (bit-flip in the metadata or something like that?) only to be
>> found in Dan's setup?
> I don't think this is because of oddities. Hopefully we can get more
> information about this soon.
>
> Thanks,
> Song
>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> #regzbot poke
>>

2024-03-01 23:12:37

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
> some issue which leading to the io request from md layer stayed in a
> partial complete statue. I can't see how this can be related with the
> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
> raid5d"")

There is no question that the above mentioned commit makes this
problem appear. While it may be that ultimately the root cause lies
outside the md/raid5 code (I'm not able to make such an assessment), I
can tell you that change is what turned it into a runtime
regression. Prior to that change, I cannot reproduce the problem. One
of my RAID-5 arrays has been running on every kernel version since
4.8, without issue. Then kernel 6.7.1 the problem appeared within
hours of running the new code and affected not just one but two
different machines with RAID-5 arrays. With that change reverted, the
problem is not reproducible. Then when I recently upgraded to 6.8-rc5
I immediately hit the problem again (because it hadn't been reverted
in the mainline yet). I'm now running 6.8.0-rc5 on one of my affected
machines without issue after reverting that commit on top of it.

It would seem a very unlikely coincidence that a careful bisection of
all changes between 6.7.0 and 6.7.1 pointed at that commit as being
the culprit, and that the change is to raid5.c, and that the hang
happens in the raid5 kernel task, if there was no connection. :)

> Are you able to reproduce using some regular scsi disk, would like to
> rule out whether this is related with virtio-scsi?

The first time I hit this problem was on two bare-metal machines, one
server and one desktop with different hardware. I then set up this
virtual machine just to reproduce the problem in a different
environment (and to see if I could reproduce it with a distribution
kernel since the other machines are running custom kernel
configurations). So I'm able to reproduce it on:

- A virtual machine
- Bare metal machines
- Custom kernel configuration with straight from stable and mainline code
- Fedora 39 distribution kernel

> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
> official mainline v6.8-rc5 without any other patches?

Yes this particular vmcore was from the Fedora 39 VM I used to
reproduce the problem, but with the straight 6.8.0-rc5 mainline code
(so that you wouldn't have to worry about any possible interference
from distribution patches).

Cheers,

-- Dan

2024-03-02 00:07:49

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi Dan and Junxiao,

On Fri, Mar 1, 2024 at 3:12 PM Dan Moulding <[email protected]> wrote:
>
> > 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
> > some issue which leading to the io request from md layer stayed in a
> > partial complete statue. I can't see how this can be related with the
> > commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
> > raid5d"")
>
> There is no question that the above mentioned commit makes this
> problem appear. While it may be that ultimately the root cause lies
> outside the md/raid5 code (I'm not able to make such an assessment), I
> can tell you that change is what turned it into a runtime
> regression. Prior to that change, I cannot reproduce the problem. One
> of my RAID-5 arrays has been running on every kernel version since
> 4.8, without issue. Then kernel 6.7.1 the problem appeared within
> hours of running the new code and affected not just one but two
> different machines with RAID-5 arrays. With that change reverted, the
> problem is not reproducible. Then when I recently upgraded to 6.8-rc5
> I immediately hit the problem again (because it hadn't been reverted
> in the mainline yet). I'm now running 6.8.0-rc5 on one of my affected
> machines without issue after reverting that commit on top of it.
>
> It would seem a very unlikely coincidence that a careful bisection of
> all changes between 6.7.0 and 6.7.1 pointed at that commit as being
> the culprit, and that the change is to raid5.c, and that the hang
> happens in the raid5 kernel task, if there was no connection. :)
>
> > Are you able to reproduce using some regular scsi disk, would like to
> > rule out whether this is related with virtio-scsi?
>
> The first time I hit this problem was on two bare-metal machines, one
> server and one desktop with different hardware. I then set up this
> virtual machine just to reproduce the problem in a different
> environment (and to see if I could reproduce it with a distribution
> kernel since the other machines are running custom kernel
> configurations). So I'm able to reproduce it on:
>
> - A virtual machine
> - Bare metal machines
> - Custom kernel configuration with straight from stable and mainline code
> - Fedora 39 distribution kernel
>
> > And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
> > official mainline v6.8-rc5 without any other patches?
>
> Yes this particular vmcore was from the Fedora 39 VM I used to
> reproduce the problem, but with the straight 6.8.0-rc5 mainline code
> (so that you wouldn't have to worry about any possible interference
> from distribution patches).

Thanks to both of your for looking into the issue and running various
tests.

I also tried again to reproduce the issue, but haven't got luck. While
I will continue try to repro the issue, I will also send the revert to 6.8
kernel. We have been fighting multiple issues recently, so we didn't
get much time into this issue. Fortunately, we have got proper fixes
for most of the other issues. We should have more time to look into
this.

Thanks again,
Song

2024-03-02 16:55:55

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> I have not root cause this yet, but would like share some findings from
> the vmcore Dan shared. From what i can see, this doesn't look like a md
> issue, but something wrong with block layer or below.

Below is one other thing I found that might be of interest. This is
from the original email thread [1] that was linked to in the original
issue from 2022, which the change in question reverts:

On 2022-09-02 17:46, Logan Gunthorpe wrote:
> I've made some progress on this nasty bug. I've got far enough to know it's not
> related to the blk-wbt or the block layer.
>
> Turns out a bunch of bios are stuck queued in a blk_plug in the md_raid5
> thread while that thread appears to be stuck in an infinite loop (so it never
> schedules or does anything to flush the plug).
>
> I'm still debugging to try and find out the root cause of that infinite loop,
> but I just wanted to send an update that the previous place I was stuck at
> was not correct.
>
> Logan

This certainly sounds like it has some similarities to what we are
seeing when that change is reverted. The md0_raid5 thread appears to be
in an infinite loop, consuming 100% CPU, but not actually doing any
work.

-- Dan

[1] https://lore.kernel.org/r/[email protected]

2024-03-06 08:38:46

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On 02.03.24 01:05, Song Liu wrote:
> On Fri, Mar 1, 2024 at 3:12 PM Dan Moulding <[email protected]> wrote:
>>
>>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
>>> some issue which leading to the io request from md layer stayed in a
>>> partial complete statue. I can't see how this can be related with the
>>> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
>>> raid5d"")
>>
>> There is no question that the above mentioned commit makes this
>> problem appear. While it may be that ultimately the root cause lies
>> outside the md/raid5 code (I'm not able to make such an assessment), I
>> can tell you that change is what turned it into a runtime
>> regression. Prior to that change, I cannot reproduce the problem. One
>> of my RAID-5 arrays has been running on every kernel version since
>> 4.8, without issue. Then kernel 6.7.1 the problem appeared within
>> hours of running the new code and affected not just one but two
>> different machines with RAID-5 arrays. With that change reverted, the
>> problem is not reproducible. Then when I recently upgraded to 6.8-rc5
>> I immediately hit the problem again (because it hadn't been reverted
>> in the mainline yet). I'm now running 6.8.0-rc5 on one of my affected
>> machines without issue after reverting that commit on top of it.
> [...]
> I also tried again to reproduce the issue, but haven't got luck. While
> I will continue try to repro the issue, I will also send the revert to 6.8
> kernel.

Is that revert on the way meanwhile? I'm asking because Linus might
release 6.8 on Sunday.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

2024-03-06 17:14:23

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi Thorsten,

On Wed, Mar 6, 2024 at 12:38 AM Linux regression tracking (Thorsten
Leemhuis) <[email protected]> wrote:
>
> On 02.03.24 01:05, Song Liu wrote:
> > On Fri, Mar 1, 2024 at 3:12 PM Dan Moulding <[email protected]> wrote:
> >>
> >>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
> >>> some issue which leading to the io request from md layer stayed in a
> >>> partial complete statue. I can't see how this can be related with the
> >>> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
> >>> raid5d"")
> >>
> >> There is no question that the above mentioned commit makes this
> >> problem appear. While it may be that ultimately the root cause lies
> >> outside the md/raid5 code (I'm not able to make such an assessment), I
> >> can tell you that change is what turned it into a runtime
> >> regression. Prior to that change, I cannot reproduce the problem. One
> >> of my RAID-5 arrays has been running on every kernel version since
> >> 4.8, without issue. Then kernel 6.7.1 the problem appeared within
> >> hours of running the new code and affected not just one but two
> >> different machines with RAID-5 arrays. With that change reverted, the
> >> problem is not reproducible. Then when I recently upgraded to 6.8-rc5
> >> I immediately hit the problem again (because it hadn't been reverted
> >> in the mainline yet). I'm now running 6.8.0-rc5 on one of my affected
> >> machines without issue after reverting that commit on top of it.
> > [...]
> > I also tried again to reproduce the issue, but haven't got luck. While
> > I will continue try to repro the issue, I will also send the revert to 6.8
> > kernel.
>
> Is that revert on the way meanwhile? I'm asking because Linus might
> release 6.8 on Sunday.

The patch is on its way to 6.9 kernel via a PR yesterday [1]. It will land in
stable 6.8 kernel via stable backports.

Since this is not a new regression in 6.8 kernel and Dan is the only one
experiencing this, we would rather not rush last minute change to the 6.8
release.

Thanks,
Song

[1] https://lore.kernel.org/linux-raid/[email protected]/

2024-03-07 03:35:20

by Yu Kuai

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi,

在 2024/03/02 4:26, [email protected] 写道:
> Hi Dan & Song,
>
> I have not root cause this yet, but would like share some findings from
> the vmcore Dan shared. From what i can see, this doesn't look like a md
> issue, but something wrong with block layer or below.

I would like to take a look at vmcore as well. How dose Dan sharing the
vmcore? I don't find it in the thread.

Thanks,
Kuai

>
> 1. There were multiple process hung by IO over 15mins.
>
> crash> ps -m | grep UN
> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0  CPU: 1
> COMMAND: "jbd2/dm-3-8"
> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0  CPU: 2
> COMMAND: "dd"
> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3
> COMMAND: "md0_reclaim"
> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740  CPU: 1
> COMMAND: "kworker/1:2"
> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80  CPU: 7
> COMMAND: "kworker/u21:1"
>
> 2. Let pick md0_reclaim to take a look, it is waiting done super_block
> update. We can see there were two pending superblock write and other
> pending io for the underling physical disk, which caused these process
> hung.
>
> crash> bt 876
> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>
> crash> mddev.pending_writes,disks 0xffff888108335800
>   pending_writes = {
>     counter = 2  <<<<<<<<<< 2 active super block write
>   },
>   disks = {
>     next = 0xffff88810ce85a00,
>     prev = 0xffff88810ce84c00
>   },
> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending
> 0xffff88810ce85a00
> ffff88810ce85a00
>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>   nr_pending = {
>     counter = 0
>   },
> ffff8881083ace00
>   kobj.name = 0xffff888100a93280 "dev-sde",
>   nr_pending = {
>     counter = 10 <<<<
>   },
> ffff8881010ad200
>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>   nr_pending = {
>     counter = 8 <<<<<
>   },
> ffff88810ce84c00
>   kobj.name = 0xffff888100325f08 "dev-sdd",
>   nr_pending = {
>     counter = 2 <<<<<
>   },
>
> 3. From block layer, i can find the inflight IO for md superblock write
> which has been pending 955s which matches with the hung time of
> "md0_reclaim"
>
> crash>
> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io
> ffff888103b4c300
>   q = 0xffff888103a00d80,
>   mq_hctx = 0xffff888103c5d200,
>   cmd_flags = 38913,
>   rq_flags = 139408,
>   start_time_ns = 1504179024146,
>   bio = 0x0,
>   biotail = 0xffff888120758e40,
>   state = MQ_RQ_COMPLETE,
>   __data_len = 0,
>   flush = {
>     seq = 3, <<<< REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA
>     saved_end_io = 0x0
>   },
>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>
> crash> p tk_core.timekeeper.tkr_mono.base
> $1 = 2459916243002
> crash> eval 2459916243002-1504179024146
> hexadecimal: de86609f28
>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>       octal: 15720630117450
>      binary:
> 0000000000000000000000001101111010000110011000001001111100101000
>
> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>   bi_iter = {
>     bi_sector = 8, <<<< super block offset
>     bi_size = 0,
>     bi_idx = 0,
>     bi_bvec_done = 0
>   },
>   bi_end_io = 0xffffffff817dca50 <super_written>,
> crash> dev -d | grep ffff888103a00d80
>     8 ffff8881003ab000   sdd        ffff888103a00d80       0 0     0
>
> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it is
> still pending. That's because each md superblock write was marked with
> REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: pre_flush,
> data, and post_flush. Once each step complete, it will be marked in
> "request.flush.seq", here the value is 3, which is REQ_FSEQ_PREFLUSH |
> REQ_FSEQ_DATA, so the last step "post_flush" has not be done.  Another
> wired thing is that blk_flush_queue.flush_data_in_flight is still 1 even
> "data" step already done.
>
> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>   fq = 0xffff88810332e240,
> crash> blk_flush_queue 0xffff88810332e240
> struct blk_flush_queue {
>   mq_flush_lock = {
>     {
>       rlock = {
>         raw_lock = {
>           {
>             val = {
>               counter = 0
>             },
>             {
>               locked = 0 '\000',
>               pending = 0 '\000'
>             },
>             {
>               locked_pending = 0,
>               tail = 0
>             }
>           }
>         }
>       }
>     }
>   },
>   flush_pending_idx = 1,
>   flush_running_idx = 1,
>   rq_status = 0 '\000',
>   flush_pending_since = 4296171408,
>   flush_queue = {{
>       next = 0xffff88810332e250,
>       prev = 0xffff88810332e250
>     }, {
>       next = 0xffff888103b4c348, <<<< the request is in this list
>       prev = 0xffff888103b4c348
>     }},
>   flush_data_in_flight = 1,  >>>>>> still 1
>   flush_rq = 0xffff888103c2e000
> }
>
> crash> list 0xffff888103b4c348
> ffff888103b4c348
> ffff88810332e260
>
> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw queue
>   tag = -1,
>   state = MQ_RQ_IDLE,
>   ref = {
>     counter = 0
>   },
>
> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
> some issue which leading to the io request from md layer stayed in a
> partial complete statue. I can't see how this can be related with the
> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
> raid5d"")
>
>
> Dan,
>
> Are you able to reproduce using some regular scsi disk, would like to
> rule out whether this is related with virtio-scsi?
>
> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
> official mainline v6.8-rc5 without any other patches?
>
>
> Thanks,
>
> Junxiao.
>
> On 2/23/24 6:13 PM, Song Liu wrote:
>> Hi,
>>
>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>> Leemhuis) <[email protected]> wrote:
>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>> Just a friendly reminder that this regression still exists on the
>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>> Song Liu, what's the status here? I aware that you fixed with quite a
>>> few regressions recently, but it seems like resolving this one is
>>> stalled. Or were you able to reproduce the issue or make some progress
>>> and I just missed it?
>> Sorry for the delay with this issue. I have been occupied with some
>> other stuff this week.
>>
>> I haven't got luck to reproduce this issue. I will spend more time
>> looking
>> into it next week.
>>
>>> And if not, what's the way forward here wrt to the release of 6.8?
>>> Revert the culprit and try again later? Or is that not an option for one
>>> reason or another?
>> If we don't make progress with it in the next week, we will do the
>> revert,
>> same as we did with stable kernels.
>>
>>> Or do we assume that this is not a real issue? That it's caused by some
>>> oddity (bit-flip in the metadata or something like that?) only to be
>>> found in Dan's setup?
>> I don't think this is because of oddities. Hopefully we can get more
>> information about this soon.
>>
>> Thanks,
>> Song
>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> If I did something stupid, please tell me, as explained on that page.
>>>
>>> #regzbot poke
>>>
>
> .
>


2024-03-08 23:50:30

by Junxiao Bi

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Here is the root cause for this issue:

Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in
raid5d") introduced a regression, it got reverted through commit
bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing,
commit d6e035aad6c0 ("md: bypass block throttle for superblock update")
was created, it avoids md superblock write getting throttled by block
layer which is good, but md superblock write could be stuck in block
layer due to block flush as well, and that is what was happening in this
regression report.

Process "md0_reclaim" got stuck while waiting IO for md superblock write
done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags, these 3
steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed before done, the
hung of this process is because the last step "POSTFLUSH" never done.
And that was because of  process "md0_raid5" submitted another IO with
REQ_FUA flag marked just before that step started. To handle that IO,
blk_insert_flush() will be invoked and hit "REQ_FSEQ_DATA |
REQ_FSEQ_POSTFLUSH" case where "fq->flush_data_in_flight" will be
increased. When the IO for md superblock write was to issue "POSTFLUSH"
step through blk_kick_flush(), it found that "fq->flush_data_in_flight"
was not zero, so it will skip that step, that is expected, because flush
will be triggered when "fq->flush_data_in_flight" dropped to zero.

Unfortunately here that inflight data IO from "md0_raid5" will never
done, because it was added into the blk_plug list of that process, but
"md0_raid5" run into infinite loop due to "MD_SB_CHANGE_PENDING" which
made it never had a chance to finish the blk plug until
"MD_SB_CHANGE_PENDING" was cleared. Process "md0_reclaim" was supposed
to clear that flag but it was stuck by "md0_raid5", so this is a deadlock.

Looks like the approach in the RFC patch trying to resolve the
regression of commit 5e2cf333b7bd can help this issue. Once "md0_raid5"
starts looping due to "MD_SB_CHANGE_PENDING", it should release all its
staging IO requests to avoid blocking others. Also a cond_reschedule()
will avoid it run into lockup.

https://www.spinics.net/lists/raid/msg75338.html

Dan, can you try the following patch?

diff --git a/block/blk-core.c b/block/blk-core.c
index de771093b526..474462abfbdc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug, bool
from_schedule)
        if (unlikely(!rq_list_empty(plug->cached_rq)))
                blk_mq_free_plug_rqs(plug);
 }
+EXPORT_SYMBOL(__blk_flush_plug);

 /**
  * blk_finish_plug - mark the end of a batch of submitted I/O
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8497880135ee..26e09cdf46a3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
spin_unlock_irq(&conf->device_lock);
                        md_check_recovery(mddev);
                        spin_lock_irq(&conf->device_lock);
+               } else {
+ spin_unlock_irq(&conf->device_lock);
+                       blk_flush_plug(&plug, false);
+                       cond_resched();
+                       spin_lock_irq(&conf->device_lock);
                }
        }
        pr_debug("%d stripes handled\n", handled);

Thanks,

Junxiao.

On 3/1/24 12:26 PM, [email protected] wrote:
> Hi Dan & Song,
>
> I have not root cause this yet, but would like share some findings
> from the vmcore Dan shared. From what i can see, this doesn't look
> like a md issue, but something wrong with block layer or below.
>
> 1. There were multiple process hung by IO over 15mins.
>
> crash> ps -m | grep UN
> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0  CPU: 1   
> COMMAND: "jbd2/dm-3-8"
> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0  CPU: 2   
> COMMAND: "dd"
> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3   
> COMMAND: "md0_reclaim"
> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740  CPU: 1   
> COMMAND: "kworker/1:2"
> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80  CPU: 7   
> COMMAND: "kworker/u21:1"
>
> 2. Let pick md0_reclaim to take a look, it is waiting done super_block
> update. We can see there were two pending superblock write and other
> pending io for the underling physical disk, which caused these process
> hung.
>
> crash> bt 876
> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>
> crash> mddev.pending_writes,disks 0xffff888108335800
>   pending_writes = {
>     counter = 2  <<<<<<<<<< 2 active super block write
>   },
>   disks = {
>     next = 0xffff88810ce85a00,
>     prev = 0xffff88810ce84c00
>   },
> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending
> 0xffff88810ce85a00
> ffff88810ce85a00
>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>   nr_pending = {
>     counter = 0
>   },
> ffff8881083ace00
>   kobj.name = 0xffff888100a93280 "dev-sde",
>   nr_pending = {
>     counter = 10 <<<<
>   },
> ffff8881010ad200
>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>   nr_pending = {
>     counter = 8 <<<<<
>   },
> ffff88810ce84c00
>   kobj.name = 0xffff888100325f08 "dev-sdd",
>   nr_pending = {
>     counter = 2 <<<<<
>   },
>
> 3. From block layer, i can find the inflight IO for md superblock
> write which has been pending 955s which matches with the hung time of
> "md0_reclaim"
>
> crash>
> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io
> ffff888103b4c300
>   q = 0xffff888103a00d80,
>   mq_hctx = 0xffff888103c5d200,
>   cmd_flags = 38913,
>   rq_flags = 139408,
>   start_time_ns = 1504179024146,
>   bio = 0x0,
>   biotail = 0xffff888120758e40,
>   state = MQ_RQ_COMPLETE,
>   __data_len = 0,
>   flush = {
>     seq = 3, <<<< REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA
>     saved_end_io = 0x0
>   },
>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>
> crash> p tk_core.timekeeper.tkr_mono.base
> $1 = 2459916243002
> crash> eval 2459916243002-1504179024146
> hexadecimal: de86609f28
>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>       octal: 15720630117450
>      binary:
> 0000000000000000000000001101111010000110011000001001111100101000
>
> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>   bi_iter = {
>     bi_sector = 8, <<<< super block offset
>     bi_size = 0,
>     bi_idx = 0,
>     bi_bvec_done = 0
>   },
>   bi_end_io = 0xffffffff817dca50 <super_written>,
> crash> dev -d | grep ffff888103a00d80
>     8 ffff8881003ab000   sdd        ffff888103a00d80       0 0 0
>
> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it is
> still pending. That's because each md superblock write was marked with
> REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: pre_flush,
> data, and post_flush. Once each step complete, it will be marked in
> "request.flush.seq", here the value is 3, which is REQ_FSEQ_PREFLUSH
> |  REQ_FSEQ_DATA, so the last step "post_flush" has not be done. 
> Another wired thing is that blk_flush_queue.flush_data_in_flight is
> still 1 even "data" step already done.
>
> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>   fq = 0xffff88810332e240,
> crash> blk_flush_queue 0xffff88810332e240
> struct blk_flush_queue {
>   mq_flush_lock = {
>     {
>       rlock = {
>         raw_lock = {
>           {
>             val = {
>               counter = 0
>             },
>             {
>               locked = 0 '\000',
>               pending = 0 '\000'
>             },
>             {
>               locked_pending = 0,
>               tail = 0
>             }
>           }
>         }
>       }
>     }
>   },
>   flush_pending_idx = 1,
>   flush_running_idx = 1,
>   rq_status = 0 '\000',
>   flush_pending_since = 4296171408,
>   flush_queue = {{
>       next = 0xffff88810332e250,
>       prev = 0xffff88810332e250
>     }, {
>       next = 0xffff888103b4c348, <<<< the request is in this list
>       prev = 0xffff888103b4c348
>     }},
>   flush_data_in_flight = 1,  >>>>>> still 1
>   flush_rq = 0xffff888103c2e000
> }
>
> crash> list 0xffff888103b4c348
> ffff888103b4c348
> ffff88810332e260
>
> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw queue
>   tag = -1,
>   state = MQ_RQ_IDLE,
>   ref = {
>     counter = 0
>   },
>
> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
> some issue which leading to the io request from md layer stayed in a
> partial complete statue. I can't see how this can be related with the
> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING
> in raid5d"")
>
>
> Dan,
>
> Are you able to reproduce using some regular scsi disk, would like to
> rule out whether this is related with virtio-scsi?
>
> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
> official mainline v6.8-rc5 without any other patches?
>
>
> Thanks,
>
> Junxiao.
>
> On 2/23/24 6:13 PM, Song Liu wrote:
>> Hi,
>>
>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>> Leemhuis) <[email protected]> wrote:
>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>> Just a friendly reminder that this regression still exists on the
>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>> Song Liu, what's the status here? I aware that you fixed with quite a
>>> few regressions recently, but it seems like resolving this one is
>>> stalled. Or were you able to reproduce the issue or make some progress
>>> and I just missed it?
>> Sorry for the delay with this issue. I have been occupied with some
>> other stuff this week.
>>
>> I haven't got luck to reproduce this issue. I will spend more time
>> looking
>> into it next week.
>>
>>> And if not, what's the way forward here wrt to the release of 6.8?
>>> Revert the culprit and try again later? Or is that not an option for
>>> one
>>> reason or another?
>> If we don't make progress with it in the next week, we will do the
>> revert,
>> same as we did with stable kernels.
>>
>>> Or do we assume that this is not a real issue? That it's caused by some
>>> oddity (bit-flip in the metadata or something like that?) only to be
>>> found in Dan's setup?
>> I don't think this is because of oddities. Hopefully we can get more
>> information about this soon.
>>
>> Thanks,
>> Song
>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker'
>>> hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> If I did something stupid, please tell me, as explained on that page.
>>>
>>> #regzbot poke
>>>

2024-03-10 05:13:34

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> Dan, can you try the following patch?
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index de771093b526..474462abfbdc 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug, bool
> from_schedule)
> if (unlikely(!rq_list_empty(plug->cached_rq)))
> blk_mq_free_plug_rqs(plug);
> }
> +EXPORT_SYMBOL(__blk_flush_plug);
>
> /**
> * blk_finish_plug - mark the end of a batch of submitted I/O
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8497880135ee..26e09cdf46a3 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
> spin_unlock_irq(&conf->device_lock);
> md_check_recovery(mddev);
> spin_lock_irq(&conf->device_lock);
> + } else {
> + spin_unlock_irq(&conf->device_lock);
> + blk_flush_plug(&plug, false);
> + cond_resched();
> + spin_lock_irq(&conf->device_lock);
> }
> }
> pr_debug("%d stripes handled\n", handled);

This patch seems to work! I can no longer reproduce the problem after
applying this.

Thanks,

-- Dan

2024-03-11 01:50:44

by Yu Kuai

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi,

在 2024/03/09 7:49, [email protected] 写道:
> Here is the root cause for this issue:
>
> Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in
> raid5d") introduced a regression, it got reverted through commit
> bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
> raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing,
> commit d6e035aad6c0 ("md: bypass block throttle for superblock update")
> was created, it avoids md superblock write getting throttled by block
> layer which is good, but md superblock write could be stuck in block
> layer due to block flush as well, and that is what was happening in this
> regression report.
>
> Process "md0_reclaim" got stuck while waiting IO for md superblock write
> done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags, these 3
> steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed before done, the
> hung of this process is because the last step "POSTFLUSH" never done.
> And that was because of  process "md0_raid5" submitted another IO with
> REQ_FUA flag marked just before that step started. To handle that IO,
> blk_insert_flush() will be invoked and hit "REQ_FSEQ_DATA |
> REQ_FSEQ_POSTFLUSH" case where "fq->flush_data_in_flight" will be
> increased. When the IO for md superblock write was to issue "POSTFLUSH"
> step through blk_kick_flush(), it found that "fq->flush_data_in_flight"
> was not zero, so it will skip that step, that is expected, because flush
> will be triggered when "fq->flush_data_in_flight" dropped to zero.
>
> Unfortunately here that inflight data IO from "md0_raid5" will never
> done, because it was added into the blk_plug list of that process, but
> "md0_raid5" run into infinite loop due to "MD_SB_CHANGE_PENDING" which
> made it never had a chance to finish the blk plug until
> "MD_SB_CHANGE_PENDING" was cleared. Process "md0_reclaim" was supposed
> to clear that flag but it was stuck by "md0_raid5", so this is a deadlock.
>
> Looks like the approach in the RFC patch trying to resolve the
> regression of commit 5e2cf333b7bd can help this issue. Once "md0_raid5"
> starts looping due to "MD_SB_CHANGE_PENDING", it should release all its
> staging IO requests to avoid blocking others. Also a cond_reschedule()
> will avoid it run into lockup.

The analysis sounds good, however, it seems to me that the behaviour
raid5d() pings the cpu to wait for 'MD_SB_CHANGE_PENDING' to be cleared
is not reasonable, because md_check_recovery() must hold
'reconfig_mutex' to clear the flag.

Look at raid1/raid10, there are two different behaviour that seems can
avoid this problem as well:

1) blk_start_plug() is delayed until all failed IO is handled. This look
reasonable because in order to get better performance, IO should be
handled by submitted thread as much as possible, and meanwhile, the
deadlock can be triggered here.
2) if 'MD_SB_CHANGE_PENDING' is not cleared by md_check_recovery(), skip
the handling of failed IO, and when mddev_unlock() is called, daemon
thread will be woken up again to handle failed IO.

How about the following patch?

Thanks,
Kuai

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3ad5f3c7f91e..0b2e6060f2c9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)

md_check_recovery(mddev);

- blk_start_plug(&plug);
handled = 0;
spin_lock_irq(&conf->device_lock);
while (1) {
@@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
int batch_size, released;
unsigned int offset;

+ /*
+ * md_check_recovery() can't clear sb_flags, usually
because of
+ * 'reconfig_mutex' can't be grabbed, wait for
mddev_unlock() to
+ * wake up raid5d().
+ */
+ if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+ goto skip;
+
released = release_stripe_list(conf,
conf->temp_inactive_list);
if (released)
clear_bit(R5_DID_ALLOC, &conf->cache_state);
@@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
spin_lock_irq(&conf->device_lock);
}
}
+skip:
pr_debug("%d stripes handled\n", handled);
-
spin_unlock_irq(&conf->device_lock);
if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
mutex_trylock(&conf->cache_size_mutex)) {
@@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
mutex_unlock(&conf->cache_size_mutex);
}

+ blk_start_plug(&plug);
flush_deferred_bios(conf);

r5l_flush_stripe_to_raid(conf->log);

>
> https://www.spinics.net/lists/raid/msg75338.html
>
> Dan, can you try the following patch?
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index de771093b526..474462abfbdc 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug, bool
> from_schedule)
>         if (unlikely(!rq_list_empty(plug->cached_rq)))
>                 blk_mq_free_plug_rqs(plug);
>  }
> +EXPORT_SYMBOL(__blk_flush_plug);
>
>  /**
>   * blk_finish_plug - mark the end of a batch of submitted I/O
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8497880135ee..26e09cdf46a3 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
> spin_unlock_irq(&conf->device_lock);
>                         md_check_recovery(mddev);
>                         spin_lock_irq(&conf->device_lock);
> +               } else {
> + spin_unlock_irq(&conf->device_lock);
> +                       blk_flush_plug(&plug, false);
> +                       cond_resched();
> +                       spin_lock_irq(&conf->device_lock);
>                 }
>         }
>         pr_debug("%d stripes handled\n", handled);
>
> Thanks,
>
> Junxiao.
>
> On 3/1/24 12:26 PM, [email protected] wrote:
>> Hi Dan & Song,
>>
>> I have not root cause this yet, but would like share some findings
>> from the vmcore Dan shared. From what i can see, this doesn't look
>> like a md issue, but something wrong with block layer or below.
>>
>> 1. There were multiple process hung by IO over 15mins.
>>
>> crash> ps -m | grep UN
>> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0  CPU: 1
>> COMMAND: "jbd2/dm-3-8"
>> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0  CPU: 2
>> COMMAND: "dd"
>> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3
>> COMMAND: "md0_reclaim"
>> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740  CPU: 1
>> COMMAND: "kworker/1:2"
>> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80  CPU: 7
>> COMMAND: "kworker/u21:1"
>>
>> 2. Let pick md0_reclaim to take a look, it is waiting done super_block
>> update. We can see there were two pending superblock write and other
>> pending io for the underling physical disk, which caused these process
>> hung.
>>
>> crash> bt 876
>> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
>>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>>
>> crash> mddev.pending_writes,disks 0xffff888108335800
>>   pending_writes = {
>>     counter = 2  <<<<<<<<<< 2 active super block write
>>   },
>>   disks = {
>>     next = 0xffff88810ce85a00,
>>     prev = 0xffff88810ce84c00
>>   },
>> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending
>> 0xffff88810ce85a00
>> ffff88810ce85a00
>>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>>   nr_pending = {
>>     counter = 0
>>   },
>> ffff8881083ace00
>>   kobj.name = 0xffff888100a93280 "dev-sde",
>>   nr_pending = {
>>     counter = 10 <<<<
>>   },
>> ffff8881010ad200
>>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>>   nr_pending = {
>>     counter = 8 <<<<<
>>   },
>> ffff88810ce84c00
>>   kobj.name = 0xffff888100325f08 "dev-sdd",
>>   nr_pending = {
>>     counter = 2 <<<<<
>>   },
>>
>> 3. From block layer, i can find the inflight IO for md superblock
>> write which has been pending 955s which matches with the hung time of
>> "md0_reclaim"
>>
>> crash>
>> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io
>> ffff888103b4c300
>>   q = 0xffff888103a00d80,
>>   mq_hctx = 0xffff888103c5d200,
>>   cmd_flags = 38913,
>>   rq_flags = 139408,
>>   start_time_ns = 1504179024146,
>>   bio = 0x0,
>>   biotail = 0xffff888120758e40,
>>   state = MQ_RQ_COMPLETE,
>>   __data_len = 0,
>>   flush = {
>>     seq = 3, <<<< REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA
>>     saved_end_io = 0x0
>>   },
>>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>>
>> crash> p tk_core.timekeeper.tkr_mono.base
>> $1 = 2459916243002
>> crash> eval 2459916243002-1504179024146
>> hexadecimal: de86609f28
>>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>>       octal: 15720630117450
>>      binary:
>> 0000000000000000000000001101111010000110011000001001111100101000
>>
>> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>>   bi_iter = {
>>     bi_sector = 8, <<<< super block offset
>>     bi_size = 0,
>>     bi_idx = 0,
>>     bi_bvec_done = 0
>>   },
>>   bi_end_io = 0xffffffff817dca50 <super_written>,
>> crash> dev -d | grep ffff888103a00d80
>>     8 ffff8881003ab000   sdd        ffff888103a00d80       0 0 0
>>
>> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it is
>> still pending. That's because each md superblock write was marked with
>> REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: pre_flush,
>> data, and post_flush. Once each step complete, it will be marked in
>> "request.flush.seq", here the value is 3, which is REQ_FSEQ_PREFLUSH
>> |  REQ_FSEQ_DATA, so the last step "post_flush" has not be done.
>> Another wired thing is that blk_flush_queue.flush_data_in_flight is
>> still 1 even "data" step already done.
>>
>> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>>   fq = 0xffff88810332e240,
>> crash> blk_flush_queue 0xffff88810332e240
>> struct blk_flush_queue {
>>   mq_flush_lock = {
>>     {
>>       rlock = {
>>         raw_lock = {
>>           {
>>             val = {
>>               counter = 0
>>             },
>>             {
>>               locked = 0 '\000',
>>               pending = 0 '\000'
>>             },
>>             {
>>               locked_pending = 0,
>>               tail = 0
>>             }
>>           }
>>         }
>>       }
>>     }
>>   },
>>   flush_pending_idx = 1,
>>   flush_running_idx = 1,
>>   rq_status = 0 '\000',
>>   flush_pending_since = 4296171408,
>>   flush_queue = {{
>>       next = 0xffff88810332e250,
>>       prev = 0xffff88810332e250
>>     }, {
>>       next = 0xffff888103b4c348, <<<< the request is in this list
>>       prev = 0xffff888103b4c348
>>     }},
>>   flush_data_in_flight = 1,  >>>>>> still 1
>>   flush_rq = 0xffff888103c2e000
>> }
>>
>> crash> list 0xffff888103b4c348
>> ffff888103b4c348
>> ffff88810332e260
>>
>> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw queue
>>   tag = -1,
>>   state = MQ_RQ_IDLE,
>>   ref = {
>>     counter = 0
>>   },
>>
>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
>> some issue which leading to the io request from md layer stayed in a
>> partial complete statue. I can't see how this can be related with the
>> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING
>> in raid5d"")
>>
>>
>> Dan,
>>
>> Are you able to reproduce using some regular scsi disk, would like to
>> rule out whether this is related with virtio-scsi?
>>
>> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
>> official mainline v6.8-rc5 without any other patches?
>>
>>
>> Thanks,
>>
>> Junxiao.
>>
>> On 2/23/24 6:13 PM, Song Liu wrote:
>>> Hi,
>>>
>>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>>> Leemhuis) <[email protected]> wrote:
>>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>>> Just a friendly reminder that this regression still exists on the
>>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>>> Song Liu, what's the status here? I aware that you fixed with quite a
>>>> few regressions recently, but it seems like resolving this one is
>>>> stalled. Or were you able to reproduce the issue or make some progress
>>>> and I just missed it?
>>> Sorry for the delay with this issue. I have been occupied with some
>>> other stuff this week.
>>>
>>> I haven't got luck to reproduce this issue. I will spend more time
>>> looking
>>> into it next week.
>>>
>>>> And if not, what's the way forward here wrt to the release of 6.8?
>>>> Revert the culprit and try again later? Or is that not an option for
>>>> one
>>>> reason or another?
>>> If we don't make progress with it in the next week, we will do the
>>> revert,
>>> same as we did with stable kernels.
>>>
>>>> Or do we assume that this is not a real issue? That it's caused by some
>>>> oddity (bit-flip in the metadata or something like that?) only to be
>>>> found in Dan's setup?
>>> I don't think this is because of oddities. Hopefully we can get more
>>> information about this soon.
>>>
>>> Thanks,
>>> Song
>>>
>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker'
>>>> hat)
>>>> --
>>>> Everything you wanna know about Linux kernel regression tracking:
>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>> If I did something stupid, please tell me, as explained on that page.
>>>>
>>>> #regzbot poke
>>>>
>
> .
>


2024-03-12 22:57:19

by Junxiao Bi

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On 3/10/24 6:50 PM, Yu Kuai wrote:

> Hi,
>
> 在 2024/03/09 7:49, [email protected] 写道:
>> Here is the root cause for this issue:
>>
>> Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in
>> raid5d") introduced a regression, it got reverted through commit
>> bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
>> raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing,
>> commit d6e035aad6c0 ("md: bypass block throttle for superblock
>> update") was created, it avoids md superblock write getting throttled
>> by block layer which is good, but md superblock write could be stuck
>> in block layer due to block flush as well, and that is what was
>> happening in this regression report.
>>
>> Process "md0_reclaim" got stuck while waiting IO for md superblock
>> write done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags,
>> these 3 steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed
>> before done, the hung of this process is because the last step
>> "POSTFLUSH" never done. And that was because of  process "md0_raid5"
>> submitted another IO with REQ_FUA flag marked just before that step
>> started. To handle that IO, blk_insert_flush() will be invoked and
>> hit "REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH" case where
>> "fq->flush_data_in_flight" will be increased. When the IO for md
>> superblock write was to issue "POSTFLUSH" step through
>> blk_kick_flush(), it found that "fq->flush_data_in_flight" was not
>> zero, so it will skip that step, that is expected, because flush will
>> be triggered when "fq->flush_data_in_flight" dropped to zero.
>>
>> Unfortunately here that inflight data IO from "md0_raid5" will never
>> done, because it was added into the blk_plug list of that process,
>> but "md0_raid5" run into infinite loop due to "MD_SB_CHANGE_PENDING"
>> which made it never had a chance to finish the blk plug until
>> "MD_SB_CHANGE_PENDING" was cleared. Process "md0_reclaim" was
>> supposed to clear that flag but it was stuck by "md0_raid5", so this
>> is a deadlock.
>>
>> Looks like the approach in the RFC patch trying to resolve the
>> regression of commit 5e2cf333b7bd can help this issue. Once
>> "md0_raid5" starts looping due to "MD_SB_CHANGE_PENDING", it should
>> release all its staging IO requests to avoid blocking others. Also a
>> cond_reschedule() will avoid it run into lockup.
>
> The analysis sounds good, however, it seems to me that the behaviour
> raid5d() pings the cpu to wait for 'MD_SB_CHANGE_PENDING' to be cleared
> is not reasonable, because md_check_recovery() must hold
> 'reconfig_mutex' to clear the flag.

That's the behavior before commit 5e2cf333b7bd which was added into Sep
2022, so this behavior has been with raid5 for many years.


>
> Look at raid1/raid10, there are two different behaviour that seems can
> avoid this problem as well:
>
> 1) blk_start_plug() is delayed until all failed IO is handled. This look
> reasonable because in order to get better performance, IO should be
> handled by submitted thread as much as possible, and meanwhile, the
> deadlock can be triggered here.
> 2) if 'MD_SB_CHANGE_PENDING' is not cleared by md_check_recovery(), skip
> the handling of failed IO, and when mddev_unlock() is called, daemon
> thread will be woken up again to handle failed IO.
>
> How about the following patch?
>
> Thanks,
> Kuai
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3ad5f3c7f91e..0b2e6060f2c9 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
>
>         md_check_recovery(mddev);
>
> -       blk_start_plug(&plug);
>         handled = 0;
>         spin_lock_irq(&conf->device_lock);
>         while (1) {
> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>                 int batch_size, released;
>                 unsigned int offset;
>
> +               /*
> +                * md_check_recovery() can't clear sb_flags, usually
> because of
> +                * 'reconfig_mutex' can't be grabbed, wait for
> mddev_unlock() to
> +                * wake up raid5d().
> +                */
> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> +                       goto skip;
> +
>                 released = release_stripe_list(conf,
> conf->temp_inactive_list);
>                 if (released)
>                         clear_bit(R5_DID_ALLOC, &conf->cache_state);
> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>                         spin_lock_irq(&conf->device_lock);
>                 }
>         }
> +skip:
>         pr_debug("%d stripes handled\n", handled);
> -
>         spin_unlock_irq(&conf->device_lock);
>         if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>             mutex_trylock(&conf->cache_size_mutex)) {
> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>                 mutex_unlock(&conf->cache_size_mutex);
>         }
>
> +       blk_start_plug(&plug);
>         flush_deferred_bios(conf);
>
>         r5l_flush_stripe_to_raid(conf->log);

This patch eliminated the benefit of blk_plug, i think it will not be
good for IO performance perspective?


Thanks,

Junxiao.

>
>>
>> https://www.spinics.net/lists/raid/msg75338.html
>>
>> Dan, can you try the following patch?
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index de771093b526..474462abfbdc 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug,
>> bool from_schedule)
>>          if (unlikely(!rq_list_empty(plug->cached_rq)))
>>                  blk_mq_free_plug_rqs(plug);
>>   }
>> +EXPORT_SYMBOL(__blk_flush_plug);
>>
>>   /**
>>    * blk_finish_plug - mark the end of a batch of submitted I/O
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 8497880135ee..26e09cdf46a3 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
>> spin_unlock_irq(&conf->device_lock);
>>                          md_check_recovery(mddev);
>> spin_lock_irq(&conf->device_lock);
>> +               } else {
>> + spin_unlock_irq(&conf->device_lock);
>> +                       blk_flush_plug(&plug, false);
>> +                       cond_resched();
>> + spin_lock_irq(&conf->device_lock);
>>                  }
>>          }
>>          pr_debug("%d stripes handled\n", handled);
>>
>> Thanks,
>>
>> Junxiao.
>>
>> On 3/1/24 12:26 PM, [email protected] wrote:
>>> Hi Dan & Song,
>>>
>>> I have not root cause this yet, but would like share some findings
>>> from the vmcore Dan shared. From what i can see, this doesn't look
>>> like a md issue, but something wrong with block layer or below.
>>>
>>> 1. There were multiple process hung by IO over 15mins.
>>>
>>> crash> ps -m | grep UN
>>> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0 CPU: 1
>>> COMMAND: "jbd2/dm-3-8"
>>> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0 CPU: 2
>>> COMMAND: "dd"
>>> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00 CPU: 3
>>> COMMAND: "md0_reclaim"
>>> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740 CPU: 1
>>> COMMAND: "kworker/1:2"
>>> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80 CPU: 7
>>> COMMAND: "kworker/u21:1"
>>>
>>> 2. Let pick md0_reclaim to take a look, it is waiting done
>>> super_block update. We can see there were two pending superblock
>>> write and other pending io for the underling physical disk, which
>>> caused these process hung.
>>>
>>> crash> bt 876
>>> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
>>>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>>>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>>>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>>>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>>>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>>>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>>>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>>>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>>>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>>>
>>> crash> mddev.pending_writes,disks 0xffff888108335800
>>>   pending_writes = {
>>>     counter = 2  <<<<<<<<<< 2 active super block write
>>>   },
>>>   disks = {
>>>     next = 0xffff88810ce85a00,
>>>     prev = 0xffff88810ce84c00
>>>   },
>>> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending
>>> 0xffff88810ce85a00
>>> ffff88810ce85a00
>>>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>>>   nr_pending = {
>>>     counter = 0
>>>   },
>>> ffff8881083ace00
>>>   kobj.name = 0xffff888100a93280 "dev-sde",
>>>   nr_pending = {
>>>     counter = 10 <<<<
>>>   },
>>> ffff8881010ad200
>>>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>>>   nr_pending = {
>>>     counter = 8 <<<<<
>>>   },
>>> ffff88810ce84c00
>>>   kobj.name = 0xffff888100325f08 "dev-sdd",
>>>   nr_pending = {
>>>     counter = 2 <<<<<
>>>   },
>>>
>>> 3. From block layer, i can find the inflight IO for md superblock
>>> write which has been pending 955s which matches with the hung time
>>> of "md0_reclaim"
>>>
>>> crash>
>>> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io
>>> ffff888103b4c300
>>>   q = 0xffff888103a00d80,
>>>   mq_hctx = 0xffff888103c5d200,
>>>   cmd_flags = 38913,
>>>   rq_flags = 139408,
>>>   start_time_ns = 1504179024146,
>>>   bio = 0x0,
>>>   biotail = 0xffff888120758e40,
>>>   state = MQ_RQ_COMPLETE,
>>>   __data_len = 0,
>>>   flush = {
>>>     seq = 3, <<<< REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA
>>>     saved_end_io = 0x0
>>>   },
>>>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>>>
>>> crash> p tk_core.timekeeper.tkr_mono.base
>>> $1 = 2459916243002
>>> crash> eval 2459916243002-1504179024146
>>> hexadecimal: de86609f28
>>>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>>>       octal: 15720630117450
>>>      binary:
>>> 0000000000000000000000001101111010000110011000001001111100101000
>>>
>>> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>>>   bi_iter = {
>>>     bi_sector = 8, <<<< super block offset
>>>     bi_size = 0,
>>>     bi_idx = 0,
>>>     bi_bvec_done = 0
>>>   },
>>>   bi_end_io = 0xffffffff817dca50 <super_written>,
>>> crash> dev -d | grep ffff888103a00d80
>>>     8 ffff8881003ab000   sdd        ffff888103a00d80       0 0 0
>>>
>>> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it
>>> is still pending. That's because each md superblock write was marked
>>> with REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps:
>>> pre_flush, data, and post_flush. Once each step complete, it will be
>>> marked in "request.flush.seq", here the value is 3, which is
>>> REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA, so the last step "post_flush"
>>> has not be done. Another wired thing is that
>>> blk_flush_queue.flush_data_in_flight is still 1 even "data" step
>>> already done.
>>>
>>> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>>>   fq = 0xffff88810332e240,
>>> crash> blk_flush_queue 0xffff88810332e240
>>> struct blk_flush_queue {
>>>   mq_flush_lock = {
>>>     {
>>>       rlock = {
>>>         raw_lock = {
>>>           {
>>>             val = {
>>>               counter = 0
>>>             },
>>>             {
>>>               locked = 0 '\000',
>>>               pending = 0 '\000'
>>>             },
>>>             {
>>>               locked_pending = 0,
>>>               tail = 0
>>>             }
>>>           }
>>>         }
>>>       }
>>>     }
>>>   },
>>>   flush_pending_idx = 1,
>>>   flush_running_idx = 1,
>>>   rq_status = 0 '\000',
>>>   flush_pending_since = 4296171408,
>>>   flush_queue = {{
>>>       next = 0xffff88810332e250,
>>>       prev = 0xffff88810332e250
>>>     }, {
>>>       next = 0xffff888103b4c348, <<<< the request is in this list
>>>       prev = 0xffff888103b4c348
>>>     }},
>>>   flush_data_in_flight = 1,  >>>>>> still 1
>>>   flush_rq = 0xffff888103c2e000
>>> }
>>>
>>> crash> list 0xffff888103b4c348
>>> ffff888103b4c348
>>> ffff88810332e260
>>>
>>> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw
>>> queue
>>>   tag = -1,
>>>   state = MQ_RQ_IDLE,
>>>   ref = {
>>>     counter = 0
>>>   },
>>>
>>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may
>>> have some issue which leading to the io request from md layer stayed
>>> in a partial complete statue. I can't see how this can be related
>>> with the commit bed9e27baf52 ("Revert "md/raid5: Wait for
>>> MD_SB_CHANGE_PENDING in raid5d"")
>>>
>>>
>>> Dan,
>>>
>>> Are you able to reproduce using some regular scsi disk, would like
>>> to rule out whether this is related with virtio-scsi?
>>>
>>> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
>>> official mainline v6.8-rc5 without any other patches?
>>>
>>>
>>> Thanks,
>>>
>>> Junxiao.
>>>
>>> On 2/23/24 6:13 PM, Song Liu wrote:
>>>> Hi,
>>>>
>>>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>>>> Leemhuis) <[email protected]> wrote:
>>>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>>>> Just a friendly reminder that this regression still exists on the
>>>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>>>> Song Liu, what's the status here? I aware that you fixed with quite a
>>>>> few regressions recently, but it seems like resolving this one is
>>>>> stalled. Or were you able to reproduce the issue or make some
>>>>> progress
>>>>> and I just missed it?
>>>> Sorry for the delay with this issue. I have been occupied with some
>>>> other stuff this week.
>>>>
>>>> I haven't got luck to reproduce this issue. I will spend more time
>>>> looking
>>>> into it next week.
>>>>
>>>>> And if not, what's the way forward here wrt to the release of 6.8?
>>>>> Revert the culprit and try again later? Or is that not an option
>>>>> for one
>>>>> reason or another?
>>>> If we don't make progress with it in the next week, we will do the
>>>> revert,
>>>> same as we did with stable kernels.
>>>>
>>>>> Or do we assume that this is not a real issue? That it's caused by
>>>>> some
>>>>> oddity (bit-flip in the metadata or something like that?) only to be
>>>>> found in Dan's setup?
>>>> I don't think this is because of oddities. Hopefully we can get more
>>>> information about this soon.
>>>>
>>>> Thanks,
>>>> Song
>>>>
>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression
>>>>> tracker' hat)
>>>>> --
>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>> If I did something stupid, please tell me, as explained on that page.
>>>>>
>>>>> #regzbot poke
>>>>>
>>
>> .
>>
>

2024-03-13 01:20:26

by Yu Kuai

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi,

在 2024/03/13 6:56, [email protected] 写道:
> On 3/10/24 6:50 PM, Yu Kuai wrote:
>
>> Hi,
>>
>> 在 2024/03/09 7:49, [email protected] 写道:
>>> Here is the root cause for this issue:
>>>
>>> Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in
>>> raid5d") introduced a regression, it got reverted through commit
>>> bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
>>> raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing,
>>> commit d6e035aad6c0 ("md: bypass block throttle for superblock
>>> update") was created, it avoids md superblock write getting throttled
>>> by block layer which is good, but md superblock write could be stuck
>>> in block layer due to block flush as well, and that is what was
>>> happening in this regression report.
>>>
>>> Process "md0_reclaim" got stuck while waiting IO for md superblock
>>> write done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags,
>>> these 3 steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed
>>> before done, the hung of this process is because the last step
>>> "POSTFLUSH" never done. And that was because of  process "md0_raid5"
>>> submitted another IO with REQ_FUA flag marked just before that step
>>> started. To handle that IO, blk_insert_flush() will be invoked and
>>> hit "REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH" case where
>>> "fq->flush_data_in_flight" will be increased. When the IO for md
>>> superblock write was to issue "POSTFLUSH" step through
>>> blk_kick_flush(), it found that "fq->flush_data_in_flight" was not
>>> zero, so it will skip that step, that is expected, because flush will
>>> be triggered when "fq->flush_data_in_flight" dropped to zero.
>>>
>>> Unfortunately here that inflight data IO from "md0_raid5" will never
>>> done, because it was added into the blk_plug list of that process,
>>> but "md0_raid5" run into infinite loop due to "MD_SB_CHANGE_PENDING"
>>> which made it never had a chance to finish the blk plug until
>>> "MD_SB_CHANGE_PENDING" was cleared. Process "md0_reclaim" was
>>> supposed to clear that flag but it was stuck by "md0_raid5", so this
>>> is a deadlock.
>>>
>>> Looks like the approach in the RFC patch trying to resolve the
>>> regression of commit 5e2cf333b7bd can help this issue. Once
>>> "md0_raid5" starts looping due to "MD_SB_CHANGE_PENDING", it should
>>> release all its staging IO requests to avoid blocking others. Also a
>>> cond_reschedule() will avoid it run into lockup.
>>
>> The analysis sounds good, however, it seems to me that the behaviour
>> raid5d() pings the cpu to wait for 'MD_SB_CHANGE_PENDING' to be cleared
>> is not reasonable, because md_check_recovery() must hold
>> 'reconfig_mutex' to clear the flag.
>
> That's the behavior before commit 5e2cf333b7bd which was added into Sep
> 2022, so this behavior has been with raid5 for many years.
>

Yes, it exists for a long time doesn't mean it's good. It is really
weird to hold spinlock to wait for a mutex.
>
>>
>> Look at raid1/raid10, there are two different behaviour that seems can
>> avoid this problem as well:
>>
>> 1) blk_start_plug() is delayed until all failed IO is handled. This look
>> reasonable because in order to get better performance, IO should be
>> handled by submitted thread as much as possible, and meanwhile, the
>> deadlock can be triggered here.
>> 2) if 'MD_SB_CHANGE_PENDING' is not cleared by md_check_recovery(), skip
>> the handling of failed IO, and when mddev_unlock() is called, daemon
>> thread will be woken up again to handle failed IO.
>>
>> How about the following patch?
>>
>> Thanks,
>> Kuai
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 3ad5f3c7f91e..0b2e6060f2c9 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
>>
>>         md_check_recovery(mddev);
>>
>> -       blk_start_plug(&plug);
>>         handled = 0;
>>         spin_lock_irq(&conf->device_lock);
>>         while (1) {
>> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>>                 int batch_size, released;
>>                 unsigned int offset;
>>
>> +               /*
>> +                * md_check_recovery() can't clear sb_flags, usually
>> because of
>> +                * 'reconfig_mutex' can't be grabbed, wait for
>> mddev_unlock() to
>> +                * wake up raid5d().
>> +                */
>> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>> +                       goto skip;
>> +
>>                 released = release_stripe_list(conf,
>> conf->temp_inactive_list);
>>                 if (released)
>>                         clear_bit(R5_DID_ALLOC, &conf->cache_state);
>> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>>                         spin_lock_irq(&conf->device_lock);
>>                 }
>>         }
>> +skip:
>>         pr_debug("%d stripes handled\n", handled);
>> -
>>         spin_unlock_irq(&conf->device_lock);
>>         if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>>             mutex_trylock(&conf->cache_size_mutex)) {
>> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>>                 mutex_unlock(&conf->cache_size_mutex);
>>         }
>>
>> +       blk_start_plug(&plug);
>>         flush_deferred_bios(conf);
>>
>>         r5l_flush_stripe_to_raid(conf->log);
>
> This patch eliminated the benefit of blk_plug, i think it will not be
> good for IO performance perspective?

There is only one daemon thread, so IO should not be handled here as
much as possible. The IO should be handled by the thread that is
submitting the IO, and let daemon to hanldle the case that IO failed or
can't be submitted at that time.

Thanks,
Kuai

>
>
> Thanks,
>
> Junxiao.
>
>>
>>>
>>> https://www.spinics.net/lists/raid/msg75338.html
>>>
>>> Dan, can you try the following patch?
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index de771093b526..474462abfbdc 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug,
>>> bool from_schedule)
>>>          if (unlikely(!rq_list_empty(plug->cached_rq)))
>>>                  blk_mq_free_plug_rqs(plug);
>>>   }
>>> +EXPORT_SYMBOL(__blk_flush_plug);
>>>
>>>   /**
>>>    * blk_finish_plug - mark the end of a batch of submitted I/O
>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>> index 8497880135ee..26e09cdf46a3 100644
>>> --- a/drivers/md/raid5.c
>>> +++ b/drivers/md/raid5.c
>>> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
>>> spin_unlock_irq(&conf->device_lock);
>>>                          md_check_recovery(mddev);
>>> spin_lock_irq(&conf->device_lock);
>>> +               } else {
>>> + spin_unlock_irq(&conf->device_lock);
>>> +                       blk_flush_plug(&plug, false);
>>> +                       cond_resched();
>>> + spin_lock_irq(&conf->device_lock);
>>>                  }
>>>          }
>>>          pr_debug("%d stripes handled\n", handled);
>>>
>>> Thanks,
>>>
>>> Junxiao.
>>>
>>> On 3/1/24 12:26 PM, [email protected] wrote:
>>>> Hi Dan & Song,
>>>>
>>>> I have not root cause this yet, but would like share some findings
>>>> from the vmcore Dan shared. From what i can see, this doesn't look
>>>> like a md issue, but something wrong with block layer or below.
>>>>
>>>> 1. There were multiple process hung by IO over 15mins.
>>>>
>>>> crash> ps -m | grep UN
>>>> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0 CPU: 1
>>>> COMMAND: "jbd2/dm-3-8"
>>>> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0 CPU: 2
>>>> COMMAND: "dd"
>>>> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00 CPU: 3
>>>> COMMAND: "md0_reclaim"
>>>> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740 CPU: 1
>>>> COMMAND: "kworker/1:2"
>>>> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80 CPU: 7
>>>> COMMAND: "kworker/u21:1"
>>>>
>>>> 2. Let pick md0_reclaim to take a look, it is waiting done
>>>> super_block update. We can see there were two pending superblock
>>>> write and other pending io for the underling physical disk, which
>>>> caused these process hung.
>>>>
>>>> crash> bt 876
>>>> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
>>>>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>>>>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>>>>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>>>>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>>>>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>>>>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>>>>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>>>>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>>>>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>>>>
>>>> crash> mddev.pending_writes,disks 0xffff888108335800
>>>>   pending_writes = {
>>>>     counter = 2  <<<<<<<<<< 2 active super block write
>>>>   },
>>>>   disks = {
>>>>     next = 0xffff88810ce85a00,
>>>>     prev = 0xffff88810ce84c00
>>>>   },
>>>> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending
>>>> 0xffff88810ce85a00
>>>> ffff88810ce85a00
>>>>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>>>>   nr_pending = {
>>>>     counter = 0
>>>>   },
>>>> ffff8881083ace00
>>>>   kobj.name = 0xffff888100a93280 "dev-sde",
>>>>   nr_pending = {
>>>>     counter = 10 <<<<
>>>>   },
>>>> ffff8881010ad200
>>>>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>>>>   nr_pending = {
>>>>     counter = 8 <<<<<
>>>>   },
>>>> ffff88810ce84c00
>>>>   kobj.name = 0xffff888100325f08 "dev-sdd",
>>>>   nr_pending = {
>>>>     counter = 2 <<<<<
>>>>   },
>>>>
>>>> 3. From block layer, i can find the inflight IO for md superblock
>>>> write which has been pending 955s which matches with the hung time
>>>> of "md0_reclaim"
>>>>
>>>> crash>
>>>> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io
>>>> ffff888103b4c300
>>>>   q = 0xffff888103a00d80,
>>>>   mq_hctx = 0xffff888103c5d200,
>>>>   cmd_flags = 38913,
>>>>   rq_flags = 139408,
>>>>   start_time_ns = 1504179024146,
>>>>   bio = 0x0,
>>>>   biotail = 0xffff888120758e40,
>>>>   state = MQ_RQ_COMPLETE,
>>>>   __data_len = 0,
>>>>   flush = {
>>>>     seq = 3, <<<< REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA
>>>>     saved_end_io = 0x0
>>>>   },
>>>>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>>>>
>>>> crash> p tk_core.timekeeper.tkr_mono.base
>>>> $1 = 2459916243002
>>>> crash> eval 2459916243002-1504179024146
>>>> hexadecimal: de86609f28
>>>>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>>>>       octal: 15720630117450
>>>>      binary:
>>>> 0000000000000000000000001101111010000110011000001001111100101000
>>>>
>>>> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>>>>   bi_iter = {
>>>>     bi_sector = 8, <<<< super block offset
>>>>     bi_size = 0,
>>>>     bi_idx = 0,
>>>>     bi_bvec_done = 0
>>>>   },
>>>>   bi_end_io = 0xffffffff817dca50 <super_written>,
>>>> crash> dev -d | grep ffff888103a00d80
>>>>     8 ffff8881003ab000   sdd        ffff888103a00d80       0 0 0
>>>>
>>>> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it
>>>> is still pending. That's because each md superblock write was marked
>>>> with REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps:
>>>> pre_flush, data, and post_flush. Once each step complete, it will be
>>>> marked in "request.flush.seq", here the value is 3, which is
>>>> REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA, so the last step "post_flush"
>>>> has not be done. Another wired thing is that
>>>> blk_flush_queue.flush_data_in_flight is still 1 even "data" step
>>>> already done.
>>>>
>>>> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>>>>   fq = 0xffff88810332e240,
>>>> crash> blk_flush_queue 0xffff88810332e240
>>>> struct blk_flush_queue {
>>>>   mq_flush_lock = {
>>>>     {
>>>>       rlock = {
>>>>         raw_lock = {
>>>>           {
>>>>             val = {
>>>>               counter = 0
>>>>             },
>>>>             {
>>>>               locked = 0 '\000',
>>>>               pending = 0 '\000'
>>>>             },
>>>>             {
>>>>               locked_pending = 0,
>>>>               tail = 0
>>>>             }
>>>>           }
>>>>         }
>>>>       }
>>>>     }
>>>>   },
>>>>   flush_pending_idx = 1,
>>>>   flush_running_idx = 1,
>>>>   rq_status = 0 '\000',
>>>>   flush_pending_since = 4296171408,
>>>>   flush_queue = {{
>>>>       next = 0xffff88810332e250,
>>>>       prev = 0xffff88810332e250
>>>>     }, {
>>>>       next = 0xffff888103b4c348, <<<< the request is in this list
>>>>       prev = 0xffff888103b4c348
>>>>     }},
>>>>   flush_data_in_flight = 1,  >>>>>> still 1
>>>>   flush_rq = 0xffff888103c2e000
>>>> }
>>>>
>>>> crash> list 0xffff888103b4c348
>>>> ffff888103b4c348
>>>> ffff88810332e260
>>>>
>>>> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw
>>>> queue
>>>>   tag = -1,
>>>>   state = MQ_RQ_IDLE,
>>>>   ref = {
>>>>     counter = 0
>>>>   },
>>>>
>>>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may
>>>> have some issue which leading to the io request from md layer stayed
>>>> in a partial complete statue. I can't see how this can be related
>>>> with the commit bed9e27baf52 ("Revert "md/raid5: Wait for
>>>> MD_SB_CHANGE_PENDING in raid5d"")
>>>>
>>>>
>>>> Dan,
>>>>
>>>> Are you able to reproduce using some regular scsi disk, would like
>>>> to rule out whether this is related with virtio-scsi?
>>>>
>>>> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
>>>> official mainline v6.8-rc5 without any other patches?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Junxiao.
>>>>
>>>> On 2/23/24 6:13 PM, Song Liu wrote:
>>>>> Hi,
>>>>>
>>>>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>>>>> Leemhuis) <[email protected]> wrote:
>>>>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>>>>> Just a friendly reminder that this regression still exists on the
>>>>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>>>>> Song Liu, what's the status here? I aware that you fixed with quite a
>>>>>> few regressions recently, but it seems like resolving this one is
>>>>>> stalled. Or were you able to reproduce the issue or make some
>>>>>> progress
>>>>>> and I just missed it?
>>>>> Sorry for the delay with this issue. I have been occupied with some
>>>>> other stuff this week.
>>>>>
>>>>> I haven't got luck to reproduce this issue. I will spend more time
>>>>> looking
>>>>> into it next week.
>>>>>
>>>>>> And if not, what's the way forward here wrt to the release of 6.8?
>>>>>> Revert the culprit and try again later? Or is that not an option
>>>>>> for one
>>>>>> reason or another?
>>>>> If we don't make progress with it in the next week, we will do the
>>>>> revert,
>>>>> same as we did with stable kernels.
>>>>>
>>>>>> Or do we assume that this is not a real issue? That it's caused by
>>>>>> some
>>>>>> oddity (bit-flip in the metadata or something like that?) only to be
>>>>>> found in Dan's setup?
>>>>> I don't think this is because of oddities. Hopefully we can get more
>>>>> information about this soon.
>>>>>
>>>>> Thanks,
>>>>> Song
>>>>>
>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression
>>>>>> tracker' hat)
>>>>>> --
>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>> If I did something stupid, please tell me, as explained on that page.
>>>>>>
>>>>>> #regzbot poke
>>>>>>
>>>
>>> .
>>>
>>
> .
>


2024-03-14 16:12:29

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> How about the following patch?
>
> Thanks,
> Kuai
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3ad5f3c7f91e..0b2e6060f2c9 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
>
> md_check_recovery(mddev);
>
> - blk_start_plug(&plug);
> handled = 0;
> spin_lock_irq(&conf->device_lock);
> while (1) {
> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
> int batch_size, released;
> unsigned int offset;
>
> + /*
> + * md_check_recovery() can't clear sb_flags, usually
> because of
> + * 'reconfig_mutex' can't be grabbed, wait for
> mddev_unlock() to
> + * wake up raid5d().
> + */
> + if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> + goto skip;
> +
> released = release_stripe_list(conf,
> conf->temp_inactive_list);
> if (released)
> clear_bit(R5_DID_ALLOC, &conf->cache_state);
> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
> spin_lock_irq(&conf->device_lock);
> }
> }
> +skip:
> pr_debug("%d stripes handled\n", handled);
> -
> spin_unlock_irq(&conf->device_lock);
> if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
> mutex_trylock(&conf->cache_size_mutex)) {
> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
> mutex_unlock(&conf->cache_size_mutex);
> }
>
> + blk_start_plug(&plug);
> flush_deferred_bios(conf);
>
> r5l_flush_stripe_to_raid(conf->log);

I can confirm that this patch also works. I'm unable to reproduce the
hang after applying this instead of the first patch provided by
Junxiao. So looks like both ways are succesful in avoiding the hang.

-- Dan

2024-03-14 18:23:52

by Junxiao Bi

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On 3/12/24 6:20 PM, Yu Kuai wrote:

> Hi,
>
> 在 2024/03/13 6:56, [email protected] 写道:
>> On 3/10/24 6:50 PM, Yu Kuai wrote:
>>
>>> Hi,
>>>
>>> 在 2024/03/09 7:49, [email protected] 写道:
>>>> Here is the root cause for this issue:
>>>>
>>>> Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in
>>>> raid5d") introduced a regression, it got reverted through commit
>>>> bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
>>>> raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing,
>>>> commit d6e035aad6c0 ("md: bypass block throttle for superblock
>>>> update") was created, it avoids md superblock write getting
>>>> throttled by block layer which is good, but md superblock write
>>>> could be stuck in block layer due to block flush as well, and that
>>>> is what was happening in this regression report.
>>>>
>>>> Process "md0_reclaim" got stuck while waiting IO for md superblock
>>>> write done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags,
>>>> these 3 steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed
>>>> before done, the hung of this process is because the last step
>>>> "POSTFLUSH" never done. And that was because of  process
>>>> "md0_raid5" submitted another IO with REQ_FUA flag marked just
>>>> before that step started. To handle that IO, blk_insert_flush()
>>>> will be invoked and hit "REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH" case
>>>> where "fq->flush_data_in_flight" will be increased. When the IO for
>>>> md superblock write was to issue "POSTFLUSH" step through
>>>> blk_kick_flush(), it found that "fq->flush_data_in_flight" was not
>>>> zero, so it will skip that step, that is expected, because flush
>>>> will be triggered when "fq->flush_data_in_flight" dropped to zero.
>>>>
>>>> Unfortunately here that inflight data IO from "md0_raid5" will
>>>> never done, because it was added into the blk_plug list of that
>>>> process, but "md0_raid5" run into infinite loop due to
>>>> "MD_SB_CHANGE_PENDING" which made it never had a chance to finish
>>>> the blk plug until "MD_SB_CHANGE_PENDING" was cleared. Process
>>>> "md0_reclaim" was supposed to clear that flag but it was stuck by
>>>> "md0_raid5", so this is a deadlock.
>>>>
>>>> Looks like the approach in the RFC patch trying to resolve the
>>>> regression of commit 5e2cf333b7bd can help this issue. Once
>>>> "md0_raid5" starts looping due to "MD_SB_CHANGE_PENDING", it should
>>>> release all its staging IO requests to avoid blocking others. Also
>>>> a cond_reschedule() will avoid it run into lockup.
>>>
>>> The analysis sounds good, however, it seems to me that the behaviour
>>> raid5d() pings the cpu to wait for 'MD_SB_CHANGE_PENDING' to be cleared
>>> is not reasonable, because md_check_recovery() must hold
>>> 'reconfig_mutex' to clear the flag.
>>
>> That's the behavior before commit 5e2cf333b7bd which was added into
>> Sep 2022, so this behavior has been with raid5 for many years.
>>
>
> Yes, it exists for a long time doesn't mean it's good. It is really
> weird to hold spinlock to wait for a mutex.
I am confused about this, where is the code that waiting mutex while
holding spinlock, wouldn't that cause a deadlock?
>>
>>>
>>> Look at raid1/raid10, there are two different behaviour that seems can
>>> avoid this problem as well:
>>>
>>> 1) blk_start_plug() is delayed until all failed IO is handled. This
>>> look
>>> reasonable because in order to get better performance, IO should be
>>> handled by submitted thread as much as possible, and meanwhile, the
>>> deadlock can be triggered here.
>>> 2) if 'MD_SB_CHANGE_PENDING' is not cleared by md_check_recovery(),
>>> skip
>>> the handling of failed IO, and when mddev_unlock() is called, daemon
>>> thread will be woken up again to handle failed IO.
>>>
>>> How about the following patch?
>>>
>>> Thanks,
>>> Kuai
>>>
>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>> index 3ad5f3c7f91e..0b2e6060f2c9 100644
>>> --- a/drivers/md/raid5.c
>>> +++ b/drivers/md/raid5.c
>>> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
>>>
>>>         md_check_recovery(mddev);
>>>
>>> -       blk_start_plug(&plug);
>>>         handled = 0;
>>>         spin_lock_irq(&conf->device_lock);
>>>         while (1) {
>>> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>>>                 int batch_size, released;
>>>                 unsigned int offset;
>>>
>>> +               /*
>>> +                * md_check_recovery() can't clear sb_flags, usually
>>> because of
>>> +                * 'reconfig_mutex' can't be grabbed, wait for
>>> mddev_unlock() to
>>> +                * wake up raid5d().
>>> +                */
>>> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>>> +                       goto skip;
>>> +
>>>                 released = release_stripe_list(conf,
>>> conf->temp_inactive_list);
>>>                 if (released)
>>>                         clear_bit(R5_DID_ALLOC, &conf->cache_state);
>>> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>>> spin_lock_irq(&conf->device_lock);
>>>                 }
>>>         }
>>> +skip:
>>>         pr_debug("%d stripes handled\n", handled);
>>> -
>>>         spin_unlock_irq(&conf->device_lock);
>>>         if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>>>             mutex_trylock(&conf->cache_size_mutex)) {
>>> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>>>                 mutex_unlock(&conf->cache_size_mutex);
>>>         }
>>>
>>> +       blk_start_plug(&plug);
>>>         flush_deferred_bios(conf);
>>>
>>>         r5l_flush_stripe_to_raid(conf->log);
>>
>> This patch eliminated the benefit of blk_plug, i think it will not be
>> good for IO performance perspective?
>
> There is only one daemon thread, so IO should not be handled here as
> much as possible. The IO should be handled by the thread that is
> submitting the IO, and let daemon to hanldle the case that IO failed or
> can't be submitted at that time.

I am not sure how much it will impact regarding drop the blk_plug.

Song, what's your take on this?

Thanks,

Junxiao.

>
> Thanks,
> Kuai
>
>>
>>
>> Thanks,
>>
>> Junxiao.
>>
>>>
>>>>
>>>> https://www.spinics.net/lists/raid/msg75338.html
>>>>
>>>> Dan, can you try the following patch?
>>>>
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index de771093b526..474462abfbdc 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug,
>>>> bool from_schedule)
>>>>          if (unlikely(!rq_list_empty(plug->cached_rq)))
>>>>                  blk_mq_free_plug_rqs(plug);
>>>>   }
>>>> +EXPORT_SYMBOL(__blk_flush_plug);
>>>>
>>>>   /**
>>>>    * blk_finish_plug - mark the end of a batch of submitted I/O
>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>>> index 8497880135ee..26e09cdf46a3 100644
>>>> --- a/drivers/md/raid5.c
>>>> +++ b/drivers/md/raid5.c
>>>> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
>>>> spin_unlock_irq(&conf->device_lock);
>>>>                          md_check_recovery(mddev);
>>>> spin_lock_irq(&conf->device_lock);
>>>> +               } else {
>>>> + spin_unlock_irq(&conf->device_lock);
>>>> +                       blk_flush_plug(&plug, false);
>>>> +                       cond_resched();
>>>> + spin_lock_irq(&conf->device_lock);
>>>>                  }
>>>>          }
>>>>          pr_debug("%d stripes handled\n", handled);
>>>>
>>>> Thanks,
>>>>
>>>> Junxiao.
>>>>
>>>> On 3/1/24 12:26 PM, [email protected] wrote:
>>>>> Hi Dan & Song,
>>>>>
>>>>> I have not root cause this yet, but would like share some findings
>>>>> from the vmcore Dan shared. From what i can see, this doesn't look
>>>>> like a md issue, but something wrong with block layer or below.
>>>>>
>>>>> 1. There were multiple process hung by IO over 15mins.
>>>>>
>>>>> crash> ps -m | grep UN
>>>>> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0 CPU: 1
>>>>> COMMAND: "jbd2/dm-3-8"
>>>>> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0 CPU: 2
>>>>> COMMAND: "dd"
>>>>> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00 CPU: 3
>>>>> COMMAND: "md0_reclaim"
>>>>> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740 CPU: 1
>>>>> COMMAND: "kworker/1:2"
>>>>> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80 CPU: 7
>>>>> COMMAND: "kworker/u21:1"
>>>>>
>>>>> 2. Let pick md0_reclaim to take a look, it is waiting done
>>>>> super_block update. We can see there were two pending superblock
>>>>> write and other pending io for the underling physical disk, which
>>>>> caused these process hung.
>>>>>
>>>>> crash> bt 876
>>>>> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND:
>>>>> "md0_reclaim"
>>>>>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>>>>>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>>>>>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>>>>>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>>>>>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>>>>>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>>>>>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>>>>>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>>>>>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>>>>>
>>>>> crash> mddev.pending_writes,disks 0xffff888108335800
>>>>>   pending_writes = {
>>>>>     counter = 2  <<<<<<<<<< 2 active super block write
>>>>>   },
>>>>>   disks = {
>>>>>     next = 0xffff88810ce85a00,
>>>>>     prev = 0xffff88810ce84c00
>>>>>   },
>>>>> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending
>>>>> 0xffff88810ce85a00
>>>>> ffff88810ce85a00
>>>>>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>>>>>   nr_pending = {
>>>>>     counter = 0
>>>>>   },
>>>>> ffff8881083ace00
>>>>>   kobj.name = 0xffff888100a93280 "dev-sde",
>>>>>   nr_pending = {
>>>>>     counter = 10 <<<<
>>>>>   },
>>>>> ffff8881010ad200
>>>>>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>>>>>   nr_pending = {
>>>>>     counter = 8 <<<<<
>>>>>   },
>>>>> ffff88810ce84c00
>>>>>   kobj.name = 0xffff888100325f08 "dev-sdd",
>>>>>   nr_pending = {
>>>>>     counter = 2 <<<<<
>>>>>   },
>>>>>
>>>>> 3. From block layer, i can find the inflight IO for md superblock
>>>>> write which has been pending 955s which matches with the hung time
>>>>> of "md0_reclaim"
>>>>>
>>>>> crash>
>>>>> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io
>>>>> ffff888103b4c300
>>>>>   q = 0xffff888103a00d80,
>>>>>   mq_hctx = 0xffff888103c5d200,
>>>>>   cmd_flags = 38913,
>>>>>   rq_flags = 139408,
>>>>>   start_time_ns = 1504179024146,
>>>>>   bio = 0x0,
>>>>>   biotail = 0xffff888120758e40,
>>>>>   state = MQ_RQ_COMPLETE,
>>>>>   __data_len = 0,
>>>>>   flush = {
>>>>>     seq = 3, <<<< REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA
>>>>>     saved_end_io = 0x0
>>>>>   },
>>>>>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>>>>>
>>>>> crash> p tk_core.timekeeper.tkr_mono.base
>>>>> $1 = 2459916243002
>>>>> crash> eval 2459916243002-1504179024146
>>>>> hexadecimal: de86609f28
>>>>>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>>>>>       octal: 15720630117450
>>>>>      binary:
>>>>> 0000000000000000000000001101111010000110011000001001111100101000
>>>>>
>>>>> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>>>>>   bi_iter = {
>>>>>     bi_sector = 8, <<<< super block offset
>>>>>     bi_size = 0,
>>>>>     bi_idx = 0,
>>>>>     bi_bvec_done = 0
>>>>>   },
>>>>>   bi_end_io = 0xffffffff817dca50 <super_written>,
>>>>> crash> dev -d | grep ffff888103a00d80
>>>>>     8 ffff8881003ab000   sdd        ffff888103a00d80 0 0 0
>>>>>
>>>>> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it
>>>>> is still pending. That's because each md superblock write was
>>>>> marked with REQ_PREFLUSH | REQ_FUA, so it will be handled in 3
>>>>> steps: pre_flush, data, and post_flush. Once each step complete,
>>>>> it will be marked in "request.flush.seq", here the value is 3,
>>>>> which is REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA, so the last step
>>>>> "post_flush" has not be done. Another wired thing is that
>>>>> blk_flush_queue.flush_data_in_flight is still 1 even "data" step
>>>>> already done.
>>>>>
>>>>> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>>>>>   fq = 0xffff88810332e240,
>>>>> crash> blk_flush_queue 0xffff88810332e240
>>>>> struct blk_flush_queue {
>>>>>   mq_flush_lock = {
>>>>>     {
>>>>>       rlock = {
>>>>>         raw_lock = {
>>>>>           {
>>>>>             val = {
>>>>>               counter = 0
>>>>>             },
>>>>>             {
>>>>>               locked = 0 '\000',
>>>>>               pending = 0 '\000'
>>>>>             },
>>>>>             {
>>>>>               locked_pending = 0,
>>>>>               tail = 0
>>>>>             }
>>>>>           }
>>>>>         }
>>>>>       }
>>>>>     }
>>>>>   },
>>>>>   flush_pending_idx = 1,
>>>>>   flush_running_idx = 1,
>>>>>   rq_status = 0 '\000',
>>>>>   flush_pending_since = 4296171408,
>>>>>   flush_queue = {{
>>>>>       next = 0xffff88810332e250,
>>>>>       prev = 0xffff88810332e250
>>>>>     }, {
>>>>>       next = 0xffff888103b4c348, <<<< the request is in this list
>>>>>       prev = 0xffff888103b4c348
>>>>>     }},
>>>>>   flush_data_in_flight = 1,  >>>>>> still 1
>>>>>   flush_rq = 0xffff888103c2e000
>>>>> }
>>>>>
>>>>> crash> list 0xffff888103b4c348
>>>>> ffff888103b4c348
>>>>> ffff88810332e260
>>>>>
>>>>> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of
>>>>> hw queue
>>>>>   tag = -1,
>>>>>   state = MQ_RQ_IDLE,
>>>>>   ref = {
>>>>>     counter = 0
>>>>>   },
>>>>>
>>>>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may
>>>>> have some issue which leading to the io request from md layer
>>>>> stayed in a partial complete statue. I can't see how this can be
>>>>> related with the commit bed9e27baf52 ("Revert "md/raid5: Wait for
>>>>> MD_SB_CHANGE_PENDING in raid5d"")
>>>>>
>>>>>
>>>>> Dan,
>>>>>
>>>>> Are you able to reproduce using some regular scsi disk, would like
>>>>> to rule out whether this is related with virtio-scsi?
>>>>>
>>>>> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
>>>>> official mainline v6.8-rc5 without any other patches?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Junxiao.
>>>>>
>>>>> On 2/23/24 6:13 PM, Song Liu wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>>>>>> Leemhuis) <[email protected]> wrote:
>>>>>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>>>>>> Just a friendly reminder that this regression still exists on the
>>>>>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>>>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>>>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>>>>>> Song Liu, what's the status here? I aware that you fixed with
>>>>>>> quite a
>>>>>>> few regressions recently, but it seems like resolving this one is
>>>>>>> stalled. Or were you able to reproduce the issue or make some
>>>>>>> progress
>>>>>>> and I just missed it?
>>>>>> Sorry for the delay with this issue. I have been occupied with some
>>>>>> other stuff this week.
>>>>>>
>>>>>> I haven't got luck to reproduce this issue. I will spend more
>>>>>> time looking
>>>>>> into it next week.
>>>>>>
>>>>>>> And if not, what's the way forward here wrt to the release of 6.8?
>>>>>>> Revert the culprit and try again later? Or is that not an option
>>>>>>> for one
>>>>>>> reason or another?
>>>>>> If we don't make progress with it in the next week, we will do
>>>>>> the revert,
>>>>>> same as we did with stable kernels.
>>>>>>
>>>>>>> Or do we assume that this is not a real issue? That it's caused
>>>>>>> by some
>>>>>>> oddity (bit-flip in the metadata or something like that?) only
>>>>>>> to be
>>>>>>> found in Dan's setup?
>>>>>> I don't think this is because of oddities. Hopefully we can get more
>>>>>> information about this soon.
>>>>>>
>>>>>> Thanks,
>>>>>> Song
>>>>>>
>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression
>>>>>>> tracker' hat)
>>>>>>> --
>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>> If I did something stupid, please tell me, as explained on that
>>>>>>> page.
>>>>>>>
>>>>>>> #regzbot poke
>>>>>>>
>>>>
>>>> .
>>>>
>>>
>> .
>>
>

2024-03-14 22:37:07

by Song Liu

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

On Thu, Mar 14, 2024 at 11:20 AM <[email protected]> wrote:
>
[...]
> >>
> >> This patch eliminated the benefit of blk_plug, i think it will not be
> >> good for IO performance perspective?
> >
> > There is only one daemon thread, so IO should not be handled here as
> > much as possible. The IO should be handled by the thread that is
> > submitting the IO, and let daemon to hanldle the case that IO failed or
> > can't be submitted at that time.

raid5 can have multiple threads calling handle_stripe(). See raid5_do_work().
Only chunk_aligned_read() can be handled in raid5_make_request.

>
> I am not sure how much it will impact regarding drop the blk_plug.
>
> Song, what's your take on this?

I think we need to evaluate the impact of (removing) blk_plug. We had
some performance regressions related to blk_plug a couple years ago.

Thanks,
Song

2024-03-15 01:18:20

by Yu Kuai

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi,

?? 2024/03/15 0:12, Dan Moulding д??:
>> How about the following patch?
>>
>> Thanks,
>> Kuai
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 3ad5f3c7f91e..0b2e6060f2c9 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
>>
>> md_check_recovery(mddev);
>>
>> - blk_start_plug(&plug);
>> handled = 0;
>> spin_lock_irq(&conf->device_lock);
>> while (1) {
>> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>> int batch_size, released;
>> unsigned int offset;
>>
>> + /*
>> + * md_check_recovery() can't clear sb_flags, usually
>> because of
>> + * 'reconfig_mutex' can't be grabbed, wait for
>> mddev_unlock() to
>> + * wake up raid5d().
>> + */
>> + if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>> + goto skip;
>> +
>> released = release_stripe_list(conf,
>> conf->temp_inactive_list);
>> if (released)
>> clear_bit(R5_DID_ALLOC, &conf->cache_state);
>> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>> spin_lock_irq(&conf->device_lock);
>> }
>> }
>> +skip:
>> pr_debug("%d stripes handled\n", handled);
>> -
>> spin_unlock_irq(&conf->device_lock);
>> if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>> mutex_trylock(&conf->cache_size_mutex)) {
>> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>> mutex_unlock(&conf->cache_size_mutex);
>> }
>>
>> + blk_start_plug(&plug);
>> flush_deferred_bios(conf);
>>
>> r5l_flush_stripe_to_raid(conf->log);
>
> I can confirm that this patch also works. I'm unable to reproduce the
> hang after applying this instead of the first patch provided by
> Junxiao. So looks like both ways are succesful in avoiding the hang.
>

Thanks a lot for the testing! Can you also give following patch a try?
It removes the change to blk_plug, because Dan and Song are worried
about performance degradation, so we need to verify the performance
before consider that patch.

Anyway, I think following patch can fix this problem as well.

Thanks,
Kuai

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3ad5f3c7f91e..ae8665be9940 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6728,6 +6728,9 @@ static void raid5d(struct md_thread *thread)
int batch_size, released;
unsigned int offset;

+ if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+ goto skip;
+
released = release_stripe_list(conf,
conf->temp_inactive_list);
if (released)
clear_bit(R5_DID_ALLOC, &conf->cache_state);
@@ -6766,6 +6769,7 @@ static void raid5d(struct md_thread *thread)
spin_lock_irq(&conf->device_lock);
}
}
+skip:
pr_debug("%d stripes handled\n", handled);

spin_unlock_irq(&conf->device_lock);


> -- Dan
> .
>


2024-03-15 01:30:35

by Yu Kuai

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Hi,

在 2024/03/15 2:20, [email protected] 写道:
> On 3/12/24 6:20 PM, Yu Kuai wrote:
>
>> Hi,
>>
>> 在 2024/03/13 6:56, [email protected] 写道:
>>> On 3/10/24 6:50 PM, Yu Kuai wrote:
>>>
>>>> Hi,
>>>>
>>>> 在 2024/03/09 7:49, [email protected] 写道:
>>>>> Here is the root cause for this issue:
>>>>>
>>>>> Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in
>>>>> raid5d") introduced a regression, it got reverted through commit
>>>>> bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
>>>>> raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing,
>>>>> commit d6e035aad6c0 ("md: bypass block throttle for superblock
>>>>> update") was created, it avoids md superblock write getting
>>>>> throttled by block layer which is good, but md superblock write
>>>>> could be stuck in block layer due to block flush as well, and that
>>>>> is what was happening in this regression report.
>>>>>
>>>>> Process "md0_reclaim" got stuck while waiting IO for md superblock
>>>>> write done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags,
>>>>> these 3 steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed
>>>>> before done, the hung of this process is because the last step
>>>>> "POSTFLUSH" never done. And that was because of  process
>>>>> "md0_raid5" submitted another IO with REQ_FUA flag marked just
>>>>> before that step started. To handle that IO, blk_insert_flush()
>>>>> will be invoked and hit "REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH" case
>>>>> where "fq->flush_data_in_flight" will be increased. When the IO for
>>>>> md superblock write was to issue "POSTFLUSH" step through
>>>>> blk_kick_flush(), it found that "fq->flush_data_in_flight" was not
>>>>> zero, so it will skip that step, that is expected, because flush
>>>>> will be triggered when "fq->flush_data_in_flight" dropped to zero.
>>>>>
>>>>> Unfortunately here that inflight data IO from "md0_raid5" will
>>>>> never done, because it was added into the blk_plug list of that
>>>>> process, but "md0_raid5" run into infinite loop due to
>>>>> "MD_SB_CHANGE_PENDING" which made it never had a chance to finish
>>>>> the blk plug until "MD_SB_CHANGE_PENDING" was cleared. Process
>>>>> "md0_reclaim" was supposed to clear that flag but it was stuck by
>>>>> "md0_raid5", so this is a deadlock.
>>>>>
>>>>> Looks like the approach in the RFC patch trying to resolve the
>>>>> regression of commit 5e2cf333b7bd can help this issue. Once
>>>>> "md0_raid5" starts looping due to "MD_SB_CHANGE_PENDING", it should
>>>>> release all its staging IO requests to avoid blocking others. Also
>>>>> a cond_reschedule() will avoid it run into lockup.
>>>>
>>>> The analysis sounds good, however, it seems to me that the behaviour
>>>> raid5d() pings the cpu to wait for 'MD_SB_CHANGE_PENDING' to be cleared
>>>> is not reasonable, because md_check_recovery() must hold
>>>> 'reconfig_mutex' to clear the flag.
>>>
>>> That's the behavior before commit 5e2cf333b7bd which was added into
>>> Sep 2022, so this behavior has been with raid5 for many years.
>>>
>>
>> Yes, it exists for a long time doesn't mean it's good. It is really
>> weird to hold spinlock to wait for a mutex.
> I am confused about this, where is the code that waiting mutex while
> holding spinlock, wouldn't that cause a deadlock?

For example, assume that other contex already holding the
'reconfig_mutex', and this can be slow, then in raid5d:

md_check_recovery
try lock 'reconfig_mutex' failed

while (1)
hold spin_lock
try to issue IO, failed
release spin_lock
blk_flush_plug
hold spin_lock

So, untill other contex release the 'reconfig_mutex', and then
md_check_recovery() update super_block, raid5d() will not make progress,
meanwhile it will waste one cpu.

Thanks,
Kuai

>>>
>>>>
>>>> Look at raid1/raid10, there are two different behaviour that seems can
>>>> avoid this problem as well:
>>>>
>>>> 1) blk_start_plug() is delayed until all failed IO is handled. This
>>>> look
>>>> reasonable because in order to get better performance, IO should be
>>>> handled by submitted thread as much as possible, and meanwhile, the
>>>> deadlock can be triggered here.
>>>> 2) if 'MD_SB_CHANGE_PENDING' is not cleared by md_check_recovery(),
>>>> skip
>>>> the handling of failed IO, and when mddev_unlock() is called, daemon
>>>> thread will be woken up again to handle failed IO.
>>>>
>>>> How about the following patch?
>>>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>>> index 3ad5f3c7f91e..0b2e6060f2c9 100644
>>>> --- a/drivers/md/raid5.c
>>>> +++ b/drivers/md/raid5.c
>>>> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
>>>>
>>>>         md_check_recovery(mddev);
>>>>
>>>> -       blk_start_plug(&plug);
>>>>         handled = 0;
>>>>         spin_lock_irq(&conf->device_lock);
>>>>         while (1) {
>>>> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>>>>                 int batch_size, released;
>>>>                 unsigned int offset;
>>>>
>>>> +               /*
>>>> +                * md_check_recovery() can't clear sb_flags, usually
>>>> because of
>>>> +                * 'reconfig_mutex' can't be grabbed, wait for
>>>> mddev_unlock() to
>>>> +                * wake up raid5d().
>>>> +                */
>>>> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>>>> +                       goto skip;
>>>> +
>>>>                 released = release_stripe_list(conf,
>>>> conf->temp_inactive_list);
>>>>                 if (released)
>>>>                         clear_bit(R5_DID_ALLOC, &conf->cache_state);
>>>> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>>>> spin_lock_irq(&conf->device_lock);
>>>>                 }
>>>>         }
>>>> +skip:
>>>>         pr_debug("%d stripes handled\n", handled);
>>>> -
>>>>         spin_unlock_irq(&conf->device_lock);
>>>>         if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>>>>             mutex_trylock(&conf->cache_size_mutex)) {
>>>> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>>>>                 mutex_unlock(&conf->cache_size_mutex);
>>>>         }
>>>>
>>>> +       blk_start_plug(&plug);
>>>>         flush_deferred_bios(conf);
>>>>
>>>>         r5l_flush_stripe_to_raid(conf->log);
>>>
>>> This patch eliminated the benefit of blk_plug, i think it will not be
>>> good for IO performance perspective?
>>
>> There is only one daemon thread, so IO should not be handled here as
>> much as possible. The IO should be handled by the thread that is
>> submitting the IO, and let daemon to hanldle the case that IO failed or
>> can't be submitted at that time.
>
> I am not sure how much it will impact regarding drop the blk_plug.
>
> Song, what's your take on this?
>
> Thanks,
>
> Junxiao.
>
>>
>> Thanks,
>> Kuai
>>
>>>
>>>
>>> Thanks,
>>>
>>> Junxiao.
>>>
>>>>
>>>>>
>>>>> https://www.spinics.net/lists/raid/msg75338.html
>>>>>
>>>>> Dan, can you try the following patch?
>>>>>
>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>> index de771093b526..474462abfbdc 100644
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug,
>>>>> bool from_schedule)
>>>>>          if (unlikely(!rq_list_empty(plug->cached_rq)))
>>>>>                  blk_mq_free_plug_rqs(plug);
>>>>>   }
>>>>> +EXPORT_SYMBOL(__blk_flush_plug);
>>>>>
>>>>>   /**
>>>>>    * blk_finish_plug - mark the end of a batch of submitted I/O
>>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>>>> index 8497880135ee..26e09cdf46a3 100644
>>>>> --- a/drivers/md/raid5.c
>>>>> +++ b/drivers/md/raid5.c
>>>>> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
>>>>> spin_unlock_irq(&conf->device_lock);
>>>>>                          md_check_recovery(mddev);
>>>>> spin_lock_irq(&conf->device_lock);
>>>>> +               } else {
>>>>> + spin_unlock_irq(&conf->device_lock);
>>>>> +                       blk_flush_plug(&plug, false);
>>>>> +                       cond_resched();
>>>>> + spin_lock_irq(&conf->device_lock);
>>>>>                  }
>>>>>          }
>>>>>          pr_debug("%d stripes handled\n", handled);
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Junxiao.
>>>>>
>>>>> On 3/1/24 12:26 PM, [email protected] wrote:
>>>>>> Hi Dan & Song,
>>>>>>
>>>>>> I have not root cause this yet, but would like share some findings
>>>>>> from the vmcore Dan shared. From what i can see, this doesn't look
>>>>>> like a md issue, but something wrong with block layer or below.
>>>>>>
>>>>>> 1. There were multiple process hung by IO over 15mins.
>>>>>>
>>>>>> crash> ps -m | grep UN
>>>>>> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0 CPU: 1
>>>>>> COMMAND: "jbd2/dm-3-8"
>>>>>> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0 CPU: 2
>>>>>> COMMAND: "dd"
>>>>>> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00 CPU: 3
>>>>>> COMMAND: "md0_reclaim"
>>>>>> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740 CPU: 1
>>>>>> COMMAND: "kworker/1:2"
>>>>>> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80 CPU: 7
>>>>>> COMMAND: "kworker/u21:1"
>>>>>>
>>>>>> 2. Let pick md0_reclaim to take a look, it is waiting done
>>>>>> super_block update. We can see there were two pending superblock
>>>>>> write and other pending io for the underling physical disk, which
>>>>>> caused these process hung.
>>>>>>
>>>>>> crash> bt 876
>>>>>> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND:
>>>>>> "md0_reclaim"
>>>>>>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>>>>>>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>>>>>>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>>>>>>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>>>>>>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>>>>>>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>>>>>>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>>>>>>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>>>>>>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>>>>>>
>>>>>> crash> mddev.pending_writes,disks 0xffff888108335800
>>>>>>   pending_writes = {
>>>>>>     counter = 2  <<<<<<<<<< 2 active super block write
>>>>>>   },
>>>>>>   disks = {
>>>>>>     next = 0xffff88810ce85a00,
>>>>>>     prev = 0xffff88810ce84c00
>>>>>>   },
>>>>>> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending
>>>>>> 0xffff88810ce85a00
>>>>>> ffff88810ce85a00
>>>>>>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>>>>>>   nr_pending = {
>>>>>>     counter = 0
>>>>>>   },
>>>>>> ffff8881083ace00
>>>>>>   kobj.name = 0xffff888100a93280 "dev-sde",
>>>>>>   nr_pending = {
>>>>>>     counter = 10 <<<<
>>>>>>   },
>>>>>> ffff8881010ad200
>>>>>>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>>>>>>   nr_pending = {
>>>>>>     counter = 8 <<<<<
>>>>>>   },
>>>>>> ffff88810ce84c00
>>>>>>   kobj.name = 0xffff888100325f08 "dev-sdd",
>>>>>>   nr_pending = {
>>>>>>     counter = 2 <<<<<
>>>>>>   },
>>>>>>
>>>>>> 3. From block layer, i can find the inflight IO for md superblock
>>>>>> write which has been pending 955s which matches with the hung time
>>>>>> of "md0_reclaim"
>>>>>>
>>>>>> crash>
>>>>>> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io
>>>>>> ffff888103b4c300
>>>>>>   q = 0xffff888103a00d80,
>>>>>>   mq_hctx = 0xffff888103c5d200,
>>>>>>   cmd_flags = 38913,
>>>>>>   rq_flags = 139408,
>>>>>>   start_time_ns = 1504179024146,
>>>>>>   bio = 0x0,
>>>>>>   biotail = 0xffff888120758e40,
>>>>>>   state = MQ_RQ_COMPLETE,
>>>>>>   __data_len = 0,
>>>>>>   flush = {
>>>>>>     seq = 3, <<<< REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA
>>>>>>     saved_end_io = 0x0
>>>>>>   },
>>>>>>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>>>>>>
>>>>>> crash> p tk_core.timekeeper.tkr_mono.base
>>>>>> $1 = 2459916243002
>>>>>> crash> eval 2459916243002-1504179024146
>>>>>> hexadecimal: de86609f28
>>>>>>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>>>>>>       octal: 15720630117450
>>>>>>      binary:
>>>>>> 0000000000000000000000001101111010000110011000001001111100101000
>>>>>>
>>>>>> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>>>>>>   bi_iter = {
>>>>>>     bi_sector = 8, <<<< super block offset
>>>>>>     bi_size = 0,
>>>>>>     bi_idx = 0,
>>>>>>     bi_bvec_done = 0
>>>>>>   },
>>>>>>   bi_end_io = 0xffffffff817dca50 <super_written>,
>>>>>> crash> dev -d | grep ffff888103a00d80
>>>>>>     8 ffff8881003ab000   sdd        ffff888103a00d80 0 0 0
>>>>>>
>>>>>> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it
>>>>>> is still pending. That's because each md superblock write was
>>>>>> marked with REQ_PREFLUSH | REQ_FUA, so it will be handled in 3
>>>>>> steps: pre_flush, data, and post_flush. Once each step complete,
>>>>>> it will be marked in "request.flush.seq", here the value is 3,
>>>>>> which is REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA, so the last step
>>>>>> "post_flush" has not be done. Another wired thing is that
>>>>>> blk_flush_queue.flush_data_in_flight is still 1 even "data" step
>>>>>> already done.
>>>>>>
>>>>>> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>>>>>>   fq = 0xffff88810332e240,
>>>>>> crash> blk_flush_queue 0xffff88810332e240
>>>>>> struct blk_flush_queue {
>>>>>>   mq_flush_lock = {
>>>>>>     {
>>>>>>       rlock = {
>>>>>>         raw_lock = {
>>>>>>           {
>>>>>>             val = {
>>>>>>               counter = 0
>>>>>>             },
>>>>>>             {
>>>>>>               locked = 0 '\000',
>>>>>>               pending = 0 '\000'
>>>>>>             },
>>>>>>             {
>>>>>>               locked_pending = 0,
>>>>>>               tail = 0
>>>>>>             }
>>>>>>           }
>>>>>>         }
>>>>>>       }
>>>>>>     }
>>>>>>   },
>>>>>>   flush_pending_idx = 1,
>>>>>>   flush_running_idx = 1,
>>>>>>   rq_status = 0 '\000',
>>>>>>   flush_pending_since = 4296171408,
>>>>>>   flush_queue = {{
>>>>>>       next = 0xffff88810332e250,
>>>>>>       prev = 0xffff88810332e250
>>>>>>     }, {
>>>>>>       next = 0xffff888103b4c348, <<<< the request is in this list
>>>>>>       prev = 0xffff888103b4c348
>>>>>>     }},
>>>>>>   flush_data_in_flight = 1,  >>>>>> still 1
>>>>>>   flush_rq = 0xffff888103c2e000
>>>>>> }
>>>>>>
>>>>>> crash> list 0xffff888103b4c348
>>>>>> ffff888103b4c348
>>>>>> ffff88810332e260
>>>>>>
>>>>>> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of
>>>>>> hw queue
>>>>>>   tag = -1,
>>>>>>   state = MQ_RQ_IDLE,
>>>>>>   ref = {
>>>>>>     counter = 0
>>>>>>   },
>>>>>>
>>>>>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may
>>>>>> have some issue which leading to the io request from md layer
>>>>>> stayed in a partial complete statue. I can't see how this can be
>>>>>> related with the commit bed9e27baf52 ("Revert "md/raid5: Wait for
>>>>>> MD_SB_CHANGE_PENDING in raid5d"")
>>>>>>
>>>>>>
>>>>>> Dan,
>>>>>>
>>>>>> Are you able to reproduce using some regular scsi disk, would like
>>>>>> to rule out whether this is related with virtio-scsi?
>>>>>>
>>>>>> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
>>>>>> official mainline v6.8-rc5 without any other patches?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Junxiao.
>>>>>>
>>>>>> On 2/23/24 6:13 PM, Song Liu wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>>>>>>> Leemhuis) <[email protected]> wrote:
>>>>>>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>>>>>>> Just a friendly reminder that this regression still exists on the
>>>>>>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>>>>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>>>>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>>>>>>> Song Liu, what's the status here? I aware that you fixed with
>>>>>>>> quite a
>>>>>>>> few regressions recently, but it seems like resolving this one is
>>>>>>>> stalled. Or were you able to reproduce the issue or make some
>>>>>>>> progress
>>>>>>>> and I just missed it?
>>>>>>> Sorry for the delay with this issue. I have been occupied with some
>>>>>>> other stuff this week.
>>>>>>>
>>>>>>> I haven't got luck to reproduce this issue. I will spend more
>>>>>>> time looking
>>>>>>> into it next week.
>>>>>>>
>>>>>>>> And if not, what's the way forward here wrt to the release of 6.8?
>>>>>>>> Revert the culprit and try again later? Or is that not an option
>>>>>>>> for one
>>>>>>>> reason or another?
>>>>>>> If we don't make progress with it in the next week, we will do
>>>>>>> the revert,
>>>>>>> same as we did with stable kernels.
>>>>>>>
>>>>>>>> Or do we assume that this is not a real issue? That it's caused
>>>>>>>> by some
>>>>>>>> oddity (bit-flip in the metadata or something like that?) only
>>>>>>>> to be
>>>>>>>> found in Dan's setup?
>>>>>>> I don't think this is because of oddities. Hopefully we can get more
>>>>>>> information about this soon.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Song
>>>>>>>
>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression
>>>>>>>> tracker' hat)
>>>>>>>> --
>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>> If I did something stupid, please tell me, as explained on that
>>>>>>>> page.
>>>>>>>>
>>>>>>>> #regzbot poke
>>>>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>> .
>>>
>>
> .
>


2024-03-19 14:16:29

by Dan Moulding

[permalink] [raw]
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> Thanks a lot for the testing! Can you also give following patch a try?
> It removes the change to blk_plug, because Dan and Song are worried
> about performance degradation, so we need to verify the performance
> before consider that patch.
>
> Anyway, I think following patch can fix this problem as well.
>
> Thanks,
> Kuai
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3ad5f3c7f91e..ae8665be9940 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6728,6 +6728,9 @@ static void raid5d(struct md_thread *thread)
> int batch_size, released;
> unsigned int offset;
>
> + if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> + goto skip;
> +
> released = release_stripe_list(conf,
> conf->temp_inactive_list);
> if (released)
> clear_bit(R5_DID_ALLOC, &conf->cache_state);
> @@ -6766,6 +6769,7 @@ static void raid5d(struct md_thread *thread)
> spin_lock_irq(&conf->device_lock);
> }
> }
> +skip:
> pr_debug("%d stripes handled\n", handled);
>
> spin_unlock_irq(&conf->device_lock);

Yes, this patch also seems to work. I cannot reproduce the problem on
6.8-rc7 or 6.8.1 with just this one applied.

Cheers!

-- Dan