by Donald Buczek

[permalink] [raw]

Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

On 25.01.21 09:54, Donald Buczek wrote:
> Dear Guoqing,
>
> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.
>
> My current theory of what is happening is:
>
> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>
> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>
> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>
> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.
>
> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.
>
> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.
>
> Does this make sense?
>
> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>
> ```
> #! /bin/bash
>
> (
>         while true; do
>                 echo "start check"
>                 echo check > /sys/block/md3/md/sync_action
>                 sleep 10
>                 echo "stop check"
>                 echo idle > /sys/block/md3/md/sync_action
>                 sleep 2
>         done
> ) &
>
> (
>         while true; do
>                 dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
>                 sync /mnt/raid_md3/bigfile
>                 rm /mnt/raid_md3/bigfile
>                 sleep .1
>         done
> ) &
>
> start="$(date +%s)"
> cd /sys/block/md3/md
> wp_count=0
> while true; do
>         array_state=$(cat array_state)
>         if [ "$array_state" = write-pending ]; then
>                 wp_count=$(($wp_count+1))
>         else
>                 wp_count=0
>         fi
>         echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
>         if [ $wp_count -ge 3 ]; then
>                 kill -- -$$
>                 exit
>         fi
>         sleep 1
> done
> ```
>
> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:
>
>     309 check 6283872 / 8378368 active-idle 4144
>     310 check 6283872 / 8378368 active 1702
>     311 check 6807528 / 8378368 active 4152
>     312 check 7331184 / 8378368 clean 3021
>     stop check
>     313 check 7331184 / 8378368 write-pending 3905
>     314 check 7331184 / 8378368 write-pending 3905
>     315 check 7331184 / 8378368 write-pending 3905
>     Terminated
>
> If I add
>
>     pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);
>
> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get
>
>     [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>     [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>     [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>     [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>
> If I add
>
>     pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>
> at the head of md_check_recovery, I get:
>
>     [ 789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>     [ 789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>     [ 789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>     [ 789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>     [ 789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>
> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )
>
> What approach would you suggest to fix this?

I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours.

D.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2d21c298ffa7..f40429843906 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
mddev_lock(mddev) == 0) {
+ set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_reap_sync_thread(mddev);
}
+ clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
mddev_unlock(mddev);
}
} else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
--
2.30.0

2021-01-26 02:52:39

Hi,

After applied suggested 7 commits to kernel v6.1. There are still deadlocks in raid456 such as following. Do we have other follow up patch to fix this issue?

[2378440.310330] INFO: task systemd-journal:1415 blocked for more than 120 seconds.
[2378440.318649] Tainted: G OE K 6.1.51-0digitalocean2-generic #0digitalocean2
[2378440.328327] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2378440.337306] task:systemd-journal state:D stack:0 pid:1415 ppid:1 flags:0x00000006
[2378440.337310] Call Trace:
[2378440.337312] <TASK>
[2378440.337315] __schedule+0x1f5/0x5c0
[2378440.337324] schedule+0x53/0xb0
[2378440.337327] io_schedule+0x42/0x70
[2378440.337329] folio_wait_bit_common+0x11f/0x310
[2378440.337335] ? filemap_invalidate_unlock_two+0x40/0x40
[2378440.337338] ? ext4_da_map_blocks.constprop.0+0x370/0x370
[2378440.337342] block_page_mkwrite+0x113/0x160
[2378440.337346] ext4_page_mkwrite+0x233/0x6d0
[2378440.337349] do_page_mkwrite+0x4d/0xe0
[2378440.337352] do_wp_page+0x1d9/0x440
[2378440.337355] __handle_mm_fault+0x50a/0x5d0
[2378440.337358] handle_mm_fault+0xe3/0x2e0
[2378440.337359] do_user_addr_fault+0x187/0x560
[2378440.337363] exc_page_fault+0x6c/0x150
[2378440.337366] asm_exc_page_fault+0x22/0x30
[2378440.337371] RIP: 0033:0x7f2b4b7c9e7a
[2378440.337377] RSP: 002b:00007ffcd6387340 EFLAGS: 00010202
[2378440.337379] RAX: 0000000004725c68 RBX: 000056420cb077f0 RCX: 00007f2b3a7983b0
[2378440.337381] RDX: 00007f2b3fb25c68 RSI: 000056420cb077f0 RDI: 00007f2b3a7983d8
[2378440.337382] RBP: 000056420caca8f0 R08: 00000000003983b0 R09: 00007ffcd6387370
[2378440.337383] R10: 86c0275e98fea605 R11: 0000000000244a4c R12: 000000000000000a
[2378440.337384] R13: e4bdb9e57ca6925c R14: 0000000000000000 R15: 00007ffcd6387378
[2378440.337386] </TASK>
[2378440.337587] INFO: task md8_resync:3952093 blocked for more than 120 seconds.
[2378440.345699] Tainted: G OE K 6.1.51-0digitalocean2-generic #0digitalocean2
[2378440.355371] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2378440.364359] task:md8_resync state:D stack:0 pid:3952093 ppid:2 flags:0x00004000
[2378440.364362] Call Trace:
[2378440.364363] <TASK>
[2378440.364364] __schedule+0x1f5/0x5c0
[2378440.364368] schedule+0x53/0xb0
[2378440.364372] raid5_get_active_stripe+0x1f6/0x290 [raid456]
[2378440.364381] ? destroy_sched_domains_rcu+0x30/0x30
[2378440.364385] raid5_sync_request+0x289/0x2c0 [raid456]
[2378440.364392] ? is_mddev_idle+0xb5/0x115
[2378440.364395] md_do_sync.cold+0x3eb/0x96b
[2378440.364398] ? destroy_sched_domains_rcu+0x30/0x30
[2378440.364400] md_thread+0xa9/0x160
[2378440.364406] ? super_90_load.part.0+0x300/0x300
[2378440.364408] kthread+0xb9/0xe0
[2378440.364412] ? kthread_complete_and_exit+0x20/0x20
[2378440.364414] ret_from_fork+0x1f/0x30
[2378440.364418] </TASK>

Thanks