2020-11-28 22:05:03

by Donald Buczek

[permalink] [raw]
Subject: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Dear Linux mdraid people,

we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.

The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.

The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with

echo idle > /sys/devices/virtual/block/md1/md/sync_action

This doesn't complete. Here's /proc/stack of this process:

root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
UID PID PPID C STIME TTY TIME CMD
root 23333 23331 0 02:00 ? 00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
[<0>] kthread_stop+0x6e/0x150
[<0>] md_unregister_thread+0x3e/0x70
[<0>] md_reap_sync_thread+0x1f/0x1e0
[<0>] action_store+0x141/0x2b0
[<0>] md_attr_store+0x71/0xb0
[<0>] kernfs_fop_write+0x113/0x1a0
[<0>] vfs_write+0xbc/0x250
[<0>] ksys_write+0xa1/0xe0
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Note, that md0 has been paused successfully just before.

2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1

2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
[ nothing related following.... ]

After that, we see md1_raid6 in a busy loop:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2376 root 20 0 0 0 0 R 100.0 0.0 1387:38 md1_raid6

Also, all processes doing I/O do the md device block.

This is /proc/mdstat:

Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[==================>..] check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk

md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
bitmap: 0/59 pages [0KB], 65536KB chunk

There doesn't seem to be any further progress.

I've taken a function_graph trace of the looping md1_raid6 process: https://owww.molgen.mpg.de/~buczek/2020-11-27_trace.txt (30 MB)

Maybe this helps to get an idea what might be going on?

Best
Donald

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433


2020-11-30 02:11:52

by Guoqing Jiang

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition



On 11/28/20 13:25, Donald Buczek wrote:
> Dear Linux mdraid people,
>
> we are using raid6 on several servers. Occasionally we had failures,
> where a mdX_raid6 process seems to go into a busy loop and all I/O to
> the md device blocks. We've seen this on various kernel versions.
>
> The last time this happened (in this case with Linux 5.10.0-rc4), I took
> some data.
>
> The triggering event seems to be the md_check cron job trying to pause
> the ongoing check operation in the morning with
>
>     echo idle > /sys/devices/virtual/block/md1/md/sync_action
>
> This doesn't complete. Here's /proc/stack of this process:
>
>     root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
>     UID        PID  PPID  C STIME TTY          TIME CMD
>     root     23333 23331  0 02:00 ?        00:00:00 /bin/bash
> /usr/bin/mdcheck --continue --duration 06:00
>     root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat
> /proc/23333/stack
>     [<0>] kthread_stop+0x6e/0x150
>     [<0>] md_unregister_thread+0x3e/0x70
>     [<0>] md_reap_sync_thread+0x1f/0x1e0
>     [<0>] action_store+0x141/0x2b0
>     [<0>] md_attr_store+0x71/0xb0
>     [<0>] kernfs_fop_write+0x113/0x1a0
>     [<0>] vfs_write+0xbc/0x250
>     [<0>] ksys_write+0xa1/0xe0
>     [<0>] do_syscall_64+0x33/0x40
>     [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Note, that md0 has been paused successfully just before.

What is the personality of md0? Is it also raid6?

>
>     2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD
> (/usr/bin/mdcheck --continue --duration "06:00")
>     2020-11-27T02:00:01+01:00 done root: mdcheck continue checking
> /dev/md0 from 10623180920
>     2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md:
> data-check of RAID array md0
>     2020-11-27T02:00:01+01:00 done root: mdcheck continue checking
> /dev/md1 from 11582849320
>     2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md:
> data-check of RAID array md1
>     2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md:
> md0: data-check interrupted.
>     2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at
> 13351127680
>     2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md:
> md1: data-check interrupted.
>     [ nothing related following.... ]
>
> After that, we see md1_raid6 in a busy loop:
>
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> COMMAND
>     2376 root     20   0       0      0      0 R 100.0  0.0   1387:38
> md1_raid6

Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6
was triggered again and again.

>
> Also, all processes doing I/O do the md device block.
>
> This is /proc/mdstat:
>
>     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
>     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11]
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>           [==================>..]  check = 94.0%
> (7350290348/7813894144) finish=57189.3min speed=135K/sec
>           bitmap: 0/59 pages [0KB], 65536KB chunk
>     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12]
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3]
> sdu[2] sdt[1]
>           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>           bitmap: 0/59 pages [0KB], 65536KB chunk
>

So the RECOVERY_CHECK flag should be set, not sure if the simple changes
helps, but you may give it a try.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f..e2697d0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev)
md_update_sb(mddev, 0);

if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
- !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
+ (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
+ test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
/* resync/recovery still happening */
clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
goto unlock;

Thanks,
Guoqing

2020-12-01 09:33:33

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Am 30.11.20 um 03:06 schrieb Guoqing Jiang:
>
>
> On 11/28/20 13:25, Donald Buczek wrote:
>> Dear Linux mdraid people,
>>
>> we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.
>>
>> The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.
>>
>> The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with
>>
>>      echo idle > /sys/devices/virtual/block/md1/md/sync_action
>>
>> This doesn't complete. Here's /proc/stack of this process:
>>
>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
>>      UID        PID  PPID  C STIME TTY          TIME CMD
>>      root     23333 23331  0 02:00 ?        00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
>>      [<0>] kthread_stop+0x6e/0x150
>>      [<0>] md_unregister_thread+0x3e/0x70
>>      [<0>] md_reap_sync_thread+0x1f/0x1e0
>>      [<0>] action_store+0x141/0x2b0
>>      [<0>] md_attr_store+0x71/0xb0
>>      [<0>] kernfs_fop_write+0x113/0x1a0
>>      [<0>] vfs_write+0xbc/0x250
>>      [<0>] ksys_write+0xa1/0xe0
>>      [<0>] do_syscall_64+0x33/0x40
>>      [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Note, that md0 has been paused successfully just before.
>
> What is the personality of md0? Is it also raid6?

Yes.

>
>>
>>      2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
>>      2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
>>      2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1
>>      2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
>>      2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
>>      2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
>>      [ nothing related following.... ]
>>
>> After that, we see md1_raid6 in a busy loop:
>>
>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>>      2376 root     20   0       0      0      0 R 100.0  0.0   1387:38 md1_raid6
>
> Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 was triggered again and again.
>
>>
>> Also, all processes doing I/O do the md device block.
>>
>> This is /proc/mdstat:
>>
>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [==================>..]  check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>
>
> So the RECOVERY_CHECK flag should be set, not sure if the simple changes
> helps, but you may give it a try.

Thanks. I've copied the original condition block to execute before the modified one and added some logging to it to see, if the change actually triggers. I will pause and unpause the check frequently on a busy machine to get this code executed more often.

Best
Donald

>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f..e2697d0 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev)
>                         md_update_sb(mddev, 0);
>
>                 if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> -                   !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
> +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
> +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>                         /* resync/recovery still happening */
>                         clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>                         goto unlock;
>
> Thanks,
> Guoqing

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2020-12-02 17:31:46

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Dear Guoqing,

unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used:

--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
goto unlock;
}
+ if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
+ (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
+ test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
+ /* resync/recovery still happening */
+ pr_info("md: XXX BUGFIX applied\n");
+ clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+ goto unlock;
+ }
if (mddev->sync_thread) {
md_reap_sync_thread(mddev);
goto unlock;

With pausing and continuing the check four times an hour, I could trigger the problem after about 48 hours. This time, the other device (md0) has locked up on `echo idle > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 is still ongoing:

Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[=>...................] check = 8.5% (666852112/7813894144) finish=1271.2min speed=93701K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk

md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[>....................] check = 0.2% (19510348/7813894144) finish=253779.6min speed=511K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk

after 1 minute:

Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[=>...................] check = 8.6% (674914560/7813894144) finish=941.1min speed=126418K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk

md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[>....................] check = 0.2% (19510348/7813894144) finish=256805.0min speed=505K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk

A data point, I didn't mention in my previous mail, is that the mdX_resync thread is not gone when the problem occurs:

buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md
root 134 2 0 Nov30 ? 00:00:00 [md]
root 1321 2 27 Nov30 ? 12:57:14 [md0_raid6]
root 1454 2 26 Nov30 ? 12:37:23 [md1_raid6]
root 5845 2 0 16:20 ? 00:00:30 [md0_resync]
root 5855 2 13 16:20 ? 00:14:11 [md1_resync]
buczek 9880 9072 0 18:05 pts/0 00:00:00 grep -F [md

buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat /proc/5845/stack
[<0>] md_bitmap_cond_end_sync+0x12d/0x170
[<0>] raid5_sync_request+0x24b/0x390
[<0>] md_do_sync+0xb41/0x1030
[<0>] md_thread+0x122/0x160
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x1f/0x30

I guess, md_bitmap_cond_end_sync+0x12d is the `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) == 0);` in md-bitmap.c.

Donald


On 01.12.20 10:29, Donald Buczek wrote:
> Am 30.11.20 um 03:06 schrieb Guoqing Jiang:
>>
>>
>> On 11/28/20 13:25, Donald Buczek wrote:
>>> Dear Linux mdraid people,
>>>
>>> we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.
>>>
>>> The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.
>>>
>>> The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with
>>>
>>>      echo idle > /sys/devices/virtual/block/md1/md/sync_action
>>>
>>> This doesn't complete. Here's /proc/stack of this process:
>>>
>>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
>>>      UID        PID  PPID  C STIME TTY          TIME CMD
>>>      root     23333 23331  0 02:00 ?        00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
>>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
>>>      [<0>] kthread_stop+0x6e/0x150
>>>      [<0>] md_unregister_thread+0x3e/0x70
>>>      [<0>] md_reap_sync_thread+0x1f/0x1e0
>>>      [<0>] action_store+0x141/0x2b0
>>>      [<0>] md_attr_store+0x71/0xb0
>>>      [<0>] kernfs_fop_write+0x113/0x1a0
>>>      [<0>] vfs_write+0xbc/0x250
>>>      [<0>] ksys_write+0xa1/0xe0
>>>      [<0>] do_syscall_64+0x33/0x40
>>>      [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> Note, that md0 has been paused successfully just before.
>>
>> What is the personality of md0? Is it also raid6?
>
> Yes.
>
>>
>>>
>>>      2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
>>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
>>>      2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
>>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
>>>      2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1
>>>      2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
>>>      2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
>>>      2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
>>>      [ nothing related following.... ]
>>>
>>> After that, we see md1_raid6 in a busy loop:
>>>
>>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>>>      2376 root     20   0       0      0      0 R 100.0  0.0   1387:38 md1_raid6
>>
>> Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 was triggered again and again.
>>
>>>
>>> Also, all processes doing I/O do the md device block.
>>>
>>> This is /proc/mdstat:
>>>
>>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>>            [==================>..]  check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
>>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>>
>>
>> So the RECOVERY_CHECK flag should be set, not sure if the simple changes
>> helps, but you may give it a try.
>
> Thanks. I've copied the original condition block to execute before the modified one and added some logging to it to see, if the change actually triggers. I will pause and unpause the check frequently on a busy machine to get this code executed more often.
>
> Best
>   Donald
>
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 98bac4f..e2697d0 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev)
>>                          md_update_sb(mddev, 0);
>>
>>                  if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>> -                   !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
>> +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>> +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>>                          /* resync/recovery still happening */
>>                          clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>                          goto unlock;
>>
>> Thanks,
>> Guoqing
>

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2020-12-03 01:58:07

by Guoqing Jiang

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi Donald,

On 12/2/20 18:28, Donald Buczek wrote:
> Dear Guoqing,
>
> unfortunately the patch didn't fix the problem (unless I messed it up
> with my logging). This is what I used:
>
>     --- a/drivers/md/md.c
>     +++ b/drivers/md/md.c
>     @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
>                             clear_bit(MD_RECOVERY_NEEDED,
> &mddev->recovery);
>                             goto unlock;
>                     }

I think you can add the check of RECOVERY_CHECK in above part instead of
add a new part.

>     +               if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>     +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>     +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>     +                       /* resync/recovery still happening */
>     +                       pr_info("md: XXX BUGFIX applied\n");
>     +                       clear_bit(MD_RECOVERY_NEEDED,
> &mddev->recovery);
>     +                       goto unlock;
>     +               }
>                     if (mddev->sync_thread) {
>                             md_reap_sync_thread(mddev);
>                             goto unlock;


>
> With pausing and continuing the check four times an hour, I could
> trigger the problem after about 48 hours. This time, the other device
> (md0) has locked up on `echo idle >
> /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1
> is still ongoing:

Without the patch, md0 was good while md1 was locked. So the patch
switches the status of the two arrays, a little weird ...

What is the stack of the process? I guess it is same as the stack of
23333 in your previous mail, but just to confirm.

>
>     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
>     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11]
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>           [=>...................]  check =  8.5% (666852112/7813894144)
> finish=1271.2min speed=93701K/sec
>           bitmap: 0/59 pages [0KB], 65536KB chunk
>     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12]
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3]
> sdu[2] sdt[1]
>           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>           [>....................]  check =  0.2% (19510348/7813894144)
> finish=253779.6min speed=511K/sec
>           bitmap: 0/59 pages [0KB], 65536KB chunk
>
> after 1 minute:
>
>     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
>     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11]
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>           [=>...................]  check =  8.6% (674914560/7813894144)
> finish=941.1min speed=126418K/sec
>           bitmap: 0/59 pages [0KB], 65536KB chunk
>     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12]
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3]
> sdu[2] sdt[1]
>           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>           [>....................]  check =  0.2% (19510348/7813894144)
> finish=256805.0min speed=505K/sec
>           bitmap: 0/59 pages [0KB], 65536KB chunk
>
> A data point, I didn't mention in my previous mail, is that the
> mdX_resync thread is not gone when the problem occurs:
>
>     buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md
>     root       134     2  0 Nov30 ?        00:00:00 [md]
>     root      1321     2 27 Nov30 ?        12:57:14 [md0_raid6]
>     root      1454     2 26 Nov30 ?        12:37:23 [md1_raid6]
>     root      5845     2  0 16:20 ?        00:00:30 [md0_resync]
>     root      5855     2 13 16:20 ?        00:14:11 [md1_resync]
>     buczek    9880  9072  0 18:05 pts/0    00:00:00 grep -F [md
>     buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat
> /proc/5845/stack
>     [<0>] md_bitmap_cond_end_sync+0x12d/0x170
>     [<0>] raid5_sync_request+0x24b/0x390
>     [<0>] md_do_sync+0xb41/0x1030
>     [<0>] md_thread+0x122/0x160
>     [<0>] kthread+0x118/0x130
>     [<0>] ret_from_fork+0x1f/0x30
>
> I guess, md_bitmap_cond_end_sync+0x12d is the
> `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active)
> == 0);` in md-bitmap.c.
>

Could be, if so, then I think md_done_sync was not triggered by the path
md0_raid6 -> ... -> handle_stripe.

I'd suggest to compare the stacks between md0 and md1 to find the
difference.

Thanks,
Guoqing

2020-12-03 11:45:22

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Dear Guoqing,

On 12/3/20 2:55 AM, Guoqing Jiang wrote:
> Hi Donald,
>
> On 12/2/20 18:28, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used:
>>
>>      --- a/drivers/md/md.c
>>      +++ b/drivers/md/md.c
>>      @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
>>                              clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>                              goto unlock;
>>                      }
>
> I think you can add the check of RECOVERY_CHECK in above part instead of add a new part.

I understand that. I did it this way to get a log only when the changed expression makes a difference (original block not triggering and using goto and and modified block triggering) and even if both of us make some stupid error with boolean expressions.

>
>>      +               if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>      +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>>      +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>>      +                       /* resync/recovery still happening */
>>      +                       pr_info("md: XXX BUGFIX applied\n");
>>      +                       clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>      +                       goto unlock;
>>      +               }
>>                      if (mddev->sync_thread) {
>>                              md_reap_sync_thread(mddev);
>>                              goto unlock;
>
>
>>
>> With pausing and continuing the check four times an hour, I could trigger the problem after about 48 hours. This time, the other device (md0) has locked up on `echo idle > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 is still ongoing:
>
> Without the patch, md0 was good while md1 was locked. So the patch switches the status of the two arrays, a little weird ...

I think its just random. There is a slight chance for a deadlock depending on I/O to the device or other system load when stopping the resync. In the last experiment, both md0 and md1 where changed from "check" to "idle" for about 400 times before one of it blocked.

OTOH - and I don't understand this at all - it looks like md0 hung again after reboot to a 5.4 kernel at the very first "check" to "idle" transition which seems to contradict the "slight" chance and "once after 400 attempts".

> What is the stack of the process? I guess it is same as the stack of 23333 in your previous mail, but just to confirm.

Of md0_raid6 thread ? Sorry, didn't take the data before reboot this time, and concentrated on the non-terminating md0_resync thread.

>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [=>...................]  check =  8.5% (666852112/7813894144) finish=1271.2min speed=93701K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [>....................]  check =  0.2% (19510348/7813894144) finish=253779.6min speed=511K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>
>> after 1 minute:
>>
>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [=>...................]  check =  8.6% (674914560/7813894144) finish=941.1min speed=126418K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [>....................]  check =  0.2% (19510348/7813894144) finish=256805.0min speed=505K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>
>> A data point, I didn't mention in my previous mail, is that the mdX_resync thread is not gone when the problem occurs:
>>
>>      buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md
>>      root       134     2  0 Nov30 ?        00:00:00 [md]
>>      root      1321     2 27 Nov30 ?        12:57:14 [md0_raid6]
>>      root      1454     2 26 Nov30 ?        12:37:23 [md1_raid6]
>>      root      5845     2  0 16:20 ?        00:00:30 [md0_resync]
>>      root      5855     2 13 16:20 ?        00:14:11 [md1_resync]
>>      buczek    9880  9072  0 18:05 pts/0    00:00:00 grep -F [md
>>      buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat /proc/5845/stack
>>      [<0>] md_bitmap_cond_end_sync+0x12d/0x170
>>      [<0>] raid5_sync_request+0x24b/0x390
>>      [<0>] md_do_sync+0xb41/0x1030
>>      [<0>] md_thread+0x122/0x160
>>      [<0>] kthread+0x118/0x130
>>      [<0>] ret_from_fork+0x1f/0x30
>>
>> I guess, md_bitmap_cond_end_sync+0x12d is the `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) == 0);` in md-bitmap.c.
>>
>
> Could be, if so, then I think md_done_sync was not triggered by the path md0_raid6 -> ... -> handle_stripe.

Ok. Maybe I find a way to add some logging there.

> I'd suggest to compare the stacks between md0 and md1 to find the difference.

Compare stack of md0_raid6 and md1_raid6?

Okay, but when one md is deadlocked, the other is operating just normally. I/O can be done and you can even start and stop a resync on it. So I'd expect the stack of the operating raid to be varying.

Thanks.

Donald

>
> Thanks,
> Guoqing


--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2020-12-21 12:35:30

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Dear Guoging,

I think now that this is not an issue for md. I've driven a system into that situation again and have clear indication, that this is a problem of the member block device driver.

With md0 in the described errornous state (md0_raid6 busy looping, echo idle > .../sync_action blocked, no progress in mdstat) and md1 operating normally:

root:deadbird:/scratch/local/# for f in /sys/devices/virtual/block/md?/md/rd*/block/inflight;do echo $f: $(cat $f);done
/sys/devices/virtual/block/md0/md/rd0/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd1/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd10/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd11/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd12/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd13/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd14/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd15/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd2/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd3/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd4/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd5/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd6/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd7/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd8/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd9/block/inflight: 1 0
/sys/devices/virtual/block/md1/md/rd0/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd1/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd10/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd11/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd12/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd13/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd14/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd15/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd2/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd3/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd4/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd5/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd6/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd7/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd8/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd9/block/inflight: 0 0

Best
Donald

2021-01-19 12:29:13

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Dear md-raid people,

I've reported a problem in this thread in December:

"We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions." It was clear, that this is related to "mdcheck" running, because we found, that the command, which should stop the scrubbing in the morning (`echo idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.

On 12/21/20, I've reported, that the problem might be caused by a failure of the underlying block device driver, because I've found "inflight" counters of the block devices not being zero. However, this is not the case. We were able to run into the mdX_raid6 looping condition a few times again, but the non-zero inflight counters have not been observed again.

I was able to collect a lot of additional information from a blocked system.

- The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command is waiting at kthread_stop to stop the sync thread. [ https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]

- The sync thread ("md1_resync") does not finish, because its blocked at

[<0>] raid5_get_active_stripe+0x4c4/0x660 # [1]
[<0>] raid5_sync_request+0x364/0x390
[<0>] md_do_sync+0xb41/0x1030
[<0>] md_thread+0x122/0x160
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x22/0x30

[1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735

- yes, gdb confirms that `conf->cache_state` is 0x03 ( R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )

- We have lots of active stripes:

root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/stripe_cache_active
27534

- handle_stripe() doesn't make progress:

echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control

In dmesg, we see the debug output from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 but never from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:

[171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
, check:4, reconstruct:0
[171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
, check:4, reconstruct:0
[171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
, check:4, reconstruct:0

The sector numbers repeat after some time. We have only the following variants of stripe state and "check":

state=0x2031 cnt=1, check:0 # ACTIVE +INSYNC+REPLACED+IO_STARTED, check_state_idle
state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING +REPLACED+IO_STARTED, check_state_check_result
state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING +IO_STARTED, check_state_idle

- We have MD_SB_CHANGE_PENDING set:

root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/array_state
write-pending

gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)

So it can be assumed that handle_stripe breaks out at https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939

- The system can manually be freed from the deadlock:

When `echo active > /sys/block/md1/md/array_state` is used, the scrubbing and other I/O continue. Probably because of https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520

I, of coruse, don't fully understand it yet. Any ideas?

I append some data from a hanging raid... (mddev, r5conf and a sample stripe_head from handle_list with it first disks)

Donald


$L_002: (struct mddev *) 0xffff8898a48e3000 : {
private = (struct r5conf *) 0xffff88988695a800 /* --> $L_004 */
pers = (struct md_personality *) 0xffffffff826ee280 <raid6_personality>
unit = 9437185
md_minor = 1
disks = {next = 0xffff889886ee8c00, prev = 0xffff889891506000}
flags = 0
sb_flags = 6
suspended = 0
active_io = { counter=4 } /* atomic_t */
ro = 0
sysfs_active = 0
gendisk = (struct gendisk *) 0xffff8898bc357800
kobj = {
name = (const char *) 0xffffffff8220cef8 "md"
entry = {next = 0xffff8898a48e3058, prev = 0xffff8898a48e3058} /* empty */
parent = (struct kobject *) 0xffff8898bc357868
kset = 0x0
ktype = (struct kobj_type *) 0xffffffff826ef140 <md_ktype>
sd = (struct kernfs_node *) 0xffff8898b5f65980
kref = {
refcount = {
refs = { counter=17 } /* atomic_t */
}
}
state_initialized = 1
state_in_sysfs = 1
state_add_uevent_sent = 0
state_remove_uevent_sent = 0
uevent_suppress = 0
}
hold_active = 0
major_version = 1
minor_version = 2
patch_version = 0
persistent = 1
external = 0
metadata_type = [
(8)
[0] : 0 '\000'
[1] : 0 '\000'
[2] : 0 '\000'
[3] : 0 '\000'
[4] : 0 '\000'
[5] : 0 '\000'
[6] : 0 '\000'
[7] : 0 '\000'
[8] : 0 '\000'
[9] : 0 '\000'
[10] : 0 '\000'
[11] : 0 '\000'
[12] : 0 '\000'
[13] : 0 '\000'
[14] : 0 '\000'
[15] : 0 '\000'
[16] : 0 '\000'
]
chunk_sectors = 1024
ctime = 1607680540
utime = 1610853337
level = 6
layout = 2
clevel = [
(8)
[0] : 114 'r'
[1] : 97 'a'
[2] : 105 'i'
[3] : 100 'd'
[4] : 54 '6'
[5] : 0 '\000'
[6] : 0 '\000'
[7] : 0 '\000'
[8] : 0 '\000'
[9] : 0 '\000'
[10] : 0 '\000'
[11] : 0 '\000'
[12] : 0 '\000'
[13] : 0 '\000'
[14] : 0 '\000'
[15] : 0 '\000'
]
raid_disks = 16
max_disks = 1920
dev_sectors = 15627788288
array_sectors = 218789036032
external_size = 0
events = 32409
can_decrease_events = 0
uuid = [
(8)
[0] : -49 '\317'
[1] : 69 'E'
[2] : 119 'w'
[3] : 62 '>'
[4] : -113 '\217'
[5] : -107 '\225'
[6] : 9 '\t'
[7] : -90 '\246'
[8] : -2 '\376'
[9] : 84 'T'
[10] : -12 '\364'
[11] : -31 '\341'
[12] : -110 '\222'
[13] : 24 '\030'
[14] : -15 '\361'
[15] : 109 'm'
]
reshape_position = 18446744073709551615
delta_disks = 0
new_level = 6
new_layout = 2
new_chunk_sectors = 1024
reshape_backwards = 0
thread = (struct md_thread *) 0xffff889887bde080
sync_thread = 0x0
last_sync_action = (char *) 0xffffffff8225e246 "check"
curr_resync = 4947256648
curr_resync_completed = 4947016640
resync_mark = 4415850173
resync_mark_cnt = 66402936
curr_mark_cnt = 75262560
resync_max_sectors = 15627788288
resync_mismatches = {
counter = 0
}
suspend_lo = 0
suspend_hi = 0
sync_speed_min = 0
sync_speed_max = 0
parallel_resync = 0
ok_start_degraded = 0
recovery = 203
recovery_disabled = 16
in_sync = 0
open_mutex = {
owner = { counter=0 } /* atomic_t */
wait_lock = {counter=0} /* spinlock_t */
osq = {
tail = { counter=0 } /* atomic_t */
}
wait_list = {next = 0xffff8898a48e31d8, prev = 0xffff8898a48e31d8} /* empty */
}
reconfig_mutex = {
owner = { counter=-131247629823744 } /* atomic_t */
wait_lock = {counter=0} /* spinlock_t */
osq = {
tail = { counter=0 } /* atomic_t */
}
wait_list = {next = 0xffff8898a48e31f8, prev = 0xffff8898a48e31f8} /* empty */
}
active = { counter=2 } /* atomic_t */
openers = { counter=1 } /* atomic_t */
changed = 0
degraded = 0
recovery_active = { counter=181360 } /* atomic_t */
recovery_wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff8898a48e3228, prev = 0xffff8898a48e3228} /* empty */
}
recovery_cp = 18446744073709551615
resync_min = 4871994088
resync_max = 18446744073709551615
sysfs_state = (struct kernfs_node *) 0xffff8898b5f65380
sysfs_action = (struct kernfs_node *) 0xffff889887d87600
sysfs_completed = (struct kernfs_node *) 0xffff8898ac6f4e80
sysfs_degraded = (struct kernfs_node *) 0xffff8898ac6f4b80
sysfs_level = (struct kernfs_node *) 0xffff8898b5f65700
del_work = {
data = { counter=128 } /* atomic_t */
entry = {next = 0xffff8898a48e3280, prev = 0xffff8898a48e3280} /* empty */
func = (work_func_t) 0xffffffff819ad430 <md_start_sync>
}
lock = {counter=0} /* spinlock_t */
sb_wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffffc9000e4a7a68, prev = 0xffffc9000cf97d30}
}
pending_writes = { counter=0 } /* atomic_t */
safemode = 0
safemode_delay = 201
safemode_timer = {
entry = {
next = (struct hlist_node *) 0xdead000000000122
pprev = 0x0
}
expires = 4415857239
function = (void (*)(struct timer_list *)) 0xffffffff819a6a40 <md_safemode_timeout>
flags = 314572820
}
writes_pending = {
percpu_count_ptr = 105896157983096
data = (struct percpu_ref_data *) 0xffff8898b340fc80
}
sync_checkers = 0
queue = (struct request_queue *) 0xffff8898b65847c8
bitmap = (struct bitmap *) 0xffff889887d84400
bitmap_info = {
file = 0x0
offset = 8
space = 0
default_offset = 2
default_space = 6
mutex = {
owner = { counter=0 } /* atomic_t */
wait_lock = {counter=0} /* spinlock_t */
osq = {
tail = { counter=0 } /* atomic_t */
}
wait_list = {next = 0xffff8898a48e3350, prev = 0xffff8898a48e3350} /* empty */
}
chunksize = 67108864
daemon_sleep = 5000
max_write_behind = 0
external = 0
nodes = 0
cluster_name = [
(8)
[0] : 0 '\000'
[1] : 0 '\000'
[2] : 0 '\000'
[3] : 0 '\000'
[4] : 0 '\000'
[5] : 0 '\000'
[6] : 0 '\000'
[7] : 0 '\000'
[8] : 0 '\000'
[9] : 0 '\000'
[10] : 0 '\000'
[11] : 0 '\000'
[12] : 0 '\000'
[13] : 0 '\000'
[14] : 0 '\000'
[15] : 0 '\000'
[16] : 0 '\000'
[17] : 0 '\000'
[18] : 0 '\000'
[19] : 0 '\000'
[20] : 0 '\000'
[21] : 0 '\000'
[22] : 0 '\000'
[23] : 0 '\000'
[24] : 0 '\000'
[25] : 0 '\000'
[26] : 0 '\000'
[27] : 0 '\000'
[28] : 0 '\000'
[29] : 0 '\000'
[30] : 0 '\000'
[31] : 0 '\000'
[32] : 0 '\000'
[33] : 0 '\000'
[34] : 0 '\000'
[35] : 0 '\000'
[36] : 0 '\000'
[37] : 0 '\000'
[38] : 0 '\000'
[39] : 0 '\000'
[40] : 0 '\000'
[41] : 0 '\000'
[42] : 0 '\000'
[43] : 0 '\000'
[44] : 0 '\000'
[45] : 0 '\000'
[46] : 0 '\000'
[47] : 0 '\000'
[48] : 0 '\000'
[49] : 0 '\000'
[50] : 0 '\000'
[51] : 0 '\000'
[52] : 0 '\000'
[53] : 0 '\000'
[54] : 0 '\000'
[55] : 0 '\000'
[56] : 0 '\000'
[57] : 0 '\000'
[58] : 0 '\000'
[59] : 0 '\000'
[60] : 0 '\000'
[61] : 0 '\000'
[62] : 0 '\000'
[63] : 0 '\000'
]
}
max_corr_read_errors = { counter=20 } /* atomic_t */
all_mddevs = {next = 0xffffffff826eef10, prev = 0xffff88810a551bc8}
to_remove = 0x0
bio_set = {
bio_slab = (struct kmem_cache *) 0xffff889886fbdc00
front_pad = 0
bio_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 2
curr_nr = 2
elements = (void **) 0xffff8898b4fae300
pool_data = (void *) 0xffff889886fbdc00
alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff8898a48e3428, prev = 0xffff8898a48e3428} /* empty */
}
}
bvec_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 2
curr_nr = 2
elements = (void **) 0xffff8898b4fae2e0
pool_data = (void *) 0xffff8881078ecb40
alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff8898a48e3470, prev = 0xffff8898a48e3470} /* empty */
}
}
bio_integrity_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
bvec_integrity_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
rescue_lock = {counter=0} /* spinlock_t */
rescue_list = {
head = 0x0
tail = 0x0
}
rescue_work = {
data = { counter=68719476704 } /* atomic_t */
entry = {next = 0xffff8898a48e3530, prev = 0xffff8898a48e3530} /* empty */
func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue>
}
rescue_workqueue = 0x0
}
sync_set = {
bio_slab = (struct kmem_cache *) 0xffff889886fbdc00
front_pad = 0
bio_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 2
curr_nr = 2
elements = (void **) 0xffff8898b4fae2c0
pool_data = (void *) 0xffff889886fbdc00
alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff8898a48e3598, prev = 0xffff8898a48e3598} /* empty */
}
}
bvec_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 2
curr_nr = 2
elements = (void **) 0xffff8898b4fae2a0
pool_data = (void *) 0xffff8881078ecb40
alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff8898a48e35e0, prev = 0xffff8898a48e35e0} /* empty */
}
}
bio_integrity_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
bvec_integrity_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
rescue_lock = {counter=0} /* spinlock_t */
rescue_list = {
head = 0x0
tail = 0x0
}
rescue_work = {
data = { counter=68719476704 } /* atomic_t */
entry = {next = 0xffff8898a48e36a0, prev = 0xffff8898a48e36a0} /* empty */
func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue>
}
rescue_workqueue = 0x0
}
md_io_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 2
curr_nr = 2
elements = (void **) 0xffff8898b236af60
pool_data = (void *) 0x28 <fixed_percpu_data+40>
alloc = (mempool_alloc_t *) 0xffffffff811c8b50 <mempool_kmalloc>
free = (mempool_free_t *) 0xffffffff811c84b0 <mempool_kfree>
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff8898a48e36f8, prev = 0xffff8898a48e36f8} /* empty */
}
}
flush_bio = 0x0
flush_pending = { counter=0 } /* atomic_t */
start_flush = 121249416090188
last_flush = 121249416090188
flush_work = {
data = { counter=2176 } /* atomic_t */
entry = {next = 0xffff8898a48e3730, prev = 0xffff8898a48e3730} /* empty */
func = (work_func_t) 0xffffffff819aca80 <md_submit_flush_data>
}
event_work = {
data = { counter=0 } /* atomic_t */
entry = {next = 0x0000000000000000, prev = 0x0000000000000000}
func = 0x0
}
serial_info_pool = 0x0
sync_super = 0x0
cluster_info = 0x0
good_device_nr = 0
noio_flag = 0
has_superblocks = true_21_
fail_last_dev = false_21_
serialize_policy = false_21_
}

$L_004: (struct r5conf *) 0xffff88988695a800 : {
stripe_hashtbl = (struct hlist_head *) 0xffff8898a1cb9000 /* --> $L_006 */
hash_locks = [
(23)
[0] : {counter=0} /* spinlock_t */
[1] : {counter=0} /* spinlock_t */
[2] : {counter=0} /* spinlock_t */
[3] : {counter=0} /* spinlock_t */
[4] : {counter=0} /* spinlock_t */
[5] : {counter=0} /* spinlock_t */
[6] : {counter=0} /* spinlock_t */
[7] : {counter=0} /* spinlock_t */
]
mddev = (struct mddev *) 0xffff8898a48e3000
chunk_sectors = 1024
level = 6
algorithm = 2
rmw_level = 1
max_degraded = 2
raid_disks = 16
max_nr_stripes = 29816
min_nr_stripes = 256
reshape_progress = 18446744073709551615
reshape_safe = 0
previous_raid_disks = 16
prev_chunk_sectors = 1024
prev_algo = 2
generation = 0
gen_lock = {
seqcount = {
sequence = 0
}
}
reshape_checkpoint = 0
min_offset_diff = 0
handle_list = {next = 0xffff888a7893c890, prev = 0xffff8889de3cc7d0}
loprio_list = {next = 0xffff88988695a898, prev = 0xffff88988695a898} /* empty */
hold_list = {next = 0xffff88988695a8a8, prev = 0xffff88988695a8a8} /* empty */
delayed_list = {next = 0xffff88988695a8b8, prev = 0xffff88988695a8b8} /* empty */
bitmap_list = {next = 0xffff88988695a8c8, prev = 0xffff88988695a8c8} /* empty */
retry_read_aligned = 0x0
retry_read_offset = 0
retry_read_aligned_list = 0x0
preread_active_stripes = { counter=0 } /* atomic_t */
active_aligned_reads = { counter=0 } /* atomic_t */
pending_full_writes = { counter=0 } /* atomic_t */
bypass_count = 0
bypass_threshold = 1
skip_copy = 0
last_hold = (struct list_head *) 0xffff88a299ea0010
reshape_stripes = { counter=0 } /* atomic_t */
active_name = 0
cache_name = [
(2)
[0] : [
(8)
[0] : 114 'r'
[1] : 97 'a'
[2] : 105 'i'
[3] : 100 'd'
[4] : 54 '6'
[5] : 45 '-'
[6] : 109 'm'
[7] : 100 'd'
[8] : 49 '1'
[9] : 0 '\000'
[10] : 0 '\000'
[11] : 0 '\000'
[12] : 0 '\000'
[13] : 0 '\000'
[14] : 0 '\000'
[15] : 0 '\000'
[16] : 0 '\000'
[17] : 0 '\000'
[18] : 0 '\000'
[19] : 0 '\000'
[20] : 0 '\000'
[21] : 0 '\000'
[22] : 0 '\000'
[23] : 0 '\000'
[24] : 0 '\000'
[25] : 0 '\000'
[26] : 0 '\000'
[27] : 0 '\000'
[28] : 0 '\000'
[29] : 0 '\000'
[30] : 0 '\000'
[31] : 0 '\000'
]
[1] : [
(8)
[0] : 114 'r'
[1] : 97 'a'
[2] : 105 'i'
[3] : 100 'd'
[4] : 54 '6'
[5] : 45 '-'
[6] : 109 'm'
[7] : 100 'd'
[8] : 49 '1'
[9] : 45 '-'
[10] : 97 'a'
[11] : 108 'l'
[12] : 116 't'
[13] : 0 '\000'
[14] : 0 '\000'
[15] : 0 '\000'
[16] : 0 '\000'
[17] : 0 '\000'
[18] : 0 '\000'
[19] : 0 '\000'
[20] : 0 '\000'
[21] : 0 '\000'
[22] : 0 '\000'
[23] : 0 '\000'
[24] : 0 '\000'
[25] : 0 '\000'
[26] : 0 '\000'
[27] : 0 '\000'
[28] : 0 '\000'
[29] : 0 '\000'
[30] : 0 '\000'
[31] : 0 '\000'
]
]
slab_cache = (struct kmem_cache *) 0xffff8881078eca80
cache_size_mutex = {
owner = { counter=0 } /* atomic_t */
wait_lock = {counter=0} /* spinlock_t */
osq = {
tail = { counter=0 } /* atomic_t */
}
wait_list = {next = 0xffff88988695a970, prev = 0xffff88988695a970} /* empty */
}
seq_flush = 5031591
seq_write = 5031591
quiesce = 0
fullsync = 0
recovery_disabled = 15
percpu = (struct raid5_percpu *) 0x604fdee147e8
scribble_disks = 16
scribble_sectors = 1024
node = {
next = 0x0
pprev = (struct hlist_node **) 0xffff88810a5b89a8
}
active_stripes = { counter=27534 } /* atomic_t */
inactive_list = [
(3)
[0] : {next = 0xffff88990fd76710, prev = 0xffff8889b7f72550}
[1] : {next = 0xffff88959af002d0, prev = 0xffff889958cd4590}
[2] : {next = 0xffff88a60a070150, prev = 0xffff88a37d3d0850}
[3] : {next = 0xffff889e211e0690, prev = 0xffff8893a4000110}
[4] : {next = 0xffff888c2139a350, prev = 0xffff88901a188410}
[5] : {next = 0xffff88a83413e3d0, prev = 0xffff88a2090bc350}
[6] : {next = 0xffff88a720d54310, prev = 0xffff888a0da6c5d0}
[7] : {next = 0xffff888a0412c650, prev = 0xffff889250596550}
]
r5c_cached_full_stripes = { counter=0 } /* atomic_t */
r5c_full_stripe_list = {next = 0xffff88988695aa48, prev = 0xffff88988695aa48} /* empty */
r5c_cached_partial_stripes = { counter=0 } /* atomic_t */
r5c_partial_stripe_list = {next = 0xffff88988695aa60, prev = 0xffff88988695aa60} /* empty */
r5c_flushing_full_stripes = { counter=0 } /* atomic_t */
r5c_flushing_partial_stripes = { counter=0 } /* atomic_t */
empty_inactive_list_nr = { counter=0 } /* atomic_t */
released_stripes = {
first = 0x0
}
wait_for_quiescent = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff88988695aa90, prev = 0xffff88988695aa90} /* empty */
}
wait_for_stripe = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffffc9000e3f7c78, prev = 0xffffc9000e3f7c78}
}
wait_for_overlap = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff88988695aac0, prev = 0xffff88988695aac0} /* empty */
}
cache_state = 3
shrinker = {
count_objects = (long unsigned int (*)(struct shrinker *, struct shrink_control *)) 0xffffffff8198bc60 <raid5_cache_count>
scan_objects = (long unsigned int (*)(struct shrinker *, struct shrink_control *)) 0xffffffff8198de90 <raid5_cache_scan>
batch = 128
seeks = 128
flags = 0
list = {next = 0xffff8898867e8c30, prev = 0xffffffffa0e14520}
id = 0
nr_deferred = (atomic_long_t *) 0xffff88988d8955c0
}
pool_size = 16
device_lock = {counter=0} /* spinlock_t */
disks = (struct disk_info *) 0xffff8898b1df3800
bio_split = {
bio_slab = (struct kmem_cache *) 0xffff889886fbdc00
front_pad = 0
bio_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 2
curr_nr = 2
elements = (void **) 0xffff8898b4fae280
pool_data = (void *) 0xffff889886fbdc00
alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff88988695ab70, prev = 0xffff88988695ab70} /* empty */
}
}
bvec_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
bio_integrity_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
bvec_integrity_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
rescue_lock = {counter=0} /* spinlock_t */
rescue_list = {
head = 0x0
tail = 0x0
}
rescue_work = {
data = { counter=68719476704 } /* atomic_t */
entry = {next = 0xffff88988695ac78, prev = 0xffff88988695ac78} /* empty */
func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue>
}
rescue_workqueue = 0x0
}
thread = 0x0
temp_inactive_list = [
(3)
[0] : {next = 0xffff88988695aca0, prev = 0xffff88988695aca0} /* empty */
[1] : {next = 0xffff88988695acb0, prev = 0xffff88988695acb0} /* empty */
[2] : {next = 0xffff88988695acc0, prev = 0xffff88988695acc0} /* empty */
[3] : {next = 0xffff88988695acd0, prev = 0xffff88988695acd0} /* empty */
[4] : {next = 0xffff88988695ace0, prev = 0xffff88988695ace0} /* empty */
[5] : {next = 0xffff88988695acf0, prev = 0xffff88988695acf0} /* empty */
[6] : {next = 0xffff88988695ad00, prev = 0xffff88988695ad00} /* empty */
[7] : {next = 0xffff88988695ad10, prev = 0xffff88988695ad10} /* empty */
]
worker_groups = 0x0
group_cnt = 0
worker_cnt_per_group = 0
log = 0x0
log_private = 0x0
pending_bios_lock = {counter=0} /* spinlock_t */
batch_bio_dispatch = true_21_
pending_data = (struct r5pending_data *) 0xffff88988fae0000
free_list = {next = 0xffff88988fae4fd8, prev = 0xffff88988fae0000}
pending_list = {next = 0xffff88988695ad60, prev = 0xffff88988695ad60} /* empty */
pending_data_cnt = 0
next_pending_data = 0x0
}


set $sh = container_of ((void *)0xffff88a57f1e6550, struct stripe_head, lru)

(gdb) dump *$sh

$L_010: (struct stripe_head *) 0xffff88a57f1e6540 : {
hash = {
next = (struct hlist_node *) 0xffff888768f7c680
pprev = (struct hlist_node **) 0xffff8892f0520080
}
lru = {next = 0xffff88945e034510, prev = 0xffff8891d8f04210}
release_list = {
next = (struct llist_node *) 0xffff88945e034520
}
raid_conf = (struct r5conf *) 0xffff88988695a800
generation = 0
sector = 4947068032
pd_idx = 14
qd_idx = 15
ddf_layout = 0
hash_lock_index = 0
state = 8242 # 0x2032 HANDLE + INSYNC + REPLACED + STARTED
count = { counter=0 } /* atomic_t */
bm_seq = 5030861
disks = 16
overwrite_disks = 0
check_state = check_state_idle_5_
reconstruct_state = reconstruct_state_idle_5_
stripe_lock = {counter=0} /* spinlock_t */
cpu = 17
group = 0x0
batch_head = 0x0 # no batch
batch_lock = {counter=0} /* spinlock_t */
batch_list = {next = 0xffff88a57f1e65c8, prev = 0xffff88a57f1e65c8} /* empty */
/* anon union*/ {
log_io = 0x0
ppl_io = 0x0
}
log_list = {next = 0xffff88a57f1e65e0, prev = 0xffff88a57f1e65e0} /* empty */
log_start = 18446744073709551615
r5c = {next = 0xffff88a57f1e65f8, prev = 0xffff88a57f1e65f8} /* empty */
ppl_page = 0x0
ops = {
target = 0
target2 = 0
zero_sum_result = 0_5_
}


(gdb) dump $sh->dev[0]

$L_011: (struct r5dev *) 0xffff88a57f1e6620 : {
req = {
bi_next = 0x0
bi_disk = 0x0
bi_opf = 0
bi_flags = 0
bi_ioprio = 0
bi_write_hint = 0
bi_status = 0 '\000'
bi_partno = 0 '\000'
__bi_remaining = { counter=1 } /* atomic_t */
bi_iter = {
bi_sector = 0
bi_size = 0
bi_idx = 0
bi_bvec_done = 0
}
bi_end_io = 0x0
bi_private = 0x0
bi_blkg = 0x0
bi_issue = {
value = 0
}
/* anon union*/ {
bi_integrity = 0x0
}
bi_vcnt = 0
bi_max_vecs = 1
__bi_cnt = { counter=1 } /* atomic_t */
bi_io_vec = (struct bio_vec *) 0xffff88a57f1e6710
bi_pool = 0x0
bi_inline_vecs = [
]
}
rreq = {
bi_next = 0x0
bi_disk = 0x0
bi_opf = 0
bi_flags = 0
bi_ioprio = 0
bi_write_hint = 0
bi_status = 0 '\000'
bi_partno = 0 '\000'
__bi_remaining = { counter=1 } /* atomic_t */
bi_iter = {
bi_sector = 0
bi_size = 0
bi_idx = 0
bi_bvec_done = 0
}
bi_end_io = 0x0
bi_private = 0x0
bi_blkg = 0x0
bi_issue = {
value = 0
}
/* anon union*/ {
bi_integrity = 0x0
}
bi_vcnt = 0
bi_max_vecs = 1
__bi_cnt = { counter=1 } /* atomic_t */
bi_io_vec = (struct bio_vec *) 0xffff88a57f1e6720
bi_pool = 0x0
bi_inline_vecs = [
]
}
vec = {
bv_page = (struct page *) 0xffffea008ffccc80
bv_len = 4096
bv_offset = 0
}
rvec = {
bv_page = 0x0
bv_len = 0
bv_offset = 0
}
page = (struct page *) 0xffffea008ffccc80
orig_page = (struct page *) 0xffffea008ffccc80
offset = 0
toread = 0x0
read = 0x0
towrite = 0x0
written = 0x0
sector = 69258950784
flags = 17 # 0x11 : R5_UPTODATE + R5_INSYNC
log_checksum = 0
write_hint = 0
}

$sh->dev[1] .. $sh->dev[15] omitted. All the same, just different page buffers and sector numbers.

$sh->dev[14] and $sh->dev[15] have sector = 0.

2021-01-20 16:43:13

by Guoqing Jiang

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi Donald,

On 1/19/21 12:30, Donald Buczek wrote:
> Dear md-raid people,
>
> I've reported a problem in this thread in December:
>
> "We are using raid6 on several servers. Occasionally we had failures,
> where a mdX_raid6 process seems to go into a busy loop and all I/O to
> the md device blocks. We've seen this on various kernel versions." It
> was clear, that this is related to "mdcheck" running, because we found,
> that the command, which should stop the scrubbing in the morning (`echo
> idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.
>
> On 12/21/20, I've reported, that the problem might be caused by a
> failure of the underlying block device driver, because I've found
> "inflight" counters of the block devices not being zero. However, this
> is not the case. We were able to run into the mdX_raid6 looping
> condition a few times again, but the non-zero inflight counters have not
> been observed again.
>
> I was able to collect a lot of additional information from a blocked
> system.
>
> - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command
> is waiting at kthread_stop to stop the sync thread. [
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]
>
> - The sync thread ("md1_resync") does not finish, because its blocked at
>
> [<0>] raid5_get_active_stripe+0x4c4/0x660     # [1]
> [<0>] raid5_sync_request+0x364/0x390
> [<0>] md_do_sync+0xb41/0x1030
> [<0>] md_thread+0x122/0x160
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
>
> [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735
>
> - yes, gdb confirms that `conf->cache_state` is 0x03 (
> R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )

The resync thread is blocked since it can't get sh from inactive list,
so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle
> /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread.

>
> - We have lots of active stripes:
>
> root@deadbird:~/linux_problems/mdX_raid6_looping# cat
> /sys/block/md1/md/stripe_cache_active
> 27534

There are too many active stripes, so this is false:

atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4)

so raid5_get_active_stripe has to wait till it becomes true, either
increase max_nr_stripes or decrease active_stripes.

1. Increase max_nr_stripes
since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE
is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could
alloc new stripe in grow_one_stripe and increase max_nr_stripes. So
please check the memory usage of your system.

Another thing is you can try to increase the number of sh manually by
write new number to stripe_cache_size if there is enough memory.

2. Or decrease active_stripes
This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set

> - handle_stripe() doesn't make progress:
>
> echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control
>
> In dmesg, we see the debug output from
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925
> but never from
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:
>
> [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856,
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                 , check:4, reconstruct:0
> [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872,
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                 , check:4, reconstruct:0
> [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912,
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                 , check:4, reconstruct:0
>
> The sector numbers repeat after some time. We have only the following
> variants of stripe state and "check":
>
> state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED,
> check_state_idle
> state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED,
> check_state_check_result
> state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED,
> check_state_idle
>
> - We have MD_SB_CHANGE_PENDING set:

because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call
atomic_dec(&conf->active_stripes).

Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be
called by md_reap_sync_thread. But md_reap_sync_thread stuck with
unregister sync_thread (it was blocked in raid5_get_active_stripe).


Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d
can't break from the while(1) loop because "if (!batch_size &&
!released)" is false. I would assume released is '0' since
> released_stripes = {
> first = 0x0
> }
And __get_priority_stripe fetches sh from conf->handle_list and delete
it from handle_list, handle_stripe marks the state of the sh with
STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list.
So batch_size can't be '0', and handle_active_stripes in the loop
repeats the process in the busy loop. This is my best guess to explain
the busy loop situation.

>
> root@deadbird:~/linux_problems/mdX_raid6_looping# cat
> /sys/block/md1/md/array_state
> write-pending
>
> gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)

since rdev_set_badblocks could set them, could you check if there is
badblock of underlying device (sd*)?

>
> So it can be assumed that handle_stripe breaks out at
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939
>
> - The system can manually be freed from the deadlock:
>
> When `echo active > /sys/block/md1/md/array_state` is used, the
> scrubbing and other I/O continue. Probably because of
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520

Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked
process can make progress.

>
> I, of coruse, don't fully understand it yet. Any ideas?
>
> I append some data from a hanging raid... (mddev, r5conf and a sample
> stripe_head from handle_list with it first disks)

These data did help for investigation!

Thanks,
Guoqing

2021-01-23 13:07:32

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Dear Guoqing,

On 20.01.21 17:33, Guoqing Jiang wrote:
> Hi Donald,
>
> On 1/19/21 12:30, Donald Buczek wrote:
>> Dear md-raid people,
>>
>> I've reported a problem in this thread in December:
>>
>> "We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions." It was clear, that this is related to "mdcheck" running, because we found, that the command, which should stop the scrubbing in the morning (`echo idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.
>>
>> On 12/21/20, I've reported, that the problem might be caused by a failure of the underlying block device driver, because I've found "inflight" counters of the block devices not being zero. However, this is not the case. We were able to run into the mdX_raid6 looping condition a few times again, but the non-zero inflight counters have not been observed again.
>>
>> I was able to collect a lot of additional information from a blocked system.
>>
>> - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command is waiting at kthread_stop to stop the sync thread. [ https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]
>>
>> - The sync thread ("md1_resync") does not finish, because its blocked at
>>
>> [<0>] raid5_get_active_stripe+0x4c4/0x660     # [1]
>> [<0>] raid5_sync_request+0x364/0x390
>> [<0>] md_do_sync+0xb41/0x1030
>> [<0>] md_thread+0x122/0x160
>> [<0>] kthread+0x118/0x130
>> [<0>] ret_from_fork+0x22/0x30
>>
>> [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735
>>
>> - yes, gdb confirms that `conf->cache_state` is 0x03 ( R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )
>
> The resync thread is blocked since it can't get sh from inactive list, so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle > /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread.
>
>>
>> - We have lots of active stripes:
>>
>> root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/stripe_cache_active
>> 27534
>
> There are too many active stripes, so this is false:
>
> atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4)
>
> so raid5_get_active_stripe has to wait till it becomes true, either increase max_nr_stripes or decrease active_stripes.
>
> 1. Increase max_nr_stripes
> since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could alloc new stripe in grow_one_stripe and increase max_nr_stripes. So please check the memory usage of your system.
>
> Another thing is you can try to increase the number of sh manually by write new number to stripe_cache_size if there is enough memory.
>
> 2. Or decrease active_stripes
> This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set
>
>> - handle_stripe() doesn't make progress:
>>
>> echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control
>>
>> In dmesg, we see the debug output from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 but never from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:
>>
>> [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>>                  , check:4, reconstruct:0
>> [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>>                  , check:4, reconstruct:0
>> [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>>                  , check:4, reconstruct:0
>>
>> The sector numbers repeat after some time. We have only the following variants of stripe state and "check":
>>
>> state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED, check_state_idle
>> state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED, check_state_check_result
>> state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED, check_state_idle
>>
>> - We have MD_SB_CHANGE_PENDING set:
>
> because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call atomic_dec(&conf->active_stripes).
>
> Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be called by md_reap_sync_thread. But md_reap_sync_thread stuck with unregister sync_thread (it was blocked in raid5_get_active_stripe).
>
>
> Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d can't break from the while(1) loop because "if (!batch_size && !released)" is false. I would assume released is '0' since
> >      released_stripes = {
> >          first = 0x0
> >      }
> And __get_priority_stripe fetches sh from conf->handle_list and delete
> it from handle_list, handle_stripe marks the state of the sh with STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list.
> So batch_size can't be '0', and handle_active_stripes in the loop
> repeats the process in the busy loop. This is my best guess to explain the busy loop situation.
>
>>
>> root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/array_state
>> write-pending
>>
>> gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)
>
> since rdev_set_badblocks could set them, could you check if there is badblock of underlying device (sd*)?

The problem is, we are still not able to run a system into the deadlocked state by a repeatable procedure. So I was not yet able to explicitly check for badblocks.

However, on the production systems, which locked up, and on the test systems, which we managed to lock up, there always was some file system activity to the affected devices aside from mdcheck itself. This alone would explain any clean->write-pending transitions and might be a required condition for the problem to happen. I'm currently trying to explicitly exercise this path (by write/fsync()/sleep), with no result yet.

Btw: When you monitor md/stripe_cache_active during a ongoing "check" on an otherwise idle system, you see a variety of values up to a certain maximum value, which is visible most of the time. But this maximum value seems to continuously increase. Also, when you start the check at higher blocks (via md/sync_min) right away, the maximum value seems to be higher. Can this be explained? Bigger seek gap between superblocks and active data area? But should this increase the number of active stripes?

Best
Donald

>> So it can be assumed that handle_stripe breaks out at https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939
>>
>> - The system can manually be freed from the deadlock:
>>
>> When `echo active > /sys/block/md1/md/array_state` is used, the scrubbing and other I/O continue. Probably because of https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520
>
> Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked process can make progress.
>
>>
>> I, of coruse, don't fully understand it yet. Any ideas?
>>
>> I append some data from a hanging raid... (mddev, r5conf and a sample stripe_head from handle_list with it first disks)
>
> These data did help for investigation!
>
> Thanks,
> Guoqing

2021-01-25 23:10:30

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition



On 25.01.21 09:54, Donald Buczek wrote:
> Dear Guoqing,
>
> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.
>
> My current theory of what is happening is:
>
> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>
> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>
> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>
> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.
>
> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.
>
> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.
>
> Does this make sense?
>
> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>
> ```
> #! /bin/bash
>
> (
>         while true; do
>                 echo "start check"
>                 echo check > /sys/block/md3/md/sync_action
>                 sleep 10
>                 echo "stop check"
>                 echo idle > /sys/block/md3/md/sync_action
>                 sleep 2
>         done
> ) &
>
> (
>         while true; do
>                 dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
>                 sync /mnt/raid_md3/bigfile
>                 rm /mnt/raid_md3/bigfile
>                 sleep .1
>         done
> ) &
>
> start="$(date +%s)"
> cd /sys/block/md3/md
> wp_count=0
> while true; do
>         array_state=$(cat array_state)
>         if [ "$array_state" = write-pending ]; then
>                 wp_count=$(($wp_count+1))
>         else
>                 wp_count=0
>         fi
>         echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
>         if [ $wp_count -ge 3 ]; then
>                 kill -- -$$
>                 exit
>         fi
>         sleep 1
> done
> ```
>
> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:
>
>     309 check 6283872 / 8378368 active-idle 4144
>     310 check 6283872 / 8378368 active 1702
>     311 check 6807528 / 8378368 active 4152
>     312 check 7331184 / 8378368 clean 3021
>     stop check
>     313 check 7331184 / 8378368 write-pending 3905
>     314 check 7331184 / 8378368 write-pending 3905
>     315 check 7331184 / 8378368 write-pending 3905
>     Terminated
>
> If I add
>
>     pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);
>
> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get
>
>     [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>     [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>     [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>     [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>
> If I add
>
>     pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>
> at the head of md_check_recovery, I get:
>
>     [  789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>     [  789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>     [  789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>     [  789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>     [  789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>
> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )
>
> What approach would you suggest to fix this?

I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours.

D.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2d21c298ffa7..f40429843906 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
mddev_lock(mddev) == 0) {
+ set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_reap_sync_thread(mddev);
}
+ clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
mddev_unlock(mddev);
}
} else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
--
2.30.0

2021-01-26 02:52:39

by Guoqing Jiang

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi Donald,

On 1/25/21 22:32, Donald Buczek wrote:
>
>
> On 25.01.21 09:54, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> a colleague of mine was able to produce the issue inside a vm and were
>> able to find a procedure to run the vm into the issue within minutes
>> (not unreliably after hours on a physical system as before). This of
>> course helped to pinpoint the problem.
>>
>> My current theory of what is happening is:
>>
>> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by
>> md_write_start() when file-system I/O wants to do a write and the
>> array transitions from "clean" to "active".
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>>
>> - Before raid5d gets to write the superblock (its busy processing
>> active stripes because of the sync activity) , userspace wants to
>> pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>>
>> - action_store() takes the reconfig_mutex before trying to stop the
>> sync thread.
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump
>> of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>>
>> - raid5d is running in its main loop.
>> raid5d()->handle_active_stripes() returns a positive batch size (
>> https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329
>> ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't
>> process any stripe because of MD_SB_CHANGE_PENDING.
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729
>> ). This is the reason, raid5d is busy looping.
>>
>> - raid5d()->md_check_recovery() is called by the raid5d main loop. One
>> of its duties is to write the superblock, if a change is pending.
>> However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able
>> to take the reconfig_mutex.
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967
>> ,
>> https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006)
>> Both is not true, so the superblock is not written and
>> MD_SB_CHANGE_PENDING is not cleared.
>>
>> - (as discussed previously) the sync thread is waiting for the number
>> of active stripes to go down and doesn't terminate. The userspace
>> thread is waiting for the sync thread to terminate.
>>
>> Does this make sense?

Yes, exactly! That was my thought too, the scenario is twisted.

Then resync thread is blocked due to there are too many active stripes,
because raid5d is in busy loop since SB_CHANGE_PENDING is set which
means tactive stripes can't be decreased, and echo idle cmd can't make
progress given resync thread is blocked while the cmd still hold
reconfig_mutex which make raid5d in busy loop and can't clear
SB_CHANGE_PENDING flag.

And raid5 could suffer from the same issue I think.

>>
>> Just for reference, I add the procedure which triggers the issue on
>> the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>>
>> ```
>> #! /bin/bash
>>
>> (
>>          while true; do
>>                  echo "start check"
>>                  echo check > /sys/block/md3/md/sync_action
>>                  sleep 10
>>                  echo "stop check"
>>                  echo idle > /sys/block/md3/md/sync_action
>>                  sleep 2
>>          done
>> ) &
>>
>> (
>>          while true; do
>>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero
>> of=/mnt/raid_md3/bigfile status=none
>>                  sync /mnt/raid_md3/bigfile
>>                  rm /mnt/raid_md3/bigfile
>>                  sleep .1
>>          done
>> ) &
>>
>> start="$(date +%s)"
>> cd /sys/block/md3/md
>> wp_count=0
>> while true; do
>>          array_state=$(cat array_state)
>>          if [ "$array_state" = write-pending ]; then
>>                  wp_count=$(($wp_count+1))
>>          else
>>                  wp_count=0
>>          fi
>>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat
>> sync_completed) $array_state $(cat stripe_cache_active)
>>          if [ $wp_count -ge 3 ]; then
>>                  kill -- -$$
>>                  exit
>>          fi
>>          sleep 1
>> done
>> ```
>>
>> The time, this needs to trigger the issue, varies from under a minute
>> to one hour with 5 minute being typical. The output ends like this:
>>
>>      309 check 6283872 / 8378368 active-idle 4144
>>      310 check 6283872 / 8378368 active 1702
>>      311 check 6807528 / 8378368 active 4152
>>      312 check 7331184 / 8378368 clean 3021
>>      stop check
>>      313 check 7331184 / 8378368 write-pending 3905
>>      314 check 7331184 / 8378368 write-pending 3905
>>      315 check 7331184 / 8378368 write-pending 3905
>>      Terminated
>>
>> If I add
>>
>>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n",
>> batch_size, released, mddev->sb_flags);
>>
>> in raid5d after the call to handle_active_stripes and enable the debug
>> location after the deadlock occurred, I get
>>
>>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0
>> mdddev->sb_flags 6
>>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0
>> mdddev->sb_flags 6
>>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0
>> mdddev->sb_flags 6
>>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0
>> mdddev->sb_flags 6
>>
>> If I add
>>
>>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags
>> %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>>
>> at the head of md_check_recovery, I get:
>>
>>      [  789.555462] [1191] md_check_recovery:8970: XXX 1
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555477] [1191] md_check_recovery:8970: XXX 1
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555491] [1191] md_check_recovery:8970: XXX 1
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555505] [1191] md_check_recovery:8970: XXX 1
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555520] [1191] md_check_recovery:8970: XXX 1
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>
>> More debug lines in md_check_recovery confirm the control flow ( `if
>> (mddev_trylock(mddev))` block not taken )
>>

That is great that you have a reproducer now!

>> What approach would you suggest to fix this?
>
> I naively tried the following patch and it seems to fix the problem. The
> test procedure didn't trigger the deadlock in 10 hours.
>
> D.
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2d21c298ffa7..f40429843906 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char
> *page, size_t len)
>              clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>          if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>              mddev_lock(mddev) == 0) {
> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>              flush_workqueue(md_misc_wq);
>              if (mddev->sync_thread) {
>                  set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>                  md_reap_sync_thread(mddev);
>              }
> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>              mddev_unlock(mddev);
>          }
>      } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))

Yes, it could break the deadlock issue, but I am not sure if it is the
right way given we only set ALLOW_SB_UPDATE in suspend which makes sense
since the io will be quiesced, but write idle action can't guarantee the
similar thing. I prefer to make resync thread not wait forever here.

wait_event_lock_irq(
conf->wait_for_stripe,
!list_empty(conf->inactive_list + hash) &&

(atomic_read(&conf->active_stripes)
< (conf->max_nr_stripes * 3 / 4)
|| !test_bit(R5_INACTIVE_BLOCKED,
&conf->cache_state)
*(conf->hash_locks + hash));

So, could you please try the attached?

Thanks,
Guoqing


Attachments:
raid5-proposal (5.72 kB)

2021-01-26 11:34:14

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Dear Guoqing,

On 26.01.21 01:44, Guoqing Jiang wrote:
> Hi Donald,
>
> On 1/25/21 22:32, Donald Buczek wrote:
>>
>>
>> On 25.01.21 09:54, Donald Buczek wrote:
>>> Dear Guoqing,
>>>
>>> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.
>>>
>>> My current theory of what is happening is:
>>>
>>> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>>>
>>> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>>>
>>> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>>>
>>> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.
>>>
>>> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.
>>>
>>> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.
>>>
>>> Does this make sense?
>
> Yes, exactly! That was my thought too, the scenario is twisted.
>
> Then resync thread is blocked due to there are too many active stripes, because raid5d is in busy loop since SB_CHANGE_PENDING is set which means tactive stripes can't be decreased, and echo idle cmd can't make progress given resync thread is blocked while the cmd still hold reconfig_mutex which make raid5d in busy loop and can't clear SB_CHANGE_PENDING flag.
>
> And raid5 could suffer from the same issue I think.
>
>>>
>>> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>>>
>>> ```
>>> #! /bin/bash
>>>
>>> (
>>>          while true; do
>>>                  echo "start check"
>>>                  echo check > /sys/block/md3/md/sync_action
>>>                  sleep 10
>>>                  echo "stop check"
>>>                  echo idle > /sys/block/md3/md/sync_action
>>>                  sleep 2
>>>          done
>>> ) &
>>>
>>> (
>>>          while true; do
>>>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
>>>                  sync /mnt/raid_md3/bigfile
>>>                  rm /mnt/raid_md3/bigfile
>>>                  sleep .1
>>>          done
>>> ) &
>>>
>>> start="$(date +%s)"
>>> cd /sys/block/md3/md
>>> wp_count=0
>>> while true; do
>>>          array_state=$(cat array_state)
>>>          if [ "$array_state" = write-pending ]; then
>>>                  wp_count=$(($wp_count+1))
>>>          else
>>>                  wp_count=0
>>>          fi
>>>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
>>>          if [ $wp_count -ge 3 ]; then
>>>                  kill -- -$$
>>>                  exit
>>>          fi
>>>          sleep 1
>>> done
>>> ```
>>>
>>> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:
>>>
>>>      309 check 6283872 / 8378368 active-idle 4144
>>>      310 check 6283872 / 8378368 active 1702
>>>      311 check 6807528 / 8378368 active 4152
>>>      312 check 7331184 / 8378368 clean 3021
>>>      stop check
>>>      313 check 7331184 / 8378368 write-pending 3905
>>>      314 check 7331184 / 8378368 write-pending 3905
>>>      315 check 7331184 / 8378368 write-pending 3905
>>>      Terminated
>>>
>>> If I add
>>>
>>>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);
>>>
>>> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get
>>>
>>>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>
>>> If I add
>>>
>>>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>>>
>>> at the head of md_check_recovery, I get:
>>>
>>>      [  789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [  789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [  789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [  789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [  789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>
>>> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )
>>>
>
> That is great that you have a reproducer now!
>
>>> What approach would you suggest to fix this?
>>
>> I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours.
>>
>> D.
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 2d21c298ffa7..f40429843906 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>               mddev_lock(mddev) == 0) {
>> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>               flush_workqueue(md_misc_wq);
>>               if (mddev->sync_thread) {
>>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>                   md_reap_sync_thread(mddev);
>>               }
>> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>               mddev_unlock(mddev);
>>           }
>>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>
> Yes, it could break the deadlock issue, but I am not sure if it is the right way given we only set ALLOW_SB_UPDATE in suspend which makes sense since the io will be quiesced, but write idle action can't guarantee the  similar thing.

Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of reconfig_mutex promises not to make any changes which would exclude superblocks from being written" might make it easier to accept the usage.

> I prefer to make resync thread not wait forever here.
>
> wait_event_lock_irq(
>     conf->wait_for_stripe,
>     !list_empty(conf->inactive_list + hash) &&
>     (atomic_read(&conf->active_stripes)
>     < (conf->max_nr_stripes * 3 / 4)
>     || !test_bit(R5_INACTIVE_BLOCKED,
>              &conf->cache_state)
>     *(conf->hash_locks + hash));
>
> So, could you please try the attached?
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 4337ae0..378ce5c 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -1931,7 +1931,7 @@ r5c_recovery_alloc_stripe(
> {
> struct stripe_head *sh;
>
> - sh = raid5_get_active_stripe(conf, stripe_sect, 0, noblock, 0);
> + sh = raid5_get_active_stripe(conf, stripe_sect, 0, 0, noblock, 0);
> if (!sh)
> return NULL; /* no more stripe available */
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8888973..33a2a22 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -700,10 +700,11 @@ static int has_failed(struct r5conf *conf)
> }
>
> struct stripe_head *
> -raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
> +raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req,
> int previous, int noblock, int noquiesce)
> {
> struct stripe_head *sh;
> + struct mddev *mddev = conf->mddev;
> int hash = stripe_hash_locks_hash(conf, sector);
> int inc_empty_inactive_list_flag;
>
> @@ -738,8 +739,14 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
> (atomic_read(&conf->active_stripes)
> < (conf->max_nr_stripes * 3 / 4)
> || !test_bit(R5_INACTIVE_BLOCKED,
> - &conf->cache_state)),
> + &conf->cache_state)
> + || (test_bit(MD_RECOVERY_INTR,
> + &mddev->recovery)
> + && sync_req)),
> *(conf->hash_locks + hash));
> + if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)
> + && sync_req)
> + break;
> clear_bit(R5_INACTIVE_BLOCKED,
> &conf->cache_state);
> } else {
> @@ -4527,7 +4534,7 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh)
> sector_t bn = raid5_compute_blocknr(sh, i, 1);
> sector_t s = raid5_compute_sector(conf, bn, 0,
> &dd_idx, NULL);
> - sh2 = raid5_get_active_stripe(conf, s, 0, 1, 1);
> + sh2 = raid5_get_active_stripe(conf, s, 0, 0, 1, 1);
> if (sh2 == NULL)
> /* so far only the early blocks of this stripe
> * have been requested. When later blocks
> @@ -5164,7 +5171,7 @@ static void handle_stripe(struct stripe_head *sh)
> /* Finish reconstruct operations initiated by the expansion process */
> if (sh->reconstruct_state == reconstruct_state_result) {
> struct stripe_head *sh_src
> - = raid5_get_active_stripe(conf, sh->sector, 1, 1, 1);
> + = raid5_get_active_stripe(conf, sh->sector, 0, 1, 1, 1);
> if (sh_src && test_bit(STRIPE_EXPAND_SOURCE, &sh_src->state)) {
> /* sh cannot be written until sh_src has been read.
> * so arrange for sh to be delayed a little
> @@ -5705,7 +5712,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
> DEFINE_WAIT(w);
> int d;
> again:
> - sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0);
> + sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0, 0);
> prepare_to_wait(&conf->wait_for_overlap, &w,
> TASK_UNINTERRUPTIBLE);
> set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags);
> @@ -5861,7 +5868,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
> (unsigned long long)new_sector,
> (unsigned long long)logical_sector);
>
> - sh = raid5_get_active_stripe(conf, new_sector, previous,
> + sh = raid5_get_active_stripe(conf, new_sector, previous, 0,


Mistake here (fourth argument added instead of third)

> (bi->bi_opf & REQ_RAHEAD), 0);
> if (sh) {
> if (unlikely(previous)) {
> @@ -6100,7 +6107,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> for (i = 0; i < reshape_sectors; i += RAID5_STRIPE_SECTORS(conf)) {
> int j;
> int skipped_disk = 0;
> - sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 1);
> + sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 0, 1);
> set_bit(STRIPE_EXPANDING, &sh->state);
> atomic_inc(&conf->reshape_stripes);
> /* If any of this stripe is beyond the end of the old
> @@ -6149,7 +6156,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> if (last_sector >= mddev->dev_sectors)
> last_sector = mddev->dev_sectors - 1;
> while (first_sector <= last_sector) {
> - sh = raid5_get_active_stripe(conf, first_sector, 1, 0, 1);
> + sh = raid5_get_active_stripe(conf, first_sector, 0, 1, 0, 1);
> set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
> set_bit(STRIPE_HANDLE, &sh->state);
> raid5_release_stripe(sh);
> @@ -6269,9 +6276,14 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
>
> md_bitmap_cond_end_sync(mddev->bitmap, sector_nr, false);
>
> - sh = raid5_get_active_stripe(conf, sector_nr, 0, 1, 0);
> + sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 1, 0);
> if (sh == NULL) {
> - sh = raid5_get_active_stripe(conf, sector_nr, 0, 0, 0);
> + sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 0, 0);
> + if (!sh && test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
> + *skipped = 1;
> + return 0;
> + }
> +
> /* make sure we don't swamp the stripe cache if someone else
> * is trying to get access
> */
> @@ -6334,7 +6346,7 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
> /* already done this stripe */
> continue;
>
> - sh = raid5_get_active_stripe(conf, sector, 0, 1, 1);
> + sh = raid5_get_active_stripe(conf, sector, 0, 0, 1, 1);
>
> if (!sh) {
> /* failed to get a stripe - must wait */
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 5c05acf..d9eab46 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -806,7 +806,7 @@ extern sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
> int previous, int *dd_idx,
> struct stripe_head *sh);
> extern struct stripe_head *
> -raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
> +raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req,
> int previous, int noblock, int noquiesce);
> extern int raid5_calc_degraded(struct r5conf *conf);
> extern int r5c_journal_mode_set(struct mddev *mddev, int journal_mode);

Unfortunately, this patch did not fix the issue.

root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack
[<0>] raid5_get_active_stripe+0x1e7/0x6b0
[<0>] raid5_sync_request+0x2a7/0x3d0
[<0>] md_do_sync.cold+0x3ee/0x97c
[<0>] md_thread+0xab/0x160
[<0>] kthread+0x11b/0x140
[<0>] ret_from_fork+0x22/0x30

which is ( according to objdump -d -Sl drivers/md/raid5.o ) at https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735

Isn't it still the case that the superblocks are not written, therefore stripes are not processed, therefore number of active stripes are not decreasing? Who is expected to wake up conf->wait_for_stripe waiters?

Best
Donald

>
> Thanks,
> Guoqing

2021-01-26 11:49:31

by Guoqing Jiang

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi Donald,

On 1/26/21 10:50, Donald Buczek wrote:
[...]

>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 2d21c298ffa7..f40429843906 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char
>>> *page, size_t len)
>>>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>>               mddev_lock(mddev) == 0) {
>>> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>               flush_workqueue(md_misc_wq);
>>>               if (mddev->sync_thread) {
>>>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>                   md_reap_sync_thread(mddev);
>>>               }
>>> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>               mddev_unlock(mddev);
>>>           }
>>>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>>
>> Yes, it could break the deadlock issue, but I am not sure if it is the
>> right way given we only set ALLOW_SB_UPDATE in suspend which makes
>> sense since the io will be quiesced, but write idle action can't
>> guarantee the  similar thing.
>
> Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of
> reconfig_mutex promises not to make any changes which would exclude
> superblocks from being written" might make it easier to accept the usage.

I am not sure it is safe to set the flag here since write idle can't
prevent IO from fs while mddev_suspend can guarantee that.

>
>> I prefer to make resync thread not wait forever here.
>>

[...]

>>
>> -        sh = raid5_get_active_stripe(conf, new_sector, previous,
>> +        sh = raid5_get_active_stripe(conf, new_sector, previous, 0,
>
>
> Mistake here (fourth argument added instead of third)

Thanks for checking.

[...]

> Unfortunately, this patch did not fix the issue.
>
>     root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack
>     [<0>] raid5_get_active_stripe+0x1e7/0x6b0
>     [<0>] raid5_sync_request+0x2a7/0x3d0
>     [<0>] md_do_sync.cold+0x3ee/0x97c
>     [<0>] md_thread+0xab/0x160
>     [<0>] kthread+0x11b/0x140
>     [<0>] ret_from_fork+0x22/0x30
>
> which is ( according to objdump -d -Sl drivers/md/raid5.o ) at
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
>
> Isn't it still the case that the superblocks are not written, therefore
> stripes are not processed, therefore number of active stripes are not
> decreasing? Who is expected to wake up conf->wait_for_stripe waiters?

Hmm, how about wake the waiter up in the while loop of raid5d?

@@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
md_check_recovery(mddev);
spin_lock_irq(&conf->device_lock);
}
+
+ if ((atomic_read(&conf->active_stripes)
+ < (conf->max_nr_stripes * 3 / 4) ||
+ (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
+ wake_up(&conf->wait_for_stripe);
}
pr_debug("%d stripes handled\n", handled);

If the issue still appears then I will change the waiter to break just
if MD_RECOVERY_INTR is set, something like.

wait_event_lock_irq(conf->wait_for_stripe,
(test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
/* the previous condition */,
*(conf->hash_locks + hash));

Thanks,
Guoqing

2021-01-26 13:04:04

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition



On 26.01.21 12:14, Guoqing Jiang wrote:
> Hi Donald,
>
> On 1/26/21 10:50, Donald Buczek wrote:
> [...]
>
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index 2d21c298ffa7..f40429843906 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>>>>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>>>               mddev_lock(mddev) == 0) {
>>>> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>>               flush_workqueue(md_misc_wq);
>>>>               if (mddev->sync_thread) {
>>>>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>>                   md_reap_sync_thread(mddev);
>>>>               }
>>>> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>>               mddev_unlock(mddev);
>>>>           }
>>>>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>>>
>>> Yes, it could break the deadlock issue, but I am not sure if it is the right way given we only set ALLOW_SB_UPDATE in suspend which makes sense since the io will be quiesced, but write idle action can't guarantee the  similar thing.
>>
>> Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of reconfig_mutex promises not to make any changes which would exclude superblocks from being written" might make it easier to accept the usage.
>
> I am not sure it is safe to set the flag here since write idle can't prevent IO from fs while mddev_suspend can guarantee that.
>
>>
>>> I prefer to make resync thread not wait forever here.
>>>
>
> [...]
>
>>>
>>> -        sh = raid5_get_active_stripe(conf, new_sector, previous,
>>> +        sh = raid5_get_active_stripe(conf, new_sector, previous, 0,
>>
>>
>> Mistake here (fourth argument added instead of third)
>
> Thanks for checking.
>
> [...]
>
>> Unfortunately, this patch did not fix the issue.
>>
>>      root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack
>>      [<0>] raid5_get_active_stripe+0x1e7/0x6b0
>>      [<0>] raid5_sync_request+0x2a7/0x3d0
>>      [<0>] md_do_sync.cold+0x3ee/0x97c
>>      [<0>] md_thread+0xab/0x160
>>      [<0>] kthread+0x11b/0x140
>>      [<0>] ret_from_fork+0x22/0x30
>>
>> which is ( according to objdump -d -Sl drivers/md/raid5.o ) at https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
>>
>> Isn't it still the case that the superblocks are not written, therefore stripes are not processed, therefore number of active stripes are not decreasing? Who is expected to wake up conf->wait_for_stripe waiters?
>
> Hmm, how about wake the waiter up in the while loop of raid5d?
>
> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>                         md_check_recovery(mddev);
>                         spin_lock_irq(&conf->device_lock);
>                 }
> +
> +               if ((atomic_read(&conf->active_stripes)
> +                    < (conf->max_nr_stripes * 3 / 4) ||
> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
> +                       wake_up(&conf->wait_for_stripe);
>         }
>         pr_debug("%d stripes handled\n", handled);

Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at

root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
[<0>] md_do_sync.cold+0x8ec/0x97c
[<0>] md_thread+0xab/0x160
[<0>] kthread+0x11b/0x140
[<0>] ret_from_fork+0x22/0x30

instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963

And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg.

Best
Donald

> If the issue still appears then I will change the waiter to break just if MD_RECOVERY_INTR is set, something like.
>
> wait_event_lock_irq(conf->wait_for_stripe,
>     (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
>      /* the previous condition */,
>     *(conf->hash_locks + hash));
>
> Thanks,
> Guoqing

2021-01-26 14:15:22

by Guoqing Jiang

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition



On 1/26/21 13:58, Donald Buczek wrote:
>
>
>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>
>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>                          md_check_recovery(mddev);
>>                          spin_lock_irq(&conf->device_lock);
>>                  }
>> +
>> +               if ((atomic_read(&conf->active_stripes)
>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>> +                       wake_up(&conf->wait_for_stripe);
>>          }
>>          pr_debug("%d stripes handled\n", handled);
>
> Hmm... With this patch on top of your other one, we still have the basic
> symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
>
>     root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>     [<0>] md_do_sync.cold+0x8ec/0x97c
>     [<0>] md_thread+0xab/0x160
>     [<0>] kthread+0x11b/0x140
>     [<0>] ret_from_fork+0x22/0x30
>
> instead, which is
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963

Not sure why recovery_active is not zero, because it is set 0 before
blk_start_plug, and raid5_sync_request returns 0 and skipped is also set
to 1. Perhaps handle_stripe calls md_done_sync.

Could you double check the value of recovery_active? Or just don't wait
if resync thread is interrupted.

wait_event(mddev->recovery_wait,
test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
!atomic_read(&mddev->recovery_active));

> And, unlike before, "md: md3: data-check interrupted." from the pr_info
> two lines above appears in dmesg.

Yes, that is intentional since MD_RECOVERY_INTR is set by write idle.

Anyway, will try the script and investigate more about the issue.

Thanks,
Guoqing

2021-01-26 16:11:23

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Dear Guoqing,

On 26.01.21 15:06, Guoqing Jiang wrote:
>
>
> On 1/26/21 13:58, Donald Buczek wrote:
>>
>>
>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>
>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>>                          md_check_recovery(mddev);
>>>                          spin_lock_irq(&conf->device_lock);
>>>                  }
>>> +
>>> +               if ((atomic_read(&conf->active_stripes)
>>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>> +                       wake_up(&conf->wait_for_stripe);
>>>          }
>>>          pr_debug("%d stripes handled\n", handled);
>>
>> Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
>>
>>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>      [<0>] md_do_sync.cold+0x8ec/0x97c
>>      [<0>] md_thread+0xab/0x160
>>      [<0>] kthread+0x11b/0x140
>>      [<0>] ret_from_fork+0x22/0x30
>>
>> instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
>
> Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync.
>
> Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted.
>
> wait_event(mddev->recovery_wait,
>        test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
>        !atomic_read(&mddev->recovery_active));

With that added, md3_resync goes into a loop, too. Not 100% busy, though.

root@sloth:~# cat /proc/$(pgrep md3_resync)/stack

[<0>] raid5_get_active_stripe+0x1e7/0x6b0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
[<0>] raid5_sync_request+0x2a7/0x3d0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
[<0>] md_do_sync.cold+0x3ee/0x97c # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
[<0>] md_thread+0xab/0x160
[<0>] kthread+0x11b/0x140
[<0>] ret_from_fork+0x22/0x30

Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again.

Best
Donald

>
>> And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg.
>
> Yes, that is intentional since MD_RECOVERY_INTR is set by write idle.
>
> Anyway, will try the script and investigate more about the issue.
>
> Thanks,
> Guoqing

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2021-01-26 21:58:44

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Dear Guoqing,

a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.

My current theory of what is happening is:

- MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)

- Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`

- action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.

- raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.

- raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.

- (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.

Does this make sense?

Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:

```
#! /bin/bash

(
while true; do
echo "start check"
echo check > /sys/block/md3/md/sync_action
sleep 10
echo "stop check"
echo idle > /sys/block/md3/md/sync_action
sleep 2
done
) &

(
while true; do
dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
sync /mnt/raid_md3/bigfile
rm /mnt/raid_md3/bigfile
sleep .1
done
) &

start="$(date +%s)"
cd /sys/block/md3/md
wp_count=0
while true; do
array_state=$(cat array_state)
if [ "$array_state" = write-pending ]; then
wp_count=$(($wp_count+1))
else
wp_count=0
fi
echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
if [ $wp_count -ge 3 ]; then
kill -- -$$
exit
fi
sleep 1
done
```

The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:

309 check 6283872 / 8378368 active-idle 4144
310 check 6283872 / 8378368 active 1702
311 check 6807528 / 8378368 active 4152
312 check 7331184 / 8378368 clean 3021
stop check
313 check 7331184 / 8378368 write-pending 3905
314 check 7331184 / 8378368 write-pending 3905
315 check 7331184 / 8378368 write-pending 3905
Terminated

If I add

pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);

in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get

[ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
[ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
[ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
[ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6

If I add

pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);

at the head of md_check_recovery, I get:

[ 789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
[ 789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
[ 789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
[ 789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
[ 789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006

More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )

What approach would you suggest to fix this?

Best
Donald


2021-02-02 22:43:46

by Guoqing Jiang

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi Donald,

On 1/26/21 17:05, Donald Buczek wrote:
> Dear Guoqing,
>
> On 26.01.21 15:06, Guoqing Jiang wrote:
>>
>>
>> On 1/26/21 13:58, Donald Buczek wrote:
>>>
>>>
>>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>>
>>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>>>                          md_check_recovery(mddev);
>>>>                          spin_lock_irq(&conf->device_lock);
>>>>                  }
>>>> +
>>>> +               if ((atomic_read(&conf->active_stripes)
>>>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>>>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>>> +                       wake_up(&conf->wait_for_stripe);
>>>>          }
>>>>          pr_debug("%d stripes handled\n", handled);
>>>
>>> Hmm... With this patch on top of your other one, we still have the
>>> basic symptoms (md3_raid6 busy looping), but the sync thread is now
>>> hanging at
>>>
>>>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>>      [<0>] md_do_sync.cold+0x8ec/0x97c
>>>      [<0>] md_thread+0xab/0x160
>>>      [<0>] kthread+0x11b/0x140
>>>      [<0>] ret_from_fork+0x22/0x30
>>>
>>> instead, which is
>>> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
>>
>> Not sure why recovery_active is not zero, because it is set 0 before
>> blk_start_plug, and raid5_sync_request returns 0 and skipped is also
>> set to 1. Perhaps handle_stripe calls md_done_sync.
>>
>> Could you double check the value of recovery_active? Or just don't
>> wait if resync thread is interrupted.
>>
>> wait_event(mddev->recovery_wait,
>>         test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
>>         !atomic_read(&mddev->recovery_active));
>
> With that added, md3_resync goes into a loop, too. Not 100% busy, though.
>
> root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>
> [<0>] raid5_get_active_stripe+0x1e7/0x6b0  #
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
> [<0>] raid5_sync_request+0x2a7/0x3d0       #
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
> [<0>] md_do_sync.cold+0x3ee/0x97c          #
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
> [<0>] md_thread+0xab/0x160
> [<0>] kthread+0x11b/0x140
> [<0>] ret_from_fork+0x22/0x30
>
> Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of
> raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken,
> but the conditions don't match so its sleeps again.

I don't know why the condition was not true after the change since the
RECOVERY_INTR is set and the caller is raid5_sync_request.

wait_event_lock_irq(conf->wait_for_stripe,
(test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
/* the previous condition */,
*(conf->hash_locks + hash));


BTW, I think there some some possible ways:

1. let "echo idle" give up the reconfig_mutex if there are limited
number of active stripe, but I feel it is ugly to check sh number from
action_store (kind of layer violation).

2. make raid5_sync_request -> raid5_get_active_stripe can quit from the
current situation (this was we tried, though it doesn't work so far).

3. set MD_ALLOW_SB_UPDATE as you said though I am not sure the safety
(but maybe I am wrong).

4. given the write IO keeps coming from upper layer which decrease the
available stripes. Maybe we need to call grow_one_stripe at the
beginning of raid5_make_request for this case, then call drop_one_stripe
at the end of make_request.

5. maybe don't hold reconfig_mutex when try to unregister sync_thread,
like this.

/* resync has finished, collect result */
mddev_unlock(mddev);
md_unregister_thread(&mddev->sync_thread);
mddev_lock(mddev);



My suggestion would be try 2 + 4 together since the reproducer triggers
both sync io and write io. Or try 5. Perhaps there is better way which I
just can't find.

Thanks,
Guoqing

2021-02-08 12:03:13

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

On 02.02.21 16:42, Guoqing Jiang wrote:
> Hi Donald,
>
> On 1/26/21 17:05, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> On 26.01.21 15:06, Guoqing Jiang wrote:
>>>
>>>
>>> On 1/26/21 13:58, Donald Buczek wrote:
>>>>
>>>>
>>>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>>>
>>>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>>>>                          md_check_recovery(mddev);
>>>>>                          spin_lock_irq(&conf->device_lock);
>>>>>                  }
>>>>> +
>>>>> +               if ((atomic_read(&conf->active_stripes)
>>>>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>>>>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>>>> +                       wake_up(&conf->wait_for_stripe);
>>>>>          }
>>>>>          pr_debug("%d stripes handled\n", handled);
>>>>
>>>> Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
>>>>
>>>>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>>>      [<0>] md_do_sync.cold+0x8ec/0x97c
>>>>      [<0>] md_thread+0xab/0x160
>>>>      [<0>] kthread+0x11b/0x140
>>>>      [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
>>>
>>> Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync.
>>>
>>> Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted.
>>>
>>> wait_event(mddev->recovery_wait,
>>>         test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
>>>         !atomic_read(&mddev->recovery_active));
>>
>> With that added, md3_resync goes into a loop, too. Not 100% busy, though.
>>
>> root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>
>> [<0>] raid5_get_active_stripe+0x1e7/0x6b0  # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
>> [<0>] raid5_sync_request+0x2a7/0x3d0       # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
>> [<0>] md_do_sync.cold+0x3ee/0x97c          # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
>> [<0>] md_thread+0xab/0x160
>> [<0>] kthread+0x11b/0x140
>> [<0>] ret_from_fork+0x22/0x30
>>
>> Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again.
>
> I don't know why the condition was not true after the change since the RECOVERY_INTR is set and the caller is raid5_sync_request.
>
> wait_event_lock_irq(conf->wait_for_stripe,
>     (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
>      /* the previous condition */,
>     *(conf->hash_locks + hash));

Not knowing the system by heart, I'd probably need more than a full day to analyze that. Please let me know, if this is needed.

> BTW, I think there some some possible ways:
>
> 1. let "echo idle" give up the reconfig_mutex if there are limited number of active stripe, but I feel it is ugly to check sh number from action_store (kind of layer violation).

My first though is, that it is either needed to hold the mutex or it is not. Why should this depend on the number of available stripes?

And wouldn't we get into the same situation, if another another reconfiguration from userspace takes the mutex again?

> 2. make raid5_sync_request -> raid5_get_active_stripe can quit from the current situation (this was we tried, though it doesn't work so far).
>
> 3. set MD_ALLOW_SB_UPDATE as you said though I am not sure the safety (but maybe I am wrong).
>
> 4. given the write IO keeps coming from upper layer which decrease the available stripes. Maybe we need to call grow_one_stripe at the beginning of raid5_make_request for this case, then call drop_one_stripe
> at the end of make_request.
>
> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this.
>
>         /* resync has finished, collect result */
>         mddev_unlock(mddev);
>         md_unregister_thread(&mddev->sync_thread);
>         mddev_lock(mddev);

As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex?

> My suggestion would be try 2 + 4 together since the reproducer triggers both sync io and write io. Or try 5. Perhaps there is better way which I just can't find.

I just get a bit suspicious, if it is needed to add special code here and here and here, that there might be something wrong with the abstraction.

But, I'm afraid, that aside from testing, I can't be of to much help here, as I don't know all the components or details of the system that well.

Best
Donald

> Thanks,
> Guoqing

2021-02-08 14:58:20

by Guoqing Jiang

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition



On 2/8/21 12:38, Donald Buczek wrote:
>> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread,
>> like this.
>>
>>          /* resync has finished, collect result */
>>          mddev_unlock(mddev);
>>          md_unregister_thread(&mddev->sync_thread);
>>          mddev_lock(mddev);
>
> As above: While we wait for the sync thread to terminate, wouldn't it be
> a problem, if another user space operation takes the mutex?

I don't think other places can be blocked while hold mutex, otherwise
these places can cause potential deadlock. Please try above two lines
change. And perhaps others have better idea.

Thanks,
Guoqing

2021-02-08 20:17:11

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Dear Guoqing,

On 08.02.21 15:53, Guoqing Jiang wrote:
>
>
> On 2/8/21 12:38, Donald Buczek wrote:
>>> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this.
>>>
>>>          /* resync has finished, collect result */
>>>          mddev_unlock(mddev);
>>>          md_unregister_thread(&mddev->sync_thread);
>>>          mddev_lock(mddev);
>>
>> As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex?
>
> I don't think other places can be blocked while hold mutex, otherwise these places can cause potential deadlock. Please try above two lines change. And perhaps others have better idea.

Yes, this works. No deadlock after >11000 seconds,

(Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )

Best
Donald
>
> Thanks,
> Guoqing

2021-02-09 00:51:37

by Guoqing Jiang

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi Donald,

On 2/8/21 19:41, Donald Buczek wrote:
> Dear Guoqing,
>
> On 08.02.21 15:53, Guoqing Jiang wrote:
>>
>>
>> On 2/8/21 12:38, Donald Buczek wrote:
>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>> sync_thread, like this.
>>>>
>>>>          /* resync has finished, collect result */
>>>>          mddev_unlock(mddev);
>>>>          md_unregister_thread(&mddev->sync_thread);
>>>>          mddev_lock(mddev);
>>>
>>> As above: While we wait for the sync thread to terminate, wouldn't it
>>> be a problem, if another user space operation takes the mutex?
>>
>> I don't think other places can be blocked while hold mutex, otherwise
>> these places can cause potential deadlock. Please try above two lines
>> change. And perhaps others have better idea.
>
> Yes, this works. No deadlock after >11000 seconds,
>
> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )

Great. I will send a formal patch with your reported-by and tested-by.

Thanks,
Guoqing

2021-02-09 09:32:01

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

On 09.02.21 01:46, Guoqing Jiang wrote:

> Great. I will send a formal patch with your reported-by and tested-by.

Yes, that's fine.

Thanks a lot for your help!

Donald

>
> Thanks,
> Guoqing

2023-03-14 13:28:26

by Marc Smith

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
<[email protected]> wrote:
>
> Hi Donald,
>
> On 2/8/21 19:41, Donald Buczek wrote:
> > Dear Guoqing,
> >
> > On 08.02.21 15:53, Guoqing Jiang wrote:
> >>
> >>
> >> On 2/8/21 12:38, Donald Buczek wrote:
> >>>> 5. maybe don't hold reconfig_mutex when try to unregister
> >>>> sync_thread, like this.
> >>>>
> >>>> /* resync has finished, collect result */
> >>>> mddev_unlock(mddev);
> >>>> md_unregister_thread(&mddev->sync_thread);
> >>>> mddev_lock(mddev);
> >>>
> >>> As above: While we wait for the sync thread to terminate, wouldn't it
> >>> be a problem, if another user space operation takes the mutex?
> >>
> >> I don't think other places can be blocked while hold mutex, otherwise
> >> these places can cause potential deadlock. Please try above two lines
> >> change. And perhaps others have better idea.
> >
> > Yes, this works. No deadlock after >11000 seconds,
> >
> > (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
> > 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>
> Great. I will send a formal patch with your reported-by and tested-by.
>
> Thanks,
> Guoqing

I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
of the patches that supposedly resolve this were applied to the stable
kernels, however, one was omitted due to a regression:
md: don't unregister sync_thread with reconfig_mutex held (upstream
commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)

I don't see any follow-up on the thread from June 8th 2022 asking for
this patch to be dropped from all stable kernels since it caused a
regression.

The patch doesn't appear to be present in the current mainline kernel
(6.3-rc2) either. So I assume this issue is still present there, or it
was resolved differently and I just can't find the commit/patch.

I can induce the issue by using Donald's script above which will
eventually result in hangs:
...
147948.504621] INFO: task md_test_2.sh:68033 blocked for more than 122 seconds.
[147948.504624] Tainted: P OE 5.4.229-esos.prod #1
[147948.504624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[147948.504625] md_test_2.sh D 0 68033 1 0x00000004
[147948.504627] Call Trace:
[147948.504634] __schedule+0x4ab/0x4f3
[147948.504637] ? usleep_range+0x7a/0x7a
[147948.504638] schedule+0x67/0x81
[147948.504639] schedule_timeout+0x2c/0xe5
[147948.504643] ? do_raw_spin_lock+0x2b/0x52
[147948.504644] __wait_for_common+0xc4/0x13a
[147948.504647] ? wake_up_q+0x40/0x40
[147948.504649] kthread_stop+0x9a/0x117
[147948.504653] md_unregister_thread+0x43/0x4d
[147948.504655] md_reap_sync_thread+0x1c/0x1d5
[147948.504657] action_store+0xc9/0x284
[147948.504658] md_attr_store+0x9f/0xb8
[147948.504661] kernfs_fop_write+0x10a/0x14c
[147948.504664] vfs_write+0xa0/0xdd
[147948.504666] ksys_write+0x71/0xba
[147948.504668] do_syscall_64+0x52/0x60
[147948.504671] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
...
[147948.504748] INFO: task md120_resync:135315 blocked for more than
122 seconds.
[147948.504749] Tainted: P OE 5.4.229-esos.prod #1
[147948.504749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[147948.504749] md120_resync D 0 135315 2 0x80004000
[147948.504750] Call Trace:
[147948.504752] __schedule+0x4ab/0x4f3
[147948.504754] ? printk+0x53/0x6a
[147948.504755] schedule+0x67/0x81
[147948.504756] md_do_sync+0xae7/0xdd9
[147948.504758] ? remove_wait_queue+0x41/0x41
[147948.504759] md_thread+0x128/0x151
[147948.504761] ? _raw_spin_lock_irqsave+0x31/0x5d
[147948.504762] ? md_start_sync+0xdc/0xdc
[147948.504763] kthread+0xe4/0xe9
[147948.504764] ? kthread_flush_worker+0x70/0x70
[147948.504765] ret_from_fork+0x35/0x40
...

This happens on 'raid6' MD RAID arrays that initially have sync_action==resync.

Any guidance would be greatly appreciated.

--Marc

2023-03-14 13:56:45

by Guoqing Jiang

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition



On 3/14/23 21:25, Marc Smith wrote:
> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
> <[email protected]> wrote:
>> Hi Donald,
>>
>> On 2/8/21 19:41, Donald Buczek wrote:
>>> Dear Guoqing,
>>>
>>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>
>>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>>>> sync_thread, like this.
>>>>>>
>>>>>> /* resync has finished, collect result */
>>>>>> mddev_unlock(mddev);
>>>>>> md_unregister_thread(&mddev->sync_thread);
>>>>>> mddev_lock(mddev);
>>>>> As above: While we wait for the sync thread to terminate, wouldn't it
>>>>> be a problem, if another user space operation takes the mutex?
>>>> I don't think other places can be blocked while hold mutex, otherwise
>>>> these places can cause potential deadlock. Please try above two lines
>>>> change. And perhaps others have better idea.
>>> Yes, this works. No deadlock after >11000 seconds,
>>>
>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>> Great. I will send a formal patch with your reported-by and tested-by.
>>
>> Thanks,
>> Guoqing
> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
> of the patches that supposedly resolve this were applied to the stable
> kernels, however, one was omitted due to a regression:
> md: don't unregister sync_thread with reconfig_mutex held (upstream
> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
>
> I don't see any follow-up on the thread from June 8th 2022 asking for
> this patch to be dropped from all stable kernels since it caused a
> regression.
>
> The patch doesn't appear to be present in the current mainline kernel
> (6.3-rc2) either. So I assume this issue is still present there, or it
> was resolved differently and I just can't find the commit/patch.

It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap
sync_thread in action_store".

Thanks,
Guoqing

2023-03-14 14:46:08

by Marc Smith

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

On Tue, Mar 14, 2023 at 9:55 AM Guoqing Jiang <[email protected]> wrote:
>
>
>
> On 3/14/23 21:25, Marc Smith wrote:
> > On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
> > <[email protected]> wrote:
> >> Hi Donald,
> >>
> >> On 2/8/21 19:41, Donald Buczek wrote:
> >>> Dear Guoqing,
> >>>
> >>> On 08.02.21 15:53, Guoqing Jiang wrote:
> >>>>
> >>>> On 2/8/21 12:38, Donald Buczek wrote:
> >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
> >>>>>> sync_thread, like this.
> >>>>>>
> >>>>>> /* resync has finished, collect result */
> >>>>>> mddev_unlock(mddev);
> >>>>>> md_unregister_thread(&mddev->sync_thread);
> >>>>>> mddev_lock(mddev);
> >>>>> As above: While we wait for the sync thread to terminate, wouldn't it
> >>>>> be a problem, if another user space operation takes the mutex?
> >>>> I don't think other places can be blocked while hold mutex, otherwise
> >>>> these places can cause potential deadlock. Please try above two lines
> >>>> change. And perhaps others have better idea.
> >>> Yes, this works. No deadlock after >11000 seconds,
> >>>
> >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
> >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
> >> Great. I will send a formal patch with your reported-by and tested-by.
> >>
> >> Thanks,
> >> Guoqing
> > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
> > of the patches that supposedly resolve this were applied to the stable
> > kernels, however, one was omitted due to a regression:
> > md: don't unregister sync_thread with reconfig_mutex held (upstream
> > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
> >
> > I don't see any follow-up on the thread from June 8th 2022 asking for
> > this patch to be dropped from all stable kernels since it caused a
> > regression.
> >
> > The patch doesn't appear to be present in the current mainline kernel
> > (6.3-rc2) either. So I assume this issue is still present there, or it
> > was resolved differently and I just can't find the commit/patch.
>
> It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap
> sync_thread in action_store".

Okay, let me try applying that patch... it does not appear to be
present in my 5.4.229 kernel source. Thanks.

--Marc


>
> Thanks,
> Guoqing

2023-03-15 03:02:51

by Yu Kuai

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition



在 2023/03/14 21:55, Guoqing Jiang 写道:
>
>
> On 3/14/23 21:25, Marc Smith wrote:
>> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
>> <[email protected]> wrote:
>>> Hi Donald,
>>>
>>> On 2/8/21 19:41, Donald Buczek wrote:
>>>> Dear Guoqing,
>>>>
>>>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>>
>>>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>>>>> sync_thread, like this.
>>>>>>>
>>>>>>>           /* resync has finished, collect result */
>>>>>>>           mddev_unlock(mddev);
>>>>>>>           md_unregister_thread(&mddev->sync_thread);
>>>>>>>           mddev_lock(mddev);
>>>>>> As above: While we wait for the sync thread to terminate, wouldn't it
>>>>>> be a problem, if another user space operation takes the mutex?
>>>>> I don't think other places can be blocked while hold mutex, otherwise
>>>>> these places can cause potential deadlock. Please try above two lines
>>>>> change. And perhaps others have better idea.
>>>> Yes, this works. No deadlock after >11000 seconds,
>>>>
>>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
>>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>> Great. I will send a formal patch with your reported-by and tested-by.
>>>
>>> Thanks,
>>> Guoqing
>> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
>> of the patches that supposedly resolve this were applied to the stable
>> kernels, however, one was omitted due to a regression:
>> md: don't unregister sync_thread with reconfig_mutex held (upstream
>> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
Hi, Guoqing,

Just borrow this thread to discuss, I think this commit might have
problem in some corner cases:

t1: t2:
action_store
mddev_lock
if (mddev->sync_thread)
mddev_unlock
md_unregister_thread
md_check_recovery
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
queue_work(md_misc_wq, &mddev->del_work)
mddev_lock_nointr
md_reap_sync_thread
// clear running
mddev_lock

t3:
md_start_sync
// running is not set

Our test report a problem that can be cause by this in theory, by we
can't be sure for now...

We thought about how to fix this, instead of calling
md_register_thread() here to wait for sync_thread to be done
synchronisely, we do this asynchronously like what md_set_readonly() and
do_md_stop() does.

What do you think?

Thanks,
Kuai
>>
>> I don't see any follow-up on the thread from June 8th 2022 asking for
>> this patch to be dropped from all stable kernels since it caused a
>> regression.
>>
>> The patch doesn't appear to be present in the current mainline kernel
>> (6.3-rc2) either. So I assume this issue is still present there, or it
>> was resolved differently and I just can't find the commit/patch.
>
> It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap
> sync_thread in action_store".
>
> Thanks,
> Guoqing
> .
>

2023-03-15 07:52:57

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi,

I can just comment, that the simple patch I proposed at https://lore.kernel.org/linux-raid/[email protected]/ works for us with several different kernel versions and currently 195 raid6 jbods on 105 systems going through several "idle->sync->idle" transitions each month for over two years now.

So if you suffer from the problem and are able to add patches to the kernel you use, you might give it a try.

Best
Donald

On 3/14/23 14:25, Marc Smith wrote:
> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
> <[email protected]> wrote:t
>>
>> Hi Donald,
>>
>> On 2/8/21 19:41, Donald Buczek wrote:
>>> Dear Guoqing,
>>>
>>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>
>>>>
>>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>>>> sync_thread, like this.
>>>>>>
>>>>>> /* resync has finished, collect result */
>>>>>> mddev_unlock(mddev);
>>>>>> md_unregister_thread(&mddev->sync_thread);
>>>>>> mddev_lock(mddev);
>>>>>
>>>>> As above: While we wait for the sync thread to terminate, wouldn't it
>>>>> be a problem, if another user space operation takes the mutex?
>>>>
>>>> I don't think other places can be blocked while hold mutex, otherwise
>>>> these places can cause potential deadlock. Please try above two lines
>>>> change. And perhaps others have better idea.
>>>
>>> Yes, this works. No deadlock after >11000 seconds,
>>>
>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>
>> Great. I will send a formal patch with your reported-by and tested-by.
>>
>> Thanks,
>> Guoqing
>
> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
> of the patches that supposedly resolve this were applied to the stable
> kernels, however, one was omitted due to a regression:
> md: don't unregister sync_thread with reconfig_mutex held (upstream
> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
>
> I don't see any follow-up on the thread from June 8th 2022 asking for
> this patch to be dropped from all stable kernels since it caused a
> regression.
>
> The patch doesn't appear to be present in the current mainline kernel
> (6.3-rc2) either. So I assume this issue is still present there, or it
> was resolved differently and I just can't find the commit/patch.
>
> I can induce the issue by using Donald's script above which will
> eventually result in hangs:
> ...
> 147948.504621] INFO: task md_test_2.sh:68033 blocked for more than 122 seconds.
> [147948.504624] Tainted: P OE 5.4.229-esos.prod #1
> [147948.504624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [147948.504625] md_test_2.sh D 0 68033 1 0x00000004
> [147948.504627] Call Trace:
> [147948.504634] __schedule+0x4ab/0x4f3
> [147948.504637] ? usleep_range+0x7a/0x7a
> [147948.504638] schedule+0x67/0x81
> [147948.504639] schedule_timeout+0x2c/0xe5
> [147948.504643] ? do_raw_spin_lock+0x2b/0x52
> [147948.504644] __wait_for_common+0xc4/0x13a
> [147948.504647] ? wake_up_q+0x40/0x40
> [147948.504649] kthread_stop+0x9a/0x117
> [147948.504653] md_unregister_thread+0x43/0x4d
> [147948.504655] md_reap_sync_thread+0x1c/0x1d5
> [147948.504657] action_store+0xc9/0x284
> [147948.504658] md_attr_store+0x9f/0xb8
> [147948.504661] kernfs_fop_write+0x10a/0x14c
> [147948.504664] vfs_write+0xa0/0xdd
> [147948.504666] ksys_write+0x71/0xba
> [147948.504668] do_syscall_64+0x52/0x60
> [147948.504671] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
> ...
> [147948.504748] INFO: task md120_resync:135315 blocked for more than
> 122 seconds.
> [147948.504749] Tainted: P OE 5.4.229-esos.prod #1
> [147948.504749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [147948.504749] md120_resync D 0 135315 2 0x80004000
> [147948.504750] Call Trace:
> [147948.504752] __schedule+0x4ab/0x4f3
> [147948.504754] ? printk+0x53/0x6a
> [147948.504755] schedule+0x67/0x81
> [147948.504756] md_do_sync+0xae7/0xdd9
> [147948.504758] ? remove_wait_queue+0x41/0x41
> [147948.504759] md_thread+0x128/0x151
> [147948.504761] ? _raw_spin_lock_irqsave+0x31/0x5d
> [147948.504762] ? md_start_sync+0xdc/0xdc
> [147948.504763] kthread+0xe4/0xe9
> [147948.504764] ? kthread_flush_worker+0x70/0x70
> [147948.504765] ret_from_fork+0x35/0x40
> ...
>
> This happens on 'raid6' MD RAID arrays that initially have sync_action==resync.
>
> Any guidance would be greatly appreciated.
>
> --Marc

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2023-03-15 09:30:42

by Guoqing Jiang

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition



On 3/15/23 11:02, Yu Kuai wrote:
>
>
> 在 2023/03/14 21:55, Guoqing Jiang 写道:
>>
>>
>> On 3/14/23 21:25, Marc Smith wrote:
>>> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
>>> <[email protected]> wrote:
>>>> Hi Donald,
>>>>
>>>> On 2/8/21 19:41, Donald Buczek wrote:
>>>>> Dear Guoqing,
>>>>>
>>>>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>>>
>>>>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>>>>>> sync_thread, like this.
>>>>>>>>
>>>>>>>>           /* resync has finished, collect result */
>>>>>>>>           mddev_unlock(mddev);
>>>>>>>> md_unregister_thread(&mddev->sync_thread);
>>>>>>>>           mddev_lock(mddev);
>>>>>>> As above: While we wait for the sync thread to terminate,
>>>>>>> wouldn't it
>>>>>>> be a problem, if another user space operation takes the mutex?
>>>>>> I don't think other places can be blocked while hold mutex,
>>>>>> otherwise
>>>>>> these places can cause potential deadlock. Please try above two
>>>>>> lines
>>>>>> change. And perhaps others have better idea.
>>>>> Yes, this works. No deadlock after >11000 seconds,
>>>>>
>>>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
>>>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>>> Great. I will send a formal patch with your reported-by and tested-by.
>>>>
>>>> Thanks,
>>>> Guoqing
>>> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
>>> of the patches that supposedly resolve this were applied to the stable
>>> kernels, however, one was omitted due to a regression:
>>> md: don't unregister sync_thread with reconfig_mutex held (upstream
>>> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
> Hi, Guoqing,
>
> Just borrow this thread to discuss, I think this commit might have
> problem in some corner cases:
>
> t1:                t2:
> action_store
>  mddev_lock
>   if (mddev->sync_thread)
>    mddev_unlock
>    md_unregister_thread
>                 md_check_recovery
>                  set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
>                  queue_work(md_misc_wq, &mddev->del_work)
>    mddev_lock_nointr
>    md_reap_sync_thread
>    // clear running
>  mddev_lock
>
> t3:
> md_start_sync
> // running is not set

What does 'running' mean? MD_RECOVERY_RUNNING?

> Our test report a problem that can be cause by this in theory, by we
> can't be sure for now...

I guess you tried to describe racy between

action_store -> md_register_thread

and

md_start_sync -> md_register_thread

Didn't you already fix them in the series?

[PATCH -next 0/5] md: fix uaf for sync_thread

Sorry, I didn't follow the problem and also your series, I might try your
test with latest mainline kernel if the test is available somewhere.

> We thought about how to fix this, instead of calling
> md_register_thread() here to wait for sync_thread to be done
> synchronisely,

IMO, md_register_thread just create and wake a thread, not sure why it
waits for sync_thread.

> we do this asynchronously like what md_set_readonly() and 
> do_md_stop() does.

Still, I don't have clear picture about the problem, so I can't judge it.

Thanks,
Guoqing

2023-03-15 09:55:58

by Yu Kuai

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi,

在 2023/03/15 17:30, Guoqing Jiang 写道:
>
>> Just borrow this thread to discuss, I think this commit might have
>> problem in some corner cases:
>>
>> t1:                t2:
>> action_store
>>  mddev_lock
>>   if (mddev->sync_thread)
>>    mddev_unlock
>>    md_unregister_thread
>>                 md_check_recovery
>>                  set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
>>                  queue_work(md_misc_wq, &mddev->del_work)
>>    mddev_lock_nointr
>>    md_reap_sync_thread
>>    // clear running
>>  mddev_lock
>>
>> t3:
>> md_start_sync
>> // running is not set
>
> What does 'running' mean? MD_RECOVERY_RUNNING?
>
>> Our test report a problem that can be cause by this in theory, by we
>> can't be sure for now...
>
> I guess you tried to describe racy between
>
> action_store -> md_register_thread
>
> and
>
> md_start_sync -> md_register_thread
>
> Didn't you already fix them in the series?
>
> [PATCH -next 0/5] md: fix uaf for sync_thread
>
> Sorry, I didn't follow the problem and also your series, I might try your
> test with latest mainline kernel if the test is available somewhere.
>
>> We thought about how to fix this, instead of calling
>> md_register_thread() here to wait for sync_thread to be done
>> synchronisely,
>
> IMO, md_register_thread just create and wake a thread, not sure why it
> waits for sync_thread.
>
>> we do this asynchronously like what md_set_readonly() and do_md_stop()
>> does.
>
> Still, I don't have clear picture about the problem, so I can't judge it.
>

Sorry that I didn't explain the problem clear. Let me explain the
problem we meet first:

1) raid10d is waiting for sync_thread to stop:
raid10d
md_unregister_thread
kthread_stop

2) sync_thread is waiting for io to finish:
md_do_sync
wait_event(... atomic_read(&mddev->recovery_active) == 0)

3) io is waiting for raid10d to finish(online crash found 2 io in
conf->retry_list)

Additional information from online crash:
mddev->recovery = 29, // DONE, RUNING, INTR is set

PID: 138293 TASK: ffff0000de89a900 CPU: 7 COMMAND: "md0_resync"
#0 [ffffa00107c178a0] __switch_to at ffffa0010001d75c
#1 [ffffa00107c178d0] __schedule at ffffa001017c7f14
#2 [ffffa00107c179f0] schedule at ffffa001017c880c
#3 [ffffa00107c17a20] md_do_sync at ffffa0010129cdb4
#4 [ffffa00107c17d50] md_thread at ffffa00101290d9c
#5 [ffffa00107c17e50] kthread at ffffa00100187a74

PID: 138294 TASK: ffff0000eba13d80 CPU: 5 COMMAND: "md0_resync"
#0 [ffffa00107e47a60] __switch_to at ffffa0010001d75c
#1 [ffffa00107e47a90] __schedule at ffffa001017c7f14
#2 [ffffa00107e47bb0] schedule at ffffa001017c880c
#3 [ffffa00107e47be0] schedule_timeout at ffffa001017d1298
#4 [ffffa00107e47d50] md_thread at ffffa00101290ee8
#5 [ffffa00107e47e50] kthread at ffffa00100187a74
// there are two sync_thread for md0

I believe the root cause is that two sync_thread exist for the same
mddev, and this is how I think this is possible:

t1: t2:
action_store
mddev_lock
if (mddev->sync_thread)
mddev_unlock
md_unregister_thread
// first sync_thread is done
md_check_recovery
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
queue_work(md_misc_wq, &mddev->del_work)
mddev_lock_nointr
md_reap_sync_thread
// MD_RECOVERY_RUNNING is cleared
mddev_unlock

t3:
md_start_sync
// second sync_thread is registed

t3:
md_check_recovery
queue_work(md_misc_wq, &mddev->del_work)
// MD_RECOVERY_RUNNING is not set, a new sync_thread can be started

This is just guess, I can't reporduce the problem yet. Please let me
know if you have any questions

Thanks,
Kuai


2023-03-16 15:25:58

by Marc Smith

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

On Tue, Mar 14, 2023 at 10:45 AM Marc Smith <[email protected]> wrote:
>
> On Tue, Mar 14, 2023 at 9:55 AM Guoqing Jiang <[email protected]> wrote:
> >
> >
> >
> > On 3/14/23 21:25, Marc Smith wrote:
> > > On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
> > > <[email protected]> wrote:
> > >> Hi Donald,
> > >>
> > >> On 2/8/21 19:41, Donald Buczek wrote:
> > >>> Dear Guoqing,
> > >>>
> > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
> > >>>>
> > >>>> On 2/8/21 12:38, Donald Buczek wrote:
> > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
> > >>>>>> sync_thread, like this.
> > >>>>>>
> > >>>>>> /* resync has finished, collect result */
> > >>>>>> mddev_unlock(mddev);
> > >>>>>> md_unregister_thread(&mddev->sync_thread);
> > >>>>>> mddev_lock(mddev);
> > >>>>> As above: While we wait for the sync thread to terminate, wouldn't it
> > >>>>> be a problem, if another user space operation takes the mutex?
> > >>>> I don't think other places can be blocked while hold mutex, otherwise
> > >>>> these places can cause potential deadlock. Please try above two lines
> > >>>> change. And perhaps others have better idea.
> > >>> Yes, this works. No deadlock after >11000 seconds,
> > >>>
> > >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
> > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
> > >> Great. I will send a formal patch with your reported-by and tested-by.
> > >>
> > >> Thanks,
> > >> Guoqing
> > > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
> > > of the patches that supposedly resolve this were applied to the stable
> > > kernels, however, one was omitted due to a regression:
> > > md: don't unregister sync_thread with reconfig_mutex held (upstream
> > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
> > >
> > > I don't see any follow-up on the thread from June 8th 2022 asking for
> > > this patch to be dropped from all stable kernels since it caused a
> > > regression.
> > >
> > > The patch doesn't appear to be present in the current mainline kernel
> > > (6.3-rc2) either. So I assume this issue is still present there, or it
> > > was resolved differently and I just can't find the commit/patch.
> >
> > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap
> > sync_thread in action_store".
>
> Okay, let me try applying that patch... it does not appear to be
> present in my 5.4.229 kernel source. Thanks.

Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
sync_thread in action_store"' patch on top of vanilla 5.4.229 source
appears to fix the problem for me -- I can't reproduce the issue with
the script, and it's been running for >24 hours now. (Previously I was
able to induce the issue within a matter of minutes.)


>
> --Marc
>
>
> >
> > Thanks,
> > Guoqing

2023-03-29 00:13:58

by Song Liu

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

On Thu, Mar 16, 2023 at 8:25 AM Marc Smith <[email protected]> wrote:
>
> On Tue, Mar 14, 2023 at 10:45 AM Marc Smith <[email protected]> wrote:
> >
> > On Tue, Mar 14, 2023 at 9:55 AM Guoqing Jiang <[email protected]> wrote:
> > >
> > >
> > >
> > > On 3/14/23 21:25, Marc Smith wrote:
> > > > On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
> > > > <[email protected]> wrote:
> > > >> Hi Donald,
> > > >>
> > > >> On 2/8/21 19:41, Donald Buczek wrote:
> > > >>> Dear Guoqing,
> > > >>>
> > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
> > > >>>>
> > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
> > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
> > > >>>>>> sync_thread, like this.
> > > >>>>>>
> > > >>>>>> /* resync has finished, collect result */
> > > >>>>>> mddev_unlock(mddev);
> > > >>>>>> md_unregister_thread(&mddev->sync_thread);
> > > >>>>>> mddev_lock(mddev);
> > > >>>>> As above: While we wait for the sync thread to terminate, wouldn't it
> > > >>>>> be a problem, if another user space operation takes the mutex?
> > > >>>> I don't think other places can be blocked while hold mutex, otherwise
> > > >>>> these places can cause potential deadlock. Please try above two lines
> > > >>>> change. And perhaps others have better idea.
> > > >>> Yes, this works. No deadlock after >11000 seconds,
> > > >>>
> > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
> > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
> > > >> Great. I will send a formal patch with your reported-by and tested-by.
> > > >>
> > > >> Thanks,
> > > >> Guoqing
> > > > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
> > > > of the patches that supposedly resolve this were applied to the stable
> > > > kernels, however, one was omitted due to a regression:
> > > > md: don't unregister sync_thread with reconfig_mutex held (upstream
> > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
> > > >
> > > > I don't see any follow-up on the thread from June 8th 2022 asking for
> > > > this patch to be dropped from all stable kernels since it caused a
> > > > regression.
> > > >
> > > > The patch doesn't appear to be present in the current mainline kernel
> > > > (6.3-rc2) either. So I assume this issue is still present there, or it
> > > > was resolved differently and I just can't find the commit/patch.
> > >
> > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap
> > > sync_thread in action_store".
> >
> > Okay, let me try applying that patch... it does not appear to be
> > present in my 5.4.229 kernel source. Thanks.
>
> Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
> sync_thread in action_store"' patch on top of vanilla 5.4.229 source
> appears to fix the problem for me -- I can't reproduce the issue with
> the script, and it's been running for >24 hours now. (Previously I was
> able to induce the issue within a matter of minutes.)

Hi Marc,

Could you please run your reproducer on the md-tmp branch?

https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-tmp

This contains a different version of the fix by Yu Kuai.

Thanks,
Song

2023-08-23 06:58:16

by Yu Kuai

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi,

?? 2023/08/23 5:16, Dragan Stancevic д??:
> On Tue, 3/28/23 17:01 Song Liu wrote:
>> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith <[email protected]>
>> wr=
>> ote:
>> >
>> > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith
>> <[email protected]>=
>> wrote:
>> > >
>> > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang
>> <guoqing.jiang@li=
>> nux.dev> wrote:
>> > > >
>> > > >
>> > > >
>> > > > On 3/14/23 21:25, Marc Smith wrote:
>> > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang
>> > > > > <[email protected]> wrote:
>> > > > >> Hi Donald,
>> > > > >>
>> > > > >> On 2/8/21 19:41, Donald Buczek wrote:
>> > > > >>> Dear Guoqing,
>> > > > >>>
>> > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
>> > > > >>>>
>> > > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
>> > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>> > > > >>>>>> sync_thread, like this.
>> > > > >>>>>>
>> > > > >>>>>> /* resync has finished, collect result */
>> > > > >>>>>> mddev_unlock(mddev);
>> > > > >>>>>> md_unregister_thread(&mddev->sync_thread);
>> > > > >>>>>> mddev_lock(mddev);
>> > > > >>>>> As above: While we wait for the sync thread to terminate,
>> would=
>> n't it
>> > > > >>>>> be a problem, if another user space operation takes the mutex?
>> > > > >>>> I don't think other places can be blocked while hold mutex,
>> othe=
>> rwise
>> > > > >>>> these places can cause potential deadlock. Please try above
>> two =
>> lines
>> > > > >>>> change. And perhaps others have better idea.
>> > > > >>> Yes, this works. No deadlock after >11000 seconds,
>> > > > >>>
>> > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37,
>> 434, 12=
>> 65,
>> > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>> > > > >> Great. I will send a formal patch with your reported-by and
>> tested=
>> -by.
>> > > > >>
>> > > > >> Thanks,
>> > > > >> Guoqing
>> > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks
>> like 1/=
>> 2
>> > > > > of the patches that supposedly resolve this were applied to the
>> sta=
>> ble
>> > > > > kernels, however, one was omitted due to a regression:
>> > > > > md: don't unregister sync_thread with reconfig_mutex held
>> (upstream
>> > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
>> > > > >
>> > > > > I don't see any follow-up on the thread from June 8th 2022
>> asking f=
>> or
>> > > > > this patch to be dropped from all stable kernels since it caused a
>> > > > > regression.
>> > > > >
>> > > > > The patch doesn't appear to be present in the current mainline
>> kern=
>> el
>> > > > > (6.3-rc2) either. So I assume this issue is still present
>> there, or=
>> it
>> > > > > was resolved differently and I just can't find the commit/patch.
>> > > >
>> > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before
>> rea=
>> p
>> > > > sync_thread in action_store".
>> > >
>> > > Okay, let me try applying that patch... it does not appear to be
>> > > present in my 5.4.229 kernel source. Thanks.
>> >
>> > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
>> > sync_thread in action_store"' patch on top of vanilla 5.4.229 source
>> > appears to fix the problem for me -- I can't reproduce the issue with
>> > the script, and it's been running for >24 hours now. (Previously I was
>> > able to induce the issue within a matter of minutes.)
>>
>> Hi Marc,
>>
>> Could you please run your reproducer on the md-tmp branch?
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
>>
>> This contains a different version of the fix by Yu Kuai.
>>
>> Thanks,
>> Song
>>
>
> Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. The change
> "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue for me.
>
> But I did pull the changes from the md-tmp branch you are refering:
> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
>
> I was not totally clear on which change exactly to pull, but I pulled
> the following changes:
> 2023-03-28 md: enhance checking in md_check_recovery()md-tmp Yu Kuai 1 -7/+15
> 2023-03-28 md: wake up 'resync_wait' at last in md_reap_sync_thread() Yu Kuai 1 -1/+1
> 2023-03-28 md: refactor idle/frozen_sync_thread() Yu Kuai 2 -4/+22
> 2023-03-28 md: add a mutex to synchronize idle and frozen in action_store() Yu Kuai 2 -0/+8
> 2023-03-28 md: refactor action_store() for 'idle' and 'frozen' Yu Kuai 1 -16/+45
>
> I used to be able to reproduce the lockup within minutes, but with those
> changes the test system has been running for more than 120 hours.
>
> When you said a "different fix", can you confirm that I grabbed the right
> changes and that I need all 5 of them.

Yes, you grabbed the right changes, and these patches is merged to
linux-next as well.
>
> And second question was, has this fix been submitted upstream yet?
> If so which kernel version?

This fix is currently in linux-next, and will be applied to v6.6-rc1
soon.

Thanks,
Kuai

>
> Thank you
>
>
> .
>


2023-08-30 18:34:36

by Yu Kuai

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi,

在 2023/08/29 4:32, Dragan Stancevic 写道:

> Just a followup on 6.1 testing. I tried reproducing this problem for 5
> days with 6.1.42 kernel without your patches and I was not able to
> reproduce it.
>
> It seems that 6.1 has some other code that prevents this from happening.
>

I see that there are lots of patches for raid456 between 5.10 and 6.1,
however, I remember that I used to reporduce the deadlock after 6.1, and
it's true it's not easy to reporduce, see below:

https://lore.kernel.org/linux-raid/[email protected]/

My guess is that 6.1 is harder to reporduce than 5.10 due to some
changes inside raid456.

By the way, raid10 had a similiar deadlock, and can be fixed the same
way, so it make sense to backport these patches.

https://lore.kernel.org/r/[email protected]

Thanks,
Kuai


> On 5.10 I can reproduce it within minutes to an hour.
>


2023-09-05 16:42:12

by Yu Kuai

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi,

在 2023/08/30 9:36, Yu Kuai 写道:
> Hi,
>
> 在 2023/08/29 4:32, Dragan Stancevic 写道:
>
>> Just a followup on 6.1 testing. I tried reproducing this problem for 5
>> days with 6.1.42 kernel without your patches and I was not able to
>> reproduce it.

oops, I forgot that you need to backport this patch first to reporduce
this problem:

https://lore.kernel.org/all/[email protected]/

The patch fix the deadlock as well, but it introduce some regressions.

Thanks,
Kuai

>>
>> It seems that 6.1 has some other code that prevents this from happening.
>>
>
> I see that there are lots of patches for raid456 between 5.10 and 6.1,
> however, I remember that I used to reporduce the deadlock after 6.1, and
> it's true it's not easy to reporduce, see below:
>
> https://lore.kernel.org/linux-raid/[email protected]/
>
>
> My guess is that 6.1 is harder to reporduce than 5.10 due to some
> changes inside raid456.
>
> By the way, raid10 had a similiar deadlock, and can be fixed the same
> way, so it make sense to backport these patches.
>
> https://lore.kernel.org/r/[email protected]
>
> Thanks,
> Kuai
>
>
>> On 5.10 I can reproduce it within minutes to an hour.
>>
>
> .
>

2023-09-13 09:10:22

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

On 9/5/23 3:54 PM, Dragan Stancevic wrote:
> On 9/4/23 22:50, Yu Kuai wrote:
>> Hi,
>>
>> 在 2023/08/30 9:36, Yu Kuai 写道:
>>> Hi,
>>>
>>> 在 2023/08/29 4:32, Dragan Stancevic 写道:
>>>
>>>> Just a followup on 6.1 testing. I tried reproducing this problem for 5 days with 6.1.42 kernel without your patches and I was not able to reproduce it.
>>
>> oops, I forgot that you need to backport this patch first to reporduce
>> this problem:
>>
>> https://lore.kernel.org/all/[email protected]/
>>
>> The patch fix the deadlock as well, but it introduce some regressions.

We've just got an unplanned lock up on "check" to "idle" transition with 6.1.52 after a few hours on a backup server. For the last 2 1/2 years we used the patch I originally proposed with multiple kernel versions [1]. But this no longer seems to be valid or maybe its even destructive in combination with the other changes.

But I totally lost track of the further development. As I understood, there are patches queue up in mainline, which might go into 6.1, too, but have not landed there which should fix the problem?

Can anyone give me exact references to the patches I'd need to apply to 6.1.52, so that I could probably fix my problem and also test the patches for you those on production systems with a load which tends to run into that problem easily?

Thanks

Donald

[1]: https://lore.kernel.org/linux-raid/[email protected]/

> Ha, jinx :) I was about to email you that I isolated that change with the testing over the weekend that made it more difficult to reproduce in 6.1 and that the original change must be reverted :)
>
>
>
>>
>> Thanks,
>> Kuai
>>
>>>>
>>>> It seems that 6.1 has some other code that prevents this from happening.
>>>>
>>>
>>> I see that there are lots of patches for raid456 between 5.10 and 6.1,
>>> however, I remember that I used to reporduce the deadlock after 6.1, and
>>> it's true it's not easy to reporduce, see below:
>>>
>>> https://lore.kernel.org/linux-raid/[email protected]/
>>>
>>> My guess is that 6.1 is harder to reporduce than 5.10 due to some
>>> changes inside raid456.
>>>
>>> By the way, raid10 had a similiar deadlock, and can be fixed the same
>>> way, so it make sense to backport these patches.
>>>
>>> https://lore.kernel.org/r/[email protected]
>>>
>>> Thanks,
>>> Kuai
>>>
>>>
>>>> On 5.10 I can reproduce it within minutes to an hour.
>>>>
>>>
>>> .
>>>
>>
>


--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2023-09-14 01:26:19

by Dragan Stancevic

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi Donald-

On 9/13/23 04:08, Donald Buczek wrote:
> On 9/5/23 3:54 PM, Dragan Stancevic wrote:
>> On 9/4/23 22:50, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2023/08/30 9:36, Yu Kuai 写道:
>>>> Hi,
>>>>
>>>> 在 2023/08/29 4:32, Dragan Stancevic 写道:
>>>>
>>>>> Just a followup on 6.1 testing. I tried reproducing this problem for 5 days with 6.1.42 kernel without your patches and I was not able to reproduce it.
>>>
>>> oops, I forgot that you need to backport this patch first to reporduce
>>> this problem:
>>>
>>> https://lore.kernel.org/all/[email protected]/
>>>
>>> The patch fix the deadlock as well, but it introduce some regressions.
>
> We've just got an unplanned lock up on "check" to "idle" transition with 6.1.52 after a few hours on a backup server. For the last 2 1/2 years we used the patch I originally proposed with multiple kernel versions [1]. But this no longer seems to be valid or maybe its even destructive in combination with the other changes.
>
> But I totally lost track of the further development. As I understood, there are patches queue up in mainline, which might go into 6.1, too, but have not landed there which should fix the problem?
>
> Can anyone give me exact references to the patches I'd need to apply to 6.1.52, so that I could probably fix my problem and also test the patches for you those on production systems with a load which tends to run into that problem easily?

Here is a list of changes for 6.1:

e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
f71209b1f21c md: enhance checking in md_check_recovery()
753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
6f56f0c4f124 md: add a mutex to synchronize idle and frozen in
action_store()
64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
a865b96c513b Revert "md: unlock mddev before reap sync_thread in
action_store"

You can get them from the following tree:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git


>
> Thanks
>
> Donald
>
> [1]: https://lore.kernel.org/linux-raid/[email protected]/
>
>> Ha, jinx :) I was about to email you that I isolated that change with the testing over the weekend that made it more difficult to reproduce in 6.1 and that the original change must be reverted :)
>>
>>
>>
>>>
>>> Thanks,
>>> Kuai
>>>
>>>>>
>>>>> It seems that 6.1 has some other code that prevents this from happening.
>>>>>
>>>>
>>>> I see that there are lots of patches for raid456 between 5.10 and 6.1,
>>>> however, I remember that I used to reporduce the deadlock after 6.1, and
>>>> it's true it's not easy to reporduce, see below:
>>>>
>>>> https://lore.kernel.org/linux-raid/[email protected]/
>>>>
>>>> My guess is that 6.1 is harder to reporduce than 5.10 due to some
>>>> changes inside raid456.
>>>>
>>>> By the way, raid10 had a similiar deadlock, and can be fixed the same
>>>> way, so it make sense to backport these patches.
>>>>
>>>> https://lore.kernel.org/r/[email protected]
>>>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>>
>>>>> On 5.10 I can reproduce it within minutes to an hour.
>>>>>
>>>>
>>>> .
>>>>
>>>
>>
>
>

--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla

2023-09-14 06:08:35

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition



On 9/13/23 16:16, Dragan Stancevic wrote:
> Hi Donald-
>
> On 9/13/23 04:08, Donald Buczek wrote:
>> On 9/5/23 3:54 PM, Dragan Stancevic wrote:
>>> On 9/4/23 22:50, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2023/08/30 9:36, Yu Kuai 写道:
>>>>> Hi,
>>>>>
>>>>> 在 2023/08/29 4:32, Dragan Stancevic 写道:
>>>>>
>>>>>> Just a followup on 6.1 testing. I tried reproducing this problem for 5 days with 6.1.42 kernel without your patches and I was not able to reproduce it.
>>>>
>>>> oops, I forgot that you need to backport this patch first to reporduce
>>>> this problem:
>>>>
>>>> https://lore.kernel.org/all/[email protected]/
>>>>
>>>> The patch fix the deadlock as well, but it introduce some regressions.
>>
>> We've just got an unplanned lock up on "check" to "idle" transition with 6.1.52 after a few hours on a backup server. For the last 2 1/2 years we used the patch I originally proposed with multiple kernel versions [1]. But this no longer seems to be valid or maybe its even destructive in combination with the other changes.
>>
>> But I totally lost track of the further development. As I understood, there are patches queue up in mainline, which might go into 6.1, too, but have not landed there which should fix the problem?
>>
>> Can anyone give me exact references to the patches I'd need to apply to 6.1.52, so that I could probably fix my problem and also test the patches for you those on production systems with a load which tends to run into that problem easily?
>
> Here is a list of changes for 6.1:
>
> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
> f71209b1f21c md: enhance checking in md_check_recovery()
> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store()
> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
> a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store"

Thanks!

I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago. It went through ~400 iterations till now. No lock-ups so far.

LGTM !

Donald

buczek@done:~$ dmesg|grep "data-check of RAID array"|wc
393 2820 18864
buczek@done:~$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md2 : active raid6 sdc[0] sdo[15] sdn[14] sdm[13] sdl[12] sdk[11] sdj[10] sdi[9] sdh[8] sdg[7] sdf[6] sde[5] sdd[4] sdr[3] sdq[2] sdp[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[=========>...........] check = 47.1% (3681799128/7813894144) finish=671.8min speed=102496K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk

md1 : active raid6 sdaa[0] sdz[15] sdy[14] sdx[13] sdw[12] sdv[11] sdu[10] sdt[16] sds[8] sdah[7] sdag[17] sdaf[5] sdae[4] sdad[3] sdac[2] sdab[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[=======>.............] check = 38.5% (3009484896/7813894144) finish=811.0min speed=98720K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk

md0 : active raid6 sdai[0] sdax[15] sdaw[16] sdav[13] sdau[12] sdat[11] sdas[10] sdar[9] sdaq[8] sdap[7] sdao[6] sdan[17] sdam[4] sdal[3] sdak[2] sdaj[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[========>............] check = 42.3% (3311789940/7813894144) finish=911.9min speed=82272K/sec
bitmap: 6/59 pages [24KB], 65536KB chunk

unused devices: <none>





>
> You can get them from the following tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>
>
>>
>> Thanks
>>
>>    Donald
>>
>> [1]: https://lore.kernel.org/linux-raid/[email protected]/
>>
>>> Ha, jinx :) I was about to email you that I isolated that change with the testing over the weekend that made it more difficult to reproduce in 6.1 and that the original change must be reverted :)
>>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>>>>
>>>>>> It seems that 6.1 has some other code that prevents this from happening.
>>>>>>
>>>>>
>>>>> I see that there are lots of patches for raid456 between 5.10 and 6.1,
>>>>> however, I remember that I used to reporduce the deadlock after 6.1, and
>>>>> it's true it's not easy to reporduce, see below:
>>>>>
>>>>> https://lore.kernel.org/linux-raid/[email protected]/
>>>>>
>>>>> My guess is that 6.1 is harder to reporduce than 5.10 due to some
>>>>> changes inside raid456.
>>>>>
>>>>> By the way, raid10 had a similiar deadlock, and can be fixed the same
>>>>> way, so it make sense to backport these patches.
>>>>>
>>>>> https://lore.kernel.org/r/[email protected]
>>>>>
>>>>> Thanks,
>>>>> Kuai
>>>>>
>>>>>
>>>>>> On 5.10 I can reproduce it within minutes to an hour.
>>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>
>>
>>
>

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2023-09-17 08:58:48

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

On 9/14/23 08:03, Donald Buczek wrote:
> On 9/13/23 16:16, Dragan Stancevic wrote:
>> Hi Donald-
>> [...]
>> Here is a list of changes for 6.1:
>>
>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
>> f71209b1f21c md: enhance checking in md_check_recovery()
>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store()
>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store"
>
> Thanks!
>
> I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago.  It went through ~400 iterations till now. No lock-ups so far.

Oh dear, looks like the deadlock problem is _not_fixed with these patches.

We've had a lockup again after ~3 days of operation. Again, the `echo idle > $sys/md/sync_action` is hanging:

# # /proc/70554/task/70554: mdcheck.safe : /bin/bash /usr/bin/mdcheck.safe --continue --duration 06:00
# cat /proc/70554/task/70554/stack

[<0>] action_store+0x17f/0x390
[<0>] md_attr_store+0x83/0xf0
[<0>] kernfs_fop_write_iter+0x117/0x1b0
[<0>] vfs_write+0x2ce/0x400
[<0>] ksys_write+0x5f/0xe0
[<0>] do_syscall_64+0x43/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

And everything else going to that specific raid (md0) is dead, too. No task is busy looping.

So as it looks now, we cant go from 5.15.X to 6.1.X as we would like to do. These patches don't fix the problem and our own patch no longer works with 6.1. Unfortunately, this happened on a production system which I need to reboot and is not available for further analysis. We'd need to reproduce the problem on a dedicated machine to really work on it.

Here's some more possibly interesting procfs output and some examples of tasks.

/sys/devices/virtual/block/md0/inflight : 0 3936

#/proc/mdstat

Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 sdae[0] sdad[15] sdac[14] sdab[13] sdaa[12] sdz[11] sdy[10] sdx[9] sdw[8] sdv[7] sdu[6] sdt[5] sds[4] sdah[3] sdag[2] sdaf[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
bitmap: 0/59 pages [0KB], 65536KB chunk

md0 : active raid6 sdc[0] sdr[17] sdq[16] sdp[13] sdo[12] sdn[11] sdm[10] sdl[9] sdk[8] sdj[7] sdi[6] sdh[5] sdg[4] sdf[3] sde[2] sdd[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[===>.................] check = 15.9% (1242830396/7813894144) finish=14788.4min speed=7405K/sec
bitmap: 53/59 pages [212KB], 65536KB chunk

unused devices: <none>

# # /proc/66024/task/66024: md0_resync :
# cat /proc/66024/task/66024/stack

[<0>] raid5_get_active_stripe+0x20f/0x4d0
[<0>] raid5_sync_request+0x38b/0x3b0
[<0>] md_do_sync.cold+0x40c/0x985
[<0>] md_thread+0xb1/0x160
[<0>] kthread+0xe7/0x110
[<0>] ret_from_fork+0x22/0x30

# # /proc/939/task/939: md0_raid6 :
# cat /proc/939/task/939/stack

[<0>] md_thread+0x12d/0x160
[<0>] kthread+0xe7/0x110
[<0>] ret_from_fork+0x22/0x30

# # /proc/1228/task/1228: xfsaild/md0 :
# cat /proc/1228/task/1228/stack

[<0>] raid5_get_active_stripe+0x20f/0x4d0
[<0>] raid5_make_request+0x24c/0x1170
[<0>] md_handle_request+0x131/0x220
[<0>] __submit_bio+0x89/0x130
[<0>] submit_bio_noacct_nocheck+0x160/0x360
[<0>] _xfs_buf_ioapply+0x26c/0x420
[<0>] __xfs_buf_submit+0x64/0x1d0
[<0>] xfs_buf_delwri_submit_buffers+0xc5/0x1e0
[<0>] xfsaild+0x2a0/0x880
[<0>] kthread+0xe7/0x110
[<0>] ret_from_fork+0x22/0x30

# # /proc/49747/task/49747: kworker/24:2+xfs-inodegc/md0 :
# cat /proc/49747/task/49747/stack

[<0>] xfs_buf_lock+0x35/0xf0
[<0>] xfs_buf_find_lock+0x45/0xf0
[<0>] xfs_buf_get_map+0x17d/0xa60
[<0>] xfs_buf_read_map+0x52/0x280
[<0>] xfs_trans_read_buf_map+0x115/0x350
[<0>] xfs_btree_read_buf_block.constprop.0+0x9a/0xd0
[<0>] xfs_btree_lookup_get_block+0x97/0x170
[<0>] xfs_btree_lookup+0xc4/0x4a0
[<0>] xfs_difree_finobt+0x62/0x250
[<0>] xfs_difree+0x130/0x1c0
[<0>] xfs_ifree+0x86/0x510
[<0>] xfs_inactive_ifree.isra.0+0xa2/0x1c0
[<0>] xfs_inactive+0xf8/0x170
[<0>] xfs_inodegc_worker+0x90/0x140
[<0>] process_one_work+0x1c7/0x3c0
[<0>] worker_thread+0x4d/0x3c0
[<0>] kthread+0xe7/0x110
[<0>] ret_from_fork+0x22/0x30

# # /proc/49844/task/49844: kworker/30:3+xfs-sync/md0 :
# cat /proc/49844/task/49844/stack

[<0>] __flush_workqueue+0x10e/0x390
[<0>] xlog_cil_push_now.isra.0+0x25/0x90
[<0>] xlog_cil_force_seq+0x7c/0x240
[<0>] xfs_log_force+0x83/0x240
[<0>] xfs_log_worker+0x3b/0xd0
[<0>] process_one_work+0x1c7/0x3c0
[<0>] worker_thread+0x4d/0x3c0
[<0>] kthread+0xe7/0x110
[<0>] ret_from_fork+0x22/0x30


# # /proc/52646/task/52646: kworker/u263:2+xfs-cil/md0 :
# cat /proc/52646/task/52646/stack

[<0>] raid5_get_active_stripe+0x20f/0x4d0
[<0>] raid5_make_request+0x24c/0x1170
[<0>] md_handle_request+0x131/0x220
[<0>] __submit_bio+0x89/0x130
[<0>] submit_bio_noacct_nocheck+0x160/0x360
[<0>] xlog_state_release_iclog+0xf6/0x1d0
[<0>] xlog_write_get_more_iclog_space+0x79/0xf0
[<0>] xlog_write+0x334/0x3b0
[<0>] xlog_cil_push_work+0x501/0x740
[<0>] process_one_work+0x1c7/0x3c0
[<0>] worker_thread+0x4d/0x3c0
[<0>] kthread+0xe7/0x110
[<0>] ret_from_fork+0x22/0x30

# # /proc/52753/task/52753: rm : rm -rf /project/pbackup_gone/data/C8029/home_Cyang/home_Cyang:202306011248:C3019.BEING_DELETED
# cat /proc/52753/task/52753/stack

[<0>] xfs_buf_lock+0x35/0xf0
[<0>] xfs_buf_find_lock+0x45/0xf0
[<0>] xfs_buf_get_map+0x17d/0xa60
[<0>] xfs_buf_read_map+0x52/0x280
[<0>] xfs_trans_read_buf_map+0x115/0x350
[<0>] xfs_read_agi+0x98/0x140
[<0>] xfs_iunlink+0x63/0x1f0
[<0>] xfs_remove+0x280/0x3a0
[<0>] xfs_vn_unlink+0x53/0xa0
[<0>] vfs_rmdir.part.0+0x5e/0x1e0
[<0>] do_rmdir+0x15c/0x1c0
[<0>] __x64_sys_unlinkat+0x4b/0x60
[<0>] do_syscall_64+0x43/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

Best
Donald

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2023-09-24 14:37:23

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

On 9/17/23 10:55, Donald Buczek wrote:
> On 9/14/23 08:03, Donald Buczek wrote:
>> On 9/13/23 16:16, Dragan Stancevic wrote:
>>> Hi Donald-
>>> [...]
>>> Here is a list of changes for 6.1:
>>>
>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
>>> f71209b1f21c md: enhance checking in md_check_recovery()
>>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store()
>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store"
>>
>> Thanks!
>>
>> I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago.  It went through ~400 iterations till now. No lock-ups so far.
>
> Oh dear, looks like the deadlock problem is _not_fixed with these patches.

Some more info after another incident:

- We've hit the deadlock with 5.15.131 (so it is NOT introduced by any of the above patches)
- The symptoms are not exactly the same as with the original year-old problem. Differences:
- - mdX_raid6 is NOT busy looping
- - /sys/devices/virtual/block/mdX/md/array_state says "active" not "write pending"
- - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does not resolve the deadlock
- - After hours in the deadlock state the system resumed operation when a script of mine read(!) lots of sysfs files.
- But in both cases, `echo idle > /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O operation on the raid.

The fact that we didn't hit the problem for many month on 5.15.94 might hint that it was introduced between 5.15.94 and 5.15.131

We'll try to reproduce the problem on a test machine for analysis, but this make take time (vacation imminent for one...).

But its not like these patches caused the problem. Any maybe they _did_ fix the original problem, as we didn't hit that one.

Best

Donald

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2023-09-25 01:11:47

by Yu Kuai

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi,

在 2023/09/24 22:35, Donald Buczek 写道:
> On 9/17/23 10:55, Donald Buczek wrote:
>> On 9/14/23 08:03, Donald Buczek wrote:
>>> On 9/13/23 16:16, Dragan Stancevic wrote:
>>>> Hi Donald-
>>>> [...]
>>>> Here is a list of changes for 6.1:
>>>>
>>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
>>>> f71209b1f21c md: enhance checking in md_check_recovery()
>>>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
>>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
>>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in
>>>> action_store()
>>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
>>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in
>>>> action_store"
>>>
>>> Thanks!
>>>
>>> I've put these patches on v6.1.52. I've started a script which
>>> transitions the three md-devices of a very active backup server
>>> through idle->check->idle every 6 minutes a few ours ago.  It went
>>> through ~400 iterations till now. No lock-ups so far.
>>
>> Oh dear, looks like the deadlock problem is _not_fixed with these
>> patches.
>
> Some more info after another incident:
>
> - We've hit the deadlock with 5.15.131 (so it is NOT introduced by any
> of the above patches)
> - The symptoms are not exactly the same as with the original year-old
> problem. Differences:
> - - mdX_raid6 is NOT busy looping
> - - /sys/devices/virtual/block/mdX/md/array_state says "active" not
> "write pending"
> - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does
> not resolve the deadlock
> - - After hours in the deadlock state the system resumed operation when
> a script of mine read(!) lots of sysfs files.
> - But in both cases, `echo idle >
> /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O
> operation on the raid.
>
> The fact that we didn't hit the problem for many month on 5.15.94 might
> hint that it was introduced between 5.15.94 and 5.15.131
>
> We'll try to reproduce the problem on a test machine for analysis, but
> this make take time (vacation imminent for one...).
>
> But its not like these patches caused the problem. Any maybe they _did_
> fix the original problem, as we didn't hit that one.

Sorry for the late reply, yes, this looks like a different problem. I'm
pretty confident that the orignal problem is fixed since that echo
idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for
sync_thread to be done.

I'll check patches between 5.15.94 and 5.15.131.

Thanks,
Kuai

>
> Best
>
>   Donald
>

2023-09-25 09:25:08

by Donald Buczek

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

On 9/25/23 03:11, Yu Kuai wrote:
> Hi,
>
> 在 2023/09/24 22:35, Donald Buczek 写道:
>> On 9/17/23 10:55, Donald Buczek wrote:
>>> On 9/14/23 08:03, Donald Buczek wrote:
>>>> On 9/13/23 16:16, Dragan Stancevic wrote:
>>>>> Hi Donald-
>>>>> [...]
>>>>> Here is a list of changes for 6.1:
>>>>>
>>>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
>>>>> f71209b1f21c md: enhance checking in md_check_recovery()
>>>>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
>>>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
>>>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store()
>>>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
>>>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store"
>>>>
>>>> Thanks!
>>>>
>>>> I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago.  It went through ~400 iterations till now. No lock-ups so far.
>>>
>>> Oh dear, looks like the deadlock problem is _not_fixed with these patches.
>>
>> Some more info after another incident:
>>
>> - We've hit the deadlock with 5.15.131 (so it is NOT introduced by any of the above patches)
>> - The symptoms are not exactly the same as with the original year-old problem. Differences:
>> - - mdX_raid6 is NOT busy looping
>> - - /sys/devices/virtual/block/mdX/md/array_state says "active" not "write pending"
>> - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does not resolve the deadlock
>> - - After hours in the deadlock state the system resumed operation when a script of mine read(!) lots of sysfs files.
>> - But in both cases, `echo idle > /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O operation on the raid.
>>
>> The fact that we didn't hit the problem for many month on 5.15.94 might hint that it was introduced between 5.15.94 and 5.15.131
>>
>> We'll try to reproduce the problem on a test machine for analysis, but this make take time (vacation imminent for one...).
>>
>> But its not like these patches caused the problem. Any maybe they _did_ fix the original problem, as we didn't hit that one.
>
> Sorry for the late reply, yes, this looks like a different problem. I'm
> pretty confident that the orignal problem is fixed since that echo
> idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for
> sync_thread to be done.
>
> I'll check patches between 5.15.94 and 5.15.131.

We've got another event today. Some more information to save you work. I'm sorry, this comes dripping in, but as I said, currently we can't reproduce it and hit it on production machines only, where we have limited time to analyze:

* In the last two events, "echo idle > sys/devices/virtual/block/mdX/md/sync_action" was not even executing. This is not a trigger, but was a random victim when it happened the first time. This deceived me to believe this is some variation of the old problem.

* It's not filesystem related, yesterday `blkid -o value -s LABEL /dev/md1` was hanging, too, and today, for example, `df`.

* /sys/devices/virtual/block/md0/inflight today was (frozen at) "2 579"

* iotop showed no disk activity (on the raid) at all. Only a single member device had activity from time to time (usually after ~30 seconds, but sometimes after a few seconds) with usually 1-4 tps, but sometimes more, max 136 tps.

* As I said, I use a script to take a snapshot of various /sys and /proc information and running this script resolved the deadlock twice.

* The recorded stack traces of mdX_raid6 of the hanging raid recorded in the two events were

[<0>] md_bitmap_unplug.part.0+0xce/0x100
[<0>] raid5d+0xe4/0x5a0
[<0>] md_thread+0xab/0x160
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x22/0x30

and

[<0>] md_super_wait+0x72/0xa0
[<0>] md_bitmap_unplug.part.0+0xce/0x100
[<0>] raid5d+0xe4/0x5a0
[<0>] md_thread+0xab/0x160
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x22/0x30

But note, that these probably were taken after the previous commands in the script already unfroze the system. Today I've manually looked at the stack while the system was still frozen, and it was just

[<0>] md_thread+0x122/0x160
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x22/0x30

* Because I knew that my script seems to unblock the system, I've run it slowly line by line to see what actually unfreezes the system. There is one loop which takes "comm" "cmdline" and "stack" of all threads:

for task in /proc/*/task/*; do
echo "# # $task: $(cat $task/comm) : $(cat $task/cmdline | xargs -0 echo)"
cmd cat $task/stack
done

I've added a few "read" to single-step it. Unfortunately, when it came to the 64 nfsd threads, I've got a bit impatient and hit "return" faster then I should have and when the unfreeze happened, I couldn't say exactly were it was triggered. But it must have been somewhere in this tail:

# # /proc/1299/task/1299: nfsd

[<0>] svc_recv+0x7a7/0x8c0 [sunrpc]
[<0>] nfsd+0xd6/0x140 [nfsd]
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x22/0x30

# # /proc/13/task/13: ksoftirqd/0

[<0>] smpboot_thread_fn+0xf3/0x140
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x22/0x30

# # /proc/130/task/130: cpuhp/22

[<0>] smpboot_thread_fn+0xf3/0x140
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x22/0x30

# # /proc/1300/task/1300: nfsd

[<0>] svc_recv+0x7a7/0x8c0 [sunrpc]
[<0>] nfsd+0xd6/0x140 [nfsd]
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x22/0x30

## (3 more repetitions of other nfsd threads which exactly the same stack skipped here ##

So it appears, that possibly a cat /proc/PID/stack of a "ksoftirqd" or a (maybe) a "cpuhp" thread unblocks the system. "nfsd" seems unlikely, as there shouldn't and wasn't anything nfs-mounted from this system.

Conclusion: This is probably not related to mdraid at all and might be a problem of the block or some infrastructure subsystem. Do you agree?

Best

Donald
--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2023-09-25 09:39:48

by Yu Kuai

[permalink] [raw]
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Hi,

在 2023/09/25 17:11, Donald Buczek 写道:
> On 9/25/23 03:11, Yu Kuai wrote:
>> Hi,
>>
>> 在 2023/09/24 22:35, Donald Buczek 写道:
>>> On 9/17/23 10:55, Donald Buczek wrote:
>>>> On 9/14/23 08:03, Donald Buczek wrote:
>>>>> On 9/13/23 16:16, Dragan Stancevic wrote:
>>>>>> Hi Donald-
>>>>>> [...]
>>>>>> Here is a list of changes for 6.1:
>>>>>>
>>>>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
>>>>>> f71209b1f21c md: enhance checking in md_check_recovery()
>>>>>> 753260ed0b46 md: wake up 'resync_wait' at last in
>>>>>> md_reap_sync_thread()
>>>>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
>>>>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in
>>>>>> action_store()
>>>>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
>>>>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in
>>>>>> action_store"
>>>>>
>>>>> Thanks!
>>>>>
>>>>> I've put these patches on v6.1.52. I've started a script which
>>>>> transitions the three md-devices of a very active backup server
>>>>> through idle->check->idle every 6 minutes a few ours ago.  It went
>>>>> through ~400 iterations till now. No lock-ups so far.
>>>>
>>>> Oh dear, looks like the deadlock problem is _not_fixed with these
>>>> patches.
>>>
>>> Some more info after another incident:
>>>
>>> - We've hit the deadlock with 5.15.131 (so it is NOT introduced by
>>> any of the above patches)
>>> - The symptoms are not exactly the same as with the original year-old
>>> problem. Differences:
>>> - - mdX_raid6 is NOT busy looping
>>> - - /sys/devices/virtual/block/mdX/md/array_state says "active" not
>>> "write pending"
>>> - - `echo active > /sys/devices/virtual/block/mdX/md/array_state`
>>> does not resolve the deadlock
>>> - - After hours in the deadlock state the system resumed operation
>>> when a script of mine read(!) lots of sysfs files.
>>> - But in both cases, `echo idle >
>>> /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O
>>> operation on the raid.
>>>
>>> The fact that we didn't hit the problem for many month on 5.15.94
>>> might hint that it was introduced between 5.15.94 and 5.15.131
>>>
>>> We'll try to reproduce the problem on a test machine for analysis,
>>> but this make take time (vacation imminent for one...).
>>>
>>> But its not like these patches caused the problem. Any maybe they
>>> _did_ fix the original problem, as we didn't hit that one.
>>
>> Sorry for the late reply, yes, this looks like a different problem. I'm
>> pretty confident that the orignal problem is fixed since that echo
>> idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for
>> sync_thread to be done.
>>
>> I'll check patches between 5.15.94 and 5.15.131.
>
> We've got another event today. Some more information to save you work.
> I'm sorry, this comes dripping in, but as I said, currently we can't
> reproduce it and hit it on production machines only, where we have
> limited time to analyze:

There is a way to clarify if io is stuck in underlying disks:

Once the problem is triggered and there are no disk activity:

cat /sys/kernel/debug/block/[disk]/hctx*/sched_tags | grep busy
cat /sys/kernel/debug/block/[disk]/hctx*/tags | grep busy

If busy is not 0, means that io is stuck in underlying disk, then this
problem is not related to raid, otherwise raid doesn't issue any io to
underlyiung dikss and this problem is related to raid.

>
> * In the last two events, "echo idle >
> sys/devices/virtual/block/mdX/md/sync_action" was not even executing.
> This is not a trigger, but was a random victim when it happened the
> first time. This deceived me to believe this is some variation of the
> old problem.
>
> * It's not filesystem related, yesterday `blkid -o value -s LABEL
> /dev/md1` was hanging, too, and today, for example, `df`.
>
> * /sys/devices/virtual/block/md0/inflight today was (frozen at) "2
> 579"
>
> * iotop showed no disk activity (on the raid) at all. Only a single
> member device had activity from time to time (usually after ~30 seconds,
> but sometimes after a few seconds) with usually 1-4 tps, but sometimes
> more, max 136 tps.
>
> * As I said, I use a script to take a snapshot of various /sys and /proc
> information and running this script resolved the deadlock twice.
>
> * The recorded stack traces of mdX_raid6 of the hanging raid recorded in
> the two events were
>
>     [<0>] md_bitmap_unplug.part.0+0xce/0x100
>     [<0>] raid5d+0xe4/0x5a0
>     [<0>] md_thread+0xab/0x160
>     [<0>] kthread+0x127/0x150
>     [<0>] ret_from_fork+0x22/0x30
>
> and
>
>     [<0>] md_super_wait+0x72/0xa0
>     [<0>] md_bitmap_unplug.part.0+0xce/0x100
>     [<0>] raid5d+0xe4/0x5a0
>     [<0>] md_thread+0xab/0x160
>     [<0>] kthread+0x127/0x150
>     [<0>] ret_from_fork+0x22/0x30

Above stack shows that raid issue bitmap io to underlying disk and is
waiting for such io to be done, except for bitmap io is broken in raid,
this problem should not related to raid, above debugfs can help to
clarify this.

Thanks,
Kuai

>
> But note, that these probably were taken after the previous commands in
> the script already unfroze the system. Today I've manually looked at the
> stack while the system was still frozen, and it was just
>
>     [<0>] md_thread+0x122/0x160
>     [<0>] kthread+0x127/0x150
>     [<0>] ret_from_fork+0x22/0x30
>
> * Because I knew that my script seems to unblock the system, I've run it
> slowly line by line to see what actually unfreezes the system. There is
> one loop which takes "comm" "cmdline" and "stack" of all threads:
>
>     for task in /proc/*/task/*; do
>         echo  "# # $task: $(cat $task/comm) : $(cat $task/cmdline |
> xargs -0 echo)"
>         cmd cat $task/stack
>     done
>
> I've added a few "read" to single-step it. Unfortunately, when it came
> to the 64 nfsd threads, I've got a bit impatient and hit "return" faster
> then I should have and when the unfreeze happened, I couldn't say
> exactly were it was triggered. But it must have been somewhere in this
> tail:
>
> # # /proc/1299/task/1299: nfsd
>
> [<0>] svc_recv+0x7a7/0x8c0 [sunrpc]
> [<0>] nfsd+0xd6/0x140 [nfsd]
> [<0>] kthread+0x127/0x150
> [<0>] ret_from_fork+0x22/0x30
>
> # # /proc/13/task/13: ksoftirqd/0
>
> [<0>] smpboot_thread_fn+0xf3/0x140
> [<0>] kthread+0x127/0x150
> [<0>] ret_from_fork+0x22/0x30
>
> # # /proc/130/task/130: cpuhp/22
>
> [<0>] smpboot_thread_fn+0xf3/0x140
> [<0>] kthread+0x127/0x150
> [<0>] ret_from_fork+0x22/0x30
>
> # # /proc/1300/task/1300: nfsd
>
> [<0>] svc_recv+0x7a7/0x8c0 [sunrpc]
> [<0>] nfsd+0xd6/0x140 [nfsd]
> [<0>] kthread+0x127/0x150
> [<0>] ret_from_fork+0x22/0x30
>
> ## (3 more repetitions of other nfsd threads which exactly the same
> stack skipped here ##
>
> So it appears, that possibly a cat /proc/PID/stack of a "ksoftirqd" or a
> (maybe) a "cpuhp" thread unblocks the system. "nfsd" seems unlikely, as
> there shouldn't and wasn't anything nfs-mounted from this system.
>
> Conclusion: This is probably not related to mdraid at all and might be a
> problem of the block or some infrastructure subsystem. Do you agree?
>
> Best
>
>   Donald