LinuxLists.cc - Re: stalling IO regression since linux 5.12, through 5.18

2022-08-17 15:07:44

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Wed, Aug 17, 2022, at 10:53 AM, Ming Lei wrote:
> On Wed, Aug 17, 2022 at 10:34:38AM -0400, Chris Murphy wrote:
>>
>>
>> On Wed, Aug 17, 2022, at 8:06 AM, Ming Lei wrote:
>>
>> > blk-mq debugfs log is usually helpful for io stall issue, care to post
>> > the blk-mq debugfs log:
>> >
>> > (cd /sys/kernel/debug/block/$disk && find . -type f -exec grep -aH . {} \;)
>>
>> This is only sda
>> https://drive.google.com/file/d/1aAld-kXb3RUiv_ShAvD_AGAFDRS03Lr0/view?usp=sharing
>
> From the log, there isn't any in-flight IO request.
>
> So please confirm that it is collected after the IO stall is triggered.

Yes, iotop reports no reads or writes at the time of collection. IO pressure 99% for auditd, systemd-journald, rsyslogd, and postgresql, with increasing pressure from all the qemu processes.

Keep in mind this is a raid10, so maybe it's enough for just one block device IO to stall and the whole thing stops? That's why I included all block devices.

> If yes, the issue may not be related with BFQ, and should be related
> with blk-cgroup code.

Problem happens with cgroup.disable=io, does this setting affect blk-cgroup?

--
Chris Murphy

2022-08-17 15:37:53

by Ming Lei

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Wed, Aug 17, 2022 at 11:02:25AM -0400, Chris Murphy wrote:
>
>
> On Wed, Aug 17, 2022, at 10:53 AM, Ming Lei wrote:
> > On Wed, Aug 17, 2022 at 10:34:38AM -0400, Chris Murphy wrote:
> >>
> >>
> >> On Wed, Aug 17, 2022, at 8:06 AM, Ming Lei wrote:
> >>
> >> > blk-mq debugfs log is usually helpful for io stall issue, care to post
> >> > the blk-mq debugfs log:
> >> >
> >> > (cd /sys/kernel/debug/block/$disk && find . -type f -exec grep -aH . {} \;)
> >>
> >> This is only sda
> >> https://drive.google.com/file/d/1aAld-kXb3RUiv_ShAvD_AGAFDRS03Lr0/view?usp=sharing
> >
> > From the log, there isn't any in-flight IO request.
> >
> > So please confirm that it is collected after the IO stall is triggered.
>
> Yes, iotop reports no reads or writes at the time of collection. IO pressure 99% for auditd, systemd-journald, rsyslogd, and postgresql, with increasing pressure from all the qemu processes.
>
> Keep in mind this is a raid10, so maybe it's enough for just one block device IO to stall and the whole thing stops? That's why I included all block devices.
>

From the 2nd log of blockdebugfs-all.txt, still not see any in-flight IO on
request based block devices, but sda is _not_ included in this log, and
only sdi, sdg and sdf are collected, is that expected?

BTW, all request based block devices should be observed in blk-mq debugfs.

thanks,
Ming

2022-08-17 17:41:28

by Chris Murphy

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Wed, Aug 17, 2022, at 11:34 AM, Ming Lei wrote:

> From the 2nd log of blockdebugfs-all.txt, still not see any in-flight IO on
> request based block devices, but sda is _not_ included in this log, and
> only sdi, sdg and sdf are collected, is that expected?

While the problem was happening I did

cd /sys/kernel/debug/block
find . -type f -exec grep -aH . {} \;

The file has the nodes out of order, but I don't know enough about the interface to see if there are things that are missing, or what it means.

> BTW, all request based block devices should be observed in blk-mq debugfs.

/sys/kernel/debug/block contains

drwxr-xr-x. 2 root root 0 Aug 17 15:20 md0
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sda
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdb
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdc
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdd
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sde
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdf
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdg
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdh
drwxr-xr-x. 4 root root 0 Aug 17 15:20 sdi
drwxr-xr-x. 2 root root 0 Aug 17 15:20 zram0

--
Chris Murphy

2022-08-18 01:07:21

by Ming Lei

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Wed, Aug 17, 2022 at 12:34:42PM -0400, Chris Murphy wrote:
>
>
> On Wed, Aug 17, 2022, at 11:34 AM, Ming Lei wrote:
>
> > From the 2nd log of blockdebugfs-all.txt, still not see any in-flight IO on
> > request based block devices, but sda is _not_ included in this log, and
> > only sdi, sdg and sdf are collected, is that expected?
>
> While the problem was happening I did
>
> cd /sys/kernel/debug/block
> find . -type f -exec grep -aH . {} \;
>
> The file has the nodes out of order, but I don't know enough about the interface to see if there are things that are missing, or what it means.
>
>
> > BTW, all request based block devices should be observed in blk-mq debugfs.
>
> /sys/kernel/debug/block contains
>
> drwxr-xr-x. 2 root root 0 Aug 17 15:20 md0
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sda
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdb
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdc
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdd
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sde
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdf
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdg
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdh
> drwxr-xr-x. 4 root root 0 Aug 17 15:20 sdi
> drwxr-xr-x. 2 root root 0 Aug 17 15:20 zram0

OK, so lots of devices are missed in your log, and the following command
is supposed to work for collecting log from all block device's debugfs:

(cd /sys/kernel/debug/block/ && find . -type f -exec grep -aH . {} \;)

Thanks,
Ming

2022-08-18 03:04:01

by Chris Murphy

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Wed, Aug 17, 2022, at 9:03 PM, Ming Lei wrote:
> On Wed, Aug 17, 2022 at 12:34:42PM -0400, Chris Murphy wrote:
>>
>>
>> On Wed, Aug 17, 2022, at 11:34 AM, Ming Lei wrote:
>>
>> > From the 2nd log of blockdebugfs-all.txt, still not see any in-flight IO on
>> > request based block devices, but sda is _not_ included in this log, and
>> > only sdi, sdg and sdf are collected, is that expected?
>>
>> While the problem was happening I did
>>
>> cd /sys/kernel/debug/block
>> find . -type f -exec grep -aH . {} \;
>>
>> The file has the nodes out of order, but I don't know enough about the interface to see if there are things that are missing, or what it means.
>>
>>
>> > BTW, all request based block devices should be observed in blk-mq debugfs.
>>
>> /sys/kernel/debug/block contains
>>
>> drwxr-xr-x. 2 root root 0 Aug 17 15:20 md0
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sda
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdb
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdc
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdd
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sde
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdf
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdg
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdh
>> drwxr-xr-x. 4 root root 0 Aug 17 15:20 sdi
>> drwxr-xr-x. 2 root root 0 Aug 17 15:20 zram0
>
> OK, so lots of devices are missed in your log, and the following command
> is supposed to work for collecting log from all block device's debugfs:
>
> (cd /sys/kernel/debug/block/ && find . -type f -exec grep -aH . {} \;)

OK here it is:

https://drive.google.com/file/d/18nEOx2Ghsqx8uII6nzWpCFuYENHuQd-f/view?usp=sharing

--
Chris Murphy

2022-08-18 03:50:28

by Ming Lei

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Wed, Aug 17, 2022 at 10:30:39PM -0400, Chris Murphy wrote:
>
>
> On Wed, Aug 17, 2022, at 9:03 PM, Ming Lei wrote:
> > On Wed, Aug 17, 2022 at 12:34:42PM -0400, Chris Murphy wrote:
> >>
> >>
> >> On Wed, Aug 17, 2022, at 11:34 AM, Ming Lei wrote:
> >>
> >> > From the 2nd log of blockdebugfs-all.txt, still not see any in-flight IO on
> >> > request based block devices, but sda is _not_ included in this log, and
> >> > only sdi, sdg and sdf are collected, is that expected?
> >>
> >> While the problem was happening I did
> >>
> >> cd /sys/kernel/debug/block
> >> find . -type f -exec grep -aH . {} \;
> >>
> >> The file has the nodes out of order, but I don't know enough about the interface to see if there are things that are missing, or what it means.
> >>
> >>
> >> > BTW, all request based block devices should be observed in blk-mq debugfs.
> >>
> >> /sys/kernel/debug/block contains
> >>
> >> drwxr-xr-x. 2 root root 0 Aug 17 15:20 md0
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sda
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdb
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdc
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdd
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sde
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdf
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdg
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdh
> >> drwxr-xr-x. 4 root root 0 Aug 17 15:20 sdi
> >> drwxr-xr-x. 2 root root 0 Aug 17 15:20 zram0
> >
> > OK, so lots of devices are missed in your log, and the following command
> > is supposed to work for collecting log from all block device's debugfs:
> >
> > (cd /sys/kernel/debug/block/ && find . -type f -exec grep -aH . {} \;)
>
> OK here it is:
>
> https://drive.google.com/file/d/18nEOx2Ghsqx8uII6nzWpCFuYENHuQd-f/view?usp=sharing

The above log shows that the io stall happens on sdd, where:

1) 616 requests pending from scheduler queue

grep "busy=" blockdebugfs-all2.txt | grep sdd | grep sched | awk -F "=" '{s+=$2} END {print s}'
616

2) 11 requests pending from ./sdd/hctx2/dispatch for more than 300 seconds

Recently we seldom observe io hang from dispatch list, except for the
following two:

https://lore.kernel.org/linux-block/[email protected]/
https://lore.kernel.org/linux-block/[email protected]/

BTW, what is the output of the following log?

(cd /sys/block/sdd/device && find . -type f -exec grep -aH . {} \;)

Also the above log shows that host_tagset_enable support is still
crippled on v5.12, I guess the issue may not be triggered(or pretty hard)
after you update to d97e594c5166 ("blk-mq: Use request queue-wide tags for
tagset-wide sbitmap"), or v5.14.

thanks,
Ming

2022-08-18 04:19:30

by Chris Murphy

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:

> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?

https://drive.google.com/file/d/1n8f66pVLCwQTJ0PMd71EiUZoeTWQk3dB/view?usp=sharing

This time it happened pretty quickly. This log is soon after triple digit load and no IO, but not as fully developed as before. The system has become entirely unresponsive to new commands, so I have to issue sysrq+b - if I let it go too long even that won't work.

--
Chris Murphy

2022-08-18 04:27:58

by Chris Murphy

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>
>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>
> https://drive.google.com/file/d/1n8f66pVLCwQTJ0PMd71EiUZoeTWQk3dB/view?usp=sharing
>
> This time it happened pretty quickly. This log is soon after triple
> digit load and no IO, but not as fully developed as before. The system
> has become entirely unresponsive to new commands, so I have to issue
> sysrq+b - if I let it go too long even that won't work.

OK by the time I clicked send, the system had recovered. That also sometimes happens but then later IO stalls again and won't recover. So I haven't issued sysrq+b on this run yet. Here is a second blk-mq debugfs log...

https://drive.google.com/file/d/1irHcns0qe7e7DJaDfanX8vSiqE1Nj5xl/view?usp=sharing

--
Chris Murphy

2022-08-18 05:20:54

by Chris Murphy

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
> On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>>
>>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?

Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.

https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing

--
Chris Murphy

2022-08-18 05:23:56

by Ming Lei

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
>
>
> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
> > On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
> >> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
> >>
> >>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>
> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>
> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
>

Please test the following patch and see if it makes a difference:

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index a4f7c101b53b..8e8d77e79dd6 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -44,7 +44,10 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
*/
smp_mb();

- blk_mq_run_hw_queue(hctx, true);
+ if (blk_mq_is_shared_tags(hctx->flags))
+ blk_mq_run_hw_queues(hctx->queue, true);
+ else
+ blk_mq_run_hw_queue(hctx, true);
}

static int sched_rq_cmp(void *priv, const struct list_head *a,

Thanks,
Ming

2022-08-18 05:28:32

by Ming Lei

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
>
>
> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
> > On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
> >> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
> >>
> >>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>
> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>
> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
>

Also please test the following one too:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5ee62b95f3e5..d01c64be08e2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
if (!needs_restart ||
(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
blk_mq_run_hw_queue(hctx, true);
- else if (needs_restart && needs_resource)
+ else if (needs_restart && (needs_resource ||
+ blk_mq_is_shared_tags(hctx->flags)))
blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);

blk_mq_update_dispatch_busy(hctx, true);

Thanks,
Ming

2022-08-18 14:22:45

by Chris Murphy

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:

>
> Also please test the following one too:
>
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 5ee62b95f3e5..d01c64be08e2 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
> *hctx, struct list_head *list,
> if (!needs_restart ||
> (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> blk_mq_run_hw_queue(hctx, true);
> - else if (needs_restart && needs_resource)
> + else if (needs_restart && (needs_resource ||
> + blk_mq_is_shared_tags(hctx->flags)))
> blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
>
> blk_mq_update_dispatch_busy(hctx, true);
>

Should I test both patches at the same time, or separately? On top of v5.17 clean, or with b6e68ee82585 still reverted?

--
Chris Murphy

2022-08-18 15:23:31

by Ming Lei

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Thu, Aug 18, 2022 at 9:50 PM Chris Murphy <[email protected]> wrote:
>
>
>
> On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
>
> >
> > Also please test the following one too:
> >
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 5ee62b95f3e5..d01c64be08e2 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
> > *hctx, struct list_head *list,
> > if (!needs_restart ||
> > (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> > blk_mq_run_hw_queue(hctx, true);
> > - else if (needs_restart && needs_resource)
> > + else if (needs_restart && (needs_resource ||
> > + blk_mq_is_shared_tags(hctx->flags)))
> > blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> >
> > blk_mq_update_dispatch_busy(hctx, true);
> >
>
> Should I test both patches at the same time, or separately? On top of v5.17 clean, or with b6e68ee82585 still reverted?

Please test it separately against v5.17.

thanks,

2022-08-18 19:24:19

by Chris Murphy

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Thu, Aug 18, 2022, at 1:15 AM, Ming Lei wrote:
> On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
>>
>>
>> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
>> > On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>> >> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>> >>
>> >>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>>
>> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>>
>> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
>>
>
> Please test the following patch and see if it makes a difference:
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index a4f7c101b53b..8e8d77e79dd6 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -44,7 +44,10 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
> */
> smp_mb();
>
> - blk_mq_run_hw_queue(hctx, true);
> + if (blk_mq_is_shared_tags(hctx->flags))
> + blk_mq_run_hw_queues(hctx->queue, true);
> + else
> + blk_mq_run_hw_queue(hctx, true);
> }
>
> static int sched_rq_cmp(void *priv, const struct list_head *a,

I still get a stall. By the time I noticed it, I can't run any new commands (they just hang) so I had to sysrq+b. Let me know if I should rerun the test in order to capture block debug log.

--
Chris Murphy

2022-08-19 19:38:33

by Chris Murphy

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
> On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
>>
>>
>> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
>> > On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>> >> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>> >>
>> >>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>>
>> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>>
>> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
>>
>
> Also please test the following one too:
>
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 5ee62b95f3e5..d01c64be08e2 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
> *hctx, struct list_head *list,
> if (!needs_restart ||
> (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> blk_mq_run_hw_queue(hctx, true);
> - else if (needs_restart && needs_resource)
> + else if (needs_restart && (needs_resource ||
> + blk_mq_is_shared_tags(hctx->flags)))
> blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
>
> blk_mq_update_dispatch_busy(hctx, true);
>

With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing

--
Chris Murphy

2022-08-20 07:17:20

by Ming Lei

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Fri, Aug 19, 2022 at 03:20:25PM -0400, Chris Murphy wrote:
>
>
> On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
> > On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
> >>
> >>
> >> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
> >> > On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
> >> >> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
> >> >>
> >> >>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
> >>
> >> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
> >>
> >> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
> >>
> >
> > Also please test the following one too:
> >
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 5ee62b95f3e5..d01c64be08e2 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
> > *hctx, struct list_head *list,
> > if (!needs_restart ||
> > (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> > blk_mq_run_hw_queue(hctx, true);
> > - else if (needs_restart && needs_resource)
> > + else if (needs_restart && (needs_resource ||
> > + blk_mq_is_shared_tags(hctx->flags)))
> > blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> >
> > blk_mq_update_dispatch_busy(hctx, true);
> >
>
>
> With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
> https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing

The log is similar with before, and the only difference is RESTART not
set.

Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:

8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues

Thanks,
Ming

2022-09-01 07:40:19

by Yu Kuai

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

Hi, Chris

?? 2022/08/20 15:00, Ming Lei д??:
> On Fri, Aug 19, 2022 at 03:20:25PM -0400, Chris Murphy wrote:
>>
>>
>> On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
>>> On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
>>>>
>>>>
>>>> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
>>>>> On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>>>>>> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>>>>>>
>>>>>>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>>>>
>>>> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>>>>
>>>> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
>>>>
>>>
>>> Also please test the following one too:
>>>
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 5ee62b95f3e5..d01c64be08e2 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
>>> *hctx, struct list_head *list,
>>> if (!needs_restart ||
>>> (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>>> blk_mq_run_hw_queue(hctx, true);
>>> - else if (needs_restart && needs_resource)
>>> + else if (needs_restart && (needs_resource ||
>>> + blk_mq_is_shared_tags(hctx->flags)))
>>> blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
>>>
>>> blk_mq_update_dispatch_busy(hctx, true);
>>>
>>
>>
>> With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
>> https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing
>
> The log is similar with before, and the only difference is RESTART not
> set.
>
> Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:
>
> 8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues

Have you tried this patch?

We meet a similar problem in our test, and I'm pretty sure about the
situation at the scene,

Our test environment??nvme with bfq ioscheduler,

How io is stalled:

1. hctx1 dispatch rq from bfq in service queue, bfqq becomes empty,
dispatch somehow fails and rq is inserted to hctx1->dispatch, new run
work is queued.

2. other hctx tries to dispatch rq, however, in service bfqq is
empty, bfq_dispatch_request return NULL, thus
blk_mq_delay_run_hw_queues is called.

3. for the problem described in above patch??run work from "hctx1"
can be stalled.

Above patch should fix this io stall, however, it seems to me bfq do
have some problems that in service bfqq doesn't expire under following
situation:

1. dispatched rqs don't complete
2. no new rq is issued to bfq

Thanks,
Kuai
>
>
>
> Thanks,
> Ming
>
> .
>

2022-09-01 09:18:52

by Jan Kara

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Thu 01-09-22 15:02:03, Yu Kuai wrote:
> Hi, Chris
>
> 在 2022/08/20 15:00, Ming Lei 写道:
> > On Fri, Aug 19, 2022 at 03:20:25PM -0400, Chris Murphy wrote:
> > >
> > >
> > > On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
> > > > On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
> > > > >
> > > > >
> > > > > On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
> > > > > > On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
> > > > > > > On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
> > > > > > >
> > > > > > > > OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
> > > > >
> > > > > Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
> > > > >
> > > > > https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
> > > > >
> > > >
> > > > Also please test the following one too:
> > > >
> > > >
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index 5ee62b95f3e5..d01c64be08e2 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
> > > > *hctx, struct list_head *list,
> > > > if (!needs_restart ||
> > > > (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> > > > blk_mq_run_hw_queue(hctx, true);
> > > > - else if (needs_restart && needs_resource)
> > > > + else if (needs_restart && (needs_resource ||
> > > > + blk_mq_is_shared_tags(hctx->flags)))
> > > > blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> > > >
> > > > blk_mq_update_dispatch_busy(hctx, true);
> > > >
> > >
> > >
> > > With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
> > > https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing
> >
> > The log is similar with before, and the only difference is RESTART not
> > set.
> >
> > Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:
> >
> > 8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues
>
> Have you tried this patch?
>
> We meet a similar problem in our test, and I'm pretty sure about the
> situation at the scene,
>
> Our test environment：nvme with bfq ioscheduler,
>
> How io is stalled:
>
> 1. hctx1 dispatch rq from bfq in service queue, bfqq becomes empty,
> dispatch somehow fails and rq is inserted to hctx1->dispatch, new run
> work is queued.
>
> 2. other hctx tries to dispatch rq, however, in service bfqq is
> empty, bfq_dispatch_request return NULL, thus
> blk_mq_delay_run_hw_queues is called.
>
> 3. for the problem described in above patch，run work from "hctx1"
> can be stalled.
>
> Above patch should fix this io stall, however, it seems to me bfq do
> have some problems that in service bfqq doesn't expire under following
> situation:
>
> 1. dispatched rqs don't complete
> 2. no new rq is issued to bfq

And I guess:
3. there are requests queued in other bfqqs
?

Otherwise I don't see a point in expiring current bfqq because there's
nothing bfq could do anyway. But under normal circumstances the request
completion should not take so long so I don't think it would be really
worth it to implement some special mechanism for this in bfq.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-09-02 17:03:02

by Chris Murphy

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Thu, Sep 1, 2022, at 3:02 AM, Yu Kuai wrote:
> Hi, Chris

>> Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:
>>
>> 8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues
>
> Have you tried this patch?

The problem happens on 5.18 series kernels. But takes longer. Once I regain access to this setup, I can try to reproduce on 5.18 and 5.19, and provide block debugfs logs.

--
Chris Murphy

2022-09-06 10:02:33

by Paolo Valente

[permalink] [raw]

Subject: Re: stalling IO regression since linux 5.12, through 5.18

> Il giorno 1 set 2022, alle ore 09:02, Yu Kuai <[email protected]> ha scritto:
>
> Hi, Chris
>
> 在 2022/08/20 15:00, Ming Lei 写道:
>> On Fri, Aug 19, 2022 at 03:20:25PM -0400, Chris Murphy wrote:
>>>
>>>
>>> On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
>>>> On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
>>>>>
>>>>>
>>>>> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
>>>>>> On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>>>>>>> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>>>>>>>
>>>>>>>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>>>>>
>>>>> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>>>>>
>>>>> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
>>>>>
>>>>
>>>> Also please test the following one too:
>>>>
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 5ee62b95f3e5..d01c64be08e2 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
>>>> *hctx, struct list_head *list,
>>>> if (!needs_restart ||
>>>> (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>>>> blk_mq_run_hw_queue(hctx, true);
>>>> - else if (needs_restart && needs_resource)
>>>> + else if (needs_restart && (needs_resource ||
>>>> + blk_mq_is_shared_tags(hctx->flags)))
>>>> blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
>>>>
>>>> blk_mq_update_dispatch_busy(hctx, true);
>>>>
>>>
>>>
>>> With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
>>> https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing
>> The log is similar with before, and the only difference is RESTART not
>> set.
>> Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:
>> 8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues
>
> Have you tried this patch?
>
> We meet a similar problem in our test, and I'm pretty sure about the
> situation at the scene,
>
> Our test environment：nvme with bfq ioscheduler,
>
> How io is stalled:
>
> 1. hctx1 dispatch rq from bfq in service queue, bfqq becomes empty,
> dispatch somehow fails and rq is inserted to hctx1->dispatch, new run
> work is queued.
>
> 2. other hctx tries to dispatch rq, however, in service bfqq is
> empty, bfq_dispatch_request return NULL, thus
> blk_mq_delay_run_hw_queues is called.
>
> 3. for the problem described in above patch，run work from "hctx1"
> can be stalled.
>
> Above patch should fix this io stall, however, it seems to me bfq do
> have some problems that in service bfqq doesn't expire under following
> situation:
>
> 1. dispatched rqs don't complete
> 2. no new rq is issued to bfq
>

There may be one more important problem: is bfq_finish_requeue_request
eventually invoked for the failed rq? If it is not, then a memory
leak follows, because recounting gets unavoidably unbalanced.

In contrast, if bfq_finish_requeue_request is correctly invoked, then
no stall should occur.

Thanks,
Paolo

> Thanks,
> Kuai
>> Thanks,
>> Ming
>> .