2022-08-10 17:38:10

by Chris Murphy

[permalink] [raw]
Subject: stalling IO regression in linux 5.12

CPU: Intel E5-2680 v3
RAM: 128 G
02:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] [1000:005d] (rev 02), using megaraid_sas driver
8 Disks: TOSHIBA AL13SEB600


The problem exhibits as increasing load, increasing IO pressure (PSI), and actual IO goes to zero. It never happens on kernel 5.11 series, and always happens after 5.12-rc1 and persists through 5.18.0. There's a new mix of behaviors with 5.19, I suspect the mm improvements in this series might be masking the problem.

The workload involves openqa, which spins up 30 qemu-kvm instances, and does a bunch of tests, generating quite a lot of writes: qcow2 files, and video in the form of many screenshots, and various log files, for each VM. These VMs are each in their own cgroup. As the problem begins, I see increasing IO pressure, and decreasing IO, for each qemu instance's cgroup, and the cgroups for httpd, journald, auditd, and postgresql. IO pressure goes to nearly ~99% and IO is literally 0.

The problem left unattended to progress will eventually result in a completely unresponsive system, with no kernel messages. It reproduces in the following configurations, the first two I provide links to full dmesg with sysrq+w:

btrfs raid10 (native) on plain partitions [1]
btrfs single/dup on dmcrypt on mdadm raid 10 and parity raid [2]
XFS on dmcrypt on mdadm raid10 or parity raid

I've started a bisect, but for some reason I haven't figured out I've started getting compiled kernels that don't boot the hardware. The failure is very early on such that the UUID for the root file system isn't found, but not much to go on as to why.[3] I have tested the first and last skipped commits in the bisect log below, they successfully boot a VM but not the hardware.

Anyway, I'm kinda stuck at this point trying to narrow it down further. Any suggestions? Thanks.

[1] btrfs raid10, plain partitions
https://drive.google.com/file/d/1-oT3MX-hHYtQqI0F3SpgPjCIDXXTysLU/view?usp=sharing

[2] btrfs single/dup, dmcrypt, mdadm raid10
https://drive.google.com/file/d/1m_T3YYaEjBKUROz6dHt5_h92ZVRji9FM/view?usp=sharing

[3]
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# bad: [c03c21ba6f4e95e406a1a7b4c34ef334b977c194] Merge tag 'keys-misc-20210126' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
git bisect bad c03c21ba6f4e95e406a1a7b4c34ef334b977c194
# status: waiting for good commit(s), bad commit known
# good: [f40ddce88593482919761f74910f42f4b84c004b] Linux 5.11
git bisect good f40ddce88593482919761f74910f42f4b84c004b
# bad: [df24212a493afda0d4de42176bea10d45825e9a0] Merge tag 's390-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect bad df24212a493afda0d4de42176bea10d45825e9a0
# good: [82851fce6107d5a3e66d95aee2ae68860a732703] Merge tag 'arm-dt-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good 82851fce6107d5a3e66d95aee2ae68860a732703
# good: [99f1a5872b706094ece117368170a92c66b2e242] Merge tag 'nfsd-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
git bisect good 99f1a5872b706094ece117368170a92c66b2e242
# bad: [9eef02334505411667a7b51a8f349f8c6c4f3b66] Merge tag 'locking-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 9eef02334505411667a7b51a8f349f8c6c4f3b66
# bad: [9820b4dca0f9c6b7ab8b4307286cdace171b724d] Merge tag 'for-5.12/drivers-2021-02-17' of git://git.kernel.dk/linux-block
git bisect bad 9820b4dca0f9c6b7ab8b4307286cdace171b724d
# good: [bd018bbaa58640da786d4289563e71c5ef3938c7] Merge tag 'for-5.12/libata-2021-02-17' of git://git.kernel.dk/linux-block
git bisect good bd018bbaa58640da786d4289563e71c5ef3938c7
# skip: [203c018079e13510f913fd0fd426370f4de0fd05] Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.12/drivers
git bisect skip 203c018079e13510f913fd0fd426370f4de0fd05
# skip: [49d1ec8573f74ff1e23df1d5092211de46baa236] block: manage bio slab cache by xarray
git bisect skip 49d1ec8573f74ff1e23df1d5092211de46baa236
# bad: [73d90386b559d6f4c3c5db5e6bb1b68aae8fd3e7] nvme: cleanup zone information initialization
git bisect bad 73d90386b559d6f4c3c5db5e6bb1b68aae8fd3e7
# skip: [71217df39dc67a0aeed83352b0d712b7892036a2] block, bfq: make waker-queue detection more robust
git bisect skip 71217df39dc67a0aeed83352b0d712b7892036a2
# bad: [8358c28a5d44bf0223a55a2334086c3707bb4185] block: fix memory leak of bvec
git bisect bad 8358c28a5d44bf0223a55a2334086c3707bb4185
# skip: [3a905c37c3510ea6d7cfcdfd0f272ba731286560] block: skip bio_check_eod for partition-remapped bios
git bisect skip 3a905c37c3510ea6d7cfcdfd0f272ba731286560
# skip: [3c337690d2ebb7a01fa13bfa59ce4911f358df42] block, bfq: avoid spurious switches to soft_rt of interactive queues
git bisect skip 3c337690d2ebb7a01fa13bfa59ce4911f358df42
# skip: [3e1a88ec96259282b9a8b45c3f1fda7a3ff4f6ea] bio: add a helper calculating nr segments to alloc
git bisect skip 3e1a88ec96259282b9a8b45c3f1fda7a3ff4f6ea
# skip: [4eb1d689045552eb966ebf25efbc3ce648797d96] blk-crypto: use bio_kmalloc in blk_crypto_clone_bio
git bisect skip 4eb1d689045552eb966ebf25efbc3ce648797d96


--
Chris Murphy


2022-08-10 17:52:21

by Josef Bacik

[permalink] [raw]
Subject: Re: stalling IO regression in linux 5.12

On Wed, Aug 10, 2022 at 12:35:34PM -0400, Chris Murphy wrote:
> CPU: Intel E5-2680 v3
> RAM: 128 G
> 02:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] [1000:005d] (rev 02), using megaraid_sas driver
> 8 Disks: TOSHIBA AL13SEB600
>
>
> The problem exhibits as increasing load, increasing IO pressure (PSI), and actual IO goes to zero. It never happens on kernel 5.11 series, and always happens after 5.12-rc1 and persists through 5.18.0. There's a new mix of behaviors with 5.19, I suspect the mm improvements in this series might be masking the problem.
>
> The workload involves openqa, which spins up 30 qemu-kvm instances, and does a bunch of tests, generating quite a lot of writes: qcow2 files, and video in the form of many screenshots, and various log files, for each VM. These VMs are each in their own cgroup. As the problem begins, I see increasing IO pressure, and decreasing IO, for each qemu instance's cgroup, and the cgroups for httpd, journald, auditd, and postgresql. IO pressure goes to nearly ~99% and IO is literally 0.
>
> The problem left unattended to progress will eventually result in a completely unresponsive system, with no kernel messages. It reproduces in the following configurations, the first two I provide links to full dmesg with sysrq+w:
>
> btrfs raid10 (native) on plain partitions [1]
> btrfs single/dup on dmcrypt on mdadm raid 10 and parity raid [2]
> XFS on dmcrypt on mdadm raid10 or parity raid
>
> I've started a bisect, but for some reason I haven't figured out I've started getting compiled kernels that don't boot the hardware. The failure is very early on such that the UUID for the root file system isn't found, but not much to go on as to why.[3] I have tested the first and last skipped commits in the bisect log below, they successfully boot a VM but not the hardware.
>
> Anyway, I'm kinda stuck at this point trying to narrow it down further. Any suggestions? Thanks.
>

I looked at the traces, btrfs is stuck waiting on IO and blk tags, which means
we've got a lot of outstanding requests and are waiting for them to finish so we
can allocate more requests.

Additionally I'm seeing a bunch of the blkg async submit things, which are used
when we have the block cgroup stuff turned on and compression enabled, so we
punt any compressed bios to a per-cgroup async thread to submit the IO's in the
appropriate block cgroup context.

This could mean we're just being overly mean and generating too many IO's, but
since the IO goes to 0 I'm more inclined to believe there's a screw up in
whatever IO cgroup controller you're using.

To help narrow this down can you disable any IO controller you've got enabled
and see if you can reproduce? If you can sysrq+w is super helpful as it'll
point us in the next direction to look. Thanks,

Josef

2022-08-10 18:48:08

by Chris Murphy

[permalink] [raw]
Subject: Re: stalling IO regression in linux 5.12



On Wed, Aug 10, 2022, at 1:48 PM, Josef Bacik wrote:

> To help narrow this down can you disable any IO controller you've got enabled
> and see if you can reproduce? If you can sysrq+w is super helpful as it'll
> point us in the next direction to look. Thanks,

I'm not following, sorry. I can boot with systemd.unified_cgroup_hierarchy=0 to make sure it's all off, but we're not using an IO cgroup controllers specifically as far as I'm aware.

--
Chris Murphy

2022-08-10 19:38:54

by Josef Bacik

[permalink] [raw]
Subject: Re: stalling IO regression in linux 5.12

On Wed, Aug 10, 2022 at 02:42:40PM -0400, Chris Murphy wrote:
>
>
> On Wed, Aug 10, 2022, at 2:33 PM, Chris Murphy wrote:
> > On Wed, Aug 10, 2022, at 1:48 PM, Josef Bacik wrote:
> >
> >> To help narrow this down can you disable any IO controller you've got enabled
> >> and see if you can reproduce? If you can sysrq+w is super helpful as it'll
> >> point us in the next direction to look. Thanks,
> >
> > I'm not following, sorry. I can boot with
> > systemd.unified_cgroup_hierarchy=0 to make sure it's all off, but we're
> > not using an IO cgroup controllers specifically as far as I'm aware.
>
> OK yeah that won't work because the workload requires cgroup2 or it won't run.
>

Oh no I don't want cgroups completley off, just disable the io controller, so
figure out which cgroup your thing is being run in, and then

echo "-io" > <parent dir>/cgroup.subtree_control

If you cat /sys/fs/cgroup/whatever/cgroup/cgroup.controllers and you see "io" in
there keep doing the above in the next highest parent directory until io is no
longer in there. Thanks,

Josef

2022-08-10 20:04:54

by Chris Murphy

[permalink] [raw]
Subject: Re: stalling IO regression in linux 5.12



On Wed, Aug 10, 2022, at 2:33 PM, Chris Murphy wrote:
> On Wed, Aug 10, 2022, at 1:48 PM, Josef Bacik wrote:
>
>> To help narrow this down can you disable any IO controller you've got enabled
>> and see if you can reproduce? If you can sysrq+w is super helpful as it'll
>> point us in the next direction to look. Thanks,
>
> I'm not following, sorry. I can boot with
> systemd.unified_cgroup_hierarchy=0 to make sure it's all off, but we're
> not using an IO cgroup controllers specifically as far as I'm aware.

OK yeah that won't work because the workload requires cgroup2 or it won't run.


--
Chris Murphy

2022-08-10 20:12:33

by Chris Murphy

[permalink] [raw]
Subject: Re: stalling IO regression in linux 5.12



On Wed, Aug 10, 2022, at 2:42 PM, Chris Murphy wrote:
> On Wed, Aug 10, 2022, at 2:33 PM, Chris Murphy wrote:
>> On Wed, Aug 10, 2022, at 1:48 PM, Josef Bacik wrote:
>>
>>> To help narrow this down can you disable any IO controller you've got enabled
>>> and see if you can reproduce? If you can sysrq+w is super helpful as it'll
>>> point us in the next direction to look. Thanks,
>>
>> I'm not following, sorry. I can boot with
>> systemd.unified_cgroup_hierarchy=0 to make sure it's all off, but we're
>> not using an IO cgroup controllers specifically as far as I'm aware.
>
> OK yeah that won't work because the workload requires cgroup2 or it won't run.


Booted with cgroup_disable=io, and confirmed cat /sys/fs/cgroup/cgroup.controllers does not list io.

I'll rerun the workload now. Sometimes reproduces fast, other times a couple hours.



--
Chris Murphy

2022-08-12 16:09:22

by Chris Murphy

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18



On Wed, Aug 10, 2022, at 3:34 PM, Chris Murphy wrote:
> Booted with cgroup_disable=io, and confirmed cat
> /sys/fs/cgroup/cgroup.controllers does not list io.

The problem still reproduces with the cgroup IO controller disabled.

On a whim, I decided to switch the IO scheduler from Fedora's default bfq for rotating drives to mq-deadline. The problem does not reproduce for 15+ hours, which is not 100% conclusive but probably 99% conclusive. I then switched live while running the workload to bfq on all eight drives, and within 10 minutes the system cratered, all new commands just hang. Load average goes to triple digits, i/o wait increasing, i/o pressure for the workload tasks to 100%, and IO completely stalls to zero. I was able to switch only two of the drive queues back to mq-deadline and then lost responsivness in that shell and had to issue sysrq+b...

Before that I was able to extra sysrq+w and sysrq+t.
https://drive.google.com/file/d/16hdQjyBnuzzQIhiQT6fQdE0nkRQJj7EI/view?usp=sharing

I can't tell if this is a bfq bug, or if there's some negative interaction between bfq and scsi or megaraid_sas. Obviously it's rare because otherwise people would have been falling over this much sooner. But at this point there's strong correlation that it's bfq related and is a kernel regression that's been around since 5.12.0 through 5.18.0, and I suspect also 5.19.0 but it's being partly masked by other improvements.



--
Chris Murphy

2022-08-12 18:01:21

by Josef Bacik

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Fri, Aug 12, 2022 at 12:05 PM Chris Murphy <[email protected]> wrote:
>
>
>
> On Wed, Aug 10, 2022, at 3:34 PM, Chris Murphy wrote:
> > Booted with cgroup_disable=io, and confirmed cat
> > /sys/fs/cgroup/cgroup.controllers does not list io.
>
> The problem still reproduces with the cgroup IO controller disabled.
>
> On a whim, I decided to switch the IO scheduler from Fedora's default bfq for rotating drives to mq-deadline. The problem does not reproduce for 15+ hours, which is not 100% conclusive but probably 99% conclusive. I then switched live while running the workload to bfq on all eight drives, and within 10 minutes the system cratered, all new commands just hang. Load average goes to triple digits, i/o wait increasing, i/o pressure for the workload tasks to 100%, and IO completely stalls to zero. I was able to switch only two of the drive queues back to mq-deadline and then lost responsivness in that shell and had to issue sysrq+b...
>
> Before that I was able to extra sysrq+w and sysrq+t.
> https://drive.google.com/file/d/16hdQjyBnuzzQIhiQT6fQdE0nkRQJj7EI/view?usp=sharing
>
> I can't tell if this is a bfq bug, or if there's some negative interaction between bfq and scsi or megaraid_sas. Obviously it's rare because otherwise people would have been falling over this much sooner. But at this point there's strong correlation that it's bfq related and is a kernel regression that's been around since 5.12.0 through 5.18.0, and I suspect also 5.19.0 but it's being partly masked by other improvements.

This matches observations we've had internally (inside Facebook) as
well as my continual integration performance testing. It should
probably be looked into by the BFQ guys as it was working previously.
Thanks,

Josef

2022-08-12 18:24:34

by Jens Axboe

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18

On 8/12/22 11:59 AM, Josef Bacik wrote:
> On Fri, Aug 12, 2022 at 12:05 PM Chris Murphy <[email protected]> wrote:
>>
>>
>>
>> On Wed, Aug 10, 2022, at 3:34 PM, Chris Murphy wrote:
>>> Booted with cgroup_disable=io, and confirmed cat
>>> /sys/fs/cgroup/cgroup.controllers does not list io.
>>
>> The problem still reproduces with the cgroup IO controller disabled.
>>
>> On a whim, I decided to switch the IO scheduler from Fedora's default bfq for rotating drives to mq-deadline. The problem does not reproduce for 15+ hours, which is not 100% conclusive but probably 99% conclusive. I then switched live while running the workload to bfq on all eight drives, and within 10 minutes the system cratered, all new commands just hang. Load average goes to triple digits, i/o wait increasing, i/o pressure for the workload tasks to 100%, and IO completely stalls to zero. I was able to switch only two of the drive queues back to mq-deadline and then lost responsivness in that shell and had to issue sysrq+b...
>>
>> Before that I was able to extra sysrq+w and sysrq+t.
>> https://drive.google.com/file/d/16hdQjyBnuzzQIhiQT6fQdE0nkRQJj7EI/view?usp=sharing
>>
>> I can't tell if this is a bfq bug, or if there's some negative interaction between bfq and scsi or megaraid_sas. Obviously it's rare because otherwise people would have been falling over this much sooner. But at this point there's strong correlation that it's bfq related and is a kernel regression that's been around since 5.12.0 through 5.18.0, and I suspect also 5.19.0 but it's being partly masked by other improvements.
>
> This matches observations we've had internally (inside Facebook) as
> well as my continual integration performance testing. It should
> probably be looked into by the BFQ guys as it was working previously.
> Thanks,

5.12 has a few BFQ changes:

Jan Kara:
bfq: Avoid false bfq queue merging
bfq: Use 'ttime' local variable
bfq: Use only idle IO periods for think time calculations

Jia Cheng Hu
block, bfq: set next_rq to waker_bfqq->next_rq in waker injection

Paolo Valente
block, bfq: use half slice_idle as a threshold to check short ttime
block, bfq: increase time window for waker detection
block, bfq: do not raise non-default weights
block, bfq: avoid spurious switches to soft_rt of interactive queues
block, bfq: do not expire a queue when it is the only busy one
block, bfq: replace mechanism for evaluating I/O intensity
block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
block, bfq: fix switch back from soft-rt weitgh-raising
block, bfq: save also weight-raised service on queue merging
block, bfq: save also injection state on queue merging
block, bfq: make waker-queue detection more robust

Might be worth trying to revert those from 5.12 to see if they are
causing the issue? Jan, Paolo - does this ring any bells?

--
Jens Axboe

2022-08-14 20:49:15

by Chris Murphy

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18



On Fri, Aug 12, 2022, at 2:02 PM, Jens Axboe wrote:
> On 8/12/22 11:59 AM, Josef Bacik wrote:
>> On Fri, Aug 12, 2022 at 12:05 PM Chris Murphy <[email protected]> wrote:
>>>
>>>
>>>
>>> On Wed, Aug 10, 2022, at 3:34 PM, Chris Murphy wrote:
>>>> Booted with cgroup_disable=io, and confirmed cat
>>>> /sys/fs/cgroup/cgroup.controllers does not list io.
>>>
>>> The problem still reproduces with the cgroup IO controller disabled.
>>>
>>> On a whim, I decided to switch the IO scheduler from Fedora's default bfq for rotating drives to mq-deadline. The problem does not reproduce for 15+ hours, which is not 100% conclusive but probably 99% conclusive. I then switched live while running the workload to bfq on all eight drives, and within 10 minutes the system cratered, all new commands just hang. Load average goes to triple digits, i/o wait increasing, i/o pressure for the workload tasks to 100%, and IO completely stalls to zero. I was able to switch only two of the drive queues back to mq-deadline and then lost responsivness in that shell and had to issue sysrq+b...
>>>
>>> Before that I was able to extra sysrq+w and sysrq+t.
>>> https://drive.google.com/file/d/16hdQjyBnuzzQIhiQT6fQdE0nkRQJj7EI/view?usp=sharing
>>>
>>> I can't tell if this is a bfq bug, or if there's some negative interaction between bfq and scsi or megaraid_sas. Obviously it's rare because otherwise people would have been falling over this much sooner. But at this point there's strong correlation that it's bfq related and is a kernel regression that's been around since 5.12.0 through 5.18.0, and I suspect also 5.19.0 but it's being partly masked by other improvements.
>>
>> This matches observations we've had internally (inside Facebook) as
>> well as my continual integration performance testing. It should
>> probably be looked into by the BFQ guys as it was working previously.
>> Thanks,
>
> 5.12 has a few BFQ changes:
>
> Jan Kara:
> bfq: Avoid false bfq queue merging
> bfq: Use 'ttime' local variable
> bfq: Use only idle IO periods for think time calculations
>
> Jia Cheng Hu
> block, bfq: set next_rq to waker_bfqq->next_rq in waker injection
>
> Paolo Valente
> block, bfq: use half slice_idle as a threshold to check short ttime
> block, bfq: increase time window for waker detection
> block, bfq: do not raise non-default weights
> block, bfq: avoid spurious switches to soft_rt of interactive queues
> block, bfq: do not expire a queue when it is the only busy one
> block, bfq: replace mechanism for evaluating I/O intensity
> block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
> block, bfq: fix switch back from soft-rt weitgh-raising
> block, bfq: save also weight-raised service on queue merging
> block, bfq: save also injection state on queue merging
> block, bfq: make waker-queue detection more robust
>
> Might be worth trying to revert those from 5.12 to see if they are
> causing the issue? Jan, Paolo - does this ring any bells?

git log --oneline --no-merges v5.11..c03c21ba6f4e > bisect.txt

I tried checking out a33df75c6328, which is right before the first bfq commit, but that kernel won't boot the hardware.

Next I checked out v5.12, then reverted these commits in order (that they were found in the bisect.txt file):

7684fbde4516 bfq: Use only idle IO periods for think time calculations
28c6def00919 bfq: Use 'ttime' local variable
41e76c85660c bfq: Avoid false bfq queue merging
>>>a5bf0a92e1b8 bfq: bfq_check_waker() should be static
71217df39dc6 block, bfq: make waker-queue detection more robust
5a5436b98d5c block, bfq: save also injection state on queue merging
e673914d52f9 block, bfq: save also weight-raised service on queue merging
d1f600fa4732 block, bfq: fix switch back from soft-rt weitgh-raising
7f1995c27b19 block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
eb2fd80f9d2c block, bfq: replace mechanism for evaluating I/O intensity
>>>1a23e06cdab2 bfq: don't duplicate code for different paths
2391d13ed484 block, bfq: do not expire a queue when it is the only busy one
3c337690d2eb block, bfq: avoid spurious switches to soft_rt of interactive queues
91b896f65d32 block, bfq: do not raise non-default weights
ab1fb47e33dc block, bfq: increase time window for waker detection
d4fc3640ff36 block, bfq: set next_rq to waker_bfqq->next_rq in waker injection
b5f74ecacc31 block, bfq: use half slice_idle as a threshold to check short ttime

The two commits prefixed by >>> above were not previously mentioned by Jens, but I reverted them anyway because they showed up in the git log command.

OK so, within 10 minutes the problem does happen still. This is block/bfq-iosched.c resulting from the above reverts, in case anyone wants to double check what I did:
https://drive.google.com/file/d/1ykU7MpmylJuXVobODWiiaLJk-XOiAjSt/view?usp=sharing



--
Chris Murphy

2022-08-15 11:31:16

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: stalling IO regression in linux 5.12

[TLDR: I'm adding this regression report to the list of tracked
regressions; all text from me you find below is based on a few templates
paragraphs you might have encountered already already in similar form.]

Hi, this is your Linux kernel regression tracker.

On 10.08.22 18:35, Chris Murphy wrote:
> CPU: Intel E5-2680 v3
> RAM: 128 G
> 02:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] [1000:005d] (rev 02), using megaraid_sas driver
> 8 Disks: TOSHIBA AL13SEB600
>
>
> The problem exhibits as increasing load, increasing IO pressure (PSI), and actual IO goes to zero. It never happens on kernel 5.11 series, and always happens after 5.12-rc1 and persists through 5.18.0. There's a new mix of behaviors with 5.19, I suspect the mm improvements in this series might be masking the problem.
>
> The workload involves openqa, which spins up 30 qemu-kvm instances, and does a bunch of tests, generating quite a lot of writes: qcow2 files, and video in the form of many screenshots, and various log files, for each VM. These VMs are each in their own cgroup. As the problem begins, I see increasing IO pressure, and decreasing IO, for each qemu instance's cgroup, and the cgroups for httpd, journald, auditd, and postgresql. IO pressure goes to nearly ~99% and IO is literally 0.
>
> The problem left unattended to progress will eventually result in a completely unresponsive system, with no kernel messages. It reproduces in the following configurations, the first two I provide links to full dmesg with sysrq+w:
>
> btrfs raid10 (native) on plain partitions [1]
> btrfs single/dup on dmcrypt on mdadm raid 10 and parity raid [2]
> XFS on dmcrypt on mdadm raid10 or parity raid
>
> I've started a bisect, but for some reason I haven't figured out I've started getting compiled kernels that don't boot the hardware. The failure is very early on such that the UUID for the root file system isn't found, but not much to go on as to why.[3] I have tested the first and last skipped commits in the bisect log below, they successfully boot a VM but not the hardware.
>
> Anyway, I'm kinda stuck at this point trying to narrow it down further. Any suggestions? Thanks.
>
> [1] btrfs raid10, plain partitions
> https://drive.google.com/file/d/1-oT3MX-hHYtQqI0F3SpgPjCIDXXTysLU/view?usp=sharing
>
> [2] btrfs single/dup, dmcrypt, mdadm raid10
> https://drive.google.com/file/d/1m_T3YYaEjBKUROz6dHt5_h92ZVRji9FM/view?usp=sharing
>
> [3]
> $ git bisect log
> git bisect start
> # status: waiting for both good and bad commits
> # bad: [c03c21ba6f4e95e406a1a7b4c34ef334b977c194] Merge tag 'keys-misc-20210126' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
> git bisect bad c03c21ba6f4e95e406a1a7b4c34ef334b977c194
> # status: waiting for good commit(s), bad commit known
> # good: [f40ddce88593482919761f74910f42f4b84c004b] Linux 5.11
> git bisect good f40ddce88593482919761f74910f42f4b84c004b
> # bad: [df24212a493afda0d4de42176bea10d45825e9a0] Merge tag 's390-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
> git bisect bad df24212a493afda0d4de42176bea10d45825e9a0
> # good: [82851fce6107d5a3e66d95aee2ae68860a732703] Merge tag 'arm-dt-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
> git bisect good 82851fce6107d5a3e66d95aee2ae68860a732703
> # good: [99f1a5872b706094ece117368170a92c66b2e242] Merge tag 'nfsd-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
> git bisect good 99f1a5872b706094ece117368170a92c66b2e242
> # bad: [9eef02334505411667a7b51a8f349f8c6c4f3b66] Merge tag 'locking-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect bad 9eef02334505411667a7b51a8f349f8c6c4f3b66
> # bad: [9820b4dca0f9c6b7ab8b4307286cdace171b724d] Merge tag 'for-5.12/drivers-2021-02-17' of git://git.kernel.dk/linux-block
> git bisect bad 9820b4dca0f9c6b7ab8b4307286cdace171b724d
> # good: [bd018bbaa58640da786d4289563e71c5ef3938c7] Merge tag 'for-5.12/libata-2021-02-17' of git://git.kernel.dk/linux-block
> git bisect good bd018bbaa58640da786d4289563e71c5ef3938c7
> # skip: [203c018079e13510f913fd0fd426370f4de0fd05] Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.12/drivers
> git bisect skip 203c018079e13510f913fd0fd426370f4de0fd05
> # skip: [49d1ec8573f74ff1e23df1d5092211de46baa236] block: manage bio slab cache by xarray
> git bisect skip 49d1ec8573f74ff1e23df1d5092211de46baa236
> # bad: [73d90386b559d6f4c3c5db5e6bb1b68aae8fd3e7] nvme: cleanup zone information initialization
> git bisect bad 73d90386b559d6f4c3c5db5e6bb1b68aae8fd3e7
> # skip: [71217df39dc67a0aeed83352b0d712b7892036a2] block, bfq: make waker-queue detection more robust
> git bisect skip 71217df39dc67a0aeed83352b0d712b7892036a2
> # bad: [8358c28a5d44bf0223a55a2334086c3707bb4185] block: fix memory leak of bvec
> git bisect bad 8358c28a5d44bf0223a55a2334086c3707bb4185
> # skip: [3a905c37c3510ea6d7cfcdfd0f272ba731286560] block: skip bio_check_eod for partition-remapped bios
> git bisect skip 3a905c37c3510ea6d7cfcdfd0f272ba731286560
> # skip: [3c337690d2ebb7a01fa13bfa59ce4911f358df42] block, bfq: avoid spurious switches to soft_rt of interactive queues
> git bisect skip 3c337690d2ebb7a01fa13bfa59ce4911f358df42
> # skip: [3e1a88ec96259282b9a8b45c3f1fda7a3ff4f6ea] bio: add a helper calculating nr segments to alloc
> git bisect skip 3e1a88ec96259282b9a8b45c3f1fda7a3ff4f6ea
> # skip: [4eb1d689045552eb966ebf25efbc3ce648797d96] blk-crypto: use bio_kmalloc in blk_crypto_clone_bio
> git bisect skip 4eb1d689045552eb966ebf25efbc3ce648797d96

Thanks for the report. To be sure below issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
tracking bot:

#regzbot ^introduced v5.11..v5.12-rc1
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply -- ideally with also
telling regzbot about it, as explained here:
https://linux-regtracking.leemhuis.info/tracked-regression/

Reminder for developers: When fixing the issue, add 'Link:' tags
pointing to the report (the mail this one replies to), as explained for
in the Linux kernel's documentation; above webpage explains why this is
important for tracked regressions.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

2022-08-16 15:02:51

by Chris Murphy

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18



On Sun, Aug 14, 2022, at 4:28 PM, Chris Murphy wrote:
> On Fri, Aug 12, 2022, at 2:02 PM, Jens Axboe wrote:
>> Might be worth trying to revert those from 5.12 to see if they are
>> causing the issue? Jan, Paolo - does this ring any bells?
>
> git log --oneline --no-merges v5.11..c03c21ba6f4e > bisect.txt
>
> I tried checking out a33df75c6328, which is right before the first bfq
> commit, but that kernel won't boot the hardware.
>
> Next I checked out v5.12, then reverted these commits in order (that
> they were found in the bisect.txt file):
>
> 7684fbde4516 bfq: Use only idle IO periods for think time calculations
> 28c6def00919 bfq: Use 'ttime' local variable
> 41e76c85660c bfq: Avoid false bfq queue merging
>>>>a5bf0a92e1b8 bfq: bfq_check_waker() should be static
> 71217df39dc6 block, bfq: make waker-queue detection more robust
> 5a5436b98d5c block, bfq: save also injection state on queue merging
> e673914d52f9 block, bfq: save also weight-raised service on queue merging
> d1f600fa4732 block, bfq: fix switch back from soft-rt weitgh-raising
> 7f1995c27b19 block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
> eb2fd80f9d2c block, bfq: replace mechanism for evaluating I/O intensity
>>>>1a23e06cdab2 bfq: don't duplicate code for different paths
> 2391d13ed484 block, bfq: do not expire a queue when it is the only busy
> one
> 3c337690d2eb block, bfq: avoid spurious switches to soft_rt of
> interactive queues
> 91b896f65d32 block, bfq: do not raise non-default weights
> ab1fb47e33dc block, bfq: increase time window for waker detection
> d4fc3640ff36 block, bfq: set next_rq to waker_bfqq->next_rq in waker
> injection
> b5f74ecacc31 block, bfq: use half slice_idle as a threshold to check
> short ttime
>
> The two commits prefixed by >>> above were not previously mentioned by
> Jens, but I reverted them anyway because they showed up in the git log
> command.
>
> OK so, within 10 minutes the problem does happen still. This is
> block/bfq-iosched.c resulting from the above reverts, in case anyone
> wants to double check what I did:
> https://drive.google.com/file/d/1ykU7MpmylJuXVobODWiiaLJk-XOiAjSt/view?usp=sharing

Any suggestions for further testing? I could try go down farther in the bisect.txt list. The problem is if the hardware falls over on an unbootable kernel, I have to bug someone with LOM access. That's a limited resource.


--
Chris Murphy

2022-08-16 15:53:01

by Nikolay Borisov

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18



On 16.08.22 г. 17:22 ч., Chris Murphy wrote:
>
>
> On Sun, Aug 14, 2022, at 4:28 PM, Chris Murphy wrote:
>> On Fri, Aug 12, 2022, at 2:02 PM, Jens Axboe wrote:
>>> Might be worth trying to revert those from 5.12 to see if they are
>>> causing the issue? Jan, Paolo - does this ring any bells?
>>
>> git log --oneline --no-merges v5.11..c03c21ba6f4e > bisect.txt
>>
>> I tried checking out a33df75c6328, which is right before the first bfq
>> commit, but that kernel won't boot the hardware.
>>
>> Next I checked out v5.12, then reverted these commits in order (that
>> they were found in the bisect.txt file):
>>
>> 7684fbde4516 bfq: Use only idle IO periods for think time calculations
>> 28c6def00919 bfq: Use 'ttime' local variable
>> 41e76c85660c bfq: Avoid false bfq queue merging
>>>>> a5bf0a92e1b8 bfq: bfq_check_waker() should be static
>> 71217df39dc6 block, bfq: make waker-queue detection more robust
>> 5a5436b98d5c block, bfq: save also injection state on queue merging
>> e673914d52f9 block, bfq: save also weight-raised service on queue merging
>> d1f600fa4732 block, bfq: fix switch back from soft-rt weitgh-raising
>> 7f1995c27b19 block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
>> eb2fd80f9d2c block, bfq: replace mechanism for evaluating I/O intensity
>>>>> 1a23e06cdab2 bfq: don't duplicate code for different paths
>> 2391d13ed484 block, bfq: do not expire a queue when it is the only busy
>> one
>> 3c337690d2eb block, bfq: avoid spurious switches to soft_rt of
>> interactive queues
>> 91b896f65d32 block, bfq: do not raise non-default weights
>> ab1fb47e33dc block, bfq: increase time window for waker detection
>> d4fc3640ff36 block, bfq: set next_rq to waker_bfqq->next_rq in waker
>> injection
>> b5f74ecacc31 block, bfq: use half slice_idle as a threshold to check
>> short ttime
>>
>> The two commits prefixed by >>> above were not previously mentioned by
>> Jens, but I reverted them anyway because they showed up in the git log
>> command.
>>
>> OK so, within 10 minutes the problem does happen still. This is
>> block/bfq-iosched.c resulting from the above reverts, in case anyone
>> wants to double check what I did:
>> https://drive.google.com/file/d/1ykU7MpmylJuXVobODWiiaLJk-XOiAjSt/view?usp=sharing
>
> Any suggestions for further testing? I could try go down farther in the bisect.txt list. The problem is if the hardware falls over on an unbootable kernel, I have to bug someone with LOM access. That's a limited resource.
>
>

How about changing the scheduler either mq-deadline or noop, just to see
if this is also reproducible with a different scheduler. I guess noop
would imply the blk cgroup controller is going to be disabled

2022-08-16 16:10:42

by Chris Murphy

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18



On Tue, Aug 16, 2022, at 11:25 AM, Nikolay Borisov wrote:
> On 16.08.22 г. 17:22 ч., Chris Murphy wrote:
>>
>>
>> On Sun, Aug 14, 2022, at 4:28 PM, Chris Murphy wrote:
>>> On Fri, Aug 12, 2022, at 2:02 PM, Jens Axboe wrote:
>>>> Might be worth trying to revert those from 5.12 to see if they are
>>>> causing the issue? Jan, Paolo - does this ring any bells?
>>>
>>> git log --oneline --no-merges v5.11..c03c21ba6f4e > bisect.txt
>>>
>>> I tried checking out a33df75c6328, which is right before the first bfq
>>> commit, but that kernel won't boot the hardware.
>>>
>>> Next I checked out v5.12, then reverted these commits in order (that
>>> they were found in the bisect.txt file):
>>>
>>> 7684fbde4516 bfq: Use only idle IO periods for think time calculations
>>> 28c6def00919 bfq: Use 'ttime' local variable
>>> 41e76c85660c bfq: Avoid false bfq queue merging
>>>>>> a5bf0a92e1b8 bfq: bfq_check_waker() should be static
>>> 71217df39dc6 block, bfq: make waker-queue detection more robust
>>> 5a5436b98d5c block, bfq: save also injection state on queue merging
>>> e673914d52f9 block, bfq: save also weight-raised service on queue merging
>>> d1f600fa4732 block, bfq: fix switch back from soft-rt weitgh-raising
>>> 7f1995c27b19 block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
>>> eb2fd80f9d2c block, bfq: replace mechanism for evaluating I/O intensity
>>>>>> 1a23e06cdab2 bfq: don't duplicate code for different paths
>>> 2391d13ed484 block, bfq: do not expire a queue when it is the only busy
>>> one
>>> 3c337690d2eb block, bfq: avoid spurious switches to soft_rt of
>>> interactive queues
>>> 91b896f65d32 block, bfq: do not raise non-default weights
>>> ab1fb47e33dc block, bfq: increase time window for waker detection
>>> d4fc3640ff36 block, bfq: set next_rq to waker_bfqq->next_rq in waker
>>> injection
>>> b5f74ecacc31 block, bfq: use half slice_idle as a threshold to check
>>> short ttime
>>>
>>> The two commits prefixed by >>> above were not previously mentioned by
>>> Jens, but I reverted them anyway because they showed up in the git log
>>> command.
>>>
>>> OK so, within 10 minutes the problem does happen still. This is
>>> block/bfq-iosched.c resulting from the above reverts, in case anyone
>>> wants to double check what I did:
>>> https://drive.google.com/file/d/1ykU7MpmylJuXVobODWiiaLJk-XOiAjSt/view?usp=sharing
>>
>> Any suggestions for further testing? I could try go down farther in the bisect.txt list. The problem is if the hardware falls over on an unbootable kernel, I have to bug someone with LOM access. That's a limited resource.
>>
>>
>
> How about changing the scheduler either mq-deadline or noop, just to see
> if this is also reproducible with a different scheduler. I guess noop
> would imply the blk cgroup controller is going to be disabled

I already reported on that: always happens with bfq within an hour or less. Doesn't happen with mq-deadline for ~25+ hours. Does happen with bfq with the above patches removed. Does happen with cgroup.disabled=io set.

Sounds to me like it's something bfq depends on and is somehow becoming perturbed in a way that mq-deadline does not, and has changed between 5.11 and 5.12. I have no idea what's under bfq that matches this description.

--
Chris Murphy

2022-08-17 10:10:12

by Holger Hoffstätte

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18

On 2022-08-16 17:34, Chris Murphy wrote:
>
> On Tue, Aug 16, 2022, at 11:25 AM, Nikolay Borisov wrote:
>> How about changing the scheduler either mq-deadline or noop, just
>> to see if this is also reproducible with a different scheduler. I
>> guess noop would imply the blk cgroup controller is going to be
>> disabled
>
> I already reported on that: always happens with bfq within an hour or
> less. Doesn't happen with mq-deadline for ~25+ hours. Does happen
> with bfq with the above patches removed. Does happen with
> cgroup.disabled=io set.
>
> Sounds to me like it's something bfq depends on and is somehow
> becoming perturbed in a way that mq-deadline does not, and has
> changed between 5.11 and 5.12. I have no idea what's under bfq that
> matches this description.

Chris, just a shot in the dark but can you try the patch from

https://lore.kernel.org/linux-block/[email protected]/

on top of something more recent than 5.12? Ideally 5.19 where it applies
cleanly.

No guarantees, I just remembered this patch and your problem sounds like
a lost wakeup. Maybe BFQ just drives the sbitmap in a way that triggers the
symptom.

-h

2022-08-17 12:01:31

by Jan Kara

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Wed 17-08-22 11:52:54, Holger Hoffst?tte wrote:
> On 2022-08-16 17:34, Chris Murphy wrote:
> >
> > On Tue, Aug 16, 2022, at 11:25 AM, Nikolay Borisov wrote:
> > > How about changing the scheduler either mq-deadline or noop, just
> > > to see if this is also reproducible with a different scheduler. I
> > > guess noop would imply the blk cgroup controller is going to be
> > > disabled
> >
> > I already reported on that: always happens with bfq within an hour or
> > less. Doesn't happen with mq-deadline for ~25+ hours. Does happen
> > with bfq with the above patches removed. Does happen with
> > cgroup.disabled=io set.
> >
> > Sounds to me like it's something bfq depends on and is somehow
> > becoming perturbed in a way that mq-deadline does not, and has
> > changed between 5.11 and 5.12. I have no idea what's under bfq that
> > matches this description.
>
> Chris, just a shot in the dark but can you try the patch from
>
> https://lore.kernel.org/linux-block/[email protected]/
>
> on top of something more recent than 5.12? Ideally 5.19 where it applies
> cleanly.
>
> No guarantees, I just remembered this patch and your problem sounds like
> a lost wakeup. Maybe BFQ just drives the sbitmap in a way that triggers the
> symptom.

Yes, symptoms look similar and it happens for devices with shared tagsets
(which megaraid sas is) but that problem usually appeared when there are
lots of LUNs sharing the tagset so that number of tags available per LUN
was rather low. Not sure if that is the case here but probably that patch
is worth a try.

Another thing worth trying is to compile the kernel without
CONFIG_BFQ_GROUP_IOSCHED. That will essentially disable cgroup support in
BFQ so we will see whether the problem may be cgroup related or not.

Another interesting thing might be to dump
/sys/kernel/debug/block/<device>/hctx*/{sched_tags,sched_tags_bitmap,tags,tags_bitmap}
as the system is hanging. That should tell us whether tags are in fact in
use or not when processes are blocking waiting for tags.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-17 12:24:43

by Chris Murphy

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18



On Wed, Aug 17, 2022, at 5:52 AM, Holger Hoffstätte wrote:
> On 2022-08-16 17:34, Chris Murphy wrote:
>>
>> On Tue, Aug 16, 2022, at 11:25 AM, Nikolay Borisov wrote:
>>> How about changing the scheduler either mq-deadline or noop, just
>>> to see if this is also reproducible with a different scheduler. I
>>> guess noop would imply the blk cgroup controller is going to be
>>> disabled
>>
>> I already reported on that: always happens with bfq within an hour or
>> less. Doesn't happen with mq-deadline for ~25+ hours. Does happen
>> with bfq with the above patches removed. Does happen with
>> cgroup.disabled=io set.
>>
>> Sounds to me like it's something bfq depends on and is somehow
>> becoming perturbed in a way that mq-deadline does not, and has
>> changed between 5.11 and 5.12. I have no idea what's under bfq that
>> matches this description.
>
> Chris, just a shot in the dark but can you try the patch from
>
> https://lore.kernel.org/linux-block/[email protected]/
>
> on top of something more recent than 5.12? Ideally 5.19 where it applies
> cleanly.

The problem doesn't reliably reproduce on 5.19. A patch for 5.12..5.18 would be much more testable.


--
Chris Murphy

2022-08-17 12:41:49

by Holger Hoffstätte

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18

On 2022-08-17 13:57, Chris Murphy wrote:
>
>
> On Wed, Aug 17, 2022, at 5:52 AM, Holger Hoffstätte wrote:
>> On 2022-08-16 17:34, Chris Murphy wrote:
>>>
>>> On Tue, Aug 16, 2022, at 11:25 AM, Nikolay Borisov wrote:
>>>> How about changing the scheduler either mq-deadline or noop, just
>>>> to see if this is also reproducible with a different scheduler. I
>>>> guess noop would imply the blk cgroup controller is going to be
>>>> disabled
>>>
>>> I already reported on that: always happens with bfq within an hour or
>>> less. Doesn't happen with mq-deadline for ~25+ hours. Does happen
>>> with bfq with the above patches removed. Does happen with
>>> cgroup.disabled=io set.
>>>
>>> Sounds to me like it's something bfq depends on and is somehow
>>> becoming perturbed in a way that mq-deadline does not, and has
>>> changed between 5.11 and 5.12. I have no idea what's under bfq that
>>> matches this description.
>>
>> Chris, just a shot in the dark but can you try the patch from
>>
>> https://lore.kernel.org/linux-block/[email protected]/
>>
>> on top of something more recent than 5.12? Ideally 5.19 where it applies
>> cleanly.
>
> The problem doesn't reliably reproduce on 5.19. A patch for 5.12..5.18 would be much more testable.

If you look at the changes to sbitmap at:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/lib/sbitmap.c

you'll find that they are relatively recent, so Yukai's patch will probably also apply
to 5.18 - I don't know. Also look at the most recent commit which mentions
"Checking free bits when setting the target bits. Otherwise, it may reuse the busying bits."

Reusing the busy bits sounds "not great" either and (AFAIU) may also be a cause for
lost wakeups, but I'm sure Jan and Ming know all that better than me.

Especially Jan's suggestions re. disabling BFQ cgroup support is probably the easiest
thing to try first. What you're observing may not have a single root cause, and even if
it does, it might not be where we suspect.

-h

2022-08-17 12:41:53

by Ming Lei

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18

Hello Chris,

On Tue, Aug 16, 2022 at 11:35 PM Chris Murphy <[email protected]> wrote:
>
>
>
...
>
> I already reported on that: always happens with bfq within an hour or less. Doesn't happen with mq-deadline for ~25+ hours. Does happen with bfq with the above patches removed. Does happen with cgroup.disabled=io set.
>
> Sounds to me like it's something bfq depends on and is somehow becoming perturbed in a way that mq-deadline does not, and has changed between 5.11 and 5.12. I have no idea what's under bfq that matches this description.
>

blk-mq debugfs log is usually helpful for io stall issue, care to post
the blk-mq debugfs log:

(cd /sys/kernel/debug/block/$disk && find . -type f -exec grep -aH . {} \;)

Thanks,
Ming

2022-08-17 14:50:49

by Chris Murphy

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18



On Wed, Aug 17, 2022, at 8:06 AM, Ming Lei wrote:

> blk-mq debugfs log is usually helpful for io stall issue, care to post
> the blk-mq debugfs log:
>
> (cd /sys/kernel/debug/block/$disk && find . -type f -exec grep -aH . {} \;)

This is only sda
https://drive.google.com/file/d/1aAld-kXb3RUiv_ShAvD_AGAFDRS03Lr0/view?usp=sharing

This is all the block devices
https://drive.google.com/file/d/1iHqRuoz8ZzvkNcMtkV3Ep7h5Uof7sTKw/view?usp=sharing

--
Chris Murphy

2022-08-17 14:51:48

by Chris Murphy

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18



On Wed, Aug 17, 2022, at 7:49 AM, Jan Kara wrote:

> Another thing worth trying is to compile the kernel without
> CONFIG_BFQ_GROUP_IOSCHED. That will essentially disable cgroup support in
> BFQ so we will see whether the problem may be cgroup related or not.

Does boot param cgroup.disable=io affect it? Because the problem still happens with that parameter. Otherwise I can build a kernel with it disabled.

--
Chris Murphy

2022-08-17 15:34:08

by Chris Murphy

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18



On Wed, Aug 17, 2022, at 7:49 AM, Jan Kara wrote:

>
> Another thing worth trying is to compile the kernel without
> CONFIG_BFQ_GROUP_IOSCHED. That will essentially disable cgroup support in
> BFQ so we will see whether the problem may be cgroup related or not.

The problem happens with a 5.12.0 kernel built without CONFIG_BFQ_GROUP_IOSCHED.


--
Chris Murphy

2022-08-17 15:54:17

by Ming Lei

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Wed, Aug 17, 2022 at 10:34:38AM -0400, Chris Murphy wrote:
>
>
> On Wed, Aug 17, 2022, at 8:06 AM, Ming Lei wrote:
>
> > blk-mq debugfs log is usually helpful for io stall issue, care to post
> > the blk-mq debugfs log:
> >
> > (cd /sys/kernel/debug/block/$disk && find . -type f -exec grep -aH . {} \;)
>
> This is only sda
> https://drive.google.com/file/d/1aAld-kXb3RUiv_ShAvD_AGAFDRS03Lr0/view?usp=sharing

From the log, there isn't any in-flight IO request.

So please confirm that it is collected after the IO stall is triggered.

If yes, the issue may not be related with BFQ, and should be related
with blk-cgroup code.


Thanks,
Ming

2022-08-17 17:45:40

by Jan Kara

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18

On Wed 17-08-22 11:09:26, Chris Murphy wrote:
>
>
> On Wed, Aug 17, 2022, at 7:49 AM, Jan Kara wrote:
>
> >
> > Another thing worth trying is to compile the kernel without
> > CONFIG_BFQ_GROUP_IOSCHED. That will essentially disable cgroup support in
> > BFQ so we will see whether the problem may be cgroup related or not.
>
> The problem happens with a 5.12.0 kernel built without
> CONFIG_BFQ_GROUP_IOSCHED.

Thanks for testing! Just to answer your previous question: This is
different from cgroup.disable=io because BFQ takes different code paths. So
this makes it even less likely this is some obscure BFQ bug. Why BFQ could
be different here from mq-deadline is that it artificially reduces device
queue depth (it sets shallow_depth when allocating new tags) and maybe that
triggers some bug in request tag allocation.

BTW, are you sure the first problematic kernel is 5.12? Because support for
shared tagsets was added to megaraid_sas driver in 5.11 (5.11-rc3 in
particular - commit 81e7eb5bf08f3 ("Revert "Revert "scsi: megaraid_sas:
Added support for shared host tagset for cpuhotplug"")) and that is one
candidate I'd expect to start to trigger issues. BTW that may be an
interesting thing to try: Can you boot with
"megaraid_sas.host_tagset_enable = 0" kernel option and see whether the
issue reproduces?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-17 19:03:15

by Chris Murphy

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18



On Wed, Aug 17, 2022, at 5:52 AM, Holger Hoffstätte wrote:

> Chris, just a shot in the dark but can you try the patch from
>
> https://lore.kernel.org/linux-block/[email protected]/
>
> on top of something more recent than 5.12? Ideally 5.19 where it applies
> cleanly.


This patch applies cleanly on 5.12.0. I can try newer kernels later, but as the problem so easily reproduces with 5.12 and the problem first appeared there, is why I'm sticking with it. (For sure we prefer to be on 5.19 series.)

Let me know if I should try it still.


--
Chris Murphy

2022-08-17 19:08:11

by Holger Hoffstätte

[permalink] [raw]
Subject: Re: stalling IO regression since linux 5.12, through 5.18

On 2022-08-17 20:16, Chris Murphy wrote:
>
>
> On Wed, Aug 17, 2022, at 5:52 AM, Holger Hoffstätte wrote:
>
>> Chris, just a shot in the dark but can you try the patch from
>>
>> https://lore.kernel.org/linux-block/[email protected]/
>>
>> on top of something more recent than 5.12? Ideally 5.19 where it applies
>> cleanly.
>
>
> This patch applies cleanly on 5.12.0. I can try newer kernels later, but as the problem so easily reproduces with 5.12 and the problem first appeared there, is why I'm sticking with it. (For sure we prefer to be on 5.19 series.)
>
> Let me know if I should try it still.

I just started running it in 5.19.2 to see if it breaks something;
no issues so far but then again I didn't have any problems to begin with
and only do peasant I/O load, and no MegaRAID.
However if it applies *and builds* on 5.12 I'd just go ahead and see what
catches fire. But you need to set the megaraid setting to fail, otherwise we
won't be able to see whether this is really a contributing factor,
or indeed the other commit that Jan identified.
Unfortunately 5.12 is a bit old already and most of the other important
fixes to sbitmap.c probably won't apply due to some other blk-mq changes.

In any case the plot thickens, so keep going. :)

-h