Hello.
After a v5.13.2 massive update I encountered this:
```
[19231.556665] ------------[ cut here ]------------
[19231.556674] WARNING: CPU: 20 PID: 502 at drivers/nvme/host/core.c:850
nvme_setup_discard+0x188/0x1f0
...
[19231.556776] CPU: 20 PID: 502 Comm: kworker/20:1H Not tainted 5.13.2 #1
[19231.556780] Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS
3601 05/26/2021
[19231.556784] Workqueue: kblockd blk_mq_run_work_fn
[19231.556789] RIP: 0010:nvme_setup_discard+0x188/0x1f0
[19231.556794] Code: 49 8b 44 24 10 4c 8b 90 40 0b 00 00 4c 2b 15 8f 09 d8 00
49 c1 fa 06 49 c1 e2 0c 4c 03 15 90 09 d8 00 4d 89 d0 e9 b9 fe ff ff <0f> 0b b8
00 00 00 80 49 01 c2 72 52 48 c7 c0 00 00 00 80 48 2b 05
[19231.556798] RSP: 0018:ffffaed2416efc00 EFLAGS: 00010287
[19231.556802] RAX: ffff8e67fb580000 RBX: ffff8e640bbe5240 RCX: 0000000000000020
[19231.556805] RDX: ffff8e67fb580000 RSI: 000000000000001f RDI: 0000000000000000
[19231.556808] RBP: ffff8e640bbe5388 R08: ffff8e677b580000 R09: 0000000008abb800
[19231.556811] R10: ffff8e677b580000 R11: 0000000000000400 R12: ffff8e6405999c00
[19231.556814] R13: 000000000000001f R14: ffff8e6405693000 R15: ffff8e640bbaf800
[19231.556816] FS: 0000000000000000(0000) GS:ffff8e6b0ef00000(0000) knlGS:
0000000000000000
[19231.556819] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19231.556822] CR2: ffff888c76ece000 CR3: 000000047a184000 CR4: 0000000000350ee0
[19231.556825] Call Trace:
[19231.556830] nvme_setup_cmd+0x2d0/0x670
[19231.556834] nvme_queue_rq+0x79/0xc90
[19231.556837] ? __sbitmap_get_word+0x30/0x80
[19231.556842] ? sbitmap_get+0x85/0x180
[19231.556846] blk_mq_dispatch_rq_list+0x15c/0x810
[19231.556851] ? list_sort+0x21d/0x2f0
[19231.556856] __blk_mq_do_dispatch_sched+0x196/0x320
[19231.556860] __blk_mq_sched_dispatch_requests+0x14d/0x190
[19231.556864] blk_mq_sched_dispatch_requests+0x2f/0x60
[19231.556867] blk_mq_run_work_fn+0x43/0xc0
[19231.556871] process_one_work+0x24e/0x430
[19231.556876] worker_thread+0x54/0x4d0
[19231.556880] ? process_one_work+0x430/0x430
[19231.556883] kthread+0x1b3/0x1e0
[19231.556886] ? __kthread_init_worker+0x50/0x50
[19231.556889] ret_from_fork+0x22/0x30
[19231.556895] ---[ end trace d9abdf019a56b4c7 ]---
[19231.556906] blk_update_request: I/O error, dev nvme1n1, sector 632935424 op
0x3:(DISCARD) flags 0x0 phys_seg 31 prio class 0
```
or, in code:
```
850 if (WARN_ON_ONCE(n != segments)) {
851 if (virt_to_page(range) == ns->ctrl->discard_page)
852 clear_bit_unlock(0, &ns->ctrl->discard_page_busy);
853 else
854 kfree(range);
855 return BLK_STS_IOERR;
856 }
```
BFQ scheduler is in use.
Something similar was already fixed by a958937ff166fc60d1c3a721036f6ff41bfa2821,
but I do not have a multipath device here, it's just 2 NVMe SSDs in a soft
RAID10 with LUKS and LVM.
Any idea what this might mean? v5.13.2 brought some commit into a stable tree
that are, as I still suspect, causing unreproducible panics [1] [2].
Previously, I dropped that extra stuff from my kernel build and had no issues.
This time I also do not have any extra commits in the block layer, only those
that are in v5.13.2.
Thanks.
[1] https://lore.kernel.org/linux-block/3533087.dJKXTdksHR@spock/
[2] https://lore.kernel.org/linux-block/2957867.CS06ZTPI5V@spock/
--
Oleksandr Natalenko (post-factum)
Hello.
On čtvrtek 15. července 2021 16:19:31 CEST Greg Kroah-Hartman wrote:
> Can you run 'git bisect' to find the offending patch?
Of course, as soon as I reproduce this reliably.
Thanks.
--
Oleksandr Natalenko (post-factum)
On Thu, Jul 15, 2021 at 03:56:38PM +0200, Oleksandr Natalenko wrote:
> Hello.
>
> After a v5.13.2 massive update I encountered this:
>
> ```
> [19231.556665] ------------[ cut here ]------------
> [19231.556674] WARNING: CPU: 20 PID: 502 at drivers/nvme/host/core.c:850
> nvme_setup_discard+0x188/0x1f0
> ...
> [19231.556776] CPU: 20 PID: 502 Comm: kworker/20:1H Not tainted 5.13.2 #1
> [19231.556780] Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS
> 3601 05/26/2021
> [19231.556784] Workqueue: kblockd blk_mq_run_work_fn
> [19231.556789] RIP: 0010:nvme_setup_discard+0x188/0x1f0
> [19231.556794] Code: 49 8b 44 24 10 4c 8b 90 40 0b 00 00 4c 2b 15 8f 09 d8 00
> 49 c1 fa 06 49 c1 e2 0c 4c 03 15 90 09 d8 00 4d 89 d0 e9 b9 fe ff ff <0f> 0b b8
> 00 00 00 80 49 01 c2 72 52 48 c7 c0 00 00 00 80 48 2b 05
> [19231.556798] RSP: 0018:ffffaed2416efc00 EFLAGS: 00010287
> [19231.556802] RAX: ffff8e67fb580000 RBX: ffff8e640bbe5240 RCX: 0000000000000020
> [19231.556805] RDX: ffff8e67fb580000 RSI: 000000000000001f RDI: 0000000000000000
> [19231.556808] RBP: ffff8e640bbe5388 R08: ffff8e677b580000 R09: 0000000008abb800
> [19231.556811] R10: ffff8e677b580000 R11: 0000000000000400 R12: ffff8e6405999c00
> [19231.556814] R13: 000000000000001f R14: ffff8e6405693000 R15: ffff8e640bbaf800
> [19231.556816] FS: 0000000000000000(0000) GS:ffff8e6b0ef00000(0000) knlGS:
> 0000000000000000
> [19231.556819] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [19231.556822] CR2: ffff888c76ece000 CR3: 000000047a184000 CR4: 0000000000350ee0
> [19231.556825] Call Trace:
> [19231.556830] nvme_setup_cmd+0x2d0/0x670
> [19231.556834] nvme_queue_rq+0x79/0xc90
> [19231.556837] ? __sbitmap_get_word+0x30/0x80
> [19231.556842] ? sbitmap_get+0x85/0x180
> [19231.556846] blk_mq_dispatch_rq_list+0x15c/0x810
> [19231.556851] ? list_sort+0x21d/0x2f0
> [19231.556856] __blk_mq_do_dispatch_sched+0x196/0x320
> [19231.556860] __blk_mq_sched_dispatch_requests+0x14d/0x190
> [19231.556864] blk_mq_sched_dispatch_requests+0x2f/0x60
> [19231.556867] blk_mq_run_work_fn+0x43/0xc0
> [19231.556871] process_one_work+0x24e/0x430
> [19231.556876] worker_thread+0x54/0x4d0
> [19231.556880] ? process_one_work+0x430/0x430
> [19231.556883] kthread+0x1b3/0x1e0
> [19231.556886] ? __kthread_init_worker+0x50/0x50
> [19231.556889] ret_from_fork+0x22/0x30
> [19231.556895] ---[ end trace d9abdf019a56b4c7 ]---
> [19231.556906] blk_update_request: I/O error, dev nvme1n1, sector 632935424 op
> 0x3:(DISCARD) flags 0x0 phys_seg 31 prio class 0
> ```
>
> or, in code:
>
> ```
> 850 if (WARN_ON_ONCE(n != segments)) {
> 851 if (virt_to_page(range) == ns->ctrl->discard_page)
> 852 clear_bit_unlock(0, &ns->ctrl->discard_page_busy);
> 853 else
> 854 kfree(range);
> 855 return BLK_STS_IOERR;
> 856 }
> ```
>
> BFQ scheduler is in use.
>
> Something similar was already fixed by a958937ff166fc60d1c3a721036f6ff41bfa2821,
> but I do not have a multipath device here, it's just 2 NVMe SSDs in a soft
> RAID10 with LUKS and LVM.
>
> Any idea what this might mean? v5.13.2 brought some commit into a stable tree
> that are, as I still suspect, causing unreproducible panics [1] [2].
> Previously, I dropped that extra stuff from my kernel build and had no issues.
> This time I also do not have any extra commits in the block layer, only those
> that are in v5.13.2.
>
> Thanks.
>
> [1] https://lore.kernel.org/linux-block/3533087.dJKXTdksHR@spock/
> [2] https://lore.kernel.org/linux-block/2957867.CS06ZTPI5V@spock/
Can you run 'git bisect' to find the offending patch?
thanks,
greg k-h
On Thu, 2021-07-15 at 16:19 +0200, Greg Kroah-Hartman wrote:
> On Thu, Jul 15, 2021 at 03:56:38PM +0200, Oleksandr Natalenko wrote:
> > Hello.
> >
> > After a v5.13.2 massive update I encountered this:
> >
> > ```
> > [19231.556665] ------------[ cut here ]------------
> > [19231.556674] WARNING: CPU: 20 PID: 502 at
> > drivers/nvme/host/core.c:850
> > nvme_setup_discard+0x188/0x1f0
> > ...
> > [19231.556776] CPU: 20 PID: 502 Comm: kworker/20:1H Not tainted
> > 5.13.2 #1
> > [19231.556780] Hardware name: ASUS System Product Name/Pro WS X570-
> > ACE, BIOS
> > 3601 05/26/2021
> > [19231.556784] Workqueue: kblockd blk_mq_run_work_fn
> > [19231.556789] RIP: 0010:nvme_setup_discard+0x188/0x1f0
> > [19231.556794] Code: 49 8b 44 24 10 4c 8b 90 40 0b 00 00 4c 2b 15
> > 8f 09 d8 00
> > 49 c1 fa 06 49 c1 e2 0c 4c 03 15 90 09 d8 00 4d 89 d0 e9 b9 fe ff
> > ff <0f> 0b b8
> > 00 00 00 80 49 01 c2 72 52 48 c7 c0 00 00 00 80 48 2b 05
> > [19231.556798] RSP: 0018:ffffaed2416efc00 EFLAGS: 00010287
> > [19231.556802] RAX: ffff8e67fb580000 RBX: ffff8e640bbe5240 RCX:
> > 0000000000000020
> > [19231.556805] RDX: ffff8e67fb580000 RSI: 000000000000001f RDI:
> > 0000000000000000
> > [19231.556808] RBP: ffff8e640bbe5388 R08: ffff8e677b580000 R09:
> > 0000000008abb800
> > [19231.556811] R10: ffff8e677b580000 R11: 0000000000000400 R12:
> > ffff8e6405999c00
> > [19231.556814] R13: 000000000000001f R14: ffff8e6405693000 R15:
> > ffff8e640bbaf800
> > [19231.556816] FS: 0000000000000000(0000)
> > GS:ffff8e6b0ef00000(0000) knlGS:
> > 0000000000000000
> > [19231.556819] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [19231.556822] CR2: ffff888c76ece000 CR3: 000000047a184000 CR4:
> > 0000000000350ee0
> > [19231.556825] Call Trace:
> > [19231.556830] nvme_setup_cmd+0x2d0/0x670
> > [19231.556834] nvme_queue_rq+0x79/0xc90
> > [19231.556837] ? __sbitmap_get_word+0x30/0x80
> > [19231.556842] ? sbitmap_get+0x85/0x180
> > [19231.556846] blk_mq_dispatch_rq_list+0x15c/0x810
> > [19231.556851] ? list_sort+0x21d/0x2f0
> > [19231.556856] __blk_mq_do_dispatch_sched+0x196/0x320
> > [19231.556860] __blk_mq_sched_dispatch_requests+0x14d/0x190
> > [19231.556864] blk_mq_sched_dispatch_requests+0x2f/0x60
> > [19231.556867] blk_mq_run_work_fn+0x43/0xc0
> > [19231.556871] process_one_work+0x24e/0x430
> > [19231.556876] worker_thread+0x54/0x4d0
> > [19231.556880] ? process_one_work+0x430/0x430
> > [19231.556883] kthread+0x1b3/0x1e0
> > [19231.556886] ? __kthread_init_worker+0x50/0x50
> > [19231.556889] ret_from_fork+0x22/0x30
> > [19231.556895] ---[ end trace d9abdf019a56b4c7 ]---
> > [19231.556906] blk_update_request: I/O error, dev nvme1n1, sector
> > 632935424 op
> > 0x3:(DISCARD) flags 0x0 phys_seg 31 prio class 0
> > ```
> >
> > or, in code:
> >
> > ```
> > 850 if (WARN_ON_ONCE(n != segments)) {
> > 851 if (virt_to_page(range) == ns->ctrl->discard_page)
> > 852 clear_bit_unlock(0, &ns->ctrl->discard_page_busy);
> > 853 else
> > 854 kfree(range);
> > 855 return BLK_STS_IOERR;
> > 856 }
> > ```
> >
> > BFQ scheduler is in use.
> >
> > Something similar was already fixed by
> > a958937ff166fc60d1c3a721036f6ff41bfa2821,
> > but I do not have a multipath device here, it's just 2 NVMe SSDs in
> > a soft
> > RAID10 with LUKS and LVM.
> >
> > Any idea what this might mean? v5.13.2 brought some commit into a
> > stable tree
> > that are, as I still suspect, causing unreproducible panics [1]
> > [2].
> > Previously, I dropped that extra stuff from my kernel build and had
> > no issues.
> > This time I also do not have any extra commits in the block layer,
> > only those
> > that are in v5.13.2.
> >
> > Thanks.
> >
> > [1] https://lore.kernel.org/linux-block/3533087.dJKXTdksHR@spock/
> > [2] https://lore.kernel.org/linux-block/2957867.CS06ZTPI5V@spock/
>
> Can you run 'git bisect' to find the offending patch?
>
> thanks,
>
> greg k-h
>
Hello
[root@ml150 ~]# uname -a
Linux ml150 5.14.0-rc1+ #1 SMP Thu Jul 15 16:41:08 EDT 2021 x86_64
x86_64 x86_64 GNU/Linux
[root@ml150 ~]# nvme list
Node SN Model
Namespace Usage Format FW Rev
---------------- -------------------- -------------------------------
--------- --------- -------------------------- ---------------- -----
---
/dev/nvme0n1 CVCQ536300C9400AGN INTEL
SSDPEDMW400G4 1 400.09 GB /
400.09 GB 512 B + 0 B 8EV10135
/dev/nvme1n1 CVFT7383000W400BGN INTEL
SSDPEDMD400G4 1 400.09 GB /
400.09 GB 512 B + 0 B 8DV10171
fwiw
I built 5.14 and I have 2 nvme devices and I am not seeing this even
using them to build the kernel on.
Regards
Laurence Oberman
On Thu, Jul 15, 2021 at 03:56:38PM +0200, Oleksandr Natalenko wrote:
> Hello.
>
> After a v5.13.2 massive update I encountered this:
>
> ```
> [19231.556665] ------------[ cut here ]------------
> [19231.556674] WARNING: CPU: 20 PID: 502 at drivers/nvme/host/core.c:850
> nvme_setup_discard+0x188/0x1f0
> ...
> [19231.556776] CPU: 20 PID: 502 Comm: kworker/20:1H Not tainted 5.13.2 #1
> [19231.556780] Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS
> 3601 05/26/2021
> [19231.556784] Workqueue: kblockd blk_mq_run_work_fn
> [19231.556789] RIP: 0010:nvme_setup_discard+0x188/0x1f0
> [19231.556794] Code: 49 8b 44 24 10 4c 8b 90 40 0b 00 00 4c 2b 15 8f 09 d8 00
> 49 c1 fa 06 49 c1 e2 0c 4c 03 15 90 09 d8 00 4d 89 d0 e9 b9 fe ff ff <0f> 0b b8
> 00 00 00 80 49 01 c2 72 52 48 c7 c0 00 00 00 80 48 2b 05
> [19231.556798] RSP: 0018:ffffaed2416efc00 EFLAGS: 00010287
> [19231.556802] RAX: ffff8e67fb580000 RBX: ffff8e640bbe5240 RCX: 0000000000000020
> [19231.556805] RDX: ffff8e67fb580000 RSI: 000000000000001f RDI: 0000000000000000
> [19231.556808] RBP: ffff8e640bbe5388 R08: ffff8e677b580000 R09: 0000000008abb800
> [19231.556811] R10: ffff8e677b580000 R11: 0000000000000400 R12: ffff8e6405999c00
> [19231.556814] R13: 000000000000001f R14: ffff8e6405693000 R15: ffff8e640bbaf800
> [19231.556816] FS: 0000000000000000(0000) GS:ffff8e6b0ef00000(0000) knlGS:
> 0000000000000000
> [19231.556819] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [19231.556822] CR2: ffff888c76ece000 CR3: 000000047a184000 CR4: 0000000000350ee0
> [19231.556825] Call Trace:
> [19231.556830] nvme_setup_cmd+0x2d0/0x670
> [19231.556834] nvme_queue_rq+0x79/0xc90
> [19231.556837] ? __sbitmap_get_word+0x30/0x80
> [19231.556842] ? sbitmap_get+0x85/0x180
> [19231.556846] blk_mq_dispatch_rq_list+0x15c/0x810
> [19231.556851] ? list_sort+0x21d/0x2f0
> [19231.556856] __blk_mq_do_dispatch_sched+0x196/0x320
> [19231.556860] __blk_mq_sched_dispatch_requests+0x14d/0x190
> [19231.556864] blk_mq_sched_dispatch_requests+0x2f/0x60
> [19231.556867] blk_mq_run_work_fn+0x43/0xc0
> [19231.556871] process_one_work+0x24e/0x430
> [19231.556876] worker_thread+0x54/0x4d0
> [19231.556880] ? process_one_work+0x430/0x430
> [19231.556883] kthread+0x1b3/0x1e0
> [19231.556886] ? __kthread_init_worker+0x50/0x50
> [19231.556889] ret_from_fork+0x22/0x30
> [19231.556895] ---[ end trace d9abdf019a56b4c7 ]---
> [19231.556906] blk_update_request: I/O error, dev nvme1n1, sector 632935424 op
> 0x3:(DISCARD) flags 0x0 phys_seg 31 prio class 0
> ```
>
> or, in code:
>
> ```
> 850 if (WARN_ON_ONCE(n != segments)) {
What is 'n' and 'segments'?
thanks,
Ming
Hello.
On čtvrtek 15. července 2021 23:37:21 CEST Laurence Oberman wrote:
> [root@ml150 ~]# uname -a
> Linux ml150 5.14.0-rc1+ #1 SMP Thu Jul 15 16:41:08 EDT 2021 x86_64
> x86_64 x86_64 GNU/Linux
>
> [root@ml150 ~]# nvme list
> Node SN Model
> Namespace Usage Format FW Rev
> ---------------- -------------------- -------------------------------
> --------- --------- -------------------------- ---------------- -----
> ---
> /dev/nvme0n1 CVCQ536300C9400AGN INTEL
> SSDPEDMW400G4 1 400.09 GB /
> 400.09 GB 512 B + 0 B 8EV10135
> /dev/nvme1n1 CVFT7383000W400BGN INTEL
> SSDPEDMD400G4 1 400.09 GB /
> 400.09 GB 512 B + 0 B 8DV10171
>
> fwiw
>
> I built 5.14 and I have 2 nvme devices and I am not seeing this even
> using them to build the kernel on.
Thanks for trying. What about v5.13.2? Also, are using BFQ? Is device mapper
in use?
--
Oleksandr Natalenko (post-factum)
Hello.
On pátek 16. července 2021 4:16:18 CEST Ming Lei wrote:
> > 850 if (WARN_ON_ONCE(n != segments)) {
>
> What is 'n' and 'segments'?
So, I used this change to check that:
```
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 66973bb56305..0a64f2f77daf 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -847,7 +847,8 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns,
struct request *req,
n++;
}
- if (WARN_ON_ONCE(n != segments)) {
+ if (unlikely(n != segments)) {
+ pr_warn("nvme_setup_discard: n != segments (%u != %u)\n", n,
segments);
if (virt_to_page(range) == ns->ctrl->discard_page)
clear_bit_unlock(0, &ns->ctrl->discard_page_busy);
else
```
and the result is:
```
nvme_setup_discard: n != segments (3 != 2)
nvme_setup_discard: n != segments (12 != 11)
```
(those are from different boots)
Thanks.
--
Oleksandr Natalenko (post-factum)
On Fri, Jul 16, 2021 at 07:53:28AM +0200, Oleksandr Natalenko wrote:
> Hello.
>
> On pátek 16. července 2021 4:16:18 CEST Ming Lei wrote:
> > > 850 if (WARN_ON_ONCE(n != segments)) {
> >
> > What is 'n' and 'segments'?
>
> So, I used this change to check that:
>
> ```
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 66973bb56305..0a64f2f77daf 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -847,7 +847,8 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns,
> struct request *req,
> n++;
> }
>
> - if (WARN_ON_ONCE(n != segments)) {
> + if (unlikely(n != segments)) {
> + pr_warn("nvme_setup_discard: n != segments (%u != %u)\n", n,
> segments);
> if (virt_to_page(range) == ns->ctrl->discard_page)
> clear_bit_unlock(0, &ns->ctrl->discard_page_busy);
> else
>
> ```
>
> and the result is:
>
> ```
> nvme_setup_discard: n != segments (3 != 2)
> nvme_setup_discard: n != segments (12 != 11)
> ```
Can you test the following patch?
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 727955918563..673a634eadd9 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2361,6 +2361,9 @@ static int bfq_request_merge(struct request_queue *q, struct request **req,
__rq = bfq_find_rq_fmerge(bfqd, bio, q);
if (__rq && elv_bio_merge_ok(__rq, bio)) {
*req = __rq;
+
+ if (blk_discard_mergable(__rq))
+ return ELEVATOR_DISCARD_MERGE;
return ELEVATOR_FRONT_MERGE;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index a11b3b53717e..f8707ff7e2fc 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -705,22 +705,6 @@ static void blk_account_io_merge_request(struct request *req)
}
}
-/*
- * Two cases of handling DISCARD merge:
- * If max_discard_segments > 1, the driver takes every bio
- * as a range and send them to controller together. The ranges
- * needn't to be contiguous.
- * Otherwise, the bios/requests will be handled as same as
- * others which should be contiguous.
- */
-static inline bool blk_discard_mergable(struct request *req)
-{
- if (req_op(req) == REQ_OP_DISCARD &&
- queue_max_discard_segments(req->q) > 1)
- return true;
- return false;
-}
-
static enum elv_merge blk_try_req_merge(struct request *req,
struct request *next)
{
diff --git a/block/elevator.c b/block/elevator.c
index 52ada14cfe45..a5fe2615ec0f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -336,6 +336,9 @@ enum elv_merge elv_merge(struct request_queue *q, struct request **req,
__rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);
if (__rq && elv_bio_merge_ok(__rq, bio)) {
*req = __rq;
+
+ if (blk_discard_mergable(__rq))
+ return ELEVATOR_DISCARD_MERGE;
return ELEVATOR_BACK_MERGE;
}
diff --git a/block/mq-deadline-main.c b/block/mq-deadline-main.c
index 6f612e6dc82b..294be0c0db65 100644
--- a/block/mq-deadline-main.c
+++ b/block/mq-deadline-main.c
@@ -677,6 +677,8 @@ static int dd_request_merge(struct request_queue *q, struct request **rq,
if (elv_bio_merge_ok(__rq, bio)) {
*rq = __rq;
+ if (blk_discard_mergable(__rq))
+ return ELEVATOR_DISCARD_MERGE;
return ELEVATOR_FRONT_MERGE;
}
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3177181c4326..87f00292fd7a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1521,6 +1521,22 @@ static inline int queue_limit_discard_alignment(struct queue_limits *lim, sector
return offset << SECTOR_SHIFT;
}
+/*
+ * Two cases of handling DISCARD merge:
+ * If max_discard_segments > 1, the driver takes every bio
+ * as a range and send them to controller together. The ranges
+ * needn't to be contiguous.
+ * Otherwise, the bios/requests will be handled as same as
+ * others which should be contiguous.
+ */
+static inline bool blk_discard_mergable(struct request *req)
+{
+ if (req_op(req) == REQ_OP_DISCARD &&
+ queue_max_discard_segments(req->q) > 1)
+ return true;
+ return false;
+}
+
static inline int bdev_discard_alignment(struct block_device *bdev)
{
struct request_queue *q = bdev_get_queue(bdev);
Thanks,
Ming
Hello.
On pátek 16. července 2021 11:33:05 CEST Ming Lei wrote:
> Can you test the following patch?
Sure, building it at the moment, and will give it a try. Also please see my
comments and questions below.
>
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 727955918563..673a634eadd9 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -2361,6 +2361,9 @@ static int bfq_request_merge(struct request_queue *q,
> struct request **req, __rq = bfq_find_rq_fmerge(bfqd, bio, q);
> if (__rq && elv_bio_merge_ok(__rq, bio)) {
> *req = __rq;
> +
> + if (blk_discard_mergable(__rq))
> + return ELEVATOR_DISCARD_MERGE;
> return ELEVATOR_FRONT_MERGE;
> }
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index a11b3b53717e..f8707ff7e2fc 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -705,22 +705,6 @@ static void blk_account_io_merge_request(struct request
> *req) }
> }
>
> -/*
> - * Two cases of handling DISCARD merge:
> - * If max_discard_segments > 1, the driver takes every bio
> - * as a range and send them to controller together. The ranges
> - * needn't to be contiguous.
> - * Otherwise, the bios/requests will be handled as same as
> - * others which should be contiguous.
> - */
> -static inline bool blk_discard_mergable(struct request *req)
> -{
> - if (req_op(req) == REQ_OP_DISCARD &&
> - queue_max_discard_segments(req->q) > 1)
> - return true;
> - return false;
> -}
> -
> static enum elv_merge blk_try_req_merge(struct request *req,
> struct request *next)
> {
> diff --git a/block/elevator.c b/block/elevator.c
> index 52ada14cfe45..a5fe2615ec0f 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -336,6 +336,9 @@ enum elv_merge elv_merge(struct request_queue *q, struct
> request **req, __rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);
> if (__rq && elv_bio_merge_ok(__rq, bio)) {
> *req = __rq;
> +
> + if (blk_discard_mergable(__rq))
> + return ELEVATOR_DISCARD_MERGE;
> return ELEVATOR_BACK_MERGE;
> }
>
> diff --git a/block/mq-deadline-main.c b/block/mq-deadline-main.c
> index 6f612e6dc82b..294be0c0db65 100644
> --- a/block/mq-deadline-main.c
> +++ b/block/mq-deadline-main.c
I had to adjust this against v5.13 because there's no mq-deadline-main.c, only
mq-deadline.c (due to Bart series, I assume). I hope this is fine as the patch
applies cleanly.
> @@ -677,6 +677,8 @@ static int dd_request_merge(struct request_queue *q,
> struct request **rq,
>
> if (elv_bio_merge_ok(__rq, bio)) {
> *rq = __rq;
> + if (blk_discard_mergable(__rq))
> + return ELEVATOR_DISCARD_MERGE;
> return ELEVATOR_FRONT_MERGE;
> }
> }
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 3177181c4326..87f00292fd7a 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1521,6 +1521,22 @@ static inline int
> queue_limit_discard_alignment(struct queue_limits *lim, sector return
> offset << SECTOR_SHIFT;
> }
>
> +/*
> + * Two cases of handling DISCARD merge:
> + * If max_discard_segments > 1, the driver takes every bio
> + * as a range and send them to controller together. The ranges
> + * needn't to be contiguous.
> + * Otherwise, the bios/requests will be handled as same as
> + * others which should be contiguous.
> + */
> +static inline bool blk_discard_mergable(struct request *req)
> +{
> + if (req_op(req) == REQ_OP_DISCARD &&
> + queue_max_discard_segments(req->q) > 1)
> + return true;
> + return false;
> +}
> +
> static inline int bdev_discard_alignment(struct block_device *bdev)
> {
> struct request_queue *q = bdev_get_queue(bdev);
Do I understand correctly that this will be something like:
Fixes: 2705dfb209 ("block: fix discard request merge")
?
Because as the bisection progresses, I've bumped into this commit only.
Without it the issue is not reproducible, at least so far.
Thanks!
--
Oleksandr Natalenko (post-factum)
On Fri, Jul 16, 2021 at 12:03:43PM +0200, Oleksandr Natalenko wrote:
> Hello.
>
> On pátek 16. července 2021 11:33:05 CEST Ming Lei wrote:
> > Can you test the following patch?
>
> Sure, building it at the moment, and will give it a try. Also please see my
> comments and questions below.
>
> >
> > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> > index 727955918563..673a634eadd9 100644
> > --- a/block/bfq-iosched.c
> > +++ b/block/bfq-iosched.c
> > @@ -2361,6 +2361,9 @@ static int bfq_request_merge(struct request_queue *q,
> > struct request **req, __rq = bfq_find_rq_fmerge(bfqd, bio, q);
> > if (__rq && elv_bio_merge_ok(__rq, bio)) {
> > *req = __rq;
> > +
> > + if (blk_discard_mergable(__rq))
> > + return ELEVATOR_DISCARD_MERGE;
> > return ELEVATOR_FRONT_MERGE;
> > }
> >
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index a11b3b53717e..f8707ff7e2fc 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -705,22 +705,6 @@ static void blk_account_io_merge_request(struct request
> > *req) }
> > }
> >
> > -/*
> > - * Two cases of handling DISCARD merge:
> > - * If max_discard_segments > 1, the driver takes every bio
> > - * as a range and send them to controller together. The ranges
> > - * needn't to be contiguous.
> > - * Otherwise, the bios/requests will be handled as same as
> > - * others which should be contiguous.
> > - */
> > -static inline bool blk_discard_mergable(struct request *req)
> > -{
> > - if (req_op(req) == REQ_OP_DISCARD &&
> > - queue_max_discard_segments(req->q) > 1)
> > - return true;
> > - return false;
> > -}
> > -
> > static enum elv_merge blk_try_req_merge(struct request *req,
> > struct request *next)
> > {
> > diff --git a/block/elevator.c b/block/elevator.c
> > index 52ada14cfe45..a5fe2615ec0f 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -336,6 +336,9 @@ enum elv_merge elv_merge(struct request_queue *q, struct
> > request **req, __rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);
> > if (__rq && elv_bio_merge_ok(__rq, bio)) {
> > *req = __rq;
> > +
> > + if (blk_discard_mergable(__rq))
> > + return ELEVATOR_DISCARD_MERGE;
> > return ELEVATOR_BACK_MERGE;
> > }
> >
> > diff --git a/block/mq-deadline-main.c b/block/mq-deadline-main.c
> > index 6f612e6dc82b..294be0c0db65 100644
> > --- a/block/mq-deadline-main.c
> > +++ b/block/mq-deadline-main.c
>
> I had to adjust this against v5.13 because there's no mq-deadline-main.c, only
> mq-deadline.c (due to Bart series, I assume). I hope this is fine as the patch
> applies cleanly.
>
> > @@ -677,6 +677,8 @@ static int dd_request_merge(struct request_queue *q,
> > struct request **rq,
> >
> > if (elv_bio_merge_ok(__rq, bio)) {
> > *rq = __rq;
> > + if (blk_discard_mergable(__rq))
> > + return ELEVATOR_DISCARD_MERGE;
> > return ELEVATOR_FRONT_MERGE;
> > }
> > }
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 3177181c4326..87f00292fd7a 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1521,6 +1521,22 @@ static inline int
> > queue_limit_discard_alignment(struct queue_limits *lim, sector return
> > offset << SECTOR_SHIFT;
> > }
> >
> > +/*
> > + * Two cases of handling DISCARD merge:
> > + * If max_discard_segments > 1, the driver takes every bio
> > + * as a range and send them to controller together. The ranges
> > + * needn't to be contiguous.
> > + * Otherwise, the bios/requests will be handled as same as
> > + * others which should be contiguous.
> > + */
> > +static inline bool blk_discard_mergable(struct request *req)
> > +{
> > + if (req_op(req) == REQ_OP_DISCARD &&
> > + queue_max_discard_segments(req->q) > 1)
> > + return true;
> > + return false;
> > +}
> > +
> > static inline int bdev_discard_alignment(struct block_device *bdev)
> > {
> > struct request_queue *q = bdev_get_queue(bdev);
>
> Do I understand correctly that this will be something like:
>
> Fixes: 2705dfb209 ("block: fix discard request merge")
>
> ?
>
> Because as the bisection progresses, I've bumped into this commit only.
> Without it the issue is not reproducible, at least so far.
It could be.
So can you just test v5.14-rc1?
Thanks,
Ming
On pátek 16. července 2021 12:41:52 CEST Ming Lei wrote:
> > Do I understand correctly that this will be something like:
> >
> > Fixes: 2705dfb209 ("block: fix discard request merge")
> >
> > ?
> >
> > Because as the bisection progresses, I've bumped into this commit only.
> > Without it the issue is not reproducible, at least so far.
>
> It could be.
>
> So can you just test v5.14-rc1?
Doing it right now, but I've got another issue. Why BFQ is not listed here:
```
/sys/class/block/nvme0n1/queue/scheduler:[mq-deadline] kyber none
/sys/class/block/nvme1n1/queue/scheduler:[mq-deadline] kyber none
```
?
It is a built-in, FWIW:
```
$ modinfo bfq
name: bfq
filename: (builtin)
description: MQ Budget Fair Queueing I/O Scheduler
license: GPL
file: block/bfq
author: Paolo Valente
alias: bfq-iosched
```
So far the issue is not reproducible with your patch + 5.13.2 as well as 5.14-
rc1 (but I don't have BFQ either with v5.14-rc1).
--
Oleksandr Natalenko (post-factum)
On Fri, Jul 16, 2021 at 02:56:22PM +0200, Oleksandr Natalenko wrote:
> On pátek 16. července 2021 12:41:52 CEST Ming Lei wrote:
> > > Do I understand correctly that this will be something like:
> > >
> > > Fixes: 2705dfb209 ("block: fix discard request merge")
> > >
> > > ?
> > >
> > > Because as the bisection progresses, I've bumped into this commit only.
> > > Without it the issue is not reproducible, at least so far.
> >
> > It could be.
> >
> > So can you just test v5.14-rc1?
>
> Doing it right now, but I've got another issue. Why BFQ is not listed here:
>
> ```
> /sys/class/block/nvme0n1/queue/scheduler:[mq-deadline] kyber none
> /sys/class/block/nvme1n1/queue/scheduler:[mq-deadline] kyber none
> ```
Maybe you need to check if the build is OK, I can't reproduce it in my
VM, and BFQ is still builtin:
[root@ktest-01 ~]# uname -a
Linux ktest-01 5.14.0-rc1+ #52 SMP Fri Jul 16 18:56:36 CST 2021 x86_64 x86_64 x86_64 GNU/Linux
[root@ktest-01 ~]# cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq
>
> ?
>
> It is a built-in, FWIW:
>
> ```
> $ modinfo bfq
> name: bfq
> filename: (builtin)
> description: MQ Budget Fair Queueing I/O Scheduler
> license: GPL
> file: block/bfq
> author: Paolo Valente
> alias: bfq-iosched
> ```
>
> So far the issue is not reproducible with your patch + 5.13.2 as well as 5.14-
> rc1 (but I don't have BFQ either with v5.14-rc1).
You have to verify it with BFQ applied, :-)
Thanks,
Ming
On sobota 17. července 2021 11:35:32 CEST Ming Lei wrote:
> Maybe you need to check if the build is OK, I can't reproduce it in my
> VM, and BFQ is still builtin:
>
> [root@ktest-01 ~]# uname -a
> Linux ktest-01 5.14.0-rc1+ #52 SMP Fri Jul 16 18:56:36 CST 2021 x86_64
> x86_64 x86_64 GNU/Linux [root@ktest-01 ~]# cat
> /sys/block/nvme0n1/queue/scheduler
> [none] mq-deadline kyber bfq
I don't think this is an issue with the build… BTW, with `initcall_debug`:
```
[ 0.902555] calling bfq_init+0x0/0x8b @ 1
[ 0.903448] initcall bfq_init+0x0/0x8b returned -28 after 507 usecs
```
-ENOSPC? Why? Also re-tested with the latest git tip, same result :(.
> > So far the issue is not reproducible with your patch + 5.13.2 as well as
> > 5.14- rc1 (but I don't have BFQ either with v5.14-rc1).
>
> You have to verify it with BFQ applied, :-)
Understandable… BTW, I'm still running v5.13.2 with your patch applied and do
not see the issue.
--
Oleksandr Natalenko (post-factum)
On sobota 17. července 2021 14:11:05 CEST Oleksandr Natalenko wrote:
> On sobota 17. července 2021 11:35:32 CEST Ming Lei wrote:
> > Maybe you need to check if the build is OK, I can't reproduce it in my
> > VM, and BFQ is still builtin:
> >
> > [root@ktest-01 ~]# uname -a
> > Linux ktest-01 5.14.0-rc1+ #52 SMP Fri Jul 16 18:56:36 CST 2021 x86_64
> > x86_64 x86_64 GNU/Linux [root@ktest-01 ~]# cat
> > /sys/block/nvme0n1/queue/scheduler
> > [none] mq-deadline kyber bfq
>
> I don't think this is an issue with the build… BTW, with `initcall_debug`:
>
> ```
> [ 0.902555] calling bfq_init+0x0/0x8b @ 1
> [ 0.903448] initcall bfq_init+0x0/0x8b returned -28 after 507 usecs
> ```
>
> -ENOSPC? Why? Also re-tested with the latest git tip, same result :(.
OK, one extra pr_info, and I see this:
```
[ 0.871180] blkcg_policy_register: BLKCG_MAX_POLS too small
[ 0.871612] blkcg_policy_register: -28
```
What does it mean please :)? The value seems to be hard-coded:
```
include/linux/blkdev.h
60:#define BLKCG_MAX_POLS 5
```
--
Oleksandr Natalenko (post-factum)
On sobota 17. července 2021 14:19:59 CEST Oleksandr Natalenko wrote:
> On sobota 17. července 2021 14:11:05 CEST Oleksandr Natalenko wrote:
> > On sobota 17. července 2021 11:35:32 CEST Ming Lei wrote:
> > > Maybe you need to check if the build is OK, I can't reproduce it in my
> > > VM, and BFQ is still builtin:
> > >
> > > [root@ktest-01 ~]# uname -a
> > > Linux ktest-01 5.14.0-rc1+ #52 SMP Fri Jul 16 18:56:36 CST 2021 x86_64
> > > x86_64 x86_64 GNU/Linux [root@ktest-01 ~]# cat
> > > /sys/block/nvme0n1/queue/scheduler
> > > [none] mq-deadline kyber bfq
> >
> > I don't think this is an issue with the build… BTW, with `initcall_debug`:
> >
> > ```
> > [ 0.902555] calling bfq_init+0x0/0x8b @ 1
> > [ 0.903448] initcall bfq_init+0x0/0x8b returned -28 after 507 usecs
> > ```
> >
> > -ENOSPC? Why? Also re-tested with the latest git tip, same result :(.
>
> OK, one extra pr_info, and I see this:
>
> ```
> [ 0.871180] blkcg_policy_register: BLKCG_MAX_POLS too small
> [ 0.871612] blkcg_policy_register: -28
> ```
>
> What does it mean please :)? The value seems to be hard-coded:
>
> ```
> include/linux/blkdev.h
> 60:#define BLKCG_MAX_POLS 5
> ```
OK, after increasing this to 6 I've got my BFQ back. Please see [1].
[1] https://lore.kernel.org/linux-block/[email protected]/
--
Oleksandr Natalenko (post-factum)
On Sat, Jul 17, 2021 at 02:35:14PM +0200, Oleksandr Natalenko wrote:
> On sobota 17. července 2021 14:19:59 CEST Oleksandr Natalenko wrote:
> > On sobota 17. července 2021 14:11:05 CEST Oleksandr Natalenko wrote:
> > > On sobota 17. července 2021 11:35:32 CEST Ming Lei wrote:
> > > > Maybe you need to check if the build is OK, I can't reproduce it in my
> > > > VM, and BFQ is still builtin:
> > > >
> > > > [root@ktest-01 ~]# uname -a
> > > > Linux ktest-01 5.14.0-rc1+ #52 SMP Fri Jul 16 18:56:36 CST 2021 x86_64
> > > > x86_64 x86_64 GNU/Linux [root@ktest-01 ~]# cat
> > > > /sys/block/nvme0n1/queue/scheduler
> > > > [none] mq-deadline kyber bfq
> > >
> > > I don't think this is an issue with the build… BTW, with `initcall_debug`:
> > >
> > > ```
> > > [ 0.902555] calling bfq_init+0x0/0x8b @ 1
> > > [ 0.903448] initcall bfq_init+0x0/0x8b returned -28 after 507 usecs
> > > ```
> > >
> > > -ENOSPC? Why? Also re-tested with the latest git tip, same result :(.
> >
> > OK, one extra pr_info, and I see this:
> >
> > ```
> > [ 0.871180] blkcg_policy_register: BLKCG_MAX_POLS too small
> > [ 0.871612] blkcg_policy_register: -28
> > ```
> >
> > What does it mean please :)? The value seems to be hard-coded:
> >
> > ```
> > include/linux/blkdev.h
> > 60:#define BLKCG_MAX_POLS 5
> > ```
>
> OK, after increasing this to 6 I've got my BFQ back. Please see [1].
>
> [1] https://lore.kernel.org/linux-block/[email protected]/
OK, after you fixed the issue in blkcg_policy_register(), can you
reproduce the discard issue on v5.14-rc1 with BFQ applied? If yes,
can you test the patch I posted previously?
Thanks,
Ming
Hello.
On pondělí 19. července 2021 3:40:40 CEST Ming Lei wrote:
> On Sat, Jul 17, 2021 at 02:35:14PM +0200, Oleksandr Natalenko wrote:
> > On sobota 17. července 2021 14:19:59 CEST Oleksandr Natalenko wrote:
> > > On sobota 17. července 2021 14:11:05 CEST Oleksandr Natalenko wrote:
> > > > On sobota 17. července 2021 11:35:32 CEST Ming Lei wrote:
> > > > > Maybe you need to check if the build is OK, I can't reproduce it in
> > > > > my
> > > > > VM, and BFQ is still builtin:
> > > > >
> > > > > [root@ktest-01 ~]# uname -a
> > > > > Linux ktest-01 5.14.0-rc1+ #52 SMP Fri Jul 16 18:56:36 CST 2021
> > > > > x86_64
> > > > > x86_64 x86_64 GNU/Linux [root@ktest-01 ~]# cat
> > > > > /sys/block/nvme0n1/queue/scheduler
> > > > > [none] mq-deadline kyber bfq
> > > >
> > > > I don't think this is an issue with the build… BTW, with
> > > > `initcall_debug`:
> > > >
> > > > ```
> > > > [ 0.902555] calling bfq_init+0x0/0x8b @ 1
> > > > [ 0.903448] initcall bfq_init+0x0/0x8b returned -28 after 507 usecs
> > > > ```
> > > >
> > > > -ENOSPC? Why? Also re-tested with the latest git tip, same result :(.
> > >
> > > OK, one extra pr_info, and I see this:
> > >
> > > ```
> > > [ 0.871180] blkcg_policy_register: BLKCG_MAX_POLS too small
> > > [ 0.871612] blkcg_policy_register: -28
> > > ```
> > >
> > > What does it mean please :)? The value seems to be hard-coded:
> > >
> > > ```
> > > include/linux/blkdev.h
> > > 60:#define BLKCG_MAX_POLS 5
> > > ```
> >
> > OK, after increasing this to 6 I've got my BFQ back. Please see [1].
> >
> > [1]
> > https://lore.kernel.org/linux-block/20210717123328.945810-1-oleksandr@nat
> > alenko.name/
> OK, after you fixed the issue in blkcg_policy_register(), can you
> reproduce the discard issue on v5.14-rc1 with BFQ applied? If yes,
> can you test the patch I posted previously?
Yes, the issue is reproducible with both v5.13.2 and v5.14-rc1. I haven't
managed to reproduce it with v5.13.2+your patch. Now I will build v5.14-
rc2+your patch and test further.
Thanks.
--
Oleksandr Natalenko (post-factum)
Hello, Ming.
On pondělí 19. července 2021 8:27:29 CEST Oleksandr Natalenko wrote:
> On pondělí 19. července 2021 3:40:40 CEST Ming Lei wrote:
> > On Sat, Jul 17, 2021 at 02:35:14PM +0200, Oleksandr Natalenko wrote:
> > > On sobota 17. července 2021 14:19:59 CEST Oleksandr Natalenko wrote:
> > > > On sobota 17. července 2021 14:11:05 CEST Oleksandr Natalenko wrote:
> > > > > On sobota 17. července 2021 11:35:32 CEST Ming Lei wrote:
> > > > > > Maybe you need to check if the build is OK, I can't reproduce it
> > > > > > in
> > > > > > my
> > > > > > VM, and BFQ is still builtin:
> > > > > >
> > > > > > [root@ktest-01 ~]# uname -a
> > > > > > Linux ktest-01 5.14.0-rc1+ #52 SMP Fri Jul 16 18:56:36 CST 2021
> > > > > > x86_64
> > > > > > x86_64 x86_64 GNU/Linux [root@ktest-01 ~]# cat
> > > > > > /sys/block/nvme0n1/queue/scheduler
> > > > > > [none] mq-deadline kyber bfq
> > > > >
> > > > > I don't think this is an issue with the build… BTW, with
> > > > > `initcall_debug`:
> > > > >
> > > > > ```
> > > > > [ 0.902555] calling bfq_init+0x0/0x8b @ 1
> > > > > [ 0.903448] initcall bfq_init+0x0/0x8b returned -28 after 507
> > > > > usecs
> > > > > ```
> > > > >
> > > > > -ENOSPC? Why? Also re-tested with the latest git tip, same result
> > > > > :(.
> > > >
> > > > OK, one extra pr_info, and I see this:
> > > >
> > > > ```
> > > > [ 0.871180] blkcg_policy_register: BLKCG_MAX_POLS too small
> > > > [ 0.871612] blkcg_policy_register: -28
> > > > ```
> > > >
> > > > What does it mean please :)? The value seems to be hard-coded:
> > > >
> > > > ```
> > > > include/linux/blkdev.h
> > > > 60:#define BLKCG_MAX_POLS 5
> > > > ```
> > >
> > > OK, after increasing this to 6 I've got my BFQ back. Please see [1].
> > >
> > > [1]
> > > https://lore.kernel.org/linux-block/20210717123328.945810-1-oleksandr@na
> > > t
> > > alenko.name/
> >
> > OK, after you fixed the issue in blkcg_policy_register(), can you
> > reproduce the discard issue on v5.14-rc1 with BFQ applied? If yes,
> > can you test the patch I posted previously?
>
> Yes, the issue is reproducible with both v5.13.2 and v5.14-rc1. I haven't
> managed to reproduce it with v5.13.2+your patch. Now I will build v5.14-
> rc2+your patch and test further.
I'm still hammering v5.14-rc2 + your patch, and I cannot reproduce the issue.
Given I do not have a reliable reproducer (I'm just firing up the kernel build,
and the issue pops up eventually, sooner or later, but usually within a couple
of first tries), for how long I should hammer it for your fix to be considered
proven?
Thanks.
--
Oleksandr Natalenko (post-factum)
On Tue, Jul 20, 2021 at 11:05:29AM +0200, Oleksandr Natalenko wrote:
> Hello, Ming.
>
> On pondělí 19. července 2021 8:27:29 CEST Oleksandr Natalenko wrote:
> > On pondělí 19. července 2021 3:40:40 CEST Ming Lei wrote:
> > > On Sat, Jul 17, 2021 at 02:35:14PM +0200, Oleksandr Natalenko wrote:
> > > > On sobota 17. července 2021 14:19:59 CEST Oleksandr Natalenko wrote:
> > > > > On sobota 17. července 2021 14:11:05 CEST Oleksandr Natalenko wrote:
> > > > > > On sobota 17. července 2021 11:35:32 CEST Ming Lei wrote:
> > > > > > > Maybe you need to check if the build is OK, I can't reproduce it
> > > > > > > in
> > > > > > > my
> > > > > > > VM, and BFQ is still builtin:
> > > > > > >
> > > > > > > [root@ktest-01 ~]# uname -a
> > > > > > > Linux ktest-01 5.14.0-rc1+ #52 SMP Fri Jul 16 18:56:36 CST 2021
> > > > > > > x86_64
> > > > > > > x86_64 x86_64 GNU/Linux [root@ktest-01 ~]# cat
> > > > > > > /sys/block/nvme0n1/queue/scheduler
> > > > > > > [none] mq-deadline kyber bfq
> > > > > >
> > > > > > I don't think this is an issue with the build… BTW, with
> > > > > > `initcall_debug`:
> > > > > >
> > > > > > ```
> > > > > > [ 0.902555] calling bfq_init+0x0/0x8b @ 1
> > > > > > [ 0.903448] initcall bfq_init+0x0/0x8b returned -28 after 507
> > > > > > usecs
> > > > > > ```
> > > > > >
> > > > > > -ENOSPC? Why? Also re-tested with the latest git tip, same result
> > > > > > :(.
> > > > >
> > > > > OK, one extra pr_info, and I see this:
> > > > >
> > > > > ```
> > > > > [ 0.871180] blkcg_policy_register: BLKCG_MAX_POLS too small
> > > > > [ 0.871612] blkcg_policy_register: -28
> > > > > ```
> > > > >
> > > > > What does it mean please :)? The value seems to be hard-coded:
> > > > >
> > > > > ```
> > > > > include/linux/blkdev.h
> > > > > 60:#define BLKCG_MAX_POLS 5
> > > > > ```
> > > >
> > > > OK, after increasing this to 6 I've got my BFQ back. Please see [1].
> > > >
> > > > [1]
> > > > https://lore.kernel.org/linux-block/20210717123328.945810-1-oleksandr@na
> > > > t
> > > > alenko.name/
> > >
> > > OK, after you fixed the issue in blkcg_policy_register(), can you
> > > reproduce the discard issue on v5.14-rc1 with BFQ applied? If yes,
> > > can you test the patch I posted previously?
> >
> > Yes, the issue is reproducible with both v5.13.2 and v5.14-rc1. I haven't
> > managed to reproduce it with v5.13.2+your patch. Now I will build v5.14-
> > rc2+your patch and test further.
>
> I'm still hammering v5.14-rc2 + your patch, and I cannot reproduce the issue.
> Given I do not have a reliable reproducer (I'm just firing up the kernel build,
> and the issue pops up eventually, sooner or later, but usually within a couple
> of first tries), for how long I should hammer it for your fix to be considered
> proven?
You mentioned that the issue is reproducible with v5.14-rc, that means
it can be always reproduced in limited time(suppose it is A). If the issue
can't be reproduced any more after applying the patch in long enough time B(B >> A),
we can think it is fixed by the patch.
For example, if A is one hour, we can set B as 5*A or bigger to simulate
the long enough time.
Thanks,
Ming
Hello.
On středa 21. července 2021 10:00:56 CEST Ming Lei wrote:
> > I'm still hammering v5.14-rc2 + your patch, and I cannot reproduce the
> > issue. Given I do not have a reliable reproducer (I'm just firing up the
> > kernel build, and the issue pops up eventually, sooner or later, but
> > usually within a couple of first tries), for how long I should hammer it
> > for your fix to be considered proven?
>
> You mentioned that the issue is reproducible with v5.14-rc, that means
> it can be always reproduced in limited time(suppose it is A). If the issue
> can't be reproduced any more after applying the patch in long enough time
> B(B >> A), we can think it is fixed by the patch.
>
> For example, if A is one hour, we can set B as 5*A or bigger to simulate
> the long enough time.
With your patch the issue is still not reproducible for me. Hence, from my
side the patch looks to be verified.
Thanks.
--
Oleksandr Natalenko (post-factum)
On Tue, Jul 27, 2021 at 05:12:03PM +0200, Oleksandr Natalenko wrote:
> Hello.
>
> On středa 21. července 2021 10:00:56 CEST Ming Lei wrote:
> > > I'm still hammering v5.14-rc2 + your patch, and I cannot reproduce the
> > > issue. Given I do not have a reliable reproducer (I'm just firing up the
> > > kernel build, and the issue pops up eventually, sooner or later, but
> > > usually within a couple of first tries), for how long I should hammer it
> > > for your fix to be considered proven?
> >
> > You mentioned that the issue is reproducible with v5.14-rc, that means
> > it can be always reproduced in limited time(suppose it is A). If the issue
> > can't be reproduced any more after applying the patch in long enough time
> > B(B >> A), we can think it is fixed by the patch.
> >
> > For example, if A is one hour, we can set B as 5*A or bigger to simulate
> > the long enough time.
>
> With your patch the issue is still not reproducible for me. Hence, from my
> side the patch looks to be verified.
OK.
BTW, can you test the following patch? which is another approach on the same
issue with other benefits.
From c853e7ed05a75f631da5b7952b9a989983437819 Mon Sep 17 00:00:00 2001
From: Ming Lei <[email protected]>
Date: Mon, 7 Jun 2021 16:03:51 +0800
Subject: [PATCH 2/2] block: support bio merge for multi-range discard
So far multi-range discard treats each bio as one segment(range) of single
discard request. This way becomes not efficient if lots of small sized
discard bios are submitted, and one example is raid456.
Support bio merge for multi-range discard for improving lots of small
sized discard bios.
Turns out it is easy to support it:
1) always try to merge bio first
2) run into multi-range discard only if bio merge can't be done
3) add rq_for_each_discard_range() for retrieving each range(segment)
of discard request
Reported-by: Wang Shanker <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-merge.c | 12 ++++-----
drivers/block/virtio_blk.c | 9 ++++---
drivers/nvme/host/core.c | 8 +++---
include/linux/blkdev.h | 51 ++++++++++++++++++++++++++++++++++++++
4 files changed, 66 insertions(+), 14 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index bcdff1879c34..65210e9a8efa 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -724,10 +724,10 @@ static inline bool blk_discard_mergable(struct request *req)
static enum elv_merge blk_try_req_merge(struct request *req,
struct request *next)
{
- if (blk_discard_mergable(req))
- return ELEVATOR_DISCARD_MERGE;
- else if (blk_rq_pos(req) + blk_rq_sectors(req) == blk_rq_pos(next))
+ if (blk_rq_pos(req) + blk_rq_sectors(req) == blk_rq_pos(next))
return ELEVATOR_BACK_MERGE;
+ else if (blk_discard_mergable(req))
+ return ELEVATOR_DISCARD_MERGE;
return ELEVATOR_NO_MERGE;
}
@@ -908,12 +908,12 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
{
- if (blk_discard_mergable(rq))
- return ELEVATOR_DISCARD_MERGE;
- else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
+ if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
return ELEVATOR_BACK_MERGE;
else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
return ELEVATOR_FRONT_MERGE;
+ else if (blk_discard_mergable(rq))
+ return ELEVATOR_DISCARD_MERGE;
return ELEVATOR_NO_MERGE;
}
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index b9fa3ef5b57c..970cb0d8acaa 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -116,7 +116,6 @@ static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
unsigned short segments = blk_rq_nr_discard_segments(req);
unsigned short n = 0;
struct virtio_blk_discard_write_zeroes *range;
- struct bio *bio;
u32 flags = 0;
if (unmap)
@@ -138,9 +137,11 @@ static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
range[0].sector = cpu_to_le64(blk_rq_pos(req));
n = 1;
} else {
- __rq_for_each_bio(bio, req) {
- u64 sector = bio->bi_iter.bi_sector;
- u32 num_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+ struct req_discard_range r;
+
+ rq_for_each_discard_range(r, req) {
+ u64 sector = r.sector;
+ u32 num_sectors = r.size >> SECTOR_SHIFT;
range[n].flags = cpu_to_le32(flags);
range[n].num_sectors = cpu_to_le32(num_sectors);
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 24bcae88587a..4b0a39360ce9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -813,7 +813,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
{
unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
struct nvme_dsm_range *range;
- struct bio *bio;
+ struct req_discard_range r;
/*
* Some devices do not consider the DSM 'Number of Ranges' field when
@@ -835,9 +835,9 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
range = page_address(ns->ctrl->discard_page);
}
- __rq_for_each_bio(bio, req) {
- u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
- u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
+ rq_for_each_discard_range(r, req) {
+ u64 slba = nvme_sect_to_lba(ns, r.sector);
+ u32 nlb = r.size >> ns->lba_shift;
if (n < segments) {
range[n].cattr = cpu_to_le32(0);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d66d0da72529..bd9d22269a7b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1007,6 +1007,57 @@ static inline unsigned int blk_rq_stats_sectors(const struct request *rq)
return rq->stats_sectors;
}
+struct req_discard_range {
+ sector_t sector;
+ unsigned int size;
+
+ /*
+ * internal field: driver don't use it, and it always points to
+ * next bio to be processed
+ */
+ struct bio *__bio;
+};
+
+static inline void req_init_discard_range_iter(const struct request *rq,
+ struct req_discard_range *range)
+{
+ range->__bio = rq->bio;
+}
+
+/* return true if @range stores one valid discard range */
+static inline bool req_get_discard_range(struct req_discard_range *range)
+{
+ struct bio *bio;
+
+ if (!range->__bio)
+ return false;
+
+ bio = range->__bio;
+ range->sector = bio->bi_iter.bi_sector;
+ range->size = bio->bi_iter.bi_size;
+ range->__bio = bio->bi_next;
+
+ while (range->__bio) {
+ struct bio *bio = range->__bio;
+
+ if (range->sector + (range->size >> SECTOR_SHIFT) !=
+ bio->bi_iter.bi_sector)
+ break;
+
+ /*
+ * ->size won't overflow because req->__data_len is defined
+ * as 'unsigned int'
+ */
+ range->size += bio->bi_iter.bi_size;
+ range->__bio = bio->bi_next;
+ }
+ return true;
+}
+
+#define rq_for_each_discard_range(range, rq) \
+ for (req_init_discard_range_iter((rq), &range); \
+ req_get_discard_range(&range);)
+
#ifdef CONFIG_BLK_DEV_ZONED
/* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */
--
2.31.1
Thanks,
Ming
Hello.
On úterý 27. července 2021 17:58:39 CEST Ming Lei wrote:
> BTW, can you test the following patch? which is another approach on the same
> issue with other benefits.
>
> From c853e7ed05a75f631da5b7952b9a989983437819 Mon Sep 17 00:00:00 2001
> From: Ming Lei <[email protected]>
> Date: Mon, 7 Jun 2021 16:03:51 +0800
> Subject: [PATCH 2/2] block: support bio merge for multi-range discard
>
> So far multi-range discard treats each bio as one segment(range) of single
> discard request. This way becomes not efficient if lots of small sized
> discard bios are submitted, and one example is raid456.
>
> Support bio merge for multi-range discard for improving lots of small
> sized discard bios.
>
> Turns out it is easy to support it:
>
> 1) always try to merge bio first
>
> 2) run into multi-range discard only if bio merge can't be done
>
> 3) add rq_for_each_discard_range() for retrieving each range(segment)
> of discard request
>
> Reported-by: Wang Shanker <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> block/blk-merge.c | 12 ++++-----
> drivers/block/virtio_blk.c | 9 ++++---
> drivers/nvme/host/core.c | 8 +++---
> include/linux/blkdev.h | 51 ++++++++++++++++++++++++++++++++++++++
> 4 files changed, 66 insertions(+), 14 deletions(-)
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index bcdff1879c34..65210e9a8efa 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -724,10 +724,10 @@ static inline bool blk_discard_mergable(struct request
> *req) static enum elv_merge blk_try_req_merge(struct request *req,
> struct request *next)
> {
> - if (blk_discard_mergable(req))
> - return ELEVATOR_DISCARD_MERGE;
> - else if (blk_rq_pos(req) + blk_rq_sectors(req) == blk_rq_pos(next))
> + if (blk_rq_pos(req) + blk_rq_sectors(req) == blk_rq_pos(next))
> return ELEVATOR_BACK_MERGE;
> + else if (blk_discard_mergable(req))
> + return ELEVATOR_DISCARD_MERGE;
>
> return ELEVATOR_NO_MERGE;
> }
> @@ -908,12 +908,12 @@ bool blk_rq_merge_ok(struct request *rq, struct bio
> *bio)
>
> enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
> {
> - if (blk_discard_mergable(rq))
> - return ELEVATOR_DISCARD_MERGE;
> - else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
> + if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
> return ELEVATOR_BACK_MERGE;
> else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
> return ELEVATOR_FRONT_MERGE;
> + else if (blk_discard_mergable(rq))
> + return ELEVATOR_DISCARD_MERGE;
> return ELEVATOR_NO_MERGE;
> }
>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index b9fa3ef5b57c..970cb0d8acaa 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -116,7 +116,6 @@ static int virtblk_setup_discard_write_zeroes(struct
> request *req, bool unmap) unsigned short segments =
> blk_rq_nr_discard_segments(req);
> unsigned short n = 0;
> struct virtio_blk_discard_write_zeroes *range;
> - struct bio *bio;
> u32 flags = 0;
>
> if (unmap)
> @@ -138,9 +137,11 @@ static int virtblk_setup_discard_write_zeroes(struct
> request *req, bool unmap) range[0].sector = cpu_to_le64(blk_rq_pos(req));
> n = 1;
> } else {
> - __rq_for_each_bio(bio, req) {
> - u64 sector = bio->bi_iter.bi_sector;
> - u32 num_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> + struct req_discard_range r;
> +
> + rq_for_each_discard_range(r, req) {
> + u64 sector = r.sector;
> + u32 num_sectors = r.size >> SECTOR_SHIFT;
>
> range[n].flags = cpu_to_le32(flags);
> range[n].num_sectors = cpu_to_le32(num_sectors);
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 24bcae88587a..4b0a39360ce9 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -813,7 +813,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns
> *ns, struct request *req, {
> unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
> struct nvme_dsm_range *range;
> - struct bio *bio;
> + struct req_discard_range r;
>
> /*
> * Some devices do not consider the DSM 'Number of Ranges' field when
> @@ -835,9 +835,9 @@ static blk_status_t nvme_setup_discard(struct nvme_ns
> *ns, struct request *req, range = page_address(ns->ctrl->discard_page);
> }
>
> - __rq_for_each_bio(bio, req) {
> - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> + rq_for_each_discard_range(r, req) {
> + u64 slba = nvme_sect_to_lba(ns, r.sector);
> + u32 nlb = r.size >> ns->lba_shift;
>
> if (n < segments) {
> range[n].cattr = cpu_to_le32(0);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index d66d0da72529..bd9d22269a7b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1007,6 +1007,57 @@ static inline unsigned int blk_rq_stats_sectors(const
> struct request *rq) return rq->stats_sectors;
> }
>
> +struct req_discard_range {
> + sector_t sector;
> + unsigned int size;
> +
> + /*
> + * internal field: driver don't use it, and it always points to
> + * next bio to be processed
> + */
> + struct bio *__bio;
> +};
> +
> +static inline void req_init_discard_range_iter(const struct request *rq,
> + struct req_discard_range *range)
> +{
> + range->__bio = rq->bio;
> +}
> +
> +/* return true if @range stores one valid discard range */
> +static inline bool req_get_discard_range(struct req_discard_range *range)
> +{
> + struct bio *bio;
> +
> + if (!range->__bio)
> + return false;
> +
> + bio = range->__bio;
> + range->sector = bio->bi_iter.bi_sector;
> + range->size = bio->bi_iter.bi_size;
> + range->__bio = bio->bi_next;
> +
> + while (range->__bio) {
> + struct bio *bio = range->__bio;
> +
> + if (range->sector + (range->size >> SECTOR_SHIFT) !=
> + bio->bi_iter.bi_sector)
> + break;
> +
> + /*
> + * ->size won't overflow because req->__data_len is defined
> + * as 'unsigned int'
> + */
> + range->size += bio->bi_iter.bi_size;
> + range->__bio = bio->bi_next;
> + }
> + return true;
> +}
> +
> +#define rq_for_each_discard_range(range, rq) \
> + for (req_init_discard_range_iter((rq), &range); \
> + req_get_discard_range(&range);)
> +
> #ifdef CONFIG_BLK_DEV_ZONED
>
> /* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */
Do I have to revert the previous one and apply this one? If so, with this one the issue is triggered pretty quick:
```
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 20 PID: 490 at drivers/nvme/host/core.c:850 nvme_setup_discard+0x1b9/0x220
…
kernel: CPU: 20 PID: 490 Comm: md0_raid10 Not tainted 5.13.0-pf4 #1
kernel: Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 3601 05/26/2021
kernel: RIP: 0010:nvme_setup_discard+0x1b9/0x220
kernel: Code: 38 4c 8b 88 40 0b 00 00 4c 2b 0d f2 06 d8 00 49 c1 f9 06 49 c1 e1 0c 4c 03 0d f3 06 d8 00 4d 89 c8 48 85 d2 0f 85 9f fe ff ff <0f> 0b b8 00 00 00 80 4c 01 c8 72 52 48 c7 c2 00 00 00 80 48 2b 15
kernel: RSP: 0018:ffffa3a34152ba10 EFLAGS: 00010202
kernel: RAX: ffff8b78d80db0c0 RBX: 000000000000000f RCX: 0000000000000400
kernel: RDX: 0000000000000000 RSI: 00000000241b5c00 RDI: 000000000000000d
kernel: RBP: ffff8b78cbd70380 R08: ffff8b78d80db000 R09: ffff8b78d80db000
kernel: R10: 00000000241b5c00 R11: 0000000000000000 R12: ffff8b78c5a4b800
kernel: R13: ffff8b78cbd704c8 R14: ffff8b78c5bd8000 R15: ffff8b78cabbf000
kernel: FS: 0000000000000000(0000) GS:ffff8b7fcef00000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007faaaf746020 CR3: 00000001e0342000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel: nvme_setup_cmd+0x2d0/0x670
kernel: nvme_queue_rq+0x79/0xc90
kernel: ? __sbitmap_get_word+0x30/0x80
kernel: ? sbitmap_get+0x85/0x180
kernel: blk_mq_dispatch_rq_list+0x15c/0x810
kernel: __blk_mq_do_dispatch_sched+0xca/0x320
kernel: ? ktime_get+0x38/0xa0
kernel: __blk_mq_sched_dispatch_requests+0x14d/0x190
kernel: blk_mq_sched_dispatch_requests+0x2f/0x60
kernel: __blk_mq_run_hw_queue+0x30/0xa0
kernel: __blk_mq_delay_run_hw_queue+0x142/0x170
kernel: blk_mq_sched_insert_requests+0x6d/0xf0
kernel: blk_mq_flush_plug_list+0x111/0x1c0
kernel: blk_finish_plug+0x21/0x30
kernel: raid10d+0x7c8/0x1960 [raid10]
kernel: ? psi_task_switch+0xf2/0x330
kernel: ? __switch_to_asm+0x42/0x70
kernel: ? finish_task_switch.isra.0+0xaa/0x290
kernel: ? md_thread+0xc3/0x190 [md_mod]
kernel: md_thread+0xc3/0x190 [md_mod]
kernel: ? finish_wait+0x80/0x80
kernel: ? md_rdev_init+0xb0/0xb0 [md_mod]
kernel: kthread+0x1b3/0x1e0
kernel: ? __kthread_init_worker+0x50/0x50
kernel: ret_from_fork+0x22/0x30
kernel: ---[ end trace dc148fcea235e799 ]---
kernel: blk_update_request: I/O error, dev nvme0n1, sector 605615104 op 0x3:(DISCARD) flags 0x0 phys_seg 15 prio class 0
kernel: blk_update_request: I/O error, dev nvme1n1, sector 118159360 op 0x3:(DISCARD) flags 0x0 phys_seg 15 prio class 0
kernel: blk_update_request: I/O error, dev nvme0n1, sector 118200320 op 0x3:(DISCARD) flags 0x0 phys_seg 50 prio class 0
kernel: blk_update_request: I/O error, dev nvme1n1, sector 118326272 op 0x3:(DISCARD) flags 0x0 phys_seg 165 prio class 0
```
--
Oleksandr Natalenko (post-factum)
On Wed, Jul 28, 2021 at 03:44:06PM +0200, Oleksandr Natalenko wrote:
> Hello.
>
> On úterý 27. července 2021 17:58:39 CEST Ming Lei wrote:
> > BTW, can you test the following patch? which is another approach on the same
> > issue with other benefits.
> >
> > From c853e7ed05a75f631da5b7952b9a989983437819 Mon Sep 17 00:00:00 2001
> > From: Ming Lei <[email protected]>
> > Date: Mon, 7 Jun 2021 16:03:51 +0800
> > Subject: [PATCH 2/2] block: support bio merge for multi-range discard
> >
> > So far multi-range discard treats each bio as one segment(range) of single
> > discard request. This way becomes not efficient if lots of small sized
> > discard bios are submitted, and one example is raid456.
> >
> > Support bio merge for multi-range discard for improving lots of small
> > sized discard bios.
> >
> > Turns out it is easy to support it:
> >
> > 1) always try to merge bio first
> >
> > 2) run into multi-range discard only if bio merge can't be done
> >
> > 3) add rq_for_each_discard_range() for retrieving each range(segment)
> > of discard request
> >
> > Reported-by: Wang Shanker <[email protected]>
> > Signed-off-by: Ming Lei <[email protected]>
> > ---
> > block/blk-merge.c | 12 ++++-----
> > drivers/block/virtio_blk.c | 9 ++++---
> > drivers/nvme/host/core.c | 8 +++---
> > include/linux/blkdev.h | 51 ++++++++++++++++++++++++++++++++++++++
> > 4 files changed, 66 insertions(+), 14 deletions(-)
> >
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index bcdff1879c34..65210e9a8efa 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -724,10 +724,10 @@ static inline bool blk_discard_mergable(struct request
> > *req) static enum elv_merge blk_try_req_merge(struct request *req,
> > struct request *next)
> > {
> > - if (blk_discard_mergable(req))
> > - return ELEVATOR_DISCARD_MERGE;
> > - else if (blk_rq_pos(req) + blk_rq_sectors(req) == blk_rq_pos(next))
> > + if (blk_rq_pos(req) + blk_rq_sectors(req) == blk_rq_pos(next))
> > return ELEVATOR_BACK_MERGE;
> > + else if (blk_discard_mergable(req))
> > + return ELEVATOR_DISCARD_MERGE;
> >
> > return ELEVATOR_NO_MERGE;
> > }
> > @@ -908,12 +908,12 @@ bool blk_rq_merge_ok(struct request *rq, struct bio
> > *bio)
> >
> > enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
> > {
> > - if (blk_discard_mergable(rq))
> > - return ELEVATOR_DISCARD_MERGE;
> > - else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
> > + if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
> > return ELEVATOR_BACK_MERGE;
> > else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
> > return ELEVATOR_FRONT_MERGE;
> > + else if (blk_discard_mergable(rq))
> > + return ELEVATOR_DISCARD_MERGE;
> > return ELEVATOR_NO_MERGE;
> > }
> >
> > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > index b9fa3ef5b57c..970cb0d8acaa 100644
> > --- a/drivers/block/virtio_blk.c
> > +++ b/drivers/block/virtio_blk.c
> > @@ -116,7 +116,6 @@ static int virtblk_setup_discard_write_zeroes(struct
> > request *req, bool unmap) unsigned short segments =
> > blk_rq_nr_discard_segments(req);
> > unsigned short n = 0;
> > struct virtio_blk_discard_write_zeroes *range;
> > - struct bio *bio;
> > u32 flags = 0;
> >
> > if (unmap)
> > @@ -138,9 +137,11 @@ static int virtblk_setup_discard_write_zeroes(struct
> > request *req, bool unmap) range[0].sector = cpu_to_le64(blk_rq_pos(req));
> > n = 1;
> > } else {
> > - __rq_for_each_bio(bio, req) {
> > - u64 sector = bio->bi_iter.bi_sector;
> > - u32 num_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> > + struct req_discard_range r;
> > +
> > + rq_for_each_discard_range(r, req) {
> > + u64 sector = r.sector;
> > + u32 num_sectors = r.size >> SECTOR_SHIFT;
> >
> > range[n].flags = cpu_to_le32(flags);
> > range[n].num_sectors = cpu_to_le32(num_sectors);
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 24bcae88587a..4b0a39360ce9 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -813,7 +813,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns
> > *ns, struct request *req, {
> > unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
> > struct nvme_dsm_range *range;
> > - struct bio *bio;
> > + struct req_discard_range r;
> >
> > /*
> > * Some devices do not consider the DSM 'Number of Ranges' field when
> > @@ -835,9 +835,9 @@ static blk_status_t nvme_setup_discard(struct nvme_ns
> > *ns, struct request *req, range = page_address(ns->ctrl->discard_page);
> > }
> >
> > - __rq_for_each_bio(bio, req) {
> > - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> > - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> > + rq_for_each_discard_range(r, req) {
> > + u64 slba = nvme_sect_to_lba(ns, r.sector);
> > + u32 nlb = r.size >> ns->lba_shift;
> >
> > if (n < segments) {
> > range[n].cattr = cpu_to_le32(0);
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index d66d0da72529..bd9d22269a7b 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1007,6 +1007,57 @@ static inline unsigned int blk_rq_stats_sectors(const
> > struct request *rq) return rq->stats_sectors;
> > }
> >
> > +struct req_discard_range {
> > + sector_t sector;
> > + unsigned int size;
> > +
> > + /*
> > + * internal field: driver don't use it, and it always points to
> > + * next bio to be processed
> > + */
> > + struct bio *__bio;
> > +};
> > +
> > +static inline void req_init_discard_range_iter(const struct request *rq,
> > + struct req_discard_range *range)
> > +{
> > + range->__bio = rq->bio;
> > +}
> > +
> > +/* return true if @range stores one valid discard range */
> > +static inline bool req_get_discard_range(struct req_discard_range *range)
> > +{
> > + struct bio *bio;
> > +
> > + if (!range->__bio)
> > + return false;
> > +
> > + bio = range->__bio;
> > + range->sector = bio->bi_iter.bi_sector;
> > + range->size = bio->bi_iter.bi_size;
> > + range->__bio = bio->bi_next;
> > +
> > + while (range->__bio) {
> > + struct bio *bio = range->__bio;
> > +
> > + if (range->sector + (range->size >> SECTOR_SHIFT) !=
> > + bio->bi_iter.bi_sector)
> > + break;
> > +
> > + /*
> > + * ->size won't overflow because req->__data_len is defined
> > + * as 'unsigned int'
> > + */
> > + range->size += bio->bi_iter.bi_size;
> > + range->__bio = bio->bi_next;
> > + }
> > + return true;
> > +}
> > +
> > +#define rq_for_each_discard_range(range, rq) \
> > + for (req_init_discard_range_iter((rq), &range); \
> > + req_get_discard_range(&range);)
> > +
> > #ifdef CONFIG_BLK_DEV_ZONED
> >
> > /* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */
>
> Do I have to revert the previous one and apply this one? If so, with this one the issue is triggered pretty quick:
Yeah, the previous one needs to be reverted.
>
> ```
> kernel: ------------[ cut here ]------------
> kernel: WARNING: CPU: 20 PID: 490 at drivers/nvme/host/core.c:850 nvme_setup_discard+0x1b9/0x220
Can you collect debug log by applying the following patch against the
last one?
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8780e4aa9df2..fbd8a68c619b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -828,6 +828,24 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
}
+static inline void blk_dump_rq(const struct request *req)
+{
+ struct bio *bio;
+ int i = 0;
+
+ printk("dump req %p(f:%x, seg: %d)\n", req, req->cmd_flags,
+ req->nr_phys_segments);
+
+ __rq_for_each_bio(bio, req) {
+ printk("%d-%p: %hx/%hx %llu %u\n",
+ i++, bio,
+ bio->bi_flags, bio->bi_opf,
+ (unsigned long long)bio->bi_iter.bi_sector,
+ bio->bi_iter.bi_size>>9);
+ }
+}
+
+
static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
{
@@ -868,6 +886,8 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
}
if (WARN_ON_ONCE(n != segments)) {
+ printk("%s: ranges %u segments %u\n", __func__, n, segments);
+ blk_dump_rq(req);
if (virt_to_page(range) == ns->ctrl->discard_page)
clear_bit_unlock(0, &ns->ctrl->discard_page_busy);
else
Thanks,
Ming
Hello.
On středa 28. července 2021 17:53:05 CEST Ming Lei wrote:
> Can you collect debug log by applying the following patch against the
> last one?
Yes, please see below.
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 8780e4aa9df2..fbd8a68c619b 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -828,6 +828,24 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
> cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
> }
>
> +static inline void blk_dump_rq(const struct request *req)
> +{
> + struct bio *bio;
> + int i = 0;
> +
> + printk("dump req %p(f:%x, seg: %d)\n", req, req->cmd_flags,
> + req->nr_phys_segments);
> +
> + __rq_for_each_bio(bio, req) {
> + printk("%d-%p: %hx/%hx %llu %u\n",
> + i++, bio,
> + bio->bi_flags, bio->bi_opf,
> + (unsigned long long)bio->bi_iter.bi_sector,
> + bio->bi_iter.bi_size>>9);
> + }
> +}
> +
> +
> static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request
> *req, struct nvme_command *cmnd)
> {
> @@ -868,6 +886,8 @@ static blk_status_t nvme_setup_discard(struct nvme_ns
> *ns, struct request *req, }
>
> if (WARN_ON_ONCE(n != segments)) {
> + printk("%s: ranges %u segments %u\n", __func__, n, segments);
> + blk_dump_rq(req);
> if (virt_to_page(range) == ns->ctrl->discard_page)
> clear_bit_unlock(0, &ns->ctrl->discard_page_busy);
> else
```
WARNING: CPU: 17 PID: 821 at drivers/nvme/host/core.c:868 nvme_setup_discard+0x1c6/0x220
…
CPU: 17 PID: 821 Comm: kworker/17:1H Not tainted 5.13.0-pf4 #1
Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 3601 05/26/2021
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:nvme_setup_discard+0x1c6/0x220
Code: 8b a0 40 0b 00 00 4c 2b 25 f7 ff d7 00 49 c1 fc 06 49 c1 e4 0c 4c 03 25 f8 ff d7 00 4c 89 e5 48 85 d2 0f 85 9b fe ff ff 31 d2 <0f> 0b 48 c7 c6 e0 a8 10 8b 41 0f b7 cd 48 c7 c7 af 09 40 8b e8 14
RSP: 0018:ffffafa884517bf0 EFLAGS: 00010297
RAX: ffff91602f5b20d0 RBX: ffff915e05743c00 RCX: 0000000000080000
RDX: 000000000000000f RSI: 0000000028445c00 RDI: 000000000000000f
RBP: ffff91602f5b2000 R08: 000000000b366000 R09: ffff91602f5b2000
R10: 000000000000002d R11: fefefefefefefeff R12: ffff91602f5b2000
R13: 000000000000000e R14: ffff915e18c77000 R15: ffff915e18c77148
FS: 0000000000000000(0000) GS:ffff91650ee40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00001c90b36879a8 CR3: 000000010d18c000 CR4: 0000000000350ee0
Call Trace:
nvme_setup_cmd+0x2e4/0x6a0
nvme_queue_rq+0x79/0xc90
blk_mq_dispatch_rq_list+0x15c/0x810
__blk_mq_do_dispatch_sched+0xca/0x320
__blk_mq_sched_dispatch_requests+0x14d/0x190
blk_mq_sched_dispatch_requests+0x2f/0x60
blk_mq_run_work_fn+0x43/0xc0
process_one_work+0x24e/0x430
worker_thread+0x54/0x4d0
kthread+0x1b3/0x1e0
ret_from_fork+0x22/0x30
---[ end trace bd51917eae1d7201 ]---
nvme_setup_discard: ranges 15 segments 14
dump req 000000002c6a085b(f:3, seg: 14)
0-000000002c3297c7: f80/3 675773440 1024
1-0000000098edb2a8: b80/3 188319744 1024
2-00000000f58e3b18: b80/3 675775488 1024
3-00000000f6670c5a: b80/3 188129280 2048
4-00000000ea371a88: b80/3 675585024 2048
5-00000000e9cec043: b80/3 188140544 2048
6-000000006e1126e6: b80/3 675596288 2048
7-000000009466f937: b80/3 188327936 2048
8-000000003c9e2ccd: b80/3 675783680 2048
9-00000000ab322c68: b80/3 188329984 2048
10-00000000eb2b3fb6: b80/3 675785728 2048
11-00000000aa19f73d: b80/3 188098560 4096
12-00000000dd2f8682: b80/3 675554304 4096
13-0000000072f20a8c: b80/3 188112896 1024
14-00000000559ba401: b80/3 675568640 1024
blk_update_request: I/O error, dev nvme1n1, sector 675773440 op 0x3:(DISCARD) flags 0x0 phys_seg 14 prio class 0
nvme_setup_discard: ranges 45 segments 48
dump req 00000000f96fdd7c(f:3, seg: 48)
0-00000000180da7f3: f82/3 188316672 1024
1-00000000d2d0828e: b82/3 675774464 1024
2-000000000c4882eb: b82/3 645145600 1024
3-000000009d701d1e: b82/3 188318720 1024
4-000000000b9e8e53: b82/3 675776512 1024
5-00000000c73ca145: b82/3 157696000 1024
6-00000000e22bbe77: b82/3 188320768 1024
7-0000000054696401: b82/3 675777536 1024
8-0000000016bc968c: b82/3 188321792 1024
9-000000006e25f763: b82/3 675779584 1024
10-00000000340df89d: b82/3 157765632 1024
11-0000000046522f81: b82/3 188323840 1024
12-00000000fa9a368c: b82/3 675780608 1024
13-0000000093af3f87: b82/3 188128256 1024
14-000000008585ce43: b82/3 675587072 1024
15-0000000044f73721: b82/3 188132352 1024
16-000000006c9df2f2: b82/3 675589120 1024
17-00000000c6f4ba8b: b82/3 675594240 1024
18-000000001f5dbb5e: b82/3 188138496 1024
19-000000001de03037: b82/3 188139520 1024
20-00000000a307ccef: b82/3 645164032 1024
21-0000000018da5076: b82/3 675598336 1024
22-0000000033cdeaf5: b82/3 188143616 1024
23-00000000ecf3d07a: b82/3 675600384 1024
24-00000000e86b10f4: b82/3 188144640 1024
25-00000000ba439495: b82/3 157726720 1024
26-0000000027fcd176: b82/3 641614848 1024
27-0000000014182edf: b82/3 157728768 1024
28-00000000194e904f: b82/3 157731840 1024
29-000000002ea97c9e: b82/3 675787776 1024
30-000000004d5c6130: b82/3 188332032 1024
31-0000000058639801: b82/3 675788800 1024
32-000000003dd731c7: b82/3 675789824 1024
33-00000000f6f8a09e: b82/3 645190656 1024
34-000000002d4c6b21: b82/3 188334080 1024
35-000000007ea3ba79: b82/3 675790848 1024
36-00000000011db072: b82/3 188064768 1024
37-00000000e31f9da8: b82/3 188067840 1024
38-000000006baa465a: b82/3 188097536 1024
39-00000000a870893c: b82/3 675559424 1024
40-000000004105bc56: b82/3 188103680 1024
41-00000000e75709b5: b82/3 675560448 1024
42-00000000e08f7e0a: b82/3 675566592 1024
43-0000000020218f1b: b82/3 188110848 1024
44-000000004bb5880a: b82/3 188111872 1024
45-00000000e6632009: b82/3 675569664 1024
46-0000000029e61303: b82/3 188114944 1024
47-000000006f66801c: b82/3 675571712 1024
blk_update_request: I/O error, dev nvme1n1, sector 188316672 op 0x3:(DISCARD) flags 0x0 phys_seg 48 prio class 0
nvme_setup_discard: ranges 73 segments 75
dump req 00000000b51b828c(f:3, seg: 75)
0-000000009c966abb: b82/3 675772416 1024
1-00000000d0b7b912: f80/3 675773440 1024
2-00000000c38e02dd: b80/3 188319744 1024
3-0000000043a53b40: b80/3 675775488 1024
4-000000009b5569c4: b80/3 188129280 2048
5-0000000066d31f07: b80/3 675585024 2048
6-000000006ff6d59f: b80/3 188140544 2048
7-00000000f39f9bdb: b80/3 675596288 2048
8-00000000929b3cff: b80/3 188327936 2048
9-0000000034c64365: b80/3 675783680 2048
10-000000004c508955: b80/3 188329984 2048
11-00000000fc9b9d0a: b80/3 675785728 2048
12-000000002a402a6c: b80/3 188098560 4096
13-0000000008074d32: b80/3 675554304 4096
14-0000000018d76730: b80/3 188112896 1024
15-000000003a12b456: b80/3 675568640 1024
16-00000000285092ed: b82/3 610962432 1024
17-000000005f714a72: b82/3 188342272 1024
18-00000000794c5e57: b82/3 611005440 1024
19-00000000630a4c01: b82/3 153951232 1024
20-00000000ffa4369f: b82/3 675798016 1024
21-00000000521f4eda: b82/3 188348416 1024
22-0000000041b3ca89: b82/3 17291264 1024
23-00000000e9c2e9f0: b82/3 17362944 1024
24-000000008f2954da: b82/3 599607296 1024
25-00000000805912ef: b82/3 611051520 1024
26-000000008533232e: b82/3 188336128 1024
27-00000000f9c0f9ae: b82/3 174541824 1024
28-00000000552ad7fc: b82/3 188318720 1024
29-000000009971d59c: b82/3 157689856 1024
30-0000000001ef1cc6: b82/3 675774464 1024
31-0000000037eddacb: b82/3 188320768 1024
32-00000000a2c8dd24: b82/3 645151744 1024
33-000000003e4d7ac7: b82/3 675776512 1024
34-000000000cb417af: b82/3 188321792 1024
35-0000000056c3e342: b82/3 675777536 1024
36-00000000ab22b292: b82/3 188323840 1024
37-0000000069b44eba: b82/3 645221376 1024
38-00000000a35eafb0: b82/3 675779584 1024
39-00000000f71cf77d: b82/3 188324864 1024
40-0000000011ba1e4e: b82/3 675584000 1024
41-0000000012ec22d4: b82/3 188131328 1024
42-00000000acf89f36: b82/3 675588096 1024
43-000000007fdfa2fc: b82/3 188133376 1024
44-000000006ae8f6de: b82/3 188138496 1024
45-00000000c86eaa59: b82/3 675594240 1024
46-00000000cbeecf93: b82/3 675595264 1024
47-0000000002061627: b82/3 157708288 1024
48-00000000c6c0e9fb: b82/3 188142592 1024
49-00000000aca0a548: b82/3 675599360 1024
50-00000000cf220d06: b82/3 188144640 1024
51-000000007a9bdfd3: b82/3 675600384 1024
52-000000009c6129de: b82/3 645182464 1024
53-000000006744fe84: b82/3 154159104 1024
54-00000000217342e3: b82/3 645184512 1024
55-00000000bc5fea55: b82/3 645187584 1024
56-00000000eb6093bd: b82/3 188332032 1024
57-0000000056371b99: b82/3 675787776 1024
58-0000000080e53e1e: b82/3 188333056 1024
59-0000000020cc4aa0: b82/3 188334080 1024
60-00000000dfb62a3e: b82/3 157734912 1024
61-00000000ff41dcf5: b82/3 675789824 1024
62-0000000042c2746d: b82/3 188335104 1024
63-00000000192eebcb: b82/3 675520512 1024
64-000000008725b451: b82/3 675523584 1024
65-00000000893843dc: b82/3 675553280 1024
66-00000000eab256cb: b82/3 188103680 1024
67-0000000046ef0487: b82/3 675559424 1024
68-000000000228f430: b82/3 188104704 1024
69-00000000386e9a99: b82/3 188110848 1024
70-000000008ced777c: b82/3 675566592 1024
71-000000008a5f9e82: b82/3 675567616 1024
72-0000000065da8656: b82/3 188113920 1024
73-00000000c938ac22: b82/3 675570688 1024
74-00000000c67c0e32: b82/3 188115968 1024
75-00000000646e12f4: b82/3 675760128 1024
76-0000000054aa9d13: b82/3 188305408 1024
blk_update_request: I/O error, dev nvme0n1, sector 675772416 op 0x3:(DISCARD) flags 0x0 phys_seg 75 prio class 0
nvme_setup_discard: ranges 5 segments 6
dump req 00000000a157e475(f:3, seg: 6)
0-000000004fc95770: f82/3 189429760 1024
1-000000003d5f4d39: b82/3 189432832 1024
2-000000003e3a5606: b82/3 676889600 1024
3-00000000d5b2b48e: b82/3 189433856 1024
4-00000000101795a9: b82/3 189434880 1024
5-000000001c3c352f: b82/3 184399872 1024
blk_update_request: I/O error, dev nvme1n1, sector 189429760 op 0x3:(DISCARD) flags 0x0 phys_seg 6 prio class 0
nvme_setup_discard: ranges 2 segments 3
dump req 00000000e17a5886(f:3, seg: 3)
0-000000007af7af24: b82/3 188352512 1024
1-000000003af3dbc9: f80/3 188353536 1024
2-00000000c4e5d7e7: b80/3 675809280 1024
3-00000000cb04663c: b82/3 675810304 1024
blk_update_request: I/O error, dev nvme1n1, sector 188352512 op 0x3:(DISCARD) flags 0x0 phys_seg 3 prio class 0
nvme_setup_discard: ranges 2 segments 3
dump req 00000000ad5525bd(f:3, seg: 3)
0-000000005cf09cca: b82/3 188348416 1024
1-0000000018cc2558: f80/3 188349440 1024
2-00000000c4c2a716: b80/3 675805184 1024
3-00000000a8908f6c: b82/3 675806208 1024
blk_update_request: I/O error, dev nvme1n1, sector 188348416 op 0x3:(DISCARD) flags 0x0 phys_seg 3 prio class 0
nvme_setup_discard: ranges 2 segments 3
dump req 00000000f96fdd7c(f:3, seg: 3)
0-000000009d4e946d: b82/3 189264896 1024
1-00000000d5eee140: f80/3 189265920 2048
2-00000000d84f9e26: b80/3 676721664 2048
3-0000000099245351: b82/3 676723712 1024
blk_update_request: I/O error, dev nvme1n1, sector 189264896 op 0x3:(DISCARD) flags 0x0 phys_seg 3 prio class 0
nvme_setup_discard: ranges 48 segments 49
dump req 0000000034977ea1(f:3, seg: 49)
0-00000000e4cae23d: f82/3 676881408 1024
1-0000000068dc7629: b82/3 676885504 1024
2-00000000f049ea5a: b82/3 676888576 1024
3-00000000c20d2ce8: b82/3 189433856 1024
4-000000004c139e91: b82/3 676889600 1024
5-0000000091100457: b82/3 676890624 1024
6-0000000076b82d71: b82/3 671855616 1024
7-000000003003717f: b82/3 144389120 1024
8-00000000f664d7bc: b82/3 188355584 1024
9-0000000004dffa3c: b82/3 675811328 1024
10-00000000333580c6: b82/3 188356608 1024
11-00000000b69a3602: b82/3 631820288 1024
12-00000000736b0c31: b80/3 188353536 1024
13-0000000007888c71: b80/3 675809280 1024
14-0000000003614d26: b82/3 675808256 1024
15-000000001a7ae177: b82/3 188354560 1024
16-000000008ab11bbd: b82/3 631758848 1024
17-00000000998c6d3c: b82/3 675807232 1024
18-0000000056d675a4: b82/3 188352512 1024
19-00000000854a6b1e: b80/3 188349440 1024
20-000000000b49b05e: b80/3 675805184 1024
21-000000008acc3b01: b82/3 675804160 1024
22-00000000315cb886: b82/3 188350464 1024
23-0000000086fb9348: b82/3 175320064 1024
24-000000006233dfd4: b82/3 188591104 1024
25-000000008022c2e2: b82/3 676046848 1024
26-000000002388ff7a: b82/3 188592128 1024
27-0000000024e54dc9: b82/3 188590080 1024
28-00000000e1706a07: b82/3 670319616 1024
29-00000000f5308dfc: b82/3 189270016 1024
30-00000000c005ae85: b80/3 189265920 2048
31-00000000a98ed3a6: b80/3 676721664 2048
32-000000003e7e01e1: b82/3 676720640 1024
33-000000003e9b0c86: b82/3 189267968 1024
34-00000000802b8160: b82/3 676716544 1024
35-00000000adf4fab8: b82/3 189257728 1024
36-000000007136233a: b82/3 676713472 1024
37-0000000085a35fca: b82/3 189258752 1024
38-000000008c8ed35f: b82/3 676710400 1024
39-00000000567baa73: b82/3 189255680 1024
40-0000000041ff7157: b82/3 676711424 1024
41-00000000fb3f4cd8: b82/3 189249536 1024
42-000000002efa7248: b82/3 676705280 1024
43-00000000a53bc3cb: b82/3 189239296 1024
44-000000006e6456e9: b82/3 676691968 1024
45-00000000f0fa291b: b82/3 676690944 1024
46-000000008b814484: b82/3 189234176 1024
47-00000000f3fa7f80: b82/3 676686848 1024
48-00000000e34ea3ca: b82/3 676679680 1024
blk_update_request: I/O error, dev nvme0n1, sector 676881408 op 0x3:(DISCARD) flags 0x0 phys_seg 49 prio class 0
nvme_setup_discard: ranges 9 segments 10
dump req 0000000066d1c6c7(f:3, seg: 10)
0-000000006f9c1d2b: f82/3 95788032 1024
1-000000005592941a: b82/3 155664384 1024
2-00000000180b2bab: b80/3 186275840 2048
3-00000000485a4757: b80/3 673731584 2048
4-00000000fcc93dcb: b82/3 673733632 1024
5-0000000089e0acd8: b80/3 186273792 1024
6-0000000005896717: b80/3 673729536 1024
7-00000000f264b15c: b82/3 186272768 1024
8-000000009684f15b: b82/3 673730560 1024
9-000000003a2a57b4: b82/3 175249408 1024
blk_update_request: I/O error, dev nvme1n1, sector 95788032 op 0x3:(DISCARD) flags 0x0 phys_seg 10 prio class 0
```
Thanks.
--
Oleksandr Natalenko (post-factum)
On Wed, Jul 28, 2021 at 06:38:36PM +0200, Oleksandr Natalenko wrote:
> Hello.
>
> On středa 28. července 2021 17:53:05 CEST Ming Lei wrote:
> > Can you collect debug log by applying the following patch against the
> > last one?
>
> Yes, please see below.
>
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 8780e4aa9df2..fbd8a68c619b 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -828,6 +828,24 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
> > cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
> > }
> >
> > +static inline void blk_dump_rq(const struct request *req)
> > +{
> > + struct bio *bio;
> > + int i = 0;
> > +
> > + printk("dump req %p(f:%x, seg: %d)\n", req, req->cmd_flags,
> > + req->nr_phys_segments);
> > +
> > + __rq_for_each_bio(bio, req) {
> > + printk("%d-%p: %hx/%hx %llu %u\n",
> > + i++, bio,
> > + bio->bi_flags, bio->bi_opf,
> > + (unsigned long long)bio->bi_iter.bi_sector,
> > + bio->bi_iter.bi_size>>9);
> > + }
> > +}
> > +
> > +
> > static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request
> > *req, struct nvme_command *cmnd)
> > {
> > @@ -868,6 +886,8 @@ static blk_status_t nvme_setup_discard(struct nvme_ns
> > *ns, struct request *req, }
> >
> > if (WARN_ON_ONCE(n != segments)) {
> > + printk("%s: ranges %u segments %u\n", __func__, n, segments);
> > + blk_dump_rq(req);
> > if (virt_to_page(range) == ns->ctrl->discard_page)
> > clear_bit_unlock(0, &ns->ctrl->discard_page_busy);
> > else
>
> ```
> WARNING: CPU: 17 PID: 821 at drivers/nvme/host/core.c:868 nvme_setup_discard+0x1c6/0x220
> …
> CPU: 17 PID: 821 Comm: kworker/17:1H Not tainted 5.13.0-pf4 #1
> Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 3601 05/26/2021
> Workqueue: kblockd blk_mq_run_work_fn
> RIP: 0010:nvme_setup_discard+0x1c6/0x220
> Code: 8b a0 40 0b 00 00 4c 2b 25 f7 ff d7 00 49 c1 fc 06 49 c1 e4 0c 4c 03 25 f8 ff d7 00 4c 89 e5 48 85 d2 0f 85 9b fe ff ff 31 d2 <0f> 0b 48 c7 c6 e0 a8 10 8b 41 0f b7 cd 48 c7 c7 af 09 40 8b e8 14
> RSP: 0018:ffffafa884517bf0 EFLAGS: 00010297
> RAX: ffff91602f5b20d0 RBX: ffff915e05743c00 RCX: 0000000000080000
> RDX: 000000000000000f RSI: 0000000028445c00 RDI: 000000000000000f
> RBP: ffff91602f5b2000 R08: 000000000b366000 R09: ffff91602f5b2000
> R10: 000000000000002d R11: fefefefefefefeff R12: ffff91602f5b2000
> R13: 000000000000000e R14: ffff915e18c77000 R15: ffff915e18c77148
> FS: 0000000000000000(0000) GS:ffff91650ee40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00001c90b36879a8 CR3: 000000010d18c000 CR4: 0000000000350ee0
> Call Trace:
> nvme_setup_cmd+0x2e4/0x6a0
> nvme_queue_rq+0x79/0xc90
> blk_mq_dispatch_rq_list+0x15c/0x810
> __blk_mq_do_dispatch_sched+0xca/0x320
> __blk_mq_sched_dispatch_requests+0x14d/0x190
> blk_mq_sched_dispatch_requests+0x2f/0x60
> blk_mq_run_work_fn+0x43/0xc0
> process_one_work+0x24e/0x430
> worker_thread+0x54/0x4d0
> kthread+0x1b3/0x1e0
> ret_from_fork+0x22/0x30
> ---[ end trace bd51917eae1d7201 ]---
> nvme_setup_discard: ranges 15 segments 14
> dump req 000000002c6a085b(f:3, seg: 14)
> 0-000000002c3297c7: f80/3 675773440 1024
> 1-0000000098edb2a8: b80/3 188319744 1024
> 2-00000000f58e3b18: b80/3 675775488 1024
> 3-00000000f6670c5a: b80/3 188129280 2048
> 4-00000000ea371a88: b80/3 675585024 2048
> 5-00000000e9cec043: b80/3 188140544 2048
> 6-000000006e1126e6: b80/3 675596288 2048
> 7-000000009466f937: b80/3 188327936 2048
> 8-000000003c9e2ccd: b80/3 675783680 2048
> 9-00000000ab322c68: b80/3 188329984 2048
188329984 = 188327936 + 2048
> 10-00000000eb2b3fb6: b80/3 675785728 2048
675785728 = 675783680 + 2048
Seems the adjacent bios are cut by coming discard
range.
Looks it isn't mature to support mixed discard IO merge(
traditional io merge vs. multi-range merge), I will post
the previous patch for fixing this issue.
> blk_update_request: I/O error, dev nvme1n1, sector 675773440 op 0x3:(DISCARD) flags 0x0 phys_seg 14 prio class 0
> nvme_setup_discard: ranges 45 segments 48
...
> nvme_setup_discard: ranges 73 segments 75
...
> nvme_setup_discard: ranges 5 segments 6
...
> nvme_setup_discard: ranges 2 segments 3
...
> nvme_setup_discard: ranges 2 segments 3
...
> nvme_setup_discard: ranges 2 segments 3
...
> nvme_setup_discard: ranges 48 segments 49
...
> nvme_setup_discard: ranges 9 segments 10
For the above(ranges < segments), that should be caused by
blk_recalc_rq_segments(), which can be solved by switching to
rq_for_each_discard_range() for discard io.
Thanks,
Ming
On Thu, Jul 29, 2021 at 11:33:33AM +0800, Ming Lei wrote:
> On Wed, Jul 28, 2021 at 06:38:36PM +0200, Oleksandr Natalenko wrote:
> > Hello.
> >
> > On středa 28. července 2021 17:53:05 CEST Ming Lei wrote:
> > > Can you collect debug log by applying the following patch against the
> > > last one?
> >
> > Yes, please see below.
> >
> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > index 8780e4aa9df2..fbd8a68c619b 100644
> > > --- a/drivers/nvme/host/core.c
> > > +++ b/drivers/nvme/host/core.c
> > > @@ -828,6 +828,24 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
> > > cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
> > > }
> > >
> > > +static inline void blk_dump_rq(const struct request *req)
> > > +{
> > > + struct bio *bio;
> > > + int i = 0;
> > > +
> > > + printk("dump req %p(f:%x, seg: %d)\n", req, req->cmd_flags,
> > > + req->nr_phys_segments);
> > > +
> > > + __rq_for_each_bio(bio, req) {
> > > + printk("%d-%p: %hx/%hx %llu %u\n",
> > > + i++, bio,
> > > + bio->bi_flags, bio->bi_opf,
> > > + (unsigned long long)bio->bi_iter.bi_sector,
> > > + bio->bi_iter.bi_size>>9);
> > > + }
> > > +}
> > > +
> > > +
> > > static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request
> > > *req, struct nvme_command *cmnd)
> > > {
> > > @@ -868,6 +886,8 @@ static blk_status_t nvme_setup_discard(struct nvme_ns
> > > *ns, struct request *req, }
> > >
> > > if (WARN_ON_ONCE(n != segments)) {
> > > + printk("%s: ranges %u segments %u\n", __func__, n, segments);
> > > + blk_dump_rq(req);
> > > if (virt_to_page(range) == ns->ctrl->discard_page)
> > > clear_bit_unlock(0, &ns->ctrl->discard_page_busy);
> > > else
> >
> > ```
> > WARNING: CPU: 17 PID: 821 at drivers/nvme/host/core.c:868 nvme_setup_discard+0x1c6/0x220
> > …
> > CPU: 17 PID: 821 Comm: kworker/17:1H Not tainted 5.13.0-pf4 #1
> > Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 3601 05/26/2021
> > Workqueue: kblockd blk_mq_run_work_fn
> > RIP: 0010:nvme_setup_discard+0x1c6/0x220
> > Code: 8b a0 40 0b 00 00 4c 2b 25 f7 ff d7 00 49 c1 fc 06 49 c1 e4 0c 4c 03 25 f8 ff d7 00 4c 89 e5 48 85 d2 0f 85 9b fe ff ff 31 d2 <0f> 0b 48 c7 c6 e0 a8 10 8b 41 0f b7 cd 48 c7 c7 af 09 40 8b e8 14
> > RSP: 0018:ffffafa884517bf0 EFLAGS: 00010297
> > RAX: ffff91602f5b20d0 RBX: ffff915e05743c00 RCX: 0000000000080000
> > RDX: 000000000000000f RSI: 0000000028445c00 RDI: 000000000000000f
> > RBP: ffff91602f5b2000 R08: 000000000b366000 R09: ffff91602f5b2000
> > R10: 000000000000002d R11: fefefefefefefeff R12: ffff91602f5b2000
> > R13: 000000000000000e R14: ffff915e18c77000 R15: ffff915e18c77148
> > FS: 0000000000000000(0000) GS:ffff91650ee40000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00001c90b36879a8 CR3: 000000010d18c000 CR4: 0000000000350ee0
> > Call Trace:
> > nvme_setup_cmd+0x2e4/0x6a0
> > nvme_queue_rq+0x79/0xc90
> > blk_mq_dispatch_rq_list+0x15c/0x810
> > __blk_mq_do_dispatch_sched+0xca/0x320
> > __blk_mq_sched_dispatch_requests+0x14d/0x190
> > blk_mq_sched_dispatch_requests+0x2f/0x60
> > blk_mq_run_work_fn+0x43/0xc0
> > process_one_work+0x24e/0x430
> > worker_thread+0x54/0x4d0
> > kthread+0x1b3/0x1e0
> > ret_from_fork+0x22/0x30
> > ---[ end trace bd51917eae1d7201 ]---
> > nvme_setup_discard: ranges 15 segments 14
> > dump req 000000002c6a085b(f:3, seg: 14)
> > 0-000000002c3297c7: f80/3 675773440 1024
> > 1-0000000098edb2a8: b80/3 188319744 1024
> > 2-00000000f58e3b18: b80/3 675775488 1024
> > 3-00000000f6670c5a: b80/3 188129280 2048
> > 4-00000000ea371a88: b80/3 675585024 2048
> > 5-00000000e9cec043: b80/3 188140544 2048
> > 6-000000006e1126e6: b80/3 675596288 2048
> > 7-000000009466f937: b80/3 188327936 2048
> > 8-000000003c9e2ccd: b80/3 675783680 2048
> > 9-00000000ab322c68: b80/3 188329984 2048
>
> 188329984 = 188327936 + 2048
>
> > 10-00000000eb2b3fb6: b80/3 675785728 2048
>
> 675785728 = 675783680 + 2048
>
> Seems the adjacent bios are cut by coming discard
> range.
>
> Looks it isn't mature to support mixed discard IO merge(
> traditional io merge vs. multi-range merge), I will post
> the previous patch for fixing this issue.
The reason is the following lines of code:
#define rq_hash_key(rq) (blk_rq_pos(rq) + blk_rq_sectors(rq))
if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
...
The current block merge code thinks that all bios in one request is physically
continuous, so we can't support mixed merge for multi-range discard request simply.
Thanks,
Ming