2022-01-21 21:19:26

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH 0/2] fix crash by reading /proc/diskstats

During failover nvme-tcp tests in a HA setup a crash orccured. iostat
was running in the background while the test was executed.

I am not sure if the fix is correct. I am unsure if using
blk_queue_enter() is the right thing. But when starring at the code
snipped I saw in older version blk_queue_enter() was used but got
evenutally replaced by the rcu_read_lock() as far I understand at the
moment... So any tip here would be highly appreciated.

Daniel

Daniel Wagner (2):
block: remove commented out code from diskstats_show
block: hold queue lock while iterating in diskstats_show

block/genhd.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)

--
2.29.2


2022-01-21 21:19:38

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

The request queues can be freed while we operate on them. Make sure we
hold a reference when using blk_mq_queue_tag_busy_iter.

RIP: blk_mq_queue_tag_busy_iter
Call Trace:
? blk_mq_hctx_mark_pending
? diskstats_show
? blk_mq_hctx_mark_pending
blk_mq_in_flight
diskstats_show
? klist_next
seq_read
proc_reg_read
vfs_read
ksys_read
do_syscall_64
entry_SYSCALL_64_after_hwframe

Signed-off-by: Daniel Wagner <[email protected]>
---
block/genhd.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index c9d4386dd177..0e163055a4e6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1167,10 +1167,13 @@ static int diskstats_show(struct seq_file *seqf, void *v)
if (bdev_is_partition(hd) && !bdev_nr_sectors(hd))
continue;
part_stat_read_all(hd, &stat);
+ if (blk_queue_enter(gp->queue, BLK_MQ_REQ_NOWAIT))
+ continue;
if (queue_is_mq(gp->queue))
inflight = blk_mq_in_flight(gp->queue, hd);
else
inflight = part_in_flight(hd);
+ blk_queue_exit(gp->queue);

seq_printf(seqf, "%4d %7d %pg "
"%lu %lu %lu %u "
--
2.29.2

2022-01-21 21:20:36

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH 1/2] block: remove commented out code from diskstats_show

The diskstats format is ABI, so we just can't add a new header. The
code snippet has been commented out since at least v2.6.12-rc2.
Remove it.

Signed-off-by: Daniel Wagner <[email protected]>
---
block/genhd.c | 8 --------
1 file changed, 8 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 626c8406f21a..c9d4386dd177 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1162,14 +1162,6 @@ static int diskstats_show(struct seq_file *seqf, void *v)
struct disk_stats stat;
unsigned long idx;

- /*
- if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next)
- seq_puts(seqf, "major minor name"
- " rio rmerge rsect ruse wio wmerge "
- "wsect wuse running use aveq"
- "\n\n");
- */
-
rcu_read_lock();
xa_for_each(&gp->part_tbl, idx, hd) {
if (bdev_is_partition(hd) && !bdev_nr_sectors(hd))
--
2.29.2

2022-01-21 21:22:40

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 1/2] block: remove commented out code from diskstats_show

On 1/20/22 11:52 AM, Daniel Wagner wrote:
> The diskstats format is ABI, so we just can't add a new header. The
> code snippet has been commented out since at least v2.6.12-rc2.
> Remove it.
>
> Signed-off-by: Daniel Wagner <[email protected]>
> ---
> block/genhd.c | 8 --------
> 1 file changed, 8 deletions(-)
>
> diff --git a/block/genhd.c b/block/genhd.c
> index 626c8406f21a..c9d4386dd177 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1162,14 +1162,6 @@ static int diskstats_show(struct seq_file *seqf, void *v)
> struct disk_stats stat;
> unsigned long idx;
>
> - /*
> - if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next)
> - seq_puts(seqf, "major minor name"
> - " rio rmerge rsect ruse wio wmerge "
> - "wsect wuse running use aveq"
> - "\n\n");
> - */
> -
> rcu_read_lock();
> xa_for_each(&gp->part_tbl, idx, hd) {
> if (bdev_is_partition(hd) && !bdev_nr_sectors(hd))
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

2022-01-21 21:34:28

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

On 1/20/22 11:52 AM, Daniel Wagner wrote:
> The request queues can be freed while we operate on them. Make sure we
> hold a reference when using blk_mq_queue_tag_busy_iter.
>
> RIP: blk_mq_queue_tag_busy_iter
> Call Trace:
> ? blk_mq_hctx_mark_pending
> ? diskstats_show
> ? blk_mq_hctx_mark_pending
> blk_mq_in_flight
> diskstats_show
> ? klist_next
> seq_read
> proc_reg_read
> vfs_read
> ksys_read
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
>
> Signed-off-by: Daniel Wagner <[email protected]>
> ---
> block/genhd.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/block/genhd.c b/block/genhd.c
> index c9d4386dd177..0e163055a4e6 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1167,10 +1167,13 @@ static int diskstats_show(struct seq_file *seqf, void *v)
> if (bdev_is_partition(hd) && !bdev_nr_sectors(hd))
> continue;
> part_stat_read_all(hd, &stat);
> + if (blk_queue_enter(gp->queue, BLK_MQ_REQ_NOWAIT))
> + continue;
> if (queue_is_mq(gp->queue))
> inflight = blk_mq_in_flight(gp->queue, hd);
> else
> inflight = part_in_flight(hd);
> + blk_queue_exit(gp->queue);
>
> seq_printf(seqf, "%4d %7d %pg "
> "%lu %lu %lu %u "
>
While this looks good, it seems to have been cut against an earlier
branch; the code in question looks different in my copy.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

2022-01-21 21:59:50

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

On Thu, Jan 20, 2022 at 12:18:24PM +0100, Hannes Reinecke wrote:
> On 1/20/22 11:52 AM, Daniel Wagner wrote:
> While this looks good, it seems to have been cut against an earlier
> branch; the code in question looks different in my copy.

This was cut against linux-block/for-next. Let me know which tree/branch
I should use.

2022-01-21 22:02:52

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

On Thu, Jan 20, 2022 at 11:52:48AM +0100, Daniel Wagner wrote:
> The request queues can be freed while we operate on them. Make sure we

It shouldn't be freed here.

> hold a reference when using blk_mq_queue_tag_busy_iter.
>
> RIP: blk_mq_queue_tag_busy_iter

Any chance to pose the whole stack trace?

> Call Trace:
> ? blk_mq_hctx_mark_pending
> ? diskstats_show
> ? blk_mq_hctx_mark_pending
> blk_mq_in_flight
> diskstats_show
> ? klist_next
> seq_read
> proc_reg_read
> vfs_read
> ksys_read
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
>
> Signed-off-by: Daniel Wagner <[email protected]>
> ---
> block/genhd.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/block/genhd.c b/block/genhd.c
> index c9d4386dd177..0e163055a4e6 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1167,10 +1167,13 @@ static int diskstats_show(struct seq_file *seqf, void *v)
> if (bdev_is_partition(hd) && !bdev_nr_sectors(hd))
> continue;
> part_stat_read_all(hd, &stat);
> + if (blk_queue_enter(gp->queue, BLK_MQ_REQ_NOWAIT))
> + continue;

The disk device's refcnt is supposed to be grabbed by class device
iterator code, so queue shouldn't be released in diskstats_show().

Not mention blk_queue_enter() won't prevent disk from being released...

thanks,
Ming

2022-01-21 22:07:45

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

Hi Ming,

On Thu, Jan 20, 2022 at 08:56:03PM +0800, Ming Lei wrote:
> On Thu, Jan 20, 2022 at 11:52:48AM +0100, Daniel Wagner wrote:
> > The request queues can be freed while we operate on them. Make sure we
>
> It shouldn't be freed here.

No, that is not what I tried to say. We somehow end here operating on
pointers which are invalid.

> > hold a reference when using blk_mq_queue_tag_busy_iter.
> >
> > RIP: blk_mq_queue_tag_busy_iter
>
> Any chance to pose the whole stack trace?

[323465.820735] nvme nvme3: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 100.102.253.10:8009
[323465.821654] nvme nvme3: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[323465.850291] nvme nvme3: creating 80 I/O queues.
[323465.974196] nvme nvme3: mapped 80/0/0 default/read/poll queues.
[323466.031107] nvme nvme3: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.251.11:4420
[323466.093520] nvme nvme4: creating 80 I/O queues.
[323466.131740] nvme nvme4: mapped 80/0/0 default/read/poll queues.
[323466.194742] nvme nvme4: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.252.12:4420
[323466.207338] nvme nvme5: creating 80 I/O queues.
[323466.219473] nvme nvme5: mapped 80/0/0 default/read/poll queues.
[323466.256877] nvme nvme5: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.253.12:4420
[323466.295990] nvme nvme6: creating 80 I/O queues.
[323466.308171] nvme nvme6: mapped 80/0/0 default/read/poll queues.
[323466.359055] nvme nvme6: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.251.12:4420
[323466.366912] nvme nvme7: creating 80 I/O queues.
[323466.379132] nvme nvme7: mapped 80/0/0 default/read/poll queues.
[323466.432957] nvme nvme7: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.252.11:4420
[323466.458548] nvme nvme8: creating 80 I/O queues.
[323466.474414] nvme nvme8: mapped 80/0/0 default/read/poll queues.
[323466.548254] nvme nvme8: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.253.11:4420
[323467.255527] BUG: kernel NULL pointer dereference, address: 0000000000000000
[323467.255538] #PF: supervisor read access in kernel mode
[323467.255541] #PF: error_code(0x0000) - not-present page
[323467.255544] PGD 0 P4D 0
[323467.255550] Oops: 0000 [#1] SMP PTI
[323467.255555] CPU: 13 PID: 17640 Comm: iostat Kdump: loaded Tainted: G IOE X 5.3.18-59.27-default #1 SLE15-SP3
[323467.255559] Hardware name: Dell Inc. PowerEdge R940xa/08XR9M, BIOS 2.12.2 07/12/2021
[323467.255569] RIP: 0010:blk_mq_queue_tag_busy_iter+0x1e4/0x2e0
[323467.255573] Code: 89 44 24 48 41 8b 1f 85 db 0f 84 d1 fe ff ff 4c 89 6c 24 20 4c 89 74 24 28 89 e8 44 29 e3 48 8d 04 40 48 c1 e0 06 49 03 47 10 <48> 8b 10 39 d3 0f 43 da 48 8b 90 80 00 00 00 41 01 dc 48 f7 d2 48
[323467.255583] RSP: 0018:ffffb701aefb7c20 EFLAGS: 00010246
[323467.255587] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[323467.255592] RDX: ffff954f43c14c00 RSI: ffffffffbaa442c0 RDI: ffff954fb3749010
[323467.255597] RBP: 0000000000000000 R08: 0000000800000000 R09: 00000008ffffffff
[323467.255602] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[323467.255607] R13: ffff9567b2200000 R14: ffff954798c09bc0 R15: ffff954798c09c20
[323467.255612] FS: 00007f6032444580(0000) GS:ffff956fbf6c0000(0000) knlGS:0000000000000000
[323467.255616] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[323467.255618] CR2: 0000000000000000 CR3: 00000037bc7c0001 CR4: 00000000007726e0
[323467.255621] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[323467.255624] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[323467.255626] PKRU: 55555554
[323467.255628] Call Trace:
[323467.255641] ? blk_mq_hctx_mark_pending+0x70/0x70
[323467.255647] ? diskstats_show+0x207/0x2a0
[323467.255651] ? blk_mq_hctx_mark_pending+0x70/0x70
[323467.255654] blk_mq_in_flight+0x35/0x60
[323467.255659] diskstats_show+0x6d/0x2a0
[323467.255669] ? klist_next+0x88/0x110
[323467.255676] seq_read+0x2a6/0x3f0
[323467.255682] proc_reg_read+0x39/0x60
[323467.255690] vfs_read+0x89/0x140
[323467.255693] ksys_read+0xa1/0xe0
[323467.255702] do_syscall_64+0x5b/0x1e0
[323467.255709] entry_SYSCALL_64_after_hwframe+0x44/0xa9

> The disk device's refcnt is supposed to be grabbed by class device
> iterator code, so queue shouldn't be released in diskstats_show().

Okay, but it's not the disk device which is invalid as far I
understand. It's the request_queue which is bogus.

> Not mention blk_queue_enter() won't prevent disk from being released...

Understood.

Daniel

2022-01-21 22:12:14

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

On Thu, Jan 20, 2022 at 02:19:36PM +0100, Daniel Wagner wrote:
> Hi Ming,
>
> On Thu, Jan 20, 2022 at 08:56:03PM +0800, Ming Lei wrote:
> [323467.255527] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [323467.255538] #PF: supervisor read access in kernel mode
> [323467.255541] #PF: error_code(0x0000) - not-present page
> [323467.255544] PGD 0 P4D 0
> [323467.255550] Oops: 0000 [#1] SMP PTI
> [323467.255555] CPU: 13 PID: 17640 Comm: iostat Kdump: loaded Tainted: G IOE X 5.3.18-59.27-default #1 SLE15-SP3
> [323467.255559] Hardware name: Dell Inc. PowerEdge R940xa/08XR9M, BIOS 2.12.2 07/12/2021
> [323467.255569] RIP: 0010:blk_mq_queue_tag_busy_iter+0x1e4/0x2e0

Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to
in source code? And what is NULL pointer?

With this kind of info, we may know the root cause.

Thanks,
Ming

2022-01-21 22:13:08

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

On Thu, Jan 20, 2022 at 09:51:18PM +0800, Ming Lei wrote:
> Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to
> in source code? And what is NULL pointer?

Here we go:

/usr/src/debug/kernel-default-5.3.18-59.27.1.x86_64/linux-5.3/linux-obj/../include/linux/sbitmap.h: 249
0xffffffffb244a254 <blk_mq_queue_tag_busy_iter+484>: mov (%rax),%rdx

234static inline void __sbitmap_for_each_set(struct sbitmap *sb,
235 unsigned int start,
236 sb_for_each_fn fn, void *data)
237{
238 unsigned int index;
239 unsigned int nr;
240 unsigned int scanned = 0;
241
242 if (start >= sb->depth)
243 start = 0;
244 index = SB_NR_TO_INDEX(sb, start);
245 nr = SB_NR_TO_BIT(sb, start);
246
247 while (scanned < sb->depth) {
248 unsigned long word;
249 unsigned int depth = min_t(unsigned int,
250 sb->map[index].depth - nr,
251 sb->depth - scanned);


Attachments:
(No filename) (1.12 kB)
blk_mq_queue_tag_busy_iter.dis (14.91 kB)
Download all attachments

2022-01-21 22:13:29

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

On Thu, Jan 20, 2022 at 03:01:27PM +0100, Daniel Wagner wrote:
> On Thu, Jan 20, 2022 at 09:51:18PM +0800, Ming Lei wrote:
> > Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to
> > in source code? And what is NULL pointer?

Yes %rax is NULL

> Here we go:
>
> /usr/src/debug/kernel-default-5.3.18-59.27.1.x86_64/linux-5.3/linux-obj/../include/linux/sbitmap.h: 249
> 0xffffffffb244a254 <blk_mq_queue_tag_busy_iter+484>: mov (%rax),%rdx
>
> 234static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> 235 unsigned int start,
> 236 sb_for_each_fn fn, void *data)
> 237{
> 238 unsigned int index;
> 239 unsigned int nr;
> 240 unsigned int scanned = 0;
> 241
> 242 if (start >= sb->depth)
> 243 start = 0;
> 244 index = SB_NR_TO_INDEX(sb, start);
> 245 nr = SB_NR_TO_BIT(sb, start);
> 246
> 247 while (scanned < sb->depth) {
> 248 unsigned long word;
> 249 unsigned int depth = min_t(unsigned int,
> 250 sb->map[index].depth - nr,
> 251 sb->depth - scanned);


forgot to add this

crash> bt
PID: 17640 TASK: ffff956f4a468000 CPU: 13 COMMAND: "iostat"
#0 [ffffb701aefb7980] machine_kexec at ffffffffba66fb91
#1 [ffffb701aefb79d8] __crash_kexec at ffffffffba75927d
#2 [ffffb701aefb7aa0] crash_kexec at ffffffffba75a13d
#3 [ffffb701aefb7ab8] oops_end at ffffffffba636cdf
#4 [ffffb701aefb7ad8] no_context at ffffffffba682baf
#5 [ffffb701aefb7b40] do_page_fault at ffffffffba683e30
#6 [ffffb701aefb7b70] page_fault at ffffffffbb0012fe
[exception RIP: blk_mq_queue_tag_busy_iter+484]
RIP: ffffffffbaa4a254 RSP: ffffb701aefb7c20 RFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffff954f43c14c00 RSI: ffffffffbaa442c0 RDI: ffff954fb3749010
RBP: 0000000000000000 R8: 0000000800000000 R9: 00000008ffffffff
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff9567b2200000 R14: ffff954798c09bc0 R15: ffff954798c09c20
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffb701aefb7cb0] blk_mq_in_flight at ffffffffbaa461e5
#8 [ffffb701aefb7cd0] diskstats_show at ffffffffbaa4f00d
#9 [ffffb701aefb7e50] seq_read at ffffffffba90df16
#10 [ffffb701aefb7eb0] proc_reg_read at ffffffffba96d789
#11 [ffffb701aefb7ec8] vfs_read at ffffffffba8e4c39
#12 [ffffb701aefb7ef8] ksys_read at ffffffffba8e4fc1
#13 [ffffb701aefb7f38] do_syscall_64 at ffffffffba60538b
#14 [ffffb701aefb7f50] entry_SYSCALL_64_after_hwframe at ffffffffbb00008c
RIP: 00007f6031f4eb5e RSP: 00007ffd187a7e88 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00005577a698c2c0 RCX: 00007f6031f4eb5e
RDX: 0000000000000400 RSI: 00005577a698f480 RDI: 0000000000000003
RBP: 00007f603222e300 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000a
R13: 0000000000000d68 R14: 00007f603222d700 R15: 0000000000000d68
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b


I've tried to figure out the request_pointer from the registers and I
think the pointer is still in %rdi

struct request_queue {
last_merge = 0x0,
elevator = 0x0,
stats = 0xffff956f45a9bec0,
rq_qos = 0xffff954f54c57558,
mq_ops = 0xffffffffc0c27140 <nvme_tcp_mq_ops>,
queue_ctx = 0x4151cf2266c0,
queue_depth = 0x0,
queue_hw_ctx = 0xffff954f43c14c00,
nr_hw_queues = 0x50,
backing_dev_info = 0xffff953fae3ae800,
queuedata = 0xffff953622282800,
queue_flags = 0x5041d0,


Attachments:
(No filename) (3.70 kB)
request_queue.log (16.25 kB)
Download all attachments

2022-01-21 22:14:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

On Thu, Jan 20, 2022 at 03:01:26PM +0100, Daniel Wagner wrote:
> On Thu, Jan 20, 2022 at 09:51:18PM +0800, Ming Lei wrote:
> > Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to
> > in source code? And what is NULL pointer?
>
> Here we go:
>
> /usr/src/debug/kernel-default-5.3.18-59.27.1.x86_64/linux-5.3/linux-obj/../include/linux/sbitmap.h: 249

That looks quite old. How does this look on mainline?

2022-01-21 22:17:09

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

On Thu, Jan 20, 2022 at 06:21:26AM -0800, Christoph Hellwig wrote:
> That looks quite old. How does this look on mainline?

I am still try to reproduce it locally (on our SUSE kernels and
mainline). So far 'no luck' using nvme-fc. I'll try now with nvme-tcp
maybe it's transport thing.

2022-01-21 22:22:39

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

On Thu, Jan 20, 2022 at 03:01:26PM +0100, Daniel Wagner wrote:
> On Thu, Jan 20, 2022 at 09:51:18PM +0800, Ming Lei wrote:
> > Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to
> > in source code? And what is NULL pointer?
>
> Here we go:
>
> /usr/src/debug/kernel-default-5.3.18-59.27.1.x86_64/linux-5.3/linux-obj/../include/linux/sbitmap.h: 249
> 0xffffffffb244a254 <blk_mq_queue_tag_busy_iter+484>: mov (%rax),%rdx
>
> 234static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> 235 unsigned int start,
> 236 sb_for_each_fn fn, void *data)
> 237{
> 238 unsigned int index;
> 239 unsigned int nr;
> 240 unsigned int scanned = 0;
> 241
> 242 if (start >= sb->depth)
> 243 start = 0;
> 244 index = SB_NR_TO_INDEX(sb, start);
> 245 nr = SB_NR_TO_BIT(sb, start);
> 246
> 247 while (scanned < sb->depth) {
> 248 unsigned long word;
> 249 unsigned int depth = min_t(unsigned int,
> 250 sb->map[index].depth - nr,
> 251 sb->depth - scanned);
>

Seems more likely nvme recovery issue, tags is changed even though
->q_usage_counter is grabbed in blk_mq_queue_tag_busy_iter(), which is
added in v4.19.

You may check dmesg log and see if there is any nvme error recovery info.

Thanks,
Ming

2022-01-21 22:25:16

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

On Thu, Jan 20, 2022 at 11:37:00PM +0800, Ming Lei wrote:
> Seems more likely nvme recovery issue, tags is changed even though
> ->q_usage_counter is grabbed in blk_mq_queue_tag_busy_iter(), which is
> added in v4.19.
>
> You may check dmesg log and see if there is any nvme error recovery info.

Thanks for the tip. It's likely that a recovery is running although
there is no entry in the logs. At least I have an idea what could be the
problem here.

2022-03-04 17:22:04

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH 1/2] block: remove commented out code from diskstats_show

On Thu, Jan 20, 2022 at 12:02:49PM +0100, Hannes Reinecke wrote:
> On 1/20/22 11:52 AM, Daniel Wagner wrote:
> > The diskstats format is ABI, so we just can't add a new header. The
> > code snippet has been commented out since at least v2.6.12-rc2.
> > Remove it.
> > > Reviewed-by: Hannes Reinecke <[email protected]>

Any change to queue this patch up?

2022-03-04 18:37:25

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: hold queue lock while iterating in diskstats_show

On Thu, Jan 20, 2022 at 05:29:50PM +0100, Daniel Wagner wrote:
> On Thu, Jan 20, 2022 at 11:37:00PM +0800, Ming Lei wrote:
> > Seems more likely nvme recovery issue, tags is changed even though
> > ->q_usage_counter is grabbed in blk_mq_queue_tag_busy_iter(), which is
> > added in v4.19.
> >
> > You may check dmesg log and see if there is any nvme error recovery info.
>
> Thanks for the tip. It's likely that a recovery is running although
> there is no entry in the logs. At least I have an idea what could be the
> problem here.

FWIW, a846a8e6c9a5 ("blk-mq: don't free tags if the tag_set is used by
other device in queue initialztion") was missing our tree.