2020-09-17 11:12:20

by Tianxianting

[permalink] [raw]
Subject: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()

We found it takes more time of blk_mq_alloc_rq_maps() in kernel space when
testing nvme hot-plugging. The test and anlysis as below.

Debug code,
1, blk_mq_alloc_rq_maps():
u64 start, end;
depth = set->queue_depth;
start = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
current->pid, current->comm, current->nvcsw, current->nivcsw,
set->queue_depth, set->nr_hw_queues);
do {
err = __blk_mq_alloc_rq_maps(set);
if (!err)
break;

set->queue_depth >>= 1;
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
err = -ENOMEM;
break;
}
} while (set->queue_depth);
end = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
current->pid, current->comm,
current->nvcsw, current->nivcsw, end - start);

2, __blk_mq_alloc_rq_maps():
u64 start, end;
for (i = 0; i < set->nr_hw_queues; i++) {
start = ktime_get_ns();
if (!__blk_mq_alloc_rq_map(set, i))
goto out_unwind;
end = ktime_get_ns();
pr_err("hw queue %d init cost time %lld\n", i, end - start);
}

Test nvme hot-plugging with above debug code, we found it totally cost more
than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1024, each hw queue cost about 140-250us. The time
cost will be increased with hw queue number and queue depth increasing. And
if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try "queue_depth >>= 1",
more time will be consumed.
[ 428.428771] nvme nvme0: pci function 10000:01:00.0
[ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
[ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
[ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
[ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
[ 432.593404] hw queue 0 init cost time 22883 ns
[ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
[ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
[ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
[ 432.596203] hw queue 0 init cost time 242630 ns
[ 432.596441] hw queue 1 init cost time 235913 ns
[ 432.596659] hw queue 2 init cost time 216461 ns
[ 432.596877] hw queue 3 init cost time 215851 ns
[ 432.597107] hw queue 4 init cost time 228406 ns
[ 432.597336] hw queue 5 init cost time 227298 ns
[ 432.597564] hw queue 6 init cost time 224633 ns
[ 432.597785] hw queue 7 init cost time 219954 ns
[ 432.597937] hw queue 8 init cost time 150930 ns
[ 432.598082] hw queue 9 init cost time 143496 ns
[ 432.598231] hw queue 10 init cost time 147261 ns
[ 432.598397] hw queue 11 init cost time 164522 ns
[ 432.598542] hw queue 12 init cost time 143401 ns
[ 432.598692] hw queue 13 init cost time 148934 ns
[ 432.598841] hw queue 14 init cost time 147194 ns
[ 432.598991] hw queue 15 init cost time 148942 ns
[ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
[ 432.602611] nvme0n1: p1

So use this patch to trigger schedule between each hw queue init, to avoid
other threads getting stuck. We call cond_resched() only when
"queue depth >= 512". We are not in atomic context when executing
__blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().

Signed-off-by: Xianting Tian <[email protected]>
---
block/blk-mq.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b3d2785ee..5a71fe53a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3255,11 +3255,16 @@ void blk_mq_exit_queue(struct request_queue *q)
static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
{
int i;
+ unsigned int depth = set->queue_depth;

- for (i = 0; i < set->nr_hw_queues; i++)
+ for (i = 0; i < set->nr_hw_queues; i++) {
if (!__blk_mq_alloc_map_and_request(set, i))
goto out_unwind;

+ if (depth >= 512)
+ cond_resched();
+ }
+
return 0;

out_unwind:
--
2.17.1


2020-09-25 10:15:17

by Tianxianting

[permalink] [raw]
Subject: RE: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()

Hi Jens
Could I get your comments for the issue? Whether it is a problem?
Maybe my understanding is superficial, highly appreciated if you can comment a bit.

-----Original Message-----
From: tianxianting (RD)
Sent: Thursday, September 17, 2020 4:13 PM
To: [email protected]
Cc: [email protected]; [email protected]; tianxianting (RD) <[email protected]>
Subject: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()

We found it takes more time of blk_mq_alloc_rq_maps() in kernel space when testing nvme hot-plugging. The test and anlysis as below.

Debug code,
1, blk_mq_alloc_rq_maps():
u64 start, end;
depth = set->queue_depth;
start = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
current->pid, current->comm, current->nvcsw, current->nivcsw,
set->queue_depth, set->nr_hw_queues);
do {
err = __blk_mq_alloc_rq_maps(set);
if (!err)
break;

set->queue_depth >>= 1;
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
err = -ENOMEM;
break;
}
} while (set->queue_depth);
end = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
current->pid, current->comm,
current->nvcsw, current->nivcsw, end - start);

2, __blk_mq_alloc_rq_maps():
u64 start, end;
for (i = 0; i < set->nr_hw_queues; i++) {
start = ktime_get_ns();
if (!__blk_mq_alloc_rq_map(set, i))
goto out_unwind;
end = ktime_get_ns();
pr_err("hw queue %d init cost time %lld\n", i, end - start);
}

Test nvme hot-plugging with above debug code, we found it totally cost more than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1024, each hw queue cost about 140-250us. The time cost will be increased with hw queue number and queue depth increasing. And if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try "queue_depth >>= 1", more time will be consumed.
[ 428.428771] nvme nvme0: pci function 10000:01:00.0
[ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
[ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
[ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
[ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
[ 432.593404] hw queue 0 init cost time 22883 ns
[ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
[ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
[ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
[ 432.596203] hw queue 0 init cost time 242630 ns
[ 432.596441] hw queue 1 init cost time 235913 ns
[ 432.596659] hw queue 2 init cost time 216461 ns
[ 432.596877] hw queue 3 init cost time 215851 ns
[ 432.597107] hw queue 4 init cost time 228406 ns
[ 432.597336] hw queue 5 init cost time 227298 ns
[ 432.597564] hw queue 6 init cost time 224633 ns
[ 432.597785] hw queue 7 init cost time 219954 ns
[ 432.597937] hw queue 8 init cost time 150930 ns
[ 432.598082] hw queue 9 init cost time 143496 ns
[ 432.598231] hw queue 10 init cost time 147261 ns
[ 432.598397] hw queue 11 init cost time 164522 ns
[ 432.598542] hw queue 12 init cost time 143401 ns
[ 432.598692] hw queue 13 init cost time 148934 ns
[ 432.598841] hw queue 14 init cost time 147194 ns
[ 432.598991] hw queue 15 init cost time 148942 ns
[ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
[ 432.602611] nvme0n1: p1

So use this patch to trigger schedule between each hw queue init, to avoid other threads getting stuck. We call cond_resched() only when "queue depth >= 512". We are not in atomic context when executing __blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().

Signed-off-by: Xianting Tian <[email protected]>
---
block/blk-mq.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c index b3d2785ee..5a71fe53a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3255,11 +3255,16 @@ void blk_mq_exit_queue(struct request_queue *q) static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set) {
int i;
+ unsigned int depth = set->queue_depth;

- for (i = 0; i < set->nr_hw_queues; i++)
+ for (i = 0; i < set->nr_hw_queues; i++) {
if (!__blk_mq_alloc_map_and_request(set, i))
goto out_unwind;

+ if (depth >= 512)
+ cond_resched();
+ }
+
return 0;

out_unwind:
--
2.17.1

2020-09-25 21:06:43

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()

On 9/17/20 2:13 AM, Xianting Tian wrote:
> We found it takes more time of blk_mq_alloc_rq_maps() in kernel space when
> testing nvme hot-plugging. The test and anlysis as below.
>
> Debug code,
> 1, blk_mq_alloc_rq_maps():
> u64 start, end;
> depth = set->queue_depth;
> start = ktime_get_ns();
> pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
> current->pid, current->comm, current->nvcsw, current->nivcsw,
> set->queue_depth, set->nr_hw_queues);
> do {
> err = __blk_mq_alloc_rq_maps(set);
> if (!err)
> break;
>
> set->queue_depth >>= 1;
> if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
> err = -ENOMEM;
> break;
> }
> } while (set->queue_depth);
> end = ktime_get_ns();
> pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
> current->pid, current->comm,
> current->nvcsw, current->nivcsw, end - start);
>
> 2, __blk_mq_alloc_rq_maps():
> u64 start, end;
> for (i = 0; i < set->nr_hw_queues; i++) {
> start = ktime_get_ns();
> if (!__blk_mq_alloc_rq_map(set, i))
> goto out_unwind;
> end = ktime_get_ns();
> pr_err("hw queue %d init cost time %lld\n", i, end - start);
> }
>
> Test nvme hot-plugging with above debug code, we found it totally cost more
> than 3ms in kernel space without being scheduled out when alloc rqs for all
> 16 hw queues with depth 1024, each hw queue cost about 140-250us. The time
> cost will be increased with hw queue number and queue depth increasing. And
> if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try "queue_depth >>= 1",
> more time will be consumed.
> [ 428.428771] nvme nvme0: pci function 10000:01:00.0
> [ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
> [ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
> [ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
> [ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
> [ 432.593404] hw queue 0 init cost time 22883 ns
> [ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
> [ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
> [ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
> [ 432.596203] hw queue 0 init cost time 242630 ns
> [ 432.596441] hw queue 1 init cost time 235913 ns
> [ 432.596659] hw queue 2 init cost time 216461 ns
> [ 432.596877] hw queue 3 init cost time 215851 ns
> [ 432.597107] hw queue 4 init cost time 228406 ns
> [ 432.597336] hw queue 5 init cost time 227298 ns
> [ 432.597564] hw queue 6 init cost time 224633 ns
> [ 432.597785] hw queue 7 init cost time 219954 ns
> [ 432.597937] hw queue 8 init cost time 150930 ns
> [ 432.598082] hw queue 9 init cost time 143496 ns
> [ 432.598231] hw queue 10 init cost time 147261 ns
> [ 432.598397] hw queue 11 init cost time 164522 ns
> [ 432.598542] hw queue 12 init cost time 143401 ns
> [ 432.598692] hw queue 13 init cost time 148934 ns
> [ 432.598841] hw queue 14 init cost time 147194 ns
> [ 432.598991] hw queue 15 init cost time 148942 ns
> [ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
> [ 432.602611] nvme0n1: p1
>
> So use this patch to trigger schedule between each hw queue init, to avoid
> other threads getting stuck. We call cond_resched() only when
> "queue depth >= 512". We are not in atomic context when executing
> __blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().
>
> Signed-off-by: Xianting Tian <[email protected]>
> ---
> block/blk-mq.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b3d2785ee..5a71fe53a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3255,11 +3255,16 @@ void blk_mq_exit_queue(struct request_queue *q)
> static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
> {
> int i;
> + unsigned int depth = set->queue_depth;
>
> - for (i = 0; i < set->nr_hw_queues; i++)
> + for (i = 0; i < set->nr_hw_queues; i++) {
> if (!__blk_mq_alloc_map_and_request(set, i))
> goto out_unwind;
>
> + if (depth >= 512)
> + cond_resched();
> + }
> +
> return 0;

I generally dislike

if (some_num)
cond_resched();

particularly when it's not a true hot path. How about just making the
cond_resched() unconditional? I suspect that will be just fine.

--
Jens Axboe

2020-09-26 04:33:47

by Tianxianting

[permalink] [raw]
Subject: RE: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()

Hi Jens
Thanks a lot for the comments,
I think it is not hot path, it is only called when system startup or device hot-plugging.
So I submitted V3 patch for you reviewing :)
https://lkml.org/lkml/2020/9/25/1543

-----Original Message-----
From: Jens Axboe [mailto:[email protected]]
Sent: Saturday, September 26, 2020 3:26 AM
To: tianxianting (RD) <[email protected]>
Cc: [email protected]; [email protected]
Subject: Re: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()

On 9/17/20 2:13 AM, Xianting Tian wrote:
> We found it takes more time of blk_mq_alloc_rq_maps() in kernel space
> when testing nvme hot-plugging. The test and anlysis as below.
>
> Debug code,
> 1, blk_mq_alloc_rq_maps():
> u64 start, end;
> depth = set->queue_depth;
> start = ktime_get_ns();
> pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
> current->pid, current->comm, current->nvcsw, current->nivcsw,
> set->queue_depth, set->nr_hw_queues);
> do {
> err = __blk_mq_alloc_rq_maps(set);
> if (!err)
> break;
>
> set->queue_depth >>= 1;
> if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
> err = -ENOMEM;
> break;
> }
> } while (set->queue_depth);
> end = ktime_get_ns();
> pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
> current->pid, current->comm,
> current->nvcsw, current->nivcsw, end - start);
>
> 2, __blk_mq_alloc_rq_maps():
> u64 start, end;
> for (i = 0; i < set->nr_hw_queues; i++) {
> start = ktime_get_ns();
> if (!__blk_mq_alloc_rq_map(set, i))
> goto out_unwind;
> end = ktime_get_ns();
> pr_err("hw queue %d init cost time %lld\n", i, end - start);
> }
>
> Test nvme hot-plugging with above debug code, we found it totally cost
> more than 3ms in kernel space without being scheduled out when alloc
> rqs for all
> 16 hw queues with depth 1024, each hw queue cost about 140-250us. The
> time cost will be increased with hw queue number and queue depth
> increasing. And if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will
> try "queue_depth >>= 1", more time will be consumed.
> [ 428.428771] nvme nvme0: pci function 10000:01:00.0
> [ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
> [ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
> [ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
> [ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
> [ 432.593404] hw queue 0 init cost time 22883 ns
> [ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
> [ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
> [ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
> [ 432.596203] hw queue 0 init cost time 242630 ns
> [ 432.596441] hw queue 1 init cost time 235913 ns
> [ 432.596659] hw queue 2 init cost time 216461 ns
> [ 432.596877] hw queue 3 init cost time 215851 ns
> [ 432.597107] hw queue 4 init cost time 228406 ns
> [ 432.597336] hw queue 5 init cost time 227298 ns
> [ 432.597564] hw queue 6 init cost time 224633 ns
> [ 432.597785] hw queue 7 init cost time 219954 ns
> [ 432.597937] hw queue 8 init cost time 150930 ns
> [ 432.598082] hw queue 9 init cost time 143496 ns
> [ 432.598231] hw queue 10 init cost time 147261 ns
> [ 432.598397] hw queue 11 init cost time 164522 ns
> [ 432.598542] hw queue 12 init cost time 143401 ns
> [ 432.598692] hw queue 13 init cost time 148934 ns
> [ 432.598841] hw queue 14 init cost time 147194 ns
> [ 432.598991] hw queue 15 init cost time 148942 ns
> [ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
> [ 432.602611] nvme0n1: p1
>
> So use this patch to trigger schedule between each hw queue init, to
> avoid other threads getting stuck. We call cond_resched() only when
> "queue depth >= 512". We are not in atomic context when executing
> __blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().
>
> Signed-off-by: Xianting Tian <[email protected]>
> ---
> block/blk-mq.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c index
> b3d2785ee..5a71fe53a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3255,11 +3255,16 @@ void blk_mq_exit_queue(struct request_queue
> *q) static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set) {
> int i;
> + unsigned int depth = set->queue_depth;
>
> - for (i = 0; i < set->nr_hw_queues; i++)
> + for (i = 0; i < set->nr_hw_queues; i++) {
> if (!__blk_mq_alloc_map_and_request(set, i))
> goto out_unwind;
>
> + if (depth >= 512)
> + cond_resched();
> + }
> +
> return 0;

I generally dislike

if (some_num)
cond_resched();

particularly when it's not a true hot path. How about just making the
cond_resched() unconditional? I suspect that will be just fine.

--
Jens Axboe