LinuxLists.cc - [PATCH V2 0/2] blk-mq: some fixes about updating hw queues

2018-10-12 10:06:27

Subject: [PATCH V2 0/2] blk-mq: some fixes about updating hw queues

Hi Jens

This patch set fixes some defects during update hw queues.

1st patch refactor the blk-mq debugfs and sysfs register during
blk_mq_update_nr_hw_queues.

2nd patch change the GFP_KERNEL to GFP_NOIO during blk_mq_realloc_hw_ctxs,

3rd patch try to realloc hctx when this hctx is mapped to a different node.

4th patch should be able to fix the panic reported in follwing link
https://marc.info/?l=linux-block&m=153799465603723&w=2

V2:
- Newly add 2nd and 3rd patch.
- Use blk_mq_map_queues when __blk_mq_update_nr_hw_queues fallbacks to
previous nr_hw_queues to avoid driver's .map_queues leak mapping of
some cpus due the fallback nr_hw_queues.
- Add Ming's Reviewed-by in 1st patch.

Jianchao Wang[4]
blk-mq: adjust debugfs and sysfs register when
blk-mq: change gfp flags to GFP_NOIO in
blk-mq: realloc hctx when hw queue is mapped to
blk-mq: fallback to previous nr_hw_queues when

block/blk-core.c | 2 +-
block/blk-flush.c | 6 +--
block/blk-mq.c | 155 ++++++++++++++++++++++++++++++++++--------------------
block/blk.h | 2 +-
4 files changed, 102 insertions(+), 63 deletions(-)

2018-10-12 10:06:34

by jianchao.wang

[permalink] [raw]

Subject: [PATCH V2 3/4] blk-mq: realloc hctx when hw queue is mapped to another node

When the hw queues and mq_map are updated, a hctx could be mapped
to a different numa node. At this moment, we need to realloc the
hctx. If fail to do that, go on using previous hctx.

Signed-off-by: Jianchao Wang <[email protected]>
---
block/blk-mq.c | 82 +++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 56 insertions(+), 26 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9805d6a..34f3973 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2504,6 +2504,39 @@ static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
return hw_ctx_size;
}

+static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
+ struct blk_mq_tag_set *set, struct request_queue *q,
+ int hctx_idx, int node)
+{
+ struct blk_mq_hw_ctx *hctx;
+
+ hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
+ GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+ node);
+ if (!hctx)
+ return NULL;
+
+ if (!zalloc_cpumask_var_node(&hctx->cpumask,
+ GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+ node)) {
+ kfree(hctx);
+ return NULL;
+ }
+
+ atomic_set(&hctx->nr_active, 0);
+ hctx->numa_node = node;
+ hctx->queue_num = hctx_idx;
+
+ if (blk_mq_init_hctx(q, set, hctx, hctx_idx)) {
+ free_cpumask_var(hctx->cpumask);
+ kfree(hctx);
+ return NULL;
+ }
+ blk_mq_hctx_kobj_init(hctx);
+
+ return hctx;
+}
+
static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
struct request_queue *q)
{
@@ -2514,37 +2547,34 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
mutex_lock(&q->sysfs_lock);
for (i = 0; i < set->nr_hw_queues; i++) {
int node;
-
- if (hctxs[i])
- continue;
+ struct blk_mq_hw_ctx *hctx;

node = blk_mq_hw_queue_to_node(q->mq_map, i);
- hctxs[i] = kzalloc_node(blk_mq_hw_ctx_size(set),
- GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
- node);
- if (!hctxs[i])
- break;
-
- if (!zalloc_cpumask_var_node(&hctxs[i]->cpumask,
- GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
- node)) {
- kfree(hctxs[i]);
- hctxs[i] = NULL;
- break;
- }
-
- atomic_set(&hctxs[i]->nr_active, 0);
- hctxs[i]->numa_node = node;
- hctxs[i]->queue_num = i;
+ /*
+ * If the hw queue has been mapped to another numa node,
+ * we need to realloc the hctx. If allocation fails, fallback
+ * to use the previous one.
+ */
+ if (hctxs[i] && (hctxs[i]->numa_node == node))
+ continue;

- if (blk_mq_init_hctx(q, set, hctxs[i], i)) {
- free_cpumask_var(hctxs[i]->cpumask);
- kfree(hctxs[i]);
- hctxs[i] = NULL;
- break;
+ hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
+ if (hctx) {
+ if (hctxs[i]) {
+ blk_mq_exit_hctx(q, set, hctxs[i], i);
+ kobject_put(&hctxs[i]->kobj);
+ }
+ hctxs[i] = hctx;
+ } else {
+ if (hctxs[i])
+ pr_warn("Allocate new hctx on node %d fails,\
+ fallback to previous one on node %d\n",
+ node, hctxs[i]->numa_node);
+ else
+ break;
}
- blk_mq_hctx_kobj_init(hctxs[i]);
}
+
for (j = i; j < q->nr_hw_queues; j++) {
struct blk_mq_hw_ctx *hctx = hctxs[j];

--
2.7.4

2018-10-12 10:07:06

by jianchao.wang

[permalink] [raw]

Subject: [PATCH V2 4/4] blk-mq: fallback to previous nr_hw_queues when updating fails

When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.

To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.

Reported-by: [email protected]
Signed-off-by: Jianchao Wang <[email protected]>
---
block/blk-mq.c | 27 ++++++++++++++++++++++++---
1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 34f3973..d2ce67a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2540,7 +2540,7 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
struct request_queue *q)
{
- int i, j;
+ int i, j, end;
struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;

/* protect against switching io scheduler */
@@ -2574,8 +2574,20 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
break;
}
}
+ /*
+ * Increasing nr_hw_queues fails. Free the newly allocated
+ * hctxs and keep the previous q->nr_hw_queues.
+ */
+ if (i != set->nr_hw_queues) {
+ j = q->nr_hw_queues;
+ end = i;
+ } else {
+ j = i;
+ end = q->nr_hw_queues;
+ q->nr_hw_queues = set->nr_hw_queues;
+ }

- for (j = i; j < q->nr_hw_queues; j++) {
+ for (; j < end; j++) {
struct blk_mq_hw_ctx *hctx = hctxs[j];

if (hctx) {
@@ -2587,7 +2599,6 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,

}
}
- q->nr_hw_queues = i;
mutex_unlock(&q->sysfs_lock);
}

@@ -2972,6 +2983,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
{
struct request_queue *q;
LIST_HEAD(head);
+ int prev_nr_hw_queues;

lockdep_assert_held(&set->tag_list_lock);

@@ -3000,10 +3012,19 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
blk_mq_sysfs_unregister(q);
}

+ prev_nr_hw_queues = set->nr_hw_queues;
set->nr_hw_queues = nr_hw_queues;
blk_mq_update_queue_map(set);
+fallback:
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_realloc_hw_ctxs(set, q);
+ if (q->nr_hw_queues != set->nr_hw_queues) {
+ pr_warn("Increasing nr_hw_queues to %d fails, fallback to %d\n",
+ nr_hw_queues, prev_nr_hw_queues);
+ set->nr_hw_queues = prev_nr_hw_queues;
+ blk_mq_map_queues(set);
+ goto fallback;
+ }
blk_mq_map_swqueue(q);
}

--
2.7.4

2018-10-12 10:07:34

by jianchao.wang

[permalink] [raw]

Subject: [PATCH V2 1/4] blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues

blk-mq debugfs and sysfs entries need to be removed before updating
queue map, otherwise, we get get wrong result there. This patch fixes
it and remove the redundant debugfs and sysfs register/unregister
operations during __blk_mq_update_nr_hw_queues.

Signed-off-by: Jianchao Wang <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
---
block/blk-mq.c | 39 ++++++++++++---------------------------
1 file changed, 12 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e3c39ea..32be432 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2137,8 +2137,6 @@ static void blk_mq_exit_hctx(struct request_queue *q,
struct blk_mq_tag_set *set,
struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
{
- blk_mq_debugfs_unregister_hctx(hctx);
-
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_idle(hctx);

@@ -2165,6 +2163,7 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
queue_for_each_hw_ctx(q, hctx, i) {
if (i == nr_queue)
break;
+ blk_mq_debugfs_unregister_hctx(hctx);
blk_mq_exit_hctx(q, set, hctx, i);
}
}
@@ -2222,8 +2221,6 @@ static int blk_mq_init_hctx(struct request_queue *q,
if (hctx->flags & BLK_MQ_F_BLOCKING)
init_srcu_struct(hctx->srcu);

- blk_mq_debugfs_register_hctx(q, hctx);
-
return 0;

free_fq:
@@ -2512,8 +2509,6 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
int i, j;
struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;

- blk_mq_sysfs_unregister(q);
-
/* protect against switching io scheduler */
mutex_lock(&q->sysfs_lock);
for (i = 0; i < set->nr_hw_queues; i++) {
@@ -2561,7 +2556,6 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
}
q->nr_hw_queues = i;
mutex_unlock(&q->sysfs_lock);
- blk_mq_sysfs_register(q);
}

struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
@@ -2659,25 +2653,6 @@ void blk_mq_free_queue(struct request_queue *q)
blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
}

-/* Basically redo blk_mq_init_queue with queue frozen */
-static void blk_mq_queue_reinit(struct request_queue *q)
-{
- WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
-
- blk_mq_debugfs_unregister_hctxs(q);
- blk_mq_sysfs_unregister(q);
-
- /*
- * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
- * we should change hctx numa_node according to the new topology (this
- * involves freeing and re-allocating memory, worth doing?)
- */
- blk_mq_map_swqueue(q);
-
- blk_mq_sysfs_register(q);
- blk_mq_debugfs_register_hctxs(q);
-}
-
static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
{
int i;
@@ -2987,11 +2962,21 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
if (!blk_mq_elv_switch_none(&head, q))
goto switch_back;

+ list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ blk_mq_debugfs_unregister_hctxs(q);
+ blk_mq_sysfs_unregister(q);
+ }
+
set->nr_hw_queues = nr_hw_queues;
blk_mq_update_queue_map(set);
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_realloc_hw_ctxs(set, q);
- blk_mq_queue_reinit(q);
+ blk_mq_map_swqueue(q);
+ }
+
+ list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ blk_mq_sysfs_register(q);
+ blk_mq_debugfs_register_hctxs(q);
}

switch_back:
--
2.7.4

2018-10-12 10:08:00

by jianchao.wang

[permalink] [raw]

Subject: [PATCH V2 2/4] blk-mq: change gfp flags to GFP_NOIO in blk_mq_realloc_hw_ctxs

blk_mq_realloc_hw_ctxs could be invoked during update hw queues.
At the momemt, IO is blocked. Change the gfp flags from GFP_KERNEL
to GFP_NOIO to avoid forever hang during memory allocation in
blk_mq_realloc_hw_ctxs.

Signed-off-by: Jianchao Wang <[email protected]>
---
block/blk-core.c | 2 +-
block/blk-flush.c | 6 +++---
block/blk-mq.c | 17 ++++++++++-------
block/blk.h | 2 +-
4 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cff0a60..4fe7084 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1160,7 +1160,7 @@ int blk_init_allocated_queue(struct request_queue *q)
{
WARN_ON_ONCE(q->mq_ops);

- q->fq = blk_alloc_flush_queue(q, NUMA_NO_NODE, q->cmd_size);
+ q->fq = blk_alloc_flush_queue(q, NUMA_NO_NODE, q->cmd_size, GFP_KERNEL);
if (!q->fq)
return -ENOMEM;

diff --git a/block/blk-flush.c b/block/blk-flush.c
index ce41f66..8b44b86 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -566,12 +566,12 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
EXPORT_SYMBOL(blkdev_issue_flush);

struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
- int node, int cmd_size)
+ int node, int cmd_size, gfp_t flags)
{
struct blk_flush_queue *fq;
int rq_sz = sizeof(struct request);

- fq = kzalloc_node(sizeof(*fq), GFP_KERNEL, node);
+ fq = kzalloc_node(sizeof(*fq), flags, node);
if (!fq)
goto fail;

@@ -579,7 +579,7 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
spin_lock_init(&fq->mq_flush_lock);

rq_sz = round_up(rq_sz + cmd_size, cache_line_size());
- fq->flush_rq = kzalloc_node(rq_sz, GFP_KERNEL, node);
+ fq->flush_rq = kzalloc_node(rq_sz, flags, node);
if (!fq->flush_rq)
goto fail_rq;

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 32be432..9805d6a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2193,12 +2193,12 @@ static int blk_mq_init_hctx(struct request_queue *q,
* runtime
*/
hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
- GFP_KERNEL, node);
+ GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node);
if (!hctx->ctxs)
goto unregister_cpu_notifier;

- if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8), GFP_KERNEL,
- node))
+ if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
+ GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node))
goto free_ctxs;

hctx->nr_ctx = 0;
@@ -2211,7 +2211,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
goto free_bitmap;

- hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
+ hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size,
+ GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
if (!hctx->fq)
goto exit_hctx;

@@ -2519,12 +2520,14 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,

node = blk_mq_hw_queue_to_node(q->mq_map, i);
hctxs[i] = kzalloc_node(blk_mq_hw_ctx_size(set),
- GFP_KERNEL, node);
+ GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+ node);
if (!hctxs[i])
break;

- if (!zalloc_cpumask_var_node(&hctxs[i]->cpumask, GFP_KERNEL,
- node)) {
+ if (!zalloc_cpumask_var_node(&hctxs[i]->cpumask,
+ GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+ node)) {
kfree(hctxs[i]);
hctxs[i] = NULL;
break;
diff --git a/block/blk.h b/block/blk.h
index 9db4e38..0f86eda 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -124,7 +124,7 @@ static inline void __blk_get_queue(struct request_queue *q)
}

struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
- int node, int cmd_size);
+ int node, int cmd_size, gfp_t flags);
void blk_free_flush_queue(struct blk_flush_queue *q);

int blk_init_rl(struct request_list *rl, struct request_queue *q,
--
2.7.4

2018-10-13 22:52:52

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH V2 0/2] blk-mq: some fixes about updating hw queues

On 10/12/18 4:07 AM, Jianchao Wang wrote:
> Hi Jens
>
> This patch set fixes some defects during update hw queues.
>
> 1st patch refactor the blk-mq debugfs and sysfs register during
> blk_mq_update_nr_hw_queues.
>
> 2nd patch change the GFP_KERNEL to GFP_NOIO during blk_mq_realloc_hw_ctxs,
>
> 3rd patch try to realloc hctx when this hctx is mapped to a different node.
>
> 4th patch should be able to fix the panic reported in follwing link
> https://marc.info/?l=linux-block&m=153799465603723&w=2
>
> V2:
> - Newly add 2nd and 3rd patch.
> - Use blk_mq_map_queues when __blk_mq_update_nr_hw_queues fallbacks to
> previous nr_hw_queues to avoid driver's .map_queues leak mapping of
> some cpus due the fallback nr_hw_queues.
> - Add Ming's Reviewed-by in 1st patch.
>
>
> Jianchao Wang[4]
> blk-mq: adjust debugfs and sysfs register when
> blk-mq: change gfp flags to GFP_NOIO in
> blk-mq: realloc hctx when hw queue is mapped to
> blk-mq: fallback to previous nr_hw_queues when
>
> block/blk-core.c | 2 +-
> block/blk-flush.c | 6 +--
> block/blk-mq.c | 155 ++++++++++++++++++++++++++++++++++--------------------
> block/blk.h | 2 +-
> 4 files changed, 102 insertions(+), 63 deletions(-)
>

Applied, thanks.

--
Jens Axboe