2015-07-18 16:28:46

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 0/7] blk-mq: fix race conditions on cpu hotplug handling

This patchset addresses several race conditions on cpu hotplug handling
for blk-mq. All problems can be reproducible by the following script.

while true; do
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online
done &

while true; do
modprobe -r null_blk
modprobe null_blk queue_mode=2 irqmode=1
sleep 0.1
done

* Changes from v2
- Add regression fix for hctx->tags->cpumask
- Remove BLK_MQ_F_SYSFS_UP per-hctx flag and use mq_sysfs_init_done
per-queue flag instead with appropriate locking in order to keep
track of 'mq' sysfs entry's life
- Add comments on non-trivial stuffs, suggested by Ming

* Changes from v1
- Release q->mq_map in blk_mq_release()
- Fix deadlock when reading cpu_list
- Fix race freeze and unfreeze

Akinobu Mita (7):
blk-mq: avoid access hctx->tags->cpumask before allocation
blk-mq: fix sysfs registration/unregistration race
blk-mq: Fix use after of free q->mq_map
blk-mq: fix q->mq_usage_counter access race
blk-mq: avoid inserting requests before establishing new mapping
blk-mq: fix freeze queue race
blk-mq: fix deadlock when reading cpu_list

block/blk-core.c | 1 +
block/blk-mq-cpumap.c | 9 +++--
block/blk-mq-sysfs.c | 36 ++++++++++++------
block/blk-mq.c | 100 +++++++++++++++++++++++++++++++++++++------------
block/blk-mq.h | 3 +-
include/linux/blk-mq.h | 1 -
include/linux/blkdev.h | 8 ++++
7 files changed, 116 insertions(+), 42 deletions(-)

Cc: Jens Axboe <[email protected]>
Cc: Ming Lei <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Keith Busch <[email protected]>
--
1.9.1


2015-07-18 16:28:51

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation

When unmapped hw queue is remapped after CPU topology is changed,
hctx->tags->cpumask is set before hctx->tags is allocated in
blk_mq_map_swqueue().

In order to fix this null pointer dereference, hctx->tags must be
allocated before configuring hctx->tags->cpumask.

Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
Signed-off-by: Akinobu Mita <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Ming Lei <[email protected]>
---
block/blk-mq.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7d842db..f29f766 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1821,7 +1821,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)

hctx = q->mq_ops->map_queue(q, i);
cpumask_set_cpu(i, hctx->cpumask);
- cpumask_set_cpu(i, hctx->tags->cpumask);
ctx->index_hw = hctx->nr_ctx;
hctx->ctxs[hctx->nr_ctx++] = ctx;
}
@@ -1861,6 +1860,14 @@ static void blk_mq_map_swqueue(struct request_queue *q)
hctx->next_cpu = cpumask_first(hctx->cpumask);
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
+
+ queue_for_each_ctx(q, ctx, i) {
+ if (!cpu_online(i))
+ continue;
+
+ hctx = q->mq_ops->map_queue(q, i);
+ cpumask_set_cpu(i, hctx->tags->cpumask);
+ }
}

static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
--
1.9.1

2015-07-18 16:28:56

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 2/7] blk-mq: fix sysfs registration/unregistration race

There is a race between cpu hotplug handling and adding/deleting
gendisk for blk-mq, where both are trying to register and unregister
the same sysfs entries.

null_add_dev
--> blk_mq_init_queue
--> blk_mq_init_allocated_queue
--> add to 'all_q_list' (*)
--> add_disk
--> blk_register_queue
--> blk_mq_register_disk (++)

null_del_dev
--> del_gendisk
--> blk_unregister_queue
--> blk_mq_unregister_disk (--)
--> blk_cleanup_queue
--> blk_mq_free_queue
--> del from 'all_q_list' (*)

blk_mq_queue_reinit
--> blk_mq_sysfs_unregister (-)
--> blk_mq_sysfs_register (+)

While the request queue is added to 'all_q_list' (*),
blk_mq_queue_reinit() can be called for the queue anytime by CPU
hotplug callback. But blk_mq_sysfs_unregister (-) and
blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
is finished. Because '/sys/block/*/mq/' is not exists.

There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
be used to track these sysfs stuff, but it is only fixing this issue
partially.

In order to fix it completely, we just need per-queue flag instead of
per-hctx flag with appropriate locking. So this introduces
q->mq_sysfs_init_done which is properly protected with all_q_mutex.

Also, we need to ensure that blk_mq_map_swqueue() is called with
all_q_mutex is held. Since hctx->nr_ctx is reset temporarily and
updated in blk_mq_map_swqueue(), so we should avoid
blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
in CPU hotplug handling or adding/deleting gendisk .

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Ming Lei <[email protected]>
---
block/blk-mq-sysfs.c | 30 ++++++++++++++++++++++--------
block/blk-mq.c | 6 +++---
include/linux/blk-mq.h | 1 -
include/linux/blkdev.h | 2 ++
4 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index b79685e..79096a6 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -332,7 +332,7 @@ static void blk_mq_unregister_hctx(struct blk_mq_hw_ctx *hctx)
struct blk_mq_ctx *ctx;
int i;

- if (!hctx->nr_ctx || !(hctx->flags & BLK_MQ_F_SYSFS_UP))
+ if (!hctx->nr_ctx)
return;

hctx_for_each_ctx(hctx, ctx, i)
@@ -347,7 +347,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
struct blk_mq_ctx *ctx;
int i, ret;

- if (!hctx->nr_ctx || !(hctx->flags & BLK_MQ_F_SYSFS_UP))
+ if (!hctx->nr_ctx)
return 0;

ret = kobject_add(&hctx->kobj, &q->mq_kobj, "%u", hctx->queue_num);
@@ -370,6 +370,8 @@ void blk_mq_unregister_disk(struct gendisk *disk)
struct blk_mq_ctx *ctx;
int i, j;

+ blk_mq_disable_hotplug();
+
queue_for_each_hw_ctx(q, hctx, i) {
blk_mq_unregister_hctx(hctx);

@@ -384,6 +386,9 @@ void blk_mq_unregister_disk(struct gendisk *disk)
kobject_put(&q->mq_kobj);

kobject_put(&disk_to_dev(disk)->kobj);
+
+ q->mq_sysfs_init_done = false;
+ blk_mq_enable_hotplug();
}

static void blk_mq_sysfs_init(struct request_queue *q)
@@ -414,27 +419,30 @@ int blk_mq_register_disk(struct gendisk *disk)
struct blk_mq_hw_ctx *hctx;
int ret, i;

+ blk_mq_disable_hotplug();
+
blk_mq_sysfs_init(q);

ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
if (ret < 0)
- return ret;
+ goto out;

kobject_uevent(&q->mq_kobj, KOBJ_ADD);

queue_for_each_hw_ctx(q, hctx, i) {
- hctx->flags |= BLK_MQ_F_SYSFS_UP;
ret = blk_mq_register_hctx(hctx);
if (ret)
break;
}

- if (ret) {
+ if (ret)
blk_mq_unregister_disk(disk);
- return ret;
- }
+ else
+ q->mq_sysfs_init_done = true;
+out:
+ blk_mq_enable_hotplug();

- return 0;
+ return ret;
}
EXPORT_SYMBOL_GPL(blk_mq_register_disk);

@@ -443,6 +451,9 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
int i;

+ if (!q->mq_sysfs_init_done)
+ return;
+
queue_for_each_hw_ctx(q, hctx, i)
blk_mq_unregister_hctx(hctx);
}
@@ -452,6 +463,9 @@ int blk_mq_sysfs_register(struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
int i, ret = 0;

+ if (!q->mq_sysfs_init_done)
+ return ret;
+
queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_register_hctx(hctx);
if (ret)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f29f766..68921b7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2045,13 +2045,13 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
goto err_hctxs;

mutex_lock(&all_q_mutex);
- list_add_tail(&q->all_q_node, &all_q_list);
- mutex_unlock(&all_q_mutex);

+ list_add_tail(&q->all_q_node, &all_q_list);
blk_mq_add_queue_tag_set(set, q);
-
blk_mq_map_swqueue(q);

+ mutex_unlock(&all_q_mutex);
+
return q;

err_hctxs:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 37d1602..b80ba45 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -145,7 +145,6 @@ enum {
BLK_MQ_F_SHOULD_MERGE = 1 << 0,
BLK_MQ_F_TAG_SHARED = 1 << 1,
BLK_MQ_F_SG_MERGE = 1 << 2,
- BLK_MQ_F_SYSFS_UP = 1 << 3,
BLK_MQ_F_DEFER_ISSUE = 1 << 4,
BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
BLK_MQ_F_ALLOC_POLICY_BITS = 1,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d4068c1..b02c90b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -462,6 +462,8 @@ struct request_queue {

struct blk_mq_tag_set *tag_set;
struct list_head tag_set_list;
+
+ bool mq_sysfs_init_done;
};

#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
--
1.9.1

2015-07-18 16:28:59

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 3/7] blk-mq: Fix use after of free q->mq_map

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates
q->mq_map by blk_mq_update_queue_map() for all request queues in
all_q_list. On the other hand, q->mq_map is released before deleting
the queue from all_q_list.

So if CPU hotplug event occurs in the window, invalid memory access
can happen. Fix it by releasing q->mq_map in blk_mq_release() to make
it happen latter than removal from all_q_list.

Signed-off-by: Akinobu Mita <[email protected]>
Suggested-by: Ming Lei <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Ming Lei <[email protected]>
---
block/blk-mq.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 68921b7..9328405 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1935,6 +1935,9 @@ void blk_mq_release(struct request_queue *q)
kfree(hctx);
}

+ kfree(q->mq_map);
+ q->mq_map = NULL;
+
kfree(q->queue_hw_ctx);

/* ctx kobj stays in queue_ctx */
@@ -2080,11 +2083,6 @@ void blk_mq_free_queue(struct request_queue *q)
blk_mq_free_hw_queues(q, set);

percpu_ref_exit(&q->mq_usage_counter);
-
- kfree(q->mq_map);
-
- q->mq_map = NULL;
-
mutex_lock(&all_q_mutex);
list_del_init(&q->all_q_node);
mutex_unlock(&all_q_mutex);
--
1.9.1

2015-07-18 16:29:04

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 4/7] blk-mq: fix q->mq_usage_counter access race

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) accesses
q->mq_usage_counter while freezing all request queues in all_q_list.
On the other hand, q->mq_usage_counter is deinitialized in
blk_mq_free_queue() before deleting the queue from all_q_list.

So if CPU hotplug event occurs in the window, percpu_ref_kill() is
called with q->mq_usage_counter which has already been marked dead,
and it triggers warning. Fix it by deleting the queue from all_q_list
earlier than destroying q->mq_usage_counter.

Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Ming Lei <[email protected]>
---
block/blk-mq.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9328405..d5d93e0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2077,15 +2077,16 @@ void blk_mq_free_queue(struct request_queue *q)
{
struct blk_mq_tag_set *set = q->tag_set;

+ mutex_lock(&all_q_mutex);
+ list_del_init(&q->all_q_node);
+ mutex_unlock(&all_q_mutex);
+
blk_mq_del_queue_tag_set(q);

blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
blk_mq_free_hw_queues(q, set);

percpu_ref_exit(&q->mq_usage_counter);
- mutex_lock(&all_q_mutex);
- list_del_init(&q->all_q_node);
- mutex_unlock(&all_q_mutex);
}

/* Basically redo blk_mq_init_queue with queue frozen */
--
1.9.1

2015-07-18 16:29:07

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 5/7] blk-mq: avoid inserting requests before establishing new mapping

Notifier callbacks for CPU_ONLINE action can be run on the other CPU
than the CPU which was just onlined. So it is possible for the
process running on the just onlined CPU to insert request and run
hw queue before establishing new mapping which is done by
blk_mq_queue_reinit_notify().

This can cause a problem when the CPU has just been onlined first time
since the request queue was initialized. At this time ctx->index_hw
for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
zero before blk_mq_queue_reinit_notify() is called by notifier
callbacks for CPU_ONLINE action.

For example, there is a single hw queue (hctx) and two CPU queues
(ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
a request is inserted into ctx1->rq_list and set bit0 in pending
bitmap as ctx1->index_hw is still zero.

And then while running hw queue, flush_busy_ctxs() finds bit0 is set
in pending bitmap and tries to retrieve requests in
hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, so the
request in ctx1->rq_list is ignored.

Fix it by ensuring that new mapping is established before onlined cpu
starts running.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Ming Lei <[email protected]>
---
block/blk-mq-cpumap.c | 9 ++++----
block/blk-mq.c | 59 +++++++++++++++++++++++++++++++++++++++------------
block/blk-mq.h | 3 ++-
3 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 1e28ddb..8764c24 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -31,7 +31,8 @@ static int get_first_sibling(unsigned int cpu)
return cpu;
}

-int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
+int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
+ const struct cpumask *online_mask)
{
unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
cpumask_var_t cpus;
@@ -41,7 +42,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)

cpumask_clear(cpus);
nr_cpus = nr_uniq_cpus = 0;
- for_each_online_cpu(i) {
+ for_each_cpu(i, online_mask) {
nr_cpus++;
first_sibling = get_first_sibling(i);
if (!cpumask_test_cpu(first_sibling, cpus))
@@ -51,7 +52,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)

queue = 0;
for_each_possible_cpu(i) {
- if (!cpu_online(i)) {
+ if (!cpumask_test_cpu(i, online_mask)) {
map[i] = 0;
continue;
}
@@ -95,7 +96,7 @@ unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
if (!map)
return NULL;

- if (!blk_mq_update_queue_map(map, set->nr_hw_queues))
+ if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
return map;

kfree(map);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d5d93e0..d861c70 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1799,7 +1799,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
}
}

-static void blk_mq_map_swqueue(struct request_queue *q)
+static void blk_mq_map_swqueue(struct request_queue *q,
+ const struct cpumask *online_mask)
{
unsigned int i;
struct blk_mq_hw_ctx *hctx;
@@ -1816,7 +1817,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
*/
queue_for_each_ctx(q, ctx, i) {
/* If the cpu isn't online, the cpu is mapped to first hctx */
- if (!cpu_online(i))
+ if (!cpumask_test_cpu(i, online_mask))
continue;

hctx = q->mq_ops->map_queue(q, i);
@@ -1862,7 +1863,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
}

queue_for_each_ctx(q, ctx, i) {
- if (!cpu_online(i))
+ if (!cpumask_test_cpu(i, online_mask))
continue;

hctx = q->mq_ops->map_queue(q, i);
@@ -2047,13 +2048,15 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
if (blk_mq_init_hw_queues(q, set))
goto err_hctxs;

+ get_online_cpus();
mutex_lock(&all_q_mutex);

list_add_tail(&q->all_q_node, &all_q_list);
blk_mq_add_queue_tag_set(set, q);
- blk_mq_map_swqueue(q);
+ blk_mq_map_swqueue(q, cpu_online_mask);

mutex_unlock(&all_q_mutex);
+ put_online_cpus();

return q;

@@ -2090,13 +2093,14 @@ void blk_mq_free_queue(struct request_queue *q)
}

/* Basically redo blk_mq_init_queue with queue frozen */
-static void blk_mq_queue_reinit(struct request_queue *q)
+static void blk_mq_queue_reinit(struct request_queue *q,
+ const struct cpumask *online_mask)
{
WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));

blk_mq_sysfs_unregister(q);

- blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues);
+ blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);

/*
* redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
@@ -2104,7 +2108,7 @@ static void blk_mq_queue_reinit(struct request_queue *q)
* involves free and re-allocate memory, worthy doing?)
*/

- blk_mq_map_swqueue(q);
+ blk_mq_map_swqueue(q, online_mask);

blk_mq_sysfs_register(q);
}
@@ -2113,16 +2117,43 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
unsigned long action, void *hcpu)
{
struct request_queue *q;
+ int cpu = (unsigned long)hcpu;
+ /*
+ * New online cpumask which is going to be set in this hotplug event.
+ * Declare this cpumasks as global as cpu-hotplug operation is invoked
+ * one-by-one and dynamically allocating this could result in a failure.
+ */
+ static struct cpumask online_new;

/*
- * Before new mappings are established, hotadded cpu might already
- * start handling requests. This doesn't break anything as we map
- * offline CPUs to first hardware queue. We will re-init the queue
- * below to get optimal settings.
+ * Before hotadded cpu starts handling requests, new mappings must
+ * be established. Otherwise, these requests in hw queue might
+ * never be dispatched.
+ *
+ * For example, there is a single hw queue (hctx) and two CPU queues
+ * (ctx0 for CPU0, and ctx1 for CPU1).
+ *
+ * Now CPU1 is just onlined and a request is inserted into
+ * ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is
+ * still zero.
+ *
+ * And then while running hw queue, flush_busy_ctxs() finds bit0 is
+ * set in pending bitmap and tries to retrieve requests in
+ * hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0,
+ * so the request in ctx1->rq_list is ignored.
*/
- if (action != CPU_DEAD && action != CPU_DEAD_FROZEN &&
- action != CPU_ONLINE && action != CPU_ONLINE_FROZEN)
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_DEAD:
+ case CPU_UP_CANCELED:
+ cpumask_copy(&online_new, cpu_online_mask);
+ break;
+ case CPU_UP_PREPARE:
+ cpumask_copy(&online_new, cpu_online_mask);
+ cpumask_set_cpu(cpu, &online_new);
+ break;
+ default:
return NOTIFY_OK;
+ }

mutex_lock(&all_q_mutex);

@@ -2146,7 +2177,7 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
}

list_for_each_entry(q, &all_q_list, all_q_node)
- blk_mq_queue_reinit(q);
+ blk_mq_queue_reinit(q, &online_new);

list_for_each_entry(q, &all_q_list, all_q_node)
blk_mq_unfreeze_queue(q);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6a48c4c..f4fea79 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -51,7 +51,8 @@ void blk_mq_disable_hotplug(void);
* CPU -> queue mappings
*/
extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
-extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues);
+extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
+ const struct cpumask *online_mask);
extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);

/*
--
1.9.1

2015-07-18 16:29:14

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 6/7] blk-mq: fix freeze queue race

There are several race conditions while freezing queue.

When unfreezing queue, there is a small window between decrementing
q->mq_freeze_depth to zero and percpu_ref_reinit() call with
q->mq_usage_counter. If the other calls blk_mq_freeze_queue_start()
in the window, q->mq_freeze_depth is increased from zero to one and
percpu_ref_kill() is called with q->mq_usage_counter which is already
killed. percpu refcount should be re-initialized before killed again.

Also, there is a race condition while switching to percpu mode.
percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
executed at the same time as the following scenario is possible:

1. q->mq_usage_counter is initialized in atomic mode.
(atomic counter: 1)

2. After the disk registration, a process like systemd-udev starts
accessing the disk, and successfully increases refcount successfully
by percpu_ref_tryget_live() in blk_mq_queue_enter().
(atomic counter: 2)

3. In the final stage of initialization, q->mq_usage_counter is being
switched to percpu mode by percpu_ref_switch_to_percpu() in
blk_mq_finish_init(). But if CONFIG_PREEMPT_VOLUNTARY is enabled,
the process is rescheduled in the middle of switching when calling
wait_event() in __percpu_ref_switch_to_percpu().
(atomic counter: 2)

4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze
request queue. q->mq_usage_counter is decreased and marked as
DEAD. Wait until all requests have finished.
(atomic counter: 1)

5. The process rescheduled in the step 3. is resumed and finishes
all remaining work in __percpu_ref_switch_to_percpu().
A bias value is added to atomic counter of q->mq_usage_counter.
(atomic counter: PERCPU_COUNT_BIAS + 1)

6. A request issed in the step 2. is finished and q->mq_usage_counter
is decreased by blk_mq_queue_exit(). q->mq_usage_counter is DEAD,
so atomic counter is decreased and no release handler is called.
(atomic counter: PERCPU_COUNT_BIAS)

7. CPU hotplug handling in the step 4. will wait forever as
q->mq_usage_counter will never be zero.

Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed
at the same time. Because both functions could call
__percpu_ref_switch_to_percpu() which adds the bias value and
initialize percpu counter.

Fix those races by serializing with per-queue mutex.

Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Ming Lei <[email protected]>
Cc: Tejun Heo <[email protected]>
---
block/blk-core.c | 1 +
block/blk-mq-sysfs.c | 2 ++
block/blk-mq.c | 8 ++++++++
include/linux/blkdev.h | 6 ++++++
4 files changed, 17 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 627ed0c..544b237 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -687,6 +687,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);

init_waitqueue_head(&q->mq_freeze_wq);
+ mutex_init(&q->mq_freeze_lock);

if (blkcg_init_queue(q))
goto fail_bdi;
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 79096a6..f63b464 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -409,7 +409,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
/* see blk_register_queue() */
void blk_mq_finish_init(struct request_queue *q)
{
+ mutex_lock(&q->mq_freeze_lock);
percpu_ref_switch_to_percpu(&q->mq_usage_counter);
+ mutex_unlock(&q->mq_freeze_lock);
}

int blk_mq_register_disk(struct gendisk *disk)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d861c70..b931e38 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -115,11 +115,15 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
{
int freeze_depth;

+ mutex_lock(&q->mq_freeze_lock);
+
freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(&q->mq_usage_counter);
blk_mq_run_hw_queues(q, false);
}
+
+ mutex_unlock(&q->mq_freeze_lock);
}
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);

@@ -143,12 +147,16 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
{
int freeze_depth;

+ mutex_lock(&q->mq_freeze_lock);
+
freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
percpu_ref_reinit(&q->mq_usage_counter);
wake_up_all(&q->mq_freeze_wq);
}
+
+ mutex_unlock(&q->mq_freeze_lock);
}
EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b02c90b..b867c32 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -457,6 +457,12 @@ struct request_queue {
#endif
struct rcu_head rcu_head;
wait_queue_head_t mq_freeze_wq;
+ /*
+ * Protect concurrent access to mq_usage_counter by
+ * percpu_ref_switch_to_percpu(), percpu_ref_kill(), and
+ * percpu_ref_reinit().
+ */
+ struct mutex mq_freeze_lock;
struct percpu_ref mq_usage_counter;
struct list_head all_q_node;

--
1.9.1

2015-07-18 16:29:17

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
entries by blk_mq_sysfs_unregister(). Removing sysfs entry needs to
be blocked until the active reference of the kernfs_node to be zero.

On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
/sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
blk_mq_hw_sysfs_cpus_show().

If these happen at the same time, a deadlock can happen. Because one
can wait for the active reference to be zero with holding all_q_mutex,
and the other tries to acquire all_q_mutex with holding the active
reference.

The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
is to avoid reading an imcomplete hctx->cpumask. Since reading sysfs
entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
while hctx->cpumask is being updated.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Ming Lei <[email protected]>
---
block/blk-mq-sysfs.c | 4 ----
block/blk-mq.c | 7 +++++++
2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index f63b464..e0f71bf 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -218,8 +218,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
unsigned int i, first = 1;
ssize_t ret = 0;

- blk_mq_disable_hotplug();
-
for_each_cpu(i, hctx->cpumask) {
if (first)
ret += sprintf(ret + page, "%u", i);
@@ -229,8 +227,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
first = 0;
}

- blk_mq_enable_hotplug();
-
ret += sprintf(ret + page, "\n");
return ret;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b931e38..1a5e7d1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1815,6 +1815,11 @@ static void blk_mq_map_swqueue(struct request_queue *q,
struct blk_mq_ctx *ctx;
struct blk_mq_tag_set *set = q->tag_set;

+ /*
+ * Avoid others reading imcomplete hctx->cpumask through sysfs
+ */
+ mutex_lock(&q->sysfs_lock);
+
queue_for_each_hw_ctx(q, hctx, i) {
cpumask_clear(hctx->cpumask);
hctx->nr_ctx = 0;
@@ -1834,6 +1839,8 @@ static void blk_mq_map_swqueue(struct request_queue *q,
hctx->ctxs[hctx->nr_ctx++] = ctx;
}

+ mutex_unlock(&q->sysfs_lock);
+
queue_for_each_hw_ctx(q, hctx, i) {
struct blk_mq_ctxmap *map = &hctx->ctx_map;

--
1.9.1

2015-07-19 10:24:19

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation

On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <[email protected]> wrote:
> When unmapped hw queue is remapped after CPU topology is changed,
> hctx->tags->cpumask is set before hctx->tags is allocated in
> blk_mq_map_swqueue().
>
> In order to fix this null pointer dereference, hctx->tags must be
> allocated before configuring hctx->tags->cpumask.

The root cause should be that the mapping between hctx and ctx
can be changed after CPU topo is changed, then hctx->tags can
be changed too, so hctx->tags->cpumask has to be set after
hctx->tags is setup.

>
> Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")

I am wondering if the above commit considers CPU hotplug, and
nvme uses tag->cpumask to set irq affinity hint just during
starting queue. Looks it should be reasonalbe to
introduce one callback of mapping_changed() for handling
this kind of stuff. But this isn't related with this patch.

> Signed-off-by: Akinobu Mita <[email protected]>
> Cc: Keith Busch <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Ming Lei <[email protected]>
> ---
> block/blk-mq.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7d842db..f29f766 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1821,7 +1821,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>
> hctx = q->mq_ops->map_queue(q, i);
> cpumask_set_cpu(i, hctx->cpumask);
> - cpumask_set_cpu(i, hctx->tags->cpumask);
> ctx->index_hw = hctx->nr_ctx;
> hctx->ctxs[hctx->nr_ctx++] = ctx;
> }
> @@ -1861,6 +1860,14 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> hctx->next_cpu = cpumask_first(hctx->cpumask);
> hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
> }
> +
> + queue_for_each_ctx(q, ctx, i) {
> + if (!cpu_online(i))
> + continue;
> +
> + hctx = q->mq_ops->map_queue(q, i);
> + cpumask_set_cpu(i, hctx->tags->cpumask);

If tags->cpumask is always same with hctx->cpumaks, this
CPU iterator can be avoided.

> + }
> }
>
> static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
> --
> 1.9.1
>



--
Ming Lei

2015-07-19 11:47:06

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] blk-mq: fix sysfs registration/unregistration race

On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <[email protected]> wrote:
> There is a race between cpu hotplug handling and adding/deleting
> gendisk for blk-mq, where both are trying to register and unregister
> the same sysfs entries.
>
> null_add_dev
> --> blk_mq_init_queue
> --> blk_mq_init_allocated_queue
> --> add to 'all_q_list' (*)
> --> add_disk
> --> blk_register_queue
> --> blk_mq_register_disk (++)
>
> null_del_dev
> --> del_gendisk
> --> blk_unregister_queue
> --> blk_mq_unregister_disk (--)
> --> blk_cleanup_queue
> --> blk_mq_free_queue
> --> del from 'all_q_list' (*)
>
> blk_mq_queue_reinit
> --> blk_mq_sysfs_unregister (-)
> --> blk_mq_sysfs_register (+)
>
> While the request queue is added to 'all_q_list' (*),
> blk_mq_queue_reinit() can be called for the queue anytime by CPU
> hotplug callback. But blk_mq_sysfs_unregister (-) and
> blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
> before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
> is finished. Because '/sys/block/*/mq/' is not exists.
>
> There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
> be used to track these sysfs stuff, but it is only fixing this issue
> partially.
>
> In order to fix it completely, we just need per-queue flag instead of
> per-hctx flag with appropriate locking. So this introduces
> q->mq_sysfs_init_done which is properly protected with all_q_mutex.
>
> Also, we need to ensure that blk_mq_map_swqueue() is called with
> all_q_mutex is held. Since hctx->nr_ctx is reset temporarily and
> updated in blk_mq_map_swqueue(), so we should avoid
> blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
> in CPU hotplug handling or adding/deleting gendisk .
>
> Signed-off-by: Akinobu Mita <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Ming Lei <[email protected]>

Reviewed-by: Ming Lei <[email protected]>

> ---
> block/blk-mq-sysfs.c | 30 ++++++++++++++++++++++--------
> block/blk-mq.c | 6 +++---
> include/linux/blk-mq.h | 1 -
> include/linux/blkdev.h | 2 ++
> 4 files changed, 27 insertions(+), 12 deletions(-)
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index b79685e..79096a6 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -332,7 +332,7 @@ static void blk_mq_unregister_hctx(struct blk_mq_hw_ctx *hctx)
> struct blk_mq_ctx *ctx;
> int i;
>
> - if (!hctx->nr_ctx || !(hctx->flags & BLK_MQ_F_SYSFS_UP))
> + if (!hctx->nr_ctx)
> return;
>
> hctx_for_each_ctx(hctx, ctx, i)
> @@ -347,7 +347,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
> struct blk_mq_ctx *ctx;
> int i, ret;
>
> - if (!hctx->nr_ctx || !(hctx->flags & BLK_MQ_F_SYSFS_UP))
> + if (!hctx->nr_ctx)
> return 0;
>
> ret = kobject_add(&hctx->kobj, &q->mq_kobj, "%u", hctx->queue_num);
> @@ -370,6 +370,8 @@ void blk_mq_unregister_disk(struct gendisk *disk)
> struct blk_mq_ctx *ctx;
> int i, j;
>
> + blk_mq_disable_hotplug();
> +
> queue_for_each_hw_ctx(q, hctx, i) {
> blk_mq_unregister_hctx(hctx);
>
> @@ -384,6 +386,9 @@ void blk_mq_unregister_disk(struct gendisk *disk)
> kobject_put(&q->mq_kobj);
>
> kobject_put(&disk_to_dev(disk)->kobj);
> +
> + q->mq_sysfs_init_done = false;
> + blk_mq_enable_hotplug();
> }
>
> static void blk_mq_sysfs_init(struct request_queue *q)
> @@ -414,27 +419,30 @@ int blk_mq_register_disk(struct gendisk *disk)
> struct blk_mq_hw_ctx *hctx;
> int ret, i;
>
> + blk_mq_disable_hotplug();
> +
> blk_mq_sysfs_init(q);
>
> ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
> if (ret < 0)
> - return ret;
> + goto out;
>
> kobject_uevent(&q->mq_kobj, KOBJ_ADD);
>
> queue_for_each_hw_ctx(q, hctx, i) {
> - hctx->flags |= BLK_MQ_F_SYSFS_UP;
> ret = blk_mq_register_hctx(hctx);
> if (ret)
> break;
> }
>
> - if (ret) {
> + if (ret)
> blk_mq_unregister_disk(disk);
> - return ret;
> - }
> + else
> + q->mq_sysfs_init_done = true;
> +out:
> + blk_mq_enable_hotplug();
>
> - return 0;
> + return ret;
> }
> EXPORT_SYMBOL_GPL(blk_mq_register_disk);
>
> @@ -443,6 +451,9 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
> struct blk_mq_hw_ctx *hctx;
> int i;
>
> + if (!q->mq_sysfs_init_done)
> + return;
> +
> queue_for_each_hw_ctx(q, hctx, i)
> blk_mq_unregister_hctx(hctx);
> }
> @@ -452,6 +463,9 @@ int blk_mq_sysfs_register(struct request_queue *q)
> struct blk_mq_hw_ctx *hctx;
> int i, ret = 0;
>
> + if (!q->mq_sysfs_init_done)
> + return ret;
> +
> queue_for_each_hw_ctx(q, hctx, i) {
> ret = blk_mq_register_hctx(hctx);
> if (ret)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f29f766..68921b7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2045,13 +2045,13 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> goto err_hctxs;
>
> mutex_lock(&all_q_mutex);
> - list_add_tail(&q->all_q_node, &all_q_list);
> - mutex_unlock(&all_q_mutex);
>
> + list_add_tail(&q->all_q_node, &all_q_list);
> blk_mq_add_queue_tag_set(set, q);
> -
> blk_mq_map_swqueue(q);
>
> + mutex_unlock(&all_q_mutex);
> +
> return q;
>
> err_hctxs:
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 37d1602..b80ba45 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -145,7 +145,6 @@ enum {
> BLK_MQ_F_SHOULD_MERGE = 1 << 0,
> BLK_MQ_F_TAG_SHARED = 1 << 1,
> BLK_MQ_F_SG_MERGE = 1 << 2,
> - BLK_MQ_F_SYSFS_UP = 1 << 3,
> BLK_MQ_F_DEFER_ISSUE = 1 << 4,
> BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> BLK_MQ_F_ALLOC_POLICY_BITS = 1,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index d4068c1..b02c90b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -462,6 +462,8 @@ struct request_queue {
>
> struct blk_mq_tag_set *tag_set;
> struct list_head tag_set_list;
> +
> + bool mq_sysfs_init_done;
> };
>
> #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
> --
> 1.9.1
>



--
Ming Lei

2015-07-19 12:12:21

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] blk-mq: avoid inserting requests before establishing new mapping

On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <[email protected]> wrote:
> Notifier callbacks for CPU_ONLINE action can be run on the other CPU
> than the CPU which was just onlined. So it is possible for the
> process running on the just onlined CPU to insert request and run
> hw queue before establishing new mapping which is done by
> blk_mq_queue_reinit_notify().
>
> This can cause a problem when the CPU has just been onlined first time
> since the request queue was initialized. At this time ctx->index_hw
> for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
> zero before blk_mq_queue_reinit_notify() is called by notifier
> callbacks for CPU_ONLINE action.
>
> For example, there is a single hw queue (hctx) and two CPU queues
> (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
> a request is inserted into ctx1->rq_list and set bit0 in pending
> bitmap as ctx1->index_hw is still zero.
>
> And then while running hw queue, flush_busy_ctxs() finds bit0 is set
> in pending bitmap and tries to retrieve requests in
> hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, so the
> request in ctx1->rq_list is ignored.
>
> Fix it by ensuring that new mapping is established before onlined cpu
> starts running.
>
> Signed-off-by: Akinobu Mita <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Ming Lei <[email protected]>

Reviewed-by: Ming Lei <[email protected]>


> ---
> block/blk-mq-cpumap.c | 9 ++++----
> block/blk-mq.c | 59 +++++++++++++++++++++++++++++++++++++++------------
> block/blk-mq.h | 3 ++-
> 3 files changed, 52 insertions(+), 19 deletions(-)
>
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 1e28ddb..8764c24 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -31,7 +31,8 @@ static int get_first_sibling(unsigned int cpu)
> return cpu;
> }
>
> -int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
> +int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
> + const struct cpumask *online_mask)
> {
> unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
> cpumask_var_t cpus;
> @@ -41,7 +42,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
>
> cpumask_clear(cpus);
> nr_cpus = nr_uniq_cpus = 0;
> - for_each_online_cpu(i) {
> + for_each_cpu(i, online_mask) {
> nr_cpus++;
> first_sibling = get_first_sibling(i);
> if (!cpumask_test_cpu(first_sibling, cpus))
> @@ -51,7 +52,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
>
> queue = 0;
> for_each_possible_cpu(i) {
> - if (!cpu_online(i)) {
> + if (!cpumask_test_cpu(i, online_mask)) {
> map[i] = 0;
> continue;
> }
> @@ -95,7 +96,7 @@ unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
> if (!map)
> return NULL;
>
> - if (!blk_mq_update_queue_map(map, set->nr_hw_queues))
> + if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
> return map;
>
> kfree(map);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d5d93e0..d861c70 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1799,7 +1799,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
> }
> }
>
> -static void blk_mq_map_swqueue(struct request_queue *q)
> +static void blk_mq_map_swqueue(struct request_queue *q,
> + const struct cpumask *online_mask)
> {
> unsigned int i;
> struct blk_mq_hw_ctx *hctx;
> @@ -1816,7 +1817,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> */
> queue_for_each_ctx(q, ctx, i) {
> /* If the cpu isn't online, the cpu is mapped to first hctx */
> - if (!cpu_online(i))
> + if (!cpumask_test_cpu(i, online_mask))
> continue;
>
> hctx = q->mq_ops->map_queue(q, i);
> @@ -1862,7 +1863,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> }
>
> queue_for_each_ctx(q, ctx, i) {
> - if (!cpu_online(i))
> + if (!cpumask_test_cpu(i, online_mask))
> continue;
>
> hctx = q->mq_ops->map_queue(q, i);
> @@ -2047,13 +2048,15 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> if (blk_mq_init_hw_queues(q, set))
> goto err_hctxs;
>
> + get_online_cpus();
> mutex_lock(&all_q_mutex);
>
> list_add_tail(&q->all_q_node, &all_q_list);
> blk_mq_add_queue_tag_set(set, q);
> - blk_mq_map_swqueue(q);
> + blk_mq_map_swqueue(q, cpu_online_mask);
>
> mutex_unlock(&all_q_mutex);
> + put_online_cpus();
>
> return q;
>
> @@ -2090,13 +2093,14 @@ void blk_mq_free_queue(struct request_queue *q)
> }
>
> /* Basically redo blk_mq_init_queue with queue frozen */
> -static void blk_mq_queue_reinit(struct request_queue *q)
> +static void blk_mq_queue_reinit(struct request_queue *q,
> + const struct cpumask *online_mask)
> {
> WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
>
> blk_mq_sysfs_unregister(q);
>
> - blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues);
> + blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
>
> /*
> * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
> @@ -2104,7 +2108,7 @@ static void blk_mq_queue_reinit(struct request_queue *q)
> * involves free and re-allocate memory, worthy doing?)
> */
>
> - blk_mq_map_swqueue(q);
> + blk_mq_map_swqueue(q, online_mask);
>
> blk_mq_sysfs_register(q);
> }
> @@ -2113,16 +2117,43 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
> unsigned long action, void *hcpu)
> {
> struct request_queue *q;
> + int cpu = (unsigned long)hcpu;
> + /*
> + * New online cpumask which is going to be set in this hotplug event.
> + * Declare this cpumasks as global as cpu-hotplug operation is invoked
> + * one-by-one and dynamically allocating this could result in a failure.
> + */
> + static struct cpumask online_new;
>
> /*
> - * Before new mappings are established, hotadded cpu might already
> - * start handling requests. This doesn't break anything as we map
> - * offline CPUs to first hardware queue. We will re-init the queue
> - * below to get optimal settings.
> + * Before hotadded cpu starts handling requests, new mappings must
> + * be established. Otherwise, these requests in hw queue might
> + * never be dispatched.
> + *
> + * For example, there is a single hw queue (hctx) and two CPU queues
> + * (ctx0 for CPU0, and ctx1 for CPU1).
> + *
> + * Now CPU1 is just onlined and a request is inserted into
> + * ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is
> + * still zero.
> + *
> + * And then while running hw queue, flush_busy_ctxs() finds bit0 is
> + * set in pending bitmap and tries to retrieve requests in
> + * hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0,
> + * so the request in ctx1->rq_list is ignored.
> */
> - if (action != CPU_DEAD && action != CPU_DEAD_FROZEN &&
> - action != CPU_ONLINE && action != CPU_ONLINE_FROZEN)
> + switch (action & ~CPU_TASKS_FROZEN) {
> + case CPU_DEAD:
> + case CPU_UP_CANCELED:
> + cpumask_copy(&online_new, cpu_online_mask);
> + break;
> + case CPU_UP_PREPARE:
> + cpumask_copy(&online_new, cpu_online_mask);
> + cpumask_set_cpu(cpu, &online_new);
> + break;
> + default:
> return NOTIFY_OK;
> + }
>
> mutex_lock(&all_q_mutex);
>
> @@ -2146,7 +2177,7 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
> }
>
> list_for_each_entry(q, &all_q_list, all_q_node)
> - blk_mq_queue_reinit(q);
> + blk_mq_queue_reinit(q, &online_new);
>
> list_for_each_entry(q, &all_q_list, all_q_node)
> blk_mq_unfreeze_queue(q);
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 6a48c4c..f4fea79 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -51,7 +51,8 @@ void blk_mq_disable_hotplug(void);
> * CPU -> queue mappings
> */
> extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
> -extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues);
> +extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
> + const struct cpumask *online_mask);
> extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
>
> /*
> --
> 1.9.1
>



--
Ming Lei

2015-07-19 12:13:02

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] blk-mq: fix freeze queue race

On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <[email protected]> wrote:
> There are several race conditions while freezing queue.
>
> When unfreezing queue, there is a small window between decrementing
> q->mq_freeze_depth to zero and percpu_ref_reinit() call with
> q->mq_usage_counter. If the other calls blk_mq_freeze_queue_start()
> in the window, q->mq_freeze_depth is increased from zero to one and
> percpu_ref_kill() is called with q->mq_usage_counter which is already
> killed. percpu refcount should be re-initialized before killed again.
>
> Also, there is a race condition while switching to percpu mode.
> percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
> executed at the same time as the following scenario is possible:
>
> 1. q->mq_usage_counter is initialized in atomic mode.
> (atomic counter: 1)
>
> 2. After the disk registration, a process like systemd-udev starts
> accessing the disk, and successfully increases refcount successfully
> by percpu_ref_tryget_live() in blk_mq_queue_enter().
> (atomic counter: 2)
>
> 3. In the final stage of initialization, q->mq_usage_counter is being
> switched to percpu mode by percpu_ref_switch_to_percpu() in
> blk_mq_finish_init(). But if CONFIG_PREEMPT_VOLUNTARY is enabled,
> the process is rescheduled in the middle of switching when calling
> wait_event() in __percpu_ref_switch_to_percpu().
> (atomic counter: 2)
>
> 4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze
> request queue. q->mq_usage_counter is decreased and marked as
> DEAD. Wait until all requests have finished.
> (atomic counter: 1)
>
> 5. The process rescheduled in the step 3. is resumed and finishes
> all remaining work in __percpu_ref_switch_to_percpu().
> A bias value is added to atomic counter of q->mq_usage_counter.
> (atomic counter: PERCPU_COUNT_BIAS + 1)
>
> 6. A request issed in the step 2. is finished and q->mq_usage_counter
> is decreased by blk_mq_queue_exit(). q->mq_usage_counter is DEAD,
> so atomic counter is decreased and no release handler is called.
> (atomic counter: PERCPU_COUNT_BIAS)
>
> 7. CPU hotplug handling in the step 4. will wait forever as
> q->mq_usage_counter will never be zero.
>
> Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed
> at the same time. Because both functions could call
> __percpu_ref_switch_to_percpu() which adds the bias value and
> initialize percpu counter.
>
> Fix those races by serializing with per-queue mutex.
>
> Signed-off-by: Akinobu Mita <[email protected]>
> Acked-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Ming Lei <[email protected]>
> Cc: Tejun Heo <[email protected]>

Reviewed-by: Ming Lei <[email protected]>

> ---
> block/blk-core.c | 1 +
> block/blk-mq-sysfs.c | 2 ++
> block/blk-mq.c | 8 ++++++++
> include/linux/blkdev.h | 6 ++++++
> 4 files changed, 17 insertions(+)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 627ed0c..544b237 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -687,6 +687,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
> __set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
>
> init_waitqueue_head(&q->mq_freeze_wq);
> + mutex_init(&q->mq_freeze_lock);
>
> if (blkcg_init_queue(q))
> goto fail_bdi;
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 79096a6..f63b464 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -409,7 +409,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
> /* see blk_register_queue() */
> void blk_mq_finish_init(struct request_queue *q)
> {
> + mutex_lock(&q->mq_freeze_lock);
> percpu_ref_switch_to_percpu(&q->mq_usage_counter);
> + mutex_unlock(&q->mq_freeze_lock);
> }
>
> int blk_mq_register_disk(struct gendisk *disk)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d861c70..b931e38 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -115,11 +115,15 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
> {
> int freeze_depth;
>
> + mutex_lock(&q->mq_freeze_lock);
> +
> freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
> if (freeze_depth == 1) {
> percpu_ref_kill(&q->mq_usage_counter);
> blk_mq_run_hw_queues(q, false);
> }
> +
> + mutex_unlock(&q->mq_freeze_lock);
> }
> EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
>
> @@ -143,12 +147,16 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
> {
> int freeze_depth;
>
> + mutex_lock(&q->mq_freeze_lock);
> +
> freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
> WARN_ON_ONCE(freeze_depth < 0);
> if (!freeze_depth) {
> percpu_ref_reinit(&q->mq_usage_counter);
> wake_up_all(&q->mq_freeze_wq);
> }
> +
> + mutex_unlock(&q->mq_freeze_lock);
> }
> EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index b02c90b..b867c32 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -457,6 +457,12 @@ struct request_queue {
> #endif
> struct rcu_head rcu_head;
> wait_queue_head_t mq_freeze_wq;
> + /*
> + * Protect concurrent access to mq_usage_counter by
> + * percpu_ref_switch_to_percpu(), percpu_ref_kill(), and
> + * percpu_ref_reinit().
> + */
> + struct mutex mq_freeze_lock;
> struct percpu_ref mq_usage_counter;
> struct list_head all_q_node;
>
> --
> 1.9.1
>



--
Ming Lei

2015-07-19 12:31:01

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list

On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <[email protected]> wrote:
> CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
> all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
> entries by blk_mq_sysfs_unregister(). Removing sysfs entry needs to
> be blocked until the active reference of the kernfs_node to be zero.
>
> On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
> /sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
> blk_mq_hw_sysfs_cpus_show().
>
> If these happen at the same time, a deadlock can happen. Because one
> can wait for the active reference to be zero with holding all_q_mutex,
> and the other tries to acquire all_q_mutex with holding the active
> reference.
>
> The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
> is to avoid reading an imcomplete hctx->cpumask. Since reading sysfs
> entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
> and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
> while hctx->cpumask is being updated.
>
> Signed-off-by: Akinobu Mita <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Ming Lei <[email protected]>

Reviewed-by: Ming Lei <[email protected]>

> ---
> block/blk-mq-sysfs.c | 4 ----
> block/blk-mq.c | 7 +++++++
> 2 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index f63b464..e0f71bf 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -218,8 +218,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
> unsigned int i, first = 1;
> ssize_t ret = 0;
>
> - blk_mq_disable_hotplug();
> -
> for_each_cpu(i, hctx->cpumask) {
> if (first)
> ret += sprintf(ret + page, "%u", i);
> @@ -229,8 +227,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
> first = 0;
> }
>
> - blk_mq_enable_hotplug();
> -
> ret += sprintf(ret + page, "\n");
> return ret;
> }
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b931e38..1a5e7d1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1815,6 +1815,11 @@ static void blk_mq_map_swqueue(struct request_queue *q,
> struct blk_mq_ctx *ctx;
> struct blk_mq_tag_set *set = q->tag_set;
>
> + /*
> + * Avoid others reading imcomplete hctx->cpumask through sysfs
> + */
> + mutex_lock(&q->sysfs_lock);
> +
> queue_for_each_hw_ctx(q, hctx, i) {
> cpumask_clear(hctx->cpumask);
> hctx->nr_ctx = 0;
> @@ -1834,6 +1839,8 @@ static void blk_mq_map_swqueue(struct request_queue *q,
> hctx->ctxs[hctx->nr_ctx++] = ctx;
> }
>
> + mutex_unlock(&q->sysfs_lock);
> +
> queue_for_each_hw_ctx(q, hctx, i) {
> struct blk_mq_ctxmap *map = &hctx->ctx_map;
>
> --
> 1.9.1
>



--
Ming Lei

2015-07-21 13:58:54

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation

2015-07-19 (日) の 18:24 +0800 に Ming Lei さんは書きました:
> On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <[email protected]> wrote:
> > When unmapped hw queue is remapped after CPU topology is changed,
> > hctx->tags->cpumask is set before hctx->tags is allocated in
> > blk_mq_map_swqueue().
> >
> > In order to fix this null pointer dereference, hctx->tags must be
> > allocated before configuring hctx->tags->cpumask.
>
> The root cause should be that the mapping between hctx and ctx
> can be changed after CPU topo is changed, then hctx->tags can
> be changed too, so hctx->tags->cpumask has to be set after
> hctx->tags is setup.
>
> >
> > Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
>
> I am wondering if the above commit considers CPU hotplug, and
> nvme uses tag->cpumask to set irq affinity hint just during
> starting queue. Looks it should be reasonalbe to
> introduce one callback of mapping_changed() for handling
> this kind of stuff. But this isn't related with this patch.
>
> > Signed-off-by: Akinobu Mita <[email protected]>
> > Cc: Keith Busch <[email protected]>
> > Cc: Jens Axboe <[email protected]>
> > Cc: Ming Lei <[email protected]>
> > ---
> > block/blk-mq.c | 9 ++++++++-
> > 1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 7d842db..f29f766 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1821,7 +1821,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> >
> > hctx = q->mq_ops->map_queue(q, i);
> > cpumask_set_cpu(i, hctx->cpumask);
> > - cpumask_set_cpu(i, hctx->tags->cpumask);
> > ctx->index_hw = hctx->nr_ctx;
> > hctx->ctxs[hctx->nr_ctx++] = ctx;
> > }
> > @@ -1861,6 +1860,14 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> > hctx->next_cpu = cpumask_first(hctx->cpumask);
> > hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
> > }
> > +
> > + queue_for_each_ctx(q, ctx, i) {
> > + if (!cpu_online(i))
> > + continue;
> > +
> > + hctx = q->mq_ops->map_queue(q, i);
> > + cpumask_set_cpu(i, hctx->tags->cpumask);
>
> If tags->cpumask is always same with hctx->cpumaks, this
> CPU iterator can be avoided.

How about this patch?
Or should we use cpumask_or() instead cpumask_copy?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7d842db..56f814a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1821,7 +1821,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)

hctx = q->mq_ops->map_queue(q, i);
cpumask_set_cpu(i, hctx->cpumask);
- cpumask_set_cpu(i, hctx->tags->cpumask);
ctx->index_hw = hctx->nr_ctx;
hctx->ctxs[hctx->nr_ctx++] = ctx;
}
@@ -1846,7 +1845,10 @@ static void blk_mq_map_swqueue(struct request_queue *q)
if (!set->tags[i])
set->tags[i] = blk_mq_init_rq_map(set, i);
hctx->tags = set->tags[i];
- WARN_ON(!hctx->tags);
+ if (hctx->tags)
+ cpumask_copy(hctx->tags->cpumask, hctx->cpumask);
+ else
+ WARN_ON(1);

/*
* Set the map size to the number of mapped software queues.

2015-07-27 09:49:46

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation

On Tue, Jul 21, 2015 at 9:58 AM, Akinobu Mita <[email protected]> wrote:
> 2015-07-19 (日) の 18:24 +0800 に Ming Lei さんは書きました:
>> On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <[email protected]> wrote:
>> > When unmapped hw queue is remapped after CPU topology is changed,
>> > hctx->tags->cpumask is set before hctx->tags is allocated in
>> > blk_mq_map_swqueue().
>> >
>> > In order to fix this null pointer dereference, hctx->tags must be
>> > allocated before configuring hctx->tags->cpumask.
>>
>> The root cause should be that the mapping between hctx and ctx
>> can be changed after CPU topo is changed, then hctx->tags can
>> be changed too, so hctx->tags->cpumask has to be set after
>> hctx->tags is setup.
>>
>> >
>> > Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
>>
>> I am wondering if the above commit considers CPU hotplug, and
>> nvme uses tag->cpumask to set irq affinity hint just during
>> starting queue. Looks it should be reasonalbe to
>> introduce one callback of mapping_changed() for handling
>> this kind of stuff. But this isn't related with this patch.
>>
>> > Signed-off-by: Akinobu Mita <[email protected]>
>> > Cc: Keith Busch <[email protected]>
>> > Cc: Jens Axboe <[email protected]>
>> > Cc: Ming Lei <[email protected]>
>> > ---
>> > block/blk-mq.c | 9 ++++++++-
>> > 1 file changed, 8 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/block/blk-mq.c b/block/blk-mq.c
>> > index 7d842db..f29f766 100644
>> > --- a/block/blk-mq.c
>> > +++ b/block/blk-mq.c
>> > @@ -1821,7 +1821,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>> >
>> > hctx = q->mq_ops->map_queue(q, i);
>> > cpumask_set_cpu(i, hctx->cpumask);
>> > - cpumask_set_cpu(i, hctx->tags->cpumask);
>> > ctx->index_hw = hctx->nr_ctx;
>> > hctx->ctxs[hctx->nr_ctx++] = ctx;
>> > }
>> > @@ -1861,6 +1860,14 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>> > hctx->next_cpu = cpumask_first(hctx->cpumask);
>> > hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>> > }
>> > +
>> > + queue_for_each_ctx(q, ctx, i) {
>> > + if (!cpu_online(i))
>> > + continue;
>> > +
>> > + hctx = q->mq_ops->map_queue(q, i);
>> > + cpumask_set_cpu(i, hctx->tags->cpumask);
>>
>> If tags->cpumask is always same with hctx->cpumaks, this
>> CPU iterator can be avoided.
>
> How about this patch?
> Or should we use cpumask_or() instead cpumask_copy?

I guess tags->cpumask need to be fixed in future, so better to
just take the current patch:

[PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation

Thanks,

>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7d842db..56f814a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1821,7 +1821,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>
> hctx = q->mq_ops->map_queue(q, i);
> cpumask_set_cpu(i, hctx->cpumask);
> - cpumask_set_cpu(i, hctx->tags->cpumask);
> ctx->index_hw = hctx->nr_ctx;
> hctx->ctxs[hctx->nr_ctx++] = ctx;
> }
> @@ -1846,7 +1845,10 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> if (!set->tags[i])
> set->tags[i] = blk_mq_init_rq_map(set, i);
> hctx->tags = set->tags[i];
> - WARN_ON(!hctx->tags);
> + if (hctx->tags)
> + cpumask_copy(hctx->tags->cpumask, hctx->cpumask);
> + else
> + WARN_ON(1);
>
> /*
> * Set the map size to the number of mapped software queues.
>
>



--
Ming Lei

2015-07-28 08:10:36

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list



On 7/19/15 12:28 AM, Akinobu Mita wrote:
> CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
> all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
> entries by blk_mq_sysfs_unregister(). Removing sysfs entry needs to
> be blocked until the active reference of the kernfs_node to be zero.
>
> On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
> /sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
> blk_mq_hw_sysfs_cpus_show().
>
> If these happen at the same time, a deadlock can happen. Because one
> can wait for the active reference to be zero with holding all_q_mutex,
> and the other tries to acquire all_q_mutex with holding the active
> reference.
>
> The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
> is to avoid reading an imcomplete hctx->cpumask. Since reading sysfs
> entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
> and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
> while hctx->cpumask is being updated.

Dump ctx attrs will be excluded with map software to hardware queue
operations which makes people confusing after your patch.

Regards,
Wanpeng Li

>
> Signed-off-by: Akinobu Mita <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Ming Lei <[email protected]>
> ---
> block/blk-mq-sysfs.c | 4 ----
> block/blk-mq.c | 7 +++++++
> 2 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index f63b464..e0f71bf 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -218,8 +218,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
> unsigned int i, first = 1;
> ssize_t ret = 0;
>
> - blk_mq_disable_hotplug();
> -
> for_each_cpu(i, hctx->cpumask) {
> if (first)
> ret += sprintf(ret + page, "%u", i);
> @@ -229,8 +227,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
> first = 0;
> }
>
> - blk_mq_enable_hotplug();
> -
> ret += sprintf(ret + page, "\n");
> return ret;
> }
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b931e38..1a5e7d1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1815,6 +1815,11 @@ static void blk_mq_map_swqueue(struct request_queue *q,
> struct blk_mq_ctx *ctx;
> struct blk_mq_tag_set *set = q->tag_set;
>
> + /*
> + * Avoid others reading imcomplete hctx->cpumask through sysfs
> + */
> + mutex_lock(&q->sysfs_lock);
> +
> queue_for_each_hw_ctx(q, hctx, i) {
> cpumask_clear(hctx->cpumask);
> hctx->nr_ctx = 0;
> @@ -1834,6 +1839,8 @@ static void blk_mq_map_swqueue(struct request_queue *q,
> hctx->ctxs[hctx->nr_ctx++] = ctx;
> }
>
> + mutex_unlock(&q->sysfs_lock);
> +
> queue_for_each_hw_ctx(q, hctx, i) {
> struct blk_mq_ctxmap *map = &hctx->ctx_map;
>

2015-07-28 23:37:34

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list

2015-07-28 17:10 GMT+09:00 Wanpeng Li <[email protected]>:
>
>
> On 7/19/15 12:28 AM, Akinobu Mita wrote:
>>
>> CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
>> all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
>> entries by blk_mq_sysfs_unregister(). Removing sysfs entry needs to
>> be blocked until the active reference of the kernfs_node to be zero.
>>
>> On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
>> /sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
>> blk_mq_hw_sysfs_cpus_show().
>>
>> If these happen at the same time, a deadlock can happen. Because one
>> can wait for the active reference to be zero with holding all_q_mutex,
>> and the other tries to acquire all_q_mutex with holding the active
>> reference.
>>
>> The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
>> is to avoid reading an imcomplete hctx->cpumask. Since reading sysfs
>> entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
>> and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
>> while hctx->cpumask is being updated.
>
>
> Dump ctx attrs will be excluded with map software to hardware queue
> operations which makes people confusing after your patch.

Do you suggest that we shouldn't use q->sysfs_lock in
blk_mq_map_swqueue() in this patch?

>>
>> Signed-off-by: Akinobu Mita <[email protected]>
>> Cc: Jens Axboe <[email protected]>
>> Cc: Ming Lei <[email protected]>
>>
>> ---
>> block/blk-mq-sysfs.c | 4 ----
>> block/blk-mq.c | 7 +++++++
>> 2 files changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
>> index f63b464..e0f71bf 100644
>> --- a/block/blk-mq-sysfs.c
>> +++ b/block/blk-mq-sysfs.c
>> @@ -218,8 +218,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct
>> blk_mq_hw_ctx *hctx, char *page)
>> unsigned int i, first = 1;
>> ssize_t ret = 0;
>> - blk_mq_disable_hotplug();
>> -
>> for_each_cpu(i, hctx->cpumask) {
>> if (first)
>> ret += sprintf(ret + page, "%u", i);
>> @@ -229,8 +227,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct
>> blk_mq_hw_ctx *hctx, char *page)
>> first = 0;
>> }
>> - blk_mq_enable_hotplug();
>> -
>> ret += sprintf(ret + page, "\n");
>> return ret;
>> }
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index b931e38..1a5e7d1 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1815,6 +1815,11 @@ static void blk_mq_map_swqueue(struct request_queue
>> *q,
>> struct blk_mq_ctx *ctx;
>> struct blk_mq_tag_set *set = q->tag_set;
>> + /*
>> + * Avoid others reading imcomplete hctx->cpumask through sysfs
>> + */
>> + mutex_lock(&q->sysfs_lock);
>> +
>> queue_for_each_hw_ctx(q, hctx, i) {
>> cpumask_clear(hctx->cpumask);
>> hctx->nr_ctx = 0;
>> @@ -1834,6 +1839,8 @@ static void blk_mq_map_swqueue(struct request_queue
>> *q,
>> hctx->ctxs[hctx->nr_ctx++] = ctx;
>> }
>> + mutex_unlock(&q->sysfs_lock);
>> +
>> queue_for_each_hw_ctx(q, hctx, i) {
>> struct blk_mq_ctxmap *map = &hctx->ctx_map;
>>
>
>