2015-06-21 13:53:00

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH 0/4] blk-mq: fix race conditions on cpu hotplug handling

This patchset addresses several race conditions on cpu hotplug handling
for blk-mq. All problems are reproducible by the following script.

while true; do
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online
done &

while true; do
modprobe -r null_blk
modprobe null_blk queue_mode=2 irqmode=1
sleep 0.1
done

Akinobu Mita (4):
blk-mq: fix sysfs registration/unregistration race
blk-mq: fix q->mq_map access race
blk-mq: establish new mapping before cpu starts handling requests
blk-mq: fix mq_usage_counter race when switching to percpu mode

block/blk-mq-cpumap.c | 9 ++++----
block/blk-mq-sysfs.c | 21 +++++++++++++++++--
block/blk-mq.c | 56 +++++++++++++++++++++++++++++++++-----------------
block/blk-mq.h | 3 ++-
include/linux/blk-mq.h | 1 +
include/linux/blkdev.h | 6 ++++++
6 files changed, 70 insertions(+), 26 deletions(-)

--
1.9.1


2015-06-21 13:53:19

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH 1/4] blk-mq: fix sysfs registration/unregistration race

There is a race between cpu hotplug handling and adding/deleting
gendisk for blk-mq, where both are trying to register and unregister
the same sysfs entries.

cpu hotplug handling for blk-mq (blk_mq_queue_reinit) does unregister
and register sysfs entries of hctx for all request queues in
all_q_list. On the other hand, these entries for a request queue may
have already been unregistered when cpu hotplug event occurs. Because
those sysfs entries are unregistered by blk_mq_unregister_disk()
firstly, and then a request queue is deleted from all_q_list by
blk_mq_free_queue(). So there is a race window that cpu hotplug must
not occur.

Fix it by serializing sysfs registration/unregistration with using
per hardware queue mutex and BLK_MQ_F_SYSFS_UP flag.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/blk-mq-sysfs.c | 19 +++++++++++++++++--
block/blk-mq.c | 1 +
include/linux/blk-mq.h | 1 +
3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index b79685e..d8ef3a3 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -371,7 +371,10 @@ void blk_mq_unregister_disk(struct gendisk *disk)
int i, j;

queue_for_each_hw_ctx(q, hctx, i) {
+ mutex_lock(&hctx->sysfs_up_lock);
blk_mq_unregister_hctx(hctx);
+ hctx->flags &= ~BLK_MQ_F_SYSFS_UP;
+ mutex_unlock(&hctx->sysfs_up_lock);

hctx_for_each_ctx(hctx, ctx, j)
kobject_put(&ctx->kobj);
@@ -423,10 +426,17 @@ int blk_mq_register_disk(struct gendisk *disk)
kobject_uevent(&q->mq_kobj, KOBJ_ADD);

queue_for_each_hw_ctx(q, hctx, i) {
+ mutex_lock(&hctx->sysfs_up_lock);
+
hctx->flags |= BLK_MQ_F_SYSFS_UP;
ret = blk_mq_register_hctx(hctx);
- if (ret)
+
+ if (ret) {
+ hctx->flags &= ~BLK_MQ_F_SYSFS_UP;
+ mutex_unlock(&hctx->sysfs_up_lock);
break;
+ }
+ mutex_unlock(&hctx->sysfs_up_lock);
}

if (ret) {
@@ -443,8 +453,11 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
int i;

- queue_for_each_hw_ctx(q, hctx, i)
+ queue_for_each_hw_ctx(q, hctx, i) {
+ mutex_lock(&hctx->sysfs_up_lock);
blk_mq_unregister_hctx(hctx);
+ mutex_unlock(&hctx->sysfs_up_lock);
+ }
}

int blk_mq_sysfs_register(struct request_queue *q)
@@ -453,7 +466,9 @@ int blk_mq_sysfs_register(struct request_queue *q)
int i, ret = 0;

queue_for_each_hw_ctx(q, hctx, i) {
+ mutex_lock(&hctx->sysfs_up_lock);
ret = blk_mq_register_hctx(hctx);
+ mutex_unlock(&hctx->sysfs_up_lock);
if (ret)
break;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 594eea0..ec94258 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1660,6 +1660,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn);
spin_lock_init(&hctx->lock);
INIT_LIST_HEAD(&hctx->dispatch);
+ mutex_init(&hctx->sysfs_up_lock);
hctx->queue = q;
hctx->queue_num = hctx_idx;
hctx->flags = set->flags;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2056a99..78cd4f3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -59,6 +59,7 @@ struct blk_mq_hw_ctx {

struct blk_mq_cpu_notifier cpu_notifier;
struct kobject kobj;
+ struct mutex sysfs_up_lock;
};

struct blk_mq_tag_set {
--
1.9.1

2015-06-21 13:53:11

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH 2/4] blk-mq: fix q->mq_map access race

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates q->mq_map
by blk_mq_update_queue_map() for all request queues in all_q_list.
On the other hand, q->mq_map is released and set to NULL before
deleting the queue from all_q_list.

So if cpu hotplug event occurs in the window, invalid memory access
can happen. Fix it by deleting the queue from all_q_list earlier
than destroyng q->mq_map.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/blk-mq.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ec94258..6487710 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2038,6 +2038,10 @@ void blk_mq_free_queue(struct request_queue *q)
{
struct blk_mq_tag_set *set = q->tag_set;

+ mutex_lock(&all_q_mutex);
+ list_del_init(&q->all_q_node);
+ mutex_unlock(&all_q_mutex);
+
blk_mq_del_queue_tag_set(q);

blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
@@ -2048,10 +2052,6 @@ void blk_mq_free_queue(struct request_queue *q)
kfree(q->mq_map);

q->mq_map = NULL;
-
- mutex_lock(&all_q_mutex);
- list_del_init(&q->all_q_node);
- mutex_unlock(&all_q_mutex);
}

/* Basically redo blk_mq_init_queue with queue frozen */
--
1.9.1

2015-06-21 13:53:53

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

ctx->index_hw is zero for the CPUs which have never been onlined since
the block queue was initialized. If one of those CPUs is hotadded and
starts handling request before new mappings are established, pending
bitmap is not correctly marked for inserted requests as ctx->index_hw
is still zero. So flush_busy_ctxs can't find the requests on the
software queue.

Fix it by establishing new mapping before hotadded cpu starts handling
requests.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/blk-mq-cpumap.c | 9 +++++----
block/blk-mq.c | 41 ++++++++++++++++++++++++++---------------
block/blk-mq.h | 3 ++-
3 files changed, 33 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 5f13f4d..99f65c0 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -31,7 +31,8 @@ static int get_first_sibling(unsigned int cpu)
return cpu;
}

-int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
+int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
+ const struct cpumask *online_mask)
{
unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
cpumask_var_t cpus;
@@ -41,7 +42,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)

cpumask_clear(cpus);
nr_cpus = nr_uniq_cpus = 0;
- for_each_online_cpu(i) {
+ for_each_cpu(i, online_mask) {
nr_cpus++;
first_sibling = get_first_sibling(i);
if (!cpumask_test_cpu(first_sibling, cpus))
@@ -51,7 +52,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)

queue = 0;
for_each_possible_cpu(i) {
- if (!cpu_online(i)) {
+ if (!cpumask_test_cpu(i, online_mask)) {
map[i] = 0;
continue;
}
@@ -95,7 +96,7 @@ unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
if (!map)
return NULL;

- if (!blk_mq_update_queue_map(map, set->nr_hw_queues))
+ if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
return map;

kfree(map);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6487710..64d93e4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1771,7 +1771,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
}
}

-static void blk_mq_map_swqueue(struct request_queue *q)
+static void blk_mq_map_swqueue(struct request_queue *q,
+ const struct cpumask *online_mask)
{
unsigned int i;
struct blk_mq_hw_ctx *hctx;
@@ -1788,7 +1789,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
*/
queue_for_each_ctx(q, ctx, i) {
/* If the cpu isn't online, the cpu is mapped to first hctx */
- if (!cpu_online(i))
+ if (!cpumask_test_cpu(i, online_mask))
continue;

hctx = q->mq_ops->map_queue(q, i);
@@ -2014,7 +2015,10 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,

blk_mq_add_queue_tag_set(set, q);

- blk_mq_map_swqueue(q);
+ get_online_cpus();
+ blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, cpu_online_mask);
+ blk_mq_map_swqueue(q, cpu_online_mask);
+ put_online_cpus();

return q;

@@ -2055,13 +2059,14 @@ void blk_mq_free_queue(struct request_queue *q)
}

/* Basically redo blk_mq_init_queue with queue frozen */
-static void blk_mq_queue_reinit(struct request_queue *q)
+static void blk_mq_queue_reinit(struct request_queue *q,
+ const struct cpumask *online_mask)
{
WARN_ON_ONCE(!q->mq_freeze_depth);

blk_mq_sysfs_unregister(q);

- blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues);
+ blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);

/*
* redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
@@ -2069,25 +2074,31 @@ static void blk_mq_queue_reinit(struct request_queue *q)
* involves free and re-allocate memory, worthy doing?)
*/

- blk_mq_map_swqueue(q);
+ blk_mq_map_swqueue(q, online_mask);

blk_mq_sysfs_register(q);
}

+static struct cpumask online_new;
+
static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
unsigned long action, void *hcpu)
{
struct request_queue *q;
+ int cpu = (unsigned long)hcpu;

- /*
- * Before new mappings are established, hotadded cpu might already
- * start handling requests. This doesn't break anything as we map
- * offline CPUs to first hardware queue. We will re-init the queue
- * below to get optimal settings.
- */
- if (action != CPU_DEAD && action != CPU_DEAD_FROZEN &&
- action != CPU_ONLINE && action != CPU_ONLINE_FROZEN)
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_DEAD:
+ case CPU_UP_CANCELED:
+ cpumask_copy(&online_new, cpu_online_mask);
+ break;
+ case CPU_UP_PREPARE:
+ cpumask_copy(&online_new, cpu_online_mask);
+ cpumask_set_cpu(cpu, &online_new);
+ break;
+ default:
return NOTIFY_OK;
+ }

mutex_lock(&all_q_mutex);

@@ -2111,7 +2122,7 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
}

list_for_each_entry(q, &all_q_list, all_q_node)
- blk_mq_queue_reinit(q);
+ blk_mq_queue_reinit(q, &online_new);

list_for_each_entry(q, &all_q_list, all_q_node)
blk_mq_unfreeze_queue(q);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6a48c4c..f4fea79 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -51,7 +51,8 @@ void blk_mq_disable_hotplug(void);
* CPU -> queue mappings
*/
extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
-extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues);
+extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
+ const struct cpumask *online_mask);
extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);

/*
--
1.9.1

2015-06-21 13:53:47

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH 4/4] blk-mq: fix mq_usage_counter race when switching to percpu mode

percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
executed at the same time as the following scenario is possible:

1. q->mq_usage_counter is initialized in atomic mode.
(atomic counter: 1)

2. After the disk registration, a process like systemd-udev starts
accessing the disk, and successfully increases refcount successfully
by percpu_ref_tryget_live() in blk_mq_queue_enter().
(atomic counter: 2)

3. In the final stage of initialization, q->mq_usage_counter is being
switched to percpu mode by percpu_ref_switch_to_percpu() in
blk_mq_finish_init(). But if CONFIG_PREEMPT_VOLUNTARY is enabled,
the process is rescheduled in the middle of switching when calling
wait_event() in __percpu_ref_switch_to_percpu().
(atomic counter: 2)

4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze
request queue. q->mq_usage_counter is decreased and marked as
DEAD. Wait until all requests have finished.
(atomic counter: 1)

5. The process rescheduled in the step 3. is resumed and finishes
all remaining work in __percpu_ref_switch_to_percpu().
A bias value is added to atomic counter of q->mq_usage_counter.
(atomic counter: PERCPU_COUNT_BIAS + 1)

6. A request issed in the step 2. is finished and q->mq_usage_counter
is decreased by blk_mq_queue_exit(). q->mq_usage_counter is DEAD,
so atomic counter is decreased and no release handler is called.
(atomic counter: PERCPU_COUNT_BIAS)

7. CPU hotplug handling in the step 4. will wait forever as
q->mq_usage_counter will never be zero.

Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed
at the same time. Because both functions could call
__percpu_ref_switch_to_percpu() which adds the bias value and
initialize percpu counter.

Fix those races by serializing with a mutex.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/blk-mq-sysfs.c | 2 ++
block/blk-mq.c | 6 ++++++
include/linux/blkdev.h | 6 ++++++
3 files changed, 14 insertions(+)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index d8ef3a3..af3e126 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -407,7 +407,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
/* see blk_register_queue() */
void blk_mq_finish_init(struct request_queue *q)
{
+ mutex_lock(&q->mq_usage_lock);
percpu_ref_switch_to_percpu(&q->mq_usage_counter);
+ mutex_unlock(&q->mq_usage_lock);
}

int blk_mq_register_disk(struct gendisk *disk)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 64d93e4..62d0ef1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -119,7 +119,9 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
spin_unlock_irq(q->queue_lock);

if (freeze) {
+ mutex_lock(&q->mq_usage_lock);
percpu_ref_kill(&q->mq_usage_counter);
+ mutex_unlock(&q->mq_usage_lock);
blk_mq_run_hw_queues(q, false);
}
}
@@ -150,7 +152,9 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
WARN_ON_ONCE(q->mq_freeze_depth < 0);
spin_unlock_irq(q->queue_lock);
if (wake) {
+ mutex_lock(&q->mq_usage_lock);
percpu_ref_reinit(&q->mq_usage_counter);
+ mutex_unlock(&q->mq_usage_lock);
wake_up_all(&q->mq_freeze_wq);
}
}
@@ -1961,6 +1965,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
hctxs[i]->queue_num = i;
}

+ mutex_init(&q->mq_usage_lock);
+
/*
* Init percpu_ref in atomic mode so that it's faster to shutdown.
* See blk_register_queue() for details.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5d93a66..c5bf534 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -484,6 +484,12 @@ struct request_queue {
struct rcu_head rcu_head;
wait_queue_head_t mq_freeze_wq;
struct percpu_ref mq_usage_counter;
+ /*
+ * Protect concurrent access from percpu_ref_switch_to_percpu and
+ * percpu_ref_kill, and access from percpu_ref_switch_to_percpu and
+ * percpu_ref_reinit.
+ */
+ struct mutex mq_usage_lock;
struct list_head all_q_node;

struct blk_mq_tag_set *tag_set;
--
1.9.1

2015-06-24 08:59:28

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 1/4] blk-mq: fix sysfs registration/unregistration race

On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <[email protected]> wrote:
> There is a race between cpu hotplug handling and adding/deleting
> gendisk for blk-mq, where both are trying to register and unregister
> the same sysfs entries.

Yes, it is one issue.

>
> cpu hotplug handling for blk-mq (blk_mq_queue_reinit) does unregister
> and register sysfs entries of hctx for all request queues in
> all_q_list. On the other hand, these entries for a request queue may
> have already been unregistered when cpu hotplug event occurs. Because
> those sysfs entries are unregistered by blk_mq_unregister_disk()
> firstly, and then a request queue is deleted from all_q_list by
> blk_mq_free_queue(). So there is a race window that cpu hotplug must
> not occur.
>
> Fix it by serializing sysfs registration/unregistration with using
> per hardware queue mutex and BLK_MQ_F_SYSFS_UP flag.

But I believe the easier and correct fix should be removing
queue from 'all_q_list' before unregistering disk in the
path of deleting disk.

>
> Signed-off-by: Akinobu Mita <[email protected]>
> Cc: Jens Axboe <[email protected]>
> ---
> block/blk-mq-sysfs.c | 19 +++++++++++++++++--
> block/blk-mq.c | 1 +
> include/linux/blk-mq.h | 1 +
> 3 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index b79685e..d8ef3a3 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -371,7 +371,10 @@ void blk_mq_unregister_disk(struct gendisk *disk)
> int i, j;
>
> queue_for_each_hw_ctx(q, hctx, i) {
> + mutex_lock(&hctx->sysfs_up_lock);
> blk_mq_unregister_hctx(hctx);
> + hctx->flags &= ~BLK_MQ_F_SYSFS_UP;
> + mutex_unlock(&hctx->sysfs_up_lock);

It may be better to put the lock and flag handling into the
register/unregister helpers.

>
> hctx_for_each_ctx(hctx, ctx, j)
> kobject_put(&ctx->kobj);
> @@ -423,10 +426,17 @@ int blk_mq_register_disk(struct gendisk *disk)
> kobject_uevent(&q->mq_kobj, KOBJ_ADD);
>
> queue_for_each_hw_ctx(q, hctx, i) {
> + mutex_lock(&hctx->sysfs_up_lock);
> +
> hctx->flags |= BLK_MQ_F_SYSFS_UP;
> ret = blk_mq_register_hctx(hctx);
> - if (ret)
> +
> + if (ret) {
> + hctx->flags &= ~BLK_MQ_F_SYSFS_UP;
> + mutex_unlock(&hctx->sysfs_up_lock);
> break;
> + }
> + mutex_unlock(&hctx->sysfs_up_lock);
> }
>
> if (ret) {
> @@ -443,8 +453,11 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
> struct blk_mq_hw_ctx *hctx;
> int i;
>
> - queue_for_each_hw_ctx(q, hctx, i)
> + queue_for_each_hw_ctx(q, hctx, i) {
> + mutex_lock(&hctx->sysfs_up_lock);
> blk_mq_unregister_hctx(hctx);
> + mutex_unlock(&hctx->sysfs_up_lock);
> + }
> }
>
> int blk_mq_sysfs_register(struct request_queue *q)
> @@ -453,7 +466,9 @@ int blk_mq_sysfs_register(struct request_queue *q)
> int i, ret = 0;
>
> queue_for_each_hw_ctx(q, hctx, i) {
> + mutex_lock(&hctx->sysfs_up_lock);
> ret = blk_mq_register_hctx(hctx);
> + mutex_unlock(&hctx->sysfs_up_lock);
> if (ret)
> break;
> }
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 594eea0..ec94258 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1660,6 +1660,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
> INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn);
> spin_lock_init(&hctx->lock);
> INIT_LIST_HEAD(&hctx->dispatch);
> + mutex_init(&hctx->sysfs_up_lock);
> hctx->queue = q;
> hctx->queue_num = hctx_idx;
> hctx->flags = set->flags;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 2056a99..78cd4f3 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -59,6 +59,7 @@ struct blk_mq_hw_ctx {
>
> struct blk_mq_cpu_notifier cpu_notifier;
> struct kobject kobj;
> + struct mutex sysfs_up_lock;
> };
>
> struct blk_mq_tag_set {
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> Please read the FAQ at http://www.tux.org/lkml/



--
Ming Lei

2015-06-24 09:23:53

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 2/4] blk-mq: fix q->mq_map access race

On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <[email protected]> wrote:
> CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates q->mq_map
> by blk_mq_update_queue_map() for all request queues in all_q_list.
> On the other hand, q->mq_map is released and set to NULL before
> deleting the queue from all_q_list.
>
> So if cpu hotplug event occurs in the window, invalid memory access
> can happen. Fix it by deleting the queue from all_q_list earlier
> than destroyng q->mq_map.

It should be better to free q->mq_map in blk_mq_release(), because
now blk_cleanup_queue() can be run before del_gendisk(), such as
loop and md.

>
> Signed-off-by: Akinobu Mita <[email protected]>
> Cc: Jens Axboe <[email protected]>
> ---
> block/blk-mq.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ec94258..6487710 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2038,6 +2038,10 @@ void blk_mq_free_queue(struct request_queue *q)
> {
> struct blk_mq_tag_set *set = q->tag_set;
>
> + mutex_lock(&all_q_mutex);
> + list_del_init(&q->all_q_node);
> + mutex_unlock(&all_q_mutex);

As commented for patch 1, the above can be moved to
blk_mq_unregister_disk() for fixing the issue of patch 1.

> +
> blk_mq_del_queue_tag_set(q);
>
> blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
> @@ -2048,10 +2052,6 @@ void blk_mq_free_queue(struct request_queue *q)
> kfree(q->mq_map);
>
> q->mq_map = NULL;
> -
> - mutex_lock(&all_q_mutex);
> - list_del_init(&q->all_q_node);
> - mutex_unlock(&all_q_mutex);
> }
>
> /* Basically redo blk_mq_init_queue with queue frozen */
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> Please read the FAQ at http://www.tux.org/lkml/



--
Ming Lei

2015-06-24 09:46:11

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <[email protected]> wrote:
> ctx->index_hw is zero for the CPUs which have never been onlined since
> the block queue was initialized. If one of those CPUs is hotadded and
> starts handling request before new mappings are established, pending

Could you explain a bit what the handling request is? The fact is that
blk_mq_queue_reinit() is run after all queues are put into freezing.

> bitmap is not correctly marked for inserted requests as ctx->index_hw
> is still zero. So flush_busy_ctxs can't find the requests on the
> software queue.
>
> Fix it by establishing new mapping before hotadded cpu starts handling
> requests.
>
> Signed-off-by: Akinobu Mita <[email protected]>
> Cc: Jens Axboe <[email protected]>
> ---
> block/blk-mq-cpumap.c | 9 +++++----
> block/blk-mq.c | 41 ++++++++++++++++++++++++++---------------
> block/blk-mq.h | 3 ++-
> 3 files changed, 33 insertions(+), 20 deletions(-)
>
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 5f13f4d..99f65c0 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -31,7 +31,8 @@ static int get_first_sibling(unsigned int cpu)
> return cpu;
> }
>
> -int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
> +int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
> + const struct cpumask *online_mask)
> {
> unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
> cpumask_var_t cpus;
> @@ -41,7 +42,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
>
> cpumask_clear(cpus);
> nr_cpus = nr_uniq_cpus = 0;
> - for_each_online_cpu(i) {
> + for_each_cpu(i, online_mask) {
> nr_cpus++;
> first_sibling = get_first_sibling(i);
> if (!cpumask_test_cpu(first_sibling, cpus))
> @@ -51,7 +52,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
>
> queue = 0;
> for_each_possible_cpu(i) {
> - if (!cpu_online(i)) {
> + if (!cpumask_test_cpu(i, online_mask)) {
> map[i] = 0;
> continue;
> }
> @@ -95,7 +96,7 @@ unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
> if (!map)
> return NULL;
>
> - if (!blk_mq_update_queue_map(map, set->nr_hw_queues))
> + if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
> return map;
>
> kfree(map);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6487710..64d93e4 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1771,7 +1771,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
> }
> }
>
> -static void blk_mq_map_swqueue(struct request_queue *q)
> +static void blk_mq_map_swqueue(struct request_queue *q,
> + const struct cpumask *online_mask)
> {
> unsigned int i;
> struct blk_mq_hw_ctx *hctx;
> @@ -1788,7 +1789,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> */
> queue_for_each_ctx(q, ctx, i) {
> /* If the cpu isn't online, the cpu is mapped to first hctx */
> - if (!cpu_online(i))
> + if (!cpumask_test_cpu(i, online_mask))
> continue;
>
> hctx = q->mq_ops->map_queue(q, i);
> @@ -2014,7 +2015,10 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>
> blk_mq_add_queue_tag_set(set, q);
>
> - blk_mq_map_swqueue(q);
> + get_online_cpus();
> + blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, cpu_online_mask);
> + blk_mq_map_swqueue(q, cpu_online_mask);
> + put_online_cpus();
>
> return q;
>
> @@ -2055,13 +2059,14 @@ void blk_mq_free_queue(struct request_queue *q)
> }
>
> /* Basically redo blk_mq_init_queue with queue frozen */
> -static void blk_mq_queue_reinit(struct request_queue *q)
> +static void blk_mq_queue_reinit(struct request_queue *q,
> + const struct cpumask *online_mask)
> {
> WARN_ON_ONCE(!q->mq_freeze_depth);
>
> blk_mq_sysfs_unregister(q);
>
> - blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues);
> + blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
>
> /*
> * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
> @@ -2069,25 +2074,31 @@ static void blk_mq_queue_reinit(struct request_queue *q)
> * involves free and re-allocate memory, worthy doing?)
> */
>
> - blk_mq_map_swqueue(q);
> + blk_mq_map_swqueue(q, online_mask);
>
> blk_mq_sysfs_register(q);
> }
>
> +static struct cpumask online_new;
> +
> static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
> unsigned long action, void *hcpu)
> {
> struct request_queue *q;
> + int cpu = (unsigned long)hcpu;
>
> - /*
> - * Before new mappings are established, hotadded cpu might already
> - * start handling requests. This doesn't break anything as we map
> - * offline CPUs to first hardware queue. We will re-init the queue
> - * below to get optimal settings.
> - */
> - if (action != CPU_DEAD && action != CPU_DEAD_FROZEN &&
> - action != CPU_ONLINE && action != CPU_ONLINE_FROZEN)
> + switch (action & ~CPU_TASKS_FROZEN) {
> + case CPU_DEAD:
> + case CPU_UP_CANCELED:
> + cpumask_copy(&online_new, cpu_online_mask);
> + break;
> + case CPU_UP_PREPARE:
> + cpumask_copy(&online_new, cpu_online_mask);
> + cpumask_set_cpu(cpu, &online_new);
> + break;
> + default:
> return NOTIFY_OK;
> + }
>
> mutex_lock(&all_q_mutex);
>
> @@ -2111,7 +2122,7 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
> }
>
> list_for_each_entry(q, &all_q_list, all_q_node)
> - blk_mq_queue_reinit(q);
> + blk_mq_queue_reinit(q, &online_new);
>
> list_for_each_entry(q, &all_q_list, all_q_node)
> blk_mq_unfreeze_queue(q);
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 6a48c4c..f4fea79 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -51,7 +51,8 @@ void blk_mq_disable_hotplug(void);
> * CPU -> queue mappings
> */
> extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
> -extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues);
> +extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
> + const struct cpumask *online_mask);
> extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
>
> /*
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> Please read the FAQ at http://www.tux.org/lkml/



--
Ming Lei

2015-06-24 12:35:47

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 4/4] blk-mq: fix mq_usage_counter race when switching to percpu mode

On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <[email protected]> wrote:
> percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
> executed at the same time as the following scenario is possible:
>
> 1. q->mq_usage_counter is initialized in atomic mode.
> (atomic counter: 1)
>
> 2. After the disk registration, a process like systemd-udev starts
> accessing the disk, and successfully increases refcount successfully
> by percpu_ref_tryget_live() in blk_mq_queue_enter().
> (atomic counter: 2)
>
> 3. In the final stage of initialization, q->mq_usage_counter is being
> switched to percpu mode by percpu_ref_switch_to_percpu() in
> blk_mq_finish_init(). But if CONFIG_PREEMPT_VOLUNTARY is enabled,
> the process is rescheduled in the middle of switching when calling
> wait_event() in __percpu_ref_switch_to_percpu().
> (atomic counter: 2)

This can only happen when freezing queue from CPU
hotplug happens before running wait_event() from blk_mq_finish_init().

So looks like we still may avoid the race by moving adding queue
into all_q_list to blk_mq_register_disk(), but the mapping need to
update at that time.

>
> 4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze
> request queue. q->mq_usage_counter is decreased and marked as
> DEAD. Wait until all requests have finished.
> (atomic counter: 1)
>
> 5. The process rescheduled in the step 3. is resumed and finishes
> all remaining work in __percpu_ref_switch_to_percpu().
> A bias value is added to atomic counter of q->mq_usage_counter.
> (atomic counter: PERCPU_COUNT_BIAS + 1)
>
> 6. A request issed in the step 2. is finished and q->mq_usage_counter
> is decreased by blk_mq_queue_exit(). q->mq_usage_counter is DEAD,
> so atomic counter is decreased and no release handler is called.
> (atomic counter: PERCPU_COUNT_BIAS)
>
> 7. CPU hotplug handling in the step 4. will wait forever as
> q->mq_usage_counter will never be zero.
>
> Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed
> at the same time. Because both functions could call
> __percpu_ref_switch_to_percpu() which adds the bias value and
> initialize percpu counter.
>
> Fix those races by serializing with a mutex.
>
> Signed-off-by: Akinobu Mita <[email protected]>
> Cc: Jens Axboe <[email protected]>
> ---
> block/blk-mq-sysfs.c | 2 ++
> block/blk-mq.c | 6 ++++++
> include/linux/blkdev.h | 6 ++++++
> 3 files changed, 14 insertions(+)
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index d8ef3a3..af3e126 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -407,7 +407,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
> /* see blk_register_queue() */
> void blk_mq_finish_init(struct request_queue *q)
> {
> + mutex_lock(&q->mq_usage_lock);
> percpu_ref_switch_to_percpu(&q->mq_usage_counter);
> + mutex_unlock(&q->mq_usage_lock);
> }
>
> int blk_mq_register_disk(struct gendisk *disk)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 64d93e4..62d0ef1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -119,7 +119,9 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
> spin_unlock_irq(q->queue_lock);
>
> if (freeze) {
> + mutex_lock(&q->mq_usage_lock);
> percpu_ref_kill(&q->mq_usage_counter);
> + mutex_unlock(&q->mq_usage_lock);
> blk_mq_run_hw_queues(q, false);
> }
> }
> @@ -150,7 +152,9 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
> WARN_ON_ONCE(q->mq_freeze_depth < 0);
> spin_unlock_irq(q->queue_lock);
> if (wake) {
> + mutex_lock(&q->mq_usage_lock);
> percpu_ref_reinit(&q->mq_usage_counter);
> + mutex_unlock(&q->mq_usage_lock);
> wake_up_all(&q->mq_freeze_wq);
> }
> }
> @@ -1961,6 +1965,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> hctxs[i]->queue_num = i;
> }
>
> + mutex_init(&q->mq_usage_lock);
> +
> /*
> * Init percpu_ref in atomic mode so that it's faster to shutdown.
> * See blk_register_queue() for details.
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 5d93a66..c5bf534 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -484,6 +484,12 @@ struct request_queue {
> struct rcu_head rcu_head;
> wait_queue_head_t mq_freeze_wq;
> struct percpu_ref mq_usage_counter;
> + /*
> + * Protect concurrent access from percpu_ref_switch_to_percpu and
> + * percpu_ref_kill, and access from percpu_ref_switch_to_percpu and
> + * percpu_ref_reinit.
> + */
> + struct mutex mq_usage_lock;
> struct list_head all_q_node;
>
> struct blk_mq_tag_set *tag_set;
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> Please read the FAQ at http://www.tux.org/lkml/



--
Ming Lei

2015-06-24 14:34:11

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

Hi Ming,

2015-06-24 18:46 GMT+09:00 Ming Lei <[email protected]>:
> On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <[email protected]> wrote:
>> ctx->index_hw is zero for the CPUs which have never been onlined since
>> the block queue was initialized. If one of those CPUs is hotadded and
>> starts handling request before new mappings are established, pending
>
> Could you explain a bit what the handling request is? The fact is that
> blk_mq_queue_reinit() is run after all queues are put into freezing.

Notifier callbacks for CPU_ONLINE action can be run on the other CPU
than the CPU which was just onlined. So it is possible for the
process running on the just onlined CPU to insert request and run
hw queue before blk_mq_queue_reinit_notify() is actually called with
action=CPU_ONLINE.

2015-06-24 16:24:35

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

On Wed, Jun 24, 2015 at 10:34 PM, Akinobu Mita <[email protected]> wrote:
> Hi Ming,
>
> 2015-06-24 18:46 GMT+09:00 Ming Lei <[email protected]>:
>> On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <[email protected]> wrote:
>>> ctx->index_hw is zero for the CPUs which have never been onlined since
>>> the block queue was initialized. If one of those CPUs is hotadded and
>>> starts handling request before new mappings are established, pending
>>
>> Could you explain a bit what the handling request is? The fact is that
>> blk_mq_queue_reinit() is run after all queues are put into freezing.
>
> Notifier callbacks for CPU_ONLINE action can be run on the other CPU
> than the CPU which was just onlined. So it is possible for the
> process running on the just onlined CPU to insert request and run
> hw queue before blk_mq_queue_reinit_notify() is actually called with
> action=CPU_ONLINE.

You are right because blk_mq_queue_reinit_notify() is alwasy run after
the CPU becomes UP, so there is a tiny window in which the CPU is up
but the mapping is updated. Per current design, the CPU just onlined
is still mapped to hw queue 0 until the mapping is updated by
blk_mq_queue_reinit_notify().

But I am wondering why it is a problem and why you think flush_busy_ctxs
can't find the requests on the software queue in this situation?


--
Ming Lei

2015-06-25 02:57:02

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

2015-06-25 1:24 GMT+09:00 Ming Lei <[email protected]>:
> On Wed, Jun 24, 2015 at 10:34 PM, Akinobu Mita <[email protected]> wrote:
>> Hi Ming,
>>
>> 2015-06-24 18:46 GMT+09:00 Ming Lei <[email protected]>:
>>> On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <[email protected]> wrote:
>>>> ctx->index_hw is zero for the CPUs which have never been onlined since
>>>> the block queue was initialized. If one of those CPUs is hotadded and
>>>> starts handling request before new mappings are established, pending
>>>
>>> Could you explain a bit what the handling request is? The fact is that
>>> blk_mq_queue_reinit() is run after all queues are put into freezing.
>>
>> Notifier callbacks for CPU_ONLINE action can be run on the other CPU
>> than the CPU which was just onlined. So it is possible for the
>> process running on the just onlined CPU to insert request and run
>> hw queue before blk_mq_queue_reinit_notify() is actually called with
>> action=CPU_ONLINE.
>
> You are right because blk_mq_queue_reinit_notify() is alwasy run after
> the CPU becomes UP, so there is a tiny window in which the CPU is up
> but the mapping is updated. Per current design, the CPU just onlined
> is still mapped to hw queue 0 until the mapping is updated by
> blk_mq_queue_reinit_notify().
>
> But I am wondering why it is a problem and why you think flush_busy_ctxs
> can't find the requests on the software queue in this situation?

The problem happens when the CPU has just been onlined first time
since the request queue was initialized. At this time ctx->index_hw
for the CPU is still zero before blk_mq_queue_reinit_notify is called.

The request can be inserted to ctx->rq_list, but blk_mq_hctx_mark_pending()
marks busy for wrong bit position as ctx->index_hw is zero.

flush_busy_ctxs() only retrieves the requests from software queues
which are marked busy. So the request just inserted is ignored as
the corresponding bit position is not busy.

2015-06-25 08:07:36

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

On Thu, Jun 25, 2015 at 10:56 AM, Akinobu Mita <[email protected]> wrote:
> 2015-06-25 1:24 GMT+09:00 Ming Lei <[email protected]>:
>> On Wed, Jun 24, 2015 at 10:34 PM, Akinobu Mita <[email protected]> wrote:
>>> Hi Ming,
>>>
>>> 2015-06-24 18:46 GMT+09:00 Ming Lei <[email protected]>:
>>>> On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <[email protected]> wrote:
>>>>> ctx->index_hw is zero for the CPUs which have never been onlined since
>>>>> the block queue was initialized. If one of those CPUs is hotadded and
>>>>> starts handling request before new mappings are established, pending
>>>>
>>>> Could you explain a bit what the handling request is? The fact is that
>>>> blk_mq_queue_reinit() is run after all queues are put into freezing.
>>>
>>> Notifier callbacks for CPU_ONLINE action can be run on the other CPU
>>> than the CPU which was just onlined. So it is possible for the
>>> process running on the just onlined CPU to insert request and run
>>> hw queue before blk_mq_queue_reinit_notify() is actually called with
>>> action=CPU_ONLINE.
>>
>> You are right because blk_mq_queue_reinit_notify() is alwasy run after
>> the CPU becomes UP, so there is a tiny window in which the CPU is up
>> but the mapping is updated. Per current design, the CPU just onlined
>> is still mapped to hw queue 0 until the mapping is updated by
>> blk_mq_queue_reinit_notify().
>>
>> But I am wondering why it is a problem and why you think flush_busy_ctxs
>> can't find the requests on the software queue in this situation?
>
> The problem happens when the CPU has just been onlined first time
> since the request queue was initialized. At this time ctx->index_hw
> for the CPU is still zero before blk_mq_queue_reinit_notify is called.
>
> The request can be inserted to ctx->rq_list, but blk_mq_hctx_mark_pending()
> marks busy for wrong bit position as ctx->index_hw is zero.

It isn't wrong bit since the CPU onlined just is still mapped to hctx 0 at that
time .

>
> flush_busy_ctxs() only retrieves the requests from software queues
> which are marked busy. So the request just inserted is ignored as
> the corresponding bit position is not busy.

Before making the remap in blk_mq_queue_reinit() for the CPU topo change,
the request queue will be put into freezing first and all requests
inserted to hctx 0
should be retrieved and scheduled out. So can the request be igonred by
flush_busy_ctxs()?

--
Ming Lei

2015-06-25 12:49:50

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

2015-06-25 17:07 GMT+09:00 Ming Lei <[email protected]>:
> On Thu, Jun 25, 2015 at 10:56 AM, Akinobu Mita <[email protected]> wrote:
>> 2015-06-25 1:24 GMT+09:00 Ming Lei <[email protected]>:
>>> On Wed, Jun 24, 2015 at 10:34 PM, Akinobu Mita <[email protected]> wrote:
>>>> Hi Ming,
>>>>
>>>> 2015-06-24 18:46 GMT+09:00 Ming Lei <[email protected]>:
>>>>> On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <[email protected]> wrote:
>>>>>> ctx->index_hw is zero for the CPUs which have never been onlined since
>>>>>> the block queue was initialized. If one of those CPUs is hotadded and
>>>>>> starts handling request before new mappings are established, pending
>>>>>
>>>>> Could you explain a bit what the handling request is? The fact is that
>>>>> blk_mq_queue_reinit() is run after all queues are put into freezing.
>>>>
>>>> Notifier callbacks for CPU_ONLINE action can be run on the other CPU
>>>> than the CPU which was just onlined. So it is possible for the
>>>> process running on the just onlined CPU to insert request and run
>>>> hw queue before blk_mq_queue_reinit_notify() is actually called with
>>>> action=CPU_ONLINE.
>>>
>>> You are right because blk_mq_queue_reinit_notify() is alwasy run after
>>> the CPU becomes UP, so there is a tiny window in which the CPU is up
>>> but the mapping is updated. Per current design, the CPU just onlined
>>> is still mapped to hw queue 0 until the mapping is updated by
>>> blk_mq_queue_reinit_notify().
>>>
>>> But I am wondering why it is a problem and why you think flush_busy_ctxs
>>> can't find the requests on the software queue in this situation?
>>
>> The problem happens when the CPU has just been onlined first time
>> since the request queue was initialized. At this time ctx->index_hw
>> for the CPU is still zero before blk_mq_queue_reinit_notify is called.
>>
>> The request can be inserted to ctx->rq_list, but blk_mq_hctx_mark_pending()
>> marks busy for wrong bit position as ctx->index_hw is zero.
>
> It isn't wrong bit since the CPU onlined just is still mapped to hctx 0 at that
> time .

ctx->index_hw is not CPU queue to HW queue mapping.
ctx->index_hw is the index in hctx->ctxs[] for this ctx.
Each ctx in a hw queue should have unique ctx->index_hw.

This problem can be reproducible with a single hw queue. (The script
in cover letter can reproduce this problem with a single hw queue)

>> flush_busy_ctxs() only retrieves the requests from software queues
>> which are marked busy. So the request just inserted is ignored as
>> the corresponding bit position is not busy.
>
> Before making the remap in blk_mq_queue_reinit() for the CPU topo change,
> the request queue will be put into freezing first and all requests
> inserted to hctx 0
> should be retrieved and scheduled out. So can the request be igonred by
> flush_busy_ctxs()?

For example, there is a single hw queue (hctx) and two CPU queues
(ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
a request is inserted into ctx1->rq_list and set bit0 in pending
bitmap as ctx1->index_hw is still zero.

And then while running hw queue, flush_busy_ctxs() finds bit0 is set
in pending bitmap and tries to retrieve requests in
hctx->ctxs[0].rq_list. But htx->ctxs[0] is ctx0, so the request in
ctx1->rq_list is ignored.

2015-06-25 15:40:50

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

On Thu, 25 Jun 2015 21:49:43 +0900
Akinobu Mita <[email protected]> wrote:

> 2015-06-25 17:07 GMT+09:00 Ming Lei <[email protected]>:
> > On Thu, Jun 25, 2015 at 10:56 AM, Akinobu Mita <[email protected]> wrote:
> >> 2015-06-25 1:24 GMT+09:00 Ming Lei <[email protected]>:
> >>> On Wed, Jun 24, 2015 at 10:34 PM, Akinobu Mita <[email protected]> wrote:
> >>>> Hi Ming,
> >>>>
> >>>> 2015-06-24 18:46 GMT+09:00 Ming Lei <[email protected]>:
> >>>>> On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <[email protected]> wrote:
> >>>>>> ctx->index_hw is zero for the CPUs which have never been onlined since
> >>>>>> the block queue was initialized. If one of those CPUs is hotadded and
> >>>>>> starts handling request before new mappings are established, pending
> >>>>>
> >>>>> Could you explain a bit what the handling request is? The fact is that
> >>>>> blk_mq_queue_reinit() is run after all queues are put into freezing.
> >>>>
> >>>> Notifier callbacks for CPU_ONLINE action can be run on the other CPU
> >>>> than the CPU which was just onlined. So it is possible for the
> >>>> process running on the just onlined CPU to insert request and run
> >>>> hw queue before blk_mq_queue_reinit_notify() is actually called with
> >>>> action=CPU_ONLINE.
> >>>
> >>> You are right because blk_mq_queue_reinit_notify() is alwasy run after
> >>> the CPU becomes UP, so there is a tiny window in which the CPU is up
> >>> but the mapping is updated. Per current design, the CPU just onlined
> >>> is still mapped to hw queue 0 until the mapping is updated by
> >>> blk_mq_queue_reinit_notify().
> >>>
> >>> But I am wondering why it is a problem and why you think flush_busy_ctxs
> >>> can't find the requests on the software queue in this situation?
> >>
> >> The problem happens when the CPU has just been onlined first time
> >> since the request queue was initialized. At this time ctx->index_hw
> >> for the CPU is still zero before blk_mq_queue_reinit_notify is called.
> >>
> >> The request can be inserted to ctx->rq_list, but blk_mq_hctx_mark_pending()
> >> marks busy for wrong bit position as ctx->index_hw is zero.
> >
> > It isn't wrong bit since the CPU onlined just is still mapped to hctx 0 at that
> > time .
>
> ctx->index_hw is not CPU queue to HW queue mapping.
> ctx->index_hw is the index in hctx->ctxs[] for this ctx.
> Each ctx in a hw queue should have unique ctx->index_hw.

You are right, sorry for my fault.

>
> This problem can be reproducible with a single hw queue. (The script
> in cover letter can reproduce this problem with a single hw queue)
>
> >> flush_busy_ctxs() only retrieves the requests from software queues
> >> which are marked busy. So the request just inserted is ignored as
> >> the corresponding bit position is not busy.
> >
> > Before making the remap in blk_mq_queue_reinit() for the CPU topo change,
> > the request queue will be put into freezing first and all requests
> > inserted to hctx 0
> > should be retrieved and scheduled out. So can the request be igonred by
> > flush_busy_ctxs()?
>
> For example, there is a single hw queue (hctx) and two CPU queues
> (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
> a request is inserted into ctx1->rq_list and set bit0 in pending
> bitmap as ctx1->index_hw is still zero.
>
> And then while running hw queue, flush_busy_ctxs() finds bit0 is set
> in pending bitmap and tries to retrieve requests in
> hctx->ctxs[0].rq_list. But htx->ctxs[0] is ctx0, so the request in
> ctx1->rq_list is ignored.

Per current design, the request should have been inserted into ctx0 instead
of ctx1 because ctx1 isn't mapped yet even though ctx1->cpu becomes ONLINE.

So how about the following patch? which looks much simpler.

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f537796..2f45b73 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1034,7 +1034,12 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;

current_ctx = blk_mq_get_ctx(q);
- if (!cpu_online(ctx->cpu))
+ /*
+ * ctx->cpu may become ONLINE but ctx hasn't been mapped to
+ * hctx yet because there is a tiny race window between
+ * ctx->cpu ONLINE and doing the remap
+ */
+ if (!blk_mq_ctx_mapped(ctx))
rq->mq_ctx = ctx = current_ctx;

hctx = q->mq_ops->map_queue(q, ctx->cpu);
@@ -1063,7 +1068,7 @@ static void blk_mq_insert_requests(struct request_queue *q,

current_ctx = blk_mq_get_ctx(q);

- if (!cpu_online(ctx->cpu))
+ if (!blk_mq_ctx_mapped(ctx))
ctx = current_ctx;
hctx = q->mq_ops->map_queue(q, ctx->cpu);

@@ -1816,13 +1821,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
*/
queue_for_each_ctx(q, ctx, i) {
/* If the cpu isn't online, the cpu is mapped to first hctx */
- if (!cpu_online(i))
+ if (!cpu_online(i)) {
+ ctx->mapped = 0;
continue;
+ }

hctx = q->mq_ops->map_queue(q, i);
cpumask_set_cpu(i, hctx->cpumask);
cpumask_set_cpu(i, hctx->tags->cpumask);
ctx->index_hw = hctx->nr_ctx;
+ ctx->mapped = 1;
hctx->ctxs[hctx->nr_ctx++] = ctx;
}

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6a48c4c..52819ad 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -10,7 +10,8 @@ struct blk_mq_ctx {
} ____cacheline_aligned_in_smp;

unsigned int cpu;
- unsigned int index_hw;
+ unsigned int index_hw : 16;
+ unsigned int mapped : 1;

unsigned int last_tag ____cacheline_aligned_in_smp;

@@ -123,4 +124,9 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
return hctx->nr_ctx && hctx->tags;
}

+static inline bool blk_mq_ctx_mapped(struct blk_mq_ctx *ctx)
+{
+ return ctx->mapped;
+}
+
#endif

2015-06-25 23:35:46

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

2015-06-26 0:40 GMT+09:00 Ming Lei <[email protected]>:
> On Thu, 25 Jun 2015 21:49:43 +0900
> Akinobu Mita <[email protected]> wrote:
>> For example, there is a single hw queue (hctx) and two CPU queues
>> (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
>> a request is inserted into ctx1->rq_list and set bit0 in pending
>> bitmap as ctx1->index_hw is still zero.
>>
>> And then while running hw queue, flush_busy_ctxs() finds bit0 is set
>> in pending bitmap and tries to retrieve requests in
>> hctx->ctxs[0].rq_list. But htx->ctxs[0] is ctx0, so the request in
>> ctx1->rq_list is ignored.
>
> Per current design, the request should have been inserted into ctx0 instead
> of ctx1 because ctx1 isn't mapped yet even though ctx1->cpu becomes ONLINE.
>
> So how about the following patch? which looks much simpler.

OK, I'll try this patch to see if the problem disappears.

> ---
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f537796..2f45b73 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1034,7 +1034,12 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
> struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;
>
> current_ctx = blk_mq_get_ctx(q);
> - if (!cpu_online(ctx->cpu))
> + /*
> + * ctx->cpu may become ONLINE but ctx hasn't been mapped to
> + * hctx yet because there is a tiny race window between
> + * ctx->cpu ONLINE and doing the remap
> + */
> + if (!blk_mq_ctx_mapped(ctx))
> rq->mq_ctx = ctx = current_ctx;
>
> hctx = q->mq_ops->map_queue(q, ctx->cpu);
> @@ -1063,7 +1068,7 @@ static void blk_mq_insert_requests(struct request_queue *q,
>
> current_ctx = blk_mq_get_ctx(q);
>
> - if (!cpu_online(ctx->cpu))
> + if (!blk_mq_ctx_mapped(ctx))
> ctx = current_ctx;
> hctx = q->mq_ops->map_queue(q, ctx->cpu);
>
> @@ -1816,13 +1821,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> */
> queue_for_each_ctx(q, ctx, i) {
> /* If the cpu isn't online, the cpu is mapped to first hctx */
> - if (!cpu_online(i))
> + if (!cpu_online(i)) {
> + ctx->mapped = 0;
> continue;
> + }
>
> hctx = q->mq_ops->map_queue(q, i);
> cpumask_set_cpu(i, hctx->cpumask);
> cpumask_set_cpu(i, hctx->tags->cpumask);
> ctx->index_hw = hctx->nr_ctx;
> + ctx->mapped = 1;
> hctx->ctxs[hctx->nr_ctx++] = ctx;
> }
>
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 6a48c4c..52819ad 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -10,7 +10,8 @@ struct blk_mq_ctx {
> } ____cacheline_aligned_in_smp;
>
> unsigned int cpu;
> - unsigned int index_hw;
> + unsigned int index_hw : 16;
> + unsigned int mapped : 1;
>
> unsigned int last_tag ____cacheline_aligned_in_smp;
>
> @@ -123,4 +124,9 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
> return hctx->nr_ctx && hctx->tags;
> }
>
> +static inline bool blk_mq_ctx_mapped(struct blk_mq_ctx *ctx)
> +{
> + return ctx->mapped;
> +}
> +
> #endif
>
>

2015-06-26 17:14:28

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

Akinobu Mita <[email protected]> wrote:
> 2015-06-26 0:40 GMT+09:00 Ming Lei <[email protected]>:
> > On Thu, 25 Jun 2015 21:49:43 +0900
> > Akinobu Mita <[email protected]> wrote:
> >> For example, there is a single hw queue (hctx) and two CPU queues
> >> (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
> >> a request is inserted into ctx1->rq_list and set bit0 in pending
> >> bitmap as ctx1->index_hw is still zero.
> >>
> >> And then while running hw queue, flush_busy_ctxs() finds bit0 is set
> >> in pending bitmap and tries to retrieve requests in
> >> hctx->ctxs[0].rq_list. But htx->ctxs[0] is ctx0, so the request in
> >> ctx1->rq_list is ignored.
> >
> > Per current design, the request should have been inserted into ctx0 instead
> > of ctx1 because ctx1 isn't mapped yet even though ctx1->cpu becomes ONLINE.
> >
> > So how about the following patch? which looks much simpler.
>
> OK, I'll try this patch to see if the problem disappears.

This doesn't fix the problem. Because:

> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index f537796..2f45b73 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1034,7 +1034,12 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
> > struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;
> >
> > current_ctx = blk_mq_get_ctx(q);
> > - if (!cpu_online(ctx->cpu))
> > + /*
> > + * ctx->cpu may become ONLINE but ctx hasn't been mapped to
> > + * hctx yet because there is a tiny race window between
> > + * ctx->cpu ONLINE and doing the remap
> > + */
> > + if (!blk_mq_ctx_mapped(ctx))
> > rq->mq_ctx = ctx = current_ctx;

The process running on just onlined CPU1 in the above example can
satisfy this condition and current_ctx will be ctx1. So the same
scenario can happen (the request is ignored by flush_busy_ctxs).

I found simple alternative solution that assigns the offline CPUs
unique ctx->index_hw.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 594eea0..a8fcfbf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1787,10 +1787,11 @@ static void blk_mq_map_swqueue(struct
request_queue *q)
*/
queue_for_each_ctx(q, ctx, i) {
/* If the cpu isn't online, the cpu is mapped to first hctx */
- if (!cpu_online(i))
- continue;
+ if (!cpu_online(i) && cpu_possible(i))
+ hctx = q->mq_ops->map_queue(q, 0);
+ else
+ hctx = q->mq_ops->map_queue(q, i);

- hctx = q->mq_ops->map_queue(q, i);
cpumask_set_cpu(i, hctx->cpumask);
ctx->index_hw = hctx->nr_ctx;
hctx->ctxs[hctx->nr_ctx++] = ctx;
--


2015-06-27 16:08:17

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

On Sat, Jun 27, 2015 at 1:14 AM, Akinobu Mita <[email protected]> wrote:
> Akinobu Mita <[email protected]> wrote:
>> 2015-06-26 0:40 GMT+09:00 Ming Lei <[email protected]>:
>> > On Thu, 25 Jun 2015 21:49:43 +0900
>> > Akinobu Mita <[email protected]> wrote:
>> >> For example, there is a single hw queue (hctx) and two CPU queues
>> >> (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
>> >> a request is inserted into ctx1->rq_list and set bit0 in pending
>> >> bitmap as ctx1->index_hw is still zero.
>> >>
>> >> And then while running hw queue, flush_busy_ctxs() finds bit0 is set
>> >> in pending bitmap and tries to retrieve requests in
>> >> hctx->ctxs[0].rq_list. But htx->ctxs[0] is ctx0, so the request in
>> >> ctx1->rq_list is ignored.
>> >
>> > Per current design, the request should have been inserted into ctx0 instead
>> > of ctx1 because ctx1 isn't mapped yet even though ctx1->cpu becomes ONLINE.
>> >
>> > So how about the following patch? which looks much simpler.
>>
>> OK, I'll try this patch to see if the problem disappears.
>
> This doesn't fix the problem. Because:
>
>> > diff --git a/block/blk-mq.c b/block/blk-mq.c
>> > index f537796..2f45b73 100644
>> > --- a/block/blk-mq.c
>> > +++ b/block/blk-mq.c
>> > @@ -1034,7 +1034,12 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
>> > struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;
>> >
>> > current_ctx = blk_mq_get_ctx(q);
>> > - if (!cpu_online(ctx->cpu))
>> > + /*
>> > + * ctx->cpu may become ONLINE but ctx hasn't been mapped to
>> > + * hctx yet because there is a tiny race window between
>> > + * ctx->cpu ONLINE and doing the remap
>> > + */
>> > + if (!blk_mq_ctx_mapped(ctx))
>> > rq->mq_ctx = ctx = current_ctx;
>
> The process running on just onlined CPU1 in the above example can
> satisfy this condition and current_ctx will be ctx1. So the same
> scenario can happen (the request is ignored by flush_busy_ctxs).

Yeah, that is possible, and it should be bug of blk_mq_insert_request(),
because the function supposes that the current ctx is mapped.

Then I think the approach in your 1st email of this thread may be
good one, and looks we have to make the remapping during
CPU UP_PREPARE notifier.

>
> I found simple alternative solution that assigns the offline CPUs
> unique ctx->index_hw.

This solution looks simpler, but it may not be correct.

>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 594eea0..a8fcfbf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1787,10 +1787,11 @@ static void blk_mq_map_swqueue(struct
> request_queue *q)
> */
> queue_for_each_ctx(q, ctx, i) {
> /* If the cpu isn't online, the cpu is mapped to first hctx */
> - if (!cpu_online(i))
> - continue;
> + if (!cpu_online(i) && cpu_possible(i))

The cpu_possible() check isn't needed.

> + hctx = q->mq_ops->map_queue(q, 0);
> + else
> + hctx = q->mq_ops->map_queue(q, i);

The above change supposes that all offline CPUs(ctxs)
share the same 'hctx' mapped from CPU 0, and that is
obvious wrong.

All offline CPUs should have shared the 1st 'hctx' instead
of the 'hctx' mapped from CPU 0.

>
> - hctx = q->mq_ops->map_queue(q, i);
> cpumask_set_cpu(i, hctx->cpumask);

CPU i shouldn't have been set on hctx->cpumask in this approach
if it isn't online.

> ctx->index_hw = hctx->nr_ctx;
> hctx->ctxs[hctx->nr_ctx++] = ctx;

I am not sure the current code is ready about adding offline 'ctx' into
'hctx', and there are some observalbe effects at least:

- blk_mq_run_hw_queue() can run even the hctx hasn't mapped
'ctx'
- the offline ctx kobject will be exposed to user space in sysfs
- blk_mq_hw_queue_mapped() may becomes always true
- set->tags[i](request entries) may not be freed even if there
aren't mapped 'ctx' in one 'hctx'

--
Ming Lei

2015-06-29 14:25:52

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

2015-06-28 1:08 GMT+09:00 Ming Lei <[email protected]>:
> On Sat, Jun 27, 2015 at 1:14 AM, Akinobu Mita <[email protected]> wrote:
>> Akinobu Mita <[email protected]> wrote:
>>> 2015-06-26 0:40 GMT+09:00 Ming Lei <[email protected]>:
>>> > On Thu, 25 Jun 2015 21:49:43 +0900
>>> > Akinobu Mita <[email protected]> wrote:
>>> >> For example, there is a single hw queue (hctx) and two CPU queues
>>> >> (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
>>> >> a request is inserted into ctx1->rq_list and set bit0 in pending
>>> >> bitmap as ctx1->index_hw is still zero.
>>> >>
>>> >> And then while running hw queue, flush_busy_ctxs() finds bit0 is set
>>> >> in pending bitmap and tries to retrieve requests in
>>> >> hctx->ctxs[0].rq_list. But htx->ctxs[0] is ctx0, so the request in
>>> >> ctx1->rq_list is ignored.
>>> >
>>> > Per current design, the request should have been inserted into ctx0 instead
>>> > of ctx1 because ctx1 isn't mapped yet even though ctx1->cpu becomes ONLINE.
>>> >
>>> > So how about the following patch? which looks much simpler.
>>>
>>> OK, I'll try this patch to see if the problem disappears.
>>
>> This doesn't fix the problem. Because:
>>
>>> > diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> > index f537796..2f45b73 100644
>>> > --- a/block/blk-mq.c
>>> > +++ b/block/blk-mq.c
>>> > @@ -1034,7 +1034,12 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
>>> > struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;
>>> >
>>> > current_ctx = blk_mq_get_ctx(q);
>>> > - if (!cpu_online(ctx->cpu))
>>> > + /*
>>> > + * ctx->cpu may become ONLINE but ctx hasn't been mapped to
>>> > + * hctx yet because there is a tiny race window between
>>> > + * ctx->cpu ONLINE and doing the remap
>>> > + */
>>> > + if (!blk_mq_ctx_mapped(ctx))
>>> > rq->mq_ctx = ctx = current_ctx;
>>
>> The process running on just onlined CPU1 in the above example can
>> satisfy this condition and current_ctx will be ctx1. So the same
>> scenario can happen (the request is ignored by flush_busy_ctxs).
>
> Yeah, that is possible, and it should be bug of blk_mq_insert_request(),
> because the function supposes that the current ctx is mapped.
>
> Then I think the approach in your 1st email of this thread may be
> good one, and looks we have to make the remapping during
> CPU UP_PREPARE notifier.

OK, we can move on to other topic that you suggested in the other mail
thread.

>> I found simple alternative solution that assigns the offline CPUs
>> unique ctx->index_hw.
>
> This solution looks simpler, but it may not be correct.

You are right. This solution needs amendments, just in case we needs
to come back this solution instead of the first approach.

>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 594eea0..a8fcfbf 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1787,10 +1787,11 @@ static void blk_mq_map_swqueue(struct
>> request_queue *q)
>> */
>> queue_for_each_ctx(q, ctx, i) {
>> /* If the cpu isn't online, the cpu is mapped to first hctx */
>> - if (!cpu_online(i))
>> - continue;
>> + if (!cpu_online(i) && cpu_possible(i))
>
> The cpu_possible() check isn't needed.

OK.

>> + hctx = q->mq_ops->map_queue(q, 0);
>> + else
>> + hctx = q->mq_ops->map_queue(q, i);
>
> The above change supposes that all offline CPUs(ctxs)
> share the same 'hctx' mapped from CPU 0, and that is
> obvious wrong.
>
> All offline CPUs should have shared the 1st 'hctx' instead
> of the 'hctx' mapped from CPU 0.

Do you mean that we shoud use 'q->queue_hw_ctx[0]' for offline CPU?

>>
>> - hctx = q->mq_ops->map_queue(q, i);
>> cpumask_set_cpu(i, hctx->cpumask);
>
> CPU i shouldn't have been set on hctx->cpumask in this approach
> if it isn't online.

OK.

>> ctx->index_hw = hctx->nr_ctx;
>> hctx->ctxs[hctx->nr_ctx++] = ctx;
>
> I am not sure the current code is ready about adding offline 'ctx' into
> 'hctx', and there are some observalbe effects at least:
>
> - blk_mq_run_hw_queue() can run even the hctx hasn't mapped
> 'ctx'

Is this fixed by not setting hctx->cpumask for offline CPUs?

> - the offline ctx kobject will be exposed to user space in sysfs

OK. this should be avoided.

> - blk_mq_hw_queue_mapped() may becomes always true
> - set->tags[i](request entries) may not be freed even if there
> aren't mapped 'ctx' in one 'hctx'

Aren't these only happend for the 1st hctx (i.e. 'q->queue_hw_ctx[0]')?

2015-06-29 14:31:28

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH 4/4] blk-mq: fix mq_usage_counter race when switching to percpu mode

2015-06-24 21:35 GMT+09:00 Ming Lei <[email protected]>:
> On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <[email protected]> wrote:
>> percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
>> executed at the same time as the following scenario is possible:
>>
>> 1. q->mq_usage_counter is initialized in atomic mode.
>> (atomic counter: 1)
>>
>> 2. After the disk registration, a process like systemd-udev starts
>> accessing the disk, and successfully increases refcount successfully
>> by percpu_ref_tryget_live() in blk_mq_queue_enter().
>> (atomic counter: 2)
>>
>> 3. In the final stage of initialization, q->mq_usage_counter is being
>> switched to percpu mode by percpu_ref_switch_to_percpu() in
>> blk_mq_finish_init(). But if CONFIG_PREEMPT_VOLUNTARY is enabled,
>> the process is rescheduled in the middle of switching when calling
>> wait_event() in __percpu_ref_switch_to_percpu().
>> (atomic counter: 2)
>
> This can only happen when freezing queue from CPU
> hotplug happens before running wait_event() from blk_mq_finish_init().
>
> So looks like we still may avoid the race by moving adding queue
> into all_q_list to blk_mq_register_disk(), but the mapping need to
> update at that time.

As you suggested, we can avoid the problems described in patch 1/4,
2/4, and 4/4 by adding to all_q_list in blk_mq_register_disk() and
removing from all_q_list in blk_mq_unregister_disk().

On the other hand, that change breaks the patch 3/4. Because it
allows the requests to be inserted before adding the queue to
all_q_list. As we have disscussed about patch 3/4, there is a problem
if the request is inserted on the just onlined CPU before establishing
new mapping.

2015-06-30 09:06:30

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts handling requests

On Mon, Jun 29, 2015 at 10:25 PM, Akinobu Mita <[email protected]> wrote:
> 2015-06-28 1:08 GMT+09:00 Ming Lei <[email protected]>:
>> On Sat, Jun 27, 2015 at 1:14 AM, Akinobu Mita <[email protected]> wrote:
>>> Akinobu Mita <[email protected]> wrote:
>>>> 2015-06-26 0:40 GMT+09:00 Ming Lei <[email protected]>:
>>>> > On Thu, 25 Jun 2015 21:49:43 +0900
>>>> > Akinobu Mita <[email protected]> wrote:
>>>> >> For example, there is a single hw queue (hctx) and two CPU queues
>>>> >> (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
>>>> >> a request is inserted into ctx1->rq_list and set bit0 in pending
>>>> >> bitmap as ctx1->index_hw is still zero.
>>>> >>
>>>> >> And then while running hw queue, flush_busy_ctxs() finds bit0 is set
>>>> >> in pending bitmap and tries to retrieve requests in
>>>> >> hctx->ctxs[0].rq_list. But htx->ctxs[0] is ctx0, so the request in
>>>> >> ctx1->rq_list is ignored.
>>>> >
>>>> > Per current design, the request should have been inserted into ctx0 instead
>>>> > of ctx1 because ctx1 isn't mapped yet even though ctx1->cpu becomes ONLINE.
>>>> >
>>>> > So how about the following patch? which looks much simpler.
>>>>
>>>> OK, I'll try this patch to see if the problem disappears.
>>>
>>> This doesn't fix the problem. Because:
>>>
>>>> > diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> > index f537796..2f45b73 100644
>>>> > --- a/block/blk-mq.c
>>>> > +++ b/block/blk-mq.c
>>>> > @@ -1034,7 +1034,12 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
>>>> > struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;
>>>> >
>>>> > current_ctx = blk_mq_get_ctx(q);
>>>> > - if (!cpu_online(ctx->cpu))
>>>> > + /*
>>>> > + * ctx->cpu may become ONLINE but ctx hasn't been mapped to
>>>> > + * hctx yet because there is a tiny race window between
>>>> > + * ctx->cpu ONLINE and doing the remap
>>>> > + */
>>>> > + if (!blk_mq_ctx_mapped(ctx))
>>>> > rq->mq_ctx = ctx = current_ctx;
>>>
>>> The process running on just onlined CPU1 in the above example can
>>> satisfy this condition and current_ctx will be ctx1. So the same
>>> scenario can happen (the request is ignored by flush_busy_ctxs).
>>
>> Yeah, that is possible, and it should be bug of blk_mq_insert_request(),
>> because the function supposes that the current ctx is mapped.
>>
>> Then I think the approach in your 1st email of this thread may be
>> good one, and looks we have to make the remapping during
>> CPU UP_PREPARE notifier.
>
> OK, we can move on to other topic that you suggested in the other mail
> thread.

Why? That is just what this patch in this thread is doing, :-)

>
>>> I found simple alternative solution that assigns the offline CPUs
>>> unique ctx->index_hw.
>>
>> This solution looks simpler, but it may not be correct.
>
> You are right. This solution needs amendments, just in case we needs
> to come back this solution instead of the first approach.
>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 594eea0..a8fcfbf 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -1787,10 +1787,11 @@ static void blk_mq_map_swqueue(struct
>>> request_queue *q)
>>> */
>>> queue_for_each_ctx(q, ctx, i) {
>>> /* If the cpu isn't online, the cpu is mapped to first hctx */
>>> - if (!cpu_online(i))
>>> - continue;
>>> + if (!cpu_online(i) && cpu_possible(i))
>>
>> The cpu_possible() check isn't needed.
>
> OK.
>
>>> + hctx = q->mq_ops->map_queue(q, 0);
>>> + else
>>> + hctx = q->mq_ops->map_queue(q, i);
>>
>> The above change supposes that all offline CPUs(ctxs)
>> share the same 'hctx' mapped from CPU 0, and that is
>> obvious wrong.
>>
>> All offline CPUs should have shared the 1st 'hctx' instead
>> of the 'hctx' mapped from CPU 0.
>
> Do you mean that we shoud use 'q->queue_hw_ctx[0]' for offline CPU?

Yes, it is the 1st hctx, but I don't think it is good idea to add the unmapped
ctx into the 1st hctx because it may bring side effects.

>
>>>
>>> - hctx = q->mq_ops->map_queue(q, i);
>>> cpumask_set_cpu(i, hctx->cpumask);
>>
>> CPU i shouldn't have been set on hctx->cpumask in this approach
>> if it isn't online.
>
> OK.
>
>>> ctx->index_hw = hctx->nr_ctx;
>>> hctx->ctxs[hctx->nr_ctx++] = ctx;
>>
>> I am not sure the current code is ready about adding offline 'ctx' into
>> 'hctx', and there are some observalbe effects at least:
>>
>> - blk_mq_run_hw_queue() can run even the hctx hasn't mapped
>> 'ctx'
>
> Is this fixed by not setting hctx->cpumask for offline CPUs?
>
>> - the offline ctx kobject will be exposed to user space in sysfs
>
> OK. this should be avoided.
>
>> - blk_mq_hw_queue_mapped() may becomes always true
>> - set->tags[i](request entries) may not be freed even if there
>> aren't mapped 'ctx' in one 'hctx'
>
> Aren't these only happend for the 1st hctx (i.e. 'q->queue_hw_ctx[0]')?

I mean it may bring lots of issues if you add the unmapped ctx into
the 1st hctx by 'hctx->ctxs[hctx->nr_ctx++] = ctx', that may cause the
fix quite complicated.

So I think it will be good to do the remapping during CPU PREPRE_UP
notifier, and I will review this patch later.

Thanks,
--
Ming Lei