MIME-Version: 1.0
In-Reply-To: <1434894751-6877-5-git-send-email-akinobu.mita@gmail.com>
References: <1434894751-6877-1-git-send-email-akinobu.mita@gmail.com>
	<1434894751-6877-5-git-send-email-akinobu.mita@gmail.com>
Date: Wed, 24 Jun 2015 20:35:37 +0800
Message-ID: <CACVXFVOHpWizk=bH6efqQ-wpvyKA3SkfvtsGvd+QMsrUyZ6-kA@mail.gmail.com>
Subject: Re: [PATCH 4/4] blk-mq: fix mq_usage_counter race when switching to
 percpu mode
From: Ming Lei <tom.leiming@gmail.com>
To: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Jens Axboe <axboe@kernel.dk>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5382
Lines: 140

On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <akinobu.mita@gmail.com> wrote:
> percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
> executed at the same time as the following scenario is possible:
>
> 1. q->mq_usage_counter is initialized in atomic mode.
>    (atomic counter: 1)
>
> 2. After the disk registration, a process like systemd-udev starts
>    accessing the disk, and successfully increases refcount successfully
>    by percpu_ref_tryget_live() in blk_mq_queue_enter().
>    (atomic counter: 2)
>
> 3. In the final stage of initialization, q->mq_usage_counter is being
>    switched to percpu mode by percpu_ref_switch_to_percpu() in
>    blk_mq_finish_init().  But if CONFIG_PREEMPT_VOLUNTARY is enabled,
>    the process is rescheduled in the middle of switching when calling
>    wait_event() in __percpu_ref_switch_to_percpu().
>    (atomic counter: 2)

This can only happen when freezing queue from CPU
hotplug happens before running wait_event() from blk_mq_finish_init().

So looks like we still may avoid the race by moving adding queue
into all_q_list to blk_mq_register_disk(), but the mapping need to
update at that time.

>
> 4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze
>    request queue.  q->mq_usage_counter is decreased and marked as
>    DEAD.  Wait until all requests have finished.
>    (atomic counter: 1)
>
> 5. The process rescheduled in the step 3. is resumed and finishes
>    all remaining work in __percpu_ref_switch_to_percpu().
>    A bias value is added to atomic counter of q->mq_usage_counter.
>    (atomic counter: PERCPU_COUNT_BIAS + 1)
>
> 6. A request issed in the step 2. is finished and q->mq_usage_counter
>    is decreased by blk_mq_queue_exit().  q->mq_usage_counter is DEAD,
>    so atomic counter is decreased and no release handler is called.
>    (atomic counter: PERCPU_COUNT_BIAS)
>
> 7. CPU hotplug handling in the step 4. will wait forever as
>    q->mq_usage_counter will never be zero.
>
> Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed
> at the same time.  Because both functions could call
> __percpu_ref_switch_to_percpu() which adds the bias value and
> initialize percpu counter.
>
> Fix those races by serializing with a mutex.
>
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> ---
>  block/blk-mq-sysfs.c   | 2 ++
>  block/blk-mq.c         | 6 ++++++
>  include/linux/blkdev.h | 6 ++++++
>  3 files changed, 14 insertions(+)
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index d8ef3a3..af3e126 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -407,7 +407,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
>  /* see blk_register_queue() */
>  void blk_mq_finish_init(struct request_queue *q)
>  {
> +       mutex_lock(&q->mq_usage_lock);
>         percpu_ref_switch_to_percpu(&q->mq_usage_counter);
> +       mutex_unlock(&q->mq_usage_lock);
>  }
>
>  int blk_mq_register_disk(struct gendisk *disk)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 64d93e4..62d0ef1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -119,7 +119,9 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
>         spin_unlock_irq(q->queue_lock);
>
>         if (freeze) {
> +               mutex_lock(&q->mq_usage_lock);
>                 percpu_ref_kill(&q->mq_usage_counter);
> +               mutex_unlock(&q->mq_usage_lock);
>                 blk_mq_run_hw_queues(q, false);
>         }
>  }
> @@ -150,7 +152,9 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
>         WARN_ON_ONCE(q->mq_freeze_depth < 0);
>         spin_unlock_irq(q->queue_lock);
>         if (wake) {
> +               mutex_lock(&q->mq_usage_lock);
>                 percpu_ref_reinit(&q->mq_usage_counter);
> +               mutex_unlock(&q->mq_usage_lock);
>                 wake_up_all(&q->mq_freeze_wq);
>         }
>  }
> @@ -1961,6 +1965,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>                 hctxs[i]->queue_num = i;
>         }
>
> +       mutex_init(&q->mq_usage_lock);
> +
>         /*
>          * Init percpu_ref in atomic mode so that it's faster to shutdown.
>          * See blk_register_queue() for details.
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 5d93a66..c5bf534 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -484,6 +484,12 @@ struct request_queue {
>         struct rcu_head         rcu_head;
>         wait_queue_head_t       mq_freeze_wq;
>         struct percpu_ref       mq_usage_counter;
> +       /*
> +        * Protect concurrent access from percpu_ref_switch_to_percpu and
> +        * percpu_ref_kill, and access from percpu_ref_switch_to_percpu and
> +        * percpu_ref_reinit.
> +        */
> +       struct mutex            mq_usage_lock;
>         struct list_head        all_q_node;
>
>         struct blk_mq_tag_set   *tag_set;
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> Please read the FAQ at  http://www.tux.org/lkml/


-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/