Date: Thu, 9 Jul 2015 14:55:10 +0800
From: Ming Lei <tom.leiming@gmail.com>
To: Akinobu Mita <akinobu.mita@gmail.com>
Cc: linux-kernel@vger.kernel.org, Jens Axboe <axboe@kernel.dk>,
        tom.leiming@gmail.com, Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH v2 5/6] blk-mq: fix freeze queue race
Message-ID: <20150709145510.6b8e21a2@tom-T450>
In-Reply-To: <1435847397-724-6-git-send-email-akinobu.mita@gmail.com>
References: <1435847397-724-1-git-send-email-akinobu.mita@gmail.com>
	<1435847397-724-6-git-send-email-akinobu.mita@gmail.com>
Organization: Ming
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6369
Lines: 177

On Thu,  2 Jul 2015 23:29:56 +0900
Akinobu Mita <akinobu.mita@gmail.com> wrote:

> There are several race conditions while freezing queue.
> 
> When unfreezing queue, there is a small window between decrementing
> q->mq_freeze_depth to zero and percpu_ref_reinit() call with
> q->mq_usage_counter.  If the other calls blk_mq_freeze_queue_start()
> in the window, q->mq_freeze_depth is increased from zero to one and
> percpu_ref_kill() is called with q->mq_usage_counter which is already
> killed.  percpu refcount should be re-initialized before killed again.
> 
> Also, there is a race condition while switching to percpu mode.
> percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
> executed at the same time as the following scenario is possible:
> 
> 1. q->mq_usage_counter is initialized in atomic mode.
>    (atomic counter: 1)
> 
> 2. After the disk registration, a process like systemd-udev starts
>    accessing the disk, and successfully increases refcount successfully
>    by percpu_ref_tryget_live() in blk_mq_queue_enter().
>    (atomic counter: 2)
> 
> 3. In the final stage of initialization, q->mq_usage_counter is being
>    switched to percpu mode by percpu_ref_switch_to_percpu() in
>    blk_mq_finish_init().  But if CONFIG_PREEMPT_VOLUNTARY is enabled,
>    the process is rescheduled in the middle of switching when calling
>    wait_event() in __percpu_ref_switch_to_percpu().
>    (atomic counter: 2)
> 
> 4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze
>    request queue.  q->mq_usage_counter is decreased and marked as
>    DEAD.  Wait until all requests have finished.
>    (atomic counter: 1)
> 
> 5. The process rescheduled in the step 3. is resumed and finishes
>    all remaining work in __percpu_ref_switch_to_percpu().
>    A bias value is added to atomic counter of q->mq_usage_counter.
>    (atomic counter: PERCPU_COUNT_BIAS + 1)
> 
> 6. A request issed in the step 2. is finished and q->mq_usage_counter
>    is decreased by blk_mq_queue_exit().  q->mq_usage_counter is DEAD,
>    so atomic counter is decreased and no release handler is called.
>    (atomic counter: PERCPU_COUNT_BIAS)
> 
> 7. CPU hotplug handling in the step 4. will wait forever as
>    q->mq_usage_counter will never be zero.
> 
> Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed
> at the same time.  Because both functions could call
> __percpu_ref_switch_to_percpu() which adds the bias value and
> initialize percpu counter.
> 
> Fix those races by serializing with per-queue mutex.

This patch looks fine since at least changing DEAD state of percpu ref
state should have been synchronized by caller.

Also looks __percpu_ref_switch_to_percpu() need to check if the refcount
becomes dead after current switching, and seems something like following
is needed:

diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 6111bcb..2532949 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -235,6 +235,10 @@ static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
 
 	wait_event(percpu_ref_switch_waitq, !ref->confirm_switch);
 
+	/* dying or dead ref can't be switched to percpu mode w/o reinit */
+	if (ref->percpu_count_ptr & __PERCPU_REF_DEAD)
+		return;
+
 	atomic_long_add(PERCPU_COUNT_BIAS, &ref->count);
 
 	/*


> 
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ming Lei <tom.leiming@gmail.com>
> ---
>  block/blk-core.c       | 1 +
>  block/blk-mq-sysfs.c   | 2 ++
>  block/blk-mq.c         | 8 ++++++++
>  include/linux/blkdev.h | 6 ++++++
>  4 files changed, 17 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index bbf67cd..f3c5ae2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -687,6 +687,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>  	__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
>  
>  	init_waitqueue_head(&q->mq_freeze_wq);
> +	mutex_init(&q->mq_freeze_lock);
>  	mutex_init(&q->mq_sysfs_lock);
>  
>  	if (blkcg_init_queue(q))
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 79a3e8d..8448513 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -413,7 +413,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
>  /* see blk_register_queue() */
>  void blk_mq_finish_init(struct request_queue *q)
>  {
> +	mutex_lock(&q->mq_freeze_lock);
>  	percpu_ref_switch_to_percpu(&q->mq_usage_counter);
> +	mutex_unlock(&q->mq_freeze_lock);
>  }
>  
>  int blk_mq_register_disk(struct gendisk *disk)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ad07373..f31de35 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -115,11 +115,15 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
>  {
>  	int freeze_depth;
>  
> +	mutex_lock(&q->mq_freeze_lock);
> +
>  	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
>  	if (freeze_depth == 1) {
>  		percpu_ref_kill(&q->mq_usage_counter);
>  		blk_mq_run_hw_queues(q, false);
>  	}
> +
> +	mutex_unlock(&q->mq_freeze_lock);
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
>  
> @@ -143,12 +147,16 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
>  {
>  	int freeze_depth;
>  
> +	mutex_lock(&q->mq_freeze_lock);
> +
>  	freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
>  	WARN_ON_ONCE(freeze_depth < 0);
>  	if (!freeze_depth) {
>  		percpu_ref_reinit(&q->mq_usage_counter);
>  		wake_up_all(&q->mq_freeze_wq);
>  	}
> +
> +	mutex_unlock(&q->mq_freeze_lock);
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c56f5a6..0bf8bea 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -457,6 +457,12 @@ struct request_queue {
>  #endif
>  	struct rcu_head		rcu_head;
>  	wait_queue_head_t	mq_freeze_wq;
> +	/*
> +	 * Protect concurrent access to mq_usage_counter by
> +	 * percpu_ref_switch_to_percpu(), percpu_ref_kill(), and
> +	 * percpu_ref_reinit().
> +	 */
> +	struct mutex		mq_freeze_lock;
>  	struct percpu_ref	mq_usage_counter;
>  	struct list_head	all_q_node;
>  

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/