Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751958AbbGIGza (ORCPT ); Thu, 9 Jul 2015 02:55:30 -0400 Received: from mail-wi0-f181.google.com ([209.85.212.181]:37976 "EHLO mail-wi0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751143AbbGIGzY (ORCPT ); Thu, 9 Jul 2015 02:55:24 -0400 Date: Thu, 9 Jul 2015 14:55:10 +0800 From: Ming Lei To: Akinobu Mita Cc: linux-kernel@vger.kernel.org, Jens Axboe , tom.leiming@gmail.com, Tejun Heo Subject: Re: [PATCH v2 5/6] blk-mq: fix freeze queue race Message-ID: <20150709145510.6b8e21a2@tom-T450> In-Reply-To: <1435847397-724-6-git-send-email-akinobu.mita@gmail.com> References: <1435847397-724-1-git-send-email-akinobu.mita@gmail.com> <1435847397-724-6-git-send-email-akinobu.mita@gmail.com> Organization: Ming X-Mailer: Claws Mail 3.9.3 (GTK+ 2.24.23; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6369 Lines: 177 On Thu, 2 Jul 2015 23:29:56 +0900 Akinobu Mita wrote: > There are several race conditions while freezing queue. > > When unfreezing queue, there is a small window between decrementing > q->mq_freeze_depth to zero and percpu_ref_reinit() call with > q->mq_usage_counter. If the other calls blk_mq_freeze_queue_start() > in the window, q->mq_freeze_depth is increased from zero to one and > percpu_ref_kill() is called with q->mq_usage_counter which is already > killed. percpu refcount should be re-initialized before killed again. > > Also, there is a race condition while switching to percpu mode. > percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be > executed at the same time as the following scenario is possible: > > 1. q->mq_usage_counter is initialized in atomic mode. > (atomic counter: 1) > > 2. After the disk registration, a process like systemd-udev starts > accessing the disk, and successfully increases refcount successfully > by percpu_ref_tryget_live() in blk_mq_queue_enter(). > (atomic counter: 2) > > 3. In the final stage of initialization, q->mq_usage_counter is being > switched to percpu mode by percpu_ref_switch_to_percpu() in > blk_mq_finish_init(). But if CONFIG_PREEMPT_VOLUNTARY is enabled, > the process is rescheduled in the middle of switching when calling > wait_event() in __percpu_ref_switch_to_percpu(). > (atomic counter: 2) > > 4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze > request queue. q->mq_usage_counter is decreased and marked as > DEAD. Wait until all requests have finished. > (atomic counter: 1) > > 5. The process rescheduled in the step 3. is resumed and finishes > all remaining work in __percpu_ref_switch_to_percpu(). > A bias value is added to atomic counter of q->mq_usage_counter. > (atomic counter: PERCPU_COUNT_BIAS + 1) > > 6. A request issed in the step 2. is finished and q->mq_usage_counter > is decreased by blk_mq_queue_exit(). q->mq_usage_counter is DEAD, > so atomic counter is decreased and no release handler is called. > (atomic counter: PERCPU_COUNT_BIAS) > > 7. CPU hotplug handling in the step 4. will wait forever as > q->mq_usage_counter will never be zero. > > Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed > at the same time. Because both functions could call > __percpu_ref_switch_to_percpu() which adds the bias value and > initialize percpu counter. > > Fix those races by serializing with per-queue mutex. This patch looks fine since at least changing DEAD state of percpu ref state should have been synchronized by caller. Also looks __percpu_ref_switch_to_percpu() need to check if the refcount becomes dead after current switching, and seems something like following is needed: diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 6111bcb..2532949 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -235,6 +235,10 @@ static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref) wait_event(percpu_ref_switch_waitq, !ref->confirm_switch); + /* dying or dead ref can't be switched to percpu mode w/o reinit */ + if (ref->percpu_count_ptr & __PERCPU_REF_DEAD) + return; + atomic_long_add(PERCPU_COUNT_BIAS, &ref->count); /* > > Signed-off-by: Akinobu Mita > Cc: Jens Axboe > Cc: Ming Lei > --- > block/blk-core.c | 1 + > block/blk-mq-sysfs.c | 2 ++ > block/blk-mq.c | 8 ++++++++ > include/linux/blkdev.h | 6 ++++++ > 4 files changed, 17 insertions(+) > > diff --git a/block/blk-core.c b/block/blk-core.c > index bbf67cd..f3c5ae2 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -687,6 +687,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) > __set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags); > > init_waitqueue_head(&q->mq_freeze_wq); > + mutex_init(&q->mq_freeze_lock); > mutex_init(&q->mq_sysfs_lock); > > if (blkcg_init_queue(q)) > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c > index 79a3e8d..8448513 100644 > --- a/block/blk-mq-sysfs.c > +++ b/block/blk-mq-sysfs.c > @@ -413,7 +413,9 @@ static void blk_mq_sysfs_init(struct request_queue *q) > /* see blk_register_queue() */ > void blk_mq_finish_init(struct request_queue *q) > { > + mutex_lock(&q->mq_freeze_lock); > percpu_ref_switch_to_percpu(&q->mq_usage_counter); > + mutex_unlock(&q->mq_freeze_lock); > } > > int blk_mq_register_disk(struct gendisk *disk) > diff --git a/block/blk-mq.c b/block/blk-mq.c > index ad07373..f31de35 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -115,11 +115,15 @@ void blk_mq_freeze_queue_start(struct request_queue *q) > { > int freeze_depth; > > + mutex_lock(&q->mq_freeze_lock); > + > freeze_depth = atomic_inc_return(&q->mq_freeze_depth); > if (freeze_depth == 1) { > percpu_ref_kill(&q->mq_usage_counter); > blk_mq_run_hw_queues(q, false); > } > + > + mutex_unlock(&q->mq_freeze_lock); > } > EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start); > > @@ -143,12 +147,16 @@ void blk_mq_unfreeze_queue(struct request_queue *q) > { > int freeze_depth; > > + mutex_lock(&q->mq_freeze_lock); > + > freeze_depth = atomic_dec_return(&q->mq_freeze_depth); > WARN_ON_ONCE(freeze_depth < 0); > if (!freeze_depth) { > percpu_ref_reinit(&q->mq_usage_counter); > wake_up_all(&q->mq_freeze_wq); > } > + > + mutex_unlock(&q->mq_freeze_lock); > } > EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue); > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index c56f5a6..0bf8bea 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -457,6 +457,12 @@ struct request_queue { > #endif > struct rcu_head rcu_head; > wait_queue_head_t mq_freeze_wq; > + /* > + * Protect concurrent access to mq_usage_counter by > + * percpu_ref_switch_to_percpu(), percpu_ref_kill(), and > + * percpu_ref_reinit(). > + */ > + struct mutex mq_freeze_lock; > struct percpu_ref mq_usage_counter; > struct list_head all_q_node; > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/