Date: Thu, 25 Jun 2015 23:40:30 +0800
From: Ming Lei <tom.leiming@gmail.com>
To: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Jens Axboe <axboe@kernel.dk>
Subject: Re: [PATCH 3/4] blk-mq: establish new mapping before cpu starts
 handling requests
Message-ID: <20150625234030.4dc99725@tom-T450>
In-Reply-To: <CAC5umyhPQMuZCF4DMfL1kwVBFfeAheW0tTtju=qcrU=yFhPofw@mail.gmail.com>
References: <1434894751-6877-1-git-send-email-akinobu.mita@gmail.com>
	<1434894751-6877-4-git-send-email-akinobu.mita@gmail.com>
	<CACVXFVO6cuZE3cq7PgveMN9oiRw0yRYns7sDg3KnkqmmRWNDSA@mail.gmail.com>
	<CAC5umyg2vFBTuwsRR0t-uLF=UKMixXo3Nd3eAe1RiMqXG4uUkg@mail.gmail.com>
	<CACVXFVNgVwh-6_EKenB1KKGWm51KR-fLw-oU2B+WX9TmCzkKyw@mail.gmail.com>
	<CAC5umyiGEQ_umjWw-tCptSZXrTszsjyp8HM682+74LxjaUhr9Q@mail.gmail.com>
	<CACVXFVOmzSKvG8jam4Hw9m=c_ggFyTk2n0BH-cvRhAjKNZQFhw@mail.gmail.com>
	<CAC5umyhPQMuZCF4DMfL1kwVBFfeAheW0tTtju=qcrU=yFhPofw@mail.gmail.com>
Organization: Ming
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5971
Lines: 155

On Thu, 25 Jun 2015 21:49:43 +0900
Akinobu Mita <akinobu.mita@gmail.com> wrote:

> 2015-06-25 17:07 GMT+09:00 Ming Lei <tom.leiming@gmail.com>:
> > On Thu, Jun 25, 2015 at 10:56 AM, Akinobu Mita <akinobu.mita@gmail.com> wrote:
> >> 2015-06-25 1:24 GMT+09:00 Ming Lei <tom.leiming@gmail.com>:
> >>> On Wed, Jun 24, 2015 at 10:34 PM, Akinobu Mita <akinobu.mita@gmail.com> wrote:
> >>>> Hi Ming,
> >>>>
> >>>> 2015-06-24 18:46 GMT+09:00 Ming Lei <tom.leiming@gmail.com>:
> >>>>> On Sun, Jun 21, 2015 at 9:52 PM, Akinobu Mita <akinobu.mita@gmail.com> wrote:
> >>>>>> ctx->index_hw is zero for the CPUs which have never been onlined since
> >>>>>> the block queue was initialized.  If one of those CPUs is hotadded and
> >>>>>> starts handling request before new mappings are established, pending
> >>>>>
> >>>>> Could you explain a bit what the handling request is? The fact is that
> >>>>> blk_mq_queue_reinit() is run after all queues are put into freezing.
> >>>>
> >>>> Notifier callbacks for CPU_ONLINE action can be run on the other CPU
> >>>> than the CPU which was just onlined.  So it is possible for the
> >>>> process running on the just onlined CPU to insert request and run
> >>>> hw queue before blk_mq_queue_reinit_notify() is actually called with
> >>>> action=CPU_ONLINE.
> >>>
> >>> You are right because blk_mq_queue_reinit_notify() is alwasy run after
> >>> the CPU becomes UP, so there is a tiny window in which the CPU is up
> >>> but the mapping is updated.  Per current design, the CPU just onlined
> >>> is still mapped to hw queue 0 until the mapping is updated by
> >>> blk_mq_queue_reinit_notify().
> >>>
> >>> But I am wondering why it is a problem and why you think flush_busy_ctxs
> >>> can't find the requests on the software queue in this situation?
> >>
> >> The problem happens when the CPU has just been onlined first time
> >> since the request queue was initialized.  At this time ctx->index_hw
> >> for the CPU is still zero before blk_mq_queue_reinit_notify is called.
> >>
> >> The request can be inserted to ctx->rq_list, but blk_mq_hctx_mark_pending()
> >> marks busy for wrong bit position as ctx->index_hw is zero.
> >
> > It isn't wrong bit since the CPU onlined just is still mapped to hctx 0 at that
> > time .
> 
> ctx->index_hw is not CPU queue to HW queue mapping.
> ctx->index_hw is the index in hctx->ctxs[] for this ctx.
> Each ctx in a hw queue should have unique ctx->index_hw.

You are right, sorry for my fault.

> 
> This problem can be reproducible with a single hw queue. (The script
> in cover letter can reproduce this problem with a single hw queue)
> 
> >> flush_busy_ctxs() only retrieves the requests from software queues
> >> which are marked busy.  So the request just inserted is ignored as
> >> the corresponding bit position is not busy.
> >
> > Before making the remap in blk_mq_queue_reinit() for the CPU topo change,
> > the request queue will be put into freezing first and all requests
> > inserted to hctx 0
> > should be retrieved and scheduled out. So can the request be igonred by
> > flush_busy_ctxs()?
> 
> For example, there is a single hw queue (hctx) and two CPU queues
> (ctx0 for CPU0, and ctx1 for CPU1).  Now CPU1 is just onlined and
> a request is inserted into ctx1->rq_list and set bit0 in pending
> bitmap as ctx1->index_hw is still zero.
> 
> And then while running hw queue, flush_busy_ctxs() finds bit0 is set
> in pending bitmap and tries to retrieve requests in
> hctx->ctxs[0].rq_list.  But htx->ctxs[0] is ctx0, so the request in
> ctx1->rq_list is ignored.

Per current design, the request should have been inserted into ctx0 instead
of ctx1 because ctx1 isn't mapped yet even though ctx1->cpu becomes ONLINE.

So how about the following patch? which looks much simpler.

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f537796..2f45b73 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1034,7 +1034,12 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
 	struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;
 
 	current_ctx = blk_mq_get_ctx(q);
-	if (!cpu_online(ctx->cpu))
+	/*
+	 * ctx->cpu may become ONLINE but ctx hasn't been mapped to
+	 * hctx yet because there is a tiny race window between
+	 * ctx->cpu ONLINE and doing the remap
+	 */
+	if (!blk_mq_ctx_mapped(ctx))
 		rq->mq_ctx = ctx = current_ctx;
 
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
@@ -1063,7 +1068,7 @@ static void blk_mq_insert_requests(struct request_queue *q,
 
 	current_ctx = blk_mq_get_ctx(q);
 
-	if (!cpu_online(ctx->cpu))
+	if (!blk_mq_ctx_mapped(ctx))
 		ctx = current_ctx;
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
 
@@ -1816,13 +1821,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 	 */
 	queue_for_each_ctx(q, ctx, i) {
 		/* If the cpu isn't online, the cpu is mapped to first hctx */
-		if (!cpu_online(i))
+		if (!cpu_online(i)) {
+			ctx->mapped = 0;
 			continue;
+		}
 
 		hctx = q->mq_ops->map_queue(q, i);
 		cpumask_set_cpu(i, hctx->cpumask);
 		cpumask_set_cpu(i, hctx->tags->cpumask);
 		ctx->index_hw = hctx->nr_ctx;
+		ctx->mapped = 1;
 		hctx->ctxs[hctx->nr_ctx++] = ctx;
 	}
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6a48c4c..52819ad 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -10,7 +10,8 @@ struct blk_mq_ctx {
 	}  ____cacheline_aligned_in_smp;
 
 	unsigned int		cpu;
-	unsigned int		index_hw;
+	unsigned int		index_hw : 16;
+	unsigned int		mapped : 1;
 
 	unsigned int		last_tag ____cacheline_aligned_in_smp;
 
@@ -123,4 +124,9 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 	return hctx->nr_ctx && hctx->tags;
 }
 
+static inline bool blk_mq_ctx_mapped(struct blk_mq_ctx *ctx)
+{
+	return ctx->mapped;
+}
+
 #endif


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/