Message-ID: <1390319905.20232.38.camel@bobble.lax.corp.google.com>
Subject: Crash in elevator_dispatch_fn() (e.g. deadline_dispatch()) when
 changing elevators.
From: Frank Mayhar <fmayhar@google.com>
To: Tejun Heo <tj@kernel.org>
Cc: linux-kernel <linux-kernel@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>
Date: Tue, 21 Jan 2014 07:58:25 -0800
In-Reply-To: <20140118143145.GD3640@htj.dyndns.org>
References: <1389995976.20232.27.camel@bobble.lax.corp.google.com>
	 <20140118143145.GD3640@htj.dyndns.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org

On Sat, 2014-01-18 at 09:31 -0500, Tejun Heo wrote:
> Hello, Frank.
> 
> On Fri, Jan 17, 2014 at 01:59:36PM -0800, Frank Mayhar wrote:
> > After investigation, it's clear that the elevator switch code is trying
> > to quiesce the request queue and sets the bypass flag.  Unfortunately,
> > scsi_mode_sense() and friends call blk_execute_rq() directly, which pays
> > no attention to said flag.
> 
> Ouch.
> 
> > In fact, the bypass flag, as far as I can tell, is only checked in the
> > blk_cgroup.c, in blkg_lookup() and blkg_lookup_create().
> 
> Hmmm?  IIRC the main place is in __get_request().  If the queue is
> bypassing, the request doesn't get REQ_ELVPRIV which basically
> indicates that the request shouldn't go through elevator.  I don't
> think that part is broken even for scsi_execute_rq() - it uses
> blk_get_request() which should handle it properly.
> 
> > So my question is:  Is this a simple oversight?  Should blk_execute_rq()
> 
> Yeah, it's a bug.  It's not supposed to crash like that.

Clearly. :-)

> > care about the bypass flag?  Should it perhaps hold off the I/O until
> > the bypass flag is cleared?  Since at that level it has nothing to do
> > with cgroups I kinda don't think so but I'm still trying to get my head
> > around how all this stuff goes together.
> 
> The cgroup part isn't really relevant.  That's just because cgroup
> also uses bypassing to quiesce the queue when it's changing blkcg
> policies.  The thing relevant to the elveator is the check in
> __get_request().
> 
> Hmm.... the culprit is that we don't have any check in the dispatch
> path.  It doesn't really matter who's calling blk_peek_request(), that
> function is called as part of all request dispatches and should never
> crash.  I think the only reason we haven't noticed yet is because
> calling evelator dispatch_fn doesn't crash in most cases even if the
> elevator isn't in fully working condition.
> 
> Probably what we need is replacing blk_queue_dying() with
> blk_queue_bypass() test in __elv_next_request() before invoking the
> elevator dispatch function.

Replacing?  Or adding to?  Is BYPASS always set when DYING is set?  (My
guess is not but I haven't done an exhaustive analysis.)  So the
relevant code snippet in __elv_next_request() would be:
		if (unlikely(blk_queue_dying(q)) ||
		    unlikely(blk_queue_bypass(q)) ||
		    !q->elevator->type->ops.elevator_dispatch_fn(q, 0))
			return NULL;

> Can you please reply w/ Jens and lkml cc'd with the original
> description and my response?  No reason to keep this private.

Sure, doing so.  I just wanted to make sure my understanding was correct
before bringing others into it.

> Thanks a lot for the report!

No problem.  Thanks for the quick response.  Sorry for my
three-day-weekend-delayed reply. :-)


Jens, others, my original email follows:

> Hey, Tejun.  I have a question about stuff going on in the block layer
> that's not quite as straightforward as one might wish.  We have a
> situation in which we need to use something other than the CFQ I/O
> scheduler so after we create a block device (iSCSI, as it happens) we
> switch it to use the deadline scheduler.  Unfortunately we've run into a
> crash when we do this in which a request completion (in some cases) or a
> sync I/O (in others) calls blk_peek_request() which in turn calls
> __elv_next_request() which in its turn calls
> q->elevator->type->ops.elevator_dispatch_fn(), which happens at the
> point of the call be deadline_dispatch_requests(), which promptly falls
> over because we're in the middle of the elevator switch and
> q->elevator->elevator_data is NULL.
> 
> After investigation, it's clear that the elevator switch code is trying
> to quiesce the request queue and sets the bypass flag.  Unfortunately,
> scsi_mode_sense() and friends call blk_execute_rq() directly, which pays
> no attention to said flag.
> 
> In fact, the bypass flag, as far as I can tell, is only checked in the
> blk_cgroup.c, in blkg_lookup() and blkg_lookup_create().
> 
> So my question is:  Is this a simple oversight?  Should blk_execute_rq()
> care about the bypass flag?  Should it perhaps hold off the I/O until
> the bypass flag is cleared?  Since at that level it has nothing to do
> with cgroups I kinda don't think so but I'm still trying to get my head
> around how all this stuff goes together.
> 
> Any hints would be _greatly_ appreciated.  Thanks!
-- 
Frank Mayhar
310-460-4042

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/