Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934016AbcJ1OKO (ORCPT ); Fri, 28 Oct 2016 10:10:14 -0400 Received: from mail-yw0-f178.google.com ([209.85.161.178]:33958 "EHLO mail-yw0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932747AbcJ1OKJ (ORCPT ); Fri, 28 Oct 2016 10:10:09 -0400 Subject: Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler To: Jan Kara References: <1477474082-2846-1-git-send-email-paolo.valente@linaro.org> <20161026113443.GA13587@quack2.suse.cz> <4ed3e291-b3e5-5ee3-6838-58644bd3d99b@sandisk.com> <12386463.fJy0cVexVD@wuerfel> <20161026152955.GA21262@infradead.org> <3ebadbb8-9ac2-851a-66f9-c9db25713695@kernel.dk> <38156FA7-9A66-44DC-8D0C-28F149D1E49B@linaro.org> <09fc1e06-3fd6-b13d-0dd9-0edfb55b01d1@kernel.dk> <20161027092656.GD19743@quack2.suse.cz> <690e7ddc-411f-20db-61a8-7996bf20dc37@kernel.dk> <20161028075901.GA30952@quack2.suse.cz> Cc: Paolo Valente , Christoph Hellwig , Arnd Bergmann , Bart Van Assche , Tejun Heo , linux-block@vger.kernel.org, Linux-Kernal , Ulf Hansson , Linus Walleij , Mark Brown , Hannes Reinecke , grant.likely@secretlab.ca, James.Bottomley@hansenpartnership.com From: Jens Axboe Message-ID: Date: Fri, 28 Oct 2016 08:10:06 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 In-Reply-To: <20161028075901.GA30952@quack2.suse.cz> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5479 Lines: 103 On 10/28/2016 01:59 AM, Jan Kara wrote: > On Thu 27-10-16 10:26:18, Jens Axboe wrote: >> On 10/27/2016 03:26 AM, Jan Kara wrote: >>> On Wed 26-10-16 10:12:38, Jens Axboe wrote: >>>> On 10/26/2016 10:04 AM, Paolo Valente wrote: >>>>> >>>>>> Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe ha scritto: >>>>>> >>>>>> On 10/26/2016 09:29 AM, Christoph Hellwig wrote: >>>>>>> On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote: >>>>>>>> The question to ask first is whether to actually have pluggable >>>>>>>> schedulers on blk-mq at all, or just have one that is meant to >>>>>>>> do the right thing in every case (and possibly can be bypassed >>>>>>>> completely). >>>>>>> >>>>>>> That would be my preference. Have a BFQ-variant for blk-mq as an >>>>>>> option (default to off unless opted in by the driver or user), and >>>>>>> not other scheduler for blk-mq. Don't bother with bfq for non >>>>>>> blk-mq. It's not like there is any advantage in the legacy-request >>>>>>> device even for slow devices, except for the option of having I/O >>>>>>> scheduling. >>>>>> >>>>>> It's the only right way forward. blk-mq might not offer any substantial >>>>>> advantages to rotating storage, but with scheduling, it won't offer a >>>>>> downside either. And it'll take us towards the real goal, which is to >>>>>> have just one IO path. >>>>> >>>>> ok >>>>> >>>>>> Adding a new scheduler for the legacy IO path >>>>>> makes no sense. >>>>> >>>>> I would fully agree if effective and stable I/O scheduling would be >>>>> available in blk-mq in one or two months. But I guess that it will >>>>> take at least one year optimistically, given the current status of the >>>>> needed infrastructure, and given the great difficulties of doing >>>>> effective scheduling at the high parallelism and extreme target speeds >>>>> of blk-mq. Of course, this holds true unless little clever scheduling >>>>> is performed. >>>>> >>>>> So, what's the point in forcing a lot of users wait another year or >>>>> more, for a solution that has yet to be even defined, while they could >>>>> enjoy a much better system, and then switch an even better system when >>>>> scheduling is ready in blk-mq too? >>>> >>>> That same argument could have been made 2 years ago. Saying no to a new >>>> scheduler for the legacy framework goes back roughly that long. We could >>>> have had BFQ for mq NOW, if we didn't keep coming back to this very >>>> point. >>>> >>>> I'm hesistant to add a new scheduler because it's very easy to add, very >>>> difficult to get rid of. If we do add BFQ as a legacy scheduler now, >>>> it'll take us years and years to get rid of it again. We should be >>>> moving towards LESS moving parts in the legacy path, not more. >>>> >>>> We can keep having this discussion every few years, but I think we'd >>>> both prefer to make some actual progress here. It's perfectly fine to >>>> add an interface for a single queue interface for an IO scheduler for >>>> blk-mq, since we don't care too much about scalability there. And that >>>> won't take years, that should be a few weeks. Retrofitting BFQ on top of >>>> that should not be hard either. That can co-exist with a real multiqueue >>>> scheduler as well, something that's geared towards some fairness for >>>> faster devices. >>> >>> OK, so some solution like having a variant of blk_sq_make_request() that >>> will consume requests, do IO scheduling decisions on them, and feed them >>> into the HW queue is it sees fit would be acceptable? That will provide the >>> IO scheduler a global view that it needs for complex scheduling decisions >>> so it should indeed be relatively easy to port BFQ to work like that. >> >> I'd probably start off Omar's base [1] that switches the software queues >> to store bios instead of requests, since that lifts the of the 1:1 >> mapping between what we can queue up and what we can dispatch. Without >> that, the IO scheduler won't have too much to work with. And with that >> in place, it'll be a "bio in, request out" type of setup, which is >> similar to what we have in the legacy path. >> >> I'd keep the software queues, but as a starting point, mandate 1 >> hardware queue to keep that as the per-device view of the state. The IO >> scheduler would be responsible for moving one or more bios from the >> software queues to the hardware queue, when they are ready to dispatch. >> >> [1] https://github.com/osandov/linux/commit/8ef3508628b6cf7c4712cd3d8084ee11ef5d2530 > > Yeah, but what would be software queues actually good for for a single > queue device with device-global IO scheduling? The IO scheduler doing > complex decisions will keep all the bios / requests in a single structure > anyway so there's no scalability to gain from per-cpu software queues... > So you can directly consume bios in your ->make_request handler, place it > in IO scheduler structures and then push requests out to the HW queue in > response to HW tags getting freed (i.e. IO completion). No need > for intermediate software queues. But maybe I miss something. The software queues tend to take the pressure of lock contention on the submission side. It's one of the reasons why the single queue blk-mq still scales a lot better than the old request_fn model. If you bypass and grab them at make_request time, I'd be worried that we are now losing the various support functionality we have for block devices, or having to implement that differently. -- Jens Axboe