Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Content-Type: text/plain; charset=windows-1252
From: Paolo Valente <paolo.valente@linaro.org>
In-Reply-To: <20160415192930.GL12583@htj.duckdns.org>
Date: Sat, 16 Apr 2016 00:08:44 +0200
Cc: Jens Axboe <axboe@kernel.dk>, Fabio Checconi <fchecconi@gmail.com>,
        Arianna Avanzini <avanzini.arianna@gmail.com>,
        linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
        Ulf Hansson <ulf.hansson@linaro.org>,
        Linus Walleij <linus.walleij@linaro.org>,
        Mark Brown <broonie@kernel.org>
Content-Transfer-Encoding: 8BIT
Message-Id: <D84E57F8-A4EB-4EFB-8F0A-45C29E860E1D@linaro.org>
References: <20160217170206.GU3741@mtj.duckdns.org> <72E81252-203C-4EB7-8459-B9B7060029C6@linaro.org> <20160301184656.GI3965@htj.duckdns.org> <E0694788-2787-4D99-88FC-8EAC5E335CBE@linaro.org> <20160413204110.GF20142@htj.duckdns.org> <2B664E4D-857C-4BBA-BE77-97EA6CC3F270@linaro.org> <20160414162953.GG12583@htj.duckdns.org> <427F5DF5-507A-4657-8279-B6A8FD98F6D8@linaro.org> <20160415150835.GI12583@htj.duckdns.org> <700B77C8-CB01-41C3-96E7-ED2C0B5A85D0@linaro.org> <20160415192930.GL12583@htj.duckdns.org>
To: Tejun Heo <tj@kernel.org>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4518
Lines: 119


Il giorno 15/apr/2016, alle ore 21:29, Tejun Heo <tj@kernel.org> ha scritto:

> Hello, Paolo.
> 
> On Fri, Apr 15, 2016 at 06:17:55PM +0200, Paolo Valente wrote:
>>> I don't think that is true with time based scheduling.  If you
>>> allocate 50% of time, it'll get close to 50% of IO time which
>>> translates to bandwidth which is lower than 50% but still in the
>>> ballpark.
>> 
>> But this is the same minimal service guarantee that you get with BFQ
>> in any case. I'm sorry for being so confusing to not make this central
>> point clear :(
> 
> lol sorry about being dumb.
> 

dumb? my problem is your remaining patience ...

>>> That is very different from "we can't guarantee anything if
>>> the other workloads are highly variable?.
>> 
>> If you have 50% of the time, but
>> . you don?t know anything about your workload properties, and
>> . the device speed can vary by two orders of magnitude,
>> then you can't provide any bandwidth guarantee, with any scheduler. Of
>> course I'm neglecting the minimal, trivial guarantee "getting a fraction
>> of the minimum possible speed of the device".
> 
> Oh, the guarantee is about "getting close to half of the available IO
> resource", what that translates to depends on the underlying hardware
> and the workload of course.
> 

yes

>> If you have 50% of the time allocated for a quasi-sequential workload,
>> then bandwidth and latencies may vary by an uncontrollable 30 or 40%,
>> depending on what you and the other groups do.
> 
> Yes, may be but it won't dive to 5% depending on what others are
> doing.
> 

exact

>> With the same device, if you have 50% of the bandwidth allocated with
>> BFQ for a quasi-sequential workload, then you can provide bandwidth
>> and latencies that may vary at most by a (still uncontrollable) 3 or
>> 4%, depending on what you and the other groups do.
>> 
>> This improvement is shown, e.g., in my--admittedly boring--numerical
>> example, and is confirmed by my experimental results so far.
> 
> I don't think the above is true.  Are you saying that the following
> two cases would lead to the same outcome for cgroup A?
> 
> 		cgroup A (50)		cgroup B (50)
> case 1		sequential		sequential
> case 2		sequential		random (to a certain degree)
> 
> The aggregate bandwidths for case 1 and 2 would be wildly different
> depending on the randomness of the second workload.  What cgroup A
> would be able to get would fluctuate accordingly, no?
> 

Your example is definitely to the point.

The answer to your question is no. In fact, in both cases cgroup A
will get exactly the same service slots, and in each slot exactly the
same number of sectors transferred. In particular, cgroup B will
systematically hit the timeout in the second case.

In other words, in case 2 cgroup A is guaranteed the same bandwidth
that it would get, in case 1, if cgroup B was quasi-sequential and so
slow to get served for a full time slice every time it got access to
the resource.

Maybe the source of confusion is the fact that a simple sector-based,
proportional share scheduler always distributes total bandwidth
according to weights. The catch is the additional BFQ rule: random
workloads get only time isolation, and are charged for full budgets,
so as to not affect the schedule of quasi-sequential workloads. So,
the correct claim for BFQ is that it distributes total bandwidth
according to weights (only) when all competing workloads are
quasi-sequential. If some workloads are random, then these workloads
are just time scheduled. This does break proportional-share bandwidth
distribution with mixed workloads, but, much more importantly, saves
both total throughput and individual bandwidths of quasi-sequential
workloads.

We could then check whether I did succeed in tuning timeouts and
budgets so as to achieve the best tradeoffs. But this is probably a
second-order problem as of now.


>>> So, I get that for a lot of workload, especially interactive ones, IO
>>> patterns are quasi-sequential and bw based scheduling is beneficial
>>> and we don't care that much about fairness in general; however, it's
>>> problematic that it would make the behavior of proportional control
>>> quite surprising.
>> 
>> If I have somehow convinced you with what I wrote above, then I hope
>> we might agree that a surprising behavior of BFQ with cgroups would be
>> just a matter of bugs.
> 
> I think I might still need more help.  What am I missing?
> 

I hope that what I wrote above did help.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun