2016-03-01 18:47:02

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

Hello, Paolo.

Sorry about the delay.

On Sat, Feb 20, 2016 at 11:23:43AM +0100, Paolo Valente wrote:
> Before replying to your points, I want to stress that I'm not a
> champion of budget-based scheduling at all costs. Budget-based
> scheduling just seems to provide tight bandwidth and latency
> guarantees that are practically impossible to get with time-based
> scheduling. I will try to explain this fact better, and to provide
> also a numerical example, in my replies to your points.

I do like the budget-based scheduling. It just feels that the budget
is based on the wrong unit.

...
> I think I got your point. In any case, a queue is not punished *after*
> it has consumed an undue amount of the resource, because a queue just
> cannot get to consume an undue amount of the resource. There is a
> timeout: a queue cannot be served for more than a pre-defined maximum
> time slice.
>
> Even if a queue expires before that timeout, BFQ checks anyway, on the
> expiration of the queue, whether the queue is so slow to not deserve
> accurate service-based guarantees. This is done to achieve additional
> robustness. In fact, if service-based guarantees were provided to a
> very slow queue that, for some reason, never causes the timeout to
> expire, then the queue would happen to be served too often, i.e., to
> get the undue amount of IO resource you mention.

I see. Once a queue starts timing out its slice, it gets switched to
time based scheduling; however, as you mentioned, workloads which
generate moderate random IOs would still get preferential treatment
over workloads which generate sequential IOs, by definition.

...
> Your metaphor is clear, but it does not seem to match what I expect
> from my storage device. As a user of my PC, I’m concerned, e.g., about
> how long it takes to copy a large file. I’m not concerned about what
> percentage of time will be guaranteed to my file copy if other
> processes happen to do I/O in parallel. As a similar example, a good

The underlying device is fundamentally incapable of giving guarantees
like that. The only way to get a (quasi) bandwidth guarantee from a
disk device is either ensuring that the IO is almost completely
sequential or there's enough buffer in capacity for the expected
seekiness of the IO pattern.

For use cases where the differences in seekiness across workloads are
accidental - e.g. all are trying to stream different files but some
files are more fragmented by accident - using bandwidth as the
resource unit would be helpful in mitigating the random gaps that the
user shouldn't be bothered by, but that'd be focusing on a pretty
narrow set of use cases.

Workloads are varied and underlying device performs wildly differently
depending on the specific IO pattern. Because rotating disks suck so
badly, it's true that there's a lot of wiggle room in what the IO
scheduler can do. People are accustomed to dealing with random
behaviors. That said, it still doesn't feel comfortable to use the
obviously wrong unit as the fundamental basis of resource
distribution.

> file-hosting service must probably guarantee reasonable read/write,
> i.e., download/upload, speeds to users (of course, I/O speed matters
> only if the bottleneck is not the network). Most likely, a user of
> such a service does not care (directly) about how much resource-time
> is guaranteed to the I/O related to his/her downloads/uploads.
>
> With a budget-based service scheme, you can easily provide these
> service guarantees accurately. In particular, you can achieve this
> goal even if the bandwidth fluctuates, provided that fluctuations
> depend mostly on I/O patterns. That is, it must hold true that the
> device achieves about the same bandwidth with the same I/O
> pattern. This is exactly the case with both rotational and

So, yes, I see that bandwidth based control would yield a better
result for this specific use case but at the same time this is a very
specialized use case and probably the *only* use case where bandwidth
based distribution makes sense - equivalent logically sequential
workloads where the specifics of IO pattern are completely accidental.
We can't really design the whole thing around that single use case.

> non-rotational devices. With a time-based scheme, it is impossible to
> provide these service guarantees, if bandwidth fluctuates and
> requirements are minimally stringent. I will try to show it with a
> simple numerical example. Before this numerical example, I would like
> to report a last practical example.

I agree that bandwidth based distribution would behave better for this
use case but think the above paragraph is going a bit too far.
Bandwidth based distribution can stick to the line better but that
just means that time based scheduling would need a bit more buffer in
achieving the same level of behavioral consistency. It's not like
bandwidth can actually be guarnateed no matter what we do.

...
> To achieve, with a time-based scheduler, the same precise and stable
> bandwidth distribution as with a budget-based scheduler, the only
> solution is to change weights dynamically, as a function of how the
> throughput achieved by B varies in its time slots. Such a solution
> would be definitely complex, if ever feasible and stable.

Isn't that a use case specifically carved for bandwidth based
distribution? Imagine how this would work when e.g. there are mostly
sequential IO workloads and fluctuating random workloads. Addition of
another sequential workload would behave as expected but addition of
random workloads would cripple everyone to the same level if the
random workloads play their cards right.

One side of the coin is "if I have two parallel file copies, they
proceed at the same speed regardless of how they're distributed across
disk" and the other is "but if I start an application which
intermittently issues random IOs, my copies take 5x longer". Isn't
"the two parallel copies mostly keep the same pace but may deviate a
bit but addition of random IOs doesn't collapse the whole thing" a
better proposition?

Hmmm... it could be that I'm mistaken on how trigger happy the switch
to time based scheduling is. Maybe it's sensitive enough that
bandwidth based scheduling is only applied to workloads which are
mostly sequential. I'm sorry if I'm being too dense on this point but
can you please give me some examples on what would happen when
sequential workloads and random ones mix?

Thanks.

--
tejun


2016-03-04 17:29:42

by Linus Walleij

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

Hi Tejun,

I'm doing a summary of this discussion as a part of presenting
Linaro's involvement in Paolo's work. So I try to understand things.

What I'm happy to see is a convergence in the discussion on
timeslicing vs budgeting+time slicing, AFAICT there is consensus
that whatever tool does the job best should be employed, despite
some remaining disagreement on the scheduling unit (time or
bandwidth).

It also seems that BFQ delivers. It remains to be seen if it always
delivers.

The big question is this:

On Wed, Mar 2, 2016 at 1:46 AM, Tejun Heo <[email protected]> wrote:

> For use cases where the differences in seekiness across workloads are
> accidental - e.g. all are trying to stream different files but some
> files are more fragmented by accident - using bandwidth as the
> resource unit would be helpful in mitigating the random gaps that the
> user shouldn't be bothered by, but that'd be focusing on a pretty
> narrow set of use cases.
>
> Workloads are varied and underlying device performs wildly differently
> depending on the specific IO pattern.

What strikes me from the overall discussion is the reference to use
cases, which ones are interesting for the I/O scheduler developers
and which are interesting for users?

I guess the current assumption is that CFQ address the needs of
the average user. The special deadline scheduler is for a special
class of users, not average. The noop scheduler is for testing.

So Paolo was asked to develop BFQ as a replacement plug-in for
CFQ on the assumption that it would better meet the needs of
the average user. He also already prepared a version that is a
selectable class like the deadline, but that was discouraged.

The notion you bring up, that BFQ may only address a narrow
set of use cases would merit it to be merged as a scheduler class
for special use cases alongside the deadline scheduler. (The only
objection I could see would be that it's a big piece of code to
maintain.)

But if it indeed meets the expectations of the "average user" the
current approach would be correct.

To me it seems it's very hard to proceed to conclude whether to
replace CFQ with BFQ or make it a special-case class without
first agreeing on what the average user use case(s) really is.
Does anyone really know? All references in this communication
seems rather handwavy.

What I know is that Paolo & his colleagues made a test suite for
BFQ that they are using, and I replicated their results with that
suite:
https://github.com/Algodev-github/S

As you can see it uses FIO to generate some (specified)
background noise and then tests some use cases such as:
- File copy
- Kernel development (git, compiles)
- Video playback
- Video streaming
- Startup of applications

Apart from the obvious that kernel devs may naively take themselves
as reference: git+compiles, that must be the most important
use case, and yes I noted my git checkouts are smackier with
BFQ but that's a subjective measure. This probably means the
average command line user issuing stuff to the left and right will
be happy with BFQ, especially since it has this boost for newly
forked processes.

There are geeksy usecases: for example video playback
on a set of random readers+writers trivially simulates someone
playing back 1080p video while downloading another one using
bittorrent in the background. That's geeky, but is it relevant for the
average user?

A NAS user for example (these boxes all run Linux more or less):
sequential reading and writing of large files is what they mostly do.
If the user downloads torrents onto the NAS, it's the former geeky
usecase.

But what is the usecase for a Google server, a NYSE transaction
system or an Oracle database? Can the people running these loads
tell us how to synthesize their typical workload using FIO and play
back something that looks like what they are doing? Is there a place
with these workload recipes that I'm obviously missing because of
being new to the block layer?

What would be helpful would be if people tested their favourite
workloads before and after this patch set and reported results.

My colleagues tried some tests on Android. It appears Android
starts up applications quicker (with cold cache) when there is
workload of dd-type sequential readers/writers in the background.
This corresponds to a smartphone user starting Candy Crush while
updating an application in the background, a usecase where Android
is known to be somewhat unresponsive. There are millions of
Android users, so this matters to them, we can add these seconds
up to years of people's lives across the userbase. (This claim may
need to be verified, but AFAICT it is genuine.)

So: can we figure out a set of FIO-based use cases that will
correspond to "average user"? That would help us determine
whether BFQ is addressing that or something fringe, so it needs
to be a special-case scheduler.

Yours,
Linus Walleij

2016-03-04 17:39:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

On Sat, Mar 05, 2016 at 12:29:39AM +0700, Linus Walleij wrote:
> Hi Tejun,
>
> I'm doing a summary of this discussion as a part of presenting
> Linaro's involvement in Paolo's work. So I try to understand things.

Btw, can someone explain why you guys waste so much time hacking and
arguing about a legacy codebase (old request code and I/O schedulers)
that everyone would really like to see disappear. Why don't you
spend your time on blk-mq where you have an entirely clean slate
for scheduling?

2016-03-04 18:11:24

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

On 2016-03-04 12:39, Christoph Hellwig wrote:
> On Sat, Mar 05, 2016 at 12:29:39AM +0700, Linus Walleij wrote:
>> Hi Tejun,
>>
>> I'm doing a summary of this discussion as a part of presenting
>> Linaro's involvement in Paolo's work. So I try to understand things.
>
> Btw, can someone explain why you guys waste so much time hacking and
> arguing about a legacy codebase (old request code and I/O schedulers)
> that everyone would really like to see disappear. Why don't you
> spend your time on blk-mq where you have an entirely clean slate
> for scheduling?
>
1. This all started long before blk-mq hit mainline.
2. There's still a decent amount of block drivers that don't support
blk-mq. Last time I looked (around the time 4.4 came out), I saw the
following that either obviously don't support it, or are ambiguous as to
whether they support it or not. Here's a list of just the ones I know
are being used on existing systems running relatively recent kernel
versions, not including any of the MTD stuff:
* fd
* MD
* bcache
* mmcblk
* nbd
* dasd
* drbd
* rbd
* aoe
* xvd (I know there were patches for this floating around, but I
never saw if they got merged or not)

2016-03-05 12:18:39

by Linus Walleij

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

On Sat, Mar 5, 2016 at 12:39 AM, Christoph Hellwig <[email protected]> wrote:

> Btw, can someone explain why you guys waste so much time hacking and
> arguing about a legacy codebase (old request code and I/O schedulers)
> that everyone would really like to see disappear. Why don't you
> spend your time on blk-mq where you have an entirely clean slate
> for scheduling?

Depends on what time horizon and target I'd say. Paolo was in contact with
the MMC/SD subsystem maintainer Ulf Hansson. (e)MMC/SD are both
synchronous command-response-based protocols, and as of today
single-channel. So everone's smartphone and tablet etc are today
single-channel. I don't know if there is even a protocol change coming
to augment this, the only duct-tapeish solution I've heard about is
command queueing which is basically a kind of pipelining of requests.

So: large userbase in all things handheld.

Yours,
Linus Walleij

2016-03-09 06:34:32

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler


Il giorno 01/mar/2016, alle ore 19:46, Tejun Heo <[email protected]> ha scritto:

> Hello, Paolo.
>
> Sorry about the delay.
>
> On Sat, Feb 20, 2016 at 11:23:43AM +0100, Paolo Valente wrote:
>> Before replying to your points, I want to stress that I'm not a
>> champion of budget-based scheduling at all costs. Budget-based
>> scheduling just seems to provide tight bandwidth and latency
>> guarantees that are practically impossible to get with time-based
>> scheduling. I will try to explain this fact better, and to provide
>> also a numerical example, in my replies to your points.
>
> I do like the budget-based scheduling. It just feels that the budget
> is based on the wrong unit.
>

This is probably the focal point of our discussion. Unfortunately, I
am still not convinced of your claim. In fact, basing budgets on
sectors (service), instead of time, still seems to me the only way to
provide the stronger bandwidth and low-latency guarantees that I have
tried to highlight in my previous email. And these guarantees do not
seem to concern only a single special case, but several, common use
cases for server and desktop systems. I will try to repeat these facts
more concisely, and hopefully more clearly, in my replies to next
points.

> ...
>> I think I got your point. In any case, a queue is not punished *after*
>> it has consumed an undue amount of the resource, because a queue just
>> cannot get to consume an undue amount of the resource. There is a
>> timeout: a queue cannot be served for more than a pre-defined maximum
>> time slice.
>>
>> Even if a queue expires before that timeout, BFQ checks anyway, on the
>> expiration of the queue, whether the queue is so slow to not deserve
>> accurate service-based guarantees. This is done to achieve additional
>> robustness. In fact, if service-based guarantees were provided to a
>> very slow queue that, for some reason, never causes the timeout to
>> expire, then the queue would happen to be served too often, i.e., to
>> get the undue amount of IO resource you mention.
>
> I see. Once a queue starts timing out its slice, it gets switched to
> time based scheduling; however, as you mentioned, workloads which
> generate moderate random IOs would still get preferential treatment
> over workloads which generate sequential IOs, by definition.
>

Exactly. However, there is still something I don?t fully understand in
your doubt. With BFQ, workloads that generate moderate random IOs
would actually do less damage to throughput, on average, than with
CFQ. In fact, with CFQ the queues containing these IOs would
systematically get a full time slice, while there are two
possibilities with BFQ:
1) If the degree of randomness of (the IOs in) these queues is not too
high, then these queues are likely to finish budgets before
timeouts. In this case, with BFQ these queues get less service than
with CFQ, and thus can waste throughput less.
2) If the degree of randomness of these queues is very high, then they
consume full time slices with BFQ, exactly as with CFQ.

Of course, performance may differ if time slices, i.e., timeouts,
differ between BFQ and CFQ, but this is easy to tune, if needed.

> ...
>> Your metaphor is clear, but it does not seem to match what I expect
>> from my storage device. As a user of my PC, I?m concerned, e.g., about
>> how long it takes to copy a large file. I?m not concerned about what
>> percentage of time will be guaranteed to my file copy if other
>> processes happen to do I/O in parallel. As a similar example, a good
>
> The underlying device is fundamentally incapable of giving guarantees
> like that. The only way to get a (quasi) bandwidth guarantee from a
> disk device is either ensuring that the IO is almost completely
> sequential or there's enough buffer in capacity for the expected
> seekiness of the IO pattern.
>
> For use cases where the differences in seekiness across workloads are
> accidental - e.g. all are trying to stream different files but some
> files are more fragmented by accident - using bandwidth as the
> resource unit would be helpful in mitigating the random gaps that the
> user shouldn't be bothered by, but that'd be focusing on a pretty
> narrow set of use cases.
>
> Workloads are varied and underlying device performs wildly differently
> depending on the specific IO pattern. Because rotating disks suck so
> badly, it's true that there's a lot of wiggle room in what the IO
> scheduler can do. People are accustomed to dealing with random
> behaviors. That said, it still doesn't feel comfortable to use the
> obviously wrong unit as the fundamental basis of resource
> distribution.
>

Actually this does not seem to match our (admittedly limited)
experience with: low-to-high-end rotational devices, RAIDS, SSDs, SD
cards and eMMCs. When stimulated with the same patterns in out tests,
these devices always responded with about the same IO service
times. And this seems to comply with intuition, because, apart from
different initial cache states, the same stimuli cause about the same
arm movements, cache hits/misses, and circuitry operations.

Of course, things differ with write and re-write operations on
flash-based devices, especially if remapping is performed. But
write-intensive or only-write IOs do not seem to be the best kind of
IO with these devices, so we did not focus on them much.


>> file-hosting service must probably guarantee reasonable read/write,
>> i.e., download/upload, speeds to users (of course, I/O speed matters
>> only if the bottleneck is not the network). Most likely, a user of
>> such a service does not care (directly) about how much resource-time
>> is guaranteed to the I/O related to his/her downloads/uploads.
>>
>> With a budget-based service scheme, you can easily provide these
>> service guarantees accurately. In particular, you can achieve this
>> goal even if the bandwidth fluctuates, provided that fluctuations
>> depend mostly on I/O patterns. That is, it must hold true that the
>> device achieves about the same bandwidth with the same I/O
>> pattern. This is exactly the case with both rotational and
>
> So, yes, I see that bandwidth based control would yield a better
> result for this specific use case but at the same time this is a very
> specialized use case and probably the *only* use case where bandwidth
> based distribution makes sense - equivalent logically sequential
> workloads where the specifics of IO pattern are completely accidental.
> We can't really design the whole thing around that single use case.
>

Actually, the tight bandwidth guarantees that I have tried to
highlight are the ground on which the other low-latency guarantees are
built. So, to sum up the set of guarantees that Linus discussed in
more detail in his email, BFQ mainly guarantees, even in the presence
of throughput fluctuations, and thanks also to sector-based
scheduling:
1) Stable(r) and tight bandwidth distribution for mostly-sequential
reads/writes
2) Stable(r) and high responsiveness
3) Stable(r) and low latency for soft real-time applications
4) Faster execution of dev tasks, such as compile and git operations
(checkout, merge, ?), in the presence of background workloads, and
while guaranteeing a high responsiveness too


>> non-rotational devices. With a time-based scheme, it is impossible to
>> provide these service guarantees, if bandwidth fluctuates and
>> requirements are minimally stringent. I will try to show it with a
>> simple numerical example. Before this numerical example, I would like
>> to report a last practical example.
>
> I agree that bandwidth based distribution would behave better for this
> use case but think the above paragraph is going a bit too far.
> Bandwidth based distribution can stick to the line better but that
> just means that time based scheduling would need a bit more buffer in
> achieving the same level of behavioral consistency. It's not like
> bandwidth can actually be guarnateed no matter what we do.
>

I hope that my reply to the above point responds somehow also to this
point.

> ...
>> To achieve, with a time-based scheduler, the same precise and stable
>> bandwidth distribution as with a budget-based scheduler, the only
>> solution is to change weights dynamically, as a function of how the
>> throughput achieved by B varies in its time slots. Such a solution
>> would be definitely complex, if ever feasible and stable.
>
> Isn't that a use case specifically carved for bandwidth based
> distribution? Imagine how this would work when e.g. there are mostly
> sequential IO workloads and fluctuating random workloads. Addition of
> another sequential workload would behave as expected but addition of
> random workloads would cripple everyone to the same level if the
> random workloads play their cards right.
>
> One side of the coin is "if I have two parallel file copies, they
> proceed at the same speed regardless of how they're distributed across
> disk" and the other is "but if I start an application which
> intermittently issues random IOs, my copies take 5x longer". Isn't
> "the two parallel copies mostly keep the same pace but may deviate a
> bit but addition of random IOs doesn't collapse the whole thing" a
> better proposition?
>
> Hmmm... it could be that I'm mistaken on how trigger happy the switch
> to time based scheduling is. Maybe it's sensitive enough that
> bandwidth based scheduling is only applied to workloads which are
> mostly sequential. I'm sorry if I'm being too dense on this point but
> can you please give me some examples on what would happen when
> sequential workloads and random ones mix?
>

In the simplest case,
. sequential workloads would get sector-based service guarantees, with
the resulting stability and low-latency properties that I have tried
to highlight;
. random workloads would get time-based service, and thus similar
service guarantees as with CFQ (actually guarantees would still be
tighter, because of the more accurate scheduling policy of BFQ).

Things differ a bit for the queues that the low-latency heuristics of
BFQ ?like'. In fact, BFQ forgives randomness a little bit more for
these queues, so as to improve responsiveness or low latency for soft
real-time applications. Of course, this may cause a throughput drop
while an interactive or a soft real-time application is being
privileged. If one only aims at maximum throughput, and does not care
about responsiveness or low latency, he/she can just set the
low_latency parameter to 0.

Thanks,
Paolo

> Thanks.
>
> --
> tejun

2016-03-09 06:55:45

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler


Il giorno 04/mar/2016, alle ore 18:39, Christoph Hellwig <[email protected]> ha scritto:

> On Sat, Mar 05, 2016 at 12:29:39AM +0700, Linus Walleij wrote:
>> Hi Tejun,
>>
>> I'm doing a summary of this discussion as a part of presenting
>> Linaro's involvement in Paolo's work. So I try to understand things.
>
> Btw, can someone explain why you guys waste so much time hacking and
> arguing about a legacy codebase (old request code and I/O schedulers)
> that everyone would really like to see disappear. Why don't you
> spend your time on blk-mq where you have an entirely clean slate
> for scheduling?

I do agree that it would very important to deal with blk-mq. And much more difficult. IMHO, a clean way to proceed is to first try to improve bandwidth and latency guarantees in the simplest, single-queue case. Then to face the multi-queue case, leveraging the lessons learned in the single-queue case.

Thanks,
Paolo

2016-03-11 11:16:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

On Fri, Mar 04, 2016 at 01:10:31PM -0500, Austin S. Hemmelgarn wrote:
> 1. This all started long before blk-mq hit mainline.

Whoe cares? :)

> 2. There's still a decent amount of block drivers that don't support blk-mq.
> Last time I looked (around the time 4.4 came out), I saw the following that
> either obviously don't support it, or are ambiguous as to whether they
> support it or not. Here's a list of just the ones I know are being used on
> existing systems running relatively recent kernel versions, not including

There is no ambiguouity. You clearly named a few ones that aren't
converted, but also a lot of make_request_fn based drivers which don't
support any I/O scheduler.

But that whole point is that anything actively developed should move
over.

2016-03-11 11:17:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

On Sat, Mar 05, 2016 at 07:18:30PM +0700, Linus Walleij wrote:
> Depends on what time horizon and target I'd say. Paolo was in contact with
> the MMC/SD subsystem maintainer Ulf Hansson. (e)MMC/SD are both
> synchronous command-response-based protocols, and as of today
> single-channel. So everone's smartphone and tablet etc are today
> single-channel. I don't know if there is even a protocol change coming
> to augment this, the only duct-tapeish solution I've heard about is
> command queueing which is basically a kind of pipelining of requests.

We use blkj-mq for single queue devices as well, in fact most consumers
are single queue. MMC is a prime candidate that hould move over soon.

2016-03-11 11:24:36

by Angel Shtilianov

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler



On 03/11/2016 01:17 PM, Christoph Hellwig wrote:
> On Sat, Mar 05, 2016 at 07:18:30PM +0700, Linus Walleij wrote:
>> Depends on what time horizon and target I'd say. Paolo was in contact with
>> the MMC/SD subsystem maintainer Ulf Hansson. (e)MMC/SD are both
>> synchronous command-response-based protocols, and as of today
>> single-channel. So everone's smartphone and tablet etc are today
>> single-channel. I don't know if there is even a protocol change coming
>> to augment this, the only duct-tapeish solution I've heard about is
>> command queueing which is basically a kind of pipelining of requests.
>
> We use blkj-mq for single queue devices as well, in fact most consumers
> are single queue. MMC is a prime candidate that hould move over soon.

Out of curiosity - would it make sense to have something like BFQ/CFQ
for single-queue blk-mq users?

2016-03-11 11:49:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

On Fri, Mar 11, 2016 at 01:24:24PM +0200, Nikolay Borisov wrote:
> Out of curiosity - would it make sense to have something like BFQ/CFQ
> for single-queue blk-mq users?

The mess that CFQ is certainly not. But we want to eventually be able
to get rid of the legacy request code, so we need suitable I/O
scheduling. See the threads at: https://lkml.org/lkml/2015/11/19/227 or
http://thread.gmane.org/gmane.linux.scsi/111257/focus=111320 for more
details. A leightweight bfq-like scheduler might come in handy on top
of Andreas' time slicing work.

2016-03-11 13:40:36

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

On 2016-03-11 06:16, Christoph Hellwig wrote:
> On Fri, Mar 04, 2016 at 01:10:31PM -0500, Austin S. Hemmelgarn wrote:
>> 1. This all started long before blk-mq hit mainline.
>
> Whoe cares? :)
It wasn't intended as an argument for this continuing to be developed
outside of blk-mq, but as an explanation of why the code exists as it
does right now.
>> 2. There's still a decent amount of block drivers that don't support blk-mq.
>> Last time I looked (around the time 4.4 came out), I saw the following that
>> either obviously don't support it, or are ambiguous as to whether they
>> support it or not. Here's a list of just the ones I know are being used on
>> existing systems running relatively recent kernel versions, not including
>
> There is no ambiguouity. You clearly named a few ones that aren't
> converted, but also a lot of make_request_fn based drivers which don't
> support any I/O scheduler.
For someone reading the code there might not be any ambiguity, but from
the perspective of someone who isn't reading the code (or even someone
who has limited background in kernel programming), it very much is
ambiguous, especially what does and doesn't support I/O scheduling in
general.
> But that whole point is that anything actively developed should move
> over.
Agreed.

2016-03-11 14:53:58

by Linus Walleij

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

On Fri, Mar 11, 2016 at 6:17 PM, Christoph Hellwig <[email protected]> wrote:

> We use blkj-mq for single queue devices as well, in fact most consumers
> are single queue. MMC is a prime candidate that hould move over soon.

OK we should do that with some before-and-after tests and see what
happens.

Yours,
Linus Walleij

2016-04-13 19:54:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

Hello, Christoph.

On Fri, Mar 04, 2016 at 09:39:47AM -0800, Christoph Hellwig wrote:
> On Sat, Mar 05, 2016 at 12:29:39AM +0700, Linus Walleij wrote:
> > I'm doing a summary of this discussion as a part of presenting
> > Linaro's involvement in Paolo's work. So I try to understand things.
>
> Btw, can someone explain why you guys waste so much time hacking and
> arguing about a legacy codebase (old request code and I/O schedulers)
> that everyone would really like to see disappear. Why don't you
> spend your time on blk-mq where you have an entirely clean slate
> for scheduling?

idk, are we gonna duplicate a full disk scheduler on blk-mq path? I
think it'd be more sensible to make blk-mq call into the old elevator
path for scheduling IOs on rotating rusts.

Thanks.

--
tejun

2016-04-13 20:41:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

Hello,

Sorry about the long delay.

On Wed, Mar 09, 2016 at 07:34:15AM +0100, Paolo Valente wrote:
> This is probably the focal point of our discussion. Unfortunately, I
> am still not convinced of your claim. In fact, basing budgets on
> sectors (service), instead of time, still seems to me the only way to
> provide the stronger bandwidth and low-latency guarantees that I have
> tried to highlight in my previous email. And these guarantees do not
> seem to concern only a single special case, but several, common use
> cases for server and desktop systems. I will try to repeat these facts
> more concisely, and hopefully more clearly, in my replies to next
> points.

I'm still trying to get my head wrapped around why basing the
scheduling on bandwidth would have those benefits because the
connection isn't intuitive to me at all. If you're saying that most
workloads care about bandwidth a lot more and the specifics of their
IO patterns are mostly accidental and should be discounted for
scheduling decisions, I can understand how that could be. Is that
what you're saying?

> > I see. Once a queue starts timing out its slice, it gets switched to
> > time based scheduling; however, as you mentioned, workloads which
> > generate moderate random IOs would still get preferential treatment
> > over workloads which generate sequential IOs, by definition.
>
> Exactly. However, there is still something I don’t fully understand in
> your doubt. With BFQ, workloads that generate moderate random IOs
> would actually do less damage to throughput, on average, than with
> CFQ. In fact, with CFQ the queues containing these IOs would
> systematically get a full time slice, while there are two
> possibilities with BFQ:
> 1) If the degree of randomness of (the IOs in) these queues is not too
> high, then these queues are likely to finish budgets before
> timeouts. In this case, with BFQ these queues get less service than
> with CFQ, and thus can waste throughput less.
> 2) If the degree of randomness of these queues is very high, then they
> consume full time slices with BFQ, exactly as with CFQ.
>
> Of course, performance may differ if time slices, i.e., timeouts,
> differ between BFQ and CFQ, but this is easy to tune, if needed.

Hmm.. the above doesn't really make sense to me, so you're saying that
bandwidth based control only cuts down the slice a random workload
would use and thus wouldn't benefit them; however, that cut-down of
slice is based on bandwidth consumption, so it would kick in a lot
more for sequential workloads. It wouldn't make a random workload's
slice longer than the timeout but it would make sequantial ones'
slices shorter. What am I missing?

> > Workloads are varied and underlying device performs wildly differently
> > depending on the specific IO pattern. Because rotating disks suck so
> > badly, it's true that there's a lot of wiggle room in what the IO
> > scheduler can do. People are accustomed to dealing with random
> > behaviors. That said, it still doesn't feel comfortable to use the
> > obviously wrong unit as the fundamental basis of resource
> > distribution.
>
> Actually this does not seem to match our (admittedly limited)
> experience with: low-to-high-end rotational devices, RAIDS, SSDs, SD
> cards and eMMCs. When stimulated with the same patterns in out tests,
> these devices always responded with about the same IO service
> times. And this seems to comply with intuition, because, apart from
> different initial cache states, the same stimuli cause about the same
> arm movements, cache hits/misses, and circuitry operations.

Oh, that's not what I meant. If you feed the same sequence of IOs,
they would behave similarly. What I meant was that the composition of
IOs themselves would change significantly depneding on how different
types of workloads get scheduled.

> > So, yes, I see that bandwidth based control would yield a better
> > result for this specific use case but at the same time this is a very
> > specialized use case and probably the *only* use case where bandwidth
> > based distribution makes sense - equivalent logically sequential
> > workloads where the specifics of IO pattern are completely accidental.
> > We can't really design the whole thing around that single use case.
>
> Actually, the tight bandwidth guarantees that I have tried to
> highlight are the ground on which the other low-latency guarantees are
> built. So, to sum up the set of guarantees that Linus discussed in
> more detail in his email, BFQ mainly guarantees, even in the presence
> of throughput fluctuations, and thanks also to sector-based
> scheduling:
> 1) Stable(r) and tight bandwidth distribution for mostly-sequential
> reads/writes

So, yeah, the above makes toal sense.

> 2) Stable(r) and high responsiveness
> 3) Stable(r) and low latency for soft real-time applications
> 4) Faster execution of dev tasks, such as compile and git operations
> (checkout, merge, …), in the presence of background workloads, and
> while guaranteeing a high responsiveness too

But can you please enlighten me on why 2-4 are inherently tied to
bandwidth-based scheduling?

> > Hmmm... it could be that I'm mistaken on how trigger happy the switch
> > to time based scheduling is. Maybe it's sensitive enough that
> > bandwidth based scheduling is only applied to workloads which are
> > mostly sequential. I'm sorry if I'm being too dense on this point but
> > can you please give me some examples on what would happen when
> > sequential workloads and random ones mix?
> >
>
> In the simplest case,
> . sequential workloads would get sector-based service guarantees, with
> the resulting stability and low-latency properties that I have tried
> to highlight;
> . random workloads would get time-based service, and thus similar
> service guarantees as with CFQ (actually guarantees would still be
> tighter, because of the more accurate scheduling policy of BFQ).

But don't the above inherently mean that sequential workloads would
get less in terms of IO time?

To summarize,

1. I still don't understand why bandwidth-based scheduling is better
(sorry). The only reason I can think of is that most workloads
that we care about are at least quasi-sequential and can benefit
from ignoring randomness to a certain degree. Is that it?

2. I don't think strict fairness matters is all that important for IO
scheduling in general. Whatever gives us the best overall result
should work, so if bandwidth based scheduling does that great;
however, fairness does matter across cgroups. A cgroup configured
to receive 50% of IO resources should get close to that no matter
what others are doing, would bfq be able to do that?

Thanks.

--
tejun

2016-04-14 05:03:36

by Mark Brown

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

On Wed, Apr 13, 2016 at 03:54:22PM -0400, Tejun Heo wrote:
> On Fri, Mar 04, 2016 at 09:39:47AM -0800, Christoph Hellwig wrote:

> > Btw, can someone explain why you guys waste so much time hacking and
> > arguing about a legacy codebase (old request code and I/O schedulers)
> > that everyone would really like to see disappear. Why don't you
> > spend your time on blk-mq where you have an entirely clean slate
> > for scheduling?

> idk, are we gonna duplicate a full disk scheduler on blk-mq path? I
> think it'd be more sensible to make blk-mq call into the old elevator
> path for scheduling IOs on rotating rusts.

It's not just rust, it's also lower end solid state devices.


Attachments:
(No filename) (685.00 B)
signature.asc (473.00 B)
Download all attachments

2016-04-14 10:23:24

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler


Il giorno 13/apr/2016, alle ore 22:41, Tejun Heo <[email protected]> ha scritto:

> Hello,
>
> Sorry about the long delay.
>
> On Wed, Mar 09, 2016 at 07:34:15AM +0100, Paolo Valente wrote:
>> This is probably the focal point of our discussion. Unfortunately, I
>> am still not convinced of your claim. In fact, basing budgets on
>> sectors (service), instead of time, still seems to me the only way to
>> provide the stronger bandwidth and low-latency guarantees that I have
>> tried to highlight in my previous email. And these guarantees do not
>> seem to concern only a single special case, but several, common use
>> cases for server and desktop systems. I will try to repeat these facts
>> more concisely, and hopefully more clearly, in my replies to next
>> points.
>
> I'm still trying to get my head wrapped around why basing the
> scheduling on bandwidth would have those benefits because the
> connection isn't intuitive to me at all. If you're saying that most
> workloads care about bandwidth a lot more and the specifics of their
> IO patterns are mostly accidental and should be discounted for
> scheduling decisions, I can understand how that could be. Is that
> what you're saying?
>

Yes. I will try to elaborate more in my replies to your points below.

>>> I see. Once a queue starts timing out its slice, it gets switched to
>>> time based scheduling; however, as you mentioned, workloads which
>>> generate moderate random IOs would still get preferential treatment
>>> over workloads which generate sequential IOs, by definition.
>>
>> Exactly. However, there is still something I don?t fully understand in
>> your doubt. With BFQ, workloads that generate moderate random IOs
>> would actually do less damage to throughput, on average, than with
>> CFQ. In fact, with CFQ the queues containing these IOs would
>> systematically get a full time slice, while there are two
>> possibilities with BFQ:
>> 1) If the degree of randomness of (the IOs in) these queues is not too
>> high, then these queues are likely to finish budgets before
>> timeouts. In this case, with BFQ these queues get less service than
>> with CFQ, and thus can waste throughput less.
>> 2) If the degree of randomness of these queues is very high, then they
>> consume full time slices with BFQ, exactly as with CFQ.
>>
>> Of course, performance may differ if time slices, i.e., timeouts,
>> differ between BFQ and CFQ, but this is easy to tune, if needed.
>
> Hmm.. the above doesn't really make sense to me, so you're saying that
> bandwidth based control only cuts down the slice a random workload
> would use and thus wouldn't benefit them; however, that cut-down of
> slice is based on bandwidth consumption, so it would kick in a lot
> more for sequential workloads. It wouldn't make a random workload's
> slice longer than the timeout but it would make sequantial ones'
> slices shorter. What am I missing?
>

Not all sequential or quasi-sequential workloads achieve the same
throughput. According to the tests I have run so far, there is a
variation of at most a ~30%, depending on the device. The
"sector-based plus timeout" service scheme of BFQ adapts elastically
to this variation: parameters are configured in such a way that fast
sequential workloads finish their budget at most a ~30% of the time
before their timeout, while slow sequential workloads tend to finish
closer to the timeout (the slower they are, the closer they finish).
More precisely, the timeout is a hard limit to the maximum slowness
tolerated by BFQ for a sequential workload.

Random workloads systematically hit the timeout.

As shown in my previous numerical example, and confirmed so far by my
tests, this service scheme guarantees a high throughput as well as an
easy and accurate bandwidth distribution.

>>> Workloads are varied and underlying device performs wildly differently
>>> depending on the specific IO pattern. Because rotating disks suck so
>>> badly, it's true that there's a lot of wiggle room in what the IO
>>> scheduler can do. People are accustomed to dealing with random
>>> behaviors. That said, it still doesn't feel comfortable to use the
>>> obviously wrong unit as the fundamental basis of resource
>>> distribution.
>>
>> Actually this does not seem to match our (admittedly limited)
>> experience with: low-to-high-end rotational devices, RAIDS, SSDs, SD
>> cards and eMMCs. When stimulated with the same patterns in out tests,
>> these devices always responded with about the same IO service
>> times. And this seems to comply with intuition, because, apart from
>> different initial cache states, the same stimuli cause about the same
>> arm movements, cache hits/misses, and circuitry operations.
>
> Oh, that's not what I meant. If you feed the same sequence of IOs,
> they would behave similarly. What I meant was that the composition of
> IOs themselves would change significantly depneding on how different
> types of workloads get scheduled.
>

Definitely. To better sync with you, let me change my point as
follows: for all use cases characterized by a given I/O locality,
sector-based scheduling makes it easy to distribute bandwidth as
desired, and yields accurate and stable results.

As for I/O locality, IMHO many, or probably most use cases are
characterized by a rather stable locality:
- file-hosting services have mostly-sequential, greedy file
reads/writes (download/uploads)
- audio and video streaming services have periodic, mostly-sequential
reads (streaming), combined with greedy mostly-sequential writes
(updates)
- information-dumping services have mostly-sequential, often greedy,
writes
- database services have mostly-random workloads, in most cases
- single users, on their PCs, generate time-varying workloads that
depend on their habits and on the tasks at hand, but while any of
these tasks is being executed, the I/O has typically a well-defined
pattern

>>> So, yes, I see that bandwidth based control would yield a better
>>> result for this specific use case but at the same time this is a very
>>> specialized use case and probably the *only* use case where bandwidth
>>> based distribution makes sense - equivalent logically sequential
>>> workloads where the specifics of IO pattern are completely accidental.
>>> We can't really design the whole thing around that single use case.
>>
>> Actually, the tight bandwidth guarantees that I have tried to
>> highlight are the ground on which the other low-latency guarantees are
>> built. So, to sum up the set of guarantees that Linus discussed in
>> more detail in his email, BFQ mainly guarantees, even in the presence
>> of throughput fluctuations, and thanks also to sector-based
>> scheduling:
>> 1) Stable(r) and tight bandwidth distribution for mostly-sequential
>> reads/writes
>
> So, yeah, the above makes toal sense.
>
>> 2) Stable(r) and high responsiveness
>> 3) Stable(r) and low latency for soft real-time applications
>> 4) Faster execution of dev tasks, such as compile and git operations
>> (checkout, merge, ?), in the presence of background workloads, and
>> while guaranteeing a high responsiveness too
>
> But can you please enlighten me on why 2-4 are inherently tied to
> bandwidth-based scheduling?

Goals 2-4 are obtained by granting a higher share of the throughput
to the applications to privilege. The more stably and accurately the
underlying scheduling engine is able to enforce the desired bandwidth
distribution, the more stably and accurately higher shares can be
guaranteed. Then 2-4 follows from 1, i.e., from that BFQ guarantees
a stabler and tight(er) bandwidth distribution.

>
>>> Hmmm... it could be that I'm mistaken on how trigger happy the switch
>>> to time based scheduling is. Maybe it's sensitive enough that
>>> bandwidth based scheduling is only applied to workloads which are
>>> mostly sequential. I'm sorry if I'm being too dense on this point but
>>> can you please give me some examples on what would happen when
>>> sequential workloads and random ones mix?
>>>
>>
>> In the simplest case,
>> . sequential workloads would get sector-based service guarantees, with
>> the resulting stability and low-latency properties that I have tried
>> to highlight;
>> . random workloads would get time-based service, and thus similar
>> service guarantees as with CFQ (actually guarantees would still be
>> tighter, because of the more accurate scheduling policy of BFQ).
>
> But don't the above inherently mean that sequential workloads would
> get less in terms of IO time?
>

Yes: if a sequential workload is so fast to finish its budgets before
the timeout, then it gets less IO time. The purpose of a sector-based
scheduler is not to distribute IO time, but to distribute bandwidth.
In particular, BFQ tries to achieve the latter goal by giving to each
process as little time as possible (which further improves latency).


> To summarize,
>
> 1. I still don't understand why bandwidth-based scheduling is better
> (sorry). The only reason I can think of is that most workloads
> that we care about are at least quasi-sequential and can benefit
> from ignoring randomness to a certain degree. Is that it?
>

If I have understood correctly, you refer to that maximum ~30%
throughput loss that a quasi-sequential workload can incur (because of
some randomness or of other unlucky accidents). If so, then I think
you fully got the point.

> 2. I don't think strict fairness matters is all that important for IO
> scheduling in general. Whatever gives us the best overall result
> should work, so if bandwidth based scheduling does that great;
> however, fairness does matter across cgroups. A cgroup configured
> to receive 50% of IO resources should get close to that no matter
> what others are doing, would bfq be able to do that?
>

BFQ guarantees 50% of the bandwidth of the resource, not 50% of the
time. In this respect, with 50% of the time instead of 50% of the
bandwidth, a group suffers from the bandwidth fluctuation, higher
latency and throughput loss problems that I have tried to highlight.
Plus, it is not possible to easily answer to questions like, e.g.: "how
long would it take to copy this file"?.

In any case, it is of course possible to get time distribution also
with BFQ, by 'just' letting it work in the time domain. However,
changing BFQ to operate in the time domain, or, probably much better,
extending BFQ to operate correctly in both domains, would be a lot of
work. I don't know whether it would be worth the effort and the
extra complexity.

Thanks,
Paolo

> Thanks.
>
> --
> tejun

2016-04-14 16:29:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

Hello, Paolo.

On Thu, Apr 14, 2016 at 12:23:14PM +0200, Paolo Valente wrote:
...
> >> 1) Stable(r) and tight bandwidth distribution for mostly-sequential
> >> reads/writes
> >
> > So, yeah, the above makes toal sense.
> >
> >> 2) Stable(r) and high responsiveness
> >> 3) Stable(r) and low latency for soft real-time applications
> >> 4) Faster execution of dev tasks, such as compile and git operations
> >> (checkout, merge, …), in the presence of background workloads, and
> >> while guaranteeing a high responsiveness too
> >
> > But can you please enlighten me on why 2-4 are inherently tied to
> > bandwidth-based scheduling?
>
> Goals 2-4 are obtained by granting a higher share of the throughput
> to the applications to privilege. The more stably and accurately the
> underlying scheduling engine is able to enforce the desired bandwidth
> distribution, the more stably and accurately higher shares can be
> guaranteed. Then 2-4 follows from 1, i.e., from that BFQ guarantees
> a stabler and tight(er) bandwidth distribution.

4) makes sense as a lot of that workload would be at least
quasi-sequential but I can't tell why 2) and 3) would depend on
bandwidth based scheduling. They're about recognizing workloads which
can benefit from low latency and treating them accordingly. Why would
making the underlying scheduling time based change that?

> > To summarize,
> >
> > 1. I still don't understand why bandwidth-based scheduling is better
> > (sorry). The only reason I can think of is that most workloads
> > that we care about are at least quasi-sequential and can benefit
> > from ignoring randomness to a certain degree. Is that it?
> >
>
> If I have understood correctly, you refer to that maximum ~30%
> throughput loss that a quasi-sequential workload can incur (because of
> some randomness or of other unlucky accidents). If so, then I think
> you fully got the point.

Alright, I see.

> > 2. I don't think strict fairness matters is all that important for IO
> > scheduling in general. Whatever gives us the best overall result
> > should work, so if bandwidth based scheduling does that great;
> > however, fairness does matter across cgroups. A cgroup configured
> > to receive 50% of IO resources should get close to that no matter
> > what others are doing, would bfq be able to do that?
>
> BFQ guarantees 50% of the bandwidth of the resource, not 50% of the
> time. In this respect, with 50% of the time instead of 50% of the

So, across cgroups, I don't think we can pretend that bandwidth is the
resource. There should be a reasonable level of isolation. Bandwidth
for a rotating disk is a byproduct which can fluctuate widely. "You
have 50% of the total disk bandwidth" doesn't mean anything if that
bandwidth can easily fluctuate a hundred fold.

> bandwidth, a group suffers from the bandwidth fluctuation, higher
> latency and throughput loss problems that I have tried to highlight.
> Plus, it is not possible to easily answer to questions like, e.g.: "how
> long would it take to copy this file"?.

It's actually a lot more difficult to answer that with bandwidth
scheduling. Let's say cgroup A has 50% of disk time. Sure, there are
inaccuracies, but it should be able to get close to the ballpark -
let's be lax and say between 30% and 45% of raw sequential bandwidth.
It isn't ideal but now imagine bandwidth based scheduling. Depending
on what the others are doing, it may get 5% or even lower of the raw
sequential bandwidth. It isn't isolating anything.

> In any case, it is of course possible to get time distribution also
> with BFQ, by 'just' letting it work in the time domain. However,
> changing BFQ to operate in the time domain, or, probably much better,
> extending BFQ to operate correctly in both domains, would be a lot of
> work. I don't know whether it would be worth the effort and the
> extra complexity.

As I wrote before, as fairness isn't that important for normal
scheduling, if empirical data show that bandwidth based scheduling is
beneficial for most common workloads, that's awesome especially given
that CFQ has plenty of issues. I don't think cgroup case is workable
as currently implemented tho.

Thanks.

--
tejun

2016-04-15 14:20:58

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler


Il giorno 14/apr/2016, alle ore 18:29, Tejun Heo <[email protected]> ha scritto:

> Hello, Paolo.
>
> On Thu, Apr 14, 2016 at 12:23:14PM +0200, Paolo Valente wrote:
> ...
>>>> 1) Stable(r) and tight bandwidth distribution for mostly-sequential
>>>> reads/writes
>>>
>>> So, yeah, the above makes toal sense.
>>>
>>>> 2) Stable(r) and high responsiveness
>>>> 3) Stable(r) and low latency for soft real-time applications
>>>> 4) Faster execution of dev tasks, such as compile and git operations
>>>> (checkout, merge, ?), in the presence of background workloads, and
>>>> while guaranteeing a high responsiveness too
>>>
>>> But can you please enlighten me on why 2-4 are inherently tied to
>>> bandwidth-based scheduling?
>>
>> Goals 2-4 are obtained by granting a higher share of the throughput
>> to the applications to privilege. The more stably and accurately the
>> underlying scheduling engine is able to enforce the desired bandwidth
>> distribution, the more stably and accurately higher shares can be
>> guaranteed. Then 2-4 follows from 1, i.e., from that BFQ guarantees
>> a stabler and tight(er) bandwidth distribution.
>
> 4) makes sense as a lot of that workload would be at least
> quasi-sequential but I can't tell why 2) and 3) would depend on
> bandwidth based scheduling. They're about recognizing workloads which
> can benefit from low latency and treating them accordingly. Why would
> making the underlying scheduling time based change that?
>

Because, in BFQ, "treating them accordingly" means raising their
weights to let them receive a higher share of the throughput (other
rigid solutions, such as priority scheduling, would easily lead to
starvation problems). With time-based scheduling, the share of the
throughput guaranteed for a given value of the weight is less stable
than with sector-based scheduling. Then, to provide the same latency
guarantees as with sector-based scheduling, weights have to be raised
more. This would throttle unprivileged processes more. And it would
not however improve stability.

With time-based scheduling, latency guarantees among two privileged
applications may vary more too, even if both applications perform
quasi-sequential I/O. In fact, throughput shares would vary more
depending, e.g., on where the sectors requested by the applications are
located.

>>> To summarize,
>>>
>>> 1. I still don't understand why bandwidth-based scheduling is better
>>> (sorry). The only reason I can think of is that most workloads
>>> that we care about are at least quasi-sequential and can benefit
>>> from ignoring randomness to a certain degree. Is that it?
>>>
>>
>> If I have understood correctly, you refer to that maximum ~30%
>> throughput loss that a quasi-sequential workload can incur (because of
>> some randomness or of other unlucky accidents). If so, then I think
>> you fully got the point.
>
> Alright, I see.
>
>>> 2. I don't think strict fairness matters is all that important for IO
>>> scheduling in general. Whatever gives us the best overall result
>>> should work, so if bandwidth based scheduling does that great;
>>> however, fairness does matter across cgroups. A cgroup configured
>>> to receive 50% of IO resources should get close to that no matter
>>> what others are doing, would bfq be able to do that?
>>
>> BFQ guarantees 50% of the bandwidth of the resource, not 50% of the
>> time. In this respect, with 50% of the time instead of 50% of the
>
> So, across cgroups, I don't think we can pretend that bandwidth is the
> resource. There should be a reasonable level of isolation. Bandwidth
> for a rotating disk is a byproduct which can fluctuate widely. "You
> have 50% of the total disk bandwidth" doesn't mean anything if that
> bandwidth can easily fluctuate a hundred fold.
>

I agree that, if a system serves a workload whose characteristics
change significantly every 100ms, or even more frequently, and in an
unpredictable way, then both time-based and sector-based scheduling
provide exactly the same level of bandwidth guarantees. That is,
almost no bandwidth guarantee.

But AFAIK many systems, services and applications do not behave in such
a way. On the contrary, their IO patterns are rather stable.

So, if we choose time-based scheduling in view of the fact that for an
unpredictable system there would be no benefits with sector-based
scheduling, then we just throw away all the benefits that we would
have with sector-based scheduling on more stable systems.

>> bandwidth, a group suffers from the bandwidth fluctuation, higher
>> latency and throughput loss problems that I have tried to highlight.
>> Plus, it is not possible to easily answer to questions like, e.g.: "how
>> long would it take to copy this file"?.
>
> It's actually a lot more difficult to answer that with bandwidth
> scheduling. Let's say cgroup A has 50% of disk time. Sure, there are
> inaccuracies, but it should be able to get close to the ballpark -
> let's be lax and say between 30% and 45% of raw sequential bandwidth.
> It isn't ideal but now imagine bandwidth based scheduling. Depending
> on what the others are doing, it may get 5% or even lower of the raw
> sequential bandwidth. It isn't isolating anything.
>

Definitely. Nevertheless my point is still about the same: we have to
consider one system at a time. If the workload of the system is highly
variable and completely unpredictable, then it is hard to provide any
bandwidth guarantee with any solution.

But if the workload has a minimum of stability, then sector scheduling
either wins or provides the same guarantees as time-based guarantees.
For example, a concrete instance of your low-bandwidth example may be
one where you have one quasi-sequential workload W, competing with
nine random workloads. In this case, if, e.g., all workloads have the
same weight, then BFQ would schedule the resource like a time-based
scheduler: one full budget (which lasts for about one time slice) for
workload W, followed by one time slice for each of the other
workloads. Then there would be no service-guarantee loss with respect
to time-based scheduling.

In contrast, in all the other examples I have mentioned so far
(file-hosting, streaming, video/audio playback against background
workloads, application start-up, ...) sector-based scheduling would be
clearly beneficial even in a hierarchical setting.

In the end, if we give up sector scheduling for cgroups, we can only
lose some benefits. Unless I'm still missing some even more important
problem (sorry about that).

>> In any case, it is of course possible to get time distribution also
>> with BFQ, by 'just' letting it work in the time domain. However,
>> changing BFQ to operate in the time domain, or, probably much better,
>> extending BFQ to operate correctly in both domains, would be a lot of
>> work. I don't know whether it would be worth the effort and the
>> extra complexity.
>
> As I wrote before, as fairness isn't that important for normal
> scheduling, if empirical data show that bandwidth based scheduling is
> beneficial for most common workloads, that's awesome especially given
> that CFQ has plenty of issues. I don't think cgroup case is workable
> as currently implemented tho.
>

I was thinking about some solution to achieve both goals. An option is
probably to let BFQ work in a double mode: sector-based within groups
and time-based among groups. However, I find it a little messy and
confusing.

Other ideas/solutions? I have no better proposal at the moment :(

Thanks,
Paolo


> Thanks.
>
> --
> tejun

2016-04-15 14:49:25

by Linus Walleij

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

On Thu, Apr 14, 2016 at 6:29 PM, Tejun Heo <[email protected]> wrote:
> On Thu, Apr 14, 2016 at 12:23:14PM +0200, Paolo Valente wrote:
> ...
>> >> 3) Stable(r) and low latency for soft real-time applications
(...)
>> Goals 2-4 are obtained by granting a higher share of the throughput
>> to the applications to privilege.
(...)
> 4) makes sense as a lot of that workload would be at least
> quasi-sequential but I can't tell why 2) and 3) would depend on
> bandwidth based scheduling. They're about recognizing workloads which
> can benefit from low latency and treating them accordingly. Why would
> making the underlying scheduling time based change that?

For (3) Isn't that fairly intuitive?

Media people (think vlc, mplayer, gstreamer and whatnot) have
a framed media format, and need the following to process it
and guarantee an enjoyable media stream with audio/video:

- a certain number of bits/s from the storage device or network

- a certain number of MIPS or FLOPS or whatnot from the
CPU before a certain deadline (a determinate time slice of
computation time)

What they don't need is an allocated time slice on a block
device. So the BFQ is asking the right question
for that kind of applications: how many bits/s can I get,
and it delivers accordingly.

For the record I have successfully reproduced Paulo's reported
benefits from this test case:
https://github.com/Algodev-github/S/blob/master/video_playing_vs_commands/video_play_vs_comms.sh

Less frames *are* dropped!

And I have been booting BFQ kernels for my development laptop
for the last months simply because I like to have a music stream
running while working. With a stock kernel, my heavy git operations
(like git fetch && git reset --hard origin/master on linux-next)
makes the audio skip and hang for seconds. With the BFQ-patched
kernel it does *not* happen. This is subjective though, based
on intuition.

Yours,
Linus Walleij

2016-04-15 15:08:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

Hello, Paolo.

On Fri, Apr 15, 2016 at 04:20:44PM +0200, Paolo Valente wrote:
> > It's actually a lot more difficult to answer that with bandwidth
> > scheduling. Let's say cgroup A has 50% of disk time. Sure, there are
> > inaccuracies, but it should be able to get close to the ballpark -
> > let's be lax and say between 30% and 45% of raw sequential bandwidth.
> > It isn't ideal but now imagine bandwidth based scheduling. Depending
> > on what the others are doing, it may get 5% or even lower of the raw
> > sequential bandwidth. It isn't isolating anything.
>
> Definitely. Nevertheless my point is still about the same: we have to
> consider one system at a time. If the workload of the system is highly
> variable and completely unpredictable, then it is hard to provide any
> bandwidth guarantee with any solution.

I don't think that is true with time based scheduling. If you
allocate 50% of time, it'll get close to 50% of IO time which
translates to bandwidth which is lower than 50% but still in the
ballpark. That is very different from "we can't guarantee anything if
the other workloads are highly variable".

So, I get that for a lot of workload, especially interactive ones, IO
patterns are quasi-sequential and bw based scheduling is beneficial
and we don't care that much about fairness in general; however, it's
problematic that it would make the behavior of proportional control
quite surprising.

> > As I wrote before, as fairness isn't that important for normal
> > scheduling, if empirical data show that bandwidth based scheduling is
> > beneficial for most common workloads, that's awesome especially given
> > that CFQ has plenty of issues. I don't think cgroup case is workable
> > as currently implemented tho.
>
> I was thinking about some solution to achieve both goals. An option is
> probably to let BFQ work in a double mode: sector-based within groups
> and time-based among groups. However, I find it a little messy and
> confusing.
>
> Other ideas/solutions? I have no better proposal at the moment :(

No idea. I don't think isolation could work without time based
scheduling at some level tho. :(

Thanks.

--
tejun

2016-04-15 16:18:05

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler


Il giorno 15/apr/2016, alle ore 17:08, Tejun Heo <[email protected]> ha scritto:

> Hello, Paolo.
>
> On Fri, Apr 15, 2016 at 04:20:44PM +0200, Paolo Valente wrote:
>>> It's actually a lot more difficult to answer that with bandwidth
>>> scheduling. Let's say cgroup A has 50% of disk time. Sure, there are
>>> inaccuracies, but it should be able to get close to the ballpark -
>>> let's be lax and say between 30% and 45% of raw sequential bandwidth.
>>> It isn't ideal but now imagine bandwidth based scheduling. Depending
>>> on what the others are doing, it may get 5% or even lower of the raw
>>> sequential bandwidth. It isn't isolating anything.
>>
>> Definitely. Nevertheless my point is still about the same: we have to
>> consider one system at a time. If the workload of the system is highly
>> variable and completely unpredictable, then it is hard to provide any
>> bandwidth guarantee with any solution.
>
> I don't think that is true with time based scheduling. If you
> allocate 50% of time, it'll get close to 50% of IO time which
> translates to bandwidth which is lower than 50% but still in the
> ballpark.

But this is the same minimal service guarantee that you get with BFQ
in any case. I'm sorry for being so confusing to not make this central
point clear :(

> That is very different from "we can't guarantee anything if
> the other workloads are highly variable?.
>


If you have 50% of the time, but
. you don?t know anything about your workload properties, and
. the device speed can vary by two orders of magnitude,
then you can't provide any bandwidth guarantee, with any scheduler. Of
course I'm neglecting the minimal, trivial guarantee "getting a fraction
of the minimum possible speed of the device".

If you have 50% of the time allocated for a quasi-sequential workload,
then bandwidth and latencies may vary by an uncontrollable 30 or 40%,
depending on what you and the other groups do.

With the same device, if you have 50% of the bandwidth allocated with
BFQ for a quasi-sequential workload, then you can provide bandwidth
and latencies that may vary at most by a (still uncontrollable) 3 or
4%, depending on what you and the other groups do.

This improvement is shown, e.g., in my--admittedly boring--numerical
example, and is confirmed by my experimental results so far.

> So, I get that for a lot of workload, especially interactive ones, IO
> patterns are quasi-sequential and bw based scheduling is beneficial
> and we don't care that much about fairness in general; however, it's
> problematic that it would make the behavior of proportional control
> quite surprising.

If I have somehow convinced you with what I wrote above, then I hope
we might agree that a surprising behavior of BFQ with cgroups would be
just a matter of bugs.

Thanks,
Paolo

>
>>> As I wrote before, as fairness isn't that important for normal
>>> scheduling, if empirical data show that bandwidth based scheduling is
>>> beneficial for most common workloads, that's awesome especially given
>>> that CFQ has plenty of issues. I don't think cgroup case is workable
>>> as currently implemented tho.
>>
>> I was thinking about some solution to achieve both goals. An option is
>> probably to let BFQ work in a double mode: sector-based within groups
>> and time-based among groups. However, I find it a little messy and
>> confusing.
>>
>> Other ideas/solutions? I have no better proposal at the moment :(
>
> No idea. I don't think isolation could work without time based
> scheduling at some level tho. :(
>
> Thanks.
>
> --
> tejun

2016-04-15 19:29:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

Hello, Paolo.

On Fri, Apr 15, 2016 at 06:17:55PM +0200, Paolo Valente wrote:
> > I don't think that is true with time based scheduling. If you
> > allocate 50% of time, it'll get close to 50% of IO time which
> > translates to bandwidth which is lower than 50% but still in the
> > ballpark.
>
> But this is the same minimal service guarantee that you get with BFQ
> in any case. I'm sorry for being so confusing to not make this central
> point clear :(

lol sorry about being dumb.

> > That is very different from "we can't guarantee anything if
> > the other workloads are highly variable”.
>
> If you have 50% of the time, but
> . you don’t know anything about your workload properties, and
> . the device speed can vary by two orders of magnitude,
> then you can't provide any bandwidth guarantee, with any scheduler. Of
> course I'm neglecting the minimal, trivial guarantee "getting a fraction
> of the minimum possible speed of the device".

Oh, the guarantee is about "getting close to half of the available IO
resource", what that translates to depends on the underlying hardware
and the workload of course.

> If you have 50% of the time allocated for a quasi-sequential workload,
> then bandwidth and latencies may vary by an uncontrollable 30 or 40%,
> depending on what you and the other groups do.

Yes, may be but it won't dive to 5% depending on what others are
doing.

> With the same device, if you have 50% of the bandwidth allocated with
> BFQ for a quasi-sequential workload, then you can provide bandwidth
> and latencies that may vary at most by a (still uncontrollable) 3 or
> 4%, depending on what you and the other groups do.
>
> This improvement is shown, e.g., in my--admittedly boring--numerical
> example, and is confirmed by my experimental results so far.

I don't think the above is true. Are you saying that the following
two cases would lead to the same outcome for cgroup A?

cgroup A (50) cgroup B (50)
case 1 sequential sequential
case 2 sequential random (to a certain degree)

The aggregate bandwidths for case 1 and 2 would be wildly different
depending on the randomness of the second workload. What cgroup A
would be able to get would fluctuate accordingly, no?

> > So, I get that for a lot of workload, especially interactive ones, IO
> > patterns are quasi-sequential and bw based scheduling is beneficial
> > and we don't care that much about fairness in general; however, it's
> > problematic that it would make the behavior of proportional control
> > quite surprising.
>
> If I have somehow convinced you with what I wrote above, then I hope
> we might agree that a surprising behavior of BFQ with cgroups would be
> just a matter of bugs.

I think I might still need more help. What am I missing?

Thanks.

--
tejun

2016-04-15 22:08:53

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler


Il giorno 15/apr/2016, alle ore 21:29, Tejun Heo <[email protected]> ha scritto:

> Hello, Paolo.
>
> On Fri, Apr 15, 2016 at 06:17:55PM +0200, Paolo Valente wrote:
>>> I don't think that is true with time based scheduling. If you
>>> allocate 50% of time, it'll get close to 50% of IO time which
>>> translates to bandwidth which is lower than 50% but still in the
>>> ballpark.
>>
>> But this is the same minimal service guarantee that you get with BFQ
>> in any case. I'm sorry for being so confusing to not make this central
>> point clear :(
>
> lol sorry about being dumb.
>

dumb? my problem is your remaining patience ...

>>> That is very different from "we can't guarantee anything if
>>> the other workloads are highly variable?.
>>
>> If you have 50% of the time, but
>> . you don?t know anything about your workload properties, and
>> . the device speed can vary by two orders of magnitude,
>> then you can't provide any bandwidth guarantee, with any scheduler. Of
>> course I'm neglecting the minimal, trivial guarantee "getting a fraction
>> of the minimum possible speed of the device".
>
> Oh, the guarantee is about "getting close to half of the available IO
> resource", what that translates to depends on the underlying hardware
> and the workload of course.
>

yes

>> If you have 50% of the time allocated for a quasi-sequential workload,
>> then bandwidth and latencies may vary by an uncontrollable 30 or 40%,
>> depending on what you and the other groups do.
>
> Yes, may be but it won't dive to 5% depending on what others are
> doing.
>

exact

>> With the same device, if you have 50% of the bandwidth allocated with
>> BFQ for a quasi-sequential workload, then you can provide bandwidth
>> and latencies that may vary at most by a (still uncontrollable) 3 or
>> 4%, depending on what you and the other groups do.
>>
>> This improvement is shown, e.g., in my--admittedly boring--numerical
>> example, and is confirmed by my experimental results so far.
>
> I don't think the above is true. Are you saying that the following
> two cases would lead to the same outcome for cgroup A?
>
> cgroup A (50) cgroup B (50)
> case 1 sequential sequential
> case 2 sequential random (to a certain degree)
>
> The aggregate bandwidths for case 1 and 2 would be wildly different
> depending on the randomness of the second workload. What cgroup A
> would be able to get would fluctuate accordingly, no?
>

Your example is definitely to the point.

The answer to your question is no. In fact, in both cases cgroup A
will get exactly the same service slots, and in each slot exactly the
same number of sectors transferred. In particular, cgroup B will
systematically hit the timeout in the second case.

In other words, in case 2 cgroup A is guaranteed the same bandwidth
that it would get, in case 1, if cgroup B was quasi-sequential and so
slow to get served for a full time slice every time it got access to
the resource.

Maybe the source of confusion is the fact that a simple sector-based,
proportional share scheduler always distributes total bandwidth
according to weights. The catch is the additional BFQ rule: random
workloads get only time isolation, and are charged for full budgets,
so as to not affect the schedule of quasi-sequential workloads. So,
the correct claim for BFQ is that it distributes total bandwidth
according to weights (only) when all competing workloads are
quasi-sequential. If some workloads are random, then these workloads
are just time scheduled. This does break proportional-share bandwidth
distribution with mixed workloads, but, much more importantly, saves
both total throughput and individual bandwidths of quasi-sequential
workloads.

We could then check whether I did succeed in tuning timeouts and
budgets so as to achieve the best tradeoffs. But this is probably a
second-order problem as of now.


>>> So, I get that for a lot of workload, especially interactive ones, IO
>>> patterns are quasi-sequential and bw based scheduling is beneficial
>>> and we don't care that much about fairness in general; however, it's
>>> problematic that it would make the behavior of proportional control
>>> quite surprising.
>>
>> If I have somehow convinced you with what I wrote above, then I hope
>> we might agree that a surprising behavior of BFQ with cgroups would be
>> just a matter of bugs.
>
> I think I might still need more help. What am I missing?
>

I hope that what I wrote above did help.

Thanks,
Paolo

> Thanks.
>
> --
> tejun

2016-04-15 22:45:28

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

Hello, Paolo.

On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
> Maybe the source of confusion is the fact that a simple sector-based,
> proportional share scheduler always distributes total bandwidth
> according to weights. The catch is the additional BFQ rule: random
> workloads get only time isolation, and are charged for full budgets,
> so as to not affect the schedule of quasi-sequential workloads. So,
> the correct claim for BFQ is that it distributes total bandwidth
> according to weights (only) when all competing workloads are
> quasi-sequential. If some workloads are random, then these workloads
> are just time scheduled. This does break proportional-share bandwidth
> distribution with mixed workloads, but, much more importantly, saves
> both total throughput and individual bandwidths of quasi-sequential
> workloads.
>
> We could then check whether I did succeed in tuning timeouts and
> budgets so as to achieve the best tradeoffs. But this is probably a
> second-order problem as of now.

Ah, I see. Yeah, that clears it up for me. I'm gonna play with
cgroup settings and see how it actually behaves.

Thanks for your patience. :)

--
tejun

2016-04-16 06:04:02

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

Hi

Il giorno 16/apr/2016, alle ore 00:45, Tejun Heo <[email protected]> ha scritto:

> Hello, Paolo.
>
> On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
>> Maybe the source of confusion is the fact that a simple sector-based,
>> proportional share scheduler always distributes total bandwidth
>> according to weights. The catch is the additional BFQ rule: random
>> workloads get only time isolation, and are charged for full budgets,
>> so as to not affect the schedule of quasi-sequential workloads. So,
>> the correct claim for BFQ is that it distributes total bandwidth
>> according to weights (only) when all competing workloads are
>> quasi-sequential. If some workloads are random, then these workloads
>> are just time scheduled. This does break proportional-share bandwidth
>> distribution with mixed workloads, but, much more importantly, saves
>> both total throughput and individual bandwidths of quasi-sequential
>> workloads.
>>
>> We could then check whether I did succeed in tuning timeouts and
>> budgets so as to achieve the best tradeoffs. But this is probably a
>> second-order problem as of now.
>
> Ah, I see. Yeah, that clears it up for me.

Very glad to hear that!

> I'm gonna play with
> cgroup settings and see how it actually behaves.
>

You have already done it two months ago, and ... found a bug!

https://lkml.org/lkml/2016/2/11/731

I will try to spot and fix it (plus the other issues you have
reported), and hopefully get back in a few days with a revised version
of the patchset.

Thanks,
Paolo

> Thanks for your patience. :)
>
> --
> tejun