2016-10-09 01:16:05

by Kyle Sanderson

[permalink] [raw]
Subject: Fwd: [PATCH V3 00/11] block-throttle: add .high limit

Re-sending as plain-text as the Gmail Android App is still
historically broken...

---------- Forwarded message ----------
From: Kyle Sanderson <[email protected]>
Date: Wed, Oct 5, 2016 at 7:09 AM
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit
To: Tejun Heo <[email protected]>
Cc: [email protected], Paolo Valente <[email protected]>,
[email protected], Mark Brown <[email protected]>,
[email protected], Shaohua Li <[email protected]>, Jens Axboe
<[email protected]>, Linus Walleij <[email protected]>, Vivek Goyal
<[email protected]>, [email protected], Ulf Hansson
<[email protected]>


Obviously not to compound against this, however it has been proven for
years that CFQ will lock significantly under contention, and other
schedulers, such as BFQ attempt to provide fairness which is
absolutely the desired outcome from using a machine. The networking
space is a little wrecked in the sense there's a plethora of qdiscs
that don't necessarily need to exist; but are legacy. This limitation
does not exist in this realm as there are no specific tunables.

There is no reason that in 2016 a user-space application can steal all
of the I/O from a disk, completely locking the machine when BFQ has
essentially solved this years ago. I've been a moderately happy user
of BFQ for quite sometime now. There aren't tens, or hundreds of us,
but thousands through the custom kernels that are spun, and the
distros that helped support BFQ.

How is this even a discussion when hard numbers, and trying any
reproduction case easily reproduce the issues that CFQ causes. Reading
this thread, and many others only grows not only my disappointment,
but whenever someone launches kterm or scrot and their machine
freezes, leaves a selective few individuals completely responsible for
this. Help those users, help yourself, help Linux.


On 4 Oct 2016 1:29 pm, "Tejun Heo" <[email protected]> wrote:
>
> Hello, Paolo.
>
> On Tue, Oct 04, 2016 at 09:29:48PM +0200, Paolo Valente wrote:
> > > Hmm... I think we already discussed this but here's a really simple
> > > case. There are three unknown workloads A, B and C and we want to
> > > give A certain best-effort guarantees (let's say around 80% of the
> > > underlying device) whether A is sharing the device with B or C.
> >
> > That's the same example that you proposed me in our previous
> > discussion. For this example I showed you, with many boring numbers,
> > that with BFQ you get the most accurate distribution of the resource.
>
> Yes, it is about the same example and what I understood was that
> "accurate distribution of the resources" holds as long as the
> randomness is incidental (ie. due to layout on the filesystem and so
> on) with the slice expiration mechanism offsetting the actually random
> workloads.
>
> > If you have enough stamina, I can repeat them again. To save your
>
> I'll go back to the thread and re-read them.
>
> > patience, here is a very brief summary. In a concrete use case, the
> > unknown workloads turn into something like this: there will be a first
> > time interval during which A happens to be, say, sequential, B happens
> > to be, say, random and C happens to be, say, quasi-sequential. Then
> > there will be a next time interval during which their characteristics
> > change, and so on. It is easy (but boring, I acknowledge it) to show
> > that, for each of these time intervals BFQ provides the best possible
> > service in terms of fairness, bandwidth distribution, stability and so
> > on. Why? Because of the elastic bandwidth-time scheduling of BFQ
> > that we already discussed, and because BFQ is naturally accurate in
> > redistributing aggregate throughput proportionally, when needed.
>
> Yeah, that's what I remember and for workload above certain level of
> randomness its time consumption is mapped to bw, right?
>
> > > I get that bfq can be a good compromise on most desktop workloads and
> > > behave reasonably well for some server workloads with the slice
> > > expiration mechanism but it really isn't an IO resource partitioning
> > > mechanism.
> >
> > Right. My argument is that BFQ enables you to give to each client the
> > bandwidth and low-latency guarantees you want. And this IMO is way
> > better than partitioning a resource and then getting unavoidable
> > unfairness and high latency.
>
> But that statement only holds while bw is the main thing to guarantee,
> no? The level of isolation that we're looking for here is fairly
> strict adherence to sub/few-milliseconds in terms of high percentile
> scheduling latency while within the configured bw/iops limits, not
> "overall this device is being used pretty well".
>
> Thanks.
>
> --
> tejun


2016-10-14 16:40:47

by Tejun Heo

[permalink] [raw]
Subject: Re: Fwd: [PATCH V3 00/11] block-throttle: add .high limit

Hello, Kyle.

On Sat, Oct 08, 2016 at 06:15:14PM -0700, Kyle Sanderson wrote:
> How is this even a discussion when hard numbers, and trying any
> reproduction case easily reproduce the issues that CFQ causes. Reading
> this thread, and many others only grows not only my disappointment,
> but whenever someone launches kterm or scrot and their machine
> freezes, leaves a selective few individuals completely responsible for
> this. Help those users, help yourself, help Linux.

So, just to be clear. I wasn't arguing against bfq replacing cfq (or
anything along that line) but that proportional control, as
implemented, would be too costly for many use cases and thus we need
something along the line of what Shaohua is proposing.

FWIW, it looks like the only way we can implement proportional control
on highspeed ssds with acceptable overhead is somehow finding a way to
calculate the cost of each IO and throttle IOs according to that while
controlling for latency as necessary. Slice scheduling with idling
seems too expensive with highspeed devices with high io depth.

Thanks.

--
tejun

2016-10-14 17:14:03

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 14 ott 2016, alle ore 18:40, Tejun Heo <[email protected]> ha scritto:
>
> Hello, Kyle.
>
> On Sat, Oct 08, 2016 at 06:15:14PM -0700, Kyle Sanderson wrote:
>> How is this even a discussion when hard numbers, and trying any
>> reproduction case easily reproduce the issues that CFQ causes. Reading
>> this thread, and many others only grows not only my disappointment,
>> but whenever someone launches kterm or scrot and their machine
>> freezes, leaves a selective few individuals completely responsible for
>> this. Help those users, help yourself, help Linux.
>
> So, just to be clear. I wasn't arguing against bfq replacing cfq (or
> anything along that line) but that proportional control, as
> implemented, would be too costly for many use cases and thus we need
> something along the line of what Shaohua is proposing.
>

Sorry for dropping in all the times, but the vision that you and some
other guys propose seems to miss some important piece (unless, now or
then, you will patiently prove me wrong, or I will finally understand
on my own why I'm wrong).

You are of course right: bfq, as a component of blk, and above all, as
a sort of derivative of CFQ (and of its overhead), has currently too
high a overhead to handle more than 10-20K IOPS.

That said, your 'thus' seems a little too strong: "bfq does not yet
handle fast SSDs, thus we need something else". What about the
millions of devices (and people) still within 10-20 K IOPS, and
experiencing awful latencies and lack of bandwidth guarantees?

For certain systems or applications, it isn't even just a "buy a fast
SSD" matter, but a technological constraint.

> FWIW, it looks like the only way we can implement proportional control
> on highspeed ssds with acceptable overhead

Maybe not: as I wrote to Viveck in a previous reply, containing
pointers to documentation, we have already achieved twenty millions
of decisions per second with a prototype driving existing
proportional-share packet schedulers (essentially without
modifications).

> is somehow finding a way to
> calculate the cost of each IO and throttle IOs according to that while
> controlling for latency as necessary. Slice scheduling with idling
> seems too expensive with highspeed devices with high io depth.
>

Yes, that's absolutely true. I'm already thinking about an idleless
solution. As I already wrote, I'm willing to help with scheduling in
blk-mq. I hope there will be the opportunity to find some way to go
at KS.

Thanks,
Paolo

> Thanks.
>
> --
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-14 18:35:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

Hello, Paolo.

On Fri, Oct 14, 2016 at 07:13:41PM +0200, Paolo Valente wrote:
> That said, your 'thus' seems a little too strong: "bfq does not yet
> handle fast SSDs, thus we need something else". What about the
> millions of devices (and people) still within 10-20 K IOPS, and
> experiencing awful latencies and lack of bandwidth guarantees?

I'm not objecting to any of that. My point just is that bfq, at least
as currently implemented, is unfit for certain classes of use cases.

> > FWIW, it looks like the only way we can implement proportional control
> > on highspeed ssds with acceptable overhead
>
> Maybe not: as I wrote to Viveck in a previous reply, containing
> pointers to documentation, we have already achieved twenty millions
> of decisions per second with a prototype driving existing
> proportional-share packet schedulers (essentially without
> modifications).

And that doesn't require idling and thus doesn't severely impact
utilization?

> > is somehow finding a way to
> > calculate the cost of each IO and throttle IOs according to that while
> > controlling for latency as necessary. Slice scheduling with idling
> > seems too expensive with highspeed devices with high io depth.
>
> Yes, that's absolutely true. I'm already thinking about an idleless
> solution. As I already wrote, I'm willing to help with scheduling in
> blk-mq. I hope there will be the opportunity to find some way to go
> at KS.

It'd be great to have a proportional control mechanism whose overhead
is acceptable. Unfortunately, we don't have one now and nothing seems
right around the corner. (Mostly) work-conserving throttling would be
fiddlier to use but is something which is useful regardless of such
proportional control mechanism and can be obtained relatively easily.

I don't see why the two approaches would be mutually exclusive.

Thanks.

--
tejun

2016-10-16 19:02:59

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 14 ott 2016, alle ore 20:35, Tejun Heo <[email protected]> ha scritto:
>
> Hello, Paolo.
>
> On Fri, Oct 14, 2016 at 07:13:41PM +0200, Paolo Valente wrote:
>> That said, your 'thus' seems a little too strong: "bfq does not yet
>> handle fast SSDs, thus we need something else". What about the
>> millions of devices (and people) still within 10-20 K IOPS, and
>> experiencing awful latencies and lack of bandwidth guarantees?
>
> I'm not objecting to any of that.

Ok, sorry for misunderstanding. I'm just more and more confused about
why a readily available, and not proven wrong solution has not yet
been accepted, if everybody apparently acknowledges the problem.

> My point just is that bfq, at least
> as currently implemented, is unfit for certain classes of use cases.
>

Absolutely correct.

>>> FWIW, it looks like the only way we can implement proportional control
>>> on highspeed ssds with acceptable overhead
>>
>> Maybe not: as I wrote to Viveck in a previous reply, containing
>> pointers to documentation, we have already achieved twenty millions
>> of decisions per second with a prototype driving existing
>> proportional-share packet schedulers (essentially without
>> modifications).
>
> And that doesn't require idling and thus doesn't severely impact
> utilization?
>

Nope. Packets are commonly assumed to be sent asynchronously.
I guess that discussing the validity of this assumption is out of the
scope of this thread.

Thanks,
Paolo

>>> is somehow finding a way to
>>> calculate the cost of each IO and throttle IOs according to that while
>>> controlling for latency as necessary. Slice scheduling with idling
>>> seems too expensive with highspeed devices with high io depth.
>>
>> Yes, that's absolutely true. I'm already thinking about an idleless
>> solution. As I already wrote, I'm willing to help with scheduling in
>> blk-mq. I hope there will be the opportunity to find some way to go
>> at KS.
>
> It'd be great to have a proportional control mechanism whose overhead
> is acceptable. Unfortunately, we don't have one now and nothing seems
> right around the corner. (Mostly) work-conserving throttling would be
> fiddlier to use but is something which is useful regardless of such
> proportional control mechanism and can be obtained relatively easily.
>
> I don't see why the two approaches would be mutually exclusive.
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-18 05:15:41

by Kyle Sanderson

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

Not to compound upon this again. However if BFQ isn't suitable to
replace CFQ for high I/O workloads (I've yet to see 20k IOPS on any
reasonably sized SAN (SC4020 / v5000, etc)), can't we at-least default
BFQ to become the default I/O scheduler for people otherwise
requesting CFQ? Paolo has had a team of students working on this for
years, even if the otherwise "secret weapon" is mainlined I highly
doubt his work will stop. We're pretty close to fixing hard I/O stalls
in Linux, mainlining being the last major burden.

While I've contributed nothing to BFQ code wise, absolutely let any of
us know if there's anything outstanding to solve hard lockups and I
believe any of us will try our best.

Kyle.

On Sun, Oct 16, 2016 at 12:02 PM, Paolo Valente
<[email protected]> wrote:
>
>> Il giorno 14 ott 2016, alle ore 20:35, Tejun Heo <[email protected]> ha scritto:
>>
>> Hello, Paolo.
>>
>> On Fri, Oct 14, 2016 at 07:13:41PM +0200, Paolo Valente wrote:
>>> That said, your 'thus' seems a little too strong: "bfq does not yet
>>> handle fast SSDs, thus we need something else". What about the
>>> millions of devices (and people) still within 10-20 K IOPS, and
>>> experiencing awful latencies and lack of bandwidth guarantees?
>>
>> I'm not objecting to any of that.
>
> Ok, sorry for misunderstanding. I'm just more and more confused about
> why a readily available, and not proven wrong solution has not yet
> been accepted, if everybody apparently acknowledges the problem.
>
>> My point just is that bfq, at least
>> as currently implemented, is unfit for certain classes of use cases.
>>
>
> Absolutely correct.
>
>>>> FWIW, it looks like the only way we can implement proportional control
>>>> on highspeed ssds with acceptable overhead
>>>
>>> Maybe not: as I wrote to Viveck in a previous reply, containing
>>> pointers to documentation, we have already achieved twenty millions
>>> of decisions per second with a prototype driving existing
>>> proportional-share packet schedulers (essentially without
>>> modifications).
>>
>> And that doesn't require idling and thus doesn't severely impact
>> utilization?
>>
>
> Nope. Packets are commonly assumed to be sent asynchronously.
> I guess that discussing the validity of this assumption is out of the
> scope of this thread.
>
> Thanks,
> Paolo
>
>>>> is somehow finding a way to
>>>> calculate the cost of each IO and throttle IOs according to that while
>>>> controlling for latency as necessary. Slice scheduling with idling
>>>> seems too expensive with highspeed devices with high io depth.
>>>
>>> Yes, that's absolutely true. I'm already thinking about an idleless
>>> solution. As I already wrote, I'm willing to help with scheduling in
>>> blk-mq. I hope there will be the opportunity to find some way to go
>>> at KS.
>>
>> It'd be great to have a proportional control mechanism whose overhead
>> is acceptable. Unfortunately, we don't have one now and nothing seems
>> right around the corner. (Mostly) work-conserving throttling would be
>> fiddlier to use but is something which is useful regardless of such
>> proportional control mechanism and can be obtained relatively easily.
>>
>> I don't see why the two approaches would be mutually exclusive.
>>
>> Thanks.
>>
>> --
>> tejun
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-block" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> --
> Paolo Valente
> Algogroup
> Dipartimento di Scienze Fisiche, Informatiche e Matematiche
> Via Campi 213/B
> 41125 Modena - Italy
> http://algogroup.unimore.it/people/paolo/
>
>
>
>
>