2014-06-02 09:26:16

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices


Il giorno 31/mag/2014, alle ore 13:52, Tejun Heo <[email protected]> ha scritto:

> Hello, Paolo.
>
> So, I've actually looked at the code. Here are some questions.
>
> On Thu, May 29, 2014 at 11:05:42AM +0200, Paolo Valente wrote:
>> + * 1) all active queues have the same weight,
>> + * 2) all active groups at the same level in the groups tree have the same
>> + * weight,
>> + * 3) all active groups at the same level in the groups tree have the same
>> + * number of children.
>
> 3) basically disables it whenever blkcg is used. Might as well just
> skip the whole thing if there are any !root cgroups. It's only
> theoretically interesting.

It is easier for me to reply to this, and the other related comments, cumulatively below.

>
>> static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
>> {
>> struct bfq_data *bfqd = bfqq->bfqd;
>
> bool symmetric_scenario, expire_non_wr;
>
>> +#ifdef CONFIG_CGROUP_BFQIO
>> +#define symmetric_scenario (!bfqd->active_numerous_groups && \
>> + !bfq_differentiated_weights(bfqd))
>
> symmetric_scenario = xxx;
>
>> +#else
>> +#define symmetric_scenario (!bfq_differentiated_weights(bfqd))
>
> symmetric_scenario = yyy;
>
>> +#endif
>> /*
>> * Condition for expiring a non-weight-raised queue (and hence not idling
>> * the device).
>> */
>> #define cond_for_expiring_non_wr (bfqd->hw_tag && \
>> - bfqd->wr_busy_queues > 0)
>> + (bfqd->wr_busy_queues > 0 || \
>> + (symmetric_scenario && \
>> + blk_queue_nonrot(bfqd->queue))))
>
> expire_non_wr = zzz;
>

The solution you propose is the first that came to my mind. But then I went for a clumsy macro-based solution because: 1) the whole function is all about evaluating a long logical expression, 2) the macro-based solution allows the short-circuit to be used at best, and the number of steps to be minimized. For example, with async queues, only one condition is evaluated.

Defining three variables entails instead that the value of all the variables is computed every time, even if most of the times there is no need to.

Would this gain be negligible (sorry for my ignorance), or would not it be however enough to justify these unusual macros?

>>
>> return bfq_bfqq_sync(bfqq) && (
>> bfqq->wr_coeff > 1 ||
>> /**
>> + * struct bfq_weight_counter - counter of the number of all active entities
>> + * with a given weight.
>> + * @weight: weight of the entities that this counter refers to.
>> + * @num_active: number of active entities with this weight.
>> + * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
>> + * and @group_weights_tree).
>> + */
>> +struct bfq_weight_counter {
>> + short int weight;
>> + unsigned int num_active;
>> + struct rb_node weights_node;
>> +};
>
> This is way over-engineered. In most cases, the only time you get the
> same weight on all IO issuers would be when everybody is on the
> default ioprio so might as well simply count the number of non-default
> ioprios. It'd be one integer instead of a tree of counters.
>

Reply below.

>> @@ -306,6 +322,22 @@ enum bfq_device_speed {
>> * @rq_pos_tree: rbtree sorted by next_request position, used when
>> * determining if two or more queues have interleaving
>> * requests (see bfq_close_cooperator()).
>> + * @active_numerous_groups: number of bfq_groups containing more than one
>> + * active @bfq_entity.
>
> You can safely assume that on any system which uses blkcg, the above
> counter is >1.
>
> This optimization may be theoretically interesting but doesn't seem
> practical at all and would make the sytem behave distinctively
> differently depending on something which is extremely subtle and seems
> completely unrelated. Furthermore, on any system which uses blkcg,
> ext4, btrfs or has any task which has non-zero nice value, it won't
> make any difference. Its value is only theoretical.
>

Turning on idling unconditionally when blkcg is used, is one of the first solutions we have considered. But there seem to be practical scenarios where this would cause an unjustified loss of throughput. The main example for us was ulatencyd, which AFAIK creates one group for each process and, by default, assigns to all processes the same weight. But the assigned weight is not the one associated to the default ioprio.

I do not know how widespread a mechanism like ulatencyd is precisely, but in the symmetric scenario it creates, the throughput on, e.g., an HDD would drop by half if the workload is mostly random and we removed the more complex mechanism we set up.
Wouldn't this be bad?

> Another thing to consider is that virtually all remotely modern
> devices, rotational or not, are queued. At this point, it's rather
> pointless to design one behavior for !queued and another for queued.
> Things should just be designed for queued devices.

I am sorry for expressing doubts again (mainly because of my ignorance), but a few months ago I had to work with some portable devices for a company specialized in ARM systems. As an HDD, they were using a Toshiba MK6006GAH. If I remember well, this device had no NCQ. Instead of the improvements that we obtained by using bfq with this slow device, removing the differentiated behavior of bfq with respect to queued/!queued devices would have caused just a loss of throughput.

> I don't know what
> the solution is but given that the benefits of NCQ for rotational
> devices is extremely limited, sticking with single request model in
> most cases and maybe allowing queued operation for specific workloads
> might be a better approach. As for ssds, just do something simple.
> It's highly likely that most ssds won't travel this code path in the
> near future anyway.

This is the point that worries me mostly. As I pointed out in one of my previous emails, dispatching requests to an SSD without control causes high latencies, or even complete unresponsiveness (Figure 8 in
http://algogroup.unimore.it/people/paolo/disk_sched/extra_results.php
or Figure 9 in
http://algogroup.unimore.it/people/paolo/disk_sched/results.php).

I am of course aware that efficiency is a critical issue with fast devices, and is probably destined to become more and more critical in the future. But, as a user, I would be definitely unhappy with a system that can, e.g., update itself in one minute instead of five, but, during that minute may become unresponsive. In particular, I would not be pleased to buy a more expensive SSD and get a much less responsive system than that I had with a cheaper HDD and bfq fully working.

Thanks,
Paolo

>
> Thanks.
>
> --
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Fisica, Informatica e Matematica
Via Campi, 213/B
41125 Modena - Italy
homepage: http://algogroup.unimore.it/people/paolo/


2014-06-03 17:11:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices

Hello,

On Mon, Jun 02, 2014 at 11:26:07AM +0200, Paolo Valente wrote:
> >> #define cond_for_expiring_non_wr (bfqd->hw_tag && \
> >> - bfqd->wr_busy_queues > 0)
> >> + (bfqd->wr_busy_queues > 0 || \
> >> + (symmetric_scenario && \
> >> + blk_queue_nonrot(bfqd->queue))))
> >
> > expire_non_wr = zzz;
> >
>
> The solution you propose is the first that came to my mind. But then
> I went for a clumsy macro-based solution because: 1) the whole
> function is all about evaluating a long logical expression, 2) the
> macro-based solution allows the short-circuit to be used at best,
> and the number of steps to be minimized. For example, with async
> queues, only one condition is evaluated.
>
> Defining three variables entails instead that the value of all the
> variables is computed every time, even if most of the times there is
> no need to.
>
> Would this gain be negligible (sorry for my ignorance), or would
> not it be however enough to justify these unusual macros?

The compiler should be able to optimize those to basically the same
code. AFAICS, everything the code tests is trivially known to be
without side-effect to the compiler. Besides, even if the compiler
generates slightly less efficient code, which it shouldn't, it's
highly unlikely that this level of micro CPU cycle optimization would
be measureable for something as heavy as [bc]fq.

> > This optimization may be theoretically interesting but doesn't seem
> > practical at all and would make the sytem behave distinctively
> > differently depending on something which is extremely subtle and seems
> > completely unrelated. Furthermore, on any system which uses blkcg,
> > ext4, btrfs or has any task which has non-zero nice value, it won't
> > make any difference. Its value is only theoretical.
>
> Turning on idling unconditionally when blkcg is used, is one of the
> first solutions we have considered. But there seem to be practical
> scenarios where this would cause an unjustified loss of
> throughput. The main example for us was ulatencyd, which AFAIK
> creates one group for each process and, by default, assigns to all
> processes the same weight. But the assigned weight is not the one
> associated to the default ioprio.

Isn't the optimization "not idling" when these conditions are met?
Shouldn't the comparison be against the benefit of "not idling
selectively" vs "always idling" when blkcg is in use?

Another problem there is that this not only depends on the number of
processes but the number of threads in it. cgroup is moving away from
allowing threads of a single process in different cgroups, so this
means that the operation can fluctuate in a very unexpected manner.

I'm not really convinced about the approach. With rotating disks, we
know that allowing queue depth > 1 generaly lowers both throughput and
responsiveness and brings benefits in quite restricted cases. It
seems rather backwards to always allow QD > 1 and then try to optimize
in an attempt to recover what's lost. Wouldn't it make far more sense
to actively maintain QD == 1 by default and allow QD > 1 in specific
cases where it can be determined to be more beneficial than harmful?

> I do not know how widespread a mechanism like ulatencyd is
> precisely, but in the symmetric scenario it creates, the throughput
> on, e.g., an HDD would drop by half if the workload is mostly random
> and we removed the more complex mechanism we set up. Wouldn't this
> be bad?

It looks like a lot of complexity for optimization for a very
specific, likely unreliable (in terms of its triggering condition),
use case. The triggering condition is just too specific.

> > Another thing to consider is that virtually all remotely modern
> > devices, rotational or not, are queued. At this point, it's rather
> > pointless to design one behavior for !queued and another for queued.
> > Things should just be designed for queued devices.
>
> I am sorry for expressing doubts again (mainly because of my
> ignorance), but a few months ago I had to work with some portable
> devices for a company specialized in ARM systems. As an HDD, they
> were using a Toshiba MK6006GAH. If I remember well, this device had
> no NCQ. Instead of the improvements that we obtained by using bfq
> with this slow device, removing the differentiated behavior of bfq
> with respect to queued/!queued devices would have caused just a loss
> of throughput.

Heh, that's 60GB ATA-100 hard drive. Had no idea those are still
being produced. However, my point still is that the design should be
focused on queued devices. They're predominant in the market and
it'll only continue to become more so. What bothers me is that the
scheduler essentially loses control and shows sub-optimal behavior on
queued devices by default and that's how it's gonna perform in vast
majority of the use cases.

> > I don't know what
> > the solution is but given that the benefits of NCQ for rotational
> > devices is extremely limited, sticking with single request model in
> > most cases and maybe allowing queued operation for specific workloads
> > might be a better approach. As for ssds, just do something simple.
> > It's highly likely that most ssds won't travel this code path in the
> > near future anyway.
>
> This is the point that worries me mostly. As I pointed out in one of my previous emails, dispatching requests to an SSD without control causes high latencies, or even complete unresponsiveness (Figure 8 in
> http://algogroup.unimore.it/people/paolo/disk_sched/extra_results.php
> or Figure 9 in
> http://algogroup.unimore.it/people/paolo/disk_sched/results.php).
>
> I am of course aware that efficiency is a critical issue with fast
> devices, and is probably destined to become more and more critical
> in the future. But, as a user, I would be definitely unhappy with a
> system that can, e.g., update itself in one minute instead of five,
> but, during that minute may become unresponsive. In particular, I
> would not be pleased to buy a more expensive SSD and get a much less
> responsive system than that I had with a cheaper HDD and bfq fully
> working.

blk-mq is right around the corner and newer devices won't travel this
path at all. Hopefully, ahci too will be served through blk-mq too
when it's connected to ssds, so its usefulness for high performance
devices will diminsh rather quickly over the coming several years. It
sure would be nice to still be able to carry some optimizations but it
does shift the trade-off balance in terms of how much extra complexity
is justified.

Thanks.

--
tejun

2014-06-04 07:29:28

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices


Il giorno 03/giu/2014, alle ore 19:11, Tejun Heo <[email protected]> ha scritto:

> Hello,
>
> On Mon, Jun 02, 2014 at 11:26:07AM +0200, Paolo Valente wrote:
>>>> #define cond_for_expiring_non_wr (bfqd->hw_tag && \
>>>> - bfqd->wr_busy_queues > 0)
>>>> + (bfqd->wr_busy_queues > 0 || \
>>>> + (symmetric_scenario && \
>>>> + blk_queue_nonrot(bfqd->queue))))
>>>
>>> expire_non_wr = zzz;
>>>
>>
>> The solution you propose is the first that came to my mind. But then
>> I went for a clumsy macro-based solution because: 1) the whole
>> function is all about evaluating a long logical expression, 2) the
>> macro-based solution allows the short-circuit to be used at best,
>> and the number of steps to be minimized. For example, with async
>> queues, only one condition is evaluated.
>>
>> Defining three variables entails instead that the value of all the
>> variables is computed every time, even if most of the times there is
>> no need to.
>>
>> Would this gain be negligible (sorry for my ignorance), or would
>> not it be however enough to justify these unusual macros?
>
> The compiler should be able to optimize those to basically the same
> code. AFAICS, everything the code tests is trivially known to be
> without side-effect to the compiler. Besides, even if the compiler
> generates slightly less efficient code, which it shouldn't, it's
> highly unlikely that this level of micro CPU cycle optimization would
> be measureable for something as heavy as [bc]fq.
>

Thanks a lot for answering this question.

>>> This optimization may be theoretically interesting but doesn't seem
>>> practical at all and would make the sytem behave distinctively
>>> differently depending on something which is extremely subtle and seems
>>> completely unrelated. Furthermore, on any system which uses blkcg,
>>> ext4, btrfs or has any task which has non-zero nice value, it won't
>>> make any difference. Its value is only theoretical.
>>
>> Turning on idling unconditionally when blkcg is used, is one of the
>> first solutions we have considered. But there seem to be practical
>> scenarios where this would cause an unjustified loss of
>> throughput. The main example for us was ulatencyd, which AFAIK
>> creates one group for each process and, by default, assigns to all
>> processes the same weight. But the assigned weight is not the one
>> associated to the default ioprio.
>
> Isn't the optimization "not idling" when these conditions are met?

Yes, not idling is the way to go in this case.

> Shouldn't the comparison be against the benefit of "not idling
> selectively" vs "always idling" when blkcg is in use?
>

Exactly. I?m sorry if I wrote things/sentences that did not let this point be clear. Maybe this lack of clarity is a further consequence of the annoying ?not not? scheme adopted in the code and in the comments.

> Another problem there is that this not only depends on the number of
> processes but the number of threads in it. cgroup is moving away from
> allowing threads of a single process in different cgroups, so this
> means that the operation can fluctuate in a very unexpected manner.
>
> I'm not really convinced about the approach. With rotating disks, we
> know that allowing queue depth > 1 generaly lowers both throughput and
> responsiveness and brings benefits in quite restricted cases. It
> seems rather backwards to always allow QD > 1 and then try to optimize
> in an attempt to recover what's lost. Wouldn't it make far more sense
> to actively maintain QD == 1 by default and allow QD > 1 in specific
> cases where it can be determined to be more beneficial than harmful?
>

Although QD == 1 is not denoted explicitly as default, what you suggest is exactly what bfq does.

>> I do not know how widespread a mechanism like ulatencyd is
>> precisely, but in the symmetric scenario it creates, the throughput
>> on, e.g., an HDD would drop by half if the workload is mostly random
>> and we removed the more complex mechanism we set up. Wouldn't this
>> be bad?
>
> It looks like a lot of complexity for optimization for a very
> specific, likely unreliable (in terms of its triggering condition),
> use case. The triggering condition is just too specific.

Actually we have been asked several times to improve random-I/O performance on HDDs over the last years, by people recording, for the typical tasks performed by their machines, much lower throughput than with the other schedulers. Major problems have been reported for server workloads (database, web), and for btrfs. According to the feedback received after introducing this optimization in bfq, those problems seem to be finally gone.

>
>>> Another thing to consider is that virtually all remotely modern
>>> devices, rotational or not, are queued. At this point, it's rather
>>> pointless to design one behavior for !queued and another for queued.
>>> Things should just be designed for queued devices.
>>
>> I am sorry for expressing doubts again (mainly because of my
>> ignorance), but a few months ago I had to work with some portable
>> devices for a company specialized in ARM systems. As an HDD, they
>> were using a Toshiba MK6006GAH. If I remember well, this device had
>> no NCQ. Instead of the improvements that we obtained by using bfq
>> with this slow device, removing the differentiated behavior of bfq
>> with respect to queued/!queued devices would have caused just a loss
>> of throughput.
>
> Heh, that's 60GB ATA-100 hard drive. Had no idea those are still
> being produced. However, my point still is that the design should be
> focused on queued devices. They're predominant in the market and
> it'll only continue to become more so. What bothers me is that the
> scheduler essentially loses control and shows sub-optimal behavior on
> queued devices by default and that's how it's gonna perform in vast
> majority of the use cases.
>

According to our experiments, this differential behavior of bfq is only beneficial, and there is apparently no performance loss with respect to any of the other schedulers and for any scenario.

>>> I don't know what
>>> the solution is but given that the benefits of NCQ for rotational
>>> devices is extremely limited, sticking with single request model in
>>> most cases and maybe allowing queued operation for specific workloads
>>> might be a better approach. As for ssds, just do something simple.
>>> It's highly likely that most ssds won't travel this code path in the
>>> near future anyway.
>>
>> This is the point that worries me mostly. As I pointed out in one of my previous emails, dispatching requests to an SSD without control causes high latencies, or even complete unresponsiveness (Figure 8 in
>> http://algogroup.unimore.it/people/paolo/disk_sched/extra_results.php
>> or Figure 9 in
>> http://algogroup.unimore.it/people/paolo/disk_sched/results.php).
>>
>> I am of course aware that efficiency is a critical issue with fast
>> devices, and is probably destined to become more and more critical
>> in the future. But, as a user, I would be definitely unhappy with a
>> system that can, e.g., update itself in one minute instead of five,
>> but, during that minute may become unresponsive. In particular, I
>> would not be pleased to buy a more expensive SSD and get a much less
>> responsive system than that I had with a cheaper HDD and bfq fully
>> working.
>
> blk-mq is right around the corner and newer devices won't travel this
> path at all. Hopefully, ahci too will be served through blk-mq too
> when it's connected to ssds, so its usefulness for high performance
> devices will diminsh rather quickly over the coming several years. It
> sure would be nice to still be able to carry some optimizations but it
> does shift the trade-off balance in terms of how much extra complexity
> is justified.
>

For users (like me), who will not like losing responsiveness in return for a shorter duration of operations that are usually executed in the background, the best solution is IMHO to leave the possibility to choose whether to preserve responsiveness or to squeeze every MB/s out of the device and/or keep the usage of every core low.

Besides, turning back to bfq, if its low-latency heuristics are disabled for non rotational devices, then, according to our results with slower devices, such as SD Cards and eMMCs, latency becomes easily unbearable, with no throughput gain.

Thanks,
Paolo

> Thanks.
>
> --
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Fisica, Informatica e Matematica
Via Campi, 213/B
41125 Modena - Italy
homepage: http://algogroup.unimore.it/people/paolo/

2014-06-04 13:56:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices

Hello, Paolo.

On Wed, Jun 04, 2014 at 09:29:20AM +0200, Paolo Valente wrote:
> > Shouldn't the comparison be against the benefit of "not idling
> > selectively" vs "always idling" when blkcg is in use?
> >
>
> Exactly. I’m sorry if I wrote things/sentences that did not let this
> point be clear. Maybe this lack of clarity is a further consequence
> of the annoying “not not” scheme adopted in the code and in the
> comments.

Ah, no, it was just me misreading the message.

> > I'm not really convinced about the approach. With rotating disks, we
> > know that allowing queue depth > 1 generaly lowers both throughput and
> > responsiveness and brings benefits in quite restricted cases. It
> > seems rather backwards to always allow QD > 1 and then try to optimize
> > in an attempt to recover what's lost. Wouldn't it make far more sense
> > to actively maintain QD == 1 by default and allow QD > 1 in specific
> > cases where it can be determined to be more beneficial than harmful?
>
> Although QD == 1 is not denoted explicitly as default, what you suggest is exactly what bfq does.
I see.

> >> I do not know how widespread a mechanism like ulatencyd is
> >> precisely, but in the symmetric scenario it creates, the throughput
> >> on, e.g., an HDD would drop by half if the workload is mostly random
> >> and we removed the more complex mechanism we set up. Wouldn't this
> >> be bad?
> >
> > It looks like a lot of complexity for optimization for a very
> > specific, likely unreliable (in terms of its triggering condition),
> > use case. The triggering condition is just too specific.
>
> Actually we have been asked several times to improve random-I/O
> performance on HDDs over the last years, by people recording, for
> the typical tasks performed by their machines, much lower throughput
> than with the other schedulers. Major problems have been reported
> for server workloads (database, web), and for btrfs. According to
> the feedback received after introducing this optimization in bfq,
> those problems seem to be finally gone.

I see. The equal priority part can probably work in enough cases to
be meaningful given that it just depends on the busy queues having the
same weight instead of everything in the system. It'd nice to note
that in the comment tho.

I'm still quite skeptical about the cgroup part tho. The triggering
condition is too specific and fragile. If I'm reading the bfq blkcg
implementation correctly, it seems to be applying the scheduling
algorithm recursively walking down the tree one level at a time. cfq
does it differently. cfq flattens the hierarchy by calculating the
nested weight of each active leaf queue and schedule all of them from
the same service tree. IOW, scheduling algorithm per-se doesn't care
about the hierarchy. All it sees are differing weights competing
equally regardless of the hierarchical structure.

If the same strategy can be applied to bfq, possibly the same strategy
of checking whether all the active queues have the same weight can be
used regardless of blkcg? That'd be simpler and a lot more robust.

Another thing I'm curious about is that the logic that you're using to
disable idling assumes that the disk will serve the queued commands
more or less in fair manner over time, right? If so, why does queues
having the same weight matter? Shouldn't the bandwidth scheduling
eventually make them converge to the specified weights over time?
Isn't wr_coeff > 1 test enough for maintaining reasonable
responsiveness?

> Besides, turning back to bfq, if its low-latency heuristics are
> disabled for non rotational devices, then, according to our results
> with slower devices, such as SD Cards and eMMCs, latency becomes
> easily unbearable, with no throughput gain.

Hmmm... looking at the nonrot optimizations again, so yeah assuming
the weight counting is necessary for NCQ hdds the part specific to
ssds isn't that big. You probably wanna sequence it the other way
around tho. This really should be primarily about disks at this
point.

The thing which still makes me cringe is how it scatters
blk_queue_nonrot() tests across multiple places without clear
explanation on what's going on. A bfqq being constantly seeky or not
doesn't have much to do with whether the device is rotational or not.
Its effect does and I don't think avoiding the overhead of keeping the
counters is meaningful. Things like this make the code a lot harder
to maintain in the long term as code is organized according to
seemingly arbitrary optimization rather than semantic structure. So,
let's please keep the accounting and optimization separate.

Thanks.

--
tejun

2014-06-16 10:46:59

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices


Il giorno 04/giu/2014, alle ore 15:56, Tejun Heo <[email protected]> ha scritto:

> Hello, Paolo.

Hello, and sorry for the late reply.

> [?]
>>
>> Actually we have been asked several times to improve random-I/O
>> performance on HDDs over the last years, by people recording, for
>> the typical tasks performed by their machines, much lower throughput
>> than with the other schedulers. Major problems have been reported
>> for server workloads (database, web), and for btrfs. According to
>> the feedback received after introducing this optimization in bfq,
>> those problems seem to be finally gone.
>
> I see. The equal priority part can probably work in enough cases to
> be meaningful given that it just depends on the busy queues having the
> same weight instead of everything in the system. It'd nice to note
> that in the comment tho.
>
> I'm still quite skeptical about the cgroup part tho. The triggering
> condition is too specific and fragile. If I'm reading the bfq blkcg
> implementation correctly, it seems to be applying the scheduling
> algorithm recursively walking down the tree one level at a time.

Yes.

> cfq
> does it differently. cfq flattens the hierarchy by calculating the
> nested weight of each active leaf queue and schedule all of them from
> the same service tree. IOW, scheduling algorithm per-se doesn't care
> about the hierarchy. All it sees are differing weights competing
> equally regardless of the hierarchical structure.
>

To preserve the desired distribution of the device throughput (or time), this
scheme entails updating weights every time the set of active queues changes.
For example (sorry for the trivial example, but I just want to make sure that I am
not misunderstanding what you are telling me), suppose that two groups,
A and B, are reserved 50% of the throughput each, and that the first group contains
two processes, whereas the second group just one process. Apart from the
additional per-queue interventions of cfq to improve latency or throughput, the
queues the two processes in in group A are reserved 50% of the group throughput
each. It follows that, if all the three queues are backlogged, then a correct weight
distribution for a flat underlying scheduler is, e.g., 25 and 25 for the two processes
in group A, and 50 for the process in group B.

But, if one of the queues in group A becomes idle, then the correct weights
of the still-active queues become, e.g., 50 and 50.

Changing weights this way has basically no influence of the per-request latency
guarantees of cfq, because cfq is based on a round-robin scheme, and the latter
already suffers from a large deviation with respect to an ideal service. In contrast,
things change dramatically with an accurate scheduler, such as the internal
B-WF2Q+ scheduler of BFQ. In that case, as explained, e.g., here for packet
scheduling (but the problem is exactly the same)

http://www.algogroup.unimore.it/people/paolo/dynWF2Q+/dynWF2Q+.pdf

weight changes would just break service guarantees, and lead to the same
deviation as a round-robin scheduler. As proved in the same
document, to preserve guarantees, weight updates should be delayed/concealed
in a non-trivial (although computationally cheap) way.

> If the same strategy can be applied to bfq, possibly the same strategy
> of checking whether all the active queues have the same weight can be
> used regardless of blkcg? That'd be simpler and a lot more robust.
>

Unfortunately, because of the above problems, this scheme would break
service guarantees with an accurate scheduler such as bfq.

The hierarchical scheme used by bfq does not suffer from this problem,
also because it does implement the mechanisms described in the
above document. In particular, it allows these mechanism to be
implemented in a quite simple way, whereas things would become
more convoluted after flattening the hierarchy.

If useful, I am willing to provide more details, although these aspects are most
certainly quite theoretical and boring.

> Another thing I'm curious about is that the logic that you're using to
> disable idling assumes that the disk will serve the queued commands
> more or less in fair manner over time, right? If so, why does queues
> having the same weight matter? Shouldn't the bandwidth scheduling
> eventually make them converge to the specified weights over time?
> Isn't wr_coeff > 1 test enough for maintaining reasonable
> responsiveness?

Unfortunately, when I first defined bfq with Fabio, this was one of the main
mistakes made in most of existing research I/O schedulers. More precisely, your
statement is true for async queues, but fails for sync queues. The problem is that
the (only) way in which a process pushes a scheduler into giving it its reserved
throughput is by issuing requests at a higher rate than that at which they are
served. But, with sync queues this just cannot happen. In particular,
postponing the service of a sync request delays the arrival of the next,
but not-yet-arrived, sync request of the same process. Intuitively, for the scheduler,
it is like if the process was happy with the throughput it is receiving, because
it happens to issue requests exactly at that rate.

This problems is described in more detail, for example, in Section 5.3 of
http://www.algogroup.unimore.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

bfq deals with this problem by properly back-shifting request timestamps when
needed. But idling is necessary for this mechanism to work (unless more
complex solutions are adopted).

>
>> Besides, turning back to bfq, if its low-latency heuristics are
>> disabled for non rotational devices, then, according to our results
>> with slower devices, such as SD Cards and eMMCs, latency becomes
>> easily unbearable, with no throughput gain.
>
> Hmmm... looking at the nonrot optimizations again, so yeah assuming
> the weight counting is necessary for NCQ hdds the part specific to
> ssds isn't that big. You probably wanna sequence it the other way
> around tho. This really should be primarily about disks at this
> point.
>
> The thing which still makes me cringe is how it scatters
> blk_queue_nonrot() tests across multiple places without clear
> explanation on what's going on. A bfqq being constantly seeky or not
> doesn't have much to do with whether the device is rotational or not.
> Its effect does and I don't think avoiding the overhead of keeping the
> counters is meaningful. Things like this make the code a lot harder
> to maintain in the long term as code is organized according to
> seemingly arbitrary optimization rather than semantic structure. So,
> let's please keep the accounting and optimization separate.
>

Added to the list of changes to make, thanks.

Paolo

> Thanks.
>
> --
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Fisica, Informatica e Matematica
Via Campi, 213/B
41125 Modena - Italy
homepage: http://algogroup.unimore.it/people/paolo/

2014-06-19 01:14:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices

Hello,

On Mon, Jun 16, 2014 at 12:46:36PM +0200, Paolo Valente wrote:
> To preserve the desired distribution of the device throughput (or time), this
> scheme entails updating weights every time the set of active queues changes.
> For example (sorry for the trivial example, but I just want to make sure that I am
> not misunderstanding what you are telling me), suppose that two groups,
> A and B, are reserved 50% of the throughput each, and that the first group contains
> two processes, whereas the second group just one process. Apart from the
> additional per-queue interventions of cfq to improve latency or throughput, the
> queues the two processes in in group A are reserved 50% of the group throughput
> each. It follows that, if all the three queues are backlogged, then a correct weight
> distribution for a flat underlying scheduler is, e.g., 25 and 25 for the two processes
> in group A, and 50 for the process in group B.
>
> But, if one of the queues in group A becomes idle, then the correct weights
> of the still-active queues become, e.g., 50 and 50.

Yeap, that's what cfq is doing now. Flattening priorities to the top
level each time the set of active queues changes. It probably is
introducing weird artifacts to scheduling but given the existing
amount of existing noise I don't think it'd make a noticeable
difference.

> Changing weights this way has basically no influence of the per-request latency
> guarantees of cfq, because cfq is based on a round-robin scheme, and the latter
> already suffers from a large deviation with respect to an ideal service. In contrast,
> things change dramatically with an accurate scheduler, such as the internal
> B-WF2Q+ scheduler of BFQ. In that case, as explained, e.g., here for packet
> scheduling (but the problem is exactly the same)
>
> http://www.algogroup.unimore.it/people/paolo/dynWF2Q+/dynWF2Q+.pdf
>
> weight changes would just break service guarantees, and lead to the same
> deviation as a round-robin scheduler. As proved in the same
> document, to preserve guarantees, weight updates should be delayed/concealed
> in a non-trivial (although computationally cheap) way.

Ah, bummer. Flattening is so much easier to deal with.

> If useful, I am willing to provide more details, although these aspects are most
> certainly quite theoretical and boring.

Including a simplified intuitive explanation of why fully hierarchical
scheduling is necessary with reference to more detailed explanation
would be nice.

> > Another thing I'm curious about is that the logic that you're using to
> > disable idling assumes that the disk will serve the queued commands
> > more or less in fair manner over time, right? If so, why does queues
> > having the same weight matter? Shouldn't the bandwidth scheduling
> > eventually make them converge to the specified weights over time?
> > Isn't wr_coeff > 1 test enough for maintaining reasonable
> > responsiveness?
>
> Unfortunately, when I first defined bfq with Fabio, this was one of the main
> mistakes made in most of existing research I/O schedulers. More precisely, your
> statement is true for async queues, but fails for sync queues. The problem is that
> the (only) way in which a process pushes a scheduler into giving it its reserved
> throughput is by issuing requests at a higher rate than that at which they are
> served. But, with sync queues this just cannot happen. In particular,
> postponing the service of a sync request delays the arrival of the next,
> but not-yet-arrived, sync request of the same process. Intuitively, for the scheduler,
> it is like if the process was happy with the throughput it is receiving, because
> it happens to issue requests exactly at that rate.
>
> This problems is described in more detail, for example, in Section 5.3 of
> http://www.algogroup.unimore.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf
>
> bfq deals with this problem by properly back-shifting request timestamps when
> needed. But idling is necessary for this mechanism to work (unless more
> complex solutions are adopted).

Oh... I understand why idling is necessary to actually implement io
priorities. I was wondering about the optimization for queued devices
and why it matters whether the active queues have equal weight or not.
If the optimization can be used for three of 1s, why can't it be used
for 1 and 2?

Thanks.

--
tejun