2018-04-24 14:32:33

by Paolo Valente

[permalink] [raw]
Subject: Re: testing io.low limit for blk-throttle



> Il giorno 23 apr 2018, alle ore 11:01, Joseph Qi <[email protected]> ha scritto:
>
>
>
> On 18/4/23 15:35, Paolo Valente wrote:
>>
>>
>>> Il giorno 23 apr 2018, alle ore 08:05, Joseph Qi <[email protected]> ha scritto:
>>>
>>> Hi Paolo,
>>
>> Hi Joseph,
>> thanks for chiming in.
>>
>>> What's your idle and latency config?
>>
>> I didn't set them at all, as the only (explicit) requirement in my
>> basic test is that one of the group is guaranteed a minimum bps.
>>
>>
>>> IMO, io.low will allow others run more bandwidth if cgroup's average
>>> idle time is high or latency is low.
>>
>> What you say here makes me think that I simply misunderstood the
>> purpose of io.low. So, here is my problem/question: "I only need to
>> guarantee at least a minimum bandwidth, in bps, to a group. Is the
>> io.low limit the way to go?"
>>
>> I know that I can use just io.max (unless I misunderstood the goal of
>> io.max too :( ), but my extra purpose would be to not waste bandwidth
>> when some group is idle. Yet, as for now, io.low is not working even
>> for the first, simpler goal, i.e., guaranteeing a minimum bandwidth to
>> one group when all groups are active.
>>
>> Am I getting something wrong?
>>
>> Otherwise, if there are some special values for idle and latency
>> parameters that would make throttle work for my test, I'll be of
>> course happy to try them.
>>
> I think you can try idle time with 1000us for all cgroups, and latency
> target 100us for cgroup with low limit 100MB/s and 2000us for cgroups
> with low limit 10MB/s. That means cgroup with low latency target will
> be preferred.
> BTW, from my expeierence the parameters are not easy to set because
> they are strongly correlated to the cgroup IO behavior.
>

+Tejun (I guess he might be interested in the results below)

Hi Joseph,
thanks for chiming in. Your suggestion did work!

At first, I thought I had also understood the use of latency from the
outcome of your suggestion: "want low limit really guaranteed for a
group? set target latency to a low value for it." But then, as a
crosscheck, I repeated the same exact test, but reversing target
latencies: I gave 2000 to the interfered (the group with 100MB/s
limit) and 100 to the interferers. And the interfered still got more
than 100MB/s! So I exaggerated: 20000 to the interfered.
Same outcome :(

I tried really many other combinations, to try to figure this out, but
results seemed more or less random w.r.t. to latency values. I
didn't even start to test different values for idle.

So, the only sound lesson that I seem to have learned is: if I want
low limits to be enforced, I have to set target latency and idle
explicitly. The actual values of latencies matter little, or not at
all. At least this holds for my simple tests.

At any rate, thanks to your help, Joseph, I could move to the most
interesting part for me: how effective is blk-throttle with low
limits? I could well be wrong again, but my results do not seem that
good. With the simplest type of non-toy example I considered, I
recorded throughput losses, apparently caused mainly by blk-throttle,
and ranging from 64% to 75%.

Here is a worst-case example. For each step, I'm reporting below the
command by which you can reproduce that step with the
thr-lat-with-interference benchmark of the S suite [1]. I just split
bandwidth equally among five groups, on my SSD. The device showed a
peak rate of ~515MB/s in this test, so I set rpbs to 100MB/s for each
group (and tried various values, and combinations of values, for the
target latency, without any effect on the results). To begin, I made
every group do sequential reads. Everything worked perfectly fine.

But then I made one group do random I/O [2], and troubles began. Even
if the group doing random I/O was given a target latency of 100usec
(or lower), while the other had a target latency of 2000usec, the poor
random-I/O group got only 4.7 MB/s! (A single process doing 4k sync
random I/O reaches 25MB/s on my SSD.)

I guess things broke because low limits did not comply any longer with
the lower speed that device reached with the new, mixed workload: the
device reached 376MB/s, while the sum of the low limits was 500MB/s.
BTW the 'fault' for this loss of throughput was not only of the device
and the workload: if I switched throttling off, then the device still
reached its peak rate, although granting only 1.3MB/s to the
random-I/O group.

So, to comply with the 376MB/s, I lowered the low limits to 74MB/s per
group (to avoid a too tight 75MB/s) [3]. A little better: the
random-I/O group got 7.2 MB/s. But the total throughput went down
further, to 289MB/s, and became again lower than the sum of the low
limits. Most certainly, this time the throughput went down mainly
because blk-throttling was serving the random I/O more than before.

To make a long story short, I arrived to setting just 12MB/s as low
limit for each group [4]. The random-I/O group was finally happy,
with a revitalizing 12.77MB/s. But the total throughput dropped down
to 127MB/s, i.e., ~25% of the peak rate of the device. Now the
'fault' for the throughput loss seemed undoubtedly of blk-throttle.
The latter was evidently over-throttling some group.

To sum up, for my device, 12MB/s seems to be the highest value for
which low limits can be guaranteed. But setting these limits entails
a high cost: if just one group really does random I/O, then 75% of the
throughput is lost.

There would be other issues too. For example, 12MB/s might be too
little for the needs of some group in some time period. This fact would
make it extremely difficult, if ever possible, to set low limits that
comply with the needs of more dynamic (and probably more
realistic) workloads than the above one.

I think this is all, sorry for the long mail, I tried to shrink it as
much as possible. Looking forward to some feedback.

Thanks,
Paolo

[1] https://github.com/Algodev-github/S
[2] sudo ./thr-lat-with-interference.sh -b t -n 4 -w 100M -W 100M -t randread -L 2000
[3] sudo ./thr-lat-with-interference.sh -b t -n 4 -w 74M -W 74M -t randread -L 2000
[4] sudo ./thr-lat-with-interference.sh -b t -n 4 -w 12M -W 12M -t randread -L 2000



2018-04-25 12:16:04

by Joseph Qi

[permalink] [raw]
Subject: Re: testing io.low limit for blk-throttle

Hi Paolo,

On 18/4/24 20:12, Paolo Valente wrote:
>
>
>> Il giorno 23 apr 2018, alle ore 11:01, Joseph Qi <[email protected]> ha scritto:
>>
>>
>>
>> On 18/4/23 15:35, Paolo Valente wrote:
>>>
>>>
>>>> Il giorno 23 apr 2018, alle ore 08:05, Joseph Qi <[email protected]> ha scritto:
>>>>
>>>> Hi Paolo,
>>>
>>> Hi Joseph,
>>> thanks for chiming in.
>>>
>>>> What's your idle and latency config?
>>>
>>> I didn't set them at all, as the only (explicit) requirement in my
>>> basic test is that one of the group is guaranteed a minimum bps.
>>>
>>>
>>>> IMO, io.low will allow others run more bandwidth if cgroup's average
>>>> idle time is high or latency is low.
>>>
>>> What you say here makes me think that I simply misunderstood the
>>> purpose of io.low. So, here is my problem/question: "I only need to
>>> guarantee at least a minimum bandwidth, in bps, to a group. Is the
>>> io.low limit the way to go?"
>>>
>>> I know that I can use just io.max (unless I misunderstood the goal of
>>> io.max too :( ), but my extra purpose would be to not waste bandwidth
>>> when some group is idle. Yet, as for now, io.low is not working even
>>> for the first, simpler goal, i.e., guaranteeing a minimum bandwidth to
>>> one group when all groups are active.
>>>
>>> Am I getting something wrong?
>>>
>>> Otherwise, if there are some special values for idle and latency
>>> parameters that would make throttle work for my test, I'll be of
>>> course happy to try them.
>>>
>> I think you can try idle time with 1000us for all cgroups, and latency
>> target 100us for cgroup with low limit 100MB/s and 2000us for cgroups
>> with low limit 10MB/s. That means cgroup with low latency target will
>> be preferred.
>> BTW, from my expeierence the parameters are not easy to set because
>> they are strongly correlated to the cgroup IO behavior.
>>
>
> +Tejun (I guess he might be interested in the results below)
>
> Hi Joseph,
> thanks for chiming in. Your suggestion did work!
>
> At first, I thought I had also understood the use of latency from the
> outcome of your suggestion: "want low limit really guaranteed for a
> group? set target latency to a low value for it." But then, as a
> crosscheck, I repeated the same exact test, but reversing target
> latencies: I gave 2000 to the interfered (the group with 100MB/s
> limit) and 100 to the interferers. And the interfered still got more
> than 100MB/s! So I exaggerated: 20000 to the interfered.
> Same outcome :(
>
> I tried really many other combinations, to try to figure this out, but
> results seemed more or less random w.r.t. to latency values. I
> didn't even start to test different values for idle.
>
> So, the only sound lesson that I seem to have learned is: if I want
> low limits to be enforced, I have to set target latency and idle
> explicitly. The actual values of latencies matter little, or not at
> all. At least this holds for my simple tests.
>
> At any rate, thanks to your help, Joseph, I could move to the most
> interesting part for me: how effective is blk-throttle with low
> limits? I could well be wrong again, but my results do not seem that
> good. With the simplest type of non-toy example I considered, I
> recorded throughput losses, apparently caused mainly by blk-throttle,
> and ranging from 64% to 75%.
>
> Here is a worst-case example. For each step, I'm reporting below the
> command by which you can reproduce that step with the
> thr-lat-with-interference benchmark of the S suite [1]. I just split
> bandwidth equally among five groups, on my SSD. The device showed a
> peak rate of ~515MB/s in this test, so I set rpbs to 100MB/s for each
> group (and tried various values, and combinations of values, for the
> target latency, without any effect on the results). To begin, I made
> every group do sequential reads. Everything worked perfectly fine.
>
> But then I made one group do random I/O [2], and troubles began. Even
> if the group doing random I/O was given a target latency of 100usec
> (or lower), while the other had a target latency of 2000usec, the poor
> random-I/O group got only 4.7 MB/s! (A single process doing 4k sync
> random I/O reaches 25MB/s on my SSD.)
>
> I guess things broke because low limits did not comply any longer with
> the lower speed that device reached with the new, mixed workload: the
> device reached 376MB/s, while the sum of the low limits was 500MB/s.
> BTW the 'fault' for this loss of throughput was not only of the device
> and the workload: if I switched throttling off, then the device still
> reached its peak rate, although granting only 1.3MB/s to the
> random-I/O group.
>
> So, to comply with the 376MB/s, I lowered the low limits to 74MB/s per
> group (to avoid a too tight 75MB/s) [3]. A little better: the
> random-I/O group got 7.2 MB/s. But the total throughput went down
> further, to 289MB/s, and became again lower than the sum of the low
> limits. Most certainly, this time the throughput went down mainly
> because blk-throttling was serving the random I/O more than before.
>
> To make a long story short, I arrived to setting just 12MB/s as low
> limit for each group [4]. The random-I/O group was finally happy,
> with a revitalizing 12.77MB/s. But the total throughput dropped down
> to 127MB/s, i.e., ~25% of the peak rate of the device. Now the
> 'fault' for the throughput loss seemed undoubtedly of blk-throttle.
> The latter was evidently over-throttling some group.
>
> To sum up, for my device, 12MB/s seems to be the highest value for
> which low limits can be guaranteed. But setting these limits entails
> a high cost: if just one group really does random I/O, then 75% of the
> throughput is lost.
>
> There would be other issues too. For example, 12MB/s might be too
> little for the needs of some group in some time period. This fact would
> make it extremely difficult, if ever possible, to set low limits that
> comply with the needs of more dynamic (and probably more
> realistic) workloads than the above one.
>
Could you run blktrace as well when testing your case? There are several
throtl traces to help analyze whether it is caused by frequently
upgrade/downgrade.
If all cgroups are just running under low, I'am afraid the case you
tested has something to do with how SSD handle mixed workload IOs.

Thanks,
Joseph

> I think this is all, sorry for the long mail, I tried to shrink it as
> much as possible. Looking forward to some feedback.
>
> Thanks,
> Paolo
>
> [1] https://github.com/Algodev-github/S
> [2] sudo ./thr-lat-with-interference.sh -b t -n 4 -w 100M -W 100M -t randread -L 2000
> [3] sudo ./thr-lat-with-interference.sh -b t -n 4 -w 74M -W 74M -t randread -L 2000
> [4] sudo ./thr-lat-with-interference.sh -b t -n 4 -w 12M -W 12M -t randread -L 2000
>

2018-04-26 18:33:25

by Tejun Heo

[permalink] [raw]
Subject: Re: testing io.low limit for blk-throttle

Hello,

On Tue, Apr 24, 2018 at 02:12:51PM +0200, Paolo Valente wrote:
> +Tejun (I guess he might be interested in the results below)

Our experiments didn't work out too well either. At this point, it
isn't clear whether io.low will ever leave experimental state. We're
trying to find a working solution.

Thanks.

--
tejun

2018-04-27 02:11:43

by jianchao.wang

[permalink] [raw]
Subject: Re: testing io.low limit for blk-throttle

Hi Tejun and Joseph

On 04/27/2018 02:32 AM, Tejun Heo wrote:
> Hello,
>
> On Tue, Apr 24, 2018 at 02:12:51PM +0200, Paolo Valente wrote:
>> +Tejun (I guess he might be interested in the results below)
>
> Our experiments didn't work out too well either. At this point, it
> isn't clear whether io.low will ever leave experimental state. We're
> trying to find a working solution.

Would you please take a look at the following two patches.

https://marc.info/?l=linux-block&m=152325456307423&w=2
https://marc.info/?l=linux-block&m=152325457607425&w=2

In addition, when I tested blk-throtl io.low on NVMe card, I always got
even if the iops has been lower than io.low limit for a while, but the
due to group is not idle, the downgrade always fails.

tg->latency_target && tg->bio_cnt &&
tg->bad_bio_cnt * 5 < tg->bio_cn

the latency always looks well even the sum of two groups's iops has reached the top.
so I disable this check on my test, plus the 2 patches above, the io.low
could basically works.

My NVMe card's max bps is ~600M, and max iops is ~160k.
Here is my config
io.low riops=50000 wiops=50000 rbps=209715200 wbps=209715200 idle=200 latency=10
io.max riops=150000
There are two cgroups in my test, both of them have same config.

In addition, saying "basically work" is due to the iops of the two cgroup will jump up and down.
such as, I launched one fio test per cgroup, the iops will wave as following:

group0 30k 50k 70k 60k 40k
group1 120k 100k 80k 90k 110k

however, if I launched two fio tests only in one cgroup, the iops of two test could stay
about 70k~80k.

Could help to explain this scenario ?

Thanks in advance
Jianchao

2018-04-27 02:42:26

by Joseph Qi

[permalink] [raw]
Subject: Re: testing io.low limit for blk-throttle

Hi Jianchao,

On 18/4/27 10:09, jianchao.wang wrote:
> Hi Tejun and Joseph
>
> On 04/27/2018 02:32 AM, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Apr 24, 2018 at 02:12:51PM +0200, Paolo Valente wrote:
>>> +Tejun (I guess he might be interested in the results below)
>>
>> Our experiments didn't work out too well either. At this point, it
>> isn't clear whether io.low will ever leave experimental state. We're
>> trying to find a working solution.
>
> Would you please take a look at the following two patches.
>
> https://marc.info/?l=linux-block&m=152325456307423&w=2
> https://marc.info/?l=linux-block&m=152325457607425&w=2
>
> In addition, when I tested blk-throtl io.low on NVMe card, I always got
> even if the iops has been lower than io.low limit for a while, but the
> due to group is not idle, the downgrade always fails.
>
> tg->latency_target && tg->bio_cnt &&
> tg->bad_bio_cnt * 5 < tg->bio_cn
>

I'm afraid the latency check is a must for io.low. Because idle time
check can only apply to simple scenarios from my test.

Yes, in some cases last_low_overflow_time does have problems.
And for not downgrade properly, I've also posted two patches before,
waiting Shaohua's review. You can also have a try.

https://patchwork.kernel.org/patch/10177185/
https://patchwork.kernel.org/patch/10177187/

Thanks,
Joseph

> the latency always looks well even the sum of two groups's iops has reached the top.
> so I disable this check on my test, plus the 2 patches above, the io.low
> could basically works.
>
> My NVMe card's max bps is ~600M, and max iops is ~160k.
> Here is my config
> io.low riops=50000 wiops=50000 rbps=209715200 wbps=209715200 idle=200 latency=10
> io.max riops=150000
> There are two cgroups in my test, both of them have same config.
>
> In addition, saying "basically work" is due to the iops of the two cgroup will jump up and down.
> such as, I launched one fio test per cgroup, the iops will wave as following:
>
> group0 30k 50k 70k 60k 40k
> group1 120k 100k 80k 90k 110k
>
> however, if I launched two fio tests only in one cgroup, the iops of two test could stay
> about 70k~80k.
>
> Could help to explain this scenario ?
>
> Thanks in advance
> Jianchao
>

2018-05-03 16:35:39

by Paolo Valente

[permalink] [raw]
Subject: Re: testing io.low limit for blk-throttle



> Il giorno 26 apr 2018, alle ore 20:32, Tejun Heo <[email protected]> ha scritto:
>
> Hello,
>
> On Tue, Apr 24, 2018 at 02:12:51PM +0200, Paolo Valente wrote:
>> +Tejun (I guess he might be interested in the results below)
>
> Our experiments didn't work out too well either. At this point, it
> isn't clear whether io.low will ever leave experimental state. We're
> trying to find a working solution.
>

Thanks for this update, Tejun. I'm still working (very slowly) on a
survey of the current state of affairs in terms of bandwidth and
latency guarantees in the block layer. The synthesis of the results
I've collected so far is, more or less:

"The problem of reaching a high throughput and, at the same time,
guaranteeing bandwidth and latency is still unsolved, apart from
simple cases, such as homogenous, constant workloads"

I'm anticipating this, because I don't want to risk to underestimate
anybody's work. So, if anyone has examples of how, e.g., to
distribute I/O bandwidth as desired among heterogenous workloads (for
instance, random vs sequential workloads) that might fluctuate over
time, without losing total throughput, please tell me, and I'll test
them.

Thanks,
Paolo

> Thanks.
>
> --
> tejun