2019-05-22 10:02:27

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 5/22/19 2:09 AM, Paolo Valente wrote:
>
> First, thank you very much for testing my patches, and, above all, for
> sharing those huge traces!
>
> According to the your traces, the residual 20% lower throughput that you
> record is due to the fact that the BFQ injection mechanism takes a few
> hundredths of seconds to stabilize, at the beginning of the workload.
> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
> that you see without this new patch. After that time, there
> seems to be no loss according to the trace.
>
> The problem is that a loss lasting only a few hundredths of seconds is
> however not negligible for a write workload that lasts only 3-4
> seconds. Could you please try writing a larger file?
>

I tried running dd for longer (about 100 seconds), but still saw around
1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
mq-deadline and noop. But I'm not too worried about that difference.

> In addition, I wanted to ask you whether you measured BFQ throughput
> with traces disabled. This may make a difference.
>

The above result (1.4 MB/s) was obtained with traces disabled.

> After trying writing a larger file, you can try with low_latency on.
> On my side, it causes results to become a little unstable across
> repetitions (which is expected).
>
With low_latency on, I get between 60 KB/s - 100 KB/s.

Regards,
Srivatsa
VMware Photon OS


2019-05-22 10:55:55

by Paolo Valente

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller



> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat <[email protected]> ha scritto:
>
> On 5/22/19 2:09 AM, Paolo Valente wrote:
>>
>> First, thank you very much for testing my patches, and, above all, for
>> sharing those huge traces!
>>
>> According to the your traces, the residual 20% lower throughput that you
>> record is due to the fact that the BFQ injection mechanism takes a few
>> hundredths of seconds to stabilize, at the beginning of the workload.
>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>> that you see without this new patch. After that time, there
>> seems to be no loss according to the trace.
>>
>> The problem is that a loss lasting only a few hundredths of seconds is
>> however not negligible for a write workload that lasts only 3-4
>> seconds. Could you please try writing a larger file?
>>
>
> I tried running dd for longer (about 100 seconds), but still saw around
> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
> mq-deadline and noop.

Ok, then now the cause is the periodic reset of the mechanism.

It would be super easy to fill this gap, by just gearing the mechanism
toward a very aggressive injection. The problem is maintaining
control. As you can imagine from the performance gap between CFQ (or
BFQ with malfunctioning injection) and BFQ with this fix, it is very
hard to succeed in maximizing the throughput while at the same time
preserving control on per-group I/O.

On the bright side, you might be interested in one of the benefits
that BFQ gives in return for this ~10% loss of throughput, in a
scenario that may be important for you (according to affiliation you
report): from ~500% to ~1000% higher throughput when you have to serve
the I/O of multiple VMs, and to guarantee at least no starvation to
any VM [1]. The same holds with multiple clients or containers, and
in general with any set of entities that may compete for storage.

[1] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/

> But I'm not too worried about that difference.
>
>> In addition, I wanted to ask you whether you measured BFQ throughput
>> with traces disabled. This may make a difference.
>>
>
> The above result (1.4 MB/s) was obtained with traces disabled.
>
>> After trying writing a larger file, you can try with low_latency on.
>> On my side, it causes results to become a little unstable across
>> repetitions (which is expected).
>>
> With low_latency on, I get between 60 KB/s - 100 KB/s.
>

Gosh, full regression. Fortunately, it is simply meaningless to use
low_latency in a scenario where the goal is to guarantee per-group
bandwidths. Low-latency heuristics, to reach their (low-latency)
goals, modify the I/O schedule compared to the best schedule for
honoring group weights and boosting throughput. So, as recommended in
BFQ documentation, just switch low_latency off if you want to control
I/O with groups. It may still make sense to leave low_latency on
in some specific case, which I don't want to bother you about.

However, I feel bad with such a low throughput :) Would you be so
kind to provide me with a trace?

Thanks,
Paolo

> Regards,
> Srivatsa
> VMware Photon OS


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2019-05-23 02:31:25

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 5/22/19 3:54 AM, Paolo Valente wrote:
>
>
>> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat <[email protected]> ha scritto:
>>
>> On 5/22/19 2:09 AM, Paolo Valente wrote:
>>>
>>> First, thank you very much for testing my patches, and, above all, for
>>> sharing those huge traces!
>>>
>>> According to the your traces, the residual 20% lower throughput that you
>>> record is due to the fact that the BFQ injection mechanism takes a few
>>> hundredths of seconds to stabilize, at the beginning of the workload.
>>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>>> that you see without this new patch. After that time, there
>>> seems to be no loss according to the trace.
>>>
>>> The problem is that a loss lasting only a few hundredths of seconds is
>>> however not negligible for a write workload that lasts only 3-4
>>> seconds. Could you please try writing a larger file?
>>>
>>
>> I tried running dd for longer (about 100 seconds), but still saw around
>> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
>> mq-deadline and noop.
>
> Ok, then now the cause is the periodic reset of the mechanism.
>
> It would be super easy to fill this gap, by just gearing the mechanism
> toward a very aggressive injection. The problem is maintaining
> control. As you can imagine from the performance gap between CFQ (or
> BFQ with malfunctioning injection) and BFQ with this fix, it is very
> hard to succeed in maximizing the throughput while at the same time
> preserving control on per-group I/O.
>

Ah, I see. Just to make sure that this fix doesn't overly optimize for
total throughput (because of the testcase we've been using) and end up
causing regressions in per-group I/O control, I ran a test with
multiple simultaneous dd instances, each writing to a different
portion of the filesystem (well separated, to induce seeks), and each
dd task bound to its own blkio cgroup. I saw similar results with and
without this patch, and the throughput was equally distributed among
all the dd tasks.

> On the bright side, you might be interested in one of the benefits
> that BFQ gives in return for this ~10% loss of throughput, in a
> scenario that may be important for you (according to affiliation you
> report): from ~500% to ~1000% higher throughput when you have to serve
> the I/O of multiple VMs, and to guarantee at least no starvation to
> any VM [1]. The same holds with multiple clients or containers, and
> in general with any set of entities that may compete for storage.
>
> [1] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/
>

Great article! :) Thank you for sharing it!

>> But I'm not too worried about that difference.
>>
>>> In addition, I wanted to ask you whether you measured BFQ throughput
>>> with traces disabled. This may make a difference.
>>>
>>
>> The above result (1.4 MB/s) was obtained with traces disabled.
>>
>>> After trying writing a larger file, you can try with low_latency on.
>>> On my side, it causes results to become a little unstable across
>>> repetitions (which is expected).
>>>
>> With low_latency on, I get between 60 KB/s - 100 KB/s.
>>
>
> Gosh, full regression. Fortunately, it is simply meaningless to use
> low_latency in a scenario where the goal is to guarantee per-group
> bandwidths. Low-latency heuristics, to reach their (low-latency)
> goals, modify the I/O schedule compared to the best schedule for
> honoring group weights and boosting throughput. So, as recommended in
> BFQ documentation, just switch low_latency off if you want to control
> I/O with groups. It may still make sense to leave low_latency on
> in some specific case, which I don't want to bother you about.
>

My main concern here is about Linux's I/O performance out-of-the-box,
i.e., with all default settings, which are:

- cgroups and blkio enabled (systemd default)
- blkio non-root cgroups in use (this is the implicit systemd behavior
if docker is installed; i.e., it runs tasks under user.slice)
- I/O scheduler with blkio group sched support: bfq
- bfq default configuration: low_latency = 1

If this yields a throughput that is 10x-30x slower than what is
achievable, I think we should either fix the code (if possible) or
change the defaults such that they don't lead to this performance
collapse (perhaps default low_latency to 0 if bfq group scheduling
is in use?)

> However, I feel bad with such a low throughput :) Would you be so
> kind to provide me with a trace?
>
Certainly! Short runs of dd resulted in a lot of variation in the
throughput (between 60 KB/s - 1 MB/s), so I increased dd's runtime
to get repeatable numbers (~70 KB/s). As a result, the trace file
(trace-bfq-boost-injection-low-latency-71KBps) is quite large, and
is available here:

https://www.dropbox.com/s/svqfbv0idcg17pn/bfq-traces.tar.gz?dl=0

Also, I'm very happy to run additional tests or experiments to help
track down this issue. So, please don't hesitate to let me know if
you'd like me to try anything else or get you additional traces etc. :)

Thank you!

Regards,
Srivatsa
VMware Photon OS

2019-05-23 23:35:24

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 5/22/19 7:30 PM, Srivatsa S. Bhat wrote:
> On 5/22/19 3:54 AM, Paolo Valente wrote:
>>
>>
>>> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>
>>> On 5/22/19 2:09 AM, Paolo Valente wrote:
>>>>
>>>> First, thank you very much for testing my patches, and, above all, for
>>>> sharing those huge traces!
>>>>
>>>> According to the your traces, the residual 20% lower throughput that you
>>>> record is due to the fact that the BFQ injection mechanism takes a few
>>>> hundredths of seconds to stabilize, at the beginning of the workload.
>>>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>>>> that you see without this new patch. After that time, there
>>>> seems to be no loss according to the trace.
>>>>
>>>> The problem is that a loss lasting only a few hundredths of seconds is
>>>> however not negligible for a write workload that lasts only 3-4
>>>> seconds. Could you please try writing a larger file?
>>>>
>>>
>>> I tried running dd for longer (about 100 seconds), but still saw around
>>> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
>>> mq-deadline and noop.
>>
>> Ok, then now the cause is the periodic reset of the mechanism.
>>
>> It would be super easy to fill this gap, by just gearing the mechanism
>> toward a very aggressive injection. The problem is maintaining
>> control. As you can imagine from the performance gap between CFQ (or
>> BFQ with malfunctioning injection) and BFQ with this fix, it is very
>> hard to succeed in maximizing the throughput while at the same time
>> preserving control on per-group I/O.
>>
>
> Ah, I see. Just to make sure that this fix doesn't overly optimize for
> total throughput (because of the testcase we've been using) and end up
> causing regressions in per-group I/O control, I ran a test with
> multiple simultaneous dd instances, each writing to a different
> portion of the filesystem (well separated, to induce seeks), and each
> dd task bound to its own blkio cgroup. I saw similar results with and
> without this patch, and the throughput was equally distributed among
> all the dd tasks.
>
Actually, it turns out that I ran the dd tasks directly on the block
device for this experiment, and not on top of ext4. I'll redo this on
ext4 and report back soon.

Regards,
Srivatsa
VMware Photon OS

2019-05-30 08:40:48

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 5/23/19 4:32 PM, Srivatsa S. Bhat wrote:
> On 5/22/19 7:30 PM, Srivatsa S. Bhat wrote:
>> On 5/22/19 3:54 AM, Paolo Valente wrote:
>>>
>>>
>>>> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>>
>>>> On 5/22/19 2:09 AM, Paolo Valente wrote:
>>>>>
>>>>> First, thank you very much for testing my patches, and, above all, for
>>>>> sharing those huge traces!
>>>>>
>>>>> According to the your traces, the residual 20% lower throughput that you
>>>>> record is due to the fact that the BFQ injection mechanism takes a few
>>>>> hundredths of seconds to stabilize, at the beginning of the workload.
>>>>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>>>>> that you see without this new patch. After that time, there
>>>>> seems to be no loss according to the trace.
>>>>>
>>>>> The problem is that a loss lasting only a few hundredths of seconds is
>>>>> however not negligible for a write workload that lasts only 3-4
>>>>> seconds. Could you please try writing a larger file?
>>>>>
>>>>
>>>> I tried running dd for longer (about 100 seconds), but still saw around
>>>> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
>>>> mq-deadline and noop.
>>>
>>> Ok, then now the cause is the periodic reset of the mechanism.
>>>
>>> It would be super easy to fill this gap, by just gearing the mechanism
>>> toward a very aggressive injection. The problem is maintaining
>>> control. As you can imagine from the performance gap between CFQ (or
>>> BFQ with malfunctioning injection) and BFQ with this fix, it is very
>>> hard to succeed in maximizing the throughput while at the same time
>>> preserving control on per-group I/O.
>>>
>>
>> Ah, I see. Just to make sure that this fix doesn't overly optimize for
>> total throughput (because of the testcase we've been using) and end up
>> causing regressions in per-group I/O control, I ran a test with
>> multiple simultaneous dd instances, each writing to a different
>> portion of the filesystem (well separated, to induce seeks), and each
>> dd task bound to its own blkio cgroup. I saw similar results with and
>> without this patch, and the throughput was equally distributed among
>> all the dd tasks.
>>
> Actually, it turns out that I ran the dd tasks directly on the block
> device for this experiment, and not on top of ext4. I'll redo this on
> ext4 and report back soon.
>

With all your patches applied (including waker detection for the low
latency case), I ran four simultaneous dd instances, each writing to a
different ext4 partition, and each dd task bound to its own blkio
cgroup. The throughput continued to be well distributed among the dd
tasks, as shown below (I increased dd's block size from 512B to 8KB
for these experiments to get double-digit throughput numbers, so as to
make comparisons easier).

bfq with low_latency = 1:

819200000 bytes (819 MB, 781 MiB) copied, 16452.6 s, 49.8 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17139.6 s, 47.8 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17251.7 s, 47.5 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17384 s, 47.1 kB/s

bfq with low_latency = 0:

819200000 bytes (819 MB, 781 MiB) copied, 16257.9 s, 50.4 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17204.5 s, 47.6 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17220.6 s, 47.6 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17348.1 s, 47.2 kB/s

Regards,
Srivatsa
VMware Photon OS