2019-05-18 19:04:08

by Paolo Valente

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

I've addressed these issues in my last batch of improvements for BFQ, which landed in the upcoming 5.2. If you give it a try, and still see the problem, then I'll be glad to reproduce it, and hopefully fix it for you.

Thanks,
Paolo

> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <[email protected]> ha scritto:
>
>
> Hi,
>
> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
> running the following command, with the CFQ I/O scheduler:
>
> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>
> Throughput with CFQ: 60 KB/s
> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>
> I spent some time looking into it and found that this is caused by the
> undesirable interaction between 4 different components:
>
> - blkio cgroup controller enabled
> - ext4 with the jbd2 kthread running in the root blkio cgroup
> - dd running on ext4, in any other blkio cgroup than that of jbd2
> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>
>
> When docker is enabled, systemd creates a blkio cgroup called
> system.slice to run system services (and docker) under it, and a
> separate blkio cgroup called user.slice for user processes. So, when
> dd is invoked, it runs under user.slice.
>
> The dd command above includes the dsync flag, which performs an
> fdatasync after every write to the output file. Since dd is writing to
> a file on ext4, jbd2 will be active, committing transactions
> corresponding to those fdatasync requests from dd. (In other words, dd
> depends on jdb2, in order to make forward progress). But jdb2 being a
> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
> runs under user.slice.
>
> Now, if the I/O scheduler in use for the underlying block device is
> CFQ, then its inter-queue/inter-group idling takes effect (via the
> slice_idle and group_idle parameters, both of which default to 8ms).
> Therefore, everytime CFQ switches between processing requests from dd
> vs jbd2, this 8ms idle time is injected, which slows down the overall
> throughput tremendously!
>
> To verify this theory, I tried various experiments, and in all cases,
> the 4 pre-conditions mentioned above were necessary to reproduce this
> performance drop. For example, if I used an XFS filesystem (which
> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
> directly to a block device, I couldn't reproduce the performance
> issue. Similarly, running dd in the root blkio cgroup (where jbd2
> runs) also gets full performance; as does using the noop or deadline
> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
> to zero.
>
> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
> both with virtualized storage as well as with disk pass-through,
> backed by a rotational hard disk in both cases. The same problem was
> also seen with the BFQ I/O scheduler in kernel v5.1.
>
> Searching for any earlier discussions of this problem, I found an old
> thread on LKML that encountered this behavior [1], as well as a docker
> github issue [2] with similar symptoms (mentioned later in the
> thread).
>
> So, I'm curious to know if this is a well-understood problem and if
> anybody has any thoughts on how to fix it.
>
> Thank you very much!
>
>
> [1]. https://lkml.org/lkml/2015/11/19/359
>
> [2]. https://github.com/moby/moby/issues/21485
> https://github.com/moby/moby/issues/21485#issuecomment-222941103
>
> Regards,
> Srivatsa


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2019-05-18 19:55:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote:
> I've addressed these issues in my last batch of improvements for
> BFQ, which landed in the upcoming 5.2. If you give it a try, and
> still see the problem, then I'll be glad to reproduce it, and
> hopefully fix it for you.

Hi Paolo, I'm curious if you could give a quick summary about what you
changed in BFQ?

I was considering adding support so that if userspace calls fsync(2)
or fdatasync(2), to attach the process's CSS to the transaction, and
then charge all of the journal metadata writes the process's CSS. If
there are multiple fsync's batched into the transaction, the first
process which forced the early transaction commit would get charged
the entire journal write. OTOH, journal writes are sequential I/O, so
the amount of disk time for writing the journal is going to be
relatively small, and especially, the fact that work from other
cgroups is going to be minimal, especially if hadn't issued an
fsync().

In the case where you have three cgroups all issuing fsync(2) and they
all landed in the same jbd2 transaction thanks to commit batching, in
the ideal world we would split up the disk time usage equally across
those three cgroups. But it's probably not worth doing that...

That being said, we probably do need some BFQ support, since in the
case where we have multiple processes doing buffered writes w/o fsync,
we do charnge the data=ordered writeback to each block cgroup. Worse,
the commit can't complete until the all of the data integrity
writebacks have completed. And if there are N cgroups with dirty
inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
of idle time tacked onto the commit time.

If we charge the journal I/O to the cgroup, and there's only one
process doing the

dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync

then we don't need to worry about this failure mode, since both the
journal I/O and the data writeback will be hitting the same cgroup.
But that's arguably an artificial use case, and much more commonly
there will be multiple cgroups all trying to at least some file system
I/O.

- Ted

2019-05-18 20:52:48

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 5/18/19 11:39 AM, Paolo Valente wrote:
> I've addressed these issues in my last batch of improvements for BFQ,
> which landed in the upcoming 5.2. If you give it a try, and still see
> the problem, then I'll be glad to reproduce it, and hopefully fix it
> for you.
>

Hi Paolo,

Thank you for looking into this!

I just tried current mainline at commit 72cf0b07, but unfortunately
didn't see any improvement:

dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync

With mq-deadline, I get:

5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s

With bfq, I get:
5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s

Please let me know if any more info about my setup might be helpful.

Thank you!

Regards,
Srivatsa
VMware Photon OS

>
>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <[email protected]> ha scritto:
>>
>>
>> Hi,
>>
>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>> running the following command, with the CFQ I/O scheduler:
>>
>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>>
>> Throughput with CFQ: 60 KB/s
>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>
>> I spent some time looking into it and found that this is caused by the
>> undesirable interaction between 4 different components:
>>
>> - blkio cgroup controller enabled
>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>
>>
>> When docker is enabled, systemd creates a blkio cgroup called
>> system.slice to run system services (and docker) under it, and a
>> separate blkio cgroup called user.slice for user processes. So, when
>> dd is invoked, it runs under user.slice.
>>
>> The dd command above includes the dsync flag, which performs an
>> fdatasync after every write to the output file. Since dd is writing to
>> a file on ext4, jbd2 will be active, committing transactions
>> corresponding to those fdatasync requests from dd. (In other words, dd
>> depends on jdb2, in order to make forward progress). But jdb2 being a
>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
>> runs under user.slice.
>>
>> Now, if the I/O scheduler in use for the underlying block device is
>> CFQ, then its inter-queue/inter-group idling takes effect (via the
>> slice_idle and group_idle parameters, both of which default to 8ms).
>> Therefore, everytime CFQ switches between processing requests from dd
>> vs jbd2, this 8ms idle time is injected, which slows down the overall
>> throughput tremendously!
>>
>> To verify this theory, I tried various experiments, and in all cases,
>> the 4 pre-conditions mentioned above were necessary to reproduce this
>> performance drop. For example, if I used an XFS filesystem (which
>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
>> directly to a block device, I couldn't reproduce the performance
>> issue. Similarly, running dd in the root blkio cgroup (where jbd2
>> runs) also gets full performance; as does using the noop or deadline
>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
>> to zero.
>>
>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
>> both with virtualized storage as well as with disk pass-through,
>> backed by a rotational hard disk in both cases. The same problem was
>> also seen with the BFQ I/O scheduler in kernel v5.1.
>>
>> Searching for any earlier discussions of this problem, I found an old
>> thread on LKML that encountered this behavior [1], as well as a docker
>> github issue [2] with similar symptoms (mentioned later in the
>> thread).
>>
>> So, I'm curious to know if this is a well-understood problem and if
>> anybody has any thoughts on how to fix it.
>>
>> Thank you very much!
>>
>>
>> [1]. https://lkml.org/lkml/2015/11/19/359
>>
>> [2]. https://github.com/moby/moby/issues/21485
>> https://github.com/moby/moby/issues/21485#issuecomment-222941103
>>
>> Regards,
>> Srivatsa
>

2019-05-20 10:31:23

by Jan Kara

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On Sat 18-05-19 15:28:47, Theodore Ts'o wrote:
> On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote:
> > I've addressed these issues in my last batch of improvements for
> > BFQ, which landed in the upcoming 5.2. If you give it a try, and
> > still see the problem, then I'll be glad to reproduce it, and
> > hopefully fix it for you.
>
> Hi Paolo, I'm curious if you could give a quick summary about what you
> changed in BFQ?
>
> I was considering adding support so that if userspace calls fsync(2)
> or fdatasync(2), to attach the process's CSS to the transaction, and
> then charge all of the journal metadata writes the process's CSS. If
> there are multiple fsync's batched into the transaction, the first
> process which forced the early transaction commit would get charged
> the entire journal write. OTOH, journal writes are sequential I/O, so
> the amount of disk time for writing the journal is going to be
> relatively small, and especially, the fact that work from other
> cgroups is going to be minimal, especially if hadn't issued an
> fsync().

But this makes priority-inversion problems with ext4 journal worse, doesn't
it? If we submit journal commit in blkio cgroup of some random process, it
may get throttled which then effectively blocks the whole filesystem. Or do
you want to implement a more complex back-pressure mechanism where you'd
just account to different blkio cgroup during journal commit and then
throttle as different point where you are not blocking other tasks from
progress?

> In the case where you have three cgroups all issuing fsync(2) and they
> all landed in the same jbd2 transaction thanks to commit batching, in
> the ideal world we would split up the disk time usage equally across
> those three cgroups. But it's probably not worth doing that...
>
> That being said, we probably do need some BFQ support, since in the
> case where we have multiple processes doing buffered writes w/o fsync,
> we do charnge the data=ordered writeback to each block cgroup. Worse,
> the commit can't complete until the all of the data integrity
> writebacks have completed. And if there are N cgroups with dirty
> inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
> of idle time tacked onto the commit time.

Yeah. At least in some cases, we know there won't be any more IO from a
particular cgroup in the near future (e.g. transaction commit completing,
or when the layers above IO scheduler already know which IO they are going
to submit next) and in that case idling is just a waste of time. But so far
I haven't decided how should look a reasonably clean interface for this
that isn't specific to a particular IO scheduler implementation.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-05-20 11:39:13

by Paolo Valente

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller



> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <[email protected]> ha scritto:
>
> On 5/18/19 11:39 AM, Paolo Valente wrote:
>> I've addressed these issues in my last batch of improvements for BFQ,
>> which landed in the upcoming 5.2. If you give it a try, and still see
>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>> for you.
>>
>
> Hi Paolo,
>
> Thank you for looking into this!
>
> I just tried current mainline at commit 72cf0b07, but unfortunately
> didn't see any improvement:
>
> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>
> With mq-deadline, I get:
>
> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>
> With bfq, I get:
> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>

Hi Srivatsa,
thanks for reproducing this on mainline. I seem to have reproduced a
bonsai-tree version of this issue. Before digging into the block
trace, I'd like to ask you for some feedback.

First, in my test, the total throughput of the disk happens to be
about 20 times as high as that enjoyed by dd, regardless of the I/O
scheduler. I guess this massive overhead is normal with dsync, but
I'd like know whether it is about the same on your side. This will
help me understand whether I'll actually be analyzing about the same
problem as yours.

Second, the commands I used follow. Do they implement your test case
correctly?

[root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
[root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
[root@localhost tmp]# cat /sys/block/sda/queue/scheduler
[mq-deadline] bfq none
[root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
10000+0 record dentro
10000+0 record fuori
5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
[root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
[root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
10000+0 record dentro
10000+0 record fuori
5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s

Thanks,
Paolo

> Please let me know if any more info about my setup might be helpful.
>
> Thank you!
>
> Regards,
> Srivatsa
> VMware Photon OS
>
>>
>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>
>>>
>>> Hi,
>>>
>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>>> running the following command, with the CFQ I/O scheduler:
>>>
>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>>>
>>> Throughput with CFQ: 60 KB/s
>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>>
>>> I spent some time looking into it and found that this is caused by the
>>> undesirable interaction between 4 different components:
>>>
>>> - blkio cgroup controller enabled
>>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>>
>>>
>>> When docker is enabled, systemd creates a blkio cgroup called
>>> system.slice to run system services (and docker) under it, and a
>>> separate blkio cgroup called user.slice for user processes. So, when
>>> dd is invoked, it runs under user.slice.
>>>
>>> The dd command above includes the dsync flag, which performs an
>>> fdatasync after every write to the output file. Since dd is writing to
>>> a file on ext4, jbd2 will be active, committing transactions
>>> corresponding to those fdatasync requests from dd. (In other words, dd
>>> depends on jdb2, in order to make forward progress). But jdb2 being a
>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
>>> runs under user.slice.
>>>
>>> Now, if the I/O scheduler in use for the underlying block device is
>>> CFQ, then its inter-queue/inter-group idling takes effect (via the
>>> slice_idle and group_idle parameters, both of which default to 8ms).
>>> Therefore, everytime CFQ switches between processing requests from dd
>>> vs jbd2, this 8ms idle time is injected, which slows down the overall
>>> throughput tremendously!
>>>
>>> To verify this theory, I tried various experiments, and in all cases,
>>> the 4 pre-conditions mentioned above were necessary to reproduce this
>>> performance drop. For example, if I used an XFS filesystem (which
>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
>>> directly to a block device, I couldn't reproduce the performance
>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2
>>> runs) also gets full performance; as does using the noop or deadline
>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
>>> to zero.
>>>
>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
>>> both with virtualized storage as well as with disk pass-through,
>>> backed by a rotational hard disk in both cases. The same problem was
>>> also seen with the BFQ I/O scheduler in kernel v5.1.
>>>
>>> Searching for any earlier discussions of this problem, I found an old
>>> thread on LKML that encountered this behavior [1], as well as a docker
>>> github issue [2] with similar symptoms (mentioned later in the
>>> thread).
>>>
>>> So, I'm curious to know if this is a well-understood problem and if
>>> anybody has any thoughts on how to fix it.
>>>
>>> Thank you very much!
>>>
>>>
>>> [1]. https://lkml.org/lkml/2015/11/19/359
>>>
>>> [2]. https://github.com/moby/moby/issues/21485
>>> https://github.com/moby/moby/issues/21485#issuecomment-222941103
>>>
>>> Regards,
>>> Srivatsa
>>
>


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2019-05-20 11:40:15

by Paolo Valente

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller



> Il giorno 18 mag 2019, alle ore 21:28, Theodore Ts'o <[email protected]> ha scritto:
>
> On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote:
>> I've addressed these issues in my last batch of improvements for
>> BFQ, which landed in the upcoming 5.2. If you give it a try, and
>> still see the problem, then I'll be glad to reproduce it, and
>> hopefully fix it for you.
>
> Hi Paolo, I'm curious if you could give a quick summary about what you
> changed in BFQ?
>

Here is the idea: while idling for a process, inject I/O from other
processes, at such an extent that no harm is caused to the process for
which we are idling. Details in this LWN article:
https://lwn.net/Articles/784267/
in section "Improving extra-service injection".

> I was considering adding support so that if userspace calls fsync(2)
> or fdatasync(2), to attach the process's CSS to the transaction, and
> then charge all of the journal metadata writes the process's CSS. If
> there are multiple fsync's batched into the transaction, the first
> process which forced the early transaction commit would get charged
> the entire journal write. OTOH, journal writes are sequential I/O, so
> the amount of disk time for writing the journal is going to be
> relatively small, and especially, the fact that work from other
> cgroups is going to be minimal, especially if hadn't issued an
> fsync().
>

Yeah, that's a longstanding and difficult instance of the general
too-short-blanket problem. Jan has already highlighted one of the
main issues in his reply. I'll add a design issue (from my point of
view): I'd find a little odd that explicit sync transactions have an
owner to charge, while generic buffered writes have not.

I think Andrea Righi addressed related issues in his recent patch
proposal [1], so I've CCed him too.

[1] https://lkml.org/lkml/2019/3/9/220

> In the case where you have three cgroups all issuing fsync(2) and they
> all landed in the same jbd2 transaction thanks to commit batching, in
> the ideal world we would split up the disk time usage equally across
> those three cgroups. But it's probably not worth doing that...
>
> That being said, we probably do need some BFQ support, since in the
> case where we have multiple processes doing buffered writes w/o fsync,
> we do charnge the data=ordered writeback to each block cgroup. Worse,
> the commit can't complete until the all of the data integrity
> writebacks have completed. And if there are N cgroups with dirty
> inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
> of idle time tacked onto the commit time.
>

Jan already wrote part of what I wanted to reply here, so I'll
continue from his reply.

Thanks,
Paolo

> If we charge the journal I/O to the cgroup, and there's only one
> process doing the
>
> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>
> then we don't need to worry about this failure mode, since both the
> journal I/O and the data writeback will be hitting the same cgroup.
> But that's arguably an artificial use case, and much more commonly
> there will be multiple cgroups all trying to at least some file system
> I/O.
>
> - Ted


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2019-05-20 11:41:34

by Paolo Valente

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller



> Il giorno 20 mag 2019, alle ore 11:15, Jan Kara <[email protected]> ha scritto:
>
> On Sat 18-05-19 15:28:47, Theodore Ts'o wrote:
>> On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote:
>>> I've addressed these issues in my last batch of improvements for
>>> BFQ, which landed in the upcoming 5.2. If you give it a try, and
>>> still see the problem, then I'll be glad to reproduce it, and
>>> hopefully fix it for you.
>>
>> Hi Paolo, I'm curious if you could give a quick summary about what you
>> changed in BFQ?
>>
>> I was considering adding support so that if userspace calls fsync(2)
>> or fdatasync(2), to attach the process's CSS to the transaction, and
>> then charge all of the journal metadata writes the process's CSS. If
>> there are multiple fsync's batched into the transaction, the first
>> process which forced the early transaction commit would get charged
>> the entire journal write. OTOH, journal writes are sequential I/O, so
>> the amount of disk time for writing the journal is going to be
>> relatively small, and especially, the fact that work from other
>> cgroups is going to be minimal, especially if hadn't issued an
>> fsync().
>
> But this makes priority-inversion problems with ext4 journal worse, doesn't
> it? If we submit journal commit in blkio cgroup of some random process, it
> may get throttled which then effectively blocks the whole filesystem. Or do
> you want to implement a more complex back-pressure mechanism where you'd
> just account to different blkio cgroup during journal commit and then
> throttle as different point where you are not blocking other tasks from
> progress?
>
>> In the case where you have three cgroups all issuing fsync(2) and they
>> all landed in the same jbd2 transaction thanks to commit batching, in
>> the ideal world we would split up the disk time usage equally across
>> those three cgroups. But it's probably not worth doing that...
>>
>> That being said, we probably do need some BFQ support, since in the
>> case where we have multiple processes doing buffered writes w/o fsync,
>> we do charnge the data=ordered writeback to each block cgroup. Worse,
>> the commit can't complete until the all of the data integrity
>> writebacks have completed. And if there are N cgroups with dirty
>> inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
>> of idle time tacked onto the commit time.
>
> Yeah. At least in some cases, we know there won't be any more IO from a
> particular cgroup in the near future (e.g. transaction commit completing,
> or when the layers above IO scheduler already know which IO they are going
> to submit next) and in that case idling is just a waste of time.

Yep. Issues like this are targeted exactly by the improvement I
mentioned in my previous reply.

> But so far
> I haven't decided how should look a reasonably clean interface for this
> that isn't specific to a particular IO scheduler implementation.
>

That's an interesting point. So far, I've assumed that nobody would
have told anything to BFQ. But if you guys think that such a
communication may be acceptable at some degree, then I'd be glad to
try to come up with some solution. For instance: some hook that any
I/O scheduler may export if meaningful.

Thanks,
Paolo

> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2019-05-20 22:46:22

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 5/20/19 3:19 AM, Paolo Valente wrote:
>
>
>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <[email protected]> ha scritto:
>>
>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>> I've addressed these issues in my last batch of improvements for BFQ,
>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>> for you.
>>>
>>
>> Hi Paolo,
>>
>> Thank you for looking into this!
>>
>> I just tried current mainline at commit 72cf0b07, but unfortunately
>> didn't see any improvement:
>>
>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>
>> With mq-deadline, I get:
>>
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>>
>> With bfq, I get:
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>>
>
> Hi Srivatsa,
> thanks for reproducing this on mainline. I seem to have reproduced a
> bonsai-tree version of this issue. Before digging into the block
> trace, I'd like to ask you for some feedback.
>
> First, in my test, the total throughput of the disk happens to be
> about 20 times as high as that enjoyed by dd, regardless of the I/O
> scheduler. I guess this massive overhead is normal with dsync, but
> I'd like know whether it is about the same on your side. This will
> help me understand whether I'll actually be analyzing about the same
> problem as yours.
>

Do you mean to say the throughput obtained by dd'ing directly to the
block device (bypassing the filesystem)? That does give me a 20x
speedup with bs=512, but much more with a bigger block size (achieving
a max throughput of about 110 MB/s).

dd if=/dev/zero of=/dev/sdc bs=512 count=10000 conv=fsync
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB, 4.9 MiB) copied, 0.15257 s, 33.6 MB/s

dd if=/dev/zero of=/dev/sdc bs=4k count=10000 conv=fsync
10000+0 records in
10000+0 records out
40960000 bytes (41 MB, 39 MiB) copied, 0.395081 s, 104 MB/s

I'm testing this on a Toshiba MG03ACA1 (1TB) hard disk.

> Second, the commands I used follow. Do they implement your test case
> correctly?
>
> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
> [mq-deadline] bfq none
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 record dentro
> 10000+0 record fuori
> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 record dentro
> 10000+0 record fuori
> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
>

Yes, this is indeed the testcase, although I see a much bigger
drop in performance with bfq, compared to the results from
your setup.

Regards,
Srivatsa

2019-05-21 06:25:53

by Paolo Valente

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller



> Il giorno 21 mag 2019, alle ore 00:45, Srivatsa S. Bhat <[email protected]> ha scritto:
>
> On 5/20/19 3:19 AM, Paolo Valente wrote:
>>
>>
>>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>
>>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>>> I've addressed these issues in my last batch of improvements for BFQ,
>>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>>> for you.
>>>>
>>>
>>> Hi Paolo,
>>>
>>> Thank you for looking into this!
>>>
>>> I just tried current mainline at commit 72cf0b07, but unfortunately
>>> didn't see any improvement:
>>>
>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>>
>>> With mq-deadline, I get:
>>>
>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>>>
>>> With bfq, I get:
>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>>>
>>
>> Hi Srivatsa,
>> thanks for reproducing this on mainline. I seem to have reproduced a
>> bonsai-tree version of this issue. Before digging into the block
>> trace, I'd like to ask you for some feedback.
>>
>> First, in my test, the total throughput of the disk happens to be
>> about 20 times as high as that enjoyed by dd, regardless of the I/O
>> scheduler. I guess this massive overhead is normal with dsync, but
>> I'd like know whether it is about the same on your side. This will
>> help me understand whether I'll actually be analyzing about the same
>> problem as yours.
>>
>
> Do you mean to say the throughput obtained by dd'ing directly to the
> block device (bypassing the filesystem)?

No no, I mean simply what follows.

1) in one terminal:
[root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
10000+0 record dentro
10000+0 record fuori
5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s

2) In a second terminal, while the dd is in progress in the first
terminal:
$ iostat -tmd /dev/sda 3
Linux 5.1.0+ (localhost.localdomain) 20/05/2019 _x86_64_ (2 CPU)

...
20/05/2019 11:40:17
Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sda 2288,00 0,00 9,77 0 29

20/05/2019 11:40:20
Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sda 2325,33 0,00 9,93 0 29

20/05/2019 11:40:23
Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sda 2351,33 0,00 10,05 0 30
...

As you can see, the overall throughput (~10 MB/s) is more than 20
times as high as the dd throughput (~350 KB/s). But the dd is the
only source of I/O.

Do you also see such a huge difference?

Thanks,
Paolo

> That does give me a 20x
> speedup with bs=512, but much more with a bigger block size (achieving
> a max throughput of about 110 MB/s).
>
> dd if=/dev/zero of=/dev/sdc bs=512 count=10000 conv=fsync
> 10000+0 records in
> 10000+0 records out
> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 0.15257 s, 33.6 MB/s
>
> dd if=/dev/zero of=/dev/sdc bs=4k count=10000 conv=fsync
> 10000+0 records in
> 10000+0 records out
> 40960000 bytes (41 MB, 39 MiB) copied, 0.395081 s, 104 MB/s
>
> I'm testing this on a Toshiba MG03ACA1 (1TB) hard disk.
>
>> Second, the commands I used follow. Do they implement your test case
>> correctly?
>>
>> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
>> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
>> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
>> [mq-deadline] bfq none
>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 10000+0 record dentro
>> 10000+0 record fuori
>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
>> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 10000+0 record dentro
>> 10000+0 record fuori
>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
>>
>
> Yes, this is indeed the testcase, although I see a much bigger
> drop in performance with bfq, compared to the results from
> your setup.
>
> Regards,
> Srivatsa


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2019-05-21 07:40:06

by Andrea Righi

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On Mon, May 20, 2019 at 12:38:32PM +0200, Paolo Valente wrote:
...
> > I was considering adding support so that if userspace calls fsync(2)
> > or fdatasync(2), to attach the process's CSS to the transaction, and
> > then charge all of the journal metadata writes the process's CSS. If
> > there are multiple fsync's batched into the transaction, the first
> > process which forced the early transaction commit would get charged
> > the entire journal write. OTOH, journal writes are sequential I/O, so
> > the amount of disk time for writing the journal is going to be
> > relatively small, and especially, the fact that work from other
> > cgroups is going to be minimal, especially if hadn't issued an
> > fsync().
> >
>
> Yeah, that's a longstanding and difficult instance of the general
> too-short-blanket problem. Jan has already highlighted one of the
> main issues in his reply. I'll add a design issue (from my point of
> view): I'd find a little odd that explicit sync transactions have an
> owner to charge, while generic buffered writes have not.
>
> I think Andrea Righi addressed related issues in his recent patch
> proposal [1], so I've CCed him too.
>
> [1] https://lkml.org/lkml/2019/3/9/220

If journal metadata writes are submitted using a process's CSS, the
commit may be throttled and that can also throttle indirectly other
"high-priority" blkio cgroups, so I think that logic alone isn't enough.

We have discussed this priorty-inversion problem with Josef and Tejun
(adding both of them in cc), the idea that seemed most reasonable was to
temporarily boost the priority of blkio cgroups when there are multiple
sync(2) waiters in the system.

More exactly, when I/O is going to be throttled for a specific blkio
cgroup, if there's any other blkio cgroup waiting for writeback I/O,
no throttling is applied (this logic can be refined by saving a list of
blkio sync(2) waiters and taking the highest I/O rate among them).

In addition to that Tejun mentioned that he would like to see a better
sync(2) isolation done at the fs namespace level. This last part still
needs to be defined and addressed.

However, even the simple logic above "no throttling if there's any other
sync(2) waiter" can already prevent big system lockups (see for example
the simple test case that I suggested here https://lkml.org/lkml/2019/),
so I think having this change alone would be a nice improvement already:

https://lkml.org/lkml/2019/3/9/220

Thanks,
-Andrea

>
> > In the case where you have three cgroups all issuing fsync(2) and they
> > all landed in the same jbd2 transaction thanks to commit batching, in
> > the ideal world we would split up the disk time usage equally across
> > those three cgroups. But it's probably not worth doing that...
> >
> > That being said, we probably do need some BFQ support, since in the
> > case where we have multiple processes doing buffered writes w/o fsync,
> > we do charnge the data=ordered writeback to each block cgroup. Worse,
> > the commit can't complete until the all of the data integrity
> > writebacks have completed. And if there are N cgroups with dirty
> > inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
> > of idle time tacked onto the commit time.
> >
>
> Jan already wrote part of what I wanted to reply here, so I'll
> continue from his reply.
>
> Thanks,
> Paolo
>
> > If we charge the journal I/O to the cgroup, and there's only one
> > process doing the
> >
> > dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
> >
> > then we don't need to worry about this failure mode, since both the
> > journal I/O and the data writeback will be hitting the same cgroup.
> > But that's arguably an artificial use case, and much more commonly
> > there will be multiple cgroups all trying to at least some file system
> > I/O.
> >
> > - Ted
>



2019-05-21 11:26:03

by Paolo Valente

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller


> Before digging into the block
> trace, I'd like to ask you for some feedback.
>
> First, in my test, the total throughput of the disk happens to be
> about 20 times as high as that enjoyed by dd, regardless of the I/O
> scheduler. I guess this massive overhead is normal with dsync, but
> I'd like know whether it is about the same on your side. This will
> help me understand whether I'll actually be analyzing about the same
> problem as yours.
>
> Second, the commands I used follow. Do they implement your test case
> correctly?
>
> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
> [mq-deadline] bfq none
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 record dentro
> 10000+0 record fuori
> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 record dentro
> 10000+0 record fuori
> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
>
> Thanks,
> Paolo
>
>> Please let me know if any more info about my setup might be helpful.
>>
>> Thank you!
>>
>> Regards,
>> Srivatsa
>> VMware Photon OS
>>
>>>
>>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>>>> running the following command, with the CFQ I/O scheduler:
>>>>
>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>>>>
>>>> Throughput with CFQ: 60 KB/s
>>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>>>
>>>> I spent some time looking into it and found that this is caused by the
>>>> undesirable interaction between 4 different components:
>>>>
>>>> - blkio cgroup controller enabled
>>>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>>>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>>>
>>>>
>>>> When docker is enabled, systemd creates a blkio cgroup called
>>>> system.slice to run system services (and docker) under it, and a
>>>> separate blkio cgroup called user.slice for user processes. So, when
>>>> dd is invoked, it runs under user.slice.
>>>>
>>>> The dd command above includes the dsync flag, which performs an
>>>> fdatasync after every write to the output file. Since dd is writing to
>>>> a file on ext4, jbd2 will be active, committing transactions
>>>> corresponding to those fdatasync requests from dd. (In other words, dd
>>>> depends on jdb2, in order to make forward progress). But jdb2 being a
>>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
>>>> runs under user.slice.
>>>>
>>>> Now, if the I/O scheduler in use for the underlying block device is
>>>> CFQ, then its inter-queue/inter-group idling takes effect (via the
>>>> slice_idle and group_idle parameters, both of which default to 8ms).
>>>> Therefore, everytime CFQ switches between processing requests from dd
>>>> vs jbd2, this 8ms idle time is injected, which slows down the overall
>>>> throughput tremendously!
>>>>
>>>> To verify this theory, I tried various experiments, and in all cases,
>>>> the 4 pre-conditions mentioned above were necessary to reproduce this
>>>> performance drop. For example, if I used an XFS filesystem (which
>>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
>>>> directly to a block device, I couldn't reproduce the performance
>>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2
>>>> runs) also gets full performance; as does using the noop or deadline
>>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
>>>> to zero.
>>>>
>>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
>>>> both with virtualized storage as well as with disk pass-through,
>>>> backed by a rotational hard disk in both cases. The same problem was
>>>> also seen with the BFQ I/O scheduler in kernel v5.1.
>>>>
>>>> Searching for any earlier discussions of this problem, I found an old
>>>> thread on LKML that encountered this behavior [1], as well as a docker
>>>> github issue [2] with similar symptoms (mentioned later in the
>>>> thread).
>>>>
>>>> So, I'm curious to know if this is a well-understood problem and if
>>>> anybody has any thoughts on how to fix it.
>>>>
>>>> Thank you very much!
>>>>
>>>>
>>>> [1]. https://lkml.org/lkml/2015/11/19/359
>>>>
>>>> [2]. https://github.com/moby/moby/issues/21485
>>>> https://github.com/moby/moby/issues/21485#issuecomment-222941103
>>>>
>>>> Regards,
>>>> Srivatsa


Attachments:
dsync_test.sh (1.96 kB)
signature.asc (849.00 B)
Message signed with OpenPGP
Download all attachments

2019-05-21 13:21:37

by Paolo Valente

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller



> Looking forward to your trace,
> Paolo
>
> <dsync_test.sh>
>> Before digging into the block
>> trace, I'd like to ask you for some feedback.
>>
>> First, in my test, the total throughput of the disk happens to be
>> about 20 times as high as that enjoyed by dd, regardless of the I/O
>> scheduler. I guess this massive overhead is normal with dsync, but
>> I'd like know whether it is about the same on your side. This will
>> help me understand whether I'll actually be analyzing about the same
>> problem as yours.
>>
>> Second, the commands I used follow. Do they implement your test case
>> correctly?
>>
>> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
>> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
>> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
>> [mq-deadline] bfq none
>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 10000+0 record dentro
>> 10000+0 record fuori
>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
>> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 10000+0 record dentro
>> 10000+0 record fuori
>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
>>
>> Thanks,
>> Paolo
>>
>>> Please let me know if any more info about my setup might be helpful.
>>>
>>> Thank you!
>>>
>>> Regards,
>>> Srivatsa
>>> VMware Photon OS
>>>
>>>>
>>>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>>>>> running the following command, with the CFQ I/O scheduler:
>>>>>
>>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>>>>>
>>>>> Throughput with CFQ: 60 KB/s
>>>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>>>>
>>>>> I spent some time looking into it and found that this is caused by the
>>>>> undesirable interaction between 4 different components:
>>>>>
>>>>> - blkio cgroup controller enabled
>>>>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>>>>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>>>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>>>>
>>>>>
>>>>> When docker is enabled, systemd creates a blkio cgroup called
>>>>> system.slice to run system services (and docker) under it, and a
>>>>> separate blkio cgroup called user.slice for user processes. So, when
>>>>> dd is invoked, it runs under user.slice.
>>>>>
>>>>> The dd command above includes the dsync flag, which performs an
>>>>> fdatasync after every write to the output file. Since dd is writing to
>>>>> a file on ext4, jbd2 will be active, committing transactions
>>>>> corresponding to those fdatasync requests from dd. (In other words, dd
>>>>> depends on jdb2, in order to make forward progress). But jdb2 being a
>>>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
>>>>> runs under user.slice.
>>>>>
>>>>> Now, if the I/O scheduler in use for the underlying block device is
>>>>> CFQ, then its inter-queue/inter-group idling takes effect (via the
>>>>> slice_idle and group_idle parameters, both of which default to 8ms).
>>>>> Therefore, everytime CFQ switches between processing requests from dd
>>>>> vs jbd2, this 8ms idle time is injected, which slows down the overall
>>>>> throughput tremendously!
>>>>>
>>>>> To verify this theory, I tried various experiments, and in all cases,
>>>>> the 4 pre-conditions mentioned above were necessary to reproduce this
>>>>> performance drop. For example, if I used an XFS filesystem (which
>>>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
>>>>> directly to a block device, I couldn't reproduce the performance
>>>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2
>>>>> runs) also gets full performance; as does using the noop or deadline
>>>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
>>>>> to zero.
>>>>>
>>>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
>>>>> both with virtualized storage as well as with disk pass-through,
>>>>> backed by a rotational hard disk in both cases. The same problem was
>>>>> also seen with the BFQ I/O scheduler in kernel v5.1.
>>>>>
>>>>> Searching for any earlier discussions of this problem, I found an old
>>>>> thread on LKML that encountered this behavior [1], as well as a docker
>>>>> github issue [2] with similar symptoms (mentioned later in the
>>>>> thread).
>>>>>
>>>>> So, I'm curious to know if this is a well-understood problem and if
>>>>> anybody has any thoughts on how to fix it.
>>>>>
>>>>> Thank you very much!
>>>>>
>>>>>
>>>>> [1]. https://lkml.org/lkml/2015/11/19/359
>>>>>
>>>>> [2]. https://github.com/moby/moby/issues/21485
>>>>> https://github.com/moby/moby/issues/21485#issuecomment-222941103
>>>>>
>>>>> Regards,
>>>>> Srivatsa


Attachments:
0001-block-bfq-add-logs-and-BUG_ONs.patch.gz (26.65 kB)
signature.asc (849.00 B)
Message signed with OpenPGP
Download all attachments

2019-05-21 16:22:27

by Paolo Valente

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller



> Thanks,
> Paolo
>
> <0001-block-bfq-add-logs-and-BUG_ONs.patch.gz>
>
>> Looking forward to your trace,
>> Paolo
>>
>> <dsync_test.sh>
>>> Before digging into the block
>>> trace, I'd like to ask you for some feedback.
>>>
>>> First, in my test, the total throughput of the disk happens to be
>>> about 20 times as high as that enjoyed by dd, regardless of the I/O
>>> scheduler. I guess this massive overhead is normal with dsync, but
>>> I'd like know whether it is about the same on your side. This will
>>> help me understand whether I'll actually be analyzing about the same
>>> problem as yours.
>>>
>>> Second, the commands I used follow. Do they implement your test case
>>> correctly?
>>>
>>> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
>>> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
>>> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
>>> [mq-deadline] bfq none
>>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>> 10000+0 record dentro
>>> 10000+0 record fuori
>>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
>>> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
>>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>> 10000+0 record dentro
>>> 10000+0 record fuori
>>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
>>>
>>> Thanks,
>>> Paolo
>>>
>>>> Please let me know if any more info about my setup might be helpful.
>>>>
>>>> Thank you!
>>>>
>>>> Regards,
>>>> Srivatsa
>>>> VMware Photon OS
>>>>
>>>>>
>>>>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>>>>>> running the following command, with the CFQ I/O scheduler:
>>>>>>
>>>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>>>>>>
>>>>>> Throughput with CFQ: 60 KB/s
>>>>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>>>>>
>>>>>> I spent some time looking into it and found that this is caused by the
>>>>>> undesirable interaction between 4 different components:
>>>>>>
>>>>>> - blkio cgroup controller enabled
>>>>>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>>>>>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>>>>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>>>>>
>>>>>>
>>>>>> When docker is enabled, systemd creates a blkio cgroup called
>>>>>> system.slice to run system services (and docker) under it, and a
>>>>>> separate blkio cgroup called user.slice for user processes. So, when
>>>>>> dd is invoked, it runs under user.slice.
>>>>>>
>>>>>> The dd command above includes the dsync flag, which performs an
>>>>>> fdatasync after every write to the output file. Since dd is writing to
>>>>>> a file on ext4, jbd2 will be active, committing transactions
>>>>>> corresponding to those fdatasync requests from dd. (In other words, dd
>>>>>> depends on jdb2, in order to make forward progress). But jdb2 being a
>>>>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
>>>>>> runs under user.slice.
>>>>>>
>>>>>> Now, if the I/O scheduler in use for the underlying block device is
>>>>>> CFQ, then its inter-queue/inter-group idling takes effect (via the
>>>>>> slice_idle and group_idle parameters, both of which default to 8ms).
>>>>>> Therefore, everytime CFQ switches between processing requests from dd
>>>>>> vs jbd2, this 8ms idle time is injected, which slows down the overall
>>>>>> throughput tremendously!
>>>>>>
>>>>>> To verify this theory, I tried various experiments, and in all cases,
>>>>>> the 4 pre-conditions mentioned above were necessary to reproduce this
>>>>>> performance drop. For example, if I used an XFS filesystem (which
>>>>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
>>>>>> directly to a block device, I couldn't reproduce the performance
>>>>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2
>>>>>> runs) also gets full performance; as does using the noop or deadline
>>>>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
>>>>>> to zero.
>>>>>>
>>>>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
>>>>>> both with virtualized storage as well as with disk pass-through,
>>>>>> backed by a rotational hard disk in both cases. The same problem was
>>>>>> also seen with the BFQ I/O scheduler in kernel v5.1.
>>>>>>
>>>>>> Searching for any earlier discussions of this problem, I found an old
>>>>>> thread on LKML that encountered this behavior [1], as well as a docker
>>>>>> github issue [2] with similar symptoms (mentioned later in the
>>>>>> thread).
>>>>>>
>>>>>> So, I'm curious to know if this is a well-understood problem and if
>>>>>> anybody has any thoughts on how to fix it.
>>>>>>
>>>>>> Thank you very much!
>>>>>>
>>>>>>
>>>>>> [1]. https://lkml.org/lkml/2015/11/19/359
>>>>>>
>>>>>> [2]. https://github.com/moby/moby/issues/21485
>>>>>> https://github.com/moby/moby/issues/21485#issuecomment-222941103
>>>>>>
>>>>>> Regards,
>>>>>> Srivatsa


Attachments:
dsync_test.sh (1.86 kB)
0001-block-bfq-boost-injection.patch.gz (2.40 kB)
signature.asc (849.00 B)
Message signed with OpenPGP
Download all attachments

2019-05-21 16:49:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On Mon, May 20, 2019 at 11:15:58AM +0200, Jan Kara wrote:
> But this makes priority-inversion problems with ext4 journal worse, doesn't
> it? If we submit journal commit in blkio cgroup of some random process, it
> may get throttled which then effectively blocks the whole filesystem. Or do
> you want to implement a more complex back-pressure mechanism where you'd
> just account to different blkio cgroup during journal commit and then
> throttle as different point where you are not blocking other tasks from
> progress?

Good point, yes, it can. It depends in what cgroup the file system is
mounted (and hence what cgroup the jbd2 kernel thread is on). If it
was mounted in the root cgroup, then jbd2 thread is going to be
completely unthrottled (except for the data=ordered writebacks, which
will be charged to the cgroup which write those pages) so the only
thing which is nuking us will be the slice_idle timeout --- both for
the writebacks (which could get charged to N different cgroups, with
disastrous effects --- and this is going to be true for any file
system on a syncfs(2) call as well) and switching between the jbd2
thread's cgroup and the writeback cgroup.

One thing the I/O scheduler could do is use the synchronous flag as a
hint that it should ix-nay on the idle-way. Or maybe we need to have
a different way to signal this to the jbd2 thread, since I do
recognize that this issue is ext4-specific, *because* we do the
transaction handling in a separate thread, and because of the
data=ordered scheme, both of which are unique to ext4. So exempting
synchronous writes from cgroup control doesn't make sense for other
file systems.

So maybe a special flag meaning "entangled writes", where the
sched_idle hacks should get suppressed for the data=ordered
writebacks, but we still charge the block I/O to the relevant CSS's?

I could also imagine if there was some way that file system could
track whether all of the file system modifications were charged to a
single cgroup, we could in that case charge it to that cgroup?

> Yeah. At least in some cases, we know there won't be any more IO from a
> particular cgroup in the near future (e.g. transaction commit completing,
> or when the layers above IO scheduler already know which IO they are going
> to submit next) and in that case idling is just a waste of time. But so far
> I haven't decided how should look a reasonably clean interface for this
> that isn't specific to a particular IO scheduler implementation.

The best I've come up with is some way of signalling that all of the
writes coming from the jbd2 commit are entangled, probably via a bio
flag.

If we don't have cgroup support, the other thing we could do is assume
that the jbd2 thread should always be in the root (unconstrained)
cgroup, and then force all writes, include data=ordered writebacks, to
be in the jbd2's cgroup. But that would make the block cgroup
controls trivially bypassable by an application, which could just be
fsync-happy and exempt all of its buffered I/O writes from cgroup
control. So that's probably not a great way to go --- but it would at
least fix this particular performance issue. :-/

- Ted

2019-05-21 17:38:36

by Paolo Valente

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller



> Il giorno 21 mag 2019, alle ore 18:21, Paolo Valente <[email protected]> ha scritto:
>
>
>
>> Il giorno 21 mag 2019, alle ore 15:20, Paolo Valente <[email protected]> ha scritto:
>>
>>
>>
>>> Il giorno 21 mag 2019, alle ore 13:25, Paolo Valente <[email protected]> ha scritto:
>>>
>>>
>>>
>>>> Il giorno 20 mag 2019, alle ore 12:19, Paolo Valente <[email protected]> ha scritto:
>>>>
>>>>
>>>>
>>>>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>>>
>>>>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>>>>> I've addressed these issues in my last batch of improvements for BFQ,
>>>>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>>>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>>>>> for you.
>>>>>>
>>>>>
>>>>> Hi Paolo,
>>>>>
>>>>> Thank you for looking into this!
>>>>>
>>>>> I just tried current mainline at commit 72cf0b07, but unfortunately
>>>>> didn't see any improvement:
>>>>>
>>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>>>>
>>>>> With mq-deadline, I get:
>>>>>
>>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>>>>>
>>>>> With bfq, I get:
>>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>>>>>
>>>>
>>>> Hi Srivatsa,
>>>> thanks for reproducing this on mainline. I seem to have reproduced a
>>>> bonsai-tree version of this issue.
>>>
>>> Hi again Srivatsa,
>>> I've analyzed the trace, and I've found the cause of the loss of
>>> throughput in on my side. To find out whether it is the same cause as
>>> on your side, I've prepared a script that executes your test and takes
>>> a trace during the test. If ok for you, could you please
>>> - change the value for the DEVS parameter in the attached script, if
>>> needed
>>> - execute the script
>>> - send me the trace file that the script will leave in your working
>>> dir
>>>
>>
>> Sorry, I forgot to add that I also need you to, first, apply the
>> attached patch (it will make BFQ generate the log I need).
>>
>
> Sorry again :) This time for attaching one more patch. This is
> basically a blind fix attempt, based on what I see in my VM.
>
> So, instead of only sending me a trace, could you please:
> 1) apply this new patch on top of the one I attached in my previous email
> 2) repeat your test and report results

One last thing (I swear!): as you can see from my script, I tested the
case low_latency=0 so far. So please, for the moment, do your test
with low_latency=0. You find the whole path to this parameter in,
e.g., my script.

Thanks,
Paolo

> 3) regardless of whether bfq performance improves, take a trace with
> my script (I've attached a new version that doesn't risk to output an
> annoying error message as the previous one)
>
> Thanks,
> Paolo
>
> <dsync_test.sh><0001-block-bfq-boost-injection.patch.gz>
>
>> Thanks,
>> Paolo
>>
>> <0001-block-bfq-add-logs-and-BUG_ONs.patch.gz>
>>
>>> Looking forward to your trace,
>>> Paolo
>>>
>>> <dsync_test.sh>
>>>> Before digging into the block
>>>> trace, I'd like to ask you for some feedback.
>>>>
>>>> First, in my test, the total throughput of the disk happens to be
>>>> about 20 times as high as that enjoyed by dd, regardless of the I/O
>>>> scheduler. I guess this massive overhead is normal with dsync, but
>>>> I'd like know whether it is about the same on your side. This will
>>>> help me understand whether I'll actually be analyzing about the same
>>>> problem as yours.
>>>>
>>>> Second, the commands I used follow. Do they implement your test case
>>>> correctly?
>>>>
>>>> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
>>>> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
>>>> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
>>>> [mq-deadline] bfq none
>>>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>>> 10000+0 record dentro
>>>> 10000+0 record fuori
>>>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
>>>> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
>>>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>>> 10000+0 record dentro
>>>> 10000+0 record fuori
>>>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
>>>>
>>>> Thanks,
>>>> Paolo
>>>>
>>>>> Please let me know if any more info about my setup might be helpful.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Regards,
>>>>> Srivatsa
>>>>> VMware Photon OS
>>>>>
>>>>>>
>>>>>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>>>>>>> running the following command, with the CFQ I/O scheduler:
>>>>>>>
>>>>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>>>>>>>
>>>>>>> Throughput with CFQ: 60 KB/s
>>>>>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>>>>>>
>>>>>>> I spent some time looking into it and found that this is caused by the
>>>>>>> undesirable interaction between 4 different components:
>>>>>>>
>>>>>>> - blkio cgroup controller enabled
>>>>>>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>>>>>>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>>>>>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>>>>>>
>>>>>>>
>>>>>>> When docker is enabled, systemd creates a blkio cgroup called
>>>>>>> system.slice to run system services (and docker) under it, and a
>>>>>>> separate blkio cgroup called user.slice for user processes. So, when
>>>>>>> dd is invoked, it runs under user.slice.
>>>>>>>
>>>>>>> The dd command above includes the dsync flag, which performs an
>>>>>>> fdatasync after every write to the output file. Since dd is writing to
>>>>>>> a file on ext4, jbd2 will be active, committing transactions
>>>>>>> corresponding to those fdatasync requests from dd. (In other words, dd
>>>>>>> depends on jdb2, in order to make forward progress). But jdb2 being a
>>>>>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
>>>>>>> runs under user.slice.
>>>>>>>
>>>>>>> Now, if the I/O scheduler in use for the underlying block device is
>>>>>>> CFQ, then its inter-queue/inter-group idling takes effect (via the
>>>>>>> slice_idle and group_idle parameters, both of which default to 8ms).
>>>>>>> Therefore, everytime CFQ switches between processing requests from dd
>>>>>>> vs jbd2, this 8ms idle time is injected, which slows down the overall
>>>>>>> throughput tremendously!
>>>>>>>
>>>>>>> To verify this theory, I tried various experiments, and in all cases,
>>>>>>> the 4 pre-conditions mentioned above were necessary to reproduce this
>>>>>>> performance drop. For example, if I used an XFS filesystem (which
>>>>>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
>>>>>>> directly to a block device, I couldn't reproduce the performance
>>>>>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2
>>>>>>> runs) also gets full performance; as does using the noop or deadline
>>>>>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
>>>>>>> to zero.
>>>>>>>
>>>>>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
>>>>>>> both with virtualized storage as well as with disk pass-through,
>>>>>>> backed by a rotational hard disk in both cases. The same problem was
>>>>>>> also seen with the BFQ I/O scheduler in kernel v5.1.
>>>>>>>
>>>>>>> Searching for any earlier discussions of this problem, I found an old
>>>>>>> thread on LKML that encountered this behavior [1], as well as a docker
>>>>>>> github issue [2] with similar symptoms (mentioned later in the
>>>>>>> thread).
>>>>>>>
>>>>>>> So, I'm curious to know if this is a well-understood problem and if
>>>>>>> anybody has any thoughts on how to fix it.
>>>>>>>
>>>>>>> Thank you very much!
>>>>>>>
>>>>>>>
>>>>>>> [1]. https://lkml.org/lkml/2015/11/19/359
>>>>>>>
>>>>>>> [2]. https://github.com/moby/moby/issues/21485
>>>>>>> https://github.com/moby/moby/issues/21485#issuecomment-222941103
>>>>>>>
>>>>>>> Regards,
>>>>>>> Srivatsa


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2019-05-21 22:52:44

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

[ Resending this mail with a dropbox link to the traces (instead
of a file attachment), since it didn't go through the last time. ]

On 5/21/19 10:38 AM, Paolo Valente wrote:
>
>> So, instead of only sending me a trace, could you please:
>> 1) apply this new patch on top of the one I attached in my previous email
>> 2) repeat your test and report results
>
> One last thing (I swear!): as you can see from my script, I tested the
> case low_latency=0 so far. So please, for the moment, do your test
> with low_latency=0. You find the whole path to this parameter in,
> e.g., my script.
>
No problem! :) Thank you for sharing patches for me to test!

I have good news :) Your patch improves the throughput significantly
when low_latency = 0.

Without any patch:

dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB, 4.9 MiB) copied, 58.0915 s, 88.1 kB/s


With both patches applied:

dd if=/dev/zero of=/root/test0.img bs=512 count=10000 oflag=dsync
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.87487 s, 1.3 MB/s

The performance is still not as good as mq-deadline (which achieves
1.6 MB/s), but this is a huge improvement for BFQ nonetheless!

A tarball with the trace output from the 2 scenarios you requested,
one with only the debug patch applied (trace-bfq-add-logs-and-BUG_ONs),
and another with both patches applied (trace-bfq-boost-injection) is
available here:

https://www.dropbox.com/s/pdf07vi7afido7e/bfq-traces.tar.gz?dl=0

Thank you!

Regards,
Srivatsa
VMware Photon OS

2019-05-22 08:06:48

by Paolo Valente

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller



> Il giorno 22 mag 2019, alle ore 00:51, Srivatsa S. Bhat <[email protected]> ha scritto:
>
> [ Resending this mail with a dropbox link to the traces (instead
> of a file attachment), since it didn't go through the last time. ]
>
> On 5/21/19 10:38 AM, Paolo Valente wrote:
>>
>>> So, instead of only sending me a trace, could you please:
>>> 1) apply this new patch on top of the one I attached in my previous email
>>> 2) repeat your test and report results
>>
>> One last thing (I swear!): as you can see from my script, I tested the
>> case low_latency=0 so far. So please, for the moment, do your test
>> with low_latency=0. You find the whole path to this parameter in,
>> e.g., my script.
>>
> No problem! :) Thank you for sharing patches for me to test!
>
> I have good news :) Your patch improves the throughput significantly
> when low_latency = 0.
>
> Without any patch:
>
> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 records in
> 10000+0 records out
> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 58.0915 s, 88.1 kB/s
>
>
> With both patches applied:
>
> dd if=/dev/zero of=/root/test0.img bs=512 count=10000 oflag=dsync
> 10000+0 records in
> 10000+0 records out
> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.87487 s, 1.3 MB/s
>
> The performance is still not as good as mq-deadline (which achieves
> 1.6 MB/s), but this is a huge improvement for BFQ nonetheless!
>
> A tarball with the trace output from the 2 scenarios you requested,
> one with only the debug patch applied (trace-bfq-add-logs-and-BUG_ONs),
> and another with both patches applied (trace-bfq-boost-injection) is
> available here:
>
> https://www.dropbox.com/s/pdf07vi7afido7e/bfq-traces.tar.gz?dl=0
>

Hi Srivatsa,
I've seen the bugzilla you've created. I'm a little confused on how
to better proceed. Shall we move this discussion to the bugzilla, or
should we continue this discussion here, where it has started, and
then update the bugzilla?

Let me know,
Paolo

> Thank you!
>
> Regards,
> Srivatsa
> VMware Photon OS


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2019-05-22 09:03:35

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 5/22/19 1:05 AM, Paolo Valente wrote:
>
>
>> Il giorno 22 mag 2019, alle ore 00:51, Srivatsa S. Bhat <[email protected]> ha scritto:
>>
>> [ Resending this mail with a dropbox link to the traces (instead
>> of a file attachment), since it didn't go through the last time. ]
>>
>> On 5/21/19 10:38 AM, Paolo Valente wrote:
>>>
>>>> So, instead of only sending me a trace, could you please:
>>>> 1) apply this new patch on top of the one I attached in my previous email
>>>> 2) repeat your test and report results
>>>
>>> One last thing (I swear!): as you can see from my script, I tested the
>>> case low_latency=0 so far. So please, for the moment, do your test
>>> with low_latency=0. You find the whole path to this parameter in,
>>> e.g., my script.
>>>
>> No problem! :) Thank you for sharing patches for me to test!
>>
>> I have good news :) Your patch improves the throughput significantly
>> when low_latency = 0.
>>
>> Without any patch:
>>
>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 10000+0 records in
>> 10000+0 records out
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 58.0915 s, 88.1 kB/s
>>
>>
>> With both patches applied:
>>
>> dd if=/dev/zero of=/root/test0.img bs=512 count=10000 oflag=dsync
>> 10000+0 records in
>> 10000+0 records out
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.87487 s, 1.3 MB/s
>>
>> The performance is still not as good as mq-deadline (which achieves
>> 1.6 MB/s), but this is a huge improvement for BFQ nonetheless!
>>
>> A tarball with the trace output from the 2 scenarios you requested,
>> one with only the debug patch applied (trace-bfq-add-logs-and-BUG_ONs),
>> and another with both patches applied (trace-bfq-boost-injection) is
>> available here:
>>
>> https://www.dropbox.com/s/pdf07vi7afido7e/bfq-traces.tar.gz?dl=0
>>
>
> Hi Srivatsa,
> I've seen the bugzilla you've created. I'm a little confused on how
> to better proceed. Shall we move this discussion to the bugzilla, or
> should we continue this discussion here, where it has started, and
> then update the bugzilla?
>

Let's continue here on LKML itself. The only reason I created the
bugzilla entry is to attach the tarball of the traces, assuming
that it would allow me to upload a 20 MB file (since email attachment
didn't work). But bugzilla's file restriction is much smaller than
that, so it didn't work out either, and I resorted to using dropbox.
So we don't need the bugzilla entry anymore; I might as well close it
to avoid confusion.

Regards,
Srivatsa
VMware Photon OS

2019-05-22 09:13:33

by Paolo Valente

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller



> Il giorno 22 mag 2019, alle ore 11:02, Srivatsa S. Bhat <[email protected]> ha scritto:
>
> On 5/22/19 1:05 AM, Paolo Valente wrote:
>>
>>
>>> Il giorno 22 mag 2019, alle ore 00:51, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>
>>> [ Resending this mail with a dropbox link to the traces (instead
>>> of a file attachment), since it didn't go through the last time. ]
>>>
>>> On 5/21/19 10:38 AM, Paolo Valente wrote:
>>>>
>>>>> So, instead of only sending me a trace, could you please:
>>>>> 1) apply this new patch on top of the one I attached in my previous email
>>>>> 2) repeat your test and report results
>>>>
>>>> One last thing (I swear!): as you can see from my script, I tested the
>>>> case low_latency=0 so far. So please, for the moment, do your test
>>>> with low_latency=0. You find the whole path to this parameter in,
>>>> e.g., my script.
>>>>
>>> No problem! :) Thank you for sharing patches for me to test!
>>>
>>> I have good news :) Your patch improves the throughput significantly
>>> when low_latency = 0.
>>>
>>> Without any patch:
>>>
>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>> 10000+0 records in
>>> 10000+0 records out
>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 58.0915 s, 88.1 kB/s
>>>
>>>
>>> With both patches applied:
>>>
>>> dd if=/dev/zero of=/root/test0.img bs=512 count=10000 oflag=dsync
>>> 10000+0 records in
>>> 10000+0 records out
>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.87487 s, 1.3 MB/s
>>>
>>> The performance is still not as good as mq-deadline (which achieves
>>> 1.6 MB/s), but this is a huge improvement for BFQ nonetheless!
>>>
>>> A tarball with the trace output from the 2 scenarios you requested,
>>> one with only the debug patch applied (trace-bfq-add-logs-and-BUG_ONs),
>>> and another with both patches applied (trace-bfq-boost-injection) is
>>> available here:
>>>
>>> https://www.dropbox.com/s/pdf07vi7afido7e/bfq-traces.tar.gz?dl=0
>>>
>>
>> Hi Srivatsa,
>> I've seen the bugzilla you've created. I'm a little confused on how
>> to better proceed. Shall we move this discussion to the bugzilla, or
>> should we continue this discussion here, where it has started, and
>> then update the bugzilla?
>>
>
> Let's continue here on LKML itself.

Just done :)

> The only reason I created the
> bugzilla entry is to attach the tarball of the traces, assuming
> that it would allow me to upload a 20 MB file (since email attachment
> didn't work). But bugzilla's file restriction is much smaller than
> that, so it didn't work out either, and I resorted to using dropbox.
> So we don't need the bugzilla entry anymore; I might as well close it
> to avoid confusion.
>

No no, don't close it: it can reach people that don't use LKML. We
just have to remember to report back at the end of this. BTW, I also
think that the bug is incorrectly filed against 5.1, while all these
tests and results concern 5.2-rcX.

Thanks,
Paolo

> Regards,
> Srivatsa
> VMware Photon OS


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2019-05-22 10:04:02

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 5/22/19 2:12 AM, Paolo Valente wrote:
>
>> Il giorno 22 mag 2019, alle ore 11:02, Srivatsa S. Bhat <[email protected]> ha scritto:
>>
>>
>> Let's continue here on LKML itself.
>
> Just done :)
>
>> The only reason I created the
>> bugzilla entry is to attach the tarball of the traces, assuming
>> that it would allow me to upload a 20 MB file (since email attachment
>> didn't work). But bugzilla's file restriction is much smaller than
>> that, so it didn't work out either, and I resorted to using dropbox.
>> So we don't need the bugzilla entry anymore; I might as well close it
>> to avoid confusion.
>>
>
> No no, don't close it: it can reach people that don't use LKML. We
> just have to remember to report back at the end of this.

Ah, good point!

> BTW, I also
> think that the bug is incorrectly filed against 5.1, while all these
> tests and results concern 5.2-rcX.
>

Fixed now, thank you for pointing out!

Regards,
Srivatsa
VMware Photon OS