LinuxLists.cc - CFQ idling kills I/O performance on ext4 with blkio cgroup controller

2019-05-17 23:16:55

Subject: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

Hi,

One of my colleagues noticed upto 10x - 30x drop in I/O throughput
running the following command, with the CFQ I/O scheduler:

dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync

Throughput with CFQ: 60 KB/s
Throughput with noop or deadline: 1.5 MB/s - 2 MB/s

I spent some time looking into it and found that this is caused by the
undesirable interaction between 4 different components:

- blkio cgroup controller enabled
- ext4 with the jbd2 kthread running in the root blkio cgroup
- dd running on ext4, in any other blkio cgroup than that of jbd2
- CFQ I/O scheduler with defaults for slice_idle and group_idle

When docker is enabled, systemd creates a blkio cgroup called
system.slice to run system services (and docker) under it, and a
separate blkio cgroup called user.slice for user processes. So, when
dd is invoked, it runs under user.slice.

The dd command above includes the dsync flag, which performs an
fdatasync after every write to the output file. Since dd is writing to
a file on ext4, jbd2 will be active, committing transactions
corresponding to those fdatasync requests from dd. (In other words, dd
depends on jdb2, in order to make forward progress). But jdb2 being a
kernel thread, runs in the root blkio cgroup, as opposed to dd, which
runs under user.slice.

Now, if the I/O scheduler in use for the underlying block device is
CFQ, then its inter-queue/inter-group idling takes effect (via the
slice_idle and group_idle parameters, both of which default to 8ms).
Therefore, everytime CFQ switches between processing requests from dd
vs jbd2, this 8ms idle time is injected, which slows down the overall
throughput tremendously!

To verify this theory, I tried various experiments, and in all cases,
the 4 pre-conditions mentioned above were necessary to reproduce this
performance drop. For example, if I used an XFS filesystem (which
doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
directly to a block device, I couldn't reproduce the performance
issue. Similarly, running dd in the root blkio cgroup (where jbd2
runs) also gets full performance; as does using the noop or deadline
I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
to zero.

These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
both with virtualized storage as well as with disk pass-through,
backed by a rotational hard disk in both cases. The same problem was
also seen with the BFQ I/O scheduler in kernel v5.1.

Searching for any earlier discussions of this problem, I found an old
thread on LKML that encountered this behavior [1], as well as a docker
github issue [2] with similar symptoms (mentioned later in the
thread).

So, I'm curious to know if this is a well-understood problem and if
anybody has any thoughts on how to fix it.

Thank you very much!

[1]. https://lkml.org/lkml/2015/11/19/359

[2]. https://github.com/moby/moby/issues/21485
https://github.com/moby/moby/issues/21485#issuecomment-222941103

Regards,
Srivatsa

2019-05-21 07:21:29

by Srivatsa S. Bhat

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 5/20/19 11:23 PM, Paolo Valente wrote:
>
>
>> Il giorno 21 mag 2019, alle ore 00:45, Srivatsa S. Bhat <[email protected]> ha scritto:
>>
>> On 5/20/19 3:19 AM, Paolo Valente wrote:
>>>
>>>
>>>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>>
>>>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>>>> I've addressed these issues in my last batch of improvements for BFQ,
>>>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>>>> for you.
>>>>>
>>>>
>>>> Hi Paolo,
>>>>
>>>> Thank you for looking into this!
>>>>
>>>> I just tried current mainline at commit 72cf0b07, but unfortunately
>>>> didn't see any improvement:
>>>>
>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>>>
>>>> With mq-deadline, I get:
>>>>
>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>>>>
>>>> With bfq, I get:
>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>>>>
>>>
>>> Hi Srivatsa,
>>> thanks for reproducing this on mainline. I seem to have reproduced a
>>> bonsai-tree version of this issue. Before digging into the block
>>> trace, I'd like to ask you for some feedback.
>>>
>>> First, in my test, the total throughput of the disk happens to be
>>> about 20 times as high as that enjoyed by dd, regardless of the I/O
>>> scheduler. I guess this massive overhead is normal with dsync, but
>>> I'd like know whether it is about the same on your side. This will
>>> help me understand whether I'll actually be analyzing about the same
>>> problem as yours.
>>>
>>
>> Do you mean to say the throughput obtained by dd'ing directly to the
>> block device (bypassing the filesystem)?
>
> No no, I mean simply what follows.
>
> 1) in one terminal:
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 record dentro
> 10000+0 record fuori
> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
>
> 2) In a second terminal, while the dd is in progress in the first
> terminal:
> $ iostat -tmd /dev/sda 3
> Linux 5.1.0+ (localhost.localdomain) 20/05/2019 _x86_64_ (2 CPU)
>
> ...
> 20/05/2019 11:40:17
> Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> sda 2288,00 0,00 9,77 0 29
>
> 20/05/2019 11:40:20
> Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> sda 2325,33 0,00 9,93 0 29
>
> 20/05/2019 11:40:23
> Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> sda 2351,33 0,00 10,05 0 30
> ...
>
> As you can see, the overall throughput (~10 MB/s) is more than 20
> times as high as the dd throughput (~350 KB/s). But the dd is the
> only source of I/O.
>
> Do you also see such a huge difference?
>
Ah, I see what you mean. Yes, I get a huge difference as well:

I/O scheduler dd throughput Total throughput (via iostat)
------------- ------------- -----------------------------

mq-deadline
or 1.6 MB/s 50 MB/s (30x)
kyber

bfq 60 KB/s 1 MB/s (16x)

Regards,
Srivatsa
VMware Photon OS

2019-05-21 09:12:11

by Jan Kara

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On Tue 21-05-19 08:23:05, Paolo Valente wrote:
> > Il giorno 21 mag 2019, alle ore 00:45, Srivatsa S. Bhat <[email protected]> ha scritto:
> >
> > On 5/20/19 3:19 AM, Paolo Valente wrote:
> >>
> >>
> >>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <[email protected]> ha scritto:
> >>>
> >>> On 5/18/19 11:39 AM, Paolo Valente wrote:
> >>>> I've addressed these issues in my last batch of improvements for BFQ,
> >>>> which landed in the upcoming 5.2. If you give it a try, and still see
> >>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
> >>>> for you.
> >>>>
> >>>
> >>> Hi Paolo,
> >>>
> >>> Thank you for looking into this!
> >>>
> >>> I just tried current mainline at commit 72cf0b07, but unfortunately
> >>> didn't see any improvement:
> >>>
> >>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> >>>
> >>> With mq-deadline, I get:
> >>>
> >>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
> >>>
> >>> With bfq, I get:
> >>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
> >>>
> >>
> >> Hi Srivatsa,
> >> thanks for reproducing this on mainline. I seem to have reproduced a
> >> bonsai-tree version of this issue. Before digging into the block
> >> trace, I'd like to ask you for some feedback.
> >>
> >> First, in my test, the total throughput of the disk happens to be
> >> about 20 times as high as that enjoyed by dd, regardless of the I/O
> >> scheduler. I guess this massive overhead is normal with dsync, but
> >> I'd like know whether it is about the same on your side. This will
> >> help me understand whether I'll actually be analyzing about the same
> >> problem as yours.
> >>
> >
> > Do you mean to say the throughput obtained by dd'ing directly to the
> > block device (bypassing the filesystem)?
>
> No no, I mean simply what follows.
>
> 1) in one terminal:
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 record dentro
> 10000+0 record fuori
> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
>
> 2) In a second terminal, while the dd is in progress in the first
> terminal:
> $ iostat -tmd /dev/sda 3
> Linux 5.1.0+ (localhost.localdomain) 20/05/2019 _x86_64_ (2 CPU)
>
> ...
> 20/05/2019 11:40:17
> Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> sda 2288,00 0,00 9,77 0 29
>
> 20/05/2019 11:40:20
> Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> sda 2325,33 0,00 9,93 0 29
>
> 20/05/2019 11:40:23
> Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> sda 2351,33 0,00 10,05 0 30
> ...
>
> As you can see, the overall throughput (~10 MB/s) is more than 20
> times as high as the dd throughput (~350 KB/s). But the dd is the
> only source of I/O.

Yes and that's expected. It just shows how inefficient small synchronous IO
is. Look, dd(1) writes 512-bytes. From FS point of view we have to write:
full fs block with data (+4KB), inode to journal (+4KB), journal descriptor
block (+4KB), journal superblock (+4KB), transaction commit block (+4KB) -
so that's 20KB just from top of my head to write 512 bytes...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-05-21 18:21:51

by Josef Bacik

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On Tue, May 21, 2019 at 12:48:14PM -0400, Theodore Ts'o wrote:
> On Mon, May 20, 2019 at 11:15:58AM +0200, Jan Kara wrote:
> > But this makes priority-inversion problems with ext4 journal worse, doesn't
> > it? If we submit journal commit in blkio cgroup of some random process, it
> > may get throttled which then effectively blocks the whole filesystem. Or do
> > you want to implement a more complex back-pressure mechanism where you'd
> > just account to different blkio cgroup during journal commit and then
> > throttle as different point where you are not blocking other tasks from
> > progress?
>
> Good point, yes, it can. It depends in what cgroup the file system is
> mounted (and hence what cgroup the jbd2 kernel thread is on). If it
> was mounted in the root cgroup, then jbd2 thread is going to be
> completely unthrottled (except for the data=ordered writebacks, which
> will be charged to the cgroup which write those pages) so the only
> thing which is nuking us will be the slice_idle timeout --- both for
> the writebacks (which could get charged to N different cgroups, with
> disastrous effects --- and this is going to be true for any file
> system on a syncfs(2) call as well) and switching between the jbd2
> thread's cgroup and the writeback cgroup.
>
> One thing the I/O scheduler could do is use the synchronous flag as a
> hint that it should ix-nay on the idle-way. Or maybe we need to have
> a different way to signal this to the jbd2 thread, since I do
> recognize that this issue is ext4-specific, *because* we do the
> transaction handling in a separate thread, and because of the
> data=ordered scheme, both of which are unique to ext4. So exempting
> synchronous writes from cgroup control doesn't make sense for other
> file systems.
>
> So maybe a special flag meaning "entangled writes", where the
> sched_idle hacks should get suppressed for the data=ordered
> writebacks, but we still charge the block I/O to the relevant CSS's?
>
> I could also imagine if there was some way that file system could
> track whether all of the file system modifications were charged to a
> single cgroup, we could in that case charge it to that cgroup?
>
> > Yeah. At least in some cases, we know there won't be any more IO from a
> > particular cgroup in the near future (e.g. transaction commit completing,
> > or when the layers above IO scheduler already know which IO they are going
> > to submit next) and in that case idling is just a waste of time. But so far
> > I haven't decided how should look a reasonably clean interface for this
> > that isn't specific to a particular IO scheduler implementation.
>
> The best I've come up with is some way of signalling that all of the
> writes coming from the jbd2 commit are entangled, probably via a bio
> flag.
>
> If we don't have cgroup support, the other thing we could do is assume
> that the jbd2 thread should always be in the root (unconstrained)
> cgroup, and then force all writes, include data=ordered writebacks, to
> be in the jbd2's cgroup. But that would make the block cgroup
> controls trivially bypassable by an application, which could just be
> fsync-happy and exempt all of its buffered I/O writes from cgroup
> control. So that's probably not a great way to go --- but it would at
> least fix this particular performance issue. :-/
>

Chris is adding a REQ_ROOT (or something) flag that means don't throttle me now,
but the the blkcg attached to the bio is the one that is responsible for this
IO. Then for io.latency we'll let the io go through unmolested but it gets
counted to the right cgroup, and if then we're exceeding latency guarantees we
have the ability to schedule throttling for that cgroup in a safer place. This
would eliminate the data=ordered issue for ext4, you guys keep doing what you
are doing and we'll handle throttling elsewhere, just so long as the bio's are
tagged with the correct source then all is well. Thanks,

Josef

2019-05-22 09:11:32

by Paolo Valente

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

> Il giorno 22 mag 2019, alle ore 10:05, Paolo Valente <[email protected]> ha scritto:
>
>
>
>> Il giorno 22 mag 2019, alle ore 00:51, Srivatsa S. Bhat <[email protected]> ha scritto:
>>
>> [ Resending this mail with a dropbox link to the traces (instead
>> of a file attachment), since it didn't go through the last time. ]
>>
>> On 5/21/19 10:38 AM, Paolo Valente wrote:
>>>
>>>> So, instead of only sending me a trace, could you please:
>>>> 1) apply this new patch on top of the one I attached in my previous email
>>>> 2) repeat your test and report results
>>>
>>> One last thing (I swear!): as you can see from my script, I tested the
>>> case low_latency=0 so far. So please, for the moment, do your test
>>> with low_latency=0. You find the whole path to this parameter in,
>>> e.g., my script.
>>>
>> No problem! :) Thank you for sharing patches for me to test!
>>
>> I have good news :) Your patch improves the throughput significantly
>> when low_latency = 0.
>>
>> Without any patch:
>>
>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 10000+0 records in
>> 10000+0 records out
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 58.0915 s, 88.1 kB/s
>>
>>
>> With both patches applied:
>>
>> dd if=/dev/zero of=/root/test0.img bs=512 count=10000 oflag=dsync
>> 10000+0 records in
>> 10000+0 records out
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.87487 s, 1.3 MB/s
>>
>> The performance is still not as good as mq-deadline (which achieves
>> 1.6 MB/s), but this is a huge improvement for BFQ nonetheless!
>>
>> A tarball with the trace output from the 2 scenarios you requested,
>> one with only the debug patch applied (trace-bfq-add-logs-and-BUG_ONs),
>> and another with both patches applied (trace-bfq-boost-injection) is
>> available here:
>>
>> https://www.dropbox.com/s/pdf07vi7afido7e/bfq-traces.tar.gz?dl=0
>>
>
> Hi Srivatsa,
> I've seen the bugzilla you've created. I'm a little confused on how
> to better proceed. Shall we move this discussion to the bugzilla, or
> should we continue this discussion here, where it has started, and
> then update the bugzilla?
>

Ok, I've received some feedback on this point, and I'll continue the
discussion here. Then I'll report back on the bugzilla.

First, thank you very much for testing my patches, and, above all, for
sharing those huge traces!

According to the your traces, the residual 20% lower throughput that you
record is due to the fact that the BFQ injection mechanism takes a few
hundredths of seconds to stabilize, at the beginning of the workload.
During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
that you see without this new patch. After that time, there
seems to be no loss according to the trace.

The problem is that a loss lasting only a few hundredths of seconds is
however not negligible for a write workload that lasts only 3-4
seconds. Could you please try writing a larger file?

In addition, I wanted to ask you whether you measured BFQ throughput
with traces disabled. This may make a difference.

After trying writing a larger file, you can try with low_latency on.
On my side, it causes results to become a little unstable across
repetitions (which is expected).

Thanks,
Paolo

> Let me know,
> Paolo
>
>> Thank you!
>>
>> Regards,
>> Srivatsa
>> VMware Photon OS

Attachments:

signature.asc (849.00 B)
Message signed with OpenPGP

2019-05-23 09:21:40

by Paolo Valente

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

> Thank you!
>
> Regards,
> Srivatsa
> VMware Photon OS

Attachments:

0001-block-bfq-re-sample-req-service-times-when-possible.patch.gz (666.00 B)
signature.asc (849.00 B)
Message signed with OpenPGP Download all attachments

2019-05-29 01:11:21

by Srivatsa S. Bhat

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 5/23/19 11:51 PM, Paolo Valente wrote:
>
>> Il giorno 24 mag 2019, alle ore 01:43, Srivatsa S. Bhat <[email protected]> ha scritto:
>>
>> When trying to run multiple dd tasks simultaneously, I get the kernel
>> panic shown below (mainline is fine, without these patches).
>>
>
> Could you please provide me somehow with a list *(bfq_serv_to_charge+0x21) ?
>

Hi Paolo,

Sorry for the delay! Here you go:

(gdb) list *(bfq_serv_to_charge+0x21)
0xffffffff814bad91 is in bfq_serv_to_charge (./include/linux/blkdev.h:919).
914
915 extern unsigned int blk_rq_err_bytes(const struct request *rq);
916
917 static inline unsigned int blk_rq_sectors(const struct request *rq)
918 {
919 return blk_rq_bytes(rq) >> SECTOR_SHIFT;
920 }
921
922 static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
923 {
(gdb)

For some reason, I've not been able to reproduce this issue after
reporting it here. (Perhaps I got lucky when I hit the kernel panic
a bunch of times last week).

I'll test with your fix applied and see how it goes.

Thank you!

Regards,
Srivatsa

>
>> [ 568.232231] BUG: kernel NULL pointer dereference, address: 0000000000000024
>> [ 568.232257] #PF: supervisor read access in kernel mode
>> [ 568.232273] #PF: error_code(0x0000) - not-present page
>> [ 568.232289] PGD 0 P4D 0
>> [ 568.232299] Oops: 0000 [#1] SMP PTI
>> [ 568.232312] CPU: 0 PID: 1029 Comm: dd Tainted: G E 5.1.0-io-dbg-4+ #6
>> [ 568.232334] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
>> [ 568.232388] RIP: 0010:bfq_serv_to_charge+0x21/0x50
>> [ 568.232404] Code: ff e8 c3 5e bc ff 0f 1f 00 0f 1f 44 00 00 48 8b 86 20 01 00 00 55 48 89 e5 53 48 89 fb a8 40 75 09 83 be a0 01 00 00 01 76 09 <8b> 43 24 c1 e8 09 5b 5d c3 48 8b 7e 08 e8 5d fd ff ff 84 c0 75 ea
>> [ 568.232473] RSP: 0018:ffffa73a42dab750 EFLAGS: 00010002
>> [ 568.232489] RAX: 0000000000001052 RBX: 0000000000000000 RCX: ffffa73a42dab7a0
>> [ 568.232510] RDX: ffffa73a42dab657 RSI: ffff8b7b6ba2ab70 RDI: 0000000000000000
>> [ 568.232530] RBP: ffffa73a42dab758 R08: 0000000000000000 R09: 0000000000000001
>> [ 568.232551] R10: 0000000000000000 R11: ffffa73a42dab7a0 R12: ffff8b7b6aed3800
>> [ 568.232571] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8b7b6aed3800
>> [ 568.232592] FS: 00007fb5b0724540(0000) GS:ffff8b7b6f800000(0000) knlGS:0000000000000000
>> [ 568.232615] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 568.232632] CR2: 0000000000000024 CR3: 00000004266be002 CR4: 00000000001606f0
>> [ 568.232690] Call Trace:
>> [ 568.232703] bfq_select_queue+0x781/0x1000
>> [ 568.232717] bfq_dispatch_request+0x1d7/0xd60
>> [ 568.232731] ? bfq_bfqq_handle_idle_busy_switch.isra.36+0x2cd/0xb20
>> [ 568.232751] blk_mq_do_dispatch_sched+0xa8/0xe0
>> [ 568.232765] blk_mq_sched_dispatch_requests+0xe3/0x150
>> [ 568.232783] __blk_mq_run_hw_queue+0x56/0x100
>> [ 568.232798] __blk_mq_delay_run_hw_queue+0x107/0x160
>> [ 568.232814] blk_mq_run_hw_queue+0x75/0x190
>> [ 568.232828] blk_mq_sched_insert_requests+0x7a/0x100
>> [ 568.232844] blk_mq_flush_plug_list+0x1d7/0x280
>> [ 568.232859] blk_flush_plug_list+0xc2/0xe0
>> [ 568.232872] blk_finish_plug+0x2c/0x40
>> [ 568.232886] ext4_writepages+0x592/0xe60
>> [ 568.233381] ? ext4_mark_iloc_dirty+0x52b/0x860
>> [ 568.233851] do_writepages+0x3c/0xd0
>> [ 568.234304] ? ext4_mark_inode_dirty+0x1a0/0x1a0
>> [ 568.234748] ? do_writepages+0x3c/0xd0
>> [ 568.235197] ? __generic_write_end+0x4e/0x80
>> [ 568.235644] __filemap_fdatawrite_range+0xa5/0xe0
>> [ 568.236089] ? __filemap_fdatawrite_range+0xa5/0xe0
>> [ 568.236533] ? ext4_da_write_end+0x13c/0x280
>> [ 568.236983] file_write_and_wait_range+0x5a/0xb0
>> [ 568.237407] ext4_sync_file+0x11e/0x3e0
>> [ 568.237819] vfs_fsync_range+0x48/0x80
>> [ 568.238217] ext4_file_write_iter+0x234/0x3d0
>> [ 568.238610] ? _cond_resched+0x19/0x40
>> [ 568.238982] new_sync_write+0x112/0x190
>> [ 568.239347] __vfs_write+0x29/0x40
>> [ 568.239705] vfs_write+0xb1/0x1a0
>> [ 568.240078] ksys_write+0x89/0xc0
>> [ 568.240428] __x64_sys_write+0x1a/0x20
>> [ 568.240771] do_syscall_64+0x5b/0x140
>> [ 568.241115] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [ 568.241456] RIP: 0033:0x7fb5b02325f4
>> [ 568.241787] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 09 11 2d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
>> [ 568.242842] RSP: 002b:00007ffcb12e2968 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>> [ 568.243220] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb5b02325f4
>> [ 568.243616] RDX: 0000000000000200 RSI: 000055698f2ad000 RDI: 0000000000000001
>> [ 568.244026] RBP: 0000000000000200 R08: 0000000000000004 R09: 0000000000000003
>> [ 568.244401] R10: 00007fb5b04feca0 R11: 0000000000000246 R12: 000055698f2ad000
>> [ 568.244775] R13: 0000000000000000 R14: 0000000000000000 R15: 000055698f2ad000
>> [ 568.245154] Modules linked in: xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) bridge(E) stp(E) llc(E) overlay(E) vmw_vsock_vmci_transport(E) vsock(E) ip6table_filter(E) ip6_tables(E) xt_conntrack(E) iptable_mangle(E) iptable_nat(E) nf_nat(E) iptable_filter
>> [ 568.248651] CR2: 0000000000000024
>> [ 568.249142] ---[ end trace 0ddd315e0a5bdfba ]---
>>

2019-05-29 07:42:48

by Paolo Valente

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

> Il giorno 29 mag 2019, alle ore 03:09, Srivatsa S. Bhat <[email protected]> ha scritto:
>
> On 5/23/19 11:51 PM, Paolo Valente wrote:
>>
>>> Il giorno 24 mag 2019, alle ore 01:43, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>
>>> When trying to run multiple dd tasks simultaneously, I get the kernel
>>> panic shown below (mainline is fine, without these patches).
>>>
>>
>> Could you please provide me somehow with a list *(bfq_serv_to_charge+0x21) ?
>>
>
> Hi Paolo,
>
> Sorry for the delay! Here you go:
>
> (gdb) list *(bfq_serv_to_charge+0x21)
> 0xffffffff814bad91 is in bfq_serv_to_charge (./include/linux/blkdev.h:919).
> 914
> 915 extern unsigned int blk_rq_err_bytes(const struct request *rq);
> 916
> 917 static inline unsigned int blk_rq_sectors(const struct request *rq)
> 918 {
> 919 return blk_rq_bytes(rq) >> SECTOR_SHIFT;
> 920 }
> 921
> 922 static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
> 923 {
> (gdb)
>
>
> For some reason, I've not been able to reproduce this issue after
> reporting it here. (Perhaps I got lucky when I hit the kernel panic
> a bunch of times last week).
>
> I'll test with your fix applied and see how it goes.
>

Great! the offending line above gives me hope that my fix is correct.
If no more failures occur, then I'm eager (and a little worried ...)
to see how it goes with throughput :)

Thanks,
Paolo

> Thank you!
>
> Regards,
> Srivatsa
>
>>
>>> [ 568.232231] BUG: kernel NULL pointer dereference, address: 0000000000000024
>>> [ 568.232257] #PF: supervisor read access in kernel mode
>>> [ 568.232273] #PF: error_code(0x0000) - not-present page
>>> [ 568.232289] PGD 0 P4D 0
>>> [ 568.232299] Oops: 0000 [#1] SMP PTI
>>> [ 568.232312] CPU: 0 PID: 1029 Comm: dd Tainted: G E 5.1.0-io-dbg-4+ #6
>>> [ 568.232334] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
>>> [ 568.232388] RIP: 0010:bfq_serv_to_charge+0x21/0x50
>>> [ 568.232404] Code: ff e8 c3 5e bc ff 0f 1f 00 0f 1f 44 00 00 48 8b 86 20 01 00 00 55 48 89 e5 53 48 89 fb a8 40 75 09 83 be a0 01 00 00 01 76 09 <8b> 43 24 c1 e8 09 5b 5d c3 48 8b 7e 08 e8 5d fd ff ff 84 c0 75 ea
>>> [ 568.232473] RSP: 0018:ffffa73a42dab750 EFLAGS: 00010002
>>> [ 568.232489] RAX: 0000000000001052 RBX: 0000000000000000 RCX: ffffa73a42dab7a0
>>> [ 568.232510] RDX: ffffa73a42dab657 RSI: ffff8b7b6ba2ab70 RDI: 0000000000000000
>>> [ 568.232530] RBP: ffffa73a42dab758 R08: 0000000000000000 R09: 0000000000000001
>>> [ 568.232551] R10: 0000000000000000 R11: ffffa73a42dab7a0 R12: ffff8b7b6aed3800
>>> [ 568.232571] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8b7b6aed3800
>>> [ 568.232592] FS: 00007fb5b0724540(0000) GS:ffff8b7b6f800000(0000) knlGS:0000000000000000
>>> [ 568.232615] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 568.232632] CR2: 0000000000000024 CR3: 00000004266be002 CR4: 00000000001606f0
>>> [ 568.232690] Call Trace:
>>> [ 568.232703] bfq_select_queue+0x781/0x1000
>>> [ 568.232717] bfq_dispatch_request+0x1d7/0xd60
>>> [ 568.232731] ? bfq_bfqq_handle_idle_busy_switch.isra.36+0x2cd/0xb20
>>> [ 568.232751] blk_mq_do_dispatch_sched+0xa8/0xe0
>>> [ 568.232765] blk_mq_sched_dispatch_requests+0xe3/0x150
>>> [ 568.232783] __blk_mq_run_hw_queue+0x56/0x100
>>> [ 568.232798] __blk_mq_delay_run_hw_queue+0x107/0x160
>>> [ 568.232814] blk_mq_run_hw_queue+0x75/0x190
>>> [ 568.232828] blk_mq_sched_insert_requests+0x7a/0x100
>>> [ 568.232844] blk_mq_flush_plug_list+0x1d7/0x280
>>> [ 568.232859] blk_flush_plug_list+0xc2/0xe0
>>> [ 568.232872] blk_finish_plug+0x2c/0x40
>>> [ 568.232886] ext4_writepages+0x592/0xe60
>>> [ 568.233381] ? ext4_mark_iloc_dirty+0x52b/0x860
>>> [ 568.233851] do_writepages+0x3c/0xd0
>>> [ 568.234304] ? ext4_mark_inode_dirty+0x1a0/0x1a0
>>> [ 568.234748] ? do_writepages+0x3c/0xd0
>>> [ 568.235197] ? __generic_write_end+0x4e/0x80
>>> [ 568.235644] __filemap_fdatawrite_range+0xa5/0xe0
>>> [ 568.236089] ? __filemap_fdatawrite_range+0xa5/0xe0
>>> [ 568.236533] ? ext4_da_write_end+0x13c/0x280
>>> [ 568.236983] file_write_and_wait_range+0x5a/0xb0
>>> [ 568.237407] ext4_sync_file+0x11e/0x3e0
>>> [ 568.237819] vfs_fsync_range+0x48/0x80
>>> [ 568.238217] ext4_file_write_iter+0x234/0x3d0
>>> [ 568.238610] ? _cond_resched+0x19/0x40
>>> [ 568.238982] new_sync_write+0x112/0x190
>>> [ 568.239347] __vfs_write+0x29/0x40
>>> [ 568.239705] vfs_write+0xb1/0x1a0
>>> [ 568.240078] ksys_write+0x89/0xc0
>>> [ 568.240428] __x64_sys_write+0x1a/0x20
>>> [ 568.240771] do_syscall_64+0x5b/0x140
>>> [ 568.241115] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> [ 568.241456] RIP: 0033:0x7fb5b02325f4
>>> [ 568.241787] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 09 11 2d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
>>> [ 568.242842] RSP: 002b:00007ffcb12e2968 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> [ 568.243220] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb5b02325f4
>>> [ 568.243616] RDX: 0000000000000200 RSI: 000055698f2ad000 RDI: 0000000000000001
>>> [ 568.244026] RBP: 0000000000000200 R08: 0000000000000004 R09: 0000000000000003
>>> [ 568.244401] R10: 00007fb5b04feca0 R11: 0000000000000246 R12: 000055698f2ad000
>>> [ 568.244775] R13: 0000000000000000 R14: 0000000000000000 R15: 000055698f2ad000
>>> [ 568.245154] Modules linked in: xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) bridge(E) stp(E) llc(E) overlay(E) vmw_vsock_vmci_transport(E) vsock(E) ip6table_filter(E) ip6_tables(E) xt_conntrack(E) iptable_mangle(E) iptable_nat(E) nf_nat(E) iptable_filter
>>> [ 568.248651] CR2: 0000000000000024
>>> [ 568.249142] ---[ end trace 0ddd315e0a5bdfba ]---
>>>

Attachments:

signature.asc (849.00 B)
Message signed with OpenPGP

2019-05-30 10:47:34

by Paolo Valente

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

> Il giorno 30 mag 2019, alle ore 10:29, Srivatsa S. Bhat <[email protected]> ha scritto:
>
> On 5/29/19 12:41 AM, Paolo Valente wrote:
>>
>>
>>> Il giorno 29 mag 2019, alle ore 03:09, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>
>>> On 5/23/19 11:51 PM, Paolo Valente wrote:
>>>>
>>>>> Il giorno 24 mag 2019, alle ore 01:43, Srivatsa S. Bhat <[email protected]> ha scritto:
>>>>>
>>>>> When trying to run multiple dd tasks simultaneously, I get the kernel
>>>>> panic shown below (mainline is fine, without these patches).
>>>>>
>>>>
>>>> Could you please provide me somehow with a list *(bfq_serv_to_charge+0x21) ?
>>>>
>>>
>>> Hi Paolo,
>>>
>>> Sorry for the delay! Here you go:
>>>
>>> (gdb) list *(bfq_serv_to_charge+0x21)
>>> 0xffffffff814bad91 is in bfq_serv_to_charge (./include/linux/blkdev.h:919).
>>> 914
>>> 915 extern unsigned int blk_rq_err_bytes(const struct request *rq);
>>> 916
>>> 917 static inline unsigned int blk_rq_sectors(const struct request *rq)
>>> 918 {
>>> 919 return blk_rq_bytes(rq) >> SECTOR_SHIFT;
>>> 920 }
>>> 921
>>> 922 static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
>>> 923 {
>>> (gdb)
>>>
>>>
>>> For some reason, I've not been able to reproduce this issue after
>>> reporting it here. (Perhaps I got lucky when I hit the kernel panic
>>> a bunch of times last week).
>>>
>>> I'll test with your fix applied and see how it goes.
>>>
>>
>> Great! the offending line above gives me hope that my fix is correct.
>> If no more failures occur, then I'm eager (and a little worried ...)
>> to see how it goes with throughput :)
>>
>
> Your fix held up well under my testing :)
>

Great!

> As for throughput, with low_latency = 1, I get around 1.4 MB/s with
> bfq (vs 1.6 MB/s with mq-deadline). This is a huge improvement
> compared to what it was before (70 KB/s).
>

That's beautiful news!

So, now we have the best of the two worlds: maximum throughput and
total control on I/O (including minimum latency for interactive and
soft real-time applications). Besides, no manual configuration
needed. Of course, this holds unless/until you find other flaws ... ;)

> With tracing on, the throughput is a bit lower (as expected I guess),
> about 1 MB/s, and the corresponding trace file
> (trace-waker-detection-1MBps) is available at:
>
> https://www.dropbox.com/s/3roycp1zwk372zo/bfq-traces.tar.gz?dl=0
>

Thank you for the new trace. I've analyzed it carefully, and, as I
imagined, this residual 12% throughput loss is due to a couple of
heuristics that occasionally get something wrong. Most likely, ~12%
is the worst-case loss, and if one repeats the tests, the loss may be
much lower in some runs.

I think it is very hard to eliminate this fluctuation while keeping
full I/O control. But, who knows, I might have some lucky idea in the
future.

At any rate, since you pointed out that you are interested in
out-of-the-box performance, let me complete the context: in case
low_latency is left set, one gets, in return for this 12% loss,
a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
times of applications under load [1];
b) 500-1000% higher throughput in multi-client server workloads, as I
already pointed out [2].

I'm going to prepare complete patches. In addition, if ok for you,
I'll report these results on the bug you created. Then I guess we can
close it.

[1] https://algo.ing.unimo.it/people/paolo/disk_sched/results.php
[2] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/

> Thank you so much for your tireless efforts in fixing this issue!
>

I did enjoy working on this with you: your test case and your support
enabled me to make important improvements. So, thank you very much
for your collaboration so far,
Paolo

> Regards,
> Srivatsa
> VMware Photon OS

Attachments:

signature.asc (849.00 B)
Message signed with OpenPGP

2019-06-02 07:36:26

by Srivatsa S. Bhat

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 5/30/19 3:45 AM, Paolo Valente wrote:
>
>
>> Il giorno 30 mag 2019, alle ore 10:29, Srivatsa S. Bhat <[email protected]> ha scritto:
>>
[...]
>>
>> Your fix held up well under my testing :)
>>
>
> Great!
>
>> As for throughput, with low_latency = 1, I get around 1.4 MB/s with
>> bfq (vs 1.6 MB/s with mq-deadline). This is a huge improvement
>> compared to what it was before (70 KB/s).
>>
>
> That's beautiful news!
>
> So, now we have the best of the two worlds: maximum throughput and
> total control on I/O (including minimum latency for interactive and
> soft real-time applications). Besides, no manual configuration
> needed. Of course, this holds unless/until you find other flaws ... ;)
>

Indeed, that's awesome! :)

>> With tracing on, the throughput is a bit lower (as expected I guess),
>> about 1 MB/s, and the corresponding trace file
>> (trace-waker-detection-1MBps) is available at:
>>
>> https://www.dropbox.com/s/3roycp1zwk372zo/bfq-traces.tar.gz?dl=0
>>
>
> Thank you for the new trace. I've analyzed it carefully, and, as I
> imagined, this residual 12% throughput loss is due to a couple of
> heuristics that occasionally get something wrong. Most likely, ~12%
> is the worst-case loss, and if one repeats the tests, the loss may be
> much lower in some runs.
>

Ah, I see.

> I think it is very hard to eliminate this fluctuation while keeping
> full I/O control. But, who knows, I might have some lucky idea in the
> future.
>

:)

> At any rate, since you pointed out that you are interested in
> out-of-the-box performance, let me complete the context: in case
> low_latency is left set, one gets, in return for this 12% loss,
> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
> times of applications under load [1];
> b) 500-1000% higher throughput in multi-client server workloads, as I
> already pointed out [2].
>

I'm very happy that you could solve the problem without having to
compromise on any of the performance characteristics/features of BFQ!

> I'm going to prepare complete patches. In addition, if ok for you,
> I'll report these results on the bug you created. Then I guess we can
> close it.
>

Sounds great!

> [1] https://algo.ing.unimo.it/people/paolo/disk_sched/results.php
> [2] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/
>
>> Thank you so much for your tireless efforts in fixing this issue!
>>
>
> I did enjoy working on this with you: your test case and your support
> enabled me to make important improvements. So, thank you very much
> for your collaboration so far,
> Paolo

My pleasure! :)

Regards,
Srivatsa
VMware Photon OS

2019-06-12 05:18:15

by Srivatsa S. Bhat

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
> On 5/30/19 3:45 AM, Paolo Valente wrote:
>>
[...]
>> At any rate, since you pointed out that you are interested in
>> out-of-the-box performance, let me complete the context: in case
>> low_latency is left set, one gets, in return for this 12% loss,
>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
>> times of applications under load [1];
>> b) 500-1000% higher throughput in multi-client server workloads, as I
>> already pointed out [2].
>>
>
> I'm very happy that you could solve the problem without having to
> compromise on any of the performance characteristics/features of BFQ!
>
>
>> I'm going to prepare complete patches. In addition, if ok for you,
>> I'll report these results on the bug you created. Then I guess we can
>> close it.
>>
>
> Sounds great!
>

Hi Paolo,

Hope you are doing great!

I was wondering if you got a chance to post these patches to LKML for
review and inclusion... (No hurry, of course!)

Also, since your fixes address the performance issues in BFQ, do you
have any thoughts on whether they can be adapted to CFQ as well, to
benefit the older stable kernels that still support CFQ?

Thank you!

Regards,
Srivatsa
VMware Photon OS

2019-06-12 19:37:27

by Srivatsa S. Bhat

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

[ Adding Greg to CC ]

On 6/12/19 6:04 AM, Jan Kara wrote:
> On Tue 11-06-19 15:34:48, Srivatsa S. Bhat wrote:
>> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
>>> On 5/30/19 3:45 AM, Paolo Valente wrote:
>>>>
>> [...]
>>>> At any rate, since you pointed out that you are interested in
>>>> out-of-the-box performance, let me complete the context: in case
>>>> low_latency is left set, one gets, in return for this 12% loss,
>>>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
>>>> times of applications under load [1];
>>>> b) 500-1000% higher throughput in multi-client server workloads, as I
>>>> already pointed out [2].
>>>>
>>>
>>> I'm very happy that you could solve the problem without having to
>>> compromise on any of the performance characteristics/features of BFQ!
>>>
>>>
>>>> I'm going to prepare complete patches. In addition, if ok for you,
>>>> I'll report these results on the bug you created. Then I guess we can
>>>> close it.
>>>>
>>>
>>> Sounds great!
>>>
>>
>> Hi Paolo,
>>
>> Hope you are doing great!
>>
>> I was wondering if you got a chance to post these patches to LKML for
>> review and inclusion... (No hurry, of course!)
>>
>> Also, since your fixes address the performance issues in BFQ, do you
>> have any thoughts on whether they can be adapted to CFQ as well, to
>> benefit the older stable kernels that still support CFQ?
>
> Since CFQ doesn't exist in current upstream kernel anymore, I seriously
> doubt you'll be able to get any performance improvements for it in the
> stable kernels...
>

I suspected as much, but that seems unfortunate though. The latest LTS
kernel is based on 4.19, which still supports CFQ. It would have been
great to have a process to address significant issues on older
kernels too.

Greg, do you have any thoughts on this? The context is that both CFQ
and BFQ I/O schedulers have issues that cause I/O throughput to suffer
upto 10x - 30x on certain workloads and system configurations, as
reported in [1].

In this thread, Paolo posted patches to fix BFQ performance on
mainline. However CFQ suffers from the same performance collapse, but
CFQ was removed from the kernel in v5.0. So obviously the usual stable
backporting path won't work here for several reasons:

1. There won't be a mainline commit to backport from, as CFQ no
longer exists in mainline.

2. This is not a security/stability fix, and is likely to involve
invasive changes.

I was wondering if there was a way to address the performance issues
in CFQ in the older stable kernels (including the latest LTS 4.19),
despite the above constraints, since the performance drop is much too
significant. I guess not, but thought I'd ask :-)

[1]. https://lore.kernel.org/lkml/[email protected]/

Regards,
Srivatsa
VMware Photon OS

2019-06-13 16:24:49

by Jens Axboe

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 6/12/19 1:36 PM, Srivatsa S. Bhat wrote:
>
> [ Adding Greg to CC ]
>
> On 6/12/19 6:04 AM, Jan Kara wrote:
>> On Tue 11-06-19 15:34:48, Srivatsa S. Bhat wrote:
>>> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
>>>> On 5/30/19 3:45 AM, Paolo Valente wrote:
>>>>>
>>> [...]
>>>>> At any rate, since you pointed out that you are interested in
>>>>> out-of-the-box performance, let me complete the context: in case
>>>>> low_latency is left set, one gets, in return for this 12% loss,
>>>>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
>>>>> times of applications under load [1];
>>>>> b) 500-1000% higher throughput in multi-client server workloads, as I
>>>>> already pointed out [2].
>>>>>
>>>>
>>>> I'm very happy that you could solve the problem without having to
>>>> compromise on any of the performance characteristics/features of BFQ!
>>>>
>>>>
>>>>> I'm going to prepare complete patches. In addition, if ok for you,
>>>>> I'll report these results on the bug you created. Then I guess we can
>>>>> close it.
>>>>>
>>>>
>>>> Sounds great!
>>>>
>>>
>>> Hi Paolo,
>>>
>>> Hope you are doing great!
>>>
>>> I was wondering if you got a chance to post these patches to LKML for
>>> review and inclusion... (No hurry, of course!)
>>>
>>> Also, since your fixes address the performance issues in BFQ, do you
>>> have any thoughts on whether they can be adapted to CFQ as well, to
>>> benefit the older stable kernels that still support CFQ?
>>
>> Since CFQ doesn't exist in current upstream kernel anymore, I seriously
>> doubt you'll be able to get any performance improvements for it in the
>> stable kernels...
>>
>
> I suspected as much, but that seems unfortunate though. The latest LTS
> kernel is based on 4.19, which still supports CFQ. It would have been
> great to have a process to address significant issues on older
> kernels too.
>
> Greg, do you have any thoughts on this? The context is that both CFQ
> and BFQ I/O schedulers have issues that cause I/O throughput to suffer
> upto 10x - 30x on certain workloads and system configurations, as
> reported in [1].
>
> In this thread, Paolo posted patches to fix BFQ performance on
> mainline. However CFQ suffers from the same performance collapse, but
> CFQ was removed from the kernel in v5.0. So obviously the usual stable
> backporting path won't work here for several reasons:
>
> 1. There won't be a mainline commit to backport from, as CFQ no
> longer exists in mainline.
>
> 2. This is not a security/stability fix, and is likely to involve
> invasive changes.
>
> I was wondering if there was a way to address the performance issues
> in CFQ in the older stable kernels (including the latest LTS 4.19),
> despite the above constraints, since the performance drop is much too
> significant. I guess not, but thought I'd ask :-)
>
> [1]. https://lore.kernel.org/lkml/[email protected]/

This issue has always been there. There will be no specific patches made
for stable for something that doesn't even exist in the newer kernels.

--
Jens Axboe

2019-06-13 16:46:25

by Paolo Valente

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

> Il giorno 12 giu 2019, alle ore 00:34, Srivatsa S. Bhat <[email protected]> ha scritto:
>
> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
>> On 5/30/19 3:45 AM, Paolo Valente wrote:
>>>
> [...]
>>> At any rate, since you pointed out that you are interested in
>>> out-of-the-box performance, let me complete the context: in case
>>> low_latency is left set, one gets, in return for this 12% loss,
>>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
>>> times of applications under load [1];
>>> b) 500-1000% higher throughput in multi-client server workloads, as I
>>> already pointed out [2].
>>>
>>
>> I'm very happy that you could solve the problem without having to
>> compromise on any of the performance characteristics/features of BFQ!
>>
>>
>>> I'm going to prepare complete patches. In addition, if ok for you,
>>> I'll report these results on the bug you created. Then I guess we can
>>> close it.
>>>
>>
>> Sounds great!
>>
>
> Hi Paolo,
>

Hi

> Hope you are doing great!
>

Sort of, thanks :)

> I was wondering if you got a chance to post these patches to LKML for
> review and inclusion... (No hurry, of course!)
>

I'm having troubles testing these new patches on 5.2-rc4. As it
happened with the first release candidates for 5.1, the CPU of my test
machine (Intel Core [email protected]) is so slowed down that results
are heavily distorted with every I/O scheduler.

Unfortunately, I'm not competent enough to spot the cause of this
regression in a feasible amount of time. I hope it'll go away with
next release candidates, or I'll test on 5.1.

> Also, since your fixes address the performance issues in BFQ, do you
> have any thoughts on whether they can be adapted to CFQ as well, to
> benefit the older stable kernels that still support CFQ?
>

I have implanted my fixes on the existing throughput-boosting
infrastructure of BFQ. CFQ doesn't have such an infrastructure.

If you need I/O control with older kernels, you may want to check my
version of BFQ for legacy block, named bfq-sq and available in this
repo:
https://github.com/Algodev-github/bfq-mq/

I'm willing to provide you with any information or help if needed.

Thanks,
Paolo

> Thank you!
>
> Regards,
> Srivatsa
> VMware Photon OS

Attachments:

signature.asc (849.00 B)
Message signed with OpenPGP

2019-06-13 19:14:02

by Srivatsa S. Bhat

[permalink] [raw]

Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

On 6/12/19 10:46 PM, Paolo Valente wrote:
>
>> Il giorno 12 giu 2019, alle ore 00:34, Srivatsa S. Bhat <[email protected]> ha scritto:
>>
[...]
>>
>> Hi Paolo,
>>
>
> Hi
>
>> Hope you are doing great!
>>
>
> Sort of, thanks :)
>
>> I was wondering if you got a chance to post these patches to LKML for
>> review and inclusion... (No hurry, of course!)
>>
>
>
> I'm having troubles testing these new patches on 5.2-rc4. As it
> happened with the first release candidates for 5.1, the CPU of my test
> machine (Intel Core [email protected]) is so slowed down that results
> are heavily distorted with every I/O scheduler.
>

Oh, that's unfortunate!

> Unfortunately, I'm not competent enough to spot the cause of this
> regression in a feasible amount of time. I hope it'll go away with
> next release candidates, or I'll test on 5.1.
>

Sounds good to me!

>> Also, since your fixes address the performance issues in BFQ, do you
>> have any thoughts on whether they can be adapted to CFQ as well, to
>> benefit the older stable kernels that still support CFQ?
>>
>
> I have implanted my fixes on the existing throughput-boosting
> infrastructure of BFQ. CFQ doesn't have such an infrastructure.
>
> If you need I/O control with older kernels, you may want to check my
> version of BFQ for legacy block, named bfq-sq and available in this
> repo:
> https://github.com/Algodev-github/bfq-mq/
>

Great! Thank you for sharing this!

> I'm willing to provide you with any information or help if needed.
>
Thank you!

Regards,
Srivatsa
VMware Photon OS