2009-11-16 20:50:55

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

Hi Vivek:

I'm finding some things that don't quite seem right - executive
summary:

o I think the apportionment algorithm doesn't work consistently well
for writes.

o I think there are problems with significant performance loss when
doing random I/Os.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Test configuration: HP dl585 (32-way quad-core AMD Opteron processors +
128GB RAM + 4 FC HBAs + 4 MSA 1000's (each exporting 3 multi-disk
striped LUNs)). Running with Jens Axboe's remotes/origin/for-2.6.33
branch (at commit 8721c81f6480e2c9acbf92078383953f825d1057)) w/out and
w/ your V2 patch.

The test: 12 Ext3 file systems (1 per disk), each file system has eight
8GB files on it. Doing simple fio runs in various modes and I/O
directions: random or sequential, read or write or read/write (80%/20%).
Using 2, 4 or 8 processes per file system (each process working on a
different file). Here is a sample fio command file:

[global]
ioengine=sync
size=8g
overwrite=0
runtime=120
bs=256k
readwrite=write
[/mnt/sdl/data.7]
filename=/mnt/sdl/data.7

I'm then using cgroups that have IO weights as follows:

/cgroup/test0/blkio.weight 100
/cgroup/test1/blkio.weight 200
/cgroup/test2/blkio.weight 300
/cgroup/test3/blkio.weight 400
/cgroup/test4/blkio.weight 500
/cgroup/test5/blkio.weight 600
/cgroup/test6/blkio.weight 700
/cgroup/test7/blkio.weight 800

There were 12 X N total processes running in the system for each test,
and each file system would have N process working on a different file in
that file system. The N processes would be assigned to increasing test
groups: process 0 will be in test0's group and working on file 0 in a
file system; process 1 will be in test1's group and working on file 1 in
a file system; and so on.

Before each test I drop caches & umount/mount the filesystem anew.

In the following tables:

'base' - means a kernel generated from Jens' branch (-no- patching)

'ioc off' - means a kernel generated w/ your patches added but -no-
other settings (no CGROUP stuff mounted or enabled)

'ioc no idle' - means the ioc kernel w/ CGROUP stuff enabled
-but- /sys/block/sd*/queue/iosched/cgroup_idle = 0

'ioc idle' - means the ioc kernel w/ CGROUP stuff enabled
-and- /sys/block/sd*/queue/iosched/cgroup_idle = 1

Modes: random or sequential

RdWr: rd==read, wr==write, rdwr==80%read & 20%write

N: Number of processes per disk

testX: Processes sharing a task group (when enabled)

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

The first thing to do is to check for correctness: when the I/O
controller is enabled do we see correctly apportioned I/O?

At the tail end of the e-mail I've placed three (3) tables showing the
state where -no- differences should be seen between the various "task"
groups in terms of performance ("level playing field"), and sure enough
no differences were seen. These were done basically as a "control" set
of tests - the script being used didn't have any inherent biases in
it.[1]

This table shows the cases where we should see a difference based upon
weights:

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
ioc idle rnd rd 2 2.8 6.3
ioc idle rnd rd 4 0.7 1.5 2.5 3.5
ioc idle rnd rd 8 0.2 0.4 0.5 0.8 0.9 1.2 1.4 1.7

ioc idle rnd wr 2 38.2 192.7
ioc idle rnd wr 4 1.0 17.7 38.1 204.5
ioc idle rnd wr 8 0.3 0.6 0.9 1.5 2.2 16.3 16.6 208.3

ioc idle rnd rdwr 2 4.9 11.3
ioc idle rnd rdwr 4 0.9 2.4 4.3 6.2
ioc idle rnd rdwr 8 0.2 0.5 0.8 1.1 1.4 1.8 2.2 2.7


ioc idle seq rd 2 221.0 386.4
ioc idle seq rd 4 69.8 128.1 183.2 226.8
ioc idle seq rd 8 21.4 40.0 55.6 70.8 85.2 98.3 111.6 121.9

ioc idle seq wr 2 398.6 391.6
ioc idle seq wr 4 219.0 214.5 214.1 214.5
ioc idle seq wr 8 107.6 106.8 104.7 102.5 99.5 99.5 100.5 100.8

ioc idle seq rdwr 2 196.8 340.9
ioc idle seq rdwr 4 64.0 109.6 148.7 183.5
ioc idle seq rdwr 8 22.6 36.6 48.8 61.1 70.3 78.5 84.9 94.3

In general, we do see weights associated in correctly increasing order,
but I don't think the proportions are done correctly in all cases.

In the random tests for example, the read distribution looks pretty
decent, but random writes are all off - for some reason the highest
priority (most heavily weighted) is getting a disproportionately large
percentage of the I/O bandwidth.

For the sequential loads, the reads look "OK" - not quite correctly fair
when we have 8 processes running against the devices, but on the whole
things look ok. Sequential writes are not working well at all:
relatively flat distribution.

I _think_ this is pointing to some real problems in both the write cases
for both random & sequential I/Os.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

The next thing to look at is to see what the "penalty" is for the
additional code: see how much bandwidth we lose for the capability
added. Here we see the sum of the system's throughput for the various
tests:

---- ---- - ----------- ----------- ----------- -----------
Mode RdWr N base ioc off ioc no idle ioc idle
---- ---- - ----------- ----------- ----------- -----------
rnd rd 2 17.3 17.1 9.4 9.1
rnd rd 4 27.1 27.1 8.1 8.2
rnd rd 8 37.1 37.1 6.8 7.1

rnd wr 2 296.5 243.7 290.2 230.9
rnd wr 4 287.3 280.7 270.4 261.3
rnd wr 8 272.5 273.1 237.7 246.5

rnd rdwr 2 27.4 27.7 16.1 16.2
rnd rdwr 4 38.3 39.3 13.5 13.9
rnd rdwr 8 62.0 61.5 10.0 10.7

seq rd 2 610.2 608.1 610.7 607.4
seq rd 4 608.4 601.5 609.3 608.0
seq rd 8 605.7 603.7 605.0 604.8

seq wr 2 840.3 850.2 836.8 790.2
seq wr 4 886.8 891.6 868.2 862.2
seq wr 8 865.1 887.1 832.1 822.0

seq rdwr 2 536.2 550.0 538.1 537.7
seq rdwr 4 595.3 605.7 512.9 505.8
seq rdwr 8 617.3 628.5 526.6 497.1

The sequential runs look very good - not much variance across the board.

The random results look horrible, especially when reads are involved:
The first two columns (base & ioc off) are very similar, however note
the significant drop in overall system performance once the
io-controller CGROUP stuff gets involved - the more processes involved
the more performance is lost.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

I'm going to spend some time drilling down into three specific tests:

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
ioc idle rnd wr 2 38.2 192.7
ioc idle seq wr 2 398.6 391.6

This test I can use to see why random writes are so disproportionately
apportioned - it should be 2-to-1 but we are seeing something like
6-to-1. And then I can look at why sequential writes are flat.

and:

---- ---- - ----------- ----------- ----------- -----------
Mode RdWr N base ioc off ioc no idle ioc idle
---- ---- - ----------- ----------- ----------- -----------
rnd rd 2 17.3 17.1 9.4 9.1

I will try to find out why we are seeing such a loss in system
performance...

Regards,
Alan D. Brunelle
Hewlett-Packard / Linux Kernel Technology Team

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
[1] Three tables showing the I/O load distributed when either there was
no I/O controller code or when it was turned off or when cgroup_idle was
turned off. All looks sane - with the exception of the ioc-enabled
kernel with no-idle set - for random writes it appears like there is
some differences, but not an appreciable amount?

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
base rnd rd 2 8.6 8.6
base rnd rd 4 6.8 6.8 6.8 6.7
base rnd rd 8 4.7 4.6 4.6 4.6 4.6 4.6 4.6 4.6

base rnd wr 2 150.4 146.1
base rnd wr 4 75.2 74.8 68.1 69.2
base rnd wr 8 36.2 39.3 29.6 35.9 32.9 37.0 29.6 32.2

base rnd rdwr 2 13.7 13.7
base rnd rdwr 4 9.6 9.6 9.6 9.6
base rnd rdwr 8 7.8 7.8 7.7 7.8 7.8 7.7 7.7 7.8


base seq rd 2 306.2 304.0
base seq rd 4 150.1 152.4 151.9 154.0
base seq rd 8 77.2 75.9 75.9 73.9 77.0 75.7 75.0 74.9

base seq wr 2 420.2 420.1
base seq wr 4 220.5 222.5 221.9 221.9
base seq wr 8 108.2 108.8 107.8 107.7 108.7 108.5 108.1 107.2

base seq rdwr 2 268.4 267.8
base seq rdwr 4 148.9 150.6 147.8 148.0
base seq rdwr 8 78.0 77.7 76.3 76.0 79.1 77.9 74.3 77.9

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
ioc off rnd rd 2 8.6 8.6
ioc off rnd rd 4 6.8 6.8 6.7 6.7
ioc off rnd rd 8 4.7 4.6 4.6 4.7 4.6 4.6 4.6 4.6

ioc off rnd wr 2 112.6 131.1
ioc off rnd wr 4 64.9 67.8 79.9 68.1
ioc off rnd wr 8 35.1 39.5 31.5 32.0 36.1 34.5 30.8 33.5

ioc off rnd rdwr 2 13.8 13.8
ioc off rnd rdwr 4 9.8 9.8 9.9 9.8
ioc off rnd rdwr 8 7.7 7.7 7.7 7.7 7.7 7.7 7.7 7.7


ioc off seq rd 2 303.1 305.0
ioc off seq rd 4 150.8 151.6 149.0 150.2
ioc off seq rd 8 77.0 76.3 74.5 74.0 77.9 75.5 74.0 74.6

ioc off seq wr 2 424.6 425.5
ioc off seq wr 4 223.0 222.4 223.9 222.3
ioc off seq wr 8 110.8 112.0 111.3 109.6 111.7 111.3 110.8 109.7

ioc off seq rdwr 2 274.3 275.8
ioc off seq rdwr 4 151.3 154.8 149.0 150.6
ioc off seq rdwr 8 81.1 80.6 77.8 74.8 81.0 78.5 77.0 77.7

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
ioc no idle rnd rd 2 4.7 4.7
ioc no idle rnd rd 4 2.0 2.0 2.0 2.0
ioc no idle rnd rd 8 0.9 0.9 0.8 0.8 0.8 0.8 0.9 0.9

ioc no idle rnd wr 2 144.8 145.4
ioc no idle rnd wr 4 73.2 65.9 65.5 65.8
ioc no idle rnd wr 8 35.5 52.5 26.2 31.0 25.5 19.3 25.1 22.6

ioc no idle rnd rdwr 2 8.1 8.1
ioc no idle rnd rdwr 4 3.4 3.4 3.4 3.4
ioc no idle rnd rdwr 8 1.3 1.3 1.3 1.2 1.2 1.3 1.2 1.3


ioc no idle seq rd 2 304.1 306.6
ioc no idle seq rd 4 152.1 154.5 149.8 153.0
ioc no idle seq rd 8 75.8 75.8 75.2 75.1 75.5 75.3 75.7 76.5

ioc no idle seq wr 2 418.6 418.2
ioc no idle seq wr 4 217.7 217.7 215.4 217.4
ioc no idle seq wr 8 105.5 105.8 105.8 103.4 102.9 103.1 102.7 102.8

ioc no idle seq rdwr 2 269.2 269.0
ioc no idle seq rdwr 4 130.0 126.4 127.8 128.6
ioc no idle seq rdwr 8 67.2 66.6 65.4 65.0 65.3 64.8 65.7 66.5




2009-11-16 21:19:16

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Mon, Nov 16, 2009 at 03:51:00PM -0500, Alan D. Brunelle wrote:
> Hi Vivek:
>
> I'm finding some things that don't quite seem right - executive
> summary:

Hi Alan,

Thanks a lot for such an extensive testing and test results. I am still
digesting the results but I thought I will make a quick note about writes.
This patchset works only for sync IO. If you are performing buffered
writes then you will not see any service differentiation. Providing
support for buffered write path is in TODO list.

>
> o I think the apportionment algorithm doesn't work consistently well
> for writes.
>
> o I think there are problems with significant performance loss when
> doing random I/Os.

This concerns me. I had a quick look and as per your results, even with
group_idle=0 you are seeing this regression. I guess this might be coming
from the fact that we idle on sync-noidle workload per group and that
idling becomes significant as number of groups increase.

Thanks
Vivek

>
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
>
> Test configuration: HP dl585 (32-way quad-core AMD Opteron processors +
> 128GB RAM + 4 FC HBAs + 4 MSA 1000's (each exporting 3 multi-disk
> striped LUNs)). Running with Jens Axboe's remotes/origin/for-2.6.33
> branch (at commit 8721c81f6480e2c9acbf92078383953f825d1057)) w/out and
> w/ your V2 patch.
>
> The test: 12 Ext3 file systems (1 per disk), each file system has eight
> 8GB files on it. Doing simple fio runs in various modes and I/O
> directions: random or sequential, read or write or read/write (80%/20%).
> Using 2, 4 or 8 processes per file system (each process working on a
> different file). Here is a sample fio command file:
>
> [global]
> ioengine=sync
> size=8g
> overwrite=0
> runtime=120
> bs=256k
> readwrite=write
> [/mnt/sdl/data.7]
> filename=/mnt/sdl/data.7
>
> I'm then using cgroups that have IO weights as follows:
>
> /cgroup/test0/blkio.weight 100
> /cgroup/test1/blkio.weight 200
> /cgroup/test2/blkio.weight 300
> /cgroup/test3/blkio.weight 400
> /cgroup/test4/blkio.weight 500
> /cgroup/test5/blkio.weight 600
> /cgroup/test6/blkio.weight 700
> /cgroup/test7/blkio.weight 800
>
> There were 12 X N total processes running in the system for each test,
> and each file system would have N process working on a different file in
> that file system. The N processes would be assigned to increasing test
> groups: process 0 will be in test0's group and working on file 0 in a
> file system; process 1 will be in test1's group and working on file 1 in
> a file system; and so on.
>
> Before each test I drop caches & umount/mount the filesystem anew.
>
> In the following tables:
>
> 'base' - means a kernel generated from Jens' branch (-no- patching)
>
> 'ioc off' - means a kernel generated w/ your patches added but -no-
> other settings (no CGROUP stuff mounted or enabled)
>
> 'ioc no idle' - means the ioc kernel w/ CGROUP stuff enabled
> -but- /sys/block/sd*/queue/iosched/cgroup_idle = 0
>
> 'ioc idle' - means the ioc kernel w/ CGROUP stuff enabled
> -and- /sys/block/sd*/queue/iosched/cgroup_idle = 1
>
> Modes: random or sequential
>
> RdWr: rd==read, wr==write, rdwr==80%read & 20%write
>
> N: Number of processes per disk
>
> testX: Processes sharing a task group (when enabled)
>
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
>
> The first thing to do is to check for correctness: when the I/O
> controller is enabled do we see correctly apportioned I/O?
>
> At the tail end of the e-mail I've placed three (3) tables showing the
> state where -no- differences should be seen between the various "task"
> groups in terms of performance ("level playing field"), and sure enough
> no differences were seen. These were done basically as a "control" set
> of tests - the script being used didn't have any inherent biases in
> it.[1]
>
> This table shows the cases where we should see a difference based upon
> weights:
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> ioc idle rnd rd 2 2.8 6.3
> ioc idle rnd rd 4 0.7 1.5 2.5 3.5
> ioc idle rnd rd 8 0.2 0.4 0.5 0.8 0.9 1.2 1.4 1.7
>
> ioc idle rnd wr 2 38.2 192.7
> ioc idle rnd wr 4 1.0 17.7 38.1 204.5
> ioc idle rnd wr 8 0.3 0.6 0.9 1.5 2.2 16.3 16.6 208.3
>
> ioc idle rnd rdwr 2 4.9 11.3
> ioc idle rnd rdwr 4 0.9 2.4 4.3 6.2
> ioc idle rnd rdwr 8 0.2 0.5 0.8 1.1 1.4 1.8 2.2 2.7
>
>
> ioc idle seq rd 2 221.0 386.4
> ioc idle seq rd 4 69.8 128.1 183.2 226.8
> ioc idle seq rd 8 21.4 40.0 55.6 70.8 85.2 98.3 111.6 121.9
>
> ioc idle seq wr 2 398.6 391.6
> ioc idle seq wr 4 219.0 214.5 214.1 214.5
> ioc idle seq wr 8 107.6 106.8 104.7 102.5 99.5 99.5 100.5 100.8
>
> ioc idle seq rdwr 2 196.8 340.9
> ioc idle seq rdwr 4 64.0 109.6 148.7 183.5
> ioc idle seq rdwr 8 22.6 36.6 48.8 61.1 70.3 78.5 84.9 94.3
>
> In general, we do see weights associated in correctly increasing order,
> but I don't think the proportions are done correctly in all cases.
>
> In the random tests for example, the read distribution looks pretty
> decent, but random writes are all off - for some reason the highest
> priority (most heavily weighted) is getting a disproportionately large
> percentage of the I/O bandwidth.
>
> For the sequential loads, the reads look "OK" - not quite correctly fair
> when we have 8 processes running against the devices, but on the whole
> things look ok. Sequential writes are not working well at all:
> relatively flat distribution.
>
> I _think_ this is pointing to some real problems in both the write cases
> for both random & sequential I/Os.
>
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
>
> The next thing to look at is to see what the "penalty" is for the
> additional code: see how much bandwidth we lose for the capability
> added. Here we see the sum of the system's throughput for the various
> tests:
>
> ---- ---- - ----------- ----------- ----------- -----------
> Mode RdWr N base ioc off ioc no idle ioc idle
> ---- ---- - ----------- ----------- ----------- -----------
> rnd rd 2 17.3 17.1 9.4 9.1
> rnd rd 4 27.1 27.1 8.1 8.2
> rnd rd 8 37.1 37.1 6.8 7.1
>
> rnd wr 2 296.5 243.7 290.2 230.9
> rnd wr 4 287.3 280.7 270.4 261.3
> rnd wr 8 272.5 273.1 237.7 246.5
>
> rnd rdwr 2 27.4 27.7 16.1 16.2
> rnd rdwr 4 38.3 39.3 13.5 13.9
> rnd rdwr 8 62.0 61.5 10.0 10.7
>
> seq rd 2 610.2 608.1 610.7 607.4
> seq rd 4 608.4 601.5 609.3 608.0
> seq rd 8 605.7 603.7 605.0 604.8
>
> seq wr 2 840.3 850.2 836.8 790.2
> seq wr 4 886.8 891.6 868.2 862.2
> seq wr 8 865.1 887.1 832.1 822.0
>
> seq rdwr 2 536.2 550.0 538.1 537.7
> seq rdwr 4 595.3 605.7 512.9 505.8
> seq rdwr 8 617.3 628.5 526.6 497.1
>
> The sequential runs look very good - not much variance across the board.
>
> The random results look horrible, especially when reads are involved:
> The first two columns (base & ioc off) are very similar, however note
> the significant drop in overall system performance once the
> io-controller CGROUP stuff gets involved - the more processes involved
> the more performance is lost.
>
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
>
> I'm going to spend some time drilling down into three specific tests:
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> ioc idle rnd wr 2 38.2 192.7
> ioc idle seq wr 2 398.6 391.6
>
> This test I can use to see why random writes are so disproportionately
> apportioned - it should be 2-to-1 but we are seeing something like
> 6-to-1. And then I can look at why sequential writes are flat.
>
> and:
>
> ---- ---- - ----------- ----------- ----------- -----------
> Mode RdWr N base ioc off ioc no idle ioc idle
> ---- ---- - ----------- ----------- ----------- -----------
> rnd rd 2 17.3 17.1 9.4 9.1
>
> I will try to find out why we are seeing such a loss in system
> performance...
>
> Regards,
> Alan D. Brunelle
> Hewlett-Packard / Linux Kernel Technology Team
>
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> [1] Three tables showing the I/O load distributed when either there was
> no I/O controller code or when it was turned off or when cgroup_idle was
> turned off. All looks sane - with the exception of the ioc-enabled
> kernel with no-idle set - for random writes it appears like there is
> some differences, but not an appreciable amount?
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> base rnd rd 2 8.6 8.6
> base rnd rd 4 6.8 6.8 6.8 6.7
> base rnd rd 8 4.7 4.6 4.6 4.6 4.6 4.6 4.6 4.6
>
> base rnd wr 2 150.4 146.1
> base rnd wr 4 75.2 74.8 68.1 69.2
> base rnd wr 8 36.2 39.3 29.6 35.9 32.9 37.0 29.6 32.2
>
> base rnd rdwr 2 13.7 13.7
> base rnd rdwr 4 9.6 9.6 9.6 9.6
> base rnd rdwr 8 7.8 7.8 7.7 7.8 7.8 7.7 7.7 7.8
>
>
> base seq rd 2 306.2 304.0
> base seq rd 4 150.1 152.4 151.9 154.0
> base seq rd 8 77.2 75.9 75.9 73.9 77.0 75.7 75.0 74.9
>
> base seq wr 2 420.2 420.1
> base seq wr 4 220.5 222.5 221.9 221.9
> base seq wr 8 108.2 108.8 107.8 107.7 108.7 108.5 108.1 107.2
>
> base seq rdwr 2 268.4 267.8
> base seq rdwr 4 148.9 150.6 147.8 148.0
> base seq rdwr 8 78.0 77.7 76.3 76.0 79.1 77.9 74.3 77.9
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> ioc off rnd rd 2 8.6 8.6
> ioc off rnd rd 4 6.8 6.8 6.7 6.7
> ioc off rnd rd 8 4.7 4.6 4.6 4.7 4.6 4.6 4.6 4.6
>
> ioc off rnd wr 2 112.6 131.1
> ioc off rnd wr 4 64.9 67.8 79.9 68.1
> ioc off rnd wr 8 35.1 39.5 31.5 32.0 36.1 34.5 30.8 33.5
>
> ioc off rnd rdwr 2 13.8 13.8
> ioc off rnd rdwr 4 9.8 9.8 9.9 9.8
> ioc off rnd rdwr 8 7.7 7.7 7.7 7.7 7.7 7.7 7.7 7.7
>
>
> ioc off seq rd 2 303.1 305.0
> ioc off seq rd 4 150.8 151.6 149.0 150.2
> ioc off seq rd 8 77.0 76.3 74.5 74.0 77.9 75.5 74.0 74.6
>
> ioc off seq wr 2 424.6 425.5
> ioc off seq wr 4 223.0 222.4 223.9 222.3
> ioc off seq wr 8 110.8 112.0 111.3 109.6 111.7 111.3 110.8 109.7
>
> ioc off seq rdwr 2 274.3 275.8
> ioc off seq rdwr 4 151.3 154.8 149.0 150.6
> ioc off seq rdwr 8 81.1 80.6 77.8 74.8 81.0 78.5 77.0 77.7
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> ioc no idle rnd rd 2 4.7 4.7
> ioc no idle rnd rd 4 2.0 2.0 2.0 2.0
> ioc no idle rnd rd 8 0.9 0.9 0.8 0.8 0.8 0.8 0.9 0.9
>
> ioc no idle rnd wr 2 144.8 145.4
> ioc no idle rnd wr 4 73.2 65.9 65.5 65.8
> ioc no idle rnd wr 8 35.5 52.5 26.2 31.0 25.5 19.3 25.1 22.6
>
> ioc no idle rnd rdwr 2 8.1 8.1
> ioc no idle rnd rdwr 4 3.4 3.4 3.4 3.4
> ioc no idle rnd rdwr 8 1.3 1.3 1.3 1.2 1.2 1.3 1.2 1.3
>
>
> ioc no idle seq rd 2 304.1 306.6
> ioc no idle seq rd 4 152.1 154.5 149.8 153.0
> ioc no idle seq rd 8 75.8 75.8 75.2 75.1 75.5 75.3 75.7 76.5
>
> ioc no idle seq wr 2 418.6 418.2
> ioc no idle seq wr 4 217.7 217.7 215.4 217.4
> ioc no idle seq wr 8 105.5 105.8 105.8 103.4 102.9 103.1 102.7 102.8
>
> ioc no idle seq rdwr 2 269.2 269.0
> ioc no idle seq rdwr 4 130.0 126.4 127.8 128.6
> ioc no idle seq rdwr 8 67.2 66.6 65.4 65.0 65.3 64.8 65.7 66.5
>
>
>

2009-11-16 21:32:10

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Mon, 2009-11-16 at 16:14 -0500, Vivek Goyal wrote:
> On Mon, Nov 16, 2009 at 03:51:00PM -0500, Alan D. Brunelle wrote:
> > Hi Vivek:
> >
> > I'm finding some things that don't quite seem right - executive
> > summary:
>
> Hi Alan,
>
> Thanks a lot for such an extensive testing and test results. I am still
> digesting the results but I thought I will make a quick note about writes.
> This patchset works only for sync IO. If you are performing buffered
> writes then you will not see any service differentiation. Providing
> support for buffered write path is in TODO list.

Ah, I thought you meant sync I/O versus async I/O. So do you mean that
the testing should use _direct_ I/O (bypassing the cache)?

>
> >
> > o I think the apportionment algorithm doesn't work consistently well
> > for writes.
> >
> > o I think there are problems with significant performance loss when
> > doing random I/Os.
>
> This concerns me. I had a quick look and as per your results, even with
> group_idle=0 you are seeing this regression. I guess this might be coming
> from the fact that we idle on sync-noidle workload per group and that
> idling becomes significant as number of groups increase.
>
> Thanks
> Vivek

2009-11-16 21:38:54

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Mon, Nov 16, 2009 at 04:32:15PM -0500, Alan D. Brunelle wrote:
> On Mon, 2009-11-16 at 16:14 -0500, Vivek Goyal wrote:
> > On Mon, Nov 16, 2009 at 03:51:00PM -0500, Alan D. Brunelle wrote:
> > > Hi Vivek:
> > >
> > > I'm finding some things that don't quite seem right - executive
> > > summary:
> >
> > Hi Alan,
> >
> > Thanks a lot for such an extensive testing and test results. I am still
> > digesting the results but I thought I will make a quick note about writes.
> > This patchset works only for sync IO. If you are performing buffered
> > writes then you will not see any service differentiation. Providing
> > support for buffered write path is in TODO list.
>
> Ah, I thought you meant sync I/O versus async I/O. So do you mean that
> the testing should use _direct_ I/O (bypassing the cache)?

Only for Writes. Reads will anyway show up as sync IO at the CFQ, so
that's not a problem. You can choose to test these either as direct IO
or let them go through page cache.

For writes, you need to use direct IO if you are looking for service
differentiation with current patchset.

Thanks
Vivek

>
> >
> > >
> > > o I think the apportionment algorithm doesn't work consistently well
> > > for writes.
> > >
> > > o I think there are problems with significant performance loss when
> > > doing random I/Os.
> >
> > This concerns me. I had a quick look and as per your results, even with
> > group_idle=0 you are seeing this regression. I guess this might be coming
> > from the fact that we idle on sync-noidle workload per group and that
> > idling becomes significant as number of groups increase.
> >
> > Thanks
> > Vivek

2009-11-16 22:20:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Mon, Nov 16, 2009 at 03:51:00PM -0500, Alan D. Brunelle wrote:

[..]
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
>
> The next thing to look at is to see what the "penalty" is for the
> additional code: see how much bandwidth we lose for the capability
> added. Here we see the sum of the system's throughput for the various
> tests:
>
> ---- ---- - ----------- ----------- ----------- -----------
> Mode RdWr N base ioc off ioc no idle ioc idle
> ---- ---- - ----------- ----------- ----------- -----------
> rnd rd 2 17.3 17.1 9.4 9.1
> rnd rd 4 27.1 27.1 8.1 8.2
> rnd rd 8 37.1 37.1 6.8 7.1
>

Hi Alan,

This seems to be the most notable result in terms of performance degradation.

I ran two random readers on a locally attached SATA disk. There in fact
I gain in terms of performance because we perform less number of seeks
now as we allocate a continous slice to one group and then move onto
next group.

But in your setup it looks like there is a striped set of disks and seek
cost is less and waiting per group for sync-noidle workload is hurting
instead.

One simple way to test that would be to set slice_idle=0 so that CFQ does
not try to do any idling at all. Can you please re-run above test. This
will help in figuring out whether above performance regression is coming
from idling on sync-noidle workload group per cgroup or not.

Above numbers are in what units?

Thanks
Vivek

2009-11-17 12:38:45

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Mon, 2009-11-16 at 17:18 -0500, Vivek Goyal wrote:
> On Mon, Nov 16, 2009 at 03:51:00PM -0500, Alan D. Brunelle wrote:
>
> [..]
> > ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> >
> > The next thing to look at is to see what the "penalty" is for the
> > additional code: see how much bandwidth we lose for the capability
> > added. Here we see the sum of the system's throughput for the various
> > tests:
> >
> > ---- ---- - ----------- ----------- ----------- -----------
> > Mode RdWr N base ioc off ioc no idle ioc idle
> > ---- ---- - ----------- ----------- ----------- -----------
> > rnd rd 2 17.3 17.1 9.4 9.1
> > rnd rd 4 27.1 27.1 8.1 8.2
> > rnd rd 8 37.1 37.1 6.8 7.1
> >
>
> Hi Alan,
>
> This seems to be the most notable result in terms of performance degradation.
>
> I ran two random readers on a locally attached SATA disk. There in fact
> I gain in terms of performance because we perform less number of seeks
> now as we allocate a continous slice to one group and then move onto
> next group.
>
> But in your setup it looks like there is a striped set of disks and seek
> cost is less and waiting per group for sync-noidle workload is hurting
> instead.


That is correct - there are 4 back-end buses on an MSA1000, and each LUN
that is exported is constructed from 1 drive from each bus (hardware
striped RAID). [There is _no_ SW RAID involved.]


>
> One simple way to test that would be to set slice_idle=0 so that CFQ does
> not try to do any idling at all. Can you please re-run above test. This
> will help in figuring out whether above performance regression is coming
> from idling on sync-noidle workload group per cgroup or not.

I'll put that in the queue - first I'm going to re-run w/ synchronous
direct I/O for the writes. I'm also going to pair this down to just
doing 2-processes per disk runs (to simplify results & speed up tests).
Once we get that working better, I can expand things back out.

>
> Above numbers are in what units?

These are in MiB/second (derived from the FIO output).

>
> Thanks
> Vivek


2009-11-17 14:15:56

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Tue, Nov 17, 2009 at 07:38:47AM -0500, Alan D. Brunelle wrote:
> On Mon, 2009-11-16 at 17:18 -0500, Vivek Goyal wrote:
> > On Mon, Nov 16, 2009 at 03:51:00PM -0500, Alan D. Brunelle wrote:
> >
> > [..]
> > > ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> > >
> > > The next thing to look at is to see what the "penalty" is for the
> > > additional code: see how much bandwidth we lose for the capability
> > > added. Here we see the sum of the system's throughput for the various
> > > tests:
> > >
> > > ---- ---- - ----------- ----------- ----------- -----------
> > > Mode RdWr N base ioc off ioc no idle ioc idle
> > > ---- ---- - ----------- ----------- ----------- -----------
> > > rnd rd 2 17.3 17.1 9.4 9.1
> > > rnd rd 4 27.1 27.1 8.1 8.2
> > > rnd rd 8 37.1 37.1 6.8 7.1
> > >
> >
> > Hi Alan,
> >
> > This seems to be the most notable result in terms of performance degradation.
> >
> > I ran two random readers on a locally attached SATA disk. There in fact
> > I gain in terms of performance because we perform less number of seeks
> > now as we allocate a continous slice to one group and then move onto
> > next group.
> >
> > But in your setup it looks like there is a striped set of disks and seek
> > cost is less and waiting per group for sync-noidle workload is hurting
> > instead.
>
>
> That is correct - there are 4 back-end buses on an MSA1000, and each LUN
> that is exported is constructed from 1 drive from each bus (hardware
> striped RAID). [There is _no_ SW RAID involved.]
>
>
> >
> > One simple way to test that would be to set slice_idle=0 so that CFQ does
> > not try to do any idling at all. Can you please re-run above test. This
> > will help in figuring out whether above performance regression is coming
> > from idling on sync-noidle workload group per cgroup or not.
>
> I'll put that in the queue - first I'm going to re-run w/ synchronous
> direct I/O for the writes. I'm also going to pair this down to just
> doing 2-processes per disk runs (to simplify results & speed up tests).
> Once we get that working better, I can expand things back out.

Ok, the only thing to watch out for is number of request descriptors. I
think at some point of time with writes, you will consume all the request
descriptors and things will become serialized after that. We will need
support of per group requests descriptros to solve this. But that patch
will come later is in TODO list. Boosting the number of request
descriptors per queue should help though.


Regarding the reduced throughput for random IO case, ideally we should not
idle on sync-noidle group on this hardware as this seems to be a fast NCQ
supporting hardware. But I guess we might not be detecting the queue depth
properly which leads to idling on per group sync-noidle workload and
forces the queue depth to be 1.

I am also trying to setup a higher end system here and will do some
experiments.

Thanks
Vivek

2009-11-17 16:17:49

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

Hi Vivek,
the performance drop reported by Alan was my main concern about your
approach. Probably you should mention/document somewhere that when the
number of groups is too large, there is large decrease in random read
performance.

However, we can check few things:
* is this kernel built with HZ < 1000? The smallest idle CFQ will do
is given by 2/HZ, so running with a small HZ will increase the impact
of idling.

On Tue, Nov 17, 2009 at 3:14 PM, Vivek Goyal <[email protected]> wrote:
> Regarding the reduced throughput for random IO case, ideally we should not
> idle on sync-noidle group on this hardware as this seems to be a fast NCQ
> supporting hardware. But I guess we might not be detecting the queue depth
> properly which leads to idling on per group sync-noidle workload and
> forces the queue depth to be 1.

* This can be ruled out testing my NCQ detection fix patch
(http://groups.google.com/group/linux.kernel/browse_thread/thread/3b62f0665f0912b6/34ec9456c7da1bb7?lnk=raot)

However, my feeling is that the real problem is having multiple
separate sync-noidle trees.
Inter group idle is marginal, since each sync-noidle tree already has
its end-of-tree idle enabled for rotational devices (The difference in
the table is in fact small).
> ---- ---- - ----------- ----------- ----------- -----------
> Mode RdWr N base ioc off ioc no idle ioc idle
> ---- ---- - ----------- ----------- ----------- -----------
> rnd rd 2 17.3 17.1 9.4 9.1
> rnd rd 4 27.1 27.1 8.1 8.2
> rnd rd 8 37.1 37.1 6.8 7.1

2 random readers without groups have bw = 17.3 ; this means that a
single random reader will have bw > 8.6 (since the two readers go
usually in parallel when no groups are involved, unless two random
reads are actually queued to the same disk).

When the random readers are in separate groups, we give the full disk
to only one at a time, so the max aggregate bw achievable is the bw of
a single random reader less the overhead proportional to number of
groups. This is compatible with the numbers.

So, an other thing to mention in the docs is that having one process
per group is not a good idea (cfq already has I/O priorities to deal
with single processes). Groups are coarse grain entities, and they
should really be used when you need to get fairness between groups of
processes.

* An other thing to do is to try setting rotational = 0, since even
with NCQ correctly detected, if the device is rotational, we still
introduce some idle delays (that are good in the root group, but not
when you have multiple groups).

>
> I am also trying to setup a higher end system here and will do some
> experiments.
>
> Thanks
> Vivek

Thanks,
Corrado

2009-11-17 16:42:12

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Tue, Nov 17, 2009 at 05:17:53PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> the performance drop reported by Alan was my main concern about your
> approach. Probably you should mention/document somewhere that when the
> number of groups is too large, there is large decrease in random read
> performance.
>

Hi Corrodo,

I thought more about it. We idle on sync-noidle group only in case of
rotational media not supporting NCQ (hw_tag = 0). So for all the fast
hardware out there (SSD and fast arrays), we should not be idling on
sync-noidle group hence should not additional idling per group.

This is all subjected to the fact that we have done a good job in
detecting the queue depth and have updated hw_tag accordingly.

On slower rotational hardware, where we will actually do idling on
sync-noidle per group, idling can infact help you because it will reduce
the number of seeks (As it does on my locally connected SATA disk).

> However, we can check few things:
> * is this kernel built with HZ < 1000? The smallest idle CFQ will do
> is given by 2/HZ, so running with a small HZ will increase the impact
> of idling.
>
> On Tue, Nov 17, 2009 at 3:14 PM, Vivek Goyal <[email protected]> wrote:
> > Regarding the reduced throughput for random IO case, ideally we should not
> > idle on sync-noidle group on this hardware as this seems to be a fast NCQ
> > supporting hardware. But I guess we might not be detecting the queue depth
> > properly which leads to idling on per group sync-noidle workload and
> > forces the queue depth to be 1.
>
> * This can be ruled out testing my NCQ detection fix patch
> (http://groups.google.com/group/linux.kernel/browse_thread/thread/3b62f0665f0912b6/34ec9456c7da1bb7?lnk=raot)

This will be a good patch to test here. Alan, can you also apply this
patch and see if we see any improvement.

My core concern is that hardware Alan is testing on is a fast NCQ
supporting hardware and we should see hw_tag=1 and hence no idling on
sync-noidle group should happen.

>
> However, my feeling is that the real problem is having multiple
> separate sync-noidle trees.
> Inter group idle is marginal, since each sync-noidle tree already has
> its end-of-tree idle enabled for rotational devices (The difference in
> the table is in fact small).
> > ---- ---- - ----------- ----------- ----------- -----------
> > Mode RdWr N base ioc off ioc no idle ioc idle
> > ---- ---- - ----------- ----------- ----------- -----------
> > rnd rd 2 17.3 17.1 9.4 9.1
> > rnd rd 4 27.1 27.1 8.1 8.2
> > rnd rd 8 37.1 37.1 6.8 7.1
>
> 2 random readers without groups have bw = 17.3 ; this means that a
> single random reader will have bw > 8.6 (since the two readers go
> usually in parallel when no groups are involved, unless two random
> reads are actually queued to the same disk).
>

Agreed. Without groups I guess we are driving queue depth as 2 hence
two random readers are able to work in paralle. Because this is striped
array of multiple disks, there are chances that reads will happen on
different disks and we can support more random readers in parallel without
dropping the throughput of box.

> When the random readers are in separate groups, we give the full disk
> to only one at a time, so the max aggregate bw achievable is the bw of
> a single random reader less the overhead proportional to number of
> groups. This is compatible with the numbers.
>

Yes it is but with group_idle=0, we don't wait for a group to get
backlogged. So in that case we should have been driving queue depth as 2
and allow both the groups go in parallel. But looking at Alan's number
with with group_ilde=0, he is not achieving close to 17MB/s and I suspect
this is coming from that fact that hw_tag=0 somehow and we are idling on
sync-nodile workload hence effectively driving queue depth as 1.

> So, an other thing to mention in the docs is that having one process
> per group is not a good idea (cfq already has I/O priorities to deal
> with single processes). Groups are coarse grain entities, and they
> should really be used when you need to get fairness between groups of
> processes.
>

I think number of processes in the group will be a more dynamic
information that changes with time. For example, if we put a virtual
machine in a group, number of processes will vary depending on what
virtual machine is doing.

I think group_idle is a more controllable parameter here. If some group
has higher weight but low load (like single process running), then should
we slow down the whole array and give the group exclusive access, or we
continue we just let slow group go away and continue to dispatch from rest
of the more active (but possibly low weight) groups. In first case
probably our latencies might be better as comapred to second case.

But more I look at it, sounds like on fast arrays, waiting for slow groups
does not sound very good. It might make sense on rotational hardware with
single disk head, though.

> * An other thing to do is to try setting rotational = 0, since even

what is rotational=0? Can't find any such tunable variable?

Thanks
Vivek

> with NCQ correctly detected, if the device is rotational, we still
> introduce some idle delays (that are good in the root group, but not
> when you have multiple groups).
>
> >
> > I am also trying to setup a higher end system here and will do some
> > experiments.
> >
> > Thanks
> > Vivek
>
> Thanks,
> Corrado

2009-11-17 16:44:59

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Tue, 2009-11-17 at 17:17 +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> the performance drop reported by Alan was my main concern about your
> approach. Probably you should mention/document somewhere that when the
> number of groups is too large, there is large decrease in random read
> performance.
>
> However, we can check few things:
> * is this kernel built with HZ < 1000? The smallest idle CFQ will do
> is given by 2/HZ, so running with a small HZ will increase the impact
> of idling.

FYI:

CONFIG_NO_HZ=y
CONFIG_HZ_1000=y
CONFIG_HZ=1000


2009-11-17 17:30:06

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Tue, 2009-11-17 at 11:40 -0500, Vivek Goyal wrote:
> On Tue, Nov 17, 2009 at 05:17:53PM +0100, Corrado Zoccolo wrote:
> > Hi Vivek,
> > the performance drop reported by Alan was my main concern about your
> > approach. Probably you should mention/document somewhere that when the
> > number of groups is too large, there is large decrease in random read
> > performance.
> >
>
> Hi Corrodo,
>
> I thought more about it. We idle on sync-noidle group only in case of
> rotational media not supporting NCQ (hw_tag = 0). So for all the fast
> hardware out there (SSD and fast arrays), we should not be idling on
> sync-noidle group hence should not additional idling per group.
>
> This is all subjected to the fact that we have done a good job in
> detecting the queue depth and have updated hw_tag accordingly.
>
> On slower rotational hardware, where we will actually do idling on
> sync-noidle per group, idling can infact help you because it will reduce
> the number of seeks (As it does on my locally connected SATA disk).
>
> > However, we can check few things:
> > * is this kernel built with HZ < 1000? The smallest idle CFQ will do
> > is given by 2/HZ, so running with a small HZ will increase the impact
> > of idling.
> >
> > On Tue, Nov 17, 2009 at 3:14 PM, Vivek Goyal <[email protected]> wrote:
> > > Regarding the reduced throughput for random IO case, ideally we should not
> > > idle on sync-noidle group on this hardware as this seems to be a fast NCQ
> > > supporting hardware. But I guess we might not be detecting the queue depth
> > > properly which leads to idling on per group sync-noidle workload and
> > > forces the queue depth to be 1.
> >
> > * This can be ruled out testing my NCQ detection fix patch
> > (http://groups.google.com/group/linux.kernel/browse_thread/thread/3b62f0665f0912b6/34ec9456c7da1bb7?lnk=raot)
>
> This will be a good patch to test here. Alan, can you also apply this
> patch and see if we see any improvement.

Vivek: Do you want me to move this over to the V3 version & apply this
patch, or stick w/ V2?

Thanks,
Alan

2009-11-17 17:46:31

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Tue, Nov 17, 2009 at 12:30:07PM -0500, Alan D. Brunelle wrote:
> On Tue, 2009-11-17 at 11:40 -0500, Vivek Goyal wrote:
> > On Tue, Nov 17, 2009 at 05:17:53PM +0100, Corrado Zoccolo wrote:
> > > Hi Vivek,
> > > the performance drop reported by Alan was my main concern about your
> > > approach. Probably you should mention/document somewhere that when the
> > > number of groups is too large, there is large decrease in random read
> > > performance.
> > >
> >
> > Hi Corrodo,
> >
> > I thought more about it. We idle on sync-noidle group only in case of
> > rotational media not supporting NCQ (hw_tag = 0). So for all the fast
> > hardware out there (SSD and fast arrays), we should not be idling on
> > sync-noidle group hence should not additional idling per group.
> >
> > This is all subjected to the fact that we have done a good job in
> > detecting the queue depth and have updated hw_tag accordingly.
> >
> > On slower rotational hardware, where we will actually do idling on
> > sync-noidle per group, idling can infact help you because it will reduce
> > the number of seeks (As it does on my locally connected SATA disk).
> >
> > > However, we can check few things:
> > > * is this kernel built with HZ < 1000? The smallest idle CFQ will do
> > > is given by 2/HZ, so running with a small HZ will increase the impact
> > > of idling.
> > >
> > > On Tue, Nov 17, 2009 at 3:14 PM, Vivek Goyal <[email protected]> wrote:
> > > > Regarding the reduced throughput for random IO case, ideally we should not
> > > > idle on sync-noidle group on this hardware as this seems to be a fast NCQ
> > > > supporting hardware. But I guess we might not be detecting the queue depth
> > > > properly which leads to idling on per group sync-noidle workload and
> > > > forces the queue depth to be 1.
> > >
> > > * This can be ruled out testing my NCQ detection fix patch
> > > (http://groups.google.com/group/linux.kernel/browse_thread/thread/3b62f0665f0912b6/34ec9456c7da1bb7?lnk=raot)
> >
> > This will be a good patch to test here. Alan, can you also apply this
> > patch and see if we see any improvement.
>
> Vivek: Do you want me to move this over to the V3 version & apply this
> patch, or stick w/ V2?

Alan,

Anthing is good. V3 is not very different from V2. May be move to V3 with
above patch applied and see if helps.

At the end of the day, you will not see improvement with group_idle=1 as
each group gets exclusive access to underlying array. But I am expecting
to see improvement with group_idle=0.

Thanks
Vivek

2009-11-17 20:38:12

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

Hi Vivek -

I've updated the runs - the results are shown at the end, I culled out
the write-related runs (haven't converted to direct I/O yet for those).

Next steps: Going to refresh to V3 of the patches and add in

http://groups.google.com/group/linux.kernel/browse_thread/thread/3b62f0665f0912b6/34ec9456c7da1bb7?lnk=raot

and convert the write-side to doing direct I/O (that would include the
rdwr tests, btw).

----------------------------------------------------------------------

I've relabeled things:

i0 = io control generated in the kernel, but not enabled
i1 = io control generated in the kernel, and is enabled

g0 = group_idle=0
g1 = group_idle=1 (default)

s8 = slide_idle=8 (default)
s0 = slice_idle=0

It looks like when the io control stuff is enabled we have random read
problems.

When it is enabled and slice_idle is set to 0 we see sequential reads
drop noticeably:

---- ---- - --------- --------- --------- --------- --------- ---------
Mode RdWr N base i0,gX,sX i1,g1,s8 i1,g0,s8 i1,g1,s0 i1,g0,s0
---- ---- - --------- --------- --------- --------- --------- ---------
rnd rd 2 17.3 17.1 9.1 9.3 9.4 9.5
rnd rd 4 27.1 27.1 8.2 8.0 8.0 8.0
rnd rd 8 37.1 37.1 7.1 6.8 6.7 6.8

seq rd 2 610.2 608.1 607.7 611.0 551.1 550.3
seq rd 4 608.4 601.5 607.7 609.6 549.6 550.0
seq rd 8 605.7 603.7 604.0 605.6 547.2 546.7

===============================================================

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
base rnd rd 2 8.6 8.6
base rnd rd 4 6.8 6.8 6.8 6.7
base rnd rd 8 4.7 4.6 4.6 4.6 4.6 4.6 4.6 4.6

base seq rd 2 306.2 304.0
base seq rd 4 150.1 152.4 151.9 154.0
base seq rd 8 77.2 75.9 75.9 73.9 77.0 75.7 75.0 74.9

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
i0,gX,sX rnd rd 2 8.6 8.6
i0,gX,sX rnd rd 4 6.8 6.8 6.7 6.7
i0,gX,sX rnd rd 8 4.7 4.6 4.6 4.7 4.6 4.6 4.6 4.6

i0,gX,sX seq rd 2 303.1 305.0
i0,gX,sX seq rd 4 150.8 151.6 149.0 150.2
i0,gX,sX seq rd 8 77.0 76.3 74.5 74.0 77.9 75.5 74.0 74.6

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
i1,g1,s8 rnd rd 2 2.8 6.3
i1,g1,s8 rnd rd 4 0.7 1.5 2.5 3.5
i1,g1,s8 rnd rd 8 0.2 0.4 0.5 0.7 0.9 1.2 1.4 1.7

i1,g1,s8 seq rd 2 221.6 386.1
i1,g1,s8 seq rd 4 70.6 128.1 181.7 227.3
i1,g1,s8 seq rd 8 21.4 40.3 55.8 71.2 85.1 99.1 109.4 121.7

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
i1,g0,s8 rnd rd 2 4.7 4.7
i1,g0,s8 rnd rd 4 2.0 2.0 2.0 2.0
i1,g0,s8 rnd rd 8 0.9 0.9 0.9 0.8 0.8 0.8 0.9 0.9

i1,g0,s8 seq rd 2 305.9 305.0
i1,g0,s8 seq rd 4 154.0 153.3 151.2 151.1
i1,g0,s8 seq rd 8 76.1 76.1 74.5 74.9 75.4 75.8 76.1 76.6

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
i1,g1,s0 rnd rd 2 4.7 4.7
i1,g1,s0 rnd rd 4 2.0 2.0 2.0 2.0
i1,g1,s0 rnd rd 8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9

i1,g1,s0 seq rd 2 275.6 275.5
i1,g1,s0 seq rd 4 136.0 137.5 137.4 138.7
i1,g1,s0 seq rd 8 68.6 68.6 68.5 68.6 67.1 68.6 68.3 68.9

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
i1,g0,s0 rnd rd 2 4.7 4.7
i1,g0,s0 rnd rd 4 2.0 2.0 2.0 2.0
i1,g0,s0 rnd rd 8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9

i1,g0,s0 seq rd 2 275.2 275.1
i1,g0,s0 seq rd 4 136.7 137.2 137.5 138.7
i1,g0,s0 seq rd 8 68.8 68.2 68.5 68.4 67.2 68.8 68.4 68.5

2009-11-17 20:59:34

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Tue, Nov 17, 2009 at 5:40 PM, Vivek Goyal <[email protected]> wrote:
>
> I thought more about it. We idle on sync-noidle group only in case of
> rotational media not supporting NCQ (hw_tag = 0). So for all the fast
> hardware out there (SSD and fast arrays), we should not be idling on
> sync-noidle group hence should not additional idling per group.

Your description is not complete. The relevant code is:

/* are we servicing noidle tree, and there are more queues?
* non-rotational or NCQ: no idle
* non-NCQ rotational : very small idle, to allow
* fair distribution of slice time for a process doing
back-to-back
* seeks.
*/
st = service_tree_for(cfqq->cfqg, cfqd->serving_prio,
SYNC_NOIDLE_WORKLOAD, cfqd);
if (!wait_busy && cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
&& st->count > 0) {
if (blk_queue_nonrot(cfqd->queue) || cfqd->hw_tag)
return false; /* here we don't idle */
sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
}
/* here we idle */

Notice the condition on count > 0? We will still idle "at the end" of
the no-idle service tree, i.e. when there are no more no-idle queues.
Without this idle, we won't get fair behaviour for no-idle queues.
This idle is enabled regardless of NCQ for rotational media. It is
only disabled on NCQ SSDs (the whole function is skipped in that
case).
So, having more than one no-idle service tree, as in your approach to
groups, introduces the problem we see.

>
> This is all subjected to the fact that we have done a good job in
> detecting the queue depth and have updated hw_tag accordingly.
>
> On slower rotational hardware, where we will actually do idling on
> sync-noidle per group, idling can infact help you because it will reduce
> the number of seeks (As it does on my locally connected SATA disk).
Right. We will do a small idle between no-idle queues, and a larger
one at the end.

>> However, we can check few things:
>> * is this kernel built with HZ < 1000? The smallest idle CFQ will do
>> is given by 2/HZ, so running with a small HZ will increase the impact
>> of idling.
>>
>> On Tue, Nov 17, 2009 at 3:14 PM, Vivek Goyal <[email protected]> wrote:
>> > Regarding the reduced throughput for random IO case, ideally we should not
>> > idle on sync-noidle group on this hardware as this seems to be a fast NCQ
>> > supporting hardware. But I guess we might not be detecting the queue depth
>> > properly which leads to idling on per group sync-noidle workload and
>> > forces the queue depth to be 1.
>>
>> * This can be ruled out testing my NCQ detection fix patch
>> (http://groups.google.com/group/linux.kernel/browse_thread/thread/3b62f0665f0912b6/34ec9456c7da1bb7?lnk=raot)
>
> This will be a good patch to test here. Alan, can you also apply this
> patch and see if we see any improvement.
>
> My core concern is that hardware Alan is testing on is a fast NCQ
> supporting hardware and we should see hw_tag=1 and hence no idling on
> sync-noidle group should happen.
>
See my explanation above.

>>
>> However, my feeling is that the real problem is having multiple
>> separate sync-noidle trees.
>> Inter group idle is marginal, since each sync-noidle tree already has
>> its end-of-tree idle enabled for rotational devices (The difference in
>> the table is in fact small).
>> > ---- ---- - ----------- ----------- ----------- -----------
>> > Mode RdWr N    base       ioc off   ioc no idle  ioc idle
>> > ---- ---- - ----------- ----------- ----------- -----------
>> >  rnd   rd 2        17.3        17.1         9.4         9.1
>> >  rnd   rd 4        27.1        27.1         8.1         8.2
>> >  rnd   rd 8        37.1        37.1         6.8         7.1
>>
>> 2 random readers without groups have bw = 17.3 ; this means that a
>> single random reader will have bw > 8.6 (since the two readers go
>> usually in parallel when no groups are involved, unless two random
>> reads are actually queued to the same disk).
>>
>
> Agreed. Without groups I guess we are driving queue depth as 2 hence
> two random readers are able to work in paralle. Because this is striped
> array of multiple disks, there are chances that reads will happen on
> different disks and we can support more random readers in parallel without
> dropping the throughput of box.
>
>> When the random readers are in separate groups, we give the full disk
>> to only one at a time, so the max aggregate bw achievable is the bw of
>> a single random reader less the overhead proportional to number of
>> groups. This is compatible with the numbers.
>>
>
> Yes it is but with group_idle=0, we don't wait for a group to get
> backlogged. So in that case we should have been driving queue depth as 2
> and allow both the groups go in parallel. But looking at Alan's number
> with with group_ilde=0, he is not achieving close to 17MB/s and I suspect
> this is coming from that fact that hw_tag=0 somehow and we are idling on
> sync-nodile workload hence effectively driving queue depth as 1.
Its because of the idle at the end of the service tree.
If you recall, I commented about the group idle being useless, since
we already do an idle in any case, even after the no-idle service
tree.

>
>> So, an other thing to mention in the docs is that having one process
>> per group is not a good idea (cfq already has I/O priorities to deal
>> with single processes). Groups are coarse grain entities, and they
>> should really be used when you need to get fairness between groups of
>> processes.
>>
>
> I think number of processes in the group will be a more dynamic
> information that changes with time. For example, if we put a virtual
> machine in a group, number of processes will vary depending on what
> virtual machine is doing.
Yes, this is one scenario. An other would be each user has a group, so
the bandwidth is divided evenly between users, and an user cannot
steal bandwidth by using more processes.
The point is that using groups to control priorities between processes
is not the best option.

>
> I think group_idle is a more controllable parameter here. If some group
> has higher weight but low load (like single process running), then should
> we slow down the whole array and give the group exclusive access, or we
> continue we just let slow group go away and continue to dispatch from rest
> of the more active (but possibly low weight) groups. In first case
> probably our latencies might be better as comapred to second case.
As Alan's test shows, disabling group_idle doesn't actually improve
the situation, since we already idle at the end of each tree.
What you want is that if a no-idle service_tree is finished, you
should immediately jump to an other no-idle service tree.
Unfortunately, this would be equivalent to having just one global
no-idle service tree, so the counter-arguments you did for my
proposals still apply here.
You can either get isolation, or performance. Not both at the same time.

> But more I look at it, sounds like on fast arrays, waiting for slow groups
> does not sound very good. It might make sense on rotational hardware with
> single disk head, though.
Especially if they don't have NCQ. In fact, in that case, the no-idle
service tree still has a small idle for each queue, so they are
handled pretty like the sequential queues.
For NCQ, maybe sending more requests in parallel will still get better
overall latency, since the head movements will be optimized across
queues in that case.
>
>> * An other thing to do is to try setting rotational = 0, since even
>
> what is rotational=0? Can't find any such tunable variable?

[corrado@et2 ~]$ cat /sys/block/sda/queue/rotational
1

Thanks,
Corrado

> Thanks
> Vivek

2009-11-17 22:41:48

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Tue, Nov 17, 2009 at 09:59:38PM +0100, Corrado Zoccolo wrote:
> On Tue, Nov 17, 2009 at 5:40 PM, Vivek Goyal <[email protected]> wrote:
> >
> > I thought more about it. We idle on sync-noidle group only in case of
> > rotational media not supporting NCQ (hw_tag = 0). So for all the fast
> > hardware out there (SSD and fast arrays), we should not be idling on
> > sync-noidle group hence should not additional idling per group.
>
> Your description is not complete. The relevant code is:
>
> /* are we servicing noidle tree, and there are more queues?
> * non-rotational or NCQ: no idle
> * non-NCQ rotational : very small idle, to allow
> * fair distribution of slice time for a process doing
> back-to-back
> * seeks.
> */
> st = service_tree_for(cfqq->cfqg, cfqd->serving_prio,
> SYNC_NOIDLE_WORKLOAD, cfqd);
> if (!wait_busy && cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
> && st->count > 0) {
> if (blk_queue_nonrot(cfqd->queue) || cfqd->hw_tag)
> return false; /* here we don't idle */
> sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
> }
> /* here we idle */
>
> Notice the condition on count > 0? We will still idle "at the end" of
> the no-idle service tree, i.e. when there are no more no-idle queues.

Ok, now I understand it better. I had missed the st->count part. So if
there are other sync-noidle queues backlogged (st->count > 0), then we
don't idle on same process to get more request, if hw_tag=1 or is is SSD
and move onto to next sync-noidle process to dispatch requests from.

But if this is last cfqq on the service tree under this workload, we will
still idle on the service tree/workload type and not start dispatching
request from other service tree (of same prio class).

> Without this idle, we won't get fair behaviour for no-idle queues.
> This idle is enabled regardless of NCQ for rotational media. It is
> only disabled on NCQ SSDs (the whole function is skipped in that
> case).

So If I have a fast storage array with NCQ, we will still idle and not
let sync-idle queues or async queues get to dispatch. Anyway, that's a
side issue for the moment.

> So, having more than one no-idle service tree, as in your approach to
> groups, introduces the problem we see.
>

True, having multiple no-idle workload is problem here. Can't think of
a solution. Putting workload type on top also is not logically good where
workload type determines the share of disk/array. This is so unintuitive.

I guess I will document this issue with random IO workload issue.

May be we can do little optimization in the sense, in cfq_should_idle(), I can
check if there are other competing sync and async queues in the cfq_group or
not. If there are no competing queues then we don't have to idle on the
sync-noidle service tree. That's a different thing that we might still
want to idle on the group as a whole to make sure a single random reader
has got good latencies and is not overwhelmed by other groups running
sequential readers.

> >
> > This is all subjected to the fact that we have done a good job in
> > detecting the queue depth and have updated hw_tag accordingly.
> >
> > On slower rotational hardware, where we will actually do idling on
> > sync-noidle per group, idling can infact help you because it will reduce
> > the number of seeks (As it does on my locally connected SATA disk).
> Right. We will do a small idle between no-idle queues, and a larger
> one at the end.

If we do want to do a small idle between no-idle queues, why do you allow
preemption of one sync-noidle queue with other sync-noidle queue.

IOW, what's the point of waiting for small period between queues? They are
anyway random seeky readers.

Idling between queues can help a bit if we have sync-noidle reader and
multiple sync-nodile sync writers. A sync-noidle reader can still witness
higher latencies if multiple libaio driven sync writers are present. We
discussed this issue briefly in private mail. But at the moment, allowing
preemption will wipe out that advantage.

>
> >> However, we can check few things:
> >> * is this kernel built with HZ < 1000? The smallest idle CFQ will do
> >> is given by 2/HZ, so running with a small HZ will increase the impact
> >> of idling.
> >>
> >> On Tue, Nov 17, 2009 at 3:14 PM, Vivek Goyal <[email protected]> wrote:
> >> > Regarding the reduced throughput for random IO case, ideally we should not
> >> > idle on sync-noidle group on this hardware as this seems to be a fast NCQ
> >> > supporting hardware. But I guess we might not be detecting the queue depth
> >> > properly which leads to idling on per group sync-noidle workload and
> >> > forces the queue depth to be 1.
> >>
> >> * This can be ruled out testing my NCQ detection fix patch
> >> (http://groups.google.com/group/linux.kernel/browse_thread/thread/3b62f0665f0912b6/34ec9456c7da1bb7?lnk=raot)
> >
> > This will be a good patch to test here. Alan, can you also apply this
> > patch and see if we see any improvement.
> >
> > My core concern is that hardware Alan is testing on is a fast NCQ
> > supporting hardware and we should see hw_tag=1 and hence no idling on
> > sync-noidle group should happen.
> >
> See my explanation above.

I understand now up to some extent. One question still remains though is
that why do we choose to idle on fast arrays. Faster the array (backed by
more disks), more harmful the idling becomes.

May be using your dyanamic cfq tuning patches might help here. If average
read time is less, than driver deeper queue depths otherwise reduce the
queue depth as underlying device/array can't handle that much.

>
> >>
> >> However, my feeling is that the real problem is having multiple
> >> separate sync-noidle trees.
> >> Inter group idle is marginal, since each sync-noidle tree already has
> >> its end-of-tree idle enabled for rotational devices (The difference in
> >> the table is in fact small).
> >> > ---- ---- - ----------- ----------- ----------- -----------
> >> > Mode RdWr N ? ?base ? ? ? ioc off ? ioc no idle ?ioc idle
> >> > ---- ---- - ----------- ----------- ----------- -----------
> >> > ?rnd ? rd 2 ? ? ? ?17.3 ? ? ? ?17.1 ? ? ? ? 9.4 ? ? ? ? 9.1
> >> > ?rnd ? rd 4 ? ? ? ?27.1 ? ? ? ?27.1 ? ? ? ? 8.1 ? ? ? ? 8.2
> >> > ?rnd ? rd 8 ? ? ? ?37.1 ? ? ? ?37.1 ? ? ? ? 6.8 ? ? ? ? 7.1
> >>
> >> 2 random readers without groups have bw = 17.3 ; this means that a
> >> single random reader will have bw > 8.6 (since the two readers go
> >> usually in parallel when no groups are involved, unless two random
> >> reads are actually queued to the same disk).
> >>
> >
> > Agreed. Without groups I guess we are driving queue depth as 2 hence
> > two random readers are able to work in paralle. Because this is striped
> > array of multiple disks, there are chances that reads will happen on
> > different disks and we can support more random readers in parallel without
> > dropping the throughput of box.
> >
> >> When the random readers are in separate groups, we give the full disk
> >> to only one at a time, so the max aggregate bw achievable is the bw of
> >> a single random reader less the overhead proportional to number of
> >> groups. This is compatible with the numbers.
> >>
> >
> > Yes it is but with group_idle=0, we don't wait for a group to get
> > backlogged. So in that case we should have been driving queue depth as 2
> > and allow both the groups go in parallel. But looking at Alan's number
> > with with group_ilde=0, he is not achieving close to 17MB/s and I suspect
> > this is coming from that fact that hw_tag=0 somehow and we are idling on
> > sync-nodile workload hence effectively driving queue depth as 1.
> Its because of the idle at the end of the service tree.
> If you recall, I commented about the group idle being useless, since
> we already do an idle in any case, even after the no-idle service
> tree.
>

I am still trying to understand your patches fully. So are you going to
idle even on sync-idle and async trees? In cfq_should_idle(), I don't see
any distinction between various kind of trees so it looks like we are
going to idle on async and sync-idle trees also? That looks unnecessary?

Regular idle does not work if slice has expired. There are situations with
sync-idle readers that I need to wait for next request for group to get
backlogged. So it is not useless. It does kick-in only in few circumstances.

> >
> >> So, an other thing to mention in the docs is that having one process
> >> per group is not a good idea (cfq already has I/O priorities to deal
> >> with single processes). Groups are coarse grain entities, and they
> >> should really be used when you need to get fairness between groups of
> >> processes.
> >>
> >
> > I think number of processes in the group will be a more dynamic
> > information that changes with time. For example, if we put a virtual
> > machine in a group, number of processes will vary depending on what
> > virtual machine is doing.
> Yes, this is one scenario. An other would be each user has a group, so
> the bandwidth is divided evenly between users, and an user cannot
> steal bandwidth by using more processes.
> The point is that using groups to control priorities between processes
> is not the best option.
>
> >
> > I think group_idle is a more controllable parameter here. If some group
> > has higher weight but low load (like single process running), then should
> > we slow down the whole array and give the group exclusive access, or we
> > continue we just let slow group go away and continue to dispatch from rest
> > of the more active (but possibly low weight) groups. In first case
> > probably our latencies might be better as comapred to second case.
> As Alan's test shows, disabling group_idle doesn't actually improve
> the situation, since we already idle at the end of each tree.
> What you want is that if a no-idle service_tree is finished, you
> should immediately jump to an other no-idle service tree.
> Unfortunately, this would be equivalent to having just one global
> no-idle service tree, so the counter-arguments you did for my
> proposals still apply here.
> You can either get isolation, or performance. Not both at the same time.

Agreed.

Thanks
Vivek

2009-11-17 23:11:04

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Tue, Nov 17, 2009 at 11:38 PM, Vivek Goyal <[email protected]> wrote:
>
> Ok, now I understand it better. I had missed the st->count part. So if
> there are other sync-noidle queues backlogged (st->count > 0), then we
> don't idle on same process to get more request, if hw_tag=1 or is is SSD
> and move onto to next sync-noidle process to dispatch requests from.
Yes.
>
> But if this is last cfqq on the service tree under this workload, we will
> still idle on the service tree/workload type and not start dispatching
> request from other service tree (of same prio class).
Yes.
>
>> Without this idle, we won't get fair behaviour for no-idle queues.
>> This idle is enabled regardless of NCQ for rotational media. It is
>> only disabled on NCQ SSDs (the whole function is skipped in that
>> case).
>
> So If I have a fast storage array with NCQ, we will still idle and not
> let sync-idle queues or async queues get to dispatch. Anyway, that's a
> side issue for the moment.
It is intended. If we don't idle, random readers will dispatch just
once and then the sequential readers will monopolize the disk for too
much time. This was teh former CFQ behaviour, and various tests showed
an improvement with this idle.

>> So, having more than one no-idle service tree, as in your approach to
>> groups, introduces the problem we see.
>>
>
> True, having multiple no-idle workload is problem here. Can't think of
> a solution. Putting workload type on top also is not logically good where
> workload type determines the share of disk/array. This is so unintuitive.
If you think that sequential and random are incommensurable, then it
becomes natural to do all the weighting and the scheduling
independently.
> I guess I will document this issue with random IO workload issue.
>
> May be we can do little optimization in the sense, in cfq_should_idle(), I can
> check if there are other competing sync and async queues in the cfq_group or
> not. If there are no competing queues then we don't have to idle on the
> sync-noidle service tree. That's a different thing that we might still
> want to idle on the group as a whole to make sure a single random reader
> has got good latencies and is not overwhelmed by other groups running
> sequential readers.
It will not change the outcome. You just rename the end of tree idle
as group idle, but the performance drop is the same.
>> >
>> > This is all subjected to the fact that we have done a good job in
>> > detecting the queue depth and have updated hw_tag accordingly.
>> >
>> > On slower rotational hardware, where we will actually do idling on
>> > sync-noidle per group, idling can infact help you because it will reduce
>> > the number of seeks (As it does on my locally connected SATA disk).
>> Right. We will do a small idle between no-idle queues, and a larger
>> one at the end.
>
> If we do want to do a small idle between no-idle queues, why do you allow
> preemption of one sync-noidle queue with other sync-noidle queue.
The preemption is useful when you are waiting on an empty tree. In
that case, any random request is good enough.
In the non-NCQ case, where we can idle even if the service tree is not
empty, I forgot to add the check. Good point.

>
> IOW, what's the point of waiting for small period between queues? They are
> anyway random seeky readers.
Smaller seeks take less time. If your random readers are reading from
contiguous files, they will be doing small seeks, so you still get an
improvement waiting a bit.

>
> Idling between queues can help a bit if we have sync-noidle reader and
> multiple sync-nodile sync writers. A sync-noidle reader can still witness
> higher latencies if multiple libaio driven sync writers are present. We
> discussed this issue briefly in private mail. But at the moment, allowing
> preemption will wipe out that advantage.
This applies also if you do random reads at a deeper depth, e.g. using
libaio or just posix_fadvise/readahead.
My proposed solution for this is to classify those queues are idling,
to get the usual time based fairness.

>
> I understand now up to some extent. One question still remains though is
> that why do we choose to idle on fast arrays. Faster the array (backed by
> more disks), more harmful the idling becomes.
Not if you do it just once every scheduling turn, and you obtain
fairness for random readers in this way.
On a fast rotational array, to obtain high BW, you have two options:
* large sequential read
* many parallel random reads
So it is better to devote the full array in turn to each sequential
task, and then for some time, to all the remaining random ones.
>
> May be using your dyanamic cfq tuning patches might help here. If average
> read time is less, than driver deeper queue depths otherwise reduce the
> queue depth as underlying device/array can't handle that much.

In autotuning, I'll allow breaking sequentiality only if random
requests are serviced in less than 0.5 ms on average.
Otherwise, I'll still prefer to allocate a contiguous timeslice for
each sequential reader, and an other one for all random ones.
Clearly, the time to idle for each process, and the contiguous
timeslice, will be proportional to the penalty incurred by a seek, so
I measure the average seek time for that purpose.

> I am still trying to understand your patches fully. So are you going to
> idle even on sync-idle and async trees? In cfq_should_idle(), I don't see
> any distinction between various kind of trees so it looks like we are
> going to idle on async and sync-idle trees also? That looks unnecessary?
For me, the idle on the end of a service tree is equivalent to an idle
on a queue.
Since sequential sync already have their idle, no additional idle is introduced.
For async, since they are always preempted by sync of the same priority,
the idle at the end just protects from lower priority class queues.

>
> Regular idle does not work if slice has expired. There are situations with
> sync-idle readers that I need to wait for next request for group to get
> backlogged. So it is not useless. It does kick-in only in few circumstances.
Are those circumstances worth the extra complexity?
If the only case is when there is just one process doing I/O in an
high weight group,
wouldn't just increase this process' slice above the usual 100ms do
the trick, with less complexity?

>> You can either get isolation, or performance. Not both at the same time.
>
> Agreed.
>
> Thanks
> Vivek
>

Thanks,
Corrado

2009-11-18 15:34:12

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Tue, Nov 17, 2009 at 07:38:47AM -0500, Alan D. Brunelle wrote:
> On Mon, 2009-11-16 at 17:18 -0500, Vivek Goyal wrote:
> > On Mon, Nov 16, 2009 at 03:51:00PM -0500, Alan D. Brunelle wrote:
> >
> > [..]
> > > ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> > >
> > > The next thing to look at is to see what the "penalty" is for the
> > > additional code: see how much bandwidth we lose for the capability
> > > added. Here we see the sum of the system's throughput for the various
> > > tests:
> > >
> > > ---- ---- - ----------- ----------- ----------- -----------
> > > Mode RdWr N base ioc off ioc no idle ioc idle
> > > ---- ---- - ----------- ----------- ----------- -----------
> > > rnd rd 2 17.3 17.1 9.4 9.1
> > > rnd rd 4 27.1 27.1 8.1 8.2
> > > rnd rd 8 37.1 37.1 6.8 7.1
> > >
> >

Hi Alan,

I got hold of a better system where few disks are striped. On this system
I can also see the performance drop and it does come from the fact that
we idle on empty sync-noidle service tree. So with multiple groups we
increase the instances of sync-noidle trees hence more idling and hence
reduced throughput.

With this patch, I had done a little optimization where I don't idle on
a sync-noidle service tree if there are no competing sync-idle or async
queues.

What that means is that if a group is doing only random IO and it does not
have sufficient IO to keep disk busy, then we will move onto next group
and start dispatching from that. So in your tests, with this patch you
should see better results for random reads with group_idle=0. With
group_idle=1, results will still remain same as we continue to idle on
sync-noidle service tree.

It is working for me. Can you please try it out. You can just apply this
patch on top of V3. (You don't have to apply hw_tag patch from corrodo).

Thanks
Vivek


o Now we don't remove the queue from service tree until we expire it, even if
queue is empty. So st->count = 0 is not a valid state while we have a cfqq
at hand. Fix it in arm_slice_timer().

o wait_busy gets set only if group is emtpy. If st->count > 1, then group is
not empty and wait_busy will not be set. Remove that extra check.

o There is no need to idle on aysnc service tree as it is backlogged most of
the time if writes are on. Those queues don't get deleted hence don't wait
on async service tree. Similiarly don't wait on sync-idle service tree as
we do idling on individual queues if need be. Fixed cfq_should_idle().

o Currently we wait on sync-noidle service tree so that sync-noidle type of
workload does not get swamped by sync-idle or async type of workload. Don't
do this idling if there are no sync-idle or async type of queues in the group
and there are other groups to dispatch the requests from and user has decided
not to wait on slow groups to achieve better throughput. (group_idle=0).

This will make sure if some group is doing just random IO and does not
have sufficient IO to keep the disk busy, we will move onto other groups to
dispatch the requests from and utilize the storage better.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 35 +++++++++++++++++++----------------
1 file changed, 19 insertions(+), 16 deletions(-)

Index: linux6/block/cfq-iosched.c
===================================================================
--- linux6.orig/block/cfq-iosched.c 2009-11-17 14:44:09.000000000 -0500
+++ linux6/block/cfq-iosched.c 2009-11-18 10:09:33.000000000 -0500
@@ -899,6 +899,10 @@ static bool cfq_should_idle(struct cfq_d
{
enum wl_prio_t prio = cfqq_prio(cfqq);
struct cfq_rb_root *service_tree = cfqq->service_tree;
+ struct cfq_group *cfqg = cfqq->cfqg;
+
+ BUG_ON(!service_tree);
+ BUG_ON(!service_tree->count);

/* We never do for idle class queues. */
if (prio == IDLE_WORKLOAD)
@@ -908,18 +912,18 @@ static bool cfq_should_idle(struct cfq_d
if (cfq_cfqq_idle_window(cfqq))
return true;

+ /* Don't idle on async and sync-idle service trees */
+ if (cfqd->serving_type != SYNC_NOIDLE_WORKLOAD)
+ return false;
/*
- * Otherwise, we do only if they are the last ones
- * in their service tree.
+ * If there are other competing groups present, don't wait on service
+ * tree if this is last queue in the group and there are no other
+ * competing queues (sync-idle or async) queues present
*/
- if (!service_tree)
- service_tree = service_tree_for(cfqq->cfqg, prio,
- cfqq_type(cfqq), cfqd);
-
- if (service_tree->count == 0)
- return true;
-
- return (service_tree->count == 1 && cfq_rb_first(service_tree) == cfqq);
+ if (cfqd->nr_groups > 1 && !cfqd->cfq_group_idle)
+ return (service_tree->count == 1 && cfqg->nr_cfqq > 1);
+ else
+ return service_tree->count == 1;
}

#ifdef CONFIG_CFQ_GROUP_IOSCHED
@@ -1102,9 +1106,6 @@ static inline bool cfqq_should_wait_busy
if (!RB_EMPTY_ROOT(&cfqq->sort_list) || cfqq->cfqg->nr_cfqq > 1)
return false;

- if (!cfq_should_idle(cfqq->cfqd, cfqq))
- return false;
-
return true;
}

@@ -1801,7 +1802,10 @@ static bool cfq_arm_slice_timer(struct c
/*
* idle is disabled, either manually or by past process history
*/
- if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
+ if (!cfqd->cfq_slice_idle)
+ return false;
+
+ if (!cfq_should_idle(cfqd, cfqq) && !wait_busy)
return false;

/*
@@ -1837,8 +1841,7 @@ static bool cfq_arm_slice_timer(struct c
*/
st = service_tree_for(cfqq->cfqg, cfqd->serving_prio,
SYNC_NOIDLE_WORKLOAD, cfqd);
- if (!wait_busy && cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
- && st->count > 0) {
+ if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD && st->count > 1) {
if (blk_queue_nonrot(cfqd->queue) || cfqd->hw_tag)
return false;
sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));

2009-11-18 16:20:15

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

Hi Vivek,
On Wed, Nov 18, 2009 at 4:32 PM, Vivek Goyal <[email protected]> wrote:
> o Currently we wait on sync-noidle service tree so that sync-noidle type of
>  workload does not get swamped by sync-idle or async type of workload. Don't
>  do this idling if there are no sync-idle or async type of queues in the group
>  and there are other groups to dispatch the requests from and user has decided
>  not to wait on slow groups to achieve better throughput. (group_idle=0).
>
>  This will make sure if some group is doing just random IO and does not
>  have sufficient IO to keep the disk busy, we will move onto other groups to
>  dispatch the requests from and utilize the storage better.
>
This group will be treated unfairly, if the other groups are doing
sequential I/O:
It will dispatch one request every 100ms (at best), and every 300ms at worst.
I can't see how this is any better than having a centralized service
tree for all sync-noidle queues.

Probably it is better to just say:
* if the user wants isolation (group_idle should be named
group_isolation), the no-idle queues go into the group no-idle tree,
and a proper idling is ensured
* if the user doesn't want isolation, but performance, then the
no-idle queues go into the root group no-idle tree, for which the end
of tree idle should be ensured. This won't affect the sync-idle
queues, for which group weighting will still work unaffected.

Corrado

2009-11-18 22:58:10

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Wed, Nov 18, 2009 at 05:20:12PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Wed, Nov 18, 2009 at 4:32 PM, Vivek Goyal <[email protected]> wrote:
> > o Currently we wait on sync-noidle service tree so that sync-noidle type of
> > ?workload does not get swamped by sync-idle or async type of workload. Don't
> > ?do this idling if there are no sync-idle or async type of queues in the group
> > ?and there are other groups to dispatch the requests from and user has decided
> > ?not to wait on slow groups to achieve better throughput. (group_idle=0).
> >
> > ?This will make sure if some group is doing just random IO and does not
> > ?have sufficient IO to keep the disk busy, we will move onto other groups to
> > ?dispatch the requests from and utilize the storage better.
> >
> This group will be treated unfairly, if the other groups are doing
> sequential I/O:
> It will dispatch one request every 100ms (at best), and every 300ms at worst.
> I can't see how this is any better than having a centralized service
> tree for all sync-noidle queues.
>
> Probably it is better to just say:
> * if the user wants isolation (group_idle should be named
> group_isolation), the no-idle queues go into the group no-idle tree,
> and a proper idling is ensured
> * if the user doesn't want isolation, but performance, then the
> no-idle queues go into the root group no-idle tree, for which the end
> of tree idle should be ensured. This won't affect the sync-idle
> queues, for which group weighting will still work unaffected.

Moving all the queues to root group is one way to solve the issue. Though
problem still remains if there are 7-8 sequential workload groups operating
with low_latency=0. In that case after every dispatch round of sync-noidle
workload in root group, next round might be much more than 300ms, hence
bumping up the max latencies of sync-noidle workload.

I think one of the core problem seems to be that I always put the group at
the end of service tree. Instead I should let the group delete from
service tree if it does not have sufficient IO, and when it comes back
again, try to put it in the beginning of tree according to weight so
that not all is lost and it gets to dispatch IO sooner.

This way, the groups which have been using long slices (either because
they are running sync-idle workload or because they have sufficient IO
to keep the disk busy), will be towards later end of service tree and the
groups which are new or which have lost their share because they have
dispatched a small IO and got deleted, will be put at the front of tree.

This way sync-noidle queues in a group will not loose out because of
sync-idle IO happening in other groups.

I have written couple of small patches and still testing it out to see
whether it is working fine in various configurations.

Will post patches after some testing.

Thanks
Vivek

2009-11-18 23:35:09

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

Hi Vivek,
On Wed, Nov 18, 2009 at 11:56 PM, Vivek Goyal <[email protected]> wrote:
> Moving all the queues to root group is one way to solve the issue. Though
> problem still remains if there are 7-8 sequential workload groups operating
> with low_latency=0. In that case after every dispatch round of sync-noidle
> workload in root group, next round might be much more than 300ms, hence
> bumping up the max latencies of sync-noidle workload.

I think that this is the desired behaviour: low_latency=0 means that
latency is less important than throughput, so I wouldn't worry about
it.

>
> I think one of the core problem seems to be that I always put the group at
> the end of service tree. Instead I should let the group delete from
> service tree if it does not have sufficient IO, and when it comes back
> again, try to put it in the beginning of tree according to weight so
> that not all is lost and it gets to dispatch IO sooner.

It is similar to how the queues are put in service tree in cfq without groups.
If a queue had some remaining slice, it is prioritized w.r.t. ones
that consumed their slice completely, by giving it a lower key.

> This way, the groups which have been using long slices (either because
> they are running sync-idle workload or because they have sufficient IO
> to keep the disk busy), will be towards later end of service tree and the
> groups which are new or which have lost their share because they have
> dispatched a small IO and got deleted, will be put at the front of tree.
>
> This way sync-noidle queues in a group will not loose out because of
> sync-idle IO happening in other groups.

It is ok if you have group idling, but if you disable it (and end of
tree idle), it will be similar to how CFQ was before my patch set (and
experiments showed that the approach was inferior to grouping no-idle
together), without the service differentiation benefit introduced by
your idling.
So I still prefer the binary choice: either you want fairness (by
idling) or performance (by putting all no-idle queues together).

>
> I have written couple of small patches and still testing it out to see
> whether it is working fine in various configurations.
>
> Will post patches after some testing.
>
> Thanks
> Vivek
>

Thanks
Corrado

2009-11-19 00:06:49

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Wed, Nov 18, 2009 at 12:11:06AM +0100, Corrado Zoccolo wrote:
> On Tue, Nov 17, 2009 at 11:38 PM, Vivek Goyal <[email protected]> wrote:
> >
> > Ok, now I understand it better. I had missed the st->count part. So if
> > there are other sync-noidle queues backlogged (st->count > 0), then we
> > don't idle on same process to get more request, if hw_tag=1 or is is SSD
> > and move onto to next sync-noidle process to dispatch requests from.
> Yes.
> >
> > But if this is last cfqq on the service tree under this workload, we will
> > still idle on the service tree/workload type and not start dispatching
> > request from other service tree (of same prio class).
> Yes.
> >
> >> Without this idle, we won't get fair behaviour for no-idle queues.
> >> This idle is enabled regardless of NCQ for rotational media. It is
> >> only disabled on NCQ SSDs (the whole function is skipped in that
> >> case).
> >
> > So If I have a fast storage array with NCQ, we will still idle and not
> > let sync-idle queues or async queues get to dispatch. Anyway, that's a
> > side issue for the moment.
> It is intended. If we don't idle, random readers will dispatch just
> once and then the sequential readers will monopolize the disk for too
> much time. This was teh former CFQ behaviour, and various tests showed
> an improvement with this idle.

Actually I was assuming that we will not idle even on sync-idle queues
on fast arrays. I am wondering what happens when we are running sync-idle
workload on an storage array with lots of disks. By letting only one
process/queue dispatch IO, are we not introducing lot of serialization
here and can we get more out of array by dispatching requests from more
sync-idle queues and stop if it becomes seek bound.

I got an array of 5 striped disks. I did some experements with
slice_idle=0. I don't see performance improvement in case of buffered reads. As
I increase the number of processes seek cost is significant and total
throughput drops. I think readahead logic reads in bigger block sizes, and
that should keep all the disks busy hence no gains here.

I did see some gains if I was doing direct sequential IO with 4K block
size. With slice_idle=8 following is total throughput with 1,2,4,8,16
readers.

16,278KB/s 16,173KB/s 15,891KB/s 15,678KB/s 15,847KB/s

With slice_idle=0, following is total throughput.

16,206KB/s 22,851KB/s 26,368KB/ 29,197KB/s 28,806KB/s

So by allowing more sequential direct readers to dispatch simultaneously,
I can get more out of array in this case.

>
> >> So, having more than one no-idle service tree, as in your approach to
> >> groups, introduces the problem we see.
> >>
> >
> > True, having multiple no-idle workload is problem here. Can't think of
> > a solution. Putting workload type on top also is not logically good where
> > workload type determines the share of disk/array. This is so unintuitive.
> If you think that sequential and random are incommensurable, then it
> becomes natural to do all the weighting and the scheduling
> independently.

I am not ruling out the option of keeping workload type on top. For
determining the workload share out of 300ms, we can use total of group
weights on all the tree workload trees and proportion out the share. That
way at least number of queues in a group don't change the share of group
based on workload type.

That will still leave the issue of a disk share being decided by number of
groups doing a particular type of IO.

I am first trying to make the more intutive thing work. If it does not
work reasonably, we can always switch to groups with-in workload method.

> > I guess I will document this issue with random IO workload issue.
> >
> > May be we can do little optimization in the sense, in cfq_should_idle(), I can
> > check if there are other competing sync and async queues in the cfq_group or
> > not. If there are no competing queues then we don't have to idle on the
> > sync-noidle service tree. That's a different thing that we might still
> > want to idle on the group as a whole to make sure a single random reader
> > has got good latencies and is not overwhelmed by other groups running
> > sequential readers.
> It will not change the outcome. You just rename the end of tree idle
> as group idle, but the performance drop is the same.

True. I am trying to cleanup the group_idle=0 path. So that we don't incur
this additional wait penatly there. Also trying to see if I can make
group_idle=0 as default so that we get reasonable isolation and resonable
performance from the box.

> >> >
> >> > This is all subjected to the fact that we have done a good job in
> >> > detecting the queue depth and have updated hw_tag accordingly.
> >> >
> >> > On slower rotational hardware, where we will actually do idling on
> >> > sync-noidle per group, idling can infact help you because it will reduce
> >> > the number of seeks (As it does on my locally connected SATA disk).
> >> Right. We will do a small idle between no-idle queues, and a larger
> >> one at the end.
> >
> > If we do want to do a small idle between no-idle queues, why do you allow
> > preemption of one sync-noidle queue with other sync-noidle queue.
> The preemption is useful when you are waiting on an empty tree. In
> that case, any random request is good enough.
> In the non-NCQ case, where we can idle even if the service tree is not
> empty, I forgot to add the check. Good point.
>
> >
> > IOW, what's the point of waiting for small period between queues? They are
> > anyway random seeky readers.
> Smaller seeks take less time. If your random readers are reading from
> contiguous files, they will be doing small seeks, so you still get an
> improvement waiting a bit.

I am not sure if waiting a bit between queues on non-rotational media is
working. Because even in select queue, we should expire the current
sync-noidle queue and move onto next sync-noidle queue.

>
> >
> > Idling between queues can help a bit if we have sync-noidle reader and
> > multiple sync-nodile sync writers. A sync-noidle reader can still witness
> > higher latencies if multiple libaio driven sync writers are present. We
> > discussed this issue briefly in private mail. But at the moment, allowing
> > preemption will wipe out that advantage.
> This applies also if you do random reads at a deeper depth, e.g. using
> libaio or just posix_fadvise/readahead.
> My proposed solution for this is to classify those queues are idling,
> to get the usual time based fairness.

But these sync reads/writes could just be random and you will suffer the
servere performance penalty if you treat these as sync-idle queues? It is
like going back to enabling idling on random seeky reader.

>
> >
> > I understand now up to some extent. One question still remains though is
> > that why do we choose to idle on fast arrays. Faster the array (backed by
> > more disks), more harmful the idling becomes.
> Not if you do it just once every scheduling turn, and you obtain
> fairness for random readers in this way.
> On a fast rotational array, to obtain high BW, you have two options:
> * large sequential read
> * many parallel random reads
> So it is better to devote the full array in turn to each sequential
> task, and then for some time, to all the remaining random ones.

Doing a group idle on many parallel random reads is fine. I am pointing
towards idling on sequential reads. If it is buffered sequential reads
things are probably fine. But what about if these are direct IO, with
smaller block size. Are we not keeping array underutilized here?

> >
> > May be using your dyanamic cfq tuning patches might help here. If average
> > read time is less, than driver deeper queue depths otherwise reduce the
> > queue depth as underlying device/array can't handle that much.
>
> In autotuning, I'll allow breaking sequentiality only if random
> requests are serviced in less than 0.5 ms on average.
> Otherwise, I'll still prefer to allocate a contiguous timeslice for
> each sequential reader, and an other one for all random ones.
> Clearly, the time to idle for each process, and the contiguous
> timeslice, will be proportional to the penalty incurred by a seek, so
> I measure the average seek time for that purpose.

Ok. I have yet to test your patch.

>
> > I am still trying to understand your patches fully. So are you going to
> > idle even on sync-idle and async trees? In cfq_should_idle(), I don't see
> > any distinction between various kind of trees so it looks like we are
> > going to idle on async and sync-idle trees also? That looks unnecessary?
> For me, the idle on the end of a service tree is equivalent to an idle
> on a queue.
> Since sequential sync already have their idle, no additional idle is introduced.

If CFQ disables idling on a queue in the middle of slice, we will continue
to idle on it and not take service tree change into account till end of
slice. This is something very minor. More than that, I think it is
confusing, on every type of service tree.

> For async, since they are always preempted by sync of the same priority,
> the idle at the end just protects from lower priority class queues.

I thought async are always preempted by sync irrespective of priority
class or priority (With the exception of idle class). Even RT asyncs are
preempted by BE sync queues. So waiting on async queue does not make sense.

>
> >
> > Regular idle does not work if slice has expired. There are situations with
> > sync-idle readers that I need to wait for next request for group to get
> > backlogged. So it is not useless. It does kick-in only in few circumstances.
> Are those circumstances worth the extra complexity?
> If the only case is when there is just one process doing I/O in an
> high weight group,
> wouldn't just increase this process' slice above the usual 100ms do
> the trick, with less complexity?

Does not work all the time. If sync queue with-in group is preempted by
another queue in the group (a sync queue containing meta data), then we
lose track of old queue and meta data queue will expire almost immediately
after hence leading to deletion of group.

I had ran into these interesting issues in the past when I had tried
something similar. Will have a look at it again.

Thanks
Vivek

2009-11-19 16:59:11

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Tue, Nov 17, 2009 at 03:38:09PM -0500, Alan D. Brunelle wrote:
> Hi Vivek -
>
> I've updated the runs - the results are shown at the end, I culled out
> the write-related runs (haven't converted to direct I/O yet for those).
>
> Next steps: Going to refresh to V3 of the patches and add in
>
> http://groups.google.com/group/linux.kernel/browse_thread/thread/3b62f0665f0912b6/34ec9456c7da1bb7?lnk=raot
>
> and convert the write-side to doing direct I/O (that would include the
> rdwr tests, btw).
>
> ----------------------------------------------------------------------
>
> I've relabeled things:
>
> i0 = io control generated in the kernel, but not enabled
> i1 = io control generated in the kernel, and is enabled
>
> g0 = group_idle=0
> g1 = group_idle=1 (default)
>
> s8 = slide_idle=8 (default)
> s0 = slice_idle=0
>
> It looks like when the io control stuff is enabled we have random read
> problems.
>
> When it is enabled and slice_idle is set to 0 we see sequential reads
> drop noticeably:

I think drop in sequential read throughput is affect of slice_idle=0 and
not of io controller stuff.

Thanks
Vivek



>
> ---- ---- - --------- --------- --------- --------- --------- ---------
> Mode RdWr N base i0,gX,sX i1,g1,s8 i1,g0,s8 i1,g1,s0 i1,g0,s0
> ---- ---- - --------- --------- --------- --------- --------- ---------
> rnd rd 2 17.3 17.1 9.1 9.3 9.4 9.5
> rnd rd 4 27.1 27.1 8.2 8.0 8.0 8.0
> rnd rd 8 37.1 37.1 7.1 6.8 6.7 6.8
>
> seq rd 2 610.2 608.1 607.7 611.0 551.1 550.3
> seq rd 4 608.4 601.5 607.7 609.6 549.6 550.0
> seq rd 8 605.7 603.7 604.0 605.6 547.2 546.7
>
> ===============================================================
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> base rnd rd 2 8.6 8.6
> base rnd rd 4 6.8 6.8 6.8 6.7
> base rnd rd 8 4.7 4.6 4.6 4.6 4.6 4.6 4.6 4.6
>
> base seq rd 2 306.2 304.0
> base seq rd 4 150.1 152.4 151.9 154.0
> base seq rd 8 77.2 75.9 75.9 73.9 77.0 75.7 75.0 74.9
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> i0,gX,sX rnd rd 2 8.6 8.6
> i0,gX,sX rnd rd 4 6.8 6.8 6.7 6.7
> i0,gX,sX rnd rd 8 4.7 4.6 4.6 4.7 4.6 4.6 4.6 4.6
>
> i0,gX,sX seq rd 2 303.1 305.0
> i0,gX,sX seq rd 4 150.8 151.6 149.0 150.2
> i0,gX,sX seq rd 8 77.0 76.3 74.5 74.0 77.9 75.5 74.0 74.6
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> i1,g1,s8 rnd rd 2 2.8 6.3
> i1,g1,s8 rnd rd 4 0.7 1.5 2.5 3.5
> i1,g1,s8 rnd rd 8 0.2 0.4 0.5 0.7 0.9 1.2 1.4 1.7
>
> i1,g1,s8 seq rd 2 221.6 386.1
> i1,g1,s8 seq rd 4 70.6 128.1 181.7 227.3
> i1,g1,s8 seq rd 8 21.4 40.3 55.8 71.2 85.1 99.1 109.4 121.7
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> i1,g0,s8 rnd rd 2 4.7 4.7
> i1,g0,s8 rnd rd 4 2.0 2.0 2.0 2.0
> i1,g0,s8 rnd rd 8 0.9 0.9 0.9 0.8 0.8 0.8 0.9 0.9
>
> i1,g0,s8 seq rd 2 305.9 305.0
> i1,g0,s8 seq rd 4 154.0 153.3 151.2 151.1
> i1,g0,s8 seq rd 8 76.1 76.1 74.5 74.9 75.4 75.8 76.1 76.6
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> i1,g1,s0 rnd rd 2 4.7 4.7
> i1,g1,s0 rnd rd 4 2.0 2.0 2.0 2.0
> i1,g1,s0 rnd rd 8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9
>
> i1,g1,s0 seq rd 2 275.6 275.5
> i1,g1,s0 seq rd 4 136.0 137.5 137.4 138.7
> i1,g1,s0 seq rd 8 68.6 68.6 68.5 68.6 67.1 68.6 68.3 68.9
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> i1,g0,s0 rnd rd 2 4.7 4.7
> i1,g0,s0 rnd rd 4 2.0 2.0 2.0 2.0
> i1,g0,s0 rnd rd 8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9
>
> i1,g0,s0 seq rd 2 275.2 275.1
> i1,g0,s0 seq rd 4 136.7 137.2 137.5 138.7
> i1,g0,s0 seq rd 8 68.8 68.2 68.5 68.4 67.2 68.8 68.4 68.5
>

2009-11-19 20:12:52

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

Hi Vivek,
On Thu, Nov 19, 2009 at 1:04 AM, Vivek Goyal <[email protected]> wrote:
>
> Actually I was assuming that we will not idle even on sync-idle queues
> on fast arrays. I am wondering what happens when we are running sync-idle
> workload on an storage array with lots of disks. By letting only one
> process/queue dispatch IO, are we not introducing lot of serialization
> here and can we get more out of array by dispatching requests from more
> sync-idle queues and stop if it becomes seek bound.
As alan's test with slice_idle = 0 showed, not idling will quickly
reduce sequential throughput as soon as you add more readers.
So we prefer to idle even on fast arrays.

>
> I got an array of 5 striped disks. I did some experements with
> slice_idle=0. I don't see performance improvement in case of buffered reads. As
> I increase the number of processes seek cost is significant and total
> throughput drops. I think readahead logic reads in bigger block sizes, and
> that should keep all the disks busy hence no gains here.
>
> I did see some gains if I was doing direct sequential IO with 4K block
> size. With slice_idle=8 following is total throughput with 1,2,4,8,16
> readers.
>
> 16,278KB/s 16,173KB/s 15,891KB/s 15,678KB/s 15,847KB/s
>
> With slice_idle=0, following is total throughput.
>
> 16,206KB/s 22,851KB/s 26,368KB/ 29,197KB/s 28,806KB/s
>
> So by allowing more sequential direct readers to dispatch simultaneously,
> I can get more out of array in this case.

Right, but I don't think this scenario is realistic. The people that
are using direct I/O, are smart enough not to submit a single small
request at a time, especially if they are doing sequential I/O. They
will probably use large requests and/or aio to submit multiple
requests at a time.

>>
>> >> So, having more than one no-idle service tree, as in your approach to
>> >> groups, introduces the problem we see.
>> >>
>> >
>> > True, having multiple no-idle workload is problem here. Can't think of
>> > a solution. Putting workload type on top also is not logically good where
>> > workload type determines the share of disk/array. This is so unintuitive.
>> If you think that sequential and random are incommensurable, then it
>> becomes natural to do all the weighting and the scheduling
>> independently.
>
> I am not ruling out the option of keeping workload type on top. For
> determining the workload share out of 300ms, we can use total of group
> weights on all the tree workload trees and proportion out the share. That
> way at least number of queues in a group don't change the share of group
> based on workload type.
>
> That will still leave the issue of a disk share being decided by number of
> groups doing a particular type of IO.
>
> I am first trying to make the more intutive thing work. If it does not
> work reasonably, we can always switch to groups with-in workload method.
>
> I am not sure if waiting a bit between queues on non-rotational media is
> working. Because even in select queue, we should expire the current
> sync-noidle queue and move onto next sync-noidle queue.
The small idle is enabled only on non-NCQ rotational media.

>> My proposed solution for this is to classify those queues are idling,
>> to get the usual time based fairness.
>
> But these sync reads/writes could just be random and you will suffer the
> servere performance penalty if you treat these as sync-idle queues? It is
> like going back to enabling idling on random seeky reader.

Here, again, I suppose that if someone is using aio or readahead, then
he is already optimizing for full utilization of the disks. In that
case, giving him a full slice of exclusive access to the disks is the
best thing to do, and exactly what he expects.

>>
>> >
>> > I understand now up to some extent. One question still remains though is
>> > that why do we choose to idle on fast arrays. Faster the array (backed by
>> > more disks), more harmful the idling becomes.
>> Not if you do it just once every scheduling turn, and you obtain
>> fairness for random readers in this way.
>> On a fast rotational array, to obtain high BW, you have two options:
>> * large sequential read
>> * many parallel random reads
>> So it is better to devote the full array in turn to each sequential
>> task, and then for some time, to all the remaining random ones.
>
> Doing a group idle on many parallel random reads is fine. I am pointing
> towards idling on sequential reads. If it is buffered sequential reads
> things are probably fine. But what about if these are direct IO, with
> smaller block size. Are we not keeping array underutilized here?

The underutilization would appear even if the application is run alone
on uncontended disk, so it is not unreasonable to ask the programmer
to do his homeworks, and optimize the application in this case.

>> >
>> > May be using your dyanamic cfq tuning patches might help here. If average
>> > read time is less, than driver deeper queue depths otherwise reduce the
>> > queue depth as underlying device/array can't handle that much.
>>
>> In autotuning, I'll allow breaking sequentiality only if random
>> requests are serviced in less than 0.5 ms on average.
>> Otherwise, I'll still prefer to allocate a contiguous timeslice for
>> each sequential reader, and an other one for all random ones.
>> Clearly, the time to idle for each process, and the contiguous
>> timeslice, will be proportional to the penalty incurred by a seek, so
>> I measure the average seek time for that purpose.
>
> Ok. I have yet to test your patch.
>
>>
>> > I am still trying to understand your patches fully. So are you going to
>> > idle even on sync-idle and async trees? In cfq_should_idle(), I don't see
>> > any distinction between various kind of trees so it looks like we are
>> > going to idle on async and sync-idle trees also? That looks unnecessary?
>> For me, the idle on the end of a service tree is equivalent to an idle
>> on a queue.
>> Since sequential sync already have their idle, no additional idle is introduced.
>
> If CFQ disables idling on a queue in the middle of slice, we will continue
> to idle on it and not take service tree change into account till end of
> slice. This is something very minor. More than that, I think it is
> confusing, on every type of service tree.

Maybe not changing mind in the middle is better. The seek average can
increase sometimes when files are fragmented or you need to fetch
metadata, but the queue could still be mostly sequential. So you
should not disable idling as soon as the seek average increases.

>> For async, since they are always preempted by sync of the same priority,
>> the idle at the end just protects from lower priority class queues.
>
> I thought async are always preempted by sync irrespective of priority
> class or priority (With the exception of idle class). Even RT asyncs are
> preempted by BE sync queues. So waiting on async queue does not make sense.

It simplifies the code not having to special case something that
doesn't change the outcome.

>>
>> >
>> > Regular idle does not work if slice has expired. There are situations with
>> > sync-idle readers that I need to wait for next request for group to get
>> > backlogged. So it is not useless. It does kick-in only in few circumstances.
>> Are those circumstances worth the extra complexity?
>> If the only case is when there is just one process doing I/O in an
>> high weight group,
>> wouldn't just increase this process' slice above the usual 100ms do
>> the trick, with less complexity?
>
> Does not work all the time. If sync queue with-in group is preempted by
> another queue in the group (a sync queue containing meta data), then we
> lose track of old queue and meta data queue will expire almost immediately
> after hence leading to deletion of group.
I think this is the correct behaviour.
If you have a queue that is preempted during idle, the time servicing
the pre-empted queue, that will usually imply a large seek, will be
enough for the queue to become backlogged again. The time to service
this large seek will be comparable with the idle time, so if the old
queue is not backlogged when the pre-empting queue is expired, then
it is correct to remove the group and switch to a new one.

>
> I had ran into these interesting issues in the past when I had tried
> something similar. Will have a look at it again.
I have an other idea, that could help reducing the code duplication in
the IO controller patches, and could support hierarchical groups.

A scheduler (like CFQ) deals with scheduling entities. Currently, the
scheduling entities for CFQ are the queues, but you should note that,
with Jeff's patch for queue merging, we already did some steps in the
direction of them being generalized.
Now, a group would be an other scheduling entity.

To define a scheduling entity you need to define few fundamental operations:
* classify() -> workload
* may_queue() -> int
* get_next_rq() -> request
* should_idle -> bool
* get_slice() -> int
* get_schedule_key() -> int
* residual_slice(int)
* add_request(request)

You will have queues, merged queues and groups, and they will all be
scheduled by the existing CFQ scheduling algorithm.
Then groups can implement their internal scheduling policy in the
get_next_rq(). You could reuse most of the CFQ scheduling inside a
group, as well as have the option to use a different I/O scheduler
within some groups.
If you reuse CFQ, then you could have sub groups inside a group.
You could also have a special case implementation for single queue
group, where you just forward all calls to the internal queue (but
scaling the slice/schedule_key according to weight), so any
discrepancy between a single group queue and a simple queue in the
root group will disappear.

Thanks
Corrado

> Thanks
> Vivek
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-11-20 14:20:20

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Thu, Nov 19, 2009 at 12:35:12AM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Wed, Nov 18, 2009 at 11:56 PM, Vivek Goyal <[email protected]> wrote:
> > Moving all the queues to root group is one way to solve the issue. Though
> > problem still remains if there are 7-8 sequential workload groups operating
> > with low_latency=0. In that case after every dispatch round of sync-noidle
> > workload in root group, next round might be much more than 300ms, hence
> > bumping up the max latencies of sync-noidle workload.
>
> I think that this is the desired behaviour: low_latency=0 means that
> latency is less important than throughput, so I wouldn't worry about
> it.
>
> >
> > I think one of the core problem seems to be that I always put the group at
> > the end of service tree. Instead I should let the group delete from
> > service tree if it does not have sufficient IO, and when it comes back
> > again, try to put it in the beginning of tree according to weight so
> > that not all is lost and it gets to dispatch IO sooner.
>
> It is similar to how the queues are put in service tree in cfq without groups.
> If a queue had some remaining slice, it is prioritized w.r.t. ones
> that consumed their slice completely, by giving it a lower key.
>
> > This way, the groups which have been using long slices (either because
> > they are running sync-idle workload or because they have sufficient IO
> > to keep the disk busy), will be towards later end of service tree and the
> > groups which are new or which have lost their share because they have
> > dispatched a small IO and got deleted, will be put at the front of tree.
> >
> > This way sync-noidle queues in a group will not loose out because of
> > sync-idle IO happening in other groups.
>
> It is ok if you have group idling, but if you disable it (and end of
> tree idle), it will be similar to how CFQ was before my patch set (and
> experiments showed that the approach was inferior to grouping no-idle
> together), without the service differentiation benefit introduced by
> your idling.
> So I still prefer the binary choice: either you want fairness (by
> idling) or performance (by putting all no-idle queues together).

Hi Corrado,

I liked the idea of putting all the sync-noidle queues together in root
group to achieve better throughput and implemeted a small patch.

It works fine for random readers. But when I do multiple direct random writers
in one group vs a random reader in other group, I am getting strange
behavior. Random reader moves to root group as sync-noidle workload. But
random writers are largely sync queues in remain in other group. But many
a times also jump into root group and preempt random reader.

Anyway, with 4 random writers and 1 random reader running for 30 seconds
in root group I get following.

rw: 59,963KB/s
rr: 66KB/s

But if these are put in seprate groups test1 and test2 then

rw: 30,587KB/s
rr: 23KB/s

I can understand the drop in rw throughput as it has been put under a
group of weight 500. But rr will run in root group with weight 1000 and
should have received much higher BW, instead it ends up loosing.

Staring hard at blktrace output to figure out what's happening. One thing
noticeable so far is that without cgroup stuff we seem to be interleaving
dispatch from random reader and random writer much better as compared to
with cgroup stuff.

Thanks
Vivek


---
block/cfq-iosched.c | 37 ++++++++++++++++++++++++++++++++++++-
1 file changed, 36 insertions(+), 1 deletion(-)

Index: linux6/block/cfq-iosched.c
===================================================================
--- linux6.orig/block/cfq-iosched.c 2009-11-19 21:38:51.000000000 -0500
+++ linux6/block/cfq-iosched.c 2009-11-19 21:38:53.000000000 -0500
@@ -142,6 +142,7 @@ struct cfq_queue {
struct cfq_rb_root *service_tree;
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
+ struct cfq_group *orig_cfqg;
/* Sectors dispatched in current dispatch round */
unsigned long nr_sectors;
};
@@ -266,6 +267,7 @@ struct cfq_data {
unsigned int cfq_slice_idle;
unsigned int cfq_latency;
unsigned int cfq_group_idle;
+ unsigned int cfq_group_isolation;

struct list_head cic_list;

@@ -1139,9 +1141,35 @@ static void cfq_service_tree_add(struct
struct cfq_rb_root *service_tree;
int left;
int new_cfqq = 1;
+ int group_changed = 0;
+
+ if (!cfqd->cfq_group_isolation
+ && cfqq_type(cfqq) == SYNC_NOIDLE_WORKLOAD
+ && cfqq->cfqg && cfqq->cfqg != &cfqd->root_group) {
+ /* Move this cfq to root group */
+ cfq_log_cfqq(cfqd, cfqq, "moving to root group");
+ if (!RB_EMPTY_NODE(&cfqq->rb_node))
+ cfq_group_service_tree_del(cfqd, cfqq->cfqg);
+ cfqq->orig_cfqg = cfqq->cfqg;
+ cfqq->cfqg = &cfqd->root_group;
+ atomic_inc(&cfqd->root_group.ref);
+ group_changed = 1;
+ } else if (!cfqd->cfq_group_isolation
+ && cfqq_type(cfqq) == SYNC_WORKLOAD && cfqq->orig_cfqg) {
+ /* cfqq is sequential now needs to go to its original group */
+ BUG_ON(cfqq->cfqg != &cfqd->root_group);
+ if (!RB_EMPTY_NODE(&cfqq->rb_node))
+ cfq_group_service_tree_del(cfqd, cfqq->cfqg);
+ cfq_put_cfqg(cfqq->cfqg);
+ cfqq->cfqg = cfqq->orig_cfqg;
+ cfqq->orig_cfqg = NULL;
+ group_changed = 1;
+ cfq_log_cfqq(cfqd, cfqq, "moved to origin group");
+ }

service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
cfqq_type(cfqq), cfqd);
+
if (cfq_class_idle(cfqq)) {
rb_key = CFQ_IDLE_DELAY;
parent = rb_last(&service_tree->rb);
@@ -1209,7 +1237,7 @@ static void cfq_service_tree_add(struct
rb_link_node(&cfqq->rb_node, parent, p);
rb_insert_color(&cfqq->rb_node, &service_tree->rb);
service_tree->count++;
- if (add_front || !new_cfqq)
+ if ((add_front || !new_cfqq) && !group_changed)
return;
cfq_group_service_tree_add(cfqd, cfqq->cfqg);
}
@@ -2379,6 +2407,9 @@ static void cfq_put_queue(struct cfq_que

kmem_cache_free(cfq_pool, cfqq);
cfq_put_cfqg(cfqg);
+
+ if (cfqq->orig_cfqg)
+ cfq_put_cfqg(cfqq->orig_cfqg);
}

/*
@@ -3661,6 +3692,7 @@ static void *cfq_init_queue(struct reque
cfqd->cfq_slice_idle = cfq_slice_idle;
cfqd->cfq_latency = 1;
cfqd->cfq_group_idle = 1;
+ cfqd->cfq_group_isolation = 0;
cfqd->hw_tag = 1;
cfqd->last_end_sync_rq = jiffies;
return cfqd;
@@ -3732,6 +3764,7 @@ SHOW_FUNCTION(cfq_slice_async_show, cfqd
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
SHOW_FUNCTION(cfq_low_latency_show, cfqd->cfq_latency, 0);
SHOW_FUNCTION(cfq_group_idle_show, cfqd->cfq_group_idle, 0);
+SHOW_FUNCTION(cfq_group_isolation_show, cfqd->cfq_group_isolation, 0);
#undef SHOW_FUNCTION

#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
@@ -3765,6 +3798,7 @@ STORE_FUNCTION(cfq_slice_async_rq_store,
UINT_MAX, 0);
STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, 1, 0);
+STORE_FUNCTION(cfq_group_isolation_store, &cfqd->cfq_group_isolation, 0, 1, 0);
#undef STORE_FUNCTION

#define CFQ_ATTR(name) \
@@ -3782,6 +3816,7 @@ static struct elv_fs_entry cfq_attrs[] =
CFQ_ATTR(slice_idle),
CFQ_ATTR(low_latency),
CFQ_ATTR(group_idle),
+ CFQ_ATTR(group_isolation),
__ATTR_NULL
};

2009-11-20 14:28:24

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

Hi Vivek,
On Fri, Nov 20, 2009 at 3:18 PM, Vivek Goyal <[email protected]> wrote:
> Hi Corrado,
>
> I liked the idea of putting all the sync-noidle queues together in root
> group to achieve better throughput and implemeted a small patch.
>
> It works fine for random readers. But when I do multiple direct random writers
> in one group vs a random reader in other group, I am getting strange
> behavior. Random reader moves to root group as sync-noidle workload. But
> random writers are largely sync queues in remain in other group. But many
> a times also jump into root group and preempt random reader.

can you try the attached patches?
They fix the problems you identified about no-idle preemption, and
deep seeky queues.
With those, you should not see this jumping any more.
I'll send them to Jens as soon has he comes back from vacation.

Corrado

> Anyway, with 4 random writers and 1 random reader running for 30 seconds
> in root group I get following.
>
> rw: 59,963KB/s
> rr: 66KB/s
>
> But if these are put in seprate groups test1 and test2 then
>
> rw: 30,587KB/s
> rr: 23KB/s
>
> I can understand the drop in rw throughput as it has been put under a
> group of weight 500. But rr will run in root group with weight 1000 and
> should have received much higher BW, instead it ends up loosing.
>
> Staring hard at blktrace output to figure out what's happening. One thing
> noticeable so far is that without cgroup stuff we seem to be interleaving
> dispatch from random reader and random writer much better as compared to
> with cgroup stuff.
>
> Thanks
> Vivek
>


Attachments:
0001-cfq-iosched-fix-no-idle-preemption-logic.patch (1.21 kB)
0002-cfq-iosched-idling-on-deep-seeky-sync-queues.patch (1.48 kB)
Download all attachments

2009-11-20 15:06:18

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Fri, Nov 20, 2009 at 03:28:27PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Fri, Nov 20, 2009 at 3:18 PM, Vivek Goyal <[email protected]> wrote:
> > Hi Corrado,
> >
> > I liked the idea of putting all the sync-noidle queues together in root
> > group to achieve better throughput and implemeted a small patch.
> >
> > It works fine for random readers. But when I do multiple direct random writers
> > in one group vs a random reader in other group, I am getting strange
> > behavior. Random reader moves to root group as sync-noidle workload. But
> > random writers are largely sync queues in remain in other group. But many
> > a times also jump into root group and preempt random reader.
>
> can you try the attached patches?
> They fix the problems you identified about no-idle preemption, and
> deep seeky queues.
> With those, you should not see this jumping any more.
> I'll send them to Jens as soon has he comes back from vacation.
>
> Corrado
>
> > Anyway, with 4 random writers and 1 random reader running for 30 seconds
> > in root group I get following.
> >
> > rw: 59,963KB/s
> > rr: 66KB/s
> >
> > But if these are put in seprate groups test1 and test2 then
> >
> > rw: 30,587KB/s
> > rr: 23KB/s
> >

I quickly tried your new patches to try to keep idling enabled on deep
seeky sync queues so that it does not jump around too much and consume
share both in sync workload and sync-noidle workload.

Here are new results.

Without cgroup.

rw: 58,571KB/s
rr: 83KB/s

With cgroup:

rw: 32,525KB/s
rr: 25KB/s

So without cgroup it looks like that random reader gained a bit and that's
a good thing.

With cgroup, problem still persists. I am wondering why both are loosing.
Looks like I am idling somewhere otherwise at least one person should have
gained.

Thanks
Vivek

2009-11-20 18:32:28

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Fri, Nov 20, 2009 at 4:04 PM, Vivek Goyal <[email protected]> wrote:
> On Fri, Nov 20, 2009 at 03:28:27PM +0100, Corrado Zoccolo wrote:
>> Hi Vivek,
>> On Fri, Nov 20, 2009 at 3:18 PM, Vivek Goyal <[email protected]> wrote:
>> > Hi Corrado,
>> >
>> > I liked the idea of putting all the sync-noidle queues together in root
>> > group to achieve better throughput and implemeted a small patch.
>> >
>> > It works fine for random readers. But when I do multiple direct random writers
>> > in one group vs a random reader in other group, I am getting strange
>> > behavior. Random reader moves to root group as sync-noidle workload. But
>> > random writers are largely sync queues in remain in other group. But many
>> > a times also jump into root group and preempt random reader.
>>
>> can you try the attached patches?
>> They fix the problems you identified about no-idle preemption, and
>> deep seeky queues.
>> With those, you should not see this jumping any more.
>> I'll send them to Jens as soon has he comes back from vacation.
>>
>> Corrado
>>
>> > Anyway, with 4 random writers and 1 random reader running for 30 seconds
>> > in root group I get following.
>> >
>> > rw: 59,963KB/s
>> > rr: 66KB/s
>> >
>> > But if these are put in seprate groups test1 and test2 then
>> >
>> > rw: 30,587KB/s
>> > rr: 23KB/s
>> >
>
> I quickly tried your new patches to try to keep idling enabled on deep
> seeky sync queues so that it does not jump around too much and consume
> share both in sync workload and sync-noidle workload.
>
> Here are new results.
>
> Without cgroup.
>
> rw: 58,571KB/s
> rr: 83KB/s
>
> With cgroup:
>
> rw: 32,525KB/s
> rr: 25KB/s
>
> So without cgroup it looks like that random reader gained a bit and that's
> a good thing.

Great.
> With cgroup, problem still persists. I am wondering why both are loosing.
> Looks like I am idling somewhere otherwise at least one person should have
> gained.
With just 2 groups (one is the root), you can't be idling 50% of the
time. How is the disk utilization during the test?

Note that you can lose even if you're not idling enough.
How does this workload fare with noop or deadline?

Thanks
Corrado

>
> Thanks
> Vivek
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

2009-11-20 18:44:41

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Fri, Nov 20, 2009 at 07:32:28PM +0100, Corrado Zoccolo wrote:
> On Fri, Nov 20, 2009 at 4:04 PM, Vivek Goyal <[email protected]> wrote:
> > On Fri, Nov 20, 2009 at 03:28:27PM +0100, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> On Fri, Nov 20, 2009 at 3:18 PM, Vivek Goyal <[email protected]> wrote:
> >> > Hi Corrado,
> >> >
> >> > I liked the idea of putting all the sync-noidle queues together in root
> >> > group to achieve better throughput and implemeted a small patch.
> >> >
> >> > It works fine for random readers. But when I do multiple direct random writers
> >> > in one group vs a random reader in other group, I am getting strange
> >> > behavior. Random reader moves to root group as sync-noidle workload. But
> >> > random writers are largely sync queues in remain in other group. But many
> >> > a times also jump into root group and preempt random reader.
> >>
> >> can you try the attached patches?
> >> They fix the problems you identified about no-idle preemption, and
> >> deep seeky queues.
> >> With those, you should not see this jumping any more.
> >> I'll send them to Jens as soon has he comes back from vacation.
> >>
> >> Corrado
> >>
> >> > Anyway, with 4 random writers and 1 random reader running for 30 seconds
> >> > in root group I get following.
> >> >
> >> > rw: 59,963KB/s
> >> > rr: 66KB/s
> >> >
> >> > But if these are put in seprate groups test1 and test2 then
> >> >
> >> > rw: 30,587KB/s
> >> > rr: 23KB/s
> >> >
> >
> > I quickly tried your new patches to try to keep idling enabled on deep
> > seeky sync queues so that it does not jump around too much and consume
> > share both in sync workload and sync-noidle workload.
> >
> > Here are new results.
> >
> > Without cgroup.
> >
> > rw: 58,571KB/s
> > rr: 83KB/s
> >
> > With cgroup:
> >
> > rw: 32,525KB/s
> > rr: 25KB/s
> >
> > So without cgroup it looks like that random reader gained a bit and that's
> > a good thing.
>
> Great.

Should we also take into account the "cfqq->dispatched" request in
determining whether we should enable idling on deep queue random seeky
readers?

Adding that helps me a bit in cgroup setup but I still see sync seeky random
writers switching between sync and sync-noidle so frequently.

In fact I think that's part of the reason why it is slow. Out of 4, 1
random seeky reader will switch group so often and then will not drive
the enough queue depth. Rest of them seem to be running in other group.

In fact sometimes this same writer will jump to second group, get the time
slice and then jump back to root group and then again get the time slice
in sync-noidle category. This will preempt the reader in root group at the
same time will not drive higher queue depths as rest of the writers are
in other groups.

So frequent switching of type of random seeky reqder queue from
sync --> sync-noidle--->sync seems to be one of the hurting factors.

But when I started taking cfqq->dispatched also in account, share of
random writers increased when running without cgroups. So it is kind of
puzzling.

But in general, we need to stablize the type of a queue and it should
not vary so fast, given the fact nature of the workload/queue has not
changed.

> > With cgroup, problem still persists. I am wondering why both are loosing.
> > Looks like I am idling somewhere otherwise at least one person should have
> > gained.
> With just 2 groups (one is the root), you can't be idling 50% of the
> time. How is the disk utilization during the test?
>

How do I measure the utilization of the array?

> Note that you can lose even if you're not idling enough.
> How does this workload fare with noop or deadline.
> Thanks
> Corrado
>
> >
> > Thanks
> > Vivek
> >
>
>
>
> --
> __________________________________________________________________________
>
> dott. Corrado Zoccolo mailto:[email protected]
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------
> The self-confidence of a warrior is not the self-confidence of the average
> man. The average man seeks certainty in the eyes of the onlooker and calls
> that self-confidence. The warrior seeks impeccability in his own eyes and
> calls that humbleness.
> Tales of Power - C. Castaneda

2009-11-20 19:50:49

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Fri, Nov 20, 2009 at 7:42 PM, Vivek Goyal <[email protected]> wrote:
>
> Should we also take into account the "cfqq->dispatched" request in
> determining whether we should enable idling on deep queue random seeky
> readers?

Probably, but I think the most important thing is to do some averaging
or hysteresis, so the completion of a single request doesn't cause the
switch of a workload.

>
> Adding that helps me a bit in cgroup setup but I still see sync seeky random
> writers switching between sync and sync-noidle so frequently.
>
> In fact I think that's part of the reason why it is slow. Out of 4, 1
> random seeky reader will switch group so often and then will not drive
> the enough queue depth. Rest of them seem to be running in other group.
>
> In fact sometimes this same writer will jump to second group, get the time
> slice and then jump back to root group and then again get the time slice
> in sync-noidle category. This will preempt the reader in root group at the
> same time will not drive higher queue depths as rest of the writers are
> in other groups.
>
> So frequent switching of type of random seeky reqder queue from
> sync --> sync-noidle--->sync seems to be one of the hurting factors.
>
> But when I started taking cfqq->dispatched also in account, share of
> random writers increased when running without cgroups. So it is kind of
> puzzling.
>
> But in general, we need to stablize the type of a queue and it should
> not vary so fast, given the fact nature of the workload/queue has not
> changed.
Yes. I'll work on a better patch.

>> > With cgroup, problem still persists. I am wondering why both are loosing.
>> > Looks like I am idling somewhere otherwise at least one person should have
>> > gained.
>> With just 2 groups (one is the root), you can't be idling 50% of the
>> time. How is the disk utilization during the test?
>>
>
> How do I measure the utilization of the array?
On software raids (with md), 'iostat -x 2' shows the utilization of
the md device, as well as the single components, so you can see if the
array is fully utilized. On hardware raid, you can't see the
individual disks, so it is less useful.

Thanks
Corrado

>> Note that you can lose even if you're not idling enough.
>> How does this workload fare with noop or deadline.
>> Thanks
>> Corrado
>>
>> >
>> > Thanks
>> > Vivek
>> >
>>
>>
>>
>> --
>> __________________________________________________________________________
>>
>> dott. Corrado Zoccolo                          mailto:[email protected]
>> PhD - Department of Computer Science - University of Pisa, Italy
>> --------------------------------------------------------------------------
>> The self-confidence of a warrior is not the self-confidence of the average
>> man. The average man seeks certainty in the eyes of the onlooker and calls
>> that self-confidence. The warrior seeks impeccability in his own eyes and
>> calls that humbleness.
>>                                Tales of Power - C. Castaneda
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

2009-11-21 18:04:27

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

Hi Vivek,
On Fri, Nov 20, 2009 at 8:50 PM, Corrado Zoccolo <[email protected]> wrote:
> On Fri, Nov 20, 2009 at 7:42 PM, Vivek Goyal <[email protected]> wrote:
>>
>> Should we also take into account the "cfqq->dispatched" request in
>> determining whether we should enable idling on deep queue random seeky
>> readers?
>
> Probably, but I think the most important thing is to do some averaging
> or hysteresis, so the completion of a single request doesn't cause the
> switch of a workload.

can you test the new version of the idling patch?
I register the fact that the queue had a large depth in a flag, that
is reset only when the idle times out (so at the end of the burst).
Idling is enabled if that flag is set (and think time is acceptable).
This should fix the switching behaviour you observed.

I decided to not count cfqq->dispatched to determine the depth.
In this way, when all queues in the system are random the idling is
enabled only if the requests queue builds up faster than it can be
consumed.

Thanks,
Corrado


Attachments:
0002-cfq-iosched-idling-on-deep-seeky-sync-queues.patch (2.38 kB)

2009-11-23 15:21:36

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

On Sat, Nov 21, 2009 at 06:57:47PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Fri, Nov 20, 2009 at 8:50 PM, Corrado Zoccolo <[email protected]> wrote:
> > On Fri, Nov 20, 2009 at 7:42 PM, Vivek Goyal <[email protected]> wrote:
> >>
> >> Should we also take into account the "cfqq->dispatched" request in
> >> determining whether we should enable idling on deep queue random seeky
> >> readers?
> >
> > Probably, but I think the most important thing is to do some averaging
> > or hysteresis, so the completion of a single request doesn't cause the
> > switch of a workload.
>
> can you test the new version of the idling patch?
> I register the fact that the queue had a large depth in a flag, that
> is reset only when the idle times out (so at the end of the burst).
> Idling is enabled if that flag is set (and think time is acceptable).
> This should fix the switching behaviour you observed.
>
> I decided to not count cfqq->dispatched to determine the depth.
> In this way, when all queues in the system are random the idling is
> enabled only if the requests queue builds up faster than it can be
> consumed.

Hi Corrado,

This patch seems to be working much better in marking the random writer
queue as sync and not interefere with sync-noidle workload.

So frequent migration of random writer queue across group has stopped.

But there seems to be a different issue now after sometime, random writer
queue stops generating enough traffic and gets deleted after one request
and root group now runs random reader for sometime. So it basically
changes the ratio in which random writers and random readers get disk
share.

I guess part of the dependency comes from kjournald which is in root
group. But there is something else too because I don't see this happening
when there are no cgroups. I will do more debugging on this.

Thanks
Vivek

2009-11-23 16:22:06

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC] Block IO Controller V2 - some results

Hi Vivek,
On Mon, Nov 23, 2009 at 4:19 PM, Vivek Goyal <[email protected]> wrote:
> Hi Corrado,
>
> This patch seems to be working much better in marking the random writer
> queue as sync and not interefere with sync-noidle workload.
>
> So frequent migration of random writer queue across group has stopped.
Great. I think this is good regardless of cgroups, to fix the fairness
issue regarding deep aio sync writers against random readers you
reported, so I'll send it to Jens.

> But there seems to be a different issue now after sometime, random writer
> queue stops generating enough traffic and gets deleted after one request
> and root group now runs random reader for sometime. So it basically
> changes the ratio in which random writers and random readers get disk
> share.
>
> I guess part of the dependency comes from kjournald which is in root
> group. But there is something else too because I don't see this happening
> when there are no cgroups. I will do more debugging on this.
Ok.

Thanks,
Corrado

> Thanks
> Vivek
>