2009-04-13 04:06:04

by Ryo Tsuruta

[permalink] [raw]
Subject: dm-ioband: Test results.

Hi Alasdair and all,

I did more tests on dm-ioband and I've posted the test items and
results on my website. The results are very good.
http://people.valinux.co.jp/~ryov/dm-ioband/test/test-items.xls

I hope someone will test dm-ioband and report back to the dm-devel
mailing list.

Alasdair, could you please merge dm-ioband into upstream? Or could
you please tell me why dm-ioband can't be merged?

Thanks,
Ryo Tsuruta

To know the details of dm-ioband:
http://people.valinux.co.jp/~ryov/dm-ioband/

RPM packages for RHEL5 and CentOS5 are available:
http://people.valinux.co.jp/~ryov/dm-ioband/binary.html


2009-04-13 14:46:38

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

On Mon, Apr 13, 2009 at 01:05:52PM +0900, Ryo Tsuruta wrote:
> Hi Alasdair and all,
>
> I did more tests on dm-ioband and I've posted the test items and
> results on my website. The results are very good.
> http://people.valinux.co.jp/~ryov/dm-ioband/test/test-items.xls
>

Hi Ryo,

I quickly looked at the xls sheet. Most of the test cases seem to be
direct IO. Have you done testing with buffered writes/async writes and
been able to provide service differentiation between cgroups?

For example, two "dd" threads running in two cgroups doing writes.

Thanks
Vivek

2009-04-14 02:50:16

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

On Mon, Apr 13, 2009 at 10:46:26AM -0400, Vivek Goyal wrote:
> On Mon, Apr 13, 2009 at 01:05:52PM +0900, Ryo Tsuruta wrote:
> > Hi Alasdair and all,
> >
> > I did more tests on dm-ioband and I've posted the test items and
> > results on my website. The results are very good.
> > http://people.valinux.co.jp/~ryov/dm-ioband/test/test-items.xls
> >
>
> Hi Ryo,
>
> I quickly looked at the xls sheet. Most of the test cases seem to be
> direct IO. Have you done testing with buffered writes/async writes and
> been able to provide service differentiation between cgroups?
>
> For example, two "dd" threads running in two cgroups doing writes.
>

Just realized that last time I replied to wrong mail id. Ryo, this time
you should get the mail. This is reply to my original reply.

Also I wanted to test run your patches. How do I do that? I see that your
patches are based on dm quilt tree. I downloaded the dm-patches and tried
to apply on top of 2.6.30-rc1 but it failed. So can't apply your patches
now.

So what's the simplest way of testing your changes on latest kernels?

Thanks
Vivek

Applying patch dm-add-request-based-facility.patch
patching file drivers/md/dm-table.c
Hunk #1 succeeded at 1021 (offset 29 lines).
patching file drivers/md/dm.c
Hunk #1 succeeded at 90 (offset 6 lines).
Hunk #2 succeeded at 175 (offset -1 lines).
Hunk #3 succeeded at 427 (offset 6 lines).
Hunk #4 succeeded at 648 (offset 10 lines).
Hunk #5 succeeded at 1246 with fuzz 2 (offset 20 lines).
Hunk #6 succeeded at 1273 (offset 3 lines).
Hunk #7 succeeded at 1615 with fuzz 1 (offset 20 lines).
Hunk #8 succeeded at 2016 with fuzz 1 (offset 14 lines).
Hunk #9 succeeded at 2141 (offset 33 lines).
Hunk #10 FAILED at 2321.
Hunk #11 FAILED at 2336.
Hunk #12 succeeded at 2388 with fuzz 2 (offset 29 lines).
2 out of 12 hunks FAILED -- rejects in file drivers/md/dm.c
patching file drivers/md/dm.h
patching file include/linux/device-mapper.h
Hunk #1 succeeded at 230 (offset -1 lines).
Patch dm-add-request-based-facility.patch does not apply (enforce with -f)

> Thanks
> Vivek

2009-04-14 05:27:42

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

Hi Vivek,

> Also I wanted to test run your patches. How do I do that? I see that your
> patches are based on dm quilt tree. I downloaded the dm-patches and tried
> to apply on top of 2.6.30-rc1 but it failed. So can't apply your patches
> now.

> Applying patch dm-add-request-based-facility.patch

"dm-add-request-based-facility.patch" is not my patch. To apply my
patch, you need to edit the series file and comment out the patch
files after "mm" section and my patch like this:

############################################################
# Marker corresponding to end of -mm tree.
############################################################

#dm-table-fix-alignment-to-hw_sector.patch
mm

# An attempt to get UML to work with dm.
uml-fixes.patch

############################################################
# May need more work or testing, but close to being ready.
############################################################

# Under review
#dm-exception-store-generalize-table-args.patch
#dm-snapshot-new-ctr-table-format.patch
#dm-snapshot-cleanup.patch

#dm-raid1-add-clustering.patch
dm-add-ioband.patch

And then type "quilt push dm-add-ioband.patch" command.
However, the patch can be applied, but a compile error is caused
because the patch is against the previous dm-tree.

> So what's the simplest way of testing your changes on latest kernels?

So I've uploaded a patch against 2.6.30-rc1. Please use it if you
like. http://people.valinux.co.jp/~ryov/dm-ioband/
The bio-cgroup patch against 2.6.30-rc1 will be uploaded within a few
days. I'm looking forward to your feedback on dm-ioband.

Thanks,
Ryo Tsuruta

2009-04-14 09:30:34

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

Hi Vivek,

> I quickly looked at the xls sheet. Most of the test cases seem to be
> direct IO. Have you done testing with buffered writes/async writes and
> been able to provide service differentiation between cgroups?
>
> For example, two "dd" threads running in two cgroups doing writes.

Thanks for taking a look at the sheet. I did a buffered write test
with "fio." Only two "dd" threads can't generate enough I/O load to
make dm-ioband start bandwidth control. The following is a script that
I actually used for the test.

#!/bin/bash
sync
echo 1 > /proc/sys/vm/drop_caches
arg="--size=64m --rw=write --numjobs=50 --group_reporting"
echo $$ > /cgroup/1/tasks
fio $arg --name=ioband1 --directory=/mnt1 --output=ioband1.log &
echo $$ > /cgroup/2/tasks
fio $arg --name=ioband2 --directory=/mnt2 --output=ioband2.log &
echo $$ > /cgroup/tasks
wait

I created two dm-devices to easily monitor the throughput of each
cgroup by iostat, and gave weights of 200 for cgroup1 and 100 for
cgroup2 that means cgroup1 can use twice bandwidth of cgroup2. The
following is a part of the output of iostat. dm-0 and dm-1 corresponds
to ioband1 and ioband2. You can see the bandwidth is according to the
weights.

avg-cpu: %user %nice %system %iowait %steal %idle
0.99 0.00 6.44 92.57 0.00 0.00

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dm-0 3549.00 0.00 28392.00 0 28392
dm-1 1797.00 0.00 14376.00 0 14376

avg-cpu: %user %nice %system %iowait %steal %idle
1.01 0.00 4.02 94.97 0.00 0.00

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dm-0 3919.00 0.00 31352.00 0 31352
dm-1 1925.00 0.00 15400.00 0 15400

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 5.97 94.03 0.00 0.00

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dm-0 3534.00 0.00 28272.00 0 28272
dm-1 1773.00 0.00 14184.00 0 14184

avg-cpu: %user %nice %system %iowait %steal %idle
0.50 0.00 6.00 93.50 0.00 0.00

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dm-0 4053.00 0.00 32424.00 0 32424
dm-1 2039.00 8.00 16304.00 8 16304

Thanks,
Ryo Tsuruta

2009-04-15 04:40:33

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

On Mon, Apr 13, 2009 at 01:05:52PM +0900, Ryo Tsuruta wrote:
> Hi Alasdair and all,
>
> I did more tests on dm-ioband and I've posted the test items and
> results on my website. The results are very good.
> http://people.valinux.co.jp/~ryov/dm-ioband/test/test-items.xls
>
> I hope someone will test dm-ioband and report back to the dm-devel
> mailing list.
>

Hi Ryo,

I have been able to take your patch for 2.6.30-rc1 kernel and started
doing some testing for reads. Hopefully you will provide bio-cgroup
patches soon so that I can do some write testing also.

In the beginning of the mail, i am listing some basic test results and
in later part of mail I am raising some of my concerns with this patchset.

My test setup:
--------------
I have got one SATA driver with two partitions /dev/sdd1 and /dev/sdd2 on
that. I have created ext3 file systems on these partitions. Created one
ioband device "ioband1" with weight 40 on /dev/sdd1 and another ioband
device "ioband2" with weight 10 on /dev/sdd2.

1) I think an RT task with-in a group does not get its fair share (all
the BW available as long as RT task is backlogged).

I launched one RT read task of 2G file in ioband1 group and in parallel
launched more readers in ioband1 group. ioband2 group did not have any
io going. Following are results with and without ioband.

A) 1 RT prio 0 + 1 BE prio 4 reader

dm-ioband
2147483648 bytes (2.1 GB) copied, 39.4701 s, 54.4 MB/s
2147483648 bytes (2.1 GB) copied, 71.8034 s, 29.9 MB/s

without-dm-ioband
2147483648 bytes (2.1 GB) copied, 35.3677 s, 60.7 MB/s
2147483648 bytes (2.1 GB) copied, 70.8214 s, 30.3 MB/s

B) 1 RT prio 0 + 2 BE prio 4 reader

dm-ioband
2147483648 bytes (2.1 GB) copied, 43.8305 s, 49.0 MB/s
2147483648 bytes (2.1 GB) copied, 135.395 s, 15.9 MB/s
2147483648 bytes (2.1 GB) copied, 136.545 s, 15.7 MB/s

without-dm-ioband
2147483648 bytes (2.1 GB) copied, 35.3177 s, 60.8 MB/s
2147483648 bytes (2.1 GB) copied, 124.793 s, 17.2 MB/s
2147483648 bytes (2.1 GB) copied, 126.267 s, 17.0 MB/s

C) 1 RT prio 0 + 3 BE prio 4 reader

dm-ioband
2147483648 bytes (2.1 GB) copied, 48.8159 s, 44.0 MB/s
2147483648 bytes (2.1 GB) copied, 185.848 s, 11.6 MB/s
2147483648 bytes (2.1 GB) copied, 188.171 s, 11.4 MB/s
2147483648 bytes (2.1 GB) copied, 189.537 s, 11.3 MB/s

without-dm-ioband
2147483648 bytes (2.1 GB) copied, 35.2928 s, 60.8 MB/s
2147483648 bytes (2.1 GB) copied, 169.929 s, 12.6 MB/s
2147483648 bytes (2.1 GB) copied, 172.486 s, 12.5 MB/s
2147483648 bytes (2.1 GB) copied, 172.817 s, 12.4 MB/s

C) 1 RT prio 0 + 3 BE prio 4 reader
dm-ioband
2147483648 bytes (2.1 GB) copied, 51.4279 s, 41.8 MB/s
2147483648 bytes (2.1 GB) copied, 260.29 s, 8.3 MB/s
2147483648 bytes (2.1 GB) copied, 261.824 s, 8.2 MB/s
2147483648 bytes (2.1 GB) copied, 261.981 s, 8.2 MB/s
2147483648 bytes (2.1 GB) copied, 262.372 s, 8.2 MB/s

without-dm-ioband
2147483648 bytes (2.1 GB) copied, 35.4213 s, 60.6 MB/s
2147483648 bytes (2.1 GB) copied, 215.784 s, 10.0 MB/s
2147483648 bytes (2.1 GB) copied, 218.706 s, 9.8 MB/s
2147483648 bytes (2.1 GB) copied, 220.12 s, 9.8 MB/s
2147483648 bytes (2.1 GB) copied, 220.57 s, 9.7 MB/s

Notice that with dm-ioband as number of readers are increasing, finish
time of RT tasks is also increasing. But without dm-ioband finish time
of RT tasks remains more or less constat even with increase in number
of readers.

For some reason overall throughput also seems to be less with dm-ioband.
Because ioband2 is not doing any IO, i expected that tasks in ioband1
will get full disk BW and throughput will not drop.

I have not debugged it but I guess it might be coming from the fact that
there are no separate queues for RT tasks. bios from all the tasks can be
buffered on a single queue in a cgroup and that might be causing RT
request to hide behind BE tasks' request?

General thoughts about dm-ioband
================================
- Implementing control at second level has the advantage tha one does not
have to muck with IO scheduler code. But then it also has the
disadvantage that there is no communication with IO scheduler.

- dm-ioband is buffering bio at higher layer and then doing FIFO release
of these bios. This FIFO release can lead to priority inversion problems
in certain cases where RT requests are way behind BE requests or
reader starvation where reader bios are getting hidden behind writer
bios etc. These are hard to notice issues in user space. I guess above
RT results do highlight the RT task problems. I am still working on
other test cases and see if i can show the probelm.

- dm-ioband does this extra grouping logic using dm messages. Why
cgroup infrastructure is not sufficient to meet your needs like
grouping tasks based on uid etc? I think we should get rid of all
the extra grouping logic and just use cgroup for grouping information.

- Why do we need to specify bio cgroup ids to the dm-ioband externally with
the help of dm messages? A user should be able to just create the
cgroups, put the tasks in right cgroup and then everything should
just work fine.

- Why do we have to put another dm-ioband device on top of every partition
or existing device mapper device to control it? Is it possible to do
this control on make_request function of the reuqest queue so that
we don't end up creating additional dm devices? I had posted the crude
RFC patch as proof of concept but did not continue the development
because of fundamental issue of FIFO release of buffered bios.

http://lkml.org/lkml/2008/11/6/227

Can you please have a look and provide feedback about why we can not
go in the direction of the above patches and why do we need to create
additional dm device.

I think in current form, dm-ioband is hard to configure and we should
look for ways simplify configuration.

- I personally think that even group IO scheduling should be done at
IO scheduler level and we should not break down IO scheduling in two
parts where group scheduling is done by higher level IO scheduler
sitting in dm layer and io scheduling among tasks with-in groups is
done by actual IO scheduler.

But this also means more work as one has to muck around with core IO
scheduler's to make them cgroup aware and also make sure existing
functionality is not broken. I posted the patches here.

http://lkml.org/lkml/2009/3/11/486

Can you please let us know that why does IO scheduler based approach
does not work for you?

Jens, it would be nice to hear your opinion about two level vs one
level conrol. Do you think that common layer approach is the way
to go where one can control things more tightly or FIFO release of bios
from second level controller is fine and we can live with this additional serialization in the layer above just above IO scheduler?

- There is no notion of RT cgroups. So even if one wants to run an RT
task in root cgroup to make sure to get full access of disk, it can't
do that. It has to share the BW with other competing groups.

- dm-ioband controls amount of IO done per second. Will a seeky process
not run away more disk time?

Additionally, at group level we will provide fairness in terms of amount
of IO (number of blocks transferred etc) and with-in group cfq will try
to provide fairness in terms of disk access time slices. I don't even
know whether it is a matter of concern or not. I was thinking that
probably one uniform policy on the hierarchical scheduling tree would
have probably been better. Just thinking loud.....

Thanks
Vivek

> Alasdair, could you please merge dm-ioband into upstream? Or could
> you please tell me why dm-ioband can't be merged?
>
> Thanks,
> Ryo Tsuruta
>
> To know the details of dm-ioband:
> http://people.valinux.co.jp/~ryov/dm-ioband/
>
> RPM packages for RHEL5 and CentOS5 are available:
> http://people.valinux.co.jp/~ryov/dm-ioband/binary.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-04-15 13:38:45

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

Hi Vivek,

> In the beginning of the mail, i am listing some basic test results and
> in later part of mail I am raising some of my concerns with this patchset.

I did a similar test and got different results to yours. I'll reply
later about the later part of your mail.

> My test setup:
> --------------
> I have got one SATA driver with two partitions /dev/sdd1 and /dev/sdd2 on
> that. I have created ext3 file systems on these partitions. Created one
> ioband device "ioband1" with weight 40 on /dev/sdd1 and another ioband
> device "ioband2" with weight 10 on /dev/sdd2.
>
> 1) I think an RT task with-in a group does not get its fair share (all
> the BW available as long as RT task is backlogged).
>
> I launched one RT read task of 2G file in ioband1 group and in parallel
> launched more readers in ioband1 group. ioband2 group did not have any
> io going. Following are results with and without ioband.
>
> A) 1 RT prio 0 + 1 BE prio 4 reader
>
> dm-ioband
> 2147483648 bytes (2.1 GB) copied, 39.4701 s, 54.4 MB/s
> 2147483648 bytes (2.1 GB) copied, 71.8034 s, 29.9 MB/s
>
> without-dm-ioband
> 2147483648 bytes (2.1 GB) copied, 35.3677 s, 60.7 MB/s
> 2147483648 bytes (2.1 GB) copied, 70.8214 s, 30.3 MB/s
>
> B) 1 RT prio 0 + 2 BE prio 4 reader
>
> dm-ioband
> 2147483648 bytes (2.1 GB) copied, 43.8305 s, 49.0 MB/s
> 2147483648 bytes (2.1 GB) copied, 135.395 s, 15.9 MB/s
> 2147483648 bytes (2.1 GB) copied, 136.545 s, 15.7 MB/s
>
> without-dm-ioband
> 2147483648 bytes (2.1 GB) copied, 35.3177 s, 60.8 MB/s
> 2147483648 bytes (2.1 GB) copied, 124.793 s, 17.2 MB/s
> 2147483648 bytes (2.1 GB) copied, 126.267 s, 17.0 MB/s
>
> C) 1 RT prio 0 + 3 BE prio 4 reader
>
> dm-ioband
> 2147483648 bytes (2.1 GB) copied, 48.8159 s, 44.0 MB/s
> 2147483648 bytes (2.1 GB) copied, 185.848 s, 11.6 MB/s
> 2147483648 bytes (2.1 GB) copied, 188.171 s, 11.4 MB/s
> 2147483648 bytes (2.1 GB) copied, 189.537 s, 11.3 MB/s
>
> without-dm-ioband
> 2147483648 bytes (2.1 GB) copied, 35.2928 s, 60.8 MB/s
> 2147483648 bytes (2.1 GB) copied, 169.929 s, 12.6 MB/s
> 2147483648 bytes (2.1 GB) copied, 172.486 s, 12.5 MB/s
> 2147483648 bytes (2.1 GB) copied, 172.817 s, 12.4 MB/s
>
> C) 1 RT prio 0 + 3 BE prio 4 reader
> dm-ioband
> 2147483648 bytes (2.1 GB) copied, 51.4279 s, 41.8 MB/s
> 2147483648 bytes (2.1 GB) copied, 260.29 s, 8.3 MB/s
> 2147483648 bytes (2.1 GB) copied, 261.824 s, 8.2 MB/s
> 2147483648 bytes (2.1 GB) copied, 261.981 s, 8.2 MB/s
> 2147483648 bytes (2.1 GB) copied, 262.372 s, 8.2 MB/s
>
> without-dm-ioband
> 2147483648 bytes (2.1 GB) copied, 35.4213 s, 60.6 MB/s
> 2147483648 bytes (2.1 GB) copied, 215.784 s, 10.0 MB/s
> 2147483648 bytes (2.1 GB) copied, 218.706 s, 9.8 MB/s
> 2147483648 bytes (2.1 GB) copied, 220.12 s, 9.8 MB/s
> 2147483648 bytes (2.1 GB) copied, 220.57 s, 9.7 MB/s
>
> Notice that with dm-ioband as number of readers are increasing, finish
> time of RT tasks is also increasing. But without dm-ioband finish time
> of RT tasks remains more or less constat even with increase in number
> of readers.
>
> For some reason overall throughput also seems to be less with dm-ioband.
> Because ioband2 is not doing any IO, i expected that tasks in ioband1
> will get full disk BW and throughput will not drop.
>
> I have not debugged it but I guess it might be coming from the fact that
> there are no separate queues for RT tasks. bios from all the tasks can be
> buffered on a single queue in a cgroup and that might be causing RT
> request to hide behind BE tasks' request?

I followed your setup and ran the following script on my machine.

#!/bin/sh
echo 1 > /proc/sys/vm/drop_caches
ionice -c1 -n0 dd if=/mnt1/2g.1 of=/dev/null &
ionice -c2 -n4 dd if=/mnt1/2g.2 of=/dev/null &
ionice -c2 -n4 dd if=/mnt1/2g.3 of=/dev/null &
ionice -c2 -n4 dd if=/mnt1/2g.4 of=/dev/null &
wait

I got different results and there is no siginificant difference each
dd's throughput between w/ and w/o dm-ioband.

A) 1 RT prio 0 + 1 BE prio 4 reader
w/ dm-ioband
2147483648 bytes (2.1 GB) copied, 64.0764 seconds, 33.5 MB/s
2147483648 bytes (2.1 GB) copied, 99.0757 seconds, 21.7 MB/s
w/o dm-ioband
2147483648 bytes (2.1 GB) copied, 62.3575 seconds, 34.4 MB/s
2147483648 bytes (2.1 GB) copied, 98.5804 seconds, 21.8 MB/s

B) 1 RT prio 0 + 2 BE prio 4 reader
w/ dm-ioband
2147483648 bytes (2.1 GB) copied, 64.5634 seconds, 33.3 MB/s
2147483648 bytes (2.1 GB) copied, 220.372 seconds, 9.7 MB/s
2147483648 bytes (2.1 GB) copied, 222.174 seconds, 9.7 MB/s
w/o dm-ioband
2147483648 bytes (2.1 GB) copied, 62.3036 seconds, 34.5 MB/s
2147483648 bytes (2.1 GB) copied, 226.315 seconds, 9.5 MB/s
2147483648 bytes (2.1 GB) copied, 229.064 seconds, 9.4 MB/s

C) 1 RT prio 0 + 3 BE prio 4 reader
w/ dm-ioband
2147483648 bytes (2.1 GB) copied, 66.7155 seconds, 32.2 MB/s
2147483648 bytes (2.1 GB) copied, 306.524 seconds, 7.0 MB/s
2147483648 bytes (2.1 GB) copied, 306.627 seconds, 7.0 MB/s
2147483648 bytes (2.1 GB) copied, 306.971 seconds, 7.0 MB/s
w/o dm-ioband
2147483648 bytes (2.1 GB) copied, 66.1144 seconds, 32.5 MB/s
2147483648 bytes (2.1 GB) copied, 305.5 seconds, 7.0 MB/s
2147483648 bytes (2.1 GB) copied, 306.469 seconds, 7.0 MB/s
2147483648 bytes (2.1 GB) copied, 307.63 seconds, 7.0 MB/s

The results show that the effect of the single queue is too small and
dm-ioband doesn't break CFQ's classification and priority.
What do you think about my results?

Thanks,
Ryo Tsuruta

2009-04-15 14:13:34

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

On Wed, Apr 15, 2009 at 10:38:32PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> > In the beginning of the mail, i am listing some basic test results and
> > in later part of mail I am raising some of my concerns with this patchset.
>
> I did a similar test and got different results to yours. I'll reply
> later about the later part of your mail.
>
> > My test setup:
> > --------------
> > I have got one SATA driver with two partitions /dev/sdd1 and /dev/sdd2 on
> > that. I have created ext3 file systems on these partitions. Created one
> > ioband device "ioband1" with weight 40 on /dev/sdd1 and another ioband
> > device "ioband2" with weight 10 on /dev/sdd2.
> >
> > 1) I think an RT task with-in a group does not get its fair share (all
> > the BW available as long as RT task is backlogged).
> >
> > I launched one RT read task of 2G file in ioband1 group and in parallel
> > launched more readers in ioband1 group. ioband2 group did not have any
> > io going. Following are results with and without ioband.
> >
> > A) 1 RT prio 0 + 1 BE prio 4 reader
> >
> > dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 39.4701 s, 54.4 MB/s
> > 2147483648 bytes (2.1 GB) copied, 71.8034 s, 29.9 MB/s
> >
> > without-dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 35.3677 s, 60.7 MB/s
> > 2147483648 bytes (2.1 GB) copied, 70.8214 s, 30.3 MB/s
> >
> > B) 1 RT prio 0 + 2 BE prio 4 reader
> >
> > dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 43.8305 s, 49.0 MB/s
> > 2147483648 bytes (2.1 GB) copied, 135.395 s, 15.9 MB/s
> > 2147483648 bytes (2.1 GB) copied, 136.545 s, 15.7 MB/s
> >
> > without-dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 35.3177 s, 60.8 MB/s
> > 2147483648 bytes (2.1 GB) copied, 124.793 s, 17.2 MB/s
> > 2147483648 bytes (2.1 GB) copied, 126.267 s, 17.0 MB/s
> >
> > C) 1 RT prio 0 + 3 BE prio 4 reader
> >
> > dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 48.8159 s, 44.0 MB/s
> > 2147483648 bytes (2.1 GB) copied, 185.848 s, 11.6 MB/s
> > 2147483648 bytes (2.1 GB) copied, 188.171 s, 11.4 MB/s
> > 2147483648 bytes (2.1 GB) copied, 189.537 s, 11.3 MB/s
> >
> > without-dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 35.2928 s, 60.8 MB/s
> > 2147483648 bytes (2.1 GB) copied, 169.929 s, 12.6 MB/s
> > 2147483648 bytes (2.1 GB) copied, 172.486 s, 12.5 MB/s
> > 2147483648 bytes (2.1 GB) copied, 172.817 s, 12.4 MB/s
> >
> > C) 1 RT prio 0 + 3 BE prio 4 reader
> > dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 51.4279 s, 41.8 MB/s
> > 2147483648 bytes (2.1 GB) copied, 260.29 s, 8.3 MB/s
> > 2147483648 bytes (2.1 GB) copied, 261.824 s, 8.2 MB/s
> > 2147483648 bytes (2.1 GB) copied, 261.981 s, 8.2 MB/s
> > 2147483648 bytes (2.1 GB) copied, 262.372 s, 8.2 MB/s
> >
> > without-dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 35.4213 s, 60.6 MB/s
> > 2147483648 bytes (2.1 GB) copied, 215.784 s, 10.0 MB/s
> > 2147483648 bytes (2.1 GB) copied, 218.706 s, 9.8 MB/s
> > 2147483648 bytes (2.1 GB) copied, 220.12 s, 9.8 MB/s
> > 2147483648 bytes (2.1 GB) copied, 220.57 s, 9.7 MB/s
> >
> > Notice that with dm-ioband as number of readers are increasing, finish
> > time of RT tasks is also increasing. But without dm-ioband finish time
> > of RT tasks remains more or less constat even with increase in number
> > of readers.
> >
> > For some reason overall throughput also seems to be less with dm-ioband.
> > Because ioband2 is not doing any IO, i expected that tasks in ioband1
> > will get full disk BW and throughput will not drop.
> >
> > I have not debugged it but I guess it might be coming from the fact that
> > there are no separate queues for RT tasks. bios from all the tasks can be
> > buffered on a single queue in a cgroup and that might be causing RT
> > request to hide behind BE tasks' request?
>
> I followed your setup and ran the following script on my machine.
>
> #!/bin/sh
> echo 1 > /proc/sys/vm/drop_caches
> ionice -c1 -n0 dd if=/mnt1/2g.1 of=/dev/null &
> ionice -c2 -n4 dd if=/mnt1/2g.2 of=/dev/null &
> ionice -c2 -n4 dd if=/mnt1/2g.3 of=/dev/null &
> ionice -c2 -n4 dd if=/mnt1/2g.4 of=/dev/null &
> wait
>
> I got different results and there is no siginificant difference each
> dd's throughput between w/ and w/o dm-ioband.
>
> A) 1 RT prio 0 + 1 BE prio 4 reader
> w/ dm-ioband
> 2147483648 bytes (2.1 GB) copied, 64.0764 seconds, 33.5 MB/s
> 2147483648 bytes (2.1 GB) copied, 99.0757 seconds, 21.7 MB/s
> w/o dm-ioband
> 2147483648 bytes (2.1 GB) copied, 62.3575 seconds, 34.4 MB/s
> 2147483648 bytes (2.1 GB) copied, 98.5804 seconds, 21.8 MB/s
>
> B) 1 RT prio 0 + 2 BE prio 4 reader
> w/ dm-ioband
> 2147483648 bytes (2.1 GB) copied, 64.5634 seconds, 33.3 MB/s
> 2147483648 bytes (2.1 GB) copied, 220.372 seconds, 9.7 MB/s
> 2147483648 bytes (2.1 GB) copied, 222.174 seconds, 9.7 MB/s
> w/o dm-ioband
> 2147483648 bytes (2.1 GB) copied, 62.3036 seconds, 34.5 MB/s
> 2147483648 bytes (2.1 GB) copied, 226.315 seconds, 9.5 MB/s
> 2147483648 bytes (2.1 GB) copied, 229.064 seconds, 9.4 MB/s
>
> C) 1 RT prio 0 + 3 BE prio 4 reader
> w/ dm-ioband
> 2147483648 bytes (2.1 GB) copied, 66.7155 seconds, 32.2 MB/s
> 2147483648 bytes (2.1 GB) copied, 306.524 seconds, 7.0 MB/s
> 2147483648 bytes (2.1 GB) copied, 306.627 seconds, 7.0 MB/s
> 2147483648 bytes (2.1 GB) copied, 306.971 seconds, 7.0 MB/s
> w/o dm-ioband
> 2147483648 bytes (2.1 GB) copied, 66.1144 seconds, 32.5 MB/s
> 2147483648 bytes (2.1 GB) copied, 305.5 seconds, 7.0 MB/s
> 2147483648 bytes (2.1 GB) copied, 306.469 seconds, 7.0 MB/s
> 2147483648 bytes (2.1 GB) copied, 307.63 seconds, 7.0 MB/s
>
> The results show that the effect of the single queue is too small and
> dm-ioband doesn't break CFQ's classification and priority.
> What do you think about my results?

Hmm..., strange. We are getting different results. May be it is some
configuration/setup issue.

How does your ioband setup looks like. Have you created at least one more
competing ioband device? Because I think only in that case you have got
this ad-hoc logic of waiting for the group which has not finished the
tokens yet and you will end up buffering the bio in a FIFO.

If you have not already done, can you just create two partitions on your
disk, say sda1 and sda2. Create two ioband devices with weights say 95
and 5 (95% of disk for first partition and 5% for other) and then run
the above test on first ioband device.

So how does this proportional weight thing works. If I have got two ioband
devices with weight 80 and 20 and if there is no IO happening on the
second device, first devices should get all the BW?

I will re-run my tests.

Secondly from technical point of view how do you explain the fact that
FIFO release of bio does not break the notion of CFQ priority? The moment
you buffered the bios in a single queue and started doing FIFO dispatch,
you lost that notion of one bio being more important than other.

That's a different thing that in practice it might not be easily visible.

Thanks
Vivek

2009-04-15 16:17:40

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

On Wed, Apr 15, 2009 at 10:38:32PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> > In the beginning of the mail, i am listing some basic test results and
> > in later part of mail I am raising some of my concerns with this patchset.
>
> I did a similar test and got different results to yours. I'll reply
> later about the later part of your mail.
>
> > My test setup:
> > --------------
> > I have got one SATA driver with two partitions /dev/sdd1 and /dev/sdd2 on
> > that. I have created ext3 file systems on these partitions. Created one
> > ioband device "ioband1" with weight 40 on /dev/sdd1 and another ioband
> > device "ioband2" with weight 10 on /dev/sdd2.
> >
> > 1) I think an RT task with-in a group does not get its fair share (all
> > the BW available as long as RT task is backlogged).
> >
> > I launched one RT read task of 2G file in ioband1 group and in parallel
> > launched more readers in ioband1 group. ioband2 group did not have any
> > io going. Following are results with and without ioband.
> >
> > A) 1 RT prio 0 + 1 BE prio 4 reader
> >
> > dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 39.4701 s, 54.4 MB/s
> > 2147483648 bytes (2.1 GB) copied, 71.8034 s, 29.9 MB/s
> >
> > without-dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 35.3677 s, 60.7 MB/s
> > 2147483648 bytes (2.1 GB) copied, 70.8214 s, 30.3 MB/s
> >
> > B) 1 RT prio 0 + 2 BE prio 4 reader
> >
> > dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 43.8305 s, 49.0 MB/s
> > 2147483648 bytes (2.1 GB) copied, 135.395 s, 15.9 MB/s
> > 2147483648 bytes (2.1 GB) copied, 136.545 s, 15.7 MB/s
> >
> > without-dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 35.3177 s, 60.8 MB/s
> > 2147483648 bytes (2.1 GB) copied, 124.793 s, 17.2 MB/s
> > 2147483648 bytes (2.1 GB) copied, 126.267 s, 17.0 MB/s
> >
> > C) 1 RT prio 0 + 3 BE prio 4 reader
> >
> > dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 48.8159 s, 44.0 MB/s
> > 2147483648 bytes (2.1 GB) copied, 185.848 s, 11.6 MB/s
> > 2147483648 bytes (2.1 GB) copied, 188.171 s, 11.4 MB/s
> > 2147483648 bytes (2.1 GB) copied, 189.537 s, 11.3 MB/s
> >
> > without-dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 35.2928 s, 60.8 MB/s
> > 2147483648 bytes (2.1 GB) copied, 169.929 s, 12.6 MB/s
> > 2147483648 bytes (2.1 GB) copied, 172.486 s, 12.5 MB/s
> > 2147483648 bytes (2.1 GB) copied, 172.817 s, 12.4 MB/s
> >
> > C) 1 RT prio 0 + 3 BE prio 4 reader
> > dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 51.4279 s, 41.8 MB/s
> > 2147483648 bytes (2.1 GB) copied, 260.29 s, 8.3 MB/s
> > 2147483648 bytes (2.1 GB) copied, 261.824 s, 8.2 MB/s
> > 2147483648 bytes (2.1 GB) copied, 261.981 s, 8.2 MB/s
> > 2147483648 bytes (2.1 GB) copied, 262.372 s, 8.2 MB/s
> >
> > without-dm-ioband
> > 2147483648 bytes (2.1 GB) copied, 35.4213 s, 60.6 MB/s
> > 2147483648 bytes (2.1 GB) copied, 215.784 s, 10.0 MB/s
> > 2147483648 bytes (2.1 GB) copied, 218.706 s, 9.8 MB/s
> > 2147483648 bytes (2.1 GB) copied, 220.12 s, 9.8 MB/s
> > 2147483648 bytes (2.1 GB) copied, 220.57 s, 9.7 MB/s
> >
> > Notice that with dm-ioband as number of readers are increasing, finish
> > time of RT tasks is also increasing. But without dm-ioband finish time
> > of RT tasks remains more or less constat even with increase in number
> > of readers.
> >
> > For some reason overall throughput also seems to be less with dm-ioband.
> > Because ioband2 is not doing any IO, i expected that tasks in ioband1
> > will get full disk BW and throughput will not drop.
> >
> > I have not debugged it but I guess it might be coming from the fact that
> > there are no separate queues for RT tasks. bios from all the tasks can be
> > buffered on a single queue in a cgroup and that might be causing RT
> > request to hide behind BE tasks' request?
>
> I followed your setup and ran the following script on my machine.
>
> #!/bin/sh
> echo 1 > /proc/sys/vm/drop_caches
> ionice -c1 -n0 dd if=/mnt1/2g.1 of=/dev/null &
> ionice -c2 -n4 dd if=/mnt1/2g.2 of=/dev/null &
> ionice -c2 -n4 dd if=/mnt1/2g.3 of=/dev/null &
> ionice -c2 -n4 dd if=/mnt1/2g.4 of=/dev/null &
> wait
>
> I got different results and there is no siginificant difference each
> dd's throughput between w/ and w/o dm-ioband.
>
> A) 1 RT prio 0 + 1 BE prio 4 reader
> w/ dm-ioband
> 2147483648 bytes (2.1 GB) copied, 64.0764 seconds, 33.5 MB/s
> 2147483648 bytes (2.1 GB) copied, 99.0757 seconds, 21.7 MB/s
> w/o dm-ioband
> 2147483648 bytes (2.1 GB) copied, 62.3575 seconds, 34.4 MB/s
> 2147483648 bytes (2.1 GB) copied, 98.5804 seconds, 21.8 MB/s
>
> B) 1 RT prio 0 + 2 BE prio 4 reader
> w/ dm-ioband
> 2147483648 bytes (2.1 GB) copied, 64.5634 seconds, 33.3 MB/s
> 2147483648 bytes (2.1 GB) copied, 220.372 seconds, 9.7 MB/s
> 2147483648 bytes (2.1 GB) copied, 222.174 seconds, 9.7 MB/s
> w/o dm-ioband
> 2147483648 bytes (2.1 GB) copied, 62.3036 seconds, 34.5 MB/s
> 2147483648 bytes (2.1 GB) copied, 226.315 seconds, 9.5 MB/s
> 2147483648 bytes (2.1 GB) copied, 229.064 seconds, 9.4 MB/s
>
> C) 1 RT prio 0 + 3 BE prio 4 reader
> w/ dm-ioband
> 2147483648 bytes (2.1 GB) copied, 66.7155 seconds, 32.2 MB/s
> 2147483648 bytes (2.1 GB) copied, 306.524 seconds, 7.0 MB/s
> 2147483648 bytes (2.1 GB) copied, 306.627 seconds, 7.0 MB/s
> 2147483648 bytes (2.1 GB) copied, 306.971 seconds, 7.0 MB/s
> w/o dm-ioband
> 2147483648 bytes (2.1 GB) copied, 66.1144 seconds, 32.5 MB/s
> 2147483648 bytes (2.1 GB) copied, 305.5 seconds, 7.0 MB/s
> 2147483648 bytes (2.1 GB) copied, 306.469 seconds, 7.0 MB/s
> 2147483648 bytes (2.1 GB) copied, 307.63 seconds, 7.0 MB/s
>
> The results show that the effect of the single queue is too small and
> dm-ioband doesn't break CFQ's classification and priority.

Ok, one more round of testing. Little different though this time. This
time instead of progressively increasing the number of competing readers
I have run with constant number of readers multimple times.

Again, I created two partitions /dev/sdd1 and /dev/sdd2 and created two
ioband devices and assigned weight 40 and 10 respectively. All my IO
is being done only on first ioband device and there is no IO happening
on second partition.

I use following to create ioband devices.

echo "0 $(blockdev --getsize /dev/sdd1) ioband /dev/sdd1 1 0 0 none"
"weight 0 :40" | dmsetup create ioband1
echo "0 $(blockdev --getsize /dev/sdd2) ioband /dev/sdd2 1 0 0 none"
"weight 0 :10" | dmsetup create ioband2

mount /dev/mapper/ioband1 /mnt/sdd1
mount /dev/mapper/ioband2 /mnt/sdd2

Following is dmsetup output.

# dmsetup status
ioband2: 0 38025855 ioband 1 -1 150 13 186 1 0 8
ioband1: 0 40098177 ioband 1 -1 335056 819 80342386 1 0 8

Following is my actual script to run multiple reads.

sync
echo 3 > /proc/sys/vm/drop_caches
ionice -c 1 -n 0 dd if=/mnt/sdd1/testzerofile1 of=/dev/null &
ionice -c 2 -n 4 dd if=/mnt/sdd1/testzerofile2 of=/dev/null &
ionice -c 2 -n 4 dd if=/mnt/sdd1/testzerofile3 of=/dev/null &
ionice -c 2 -n 4 dd if=/mnt/sdd1/testzerofile4 of=/dev/null &
ionice -c 2 -n 4 dd if=/mnt/sdd1/testzerofile5 of=/dev/null &

Following is output of 4 runs of reads with and without dm-ioband

1 RT process prio 0 and 4 BE process with prio 4.

First run
----------
without dm-ioband

2147483648 bytes (2.1 GB) copied, 35.3428 s, 60.8 MB/s
2147483648 bytes (2.1 GB) copied, 215.446 s, 10.0 MB/s
2147483648 bytes (2.1 GB) copied, 218.269 s, 9.8 MB/s
2147483648 bytes (2.1 GB) copied, 219.433 s, 9.8 MB/s
2147483648 bytes (2.1 GB) copied, 220.033 s, 9.8 MB/s

with dm-ioband

2147483648 bytes (2.1 GB) copied, 48.4239 s, 44.3 MB/s
2147483648 bytes (2.1 GB) copied, 257.943 s, 8.3 MB/s
2147483648 bytes (2.1 GB) copied, 258.385 s, 8.3 MB/s
2147483648 bytes (2.1 GB) copied, 258.778 s, 8.3 MB/s
2147483648 bytes (2.1 GB) copied, 259.81 s, 8.3 MB/s

Second run
----------
without dm-ioband
2147483648 bytes (2.1 GB) copied, 35.4003 s, 60.7 MB/s
2147483648 bytes (2.1 GB) copied, 217.204 s, 9.9 MB/s
2147483648 bytes (2.1 GB) copied, 218.336 s, 9.8 MB/s
2147483648 bytes (2.1 GB) copied, 219.75 s, 9.8 MB/s
2147483648 bytes (2.1 GB) copied, 219.816 s, 9.8 MB/s

with dm-ioband
2147483648 bytes (2.1 GB) copied, 49.7719 s, 43.1 MB/s
2147483648 bytes (2.1 GB) copied, 254.118 s, 8.5 MB/s
2147483648 bytes (2.1 GB) copied, 255.7 s, 8.4 MB/s
2147483648 bytes (2.1 GB) copied, 256.512 s, 8.4 MB/s
2147483648 bytes (2.1 GB) copied, 256.581 s, 8.4 MB/s

third run
---------
without dm-ioband
2147483648 bytes (2.1 GB) copied, 35.426 s, 60.6 MB/s
2147483648 bytes (2.1 GB) copied, 218.4 s, 9.8 MB/s
2147483648 bytes (2.1 GB) copied, 221.074 s, 9.7 MB/s
2147483648 bytes (2.1 GB) copied, 222.421 s, 9.7 MB/s
2147483648 bytes (2.1 GB) copied, 222.489 s, 9.7 MB/s

with dm-ioband
2147483648 bytes (2.1 GB) copied, 51.5454 s, 41.7 MB/s
2147483648 bytes (2.1 GB) copied, 261.481 s, 8.2 MB/s
2147483648 bytes (2.1 GB) copied, 261.567 s, 8.2 MB/s
2147483648 bytes (2.1 GB) copied, 263.048 s, 8.2 MB/s
2147483648 bytes (2.1 GB) copied, 264.204 s, 8.1 MB/s

fourth run
----------
without dm-ioband
2147483648 bytes (2.1 GB) copied, 35.4676 s, 60.5 MB/s
2147483648 bytes (2.1 GB) copied, 217.752 s, 9.9 MB/s
2147483648 bytes (2.1 GB) copied, 219.693 s, 9.8 MB/s
2147483648 bytes (2.1 GB) copied, 221.921 s, 9.7 MB/s
2147483648 bytes (2.1 GB) copied, 222.18 s, 9.7 MB/s

with dm-ioband
2147483648 bytes (2.1 GB) copied, 46.1355 s, 46.5 MB/s
2147483648 bytes (2.1 GB) copied, 253.84 s, 8.5 MB/s
2147483648 bytes (2.1 GB) copied, 256.282 s, 8.4 MB/s
2147483648 bytes (2.1 GB) copied, 256.356 s, 8.4 MB/s
2147483648 bytes (2.1 GB) copied, 256.679 s, 8.4 MB/s


Do let me know if you think there is something wrong with my
configuration.

First of all I still notice that there is significant performance drop
here.

Secondly notice that finish time of RT task is varying so much with
dm-ioband and it is so stable with plain cfq.

with dm-ioabnd 48.4239 49.7719 51.5454 46.1355
without dm-ioband 35.3428 35.4003 35.426 35.4676

Thanks
Vivek

2009-04-15 17:04:48

by Vivek Goyal

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

On Tue, Apr 14, 2009 at 06:30:22PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> > I quickly looked at the xls sheet. Most of the test cases seem to be
> > direct IO. Have you done testing with buffered writes/async writes and
> > been able to provide service differentiation between cgroups?
> >
> > For example, two "dd" threads running in two cgroups doing writes.
>
> Thanks for taking a look at the sheet. I did a buffered write test
> with "fio." Only two "dd" threads can't generate enough I/O load to
> make dm-ioband start bandwidth control. The following is a script that
> I actually used for the test.
>
> #!/bin/bash
> sync
> echo 1 > /proc/sys/vm/drop_caches
> arg="--size=64m --rw=write --numjobs=50 --group_reporting"
> echo $$ > /cgroup/1/tasks
> fio $arg --name=ioband1 --directory=/mnt1 --output=ioband1.log &
> echo $$ > /cgroup/2/tasks
> fio $arg --name=ioband2 --directory=/mnt2 --output=ioband2.log &
> echo $$ > /cgroup/tasks
> wait
>

Ryo,

Can you also send bio-cgroup patches which apply to 2.6.30-rc1 so that
I can do testing for async writes.

Why have you split the regular patch and bio-cgroup patch? Do you want
to address only reads and sync writes?

In the above test case, do these "fio" jobs finish at different times?
In my testing I see that two dd generate a lot of traffic at IO scheudler
level but traffic seems to be bursty. So when higher weight process has
done some IO, it seems to disappear for .2 to 1 seconds and in that
time other writer gets to do lot of IO and eradicates any service
difference provided so far.

I am not sure where this high priority writer is blocked and that needs
to be looked into. But I am sure that you will also face the same issue.

Thanks
Vivek

> I created two dm-devices to easily monitor the throughput of each
> cgroup by iostat, and gave weights of 200 for cgroup1 and 100 for
> cgroup2 that means cgroup1 can use twice bandwidth of cgroup2. The
> following is a part of the output of iostat. dm-0 and dm-1 corresponds
> to ioband1 and ioband2. You can see the bandwidth is according to the
> weights.
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.99 0.00 6.44 92.57 0.00 0.00
>
> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> dm-0 3549.00 0.00 28392.00 0 28392
> dm-1 1797.00 0.00 14376.00 0 14376
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 1.01 0.00 4.02 94.97 0.00 0.00
>
> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> dm-0 3919.00 0.00 31352.00 0 31352
> dm-1 1925.00 0.00 15400.00 0 15400
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.00 0.00 5.97 94.03 0.00 0.00
>
> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> dm-0 3534.00 0.00 28272.00 0 28272
> dm-1 1773.00 0.00 14184.00 0 14184
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.50 0.00 6.00 93.50 0.00 0.00
>
> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> dm-0 4053.00 0.00 32424.00 0 32424
> dm-1 2039.00 8.00 16304.00 8 16304
>

> Thanks,
> Ryo Tsuruta

2009-04-16 02:48:01

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

Hi Vivek,

> General thoughts about dm-ioband
> ================================
> - Implementing control at second level has the advantage tha one does not
> have to muck with IO scheduler code. But then it also has the
> disadvantage that there is no communication with IO scheduler.
>
> - dm-ioband is buffering bio at higher layer and then doing FIFO release
> of these bios. This FIFO release can lead to priority inversion problems
> in certain cases where RT requests are way behind BE requests or
> reader starvation where reader bios are getting hidden behind writer
> bios etc. These are hard to notice issues in user space. I guess above
> RT results do highlight the RT task problems. I am still working on
> other test cases and see if i can show the probelm.
>
> - dm-ioband does this extra grouping logic using dm messages. Why
> cgroup infrastructure is not sufficient to meet your needs like
> grouping tasks based on uid etc? I think we should get rid of all
> the extra grouping logic and just use cgroup for grouping information.

I want to use dm-ioband even without cgroup and to make dm-ioband has
flexibility to support various type of objects.

> - Why do we need to specify bio cgroup ids to the dm-ioband externally with
> the help of dm messages? A user should be able to just create the
> cgroups, put the tasks in right cgroup and then everything should
> just work fine.

This is because to handle cgroup on dm-ioband easily and it keeps the
code simple.

> - Why do we have to put another dm-ioband device on top of every partition
> or existing device mapper device to control it? Is it possible to do
> this control on make_request function of the reuqest queue so that
> we don't end up creating additional dm devices? I had posted the crude
> RFC patch as proof of concept but did not continue the development
> because of fundamental issue of FIFO release of buffered bios.
>
> http://lkml.org/lkml/2008/11/6/227
>
> Can you please have a look and provide feedback about why we can not
> go in the direction of the above patches and why do we need to create
> additional dm device.
>
> I think in current form, dm-ioband is hard to configure and we should
> look for ways simplify configuration.

This can be solved by using a tool or a small script.

> - I personally think that even group IO scheduling should be done at
> IO scheduler level and we should not break down IO scheduling in two
> parts where group scheduling is done by higher level IO scheduler
> sitting in dm layer and io scheduling among tasks with-in groups is
> done by actual IO scheduler.
>
> But this also means more work as one has to muck around with core IO
> scheduler's to make them cgroup aware and also make sure existing
> functionality is not broken. I posted the patches here.
>
> http://lkml.org/lkml/2009/3/11/486
>
> Can you please let us know that why does IO scheduler based approach
> does not work for you?

I think your approach is not bad, but I've made it my purpose to
control disk bandwidth of virtual machines by device-mapper and
dm-ioband.
I think device-mapper is a well designed system for the following
reasons:
- It can easily add new functions to a block device.
- No need to muck around with the existing kernel code.
- dm-devices are detachable. It doesn't make any effects on the
system if a user doesn't use it.
So I think dm-ioband and your IO controller can coexist. What do you
think about it?

> Jens, it would be nice to hear your opinion about two level vs one
> level conrol. Do you think that common layer approach is the way
> to go where one can control things more tightly or FIFO release of bios
> from second level controller is fine and we can live with this additional serialization in the layer above just above IO scheduler?
>
> - There is no notion of RT cgroups. So even if one wants to run an RT
> task in root cgroup to make sure to get full access of disk, it can't
> do that. It has to share the BW with other competing groups.
>
> - dm-ioband controls amount of IO done per second. Will a seeky process
> not run away more disk time?

Could you elaborate on this? dm-ioband doesn't control it per second.

> Additionally, at group level we will provide fairness in terms of amount
> of IO (number of blocks transferred etc) and with-in group cfq will try
> to provide fairness in terms of disk access time slices. I don't even
> know whether it is a matter of concern or not. I was thinking that
> probably one uniform policy on the hierarchical scheduling tree would
> have probably been better. Just thinking loud.....
>
> Thanks
> Vivek

Thanks,
Ryo Tsuruta

2009-04-16 12:56:41

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

Hi Vivek,

> How does your ioband setup looks like. Have you created at least one more
> competing ioband device? Because I think only in that case you have got
> this ad-hoc logic of waiting for the group which has not finished the
> tokens yet and you will end up buffering the bio in a FIFO.

I created two ioband devices and ran the dd commands only on the first
device.

> Do let me know if you think there is something wrong with my
> configuration.

>From a quick look at your configuration, there seems to be no problem.

> Can you also send bio-cgroup patches which apply to 2.6.30-rc1 so that
> I can do testing for async writes.

I've just posted the patches to related mailing lists. Please try it.

> Why have you split the regular patch and bio-cgroup patch? Do you want
> to address only reads and sync writes?

For the first step, my goal is to merge dm-ioband into device-mapper,
and bio-cgroup is not necessary for all situations such as bandwidth
control on a per partition basis.

I'll also try to do more test and report you back.

Thank you for your help,
Ryo Tsuruta

2009-04-16 13:33:19

by Vivek Goyal

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

On Thu, Apr 16, 2009 at 09:56:30PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> > How does your ioband setup looks like. Have you created at least one more
> > competing ioband device? Because I think only in that case you have got
> > this ad-hoc logic of waiting for the group which has not finished the
> > tokens yet and you will end up buffering the bio in a FIFO.
>
> I created two ioband devices and ran the dd commands only on the first
> device.

Ok. So please do let me know how to debug it further. At the moment I
think it is a problem with dm-ioband and most likely either coming
from buffering of bios in single queue or due to delays introduced
because of waiting mechanism to let slowest process catch up.

I have looked at that code 2-3 times but never understood it fully. Will
give it a try again. I think code quality there needs to be improved.

>
> > Do let me know if you think there is something wrong with my
> > configuration.
>
> >From a quick look at your configuration, there seems to be no problem.
>
> > Can you also send bio-cgroup patches which apply to 2.6.30-rc1 so that
> > I can do testing for async writes.
>
> I've just posted the patches to related mailing lists. Please try it.

Which mailing list you have posted to. I am assuming you did it to
dm-devel list. Please keep all the postings to both lkml and dm-devel
list for sometime while we are discussing the fundamental issues which
are also of concern from generic IO controller point of view and not
limited to dm only.

>
> > Why have you split the regular patch and bio-cgroup patch? Do you want
> > to address only reads and sync writes?
>
> For the first step, my goal is to merge dm-ioband into device-mapper,
> and bio-cgroup is not necessary for all situations such as bandwidth
> control on a per partition basis.

IIUC, bio-cgroup is necessary to account for async writes otherwise writes
will be accounted to submitting task. Andrew Morton clearly mentioned
in one of the mails that writes have been our biggest problem and he
wants to see a clear solution for handling async writes. So please don't
split up both the patches and keep these together.

So if you are not accounting for async writes, what kind of usage you have
got in mind? Any practical work load will have both reads and writes
going. So if a customer creates even two groups say A and B and both are
having async writes also going, what kind of gurantees will you offer
to these guys?

IOW, with just sync bio handling as your first step, what kind of usage
scenario you are covering?

Secondly, per partition control sounds bit excessive. Why per disk
control is not sufficient? That's where the real contention for resources
is. And even if you really want equivalent of per partition contorl one
should be able to achive it with to level of cgroup hierarchy.

root
/ \
sda1G sda2g

So if there are two partitions in a disk, just create two groups and put
the processes doing IO to partition sda1 in group sda1G and processes
doing IO to partition sda2 in sda2g and assign the weights to the groups
the way you want to the IO to be distributed between these two partitions.

But in the end, I think doing per partition control is excessive. If
you really want that kind of isolation, then from storage array carve
out another device/logical unit and create a separate device and do
IO on that.

Thanks
Vivek

>
> I'll also try to do more test and report you back.
>

> Thank you for your help,
> Ryo Tsuruta

2009-04-16 14:13:53

by Vivek Goyal

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

On Thu, Apr 16, 2009 at 11:47:50AM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> > General thoughts about dm-ioband
> > ================================
> > - Implementing control at second level has the advantage tha one does not
> > have to muck with IO scheduler code. But then it also has the
> > disadvantage that there is no communication with IO scheduler.
> >
> > - dm-ioband is buffering bio at higher layer and then doing FIFO release
> > of these bios. This FIFO release can lead to priority inversion problems
> > in certain cases where RT requests are way behind BE requests or
> > reader starvation where reader bios are getting hidden behind writer
> > bios etc. These are hard to notice issues in user space. I guess above
> > RT results do highlight the RT task problems. I am still working on
> > other test cases and see if i can show the probelm.
> >
> > - dm-ioband does this extra grouping logic using dm messages. Why
> > cgroup infrastructure is not sufficient to meet your needs like
> > grouping tasks based on uid etc? I think we should get rid of all
> > the extra grouping logic and just use cgroup for grouping information.
>
> I want to use dm-ioband even without cgroup and to make dm-ioband has
> flexibility to support various type of objects.

That's the core question. We all know that you want to use it that way.
But the point is that does not sound the right way. cgroup infrastructure
has been created for the precise reason to allow arbitrary grouping of
tasks in hierarchical manner. The kind of grouping you are doing like
uid based, you can easily do with cgroups also. In fact I have written
a pam plugin and contributed to libcg project (user space library) to
put a uid's task automatically in a specified cgroup upon login to help
the admin.

By not using cgroups and creating additional grouping mechanisms in the
dm layer I don't think we are helping anybody. We are just increasing
the complexity for no reason without any proper justification. The only
reason I have heard so far is "I want it that way" or "This is my goal".
This kind of reasoning does not help.

>
> > - Why do we need to specify bio cgroup ids to the dm-ioband externally with
> > the help of dm messages? A user should be able to just create the
> > cgroups, put the tasks in right cgroup and then everything should
> > just work fine.
>
> This is because to handle cgroup on dm-ioband easily and it keeps the
> code simple.

But it becomes the configuration nightmare. cgroup is the way for grouping
tasks from resource management perspective. Please use that and don't
create additional ways of grouping which increase configuration
complexity. If you think there are deficiencies in cgroup infrastructure
and it can't handle your case, then please enhance cgroup infrstructure to
meet that case.

>
> > - Why do we have to put another dm-ioband device on top of every partition
> > or existing device mapper device to control it? Is it possible to do
> > this control on make_request function of the reuqest queue so that
> > we don't end up creating additional dm devices? I had posted the crude
> > RFC patch as proof of concept but did not continue the development
> > because of fundamental issue of FIFO release of buffered bios.
> >
> > http://lkml.org/lkml/2008/11/6/227
> >
> > Can you please have a look and provide feedback about why we can not
> > go in the direction of the above patches and why do we need to create
> > additional dm device.
> >
> > I think in current form, dm-ioband is hard to configure and we should
> > look for ways simplify configuration.
>
> This can be solved by using a tool or a small script.
>

libcg is trying to provide generic helper library so that all the
user space management programs can use it to control resource controllers
which are using cgroup. Now by not using cgroup, an admin shall have to
come up with entirely different set of scripts for IO controller? That
does not make too much of sense.

Please also answer rest of the question above. Why do we need to put
additional device mapper device on every device we want to control and
why can't we do it by providing a hook into make_request function of
the queue and not putting additional device mapper device.

Why do you think that it will not turn out to be a simpler approach?

> > - I personally think that even group IO scheduling should be done at
> > IO scheduler level and we should not break down IO scheduling in two
> > parts where group scheduling is done by higher level IO scheduler
> > sitting in dm layer and io scheduling among tasks with-in groups is
> > done by actual IO scheduler.
> >
> > But this also means more work as one has to muck around with core IO
> > scheduler's to make them cgroup aware and also make sure existing
> > functionality is not broken. I posted the patches here.
> >
> > http://lkml.org/lkml/2009/3/11/486
> >
> > Can you please let us know that why does IO scheduler based approach
> > does not work for you?
>
> I think your approach is not bad, but I've made it my purpose to
> control disk bandwidth of virtual machines by device-mapper and
> dm-ioband.

What do you mean by "I have made it my purpose"? Its not about that
I have decided to do something in a specific way and I will do it
only that way.

I think open source development is more about that this is the problem
statement and we discuss openly and experiment with various approaches
and then a approach which works for most of the people is accepted.

If you say that providing "IO control infrastructure in linux kernel"
is my goal, I can very well relate to it. But if you say providng "IO
control infrastructure only through dm-ioband, only through device-mapper
infrastructure" is my goal, then it is hard to digest.

I also have same concern and that is control the IO resources for
virtual machines. And IO schduler modification based approach as as well as
hooking into make_request function approach will achive the same
goal.

Here we are having a technical discussion about interfaces and what's the
best way do that. And not looking at other approches and not having an
open discussion about merits and demerits of all the approaches and not
willing to change the direction does not help.

> I think device-mapper is a well designed system for the following
> reasons:
> - It can easily add new functions to a block device.
> - No need to muck around with the existing kernel code.

Not touching the core code makes life simple and is an advantage. But
remember that it comes at a cost of FIFO dispatch and possible unwanted
scnerios with underlying ioscheduoer like CFQ. I already demonstrated that
with one RT example.

But then hooking into make_request_function will give us same advantage
with simpler configuration and there is no need of putting extra dm
device on every device.

> - dm-devices are detachable. It doesn't make any effects on the
> system if a user doesn't use it.

Even wth make_request approach, one could enable/disable io controller
by writing 0/1 to a file.

So why are you not open to experimenting with hooking into make_request
function approach and try to make it work? It would meet your requirements
at the same time achive the goals of not touching the core IO scheduler,
elevator and block layer code etc.? It will also be simple to
enable/disable IO control. We shall not have to put additional dm device
on every device. We shall not have to come up with additional grouping
mechanisms and can use cgroup interfaces etc.

> So I think dm-ioband and your IO controller can coexist. What do you
> think about it?

Yes they can. I am not against that. But I don't think that dm-ioband
currently is in the right shape for various reasons have been citing
in the mails.

>
> > Jens, it would be nice to hear your opinion about two level vs one
> > level conrol. Do you think that common layer approach is the way
> > to go where one can control things more tightly or FIFO release of bios
> > from second level controller is fine and we can live with this additional serialization in the layer above just above IO scheduler?
> >
> > - There is no notion of RT cgroups. So even if one wants to run an RT
> > task in root cgroup to make sure to get full access of disk, it can't
> > do that. It has to share the BW with other competing groups.
> >
> > - dm-ioband controls amount of IO done per second. Will a seeky process
> > not run away more disk time?
>
> Could you elaborate on this? dm-ioband doesn't control it per second.
>

There are two ways to view fairness.

- Fairness in terms of amount of sectors/data transferred.
- Fairness in terms of disk access time one gets.

In first case, if there is a seeky process doing IO, it will run away
with lot more disk time than a process doing sequential IO. Some people
consider it unfair and I think that's the reason CFQ provides fairness
in terms of disk time slices and not in terms of number of sectors
transferred.

Now with any two level of scheme, at higher layer only easy way to
provide fairness is in terms of secotrs transferred and underlying
CFQ will be working on providing fairness in terms of disk slices.

Thanks
Vivek

> > Additionally, at group level we will provide fairness in terms of amount
> > of IO (number of blocks transferred etc) and with-in group cfq will try
> > to provide fairness in terms of disk access time slices. I don't even
> > know whether it is a matter of concern or not. I was thinking that
> > probably one uniform policy on the hierarchical scheduling tree would
> > have probably been better. Just thinking loud.....
> >
> > Thanks
> > Vivek
>
> Thanks,
> Ryo Tsuruta

2009-04-16 20:25:15

by Nauman Rafique

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

On Thu, Apr 16, 2009 at 7:11 AM, Vivek Goyal <[email protected]> wrote:
> On Thu, Apr 16, 2009 at 11:47:50AM +0900, Ryo Tsuruta wrote:
>> Hi Vivek,
>>
>> > General thoughts about dm-ioband
>> > ================================
>> > - Implementing control at second level has the advantage tha one does not
>> > ? have to muck with IO scheduler code. But then it also has the
>> > ? disadvantage that there is no communication with IO scheduler.
>> >
>> > - dm-ioband is buffering bio at higher layer and then doing FIFO release
>> > ? of these bios. This FIFO release can lead to priority inversion problems
>> > ? in certain cases where RT requests are way behind BE requests or
>> > ? reader starvation where reader bios are getting hidden behind writer
>> > ? bios etc. These are hard to notice issues in user space. I guess above
>> > ? RT results do highlight the RT task problems. I am still working on
>> > ? other test cases and see if i can show the probelm.

Ryo, I could not agree more with Vivek here. At Google, we have very
stringent requirement for latency of our RT requests. If RT requests
get queued in any higher layer (behind BE requests), all bets are off.
I don't find doing IO control at two layer for this particular reason.
The upper layer (dm-ioband in this case) would have to make sure that
RT requests are released immediately, irrespective of the state (FIFO
queuing and tokens held). And the lower layer (IO scheduling layer)
has to do the same. This requirement is not specific to us. I have
seen similar comments from filesystem folks here previously, in the
context of metadata updates being submitted as RT. Basically, the
semantics of RT class has to be preserved by any solution that is
build on top of CFQ scheduler.

>> >
>> > - dm-ioband does this extra grouping logic using dm messages. Why
>> > ? cgroup infrastructure is not sufficient to meet your needs like
>> > ? grouping tasks based on uid etc? I think we should get rid of all
>> > ? the extra grouping logic and just use cgroup for grouping information.
>>
>> I want to use dm-ioband even without cgroup and to make dm-ioband has
>> flexibility to support various type of objects.
>
> That's the core question. We all know that you want to use it that way.
> But the point is that does not sound the right way. cgroup infrastructure
> has been created for the precise reason to allow arbitrary grouping of
> tasks in hierarchical manner. The kind of grouping you are doing like
> uid based, you can easily do with cgroups also. In fact I have written
> a pam plugin and contributed to libcg project (user space library) to
> put a uid's task automatically in a specified cgroup upon login to help
> the admin.
>
> By not using cgroups and creating additional grouping mechanisms in the
> dm layer I don't think we are helping anybody. We are just increasing
> the complexity for no reason without any proper justification. The only
> reason I have heard so far is "I want it that way" or "This is my goal".
> This kind of reasoning does not help.
>
>>
>> > - Why do we need to specify bio cgroup ids to the dm-ioband externally with
>> > ? the help of dm messages? A user should be able to just create the
>> > ? cgroups, put the tasks in right cgroup and then everything should
>> > ? just work fine.
>>
>> This is because to handle cgroup on dm-ioband easily and it keeps the
>> code simple.
>
> But it becomes the configuration nightmare. cgroup is the way for grouping
> tasks from resource management perspective. Please use that and don't
> create additional ways of grouping which increase configuration
> complexity. If you think there are deficiencies in cgroup infrastructure
> and it can't handle your case, then please enhance cgroup infrstructure to
> meet that case.
>
>>
>> > - Why do we have to put another dm-ioband device on top of every partition
>> > ? or existing device mapper device to control it? Is it possible to do
>> > ? this control on make_request function of the reuqest queue so that
>> > ? we don't end up creating additional dm devices? I had posted the crude
>> > ? RFC patch as proof of concept but did not continue the development
>> > ? because of fundamental issue of FIFO release of buffered bios.
>> >
>> > ? ? http://lkml.org/lkml/2008/11/6/227
>> >
>> > ? Can you please have a look and provide feedback about why we can not
>> > ? go in the direction of the above patches and why do we need to create
>> > ? additional dm device.
>> >
>> > ? I think in current form, dm-ioband is hard to configure and we should
>> > ? look for ways simplify configuration.
>>
>> This can be solved by using a tool or a small script.
>>
>
> libcg is trying to provide generic helper library so that all the
> user space management programs can use it to control resource controllers
> which are using cgroup. Now by not using cgroup, an admin shall have to
> come up with entirely different set of scripts for IO controller? That
> does not make too much of sense.
>
> Please also answer rest of the question above. Why do we need to put
> additional device mapper device on every device we want to control and
> why can't we do it by providing a hook into make_request function of
> the queue and not putting additional device mapper device.
>
> Why do you think that it will not turn out to be a simpler approach?
>
>> > - I personally think that even group IO scheduling should be done at
>> > ? IO scheduler level and we should not break down IO scheduling in two
>> > ? parts where group scheduling is done by higher level IO scheduler
>> > ? sitting in dm layer and io scheduling among tasks with-in groups is
>> > ? done by actual IO scheduler.
>> >
>> > ? But this also means more work as one has to muck around with core IO
>> > ? scheduler's to make them cgroup aware and also make sure existing
>> > ? functionality is not broken. I posted the patches here.
>> >
>> > ? ? http://lkml.org/lkml/2009/3/11/486
>> >
>> > ? Can you please let us know that why does IO scheduler based approach
>> > ? does not work for you?
>>
>> I think your approach is not bad, but I've made it my purpose to
>> control disk bandwidth of virtual machines by device-mapper and
>> dm-ioband.
>
> What do you mean by "I have made it my purpose"? Its not about that
> I have decided to do something in a specific way and I will do it
> only that way.
>
> I think open source development is more about that this is the problem
> statement and we discuss openly and experiment with various approaches
> and then a approach which works for most of the people is accepted.
>
> If you say that providing "IO control infrastructure in linux kernel"
> is my goal, I can very well relate to it. But if you say providng "IO
> control infrastructure only through dm-ioband, only through device-mapper
> infrastructure" is my goal, then it is hard to digest.
>
> I also have same concern and that is control the IO resources for
> virtual machines. And IO schduler modification based approach as as well as
> hooking into make_request function approach will achive the same
> goal.
>
> Here we are having a technical discussion about interfaces and what's the
> best way do that. And not looking at other approches and not having an
> open discussion about merits and demerits of all the approaches and not
> willing to change the direction does not help.
>
>> I think device-mapper is a well designed system for the following
>> reasons:
>> ?- It can easily add new functions to a block device.
>> ?- No need to muck around with the existing kernel code.
>
> Not touching the core code makes life simple and is an advantage. ?But
> remember that it comes at a cost of FIFO dispatch and possible unwanted
> scnerios with underlying ioscheduoer like CFQ. I already demonstrated that
> with one RT example.
>
> But then hooking into make_request_function will give us same advantage
> with simpler configuration and there is no need of putting extra dm
> device on every device.
>
>> ?- dm-devices are detachable. It doesn't make any effects on the
>> ? ?system if a user doesn't use it.
>
> Even wth make_request approach, one could enable/disable io controller
> by writing 0/1 to a file.
>
> So why are you not open to experimenting with hooking into make_request
> function approach and try to make it work? It would meet your requirements
> at the same time achive the goals of not touching the core IO scheduler,
> elevator and block layer code etc.? It will also be simple to
> enable/disable IO control. We shall not have to put additional dm device
> on every device. We shall not have to come up with additional grouping
> mechanisms and can use cgroup interfaces etc.
>
>> So I think dm-ioband and your IO controller can coexist. What do you
>> think about it?
>
> Yes they can. I am not against that. But I don't think that dm-ioband
> currently is in the right shape for various reasons have been citing
> in the mails.
>
>>
>> > ? Jens, it would be nice to hear your opinion about two level vs one
>> > ? level conrol. Do you think that common layer approach is the way
>> > ? to go where one can control things more tightly or FIFO release of bios
>> > ? from second level controller is fine and we can live with this additional ? ? ? serialization in the layer above just above IO scheduler?
>> >
>> > - There is no notion of RT cgroups. So even if one wants to run an RT
>> > ? task in root cgroup to make sure to get full access of disk, it can't
>> > ? do that. It has to share the BW with other competing groups.
>> >
>> > - dm-ioband controls amount of IO done per second. Will a seeky process
>> > ? not run away more disk time?
>>
>> Could you elaborate on this? dm-ioband doesn't control it per second.
>>
>
> There are two ways to view fairness.
>
> - Fairness in terms of amount of sectors/data transferred.
> - Fairness in terms of disk access time one gets.
>
> In first case, if there is a seeky process doing IO, it will run away
> with lot more disk time than a process doing sequential IO. Some people
> consider it unfair and I think that's the reason CFQ provides fairness
> in terms of disk time slices and not in terms of number of sectors
> transferred.
>
> Now with any two level of scheme, at higher layer only easy way to
> provide fairness is in terms of secotrs transferred and underlying
> CFQ will be working on providing fairness in terms of disk slices.
>
> Thanks
> Vivek
>
>> > ? Additionally, at group level we will provide fairness in terms of amount
>> > ? of IO (number of blocks transferred etc) and with-in group cfq will try
>> > ? to provide fairness in terms of disk access time slices. I don't even
>> > ? know whether it is a matter of concern or not. I was thinking that
>> > ? probably one uniform policy on the hierarchical scheduling tree would
>> > ? have probably been better. Just thinking loud.....
>> >
>> > Thanks
>> > Vivek
>>
>> Thanks,
>> Ryo Tsuruta
>

2009-04-16 20:59:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

On Mon, Apr 13, 2009 at 01:05:52PM +0900, Ryo Tsuruta wrote:
> Hi Alasdair and all,
>
> I did more tests on dm-ioband and I've posted the test items and
> results on my website. The results are very good.
> http://people.valinux.co.jp/~ryov/dm-ioband/test/test-items.xls
>
> I hope someone will test dm-ioband and report back to the dm-devel
> mailing list.
>

Ok, here are more test results. This time I am trying to see how fairness
is provided for async writes and how does it impact throughput.

I have created two partitions /dev/sda1 and /dev/sda2. Two ioband devices
ioband1 and ioband2 on /dev/sda1 and /dev/sda2 respectively with weights
40 and 40.

#dmsetup status
ioband2: 0 38025855 ioband 1 -1 150 8 186 1 0 8
ioband1: 0 40098177 ioband 1 -1 150 8 186 1 0 8

I ran following two fio jobs. One job in each partition.

************************************************************
echo cfq > /sys/block/sdd/queue/scheduler
sync
echo 3 > /proc/sys/vm/drop_caches

fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"
time fio $fio_args --name=test1 --directory=/mnt/sdd1/fio/
--output=test1.log &
time fio $fio_args --name=test2 --directory=/mnt/sdd2/fio/
--output=test2.log &
wait
*****************************************************************

Following are fio job finish times with and without dm-ioband

first job second job
without dm-ioband 3m29.947s 4m1.436s
with dm-ioband 8m42.532s 8m43.328s

This sounds like 100% performance regression in this particular setup.

I think this regression is introduced because we are waiting for too
long for slower group to catch up to make sure proportionate numbers
look right and choke the writes even if deviec is free.

It is an hard to solve problem because the async writes traffic is
bursty when seen at block layer and we not necessarily see higher amount of
writer traffic dispatched from higher prio process/group. So what does one
do? Wait for other groups to catch up to show right proportionate numbers
and hence let the disk be idle and kill the performance. Or just continue
and not idle too much (a small amount of idling like 8ms for sync queue
might still be ok).

I think there might not be much benefit in providing artificial notion
of maintaining proportionate ratio and kill the performance. We should
instead try to audit async write path and see where the higher weight
application/group is stuck.

In my simple two dd test, I could see bursty traffic from high prio app and
then it would sometimes disappear for .2 to .8 seconds. In that duration if I
wait for higher priority group to catch up that I will end up keeping disk
idle for .8 seconds and kill performance. I guess better way is to not wait
that long (even if it means that to application it might give the impression
that io scheduler is not doing the job right in assiginig proportionate disk)
and over a period of time see if we can fix some things in async write path
for more smooth traffic to io scheduler.

Thoughts?

Thanks
Vivek


> Alasdair, could you please merge dm-ioband into upstream? Or could
> you please tell me why dm-ioband can't be merged?
>
> Thanks,
> Ryo Tsuruta
>
> To know the details of dm-ioband:
> http://people.valinux.co.jp/~ryov/dm-ioband/
>
> RPM packages for RHEL5 and CentOS5 are available:
> http://people.valinux.co.jp/~ryov/dm-ioband/binary.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-04-17 02:14:54

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

On Thu, Apr 16, 2009 at 04:57:20PM -0400, Vivek Goyal wrote:
> On Mon, Apr 13, 2009 at 01:05:52PM +0900, Ryo Tsuruta wrote:
> > Hi Alasdair and all,
> >
> > I did more tests on dm-ioband and I've posted the test items and
> > results on my website. The results are very good.
> > http://people.valinux.co.jp/~ryov/dm-ioband/test/test-items.xls
> >
> > I hope someone will test dm-ioband and report back to the dm-devel
> > mailing list.
> >
>

Ok, one more test. This time to show that with single queue and FIFO
dispatch a writer can easily starve the reader.

I have created two partitions /dev/sda1 and /dev/sda2. Two ioband devices
ioband1 and ioband2 on /dev/sda1 and /dev/sda2 respectively with weights
40 and 20.

I am launching an aggressive writer dd with prio 7 (Best effort) and a
reader with prio 0 (Best effort).

Following is my script.

****************************************************************
rm /mnt/sdd1/aggressivewriter

sync
echo 3 > /proc/sys/vm/drop_caches

#launch an hostile writer
ionice -c2 -n7 dd if=/dev/zero of=/mnt/sdd1/aggressivewriter bs=4K count=524288 conv=fdatasync &

# Reader
ionice -c 2 -n 0 dd if=/mnt/sdd1/testzerofile1 of=/dev/null &
wait $!
echo "reader finished"
**********************************************************************

Following are the results without and with dm-ioband

Without dm-ioband
-----------------
First run
2147483648 bytes (2.1 GB) copied, 46.4747 s, 46.2 MB/s (Reader)
reader finished
2147483648 bytes (2.1 GB) copied, 87.9293 s, 24.4 MB/s (Writer)

Second run
2147483648 bytes (2.1 GB) copied, 47.6461 s, 45.1 MB/s (Reader)
reader finished
2147483648 bytes (2.1 GB) copied, 89.0781 s, 24.1 MB/s (Writer)

Third run
2147483648 bytes (2.1 GB) copied, 51.0624 s, 42.1 MB/s (Reader)
reader finished
2147483648 bytes (2.1 GB) copied, 91.9507 s, 23.4 MB/s (Writer)

With dm-ioband
--------------
2147483648 bytes (2.1 GB) copied, 54.895 s, 39.1 MB/s (Writer)
2147483648 bytes (2.1 GB) copied, 88.6323 s, 24.2 MB/s (Reader)
reader finished

2147483648 bytes (2.1 GB) copied, 62.6102 s, 34.3 MB/s (Writer)
2147483648 bytes (2.1 GB) copied, 91.6662 s, 23.4 MB/s (Reader)
reader finished

2147483648 bytes (2.1 GB) copied, 58.9928 s, 36.4 MB/s (Writer)
2147483648 bytes (2.1 GB) copied, 90.6707 s, 23.7 MB/s (Reader)
reader finished

I have marked which dd finished first. I determine it with the help of
wait command and also monitor the "iostat -d 5 sdd1" to see how IO rates
are varying.

Notice with dm-ioband its complete reversal of fortunes. A reader is
completely starved by the aggressive writer. I think this one you should
be able to reproduce easily with script.

I don't understand how single queue and FIFO dispatch does not break
the notion of cfq classes and priorities?

Thanks
Vivek

> Ok, here are more test results. This time I am trying to see how fairness
> is provided for async writes and how does it impact throughput.
>
> I have created two partitions /dev/sda1 and /dev/sda2. Two ioband devices
> ioband1 and ioband2 on /dev/sda1 and /dev/sda2 respectively with weights
> 40 and 40.
>
> #dmsetup status
> ioband2: 0 38025855 ioband 1 -1 150 8 186 1 0 8
> ioband1: 0 40098177 ioband 1 -1 150 8 186 1 0 8
>
> I ran following two fio jobs. One job in each partition.
>
> ************************************************************
> echo cfq > /sys/block/sdd/queue/scheduler
> sync
> echo 3 > /proc/sys/vm/drop_caches
>
> fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"
> time fio $fio_args --name=test1 --directory=/mnt/sdd1/fio/
> --output=test1.log &
> time fio $fio_args --name=test2 --directory=/mnt/sdd2/fio/
> --output=test2.log &
> wait
> *****************************************************************
>
> Following are fio job finish times with and without dm-ioband
>
> first job second job
> without dm-ioband 3m29.947s 4m1.436s
> with dm-ioband 8m42.532s 8m43.328s
>
> This sounds like 100% performance regression in this particular setup.
>
> I think this regression is introduced because we are waiting for too
> long for slower group to catch up to make sure proportionate numbers
> look right and choke the writes even if deviec is free.
>
> It is an hard to solve problem because the async writes traffic is
> bursty when seen at block layer and we not necessarily see higher amount of
> writer traffic dispatched from higher prio process/group. So what does one
> do? Wait for other groups to catch up to show right proportionate numbers
> and hence let the disk be idle and kill the performance. Or just continue
> and not idle too much (a small amount of idling like 8ms for sync queue
> might still be ok).
>
> I think there might not be much benefit in providing artificial notion
> of maintaining proportionate ratio and kill the performance. We should
> instead try to audit async write path and see where the higher weight
> application/group is stuck.
>
> In my simple two dd test, I could see bursty traffic from high prio app and
> then it would sometimes disappear for .2 to .8 seconds. In that duration if I
> wait for higher priority group to catch up that I will end up keeping disk
> idle for .8 seconds and kill performance. I guess better way is to not wait
> that long (even if it means that to application it might give the impression
> that io scheduler is not doing the job right in assiginig proportionate disk)
> and over a period of time see if we can fix some things in async write path
> for more smooth traffic to io scheduler.
>
> Thoughts?
>
> Thanks
> Vivek
>
>
> > Alasdair, could you please merge dm-ioband into upstream? Or could
> > you please tell me why dm-ioband can't be merged?
> >
> > Thanks,
> > Ryo Tsuruta
> >
> > To know the details of dm-ioband:
> > http://people.valinux.co.jp/~ryov/dm-ioband/
> >
> > RPM packages for RHEL5 and CentOS5 are available:
> > http://people.valinux.co.jp/~ryov/dm-ioband/binary.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/

2009-04-17 02:29:54

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

On Thu, Apr 16, 2009 at 10:11:39PM -0400, Vivek Goyal wrote:
> On Thu, Apr 16, 2009 at 04:57:20PM -0400, Vivek Goyal wrote:
> > On Mon, Apr 13, 2009 at 01:05:52PM +0900, Ryo Tsuruta wrote:
> > > Hi Alasdair and all,
> > >
> > > I did more tests on dm-ioband and I've posted the test items and
> > > results on my website. The results are very good.
> > > http://people.valinux.co.jp/~ryov/dm-ioband/test/test-items.xls
> > >
> > > I hope someone will test dm-ioband and report back to the dm-devel
> > > mailing list.
> > >
> >
>
> Ok, one more test. This time to show that with single queue and FIFO
> dispatch a writer can easily starve the reader.
>
> I have created two partitions /dev/sda1 and /dev/sda2. Two ioband devices
> ioband1 and ioband2 on /dev/sda1 and /dev/sda2 respectively with weights
> 40 and 20.
>
> I am launching an aggressive writer dd with prio 7 (Best effort) and a
> reader with prio 0 (Best effort).
>
> Following is my script.
>
> ****************************************************************
> rm /mnt/sdd1/aggressivewriter
>
> sync
> echo 3 > /proc/sys/vm/drop_caches
>
> #launch an hostile writer
> ionice -c2 -n7 dd if=/dev/zero of=/mnt/sdd1/aggressivewriter bs=4K count=524288 conv=fdatasync &
>
> # Reader
> ionice -c 2 -n 0 dd if=/mnt/sdd1/testzerofile1 of=/dev/null &
> wait $!
> echo "reader finished"
> **********************************************************************

More results. Same reader writer test as above, the only variation I
have done is change the class of reader to RT from BE.

ionice -c 1 -n 0 dd if=/mnt/sdd1/testzerofile1 of=/dev/null &

Even chaning the class of reader does not help. Writer still starves the
reader.

Without dm-ioband
=================
First run
2147483648 bytes (2.1 GB) copied, 43.9096 s, 48.9 MB/s (Reader)
reader finished
2147483648 bytes (2.1 GB) copied, 85.9094 s, 25.0 MB/s (Writer)

Second run
2147483648 bytes (2.1 GB) copied, 40.2446 s, 53.4 MB/s (Reader)
reader finished
2147483648 bytes (2.1 GB) copied, 82.723 s, 26.0 MB/s (Writer)

With dm-ioband
==============
First run
2147483648 bytes (2.1 GB) copied, 69.0272 s, 31.1 MB/s (Writer)
2147483648 bytes (2.1 GB) copied, 89.3037 s, 24.0 MB/s (Reader)
reader finished

Second run
2147483648 bytes (2.1 GB) copied, 64.8751 s, 33.1 MB/s (Writer)
2147483648 bytes (2.1 GB) copied, 89.0273 s, 24.1 MB/s (Reader)
reader finished

Thanks
Vivek

2009-04-20 08:30:16

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

Hi Vivek and Nauman,

> On Thu, Apr 16, 2009 at 7:11 AM, Vivek Goyal <[email protected]> wrote:
> > On Thu, Apr 16, 2009 at 11:47:50AM +0900, Ryo Tsuruta wrote:
> >> Hi Vivek,
> >>
> >> > General thoughts about dm-ioband
> >> > ================================
> >> > - Implementing control at second level has the advantage tha one does not
> >> > ? have to muck with IO scheduler code. But then it also has the
> >> > ? disadvantage that there is no communication with IO scheduler.
> >> >
> >> > - dm-ioband is buffering bio at higher layer and then doing FIFO release
> >> > ? of these bios. This FIFO release can lead to priority inversion problems
> >> > ? in certain cases where RT requests are way behind BE requests or
> >> > ? reader starvation where reader bios are getting hidden behind writer
> >> > ? bios etc. These are hard to notice issues in user space. I guess above
> >> > ? RT results do highlight the RT task problems. I am still working on
> >> > ? other test cases and see if i can show the probelm.
>
> Ryo, I could not agree more with Vivek here. At Google, we have very
> stringent requirement for latency of our RT requests. If RT requests
> get queued in any higher layer (behind BE requests), all bets are off.
> I don't find doing IO control at two layer for this particular reason.
> The upper layer (dm-ioband in this case) would have to make sure that
> RT requests are released immediately, irrespective of the state (FIFO
> queuing and tokens held). And the lower layer (IO scheduling layer)
> has to do the same. This requirement is not specific to us. I have
> seen similar comments from filesystem folks here previously, in the
> context of metadata updates being submitted as RT. Basically, the
> semantics of RT class has to be preserved by any solution that is
> build on top of CFQ scheduler.

I could see the priority inversion by running Vivek's script and I
understand how RT requests has to be handled. I'll create a patch
which makes dm-ioband cooperates with CFQ scheduler. However, do you
think we need some kind of limitation on processes which belong to the
RT class to prevent the processes from depleting bandwidth?

Thanks,
Ryo Tsuruta

2009-04-20 09:08:05

by Nauman Rafique

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

On Mon, Apr 20, 2009 at 1:29 AM, Ryo Tsuruta <[email protected]> wrote:
> Hi Vivek and Nauman,
>
>> On Thu, Apr 16, 2009 at 7:11 AM, Vivek Goyal <[email protected]> wrote:
>> > On Thu, Apr 16, 2009 at 11:47:50AM +0900, Ryo Tsuruta wrote:
>> >> Hi Vivek,
>> >>
>> >> > General thoughts about dm-ioband
>> >> > ================================
>> >> > - Implementing control at second level has the advantage tha one does not
>> >> > ? have to muck with IO scheduler code. But then it also has the
>> >> > ? disadvantage that there is no communication with IO scheduler.
>> >> >
>> >> > - dm-ioband is buffering bio at higher layer and then doing FIFO release
>> >> > ? of these bios. This FIFO release can lead to priority inversion problems
>> >> > ? in certain cases where RT requests are way behind BE requests or
>> >> > ? reader starvation where reader bios are getting hidden behind writer
>> >> > ? bios etc. These are hard to notice issues in user space. I guess above
>> >> > ? RT results do highlight the RT task problems. I am still working on
>> >> > ? other test cases and see if i can show the probelm.
>>
>> Ryo, I could not agree more with Vivek here. At Google, we have very
>> stringent requirement for latency of our RT requests. If RT requests
>> get queued in any higher layer (behind BE requests), all bets are off.
>> I don't find doing IO control at two layer for this particular reason.
>> The upper layer (dm-ioband in this case) would have to make sure that
>> RT requests are released immediately, irrespective of the state (FIFO
>> queuing and tokens held). And the lower layer (IO scheduling layer)
>> has to do the same. This requirement is not specific to us. I have
>> seen similar comments from filesystem folks here previously, in the
>> context of metadata updates being submitted as RT. Basically, the
>> semantics of RT class has to be preserved by any solution that is
>> build on top of CFQ scheduler.
>
> I could see the priority inversion by running Vivek's script and I
> understand how RT requests has to be handled. I'll create a patch
> which makes dm-ioband cooperates with CFQ scheduler. However, do you
> think we need some kind of limitation on processes which belong to the
> RT class to prevent the processes from depleting bandwidth?

If you are talking about starvation that could be caused by RT tasks,
you are right. We need some mechanism to introduce starvation
prevention, but I think that is an issue that can be tackled once we
decide where to do bandwidth control.

The real question is, once you create a version of dm-ioband that
co-operates with CFQ scheduler, how that solution would compare with
the patch set Vivek has posted? In my opinion, we need to converge to
one solution as soon as possible, so that we can work on it together
to refine and test it.

>
> Thanks,
> Ryo Tsuruta
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

2009-04-20 21:38:03

by Vivek Goyal

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

On Mon, Apr 20, 2009 at 05:29:59PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and Nauman,
>
> > On Thu, Apr 16, 2009 at 7:11 AM, Vivek Goyal <[email protected]> wrote:
> > > On Thu, Apr 16, 2009 at 11:47:50AM +0900, Ryo Tsuruta wrote:
> > >> Hi Vivek,
> > >>
> > >> > General thoughts about dm-ioband
> > >> > ================================
> > >> > - Implementing control at second level has the advantage tha one does not
> > >> > ? have to muck with IO scheduler code. But then it also has the
> > >> > ? disadvantage that there is no communication with IO scheduler.
> > >> >
> > >> > - dm-ioband is buffering bio at higher layer and then doing FIFO release
> > >> > ? of these bios. This FIFO release can lead to priority inversion problems
> > >> > ? in certain cases where RT requests are way behind BE requests or
> > >> > ? reader starvation where reader bios are getting hidden behind writer
> > >> > ? bios etc. These are hard to notice issues in user space. I guess above
> > >> > ? RT results do highlight the RT task problems. I am still working on
> > >> > ? other test cases and see if i can show the probelm.
> >
> > Ryo, I could not agree more with Vivek here. At Google, we have very
> > stringent requirement for latency of our RT requests. If RT requests
> > get queued in any higher layer (behind BE requests), all bets are off.
> > I don't find doing IO control at two layer for this particular reason.
> > The upper layer (dm-ioband in this case) would have to make sure that
> > RT requests are released immediately, irrespective of the state (FIFO
> > queuing and tokens held). And the lower layer (IO scheduling layer)
> > has to do the same. This requirement is not specific to us. I have
> > seen similar comments from filesystem folks here previously, in the
> > context of metadata updates being submitted as RT. Basically, the
> > semantics of RT class has to be preserved by any solution that is
> > build on top of CFQ scheduler.
>
> I could see the priority inversion by running Vivek's script and I
> understand how RT requests has to be handled. I'll create a patch
> which makes dm-ioband cooperates with CFQ scheduler. However, do you
> think we need some kind of limitation on processes which belong to the
> RT class to prevent the processes from depleting bandwidth?

I think to begin with, we can keep the same behavior as CFQ. An RT task
can starve other tasks.

But we should provide two configurations and user can choose any one.
If RT task is in root group, it will starve other sibling tasks/groups. If
it is with-in a cgroup, then it will starve its sibling only with-in that
cgroup and will not impact other cgroups.

What I mean is following.

root
/ \
RT group1

In above configuration RT task will starve everybody else.

root
/ \
group1 group2
/ \
RT BE

In above configuration RT task will starve only sibling in group1 but
will not starve the tasks in group2 or in root.

Thanks
Vivek

2009-04-21 12:06:26

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

Hi Nauman,

> >> >> > General thoughts about dm-ioband
> >> >> > ================================
> >> >> > - Implementing control at second level has the advantage tha one does not
> >> >> > ? have to muck with IO scheduler code. But then it also has the
> >> >> > ? disadvantage that there is no communication with IO scheduler.
> >> >> >
> >> >> > - dm-ioband is buffering bio at higher layer and then doing FIFO release
> >> >> > ? of these bios. This FIFO release can lead to priority inversion problems
> >> >> > ? in certain cases where RT requests are way behind BE requests or
> >> >> > ? reader starvation where reader bios are getting hidden behind writer
> >> >> > ? bios etc. These are hard to notice issues in user space. I guess above
> >> >> > ? RT results do highlight the RT task problems. I am still working on
> >> >> > ? other test cases and see if i can show the probelm.
> >>
> >> Ryo, I could not agree more with Vivek here. At Google, we have very
> >> stringent requirement for latency of our RT requests. If RT requests
> >> get queued in any higher layer (behind BE requests), all bets are off.
> >> I don't find doing IO control at two layer for this particular reason.
> >> The upper layer (dm-ioband in this case) would have to make sure that
> >> RT requests are released immediately, irrespective of the state (FIFO
> >> queuing and tokens held). And the lower layer (IO scheduling layer)
> >> has to do the same. This requirement is not specific to us. I have
> >> seen similar comments from filesystem folks here previously, in the
> >> context of metadata updates being submitted as RT. Basically, the
> >> semantics of RT class has to be preserved by any solution that is
> >> build on top of CFQ scheduler.
> >
> > I could see the priority inversion by running Vivek's script and I
> > understand how RT requests has to be handled. I'll create a patch
> > which makes dm-ioband cooperates with CFQ scheduler. However, do you
> > think we need some kind of limitation on processes which belong to the
> > RT class to prevent the processes from depleting bandwidth?
>
> If you are talking about starvation that could be caused by RT tasks,
> you are right. We need some mechanism to introduce starvation
> prevention, but I think that is an issue that can be tackled once we
> decide where to do bandwidth control.
>
> The real question is, once you create a version of dm-ioband that
> co-operates with CFQ scheduler, how that solution would compare with
> the patch set Vivek has posted? In my opinion, we need to converge to
> one solution as soon as possible, so that we can work on it together
> to refine and test it.

I think I can do some help for your work. but I want to continue the
development of dm-ioband, because dm-ioband actually works well and
I think it has some advantages against other IO controllers.
- It can use without cgroup.
- It can control bandwidth on a per partition basis.
- The driver module can be replaced without stopping the system.

Thanks,
Ryo Tsuruta

2009-04-21 12:11:05

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

Hi Nauman,

> > The real question is, once you create a version of dm-ioband that
> > co-operates with CFQ scheduler, how that solution would compare with
> > the patch set Vivek has posted? In my opinion, we need to converge to
> > one solution as soon as possible, so that we can work on it together
> > to refine and test it.
>
> I think I can do some help for your work. but I want to continue the
> development of dm-ioband, because dm-ioband actually works well and
> I think it has some advantages against other IO controllers.
> - It can use without cgroup.
> - It can control bandwidth on a per partition basis.
> - The driver module can be replaced without stopping the system.

In addition, dm-ioband can run on the RHEL5.

Thanks,
Ryo Tsuruta

2009-04-21 12:18:39

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

Hi Vivek,

> > I could see the priority inversion by running Vivek's script and I
> > understand how RT requests has to be handled. I'll create a patch
> > which makes dm-ioband cooperates with CFQ scheduler. However, do you
> > think we need some kind of limitation on processes which belong to the
> > RT class to prevent the processes from depleting bandwidth?
>
> I think to begin with, we can keep the same behavior as CFQ. An RT task
> can starve other tasks.
>
> But we should provide two configurations and user can choose any one.
> If RT task is in root group, it will starve other sibling tasks/groups. If
> it is with-in a cgroup, then it will starve its sibling only with-in that
> cgroup and will not impact other cgroups.
>
> What I mean is following.
>
> root
> / \
> RT group1
>
> In above configuration RT task will starve everybody else.
>
> root
> / \
> group1 group2
> / \
> RT BE
>
> In above configuration RT task will starve only sibling in group1 but
> will not starve the tasks in group2 or in root.

Thanks for the suggestion. I'll try this way when dm-ioband supports
hierarchical grouping.

Thanks,
Ryo Tsuruta

2009-04-21 13:58:06

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

On Tue, Apr 21 2009 at 8:10am -0400,
Ryo Tsuruta <[email protected]> wrote:

> Hi Nauman,
>
> > > The real question is, once you create a version of dm-ioband that
> > > co-operates with CFQ scheduler, how that solution would compare with
> > > the patch set Vivek has posted? In my opinion, we need to converge to
> > > one solution as soon as possible, so that we can work on it together
> > > to refine and test it.
> >
> > I think I can do some help for your work. but I want to continue the
> > development of dm-ioband, because dm-ioband actually works well and
> > I think it has some advantages against other IO controllers.
> > - It can use without cgroup.
> > - It can control bandwidth on a per partition basis.
> > - The driver module can be replaced without stopping the system.
>
> In addition, dm-ioband can run on the RHEL5.

RHEL5 compatibility does not matter relative to merging an I/O bandwidth
controller upstream. So both the "can [be] use without cgroup" and "can
run on RHEL5" features do not help your cause of getting dm-ioband
merged upstream. In fact these features serve as distractions.

Mike

2009-04-21 14:16:32

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

On Tue, Apr 21, 2009 at 09:57:23AM -0400, Mike Snitzer wrote:
> On Tue, Apr 21 2009 at 8:10am -0400,
> Ryo Tsuruta <[email protected]> wrote:
>
> > Hi Nauman,
> >
> > > > The real question is, once you create a version of dm-ioband that
> > > > co-operates with CFQ scheduler, how that solution would compare with
> > > > the patch set Vivek has posted? In my opinion, we need to converge to
> > > > one solution as soon as possible, so that we can work on it together
> > > > to refine and test it.
> > >
> > > I think I can do some help for your work. but I want to continue the
> > > development of dm-ioband, because dm-ioband actually works well and
> > > I think it has some advantages against other IO controllers.
> > > - It can use without cgroup.
> > > - It can control bandwidth on a per partition basis.
> > > - The driver module can be replaced without stopping the system.
> >
> > In addition, dm-ioband can run on the RHEL5.
>
> RHEL5 compatibility does not matter relative to merging an I/O bandwidth
> controller upstream. So both the "can [be] use without cgroup" and "can
> run on RHEL5" features do not help your cause of getting dm-ioband
> merged upstream. In fact these features serve as distractions.

Exactly. I don't think that "it can be used without cgroup" is a feature
or advantage. To me it is a disadvantage and should be fixed. cgroup is
standard mechanism to group tasks arbitrarily and we should use that to make
things working instead of coming up with own ways of grouping things and
terming it as advantage.

What do you mean by "The driver module can be replaced without stopping
the system"? I guess you mean that one does not have to reboot the system
to remove ioband device? So if one decides to not use the cgroup, then
one shall have to remove the ioband devices, remount the filesystems and
restart the application?

With cgroup approach, if one does not want things to be classified, a user
can simply move all the tasks to root group and things will be fine. No
remounting, no application stopping etc. So this also does not look like
an advantage instead sounds like an disadvantage.

Thanks
Vivek

2009-04-22 00:49:29

by Li Zefan

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

Vivek Goyal wrote:
> On Tue, Apr 21, 2009 at 09:57:23AM -0400, Mike Snitzer wrote:
>> On Tue, Apr 21 2009 at 8:10am -0400,
>> Ryo Tsuruta <[email protected]> wrote:
>>
>>> Hi Nauman,
>>>
>>>>> The real question is, once you create a version of dm-ioband that
>>>>> co-operates with CFQ scheduler, how that solution would compare with
>>>>> the patch set Vivek has posted? In my opinion, we need to converge to
>>>>> one solution as soon as possible, so that we can work on it together
>>>>> to refine and test it.
>>>> I think I can do some help for your work. but I want to continue the
>>>> development of dm-ioband, because dm-ioband actually works well and
>>>> I think it has some advantages against other IO controllers.
>>>> - It can use without cgroup.
>>>> - It can control bandwidth on a per partition basis.
>>>> - The driver module can be replaced without stopping the system.
>>> In addition, dm-ioband can run on the RHEL5.
>> RHEL5 compatibility does not matter relative to merging an I/O bandwidth
>> controller upstream. So both the "can [be] use without cgroup" and "can
>> run on RHEL5" features do not help your cause of getting dm-ioband
>> merged upstream. In fact these features serve as distractions.
>
> Exactly. I don't think that "it can be used without cgroup" is a feature
> or advantage. To me it is a disadvantage and should be fixed. cgroup is
> standard mechanism to group tasks arbitrarily and we should use that to make
> things working instead of coming up with own ways of grouping things and
> terming it as advantage.
>

I agree. And for the case of cpu scheduler, there are user group scheduler
and cgroup group scheduler, but Peter said he would like to see user group
scheduler to be removed.

> What do you mean by "The driver module can be replaced without stopping
> the system"? I guess you mean that one does not have to reboot the system
> to remove ioband device? So if one decides to not use the cgroup, then
> one shall have to remove the ioband devices, remount the filesystems and
> restart the application?
>
> With cgroup approach, if one does not want things to be classified, a user
> can simply move all the tasks to root group and things will be fine. No
> remounting, no application stopping etc. So this also does not look like
> an advantage instead sounds like an disadvantage.
>
> Thanks
> Vivek

2009-04-22 03:14:54

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm-ioband: Test results.

Hi,

From: Vivek Goyal <[email protected]>
Subject: [dm-devel] Re: dm-ioband: Test results.
Date: Tue, 21 Apr 2009 10:16:07 -0400

> On Tue, Apr 21, 2009 at 09:57:23AM -0400, Mike Snitzer wrote:
> > On Tue, Apr 21 2009 at 8:10am -0400,
> > Ryo Tsuruta <[email protected]> wrote:
> >
> > > Hi Nauman,
> > >
> > > > > The real question is, once you create a version of dm-ioband that
> > > > > co-operates with CFQ scheduler, how that solution would compare with
> > > > > the patch set Vivek has posted? In my opinion, we need to converge to
> > > > > one solution as soon as possible, so that we can work on it together
> > > > > to refine and test it.
> > > >
> > > > I think I can do some help for your work. but I want to continue the
> > > > development of dm-ioband, because dm-ioband actually works well and
> > > > I think it has some advantages against other IO controllers.
> > > > - It can use without cgroup.
> > > > - It can control bandwidth on a per partition basis.
> > > > - The driver module can be replaced without stopping the system.
> > >
> > > In addition, dm-ioband can run on the RHEL5.
> >
> > RHEL5 compatibility does not matter relative to merging an I/O bandwidth
> > controller upstream. So both the "can [be] use without cgroup" and "can
> > run on RHEL5" features do not help your cause of getting dm-ioband
> > merged upstream. In fact these features serve as distractions.
>
> Exactly. I don't think that "it can be used without cgroup" is a feature
> or advantage. To me it is a disadvantage and should be fixed. cgroup is
> standard mechanism to group tasks arbitrarily and we should use that to make
> things working instead of coming up with own ways of grouping things and
> terming it as advantage.
>
> What do you mean by "The driver module can be replaced without stopping
> the system"? I guess you mean that one does not have to reboot the system
> to remove ioband device? So if one decides to not use the cgroup, then
> one shall have to remove the ioband devices, remount the filesystems and
> restart the application?

Device-mapper has a feature that can replace an intermediate module
without unmount the device like the following.

--------------------- ---------------------
| /mnt | | /mnt |
|---------------------| |---------------------|
| /dev/mapper/ioband1 | | /dev/mapper/ioband1 |
|---------------------| |---------------------|
| dm-ioband | <==> | dm-linear |
|---------------------| |---------------------|
| /dev/sda1 | | /dev/sda1 |
--------------------- ---------------------

So we can safely unload the dm-ioband module and update it.

> With cgroup approach, if one does not want things to be classified, a user
> can simply move all the tasks to root group and things will be fine. No
> remounting, no application stopping etc. So this also does not look like
> an advantage instead sounds like an disadvantage.
>
> Thanks
> Vivek
>
> --
> dm-devel mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/dm-devel

2009-04-22 15:32:22

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

On Tue, Apr 21 2009 at 11:14pm -0400,
Ryo Tsuruta <[email protected]> wrote:

> Hi,
>
> From: Vivek Goyal <[email protected]>
> Subject: [dm-devel] Re: dm-ioband: Test results.
> Date: Tue, 21 Apr 2009 10:16:07 -0400
>
> > On Tue, Apr 21, 2009 at 09:57:23AM -0400, Mike Snitzer wrote:
> > > On Tue, Apr 21 2009 at 8:10am -0400,
> > > Ryo Tsuruta <[email protected]> wrote:
> > >
> > > > Hi Nauman,
> > > >
> > > > > > The real question is, once you create a version of dm-ioband that
> > > > > > co-operates with CFQ scheduler, how that solution would compare with
> > > > > > the patch set Vivek has posted? In my opinion, we need to converge to
> > > > > > one solution as soon as possible, so that we can work on it together
> > > > > > to refine and test it.
> > > > >
> > > > > I think I can do some help for your work. but I want to continue the
> > > > > development of dm-ioband, because dm-ioband actually works well and
> > > > > I think it has some advantages against other IO controllers.
> > > > > - It can use without cgroup.
> > > > > - It can control bandwidth on a per partition basis.
> > > > > - The driver module can be replaced without stopping the system.
> > > >
> > > > In addition, dm-ioband can run on the RHEL5.
> > >
> > > RHEL5 compatibility does not matter relative to merging an I/O bandwidth
> > > controller upstream. So both the "can [be] use without cgroup" and "can
> > > run on RHEL5" features do not help your cause of getting dm-ioband
> > > merged upstream. In fact these features serve as distractions.
> >
> > Exactly. I don't think that "it can be used without cgroup" is a feature
> > or advantage. To me it is a disadvantage and should be fixed. cgroup is
> > standard mechanism to group tasks arbitrarily and we should use that to make
> > things working instead of coming up with own ways of grouping things and
> > terming it as advantage.
> >
> > What do you mean by "The driver module can be replaced without stopping
> > the system"? I guess you mean that one does not have to reboot the system
> > to remove ioband device? So if one decides to not use the cgroup, then
> > one shall have to remove the ioband devices, remount the filesystems and
> > restart the application?
>
> Device-mapper has a feature that can replace an intermediate module
> without unmount the device like the following.
>
> --------------------- ---------------------
> | /mnt | | /mnt |
> |---------------------| |---------------------|
> | /dev/mapper/ioband1 | | /dev/mapper/ioband1 |
> |---------------------| |---------------------|
> | dm-ioband | <==> | dm-linear |
> |---------------------| |---------------------|
> | /dev/sda1 | | /dev/sda1 |
> --------------------- ---------------------
>
> So we can safely unload the dm-ioband module and update it.
>
> > With cgroup approach, if one does not want things to be classified, a user
> > can simply move all the tasks to root group and things will be fine. No
> > remounting, no application stopping etc. So this also does not look like
> > an advantage instead sounds like an disadvantage.
> >
> > Thanks
> > Vivek

Ryo,

Why is it that you repeatedly ignore concern/discussion about your
determination to continue using a custom grouping mechanism? It is this
type of excess layering that serves no purpose other than to facilitate
out-of-tree use-cases. dm-ioband would take a big step closer to being
merged upstream if you took others' feedback and showed more willingness
to work through the outstanding issues.

Mike

2009-04-27 10:30:21

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

Hi Mike,

> Why is it that you repeatedly ignore concern/discussion about your
> determination to continue using a custom grouping mechanism? It is this
> type of excess layering that serves no purpose other than to facilitate
> out-of-tree use-cases. dm-ioband would take a big step closer to being
> merged upstream if you took others' feedback and showed more willingness
> to work through the outstanding issues.

I think dm-ioband's approach is one simple way to handle cgroup
because the current cgroup has no way to manage kernel module's
resources. Please tell me if you have any good ideas to handle
cgroup by dm-ioband.

Thanks,
Ryo Tsuruta

2009-04-27 12:44:32

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

Hi Alasdair,

From: Mike Snitzer <[email protected]>
Subject: Re: dm-ioband: Test results.
Date: Wed, 22 Apr 2009 11:18:06 -0400

> Why is it that you repeatedly ignore concern/discussion about your
> determination to continue using a custom grouping mechanism? It is this
> type of excess layering that serves no purpose other than to facilitate
> out-of-tree use-cases. dm-ioband would take a big step closer to being
> merged upstream if you took others' feedback and showed more willingness
> to work through the outstanding issues.

Could you please let me know your opinion about merging dm-ioband
upstream? What do I have to do in order to merge dm-ioband?

Thanks,
Ryo Tsuruta

2009-04-27 13:03:59

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm-ioband: Test results.

On Mon, Apr 27 2009 at 6:30am -0400,
Ryo Tsuruta <[email protected]> wrote:

> Hi Mike,
>
> > Why is it that you repeatedly ignore concern/discussion about your
> > determination to continue using a custom grouping mechanism? It is this
> > type of excess layering that serves no purpose other than to facilitate
> > out-of-tree use-cases. dm-ioband would take a big step closer to being
> > merged upstream if you took others' feedback and showed more willingness
> > to work through the outstanding issues.
>
> I think dm-ioband's approach is one simple way to handle cgroup
> because the current cgroup has no way to manage kernel module's
> resources. Please tell me if you have any good ideas to handle
> cgroup by dm-ioband.

If you'd like to keep dm-ioband modular then I'd say the appropriate
cgroup interfaces need to be exposed for module use (symbols exported,
etc). No other controller has had a need to be modular but if you think
it is requirement for dm-ioband (facilitate updates, etc) then I have to
believe it is doable.

Mike