Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751707AbZDOEkd (ORCPT ); Wed, 15 Apr 2009 00:40:33 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751034AbZDOEkX (ORCPT ); Wed, 15 Apr 2009 00:40:23 -0400 Received: from mx2.redhat.com ([66.187.237.31]:48393 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750814AbZDOEkW (ORCPT ); Wed, 15 Apr 2009 00:40:22 -0400 Date: Wed, 15 Apr 2009 00:37:59 -0400 From: Vivek Goyal To: Ryo Tsuruta Cc: agk@redhat.com, dm-devel@redhat.com, linux-kernel@vger.kernel.org, Jens Axboe , Fernando Luis =?iso-8859-1?Q?V=E1zquez?= Cao , Nauman Rafique , Moyer Jeff Moyer , Balbir Singh Subject: Re: dm-ioband: Test results. Message-ID: <20090415043759.GA8349@redhat.com> References: <20090413.130552.226792299.ryov@valinux.co.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090413.130552.226792299.ryov@valinux.co.jp> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8452 Lines: 197 On Mon, Apr 13, 2009 at 01:05:52PM +0900, Ryo Tsuruta wrote: > Hi Alasdair and all, > > I did more tests on dm-ioband and I've posted the test items and > results on my website. The results are very good. > http://people.valinux.co.jp/~ryov/dm-ioband/test/test-items.xls > > I hope someone will test dm-ioband and report back to the dm-devel > mailing list. > Hi Ryo, I have been able to take your patch for 2.6.30-rc1 kernel and started doing some testing for reads. Hopefully you will provide bio-cgroup patches soon so that I can do some write testing also. In the beginning of the mail, i am listing some basic test results and in later part of mail I am raising some of my concerns with this patchset. My test setup: -------------- I have got one SATA driver with two partitions /dev/sdd1 and /dev/sdd2 on that. I have created ext3 file systems on these partitions. Created one ioband device "ioband1" with weight 40 on /dev/sdd1 and another ioband device "ioband2" with weight 10 on /dev/sdd2. 1) I think an RT task with-in a group does not get its fair share (all the BW available as long as RT task is backlogged). I launched one RT read task of 2G file in ioband1 group and in parallel launched more readers in ioband1 group. ioband2 group did not have any io going. Following are results with and without ioband. A) 1 RT prio 0 + 1 BE prio 4 reader dm-ioband 2147483648 bytes (2.1 GB) copied, 39.4701 s, 54.4 MB/s 2147483648 bytes (2.1 GB) copied, 71.8034 s, 29.9 MB/s without-dm-ioband 2147483648 bytes (2.1 GB) copied, 35.3677 s, 60.7 MB/s 2147483648 bytes (2.1 GB) copied, 70.8214 s, 30.3 MB/s B) 1 RT prio 0 + 2 BE prio 4 reader dm-ioband 2147483648 bytes (2.1 GB) copied, 43.8305 s, 49.0 MB/s 2147483648 bytes (2.1 GB) copied, 135.395 s, 15.9 MB/s 2147483648 bytes (2.1 GB) copied, 136.545 s, 15.7 MB/s without-dm-ioband 2147483648 bytes (2.1 GB) copied, 35.3177 s, 60.8 MB/s 2147483648 bytes (2.1 GB) copied, 124.793 s, 17.2 MB/s 2147483648 bytes (2.1 GB) copied, 126.267 s, 17.0 MB/s C) 1 RT prio 0 + 3 BE prio 4 reader dm-ioband 2147483648 bytes (2.1 GB) copied, 48.8159 s, 44.0 MB/s 2147483648 bytes (2.1 GB) copied, 185.848 s, 11.6 MB/s 2147483648 bytes (2.1 GB) copied, 188.171 s, 11.4 MB/s 2147483648 bytes (2.1 GB) copied, 189.537 s, 11.3 MB/s without-dm-ioband 2147483648 bytes (2.1 GB) copied, 35.2928 s, 60.8 MB/s 2147483648 bytes (2.1 GB) copied, 169.929 s, 12.6 MB/s 2147483648 bytes (2.1 GB) copied, 172.486 s, 12.5 MB/s 2147483648 bytes (2.1 GB) copied, 172.817 s, 12.4 MB/s C) 1 RT prio 0 + 3 BE prio 4 reader dm-ioband 2147483648 bytes (2.1 GB) copied, 51.4279 s, 41.8 MB/s 2147483648 bytes (2.1 GB) copied, 260.29 s, 8.3 MB/s 2147483648 bytes (2.1 GB) copied, 261.824 s, 8.2 MB/s 2147483648 bytes (2.1 GB) copied, 261.981 s, 8.2 MB/s 2147483648 bytes (2.1 GB) copied, 262.372 s, 8.2 MB/s without-dm-ioband 2147483648 bytes (2.1 GB) copied, 35.4213 s, 60.6 MB/s 2147483648 bytes (2.1 GB) copied, 215.784 s, 10.0 MB/s 2147483648 bytes (2.1 GB) copied, 218.706 s, 9.8 MB/s 2147483648 bytes (2.1 GB) copied, 220.12 s, 9.8 MB/s 2147483648 bytes (2.1 GB) copied, 220.57 s, 9.7 MB/s Notice that with dm-ioband as number of readers are increasing, finish time of RT tasks is also increasing. But without dm-ioband finish time of RT tasks remains more or less constat even with increase in number of readers. For some reason overall throughput also seems to be less with dm-ioband. Because ioband2 is not doing any IO, i expected that tasks in ioband1 will get full disk BW and throughput will not drop. I have not debugged it but I guess it might be coming from the fact that there are no separate queues for RT tasks. bios from all the tasks can be buffered on a single queue in a cgroup and that might be causing RT request to hide behind BE tasks' request? General thoughts about dm-ioband ================================ - Implementing control at second level has the advantage tha one does not have to muck with IO scheduler code. But then it also has the disadvantage that there is no communication with IO scheduler. - dm-ioband is buffering bio at higher layer and then doing FIFO release of these bios. This FIFO release can lead to priority inversion problems in certain cases where RT requests are way behind BE requests or reader starvation where reader bios are getting hidden behind writer bios etc. These are hard to notice issues in user space. I guess above RT results do highlight the RT task problems. I am still working on other test cases and see if i can show the probelm. - dm-ioband does this extra grouping logic using dm messages. Why cgroup infrastructure is not sufficient to meet your needs like grouping tasks based on uid etc? I think we should get rid of all the extra grouping logic and just use cgroup for grouping information. - Why do we need to specify bio cgroup ids to the dm-ioband externally with the help of dm messages? A user should be able to just create the cgroups, put the tasks in right cgroup and then everything should just work fine. - Why do we have to put another dm-ioband device on top of every partition or existing device mapper device to control it? Is it possible to do this control on make_request function of the reuqest queue so that we don't end up creating additional dm devices? I had posted the crude RFC patch as proof of concept but did not continue the development because of fundamental issue of FIFO release of buffered bios. http://lkml.org/lkml/2008/11/6/227 Can you please have a look and provide feedback about why we can not go in the direction of the above patches and why do we need to create additional dm device. I think in current form, dm-ioband is hard to configure and we should look for ways simplify configuration. - I personally think that even group IO scheduling should be done at IO scheduler level and we should not break down IO scheduling in two parts where group scheduling is done by higher level IO scheduler sitting in dm layer and io scheduling among tasks with-in groups is done by actual IO scheduler. But this also means more work as one has to muck around with core IO scheduler's to make them cgroup aware and also make sure existing functionality is not broken. I posted the patches here. http://lkml.org/lkml/2009/3/11/486 Can you please let us know that why does IO scheduler based approach does not work for you? Jens, it would be nice to hear your opinion about two level vs one level conrol. Do you think that common layer approach is the way to go where one can control things more tightly or FIFO release of bios from second level controller is fine and we can live with this additional serialization in the layer above just above IO scheduler? - There is no notion of RT cgroups. So even if one wants to run an RT task in root cgroup to make sure to get full access of disk, it can't do that. It has to share the BW with other competing groups. - dm-ioband controls amount of IO done per second. Will a seeky process not run away more disk time? Additionally, at group level we will provide fairness in terms of amount of IO (number of blocks transferred etc) and with-in group cfq will try to provide fairness in terms of disk access time slices. I don't even know whether it is a matter of concern or not. I was thinking that probably one uniform policy on the hierarchical scheduling tree would have probably been better. Just thinking loud..... Thanks Vivek > Alasdair, could you please merge dm-ioband into upstream? Or could > you please tell me why dm-ioband can't be merged? > > Thanks, > Ryo Tsuruta > > To know the details of dm-ioband: > http://people.valinux.co.jp/~ryov/dm-ioband/ > > RPM packages for RHEL5 and CentOS5 are available: > http://people.valinux.co.jp/~ryov/dm-ioband/binary.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/