Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757048AbZAVQPL (ORCPT ); Thu, 22 Jan 2009 11:15:11 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753700AbZAVQO5 (ORCPT ); Thu, 22 Jan 2009 11:14:57 -0500 Received: from mx2.redhat.com ([66.187.237.31]:36432 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751161AbZAVQO4 (ORCPT ); Thu, 22 Jan 2009 11:14:56 -0500 Date: Thu, 22 Jan 2009 11:12:18 -0500 From: Vivek Goyal To: Ryo Tsuruta , dm-devel@redhat.com, Alasdair G Kergon Cc: linux kernel mailing list , containers@lists.linux-foundation.org, Nauman Rafique , dpshah@google.com, lizf@cn.fujitsu.com, mikew@google.com, fchecconi@gmail.com, paolo.valente@unimore.it, jens.axboe@oracle.com, fernando@intellilink.co.jp, s-uchida@ap.jp.nec.com, taka@valinux.co.jp, guijianfeng@cn.fujitsu.com, arozansk@redhat.com, jmoyer@redhat.com, Rik Van Riel , Peter Zijlstra , Paul Menage , Balbir Singh , Dhaval Giani , Chris Wright Subject: Re: [dm-devel] [PATCH 1/2] dm-ioband: I/O bandwidth controller v1.10.0: Source code and patch Message-ID: <20090122161218.GA28795@redhat.com> References: <20090120.141022.193708059.ryov@valinux.co.jp> <20090120.141114.226778416.ryov@valinux.co.jp> <20090120155334.GH9859@agk.fab.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090120155334.GH9859@agk.fab.redhat.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7785 Lines: 171 On Tue, Jan 20, 2009 at 03:53:34PM +0000, Alasdair G Kergon wrote: > So, what needs to be reviewed? > > > 1. General style/layout/naming cleanup. > - It's pretty good compared to a lot of patches that get sent, but there are > still a few things we can improve. > > Lindent is throwing up some stuff (but remember it doesn't get things > perfect so don't make all the changes it recommends). > Remove double blank lines, plenty of unnecessary braces, "unsigned int" -> > "unsigned". > Review the names - should a few more things get a DM_ or dm_ prefix? > Are all the names consistent and as self-explanatory as they can reasonably be? > (e.g. a variable called 'new' - new what?) > > > 2. A high-level review. > Review the documentation and what the code does and how it does it. > - Does it make sense to add this to the kernel in this way? > CCing lkml, containers mailing list and other folks who might be interested in the thread. It is a long mail. You have been warned. :-) Here are some of my thoughts and also summary of some of the past discussions about dm-ioband either on lkml or off lkml. Following is one of the relevant link of past discussion on lkml about this. http://lkml.org/lkml/2008/11/6/227 At this point of time looks like there are two schools of thought regarding how IO controller should be implemented. The first one believes (dm-ioband) that io controller should be implemented as 2 level approach where higher level of IO control is done by this dm-ioband driver and lower level of scheduling is done by elevator/iosched code (noop, deadline, AS and cfq). Second school of thought (me, nauman from google and may be others) belive that introducing another level of IO control at higher layer breaks the assumptions of lower level scheduling hence we should be doing IO control and IO scheduing both in single layer and that is at elevator layer. Before I dive into details how assumptions are broken, let me discuss requirment part a bit. We seem to be differing on requirement part also. I think that we need IO control only at the point where real contetion is and not on every logical block device where there is no real contetion for the resource. Real contention for the resource is at the end node physical device where device is slow and then arises the need of some kind of resource control. I am not very sure why dm-ioband folks want to enable IO control on any xyz block device but in the past I got two responses. 1. Need to control end devices which don't have any elevator attached. 2. Need to do IO control for devices which are effectively network backed. for example, an NFS mounted file loop mounted as a block device. I don't fully understand the first requirement. Which are the device drviers that don't use any of the standard ioschedulers? I am not aware of any in-kernel drviers and I am assuming it will be binary drivers. If that's the case, then those binary drviers need to be modified to take advantange of IO control provided by elevator layer. Regarding the second requirement I think this sounds more like a network controller issue. Again the real contention is at network layer and not at logical block device. So at this point of time my understanding is that most common case for IO resource control is at the end devices in the system and it can be controlled by one level of IO control and scheudling. Please correct me if that's not the case from requirement point of view. Having said that even if we really find genuine cases where we need to control IO on any xyz block device, then we should be able to come up with generic IO controller which can reuse some of the code from 1 level controller. I am not against that and I think probably 1 level IO controller and a generic IO controller can co-exist. But there are few points which I find little odd about dm-ioband. Why generic IO controller is not good for every case ==================================================== To my knowledge, there have been two generic controller implementations. One is dm-ioband and other is an RFC patch by me. Following is the link. http://lkml.org/lkml/2008/11/6/227 The biggest issue with generic controller is that they can buffer the bio's at higher layer (once a cgroup is backed up) and then later release those bios in FIFO manner. This can conflict with unerlying IO scheduler's assumptions. Following example comes to mind. - If there is one task of io priority 0 in a cgroup and rest of the tasks are of io prio 7. All the tasks belong to best effort class. If tasks of lower priority (7) do lot of IO, then due to buffering there is a chance that IO from lower prio tasks is seen by CFQ first and io from higher prio task is not seen by cfq for quite some time hence that task not getting it fair share with in the cgroup. Similar situation can arise with RT tasks also. Some of the issues with dm-ioband implementation =============================================== - Breaks the assumptions of underlying IO schedulers. - There is no notion of task classes. So tasks of all the classes are at same level from resource contention point of view. The only thing which differentiates them is cgroup weight. Which does not answer the question that an RT task or RT cgroup should starve the peer cgroup if need be as RT cgroup should get priority access. - Because of FIFO release of buffered bios, it is possible that task of lower priority gets more IO done than the task of higher priority. - Task grouping logic - We already have the notion of cgroup where tasks can be grouped in hierarhical manner. dm-ioband does not make full use of that and comes up with own mechansim of grouping tasks (apart from cgroup). And there are odd ways of specifying cgroup id which configuring the dm-ioband device. I think once somebody has created the cgroup hieararchy, any IO controller logic should be able to internally read that hiearchy and provide control. There should not be need of any other configuration utity on top of cgroup. My RFC patches had done that. - Need of a dm device for every device we want to control - This requirement looks odd. It forces everybody to use dm-tools and if there are lots of disks in the system, configuation is pain. - Does it support hiearhical grouping? - I have not looked very closely at dm-ioband patches about this and had asked ryo a question about this (no response). Ryo does, dm-ioband support hierarhical grouping configuration? Summary ======= - IMHO, for common case we don't need a generic IO controller and by implementing an IO controller at elevator layer with close coupling to io schedulers, we should be able to achive the goal. Currently there is work in progress (off the list) by me, nauman, Fabio, Paolo and others to implement a common IO control layer which can be used by all the four IO schedulers without too much of code duplication. Hopefully in next 2-3 weeks we should be able to post the initial patches for RFC. - Even if there are cases for controlling a xyz block device, we can have a generic io controller also to cover that case. Ideally this controller should not be used by devices which use standard io schedulers. IMHO, dm-ioband as few odd points as mentioned above when it comes to generic controller and I think those should be addressed if we can really justify the need of a generic IO controller. Your comments are welcome. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/