Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753168AbZI2DXn (ORCPT ); Mon, 28 Sep 2009 23:23:43 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753007AbZI2DXn (ORCPT ); Mon, 28 Sep 2009 23:23:43 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40253 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753010AbZI2DXm (ORCPT ); Mon, 28 Sep 2009 23:23:42 -0400 Date: Mon, 28 Sep 2009 23:22:55 -0400 From: Vivek Goyal To: Nauman Rafique Cc: linux-kernel@vger.kernel.org, jens.axboe@oracle.com, containers@lists.linux-foundation.org, dm-devel@redhat.com, dpshah@google.com, lizf@cn.fujitsu.com, mikew@google.com, fchecconi@gmail.com, paolo.valente@unimore.it, ryov@valinux.co.jp, fernando@oss.ntt.co.jp, s-uchida@ap.jp.nec.com, taka@valinux.co.jp, guijianfeng@cn.fujitsu.com, jmoyer@redhat.com, dhaval@linux.vnet.ibm.com, balbir@linux.vnet.ibm.com, righi.andrea@gmail.com, m-ikeda@ds.jp.nec.com, agk@redhat.com, akpm@linux-foundation.org, peterz@infradead.org, jmarchan@redhat.com, torvalds@linux-foundation.org, mingo@elte.hu, riel@redhat.com, yoshikawa.takuya@oss.ntt.co.jp Subject: Re: IO scheduler based IO controller V10 Message-ID: <20090929032255.GA10664@redhat.com> References: <1253820332-10246-1-git-send-email-vgoyal@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10503 Lines: 224 On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > Hi Vivek, > Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with > Jens about IO controller during Linux Plumbers Conference '09. Jens > expressed his concerns about the size and complexity of the patches. I > believe that is a reasonable concern. We talked about things that > could be done to reduce the size of the patches. The requirement that > the "solution has to work with all IO schedulers" seems like a > secondary concern at this point; and it came out as one thing that can > help to reduce the size of the patch set. Initially doing cgroup based IO control only for CFQ should help a lot in reducing the patchset size. > Another possibility is to > use a simpler scheduling algorithm e.g. weighted round robin, instead > of BFQ scheduler. BFQ indeed has great properties, but we cannot deny > the fact that it is complex to understand, and might be cumbersome to > maintain. Core of the BFQ I have gotten rid of already. The remaining part is idle tree and data structures. I will see how can I simplify it further. > Also, hierarchical scheduling is something that could be > unnecessary in the first set of patches, even though cgroups are > hierarchical in nature. Sure. Though I don't think that a lot of code is there because of hierarchical nature. If we solve the issue at CFQ layer, we have to maintain atleast two levels. One for queue and other for groups. So even the simplest solution becomes almost hierarchical in nature. But I will still see how to get rid of some code here too... > > We are starting from a point where there is no cgroup based IO > scheduling in the kernel. And it is probably not reasonable to satisfy > all IO scheduling related requirements in one patch set. We can start > with something simple, and build on top of that. So a very simple > patch set that enables cgroup based proportional scheduling for CFQ > seems like the way to go at this point. Sure, we can start with CFQ only. But a bigger question we need to answer is that is CFQ the right place to solve the issue? Jens, do you think that CFQ is the right place to solve the problem? Andrew seems to favor a high level approach so that IO schedulers are less complex and we can provide fairness at high level logical devices also. I will again try to summarize my understanding so far about the pros/cons of each approach and then we can take the discussion forward. Fairness in terms of size of IO or disk time used ================================================= On a seeky media, fairness in terms of disk time can get us better results instead fairness interms of size of IO or number of IO. If we implement some kind of time based solution at higher layer, then that higher layer should know who used how much of time each group used. We can probably do some kind of timestamping in bio to get a sense when did it get into disk and when did it finish. But on a multi queue hardware there can be multiple requests in the disk either from same queue or from differnet queues and with pure timestamping based apparoch, so far I could not think how at high level we will get an idea who used how much of time. So this is the first point of contention that how do we want to provide fairness. In terms of disk time used or in terms of size of IO/number of IO. Max bandwidth Controller or Proportional bandwidth controller ============================================================= What is our primary requirement here? A weight based proportional bandwidth controller where we can use the resources optimally and any kind of throttling kicks in only if there is contention for the disk. Or we want max bandwidth control where a group is not allowed to use the disk even if disk is free. Or we need both? I would think that at some point of time we will need both but we can start with proportional bandwidth control first. Fairness for higher level logical devices ========================================= Do we want good fairness numbers for higher level logical devices also or it is sufficient to provide fairness at leaf nodes. Providing fairness at leaf nodes can help us use the resources optimally and in the process we can get fairness at higher level also in many of the cases. But do we want strict fairness numbers on higher level logical devices even if it means sub-optimal usage of unerlying phsical devices? I think that for proportinal bandwidth control, it should be ok to provide fairness at higher level logical device but for max bandwidth control it might make more sense to provide fairness at higher level. Consider a case where from a striped device a customer wants to limit a group to 30MB/s and in case of leaf node control, if every leaf node provides 30MB/s, it might accumulate to much more than specified rate at logical device. Latency Control and strong isolation between groups =================================================== Do we want a good isolation between groups and better latencies and stronger isolation between groups? I think if problem is solved at IO scheduler level, we can achieve better latency control and hence stronger isolation between groups. Higher level solutions should find it hard to provide same kind of latency control and isolation between groups as IO scheduler based solution. Fairness for buffered writes ============================ Doing io control at any place below page cache has disadvantage that page cache might not dispatch more writes from higher weight group hence higher weight group might not see more IO done. Andrew says that we don't have a solution to this problem in kernel and he would like to see it handled properly. Only way to solve this seems to be to slow down the writers before they write into page cache. IO throttling patch handled it by slowing down writer if it crossed max specified rate. Other suggestions have come in the form of dirty_ratio per memory cgroup or a separate cgroup controller al-together where some kind of per group write limit can be specified. So if solution is implemented at IO scheduler layer or at device mapper layer, both shall have to rely on another controller to be co-mounted to handle buffered writes properly. Fairness with-in group ====================== One of the issues with higher level controller is that how to do fair throttling so that fairness with-in group is not impacted. Especially the case of making sure that we don't break the notion of ioprio of the processes with-in group. Especially io throttling patch was very bad in terms of prio with-in group where throttling treated everyone equally and difference between process prio disappeared. Reads Vs Writes =============== A higher level control most likely will change the ratio in which reads and writes are dispatched to disk with-in group. It used to be decided by IO scheduler so far but with higher level groups doing throttling and possibly buffering the bios and releasing them later, they will have to come up with their own policy on in what proportion reads and writes should be dispatched. In case of IO scheduler based control, all the queuing takes place at IO scheduler and it still retains control of in what ration reads and writes should be dispatched. Summary ======= - An io scheduler based io controller can provide better latencies, stronger isolation between groups, time based fairness and will not interfere with io schedulers policies like class, ioprio and reader vs writer issues. But it can gunrantee fairness at higher logical level devices. Especially in case of max bw control, leaf node control does not sound to be the most appropriate thing. - IO throttling provides max bw control in terms of absolute rate. It has the advantage that it can provide control at higher level logical device and also control buffered writes without need of additional controller co-mounted. But it does only max bw control and not proportion control so one might not be using resources optimally. It looses sense of task prio and class with-in group as any of the task can be throttled with-in group. Because throttling does not kick in till you hit the max bw limit, it should find it hard to provide same latencies as io scheduler based control. - dm-ioband also has the advantage that it can provide fairness at higher level logical devices. But, fairness is provided only in terms of size of IO or number of IO. No time based fairness. It is very throughput oriented and does not throttle high speed group if other group is running slow random reader. This results in bad latnecies for random reader group and weaker isolation between groups. Also it does not provide fairness if a group is not continuously backlogged. So if one is running 1-2 dd/sequential readers in the group, one does not get fairness until workload is increased to a point where group becomes continuously backlogged. This also results in poor latencies and limited fairness. At this point of time it does not look like a single IO controller all the scenarios/requirements. This means few things to me. - Drop some of the requirements and go with one implementation which meets those reduced set of requirements. - Have more than one IO controller implementation in kenrel. One for lower level control for better latencies, stronger isolation and optimal resource usage and other one for fairness at higher level logical devices and max bandwidth control. And let user decide which one to use based on his/her needs. - Come up with more intelligent way of doing IO control where single controller covers all the cases. At this point of time, I am more inclined towards option 2 of having more than one implementation in kernel. :-) (Until and unless we can brainstrom and come up with ideas to make option 3 happen). > > It would be great if we discuss our plans on the mailing list, so we > can get early feedback from everyone. This is what comes to my mind so far. Please add to the list if I have missed some points. Also correct me if I am wrong about the pros/cons of the approaches. Thoughts/ideas/opinions are welcome... Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/