Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757331AbYHGCo0 (ORCPT ); Wed, 6 Aug 2008 22:44:26 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750884AbYHGCoR (ORCPT ); Wed, 6 Aug 2008 22:44:17 -0400 Received: from serv2.oss.ntt.co.jp ([222.151.198.100]:35435 "EHLO serv2.oss.ntt.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750708AbYHGCoP (ORCPT ); Wed, 6 Aug 2008 22:44:15 -0400 Subject: Re: RFC: I/O bandwidth controller (was Re: Too many I/O controller patches) From: Fernando Luis =?ISO-8859-1?Q?V=E1zquez?= Cao To: balbir@linux.vnet.ibm.com Cc: Dave Hansen , xen-devel@lists.xensource.com, uchida@ap.jp.nec.com, containers@lists.linux-foundation.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, dm-devel@redhat.com, agk@sourceware.org, ngupta@google.com, Andrea Righi In-Reply-To: <4899D464.1070506@linux.vnet.ibm.com> References: <20080804.175126.193692178.ryov@valinux.co.jp> <1217870433.20260.101.camel@nimitz> <1217985189.3154.57.camel@sebastian.kern.oss.ntt.co.jp> <4899D464.1070506@linux.vnet.ibm.com> Content-Type: text/plain Organization: NTT Open Source Software Center Date: Thu, 07 Aug 2008 11:44:10 +0900 Message-Id: <1218077050.3803.120.camel@sebastian.kern.oss.ntt.co.jp> Mime-Version: 1.0 X-Mailer: Evolution 2.22.3.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7352 Lines: 140 On Wed, 2008-08-06 at 22:12 +0530, Balbir Singh wrote: > > 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary > > groupings of processes and treat each group as a single scheduling > > identity) > > > > We obviously need this because our final goal is to be able to control > > the IO generated by a Linux container. The good news is that we already > > have the cgroups infrastructure so, regarding this problem, we would > > just have to transform our I/O bandwidth controller into a cgroup > > subsystem. > > > > This seems to be the easiest part, but the current cgroups > > infrastructure has some limitations when it comes to dealing with block > > devices: impossibility of creating/removing certain control structures > > dynamically and hardcoding of subsystems (i.e. resource controllers). > > This makes it difficult to handle block devices that can be hotplugged > > and go away at any time (this applies not only to usb storage but also > > to some SATA and SCSI devices). To cope with this situation properly we > > would need hotplug support in cgroups, but, as suggested before and > > discussed in the past (see (0) below), there are some limitations. > > > > Even in the non-hotplug case it would be nice if we could treat each > > block I/O device as an independent resource, which means we could do > > things like allocating I/O bandwidth on a per-device basis. As long as > > performance is not compromised too much, adding some kind of basic > > hotplug support to cgroups is probably worth it. > > > > Won't that get too complex. What if the user has thousands of disks with several > partitions on each? As Dave pointed out I just think that we should allow each disk to be treated separately. To avoid the administration nightmare you mention adding block device grouping capabilities should suffice to solve most of the issues. > > 6.- I/O tracking > > > > This is arguably the most important part, since to perform I/O control > > we need to be able to determine where the I/O is coming from. > > > > Reads are trivial because they are served in the context of the task > > that generated the I/O. But most writes are performed by pdflush, > > kswapd, and friends so performing I/O control just in the synchronous > > I/O path would lead to large inaccuracy. To get this right we would need > > to track ownership all the way up to the pagecache page. In other words, > > it is necessary to track who is dirtying pages so that when they are > > written to disk the right task is charged for that I/O. > > > > Fortunately, such tracking of pages is one of the things the existing > > memory resource controller is doing to control memory usage. This is a > > clever observation which has a useful implication: if the rather > > imbricated tracking and accounting parts of the memory resource > > controller were split the I/O controller could leverage the existing > > infrastructure to track buffered and asynchronous I/O. This is exactly > > what the bio-cgroup (see (6) below) patches set out to do. > > > > Are you suggesting that the IO and memory controller should always be bound > together? That is a really good question. The I/O tracking patches split the memory controller in two functional parts: (1) page tracking and (2) memory accounting/cgroup policy enforcement. By doing so the memory controller specific code can be separated from the rest, which admittedly, will not benefit the memory controller a great deal but, hopefully, we can get cleaner code that is easier to maintain. The important thing, though, is that with this separation the page tracking bits can be easily reused by any subsystem that needs to keep track of pages, and the I/O controller is certainly one such candidate. Synchronous I/O is easy to deal with because everything is done in the context of the task that generated the I/O, but buffered I/O and synchronous I/O are problematic. However with the observation that the owner of an I/O request happens to be the owner the of the pages the I/O buffers of that request reside in, it becomes clear that pdflush and friends could use that information to determine who the originator of the I/O is and the I/O request accordingly. Going back to your question, with the current I/O tracking patches I/O controller would be bound to the page tracking functionality of cgroups (page_cgroup) not the memory controller. We would not even need to compile the memory controller. The dependency on cgroups would still be there though. As an aside, I guess that with some effort we could get rid of this dependency by providing some basic tracking capabilities even when the cgroups infrastructure is not being used. By doing so traditional I/O schedulers such as CFQ could benefit from proper I/O tracking capabilities without using cgroups. Of course if the kernel has cgroups support compiled in the cgroups I/O tracking would be used instead (this idea was inpired by CFS' group scheduling, which works both with and without cgroups support). I am currently trying to implement this. > > (6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90 > > *** How to move on > > > > As discussed before, it probably makes sense to have both a block layer > > I/O controller and a elevator-based one, and they could certainly > > cohabitate. As discussed before, all of them need I/O tracking > > capabilities so I would like to suggest the plan below to get things > > started: > > > > - Improve the I/O tracking patches (see (6) above) until they are in > > mergeable shape. > > Yes, I agree with this step as being the first step. May be extending the > current task I/O accounting to cgroups could be done as a part of this. Yes, makes sense. > > - Fix CFQ and AS to use the new I/O tracking functionality to show its > > benefits. If the performance impact is acceptable this should suffice to > > convince the respective maintainer and get the I/O tracking patches > > merged. > > - Implement a block layer resource controller. dm-ioband is a working > > solution and feature rich but its dependency on the dm infrastructure is > > likely to find opposition (the dm layer does not handle barriers > > properly and the maximum size of I/O requests can be limited in some > > cases). In such a case, we could either try to build a standalone > > resource controller based on dm-ioband (which would probably hook into > > generic_make_request) or try to come up with something new. > > - If the I/O tracking patches make it into the kernel we could move on > > and try to get the Cgroup extensions to CFQ and AS mentioned before (see > > (1), (2), and (3) above for details) merged. > > - Delegate the task of controlling the rate at which a task can > > generate dirty pages to the memory controller. > > > > This RFC is somewhat vague but my feeling is that we build some > > consensus on the goals and basic design aspects before delving into > > implementation details. > > > > I would appreciate your comments and feedback. > > Very nice summary Thank you! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/