DomainKey-Signature: a=rsa-sha1; s=beta; d=google.com; c=nofws; q=dns;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to:
	cc:content-type:content-transfer-encoding:x-system-of-record;
	b=ZFX4dIsp6gL46SN1C/gckevJ+VhNJGBP7svqMWnhyGmXaIv3EtJvOWr1C6AQdEIms
	Pc2sgNeyxGlUbvY/xx8wg==
MIME-Version: 1.0
In-Reply-To: <20090417141358.GD29086@redhat.com>
References: <1236823015-4183-1-git-send-email-vgoyal@redhat.com>
	 <1236823015-4183-2-git-send-email-vgoyal@redhat.com>
	 <20090312001146.74591b9d.akpm@linux-foundation.org>
	 <20090312180126.GI10919@redhat.com> <49D8CB17.7040501@gmail.com>
	 <20090407064046.GB20498@redhat.com> <20090408203756.GB10077@linux>
	 <20090416183753.GE8896@redhat.com> <20090417093656.GA5246@linux>
	 <20090417141358.GD29086@redhat.com>
Date: Sat, 18 Apr 2009 21:35:16 -0700
Message-ID: <e98e18940904182135k7069e839vb749188647145bd4@mail.gmail.com>
Subject: Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
From: Nauman Rafique <nauman@google.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: Andrea Righi <righi.andrea@gmail.com>,
       Andrew Morton <akpm@linux-foundation.org>, dpshah@google.com,
       lizf@cn.fujitsu.com, mikew@google.com, fchecconi@gmail.com,
       paolo.valente@unimore.it, jens.axboe@oracle.com, ryov@valinux.co.jp,
       fernando@intellilink.co.jp, s-uchida@ap.jp.nec.com, taka@valinux.co.jp,
       guijianfeng@cn.fujitsu.com, arozansk@redhat.com, jmoyer@redhat.com,
       oz-kernel@redhat.com, dhaval@linux.vnet.ibm.com,
       balbir@linux.vnet.ibm.com, linux-kernel@vger.kernel.org,
       containers@lists.linux-foundation.org, menage@google.com,
       peterz@infradead.org, John Wilkes <johnwilkes@google.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7310
Lines: 144

On Fri, Apr 17, 2009 at 7:13 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
>> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
>> > > I think it would be possible to implement both proportional and limiting
>> > > rules at the same level (e.g., the IO scheduler), but we need also to
>> > > address the memory consumption problem (I still need to review your
>> > > patchset in details and I'm going to test it soon :), so I don't know if
>> > > you already addressed this issue).
>> > >
>> >
>> > Can you please elaborate a bit on this? Are you concerned about that data
>> > structures created to solve the problem consume a lot of memory?
>>
>> Sorry I was not very clear here. With memory consumption I mean wasting
>> the memory with hard/slow reclaimable dirty pages or pending IO
>> requests.
>>
>> If there's only a global limit on dirty pages, any cgroup can exhaust
>> that limit and cause other cgroups/processes to block when they try to
>> write to disk.
>>
>> But, ok, the IO controller is not probably the best place to implement
>> such functionality. I should rework on the per cgroup dirty_ratio:
>>
>> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
>>
>> Last time we focused too much on the best interfaces to define dirty
>> pages limit, and I never re-posted an updated version of this patchset.
>> Now I think we can simply provide the same dirty_ratio/dirty_bytes
>> interface that we provide globally, but per cgroup.
>>
>> >
>> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
>> > > in the cgroup that exceeds its limit, how do we avoid the waste of
>> > > memory due to the succeeding IO requests and the increasingly dirty
>> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
>> > > but I think we talked about this problem in a previous email... sorry I
>> > > don't find the discussion in my mail archives.
>> > >
>> > > IMHO a nice approach would be to measure IO consumption at the IO
>> > > scheduler level, and control IO applying proportional weights / absolute
>> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
>> > > time block the tasks from dirtying memory that will generate additional
>> > > IO requests.
>> > >
>> > > Anyway, there's no need to provide this with a single IO controller, we
>> > > could split the problem in two parts: 1) provide a proportional /
>> > > absolute IO controller in the IO schedulers and 2) allow to set, for
>> > > example, a maximum limit of dirty pages for each cgroup.
>> > >
>> >
>> > I think setting a maximum limit on dirty pages is an interesting thought.
>> > It sounds like as if memory controller can handle it?
>>
>> Exactly, the same above.
>
> Thinking more about it. Memory controller can probably enforce the higher
> limit but it would not easily translate into a fixed upper async write
> rate. Till the process hits the page cache limit or is slowed down by
> dirty page writeout, it can get a very high async write BW.
>
> So memory controller page cache limit will help but it would not direclty
> translate into what max bw limit patches are doing.
>
> Even if we do max bw control at IO scheduler level, async writes are
> problematic again. IO controller will not be able to throttle the process
> until it sees actuall write request. In big memory systems, writeout might
> not happen for some time and till then it will see a high throughput.
>
> So doing async write throttling at higher layer and not at IO scheduler
> layer gives us the opprotunity to produce more accurate results.
>
> For sync requests, I think IO scheduler max bw control should work fine.
>
> BTW, andrea, what is the use case of your patches? Andrew had mentioned
> that some people are already using it. I am curious to know will a
> proportional BW controller will solve the issues/requirements of these
> people or they have specific requirement of traffic shaping and max bw
> controller only.
>
> [..]
>> > > > Can you please give little more details here regarding how QoS requirements
>> > > > are not met with proportional weight?
>> > >
>> > > With proportional weights the whole bandwidth is allocated if no one
>> > > else is using it. When IO is submitted other tasks with a higher weight
>> > > can be forced to sleep until the IO generated by the low weight tasks is
>> > > not completely dispatched. Or any extent of the priority inversion
>> > > problems.
>> >
>> > Hmm..., I am not very sure here. When admin is allocating the weights, he
>> > has the whole picture. He knows how many groups are conteding for the disk
>> > and what could be the worst case scenario. So if I have got two groups
>> > with A and B with weight 1 and 2 and both are contending, then as an
>> > admin one would expect to get 33% of BW for group A in worst case (if
>> > group B is continuously backlogged). If B is not contending than A can get
>> > 100% of BW. So while configuring the system, will one not plan for worst
>> > case (33% for A, and 66 % for B)?
>>
>> OK, I'm quite convinced.. :)
>>
>> To a large degree, if we want to provide a BW reservation strategy we
>> must provide an interface that allows cgroups to ask for time slices
>> such as max/min 5 IO requests every 50ms or something like that.
>> Probably the same functionality can be achieved translating time slices
>> from weights, percentages or absolute BW limits.
>
> Ok, I would like to split it in two parts.
>
> I think providng minimum gurantee in absolute terms like 5 IO request
> every 50ms will be very hard because IO scheduler has no control over
> how many competitors are there. An easier thing will be to have minimum
> gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> shall have to create right cgroup hierarchy and assign weights properly and
> then admin can calculate what % of disk slice a particular group will get
> as minimum gurantee. (This is more complicated than this as there are
> time slices which are not accounted to any groups. During queue switch
> cfq starts the time slice counting only after first request has completed
> to offset the impact of seeking and i guess also NCQ).

I agree with Vivek that absolute metrics like 5 IO requests every 50ms
might be hard to offer. But 'x ms of disk time every y ms, for a given
cgroup' might be a desirable goal. That said, for now we can focus on
weight based allocation of disk time, and leave such goals for future.

>
> I think it should be possible to give max bandwidth gurantees in absolute
> terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> scheduler has to do is to not allow dispatch from a particular queue if
> it has crossed its limit and then either let the disk idle or move onto
> next eligible queue.
>
> The only issue here will be async writes. max bw gurantee for async writes
> at IO scheduler level might not mean much to application because of page
> cache.
>
> Thanks
> Vivek
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/