Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753494Ab1C2NQV (ORCPT ); Tue, 29 Mar 2011 09:16:21 -0400 Received: from mail-iw0-f174.google.com ([209.85.214.174]:44258 "EHLO mail-iw0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750932Ab1C2NQU convert rfc822-to-8bit (ORCPT ); Tue, 29 Mar 2011 09:16:20 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; b=ObN8KMFW0k9HAwDK68myZESgWbxZJtRF5A1wxG70CNchSsDkG0nYkxls+DuBkgWQeU O/afLtx2Gv6/m3s9n4MPh47z6qCu/alQn6fQyIF8Gxuhr6X2re1htL/e8LVhkuD5k1xV TNKDB7+xeQfgDSyS1ZUpzXMyNtQXEpL+6dIvM= MIME-Version: 1.0 In-Reply-To: <20110329111858.GF30671@tiehlicka.suse.cz> References: <20110328093957.089007035@suse.cz> <20110328200332.17fb4b78.kamezawa.hiroyu@jp.fujitsu.com> <20110328114430.GE5693@tiehlicka.suse.cz> <20110329090924.6a565ef3.kamezawa.hiroyu@jp.fujitsu.com> <20110329073232.GB30671@tiehlicka.suse.cz> <20110329165117.179d87f9.kamezawa.hiroyu@jp.fujitsu.com> <20110329085942.GD30671@tiehlicka.suse.cz> <20110329184119.219f7d7b.kamezawa.hiroyu@jp.fujitsu.com> <20110329111858.GF30671@tiehlicka.suse.cz> From: Zhu Yanhai Date: Tue, 29 Mar 2011 21:15:59 +0800 Message-ID: Subject: Re: [RFC 0/3] Implementation of cgroup isolation To: Michal Hocko Cc: KAMEZAWA Hiroyuki , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7755 Lines: 201 Michal, Maybe what we need here is some kind of trade-off? Let's say a new configuable parameter reserve_limit, for the cgroups which want to have some guarantee in the memory resource, we have: limit_in_bytes > soft_limit > reserve_limit MEM[limit_in_bytes..soft_limit] are the bytes that I'm willing to contribute to the others if they are short of memory. MEM[soft_limit..reserve_limit] are the bytes that I can afford if the others are still eager for memory after I gave them MEM[limit_in_bytes..soft_limit]. MEM[reserve_limit..0] are the bytes which is a must for me to guarantee QoS. Nobody is allowed to steal them. And reserve_limit is 0 by default for the cgroups who don't care about Qos. Then the reclaim path also needs some changes, i.e, balance_pgdat(): 1) call mem_cgroup_soft_limit_reclaim(), if nr_reclaimed is meet, goto finish. 2) shrink the global LRU list, and skip the pages which belong to the cgroup who have set a reserve_limit. if nr_reclaimed is meet, goto finish. 3) shrink the cgroups who have set a reserve_limit, and leave them with only the reserve_limit bytes they need. if nr_reclaimed is meet, goto finish. 4) OOM Does it make sense? Thanks, Zhu Yanhai 2011/3/29 Michal Hocko : > On Tue 29-03-11 18:41:19, KAMEZAWA Hiroyuki wrote: >> On Tue, 29 Mar 2011 10:59:43 +0200 >> Michal Hocko wrote: >> >> > On Tue 29-03-11 16:51:17, KAMEZAWA Hiroyuki wrote: > [...] >> > > My opinions is to enhance softlimit is better. >> > >> > I will look how softlimit can be enhanced to match the expectations but >> > I'm kind of suspicious it can handle workloads where heuristics simply >> > cannot guess that the resident memory is important even though it wasn't >> > touched for a long time. >> > >> >> I think we recommend mlock() or hugepagefs to pin application's work area >> in usual. And mm guyes have did hardwork to work mm better even without >> memory cgroup under realisitic workloads. > > Agreed. Whenever this approach is possible we recomend the same thing. > >> If your worload is realistic but _important_ anonymous memory is swapped out, >> it's problem of global VM rather than memcg. > > I would disagree with you on that. The important thing is that it can be > defined from many perspectives. One is the kernel which considers long > unused memory as not _that_ important. And it makes a perfect sense for > most workloads. > An important memory for an application can be something that would > considerably increase the latency just because the memory got paged out > (be it swap or the storage) because it contains pre-computed > data that have a big initial costs. > As you can see there is no mention about the time from the application > POV because it can depend on the incoming requests which you cannot > control. > >> If you add 'isolate' per process, okay, I'll agree to add isolate per memcg. > > What do you mean by isolate per process? > > [...] >> > > > OK, I have tried to explain that in one of the (2nd) patch description. >> > > > If I move all task from the root group to other group(s) and keep the >> > > > primary application in the root group I would achieve some isolation as >> > > > well. That is very much true. >> > > >> > > Okay, then, current works well. >> > > >> > > > But then there is only one such a group. >> > > >> > > I can't catch what you mean. you can create limitless cgroup, anywhere. >> > > Can't you ? >> > >> > This is not about limits. This is about global vs. per-cgroup reclaim >> > and how much they interact together. >> > >> > The everything-in-groups approach with the "primary" service in the root >> > group (or call it unlimited) works just because all the memory activity >> > (but the primary service) is caped with the limits so the rest of the >> > memory can be used by the service. Moreover, in order this to work the >> > limit for other groups would be smaller then the working set of the >> > primary service. >> > >> > Even if you created a limitless group for other important service they >> > would still interact together and if one goes wild the other would >> > suffer from that. >> > >> >> .........I can't understad what is the problem when global reclaim >> runs just because an application wasn't limited ...or memory are >> overcomitted. > > I am not sure I understand but what I see as a problem is when unrelated > memory activity triggers reclaim and it pushes out the memory of a > process group just because the heuristics done by the reclaim algorithm > do not pick up the right memory - and honestly, no heuristic will fit > all requirements. Isolation can protect from an unrelated activity > without new heuristics. > > [...] >> If softlimit (after some improvement) isn't enough, please add some other. >> >> What I think of is >> >> 1. need to "guarantee" memory usages in future. >>    "first come, first served" is not good for admins. > > this is not in scope of these patchsets but I agree that it would be > nice to have this guarantee > >> 2. need to handle zone memory shortage. Using memory migration >>    between zones will be necessary to avoid pageout. > > I am not sure I understand. > >> >> 3. need a knob to say "please reclaim from my own cgroup rather than >>    affecting others (if usage > some(soft)limit)." > > Isn't this handled already and enhanced by the per-cgroup background > reclaim patches? > >> >> > [...] >> > > > > I think you should put tasks in root cgroup to somewhere. It works perfect >> > > > > against OOM. And if memory are hidden by isolation, OOM will happen easier. >> > > > >> > > > Why do you think that it would happen easier? Isn't it similar (from OOM >> > > > POV) as if somebody mlocked that memory? >> > > > >> > > >> > > if global lru scan cannot find victim memory, oom happens. >> > >> > Yes, but this will happen with mlocked memory as well, right? >> > >> Yes, of course. >> >> Anyway, I'll Nack to simple "first come, first served" isolation. >> Please implement garantee, which is reliable and admin can use safely. > > Isolation is not about future guarantee. It is rather after you have it > you can rely it will stay in unless in-group activity pushes it out. > >> mlock() has similar problem, So, I recommend hugetlbfs to customers, >> admin can schedule it at boot time. >> (the number of users of hugetlbfs is tend to be one app. (oracle)) > > What if we decide that hugetlbfs won't be pinned into memory in future? > >> >> I'll be absent, tomorrow. >> >> I think you'll come LSF/MM summit and from the schedule, you'll have >> a joint session with Ying as "Memcg LRU management and isolation". > > I didn't have plans to do a session actively, but I can certainly join > to talk and will be happy to discuss this topic. > >> >> IIUC, "LRU management" is a google's performance improvement topic. >> >> It's ok for me to talk only about 'isolation'  1st in earlier session. >> If you want, please ask James to move session and overlay 1st memory >> cgroup session. (I think you saw e-mail from James.) > > Yeah, I can do that. > > Thanks > -- > Michal Hocko > SUSE Labs > SUSE LINUX s.r.o. > Lihovarska 1060/12 > 190 00 Praha 9 > Czech Republic > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org.  For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: email@kvack.org > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/