Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756310Ab0ATH3V (ORCPT ); Wed, 20 Jan 2010 02:29:21 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756140Ab0ATH3T (ORCPT ); Wed, 20 Jan 2010 02:29:19 -0500 Received: from TYO201.gate.nec.co.jp ([202.32.8.193]:36296 "EHLO tyo201.gate.nec.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756287Ab0ATH3T (ORCPT ); Wed, 20 Jan 2010 02:29:19 -0500 Date: Wed, 20 Jan 2010 16:15:33 +0900 From: Daisuke Nishimura To: balbir@linux.vnet.ibm.com Cc: KAMEZAWA Hiroyuki , "linux-mm@kvack.org" , Andrew Morton , "linux-kernel@vger.kernel.org" , Daisuke Nishimura Subject: Re: [RFC] Shared page accounting for memory cgroup Message-Id: <20100120161533.6d83f607.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20100120130902.865d8269.nishimura@mxp.nes.nec.co.jp> References: <20100104093528.04846521.kamezawa.hiroyu@jp.fujitsu.com> <20100107083440.GS3059@balbir.in.ibm.com> <20100107174814.ad6820db.kamezawa.hiroyu@jp.fujitsu.com> <20100107180800.7b85ed10.kamezawa.hiroyu@jp.fujitsu.com> <20100107092736.GW3059@balbir.in.ibm.com> <20100108084727.429c40fc.kamezawa.hiroyu@jp.fujitsu.com> <661de9471001171130p2b0ac061he6f3dab9ef46fd06@mail.gmail.com> <20100118094920.151e1370.nishimura@mxp.nes.nec.co.jp> <4B541B44.3090407@linux.vnet.ibm.com> <20100119102208.59a16397.nishimura@mxp.nes.nec.co.jp> <661de9471001181749y2fe22a15j1c01c94aa1838e99@mail.gmail.com> <20100119113443.562e38ba.nishimura@mxp.nes.nec.co.jp> <4B552C89.8000004@linux.vnet.ibm.com> <20100120130902.865d8269.nishimura@mxp.nes.nec.co.jp> Organization: NEC Soft, Ltd. X-Mailer: Sylpheed 2.6.0 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6963 Lines: 156 On Wed, 20 Jan 2010 13:09:02 +0900, Daisuke Nishimura wrote: > On Tue, 19 Jan 2010 09:22:41 +0530, Balbir Singh wrote: > > On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote: > > > On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh wrote: > > >> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura > > >> wrote: > > >> [snip] > > >>>> Correct, file cache is almost always considered shared, so it has > > >>>> > > >>>> 1. non-private or shared usage of 10MB > > >>>> 2. 10 MB of file cache > > >>>> > > >>>>> I don't think "non private usage" is appropriate to this value. > > >>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier > > >>>>> to understand for users. > > >>>> > > >>>> Here is my concern > > >>>> > > >>>> 1. The gap between looking at memcg stat and sum of all RSS is way > > >>>> higher in user space > > >>>> 2. Summing up all rss without walking the tasks atomically can and > > >>>> will lead to consistency issues. Data can be stale as long as it > > >>>> represents a consistent snapshot of data > > >>>> > > >>>> We need to differentiate between > > >>>> > > >>>> 1. Data snapshot (taken at a time, but valid at that point) > > >>>> 2. Data taken from different sources that does not form a uniform > > >>>> snapshot, because the timestamping of the each of the collected data > > >>>> items is different > > >>>> > > >>> Hmm, I'm sorry I can't understand why you need "difference". > > >>> IOW, what can users or middlewares know by the value in the above case > > >>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about > > >>> this point... Why can this value mean some of the groups are "heavy" ? > > >>> > > >> > > >> Consider a default cgroup that is not root and assume all applications > > >> move there initially. Now with a lot of shared memory, > > >> the default cgroup will be the first one to page in a lot of the > > >> memory and its usage will be very high. Without the concept of > > >> showing how much is non-private, how does one decide if the default > > >> cgroup is using a lot of memory or sharing it? How > > >> do we decide on limits of a cgroup without knowing its actual usage - > > >> PSS equivalent for a region of memory for a task. > > >> > > > As for limit, I think we should decide it based on the actual usage because > > > we account and limit the accual usage. Why we should take account of the sum of rss ? > > > > I am talking of non-private pages or potentially shared pages - which is > > derived as follows > > > > sum_of_all_rss - (rss + file_mapped) (from .stat file) > > > > file cache is considered to be shared always > > > > > > > I agree that we'd better not to ignore the sum of rss completely, but could you show me > > > how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ? > > > > In your example, usage shows that the real usage of the cgroup is 20 MB > > for 01 and 10 MB for 02. > right. > > > Today we show that we are using 40MB instead of > > 30MB (when summed). > Sorry, I can't understand here. > If we sum usage_in_bytes in both groups, it would be 30MB. > If we sum "actual rss(rss_file, rss_anon) via stat file" in both groups, it would be 30M. > If we sum "total rss(rss_file, rss_anon) of all process via mm_counter" in both groups, > it would be 40MB. > > > If an administrator has to make a decision to say > > add more resources, the one with 20MB would be the right place w.r.t. > > memory. > > > You mean he would add the additional resource to 00, right? Then, > the smaller "shared_usage_in_bytes" is, the more likely an administrator should > add additional resources to the group ? > > But when both /cgroup/memory/aa and /cgroup/memory/bb has 20MB as acutual usage, > and aa has 10MB "shared"(used by multiple processes *in aa*) usage while bb has none, > "shared_usage_in_bytes" is 10MB in aa and 0MB in bb(please consider there is > no "shared" usage between aa and bb). > Should an administrator consider bb is heavier than aa ? I don't think so. > > IOW, "shared_usage_in_bytes" doesn't have any consistent meaning about which > group is unfairly "heavy". > > The problem here is, "shared_usage_in_bytes" doesn't show neither one of nor the sum > of the following value(*IFF* we have only one cgroup, "shared_usage_in_bytes" would > mean a), but it has no use in real case). > > a) memory usage used by multiple processes inside this group. > b) memory usage used by both processes inside this and another group. > c) memory usage not used by any processes inside this group, but used by > that of in another group. > > IMHO, we should take account of all the above values to determine which group > is unfairly "heavy". I agree that the bigger the size of a) is, the bigger > "shared_usage_in_bytes" of the group would be, but we cannot know any information about > the size of b) by it, becase those usages are included in both actual usage(rss via stat) > and sum of rss(via mm_counter). To make matters warse, "shared_usage_in_bytes" has > the opposite meaning about b), i.e., the more a processe in some group(foo) has actual > charges in *another* group(baa), the bigger "shared_usage_in_bytes" in "foo" would be > (as 00 and 01 in my example). > > I would agree with you if you add interfaces to show some hints to users about above values, > but "shared_usage_in_bytes" doesn't meet it at all. > This is just an idea(At least, we need interfaces to read and reset them). diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 385e29b..bf601f2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -83,6 +83,8 @@ enum mem_cgroup_stat_index { used by soft limit implementation */ MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out. used by threshold implementation */ + MEM_CGROUP_STAT_SHARED_IN_GROUP, + MEM_CGROUP_STAT_SHARED_FROM_OTHERS, MEM_CGROUP_STAT_NSTATS, }; @@ -1707,8 +1709,25 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, lock_page_cgroup(pc); if (unlikely(PageCgroupUsed(pc))) { + struct mem_cgroup *charged = pc->mem_cgroup; + struct mem_cgroup_stat *stat; + struct mem_cgroup_stat_cpu *cpustat; + int cpu; + int shared_type; + unlock_page_cgroup(pc); mem_cgroup_cancel_charge(mem); + + stat = &charged->stat; + cpu = get_cpu(); + cpustat = &stat->cpustat[cpu]; + if (charged == mem) + shared_type = MEM_CGROUP_STAT_SHARED_IN_GROUP; + else + shared_type = MEM_CGROUP_STAT_SHARED_FROM_OTHERS; + __mem_cgroup_stat_add_safe(cpustat, shared_type, 1); + put_cpu(); + return; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/