Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752544Ab1FBPBw (ORCPT ); Thu, 2 Jun 2011 11:01:52 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:37435 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751870Ab1FBPBu (ORCPT ); Thu, 2 Jun 2011 11:01:50 -0400 Date: Thu, 2 Jun 2011 17:01:23 +0200 From: Johannes Weiner To: Hiroyuki Kamezawa Cc: KAMEZAWA Hiroyuki , Daisuke Nishimura , Balbir Singh , Ying Han , Michal Hocko , Andrew Morton , Rik van Riel , Minchan Kim , KOSAKI Motohiro , Mel Gorman , Greg Thelen , Michel Lespinasse , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/8] mm: memcg-aware global reclaim Message-ID: <20110602150123.GE28684@cmpxchg.org> References: <1306909519-7286-1-git-send-email-hannes@cmpxchg.org> <1306909519-7286-3-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6039 Lines: 158 On Thu, Jun 02, 2011 at 10:59:01PM +0900, Hiroyuki Kamezawa wrote: > 2011/6/1 Johannes Weiner : > > @@ -1381,6 +1373,97 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > > ? ? ? ?return min(limit, memsw); > > ?} > > > > +/** > > + * mem_cgroup_hierarchy_walk - iterate over a memcg hierarchy > > + * @root: starting point of the hierarchy > > + * @prev: previous position or NULL > > + * > > + * Caller must hold a reference to @root. ?While this function will > > + * return @root as part of the walk, it will never increase its > > + * reference count. > > + * > > + * Caller must clean up with mem_cgroup_stop_hierarchy_walk() when it > > + * stops the walk potentially before the full round trip. > > + */ > > +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root, > > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct mem_cgroup *prev) > > +{ > > + ? ? ? struct mem_cgroup *mem; > > + > > + ? ? ? if (mem_cgroup_disabled()) > > + ? ? ? ? ? ? ? return NULL; > > + > > + ? ? ? if (!root) > > + ? ? ? ? ? ? ? root = root_mem_cgroup; > > + ? ? ? /* > > + ? ? ? ?* Even without hierarchy explicitely enabled in the root > > + ? ? ? ?* memcg, it is the ultimate parent of all memcgs. > > + ? ? ? ?*/ > > + ? ? ? if (!(root == root_mem_cgroup || root->use_hierarchy)) > > + ? ? ? ? ? ? ? return root; > > Hmm, because ROOT cgroup has no limit and control, if root=root_mem_cgroup, > we do full hierarchy scan always. Right ? What it essentially means is that all existing memcgs in the system contribute to the usage of root_mem_cgroup. If there is global memory pressure, we need to consider reclaiming from every single memcg in the system. > > +static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem, > > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, > > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long flags) > > +{ > > + ? ? ? unsigned long total = 0; > > + ? ? ? bool noswap = false; > > + ? ? ? int loop; > > + > > + ? ? ? if ((flags & MEM_CGROUP_RECLAIM_NOSWAP) || mem->memsw_is_minimum) > > + ? ? ? ? ? ? ? noswap = true; > > + ? ? ? for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) { > > + ? ? ? ? ? ? ? drain_all_stock_async(); > > In recent patch, I removed this call here because this wakes up > kworker too much. > I will post that patch as a bugfix. So, please adjust this call > somewhere which is > not called frequently. Okay, please CC me when you send out the bugfix. > > @@ -1927,8 +1980,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask, > > ? ? ? ?if (!(gfp_mask & __GFP_WAIT)) > > ? ? ? ? ? ? ? ?return CHARGE_WOULDBLOCK; > > > > - ? ? ? ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL, > > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_mask, flags); > > + ? ? ? ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); > > ? ? ? ?if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > > ? ? ? ? ? ? ? ?return CHARGE_RETRY; > > ? ? ? ?/* > > It seems this clean-up around hierarchy and softlimit can be in an > independent patch, no ? Hm, why do you think it's a cleanup? The hierarchical target reclaim code is moved to vmscan.c and as a result the entry points for hard limit and soft limit reclaim differ. This is why the original function, mem_cgroup_hierarchical_reclaim() has to be split into two parts. > > @@ -1943,6 +1976,31 @@ restart: > > ? ? ? ?throttle_vm_writeout(sc->gfp_mask); > > ?} > > > > +static void shrink_zone(int priority, struct zone *zone, > > + ? ? ? ? ? ? ? ? ? ? ? struct scan_control *sc) > > +{ > > + ? ? ? unsigned long nr_reclaimed_before = sc->nr_reclaimed; > > + ? ? ? struct mem_cgroup *root = sc->target_mem_cgroup; > > + ? ? ? struct mem_cgroup *first, *mem = NULL; > > + > > + ? ? ? first = mem = mem_cgroup_hierarchy_walk(root, mem); > > Hmm, I think we should add some scheduling here, later. > (as select a group over softlimit or select a group which has > easily reclaimable pages on this zone.) > > This name as hierarchy_walk() sounds like "full scan in round-robin, always". > Could you find better name ? Okay, I'll try. > > + ? ? ? for (;;) { > > + ? ? ? ? ? ? ? unsigned long nr_reclaimed; > > + > > + ? ? ? ? ? ? ? sc->mem_cgroup = mem; > > + ? ? ? ? ? ? ? do_shrink_zone(priority, zone, sc); > > + > > + ? ? ? ? ? ? ? nr_reclaimed = sc->nr_reclaimed - nr_reclaimed_before; > > + ? ? ? ? ? ? ? if (nr_reclaimed >= sc->nr_to_reclaim) > > + ? ? ? ? ? ? ? ? ? ? ? break; > > what this calculation means ? Shouldn't we do this quit based on the > number of "scan" > rather than "reclaimed" ? It aborts the loop once sc->nr_to_reclaim pages have been reclaimed from that zone during that hierarchy walk, to prevent overreclaim. If you have unbalanced sizes of memcgs in the system, it is not desirable to have every reclaimer scan all memcgs, but let those quit early that have made some progress on the bigger memcgs. It's essentially a forward progagation of the same check in do_shrink_zone(). It trades absolute fairness for average reclaim latency. Note that kswapd sets the reclaim target to infinity, so this optimization applies only to direct reclaimers. > > + ? ? ? ? ? ? ? mem = mem_cgroup_hierarchy_walk(root, mem); > > + ? ? ? ? ? ? ? if (mem == first) > > + ? ? ? ? ? ? ? ? ? ? ? break; > > Why we quit loop ? get_scan_count() for traditional global reclaim returns the scan target for the zone. With this per-memcg reclaimer, get_scan_count() will return scan targets for the respective per-memcg zone subsizes. So once we have gone through all memcgs, we should have scanned the amount of pages that global reclaim would have deemed sensible for that zone at that priority level. As such, this is the exit condition based on scan count you referred to above. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/