Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753958Ab1FIRYX (ORCPT ); Thu, 9 Jun 2011 13:24:23 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:51668 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753278Ab1FIRYW (ORCPT ); Thu, 9 Jun 2011 13:24:22 -0400 Date: Thu, 9 Jun 2011 19:23:47 +0200 From: Johannes Weiner To: Minchan Kim Cc: KAMEZAWA Hiroyuki , Daisuke Nishimura , Balbir Singh , Ying Han , Michal Hocko , Andrew Morton , Rik van Riel , KOSAKI Motohiro , Mel Gorman , Greg Thelen , Michel Lespinasse , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/8] mm: memcg-aware global reclaim Message-ID: <20110609172347.GB20333@cmpxchg.org> References: <1306909519-7286-1-git-send-email-hannes@cmpxchg.org> <1306909519-7286-3-git-send-email-hannes@cmpxchg.org> <20110609154839.GF4878@barrios-laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110609154839.GF4878@barrios-laptop> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3716 Lines: 74 On Fri, Jun 10, 2011 at 12:48:39AM +0900, Minchan Kim wrote: > On Wed, Jun 01, 2011 at 08:25:13AM +0200, Johannes Weiner wrote: > > When a memcg hits its hard limit, hierarchical target reclaim is > > invoked, which goes through all contributing memcgs in the hierarchy > > below the offending memcg and reclaims from the respective per-memcg > > lru lists. This distributes pressure fairly among all involved > > memcgs, and pages are aged with respect to their list buddies. > > > > When global memory pressure arises, however, all this is dropped > > overboard. Pages are reclaimed based on global lru lists that have > > nothing to do with container-internal age, and some memcgs may be > > reclaimed from much more than others. > > > > This patch makes traditional global reclaim consider container > > boundaries and no longer scan the global lru lists. For each zone > > scanned, the memcg hierarchy is walked and pages are reclaimed from > > the per-memcg lru lists of the respective zone. For now, the > > hierarchy walk is bounded to one full round-trip through the > > hierarchy, or if the number of reclaimed pages reach the overall > > reclaim target, whichever comes first. > > > > Conceptually, global memory pressure is then treated as if the root > > memcg had hit its limit. Since all existing memcgs contribute to the > > usage of the root memcg, global reclaim is nothing more than target > > reclaim starting from the root memcg. The code is mostly the same for > > both cases, except for a few heuristics and statistics that do not > > always apply. They are distinguished by a newly introduced > > global_reclaim() primitive. > > > > One implication of this change is that pages have to be linked to the > > lru lists of the root memcg again, which could be optimized away with > > the old scheme. The costs are not measurable, though, even with > > worst-case microbenchmarks. > > > > As global reclaim no longer relies on global lru lists, this change is > > also in preparation to remove those completely. [cut diff] > I didn't look at all, still. You might change the logic later patches. > If I understand this patch right, it does round-robin reclaim in all memcgs > when global memory pressure happens. > > Let's consider this memcg size unbalance case. > > If A-memcg has lots of LRU pages, scanning count for reclaim would be bigger > so the chance to reclaim the pages would be higher. > If we reclaim A-memcg, we can reclaim the number of pages we want easily and break. > Next reclaim will happen at some time and reclaim will start the B-memcg of A-memcg > we reclaimed successfully before. But unfortunately B-memcg has small lru so > scanning count would be small and small memcg's LRU aging is higher than bigger memcg. > It means small memcg's working set can be evicted easily than big memcg. > my point is that we should not set next memcg easily. > We have to consider memcg LRU size. I may be missing something, but you said yourself that B had a smaller scan count compared to A, so the aging speed should be proportional to respective size. The number of pages scanned per iteration is essentially number of lru pages in memcg-zone >> priority so we scan relatively more pages from B than from A each round. It's the exact same logic we have been applying traditionally to distribute pressure fairly among zones to equalize their aging speed. Is that what you meant or are we talking past each other? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/