2011-06-01 06:25:54

by Johannes Weiner

[permalink] [raw]
Subject: [patch 0/8] mm: memcg naturalization -rc2

Hi,

this is the second version of the memcg naturalization series. The
notable changes since the first submission are:

o the hierarchy walk is now intermittent and will abort and
remember the last scanned child after sc->nr_to_reclaim pages
have been reclaimed during the walk in one zone (Rik)

o the global lru lists are never scanned when memcg is enabled
after #2 'memcg-aware global reclaim', which makes this patch
self-sufficient and complete without requiring the per-memcg lru
lists to be exclusive (Michal)

o renamed sc->memcg and sc->current_memcg to sc->target_mem_cgroup
and sc->mem_cgroup and fixed their documentation, I hope this is
better understandable now (Rik)

o the reclaim statistic counters have been renamed. there is no
more distinction between 'pgfree' and 'pgsteal', it is now
'pgreclaim' in both cases; 'kswapd' has been replaced by
'background'

o fixed a nasty crash in the hierarchical soft limit check that
happened during global reclaim in memcgs that are hierarchical
but have no hierarchical parents themselves

o properly implemented the memcg-aware unevictable page rescue
scanner, there were several blatant bugs in there

o documentation on new public interfaces

Thanks for your input on the first version.

I ran microbenchmarks (sparse file catting, essentially) to stress
reclaim and LRU operations. There is no measurable overhead for
!CONFIG_MEMCG, memcg disabled during boot, memcg enabled but no
configured groups, and hard limit reclaim.

I also ran single-threaded kernbenchs in four unlimited memcgs in
parallel, contained in a hard-limited hierarchical parent that put
constant pressure on the workload. There is no measurable difference
in runtime, the pgpgin/pgpgout counters, and fairness among memcgs in
this test compared to an unpatched kernel. Needs more evaluation,
especially with a higher number of memcgs.

The soft limit changes are also proven to work in so far that it is
possible to prioritize between children in a hierarchy under pressure
and that runtime differences corresponded directly to the soft limit
settings in the previously described kernbench setup with staggered
soft limits on the groups, but this needs quantification.

Based on v2.6.39.

include/linux/memcontrol.h | 91 +++--
include/linux/mm_inline.h | 14 +-
include/linux/mmzone.h | 10 +-
include/linux/page_cgroup.h | 36 --
include/linux/swap.h | 4 -
mm/memcontrol.c | 889 ++++++++++++++-----------------------------
mm/page_alloc.c | 2 +-
mm/page_cgroup.c | 38 +--
mm/swap.c | 20 +-
mm/vmscan.c | 296 ++++++++-------
10 files changed, 536 insertions(+), 864 deletions(-)


2011-06-01 06:25:45

by Johannes Weiner

[permalink] [raw]
Subject: [patch 1/8] memcg: remove unused retry signal from reclaim

If the memcg reclaim code detects the target memcg below its limit it
exits and returns a guaranteed non-zero value so that the charge is
retried.

Nowadays, the charge side checks the memcg limit itself and does not
rely on this non-zero return value trick.

This patch removes it. The reclaim code will now always return the
true number of pages it reclaimed on its own.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Ying Han <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
---
mm/memcontrol.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 010f916..bf5ab87 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1503,7 +1503,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
if (!res_counter_soft_limit_excess(&root_mem->res))
return total;
} else if (mem_cgroup_margin(root_mem))
- return 1 + total;
+ return total;
}
return total;
}
--
1.7.5.2

2011-06-01 06:27:22

by Johannes Weiner

[permalink] [raw]
Subject: [patch 2/8] mm: memcg-aware global reclaim

When a memcg hits its hard limit, hierarchical target reclaim is
invoked, which goes through all contributing memcgs in the hierarchy
below the offending memcg and reclaims from the respective per-memcg
lru lists. This distributes pressure fairly among all involved
memcgs, and pages are aged with respect to their list buddies.

When global memory pressure arises, however, all this is dropped
overboard. Pages are reclaimed based on global lru lists that have
nothing to do with container-internal age, and some memcgs may be
reclaimed from much more than others.

This patch makes traditional global reclaim consider container
boundaries and no longer scan the global lru lists. For each zone
scanned, the memcg hierarchy is walked and pages are reclaimed from
the per-memcg lru lists of the respective zone. For now, the
hierarchy walk is bounded to one full round-trip through the
hierarchy, or if the number of reclaimed pages reach the overall
reclaim target, whichever comes first.

Conceptually, global memory pressure is then treated as if the root
memcg had hit its limit. Since all existing memcgs contribute to the
usage of the root memcg, global reclaim is nothing more than target
reclaim starting from the root memcg. The code is mostly the same for
both cases, except for a few heuristics and statistics that do not
always apply. They are distinguished by a newly introduced
global_reclaim() primitive.

One implication of this change is that pages have to be linked to the
lru lists of the root memcg again, which could be optimized away with
the old scheme. The costs are not measurable, though, even with
worst-case microbenchmarks.

As global reclaim no longer relies on global lru lists, this change is
also in preparation to remove those completely.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 15 ++++
mm/memcontrol.c | 176 ++++++++++++++++++++++++++++----------------
mm/vmscan.c | 121 ++++++++++++++++++++++--------
3 files changed, 218 insertions(+), 94 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e9840f5..332b0a6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -101,6 +101,10 @@ mem_cgroup_prepare_migration(struct page *page,
extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
struct page *oldpage, struct page *newpage, bool migration_ok);

+struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *,
+ struct mem_cgroup *);
+void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup *);
+
/*
* For memory reclaim.
*/
@@ -321,6 +325,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
return NULL;
}

+static inline struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *r,
+ struct mem_cgroup *m)
+{
+ return NULL;
+}
+
+static inline void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *r,
+ struct mem_cgroup *m)
+{
+}
+
static inline void
mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bf5ab87..850176e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -313,8 +313,8 @@ static bool move_file(void)
}

/*
- * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
- * limit reclaim to prevent infinite loops, if they ever occur.
+ * Maximum loops in reclaim, used for soft limit reclaim to prevent
+ * infinite loops, if they ever occur.
*/
#define MEM_CGROUP_MAX_RECLAIM_LOOPS (100)
#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)
@@ -340,7 +340,7 @@ enum charge_type {
#define OOM_CONTROL (0)

/*
- * Reclaim flags for mem_cgroup_hierarchical_reclaim
+ * Reclaim flags
*/
#define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0
#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
@@ -846,8 +846,6 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
/* huge page split is done under lru_lock. so, we have no races. */
MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
- if (mem_cgroup_is_root(pc->mem_cgroup))
- return;
VM_BUG_ON(list_empty(&pc->lru));
list_del_init(&pc->lru);
}
@@ -872,13 +870,11 @@ void mem_cgroup_rotate_reclaimable_page(struct page *page)
return;

pc = lookup_page_cgroup(page);
- /* unused or root page is not rotated. */
+ /* unused page is not rotated. */
if (!PageCgroupUsed(pc))
return;
/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
smp_rmb();
- if (mem_cgroup_is_root(pc->mem_cgroup))
- return;
mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
list_move_tail(&pc->lru, &mz->lists[lru]);
}
@@ -892,13 +888,11 @@ void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
return;

pc = lookup_page_cgroup(page);
- /* unused or root page is not rotated. */
+ /* unused page is not rotated. */
if (!PageCgroupUsed(pc))
return;
/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
smp_rmb();
- if (mem_cgroup_is_root(pc->mem_cgroup))
- return;
mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
list_move(&pc->lru, &mz->lists[lru]);
}
@@ -920,8 +914,6 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
/* huge page split is done under lru_lock. so, we have no races. */
MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
SetPageCgroupAcctLRU(pc);
- if (mem_cgroup_is_root(pc->mem_cgroup))
- return;
list_add(&pc->lru, &mz->lists[lru]);
}

@@ -1381,6 +1373,97 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
return min(limit, memsw);
}

+/**
+ * mem_cgroup_hierarchy_walk - iterate over a memcg hierarchy
+ * @root: starting point of the hierarchy
+ * @prev: previous position or NULL
+ *
+ * Caller must hold a reference to @root. While this function will
+ * return @root as part of the walk, it will never increase its
+ * reference count.
+ *
+ * Caller must clean up with mem_cgroup_stop_hierarchy_walk() when it
+ * stops the walk potentially before the full round trip.
+ */
+struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root,
+ struct mem_cgroup *prev)
+{
+ struct mem_cgroup *mem;
+
+ if (mem_cgroup_disabled())
+ return NULL;
+
+ if (!root)
+ root = root_mem_cgroup;
+ /*
+ * Even without hierarchy explicitely enabled in the root
+ * memcg, it is the ultimate parent of all memcgs.
+ */
+ if (!(root == root_mem_cgroup || root->use_hierarchy))
+ return root;
+ if (prev && prev != root)
+ css_put(&prev->css);
+ do {
+ int id = root->last_scanned_child;
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+ css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
+ if (css && (css == &root->css || css_tryget(css)))
+ mem = container_of(css, struct mem_cgroup, css);
+ rcu_read_unlock();
+ if (!css)
+ id = 0;
+ root->last_scanned_child = id;
+ } while (!mem);
+ return mem;
+}
+
+/**
+ * mem_cgroup_stop_hierarchy_walk - clean up after partial hierarchy walk
+ * @root: starting point in the hierarchy
+ * @mem: last position during the walk
+ */
+void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *root,
+ struct mem_cgroup *mem)
+{
+ if (mem && mem != root)
+ css_put(&mem->css);
+}
+
+static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
+ gfp_t gfp_mask,
+ unsigned long flags)
+{
+ unsigned long total = 0;
+ bool noswap = false;
+ int loop;
+
+ if ((flags & MEM_CGROUP_RECLAIM_NOSWAP) || mem->memsw_is_minimum)
+ noswap = true;
+ for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
+ drain_all_stock_async();
+ total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap,
+ get_swappiness(mem));
+ /*
+ * Avoid freeing too much when shrinking to resize the
+ * limit. XXX: Shouldn't the margin check be enough?
+ */
+ if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
+ break;
+ if (mem_cgroup_margin(mem))
+ break;
+ /*
+ * If we have not been able to reclaim anything after
+ * two reclaim attempts, there may be no reclaimable
+ * pages in this hierarchy.
+ */
+ if (loop && !total)
+ break;
+ }
+ return total;
+}
+
/*
* Visit the first child (need not be the first child as per the ordering
* of the cgroup list, since we track last_scanned_child) of @mem and use
@@ -1418,29 +1501,14 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
return ret;
}

-/*
- * Scan the hierarchy if needed to reclaim memory. We remember the last child
- * we reclaimed from, so that we don't end up penalizing one child extensively
- * based on its position in the children list.
- *
- * root_mem is the original ancestor that we've been reclaim from.
- *
- * We give up and return to the caller when we visit root_mem twice.
- * (other groups can be removed while we're walking....)
- *
- * If shrink==true, for avoiding to free too much, this returns immedieately.
- */
-static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
- struct zone *zone,
- gfp_t gfp_mask,
- unsigned long reclaim_options)
+static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
+ struct zone *zone,
+ gfp_t gfp_mask)
{
struct mem_cgroup *victim;
int ret, total = 0;
int loop = 0;
- bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
- bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
- bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
+ bool noswap = false;
unsigned long excess;

excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
@@ -1461,7 +1529,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
* anything, it might because there are
* no reclaimable pages under this hierarchy
*/
- if (!check_soft || !total) {
+ if (!total) {
css_put(&victim->css);
break;
}
@@ -1483,26 +1551,11 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
css_put(&victim->css);
continue;
}
- /* we use swappiness of local cgroup */
- if (check_soft)
- ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
- noswap, get_swappiness(victim), zone);
- else
- ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
- noswap, get_swappiness(victim));
+ ret = mem_cgroup_shrink_node_zone(victim, gfp_mask, noswap,
+ get_swappiness(victim), zone);
css_put(&victim->css);
- /*
- * At shrinking usage, we can't check we should stop here or
- * reclaim more. It's depends on callers. last_scanned_child
- * will work enough for keeping fairness under tree.
- */
- if (shrink)
- return ret;
total += ret;
- if (check_soft) {
- if (!res_counter_soft_limit_excess(&root_mem->res))
- return total;
- } else if (mem_cgroup_margin(root_mem))
+ if (!res_counter_soft_limit_excess(&root_mem->res))
return total;
}
return total;
@@ -1927,8 +1980,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
if (!(gfp_mask & __GFP_WAIT))
return CHARGE_WOULDBLOCK;

- ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
- gfp_mask, flags);
+ ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
return CHARGE_RETRY;
/*
@@ -3085,7 +3137,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,

/*
* A call to try to shrink memory usage on charge failure at shmem's swapin.
- * Calling hierarchical_reclaim is not enough because we should update
+ * Calling reclaim is not enough because we should update
* last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
* Moreover considering hierarchy, we should reclaim from the mem_over_limit,
* not from the memcg which this page would be charged to.
@@ -3167,7 +3219,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
int enlarge;

/*
- * For keeping hierarchical_reclaim simple, how long we should retry
+ * For keeping reclaim simple, how long we should retry
* is depends on callers. We set our retry-count to be function
* of # of children which we should visit in this loop.
*/
@@ -3210,8 +3262,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
if (!ret)
break;

- mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
- MEM_CGROUP_RECLAIM_SHRINK);
+ mem_cgroup_reclaim(memcg, GFP_KERNEL,
+ MEM_CGROUP_RECLAIM_SHRINK);
curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
@@ -3269,9 +3321,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
if (!ret)
break;

- mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
- MEM_CGROUP_RECLAIM_NOSWAP |
- MEM_CGROUP_RECLAIM_SHRINK);
+ mem_cgroup_reclaim(memcg, GFP_KERNEL,
+ MEM_CGROUP_RECLAIM_NOSWAP |
+ MEM_CGROUP_RECLAIM_SHRINK);
curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
@@ -3311,9 +3363,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
if (!mz)
break;

- reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
- gfp_mask,
- MEM_CGROUP_RECLAIM_SOFT);
+ reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
nr_reclaimed += reclaimed;
spin_lock(&mctz->lock);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8bfd450..7e9bfca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -104,7 +104,16 @@ struct scan_control {
*/
reclaim_mode_t reclaim_mode;

- /* Which cgroup do we reclaim from */
+ /*
+ * The memory cgroup that hit its hard limit and is the
+ * primary target of this reclaim invocation.
+ */
+ struct mem_cgroup *target_mem_cgroup;
+
+ /*
+ * The memory cgroup that is currently being scanned as a
+ * child and contributor to the usage of target_mem_cgroup.
+ */
struct mem_cgroup *mem_cgroup;

/*
@@ -154,9 +163,36 @@ static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-#define scanning_global_lru(sc) (!(sc)->mem_cgroup)
+/**
+ * global_reclaim - whether reclaim is global or due to memcg hard limit
+ * @sc: scan control of this reclaim invocation
+ */
+static bool global_reclaim(struct scan_control *sc)
+{
+ return !sc->target_mem_cgroup;
+}
+/**
+ * scanning_global_lru - whether scanning global lrus or per-memcg lrus
+ * @sc: scan control of this reclaim invocation
+ */
+static bool scanning_global_lru(struct scan_control *sc)
+{
+ /*
+ * Unless memory cgroups are disabled on boot, the traditional
+ * global lru lists are never scanned and reclaim will always
+ * operate on the per-memcg lru lists.
+ */
+ return mem_cgroup_disabled();
+}
#else
-#define scanning_global_lru(sc) (1)
+static bool global_reclaim(struct scan_control *sc)
+{
+ return true;
+}
+static bool scanning_global_lru(struct scan_control *sc)
+{
+ return true;
+}
#endif

static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
@@ -1228,7 +1264,7 @@ static int too_many_isolated(struct zone *zone, int file,
if (current_is_kswapd())
return 0;

- if (!scanning_global_lru(sc))
+ if (!global_reclaim(sc))
return 0;

if (file) {
@@ -1397,13 +1433,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
ISOLATE_BOTH : ISOLATE_INACTIVE,
zone, 0, file);
- zone->pages_scanned += nr_scanned;
- if (current_is_kswapd())
- __count_zone_vm_events(PGSCAN_KSWAPD, zone,
- nr_scanned);
- else
- __count_zone_vm_events(PGSCAN_DIRECT, zone,
- nr_scanned);
} else {
nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
&page_list, &nr_scanned, sc->order,
@@ -1411,10 +1440,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
ISOLATE_BOTH : ISOLATE_INACTIVE,
zone, sc->mem_cgroup,
0, file);
- /*
- * mem_cgroup_isolate_pages() keeps track of
- * scanned pages on its own.
- */
+ }
+
+ if (global_reclaim(sc)) {
+ zone->pages_scanned += nr_scanned;
+ if (current_is_kswapd())
+ __count_zone_vm_events(PGSCAN_KSWAPD, zone,
+ nr_scanned);
+ else
+ __count_zone_vm_events(PGSCAN_DIRECT, zone,
+ nr_scanned);
}

if (nr_taken == 0) {
@@ -1520,18 +1555,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
&pgscanned, sc->order,
ISOLATE_ACTIVE, zone,
1, file);
- zone->pages_scanned += pgscanned;
} else {
nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
&pgscanned, sc->order,
ISOLATE_ACTIVE, zone,
sc->mem_cgroup, 1, file);
- /*
- * mem_cgroup_isolate_pages() keeps track of
- * scanned pages on its own.
- */
}

+ if (global_reclaim(sc))
+ zone->pages_scanned += pgscanned;
+
reclaim_stat->recent_scanned[file] += nr_taken;

__count_zone_vm_events(PGREFILL, zone, pgscanned);
@@ -1752,7 +1785,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);

- if (scanning_global_lru(sc)) {
+ if (global_reclaim(sc)) {
free = zone_page_state(zone, NR_FREE_PAGES);
/* If we have very few page cache pages,
force-scan anon pages. */
@@ -1889,8 +1922,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
-static void shrink_zone(int priority, struct zone *zone,
- struct scan_control *sc)
+static void do_shrink_zone(int priority, struct zone *zone,
+ struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long nr_to_scan;
@@ -1943,6 +1976,31 @@ restart:
throttle_vm_writeout(sc->gfp_mask);
}

+static void shrink_zone(int priority, struct zone *zone,
+ struct scan_control *sc)
+{
+ unsigned long nr_reclaimed_before = sc->nr_reclaimed;
+ struct mem_cgroup *root = sc->target_mem_cgroup;
+ struct mem_cgroup *first, *mem = NULL;
+
+ first = mem = mem_cgroup_hierarchy_walk(root, mem);
+ for (;;) {
+ unsigned long nr_reclaimed;
+
+ sc->mem_cgroup = mem;
+ do_shrink_zone(priority, zone, sc);
+
+ nr_reclaimed = sc->nr_reclaimed - nr_reclaimed_before;
+ if (nr_reclaimed >= sc->nr_to_reclaim)
+ break;
+
+ mem = mem_cgroup_hierarchy_walk(root, mem);
+ if (mem == first)
+ break;
+ }
+ mem_cgroup_stop_hierarchy_walk(root, mem);
+}
+
/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
@@ -1973,7 +2031,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
* Take care memory controller reclaiming has small influence
* to global LRU.
*/
- if (scanning_global_lru(sc)) {
+ if (global_reclaim(sc)) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
@@ -2038,7 +2096,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
get_mems_allowed();
delayacct_freepages_start();

- if (scanning_global_lru(sc))
+ if (global_reclaim(sc))
count_vm_event(ALLOCSTALL);

for (priority = DEF_PRIORITY; priority >= 0; priority--) {
@@ -2050,7 +2108,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
* Don't shrink slabs when reclaiming memory from
* over limit cgroups
*/
- if (scanning_global_lru(sc)) {
+ if (global_reclaim(sc)) {
unsigned long lru_pages = 0;
for_each_zone_zonelist(zone, z, zonelist,
gfp_zone(sc->gfp_mask)) {
@@ -2111,7 +2169,7 @@ out:
return 0;

/* top priority shrink_zones still had more to do? don't OOM, then */
- if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
+ if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
return 1;

return 0;
@@ -2129,7 +2187,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.may_swap = 1,
.swappiness = vm_swappiness,
.order = order,
- .mem_cgroup = NULL,
+ .target_mem_cgroup = NULL,
.nodemask = nodemask,
};

@@ -2158,6 +2216,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
.may_swap = !noswap,
.swappiness = swappiness,
.order = 0,
+ .target_mem_cgroup = mem,
.mem_cgroup = mem,
};
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2174,7 +2233,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
* will pick up pages from other mem cgroup's as well. We hack
* the priority and make it zero.
*/
- shrink_zone(0, zone, &sc);
+ do_shrink_zone(0, zone, &sc);

trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);

@@ -2195,7 +2254,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
.swappiness = swappiness,
.order = 0,
- .mem_cgroup = mem_cont,
+ .target_mem_cgroup = mem_cont,
.nodemask = NULL, /* we don't care the placement */
};

@@ -2333,7 +2392,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
.nr_to_reclaim = ULONG_MAX,
.swappiness = vm_swappiness,
.order = order,
- .mem_cgroup = NULL,
+ .target_mem_cgroup = NULL,
};
loop_again:
total_scanned = 0;
--
1.7.5.2

2011-06-01 06:27:25

by Johannes Weiner

[permalink] [raw]
Subject: [patch 3/8] memcg: reclaim statistics

Currently, there are no statistics whatsoever that would give an
insight into how memory is reclaimed from specific memcgs.

This patch introduces statistics that break down into the following
categories.

1. Limit-triggered direct reclaim

pgscan_direct_limit
pgreclaim_direct_limit

These counters indicate the number of pages scanned and reclaimed
directly by tasks that needed to allocate memory while the memcg
had reached its hard limit.

2. Limit-triggered background reclaim

pgscan_background_limit
pgreclaim_background_limit

These counters indicate the number of pages scanned and reclaimed
by a kernel thread while the memcg's usage was coming close to the
hard limit, so to prevent allocators from having to drop into
direct reclaim.

There is currently no mechanism in the kernel that would increase
those counters, but there is per-memcg watermark reclaim in the
workings that would fall into this category.

3. Hierarchy-triggered direct reclaim

pgscan_direct_hierarchy
pgreclaim_direct_hierarchy

These counters indicate the number of pages scanned and reclaimed
directly by tasks that needed to allocate memory in hierarchical
parents of the memcg while those parents where experiencing memory
shortness.

For now, this could be either because of a hard limit in the
parents, or because of global memory pressure.

4. Hierarchy-triggered background reclaim

pgscan_background_hierarchy
pgreclaim_background_hierarchy

These counters indicate the number of pages scanned and reclaimed
by a kernel thread while one of the memcgs hierarchical parents was
coming close to running out of memory.

For now, this only accounts for the work done by kswapd to balance
zones, but there is also per-memcg watermark reclaim in the
workings that would fall into this category.

The counters for limit-triggered reclaim always inform about pressure
that exists within the memcg and if the workload is too big for its
container. The counters for hierarchy-triggered reclaim on the other
hand inform about the pressure outside the memcg, such as the limit of
a parent or physical memory shortness. Having this distinction helps
locating the cause for a thrashing workload in the hierarchy.

In addition, the distinction between direct and background reclaim
shows how well background reclaim can keep up or whether it is
overwhelmed and forces allocators into direct reclaim.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 9 ++++++
mm/memcontrol.c | 61 ++++++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 6 ++++
3 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 332b0a6..8f402b9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -108,6 +108,8 @@ void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup *);
/*
* For memory reclaim.
*/
+void mem_cgroup_count_reclaim(struct mem_cgroup *, bool, bool,
+ unsigned long, unsigned long);
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
@@ -293,6 +295,13 @@ static inline bool mem_cgroup_disabled(void)
return true;
}

+static inline void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
+ bool background, bool hierarchy,
+ unsigned long scanned,
+ unsigned long reclaimed)
+{
+}
+
static inline int
mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 850176e..983efe4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -90,10 +90,24 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_NSTATS,
};

+#define RECLAIM_RECLAIMED 1
+#define RECLAIM_BACKGROUND 2
+#define RECLAIM_HIERARCHY 4
+
enum mem_cgroup_events_index {
MEM_CGROUP_EVENTS_PGPGIN, /* # of pages paged in */
MEM_CGROUP_EVENTS_PGPGOUT, /* # of pages paged out */
MEM_CGROUP_EVENTS_COUNT, /* # of pages paged in/out */
+ RECLAIM_BASE,
+ /* base + [!]hierarchy + [!]background + [!]reclaimed */
+ PGSCAN_DIRECT_LIMIT = RECLAIM_BASE,
+ PGRECLAIM_DIRECT_LIMIT,
+ PGSCAN_BACKGROUND_LIMIT,
+ PGRECLAIM_BACKGROUND_LIMIT,
+ PGSCAN_DIRECT_HIERARCHY,
+ PGRECLAIM_DIRECT_HIERARCHY,
+ PGSCAN_BACKGROUND_HIERARCHY,
+ PGRECLAIM_BACKGROUND_HIERARCHY,
MEM_CGROUP_EVENTS_NSTATS,
};
/*
@@ -585,6 +599,21 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
}

+void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
+ bool background, bool hierarchy,
+ unsigned long scanned, unsigned long reclaimed)
+{
+ unsigned int base = RECLAIM_BASE;
+
+ if (hierarchy)
+ base += RECLAIM_HIERARCHY;
+ if (background)
+ base += RECLAIM_BACKGROUND;
+
+ this_cpu_add(mem->stat->events[base], scanned);
+ this_cpu_add(mem->stat->events[base + RECLAIM_RECLAIMED], reclaimed);
+}
+
static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
enum mem_cgroup_events_index idx)
{
@@ -3821,6 +3850,14 @@ enum {
MCS_FILE_MAPPED,
MCS_PGPGIN,
MCS_PGPGOUT,
+ MCS_PGSCAN_DIRECT_LIMIT,
+ MCS_PGRECLAIM_DIRECT_LIMIT,
+ MCS_PGSCAN_BACKGROUND_LIMIT,
+ MCS_PGRECLAIM_BACKGROUND_LIMIT,
+ MCS_PGSCAN_DIRECT_HIERARCHY,
+ MCS_PGRECLAIM_DIRECT_HIERARCHY,
+ MCS_PGSCAN_BACKGROUND_HIERARCHY,
+ MCS_PGRECLAIM_BACKGROUND_HIERARCHY,
MCS_SWAP,
MCS_INACTIVE_ANON,
MCS_ACTIVE_ANON,
@@ -3843,6 +3880,14 @@ struct {
{"mapped_file", "total_mapped_file"},
{"pgpgin", "total_pgpgin"},
{"pgpgout", "total_pgpgout"},
+ {"pgscan_direct_limit", "total_pgscan_direct_limit"},
+ {"pgreclaim_direct_limit", "total_pgreclaim_direct_limit"},
+ {"pgscan_background_limit", "total_pgscan_background_limit"},
+ {"pgreclaim_background_limit", "total_pgreclaim_background_limit"},
+ {"pgscan_direct_hierarchy", "total_pgscan_direct_hierarchy"},
+ {"pgreclaim_direct_hierarchy", "total_pgreclaim_direct_hierarchy"},
+ {"pgscan_background_hierarchy", "total_pgscan_background_hierarchy"},
+ {"pgreclaim_background_hierarchy", "total_pgreclaim_background_hierarchy"},
{"swap", "total_swap"},
{"inactive_anon", "total_inactive_anon"},
{"active_anon", "total_active_anon"},
@@ -3868,6 +3913,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
s->stat[MCS_PGPGIN] += val;
val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGPGOUT);
s->stat[MCS_PGPGOUT] += val;
+ val = mem_cgroup_read_events(mem, PGSCAN_DIRECT_LIMIT);
+ s->stat[MCS_PGSCAN_DIRECT_LIMIT] += val;
+ val = mem_cgroup_read_events(mem, PGRECLAIM_DIRECT_LIMIT);
+ s->stat[MCS_PGRECLAIM_DIRECT_LIMIT] += val;
+ val = mem_cgroup_read_events(mem, PGSCAN_BACKGROUND_LIMIT);
+ s->stat[MCS_PGSCAN_BACKGROUND_LIMIT] += val;
+ val = mem_cgroup_read_events(mem, PGRECLAIM_BACKGROUND_LIMIT);
+ s->stat[MCS_PGRECLAIM_BACKGROUND_LIMIT] += val;
+ val = mem_cgroup_read_events(mem, PGSCAN_DIRECT_HIERARCHY);
+ s->stat[MCS_PGSCAN_DIRECT_HIERARCHY] += val;
+ val = mem_cgroup_read_events(mem, PGRECLAIM_DIRECT_HIERARCHY);
+ s->stat[MCS_PGRECLAIM_DIRECT_HIERARCHY] += val;
+ val = mem_cgroup_read_events(mem, PGSCAN_BACKGROUND_HIERARCHY);
+ s->stat[MCS_PGSCAN_BACKGROUND_HIERARCHY] += val;
+ val = mem_cgroup_read_events(mem, PGRECLAIM_BACKGROUND_HIERARCHY);
+ s->stat[MCS_PGRECLAIM_BACKGROUND_HIERARCHY] += val;
if (do_swap_account) {
val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
s->stat[MCS_SWAP] += val * PAGE_SIZE;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7e9bfca..c7d4b44 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1985,10 +1985,16 @@ static void shrink_zone(int priority, struct zone *zone,

first = mem = mem_cgroup_hierarchy_walk(root, mem);
for (;;) {
+ unsigned long reclaimed = sc->nr_reclaimed;
+ unsigned long scanned = sc->nr_scanned;
unsigned long nr_reclaimed;

sc->mem_cgroup = mem;
do_shrink_zone(priority, zone, sc);
+ mem_cgroup_count_reclaim(mem, current_is_kswapd(),
+ mem != root, /* limit or hierarchy? */
+ sc->nr_scanned - scanned,
+ sc->nr_reclaimed - reclaimed);

nr_reclaimed = sc->nr_reclaimed - nr_reclaimed_before;
if (nr_reclaimed >= sc->nr_to_reclaim)
--
1.7.5.2

2011-06-01 06:27:05

by Johannes Weiner

[permalink] [raw]
Subject: [patch 4/8] memcg: rework soft limit reclaim

Currently, soft limit reclaim is entered from kswapd, where it selects
the memcg with the biggest soft limit excess in absolute bytes, and
reclaims pages from it with maximum aggressiveness (priority 0).

This has the following disadvantages:

1. because of the aggressiveness, kswapd can be stalled on a memcg
that is hard to reclaim from for a long time, sending the rest of
the allocators into direct reclaim in the meantime.

2. it only considers the biggest offender (in absolute bytes, no
less, so very unhandy for setups with different-sized memcgs) and
does not apply any pressure at all on other memcgs in excess.

3. because it is only invoked from kswapd, the soft limit is
meaningful during global memory pressure, but it is not taken into
account during hierarchical target reclaim where it could allow
prioritizing memcgs as well. So while it does hierarchical
reclaim once triggered, it is not a truly hierarchical mechanism.

Here is a different approach. Instead of having a soft limit reclaim
cycle separate from the rest of reclaim, this patch ensures that each
time a group of memcgs is reclaimed - be it because of global memory
pressure or because of a hard limit - memcgs that exceed their soft
limit, or contribute to the soft limit excess of one their parents,
are reclaimed from at a higher priority than their siblings.

This results in the following:

1. all relevant memcgs are scanned with increasing priority during
memory pressure. The primary goal is to free pages, not to punish
soft limit offenders.

2. increased pressure is applied to all memcgs in excess of their
soft limit, not only the biggest offender.

3. the soft limit becomes meaningful for target reclaim as well,
where it allows prioritizing children of a hierarchy when the
parent hits its limit.

4. direct reclaim now also applies increased soft limit pressure,
not just kswapd anymore.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 7 +++++++
mm/memcontrol.c | 26 ++++++++++++++++++++++++++
mm/vmscan.c | 8 ++++++--
3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8f402b9..7d99e87 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -104,6 +104,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *,
struct mem_cgroup *);
void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup *);
+bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *, struct mem_cgroup *);

/*
* For memory reclaim.
@@ -345,6 +346,12 @@ static inline void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *r,
{
}

+static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
+ struct mem_cgroup *mem)
+{
+ return false;
+}
+
static inline void
mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 983efe4..94f77cc3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1460,6 +1460,32 @@ void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *root,
css_put(&mem->css);
}

+/**
+ * mem_cgroup_soft_limit_exceeded - check if a memcg (hierarchically)
+ * exceeds a soft limit
+ * @root: highest ancestor of @mem to consider
+ * @mem: memcg to check for excess
+ *
+ * The function indicates whether @mem has exceeded its own soft
+ * limit, or contributes to the soft limit excess of one of its
+ * parents in the hierarchy below @root.
+ */
+bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
+ struct mem_cgroup *mem)
+{
+ for (;;) {
+ if (mem == root_mem_cgroup)
+ return false;
+ if (res_counter_soft_limit_excess(&mem->res))
+ return true;
+ if (mem == root)
+ return false;
+ mem = parent_mem_cgroup(mem);
+ if (!mem)
+ return false;
+ }
+}
+
static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
gfp_t gfp_mask,
unsigned long flags)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c7d4b44..0163840 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1988,9 +1988,13 @@ static void shrink_zone(int priority, struct zone *zone,
unsigned long reclaimed = sc->nr_reclaimed;
unsigned long scanned = sc->nr_scanned;
unsigned long nr_reclaimed;
+ int epriority = priority;
+
+ if (mem_cgroup_soft_limit_exceeded(root, mem))
+ epriority -= 1;

sc->mem_cgroup = mem;
- do_shrink_zone(priority, zone, sc);
+ do_shrink_zone(epriority, zone, sc);
mem_cgroup_count_reclaim(mem, current_is_kswapd(),
mem != root, /* limit or hierarchy? */
sc->nr_scanned - scanned,
@@ -2480,7 +2484,7 @@ loop_again:
* Call soft limit reclaim before calling shrink_zone.
* For now we ignore the return value
*/
- mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
+ //mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);

/*
* We put equal pressure on every zone, unless
--
1.7.5.2

2011-06-01 06:25:58

by Johannes Weiner

[permalink] [raw]
Subject: [patch 5/8] memcg: remove unused soft limit code

This should be merged into the previous patch, which is however better
readable and reviewable without all this deletion noise.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 9 -
include/linux/swap.h | 4 -
mm/memcontrol.c | 418 --------------------------------------------
mm/vmscan.c | 44 -----
4 files changed, 0 insertions(+), 475 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7d99e87..cb02c00 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -150,8 +150,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}

-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask);
u64 mem_cgroup_get_limit(struct mem_cgroup *mem);

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -368,13 +366,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
}

static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask)
-{
- return 0;
-}
-
-static inline
u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
{
return 0;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a5c6da5..885cf19 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,10 +254,6 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
unsigned int swappiness);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- unsigned int swappiness,
- struct zone *zone);
extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 94f77cc3..78ae4dd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -34,7 +34,6 @@
#include <linux/rcupdate.h>
#include <linux/limits.h>
#include <linux/mutex.h>
-#include <linux/rbtree.h>
#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/swapops.h>
@@ -141,12 +140,6 @@ struct mem_cgroup_per_zone {
unsigned long count[NR_LRU_LISTS];

struct zone_reclaim_stat reclaim_stat;
- struct rb_node tree_node; /* RB tree node */
- unsigned long long usage_in_excess;/* Set to the value by which */
- /* the soft limit is exceeded*/
- bool on_tree;
- struct mem_cgroup *mem; /* Back pointer, we cannot */
- /* use container_of */
};
/* Macro for accessing counter */
#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
@@ -159,26 +152,6 @@ struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
};

-/*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
- */
-
-struct mem_cgroup_tree_per_zone {
- struct rb_root rb_root;
- spinlock_t lock;
-};
-
-struct mem_cgroup_tree_per_node {
- struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_tree {
- struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
-};
-
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
-
struct mem_cgroup_threshold {
struct eventfd_ctx *eventfd;
u64 threshold;
@@ -326,12 +299,7 @@ static bool move_file(void)
&mc.to->move_charge_at_immigrate);
}

-/*
- * Maximum loops in reclaim, used for soft limit reclaim to prevent
- * infinite loops, if they ever occur.
- */
#define MEM_CGROUP_MAX_RECLAIM_LOOPS (100)
-#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)

enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -388,164 +356,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *mem, struct page *page)
return mem_cgroup_zoneinfo(mem, nid, zid);
}

-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
-{
- return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_from_page(struct page *page)
-{
- int nid = page_to_nid(page);
- int zid = page_zonenum(page);
-
- return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static void
-__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz,
- unsigned long long new_usage_in_excess)
-{
- struct rb_node **p = &mctz->rb_root.rb_node;
- struct rb_node *parent = NULL;
- struct mem_cgroup_per_zone *mz_node;
-
- if (mz->on_tree)
- return;
-
- mz->usage_in_excess = new_usage_in_excess;
- if (!mz->usage_in_excess)
- return;
- while (*p) {
- parent = *p;
- mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
- tree_node);
- if (mz->usage_in_excess < mz_node->usage_in_excess)
- p = &(*p)->rb_left;
- /*
- * We can't avoid mem cgroups that are over their soft
- * limit by the same amount
- */
- else if (mz->usage_in_excess >= mz_node->usage_in_excess)
- p = &(*p)->rb_right;
- }
- rb_link_node(&mz->tree_node, parent, p);
- rb_insert_color(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = true;
-}
-
-static void
-__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz)
-{
- if (!mz->on_tree)
- return;
- rb_erase(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = false;
-}
-
-static void
-mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz)
-{
- spin_lock(&mctz->lock);
- __mem_cgroup_remove_exceeded(mem, mz, mctz);
- spin_unlock(&mctz->lock);
-}
-
-
-static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
-{
- unsigned long long excess;
- struct mem_cgroup_per_zone *mz;
- struct mem_cgroup_tree_per_zone *mctz;
- int nid = page_to_nid(page);
- int zid = page_zonenum(page);
- mctz = soft_limit_tree_from_page(page);
-
- /*
- * Necessary to update all ancestors when hierarchy is used.
- * because their event counter is not touched.
- */
- for (; mem; mem = parent_mem_cgroup(mem)) {
- mz = mem_cgroup_zoneinfo(mem, nid, zid);
- excess = res_counter_soft_limit_excess(&mem->res);
- /*
- * We have to update the tree if mz is on RB-tree or
- * mem is over its softlimit.
- */
- if (excess || mz->on_tree) {
- spin_lock(&mctz->lock);
- /* if on-tree, remove it */
- if (mz->on_tree)
- __mem_cgroup_remove_exceeded(mem, mz, mctz);
- /*
- * Insert again. mz->usage_in_excess will be updated.
- * If excess is 0, no tree ops.
- */
- __mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
- spin_unlock(&mctz->lock);
- }
- }
-}
-
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
-{
- int node, zone;
- struct mem_cgroup_per_zone *mz;
- struct mem_cgroup_tree_per_zone *mctz;
-
- for_each_node_state(node, N_POSSIBLE) {
- for (zone = 0; zone < MAX_NR_ZONES; zone++) {
- mz = mem_cgroup_zoneinfo(mem, node, zone);
- mctz = soft_limit_tree_node_zone(node, zone);
- mem_cgroup_remove_exceeded(mem, mz, mctz);
- }
- }
-}
-
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
- struct rb_node *rightmost = NULL;
- struct mem_cgroup_per_zone *mz;
-
-retry:
- mz = NULL;
- rightmost = rb_last(&mctz->rb_root);
- if (!rightmost)
- goto done; /* Nothing to reclaim from */
-
- mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
- /*
- * Remove the node now but someone else can add it back,
- * we will to add it back at the end of reclaim to its correct
- * position in the tree.
- */
- __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
- if (!res_counter_soft_limit_excess(&mz->mem->res) ||
- !css_tryget(&mz->mem->css))
- goto retry;
-done:
- return mz;
-}
-
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
- struct mem_cgroup_per_zone *mz;
-
- spin_lock(&mctz->lock);
- mz = __mem_cgroup_largest_soft_limit_node(mctz);
- spin_unlock(&mctz->lock);
- return mz;
-}
-
/*
* Implementation Note: reading percpu statistics for memcg.
*
@@ -583,15 +393,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *mem,
return val;
}

-static long mem_cgroup_local_usage(struct mem_cgroup *mem)
-{
- long ret;
-
- ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
- ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
- return ret;
-}
-
static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
bool charge)
{
@@ -710,7 +511,6 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
__mem_cgroup_target_update(mem, MEM_CGROUP_TARGET_THRESH);
if (unlikely(__memcg_event_check(mem,
MEM_CGROUP_TARGET_SOFTLIMIT))){
- mem_cgroup_update_tree(mem, page);
__mem_cgroup_target_update(mem,
MEM_CGROUP_TARGET_SOFTLIMIT);
}
@@ -1520,103 +1320,6 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
}

/*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
-static struct mem_cgroup *
-mem_cgroup_select_victim(struct mem_cgroup *root_mem)
-{
- struct mem_cgroup *ret = NULL;
- struct cgroup_subsys_state *css;
- int nextid, found;
-
- if (!root_mem->use_hierarchy) {
- css_get(&root_mem->css);
- ret = root_mem;
- }
-
- while (!ret) {
- rcu_read_lock();
- nextid = root_mem->last_scanned_child + 1;
- css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
- &found);
- if (css && css_tryget(css))
- ret = container_of(css, struct mem_cgroup, css);
-
- rcu_read_unlock();
- /* Updates scanning parameter */
- if (!css) {
- /* this means start scan from ID:1 */
- root_mem->last_scanned_child = 0;
- } else
- root_mem->last_scanned_child = found;
- }
-
- return ret;
-}
-
-static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
- struct zone *zone,
- gfp_t gfp_mask)
-{
- struct mem_cgroup *victim;
- int ret, total = 0;
- int loop = 0;
- bool noswap = false;
- unsigned long excess;
-
- excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
-
- /* If memsw_is_minimum==1, swap-out is of-no-use. */
- if (root_mem->memsw_is_minimum)
- noswap = true;
-
- while (1) {
- victim = mem_cgroup_select_victim(root_mem);
- if (victim == root_mem) {
- loop++;
- if (loop >= 1)
- drain_all_stock_async();
- if (loop >= 2) {
- /*
- * If we have not been able to reclaim
- * anything, it might because there are
- * no reclaimable pages under this hierarchy
- */
- if (!total) {
- css_put(&victim->css);
- break;
- }
- /*
- * We want to do more targeted reclaim.
- * excess >> 2 is not to excessive so as to
- * reclaim too much, nor too less that we keep
- * coming back to reclaim from this cgroup
- */
- if (total >= (excess >> 2) ||
- (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
- css_put(&victim->css);
- break;
- }
- }
- }
- if (!mem_cgroup_local_usage(victim)) {
- /* this cgroup's local usage == 0 */
- css_put(&victim->css);
- continue;
- }
- ret = mem_cgroup_shrink_node_zone(victim, gfp_mask, noswap,
- get_swappiness(victim), zone);
- css_put(&victim->css);
- total += ret;
- if (!res_counter_soft_limit_excess(&root_mem->res))
- return total;
- }
- return total;
-}
-
-/*
* Check OOM-Killer is already running under our hierarchy.
* If someone is running, return false.
*/
@@ -2310,8 +2013,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
unlock_page_cgroup(pc);
/*
* "charge_statistics" updated event counter. Then, check it.
- * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
- * if they exceeds softlimit.
*/
memcg_check_events(mem, page);
}
@@ -3391,94 +3092,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
return ret;
}

-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask)
-{
- unsigned long nr_reclaimed = 0;
- struct mem_cgroup_per_zone *mz, *next_mz = NULL;
- unsigned long reclaimed;
- int loop = 0;
- struct mem_cgroup_tree_per_zone *mctz;
- unsigned long long excess;
-
- if (order > 0)
- return 0;
-
- mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
- /*
- * This loop can run a while, specially if mem_cgroup's continuously
- * keep exceeding their soft limit and putting the system under
- * pressure
- */
- do {
- if (next_mz)
- mz = next_mz;
- else
- mz = mem_cgroup_largest_soft_limit_node(mctz);
- if (!mz)
- break;
-
- reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
- nr_reclaimed += reclaimed;
- spin_lock(&mctz->lock);
-
- /*
- * If we failed to reclaim anything from this memory cgroup
- * it is time to move on to the next cgroup
- */
- next_mz = NULL;
- if (!reclaimed) {
- do {
- /*
- * Loop until we find yet another one.
- *
- * By the time we get the soft_limit lock
- * again, someone might have aded the
- * group back on the RB tree. Iterate to
- * make sure we get a different mem.
- * mem_cgroup_largest_soft_limit_node returns
- * NULL if no other cgroup is present on
- * the tree
- */
- next_mz =
- __mem_cgroup_largest_soft_limit_node(mctz);
- if (next_mz == mz) {
- css_put(&next_mz->mem->css);
- next_mz = NULL;
- } else /* next_mz == NULL or other memcg */
- break;
- } while (1);
- }
- __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
- excess = res_counter_soft_limit_excess(&mz->mem->res);
- /*
- * One school of thought says that we should not add
- * back the node to the tree if reclaim returns 0.
- * But our reclaim could return 0, simply because due
- * to priority we are exposing a smaller subset of
- * memory to reclaim from. Consider this as a longer
- * term TODO.
- */
- /* If excess == 0, no tree ops */
- __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
- spin_unlock(&mctz->lock);
- css_put(&mz->mem->css);
- loop++;
- /*
- * Could not reclaim anything and there are no more
- * mem cgroups to try or we seem to be looping without
- * reclaiming anything.
- */
- if (!nr_reclaimed &&
- (next_mz == NULL ||
- loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
- break;
- } while (!nr_reclaimed);
- if (next_mz)
- css_put(&next_mz->mem->css);
- return nr_reclaimed;
-}
-
/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -4548,9 +4161,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
mz = &pn->zoneinfo[zone];
for_each_lru(l)
INIT_LIST_HEAD(&mz->lists[l]);
- mz->usage_in_excess = 0;
- mz->on_tree = false;
- mz->mem = mem;
}
return 0;
}
@@ -4603,7 +4213,6 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
{
int node;

- mem_cgroup_remove_from_trees(mem);
free_css_id(&mem_cgroup_subsys, &mem->css);

for_each_node_state(node, N_POSSIBLE)
@@ -4658,31 +4267,6 @@ static void __init enable_swap_cgroup(void)
}
#endif

-static int mem_cgroup_soft_limit_tree_init(void)
-{
- struct mem_cgroup_tree_per_node *rtpn;
- struct mem_cgroup_tree_per_zone *rtpz;
- int tmp, node, zone;
-
- for_each_node_state(node, N_POSSIBLE) {
- tmp = node;
- if (!node_state(node, N_NORMAL_MEMORY))
- tmp = -1;
- rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
- if (!rtpn)
- return 1;
-
- soft_limit_tree.rb_tree_per_node[node] = rtpn;
-
- for (zone = 0; zone < MAX_NR_ZONES; zone++) {
- rtpz = &rtpn->rb_tree_per_zone[zone];
- rtpz->rb_root = RB_ROOT;
- spin_lock_init(&rtpz->lock);
- }
- }
- return 0;
-}
-
static struct cgroup_subsys_state * __ref
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
@@ -4704,8 +4288,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
enable_swap_cgroup();
parent = NULL;
root_mem_cgroup = mem;
- if (mem_cgroup_soft_limit_tree_init())
- goto free_out;
for_each_possible_cpu(cpu) {
struct memcg_stock_pcp *stock =
&per_cpu(memcg_stock, cpu);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0163840..7d74e48 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2213,43 +2213,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-
-unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- unsigned int swappiness,
- struct zone *zone)
-{
- struct scan_control sc = {
- .nr_to_reclaim = SWAP_CLUSTER_MAX,
- .may_writepage = !laptop_mode,
- .may_unmap = 1,
- .may_swap = !noswap,
- .swappiness = swappiness,
- .order = 0,
- .target_mem_cgroup = mem,
- .mem_cgroup = mem,
- };
- sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
- (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
-
- trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
- sc.may_writepage,
- sc.gfp_mask);
-
- /*
- * NOTE: Although we can get the priority field, using it
- * here is not a good idea, since it limits the pages we can scan.
- * if we don't reclaim here, the shrink_zone from balance_pgdat
- * will pick up pages from other mem cgroup's as well. We hack
- * the priority and make it zero.
- */
- do_shrink_zone(0, zone, &sc);
-
- trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
-
- return sc.nr_reclaimed;
-}
-
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask,
bool noswap,
@@ -2479,13 +2442,6 @@ loop_again:
continue;

sc.nr_scanned = 0;
-
- /*
- * Call soft limit reclaim before calling shrink_zone.
- * For now we ignore the return value
- */
- //mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
-
/*
* We put equal pressure on every zone, unless
* one zone has way too many pages free
--
1.7.5.2

2011-06-01 06:25:49

by Johannes Weiner

[permalink] [raw]
Subject: [patch 6/8] vmscan: change zone_nr_lru_pages to take memcg instead of scan control

This function only uses sc->mem_cgroup from the scan control. Change
it to take a memcg argument directly, so callsites without an actual
reclaim context can use it as well.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 22 ++++++++++++----------
1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7d74e48..9c51ec8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -205,10 +205,11 @@ static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
}

static unsigned long zone_nr_lru_pages(struct zone *zone,
- struct scan_control *sc, enum lru_list lru)
+ struct mem_cgroup *mem,
+ enum lru_list lru)
{
- if (!scanning_global_lru(sc))
- return mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone, lru);
+ if (mem)
+ return mem_cgroup_zone_nr_pages(mem, zone, lru);

return zone_page_state(zone, NR_LRU_BASE + lru);
}
@@ -1780,10 +1781,10 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
goto out;
}

- anon = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
- zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
- file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
- zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+ anon = zone_nr_lru_pages(zone, sc->mem_cgroup, LRU_ACTIVE_ANON) +
+ zone_nr_lru_pages(zone, sc->mem_cgroup, LRU_INACTIVE_ANON);
+ file = zone_nr_lru_pages(zone, sc->mem_cgroup, LRU_ACTIVE_FILE) +
+ zone_nr_lru_pages(zone, sc->mem_cgroup, LRU_INACTIVE_FILE);

if (global_reclaim(sc)) {
free = zone_page_state(zone, NR_FREE_PAGES);
@@ -1846,7 +1847,7 @@ out:
int file = is_file_lru(l);
unsigned long scan;

- scan = zone_nr_lru_pages(zone, sc, l);
+ scan = zone_nr_lru_pages(zone, sc->mem_cgroup, l);
if (priority || noswap) {
scan >>= priority;
scan = div64_u64(scan * fraction[file], denominator);
@@ -1903,8 +1904,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
* inactive lists are large enough, continue reclaiming
*/
pages_for_compaction = (2UL << sc->order);
- inactive_lru_pages = zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON) +
- zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+ inactive_lru_pages =
+ zone_nr_lru_pages(zone, sc->mem_cgroup, LRU_INACTIVE_ANON) +
+ zone_nr_lru_pages(zone, sc->mem_cgroup, LRU_INACTIVE_FILE);
if (sc->nr_reclaimed < pages_for_compaction &&
inactive_lru_pages > pages_for_compaction)
return true;
--
1.7.5.2

2011-06-01 06:27:19

by Johannes Weiner

[permalink] [raw]
Subject: [patch 7/8] vmscan: memcg-aware unevictable page rescue scanner

Once the per-memcg lru lists are exclusive, the unevictable page
rescue scanner can no longer work on the global zone lru lists.

This converts it to go through all memcgs and scan their respective
unevictable lists instead.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 2 +
mm/memcontrol.c | 11 +++++++++
mm/vmscan.c | 53 +++++++++++++++++++++++++++----------------
3 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index cb02c00..56c1def 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -60,6 +60,8 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);

extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
+struct page *mem_cgroup_lru_to_page(struct zone *, struct mem_cgroup *,
+ enum lru_list);
extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 78ae4dd..d9d1a7e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -656,6 +656,17 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
* When moving account, the page is not on LRU. It's isolated.
*/

+struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
+ enum lru_list lru)
+{
+ struct mem_cgroup_per_zone *mz;
+ struct page_cgroup *pc;
+
+ mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
+ pc = list_entry(mz->lists[lru].prev, struct page_cgroup, lru);
+ return lookup_cgroup_page(pc);
+}
+
void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
{
struct page_cgroup *pc;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c51ec8..23fd2b1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3233,6 +3233,14 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)

}

+static struct page *lru_tailpage(struct zone *zone, struct mem_cgroup *mem,
+ enum lru_list lru)
+{
+ if (mem)
+ return mem_cgroup_lru_to_page(zone, mem, lru);
+ return lru_to_page(&zone->lru[lru].list);
+}
+
/**
* scan_zone_unevictable_pages - check unevictable list for evictable pages
* @zone - zone of which to scan the unevictable list
@@ -3246,32 +3254,37 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
#define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
static void scan_zone_unevictable_pages(struct zone *zone)
{
- struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list;
- unsigned long scan;
- unsigned long nr_to_scan = zone_page_state(zone, NR_UNEVICTABLE);
+ struct mem_cgroup *first, *mem = NULL;

- while (nr_to_scan > 0) {
- unsigned long batch_size = min(nr_to_scan,
- SCAN_UNEVICTABLE_BATCH_SIZE);
+ first = mem = mem_cgroup_hierarchy_walk(NULL, mem);
+ do {
+ unsigned long nr_to_scan;

- spin_lock_irq(&zone->lru_lock);
- for (scan = 0; scan < batch_size; scan++) {
- struct page *page = lru_to_page(l_unevictable);
+ nr_to_scan = zone_nr_lru_pages(zone, mem, LRU_UNEVICTABLE);
+ while (nr_to_scan > 0) {
+ unsigned long batch_size;
+ unsigned long scan;

- if (!trylock_page(page))
- continue;
+ batch_size = min(nr_to_scan,
+ SCAN_UNEVICTABLE_BATCH_SIZE);

- prefetchw_prev_lru_page(page, l_unevictable, flags);
-
- if (likely(PageLRU(page) && PageUnevictable(page)))
- check_move_unevictable_page(page, zone);
+ spin_lock_irq(&zone->lru_lock);
+ for (scan = 0; scan < batch_size; scan++) {
+ struct page *page;

- unlock_page(page);
+ page = lru_tailpage(zone, mem, LRU_UNEVICTABLE);
+ if (!trylock_page(page))
+ continue;
+ if (likely(PageLRU(page) &&
+ PageUnevictable(page)))
+ check_move_unevictable_page(page, zone);
+ unlock_page(page);
+ }
+ spin_unlock_irq(&zone->lru_lock);
+ nr_to_scan -= batch_size;
}
- spin_unlock_irq(&zone->lru_lock);
-
- nr_to_scan -= batch_size;
- }
+ mem = mem_cgroup_hierarchy_walk(NULL, mem);
+ } while (mem != first);
}


--
1.7.5.2

2011-06-01 06:26:23

by Johannes Weiner

[permalink] [raw]
Subject: [patch 8/8] mm: make per-memcg lru lists exclusive

All lru list walkers have been converted to operate on per-memcg
lists, the global per-zone lists are no longer required.

This patch makes the per-memcg lists exclusive and removes the global
lists from memcg-enabled kernels.

The per-memcg lists now string up page descriptors directly, which
unifies/simplifies the list isolation code of page reclaim as well as
it saves a full double-linked list head for each page in the system.

At the core of this change is the introduction of the lruvec
structure, an array of all lru list heads. It exists for each zone
globally, and for each zone per memcg. All lru list operations are
now done in generic code against lruvecs, with the memcg lru list
primitives only doing accounting and returning the proper lruvec for
the currently scanned memcg on isolation, or for the respective page
on putback.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 53 ++++-----
include/linux/mm_inline.h | 14 ++-
include/linux/mmzone.h | 10 +-
include/linux/page_cgroup.h | 36 ------
mm/memcontrol.c | 271 ++++++++++++++++++-------------------------
mm/page_alloc.c | 2 +-
mm/page_cgroup.c | 38 +------
mm/swap.c | 20 ++--
mm/vmscan.c | 88 ++++++--------
9 files changed, 207 insertions(+), 325 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 56c1def..d3837f0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -20,6 +20,7 @@
#ifndef _LINUX_MEMCONTROL_H
#define _LINUX_MEMCONTROL_H
#include <linux/cgroup.h>
+#include <linux/mmzone.h>
struct mem_cgroup;
struct page_cgroup;
struct page;
@@ -30,13 +31,6 @@ enum mem_cgroup_page_stat_item {
MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
};

-extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
- struct list_head *dst,
- unsigned long *scanned, int order,
- int mode, struct zone *z,
- struct mem_cgroup *mem_cont,
- int active, int file);
-
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
/*
* All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -60,15 +54,14 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);

extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
-struct page *mem_cgroup_lru_to_page(struct zone *, struct mem_cgroup *,
- enum lru_list);
-extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
-extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_del_lru(struct page *page);
-extern void mem_cgroup_move_lists(struct page *page,
- enum lru_list from, enum lru_list to);
+
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_lru_add_list(struct zone *, struct page *,
+ enum lru_list);
+void mem_cgroup_lru_del_list(struct page *, enum lru_list);
+void mem_cgroup_lru_del(struct page *);
+struct lruvec *mem_cgroup_lru_move_lists(struct zone *, struct page *,
+ enum lru_list, enum lru_list);

/* For coalescing uncharge for reducing memcg' overhead*/
extern void mem_cgroup_uncharge_start(void);
@@ -214,33 +207,33 @@ static inline int mem_cgroup_shmem_charge_fallback(struct page *page,
return 0;
}

-static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
-{
-}
-
-static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
+static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
+ struct mem_cgroup *mem)
{
- return ;
+ return &zone->lruvec;
}

-static inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+static inline struct lruvec *mem_cgroup_lru_add_list(struct zone *zone,
+ struct page *page,
+ enum lru_list lru)
{
- return ;
+ return &zone->lruvec;
}

-static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
+static inline void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
{
- return ;
}

-static inline void mem_cgroup_del_lru(struct page *page)
+static inline void mem_cgroup_lru_del(struct page *page)
{
- return ;
}

-static inline void
-mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
+static inline struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+ struct page *page,
+ enum lru_list from,
+ enum lru_list to)
{
+ return &zone->lruvec;
}

static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 8f7d247..43d5d9f 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -25,23 +25,27 @@ static inline void
__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
struct list_head *head)
{
+ /* NOTE: Caller must ensure @head is on the right lruvec! */
+ mem_cgroup_lru_add_list(zone, page, l);
list_add(&page->lru, head);
__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
- mem_cgroup_add_lru_list(page, l);
}

static inline void
add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
{
- __add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
+ struct lruvec *lruvec = mem_cgroup_lru_add_list(zone, page, l);
+
+ list_add(&page->lru, &lruvec->lists[l]);
+ __mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
}

static inline void
del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
{
+ mem_cgroup_lru_del_list(page, l);
list_del(&page->lru);
__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
- mem_cgroup_del_lru_list(page, l);
}

/**
@@ -64,7 +68,6 @@ del_page_from_lru(struct zone *zone, struct page *page)
{
enum lru_list l;

- list_del(&page->lru);
if (PageUnevictable(page)) {
__ClearPageUnevictable(page);
l = LRU_UNEVICTABLE;
@@ -75,8 +78,9 @@ del_page_from_lru(struct zone *zone, struct page *page)
l += LRU_ACTIVE;
}
}
+ mem_cgroup_lru_del_list(page, l);
+ list_del(&page->lru);
__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
- mem_cgroup_del_lru_list(page, l);
}

/**
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e56f835..c2ddce5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,6 +158,10 @@ static inline int is_unevictable_lru(enum lru_list l)
return (l == LRU_UNEVICTABLE);
}

+struct lruvec {
+ struct list_head lists[NR_LRU_LISTS];
+};
+
enum zone_watermarks {
WMARK_MIN,
WMARK_LOW,
@@ -344,10 +348,8 @@ struct zone {
ZONE_PADDING(_pad1_)

/* Fields commonly accessed by the page reclaim scanner */
- spinlock_t lru_lock;
- struct zone_lru {
- struct list_head list;
- } lru[NR_LRU_LISTS];
+ spinlock_t lru_lock;
+ struct lruvec lruvec;

struct zone_reclaim_stat reclaim_stat;

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 961ecc7..a42ddf9 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -31,7 +31,6 @@ enum {
struct page_cgroup {
unsigned long flags;
struct mem_cgroup *mem_cgroup;
- struct list_head lru; /* per cgroup LRU list */
};

void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -49,7 +48,6 @@ static inline void __init page_cgroup_init(void)
#endif

struct page_cgroup *lookup_page_cgroup(struct page *page);
-struct page *lookup_cgroup_page(struct page_cgroup *pc);

#define TESTPCGFLAG(uname, lname) \
static inline int PageCgroup##uname(struct page_cgroup *pc) \
@@ -121,40 +119,6 @@ static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
local_irq_restore(*flags);
}
-
-#ifdef CONFIG_SPARSEMEM
-#define PCG_ARRAYID_WIDTH SECTIONS_SHIFT
-#else
-#define PCG_ARRAYID_WIDTH NODES_SHIFT
-#endif
-
-#if (PCG_ARRAYID_WIDTH > BITS_PER_LONG - NR_PCG_FLAGS)
-#error Not enough space left in pc->flags to store page_cgroup array IDs
-#endif
-
-/* pc->flags: ARRAY-ID | FLAGS */
-
-#define PCG_ARRAYID_MASK ((1UL << PCG_ARRAYID_WIDTH) - 1)
-
-#define PCG_ARRAYID_OFFSET (BITS_PER_LONG - PCG_ARRAYID_WIDTH)
-/*
- * Zero the shift count for non-existent fields, to prevent compiler
- * warnings and ensure references are optimized away.
- */
-#define PCG_ARRAYID_SHIFT (PCG_ARRAYID_OFFSET * (PCG_ARRAYID_WIDTH != 0))
-
-static inline void set_page_cgroup_array_id(struct page_cgroup *pc,
- unsigned long id)
-{
- pc->flags &= ~(PCG_ARRAYID_MASK << PCG_ARRAYID_SHIFT);
- pc->flags |= (id & PCG_ARRAYID_MASK) << PCG_ARRAYID_SHIFT;
-}
-
-static inline unsigned long page_cgroup_array_id(struct page_cgroup *pc)
-{
- return (pc->flags >> PCG_ARRAYID_SHIFT) & PCG_ARRAYID_MASK;
-}
-
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct page_cgroup;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d9d1a7e..4a365b7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -133,10 +133,7 @@ struct mem_cgroup_stat_cpu {
* per-zone information in memory controller.
*/
struct mem_cgroup_per_zone {
- /*
- * spin_lock to protect the per cgroup LRU
- */
- struct list_head lists[NR_LRU_LISTS];
+ struct lruvec lruvec;
unsigned long count[NR_LRU_LISTS];

struct zone_reclaim_stat reclaim_stat;
@@ -642,6 +639,26 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
return (mem == root_mem_cgroup);
}

+/**
+ * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
+ * @zone: zone of the wanted lruvec
+ * @mem: memcg of the wanted lruvec
+ *
+ * Returns the lru list vector holding pages for the given @zone and
+ * @mem. This can be the global zone lruvec, if the memory controller
+ * is disabled.
+ */
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, struct mem_cgroup *mem)
+{
+ struct mem_cgroup_per_zone *mz;
+
+ if (mem_cgroup_disabled())
+ return &zone->lruvec;
+
+ mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
+ return &mz->lruvec;
+}
+
/*
* Following LRU functions are allowed to be used without PCG_LOCK.
* Operations are called by routine of global LRU independently from memcg.
@@ -656,21 +673,74 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
* When moving account, the page is not on LRU. It's isolated.
*/

-struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
- enum lru_list lru)
+/**
+ * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
+ * @zone: zone of the page
+ * @page: the page itself
+ * @lru: target lru list
+ *
+ * This function must be called when a page is to be added to an lru
+ * list.
+ *
+ * Returns the lruvec to hold @page, the callsite is responsible for
+ * physically linking the page to &lruvec->lists[@lru].
+ */
+struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
+ enum lru_list lru)
{
struct mem_cgroup_per_zone *mz;
struct page_cgroup *pc;
+ struct mem_cgroup *mem;

- mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
- pc = list_entry(mz->lists[lru].prev, struct page_cgroup, lru);
- return lookup_cgroup_page(pc);
+ if (mem_cgroup_disabled())
+ return &zone->lruvec;
+
+ pc = lookup_page_cgroup(page);
+ VM_BUG_ON(PageCgroupAcctLRU(pc));
+ if (PageCgroupUsed(pc)) {
+ /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
+ smp_rmb();
+ mem = pc->mem_cgroup;
+ } else {
+ /*
+ * If the page is no longer charged, add it to the
+ * root memcg's lru. Either it will be freed soon, or
+ * it will get charged again and the charger will
+ * relink it to the right list.
+ */
+ mem = root_mem_cgroup;
+ }
+ mz = page_cgroup_zoneinfo(mem, page);
+ /*
+ * We do not account for uncharged pages: they are linked to
+ * root_mem_cgroup but when the page is unlinked upon free,
+ * accounting would be done against pc->mem_cgroup.
+ */
+ if (PageCgroupUsed(pc)) {
+ /*
+ * Huge page splitting is serialized through the lru
+ * lock, so compound_order() is stable here.
+ */
+ MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
+ SetPageCgroupAcctLRU(pc);
+ }
+ return &mz->lruvec;
}

-void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
+/**
+ * mem_cgroup_lru_del_list - account for removing an lru page
+ * @page: page to unlink
+ * @lru: lru list the page is sitting on
+ *
+ * This function must be called when a page is to be removed from an
+ * lru list.
+ *
+ * The callsite is responsible for physically unlinking &@page->lru.
+ */
+void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
{
- struct page_cgroup *pc;
struct mem_cgroup_per_zone *mz;
+ struct page_cgroup *pc;

if (mem_cgroup_disabled())
return;
@@ -686,75 +756,35 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
/* huge page split is done under lru_lock. so, we have no races. */
MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
- VM_BUG_ON(list_empty(&pc->lru));
- list_del_init(&pc->lru);
}

-void mem_cgroup_del_lru(struct page *page)
+void mem_cgroup_lru_del(struct page *page)
{
- mem_cgroup_del_lru_list(page, page_lru(page));
+ mem_cgroup_lru_del_list(page, page_lru(page));
}

-/*
- * Writeback is about to end against a page which has been marked for immediate
- * reclaim. If it still appears to be reclaimable, move it to the tail of the
- * inactive list.
+/**
+ * mem_cgroup_lru_move_lists - account for moving a page between lru lists
+ * @zone: zone of the page
+ * @page: page to move
+ * @from: current lru list
+ * @to: new lru list
+ *
+ * This function must be called when a page is moved between lru
+ * lists, or rotated on the same lru list.
+ *
+ * Returns the lruvec to hold @page in the future, the callsite is
+ * responsible for physically relinking the page to
+ * &lruvec->lists[@to].
*/
-void mem_cgroup_rotate_reclaimable_page(struct page *page)
-{
- struct mem_cgroup_per_zone *mz;
- struct page_cgroup *pc;
- enum lru_list lru = page_lru(page);
-
- if (mem_cgroup_disabled())
- return;
-
- pc = lookup_page_cgroup(page);
- /* unused page is not rotated. */
- if (!PageCgroupUsed(pc))
- return;
- /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
- smp_rmb();
- mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
- list_move_tail(&pc->lru, &mz->lists[lru]);
-}
-
-void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
+struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+ struct page *page,
+ enum lru_list from,
+ enum lru_list to)
{
- struct mem_cgroup_per_zone *mz;
- struct page_cgroup *pc;
-
- if (mem_cgroup_disabled())
- return;
-
- pc = lookup_page_cgroup(page);
- /* unused page is not rotated. */
- if (!PageCgroupUsed(pc))
- return;
- /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
- smp_rmb();
- mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
- list_move(&pc->lru, &mz->lists[lru]);
-}
-
-void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
-{
- struct page_cgroup *pc;
- struct mem_cgroup_per_zone *mz;
-
- if (mem_cgroup_disabled())
- return;
- pc = lookup_page_cgroup(page);
- VM_BUG_ON(PageCgroupAcctLRU(pc));
- if (!PageCgroupUsed(pc))
- return;
- /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
- smp_rmb();
- mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
- /* huge page split is done under lru_lock. so, we have no races. */
- MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
- SetPageCgroupAcctLRU(pc);
- list_add(&pc->lru, &mz->lists[lru]);
+ /* TODO: this could be optimized, especially if from == to */
+ mem_cgroup_lru_del_list(page, from);
+ return mem_cgroup_lru_add_list(zone, page, to);
}

/*
@@ -786,7 +816,7 @@ static void mem_cgroup_lru_del_before_commit(struct page *page)
* is guarded by lock_page() because the page is SwapCache.
*/
if (!PageCgroupUsed(pc))
- mem_cgroup_del_lru_list(page, page_lru(page));
+ del_page_from_lru(zone, page);
spin_unlock_irqrestore(&zone->lru_lock, flags);
}

@@ -800,22 +830,11 @@ static void mem_cgroup_lru_add_after_commit(struct page *page)
if (likely(!PageLRU(page)))
return;
spin_lock_irqsave(&zone->lru_lock, flags);
- /* link when the page is linked to LRU but page_cgroup isn't */
if (PageLRU(page) && !PageCgroupAcctLRU(pc))
- mem_cgroup_add_lru_list(page, page_lru(page));
+ add_page_to_lru_list(zone, page, page_lru(page));
spin_unlock_irqrestore(&zone->lru_lock, flags);
}

-
-void mem_cgroup_move_lists(struct page *page,
- enum lru_list from, enum lru_list to)
-{
- if (mem_cgroup_disabled())
- return;
- mem_cgroup_del_lru_list(page, from);
- mem_cgroup_add_lru_list(page, to);
-}
-
int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
{
int ret;
@@ -935,67 +954,6 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
return &mz->reclaim_stat;
}

-unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
- struct list_head *dst,
- unsigned long *scanned, int order,
- int mode, struct zone *z,
- struct mem_cgroup *mem_cont,
- int active, int file)
-{
- unsigned long nr_taken = 0;
- struct page *page;
- unsigned long scan;
- LIST_HEAD(pc_list);
- struct list_head *src;
- struct page_cgroup *pc, *tmp;
- int nid = zone_to_nid(z);
- int zid = zone_idx(z);
- struct mem_cgroup_per_zone *mz;
- int lru = LRU_FILE * file + active;
- int ret;
-
- BUG_ON(!mem_cont);
- mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
- src = &mz->lists[lru];
-
- scan = 0;
- list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
- if (scan >= nr_to_scan)
- break;
-
- if (unlikely(!PageCgroupUsed(pc)))
- continue;
-
- page = lookup_cgroup_page(pc);
-
- if (unlikely(!PageLRU(page)))
- continue;
-
- scan++;
- ret = __isolate_lru_page(page, mode, file);
- switch (ret) {
- case 0:
- list_move(&page->lru, dst);
- mem_cgroup_del_lru(page);
- nr_taken += hpage_nr_pages(page);
- break;
- case -EBUSY:
- /* we don't affect global LRU but rotate in our LRU */
- mem_cgroup_rotate_lru_list(page, page_lru(page));
- break;
- default:
- break;
- }
- }
-
- *scanned = scan;
-
- trace_mm_vmscan_memcg_isolate(0, nr_to_scan, scan, nr_taken,
- 0, 0, 0, mode);
-
- return nr_taken;
-}
-
#define mem_cgroup_from_res_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)

@@ -3110,22 +3068,23 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
int node, int zid, enum lru_list lru)
{
- struct zone *zone;
struct mem_cgroup_per_zone *mz;
- struct page_cgroup *pc, *busy;
unsigned long flags, loop;
struct list_head *list;
+ struct page *busy;
+ struct zone *zone;
int ret = 0;

zone = &NODE_DATA(node)->node_zones[zid];
mz = mem_cgroup_zoneinfo(mem, node, zid);
- list = &mz->lists[lru];
+ list = &mz->lruvec.lists[lru];

loop = MEM_CGROUP_ZSTAT(mz, lru);
/* give some margin against EBUSY etc...*/
loop += 256;
busy = NULL;
while (loop--) {
+ struct page_cgroup *pc;
struct page *page;

ret = 0;
@@ -3134,16 +3093,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
spin_unlock_irqrestore(&zone->lru_lock, flags);
break;
}
- pc = list_entry(list->prev, struct page_cgroup, lru);
- if (busy == pc) {
- list_move(&pc->lru, list);
+ page = list_entry(list->prev, struct page, lru);
+ if (busy == page) {
+ list_move(&page->lru, list);
busy = NULL;
spin_unlock_irqrestore(&zone->lru_lock, flags);
continue;
}
spin_unlock_irqrestore(&zone->lru_lock, flags);

- page = lookup_cgroup_page(pc);
+ pc = lookup_page_cgroup(page);

ret = mem_cgroup_move_parent(page, pc, mem, GFP_KERNEL);
if (ret == -ENOMEM)
@@ -3151,7 +3110,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,

if (ret == -EBUSY || ret == -EINVAL) {
/* found lock contention or "pc" is obsolete. */
- busy = pc;
+ busy = page;
cond_resched();
} else
busy = NULL;
@@ -4171,7 +4130,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
for (zone = 0; zone < MAX_NR_ZONES; zone++) {
mz = &pn->zoneinfo[zone];
for_each_lru(l)
- INIT_LIST_HEAD(&mz->lists[l]);
+ INIT_LIST_HEAD(&mz->lruvec.lists[l]);
}
return 0;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f8bce2..9da238d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4289,7 +4289,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

zone_pcp_init(zone);
for_each_lru(l) {
- INIT_LIST_HEAD(&zone->lru[l].list);
+ INIT_LIST_HEAD(&zone->lruvec.lists[l]);
zone->reclaim_stat.nr_saved_scan[l] = 0;
}
zone->reclaim_stat.recent_rotated[0] = 0;
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 2daadc3..916c6f9 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -11,12 +11,10 @@
#include <linux/swapops.h>
#include <linux/kmemleak.h>

-static void __meminit init_page_cgroup(struct page_cgroup *pc, unsigned long id)
+static void __meminit init_page_cgroup(struct page_cgroup *pc)
{
pc->flags = 0;
- set_page_cgroup_array_id(pc, id);
pc->mem_cgroup = NULL;
- INIT_LIST_HEAD(&pc->lru);
}
static unsigned long total_usage;

@@ -42,19 +40,6 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
return base + offset;
}

-struct page *lookup_cgroup_page(struct page_cgroup *pc)
-{
- unsigned long pfn;
- struct page *page;
- pg_data_t *pgdat;
-
- pgdat = NODE_DATA(page_cgroup_array_id(pc));
- pfn = pc - pgdat->node_page_cgroup + pgdat->node_start_pfn;
- page = pfn_to_page(pfn);
- VM_BUG_ON(pc != lookup_page_cgroup(page));
- return page;
-}
-
static int __init alloc_node_page_cgroup(int nid)
{
struct page_cgroup *base, *pc;
@@ -75,7 +60,7 @@ static int __init alloc_node_page_cgroup(int nid)
return -ENOMEM;
for (index = 0; index < nr_pages; index++) {
pc = base + index;
- init_page_cgroup(pc, nid);
+ init_page_cgroup(pc);
}
NODE_DATA(nid)->node_page_cgroup = base;
total_usage += table_size;
@@ -117,19 +102,6 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
return section->page_cgroup + pfn;
}

-struct page *lookup_cgroup_page(struct page_cgroup *pc)
-{
- struct mem_section *section;
- struct page *page;
- unsigned long nr;
-
- nr = page_cgroup_array_id(pc);
- section = __nr_to_section(nr);
- page = pfn_to_page(pc - section->page_cgroup);
- VM_BUG_ON(pc != lookup_page_cgroup(page));
- return page;
-}
-
static void *__init_refok alloc_page_cgroup(size_t size, int nid)
{
void *addr = NULL;
@@ -167,11 +139,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
struct page_cgroup *base, *pc;
struct mem_section *section;
unsigned long table_size;
- unsigned long nr;
int nid, index;

- nr = pfn_to_section_nr(pfn);
- section = __nr_to_section(nr);
+ section = __pfn_to_section(pfn);

if (section->page_cgroup)
return 0;
@@ -194,7 +164,7 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)

for (index = 0; index < PAGES_PER_SECTION; index++) {
pc = base + index;
- init_page_cgroup(pc, nr);
+ init_page_cgroup(pc);
}

section->page_cgroup = base - pfn;
diff --git a/mm/swap.c b/mm/swap.c
index 5602f1a..0a5a93b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,12 +209,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
static void pagevec_move_tail_fn(struct page *page, void *arg)
{
int *pgmoved = arg;
- struct zone *zone = page_zone(page);

if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
enum lru_list lru = page_lru_base_type(page);
- list_move_tail(&page->lru, &zone->lru[lru].list);
- mem_cgroup_rotate_reclaimable_page(page);
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_lru_move_lists(page_zone(page),
+ page, lru, lru);
+ list_move_tail(&page->lru, &lruvec->lists[lru]);
(*pgmoved)++;
}
}
@@ -420,12 +422,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
*/
SetPageReclaim(page);
} else {
+ struct lruvec *lruvec;
/*
* The page's writeback ends up during pagevec
* We moves tha page into tail of inactive.
*/
- list_move_tail(&page->lru, &zone->lru[lru].list);
- mem_cgroup_rotate_reclaimable_page(page);
+ lruvec = mem_cgroup_lru_move_lists(zone, page, lru, lru);
+ list_move_tail(&page->lru, &lruvec->lists[lru]);
__count_vm_event(PGROTATED);
}

@@ -597,7 +600,6 @@ void lru_add_page_tail(struct zone* zone,
int active;
enum lru_list lru;
const int file = 0;
- struct list_head *head;

VM_BUG_ON(!PageHead(page));
VM_BUG_ON(PageCompound(page_tail));
@@ -617,10 +619,10 @@ void lru_add_page_tail(struct zone* zone,
}
update_page_reclaim_stat(zone, page_tail, file, active);
if (likely(PageLRU(page)))
- head = page->lru.prev;
+ __add_page_to_lru_list(zone, page_tail, lru,
+ page->lru.prev);
else
- head = &zone->lru[lru].list;
- __add_page_to_lru_list(zone, page_tail, lru, head);
+ add_page_to_lru_list(zone, page_tail, lru);
} else {
SetPageUnevictable(page_tail);
add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 23fd2b1..87e1fcb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1080,15 +1080,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,

switch (__isolate_lru_page(page, mode, file)) {
case 0:
+ mem_cgroup_lru_del(page);
list_move(&page->lru, dst);
- mem_cgroup_del_lru(page);
nr_taken += hpage_nr_pages(page);
break;

case -EBUSY:
/* else it is being freed elsewhere */
list_move(&page->lru, src);
- mem_cgroup_rotate_lru_list(page, page_lru(page));
continue;

default:
@@ -1138,8 +1137,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
break;

if (__isolate_lru_page(cursor_page, mode, file) == 0) {
+ mem_cgroup_lru_del(cursor_page);
list_move(&cursor_page->lru, dst);
- mem_cgroup_del_lru(cursor_page);
nr_taken += hpage_nr_pages(page);
nr_lumpy_taken++;
if (PageDirty(cursor_page))
@@ -1168,19 +1167,22 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
return nr_taken;
}

-static unsigned long isolate_pages_global(unsigned long nr,
- struct list_head *dst,
- unsigned long *scanned, int order,
- int mode, struct zone *z,
- int active, int file)
+static unsigned long isolate_pages(unsigned long nr,
+ struct list_head *dst,
+ unsigned long *scanned, int order,
+ int mode, struct zone *z,
+ int active, int file,
+ struct mem_cgroup *mem)
{
+ struct lruvec *lruvec = mem_cgroup_zone_lruvec(z, mem);
int lru = LRU_BASE;
+
if (active)
lru += LRU_ACTIVE;
if (file)
lru += LRU_FILE;
- return isolate_lru_pages(nr, &z->lru[lru].list, dst, scanned, order,
- mode, file);
+ return isolate_lru_pages(nr, &lruvec->lists[lru], dst,
+ scanned, order, mode, file);
}

/*
@@ -1428,20 +1430,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
lru_add_drain();
spin_lock_irq(&zone->lru_lock);

- if (scanning_global_lru(sc)) {
- nr_taken = isolate_pages_global(nr_to_scan,
- &page_list, &nr_scanned, sc->order,
- sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
- ISOLATE_BOTH : ISOLATE_INACTIVE,
- zone, 0, file);
- } else {
- nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
- &page_list, &nr_scanned, sc->order,
- sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
+ nr_taken = isolate_pages(nr_to_scan,
+ &page_list, &nr_scanned, sc->order,
+ sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
ISOLATE_BOTH : ISOLATE_INACTIVE,
- zone, sc->mem_cgroup,
- 0, file);
- }
+ zone, 0, file, sc->mem_cgroup);

if (global_reclaim(sc)) {
zone->pages_scanned += nr_scanned;
@@ -1514,13 +1507,15 @@ static void move_active_pages_to_lru(struct zone *zone,
pagevec_init(&pvec, 1);

while (!list_empty(list)) {
+ struct lruvec *lruvec;
+
page = lru_to_page(list);

VM_BUG_ON(PageLRU(page));
SetPageLRU(page);

- list_move(&page->lru, &zone->lru[lru].list);
- mem_cgroup_add_lru_list(page, lru);
+ lruvec = mem_cgroup_lru_add_list(zone, page, lru);
+ list_move(&page->lru, &lruvec->lists[lru]);
pgmoved += hpage_nr_pages(page);

if (!pagevec_add(&pvec, page) || list_empty(list)) {
@@ -1551,17 +1546,10 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,

lru_add_drain();
spin_lock_irq(&zone->lru_lock);
- if (scanning_global_lru(sc)) {
- nr_taken = isolate_pages_global(nr_pages, &l_hold,
- &pgscanned, sc->order,
- ISOLATE_ACTIVE, zone,
- 1, file);
- } else {
- nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
- &pgscanned, sc->order,
- ISOLATE_ACTIVE, zone,
- sc->mem_cgroup, 1, file);
- }
+ nr_taken = isolate_pages(nr_pages, &l_hold,
+ &pgscanned, sc->order,
+ ISOLATE_ACTIVE, zone,
+ 1, file, sc->mem_cgroup);

if (global_reclaim(sc))
zone->pages_scanned += pgscanned;
@@ -3154,16 +3142,18 @@ int page_evictable(struct page *page, struct vm_area_struct *vma)
*/
static void check_move_unevictable_page(struct page *page, struct zone *zone)
{
- VM_BUG_ON(PageActive(page));
+ struct lruvec *lruvec;

+ VM_BUG_ON(PageActive(page));
retry:
ClearPageUnevictable(page);
if (page_evictable(page, NULL)) {
enum lru_list l = page_lru_base_type(page);

+ lruvec = mem_cgroup_lru_move_lists(zone, page,
+ LRU_UNEVICTABLE, l);
__dec_zone_state(zone, NR_UNEVICTABLE);
- list_move(&page->lru, &zone->lru[l].list);
- mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
+ list_move(&page->lru, &lruvec->lists[l]);
__inc_zone_state(zone, NR_INACTIVE_ANON + l);
__count_vm_event(UNEVICTABLE_PGRESCUED);
} else {
@@ -3171,8 +3161,9 @@ retry:
* rotate unevictable list
*/
SetPageUnevictable(page);
- list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
- mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
+ lruvec = mem_cgroup_lru_move_lists(zone, page, LRU_UNEVICTABLE,
+ LRU_UNEVICTABLE);
+ list_move(&page->lru, &lruvec->lists[LRU_UNEVICTABLE]);
if (page_evictable(page, NULL))
goto retry;
}
@@ -3233,14 +3224,6 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)

}

-static struct page *lru_tailpage(struct zone *zone, struct mem_cgroup *mem,
- enum lru_list lru)
-{
- if (mem)
- return mem_cgroup_lru_to_page(zone, mem, lru);
- return lru_to_page(&zone->lru[lru].list);
-}
-
/**
* scan_zone_unevictable_pages - check unevictable list for evictable pages
* @zone - zone of which to scan the unevictable list
@@ -3259,8 +3242,13 @@ static void scan_zone_unevictable_pages(struct zone *zone)
first = mem = mem_cgroup_hierarchy_walk(NULL, mem);
do {
unsigned long nr_to_scan;
+ struct list_head *list;
+ struct lruvec *lruvec;

nr_to_scan = zone_nr_lru_pages(zone, mem, LRU_UNEVICTABLE);
+ lruvec = mem_cgroup_zone_lruvec(zone, mem);
+ list = &lruvec->lists[LRU_UNEVICTABLE];
+
while (nr_to_scan > 0) {
unsigned long batch_size;
unsigned long scan;
@@ -3272,7 +3260,7 @@ static void scan_zone_unevictable_pages(struct zone *zone)
for (scan = 0; scan < batch_size; scan++) {
struct page *page;

- page = lru_tailpage(zone, mem, LRU_UNEVICTABLE);
+ page = lru_to_page(list);
if (!trylock_page(page))
continue;
if (likely(PageLRU(page) &&
--
1.7.5.2

2011-06-01 23:52:50

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

2011/6/1 Johannes Weiner <[email protected]>:
> Hi,
>
> this is the second version of the memcg naturalization series. ?The
> notable changes since the first submission are:
>
> ? ?o the hierarchy walk is now intermittent and will abort and
> ? ? ?remember the last scanned child after sc->nr_to_reclaim pages
> ? ? ?have been reclaimed during the walk in one zone (Rik)
>
> ? ?o the global lru lists are never scanned when memcg is enabled
> ? ? ?after #2 'memcg-aware global reclaim', which makes this patch
> ? ? ?self-sufficient and complete without requiring the per-memcg lru
> ? ? ?lists to be exclusive (Michal)
>
> ? ?o renamed sc->memcg and sc->current_memcg to sc->target_mem_cgroup
> ? ? ?and sc->mem_cgroup and fixed their documentation, I hope this is
> ? ? ?better understandable now (Rik)
>
> ? ?o the reclaim statistic counters have been renamed. ?there is no
> ? ? ?more distinction between 'pgfree' and 'pgsteal', it is now
> ? ? ?'pgreclaim' in both cases; 'kswapd' has been replaced by
> ? ? ?'background'
>
> ? ?o fixed a nasty crash in the hierarchical soft limit check that
> ? ? ?happened during global reclaim in memcgs that are hierarchical
> ? ? ?but have no hierarchical parents themselves
>
> ? ?o properly implemented the memcg-aware unevictable page rescue
> ? ? ?scanner, there were several blatant bugs in there
>
> ? ?o documentation on new public interfaces
>
> Thanks for your input on the first version.
>
> I ran microbenchmarks (sparse file catting, essentially) to stress
> reclaim and LRU operations. ?There is no measurable overhead for
> !CONFIG_MEMCG, memcg disabled during boot, memcg enabled but no
> configured groups, and hard limit reclaim.
>
> I also ran single-threaded kernbenchs in four unlimited memcgs in
> parallel, contained in a hard-limited hierarchical parent that put
> constant pressure on the workload. ?There is no measurable difference
> in runtime, the pgpgin/pgpgout counters, and fairness among memcgs in
> this test compared to an unpatched kernel. ?Needs more evaluation,
> especially with a higher number of memcgs.
>
> The soft limit changes are also proven to work in so far that it is
> possible to prioritize between children in a hierarchy under pressure
> and that runtime differences corresponded directly to the soft limit
> settings in the previously described kernbench setup with staggered
> soft limits on the groups, but this needs quantification.
>
> Based on v2.6.39.
>

Hmm, I welcome and will review this patches but.....some points I want to say.

1. No more conflict with Ying's work ?
Could you explain what she has and what you don't in this v2 ?
If Ying's one has something good to be merged to your set, please
include it.

2. it's required to see performance score in commit log.

3. I think dirty_ratio as 1st big patch to be merged. (But...hmm..Greg ?
My patches for asynchronous reclaim is not very important. I can rework it.

4. This work can be splitted into some small works.
a) fix for current code and clean ups
a') statistics
b) soft limit rework
c) change global reclaim

I like (a)->(b)->(c) order. and while (b) you can merge your work
with Ying's one.
And for a') , I'd like to add a new file memory.reclaim_stat as I've
already shown.
and allow resetting.

Hmm, how about splitting patch 2/8 into small patches and see what happens in
3.2 or 3.3 ? While that, we can make softlimit works better.
(and once we do 2/8, our direction will be fixed to the direction to
remove global LRU.)

5. please write documentation to explain what new LRU do.

BTW, after this work, lists of ROOT cgroup comes again. I may need to check
codes which see memcg is ROOT or not. Because we removed many atomic
ops in memcg, I wonder ROOT cgroup can be accounted again..

Thanks,
-Kame

2011-06-02 00:36:07

by Greg Thelen

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Wed, Jun 1, 2011 at 4:52 PM, Hiroyuki Kamezawa
<[email protected]> wrote:
> 2011/6/1 Johannes Weiner <[email protected]>:
>> Hi,
>>
>> this is the second version of the memcg naturalization series. ?The
>> notable changes since the first submission are:
>>
>> ? ?o the hierarchy walk is now intermittent and will abort and
>> ? ? ?remember the last scanned child after sc->nr_to_reclaim pages
>> ? ? ?have been reclaimed during the walk in one zone (Rik)
>>
>> ? ?o the global lru lists are never scanned when memcg is enabled
>> ? ? ?after #2 'memcg-aware global reclaim', which makes this patch
>> ? ? ?self-sufficient and complete without requiring the per-memcg lru
>> ? ? ?lists to be exclusive (Michal)
>>
>> ? ?o renamed sc->memcg and sc->current_memcg to sc->target_mem_cgroup
>> ? ? ?and sc->mem_cgroup and fixed their documentation, I hope this is
>> ? ? ?better understandable now (Rik)
>>
>> ? ?o the reclaim statistic counters have been renamed. ?there is no
>> ? ? ?more distinction between 'pgfree' and 'pgsteal', it is now
>> ? ? ?'pgreclaim' in both cases; 'kswapd' has been replaced by
>> ? ? ?'background'
>>
>> ? ?o fixed a nasty crash in the hierarchical soft limit check that
>> ? ? ?happened during global reclaim in memcgs that are hierarchical
>> ? ? ?but have no hierarchical parents themselves
>>
>> ? ?o properly implemented the memcg-aware unevictable page rescue
>> ? ? ?scanner, there were several blatant bugs in there
>>
>> ? ?o documentation on new public interfaces
>>
>> Thanks for your input on the first version.
>>
>> I ran microbenchmarks (sparse file catting, essentially) to stress
>> reclaim and LRU operations. ?There is no measurable overhead for
>> !CONFIG_MEMCG, memcg disabled during boot, memcg enabled but no
>> configured groups, and hard limit reclaim.
>>
>> I also ran single-threaded kernbenchs in four unlimited memcgs in
>> parallel, contained in a hard-limited hierarchical parent that put
>> constant pressure on the workload. ?There is no measurable difference
>> in runtime, the pgpgin/pgpgout counters, and fairness among memcgs in
>> this test compared to an unpatched kernel. ?Needs more evaluation,
>> especially with a higher number of memcgs.
>>
>> The soft limit changes are also proven to work in so far that it is
>> possible to prioritize between children in a hierarchy under pressure
>> and that runtime differences corresponded directly to the soft limit
>> settings in the previously described kernbench setup with staggered
>> soft limits on the groups, but this needs quantification.
>>
>> Based on v2.6.39.
>>
>
> Hmm, I welcome and will review this patches but.....some points I want to say.
>
> 1. No more conflict with Ying's work ?
> ? ?Could you explain what she has and what you don't in this v2 ?
> ? ?If Ying's one has something good to be merged to your set, please
> include it.
>
> 2. it's required to see performance score in commit log.
>
> 3. I think dirty_ratio as 1st big patch to be merged. (But...hmm..Greg ?
> ? ?My patches for asynchronous reclaim is not very important. I can rework it.

I am testing the next version (v8) of the memcg dirty ratio patches. I expect
to have it posted for review later this week.

> 4. This work can be splitted into some small works.
> ? ? a) fix for current code and clean ups
> ? ? a') statistics
> ? ? b) soft limit rework
> ? ? c) change global reclaim
>
> ?I like (a)->(b)->(c) order. and while (b) you can merge your work
> with Ying's one.
> ?And for a') , I'd like to add a new file memory.reclaim_stat as I've
> already shown.
> ?and allow resetting.
>
> ?Hmm, how about splitting patch 2/8 into small patches and see what happens in
> ?3.2 or 3.3 ? While that, we can make softlimit works better.
> ?(and once we do 2/8, our direction will be fixed to the direction to
> remove global LRU.)
>
> 5. please write documentation to explain what new LRU do.
>
> BTW, after this work, lists of ROOT cgroup comes again. I may need to check
> codes which see memcg is ROOT or not. Because we removed many atomic
> ops in memcg, I wonder ROOT cgroup can be accounted again..
>
> Thanks,
> -Kame
>

2011-06-02 04:05:24

by Ying Han

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Wed, Jun 1, 2011 at 4:52 PM, Hiroyuki Kamezawa
<[email protected]> wrote:
> 2011/6/1 Johannes Weiner <[email protected]>:
>> Hi,
>>
>> this is the second version of the memcg naturalization series. ?The
>> notable changes since the first submission are:
>>
>> ? ?o the hierarchy walk is now intermittent and will abort and
>> ? ? ?remember the last scanned child after sc->nr_to_reclaim pages
>> ? ? ?have been reclaimed during the walk in one zone (Rik)
>>
>> ? ?o the global lru lists are never scanned when memcg is enabled
>> ? ? ?after #2 'memcg-aware global reclaim', which makes this patch
>> ? ? ?self-sufficient and complete without requiring the per-memcg lru
>> ? ? ?lists to be exclusive (Michal)
>>
>> ? ?o renamed sc->memcg and sc->current_memcg to sc->target_mem_cgroup
>> ? ? ?and sc->mem_cgroup and fixed their documentation, I hope this is
>> ? ? ?better understandable now (Rik)
>>
>> ? ?o the reclaim statistic counters have been renamed. ?there is no
>> ? ? ?more distinction between 'pgfree' and 'pgsteal', it is now
>> ? ? ?'pgreclaim' in both cases; 'kswapd' has been replaced by
>> ? ? ?'background'
>>
>> ? ?o fixed a nasty crash in the hierarchical soft limit check that
>> ? ? ?happened during global reclaim in memcgs that are hierarchical
>> ? ? ?but have no hierarchical parents themselves
>>
>> ? ?o properly implemented the memcg-aware unevictable page rescue
>> ? ? ?scanner, there were several blatant bugs in there
>>
>> ? ?o documentation on new public interfaces
>>
>> Thanks for your input on the first version.
>>
>> I ran microbenchmarks (sparse file catting, essentially) to stress
>> reclaim and LRU operations. ?There is no measurable overhead for
>> !CONFIG_MEMCG, memcg disabled during boot, memcg enabled but no
>> configured groups, and hard limit reclaim.
>>
>> I also ran single-threaded kernbenchs in four unlimited memcgs in
>> parallel, contained in a hard-limited hierarchical parent that put
>> constant pressure on the workload. ?There is no measurable difference
>> in runtime, the pgpgin/pgpgout counters, and fairness among memcgs in
>> this test compared to an unpatched kernel. ?Needs more evaluation,
>> especially with a higher number of memcgs.
>>
>> The soft limit changes are also proven to work in so far that it is
>> possible to prioritize between children in a hierarchy under pressure
>> and that runtime differences corresponded directly to the soft limit
>> settings in the previously described kernbench setup with staggered
>> soft limits on the groups, but this needs quantification.
>>
>> Based on v2.6.39.
>>
>
> Hmm, I welcome and will review this patches but.....some points I want to say.
>
> 1. No more conflict with Ying's work ?
> ? ?Could you explain what she has and what you don't in this v2 ?
> ? ?If Ying's one has something good to be merged to your set, please
> include it.

My patch I sent out last time was doing rework of soft_limit reclaim.
It convert the RB-tree based to
a linked list round-robin fashion of all memcgs across their soft
limit per-zone.

I will apply this patch and try to test it. After that i will get
better idea whether or not it is being covered here.

> 2. it's required to see performance score in commit log.
>
> 3. I think dirty_ratio as 1st big patch to be merged. (But...hmm..Greg ?
> ? ?My patches for asynchronous reclaim is not very important. I can rework it.
>
> 4. This work can be splitted into some small works.
> ? ? a) fix for current code and clean ups

> ? ? a') statistics

> ? ? b) soft limit rework

> ? ? c) change global reclaim

My last patchset starts with a patch reverting the RB-tree
implementation of the soft_limit
reclaim, and then the new round-robin implementation comes on the
following patches.

I like the ordering here, and that is consistent w/ the plan we
discussed earlier in LSF. Changing
the global reclaim would be the last step when the changes before that
have been well understood
and tested.

Sorry If that is how it is done here. I will read through the patchset.

--Ying
>
> ?I like (a)->(b)->(c) order. and while (b) you can merge your work
> with Ying's one.
> ?And for a') , I'd like to add a new file memory.reclaim_stat as I've
> already shown.
> ?and allow resetting.
>
> ?Hmm, how about splitting patch 2/8 into small patches and see what happens in
> ?3.2 or 3.3 ? While that, we can make softlimit works better.
> ?(and once we do 2/8, our direction will be fixed to the direction to
> remove global LRU.)
>
> 5. please write documentation to explain what new LRU do.
>
> BTW, after this work, lists of ROOT cgroup comes again. I may need to check
> codes which see memcg is ROOT or not. Because we removed many atomic
> ops in memcg, I wonder ROOT cgroup can be accounted again..
>
> Thanks,
> -Kame
>

2011-06-02 05:37:35

by Ying Han

[permalink] [raw]
Subject: Re: [patch 4/8] memcg: rework soft limit reclaim

On Tue, May 31, 2011 at 11:25 PM, Johannes Weiner <[email protected]> wrote:
> Currently, soft limit reclaim is entered from kswapd, where it selects
> the memcg with the biggest soft limit excess in absolute bytes, and
> reclaims pages from it with maximum aggressiveness (priority 0).
>
> This has the following disadvantages:
>
> ? ?1. because of the aggressiveness, kswapd can be stalled on a memcg
> ? ?that is hard to reclaim from for a long time, sending the rest of
> ? ?the allocators into direct reclaim in the meantime.
>
> ? ?2. it only considers the biggest offender (in absolute bytes, no
> ? ?less, so very unhandy for setups with different-sized memcgs) and
> ? ?does not apply any pressure at all on other memcgs in excess.
>
> ? ?3. because it is only invoked from kswapd, the soft limit is
> ? ?meaningful during global memory pressure, but it is not taken into
> ? ?account during hierarchical target reclaim where it could allow
> ? ?prioritizing memcgs as well. ?So while it does hierarchical
> ? ?reclaim once triggered, it is not a truly hierarchical mechanism.
>
> Here is a different approach. ?Instead of having a soft limit reclaim
> cycle separate from the rest of reclaim, this patch ensures that each
> time a group of memcgs is reclaimed - be it because of global memory
> pressure or because of a hard limit - memcgs that exceed their soft
> limit, or contribute to the soft limit excess of one their parents,
> are reclaimed from at a higher priority than their siblings.
>
> This results in the following:
>
> ? ?1. all relevant memcgs are scanned with increasing priority during
> ? ?memory pressure. ?The primary goal is to free pages, not to punish
> ? ?soft limit offenders.
>
> ? ?2. increased pressure is applied to all memcgs in excess of their
> ? ?soft limit, not only the biggest offender.
>
> ? ?3. the soft limit becomes meaningful for target reclaim as well,
> ? ?where it allows prioritizing children of a hierarchy when the
> ? ?parent hits its limit.
>
> ? ?4. direct reclaim now also applies increased soft limit pressure,
> ? ?not just kswapd anymore.

So I see now that we removed the logic of doing per-zone soft_limit
reclaim totally (including the next patch). Instead we are iterating
the whole memcg hierarchy under global memory pressure.

Is there a reason we didn't keep the per-zone memcg list which allows
us only scanning memgs w/ pages landed on the zone?

--Ying



>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> ?include/linux/memcontrol.h | ? ?7 +++++++
> ?mm/memcontrol.c ? ? ? ? ? ?| ? 26 ++++++++++++++++++++++++++
> ?mm/vmscan.c ? ? ? ? ? ? ? ?| ? ?8 ++++++--
> ?3 files changed, 39 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8f402b9..7d99e87 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -104,6 +104,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
> ?struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *);
> ?void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup *);
> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *, struct mem_cgroup *);
>
> ?/*
> ?* For memory reclaim.
> @@ -345,6 +346,12 @@ static inline void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *r,
> ?{
> ?}
>
> +static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem)
> +{
> + ? ? ? return false;
> +}
> +
> ?static inline void
> ?mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> ?{
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 983efe4..94f77cc3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1460,6 +1460,32 @@ void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *root,
> ? ? ? ? ? ? ? ?css_put(&mem->css);
> ?}
>
> +/**
> + * mem_cgroup_soft_limit_exceeded - check if a memcg (hierarchically)
> + * ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?exceeds a soft limit
> + * @root: highest ancestor of @mem to consider
> + * @mem: memcg to check for excess
> + *
> + * The function indicates whether @mem has exceeded its own soft
> + * limit, or contributes to the soft limit excess of one of its
> + * parents in the hierarchy below @root.
> + */
> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem)
> +{
> + ? ? ? for (;;) {
> + ? ? ? ? ? ? ? if (mem == root_mem_cgroup)
> + ? ? ? ? ? ? ? ? ? ? ? return false;
> + ? ? ? ? ? ? ? if (res_counter_soft_limit_excess(&mem->res))
> + ? ? ? ? ? ? ? ? ? ? ? return true;
> + ? ? ? ? ? ? ? if (mem == root)
> + ? ? ? ? ? ? ? ? ? ? ? return false;
> + ? ? ? ? ? ? ? mem = parent_mem_cgroup(mem);
> + ? ? ? ? ? ? ? if (!mem)
> + ? ? ? ? ? ? ? ? ? ? ? return false;
> + ? ? ? }
> +}
> +
> ?static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long flags)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c7d4b44..0163840 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1988,9 +1988,13 @@ static void shrink_zone(int priority, struct zone *zone,
> ? ? ? ? ? ? ? ?unsigned long reclaimed = sc->nr_reclaimed;
> ? ? ? ? ? ? ? ?unsigned long scanned = sc->nr_scanned;
> ? ? ? ? ? ? ? ?unsigned long nr_reclaimed;
> + ? ? ? ? ? ? ? int epriority = priority;
> +
> + ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
> + ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
>
> ? ? ? ? ? ? ? ?sc->mem_cgroup = mem;
> - ? ? ? ? ? ? ? do_shrink_zone(priority, zone, sc);
> + ? ? ? ? ? ? ? do_shrink_zone(epriority, zone, sc);
> ? ? ? ? ? ? ? ?mem_cgroup_count_reclaim(mem, current_is_kswapd(),
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mem != root, /* limit or hierarchy? */
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sc->nr_scanned - scanned,
> @@ -2480,7 +2484,7 @@ loop_again:
> ? ? ? ? ? ? ? ? ? ? ? ? * Call soft limit reclaim before calling shrink_zone.
> ? ? ? ? ? ? ? ? ? ? ? ? * For now we ignore the return value
> ? ? ? ? ? ? ? ? ? ? ? ? */
> - ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
> + ? ? ? ? ? ? ? ? ? ? ? //mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
>
> ? ? ? ? ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? ? ? ? ? * We put equal pressure on every zone, unless
> --
> 1.7.5.2
>
>

2011-06-02 07:34:27

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Thu, Jun 02, 2011 at 08:52:47AM +0900, Hiroyuki Kamezawa wrote:
> 2011/6/1 Johannes Weiner <[email protected]>:
> > Hi,
> >
> > this is the second version of the memcg naturalization series. ?The
> > notable changes since the first submission are:
> >
> > ? ?o the hierarchy walk is now intermittent and will abort and
> > ? ? ?remember the last scanned child after sc->nr_to_reclaim pages
> > ? ? ?have been reclaimed during the walk in one zone (Rik)
> >
> > ? ?o the global lru lists are never scanned when memcg is enabled
> > ? ? ?after #2 'memcg-aware global reclaim', which makes this patch
> > ? ? ?self-sufficient and complete without requiring the per-memcg lru
> > ? ? ?lists to be exclusive (Michal)
> >
> > ? ?o renamed sc->memcg and sc->current_memcg to sc->target_mem_cgroup
> > ? ? ?and sc->mem_cgroup and fixed their documentation, I hope this is
> > ? ? ?better understandable now (Rik)
> >
> > ? ?o the reclaim statistic counters have been renamed. ?there is no
> > ? ? ?more distinction between 'pgfree' and 'pgsteal', it is now
> > ? ? ?'pgreclaim' in both cases; 'kswapd' has been replaced by
> > ? ? ?'background'
> >
> > ? ?o fixed a nasty crash in the hierarchical soft limit check that
> > ? ? ?happened during global reclaim in memcgs that are hierarchical
> > ? ? ?but have no hierarchical parents themselves
> >
> > ? ?o properly implemented the memcg-aware unevictable page rescue
> > ? ? ?scanner, there were several blatant bugs in there
> >
> > ? ?o documentation on new public interfaces
> >
> > Thanks for your input on the first version.
> >
> > I ran microbenchmarks (sparse file catting, essentially) to stress
> > reclaim and LRU operations. ?There is no measurable overhead for
> > !CONFIG_MEMCG, memcg disabled during boot, memcg enabled but no
> > configured groups, and hard limit reclaim.
> >
> > I also ran single-threaded kernbenchs in four unlimited memcgs in
> > parallel, contained in a hard-limited hierarchical parent that put
> > constant pressure on the workload. ?There is no measurable difference
> > in runtime, the pgpgin/pgpgout counters, and fairness among memcgs in
> > this test compared to an unpatched kernel. ?Needs more evaluation,
> > especially with a higher number of memcgs.
> >
> > The soft limit changes are also proven to work in so far that it is
> > possible to prioritize between children in a hierarchy under pressure
> > and that runtime differences corresponded directly to the soft limit
> > settings in the previously described kernbench setup with staggered
> > soft limits on the groups, but this needs quantification.
> >
> > Based on v2.6.39.
> >
>
> Hmm, I welcome and will review this patches but.....some points I want to say.
>
> 1. No more conflict with Ying's work ?
> Could you explain what she has and what you don't in this v2 ?
> If Ying's one has something good to be merged to your set, please
> include it.

The problem is that the solution we came up with at LSF, i.e. the
one-dimensional linked list of soft limit-exceeding memcgs, is not
adequate to represent the hierarchy structure of memcgs.

My solution is fundamentally different, so I don't really see possible
synergy between the patch series right now.

This was the conclusion last time:
http://marc.info/?l=linux-mm&m=130564056215365&w=2

> 2. it's required to see performance score in commit log.

The patch series is not a performance optimization. But I can include
it to prove there are no regressions.

> 4. This work can be splitted into some small works.
> a) fix for current code and clean ups
> a') statistics
> b) soft limit rework
> c) change global reclaim
>
> I like (a)->(b)->(c) order. and while (b) you can merge your work
> with Ying's one.
> And for a') , I'd like to add a new file memory.reclaim_stat as I've
> already shown.
> and allow resetting.

Resetting reclaim statistics is a nice idea, let me have a look.
Sorry, I am a bit behind on reviewing other patches...

> Hmm, how about splitting patch 2/8 into small patches and see what happens in
> 3.2 or 3.3 ? While that, we can make softlimit works better.
> (and once we do 2/8, our direction will be fixed to the direction to
> remove global LRU.)

Do you have specific parts in mind that could go stand-alone?

One thing I can think of is splitting up those parts:

1. move /target/ reclaim to generic code

2. convert /global/ reclaim from global lru to hierarchy reclaim
including root_mem_cgroup

> 5. please write documentation to explain what new LRU do.

Ok.

> BTW, after this work, lists of ROOT cgroup comes again. I may need to check
> codes which see memcg is ROOT or not. Because we removed many atomic
> ops in memcg, I wonder ROOT cgroup can be accounted again..

Oh, please do if you can find the time. The memcg lru rules are
scary!

2011-06-02 07:50:48

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Wed, Jun 01, 2011 at 09:05:18PM -0700, Ying Han wrote:
> On Wed, Jun 1, 2011 at 4:52 PM, Hiroyuki Kamezawa
> <[email protected]> wrote:
> > 2011/6/1 Johannes Weiner <[email protected]>:
> >> Hi,
> >>
> >> this is the second version of the memcg naturalization series. ?The
> >> notable changes since the first submission are:
> >>
> >> ? ?o the hierarchy walk is now intermittent and will abort and
> >> ? ? ?remember the last scanned child after sc->nr_to_reclaim pages
> >> ? ? ?have been reclaimed during the walk in one zone (Rik)
> >>
> >> ? ?o the global lru lists are never scanned when memcg is enabled
> >> ? ? ?after #2 'memcg-aware global reclaim', which makes this patch
> >> ? ? ?self-sufficient and complete without requiring the per-memcg lru
> >> ? ? ?lists to be exclusive (Michal)
> >>
> >> ? ?o renamed sc->memcg and sc->current_memcg to sc->target_mem_cgroup
> >> ? ? ?and sc->mem_cgroup and fixed their documentation, I hope this is
> >> ? ? ?better understandable now (Rik)
> >>
> >> ? ?o the reclaim statistic counters have been renamed. ?there is no
> >> ? ? ?more distinction between 'pgfree' and 'pgsteal', it is now
> >> ? ? ?'pgreclaim' in both cases; 'kswapd' has been replaced by
> >> ? ? ?'background'
> >>
> >> ? ?o fixed a nasty crash in the hierarchical soft limit check that
> >> ? ? ?happened during global reclaim in memcgs that are hierarchical
> >> ? ? ?but have no hierarchical parents themselves
> >>
> >> ? ?o properly implemented the memcg-aware unevictable page rescue
> >> ? ? ?scanner, there were several blatant bugs in there
> >>
> >> ? ?o documentation on new public interfaces
> >>
> >> Thanks for your input on the first version.
> >>
> >> I ran microbenchmarks (sparse file catting, essentially) to stress
> >> reclaim and LRU operations. ?There is no measurable overhead for
> >> !CONFIG_MEMCG, memcg disabled during boot, memcg enabled but no
> >> configured groups, and hard limit reclaim.
> >>
> >> I also ran single-threaded kernbenchs in four unlimited memcgs in
> >> parallel, contained in a hard-limited hierarchical parent that put
> >> constant pressure on the workload. ?There is no measurable difference
> >> in runtime, the pgpgin/pgpgout counters, and fairness among memcgs in
> >> this test compared to an unpatched kernel. ?Needs more evaluation,
> >> especially with a higher number of memcgs.
> >>
> >> The soft limit changes are also proven to work in so far that it is
> >> possible to prioritize between children in a hierarchy under pressure
> >> and that runtime differences corresponded directly to the soft limit
> >> settings in the previously described kernbench setup with staggered
> >> soft limits on the groups, but this needs quantification.
> >>
> >> Based on v2.6.39.
> >>
> >
> > Hmm, I welcome and will review this patches but.....some points I want to say.
> >
> > 1. No more conflict with Ying's work ?
> > ? ?Could you explain what she has and what you don't in this v2 ?
> > ? ?If Ying's one has something good to be merged to your set, please
> > include it.
>
> My patch I sent out last time was doing rework of soft_limit reclaim.
> It convert the RB-tree based to
> a linked list round-robin fashion of all memcgs across their soft
> limit per-zone.
>
> I will apply this patch and try to test it. After that i will get
> better idea whether or not it is being covered here.

Thanks!!

> > 4. This work can be splitted into some small works.
> > ? ? a) fix for current code and clean ups
>
> > ? ? a') statistics
>
> > ? ? b) soft limit rework
>
> > ? ? c) change global reclaim
>
> My last patchset starts with a patch reverting the RB-tree
> implementation of the soft_limit
> reclaim, and then the new round-robin implementation comes on the
> following patches.
>
> I like the ordering here, and that is consistent w/ the plan we
> discussed earlier in LSF. Changing
> the global reclaim would be the last step when the changes before that
> have been well understood
> and tested.
>
> Sorry If that is how it is done here. I will read through the patchset.

It's not. The way I implemented soft limits depends on global reclaim
performing hierarchical reclaim. I don't see how I can reverse the
order with this dependency.

2011-06-02 09:06:55

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

2011/6/2 Johannes Weiner <[email protected]>:
> On Thu, Jun 02, 2011 at 08:52:47AM +0900, Hiroyuki Kamezawa wrote:
>> 2011/6/1 Johannes Weiner <[email protected]>:
>> Hmm, I welcome and will review this patches but.....some points I want to say.
>>
>> 1. No more conflict with Ying's work ?
>> ? ? Could you explain what she has and what you don't in this v2 ?
>> ? ? If Ying's one has something good to be merged to your set, please
>> include it.
>
> The problem is that the solution we came up with at LSF, i.e. the
> one-dimensional linked list of soft limit-exceeding memcgs, is not
> adequate to represent the hierarchy structure of memcgs.
>
> My solution is fundamentally different, so I don't really see possible
> synergy between the patch series right now.
>
> This was the conclusion last time:
> http://marc.info/?l=linux-mm&m=130564056215365&w=2
>

Hmm, will look.

IIUC, current design of per-zone tree is for supoorting current policy
in efficient way as "pick up the largest usage excess memcg"

If we change policy, it's natural to make changes in implementation.


>> 2. it's required to see performance score in commit log.
>
> The patch series is not a performance optimization. ?But I can include
> it to prove there are no regressions.
>
yes, it's helpful.


>> 4. This work can be splitted into some small works.
>> ? ? ?a) fix for current code and clean ups
>> ? ? ?a') statistics
>> ? ? ?b) soft limit rework
>> ? ? ?c) change global reclaim
>>
>> ? I like (a)->(b)->(c) order. and while (b) you can merge your work
>> with Ying's one.
>> ? And for a') , I'd like to add a new file memory.reclaim_stat as I've
>> already shown.
>> ? and allow resetting.
>
> Resetting reclaim statistics is a nice idea, let me have a look.
> Sorry, I am a bit behind on reviewing other patches...
>
I think I'll cut-out the patch and merge it before my full work.


>> ? Hmm, how about splitting patch 2/8 into small patches and see what happens in
>> ? 3.2 or 3.3 ? While that, we can make softlimit works better.
>> ? (and once we do 2/8, our direction will be fixed to the direction to
>> remove global LRU.)
>
> Do you have specific parts in mind that could go stand-alone?
>
> One thing I can think of is splitting up those parts:
>
> ?1. move /target/ reclaim to generic code
>
> ?2. convert /global/ reclaim from global lru to hierarchy reclaim
> ? ? including root_mem_cgroup
>

Hmm, at brief look
patch 2/8
- hierarchy walk rewrite code should be stand alone and can be merged
1st, as clean-up
- root cgroup LRU handling was required for performance. I think we
removed tons of
atomic ops and can remove that special handling personally. But this change of
root cgroup handling should be in separate patch. with performance report.
....

I'll do close look later, sorry.
-Kame



>> 5. please write documentation to explain what new LRU do.
>
> Ok.
>
>> BTW, after this work, lists of ROOT cgroup comes again. I may need to check
>> codes which see memcg is ROOT or not. Because we removed many atomic
>> ops in memcg, I wonder ROOT cgroup can be accounted again..
>
> Oh, please do if you can find the time. ?The memcg lru rules are
> scary!
>

IIRC, It was requested by Red*at ;)

Thanks,
-Kame

2011-06-02 10:00:37

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Thu, Jun 02, 2011 at 06:06:51PM +0900, Hiroyuki Kamezawa wrote:
> 2011/6/2 Johannes Weiner <[email protected]>:
> > On Thu, Jun 02, 2011 at 08:52:47AM +0900, Hiroyuki Kamezawa wrote:
> >> 2011/6/1 Johannes Weiner <[email protected]>:
> >
> > The patch series is not a performance optimization. ?But I can include
> > it to prove there are no regressions.
> >
> yes, it's helpful.

Okay.

> >> ? Hmm, how about splitting patch 2/8 into small patches and see what happens in
> >> ? 3.2 or 3.3 ? While that, we can make softlimit works better.
> >> ? (and once we do 2/8, our direction will be fixed to the direction to
> >> remove global LRU.)
> >
> > Do you have specific parts in mind that could go stand-alone?
> >
> > One thing I can think of is splitting up those parts:
> >
> > ?1. move /target/ reclaim to generic code
> >
> > ?2. convert /global/ reclaim from global lru to hierarchy reclaim
> > ? ? including root_mem_cgroup
>
> Hmm, at brief look
> patch 2/8
> - hierarchy walk rewrite code should be stand alone and can be merged
> 1st, as clean-up

You mean introducing mem_cgroup_hierarchy_walk() and make use of it in
mem_cgroup_hierarchical_reclaim() as a first step?

> - root cgroup LRU handling was required for performance. I think we
> removed tons of
> atomic ops and can remove that special handling personally. But this change of
> root cgroup handling should be in separate patch. with performance report.

I disagree.

With view on the whole patch series, linking ungrouped process pages
to the root_mem_cgroup is traded against

1. linking ungrouped process pages to the global LRU

2. linking grouped process pages to both the global LRU and the
memcg LRU

The comparison you propose is neither fair nor relevant because it
would never make sense to merge that patch without the others.

2011-06-02 12:59:43

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

2011/6/2 Johannes Weiner <[email protected]>:
> On Thu, Jun 02, 2011 at 06:06:51PM +0900, Hiroyuki Kamezawa wrote:
>> 2011/6/2 Johannes Weiner <[email protected]>:
>> > On Thu, Jun 02, 2011 at 08:52:47AM +0900, Hiroyuki Kamezawa wrote:
>> >> ? Hmm, how about splitting patch 2/8 into small patches and see what happens in
>> >> ? 3.2 or 3.3 ? While that, we can make softlimit works better.
>> >> ? (and once we do 2/8, our direction will be fixed to the direction to
>> >> remove global LRU.)
>> >
>> > Do you have specific parts in mind that could go stand-alone?
>> >
>> > One thing I can think of is splitting up those parts:
>> >
>> > ?1. move /target/ reclaim to generic code
>> >
>> > ?2. convert /global/ reclaim from global lru to hierarchy reclaim
>> > ? ? including root_mem_cgroup
>>
>> Hmm, at brief look
>> patch 2/8
>> ?- hierarchy walk rewrite code should be stand alone and can be merged
>> 1st, as clean-up
>
> You mean introducing mem_cgroup_hierarchy_walk() and make use of it in
> mem_cgroup_hierarchical_reclaim() as a first step?
>

yes. I like to cut out a patch from a series and forward it to mainline,
and make the series smaller. in some way...


>> ?- root cgroup LRU handling was required for performance. I think we
>> removed tons of
>> ? atomic ops and can remove that special handling personally. But this change of
>> ? root cgroup handling should be in separate patch. with performance report.
>
> I disagree.
>
> With view on the whole patch series, linking ungrouped process pages
> to the root_mem_cgroup is traded against
>
> ? 1. linking ungrouped process pages to the global LRU
>
> ? 2. linking grouped process pages to both the global LRU and the
> ? ? ?memcg LRU
>
> The comparison you propose is neither fair nor relevant because it
> would never make sense to merge that patch without the others.

If you show there is no performance regression when
- memory cgroup is configured.
- it's not disabled by boot option
- there are only ROOT cgroup.
(Then, I'd like to see score.)


It seems your current series is a mixture of 2 works as
"re-desgin of softlimit" and "removal of global LRU".
I don't understand why you need 2 works at once.

Above test is for the latter. You need another justification for the former.
So, I'd like to ask you to divide the series into 2 series.

Thanks,
-Kame

2011-06-02 13:17:05

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [patch 8/8] mm: make per-memcg lru lists exclusive

2011/6/1 Johannes Weiner <[email protected]>:
> All lru list walkers have been converted to operate on per-memcg
> lists, the global per-zone lists are no longer required.
>
> This patch makes the per-memcg lists exclusive and removes the global
> lists from memcg-enabled kernels.
>
> The per-memcg lists now string up page descriptors directly, which
> unifies/simplifies the list isolation code of page reclaim as well as
> it saves a full double-linked list head for each page in the system.
>
> At the core of this change is the introduction of the lruvec
> structure, an array of all lru list heads. ?It exists for each zone
> globally, and for each zone per memcg. ?All lru list operations are
> now done in generic code against lruvecs, with the memcg lru list
> primitives only doing accounting and returning the proper lruvec for
> the currently scanned memcg on isolation, or for the respective page
> on putback.
>
> Signed-off-by: Johannes Weiner <[email protected]>


could you divide this into
- introduce lruvec
- don't record section? information into pc->flags because we see
"page" on memcg LRU
and there is no requirement to get page from "pc".
- remove pc->lru completely
?
Thanks,
-Kame

> ---
> ?include/linux/memcontrol.h ?| ? 53 ++++-----
> ?include/linux/mm_inline.h ? | ? 14 ++-
> ?include/linux/mmzone.h ? ? ?| ? 10 +-
> ?include/linux/page_cgroup.h | ? 36 ------
> ?mm/memcontrol.c ? ? ? ? ? ? | ?271 ++++++++++++++++++-------------------------
> ?mm/page_alloc.c ? ? ? ? ? ? | ? ?2 +-
> ?mm/page_cgroup.c ? ? ? ? ? ?| ? 38 +------
> ?mm/swap.c ? ? ? ? ? ? ? ? ? | ? 20 ++--
> ?mm/vmscan.c ? ? ? ? ? ? ? ? | ? 88 ++++++--------
> ?9 files changed, 207 insertions(+), 325 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 56c1def..d3837f0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -20,6 +20,7 @@
> ?#ifndef _LINUX_MEMCONTROL_H
> ?#define _LINUX_MEMCONTROL_H
> ?#include <linux/cgroup.h>
> +#include <linux/mmzone.h>
> ?struct mem_cgroup;
> ?struct page_cgroup;
> ?struct page;
> @@ -30,13 +31,6 @@ enum mem_cgroup_page_stat_item {
> ? ? ? ?MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> ?};
>
> -extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct list_head *dst,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long *scanned, int order,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int mode, struct zone *z,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem_cont,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int active, int file);
> -
> ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> ?/*
> ?* All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -60,15 +54,14 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
>
> ?extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask);
> -struct page *mem_cgroup_lru_to_page(struct zone *, struct mem_cgroup *,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? enum lru_list);
> -extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
> -extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
> -extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
> -extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
> -extern void mem_cgroup_del_lru(struct page *page);
> -extern void mem_cgroup_move_lists(struct page *page,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? enum lru_list from, enum lru_list to);
> +
> +struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
> +struct lruvec *mem_cgroup_lru_add_list(struct zone *, struct page *,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum lru_list);
> +void mem_cgroup_lru_del_list(struct page *, enum lru_list);
> +void mem_cgroup_lru_del(struct page *);
> +struct lruvec *mem_cgroup_lru_move_lists(struct zone *, struct page *,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum lru_list, enum lru_list);
>
> ?/* For coalescing uncharge for reducing memcg' overhead*/
> ?extern void mem_cgroup_uncharge_start(void);
> @@ -214,33 +207,33 @@ static inline int mem_cgroup_shmem_charge_fallback(struct page *page,
> ? ? ? ?return 0;
> ?}
>
> -static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
> -{
> -}
> -
> -static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
> +static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem)
> ?{
> - ? ? ? return ;
> + ? ? ? return &zone->lruvec;
> ?}
>
> -static inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
> +static inline struct lruvec *mem_cgroup_lru_add_list(struct zone *zone,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct page *page,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum lru_list lru)
> ?{
> - ? ? ? return ;
> + ? ? ? return &zone->lruvec;
> ?}
>
> -static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
> +static inline void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
> ?{
> - ? ? ? return ;
> ?}
>
> -static inline void mem_cgroup_del_lru(struct page *page)
> +static inline void mem_cgroup_lru_del(struct page *page)
> ?{
> - ? ? ? return ;
> ?}
>
> -static inline void
> -mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
> +static inline struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct page *page,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum lru_list from,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum lru_list to)
> ?{
> + ? ? ? return &zone->lruvec;
> ?}
>
> ?static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 8f7d247..43d5d9f 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -25,23 +25,27 @@ static inline void
> ?__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
> ? ? ? ? ? ? ? ? ? ? ? struct list_head *head)
> ?{
> + ? ? ? /* NOTE: Caller must ensure @head is on the right lruvec! */
> + ? ? ? mem_cgroup_lru_add_list(zone, page, l);
> ? ? ? ?list_add(&page->lru, head);
> ? ? ? ?__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
> - ? ? ? mem_cgroup_add_lru_list(page, l);
> ?}
>
> ?static inline void
> ?add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
> ?{
> - ? ? ? __add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
> + ? ? ? struct lruvec *lruvec = mem_cgroup_lru_add_list(zone, page, l);
> +
> + ? ? ? list_add(&page->lru, &lruvec->lists[l]);
> + ? ? ? __mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
> ?}
>
> ?static inline void
> ?del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
> ?{
> + ? ? ? mem_cgroup_lru_del_list(page, l);
> ? ? ? ?list_del(&page->lru);
> ? ? ? ?__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
> - ? ? ? mem_cgroup_del_lru_list(page, l);
> ?}
>
> ?/**
> @@ -64,7 +68,6 @@ del_page_from_lru(struct zone *zone, struct page *page)
> ?{
> ? ? ? ?enum lru_list l;
>
> - ? ? ? list_del(&page->lru);
> ? ? ? ?if (PageUnevictable(page)) {
> ? ? ? ? ? ? ? ?__ClearPageUnevictable(page);
> ? ? ? ? ? ? ? ?l = LRU_UNEVICTABLE;
> @@ -75,8 +78,9 @@ del_page_from_lru(struct zone *zone, struct page *page)
> ? ? ? ? ? ? ? ? ? ? ? ?l += LRU_ACTIVE;
> ? ? ? ? ? ? ? ?}
> ? ? ? ?}
> + ? ? ? mem_cgroup_lru_del_list(page, l);
> + ? ? ? list_del(&page->lru);
> ? ? ? ?__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
> - ? ? ? mem_cgroup_del_lru_list(page, l);
> ?}
>
> ?/**
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e56f835..c2ddce5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -158,6 +158,10 @@ static inline int is_unevictable_lru(enum lru_list l)
> ? ? ? ?return (l == LRU_UNEVICTABLE);
> ?}
>
> +struct lruvec {
> + ? ? ? struct list_head lists[NR_LRU_LISTS];
> +};
> +
> ?enum zone_watermarks {
> ? ? ? ?WMARK_MIN,
> ? ? ? ?WMARK_LOW,
> @@ -344,10 +348,8 @@ struct zone {
> ? ? ? ?ZONE_PADDING(_pad1_)
>
> ? ? ? ?/* Fields commonly accessed by the page reclaim scanner */
> - ? ? ? spinlock_t ? ? ? ? ? ? ?lru_lock;
> - ? ? ? struct zone_lru {
> - ? ? ? ? ? ? ? struct list_head list;
> - ? ? ? } lru[NR_LRU_LISTS];
> + ? ? ? spinlock_t ? ? ? ? ? ? ?lru_lock;
> + ? ? ? struct lruvec ? ? ? ? ? lruvec;
>
> ? ? ? ?struct zone_reclaim_stat reclaim_stat;
>
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 961ecc7..a42ddf9 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -31,7 +31,6 @@ enum {
> ?struct page_cgroup {
> ? ? ? ?unsigned long flags;
> ? ? ? ?struct mem_cgroup *mem_cgroup;
> - ? ? ? struct list_head lru; ? ? ? ? ? /* per cgroup LRU list */
> ?};
>
> ?void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
> @@ -49,7 +48,6 @@ static inline void __init page_cgroup_init(void)
> ?#endif
>
> ?struct page_cgroup *lookup_page_cgroup(struct page *page);
> -struct page *lookup_cgroup_page(struct page_cgroup *pc);
>
> ?#define TESTPCGFLAG(uname, lname) ? ? ? ? ? ? ? ? ? ? ?\
> ?static inline int PageCgroup##uname(struct page_cgroup *pc) ? ?\
> @@ -121,40 +119,6 @@ static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
> ? ? ? ?bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
> ? ? ? ?local_irq_restore(*flags);
> ?}
> -
> -#ifdef CONFIG_SPARSEMEM
> -#define PCG_ARRAYID_WIDTH ? ? ?SECTIONS_SHIFT
> -#else
> -#define PCG_ARRAYID_WIDTH ? ? ?NODES_SHIFT
> -#endif
> -
> -#if (PCG_ARRAYID_WIDTH > BITS_PER_LONG - NR_PCG_FLAGS)
> -#error Not enough space left in pc->flags to store page_cgroup array IDs
> -#endif
> -
> -/* pc->flags: ARRAY-ID | FLAGS */
> -
> -#define PCG_ARRAYID_MASK ? ? ? ((1UL << PCG_ARRAYID_WIDTH) - 1)
> -
> -#define PCG_ARRAYID_OFFSET ? ? (BITS_PER_LONG - PCG_ARRAYID_WIDTH)
> -/*
> - * Zero the shift count for non-existent fields, to prevent compiler
> - * warnings and ensure references are optimized away.
> - */
> -#define PCG_ARRAYID_SHIFT ? ? ?(PCG_ARRAYID_OFFSET * (PCG_ARRAYID_WIDTH != 0))
> -
> -static inline void set_page_cgroup_array_id(struct page_cgroup *pc,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long id)
> -{
> - ? ? ? pc->flags &= ~(PCG_ARRAYID_MASK << PCG_ARRAYID_SHIFT);
> - ? ? ? pc->flags |= (id & PCG_ARRAYID_MASK) << PCG_ARRAYID_SHIFT;
> -}
> -
> -static inline unsigned long page_cgroup_array_id(struct page_cgroup *pc)
> -{
> - ? ? ? return (pc->flags >> PCG_ARRAYID_SHIFT) & PCG_ARRAYID_MASK;
> -}
> -
> ?#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> ?struct page_cgroup;
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d9d1a7e..4a365b7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -133,10 +133,7 @@ struct mem_cgroup_stat_cpu {
> ?* per-zone information in memory controller.
> ?*/
> ?struct mem_cgroup_per_zone {
> - ? ? ? /*
> - ? ? ? ?* spin_lock to protect the per cgroup LRU
> - ? ? ? ?*/
> - ? ? ? struct list_head ? ? ? ?lists[NR_LRU_LISTS];
> + ? ? ? struct lruvec ? ? ? ? ? lruvec;
> ? ? ? ?unsigned long ? ? ? ? ? count[NR_LRU_LISTS];
>
> ? ? ? ?struct zone_reclaim_stat reclaim_stat;
> @@ -642,6 +639,26 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> ? ? ? ?return (mem == root_mem_cgroup);
> ?}
>
> +/**
> + * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
> + * @zone: zone of the wanted lruvec
> + * @mem: memcg of the wanted lruvec
> + *
> + * Returns the lru list vector holding pages for the given @zone and
> + * @mem. ?This can be the global zone lruvec, if the memory controller
> + * is disabled.
> + */
> +struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, struct mem_cgroup *mem)
> +{
> + ? ? ? struct mem_cgroup_per_zone *mz;
> +
> + ? ? ? if (mem_cgroup_disabled())
> + ? ? ? ? ? ? ? return &zone->lruvec;
> +
> + ? ? ? mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
> + ? ? ? return &mz->lruvec;
> +}
> +
> ?/*
> ?* Following LRU functions are allowed to be used without PCG_LOCK.
> ?* Operations are called by routine of global LRU independently from memcg.
> @@ -656,21 +673,74 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> ?* When moving account, the page is not on LRU. It's isolated.
> ?*/
>
> -struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? enum lru_list lru)
> +/**
> + * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
> + * @zone: zone of the page
> + * @page: the page itself
> + * @lru: target lru list
> + *
> + * This function must be called when a page is to be added to an lru
> + * list.
> + *
> + * Returns the lruvec to hold @page, the callsite is responsible for
> + * physically linking the page to &lruvec->lists[@lru].
> + */
> +struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum lru_list lru)
> ?{
> ? ? ? ?struct mem_cgroup_per_zone *mz;
> ? ? ? ?struct page_cgroup *pc;
> + ? ? ? struct mem_cgroup *mem;
>
> - ? ? ? mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
> - ? ? ? pc = list_entry(mz->lists[lru].prev, struct page_cgroup, lru);
> - ? ? ? return lookup_cgroup_page(pc);
> + ? ? ? if (mem_cgroup_disabled())
> + ? ? ? ? ? ? ? return &zone->lruvec;
> +
> + ? ? ? pc = lookup_page_cgroup(page);
> + ? ? ? VM_BUG_ON(PageCgroupAcctLRU(pc));
> + ? ? ? if (PageCgroupUsed(pc)) {
> + ? ? ? ? ? ? ? /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
> + ? ? ? ? ? ? ? smp_rmb();
> + ? ? ? ? ? ? ? mem = pc->mem_cgroup;
> + ? ? ? } else {
> + ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ?* If the page is no longer charged, add it to the
> + ? ? ? ? ? ? ? ?* root memcg's lru. ?Either it will be freed soon, or
> + ? ? ? ? ? ? ? ?* it will get charged again and the charger will
> + ? ? ? ? ? ? ? ?* relink it to the right list.
> + ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? mem = root_mem_cgroup;
> + ? ? ? }
> + ? ? ? mz = page_cgroup_zoneinfo(mem, page);
> + ? ? ? /*
> + ? ? ? ?* We do not account for uncharged pages: they are linked to
> + ? ? ? ?* root_mem_cgroup but when the page is unlinked upon free,
> + ? ? ? ?* accounting would be done against pc->mem_cgroup.
> + ? ? ? ?*/
> + ? ? ? if (PageCgroupUsed(pc)) {
> + ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ?* Huge page splitting is serialized through the lru
> + ? ? ? ? ? ? ? ?* lock, so compound_order() is stable here.
> + ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
> + ? ? ? ? ? ? ? SetPageCgroupAcctLRU(pc);
> + ? ? ? }
> + ? ? ? return &mz->lruvec;
> ?}
>
> -void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
> +/**
> + * mem_cgroup_lru_del_list - account for removing an lru page
> + * @page: page to unlink
> + * @lru: lru list the page is sitting on
> + *
> + * This function must be called when a page is to be removed from an
> + * lru list.
> + *
> + * The callsite is responsible for physically unlinking &@page->lru.
> + */
> +void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
> ?{
> - ? ? ? struct page_cgroup *pc;
> ? ? ? ?struct mem_cgroup_per_zone *mz;
> + ? ? ? struct page_cgroup *pc;
>
> ? ? ? ?if (mem_cgroup_disabled())
> ? ? ? ? ? ? ? ?return;
> @@ -686,75 +756,35 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
> ? ? ? ?mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
> ? ? ? ?/* huge page split is done under lru_lock. so, we have no races. */
> ? ? ? ?MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
> - ? ? ? VM_BUG_ON(list_empty(&pc->lru));
> - ? ? ? list_del_init(&pc->lru);
> ?}
>
> -void mem_cgroup_del_lru(struct page *page)
> +void mem_cgroup_lru_del(struct page *page)
> ?{
> - ? ? ? mem_cgroup_del_lru_list(page, page_lru(page));
> + ? ? ? mem_cgroup_lru_del_list(page, page_lru(page));
> ?}
>
> -/*
> - * Writeback is about to end against a page which has been marked for immediate
> - * reclaim. ?If it still appears to be reclaimable, move it to the tail of the
> - * inactive list.
> +/**
> + * mem_cgroup_lru_move_lists - account for moving a page between lru lists
> + * @zone: zone of the page
> + * @page: page to move
> + * @from: current lru list
> + * @to: new lru list
> + *
> + * This function must be called when a page is moved between lru
> + * lists, or rotated on the same lru list.
> + *
> + * Returns the lruvec to hold @page in the future, the callsite is
> + * responsible for physically relinking the page to
> + * &lruvec->lists[@to].
> ?*/
> -void mem_cgroup_rotate_reclaimable_page(struct page *page)
> -{
> - ? ? ? struct mem_cgroup_per_zone *mz;
> - ? ? ? struct page_cgroup *pc;
> - ? ? ? enum lru_list lru = page_lru(page);
> -
> - ? ? ? if (mem_cgroup_disabled())
> - ? ? ? ? ? ? ? return;
> -
> - ? ? ? pc = lookup_page_cgroup(page);
> - ? ? ? /* unused page is not rotated. */
> - ? ? ? if (!PageCgroupUsed(pc))
> - ? ? ? ? ? ? ? return;
> - ? ? ? /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
> - ? ? ? smp_rmb();
> - ? ? ? mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
> - ? ? ? list_move_tail(&pc->lru, &mz->lists[lru]);
> -}
> -
> -void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
> +struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct page *page,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum lru_list from,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum lru_list to)
> ?{
> - ? ? ? struct mem_cgroup_per_zone *mz;
> - ? ? ? struct page_cgroup *pc;
> -
> - ? ? ? if (mem_cgroup_disabled())
> - ? ? ? ? ? ? ? return;
> -
> - ? ? ? pc = lookup_page_cgroup(page);
> - ? ? ? /* unused page is not rotated. */
> - ? ? ? if (!PageCgroupUsed(pc))
> - ? ? ? ? ? ? ? return;
> - ? ? ? /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
> - ? ? ? smp_rmb();
> - ? ? ? mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
> - ? ? ? list_move(&pc->lru, &mz->lists[lru]);
> -}
> -
> -void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
> -{
> - ? ? ? struct page_cgroup *pc;
> - ? ? ? struct mem_cgroup_per_zone *mz;
> -
> - ? ? ? if (mem_cgroup_disabled())
> - ? ? ? ? ? ? ? return;
> - ? ? ? pc = lookup_page_cgroup(page);
> - ? ? ? VM_BUG_ON(PageCgroupAcctLRU(pc));
> - ? ? ? if (!PageCgroupUsed(pc))
> - ? ? ? ? ? ? ? return;
> - ? ? ? /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
> - ? ? ? smp_rmb();
> - ? ? ? mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
> - ? ? ? /* huge page split is done under lru_lock. so, we have no races. */
> - ? ? ? MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
> - ? ? ? SetPageCgroupAcctLRU(pc);
> - ? ? ? list_add(&pc->lru, &mz->lists[lru]);
> + ? ? ? /* TODO: this could be optimized, especially if from == to */
> + ? ? ? mem_cgroup_lru_del_list(page, from);
> + ? ? ? return mem_cgroup_lru_add_list(zone, page, to);
> ?}
>
> ?/*
> @@ -786,7 +816,7 @@ static void mem_cgroup_lru_del_before_commit(struct page *page)
> ? ? ? ? * is guarded by lock_page() because the page is SwapCache.
> ? ? ? ? */
> ? ? ? ?if (!PageCgroupUsed(pc))
> - ? ? ? ? ? ? ? mem_cgroup_del_lru_list(page, page_lru(page));
> + ? ? ? ? ? ? ? del_page_from_lru(zone, page);
> ? ? ? ?spin_unlock_irqrestore(&zone->lru_lock, flags);
> ?}
>
> @@ -800,22 +830,11 @@ static void mem_cgroup_lru_add_after_commit(struct page *page)
> ? ? ? ?if (likely(!PageLRU(page)))
> ? ? ? ? ? ? ? ?return;
> ? ? ? ?spin_lock_irqsave(&zone->lru_lock, flags);
> - ? ? ? /* link when the page is linked to LRU but page_cgroup isn't */
> ? ? ? ?if (PageLRU(page) && !PageCgroupAcctLRU(pc))
> - ? ? ? ? ? ? ? mem_cgroup_add_lru_list(page, page_lru(page));
> + ? ? ? ? ? ? ? add_page_to_lru_list(zone, page, page_lru(page));
> ? ? ? ?spin_unlock_irqrestore(&zone->lru_lock, flags);
> ?}
>
> -
> -void mem_cgroup_move_lists(struct page *page,
> - ? ? ? ? ? ? ? ? ? ? ? ? ?enum lru_list from, enum lru_list to)
> -{
> - ? ? ? if (mem_cgroup_disabled())
> - ? ? ? ? ? ? ? return;
> - ? ? ? mem_cgroup_del_lru_list(page, from);
> - ? ? ? mem_cgroup_add_lru_list(page, to);
> -}
> -
> ?int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
> ?{
> ? ? ? ?int ret;
> @@ -935,67 +954,6 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> ? ? ? ?return &mz->reclaim_stat;
> ?}
>
> -unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct list_head *dst,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long *scanned, int order,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int mode, struct zone *z,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem_cont,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int active, int file)
> -{
> - ? ? ? unsigned long nr_taken = 0;
> - ? ? ? struct page *page;
> - ? ? ? unsigned long scan;
> - ? ? ? LIST_HEAD(pc_list);
> - ? ? ? struct list_head *src;
> - ? ? ? struct page_cgroup *pc, *tmp;
> - ? ? ? int nid = zone_to_nid(z);
> - ? ? ? int zid = zone_idx(z);
> - ? ? ? struct mem_cgroup_per_zone *mz;
> - ? ? ? int lru = LRU_FILE * file + active;
> - ? ? ? int ret;
> -
> - ? ? ? BUG_ON(!mem_cont);
> - ? ? ? mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
> - ? ? ? src = &mz->lists[lru];
> -
> - ? ? ? scan = 0;
> - ? ? ? list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
> - ? ? ? ? ? ? ? if (scan >= nr_to_scan)
> - ? ? ? ? ? ? ? ? ? ? ? break;
> -
> - ? ? ? ? ? ? ? if (unlikely(!PageCgroupUsed(pc)))
> - ? ? ? ? ? ? ? ? ? ? ? continue;
> -
> - ? ? ? ? ? ? ? page = lookup_cgroup_page(pc);
> -
> - ? ? ? ? ? ? ? if (unlikely(!PageLRU(page)))
> - ? ? ? ? ? ? ? ? ? ? ? continue;
> -
> - ? ? ? ? ? ? ? scan++;
> - ? ? ? ? ? ? ? ret = __isolate_lru_page(page, mode, file);
> - ? ? ? ? ? ? ? switch (ret) {
> - ? ? ? ? ? ? ? case 0:
> - ? ? ? ? ? ? ? ? ? ? ? list_move(&page->lru, dst);
> - ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_del_lru(page);
> - ? ? ? ? ? ? ? ? ? ? ? nr_taken += hpage_nr_pages(page);
> - ? ? ? ? ? ? ? ? ? ? ? break;
> - ? ? ? ? ? ? ? case -EBUSY:
> - ? ? ? ? ? ? ? ? ? ? ? /* we don't affect global LRU but rotate in our LRU */
> - ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_rotate_lru_list(page, page_lru(page));
> - ? ? ? ? ? ? ? ? ? ? ? break;
> - ? ? ? ? ? ? ? default:
> - ? ? ? ? ? ? ? ? ? ? ? break;
> - ? ? ? ? ? ? ? }
> - ? ? ? }
> -
> - ? ? ? *scanned = scan;
> -
> - ? ? ? trace_mm_vmscan_memcg_isolate(0, nr_to_scan, scan, nr_taken,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0, 0, 0, mode);
> -
> - ? ? ? return nr_taken;
> -}
> -
> ?#define mem_cgroup_from_res_counter(counter, member) ? \
> ? ? ? ?container_of(counter, struct mem_cgroup, member)
>
> @@ -3110,22 +3068,23 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> ?static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int node, int zid, enum lru_list lru)
> ?{
> - ? ? ? struct zone *zone;
> ? ? ? ?struct mem_cgroup_per_zone *mz;
> - ? ? ? struct page_cgroup *pc, *busy;
> ? ? ? ?unsigned long flags, loop;
> ? ? ? ?struct list_head *list;
> + ? ? ? struct page *busy;
> + ? ? ? struct zone *zone;
> ? ? ? ?int ret = 0;
>
> ? ? ? ?zone = &NODE_DATA(node)->node_zones[zid];
> ? ? ? ?mz = mem_cgroup_zoneinfo(mem, node, zid);
> - ? ? ? list = &mz->lists[lru];
> + ? ? ? list = &mz->lruvec.lists[lru];
>
> ? ? ? ?loop = MEM_CGROUP_ZSTAT(mz, lru);
> ? ? ? ?/* give some margin against EBUSY etc...*/
> ? ? ? ?loop += 256;
> ? ? ? ?busy = NULL;
> ? ? ? ?while (loop--) {
> + ? ? ? ? ? ? ? struct page_cgroup *pc;
> ? ? ? ? ? ? ? ?struct page *page;
>
> ? ? ? ? ? ? ? ?ret = 0;
> @@ -3134,16 +3093,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ?spin_unlock_irqrestore(&zone->lru_lock, flags);
> ? ? ? ? ? ? ? ? ? ? ? ?break;
> ? ? ? ? ? ? ? ?}
> - ? ? ? ? ? ? ? pc = list_entry(list->prev, struct page_cgroup, lru);
> - ? ? ? ? ? ? ? if (busy == pc) {
> - ? ? ? ? ? ? ? ? ? ? ? list_move(&pc->lru, list);
> + ? ? ? ? ? ? ? page = list_entry(list->prev, struct page, lru);
> + ? ? ? ? ? ? ? if (busy == page) {
> + ? ? ? ? ? ? ? ? ? ? ? list_move(&page->lru, list);
> ? ? ? ? ? ? ? ? ? ? ? ?busy = NULL;
> ? ? ? ? ? ? ? ? ? ? ? ?spin_unlock_irqrestore(&zone->lru_lock, flags);
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?spin_unlock_irqrestore(&zone->lru_lock, flags);
>
> - ? ? ? ? ? ? ? page = lookup_cgroup_page(pc);
> + ? ? ? ? ? ? ? pc = lookup_page_cgroup(page);
>
> ? ? ? ? ? ? ? ?ret = mem_cgroup_move_parent(page, pc, mem, GFP_KERNEL);
> ? ? ? ? ? ? ? ?if (ret == -ENOMEM)
> @@ -3151,7 +3110,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
>
> ? ? ? ? ? ? ? ?if (ret == -EBUSY || ret == -EINVAL) {
> ? ? ? ? ? ? ? ? ? ? ? ?/* found lock contention or "pc" is obsolete. */
> - ? ? ? ? ? ? ? ? ? ? ? busy = pc;
> + ? ? ? ? ? ? ? ? ? ? ? busy = page;
> ? ? ? ? ? ? ? ? ? ? ? ?cond_resched();
> ? ? ? ? ? ? ? ?} else
> ? ? ? ? ? ? ? ? ? ? ? ?busy = NULL;
> @@ -4171,7 +4130,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> ? ? ? ?for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> ? ? ? ? ? ? ? ?mz = &pn->zoneinfo[zone];
> ? ? ? ? ? ? ? ?for_each_lru(l)
> - ? ? ? ? ? ? ? ? ? ? ? INIT_LIST_HEAD(&mz->lists[l]);
> + ? ? ? ? ? ? ? ? ? ? ? INIT_LIST_HEAD(&mz->lruvec.lists[l]);
> ? ? ? ?}
> ? ? ? ?return 0;
> ?}
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3f8bce2..9da238d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4289,7 +4289,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>
> ? ? ? ? ? ? ? ?zone_pcp_init(zone);
> ? ? ? ? ? ? ? ?for_each_lru(l) {
> - ? ? ? ? ? ? ? ? ? ? ? INIT_LIST_HEAD(&zone->lru[l].list);
> + ? ? ? ? ? ? ? ? ? ? ? INIT_LIST_HEAD(&zone->lruvec.lists[l]);
> ? ? ? ? ? ? ? ? ? ? ? ?zone->reclaim_stat.nr_saved_scan[l] = 0;
> ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?zone->reclaim_stat.recent_rotated[0] = 0;
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 2daadc3..916c6f9 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -11,12 +11,10 @@
> ?#include <linux/swapops.h>
> ?#include <linux/kmemleak.h>
>
> -static void __meminit init_page_cgroup(struct page_cgroup *pc, unsigned long id)
> +static void __meminit init_page_cgroup(struct page_cgroup *pc)
> ?{
> ? ? ? ?pc->flags = 0;
> - ? ? ? set_page_cgroup_array_id(pc, id);
> ? ? ? ?pc->mem_cgroup = NULL;
> - ? ? ? INIT_LIST_HEAD(&pc->lru);
> ?}
> ?static unsigned long total_usage;
>
> @@ -42,19 +40,6 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
> ? ? ? ?return base + offset;
> ?}
>
> -struct page *lookup_cgroup_page(struct page_cgroup *pc)
> -{
> - ? ? ? unsigned long pfn;
> - ? ? ? struct page *page;
> - ? ? ? pg_data_t *pgdat;
> -
> - ? ? ? pgdat = NODE_DATA(page_cgroup_array_id(pc));
> - ? ? ? pfn = pc - pgdat->node_page_cgroup + pgdat->node_start_pfn;
> - ? ? ? page = pfn_to_page(pfn);
> - ? ? ? VM_BUG_ON(pc != lookup_page_cgroup(page));
> - ? ? ? return page;
> -}
> -
> ?static int __init alloc_node_page_cgroup(int nid)
> ?{
> ? ? ? ?struct page_cgroup *base, *pc;
> @@ -75,7 +60,7 @@ static int __init alloc_node_page_cgroup(int nid)
> ? ? ? ? ? ? ? ?return -ENOMEM;
> ? ? ? ?for (index = 0; index < nr_pages; index++) {
> ? ? ? ? ? ? ? ?pc = base + index;
> - ? ? ? ? ? ? ? init_page_cgroup(pc, nid);
> + ? ? ? ? ? ? ? init_page_cgroup(pc);
> ? ? ? ?}
> ? ? ? ?NODE_DATA(nid)->node_page_cgroup = base;
> ? ? ? ?total_usage += table_size;
> @@ -117,19 +102,6 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
> ? ? ? ?return section->page_cgroup + pfn;
> ?}
>
> -struct page *lookup_cgroup_page(struct page_cgroup *pc)
> -{
> - ? ? ? struct mem_section *section;
> - ? ? ? struct page *page;
> - ? ? ? unsigned long nr;
> -
> - ? ? ? nr = page_cgroup_array_id(pc);
> - ? ? ? section = __nr_to_section(nr);
> - ? ? ? page = pfn_to_page(pc - section->page_cgroup);
> - ? ? ? VM_BUG_ON(pc != lookup_page_cgroup(page));
> - ? ? ? return page;
> -}
> -
> ?static void *__init_refok alloc_page_cgroup(size_t size, int nid)
> ?{
> ? ? ? ?void *addr = NULL;
> @@ -167,11 +139,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
> ? ? ? ?struct page_cgroup *base, *pc;
> ? ? ? ?struct mem_section *section;
> ? ? ? ?unsigned long table_size;
> - ? ? ? unsigned long nr;
> ? ? ? ?int nid, index;
>
> - ? ? ? nr = pfn_to_section_nr(pfn);
> - ? ? ? section = __nr_to_section(nr);
> + ? ? ? section = __pfn_to_section(pfn);
>
> ? ? ? ?if (section->page_cgroup)
> ? ? ? ? ? ? ? ?return 0;
> @@ -194,7 +164,7 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
>
> ? ? ? ?for (index = 0; index < PAGES_PER_SECTION; index++) {
> ? ? ? ? ? ? ? ?pc = base + index;
> - ? ? ? ? ? ? ? init_page_cgroup(pc, nr);
> + ? ? ? ? ? ? ? init_page_cgroup(pc);
> ? ? ? ?}
>
> ? ? ? ?section->page_cgroup = base - pfn;
> diff --git a/mm/swap.c b/mm/swap.c
> index 5602f1a..0a5a93b 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -209,12 +209,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
> ?static void pagevec_move_tail_fn(struct page *page, void *arg)
> ?{
> ? ? ? ?int *pgmoved = arg;
> - ? ? ? struct zone *zone = page_zone(page);
>
> ? ? ? ?if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> ? ? ? ? ? ? ? ?enum lru_list lru = page_lru_base_type(page);
> - ? ? ? ? ? ? ? list_move_tail(&page->lru, &zone->lru[lru].list);
> - ? ? ? ? ? ? ? mem_cgroup_rotate_reclaimable_page(page);
> + ? ? ? ? ? ? ? struct lruvec *lruvec;
> +
> + ? ? ? ? ? ? ? lruvec = mem_cgroup_lru_move_lists(page_zone(page),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?page, lru, lru);
> + ? ? ? ? ? ? ? list_move_tail(&page->lru, &lruvec->lists[lru]);
> ? ? ? ? ? ? ? ?(*pgmoved)++;
> ? ? ? ?}
> ?}
> @@ -420,12 +422,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
> ? ? ? ? ? ? ? ? */
> ? ? ? ? ? ? ? ?SetPageReclaim(page);
> ? ? ? ?} else {
> + ? ? ? ? ? ? ? struct lruvec *lruvec;
> ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? * The page's writeback ends up during pagevec
> ? ? ? ? ? ? ? ? * We moves tha page into tail of inactive.
> ? ? ? ? ? ? ? ? */
> - ? ? ? ? ? ? ? list_move_tail(&page->lru, &zone->lru[lru].list);
> - ? ? ? ? ? ? ? mem_cgroup_rotate_reclaimable_page(page);
> + ? ? ? ? ? ? ? lruvec = mem_cgroup_lru_move_lists(zone, page, lru, lru);
> + ? ? ? ? ? ? ? list_move_tail(&page->lru, &lruvec->lists[lru]);
> ? ? ? ? ? ? ? ?__count_vm_event(PGROTATED);
> ? ? ? ?}
>
> @@ -597,7 +600,6 @@ void lru_add_page_tail(struct zone* zone,
> ? ? ? ?int active;
> ? ? ? ?enum lru_list lru;
> ? ? ? ?const int file = 0;
> - ? ? ? struct list_head *head;
>
> ? ? ? ?VM_BUG_ON(!PageHead(page));
> ? ? ? ?VM_BUG_ON(PageCompound(page_tail));
> @@ -617,10 +619,10 @@ void lru_add_page_tail(struct zone* zone,
> ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?update_page_reclaim_stat(zone, page_tail, file, active);
> ? ? ? ? ? ? ? ?if (likely(PageLRU(page)))
> - ? ? ? ? ? ? ? ? ? ? ? head = page->lru.prev;
> + ? ? ? ? ? ? ? ? ? ? ? __add_page_to_lru_list(zone, page_tail, lru,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?page->lru.prev);
> ? ? ? ? ? ? ? ?else
> - ? ? ? ? ? ? ? ? ? ? ? head = &zone->lru[lru].list;
> - ? ? ? ? ? ? ? __add_page_to_lru_list(zone, page_tail, lru, head);
> + ? ? ? ? ? ? ? ? ? ? ? add_page_to_lru_list(zone, page_tail, lru);
> ? ? ? ?} else {
> ? ? ? ? ? ? ? ?SetPageUnevictable(page_tail);
> ? ? ? ? ? ? ? ?add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 23fd2b1..87e1fcb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1080,15 +1080,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>
> ? ? ? ? ? ? ? ?switch (__isolate_lru_page(page, mode, file)) {
> ? ? ? ? ? ? ? ?case 0:
> + ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_lru_del(page);
> ? ? ? ? ? ? ? ? ? ? ? ?list_move(&page->lru, dst);
> - ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_del_lru(page);
> ? ? ? ? ? ? ? ? ? ? ? ?nr_taken += hpage_nr_pages(page);
> ? ? ? ? ? ? ? ? ? ? ? ?break;
>
> ? ? ? ? ? ? ? ?case -EBUSY:
> ? ? ? ? ? ? ? ? ? ? ? ?/* else it is being freed elsewhere */
> ? ? ? ? ? ? ? ? ? ? ? ?list_move(&page->lru, src);
> - ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_rotate_lru_list(page, page_lru(page));
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
>
> ? ? ? ? ? ? ? ?default:
> @@ -1138,8 +1137,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?break;
>
> ? ? ? ? ? ? ? ? ? ? ? ?if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_lru_del(cursor_page);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?list_move(&cursor_page->lru, dst);
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_del_lru(cursor_page);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?nr_taken += hpage_nr_pages(page);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?nr_lumpy_taken++;
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if (PageDirty(cursor_page))
> @@ -1168,19 +1167,22 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> ? ? ? ?return nr_taken;
> ?}
>
> -static unsigned long isolate_pages_global(unsigned long nr,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct list_head *dst,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long *scanned, int order,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int mode, struct zone *z,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int active, int file)
> +static unsigned long isolate_pages(unsigned long nr,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct list_head *dst,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long *scanned, int order,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int mode, struct zone *z,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int active, int file,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct mem_cgroup *mem)
> ?{
> + ? ? ? struct lruvec *lruvec = mem_cgroup_zone_lruvec(z, mem);
> ? ? ? ?int lru = LRU_BASE;
> +
> ? ? ? ?if (active)
> ? ? ? ? ? ? ? ?lru += LRU_ACTIVE;
> ? ? ? ?if (file)
> ? ? ? ? ? ? ? ?lru += LRU_FILE;
> - ? ? ? return isolate_lru_pages(nr, &z->lru[lru].list, dst, scanned, order,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mode, file);
> + ? ? ? return isolate_lru_pages(nr, &lruvec->lists[lru], dst,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?scanned, order, mode, file);
> ?}
>
> ?/*
> @@ -1428,20 +1430,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> ? ? ? ?lru_add_drain();
> ? ? ? ?spin_lock_irq(&zone->lru_lock);
>
> - ? ? ? if (scanning_global_lru(sc)) {
> - ? ? ? ? ? ? ? nr_taken = isolate_pages_global(nr_to_scan,
> - ? ? ? ? ? ? ? ? ? ? ? &page_list, &nr_scanned, sc->order,
> - ? ? ? ? ? ? ? ? ? ? ? sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ISOLATE_BOTH : ISOLATE_INACTIVE,
> - ? ? ? ? ? ? ? ? ? ? ? zone, 0, file);
> - ? ? ? } else {
> - ? ? ? ? ? ? ? nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
> - ? ? ? ? ? ? ? ? ? ? ? &page_list, &nr_scanned, sc->order,
> - ? ? ? ? ? ? ? ? ? ? ? sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
> + ? ? ? nr_taken = isolate_pages(nr_to_scan,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?&page_list, &nr_scanned, sc->order,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ISOLATE_BOTH : ISOLATE_INACTIVE,
> - ? ? ? ? ? ? ? ? ? ? ? zone, sc->mem_cgroup,
> - ? ? ? ? ? ? ? ? ? ? ? 0, file);
> - ? ? ? }
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?zone, 0, file, sc->mem_cgroup);
>
> ? ? ? ?if (global_reclaim(sc)) {
> ? ? ? ? ? ? ? ?zone->pages_scanned += nr_scanned;
> @@ -1514,13 +1507,15 @@ static void move_active_pages_to_lru(struct zone *zone,
> ? ? ? ?pagevec_init(&pvec, 1);
>
> ? ? ? ?while (!list_empty(list)) {
> + ? ? ? ? ? ? ? struct lruvec *lruvec;
> +
> ? ? ? ? ? ? ? ?page = lru_to_page(list);
>
> ? ? ? ? ? ? ? ?VM_BUG_ON(PageLRU(page));
> ? ? ? ? ? ? ? ?SetPageLRU(page);
>
> - ? ? ? ? ? ? ? list_move(&page->lru, &zone->lru[lru].list);
> - ? ? ? ? ? ? ? mem_cgroup_add_lru_list(page, lru);
> + ? ? ? ? ? ? ? lruvec = mem_cgroup_lru_add_list(zone, page, lru);
> + ? ? ? ? ? ? ? list_move(&page->lru, &lruvec->lists[lru]);
> ? ? ? ? ? ? ? ?pgmoved += hpage_nr_pages(page);
>
> ? ? ? ? ? ? ? ?if (!pagevec_add(&pvec, page) || list_empty(list)) {
> @@ -1551,17 +1546,10 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>
> ? ? ? ?lru_add_drain();
> ? ? ? ?spin_lock_irq(&zone->lru_lock);
> - ? ? ? if (scanning_global_lru(sc)) {
> - ? ? ? ? ? ? ? nr_taken = isolate_pages_global(nr_pages, &l_hold,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &pgscanned, sc->order,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ISOLATE_ACTIVE, zone,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1, file);
> - ? ? ? } else {
> - ? ? ? ? ? ? ? nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &pgscanned, sc->order,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ISOLATE_ACTIVE, zone,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sc->mem_cgroup, 1, file);
> - ? ? ? }
> + ? ? ? nr_taken = isolate_pages(nr_pages, &l_hold,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?&pgscanned, sc->order,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ISOLATE_ACTIVE, zone,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1, file, sc->mem_cgroup);
>
> ? ? ? ?if (global_reclaim(sc))
> ? ? ? ? ? ? ? ?zone->pages_scanned += pgscanned;
> @@ -3154,16 +3142,18 @@ int page_evictable(struct page *page, struct vm_area_struct *vma)
> ?*/
> ?static void check_move_unevictable_page(struct page *page, struct zone *zone)
> ?{
> - ? ? ? VM_BUG_ON(PageActive(page));
> + ? ? ? struct lruvec *lruvec;
>
> + ? ? ? VM_BUG_ON(PageActive(page));
> ?retry:
> ? ? ? ?ClearPageUnevictable(page);
> ? ? ? ?if (page_evictable(page, NULL)) {
> ? ? ? ? ? ? ? ?enum lru_list l = page_lru_base_type(page);
>
> + ? ? ? ? ? ? ? lruvec = mem_cgroup_lru_move_lists(zone, page,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?LRU_UNEVICTABLE, l);
> ? ? ? ? ? ? ? ?__dec_zone_state(zone, NR_UNEVICTABLE);
> - ? ? ? ? ? ? ? list_move(&page->lru, &zone->lru[l].list);
> - ? ? ? ? ? ? ? mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
> + ? ? ? ? ? ? ? list_move(&page->lru, &lruvec->lists[l]);
> ? ? ? ? ? ? ? ?__inc_zone_state(zone, NR_INACTIVE_ANON + l);
> ? ? ? ? ? ? ? ?__count_vm_event(UNEVICTABLE_PGRESCUED);
> ? ? ? ?} else {
> @@ -3171,8 +3161,9 @@ retry:
> ? ? ? ? ? ? ? ? * rotate unevictable list
> ? ? ? ? ? ? ? ? */
> ? ? ? ? ? ? ? ?SetPageUnevictable(page);
> - ? ? ? ? ? ? ? list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
> - ? ? ? ? ? ? ? mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
> + ? ? ? ? ? ? ? lruvec = mem_cgroup_lru_move_lists(zone, page, LRU_UNEVICTABLE,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?LRU_UNEVICTABLE);
> + ? ? ? ? ? ? ? list_move(&page->lru, &lruvec->lists[LRU_UNEVICTABLE]);
> ? ? ? ? ? ? ? ?if (page_evictable(page, NULL))
> ? ? ? ? ? ? ? ? ? ? ? ?goto retry;
> ? ? ? ?}
> @@ -3233,14 +3224,6 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
>
> ?}
>
> -static struct page *lru_tailpage(struct zone *zone, struct mem_cgroup *mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum lru_list lru)
> -{
> - ? ? ? if (mem)
> - ? ? ? ? ? ? ? return mem_cgroup_lru_to_page(zone, mem, lru);
> - ? ? ? return lru_to_page(&zone->lru[lru].list);
> -}
> -
> ?/**
> ?* scan_zone_unevictable_pages - check unevictable list for evictable pages
> ?* @zone - zone of which to scan the unevictable list
> @@ -3259,8 +3242,13 @@ static void scan_zone_unevictable_pages(struct zone *zone)
> ? ? ? ?first = mem = mem_cgroup_hierarchy_walk(NULL, mem);
> ? ? ? ?do {
> ? ? ? ? ? ? ? ?unsigned long nr_to_scan;
> + ? ? ? ? ? ? ? struct list_head *list;
> + ? ? ? ? ? ? ? struct lruvec *lruvec;
>
> ? ? ? ? ? ? ? ?nr_to_scan = zone_nr_lru_pages(zone, mem, LRU_UNEVICTABLE);
> + ? ? ? ? ? ? ? lruvec = mem_cgroup_zone_lruvec(zone, mem);
> + ? ? ? ? ? ? ? list = &lruvec->lists[LRU_UNEVICTABLE];
> +
> ? ? ? ? ? ? ? ?while (nr_to_scan > 0) {
> ? ? ? ? ? ? ? ? ? ? ? ?unsigned long batch_size;
> ? ? ? ? ? ? ? ? ? ? ? ?unsigned long scan;
> @@ -3272,7 +3260,7 @@ static void scan_zone_unevictable_pages(struct zone *zone)
> ? ? ? ? ? ? ? ? ? ? ? ?for (scan = 0; scan < batch_size; scan++) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct page *page;
>
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? page = lru_tailpage(zone, mem, LRU_UNEVICTABLE);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? page = lru_to_page(list);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if (!trylock_page(page))
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if (likely(PageLRU(page) &&
> --
> 1.7.5.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

2011-06-02 13:27:18

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [patch 7/8] vmscan: memcg-aware unevictable page rescue scanner

2011/6/1 Johannes Weiner <[email protected]>:
> Once the per-memcg lru lists are exclusive, the unevictable page
> rescue scanner can no longer work on the global zone lru lists.
>
> This converts it to go through all memcgs and scan their respective
> unevictable lists instead.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Hm, isn't it better to have only one GLOBAL LRU for unevictable pages ?
memcg only needs counter for unevictable pages and LRU is not necessary
to be per memcg because we don't reclaim it...

Thanks,
-Kame

2011-06-02 13:30:51

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [patch 6/8] vmscan: change zone_nr_lru_pages to take memcg instead of scan control

2011/6/1 Johannes Weiner <[email protected]>:
> This function only uses sc->mem_cgroup from the scan control. ?Change
> it to take a memcg argument directly, so callsites without an actual
> reclaim context can use it as well.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

I wonder this can be cut out and cab be merged immediately, no ?

2011-06-02 13:59:30

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

2011/6/1 Johannes Weiner <[email protected]>:
> When a memcg hits its hard limit, hierarchical target reclaim is
> invoked, which goes through all contributing memcgs in the hierarchy
> below the offending memcg and reclaims from the respective per-memcg
> lru lists. ?This distributes pressure fairly among all involved
> memcgs, and pages are aged with respect to their list buddies.
>
> When global memory pressure arises, however, all this is dropped
> overboard. ?Pages are reclaimed based on global lru lists that have
> nothing to do with container-internal age, and some memcgs may be
> reclaimed from much more than others.
>
> This patch makes traditional global reclaim consider container
> boundaries and no longer scan the global lru lists. ?For each zone
> scanned, the memcg hierarchy is walked and pages are reclaimed from
> the per-memcg lru lists of the respective zone. ?For now, the
> hierarchy walk is bounded to one full round-trip through the
> hierarchy, or if the number of reclaimed pages reach the overall
> reclaim target, whichever comes first.
>
> Conceptually, global memory pressure is then treated as if the root
> memcg had hit its limit. ?Since all existing memcgs contribute to the
> usage of the root memcg, global reclaim is nothing more than target
> reclaim starting from the root memcg. ?The code is mostly the same for
> both cases, except for a few heuristics and statistics that do not
> always apply. ?They are distinguished by a newly introduced
> global_reclaim() primitive.
>
> One implication of this change is that pages have to be linked to the
> lru lists of the root memcg again, which could be optimized away with
> the old scheme. ?The costs are not measurable, though, even with
> worst-case microbenchmarks.
>
> As global reclaim no longer relies on global lru lists, this change is
> also in preparation to remove those completely.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> ?include/linux/memcontrol.h | ? 15 ++++
> ?mm/memcontrol.c ? ? ? ? ? ?| ?176 ++++++++++++++++++++++++++++----------------
> ?mm/vmscan.c ? ? ? ? ? ? ? ?| ?121 ++++++++++++++++++++++--------
> ?3 files changed, 218 insertions(+), 94 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5e9840f5..332b0a6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -101,6 +101,10 @@ mem_cgroup_prepare_migration(struct page *page,
> ?extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
> ? ? ? ?struct page *oldpage, struct page *newpage, bool migration_ok);
>
> +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct mem_cgroup *);
> +void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup *);
> +
> ?/*
> ?* For memory reclaim.
> ?*/
> @@ -321,6 +325,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> ? ? ? ?return NULL;
> ?}
>
> +static inline struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *r,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct mem_cgroup *m)
> +{
> + ? ? ? return NULL;
> +}
> +
> +static inline void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *r,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *m)
> +{
> +}
> +
> ?static inline void
> ?mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> ?{
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bf5ab87..850176e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -313,8 +313,8 @@ static bool move_file(void)
> ?}
>
> ?/*
> - * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> - * limit reclaim to prevent infinite loops, if they ever occur.
> + * Maximum loops in reclaim, used for soft limit reclaim to prevent
> + * infinite loops, if they ever occur.
> ?*/
> ?#define ? ? ? ?MEM_CGROUP_MAX_RECLAIM_LOOPS ? ? ? ? ? ?(100)
> ?#define ? ? ? ?MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)
> @@ -340,7 +340,7 @@ enum charge_type {
> ?#define OOM_CONTROL ? ? ? ? ? ?(0)
>
> ?/*
> - * Reclaim flags for mem_cgroup_hierarchical_reclaim
> + * Reclaim flags
> ?*/
> ?#define MEM_CGROUP_RECLAIM_NOSWAP_BIT ?0x0
> ?#define MEM_CGROUP_RECLAIM_NOSWAP ? ? ?(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> @@ -846,8 +846,6 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
> ? ? ? ?mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
> ? ? ? ?/* huge page split is done under lru_lock. so, we have no races. */
> ? ? ? ?MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
> - ? ? ? if (mem_cgroup_is_root(pc->mem_cgroup))
> - ? ? ? ? ? ? ? return;
> ? ? ? ?VM_BUG_ON(list_empty(&pc->lru));
> ? ? ? ?list_del_init(&pc->lru);
> ?}
> @@ -872,13 +870,11 @@ void mem_cgroup_rotate_reclaimable_page(struct page *page)
> ? ? ? ? ? ? ? ?return;
>
> ? ? ? ?pc = lookup_page_cgroup(page);
> - ? ? ? /* unused or root page is not rotated. */
> + ? ? ? /* unused page is not rotated. */
> ? ? ? ?if (!PageCgroupUsed(pc))
> ? ? ? ? ? ? ? ?return;
> ? ? ? ?/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
> ? ? ? ?smp_rmb();
> - ? ? ? if (mem_cgroup_is_root(pc->mem_cgroup))
> - ? ? ? ? ? ? ? return;
> ? ? ? ?mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
> ? ? ? ?list_move_tail(&pc->lru, &mz->lists[lru]);
> ?}
> @@ -892,13 +888,11 @@ void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
> ? ? ? ? ? ? ? ?return;
>
> ? ? ? ?pc = lookup_page_cgroup(page);
> - ? ? ? /* unused or root page is not rotated. */
> + ? ? ? /* unused page is not rotated. */
> ? ? ? ?if (!PageCgroupUsed(pc))
> ? ? ? ? ? ? ? ?return;
> ? ? ? ?/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
> ? ? ? ?smp_rmb();
> - ? ? ? if (mem_cgroup_is_root(pc->mem_cgroup))
> - ? ? ? ? ? ? ? return;
> ? ? ? ?mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
> ? ? ? ?list_move(&pc->lru, &mz->lists[lru]);
> ?}
> @@ -920,8 +914,6 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
> ? ? ? ?/* huge page split is done under lru_lock. so, we have no races. */
> ? ? ? ?MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
> ? ? ? ?SetPageCgroupAcctLRU(pc);
> - ? ? ? if (mem_cgroup_is_root(pc->mem_cgroup))
> - ? ? ? ? ? ? ? return;
> ? ? ? ?list_add(&pc->lru, &mz->lists[lru]);
> ?}
>
> @@ -1381,6 +1373,97 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> ? ? ? ?return min(limit, memsw);
> ?}
>
> +/**
> + * mem_cgroup_hierarchy_walk - iterate over a memcg hierarchy
> + * @root: starting point of the hierarchy
> + * @prev: previous position or NULL
> + *
> + * Caller must hold a reference to @root. ?While this function will
> + * return @root as part of the walk, it will never increase its
> + * reference count.
> + *
> + * Caller must clean up with mem_cgroup_stop_hierarchy_walk() when it
> + * stops the walk potentially before the full round trip.
> + */
> +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct mem_cgroup *prev)
> +{
> + ? ? ? struct mem_cgroup *mem;
> +
> + ? ? ? if (mem_cgroup_disabled())
> + ? ? ? ? ? ? ? return NULL;
> +
> + ? ? ? if (!root)
> + ? ? ? ? ? ? ? root = root_mem_cgroup;
> + ? ? ? /*
> + ? ? ? ?* Even without hierarchy explicitely enabled in the root
> + ? ? ? ?* memcg, it is the ultimate parent of all memcgs.
> + ? ? ? ?*/
> + ? ? ? if (!(root == root_mem_cgroup || root->use_hierarchy))
> + ? ? ? ? ? ? ? return root;

Hmm, because ROOT cgroup has no limit and control, if root=root_mem_cgroup,
we do full hierarchy scan always. Right ?


> + ? ? ? if (prev && prev != root)
> + ? ? ? ? ? ? ? css_put(&prev->css);
> + ? ? ? do {
> + ? ? ? ? ? ? ? int id = root->last_scanned_child;
> + ? ? ? ? ? ? ? struct cgroup_subsys_state *css;
> +
> + ? ? ? ? ? ? ? rcu_read_lock();
> + ? ? ? ? ? ? ? css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> + ? ? ? ? ? ? ? if (css && (css == &root->css || css_tryget(css)))
> + ? ? ? ? ? ? ? ? ? ? ? mem = container_of(css, struct mem_cgroup, css);
> + ? ? ? ? ? ? ? rcu_read_unlock();
> + ? ? ? ? ? ? ? if (!css)
> + ? ? ? ? ? ? ? ? ? ? ? id = 0;
> + ? ? ? ? ? ? ? root->last_scanned_child = id;
> + ? ? ? } while (!mem);
> + ? ? ? return mem;
> +}
> +
> +/**
> + * mem_cgroup_stop_hierarchy_walk - clean up after partial hierarchy walk
> + * @root: starting point in the hierarchy
> + * @mem: last position during the walk
> + */
> +void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *root,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem)
> +{
> + ? ? ? if (mem && mem != root)
> + ? ? ? ? ? ? ? css_put(&mem->css);
> +}

Recently I wonder it's better to cgroup_exclude_rmdir() and
cgroup_release_and_wakeup_rmdir() for this hierarchy scan...hm.


> +
> +static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long flags)
> +{
> + ? ? ? unsigned long total = 0;
> + ? ? ? bool noswap = false;
> + ? ? ? int loop;
> +
> + ? ? ? if ((flags & MEM_CGROUP_RECLAIM_NOSWAP) || mem->memsw_is_minimum)
> + ? ? ? ? ? ? ? noswap = true;
> + ? ? ? for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> + ? ? ? ? ? ? ? drain_all_stock_async();

In recent patch, I removed this call here because this wakes up
kworker too much.
I will post that patch as a bugfix. So, please adjust this call
somewhere which is
not called frequently.


> + ? ? ? ? ? ? ? total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? get_swappiness(mem));
> + ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ?* Avoid freeing too much when shrinking to resize the
> + ? ? ? ? ? ? ? ?* limit. ?XXX: Shouldn't the margin check be enough?
> + ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? if (mem_cgroup_margin(mem))
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ?* If we have not been able to reclaim anything after
> + ? ? ? ? ? ? ? ?* two reclaim attempts, there may be no reclaimable
> + ? ? ? ? ? ? ? ?* pages in this hierarchy.
> + ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? if (loop && !total)
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? }
> + ? ? ? return total;
> +}
> +
> ?/*
> ?* Visit the first child (need not be the first child as per the ordering
> ?* of the cgroup list, since we track last_scanned_child) of @mem and use
> @@ -1418,29 +1501,14 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> ? ? ? ?return ret;
> ?}
>
> -/*
> - * Scan the hierarchy if needed to reclaim memory. We remember the last child
> - * we reclaimed from, so that we don't end up penalizing one child extensively
> - * based on its position in the children list.
> - *
> - * root_mem is the original ancestor that we've been reclaim from.
> - *
> - * We give up and return to the caller when we visit root_mem twice.
> - * (other groups can be removed while we're walking....)
> - *
> - * If shrink==true, for avoiding to free too much, this returns immedieately.
> - */
> -static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long reclaim_options)
> +static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct zone *zone,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask)
> ?{
> ? ? ? ?struct mem_cgroup *victim;
> ? ? ? ?int ret, total = 0;
> ? ? ? ?int loop = 0;
> - ? ? ? bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> - ? ? ? bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> - ? ? ? bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> + ? ? ? bool noswap = false;
> ? ? ? ?unsigned long excess;
>
> ? ? ? ?excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
> @@ -1461,7 +1529,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? * anything, it might because there are
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? * no reclaimable pages under this hierarchy
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? */
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (!check_soft || !total) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (!total) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?css_put(&victim->css);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?break;
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
> @@ -1483,26 +1551,11 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> ? ? ? ? ? ? ? ? ? ? ? ?css_put(&victim->css);
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?}
> - ? ? ? ? ? ? ? /* we use swappiness of local cgroup */
> - ? ? ? ? ? ? ? if (check_soft)
> - ? ? ? ? ? ? ? ? ? ? ? ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, get_swappiness(victim), zone);
> - ? ? ? ? ? ? ? else
> - ? ? ? ? ? ? ? ? ? ? ? ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, get_swappiness(victim));
> + ? ? ? ? ? ? ? ret = mem_cgroup_shrink_node_zone(victim, gfp_mask, noswap,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? get_swappiness(victim), zone);
> ? ? ? ? ? ? ? ?css_put(&victim->css);
> - ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ?* At shrinking usage, we can't check we should stop here or
> - ? ? ? ? ? ? ? ?* reclaim more. It's depends on callers. last_scanned_child
> - ? ? ? ? ? ? ? ?* will work enough for keeping fairness under tree.
> - ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? if (shrink)
> - ? ? ? ? ? ? ? ? ? ? ? return ret;
> ? ? ? ? ? ? ? ?total += ret;
> - ? ? ? ? ? ? ? if (check_soft) {
> - ? ? ? ? ? ? ? ? ? ? ? if (!res_counter_soft_limit_excess(&root_mem->res))
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return total;
> - ? ? ? ? ? ? ? } else if (mem_cgroup_margin(root_mem))
> + ? ? ? ? ? ? ? if (!res_counter_soft_limit_excess(&root_mem->res))
> ? ? ? ? ? ? ? ? ? ? ? ?return total;
> ? ? ? ?}
> ? ? ? ?return total;
> @@ -1927,8 +1980,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> ? ? ? ?if (!(gfp_mask & __GFP_WAIT))
> ? ? ? ? ? ? ? ?return CHARGE_WOULDBLOCK;
>
> - ? ? ? ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_mask, flags);
> + ? ? ? ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> ? ? ? ?if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> ? ? ? ? ? ? ? ?return CHARGE_RETRY;
> ? ? ? ?/*

It seems this clean-up around hierarchy and softlimit can be in an
independent patch, no ?


> @@ -3085,7 +3137,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>
> ?/*
> ?* A call to try to shrink memory usage on charge failure at shmem's swapin.
> - * Calling hierarchical_reclaim is not enough because we should update
> + * Calling reclaim is not enough because we should update
> ?* last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
> ?* Moreover considering hierarchy, we should reclaim from the mem_over_limit,
> ?* not from the memcg which this page would be charged to.
> @@ -3167,7 +3219,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> ? ? ? ?int enlarge;
>
> ? ? ? ?/*
> - ? ? ? ?* For keeping hierarchical_reclaim simple, how long we should retry
> + ? ? ? ?* For keeping reclaim simple, how long we should retry
> ? ? ? ? * is depends on callers. We set our retry-count to be function
> ? ? ? ? * of # of children which we should visit in this loop.
> ? ? ? ? */
> @@ -3210,8 +3262,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> ? ? ? ? ? ? ? ?if (!ret)
> ? ? ? ? ? ? ? ? ? ? ? ?break;
>
> - ? ? ? ? ? ? ? mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? MEM_CGROUP_RECLAIM_SHRINK);
> + ? ? ? ? ? ? ? mem_cgroup_reclaim(memcg, GFP_KERNEL,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?MEM_CGROUP_RECLAIM_SHRINK);
> ? ? ? ? ? ? ? ?curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> ? ? ? ? ? ? ? ?/* Usage is reduced ? */
> ? ? ? ? ? ? ? ?if (curusage >= oldusage)
> @@ -3269,9 +3321,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> ? ? ? ? ? ? ? ?if (!ret)
> ? ? ? ? ? ? ? ? ? ? ? ?break;
>
> - ? ? ? ? ? ? ? mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? MEM_CGROUP_RECLAIM_NOSWAP |
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? MEM_CGROUP_RECLAIM_SHRINK);
> + ? ? ? ? ? ? ? mem_cgroup_reclaim(memcg, GFP_KERNEL,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?MEM_CGROUP_RECLAIM_NOSWAP |
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?MEM_CGROUP_RECLAIM_SHRINK);
> ? ? ? ? ? ? ? ?curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> ? ? ? ? ? ? ? ?/* Usage is reduced ? */
> ? ? ? ? ? ? ? ?if (curusage >= oldusage)
> @@ -3311,9 +3363,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> ? ? ? ? ? ? ? ?if (!mz)
> ? ? ? ? ? ? ? ? ? ? ? ?break;
>
> - ? ? ? ? ? ? ? reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_mask,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? MEM_CGROUP_RECLAIM_SOFT);
> + ? ? ? ? ? ? ? reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
> ? ? ? ? ? ? ? ?nr_reclaimed += reclaimed;
> ? ? ? ? ? ? ? ?spin_lock(&mctz->lock);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8bfd450..7e9bfca 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -104,7 +104,16 @@ struct scan_control {
> ? ? ? ? */
> ? ? ? ?reclaim_mode_t reclaim_mode;
>
> - ? ? ? /* Which cgroup do we reclaim from */
> + ? ? ? /*
> + ? ? ? ?* The memory cgroup that hit its hard limit and is the
> + ? ? ? ?* primary target of this reclaim invocation.
> + ? ? ? ?*/
> + ? ? ? struct mem_cgroup *target_mem_cgroup;
> +
> + ? ? ? /*
> + ? ? ? ?* The memory cgroup that is currently being scanned as a
> + ? ? ? ?* child and contributor to the usage of target_mem_cgroup.
> + ? ? ? ?*/
> ? ? ? ?struct mem_cgroup *mem_cgroup;
>
> ? ? ? ?/*
> @@ -154,9 +163,36 @@ static LIST_HEAD(shrinker_list);
> ?static DECLARE_RWSEM(shrinker_rwsem);
>
> ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -#define scanning_global_lru(sc) ? ? ? ?(!(sc)->mem_cgroup)
> +/**
> + * global_reclaim - whether reclaim is global or due to memcg hard limit
> + * @sc: scan control of this reclaim invocation
> + */
> +static bool global_reclaim(struct scan_control *sc)
> +{
> + ? ? ? return !sc->target_mem_cgroup;
> +}
> +/**
> + * scanning_global_lru - whether scanning global lrus or per-memcg lrus
> + * @sc: scan control of this reclaim invocation
> + */
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> + ? ? ? /*
> + ? ? ? ?* Unless memory cgroups are disabled on boot, the traditional
> + ? ? ? ?* global lru lists are never scanned and reclaim will always
> + ? ? ? ?* operate on the per-memcg lru lists.
> + ? ? ? ?*/
> + ? ? ? return mem_cgroup_disabled();
> +}
> ?#else
> -#define scanning_global_lru(sc) ? ? ? ?(1)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> + ? ? ? return true;
> +}
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> + ? ? ? return true;
> +}
> ?#endif
>
> ?static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> @@ -1228,7 +1264,7 @@ static int too_many_isolated(struct zone *zone, int file,
> ? ? ? ?if (current_is_kswapd())
> ? ? ? ? ? ? ? ?return 0;
>
> - ? ? ? if (!scanning_global_lru(sc))
> + ? ? ? if (!global_reclaim(sc))
> ? ? ? ? ? ? ? ?return 0;
>
> ? ? ? ?if (file) {
> @@ -1397,13 +1433,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> ? ? ? ? ? ? ? ? ? ? ? ?sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ISOLATE_BOTH : ISOLATE_INACTIVE,
> ? ? ? ? ? ? ? ? ? ? ? ?zone, 0, file);
> - ? ? ? ? ? ? ? zone->pages_scanned += nr_scanned;
> - ? ? ? ? ? ? ? if (current_is_kswapd())
> - ? ? ? ? ? ? ? ? ? ? ? __count_zone_vm_events(PGSCAN_KSWAPD, zone,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?nr_scanned);
> - ? ? ? ? ? ? ? else
> - ? ? ? ? ? ? ? ? ? ? ? __count_zone_vm_events(PGSCAN_DIRECT, zone,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?nr_scanned);
> ? ? ? ?} else {
> ? ? ? ? ? ? ? ?nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
> ? ? ? ? ? ? ? ? ? ? ? ?&page_list, &nr_scanned, sc->order,
> @@ -1411,10 +1440,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ISOLATE_BOTH : ISOLATE_INACTIVE,
> ? ? ? ? ? ? ? ? ? ? ? ?zone, sc->mem_cgroup,
> ? ? ? ? ? ? ? ? ? ? ? ?0, file);
> - ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ?* mem_cgroup_isolate_pages() keeps track of
> - ? ? ? ? ? ? ? ?* scanned pages on its own.
> - ? ? ? ? ? ? ? ?*/
> + ? ? ? }
> +
> + ? ? ? if (global_reclaim(sc)) {
> + ? ? ? ? ? ? ? zone->pages_scanned += nr_scanned;
> + ? ? ? ? ? ? ? if (current_is_kswapd())
> + ? ? ? ? ? ? ? ? ? ? ? __count_zone_vm_events(PGSCAN_KSWAPD, zone,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?nr_scanned);
> + ? ? ? ? ? ? ? else
> + ? ? ? ? ? ? ? ? ? ? ? __count_zone_vm_events(PGSCAN_DIRECT, zone,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?nr_scanned);
> ? ? ? ?}
>
> ? ? ? ?if (nr_taken == 0) {
> @@ -1520,18 +1555,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?&pgscanned, sc->order,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ISOLATE_ACTIVE, zone,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1, file);
> - ? ? ? ? ? ? ? zone->pages_scanned += pgscanned;
> ? ? ? ?} else {
> ? ? ? ? ? ? ? ?nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?&pgscanned, sc->order,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ISOLATE_ACTIVE, zone,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc->mem_cgroup, 1, file);
> - ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ?* mem_cgroup_isolate_pages() keeps track of
> - ? ? ? ? ? ? ? ?* scanned pages on its own.
> - ? ? ? ? ? ? ? ?*/
> ? ? ? ?}
>
> + ? ? ? if (global_reclaim(sc))
> + ? ? ? ? ? ? ? zone->pages_scanned += pgscanned;
> +
> ? ? ? ?reclaim_stat->recent_scanned[file] += nr_taken;
>
> ? ? ? ?__count_zone_vm_events(PGREFILL, zone, pgscanned);
> @@ -1752,7 +1785,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
> ? ? ? ?file ?= zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
> ? ? ? ? ? ? ? ?zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
>
> - ? ? ? if (scanning_global_lru(sc)) {
> + ? ? ? if (global_reclaim(sc)) {
> ? ? ? ? ? ? ? ?free ?= zone_page_state(zone, NR_FREE_PAGES);
> ? ? ? ? ? ? ? ?/* If we have very few page cache pages,
> ? ? ? ? ? ? ? ? ? force-scan anon pages. */
> @@ -1889,8 +1922,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
> ?/*
> ?* This is a basic per-zone page freer. ?Used by both kswapd and direct reclaim.
> ?*/
> -static void shrink_zone(int priority, struct zone *zone,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct scan_control *sc)
> +static void do_shrink_zone(int priority, struct zone *zone,
> + ? ? ? ? ? ? ? ? ? ? ? ? ?struct scan_control *sc)
> ?{
> ? ? ? ?unsigned long nr[NR_LRU_LISTS];
> ? ? ? ?unsigned long nr_to_scan;
> @@ -1943,6 +1976,31 @@ restart:
> ? ? ? ?throttle_vm_writeout(sc->gfp_mask);
> ?}
>
> +static void shrink_zone(int priority, struct zone *zone,
> + ? ? ? ? ? ? ? ? ? ? ? struct scan_control *sc)
> +{
> + ? ? ? unsigned long nr_reclaimed_before = sc->nr_reclaimed;
> + ? ? ? struct mem_cgroup *root = sc->target_mem_cgroup;
> + ? ? ? struct mem_cgroup *first, *mem = NULL;
> +
> + ? ? ? first = mem = mem_cgroup_hierarchy_walk(root, mem);

Hmm, I think we should add some scheduling here, later.
(as select a group over softlimit or select a group which has
easily reclaimable pages on this zone.)

This name as hierarchy_walk() sounds like "full scan in round-robin, always".
Could you find better name ?

> + ? ? ? for (;;) {
> + ? ? ? ? ? ? ? unsigned long nr_reclaimed;
> +
> + ? ? ? ? ? ? ? sc->mem_cgroup = mem;
> + ? ? ? ? ? ? ? do_shrink_zone(priority, zone, sc);
> +
> + ? ? ? ? ? ? ? nr_reclaimed = sc->nr_reclaimed - nr_reclaimed_before;
> + ? ? ? ? ? ? ? if (nr_reclaimed >= sc->nr_to_reclaim)
> + ? ? ? ? ? ? ? ? ? ? ? break;

what this calculation means ? Shouldn't we do this quit based on the
number of "scan"
rather than "reclaimed" ?

> +
> + ? ? ? ? ? ? ? mem = mem_cgroup_hierarchy_walk(root, mem);
> + ? ? ? ? ? ? ? if (mem == first)
> + ? ? ? ? ? ? ? ? ? ? ? break;

Why we quit loop ?

> + ? ? ? }
> + ? ? ? mem_cgroup_stop_hierarchy_walk(root, mem);
> +}



Thanks,
-Kame

2011-06-02 14:24:31

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 8/8] mm: make per-memcg lru lists exclusive

On Thu, Jun 02, 2011 at 10:16:59PM +0900, Hiroyuki Kamezawa wrote:
> 2011/6/1 Johannes Weiner <[email protected]>:
> > All lru list walkers have been converted to operate on per-memcg
> > lists, the global per-zone lists are no longer required.
> >
> > This patch makes the per-memcg lists exclusive and removes the global
> > lists from memcg-enabled kernels.
> >
> > The per-memcg lists now string up page descriptors directly, which
> > unifies/simplifies the list isolation code of page reclaim as well as
> > it saves a full double-linked list head for each page in the system.
> >
> > At the core of this change is the introduction of the lruvec
> > structure, an array of all lru list heads. ?It exists for each zone
> > globally, and for each zone per memcg. ?All lru list operations are
> > now done in generic code against lruvecs, with the memcg lru list
> > primitives only doing accounting and returning the proper lruvec for
> > the currently scanned memcg on isolation, or for the respective page
> > on putback.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
>
> could you divide this into
> - introduce lruvec
> - don't record section? information into pc->flags because we see
> "page" on memcg LRU
> and there is no requirement to get page from "pc".
> - remove pc->lru completely

Yes, that makes sense. It shall be fixed in the next version.

2011-06-02 14:28:00

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 7/8] vmscan: memcg-aware unevictable page rescue scanner

On Thu, Jun 02, 2011 at 10:27:15PM +0900, Hiroyuki Kamezawa wrote:
> 2011/6/1 Johannes Weiner <[email protected]>:
> > Once the per-memcg lru lists are exclusive, the unevictable page
> > rescue scanner can no longer work on the global zone lru lists.
> >
> > This converts it to go through all memcgs and scan their respective
> > unevictable lists instead.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Hm, isn't it better to have only one GLOBAL LRU for unevictable pages ?
> memcg only needs counter for unevictable pages and LRU is not necessary
> to be per memcg because we don't reclaim it...

That's true, and I will look into it. But keep in mind that it needs
special-casing that one list type from all the others, so maybe it's
just easier to keep it like this.

2011-06-02 14:29:27

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 6/8] vmscan: change zone_nr_lru_pages to take memcg instead of scan control

On Thu, Jun 02, 2011 at 10:30:48PM +0900, Hiroyuki Kamezawa wrote:
> 2011/6/1 Johannes Weiner <[email protected]>:
> > This function only uses sc->mem_cgroup from the scan control. ?Change
> > it to take a memcg argument directly, so callsites without an actual
> > reclaim context can use it as well.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>
> I wonder this can be cut out and cab be merged immediately, no ?

I don't see anything standing in the way of that. OTOH, all current
users have scan controls, so it's not really urgent, either.

2011-06-02 15:01:52

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Thu, Jun 02, 2011 at 10:59:01PM +0900, Hiroyuki Kamezawa wrote:
> 2011/6/1 Johannes Weiner <[email protected]>:
> > @@ -1381,6 +1373,97 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> > ? ? ? ?return min(limit, memsw);
> > ?}
> >
> > +/**
> > + * mem_cgroup_hierarchy_walk - iterate over a memcg hierarchy
> > + * @root: starting point of the hierarchy
> > + * @prev: previous position or NULL
> > + *
> > + * Caller must hold a reference to @root. ?While this function will
> > + * return @root as part of the walk, it will never increase its
> > + * reference count.
> > + *
> > + * Caller must clean up with mem_cgroup_stop_hierarchy_walk() when it
> > + * stops the walk potentially before the full round trip.
> > + */
> > +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root,
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct mem_cgroup *prev)
> > +{
> > + ? ? ? struct mem_cgroup *mem;
> > +
> > + ? ? ? if (mem_cgroup_disabled())
> > + ? ? ? ? ? ? ? return NULL;
> > +
> > + ? ? ? if (!root)
> > + ? ? ? ? ? ? ? root = root_mem_cgroup;
> > + ? ? ? /*
> > + ? ? ? ?* Even without hierarchy explicitely enabled in the root
> > + ? ? ? ?* memcg, it is the ultimate parent of all memcgs.
> > + ? ? ? ?*/
> > + ? ? ? if (!(root == root_mem_cgroup || root->use_hierarchy))
> > + ? ? ? ? ? ? ? return root;
>
> Hmm, because ROOT cgroup has no limit and control, if root=root_mem_cgroup,
> we do full hierarchy scan always. Right ?

What it essentially means is that all existing memcgs in the system
contribute to the usage of root_mem_cgroup.

If there is global memory pressure, we need to consider reclaiming
from every single memcg in the system.

> > +static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask,
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long flags)
> > +{
> > + ? ? ? unsigned long total = 0;
> > + ? ? ? bool noswap = false;
> > + ? ? ? int loop;
> > +
> > + ? ? ? if ((flags & MEM_CGROUP_RECLAIM_NOSWAP) || mem->memsw_is_minimum)
> > + ? ? ? ? ? ? ? noswap = true;
> > + ? ? ? for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> > + ? ? ? ? ? ? ? drain_all_stock_async();
>
> In recent patch, I removed this call here because this wakes up
> kworker too much.
> I will post that patch as a bugfix. So, please adjust this call
> somewhere which is
> not called frequently.

Okay, please CC me when you send out the bugfix.

> > @@ -1927,8 +1980,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> > ? ? ? ?if (!(gfp_mask & __GFP_WAIT))
> > ? ? ? ? ? ? ? ?return CHARGE_WOULDBLOCK;
> >
> > - ? ? ? ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_mask, flags);
> > + ? ? ? ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> > ? ? ? ?if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> > ? ? ? ? ? ? ? ?return CHARGE_RETRY;
> > ? ? ? ?/*
>
> It seems this clean-up around hierarchy and softlimit can be in an
> independent patch, no ?

Hm, why do you think it's a cleanup? The hierarchical target reclaim
code is moved to vmscan.c and as a result the entry points for hard
limit and soft limit reclaim differ. This is why the original
function, mem_cgroup_hierarchical_reclaim() has to be split into two
parts.

> > @@ -1943,6 +1976,31 @@ restart:
> > ? ? ? ?throttle_vm_writeout(sc->gfp_mask);
> > ?}
> >
> > +static void shrink_zone(int priority, struct zone *zone,
> > + ? ? ? ? ? ? ? ? ? ? ? struct scan_control *sc)
> > +{
> > + ? ? ? unsigned long nr_reclaimed_before = sc->nr_reclaimed;
> > + ? ? ? struct mem_cgroup *root = sc->target_mem_cgroup;
> > + ? ? ? struct mem_cgroup *first, *mem = NULL;
> > +
> > + ? ? ? first = mem = mem_cgroup_hierarchy_walk(root, mem);
>
> Hmm, I think we should add some scheduling here, later.
> (as select a group over softlimit or select a group which has
> easily reclaimable pages on this zone.)
>
> This name as hierarchy_walk() sounds like "full scan in round-robin, always".
> Could you find better name ?

Okay, I'll try.

> > + ? ? ? for (;;) {
> > + ? ? ? ? ? ? ? unsigned long nr_reclaimed;
> > +
> > + ? ? ? ? ? ? ? sc->mem_cgroup = mem;
> > + ? ? ? ? ? ? ? do_shrink_zone(priority, zone, sc);
> > +
> > + ? ? ? ? ? ? ? nr_reclaimed = sc->nr_reclaimed - nr_reclaimed_before;
> > + ? ? ? ? ? ? ? if (nr_reclaimed >= sc->nr_to_reclaim)
> > + ? ? ? ? ? ? ? ? ? ? ? break;
>
> what this calculation means ? Shouldn't we do this quit based on the
> number of "scan"
> rather than "reclaimed" ?

It aborts the loop once sc->nr_to_reclaim pages have been reclaimed
from that zone during that hierarchy walk, to prevent overreclaim.

If you have unbalanced sizes of memcgs in the system, it is not
desirable to have every reclaimer scan all memcgs, but let those quit
early that have made some progress on the bigger memcgs.

It's essentially a forward progagation of the same check in
do_shrink_zone(). It trades absolute fairness for average reclaim
latency.

Note that kswapd sets the reclaim target to infinity, so this
optimization applies only to direct reclaimers.

> > + ? ? ? ? ? ? ? mem = mem_cgroup_hierarchy_walk(root, mem);
> > + ? ? ? ? ? ? ? if (mem == first)
> > + ? ? ? ? ? ? ? ? ? ? ? break;
>
> Why we quit loop ?

get_scan_count() for traditional global reclaim returns the scan
target for the zone.

With this per-memcg reclaimer, get_scan_count() will return scan
targets for the respective per-memcg zone subsizes.

So once we have gone through all memcgs, we should have scanned the
amount of pages that global reclaim would have deemed sensible for
that zone at that priority level.

As such, this is the exit condition based on scan count you referred
to above.

2011-06-02 15:51:45

by Ying Han

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Thu, Jun 2, 2011 at 12:50 AM, Johannes Weiner <[email protected]> wrote:
> On Wed, Jun 01, 2011 at 09:05:18PM -0700, Ying Han wrote:
>> On Wed, Jun 1, 2011 at 4:52 PM, Hiroyuki Kamezawa
>> <[email protected]> wrote:
>> > 2011/6/1 Johannes Weiner <[email protected]>:
>> >> Hi,
>> >>
>> >> this is the second version of the memcg naturalization series. ?The
>> >> notable changes since the first submission are:
>> >>
>> >> ? ?o the hierarchy walk is now intermittent and will abort and
>> >> ? ? ?remember the last scanned child after sc->nr_to_reclaim pages
>> >> ? ? ?have been reclaimed during the walk in one zone (Rik)
>> >>
>> >> ? ?o the global lru lists are never scanned when memcg is enabled
>> >> ? ? ?after #2 'memcg-aware global reclaim', which makes this patch
>> >> ? ? ?self-sufficient and complete without requiring the per-memcg lru
>> >> ? ? ?lists to be exclusive (Michal)
>> >>
>> >> ? ?o renamed sc->memcg and sc->current_memcg to sc->target_mem_cgroup
>> >> ? ? ?and sc->mem_cgroup and fixed their documentation, I hope this is
>> >> ? ? ?better understandable now (Rik)
>> >>
>> >> ? ?o the reclaim statistic counters have been renamed. ?there is no
>> >> ? ? ?more distinction between 'pgfree' and 'pgsteal', it is now
>> >> ? ? ?'pgreclaim' in both cases; 'kswapd' has been replaced by
>> >> ? ? ?'background'
>> >>
>> >> ? ?o fixed a nasty crash in the hierarchical soft limit check that
>> >> ? ? ?happened during global reclaim in memcgs that are hierarchical
>> >> ? ? ?but have no hierarchical parents themselves
>> >>
>> >> ? ?o properly implemented the memcg-aware unevictable page rescue
>> >> ? ? ?scanner, there were several blatant bugs in there
>> >>
>> >> ? ?o documentation on new public interfaces
>> >>
>> >> Thanks for your input on the first version.
>> >>
>> >> I ran microbenchmarks (sparse file catting, essentially) to stress
>> >> reclaim and LRU operations. ?There is no measurable overhead for
>> >> !CONFIG_MEMCG, memcg disabled during boot, memcg enabled but no
>> >> configured groups, and hard limit reclaim.
>> >>
>> >> I also ran single-threaded kernbenchs in four unlimited memcgs in
>> >> parallel, contained in a hard-limited hierarchical parent that put
>> >> constant pressure on the workload. ?There is no measurable difference
>> >> in runtime, the pgpgin/pgpgout counters, and fairness among memcgs in
>> >> this test compared to an unpatched kernel. ?Needs more evaluation,
>> >> especially with a higher number of memcgs.
>> >>
>> >> The soft limit changes are also proven to work in so far that it is
>> >> possible to prioritize between children in a hierarchy under pressure
>> >> and that runtime differences corresponded directly to the soft limit
>> >> settings in the previously described kernbench setup with staggered
>> >> soft limits on the groups, but this needs quantification.
>> >>
>> >> Based on v2.6.39.
>> >>
>> >
>> > Hmm, I welcome and will review this patches but.....some points I want to say.
>> >
>> > 1. No more conflict with Ying's work ?
>> > ? ?Could you explain what she has and what you don't in this v2 ?
>> > ? ?If Ying's one has something good to be merged to your set, please
>> > include it.
>>
>> My patch I sent out last time was doing rework of soft_limit reclaim.
>> It convert the RB-tree based to
>> a linked list round-robin fashion of all memcgs across their soft
>> limit per-zone.
>>
>> I will apply this patch and try to test it. After that i will get
>> better idea whether or not it is being covered here.
>
> Thanks!!
>
>> > 4. This work can be splitted into some small works.
>> > ? ? a) fix for current code and clean ups
>>
>> > ? ? a') statistics
>>
>> > ? ? b) soft limit rework
>>
>> > ? ? c) change global reclaim
>>
>> My last patchset starts with a patch reverting the RB-tree
>> implementation of the soft_limit
>> reclaim, and then the new round-robin implementation comes on the
>> following patches.
>>
>> I like the ordering here, and that is consistent w/ the plan we
>> discussed earlier in LSF. Changing
>> the global reclaim would be the last step when the changes before that
>> have been well understood
>> and tested.
>>
>> Sorry If that is how it is done here. I will read through the patchset.
>
> It's not. ?The way I implemented soft limits depends on global reclaim
> performing hierarchical reclaim. ?I don't see how I can reverse the
> order with this dependency.

That is something I don't quite get yet, and maybe need a closer look
into the patchset. The current design of
soft_limit doesn't do reclaim hierarchically but instead links the
memcgs together on per-zone basis.

However on this patchset, we changed that design and doing
hierarchy_walk of the memcg tree. Can we clarify more on why we made
the design change? I can see the current design provides a efficient
way to pick the one memcg over-their-soft-limit under shrink_zone().

--Ying

>

2011-06-02 15:54:42

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [patch 8/8] mm: make per-memcg lru lists exclusive

2011/6/2 Johannes Weiner <[email protected]>:
> On Thu, Jun 02, 2011 at 10:16:59PM +0900, Hiroyuki Kamezawa wrote:
>> 2011/6/1 Johannes Weiner <[email protected]>:
>> > All lru list walkers have been converted to operate on per-memcg
>> > lists, the global per-zone lists are no longer required.
>> >
>> > This patch makes the per-memcg lists exclusive and removes the global
>> > lists from memcg-enabled kernels.
>> >
>> > The per-memcg lists now string up page descriptors directly, which
>> > unifies/simplifies the list isolation code of page reclaim as well as
>> > it saves a full double-linked list head for each page in the system.
>> >
>> > At the core of this change is the introduction of the lruvec
>> > structure, an array of all lru list heads. ?It exists for each zone
>> > globally, and for each zone per memcg. ?All lru list operations are
>> > now done in generic code against lruvecs, with the memcg lru list
>> > primitives only doing accounting and returning the proper lruvec for
>> > the currently scanned memcg on isolation, or for the respective page
>> > on putback.
>> >
>> > Signed-off-by: Johannes Weiner <[email protected]>
>>
>>
>> could you divide this into
>> ? - introduce lruvec
>> ? - don't record section? information into pc->flags because we see
>> "page" on memcg LRU
>> ? ? and there is no requirement to get page from "pc".
>> ? - remove pc->lru completely
>
> Yes, that makes sense. ?It shall be fixed in the next version.
>

BTW, IIUC, Transparent hugepage has a code to link a page to the
page->lru directly.
And recent Minchan's work does the same kind of trick.

But it may put a page onto wrong memcgs if we do link a page to
another page's page->lru
because 2 pages may be in different cgroup each other.

Could you check there are more codes which does link page->lru to nearby page's
page->lru ? Now, I'm not sure there are other codes....but we need care.

Thanks,
-Kame

2011-06-02 16:14:16

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

2011/6/3 Johannes Weiner <[email protected]>:
> On Thu, Jun 02, 2011 at 10:59:01PM +0900, Hiroyuki Kamezawa wrote:
>> 2011/6/1 Johannes Weiner <[email protected]>:

>> > @@ -1927,8 +1980,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>> > ? ? ? ?if (!(gfp_mask & __GFP_WAIT))
>> > ? ? ? ? ? ? ? ?return CHARGE_WOULDBLOCK;
>> >
>> > - ? ? ? ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_mask, flags);
>> > + ? ? ? ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
>> > ? ? ? ?if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>> > ? ? ? ? ? ? ? ?return CHARGE_RETRY;
>> > ? ? ? ?/*
>>
>> It seems this clean-up around hierarchy and softlimit can be in an
>> independent patch, no ?
>
> Hm, why do you think it's a cleanup? ?The hierarchical target reclaim
> code is moved to vmscan.c and as a result the entry points for hard
> limit and soft limit reclaim differ. ?This is why the original
> function, mem_cgroup_hierarchical_reclaim() has to be split into two
> parts.
>
If functionality is unchanged, I think it's clean up.
I agree to move hierarchy walk to vmscan.c. but it can be done as
a clean up patch for current code.
(Make current try_to_free_mem_cgroup_pages() to use this code.)
and then, you can write a patch which only includes a core
logic/purpose of this patch
"use root cgroup's LRU for global and make global reclaim as full-scan
of memcgroup."

In short, I felt this patch is long....and maybe watchers of -mm are
not interested in rewritie of hierarchy walk but are intetested in the
chages in shrink_zone() itself very much.



>> > @@ -1943,6 +1976,31 @@ restart:
>> > ? ? ? ?throttle_vm_writeout(sc->gfp_mask);
>> > ?}
>> >
>> > +static void shrink_zone(int priority, struct zone *zone,
>> > + ? ? ? ? ? ? ? ? ? ? ? struct scan_control *sc)
>> > +{
>> > + ? ? ? unsigned long nr_reclaimed_before = sc->nr_reclaimed;
>> > + ? ? ? struct mem_cgroup *root = sc->target_mem_cgroup;
>> > + ? ? ? struct mem_cgroup *first, *mem = NULL;
>> > +
>> > + ? ? ? first = mem = mem_cgroup_hierarchy_walk(root, mem);
>>
>> Hmm, I think we should add some scheduling here, later.
>> (as select a group over softlimit or select a group which has
>> ?easily reclaimable pages on this zone.)
>>
>> This name as hierarchy_walk() sounds like "full scan in round-robin, always".
>> Could you find better name ?
>
> Okay, I'll try.
>
>> > + ? ? ? for (;;) {
>> > + ? ? ? ? ? ? ? unsigned long nr_reclaimed;
>> > +
>> > + ? ? ? ? ? ? ? sc->mem_cgroup = mem;
>> > + ? ? ? ? ? ? ? do_shrink_zone(priority, zone, sc);
>> > +
>> > + ? ? ? ? ? ? ? nr_reclaimed = sc->nr_reclaimed - nr_reclaimed_before;
>> > + ? ? ? ? ? ? ? if (nr_reclaimed >= sc->nr_to_reclaim)
>> > + ? ? ? ? ? ? ? ? ? ? ? break;
>>
>> what this calculation means ? ?Shouldn't we do this quit based on the
>> number of "scan"
>> rather than "reclaimed" ?
>
> It aborts the loop once sc->nr_to_reclaim pages have been reclaimed
> from that zone during that hierarchy walk, to prevent overreclaim.
>
> If you have unbalanced sizes of memcgs in the system, it is not
> desirable to have every reclaimer scan all memcgs, but let those quit
> early that have made some progress on the bigger memcgs.
>
Hmm, why not if (sc->nr_reclaimed >= sc->nr_to_reclaim) ?

I'm sorry if I miss something..


> It's essentially a forward progagation of the same check in
> do_shrink_zone(). ?It trades absolute fairness for average reclaim
> latency.
>
> Note that kswapd sets the reclaim target to infinity, so this
> optimization applies only to direct reclaimers.
>
>> > + ? ? ? ? ? ? ? mem = mem_cgroup_hierarchy_walk(root, mem);
>> > + ? ? ? ? ? ? ? if (mem == first)
>> > + ? ? ? ? ? ? ? ? ? ? ? break;
>>
>> Why we quit loop ??
>
> get_scan_count() for traditional global reclaim returns the scan
> target for the zone.
>
> With this per-memcg reclaimer, get_scan_count() will return scan
> targets for the respective per-memcg zone subsizes.
>
> So once we have gone through all memcgs, we should have scanned the
> amount of pages that global reclaim would have deemed sensible for
> that zone at that priority level.
>
> As such, this is the exit condition based on scan count you referred
> to above.
>
That's what I want as a comment in codes.

Thanks,
-Kame

2011-06-02 17:29:35

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Fri, Jun 03, 2011 at 01:14:12AM +0900, Hiroyuki Kamezawa wrote:
> 2011/6/3 Johannes Weiner <[email protected]>:
> > On Thu, Jun 02, 2011 at 10:59:01PM +0900, Hiroyuki Kamezawa wrote:
> >> 2011/6/1 Johannes Weiner <[email protected]>:
>
> >> > @@ -1927,8 +1980,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> >> > ? ? ? ?if (!(gfp_mask & __GFP_WAIT))
> >> > ? ? ? ? ? ? ? ?return CHARGE_WOULDBLOCK;
> >> >
> >> > - ? ? ? ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> >> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_mask, flags);
> >> > + ? ? ? ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> >> > ? ? ? ?if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> >> > ? ? ? ? ? ? ? ?return CHARGE_RETRY;
> >> > ? ? ? ?/*
> >>
> >> It seems this clean-up around hierarchy and softlimit can be in an
> >> independent patch, no ?
> >
> > Hm, why do you think it's a cleanup? ?The hierarchical target reclaim
> > code is moved to vmscan.c and as a result the entry points for hard
> > limit and soft limit reclaim differ. ?This is why the original
> > function, mem_cgroup_hierarchical_reclaim() has to be split into two
> > parts.
> >
> If functionality is unchanged, I think it's clean up.
> I agree to move hierarchy walk to vmscan.c. but it can be done as
> a clean up patch for current code.
> (Make current try_to_free_mem_cgroup_pages() to use this code.)
> and then, you can write a patch which only includes a core
> logic/purpose of this patch
> "use root cgroup's LRU for global and make global reclaim as full-scan
> of memcgroup."
>
> In short, I felt this patch is long....and maybe watchers of -mm are
> not interested in rewritie of hierarchy walk but are intetested in the
> chages in shrink_zone() itself very much.

But the split up is, unfortunately, a change in functionality. The
current code selects one memcg and reclaims all zones on all priority
levels on behalf of that memcg. My code changes that such that it
reclaims a bunch of memcgs from the hierarchy for each zone and
priority level instead. From memcgs -> priorities -> zones to
priorities -> zones -> memcgs.

I don't want to pass that off as a cleanup.

But it is long, I agree with you. I'll split up the 'move
hierarchical target reclaim to generic code' from 'make global reclaim
hierarchical' and see if this makes the changes more straight-forward.

Because I suspect the perceived unwieldiness does not stem from the
amount of lines changed, but from the number of different logical
changes.

> >> > + ? ? ? for (;;) {
> >> > + ? ? ? ? ? ? ? unsigned long nr_reclaimed;
> >> > +
> >> > + ? ? ? ? ? ? ? sc->mem_cgroup = mem;
> >> > + ? ? ? ? ? ? ? do_shrink_zone(priority, zone, sc);
> >> > +
> >> > + ? ? ? ? ? ? ? nr_reclaimed = sc->nr_reclaimed - nr_reclaimed_before;
> >> > + ? ? ? ? ? ? ? if (nr_reclaimed >= sc->nr_to_reclaim)
> >> > + ? ? ? ? ? ? ? ? ? ? ? break;
> >>
> >> what this calculation means ? ?Shouldn't we do this quit based on the
> >> number of "scan"
> >> rather than "reclaimed" ?
> >
> > It aborts the loop once sc->nr_to_reclaim pages have been reclaimed
> > from that zone during that hierarchy walk, to prevent overreclaim.
> >
> > If you have unbalanced sizes of memcgs in the system, it is not
> > desirable to have every reclaimer scan all memcgs, but let those quit
> > early that have made some progress on the bigger memcgs.
> >
> Hmm, why not if (sc->nr_reclaimed >= sc->nr_to_reclaim) ?
>
> I'm sorry if I miss something..

It's a bit awkward and undocumented, I'm afraid. The loop is like
this:

for each zone:
for each memcg:
shrink
if sc->nr_reclaimed >= sc->nr_to_reclaim:
break

sc->nr_reclaimed is never reset, so once you reclaimed enough pages
from one zone, you will only try the first memcg in all the other
zones, which might well be empty, so no pressure at all on subsequent
zones.

That's why I use the per-zone delta like this:

for each zone:
before = sc->nr_reclaimed
for each memcg:
shrink
if sc->nr_reclaimed - before >= sc->nr_to_reclaim

which still ensures on one hand that we don't keep hammering a zone if
we reclaimed the overall reclaim target already, but on the other hand
that we apply some pressure to the other zones as well.

It's the same concept as in do_shrink_zone(). It breaks the loop when

nr_reclaimed >= sc->nr_to_reclaim

where nr_reclaimed refers to the number of pages reclaimed from the
current zone, not the accumulated total of the whole reclaim cycle.

> >> > + ? ? ? ? ? ? ? mem = mem_cgroup_hierarchy_walk(root, mem);
> >> > + ? ? ? ? ? ? ? if (mem == first)
> >> > + ? ? ? ? ? ? ? ? ? ? ? break;
> >>
> >> Why we quit loop ??
> >
> > get_scan_count() for traditional global reclaim returns the scan
> > target for the zone.
> >
> > With this per-memcg reclaimer, get_scan_count() will return scan
> > targets for the respective per-memcg zone subsizes.
> >
> > So once we have gone through all memcgs, we should have scanned the
> > amount of pages that global reclaim would have deemed sensible for
> > that zone at that priority level.
> >
> > As such, this is the exit condition based on scan count you referred
> > to above.
> >
> That's what I want as a comment in codes.

Will do, for both exit conditions ;-)

2011-06-02 17:52:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Thu, Jun 02, 2011 at 08:51:39AM -0700, Ying Han wrote:
> On Thu, Jun 2, 2011 at 12:50 AM, Johannes Weiner <[email protected]> wrote:
> > On Wed, Jun 01, 2011 at 09:05:18PM -0700, Ying Han wrote:
> >> On Wed, Jun 1, 2011 at 4:52 PM, Hiroyuki Kamezawa
> >> <[email protected]> wrote:
> >> > 2011/6/1 Johannes Weiner <[email protected]>:
> >> >> Hi,
> >> >>
> >> >> this is the second version of the memcg naturalization series. ?The
> >> >> notable changes since the first submission are:
> >> >>
> >> >> ? ?o the hierarchy walk is now intermittent and will abort and
> >> >> ? ? ?remember the last scanned child after sc->nr_to_reclaim pages
> >> >> ? ? ?have been reclaimed during the walk in one zone (Rik)
> >> >>
> >> >> ? ?o the global lru lists are never scanned when memcg is enabled
> >> >> ? ? ?after #2 'memcg-aware global reclaim', which makes this patch
> >> >> ? ? ?self-sufficient and complete without requiring the per-memcg lru
> >> >> ? ? ?lists to be exclusive (Michal)
> >> >>
> >> >> ? ?o renamed sc->memcg and sc->current_memcg to sc->target_mem_cgroup
> >> >> ? ? ?and sc->mem_cgroup and fixed their documentation, I hope this is
> >> >> ? ? ?better understandable now (Rik)
> >> >>
> >> >> ? ?o the reclaim statistic counters have been renamed. ?there is no
> >> >> ? ? ?more distinction between 'pgfree' and 'pgsteal', it is now
> >> >> ? ? ?'pgreclaim' in both cases; 'kswapd' has been replaced by
> >> >> ? ? ?'background'
> >> >>
> >> >> ? ?o fixed a nasty crash in the hierarchical soft limit check that
> >> >> ? ? ?happened during global reclaim in memcgs that are hierarchical
> >> >> ? ? ?but have no hierarchical parents themselves
> >> >>
> >> >> ? ?o properly implemented the memcg-aware unevictable page rescue
> >> >> ? ? ?scanner, there were several blatant bugs in there
> >> >>
> >> >> ? ?o documentation on new public interfaces
> >> >>
> >> >> Thanks for your input on the first version.
> >> >>
> >> >> I ran microbenchmarks (sparse file catting, essentially) to stress
> >> >> reclaim and LRU operations. ?There is no measurable overhead for
> >> >> !CONFIG_MEMCG, memcg disabled during boot, memcg enabled but no
> >> >> configured groups, and hard limit reclaim.
> >> >>
> >> >> I also ran single-threaded kernbenchs in four unlimited memcgs in
> >> >> parallel, contained in a hard-limited hierarchical parent that put
> >> >> constant pressure on the workload. ?There is no measurable difference
> >> >> in runtime, the pgpgin/pgpgout counters, and fairness among memcgs in
> >> >> this test compared to an unpatched kernel. ?Needs more evaluation,
> >> >> especially with a higher number of memcgs.
> >> >>
> >> >> The soft limit changes are also proven to work in so far that it is
> >> >> possible to prioritize between children in a hierarchy under pressure
> >> >> and that runtime differences corresponded directly to the soft limit
> >> >> settings in the previously described kernbench setup with staggered
> >> >> soft limits on the groups, but this needs quantification.
> >> >>
> >> >> Based on v2.6.39.
> >> >>
> >> >
> >> > Hmm, I welcome and will review this patches but.....some points I want to say.
> >> >
> >> > 1. No more conflict with Ying's work ?
> >> > ? ?Could you explain what she has and what you don't in this v2 ?
> >> > ? ?If Ying's one has something good to be merged to your set, please
> >> > include it.
> >>
> >> My patch I sent out last time was doing rework of soft_limit reclaim.
> >> It convert the RB-tree based to
> >> a linked list round-robin fashion of all memcgs across their soft
> >> limit per-zone.
> >>
> >> I will apply this patch and try to test it. After that i will get
> >> better idea whether or not it is being covered here.
> >
> > Thanks!!
> >
> >> > 4. This work can be splitted into some small works.
> >> > ? ? a) fix for current code and clean ups
> >>
> >> > ? ? a') statistics
> >>
> >> > ? ? b) soft limit rework
> >>
> >> > ? ? c) change global reclaim
> >>
> >> My last patchset starts with a patch reverting the RB-tree
> >> implementation of the soft_limit
> >> reclaim, and then the new round-robin implementation comes on the
> >> following patches.
> >>
> >> I like the ordering here, and that is consistent w/ the plan we
> >> discussed earlier in LSF. Changing
> >> the global reclaim would be the last step when the changes before that
> >> have been well understood
> >> and tested.
> >>
> >> Sorry If that is how it is done here. I will read through the patchset.
> >
> > It's not. ?The way I implemented soft limits depends on global reclaim
> > performing hierarchical reclaim. ?I don't see how I can reverse the
> > order with this dependency.
>
> That is something I don't quite get yet, and maybe need a closer look
> into the patchset. The current design of
> soft_limit doesn't do reclaim hierarchically but instead links the
> memcgs together on per-zone basis.
>
> However on this patchset, we changed that design and doing
> hierarchy_walk of the memcg tree. Can we clarify more on why we made
> the design change? I can see the current design provides a efficient
> way to pick the one memcg over-their-soft-limit under shrink_zone().

The question is whether we even want it to work that way. I outlined
that in the changelog of the soft limit rework patch.

As I see it, the soft limit should not exist solely to punish a memcg,
but to prioritize memcgs in case hierarchical pressure exists. I am
arguing that the focus should be on relieving the pressure, rather
than beating the living crap out of the single-biggest offender. Keep
in mind the scenarios where the biggest offender has a lot of dirty,
hard-to-reclaim pages while there are other, unsoftlimited groups that
have large amounts of easily reclaimable cache of questionable future
value. I believe only going for soft-limit excessors is too extreme,
only for the single-biggest one outright nuts.

The second point I made last time already is that there is no
hierarchy support with that current scheme. If you have a group with
two subgroups, it makes sense to soft limit one subgroup against the
other when the parent hits its limit. This is not possible otherwise.

The third point was that the amount of code to actually support the
questionable behaviour of picking the biggest offender is gigantic
compared to naturally hooking soft limit reclaim into regular reclaim.

The implementation is not proven to be satisfactory, I only sent it
out so early and with this particular series because I wanted people
to stop merging reclaim statistics that may not even be supportable in
the long run.

I agree with Andrew: we either need to prove it's the way to go, or
prove that we never want to do it like this. Before we start adding
statistics that commit us to one way or the other.

2011-06-02 17:57:21

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 8/8] mm: make per-memcg lru lists exclusive

On Fri, Jun 03, 2011 at 12:54:39AM +0900, Hiroyuki Kamezawa wrote:
> 2011/6/2 Johannes Weiner <[email protected]>:
> > On Thu, Jun 02, 2011 at 10:16:59PM +0900, Hiroyuki Kamezawa wrote:
> >> 2011/6/1 Johannes Weiner <[email protected]>:
> >> > All lru list walkers have been converted to operate on per-memcg
> >> > lists, the global per-zone lists are no longer required.
> >> >
> >> > This patch makes the per-memcg lists exclusive and removes the global
> >> > lists from memcg-enabled kernels.
> >> >
> >> > The per-memcg lists now string up page descriptors directly, which
> >> > unifies/simplifies the list isolation code of page reclaim as well as
> >> > it saves a full double-linked list head for each page in the system.
> >> >
> >> > At the core of this change is the introduction of the lruvec
> >> > structure, an array of all lru list heads. ?It exists for each zone
> >> > globally, and for each zone per memcg. ?All lru list operations are
> >> > now done in generic code against lruvecs, with the memcg lru list
> >> > primitives only doing accounting and returning the proper lruvec for
> >> > the currently scanned memcg on isolation, or for the respective page
> >> > on putback.
> >> >
> >> > Signed-off-by: Johannes Weiner <[email protected]>
> >>
> >>
> >> could you divide this into
> >> ? - introduce lruvec
> >> ? - don't record section? information into pc->flags because we see
> >> "page" on memcg LRU
> >> ? ? and there is no requirement to get page from "pc".
> >> ? - remove pc->lru completely
> >
> > Yes, that makes sense. ?It shall be fixed in the next version.
> >
>
> BTW, IIUC, Transparent hugepage has a code to link a page to the
> page->lru directly.

[...]

> But it may put a page onto wrong memcgs if we do link a page to
> another page's page->lru
> because 2 pages may be in different cgroup each other.

Yes, I noticed that. If it splits a huge page, it does not just add
the tailpages to the lru head, but it links them next to the head
page.

But I don't see how those pages could ever be in different memcgs?
pages with page->mapping pointing to the same anon_vma are always in
the same memcg, AFAIU. Only since broken COWs get their own, though.

>
> Could you check there are more codes which does link page->lru to nearby page's
> page->lru ? Now, I'm not sure there are other codes....but we need care.

I double check again. It's a tricky area, but I thought I covered all
cases. Never hurts to reassure, though.

2011-06-02 21:02:38

by Ying Han

[permalink] [raw]
Subject: Re: [patch 7/8] vmscan: memcg-aware unevictable page rescue scanner

On Thu, Jun 2, 2011 at 6:27 AM, Hiroyuki Kamezawa
<[email protected]> wrote:
> 2011/6/1 Johannes Weiner <[email protected]>:
>> Once the per-memcg lru lists are exclusive, the unevictable page
>> rescue scanner can no longer work on the global zone lru lists.
>>
>> This converts it to go through all memcgs and scan their respective
>> unevictable lists instead.
>>
>> Signed-off-by: Johannes Weiner <[email protected]>
>
> Hm, isn't it better to have only one GLOBAL LRU for unevictable pages ?
> memcg only needs counter for unevictable pages and LRU is not necessary
> to be per memcg because we don't reclaim it...

Hmm. Are we suggesting to keep one un-evictable LRU list for all
memcgs? So we will have
exclusive lru only for file and anon. If so, we are not done to make
all the lru list being exclusive
which is critical later to improve the zone->lru_lock contention
across the memcgs

Sorry If i misinterpret the suggestion here

--Ying


> Thanks,
> -Kame
>

2011-06-02 21:56:12

by Ying Han

[permalink] [raw]
Subject: Re: [patch 4/8] memcg: rework soft limit reclaim

On Tue, May 31, 2011 at 11:25 PM, Johannes Weiner <[email protected]> wrote:
> Currently, soft limit reclaim is entered from kswapd, where it selects
> the memcg with the biggest soft limit excess in absolute bytes, and
> reclaims pages from it with maximum aggressiveness (priority 0).
>
> This has the following disadvantages:
>
> ? ?1. because of the aggressiveness, kswapd can be stalled on a memcg
> ? ?that is hard to reclaim from for a long time, sending the rest of
> ? ?the allocators into direct reclaim in the meantime.
>
> ? ?2. it only considers the biggest offender (in absolute bytes, no
> ? ?less, so very unhandy for setups with different-sized memcgs) and
> ? ?does not apply any pressure at all on other memcgs in excess.
>
> ? ?3. because it is only invoked from kswapd, the soft limit is
> ? ?meaningful during global memory pressure, but it is not taken into
> ? ?account during hierarchical target reclaim where it could allow
> ? ?prioritizing memcgs as well. ?So while it does hierarchical
> ? ?reclaim once triggered, it is not a truly hierarchical mechanism.
>
> Here is a different approach. ?Instead of having a soft limit reclaim
> cycle separate from the rest of reclaim, this patch ensures that each
> time a group of memcgs is reclaimed - be it because of global memory
> pressure or because of a hard limit - memcgs that exceed their soft
> limit, or contribute to the soft limit excess of one their parents,
> are reclaimed from at a higher priority than their siblings.
>
> This results in the following:
>
> ? ?1. all relevant memcgs are scanned with increasing priority during
> ? ?memory pressure. ?The primary goal is to free pages, not to punish
> ? ?soft limit offenders.
>
> ? ?2. increased pressure is applied to all memcgs in excess of their
> ? ?soft limit, not only the biggest offender.
>
> ? ?3. the soft limit becomes meaningful for target reclaim as well,
> ? ?where it allows prioritizing children of a hierarchy when the
> ? ?parent hits its limit.
>
> ? ?4. direct reclaim now also applies increased soft limit pressure,
> ? ?not just kswapd anymore.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> ?include/linux/memcontrol.h | ? ?7 +++++++
> ?mm/memcontrol.c ? ? ? ? ? ?| ? 26 ++++++++++++++++++++++++++
> ?mm/vmscan.c ? ? ? ? ? ? ? ?| ? ?8 ++++++--
> ?3 files changed, 39 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8f402b9..7d99e87 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -104,6 +104,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
> ?struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *);
> ?void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup *);
> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *, struct mem_cgroup *);
>
> ?/*
> ?* For memory reclaim.
> @@ -345,6 +346,12 @@ static inline void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *r,
> ?{
> ?}
>
> +static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem)
> +{
> + ? ? ? return false;
> +}
> +
> ?static inline void
> ?mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> ?{
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 983efe4..94f77cc3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1460,6 +1460,32 @@ void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *root,
> ? ? ? ? ? ? ? ?css_put(&mem->css);
> ?}
>
> +/**
> + * mem_cgroup_soft_limit_exceeded - check if a memcg (hierarchically)
> + * ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?exceeds a soft limit
> + * @root: highest ancestor of @mem to consider
> + * @mem: memcg to check for excess
> + *
> + * The function indicates whether @mem has exceeded its own soft
> + * limit, or contributes to the soft limit excess of one of its
> + * parents in the hierarchy below @root.
> + */
> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem)
> +{
> + ? ? ? for (;;) {
> + ? ? ? ? ? ? ? if (mem == root_mem_cgroup)
> + ? ? ? ? ? ? ? ? ? ? ? return false;
> + ? ? ? ? ? ? ? if (res_counter_soft_limit_excess(&mem->res))
> + ? ? ? ? ? ? ? ? ? ? ? return true;
> + ? ? ? ? ? ? ? if (mem == root)
> + ? ? ? ? ? ? ? ? ? ? ? return false;
> + ? ? ? ? ? ? ? mem = parent_mem_cgroup(mem);
> + ? ? ? ? ? ? ? if (!mem)
> + ? ? ? ? ? ? ? ? ? ? ? return false;
> + ? ? ? }
> +}
> +
> ?static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long flags)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c7d4b44..0163840 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1988,9 +1988,13 @@ static void shrink_zone(int priority, struct zone *zone,
> ? ? ? ? ? ? ? ?unsigned long reclaimed = sc->nr_reclaimed;
> ? ? ? ? ? ? ? ?unsigned long scanned = sc->nr_scanned;
> ? ? ? ? ? ? ? ?unsigned long nr_reclaimed;
> + ? ? ? ? ? ? ? int epriority = priority;
> +
> + ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
> + ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;

Here we grant the ability to shrink from all the memcgs, but only
higher the priority for those exceed the soft_limit. That is a design
change
for the "soft_limit" which giving a hint to which memcgs to reclaim
from first under global memory pressure.

--Ying


>
> ? ? ? ? ? ? ? ?sc->mem_cgroup = mem;
> - ? ? ? ? ? ? ? do_shrink_zone(priority, zone, sc);
> + ? ? ? ? ? ? ? do_shrink_zone(epriority, zone, sc);
> ? ? ? ? ? ? ? ?mem_cgroup_count_reclaim(mem, current_is_kswapd(),
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mem != root, /* limit or hierarchy? */
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sc->nr_scanned - scanned,
> @@ -2480,7 +2484,7 @@ loop_again:
> ? ? ? ? ? ? ? ? ? ? ? ? * Call soft limit reclaim before calling shrink_zone.
> ? ? ? ? ? ? ? ? ? ? ? ? * For now we ignore the return value
> ? ? ? ? ? ? ? ? ? ? ? ? */
> - ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
> + ? ? ? ? ? ? ? ? ? ? ? //mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
>
> ? ? ? ? ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? ? ? ? ? * We put equal pressure on every zone, unless
> --
> 1.7.5.2
>
>

2011-06-02 22:01:37

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [patch 7/8] vmscan: memcg-aware unevictable page rescue scanner

2011/6/3 Ying Han <[email protected]>:
> On Thu, Jun 2, 2011 at 6:27 AM, Hiroyuki Kamezawa
> <[email protected]> wrote:
>> 2011/6/1 Johannes Weiner <[email protected]>:
>>> Once the per-memcg lru lists are exclusive, the unevictable page
>>> rescue scanner can no longer work on the global zone lru lists.
>>>
>>> This converts it to go through all memcgs and scan their respective
>>> unevictable lists instead.
>>>
>>> Signed-off-by: Johannes Weiner <[email protected]>
>>
>> Hm, isn't it better to have only one GLOBAL LRU for unevictable pages ?
>> memcg only needs counter for unevictable pages and LRU is not necessary
>> to be per memcg because we don't reclaim it...
>
> Hmm. Are we suggesting to keep one un-evictable LRU list for all
> memcgs? So we will have
> exclusive lru only for file and anon. If so, we are not done to make
> all the lru list being exclusive
> which is critical later to improve the zone->lru_lock contention
> across the memcgs
>
considering lrulock, yes, maybe you're right.

> Sorry If i misinterpret the suggestion here
>

My concern is I don't know for what purpose this function is used ..


Thanks,
-Kame

2011-06-02 22:19:32

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 7/8] vmscan: memcg-aware unevictable page rescue scanner

On Fri, Jun 03, 2011 at 07:01:34AM +0900, Hiroyuki Kamezawa wrote:
> 2011/6/3 Ying Han <[email protected]>:
> > On Thu, Jun 2, 2011 at 6:27 AM, Hiroyuki Kamezawa
> > <[email protected]> wrote:
> >> 2011/6/1 Johannes Weiner <[email protected]>:
> >>> Once the per-memcg lru lists are exclusive, the unevictable page
> >>> rescue scanner can no longer work on the global zone lru lists.
> >>>
> >>> This converts it to go through all memcgs and scan their respective
> >>> unevictable lists instead.
> >>>
> >>> Signed-off-by: Johannes Weiner <[email protected]>
> >>
> >> Hm, isn't it better to have only one GLOBAL LRU for unevictable pages ?
> >> memcg only needs counter for unevictable pages and LRU is not necessary
> >> to be per memcg because we don't reclaim it...
> >
> > Hmm. Are we suggesting to keep one un-evictable LRU list for all
> > memcgs? So we will have
> > exclusive lru only for file and anon. If so, we are not done to make
> > all the lru list being exclusive
> > which is critical later to improve the zone->lru_lock contention
> > across the memcgs
> >
> considering lrulock, yes, maybe you're right.

That's one of the complications.

> > Sorry If i misinterpret the suggestion here
> >
>
> My concern is I don't know for what purpose this function is used ..

I am not sure how it's supposed to be used, either. But it's
documented to be a 'really big hammer' and it's kicked off from
userspace. So I suppose having the thing go through all memcgs bears
a low risk of being a problem. My suggestion is we go that way until
someone complains.

2011-06-02 23:15:42

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [patch 7/8] vmscan: memcg-aware unevictable page rescue scanner

2011$BG/(B6$B7n(B3$BF|6bMKF|(B Johannes Weiner [email protected]:
> On Fri, Jun 03, 2011 at 07:01:34AM +0900, Hiroyuki Kamezawa wrote:
>> 2011/6/3 Ying Han <[email protected]>:
>> > On Thu, Jun 2, 2011 at 6:27 AM, Hiroyuki Kamezawa
>> > <[email protected]> wrote:
>> >> 2011/6/1 Johannes Weiner <[email protected]>:
>> >>> Once the per-memcg lru lists are exclusive, the unevictable page
>> >>> rescue scanner can no longer work on the global zone lru lists.
>> >>>
>> >>> This converts it to go through all memcgs and scan their respective
>> >>> unevictable lists instead.
>> >>>
>> >>> Signed-off-by: Johannes Weiner <[email protected]>
>> >>
>> >> Hm, isn't it better to have only one GLOBAL LRU for unevictable pages ?
>> >> memcg only needs counter for unevictable pages and LRU is not necessary
>> >> to be per memcg because we don't reclaim it...
>> >
>> > Hmm. Are we suggesting to keep one un-evictable LRU list for all
>> > memcgs? So we will have
>> > exclusive lru only for file and anon. If so, we are not done to make
>> > all the lru list being exclusive
>> > which is critical later to improve the zone->lru_lock contention
>> > across the memcgs
>> >
>> considering lrulock, yes, maybe you're right.
>
> That's one of the complications.
>
>> > Sorry If i misinterpret the suggestion here
>> >
>>
>> My concern is I don't know for what purpose this function is used ..
>
> I am not sure how it's supposed to be used, either. But it's
> documented to be a 'really big hammer' and it's kicked off from
> userspace. So I suppose having the thing go through all memcgs bears
> a low risk of being a problem. My suggestion is we go that way until
> someone complains.

Ok. Please go with memcg local unevictable lru.

-kame

2011-06-03 05:08:29

by Ying Han

[permalink] [raw]
Subject: Re: [patch 7/8] vmscan: memcg-aware unevictable page rescue scanner

On Thu, Jun 2, 2011 at 3:19 PM, Johannes Weiner <[email protected]> wrote:
> On Fri, Jun 03, 2011 at 07:01:34AM +0900, Hiroyuki Kamezawa wrote:
>> 2011/6/3 Ying Han <[email protected]>:
>> > On Thu, Jun 2, 2011 at 6:27 AM, Hiroyuki Kamezawa
>> > <[email protected]> wrote:
>> >> 2011/6/1 Johannes Weiner <[email protected]>:
>> >>> Once the per-memcg lru lists are exclusive, the unevictable page
>> >>> rescue scanner can no longer work on the global zone lru lists.
>> >>>
>> >>> This converts it to go through all memcgs and scan their respective
>> >>> unevictable lists instead.
>> >>>
>> >>> Signed-off-by: Johannes Weiner <[email protected]>
>> >>
>> >> Hm, isn't it better to have only one GLOBAL LRU for unevictable pages ?
>> >> memcg only needs counter for unevictable pages and LRU is not necessary
>> >> to be per memcg because we don't reclaim it...
>> >
>> > Hmm. Are we suggesting to keep one un-evictable LRU list for all
>> > memcgs? So we will have
>> > exclusive lru only for file and anon. If so, we are not done to make
>> > all the lru list being exclusive
>> > which is critical later to improve the zone->lru_lock contention
>> > across the memcgs
>> >
>> considering lrulock, yes, maybe you're right.
>
> That's one of the complications.

That should be achievable if we make all the per-memcg lru being
exclusive. So we can switch the global zone->lru_lock
to per-memcg-per-zone lru_lock. We have a prototype of the patch doing
something like that, but we will wait for this effort
being discussed and reviewed.

--Ying

>
>> > Sorry If i misinterpret the suggestion here
>> >
>>
>> My concern is I don't know for what purpose this function is used ..
>
> I am not sure how it's supposed to be used, either. ?But it's
> documented to be a 'really big hammer' and it's kicked off from
> userspace. ?So I suppose having the thing go through all memcgs bears
> a low risk of being a problem. ?My suggestion is we go that way until
> someone complains.
>

2011-06-03 05:25:36

by Ying Han

[permalink] [raw]
Subject: Re: [patch 4/8] memcg: rework soft limit reclaim

On Thu, Jun 2, 2011 at 2:55 PM, Ying Han <[email protected]> wrote:
> On Tue, May 31, 2011 at 11:25 PM, Johannes Weiner <[email protected]> wrote:
>> Currently, soft limit reclaim is entered from kswapd, where it selects
>> the memcg with the biggest soft limit excess in absolute bytes, and
>> reclaims pages from it with maximum aggressiveness (priority 0).
>>
>> This has the following disadvantages:
>>
>> ? ?1. because of the aggressiveness, kswapd can be stalled on a memcg
>> ? ?that is hard to reclaim from for a long time, sending the rest of
>> ? ?the allocators into direct reclaim in the meantime.
>>
>> ? ?2. it only considers the biggest offender (in absolute bytes, no
>> ? ?less, so very unhandy for setups with different-sized memcgs) and
>> ? ?does not apply any pressure at all on other memcgs in excess.
>>
>> ? ?3. because it is only invoked from kswapd, the soft limit is
>> ? ?meaningful during global memory pressure, but it is not taken into
>> ? ?account during hierarchical target reclaim where it could allow
>> ? ?prioritizing memcgs as well. ?So while it does hierarchical
>> ? ?reclaim once triggered, it is not a truly hierarchical mechanism.
>>
>> Here is a different approach. ?Instead of having a soft limit reclaim
>> cycle separate from the rest of reclaim, this patch ensures that each
>> time a group of memcgs is reclaimed - be it because of global memory
>> pressure or because of a hard limit - memcgs that exceed their soft
>> limit, or contribute to the soft limit excess of one their parents,
>> are reclaimed from at a higher priority than their siblings.
>>
>> This results in the following:
>>
>> ? ?1. all relevant memcgs are scanned with increasing priority during
>> ? ?memory pressure. ?The primary goal is to free pages, not to punish
>> ? ?soft limit offenders.
>>
>> ? ?2. increased pressure is applied to all memcgs in excess of their
>> ? ?soft limit, not only the biggest offender.
>>
>> ? ?3. the soft limit becomes meaningful for target reclaim as well,
>> ? ?where it allows prioritizing children of a hierarchy when the
>> ? ?parent hits its limit.
>>
>> ? ?4. direct reclaim now also applies increased soft limit pressure,
>> ? ?not just kswapd anymore.
>>
>> Signed-off-by: Johannes Weiner <[email protected]>
>> ---
>> ?include/linux/memcontrol.h | ? ?7 +++++++
>> ?mm/memcontrol.c ? ? ? ? ? ?| ? 26 ++++++++++++++++++++++++++
>> ?mm/vmscan.c ? ? ? ? ? ? ? ?| ? ?8 ++++++--
>> ?3 files changed, 39 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 8f402b9..7d99e87 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -104,6 +104,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
>> ?struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *);
>> ?void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup *);
>> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *, struct mem_cgroup *);
>>
>> ?/*
>> ?* For memory reclaim.
>> @@ -345,6 +346,12 @@ static inline void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *r,
>> ?{
>> ?}
>>
>> +static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem)
>> +{
>> + ? ? ? return false;
>> +}
>> +
>> ?static inline void
>> ?mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>> ?{
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 983efe4..94f77cc3 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1460,6 +1460,32 @@ void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *root,
>> ? ? ? ? ? ? ? ?css_put(&mem->css);
>> ?}
>>
>> +/**
>> + * mem_cgroup_soft_limit_exceeded - check if a memcg (hierarchically)
>> + * ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?exceeds a soft limit
>> + * @root: highest ancestor of @mem to consider
>> + * @mem: memcg to check for excess
>> + *
>> + * The function indicates whether @mem has exceeded its own soft
>> + * limit, or contributes to the soft limit excess of one of its
>> + * parents in the hierarchy below @root.
>> + */
>> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem)
>> +{
>> + ? ? ? for (;;) {
>> + ? ? ? ? ? ? ? if (mem == root_mem_cgroup)
>> + ? ? ? ? ? ? ? ? ? ? ? return false;
>> + ? ? ? ? ? ? ? if (res_counter_soft_limit_excess(&mem->res))
>> + ? ? ? ? ? ? ? ? ? ? ? return true;
>> + ? ? ? ? ? ? ? if (mem == root)
>> + ? ? ? ? ? ? ? ? ? ? ? return false;
>> + ? ? ? ? ? ? ? mem = parent_mem_cgroup(mem);
>> + ? ? ? ? ? ? ? if (!mem)
>> + ? ? ? ? ? ? ? ? ? ? ? return false;
>> + ? ? ? }
>> +}
>> +
>> ?static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long flags)
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c7d4b44..0163840 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1988,9 +1988,13 @@ static void shrink_zone(int priority, struct zone *zone,
>> ? ? ? ? ? ? ? ?unsigned long reclaimed = sc->nr_reclaimed;
>> ? ? ? ? ? ? ? ?unsigned long scanned = sc->nr_scanned;
>> ? ? ? ? ? ? ? ?unsigned long nr_reclaimed;
>> + ? ? ? ? ? ? ? int epriority = priority;
>> +
>> + ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
>> + ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
>
> Here we grant the ability to shrink from all the memcgs, but only
> higher the priority for those exceed the soft_limit. That is a design
> change
> for the "soft_limit" which giving a hint to which memcgs to reclaim
> from first under global memory pressure.


Basically, we shouldn't reclaim from a memcg under its soft_limit
unless we have trouble reclaim pages from others. Something like the
following makes better sense:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bdc2fd3..b82ba8c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1989,6 +1989,8 @@ restart:
throttle_vm_writeout(sc->gfp_mask);
}

+#define MEMCG_SOFTLIMIT_RECLAIM_PRIORITY 2
+
static void shrink_zone(int priority, struct zone *zone,
struct scan_control *sc)
{
@@ -2001,13 +2003,13 @@ static void shrink_zone(int priority, struct zone *zone,
unsigned long reclaimed = sc->nr_reclaimed;
unsigned long scanned = sc->nr_scanned;
unsigned long nr_reclaimed;
- int epriority = priority;

- if (mem_cgroup_soft_limit_exceeded(root, mem))
- epriority -= 1;
+ if (!mem_cgroup_soft_limit_exceeded(root, mem) &&
+ priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
+ continue;

sc->mem_cgroup = mem;
- do_shrink_zone(epriority, zone, sc);
+ do_shrink_zone(priority, zone, sc);
mem_cgroup_count_reclaim(mem, current_is_kswapd(),
mem != root, /* limit or hierarchy? */
sc->nr_scanned - scanned,

--Ying
>
> --Ying
>
>
>>
>> ? ? ? ? ? ? ? ?sc->mem_cgroup = mem;
>> - ? ? ? ? ? ? ? do_shrink_zone(priority, zone, sc);
>> + ? ? ? ? ? ? ? do_shrink_zone(epriority, zone, sc);
>> ? ? ? ? ? ? ? ?mem_cgroup_count_reclaim(mem, current_is_kswapd(),
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mem != root, /* limit or hierarchy? */
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sc->nr_scanned - scanned,
>> @@ -2480,7 +2484,7 @@ loop_again:
>> ? ? ? ? ? ? ? ? ? ? ? ? * Call soft limit reclaim before calling shrink_zone.
>> ? ? ? ? ? ? ? ? ? ? ? ? * For now we ignore the return value
>> ? ? ? ? ? ? ? ? ? ? ? ? */
>> - ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
>> + ? ? ? ? ? ? ? ? ? ? ? //mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
>>
>> ? ? ? ? ? ? ? ? ? ? ? ?/*
>> ? ? ? ? ? ? ? ? ? ? ? ? * We put equal pressure on every zone, unless
>> --
>> 1.7.5.2
>>
>>
>

2011-06-07 12:25:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

A few small nitpicks:

> +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root,
> + struct mem_cgroup *prev)
> +{
> + struct mem_cgroup *mem;
> +
> + if (mem_cgroup_disabled())
> + return NULL;
> +
> + if (!root)
> + root = root_mem_cgroup;
> + /*
> + * Even without hierarchy explicitely enabled in the root
> + * memcg, it is the ultimate parent of all memcgs.
> + */
> + if (!(root == root_mem_cgroup || root->use_hierarchy))
> + return root;

The logic here reads a bit weird, why not simply:

/*
* Even without hierarchy explicitely enabled in the root
* memcg, it is the ultimate parent of all memcgs.
*/
if (!root || root == root_mem_cgroup)
return root_mem_cgroup;
if (root->use_hierarchy)
return root;


> /*
> * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> */
> -static void shrink_zone(int priority, struct zone *zone,
> - struct scan_control *sc)
> +static void do_shrink_zone(int priority, struct zone *zone,
> + struct scan_control *sc)

It actually is the per-memcg shrinker now, and thus should be called
shrink_memcg.

> + sc->mem_cgroup = mem;
> + do_shrink_zone(priority, zone, sc);

Any passing the mem_cgroup explicitly instead of hiding it in the
scan_control would make that much more obvious. If there's a good
reason to pass it in the structure the same probably applies to the
zone and priority, too.

Shouldn't we also have a non-cgroups stub of shrink_zone to directly
call do_shrink_zone/shrink_memcg with a NULL memcg and thus optimize
the whole loop away for it?

2011-06-07 12:42:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 8/8] mm: make per-memcg lru lists exclusive

On Wed, Jun 01, 2011 at 08:25:19AM +0200, Johannes Weiner wrote:
> All lru list walkers have been converted to operate on per-memcg
> lists, the global per-zone lists are no longer required.
>
> This patch makes the per-memcg lists exclusive and removes the global
> lists from memcg-enabled kernels.
>
> The per-memcg lists now string up page descriptors directly, which
> unifies/simplifies the list isolation code of page reclaim as well as
> it saves a full double-linked list head for each page in the system.
>
> At the core of this change is the introduction of the lruvec
> structure, an array of all lru list heads. It exists for each zone
> globally, and for each zone per memcg. All lru list operations are
> now done in generic code against lruvecs, with the memcg lru list
> primitives only doing accounting and returning the proper lruvec for
> the currently scanned memcg on isolation, or for the respective page
> on putback.

Wouldn't it be simpler if we always have a stub mem_cgroup_per_zone
structure even for non-memcg kernels, and always operate on a
single instance per node of those for non-memcg kernels? In effect the
lruvec almost is something like that, just adding another layer of
abstraction.

> static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 8f7d247..43d5d9f 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -25,23 +25,27 @@ static inline void
> __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
> struct list_head *head)
> {
> + /* NOTE: Caller must ensure @head is on the right lruvec! */
> + mem_cgroup_lru_add_list(zone, page, l);
> list_add(&page->lru, head);
>
> __mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
> - mem_cgroup_add_lru_list(page, l);
> }

This already has been a borderline-useful function before, but with the
new changes it's not a useful helper. Either add the code surrounding
it includeing the PageLRU check and the normal add_page_to_lru_list
into a new page_update_lru_pos or similar helper, or just opencode these
bits in the only caller with a comment documenting why we are doing it.

I would tend towards the opencoding variant.

2011-06-08 03:53:25

by Ying Han

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Thu, Jun 2, 2011 at 10:51 AM, Johannes Weiner <[email protected]> wrote:
>
> On Thu, Jun 02, 2011 at 08:51:39AM -0700, Ying Han wrote:
> > On Thu, Jun 2, 2011 at 12:50 AM, Johannes Weiner <[email protected]> wrote:
> > > On Wed, Jun 01, 2011 at 09:05:18PM -0700, Ying Han wrote:
> > >> On Wed, Jun 1, 2011 at 4:52 PM, Hiroyuki Kamezawa
> > >> <[email protected]> wrote:
> > >> > 2011/6/1 Johannes Weiner <[email protected]>:
> > >> >> Hi,
> > >> >>
> > >> >> this is the second version of the memcg naturalization series. ?The
> > >> >> notable changes since the first submission are:
> > >> >>
> > >> >> ? ?o the hierarchy walk is now intermittent and will abort and
> > >> >> ? ? ?remember the last scanned child after sc->nr_to_reclaim pages
> > >> >> ? ? ?have been reclaimed during the walk in one zone (Rik)
> > >> >>
> > >> >> ? ?o the global lru lists are never scanned when memcg is enabled
> > >> >> ? ? ?after #2 'memcg-aware global reclaim', which makes this patch
> > >> >> ? ? ?self-sufficient and complete without requiring the per-memcg lru
> > >> >> ? ? ?lists to be exclusive (Michal)
> > >> >>
> > >> >> ? ?o renamed sc->memcg and sc->current_memcg to sc->target_mem_cgroup
> > >> >> ? ? ?and sc->mem_cgroup and fixed their documentation, I hope this is
> > >> >> ? ? ?better understandable now (Rik)
> > >> >>
> > >> >> ? ?o the reclaim statistic counters have been renamed. ?there is no
> > >> >> ? ? ?more distinction between 'pgfree' and 'pgsteal', it is now
> > >> >> ? ? ?'pgreclaim' in both cases; 'kswapd' has been replaced by
> > >> >> ? ? ?'background'
> > >> >>
> > >> >> ? ?o fixed a nasty crash in the hierarchical soft limit check that
> > >> >> ? ? ?happened during global reclaim in memcgs that are hierarchical
> > >> >> ? ? ?but have no hierarchical parents themselves
> > >> >>
> > >> >> ? ?o properly implemented the memcg-aware unevictable page rescue
> > >> >> ? ? ?scanner, there were several blatant bugs in there
> > >> >>
> > >> >> ? ?o documentation on new public interfaces
> > >> >>
> > >> >> Thanks for your input on the first version.
> > >> >>
> > >> >> I ran microbenchmarks (sparse file catting, essentially) to stress
> > >> >> reclaim and LRU operations. ?There is no measurable overhead for
> > >> >> !CONFIG_MEMCG, memcg disabled during boot, memcg enabled but no
> > >> >> configured groups, and hard limit reclaim.
> > >> >>
> > >> >> I also ran single-threaded kernbenchs in four unlimited memcgs in
> > >> >> parallel, contained in a hard-limited hierarchical parent that put
> > >> >> constant pressure on the workload. ?There is no measurable difference
> > >> >> in runtime, the pgpgin/pgpgout counters, and fairness among memcgs in
> > >> >> this test compared to an unpatched kernel. ?Needs more evaluation,
> > >> >> especially with a higher number of memcgs.
> > >> >>
> > >> >> The soft limit changes are also proven to work in so far that it is
> > >> >> possible to prioritize between children in a hierarchy under pressure
> > >> >> and that runtime differences corresponded directly to the soft limit
> > >> >> settings in the previously described kernbench setup with staggered
> > >> >> soft limits on the groups, but this needs quantification.
> > >> >>
> > >> >> Based on v2.6.39.
> > >> >>
> > >> >
> > >> > Hmm, I welcome and will review this patches but.....some points I want to say.
> > >> >
> > >> > 1. No more conflict with Ying's work ?
> > >> > ? ?Could you explain what she has and what you don't in this v2 ?
> > >> > ? ?If Ying's one has something good to be merged to your set, please
> > >> > include it.
> > >>
> > >> My patch I sent out last time was doing rework of soft_limit reclaim.
> > >> It convert the RB-tree based to
> > >> a linked list round-robin fashion of all memcgs across their soft
> > >> limit per-zone.
> > >>
> > >> I will apply this patch and try to test it. After that i will get
> > >> better idea whether or not it is being covered here.
> > >
> > > Thanks!!
> > >
> > >> > 4. This work can be splitted into some small works.
> > >> > ? ? a) fix for current code and clean ups
> > >>
> > >> > ? ? a') statistics
> > >>
> > >> > ? ? b) soft limit rework
> > >>
> > >> > ? ? c) change global reclaim
> > >>
> > >> My last patchset starts with a patch reverting the RB-tree
> > >> implementation of the soft_limit
> > >> reclaim, and then the new round-robin implementation comes on the
> > >> following patches.
> > >>
> > >> I like the ordering here, and that is consistent w/ the plan we
> > >> discussed earlier in LSF. Changing
> > >> the global reclaim would be the last step when the changes before that
> > >> have been well understood
> > >> and tested.
> > >>
> > >> Sorry If that is how it is done here. I will read through the patchset.
> > >
> > > It's not. ?The way I implemented soft limits depends on global reclaim
> > > performing hierarchical reclaim. ?I don't see how I can reverse the
> > > order with this dependency.
> >
> > That is something I don't quite get yet, and maybe need a closer look
> > into the patchset. The current design of
> > soft_limit doesn't do reclaim hierarchically but instead links the
> > memcgs together on per-zone basis.
> >
> > However on this patchset, we changed that design and doing
> > hierarchy_walk of the memcg tree. Can we clarify more on why we made
> > the design change? I can see the current design provides a efficient
> > way to pick the one memcg over-their-soft-limit under shrink_zone().
>
> The question is whether we even want it to work that way. ?I outlined
> that in the changelog of the soft limit rework patch.
>
> As I see it, the soft limit should not exist solely to punish a memcg,
> but to prioritize memcgs in case hierarchical pressure exists. ?I am
> arguing that the focus should be on relieving the pressure, rather
> than beating the living crap out of the single-biggest offender. ?Keep
> in mind the scenarios where the biggest offender has a lot of dirty,
> hard-to-reclaim pages while there are other, unsoftlimited groups that
> have large amounts of easily reclaimable cache of questionable future
> value. ?I believe only going for soft-limit excessors is too extreme,
> only for the single-biggest one outright nuts.
>
> The second point I made last time already is that there is no
> hierarchy support with that current scheme. ?If you have a group with
> two subgroups, it makes sense to soft limit one subgroup against the
> other when the parent hits its limit. ?This is not possible otherwise.
>
> The third point was that the amount of code to actually support the
> questionable behaviour of picking the biggest offender is gigantic
> compared to naturally hooking soft limit reclaim into regular reclaim.

Ok, thank you for detailed clarification. After reading through the
patchset more closely, I do agree that it makes
better integration of memcg reclaim to the other part of vm reclaim
code. So I don't have objection at this point to
proceed w/ this direction. However, three of my concerns still remains:

1. Whether or not we introduced extra overhead for each shrink_zone()
under global memory pressure. We used to have quick
access of memcgs to reclaim from who has pages charged on the zone.
Now we need to do hierarchy_walk for all memcgs on the system. This
requires more testing and more data results would be helpful

2. The way we treat the per-memcg soft_limit is changed in this patch.
The same comment I made on the following patch where we shouldn't
change the definition of user API (soft_limit_in_bytes in this case).
So I attached the patch to fix that where we should only go to the
ones under their soft_limit above certain reclaim priority. Please
consider.

3. Please break this patchset into different patchsets. One way to
break it could be:

a) code which is less relevant to this effort and should be merged
first early regardless
b) code added in vm reclaim supporting the following changes
c) rework soft limit reclaim
d) make per-memcg lru lists exclusive

I should have the patch posted soon which breaks the zone->lru lock
for memcg reclaim. That patch should come after everything listed
above.

Thanks
--Ying
>
> The implementation is not proven to be satisfactory, I only sent it
> out so early and with this particular series because I wanted people
> to stop merging reclaim statistics that may not even be supportable in
> the long run.
>
> I agree with Andrew: we either need to prove it's the way to go, or
> prove that we never want to do it like this. ?Before we start adding
> statistics that commit us to one way or the other.
>

2011-06-08 08:54:43

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 8/8] mm: make per-memcg lru lists exclusive

On Tue, Jun 07, 2011 at 08:42:13AM -0400, Christoph Hellwig wrote:
> On Wed, Jun 01, 2011 at 08:25:19AM +0200, Johannes Weiner wrote:
> > All lru list walkers have been converted to operate on per-memcg
> > lists, the global per-zone lists are no longer required.
> >
> > This patch makes the per-memcg lists exclusive and removes the global
> > lists from memcg-enabled kernels.
> >
> > The per-memcg lists now string up page descriptors directly, which
> > unifies/simplifies the list isolation code of page reclaim as well as
> > it saves a full double-linked list head for each page in the system.
> >
> > At the core of this change is the introduction of the lruvec
> > structure, an array of all lru list heads. It exists for each zone
> > globally, and for each zone per memcg. All lru list operations are
> > now done in generic code against lruvecs, with the memcg lru list
> > primitives only doing accounting and returning the proper lruvec for
> > the currently scanned memcg on isolation, or for the respective page
> > on putback.
>
> Wouldn't it be simpler if we always have a stub mem_cgroup_per_zone
> structure even for non-memcg kernels, and always operate on a
> single instance per node of those for non-memcg kernels? In effect the
> lruvec almost is something like that, just adding another layer of
> abstraction.

I assume you meant 'single instance per zone'; the lruvec is this. It
exists per zone and per mem_cgroup_per_zone so there is no difference
between memcg kernels and non-memcg ones in generic code. But maybe
you really meant 'node' and I just don't get it? Care to elaborate a
bit more?

> > static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > index 8f7d247..43d5d9f 100644
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -25,23 +25,27 @@ static inline void
> > __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
> > struct list_head *head)
> > {
> > + /* NOTE: Caller must ensure @head is on the right lruvec! */
> > + mem_cgroup_lru_add_list(zone, page, l);
> > list_add(&page->lru, head);
> >
> > __mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
> > - mem_cgroup_add_lru_list(page, l);
> > }
>
> This already has been a borderline-useful function before, but with the
> new changes it's not a useful helper. Either add the code surrounding
> it includeing the PageLRU check and the normal add_page_to_lru_list
> into a new page_update_lru_pos or similar helper, or just opencode these
> bits in the only caller with a comment documenting why we are doing it.
>
> I would tend towards the opencoding variant.

It's only one user, I'll opencode it. That also makes for a nice
opportunity to document at the current callsite why the lruvec is
guaranteed to be the right one.

2011-06-08 09:31:11

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Tue, Jun 07, 2011 at 08:25:19AM -0400, Christoph Hellwig wrote:
> A few small nitpicks:
>
> > +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root,
> > + struct mem_cgroup *prev)
> > +{
> > + struct mem_cgroup *mem;
> > +
> > + if (mem_cgroup_disabled())
> > + return NULL;
> > +
> > + if (!root)
> > + root = root_mem_cgroup;
> > + /*
> > + * Even without hierarchy explicitely enabled in the root
> > + * memcg, it is the ultimate parent of all memcgs.
> > + */
> > + if (!(root == root_mem_cgroup || root->use_hierarchy))
> > + return root;
>
> The logic here reads a bit weird, why not simply:
>
> /*
> * Even without hierarchy explicitely enabled in the root
> * memcg, it is the ultimate parent of all memcgs.
> */
> if (!root || root == root_mem_cgroup)
> return root_mem_cgroup;
> if (root->use_hierarchy)
> return root;

What you are proposing is not equivalent, so... case in point! It's
meant to do the hierarchy walk for when foo->use_hierarchy, obviously,
but ALSO for root_mem_cgroup, which is parent to everyone else even
without use_hierarchy set. I changed it to read like this:

if (!root)
root = root_mem_cgroup;
if (!root->use_hierarchy && root != root_mem_cgroup)
return root;
/* actually iterate hierarchy */

Does that make more sense?

Another alternative would be

if (root->use_hierarchy || root == root_mem_cgroup) {
/* most of the function body */
}

but that quickly ends up with ugly linewraps...

> > /*
> > * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> > */
> > -static void shrink_zone(int priority, struct zone *zone,
> > - struct scan_control *sc)
> > +static void do_shrink_zone(int priority, struct zone *zone,
> > + struct scan_control *sc)
>
> It actually is the per-memcg shrinker now, and thus should be called
> shrink_memcg.

Per-zone per-memcg, actually. shrink_zone_memcg?

> > + sc->mem_cgroup = mem;
> > + do_shrink_zone(priority, zone, sc);
>
> Any passing the mem_cgroup explicitly instead of hiding it in the
> scan_control would make that much more obvious. If there's a good
> reason to pass it in the structure the same probably applies to the
> zone and priority, too.

Stack frame size, I guess. But unreadable code can't be the answer to
this problem. I'll try to pass it explicitely and see what the damage
is.

> Shouldn't we also have a non-cgroups stub of shrink_zone to directly
> call do_shrink_zone/shrink_memcg with a NULL memcg and thus optimize
> the whole loop away for it?

On !CONFIG_MEMCG, the code in shrink_zone() looks effectively like
this:

first = mem = NULL;
for (;;) {
sc->mem_cgroup = mem;
do_shrink_zone()
if (reclaimed enough)
break;
mem = NULL;
if (first == mem)
break;
}

I have gcc version 4.6.0 20110530 (Red Hat 4.6.0-9) (GCC) on this
machine, and it manages to optimize the loop away completely.

The only increase in code size I could see was from all callers having
to do the extra sc->mem_cgroup = NULL. But I guess there is no way
around this.

2011-06-08 15:05:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 8/8] mm: make per-memcg lru lists exclusive

On Thu 02-06-11 19:57:02, Johannes Weiner wrote:
> On Fri, Jun 03, 2011 at 12:54:39AM +0900, Hiroyuki Kamezawa wrote:
> > 2011/6/2 Johannes Weiner <[email protected]>:
> > > On Thu, Jun 02, 2011 at 10:16:59PM +0900, Hiroyuki Kamezawa wrote:
[...]
>
> > But it may put a page onto wrong memcgs if we do link a page to
> > another page's page->lru
> > because 2 pages may be in different cgroup each other.
>
> Yes, I noticed that. If it splits a huge page, it does not just add
> the tailpages to the lru head, but it links them next to the head
> page.
>
> But I don't see how those pages could ever be in different memcgs?
> pages with page->mapping pointing to the same anon_vma are always in
> the same memcg, AFAIU.

Process can be moved to other memcg and without move_charge_at_immigrate
all previously faulted pages stay in the original group while all new
(not faulted yet) get into the new group while mapping doesn't change.
I guess this might happen with thp tailpages as well. But I do not think
this is a problem. The original group already got charged for the huge
page so we can keep all tail pages in it.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-08 15:32:45

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Tue, Jun 07, 2011 at 08:53:21PM -0700, Ying Han wrote:
> On Thu, Jun 2, 2011 at 10:51 AM, Johannes Weiner <[email protected]> wrote:
> >
> > On Thu, Jun 02, 2011 at 08:51:39AM -0700, Ying Han wrote:
> > > However on this patchset, we changed that design and doing
> > > hierarchy_walk of the memcg tree. Can we clarify more on why we made
> > > the design change? I can see the current design provides a efficient
> > > way to pick the one memcg over-their-soft-limit under shrink_zone().
> >
> > The question is whether we even want it to work that way. ?I outlined
> > that in the changelog of the soft limit rework patch.
> >
> > As I see it, the soft limit should not exist solely to punish a memcg,
> > but to prioritize memcgs in case hierarchical pressure exists. ?I am
> > arguing that the focus should be on relieving the pressure, rather
> > than beating the living crap out of the single-biggest offender. ?Keep
> > in mind the scenarios where the biggest offender has a lot of dirty,
> > hard-to-reclaim pages while there are other, unsoftlimited groups that
> > have large amounts of easily reclaimable cache of questionable future
> > value. ?I believe only going for soft-limit excessors is too extreme,
> > only for the single-biggest one outright nuts.
> >
> > The second point I made last time already is that there is no
> > hierarchy support with that current scheme. ?If you have a group with
> > two subgroups, it makes sense to soft limit one subgroup against the
> > other when the parent hits its limit. ?This is not possible otherwise.
> >
> > The third point was that the amount of code to actually support the
> > questionable behaviour of picking the biggest offender is gigantic
> > compared to naturally hooking soft limit reclaim into regular reclaim.
>
> Ok, thank you for detailed clarification. After reading through the
> patchset more closely, I do agree that it makes
> better integration of memcg reclaim to the other part of vm reclaim
> code. So I don't have objection at this point to
> proceed w/ this direction. However, three of my concerns still remains:
>
> 1. Whether or not we introduced extra overhead for each shrink_zone()
> under global memory pressure. We used to have quick
> access of memcgs to reclaim from who has pages charged on the zone.
> Now we need to do hierarchy_walk for all memcgs on the system. This
> requires more testing and more data results would be helpful

That's a nice description for "we went ahead and reclaimed pages from
a zone without any regard for memory control groups" ;-)

But OTOH I agree with you of course, we may well have to visit a
number of memcgs before finding any that have memory allocated from
the zone we are trying to reclaim from.

> 2. The way we treat the per-memcg soft_limit is changed in this patch.
> The same comment I made on the following patch where we shouldn't
> change the definition of user API (soft_limit_in_bytes in this case).
> So I attached the patch to fix that where we should only go to the
> ones under their soft_limit above certain reclaim priority. Please
> consider.

Here is your proposal from the other mail:

: Basically, we shouldn't reclaim from a memcg under its soft_limit
: unless we have trouble reclaim pages from others. Something like the
: following makes better sense:
:
: diff --git a/mm/vmscan.c b/mm/vmscan.c
: index bdc2fd3..b82ba8c 100644
: --- a/mm/vmscan.c
: +++ b/mm/vmscan.c
: @@ -1989,6 +1989,8 @@ restart:
: throttle_vm_writeout(sc->gfp_mask);
: }
:
: +#define MEMCG_SOFTLIMIT_RECLAIM_PRIORITY 2
: +
: static void shrink_zone(int priority, struct zone *zone,
: struct scan_control *sc)
: {
: @@ -2001,13 +2003,13 @@ static void shrink_zone(int priority, struct zone *zone,
: unsigned long reclaimed = sc->nr_reclaimed;
: unsigned long scanned = sc->nr_scanned;
: unsigned long nr_reclaimed;
: - int epriority = priority;
:
: - if (mem_cgroup_soft_limit_exceeded(root, mem))
: - epriority -= 1;
: + if (!mem_cgroup_soft_limit_exceeded(root, mem) &&
: + priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
: + continue;

I am not sure if you are serious or playing devil's advocate here,
because it exacerbates the problem you are concerned about in 1. by
orders of magnitude.

Starting priority is 12. If you have no groups over soft limit, you
iterate the whole hierarchy 10 times before you even begin to think of
reclaiming something.

I guess it would make much more sense to evaluate if reclaiming from
memcgs while there are others exceeding their soft limit is even a
problem. Otherwise this discussion is pretty pointless.

> 3. Please break this patchset into different patchsets. One way to
> break it could be:

Yes, that makes a ton of sense. Kame suggested the same thing, there
are too much goals in this series.

> a) code which is less relevant to this effort and should be merged
> first early regardless
> b) code added in vm reclaim supporting the following changes
> c) rework soft limit reclaim

I dropped that for now..

> d) make per-memcg lru lists exclusive

..and focus on this one instead.

> I should have the patch posted soon which breaks the zone->lru lock
> for memcg reclaim. That patch should come after everything listed
> above.

Yeah, the lru lock fits perfectly into struct lruvec.

2011-06-09 01:14:06

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On 06/01/2011 08:35 PM, Greg Thelen wrote:
> On Wed, Jun 1, 2011 at 4:52 PM, Hiroyuki Kamezawa
> <[email protected]> wrote:

>> 1. No more conflict with Ying's work ?
>> Could you explain what she has and what you don't in this v2 ?
>> If Ying's one has something good to be merged to your set, please
>> include it.
>>
>> 2. it's required to see performance score in commit log.
>>
>> 3. I think dirty_ratio as 1st big patch to be merged. (But...hmm..Greg ?
>> My patches for asynchronous reclaim is not very important. I can rework it.
>
> I am testing the next version (v8) of the memcg dirty ratio patches. I expect
> to have it posted for review later this week.

Sounds like you guys might need a common git tree to
cooperate on memcg work, and not step on each other's
toes quite as often :)

A git tree has the added benefit of not continuously
trying to throw out each other's work, but building
on top of it.

--
All rights reversed

2011-06-09 01:16:10

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On 06/02/2011 08:59 AM, Hiroyuki Kamezawa wrote:

> It seems your current series is a mixture of 2 works as
> "re-desgin of softlimit" and "removal of global LRU".
> I don't understand why you need 2 works at once.

That seems pretty obvious.

With the global LRU gone, the only way to reclaim
pages in a global fashion (because the zone is low
on memory), is to reclaim from all the memcgs in
the zone.

Doing that requires that the softlimit stuff is
changed, and not only the biggest offender is
attacked.

--
All rights reversed

2011-06-09 03:52:15

by Ying Han

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Wed, Jun 8, 2011 at 8:32 AM, Johannes Weiner <[email protected]> wrote:
> On Tue, Jun 07, 2011 at 08:53:21PM -0700, Ying Han wrote:
>> On Thu, Jun 2, 2011 at 10:51 AM, Johannes Weiner <[email protected]> wrote:
>> >
>> > On Thu, Jun 02, 2011 at 08:51:39AM -0700, Ying Han wrote:
>> > > However on this patchset, we changed that design and doing
>> > > hierarchy_walk of the memcg tree. Can we clarify more on why we made
>> > > the design change? I can see the current design provides a efficient
>> > > way to pick the one memcg over-their-soft-limit under shrink_zone().
>> >
>> > The question is whether we even want it to work that way. ?I outlined
>> > that in the changelog of the soft limit rework patch.
>> >
>> > As I see it, the soft limit should not exist solely to punish a memcg,
>> > but to prioritize memcgs in case hierarchical pressure exists. ?I am
>> > arguing that the focus should be on relieving the pressure, rather
>> > than beating the living crap out of the single-biggest offender. ?Keep
>> > in mind the scenarios where the biggest offender has a lot of dirty,
>> > hard-to-reclaim pages while there are other, unsoftlimited groups that
>> > have large amounts of easily reclaimable cache of questionable future
>> > value. ?I believe only going for soft-limit excessors is too extreme,
>> > only for the single-biggest one outright nuts.
>> >
>> > The second point I made last time already is that there is no
>> > hierarchy support with that current scheme. ?If you have a group with
>> > two subgroups, it makes sense to soft limit one subgroup against the
>> > other when the parent hits its limit. ?This is not possible otherwise.
>> >
>> > The third point was that the amount of code to actually support the
>> > questionable behaviour of picking the biggest offender is gigantic
>> > compared to naturally hooking soft limit reclaim into regular reclaim.
>>
>> Ok, thank you for detailed clarification. After reading through the
>> patchset more closely, I do agree that it makes
>> better integration of memcg reclaim to the other part of vm reclaim
>> code. So I don't have objection at this point to
>> proceed w/ this direction. However, three of my concerns still remains:
>>
>> 1. ?Whether or not we introduced extra overhead for each shrink_zone()
>> under global memory pressure. We used to have quick
>> access of memcgs to reclaim from who has pages charged on the zone.
>> Now we need to do hierarchy_walk for all memcgs on the system. This
>> requires more testing and more data results would be helpful
>
> That's a nice description for "we went ahead and reclaimed pages from
> a zone without any regard for memory control groups" ;-)
>
> But OTOH I agree with you of course, we may well have to visit a
> number of memcgs before finding any that have memory allocated from
> the zone we are trying to reclaim from.
>
>> 2. The way we treat the per-memcg soft_limit is changed in this patch.
>> The same comment I made on the following patch where we shouldn't
>> change the definition of user API (soft_limit_in_bytes in this case).
>> So I attached the patch to fix that where we should only go to the
>> ones under their soft_limit above certain reclaim priority. Please
>> consider.
>
> Here is your proposal from the other mail:
>
> : Basically, we shouldn't reclaim from a memcg under its soft_limit
> : unless we have trouble reclaim pages from others. Something like the
> : following makes better sense:
> :
> : diff --git a/mm/vmscan.c b/mm/vmscan.c
> : index bdc2fd3..b82ba8c 100644
> : --- a/mm/vmscan.c
> : +++ b/mm/vmscan.c
> : @@ -1989,6 +1989,8 @@ restart:
> : ? ? ? ? throttle_vm_writeout(sc->gfp_mask);
> : ?}
> :
> : +#define MEMCG_SOFTLIMIT_RECLAIM_PRIORITY ? ? ? 2
> : +
> : ?static void shrink_zone(int priority, struct zone *zone,
> : ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct scan_control *sc)
> : ?{
> : @@ -2001,13 +2003,13 @@ static void shrink_zone(int priority, struct zone *zone,
> : ? ? ? ? ? ? ? ? unsigned long reclaimed = sc->nr_reclaimed;
> : ? ? ? ? ? ? ? ? unsigned long scanned = sc->nr_scanned;
> : ? ? ? ? ? ? ? ? unsigned long nr_reclaimed;
> : - ? ? ? ? ? ? ? int epriority = priority;
> :
> : - ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
> : - ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
> : + ? ? ? ? ? ? ? if (!mem_cgroup_soft_limit_exceeded(root, mem) &&
> : + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
> : + ? ? ? ? ? ? ? ? ? ? ? continue;
>
> I am not sure if you are serious or playing devil's advocate here,
> because it exacerbates the problem you are concerned about in 1. by
> orders of magnitude.

No, the two are different issues. The first one is a performance
concern of detailed implementation, while the second one is a design
concern. On the second, I would like us to not changing kernel API
spec (soft_limit) in this case :)

> Starting priority is 12. ?If you have no groups over soft limit, you
> iterate the whole hierarchy 10 times before you even begin to think of
> reclaiming something.

I agree that the patch I posted might make the performance issue even
bigger, which we need to look into next. But I just want to demo my
understanding of reclaim based on soft_limit.

>
> I guess it would make much more sense to evaluate if reclaiming from
> memcgs while there are others exceeding their soft limit is even a
> problem. ?Otherwise this discussion is pretty pointless.

AFAIK it is a problem since it changes the spec of kernel API
memory.soft_limit_in_bytes. That value is set per-memcg which all the
pages allocated above that are best effort and targeted to reclaim
prior to others.

>
>> 3. Please break this patchset into different patchsets. One way to
>> break it could be:
>
> Yes, that makes a ton of sense. ?Kame suggested the same thing, there
> are too much goals in this series.
>
>> a) code which is less relevant to this effort and should be merged
>> first early regardless
>> b) code added in vm reclaim supporting the following changes
>> c) rework soft limit reclaim
>
> I dropped that for now..

Ok, we can make the soft_limit reclaim into a seperate patch including
cleaning up the current implementation. I can probably pick that up
after we agreed how to do global reclaim based on soft_limit (last
comment)

>
>> d) make per-memcg lru lists exclusive
>
> ..and focus on this one instead.

>
>> I should have the patch posted soon which breaks the zone->lru lock
>> for memcg reclaim. That patch should come after everything listed
>> above.
>
> Yeah, the lru lock fits perfectly into struct lruvec.
>
I posted that patch today and please take a look when you get chance.
That patch relies on the
d), so i will wait till the previous patches merged.

--Ying

2011-06-09 08:35:37

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Wed, Jun 08, 2011 at 08:52:03PM -0700, Ying Han wrote:
> On Wed, Jun 8, 2011 at 8:32 AM, Johannes Weiner <[email protected]> wrote:
> > On Tue, Jun 07, 2011 at 08:53:21PM -0700, Ying Han wrote:
> >> 2. The way we treat the per-memcg soft_limit is changed in this patch.
> >> The same comment I made on the following patch where we shouldn't
> >> change the definition of user API (soft_limit_in_bytes in this case).
> >> So I attached the patch to fix that where we should only go to the
> >> ones under their soft_limit above certain reclaim priority. Please
> >> consider.
> >
> > Here is your proposal from the other mail:
> >
> > : Basically, we shouldn't reclaim from a memcg under its soft_limit
> > : unless we have trouble reclaim pages from others. Something like the
> > : following makes better sense:
> > :
> > : diff --git a/mm/vmscan.c b/mm/vmscan.c
> > : index bdc2fd3..b82ba8c 100644
> > : --- a/mm/vmscan.c
> > : +++ b/mm/vmscan.c
> > : @@ -1989,6 +1989,8 @@ restart:
> > : ? ? ? ? throttle_vm_writeout(sc->gfp_mask);
> > : ?}
> > :
> > : +#define MEMCG_SOFTLIMIT_RECLAIM_PRIORITY ? ? ? 2
> > : +
> > : ?static void shrink_zone(int priority, struct zone *zone,
> > : ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct scan_control *sc)
> > : ?{
> > : @@ -2001,13 +2003,13 @@ static void shrink_zone(int priority, struct zone *zone,
> > : ? ? ? ? ? ? ? ? unsigned long reclaimed = sc->nr_reclaimed;
> > : ? ? ? ? ? ? ? ? unsigned long scanned = sc->nr_scanned;
> > : ? ? ? ? ? ? ? ? unsigned long nr_reclaimed;
> > : - ? ? ? ? ? ? ? int epriority = priority;
> > :
> > : - ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
> > : - ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
> > : + ? ? ? ? ? ? ? if (!mem_cgroup_soft_limit_exceeded(root, mem) &&
> > : + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
> > : + ? ? ? ? ? ? ? ? ? ? ? continue;
> >
> > I am not sure if you are serious or playing devil's advocate here,
> > because it exacerbates the problem you are concerned about in 1. by
> > orders of magnitude.
>
> No, the two are different issues. The first one is a performance
> concern of detailed implementation, while the second one is a design
> concern.

Got ya.

> > I guess it would make much more sense to evaluate if reclaiming from
> > memcgs while there are others exceeding their soft limit is even a
> > problem. ?Otherwise this discussion is pretty pointless.
>
> AFAIK it is a problem since it changes the spec of kernel API
> memory.soft_limit_in_bytes. That value is set per-memcg which all the
> pages allocated above that are best effort and targeted to reclaim
> prior to others.

That's not really true. Quoting the documentation:

When the system detects memory contention or low memory, control groups
are pushed back to their soft limits. If the soft limit of each control
group is very high, they are pushed back as much as possible to make
sure that one control group does not starve the others of memory.

I am language lawyering here, but I don't think it says it won't touch
other memcgs at all while there are memcgs exceeding their soft limit.

It would be a lie about the current code in the first place, which
does soft limit reclaim and then regular reclaim, no matter the
outcome of the soft limit reclaim cycle. It will go for the soft
limit first, but after an allocation under pressure the VM is likely
to have reclaimed from other memcgs as well.

I saw your patch to fix that and break out of reclaim if soft limit
reclaim did enough. But this fix is not much newer than my changes.

The second part of this is:

Please note that soft limits is a best effort feature, it comes with
no guarantees, but it does its best to make sure that when memory is
heavily contended for, memory is allocated based on the soft limit
hints/setup. Currently soft limit based reclaim is setup such that
it gets invoked from balance_pgdat (kswapd).

It's not the pages-over-soft-limit that are best effort. It says that
it tries its best to take soft limits into account while reclaiming.

My code does that, so I don't think we are breaking any promises
currently made in the documentation.

But much more important than keeping documentation promises is not to
break actual users. So if you are yourself a user of soft limits,
test the new code pretty please and complain if it breaks your setup!

2011-06-09 08:43:21

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Wed, Jun 08, 2011 at 09:15:46PM -0400, Rik van Riel wrote:
> On 06/02/2011 08:59 AM, Hiroyuki Kamezawa wrote:
>
> >It seems your current series is a mixture of 2 works as
> >"re-desgin of softlimit" and "removal of global LRU".
> >I don't understand why you need 2 works at once.
>
> That seems pretty obvious.
>
> With the global LRU gone, the only way to reclaim
> pages in a global fashion (because the zone is low
> on memory), is to reclaim from all the memcgs in
> the zone.

That is correct.

> Doing that requires that the softlimit stuff is
> changed, and not only the biggest offender is
> attacked.

I think it's much more natural to do it that way, but it's not a
requirement as such. We could just keep the extra soft limit reclaim
invocation in kswapd that looks for the biggest offender and the
hierarchy below it, then does a direct call to do_shrink_zone() to
bypass the generic hierarchy walk.

It's not very nice to have that kind of code duplication, but it's
possible to leave it like that for now.

2011-06-09 09:23:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 8/8] mm: make per-memcg lru lists exclusive

On Wed, Jun 08, 2011 at 10:54:00AM +0200, Johannes Weiner wrote:
> > Wouldn't it be simpler if we always have a stub mem_cgroup_per_zone
> > structure even for non-memcg kernels, and always operate on a
> > single instance per node of those for non-memcg kernels? In effect the
> > lruvec almost is something like that, just adding another layer of
> > abstraction.
>
> I assume you meant 'single instance per zone'; the lruvec is this.

Yes, sorry.

> It
> exists per zone and per mem_cgroup_per_zone so there is no difference
> between memcg kernels and non-memcg ones in generic code. But maybe
> you really meant 'node' and I just don't get it? Care to elaborate a
> bit more?

My suggestion was to not bother with adding the new lruvec concept,
but make sure we always have sturct mem_cgroup_per_zone around even
for non-memcg kernel, thus making the code even more similar for
using cgroups or not, and avoiding to keep the superflous lruvec
in the zone around for the cgroup case. Basically always keeping
a minimal stub memcg infrastructure around.

This is really just from the top of my head, so it might not actually
be feasily, but it's similar to how we do things elsewhere in the
kernel.

2011-06-09 09:26:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Wed, Jun 08, 2011 at 11:30:46AM +0200, Johannes Weiner wrote:
> On Tue, Jun 07, 2011 at 08:25:19AM -0400, Christoph Hellwig wrote:
> > A few small nitpicks:
> >
> > > +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root,
> > > + struct mem_cgroup *prev)
> > > +{
> > > + struct mem_cgroup *mem;
> > > +
> > > + if (mem_cgroup_disabled())
> > > + return NULL;
> > > +
> > > + if (!root)
> > > + root = root_mem_cgroup;
> > > + /*
> > > + * Even without hierarchy explicitely enabled in the root
> > > + * memcg, it is the ultimate parent of all memcgs.
> > > + */
> > > + if (!(root == root_mem_cgroup || root->use_hierarchy))
> > > + return root;
> >
> > The logic here reads a bit weird, why not simply:
> >
> > /*
> > * Even without hierarchy explicitely enabled in the root
> > * memcg, it is the ultimate parent of all memcgs.
> > */
> > if (!root || root == root_mem_cgroup)
> > return root_mem_cgroup;
> > if (root->use_hierarchy)
> > return root;
>
> What you are proposing is not equivalent, so... case in point! It's
> meant to do the hierarchy walk for when foo->use_hierarchy, obviously,
> but ALSO for root_mem_cgroup, which is parent to everyone else even
> without use_hierarchy set. I changed it to read like this:
>
> if (!root)
> root = root_mem_cgroup;
> if (!root->use_hierarchy && root != root_mem_cgroup)
> return root;
> /* actually iterate hierarchy */
>
> Does that make more sense?

It does, sorry for misparsing it. The thing that I really hated was
the conditional assignment of root. Can we clean this up somehow
by making the caller pass root_mem_cgroup in the case where it
passes root right now, or at least always pass NULL when it means
root_mem_cgroup.

Note really that important in the end, it just irked me when I looked
over it, especially the conditional assigned of root to root_mem_cgroup,
and then a little later checking for the equality of the two.

Thinking about it it's probably better left as-is for now to not
complicate the series, and maybe revisit it later once things have
settled a bit.

> > It actually is the per-memcg shrinker now, and thus should be called
> > shrink_memcg.
>
> Per-zone per-memcg, actually. shrink_zone_memcg?

Sounds fine to me.

> I have gcc version 4.6.0 20110530 (Red Hat 4.6.0-9) (GCC) on this
> machine, and it manages to optimize the loop away completely.

Ok, good enough.

2011-06-09 09:31:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Thu, Jun 09, 2011 at 10:43:00AM +0200, Johannes Weiner wrote:
> I think it's much more natural to do it that way, but it's not a
> requirement as such. We could just keep the extra soft limit reclaim
> invocation in kswapd that looks for the biggest offender and the
> hierarchy below it, then does a direct call to do_shrink_zone() to
> bypass the generic hierarchy walk.
>
> It's not very nice to have that kind of code duplication, but it's
> possible to leave it like that for now.

Unless there is a really good reason please kill it. It just means more
codepathes that eat away tons of stack in the reclaim path, and we
already have far too much of those, and more code that needs fixing for
all the reclaim issues we have. Nevermind that the cgroups code
generally gets a lot less testing, so the QA overhead is also much
worse.

2011-06-09 13:12:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Wed 01-06-11 08:25:13, Johannes Weiner wrote:
[...]

Just a minor thing. I am really slow at reviewing these days due to
other work that has to be done...

> +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root,
> + struct mem_cgroup *prev)
> +{
> + struct mem_cgroup *mem;

You want mem = NULL here because you might end up using it unitialized
AFAICS (css_get_next returns with NULL).

> +
> + if (mem_cgroup_disabled())
> + return NULL;
> +
> + if (!root)
> + root = root_mem_cgroup;
> + /*
> + * Even without hierarchy explicitely enabled in the root
> + * memcg, it is the ultimate parent of all memcgs.
> + */
> + if (!(root == root_mem_cgroup || root->use_hierarchy))
> + return root;
> + if (prev && prev != root)
> + css_put(&prev->css);
> + do {
> + int id = root->last_scanned_child;
> + struct cgroup_subsys_state *css;
> +
> + rcu_read_lock();
> + css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> + if (css && (css == &root->css || css_tryget(css)))
> + mem = container_of(css, struct mem_cgroup, css);
> + rcu_read_unlock();
> + if (!css)
> + id = 0;
> + root->last_scanned_child = id;
> + } while (!mem);
> + return mem;

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-09 13:46:31

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Thu, Jun 09, 2011 at 03:12:03PM +0200, Michal Hocko wrote:
> On Wed 01-06-11 08:25:13, Johannes Weiner wrote:
> [...]
>
> Just a minor thing. I am really slow at reviewing these days due to
> other work that has to be done...
>
> > +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root,
> > + struct mem_cgroup *prev)
> > +{
> > + struct mem_cgroup *mem;
>
> You want mem = NULL here because you might end up using it unitialized
> AFAICS (css_get_next returns with NULL).

Thanks for pointing it out. It was introduced when I switched from
using @prev to continue the search to using root->last_scanned_child.

It's fixed now.

2011-06-09 14:01:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Thu 02-06-11 19:29:05, Johannes Weiner wrote:
> On Fri, Jun 03, 2011 at 01:14:12AM +0900, Hiroyuki Kamezawa wrote:
> > 2011/6/3 Johannes Weiner <[email protected]>:
> > > On Thu, Jun 02, 2011 at 10:59:01PM +0900, Hiroyuki Kamezawa wrote:
> > >> 2011/6/1 Johannes Weiner <[email protected]>:
> >
> > >> > @@ -1927,8 +1980,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> > >> > ? ? ? ?if (!(gfp_mask & __GFP_WAIT))
> > >> > ? ? ? ? ? ? ? ?return CHARGE_WOULDBLOCK;
> > >> >
> > >> > - ? ? ? ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > >> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_mask, flags);
> > >> > + ? ? ? ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> > >> > ? ? ? ?if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> > >> > ? ? ? ? ? ? ? ?return CHARGE_RETRY;
> > >> > ? ? ? ?/*
> > >>
> > >> It seems this clean-up around hierarchy and softlimit can be in an
> > >> independent patch, no ?
> > >
> > > Hm, why do you think it's a cleanup? ?The hierarchical target reclaim
> > > code is moved to vmscan.c and as a result the entry points for hard
> > > limit and soft limit reclaim differ. ?This is why the original
> > > function, mem_cgroup_hierarchical_reclaim() has to be split into two
> > > parts.
> > >
> > If functionality is unchanged, I think it's clean up.
> > I agree to move hierarchy walk to vmscan.c. but it can be done as
> > a clean up patch for current code.
> > (Make current try_to_free_mem_cgroup_pages() to use this code.)
> > and then, you can write a patch which only includes a core
> > logic/purpose of this patch
> > "use root cgroup's LRU for global and make global reclaim as full-scan
> > of memcgroup."
> >
> > In short, I felt this patch is long....and maybe watchers of -mm are
> > not interested in rewritie of hierarchy walk but are intetested in the
> > chages in shrink_zone() itself very much.
>
> But the split up is, unfortunately, a change in functionality. The
> current code selects one memcg and reclaims all zones on all priority
> levels on behalf of that memcg. My code changes that such that it
> reclaims a bunch of memcgs from the hierarchy for each zone and
> priority level instead. From memcgs -> priorities -> zones to
> priorities -> zones -> memcgs.

I think you should mention this in the change log it nicely describes
the core of the change.

>
> I don't want to pass that off as a cleanup.
>
> But it is long, I agree with you. I'll split up the 'move
> hierarchical target reclaim to generic code' from 'make global reclaim
> hierarchical' and see if this makes the changes more straight-forward.
>
> Because I suspect the perceived unwieldiness does not stem from the
> amount of lines changed, but from the number of different logical
> changes.

Agreed.

>
> > >> > + ? ? ? for (;;) {
> > >> > + ? ? ? ? ? ? ? unsigned long nr_reclaimed;
> > >> > +
> > >> > + ? ? ? ? ? ? ? sc->mem_cgroup = mem;
> > >> > + ? ? ? ? ? ? ? do_shrink_zone(priority, zone, sc);
> > >> > +
> > >> > + ? ? ? ? ? ? ? nr_reclaimed = sc->nr_reclaimed - nr_reclaimed_before;
> > >> > + ? ? ? ? ? ? ? if (nr_reclaimed >= sc->nr_to_reclaim)
> > >> > + ? ? ? ? ? ? ? ? ? ? ? break;
> > >>
> > >> what this calculation means ? ?Shouldn't we do this quit based on the
> > >> number of "scan"
> > >> rather than "reclaimed" ?
> > >
> > > It aborts the loop once sc->nr_to_reclaim pages have been reclaimed
> > > from that zone during that hierarchy walk, to prevent overreclaim.
> > >
> > > If you have unbalanced sizes of memcgs in the system, it is not
> > > desirable to have every reclaimer scan all memcgs, but let those quit
> > > early that have made some progress on the bigger memcgs.
> > >
> > Hmm, why not if (sc->nr_reclaimed >= sc->nr_to_reclaim) ?
> >
> > I'm sorry if I miss something..
>
> It's a bit awkward and undocumented, I'm afraid. The loop is like
> this:
>
> for each zone:
> for each memcg:
> shrink
> if sc->nr_reclaimed >= sc->nr_to_reclaim:
> break
>
> sc->nr_reclaimed is never reset, so once you reclaimed enough pages
> from one zone, you will only try the first memcg in all the other
> zones, which might well be empty, so no pressure at all on subsequent
> zones.
>
> That's why I use the per-zone delta like this:
>
> for each zone:
> before = sc->nr_reclaimed
> for each memcg:
> shrink
> if sc->nr_reclaimed - before >= sc->nr_to_reclaim
>
> which still ensures on one hand that we don't keep hammering a zone if
> we reclaimed the overall reclaim target already, but on the other hand
> that we apply some pressure to the other zones as well.
>
> It's the same concept as in do_shrink_zone(). It breaks the loop when
>
> nr_reclaimed >= sc->nr_to_reclaim

Maybe you could make do_shrink_zone return the number of reclaimed
pages. It's true that it would require yet another nr_reclaimed variable
in the that function but it would be more straightforward IMO.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-09 15:00:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 4/8] memcg: rework soft limit reclaim

On Thu 02-06-11 22:25:29, Ying Han wrote:
> On Thu, Jun 2, 2011 at 2:55 PM, Ying Han <[email protected]> wrote:
> > On Tue, May 31, 2011 at 11:25 PM, Johannes Weiner <[email protected]> wrote:
> >> Currently, soft limit reclaim is entered from kswapd, where it selects
[...]
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index c7d4b44..0163840 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1988,9 +1988,13 @@ static void shrink_zone(int priority, struct zone *zone,
> >> ? ? ? ? ? ? ? ?unsigned long reclaimed = sc->nr_reclaimed;
> >> ? ? ? ? ? ? ? ?unsigned long scanned = sc->nr_scanned;
> >> ? ? ? ? ? ? ? ?unsigned long nr_reclaimed;
> >> + ? ? ? ? ? ? ? int epriority = priority;
> >> +
> >> + ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
> >> + ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
> >
> > Here we grant the ability to shrink from all the memcgs, but only
> > higher the priority for those exceed the soft_limit. That is a design
> > change
> > for the "soft_limit" which giving a hint to which memcgs to reclaim
> > from first under global memory pressure.
>
>
> Basically, we shouldn't reclaim from a memcg under its soft_limit
> unless we have trouble reclaim pages from others.

Agreed.

> Something like the following makes better sense:
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bdc2fd3..b82ba8c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1989,6 +1989,8 @@ restart:
> throttle_vm_writeout(sc->gfp_mask);
> }
>
> +#define MEMCG_SOFTLIMIT_RECLAIM_PRIORITY 2
> +
> static void shrink_zone(int priority, struct zone *zone,
> struct scan_control *sc)
> {
> @@ -2001,13 +2003,13 @@ static void shrink_zone(int priority, struct zone *zone,
> unsigned long reclaimed = sc->nr_reclaimed;
> unsigned long scanned = sc->nr_scanned;
> unsigned long nr_reclaimed;
> - int epriority = priority;
>
> - if (mem_cgroup_soft_limit_exceeded(root, mem))
> - epriority -= 1;
> + if (!mem_cgroup_soft_limit_exceeded(root, mem) &&
> + priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
> + continue;

yes, this makes sense but I am not sure about the right(tm) value of the
MEMCG_SOFTLIMIT_RECLAIM_PRIORITY. 2 sounds too low. You would do quite a
lot of loops
(DEFAULT_PRIORITY-MEMCG_SOFTLIMIT_RECLAIM_PRIORITY) * zones * memcg_count
without any progress (assuming that all of them are under soft limit
which doesn't sound like a totally artificial configuration) until you
allow reclaiming from groups that are under soft limit. Then, when you
finally get to reclaiming, you scan rather aggressively.

Maybe something like 3/4 of DEFAULT_PRIORITY? You would get 3 times
over all (unbalanced) zones and all cgroups that are above the limit
(scanning max{1/4096+1/2048+1/1024, 3*SWAP_CLUSTER_MAX} of the LRUs for
each cgroup) which could be enough to collect the low hanging fruit.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-09 15:48:52

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

Hi Hannes,

I have a comment.
Please look at bottom line.

On Wed, Jun 01, 2011 at 08:25:13AM +0200, Johannes Weiner wrote:
> When a memcg hits its hard limit, hierarchical target reclaim is
> invoked, which goes through all contributing memcgs in the hierarchy
> below the offending memcg and reclaims from the respective per-memcg
> lru lists. This distributes pressure fairly among all involved
> memcgs, and pages are aged with respect to their list buddies.
>
> When global memory pressure arises, however, all this is dropped
> overboard. Pages are reclaimed based on global lru lists that have
> nothing to do with container-internal age, and some memcgs may be
> reclaimed from much more than others.
>
> This patch makes traditional global reclaim consider container
> boundaries and no longer scan the global lru lists. For each zone
> scanned, the memcg hierarchy is walked and pages are reclaimed from
> the per-memcg lru lists of the respective zone. For now, the
> hierarchy walk is bounded to one full round-trip through the
> hierarchy, or if the number of reclaimed pages reach the overall
> reclaim target, whichever comes first.
>
> Conceptually, global memory pressure is then treated as if the root
> memcg had hit its limit. Since all existing memcgs contribute to the
> usage of the root memcg, global reclaim is nothing more than target
> reclaim starting from the root memcg. The code is mostly the same for
> both cases, except for a few heuristics and statistics that do not
> always apply. They are distinguished by a newly introduced
> global_reclaim() primitive.
>
> One implication of this change is that pages have to be linked to the
> lru lists of the root memcg again, which could be optimized away with
> the old scheme. The costs are not measurable, though, even with
> worst-case microbenchmarks.
>
> As global reclaim no longer relies on global lru lists, this change is
> also in preparation to remove those completely.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/memcontrol.h | 15 ++++
> mm/memcontrol.c | 176 ++++++++++++++++++++++++++++----------------
> mm/vmscan.c | 121 ++++++++++++++++++++++--------
> 3 files changed, 218 insertions(+), 94 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5e9840f5..332b0a6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -101,6 +101,10 @@ mem_cgroup_prepare_migration(struct page *page,
> extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
> struct page *oldpage, struct page *newpage, bool migration_ok);
>
> +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *,
> + struct mem_cgroup *);
> +void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup *);
> +
> /*
> * For memory reclaim.
> */
> @@ -321,6 +325,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> return NULL;
> }
>
> +static inline struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *r,
> + struct mem_cgroup *m)
> +{
> + return NULL;
> +}
> +
> +static inline void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *r,
> + struct mem_cgroup *m)
> +{
> +}
> +
> static inline void
> mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bf5ab87..850176e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -313,8 +313,8 @@ static bool move_file(void)
> }
>
> /*
> - * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> - * limit reclaim to prevent infinite loops, if they ever occur.
> + * Maximum loops in reclaim, used for soft limit reclaim to prevent
> + * infinite loops, if they ever occur.
> */
> #define MEM_CGROUP_MAX_RECLAIM_LOOPS (100)
> #define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)
> @@ -340,7 +340,7 @@ enum charge_type {
> #define OOM_CONTROL (0)
>
> /*
> - * Reclaim flags for mem_cgroup_hierarchical_reclaim
> + * Reclaim flags
> */
> #define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0
> #define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> @@ -846,8 +846,6 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
> mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
> /* huge page split is done under lru_lock. so, we have no races. */
> MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
> - if (mem_cgroup_is_root(pc->mem_cgroup))
> - return;
> VM_BUG_ON(list_empty(&pc->lru));
> list_del_init(&pc->lru);
> }
> @@ -872,13 +870,11 @@ void mem_cgroup_rotate_reclaimable_page(struct page *page)
> return;
>
> pc = lookup_page_cgroup(page);
> - /* unused or root page is not rotated. */
> + /* unused page is not rotated. */
> if (!PageCgroupUsed(pc))
> return;
> /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
> smp_rmb();
> - if (mem_cgroup_is_root(pc->mem_cgroup))
> - return;
> mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
> list_move_tail(&pc->lru, &mz->lists[lru]);
> }
> @@ -892,13 +888,11 @@ void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
> return;
>
> pc = lookup_page_cgroup(page);
> - /* unused or root page is not rotated. */
> + /* unused page is not rotated. */
> if (!PageCgroupUsed(pc))
> return;
> /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
> smp_rmb();
> - if (mem_cgroup_is_root(pc->mem_cgroup))
> - return;
> mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
> list_move(&pc->lru, &mz->lists[lru]);
> }
> @@ -920,8 +914,6 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
> /* huge page split is done under lru_lock. so, we have no races. */
> MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
> SetPageCgroupAcctLRU(pc);
> - if (mem_cgroup_is_root(pc->mem_cgroup))
> - return;
> list_add(&pc->lru, &mz->lists[lru]);
> }
>
> @@ -1381,6 +1373,97 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> return min(limit, memsw);
> }
>
> +/**
> + * mem_cgroup_hierarchy_walk - iterate over a memcg hierarchy
> + * @root: starting point of the hierarchy
> + * @prev: previous position or NULL
> + *
> + * Caller must hold a reference to @root. While this function will
> + * return @root as part of the walk, it will never increase its
> + * reference count.
> + *
> + * Caller must clean up with mem_cgroup_stop_hierarchy_walk() when it
> + * stops the walk potentially before the full round trip.
> + */
> +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root,
> + struct mem_cgroup *prev)
> +{
> + struct mem_cgroup *mem;
> +
> + if (mem_cgroup_disabled())
> + return NULL;
> +
> + if (!root)
> + root = root_mem_cgroup;
> + /*
> + * Even without hierarchy explicitely enabled in the root
> + * memcg, it is the ultimate parent of all memcgs.
> + */
> + if (!(root == root_mem_cgroup || root->use_hierarchy))
> + return root;
> + if (prev && prev != root)
> + css_put(&prev->css);
> + do {
> + int id = root->last_scanned_child;
> + struct cgroup_subsys_state *css;
> +
> + rcu_read_lock();
> + css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> + if (css && (css == &root->css || css_tryget(css)))
> + mem = container_of(css, struct mem_cgroup, css);
> + rcu_read_unlock();
> + if (!css)
> + id = 0;
> + root->last_scanned_child = id;
> + } while (!mem);
> + return mem;
> +}
> +
> +/**
> + * mem_cgroup_stop_hierarchy_walk - clean up after partial hierarchy walk
> + * @root: starting point in the hierarchy
> + * @mem: last position during the walk
> + */
> +void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *root,
> + struct mem_cgroup *mem)
> +{
> + if (mem && mem != root)
> + css_put(&mem->css);
> +}
> +
> +static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
> + gfp_t gfp_mask,
> + unsigned long flags)
> +{
> + unsigned long total = 0;
> + bool noswap = false;
> + int loop;
> +
> + if ((flags & MEM_CGROUP_RECLAIM_NOSWAP) || mem->memsw_is_minimum)
> + noswap = true;
> + for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> + drain_all_stock_async();
> + total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap,
> + get_swappiness(mem));
> + /*
> + * Avoid freeing too much when shrinking to resize the
> + * limit. XXX: Shouldn't the margin check be enough?
> + */
> + if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
> + break;
> + if (mem_cgroup_margin(mem))
> + break;
> + /*
> + * If we have not been able to reclaim anything after
> + * two reclaim attempts, there may be no reclaimable
> + * pages in this hierarchy.
> + */
> + if (loop && !total)
> + break;
> + }
> + return total;
> +}
> +
> /*
> * Visit the first child (need not be the first child as per the ordering
> * of the cgroup list, since we track last_scanned_child) of @mem and use
> @@ -1418,29 +1501,14 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> return ret;
> }
>
> -/*
> - * Scan the hierarchy if needed to reclaim memory. We remember the last child
> - * we reclaimed from, so that we don't end up penalizing one child extensively
> - * based on its position in the children list.
> - *
> - * root_mem is the original ancestor that we've been reclaim from.
> - *
> - * We give up and return to the caller when we visit root_mem twice.
> - * (other groups can be removed while we're walking....)
> - *
> - * If shrink==true, for avoiding to free too much, this returns immedieately.
> - */
> -static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> - struct zone *zone,
> - gfp_t gfp_mask,
> - unsigned long reclaim_options)
> +static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
> + struct zone *zone,
> + gfp_t gfp_mask)
> {
> struct mem_cgroup *victim;
> int ret, total = 0;
> int loop = 0;
> - bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> - bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> - bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> + bool noswap = false;
> unsigned long excess;
>
> excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
> @@ -1461,7 +1529,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> * anything, it might because there are
> * no reclaimable pages under this hierarchy
> */
> - if (!check_soft || !total) {
> + if (!total) {
> css_put(&victim->css);
> break;
> }
> @@ -1483,26 +1551,11 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> css_put(&victim->css);
> continue;
> }
> - /* we use swappiness of local cgroup */
> - if (check_soft)
> - ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> - noswap, get_swappiness(victim), zone);
> - else
> - ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> - noswap, get_swappiness(victim));
> + ret = mem_cgroup_shrink_node_zone(victim, gfp_mask, noswap,
> + get_swappiness(victim), zone);
> css_put(&victim->css);
> - /*
> - * At shrinking usage, we can't check we should stop here or
> - * reclaim more. It's depends on callers. last_scanned_child
> - * will work enough for keeping fairness under tree.
> - */
> - if (shrink)
> - return ret;
> total += ret;
> - if (check_soft) {
> - if (!res_counter_soft_limit_excess(&root_mem->res))
> - return total;
> - } else if (mem_cgroup_margin(root_mem))
> + if (!res_counter_soft_limit_excess(&root_mem->res))
> return total;
> }
> return total;
> @@ -1927,8 +1980,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> if (!(gfp_mask & __GFP_WAIT))
> return CHARGE_WOULDBLOCK;
>
> - ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> - gfp_mask, flags);
> + ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> return CHARGE_RETRY;
> /*
> @@ -3085,7 +3137,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>
> /*
> * A call to try to shrink memory usage on charge failure at shmem's swapin.
> - * Calling hierarchical_reclaim is not enough because we should update
> + * Calling reclaim is not enough because we should update
> * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
> * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
> * not from the memcg which this page would be charged to.
> @@ -3167,7 +3219,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> int enlarge;
>
> /*
> - * For keeping hierarchical_reclaim simple, how long we should retry
> + * For keeping reclaim simple, how long we should retry
> * is depends on callers. We set our retry-count to be function
> * of # of children which we should visit in this loop.
> */
> @@ -3210,8 +3262,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> if (!ret)
> break;
>
> - mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> - MEM_CGROUP_RECLAIM_SHRINK);
> + mem_cgroup_reclaim(memcg, GFP_KERNEL,
> + MEM_CGROUP_RECLAIM_SHRINK);
> curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> /* Usage is reduced ? */
> if (curusage >= oldusage)
> @@ -3269,9 +3321,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> if (!ret)
> break;
>
> - mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> - MEM_CGROUP_RECLAIM_NOSWAP |
> - MEM_CGROUP_RECLAIM_SHRINK);
> + mem_cgroup_reclaim(memcg, GFP_KERNEL,
> + MEM_CGROUP_RECLAIM_NOSWAP |
> + MEM_CGROUP_RECLAIM_SHRINK);
> curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> /* Usage is reduced ? */
> if (curusage >= oldusage)
> @@ -3311,9 +3363,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> if (!mz)
> break;
>
> - reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
> - gfp_mask,
> - MEM_CGROUP_RECLAIM_SOFT);
> + reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
> nr_reclaimed += reclaimed;
> spin_lock(&mctz->lock);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8bfd450..7e9bfca 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -104,7 +104,16 @@ struct scan_control {
> */
> reclaim_mode_t reclaim_mode;
>
> - /* Which cgroup do we reclaim from */
> + /*
> + * The memory cgroup that hit its hard limit and is the
> + * primary target of this reclaim invocation.
> + */
> + struct mem_cgroup *target_mem_cgroup;
> +
> + /*
> + * The memory cgroup that is currently being scanned as a
> + * child and contributor to the usage of target_mem_cgroup.
> + */
> struct mem_cgroup *mem_cgroup;
>
> /*
> @@ -154,9 +163,36 @@ static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -#define scanning_global_lru(sc) (!(sc)->mem_cgroup)
> +/**
> + * global_reclaim - whether reclaim is global or due to memcg hard limit
> + * @sc: scan control of this reclaim invocation
> + */
> +static bool global_reclaim(struct scan_control *sc)
> +{
> + return !sc->target_mem_cgroup;
> +}
> +/**
> + * scanning_global_lru - whether scanning global lrus or per-memcg lrus
> + * @sc: scan control of this reclaim invocation
> + */
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> + /*
> + * Unless memory cgroups are disabled on boot, the traditional
> + * global lru lists are never scanned and reclaim will always
> + * operate on the per-memcg lru lists.
> + */
> + return mem_cgroup_disabled();
> +}
> #else
> -#define scanning_global_lru(sc) (1)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> + return true;
> +}
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> + return true;
> +}
> #endif
>
> static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> @@ -1228,7 +1264,7 @@ static int too_many_isolated(struct zone *zone, int file,
> if (current_is_kswapd())
> return 0;
>
> - if (!scanning_global_lru(sc))
> + if (!global_reclaim(sc))
> return 0;
>
> if (file) {
> @@ -1397,13 +1433,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
> ISOLATE_BOTH : ISOLATE_INACTIVE,
> zone, 0, file);
> - zone->pages_scanned += nr_scanned;
> - if (current_is_kswapd())
> - __count_zone_vm_events(PGSCAN_KSWAPD, zone,
> - nr_scanned);
> - else
> - __count_zone_vm_events(PGSCAN_DIRECT, zone,
> - nr_scanned);
> } else {
> nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
> &page_list, &nr_scanned, sc->order,
> @@ -1411,10 +1440,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> ISOLATE_BOTH : ISOLATE_INACTIVE,
> zone, sc->mem_cgroup,
> 0, file);
> - /*
> - * mem_cgroup_isolate_pages() keeps track of
> - * scanned pages on its own.
> - */
> + }
> +
> + if (global_reclaim(sc)) {
> + zone->pages_scanned += nr_scanned;
> + if (current_is_kswapd())
> + __count_zone_vm_events(PGSCAN_KSWAPD, zone,
> + nr_scanned);
> + else
> + __count_zone_vm_events(PGSCAN_DIRECT, zone,
> + nr_scanned);
> }
>
> if (nr_taken == 0) {
> @@ -1520,18 +1555,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
> &pgscanned, sc->order,
> ISOLATE_ACTIVE, zone,
> 1, file);
> - zone->pages_scanned += pgscanned;
> } else {
> nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
> &pgscanned, sc->order,
> ISOLATE_ACTIVE, zone,
> sc->mem_cgroup, 1, file);
> - /*
> - * mem_cgroup_isolate_pages() keeps track of
> - * scanned pages on its own.
> - */
> }
>
> + if (global_reclaim(sc))
> + zone->pages_scanned += pgscanned;
> +
> reclaim_stat->recent_scanned[file] += nr_taken;
>
> __count_zone_vm_events(PGREFILL, zone, pgscanned);
> @@ -1752,7 +1785,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
> file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
> zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
>
> - if (scanning_global_lru(sc)) {
> + if (global_reclaim(sc)) {
> free = zone_page_state(zone, NR_FREE_PAGES);
> /* If we have very few page cache pages,
> force-scan anon pages. */
> @@ -1889,8 +1922,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
> /*
> * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> */
> -static void shrink_zone(int priority, struct zone *zone,
> - struct scan_control *sc)
> +static void do_shrink_zone(int priority, struct zone *zone,
> + struct scan_control *sc)
> {
> unsigned long nr[NR_LRU_LISTS];
> unsigned long nr_to_scan;
> @@ -1943,6 +1976,31 @@ restart:
> throttle_vm_writeout(sc->gfp_mask);
> }
>
> +static void shrink_zone(int priority, struct zone *zone,
> + struct scan_control *sc)
> +{
> + unsigned long nr_reclaimed_before = sc->nr_reclaimed;
> + struct mem_cgroup *root = sc->target_mem_cgroup;
> + struct mem_cgroup *first, *mem = NULL;
> +
> + first = mem = mem_cgroup_hierarchy_walk(root, mem);
> + for (;;) {
> + unsigned long nr_reclaimed;
> +
> + sc->mem_cgroup = mem;
> + do_shrink_zone(priority, zone, sc);
> +
> + nr_reclaimed = sc->nr_reclaimed - nr_reclaimed_before;
> + if (nr_reclaimed >= sc->nr_to_reclaim)
> + break;
> +
> + mem = mem_cgroup_hierarchy_walk(root, mem);
> + if (mem == first)
> + break;
> + }
> + mem_cgroup_stop_hierarchy_walk(root, mem);
> +}
> +
> /*
> * This is the direct reclaim path, for page-allocating processes. We only
> * try to reclaim pages from zones which will satisfy the caller's allocation
> @@ -1973,7 +2031,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
> * Take care memory controller reclaiming has small influence
> * to global LRU.
> */
> - if (scanning_global_lru(sc)) {
> + if (global_reclaim(sc)) {
> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> continue;
> if (zone->all_unreclaimable && priority != DEF_PRIORITY)
> @@ -2038,7 +2096,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> get_mems_allowed();
> delayacct_freepages_start();
>
> - if (scanning_global_lru(sc))
> + if (global_reclaim(sc))
> count_vm_event(ALLOCSTALL);
>
> for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> @@ -2050,7 +2108,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> * Don't shrink slabs when reclaiming memory from
> * over limit cgroups
> */
> - if (scanning_global_lru(sc)) {
> + if (global_reclaim(sc)) {
> unsigned long lru_pages = 0;
> for_each_zone_zonelist(zone, z, zonelist,
> gfp_zone(sc->gfp_mask)) {
> @@ -2111,7 +2169,7 @@ out:
> return 0;
>
> /* top priority shrink_zones still had more to do? don't OOM, then */
> - if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
> + if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
> return 1;
>
> return 0;
> @@ -2129,7 +2187,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> .may_swap = 1,
> .swappiness = vm_swappiness,
> .order = order,
> - .mem_cgroup = NULL,
> + .target_mem_cgroup = NULL,
> .nodemask = nodemask,
> };
>
> @@ -2158,6 +2216,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> .may_swap = !noswap,
> .swappiness = swappiness,
> .order = 0,
> + .target_mem_cgroup = mem,
> .mem_cgroup = mem,
> };
> sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> @@ -2174,7 +2233,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> * will pick up pages from other mem cgroup's as well. We hack
> * the priority and make it zero.
> */
> - shrink_zone(0, zone, &sc);
> + do_shrink_zone(0, zone, &sc);
>
> trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>
> @@ -2195,7 +2254,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> .swappiness = swappiness,
> .order = 0,
> - .mem_cgroup = mem_cont,
> + .target_mem_cgroup = mem_cont,
> .nodemask = NULL, /* we don't care the placement */
> };
>
> @@ -2333,7 +2392,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> .nr_to_reclaim = ULONG_MAX,
> .swappiness = vm_swappiness,
> .order = order,
> - .mem_cgroup = NULL,
> + .target_mem_cgroup = NULL,
> };
> loop_again:
> total_scanned = 0;
> --
> 1.7.5.2
>

I didn't look at all, still. You might change the logic later patches.
If I understand this patch right, it does round-robin reclaim in all memcgs
when global memory pressure happens.

Let's consider this memcg size unbalance case.

If A-memcg has lots of LRU pages, scanning count for reclaim would be bigger
so the chance to reclaim the pages would be higher.
If we reclaim A-memcg, we can reclaim the number of pages we want easily and break.
Next reclaim will happen at some time and reclaim will start the B-memcg of A-memcg
we reclaimed successfully before. But unfortunately B-memcg has small lru so
scanning count would be small and small memcg's LRU aging is higher than bigger memcg.
It means small memcg's working set can be evicted easily than big memcg.
my point is that we should not set next memcg easily.
We have to consider memcg LRU size.

It is big change compared to old LRU aging.
I think LRU is meaningful when we have lots of pages globally.
I really like unify global and memcg but if we break old,
we have to prove that it doesn't hurt old LRU aging.

I hope I miss something.
--
Kind regards
Minchan Kim

2011-06-09 16:58:19

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Thu, Jun 09, 2011 at 05:26:17AM -0400, Christoph Hellwig wrote:
> On Wed, Jun 08, 2011 at 11:30:46AM +0200, Johannes Weiner wrote:
> > On Tue, Jun 07, 2011 at 08:25:19AM -0400, Christoph Hellwig wrote:
> > > A few small nitpicks:
> > >
> > > > +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root,
> > > > + struct mem_cgroup *prev)
> > > > +{
> > > > + struct mem_cgroup *mem;
> > > > +
> > > > + if (mem_cgroup_disabled())
> > > > + return NULL;
> > > > +
> > > > + if (!root)
> > > > + root = root_mem_cgroup;
> > > > + /*
> > > > + * Even without hierarchy explicitely enabled in the root
> > > > + * memcg, it is the ultimate parent of all memcgs.
> > > > + */
> > > > + if (!(root == root_mem_cgroup || root->use_hierarchy))
> > > > + return root;
> > >
> > > The logic here reads a bit weird, why not simply:
> > >
> > > /*
> > > * Even without hierarchy explicitely enabled in the root
> > > * memcg, it is the ultimate parent of all memcgs.
> > > */
> > > if (!root || root == root_mem_cgroup)
> > > return root_mem_cgroup;
> > > if (root->use_hierarchy)
> > > return root;
> >
> > What you are proposing is not equivalent, so... case in point! It's
> > meant to do the hierarchy walk for when foo->use_hierarchy, obviously,
> > but ALSO for root_mem_cgroup, which is parent to everyone else even
> > without use_hierarchy set. I changed it to read like this:
> >
> > if (!root)
> > root = root_mem_cgroup;
> > if (!root->use_hierarchy && root != root_mem_cgroup)
> > return root;
> > /* actually iterate hierarchy */
> >
> > Does that make more sense?
>
> It does, sorry for misparsing it. The thing that I really hated was
> the conditional assignment of root. Can we clean this up somehow
> by making the caller pass root_mem_cgroup in the case where it
> passes root right now, or at least always pass NULL when it means
> root_mem_cgroup.
>
> Note really that important in the end, it just irked me when I looked
> over it, especially the conditional assigned of root to root_mem_cgroup,
> and then a little later checking for the equality of the two.

Yeah, the assignment is an ugly interface fixup because
root_mem_cgroup is local to memcontrol.c, as is struct mem_cgroup as a
whole.

I'll look into your suggestion from the other mail of making struct
mem_cgroup and struct mem_cgroup_per_zone always available, and have
everyone operate against root_mem_cgroup per default.

> Thinking about it it's probably better left as-is for now to not
> complicate the series, and maybe revisit it later once things have
> settled a bit.

I may take you up on that if this approach turns out to require more
change than is sensible to add to this series.

I'll at least add an

/* XXX: until vmscan.c knows about root_mem_cgroup */

or so, if this is the case, to explain the temporary nastiness.

2011-06-09 17:24:23

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Fri, Jun 10, 2011 at 12:48:39AM +0900, Minchan Kim wrote:
> On Wed, Jun 01, 2011 at 08:25:13AM +0200, Johannes Weiner wrote:
> > When a memcg hits its hard limit, hierarchical target reclaim is
> > invoked, which goes through all contributing memcgs in the hierarchy
> > below the offending memcg and reclaims from the respective per-memcg
> > lru lists. This distributes pressure fairly among all involved
> > memcgs, and pages are aged with respect to their list buddies.
> >
> > When global memory pressure arises, however, all this is dropped
> > overboard. Pages are reclaimed based on global lru lists that have
> > nothing to do with container-internal age, and some memcgs may be
> > reclaimed from much more than others.
> >
> > This patch makes traditional global reclaim consider container
> > boundaries and no longer scan the global lru lists. For each zone
> > scanned, the memcg hierarchy is walked and pages are reclaimed from
> > the per-memcg lru lists of the respective zone. For now, the
> > hierarchy walk is bounded to one full round-trip through the
> > hierarchy, or if the number of reclaimed pages reach the overall
> > reclaim target, whichever comes first.
> >
> > Conceptually, global memory pressure is then treated as if the root
> > memcg had hit its limit. Since all existing memcgs contribute to the
> > usage of the root memcg, global reclaim is nothing more than target
> > reclaim starting from the root memcg. The code is mostly the same for
> > both cases, except for a few heuristics and statistics that do not
> > always apply. They are distinguished by a newly introduced
> > global_reclaim() primitive.
> >
> > One implication of this change is that pages have to be linked to the
> > lru lists of the root memcg again, which could be optimized away with
> > the old scheme. The costs are not measurable, though, even with
> > worst-case microbenchmarks.
> >
> > As global reclaim no longer relies on global lru lists, this change is
> > also in preparation to remove those completely.

[cut diff]

> I didn't look at all, still. You might change the logic later patches.
> If I understand this patch right, it does round-robin reclaim in all memcgs
> when global memory pressure happens.
>
> Let's consider this memcg size unbalance case.
>
> If A-memcg has lots of LRU pages, scanning count for reclaim would be bigger
> so the chance to reclaim the pages would be higher.
> If we reclaim A-memcg, we can reclaim the number of pages we want easily and break.
> Next reclaim will happen at some time and reclaim will start the B-memcg of A-memcg
> we reclaimed successfully before. But unfortunately B-memcg has small lru so
> scanning count would be small and small memcg's LRU aging is higher than bigger memcg.
> It means small memcg's working set can be evicted easily than big memcg.
> my point is that we should not set next memcg easily.
> We have to consider memcg LRU size.

I may be missing something, but you said yourself that B had a smaller
scan count compared to A, so the aging speed should be proportional to
respective size.

The number of pages scanned per iteration is essentially

number of lru pages in memcg-zone >> priority

so we scan relatively more pages from B than from A each round.

It's the exact same logic we have been applying traditionally to
distribute pressure fairly among zones to equalize their aging speed.

Is that what you meant or are we talking past each other?

2011-06-09 17:36:56

by Ying Han

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Thu, Jun 9, 2011 at 1:35 AM, Johannes Weiner <[email protected]> wrote:
> On Wed, Jun 08, 2011 at 08:52:03PM -0700, Ying Han wrote:
>> On Wed, Jun 8, 2011 at 8:32 AM, Johannes Weiner <[email protected]> wrote:
>> > On Tue, Jun 07, 2011 at 08:53:21PM -0700, Ying Han wrote:
>> >> 2. The way we treat the per-memcg soft_limit is changed in this patch.
>> >> The same comment I made on the following patch where we shouldn't
>> >> change the definition of user API (soft_limit_in_bytes in this case).
>> >> So I attached the patch to fix that where we should only go to the
>> >> ones under their soft_limit above certain reclaim priority. Please
>> >> consider.
>> >
>> > Here is your proposal from the other mail:
>> >
>> > : Basically, we shouldn't reclaim from a memcg under its soft_limit
>> > : unless we have trouble reclaim pages from others. Something like the
>> > : following makes better sense:
>> > :
>> > : diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > : index bdc2fd3..b82ba8c 100644
>> > : --- a/mm/vmscan.c
>> > : +++ b/mm/vmscan.c
>> > : @@ -1989,6 +1989,8 @@ restart:
>> > : ? ? ? ? throttle_vm_writeout(sc->gfp_mask);
>> > : ?}
>> > :
>> > : +#define MEMCG_SOFTLIMIT_RECLAIM_PRIORITY ? ? ? 2
>> > : +
>> > : ?static void shrink_zone(int priority, struct zone *zone,
>> > : ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct scan_control *sc)
>> > : ?{
>> > : @@ -2001,13 +2003,13 @@ static void shrink_zone(int priority, struct zone *zone,
>> > : ? ? ? ? ? ? ? ? unsigned long reclaimed = sc->nr_reclaimed;
>> > : ? ? ? ? ? ? ? ? unsigned long scanned = sc->nr_scanned;
>> > : ? ? ? ? ? ? ? ? unsigned long nr_reclaimed;
>> > : - ? ? ? ? ? ? ? int epriority = priority;
>> > :
>> > : - ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
>> > : - ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
>> > : + ? ? ? ? ? ? ? if (!mem_cgroup_soft_limit_exceeded(root, mem) &&
>> > : + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
>> > : + ? ? ? ? ? ? ? ? ? ? ? continue;
>> >
>> > I am not sure if you are serious or playing devil's advocate here,
>> > because it exacerbates the problem you are concerned about in 1. by
>> > orders of magnitude.
>>
>> No, the two are different issues. The first one is a performance
>> concern of detailed implementation, while the second one is a design
>> concern.
>
> Got ya.
>
>> > I guess it would make much more sense to evaluate if reclaiming from
>> > memcgs while there are others exceeding their soft limit is even a
>> > problem. ?Otherwise this discussion is pretty pointless.
>>
>> AFAIK it is a problem since it changes the spec of kernel API
>> memory.soft_limit_in_bytes. That value is set per-memcg which all the
>> pages allocated above that are best effort and targeted to reclaim
>> prior to others.
>
> That's not really true. ?Quoting the documentation:
>
> ? ?When the system detects memory contention or low memory, control groups
> ? ?are pushed back to their soft limits. If the soft limit of each control
> ? ?group is very high, they are pushed back as much as possible to make
> ? ?sure that one control group does not starve the others of memory.
>
> I am language lawyering here, but I don't think it says it won't touch
> other memcgs at all while there are memcgs exceeding their soft limit.

Well... :) I would say that the documentation of soft_limit needs lots
of work especially after lots of discussions we have after the LSF.

The RFC i sent after our discussion has the following documentation,
and I only cut & paste the content relevant to our conversation here:

What is "soft_limit"?
The "soft_limit was introduced in memcg to support over-committing the
memory resource on the host. Each cgroup can be configured with
"hard_limit", where it will be throttled or OOM killed by going over
the limit. However, the allocation can go above the "soft_limit" as
long as there is no memory contention. The "soft_limit" is the kernel
mechanism for re-distributing spare memory resource among cgroups.

What we have now?
The current implementation of softlimit is based on per-zone RB tree,
where only the cgroup exceeds the soft_limit the most being selected
for reclaim.

It makes less sense to only reclaim from one cgroup rather than
reclaiming all cgroups based on calculated propotion. This is required
for fairness.

Proposed design:
round-robin across the cgroups where they have memory allocated on the
zone and also exceed the softlimit configured.

there was a question on how to do zone balancing w/o global LRU. This
could be solved by building another cgroup list per-zone, where we
also link cgroups under their soft_limit. We won't scan the list
unless the first list being exhausted and
the free pages is still under the high_wmark.

Since the per-zone memcg list design is being replaced by your
patchset, some of the details doesn't apply. But the concept still
remains where we would like to scan some memcgs first (above
soft_limit) .

>
> It would be a lie about the current code in the first place, which
> does soft limit reclaim and then regular reclaim, no matter the
> outcome of the soft limit reclaim cycle. ?It will go for the soft
> limit first, but after an allocation under pressure the VM is likely
> to have reclaimed from other memcgs as well.
>
> I saw your patch to fix that and break out of reclaim if soft limit
> reclaim did enough. ?But this fix is not much newer than my changes.

My soft_limit patch was developed in parallel with your patchset, and
most of that wouldn't apply here.
Is that what you are referring to?

>
> The second part of this is:
>
> ? ?Please note that soft limits is a best effort feature, it comes with
> ? ?no guarantees, but it does its best to make sure that when memory is
> ? ?heavily contended for, memory is allocated based on the soft limit
> ? ?hints/setup. Currently soft limit based reclaim is setup such that
> ? ?it gets invoked from balance_pgdat (kswapd).

We had patch merged which add the soft_limit reclaim also in the global ttfp.

memcg-add-the-soft_limit-reclaim-in-global-direct-reclaim.patch

> It's not the pages-over-soft-limit that are best effort. ?It says that
> it tries its best to take soft limits into account while reclaiming.
Hmm. Both cases are true. The best effort pages I referring to means
"the page above the soft_limit are targeted to reclaim first under
memory contention"

>
> My code does that, so I don't think we are breaking any promises
> currently made in the documentation.
>
> But much more important than keeping documentation promises is not to
> break actual users. ?So if you are yourself a user of soft limits,
> test the new code pretty please and complain if it breaks your setup!

Yes, I've been running tests on your patchset, but not getting into
specific configurations yet. But I don't think it is hard to generate
the following scenario:

on 32G machine, under root I have three cgroups with 20G hard_limit and
cgroup-A: soft_limit 1g, usage 20g with clean file pages
cgroup-B: soft_limit 10g, usage 5g with clean file pages
cgroup-C: soft_limit 10g, usage 5g with clean file pages

I would assume reclaiming from cgroup-A should be sufficient under
global memory pressure, and no pages needs to be reclaimed from B or
C, especially both of them have memory usage under their soft_limit.

I see we also have discussion on the soft_limit reclaim on the [patch
4] with Michal, then i might start working on that.

--Ying




>

2011-06-09 18:37:14

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Thu, Jun 09, 2011 at 10:36:47AM -0700, Ying Han wrote:
> On Thu, Jun 9, 2011 at 1:35 AM, Johannes Weiner <[email protected]> wrote:
> > On Wed, Jun 08, 2011 at 08:52:03PM -0700, Ying Han wrote:
> >> On Wed, Jun 8, 2011 at 8:32 AM, Johannes Weiner <[email protected]> wrote:
> >> > I guess it would make much more sense to evaluate if reclaiming from
> >> > memcgs while there are others exceeding their soft limit is even a
> >> > problem. ?Otherwise this discussion is pretty pointless.
> >>
> >> AFAIK it is a problem since it changes the spec of kernel API
> >> memory.soft_limit_in_bytes. That value is set per-memcg which all the
> >> pages allocated above that are best effort and targeted to reclaim
> >> prior to others.
> >
> > That's not really true. ?Quoting the documentation:
> >
> > ? ?When the system detects memory contention or low memory, control groups
> > ? ?are pushed back to their soft limits. If the soft limit of each control
> > ? ?group is very high, they are pushed back as much as possible to make
> > ? ?sure that one control group does not starve the others of memory.
> >
> > I am language lawyering here, but I don't think it says it won't touch
> > other memcgs at all while there are memcgs exceeding their soft limit.
>
> Well... :) I would say that the documentation of soft_limit needs lots
> of work especially after lots of discussions we have after the LSF.
>
> The RFC i sent after our discussion has the following documentation,
> and I only cut & paste the content relevant to our conversation here:
>
> What is "soft_limit"?
> The "soft_limit was introduced in memcg to support over-committing the
> memory resource on the host. Each cgroup can be configured with
> "hard_limit", where it will be throttled or OOM killed by going over
> the limit. However, the allocation can go above the "soft_limit" as
> long as there is no memory contention. The "soft_limit" is the kernel
> mechanism for re-distributing spare memory resource among cgroups.
>
> What we have now?
> The current implementation of softlimit is based on per-zone RB tree,
> where only the cgroup exceeds the soft_limit the most being selected
> for reclaim.
>
> It makes less sense to only reclaim from one cgroup rather than
> reclaiming all cgroups based on calculated propotion. This is required
> for fairness.
>
> Proposed design:
> round-robin across the cgroups where they have memory allocated on the
> zone and also exceed the softlimit configured.
>
> there was a question on how to do zone balancing w/o global LRU. This
> could be solved by building another cgroup list per-zone, where we
> also link cgroups under their soft_limit. We won't scan the list
> unless the first list being exhausted and
> the free pages is still under the high_wmark.
>
> Since the per-zone memcg list design is being replaced by your
> patchset, some of the details doesn't apply. But the concept still
> remains where we would like to scan some memcgs first (above
> soft_limit) .

I think the most important thing we wanted was to round-robin scan all
soft limit excessors instead of just the biggest one. I understood
this is the biggest fault with soft limits right now.

We came up with maintaining a list of excessors, rather than a tree,
and from this particular implementation followed naturally that this
list is scanned BEFORE we look at other memcgs at all.

This is a nice to have, but it was never the primary problem with the
soft limit implementation, as far as I understood.

> > It would be a lie about the current code in the first place, which
> > does soft limit reclaim and then regular reclaim, no matter the
> > outcome of the soft limit reclaim cycle. ?It will go for the soft
> > limit first, but after an allocation under pressure the VM is likely
> > to have reclaimed from other memcgs as well.
> >
> > I saw your patch to fix that and break out of reclaim if soft limit
> > reclaim did enough. ?But this fix is not much newer than my changes.
>
> My soft_limit patch was developed in parallel with your patchset, and
> most of that wouldn't apply here.
> Is that what you are referring to?

No, I meant that the current behaviour is old and we are only changing
it only now, so we are not really breaking backward compatibility.

> > The second part of this is:
> >
> > ? ?Please note that soft limits is a best effort feature, it comes with
> > ? ?no guarantees, but it does its best to make sure that when memory is
> > ? ?heavily contended for, memory is allocated based on the soft limit
> > ? ?hints/setup. Currently soft limit based reclaim is setup such that
> > ? ?it gets invoked from balance_pgdat (kswapd).
>
> We had patch merged which add the soft_limit reclaim also in the global ttfp.
>
> memcg-add-the-soft_limit-reclaim-in-global-direct-reclaim.patch
>
> > It's not the pages-over-soft-limit that are best effort. ?It says that
> > it tries its best to take soft limits into account while reclaiming.
> Hmm. Both cases are true. The best effort pages I referring to means
> "the page above the soft_limit are targeted to reclaim first under
> memory contention"

I really don't know where you are taking this from. That is neither
documented anywhere, nor is it the current behaviour.

Yeah, currently the soft limit reclaim cycle preceeds the generic
reclaim cycle. But the end result is that other memcgs are reclaimed
from as well in both cases. The exact timing is irrelevant.

And this has been the case for a long time, so I don't think my rework
breaks existing users in that regard.

> > My code does that, so I don't think we are breaking any promises
> > currently made in the documentation.
> >
> > But much more important than keeping documentation promises is not to
> > break actual users. ?So if you are yourself a user of soft limits,
> > test the new code pretty please and complain if it breaks your setup!
>
> Yes, I've been running tests on your patchset, but not getting into
> specific configurations yet. But I don't think it is hard to generate
> the following scenario:
>
> on 32G machine, under root I have three cgroups with 20G hard_limit and
> cgroup-A: soft_limit 1g, usage 20g with clean file pages
> cgroup-B: soft_limit 10g, usage 5g with clean file pages
> cgroup-C: soft_limit 10g, usage 5g with clean file pages
>
> I would assume reclaiming from cgroup-A should be sufficient under
> global memory pressure, and no pages needs to be reclaimed from B or
> C, especially both of them have memory usage under their soft_limit.

Keep in mind that memcgs are scanned proportionally to their size,
that we start out with relatively low scan counts, and that the
priority levels are a logarithmic scale.

The formula is essentially this:

(usage / PAGE_SIZE) >> priority

which means that we would scan as follows, with decreased soft limit
priority for A:

A: ((20 << 30) >> 12) >> 11 = 2560 pages
B: (( 5 << 30) >> 12) >> 12 = 320 pages
C: = 320 pages.

So even if B and C are scanned, they are only shrunk by a bit over a
megabyte tops. For decreasing levels (if they are reached at all if
there is clean cache around):

A: 20M 40M 80M 160M ...
B: 2M 4M 8M 16M ...

While it would be sufficient to reclaim only from A, actually
reclaiming from B and C is not a big deal in practice, I would
suspect.

2011-06-09 21:38:37

by Ying Han

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Thu, Jun 9, 2011 at 11:36 AM, Johannes Weiner <[email protected]> wrote:
> On Thu, Jun 09, 2011 at 10:36:47AM -0700, Ying Han wrote:
>> On Thu, Jun 9, 2011 at 1:35 AM, Johannes Weiner <[email protected]> wrote:
>> > On Wed, Jun 08, 2011 at 08:52:03PM -0700, Ying Han wrote:
>> >> On Wed, Jun 8, 2011 at 8:32 AM, Johannes Weiner <[email protected]> wrote:
>> >> > I guess it would make much more sense to evaluate if reclaiming from
>> >> > memcgs while there are others exceeding their soft limit is even a
>> >> > problem. ?Otherwise this discussion is pretty pointless.
>> >>
>> >> AFAIK it is a problem since it changes the spec of kernel API
>> >> memory.soft_limit_in_bytes. That value is set per-memcg which all the
>> >> pages allocated above that are best effort and targeted to reclaim
>> >> prior to others.
>> >
>> > That's not really true. ?Quoting the documentation:
>> >
>> > ? ?When the system detects memory contention or low memory, control groups
>> > ? ?are pushed back to their soft limits. If the soft limit of each control
>> > ? ?group is very high, they are pushed back as much as possible to make
>> > ? ?sure that one control group does not starve the others of memory.
>> >
>> > I am language lawyering here, but I don't think it says it won't touch
>> > other memcgs at all while there are memcgs exceeding their soft limit.
>>
>> Well... :) I would say that the documentation of soft_limit needs lots
>> of work especially after lots of discussions we have after the LSF.
>>
>> The RFC i sent after our discussion has the following documentation,
>> and I only cut & paste the content relevant to our conversation here:
>>
>> What is "soft_limit"?
>> The "soft_limit was introduced in memcg to support over-committing the
>> memory resource on the host. Each cgroup can be configured with
>> "hard_limit", where it will be throttled or OOM killed by going over
>> the limit. However, the allocation can go above the "soft_limit" as
>> long as there is no memory contention. The "soft_limit" is the kernel
>> mechanism for re-distributing spare memory resource among cgroups.
>>
>> What we have now?
>> The current implementation of softlimit is based on per-zone RB tree,
>> where only the cgroup exceeds the soft_limit the most being selected
>> for reclaim.
>>
>> It makes less sense to only reclaim from one cgroup rather than
>> reclaiming all cgroups based on calculated propotion. This is required
>> for fairness.
>>
>> Proposed design:
>> round-robin across the cgroups where they have memory allocated on the
>> zone and also exceed the softlimit configured.
>>
>> there was a question on how to do zone balancing w/o global LRU. This
>> could be solved by building another cgroup list per-zone, where we
>> also link cgroups under their soft_limit. We won't scan the list
>> unless the first list being exhausted and
>> the free pages is still under the high_wmark.
>>
>> Since the per-zone memcg list design is being replaced by your
>> patchset, some of the details doesn't apply. But the concept still
>> remains where we would like to scan some memcgs first (above
>> soft_limit) .
>
> I think the most important thing we wanted was to round-robin scan all
> soft limit excessors instead of just the biggest one. ?I understood
> this is the biggest fault with soft limits right now.
>
> We came up with maintaining a list of excessors, rather than a tree,
> and from this particular implementation followed naturally that this
> list is scanned BEFORE we look at other memcgs at all.
>
> This is a nice to have, but it was never the primary problem with the
> soft limit implementation, as far as I understood.
>
>> > It would be a lie about the current code in the first place, which
>> > does soft limit reclaim and then regular reclaim, no matter the
>> > outcome of the soft limit reclaim cycle. ?It will go for the soft
>> > limit first, but after an allocation under pressure the VM is likely
>> > to have reclaimed from other memcgs as well.
>> >
>> > I saw your patch to fix that and break out of reclaim if soft limit
>> > reclaim did enough. ?But this fix is not much newer than my changes.
>>
>> My soft_limit patch was developed in parallel with your patchset, and
>> most of that wouldn't apply here.
>> Is that what you are referring to?
>
> No, I meant that the current behaviour is old and we are only changing
> it only now, so we are not really breaking backward compatibility.
>
>> > The second part of this is:
>> >
>> > ? ?Please note that soft limits is a best effort feature, it comes with
>> > ? ?no guarantees, but it does its best to make sure that when memory is
>> > ? ?heavily contended for, memory is allocated based on the soft limit
>> > ? ?hints/setup. Currently soft limit based reclaim is setup such that
>> > ? ?it gets invoked from balance_pgdat (kswapd).
>>
>> We had patch merged which add the soft_limit reclaim also in the global ttfp.
>>
>> memcg-add-the-soft_limit-reclaim-in-global-direct-reclaim.patch
>>
>> > It's not the pages-over-soft-limit that are best effort. ?It says that
>> > it tries its best to take soft limits into account while reclaiming.
>> Hmm. Both cases are true. The best effort pages I referring to means
>> "the page above the soft_limit are targeted to reclaim first under
>> memory contention"
>
> I really don't know where you are taking this from. ?That is neither
> documented anywhere, nor is it the current behaviour.
>
> Yeah, currently the soft limit reclaim cycle preceeds the generic
> reclaim cycle. ?But the end result is that other memcgs are reclaimed
> from as well in both cases. ?The exact timing is irrelevant.
>
> And this has been the case for a long time, so I don't think my rework
> breaks existing users in that regard.
>
>> > My code does that, so I don't think we are breaking any promises
>> > currently made in the documentation.
>> >
>> > But much more important than keeping documentation promises is not to
>> > break actual users. ?So if you are yourself a user of soft limits,
>> > test the new code pretty please and complain if it breaks your setup!
>>
>> Yes, I've been running tests on your patchset, but not getting into
>> specific configurations yet. But I don't think it is hard to generate
>> the following scenario:
>>
>> on 32G machine, under root I have three cgroups with 20G hard_limit and
>> cgroup-A: soft_limit 1g, usage 20g with clean file pages
>> cgroup-B: soft_limit 10g, usage 5g with clean file pages
>> cgroup-C: soft_limit 10g, usage 5g with clean file pages
>>
>> I would assume reclaiming from cgroup-A should be sufficient under
>> global memory pressure, and no pages needs to be reclaimed from B or
>> C, especially both of them have memory usage under their soft_limit.
>
> Keep in mind that memcgs are scanned proportionally to their size,
> that we start out with relatively low scan counts, and that the
> priority levels are a logarithmic scale.
>
> The formula is essentially this:
>
> ? ? ? ?(usage / PAGE_SIZE) >> priority
>
> which means that we would scan as follows, with decreased soft limit
> priority for A:
>
> ? ? ? ?A: ((20 << 30) >> 12) >> 11 = 2560 pages
> ? ? ? ?B: (( 5 << 30) >> 12) >> 12 = ?320 pages
> ? ? ? ?C: ? ? ? ? ? ? ? ? ? ? ? ? ?= ?320 pages.
>
> So even if B and C are scanned, they are only shrunk by a bit over a
> megabyte tops. ?For decreasing levels (if they are reached at all if
> there is clean cache around):
>
> ? ? ? ?A: 20M 40M 80M 160M ...
> ? ? ? ?B: ?2M ?4M ?8M ?16M ...
>
> While it would be sufficient to reclaim only from A, actually
> reclaiming from B and C is not a big deal in practice, I would
> suspect.

One way to think about how the user will set the soft_limit ( in our
case for example) is to set the soft_limit to be the working_set_size
of the cgroup (by doing working set estimation).

The soft_limit will be readjust at run time based on the workload. In
that case, shrinking the memory from B/C has potential performance
impact on the application. While it doesn't mean we can never reclaim
pages from them, but shrinking from A ( usage is 19G above its
soft_limit ) first will provide better predictability of performance.

--Ying

>

2011-06-09 22:30:49

by Ying Han

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Thu, Jun 9, 2011 at 11:36 AM, Johannes Weiner <[email protected]> wrote:
> On Thu, Jun 09, 2011 at 10:36:47AM -0700, Ying Han wrote:
>> On Thu, Jun 9, 2011 at 1:35 AM, Johannes Weiner <[email protected]> wrote:
>> > On Wed, Jun 08, 2011 at 08:52:03PM -0700, Ying Han wrote:
>> >> On Wed, Jun 8, 2011 at 8:32 AM, Johannes Weiner <[email protected]> wrote:
>> >> > I guess it would make much more sense to evaluate if reclaiming from
>> >> > memcgs while there are others exceeding their soft limit is even a
>> >> > problem. ?Otherwise this discussion is pretty pointless.
>> >>
>> >> AFAIK it is a problem since it changes the spec of kernel API
>> >> memory.soft_limit_in_bytes. That value is set per-memcg which all the
>> >> pages allocated above that are best effort and targeted to reclaim
>> >> prior to others.
>> >
>> > That's not really true. ?Quoting the documentation:
>> >
>> > ? ?When the system detects memory contention or low memory, control groups
>> > ? ?are pushed back to their soft limits. If the soft limit of each control
>> > ? ?group is very high, they are pushed back as much as possible to make
>> > ? ?sure that one control group does not starve the others of memory.
>> >
>> > I am language lawyering here, but I don't think it says it won't touch
>> > other memcgs at all while there are memcgs exceeding their soft limit.
>>
>> Well... :) I would say that the documentation of soft_limit needs lots
>> of work especially after lots of discussions we have after the LSF.
>>
>> The RFC i sent after our discussion has the following documentation,
>> and I only cut & paste the content relevant to our conversation here:
>>
>> What is "soft_limit"?
>> The "soft_limit was introduced in memcg to support over-committing the
>> memory resource on the host. Each cgroup can be configured with
>> "hard_limit", where it will be throttled or OOM killed by going over
>> the limit. However, the allocation can go above the "soft_limit" as
>> long as there is no memory contention. The "soft_limit" is the kernel
>> mechanism for re-distributing spare memory resource among cgroups.
>>
>> What we have now?
>> The current implementation of softlimit is based on per-zone RB tree,
>> where only the cgroup exceeds the soft_limit the most being selected
>> for reclaim.
>>
>> It makes less sense to only reclaim from one cgroup rather than
>> reclaiming all cgroups based on calculated propotion. This is required
>> for fairness.
>>
>> Proposed design:
>> round-robin across the cgroups where they have memory allocated on the
>> zone and also exceed the softlimit configured.
>>
>> there was a question on how to do zone balancing w/o global LRU. This
>> could be solved by building another cgroup list per-zone, where we
>> also link cgroups under their soft_limit. We won't scan the list
>> unless the first list being exhausted and
>> the free pages is still under the high_wmark.
>>
>> Since the per-zone memcg list design is being replaced by your
>> patchset, some of the details doesn't apply. But the concept still
>> remains where we would like to scan some memcgs first (above
>> soft_limit) .
>
> I think the most important thing we wanted was to round-robin scan all
> soft limit excessors instead of just the biggest one. ?I understood
> this is the biggest fault with soft limits right now.
>
> We came up with maintaining a list of excessors, rather than a tree,
> and from this particular implementation followed naturally that this
> list is scanned BEFORE we look at other memcgs at all.
>
> This is a nice to have, but it was never the primary problem with the
> soft limit implementation, as far as I understood.
>
>> > It would be a lie about the current code in the first place, which
>> > does soft limit reclaim and then regular reclaim, no matter the
>> > outcome of the soft limit reclaim cycle. ?It will go for the soft
>> > limit first, but after an allocation under pressure the VM is likely
>> > to have reclaimed from other memcgs as well.
>> >
>> > I saw your patch to fix that and break out of reclaim if soft limit
>> > reclaim did enough. ?But this fix is not much newer than my changes.
>>
>> My soft_limit patch was developed in parallel with your patchset, and
>> most of that wouldn't apply here.
>> Is that what you are referring to?
>
> No, I meant that the current behaviour is old and we are only changing
> it only now, so we are not really breaking backward compatibility.
>
>> > The second part of this is:
>> >
>> > ? ?Please note that soft limits is a best effort feature, it comes with
>> > ? ?no guarantees, but it does its best to make sure that when memory is
>> > ? ?heavily contended for, memory is allocated based on the soft limit
>> > ? ?hints/setup. Currently soft limit based reclaim is setup such that
>> > ? ?it gets invoked from balance_pgdat (kswapd).
>>
>> We had patch merged which add the soft_limit reclaim also in the global ttfp.
>>
>> memcg-add-the-soft_limit-reclaim-in-global-direct-reclaim.patch
>>
>> > It's not the pages-over-soft-limit that are best effort. ?It says that
>> > it tries its best to take soft limits into account while reclaiming.
>> Hmm. Both cases are true. The best effort pages I referring to means
>> "the page above the soft_limit are targeted to reclaim first under
>> memory contention"
>
> I really don't know where you are taking this from. ?That is neither
> documented anywhere, nor is it the current behaviour.

I got the email from andrew on may 27 and you were on the cc-ed :)
Anyway, i just forwarded you that one.

--Ying

>
> Yeah, currently the soft limit reclaim cycle preceeds the generic
> reclaim cycle. ?But the end result is that other memcgs are reclaimed
> from as well in both cases. ?The exact timing is irrelevant.
>
> And this has been the case for a long time, so I don't think my rework
> breaks existing users in that regard.
>
>> > My code does that, so I don't think we are breaking any promises
>> > currently made in the documentation.
>> >
>> > But much more important than keeping documentation promises is not to
>> > break actual users. ?So if you are yourself a user of soft limits,
>> > test the new code pretty please and complain if it breaks your setup!
>>
>> Yes, I've been running tests on your patchset, but not getting into
>> specific configurations yet. But I don't think it is hard to generate
>> the following scenario:
>>
>> on 32G machine, under root I have three cgroups with 20G hard_limit and
>> cgroup-A: soft_limit 1g, usage 20g with clean file pages
>> cgroup-B: soft_limit 10g, usage 5g with clean file pages
>> cgroup-C: soft_limit 10g, usage 5g with clean file pages
>>
>> I would assume reclaiming from cgroup-A should be sufficient under
>> global memory pressure, and no pages needs to be reclaimed from B or
>> C, especially both of them have memory usage under their soft_limit.
>
> Keep in mind that memcgs are scanned proportionally to their size,
> that we start out with relatively low scan counts, and that the
> priority levels are a logarithmic scale.
>
> The formula is essentially this:
>
> ? ? ? ?(usage / PAGE_SIZE) >> priority
>
> which means that we would scan as follows, with decreased soft limit
> priority for A:
>
> ? ? ? ?A: ((20 << 30) >> 12) >> 11 = 2560 pages
> ? ? ? ?B: (( 5 << 30) >> 12) >> 12 = ?320 pages
> ? ? ? ?C: ? ? ? ? ? ? ? ? ? ? ? ? ?= ?320 pages.
>
> So even if B and C are scanned, they are only shrunk by a bit over a
> megabyte tops. ?For decreasing levels (if they are reached at all if
> there is clean cache around):
>
> ? ? ? ?A: 20M 40M 80M 160M ...
> ? ? ? ?B: ?2M ?4M ?8M ?16M ...
>
> While it would be sufficient to reclaim only from A, actually
> reclaiming from B and C is not a big deal in practice, I would
> suspect.
>

2011-06-09 23:32:22

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Thu, Jun 09, 2011 at 03:30:27PM -0700, Ying Han wrote:
> On Thu, Jun 9, 2011 at 11:36 AM, Johannes Weiner <[email protected]> wrote:
> > On Thu, Jun 09, 2011 at 10:36:47AM -0700, Ying Han wrote:
> >> On Thu, Jun 9, 2011 at 1:35 AM, Johannes Weiner <[email protected]> wrote:
> >> > On Wed, Jun 08, 2011 at 08:52:03PM -0700, Ying Han wrote:
> >> >> On Wed, Jun 8, 2011 at 8:32 AM, Johannes Weiner <[email protected]> wrote:
> >> >> > I guess it would make much more sense to evaluate if reclaiming from
> >> >> > memcgs while there are others exceeding their soft limit is even a
> >> >> > problem. ?Otherwise this discussion is pretty pointless.
> >> >>
> >> >> AFAIK it is a problem since it changes the spec of kernel API
> >> >> memory.soft_limit_in_bytes. That value is set per-memcg which all the
> >> >> pages allocated above that are best effort and targeted to reclaim
> >> >> prior to others.
> >> >
> >> > That's not really true. ?Quoting the documentation:
> >> >
> >> > ? ?When the system detects memory contention or low memory, control groups
> >> > ? ?are pushed back to their soft limits. If the soft limit of each control
> >> > ? ?group is very high, they are pushed back as much as possible to make
> >> > ? ?sure that one control group does not starve the others of memory.
> >> >
> >> > I am language lawyering here, but I don't think it says it won't touch
> >> > other memcgs at all while there are memcgs exceeding their soft limit.
> >>
> >> Well... :) I would say that the documentation of soft_limit needs lots
> >> of work especially after lots of discussions we have after the LSF.
> >>
> >> The RFC i sent after our discussion has the following documentation,
> >> and I only cut & paste the content relevant to our conversation here:
> >>
> >> What is "soft_limit"?
> >> The "soft_limit was introduced in memcg to support over-committing the
> >> memory resource on the host. Each cgroup can be configured with
> >> "hard_limit", where it will be throttled or OOM killed by going over
> >> the limit. However, the allocation can go above the "soft_limit" as
> >> long as there is no memory contention. The "soft_limit" is the kernel
> >> mechanism for re-distributing spare memory resource among cgroups.
> >>
> >> What we have now?
> >> The current implementation of softlimit is based on per-zone RB tree,
> >> where only the cgroup exceeds the soft_limit the most being selected
> >> for reclaim.
> >>
> >> It makes less sense to only reclaim from one cgroup rather than
> >> reclaiming all cgroups based on calculated propotion. This is required
> >> for fairness.
> >>
> >> Proposed design:
> >> round-robin across the cgroups where they have memory allocated on the
> >> zone and also exceed the softlimit configured.
> >>
> >> there was a question on how to do zone balancing w/o global LRU. This
> >> could be solved by building another cgroup list per-zone, where we
> >> also link cgroups under their soft_limit. We won't scan the list
> >> unless the first list being exhausted and
> >> the free pages is still under the high_wmark.
> >>
> >> Since the per-zone memcg list design is being replaced by your
> >> patchset, some of the details doesn't apply. But the concept still
> >> remains where we would like to scan some memcgs first (above
> >> soft_limit) .
> >
> > I think the most important thing we wanted was to round-robin scan all
> > soft limit excessors instead of just the biggest one. ?I understood
> > this is the biggest fault with soft limits right now.
> >
> > We came up with maintaining a list of excessors, rather than a tree,
> > and from this particular implementation followed naturally that this
> > list is scanned BEFORE we look at other memcgs at all.
> >
> > This is a nice to have, but it was never the primary problem with the
> > soft limit implementation, as far as I understood.
> >
> >> > It would be a lie about the current code in the first place, which
> >> > does soft limit reclaim and then regular reclaim, no matter the
> >> > outcome of the soft limit reclaim cycle. ?It will go for the soft
> >> > limit first, but after an allocation under pressure the VM is likely
> >> > to have reclaimed from other memcgs as well.
> >> >
> >> > I saw your patch to fix that and break out of reclaim if soft limit
> >> > reclaim did enough. ?But this fix is not much newer than my changes.
> >>
> >> My soft_limit patch was developed in parallel with your patchset, and
> >> most of that wouldn't apply here.
> >> Is that what you are referring to?
> >
> > No, I meant that the current behaviour is old and we are only changing
> > it only now, so we are not really breaking backward compatibility.
> >
> >> > The second part of this is:
> >> >
> >> > ? ?Please note that soft limits is a best effort feature, it comes with
> >> > ? ?no guarantees, but it does its best to make sure that when memory is
> >> > ? ?heavily contended for, memory is allocated based on the soft limit
> >> > ? ?hints/setup. Currently soft limit based reclaim is setup such that
> >> > ? ?it gets invoked from balance_pgdat (kswapd).
> >>
> >> We had patch merged which add the soft_limit reclaim also in the global ttfp.
> >>
> >> memcg-add-the-soft_limit-reclaim-in-global-direct-reclaim.patch
> >>
> >> > It's not the pages-over-soft-limit that are best effort. ?It says that
> >> > it tries its best to take soft limits into account while reclaiming.
> >> Hmm. Both cases are true. The best effort pages I referring to means
> >> "the page above the soft_limit are targeted to reclaim first under
> >> memory contention"
> >
> > I really don't know where you are taking this from. ?That is neither
> > documented anywhere, nor is it the current behaviour.
>
> I got the email from andrew on may 27 and you were on the cc-ed :)
> Anyway, i just forwarded you that one.

I wasn't asking about this patch at all... This is the conversation:

Me:

> >> > It's not the pages-over-soft-limit that are best effort. ?It says that
> >> > it tries its best to take soft limits into account while reclaiming.

You:

> >> Hmm. Both cases are true. The best effort pages I referring to means
> >> "the page above the soft_limit are targeted to reclaim first under
> >> memory contention"

Me:

> > I really don't know where you are taking this from. ?That is neither
> > documented anywhere, nor is it the current behaviour.

And this is still my question.

Current: scan up to all pages of the biggest soft limit offender, then
reclaim from random memcgs (because of the global LRU).

After my patch: scan all memcgs according to their size, with double
the pressure on those over their soft limit.

Please tell me exactly how my patch regresses existing behaviour, a
user interface, a documented feature, etc.

If you have an even better idea, please propose it.

2011-06-09 23:41:16

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Fri, Jun 10, 2011 at 2:23 AM, Johannes Weiner <[email protected]> wrote:
> On Fri, Jun 10, 2011 at 12:48:39AM +0900, Minchan Kim wrote:
>> On Wed, Jun 01, 2011 at 08:25:13AM +0200, Johannes Weiner wrote:
>> > When a memcg hits its hard limit, hierarchical target reclaim is
>> > invoked, which goes through all contributing memcgs in the hierarchy
>> > below the offending memcg and reclaims from the respective per-memcg
>> > lru lists.  This distributes pressure fairly among all involved
>> > memcgs, and pages are aged with respect to their list buddies.
>> >
>> > When global memory pressure arises, however, all this is dropped
>> > overboard.  Pages are reclaimed based on global lru lists that have
>> > nothing to do with container-internal age, and some memcgs may be
>> > reclaimed from much more than others.
>> >
>> > This patch makes traditional global reclaim consider container
>> > boundaries and no longer scan the global lru lists.  For each zone
>> > scanned, the memcg hierarchy is walked and pages are reclaimed from
>> > the per-memcg lru lists of the respective zone.  For now, the
>> > hierarchy walk is bounded to one full round-trip through the
>> > hierarchy, or if the number of reclaimed pages reach the overall
>> > reclaim target, whichever comes first.
>> >
>> > Conceptually, global memory pressure is then treated as if the root
>> > memcg had hit its limit.  Since all existing memcgs contribute to the
>> > usage of the root memcg, global reclaim is nothing more than target
>> > reclaim starting from the root memcg.  The code is mostly the same for
>> > both cases, except for a few heuristics and statistics that do not
>> > always apply.  They are distinguished by a newly introduced
>> > global_reclaim() primitive.
>> >
>> > One implication of this change is that pages have to be linked to the
>> > lru lists of the root memcg again, which could be optimized away with
>> > the old scheme.  The costs are not measurable, though, even with
>> > worst-case microbenchmarks.
>> >
>> > As global reclaim no longer relies on global lru lists, this change is
>> > also in preparation to remove those completely.
>
> [cut diff]
>
>> I didn't look at all, still. You might change the logic later patches.
>> If I understand this patch right, it does round-robin reclaim in all memcgs
>> when global memory pressure happens.
>>
>> Let's consider this memcg size unbalance case.
>>
>> If A-memcg has lots of LRU pages, scanning count for reclaim would be bigger
>> so the chance to reclaim the pages would be higher.
>> If we reclaim A-memcg, we can reclaim the number of pages we want easily and break.
>> Next reclaim will happen at some time and reclaim will start the B-memcg of A-memcg
>> we reclaimed successfully before. But unfortunately B-memcg has small lru so
>> scanning count would be small and small memcg's LRU aging is higher than bigger memcg.
>> It means small memcg's working set can be evicted easily than big memcg.
>> my point is that we should not set next memcg easily.
>> We have to consider memcg LRU size.
>
> I may be missing something, but you said yourself that B had a smaller
> scan count compared to A, so the aging speed should be proportional to
> respective size.
>
> The number of pages scanned per iteration is essentially
>
>        number of lru pages in memcg-zone >> priority
>
> so we scan relatively more pages from B than from A each round.
>
> It's the exact same logic we have been applying traditionally to
> distribute pressure fairly among zones to equalize their aging speed.
>
> Is that what you meant or are we talking past each other?

True if we can reclaim pages easily(ie, default priority) in all memcgs.
But let's think about it.
Normally direct reclaim path reclaims only SWAP_CLUSTER_MAX size.
If we have small memcg, scan window size would be smaller and it is
likely to be hard reclaim in the priority compared to bigger memcg. It
means it can raise priority easily in small memcg and even it might
call lumpy or compaction in case of global memory pressure. It can
churn all LRU order. :(
Of course, we have bailout routine so we might make such unfair aging
effect small but it's not same with old behavior(ie, single LRU list,
fair aging POV global according to priority raise)

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected].  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>



--
Kind regards,
Minchan Kim

2011-06-09 23:48:40

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Fri, Jun 10, 2011 at 8:41 AM, Minchan Kim <[email protected]> wrote:
> On Fri, Jun 10, 2011 at 2:23 AM, Johannes Weiner <[email protected]> wrote:
>> On Fri, Jun 10, 2011 at 12:48:39AM +0900, Minchan Kim wrote:
>>> On Wed, Jun 01, 2011 at 08:25:13AM +0200, Johannes Weiner wrote:
>>> > When a memcg hits its hard limit, hierarchical target reclaim is
>>> > invoked, which goes through all contributing memcgs in the hierarchy
>>> > below the offending memcg and reclaims from the respective per-memcg
>>> > lru lists.  This distributes pressure fairly among all involved
>>> > memcgs, and pages are aged with respect to their list buddies.
>>> >
>>> > When global memory pressure arises, however, all this is dropped
>>> > overboard.  Pages are reclaimed based on global lru lists that have
>>> > nothing to do with container-internal age, and some memcgs may be
>>> > reclaimed from much more than others.
>>> >
>>> > This patch makes traditional global reclaim consider container
>>> > boundaries and no longer scan the global lru lists.  For each zone
>>> > scanned, the memcg hierarchy is walked and pages are reclaimed from
>>> > the per-memcg lru lists of the respective zone.  For now, the
>>> > hierarchy walk is bounded to one full round-trip through the
>>> > hierarchy, or if the number of reclaimed pages reach the overall
>>> > reclaim target, whichever comes first.
>>> >
>>> > Conceptually, global memory pressure is then treated as if the root
>>> > memcg had hit its limit.  Since all existing memcgs contribute to the
>>> > usage of the root memcg, global reclaim is nothing more than target
>>> > reclaim starting from the root memcg.  The code is mostly the same for
>>> > both cases, except for a few heuristics and statistics that do not
>>> > always apply.  They are distinguished by a newly introduced
>>> > global_reclaim() primitive.
>>> >
>>> > One implication of this change is that pages have to be linked to the
>>> > lru lists of the root memcg again, which could be optimized away with
>>> > the old scheme.  The costs are not measurable, though, even with
>>> > worst-case microbenchmarks.
>>> >
>>> > As global reclaim no longer relies on global lru lists, this change is
>>> > also in preparation to remove those completely.
>>
>> [cut diff]
>>
>>> I didn't look at all, still. You might change the logic later patches.
>>> If I understand this patch right, it does round-robin reclaim in all memcgs
>>> when global memory pressure happens.
>>>
>>> Let's consider this memcg size unbalance case.
>>>
>>> If A-memcg has lots of LRU pages, scanning count for reclaim would be bigger
>>> so the chance to reclaim the pages would be higher.
>>> If we reclaim A-memcg, we can reclaim the number of pages we want easily and break.
>>> Next reclaim will happen at some time and reclaim will start the B-memcg of A-memcg
>>> we reclaimed successfully before. But unfortunately B-memcg has small lru so
>>> scanning count would be small and small memcg's LRU aging is higher than bigger memcg.
>>> It means small memcg's working set can be evicted easily than big memcg.
>>> my point is that we should not set next memcg easily.
>>> We have to consider memcg LRU size.
>>
>> I may be missing something, but you said yourself that B had a smaller
>> scan count compared to A, so the aging speed should be proportional to
>> respective size.
>>
>> The number of pages scanned per iteration is essentially
>>
>>        number of lru pages in memcg-zone >> priority
>>
>> so we scan relatively more pages from B than from A each round.
>>
>> It's the exact same logic we have been applying traditionally to
>> distribute pressure fairly among zones to equalize their aging speed.
>>
>> Is that what you meant or are we talking past each other?
>
> True if we can reclaim pages easily(ie, default priority) in all memcgs.
> But let's think about it.
> Normally direct reclaim path reclaims only SWAP_CLUSTER_MAX size.
> If we have small memcg, scan window size would be smaller and it is
> likely to be hard reclaim in the priority compared to bigger memcg. It
> means it can raise priority easily in small memcg and even it might
> call lumpy or compaction in case of global memory pressure. It can
> churn all LRU order. :(
> Of course, we have bailout routine so we might make such unfair aging
> effect small but it's not same with old behavior(ie, single LRU list,
> fair aging POV global according to priority raise)

To make fair, how about considering turn over different memcg before
raise up priority?
It can make aging speed fairly while it can make high contention of
lru_lock. :(



--
Kind regards,
Minchan Kim

2011-06-10 00:17:12

by Ying Han

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Thu, Jun 9, 2011 at 4:31 PM, Johannes Weiner <[email protected]> wrote:
> On Thu, Jun 09, 2011 at 03:30:27PM -0700, Ying Han wrote:
>> On Thu, Jun 9, 2011 at 11:36 AM, Johannes Weiner <[email protected]> wrote:
>> > On Thu, Jun 09, 2011 at 10:36:47AM -0700, Ying Han wrote:
>> >> On Thu, Jun 9, 2011 at 1:35 AM, Johannes Weiner <[email protected]> wrote:
>> >> > On Wed, Jun 08, 2011 at 08:52:03PM -0700, Ying Han wrote:
>> >> >> On Wed, Jun 8, 2011 at 8:32 AM, Johannes Weiner <[email protected]> wrote:
>> >> >> > I guess it would make much more sense to evaluate if reclaiming from
>> >> >> > memcgs while there are others exceeding their soft limit is even a
>> >> >> > problem. ?Otherwise this discussion is pretty pointless.
>> >> >>
>> >> >> AFAIK it is a problem since it changes the spec of kernel API
>> >> >> memory.soft_limit_in_bytes. That value is set per-memcg which all the
>> >> >> pages allocated above that are best effort and targeted to reclaim
>> >> >> prior to others.
>> >> >
>> >> > That's not really true. ?Quoting the documentation:
>> >> >
>> >> > ? ?When the system detects memory contention or low memory, control groups
>> >> > ? ?are pushed back to their soft limits. If the soft limit of each control
>> >> > ? ?group is very high, they are pushed back as much as possible to make
>> >> > ? ?sure that one control group does not starve the others of memory.
>> >> >
>> >> > I am language lawyering here, but I don't think it says it won't touch
>> >> > other memcgs at all while there are memcgs exceeding their soft limit.
>> >>
>> >> Well... :) I would say that the documentation of soft_limit needs lots
>> >> of work especially after lots of discussions we have after the LSF.
>> >>
>> >> The RFC i sent after our discussion has the following documentation,
>> >> and I only cut & paste the content relevant to our conversation here:
>> >>
>> >> What is "soft_limit"?
>> >> The "soft_limit was introduced in memcg to support over-committing the
>> >> memory resource on the host. Each cgroup can be configured with
>> >> "hard_limit", where it will be throttled or OOM killed by going over
>> >> the limit. However, the allocation can go above the "soft_limit" as
>> >> long as there is no memory contention. The "soft_limit" is the kernel
>> >> mechanism for re-distributing spare memory resource among cgroups.
>> >>
>> >> What we have now?
>> >> The current implementation of softlimit is based on per-zone RB tree,
>> >> where only the cgroup exceeds the soft_limit the most being selected
>> >> for reclaim.
>> >>
>> >> It makes less sense to only reclaim from one cgroup rather than
>> >> reclaiming all cgroups based on calculated propotion. This is required
>> >> for fairness.
>> >>
>> >> Proposed design:
>> >> round-robin across the cgroups where they have memory allocated on the
>> >> zone and also exceed the softlimit configured.
>> >>
>> >> there was a question on how to do zone balancing w/o global LRU. This
>> >> could be solved by building another cgroup list per-zone, where we
>> >> also link cgroups under their soft_limit. We won't scan the list
>> >> unless the first list being exhausted and
>> >> the free pages is still under the high_wmark.
>> >>
>> >> Since the per-zone memcg list design is being replaced by your
>> >> patchset, some of the details doesn't apply. But the concept still
>> >> remains where we would like to scan some memcgs first (above
>> >> soft_limit) .
>> >
>> > I think the most important thing we wanted was to round-robin scan all
>> > soft limit excessors instead of just the biggest one. ?I understood
>> > this is the biggest fault with soft limits right now.
>> >
>> > We came up with maintaining a list of excessors, rather than a tree,
>> > and from this particular implementation followed naturally that this
>> > list is scanned BEFORE we look at other memcgs at all.
>> >
>> > This is a nice to have, but it was never the primary problem with the
>> > soft limit implementation, as far as I understood.
>> >
>> >> > It would be a lie about the current code in the first place, which
>> >> > does soft limit reclaim and then regular reclaim, no matter the
>> >> > outcome of the soft limit reclaim cycle. ?It will go for the soft
>> >> > limit first, but after an allocation under pressure the VM is likely
>> >> > to have reclaimed from other memcgs as well.
>> >> >
>> >> > I saw your patch to fix that and break out of reclaim if soft limit
>> >> > reclaim did enough. ?But this fix is not much newer than my changes.
>> >>
>> >> My soft_limit patch was developed in parallel with your patchset, and
>> >> most of that wouldn't apply here.
>> >> Is that what you are referring to?
>> >
>> > No, I meant that the current behaviour is old and we are only changing
>> > it only now, so we are not really breaking backward compatibility.
>> >
>> >> > The second part of this is:
>> >> >
>> >> > ? ?Please note that soft limits is a best effort feature, it comes with
>> >> > ? ?no guarantees, but it does its best to make sure that when memory is
>> >> > ? ?heavily contended for, memory is allocated based on the soft limit
>> >> > ? ?hints/setup. Currently soft limit based reclaim is setup such that
>> >> > ? ?it gets invoked from balance_pgdat (kswapd).
>> >>
>> >> We had patch merged which add the soft_limit reclaim also in the global ttfp.
>> >>
>> >> memcg-add-the-soft_limit-reclaim-in-global-direct-reclaim.patch
>> >>
>> >> > It's not the pages-over-soft-limit that are best effort. ?It says that
>> >> > it tries its best to take soft limits into account while reclaiming.
>> >> Hmm. Both cases are true. The best effort pages I referring to means
>> >> "the page above the soft_limit are targeted to reclaim first under
>> >> memory contention"
>> >
>> > I really don't know where you are taking this from. ?That is neither
>> > documented anywhere, nor is it the current behaviour.
>>
>> I got the email from andrew on may 27 and you were on the cc-ed :)
>> Anyway, i just forwarded you that one.
>
> I wasn't asking about this patch at all... ?This is the conversation:
>
> Me:
>
>> >> > It's not the pages-over-soft-limit that are best effort. ?It says that
>> >> > it tries its best to take soft limits into account while reclaiming.
>
> You:
>
>> >> Hmm. Both cases are true. The best effort pages I referring to means
>> >> "the page above the soft_limit are targeted to reclaim first under
>> >> memory contention"
>
> Me:
>
>> > I really don't know where you are taking this from. ?That is neither
>> > documented anywhere, nor is it the current behaviour.
>
> And this is still my question.
>
> Current: scan up to all pages of the biggest soft limit offender, then
> reclaim from random memcgs (because of the global LRU).
agree.

>
> After my patch: scan all memcgs according to their size, with double
> the pressure on those over their soft limit.
agree.
>
> Please tell me exactly how my patch regresses existing behaviour, a
> user interface, a documented feature, etc.
>

Ok, thank you for clarifying it. Now i understand what's the confusion here.

I agree that your patch doesn't regress from what we have now
currently. What i referred earlier was the improvement from the
current design. So we were comparing to two targets.

Please go ahead with your patch, and I don't have problem with that
now. I will propose the soft_limit reclaim improvement as separate
thread.

Thanks

--Ying

> If you have an even better idea, please propose it.
>

2011-06-10 00:34:51

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Fri, Jun 10, 2011 at 08:47:55AM +0900, Minchan Kim wrote:
> On Fri, Jun 10, 2011 at 8:41 AM, Minchan Kim <[email protected]> wrote:
> > On Fri, Jun 10, 2011 at 2:23 AM, Johannes Weiner <[email protected]> wrote:
> >> On Fri, Jun 10, 2011 at 12:48:39AM +0900, Minchan Kim wrote:
> >>> On Wed, Jun 01, 2011 at 08:25:13AM +0200, Johannes Weiner wrote:
> >>> > When a memcg hits its hard limit, hierarchical target reclaim is
> >>> > invoked, which goes through all contributing memcgs in the hierarchy
> >>> > below the offending memcg and reclaims from the respective per-memcg
> >>> > lru lists. ?This distributes pressure fairly among all involved
> >>> > memcgs, and pages are aged with respect to their list buddies.
> >>> >
> >>> > When global memory pressure arises, however, all this is dropped
> >>> > overboard. ?Pages are reclaimed based on global lru lists that have
> >>> > nothing to do with container-internal age, and some memcgs may be
> >>> > reclaimed from much more than others.
> >>> >
> >>> > This patch makes traditional global reclaim consider container
> >>> > boundaries and no longer scan the global lru lists. ?For each zone
> >>> > scanned, the memcg hierarchy is walked and pages are reclaimed from
> >>> > the per-memcg lru lists of the respective zone. ?For now, the
> >>> > hierarchy walk is bounded to one full round-trip through the
> >>> > hierarchy, or if the number of reclaimed pages reach the overall
> >>> > reclaim target, whichever comes first.
> >>> >
> >>> > Conceptually, global memory pressure is then treated as if the root
> >>> > memcg had hit its limit. ?Since all existing memcgs contribute to the
> >>> > usage of the root memcg, global reclaim is nothing more than target
> >>> > reclaim starting from the root memcg. ?The code is mostly the same for
> >>> > both cases, except for a few heuristics and statistics that do not
> >>> > always apply. ?They are distinguished by a newly introduced
> >>> > global_reclaim() primitive.
> >>> >
> >>> > One implication of this change is that pages have to be linked to the
> >>> > lru lists of the root memcg again, which could be optimized away with
> >>> > the old scheme. ?The costs are not measurable, though, even with
> >>> > worst-case microbenchmarks.
> >>> >
> >>> > As global reclaim no longer relies on global lru lists, this change is
> >>> > also in preparation to remove those completely.
> >>
> >> [cut diff]
> >>
> >>> I didn't look at all, still. You might change the logic later patches.
> >>> If I understand this patch right, it does round-robin reclaim in all memcgs
> >>> when global memory pressure happens.
> >>>
> >>> Let's consider this memcg size unbalance case.
> >>>
> >>> If A-memcg has lots of LRU pages, scanning count for reclaim would be bigger
> >>> so the chance to reclaim the pages would be higher.
> >>> If we reclaim A-memcg, we can reclaim the number of pages we want easily and break.
> >>> Next reclaim will happen at some time and reclaim will start the B-memcg of A-memcg
> >>> we reclaimed successfully before. But unfortunately B-memcg has small lru so
> >>> scanning count would be small and small memcg's LRU aging is higher than bigger memcg.
> >>> It means small memcg's working set can be evicted easily than big memcg.
> >>> my point is that we should not set next memcg easily.
> >>> We have to consider memcg LRU size.
> >>
> >> I may be missing something, but you said yourself that B had a smaller
> >> scan count compared to A, so the aging speed should be proportional to
> >> respective size.
> >>
> >> The number of pages scanned per iteration is essentially
> >>
> >> ? ? ? ?number of lru pages in memcg-zone >> priority
> >>
> >> so we scan relatively more pages from B than from A each round.
> >>
> >> It's the exact same logic we have been applying traditionally to
> >> distribute pressure fairly among zones to equalize their aging speed.
> >>
> >> Is that what you meant or are we talking past each other?
> >
> > True if we can reclaim pages easily(ie, default priority) in all memcgs.
> > But let's think about it.
> > Normally direct reclaim path reclaims only SWAP_CLUSTER_MAX size.
> > If we have small memcg, scan window size would be smaller and it is
> > likely to be hard reclaim in the priority compared to bigger memcg. It
> > means it can raise priority easily in small memcg and even it might
> > call lumpy or compaction in case of global memory pressure. It can
> > churn all LRU order. :(
> > Of course, we have bailout routine so we might make such unfair aging
> > effect small but it's not same with old behavior(ie, single LRU list,
> > fair aging POV global according to priority raise)
>
> To make fair, how about considering turn over different memcg before
> raise up priority?
> It can make aging speed fairly while it can make high contention of
> lru_lock. :(

Actually, the way you describe it is how it used to work for limit
reclaim before my patches. It would select one memcg, then reclaim
with increasing priority until SWAP_CLUSTER_MAX were reclaimed.

memcg = select_victim()
for each prio:
for each zone:
shrink_zone(prio, zone, sc = { .mem_cgroup = memcg })

What it's supposed to do with my patches is scan all memcgs in the
hierarchy at the same priority. If it hasn't made progress, it will
increase the priority and iterate again over the hierarchy.

for each prio:
for each zone:
for each memcg:
do_shrink_zone(prio, zone, sc = { .mem_cgroup = memcg })

2011-06-10 00:48:40

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Fri, Jun 10, 2011 at 9:34 AM, Johannes Weiner <[email protected]> wrote:
> On Fri, Jun 10, 2011 at 08:47:55AM +0900, Minchan Kim wrote:
>> On Fri, Jun 10, 2011 at 8:41 AM, Minchan Kim <[email protected]> wrote:
>> > On Fri, Jun 10, 2011 at 2:23 AM, Johannes Weiner <[email protected]> wrote:
>> >> On Fri, Jun 10, 2011 at 12:48:39AM +0900, Minchan Kim wrote:
>> >>> On Wed, Jun 01, 2011 at 08:25:13AM +0200, Johannes Weiner wrote:
>> >>> > When a memcg hits its hard limit, hierarchical target reclaim is
>> >>> > invoked, which goes through all contributing memcgs in the hierarchy
>> >>> > below the offending memcg and reclaims from the respective per-memcg
>> >>> > lru lists.  This distributes pressure fairly among all involved
>> >>> > memcgs, and pages are aged with respect to their list buddies.
>> >>> >
>> >>> > When global memory pressure arises, however, all this is dropped
>> >>> > overboard.  Pages are reclaimed based on global lru lists that have
>> >>> > nothing to do with container-internal age, and some memcgs may be
>> >>> > reclaimed from much more than others.
>> >>> >
>> >>> > This patch makes traditional global reclaim consider container
>> >>> > boundaries and no longer scan the global lru lists.  For each zone
>> >>> > scanned, the memcg hierarchy is walked and pages are reclaimed from
>> >>> > the per-memcg lru lists of the respective zone.  For now, the
>> >>> > hierarchy walk is bounded to one full round-trip through the
>> >>> > hierarchy, or if the number of reclaimed pages reach the overall
>> >>> > reclaim target, whichever comes first.
>> >>> >
>> >>> > Conceptually, global memory pressure is then treated as if the root
>> >>> > memcg had hit its limit.  Since all existing memcgs contribute to the
>> >>> > usage of the root memcg, global reclaim is nothing more than target
>> >>> > reclaim starting from the root memcg.  The code is mostly the same for
>> >>> > both cases, except for a few heuristics and statistics that do not
>> >>> > always apply.  They are distinguished by a newly introduced
>> >>> > global_reclaim() primitive.
>> >>> >
>> >>> > One implication of this change is that pages have to be linked to the
>> >>> > lru lists of the root memcg again, which could be optimized away with
>> >>> > the old scheme.  The costs are not measurable, though, even with
>> >>> > worst-case microbenchmarks.
>> >>> >
>> >>> > As global reclaim no longer relies on global lru lists, this change is
>> >>> > also in preparation to remove those completely.
>> >>
>> >> [cut diff]
>> >>
>> >>> I didn't look at all, still. You might change the logic later patches.
>> >>> If I understand this patch right, it does round-robin reclaim in all memcgs
>> >>> when global memory pressure happens.
>> >>>
>> >>> Let's consider this memcg size unbalance case.
>> >>>
>> >>> If A-memcg has lots of LRU pages, scanning count for reclaim would be bigger
>> >>> so the chance to reclaim the pages would be higher.
>> >>> If we reclaim A-memcg, we can reclaim the number of pages we want easily and break.
>> >>> Next reclaim will happen at some time and reclaim will start the B-memcg of A-memcg
>> >>> we reclaimed successfully before. But unfortunately B-memcg has small lru so
>> >>> scanning count would be small and small memcg's LRU aging is higher than bigger memcg.
>> >>> It means small memcg's working set can be evicted easily than big memcg.
>> >>> my point is that we should not set next memcg easily.
>> >>> We have to consider memcg LRU size.
>> >>
>> >> I may be missing something, but you said yourself that B had a smaller
>> >> scan count compared to A, so the aging speed should be proportional to
>> >> respective size.
>> >>
>> >> The number of pages scanned per iteration is essentially
>> >>
>> >>        number of lru pages in memcg-zone >> priority
>> >>
>> >> so we scan relatively more pages from B than from A each round.
>> >>
>> >> It's the exact same logic we have been applying traditionally to
>> >> distribute pressure fairly among zones to equalize their aging speed.
>> >>
>> >> Is that what you meant or are we talking past each other?
>> >
>> > True if we can reclaim pages easily(ie, default priority) in all memcgs.
>> > But let's think about it.
>> > Normally direct reclaim path reclaims only SWAP_CLUSTER_MAX size.
>> > If we have small memcg, scan window size would be smaller and it is
>> > likely to be hard reclaim in the priority compared to bigger memcg. It
>> > means it can raise priority easily in small memcg and even it might
>> > call lumpy or compaction in case of global memory pressure. It can
>> > churn all LRU order. :(
>> > Of course, we have bailout routine so we might make such unfair aging
>> > effect small but it's not same with old behavior(ie, single LRU list,
>> > fair aging POV global according to priority raise)
>>
>> To make fair, how about considering turn over different memcg before
>> raise up priority?
>> It can make aging speed fairly while it can make high contention of
>> lru_lock. :(
>
> Actually, the way you describe it is how it used to work for limit
> reclaim before my patches.  It would select one memcg, then reclaim
> with increasing priority until SWAP_CLUSTER_MAX were reclaimed.
>
>        memcg = select_victim()
>        for each prio:
>          for each zone:
>            shrink_zone(prio, zone, sc = { .mem_cgroup = memcg })
>
> What it's supposed to do with my patches is scan all memcgs in the
> hierarchy at the same priority.  If it hasn't made progress, it will
> increase the priority and iterate again over the hierarchy.
>
>        for each prio:
>          for each zone:
>            for each memcg:
>              do_shrink_zone(prio, zone, sc = { .mem_cgroup = memcg })
>
>

Right you are. I got confused with old behavior which wasn't good.
Your way is very desirable to me and my concern disappear.
Thanks, Hannes.

--
Kind regards,
Minchan Kim

2011-06-10 07:36:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 4/8] memcg: rework soft limit reclaim

On Thu 09-06-11 17:00:26, Michal Hocko wrote:
> On Thu 02-06-11 22:25:29, Ying Han wrote:
> > On Thu, Jun 2, 2011 at 2:55 PM, Ying Han <[email protected]> wrote:
> > > On Tue, May 31, 2011 at 11:25 PM, Johannes Weiner <[email protected]> wrote:
> > >> Currently, soft limit reclaim is entered from kswapd, where it selects
> [...]
> > >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >> index c7d4b44..0163840 100644
> > >> --- a/mm/vmscan.c
> > >> +++ b/mm/vmscan.c
> > >> @@ -1988,9 +1988,13 @@ static void shrink_zone(int priority, struct zone *zone,
> > >> ? ? ? ? ? ? ? ?unsigned long reclaimed = sc->nr_reclaimed;
> > >> ? ? ? ? ? ? ? ?unsigned long scanned = sc->nr_scanned;
> > >> ? ? ? ? ? ? ? ?unsigned long nr_reclaimed;
> > >> + ? ? ? ? ? ? ? int epriority = priority;
> > >> +
> > >> + ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
> > >> + ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
> > >
> > > Here we grant the ability to shrink from all the memcgs, but only
> > > higher the priority for those exceed the soft_limit. That is a design
> > > change
> > > for the "soft_limit" which giving a hint to which memcgs to reclaim
> > > from first under global memory pressure.
> >
> >
> > Basically, we shouldn't reclaim from a memcg under its soft_limit
> > unless we have trouble reclaim pages from others.
>
> Agreed.
>
> > Something like the following makes better sense:
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bdc2fd3..b82ba8c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1989,6 +1989,8 @@ restart:
> > throttle_vm_writeout(sc->gfp_mask);
> > }
> >
> > +#define MEMCG_SOFTLIMIT_RECLAIM_PRIORITY 2
> > +
> > static void shrink_zone(int priority, struct zone *zone,
> > struct scan_control *sc)
> > {
> > @@ -2001,13 +2003,13 @@ static void shrink_zone(int priority, struct zone *zone,
> > unsigned long reclaimed = sc->nr_reclaimed;
> > unsigned long scanned = sc->nr_scanned;
> > unsigned long nr_reclaimed;
> > - int epriority = priority;
> >
> > - if (mem_cgroup_soft_limit_exceeded(root, mem))
> > - epriority -= 1;
> > + if (!mem_cgroup_soft_limit_exceeded(root, mem) &&
> > + priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
> > + continue;
>
> yes, this makes sense but I am not sure about the right(tm) value of the
> MEMCG_SOFTLIMIT_RECLAIM_PRIORITY. 2 sounds too low.

There is also another problem. I have just realized that this code path
is shared with the cgroup direct reclaim. We shouldn't care about soft
limit in such a situation. It would be just a wasting of cycles. So we
have to:

if (current_is_kswapd() &&
!mem_cgroup_soft_limit_exceeded(root, mem) &&
priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
continue;

Maybe the condition would have to be more complex for per-cgroup
background reclaim, though.

> You would do quite a
> lot of loops
> (DEFAULT_PRIORITY-MEMCG_SOFTLIMIT_RECLAIM_PRIORITY) * zones * memcg_count
> without any progress (assuming that all of them are under soft limit
> which doesn't sound like a totally artificial configuration) until you
> allow reclaiming from groups that are under soft limit. Then, when you
> finally get to reclaiming, you scan rather aggressively.
>
> Maybe something like 3/4 of DEFAULT_PRIORITY? You would get 3 times
> over all (unbalanced) zones and all cgroups that are above the limit
> (scanning max{1/4096+1/2048+1/1024, 3*SWAP_CLUSTER_MAX} of the LRUs for
> each cgroup) which could be enough to collect the low hanging fruit.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-13 09:26:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 5/8] memcg: remove unused soft limit code

On Wed 01-06-11 08:25:16, Johannes Weiner wrote:
> This should be merged into the previous patch, which is however better
> readable and reviewable without all this deletion noise.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/memcontrol.h | 9 -
> include/linux/swap.h | 4 -
> mm/memcontrol.c | 418 --------------------------------------------
> mm/vmscan.c | 44 -----
> 4 files changed, 0 insertions(+), 475 deletions(-)

Heh, that is what I call a nice clean up ;)
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-13 09:29:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 6/8] vmscan: change zone_nr_lru_pages to take memcg instead of scan control

On Wed 01-06-11 08:25:17, Johannes Weiner wrote:
> This function only uses sc->mem_cgroup from the scan control. Change
> it to take a memcg argument directly, so callsites without an actual
> reclaim context can use it as well.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-13 09:42:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 7/8] vmscan: memcg-aware unevictable page rescue scanner

On Wed 01-06-11 08:25:18, Johannes Weiner wrote:
> Once the per-memcg lru lists are exclusive, the unevictable page
> rescue scanner can no longer work on the global zone lru lists.
>
> This converts it to go through all memcgs and scan their respective
> unevictable lists instead.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Just a minor naming thing.

Other than that looks good to me.
Reviewed-by: Michal Hocko <[email protected]>

> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
[...]
> +struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
> + enum lru_list lru)
> +{
> + struct mem_cgroup_per_zone *mz;
> + struct page_cgroup *pc;
> +
> + mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
> + pc = list_entry(mz->lists[lru].prev, struct page_cgroup, lru);
> + return lookup_cgroup_page(pc);
> +}
> +
[...]
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3233,6 +3233,14 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
>
> }
>
> +static struct page *lru_tailpage(struct zone *zone, struct mem_cgroup *mem,
> + enum lru_list lru)
> +{
> + if (mem)
> + return mem_cgroup_lru_to_page(zone, mem, lru);
> + return lru_to_page(&zone->lru[lru].list);
> +}

Wouldn't it better to have those names consistent?
mem_cgroup_lru_tailpage vs lru_tailpage?

[...]
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-13 09:47:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Wed 01-06-11 08:25:11, Johannes Weiner wrote:
> Hi,
>
> this is the second version of the memcg naturalization series. The
> notable changes since the first submission are:
>
> o the hierarchy walk is now intermittent and will abort and
> remember the last scanned child after sc->nr_to_reclaim pages
> have been reclaimed during the walk in one zone (Rik)
>
> o the global lru lists are never scanned when memcg is enabled
> after #2 'memcg-aware global reclaim', which makes this patch
> self-sufficient and complete without requiring the per-memcg lru
> lists to be exclusive (Michal)
>
> o renamed sc->memcg and sc->current_memcg to sc->target_mem_cgroup
> and sc->mem_cgroup and fixed their documentation, I hope this is
> better understandable now (Rik)
>
> o the reclaim statistic counters have been renamed. there is no
> more distinction between 'pgfree' and 'pgsteal', it is now
> 'pgreclaim' in both cases; 'kswapd' has been replaced by
> 'background'
>
> o fixed a nasty crash in the hierarchical soft limit check that
> happened during global reclaim in memcgs that are hierarchical
> but have no hierarchical parents themselves
>
> o properly implemented the memcg-aware unevictable page rescue
> scanner, there were several blatant bugs in there
>
> o documentation on new public interfaces
>
> Thanks for your input on the first version.

I have finally got through the whole series, sorry that it took so long,
and I have to say that I like it. There is just one issue I can see that
was already discussed by you and Ying regarding further soft reclaim
enhancement. I think it will be much better if that one comes as a
separate patch though.

So thank you for this work and I am looking forward for a new version.
I will try to give it some testing as well.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-13 10:31:10

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 7/8] vmscan: memcg-aware unevictable page rescue scanner

On Mon, Jun 13, 2011 at 11:42:03AM +0200, Michal Hocko wrote:
> On Wed 01-06-11 08:25:18, Johannes Weiner wrote:
> > Once the per-memcg lru lists are exclusive, the unevictable page
> > rescue scanner can no longer work on the global zone lru lists.
> >
> > This converts it to go through all memcgs and scan their respective
> > unevictable lists instead.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Just a minor naming thing.
>
> Other than that looks good to me.
> Reviewed-by: Michal Hocko <[email protected]>
>
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> [...]
> > +struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
> > + enum lru_list lru)
> > +{
> > + struct mem_cgroup_per_zone *mz;
> > + struct page_cgroup *pc;
> > +
> > + mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
> > + pc = list_entry(mz->lists[lru].prev, struct page_cgroup, lru);
> > + return lookup_cgroup_page(pc);
> > +}
> > +
> [...]
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3233,6 +3233,14 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
> >
> > }
> >
> > +static struct page *lru_tailpage(struct zone *zone, struct mem_cgroup *mem,
> > + enum lru_list lru)
> > +{
> > + if (mem)
> > + return mem_cgroup_lru_to_page(zone, mem, lru);
> > + return lru_to_page(&zone->lru[lru].list);
> > +}
>
> Wouldn't it better to have those names consistent?
> mem_cgroup_lru_tailpage vs lru_tailpage?

It's bad naming alright, but what is the wrapper for both of them
supposed to be called then?

Note that this function is only temporary, though, that's why I did
not spent much time on looking for a better name.

When the per-memcg lru lists finally become exclusive, this is removed
and the function converted to work on lruvecs.

Would you be okay with just adding an /* XXX */ to the function in
this patch that mentions that it's only temporary?

2011-06-13 10:35:28

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Mon, Jun 13, 2011 at 11:47:04AM +0200, Michal Hocko wrote:
> On Wed 01-06-11 08:25:11, Johannes Weiner wrote:
> > Hi,
> >
> > this is the second version of the memcg naturalization series. The
> > notable changes since the first submission are:
> >
> > o the hierarchy walk is now intermittent and will abort and
> > remember the last scanned child after sc->nr_to_reclaim pages
> > have been reclaimed during the walk in one zone (Rik)
> >
> > o the global lru lists are never scanned when memcg is enabled
> > after #2 'memcg-aware global reclaim', which makes this patch
> > self-sufficient and complete without requiring the per-memcg lru
> > lists to be exclusive (Michal)
> >
> > o renamed sc->memcg and sc->current_memcg to sc->target_mem_cgroup
> > and sc->mem_cgroup and fixed their documentation, I hope this is
> > better understandable now (Rik)
> >
> > o the reclaim statistic counters have been renamed. there is no
> > more distinction between 'pgfree' and 'pgsteal', it is now
> > 'pgreclaim' in both cases; 'kswapd' has been replaced by
> > 'background'
> >
> > o fixed a nasty crash in the hierarchical soft limit check that
> > happened during global reclaim in memcgs that are hierarchical
> > but have no hierarchical parents themselves
> >
> > o properly implemented the memcg-aware unevictable page rescue
> > scanner, there were several blatant bugs in there
> >
> > o documentation on new public interfaces
> >
> > Thanks for your input on the first version.
>
> I have finally got through the whole series, sorry that it took so long,
> and I have to say that I like it. There is just one issue I can see that
> was already discussed by you and Ying regarding further soft reclaim
> enhancement. I think it will be much better if that one comes as a
> separate patch though.

People have been arguing in both directions. I share the sentiment
that that the soft limit rework is a separate thing, though, and will
make this series purely about the exclusive per-memcg lru lists.

Once this is done, the soft limit stuff should follow immediately.

> So thank you for this work and I am looking forward for a new version.
> I will try to give it some testing as well.

Thanks for your input and testing!

2011-06-13 11:19:08

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 7/8] vmscan: memcg-aware unevictable page rescue scanner

On Mon 13-06-11 12:30:32, Johannes Weiner wrote:
> On Mon, Jun 13, 2011 at 11:42:03AM +0200, Michal Hocko wrote:
> > On Wed 01-06-11 08:25:18, Johannes Weiner wrote:
> > > Once the per-memcg lru lists are exclusive, the unevictable page
> > > rescue scanner can no longer work on the global zone lru lists.
> > >
> > > This converts it to go through all memcgs and scan their respective
> > > unevictable lists instead.
> > >
> > > Signed-off-by: Johannes Weiner <[email protected]>
> >
> > Just a minor naming thing.
> >
> > Other than that looks good to me.
> > Reviewed-by: Michal Hocko <[email protected]>
> >
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > [...]
> > > +struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
> > > + enum lru_list lru)
> > > +{
> > > + struct mem_cgroup_per_zone *mz;
> > > + struct page_cgroup *pc;
> > > +
> > > + mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
> > > + pc = list_entry(mz->lists[lru].prev, struct page_cgroup, lru);
> > > + return lookup_cgroup_page(pc);
> > > +}
> > > +
> > [...]
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -3233,6 +3233,14 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
> > >
> > > }
> > >
> > > +static struct page *lru_tailpage(struct zone *zone, struct mem_cgroup *mem,
> > > + enum lru_list lru)
> > > +{
> > > + if (mem)
> > > + return mem_cgroup_lru_to_page(zone, mem, lru);
> > > + return lru_to_page(&zone->lru[lru].list);
> > > +}
> >
> > Wouldn't it better to have those names consistent?
> > mem_cgroup_lru_tailpage vs lru_tailpage?
>
> It's bad naming alright, but what is the wrapper for both of them
> supposed to be called then?
>
> Note that this function is only temporary, though, that's why I did
> not spent much time on looking for a better name.
>
> When the per-memcg lru lists finally become exclusive, this is removed
> and the function converted to work on lruvecs.
>
> Would you be okay with just adding an /* XXX */ to the function in
> this patch that mentions that it's only temporary?

Sure. No biggie about the naming. It just hit my eyes during reading the
patch because while lru_tailpage is clear about which end of the list is
returned the memcg counterpart is not that clear.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-15 22:48:34

by Ying Han

[permalink] [raw]
Subject: Re: [patch 4/8] memcg: rework soft limit reclaim

On Thu, Jun 9, 2011 at 8:00 AM, Michal Hocko <[email protected]> wrote:
> On Thu 02-06-11 22:25:29, Ying Han wrote:
>> On Thu, Jun 2, 2011 at 2:55 PM, Ying Han <[email protected]> wrote:
>> > On Tue, May 31, 2011 at 11:25 PM, Johannes Weiner <[email protected]> wrote:
>> >> Currently, soft limit reclaim is entered from kswapd, where it selects
> [...]
>> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> index c7d4b44..0163840 100644
>> >> --- a/mm/vmscan.c
>> >> +++ b/mm/vmscan.c
>> >> @@ -1988,9 +1988,13 @@ static void shrink_zone(int priority, struct zone *zone,
>> >> ? ? ? ? ? ? ? ?unsigned long reclaimed = sc->nr_reclaimed;
>> >> ? ? ? ? ? ? ? ?unsigned long scanned = sc->nr_scanned;
>> >> ? ? ? ? ? ? ? ?unsigned long nr_reclaimed;
>> >> + ? ? ? ? ? ? ? int epriority = priority;
>> >> +
>> >> + ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
>> >> + ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
>> >
>> > Here we grant the ability to shrink from all the memcgs, but only
>> > higher the priority for those exceed the soft_limit. That is a design
>> > change
>> > for the "soft_limit" which giving a hint to which memcgs to reclaim
>> > from first under global memory pressure.
>>
>>
>> Basically, we shouldn't reclaim from a memcg under its soft_limit
>> unless we have trouble reclaim pages from others.
>
> Agreed.
>
>> Something like the following makes better sense:
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index bdc2fd3..b82ba8c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1989,6 +1989,8 @@ restart:
>> ? ? ? ? throttle_vm_writeout(sc->gfp_mask);
>> ?}
>>
>> +#define MEMCG_SOFTLIMIT_RECLAIM_PRIORITY ? ? ? 2
>> +
>> ?static void shrink_zone(int priority, struct zone *zone,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct scan_control *sc)
>> ?{
>> @@ -2001,13 +2003,13 @@ static void shrink_zone(int priority, struct zone *zone,
>> ? ? ? ? ? ? ? ? unsigned long reclaimed = sc->nr_reclaimed;
>> ? ? ? ? ? ? ? ? unsigned long scanned = sc->nr_scanned;
>> ? ? ? ? ? ? ? ? unsigned long nr_reclaimed;
>> - ? ? ? ? ? ? ? int epriority = priority;
>>
>> - ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
>> - ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
>> + ? ? ? ? ? ? ? if (!mem_cgroup_soft_limit_exceeded(root, mem) &&
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
>> + ? ? ? ? ? ? ? ? ? ? ? continue;
>
> yes, this makes sense but I am not sure about the right(tm) value of the
> MEMCG_SOFTLIMIT_RECLAIM_PRIORITY. 2 sounds too low. You would do quite a
> lot of loops
> (DEFAULT_PRIORITY-MEMCG_SOFTLIMIT_RECLAIM_PRIORITY) * zones * memcg_count
> without any progress (assuming that all of them are under soft limit
> which doesn't sound like a totally artificial configuration) until you
> allow reclaiming from groups that are under soft limit. Then, when you
> finally get to reclaiming, you scan rather aggressively.

Fair enough, something smarter is definitely needed :)

>
> Maybe something like 3/4 of DEFAULT_PRIORITY? You would get 3 times
> over all (unbalanced) zones and all cgroups that are above the limit
> (scanning max{1/4096+1/2048+1/1024, 3*SWAP_CLUSTER_MAX} of the LRUs for
> each cgroup) which could be enough to collect the low hanging fruit.

Hmm, that sounds more reasonable than the initial proposal.

For the same worst case where all the memcgs are blow their soft
limit, we need to scan 3 times of total memcgs before actually doing
anything. For that condition, I can not think of anything solve the
problem totally unless we have separate list of memcg (like what do
currently) per-zone.

--Ying

> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

2011-06-15 22:58:05

by Ying Han

[permalink] [raw]
Subject: Re: [patch 4/8] memcg: rework soft limit reclaim

On Fri, Jun 10, 2011 at 12:36 AM, Michal Hocko <[email protected]> wrote:
> On Thu 09-06-11 17:00:26, Michal Hocko wrote:
>> On Thu 02-06-11 22:25:29, Ying Han wrote:
>> > On Thu, Jun 2, 2011 at 2:55 PM, Ying Han <[email protected]> wrote:
>> > > On Tue, May 31, 2011 at 11:25 PM, Johannes Weiner <[email protected]> wrote:
>> > >> Currently, soft limit reclaim is entered from kswapd, where it selects
>> [...]
>> > >> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > >> index c7d4b44..0163840 100644
>> > >> --- a/mm/vmscan.c
>> > >> +++ b/mm/vmscan.c
>> > >> @@ -1988,9 +1988,13 @@ static void shrink_zone(int priority, struct zone *zone,
>> > >> ? ? ? ? ? ? ? ?unsigned long reclaimed = sc->nr_reclaimed;
>> > >> ? ? ? ? ? ? ? ?unsigned long scanned = sc->nr_scanned;
>> > >> ? ? ? ? ? ? ? ?unsigned long nr_reclaimed;
>> > >> + ? ? ? ? ? ? ? int epriority = priority;
>> > >> +
>> > >> + ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
>> > >> + ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
>> > >
>> > > Here we grant the ability to shrink from all the memcgs, but only
>> > > higher the priority for those exceed the soft_limit. That is a design
>> > > change
>> > > for the "soft_limit" which giving a hint to which memcgs to reclaim
>> > > from first under global memory pressure.
>> >
>> >
>> > Basically, we shouldn't reclaim from a memcg under its soft_limit
>> > unless we have trouble reclaim pages from others.
>>
>> Agreed.
>>
>> > Something like the following makes better sense:
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index bdc2fd3..b82ba8c 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1989,6 +1989,8 @@ restart:
>> > ? ? ? ? throttle_vm_writeout(sc->gfp_mask);
>> > ?}
>> >
>> > +#define MEMCG_SOFTLIMIT_RECLAIM_PRIORITY ? ? ? 2
>> > +
>> > ?static void shrink_zone(int priority, struct zone *zone,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct scan_control *sc)
>> > ?{
>> > @@ -2001,13 +2003,13 @@ static void shrink_zone(int priority, struct zone *zone,
>> > ? ? ? ? ? ? ? ? unsigned long reclaimed = sc->nr_reclaimed;
>> > ? ? ? ? ? ? ? ? unsigned long scanned = sc->nr_scanned;
>> > ? ? ? ? ? ? ? ? unsigned long nr_reclaimed;
>> > - ? ? ? ? ? ? ? int epriority = priority;
>> >
>> > - ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
>> > - ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
>> > + ? ? ? ? ? ? ? if (!mem_cgroup_soft_limit_exceeded(root, mem) &&
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
>> > + ? ? ? ? ? ? ? ? ? ? ? continue;
>>
>> yes, this makes sense but I am not sure about the right(tm) value of the
>> MEMCG_SOFTLIMIT_RECLAIM_PRIORITY. 2 sounds too low.
>
> There is also another problem. I have just realized that this code path
> is shared with the cgroup direct reclaim. We shouldn't care about soft
> limit in such a situation. It would be just a wasting of cycles. So we
> have to:
>
> if (current_is_kswapd() &&
> ? ? ? ?!mem_cgroup_soft_limit_exceeded(root, mem) &&
> ? ? ? ?priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
> ? ? ? ?continue;

Agreed.

>
> Maybe the condition would have to be more complex for per-cgroup
> background reclaim, though.

That would be the same logic for per-memcg direct reclaim. In general,
we don't consider soft_limit
unless the global memory pressure. So the condition could be something like:

> if ( global_reclaim(sc) &&
> ? ? ? ?!mem_cgroup_soft_limit_exceeded(root, mem) &&
> ? ? ? ?priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
> ? ? ? ?continue;

make sense?

Thanks

--Ying
>
>> You would do quite a
>> lot of loops
>> (DEFAULT_PRIORITY-MEMCG_SOFTLIMIT_RECLAIM_PRIORITY) * zones * memcg_count
>> without any progress (assuming that all of them are under soft limit
>> which doesn't sound like a totally artificial configuration) until you
>> allow reclaiming from groups that are under soft limit. Then, when you
>> finally get to reclaiming, you scan rather aggressively.
>>
>> Maybe something like 3/4 of DEFAULT_PRIORITY? You would get 3 times
>> over all (unbalanced) zones and all cgroups that are above the limit
>> (scanning max{1/4096+1/2048+1/1024, 3*SWAP_CLUSTER_MAX} of the LRUs for
>> each cgroup) which could be enough to collect the low hanging fruit.
>
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

2011-06-16 00:33:38

by Ying Han

[permalink] [raw]
Subject: Re: [patch 4/8] memcg: rework soft limit reclaim

On Wed, Jun 15, 2011 at 3:57 PM, Ying Han <[email protected]> wrote:
> On Fri, Jun 10, 2011 at 12:36 AM, Michal Hocko <[email protected]> wrote:
>> On Thu 09-06-11 17:00:26, Michal Hocko wrote:
>>> On Thu 02-06-11 22:25:29, Ying Han wrote:
>>> > On Thu, Jun 2, 2011 at 2:55 PM, Ying Han <[email protected]> wrote:
>>> > > On Tue, May 31, 2011 at 11:25 PM, Johannes Weiner <[email protected]> wrote:
>>> > >> Currently, soft limit reclaim is entered from kswapd, where it selects
>>> [...]
>>> > >> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> > >> index c7d4b44..0163840 100644
>>> > >> --- a/mm/vmscan.c
>>> > >> +++ b/mm/vmscan.c
>>> > >> @@ -1988,9 +1988,13 @@ static void shrink_zone(int priority, struct zone *zone,
>>> > >> ? ? ? ? ? ? ? ?unsigned long reclaimed = sc->nr_reclaimed;
>>> > >> ? ? ? ? ? ? ? ?unsigned long scanned = sc->nr_scanned;
>>> > >> ? ? ? ? ? ? ? ?unsigned long nr_reclaimed;
>>> > >> + ? ? ? ? ? ? ? int epriority = priority;
>>> > >> +
>>> > >> + ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
>>> > >> + ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
>>> > >
>>> > > Here we grant the ability to shrink from all the memcgs, but only
>>> > > higher the priority for those exceed the soft_limit. That is a design
>>> > > change
>>> > > for the "soft_limit" which giving a hint to which memcgs to reclaim
>>> > > from first under global memory pressure.
>>> >
>>> >
>>> > Basically, we shouldn't reclaim from a memcg under its soft_limit
>>> > unless we have trouble reclaim pages from others.
>>>
>>> Agreed.
>>>
>>> > Something like the following makes better sense:
>>> >
>>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> > index bdc2fd3..b82ba8c 100644
>>> > --- a/mm/vmscan.c
>>> > +++ b/mm/vmscan.c
>>> > @@ -1989,6 +1989,8 @@ restart:
>>> > ? ? ? ? throttle_vm_writeout(sc->gfp_mask);
>>> > ?}
>>> >
>>> > +#define MEMCG_SOFTLIMIT_RECLAIM_PRIORITY ? ? ? 2
>>> > +
>>> > ?static void shrink_zone(int priority, struct zone *zone,
>>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct scan_control *sc)
>>> > ?{
>>> > @@ -2001,13 +2003,13 @@ static void shrink_zone(int priority, struct zone *zone,
>>> > ? ? ? ? ? ? ? ? unsigned long reclaimed = sc->nr_reclaimed;
>>> > ? ? ? ? ? ? ? ? unsigned long scanned = sc->nr_scanned;
>>> > ? ? ? ? ? ? ? ? unsigned long nr_reclaimed;
>>> > - ? ? ? ? ? ? ? int epriority = priority;
>>> >
>>> > - ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
>>> > - ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
>>> > + ? ? ? ? ? ? ? if (!mem_cgroup_soft_limit_exceeded(root, mem) &&
>>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
>>> > + ? ? ? ? ? ? ? ? ? ? ? continue;
>>>
>>> yes, this makes sense but I am not sure about the right(tm) value of the
>>> MEMCG_SOFTLIMIT_RECLAIM_PRIORITY. 2 sounds too low.
>>
>> There is also another problem. I have just realized that this code path
>> is shared with the cgroup direct reclaim. We shouldn't care about soft
>> limit in such a situation. It would be just a wasting of cycles. So we
>> have to:
>>
>> if (current_is_kswapd() &&
>> ? ? ? ?!mem_cgroup_soft_limit_exceeded(root, mem) &&
>> ? ? ? ?priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
>> ? ? ? ?continue;
>
> Agreed.
>
>>
>> Maybe the condition would have to be more complex for per-cgroup
>> background reclaim, though.
>
> That would be the same logic for per-memcg direct reclaim. In general,
> we don't consider soft_limit
> unless the global memory pressure. So the condition could be something like:
>
>> if ( ? global_reclaim(sc) &&
>> ? ? ? ?!mem_cgroup_soft_limit_exceeded(root, mem) &&
>> ? ? ? ?priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
>> ? ? ? ?continue;
>
> make sense?

Also

+bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *mem)
+{
+ return res_counter_soft_limit_excess(&mem->res);
+}

--Ying
>
> Thanks
>
> --Ying
>>
>>> You would do quite a
>>> lot of loops
>>> (DEFAULT_PRIORITY-MEMCG_SOFTLIMIT_RECLAIM_PRIORITY) * zones * memcg_count
>>> without any progress (assuming that all of them are under soft limit
>>> which doesn't sound like a totally artificial configuration) until you
>>> allow reclaiming from groups that are under soft limit. Then, when you
>>> finally get to reclaiming, you scan rather aggressively.
>>>
>>> Maybe something like 3/4 of DEFAULT_PRIORITY? You would get 3 times
>>> over all (unbalanced) zones and all cgroups that are above the limit
>>> (scanning max{1/4096+1/2048+1/1024, 3*SWAP_CLUSTER_MAX} of the LRUs for
>>> each cgroup) which could be enough to collect the low hanging fruit.
>>
>> --
>> Michal Hocko
>> SUSE Labs
>> SUSE LINUX s.r.o.
>> Lihovarska 1060/12
>> 190 00 Praha 9
>> Czech Republic
>>
>

2011-06-16 11:42:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 4/8] memcg: rework soft limit reclaim

On Wed 15-06-11 15:48:25, Ying Han wrote:
> On Thu, Jun 9, 2011 at 8:00 AM, Michal Hocko <[email protected]> wrote:
> > On Thu 02-06-11 22:25:29, Ying Han wrote:
[...]
> > yes, this makes sense but I am not sure about the right(tm) value of the
> > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY. 2 sounds too low. You would do quite a
> > lot of loops
> > (DEFAULT_PRIORITY-MEMCG_SOFTLIMIT_RECLAIM_PRIORITY) * zones * memcg_count
> > without any progress (assuming that all of them are under soft limit
> > which doesn't sound like a totally artificial configuration) until you
> > allow reclaiming from groups that are under soft limit. Then, when you
> > finally get to reclaiming, you scan rather aggressively.
>
> Fair enough, something smarter is definitely needed :)
>
> >
> > Maybe something like 3/4 of DEFAULT_PRIORITY? You would get 3 times
> > over all (unbalanced) zones and all cgroups that are above the limit
> > (scanning max{1/4096+1/2048+1/1024, 3*SWAP_CLUSTER_MAX} of the LRUs for
> > each cgroup) which could be enough to collect the low hanging fruit.
>
> Hmm, that sounds more reasonable than the initial proposal.
>
> For the same worst case where all the memcgs are blow their soft
> limit, we need to scan 3 times of total memcgs before actually doing

it is not scanning what we do. We just walk through all existing memcgs.
I think that the real issue here is how much we scan when we start
doing something useful. Maybe even DEFAULT_PRIORITY-3 is too much as
well. dunno.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-16 11:45:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 4/8] memcg: rework soft limit reclaim

On Wed 15-06-11 15:57:59, Ying Han wrote:
> On Fri, Jun 10, 2011 at 12:36 AM, Michal Hocko <[email protected]> wrote:
> > On Thu 09-06-11 17:00:26, Michal Hocko wrote:
> >> On Thu 02-06-11 22:25:29, Ying Han wrote:
> >> > On Thu, Jun 2, 2011 at 2:55 PM, Ying Han <[email protected]> wrote:
> >> > > On Tue, May 31, 2011 at 11:25 PM, Johannes Weiner <[email protected]> wrote:
> >> > >> Currently, soft limit reclaim is entered from kswapd, where it selects
> >> [...]
> >> > >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > >> index c7d4b44..0163840 100644
> >> > >> --- a/mm/vmscan.c
> >> > >> +++ b/mm/vmscan.c
> >> > >> @@ -1988,9 +1988,13 @@ static void shrink_zone(int priority, struct zone *zone,
> >> > >> ? ? ? ? ? ? ? ?unsigned long reclaimed = sc->nr_reclaimed;
> >> > >> ? ? ? ? ? ? ? ?unsigned long scanned = sc->nr_scanned;
> >> > >> ? ? ? ? ? ? ? ?unsigned long nr_reclaimed;
> >> > >> + ? ? ? ? ? ? ? int epriority = priority;
> >> > >> +
> >> > >> + ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
> >> > >> + ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
> >> > >
> >> > > Here we grant the ability to shrink from all the memcgs, but only
> >> > > higher the priority for those exceed the soft_limit. That is a design
> >> > > change
> >> > > for the "soft_limit" which giving a hint to which memcgs to reclaim
> >> > > from first under global memory pressure.
> >> >
> >> >
> >> > Basically, we shouldn't reclaim from a memcg under its soft_limit
> >> > unless we have trouble reclaim pages from others.
> >>
> >> Agreed.
> >>
> >> > Something like the following makes better sense:
> >> >
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index bdc2fd3..b82ba8c 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -1989,6 +1989,8 @@ restart:
> >> > ? ? ? ? throttle_vm_writeout(sc->gfp_mask);
> >> > ?}
> >> >
> >> > +#define MEMCG_SOFTLIMIT_RECLAIM_PRIORITY ? ? ? 2
> >> > +
> >> > ?static void shrink_zone(int priority, struct zone *zone,
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct scan_control *sc)
> >> > ?{
> >> > @@ -2001,13 +2003,13 @@ static void shrink_zone(int priority, struct zone *zone,
> >> > ? ? ? ? ? ? ? ? unsigned long reclaimed = sc->nr_reclaimed;
> >> > ? ? ? ? ? ? ? ? unsigned long scanned = sc->nr_scanned;
> >> > ? ? ? ? ? ? ? ? unsigned long nr_reclaimed;
> >> > - ? ? ? ? ? ? ? int epriority = priority;
> >> >
> >> > - ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
> >> > - ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
> >> > + ? ? ? ? ? ? ? if (!mem_cgroup_soft_limit_exceeded(root, mem) &&
> >> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
> >> > + ? ? ? ? ? ? ? ? ? ? ? continue;
> >>
> >> yes, this makes sense but I am not sure about the right(tm) value of the
> >> MEMCG_SOFTLIMIT_RECLAIM_PRIORITY. 2 sounds too low.
> >
> > There is also another problem. I have just realized that this code path
> > is shared with the cgroup direct reclaim. We shouldn't care about soft
> > limit in such a situation. It would be just a wasting of cycles. So we
> > have to:
> >
> > if (current_is_kswapd() &&
> > ? ? ? ?!mem_cgroup_soft_limit_exceeded(root, mem) &&
> > ? ? ? ?priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
> > ? ? ? ?continue;
>
> Agreed.
>
> >
> > Maybe the condition would have to be more complex for per-cgroup
> > background reclaim, though.
>
> That would be the same logic for per-memcg direct reclaim. In general,
> we don't consider soft_limit
> unless the global memory pressure. So the condition could be something like:
>
> > if ( global_reclaim(sc) &&
> > ? ? ? ?!mem_cgroup_soft_limit_exceeded(root, mem) &&
> > ? ? ? ?priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
> > ? ? ? ?continue;
>
> make sense?

Yes seems to be more consistent.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic