2011-05-12 14:54:48

by Johannes Weiner

[permalink] [raw]
Subject: [rfc patch 0/6] mm: memcg naturalization

Hi!

Here is a patch series that is a result of the memcg discussions on
LSF (memcg-aware global reclaim, global lru removal, struct
page_cgroup reduction, soft limit implementation) and the recent
feature discussions on linux-mm.

The long-term idea is to have memcgs no longer bolted to the side of
the mm code, but integrate it as much as possible such that there is a
native understanding of containers, and that the traditional !memcg
setup is just a singular group. This series is an approach in that
direction.

It is a rather early snapshot, WIP, barely tested etc., but I wanted
to get your opinions before further pursuing it. It is also part of
my counter-argument to the proposals of adding memcg-reclaim-related
user interfaces at this point in time, so I wanted to push this out
the door before things are merged into .40.

The patches are quite big, I am still looking for things to factor and
split out, sorry for this. Documentation is on its way as well ;)

#1 and #2 are boring preparational work. #3 makes traditional reclaim
in vmscan.c memcg-aware, which is a prerequisite for both removal of
the global lru in #5 and the way I reimplemented soft limit reclaim in
#6.

The diffstat so far looks like this:

include/linux/memcontrol.h | 84 +++--
include/linux/mm_inline.h | 15 +-
include/linux/mmzone.h | 10 +-
include/linux/page_cgroup.h | 35 --
include/linux/swap.h | 4 -
mm/memcontrol.c | 860 +++++++++++++------------------------------
mm/page_alloc.c | 2 +-
mm/page_cgroup.c | 39 +--
mm/swap.c | 20 +-
mm/vmscan.c | 273 +++++++--------
10 files changed, 452 insertions(+), 890 deletions(-)

It is based on .39-rc7 because of the memcg churn in -mm, but I'll
rebase it in the near future.

Discuss!

Hannes


2011-05-12 14:54:50

by Johannes Weiner

[permalink] [raw]
Subject: [rfc patch 1/6] memcg: remove unused retry signal from reclaim

If the memcg reclaim code detects the target memcg below its limit it
exits and returns a guaranteed non-zero value so that the charge is
retried.

Nowadays, the charge side checks the memcg limit itself and does not
rely on this non-zero return value trick.

This patch removes it. The reclaim code will now always return the
true number of pages it reclaimed on its own.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 010f916..bf5ab87 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1503,7 +1503,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
if (!res_counter_soft_limit_excess(&root_mem->res))
return total;
} else if (mem_cgroup_margin(root_mem))
- return 1 + total;
+ return total;
}
return total;
}
--
1.7.5.1

2011-05-12 14:54:47

by Johannes Weiner

[permalink] [raw]
Subject: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection

The reclaim code has a single predicate for whether it currently
reclaims on behalf of a memory cgroup, as well as whether it is
reclaiming from the global LRU list or a memory cgroup LRU list.

Up to now, both cases always coincide, but subsequent patches will
change things such that global reclaim will scan memory cgroup lists.

This patch adds a new predicate that tells global reclaim from memory
cgroup reclaim, and then changes all callsites that are actually about
global reclaim heuristics rather than strict LRU list selection.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 96 ++++++++++++++++++++++++++++++++++------------------------
1 files changed, 56 insertions(+), 40 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6b435c..ceeb2a5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -104,8 +104,12 @@ struct scan_control {
*/
reclaim_mode_t reclaim_mode;

- /* Which cgroup do we reclaim from */
- struct mem_cgroup *mem_cgroup;
+ /*
+ * The memory cgroup we reclaim on behalf of, and the one we
+ * are currently reclaiming from.
+ */
+ struct mem_cgroup *memcg;
+ struct mem_cgroup *current_memcg;

/*
* Nodemask of nodes allowed by the caller. If NULL, all nodes
@@ -154,16 +158,24 @@ static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-#define scanning_global_lru(sc) (!(sc)->mem_cgroup)
+static bool global_reclaim(struct scan_control *sc)
+{
+ return !sc->memcg;
+}
+static bool scanning_global_lru(struct scan_control *sc)
+{
+ return !sc->current_memcg;
+}
#else
-#define scanning_global_lru(sc) (1)
+static bool global_reclaim(struct scan_control *sc) { return 1; }
+static bool scanning_global_lru(struct scan_control *sc) { return 1; }
#endif

static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
struct scan_control *sc)
{
if (!scanning_global_lru(sc))
- return mem_cgroup_get_reclaim_stat(sc->mem_cgroup, zone);
+ return mem_cgroup_get_reclaim_stat(sc->current_memcg, zone);

return &zone->reclaim_stat;
}
@@ -172,7 +184,7 @@ static unsigned long zone_nr_lru_pages(struct zone *zone,
struct scan_control *sc, enum lru_list lru)
{
if (!scanning_global_lru(sc))
- return mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone, lru);
+ return mem_cgroup_zone_nr_pages(sc->current_memcg, zone, lru);

return zone_page_state(zone, NR_LRU_BASE + lru);
}
@@ -635,7 +647,7 @@ static enum page_references page_check_references(struct page *page,
int referenced_ptes, referenced_page;
unsigned long vm_flags;

- referenced_ptes = page_referenced(page, 1, sc->mem_cgroup, &vm_flags);
+ referenced_ptes = page_referenced(page, 1, sc->current_memcg, &vm_flags);
referenced_page = TestClearPageReferenced(page);

/* Lumpy reclaim - ignore references */
@@ -1228,7 +1240,7 @@ static int too_many_isolated(struct zone *zone, int file,
if (current_is_kswapd())
return 0;

- if (!scanning_global_lru(sc))
+ if (!global_reclaim(sc))
return 0;

if (file) {
@@ -1397,6 +1409,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
ISOLATE_BOTH : ISOLATE_INACTIVE,
zone, 0, file);
+ } else {
+ nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+ &page_list, &nr_scanned, sc->order,
+ sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
+ ISOLATE_BOTH : ISOLATE_INACTIVE,
+ zone, sc->current_memcg,
+ 0, file);
+ }
+
+ if (global_reclaim(sc)) {
zone->pages_scanned += nr_scanned;
if (current_is_kswapd())
__count_zone_vm_events(PGSCAN_KSWAPD, zone,
@@ -1404,17 +1426,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
else
__count_zone_vm_events(PGSCAN_DIRECT, zone,
nr_scanned);
- } else {
- nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
- &page_list, &nr_scanned, sc->order,
- sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
- ISOLATE_BOTH : ISOLATE_INACTIVE,
- zone, sc->mem_cgroup,
- 0, file);
- /*
- * mem_cgroup_isolate_pages() keeps track of
- * scanned pages on its own.
- */
}

if (nr_taken == 0) {
@@ -1435,9 +1446,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
}

local_irq_disable();
- if (current_is_kswapd())
- __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
- __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+ if (global_reclaim(sc)) {
+ if (current_is_kswapd())
+ __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+ __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+ }

putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);

@@ -1520,18 +1533,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
&pgscanned, sc->order,
ISOLATE_ACTIVE, zone,
1, file);
- zone->pages_scanned += pgscanned;
} else {
nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
&pgscanned, sc->order,
ISOLATE_ACTIVE, zone,
- sc->mem_cgroup, 1, file);
- /*
- * mem_cgroup_isolate_pages() keeps track of
- * scanned pages on its own.
- */
+ sc->current_memcg, 1, file);
}

+ if (global_reclaim(sc))
+ zone->pages_scanned += pgscanned;
+
reclaim_stat->recent_scanned[file] += nr_taken;

__count_zone_vm_events(PGREFILL, zone, pgscanned);
@@ -1552,7 +1563,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
continue;
}

- if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+ if (page_referenced(page, 0, sc->current_memcg, &vm_flags)) {
nr_rotated += hpage_nr_pages(page);
/*
* Identify referenced, file-backed active pages and
@@ -1629,7 +1640,7 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
if (scanning_global_lru(sc))
low = inactive_anon_is_low_global(zone);
else
- low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup);
+ low = mem_cgroup_inactive_anon_is_low(sc->current_memcg);
return low;
}
#else
@@ -1672,7 +1683,7 @@ static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
if (scanning_global_lru(sc))
low = inactive_file_is_low_global(zone);
else
- low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
+ low = mem_cgroup_inactive_file_is_low(sc->current_memcg);
return low;
}

@@ -1752,7 +1763,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);

- if (scanning_global_lru(sc)) {
+ if (global_reclaim(sc)) {
free = zone_page_state(zone, NR_FREE_PAGES);
/* If we have very few page cache pages,
force-scan anon pages. */
@@ -1903,6 +1914,8 @@ restart:
nr_scanned = sc->nr_scanned;
get_scan_count(zone, sc, nr, priority);

+ sc->current_memcg = sc->memcg;
+
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
@@ -1941,6 +1954,9 @@ restart:
goto restart;

throttle_vm_writeout(sc->gfp_mask);
+
+ /* For good measure, noone higher up the stack should look at it */
+ sc->current_memcg = NULL;
}

/*
@@ -1973,7 +1989,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
* Take care memory controller reclaiming has small influence
* to global LRU.
*/
- if (scanning_global_lru(sc)) {
+ if (global_reclaim(sc)) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
@@ -2038,7 +2054,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
get_mems_allowed();
delayacct_freepages_start();

- if (scanning_global_lru(sc))
+ if (global_reclaim(sc))
count_vm_event(ALLOCSTALL);

for (priority = DEF_PRIORITY; priority >= 0; priority--) {
@@ -2050,7 +2066,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
* Don't shrink slabs when reclaiming memory from
* over limit cgroups
*/
- if (scanning_global_lru(sc)) {
+ if (global_reclaim(sc)) {
unsigned long lru_pages = 0;
for_each_zone_zonelist(zone, z, zonelist,
gfp_zone(sc->gfp_mask)) {
@@ -2111,7 +2127,7 @@ out:
return 0;

/* top priority shrink_zones still had more to do? don't OOM, then */
- if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
+ if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
return 1;

return 0;
@@ -2129,7 +2145,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.may_swap = 1,
.swappiness = vm_swappiness,
.order = order,
- .mem_cgroup = NULL,
+ .memcg = NULL,
.nodemask = nodemask,
};

@@ -2158,7 +2174,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
.may_swap = !noswap,
.swappiness = swappiness,
.order = 0,
- .mem_cgroup = mem,
+ .memcg = mem,
};
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2195,7 +2211,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
.swappiness = swappiness,
.order = 0,
- .mem_cgroup = mem_cont,
+ .memcg = mem_cont,
.nodemask = NULL, /* we don't care the placement */
};

@@ -2333,7 +2349,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
.nr_to_reclaim = ULONG_MAX,
.swappiness = vm_swappiness,
.order = order,
- .mem_cgroup = NULL,
+ .memcg = NULL,
};
loop_again:
total_scanned = 0;
--
1.7.5.1

2011-05-12 14:56:06

by Johannes Weiner

[permalink] [raw]
Subject: [rfc patch 3/6] mm: memcg-aware global reclaim

A page charged to a memcg is linked to a lru list specific to that
memcg. At the same time, traditional global reclaim is obvlivious to
memcgs, and all the pages are also linked to a global per-zone list.

This patch changes traditional global reclaim to iterate over all
existing memcgs, so that it no longer relies on the global list being
present.

This is one step forward in integrating memcg code better into the
rest of memory management. It is also a prerequisite to get rid of
the global per-zone lru lists.

RFC:

The algorithm implemented in this patch is very naive. For each zone
scanned at each priority level, it iterates over all existing memcgs
and considers them for scanning.

This is just a prototype and I did not optimize it yet because I am
unsure about the maximum number of memcgs that still constitute a sane
configuration in comparison to the machine size.

It is perfectly fair since all memcgs are scanned at each priority
level.

On my 4G quadcore laptop with 1000 memcgs, a significant amount of CPU
time was spent just iterating memcgs during reclaim. But it can not
really be claimed that the old code was much better, either: global
LRU reclaim could mean that a few hundred memcgs would have been
emptied out completely, while others stayed untouched.

I am open to solutions that trade fairness against CPU-time but don't
want to have an extreme in either direction. Maybe break out early if
a number of memcgs has been successfully reclaimed from and remember
the last one scanned.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 7 ++
mm/memcontrol.c | 148 +++++++++++++++++++++++++++++---------------
mm/vmscan.c | 21 +++++--
3 files changed, 120 insertions(+), 56 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e9840f5..58728c7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -104,6 +104,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
/*
* For memory reclaim.
*/
+void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
@@ -289,6 +290,12 @@ static inline bool mem_cgroup_disabled(void)
return true;
}

+static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
+ struct mem_cgroup **iter)
+{
+ *iter = start;
+}
+
static inline int
mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bf5ab87..edcd55a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -313,7 +313,7 @@ static bool move_file(void)
}

/*
- * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
+ * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
* limit reclaim to prevent infinite loops, if they ever occur.
*/
#define MEM_CGROUP_MAX_RECLAIM_LOOPS (100)
@@ -339,16 +339,6 @@ enum charge_type {
/* Used for OOM nofiier */
#define OOM_CONTROL (0)

-/*
- * Reclaim flags for mem_cgroup_hierarchical_reclaim
- */
-#define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0
-#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
-#define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1
-#define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
-#define MEM_CGROUP_RECLAIM_SOFT_BIT 0x2
-#define MEM_CGROUP_RECLAIM_SOFT (1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
-
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
@@ -1381,6 +1371,86 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
return min(limit, memsw);
}

+void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
+ struct mem_cgroup **iter)
+{
+ struct mem_cgroup *mem = *iter;
+ int id;
+
+ if (!start)
+ start = root_mem_cgroup;
+ /*
+ * Even without hierarchy explicitely enabled in the root
+ * memcg, it is the ultimate parent of all memcgs.
+ */
+ if (!(start == root_mem_cgroup || start->use_hierarchy)) {
+ *iter = start;
+ return;
+ }
+
+ if (!mem)
+ id = css_id(&start->css);
+ else {
+ id = css_id(&mem->css);
+ css_put(&mem->css);
+ mem = NULL;
+ }
+
+ do {
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+ css = css_get_next(&mem_cgroup_subsys, id+1, &start->css, &id);
+ /*
+ * The caller must already have a reference to the
+ * starting point of this hierarchy walk, do not grab
+ * another one. This way, the loop can be finished
+ * when the hierarchy root is returned, without any
+ * further cleanup required.
+ */
+ if (css && (css == &start->css || css_tryget(css)))
+ mem = container_of(css, struct mem_cgroup, css);
+ rcu_read_unlock();
+ if (!css)
+ id = 0;
+ } while (!mem);
+
+ if (mem == root_mem_cgroup)
+ mem = NULL;
+
+ *iter = mem;
+}
+
+static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
+ gfp_t gfp_mask,
+ bool noswap,
+ bool shrink)
+{
+ unsigned long total = 0;
+ int loop;
+
+ if (mem->memsw_is_minimum)
+ noswap = true;
+
+ for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
+ drain_all_stock_async();
+ total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap,
+ get_swappiness(mem));
+ if (total && shrink)
+ break;
+ if (mem_cgroup_margin(mem))
+ break;
+ /*
+ * If we have not been able to reclaim anything after
+ * two reclaim attempts, there may be no reclaimable
+ * pages under this hierarchy.
+ */
+ if (loop && !total)
+ break;
+ }
+ return total;
+}
+
/*
* Visit the first child (need not be the first child as per the ordering
* of the cgroup list, since we track last_scanned_child) of @mem and use
@@ -1427,21 +1497,16 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
*
* We give up and return to the caller when we visit root_mem twice.
* (other groups can be removed while we're walking....)
- *
- * If shrink==true, for avoiding to free too much, this returns immedieately.
*/
-static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
- struct zone *zone,
- gfp_t gfp_mask,
- unsigned long reclaim_options)
+static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
+ struct zone *zone,
+ gfp_t gfp_mask)
{
struct mem_cgroup *victim;
int ret, total = 0;
int loop = 0;
- bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
- bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
- bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
unsigned long excess;
+ bool noswap = false;

excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;

@@ -1461,7 +1526,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
* anything, it might because there are
* no reclaimable pages under this hierarchy
*/
- if (!check_soft || !total) {
+ if (!total) {
css_put(&victim->css);
break;
}
@@ -1484,25 +1549,11 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
continue;
}
/* we use swappiness of local cgroup */
- if (check_soft)
- ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
+ ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
noswap, get_swappiness(victim), zone);
- else
- ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
- noswap, get_swappiness(victim));
css_put(&victim->css);
- /*
- * At shrinking usage, we can't check we should stop here or
- * reclaim more. It's depends on callers. last_scanned_child
- * will work enough for keeping fairness under tree.
- */
- if (shrink)
- return ret;
total += ret;
- if (check_soft) {
- if (!res_counter_soft_limit_excess(&root_mem->res))
- return total;
- } else if (mem_cgroup_margin(root_mem))
+ if (!res_counter_soft_limit_excess(&root_mem->res))
return total;
}
return total;
@@ -1897,7 +1948,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
struct res_counter *fail_res;
- unsigned long flags = 0;
+ bool noswap = false;
int ret;

ret = res_counter_charge(&mem->res, csize, &fail_res);
@@ -1911,7 +1962,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,

res_counter_uncharge(&mem->res, csize);
mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
- flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+ noswap = true;
} else
mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
/*
@@ -1927,8 +1978,8 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
if (!(gfp_mask & __GFP_WAIT))
return CHARGE_WOULDBLOCK;

- ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
- gfp_mask, flags);
+ ret = mem_cgroup_target_reclaim(mem_over_limit, gfp_mask,
+ noswap, false);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
return CHARGE_RETRY;
/*
@@ -3085,7 +3136,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,

/*
* A call to try to shrink memory usage on charge failure at shmem's swapin.
- * Calling hierarchical_reclaim is not enough because we should update
+ * Calling target_reclaim is not enough because we should update
* last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
* Moreover considering hierarchy, we should reclaim from the mem_over_limit,
* not from the memcg which this page would be charged to.
@@ -3167,7 +3218,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
int enlarge;

/*
- * For keeping hierarchical_reclaim simple, how long we should retry
+ * For keeping target_reclaim simple, how long we should retry
* is depends on callers. We set our retry-count to be function
* of # of children which we should visit in this loop.
*/
@@ -3210,8 +3261,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
if (!ret)
break;

- mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
- MEM_CGROUP_RECLAIM_SHRINK);
+ mem_cgroup_target_reclaim(memcg, GFP_KERNEL, false, false);
curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
@@ -3269,9 +3319,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
if (!ret)
break;

- mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
- MEM_CGROUP_RECLAIM_NOSWAP |
- MEM_CGROUP_RECLAIM_SHRINK);
+ mem_cgroup_target_reclaim(memcg, GFP_KERNEL, true, false);
curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
@@ -3311,9 +3359,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
if (!mz)
break;

- reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
- gfp_mask,
- MEM_CGROUP_RECLAIM_SOFT);
+ reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
nr_reclaimed += reclaimed;
spin_lock(&mctz->lock);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ceeb2a5..e2a3647 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1900,8 +1900,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
-static void shrink_zone(int priority, struct zone *zone,
- struct scan_control *sc)
+static void do_shrink_zone(int priority, struct zone *zone,
+ struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long nr_to_scan;
@@ -1914,8 +1914,6 @@ restart:
nr_scanned = sc->nr_scanned;
get_scan_count(zone, sc, nr, priority);

- sc->current_memcg = sc->memcg;
-
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
@@ -1954,6 +1952,19 @@ restart:
goto restart;

throttle_vm_writeout(sc->gfp_mask);
+}
+
+static void shrink_zone(int priority, struct zone *zone,
+ struct scan_control *sc)
+{
+ struct mem_cgroup *root = sc->memcg;
+ struct mem_cgroup *mem = NULL;
+
+ do {
+ mem_cgroup_hierarchy_walk(root, &mem);
+ sc->current_memcg = mem;
+ do_shrink_zone(priority, zone, sc);
+ } while (mem != root);

/* For good measure, noone higher up the stack should look at it */
sc->current_memcg = NULL;
@@ -2190,7 +2201,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
* will pick up pages from other mem cgroup's as well. We hack
* the priority and make it zero.
*/
- shrink_zone(0, zone, &sc);
+ do_shrink_zone(0, zone, &sc);

trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);

--
1.7.5.1

2011-05-12 14:56:36

by Johannes Weiner

[permalink] [raw]
Subject: [rfc patch 4/6] memcg: reclaim statistics

TODO: write proper changelog. Here is an excerpt from
http://lkml.kernel.org/r/[email protected]:

: 1. Limit-triggered direct reclaim
:
: The memory cgroup hits its limit and the task does direct reclaim from
: its own memcg. We probably want statistics for this separately from
: background reclaim to see how successful background reclaim is, the
: same reason we have this separation in the global vmstat as well.
:
: pgscan_direct_limit
: pgfree_direct_limit
:
: 2. Limit-triggered background reclaim
:
: This is the watermark-based asynchroneous reclaim that is currently in
: discussion. It's triggered by the memcg breaching its watermark,
: which is relative to its hard-limit. I named it kswapd because I
: still think kswapd should do this job, but it is all open for
: discussion, obviously. Treat it as meaning 'background' or
: 'asynchroneous'.
:
: pgscan_kswapd_limit
: pgfree_kswapd_limit
:
: 3. Hierarchy-triggered direct reclaim
:
: A condition outside the memcg leads to a task directly reclaiming from
: this memcg. This could be global memory pressure for example, but
: also a parent cgroup hitting its limit. It's probably helpful to
: assume global memory pressure meaning that the root cgroup hit its
: limit, conceptually. We don't have that yet, but this could be the
: direct softlimit reclaim Ying mentioned above.
:
: pgscan_direct_hierarchy
: pgsteal_direct_hierarchy
:
: 4. Hierarchy-triggered background reclaim
:
: An outside condition leads to kswapd reclaiming from this memcg, like
: kswapd doing softlimit pushback due to global memory pressure.
:
: pgscan_kswapd_hierarchy
: pgsteal_kswapd_hierarchy
:
: ---
:
: With these stats in place, you can see how much pressure there is on
: your memcg hierarchy. This includes machine utilization and if you
: overcommitted too much on a global level if there is a lot of reclaim
: activity indicated in the hierarchical stats.
:
: With the limit-based stats, you can see the amount of internal
: pressure of memcgs, which shows you if you overcommitted on a local
: level.
:
: And for both cases, you can also see the effectiveness of background
: reclaim by comparing the direct and the kswapd stats.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 9 ++++++
mm/memcontrol.c | 63 ++++++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 7 +++++
3 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 58728c7..a4c84db 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -105,6 +105,8 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
* For memory reclaim.
*/
void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
+void mem_cgroup_count_reclaim(struct mem_cgroup *, bool, bool,
+ unsigned long, unsigned long);
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
@@ -296,6 +298,13 @@ static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
*iter = start;
}

+static inline void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
+ bool kswapd, bool hierarchy,
+ unsigned long scanned,
+ unsigned long reclaimed)
+{
+}
+
static inline int
mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index edcd55a..d762706 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -90,10 +90,24 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_NSTATS,
};

+#define RECLAIM_RECLAIMED 1
+#define RECLAIM_HIERARCHY 2
+#define RECLAIM_KSWAPD 4
+
enum mem_cgroup_events_index {
MEM_CGROUP_EVENTS_PGPGIN, /* # of pages paged in */
MEM_CGROUP_EVENTS_PGPGOUT, /* # of pages paged out */
MEM_CGROUP_EVENTS_COUNT, /* # of pages paged in/out */
+ RECLAIM_BASE,
+ PGSCAN_DIRECT_LIMIT = RECLAIM_BASE,
+ PGFREE_DIRECT_LIMIT = RECLAIM_BASE + RECLAIM_RECLAIMED,
+ PGSCAN_DIRECT_HIERARCHY = RECLAIM_BASE + RECLAIM_HIERARCHY,
+ PGSTEAL_DIRECT_HIERARCHY = RECLAIM_BASE + RECLAIM_HIERARCHY + RECLAIM_RECLAIMED,
+ /* you know the drill... */
+ PGSCAN_KSWAPD_LIMIT,
+ PGFREE_KSWAPD_LIMIT,
+ PGSCAN_KSWAPD_HIERARCHY,
+ PGSTEAL_KSWAPD_HIERARCHY,
MEM_CGROUP_EVENTS_NSTATS,
};
/*
@@ -575,6 +589,23 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
}

+void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
+ bool kswapd, bool hierarchy,
+ unsigned long scanned, unsigned long reclaimed)
+{
+ unsigned int base = RECLAIM_BASE;
+
+ if (!mem)
+ mem = root_mem_cgroup;
+ if (kswapd)
+ base += RECLAIM_KSWAPD;
+ if (hierarchy)
+ base += RECLAIM_HIERARCHY;
+
+ this_cpu_add(mem->stat->events[base], scanned);
+ this_cpu_add(mem->stat->events[base + RECLAIM_RECLAIMED], reclaimed);
+}
+
static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
enum mem_cgroup_events_index idx)
{
@@ -3817,6 +3848,14 @@ enum {
MCS_FILE_MAPPED,
MCS_PGPGIN,
MCS_PGPGOUT,
+ MCS_PGSCAN_DIRECT_LIMIT,
+ MCS_PGFREE_DIRECT_LIMIT,
+ MCS_PGSCAN_DIRECT_HIERARCHY,
+ MCS_PGSTEAL_DIRECT_HIERARCHY,
+ MCS_PGSCAN_KSWAPD_LIMIT,
+ MCS_PGFREE_KSWAPD_LIMIT,
+ MCS_PGSCAN_KSWAPD_HIERARCHY,
+ MCS_PGSTEAL_KSWAPD_HIERARCHY,
MCS_SWAP,
MCS_INACTIVE_ANON,
MCS_ACTIVE_ANON,
@@ -3839,6 +3878,14 @@ struct {
{"mapped_file", "total_mapped_file"},
{"pgpgin", "total_pgpgin"},
{"pgpgout", "total_pgpgout"},
+ {"pgscan_direct_limit", "total_pgscan_direct_limit"},
+ {"pgfree_direct_limit", "total_pgfree_direct_limit"},
+ {"pgscan_direct_hierarchy", "total_pgscan_direct_hierarchy"},
+ {"pgsteal_direct_hierarchy", "total_pgsteal_direct_hierarchy"},
+ {"pgscan_kswapd_limit", "total_pgscan_kswapd_limit"},
+ {"pgfree_kswapd_limit", "total_pgfree_kswapd_limit"},
+ {"pgscan_kswapd_hierarchy", "total_pgscan_kswapd_hierarchy"},
+ {"pgsteal_kswapd_hierarchy", "total_pgsteal_kswapd_hierarchy"},
{"swap", "total_swap"},
{"inactive_anon", "total_inactive_anon"},
{"active_anon", "total_active_anon"},
@@ -3864,6 +3911,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
s->stat[MCS_PGPGIN] += val;
val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGPGOUT);
s->stat[MCS_PGPGOUT] += val;
+ val = mem_cgroup_read_events(mem, PGSCAN_DIRECT_LIMIT);
+ s->stat[MCS_PGSCAN_DIRECT_LIMIT] += val;
+ val = mem_cgroup_read_events(mem, PGFREE_DIRECT_LIMIT);
+ s->stat[MCS_PGFREE_DIRECT_LIMIT] += val;
+ val = mem_cgroup_read_events(mem, PGSCAN_DIRECT_HIERARCHY);
+ s->stat[MCS_PGSCAN_DIRECT_HIERARCHY] += val;
+ val = mem_cgroup_read_events(mem, PGSTEAL_DIRECT_HIERARCHY);
+ s->stat[MCS_PGSTEAL_DIRECT_HIERARCHY] += val;
+ val = mem_cgroup_read_events(mem, PGSCAN_KSWAPD_LIMIT);
+ s->stat[MCS_PGSCAN_KSWAPD_LIMIT] += val;
+ val = mem_cgroup_read_events(mem, PGFREE_KSWAPD_LIMIT);
+ s->stat[MCS_PGFREE_KSWAPD_LIMIT] += val;
+ val = mem_cgroup_read_events(mem, PGSCAN_KSWAPD_HIERARCHY);
+ s->stat[MCS_PGSCAN_KSWAPD_HIERARCHY] += val;
+ val = mem_cgroup_read_events(mem, PGSTEAL_KSWAPD_HIERARCHY);
+ s->stat[MCS_PGSTEAL_KSWAPD_HIERARCHY] += val;
if (do_swap_account) {
val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
s->stat[MCS_SWAP] += val * PAGE_SIZE;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e2a3647..0e45ceb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1961,9 +1961,16 @@ static void shrink_zone(int priority, struct zone *zone,
struct mem_cgroup *mem = NULL;

do {
+ unsigned long reclaimed = sc->nr_reclaimed;
+ unsigned long scanned = sc->nr_scanned;
+
mem_cgroup_hierarchy_walk(root, &mem);
sc->current_memcg = mem;
do_shrink_zone(priority, zone, sc);
+ mem_cgroup_count_reclaim(mem, current_is_kswapd(),
+ mem != root, /* limit or hierarchy? */
+ sc->nr_scanned - scanned,
+ sc->nr_reclaimed - reclaimed);
} while (mem != root);

/* For good measure, noone higher up the stack should look at it */
--
1.7.5.1

2011-05-12 14:55:10

by Johannes Weiner

[permalink] [raw]
Subject: [rfc patch 5/6] memcg: remove global LRU list

Since the VM now has means to do global reclaim from the per-memcg lru
lists, the global LRU list is no longer required.

It saves two linked list pointers per page, since all pages are now
only on one list. Also, the memcg lru lists now directly link pages
instead of page_cgroup descriptors, which gets rid of finding the way
back from the page_cgroup to the page.

A big change in behaviour is that pages are no longer aged on a global
level. Instead, they are aged with respect to the other pages in the
same memcg, where the aging speed is determined by global memory
pressure and size of the memcg itself.

[ TO EVALUATE: this should bring more fairness to reclaim in setups
with differently sized memcgs, and distribute pressure proportionally
among memcgs instead of reclaiming only from the one that has the
oldest pages on a global level. There is potential unfairness if
unused pages are hiding in small memcgs that are never scanned and
reclaim going only for a single, much bigger memcg. The severeness of
this also scales with the number of memcgs wrt amount of physical
memory, so it again boils down to the question of what the sane
maximum number of memcgs on the system is ].

The patch introduces an lruvec structure that exists for both global
zones and for each zone per memcg. All lru operations are now done in
generic code, with the memcg lru primitives only doing accounting and
returning the proper lruvec for the currently scanned memcg on
isolation, or for the respective page on putback.

The code that scans and rescues unevictable pages in a specific zone
had to be converted to iterate over all memcgs as well.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 52 ++++-----
include/linux/mm_inline.h | 15 ++-
include/linux/mmzone.h | 10 +-
include/linux/page_cgroup.h | 35 ------
mm/memcontrol.c | 251 +++++++++++++++---------------------------
mm/page_alloc.c | 2 +-
mm/page_cgroup.c | 39 +------
mm/swap.c | 20 ++--
mm/vmscan.c | 149 ++++++++++++--------------
9 files changed, 213 insertions(+), 360 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a4c84db..65163c2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -20,6 +20,7 @@
#ifndef _LINUX_MEMCONTROL_H
#define _LINUX_MEMCONTROL_H
#include <linux/cgroup.h>
+#include <linux/mmzone.h>
struct mem_cgroup;
struct page_cgroup;
struct page;
@@ -30,13 +31,6 @@ enum mem_cgroup_page_stat_item {
MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
};

-extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
- struct list_head *dst,
- unsigned long *scanned, int order,
- int mode, struct zone *z,
- struct mem_cgroup *mem_cont,
- int active, int file);
-
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
/*
* All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -60,13 +54,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);

extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
-extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
-extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_del_lru(struct page *page);
-extern void mem_cgroup_move_lists(struct page *page,
- enum lru_list from, enum lru_list to);
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_lru_add_list(struct zone *, struct page *,
+ enum lru_list);
+void mem_cgroup_lru_del_list(struct zone *, struct page *, enum lru_list);
+void mem_cgroup_lru_del(struct zone *, struct page *);
+struct lruvec *mem_cgroup_lru_move_lists(struct zone *, struct page *,
+ enum lru_list, enum lru_list);

/* For coalescing uncharge for reducing memcg' overhead*/
extern void mem_cgroup_uncharge_start(void);
@@ -210,33 +204,35 @@ static inline int mem_cgroup_shmem_charge_fallback(struct page *page,
return 0;
}

-static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
+static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
+ struct mem_cgroup *mem)
{
+ return &zone->lruvec;
}

-static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
+static inline struct lruvec *mem_cgroup_lru_add_list(struct zone *zone,
+ struct page *page,
+ enum lru_list lru)
{
- return ;
+ return &zone->lruvec;
}

-static inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+static inline void mem_cgroup_lru_del_list(struct zone *zone,
+ struct page *page,
+ enum lru_list lru)
{
- return ;
}

-static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
+static inline void mem_cgroup_lru_del(struct zone *zone, struct page *page)
{
- return ;
}

-static inline void mem_cgroup_del_lru(struct page *page)
-{
- return ;
-}
-
-static inline void
-mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
+static inline struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+ struct page *page,
+ enum lru_list from,
+ enum lru_list to)
{
+ return &zone->lruvec;
}

static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 8f7d247..ca794f3 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -25,23 +25,28 @@ static inline void
__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
struct list_head *head)
{
+ /* NOTE! Caller must ensure @head is on the right lruvec! */
+ mem_cgroup_lru_add_list(zone, page, l);
list_add(&page->lru, head);
__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
- mem_cgroup_add_lru_list(page, l);
}

static inline void
add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
{
- __add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_lru_add_list(zone, page, l);
+ list_add(&page->lru, &lruvec->lists[l]);
+ __mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
}

static inline void
del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
{
+ mem_cgroup_lru_del_list(zone, page, l);
list_del(&page->lru);
__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
- mem_cgroup_del_lru_list(page, l);
}

/**
@@ -64,7 +69,6 @@ del_page_from_lru(struct zone *zone, struct page *page)
{
enum lru_list l;

- list_del(&page->lru);
if (PageUnevictable(page)) {
__ClearPageUnevictable(page);
l = LRU_UNEVICTABLE;
@@ -75,8 +79,9 @@ del_page_from_lru(struct zone *zone, struct page *page)
l += LRU_ACTIVE;
}
}
+ mem_cgroup_lru_del_list(zone, page, l);
+ list_del(&page->lru);
__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
- mem_cgroup_del_lru_list(page, l);
}

/**
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e56f835..c2ddce5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,6 +158,10 @@ static inline int is_unevictable_lru(enum lru_list l)
return (l == LRU_UNEVICTABLE);
}

+struct lruvec {
+ struct list_head lists[NR_LRU_LISTS];
+};
+
enum zone_watermarks {
WMARK_MIN,
WMARK_LOW,
@@ -344,10 +348,8 @@ struct zone {
ZONE_PADDING(_pad1_)

/* Fields commonly accessed by the page reclaim scanner */
- spinlock_t lru_lock;
- struct zone_lru {
- struct list_head list;
- } lru[NR_LRU_LISTS];
+ spinlock_t lru_lock;
+ struct lruvec lruvec;

struct zone_reclaim_stat reclaim_stat;

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 961ecc7..2e7cbc5 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -31,7 +31,6 @@ enum {
struct page_cgroup {
unsigned long flags;
struct mem_cgroup *mem_cgroup;
- struct list_head lru; /* per cgroup LRU list */
};

void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -49,7 +48,6 @@ static inline void __init page_cgroup_init(void)
#endif

struct page_cgroup *lookup_page_cgroup(struct page *page);
-struct page *lookup_cgroup_page(struct page_cgroup *pc);

#define TESTPCGFLAG(uname, lname) \
static inline int PageCgroup##uname(struct page_cgroup *pc) \
@@ -122,39 +120,6 @@ static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
local_irq_restore(*flags);
}

-#ifdef CONFIG_SPARSEMEM
-#define PCG_ARRAYID_WIDTH SECTIONS_SHIFT
-#else
-#define PCG_ARRAYID_WIDTH NODES_SHIFT
-#endif
-
-#if (PCG_ARRAYID_WIDTH > BITS_PER_LONG - NR_PCG_FLAGS)
-#error Not enough space left in pc->flags to store page_cgroup array IDs
-#endif
-
-/* pc->flags: ARRAY-ID | FLAGS */
-
-#define PCG_ARRAYID_MASK ((1UL << PCG_ARRAYID_WIDTH) - 1)
-
-#define PCG_ARRAYID_OFFSET (BITS_PER_LONG - PCG_ARRAYID_WIDTH)
-/*
- * Zero the shift count for non-existent fields, to prevent compiler
- * warnings and ensure references are optimized away.
- */
-#define PCG_ARRAYID_SHIFT (PCG_ARRAYID_OFFSET * (PCG_ARRAYID_WIDTH != 0))
-
-static inline void set_page_cgroup_array_id(struct page_cgroup *pc,
- unsigned long id)
-{
- pc->flags &= ~(PCG_ARRAYID_MASK << PCG_ARRAYID_SHIFT);
- pc->flags |= (id & PCG_ARRAYID_MASK) << PCG_ARRAYID_SHIFT;
-}
-
-static inline unsigned long page_cgroup_array_id(struct page_cgroup *pc)
-{
- return (pc->flags >> PCG_ARRAYID_SHIFT) & PCG_ARRAYID_MASK;
-}
-
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct page_cgroup;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d762706..f5d90ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -134,10 +134,7 @@ struct mem_cgroup_stat_cpu {
* per-zone information in memory controller.
*/
struct mem_cgroup_per_zone {
- /*
- * spin_lock to protect the per cgroup LRU
- */
- struct list_head lists[NR_LRU_LISTS];
+ struct lruvec lruvec;
unsigned long count[NR_LRU_LISTS];

struct zone_reclaim_stat reclaim_stat;
@@ -834,6 +831,24 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
return (mem == root_mem_cgroup);
}

+struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, struct mem_cgroup *mem)
+{
+ struct mem_cgroup_per_zone *mz;
+ int nid, zid;
+
+ /* Pages are on the zone's own lru lists */
+ if (mem_cgroup_disabled())
+ return &zone->lruvec;
+
+ if (!mem)
+ mem = root_mem_cgroup;
+
+ nid = zone_to_nid(zone);
+ zid = zone_idx(zone);
+ mz = mem_cgroup_zoneinfo(mem, nid, zid);
+ return &mz->lruvec;
+}
+
/*
* Following LRU functions are allowed to be used without PCG_LOCK.
* Operations are called by routine of global LRU independently from memcg.
@@ -848,10 +863,43 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
* When moving account, the page is not on LRU. It's isolated.
*/

-void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
+struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
+ enum lru_list lru)
{
+ struct mem_cgroup_per_zone *mz;
struct page_cgroup *pc;
+ struct mem_cgroup *mem;
+
+ if (mem_cgroup_disabled())
+ return &zone->lruvec;
+
+ pc = lookup_page_cgroup(page);
+ VM_BUG_ON(PageCgroupAcctLRU(pc));
+ if (PageCgroupUsed(pc)) {
+ /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
+ smp_rmb();
+ mem = pc->mem_cgroup;
+ } else {
+ /*
+ * If the page is uncharged, add it to the root's lru.
+ * Either it will be freed soon, or it will get
+ * charged again and the charger will relink it to the
+ * right list.
+ */
+ mem = root_mem_cgroup;
+ }
+ mz = page_cgroup_zoneinfo(mem, page);
+ /* huge page split is done under lru_lock. so, we have no races. */
+ MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
+ SetPageCgroupAcctLRU(pc);
+ return &mz->lruvec;
+}
+
+void mem_cgroup_lru_del_list(struct zone *zone, struct page *page,
+ enum lru_list lru)
+{
struct mem_cgroup_per_zone *mz;
+ struct page_cgroup *pc;

if (mem_cgroup_disabled())
return;
@@ -867,83 +915,21 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
/* huge page split is done under lru_lock. so, we have no races. */
MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
- if (mem_cgroup_is_root(pc->mem_cgroup))
- return;
- VM_BUG_ON(list_empty(&pc->lru));
- list_del_init(&pc->lru);
}

-void mem_cgroup_del_lru(struct page *page)
+void mem_cgroup_lru_del(struct zone *zone, struct page *page)
{
- mem_cgroup_del_lru_list(page, page_lru(page));
+ mem_cgroup_lru_del_list(zone, page, page_lru(page));
}

-/*
- * Writeback is about to end against a page which has been marked for immediate
- * reclaim. If it still appears to be reclaimable, move it to the tail of the
- * inactive list.
- */
-void mem_cgroup_rotate_reclaimable_page(struct page *page)
+struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+ struct page *page,
+ enum lru_list from,
+ enum lru_list to)
{
- struct mem_cgroup_per_zone *mz;
- struct page_cgroup *pc;
- enum lru_list lru = page_lru(page);
-
- if (mem_cgroup_disabled())
- return;
-
- pc = lookup_page_cgroup(page);
- /* unused or root page is not rotated. */
- if (!PageCgroupUsed(pc))
- return;
- /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
- smp_rmb();
- if (mem_cgroup_is_root(pc->mem_cgroup))
- return;
- mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
- list_move_tail(&pc->lru, &mz->lists[lru]);
-}
-
-void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
-{
- struct mem_cgroup_per_zone *mz;
- struct page_cgroup *pc;
-
- if (mem_cgroup_disabled())
- return;
-
- pc = lookup_page_cgroup(page);
- /* unused or root page is not rotated. */
- if (!PageCgroupUsed(pc))
- return;
- /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
- smp_rmb();
- if (mem_cgroup_is_root(pc->mem_cgroup))
- return;
- mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
- list_move(&pc->lru, &mz->lists[lru]);
-}
-
-void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
-{
- struct page_cgroup *pc;
- struct mem_cgroup_per_zone *mz;
-
- if (mem_cgroup_disabled())
- return;
- pc = lookup_page_cgroup(page);
- VM_BUG_ON(PageCgroupAcctLRU(pc));
- if (!PageCgroupUsed(pc))
- return;
- /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
- smp_rmb();
- mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
- /* huge page split is done under lru_lock. so, we have no races. */
- MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
- SetPageCgroupAcctLRU(pc);
- if (mem_cgroup_is_root(pc->mem_cgroup))
- return;
- list_add(&pc->lru, &mz->lists[lru]);
+ /* TODO: could be optimized, especially if from == to */
+ mem_cgroup_lru_del_list(zone, page, from);
+ return mem_cgroup_lru_add_list(zone, page, to);
}

/*
@@ -975,7 +961,7 @@ static void mem_cgroup_lru_del_before_commit(struct page *page)
* is guarded by lock_page() because the page is SwapCache.
*/
if (!PageCgroupUsed(pc))
- mem_cgroup_del_lru_list(page, page_lru(page));
+ del_page_from_lru(zone, page);
spin_unlock_irqrestore(&zone->lru_lock, flags);
}

@@ -989,22 +975,11 @@ static void mem_cgroup_lru_add_after_commit(struct page *page)
if (likely(!PageLRU(page)))
return;
spin_lock_irqsave(&zone->lru_lock, flags);
- /* link when the page is linked to LRU but page_cgroup isn't */
if (PageLRU(page) && !PageCgroupAcctLRU(pc))
- mem_cgroup_add_lru_list(page, page_lru(page));
+ add_page_to_lru_list(zone, page, page_lru(page));
spin_unlock_irqrestore(&zone->lru_lock, flags);
}

-
-void mem_cgroup_move_lists(struct page *page,
- enum lru_list from, enum lru_list to)
-{
- if (mem_cgroup_disabled())
- return;
- mem_cgroup_del_lru_list(page, from);
- mem_cgroup_add_lru_list(page, to);
-}
-
int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
{
int ret;
@@ -1063,6 +1038,9 @@ int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
unsigned long present_pages[2];
unsigned long inactive_ratio;

+ if (!memcg)
+ memcg = root_mem_cgroup;
+
inactive_ratio = calc_inactive_ratio(memcg, present_pages);

inactive = present_pages[0];
@@ -1079,6 +1057,9 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
unsigned long active;
unsigned long inactive;

+ if (!memcg)
+ memcg = root_mem_cgroup;
+
inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);

@@ -1091,8 +1072,12 @@ unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
{
int nid = zone_to_nid(zone);
int zid = zone_idx(zone);
- struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+ struct mem_cgroup_per_zone *mz;
+
+ if (!memcg)
+ memcg = root_mem_cgroup;

+ mz = mem_cgroup_zoneinfo(memcg, nid, zid);
return MEM_CGROUP_ZSTAT(mz, lru);
}

@@ -1101,8 +1086,12 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
{
int nid = zone_to_nid(zone);
int zid = zone_idx(zone);
- struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+ struct mem_cgroup_per_zone *mz;
+
+ if (!memcg)
+ memcg = root_mem_cgroup;

+ mz = mem_cgroup_zoneinfo(memcg, nid, zid);
return &mz->reclaim_stat;
}

@@ -1124,67 +1113,6 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
return &mz->reclaim_stat;
}

-unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
- struct list_head *dst,
- unsigned long *scanned, int order,
- int mode, struct zone *z,
- struct mem_cgroup *mem_cont,
- int active, int file)
-{
- unsigned long nr_taken = 0;
- struct page *page;
- unsigned long scan;
- LIST_HEAD(pc_list);
- struct list_head *src;
- struct page_cgroup *pc, *tmp;
- int nid = zone_to_nid(z);
- int zid = zone_idx(z);
- struct mem_cgroup_per_zone *mz;
- int lru = LRU_FILE * file + active;
- int ret;
-
- BUG_ON(!mem_cont);
- mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
- src = &mz->lists[lru];
-
- scan = 0;
- list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
- if (scan >= nr_to_scan)
- break;
-
- if (unlikely(!PageCgroupUsed(pc)))
- continue;
-
- page = lookup_cgroup_page(pc);
-
- if (unlikely(!PageLRU(page)))
- continue;
-
- scan++;
- ret = __isolate_lru_page(page, mode, file);
- switch (ret) {
- case 0:
- list_move(&page->lru, dst);
- mem_cgroup_del_lru(page);
- nr_taken += hpage_nr_pages(page);
- break;
- case -EBUSY:
- /* we don't affect global LRU but rotate in our LRU */
- mem_cgroup_rotate_lru_list(page, page_lru(page));
- break;
- default:
- break;
- }
- }
-
- *scanned = scan;
-
- trace_mm_vmscan_memcg_isolate(0, nr_to_scan, scan, nr_taken,
- 0, 0, 0, mode);
-
- return nr_taken;
-}
-
#define mem_cgroup_from_res_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)

@@ -3458,22 +3386,23 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
int node, int zid, enum lru_list lru)
{
- struct zone *zone;
struct mem_cgroup_per_zone *mz;
- struct page_cgroup *pc, *busy;
unsigned long flags, loop;
struct list_head *list;
+ struct page *busy;
+ struct zone *zone;
int ret = 0;

zone = &NODE_DATA(node)->node_zones[zid];
mz = mem_cgroup_zoneinfo(mem, node, zid);
- list = &mz->lists[lru];
+ list = &mz->lruvec.lists[lru];

loop = MEM_CGROUP_ZSTAT(mz, lru);
/* give some margin against EBUSY etc...*/
loop += 256;
busy = NULL;
while (loop--) {
+ struct page_cgroup *pc;
struct page *page;

ret = 0;
@@ -3482,16 +3411,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
spin_unlock_irqrestore(&zone->lru_lock, flags);
break;
}
- pc = list_entry(list->prev, struct page_cgroup, lru);
- if (busy == pc) {
- list_move(&pc->lru, list);
+ page = list_entry(list->prev, struct page, lru);
+ if (busy == page) {
+ list_move(&page->lru, list);
busy = NULL;
spin_unlock_irqrestore(&zone->lru_lock, flags);
continue;
}
spin_unlock_irqrestore(&zone->lru_lock, flags);

- page = lookup_cgroup_page(pc);
+ pc = lookup_page_cgroup(page);

ret = mem_cgroup_move_parent(page, pc, mem, GFP_KERNEL);
if (ret == -ENOMEM)
@@ -3499,7 +3428,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,

if (ret == -EBUSY || ret == -EINVAL) {
/* found lock contention or "pc" is obsolete. */
- busy = pc;
+ busy = page;
cond_resched();
} else
busy = NULL;
@@ -4519,7 +4448,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
for (zone = 0; zone < MAX_NR_ZONES; zone++) {
mz = &pn->zoneinfo[zone];
for_each_lru(l)
- INIT_LIST_HEAD(&mz->lists[l]);
+ INIT_LIST_HEAD(&mz->lruvec.lists[l]);
mz->usage_in_excess = 0;
mz->on_tree = false;
mz->mem = mem;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f8a97b..4099e8c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4262,7 +4262,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

zone_pcp_init(zone);
for_each_lru(l) {
- INIT_LIST_HEAD(&zone->lru[l].list);
+ INIT_LIST_HEAD(&zone->lruvec.lists[l]);
zone->reclaim_stat.nr_saved_scan[l] = 0;
}
zone->reclaim_stat.recent_rotated[0] = 0;
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 9905501..313e1d7 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -11,12 +11,10 @@
#include <linux/swapops.h>
#include <linux/kmemleak.h>

-static void __meminit init_page_cgroup(struct page_cgroup *pc, unsigned long id)
+static void __meminit init_page_cgroup(struct page_cgroup *pc)
{
pc->flags = 0;
- set_page_cgroup_array_id(pc, id);
pc->mem_cgroup = NULL;
- INIT_LIST_HEAD(&pc->lru);
}
static unsigned long total_usage;

@@ -42,19 +40,6 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
return base + offset;
}

-struct page *lookup_cgroup_page(struct page_cgroup *pc)
-{
- unsigned long pfn;
- struct page *page;
- pg_data_t *pgdat;
-
- pgdat = NODE_DATA(page_cgroup_array_id(pc));
- pfn = pc - pgdat->node_page_cgroup + pgdat->node_start_pfn;
- page = pfn_to_page(pfn);
- VM_BUG_ON(pc != lookup_page_cgroup(page));
- return page;
-}
-
static int __init alloc_node_page_cgroup(int nid)
{
struct page_cgroup *base, *pc;
@@ -75,7 +60,7 @@ static int __init alloc_node_page_cgroup(int nid)
return -ENOMEM;
for (index = 0; index < nr_pages; index++) {
pc = base + index;
- init_page_cgroup(pc, nid);
+ init_page_cgroup(pc);
}
NODE_DATA(nid)->node_page_cgroup = base;
total_usage += table_size;
@@ -117,19 +102,6 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
return section->page_cgroup + pfn;
}

-struct page *lookup_cgroup_page(struct page_cgroup *pc)
-{
- struct mem_section *section;
- struct page *page;
- unsigned long nr;
-
- nr = page_cgroup_array_id(pc);
- section = __nr_to_section(nr);
- page = pfn_to_page(pc - section->page_cgroup);
- VM_BUG_ON(pc != lookup_page_cgroup(page));
- return page;
-}
-
static void *__init_refok alloc_page_cgroup(size_t size, int nid)
{
void *addr = NULL;
@@ -167,12 +139,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
struct page_cgroup *base, *pc;
struct mem_section *section;
unsigned long table_size;
- unsigned long nr;
int nid, index;

- nr = pfn_to_section_nr(pfn);
- section = __nr_to_section(nr);
-
+ section = __pfn_to_section(pfn);
if (section->page_cgroup)
return 0;

@@ -194,7 +163,7 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)

for (index = 0; index < PAGES_PER_SECTION; index++) {
pc = base + index;
- init_page_cgroup(pc, nr);
+ init_page_cgroup(pc);
}

section->page_cgroup = base - pfn;
diff --git a/mm/swap.c b/mm/swap.c
index a448db3..12095a0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,12 +209,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
static void pagevec_move_tail_fn(struct page *page, void *arg)
{
int *pgmoved = arg;
- struct zone *zone = page_zone(page);

if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
enum lru_list lru = page_lru_base_type(page);
- list_move_tail(&page->lru, &zone->lru[lru].list);
- mem_cgroup_rotate_reclaimable_page(page);
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_lru_move_lists(page_zone(page),
+ page, lru, lru);
+ list_move_tail(&page->lru, &lruvec->lists[lru]);
(*pgmoved)++;
}
}
@@ -417,12 +419,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
*/
SetPageReclaim(page);
} else {
+ struct lruvec *lruvec;
/*
* The page's writeback ends up during pagevec
* We moves tha page into tail of inactive.
*/
- list_move_tail(&page->lru, &zone->lru[lru].list);
- mem_cgroup_rotate_reclaimable_page(page);
+ lruvec = mem_cgroup_lru_move_lists(zone, page, lru, lru);
+ list_move_tail(&page->lru, &lruvec->lists[lru]);
__count_vm_event(PGROTATED);
}

@@ -594,7 +597,6 @@ void lru_add_page_tail(struct zone* zone,
int active;
enum lru_list lru;
const int file = 0;
- struct list_head *head;

VM_BUG_ON(!PageHead(page));
VM_BUG_ON(PageCompound(page_tail));
@@ -614,10 +616,10 @@ void lru_add_page_tail(struct zone* zone,
}
update_page_reclaim_stat(zone, page_tail, file, active);
if (likely(PageLRU(page)))
- head = page->lru.prev;
+ __add_page_to_lru_list(zone, page_tail, lru,
+ page->lru.prev);
else
- head = &zone->lru[lru].list;
- __add_page_to_lru_list(zone, page_tail, lru, head);
+ add_page_to_lru_list(zone, page_tail, lru);
} else {
SetPageUnevictable(page_tail);
add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0e45ceb..0381a5d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -162,34 +162,27 @@ static bool global_reclaim(struct scan_control *sc)
{
return !sc->memcg;
}
-static bool scanning_global_lru(struct scan_control *sc)
-{
- return !sc->current_memcg;
-}
#else
static bool global_reclaim(struct scan_control *sc) { return 1; }
-static bool scanning_global_lru(struct scan_control *sc) { return 1; }
#endif

static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
struct scan_control *sc)
{
- if (!scanning_global_lru(sc))
- return mem_cgroup_get_reclaim_stat(sc->current_memcg, zone);
-
- return &zone->reclaim_stat;
+ if (mem_cgroup_disabled())
+ return &zone->reclaim_stat;
+ return mem_cgroup_get_reclaim_stat(sc->current_memcg, zone);
}

static unsigned long zone_nr_lru_pages(struct zone *zone,
- struct scan_control *sc, enum lru_list lru)
+ struct scan_control *sc,
+ enum lru_list lru)
{
- if (!scanning_global_lru(sc))
- return mem_cgroup_zone_nr_pages(sc->current_memcg, zone, lru);
-
- return zone_page_state(zone, NR_LRU_BASE + lru);
+ if (mem_cgroup_disabled())
+ return zone_page_state(zone, NR_LRU_BASE + lru);
+ return mem_cgroup_zone_nr_pages(sc->current_memcg, zone, lru);
}

-
/*
* Add a shrinker callback to be called from the vm
*/
@@ -1055,15 +1048,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,

switch (__isolate_lru_page(page, mode, file)) {
case 0:
+ mem_cgroup_lru_del(page_zone(page), page);
list_move(&page->lru, dst);
- mem_cgroup_del_lru(page);
nr_taken += hpage_nr_pages(page);
break;

case -EBUSY:
/* else it is being freed elsewhere */
list_move(&page->lru, src);
- mem_cgroup_rotate_lru_list(page, page_lru(page));
continue;

default:
@@ -1113,8 +1105,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
break;

if (__isolate_lru_page(cursor_page, mode, file) == 0) {
+ mem_cgroup_lru_del(page_zone(cursor_page),
+ cursor_page);
list_move(&cursor_page->lru, dst);
- mem_cgroup_del_lru(cursor_page);
nr_taken += hpage_nr_pages(page);
nr_lumpy_taken++;
if (PageDirty(cursor_page))
@@ -1143,19 +1136,22 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
return nr_taken;
}

-static unsigned long isolate_pages_global(unsigned long nr,
- struct list_head *dst,
- unsigned long *scanned, int order,
- int mode, struct zone *z,
- int active, int file)
+static unsigned long isolate_pages(unsigned long nr,
+ struct list_head *dst,
+ unsigned long *scanned, int order,
+ int mode, struct zone *z,
+ int active, int file,
+ struct mem_cgroup *mem)
{
+ struct lruvec *lruvec = mem_cgroup_zone_lruvec(z, mem);
int lru = LRU_BASE;
+
if (active)
lru += LRU_ACTIVE;
if (file)
lru += LRU_FILE;
- return isolate_lru_pages(nr, &z->lru[lru].list, dst, scanned, order,
- mode, file);
+ return isolate_lru_pages(nr, &lruvec->lists[lru], dst,
+ scanned, order, mode, file);
}

/*
@@ -1403,20 +1399,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
lru_add_drain();
spin_lock_irq(&zone->lru_lock);

- if (scanning_global_lru(sc)) {
- nr_taken = isolate_pages_global(nr_to_scan,
- &page_list, &nr_scanned, sc->order,
- sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
- ISOLATE_BOTH : ISOLATE_INACTIVE,
- zone, 0, file);
- } else {
- nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+ nr_taken = isolate_pages(nr_to_scan,
&page_list, &nr_scanned, sc->order,
sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
ISOLATE_BOTH : ISOLATE_INACTIVE,
- zone, sc->current_memcg,
- 0, file);
- }
+ zone, 0, file, sc->current_memcg);

if (global_reclaim(sc)) {
zone->pages_scanned += nr_scanned;
@@ -1491,13 +1478,15 @@ static void move_active_pages_to_lru(struct zone *zone,
pagevec_init(&pvec, 1);

while (!list_empty(list)) {
+ struct lruvec *lruvec;
+
page = lru_to_page(list);

VM_BUG_ON(PageLRU(page));
SetPageLRU(page);

- list_move(&page->lru, &zone->lru[lru].list);
- mem_cgroup_add_lru_list(page, lru);
+ lruvec = mem_cgroup_lru_add_list(zone, page, lru);
+ list_move(&page->lru, &lruvec->lists[lru]);
pgmoved += hpage_nr_pages(page);

if (!pagevec_add(&pvec, page) || list_empty(list)) {
@@ -1528,17 +1517,10 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,

lru_add_drain();
spin_lock_irq(&zone->lru_lock);
- if (scanning_global_lru(sc)) {
- nr_taken = isolate_pages_global(nr_pages, &l_hold,
- &pgscanned, sc->order,
- ISOLATE_ACTIVE, zone,
- 1, file);
- } else {
- nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
- &pgscanned, sc->order,
- ISOLATE_ACTIVE, zone,
- sc->current_memcg, 1, file);
- }
+ nr_taken = isolate_pages(nr_pages, &l_hold,
+ &pgscanned, sc->order,
+ ISOLATE_ACTIVE, zone,
+ 1, file, sc->current_memcg);

if (global_reclaim(sc))
zone->pages_scanned += pgscanned;
@@ -1628,8 +1610,6 @@ static int inactive_anon_is_low_global(struct zone *zone)
*/
static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
{
- int low;
-
/*
* If we don't have swap space, anonymous page deactivation
* is pointless.
@@ -1637,11 +1617,9 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
if (!total_swap_pages)
return 0;

- if (scanning_global_lru(sc))
- low = inactive_anon_is_low_global(zone);
- else
- low = mem_cgroup_inactive_anon_is_low(sc->current_memcg);
- return low;
+ if (mem_cgroup_disabled())
+ return inactive_anon_is_low_global(zone);
+ return mem_cgroup_inactive_anon_is_low(sc->current_memcg);
}
#else
static inline int inactive_anon_is_low(struct zone *zone,
@@ -1678,13 +1656,9 @@ static int inactive_file_is_low_global(struct zone *zone)
*/
static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
{
- int low;
-
- if (scanning_global_lru(sc))
- low = inactive_file_is_low_global(zone);
- else
- low = mem_cgroup_inactive_file_is_low(sc->current_memcg);
- return low;
+ if (mem_cgroup_disabled())
+ return inactive_file_is_low_global(zone);
+ return mem_cgroup_inactive_file_is_low(sc->current_memcg);
}

static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
@@ -3161,16 +3135,18 @@ int page_evictable(struct page *page, struct vm_area_struct *vma)
*/
static void check_move_unevictable_page(struct page *page, struct zone *zone)
{
- VM_BUG_ON(PageActive(page));
+ struct lruvec *lruvec;

+ VM_BUG_ON(PageActive(page));
retry:
ClearPageUnevictable(page);
if (page_evictable(page, NULL)) {
enum lru_list l = page_lru_base_type(page);

__dec_zone_state(zone, NR_UNEVICTABLE);
- list_move(&page->lru, &zone->lru[l].list);
- mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
+ lruvec = mem_cgroup_lru_move_lists(zone, page,
+ LRU_UNEVICTABLE, l);
+ list_move(&page->lru, &lruvec->lists[l]);
__inc_zone_state(zone, NR_INACTIVE_ANON + l);
__count_vm_event(UNEVICTABLE_PGRESCUED);
} else {
@@ -3178,8 +3154,9 @@ retry:
* rotate unevictable list
*/
SetPageUnevictable(page);
- list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
- mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
+ lruvec = mem_cgroup_lru_move_lists(zone, page, LRU_UNEVICTABLE,
+ LRU_UNEVICTABLE);
+ list_move(&page->lru, &lruvec->lists[LRU_UNEVICTABLE]);
if (page_evictable(page, NULL))
goto retry;
}
@@ -3253,29 +3230,37 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
#define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
static void scan_zone_unevictable_pages(struct zone *zone)
{
- struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list;
- unsigned long scan;
unsigned long nr_to_scan = zone_page_state(zone, NR_UNEVICTABLE);

while (nr_to_scan > 0) {
unsigned long batch_size = min(nr_to_scan,
SCAN_UNEVICTABLE_BATCH_SIZE);
+ struct mem_cgroup *mem = NULL;

- spin_lock_irq(&zone->lru_lock);
- for (scan = 0; scan < batch_size; scan++) {
- struct page *page = lru_to_page(l_unevictable);
-
- if (!trylock_page(page))
- continue;
+ do {
+ struct list_head *list;
+ struct lruvec *lruvec;
+ unsigned long scan;

- prefetchw_prev_lru_page(page, l_unevictable, flags);
+ mem_cgroup_hierarchy_walk(NULL, &mem);
+ spin_lock_irq(&zone->lru_lock);
+ lruvec = mem_cgroup_zone_lruvec(zone, mem);
+ list = &lruvec->lists[LRU_UNEVICTABLE];
+ for (scan = 0; scan < batch_size; scan++) {
+ struct page *page = lru_to_page(list);

- if (likely(PageLRU(page) && PageUnevictable(page)))
+ if (!trylock_page(page))
+ continue;
+ prefetchw_prev_lru_page(page, list, flags);
+ if (unlikely(!PageLRU(page)))
+ continue;
+ if (unlikely(!PageUnevictable(page)))
+ continue;
check_move_unevictable_page(page, zone);
-
- unlock_page(page);
- }
- spin_unlock_irq(&zone->lru_lock);
+ unlock_page(page);
+ }
+ spin_unlock_irq(&zone->lru_lock);
+ } while (mem);

nr_to_scan -= batch_size;
}
--
1.7.5.1

2011-05-12 14:54:53

by Johannes Weiner

[permalink] [raw]
Subject: [rfc patch 6/6] memcg: rework soft limit reclaim

The current soft limit reclaim algorithm entered from kswapd. It
selects the memcg that exceeds its soft limit the most in absolute
bytes and reclaims from it most aggressively (priority 0).

This has several disadvantages:

1. because of the aggressiveness, kswapd can be stalled on a
memcg that is hard to reclaim for a long time before going for
other pages.

2. it only considers the biggest violator (in absolute byes!)
and does not put extra pressure on other memcgs in excess.

3. it needs a ton of code to quickly find the target

This patch removes all the explicit soft limit target selection and
instead hooks into the hierarchical memcg walk that is done by direct
reclaim and kswapd balancing. If it encounters a memcg that exceeds
its soft limit, or contributes to the soft limit excess in one of its
hierarchy parents, it scans the memcg one priority level below the
current reclaim priority.

1. the primary goal is to reclaim pages, not to punish soft
limit violators at any price

2. increased pressure is applied to all violators, not just
the biggest one

3. the soft limit is no longer only meaningful on global
memory pressure, but considered for any hierarchical reclaim.
This means that even for hard limit reclaim, the children in
excess of their soft limit experience more pressure compared
to their siblings

4. direct reclaim now also applies more pressure on memcgs in
soft limit excess, not only kswapd

5. the implementation is only a few lines of straight-forward
code

RFC: since there is no longer a reliable way of counting the pages
reclaimed solely because of an exceeding soft limit, this patch
conflicts with Ying's exporting of exactly this number to userspace.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 16 +-
include/linux/swap.h | 4 -
mm/memcontrol.c | 450 +++-----------------------------------------
mm/vmscan.c | 48 +-----
4 files changed, 34 insertions(+), 484 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 65163c2..b0c7323 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -99,6 +99,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
* For memory reclaim.
*/
void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
+bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *, struct mem_cgroup *);
void mem_cgroup_count_reclaim(struct mem_cgroup *, bool, bool,
unsigned long, unsigned long);
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
@@ -140,8 +141,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}

-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask);
u64 mem_cgroup_get_limit(struct mem_cgroup *mem);

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -294,6 +293,12 @@ static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
*iter = start;
}

+static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
+ struct mem_cgroup *mem)
+{
+ return 0;
+}
+
static inline void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
bool kswapd, bool hierarchy,
unsigned long scanned,
@@ -349,13 +354,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
}

static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask)
-{
- return 0;
-}
-
-static inline
u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
{
return 0;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a5c6da5..885cf19 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,10 +254,6 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
unsigned int swappiness);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- unsigned int swappiness,
- struct zone *zone);
extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f5d90ba..b0c6dd5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -34,7 +34,6 @@
#include <linux/rcupdate.h>
#include <linux/limits.h>
#include <linux/mutex.h>
-#include <linux/rbtree.h>
#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/swapops.h>
@@ -138,12 +137,6 @@ struct mem_cgroup_per_zone {
unsigned long count[NR_LRU_LISTS];

struct zone_reclaim_stat reclaim_stat;
- struct rb_node tree_node; /* RB tree node */
- unsigned long long usage_in_excess;/* Set to the value by which */
- /* the soft limit is exceeded*/
- bool on_tree;
- struct mem_cgroup *mem; /* Back pointer, we cannot */
- /* use container_of */
};
/* Macro for accessing counter */
#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
@@ -156,26 +149,6 @@ struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
};

-/*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
- */
-
-struct mem_cgroup_tree_per_zone {
- struct rb_root rb_root;
- spinlock_t lock;
-};
-
-struct mem_cgroup_tree_per_node {
- struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_tree {
- struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
-};
-
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
-
struct mem_cgroup_threshold {
struct eventfd_ctx *eventfd;
u64 threshold;
@@ -323,12 +296,7 @@ static bool move_file(void)
&mc.to->move_charge_at_immigrate);
}

-/*
- * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
- * limit reclaim to prevent infinite loops, if they ever occur.
- */
#define MEM_CGROUP_MAX_RECLAIM_LOOPS (100)
-#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)

enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -375,164 +343,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *mem, struct page *page)
return mem_cgroup_zoneinfo(mem, nid, zid);
}

-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
-{
- return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_from_page(struct page *page)
-{
- int nid = page_to_nid(page);
- int zid = page_zonenum(page);
-
- return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static void
-__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz,
- unsigned long long new_usage_in_excess)
-{
- struct rb_node **p = &mctz->rb_root.rb_node;
- struct rb_node *parent = NULL;
- struct mem_cgroup_per_zone *mz_node;
-
- if (mz->on_tree)
- return;
-
- mz->usage_in_excess = new_usage_in_excess;
- if (!mz->usage_in_excess)
- return;
- while (*p) {
- parent = *p;
- mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
- tree_node);
- if (mz->usage_in_excess < mz_node->usage_in_excess)
- p = &(*p)->rb_left;
- /*
- * We can't avoid mem cgroups that are over their soft
- * limit by the same amount
- */
- else if (mz->usage_in_excess >= mz_node->usage_in_excess)
- p = &(*p)->rb_right;
- }
- rb_link_node(&mz->tree_node, parent, p);
- rb_insert_color(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = true;
-}
-
-static void
-__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz)
-{
- if (!mz->on_tree)
- return;
- rb_erase(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = false;
-}
-
-static void
-mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz)
-{
- spin_lock(&mctz->lock);
- __mem_cgroup_remove_exceeded(mem, mz, mctz);
- spin_unlock(&mctz->lock);
-}
-
-
-static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
-{
- unsigned long long excess;
- struct mem_cgroup_per_zone *mz;
- struct mem_cgroup_tree_per_zone *mctz;
- int nid = page_to_nid(page);
- int zid = page_zonenum(page);
- mctz = soft_limit_tree_from_page(page);
-
- /*
- * Necessary to update all ancestors when hierarchy is used.
- * because their event counter is not touched.
- */
- for (; mem; mem = parent_mem_cgroup(mem)) {
- mz = mem_cgroup_zoneinfo(mem, nid, zid);
- excess = res_counter_soft_limit_excess(&mem->res);
- /*
- * We have to update the tree if mz is on RB-tree or
- * mem is over its softlimit.
- */
- if (excess || mz->on_tree) {
- spin_lock(&mctz->lock);
- /* if on-tree, remove it */
- if (mz->on_tree)
- __mem_cgroup_remove_exceeded(mem, mz, mctz);
- /*
- * Insert again. mz->usage_in_excess will be updated.
- * If excess is 0, no tree ops.
- */
- __mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
- spin_unlock(&mctz->lock);
- }
- }
-}
-
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
-{
- int node, zone;
- struct mem_cgroup_per_zone *mz;
- struct mem_cgroup_tree_per_zone *mctz;
-
- for_each_node_state(node, N_POSSIBLE) {
- for (zone = 0; zone < MAX_NR_ZONES; zone++) {
- mz = mem_cgroup_zoneinfo(mem, node, zone);
- mctz = soft_limit_tree_node_zone(node, zone);
- mem_cgroup_remove_exceeded(mem, mz, mctz);
- }
- }
-}
-
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
- struct rb_node *rightmost = NULL;
- struct mem_cgroup_per_zone *mz;
-
-retry:
- mz = NULL;
- rightmost = rb_last(&mctz->rb_root);
- if (!rightmost)
- goto done; /* Nothing to reclaim from */
-
- mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
- /*
- * Remove the node now but someone else can add it back,
- * we will to add it back at the end of reclaim to its correct
- * position in the tree.
- */
- __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
- if (!res_counter_soft_limit_excess(&mz->mem->res) ||
- !css_tryget(&mz->mem->css))
- goto retry;
-done:
- return mz;
-}
-
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
- struct mem_cgroup_per_zone *mz;
-
- spin_lock(&mctz->lock);
- mz = __mem_cgroup_largest_soft_limit_node(mctz);
- spin_unlock(&mctz->lock);
- return mz;
-}
-
/*
* Implementation Note: reading percpu statistics for memcg.
*
@@ -570,15 +380,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *mem,
return val;
}

-static long mem_cgroup_local_usage(struct mem_cgroup *mem)
-{
- long ret;
-
- ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
- ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
- return ret;
-}
-
static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
bool charge)
{
@@ -699,7 +500,6 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
__mem_cgroup_target_update(mem, MEM_CGROUP_TARGET_THRESH);
if (unlikely(__memcg_event_check(mem,
MEM_CGROUP_TARGET_SOFTLIMIT))){
- mem_cgroup_update_tree(mem, page);
__mem_cgroup_target_update(mem,
MEM_CGROUP_TARGET_SOFTLIMIT);
}
@@ -1380,6 +1180,29 @@ void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
*iter = mem;
}

+bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
+ struct mem_cgroup *mem)
+{
+ /* root_mem_cgroup never exceeds its soft limit */
+ if (!mem)
+ return false;
+ if (!root)
+ root = root_mem_cgroup;
+ /*
+ * See whether the memcg in question exceeds its soft limit
+ * directly, or contributes to the soft limit excess in the
+ * hierarchy below the given root.
+ */
+ while (mem != root) {
+ if (res_counter_soft_limit_excess(&mem->res))
+ return true;
+ if (!mem->use_hierarchy)
+ break;
+ mem = mem_cgroup_from_cont(mem->css.cgroup->parent);
+ }
+ return false;
+}
+
static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
gfp_t gfp_mask,
bool noswap,
@@ -1411,114 +1234,6 @@ static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
}

/*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
-static struct mem_cgroup *
-mem_cgroup_select_victim(struct mem_cgroup *root_mem)
-{
- struct mem_cgroup *ret = NULL;
- struct cgroup_subsys_state *css;
- int nextid, found;
-
- if (!root_mem->use_hierarchy) {
- css_get(&root_mem->css);
- ret = root_mem;
- }
-
- while (!ret) {
- rcu_read_lock();
- nextid = root_mem->last_scanned_child + 1;
- css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
- &found);
- if (css && css_tryget(css))
- ret = container_of(css, struct mem_cgroup, css);
-
- rcu_read_unlock();
- /* Updates scanning parameter */
- if (!css) {
- /* this means start scan from ID:1 */
- root_mem->last_scanned_child = 0;
- } else
- root_mem->last_scanned_child = found;
- }
-
- return ret;
-}
-
-/*
- * Scan the hierarchy if needed to reclaim memory. We remember the last child
- * we reclaimed from, so that we don't end up penalizing one child extensively
- * based on its position in the children list.
- *
- * root_mem is the original ancestor that we've been reclaim from.
- *
- * We give up and return to the caller when we visit root_mem twice.
- * (other groups can be removed while we're walking....)
- */
-static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
- struct zone *zone,
- gfp_t gfp_mask)
-{
- struct mem_cgroup *victim;
- int ret, total = 0;
- int loop = 0;
- unsigned long excess;
- bool noswap = false;
-
- excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
-
- /* If memsw_is_minimum==1, swap-out is of-no-use. */
- if (root_mem->memsw_is_minimum)
- noswap = true;
-
- while (1) {
- victim = mem_cgroup_select_victim(root_mem);
- if (victim == root_mem) {
- loop++;
- if (loop >= 1)
- drain_all_stock_async();
- if (loop >= 2) {
- /*
- * If we have not been able to reclaim
- * anything, it might because there are
- * no reclaimable pages under this hierarchy
- */
- if (!total) {
- css_put(&victim->css);
- break;
- }
- /*
- * We want to do more targeted reclaim.
- * excess >> 2 is not to excessive so as to
- * reclaim too much, nor too less that we keep
- * coming back to reclaim from this cgroup
- */
- if (total >= (excess >> 2) ||
- (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
- css_put(&victim->css);
- break;
- }
- }
- }
- if (!mem_cgroup_local_usage(victim)) {
- /* this cgroup's local usage == 0 */
- css_put(&victim->css);
- continue;
- }
- /* we use swappiness of local cgroup */
- ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
- noswap, get_swappiness(victim), zone);
- css_put(&victim->css);
- total += ret;
- if (!res_counter_soft_limit_excess(&root_mem->res))
- return total;
- }
- return total;
-}
-
-/*
* Check OOM-Killer is already running under our hierarchy.
* If someone is running, return false.
*/
@@ -3291,94 +3006,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
return ret;
}

-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask)
-{
- unsigned long nr_reclaimed = 0;
- struct mem_cgroup_per_zone *mz, *next_mz = NULL;
- unsigned long reclaimed;
- int loop = 0;
- struct mem_cgroup_tree_per_zone *mctz;
- unsigned long long excess;
-
- if (order > 0)
- return 0;
-
- mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
- /*
- * This loop can run a while, specially if mem_cgroup's continuously
- * keep exceeding their soft limit and putting the system under
- * pressure
- */
- do {
- if (next_mz)
- mz = next_mz;
- else
- mz = mem_cgroup_largest_soft_limit_node(mctz);
- if (!mz)
- break;
-
- reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
- nr_reclaimed += reclaimed;
- spin_lock(&mctz->lock);
-
- /*
- * If we failed to reclaim anything from this memory cgroup
- * it is time to move on to the next cgroup
- */
- next_mz = NULL;
- if (!reclaimed) {
- do {
- /*
- * Loop until we find yet another one.
- *
- * By the time we get the soft_limit lock
- * again, someone might have aded the
- * group back on the RB tree. Iterate to
- * make sure we get a different mem.
- * mem_cgroup_largest_soft_limit_node returns
- * NULL if no other cgroup is present on
- * the tree
- */
- next_mz =
- __mem_cgroup_largest_soft_limit_node(mctz);
- if (next_mz == mz) {
- css_put(&next_mz->mem->css);
- next_mz = NULL;
- } else /* next_mz == NULL or other memcg */
- break;
- } while (1);
- }
- __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
- excess = res_counter_soft_limit_excess(&mz->mem->res);
- /*
- * One school of thought says that we should not add
- * back the node to the tree if reclaim returns 0.
- * But our reclaim could return 0, simply because due
- * to priority we are exposing a smaller subset of
- * memory to reclaim from. Consider this as a longer
- * term TODO.
- */
- /* If excess == 0, no tree ops */
- __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
- spin_unlock(&mctz->lock);
- css_put(&mz->mem->css);
- loop++;
- /*
- * Could not reclaim anything and there are no more
- * mem cgroups to try or we seem to be looping without
- * reclaiming anything.
- */
- if (!nr_reclaimed &&
- (next_mz == NULL ||
- loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
- break;
- } while (!nr_reclaimed);
- if (next_mz)
- css_put(&next_mz->mem->css);
- return nr_reclaimed;
-}
-
/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -4449,9 +4076,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
mz = &pn->zoneinfo[zone];
for_each_lru(l)
INIT_LIST_HEAD(&mz->lruvec.lists[l]);
- mz->usage_in_excess = 0;
- mz->on_tree = false;
- mz->mem = mem;
}
return 0;
}
@@ -4504,7 +4128,6 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
{
int node;

- mem_cgroup_remove_from_trees(mem);
free_css_id(&mem_cgroup_subsys, &mem->css);

for_each_node_state(node, N_POSSIBLE)
@@ -4559,31 +4182,6 @@ static void __init enable_swap_cgroup(void)
}
#endif

-static int mem_cgroup_soft_limit_tree_init(void)
-{
- struct mem_cgroup_tree_per_node *rtpn;
- struct mem_cgroup_tree_per_zone *rtpz;
- int tmp, node, zone;
-
- for_each_node_state(node, N_POSSIBLE) {
- tmp = node;
- if (!node_state(node, N_NORMAL_MEMORY))
- tmp = -1;
- rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
- if (!rtpn)
- return 1;
-
- soft_limit_tree.rb_tree_per_node[node] = rtpn;
-
- for (zone = 0; zone < MAX_NR_ZONES; zone++) {
- rtpz = &rtpn->rb_tree_per_zone[zone];
- rtpz->rb_root = RB_ROOT;
- spin_lock_init(&rtpz->lock);
- }
- }
- return 0;
-}
-
static struct cgroup_subsys_state * __ref
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
@@ -4605,8 +4203,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
enable_swap_cgroup();
parent = NULL;
root_mem_cgroup = mem;
- if (mem_cgroup_soft_limit_tree_init())
- goto free_out;
for_each_possible_cpu(cpu) {
struct memcg_stock_pcp *stock =
&per_cpu(memcg_stock, cpu);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0381a5d..2b701e0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1937,10 +1937,13 @@ static void shrink_zone(int priority, struct zone *zone,
do {
unsigned long reclaimed = sc->nr_reclaimed;
unsigned long scanned = sc->nr_scanned;
+ int epriority = priority;

mem_cgroup_hierarchy_walk(root, &mem);
sc->current_memcg = mem;
- do_shrink_zone(priority, zone, sc);
+ if (mem_cgroup_soft_limit_exceeded(root, mem))
+ epriority -= 1;
+ do_shrink_zone(epriority, zone, sc);
mem_cgroup_count_reclaim(mem, current_is_kswapd(),
mem != root, /* limit or hierarchy? */
sc->nr_scanned - scanned,
@@ -2153,42 +2156,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-
-unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- unsigned int swappiness,
- struct zone *zone)
-{
- struct scan_control sc = {
- .nr_to_reclaim = SWAP_CLUSTER_MAX,
- .may_writepage = !laptop_mode,
- .may_unmap = 1,
- .may_swap = !noswap,
- .swappiness = swappiness,
- .order = 0,
- .memcg = mem,
- };
- sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
- (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
-
- trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
- sc.may_writepage,
- sc.gfp_mask);
-
- /*
- * NOTE: Although we can get the priority field, using it
- * here is not a good idea, since it limits the pages we can scan.
- * if we don't reclaim here, the shrink_zone from balance_pgdat
- * will pick up pages from other mem cgroup's as well. We hack
- * the priority and make it zero.
- */
- do_shrink_zone(0, zone, &sc);
-
- trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
-
- return sc.nr_reclaimed;
-}
-
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask,
bool noswap,
@@ -2418,13 +2385,6 @@ loop_again:
continue;

sc.nr_scanned = 0;
-
- /*
- * Call soft limit reclaim before calling shrink_zone.
- * For now we ignore the return value
- */
- mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
-
/*
* We put equal pressure on every zone, unless
* one zone has way too many pages free
--
1.7.5.1

2011-05-12 15:38:20

by Rik van Riel

[permalink] [raw]
Subject: Re: [rfc patch 1/6] memcg: remove unused retry signal from reclaim

On 05/12/2011 10:53 AM, Johannes Weiner wrote:
> If the memcg reclaim code detects the target memcg below its limit it
> exits and returns a guaranteed non-zero value so that the charge is
> retried.
>
> Nowadays, the charge side checks the memcg limit itself and does not
> rely on this non-zero return value trick.
>
> This patch removes it. The reclaim code will now always return the
> true number of pages it reclaimed on its own.
>
> Signed-off-by: Johannes Weiner<[email protected]>

Acked-by: Rik van Riel<[email protected]>

2011-05-12 15:33:50

by Rik van Riel

[permalink] [raw]
Subject: Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection

On 05/12/2011 10:53 AM, Johannes Weiner wrote:
> The reclaim code has a single predicate for whether it currently
> reclaims on behalf of a memory cgroup, as well as whether it is
> reclaiming from the global LRU list or a memory cgroup LRU list.
>
> Up to now, both cases always coincide, but subsequent patches will
> change things such that global reclaim will scan memory cgroup lists.
>
> This patch adds a new predicate that tells global reclaim from memory
> cgroup reclaim, and then changes all callsites that are actually about
> global reclaim heuristics rather than strict LRU list selection.
>
> Signed-off-by: Johannes Weiner<[email protected]>
> ---
> mm/vmscan.c | 96 ++++++++++++++++++++++++++++++++++------------------------
> 1 files changed, 56 insertions(+), 40 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f6b435c..ceeb2a5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -104,8 +104,12 @@ struct scan_control {
> */
> reclaim_mode_t reclaim_mode;
>
> - /* Which cgroup do we reclaim from */
> - struct mem_cgroup *mem_cgroup;
> + /*
> + * The memory cgroup we reclaim on behalf of, and the one we
> + * are currently reclaiming from.
> + */
> + struct mem_cgroup *memcg;
> + struct mem_cgroup *current_memcg;

I can't say I'm fond of these names. I had to read the
rest of the patch to figure out that the old mem_cgroup
got renamed to current_memcg.

Would it be better to call them my_memcg and reclaim_memcg?

Maybe somebody else has better suggestions...

Other than the naming, no objection.

2011-05-12 16:04:31

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection

On Thu, May 12, 2011 at 11:33:13AM -0400, Rik van Riel wrote:
> On 05/12/2011 10:53 AM, Johannes Weiner wrote:
> >The reclaim code has a single predicate for whether it currently
> >reclaims on behalf of a memory cgroup, as well as whether it is
> >reclaiming from the global LRU list or a memory cgroup LRU list.
> >
> >Up to now, both cases always coincide, but subsequent patches will
> >change things such that global reclaim will scan memory cgroup lists.
> >
> >This patch adds a new predicate that tells global reclaim from memory
> >cgroup reclaim, and then changes all callsites that are actually about
> >global reclaim heuristics rather than strict LRU list selection.
> >
> >Signed-off-by: Johannes Weiner<[email protected]>
> >---
> > mm/vmscan.c | 96 ++++++++++++++++++++++++++++++++++------------------------
> > 1 files changed, 56 insertions(+), 40 deletions(-)
> >
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index f6b435c..ceeb2a5 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -104,8 +104,12 @@ struct scan_control {
> > */
> > reclaim_mode_t reclaim_mode;
> >
> >- /* Which cgroup do we reclaim from */
> >- struct mem_cgroup *mem_cgroup;
> >+ /*
> >+ * The memory cgroup we reclaim on behalf of, and the one we
> >+ * are currently reclaiming from.
> >+ */
> >+ struct mem_cgroup *memcg;
> >+ struct mem_cgroup *current_memcg;
>
> I can't say I'm fond of these names. I had to read the
> rest of the patch to figure out that the old mem_cgroup
> got renamed to current_memcg.

To clarify: sc->memcg will be the memcg that hit the hard limit and is
the main target of this reclaim invocation. current_memcg is the
iterator over the hierarchy below the target.

I realize this change in particular was placed a bit unfortunate in
terms of understanding in the series, I just wanted to keep out the
mem_cgroup to current_memcg renaming out of the next patch. There is
probably a better way, I'll fix it up and improve the comment.

> Would it be better to call them my_memcg and reclaim_memcg?
>
> Maybe somebody else has better suggestions...

Yes, suggestions welcome. I'm not too fond of the naming, either.

> Other than the naming, no objection.

Thanks, Rik.

Hannes

2011-05-12 16:05:42

by Rik van Riel

[permalink] [raw]
Subject: Re: [rfc patch 3/6] mm: memcg-aware global reclaim

On 05/12/2011 10:53 AM, Johannes Weiner wrote:

> I am open to solutions that trade fairness against CPU-time but don't
> want to have an extreme in either direction. Maybe break out early if
> a number of memcgs has been successfully reclaimed from and remember
> the last one scanned.

The way we used to deal with this when we did per-process
virtual scanning (before rmap), was to scan the process at
the head of the list.

After we were done with that process, it got moved to the
back of the list. If enough had been scanned, we bailed
out of the scanning code alltogether; if more needed to
be scanned, we moved on to the next process.

Doing a list move after scanning a bunch of pages in the
LRU lists of a cgroup isn't nearly as expensive as having
to scan all the cgroups.

2011-05-12 18:41:42

by Ying Han

[permalink] [raw]
Subject: Re: [rfc patch 6/6] memcg: rework soft limit reclaim

Hi Johannes:

Thank you for the patchset, and i will definitely spend time read them
through later today.

Also, I have a patchset which implements the round-robin soft_limit
reclaim as we discussed in LSF. Before I read through this set, i
don't know if we are making the similar approach or not. My
implementation is the first step only replace the RB-tree based
soft_limit reclaim to link_list round-robin. Feel free to throw
comment on that.

--Ying

On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <[email protected]> wrote:
> The current soft limit reclaim algorithm entered from kswapd. ?It
> selects the memcg that exceeds its soft limit the most in absolute
> bytes and reclaims from it most aggressively (priority 0).
>
> This has several disadvantages:
>
> ? ? ? ?1. because of the aggressiveness, kswapd can be stalled on a
> ? ? ? ?memcg that is hard to reclaim for a long time before going for
> ? ? ? ?other pages.
>
> ? ? ? ?2. it only considers the biggest violator (in absolute byes!)
> ? ? ? ?and does not put extra pressure on other memcgs in excess.
>
> ? ? ? ?3. it needs a ton of code to quickly find the target
>
> This patch removes all the explicit soft limit target selection and
> instead hooks into the hierarchical memcg walk that is done by direct
> reclaim and kswapd balancing. ?If it encounters a memcg that exceeds
> its soft limit, or contributes to the soft limit excess in one of its
> hierarchy parents, it scans the memcg one priority level below the
> current reclaim priority.
>
> ? ? ? ?1. the primary goal is to reclaim pages, not to punish soft
> ? ? ? ?limit violators at any price
>
> ? ? ? ?2. increased pressure is applied to all violators, not just
> ? ? ? ?the biggest one
>
> ? ? ? ?3. the soft limit is no longer only meaningful on global
> ? ? ? ?memory pressure, but considered for any hierarchical reclaim.
> ? ? ? ?This means that even for hard limit reclaim, the children in
> ? ? ? ?excess of their soft limit experience more pressure compared
> ? ? ? ?to their siblings
>
> ? ? ? ?4. direct reclaim now also applies more pressure on memcgs in
> ? ? ? ?soft limit excess, not only kswapd
>
> ? ? ? ?5. the implementation is only a few lines of straight-forward
> ? ? ? ?code
>
> RFC: since there is no longer a reliable way of counting the pages
> reclaimed solely because of an exceeding soft limit, this patch
> conflicts with Ying's exporting of exactly this number to userspace.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> ?include/linux/memcontrol.h | ? 16 +-
> ?include/linux/swap.h ? ? ? | ? ?4 -
> ?mm/memcontrol.c ? ? ? ? ? ?| ?450 +++-----------------------------------------
> ?mm/vmscan.c ? ? ? ? ? ? ? ?| ? 48 +-----
> ?4 files changed, 34 insertions(+), 484 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 65163c2..b0c7323 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -99,6 +99,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
> ?* For memory reclaim.
> ?*/
> ?void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *, struct mem_cgroup *);
> ?void mem_cgroup_count_reclaim(struct mem_cgroup *, bool, bool,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long, unsigned long);
> ?int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> @@ -140,8 +141,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
> ? ? ? ?mem_cgroup_update_page_stat(page, idx, -1);
> ?}
>
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask);
> ?u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>
> ?#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -294,6 +293,12 @@ static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
> ? ? ? ?*iter = start;
> ?}
>
> +static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem)
> +{
> + ? ? ? return 0;
> +}
> +
> ?static inline void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bool kswapd, bool hierarchy,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long scanned,
> @@ -349,13 +354,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
> ?}
>
> ?static inline
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask)
> -{
> - ? ? ? return 0;
> -}
> -
> -static inline
> ?u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
> ?{
> ? ? ? ?return 0;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a5c6da5..885cf19 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,10 +254,6 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> ?extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask, bool noswap,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned int swappiness);
> -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned int swappiness,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone);
> ?extern int __isolate_lru_page(struct page *page, int mode, int file);
> ?extern unsigned long shrink_all_memory(unsigned long nr_pages);
> ?extern int vm_swappiness;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f5d90ba..b0c6dd5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -34,7 +34,6 @@
> ?#include <linux/rcupdate.h>
> ?#include <linux/limits.h>
> ?#include <linux/mutex.h>
> -#include <linux/rbtree.h>
> ?#include <linux/slab.h>
> ?#include <linux/swap.h>
> ?#include <linux/swapops.h>
> @@ -138,12 +137,6 @@ struct mem_cgroup_per_zone {
> ? ? ? ?unsigned long ? ? ? ? ? count[NR_LRU_LISTS];
>
> ? ? ? ?struct zone_reclaim_stat reclaim_stat;
> - ? ? ? struct rb_node ? ? ? ? ?tree_node; ? ? ?/* RB tree node */
> - ? ? ? unsigned long long ? ? ?usage_in_excess;/* Set to the value by which */
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* the soft limit is exceeded*/
> - ? ? ? bool ? ? ? ? ? ? ? ? ? ?on_tree;
> - ? ? ? struct mem_cgroup ? ? ? *mem; ? ? ? ? ? /* Back pointer, we cannot */
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* use container_of ? ? ? ?*/
> ?};
> ?/* Macro for accessing counter */
> ?#define MEM_CGROUP_ZSTAT(mz, idx) ? ? ?((mz)->count[(idx)])
> @@ -156,26 +149,6 @@ struct mem_cgroup_lru_info {
> ? ? ? ?struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
> ?};
>
> -/*
> - * Cgroups above their limits are maintained in a RB-Tree, independent of
> - * their hierarchy representation
> - */
> -
> -struct mem_cgroup_tree_per_zone {
> - ? ? ? struct rb_root rb_root;
> - ? ? ? spinlock_t lock;
> -};
> -
> -struct mem_cgroup_tree_per_node {
> - ? ? ? struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
> -};
> -
> -struct mem_cgroup_tree {
> - ? ? ? struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
> -};
> -
> -static struct mem_cgroup_tree soft_limit_tree __read_mostly;
> -
> ?struct mem_cgroup_threshold {
> ? ? ? ?struct eventfd_ctx *eventfd;
> ? ? ? ?u64 threshold;
> @@ -323,12 +296,7 @@ static bool move_file(void)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?&mc.to->move_charge_at_immigrate);
> ?}
>
> -/*
> - * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
> - * limit reclaim to prevent infinite loops, if they ever occur.
> - */
> ?#define ? ? ? ?MEM_CGROUP_MAX_RECLAIM_LOOPS ? ? ? ? ? ?(100)
> -#define ? ? ? ?MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)
>
> ?enum charge_type {
> ? ? ? ?MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> @@ -375,164 +343,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *mem, struct page *page)
> ? ? ? ?return mem_cgroup_zoneinfo(mem, nid, zid);
> ?}
>
> -static struct mem_cgroup_tree_per_zone *
> -soft_limit_tree_node_zone(int nid, int zid)
> -{
> - ? ? ? return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
> -}
> -
> -static struct mem_cgroup_tree_per_zone *
> -soft_limit_tree_from_page(struct page *page)
> -{
> - ? ? ? int nid = page_to_nid(page);
> - ? ? ? int zid = page_zonenum(page);
> -
> - ? ? ? return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
> -}
> -
> -static void
> -__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup_per_zone *mz,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup_tree_per_zone *mctz,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long long new_usage_in_excess)
> -{
> - ? ? ? struct rb_node **p = &mctz->rb_root.rb_node;
> - ? ? ? struct rb_node *parent = NULL;
> - ? ? ? struct mem_cgroup_per_zone *mz_node;
> -
> - ? ? ? if (mz->on_tree)
> - ? ? ? ? ? ? ? return;
> -
> - ? ? ? mz->usage_in_excess = new_usage_in_excess;
> - ? ? ? if (!mz->usage_in_excess)
> - ? ? ? ? ? ? ? return;
> - ? ? ? while (*p) {
> - ? ? ? ? ? ? ? parent = *p;
> - ? ? ? ? ? ? ? mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? tree_node);
> - ? ? ? ? ? ? ? if (mz->usage_in_excess < mz_node->usage_in_excess)
> - ? ? ? ? ? ? ? ? ? ? ? p = &(*p)->rb_left;
> - ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ?* We can't avoid mem cgroups that are over their soft
> - ? ? ? ? ? ? ? ?* limit by the same amount
> - ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? else if (mz->usage_in_excess >= mz_node->usage_in_excess)
> - ? ? ? ? ? ? ? ? ? ? ? p = &(*p)->rb_right;
> - ? ? ? }
> - ? ? ? rb_link_node(&mz->tree_node, parent, p);
> - ? ? ? rb_insert_color(&mz->tree_node, &mctz->rb_root);
> - ? ? ? mz->on_tree = true;
> -}
> -
> -static void
> -__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup_per_zone *mz,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup_tree_per_zone *mctz)
> -{
> - ? ? ? if (!mz->on_tree)
> - ? ? ? ? ? ? ? return;
> - ? ? ? rb_erase(&mz->tree_node, &mctz->rb_root);
> - ? ? ? mz->on_tree = false;
> -}
> -
> -static void
> -mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup_per_zone *mz,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup_tree_per_zone *mctz)
> -{
> - ? ? ? spin_lock(&mctz->lock);
> - ? ? ? __mem_cgroup_remove_exceeded(mem, mz, mctz);
> - ? ? ? spin_unlock(&mctz->lock);
> -}
> -
> -
> -static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
> -{
> - ? ? ? unsigned long long excess;
> - ? ? ? struct mem_cgroup_per_zone *mz;
> - ? ? ? struct mem_cgroup_tree_per_zone *mctz;
> - ? ? ? int nid = page_to_nid(page);
> - ? ? ? int zid = page_zonenum(page);
> - ? ? ? mctz = soft_limit_tree_from_page(page);
> -
> - ? ? ? /*
> - ? ? ? ?* Necessary to update all ancestors when hierarchy is used.
> - ? ? ? ?* because their event counter is not touched.
> - ? ? ? ?*/
> - ? ? ? for (; mem; mem = parent_mem_cgroup(mem)) {
> - ? ? ? ? ? ? ? mz = mem_cgroup_zoneinfo(mem, nid, zid);
> - ? ? ? ? ? ? ? excess = res_counter_soft_limit_excess(&mem->res);
> - ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ?* We have to update the tree if mz is on RB-tree or
> - ? ? ? ? ? ? ? ?* mem is over its softlimit.
> - ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? if (excess || mz->on_tree) {
> - ? ? ? ? ? ? ? ? ? ? ? spin_lock(&mctz->lock);
> - ? ? ? ? ? ? ? ? ? ? ? /* if on-tree, remove it */
> - ? ? ? ? ? ? ? ? ? ? ? if (mz->on_tree)
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? __mem_cgroup_remove_exceeded(mem, mz, mctz);
> - ? ? ? ? ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ? ? ? ? ?* Insert again. mz->usage_in_excess will be updated.
> - ? ? ? ? ? ? ? ? ? ? ? ?* If excess is 0, no tree ops.
> - ? ? ? ? ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? ? ? ? ? __mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
> - ? ? ? ? ? ? ? ? ? ? ? spin_unlock(&mctz->lock);
> - ? ? ? ? ? ? ? }
> - ? ? ? }
> -}
> -
> -static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
> -{
> - ? ? ? int node, zone;
> - ? ? ? struct mem_cgroup_per_zone *mz;
> - ? ? ? struct mem_cgroup_tree_per_zone *mctz;
> -
> - ? ? ? for_each_node_state(node, N_POSSIBLE) {
> - ? ? ? ? ? ? ? for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> - ? ? ? ? ? ? ? ? ? ? ? mz = mem_cgroup_zoneinfo(mem, node, zone);
> - ? ? ? ? ? ? ? ? ? ? ? mctz = soft_limit_tree_node_zone(node, zone);
> - ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_remove_exceeded(mem, mz, mctz);
> - ? ? ? ? ? ? ? }
> - ? ? ? }
> -}
> -
> -static struct mem_cgroup_per_zone *
> -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
> -{
> - ? ? ? struct rb_node *rightmost = NULL;
> - ? ? ? struct mem_cgroup_per_zone *mz;
> -
> -retry:
> - ? ? ? mz = NULL;
> - ? ? ? rightmost = rb_last(&mctz->rb_root);
> - ? ? ? if (!rightmost)
> - ? ? ? ? ? ? ? goto done; ? ? ? ? ? ? ?/* Nothing to reclaim from */
> -
> - ? ? ? mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
> - ? ? ? /*
> - ? ? ? ?* Remove the node now but someone else can add it back,
> - ? ? ? ?* we will to add it back at the end of reclaim to its correct
> - ? ? ? ?* position in the tree.
> - ? ? ? ?*/
> - ? ? ? __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
> - ? ? ? if (!res_counter_soft_limit_excess(&mz->mem->res) ||
> - ? ? ? ? ? ? ? !css_tryget(&mz->mem->css))
> - ? ? ? ? ? ? ? goto retry;
> -done:
> - ? ? ? return mz;
> -}
> -
> -static struct mem_cgroup_per_zone *
> -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
> -{
> - ? ? ? struct mem_cgroup_per_zone *mz;
> -
> - ? ? ? spin_lock(&mctz->lock);
> - ? ? ? mz = __mem_cgroup_largest_soft_limit_node(mctz);
> - ? ? ? spin_unlock(&mctz->lock);
> - ? ? ? return mz;
> -}
> -
> ?/*
> ?* Implementation Note: reading percpu statistics for memcg.
> ?*
> @@ -570,15 +380,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *mem,
> ? ? ? ?return val;
> ?}
>
> -static long mem_cgroup_local_usage(struct mem_cgroup *mem)
> -{
> - ? ? ? long ret;
> -
> - ? ? ? ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
> - ? ? ? ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
> - ? ? ? return ret;
> -}
> -
> ?static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool charge)
> ?{
> @@ -699,7 +500,6 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
> ? ? ? ? ? ? ? ?__mem_cgroup_target_update(mem, MEM_CGROUP_TARGET_THRESH);
> ? ? ? ? ? ? ? ?if (unlikely(__memcg_event_check(mem,
> ? ? ? ? ? ? ? ? ? ? ? ?MEM_CGROUP_TARGET_SOFTLIMIT))){
> - ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_update_tree(mem, page);
> ? ? ? ? ? ? ? ? ? ? ? ?__mem_cgroup_target_update(mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?MEM_CGROUP_TARGET_SOFTLIMIT);
> ? ? ? ? ? ? ? ?}
> @@ -1380,6 +1180,29 @@ void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
> ? ? ? ?*iter = mem;
> ?}
>
> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem)
> +{
> + ? ? ? /* root_mem_cgroup never exceeds its soft limit */
> + ? ? ? if (!mem)
> + ? ? ? ? ? ? ? return false;
> + ? ? ? if (!root)
> + ? ? ? ? ? ? ? root = root_mem_cgroup;
> + ? ? ? /*
> + ? ? ? ?* See whether the memcg in question exceeds its soft limit
> + ? ? ? ?* directly, or contributes to the soft limit excess in the
> + ? ? ? ?* hierarchy below the given root.
> + ? ? ? ?*/
> + ? ? ? while (mem != root) {
> + ? ? ? ? ? ? ? if (res_counter_soft_limit_excess(&mem->res))
> + ? ? ? ? ? ? ? ? ? ? ? return true;
> + ? ? ? ? ? ? ? if (!mem->use_hierarchy)
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? mem = mem_cgroup_from_cont(mem->css.cgroup->parent);
> + ? ? ? }
> + ? ? ? return false;
> +}
> +
> ?static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool noswap,
> @@ -1411,114 +1234,6 @@ static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
> ?}
>
> ?/*
> - * Visit the first child (need not be the first child as per the ordering
> - * of the cgroup list, since we track last_scanned_child) of @mem and use
> - * that to reclaim free pages from.
> - */
> -static struct mem_cgroup *
> -mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> -{
> - ? ? ? struct mem_cgroup *ret = NULL;
> - ? ? ? struct cgroup_subsys_state *css;
> - ? ? ? int nextid, found;
> -
> - ? ? ? if (!root_mem->use_hierarchy) {
> - ? ? ? ? ? ? ? css_get(&root_mem->css);
> - ? ? ? ? ? ? ? ret = root_mem;
> - ? ? ? }
> -
> - ? ? ? while (!ret) {
> - ? ? ? ? ? ? ? rcu_read_lock();
> - ? ? ? ? ? ? ? nextid = root_mem->last_scanned_child + 1;
> - ? ? ? ? ? ? ? css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?&found);
> - ? ? ? ? ? ? ? if (css && css_tryget(css))
> - ? ? ? ? ? ? ? ? ? ? ? ret = container_of(css, struct mem_cgroup, css);
> -
> - ? ? ? ? ? ? ? rcu_read_unlock();
> - ? ? ? ? ? ? ? /* Updates scanning parameter */
> - ? ? ? ? ? ? ? if (!css) {
> - ? ? ? ? ? ? ? ? ? ? ? /* this means start scan from ID:1 */
> - ? ? ? ? ? ? ? ? ? ? ? root_mem->last_scanned_child = 0;
> - ? ? ? ? ? ? ? } else
> - ? ? ? ? ? ? ? ? ? ? ? root_mem->last_scanned_child = found;
> - ? ? ? }
> -
> - ? ? ? return ret;
> -}
> -
> -/*
> - * Scan the hierarchy if needed to reclaim memory. We remember the last child
> - * we reclaimed from, so that we don't end up penalizing one child extensively
> - * based on its position in the children list.
> - *
> - * root_mem is the original ancestor that we've been reclaim from.
> - *
> - * We give up and return to the caller when we visit root_mem twice.
> - * (other groups can be removed while we're walking....)
> - */
> -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct zone *zone,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask)
> -{
> - ? ? ? struct mem_cgroup *victim;
> - ? ? ? int ret, total = 0;
> - ? ? ? int loop = 0;
> - ? ? ? unsigned long excess;
> - ? ? ? bool noswap = false;
> -
> - ? ? ? excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
> -
> - ? ? ? /* If memsw_is_minimum==1, swap-out is of-no-use. */
> - ? ? ? if (root_mem->memsw_is_minimum)
> - ? ? ? ? ? ? ? noswap = true;
> -
> - ? ? ? while (1) {
> - ? ? ? ? ? ? ? victim = mem_cgroup_select_victim(root_mem);
> - ? ? ? ? ? ? ? if (victim == root_mem) {
> - ? ? ? ? ? ? ? ? ? ? ? loop++;
> - ? ? ? ? ? ? ? ? ? ? ? if (loop >= 1)
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? drain_all_stock_async();
> - ? ? ? ? ? ? ? ? ? ? ? if (loop >= 2) {
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* If we have not been able to reclaim
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* anything, it might because there are
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* no reclaimable pages under this hierarchy
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (!total) {
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? css_put(&victim->css);
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? }
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* We want to do more targeted reclaim.
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* excess >> 2 is not to excessive so as to
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* reclaim too much, nor too less that we keep
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* coming back to reclaim from this cgroup
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (total >= (excess >> 2) ||
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? css_put(&victim->css);
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? }
> - ? ? ? ? ? ? ? ? ? ? ? }
> - ? ? ? ? ? ? ? }
> - ? ? ? ? ? ? ? if (!mem_cgroup_local_usage(victim)) {
> - ? ? ? ? ? ? ? ? ? ? ? /* this cgroup's local usage == 0 */
> - ? ? ? ? ? ? ? ? ? ? ? css_put(&victim->css);
> - ? ? ? ? ? ? ? ? ? ? ? continue;
> - ? ? ? ? ? ? ? }
> - ? ? ? ? ? ? ? /* we use swappiness of local cgroup */
> - ? ? ? ? ? ? ? ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, get_swappiness(victim), zone);
> - ? ? ? ? ? ? ? css_put(&victim->css);
> - ? ? ? ? ? ? ? total += ret;
> - ? ? ? ? ? ? ? if (!res_counter_soft_limit_excess(&root_mem->res))
> - ? ? ? ? ? ? ? ? ? ? ? return total;
> - ? ? ? }
> - ? ? ? return total;
> -}
> -
> -/*
> ?* Check OOM-Killer is already running under our hierarchy.
> ?* If someone is running, return false.
> ?*/
> @@ -3291,94 +3006,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> ? ? ? ?return ret;
> ?}
>
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask)
> -{
> - ? ? ? unsigned long nr_reclaimed = 0;
> - ? ? ? struct mem_cgroup_per_zone *mz, *next_mz = NULL;
> - ? ? ? unsigned long reclaimed;
> - ? ? ? int loop = 0;
> - ? ? ? struct mem_cgroup_tree_per_zone *mctz;
> - ? ? ? unsigned long long excess;
> -
> - ? ? ? if (order > 0)
> - ? ? ? ? ? ? ? return 0;
> -
> - ? ? ? mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
> - ? ? ? /*
> - ? ? ? ?* This loop can run a while, specially if mem_cgroup's continuously
> - ? ? ? ?* keep exceeding their soft limit and putting the system under
> - ? ? ? ?* pressure
> - ? ? ? ?*/
> - ? ? ? do {
> - ? ? ? ? ? ? ? if (next_mz)
> - ? ? ? ? ? ? ? ? ? ? ? mz = next_mz;
> - ? ? ? ? ? ? ? else
> - ? ? ? ? ? ? ? ? ? ? ? mz = mem_cgroup_largest_soft_limit_node(mctz);
> - ? ? ? ? ? ? ? if (!mz)
> - ? ? ? ? ? ? ? ? ? ? ? break;
> -
> - ? ? ? ? ? ? ? reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
> - ? ? ? ? ? ? ? nr_reclaimed += reclaimed;
> - ? ? ? ? ? ? ? spin_lock(&mctz->lock);
> -
> - ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ?* If we failed to reclaim anything from this memory cgroup
> - ? ? ? ? ? ? ? ?* it is time to move on to the next cgroup
> - ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? next_mz = NULL;
> - ? ? ? ? ? ? ? if (!reclaimed) {
> - ? ? ? ? ? ? ? ? ? ? ? do {
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* Loop until we find yet another one.
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?*
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* By the time we get the soft_limit lock
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* again, someone might have aded the
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* group back on the RB tree. Iterate to
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* make sure we get a different mem.
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* mem_cgroup_largest_soft_limit_node returns
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* NULL if no other cgroup is present on
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* the tree
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? next_mz =
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? __mem_cgroup_largest_soft_limit_node(mctz);
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (next_mz == mz) {
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? css_put(&next_mz->mem->css);
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? next_mz = NULL;
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? } else /* next_mz == NULL or other memcg */
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> - ? ? ? ? ? ? ? ? ? ? ? } while (1);
> - ? ? ? ? ? ? ? }
> - ? ? ? ? ? ? ? __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
> - ? ? ? ? ? ? ? excess = res_counter_soft_limit_excess(&mz->mem->res);
> - ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ?* One school of thought says that we should not add
> - ? ? ? ? ? ? ? ?* back the node to the tree if reclaim returns 0.
> - ? ? ? ? ? ? ? ?* But our reclaim could return 0, simply because due
> - ? ? ? ? ? ? ? ?* to priority we are exposing a smaller subset of
> - ? ? ? ? ? ? ? ?* memory to reclaim from. Consider this as a longer
> - ? ? ? ? ? ? ? ?* term TODO.
> - ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? /* If excess == 0, no tree ops */
> - ? ? ? ? ? ? ? __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
> - ? ? ? ? ? ? ? spin_unlock(&mctz->lock);
> - ? ? ? ? ? ? ? css_put(&mz->mem->css);
> - ? ? ? ? ? ? ? loop++;
> - ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ?* Could not reclaim anything and there are no more
> - ? ? ? ? ? ? ? ?* mem cgroups to try or we seem to be looping without
> - ? ? ? ? ? ? ? ?* reclaiming anything.
> - ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? if (!nr_reclaimed &&
> - ? ? ? ? ? ? ? ? ? ? ? (next_mz == NULL ||
> - ? ? ? ? ? ? ? ? ? ? ? loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> - ? ? ? ? ? ? ? ? ? ? ? break;
> - ? ? ? } while (!nr_reclaimed);
> - ? ? ? if (next_mz)
> - ? ? ? ? ? ? ? css_put(&next_mz->mem->css);
> - ? ? ? return nr_reclaimed;
> -}
> -
> ?/*
> ?* This routine traverse page_cgroup in given list and drop them all.
> ?* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> @@ -4449,9 +4076,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> ? ? ? ? ? ? ? ?mz = &pn->zoneinfo[zone];
> ? ? ? ? ? ? ? ?for_each_lru(l)
> ? ? ? ? ? ? ? ? ? ? ? ?INIT_LIST_HEAD(&mz->lruvec.lists[l]);
> - ? ? ? ? ? ? ? mz->usage_in_excess = 0;
> - ? ? ? ? ? ? ? mz->on_tree = false;
> - ? ? ? ? ? ? ? mz->mem = mem;
> ? ? ? ?}
> ? ? ? ?return 0;
> ?}
> @@ -4504,7 +4128,6 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
> ?{
> ? ? ? ?int node;
>
> - ? ? ? mem_cgroup_remove_from_trees(mem);
> ? ? ? ?free_css_id(&mem_cgroup_subsys, &mem->css);
>
> ? ? ? ?for_each_node_state(node, N_POSSIBLE)
> @@ -4559,31 +4182,6 @@ static void __init enable_swap_cgroup(void)
> ?}
> ?#endif
>
> -static int mem_cgroup_soft_limit_tree_init(void)
> -{
> - ? ? ? struct mem_cgroup_tree_per_node *rtpn;
> - ? ? ? struct mem_cgroup_tree_per_zone *rtpz;
> - ? ? ? int tmp, node, zone;
> -
> - ? ? ? for_each_node_state(node, N_POSSIBLE) {
> - ? ? ? ? ? ? ? tmp = node;
> - ? ? ? ? ? ? ? if (!node_state(node, N_NORMAL_MEMORY))
> - ? ? ? ? ? ? ? ? ? ? ? tmp = -1;
> - ? ? ? ? ? ? ? rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
> - ? ? ? ? ? ? ? if (!rtpn)
> - ? ? ? ? ? ? ? ? ? ? ? return 1;
> -
> - ? ? ? ? ? ? ? soft_limit_tree.rb_tree_per_node[node] = rtpn;
> -
> - ? ? ? ? ? ? ? for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> - ? ? ? ? ? ? ? ? ? ? ? rtpz = &rtpn->rb_tree_per_zone[zone];
> - ? ? ? ? ? ? ? ? ? ? ? rtpz->rb_root = RB_ROOT;
> - ? ? ? ? ? ? ? ? ? ? ? spin_lock_init(&rtpz->lock);
> - ? ? ? ? ? ? ? }
> - ? ? ? }
> - ? ? ? return 0;
> -}
> -
> ?static struct cgroup_subsys_state * __ref
> ?mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> ?{
> @@ -4605,8 +4203,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> ? ? ? ? ? ? ? ?enable_swap_cgroup();
> ? ? ? ? ? ? ? ?parent = NULL;
> ? ? ? ? ? ? ? ?root_mem_cgroup = mem;
> - ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_tree_init())
> - ? ? ? ? ? ? ? ? ? ? ? goto free_out;
> ? ? ? ? ? ? ? ?for_each_possible_cpu(cpu) {
> ? ? ? ? ? ? ? ? ? ? ? ?struct memcg_stock_pcp *stock =
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?&per_cpu(memcg_stock, cpu);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0381a5d..2b701e0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1937,10 +1937,13 @@ static void shrink_zone(int priority, struct zone *zone,
> ? ? ? ?do {
> ? ? ? ? ? ? ? ?unsigned long reclaimed = sc->nr_reclaimed;
> ? ? ? ? ? ? ? ?unsigned long scanned = sc->nr_scanned;
> + ? ? ? ? ? ? ? int epriority = priority;
>
> ? ? ? ? ? ? ? ?mem_cgroup_hierarchy_walk(root, &mem);
> ? ? ? ? ? ? ? ?sc->current_memcg = mem;
> - ? ? ? ? ? ? ? do_shrink_zone(priority, zone, sc);
> + ? ? ? ? ? ? ? if (mem_cgroup_soft_limit_exceeded(root, mem))
> + ? ? ? ? ? ? ? ? ? ? ? epriority -= 1;
> + ? ? ? ? ? ? ? do_shrink_zone(epriority, zone, sc);
> ? ? ? ? ? ? ? ?mem_cgroup_count_reclaim(mem, current_is_kswapd(),
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mem != root, /* limit or hierarchy? */
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sc->nr_scanned - scanned,
> @@ -2153,42 +2156,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> ?}
>
> ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -
> -unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned int swappiness,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone)
> -{
> - ? ? ? struct scan_control sc = {
> - ? ? ? ? ? ? ? .nr_to_reclaim = SWAP_CLUSTER_MAX,
> - ? ? ? ? ? ? ? .may_writepage = !laptop_mode,
> - ? ? ? ? ? ? ? .may_unmap = 1,
> - ? ? ? ? ? ? ? .may_swap = !noswap,
> - ? ? ? ? ? ? ? .swappiness = swappiness,
> - ? ? ? ? ? ? ? .order = 0,
> - ? ? ? ? ? ? ? .memcg = mem,
> - ? ? ? };
> - ? ? ? sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> - ? ? ? ? ? ? ? ? ? ? ? (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> -
> - ? ? ? trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sc.may_writepage,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sc.gfp_mask);
> -
> - ? ? ? /*
> - ? ? ? ?* NOTE: Although we can get the priority field, using it
> - ? ? ? ?* here is not a good idea, since it limits the pages we can scan.
> - ? ? ? ?* if we don't reclaim here, the shrink_zone from balance_pgdat
> - ? ? ? ?* will pick up pages from other mem cgroup's as well. We hack
> - ? ? ? ?* the priority and make it zero.
> - ? ? ? ?*/
> - ? ? ? do_shrink_zone(0, zone, &sc);
> -
> - ? ? ? trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
> -
> - ? ? ? return sc.nr_reclaimed;
> -}
> -
> ?unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool noswap,
> @@ -2418,13 +2385,6 @@ loop_again:
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>
> ? ? ? ? ? ? ? ? ? ? ? ?sc.nr_scanned = 0;
> -
> - ? ? ? ? ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ? ? ? ? ?* Call soft limit reclaim before calling shrink_zone.
> - ? ? ? ? ? ? ? ? ? ? ? ?* For now we ignore the return value
> - ? ? ? ? ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
> -
> ? ? ? ? ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? ? ? ? ? * We put equal pressure on every zone, unless
> ? ? ? ? ? ? ? ? ? ? ? ? * one zone has way too many pages free
> --
> 1.7.5.1
>
>

2011-05-12 23:51:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [rfc patch 1/6] memcg: remove unused retry signal from reclaim

On Thu, 12 May 2011 16:53:53 +0200
Johannes Weiner <[email protected]> wrote:

> If the memcg reclaim code detects the target memcg below its limit it
> exits and returns a guaranteed non-zero value so that the charge is
> retried.
>
> Nowadays, the charge side checks the memcg limit itself and does not
> rely on this non-zero return value trick.
>
> This patch removes it. The reclaim code will now always return the
> true number of pages it reclaimed on its own.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2011-05-12 23:57:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection

On Thu, 12 May 2011 16:53:54 +0200
Johannes Weiner <[email protected]> wrote:

> The reclaim code has a single predicate for whether it currently
> reclaims on behalf of a memory cgroup, as well as whether it is
> reclaiming from the global LRU list or a memory cgroup LRU list.
>
> Up to now, both cases always coincide, but subsequent patches will
> change things such that global reclaim will scan memory cgroup lists.
>
> This patch adds a new predicate that tells global reclaim from memory
> cgroup reclaim, and then changes all callsites that are actually about
> global reclaim heuristics rather than strict LRU list selection.
>
> Signed-off-by: Johannes Weiner <[email protected]>


Hmm, isn't it better to merge this to patches where the meaning of
new variable gets clearer ?

> ---
> mm/vmscan.c | 96 ++++++++++++++++++++++++++++++++++------------------------
> 1 files changed, 56 insertions(+), 40 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f6b435c..ceeb2a5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -104,8 +104,12 @@ struct scan_control {
> */
> reclaim_mode_t reclaim_mode;
>
> - /* Which cgroup do we reclaim from */
> - struct mem_cgroup *mem_cgroup;
> + /*
> + * The memory cgroup we reclaim on behalf of, and the one we
> + * are currently reclaiming from.
> + */
> + struct mem_cgroup *memcg;
> + struct mem_cgroup *current_memcg;
>

I wonder if you avoid renaming exisiting one, the patch will
be clearer...



> /*
> * Nodemask of nodes allowed by the caller. If NULL, all nodes
> @@ -154,16 +158,24 @@ static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -#define scanning_global_lru(sc) (!(sc)->mem_cgroup)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> + return !sc->memcg;
> +}
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> + return !sc->current_memcg;
> +}


Could you add comments ?

Thanks,
-Kame

2011-05-13 00:11:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [rfc patch 3/6] mm: memcg-aware global reclaim

On Thu, 12 May 2011 16:53:55 +0200
Johannes Weiner <[email protected]> wrote:

> A page charged to a memcg is linked to a lru list specific to that
> memcg. At the same time, traditional global reclaim is obvlivious to
> memcgs, and all the pages are also linked to a global per-zone list.
>
> This patch changes traditional global reclaim to iterate over all
> existing memcgs, so that it no longer relies on the global list being
> present.
>
> This is one step forward in integrating memcg code better into the
> rest of memory management. It is also a prerequisite to get rid of
> the global per-zone lru lists.
>


As I said, I don't want removing global reclaim until dirty_ratio support and
better softlimit algorithm, at least. Current my concern is dirty_ratio,
if you want to speed up, please help Greg and implement dirty_ratio first.

BTW, could you separete clean up code and your new logic ? 1st half of
codes seems to be just a clean up and seems nice. But , IIUC, someone
changed the arguments from chunk of params to be a flags....in some patch.
...
commit 75822b4495b62e8721e9b88e3cf9e653a0c85b73
Author: Balbir Singh <[email protected]>
Date: Wed Sep 23 15:56:38 2009 -0700

memory controller: soft limit refactor reclaim flags

Refactor mem_cgroup_hierarchical_reclaim()

Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
flags, so that new parameters don't have to be passed as we make the
reclaim routine more flexible

...

Balbir ? Both are ok to me, please ask him.


And hmm...

+ do {
+ mem_cgroup_hierarchy_walk(root, &mem);
+ sc->current_memcg = mem;
+ do_shrink_zone(priority, zone, sc);
+ } while (mem != root);

This move hierarchy walk from memcontrol.c to vmscan.c ?

About moving hierarchy walk, I may say okay...because my patch does this, too.

But....doesn't this reclaim too much memory if hierarchy is very deep ?
Could you add some 'quit' path ?


Thanks,
-Kame

2011-05-13 00:47:38

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [rfc patch 3/6] mm: memcg-aware global reclaim

On Thu, 12 May 2011 16:53:55 +0200
Johannes Weiner <[email protected]> wrote:

> A page charged to a memcg is linked to a lru list specific to that
> memcg. At the same time, traditional global reclaim is obvlivious to
> memcgs, and all the pages are also linked to a global per-zone list.
>
> This patch changes traditional global reclaim to iterate over all
> existing memcgs, so that it no longer relies on the global list being
> present.
>
> This is one step forward in integrating memcg code better into the
> rest of memory management. It is also a prerequisite to get rid of
> the global per-zone lru lists.
>
> RFC:
>
> The algorithm implemented in this patch is very naive. For each zone
> scanned at each priority level, it iterates over all existing memcgs
> and considers them for scanning.
>
> This is just a prototype and I did not optimize it yet because I am
> unsure about the maximum number of memcgs that still constitute a sane
> configuration in comparison to the machine size.
>
> It is perfectly fair since all memcgs are scanned at each priority
> level.
>
> On my 4G quadcore laptop with 1000 memcgs, a significant amount of CPU
> time was spent just iterating memcgs during reclaim. But it can not
> really be claimed that the old code was much better, either: global
> LRU reclaim could mean that a few hundred memcgs would have been
> emptied out completely, while others stayed untouched.
>
> I am open to solutions that trade fairness against CPU-time but don't
> want to have an extreme in either direction. Maybe break out early if
> a number of memcgs has been successfully reclaimed from and remember
> the last one scanned.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/memcontrol.h | 7 ++
> mm/memcontrol.c | 148 +++++++++++++++++++++++++++++---------------
> mm/vmscan.c | 21 +++++--
> 3 files changed, 120 insertions(+), 56 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5e9840f5..58728c7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -104,6 +104,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
> /*
> * For memory reclaim.
> */
> +void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
> int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> @@ -289,6 +290,12 @@ static inline bool mem_cgroup_disabled(void)
> return true;
> }
>
> +static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
> + struct mem_cgroup **iter)
> +{
> + *iter = start;
> +}
> +
> static inline int
> mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
> {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bf5ab87..edcd55a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -313,7 +313,7 @@ static bool move_file(void)
> }
>
> /*
> - * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> + * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
> * limit reclaim to prevent infinite loops, if they ever occur.
> */
> #define MEM_CGROUP_MAX_RECLAIM_LOOPS (100)
> @@ -339,16 +339,6 @@ enum charge_type {
> /* Used for OOM nofiier */
> #define OOM_CONTROL (0)
>
> -/*
> - * Reclaim flags for mem_cgroup_hierarchical_reclaim
> - */
> -#define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0
> -#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> -#define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1
> -#define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
> -#define MEM_CGROUP_RECLAIM_SOFT_BIT 0x2
> -#define MEM_CGROUP_RECLAIM_SOFT (1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
> -
> static void mem_cgroup_get(struct mem_cgroup *mem);
> static void mem_cgroup_put(struct mem_cgroup *mem);
> static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> @@ -1381,6 +1371,86 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> return min(limit, memsw);
> }
>
> +void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
> + struct mem_cgroup **iter)
> +{
> + struct mem_cgroup *mem = *iter;
> + int id;
> +
> + if (!start)
> + start = root_mem_cgroup;
> + /*
> + * Even without hierarchy explicitely enabled in the root
> + * memcg, it is the ultimate parent of all memcgs.
> + */
> + if (!(start == root_mem_cgroup || start->use_hierarchy)) {
> + *iter = start;
> + return;
> + }
> +
> + if (!mem)
> + id = css_id(&start->css);
> + else {
> + id = css_id(&mem->css);
> + css_put(&mem->css);
> + mem = NULL;
> + }
> +
> + do {
> + struct cgroup_subsys_state *css;
> +
> + rcu_read_lock();
> + css = css_get_next(&mem_cgroup_subsys, id+1, &start->css, &id);
> + /*
> + * The caller must already have a reference to the
> + * starting point of this hierarchy walk, do not grab
> + * another one. This way, the loop can be finished
> + * when the hierarchy root is returned, without any
> + * further cleanup required.
> + */
> + if (css && (css == &start->css || css_tryget(css)))
> + mem = container_of(css, struct mem_cgroup, css);
> + rcu_read_unlock();
> + if (!css)
> + id = 0;
> + } while (!mem);
> +
> + if (mem == root_mem_cgroup)
> + mem = NULL;
> +
> + *iter = mem;
> +}
> +
> +static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
> + gfp_t gfp_mask,
> + bool noswap,
> + bool shrink)
> +{
> + unsigned long total = 0;
> + int loop;
> +
> + if (mem->memsw_is_minimum)
> + noswap = true;
> +
> + for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> + drain_all_stock_async();
> + total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap,
> + get_swappiness(mem));
> + if (total && shrink)
> + break;
> + if (mem_cgroup_margin(mem))
> + break;
> + /*
> + * If we have not been able to reclaim anything after
> + * two reclaim attempts, there may be no reclaimable
> + * pages under this hierarchy.
> + */
> + if (loop && !total)
> + break;
> + }
> + return total;
> +}
> +
> /*
> * Visit the first child (need not be the first child as per the ordering
> * of the cgroup list, since we track last_scanned_child) of @mem and use
> @@ -1427,21 +1497,16 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> *
> * We give up and return to the caller when we visit root_mem twice.
> * (other groups can be removed while we're walking....)
> - *
> - * If shrink==true, for avoiding to free too much, this returns immedieately.
> */
> -static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> - struct zone *zone,
> - gfp_t gfp_mask,
> - unsigned long reclaim_options)
> +static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
> + struct zone *zone,
> + gfp_t gfp_mask)
> {
> struct mem_cgroup *victim;
> int ret, total = 0;
> int loop = 0;
> - bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> - bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> - bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> unsigned long excess;
> + bool noswap = false;
>
> excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
>
> @@ -1461,7 +1526,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> * anything, it might because there are
> * no reclaimable pages under this hierarchy
> */
> - if (!check_soft || !total) {
> + if (!total) {
> css_put(&victim->css);
> break;
> }
> @@ -1484,25 +1549,11 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> continue;
> }
> /* we use swappiness of local cgroup */
> - if (check_soft)
> - ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> + ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> noswap, get_swappiness(victim), zone);
> - else
> - ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> - noswap, get_swappiness(victim));
> css_put(&victim->css);
> - /*
> - * At shrinking usage, we can't check we should stop here or
> - * reclaim more. It's depends on callers. last_scanned_child
> - * will work enough for keeping fairness under tree.
> - */
> - if (shrink)
> - return ret;
> total += ret;
> - if (check_soft) {
> - if (!res_counter_soft_limit_excess(&root_mem->res))
> - return total;
> - } else if (mem_cgroup_margin(root_mem))
> + if (!res_counter_soft_limit_excess(&root_mem->res))
> return total;
> }
> return total;
> @@ -1897,7 +1948,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> unsigned long csize = nr_pages * PAGE_SIZE;
> struct mem_cgroup *mem_over_limit;
> struct res_counter *fail_res;
> - unsigned long flags = 0;
> + bool noswap = false;
> int ret;
>
> ret = res_counter_charge(&mem->res, csize, &fail_res);
> @@ -1911,7 +1962,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>
> res_counter_uncharge(&mem->res, csize);
> mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
> - flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> + noswap = true;
> } else
> mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> /*
> @@ -1927,8 +1978,8 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> if (!(gfp_mask & __GFP_WAIT))
> return CHARGE_WOULDBLOCK;
>
> - ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> - gfp_mask, flags);
> + ret = mem_cgroup_target_reclaim(mem_over_limit, gfp_mask,
> + noswap, false);
> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> return CHARGE_RETRY;
> /*
> @@ -3085,7 +3136,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>
> /*
> * A call to try to shrink memory usage on charge failure at shmem's swapin.
> - * Calling hierarchical_reclaim is not enough because we should update
> + * Calling target_reclaim is not enough because we should update
> * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
> * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
> * not from the memcg which this page would be charged to.
> @@ -3167,7 +3218,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> int enlarge;
>
> /*
> - * For keeping hierarchical_reclaim simple, how long we should retry
> + * For keeping target_reclaim simple, how long we should retry
> * is depends on callers. We set our retry-count to be function
> * of # of children which we should visit in this loop.
> */
> @@ -3210,8 +3261,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> if (!ret)
> break;
>
> - mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> - MEM_CGROUP_RECLAIM_SHRINK);
> + mem_cgroup_target_reclaim(memcg, GFP_KERNEL, false, false);
> curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> /* Usage is reduced ? */
> if (curusage >= oldusage)
> @@ -3269,9 +3319,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> if (!ret)
> break;
>
> - mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> - MEM_CGROUP_RECLAIM_NOSWAP |
> - MEM_CGROUP_RECLAIM_SHRINK);
> + mem_cgroup_target_reclaim(memcg, GFP_KERNEL, true, false);
> curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> /* Usage is reduced ? */
> if (curusage >= oldusage)
> @@ -3311,9 +3359,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> if (!mz)
> break;
>
> - reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
> - gfp_mask,
> - MEM_CGROUP_RECLAIM_SOFT);
> + reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
> nr_reclaimed += reclaimed;
> spin_lock(&mctz->lock);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ceeb2a5..e2a3647 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1900,8 +1900,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
> /*
> * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> */
> -static void shrink_zone(int priority, struct zone *zone,
> - struct scan_control *sc)
> +static void do_shrink_zone(int priority, struct zone *zone,
> + struct scan_control *sc)
> {
> unsigned long nr[NR_LRU_LISTS];
> unsigned long nr_to_scan;
> @@ -1914,8 +1914,6 @@ restart:
> nr_scanned = sc->nr_scanned;
> get_scan_count(zone, sc, nr, priority);
>
> - sc->current_memcg = sc->memcg;
> -
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
> for_each_evictable_lru(l) {
> @@ -1954,6 +1952,19 @@ restart:
> goto restart;
>
> throttle_vm_writeout(sc->gfp_mask);
> +}
> +
> +static void shrink_zone(int priority, struct zone *zone,
> + struct scan_control *sc)
> +{
> + struct mem_cgroup *root = sc->memcg;
> + struct mem_cgroup *mem = NULL;
> +
> + do {
> + mem_cgroup_hierarchy_walk(root, &mem);
> + sc->current_memcg = mem;
> + do_shrink_zone(priority, zone, sc);

If I don't miss something, css_put() against mem->css will be required somewhere.

Thanks,
-Kame

2011-05-13 06:55:27

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 3/6] mm: memcg-aware global reclaim

On Fri, May 13, 2011 at 09:40:50AM +0900, KAMEZAWA Hiroyuki wrote:
> > @@ -1954,6 +1952,19 @@ restart:
> > goto restart;
> >
> > throttle_vm_writeout(sc->gfp_mask);
> > +}
> > +
> > +static void shrink_zone(int priority, struct zone *zone,
> > + struct scan_control *sc)
> > +{
> > + struct mem_cgroup *root = sc->memcg;
> > + struct mem_cgroup *mem = NULL;
> > +
> > + do {
> > + mem_cgroup_hierarchy_walk(root, &mem);
> > + sc->current_memcg = mem;
> > + do_shrink_zone(priority, zone, sc);
>
> If I don't miss something, css_put() against mem->css will be required somewhere.

That's a bit of a hack. mem_cgroup_hierarchy_walk() always does
css_put() on *mem before advancing to the next child.

At the last iteration, it returns mem == root. Since the caller must
have a reference on root to begin with, it does not css_get() root.

So when mem == root, there are no outstanding references from the walk
anymore.

This only works since it always does the full hierarchy walk, so it's
going away anyway when the hierarchy walk becomes intermittent.

2011-05-13 06:59:13

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection

On Fri, May 13, 2011 at 08:50:27AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 12 May 2011 16:53:54 +0200
> Johannes Weiner <[email protected]> wrote:
>
> > The reclaim code has a single predicate for whether it currently
> > reclaims on behalf of a memory cgroup, as well as whether it is
> > reclaiming from the global LRU list or a memory cgroup LRU list.
> >
> > Up to now, both cases always coincide, but subsequent patches will
> > change things such that global reclaim will scan memory cgroup lists.
> >
> > This patch adds a new predicate that tells global reclaim from memory
> > cgroup reclaim, and then changes all callsites that are actually about
> > global reclaim heuristics rather than strict LRU list selection.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
>
> Hmm, isn't it better to merge this to patches where the meaning of
> new variable gets clearer ?

I apologize for the confusing order. I am going to merge them.

> > mm/vmscan.c | 96 ++++++++++++++++++++++++++++++++++------------------------
> > 1 files changed, 56 insertions(+), 40 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f6b435c..ceeb2a5 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -104,8 +104,12 @@ struct scan_control {
> > */
> > reclaim_mode_t reclaim_mode;
> >
> > - /* Which cgroup do we reclaim from */
> > - struct mem_cgroup *mem_cgroup;
> > + /*
> > + * The memory cgroup we reclaim on behalf of, and the one we
> > + * are currently reclaiming from.
> > + */
> > + struct mem_cgroup *memcg;
> > + struct mem_cgroup *current_memcg;
> >
>
> I wonder if you avoid renaming exisiting one, the patch will
> be clearer...

I renamed it mostly because I thought current_mem_cgroup too long.
It's probably best if both get more descriptive names.

> > @@ -154,16 +158,24 @@ static LIST_HEAD(shrinker_list);
> > static DECLARE_RWSEM(shrinker_rwsem);
> >
> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > -#define scanning_global_lru(sc) (!(sc)->mem_cgroup)
> > +static bool global_reclaim(struct scan_control *sc)
> > +{
> > + return !sc->memcg;
> > +}
> > +static bool scanning_global_lru(struct scan_control *sc)
> > +{
> > + return !sc->current_memcg;
> > +}
>
>
> Could you add comments ?

Yes, I will.

Thanks for your input!

2011-05-13 07:09:00

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 3/6] mm: memcg-aware global reclaim

On Thu, May 12, 2011 at 12:19:45PM -0700, Ying Han wrote:
> On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <[email protected]> wrote:
>
> > A page charged to a memcg is linked to a lru list specific to that
> > memcg. At the same time, traditional global reclaim is obvlivious to
> > memcgs, and all the pages are also linked to a global per-zone list.
> >
> > This patch changes traditional global reclaim to iterate over all
> > existing memcgs, so that it no longer relies on the global list being
> > present.
> >
>
> > This is one step forward in integrating memcg code better into the
> > rest of memory management. It is also a prerequisite to get rid
> > of the global per-zone lru lists.
> >
> Sorry If i misunderstood something here. I assume this patch has not
> much to do with the global soft_limit reclaim, but only allow the
> system only scan per-memcg lru under global memory pressure.

I see you found 6/6 in the meantime :) Did it answer your question?

> > The algorithm implemented in this patch is very naive. For each zone
> > scanned at each priority level, it iterates over all existing memcgs
> > and considers them for scanning.
> >
> > This is just a prototype and I did not optimize it yet because I am
> > unsure about the maximum number of memcgs that still constitute a sane
> > configuration in comparison to the machine size.
>
> So we also scan memcg which has no page allocated on this zone? I
> will read the following patch in case i missed something here :)

The old hierarchy walk skipped a memcg if it had no local pages at
all. I thought this was a rather unlikely situation and ripped it
out.

It will not loop persistently over a specific memcg and node
combination, like soft limit reclaim does at the moment.

Since this is much deeper integrated in memory reclaim now, it
benefits from all the existing mechanisms and will calculate the scan
target based on the number of lru pages on memcg->zone->lru, and do
nothing if there are no pages there.

2011-05-13 07:18:52

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 3/6] mm: memcg-aware global reclaim

On Fri, May 13, 2011 at 09:04:50AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 12 May 2011 16:53:55 +0200
> Johannes Weiner <[email protected]> wrote:
>
> > A page charged to a memcg is linked to a lru list specific to that
> > memcg. At the same time, traditional global reclaim is obvlivious to
> > memcgs, and all the pages are also linked to a global per-zone list.
> >
> > This patch changes traditional global reclaim to iterate over all
> > existing memcgs, so that it no longer relies on the global list being
> > present.
> >
> > This is one step forward in integrating memcg code better into the
> > rest of memory management. It is also a prerequisite to get rid of
> > the global per-zone lru lists.
>
> As I said, I don't want removing global reclaim until dirty_ratio support and
> better softlimit algorithm, at least. Current my concern is dirty_ratio,
> if you want to speed up, please help Greg and implement dirty_ratio first.

As I said, I am not proposing this for integration now. It was more
like asking if people were okay with this direction before we put
things in place that could be in the way of the long-term plan.

Note that 6/6 is an attempt to improve the soft limit algorithm.

> BTW, could you separete clean up code and your new logic ? 1st half of
> codes seems to be just a clean up and seems nice. But , IIUC, someone
> changed the arguments from chunk of params to be a flags....in some patch.

Sorry again, I know that the series is pretty unorganized.

> + do {
> + mem_cgroup_hierarchy_walk(root, &mem);
> + sc->current_memcg = mem;
> + do_shrink_zone(priority, zone, sc);
> + } while (mem != root);
>
> This move hierarchy walk from memcontrol.c to vmscan.c ?
>
> About moving hierarchy walk, I may say okay...because my patch does this, too.
>
> But....doesn't this reclaim too much memory if hierarchy is very deep ?
> Could you add some 'quit' path ?

Yes, I think I'll just reinstate the logic from
mem_cgroup_select_victim() to remember the last child, and add an exit
condition based on the number of reclaimed pages.

This was also suggested by Rik in this thread already.

2011-05-13 07:21:16

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 0/6] mm: memcg naturalization

On Thu, May 12, 2011 at 11:53:37AM -0700, Ying Han wrote:
> On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <[email protected]> wrote:
>
> > Hi!
> >
> > Here is a patch series that is a result of the memcg discussions on
> > LSF (memcg-aware global reclaim, global lru removal, struct
> > page_cgroup reduction, soft limit implementation) and the recent
> > feature discussions on linux-mm.
> >
> > The long-term idea is to have memcgs no longer bolted to the side of
> > the mm code, but integrate it as much as possible such that there is a
> > native understanding of containers, and that the traditional !memcg
> > setup is just a singular group. This series is an approach in that
> > direction.
> >
> > It is a rather early snapshot, WIP, barely tested etc., but I wanted
> > to get your opinions before further pursuing it. It is also part of
> > my counter-argument to the proposals of adding memcg-reclaim-related
> > user interfaces at this point in time, so I wanted to push this out
> > the door before things are merged into .40.
> >
>
> The memcg-reclaim-related user interface I assume was the watermark
> configurable tunable we were talking about in the per-memcg
> background reclaim patch. I think we got some agreement to remove
> the watermark tunable at the first step. But the newly added
> memory.soft_limit_async_reclaim as you proposed seems to be a usable
> interface.

Actually, I meant the soft limit reclaim statistics. There is a
comment about that in the 6/6 changelog.

2011-05-13 09:23:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [rfc patch 1/6] memcg: remove unused retry signal from reclaim

On Thu 12-05-11 16:53:53, Johannes Weiner wrote:
> If the memcg reclaim code detects the target memcg below its limit it
> exits and returns a guaranteed non-zero value so that the charge is
> retried.
>
> Nowadays, the charge side checks the memcg limit itself and does not
> rely on this non-zero return value trick.
>
> This patch removes it. The reclaim code will now always return the
> true number of pages it reclaimed on its own.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Makes sense
Reviewed-by: Michal Hocko <[email protected]>

> ---
> mm/memcontrol.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 010f916..bf5ab87 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1503,7 +1503,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> if (!res_counter_soft_limit_excess(&root_mem->res))
> return total;
> } else if (mem_cgroup_margin(root_mem))
> - return 1 + total;
> + return total;
> }
> return total;
> }
> --
> 1.7.5.1
>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-05-13 09:53:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [rfc patch 3/6] mm: memcg-aware global reclaim

On Thu 12-05-11 16:53:55, Johannes Weiner wrote:
> A page charged to a memcg is linked to a lru list specific to that
> memcg. At the same time, traditional global reclaim is obvlivious to
> memcgs, and all the pages are also linked to a global per-zone list.
>
> This patch changes traditional global reclaim to iterate over all
> existing memcgs, so that it no longer relies on the global list being
> present.

At LSF we have discussed that we should keep a list of over-(soft)limit
cgroups in a list which would be the first target for reclaiming (in
round-robin fashion). If we are note able to reclaim enough from those
(the list becomes empty) we should fallback to the all groups reclaim
(what you did in this patchset).

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-05-13 09:53:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [rfc patch 5/6] memcg: remove global LRU list

On Thu 12-05-11 16:53:57, Johannes Weiner wrote:
> Since the VM now has means to do global reclaim from the per-memcg lru
> lists, the global LRU list is no longer required.

Shouldn't this one be at the end of the series?

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-05-13 10:29:20

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 3/6] mm: memcg-aware global reclaim

On Fri, May 13, 2011 at 11:53:08AM +0200, Michal Hocko wrote:
> On Thu 12-05-11 16:53:55, Johannes Weiner wrote:
> > A page charged to a memcg is linked to a lru list specific to that
> > memcg. At the same time, traditional global reclaim is obvlivious to
> > memcgs, and all the pages are also linked to a global per-zone list.
> >
> > This patch changes traditional global reclaim to iterate over all
> > existing memcgs, so that it no longer relies on the global list being
> > present.
>
> At LSF we have discussed that we should keep a list of over-(soft)limit
> cgroups in a list which would be the first target for reclaiming (in
> round-robin fashion). If we are note able to reclaim enough from those
> (the list becomes empty) we should fallback to the all groups reclaim
> (what you did in this patchset).

This would be on top or instead of 6/6. This, 3/6, is indepent of
soft limit reclaim. It is mainly in preparation to remove the global
LRU.

2011-05-13 10:36:29

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 5/6] memcg: remove global LRU list

On Fri, May 13, 2011 at 11:53:48AM +0200, Michal Hocko wrote:
> On Thu 12-05-11 16:53:57, Johannes Weiner wrote:
> > Since the VM now has means to do global reclaim from the per-memcg lru
> > lists, the global LRU list is no longer required.
>
> Shouldn't this one be at the end of the series?

I don't really have an opinion. Why do you think it should?

2011-05-13 11:01:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [rfc patch 5/6] memcg: remove global LRU list

On Fri 13-05-11 12:36:08, Johannes Weiner wrote:
> On Fri, May 13, 2011 at 11:53:48AM +0200, Michal Hocko wrote:
> > On Thu 12-05-11 16:53:57, Johannes Weiner wrote:
> > > Since the VM now has means to do global reclaim from the per-memcg lru
> > > lists, the global LRU list is no longer required.
> >
> > Shouldn't this one be at the end of the series?
>
> I don't really have an opinion. Why do you think it should?

It is the last step in my eyes and maybe we want to keep both global
LRU as a fallback for some time just to get an impression (with some
tracepoints)how well does the per-cgroup reclaim goes.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-05-13 11:02:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [rfc patch 3/6] mm: memcg-aware global reclaim

On Fri 13-05-11 12:28:58, Johannes Weiner wrote:
> On Fri, May 13, 2011 at 11:53:08AM +0200, Michal Hocko wrote:
> > On Thu 12-05-11 16:53:55, Johannes Weiner wrote:
> > > A page charged to a memcg is linked to a lru list specific to that
> > > memcg. At the same time, traditional global reclaim is obvlivious to
> > > memcgs, and all the pages are also linked to a global per-zone list.
> > >
> > > This patch changes traditional global reclaim to iterate over all
> > > existing memcgs, so that it no longer relies on the global list being
> > > present.
> >
> > At LSF we have discussed that we should keep a list of over-(soft)limit
> > cgroups in a list which would be the first target for reclaiming (in
> > round-robin fashion). If we are note able to reclaim enough from those
> > (the list becomes empty) we should fallback to the all groups reclaim
> > (what you did in this patchset).
>
> This would be on top or instead of 6/6. This, 3/6, is indepent of
> soft limit reclaim. It is mainly in preparation to remove the global
> LRU.

OK.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-05-16 10:35:50

by Balbir Singh

[permalink] [raw]
Subject: Re: [rfc patch 0/6] mm: memcg naturalization

* Johannes Weiner <[email protected]> [2011-05-12 16:53:52]:

> Hi!
>
> Here is a patch series that is a result of the memcg discussions on
> LSF (memcg-aware global reclaim, global lru removal, struct
> page_cgroup reduction, soft limit implementation) and the recent
> feature discussions on linux-mm.
>
> The long-term idea is to have memcgs no longer bolted to the side of
> the mm code, but integrate it as much as possible such that there is a
> native understanding of containers, and that the traditional !memcg
> setup is just a singular group. This series is an approach in that
> direction.
>
> It is a rather early snapshot, WIP, barely tested etc., but I wanted
> to get your opinions before further pursuing it. It is also part of
> my counter-argument to the proposals of adding memcg-reclaim-related
> user interfaces at this point in time, so I wanted to push this out
> the door before things are merged into .40.
>
> The patches are quite big, I am still looking for things to factor and
> split out, sorry for this. Documentation is on its way as well ;)
>
> #1 and #2 are boring preparational work. #3 makes traditional reclaim
> in vmscan.c memcg-aware, which is a prerequisite for both removal of
> the global lru in #5 and the way I reimplemented soft limit reclaim in
> #6.

A large part of the acceptance would be based on what the test results
for common mm benchmarks show.

--
Three Cheers,
Balbir

2011-05-16 10:58:06

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 0/6] mm: memcg naturalization

On Mon, May 16, 2011 at 04:00:34PM +0530, Balbir Singh wrote:
> * Johannes Weiner <[email protected]> [2011-05-12 16:53:52]:
>
> > Hi!
> >
> > Here is a patch series that is a result of the memcg discussions on
> > LSF (memcg-aware global reclaim, global lru removal, struct
> > page_cgroup reduction, soft limit implementation) and the recent
> > feature discussions on linux-mm.
> >
> > The long-term idea is to have memcgs no longer bolted to the side of
> > the mm code, but integrate it as much as possible such that there is a
> > native understanding of containers, and that the traditional !memcg
> > setup is just a singular group. This series is an approach in that
> > direction.
> >
> > It is a rather early snapshot, WIP, barely tested etc., but I wanted
> > to get your opinions before further pursuing it. It is also part of
> > my counter-argument to the proposals of adding memcg-reclaim-related
> > user interfaces at this point in time, so I wanted to push this out
> > the door before things are merged into .40.
> >
> > The patches are quite big, I am still looking for things to factor and
> > split out, sorry for this. Documentation is on its way as well ;)
> >
> > #1 and #2 are boring preparational work. #3 makes traditional reclaim
> > in vmscan.c memcg-aware, which is a prerequisite for both removal of
> > the global lru in #5 and the way I reimplemented soft limit reclaim in
> > #6.
>
> A large part of the acceptance would be based on what the test results
> for common mm benchmarks show.

I will try to ensure the following things:

1. will not degrade performance on !CONFIG_MEMCG kernels

2. will not degrade performance on CONFIG_MEMCG kernels without
configured memcgs. This might be the most important one as most
desktop/server distributions enable the memory controller per default

3. will not degrade overall performance of workloads running
concurrently in separate memory control groups. I expect some shifts,
however, that even out performance differences.

Please let me know what you consider common mm benchmarks.

Thanks!

Hannes

2011-05-16 22:36:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection

On Fri, 13 May 2011 08:58:54 +0200
Johannes Weiner <[email protected]> wrote:

> > > @@ -154,16 +158,24 @@ static LIST_HEAD(shrinker_list);
> > > static DECLARE_RWSEM(shrinker_rwsem);
> > >
> > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > -#define scanning_global_lru(sc) (!(sc)->mem_cgroup)
> > > +static bool global_reclaim(struct scan_control *sc)
> > > +{
> > > + return !sc->memcg;
> > > +}
> > > +static bool scanning_global_lru(struct scan_control *sc)
> > > +{
> > > + return !sc->current_memcg;
> > > +}
> >
> >
> > Could you add comments ?

oy, that's my job.

> Yes, I will.

> +static bool global_reclaim(struct scan_control *sc) { return 1; }
> +static bool scanning_global_lru(struct scan_control *sc) { return 1; }

s/1/true/

And we may as well format the functions properly?

And it would be nice for the names of the functions to identify what
subsystem they belong to: memcg_global_reclaim() or such. Although
that's already been a bit messed up in memcg (and in the VM generally).

2011-05-16 23:11:06

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 4/6] memcg: reclaim statistics

On Thu, May 12, 2011 at 12:33:50PM -0700, Ying Han wrote:
> On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <[email protected]> wrote:
>
> > TODO: write proper changelog. Here is an excerpt from
> > http://lkml.kernel.org/r/[email protected]:
> >
> > : 1. Limit-triggered direct reclaim
> > :
> > : The memory cgroup hits its limit and the task does direct reclaim from
> > : its own memcg. We probably want statistics for this separately from
> > : background reclaim to see how successful background reclaim is, the
> > : same reason we have this separation in the global vmstat as well.
> > :
> > : pgscan_direct_limit
> > : pgfree_direct_limit
> >
>
> Can we use "pgsteal_" instead? Not big fan of the naming but want to make
> them consistent to other stats.

Actually, I thought what KAME-san said made sense. 'Stealing' is a
good fit for reclaim due to outside pressure. But if the memcg is
target-reclaimed from the inside because it hit the limit, is
'stealing' the appropriate term?

> > : 2. Limit-triggered background reclaim
> > :
> > : This is the watermark-based asynchroneous reclaim that is currently in
> > : discussion. It's triggered by the memcg breaching its watermark,
> > : which is relative to its hard-limit. I named it kswapd because I
> > : still think kswapd should do this job, but it is all open for
> > : discussion, obviously. Treat it as meaning 'background' or
> > : 'asynchroneous'.
> > :
> > : pgscan_kswapd_limit
> > : pgfree_kswapd_limit
> >
>
> Kame might have this stats on the per-memcg bg reclaim patch. Just mention
> here since it will make later merge
> a bit harder

I'll have a look, thanks for the heads up.

> > : 3. Hierarchy-triggered direct reclaim
> > :
> > : A condition outside the memcg leads to a task directly reclaiming from
> > : this memcg. This could be global memory pressure for example, but
> > : also a parent cgroup hitting its limit. It's probably helpful to
> > : assume global memory pressure meaning that the root cgroup hit its
> > : limit, conceptually. We don't have that yet, but this could be the
> > : direct softlimit reclaim Ying mentioned above.
> > :
> > : pgscan_direct_hierarchy
> > : pgsteal_direct_hierarchy
> >
>
> The stats for soft_limit reclaim from global ttfp have been merged in mmotm
> i believe as the following:
>
> "soft_direct_steal"
> "soft_direct_scan"
>
> I wonder we might want to separate that out from the other case where the
> reclaim is from the parent triggers its limit.

The way I implemented soft limits in 6/6 is to increase pressure on
exceeding children whenever hierarchical reclaim is taking place.

This changes soft limit from

Global memory pressure: reclaim from exceeding memcg(s) first

to

Memory pressure on a memcg: reclaim from all its children,
with increased pressure on those exceeding their soft limit
(where global memory pressure means root_mem_cgroup and all
existing memcgs are considered its children)

which makes the soft limit much more generic and more powerful, as it
allows the admin to prioritize reclaim throughout the hierarchy, not
only for global memory pressure. Consider one memcg with two
subgroups. You can now prioritize reclaim to prefer one subgroup over
another through soft limiting.

This is one reason why I think that the approach of maintaining a
global list of memcgs that exceed their soft limits is an inferior
approach; it does not take the hierarchy into account at all.

This scheme would not provide a natural way of counting pages that
were reclaimed because of the soft limit, and thus I still oppose the
merging of soft limit counters.

Hannes

2011-05-17 06:32:50

by Balbir Singh

[permalink] [raw]
Subject: Re: [rfc patch 0/6] mm: memcg naturalization

* Johannes Weiner <[email protected]> [2011-05-16 12:57:29]:

> On Mon, May 16, 2011 at 04:00:34PM +0530, Balbir Singh wrote:
> > * Johannes Weiner <[email protected]> [2011-05-12 16:53:52]:
> >
> > > Hi!
> > >
> > > Here is a patch series that is a result of the memcg discussions on
> > > LSF (memcg-aware global reclaim, global lru removal, struct
> > > page_cgroup reduction, soft limit implementation) and the recent
> > > feature discussions on linux-mm.
> > >
> > > The long-term idea is to have memcgs no longer bolted to the side of
> > > the mm code, but integrate it as much as possible such that there is a
> > > native understanding of containers, and that the traditional !memcg
> > > setup is just a singular group. This series is an approach in that
> > > direction.
> > >
> > > It is a rather early snapshot, WIP, barely tested etc., but I wanted
> > > to get your opinions before further pursuing it. It is also part of
> > > my counter-argument to the proposals of adding memcg-reclaim-related
> > > user interfaces at this point in time, so I wanted to push this out
> > > the door before things are merged into .40.
> > >
> > > The patches are quite big, I am still looking for things to factor and
> > > split out, sorry for this. Documentation is on its way as well ;)
> > >
> > > #1 and #2 are boring preparational work. #3 makes traditional reclaim
> > > in vmscan.c memcg-aware, which is a prerequisite for both removal of
> > > the global lru in #5 and the way I reimplemented soft limit reclaim in
> > > #6.
> >
> > A large part of the acceptance would be based on what the test results
> > for common mm benchmarks show.
>
> I will try to ensure the following things:
>
> 1. will not degrade performance on !CONFIG_MEMCG kernels
>
> 2. will not degrade performance on CONFIG_MEMCG kernels without
> configured memcgs. This might be the most important one as most
> desktop/server distributions enable the memory controller per default
>
> 3. will not degrade overall performance of workloads running
> concurrently in separate memory control groups. I expect some shifts,
> however, that even out performance differences.
>
> Please let me know what you consider common mm benchmarks.

1, 2 and 3 do sound nice, what workload do you intend to run? We used
reaim, lmbench, page fault rate based tests.

--
Three Cheers,
Balbir

2011-05-17 06:38:14

by Ying Han

[permalink] [raw]
Subject: Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection

On Thu, May 12, 2011 at 9:03 AM, Johannes Weiner <[email protected]> wrote:
> On Thu, May 12, 2011 at 11:33:13AM -0400, Rik van Riel wrote:
>> On 05/12/2011 10:53 AM, Johannes Weiner wrote:
>> >The reclaim code has a single predicate for whether it currently
>> >reclaims on behalf of a memory cgroup, as well as whether it is
>> >reclaiming from the global LRU list or a memory cgroup LRU list.
>> >
>> >Up to now, both cases always coincide, but subsequent patches will
>> >change things such that global reclaim will scan memory cgroup lists.
>> >
>> >This patch adds a new predicate that tells global reclaim from memory
>> >cgroup reclaim, and then changes all callsites that are actually about
>> >global reclaim heuristics rather than strict LRU list selection.
>> >
>> >Signed-off-by: Johannes Weiner<[email protected]>
>> >---
>> > ?mm/vmscan.c | ? 96 ++++++++++++++++++++++++++++++++++------------------------
>> > ?1 files changed, 56 insertions(+), 40 deletions(-)
>> >
>> >diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >index f6b435c..ceeb2a5 100644
>> >--- a/mm/vmscan.c
>> >+++ b/mm/vmscan.c
>> >@@ -104,8 +104,12 @@ struct scan_control {
>> > ? ? ?*/
>> > ? ? reclaim_mode_t reclaim_mode;
>> >
>> >- ? ?/* Which cgroup do we reclaim from */
>> >- ? ?struct mem_cgroup *mem_cgroup;
>> >+ ? ?/*
>> >+ ? ? * The memory cgroup we reclaim on behalf of, and the one we
>> >+ ? ? * are currently reclaiming from.
>> >+ ? ? */
>> >+ ? ?struct mem_cgroup *memcg;
>> >+ ? ?struct mem_cgroup *current_memcg;
>>
>> I can't say I'm fond of these names. ?I had to read the
>> rest of the patch to figure out that the old mem_cgroup
>> got renamed to current_memcg.
>
> To clarify: sc->memcg will be the memcg that hit the hard limit and is
> the main target of this reclaim invocation. ?current_memcg is the
> iterator over the hierarchy below the target.

I would assume the new variable memcg is a renaming of the
"mem_cgroup" which indicating which cgroup we reclaim on behalf of.
About the "current_memcg", i couldn't find where it is indicating to
be the current cgroup under the hierarchy below the "memcg".

Both mem_cgroup_shrink_node_zone() and try_to_free_mem_cgroup_pages()
are called within mem_cgroup_hierarchical_reclaim(), and the sc->memcg
is initialized w/ the victim passed down which is already the memcg
under hierarchy.

--Ying


> I realize this change in particular was placed a bit unfortunate in
> terms of understanding in the series, I just wanted to keep out the
> mem_cgroup to current_memcg renaming out of the next patch. ?There is
> probably a better way, I'll fix it up and improve the comment.
>
>> Would it be better to call them my_memcg and reclaim_memcg?
>>
>> Maybe somebody else has better suggestions...
>
> Yes, suggestions welcome. ?I'm not too fond of the naming, either.
>
>> Other than the naming, no objection.
>
> Thanks, Rik.
>
> ? ? ? ?Hannes
>

2011-05-17 07:42:57

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 4/6] memcg: reclaim statistics

On Mon, May 16, 2011 at 05:20:31PM -0700, Ying Han wrote:
> On Mon, May 16, 2011 at 4:10 PM, Johannes Weiner <[email protected]> wrote:
>
> > On Thu, May 12, 2011 at 12:33:50PM -0700, Ying Han wrote:
> > > The stats for soft_limit reclaim from global ttfp have been merged in
> > > mmotm i believe as the following:
> > >
> > > "soft_direct_steal"
> > > "soft_direct_scan"
> > >
> > > I wonder we might want to separate that out from the other case where the
> > > reclaim is from the parent triggers its limit.
> >
> > The way I implemented soft limits in 6/6 is to increase pressure on
> > exceeding children whenever hierarchical reclaim is taking place.
> >
> > This changes soft limit from
> >
> > Global memory pressure: reclaim from exceeding memcg(s) first
> >
> > to
> >
> > Memory pressure on a memcg: reclaim from all its children,
> > with increased pressure on those exceeding their soft limit
> > (where global memory pressure means root_mem_cgroup and all
> > existing memcgs are considered its children)
> >
> > which makes the soft limit much more generic and more powerful, as it
> > allows the admin to prioritize reclaim throughout the hierarchy, not
> > only for global memory pressure. Consider one memcg with two
> > subgroups. You can now prioritize reclaim to prefer one subgroup over
> > another through soft limiting.
> >
> > This is one reason why I think that the approach of maintaining a
> > global list of memcgs that exceed their soft limits is an inferior
> > approach; it does not take the hierarchy into account at all.
> >
> > This scheme would not provide a natural way of counting pages that
> > were reclaimed because of the soft limit, and thus I still oppose the
> > merging of soft limit counters.
>
> The proposal we discussed during LSF ( implemented in the patch " memcg:
> revisit soft_limit reclaim on contention") takes consideration
> of hierarchical reclaim. The memcg is linked in the list if it exceeds the
> soft_limit, and the soft_limit reclaim per-memcg is calling
> mem_cgroup_hierarchical_reclaim().

It does hierarchical soft limit reclaim once triggered, but I meant
that soft limits themselves have no hierarchical meaning. Say you
have the following hierarchy:

root_mem_cgroup

aaa bbb

a1 a2 b1 b2

a1-1

Consider aaa and a1 had a soft limit. If global memory arose, aaa and
all its children would be pushed back with the current scheme, the one
you are proposing, and the one I am proposing.

But now consider aaa hitting its hard limit. Regular target reclaim
will be triggered, and a1, a2, and a1-1 will be scanned equally from
hierarchical reclaim. That a1 is in excess of its soft limit is not
considered at all.

With what I am proposing, a1 and a1-1 would be pushed back more
aggressively than a2, because a1 is in excess of its soft limit and
a1-1 is contributing to that.

It would mean that given a group of siblings, you distribute the
pressure weighted by the soft limit configuration, independent of the
kind of hierarchical/external pressure (global memory scarcity or
parent hit the hard limit).

It's much easier to understand if you think of global memory pressure
to mean that root_mem_cgroup hit its hard limit, and that all existing
memcgs are hierarchically below the root_mem_cgroup. Altough it is
technically not implemented that way, that would be the consistent
model.

My proposal is a generic and native way of enforcing soft limits: a
memcg hit its hard limit, reclaim from the hierarchy below it, prefer
those in excess of their soft limit.

While yours is special-cased to immediate descendants of the
root_mem_cgroup.

> The current "soft_steal" and "soft_scan" is counting pages being steal/scan
> inside mem_cgroup_hierarchical_reclaim() w check_soft checking, which then
> counts pages being reclaimed because of soft_limit and also counting the
> hierarchical reclaim.

Yeah, I understand that. What I am saying is that in my code,
everytime a hierarchy of memcgs is scanned (global memory reclaim,
target reclaim, kswapd or direct, it's all the same), a memcg that is
in excess of its soft limit is put more pressure on compared to its
siblings.

There is no stand-alone 'now, go reclaim soft limits' cycle anymore.
As such, it would be impossible to maintain that counter.

2011-05-17 08:11:23

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 0/6] mm: memcg naturalization

On Mon, May 16, 2011 at 05:53:04PM -0700, Ying Han wrote:
> On Fri, May 13, 2011 at 12:20 AM, Johannes Weiner <[email protected]>wrote:
>
> > On Thu, May 12, 2011 at 11:53:37AM -0700, Ying Han wrote:
> > > On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <[email protected]>
> > wrote:
> > >
> > > > Hi!
> > > >
> > > > Here is a patch series that is a result of the memcg discussions on
> > > > LSF (memcg-aware global reclaim, global lru removal, struct
> > > > page_cgroup reduction, soft limit implementation) and the recent
> > > > feature discussions on linux-mm.
> > > >
> > > > The long-term idea is to have memcgs no longer bolted to the side of
> > > > the mm code, but integrate it as much as possible such that there is a
> > > > native understanding of containers, and that the traditional !memcg
> > > > setup is just a singular group. This series is an approach in that
> > > > direction.
> >
>
> This sounds like a good long term plan. Now I would wonder should we take it
> step by step by doing:
>
> 1. improving the existing soft_limit reclaim from RB-tree based to link-list
> based, also in a round_robin fashion.
> We can keep the existing APIs but only changing the underlying
> implementation of mem_cgroup_soft_limit_reclaim()
>
> 2. remove the global lru list after the first one being proved to be
> efficient.
>
> 3. then have better integration of memcg reclaim to the mm code.

I chose to go the other because it did not seem more complex to me and
fixed many things we had planned anyway. Deeper integration, better
soft limit implementation (including better pressure distribution,
enforcement also from direct reclaim, not just kswapd), global lru
removal etc.

That ground work was a bit unwieldy and I think quite some confusion
ensued, but I am currently reorganizing, cleaning up, and documenting.
I expect the next version to be much easier to understand.

The three steps are still this:

1. make traditional reclaim memcg-aware.

2. improve soft limit based on 1.

3. remove global lru based on 1.

But 1. already effectively disables the global LRU for memcg-enabled
kernels, so 3. can be deferred until we are comfortable with 1.

Hannes

2011-05-17 08:25:52

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection

On Mon, May 16, 2011 at 11:38:07PM -0700, Ying Han wrote:
> On Thu, May 12, 2011 at 9:03 AM, Johannes Weiner <[email protected]> wrote:
> > On Thu, May 12, 2011 at 11:33:13AM -0400, Rik van Riel wrote:
> >> On 05/12/2011 10:53 AM, Johannes Weiner wrote:
> >> >The reclaim code has a single predicate for whether it currently
> >> >reclaims on behalf of a memory cgroup, as well as whether it is
> >> >reclaiming from the global LRU list or a memory cgroup LRU list.
> >> >
> >> >Up to now, both cases always coincide, but subsequent patches will
> >> >change things such that global reclaim will scan memory cgroup lists.
> >> >
> >> >This patch adds a new predicate that tells global reclaim from memory
> >> >cgroup reclaim, and then changes all callsites that are actually about
> >> >global reclaim heuristics rather than strict LRU list selection.
> >> >
> >> >Signed-off-by: Johannes Weiner<[email protected]>
> >> >---
> >> > ?mm/vmscan.c | ? 96 ++++++++++++++++++++++++++++++++++------------------------
> >> > ?1 files changed, 56 insertions(+), 40 deletions(-)
> >> >
> >> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >index f6b435c..ceeb2a5 100644
> >> >--- a/mm/vmscan.c
> >> >+++ b/mm/vmscan.c
> >> >@@ -104,8 +104,12 @@ struct scan_control {
> >> > ? ? ?*/
> >> > ? ? reclaim_mode_t reclaim_mode;
> >> >
> >> >- ? ?/* Which cgroup do we reclaim from */
> >> >- ? ?struct mem_cgroup *mem_cgroup;
> >> >+ ? ?/*
> >> >+ ? ? * The memory cgroup we reclaim on behalf of, and the one we
> >> >+ ? ? * are currently reclaiming from.
> >> >+ ? ? */
> >> >+ ? ?struct mem_cgroup *memcg;
> >> >+ ? ?struct mem_cgroup *current_memcg;
> >>
> >> I can't say I'm fond of these names. ?I had to read the
> >> rest of the patch to figure out that the old mem_cgroup
> >> got renamed to current_memcg.
> >
> > To clarify: sc->memcg will be the memcg that hit the hard limit and is
> > the main target of this reclaim invocation. ?current_memcg is the
> > iterator over the hierarchy below the target.
>
> I would assume the new variable memcg is a renaming of the
> "mem_cgroup" which indicating which cgroup we reclaim on behalf of.

The thing is, mem_cgroup would mean both the group we are reclaiming
on behalf of AND the group we are currently reclaiming from. Because
the hierarchy walk was implemented in memcontrol.c, vmscan.c only ever
saw one cgroup at a time.

> About the "current_memcg", i couldn't find where it is indicating to
> be the current cgroup under the hierarchy below the "memcg".

It's codified in shrink_zone().

for each child of sc->memcg:
sc->current_memcg = child
reclaim(sc)

In the new version I named (and documented) them:

sc->target_mem_cgroup: the entry point into the hierarchy, set
by the functions that have the scan control structure on their
stack. That's the one hitting its hard limit.

sc->mem_cgroup: the current position in the hierarchy below
sc->target_mem_cgroup. That's the one that actively gets its
pages reclaimed.

> Both mem_cgroup_shrink_node_zone() and try_to_free_mem_cgroup_pages()
> are called within mem_cgroup_hierarchical_reclaim(), and the sc->memcg
> is initialized w/ the victim passed down which is already the memcg
> under hierarchy.

I changed mem_cgroup_shrink_node_zone() to use do_shrink_zone(), and
mem_cgroup_hierarchical_reclaim() no longer calls
try_to_free_mem_cgroup_pages().

So there is no hierarchy walk triggered from within a hierarchy walk.

I just noticed that there is, however, a bug in that
mem_cgroup_shrink_node_zone() does not initialize sc->current_memcg.

2011-05-17 13:56:08

by Rik van Riel

[permalink] [raw]
Subject: Re: [rfc patch 4/6] memcg: reclaim statistics

On 05/17/2011 03:42 AM, Johannes Weiner wrote:

> It does hierarchical soft limit reclaim once triggered, but I meant
> that soft limits themselves have no hierarchical meaning. Say you
> have the following hierarchy:
>
> root_mem_cgroup
>
> aaa bbb
>
> a1 a2 b1 b2
>
> a1-1
>
> Consider aaa and a1 had a soft limit. If global memory arose, aaa and
> all its children would be pushed back with the current scheme, the one
> you are proposing, and the one I am proposing.
>
> But now consider aaa hitting its hard limit. Regular target reclaim
> will be triggered, and a1, a2, and a1-1 will be scanned equally from
> hierarchical reclaim. That a1 is in excess of its soft limit is not
> considered at all.
>
> With what I am proposing, a1 and a1-1 would be pushed back more
> aggressively than a2, because a1 is in excess of its soft limit and
> a1-1 is contributing to that.

Ying, I think Johannes has a good point. I do not see
a way to enforce the limits properly with the scheme we
came up with at LSF, in the hierarchical scenario above.

There may be a way, but until we think of it, I suspect
it will be better to go with Johannes's scheme for now.

--
All rights reversed