2011-06-16 03:55:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 0/7] memcg numa node scan update.


In the last month, I added round-robin scan of numa nodes at
hittling limit, and wrote "a better algorithm is needed."

Here is update. Because some of patches are bugfixes, I may
cut out them as independent patch.

Pathc 6-7/7 implements a selection logic.

==
Tested on 8cpu/24GB system, which has 2 nodes.
limit memory to 300MB and run httpd under it.
httpd's working set is 4096files/600MB.

Then, do 40960 access by apache-bench. and see how memory reclaim costs.
Because a thread of httpd doesn't consume cpu much, the number of
working threads are not balanced between numa nodes and file caches
will be not balanced.

[round-robin]
[kamezawa@bluextal ~]$ cat /cgroup/memory/test/memory.scan_stat
scanned_pages_by_limit 550740
freed_pages_by_limit 206473
elapsed_ns_by_limit 9485418834

[After patch]
scanned_pages_by_limit 521520
freed_pages_by_limit 199330
elapsed_ns_by_limit 7904913234

I can see elapsed time is decreased.
Test on big machine is welcomed.

Thanks,
-Kame


2011-06-16 03:58:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 1/7] Fix mem_cgroup_hierarchical_reclaim() to do stable hierarchy walk.

patch is onto mmotm-06-15.
==
>From e58c243f3a5e5ace225a366b4f9d4dfdb0254e28 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Wed, 15 Jun 2011 11:27:04 +0900
Subject: [PATCH 1/7] Fix mem_cgroup_hierarchical_reclaim() to do stable hierarchy walk.

Now, mem_cgroup_hierarchical_reclaim() walks memory cgroups under a tree
from a saved point (root_mem->last_scanned_child) until it visits
root_mem (a top of hierarchy tree) twice.

This means an unstable walk. Assume a tree consists of 6 nodes as

Root-A-B-C-D-E.

When you start a scan from Root.
Root->A->B-C-D-E->Root ==> end with scanning 6 groups.

When you start a scan from "A"
A->B->C->D->E->Root->A->B->C->D->E->Root ==> end with scanning 11 groups.

This is unstable. This patch fixes to visit stable number of nodes at
every scan...visit all nodes only once. In above case,
A->B->C->D->E->Root ==> end.

By this, the core loop can be much cleaner.

And this patch moves drain_all_stock_async() out of loop. Then,
it will be called once if a memcg hit limits.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 85 +++++++++++++++++++++++++++-----------------------------
1 file changed, 42 insertions(+), 43 deletions(-)

Index: mmotm-0615/mm/memcontrol.c
===================================================================
--- mmotm-0615.orig/mm/memcontrol.c
+++ mmotm-0615/mm/memcontrol.c
@@ -1641,8 +1641,8 @@ int mem_cgroup_select_victim_node(struct
*
* root_mem is the original ancestor that we've been reclaim from.
*
- * We give up and return to the caller when we visit root_mem twice.
- * (other groups can be removed while we're walking....)
+ * We give up and return to the caller when we visit enough memcgs.
+ * (Typically, we visit the whole memcg tree)
*
* If shrink==true, for avoiding to free too much, this returns immedieately.
*/
@@ -1660,6 +1660,7 @@ static int mem_cgroup_hierarchical_recla
bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
unsigned long excess;
unsigned long nr_scanned;
+ int visit;

excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;

@@ -1667,41 +1668,28 @@ static int mem_cgroup_hierarchical_recla
if (!check_soft && root_mem->memsw_is_minimum)
noswap = true;

- while (1) {
+again:
+ if (!shrink) {
+ visit = 0;
+ for_each_mem_cgroup_tree(victim, root_mem)
+ visit++;
+ } else {
+ /*
+ * At shrinking, we check the usage again in caller side.
+ * so, visit children one by one.
+ */
+ visit = 1;
+ }
+ /*
+ * We are not draining per cpu cached charges during soft limit reclaim
+ * because global reclaim doesn't care about charges. It tries to free
+ * some memory and charges will not give any.
+ */
+ if (!check_soft)
+ drain_all_stock_async(root_mem);
+
+ while (visit--) {
victim = mem_cgroup_select_victim(root_mem);
- if (victim == root_mem) {
- loop++;
- /*
- * We are not draining per cpu cached charges during
- * soft limit reclaim because global reclaim doesn't
- * care about charges. It tries to free some memory and
- * charges will not give any.
- */
- if (!check_soft && loop >= 1)
- drain_all_stock_async(root_mem);
- if (loop >= 2) {
- /*
- * If we have not been able to reclaim
- * anything, it might because there are
- * no reclaimable pages under this hierarchy
- */
- if (!check_soft || !total) {
- css_put(&victim->css);
- break;
- }
- /*
- * We want to do more targeted reclaim.
- * excess >> 2 is not to excessive so as to
- * reclaim too much, nor too less that we keep
- * coming back to reclaim from this cgroup
- */
- if (total >= (excess >> 2) ||
- (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
- css_put(&victim->css);
- break;
- }
- }
- }
if (!mem_cgroup_local_usage(victim)) {
/* this cgroup's local usage == 0 */
css_put(&victim->css);
@@ -1717,13 +1705,7 @@ static int mem_cgroup_hierarchical_recla
ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
noswap, get_swappiness(victim));
css_put(&victim->css);
- /*
- * At shrinking usage, we can't check we should stop here or
- * reclaim more. It's depends on callers. last_scanned_child
- * will work enough for keeping fairness under tree.
- */
- if (shrink)
- return ret;
+
total += ret;
if (check_soft) {
if (!res_counter_soft_limit_excess(&root_mem->res))
@@ -1731,6 +1713,23 @@ static int mem_cgroup_hierarchical_recla
} else if (mem_cgroup_margin(root_mem))
return total;
}
+ /*
+ * Basically, softlimit reclaim does deep scan for targeted reclaim. But
+ * if we have not been able to reclaim anything, it might because
+ * there are no reclaimable pages under this hierarchy. So, we don't
+ * retry if total == 0.
+ */
+ if (check_soft && total) {
+ /*
+ * We want to do more targeted reclaim. excess >> 2 is not to
+ * excessive so as to reclaim too much, nor too less that we
+ * keep coming back to reclaim from this cgroup
+ */
+ if (total < (excess >> 2) &&
+ (loop <= MEM_CGROUP_MAX_RECLAIM_LOOPS))
+ goto again;
+ }
+
return total;
}

2011-06-16 03:59:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 2/7] export memory cgroup's swappines by mem_cgroup_swappiness()

>From 6f9c40172947fb92ab0ea6f7d73d577473879636 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Wed, 15 Jun 2011 12:06:31 +0900
Subject: [PATCH 2/7] export memory cgroup's swappines by mem_cgroup_swappiness()

Each memory cgroup has 'swappiness' value and it can be accessed by
get_swappiness(memcg). The major user is try_to_free_mem_cgroup_pages()
and swappiness is passed by argument.

It's now static function but some planned updates will need to
get swappiness from files other than memcontrol.c
This patch exports get_swappiness() as mem_cgroup_swappiness().
By this, we can remove the argument of swapiness from try_to_fre...

I think this makes sense because passed swapiness is always from memory
cgroup passed as an argument and this duplication of argument is
not very good.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 1 +
include/linux/swap.h | 4 +---
mm/memcontrol.c | 14 ++++++--------
mm/vmscan.c | 8 +++-----
4 files changed, 11 insertions(+), 16 deletions(-)

Index: mmotm-0615/include/linux/memcontrol.h
===================================================================
--- mmotm-0615.orig/include/linux/memcontrol.h
+++ mmotm-0615/include/linux/memcontrol.h
@@ -110,6 +110,7 @@ extern void mem_cgroup_end_migration(str
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
struct zone *zone,
enum lru_list lru);
Index: mmotm-0615/include/linux/swap.h
===================================================================
--- mmotm-0615.orig/include/linux/swap.h
+++ mmotm-0615/include/linux/swap.h
@@ -254,11 +254,9 @@ static inline void lru_cache_add_file(st
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- unsigned int swappiness);
+ gfp_t gfp_mask, bool noswap);
extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
- unsigned int swappiness,
struct zone *zone,
unsigned long *nr_scanned);
extern int __isolate_lru_page(struct page *page, int mode, int file);
Index: mmotm-0615/mm/memcontrol.c
===================================================================
--- mmotm-0615.orig/mm/memcontrol.c
+++ mmotm-0615/mm/memcontrol.c
@@ -1325,7 +1325,7 @@ static unsigned long mem_cgroup_margin(s
return margin >> PAGE_SHIFT;
}

-static unsigned int get_swappiness(struct mem_cgroup *memcg)
+unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
struct cgroup *cgrp = memcg->css.cgroup;

@@ -1698,12 +1698,11 @@ again:
/* we use swappiness of local cgroup */
if (check_soft) {
ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
- noswap, get_swappiness(victim), zone,
- &nr_scanned);
+ noswap, zone, &nr_scanned);
*total_scanned += nr_scanned;
} else
ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
- noswap, get_swappiness(victim));
+ noswap);
css_put(&victim->css);

total += ret;
@@ -3758,8 +3757,7 @@ try_to_free:
ret = -EINTR;
goto out;
}
- progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
- false, get_swappiness(mem));
+ progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, false);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
@@ -4221,7 +4219,7 @@ static u64 mem_cgroup_swappiness_read(st
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);

- return get_swappiness(memcg);
+ return mem_cgroup_swappiness(memcg);
}

static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
@@ -4930,7 +4928,7 @@ mem_cgroup_create(struct cgroup_subsys *
INIT_LIST_HEAD(&mem->oom_notify);

if (parent)
- mem->swappiness = get_swappiness(parent);
+ mem->swappiness = mem_cgroup_swappiness(parent);
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
Index: mmotm-0615/mm/vmscan.c
===================================================================
--- mmotm-0615.orig/mm/vmscan.c
+++ mmotm-0615/mm/vmscan.c
@@ -2200,7 +2200,6 @@ unsigned long try_to_free_pages(struct z

unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
- unsigned int swappiness,
struct zone *zone,
unsigned long *nr_scanned)
{
@@ -2210,7 +2209,7 @@ unsigned long mem_cgroup_shrink_node_zon
.may_writepage = !laptop_mode,
.may_unmap = 1,
.may_swap = !noswap,
- .swappiness = swappiness,
+ .swappiness = mem_cgroup_swappiness(mem),
.order = 0,
.mem_cgroup = mem,
};
@@ -2239,8 +2238,7 @@ unsigned long mem_cgroup_shrink_node_zon

unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask,
- bool noswap,
- unsigned int swappiness)
+ bool noswap)
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
@@ -2250,7 +2248,7 @@ unsigned long try_to_free_mem_cgroup_pag
.may_unmap = 1,
.may_swap = !noswap,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
- .swappiness = swappiness,
+ .swappiness = mem_cgroup_swappiness(mem_cont),
.order = 0,
.mem_cgroup = mem_cont,
.nodemask = NULL, /* we don't care the placement */

2011-06-16 04:00:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 3/7] memcg: add memory.scan_stat

>From e08990dd9ada13cf236bec1ef44b207436434b8e Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Wed, 15 Jun 2011 14:11:01 +0900
Subject: [PATCH 3/7] memcg: add memory.scan_stat

commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
says it adds scanning stats to memory.stat file. But it doesn't because
we considered we needed to make a concensus for such new APIs.

This patch is a trial to add memory.scan_stat. This shows
- the number of scanned pages
- the number of recleimed pages
- the number of elaplsed time (including sleep/pause time)
for both of direct/soft reclaim and shrinking caused by changing limit
or force_empty.

The biggest difference with oringinal Ying's one is that this file
can be reset by some write, as

# echo 0 ...../memory.scan_stat

[kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
scanned_pages_by_limit 358470
freed_pages_by_limit 180795
elapsed_ns_by_limit 21629927
scanned_pages_by_system 0
freed_pages_by_system 0
elapsed_ns_by_system 0
scanned_pages_by_shrink 76646
freed_pages_by_shrink 38355
elappsed_ns_by_shrink 31990670
total_scanned_pages_by_limit 358470
total_freed_pages_by_limit 180795
total_elapsed_ns_by_hierarchical 216299275
total_scanned_pages_by_system 0
total_freed_pages_by_system 0
total_elapsed_ns_by_system 0
total_scanned_pages_by_shrink 76646
total_freed_pages_by_shrink 38355
total_elapsed_ns_by_shrink 31990670

total_xxxx is for hierarchy management.

This will be useful for further memcg developments and need to be
developped before we do some complicated rework on LRU/softlimit
management.

Now, scan/free/elapsed_by_system is incomplete but future works of
Johannes at el. will fill remaining information and then, we can
look into problems of isolation between memcgs.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/memory.txt | 33 +++++++++
include/linux/memcontrol.h | 16 ++++
include/linux/swap.h | 6 -
mm/memcontrol.c | 135 +++++++++++++++++++++++++++++++++++++--
mm/vmscan.c | 27 ++++++-
5 files changed, 199 insertions(+), 18 deletions(-)

Index: mmotm-0615/Documentation/cgroups/memory.txt
===================================================================
--- mmotm-0615.orig/Documentation/cgroups/memory.txt
+++ mmotm-0615/Documentation/cgroups/memory.txt
@@ -380,7 +380,7 @@ will be charged as a new owner of it.

5.2 stat file

-memory.stat file includes following statistics
+5.2.1 memory.stat file includes following statistics

# per-memory cgroup local status
cache - # of bytes of page cache memory.
@@ -438,6 +438,37 @@ Note:
file_mapped is accounted only when the memory cgroup is owner of page
cache.)

+5.2.2 memory.scan_stat
+
+memory.scan_stat includes statistics information for memory scanning and
+freeing, reclaiming. The statistics shows memory scanning information since
+memory cgroup creation and can be reset to 0 by writing 0 as
+
+ #echo 0 > ../memory.scan_stat
+
+This file contains following statistics.
+
+scanned_pages_by_limit - # of scanned pages at hitting limit.
+freed_pages_by_limit - # of freed pages at hitting limit.
+elapsed_ns_by_limit - nano sec of elappsed time at LRU scan at
+ hitting limit.(this includes sleep time.)
+
+scanned_pages_by_system - # of scanned pages by the kernel.
+ (Now, this value means global memory reclaim
+ caused by system memory shortage with a hint
+ of softlimit. "No soft limit" case will be
+ supported in future.)
+freed_pages_by_system - # of freed pages by the kernel.
+elapsed_ns_by_system - nano sec of elappsed time by kernel.
+
+scanned_pages_by_shrink - # of scanned pages by shrinking.
+ (i.e. changes of limit, force_empty, etc.)
+freed_pages_by_shrink - # of freed pages by shirkining.
+elappsed_ns_by_shrink - nano sec of elappsed time at shrinking.
+
+total_xxx includes the statistics of children scanning caused by the cgroup.
+
+
5.3 swappiness

Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
Index: mmotm-0615/include/linux/memcontrol.h
===================================================================
--- mmotm-0615.orig/include/linux/memcontrol.h
+++ mmotm-0615/include/linux/memcontrol.h
@@ -120,6 +120,22 @@ struct zone_reclaim_stat*
mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
+struct memcg_scanrecord {
+ struct mem_cgroup *mem; /* scanend memory cgroup */
+ struct mem_cgroup *root; /* scan target hierarchy root */
+ int context; /* scanning context (see memcontrol.c) */
+ unsigned long nr_scanned; /* the number of scanned pages */
+ unsigned long nr_freed; /* the number of freed pages */
+ unsigned long elappsed; /* nsec of time elapsed while scanning */
+};
+
+extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
+ gfp_t gfp_mask, bool noswap,
+ struct memcg_scanrecord *rec);
+extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+ gfp_t gfp_mask, bool noswap,
+ struct zone *zone,
+ struct memcg_scanrecord *rec);

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
Index: mmotm-0615/include/linux/swap.h
===================================================================
--- mmotm-0615.orig/include/linux/swap.h
+++ mmotm-0615/include/linux/swap.h
@@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
/* linux/mm/vmscan.c */
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
-extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- struct zone *zone,
- unsigned long *nr_scanned);
extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
Index: mmotm-0615/mm/memcontrol.c
===================================================================
--- mmotm-0615.orig/mm/memcontrol.c
+++ mmotm-0615/mm/memcontrol.c
@@ -203,6 +203,57 @@ struct mem_cgroup_eventfd_list {
static void mem_cgroup_threshold(struct mem_cgroup *mem);
static void mem_cgroup_oom_notify(struct mem_cgroup *mem);

+enum {
+ SCAN_BY_LIMIT,
+ FREED_BY_LIMIT,
+ ELAPSED_BY_LIMIT,
+
+ SCAN_BY_SYSTEM,
+ FREED_BY_SYSTEM,
+ ELAPSED_BY_SYSTEM,
+
+ SCAN_BY_SHRINK,
+ FREED_BY_SHRINK,
+ ELAPSED_BY_SHRINK,
+ NR_SCANSTATS,
+};
+#define __FREED (1)
+#define __ELAPSED (2)
+
+struct scanstat {
+ spinlock_t lock;
+ unsigned long stats[NR_SCANSTATS]; /* local statistics */
+ unsigned long totalstats[NR_SCANSTATS]; /* hierarchical */
+};
+
+const char *scanstat_string[NR_SCANSTATS] = {
+ "scanned_pages_by_limit",
+ "freed_pages_by_limit",
+ "elapsed_ns_by_limit",
+
+ "scanned_pages_by_system",
+ "freed_pages_by_system",
+ "elapsed_ns_by_system",
+
+ "scanned_pages_by_shrink",
+ "freed_pages_by_shrink",
+ "elappsed_ns_by_shrink",
+};
+
+const char *total_scanstat_string[NR_SCANSTATS] = {
+ "total_scanned_pages_by_limit",
+ "total_freed_pages_by_limit",
+ "total_elapsed_ns_by_hierarchical",
+
+ "total_scanned_pages_by_system",
+ "total_freed_pages_by_system",
+ "total_elapsed_ns_by_system",
+
+ "total_scanned_pages_by_shrink",
+ "total_freed_pages_by_shrink",
+ "total_elapsed_ns_by_shrink",
+};
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -264,7 +315,8 @@ struct mem_cgroup {

/* For oom notifier event fd */
struct list_head oom_notify;
-
+ /* For recording LRU-scan statistics */
+ struct scanstat scanstat;
/*
* Should we move charges of a task when a task is moved into this
* mem_cgroup ? And what type of charges should we move ?
@@ -1634,6 +1686,28 @@ int mem_cgroup_select_victim_node(struct
}
#endif

+
+
+static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec)
+{
+ struct mem_cgroup *mem;
+ int context = rec->context;
+
+ mem = rec->mem;
+ spin_lock(&mem->scanstat.lock);
+ mem->scanstat.stats[context] += rec->nr_scanned;
+ mem->scanstat.stats[context + __FREED] += rec->nr_freed;
+ mem->scanstat.stats[context + __ELAPSED] += rec->elappsed;
+ spin_unlock(&mem->scanstat.lock);
+
+ mem = rec->root;
+ spin_lock(&mem->scanstat.lock);
+ mem->scanstat.totalstats[context] += rec->nr_scanned;
+ mem->scanstat.totalstats[context + __FREED] += rec->nr_freed;
+ mem->scanstat.totalstats[context + __ELAPSED] += rec->elappsed;
+ spin_unlock(&mem->scanstat.lock);
+}
+
/*
* Scan the hierarchy if needed to reclaim memory. We remember the last child
* we reclaimed from, so that we don't end up penalizing one child extensively
@@ -1659,8 +1733,8 @@ static int mem_cgroup_hierarchical_recla
bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
unsigned long excess;
- unsigned long nr_scanned;
int visit;
+ struct memcg_scanrecord rec;

excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;

@@ -1668,6 +1742,15 @@ static int mem_cgroup_hierarchical_recla
if (!check_soft && root_mem->memsw_is_minimum)
noswap = true;

+ if (shrink)
+ rec.context = SCAN_BY_SHRINK;
+ else if (check_soft)
+ rec.context = SCAN_BY_SYSTEM;
+ else
+ rec.context = SCAN_BY_LIMIT;
+
+ rec.root = root_mem;
+
again:
if (!shrink) {
visit = 0;
@@ -1695,14 +1778,19 @@ again:
css_put(&victim->css);
continue;
}
+ rec.mem = victim;
+ rec.nr_scanned = 0;
+ rec.nr_freed = 0;
+ rec.elappsed = 0;
/* we use swappiness of local cgroup */
if (check_soft) {
ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
- noswap, zone, &nr_scanned);
- *total_scanned += nr_scanned;
+ noswap, zone, &rec);
+ *total_scanned += rec.nr_scanned;
} else
ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
- noswap);
+ noswap, &rec);
+ mem_cgroup_record_scanstat(&rec);
css_put(&victim->css);

total += ret;
@@ -3757,7 +3845,8 @@ try_to_free:
ret = -EINTR;
goto out;
}
- progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, false);
+ progress = try_to_free_mem_cgroup_pages(mem,
+ GFP_KERNEL, false, NULL);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
@@ -4599,6 +4688,34 @@ static int mem_control_numa_stat_open(st
}
#endif /* CONFIG_NUMA */

+static int mem_cgroup_scan_stat_read(struct cgroup *cgrp,
+ struct cftype *cft,
+ struct cgroup_map_cb *cb)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ int i;
+
+ for (i = 0; i < NR_SCANSTATS; i++)
+ cb->fill(cb, scanstat_string[i], mem->scanstat.stats[i]);
+ for (i = 0; i < NR_SCANSTATS; i++)
+ cb->fill(cb, total_scanstat_string[i],
+ mem->scanstat.totalstats[i]);
+ return 0;
+}
+
+static int mem_cgroup_reset_scan_stat(struct cgroup *cgrp,
+ unsigned int event)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+ spin_lock(&mem->scanstat.lock);
+ memset(&mem->scanstat.stats, 0, sizeof(mem->scanstat.stats));
+ memset(&mem->scanstat.totalstats, 0, sizeof(mem->scanstat.totalstats));
+ spin_unlock(&mem->scanstat.lock);
+ return 0;
+}
+
+
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
@@ -4669,6 +4786,11 @@ static struct cftype mem_cgroup_files[]
.mode = S_IRUGO,
},
#endif
+ {
+ .name = "scan_stat",
+ .read_map = mem_cgroup_scan_stat_read,
+ .trigger = mem_cgroup_reset_scan_stat,
+ },
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -4932,6 +5054,7 @@ mem_cgroup_create(struct cgroup_subsys *
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
+ spin_lock_init(&mem->scanstat.lock);
return &mem->css;
free_out:
__mem_cgroup_free(mem);
Index: mmotm-0615/mm/vmscan.c
===================================================================
--- mmotm-0615.orig/mm/vmscan.c
+++ mmotm-0615/mm/vmscan.c
@@ -2199,9 +2199,9 @@ unsigned long try_to_free_pages(struct z
#ifdef CONFIG_CGROUP_MEM_RES_CTLR

unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- struct zone *zone,
- unsigned long *nr_scanned)
+ gfp_t gfp_mask, bool noswap,
+ struct zone *zone,
+ struct memcg_scanrecord *rec)
{
struct scan_control sc = {
.nr_scanned = 0,
@@ -2213,6 +2213,7 @@ unsigned long mem_cgroup_shrink_node_zon
.order = 0,
.mem_cgroup = mem,
};
+ unsigned long start, end;

sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2221,6 +2222,7 @@ unsigned long mem_cgroup_shrink_node_zon
sc.may_writepage,
sc.gfp_mask);

+ start = sched_clock();
/*
* NOTE: Although we can get the priority field, using it
* here is not a good idea, since it limits the pages we can scan.
@@ -2229,19 +2231,27 @@ unsigned long mem_cgroup_shrink_node_zon
* the priority and make it zero.
*/
shrink_zone(0, zone, &sc);
+ end = sched_clock();
+
+ if (rec) {
+ rec->nr_scanned += sc.nr_scanned;
+ rec->nr_freed += sc.nr_reclaimed;
+ rec->elappsed += end - start;
+ }

trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);

- *nr_scanned = sc.nr_scanned;
return sc.nr_reclaimed;
}

unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask,
- bool noswap)
+ bool noswap,
+ struct memcg_scanrecord *rec)
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
+ unsigned long start, end;
int nid;
struct scan_control sc = {
.may_writepage = !laptop_mode,
@@ -2259,6 +2269,7 @@ unsigned long try_to_free_mem_cgroup_pag
.gfp_mask = sc.gfp_mask,
};

+ start = sched_clock();
/*
* Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
* take care of from where we get pages. So the node where we start the
@@ -2273,6 +2284,12 @@ unsigned long try_to_free_mem_cgroup_pag
sc.gfp_mask);

nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+ end = sched_clock();
+ if (rec) {
+ rec->nr_scanned += sc.nr_scanned;
+ rec->nr_freed += sc.nr_reclaimed;
+ rec->elappsed += end - start;
+ }

trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);

2011-06-16 04:01:05

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 4/7] memcg: update numa information based on event counter

>From 88090fe10e225ad8769ba0ea01692b7314e8b973 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Wed, 15 Jun 2011 16:19:46 +0900
Subject: [PATCH 4/7] memcg: update numa information based on event counter

commit 889976 adds an numa node round-robin for memcg. But the information
is updated once per 10sec.

This patch changes the update trigger from jiffies to memcg's event count.
After this patch, numa scan information will be updated when

- the number of pagein/out events is larger than 3% of limit
or
- the number of pagein/out events is larger than 16k
(==64MB pagein/pageout if pagesize==4k.)

The counter of mem->numascan_update the sum of percpu events counter.
When a task hits limit, it checks mem->numascan_update. If it's over
min(3% of limit, 16k), numa information will be updated.

This patch also adds mutex for updating information. This will allow us
to avoid unnecessary scan.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 51 +++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 45 insertions(+), 6 deletions(-)

Index: mmotm-0615/mm/memcontrol.c
===================================================================
--- mmotm-0615.orig/mm/memcontrol.c
+++ mmotm-0615/mm/memcontrol.c
@@ -108,10 +108,12 @@ enum mem_cgroup_events_index {
enum mem_cgroup_events_target {
MEM_CGROUP_TARGET_THRESH,
MEM_CGROUP_TARGET_SOFTLIMIT,
+ MEM_CGROUP_TARGET_NUMASCAN,
MEM_CGROUP_NTARGETS,
};
#define THRESHOLDS_EVENTS_TARGET (128)
#define SOFTLIMIT_EVENTS_TARGET (1024)
+#define NUMASCAN_EVENTS_TARGET (1024)

struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
@@ -288,8 +290,9 @@ struct mem_cgroup {
int last_scanned_node;
#if MAX_NUMNODES > 1
nodemask_t scan_nodes;
- unsigned long next_scan_node_update;
+ struct mutex numascan_mutex;
#endif
+ atomic_t numascan_update;
/*
* Should the accounting and control be hierarchical, per subtree?
*/
@@ -741,6 +744,9 @@ static void __mem_cgroup_target_update(s
case MEM_CGROUP_TARGET_SOFTLIMIT:
next = val + SOFTLIMIT_EVENTS_TARGET;
break;
+ case MEM_CGROUP_TARGET_NUMASCAN:
+ next = val + NUMASCAN_EVENTS_TARGET;
+ break;
default:
return;
}
@@ -764,6 +770,13 @@ static void memcg_check_events(struct me
__mem_cgroup_target_update(mem,
MEM_CGROUP_TARGET_SOFTLIMIT);
}
+ if (unlikely(__memcg_event_check(mem,
+ MEM_CGROUP_TARGET_NUMASCAN))) {
+ atomic_add(MEM_CGROUP_TARGET_NUMASCAN,
+ &mem->numascan_update);
+ __mem_cgroup_target_update(mem,
+ MEM_CGROUP_TARGET_NUMASCAN);
+ }
}
}

@@ -1616,17 +1629,32 @@ mem_cgroup_select_victim(struct mem_cgro
/*
* Always updating the nodemask is not very good - even if we have an empty
* list or the wrong list here, we can start from some node and traverse all
- * nodes based on the zonelist. So update the list loosely once per 10 secs.
+ * nodes based on the zonelist.
*
+ * The counter of mem->numascan_update is updated once per
+ * NUMASCAN_EVENTS_TARGET. We update the numa information when we see
+ * the number of event is larger than 3% of limit or 64MB pagein/pageout.
*/
+#define NUMASCAN_UPDATE_RATIO (3)
+#define NUMASCAN_UPDATE_THRESH (16384UL) /* 16k events of pagein/pageout */
static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
{
int nid;
-
- if (time_after(mem->next_scan_node_update, jiffies))
+ unsigned long long limit;
+ /* if no limit, we never reach here */
+ limit = res_counter_read_u64(&mem->res, RES_LIMIT);
+ limit /= PAGE_SIZE;
+ /* 3% of limit */
+ limit = (limit * NUMASCAN_UPDATE_RATIO/100UL);
+ limit = min_t(unsigned long long, limit, NUMASCAN_UPDATE_THRESH);
+ /*
+ * If the number of pagein/out event is larger than 3% of limit or
+ * 64MB pagein/out, refresh numa information.
+ */
+ if (atomic_read(&mem->numascan_update) < limit ||
+ !mutex_trylock(&mem->numascan_mutex))
return;
-
- mem->next_scan_node_update = jiffies + 10*HZ;
+ atomic_set(&mem->numascan_update, 0);
/* make a nodemask where this memcg uses memory from */
mem->scan_nodes = node_states[N_HIGH_MEMORY];

@@ -1642,6 +1670,7 @@ static void mem_cgroup_may_update_nodema
continue;
node_clear(nid, mem->scan_nodes);
}
+ mutex_unlock(&mem->numascan_mutex);
}

/*
@@ -1679,11 +1708,20 @@ int mem_cgroup_select_victim_node(struct
return node;
}

+static void mem_cgroup_numascan_init(struct mem_cgroup *mem)
+{
+ mutex_init(&mem->numascan_mutex);
+}
+
#else
int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
{
return 0;
}
+static void mem_cgroup_numascan_init(struct mem_cgroup *mem)
+{
+ return 0;
+}
#endif


@@ -5054,6 +5092,7 @@ mem_cgroup_create(struct cgroup_subsys *
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
+ mem_cgroup_numascan_init(mem);
spin_lock_init(&mem->scanstat.lock);
return &mem->css;
free_out:

2011-06-16 04:01:44

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 5/7] Fix not good check of mem_cgroup_local_usage()

>From fcfc6ee9847b0b2571cd6e9847572d7c70e1e2b2 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 16 Jun 2011 09:23:54 +0900
Subject: [PATCH 5/7] Fix not good check of mem_cgroup_local_usage()

Now, mem_cgroup_local_usage(memcg) is used as hint for scanning memory
cgroup hierarchy. If it returns true, the memcg has some reclaimable memory.

But this function doesn't take care of
- unevictable pages
- anon pages on swapless system.

This patch fixes the function to use LRU information.
For NUMA, for avoid scanning, numa scan bitmap is used. If it's
empty, some more precise check will be done.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 43 +++++++++++++++++++++++++++++++++----------
1 files changed, 33 insertions(+), 10 deletions(-)

Index: mmotm-0615/mm/memcontrol.c
===================================================================
--- mmotm-0615.orig/mm/memcontrol.c
+++ mmotm-0615/mm/memcontrol.c
@@ -632,15 +632,6 @@ static long mem_cgroup_read_stat(struct
return val;
}

-static long mem_cgroup_local_usage(struct mem_cgroup *mem)
-{
- long ret;
-
- ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
- ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
- return ret;
-}
-
static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
bool charge)
{
@@ -1713,6 +1704,23 @@ static void mem_cgroup_numascan_init(str
mutex_init(&mem->numascan_mutex);
}

+static bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
+{
+ if (!nodes_empty(mem->scan_nodes))
+ return true;
+ /* slow path */
+ if (mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_FILE))
+ return true;
+ if (mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_FILE))
+ return true;
+ if (noswap || !total_swap_pages)
+ return false;
+ if (mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON))
+ return true;
+ if (mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_ANON))
+ return true;
+ return false;
+}
#else
int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
{
@@ -1722,6 +1730,21 @@ static void mem_cgroup_numascan_init(str
{
return 0;
}
+
+static bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
+{
+ if (mem_cgroup_get_zonestat_node(mem, 0, LRU_INACTIVE_FILE))
+ return true;
+ if (mem_cgroup_get_zonestat_node(mem, 0, LRU_ACTIVE_FILE))
+ return true;
+ if (noswap || !total_swap_pages)
+ return false;
+ if (mem_cgroup_get_zonestat_node(mem, 0, LRU_INACTIVE_ANON))
+ return true;
+ if (mem_cgroup_get_zonestat_node(mem, 0, LRU_ACTIVE_ANON))
+ return true;
+ return false;
+}
#endif


@@ -1811,7 +1834,7 @@ again:

while (visit--) {
victim = mem_cgroup_select_victim(root_mem);
- if (!mem_cgroup_local_usage(victim)) {
+ if (!mem_cgroup_reclaimable(victim, noswap)) {
/* this cgroup's local usage == 0 */
css_put(&victim->css);
continue;

2011-06-16 04:03:34

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 6/7] memcg: calc NUMA node's weight for scan.

>From fb8aaa2c5f7fd99dfcb5d2ecb3c1226a58caafea Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 16 Jun 2011 10:05:46 +0900
Subject: [PATCH 6/7] memcg: calc NUMA node's weight for scan.

Now, by commit 889976, numa node scan of memcg is in round-robin.
As commit log says, "a better algorithm is needed".

for implementing some good scheduling, one of required things is
defining importance of each node at LRU scanning.

This patch defines each node's weight for scan as

swappiness = (memcg's swappiness)? memcg's swappiness : 1
FILE = inactive_file + (inactive_file_is_low)? active_file : 0
ANON = inactive_anon + (inactive_anon_is_low)? active_anon : 0

weight = (FILE * (200-swappiness) + ANON * swappiness)/200.

Note: After we have dirty page accounting per memcg, we can make use of
dirty page information. (very dirty node should be skipped...)

Following patch will implement a scheduling using this weight.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 50 insertions(+), 8 deletions(-)

Index: mmotm-0615/mm/memcontrol.c
===================================================================
--- mmotm-0615.orig/mm/memcontrol.c
+++ mmotm-0615/mm/memcontrol.c
@@ -144,10 +144,12 @@ struct mem_cgroup_per_zone {

struct mem_cgroup_per_node {
struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+ unsigned long weight;
};

struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
+ unsigned long total_weight;
};

/*
@@ -1617,6 +1619,33 @@ mem_cgroup_select_victim(struct mem_cgro

#if MAX_NUMNODES > 1

+static unsigned long mem_cgroup_numascan_weight(struct mem_cgroup *mem,
+ int nid, bool inactive_file_low,
+ bool inactive_anon_low)
+{
+ unsigned int swappiness = mem_cgroup_swappiness(mem);
+ unsigned long file, anon, weight;
+
+ /* swappiness == 0 needs some care for avoiding very heavy scanning */
+ if (!swappiness)
+ swappiness = 1;
+
+ file = mem_cgroup_get_zonestat_node(mem, nid, LRU_INACTIVE_FILE);
+ if (inactive_file_low)
+ file += mem_cgroup_get_zonestat_node(mem, nid, LRU_ACTIVE_FILE);
+
+ anon = mem_cgroup_get_zonestat_node(mem, nid, LRU_INACTIVE_ANON);
+ if (inactive_anon_low)
+ anon += mem_cgroup_get_zonestat_node(mem, nid, LRU_ACTIVE_ANON);
+
+ if (!total_swap_pages || !res_counter_margin(&mem->memsw))
+ weight = file;
+ else
+ weight = (file * (200 - swappiness) + anon * swappiness)/200;
+ mem->info.nodeinfo[nid]->weight = weight;
+ return weight;
+}
+
/*
* Always updating the nodemask is not very good - even if we have an empty
* list or the wrong list here, we can start from some node and traverse all
@@ -1630,6 +1659,7 @@ mem_cgroup_select_victim(struct mem_cgro
#define NUMASCAN_UPDATE_THRESH (16384UL) /* 16k events of pagein/pageout */
static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
{
+ bool inactive_file_low, inactive_anon_low;
int nid;
unsigned long long limit;
/* if no limit, we never reach here */
@@ -1649,17 +1679,20 @@ static void mem_cgroup_may_update_nodema
/* make a nodemask where this memcg uses memory from */
mem->scan_nodes = node_states[N_HIGH_MEMORY];

+ inactive_file_low = mem_cgroup_inactive_file_is_low(mem);
+ inactive_anon_low = mem_cgroup_inactive_anon_is_low(mem);
+ mem->info.total_weight = 0;
+
for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
+ unsigned long weight;

- if (mem_cgroup_get_zonestat_node(mem, nid, LRU_INACTIVE_FILE) ||
- mem_cgroup_get_zonestat_node(mem, nid, LRU_ACTIVE_FILE))
- continue;
+ weight = mem_cgroup_numascan_weight(mem, nid,
+ inactive_file_low,
+ inactive_anon_low);
+ if (!weight)
+ node_clear(nid, mem->scan_nodes);

- if (total_swap_pages &&
- (mem_cgroup_get_zonestat_node(mem, nid, LRU_INACTIVE_ANON) ||
- mem_cgroup_get_zonestat_node(mem, nid, LRU_ACTIVE_ANON)))
- continue;
- node_clear(nid, mem->scan_nodes);
+ mem->info.total_weight += weight;
}
mutex_unlock(&mem->numascan_mutex);
}
@@ -4295,6 +4328,15 @@ static int mem_control_numa_stat_show(st
seq_printf(m, " N%d=%lu", nid, node_nr);
}
seq_putc(m, '\n');
+
+ seq_printf(m, "scan_weight=%lu", mem_cont->info.total_weight);
+ for_each_node_state(nid, N_HIGH_MEMORY) {
+ unsigned long weight;
+
+ weight = mem_cont->info.nodeinfo[nid]->weight;
+ seq_printf(m, " N%d=%lu", nid, weight);
+ }
+ seq_putc(m, '\n');
return 0;
}
#endif /* CONFIG_NUMA */

2011-06-16 04:04:46

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 7/7] memcg: proportional fair vicitm node selection

>From 4fbd49697456c227c86f1d5b46f2cd2169bf1c5b Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 16 Jun 2011 11:25:23 +0900
Subject: [PATCH 7/7] memcg: proportional fair node vicitm selection

commit 889976 implements a round-robin scan of numa nodes for
LRU scanning of memcg at hitting limit.
But, round-robin is not very good.

This patch implements a proportionally fair victim selection of nodes
rather than round-robin. The logic is fair against each node's weight.

Each node's weight is calculated periodically and we build an node's
scheduling entity as

total_ticket = 0;
for_each_node(node)
node->ticket_start = total_ticket;
node->ticket_end = total_ticket + this_node's_weight()
total_ticket = node->ticket_end;

Then, each nodes has some amounts of tickets in proportion to its own weight.

At selecting victim, a random number is selected and the node which contains
the random number in [ticket_start, ticket_end) is selected as vicitm.
This is a lottery scheduling algorithm.

For quick search of victim, this patch uses bsearch().

Test result:
on 8cpu box with 2 nodes.
limit memory to be 300MB and run httpd for 4096files/600MB working set.
do (normalized) random access by apache-bench and see scan_stat.
The test makes 40960 request. and see scan_stat.
(Because a httpd thread just use 10% cpu, the number of threads will
not be balanced between nodes. Then, file caches will not be balanced
between nodes.)

[Before patch]
[kamezawa@bluextal ~]$ cat /cgroup/memory/test/memory.scan_stat
scanned_pages_by_limit 550740
freed_pages_by_limit 206473
elapsed_ns_by_limit 9485418834

[After patch]
scanned_pages_by_limit 521520
freed_pages_by_limit 199330
elapsed_ns_by_limit 7904913234

Total number of scan, elapsed time for scan are decreased in this test.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 116 ++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 92 insertions(+), 24 deletions(-)

Index: mmotm-0615/mm/memcontrol.c
===================================================================
--- mmotm-0615.orig/mm/memcontrol.c
+++ mmotm-0615/mm/memcontrol.c
@@ -49,6 +49,8 @@
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
#include <linux/oom.h>
+#include <linux/random.h>
+#include <linux/bsearch.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -149,7 +151,14 @@ struct mem_cgroup_per_node {

struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
- unsigned long total_weight;
+ u32 total_weight;
+};
+
+/* a structure for numa victim selection scheduling */
+struct mem_cgroup_numasched {
+ int nid;
+ u32 start;
+ u32 ticket;
};

/*
@@ -289,10 +298,11 @@ struct mem_cgroup {
* reclaimed from.
*/
int last_scanned_child;
- int last_scanned_node;
#if MAX_NUMNODES > 1
nodemask_t scan_nodes;
struct mutex numascan_mutex;
+ struct mem_cgroup_numasched *nsched;
+ int nr_nsched;
#endif
atomic_t numascan_update;
/*
@@ -1660,8 +1670,10 @@ static unsigned long mem_cgroup_numascan
static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
{
bool inactive_file_low, inactive_anon_low;
- int nid;
- unsigned long long limit;
+ int nid, factor;
+ unsigned long long limit, usage;
+ struct mem_cgroup_numasched *ns;
+
/* if no limit, we never reach here */
limit = res_counter_read_u64(&mem->res, RES_LIMIT);
limit /= PAGE_SIZE;
@@ -1683,13 +1695,41 @@ static void mem_cgroup_may_update_nodema
inactive_anon_low = mem_cgroup_inactive_anon_is_low(mem);
mem->info.total_weight = 0;

+ /*
+ * mem->nsched is an array of nodes with weight. If weight==0,
+ * the nid will not appear in the array. Each ns entry has
+ * [start, ticket] pair and the array. This ticket will be used
+ * for lottery.
+ */
+ ns = mem->nsched;
+ mem->nr_nsched = 0;
+
+ /*
+ * We use random32() for lottery. so, we'd like to make
+ * total_weight smaller than u32. calculate the factor to limit
+ * total_weight, here.
+ */
+ usage = res_counter_read_u64(&mem->res, RES_USAGE);
+ factor = 0;
+
+ while ((usage >> factor) > 0x7fffffff)
+ factor++;
+
for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
unsigned long weight;

weight = mem_cgroup_numascan_weight(mem, nid,
inactive_file_low,
inactive_anon_low);
- if (!weight)
+ if (weight) {
+ /* we'd like to fit total_weight to u32. */
+ weight = (weight >> factor) + 1;
+ ns->nid = nid;
+ ns->start = mem->info.total_weight;
+ ns->ticket = weight;
+ ns++;
+ mem->nr_nsched++;
+ } else
node_clear(nid, mem->scan_nodes);

mem->info.total_weight += weight;
@@ -1707,34 +1747,58 @@ static void mem_cgroup_may_update_nodema
* hit limits, it will see a contention on a node. But freeing from remote
* node means more costs for memory reclaim because of memory latency.
*
- * Now, we use round-robin. Better algorithm is welcomed.
+ * Now, we use lottery scheduling based on each node's weight.
*/
+int node_weight_compare(const void *key, const void *elt)
+{
+ const unsigned long lottery = (unsigned long)key;
+ const struct mem_cgroup_numasched *ns = elt;
+
+ if (lottery < ns->start)
+ return -1;
+ if (lottery >= ns->start + ns->ticket)
+ return 1;
+ return 0;
+}
+
int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
{
+ unsigned long lottery;
+ struct mem_cgroup_numasched *ns;
int node;

mem_cgroup_may_update_nodemask(mem);
- node = mem->last_scanned_node;
+ if (!mutex_trylock(&mem->numascan_mutex))
+ return numa_node_id();

- node = next_node(node, mem->scan_nodes);
- if (node == MAX_NUMNODES)
- node = first_node(mem->scan_nodes);
- /*
- * We call this when we hit limit, not when pages are added to LRU.
- * No LRU may hold pages because all pages are UNEVICTABLE or
- * memcg is too small and all pages are not on LRU. In that case,
- * we use curret node.
- */
- if (unlikely(node == MAX_NUMNODES))
- node = numa_node_id();
+ lottery = random32()/mem->info.total_weight;
+ ns = bsearch((void *)lottery, mem->nsched, mem->nr_nsched,
+ sizeof(struct mem_cgroup_numasched), node_weight_compare);

- mem->last_scanned_node = node;
+ if (unlikely(ns == NULL))
+ node = numa_node_id();
+ else
+ node = ns->nid;
+ mutex_unlock(&mem->numascan_mutex);
return node;
}

-static void mem_cgroup_numascan_init(struct mem_cgroup *mem)
+static bool mem_cgroup_numascan_init(struct mem_cgroup *mem)
{
+ int size;
+
mutex_init(&mem->numascan_mutex);
+
+ size = sizeof(struct mem_cgroup_numasched) * num_possible_cpus();
+ mem->nsched = kmalloc(size, GFP_KERNEL);
+ if (!mem->nsched)
+ return false;
+ return true;
+}
+
+static void mem_cgroup_numascan_free(struct mem_cgroup *mem)
+{
+ kfree(mem->nsched);
}

static bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
@@ -1759,9 +1823,12 @@ int mem_cgroup_select_victim_node(struct
{
return 0;
}
-static void mem_cgroup_numascan_init(struct mem_cgroup *mem)
+static bool mem_cgroup_numascan_init(struct mem_cgroup *mem)
+{
+ return true;
+}
+static void mem_cgroup_numascan_free(struct mem_cgroup *mem)
{
- return 0;
}

static bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
@@ -5024,6 +5091,7 @@ static void __mem_cgroup_free(struct mem

for_each_node_state(node, N_POSSIBLE)
free_mem_cgroup_per_zone_info(mem, node);
+ mem_cgroup_numascan_free(mem);

free_percpu(mem->stat);
if (sizeof(struct mem_cgroup) < PAGE_SIZE)
@@ -5113,6 +5181,8 @@ mem_cgroup_create(struct cgroup_subsys *
for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
goto free_out;
+ if (!mem_cgroup_numascan_init(mem))
+ goto free_out;

/* root ? */
if (cont->parent == NULL) {
@@ -5149,7 +5219,6 @@ mem_cgroup_create(struct cgroup_subsys *
res_counter_init(&mem->memsw, NULL);
}
mem->last_scanned_child = 0;
- mem->last_scanned_node = MAX_NUMNODES;
INIT_LIST_HEAD(&mem->oom_notify);

if (parent)
@@ -5157,7 +5226,6 @@ mem_cgroup_create(struct cgroup_subsys *
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
- mem_cgroup_numascan_init(mem);
spin_lock_init(&mem->scanstat.lock);
return &mem->css;
free_out:

2011-06-17 22:04:26

by Ying Han

[permalink] [raw]
Subject: Re: [PATCH 3/7] memcg: add memory.scan_stat

On Wed, Jun 15, 2011 at 8:53 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From e08990dd9ada13cf236bec1ef44b207436434b8e Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Wed, 15 Jun 2011 14:11:01 +0900
> Subject: [PATCH 3/7] memcg: add memory.scan_stat
>
> commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
> says it adds scanning stats to memory.stat file. But it doesn't because
> we considered we needed to make a concensus for such new APIs.
>
> This patch is a trial to add memory.scan_stat. This shows
> ?- the number of scanned pages
> ?- the number of recleimed pages
> ?- the number of elaplsed time (including sleep/pause time)
> ?for both of direct/soft reclaim and shrinking caused by changing limit
> ?or force_empty.
>
> The biggest difference with oringinal Ying's one is that this file
> can be reset by some write, as
>
> ?# echo 0 ...../memory.scan_stat
>
> [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
> scanned_pages_by_limit 358470
> freed_pages_by_limit 180795
> elapsed_ns_by_limit 21629927
> scanned_pages_by_system 0
> freed_pages_by_system 0
> elapsed_ns_by_system 0
> scanned_pages_by_shrink 76646
> freed_pages_by_shrink 38355
> elappsed_ns_by_shrink 31990670
> total_scanned_pages_by_limit 358470
> total_freed_pages_by_limit 180795
> total_elapsed_ns_by_hierarchical 216299275
> total_scanned_pages_by_system 0
> total_freed_pages_by_system 0
> total_elapsed_ns_by_system 0
> total_scanned_pages_by_shrink 76646
> total_freed_pages_by_shrink 38355
> total_elapsed_ns_by_shrink 31990670
>
> total_xxxx is for hierarchy management.
>
> This will be useful for further memcg developments and need to be
> developped before we do some complicated rework on LRU/softlimit
> management.

Agreed. Actually we are also looking into adding a per-memcg API for
adding visibility of
page reclaim path. It would be helpful for us to settle w/ the API first.

I am not a fan of names, but how about
"/dev/cgroup/memory/memory.reclaim_stat" ?

>
> Now, scan/free/elapsed_by_system is incomplete but future works of
> Johannes at el. will fill remaining information and then, we can
> look into problems of isolation between memcgs.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> ?Documentation/cgroups/memory.txt | ? 33 +++++++++
> ?include/linux/memcontrol.h ? ? ? | ? 16 ++++
> ?include/linux/swap.h ? ? ? ? ? ? | ? ?6 -
> ?mm/memcontrol.c ? ? ? ? ? ? ? ? ?| ?135 +++++++++++++++++++++++++++++++++++++--
> ?mm/vmscan.c ? ? ? ? ? ? ? ? ? ? ?| ? 27 ++++++-
> ?5 files changed, 199 insertions(+), 18 deletions(-)
>
> Index: mmotm-0615/Documentation/cgroups/memory.txt
> ===================================================================
> --- mmotm-0615.orig/Documentation/cgroups/memory.txt
> +++ mmotm-0615/Documentation/cgroups/memory.txt
> @@ -380,7 +380,7 @@ will be charged as a new owner of it.
>
> ?5.2 stat file
>
> -memory.stat file includes following statistics
> +5.2.1 memory.stat file includes following statistics
>
> ?# per-memory cgroup local status
> ?cache ? ? ? ? ?- # of bytes of page cache memory.
> @@ -438,6 +438,37 @@ Note:
> ? ? ? ? file_mapped is accounted only when the memory cgroup is owner of page
> ? ? ? ? cache.)
>
> +5.2.2 memory.scan_stat
> +
> +memory.scan_stat includes statistics information for memory scanning and
> +freeing, reclaiming. The statistics shows memory scanning information since
> +memory cgroup creation and can be reset to 0 by writing 0 as
> +
> + #echo 0 > ../memory.scan_stat
> +
> +This file contains following statistics.
> +
> +scanned_pages_by_limit - # of scanned pages at hitting limit.
> +freed_pages_by_limit ? - # of freed pages at hitting limit.

How those stats different from Johannes's patch? I feel we should keep
them into this API instead of memory.stat
"pgscan_direct_limit"
"pgreclaim_direct_limit"

> +elapsed_ns_by_limit ? ?- nano sec of elappsed time at LRU scan at
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?hitting limit.(this includes sleep time.)


> +
> +scanned_pages_by_system ? ? ? ?- # of scanned pages by the kernel.
> + ? ? ? ? ? ? ? ? ? ? ? ? (Now, this value means global memory reclaim
> + ? ? ? ? ? ? ? ? ? ? ? ? ? caused by system memory shortage with a hint
> + ? ? ? ? ? ? ? ? ? ? ? ? ?of softlimit. "No soft limit" case will be
> + ? ? ? ? ? ? ? ? ? ? ? ? ?supported in future.)
> +freed_pages_by_system ?- # of freed pages by the kernel.

The same for the following which I assume the same meaning with:
"pgscan_direct_hierarchy"
"pgreclaim_direct_hierarchy"

> +elapsed_ns_by_system ? - nano sec of elappsed time by kernel.
> +
> +scanned_pages_by_shrink ? ? ? ?- # of scanned pages by shrinking.
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (i.e. changes of limit, force_empty, etc.)
> +freed_pages_by_shrink ?- # of freed pages by shirkining.

So those stats are not included in the ones above?

--Ying

> +elappsed_ns_by_shrink ?- nano sec of elappsed time at shrinking.
> +
> +total_xxx includes the statistics of children scanning caused by the cgroup.
> +
> +
> ?5.3 swappiness
>
> ?Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
> Index: mmotm-0615/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-0615.orig/include/linux/memcontrol.h
> +++ mmotm-0615/include/linux/memcontrol.h
> @@ -120,6 +120,22 @@ struct zone_reclaim_stat*
> ?mem_cgroup_get_reclaim_stat_from_page(struct page *page);
> ?extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct task_struct *p);
> +struct memcg_scanrecord {
> + ? ? ? struct mem_cgroup *mem; /* scanend memory cgroup */
> + ? ? ? struct mem_cgroup *root; /* scan target hierarchy root */
> + ? ? ? int context; ? ? ? ? ? ?/* scanning context (see memcontrol.c) */
> + ? ? ? unsigned long nr_scanned; /* the number of scanned pages */
> + ? ? ? unsigned long nr_freed; /* the number of freed pages */
> + ? ? ? unsigned long elappsed; /* nsec of time elapsed while scanning */
> +};
> +
> +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct memcg_scanrecord *rec);
> +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct memcg_scanrecord *rec);
>
> ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> ?extern int do_swap_account;
> Index: mmotm-0615/include/linux/swap.h
> ===================================================================
> --- mmotm-0615.orig/include/linux/swap.h
> +++ mmotm-0615/include/linux/swap.h
> @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
> ?/* linux/mm/vmscan.c */
> ?extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask, nodemask_t *mask);
> -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap);
> -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long *nr_scanned);
> ?extern int __isolate_lru_page(struct page *page, int mode, int file);
> ?extern unsigned long shrink_all_memory(unsigned long nr_pages);
> ?extern int vm_swappiness;
> Index: mmotm-0615/mm/memcontrol.c
> ===================================================================
> --- mmotm-0615.orig/mm/memcontrol.c
> +++ mmotm-0615/mm/memcontrol.c
> @@ -203,6 +203,57 @@ struct mem_cgroup_eventfd_list {
> ?static void mem_cgroup_threshold(struct mem_cgroup *mem);
> ?static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>
> +enum {
> + ? ? ? SCAN_BY_LIMIT,
> + ? ? ? FREED_BY_LIMIT,
> + ? ? ? ELAPSED_BY_LIMIT,
> +
> + ? ? ? SCAN_BY_SYSTEM,
> + ? ? ? FREED_BY_SYSTEM,
> + ? ? ? ELAPSED_BY_SYSTEM,
> +
> + ? ? ? SCAN_BY_SHRINK,
> + ? ? ? FREED_BY_SHRINK,
> + ? ? ? ELAPSED_BY_SHRINK,
> + ? ? ? NR_SCANSTATS,
> +};
> +#define __FREED ? ? ? ? ? ? ? ?(1)
> +#define ? ? ? ?__ELAPSED ? ? ? (2)
> +
> +struct scanstat {
> + ? ? ? spinlock_t ? ? ?lock;
> + ? ? ? unsigned long ? stats[NR_SCANSTATS]; ? ?/* local statistics */
> + ? ? ? unsigned long ? totalstats[NR_SCANSTATS]; ? /* hierarchical */
> +};
> +
> +const char *scanstat_string[NR_SCANSTATS] = {
> + ? ? ? "scanned_pages_by_limit",
> + ? ? ? "freed_pages_by_limit",
> + ? ? ? "elapsed_ns_by_limit",
> +
> + ? ? ? "scanned_pages_by_system",
> + ? ? ? "freed_pages_by_system",
> + ? ? ? "elapsed_ns_by_system",
> +
> + ? ? ? "scanned_pages_by_shrink",
> + ? ? ? "freed_pages_by_shrink",
> + ? ? ? "elappsed_ns_by_shrink",
> +};
> +
> +const char *total_scanstat_string[NR_SCANSTATS] = {
> + ? ? ? "total_scanned_pages_by_limit",
> + ? ? ? "total_freed_pages_by_limit",
> + ? ? ? "total_elapsed_ns_by_hierarchical",
> +
> + ? ? ? "total_scanned_pages_by_system",
> + ? ? ? "total_freed_pages_by_system",
> + ? ? ? "total_elapsed_ns_by_system",
> +
> + ? ? ? "total_scanned_pages_by_shrink",
> + ? ? ? "total_freed_pages_by_shrink",
> + ? ? ? "total_elapsed_ns_by_shrink",
> +};
> +
> ?/*
> ?* The memory controller data structure. The memory controller controls both
> ?* page cache and RSS per cgroup. We would eventually like to provide
> @@ -264,7 +315,8 @@ struct mem_cgroup {
>
> ? ? ? ?/* For oom notifier event fd */
> ? ? ? ?struct list_head oom_notify;
> -
> + ? ? ? /* For recording LRU-scan statistics */
> + ? ? ? struct scanstat scanstat;
> ? ? ? ?/*
> ? ? ? ? * Should we move charges of a task when a task is moved into this
> ? ? ? ? * mem_cgroup ? And what type of charges should we move ?
> @@ -1634,6 +1686,28 @@ int mem_cgroup_select_victim_node(struct
> ?}
> ?#endif
>
> +
> +
> +static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec)
> +{
> + ? ? ? struct mem_cgroup *mem;
> + ? ? ? int context = rec->context;
> +
> + ? ? ? mem = rec->mem;
> + ? ? ? spin_lock(&mem->scanstat.lock);
> + ? ? ? mem->scanstat.stats[context] += rec->nr_scanned;
> + ? ? ? mem->scanstat.stats[context + __FREED] += rec->nr_freed;
> + ? ? ? mem->scanstat.stats[context + __ELAPSED] += rec->elappsed;
> + ? ? ? spin_unlock(&mem->scanstat.lock);
> +
> + ? ? ? mem = rec->root;
> + ? ? ? spin_lock(&mem->scanstat.lock);
> + ? ? ? mem->scanstat.totalstats[context] += rec->nr_scanned;
> + ? ? ? mem->scanstat.totalstats[context + __FREED] += rec->nr_freed;
> + ? ? ? mem->scanstat.totalstats[context + __ELAPSED] += rec->elappsed;
> + ? ? ? spin_unlock(&mem->scanstat.lock);
> +}
> +
> ?/*
> ?* Scan the hierarchy if needed to reclaim memory. We remember the last child
> ?* we reclaimed from, so that we don't end up penalizing one child extensively
> @@ -1659,8 +1733,8 @@ static int mem_cgroup_hierarchical_recla
> ? ? ? ?bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> ? ? ? ?bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> ? ? ? ?unsigned long excess;
> - ? ? ? unsigned long nr_scanned;
> ? ? ? ?int visit;
> + ? ? ? struct memcg_scanrecord rec;
>
> ? ? ? ?excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
>
> @@ -1668,6 +1742,15 @@ static int mem_cgroup_hierarchical_recla
> ? ? ? ?if (!check_soft && root_mem->memsw_is_minimum)
> ? ? ? ? ? ? ? ?noswap = true;
>
> + ? ? ? if (shrink)
> + ? ? ? ? ? ? ? rec.context = SCAN_BY_SHRINK;
> + ? ? ? else if (check_soft)
> + ? ? ? ? ? ? ? rec.context = SCAN_BY_SYSTEM;
> + ? ? ? else
> + ? ? ? ? ? ? ? rec.context = SCAN_BY_LIMIT;
> +
> + ? ? ? rec.root = root_mem;
> +
> ?again:
> ? ? ? ?if (!shrink) {
> ? ? ? ? ? ? ? ?visit = 0;
> @@ -1695,14 +1778,19 @@ again:
> ? ? ? ? ? ? ? ? ? ? ? ?css_put(&victim->css);
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?}
> + ? ? ? ? ? ? ? rec.mem = victim;
> + ? ? ? ? ? ? ? rec.nr_scanned = 0;
> + ? ? ? ? ? ? ? rec.nr_freed = 0;
> + ? ? ? ? ? ? ? rec.elappsed = 0;
> ? ? ? ? ? ? ? ?/* we use swappiness of local cgroup */
> ? ? ? ? ? ? ? ?if (check_soft) {
> ? ? ? ? ? ? ? ? ? ? ? ?ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, zone, &nr_scanned);
> - ? ? ? ? ? ? ? ? ? ? ? *total_scanned += nr_scanned;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, zone, &rec);
> + ? ? ? ? ? ? ? ? ? ? ? *total_scanned += rec.nr_scanned;
> ? ? ? ? ? ? ? ?} else
> ? ? ? ? ? ? ? ? ? ? ? ?ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, &rec);
> + ? ? ? ? ? ? ? mem_cgroup_record_scanstat(&rec);
> ? ? ? ? ? ? ? ?css_put(&victim->css);
>
> ? ? ? ? ? ? ? ?total += ret;
> @@ -3757,7 +3845,8 @@ try_to_free:
> ? ? ? ? ? ? ? ? ? ? ? ?ret = -EINTR;
> ? ? ? ? ? ? ? ? ? ? ? ?goto out;
> ? ? ? ? ? ? ? ?}
> - ? ? ? ? ? ? ? progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, false);
> + ? ? ? ? ? ? ? progress = try_to_free_mem_cgroup_pages(mem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? GFP_KERNEL, false, NULL);
> ? ? ? ? ? ? ? ?if (!progress) {
> ? ? ? ? ? ? ? ? ? ? ? ?nr_retries--;
> ? ? ? ? ? ? ? ? ? ? ? ?/* maybe some writeback is necessary */
> @@ -4599,6 +4688,34 @@ static int mem_control_numa_stat_open(st
> ?}
> ?#endif /* CONFIG_NUMA */
>
> +static int mem_cgroup_scan_stat_read(struct cgroup *cgrp,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct cftype *cft,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct cgroup_map_cb *cb)
> +{
> + ? ? ? struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> + ? ? ? int i;
> +
> + ? ? ? for (i = 0; i < NR_SCANSTATS; i++)
> + ? ? ? ? ? ? ? cb->fill(cb, scanstat_string[i], mem->scanstat.stats[i]);
> + ? ? ? for (i = 0; i < NR_SCANSTATS; i++)
> + ? ? ? ? ? ? ? cb->fill(cb, total_scanstat_string[i],
> + ? ? ? ? ? ? ? ? ? ? ? mem->scanstat.totalstats[i]);
> + ? ? ? return 0;
> +}
> +
> +static int mem_cgroup_reset_scan_stat(struct cgroup *cgrp,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned int event)
> +{
> + ? ? ? struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +
> + ? ? ? spin_lock(&mem->scanstat.lock);
> + ? ? ? memset(&mem->scanstat.stats, 0, sizeof(mem->scanstat.stats));
> + ? ? ? memset(&mem->scanstat.totalstats, 0, sizeof(mem->scanstat.totalstats));
> + ? ? ? spin_unlock(&mem->scanstat.lock);
> + ? ? ? return 0;
> +}
> +
> +
> ?static struct cftype mem_cgroup_files[] = {
> ? ? ? ?{
> ? ? ? ? ? ? ? ?.name = "usage_in_bytes",
> @@ -4669,6 +4786,11 @@ static struct cftype mem_cgroup_files[]
> ? ? ? ? ? ? ? ?.mode = S_IRUGO,
> ? ? ? ?},
> ?#endif
> + ? ? ? {
> + ? ? ? ? ? ? ? .name = "scan_stat",
> + ? ? ? ? ? ? ? .read_map = mem_cgroup_scan_stat_read,
> + ? ? ? ? ? ? ? .trigger = mem_cgroup_reset_scan_stat,
> + ? ? ? },
> ?};
>
> ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -4932,6 +5054,7 @@ mem_cgroup_create(struct cgroup_subsys *
> ? ? ? ?atomic_set(&mem->refcnt, 1);
> ? ? ? ?mem->move_charge_at_immigrate = 0;
> ? ? ? ?mutex_init(&mem->thresholds_lock);
> + ? ? ? spin_lock_init(&mem->scanstat.lock);
> ? ? ? ?return &mem->css;
> ?free_out:
> ? ? ? ?__mem_cgroup_free(mem);
> Index: mmotm-0615/mm/vmscan.c
> ===================================================================
> --- mmotm-0615.orig/mm/vmscan.c
> +++ mmotm-0615/mm/vmscan.c
> @@ -2199,9 +2199,9 @@ unsigned long try_to_free_pages(struct z
> ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>
> ?unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long *nr_scanned)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct memcg_scanrecord *rec)
> ?{
> ? ? ? ?struct scan_control sc = {
> ? ? ? ? ? ? ? ?.nr_scanned = 0,
> @@ -2213,6 +2213,7 @@ unsigned long mem_cgroup_shrink_node_zon
> ? ? ? ? ? ? ? ?.order = 0,
> ? ? ? ? ? ? ? ?.mem_cgroup = mem,
> ? ? ? ?};
> + ? ? ? unsigned long start, end;
>
> ? ? ? ?sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> ? ? ? ? ? ? ? ? ? ? ? ?(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> @@ -2221,6 +2222,7 @@ unsigned long mem_cgroup_shrink_node_zon
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc.may_writepage,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc.gfp_mask);
>
> + ? ? ? start = sched_clock();
> ? ? ? ?/*
> ? ? ? ? * NOTE: Although we can get the priority field, using it
> ? ? ? ? * here is not a good idea, since it limits the pages we can scan.
> @@ -2229,19 +2231,27 @@ unsigned long mem_cgroup_shrink_node_zon
> ? ? ? ? * the priority and make it zero.
> ? ? ? ? */
> ? ? ? ?shrink_zone(0, zone, &sc);
> + ? ? ? end = sched_clock();
> +
> + ? ? ? if (rec) {
> + ? ? ? ? ? ? ? rec->nr_scanned += sc.nr_scanned;
> + ? ? ? ? ? ? ? rec->nr_freed += sc.nr_reclaimed;
> + ? ? ? ? ? ? ? rec->elappsed += end - start;
> + ? ? ? }
>
> ? ? ? ?trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>
> - ? ? ? *nr_scanned = sc.nr_scanned;
> ? ? ? ?return sc.nr_reclaimed;
> ?}
>
> ?unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bool noswap)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bool noswap,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct memcg_scanrecord *rec)
> ?{
> ? ? ? ?struct zonelist *zonelist;
> ? ? ? ?unsigned long nr_reclaimed;
> + ? ? ? unsigned long start, end;
> ? ? ? ?int nid;
> ? ? ? ?struct scan_control sc = {
> ? ? ? ? ? ? ? ?.may_writepage = !laptop_mode,
> @@ -2259,6 +2269,7 @@ unsigned long try_to_free_mem_cgroup_pag
> ? ? ? ? ? ? ? ?.gfp_mask = sc.gfp_mask,
> ? ? ? ?};
>
> + ? ? ? start = sched_clock();
> ? ? ? ?/*
> ? ? ? ? * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
> ? ? ? ? * take care of from where we get pages. So the node where we start the
> @@ -2273,6 +2284,12 @@ unsigned long try_to_free_mem_cgroup_pag
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc.gfp_mask);
>
> ? ? ? ?nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
> + ? ? ? end = sched_clock();
> + ? ? ? if (rec) {
> + ? ? ? ? ? ? ? rec->nr_scanned += sc.nr_scanned;
> + ? ? ? ? ? ? ? rec->nr_freed += sc.nr_reclaimed;
> + ? ? ? ? ? ? ? rec->elappsed += end - start;
> + ? ? ? }
>
> ? ? ? ?trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
>
>
>

2011-06-17 22:27:43

by Ying Han

[permalink] [raw]
Subject: Re: [PATCH 5/7] Fix not good check of mem_cgroup_local_usage()

On Wed, Jun 15, 2011 at 8:54 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From fcfc6ee9847b0b2571cd6e9847572d7c70e1e2b2 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 16 Jun 2011 09:23:54 +0900
> Subject: [PATCH 5/7] Fix not good check of mem_cgroup_local_usage()
>
> Now, mem_cgroup_local_usage(memcg) is used as hint for scanning memory
> cgroup hierarchy. If it returns true, the memcg has some reclaimable memory.
>
> But this function doesn't take care of
> ?- unevictable pages
> ?- anon pages on swapless system.
>
> This patch fixes the function to use LRU information.
> For NUMA, for avoid scanning, numa scan bitmap is used. If it's
> empty, some more precise check will be done.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> ?mm/memcontrol.c | ? 43 +++++++++++++++++++++++++++++++++----------
> ?1 files changed, 33 insertions(+), 10 deletions(-)
>
> Index: mmotm-0615/mm/memcontrol.c
> ===================================================================
> --- mmotm-0615.orig/mm/memcontrol.c
> +++ mmotm-0615/mm/memcontrol.c
> @@ -632,15 +632,6 @@ static long mem_cgroup_read_stat(struct
> ? ? ? ?return val;
> ?}
>
> -static long mem_cgroup_local_usage(struct mem_cgroup *mem)
> -{
> - ? ? ? long ret;
> -
> - ? ? ? ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
> - ? ? ? ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
> - ? ? ? return ret;
> -}
> -
> ?static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool charge)
> ?{
> @@ -1713,6 +1704,23 @@ static void mem_cgroup_numascan_init(str
> ? ? ? ?mutex_init(&mem->numascan_mutex);
> ?}
>
> +static bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
> +{
> + ? ? ? if (!nodes_empty(mem->scan_nodes))
> + ? ? ? ? ? ? ? return true;
> + ? ? ? /* slow path */
> + ? ? ? if (mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_FILE))
> + ? ? ? ? ? ? ? return true;
> + ? ? ? if (mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_FILE))
> + ? ? ? ? ? ? ? return true;

Wondering if we can simplify this like:

if (mem_cgroup_nr_file_lru_pages(mem))
return true;


> + ? ? ? if (noswap || !total_swap_pages)
> + ? ? ? ? ? ? ? return false;
> + ? ? ? if (mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON))
> + ? ? ? ? ? ? ? return true;
> + ? ? ? if (mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_ANON))
> + ? ? ? ? ? ? ? return true;

the same:
if (mem_cgroup_nr_anon_lru_pages(mem))
return true;

> + ? ? ? return false;
> +}

The two functions above are part of memory.numa_stat patch which is in
mmotm i believe. Just feel the functionality a bit duplicate except
the noswap parameter and scan_nodes.

--Ying

> ?#else
> ?int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
> ?{
> @@ -1722,6 +1730,21 @@ static void mem_cgroup_numascan_init(str
> ?{
> ? ? ? ?return 0;
> ?}
> +
> +static bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
> +{
> + ? ? ? if (mem_cgroup_get_zonestat_node(mem, 0, LRU_INACTIVE_FILE))
> + ? ? ? ? ? ? ? return true;
> + ? ? ? if (mem_cgroup_get_zonestat_node(mem, 0, LRU_ACTIVE_FILE))
> + ? ? ? ? ? ? ? return true;
> + ? ? ? if (noswap || !total_swap_pages)
> + ? ? ? ? ? ? ? return false;
> + ? ? ? if (mem_cgroup_get_zonestat_node(mem, 0, LRU_INACTIVE_ANON))
> + ? ? ? ? ? ? ? return true;
> + ? ? ? if (mem_cgroup_get_zonestat_node(mem, 0, LRU_ACTIVE_ANON))
> + ? ? ? ? ? ? ? return true;
> + ? ? ? return false;
> +}
> ?#endif
>
>
> @@ -1811,7 +1834,7 @@ again:
>
> ? ? ? ?while (visit--) {
> ? ? ? ? ? ? ? ?victim = mem_cgroup_select_victim(root_mem);
> - ? ? ? ? ? ? ? if (!mem_cgroup_local_usage(victim)) {
> + ? ? ? ? ? ? ? if (!mem_cgroup_reclaimable(victim, noswap)) {
> ? ? ? ? ? ? ? ? ? ? ? ?/* this cgroup's local usage == 0 */
> ? ? ? ? ? ? ? ? ? ? ? ?css_put(&victim->css);
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
>
>

2011-06-19 23:48:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/7] memcg: add memory.scan_stat

On Fri, 17 Jun 2011 15:04:18 -0700
Ying Han <[email protected]> wrote:

> On Wed, Jun 15, 2011 at 8:53 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > From e08990dd9ada13cf236bec1ef44b207436434b8e Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <[email protected]>
> > Date: Wed, 15 Jun 2011 14:11:01 +0900
> > Subject: [PATCH 3/7] memcg: add memory.scan_stat
> >
> > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
> > says it adds scanning stats to memory.stat file. But it doesn't because
> > we considered we needed to make a concensus for such new APIs.
> >
> > This patch is a trial to add memory.scan_stat. This shows
> >  - the number of scanned pages
> >  - the number of recleimed pages
> >  - the number of elaplsed time (including sleep/pause time)
> >  for both of direct/soft reclaim and shrinking caused by changing limit
> >  or force_empty.
> >
> > The biggest difference with oringinal Ying's one is that this file
> > can be reset by some write, as
> >
> >  # echo 0 ...../memory.scan_stat
> >
> > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
> > scanned_pages_by_limit 358470
> > freed_pages_by_limit 180795
> > elapsed_ns_by_limit 21629927
> > scanned_pages_by_system 0
> > freed_pages_by_system 0
> > elapsed_ns_by_system 0
> > scanned_pages_by_shrink 76646
> > freed_pages_by_shrink 38355
> > elappsed_ns_by_shrink 31990670
> > total_scanned_pages_by_limit 358470
> > total_freed_pages_by_limit 180795
> > total_elapsed_ns_by_hierarchical 216299275
> > total_scanned_pages_by_system 0
> > total_freed_pages_by_system 0
> > total_elapsed_ns_by_system 0
> > total_scanned_pages_by_shrink 76646
> > total_freed_pages_by_shrink 38355
> > total_elapsed_ns_by_shrink 31990670
> >
> > total_xxxx is for hierarchy management.
> >
> > This will be useful for further memcg developments and need to be
> > developped before we do some complicated rework on LRU/softlimit
> > management.
>
> Agreed. Actually we are also looking into adding a per-memcg API for
> adding visibility of
> page reclaim path. It would be helpful for us to settle w/ the API first.
>
> I am not a fan of names, but how about
> "/dev/cgroup/memory/memory.reclaim_stat" ?
>

Hm, ok, I have no favorite.


> >
> > Now, scan/free/elapsed_by_system is incomplete but future works of
> > Johannes at el. will fill remaining information and then, we can
> > look into problems of isolation between memcgs.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> >  Documentation/cgroups/memory.txt |   33 +++++++++
> >  include/linux/memcontrol.h       |   16 ++++
> >  include/linux/swap.h             |    6 -
> >  mm/memcontrol.c                  |  135 +++++++++++++++++++++++++++++++++++++--
> >  mm/vmscan.c                      |   27 ++++++-
> >  5 files changed, 199 insertions(+), 18 deletions(-)
> >
> > Index: mmotm-0615/Documentation/cgroups/memory.txt
> > ===================================================================
> > --- mmotm-0615.orig/Documentation/cgroups/memory.txt
> > +++ mmotm-0615/Documentation/cgroups/memory.txt
> > @@ -380,7 +380,7 @@ will be charged as a new owner of it.
> >
> >  5.2 stat file
> >
> > -memory.stat file includes following statistics
> > +5.2.1 memory.stat file includes following statistics
> >
> >  # per-memory cgroup local status
> >  cache          - # of bytes of page cache memory.
> > @@ -438,6 +438,37 @@ Note:
> >         file_mapped is accounted only when the memory cgroup is owner of page
> >         cache.)
> >
> > +5.2.2 memory.scan_stat
> > +
> > +memory.scan_stat includes statistics information for memory scanning and
> > +freeing, reclaiming. The statistics shows memory scanning information since
> > +memory cgroup creation and can be reset to 0 by writing 0 as
> > +
> > + #echo 0 > ../memory.scan_stat
> > +
> > +This file contains following statistics.
> > +
> > +scanned_pages_by_limit - # of scanned pages at hitting limit.
> > +freed_pages_by_limit   - # of freed pages at hitting limit.
>
> How those stats different from Johannes's patch? I feel we should keep
> them into this API instead of memory.stat
> "pgscan_direct_limit"
> "pgreclaim_direct_limit"
>

It's unclear the unit of number from that name.
And, I can't find what it means "direct_limit". Only "limit" is meaningful.



> > +elapsed_ns_by_limit    - nano sec of elappsed time at LRU scan at
> > +                                  hitting limit.(this includes sleep time.)
>
>
> > +
> > +scanned_pages_by_system        - # of scanned pages by the kernel.
> > +                         (Now, this value means global memory reclaim
> > +                           caused by system memory shortage with a hint
> > +                          of softlimit. "No soft limit" case will be
> > +                          supported in future.)
> > +freed_pages_by_system  - # of freed pages by the kernel.
>
> The same for the following which I assume the same meaning with:
> "pgscan_direct_hierarchy"
> "pgreclaim_direct_hierarchy"
>

Doesn't make sense. What hierarchy means, here?

Above 2 is for showing "amount of scanned/reclaimed memory by system's memory
pressure".
(But now, it just shows "softlimit" information until Johannes' work comes.)

For hierarchy information, I have "total_xxxx_by_xxxx" parameter.

> > +elapsed_ns_by_system   - nano sec of elappsed time by kernel.
> > +
> > +scanned_pages_by_shrink        - # of scanned pages by shrinking.
> > +                                 (i.e. changes of limit, force_empty, etc.)
> > +freed_pages_by_shrink  - # of freed pages by shirkining.
>
> So those stats are not included in the ones above?
>

Yes.

Thanks,
-Kame


2011-06-19 23:51:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 5/7] Fix not good check of mem_cgroup_local_usage()

On Fri, 17 Jun 2011 15:27:36 -0700
Ying Han <[email protected]> wrote:

> On Wed, Jun 15, 2011 at 8:54 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > From fcfc6ee9847b0b2571cd6e9847572d7c70e1e2b2 Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <[email protected]>
> > Date: Thu, 16 Jun 2011 09:23:54 +0900
> > Subject: [PATCH 5/7] Fix not good check of mem_cgroup_local_usage()
> >
> > Now, mem_cgroup_local_usage(memcg) is used as hint for scanning memory
> > cgroup hierarchy. If it returns true, the memcg has some reclaimable memory.
> >
> > But this function doesn't take care of
> >  - unevictable pages
> >  - anon pages on swapless system.
> >
> > This patch fixes the function to use LRU information.
> > For NUMA, for avoid scanning, numa scan bitmap is used. If it's
> > empty, some more precise check will be done.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> >  mm/memcontrol.c |   43 +++++++++++++++++++++++++++++++++----------
> >  1 files changed, 33 insertions(+), 10 deletions(-)
> >
> > Index: mmotm-0615/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0615.orig/mm/memcontrol.c
> > +++ mmotm-0615/mm/memcontrol.c
> > @@ -632,15 +632,6 @@ static long mem_cgroup_read_stat(struct
> >        return val;
> >  }
> >
> > -static long mem_cgroup_local_usage(struct mem_cgroup *mem)
> > -{
> > -       long ret;
> > -
> > -       ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
> > -       ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
> > -       return ret;
> > -}
> > -
> >  static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
> >                                         bool charge)
> >  {
> > @@ -1713,6 +1704,23 @@ static void mem_cgroup_numascan_init(str
> >        mutex_init(&mem->numascan_mutex);
> >  }
> >
> > +static bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
> > +{
> > +       if (!nodes_empty(mem->scan_nodes))
> > +               return true;
> > +       /* slow path */
> > +       if (mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_FILE))
> > +               return true;
> > +       if (mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_FILE))
> > +               return true;
>
> Wondering if we can simplify this like:
>
> if (mem_cgroup_nr_file_lru_pages(mem))
> return true;
>
>
> > +       if (noswap || !total_swap_pages)
> > +               return false;
> > +       if (mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON))
> > +               return true;
> > +       if (mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_ANON))
> > +               return true;
>
> the same:
> if (mem_cgroup_nr_anon_lru_pages(mem))
> return true;
>
> > +       return false;
> > +}
>
> The two functions above are part of memory.numa_stat patch which is in
> mmotm i believe. Just feel the functionality a bit duplicate except
> the noswap parameter and scan_nodes.
>

Ah, I didn't noticed such function.


Hm, considering more, I think we don't have to scann all nodes and
make sum of number because what we check is whether pages == 0 or
pages != 0.

I'll make an update.

Thanks,
-Kame



2011-06-20 04:09:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/7] memcg: add memory.scan_stat

On Mon, 20 Jun 2011 08:41:23 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Fri, 17 Jun 2011 15:04:18 -0700
> Ying Han <[email protected]> wrote:
>
> > On Wed, Jun 15, 2011 at 8:53 PM, KAMEZAWA Hiroyuki
> > <[email protected]> wrote:
> > > From e08990dd9ada13cf236bec1ef44b207436434b8e Mon Sep 17 00:00:00 2001
> > > From: KAMEZAWA Hiroyuki <[email protected]>
> > > Date: Wed, 15 Jun 2011 14:11:01 +0900
> > > Subject: [PATCH 3/7] memcg: add memory.scan_stat
> > >
> > > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
> > > says it adds scanning stats to memory.stat file. But it doesn't because
> > > we considered we needed to make a concensus for such new APIs.
> > >
> > > This patch is a trial to add memory.scan_stat. This shows
> > >  - the number of scanned pages
> > >  - the number of recleimed pages
> > >  - the number of elaplsed time (including sleep/pause time)
> > >  for both of direct/soft reclaim and shrinking caused by changing limit
> > >  or force_empty.
> > >
> > > The biggest difference with oringinal Ying's one is that this file
> > > can be reset by some write, as
> > >
> > >  # echo 0 ...../memory.scan_stat
> > >
> > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
> > > scanned_pages_by_limit 358470
> > > freed_pages_by_limit 180795
> > > elapsed_ns_by_limit 21629927
> > > scanned_pages_by_system 0
> > > freed_pages_by_system 0
> > > elapsed_ns_by_system 0
> > > scanned_pages_by_shrink 76646
> > > freed_pages_by_shrink 38355
> > > elappsed_ns_by_shrink 31990670
> > > total_scanned_pages_by_limit 358470
> > > total_freed_pages_by_limit 180795
> > > total_elapsed_ns_by_hierarchical 216299275
> > > total_scanned_pages_by_system 0
> > > total_freed_pages_by_system 0
> > > total_elapsed_ns_by_system 0
> > > total_scanned_pages_by_shrink 76646
> > > total_freed_pages_by_shrink 38355
> > > total_elapsed_ns_by_shrink 31990670
> > >
> > > total_xxxx is for hierarchy management.
> > >
> > > This will be useful for further memcg developments and need to be
> > > developped before we do some complicated rework on LRU/softlimit
> > > management.
> >
> > Agreed. Actually we are also looking into adding a per-memcg API for
> > adding visibility of
> > page reclaim path. It would be helpful for us to settle w/ the API first.
> >
> > I am not a fan of names, but how about
> > "/dev/cgroup/memory/memory.reclaim_stat" ?
> >
>
> Hm, ok, I have no favorite.
>
>

If I rename, I'll just rename file name as "reclaim_stat" but doesn't
rename internal structures because there is already "struct reclaim_stat".

Hm, to be honest, I don't like the name "reclaim_stat".
(Because in most case, the pages are not freed for reclaim, but for
hitting limit.)

memory.vmscan_info ?

Thanks,
-Kame

2011-06-20 06:59:18

by Ying Han

[permalink] [raw]
Subject: Re: [PATCH 3/7] memcg: add memory.scan_stat

On Sunday, June 19, 2011, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Mon, 20 Jun 2011 08:41:23 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> On Fri, 17 Jun 2011 15:04:18 -0700
>> Ying Han <[email protected]> wrote:
>>
>> > On Wed, Jun 15, 2011 at 8:53 PM, KAMEZAWA Hiroyuki
>> > <[email protected]> wrote:
>> > > From e08990dd9ada13cf236bec1ef44b207436434b8e Mon Sep 17 00:00:00 2001
>> > > From: KAMEZAWA Hiroyuki <[email protected]>
>> > > Date: Wed, 15 Jun 2011 14:11:01 +0900
>> > > Subject: [PATCH 3/7] memcg: add memory.scan_stat
>> > >
>> > > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
>> > > says it adds scanning stats to memory.stat file. But it doesn't because
>> > > we considered we needed to make a concensus for such new APIs.
>> > >
>> > > This patch is a trial to add memory.scan_stat. This shows
>> > > ?- the number of scanned pages
>> > > ?- the number of recleimed pages
>> > > ?- the number of elaplsed time (including sleep/pause time)
>> > > ?for both of direct/soft reclaim and shrinking caused by changing limit
>> > > ?or force_empty.
>> > >
>> > > The biggest difference with oringinal Ying's one is that this file
>> > > can be reset by some write, as
>> > >
>> > > ?# echo 0 ...../memory.scan_stat
>> > >
>> > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
>> > > scanned_pages_by_limit 358470
>> > > freed_pages_by_limit 180795
>> > > elapsed_ns_by_limit 21629927
>> > > scanned_pages_by_system 0
>> > > freed_pages_by_system 0
>> > > elapsed_ns_by_system 0
>> > > scanned_pages_by_shrink 76646
>> > > freed_pages_by_shrink 38355
>> > > elappsed_ns_by_shrink 31990670
>> > > total_scanned_pages_by_limit 358470
>> > > total_freed_pages_by_limit 180795
>> > > total_elapsed_ns_by_hierarchical 216299275
>> > > total_scanned_pages_by_system 0
>> > > total_freed_pages_by_system 0
>> > > total_elapsed_ns_by_system 0
>> > > total_scanned_pages_by_shrink 76646
>> > > total_freed_pages_by_shrink 38355
>> > > total_elapsed_ns_by_shrink 31990670
>> > >
>> > > total_xxxx is for hierarchy management.
>> > >
>> > > This will be useful for further memcg developments and need to be
>> > > developped before we do some complicated rework on LRU/softlimit
>> > > management.
>> >
>> > Agreed. Actually we are also looking into adding a per-memcg API for
>> > adding visibility of
>> > page reclaim path. It would be helpful for us to settle w/ the API first.
>> >
>> > I am not a fan of names, but how about
>> > "/dev/cgroup/memory/memory.reclaim_stat" ?
>> >
>>
>> Hm, ok, I have no favorite.
>>
>>
>
> If I rename, I'll just rename file name as "reclaim_stat" but doesn't
> rename internal structures because there is already "struct reclaim_stat".
>
> Hm, to be honest, I don't like the name "reclaim_stat".
> (Because in most case, the pages are not freed for reclaim, but for
> ?hitting limit.)
>
> memory.vmscan_info ?

No objection on the name. I will look into the other part of the patch

Thanks

--ying
>
> Thanks,
> -Kame
>
>

2011-06-21 06:52:27

by Ying Han

[permalink] [raw]
Subject: Re: [PATCH 3/7] memcg: add memory.scan_stat

On Mon, Jun 20, 2011 at 11:49 PM, Ying Han <[email protected]> wrote:
>
>
> On Wed, Jun 15, 2011 at 8:53 PM, KAMEZAWA Hiroyuki <[email protected]> wrote:
>>
>> From e08990dd9ada13cf236bec1ef44b207436434b8e Mon Sep 17 00:00:00 2001
>> From: KAMEZAWA Hiroyuki <[email protected]>
>> Date: Wed, 15 Jun 2011 14:11:01 +0900
>> Subject: [PATCH 3/7] memcg: add memory.scan_stat
>>
>> commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
>> says it adds scanning stats to memory.stat file. But it doesn't because
>> we considered we needed to make a concensus for such new APIs.
>>
>> This patch is a trial to add memory.scan_stat. This shows
>> ?- the number of scanned pages
>> ?- the number of recleimed pages
>> ?- the number of elaplsed time (including sleep/pause time)
>> ?for both of direct/soft reclaim and shrinking caused by changing limit
>> ?or force_empty.
>>
>> The biggest difference with oringinal Ying's one is that this file
>> can be reset by some write, as
>>
>> ?# echo 0 ...../memory.scan_stat
>>
>> [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
>> scanned_pages_by_limit 358470
>> freed_pages_by_limit 180795
>> elapsed_ns_by_limit 21629927
>> scanned_pages_by_system 0
>> freed_pages_by_system 0
>> elapsed_ns_by_system 0
>> scanned_pages_by_shrink 76646
>> freed_pages_by_shrink 38355
>> elappsed_ns_by_shrink 31990670
>
> elapsed?
>
>>
>> total_scanned_pages_by_limit 358470
>> total_freed_pages_by_limit 180795
>> total_elapsed_ns_by_hierarchical 216299275
>> total_scanned_pages_by_system 0
>> total_freed_pages_by_system 0
>> total_elapsed_ns_by_system 0
>> total_scanned_pages_by_shrink 76646
>> total_freed_pages_by_shrink 38355
>> total_elapsed_ns_by_shrink 31990670
>>
>> total_xxxx is for hierarchy management.
>
> For some reason, i feel the?opposite where the local stat (like "scanned_pages_by_limit") are reclaimed under hierarchical reclaim. The total_xxx stats are only incremented for root_mem which is the cgroup triggers the hierarchical reclaim. So:
> total_scanned_pages_by_limit: number of pages being scanned while the memcg hits its limit
> scanned_pages_by_limit: number of pages being scanned while one of the memcg's?ancestor hits its limit
> am i missing something?
>
>>
>> This will be useful for further memcg developments and need to be
>> developped before we do some complicated rework on LRU/softlimit
>> management.
>>
>> Now, scan/free/elapsed_by_system is incomplete but future works of
>> Johannes at el. will fill remaining information and then, we can
>> look into problems of isolation between memcgs.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>> ---
>> ?Documentation/cgroups/memory.txt | ? 33 +++++++++
>> ?include/linux/memcontrol.h ? ? ? | ? 16 ++++
>> ?include/linux/swap.h ? ? ? ? ? ? | ? ?6 -
>> ?mm/memcontrol.c ? ? ? ? ? ? ? ? ?| ?135 +++++++++++++++++++++++++++++++++++++--
>> ?mm/vmscan.c ? ? ? ? ? ? ? ? ? ? ?| ? 27 ++++++-
>> ?5 files changed, 199 insertions(+), 18 deletions(-)
>>
>> Index: mmotm-0615/Documentation/cgroups/memory.txt
>> ===================================================================
>> --- mmotm-0615.orig/Documentation/cgroups/memory.txt
>> +++ mmotm-0615/Documentation/cgroups/memory.txt
>> @@ -380,7 +380,7 @@ will be charged as a new owner of it.
>>
>> ?5.2 stat file
>>
>> -memory.stat file includes following statistics
>> +5.2.1 memory.stat file includes following statistics
>>
>> ?# per-memory cgroup local status
>> ?cache ? ? ? ? ?- # of bytes of page cache memory.
>> @@ -438,6 +438,37 @@ Note:
>> ? ? ? ? file_mapped is accounted only when the memory cgroup is owner of page
>> ? ? ? ? cache.)
>>
>> +5.2.2 memory.scan_stat
>> +
>> +memory.scan_stat includes statistics information for memory scanning and
>> +freeing, reclaiming. The statistics shows memory scanning information since
>> +memory cgroup creation and can be reset to 0 by writing 0 as
>> +
>> + #echo 0 > ../memory.scan_stat
>> +
>> +This file contains following statistics.
>> +
>> +scanned_pages_by_limit - # of scanned pages at hitting limit.
>> +freed_pages_by_limit ? - # of freed pages at hitting limit.
>> +elapsed_ns_by_limit ? ?- nano sec of elappsed time at LRU scan at
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?hitting limit.(this includes sleep time.)
>
> elapsed?
>>
>> +
>> +scanned_pages_by_system ? ? ? ?- # of scanned pages by the kernel.
>> + ? ? ? ? ? ? ? ? ? ? ? ? (Now, this value means global memory reclaim
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? caused by system memory shortage with a hint
>> + ? ? ? ? ? ? ? ? ? ? ? ? ?of softlimit. "No soft limit" case will be
>> + ? ? ? ? ? ? ? ? ? ? ? ? ?supported in future.)
>> +freed_pages_by_system ?- # of freed pages by the kernel.
>> +elapsed_ns_by_system ? - nano sec of elappsed time by kernel.
>> +
>> +scanned_pages_by_shrink ? ? ? ?- # of scanned pages by shrinking.
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (i.e. changes of limit, force_empty, etc.)
>> +freed_pages_by_shrink ?- # of freed pages by shirkining.
>> +elappsed_ns_by_shrink ?- nano sec of elappsed time at shrinking.
>
> elapsed?
>>
>> +
>> +total_xxx includes the statistics of children scanning caused by the cgroup.
>
> based on the code inspection, the total_xxx also includes the cgroup's scan stat as well.
>>
>> +
>> +
>> ?5.3 swappiness
>>
>> ?Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
>> Index: mmotm-0615/include/linux/memcontrol.h
>> ===================================================================
>> --- mmotm-0615.orig/include/linux/memcontrol.h
>> +++ mmotm-0615/include/linux/memcontrol.h
>> @@ -120,6 +120,22 @@ struct zone_reclaim_stat*
>> ?mem_cgroup_get_reclaim_stat_from_page(struct page *page);
>> ?extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct task_struct *p);
>> +struct memcg_scanrecord {
>> + ? ? ? struct mem_cgroup *mem; /* scanend memory cgroup */
>> + ? ? ? struct mem_cgroup *root; /* scan target hierarchy root */
>> + ? ? ? int context; ? ? ? ? ? ?/* scanning context (see memcontrol.c) */
>> + ? ? ? unsigned long nr_scanned; /* the number of scanned pages */
>> + ? ? ? unsigned long nr_freed; /* the number of freed pages */
>> + ? ? ? unsigned long elappsed; /* nsec of time elapsed while scanning */
>
> elapsed?
>>
>> +};
>> +
>> +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct memcg_scanrecord *rec);
>> +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct memcg_scanrecord *rec);
>>
>> ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>> ?extern int do_swap_account;
>> Index: mmotm-0615/include/linux/swap.h
>> ===================================================================
>> --- mmotm-0615.orig/include/linux/swap.h
>> +++ mmotm-0615/include/linux/swap.h
>> @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
>> ?/* linux/mm/vmscan.c */
>> ?extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask, nodemask_t *mask);
>> -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap);
>> -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long *nr_scanned);
>> ?extern int __isolate_lru_page(struct page *page, int mode, int file);
>> ?extern unsigned long shrink_all_memory(unsigned long nr_pages);
>> ?extern int vm_swappiness;
>> Index: mmotm-0615/mm/memcontrol.c
>> ===================================================================
>> --- mmotm-0615.orig/mm/memcontrol.c
>> +++ mmotm-0615/mm/memcontrol.c
>> @@ -203,6 +203,57 @@ struct mem_cgroup_eventfd_list {
>> ?static void mem_cgroup_threshold(struct mem_cgroup *mem);
>> ?static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>>
>> +enum {
>> + ? ? ? SCAN_BY_LIMIT,
>> + ? ? ? FREED_BY_LIMIT,
>> + ? ? ? ELAPSED_BY_LIMIT,
>> +
>> + ? ? ? SCAN_BY_SYSTEM,
>> + ? ? ? FREED_BY_SYSTEM,
>> + ? ? ? ELAPSED_BY_SYSTEM,
>> +
>> + ? ? ? SCAN_BY_SHRINK,
>> + ? ? ? FREED_BY_SHRINK,
>> + ? ? ? ELAPSED_BY_SHRINK,
>> + ? ? ? NR_SCANSTATS,
>> +};
>> +#define __FREED ? ? ? ? ? ? ? ?(1)
>> +#define ? ? ? ?__ELAPSED ? ? ? (2)
>
> /tab/space/
>
>>
>> +
>> +struct scanstat {
>> + ? ? ? spinlock_t ? ? ?lock;
>> + ? ? ? unsigned long ? stats[NR_SCANSTATS]; ? ?/* local statistics */
>> + ? ? ? unsigned long ? totalstats[NR_SCANSTATS]; ? /* hierarchical */
>> +};
>> +
>> +const char *scanstat_string[NR_SCANSTATS] = {
>> + ? ? ? "scanned_pages_by_limit",
>> + ? ? ? "freed_pages_by_limit",
>> + ? ? ? "elapsed_ns_by_limit",
>> +
>> + ? ? ? "scanned_pages_by_system",
>> + ? ? ? "freed_pages_by_system",
>> + ? ? ? "elapsed_ns_by_system",
>> +
>> + ? ? ? "scanned_pages_by_shrink",
>> + ? ? ? "freed_pages_by_shrink",
>> + ? ? ? "elappsed_ns_by_shrink",
>
> elapsed?
>>
>> +};
>> +
>> +const char *total_scanstat_string[NR_SCANSTATS] = {
>> + ? ? ? "total_scanned_pages_by_limit",
>> + ? ? ? "total_freed_pages_by_limit",
>> + ? ? ? "total_elapsed_ns_by_hierarchical",
>
> typo?
>
>>
>> +
>> + ? ? ? "total_scanned_pages_by_system",
>> + ? ? ? "total_freed_pages_by_system",
>> + ? ? ? "total_elapsed_ns_by_system",
>> +
>> + ? ? ? "total_scanned_pages_by_shrink",
>> + ? ? ? "total_freed_pages_by_shrink",
>> + ? ? ? "total_elapsed_ns_by_shrink",
>> +};
>> +
>> ?/*
>> ?* The memory controller data structure. The memory controller controls both
>> ?* page cache and RSS per cgroup. We would eventually like to provide
>> @@ -264,7 +315,8 @@ struct mem_cgroup {
>>
>> ? ? ? ?/* For oom notifier event fd */
>> ? ? ? ?struct list_head oom_notify;
>> -
>> + ? ? ? /* For recording LRU-scan statistics */
>> + ? ? ? struct scanstat scanstat;
>> ? ? ? ?/*
>> ? ? ? ? * Should we move charges of a task when a task is moved into this
>> ? ? ? ? * mem_cgroup ? And what type of charges should we move ?
>> @@ -1634,6 +1686,28 @@ int mem_cgroup_select_victim_node(struct
>> ?}
>> ?#endif
>>
>> +
>> +
>> +static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec)
>> +{
>> + ? ? ? struct mem_cgroup *mem;
>> + ? ? ? int context = rec->context;
>> +
>> + ? ? ? mem = rec->mem;
>> + ? ? ? spin_lock(&mem->scanstat.lock);
>> + ? ? ? mem->scanstat.stats[context] += rec->nr_scanned;
>> + ? ? ? mem->scanstat.stats[context + __FREED] += rec->nr_freed;
>> + ? ? ? mem->scanstat.stats[context + __ELAPSED] += rec->elappsed;
>
> elapsed?
>
>> + ? ? ? spin_unlock(&mem->scanstat.lock);
>> +
>> + ? ? ? mem = rec->root;
>> + ? ? ? spin_lock(&mem->scanstat.lock);
>> + ? ? ? mem->scanstat.totalstats[context] += rec->nr_scanned;
>> + ? ? ? mem->scanstat.totalstats[context + __FREED] += rec->nr_freed;
>> + ? ? ? mem->scanstat.totalstats[context + __ELAPSED] += rec->elappsed;
>
> elapsed?
>>
>> + ? ? ? spin_unlock(&mem->scanstat.lock);
>> +}
>> +
>> ?/*
>> ?* Scan the hierarchy if needed to reclaim memory. We remember the last child
>> ?* we reclaimed from, so that we don't end up penalizing one child extensively
>> @@ -1659,8 +1733,8 @@ static int mem_cgroup_hierarchical_recla
>> ? ? ? ?bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
>> ? ? ? ?bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
>> ? ? ? ?unsigned long excess;
>> - ? ? ? unsigned long nr_scanned;
>> ? ? ? ?int visit;
>> + ? ? ? struct memcg_scanrecord rec;
>>
>> ? ? ? ?excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
>>
>> @@ -1668,6 +1742,15 @@ static int mem_cgroup_hierarchical_recla
>> ? ? ? ?if (!check_soft && root_mem->memsw_is_minimum)
>> ? ? ? ? ? ? ? ?noswap = true;
>>
>> + ? ? ? if (shrink)
>> + ? ? ? ? ? ? ? rec.context = SCAN_BY_SHRINK;
>> + ? ? ? else if (check_soft)
>> + ? ? ? ? ? ? ? rec.context = SCAN_BY_SYSTEM;
>> + ? ? ? else
>> + ? ? ? ? ? ? ? rec.context = SCAN_BY_LIMIT;
>> +
>> + ? ? ? rec.root = root_mem;
>
>
>
>>
>> +
>> ?again:
>> ? ? ? ?if (!shrink) {
>> ? ? ? ? ? ? ? ?visit = 0;
>> @@ -1695,14 +1778,19 @@ again:
>> ? ? ? ? ? ? ? ? ? ? ? ?css_put(&victim->css);
>> ? ? ? ? ? ? ? ? ? ? ? ?continue;
>> ? ? ? ? ? ? ? ?}
>> + ? ? ? ? ? ? ? rec.mem = victim;
>> + ? ? ? ? ? ? ? rec.nr_scanned = 0;
>> + ? ? ? ? ? ? ? rec.nr_freed = 0;
>> + ? ? ? ? ? ? ? rec.elappsed = 0;
>
> elapsed?
>>
>> ? ? ? ? ? ? ? ?/* we use swappiness of local cgroup */
>> ? ? ? ? ? ? ? ?if (check_soft) {
>> ? ? ? ? ? ? ? ? ? ? ? ?ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, zone, &nr_scanned);
>> - ? ? ? ? ? ? ? ? ? ? ? *total_scanned += nr_scanned;
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, zone, &rec);
>> + ? ? ? ? ? ? ? ? ? ? ? *total_scanned += rec.nr_scanned;
>> ? ? ? ? ? ? ? ?} else
>> ? ? ? ? ? ? ? ? ? ? ? ?ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, &rec);
>> + ? ? ? ? ? ? ? mem_cgroup_record_scanstat(&rec);
>> ? ? ? ? ? ? ? ?css_put(&victim->css);
>>
>> ? ? ? ? ? ? ? ?total += ret;
>> @@ -3757,7 +3845,8 @@ try_to_free:
>> ? ? ? ? ? ? ? ? ? ? ? ?ret = -EINTR;
>> ? ? ? ? ? ? ? ? ? ? ? ?goto out;
>> ? ? ? ? ? ? ? ?}
>> - ? ? ? ? ? ? ? progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, false);
>> + ? ? ? ? ? ? ? progress = try_to_free_mem_cgroup_pages(mem,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? GFP_KERNEL, false, NULL);
>
> So we don't record the stat for force_empty case?
>
>>
>> ? ? ? ? ? ? ? ?if (!progress) {
>> ? ? ? ? ? ? ? ? ? ? ? ?nr_retries--;
>> ? ? ? ? ? ? ? ? ? ? ? ?/* maybe some writeback is necessary */
>> @@ -4599,6 +4688,34 @@ static int mem_control_numa_stat_open(st
>> ?}
>> ?#endif /* CONFIG_NUMA */
>>
>> +static int mem_cgroup_scan_stat_read(struct cgroup *cgrp,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct cftype *cft,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct cgroup_map_cb *cb)
>> +{
>> + ? ? ? struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
>> + ? ? ? int i;
>> +
>> + ? ? ? for (i = 0; i < NR_SCANSTATS; i++)
>> + ? ? ? ? ? ? ? cb->fill(cb, scanstat_string[i], mem->scanstat.stats[i]);
>> + ? ? ? for (i = 0; i < NR_SCANSTATS; i++)
>> + ? ? ? ? ? ? ? cb->fill(cb, total_scanstat_string[i],
>> + ? ? ? ? ? ? ? ? ? ? ? mem->scanstat.totalstats[i]);
>> + ? ? ? return 0;
>> +}
>> +
>> +static int mem_cgroup_reset_scan_stat(struct cgroup *cgrp,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned int event)
>> +{
>> + ? ? ? struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
>> +
>> + ? ? ? spin_lock(&mem->scanstat.lock);
>> + ? ? ? memset(&mem->scanstat.stats, 0, sizeof(mem->scanstat.stats));
>> + ? ? ? memset(&mem->scanstat.totalstats, 0, sizeof(mem->scanstat.totalstats));
>> + ? ? ? spin_unlock(&mem->scanstat.lock);
>> + ? ? ? return 0;
>> +}
>> +
>> +
>> ?static struct cftype mem_cgroup_files[] = {
>> ? ? ? ?{
>> ? ? ? ? ? ? ? ?.name = "usage_in_bytes",
>> @@ -4669,6 +4786,11 @@ static struct cftype mem_cgroup_files[]
>> ? ? ? ? ? ? ? ?.mode = S_IRUGO,
>> ? ? ? ?},
>> ?#endif
>> + ? ? ? {
>> + ? ? ? ? ? ? ? .name = "scan_stat",
>> + ? ? ? ? ? ? ? .read_map = mem_cgroup_scan_stat_read,
>> + ? ? ? ? ? ? ? .trigger = mem_cgroup_reset_scan_stat,
>> + ? ? ? },
>> ?};
>>
>> ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>> @@ -4932,6 +5054,7 @@ mem_cgroup_create(struct cgroup_subsys *
>> ? ? ? ?atomic_set(&mem->refcnt, 1);
>> ? ? ? ?mem->move_charge_at_immigrate = 0;
>> ? ? ? ?mutex_init(&mem->thresholds_lock);
>> + ? ? ? spin_lock_init(&mem->scanstat.lock);
>> ? ? ? ?return &mem->css;
>> ?free_out:
>> ? ? ? ?__mem_cgroup_free(mem);
>> Index: mmotm-0615/mm/vmscan.c
>> ===================================================================
>> --- mmotm-0615.orig/mm/vmscan.c
>> +++ mmotm-0615/mm/vmscan.c
>> @@ -2199,9 +2199,9 @@ unsigned long try_to_free_pages(struct z
>> ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>>
>> ?unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long *nr_scanned)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct memcg_scanrecord *rec)
>> ?{
>> ? ? ? ?struct scan_control sc = {
>> ? ? ? ? ? ? ? ?.nr_scanned = 0,
>> @@ -2213,6 +2213,7 @@ unsigned long mem_cgroup_shrink_node_zon
>> ? ? ? ? ? ? ? ?.order = 0,
>> ? ? ? ? ? ? ? ?.mem_cgroup = mem,
>> ? ? ? ?};
>> + ? ? ? unsigned long start, end;
>>
>> ? ? ? ?sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>> ? ? ? ? ? ? ? ? ? ? ? ?(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
>> @@ -2221,6 +2222,7 @@ unsigned long mem_cgroup_shrink_node_zon
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc.may_writepage,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc.gfp_mask);
>>
>> + ? ? ? start = sched_clock();
>> ? ? ? ?/*
>> ? ? ? ? * NOTE: Although we can get the priority field, using it
>> ? ? ? ? * here is not a good idea, since it limits the pages we can scan.
>> @@ -2229,19 +2231,27 @@ unsigned long mem_cgroup_shrink_node_zon
>> ? ? ? ? * the priority and make it zero.
>> ? ? ? ? */
>> ? ? ? ?shrink_zone(0, zone, &sc);
>> + ? ? ? end = sched_clock();
>> +
>> + ? ? ? if (rec) {
>> + ? ? ? ? ? ? ? rec->nr_scanned += sc.nr_scanned;
>
>
>>
>> + ? ? ? ? ? ? ? rec->nr_freed += sc.nr_reclaimed;
>> + ? ? ? ? ? ? ? rec->elappsed += end - start;
>
> elapsed?
>>
>> + ? ? ? }
>>
>> ? ? ? ?trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>>
>> - ? ? ? *nr_scanned = sc.nr_scanned;
>> ? ? ? ?return sc.nr_reclaimed;
>> ?}
>>
>> ?unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bool noswap)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bool noswap,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct memcg_scanrecord *rec)
>> ?{
>> ? ? ? ?struct zonelist *zonelist;
>> ? ? ? ?unsigned long nr_reclaimed;
>> + ? ? ? unsigned long start, end;
>> ? ? ? ?int nid;
>> ? ? ? ?struct scan_control sc = {
>> ? ? ? ? ? ? ? ?.may_writepage = !laptop_mode,
>> @@ -2259,6 +2269,7 @@ unsigned long try_to_free_mem_cgroup_pag
>> ? ? ? ? ? ? ? ?.gfp_mask = sc.gfp_mask,
>> ? ? ? ?};
>>
>> + ? ? ? start = sched_clock();
>> ? ? ? ?/*
>> ? ? ? ? * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
>> ? ? ? ? * take care of from where we get pages. So the node where we start the
>> @@ -2273,6 +2284,12 @@ unsigned long try_to_free_mem_cgroup_pag
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc.gfp_mask);
>>
>> ? ? ? ?nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
>> + ? ? ? end = sched_clock();
>> + ? ? ? if (rec) {
>> + ? ? ? ? ? ? ? rec->nr_scanned += sc.nr_scanned;
>
> sc.nr_scanned only contains the nr_scanned of last priority?do_try_to_free_pages(). we need to reset it to total_scanned before return. I am looking at?v3.0-rc3 .
>
>>
>> + ? ? ? ? ? ? ? rec->nr_freed += sc.nr_reclaimed;
>> + ? ? ? ? ? ? ? rec->elappsed += end - start;
>
> elapsed?
>>
>> + ? ? ? }
>>
>> ? ? ? ?trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
>>
>>
> --Ying

2011-06-22 00:27:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/7] memcg: add memory.scan_stat

On Mon, 20 Jun 2011 23:49:54 -0700
Ying Han <[email protected]> wrote:

> On Wed, Jun 15, 2011 at 8:53 PM, KAMEZAWA Hiroyuki <
> [email protected]> wrote:
>
> > From e08990dd9ada13cf236bec1ef44b207436434b8e Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <[email protected]>
> > Date: Wed, 15 Jun 2011 14:11:01 +0900
> > Subject: [PATCH 3/7] memcg: add memory.scan_stat
> >
> > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
> > says it adds scanning stats to memory.stat file. But it doesn't because
> > we considered we needed to make a concensus for such new APIs.
> >
> > This patch is a trial to add memory.scan_stat. This shows
> > - the number of scanned pages
> > - the number of recleimed pages
> > - the number of elaplsed time (including sleep/pause time)
> > for both of direct/soft reclaim and shrinking caused by changing limit
> > or force_empty.
> >
> > The biggest difference with oringinal Ying's one is that this file
> > can be reset by some write, as
> >
> > # echo 0 ...../memory.scan_stat
> >
> > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
> > scanned_pages_by_limit 358470
> > freed_pages_by_limit 180795
> > elapsed_ns_by_limit 21629927
> > scanned_pages_by_system 0
> > freed_pages_by_system 0
> > elapsed_ns_by_system 0
> > scanned_pages_by_shrink 76646
> > freed_pages_by_shrink 38355
> > elappsed_ns_by_shrink 31990670
> >
>
> elapsed?
>

you'r right.

>
> > total_scanned_pages_by_limit 358470
> > total_freed_pages_by_limit 180795
> > total_elapsed_ns_by_hierarchical 216299275
> > total_scanned_pages_by_system 0
> > total_freed_pages_by_system 0
> > total_elapsed_ns_by_system 0
> > total_scanned_pages_by_shrink 76646
> > total_freed_pages_by_shrink 38355
> > total_elapsed_ns_by_shrink 31990670
> >
> > total_xxxx is for hierarchy management.
> >
>
> For some reason, i feel the opposite where the local stat (like
> "scanned_pages_by_limit") are reclaimed under hierarchical reclaim. The
> total_xxx stats are only incremented for root_mem which is the cgroup
> triggers the hierarchical reclaim. So:
>
> total_scanned_pages_by_limit: number of pages being scanned while the memcg
> hits its limit
> scanned_pages_by_limit: number of pages being scanned while one of the
> memcg's ancestor hits its limit
>
> am i missing something?
>

scanned_pages_by_limit: one of ancestors and itself's limit.




>
> >
> > This will be useful for further memcg developments and need to be
> > developped before we do some complicated rework on LRU/softlimit
> > management.
> >
> > Now, scan/free/elapsed_by_system is incomplete but future works of
> > Johannes at el. will fill remaining information and then, we can
> > look into problems of isolation between memcgs.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > Documentation/cgroups/memory.txt | 33 +++++++++
> > include/linux/memcontrol.h | 16 ++++
> > include/linux/swap.h | 6 -
> > mm/memcontrol.c | 135
> > +++++++++++++++++++++++++++++++++++++--
> > mm/vmscan.c | 27 ++++++-
> > 5 files changed, 199 insertions(+), 18 deletions(-)
> >
> > Index: mmotm-0615/Documentation/cgroups/memory.txt
> > ===================================================================
> > --- mmotm-0615.orig/Documentation/cgroups/memory.txt
> > +++ mmotm-0615/Documentation/cgroups/memory.txt
> > @@ -380,7 +380,7 @@ will be charged as a new owner of it.
> >
> > 5.2 stat file
> >
> > -memory.stat file includes following statistics
> > +5.2.1 memory.stat file includes following statistics
> >
> > # per-memory cgroup local status
> > cache - # of bytes of page cache memory.
> > @@ -438,6 +438,37 @@ Note:
> > file_mapped is accounted only when the memory cgroup is owner of
> > page
> > cache.)
> >
> > +5.2.2 memory.scan_stat
> > +
> > +memory.scan_stat includes statistics information for memory scanning and
> > +freeing, reclaiming. The statistics shows memory scanning information
> > since
> > +memory cgroup creation and can be reset to 0 by writing 0 as
> > +
> > + #echo 0 > ../memory.scan_stat
> > +
> > +This file contains following statistics.
> > +
> > +scanned_pages_by_limit - # of scanned pages at hitting limit.
> > +freed_pages_by_limit - # of freed pages at hitting limit.
> > +elapsed_ns_by_limit - nano sec of elappsed time at LRU scan at
> > + hitting limit.(this includes sleep
> > time.)
> >
> elapsed?
>
> > +
> > +scanned_pages_by_system - # of scanned pages by the kernel.
> > + (Now, this value means global memory reclaim
> > + caused by system memory shortage with a hint
> > + of softlimit. "No soft limit" case will be
> > + supported in future.)
> > +freed_pages_by_system - # of freed pages by the kernel.
> > +elapsed_ns_by_system - nano sec of elappsed time by kernel.
> > +
> > +scanned_pages_by_shrink - # of scanned pages by shrinking.
> > + (i.e. changes of limit, force_empty,
> > etc.)
> > +freed_pages_by_shrink - # of freed pages by shirkining.
> > +elappsed_ns_by_shrink - nano sec of elappsed time at shrinking.
> >
> elapsed?
>
> > +
> > +total_xxx includes the statistics of children scanning caused by the
> > cgroup.
> >
>
> based on the code inspection, the total_xxx also includes the cgroup's scan
> stat as well.
>

yes.


> +
> > +
> > 5.3 swappiness
> >
> > Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups
> > only.
> > Index: mmotm-0615/include/linux/memcontrol.h
> > ===================================================================
> > --- mmotm-0615.orig/include/linux/memcontrol.h
> > +++ mmotm-0615/include/linux/memcontrol.h
> > @@ -120,6 +120,22 @@ struct zone_reclaim_stat*
> > mem_cgroup_get_reclaim_stat_from_page(struct page *page);
> > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> > struct task_struct *p);
> > +struct memcg_scanrecord {
> > + struct mem_cgroup *mem; /* scanend memory cgroup */
> > + struct mem_cgroup *root; /* scan target hierarchy root */
> > + int context; /* scanning context (see memcontrol.c) */
> > + unsigned long nr_scanned; /* the number of scanned pages */
> > + unsigned long nr_freed; /* the number of freed pages */
> > + unsigned long elappsed; /* nsec of time elapsed while scanning */
> >
> elapsed?
>
> > +};
> > +
> > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> > + gfp_t gfp_mask, bool
> > noswap,
> > + struct memcg_scanrecord
> > *rec);
> > +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> > + gfp_t gfp_mask, bool
> > noswap,
> > + struct zone *zone,
> > + struct memcg_scanrecord
> > *rec);
> >
> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > extern int do_swap_account;
> > Index: mmotm-0615/include/linux/swap.h
> > ===================================================================
> > --- mmotm-0615.orig/include/linux/swap.h
> > +++ mmotm-0615/include/linux/swap.h
> > @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
> > /* linux/mm/vmscan.c */
> > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int
> > order,
> > gfp_t gfp_mask, nodemask_t *mask);
> > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> > - gfp_t gfp_mask, bool
> > noswap);
> > -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> > - gfp_t gfp_mask, bool
> > noswap,
> > - struct zone *zone,
> > - unsigned long *nr_scanned);
> > extern int __isolate_lru_page(struct page *page, int mode, int file);
> > extern unsigned long shrink_all_memory(unsigned long nr_pages);
> > extern int vm_swappiness;
> > Index: mmotm-0615/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0615.orig/mm/memcontrol.c
> > +++ mmotm-0615/mm/memcontrol.c
> > @@ -203,6 +203,57 @@ struct mem_cgroup_eventfd_list {
> > static void mem_cgroup_threshold(struct mem_cgroup *mem);
> > static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
> >
> > +enum {
> > + SCAN_BY_LIMIT,
> > + FREED_BY_LIMIT,
> > + ELAPSED_BY_LIMIT,
> > +
> > + SCAN_BY_SYSTEM,
> > + FREED_BY_SYSTEM,
> > + ELAPSED_BY_SYSTEM,
> > +
> > + SCAN_BY_SHRINK,
> > + FREED_BY_SHRINK,
> > + ELAPSED_BY_SHRINK,
> > + NR_SCANSTATS,
> > +};
> > +#define __FREED (1)
> > +#define __ELAPSED (2)
> >
>
> /tab/space/
>
>
> > +
> > +struct scanstat {
> > + spinlock_t lock;
> > + unsigned long stats[NR_SCANSTATS]; /* local statistics */
> > + unsigned long totalstats[NR_SCANSTATS]; /* hierarchical */
> > +};
> > +
> > +const char *scanstat_string[NR_SCANSTATS] = {
> > + "scanned_pages_by_limit",
> > + "freed_pages_by_limit",
> > + "elapsed_ns_by_limit",
> > +
> > + "scanned_pages_by_system",
> > + "freed_pages_by_system",
> > + "elapsed_ns_by_system",
> > +
> > + "scanned_pages_by_shrink",
> > + "freed_pages_by_shrink",
> > + "elappsed_ns_by_shrink",
> >
> elapsed?
>
> > +};
> > +
> > +const char *total_scanstat_string[NR_SCANSTATS] = {
> > + "total_scanned_pages_by_limit",
> > + "total_freed_pages_by_limit",
> > + "total_elapsed_ns_by_hierarchical",
> >
>
> typo?
>
>
> > +
> > + "total_scanned_pages_by_system",
> > + "total_freed_pages_by_system",
> > + "total_elapsed_ns_by_system",
> > +
> > + "total_scanned_pages_by_shrink",
> > + "total_freed_pages_by_shrink",
> > + "total_elapsed_ns_by_shrink",
> > +};
> > +
> > /*
> > * The memory controller data structure. The memory controller controls
> > both
> > * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -264,7 +315,8 @@ struct mem_cgroup {
> >
> > /* For oom notifier event fd */
> > struct list_head oom_notify;
> > -
> > + /* For recording LRU-scan statistics */
> > + struct scanstat scanstat;
> > /*
> > * Should we move charges of a task when a task is moved into this
> > * mem_cgroup ? And what type of charges should we move ?
> > @@ -1634,6 +1686,28 @@ int mem_cgroup_select_victim_node(struct
> > }
> > #endif
> >
> > +
> > +
> > +static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec)
> > +{
> > + struct mem_cgroup *mem;
> > + int context = rec->context;
> > +
> > + mem = rec->mem;
> > + spin_lock(&mem->scanstat.lock);
> > + mem->scanstat.stats[context] += rec->nr_scanned;
> > + mem->scanstat.stats[context + __FREED] += rec->nr_freed;
> > + mem->scanstat.stats[context + __ELAPSED] += rec->elappsed;
> >
> elapsed?
>
>
> + spin_unlock(&mem->scanstat.lock);
> > +
> > + mem = rec->root;
> > + spin_lock(&mem->scanstat.lock);
> > + mem->scanstat.totalstats[context] += rec->nr_scanned;
> > + mem->scanstat.totalstats[context + __FREED] += rec->nr_freed;
> > + mem->scanstat.totalstats[context + __ELAPSED] += rec->elappsed;
> >
>
> elapsed?
>
> > + spin_unlock(&mem->scanstat.lock);
> > +}
> > +
> > /*
> > * Scan the hierarchy if needed to reclaim memory. We remember the last
> > child
> > * we reclaimed from, so that we don't end up penalizing one child
> > extensively
> > @@ -1659,8 +1733,8 @@ static int mem_cgroup_hierarchical_recla
> > bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> > bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> > unsigned long excess;
> > - unsigned long nr_scanned;
> > int visit;
> > + struct memcg_scanrecord rec;
> >
> > excess = res_counter_soft_limit_excess(&root_mem->res) >>
> > PAGE_SHIFT;
> >
> > @@ -1668,6 +1742,15 @@ static int mem_cgroup_hierarchical_recla
> > if (!check_soft && root_mem->memsw_is_minimum)
> > noswap = true;
> >
> > + if (shrink)
> > + rec.context = SCAN_BY_SHRINK;
> > + else if (check_soft)
> > + rec.context = SCAN_BY_SYSTEM;
> > + else
> > + rec.context = SCAN_BY_LIMIT;
> > +
> > + rec.root = root_mem;
> >
>
>
>
> > +
> > again:
> > if (!shrink) {
> > visit = 0;
> > @@ -1695,14 +1778,19 @@ again:
> > css_put(&victim->css);
> > continue;
> > }
> > + rec.mem = victim;
> > + rec.nr_scanned = 0;
> > + rec.nr_freed = 0;
> > + rec.elappsed = 0;
> >
> elapsed?
>
> > /* we use swappiness of local cgroup */
> > if (check_soft) {
> > ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> > - noswap, zone, &nr_scanned);
> > - *total_scanned += nr_scanned;
> > + noswap, zone, &rec);
> > + *total_scanned += rec.nr_scanned;
> > } else
> > ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> > - noswap);
> > + noswap, &rec);
> > + mem_cgroup_record_scanstat(&rec);
> > css_put(&victim->css);
> >
> > total += ret;
> > @@ -3757,7 +3845,8 @@ try_to_free:
> > ret = -EINTR;
> > goto out;
> > }
> > - progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> > false);
> > + progress = try_to_free_mem_cgroup_pages(mem,
> > + GFP_KERNEL, false, NULL);
> >
>
> So we don't record the stat for force_empty case?
>

yes, now. force_empty is used only for rmdir(). I don't think log is
necessary for cgroup disappearing.


>
> > if (!progress) {
> > nr_retries--;
> > /* maybe some writeback is necessary */
> > @@ -4599,6 +4688,34 @@ static int mem_control_numa_stat_open(st
> > }
> > #endif /* CONFIG_NUMA */
> >
> > +static int mem_cgroup_scan_stat_read(struct cgroup *cgrp,
> > + struct cftype *cft,
> > + struct cgroup_map_cb *cb)
> > +{
> > + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> > + int i;
> > +
> > + for (i = 0; i < NR_SCANSTATS; i++)
> > + cb->fill(cb, scanstat_string[i], mem->scanstat.stats[i]);
> > + for (i = 0; i < NR_SCANSTATS; i++)
> > + cb->fill(cb, total_scanstat_string[i],
> > + mem->scanstat.totalstats[i]);
> > + return 0;
> > +}
> > +
> > +static int mem_cgroup_reset_scan_stat(struct cgroup *cgrp,
> > + unsigned int event)
> > +{
> > + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> > +
> > + spin_lock(&mem->scanstat.lock);
> > + memset(&mem->scanstat.stats, 0, sizeof(mem->scanstat.stats));
> > + memset(&mem->scanstat.totalstats, 0,
> > sizeof(mem->scanstat.totalstats));
> > + spin_unlock(&mem->scanstat.lock);
> > + return 0;
> > +}
> > +
> > +
> > static struct cftype mem_cgroup_files[] = {
> > {
> > .name = "usage_in_bytes",
> > @@ -4669,6 +4786,11 @@ static struct cftype mem_cgroup_files[]
> > .mode = S_IRUGO,
> > },
> > #endif
> > + {
> > + .name = "scan_stat",
> > + .read_map = mem_cgroup_scan_stat_read,
> > + .trigger = mem_cgroup_reset_scan_stat,
> > + },
> > };
> >
> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > @@ -4932,6 +5054,7 @@ mem_cgroup_create(struct cgroup_subsys *
> > atomic_set(&mem->refcnt, 1);
> > mem->move_charge_at_immigrate = 0;
> > mutex_init(&mem->thresholds_lock);
> > + spin_lock_init(&mem->scanstat.lock);
> > return &mem->css;
> > free_out:
> > __mem_cgroup_free(mem);
> > Index: mmotm-0615/mm/vmscan.c
> > ===================================================================
> > --- mmotm-0615.orig/mm/vmscan.c
> > +++ mmotm-0615/mm/vmscan.c
> > @@ -2199,9 +2199,9 @@ unsigned long try_to_free_pages(struct z
> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >
> > unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> > - gfp_t gfp_mask, bool
> > noswap,
> > - struct zone *zone,
> > - unsigned long *nr_scanned)
> > + gfp_t gfp_mask, bool noswap,
> > + struct zone *zone,
> > + struct memcg_scanrecord *rec)
> > {
> > struct scan_control sc = {
> > .nr_scanned = 0,
> > @@ -2213,6 +2213,7 @@ unsigned long mem_cgroup_shrink_node_zon
> > .order = 0,
> > .mem_cgroup = mem,
> > };
> > + unsigned long start, end;
> >
> > sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> > (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> > @@ -2221,6 +2222,7 @@ unsigned long mem_cgroup_shrink_node_zon
> > sc.may_writepage,
> > sc.gfp_mask);
> >
> > + start = sched_clock();
> > /*
> > * NOTE: Although we can get the priority field, using it
> > * here is not a good idea, since it limits the pages we can scan.
> > @@ -2229,19 +2231,27 @@ unsigned long mem_cgroup_shrink_node_zon
> > * the priority and make it zero.
> > */
> > shrink_zone(0, zone, &sc);
> > + end = sched_clock();
> > +
> > + if (rec) {
> > + rec->nr_scanned += sc.nr_scanned;
> >
>
>
>
> > + rec->nr_freed += sc.nr_reclaimed;
> > + rec->elappsed += end - start;
> >
> elapsed?
>
> > + }
> >
> > trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
> >
> > - *nr_scanned = sc.nr_scanned;
> > return sc.nr_reclaimed;
> > }
> >
> > unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> > gfp_t gfp_mask,
> > - bool noswap)
> > + bool noswap,
> > + struct memcg_scanrecord *rec)
> > {
> > struct zonelist *zonelist;
> > unsigned long nr_reclaimed;
> > + unsigned long start, end;
> > int nid;
> > struct scan_control sc = {
> > .may_writepage = !laptop_mode,
> > @@ -2259,6 +2269,7 @@ unsigned long try_to_free_mem_cgroup_pag
> > .gfp_mask = sc.gfp_mask,
> > };
> >
> > + start = sched_clock();
> > /*
> > * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
> > * take care of from where we get pages. So the node where we start
> > the
> > @@ -2273,6 +2284,12 @@ unsigned long try_to_free_mem_cgroup_pag
> > sc.gfp_mask);
> >
> > nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
> > + end = sched_clock();
> > + if (rec) {
> > + rec->nr_scanned += sc.nr_scanned;
> >
>
> sc.nr_scanned only contains the nr_scanned of last
> priority do_try_to_free_pages(). we need to reset it to total_scanned before
> return. I am looking at v3.0-rc3 .
>

Hm. ok, then, total_reclaimed in softlimit is buggy, too.
I'll check.

Thanks,
-Kame

2011-06-22 15:15:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/7] Fix mem_cgroup_hierarchical_reclaim() to do stable hierarchy walk.

On Thu 16-06-11 12:51:41, KAMEZAWA Hiroyuki wrote:
[...]
> @@ -1667,41 +1668,28 @@ static int mem_cgroup_hierarchical_recla
> if (!check_soft && root_mem->memsw_is_minimum)
> noswap = true;
>
> - while (1) {
> +again:
> + if (!shrink) {
> + visit = 0;
> + for_each_mem_cgroup_tree(victim, root_mem)
> + visit++;
> + } else {
> + /*
> + * At shrinking, we check the usage again in caller side.
> + * so, visit children one by one.
> + */
> + visit = 1;
> + }
> + /*
> + * We are not draining per cpu cached charges during soft limit reclaim
> + * because global reclaim doesn't care about charges. It tries to free
> + * some memory and charges will not give any.
> + */
> + if (!check_soft)
> + drain_all_stock_async(root_mem);
> +
> + while (visit--) {

This is racy, isn't it? What prevents some groups to disapear in the
meantime? We would reclaim from those that are left more that we want.

Why cannot we simply do something like (totally untested):

Index: linus_tree/mm/memcontrol.c
===================================================================
--- linus_tree.orig/mm/memcontrol.c 2011-06-22 17:11:54.000000000 +0200
+++ linus_tree/mm/memcontrol.c 2011-06-22 17:13:05.000000000 +0200
@@ -1652,7 +1652,7 @@ static int mem_cgroup_hierarchical_recla
unsigned long reclaim_options,
unsigned long *total_scanned)
{
- struct mem_cgroup *victim;
+ struct mem_cgroup *victim, *first_victim = NULL;
int ret, total = 0;
int loop = 0;
bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
@@ -1669,6 +1669,11 @@ static int mem_cgroup_hierarchical_recla

while (1) {
victim = mem_cgroup_select_victim(root_mem);
+ if (!first_victim)
+ first_victim = victim;
+ else if (first_victim == victim)
+ break;
+
if (victim == root_mem) {
loop++;
/*
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-22 15:22:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/7] export memory cgroup's swappines by mem_cgroup_swappiness()

On Thu 16-06-11 12:52:22, KAMEZAWA Hiroyuki wrote:
> From 6f9c40172947fb92ab0ea6f7d73d577473879636 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Wed, 15 Jun 2011 12:06:31 +0900
> Subject: [PATCH 2/7] export memory cgroup's swappines by mem_cgroup_swappiness()
>
> Each memory cgroup has 'swappiness' value and it can be accessed by
> get_swappiness(memcg). The major user is try_to_free_mem_cgroup_pages()
> and swappiness is passed by argument.
>
> It's now static function but some planned updates will need to
> get swappiness from files other than memcontrol.c
> This patch exports get_swappiness() as mem_cgroup_swappiness().
> By this, we can remove the argument of swapiness from try_to_fre...
>
> I think this makes sense because passed swapiness is always from memory
> cgroup passed as an argument and this duplication of argument is
> not very good.

Yes makes sense and it makes it more looking like a global reclaim.

>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-22 15:53:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/7] memcg: update numa information based on event counter

On Thu 16-06-11 12:54:00, KAMEZAWA Hiroyuki wrote:
> From 88090fe10e225ad8769ba0ea01692b7314e8b973 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Wed, 15 Jun 2011 16:19:46 +0900
> Subject: [PATCH 4/7] memcg: update numa information based on event counter
>
> commit 889976 adds an numa node round-robin for memcg. But the information
> is updated once per 10sec.
>
> This patch changes the update trigger from jiffies to memcg's event count.
> After this patch, numa scan information will be updated when
>
> - the number of pagein/out events is larger than 3% of limit
> or
> - the number of pagein/out events is larger than 16k
> (==64MB pagein/pageout if pagesize==4k.)
>
> The counter of mem->numascan_update the sum of percpu events counter.
> When a task hits limit, it checks mem->numascan_update. If it's over
> min(3% of limit, 16k), numa information will be updated.

Yes, I like the event based approach more than the origin (time) based
one.

>
> This patch also adds mutex for updating information. This will allow us
> to avoid unnecessary scan.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/memcontrol.c | 51 +++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 45 insertions(+), 6 deletions(-)
>
> Index: mmotm-0615/mm/memcontrol.c
> ===================================================================
> --- mmotm-0615.orig/mm/memcontrol.c
> +++ mmotm-0615/mm/memcontrol.c
> @@ -108,10 +108,12 @@ enum mem_cgroup_events_index {
> enum mem_cgroup_events_target {
> MEM_CGROUP_TARGET_THRESH,
> MEM_CGROUP_TARGET_SOFTLIMIT,
> + MEM_CGROUP_TARGET_NUMASCAN,

Shouldn't it be defined only for MAX_NUMNODES > 1

> MEM_CGROUP_NTARGETS,
> };
> #define THRESHOLDS_EVENTS_TARGET (128)
> #define SOFTLIMIT_EVENTS_TARGET (1024)
> +#define NUMASCAN_EVENTS_TARGET (1024)
>
> struct mem_cgroup_stat_cpu {
> long count[MEM_CGROUP_STAT_NSTATS];
> @@ -288,8 +290,9 @@ struct mem_cgroup {
> int last_scanned_node;
> #if MAX_NUMNODES > 1
> nodemask_t scan_nodes;
> - unsigned long next_scan_node_update;
> + struct mutex numascan_mutex;
> #endif
> + atomic_t numascan_update;

Why it is out of ifdef?

> /*
> * Should the accounting and control be hierarchical, per subtree?
> */
> @@ -741,6 +744,9 @@ static void __mem_cgroup_target_update(s
> case MEM_CGROUP_TARGET_SOFTLIMIT:
> next = val + SOFTLIMIT_EVENTS_TARGET;
> break;
> + case MEM_CGROUP_TARGET_NUMASCAN:
> + next = val + NUMASCAN_EVENTS_TARGET;
> + break;

MAX_NUMNODES > 1

> default:
> return;
> }
> @@ -764,6 +770,13 @@ static void memcg_check_events(struct me
> __mem_cgroup_target_update(mem,
> MEM_CGROUP_TARGET_SOFTLIMIT);
> }
> + if (unlikely(__memcg_event_check(mem,
> + MEM_CGROUP_TARGET_NUMASCAN))) {
> + atomic_add(MEM_CGROUP_TARGET_NUMASCAN,
> + &mem->numascan_update);
> + __mem_cgroup_target_update(mem,
> + MEM_CGROUP_TARGET_NUMASCAN);
> + }
> }

again MAX_NUMNODES > 1

> }
>
> @@ -1616,17 +1629,32 @@ mem_cgroup_select_victim(struct mem_cgro
> /*
> * Always updating the nodemask is not very good - even if we have an empty
> * list or the wrong list here, we can start from some node and traverse all
> - * nodes based on the zonelist. So update the list loosely once per 10 secs.
> + * nodes based on the zonelist.
> *
> + * The counter of mem->numascan_update is updated once per
> + * NUMASCAN_EVENTS_TARGET. We update the numa information when we see
> + * the number of event is larger than 3% of limit or 64MB pagein/pageout.
> */
> +#define NUMASCAN_UPDATE_RATIO (3)
> +#define NUMASCAN_UPDATE_THRESH (16384UL) /* 16k events of pagein/pageout */
> static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
> {
> int nid;
> -
> - if (time_after(mem->next_scan_node_update, jiffies))
> + unsigned long long limit;
> + /* if no limit, we never reach here */
> + limit = res_counter_read_u64(&mem->res, RES_LIMIT);
> + limit /= PAGE_SIZE;
> + /* 3% of limit */
> + limit = (limit * NUMASCAN_UPDATE_RATIO/100UL);
> + limit = min_t(unsigned long long, limit, NUMASCAN_UPDATE_THRESH);
> + /*
> + * If the number of pagein/out event is larger than 3% of limit or
> + * 64MB pagein/out, refresh numa information.
> + */
> + if (atomic_read(&mem->numascan_update) < limit ||
> + !mutex_trylock(&mem->numascan_mutex))
> return;

I am not sure whether a mutex is not overkill here. What about using an
atomic operation instead?

> -
> - mem->next_scan_node_update = jiffies + 10*HZ;
> + atomic_set(&mem->numascan_update, 0);
> /* make a nodemask where this memcg uses memory from */
> mem->scan_nodes = node_states[N_HIGH_MEMORY];
>
> @@ -1642,6 +1670,7 @@ static void mem_cgroup_may_update_nodema
> continue;
> node_clear(nid, mem->scan_nodes);
> }
> + mutex_unlock(&mem->numascan_mutex);
> }

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-22 15:59:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 5/7] Fix not good check of mem_cgroup_local_usage()

On Thu 16-06-11 12:54:43, KAMEZAWA Hiroyuki wrote:
> From fcfc6ee9847b0b2571cd6e9847572d7c70e1e2b2 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 16 Jun 2011 09:23:54 +0900
> Subject: [PATCH 5/7] Fix not good check of mem_cgroup_local_usage()
>
> Now, mem_cgroup_local_usage(memcg) is used as hint for scanning memory
> cgroup hierarchy. If it returns true, the memcg has some reclaimable memory.
>
> But this function doesn't take care of
> - unevictable pages
> - anon pages on swapless system.
>
> This patch fixes the function to use LRU information.
> For NUMA, for avoid scanning, numa scan bitmap is used. If it's
> empty, some more precise check will be done.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/memcontrol.c | 43 +++++++++++++++++++++++++++++++++----------
> 1 files changed, 33 insertions(+), 10 deletions(-)
>
> Index: mmotm-0615/mm/memcontrol.c
> ===================================================================
> --- mmotm-0615.orig/mm/memcontrol.c
> +++ mmotm-0615/mm/memcontrol.c
> @@ -632,15 +632,6 @@ static long mem_cgroup_read_stat(struct
> return val;
> }
>
> -static long mem_cgroup_local_usage(struct mem_cgroup *mem)
> -{
> - long ret;
> -
> - ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
> - ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
> - return ret;
> -}
> -
> static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
> bool charge)
> {
> @@ -1713,6 +1704,23 @@ static void mem_cgroup_numascan_init(str
> mutex_init(&mem->numascan_mutex);
> }
>
> +static bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
> +{
> + if (!nodes_empty(mem->scan_nodes))
> + return true;

How the non empty node mask guarantees that there is some reclaimable memory?

> + /* slow path */
> + if (mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_FILE))
> + return true;
> + if (mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_FILE))
> + return true;
> + if (noswap || !total_swap_pages)
> + return false;
> + if (mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON))
> + return true;
> + if (mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_ANON))
> + return true;
> + return false;
> +}
> #else
> int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
> {
> @@ -1722,6 +1730,21 @@ static void mem_cgroup_numascan_init(str
> {
> return 0;
> }
> +
> +static bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
> +{
> + if (mem_cgroup_get_zonestat_node(mem, 0, LRU_INACTIVE_FILE))
> + return true;
> + if (mem_cgroup_get_zonestat_node(mem, 0, LRU_ACTIVE_FILE))
> + return true;
> + if (noswap || !total_swap_pages)
> + return false;
> + if (mem_cgroup_get_zonestat_node(mem, 0, LRU_INACTIVE_ANON))
> + return true;
> + if (mem_cgroup_get_zonestat_node(mem, 0, LRU_ACTIVE_ANON))
> + return true;
> + return false;
> +}
> #endif

Code duplication doesn't look good. What about a common helper?
>
>
> @@ -1811,7 +1834,7 @@ again:
>
> while (visit--) {
> victim = mem_cgroup_select_victim(root_mem);
> - if (!mem_cgroup_local_usage(victim)) {
> + if (!mem_cgroup_reclaimable(victim, noswap)) {
> /* this cgroup's local usage == 0 */
> css_put(&victim->css);
> continue;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-22 18:33:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/7] Fix mem_cgroup_hierarchical_reclaim() to do stable hierarchy walk.

On Wed 22-06-11 17:15:00, Michal Hocko wrote:
> On Thu 16-06-11 12:51:41, KAMEZAWA Hiroyuki wrote:
> [...]
> > @@ -1667,41 +1668,28 @@ static int mem_cgroup_hierarchical_recla
> > if (!check_soft && root_mem->memsw_is_minimum)
> > noswap = true;
> >
> > - while (1) {
> > +again:
> > + if (!shrink) {
> > + visit = 0;
> > + for_each_mem_cgroup_tree(victim, root_mem)
> > + visit++;
> > + } else {
> > + /*
> > + * At shrinking, we check the usage again in caller side.
> > + * so, visit children one by one.
> > + */
> > + visit = 1;
> > + }
> > + /*
> > + * We are not draining per cpu cached charges during soft limit reclaim
> > + * because global reclaim doesn't care about charges. It tries to free
> > + * some memory and charges will not give any.
> > + */
> > + if (!check_soft)
> > + drain_all_stock_async(root_mem);
> > +
> > + while (visit--) {
>
> This is racy, isn't it? What prevents some groups to disapear in the
> meantime? We would reclaim from those that are left more that we want.
>
> Why cannot we simply do something like (totally untested):
>
> Index: linus_tree/mm/memcontrol.c
> ===================================================================
> --- linus_tree.orig/mm/memcontrol.c 2011-06-22 17:11:54.000000000 +0200
> +++ linus_tree/mm/memcontrol.c 2011-06-22 17:13:05.000000000 +0200
> @@ -1652,7 +1652,7 @@ static int mem_cgroup_hierarchical_recla
> unsigned long reclaim_options,
> unsigned long *total_scanned)
> {
> - struct mem_cgroup *victim;
> + struct mem_cgroup *victim, *first_victim = NULL;
> int ret, total = 0;
> int loop = 0;
> bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> @@ -1669,6 +1669,11 @@ static int mem_cgroup_hierarchical_recla
>
> while (1) {
> victim = mem_cgroup_select_victim(root_mem);
> + if (!first_victim)
> + first_victim = victim;
> + else if (first_victim == victim)
> + break;

this will obviously need css_get and css_put to make sure that the group
doesn't disappear in the meantime.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-23 06:28:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Fix mem_cgroup_hierarchical_reclaim() to do stable hierarchy walk.

On Wed, 22 Jun 2011 20:33:31 +0200
Michal Hocko <[email protected]> wrote:

> On Wed 22-06-11 17:15:00, Michal Hocko wrote:
> > On Thu 16-06-11 12:51:41, KAMEZAWA Hiroyuki wrote:
> > [...]
> > > @@ -1667,41 +1668,28 @@ static int mem_cgroup_hierarchical_recla
> > > if (!check_soft && root_mem->memsw_is_minimum)
> > > noswap = true;
> > >
> > > - while (1) {
> > > +again:
> > > + if (!shrink) {
> > > + visit = 0;
> > > + for_each_mem_cgroup_tree(victim, root_mem)
> > > + visit++;
> > > + } else {
> > > + /*
> > > + * At shrinking, we check the usage again in caller side.
> > > + * so, visit children one by one.
> > > + */
> > > + visit = 1;
> > > + }
> > > + /*
> > > + * We are not draining per cpu cached charges during soft limit reclaim
> > > + * because global reclaim doesn't care about charges. It tries to free
> > > + * some memory and charges will not give any.
> > > + */
> > > + if (!check_soft)
> > > + drain_all_stock_async(root_mem);
> > > +
> > > + while (visit--) {
> >
> > This is racy, isn't it? What prevents some groups to disapear in the
> > meantime? We would reclaim from those that are left more that we want.
> >
> > Why cannot we simply do something like (totally untested):
> >
> > Index: linus_tree/mm/memcontrol.c
> > ===================================================================
> > --- linus_tree.orig/mm/memcontrol.c 2011-06-22 17:11:54.000000000 +0200
> > +++ linus_tree/mm/memcontrol.c 2011-06-22 17:13:05.000000000 +0200
> > @@ -1652,7 +1652,7 @@ static int mem_cgroup_hierarchical_recla
> > unsigned long reclaim_options,
> > unsigned long *total_scanned)
> > {
> > - struct mem_cgroup *victim;
> > + struct mem_cgroup *victim, *first_victim = NULL;
> > int ret, total = 0;
> > int loop = 0;
> > bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> > @@ -1669,6 +1669,11 @@ static int mem_cgroup_hierarchical_recla
> >
> > while (1) {
> > victim = mem_cgroup_select_victim(root_mem);
> > + if (!first_victim)
> > + first_victim = victim;
> > + else if (first_victim == victim)
> > + break;
>
> this will obviously need css_get and css_put to make sure that the group
> doesn't disappear in the meantime.
>

I forgot why we didn't this. Hmm, ok, I'll use this style.

Thanks,
-Kame



2011-06-23 06:34:39

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/7] memcg: update numa information based on event counter

On Wed, 22 Jun 2011 17:53:09 +0200
Michal Hocko <[email protected]> wrote:

> On Thu 16-06-11 12:54:00, KAMEZAWA Hiroyuki wrote:
> > From 88090fe10e225ad8769ba0ea01692b7314e8b973 Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <[email protected]>
> > Date: Wed, 15 Jun 2011 16:19:46 +0900
> > Subject: [PATCH 4/7] memcg: update numa information based on event counter
> >
> > commit 889976 adds an numa node round-robin for memcg. But the information
> > is updated once per 10sec.
> >
> > This patch changes the update trigger from jiffies to memcg's event count.
> > After this patch, numa scan information will be updated when
> >
> > - the number of pagein/out events is larger than 3% of limit
> > or
> > - the number of pagein/out events is larger than 16k
> > (==64MB pagein/pageout if pagesize==4k.)
> >
> > The counter of mem->numascan_update the sum of percpu events counter.
> > When a task hits limit, it checks mem->numascan_update. If it's over
> > min(3% of limit, 16k), numa information will be updated.
>
> Yes, I like the event based approach more than the origin (time) based
> one.
>
> >
> > This patch also adds mutex for updating information. This will allow us
> > to avoid unnecessary scan.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > mm/memcontrol.c | 51 +++++++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 45 insertions(+), 6 deletions(-)
> >
> > Index: mmotm-0615/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0615.orig/mm/memcontrol.c
> > +++ mmotm-0615/mm/memcontrol.c
> > @@ -108,10 +108,12 @@ enum mem_cgroup_events_index {
> > enum mem_cgroup_events_target {
> > MEM_CGROUP_TARGET_THRESH,
> > MEM_CGROUP_TARGET_SOFTLIMIT,
> > + MEM_CGROUP_TARGET_NUMASCAN,
>
> Shouldn't it be defined only for MAX_NUMNODES > 1
>

Hmm, yes. But I want to reduce #ifdefs..


> > MEM_CGROUP_NTARGETS,
> > };
> > #define THRESHOLDS_EVENTS_TARGET (128)
> > #define SOFTLIMIT_EVENTS_TARGET (1024)
> > +#define NUMASCAN_EVENTS_TARGET (1024)
> >
> > struct mem_cgroup_stat_cpu {
> > long count[MEM_CGROUP_STAT_NSTATS];
> > @@ -288,8 +290,9 @@ struct mem_cgroup {
> > int last_scanned_node;
> > #if MAX_NUMNODES > 1
> > nodemask_t scan_nodes;
> > - unsigned long next_scan_node_update;
> > + struct mutex numascan_mutex;
> > #endif
> > + atomic_t numascan_update;
>
> Why it is out of ifdef?
>

This was for avoiding #ifdef in mem_cgroup_create()...but it's not used now.
I'll fix this.



> > /*
> > * Should the accounting and control be hierarchical, per subtree?
> > */
> > @@ -741,6 +744,9 @@ static void __mem_cgroup_target_update(s
> > case MEM_CGROUP_TARGET_SOFTLIMIT:
> > next = val + SOFTLIMIT_EVENTS_TARGET;
> > break;
> > + case MEM_CGROUP_TARGET_NUMASCAN:
> > + next = val + NUMASCAN_EVENTS_TARGET;
> > + break;
>
> MAX_NUMNODES > 1
>
> > default:
> > return;
> > }
> > @@ -764,6 +770,13 @@ static void memcg_check_events(struct me
> > __mem_cgroup_target_update(mem,
> > MEM_CGROUP_TARGET_SOFTLIMIT);
> > }
> > + if (unlikely(__memcg_event_check(mem,
> > + MEM_CGROUP_TARGET_NUMASCAN))) {
> > + atomic_add(MEM_CGROUP_TARGET_NUMASCAN,
> > + &mem->numascan_update);
> > + __mem_cgroup_target_update(mem,
> > + MEM_CGROUP_TARGET_NUMASCAN);
> > + }
> > }
>
> again MAX_NUMNODES > 1
>

Hmm, ok, I will add #ifdef only here.



> > }
> >
> > @@ -1616,17 +1629,32 @@ mem_cgroup_select_victim(struct mem_cgro
> > /*
> > * Always updating the nodemask is not very good - even if we have an empty
> > * list or the wrong list here, we can start from some node and traverse all
> > - * nodes based on the zonelist. So update the list loosely once per 10 secs.
> > + * nodes based on the zonelist.
> > *
> > + * The counter of mem->numascan_update is updated once per
> > + * NUMASCAN_EVENTS_TARGET. We update the numa information when we see
> > + * the number of event is larger than 3% of limit or 64MB pagein/pageout.
> > */
> > +#define NUMASCAN_UPDATE_RATIO (3)
> > +#define NUMASCAN_UPDATE_THRESH (16384UL) /* 16k events of pagein/pageout */
> > static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
> > {
> > int nid;
> > -
> > - if (time_after(mem->next_scan_node_update, jiffies))
> > + unsigned long long limit;
> > + /* if no limit, we never reach here */
> > + limit = res_counter_read_u64(&mem->res, RES_LIMIT);
> > + limit /= PAGE_SIZE;
> > + /* 3% of limit */
> > + limit = (limit * NUMASCAN_UPDATE_RATIO/100UL);
> > + limit = min_t(unsigned long long, limit, NUMASCAN_UPDATE_THRESH);
> > + /*
> > + * If the number of pagein/out event is larger than 3% of limit or
> > + * 64MB pagein/out, refresh numa information.
> > + */
> > + if (atomic_read(&mem->numascan_update) < limit ||
> > + !mutex_trylock(&mem->numascan_mutex))
> > return;
>
> I am not sure whether a mutex is not overkill here. What about using an
> atomic operation instead?
>

I think mutex is informative than atomic counter for code readers.
If influence of overhead is not big, I'd like to use mutex.

Thanks,
-Kame

2011-06-23 08:12:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/7] memcg: update numa information based on event counter

On Thu 23-06-11 15:27:34, KAMEZAWA Hiroyuki wrote:
> On Wed, 22 Jun 2011 17:53:09 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Thu 16-06-11 12:54:00, KAMEZAWA Hiroyuki wrote:
[...]
> > > @@ -1616,17 +1629,32 @@ mem_cgroup_select_victim(struct mem_cgro
> > > /*
> > > * Always updating the nodemask is not very good - even if we have an empty
> > > * list or the wrong list here, we can start from some node and traverse all
> > > - * nodes based on the zonelist. So update the list loosely once per 10 secs.
> > > + * nodes based on the zonelist.
> > > *
> > > + * The counter of mem->numascan_update is updated once per
> > > + * NUMASCAN_EVENTS_TARGET. We update the numa information when we see
> > > + * the number of event is larger than 3% of limit or 64MB pagein/pageout.
> > > */
> > > +#define NUMASCAN_UPDATE_RATIO (3)
> > > +#define NUMASCAN_UPDATE_THRESH (16384UL) /* 16k events of pagein/pageout */
> > > static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
> > > {
> > > int nid;
> > > -
> > > - if (time_after(mem->next_scan_node_update, jiffies))
> > > + unsigned long long limit;
> > > + /* if no limit, we never reach here */
> > > + limit = res_counter_read_u64(&mem->res, RES_LIMIT);
> > > + limit /= PAGE_SIZE;
> > > + /* 3% of limit */
> > > + limit = (limit * NUMASCAN_UPDATE_RATIO/100UL);
> > > + limit = min_t(unsigned long long, limit, NUMASCAN_UPDATE_THRESH);
> > > + /*
> > > + * If the number of pagein/out event is larger than 3% of limit or
> > > + * 64MB pagein/out, refresh numa information.
> > > + */
> > > + if (atomic_read(&mem->numascan_update) < limit ||
> > > + !mutex_trylock(&mem->numascan_mutex))
> > > return;
> >
> > I am not sure whether a mutex is not overkill here. What about using an
> > atomic operation instead?
> >
>
> I think mutex is informative than atomic counter for code readers.
> If influence of overhead is not big, I'd like to use mutex.

I do not have a strong opinion on that. mem_cgroup is not that
widespread structure to think about every single byte. On the other hand
atomic test&set would do the same thing. We are already using atomic
operations to manipulate numascan_update so doing it whole atomic based
sounds natural.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-23 13:35:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 6/7] memcg: calc NUMA node's weight for scan.

On Thu 16-06-11 12:56:33, KAMEZAWA Hiroyuki wrote:
> From fb8aaa2c5f7fd99dfcb5d2ecb3c1226a58caafea Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 16 Jun 2011 10:05:46 +0900
> Subject: [PATCH 6/7] memcg: calc NUMA node's weight for scan.
>
> Now, by commit 889976, numa node scan of memcg is in round-robin.
> As commit log says, "a better algorithm is needed".
>
> for implementing some good scheduling, one of required things is
> defining importance of each node at LRU scanning.
>
> This patch defines each node's weight for scan as
>
> swappiness = (memcg's swappiness)? memcg's swappiness : 1
> FILE = inactive_file + (inactive_file_is_low)? active_file : 0
> ANON = inactive_anon + (inactive_anon_is_low)? active_anon : 0
>
> weight = (FILE * (200-swappiness) + ANON * swappiness)/200.

Shouldn't we consider the node size?
If we have a node which is almost full with file cache and then other
node wich is much bigger and it is mostly occupied by anonymous memory
than the other node might end up with higher weight.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-23 13:48:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 7/7] memcg: proportional fair vicitm node selection

On Thu 16-06-11 12:57:41, KAMEZAWA Hiroyuki wrote:
> From 4fbd49697456c227c86f1d5b46f2cd2169bf1c5b Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 16 Jun 2011 11:25:23 +0900
> Subject: [PATCH 7/7] memcg: proportional fair node vicitm selection
>
> commit 889976 implements a round-robin scan of numa nodes for
> LRU scanning of memcg at hitting limit.
> But, round-robin is not very good.
>
> This patch implements a proportionally fair victim selection of nodes
> rather than round-robin. The logic is fair against each node's weight.
>
> Each node's weight is calculated periodically and we build an node's
> scheduling entity as
>
> total_ticket = 0;
> for_each_node(node)
> node->ticket_start = total_ticket;
> node->ticket_end = total_ticket + this_node's_weight()
> total_ticket = node->ticket_end;
>
> Then, each nodes has some amounts of tickets in proportion to its own weight.
>
> At selecting victim, a random number is selected and the node which contains
> the random number in [ticket_start, ticket_end) is selected as vicitm.
> This is a lottery scheduling algorithm.
>
> For quick search of victim, this patch uses bsearch().
>
> Test result:
> on 8cpu box with 2 nodes.
> limit memory to be 300MB and run httpd for 4096files/600MB working set.
> do (normalized) random access by apache-bench and see scan_stat.
> The test makes 40960 request. and see scan_stat.
> (Because a httpd thread just use 10% cpu, the number of threads will
> not be balanced between nodes. Then, file caches will not be balanced
> between nodes.)

Have you also tried to test with balanced nodes? I mean, is there any
measurable overhead?

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-23 14:10:14

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH 7/7] memcg: proportional fair vicitm node selection

2011/6/23 Michal Hocko <[email protected]>:
> On Thu 16-06-11 12:57:41, KAMEZAWA Hiroyuki wrote:
>> From 4fbd49697456c227c86f1d5b46f2cd2169bf1c5b Mon Sep 17 00:00:00 2001
>> From: KAMEZAWA Hiroyuki <[email protected]>
>> Date: Thu, 16 Jun 2011 11:25:23 +0900
>> Subject: [PATCH 7/7] memcg: proportional fair node vicitm selection
>>
>> commit 889976 implements a round-robin scan of numa nodes for
>> LRU scanning of memcg at hitting limit.
>> But, round-robin is not very good.
>>
>> This patch implements a proportionally fair victim selection of nodes
>> rather than round-robin. The logic is fair against each node's weight.
>>
>> Each node's weight is calculated periodically and we build an node's
>> scheduling entity as
>>
>> ? ? ?total_ticket = 0;
>> ? ? ?for_each_node(node)
>> ? ? ? node->ticket_start = ?total_ticket;
>> ? ? ? ? node->ticket_end ? = ?total_ticket + this_node's_weight()
>> ? ? ? ? total_ticket = node->ticket_end;
>>
>> Then, each nodes has some amounts of tickets in proportion to its own weight.
>>
>> At selecting victim, a random number is selected and the node which contains
>> the random number in [ticket_start, ticket_end) is selected as vicitm.
>> This is a lottery scheduling algorithm.
>>
>> For quick search of victim, this patch uses bsearch().
>>
>> Test result:
>> ? on 8cpu box with 2 nodes.
>> ? limit memory to be 300MB and run httpd for 4096files/600MB working set.
>> ? do (normalized) random access by apache-bench and see scan_stat.
>> ? The test makes 40960 request. and see scan_stat.
>> ? (Because a httpd thread just use 10% cpu, the number of threads will
>> ? ?not be balanced between nodes. Then, file caches will not be balanced
>> ? ?between nodes.)
>
> Have you also tried to test with balanced nodes? I mean, is there any
> measurable overhead?
>

Not enough yet. I checked OOM trouble this week :).

I may need to make another fake_numa setup + cpuset
to measurements. In usual path, new overhead is random32() and
bsearch(). I'll do some.

Thanks,
-Kame

2011-06-23 14:27:13

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH 6/7] memcg: calc NUMA node's weight for scan.

2011/6/23 Michal Hocko <[email protected]>:
> On Thu 16-06-11 12:56:33, KAMEZAWA Hiroyuki wrote:
>> From fb8aaa2c5f7fd99dfcb5d2ecb3c1226a58caafea Mon Sep 17 00:00:00 2001
>> From: KAMEZAWA Hiroyuki <[email protected]>
>> Date: Thu, 16 Jun 2011 10:05:46 +0900
>> Subject: [PATCH 6/7] memcg: calc NUMA node's weight for scan.
>>
>> Now, by commit 889976, numa node scan of memcg is in round-robin.
>> As commit log says, "a better algorithm is needed".
>>
>> for implementing some good scheduling, one of required things is
>> defining importance of each node at LRU scanning.
>>
>> This patch defines each node's weight for scan as
>>
>> swappiness = (memcg's swappiness)? memcg's swappiness : 1
>> FILE = inactive_file + (inactive_file_is_low)? active_file : 0
>> ANON = inactive_anon + (inactive_anon_is_low)? active_anon : 0
>>
>> weight = (FILE * (200-swappiness) + ANON * swappiness)/200.
>
> Shouldn't we consider the node size?

Above one cheks FILE+ANON....it's size of node.

> If we have a node which is almost full with file cache and then other
> node wich is much bigger and it is mostly occupied by anonymous memory
> than the other node might end up with higher weight.

I used a porportional fair scheduling in the next patch and I expect I'll not
see heavy starvation of node balancing. And if inactive_anon_is_low(),
the weight of anon-only-node will jump up.

But yes, other "weight" calculation is possible. The point of this patch
series is introducing a scheduler which can handle "weight" of value.

Thanks,
-Kame

2011-06-23 14:30:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 7/7] memcg: proportional fair vicitm node selection

On Thu 23-06-11 23:10:11, Hiroyuki Kamezawa wrote:
> 2011/6/23 Michal Hocko <[email protected]>:
> > On Thu 16-06-11 12:57:41, KAMEZAWA Hiroyuki wrote:
> >> From 4fbd49697456c227c86f1d5b46f2cd2169bf1c5b Mon Sep 17 00:00:00 2001
> >> From: KAMEZAWA Hiroyuki <[email protected]>
> >> Date: Thu, 16 Jun 2011 11:25:23 +0900
> >> Subject: [PATCH 7/7] memcg: proportional fair node vicitm selection
> >>
> >> commit 889976 implements a round-robin scan of numa nodes for
> >> LRU scanning of memcg at hitting limit.
> >> But, round-robin is not very good.
> >>
> >> This patch implements a proportionally fair victim selection of nodes
> >> rather than round-robin. The logic is fair against each node's weight.
> >>
> >> Each node's weight is calculated periodically and we build an node's
> >> scheduling entity as
> >>
> >> ? ? ?total_ticket = 0;
> >> ? ? ?for_each_node(node)
> >> ? ? ? node->ticket_start = ?total_ticket;
> >> ? ? ? ? node->ticket_end ? = ?total_ticket + this_node's_weight()
> >> ? ? ? ? total_ticket = node->ticket_end;
> >>
> >> Then, each nodes has some amounts of tickets in proportion to its own weight.
> >>
> >> At selecting victim, a random number is selected and the node which contains
> >> the random number in [ticket_start, ticket_end) is selected as vicitm.
> >> This is a lottery scheduling algorithm.
> >>
> >> For quick search of victim, this patch uses bsearch().
> >>
> >> Test result:
> >> ? on 8cpu box with 2 nodes.
> >> ? limit memory to be 300MB and run httpd for 4096files/600MB working set.
> >> ? do (normalized) random access by apache-bench and see scan_stat.
> >> ? The test makes 40960 request. and see scan_stat.
> >> ? (Because a httpd thread just use 10% cpu, the number of threads will
> >> ? ?not be balanced between nodes. Then, file caches will not be balanced
> >> ? ?between nodes.)
> >
> > Have you also tried to test with balanced nodes? I mean, is there any
> > measurable overhead?
> >
>
> Not enough yet. I checked OOM trouble this week :).
>
> I may need to make another fake_numa setup + cpuset
> to measurements.

What if you just use NUMA rotor for page cache?

> In usual path, new overhead is random32() and
> bsearch(). I'll do some.
>
> Thanks,
> -Kame

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-23 22:25:22

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH 7/7] memcg: proportional fair vicitm node selection

2011/6/23 Michal Hocko <[email protected]>:
> On Thu 23-06-11 23:10:11, Hiroyuki Kamezawa wrote:
>> 2011/6/23 Michal Hocko <[email protected]>:
>> > On Thu 16-06-11 12:57:41, KAMEZAWA Hiroyuki wrote:
>> >> From 4fbd49697456c227c86f1d5b46f2cd2169bf1c5b Mon Sep 17 00:00:00 2001
>> >> From: KAMEZAWA Hiroyuki <[email protected]>
>> >> Date: Thu, 16 Jun 2011 11:25:23 +0900
>> >> Subject: [PATCH 7/7] memcg: proportional fair node vicitm selection
>> >>
>> >> commit 889976 implements a round-robin scan of numa nodes for
>> >> LRU scanning of memcg at hitting limit.
>> >> But, round-robin is not very good.
>> >>
>> >> This patch implements a proportionally fair victim selection of nodes
>> >> rather than round-robin. The logic is fair against each node's weight.
>> >>
>> >> Each node's weight is calculated periodically and we build an node's
>> >> scheduling entity as
>> >>
>> >> ? ? ?total_ticket = 0;
>> >> ? ? ?for_each_node(node)
>> >> ? ? ? node->ticket_start = ?total_ticket;
>> >> ? ? ? ? node->ticket_end ? = ?total_ticket + this_node's_weight()
>> >> ? ? ? ? total_ticket = node->ticket_end;
>> >>
>> >> Then, each nodes has some amounts of tickets in proportion to its own weight.
>> >>
>> >> At selecting victim, a random number is selected and the node which contains
>> >> the random number in [ticket_start, ticket_end) is selected as vicitm.
>> >> This is a lottery scheduling algorithm.
>> >>
>> >> For quick search of victim, this patch uses bsearch().
>> >>
>> >> Test result:
>> >> ? on 8cpu box with 2 nodes.
>> >> ? limit memory to be 300MB and run httpd for 4096files/600MB working set.
>> >> ? do (normalized) random access by apache-bench and see scan_stat.
>> >> ? The test makes 40960 request. and see scan_stat.
>> >> ? (Because a httpd thread just use 10% cpu, the number of threads will
>> >> ? ?not be balanced between nodes. Then, file caches will not be balanced
>> >> ? ?between nodes.)
>> >
>> > Have you also tried to test with balanced nodes? I mean, is there any
>> > measurable overhead?
>> >
>>
>> Not enough yet. I checked OOM trouble this week :).
>>
>> I may need to make another fake_numa setup + cpuset
>> to measurements.
>
> What if you just use NUMA rotor for page cache?
>

Ok, I'll do try in the next week. Thank you for suggestion.

Thanks,
-Kame

2011-06-24 21:40:49

by Ying Han

[permalink] [raw]
Subject: Re: [PATCH 3/7] memcg: add memory.scan_stat

On Tue, Jun 21, 2011 at 5:20 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Mon, 20 Jun 2011 23:49:54 -0700
> Ying Han <[email protected]> wrote:
>
>> On Wed, Jun 15, 2011 at 8:53 PM, KAMEZAWA Hiroyuki <
>> [email protected]> wrote:
>>
>> > From e08990dd9ada13cf236bec1ef44b207436434b8e Mon Sep 17 00:00:00 2001
>> > From: KAMEZAWA Hiroyuki <[email protected]>
>> > Date: Wed, 15 Jun 2011 14:11:01 +0900
>> > Subject: [PATCH 3/7] memcg: add memory.scan_stat
>> >
>> > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
>> > says it adds scanning stats to memory.stat file. But it doesn't because
>> > we considered we needed to make a concensus for such new APIs.
>> >
>> > This patch is a trial to add memory.scan_stat. This shows
>> > ?- the number of scanned pages
>> > ?- the number of recleimed pages
>> > ?- the number of elaplsed time (including sleep/pause time)
>> > ?for both of direct/soft reclaim and shrinking caused by changing limit
>> > ?or force_empty.
>> >
>> > The biggest difference with oringinal Ying's one is that this file
>> > can be reset by some write, as
>> >
>> > ?# echo 0 ...../memory.scan_stat
>> >
>> > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
>> > scanned_pages_by_limit 358470
>> > freed_pages_by_limit 180795
>> > elapsed_ns_by_limit 21629927
>> > scanned_pages_by_system 0
>> > freed_pages_by_system 0
>> > elapsed_ns_by_system 0
>> > scanned_pages_by_shrink 76646
>> > freed_pages_by_shrink 38355
>> > elappsed_ns_by_shrink 31990670
>> >
>>
>> elapsed?
>>
>
> you'r right.
>
>>
>> > total_scanned_pages_by_limit 358470
>> > total_freed_pages_by_limit 180795
>> > total_elapsed_ns_by_hierarchical 216299275
>> > total_scanned_pages_by_system 0
>> > total_freed_pages_by_system 0
>> > total_elapsed_ns_by_system 0
>> > total_scanned_pages_by_shrink 76646
>> > total_freed_pages_by_shrink 38355
>> > total_elapsed_ns_by_shrink 31990670
>> >
>> > total_xxxx is for hierarchy management.
>> >
>>
>> For some reason, i feel the opposite where the local stat (like
>> "scanned_pages_by_limit") are reclaimed under hierarchical reclaim. The
>> total_xxx stats are only incremented for root_mem which is the cgroup
>> triggers the hierarchical reclaim. So:
>>
>> total_scanned_pages_by_limit: number of pages being scanned while the memcg
>> hits its limit
>> scanned_pages_by_limit: number of pages being scanned while one of the
>> memcg's ancestor hits its limit
>>
>> am i missing something?
>>
>
> scanned_pages_by_limit: one of ancestors and itself's limit.

yes, and that is what I understood.
>
>
>
>
>>
>> >
>> > This will be useful for further memcg developments and need to be
>> > developped before we do some complicated rework on LRU/softlimit
>> > management.
>> >
>> > Now, scan/free/elapsed_by_system is incomplete but future works of
>> > Johannes at el. will fill remaining information and then, we can
>> > look into problems of isolation between memcgs.
>> >
>> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>> > ---
>> > ?Documentation/cgroups/memory.txt | ? 33 +++++++++
>> > ?include/linux/memcontrol.h ? ? ? | ? 16 ++++
>> > ?include/linux/swap.h ? ? ? ? ? ? | ? ?6 -
>> > ?mm/memcontrol.c ? ? ? ? ? ? ? ? ?| ?135
>> > +++++++++++++++++++++++++++++++++++++--
>> > ?mm/vmscan.c ? ? ? ? ? ? ? ? ? ? ?| ? 27 ++++++-
>> > ?5 files changed, 199 insertions(+), 18 deletions(-)
>> >
>> > Index: mmotm-0615/Documentation/cgroups/memory.txt
>> > ===================================================================
>> > --- mmotm-0615.orig/Documentation/cgroups/memory.txt
>> > +++ mmotm-0615/Documentation/cgroups/memory.txt
>> > @@ -380,7 +380,7 @@ will be charged as a new owner of it.
>> >
>> > ?5.2 stat file
>> >
>> > -memory.stat file includes following statistics
>> > +5.2.1 memory.stat file includes following statistics
>> >
>> > ?# per-memory cgroup local status
>> > ?cache ? ? ? ? ?- # of bytes of page cache memory.
>> > @@ -438,6 +438,37 @@ Note:
>> > ? ? ? ? file_mapped is accounted only when the memory cgroup is owner of
>> > page
>> > ? ? ? ? cache.)
>> >
>> > +5.2.2 memory.scan_stat
>> > +
>> > +memory.scan_stat includes statistics information for memory scanning and
>> > +freeing, reclaiming. The statistics shows memory scanning information
>> > since
>> > +memory cgroup creation and can be reset to 0 by writing 0 as
>> > +
>> > + #echo 0 > ../memory.scan_stat
>> > +
>> > +This file contains following statistics.
>> > +
>> > +scanned_pages_by_limit - # of scanned pages at hitting limit.
>> > +freed_pages_by_limit ? - # of freed pages at hitting limit.
>> > +elapsed_ns_by_limit ? ?- nano sec of elappsed time at LRU scan at
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?hitting limit.(this includes sleep
>> > time.)
>> >
>> elapsed?
>>
>> > +
>> > +scanned_pages_by_system ? ? ? ?- # of scanned pages by the kernel.
>> > + ? ? ? ? ? ? ? ? ? ? ? ? (Now, this value means global memory reclaim
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? caused by system memory shortage with a hint
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ?of softlimit. "No soft limit" case will be
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ?supported in future.)
>> > +freed_pages_by_system ?- # of freed pages by the kernel.
>> > +elapsed_ns_by_system ? - nano sec of elappsed time by kernel.
>> > +
>> > +scanned_pages_by_shrink ? ? ? ?- # of scanned pages by shrinking.
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (i.e. changes of limit, force_empty,
>> > etc.)
>> > +freed_pages_by_shrink ?- # of freed pages by shirkining.
>> > +elappsed_ns_by_shrink ?- nano sec of elappsed time at shrinking.
>> >
>> elapsed?
>>
>> > +
>> > +total_xxx includes the statistics of children scanning caused by the
>> > cgroup.
>> >
>>
>> based on the code inspection, the total_xxx also includes the cgroup's scan
>> stat as well.
>>
>
> yes.
>
>
>> +
>> > +
>> > ?5.3 swappiness
>> >
>> > ?Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups
>> > only.
>> > Index: mmotm-0615/include/linux/memcontrol.h
>> > ===================================================================
>> > --- mmotm-0615.orig/include/linux/memcontrol.h
>> > +++ mmotm-0615/include/linux/memcontrol.h
>> > @@ -120,6 +120,22 @@ struct zone_reclaim_stat*
>> > ?mem_cgroup_get_reclaim_stat_from_page(struct page *page);
>> > ?extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct task_struct *p);
>> > +struct memcg_scanrecord {
>> > + ? ? ? struct mem_cgroup *mem; /* scanend memory cgroup */
>> > + ? ? ? struct mem_cgroup *root; /* scan target hierarchy root */
>> > + ? ? ? int context; ? ? ? ? ? ?/* scanning context (see memcontrol.c) */
>> > + ? ? ? unsigned long nr_scanned; /* the number of scanned pages */
>> > + ? ? ? unsigned long nr_freed; /* the number of freed pages */
>> > + ? ? ? unsigned long elappsed; /* nsec of time elapsed while scanning */
>> >
>> elapsed?
>>
>> > +};
>> > +
>> > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool
>> > noswap,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct memcg_scanrecord
>> > *rec);
>> > +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool
>> > noswap,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct memcg_scanrecord
>> > *rec);
>> >
>> > ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>> > ?extern int do_swap_account;
>> > Index: mmotm-0615/include/linux/swap.h
>> > ===================================================================
>> > --- mmotm-0615.orig/include/linux/swap.h
>> > +++ mmotm-0615/include/linux/swap.h
>> > @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
>> > ?/* linux/mm/vmscan.c */
>> > ?extern unsigned long try_to_free_pages(struct zonelist *zonelist, int
>> > order,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask, nodemask_t *mask);
>> > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool
>> > noswap);
>> > -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool
>> > noswap,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long *nr_scanned);
>> > ?extern int __isolate_lru_page(struct page *page, int mode, int file);
>> > ?extern unsigned long shrink_all_memory(unsigned long nr_pages);
>> > ?extern int vm_swappiness;
>> > Index: mmotm-0615/mm/memcontrol.c
>> > ===================================================================
>> > --- mmotm-0615.orig/mm/memcontrol.c
>> > +++ mmotm-0615/mm/memcontrol.c
>> > @@ -203,6 +203,57 @@ struct mem_cgroup_eventfd_list {
>> > ?static void mem_cgroup_threshold(struct mem_cgroup *mem);
>> > ?static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>> >
>> > +enum {
>> > + ? ? ? SCAN_BY_LIMIT,
>> > + ? ? ? FREED_BY_LIMIT,
>> > + ? ? ? ELAPSED_BY_LIMIT,
>> > +
>> > + ? ? ? SCAN_BY_SYSTEM,
>> > + ? ? ? FREED_BY_SYSTEM,
>> > + ? ? ? ELAPSED_BY_SYSTEM,
>> > +
>> > + ? ? ? SCAN_BY_SHRINK,
>> > + ? ? ? FREED_BY_SHRINK,
>> > + ? ? ? ELAPSED_BY_SHRINK,
>> > + ? ? ? NR_SCANSTATS,
>> > +};
>> > +#define __FREED ? ? ? ? ? ? ? ?(1)
>> > +#define ? ? ? ?__ELAPSED ? ? ? (2)
>> >
>>
>> /tab/space/
>>
>>
>> > +
>> > +struct scanstat {
>> > + ? ? ? spinlock_t ? ? ?lock;
>> > + ? ? ? unsigned long ? stats[NR_SCANSTATS]; ? ?/* local statistics */
>> > + ? ? ? unsigned long ? totalstats[NR_SCANSTATS]; ? /* hierarchical */
>> > +};

I wonder why not extending the mem_cgroup_stat_cpu struct, and then we
can use the per-cpu counters like others.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b7d2d79..5b8bbe9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -138,6 +138,7 @@ struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
unsigned long targets[MEM_CGROUP_NTARGETS];
+ unsigned long reclaim_stats[MEMCG_RECLAIM_NSTATS];
};

/*
@@ -557,6 +558,101 @@ static void __mem_cgroup_target_update(struct
mem_cgroup *mem, int target)
this_cpu_write(mem->stat->targets[target], next);
}

+void mem_cgroup_update_reclaim_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_reclaim_stat_index idx,
+ unsigned long num)
+{
+ this_cpu_add(mem->stat->reclaim_stats[idx], num);
+}


>> > +
>> > +const char *scanstat_string[NR_SCANSTATS] = {
>> > + ? ? ? "scanned_pages_by_limit",
>> > + ? ? ? "freed_pages_by_limit",
>> > + ? ? ? "elapsed_ns_by_limit",
>> > +
>> > + ? ? ? "scanned_pages_by_system",
>> > + ? ? ? "freed_pages_by_system",
>> > + ? ? ? "elapsed_ns_by_system",
>> > +
>> > + ? ? ? "scanned_pages_by_shrink",
>> > + ? ? ? "freed_pages_by_shrink",
>> > + ? ? ? "elappsed_ns_by_shrink",
>> >
>> elapsed?
>>
>> > +};
>> > +
>> > +const char *total_scanstat_string[NR_SCANSTATS] = {
>> > + ? ? ? "total_scanned_pages_by_limit",
>> > + ? ? ? "total_freed_pages_by_limit",
>> > + ? ? ? "total_elapsed_ns_by_hierarchical",
>> >
>>
>> typo?
>>
>>
>> > +
>> > + ? ? ? "total_scanned_pages_by_system",
>> > + ? ? ? "total_freed_pages_by_system",
>> > + ? ? ? "total_elapsed_ns_by_system",
>> > +
>> > + ? ? ? "total_scanned_pages_by_shrink",
>> > + ? ? ? "total_freed_pages_by_shrink",
>> > + ? ? ? "total_elapsed_ns_by_shrink",
>> > +};
>> > +
>> > ?/*
>> > ?* The memory controller data structure. The memory controller controls
>> > both
>> > ?* page cache and RSS per cgroup. We would eventually like to provide
>> > @@ -264,7 +315,8 @@ struct mem_cgroup {
>> >
>> > ? ? ? ?/* For oom notifier event fd */
>> > ? ? ? ?struct list_head oom_notify;
>> > -
>> > + ? ? ? /* For recording LRU-scan statistics */
>> > + ? ? ? struct scanstat scanstat;

Can we add this to mem_cgroup_stat_cpu ?

>> > ? ? ? ?/*
>> > ? ? ? ? * Should we move charges of a task when a task is moved into this
>> > ? ? ? ? * mem_cgroup ? And what type of charges should we move ?
>> > @@ -1634,6 +1686,28 @@ int mem_cgroup_select_victim_node(struct
>> > ?}
>> > ?#endif
>> >
>> > +
>> > +
>> > +static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec)
>> > +{
>> > + ? ? ? struct mem_cgroup *mem;
>> > + ? ? ? int context = rec->context;
>> > +
>> > + ? ? ? mem = rec->mem;
>> > + ? ? ? spin_lock(&mem->scanstat.lock);
>> > + ? ? ? mem->scanstat.stats[context] += rec->nr_scanned;
>> > + ? ? ? mem->scanstat.stats[context + __FREED] += rec->nr_freed;
>> > + ? ? ? mem->scanstat.stats[context + __ELAPSED] += rec->elappsed;
>> >
>> elapsed?
>>
>>
>> + ? ? ? spin_unlock(&mem->scanstat.lock);
>> > +
>> > + ? ? ? mem = rec->root;
>> > + ? ? ? spin_lock(&mem->scanstat.lock);
>> > + ? ? ? mem->scanstat.totalstats[context] += rec->nr_scanned;
>> > + ? ? ? mem->scanstat.totalstats[context + __FREED] += rec->nr_freed;
>> > + ? ? ? mem->scanstat.totalstats[context + __ELAPSED] += rec->elappsed;
>> >
>>
>> elapsed?
>>
>> > + ? ? ? spin_unlock(&mem->scanstat.lock);
>> > +}
>> > +
>> > ?/*
>> > ?* Scan the hierarchy if needed to reclaim memory. We remember the last
>> > child
>> > ?* we reclaimed from, so that we don't end up penalizing one child
>> > extensively
>> > @@ -1659,8 +1733,8 @@ static int mem_cgroup_hierarchical_recla
>> > ? ? ? ?bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
>> > ? ? ? ?bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
>> > ? ? ? ?unsigned long excess;
>> > - ? ? ? unsigned long nr_scanned;
>> > ? ? ? ?int visit;
>> > + ? ? ? struct memcg_scanrecord rec;
>> >
>> > ? ? ? ?excess = res_counter_soft_limit_excess(&root_mem->res) >>
>> > PAGE_SHIFT;
>> >
>> > @@ -1668,6 +1742,15 @@ static int mem_cgroup_hierarchical_recla
>> > ? ? ? ?if (!check_soft && root_mem->memsw_is_minimum)
>> > ? ? ? ? ? ? ? ?noswap = true;
>> >
>> > + ? ? ? if (shrink)
>> > + ? ? ? ? ? ? ? rec.context = SCAN_BY_SHRINK;
>> > + ? ? ? else if (check_soft)
>> > + ? ? ? ? ? ? ? rec.context = SCAN_BY_SYSTEM;
>> > + ? ? ? else
>> > + ? ? ? ? ? ? ? rec.context = SCAN_BY_LIMIT;
>> > +
>> > + ? ? ? rec.root = root_mem;
>> >
>>
>>
>>
>> > +
>> > ?again:
>> > ? ? ? ?if (!shrink) {
>> > ? ? ? ? ? ? ? ?visit = 0;
>> > @@ -1695,14 +1778,19 @@ again:
>> > ? ? ? ? ? ? ? ? ? ? ? ?css_put(&victim->css);
>> > ? ? ? ? ? ? ? ? ? ? ? ?continue;
>> > ? ? ? ? ? ? ? ?}
>> > + ? ? ? ? ? ? ? rec.mem = victim;
>> > + ? ? ? ? ? ? ? rec.nr_scanned = 0;
>> > + ? ? ? ? ? ? ? rec.nr_freed = 0;
>> > + ? ? ? ? ? ? ? rec.elappsed = 0;
>> >
>> elapsed?
>>
>> > ? ? ? ? ? ? ? ?/* we use swappiness of local cgroup */
>> > ? ? ? ? ? ? ? ?if (check_soft) {
>> > ? ? ? ? ? ? ? ? ? ? ? ?ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, zone, &nr_scanned);
>> > - ? ? ? ? ? ? ? ? ? ? ? *total_scanned += nr_scanned;
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, zone, &rec);
>> > + ? ? ? ? ? ? ? ? ? ? ? *total_scanned += rec.nr_scanned;
>> > ? ? ? ? ? ? ? ?} else
>> > ? ? ? ? ? ? ? ? ? ? ? ?ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap);
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, &rec);
>> > + ? ? ? ? ? ? ? mem_cgroup_record_scanstat(&rec);
>> > ? ? ? ? ? ? ? ?css_put(&victim->css);
>> >
>> > ? ? ? ? ? ? ? ?total += ret;
>> > @@ -3757,7 +3845,8 @@ try_to_free:
>> > ? ? ? ? ? ? ? ? ? ? ? ?ret = -EINTR;
>> > ? ? ? ? ? ? ? ? ? ? ? ?goto out;
>> > ? ? ? ? ? ? ? ?}
>> > - ? ? ? ? ? ? ? progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
>> > false);
>> > + ? ? ? ? ? ? ? progress = try_to_free_mem_cgroup_pages(mem,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? GFP_KERNEL, false, NULL);
>> >
>>
>> So we don't record the stat for force_empty case?
>>
>
> yes, now. force_empty is used only for rmdir(). I don't think log is
> necessary for cgroup disappearing.

we might need to add that case later, since force_empty can also being
used as adding external pressure to the memcg.
>
>
>>
>> > ? ? ? ? ? ? ? ?if (!progress) {
>> > ? ? ? ? ? ? ? ? ? ? ? ?nr_retries--;
>> > ? ? ? ? ? ? ? ? ? ? ? ?/* maybe some writeback is necessary */
>> > @@ -4599,6 +4688,34 @@ static int mem_control_numa_stat_open(st
>> > ?}
>> > ?#endif /* CONFIG_NUMA */
>> >
>> > +static int mem_cgroup_scan_stat_read(struct cgroup *cgrp,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct cftype *cft,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct cgroup_map_cb *cb)
>> > +{
>> > + ? ? ? struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
>> > + ? ? ? int i;
>> > +
>> > + ? ? ? for (i = 0; i < NR_SCANSTATS; i++)
>> > + ? ? ? ? ? ? ? cb->fill(cb, scanstat_string[i], mem->scanstat.stats[i]);
>> > + ? ? ? for (i = 0; i < NR_SCANSTATS; i++)
>> > + ? ? ? ? ? ? ? cb->fill(cb, total_scanstat_string[i],
>> > + ? ? ? ? ? ? ? ? ? ? ? mem->scanstat.totalstats[i]);
>> > + ? ? ? return 0;
>> > +}
>> > +
>> > +static int mem_cgroup_reset_scan_stat(struct cgroup *cgrp,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned int event)
>> > +{
>> > + ? ? ? struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
>> > +
>> > + ? ? ? spin_lock(&mem->scanstat.lock);
>> > + ? ? ? memset(&mem->scanstat.stats, 0, sizeof(mem->scanstat.stats));
>> > + ? ? ? memset(&mem->scanstat.totalstats, 0,
>> > sizeof(mem->scanstat.totalstats));
>> > + ? ? ? spin_unlock(&mem->scanstat.lock);
>> > + ? ? ? return 0;
>> > +}
>> > +
>> > +
>> > ?static struct cftype mem_cgroup_files[] = {
>> > ? ? ? ?{
>> > ? ? ? ? ? ? ? ?.name = "usage_in_bytes",
>> > @@ -4669,6 +4786,11 @@ static struct cftype mem_cgroup_files[]
>> > ? ? ? ? ? ? ? ?.mode = S_IRUGO,
>> > ? ? ? ?},
>> > ?#endif
>> > + ? ? ? {
>> > + ? ? ? ? ? ? ? .name = "scan_stat",
>> > + ? ? ? ? ? ? ? .read_map = mem_cgroup_scan_stat_read,
>> > + ? ? ? ? ? ? ? .trigger = mem_cgroup_reset_scan_stat,
>> > + ? ? ? },
>> > ?};
>> >
>> > ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>> > @@ -4932,6 +5054,7 @@ mem_cgroup_create(struct cgroup_subsys *
>> > ? ? ? ?atomic_set(&mem->refcnt, 1);
>> > ? ? ? ?mem->move_charge_at_immigrate = 0;
>> > ? ? ? ?mutex_init(&mem->thresholds_lock);
>> > + ? ? ? spin_lock_init(&mem->scanstat.lock);
>> > ? ? ? ?return &mem->css;
>> > ?free_out:
>> > ? ? ? ?__mem_cgroup_free(mem);
>> > Index: mmotm-0615/mm/vmscan.c
>> > ===================================================================
>> > --- mmotm-0615.orig/mm/vmscan.c
>> > +++ mmotm-0615/mm/vmscan.c
>> > @@ -2199,9 +2199,9 @@ unsigned long try_to_free_pages(struct z
>> > ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>> >
>> > ?unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool
>> > noswap,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long *nr_scanned)
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct memcg_scanrecord *rec)
>> > ?{
>> > ? ? ? ?struct scan_control sc = {
>> > ? ? ? ? ? ? ? ?.nr_scanned = 0,
>> > @@ -2213,6 +2213,7 @@ unsigned long mem_cgroup_shrink_node_zon
>> > ? ? ? ? ? ? ? ?.order = 0,
>> > ? ? ? ? ? ? ? ?.mem_cgroup = mem,
>> > ? ? ? ?};
>> > + ? ? ? unsigned long start, end;
>> >
>> > ? ? ? ?sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>> > ? ? ? ? ? ? ? ? ? ? ? ?(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
>> > @@ -2221,6 +2222,7 @@ unsigned long mem_cgroup_shrink_node_zon
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc.may_writepage,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc.gfp_mask);
>> >
>> > + ? ? ? start = sched_clock();
>> > ? ? ? ?/*
>> > ? ? ? ? * NOTE: Although we can get the priority field, using it
>> > ? ? ? ? * here is not a good idea, since it limits the pages we can scan.
>> > @@ -2229,19 +2231,27 @@ unsigned long mem_cgroup_shrink_node_zon
>> > ? ? ? ? * the priority and make it zero.
>> > ? ? ? ? */
>> > ? ? ? ?shrink_zone(0, zone, &sc);
>> > + ? ? ? end = sched_clock();
>> > +
>> > + ? ? ? if (rec) {
>> > + ? ? ? ? ? ? ? rec->nr_scanned += sc.nr_scanned;
>> >
>>
>>
>>
>> > + ? ? ? ? ? ? ? rec->nr_freed += sc.nr_reclaimed;
>> > + ? ? ? ? ? ? ? rec->elappsed += end - start;
>> >
>> elapsed?
>>
>> > + ? ? ? }
>> >
>> > ? ? ? ?trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>> >
>> > - ? ? ? *nr_scanned = sc.nr_scanned;
>> > ? ? ? ?return sc.nr_reclaimed;
>> > ?}
>> >
>> > ?unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bool noswap)
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bool noswap,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct memcg_scanrecord *rec)
>> > ?{
>> > ? ? ? ?struct zonelist *zonelist;
>> > ? ? ? ?unsigned long nr_reclaimed;
>> > + ? ? ? unsigned long start, end;
>> > ? ? ? ?int nid;
>> > ? ? ? ?struct scan_control sc = {
>> > ? ? ? ? ? ? ? ?.may_writepage = !laptop_mode,
>> > @@ -2259,6 +2269,7 @@ unsigned long try_to_free_mem_cgroup_pag
>> > ? ? ? ? ? ? ? ?.gfp_mask = sc.gfp_mask,
>> > ? ? ? ?};
>> >
>> > + ? ? ? start = sched_clock();
>> > ? ? ? ?/*
>> > ? ? ? ? * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
>> > ? ? ? ? * take care of from where we get pages. So the node where we start
>> > the
>> > @@ -2273,6 +2284,12 @@ unsigned long try_to_free_mem_cgroup_pag
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc.gfp_mask);
>> >
>> > ? ? ? ?nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
>> > + ? ? ? end = sched_clock();
>> > + ? ? ? if (rec) {
>> > + ? ? ? ? ? ? ? rec->nr_scanned += sc.nr_scanned;
>> >
>>
>> sc.nr_scanned only contains the nr_scanned of last
>> priority do_try_to_free_pages(). we need to reset it to total_scanned before
>> return. I am looking at v3.0-rc3 .
>>
>
> Hm. ok, then, total_reclaimed in softlimit is buggy, too.
> I'll check.

Thanks

--Ying
>
> Thanks,
> -Kame
>
>

2011-06-27 01:57:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/7] memcg: add memory.scan_stat

On Fri, 24 Jun 2011 14:40:42 -0700
Ying Han <[email protected]> wrote:

> On Tue, Jun 21, 2011 at 5:20 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Mon, 20 Jun 2011 23:49:54 -0700
> > Ying Han <[email protected]> wrote:
> >
> >> On Wed, Jun 15, 2011 at 8:53 PM, KAMEZAWA Hiroyuki <
> >> [email protected]> wrote:
> >>
> >> > From e08990dd9ada13cf236bec1ef44b207436434b8e Mon Sep 17 00:00:00 2001
> >> > From: KAMEZAWA Hiroyuki <[email protected]>
> >> > Date: Wed, 15 Jun 2011 14:11:01 +0900
> >> > Subject: [PATCH 3/7] memcg: add memory.scan_stat

> >> > +
> >> > +struct scanstat {
> >> > +       spinlock_t      lock;
> >> > +       unsigned long   stats[NR_SCANSTATS];    /* local statistics */
> >> > +       unsigned long   totalstats[NR_SCANSTATS];   /* hierarchical */
> >> > +};
>
> I wonder why not extending the mem_cgroup_stat_cpu struct, and then we
> can use the per-cpu counters like others.
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b7d2d79..5b8bbe9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -138,6 +138,7 @@ struct mem_cgroup_stat_cpu {
> long count[MEM_CGROUP_STAT_NSTATS];
> unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
> unsigned long targets[MEM_CGROUP_NTARGETS];
> + unsigned long reclaim_stats[MEMCG_RECLAIM_NSTATS];
> };
>

Hmm, do we have enough benefit to consume 72 bytes per cpu and make
read-side slow for this rarely updated counter ?

Thanks,
-Kame