2011-05-26 05:18:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH v3 0/10] memcg async reclaim


It's now merge window...I just dump my patch queue to hear other's idea.
I wonder I should wait until dirty_ratio for memcg is queued to mmotm...
I'll be busy with LinuxCon Japan etc...in the next week.

This patch is onto mmotm-May-11 + some patches queued in mmotm, as numa_stat.

This is a patch for memcg to keep margin to the limit in background.
By keeping some margin to the limit in background, application can
avoid foreground memory reclaim at charge() and this will help latency.

Main changes from v2 is.
- use SCHED_IDLE.
- removed most of heuristic codes. Now, code is very simple.

By using SCHED_IDLE, async memory reclaim can only consume 0.3%? of cpu
if the system is truely busy but can use much CPU if the cpu is idle.
Because my purpose is for reducing latency without affecting other running
applications, SCHED_IDLE fits this work.

If application need to stop by some I/O or event, background memory reclaim
will cull memory while the system is idle.

Perforemce:
Running an httpd (apache) under 300M limit. And access 600MB working set
with normalized distribution access by apatch-bench.
apatch bench's concurrency was 4 and did 40960 accesses.

Without async reclaim:
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 2
Processing: 30 37 28.3 32 1793
Waiting: 28 35 25.5 31 1792
Total: 30 37 28.4 32 1793

Percentage of the requests served within a certain time (ms)
50% 32
66% 32
75% 33
80% 34
90% 39
95% 60
98% 100
99% 133
100% 1793 (longest request)

With async reclaim:
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 2
Processing: 30 35 12.3 32 678
Waiting: 28 34 12.0 31 658
Total: 30 35 12.3 32 678

Percentage of the requests served within a certain time (ms)
50% 32
66% 32
75% 33
80% 34
90% 39
95% 49
98% 71
99% 86
100% 678 (longest request)


It seems latency is stabilized by hiding memory reclaim.

The score for memory reclaim was following.
See patch 10 for meaning of each member.

== without async reclaim ==
recent_scan_success_ratio 44
limit_scan_pages 388463
limit_freed_pages 162238
limit_elapsed_ns 13852159231
soft_scan_pages 0
soft_freed_pages 0
soft_elapsed_ns 0
margin_scan_pages 0
margin_freed_pages 0
margin_elapsed_ns 0

== with async reclaim ==
recent_scan_success_ratio 6
limit_scan_pages 0
limit_freed_pages 0
limit_elapsed_ns 0
soft_scan_pages 0
soft_freed_pages 0
soft_elapsed_ns 0
margin_scan_pages 1295556
margin_freed_pages 122450
margin_elapsed_ns 644881521


For this case, SCHED_IDLE workqueue can reclaim enough memory to the httpd.

I may need to dig why scan_success_ratio is far different in the both case.
I guess the difference of epalsed_ns is because several threads enter
memory reclaim when async reclaim doesn't run. But may not...



Thanks,
-Kame




2011-05-26 05:22:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH v3 1/10] check reclaimable in hierarchy walk


I may post this patch as stand alone, later.
==
Check memcg has reclaimable pages at select_victim().

Now, with help of bitmap as memcg->scan_node, we can check whether memcg has
reclaimable pages with easy test of node_empty(&mem->scan_nodes).

mem->scan_nodes is a bitmap to show whether memcg contains reclaimable
memory or not, which is updated periodically.

This patch makes use of scan_nodes and modify hierarchy walk at memory
shrinking in following way.

- check scan_nodes in mem_cgroup_select_victim()
- mem_cgroup_select_victim() returns NULL if no memcg is reclaimable.
- force update of scan_nodes.
- rename mem_cgroup_select_victim() to be mem_cgroup_select_get_victim()
to show refcnt is +1.

This will make hierarchy walk better.

And this allows to remove mem_cgroup_local_pages() check which was used for
the same purpose. But this function was wrong because it cannot handle
information of unevictable pages and tmpfs v.s. swapless information.

Changelog:
- added since v3.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 165 +++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 110 insertions(+), 55 deletions(-)

Index: memcg_async/mm/memcontrol.c
===================================================================
--- memcg_async.orig/mm/memcontrol.c
+++ memcg_async/mm/memcontrol.c
@@ -584,15 +584,6 @@ static long mem_cgroup_read_stat(struct
return val;
}

-static long mem_cgroup_local_usage(struct mem_cgroup *mem)
-{
- long ret;
-
- ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
- ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
- return ret;
-}
-
static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
bool charge)
{
@@ -1555,43 +1546,6 @@ u64 mem_cgroup_get_limit(struct mem_cgro
return min(limit, memsw);
}

-/*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
-static struct mem_cgroup *
-mem_cgroup_select_victim(struct mem_cgroup *root_mem)
-{
- struct mem_cgroup *ret = NULL;
- struct cgroup_subsys_state *css;
- int nextid, found;
-
- if (!root_mem->use_hierarchy) {
- css_get(&root_mem->css);
- ret = root_mem;
- }
-
- while (!ret) {
- rcu_read_lock();
- nextid = root_mem->last_scanned_child + 1;
- css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
- &found);
- if (css && css_tryget(css))
- ret = container_of(css, struct mem_cgroup, css);
-
- rcu_read_unlock();
- /* Updates scanning parameter */
- if (!css) {
- /* this means start scan from ID:1 */
- root_mem->last_scanned_child = 0;
- } else
- root_mem->last_scanned_child = found;
- }
-
- return ret;
-}
-
#if MAX_NUMNODES > 1

/*
@@ -1600,11 +1554,11 @@ mem_cgroup_select_victim(struct mem_cgro
* nodes based on the zonelist. So update the list loosely once per 10 secs.
*
*/
-static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
+static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem, bool force)
{
int nid;

- if (time_after(mem->next_scan_node_update, jiffies))
+ if (!force && time_after(mem->next_scan_node_update, jiffies))
return;

mem->next_scan_node_update = jiffies + 10*HZ;
@@ -1641,7 +1595,7 @@ int mem_cgroup_select_victim_node(struct
{
int node;

- mem_cgroup_may_update_nodemask(mem);
+ mem_cgroup_may_update_nodemask(mem, false);
node = mem->last_scanned_node;

node = next_node(node, mem->scan_nodes);
@@ -1660,13 +1614,117 @@ int mem_cgroup_select_victim_node(struct
return node;
}

+/**
+ * mem_cgroup_has_reclaimable
+ * @mem_cgroup : the mem_cgroup
+ *
+ * The caller can test whether the memcg has reclaimable pages.
+ *
+ * This function checks memcg has reclaimable pages or not with bitmap of
+ * memcg->scan_nodes. This bitmap is updated periodically and indicates
+ * which node has reclaimable memcg memory or not.
+ * Although this is a rough test and result is not very precise but we don't
+ * have to scan all nodes and don't have to use locks.
+ *
+ * For non-NUMA, this cheks reclaimable pages on zones because we don't
+ * update scan_nodes.(see below)
+ */
+static bool mem_cgroup_has_reclaimable(struct mem_cgroup *memcg)
+{
+ return !nodes_empty(memcg->scan_nodes);
+}
+
#else
+
+static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem, bool force)
+{
+}
+
int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
{
return 0;
}
+
+static bool mem_cgroup_has_reclaimable(struct mem_cgroup *memcg)
+{
+ unsigned long nr;
+ int zid;
+
+ for (zid = NODE_DATA(0)->nr_zones - 1; zid >= 0; zid--)
+ if (mem_cgroup_zone_reclaimable_pages(memcg, 0, zid))
+ break;
+ if (zid < 0)
+ return false;
+ return true;
+}
#endif

+/**
+ * mem_cgroup_select_get_victim
+ * @root_mem: the root memcg of hierarchy which should be shrinked.
+ *
+ * Visit children of root_mem ony by one. If the routine finds a memcg
+ * which contains reclaimable pages, returns it with refcnt +1. The
+ * scan is done in round-robin and 'the next start point' is saved into
+ * mem->last_scanned_child. If no reclaimable memcg are found, returns NULL.
+ */
+static struct mem_cgroup *
+mem_cgroup_select_get_victim(struct mem_cgroup *root_mem)
+{
+ struct mem_cgroup *ret = NULL;
+ struct cgroup_subsys_state *css;
+ int nextid, found;
+ bool second_visit = false;
+
+ if (!root_mem->use_hierarchy)
+ goto return_root;
+
+ while (!ret) {
+ rcu_read_lock();
+ nextid = root_mem->last_scanned_child + 1;
+ css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
+ &found);
+ if (css && css_tryget(css))
+ ret = container_of(css, struct mem_cgroup, css);
+
+ rcu_read_unlock();
+ /* Updates scanning parameter */
+ if (!css) { /* Indicates we scanned the last node of tree */
+ /*
+ * If all memcg has no reclaimable pages, we may enter
+ * an infinite loop. Exit here if we reached the end
+ * of hierarchy tree twice.
+ */
+ if (second_visit)
+ return NULL;
+ /* this means start scan from ID:1 */
+ root_mem->last_scanned_child = 0;
+ second_visit = true;
+ } else
+ root_mem->last_scanned_child = found;
+ if (css && ret) {
+ /*
+ * check memcg has reclaimable memory or not. Update
+ * information carefully if we might fail with cached
+ * bitmask information.
+ */
+ if (second_visit)
+ mem_cgroup_may_update_nodemask(ret, true);
+
+ if (!mem_cgroup_has_reclaimable(ret)) {
+ css_put(css);
+ ret = NULL;
+ }
+ }
+ }
+
+ return ret;
+return_root:
+ css_get(&root_mem->css);
+ return root_mem;
+}
+
+
/*
* Scan the hierarchy if needed to reclaim memory. We remember the last child
* we reclaimed from, so that we don't end up penalizing one child extensively
@@ -1705,7 +1763,9 @@ static int mem_cgroup_hierarchical_recla
is_kswapd = true;

while (1) {
- victim = mem_cgroup_select_victim(root_mem);
+ victim = mem_cgroup_select_get_victim(root_mem);
+ if (!victim)
+ return total;
if (victim == root_mem) {
loop++;
if (loop >= 1)
@@ -1733,11 +1793,6 @@ static int mem_cgroup_hierarchical_recla
}
}
}
- if (!mem_cgroup_local_usage(victim)) {
- /* this cgroup's local usage == 0 */
- css_put(&victim->css);
- continue;
- }
/* we use swappiness of local cgroup */
if (check_soft) {
ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,

2011-05-26 05:24:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH v3 2/10] memcg: fix cached charge drain ratio


IIUC, this is a bugfix.
=
Memory cgroup cachess charge per cpu for avoinding heavy access
on res_counter. At memory reclaim, caches are drained in asynchronous
way.

On SMP system, if memcg hits limit heavily, this draining is
called too frequently and you'll see tons of kworker...
Reduce it.

By this patch,
- drain_all_stock_async is called only after 1st trial of reclaim fails.
- drain_all_stock_async checks "cached" information is related to
memory reclaim target.
- drain_all_stock_async checks a flag per cpu to do draining.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 42 +++++++++++++++++++++++-------------------
1 file changed, 23 insertions(+), 19 deletions(-)

Index: memcg_async/mm/memcontrol.c
===================================================================
--- memcg_async.orig/mm/memcontrol.c
+++ memcg_async/mm/memcontrol.c
@@ -367,7 +367,7 @@ enum charge_type {
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
-static void drain_all_stock_async(void);
+static void drain_all_stock_async(struct mem_cgroup *mem);

static struct mem_cgroup_per_zone *
mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1768,8 +1768,6 @@ static int mem_cgroup_hierarchical_recla
return total;
if (victim == root_mem) {
loop++;
- if (loop >= 1)
- drain_all_stock_async();
if (loop >= 2) {
/*
* If we have not been able to reclaim
@@ -1818,6 +1816,8 @@ static int mem_cgroup_hierarchical_recla
return total;
} else if (mem_cgroup_margin(root_mem))
return total;
+ /* we failed with the first memcg, drain cached ones. */
+ drain_all_stock_async(root_mem);
}
return total;
}
@@ -2029,9 +2029,10 @@ struct memcg_stock_pcp {
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
struct work_struct work;
+ unsigned long flags;
+#define STOCK_FLUSHING (0)
};
static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
-static atomic_t memcg_drain_count;

/*
* Try to consume stocked charge on this cpu. If success, one page is consumed
@@ -2078,7 +2079,9 @@ static void drain_stock(struct memcg_sto
static void drain_local_stock(struct work_struct *dummy)
{
struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
+
drain_stock(stock);
+ clear_bit(STOCK_FLUSHING, &stock->flags);
}

/*
@@ -2103,36 +2106,37 @@ static void refill_stock(struct mem_cgro
* expects some charges will be back to res_counter later but cannot wait for
* it.
*/
-static void drain_all_stock_async(void)
+static void drain_all_stock_async(struct mem_cgroup *root_mem)
{
int cpu;
- /* This function is for scheduling "drain" in asynchronous way.
- * The result of "drain" is not directly handled by callers. Then,
- * if someone is calling drain, we don't have to call drain more.
- * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
- * there is a race. We just do loose check here.
- */
- if (atomic_read(&memcg_drain_count))
- return;
/* Notify other cpus that system-wide "drain" is running */
- atomic_inc(&memcg_drain_count);
get_online_cpus();
for_each_online_cpu(cpu) {
struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
- schedule_work_on(cpu, &stock->work);
+ struct mem_cgroup *mem;
+
+ rcu_read_lock();
+ mem = stock->cached;
+ if (!mem) {
+ rcu_read_unlock();
+ continue;
+ }
+ if ((mem == root_mem ||
+ css_is_ancestor(&mem->css, &root_mem->css))) {
+ rcu_read_unlock();
+ if (!test_and_set_bit(STOCK_FLUSHING, &stock->flags))
+ schedule_work_on(cpu, &stock->work);
+ } else
+ rcu_read_unlock();
}
put_online_cpus();
- atomic_dec(&memcg_drain_count);
- /* We don't wait for flush_work */
}

/* This is a synchronous drain interface. */
static void drain_all_stock_sync(void)
{
/* called when force_empty is called */
- atomic_inc(&memcg_drain_count);
schedule_on_each_cpu(drain_local_stock);
- atomic_dec(&memcg_drain_count);
}

/*

2011-05-26 05:26:03

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH v3 3/10] memcg: a test whether zone is reclaimable or not

From: Ying Han <[email protected]>

The number of reclaimable pages per zone is an useful information for
controling memory reclaim schedule. This patch exports it.

Changelog v2->v3:
- added comments.

Signed-off-by: Ying Han <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 24 ++++++++++++++++++++++++
2 files changed, 26 insertions(+)

Index: memcg_async/mm/memcontrol.c
===================================================================
--- memcg_async.orig/mm/memcontrol.c
+++ memcg_async/mm/memcontrol.c
@@ -1240,6 +1240,30 @@ static unsigned long mem_cgroup_nr_lru_p
}
#endif /* CONFIG_NUMA */

+/**
+ * mem_cgroup_zone_reclaimable_pages
+ * @memcg: the memcg
+ * @nid : node index to be checked.
+ * @zid : zone index to be checked.
+ *
+ * This function returns the number reclaimable pages on a zone for given memcg.
+ * Reclaimable page includes file caches and anonymous pages if swap is
+ * avaliable and never includes unevictable pages.
+ */
+unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
+ int nid, int zid)
+{
+ unsigned long nr;
+ struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+ nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
+ MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE);
+ if (nr_swap_pages > 0)
+ nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
+ MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
+ return nr;
+}
+
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
struct zone *zone)
{
Index: memcg_async/include/linux/memcontrol.h
===================================================================
--- memcg_async.orig/include/linux/memcontrol.h
+++ memcg_async/include/linux/memcontrol.h
@@ -109,6 +109,8 @@ extern void mem_cgroup_end_migration(str
*/
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
+unsigned long
+mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zid);
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
struct zone *zone,

2011-05-26 05:27:01

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH v3 4/10] memcg: export swappiness

From: Ying Han <[email protected]>
change mem_cgroup's swappiness interface.

Now, memcg's swappiness interface is defined as 'static' and
the value is passed as an argument to try_to_free_xxxx...

This patch adds an function mem_cgroup_swappiness() and export it,
reduce arguments. This interface will be used in async reclaim, later.

I think an function is better than passing arguments because it's
clearer where the swappiness comes from to scan_control.

Changelog: v2->v3
- added comments.

Signed-off-by: Ying Han <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 2 ++
include/linux/swap.h | 4 +---
mm/memcontrol.c | 21 +++++++++++++--------
mm/vmscan.c | 9 ++++-----
4 files changed, 20 insertions(+), 16 deletions(-)

Index: memcg_async/mm/memcontrol.c
===================================================================
--- memcg_async.orig/mm/memcontrol.c
+++ memcg_async/mm/memcontrol.c
@@ -1373,7 +1373,14 @@ static unsigned long mem_cgroup_margin(s
return margin >> PAGE_SHIFT;
}

-static unsigned int get_swappiness(struct mem_cgroup *memcg)
+/**
+ * mem_cgroup_swappiness
+ * @memcg: the memcg
+ *
+ * Returnes user defined swappiness of memory cgroup. Root cgroup uses
+ * system's value always.
+ */
+unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
struct cgroup *cgrp = memcg->css.cgroup;

@@ -1818,14 +1825,13 @@ static int mem_cgroup_hierarchical_recla
/* we use swappiness of local cgroup */
if (check_soft) {
ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
- noswap, get_swappiness(victim), zone,
- &nr_scanned);
+ noswap, zone, &nr_scanned);
*total_scanned += nr_scanned;
mem_cgroup_soft_steal(victim, is_kswapd, ret);
mem_cgroup_soft_scan(victim, is_kswapd, nr_scanned);
} else
ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
- noswap, get_swappiness(victim));
+ noswap);
css_put(&victim->css);
/*
* At shrinking usage, we can't check we should stop here or
@@ -3854,8 +3860,7 @@ try_to_free:
ret = -EINTR;
goto out;
}
- progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
- false, get_swappiness(mem));
+ progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, false);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
@@ -4333,7 +4338,7 @@ static u64 mem_cgroup_swappiness_read(st
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);

- return get_swappiness(memcg);
+ return mem_cgroup_swappiness(memcg);
}

static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
@@ -5041,7 +5046,7 @@ mem_cgroup_create(struct cgroup_subsys *
INIT_LIST_HEAD(&mem->oom_notify);

if (parent)
- mem->swappiness = get_swappiness(parent);
+ mem->swappiness = mem_cgroup_swappiness(parent);
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
Index: memcg_async/include/linux/swap.h
===================================================================
--- memcg_async.orig/include/linux/swap.h
+++ memcg_async/include/linux/swap.h
@@ -252,11 +252,9 @@ static inline void lru_cache_add_file(st
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- unsigned int swappiness);
+ gfp_t gfp_mask, bool noswap);
extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
- unsigned int swappiness,
struct zone *zone,
unsigned long *nr_scanned);
extern int __isolate_lru_page(struct page *page, int mode, int file);
Index: memcg_async/mm/vmscan.c
===================================================================
--- memcg_async.orig/mm/vmscan.c
+++ memcg_async/mm/vmscan.c
@@ -2182,7 +2182,6 @@ unsigned long try_to_free_pages(struct z

unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
- unsigned int swappiness,
struct zone *zone,
unsigned long *nr_scanned)
{
@@ -2192,7 +2191,6 @@ unsigned long mem_cgroup_shrink_node_zon
.may_writepage = !laptop_mode,
.may_unmap = 1,
.may_swap = !noswap,
- .swappiness = swappiness,
.order = 0,
.mem_cgroup = mem,
};
@@ -2200,6 +2198,8 @@ unsigned long mem_cgroup_shrink_node_zon
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);

+ sc.swappiness = mem_cgroup_swappiness(mem);
+
trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
sc.may_writepage,
sc.gfp_mask);
@@ -2221,8 +2221,7 @@ unsigned long mem_cgroup_shrink_node_zon

unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask,
- bool noswap,
- unsigned int swappiness)
+ bool noswap)
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
@@ -2232,7 +2231,6 @@ unsigned long try_to_free_mem_cgroup_pag
.may_unmap = 1,
.may_swap = !noswap,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
- .swappiness = swappiness,
.order = 0,
.mem_cgroup = mem_cont,
.nodemask = NULL, /* we don't care the placement */
@@ -2249,6 +2247,7 @@ unsigned long try_to_free_mem_cgroup_pag
* scan does not need to be the current node.
*/
nid = mem_cgroup_select_victim_node(mem_cont);
+ sc.swappiness = mem_cgroup_swappiness(mem_cont);

zonelist = NODE_DATA(nid)->node_zonelists;

Index: memcg_async/include/linux/memcontrol.h
===================================================================
--- memcg_async.orig/include/linux/memcontrol.h
+++ memcg_async/include/linux/memcontrol.h
@@ -112,6 +112,7 @@ int mem_cgroup_inactive_file_is_low(stru
unsigned long
mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zid);
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
struct zone *zone,
enum lru_list lru);
@@ -121,6 +122,7 @@ struct zone_reclaim_stat*
mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif

2011-05-26 05:30:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH v3 5/10] memcg keep margin to limit in background


Still magic numbers existing..but...
==
When memcg is used, applications can see latency of memory reclaim
which is caused by the limit of memcg. In general, it's unavoidable
and the user's setting is wrong.

There are some class of application, which uses much Clean file caches
and do some interactive jobs. With that applications, if the kernel
can help memory reclaim in background, application latency can be
hidden to some extent. (It depends on how applications sleep..)

This patch adds a control knob to enable/disable a kernel help
to keep marging to limit in background.

If a user writes
# echo 1 > /memory.async_control

The memcg tries to keep free space (called margin) to the limit in
background. The size of margin is calculated in dynamic way. The
value is determined to decrease opportunity for appliations to hit limit.
(Now, just use a random_walk.)


Changelog v2 -> v3:
- totally reworked.
- calculate margin to the limit in dynamic way.
- divided user interface and internal flags.
- added comments.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/memory.txt | 44 ++++++++
mm/memcontrol.c | 192 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 234 insertions(+), 2 deletions(-)

Index: memcg_async/mm/memcontrol.c
===================================================================
--- memcg_async.orig/mm/memcontrol.c
+++ memcg_async/mm/memcontrol.c
@@ -115,10 +115,12 @@ enum mem_cgroup_events_index {
enum mem_cgroup_events_target {
MEM_CGROUP_TARGET_THRESH,
MEM_CGROUP_TARGET_SOFTLIMIT,
+ MEM_CGROUP_TARGET_KEEP_MARGIN,
MEM_CGROUP_NTARGETS,
};
#define THRESHOLDS_EVENTS_TARGET (128)
#define SOFTLIMIT_EVENTS_TARGET (1024)
+#define KEEP_MARGIN_EVENTS_TARGET (512)

struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
@@ -210,6 +212,10 @@ struct mem_cgroup_eventfd_list {
static void mem_cgroup_threshold(struct mem_cgroup *mem);
static void mem_cgroup_oom_notify(struct mem_cgroup *mem);

+static void mem_cgroup_reset_margin_to_limit(struct mem_cgroup *mem);
+static void mem_cgroup_update_margin_to_limit(struct mem_cgroup *mem);
+static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -278,6 +284,15 @@ struct mem_cgroup {
*/
unsigned long move_charge_at_immigrate;
/*
+ * Checks for async reclaim.
+ */
+ unsigned long margin_to_limit_pages; /* margin to limit */
+ spinlock_t update_margin_lock;
+ unsigned long async_flags;
+#define AUTO_KEEP_MARGIN_ENABLED (0) /* user enabled async reclaim */
+#define FAILED_TO_KEEP_MARGIN (1) /* someone hit limit */
+
+ /*
* percpu counter.
*/
struct mem_cgroup_stat_cpu *stat;
@@ -713,6 +728,9 @@ static void __mem_cgroup_target_update(s
case MEM_CGROUP_TARGET_SOFTLIMIT:
next = val + SOFTLIMIT_EVENTS_TARGET;
break;
+ case MEM_CGROUP_TARGET_KEEP_MARGIN:
+ next = val + KEEP_MARGIN_EVENTS_TARGET;
+ break;
default:
return;
}
@@ -736,6 +754,12 @@ static void memcg_check_events(struct me
__mem_cgroup_target_update(mem,
MEM_CGROUP_TARGET_SOFTLIMIT);
}
+ /* update margin-to-limit and run async reclaim if necessary */
+ if (__memcg_event_check(mem, MEM_CGROUP_TARGET_KEEP_MARGIN)) {
+ mem_cgroup_may_async_reclaim(mem);
+ __mem_cgroup_target_update(mem,
+ MEM_CGROUP_TARGET_KEEP_MARGIN);
+ }
}
}

@@ -2267,8 +2291,10 @@ static int mem_cgroup_do_charge(struct m
* of regular pages (CHARGE_BATCH), or a single regular page (1).
*
* Never reclaim on behalf of optional batching, retry with a
- * single page instead.
+ * single page instead. But mark we hit limit and give a hint
+ * to auto_keep_margin.
*/
+ set_bit(FAILED_TO_KEEP_MARGIN, &mem->async_flags);
if (nr_pages == CHARGE_BATCH)
return CHARGE_RETRY;

@@ -3552,6 +3578,7 @@ static int mem_cgroup_resize_limit(struc
memcg->memsw_is_minimum = true;
else
memcg->memsw_is_minimum = false;
+ mem_cgroup_reset_margin_to_limit(memcg);
}
mutex_unlock(&set_limit_mutex);

@@ -3729,6 +3756,131 @@ unsigned long mem_cgroup_soft_limit_recl
return nr_reclaimed;
}

+
+/*
+ * Auto-keep-margin and Dynamic auto margin calculation.
+ *
+ * When an application hits memcg's limit, it need to scan LRU and reclaim
+ * memory. This means extra latency is added by setting limit of memcg. With
+ * some class of applications, a kernel help, freeing pages in background
+ * works good and can reduce their latency and stabilize their works.
+ *
+ * The porblem here is what amount of margin should be kept for keeping
+ * applications hit limit. In general, margin to the limit should be as small
+ * as possible because the user wants to use memory up to the limit, he defined.
+ * But small margin is just a small help.
+ * Below is a code for calculating margin to limit in dynamic way. The margin
+ * is determined by the size of limit and workload.
+ *
+ * At initial, margin is set to MIN_MARGIN_TO_LIMIT and the kernel tries to
+ * keep free bytes of it. If someone hit limit and failcnt increases, this
+ * margin is increase by twice. The kernel periodically checks the
+ * status. If it finds free space is enough, it decreases the margin
+ * LIMIT_MARGIN_STEP. This means the window of margin increases
+ * in exponential (to catch rapid workload) but decreased in linear way.
+ *
+ * This feature is enabled only when AUTO_KEEP_MARGIN_ENABLED is set.
+ */
+#define MIN_MARGIN_TO_LIMIT ((4*1024*1024) >> PAGE_SHIFT)
+#define MAX_MARGIN_TO_LIMIT ((64*1024*1024) >> PAGE_SHIFT)
+#define MAX_MARGIN_LIMIT_RATIO (5) /* 5% of limit */
+#define MARGIN_SHRINK_STEP ((256 * 1024) >>PAGE_SHIFT)
+
+enum {
+ MARGIN_RESET, /* used when limit is set */
+ MARGIN_ENLARGE, /* called when margin seems not enough */
+ MARGIN_SHRINK, /* called when margin seems enough */
+};
+
+static void
+__mem_cgroup_update_limit_margin(struct mem_cgroup *mem, int action)
+{
+ u64 max_margin, limit;
+
+ /*
+ * Note: this function is racy. But the race will be harmless.
+ */
+
+ limit = res_counter_read_u64(&mem->res, RES_LIMIT) >> PAGE_SHIFT;
+
+ max_margin = min(limit * MAX_MARGIN_LIMIT_RATIO/100,
+ (u64) MAX_MARGIN_TO_LIMIT);
+
+ switch (action) {
+ case MARGIN_RESET:
+ mem->margin_to_limit_pages = MIN_MARGIN_TO_LIMIT;
+ if (mem->margin_to_limit_pages < max_margin)
+ mem->margin_to_limit_pages = max_margin;
+ break;
+ case MARGIN_ENLARGE:
+ if (mem->margin_to_limit_pages < max_margin)
+ mem->margin_to_limit_pages *= 2;
+ if (mem->margin_to_limit_pages > max_margin)
+ mem->margin_to_limit_pages = max_margin;
+ break;
+ case MARGIN_SHRINK:
+ if (mem->margin_to_limit_pages > MIN_MARGIN_TO_LIMIT)
+ mem->margin_to_limit_pages -= MARGIN_SHRINK_STEP;
+ if (mem->margin_to_limit_pages < MIN_MARGIN_TO_LIMIT)
+ mem->margin_to_limit_pages = MIN_MARGIN_TO_LIMIT;
+ break;
+ }
+ return;
+}
+
+/*
+ * Called by percpu event counter.
+ */
+static void mem_cgroup_update_margin_to_limit(struct mem_cgroup *mem)
+{
+ if (!test_bit(AUTO_KEEP_MARGIN_ENABLED, &mem->async_flags))
+ return;
+ /* If someone does update, we don't need to update */
+ if (!spin_trylock(&mem->update_margin_lock))
+ return;
+ /*
+ * If someone hits limit, enlarge margin. If no one hits and
+ * it seems there are minimum margin, shrink it.
+ */
+ if (test_and_clear_bit(FAILED_TO_KEEP_MARGIN, &mem->async_flags))
+ __mem_cgroup_update_limit_margin(mem, MARGIN_ENLARGE);
+ else if (mem_cgroup_margin(mem) > MIN_MARGIN_TO_LIMIT)
+ __mem_cgroup_update_limit_margin(mem, MARGIN_SHRINK);
+
+ spin_unlock(&mem->update_margin_lock);
+ return;
+}
+
+/*
+ * Called when the limit changes.
+ */
+static void mem_cgroup_reset_margin_to_limit(struct mem_cgroup *mem)
+{
+ spin_lock(&mem->update_margin_lock);
+ __mem_cgroup_update_limit_margin(mem, MARGIN_RESET);
+ spin_unlock(&mem->update_margin_lock);
+}
+
+/*
+ * Run a asynchronous memory reclaim on the memcg.
+ */
+static void mem_cgroup_schedule_async_reclaim(struct mem_cgroup *mem)
+{
+}
+
+/*
+ * Check memcg's flag and if margin to limit is smaller than limit_margin,
+ * schedule asynchronous memory reclaim in background.
+ */
+static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem)
+{
+ if (!test_bit(AUTO_KEEP_MARGIN_ENABLED, &mem->async_flags))
+ return;
+ mem_cgroup_update_margin_to_limit(mem);
+ if (mem_cgroup_margin(mem) < mem->margin_to_limit_pages)
+ mem_cgroup_schedule_async_reclaim(mem);
+}
+
/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -4329,11 +4481,43 @@ static int mem_control_stat_show(struct
cb->fill(cb, "recent_scanned_anon", recent_scanned[0]);
cb->fill(cb, "recent_scanned_file", recent_scanned[1]);
}
+ cb->fill(cb, "margin_to_limit",
+ (u64)mem_cont->margin_to_limit_pages << PAGE_SHIFT);
#endif

return 0;
}

+/*
+ * User flags for async_control is a subset of mem->async_flags. But
+ * this needs to be defined independently to hide implemation details.
+ */
+#define USER_AUTO_KEEP_MARGIN_ENABLE (0)
+static int mem_cgroup_async_control_write(struct cgroup *cgrp,
+ struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ unsigned long user_flag = val;
+
+ if (test_bit(USER_AUTO_KEEP_MARGIN_ENABLE, &user_flag))
+ set_bit(AUTO_KEEP_MARGIN_ENABLED, &mem->async_flags);
+ else
+ clear_bit(AUTO_KEEP_MARGIN_ENABLED, &mem->async_flags);
+ mem_cgroup_reset_margin_to_limit(mem);
+ return 0;
+}
+
+static u64 mem_cgroup_async_control_read(struct cgroup *cgrp,
+ struct cftype *cft)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ unsigned long val = 0;
+
+ if (test_bit(AUTO_KEEP_MARGIN_ENABLED, &mem->async_flags))
+ set_bit(USER_AUTO_KEEP_MARGIN_ENABLE, &val);
+ return (u64)val;
+}
+
static u64 mem_cgroup_swappiness_read(struct cgroup *cgrp, struct cftype *cft)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
@@ -4816,6 +5000,11 @@ static struct cftype memsw_cgroup_files[
.trigger = mem_cgroup_reset,
.read_u64 = mem_cgroup_read,
},
+ {
+ .name = "async_control",
+ .read_u64 = mem_cgroup_async_control_read,
+ .write_u64 = mem_cgroup_async_control_write,
+ },
};

static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
@@ -5049,6 +5238,7 @@ mem_cgroup_create(struct cgroup_subsys *
mem->swappiness = mem_cgroup_swappiness(parent);
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
+ spin_lock_init(&mem->update_margin_lock);
mutex_init(&mem->thresholds_lock);
return &mem->css;
free_out:
Index: memcg_async/Documentation/cgroups/memory.txt
===================================================================
--- memcg_async.orig/Documentation/cgroups/memory.txt
+++ memcg_async/Documentation/cgroups/memory.txt
@@ -70,6 +70,7 @@ Brief summary of control files.
(See sysctl's vm.swappiness)
memory.move_charge_at_immigrate # set/show controls of moving charges
memory.oom_control # set/show oom controls.
+ memory.async_control # set control for asynchronous memory reclaim

1. History

@@ -433,6 +434,7 @@ recent_rotated_anon - VM internal parame
recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
+margin_to_limit - The margin to limit to be kept.

Memo:
recent_rotated means recent frequency of LRU rotation.
@@ -664,7 +666,47 @@ At reading, current status of OOM is sho
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
be stopped.)

-11. TODO
+11. Asynchronous memory reclaim
+
+Running some kind of applications which uses many file caches, once memory
+cgroup hit, it gets memory reclaim latency. By shrinking usage in background,
+this latency can be hidden with a kernel help if cpu is enough free.
+
+Memory cgroup provides a method for asynchronous memory reclaim for freeing
+memory before hitting limit. By this, some class of application can reduce
+latency effectively and show good/stable peformance. For example, if an
+application reads data from files bigger than limit, freeing memory in
+background will reduce latency of read.
+
+(*)please note, even if latency is hiddedn, the CPU is used in background.
+ So, asynchronous memory reclaim works effectively only when you have
+ extra unused CPU, or applications tend to sleep. On UP host, context-switch
+ by background job can just make perfomance worse.
+ So, if you see this feature doesn't help your application, please leave it
+ turned off.
+
+11.1 memory.async_control
+
+memory.async_control is a control for asynchronous memory reclaim and
+represented as bitmask of controls.
+
+ bit 0 ....user control of automatic keep margin to limit (see below)
+
+ bit 0:
+ Automatic keep margin to limit is a feature to keep free space to the
+ limit by freeing memory in background. The size of margin is calculated
+ by the kernel automatically and it can be changed with information of
+ jobs.
+
+ This feature can be enabled by
+
+ echo 1 > memory.async_control
+
+ Note: This feature is not propageted to childrens in automatic. This
+ may be conservative but required limitation to avoid using too much
+ cpus.
+
+12. TODO

1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first

2011-05-26 05:31:25

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH v3 6/10] memcg : auto keep margin in background , workqueue core.


Core codes for controling workqueue for keeping margin to limit
in background. For contrling work, this patch adds 2 flags.

ASYNC_WORKER_RUNNING indicates the worker for memcg is scheduled and
"don't need to add new work". ASYNC_WORKER_SHOULD_STOP indicates that
someone is trying to remove the memcg and "stop async reclaim".
Because a worker need to hold a reference count for the memcg,
"stop work" is required at removing cgroup.

memory cgroup's automatic-keep-margin-to-limit work is scheduled
by memcg_async_shrinker workqueue which is configured as WQ_UNBOUND.

A shrinker core code mem_cgroup_shrink_rate_limited() will be
implemented in following patches.

Changelog:
- added comments and renamed flags.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/swap.h | 3 +
mm/memcontrol.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++--
mm/vmscan.c | 6 +++
3 files changed, 87 insertions(+), 2 deletions(-)

Index: memcg_async/mm/memcontrol.c
===================================================================
--- memcg_async.orig/mm/memcontrol.c
+++ memcg_async/mm/memcontrol.c
@@ -288,10 +288,12 @@ struct mem_cgroup {
*/
unsigned long margin_to_limit_pages; /* margin to limit */
spinlock_t update_margin_lock;
- unsigned long async_flags;
+ struct delayed_work async_work;
+ unsigned long async_flags;
#define AUTO_KEEP_MARGIN_ENABLED (0) /* user enabled async reclaim */
#define FAILED_TO_KEEP_MARGIN (1) /* someone hit limit */
-
+#define ASYNC_WORKER_RUNNING (2) /* a worker runs */
+#define ASYNC_WORKER_SHOULD_STOP (3) /* worker thread should stop */
/*
* percpu counter.
*/
@@ -3862,10 +3864,82 @@ static void mem_cgroup_reset_margin_to_l
}

/*
+ * Codes for reclaim memory in background with help of kworker.
+ * memory cgroup uses UNBOUND workqueue.
+ */
+struct workqueue_struct *memcg_async_shrinker;
+
+static int memcg_async_shrinker_init(void)
+{
+ memcg_async_shrinker = alloc_workqueue("memcg_async",
+ WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_FREEZABLE, 0);
+ return 0;
+}
+module_init(memcg_async_shrinker_init);
+/*
+ * Called from rmdir() path and stop asynchronous worker because
+ * it has an extra reference count.
+ */
+static void mem_cgroup_stop_async_worker(struct mem_cgroup *mem)
+{
+ /* The worker will stop when see this flag */
+ set_bit(ASYNC_WORKER_SHOULD_STOP, &mem->async_flags);
+ flush_delayed_work(&mem->async_work);
+ clear_bit(ASYNC_WORKER_SHOULD_STOP, &mem->async_flags);
+}
+
+/*
+ * Reclaim memory in asynchronous way. This function is for getting
+ * enough margin to limit in background. If margin is enough big or
+ * someone tries to delete cgroup, stop reclaim.
+ * If margin is big even after shrink memory, reschedule itself again.
+ */
+static void mem_cgroup_async_shrink_worker(struct work_struct *work)
+{
+ struct delayed_work *dw = to_delayed_work(work);
+ struct mem_cgroup *mem;
+ int delay = 0;
+ long nr_to_reclaim;
+
+ mem = container_of(dw, struct mem_cgroup, async_work);
+
+ if (!test_bit(AUTO_KEEP_MARGIN_ENABLED, &mem->async_flags) ||
+ test_bit(ASYNC_WORKER_SHOULD_STOP, &mem->async_flags))
+ goto finish_scan;
+
+ nr_to_reclaim = mem->margin_to_limit_pages - mem_cgroup_margin(mem);
+
+ if (nr_to_reclaim > 0)
+ mem_cgroup_shrink_rate_limited(mem, nr_to_reclaim);
+ else
+ goto finish_scan;
+ /* If margin is enough big, stop */
+ if (mem_cgroup_margin(mem) >= mem->margin_to_limit_pages)
+ goto finish_scan;
+ /* If someone tries to rmdir(), we should stop */
+ if (test_bit(ASYNC_WORKER_SHOULD_STOP, &mem->async_flags))
+ goto finish_scan;
+
+ queue_delayed_work(memcg_async_shrinker, &mem->async_work, delay);
+ return;
+finish_scan:
+ cgroup_release_and_wakeup_rmdir(&mem->css);
+ clear_bit(ASYNC_WORKER_RUNNING, &mem->async_flags);
+ return;
+}
+
+/*
* Run a asynchronous memory reclaim on the memcg.
*/
static void mem_cgroup_schedule_async_reclaim(struct mem_cgroup *mem)
{
+ if (test_and_set_bit(ASYNC_WORKER_RUNNING, &mem->async_flags))
+ return;
+ cgroup_exclude_rmdir(&mem->css);
+ if (!queue_delayed_work(memcg_async_shrinker, &mem->async_work, 0)) {
+ cgroup_release_and_wakeup_rmdir(&mem->css);
+ clear_bit(ASYNC_WORKER_RUNNING, &mem->async_flags);
+ }
}

/*
@@ -3960,6 +4034,7 @@ static int mem_cgroup_force_empty(struct
move_account:
do {
ret = -EBUSY;
+ mem_cgroup_stop_async_worker(mem);
if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
goto out;
ret = -EINTR;
@@ -5239,6 +5314,7 @@ mem_cgroup_create(struct cgroup_subsys *
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
spin_lock_init(&mem->update_margin_lock);
+ INIT_DELAYED_WORK(&mem->async_work, mem_cgroup_async_shrink_worker);
mutex_init(&mem->thresholds_lock);
return &mem->css;
free_out:
Index: memcg_async/include/linux/swap.h
===================================================================
--- memcg_async.orig/include/linux/swap.h
+++ memcg_async/include/linux/swap.h
@@ -257,6 +257,9 @@ extern unsigned long mem_cgroup_shrink_n
gfp_t gfp_mask, bool noswap,
struct zone *zone,
unsigned long *nr_scanned);
+extern void mem_cgroup_shrink_rate_limited(struct mem_cgroup *mem,
+ unsigned long nr_to_reclaim);
+
extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
Index: memcg_async/mm/vmscan.c
===================================================================
--- memcg_async.orig/mm/vmscan.c
+++ memcg_async/mm/vmscan.c
@@ -2261,6 +2261,12 @@ unsigned long try_to_free_mem_cgroup_pag

return nr_reclaimed;
}
+
+void mem_cgroup_shrink_rate_limited(struct mem_cgroup *mem,
+ unsigned long nr_to_reclaim)
+{
+}
+
#endif

/*

2011-05-26 05:37:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH v3 7/10] workqueue: add WQ_IDLEPRI


When this idea came to me, I wonder which is better to maintain memcg's thread
pool or add support in workqueue for generic use. In genral, I feel enhancing
genric one is better...so, wrote this one.
==
This patch adds a new workqueue class as WQ_IDLEPRI.

The worker thread for this workqueue will have SCHED_IDLE scheduling
policy and don't use (too much) CPU if there are other active threads.
IOW, unless the system is idle, work will not progress.

Considering to schedule an asynchronous work which can be a help
for reduce latency of applications, it's good to use idle time
of the system. The CPU time which was used by application's context
will be moved to fill idle time of the system.

Applications can hide its latency by shifting cpu time for a work
to be done in idle time. This will be used by memory cgroup to hide
memory reclaim latency.

I may miss something...any comments are welcomed.

NOTE 1: SCHED_IDLE is just a lowest priority of SCHED_OTHER.
NOTE 2: It may be better to add cond_resched() in worker thread somewhere..
but I couldn't find where is the best.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/workqueue.txt | 10 ++++
include/linux/workqueue.h | 8 ++-
kernel/workqueue.c | 101 +++++++++++++++++++++++++++++++++-----------
mm/memcontrol.c | 3 -
4 files changed, 93 insertions(+), 29 deletions(-)

Index: memcg_async/include/linux/workqueue.h
===================================================================
--- memcg_async.orig/include/linux/workqueue.h
+++ memcg_async/include/linux/workqueue.h
@@ -56,7 +56,8 @@ enum {

/* special cpu IDs */
WORK_CPU_UNBOUND = NR_CPUS,
- WORK_CPU_NONE = NR_CPUS + 1,
+ WORK_CPU_IDLEPRI = NR_CPUS + 1,
+ WORK_CPU_NONE = NR_CPUS + 2,
WORK_CPU_LAST = WORK_CPU_NONE,

/*
@@ -254,9 +255,10 @@ enum {
WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */
WQ_HIGHPRI = 1 << 4, /* high priority */
WQ_CPU_INTENSIVE = 1 << 5, /* cpu instensive workqueue */
+ WQ_IDLEPRI = 1 << 6, /* the lowest priority in scheduler*/

- WQ_DYING = 1 << 6, /* internal: workqueue is dying */
- WQ_RESCUER = 1 << 7, /* internal: workqueue has rescuer */
+ WQ_DYING = 1 << 7, /* internal: workqueue is dying */
+ WQ_RESCUER = 1 << 8, /* internal: workqueue has rescuer */

WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */
Index: memcg_async/kernel/workqueue.c
===================================================================
--- memcg_async.orig/kernel/workqueue.c
+++ memcg_async/kernel/workqueue.c
@@ -61,9 +61,11 @@ enum {
WORKER_REBIND = 1 << 5, /* mom is home, come back */
WORKER_CPU_INTENSIVE = 1 << 6, /* cpu intensive */
WORKER_UNBOUND = 1 << 7, /* worker is unbound */
+ WORKER_IDLEPRI = 1 << 8,

WORKER_NOT_RUNNING = WORKER_PREP | WORKER_ROGUE | WORKER_REBIND |
- WORKER_CPU_INTENSIVE | WORKER_UNBOUND,
+ WORKER_CPU_INTENSIVE | WORKER_UNBOUND |
+ WORKER_IDLEPRI,

/* gcwq->trustee_state */
TRUSTEE_START = 0, /* start */
@@ -276,14 +278,25 @@ static inline int __next_gcwq_cpu(int cp
}
if (sw & 2)
return WORK_CPU_UNBOUND;
- }
+ if (sw & 4)
+ return WORK_CPU_IDLEPRI;
+ } else if (cpu == WORK_CPU_UNBOUND && (sw & 4))
+ return WORK_CPU_IDLEPRI;
return WORK_CPU_NONE;
}

static inline int __next_wq_cpu(int cpu, const struct cpumask *mask,
struct workqueue_struct *wq)
{
- return __next_gcwq_cpu(cpu, mask, !(wq->flags & WQ_UNBOUND) ? 1 : 2);
+ int sw = 1;
+
+ if (wq->flags & WQ_UNBOUND) {
+ if (!(wq->flags & WQ_IDLEPRI))
+ sw = 2;
+ else
+ sw = 4;
+ }
+ return __next_gcwq_cpu(cpu, mask, sw);
}

/*
@@ -294,20 +307,21 @@ static inline int __next_wq_cpu(int cpu,
* specific CPU. The following iterators are similar to
* for_each_*_cpu() iterators but also considers the unbound gcwq.
*
- * for_each_gcwq_cpu() : possible CPUs + WORK_CPU_UNBOUND
- * for_each_online_gcwq_cpu() : online CPUs + WORK_CPU_UNBOUND
- * for_each_cwq_cpu() : possible CPUs for bound workqueues,
- * WORK_CPU_UNBOUND for unbound workqueues
+ * for_each_gcwq_cpu() : possible CPUs + WORK_CPU_UNBOUND + IDLEPRI
+ * for_each_online_gcwq_cpu() : online CPUs + WORK_CPU_UNBOUND + IDLEPRI
+ * for_each_cwq_cpu() : possible CPUs for bound workqueues,
+ * WORK_CPU_UNBOUND for unbound workqueues
+ * IDLEPRI for idle workqueues.
*/
#define for_each_gcwq_cpu(cpu) \
- for ((cpu) = __next_gcwq_cpu(-1, cpu_possible_mask, 3); \
+ for ((cpu) = __next_gcwq_cpu(-1, cpu_possible_mask, 7); \
(cpu) < WORK_CPU_NONE; \
- (cpu) = __next_gcwq_cpu((cpu), cpu_possible_mask, 3))
+ (cpu) = __next_gcwq_cpu((cpu), cpu_possible_mask, 7))

#define for_each_online_gcwq_cpu(cpu) \
- for ((cpu) = __next_gcwq_cpu(-1, cpu_online_mask, 3); \
+ for ((cpu) = __next_gcwq_cpu(-1, cpu_online_mask, 7); \
(cpu) < WORK_CPU_NONE; \
- (cpu) = __next_gcwq_cpu((cpu), cpu_online_mask, 3))
+ (cpu) = __next_gcwq_cpu((cpu), cpu_online_mask, 7))

#define for_each_cwq_cpu(cpu, wq) \
for ((cpu) = __next_wq_cpu(-1, cpu_possible_mask, (wq)); \
@@ -451,22 +465,34 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(ato
static struct global_cwq unbound_global_cwq;
static atomic_t unbound_gcwq_nr_running = ATOMIC_INIT(0); /* always 0 */

+/*
+ * Global cpu workqueue and nr_running for idle gcwq. The idle gcwq is
+ * always online has GCWQ_DISASSOCIATED set. and all its worker have
+ * WORKER_UNBOUND and WORKER_IDLEPRI set.
+ */
+static struct global_cwq unbound_idle_global_cwq;
+static atomic_t unbound_idle_gcwq_nr_running = ATOMIC_INIT(0); /* always 0 */
+
static int worker_thread(void *__worker);

static struct global_cwq *get_gcwq(unsigned int cpu)
{
- if (cpu != WORK_CPU_UNBOUND)
+ if (cpu < WORK_CPU_UNBOUND)
return &per_cpu(global_cwq, cpu);
- else
+ else if (cpu == WORK_CPU_UNBOUND)
return &unbound_global_cwq;
+ else
+ return &unbound_idle_global_cwq;
}

static atomic_t *get_gcwq_nr_running(unsigned int cpu)
{
- if (cpu != WORK_CPU_UNBOUND)
+ if (cpu < WORK_CPU_UNBOUND)
return &per_cpu(gcwq_nr_running, cpu);
- else
+ else if (cpu == WORK_CPU_UNBOUND)
return &unbound_gcwq_nr_running;
+ else
+ return &unbound_idle_gcwq_nr_running;
}

static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
@@ -480,7 +506,8 @@ static struct cpu_workqueue_struct *get_
return wq->cpu_wq.single;
#endif
}
- } else if (likely(cpu == WORK_CPU_UNBOUND))
+ } else if (likely(cpu == WORK_CPU_UNBOUND ||
+ cpu == WORK_CPU_IDLEPRI))
return wq->cpu_wq.single;
return NULL;
}
@@ -563,7 +590,9 @@ static struct global_cwq *get_work_gcwq(
if (cpu == WORK_CPU_NONE)
return NULL;

- BUG_ON(cpu >= nr_cpu_ids && cpu != WORK_CPU_UNBOUND);
+ BUG_ON(cpu >= nr_cpu_ids
+ && cpu != WORK_CPU_UNBOUND
+ && cpu != WORK_CPU_IDLEPRI);
return get_gcwq(cpu);
}

@@ -599,6 +628,10 @@ static bool keep_working(struct global_c
{
atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);

+ if (unlikely((gcwq->cpu == WORK_CPU_IDLEPRI)) &&
+ need_resched())
+ return false;
+
return !list_empty(&gcwq->worklist) &&
(atomic_read(nr_running) <= 1 ||
gcwq->flags & GCWQ_HIGHPRI_PENDING);
@@ -1025,9 +1058,12 @@ static void __queue_work(unsigned int cp
}
} else
spin_lock_irqsave(&gcwq->lock, flags);
- } else {
+ } else if (!(wq->flags & WQ_IDLEPRI)) {
gcwq = get_gcwq(WORK_CPU_UNBOUND);
spin_lock_irqsave(&gcwq->lock, flags);
+ } else {
+ gcwq = get_gcwq(WORK_CPU_IDLEPRI);
+ spin_lock_irqsave(&gcwq->lock, flags);
}

/* gcwq determined, get cwq and queue */
@@ -1160,8 +1196,10 @@ int queue_delayed_work_on(int cpu, struc
lcpu = gcwq->cpu;
else
lcpu = raw_smp_processor_id();
- } else
+ } else if (!(wq->flags & WQ_IDLEPRI))
lcpu = WORK_CPU_UNBOUND;
+ else
+ lcpu = WORK_CPU_IDLEPRI;

set_work_cwq(work, get_cwq(lcpu, wq), 0);

@@ -1352,6 +1390,7 @@ static struct worker *alloc_worker(void)
static struct worker *create_worker(struct global_cwq *gcwq, bool bind)
{
bool on_unbound_cpu = gcwq->cpu == WORK_CPU_UNBOUND;
+ bool on_idle_cpu = gcwq->cpu == WORK_CPU_IDLEPRI;
struct worker *worker = NULL;
int id = -1;

@@ -1371,14 +1410,17 @@ static struct worker *create_worker(stru
worker->gcwq = gcwq;
worker->id = id;

- if (!on_unbound_cpu)
+ if (!on_unbound_cpu && !on_idle_cpu)
worker->task = kthread_create_on_node(worker_thread,
worker,
cpu_to_node(gcwq->cpu),
"kworker/%u:%d", gcwq->cpu, id);
- else
+ else if (!on_idle_cpu)
worker->task = kthread_create(worker_thread, worker,
"kworker/u:%d", id);
+ else
+ worker->task = kthread_create(worker_thread, worker,
+ "kworker/i:%d", id);
if (IS_ERR(worker->task))
goto fail;

@@ -1387,12 +1429,14 @@ static struct worker *create_worker(stru
* online later on. Make sure every worker has
* PF_THREAD_BOUND set.
*/
- if (bind && !on_unbound_cpu)
+ if (bind && !on_unbound_cpu && !on_idle_cpu)
kthread_bind(worker->task, gcwq->cpu);
else {
worker->task->flags |= PF_THREAD_BOUND;
if (on_unbound_cpu)
worker->flags |= WORKER_UNBOUND;
+ if (on_idle_cpu)
+ worker->flags |= WORKER_IDLEPRI;
}

return worker;
@@ -1496,7 +1540,7 @@ static bool send_mayday(struct work_stru
/* mayday mayday mayday */
cpu = cwq->gcwq->cpu;
/* WORK_CPU_UNBOUND can't be set in cpumask, use cpu 0 instead */
- if (cpu == WORK_CPU_UNBOUND)
+ if ((cpu == WORK_CPU_UNBOUND) || (cpu == WORK_CPU_IDLEPRI))
cpu = 0;
if (!mayday_test_and_set_cpu(cpu, wq->mayday_mask))
wake_up_process(wq->rescuer->task);
@@ -1935,6 +1979,11 @@ static int worker_thread(void *__worker)

/* tell the scheduler that this is a workqueue worker */
worker->task->flags |= PF_WQ_WORKER;
+ /* if worker is for IDLEPRI, set scheduler */
+ if (worker->flags & WORKER_IDLEPRI) {
+ struct sched_param param;
+ sched_setscheduler(current, SCHED_IDLE, &param);
+ }
woke_up:
spin_lock_irq(&gcwq->lock);

@@ -2912,8 +2961,9 @@ struct workqueue_struct *__alloc_workque
/*
* Workqueues which may be used during memory reclaim should
* have a rescuer to guarantee forward progress.
+ * But IDLE workqueue will not have any rescuer.
*/
- if (flags & WQ_MEM_RECLAIM)
+ if ((flags & WQ_MEM_RECLAIM) && !(flags & WQ_IDLEPRI))
flags |= WQ_RESCUER;

/*
@@ -3775,7 +3825,8 @@ static int __init init_workqueues(void)
struct global_cwq *gcwq = get_gcwq(cpu);
struct worker *worker;

- if (cpu != WORK_CPU_UNBOUND)
+ if ((cpu != WORK_CPU_UNBOUND) &&
+ (cpu != WORK_CPU_IDLEPRI))
gcwq->flags &= ~GCWQ_DISASSOCIATED;
worker = create_worker(gcwq, true);
BUG_ON(!worker);
Index: memcg_async/Documentation/workqueue.txt
===================================================================
--- memcg_async.orig/Documentation/workqueue.txt
+++ memcg_async/Documentation/workqueue.txt
@@ -247,6 +247,16 @@ resources, scheduled and executed.
highpri CPU-intensive wq start execution as soon as resources
are available and don't affect execution of other work items.

+ WQ_UNBOUND | WQ_IDLEPRI
+ An special case of unbound wq, the worker thread for this workqueue
+ will run in the lowest priority of SCHED_IDLE. Most of characteristics
+ are same to UNBOUND workqueue but the thread's priority is SCHED_IDLE.
+ This is useful when you want to run a work for hiding application's
+ latency by making use of idle time of the system. Because scheduling
+ priority of this class workqueue is minimum, you must assume that
+ the work will not run for a long time when the system is cpu hogging.
+ Then, unlike UNBOUND WQ, this will not have rescuer threads.
+
@max_active:

@max_active determines the maximum number of execution contexts per
Index: memcg_async/mm/memcontrol.c
===================================================================
--- memcg_async.orig/mm/memcontrol.c
+++ memcg_async/mm/memcontrol.c
@@ -3872,7 +3872,8 @@ struct workqueue_struct *memcg_async_shr
static int memcg_async_shrinker_init(void)
{
memcg_async_shrinker = alloc_workqueue("memcg_async",
- WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_FREEZABLE, 0);
+ WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_IDLEPRI | WQ_FREEZABLE,
+ 0);
return 0;
}
module_init(memcg_async_shrinker_init);

2011-05-26 05:39:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH v3 8/10] memcg: scan ratio calculation


==
This patch adds a function to calculate reclam/scan ratio.
By the recent scan.
This wil be shown by memory.reclaim_stat interface in later patch.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/swap.h | 8 +-
mm/memcontrol.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++----
mm/vmscan.c | 9 ++-
3 files changed, 138 insertions(+), 16 deletions(-)

Index: memcg_async/mm/memcontrol.c
===================================================================
--- memcg_async.orig/mm/memcontrol.c
+++ memcg_async/mm/memcontrol.c
@@ -73,7 +73,6 @@ static int really_do_swap_account __init
#define do_swap_account (0)
#endif

-
/*
* Statistics for memory cgroup.
*/
@@ -215,6 +214,7 @@ static void mem_cgroup_oom_notify(struct
static void mem_cgroup_reset_margin_to_limit(struct mem_cgroup *mem);
static void mem_cgroup_update_margin_to_limit(struct mem_cgroup *mem);
static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
+static void mem_cgroup_reflesh_scan_ratio(struct mem_cgroup *mem);

/*
* The memory controller data structure. The memory controller controls both
@@ -294,6 +294,12 @@ struct mem_cgroup {
#define FAILED_TO_KEEP_MARGIN (1) /* someone hit limit */
#define ASYNC_WORKER_RUNNING (2) /* a worker runs */
#define ASYNC_WORKER_SHOULD_STOP (3) /* worker thread should stop */
+
+ /* For calculating scan success ratio */
+ spinlock_t scan_stat_lock;
+ unsigned long scanned;
+ unsigned long reclaimed;
+ unsigned long next_scanratio_update;
/*
* percpu counter.
*/
@@ -758,6 +764,7 @@ static void memcg_check_events(struct me
}
/* update margin-to-limit and run async reclaim if necessary */
if (__memcg_event_check(mem, MEM_CGROUP_TARGET_KEEP_MARGIN)) {
+ mem_cgroup_reflesh_scan_ratio(mem);
mem_cgroup_may_async_reclaim(mem);
__mem_cgroup_target_update(mem,
MEM_CGROUP_TARGET_KEEP_MARGIN);
@@ -1417,6 +1424,96 @@ unsigned int mem_cgroup_swappiness(struc
return memcg->swappiness;
}

+static void __mem_cgroup_update_scan_ratio(struct mem_cgroup *mem,
+ unsigned long scanned,
+ unsigned long reclaimed)
+{
+ unsigned long limit;
+
+ limit = res_counter_read_u64(&mem->res, RES_LIMIT) >> PAGE_SHIFT;
+ spin_lock(&mem->scan_stat_lock);
+ mem->scanned += scanned;
+ mem->reclaimed += reclaimed;
+ /* avoid overflow */
+ if (mem->scanned > limit) {
+ mem->scanned /= 2;
+ mem->reclaimed /= 2;
+ }
+ spin_unlock(&mem->scan_stat_lock);
+}
+
+/**
+ * mem_cgroup_update_scan_ratio
+ * @memcg: the memcg
+ * @root : root memcg of hierarchy walk.
+ * @scanned : scanned pages
+ * @reclaimed: reclaimed pages.
+ *
+ * record scan/reclaim ratio to the memcg both to a child and it's root
+ * mem cgroup, which is a reclaim target. This value is used for
+ * detect congestion and for determining sleep time at memory reclaim.
+ */
+
+static void mem_cgroup_update_scan_ratio(struct mem_cgroup *mem,
+ struct mem_cgroup *root,
+ unsigned long scanned,
+ unsigned long reclaimed)
+{
+ __mem_cgroup_update_scan_ratio(mem, scanned, reclaimed);
+ if (mem != root)
+ __mem_cgroup_update_scan_ratio(root, scanned, reclaimed);
+
+}
+
+/*
+ * Workload can be changed over time. This routine is for forgetting old
+ * information to some extent. This is triggered by event counter i.e.
+ * some amounts of pagein/pageout events and rate limited once per 1 min.
+ *
+ * By this, recent 1min information will be twice informative than old
+ * information.
+ */
+static void mem_cgroup_reflesh_scan_ratio(struct mem_cgroup *mem)
+{
+ struct cgroup *parent;
+ /* Update all parent's information if they are old */
+ while (1) {
+ if (time_after(mem->next_scanratio_update, jiffies))
+ break;
+ mem->next_scanratio_update = jiffies + HZ*60;
+ spin_lock(&mem->scan_stat_lock);
+ mem->scanned /= 2;
+ mem->reclaimed /= 2;
+ spin_unlock(&mem->scan_stat_lock);
+ if (!mem->use_hierarchy)
+ break;
+ parent = mem->css.cgroup->parent;
+ if (!parent)
+ break;
+ mem = mem_cgroup_from_cont(parent);
+ }
+}
+
+/**
+ * mem_cgroup_scan_ratio:
+ * @mem: the mem cgroup
+ *
+ * Returns recent reclaim/scan ratio. If this is low, memory is filled by
+ * active pages(or dirty pages). If high, memory includes inactive, unneccesary
+ * files. This can be a hint for admins to show the limit is correct or not.
+ */
+static int mem_cgroup_scan_ratio(struct mem_cgroup *mem)
+{
+ int scan_success_ratio;
+
+ spin_lock(&mem->scan_stat_lock);
+ scan_success_ratio = mem->reclaimed * 100 / (mem->scanned + 1);
+ spin_unlock(&mem->scan_stat_lock);
+
+ return scan_success_ratio;
+}
+
+
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
int cpu;
@@ -1855,9 +1952,14 @@ static int mem_cgroup_hierarchical_recla
*total_scanned += nr_scanned;
mem_cgroup_soft_steal(victim, is_kswapd, ret);
mem_cgroup_soft_scan(victim, is_kswapd, nr_scanned);
- } else
+ mem_cgroup_update_scan_ratio(victim,
+ root_mem, nr_scanned, ret);
+ } else {
ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
- noswap);
+ noswap, &nr_scanned);
+ mem_cgroup_update_scan_ratio(victim,
+ root_mem, nr_scanned, ret);
+ }
css_put(&victim->css);
/*
* At shrinking usage, we can't check we should stop here or
@@ -3895,12 +3997,14 @@ static void mem_cgroup_stop_async_worker
* someone tries to delete cgroup, stop reclaim.
* If margin is big even after shrink memory, reschedule itself again.
*/
+
static void mem_cgroup_async_shrink_worker(struct work_struct *work)
{
struct delayed_work *dw = to_delayed_work(work);
- struct mem_cgroup *mem;
- int delay = 0;
+ struct mem_cgroup *mem, *victim;
long nr_to_reclaim;
+ unsigned long nr_scanned, nr_reclaimed;
+ int delay = 0;

mem = container_of(dw, struct mem_cgroup, async_work);

@@ -3910,12 +4014,22 @@ static void mem_cgroup_async_shrink_work

nr_to_reclaim = mem->margin_to_limit_pages - mem_cgroup_margin(mem);

- if (nr_to_reclaim > 0)
- mem_cgroup_shrink_rate_limited(mem, nr_to_reclaim);
- else
+ if (nr_to_reclaim <= 0)
+ goto finish_scan;
+
+ /* select a memcg under hierarchy */
+ victim = mem_cgroup_select_get_victim(mem);
+ if (!victim)
goto finish_scan;
+
+ nr_reclaimed = mem_cgroup_shrink_rate_limited(victim, nr_to_reclaim,
+ &nr_scanned);
+ mem_cgroup_update_scan_ratio(victim, mem, nr_scanned, nr_reclaimed);
+ css_put(&victim->css);
+
/* If margin is enough big, stop */
- if (mem_cgroup_margin(mem) >= mem->margin_to_limit_pages)
+ nr_to_reclaim = mem->margin_to_limit_pages - mem_cgroup_margin(mem);
+ if (nr_to_reclaim <= 0)
goto finish_scan;
/* If someone tries to rmdir(), we should stop */
if (test_bit(ASYNC_WORKER_SHOULD_STOP, &mem->async_flags))
@@ -4083,12 +4197,14 @@ try_to_free:
shrink = 1;
while (nr_retries && mem->res.usage > 0) {
int progress;
+ unsigned long nr_scanned;

if (signal_pending(current)) {
ret = -EINTR;
goto out;
}
- progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, false);
+ progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
+ false, &nr_scanned);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
@@ -5315,6 +5431,7 @@ mem_cgroup_create(struct cgroup_subsys *
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
spin_lock_init(&mem->update_margin_lock);
+ spin_lock_init(&mem->scan_stat_lock);
INIT_DELAYED_WORK(&mem->async_work, mem_cgroup_async_shrink_worker);
mutex_init(&mem->thresholds_lock);
return &mem->css;
Index: memcg_async/include/linux/swap.h
===================================================================
--- memcg_async.orig/include/linux/swap.h
+++ memcg_async/include/linux/swap.h
@@ -252,13 +252,15 @@ static inline void lru_cache_add_file(st
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap);
+ gfp_t gfp_mask, bool noswap,
+ unsigned long *nr_scanned);
extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
struct zone *zone,
unsigned long *nr_scanned);
-extern void mem_cgroup_shrink_rate_limited(struct mem_cgroup *mem,
- unsigned long nr_to_reclaim);
+extern unsigned long mem_cgroup_shrink_rate_limited(struct mem_cgroup *mem,
+ unsigned long nr_to_reclaim,
+ unsigned long *nr_scanned);

extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
Index: memcg_async/mm/vmscan.c
===================================================================
--- memcg_async.orig/mm/vmscan.c
+++ memcg_async/mm/vmscan.c
@@ -2221,7 +2221,8 @@ unsigned long mem_cgroup_shrink_node_zon

unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask,
- bool noswap)
+ bool noswap,
+ unsigned long *nr_scanned)
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
@@ -2258,12 +2259,14 @@ unsigned long try_to_free_mem_cgroup_pag
nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);

trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
+ *nr_scanned = sc.nr_scanned;

return nr_reclaimed;
}

-void mem_cgroup_shrink_rate_limited(struct mem_cgroup *mem,
- unsigned long nr_to_reclaim)
+unsigned long mem_cgroup_shrink_rate_limited(struct mem_cgroup *mem,
+ unsigned long nr_to_reclaim,
+ unsigned long *nr_scanned)
{
}

2011-05-26 05:41:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH v3 9/10] memcg: scan limited memory reclaim


Better name is welcomed ;(

==
rate limited memory LRU scanning for memcg.

This patch implements a routine for asynchronous memory reclaim for memory
cgroup, which will be triggered when the usage is near to the limit.
This patch includes only code codes for memory freeing.

Asynchronous memory reclaim can be a help for reduce latency because
memory reclaim goes while an application need to wait or compute something.

To do memory reclaim in async, we need some thread or worker.
Unlike node or zones, memcg can be created on demand and there may be
a system with thousands of memcgs. So, the number of jobs for memcg
asynchronous memory reclaim can be big number in theory. So, node kswapd
codes doesn't fit well. And some scheduling on memcg layer will be
appreciated.

This patch implements a LRU scanning which the number of scan is limited.

When shrink_mem_cgroup_shrink_rate_limited() is called, it scans pages at most
MEMCG_STATIC_SCAN_LIMIT(2Mbytes) pages. By this, round-robin can be
implemented.

Changelog:
- dropped most of un-explained heuristic codes.
- added more comments.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 2
mm/memcontrol.c | 4 -
mm/vmscan.c | 153 +++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 153 insertions(+), 6 deletions(-)

Index: memcg_async/mm/vmscan.c
===================================================================
--- memcg_async.orig/mm/vmscan.c
+++ memcg_async/mm/vmscan.c
@@ -106,6 +106,7 @@ struct scan_control {

/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup;
+ unsigned long scan_limit; /* async reclaim uses static scan rate */

/*
* Nodemask of nodes allowed by the caller. If NULL, all nodes
@@ -1722,7 +1723,7 @@ static unsigned long shrink_list(enum lr
static void get_scan_count(struct zone *zone, struct scan_control *sc,
unsigned long *nr, int priority)
{
- unsigned long anon, file, free;
+ unsigned long anon, file, free, total_scan;
unsigned long anon_prio, file_prio;
unsigned long ap, fp;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
@@ -1812,6 +1813,8 @@ static void get_scan_count(struct zone *
fraction[1] = fp;
denominator = ap + fp + 1;
out:
+ total_scan = 0;
+
for_each_evictable_lru(l) {
int file = is_file_lru(l);
unsigned long scan;
@@ -1838,6 +1841,20 @@ out:
scan = SWAP_CLUSTER_MAX;
}
nr[l] = scan;
+ total_scan += nr[l];
+ }
+ /*
+ * Asynchronous reclaim for memcg uses static scan rate for avoiding
+ * too much cpu consumption in a memcg. Adjust the scan count to fit
+ * into scan_limit.
+ */
+ if (!scanning_global_lru(sc) && (total_scan > sc->scan_limit)) {
+ for_each_evictable_lru(l) {
+ if (nr[l] < SWAP_CLUSTER_MAX)
+ continue;
+ nr[l] = div64_u64(nr[l] * sc->scan_limit, total_scan);
+ nr[l] = max((unsigned long)SWAP_CLUSTER_MAX, nr[l]);
+ }
}
}

@@ -1943,6 +1960,11 @@ restart:
*/
if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
break;
+ /*
+ * static scan rate memory reclaim ?
+ */
+ if (sc->nr_scanned > sc->scan_limit)
+ break;
}
sc->nr_reclaimed += nr_reclaimed;

@@ -2162,6 +2184,7 @@ unsigned long try_to_free_pages(struct z
.order = order,
.mem_cgroup = NULL,
.nodemask = nodemask,
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -2193,6 +2216,7 @@ unsigned long mem_cgroup_shrink_node_zon
.may_swap = !noswap,
.order = 0,
.mem_cgroup = mem,
+ .scan_limit = ULONG_MAX,
};

sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2237,6 +2261,7 @@ unsigned long try_to_free_mem_cgroup_pag
.nodemask = NULL, /* we don't care the placement */
.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -2264,12 +2289,129 @@ unsigned long try_to_free_mem_cgroup_pag
return nr_reclaimed;
}

-unsigned long mem_cgroup_shrink_rate_limited(struct mem_cgroup *mem,
- unsigned long nr_to_reclaim,
- unsigned long *nr_scanned)
+/*
+ * Routines for static scan rate memory reclaim for memory cgroup.
+ *
+ * Because asyncronous memory reclaim is served by the kernel as background
+ * service for reduce latency, we don't want to scan too much as priority=0
+ * scan of kswapd. We just scan MEMCG_ASYNCSCAN_LIMIT per iteration at most
+ * and frees MEMCG_ASYNCSCAN_LIMIT/2 of pages. Then, check our success rate
+ * and returns the information to the caller.
+ */
+
+static void shrink_mem_cgroup_node(int nid,
+ int priority, struct scan_control *sc)
{
+ unsigned long this_scanned = 0;
+ unsigned long this_reclaimed = 0;
+ int i;
+
+ for (i = 0; i < NODE_DATA(nid)->nr_zones; i++) {
+ struct zone *zone = NODE_DATA(nid)->node_zones + i;
+
+ if (!populated_zone(zone))
+ continue;
+ if (!mem_cgroup_zone_reclaimable_pages(sc->mem_cgroup, nid, i))
+ continue;
+ /* If recent scan didn't go good, do writepate */
+ sc->nr_scanned = 0;
+ sc->nr_reclaimed = 0;
+ shrink_zone(priority, zone, sc);
+ this_scanned += sc->nr_scanned;
+ this_reclaimed += sc->nr_reclaimed;
+ if ((sc->nr_to_reclaim < this_reclaimed) ||
+ (sc->scan_limit < this_scanned))
+ break;
+ if (need_resched())
+ break;
+ }
+ sc->nr_scanned = this_scanned;
+ sc->nr_reclaimed = this_reclaimed;
+ return;
}

+/**
+ * mem_cgroup_shrink_rate_limited
+ * @mem : the mem cgroup to be scanned.
+ * @required: number of required pages to be freed
+ * @nr_scanned: total number of scanned pages will be returned by this.
+ *
+ * This is a memory reclaim routine designed for background memory shrinking
+ * for memcg. Main idea is to do limited scan for implementing round-robin
+ * work per memcg. This routine scans MEMCG_SCAN_LIMIT of pages per iteration
+ * and reclaim MEMCG_SCAN_LIMIT/2 of pages per scan.
+ * The number of MEMCG_SCAN_LIMIT can be...arbitrary if it's enough small.
+ * Here, we scan 2M bytes of memory per iteration. If scan is not enough
+ * for the caller, it will call this again.
+ * This routine's memory scan success rate is reported to the caller and
+ * the caller will adjust the next call.
+ */
+#define MEMCG_SCAN_LIMIT (2*1024*1024/PAGE_SIZE)
+
+unsigned long mem_cgroup_shrink_rate_limited(struct mem_cgroup *mem,
+ unsigned long required,
+ unsigned long *nr_scanned)
+{
+ int nid, priority;
+ unsigned long total_scanned, total_reclaimed, reclaim_target;
+ struct scan_control sc = {
+ .gfp_mask = GFP_HIGHUSER_MOVABLE,
+ .may_unmap = 1,
+ .may_swap = 1,
+ .order = 0,
+ /* we don't writepage in our scan. but kick flusher threads */
+ .may_writepage = 0,
+ };
+
+ total_scanned = 0;
+ total_reclaimed = 0;
+ reclaim_target = min(required, MEMCG_SCAN_LIMIT/2L);
+ sc.swappiness = mem_cgroup_swappiness(mem);
+
+ current->flags |= PF_SWAPWRITE;
+ /*
+ * We can use arbitrary priority for our run because we just scan
+ * up to MEMCG_ASYNCSCAN_LIMIT and reclaim only the half of it.
+ * But, we need to have early-give-up chance for avoid cpu hogging.
+ * So, start from a small priority and increase it.
+ */
+ priority = DEF_PRIORITY;
+
+ /* select a node to scan */
+ nid = mem_cgroup_select_victim_node(mem);
+ /* We do scan until scanning up to scan_limit. */
+ while ((total_scanned < MEMCG_SCAN_LIMIT) &&
+ (total_reclaimed < reclaim_target)) {
+
+ if (!mem_cgroup_has_reclaimable(mem))
+ break;
+ sc.mem_cgroup = mem;
+ sc.nr_scanned = 0;
+ sc.nr_reclaimed = 0;
+ sc.scan_limit = MEMCG_SCAN_LIMIT - total_scanned;
+ sc.nr_to_reclaim = reclaim_target - total_reclaimed;
+ shrink_mem_cgroup_node(nid, priority, &sc);
+ total_scanned += sc.nr_scanned;
+ total_reclaimed += sc.nr_reclaimed;
+ if (sc.nr_scanned < SWAP_CLUSTER_MAX) { /* no page ? */
+ nid = mem_cgroup_select_victim_node(mem);
+ priority = DEF_PRIORITY;
+ }
+ /*
+ * If priority == 0, swappiness will be ignored.
+ * we should avoid it.
+ */
+ if (priority > 1)
+ priority--;
+ }
+ /* if scan rate was not good, wake flusher thread */
+ if (total_scanned > total_reclaimed * 2)
+ wakeup_flusher_threads(total_scanned - total_reclaimed);
+
+ current->flags &= ~PF_SWAPWRITE;
+ *nr_scanned = total_scanned;
+ return total_reclaimed;
+}
#endif

/*
@@ -2393,6 +2535,7 @@ static unsigned long balance_pgdat(pg_da
.swappiness = vm_swappiness,
.order = order,
.mem_cgroup = NULL,
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -2851,6 +2994,7 @@ unsigned long shrink_all_memory(unsigned
.hibernation_mode = 1,
.swappiness = vm_swappiness,
.order = 0,
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -3038,6 +3182,7 @@ static int __zone_reclaim(struct zone *z
.gfp_mask = gfp_mask,
.swappiness = vm_swappiness,
.order = order,
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
Index: memcg_async/include/linux/memcontrol.h
===================================================================
--- memcg_async.orig/include/linux/memcontrol.h
+++ memcg_async/include/linux/memcontrol.h
@@ -123,6 +123,8 @@ mem_cgroup_get_reclaim_stat_from_page(st
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);

+extern bool mem_cgroup_has_reclaimable(struct mem_cgroup *memcg);
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
Index: memcg_async/mm/memcontrol.c
===================================================================
--- memcg_async.orig/mm/memcontrol.c
+++ memcg_async/mm/memcontrol.c
@@ -1783,7 +1783,7 @@ int mem_cgroup_select_victim_node(struct
* For non-NUMA, this cheks reclaimable pages on zones because we don't
* update scan_nodes.(see below)
*/
-static bool mem_cgroup_has_reclaimable(struct mem_cgroup *memcg)
+bool mem_cgroup_has_reclaimable(struct mem_cgroup *memcg)
{
return !nodes_empty(memcg->scan_nodes);
}
@@ -1799,7 +1799,7 @@ int mem_cgroup_select_victim_node(struct
return 0;
}

-static bool mem_cgroup_has_reclaimable(struct mem_cgroup *memcg)
+bool mem_cgroup_has_reclaimable(struct mem_cgroup *memcg)
{
unsigned long nr;
int zid;

2011-05-26 05:43:24

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH v3 10/10] memcg : reclaim statistics


This patch adds a file memory.reclaim_stat.

This file shows following.
==
recent_scan_success_ratio 12 # recent reclaim/scan ratio.
limit_scan_pages 671 # scan caused by hitting limit.
limit_freed_pages 538 # freed pages by limit_scan
limit_elapsed_ns 518555076 # elapsed time in LRU scanning by limit.
soft_scan_pages 0 # scan caused by softlimit.
soft_freed_pages 0 # freed pages by soft_scan.
soft_elapsed_ns 0 # elapsed time in LRU scanning by softlimit.
margin_scan_pages 16744221 # scan caused by auto-keep-margin
margin_freed_pages 565943 # freed pages by auto-keep-margin.
margin_elapsed_ns 5545388791 # elapsed time in LRU scanning by auto-keep-margin

This patch adds a new file rather than adding more stats to memory.stat. By it,
this support "reset" accounting by

# echo 0 > .../memory.reclaim_stat

This is good for debug and tuning.

TODO:
- add Documentaion.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 79 insertions(+), 8 deletions(-)

Index: memcg_async/mm/memcontrol.c
===================================================================
--- memcg_async.orig/mm/memcontrol.c
+++ memcg_async/mm/memcontrol.c
@@ -216,6 +216,13 @@ static void mem_cgroup_update_margin_to_
static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
static void mem_cgroup_reflesh_scan_ratio(struct mem_cgroup *mem);

+enum scan_type {
+ LIMIT_SCAN, /* scan memory because memcg hits limit */
+ SOFT_SCAN, /* scan memory because of soft limit */
+ MARGIN_SCAN, /* scan memory for making margin to limit */
+ NR_SCAN_TYPES,
+};
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -300,6 +307,13 @@ struct mem_cgroup {
unsigned long scanned;
unsigned long reclaimed;
unsigned long next_scanratio_update;
+ /* For statistics */
+ struct {
+ unsigned long nr_scanned_pages;
+ unsigned long nr_reclaimed_pages;
+ unsigned long elapsed_ns;
+ } scan_stat[NR_SCAN_TYPES];
+
/*
* percpu counter.
*/
@@ -1426,7 +1440,9 @@ unsigned int mem_cgroup_swappiness(struc

static void __mem_cgroup_update_scan_ratio(struct mem_cgroup *mem,
unsigned long scanned,
- unsigned long reclaimed)
+ unsigned long reclaimed,
+ unsigned long elapsed,
+ enum scan_type type)
{
unsigned long limit;

@@ -1439,6 +1455,9 @@ static void __mem_cgroup_update_scan_rat
mem->scanned /= 2;
mem->reclaimed /= 2;
}
+ mem->scan_stat[type].nr_scanned_pages += scanned;
+ mem->scan_stat[type].nr_reclaimed_pages += reclaimed;
+ mem->scan_stat[type].elapsed_ns += elapsed;
spin_unlock(&mem->scan_stat_lock);
}

@@ -1448,6 +1467,8 @@ static void __mem_cgroup_update_scan_rat
* @root : root memcg of hierarchy walk.
* @scanned : scanned pages
* @reclaimed: reclaimed pages.
+ * @elapsed: used time for memory reclaim
+ * @type : scan type as LIMIT_SCAN, SOFT_SCAN, MARGIN_SCAN.
*
* record scan/reclaim ratio to the memcg both to a child and it's root
* mem cgroup, which is a reclaim target. This value is used for
@@ -1457,11 +1478,14 @@ static void __mem_cgroup_update_scan_rat
static void mem_cgroup_update_scan_ratio(struct mem_cgroup *mem,
struct mem_cgroup *root,
unsigned long scanned,
- unsigned long reclaimed)
+ unsigned long reclaimed,
+ unsigned long elapsed,
+ int type)
{
- __mem_cgroup_update_scan_ratio(mem, scanned, reclaimed);
+ __mem_cgroup_update_scan_ratio(mem, scanned, reclaimed, elapsed, type);
if (mem != root)
- __mem_cgroup_update_scan_ratio(root, scanned, reclaimed);
+ __mem_cgroup_update_scan_ratio(root, scanned, reclaimed,
+ elapsed, type);

}

@@ -1906,6 +1930,7 @@ static int mem_cgroup_hierarchical_recla
bool is_kswapd = false;
unsigned long excess;
unsigned long nr_scanned;
+ unsigned long start, end, elapsed;

excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;

@@ -1947,18 +1972,24 @@ static int mem_cgroup_hierarchical_recla
}
/* we use swappiness of local cgroup */
if (check_soft) {
+ start = sched_clock();
ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
noswap, zone, &nr_scanned);
+ end = sched_clock();
+ elapsed = end - start;
*total_scanned += nr_scanned;
mem_cgroup_soft_steal(victim, is_kswapd, ret);
mem_cgroup_soft_scan(victim, is_kswapd, nr_scanned);
mem_cgroup_update_scan_ratio(victim,
- root_mem, nr_scanned, ret);
+ root_mem, nr_scanned, ret, elapsed, SOFT_SCAN);
} else {
+ start = sched_clock();
ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
noswap, &nr_scanned);
+ end = sched_clock();
+ elapsed = end - start;
mem_cgroup_update_scan_ratio(victim,
- root_mem, nr_scanned, ret);
+ root_mem, nr_scanned, ret, elapsed, LIMIT_SCAN);
}
css_put(&victim->css);
/*
@@ -4003,7 +4034,7 @@ static void mem_cgroup_async_shrink_work
struct delayed_work *dw = to_delayed_work(work);
struct mem_cgroup *mem, *victim;
long nr_to_reclaim;
- unsigned long nr_scanned, nr_reclaimed;
+ unsigned long nr_scanned, nr_reclaimed, start, end;
int delay = 0;

mem = container_of(dw, struct mem_cgroup, async_work);
@@ -4022,9 +4053,12 @@ static void mem_cgroup_async_shrink_work
if (!victim)
goto finish_scan;

+ start = sched_clock();
nr_reclaimed = mem_cgroup_shrink_rate_limited(victim, nr_to_reclaim,
&nr_scanned);
- mem_cgroup_update_scan_ratio(victim, mem, nr_scanned, nr_reclaimed);
+ end = sched_clock();
+ mem_cgroup_update_scan_ratio(victim, mem, nr_scanned, nr_reclaimed,
+ end - start, MARGIN_SCAN);
css_put(&victim->css);

/* If margin is enough big, stop */
@@ -4680,6 +4714,38 @@ static int mem_control_stat_show(struct
return 0;
}

+static int mem_cgroup_reclaim_stat_read(struct cgroup *cont, struct cftype *cft,
+ struct cgroup_map_cb *cb)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ u64 val;
+ int i; /* for indexing scan_stat[] */
+
+ val = mem->reclaimed * 100 / mem->scanned;
+ cb->fill(cb, "recent_scan_success_ratio", val);
+ i = LIMIT_SCAN;
+ cb->fill(cb, "limit_scan_pages", mem->scan_stat[i].nr_scanned_pages);
+ cb->fill(cb, "limit_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
+ cb->fill(cb, "limit_elapsed_ns", mem->scan_stat[i].elapsed_ns);
+ i = SOFT_SCAN;
+ cb->fill(cb, "soft_scan_pages", mem->scan_stat[i].nr_scanned_pages);
+ cb->fill(cb, "soft_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
+ cb->fill(cb, "soft_elapsed_ns", mem->scan_stat[i].elapsed_ns);
+ i = MARGIN_SCAN;
+ cb->fill(cb, "margin_scan_pages", mem->scan_stat[i].nr_scanned_pages);
+ cb->fill(cb, "margin_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
+ cb->fill(cb, "margin_elapsed_ns", mem->scan_stat[i].elapsed_ns);
+ return 0;
+}
+
+static int mem_cgroup_reclaim_stat_reset(struct cgroup *cgrp, unsigned int event)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ memset(mem->scan_stat, 0, sizeof(mem->scan_stat));
+ return 0;
+}
+
+
/*
* User flags for async_control is a subset of mem->async_flags. But
* this needs to be defined independently to hide implemation details.
@@ -5163,6 +5229,11 @@ static struct cftype mem_cgroup_files[]
.open = mem_control_numa_stat_open,
},
#endif
+ {
+ .name = "reclaim_stat",
+ .read_map = mem_cgroup_reclaim_stat_read,
+ .trigger = mem_cgroup_reclaim_stat_reset,
+ }
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP

2011-05-26 09:38:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 7/10] workqueue: add WQ_IDLEPRI

Hello, KAMEZAWA.

On Thu, May 26, 2011 at 02:30:24PM +0900, KAMEZAWA Hiroyuki wrote:
> When this idea came to me, I wonder which is better to maintain
> memcg's thread pool or add support in workqueue for generic use. In
> genral, I feel enhancing genric one is better...so, wrote this one.

Sure, if it's something which can be useful for other users, it makes
sense to make it generic.

> Index: memcg_async/include/linux/workqueue.h
> ===================================================================
> --- memcg_async.orig/include/linux/workqueue.h
> +++ memcg_async/include/linux/workqueue.h
> @@ -56,7 +56,8 @@ enum {
>
> /* special cpu IDs */
> WORK_CPU_UNBOUND = NR_CPUS,
> - WORK_CPU_NONE = NR_CPUS + 1,
> + WORK_CPU_IDLEPRI = NR_CPUS + 1,
> + WORK_CPU_NONE = NR_CPUS + 2,
> WORK_CPU_LAST = WORK_CPU_NONE,

Hmmm... so, you're defining another fake CPU a la unbound CPU. I'm
not sure whether it's really necessary to create its own worker pool
tho. The reason why SCHED_OTHER is necessary is because it may
consume large amount of CPU cycles. Workqueue already has UNBOUND -
for an unbound one, workqueue code simply acts as generic worker pool
provider and everything other than work item dispatching and worker
management are deferred to scheduler and the workqueue user.

Is there any reason memcg can't just use UNBOUND workqueue and set
scheduling priority when the work item starts and restore it when it's
done? If it's gonna be using UNBOUND at all, I don't think changing
scheduling policy would be a noticeable overhead and I find having
separate worker pools depending on scheduling priority somewhat silly.

We can add a mechanism to manage work item scheduler priority to
workqueue if necessary tho, I think. But that would be per-workqueue
attribute which is applied during execution, not something per-gcwq.

Thanks.

--
tejun

2011-05-26 10:37:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 7/10] workqueue: add WQ_IDLEPRI

On Thu, 26 May 2011 11:38:08 +0200
Tejun Heo <[email protected]> wrote:

> Hello, KAMEZAWA.
>
> On Thu, May 26, 2011 at 02:30:24PM +0900, KAMEZAWA Hiroyuki wrote:
> > When this idea came to me, I wonder which is better to maintain
> > memcg's thread pool or add support in workqueue for generic use. In
> > genral, I feel enhancing genric one is better...so, wrote this one.
>
> Sure, if it's something which can be useful for other users, it makes
> sense to make it generic.
>
Thank you for review.


> > Index: memcg_async/include/linux/workqueue.h
> > ===================================================================
> > --- memcg_async.orig/include/linux/workqueue.h
> > +++ memcg_async/include/linux/workqueue.h
> > @@ -56,7 +56,8 @@ enum {
> >
> > /* special cpu IDs */
> > WORK_CPU_UNBOUND = NR_CPUS,
> > - WORK_CPU_NONE = NR_CPUS + 1,
> > + WORK_CPU_IDLEPRI = NR_CPUS + 1,
> > + WORK_CPU_NONE = NR_CPUS + 2,
> > WORK_CPU_LAST = WORK_CPU_NONE,
>
> Hmmm... so, you're defining another fake CPU a la unbound CPU. I'm
> not sure whether it's really necessary to create its own worker pool
> tho. The reason why SCHED_OTHER is necessary is because it may
> consume large amount of CPU cycles. Workqueue already has UNBOUND -
> for an unbound one, workqueue code simply acts as generic worker pool
> provider and everything other than work item dispatching and worker
> management are deferred to scheduler and the workqueue user.
>
yes.

> Is there any reason memcg can't just use UNBOUND workqueue and set
> scheduling priority when the work item starts and restore it when it's
> done?

I thought of that. But I didn't do that because I wasn't sure how others
will think about changing exisitng workqueue priority...and I was curious
to know how workqueue works.

> If it's gonna be using UNBOUND at all, I don't think changing
> scheduling policy would be a noticeable overhead and I find having
> separate worker pools depending on scheduling priority somewhat silly.
>
ok.

> We can add a mechanism to manage work item scheduler priority to
> workqueue if necessary tho, I think. But that would be per-workqueue
> attribute which is applied during execution, not something per-gcwq.
>

In the next version, I'll try some like..
==
process_one_work(...) {
.....
spin_unlock_irq(&gcwq->lock);
.....
if (cwq->wq->flags & WQ_IDLEPRI) {
set_scheduler(...SCHED_IDLE...)
cond_resched();
scheduler_switched = true;
}
f(work)
if (scheduler_switched)
set_scheduler(...SCHED_OTHER...)
spin_lock_irq(&gcwq->lock);
}
==
Patch size will be much smaller. (Should I do this in memcg's code ??)

Thank you for your advices.

Thanks,
-Kame

2011-05-26 10:57:10

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 7/10] workqueue: add WQ_IDLEPRI

On Thu, 26 May 2011 19:30:18 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Thu, 26 May 2011 11:38:08 +0200
> Tejun Heo <[email protected]> wrote:
>
> > Hello, KAMEZAWA.
> >
> > On Thu, May 26, 2011 at 02:30:24PM +0900, KAMEZAWA Hiroyuki wrote:
> > > When this idea came to me, I wonder which is better to maintain
> > > memcg's thread pool or add support in workqueue for generic use. In
> > > genral, I feel enhancing genric one is better...so, wrote this one.
> >
> > Sure, if it's something which can be useful for other users, it makes
> > sense to make it generic.
> >
> Thank you for review.
>
>
> > > Index: memcg_async/include/linux/workqueue.h
> > > ===================================================================
> > > --- memcg_async.orig/include/linux/workqueue.h
> > > +++ memcg_async/include/linux/workqueue.h
> > > @@ -56,7 +56,8 @@ enum {
> > >
> > > /* special cpu IDs */
> > > WORK_CPU_UNBOUND = NR_CPUS,
> > > - WORK_CPU_NONE = NR_CPUS + 1,
> > > + WORK_CPU_IDLEPRI = NR_CPUS + 1,
> > > + WORK_CPU_NONE = NR_CPUS + 2,
> > > WORK_CPU_LAST = WORK_CPU_NONE,
> >
> > Hmmm... so, you're defining another fake CPU a la unbound CPU. I'm
> > not sure whether it's really necessary to create its own worker pool
> > tho. The reason why SCHED_OTHER is necessary is because it may
> > consume large amount of CPU cycles. Workqueue already has UNBOUND -
> > for an unbound one, workqueue code simply acts as generic worker pool
> > provider and everything other than work item dispatching and worker
> > management are deferred to scheduler and the workqueue user.
> >
> yes.
>
> > Is there any reason memcg can't just use UNBOUND workqueue and set
> > scheduling priority when the work item starts and restore it when it's
> > done?
>
> I thought of that. But I didn't do that because I wasn't sure how others
> will think about changing exisitng workqueue priority...and I was curious
> to know how workqueue works.
>
> > If it's gonna be using UNBOUND at all, I don't think changing
> > scheduling policy would be a noticeable overhead and I find having
> > separate worker pools depending on scheduling priority somewhat silly.
> >
> ok.
>
> > We can add a mechanism to manage work item scheduler priority to
> > workqueue if necessary tho, I think. But that would be per-workqueue
> > attribute which is applied during execution, not something per-gcwq.
> >
>
> In the next version, I'll try some like..
> ==
> process_one_work(...) {
> .....
> spin_unlock_irq(&gcwq->lock);
> .....
> if (cwq->wq->flags & WQ_IDLEPRI) {
> set_scheduler(...SCHED_IDLE...)
> cond_resched();
> scheduler_switched = true;
> }
> f(work)
> if (scheduler_switched)
> set_scheduler(...SCHED_OTHER...)
> spin_lock_irq(&gcwq->lock);
> }
> ==
> Patch size will be much smaller. (Should I do this in memcg's code ??)
>

BTW, my concern is that if f(work) is enough short,effect of SCHED_IDLE will never
be found because SCHED_OTHER -> SCHED_IDLE -> SCHED_OTHER switch is very fast.
Changed "weight" of CFQ never affects the next calculation of vruntime..of the
thread and the work will show the same behavior with SCHED_OTHER.

I'm sorry if I misunderstand CFQ and setscheduler().

Thanks,
-Kame



2011-05-26 11:44:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 7/10] workqueue: add WQ_IDLEPRI

Hello,

On Thu, May 26, 2011 at 07:50:19PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 26 May 2011 19:30:18 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
> > In the next version, I'll try some like..
> > ==
> > process_one_work(...) {
> > .....
> > spin_unlock_irq(&gcwq->lock);
> > .....
> > if (cwq->wq->flags & WQ_IDLEPRI) {
> > set_scheduler(...SCHED_IDLE...)
> > cond_resched();
> > scheduler_switched = true;
> > }
> > f(work)
> > if (scheduler_switched)
> > set_scheduler(...SCHED_OTHER...)
> > spin_lock_irq(&gcwq->lock);
> > }
> > ==
> > Patch size will be much smaller. (Should I do this in memcg's code ??)
> >
>
> BTW, my concern is that if f(work) is enough short,effect of SCHED_IDLE will never
> be found because SCHED_OTHER -> SCHED_IDLE -> SCHED_OTHER switch is very fast.
> Changed "weight" of CFQ never affects the next calculation of vruntime..of the
> thread and the work will show the same behavior with SCHED_OTHER.
>
> I'm sorry if I misunderstand CFQ and setscheduler().

Hmm... I'm not too familiar there either but,

* If prio is lowered (you're gonna lower it too, right?),
prio_changed_fair() is called which in turn does resched_task() as
necessary.

* More importantly, for short work items, it's likely to not matter at
all. If you can determine beforehand that it's not gonna take very
long time, queueing on system_wq would be more efficient.

Thanks.

--
tejun

2011-05-26 23:48:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 7/10] workqueue: add WQ_IDLEPRI

On Thu, 26 May 2011 13:44:06 +0200
Tejun Heo <[email protected]> wrote:

> Hello,
>
> On Thu, May 26, 2011 at 07:50:19PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 26 May 2011 19:30:18 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > In the next version, I'll try some like..
> > > ==
> > > process_one_work(...) {
> > > .....
> > > spin_unlock_irq(&gcwq->lock);
> > > .....
> > > if (cwq->wq->flags & WQ_IDLEPRI) {
> > > set_scheduler(...SCHED_IDLE...)
> > > cond_resched();
> > > scheduler_switched = true;
> > > }
> > > f(work)
> > > if (scheduler_switched)
> > > set_scheduler(...SCHED_OTHER...)
> > > spin_lock_irq(&gcwq->lock);
> > > }
> > > ==
> > > Patch size will be much smaller. (Should I do this in memcg's code ??)
> > >
> >
> > BTW, my concern is that if f(work) is enough short,effect of SCHED_IDLE will never
> > be found because SCHED_OTHER -> SCHED_IDLE -> SCHED_OTHER switch is very fast.
> > Changed "weight" of CFQ never affects the next calculation of vruntime..of the
> > thread and the work will show the same behavior with SCHED_OTHER.
> >
> > I'm sorry if I misunderstand CFQ and setscheduler().
>
> Hmm... I'm not too familiar there either but,
>
> * If prio is lowered (you're gonna lower it too, right?),
> prio_changed_fair() is called which in turn does resched_task() as
> necessary.
>
> * More importantly, for short work items, it's likely to not matter at
> all. If you can determine beforehand that it's not gonna take very
> long time, queueing on system_wq would be more efficient.
>
> Thanks.
>

Ok, Now, I use following style.

(short work)->requeue->(short work)->requeue
I'll change this as
(set SCHED_IDLE)->long work (until the end)->(set SCHED_OTHER)

Then, I'll see what I want.

Thanks.
-Kame



2011-05-27 01:19:29

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 10/10] memcg : reclaim statistics

On Thu, 26 May 2011 18:17:04 -0700
Ying Han <[email protected]> wrote:

> Hi Kame:
>
> I applied the patch on top of mmotm-2011-05-12-15-52. After boot up, i
> keep getting the following crash by reading the
> /dev/cgroup/memory/memory.reclaim_stat
>
> [ 200.776366] Kernel panic - not syncing: Fatal exception
> [ 200.781591] Pid: 7535, comm: cat Tainted: G D W 2.6.39-mcg-DEV #130
> [ 200.788463] Call Trace:
> [ 200.790916] [<ffffffff81405a75>] panic+0x91/0x194
> [ 200.797096] [<ffffffff81408ac8>] oops_end+0xae/0xbe
> [ 200.803450] [<ffffffff810398d3>] die+0x5a/0x63
> [ 200.809366] [<ffffffff81408561>] do_trap+0x121/0x130
> [ 200.814427] [<ffffffff81037fe6>] do_divide_error+0x90/0x99
> [#1] SMP
> [ 200.821395] [<ffffffff81112bcb>] ? mem_cgroup_reclaim_stat_read+0x28/0xf0
> [ 200.829624] [<ffffffff81104509>] ? page_add_new_anon_rmap+0x7e/0x90
> [ 200.837372] [<ffffffff810fb7f8>] ? handle_pte_fault+0x28a/0x775
> [ 200.844773] [<ffffffff8140f0f5>] divide_error+0x15/0x20
> [ 200.851471] [<ffffffff81112bcb>] ? mem_cgroup_reclaim_stat_read+0x28/0xf0
> [ 200.859729] [<ffffffff810a4a01>] cgroup_seqfile_show+0x38/0x46
> [ 200.867036] [<ffffffff810a4d72>] ? cgroup_lock+0x17/0x17
> [ 200.872444] [<ffffffff81133f2c>] seq_read+0x182/0x361
> [ 200.878984] [<ffffffff8111a0c4>] vfs_read+0xab/0x107
> [ 200.885403] [<ffffffff8111a1e0>] sys_read+0x4a/0x6e
> [ 200.891764] [<ffffffff8140f469>] sysenter_dispatch+0x7/0x27
>
> I will debug it, but like to post here in case i missed some patches in between.
>

maybe mem->scanned is 0. It must be mem->scanned +1. thank you for report.

Thanks,
-kame

> --Ying
>
> On Wed, May 25, 2011 at 10:36 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >
> > This patch adds a file memory.reclaim_stat.
> >
> > This file shows following.
> > ==
> > recent_scan_success_ratio  12 # recent reclaim/scan ratio.
> > limit_scan_pages 671          # scan caused by hitting limit.
> > limit_freed_pages 538         # freed pages by limit_scan
> > limit_elapsed_ns 518555076    # elapsed time in LRU scanning by limit.
> > soft_scan_pages 0             # scan caused by softlimit.
> > soft_freed_pages 0            # freed pages by soft_scan.
> > soft_elapsed_ns 0             # elapsed time in LRU scanning by softlimit.
> > margin_scan_pages 16744221    # scan caused by auto-keep-margin
> > margin_freed_pages 565943     # freed pages by auto-keep-margin.
> > margin_elapsed_ns 5545388791  # elapsed time in LRU scanning by auto-keep-margin
> >
> > This patch adds a new file rather than adding more stats to memory.stat. By it,
> > this support "reset" accounting by
> >
> >  # echo 0 > .../memory.reclaim_stat
> >
> > This is good for debug and tuning.
> >
> > TODO:
> >  - add Documentaion.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> >  mm/memcontrol.c |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 79 insertions(+), 8 deletions(-)
> >
> > Index: memcg_async/mm/memcontrol.c
> > ===================================================================
> > --- memcg_async.orig/mm/memcontrol.c
> > +++ memcg_async/mm/memcontrol.c
> > @@ -216,6 +216,13 @@ static void mem_cgroup_update_margin_to_
> >  static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
> >  static void mem_cgroup_reflesh_scan_ratio(struct mem_cgroup *mem);
> >
> > +enum scan_type {
> > +       LIMIT_SCAN,     /* scan memory because memcg hits limit */
> > +       SOFT_SCAN,      /* scan memory because of soft limit */
> > +       MARGIN_SCAN,    /* scan memory for making margin to limit */
> > +       NR_SCAN_TYPES,
> > +};
> > +
> >  /*
> >  * The memory controller data structure. The memory controller controls both
> >  * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -300,6 +307,13 @@ struct mem_cgroup {
> >        unsigned long   scanned;
> >        unsigned long   reclaimed;
> >        unsigned long   next_scanratio_update;
> > +       /* For statistics */
> > +       struct {
> > +               unsigned long nr_scanned_pages;
> > +               unsigned long nr_reclaimed_pages;
> > +               unsigned long elapsed_ns;
> > +       } scan_stat[NR_SCAN_TYPES];
> > +
> >        /*
> >         * percpu counter.
> >         */
> > @@ -1426,7 +1440,9 @@ unsigned int mem_cgroup_swappiness(struc
> >
> >  static void __mem_cgroup_update_scan_ratio(struct mem_cgroup *mem,
> >                                unsigned long scanned,
> > -                               unsigned long reclaimed)
> > +                               unsigned long reclaimed,
> > +                               unsigned long elapsed,
> > +                               enum scan_type type)
> >  {
> >        unsigned long limit;
> >
> > @@ -1439,6 +1455,9 @@ static void __mem_cgroup_update_scan_rat
> >                mem->scanned /= 2;
> >                mem->reclaimed /= 2;
> >        }
> > +       mem->scan_stat[type].nr_scanned_pages += scanned;
> > +       mem->scan_stat[type].nr_reclaimed_pages += reclaimed;
> > +       mem->scan_stat[type].elapsed_ns += elapsed;
> >        spin_unlock(&mem->scan_stat_lock);
> >  }
> >
> > @@ -1448,6 +1467,8 @@ static void __mem_cgroup_update_scan_rat
> >  * @root : root memcg of hierarchy walk.
> >  * @scanned : scanned pages
> >  * @reclaimed: reclaimed pages.
> > + * @elapsed: used time for memory reclaim
> > + * @type : scan type as LIMIT_SCAN, SOFT_SCAN, MARGIN_SCAN.
> >  *
> >  * record scan/reclaim ratio to the memcg both to a child and it's root
> >  * mem cgroup, which is a reclaim target. This value is used for
> > @@ -1457,11 +1478,14 @@ static void __mem_cgroup_update_scan_rat
> >  static void mem_cgroup_update_scan_ratio(struct mem_cgroup *mem,
> >                                  struct mem_cgroup *root,
> >                                unsigned long scanned,
> > -                               unsigned long reclaimed)
> > +                               unsigned long reclaimed,
> > +                               unsigned long elapsed,
> > +                               int type)
> >  {
> > -       __mem_cgroup_update_scan_ratio(mem, scanned, reclaimed);
> > +       __mem_cgroup_update_scan_ratio(mem, scanned, reclaimed, elapsed, type);
> >        if (mem != root)
> > -               __mem_cgroup_update_scan_ratio(root, scanned, reclaimed);
> > +               __mem_cgroup_update_scan_ratio(root, scanned, reclaimed,
> > +                                       elapsed, type);
> >
> >  }
> >
> > @@ -1906,6 +1930,7 @@ static int mem_cgroup_hierarchical_recla
> >        bool is_kswapd = false;
> >        unsigned long excess;
> >        unsigned long nr_scanned;
> > +       unsigned long start, end, elapsed;
> >
> >        excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
> >
> > @@ -1947,18 +1972,24 @@ static int mem_cgroup_hierarchical_recla
> >                }
> >                /* we use swappiness of local cgroup */
> >                if (check_soft) {
> > +                       start = sched_clock();
> >                        ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> >                                noswap, zone, &nr_scanned);
> > +                       end = sched_clock();
> > +                       elapsed = end - start;
> >                        *total_scanned += nr_scanned;
> >                        mem_cgroup_soft_steal(victim, is_kswapd, ret);
> >                        mem_cgroup_soft_scan(victim, is_kswapd, nr_scanned);
> >                        mem_cgroup_update_scan_ratio(victim,
> > -                                       root_mem, nr_scanned, ret);
> > +                               root_mem, nr_scanned, ret, elapsed, SOFT_SCAN);
> >                } else {
> > +                       start = sched_clock();
> >                        ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> >                                        noswap, &nr_scanned);
> > +                       end = sched_clock();
> > +                       elapsed = end - start;
> >                        mem_cgroup_update_scan_ratio(victim,
> > -                                       root_mem, nr_scanned, ret);
> > +                               root_mem, nr_scanned, ret, elapsed, LIMIT_SCAN);
> >                }
> >                css_put(&victim->css);
> >                /*
> > @@ -4003,7 +4034,7 @@ static void mem_cgroup_async_shrink_work
> >        struct delayed_work *dw = to_delayed_work(work);
> >        struct mem_cgroup *mem, *victim;
> >        long nr_to_reclaim;
> > -       unsigned long nr_scanned, nr_reclaimed;
> > +       unsigned long nr_scanned, nr_reclaimed, start, end;
> >        int delay = 0;
> >
> >        mem = container_of(dw, struct mem_cgroup, async_work);
> > @@ -4022,9 +4053,12 @@ static void mem_cgroup_async_shrink_work
> >        if (!victim)
> >                goto finish_scan;
> >
> > +       start = sched_clock();
> >        nr_reclaimed = mem_cgroup_shrink_rate_limited(victim, nr_to_reclaim,
> >                                        &nr_scanned);
> > -       mem_cgroup_update_scan_ratio(victim, mem, nr_scanned, nr_reclaimed);
> > +       end = sched_clock();
> > +       mem_cgroup_update_scan_ratio(victim, mem, nr_scanned, nr_reclaimed,
> > +                       end - start, MARGIN_SCAN);
> >        css_put(&victim->css);
> >
> >        /* If margin is enough big, stop */
> > @@ -4680,6 +4714,38 @@ static int mem_control_stat_show(struct
> >        return 0;
> >  }
> >
> > +static int mem_cgroup_reclaim_stat_read(struct cgroup *cont, struct cftype *cft,
> > +                                struct cgroup_map_cb *cb)
> > +{
> > +       struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> > +       u64 val;
> > +       int i; /* for indexing scan_stat[] */
> > +
> > +       val = mem->reclaimed * 100 / mem->scanned;
> > +       cb->fill(cb, "recent_scan_success_ratio", val);
> > +       i  = LIMIT_SCAN;
> > +       cb->fill(cb, "limit_scan_pages", mem->scan_stat[i].nr_scanned_pages);
> > +       cb->fill(cb, "limit_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
> > +       cb->fill(cb, "limit_elapsed_ns", mem->scan_stat[i].elapsed_ns);
> > +       i = SOFT_SCAN;
> > +       cb->fill(cb, "soft_scan_pages", mem->scan_stat[i].nr_scanned_pages);
> > +       cb->fill(cb, "soft_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
> > +       cb->fill(cb, "soft_elapsed_ns", mem->scan_stat[i].elapsed_ns);
> > +       i = MARGIN_SCAN;
> > +       cb->fill(cb, "margin_scan_pages", mem->scan_stat[i].nr_scanned_pages);
> > +       cb->fill(cb, "margin_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
> > +       cb->fill(cb, "margin_elapsed_ns", mem->scan_stat[i].elapsed_ns);
> > +       return 0;
> > +}
> > +
> > +static int mem_cgroup_reclaim_stat_reset(struct cgroup *cgrp, unsigned int event)
> > +{
> > +       struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> > +       memset(mem->scan_stat, 0, sizeof(mem->scan_stat));
> > +       return 0;
> > +}
> > +
> > +
> >  /*
> >  * User flags for async_control is a subset of mem->async_flags. But
> >  * this needs to be defined independently to hide implemation details.
> > @@ -5163,6 +5229,11 @@ static struct cftype mem_cgroup_files[]
> >                .open = mem_control_numa_stat_open,
> >        },
> >  #endif
> > +       {
> > +               .name = "reclaim_stat",
> > +               .read_map = mem_cgroup_reclaim_stat_read,
> > +               .trigger = mem_cgroup_reclaim_stat_reset,
> > +       }
> >  };
> >
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-05-27 01:21:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 10/10] memcg : reclaim statistics

On Thu, 26 May 2011 18:17:04 -0700
Ying Han <[email protected]> wrote:

> Hi Kame:
>
> I applied the patch on top of mmotm-2011-05-12-15-52. After boot up, i
> keep getting the following crash by reading the
> /dev/cgroup/memory/memory.reclaim_stat
>
> [ 200.776366] Kernel panic - not syncing: Fatal exception
> [ 200.781591] Pid: 7535, comm: cat Tainted: G D W 2.6.39-mcg-DEV #130
> [ 200.788463] Call Trace:
> [ 200.790916] [<ffffffff81405a75>] panic+0x91/0x194
> [ 200.797096] [<ffffffff81408ac8>] oops_end+0xae/0xbe
> [ 200.803450] [<ffffffff810398d3>] die+0x5a/0x63
> [ 200.809366] [<ffffffff81408561>] do_trap+0x121/0x130
> [ 200.814427] [<ffffffff81037fe6>] do_divide_error+0x90/0x99
> [#1] SMP
> [ 200.821395] [<ffffffff81112bcb>] ? mem_cgroup_reclaim_stat_read+0x28/0xf0
> [ 200.829624] [<ffffffff81104509>] ? page_add_new_anon_rmap+0x7e/0x90
> [ 200.837372] [<ffffffff810fb7f8>] ? handle_pte_fault+0x28a/0x775
> [ 200.844773] [<ffffffff8140f0f5>] divide_error+0x15/0x20
> [ 200.851471] [<ffffffff81112bcb>] ? mem_cgroup_reclaim_stat_read+0x28/0xf0
> [ 200.859729] [<ffffffff810a4a01>] cgroup_seqfile_show+0x38/0x46
> [ 200.867036] [<ffffffff810a4d72>] ? cgroup_lock+0x17/0x17
> [ 200.872444] [<ffffffff81133f2c>] seq_read+0x182/0x361
> [ 200.878984] [<ffffffff8111a0c4>] vfs_read+0xab/0x107
> [ 200.885403] [<ffffffff8111a1e0>] sys_read+0x4a/0x6e
> [ 200.891764] [<ffffffff8140f469>] sysenter_dispatch+0x7/0x27
>
> I will debug it, but like to post here in case i missed some patches in between.
>

It must be mem->scanned is 0 and
mem->reclaimed * 100 /mem->scanned cause error.

It must be mem->reclaimed * 100 / (mem->scanned +1).

I'll fix. thank you for reporting.

Thanks,
-Kame


> --Ying
>
> On Wed, May 25, 2011 at 10:36 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >
> > This patch adds a file memory.reclaim_stat.
> >
> > This file shows following.
> > ==
> > recent_scan_success_ratio  12 # recent reclaim/scan ratio.
> > limit_scan_pages 671          # scan caused by hitting limit.
> > limit_freed_pages 538         # freed pages by limit_scan
> > limit_elapsed_ns 518555076    # elapsed time in LRU scanning by limit.
> > soft_scan_pages 0             # scan caused by softlimit.
> > soft_freed_pages 0            # freed pages by soft_scan.
> > soft_elapsed_ns 0             # elapsed time in LRU scanning by softlimit.
> > margin_scan_pages 16744221    # scan caused by auto-keep-margin
> > margin_freed_pages 565943     # freed pages by auto-keep-margin.
> > margin_elapsed_ns 5545388791  # elapsed time in LRU scanning by auto-keep-margin
> >
> > This patch adds a new file rather than adding more stats to memory.stat. By it,
> > this support "reset" accounting by
> >
> >  # echo 0 > .../memory.reclaim_stat
> >
> > This is good for debug and tuning.
> >
> > TODO:
> >  - add Documentaion.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> >  mm/memcontrol.c |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 79 insertions(+), 8 deletions(-)
> >
> > Index: memcg_async/mm/memcontrol.c
> > ===================================================================
> > --- memcg_async.orig/mm/memcontrol.c
> > +++ memcg_async/mm/memcontrol.c
> > @@ -216,6 +216,13 @@ static void mem_cgroup_update_margin_to_
> >  static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
> >  static void mem_cgroup_reflesh_scan_ratio(struct mem_cgroup *mem);
> >
> > +enum scan_type {
> > +       LIMIT_SCAN,     /* scan memory because memcg hits limit */
> > +       SOFT_SCAN,      /* scan memory because of soft limit */
> > +       MARGIN_SCAN,    /* scan memory for making margin to limit */
> > +       NR_SCAN_TYPES,
> > +};
> > +
> >  /*
> >  * The memory controller data structure. The memory controller controls both
> >  * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -300,6 +307,13 @@ struct mem_cgroup {
> >        unsigned long   scanned;
> >        unsigned long   reclaimed;
> >        unsigned long   next_scanratio_update;
> > +       /* For statistics */
> > +       struct {
> > +               unsigned long nr_scanned_pages;
> > +               unsigned long nr_reclaimed_pages;
> > +               unsigned long elapsed_ns;
> > +       } scan_stat[NR_SCAN_TYPES];
> > +
> >        /*
> >         * percpu counter.
> >         */
> > @@ -1426,7 +1440,9 @@ unsigned int mem_cgroup_swappiness(struc
> >
> >  static void __mem_cgroup_update_scan_ratio(struct mem_cgroup *mem,
> >                                unsigned long scanned,
> > -                               unsigned long reclaimed)
> > +                               unsigned long reclaimed,
> > +                               unsigned long elapsed,
> > +                               enum scan_type type)
> >  {
> >        unsigned long limit;
> >
> > @@ -1439,6 +1455,9 @@ static void __mem_cgroup_update_scan_rat
> >                mem->scanned /= 2;
> >                mem->reclaimed /= 2;
> >        }
> > +       mem->scan_stat[type].nr_scanned_pages += scanned;
> > +       mem->scan_stat[type].nr_reclaimed_pages += reclaimed;
> > +       mem->scan_stat[type].elapsed_ns += elapsed;
> >        spin_unlock(&mem->scan_stat_lock);
> >  }
> >
> > @@ -1448,6 +1467,8 @@ static void __mem_cgroup_update_scan_rat
> >  * @root : root memcg of hierarchy walk.
> >  * @scanned : scanned pages
> >  * @reclaimed: reclaimed pages.
> > + * @elapsed: used time for memory reclaim
> > + * @type : scan type as LIMIT_SCAN, SOFT_SCAN, MARGIN_SCAN.
> >  *
> >  * record scan/reclaim ratio to the memcg both to a child and it's root
> >  * mem cgroup, which is a reclaim target. This value is used for
> > @@ -1457,11 +1478,14 @@ static void __mem_cgroup_update_scan_rat
> >  static void mem_cgroup_update_scan_ratio(struct mem_cgroup *mem,
> >                                  struct mem_cgroup *root,
> >                                unsigned long scanned,
> > -                               unsigned long reclaimed)
> > +                               unsigned long reclaimed,
> > +                               unsigned long elapsed,
> > +                               int type)
> >  {
> > -       __mem_cgroup_update_scan_ratio(mem, scanned, reclaimed);
> > +       __mem_cgroup_update_scan_ratio(mem, scanned, reclaimed, elapsed, type);
> >        if (mem != root)
> > -               __mem_cgroup_update_scan_ratio(root, scanned, reclaimed);
> > +               __mem_cgroup_update_scan_ratio(root, scanned, reclaimed,
> > +                                       elapsed, type);
> >
> >  }
> >
> > @@ -1906,6 +1930,7 @@ static int mem_cgroup_hierarchical_recla
> >        bool is_kswapd = false;
> >        unsigned long excess;
> >        unsigned long nr_scanned;
> > +       unsigned long start, end, elapsed;
> >
> >        excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
> >
> > @@ -1947,18 +1972,24 @@ static int mem_cgroup_hierarchical_recla
> >                }
> >                /* we use swappiness of local cgroup */
> >                if (check_soft) {
> > +                       start = sched_clock();
> >                        ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> >                                noswap, zone, &nr_scanned);
> > +                       end = sched_clock();
> > +                       elapsed = end - start;
> >                        *total_scanned += nr_scanned;
> >                        mem_cgroup_soft_steal(victim, is_kswapd, ret);
> >                        mem_cgroup_soft_scan(victim, is_kswapd, nr_scanned);
> >                        mem_cgroup_update_scan_ratio(victim,
> > -                                       root_mem, nr_scanned, ret);
> > +                               root_mem, nr_scanned, ret, elapsed, SOFT_SCAN);
> >                } else {
> > +                       start = sched_clock();
> >                        ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> >                                        noswap, &nr_scanned);
> > +                       end = sched_clock();
> > +                       elapsed = end - start;
> >                        mem_cgroup_update_scan_ratio(victim,
> > -                                       root_mem, nr_scanned, ret);
> > +                               root_mem, nr_scanned, ret, elapsed, LIMIT_SCAN);
> >                }
> >                css_put(&victim->css);
> >                /*
> > @@ -4003,7 +4034,7 @@ static void mem_cgroup_async_shrink_work
> >        struct delayed_work *dw = to_delayed_work(work);
> >        struct mem_cgroup *mem, *victim;
> >        long nr_to_reclaim;
> > -       unsigned long nr_scanned, nr_reclaimed;
> > +       unsigned long nr_scanned, nr_reclaimed, start, end;
> >        int delay = 0;
> >
> >        mem = container_of(dw, struct mem_cgroup, async_work);
> > @@ -4022,9 +4053,12 @@ static void mem_cgroup_async_shrink_work
> >        if (!victim)
> >                goto finish_scan;
> >
> > +       start = sched_clock();
> >        nr_reclaimed = mem_cgroup_shrink_rate_limited(victim, nr_to_reclaim,
> >                                        &nr_scanned);
> > -       mem_cgroup_update_scan_ratio(victim, mem, nr_scanned, nr_reclaimed);
> > +       end = sched_clock();
> > +       mem_cgroup_update_scan_ratio(victim, mem, nr_scanned, nr_reclaimed,
> > +                       end - start, MARGIN_SCAN);
> >        css_put(&victim->css);
> >
> >        /* If margin is enough big, stop */
> > @@ -4680,6 +4714,38 @@ static int mem_control_stat_show(struct
> >        return 0;
> >  }
> >
> > +static int mem_cgroup_reclaim_stat_read(struct cgroup *cont, struct cftype *cft,
> > +                                struct cgroup_map_cb *cb)
> > +{
> > +       struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> > +       u64 val;
> > +       int i; /* for indexing scan_stat[] */
> > +
> > +       val = mem->reclaimed * 100 / mem->scanned;
> > +       cb->fill(cb, "recent_scan_success_ratio", val);
> > +       i  = LIMIT_SCAN;
> > +       cb->fill(cb, "limit_scan_pages", mem->scan_stat[i].nr_scanned_pages);
> > +       cb->fill(cb, "limit_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
> > +       cb->fill(cb, "limit_elapsed_ns", mem->scan_stat[i].elapsed_ns);
> > +       i = SOFT_SCAN;
> > +       cb->fill(cb, "soft_scan_pages", mem->scan_stat[i].nr_scanned_pages);
> > +       cb->fill(cb, "soft_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
> > +       cb->fill(cb, "soft_elapsed_ns", mem->scan_stat[i].elapsed_ns);
> > +       i = MARGIN_SCAN;
> > +       cb->fill(cb, "margin_scan_pages", mem->scan_stat[i].nr_scanned_pages);
> > +       cb->fill(cb, "margin_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
> > +       cb->fill(cb, "margin_elapsed_ns", mem->scan_stat[i].elapsed_ns);
> > +       return 0;
> > +}
> > +
> > +static int mem_cgroup_reclaim_stat_reset(struct cgroup *cgrp, unsigned int event)
> > +{
> > +       struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> > +       memset(mem->scan_stat, 0, sizeof(mem->scan_stat));
> > +       return 0;
> > +}
> > +
> > +
> >  /*
> >  * User flags for async_control is a subset of mem->async_flags. But
> >  * this needs to be defined independently to hide implemation details.
> > @@ -5163,6 +5229,11 @@ static struct cftype mem_cgroup_files[]
> >                .open = mem_control_numa_stat_open,
> >        },
> >  #endif
> > +       {
> > +               .name = "reclaim_stat",
> > +               .read_map = mem_cgroup_reclaim_stat_read,
> > +               .trigger = mem_cgroup_reclaim_stat_reset,
> > +       }
> >  };
> >
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-05-27 01:17:10

by Ying Han

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 10/10] memcg : reclaim statistics

Hi Kame:

I applied the patch on top of mmotm-2011-05-12-15-52. After boot up, i
keep getting the following crash by reading the
/dev/cgroup/memory/memory.reclaim_stat

[ 200.776366] Kernel panic - not syncing: Fatal exception
[ 200.781591] Pid: 7535, comm: cat Tainted: G D W 2.6.39-mcg-DEV #130
[ 200.788463] Call Trace:
[ 200.790916] [<ffffffff81405a75>] panic+0x91/0x194
[ 200.797096] [<ffffffff81408ac8>] oops_end+0xae/0xbe
[ 200.803450] [<ffffffff810398d3>] die+0x5a/0x63
[ 200.809366] [<ffffffff81408561>] do_trap+0x121/0x130
[ 200.814427] [<ffffffff81037fe6>] do_divide_error+0x90/0x99
[#1] SMP
[ 200.821395] [<ffffffff81112bcb>] ? mem_cgroup_reclaim_stat_read+0x28/0xf0
[ 200.829624] [<ffffffff81104509>] ? page_add_new_anon_rmap+0x7e/0x90
[ 200.837372] [<ffffffff810fb7f8>] ? handle_pte_fault+0x28a/0x775
[ 200.844773] [<ffffffff8140f0f5>] divide_error+0x15/0x20
[ 200.851471] [<ffffffff81112bcb>] ? mem_cgroup_reclaim_stat_read+0x28/0xf0
[ 200.859729] [<ffffffff810a4a01>] cgroup_seqfile_show+0x38/0x46
[ 200.867036] [<ffffffff810a4d72>] ? cgroup_lock+0x17/0x17
[ 200.872444] [<ffffffff81133f2c>] seq_read+0x182/0x361
[ 200.878984] [<ffffffff8111a0c4>] vfs_read+0xab/0x107
[ 200.885403] [<ffffffff8111a1e0>] sys_read+0x4a/0x6e
[ 200.891764] [<ffffffff8140f469>] sysenter_dispatch+0x7/0x27

I will debug it, but like to post here in case i missed some patches in between.

--Ying

On Wed, May 25, 2011 at 10:36 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> This patch adds a file memory.reclaim_stat.
>
> This file shows following.
> ==
> recent_scan_success_ratio ?12 # recent reclaim/scan ratio.
> limit_scan_pages 671 ? ? ? ? ?# scan caused by hitting limit.
> limit_freed_pages 538 ? ? ? ? # freed pages by limit_scan
> limit_elapsed_ns 518555076 ? ?# elapsed time in LRU scanning by limit.
> soft_scan_pages 0 ? ? ? ? ? ? # scan caused by softlimit.
> soft_freed_pages 0 ? ? ? ? ? ?# freed pages by soft_scan.
> soft_elapsed_ns 0 ? ? ? ? ? ? # elapsed time in LRU scanning by softlimit.
> margin_scan_pages 16744221 ? ?# scan caused by auto-keep-margin
> margin_freed_pages 565943 ? ? # freed pages by auto-keep-margin.
> margin_elapsed_ns 5545388791 ?# elapsed time in LRU scanning by auto-keep-margin
>
> This patch adds a new file rather than adding more stats to memory.stat. By it,
> this support "reset" accounting by
>
> ?# echo 0 > .../memory.reclaim_stat
>
> This is good for debug and tuning.
>
> TODO:
> ?- add Documentaion.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> ?mm/memcontrol.c | ? 87 ++++++++++++++++++++++++++++++++++++++++++++++++++------
> ?1 file changed, 79 insertions(+), 8 deletions(-)
>
> Index: memcg_async/mm/memcontrol.c
> ===================================================================
> --- memcg_async.orig/mm/memcontrol.c
> +++ memcg_async/mm/memcontrol.c
> @@ -216,6 +216,13 @@ static void mem_cgroup_update_margin_to_
> ?static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
> ?static void mem_cgroup_reflesh_scan_ratio(struct mem_cgroup *mem);
>
> +enum scan_type {
> + ? ? ? LIMIT_SCAN, ? ? /* scan memory because memcg hits limit */
> + ? ? ? SOFT_SCAN, ? ? ?/* scan memory because of soft limit */
> + ? ? ? MARGIN_SCAN, ? ?/* scan memory for making margin to limit */
> + ? ? ? NR_SCAN_TYPES,
> +};
> +
> ?/*
> ?* The memory controller data structure. The memory controller controls both
> ?* page cache and RSS per cgroup. We would eventually like to provide
> @@ -300,6 +307,13 @@ struct mem_cgroup {
> ? ? ? ?unsigned long ? scanned;
> ? ? ? ?unsigned long ? reclaimed;
> ? ? ? ?unsigned long ? next_scanratio_update;
> + ? ? ? /* For statistics */
> + ? ? ? struct {
> + ? ? ? ? ? ? ? unsigned long nr_scanned_pages;
> + ? ? ? ? ? ? ? unsigned long nr_reclaimed_pages;
> + ? ? ? ? ? ? ? unsigned long elapsed_ns;
> + ? ? ? } scan_stat[NR_SCAN_TYPES];
> +
> ? ? ? ?/*
> ? ? ? ? * percpu counter.
> ? ? ? ? */
> @@ -1426,7 +1440,9 @@ unsigned int mem_cgroup_swappiness(struc
>
> ?static void __mem_cgroup_update_scan_ratio(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long scanned,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long reclaimed)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long reclaimed,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long elapsed,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? enum scan_type type)
> ?{
> ? ? ? ?unsigned long limit;
>
> @@ -1439,6 +1455,9 @@ static void __mem_cgroup_update_scan_rat
> ? ? ? ? ? ? ? ?mem->scanned /= 2;
> ? ? ? ? ? ? ? ?mem->reclaimed /= 2;
> ? ? ? ?}
> + ? ? ? mem->scan_stat[type].nr_scanned_pages += scanned;
> + ? ? ? mem->scan_stat[type].nr_reclaimed_pages += reclaimed;
> + ? ? ? mem->scan_stat[type].elapsed_ns += elapsed;
> ? ? ? ?spin_unlock(&mem->scan_stat_lock);
> ?}
>
> @@ -1448,6 +1467,8 @@ static void __mem_cgroup_update_scan_rat
> ?* @root : root memcg of hierarchy walk.
> ?* @scanned : scanned pages
> ?* @reclaimed: reclaimed pages.
> + * @elapsed: used time for memory reclaim
> + * @type : scan type as LIMIT_SCAN, SOFT_SCAN, MARGIN_SCAN.
> ?*
> ?* record scan/reclaim ratio to the memcg both to a child and it's root
> ?* mem cgroup, which is a reclaim target. This value is used for
> @@ -1457,11 +1478,14 @@ static void __mem_cgroup_update_scan_rat
> ?static void mem_cgroup_update_scan_ratio(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct mem_cgroup *root,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long scanned,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long reclaimed)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long reclaimed,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long elapsed,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int type)
> ?{
> - ? ? ? __mem_cgroup_update_scan_ratio(mem, scanned, reclaimed);
> + ? ? ? __mem_cgroup_update_scan_ratio(mem, scanned, reclaimed, elapsed, type);
> ? ? ? ?if (mem != root)
> - ? ? ? ? ? ? ? __mem_cgroup_update_scan_ratio(root, scanned, reclaimed);
> + ? ? ? ? ? ? ? __mem_cgroup_update_scan_ratio(root, scanned, reclaimed,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? elapsed, type);
>
> ?}
>
> @@ -1906,6 +1930,7 @@ static int mem_cgroup_hierarchical_recla
> ? ? ? ?bool is_kswapd = false;
> ? ? ? ?unsigned long excess;
> ? ? ? ?unsigned long nr_scanned;
> + ? ? ? unsigned long start, end, elapsed;
>
> ? ? ? ?excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
>
> @@ -1947,18 +1972,24 @@ static int mem_cgroup_hierarchical_recla
> ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?/* we use swappiness of local cgroup */
> ? ? ? ? ? ? ? ?if (check_soft) {
> + ? ? ? ? ? ? ? ? ? ? ? start = sched_clock();
> ? ? ? ? ? ? ? ? ? ? ? ?ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?noswap, zone, &nr_scanned);
> + ? ? ? ? ? ? ? ? ? ? ? end = sched_clock();
> + ? ? ? ? ? ? ? ? ? ? ? elapsed = end - start;
> ? ? ? ? ? ? ? ? ? ? ? ?*total_scanned += nr_scanned;
> ? ? ? ? ? ? ? ? ? ? ? ?mem_cgroup_soft_steal(victim, is_kswapd, ret);
> ? ? ? ? ? ? ? ? ? ? ? ?mem_cgroup_soft_scan(victim, is_kswapd, nr_scanned);
> ? ? ? ? ? ? ? ? ? ? ? ?mem_cgroup_update_scan_ratio(victim,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? root_mem, nr_scanned, ret);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? root_mem, nr_scanned, ret, elapsed, SOFT_SCAN);
> ? ? ? ? ? ? ? ?} else {
> + ? ? ? ? ? ? ? ? ? ? ? start = sched_clock();
> ? ? ? ? ? ? ? ? ? ? ? ?ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?noswap, &nr_scanned);
> + ? ? ? ? ? ? ? ? ? ? ? end = sched_clock();
> + ? ? ? ? ? ? ? ? ? ? ? elapsed = end - start;
> ? ? ? ? ? ? ? ? ? ? ? ?mem_cgroup_update_scan_ratio(victim,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? root_mem, nr_scanned, ret);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? root_mem, nr_scanned, ret, elapsed, LIMIT_SCAN);
> ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?css_put(&victim->css);
> ? ? ? ? ? ? ? ?/*
> @@ -4003,7 +4034,7 @@ static void mem_cgroup_async_shrink_work
> ? ? ? ?struct delayed_work *dw = to_delayed_work(work);
> ? ? ? ?struct mem_cgroup *mem, *victim;
> ? ? ? ?long nr_to_reclaim;
> - ? ? ? unsigned long nr_scanned, nr_reclaimed;
> + ? ? ? unsigned long nr_scanned, nr_reclaimed, start, end;
> ? ? ? ?int delay = 0;
>
> ? ? ? ?mem = container_of(dw, struct mem_cgroup, async_work);
> @@ -4022,9 +4053,12 @@ static void mem_cgroup_async_shrink_work
> ? ? ? ?if (!victim)
> ? ? ? ? ? ? ? ?goto finish_scan;
>
> + ? ? ? start = sched_clock();
> ? ? ? ?nr_reclaimed = mem_cgroup_shrink_rate_limited(victim, nr_to_reclaim,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?&nr_scanned);
> - ? ? ? mem_cgroup_update_scan_ratio(victim, mem, nr_scanned, nr_reclaimed);
> + ? ? ? end = sched_clock();
> + ? ? ? mem_cgroup_update_scan_ratio(victim, mem, nr_scanned, nr_reclaimed,
> + ? ? ? ? ? ? ? ? ? ? ? end - start, MARGIN_SCAN);
> ? ? ? ?css_put(&victim->css);
>
> ? ? ? ?/* If margin is enough big, stop */
> @@ -4680,6 +4714,38 @@ static int mem_control_stat_show(struct
> ? ? ? ?return 0;
> ?}
>
> +static int mem_cgroup_reclaim_stat_read(struct cgroup *cont, struct cftype *cft,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct cgroup_map_cb *cb)
> +{
> + ? ? ? struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> + ? ? ? u64 val;
> + ? ? ? int i; /* for indexing scan_stat[] */
> +
> + ? ? ? val = mem->reclaimed * 100 / mem->scanned;
> + ? ? ? cb->fill(cb, "recent_scan_success_ratio", val);
> + ? ? ? i ?= LIMIT_SCAN;
> + ? ? ? cb->fill(cb, "limit_scan_pages", mem->scan_stat[i].nr_scanned_pages);
> + ? ? ? cb->fill(cb, "limit_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
> + ? ? ? cb->fill(cb, "limit_elapsed_ns", mem->scan_stat[i].elapsed_ns);
> + ? ? ? i = SOFT_SCAN;
> + ? ? ? cb->fill(cb, "soft_scan_pages", mem->scan_stat[i].nr_scanned_pages);
> + ? ? ? cb->fill(cb, "soft_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
> + ? ? ? cb->fill(cb, "soft_elapsed_ns", mem->scan_stat[i].elapsed_ns);
> + ? ? ? i = MARGIN_SCAN;
> + ? ? ? cb->fill(cb, "margin_scan_pages", mem->scan_stat[i].nr_scanned_pages);
> + ? ? ? cb->fill(cb, "margin_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
> + ? ? ? cb->fill(cb, "margin_elapsed_ns", mem->scan_stat[i].elapsed_ns);
> + ? ? ? return 0;
> +}
> +
> +static int mem_cgroup_reclaim_stat_reset(struct cgroup *cgrp, unsigned int event)
> +{
> + ? ? ? struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> + ? ? ? memset(mem->scan_stat, 0, sizeof(mem->scan_stat));
> + ? ? ? return 0;
> +}
> +
> +
> ?/*
> ?* User flags for async_control is a subset of mem->async_flags. But
> ?* this needs to be defined independently to hide implemation details.
> @@ -5163,6 +5229,11 @@ static struct cftype mem_cgroup_files[]
> ? ? ? ? ? ? ? ?.open = mem_control_numa_stat_open,
> ? ? ? ?},
> ?#endif
> + ? ? ? {
> + ? ? ? ? ? ? ? .name = "reclaim_stat",
> + ? ? ? ? ? ? ? .read_map = mem_cgroup_reclaim_stat_read,
> + ? ? ? ? ? ? ? .trigger = mem_cgroup_reclaim_stat_reset,
> + ? ? ? }
> ?};
>
> ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>

2011-05-27 01:22:23

by Ying Han

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 10/10] memcg : reclaim statistics

On Thu, May 26, 2011 at 6:14 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Thu, 26 May 2011 18:17:04 -0700
> Ying Han <[email protected]> wrote:
>
>> Hi Kame:
>>
>> I applied the patch on top of mmotm-2011-05-12-15-52. After boot up, i
>> keep getting the following crash by reading the
>> /dev/cgroup/memory/memory.reclaim_stat
>>
>> [ ?200.776366] Kernel panic - not syncing: Fatal exception
>> [ ?200.781591] Pid: 7535, comm: cat Tainted: G ? ? ?D W ? 2.6.39-mcg-DEV #130
>> [ ?200.788463] Call Trace:
>> [ ?200.790916] ?[<ffffffff81405a75>] panic+0x91/0x194
>> [ ?200.797096] ?[<ffffffff81408ac8>] oops_end+0xae/0xbe
>> [ ?200.803450] ?[<ffffffff810398d3>] die+0x5a/0x63
>> [ ?200.809366] ?[<ffffffff81408561>] do_trap+0x121/0x130
>> [ ?200.814427] ?[<ffffffff81037fe6>] do_divide_error+0x90/0x99
>> [#1] SMP
>> [ ?200.821395] ?[<ffffffff81112bcb>] ? mem_cgroup_reclaim_stat_read+0x28/0xf0
>> [ ?200.829624] ?[<ffffffff81104509>] ? page_add_new_anon_rmap+0x7e/0x90
>> [ ?200.837372] ?[<ffffffff810fb7f8>] ? handle_pte_fault+0x28a/0x775
>> [ ?200.844773] ?[<ffffffff8140f0f5>] divide_error+0x15/0x20
>> [ ?200.851471] ?[<ffffffff81112bcb>] ? mem_cgroup_reclaim_stat_read+0x28/0xf0
>> [ ?200.859729] ?[<ffffffff810a4a01>] cgroup_seqfile_show+0x38/0x46
>> [ ?200.867036] ?[<ffffffff810a4d72>] ? cgroup_lock+0x17/0x17
>> [ ?200.872444] ?[<ffffffff81133f2c>] seq_read+0x182/0x361
>> [ ?200.878984] ?[<ffffffff8111a0c4>] vfs_read+0xab/0x107
>> [ ?200.885403] ?[<ffffffff8111a1e0>] sys_read+0x4a/0x6e
>> [ ?200.891764] ?[<ffffffff8140f469>] sysenter_dispatch+0x7/0x27
>>
>> I will debug it, but like to post here in case i missed some patches in between.
>>
>
> It must be mem->scanned is 0 and
> ? ? mem->reclaimed * 100 /mem->scanned cause error.
>
> It must be mem->reclaimed * 100 / (mem->scanned +1).
>
> I'll fix. thank you for reporting.

That is what I thought :) I will apply the change for now

--Ying
>
> Thanks,
> -Kame
>
>
>> --Ying
>>
>> On Wed, May 25, 2011 at 10:36 PM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>> >
>> > This patch adds a file memory.reclaim_stat.
>> >
>> > This file shows following.
>> > ==
>> > recent_scan_success_ratio ?12 # recent reclaim/scan ratio.
>> > limit_scan_pages 671 ? ? ? ? ?# scan caused by hitting limit.
>> > limit_freed_pages 538 ? ? ? ? # freed pages by limit_scan
>> > limit_elapsed_ns 518555076 ? ?# elapsed time in LRU scanning by limit.
>> > soft_scan_pages 0 ? ? ? ? ? ? # scan caused by softlimit.
>> > soft_freed_pages 0 ? ? ? ? ? ?# freed pages by soft_scan.
>> > soft_elapsed_ns 0 ? ? ? ? ? ? # elapsed time in LRU scanning by softlimit.
>> > margin_scan_pages 16744221 ? ?# scan caused by auto-keep-margin
>> > margin_freed_pages 565943 ? ? # freed pages by auto-keep-margin.
>> > margin_elapsed_ns 5545388791 ?# elapsed time in LRU scanning by auto-keep-margin
>> >
>> > This patch adds a new file rather than adding more stats to memory.stat. By it,
>> > this support "reset" accounting by
>> >
>> > ?# echo 0 > .../memory.reclaim_stat
>> >
>> > This is good for debug and tuning.
>> >
>> > TODO:
>> > ?- add Documentaion.
>> >
>> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>> > ---
>> > ?mm/memcontrol.c | ? 87 ++++++++++++++++++++++++++++++++++++++++++++++++++------
>> > ?1 file changed, 79 insertions(+), 8 deletions(-)
>> >
>> > Index: memcg_async/mm/memcontrol.c
>> > ===================================================================
>> > --- memcg_async.orig/mm/memcontrol.c
>> > +++ memcg_async/mm/memcontrol.c
>> > @@ -216,6 +216,13 @@ static void mem_cgroup_update_margin_to_
>> > ?static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
>> > ?static void mem_cgroup_reflesh_scan_ratio(struct mem_cgroup *mem);
>> >
>> > +enum scan_type {
>> > + ? ? ? LIMIT_SCAN, ? ? /* scan memory because memcg hits limit */
>> > + ? ? ? SOFT_SCAN, ? ? ?/* scan memory because of soft limit */
>> > + ? ? ? MARGIN_SCAN, ? ?/* scan memory for making margin to limit */
>> > + ? ? ? NR_SCAN_TYPES,
>> > +};
>> > +
>> > ?/*
>> > ?* The memory controller data structure. The memory controller controls both
>> > ?* page cache and RSS per cgroup. We would eventually like to provide
>> > @@ -300,6 +307,13 @@ struct mem_cgroup {
>> > ? ? ? ?unsigned long ? scanned;
>> > ? ? ? ?unsigned long ? reclaimed;
>> > ? ? ? ?unsigned long ? next_scanratio_update;
>> > + ? ? ? /* For statistics */
>> > + ? ? ? struct {
>> > + ? ? ? ? ? ? ? unsigned long nr_scanned_pages;
>> > + ? ? ? ? ? ? ? unsigned long nr_reclaimed_pages;
>> > + ? ? ? ? ? ? ? unsigned long elapsed_ns;
>> > + ? ? ? } scan_stat[NR_SCAN_TYPES];
>> > +
>> > ? ? ? ?/*
>> > ? ? ? ? * percpu counter.
>> > ? ? ? ? */
>> > @@ -1426,7 +1440,9 @@ unsigned int mem_cgroup_swappiness(struc
>> >
>> > ?static void __mem_cgroup_update_scan_ratio(struct mem_cgroup *mem,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long scanned,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long reclaimed)
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long reclaimed,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long elapsed,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? enum scan_type type)
>> > ?{
>> > ? ? ? ?unsigned long limit;
>> >
>> > @@ -1439,6 +1455,9 @@ static void __mem_cgroup_update_scan_rat
>> > ? ? ? ? ? ? ? ?mem->scanned /= 2;
>> > ? ? ? ? ? ? ? ?mem->reclaimed /= 2;
>> > ? ? ? ?}
>> > + ? ? ? mem->scan_stat[type].nr_scanned_pages += scanned;
>> > + ? ? ? mem->scan_stat[type].nr_reclaimed_pages += reclaimed;
>> > + ? ? ? mem->scan_stat[type].elapsed_ns += elapsed;
>> > ? ? ? ?spin_unlock(&mem->scan_stat_lock);
>> > ?}
>> >
>> > @@ -1448,6 +1467,8 @@ static void __mem_cgroup_update_scan_rat
>> > ?* @root : root memcg of hierarchy walk.
>> > ?* @scanned : scanned pages
>> > ?* @reclaimed: reclaimed pages.
>> > + * @elapsed: used time for memory reclaim
>> > + * @type : scan type as LIMIT_SCAN, SOFT_SCAN, MARGIN_SCAN.
>> > ?*
>> > ?* record scan/reclaim ratio to the memcg both to a child and it's root
>> > ?* mem cgroup, which is a reclaim target. This value is used for
>> > @@ -1457,11 +1478,14 @@ static void __mem_cgroup_update_scan_rat
>> > ?static void mem_cgroup_update_scan_ratio(struct mem_cgroup *mem,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct mem_cgroup *root,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long scanned,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long reclaimed)
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long reclaimed,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long elapsed,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int type)
>> > ?{
>> > - ? ? ? __mem_cgroup_update_scan_ratio(mem, scanned, reclaimed);
>> > + ? ? ? __mem_cgroup_update_scan_ratio(mem, scanned, reclaimed, elapsed, type);
>> > ? ? ? ?if (mem != root)
>> > - ? ? ? ? ? ? ? __mem_cgroup_update_scan_ratio(root, scanned, reclaimed);
>> > + ? ? ? ? ? ? ? __mem_cgroup_update_scan_ratio(root, scanned, reclaimed,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? elapsed, type);
>> >
>> > ?}
>> >
>> > @@ -1906,6 +1930,7 @@ static int mem_cgroup_hierarchical_recla
>> > ? ? ? ?bool is_kswapd = false;
>> > ? ? ? ?unsigned long excess;
>> > ? ? ? ?unsigned long nr_scanned;
>> > + ? ? ? unsigned long start, end, elapsed;
>> >
>> > ? ? ? ?excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
>> >
>> > @@ -1947,18 +1972,24 @@ static int mem_cgroup_hierarchical_recla
>> > ? ? ? ? ? ? ? ?}
>> > ? ? ? ? ? ? ? ?/* we use swappiness of local cgroup */
>> > ? ? ? ? ? ? ? ?if (check_soft) {
>> > + ? ? ? ? ? ? ? ? ? ? ? start = sched_clock();
>> > ? ? ? ? ? ? ? ? ? ? ? ?ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?noswap, zone, &nr_scanned);
>> > + ? ? ? ? ? ? ? ? ? ? ? end = sched_clock();
>> > + ? ? ? ? ? ? ? ? ? ? ? elapsed = end - start;
>> > ? ? ? ? ? ? ? ? ? ? ? ?*total_scanned += nr_scanned;
>> > ? ? ? ? ? ? ? ? ? ? ? ?mem_cgroup_soft_steal(victim, is_kswapd, ret);
>> > ? ? ? ? ? ? ? ? ? ? ? ?mem_cgroup_soft_scan(victim, is_kswapd, nr_scanned);
>> > ? ? ? ? ? ? ? ? ? ? ? ?mem_cgroup_update_scan_ratio(victim,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? root_mem, nr_scanned, ret);
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? root_mem, nr_scanned, ret, elapsed, SOFT_SCAN);
>> > ? ? ? ? ? ? ? ?} else {
>> > + ? ? ? ? ? ? ? ? ? ? ? start = sched_clock();
>> > ? ? ? ? ? ? ? ? ? ? ? ?ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?noswap, &nr_scanned);
>> > + ? ? ? ? ? ? ? ? ? ? ? end = sched_clock();
>> > + ? ? ? ? ? ? ? ? ? ? ? elapsed = end - start;
>> > ? ? ? ? ? ? ? ? ? ? ? ?mem_cgroup_update_scan_ratio(victim,
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? root_mem, nr_scanned, ret);
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? root_mem, nr_scanned, ret, elapsed, LIMIT_SCAN);
>> > ? ? ? ? ? ? ? ?}
>> > ? ? ? ? ? ? ? ?css_put(&victim->css);
>> > ? ? ? ? ? ? ? ?/*
>> > @@ -4003,7 +4034,7 @@ static void mem_cgroup_async_shrink_work
>> > ? ? ? ?struct delayed_work *dw = to_delayed_work(work);
>> > ? ? ? ?struct mem_cgroup *mem, *victim;
>> > ? ? ? ?long nr_to_reclaim;
>> > - ? ? ? unsigned long nr_scanned, nr_reclaimed;
>> > + ? ? ? unsigned long nr_scanned, nr_reclaimed, start, end;
>> > ? ? ? ?int delay = 0;
>> >
>> > ? ? ? ?mem = container_of(dw, struct mem_cgroup, async_work);
>> > @@ -4022,9 +4053,12 @@ static void mem_cgroup_async_shrink_work
>> > ? ? ? ?if (!victim)
>> > ? ? ? ? ? ? ? ?goto finish_scan;
>> >
>> > + ? ? ? start = sched_clock();
>> > ? ? ? ?nr_reclaimed = mem_cgroup_shrink_rate_limited(victim, nr_to_reclaim,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?&nr_scanned);
>> > - ? ? ? mem_cgroup_update_scan_ratio(victim, mem, nr_scanned, nr_reclaimed);
>> > + ? ? ? end = sched_clock();
>> > + ? ? ? mem_cgroup_update_scan_ratio(victim, mem, nr_scanned, nr_reclaimed,
>> > + ? ? ? ? ? ? ? ? ? ? ? end - start, MARGIN_SCAN);
>> > ? ? ? ?css_put(&victim->css);
>> >
>> > ? ? ? ?/* If margin is enough big, stop */
>> > @@ -4680,6 +4714,38 @@ static int mem_control_stat_show(struct
>> > ? ? ? ?return 0;
>> > ?}
>> >
>> > +static int mem_cgroup_reclaim_stat_read(struct cgroup *cont, struct cftype *cft,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct cgroup_map_cb *cb)
>> > +{
>> > + ? ? ? struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>> > + ? ? ? u64 val;
>> > + ? ? ? int i; /* for indexing scan_stat[] */
>> > +
>> > + ? ? ? val = mem->reclaimed * 100 / mem->scanned;
>> > + ? ? ? cb->fill(cb, "recent_scan_success_ratio", val);
>> > + ? ? ? i ?= LIMIT_SCAN;
>> > + ? ? ? cb->fill(cb, "limit_scan_pages", mem->scan_stat[i].nr_scanned_pages);
>> > + ? ? ? cb->fill(cb, "limit_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
>> > + ? ? ? cb->fill(cb, "limit_elapsed_ns", mem->scan_stat[i].elapsed_ns);
>> > + ? ? ? i = SOFT_SCAN;
>> > + ? ? ? cb->fill(cb, "soft_scan_pages", mem->scan_stat[i].nr_scanned_pages);
>> > + ? ? ? cb->fill(cb, "soft_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
>> > + ? ? ? cb->fill(cb, "soft_elapsed_ns", mem->scan_stat[i].elapsed_ns);
>> > + ? ? ? i = MARGIN_SCAN;
>> > + ? ? ? cb->fill(cb, "margin_scan_pages", mem->scan_stat[i].nr_scanned_pages);
>> > + ? ? ? cb->fill(cb, "margin_freed_pages", mem->scan_stat[i].nr_reclaimed_pages);
>> > + ? ? ? cb->fill(cb, "margin_elapsed_ns", mem->scan_stat[i].elapsed_ns);
>> > + ? ? ? return 0;
>> > +}
>> > +
>> > +static int mem_cgroup_reclaim_stat_reset(struct cgroup *cgrp, unsigned int event)
>> > +{
>> > + ? ? ? struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
>> > + ? ? ? memset(mem->scan_stat, 0, sizeof(mem->scan_stat));
>> > + ? ? ? return 0;
>> > +}
>> > +
>> > +
>> > ?/*
>> > ?* User flags for async_control is a subset of mem->async_flags. But
>> > ?* this needs to be defined independently to hide implemation details.
>> > @@ -5163,6 +5229,11 @@ static struct cftype mem_cgroup_files[]
>> > ? ? ? ? ? ? ? ?.open = mem_control_numa_stat_open,
>> > ? ? ? ?},
>> > ?#endif
>> > + ? ? ? {
>> > + ? ? ? ? ? ? ? .name = "reclaim_stat",
>> > + ? ? ? ? ? ? ? .read_map = mem_cgroup_reclaim_stat_read,
>> > + ? ? ? ? ? ? ? .trigger = mem_cgroup_reclaim_stat_reset,
>> > + ? ? ? }
>> > ?};
>> >
>> > ?#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at ?http://www.tux.org/lkml/
>>
>
>

2011-05-27 01:49:32

by Ying Han

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 0/10] memcg async reclaim

On Wed, May 25, 2011 at 10:10 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> It's now merge window...I just dump my patch queue to hear other's idea.
> I wonder I should wait until dirty_ratio for memcg is queued to mmotm...
> I'll be busy with LinuxCon Japan etc...in the next week.
>
> This patch is onto mmotm-May-11 + some patches queued in mmotm, as numa_stat.
>
> This is a patch for memcg to keep margin to the limit in background.
> By keeping some margin to the limit in background, application can
> avoid foreground memory reclaim at charge() and this will help latency.
>
> Main changes from v2 is.
> ?- use SCHED_IDLE.
> ?- removed most of heuristic codes. Now, code is very simple.
>
> By using SCHED_IDLE, async memory reclaim can only consume 0.3%? of cpu
> if the system is truely busy but can use much CPU if the cpu is idle.
> Because my purpose is for reducing latency without affecting other running
> applications, SCHED_IDLE fits this work.
>
> If application need to stop by some I/O or event, background memory reclaim
> will cull memory while the system is idle.
>
> Perforemce:
> ?Running an httpd (apache) under 300M limit. And access 600MB working set
> ?with normalized distribution access by apatch-bench.
> ?apatch bench's concurrency was 4 and did 40960 accesses.
>
> Without async reclaim:
> Connection Times (ms)
> ? ? ? ? ? ? ?min ?mean[+/-sd] median ? max
> Connect: ? ? ? ?0 ? ?0 ? 0.0 ? ? ?0 ? ? ? 2
> Processing: ? ?30 ? 37 ?28.3 ? ? 32 ? ?1793
> Waiting: ? ? ? 28 ? 35 ?25.5 ? ? 31 ? ?1792
> Total: ? ? ? ? 30 ? 37 ?28.4 ? ? 32 ? ?1793
>
> Percentage of the requests served within a certain time (ms)
> ?50% ? ? 32
> ?66% ? ? 32
> ?75% ? ? 33
> ?80% ? ? 34
> ?90% ? ? 39
> ?95% ? ? 60
> ?98% ? ?100
> ?99% ? ?133
> ?100% ? 1793 (longest request)
>
> With async reclaim:
> Connection Times (ms)
> ? ? ? ? ? ? ?min ?mean[+/-sd] median ? max
> Connect: ? ? ? ?0 ? ?0 ? 0.0 ? ? ?0 ? ? ? 2
> Processing: ? ?30 ? 35 ?12.3 ? ? 32 ? ? 678
> Waiting: ? ? ? 28 ? 34 ?12.0 ? ? 31 ? ? 658
> Total: ? ? ? ? 30 ? 35 ?12.3 ? ? 32 ? ? 678
>
> Percentage of the requests served within a certain time (ms)
> ?50% ? ? 32
> ?66% ? ? 32
> ?75% ? ? 33
> ?80% ? ? 34
> ?90% ? ? 39
> ?95% ? ? 49
> ?98% ? ? 71
> ?99% ? ? 86
> ?100% ? ?678 (longest request)
>
>
> It seems latency is stabilized by hiding memory reclaim.
>
> The score for memory reclaim was following.
> See patch 10 for meaning of each member.
>
> == without async reclaim ==
> recent_scan_success_ratio 44
> limit_scan_pages 388463
> limit_freed_pages 162238
> limit_elapsed_ns 13852159231
> soft_scan_pages 0
> soft_freed_pages 0
> soft_elapsed_ns 0
> margin_scan_pages 0
> margin_freed_pages 0
> margin_elapsed_ns 0
>
> == with async reclaim ==
> recent_scan_success_ratio 6
> limit_scan_pages 0
> limit_freed_pages 0
> limit_elapsed_ns 0
> soft_scan_pages 0
> soft_freed_pages 0
> soft_elapsed_ns 0
> margin_scan_pages 1295556
> margin_freed_pages 122450
> margin_elapsed_ns 644881521
>
>
> For this case, SCHED_IDLE workqueue can reclaim enough memory to the httpd.
>
> I may need to dig why scan_success_ratio is far different in the both case.
> I guess the difference of epalsed_ns is because several threads enter
> memory reclaim when async reclaim doesn't run. But may not...
>


Hmm.. I noticed a very strange behavior on a simple test w/ the patch set.

Test:
I created a 4g memcg and start doing cat. Then the memcg being OOM
killed as soon as it reaches its hard_limit. We shouldn't hit OOM even
w/o async-reclaim.

Again, I will read through the patch. But like to post the test result first.

$ echo $$ >/dev/cgroup/memory/A/tasks
$ cat /dev/cgroup/memory/A/memory.limit_in_bytes
4294967296

$ time cat /export/hdc3/dd_A/tf0 > /dev/zero
Killed

real 0m53.565s
user 0m0.061s
sys 0m4.814s

Here is the OOM log:

May 26 18:43:00 kernel: [ 963.489112] cat invoked oom-killer:
gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
May 26 18:43:00 kernel: [ 963.489121] Pid: 9425, comm: cat Tainted:
G W 2.6.39-mcg-DEV #131
May 26 18:43:00 kernel: [ 963.489123] Call Trace:
May 26 18:43:00 kernel: [ 963.489134] [<ffffffff810e3512>]
dump_header+0x82/0x1af
May 26 18:43:00 kernel: [ 963.489137] [<ffffffff810e33ca>] ?
spin_lock+0xe/0x10
May 26 18:43:00 kernel: [ 963.489140] [<ffffffff810e33f9>] ?
find_lock_task_mm+0x2d/0x67
May 26 18:43:00 kernel: [ 963.489143] [<ffffffff810e38dd>]
oom_kill_process+0x50/0x27b
May 26 18:43:00 kernel: [ 963.489155] [<ffffffff810e3dc6>]
mem_cgroup_out_of_memory+0x9a/0xe4
May 26 18:43:00 kernel: [ 963.489160] [<ffffffff811153aa>]
mem_cgroup_handle_oom+0x134/0x1fe
May 26 18:43:00 kernel: [ 963.489163] [<ffffffff81114a72>] ?
__mem_cgroup_insert_exceeded+0x83/0x83
May 26 18:43:00 kernel: [ 963.489176] [<ffffffff811166e9>]
__mem_cgroup_try_charge.clone.3+0x368/0x43a
May 26 18:43:00 kernel: [ 963.489179] [<ffffffff81117586>]
mem_cgroup_cache_charge+0x95/0x123
May 26 18:43:00 kernel: [ 963.489183] [<ffffffff810e16d8>]
add_to_page_cache_locked+0x42/0x114
May 26 18:43:00 kernel: [ 963.489185] [<ffffffff810e17db>]
add_to_page_cache_lru+0x31/0x5f
May 26 18:43:00 kernel: [ 963.489189] [<ffffffff81145636>]
mpage_readpages+0xb6/0x132
May 26 18:43:00 kernel: [ 963.489194] [<ffffffff8119992f>] ?
noalloc_get_block_write+0x24/0x24
May 26 18:43:00 kernel: [ 963.489197] [<ffffffff8119992f>] ?
noalloc_get_block_write+0x24/0x24
May 26 18:43:00 kernel: [ 963.489201] [<ffffffff81036742>] ?
__switch_to+0x160/0x212
May 26 18:43:00 kernel: [ 963.489205] [<ffffffff811978b2>]
ext4_readpages+0x1d/0x1f
May 26 18:43:00 kernel: [ 963.489209] [<ffffffff810e8d4b>]
__do_page_cache_readahead+0x144/0x1e3
May 26 18:43:00 kernel: [ 963.489212] [<ffffffff810e8e0b>]
ra_submit+0x21/0x25
May 26 18:43:00 kernel: [ 963.489215] [<ffffffff810e9075>]
ondemand_readahead+0x18c/0x19f
May 26 18:43:00 kernel: [ 963.489218] [<ffffffff810e9105>]
page_cache_async_readahead+0x7d/0x86
May 26 18:43:00 kernel: [ 963.489221] [<ffffffff810e2b7e>]
generic_file_aio_read+0x2d8/0x5fe
May 26 18:43:00 kernel: [ 963.489225] [<ffffffff81119626>]
do_sync_read+0xcb/0x108
May 26 18:43:00 kernel: [ 963.489230] [<ffffffff811f168a>] ?
fsnotify_perm+0x66/0x72
May 26 18:43:00 kernel: [ 963.489233] [<ffffffff811f16f7>] ?
security_file_permission+0x2e/0x33
May 26 18:43:00 kernel: [ 963.489236] [<ffffffff8111a0c8>]
vfs_read+0xab/0x107
May 26 18:43:00 kernel: [ 963.489239] [<ffffffff8111a1e4>] sys_read+0x4a/0x6e
May 26 18:43:00 kernel: [ 963.489244] [<ffffffff8140f469>]
sysenter_dispatch+0x7/0x27
May 26 18:43:00 kernel: [ 963.489248] Task in /A killed as a result
of limit of /A
May 26 18:43:00 kernel: [ 963.489251] memory: usage 4194304kB, limit
4194304kB, failcnt 26
May 26 18:43:00 kernel: [ 963.489253] memory+swap: usage 0kB, limit
9007199254740991kB, failcnt 0

--Ying

>
> Thanks,
> -Kame
>
>
>
>
>

2011-05-27 02:23:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 0/10] memcg async reclaim

On Thu, 26 May 2011 18:49:26 -0700
Ying Han <[email protected]> wrote:

> On Wed, May 25, 2011 at 10:10 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >
> > It's now merge window...I just dump my patch queue to hear other's idea.
> > I wonder I should wait until dirty_ratio for memcg is queued to mmotm...
> > I'll be busy with LinuxCon Japan etc...in the next week.
> >
> > This patch is onto mmotm-May-11 + some patches queued in mmotm, as numa_stat.
> >
> > This is a patch for memcg to keep margin to the limit in background.
> > By keeping some margin to the limit in background, application can
> > avoid foreground memory reclaim at charge() and this will help latency.
> >
> > Main changes from v2 is.
> >  - use SCHED_IDLE.
> >  - removed most of heuristic codes. Now, code is very simple.
> >
> > By using SCHED_IDLE, async memory reclaim can only consume 0.3%? of cpu
> > if the system is truely busy but can use much CPU if the cpu is idle.
> > Because my purpose is for reducing latency without affecting other running
> > applications, SCHED_IDLE fits this work.
> >
> > If application need to stop by some I/O or event, background memory reclaim
> > will cull memory while the system is idle.
> >
> > Perforemce:
> >  Running an httpd (apache) under 300M limit. And access 600MB working set
> >  with normalized distribution access by apatch-bench.
> >  apatch bench's concurrency was 4 and did 40960 accesses.
> >
> > Without async reclaim:
> > Connection Times (ms)
> >              min  mean[+/-sd] median   max
> > Connect:        0    0   0.0      0       2
> > Processing:    30   37  28.3     32    1793
> > Waiting:       28   35  25.5     31    1792
> > Total:         30   37  28.4     32    1793
> >
> > Percentage of the requests served within a certain time (ms)
> >  50%     32
> >  66%     32
> >  75%     33
> >  80%     34
> >  90%     39
> >  95%     60
> >  98%    100
> >  99%    133
> >  100%   1793 (longest request)
> >
> > With async reclaim:
> > Connection Times (ms)
> >              min  mean[+/-sd] median   max
> > Connect:        0    0   0.0      0       2
> > Processing:    30   35  12.3     32     678
> > Waiting:       28   34  12.0     31     658
> > Total:         30   35  12.3     32     678
> >
> > Percentage of the requests served within a certain time (ms)
> >  50%     32
> >  66%     32
> >  75%     33
> >  80%     34
> >  90%     39
> >  95%     49
> >  98%     71
> >  99%     86
> >  100%    678 (longest request)
> >
> >
> > It seems latency is stabilized by hiding memory reclaim.
> >
> > The score for memory reclaim was following.
> > See patch 10 for meaning of each member.
> >
> > == without async reclaim ==
> > recent_scan_success_ratio 44
> > limit_scan_pages 388463
> > limit_freed_pages 162238
> > limit_elapsed_ns 13852159231
> > soft_scan_pages 0
> > soft_freed_pages 0
> > soft_elapsed_ns 0
> > margin_scan_pages 0
> > margin_freed_pages 0
> > margin_elapsed_ns 0
> >
> > == with async reclaim ==
> > recent_scan_success_ratio 6
> > limit_scan_pages 0
> > limit_freed_pages 0
> > limit_elapsed_ns 0
> > soft_scan_pages 0
> > soft_freed_pages 0
> > soft_elapsed_ns 0
> > margin_scan_pages 1295556
> > margin_freed_pages 122450
> > margin_elapsed_ns 644881521
> >
> >
> > For this case, SCHED_IDLE workqueue can reclaim enough memory to the httpd.
> >
> > I may need to dig why scan_success_ratio is far different in the both case.
> > I guess the difference of epalsed_ns is because several threads enter
> > memory reclaim when async reclaim doesn't run. But may not...
> >
>
>
> Hmm.. I noticed a very strange behavior on a simple test w/ the patch set.
>
> Test:
> I created a 4g memcg and start doing cat. Then the memcg being OOM
> killed as soon as it reaches its hard_limit. We shouldn't hit OOM even
> w/o async-reclaim.
>
> Again, I will read through the patch. But like to post the test result first.
>
> $ echo $$ >/dev/cgroup/memory/A/tasks
> $ cat /dev/cgroup/memory/A/memory.limit_in_bytes
> 4294967296
>
> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
> Killed
>

I did the same kind of test without any problem...but ok, I'll do more test
later.



> real 0m53.565s
> user 0m0.061s
> sys 0m4.814s
>
> Here is the OOM log:
>
> May 26 18:43:00 kernel: [ 963.489112] cat invoked oom-killer:
> gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
> May 26 18:43:00 kernel: [ 963.489121] Pid: 9425, comm: cat Tainted:
> G W 2.6.39-mcg-DEV #131
> May 26 18:43:00 kernel: [ 963.489123] Call Trace:
> May 26 18:43:00 kernel: [ 963.489134] [<ffffffff810e3512>]
> dump_header+0x82/0x1af
> May 26 18:43:00 kernel: [ 963.489137] [<ffffffff810e33ca>] ?
> spin_lock+0xe/0x10
> May 26 18:43:00 kernel: [ 963.489140] [<ffffffff810e33f9>] ?
> find_lock_task_mm+0x2d/0x67
> May 26 18:43:00 kernel: [ 963.489143] [<ffffffff810e38dd>]
> oom_kill_process+0x50/0x27b
> May 26 18:43:00 kernel: [ 963.489155] [<ffffffff810e3dc6>]
> mem_cgroup_out_of_memory+0x9a/0xe4
> May 26 18:43:00 kernel: [ 963.489160] [<ffffffff811153aa>]
> mem_cgroup_handle_oom+0x134/0x1fe
> May 26 18:43:00 kernel: [ 963.489163] [<ffffffff81114a72>] ?
> __mem_cgroup_insert_exceeded+0x83/0x83
> May 26 18:43:00 kernel: [ 963.489176] [<ffffffff811166e9>]
> __mem_cgroup_try_charge.clone.3+0x368/0x43a
> May 26 18:43:00 kernel: [ 963.489179] [<ffffffff81117586>]
> mem_cgroup_cache_charge+0x95/0x123
> May 26 18:43:00 kernel: [ 963.489183] [<ffffffff810e16d8>]
> add_to_page_cache_locked+0x42/0x114
> May 26 18:43:00 kernel: [ 963.489185] [<ffffffff810e17db>]
> add_to_page_cache_lru+0x31/0x5f
> May 26 18:43:00 kernel: [ 963.489189] [<ffffffff81145636>]
> mpage_readpages+0xb6/0x132
> May 26 18:43:00 kernel: [ 963.489194] [<ffffffff8119992f>] ?
> noalloc_get_block_write+0x24/0x24
> May 26 18:43:00 kernel: [ 963.489197] [<ffffffff8119992f>] ?
> noalloc_get_block_write+0x24/0x24
> May 26 18:43:00 kernel: [ 963.489201] [<ffffffff81036742>] ?
> __switch_to+0x160/0x212
> May 26 18:43:00 kernel: [ 963.489205] [<ffffffff811978b2>]
> ext4_readpages+0x1d/0x1f
> May 26 18:43:00 kernel: [ 963.489209] [<ffffffff810e8d4b>]
> __do_page_cache_readahead+0x144/0x1e3
> May 26 18:43:00 kernel: [ 963.489212] [<ffffffff810e8e0b>]
> ra_submit+0x21/0x25
> May 26 18:43:00 kernel: [ 963.489215] [<ffffffff810e9075>]
> ondemand_readahead+0x18c/0x19f
> May 26 18:43:00 kernel: [ 963.489218] [<ffffffff810e9105>]
> page_cache_async_readahead+0x7d/0x86
> May 26 18:43:00 kernel: [ 963.489221] [<ffffffff810e2b7e>]
> generic_file_aio_read+0x2d8/0x5fe
> May 26 18:43:00 kernel: [ 963.489225] [<ffffffff81119626>]
> do_sync_read+0xcb/0x108
> May 26 18:43:00 kernel: [ 963.489230] [<ffffffff811f168a>] ?
> fsnotify_perm+0x66/0x72
> May 26 18:43:00 kernel: [ 963.489233] [<ffffffff811f16f7>] ?
> security_file_permission+0x2e/0x33
> May 26 18:43:00 kernel: [ 963.489236] [<ffffffff8111a0c8>]
> vfs_read+0xab/0x107
> May 26 18:43:00 kernel: [ 963.489239] [<ffffffff8111a1e4>] sys_read+0x4a/0x6e
> May 26 18:43:00 kernel: [ 963.489244] [<ffffffff8140f469>]
> sysenter_dispatch+0x7/0x27
> May 26 18:43:00 kernel: [ 963.489248] Task in /A killed as a result
> of limit of /A
> May 26 18:43:00 kernel: [ 963.489251] memory: usage 4194304kB, limit
> 4194304kB, failcnt 26
> May 26 18:43:00 kernel: [ 963.489253] memory+swap: usage 0kB, limit
> 9007199254740991kB, failcnt 0
>

Hmm, why memory+swap usage 0kb here...

In this set, I used mem_cgroup_margin() rather than res_counter_margin().
Hmm, do you disable swap accounting ? If so, I may miss some.

Thanks,
-Kame



2011-05-27 02:55:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 0/10] memcg async reclaim

On Thu, 26 May 2011 18:49:26 -0700
Ying Han <[email protected]> wrote:

> On Wed, May 25, 2011 at 10:10 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >
> > It's now merge window...I just dump my patch queue to hear other's idea.
> > I wonder I should wait until dirty_ratio for memcg is queued to mmotm...
> > I'll be busy with LinuxCon Japan etc...in the next week.
> >
> > This patch is onto mmotm-May-11 + some patches queued in mmotm, as numa_stat.
> >
> > This is a patch for memcg to keep margin to the limit in background.
> > By keeping some margin to the limit in background, application can
> > avoid foreground memory reclaim at charge() and this will help latency.
> >
> > Main changes from v2 is.
> >  - use SCHED_IDLE.
> >  - removed most of heuristic codes. Now, code is very simple.
> >
> > By using SCHED_IDLE, async memory reclaim can only consume 0.3%? of cpu
> > if the system is truely busy but can use much CPU if the cpu is idle.
> > Because my purpose is for reducing latency without affecting other running
> > applications, SCHED_IDLE fits this work.
> >
> > If application need to stop by some I/O or event, background memory reclaim
> > will cull memory while the system is idle.
> >
> > Perforemce:
> >  Running an httpd (apache) under 300M limit. And access 600MB working set
> >  with normalized distribution access by apatch-bench.
> >  apatch bench's concurrency was 4 and did 40960 accesses.
> >
> > Without async reclaim:
> > Connection Times (ms)
> >              min  mean[+/-sd] median   max
> > Connect:        0    0   0.0      0       2
> > Processing:    30   37  28.3     32    1793
> > Waiting:       28   35  25.5     31    1792
> > Total:         30   37  28.4     32    1793
> >
> > Percentage of the requests served within a certain time (ms)
> >  50%     32
> >  66%     32
> >  75%     33
> >  80%     34
> >  90%     39
> >  95%     60
> >  98%    100
> >  99%    133
> >  100%   1793 (longest request)
> >
> > With async reclaim:
> > Connection Times (ms)
> >              min  mean[+/-sd] median   max
> > Connect:        0    0   0.0      0       2
> > Processing:    30   35  12.3     32     678
> > Waiting:       28   34  12.0     31     658
> > Total:         30   35  12.3     32     678
> >
> > Percentage of the requests served within a certain time (ms)
> >  50%     32
> >  66%     32
> >  75%     33
> >  80%     34
> >  90%     39
> >  95%     49
> >  98%     71
> >  99%     86
> >  100%    678 (longest request)
> >
> >
> > It seems latency is stabilized by hiding memory reclaim.
> >
> > The score for memory reclaim was following.
> > See patch 10 for meaning of each member.
> >
> > == without async reclaim ==
> > recent_scan_success_ratio 44
> > limit_scan_pages 388463
> > limit_freed_pages 162238
> > limit_elapsed_ns 13852159231
> > soft_scan_pages 0
> > soft_freed_pages 0
> > soft_elapsed_ns 0
> > margin_scan_pages 0
> > margin_freed_pages 0
> > margin_elapsed_ns 0
> >
> > == with async reclaim ==
> > recent_scan_success_ratio 6
> > limit_scan_pages 0
> > limit_freed_pages 0
> > limit_elapsed_ns 0
> > soft_scan_pages 0
> > soft_freed_pages 0
> > soft_elapsed_ns 0
> > margin_scan_pages 1295556
> > margin_freed_pages 122450
> > margin_elapsed_ns 644881521
> >
> >
> > For this case, SCHED_IDLE workqueue can reclaim enough memory to the httpd.
> >
> > I may need to dig why scan_success_ratio is far different in the both case.
> > I guess the difference of epalsed_ns is because several threads enter
> > memory reclaim when async reclaim doesn't run. But may not...
> >
>
>
> Hmm.. I noticed a very strange behavior on a simple test w/ the patch set.
>
> Test:
> I created a 4g memcg and start doing cat. Then the memcg being OOM
> killed as soon as it reaches its hard_limit. We shouldn't hit OOM even
> w/o async-reclaim.
>
> Again, I will read through the patch. But like to post the test result first.
>
> $ echo $$ >/dev/cgroup/memory/A/tasks
> $ cat /dev/cgroup/memory/A/memory.limit_in_bytes
> 4294967296
>
> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
> Killed
>
> real 0m53.565s
> user 0m0.061s
> sys 0m4.814s
>

Hmm, what I see is
==
root@bluextal kamezawa]# ls -l test/1G
-rw-rw-r--. 1 kamezawa kamezawa 1053261824 May 13 13:58 test/1G
[root@bluextal kamezawa]# mkdir /cgroup/memory/A
[root@bluextal kamezawa]# echo 0 > /cgroup/memory/A/tasks
[root@bluextal kamezawa]# echo 300M > /cgroup/memory/A/memory.limit_in_bytes
[root@bluextal kamezawa]# echo 1 > /cgroup/memory/A/memory.async_control
[root@bluextal kamezawa]# cat test/1G > /dev/null
[root@bluextal kamezawa]# cat /cgroup/memory/A/memory.reclaim_stat
recent_scan_success_ratio 83
limit_scan_pages 82
limit_freed_pages 49
limit_elapsed_ns 242507
soft_scan_pages 0
soft_freed_pages 0
soft_elapsed_ns 0
margin_scan_pages 218630
margin_freed_pages 181598
margin_elapsed_ns 117466604
[root@bluextal kamezawa]#
==

I'll turn off swapaccount and try again.

Thanks,
-Kame


2011-05-27 03:12:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 0/10] memcg async reclaim

On Fri, 27 May 2011 11:48:37 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Thu, 26 May 2011 18:49:26 -0700
> Ying Han <[email protected]> wrote:

> > Hmm.. I noticed a very strange behavior on a simple test w/ the patch set.
> >
> > Test:
> > I created a 4g memcg and start doing cat. Then the memcg being OOM
> > killed as soon as it reaches its hard_limit. We shouldn't hit OOM even
> > w/o async-reclaim.
> >
> > Again, I will read through the patch. But like to post the test result first.
> >
> > $ echo $$ >/dev/cgroup/memory/A/tasks
> > $ cat /dev/cgroup/memory/A/memory.limit_in_bytes
> > 4294967296
> >
> > $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
> > Killed
> >
> > real 0m53.565s
> > user 0m0.061s
> > sys 0m4.814s
> >
>
> Hmm, what I see is
> ==
> root@bluextal kamezawa]# ls -l test/1G
> -rw-rw-r--. 1 kamezawa kamezawa 1053261824 May 13 13:58 test/1G
> [root@bluextal kamezawa]# mkdir /cgroup/memory/A
> [root@bluextal kamezawa]# echo 0 > /cgroup/memory/A/tasks
> [root@bluextal kamezawa]# echo 300M > /cgroup/memory/A/memory.limit_in_bytes
> [root@bluextal kamezawa]# echo 1 > /cgroup/memory/A/memory.async_control
> [root@bluextal kamezawa]# cat test/1G > /dev/null
> [root@bluextal kamezawa]# cat /cgroup/memory/A/memory.reclaim_stat
> recent_scan_success_ratio 83
> limit_scan_pages 82
> limit_freed_pages 49
> limit_elapsed_ns 242507
> soft_scan_pages 0
> soft_freed_pages 0
> soft_elapsed_ns 0
> margin_scan_pages 218630
> margin_freed_pages 181598
> margin_elapsed_ns 117466604
> [root@bluextal kamezawa]#
> ==
>
> I'll turn off swapaccount and try again.
>

A bug found....I added memory.async_control file to memsw.....file set by mistake.
Then, async_control cannot be enabled when swapaccount=0. I'll fix that.

So, how do you enabled async_control ?

Thanks,
-Kame


2011-05-27 04:33:41

by Ying Han

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 0/10] memcg async reclaim

On Thu, May 26, 2011 at 7:16 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Thu, 26 May 2011 18:49:26 -0700
> Ying Han <[email protected]> wrote:
>
>> On Wed, May 25, 2011 at 10:10 PM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>> >
>> > It's now merge window...I just dump my patch queue to hear other's idea.
>> > I wonder I should wait until dirty_ratio for memcg is queued to mmotm...
>> > I'll be busy with LinuxCon Japan etc...in the next week.
>> >
>> > This patch is onto mmotm-May-11 + some patches queued in mmotm, as numa_stat.
>> >
>> > This is a patch for memcg to keep margin to the limit in background.
>> > By keeping some margin to the limit in background, application can
>> > avoid foreground memory reclaim at charge() and this will help latency.
>> >
>> > Main changes from v2 is.
>> > ?- use SCHED_IDLE.
>> > ?- removed most of heuristic codes. Now, code is very simple.
>> >
>> > By using SCHED_IDLE, async memory reclaim can only consume 0.3%? of cpu
>> > if the system is truely busy but can use much CPU if the cpu is idle.
>> > Because my purpose is for reducing latency without affecting other running
>> > applications, SCHED_IDLE fits this work.
>> >
>> > If application need to stop by some I/O or event, background memory reclaim
>> > will cull memory while the system is idle.
>> >
>> > Perforemce:
>> > ?Running an httpd (apache) under 300M limit. And access 600MB working set
>> > ?with normalized distribution access by apatch-bench.
>> > ?apatch bench's concurrency was 4 and did 40960 accesses.
>> >
>> > Without async reclaim:
>> > Connection Times (ms)
>> > ? ? ? ? ? ? ?min ?mean[+/-sd] median ? max
>> > Connect: ? ? ? ?0 ? ?0 ? 0.0 ? ? ?0 ? ? ? 2
>> > Processing: ? ?30 ? 37 ?28.3 ? ? 32 ? ?1793
>> > Waiting: ? ? ? 28 ? 35 ?25.5 ? ? 31 ? ?1792
>> > Total: ? ? ? ? 30 ? 37 ?28.4 ? ? 32 ? ?1793
>> >
>> > Percentage of the requests served within a certain time (ms)
>> > ?50% ? ? 32
>> > ?66% ? ? 32
>> > ?75% ? ? 33
>> > ?80% ? ? 34
>> > ?90% ? ? 39
>> > ?95% ? ? 60
>> > ?98% ? ?100
>> > ?99% ? ?133
>> > ?100% ? 1793 (longest request)
>> >
>> > With async reclaim:
>> > Connection Times (ms)
>> > ? ? ? ? ? ? ?min ?mean[+/-sd] median ? max
>> > Connect: ? ? ? ?0 ? ?0 ? 0.0 ? ? ?0 ? ? ? 2
>> > Processing: ? ?30 ? 35 ?12.3 ? ? 32 ? ? 678
>> > Waiting: ? ? ? 28 ? 34 ?12.0 ? ? 31 ? ? 658
>> > Total: ? ? ? ? 30 ? 35 ?12.3 ? ? 32 ? ? 678
>> >
>> > Percentage of the requests served within a certain time (ms)
>> > ?50% ? ? 32
>> > ?66% ? ? 32
>> > ?75% ? ? 33
>> > ?80% ? ? 34
>> > ?90% ? ? 39
>> > ?95% ? ? 49
>> > ?98% ? ? 71
>> > ?99% ? ? 86
>> > ?100% ? ?678 (longest request)
>> >
>> >
>> > It seems latency is stabilized by hiding memory reclaim.
>> >
>> > The score for memory reclaim was following.
>> > See patch 10 for meaning of each member.
>> >
>> > == without async reclaim ==
>> > recent_scan_success_ratio 44
>> > limit_scan_pages 388463
>> > limit_freed_pages 162238
>> > limit_elapsed_ns 13852159231
>> > soft_scan_pages 0
>> > soft_freed_pages 0
>> > soft_elapsed_ns 0
>> > margin_scan_pages 0
>> > margin_freed_pages 0
>> > margin_elapsed_ns 0
>> >
>> > == with async reclaim ==
>> > recent_scan_success_ratio 6
>> > limit_scan_pages 0
>> > limit_freed_pages 0
>> > limit_elapsed_ns 0
>> > soft_scan_pages 0
>> > soft_freed_pages 0
>> > soft_elapsed_ns 0
>> > margin_scan_pages 1295556
>> > margin_freed_pages 122450
>> > margin_elapsed_ns 644881521
>> >
>> >
>> > For this case, SCHED_IDLE workqueue can reclaim enough memory to the httpd.
>> >
>> > I may need to dig why scan_success_ratio is far different in the both case.
>> > I guess the difference of epalsed_ns is because several threads enter
>> > memory reclaim when async reclaim doesn't run. But may not...
>> >
>>
>>
>> Hmm.. I noticed a very strange behavior on a simple test w/ the patch set.
>>
>> Test:
>> I created a 4g memcg and start doing cat. Then the memcg being OOM
>> killed as soon as it reaches its hard_limit. We shouldn't hit OOM even
>> w/o async-reclaim.
>>
>> Again, I will read through the patch. But like to post the test result first.
>>
>> $ echo $$ >/dev/cgroup/memory/A/tasks
>> $ cat /dev/cgroup/memory/A/memory.limit_in_bytes
>> 4294967296
>>
>> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
>> Killed
>>
>
> I did the same kind of test without any problem...but ok, I'll do more test
> later.
>
>
>
>> real ?0m53.565s
>> user ?0m0.061s
>> sys ? 0m4.814s
>>
>> Here is the OOM log:
>>
>> May 26 18:43:00 ?kernel: [ ?963.489112] cat invoked oom-killer:
>> gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
>> May 26 18:43:00 ?kernel: [ ?963.489121] Pid: 9425, comm: cat Tainted:
>> G ? ? ? ?W ? 2.6.39-mcg-DEV #131
>> May 26 18:43:00 ?kernel: [ ?963.489123] Call Trace:
>> May 26 18:43:00 ?kernel: [ ?963.489134] ?[<ffffffff810e3512>]
>> dump_header+0x82/0x1af
>> May 26 18:43:00 ?kernel: [ ?963.489137] ?[<ffffffff810e33ca>] ?
>> spin_lock+0xe/0x10
>> May 26 18:43:00 ?kernel: [ ?963.489140] ?[<ffffffff810e33f9>] ?
>> find_lock_task_mm+0x2d/0x67
>> May 26 18:43:00 ?kernel: [ ?963.489143] ?[<ffffffff810e38dd>]
>> oom_kill_process+0x50/0x27b
>> May 26 18:43:00 ?kernel: [ ?963.489155] ?[<ffffffff810e3dc6>]
>> mem_cgroup_out_of_memory+0x9a/0xe4
>> May 26 18:43:00 ?kernel: [ ?963.489160] ?[<ffffffff811153aa>]
>> mem_cgroup_handle_oom+0x134/0x1fe
>> May 26 18:43:00 ?kernel: [ ?963.489163] ?[<ffffffff81114a72>] ?
>> __mem_cgroup_insert_exceeded+0x83/0x83
>> May 26 18:43:00 ?kernel: [ ?963.489176] ?[<ffffffff811166e9>]
>> __mem_cgroup_try_charge.clone.3+0x368/0x43a
>> May 26 18:43:00 ?kernel: [ ?963.489179] ?[<ffffffff81117586>]
>> mem_cgroup_cache_charge+0x95/0x123
>> May 26 18:43:00 ?kernel: [ ?963.489183] ?[<ffffffff810e16d8>]
>> add_to_page_cache_locked+0x42/0x114
>> May 26 18:43:00 ?kernel: [ ?963.489185] ?[<ffffffff810e17db>]
>> add_to_page_cache_lru+0x31/0x5f
>> May 26 18:43:00 ?kernel: [ ?963.489189] ?[<ffffffff81145636>]
>> mpage_readpages+0xb6/0x132
>> May 26 18:43:00 ?kernel: [ ?963.489194] ?[<ffffffff8119992f>] ?
>> noalloc_get_block_write+0x24/0x24
>> May 26 18:43:00 ?kernel: [ ?963.489197] ?[<ffffffff8119992f>] ?
>> noalloc_get_block_write+0x24/0x24
>> May 26 18:43:00 ?kernel: [ ?963.489201] ?[<ffffffff81036742>] ?
>> __switch_to+0x160/0x212
>> May 26 18:43:00 ?kernel: [ ?963.489205] ?[<ffffffff811978b2>]
>> ext4_readpages+0x1d/0x1f
>> May 26 18:43:00 ?kernel: [ ?963.489209] ?[<ffffffff810e8d4b>]
>> __do_page_cache_readahead+0x144/0x1e3
>> May 26 18:43:00 ?kernel: [ ?963.489212] ?[<ffffffff810e8e0b>]
>> ra_submit+0x21/0x25
>> May 26 18:43:00 ?kernel: [ ?963.489215] ?[<ffffffff810e9075>]
>> ondemand_readahead+0x18c/0x19f
>> May 26 18:43:00 ?kernel: [ ?963.489218] ?[<ffffffff810e9105>]
>> page_cache_async_readahead+0x7d/0x86
>> May 26 18:43:00 ?kernel: [ ?963.489221] ?[<ffffffff810e2b7e>]
>> generic_file_aio_read+0x2d8/0x5fe
>> May 26 18:43:00 ?kernel: [ ?963.489225] ?[<ffffffff81119626>]
>> do_sync_read+0xcb/0x108
>> May 26 18:43:00 ?kernel: [ ?963.489230] ?[<ffffffff811f168a>] ?
>> fsnotify_perm+0x66/0x72
>> May 26 18:43:00 ?kernel: [ ?963.489233] ?[<ffffffff811f16f7>] ?
>> security_file_permission+0x2e/0x33
>> May 26 18:43:00 ?kernel: [ ?963.489236] ?[<ffffffff8111a0c8>]
>> vfs_read+0xab/0x107
>> May 26 18:43:00 ?kernel: [ ?963.489239] ?[<ffffffff8111a1e4>] sys_read+0x4a/0x6e
>> May 26 18:43:00 ?kernel: [ ?963.489244] ?[<ffffffff8140f469>]
>> sysenter_dispatch+0x7/0x27
>> May 26 18:43:00 ?kernel: [ ?963.489248] Task in /A killed as a result
>> of limit of /A
>> May 26 18:43:00 ?kernel: [ ?963.489251] memory: usage 4194304kB, limit
>> 4194304kB, failcnt 26
>> May 26 18:43:00 ?kernel: [ ?963.489253] memory+swap: usage 0kB, limit
>> 9007199254740991kB, failcnt 0
>>
>
> Hmm, why memory+swap usage 0kb here...
>
> In this set, I used mem_cgroup_margin() rather than res_counter_margin().
> Hmm, do you disable swap accounting ? If so, I may miss some.

Yes, I disabled the swap accounting in .config:
# CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set


Here is how i reproduce it:

$ mkdir /dev/cgroup/memory/D
$ echo 4g >/dev/cgroup/memory/D/memory.limit_in_bytes

$ cat /dev/cgroup/memory/D/memory.limit_in_bytes
4294967296

$ cat /dev/cgroup/memory/D/memory.
memory.async_control memory.max_usage_in_bytes
memory.soft_limit_in_bytes memory.use_hierarchy
memory.failcnt memory.move_charge_at_immigrate
memory.stat
memory.force_empty memory.oom_control
memory.swappiness
memory.limit_in_bytes memory.reclaim_stat
memory.usage_in_bytes

$ cat /dev/cgroup/memory/D/memory.async_control
0
$ echo 1 >/dev/cgroup/memory/D/memory.async_control
$ cat /dev/cgroup/memory/D/memory.async_control
1

$ echo $$ >/dev/cgroup/memory/D/tasks
$ cat /proc/4358/cgroup
3:memory:/D

$ time cat /export/hdc3/dd_A/tf0 > /dev/zero
Killed

--Ying


>
> Thanks,
> -Kame
>
>
>
>
>

2011-05-27 04:34:34

by Ying Han

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 0/10] memcg async reclaim

On Thu, May 26, 2011 at 8:05 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Fri, 27 May 2011 11:48:37 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> On Thu, 26 May 2011 18:49:26 -0700
>> Ying Han <[email protected]> wrote:
>
>> > Hmm.. I noticed a very strange behavior on a simple test w/ the patch set.
>> >
>> > Test:
>> > I created a 4g memcg and start doing cat. Then the memcg being OOM
>> > killed as soon as it reaches its hard_limit. We shouldn't hit OOM even
>> > w/o async-reclaim.
>> >
>> > Again, I will read through the patch. But like to post the test result first.
>> >
>> > $ echo $$ >/dev/cgroup/memory/A/tasks
>> > $ cat /dev/cgroup/memory/A/memory.limit_in_bytes
>> > 4294967296
>> >
>> > $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
>> > Killed
>> >
>> > real ? ? ? ?0m53.565s
>> > user ? ? ? ?0m0.061s
>> > sys 0m4.814s
>> >
>>
>> Hmm, what I see is
>> ==
>> root@bluextal kamezawa]# ls -l test/1G
>> -rw-rw-r--. 1 kamezawa kamezawa 1053261824 May 13 13:58 test/1G
>> [root@bluextal kamezawa]# mkdir /cgroup/memory/A
>> [root@bluextal kamezawa]# echo 0 > /cgroup/memory/A/tasks
>> [root@bluextal kamezawa]# echo 300M > /cgroup/memory/A/memory.limit_in_bytes
>> [root@bluextal kamezawa]# echo 1 > /cgroup/memory/A/memory.async_control
>> [root@bluextal kamezawa]# cat test/1G > /dev/null
>> [root@bluextal kamezawa]# cat /cgroup/memory/A/memory.reclaim_stat
>> recent_scan_success_ratio 83
>> limit_scan_pages 82
>> limit_freed_pages 49
>> limit_elapsed_ns 242507
>> soft_scan_pages 0
>> soft_freed_pages 0
>> soft_elapsed_ns 0
>> margin_scan_pages 218630
>> margin_freed_pages 181598
>> margin_elapsed_ns 117466604
>> [root@bluextal kamezawa]#
>> ==
>>
>> I'll turn off swapaccount and try again.
>>
>
> A bug found....I added memory.async_control file to memsw.....file set by mistake.
> Then, async_control cannot be enabled when swapaccount=0. I'll fix that.

Yes, i have that changed in my previous testing
>
> So, how do you enabled async_control ?

$ echo 1 >/dev/cgroup/memory/D/memory.async_control

?

--Ying
>
> Thanks,
> -Kame
>
>
>
>

2011-05-27 04:41:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 0/10] memcg async reclaim

On Thu, 26 May 2011 21:33:32 -0700
Ying Han <[email protected]> wrote:

> On Thu, May 26, 2011 at 7:16 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Thu, 26 May 2011 18:49:26 -0700
> > Ying Han <[email protected]> wrote:
> >
> >> On Wed, May 25, 2011 at 10:10 PM, KAMEZAWA Hiroyuki
> >> <[email protected]> wrote:
> >> >
> >> > It's now merge window...I just dump my patch queue to hear other's idea.
> >> > I wonder I should wait until dirty_ratio for memcg is queued to mmotm...
> >> > I'll be busy with LinuxCon Japan etc...in the next week.
> >> >
> >> > This patch is onto mmotm-May-11 + some patches queued in mmotm, as numa_stat.
> >> >
> >> > This is a patch for memcg to keep margin to the limit in background.
> >> > By keeping some margin to the limit in background, application can
> >> > avoid foreground memory reclaim at charge() and this will help latency.
> >> >
> >> > Main changes from v2 is.
> >> >  - use SCHED_IDLE.
> >> >  - removed most of heuristic codes. Now, code is very simple.
> >> >
> >> > By using SCHED_IDLE, async memory reclaim can only consume 0.3%? of cpu
> >> > if the system is truely busy but can use much CPU if the cpu is idle.
> >> > Because my purpose is for reducing latency without affecting other running
> >> > applications, SCHED_IDLE fits this work.
> >> >
> >> > If application need to stop by some I/O or event, background memory reclaim
> >> > will cull memory while the system is idle.
> >> >
> >> > Perforemce:
> >> >  Running an httpd (apache) under 300M limit. And access 600MB working set
> >> >  with normalized distribution access by apatch-bench.
> >> >  apatch bench's concurrency was 4 and did 40960 accesses.
> >> >
> >> > Without async reclaim:
> >> > Connection Times (ms)
> >> >              min  mean[+/-sd] median   max
> >> > Connect:        0    0   0.0      0       2
> >> > Processing:    30   37  28.3     32    1793
> >> > Waiting:       28   35  25.5     31    1792
> >> > Total:         30   37  28.4     32    1793
> >> >
> >> > Percentage of the requests served within a certain time (ms)
> >> >  50%     32
> >> >  66%     32
> >> >  75%     33
> >> >  80%     34
> >> >  90%     39
> >> >  95%     60
> >> >  98%    100
> >> >  99%    133
> >> >  100%   1793 (longest request)
> >> >
> >> > With async reclaim:
> >> > Connection Times (ms)
> >> >              min  mean[+/-sd] median   max
> >> > Connect:        0    0   0.0      0       2
> >> > Processing:    30   35  12.3     32     678
> >> > Waiting:       28   34  12.0     31     658
> >> > Total:         30   35  12.3     32     678
> >> >
> >> > Percentage of the requests served within a certain time (ms)
> >> >  50%     32
> >> >  66%     32
> >> >  75%     33
> >> >  80%     34
> >> >  90%     39
> >> >  95%     49
> >> >  98%     71
> >> >  99%     86
> >> >  100%    678 (longest request)
> >> >
> >> >
> >> > It seems latency is stabilized by hiding memory reclaim.
> >> >
> >> > The score for memory reclaim was following.
> >> > See patch 10 for meaning of each member.
> >> >
> >> > == without async reclaim ==
> >> > recent_scan_success_ratio 44
> >> > limit_scan_pages 388463
> >> > limit_freed_pages 162238
> >> > limit_elapsed_ns 13852159231
> >> > soft_scan_pages 0
> >> > soft_freed_pages 0
> >> > soft_elapsed_ns 0
> >> > margin_scan_pages 0
> >> > margin_freed_pages 0
> >> > margin_elapsed_ns 0
> >> >
> >> > == with async reclaim ==
> >> > recent_scan_success_ratio 6
> >> > limit_scan_pages 0
> >> > limit_freed_pages 0
> >> > limit_elapsed_ns 0
> >> > soft_scan_pages 0
> >> > soft_freed_pages 0
> >> > soft_elapsed_ns 0
> >> > margin_scan_pages 1295556
> >> > margin_freed_pages 122450
> >> > margin_elapsed_ns 644881521
> >> >
> >> >
> >> > For this case, SCHED_IDLE workqueue can reclaim enough memory to the httpd.
> >> >
> >> > I may need to dig why scan_success_ratio is far different in the both case.
> >> > I guess the difference of epalsed_ns is because several threads enter
> >> > memory reclaim when async reclaim doesn't run. But may not...
> >> >
> >>
> >>
> >> Hmm.. I noticed a very strange behavior on a simple test w/ the patch set.
> >>
> >> Test:
> >> I created a 4g memcg and start doing cat. Then the memcg being OOM
> >> killed as soon as it reaches its hard_limit. We shouldn't hit OOM even
> >> w/o async-reclaim.
> >>
> >> Again, I will read through the patch. But like to post the test result first.
> >>
> >> $ echo $$ >/dev/cgroup/memory/A/tasks
> >> $ cat /dev/cgroup/memory/A/memory.limit_in_bytes
> >> 4294967296
> >>
> >> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
> >> Killed
> >>
> >
> > I did the same kind of test without any problem...but ok, I'll do more test
> > later.
> >
> >
> >
> >> real  0m53.565s
> >> user  0m0.061s
> >> sys   0m4.814s
> >>
> >> Here is the OOM log:
> >>
> >> May 26 18:43:00  kernel: [  963.489112] cat invoked oom-killer:
> >> gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
> >> May 26 18:43:00  kernel: [  963.489121] Pid: 9425, comm: cat Tainted:
> >> G        W   2.6.39-mcg-DEV #131
> >> May 26 18:43:00  kernel: [  963.489123] Call Trace:
> >> May 26 18:43:00  kernel: [  963.489134]  [<ffffffff810e3512>]
> >> dump_header+0x82/0x1af
> >> May 26 18:43:00  kernel: [  963.489137]  [<ffffffff810e33ca>] ?
> >> spin_lock+0xe/0x10
> >> May 26 18:43:00  kernel: [  963.489140]  [<ffffffff810e33f9>] ?
> >> find_lock_task_mm+0x2d/0x67
> >> May 26 18:43:00  kernel: [  963.489143]  [<ffffffff810e38dd>]
> >> oom_kill_process+0x50/0x27b
> >> May 26 18:43:00  kernel: [  963.489155]  [<ffffffff810e3dc6>]
> >> mem_cgroup_out_of_memory+0x9a/0xe4
> >> May 26 18:43:00  kernel: [  963.489160]  [<ffffffff811153aa>]
> >> mem_cgroup_handle_oom+0x134/0x1fe
> >> May 26 18:43:00  kernel: [  963.489163]  [<ffffffff81114a72>] ?
> >> __mem_cgroup_insert_exceeded+0x83/0x83
> >> May 26 18:43:00  kernel: [  963.489176]  [<ffffffff811166e9>]
> >> __mem_cgroup_try_charge.clone.3+0x368/0x43a
> >> May 26 18:43:00  kernel: [  963.489179]  [<ffffffff81117586>]
> >> mem_cgroup_cache_charge+0x95/0x123
> >> May 26 18:43:00  kernel: [  963.489183]  [<ffffffff810e16d8>]
> >> add_to_page_cache_locked+0x42/0x114
> >> May 26 18:43:00  kernel: [  963.489185]  [<ffffffff810e17db>]
> >> add_to_page_cache_lru+0x31/0x5f
> >> May 26 18:43:00  kernel: [  963.489189]  [<ffffffff81145636>]
> >> mpage_readpages+0xb6/0x132
> >> May 26 18:43:00  kernel: [  963.489194]  [<ffffffff8119992f>] ?
> >> noalloc_get_block_write+0x24/0x24
> >> May 26 18:43:00  kernel: [  963.489197]  [<ffffffff8119992f>] ?
> >> noalloc_get_block_write+0x24/0x24
> >> May 26 18:43:00  kernel: [  963.489201]  [<ffffffff81036742>] ?
> >> __switch_to+0x160/0x212
> >> May 26 18:43:00  kernel: [  963.489205]  [<ffffffff811978b2>]
> >> ext4_readpages+0x1d/0x1f
> >> May 26 18:43:00  kernel: [  963.489209]  [<ffffffff810e8d4b>]
> >> __do_page_cache_readahead+0x144/0x1e3
> >> May 26 18:43:00  kernel: [  963.489212]  [<ffffffff810e8e0b>]
> >> ra_submit+0x21/0x25
> >> May 26 18:43:00  kernel: [  963.489215]  [<ffffffff810e9075>]
> >> ondemand_readahead+0x18c/0x19f
> >> May 26 18:43:00  kernel: [  963.489218]  [<ffffffff810e9105>]
> >> page_cache_async_readahead+0x7d/0x86
> >> May 26 18:43:00  kernel: [  963.489221]  [<ffffffff810e2b7e>]
> >> generic_file_aio_read+0x2d8/0x5fe
> >> May 26 18:43:00  kernel: [  963.489225]  [<ffffffff81119626>]
> >> do_sync_read+0xcb/0x108
> >> May 26 18:43:00  kernel: [  963.489230]  [<ffffffff811f168a>] ?
> >> fsnotify_perm+0x66/0x72
> >> May 26 18:43:00  kernel: [  963.489233]  [<ffffffff811f16f7>] ?
> >> security_file_permission+0x2e/0x33
> >> May 26 18:43:00  kernel: [  963.489236]  [<ffffffff8111a0c8>]
> >> vfs_read+0xab/0x107
> >> May 26 18:43:00  kernel: [  963.489239]  [<ffffffff8111a1e4>] sys_read+0x4a/0x6e
> >> May 26 18:43:00  kernel: [  963.489244]  [<ffffffff8140f469>]
> >> sysenter_dispatch+0x7/0x27
> >> May 26 18:43:00  kernel: [  963.489248] Task in /A killed as a result
> >> of limit of /A
> >> May 26 18:43:00  kernel: [  963.489251] memory: usage 4194304kB, limit
> >> 4194304kB, failcnt 26
> >> May 26 18:43:00  kernel: [  963.489253] memory+swap: usage 0kB, limit
> >> 9007199254740991kB, failcnt 0
> >>
> >
> > Hmm, why memory+swap usage 0kb here...
> >
> > In this set, I used mem_cgroup_margin() rather than res_counter_margin().
> > Hmm, do you disable swap accounting ? If so, I may miss some.
>
> Yes, I disabled the swap accounting in .config:
> # CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set
>
>
> Here is how i reproduce it:
>
> $ mkdir /dev/cgroup/memory/D
> $ echo 4g >/dev/cgroup/memory/D/memory.limit_in_bytes
>
> $ cat /dev/cgroup/memory/D/memory.limit_in_bytes
> 4294967296
>
> $ cat /dev/cgroup/memory/D/memory.
> memory.async_control memory.max_usage_in_bytes
> memory.soft_limit_in_bytes memory.use_hierarchy
> memory.failcnt memory.move_charge_at_immigrate
> memory.stat
> memory.force_empty memory.oom_control
> memory.swappiness
> memory.limit_in_bytes memory.reclaim_stat
> memory.usage_in_bytes
>
> $ cat /dev/cgroup/memory/D/memory.async_control
> 0
> $ echo 1 >/dev/cgroup/memory/D/memory.async_control
> $ cat /dev/cgroup/memory/D/memory.async_control
> 1
>
> $ echo $$ >/dev/cgroup/memory/D/tasks
> $ cat /proc/4358/cgroup
> 3:memory:/D
>
> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
> Killed
>

If you applied my patches collectly, async_control can be seen if
swap controller is configured because of BUG in patch.

I could cat 20G file under 4G limit without any problem with boot option
swapaccount=0. no problem if async_control == 0 ?



Thanks,
-Kame


2011-05-27 04:49:17

by Ying Han

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 0/10] memcg async reclaim

On Thu, May 26, 2011 at 9:34 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Thu, 26 May 2011 21:33:32 -0700
> Ying Han <[email protected]> wrote:
>
>> On Thu, May 26, 2011 at 7:16 PM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>> > On Thu, 26 May 2011 18:49:26 -0700
>> > Ying Han <[email protected]> wrote:
>> >
>> >> On Wed, May 25, 2011 at 10:10 PM, KAMEZAWA Hiroyuki
>> >> <[email protected]> wrote:
>> >> >
>> >> > It's now merge window...I just dump my patch queue to hear other's idea.
>> >> > I wonder I should wait until dirty_ratio for memcg is queued to mmotm...
>> >> > I'll be busy with LinuxCon Japan etc...in the next week.
>> >> >
>> >> > This patch is onto mmotm-May-11 + some patches queued in mmotm, as numa_stat.
>> >> >
>> >> > This is a patch for memcg to keep margin to the limit in background.
>> >> > By keeping some margin to the limit in background, application can
>> >> > avoid foreground memory reclaim at charge() and this will help latency.
>> >> >
>> >> > Main changes from v2 is.
>> >> > ?- use SCHED_IDLE.
>> >> > ?- removed most of heuristic codes. Now, code is very simple.
>> >> >
>> >> > By using SCHED_IDLE, async memory reclaim can only consume 0.3%? of cpu
>> >> > if the system is truely busy but can use much CPU if the cpu is idle.
>> >> > Because my purpose is for reducing latency without affecting other running
>> >> > applications, SCHED_IDLE fits this work.
>> >> >
>> >> > If application need to stop by some I/O or event, background memory reclaim
>> >> > will cull memory while the system is idle.
>> >> >
>> >> > Perforemce:
>> >> > ?Running an httpd (apache) under 300M limit. And access 600MB working set
>> >> > ?with normalized distribution access by apatch-bench.
>> >> > ?apatch bench's concurrency was 4 and did 40960 accesses.
>> >> >
>> >> > Without async reclaim:
>> >> > Connection Times (ms)
>> >> > ? ? ? ? ? ? ?min ?mean[+/-sd] median ? max
>> >> > Connect: ? ? ? ?0 ? ?0 ? 0.0 ? ? ?0 ? ? ? 2
>> >> > Processing: ? ?30 ? 37 ?28.3 ? ? 32 ? ?1793
>> >> > Waiting: ? ? ? 28 ? 35 ?25.5 ? ? 31 ? ?1792
>> >> > Total: ? ? ? ? 30 ? 37 ?28.4 ? ? 32 ? ?1793
>> >> >
>> >> > Percentage of the requests served within a certain time (ms)
>> >> > ?50% ? ? 32
>> >> > ?66% ? ? 32
>> >> > ?75% ? ? 33
>> >> > ?80% ? ? 34
>> >> > ?90% ? ? 39
>> >> > ?95% ? ? 60
>> >> > ?98% ? ?100
>> >> > ?99% ? ?133
>> >> > ?100% ? 1793 (longest request)
>> >> >
>> >> > With async reclaim:
>> >> > Connection Times (ms)
>> >> > ? ? ? ? ? ? ?min ?mean[+/-sd] median ? max
>> >> > Connect: ? ? ? ?0 ? ?0 ? 0.0 ? ? ?0 ? ? ? 2
>> >> > Processing: ? ?30 ? 35 ?12.3 ? ? 32 ? ? 678
>> >> > Waiting: ? ? ? 28 ? 34 ?12.0 ? ? 31 ? ? 658
>> >> > Total: ? ? ? ? 30 ? 35 ?12.3 ? ? 32 ? ? 678
>> >> >
>> >> > Percentage of the requests served within a certain time (ms)
>> >> > ?50% ? ? 32
>> >> > ?66% ? ? 32
>> >> > ?75% ? ? 33
>> >> > ?80% ? ? 34
>> >> > ?90% ? ? 39
>> >> > ?95% ? ? 49
>> >> > ?98% ? ? 71
>> >> > ?99% ? ? 86
>> >> > ?100% ? ?678 (longest request)
>> >> >
>> >> >
>> >> > It seems latency is stabilized by hiding memory reclaim.
>> >> >
>> >> > The score for memory reclaim was following.
>> >> > See patch 10 for meaning of each member.
>> >> >
>> >> > == without async reclaim ==
>> >> > recent_scan_success_ratio 44
>> >> > limit_scan_pages 388463
>> >> > limit_freed_pages 162238
>> >> > limit_elapsed_ns 13852159231
>> >> > soft_scan_pages 0
>> >> > soft_freed_pages 0
>> >> > soft_elapsed_ns 0
>> >> > margin_scan_pages 0
>> >> > margin_freed_pages 0
>> >> > margin_elapsed_ns 0
>> >> >
>> >> > == with async reclaim ==
>> >> > recent_scan_success_ratio 6
>> >> > limit_scan_pages 0
>> >> > limit_freed_pages 0
>> >> > limit_elapsed_ns 0
>> >> > soft_scan_pages 0
>> >> > soft_freed_pages 0
>> >> > soft_elapsed_ns 0
>> >> > margin_scan_pages 1295556
>> >> > margin_freed_pages 122450
>> >> > margin_elapsed_ns 644881521
>> >> >
>> >> >
>> >> > For this case, SCHED_IDLE workqueue can reclaim enough memory to the httpd.
>> >> >
>> >> > I may need to dig why scan_success_ratio is far different in the both case.
>> >> > I guess the difference of epalsed_ns is because several threads enter
>> >> > memory reclaim when async reclaim doesn't run. But may not...
>> >> >
>> >>
>> >>
>> >> Hmm.. I noticed a very strange behavior on a simple test w/ the patch set.
>> >>
>> >> Test:
>> >> I created a 4g memcg and start doing cat. Then the memcg being OOM
>> >> killed as soon as it reaches its hard_limit. We shouldn't hit OOM even
>> >> w/o async-reclaim.
>> >>
>> >> Again, I will read through the patch. But like to post the test result first.
>> >>
>> >> $ echo $$ >/dev/cgroup/memory/A/tasks
>> >> $ cat /dev/cgroup/memory/A/memory.limit_in_bytes
>> >> 4294967296
>> >>
>> >> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
>> >> Killed
>> >>
>> >
>> > I did the same kind of test without any problem...but ok, I'll do more test
>> > later.
>> >
>> >
>> >
>> >> real ?0m53.565s
>> >> user ?0m0.061s
>> >> sys ? 0m4.814s
>> >>
>> >> Here is the OOM log:
>> >>
>> >> May 26 18:43:00 ?kernel: [ ?963.489112] cat invoked oom-killer:
>> >> gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
>> >> May 26 18:43:00 ?kernel: [ ?963.489121] Pid: 9425, comm: cat Tainted:
>> >> G ? ? ? ?W ? 2.6.39-mcg-DEV #131
>> >> May 26 18:43:00 ?kernel: [ ?963.489123] Call Trace:
>> >> May 26 18:43:00 ?kernel: [ ?963.489134] ?[<ffffffff810e3512>]
>> >> dump_header+0x82/0x1af
>> >> May 26 18:43:00 ?kernel: [ ?963.489137] ?[<ffffffff810e33ca>] ?
>> >> spin_lock+0xe/0x10
>> >> May 26 18:43:00 ?kernel: [ ?963.489140] ?[<ffffffff810e33f9>] ?
>> >> find_lock_task_mm+0x2d/0x67
>> >> May 26 18:43:00 ?kernel: [ ?963.489143] ?[<ffffffff810e38dd>]
>> >> oom_kill_process+0x50/0x27b
>> >> May 26 18:43:00 ?kernel: [ ?963.489155] ?[<ffffffff810e3dc6>]
>> >> mem_cgroup_out_of_memory+0x9a/0xe4
>> >> May 26 18:43:00 ?kernel: [ ?963.489160] ?[<ffffffff811153aa>]
>> >> mem_cgroup_handle_oom+0x134/0x1fe
>> >> May 26 18:43:00 ?kernel: [ ?963.489163] ?[<ffffffff81114a72>] ?
>> >> __mem_cgroup_insert_exceeded+0x83/0x83
>> >> May 26 18:43:00 ?kernel: [ ?963.489176] ?[<ffffffff811166e9>]
>> >> __mem_cgroup_try_charge.clone.3+0x368/0x43a
>> >> May 26 18:43:00 ?kernel: [ ?963.489179] ?[<ffffffff81117586>]
>> >> mem_cgroup_cache_charge+0x95/0x123
>> >> May 26 18:43:00 ?kernel: [ ?963.489183] ?[<ffffffff810e16d8>]
>> >> add_to_page_cache_locked+0x42/0x114
>> >> May 26 18:43:00 ?kernel: [ ?963.489185] ?[<ffffffff810e17db>]
>> >> add_to_page_cache_lru+0x31/0x5f
>> >> May 26 18:43:00 ?kernel: [ ?963.489189] ?[<ffffffff81145636>]
>> >> mpage_readpages+0xb6/0x132
>> >> May 26 18:43:00 ?kernel: [ ?963.489194] ?[<ffffffff8119992f>] ?
>> >> noalloc_get_block_write+0x24/0x24
>> >> May 26 18:43:00 ?kernel: [ ?963.489197] ?[<ffffffff8119992f>] ?
>> >> noalloc_get_block_write+0x24/0x24
>> >> May 26 18:43:00 ?kernel: [ ?963.489201] ?[<ffffffff81036742>] ?
>> >> __switch_to+0x160/0x212
>> >> May 26 18:43:00 ?kernel: [ ?963.489205] ?[<ffffffff811978b2>]
>> >> ext4_readpages+0x1d/0x1f
>> >> May 26 18:43:00 ?kernel: [ ?963.489209] ?[<ffffffff810e8d4b>]
>> >> __do_page_cache_readahead+0x144/0x1e3
>> >> May 26 18:43:00 ?kernel: [ ?963.489212] ?[<ffffffff810e8e0b>]
>> >> ra_submit+0x21/0x25
>> >> May 26 18:43:00 ?kernel: [ ?963.489215] ?[<ffffffff810e9075>]
>> >> ondemand_readahead+0x18c/0x19f
>> >> May 26 18:43:00 ?kernel: [ ?963.489218] ?[<ffffffff810e9105>]
>> >> page_cache_async_readahead+0x7d/0x86
>> >> May 26 18:43:00 ?kernel: [ ?963.489221] ?[<ffffffff810e2b7e>]
>> >> generic_file_aio_read+0x2d8/0x5fe
>> >> May 26 18:43:00 ?kernel: [ ?963.489225] ?[<ffffffff81119626>]
>> >> do_sync_read+0xcb/0x108
>> >> May 26 18:43:00 ?kernel: [ ?963.489230] ?[<ffffffff811f168a>] ?
>> >> fsnotify_perm+0x66/0x72
>> >> May 26 18:43:00 ?kernel: [ ?963.489233] ?[<ffffffff811f16f7>] ?
>> >> security_file_permission+0x2e/0x33
>> >> May 26 18:43:00 ?kernel: [ ?963.489236] ?[<ffffffff8111a0c8>]
>> >> vfs_read+0xab/0x107
>> >> May 26 18:43:00 ?kernel: [ ?963.489239] ?[<ffffffff8111a1e4>] sys_read+0x4a/0x6e
>> >> May 26 18:43:00 ?kernel: [ ?963.489244] ?[<ffffffff8140f469>]
>> >> sysenter_dispatch+0x7/0x27
>> >> May 26 18:43:00 ?kernel: [ ?963.489248] Task in /A killed as a result
>> >> of limit of /A
>> >> May 26 18:43:00 ?kernel: [ ?963.489251] memory: usage 4194304kB, limit
>> >> 4194304kB, failcnt 26
>> >> May 26 18:43:00 ?kernel: [ ?963.489253] memory+swap: usage 0kB, limit
>> >> 9007199254740991kB, failcnt 0
>> >>
>> >
>> > Hmm, why memory+swap usage 0kb here...
>> >
>> > In this set, I used mem_cgroup_margin() rather than res_counter_margin().
>> > Hmm, do you disable swap accounting ? If so, I may miss some.
>>
>> Yes, I disabled the swap accounting in .config:
>> # CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set
>>
>>
>> Here is how i reproduce it:
>>
>> $ mkdir /dev/cgroup/memory/D
>> $ echo 4g >/dev/cgroup/memory/D/memory.limit_in_bytes
>>
>> $ cat /dev/cgroup/memory/D/memory.limit_in_bytes
>> 4294967296
>>
>> $ cat /dev/cgroup/memory/D/memory.
>> memory.async_control ? ? ? ? ? ? memory.max_usage_in_bytes
>> memory.soft_limit_in_bytes ? ? ? memory.use_hierarchy
>> memory.failcnt ? ? ? ? ? ? ? ? ? memory.move_charge_at_immigrate
>> memory.stat
>> memory.force_empty ? ? ? ? ? ? ? memory.oom_control
>> memory.swappiness
>> memory.limit_in_bytes ? ? ? ? ? ?memory.reclaim_stat
>> memory.usage_in_bytes
>>
>> $ cat /dev/cgroup/memory/D/memory.async_control
>> 0
>> $ echo 1 >/dev/cgroup/memory/D/memory.async_control
>> $ cat /dev/cgroup/memory/D/memory.async_control
>> 1
>>
>> $ echo $$ >/dev/cgroup/memory/D/tasks
>> $ cat /proc/4358/cgroup
>> 3:memory:/D
>>
>> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
>> Killed
>>
>
> If you applied my patches collectly, async_control can be seen if
> swap controller is configured because of BUG in patch.

I noticed the BUG at the very beginning, so all my tests are having the fix.

>
> I could cat 20G file under 4G limit without any problem with boot option
> swapaccount=0. no problem if async_control == 0 ?

$ cat /dev/cgroup/memory/D/memory.async_control
1

I have the .config
# CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set

Not sure if that makes difference. I will test next to turn that on.

--Ying


>
>
>
> Thanks,
> -Kame
>
>
>
>

2011-05-27 05:48:01

by Ying Han

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/10] check reclaimable in hierarchy walk

On Wed, May 25, 2011 at 10:15 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> I may post this patch as stand alone, later.
> ==
> Check memcg has reclaimable pages at select_victim().
>
> Now, with help of bitmap as memcg->scan_node, we can check whether memcg has
> reclaimable pages with easy test of node_empty(&mem->scan_nodes).
>
> mem->scan_nodes is a bitmap to show whether memcg contains reclaimable
> memory or not, which is updated periodically.
>
> This patch makes use of scan_nodes and modify hierarchy walk at memory
> shrinking in following way.
>
> ?- check scan_nodes in mem_cgroup_select_victim()
> ?- mem_cgroup_select_victim() returns NULL if no memcg is reclaimable.
> ?- force update of scan_nodes.
> ?- rename mem_cgroup_select_victim() to be mem_cgroup_select_get_victim()
> ? ?to show refcnt is +1.
>
> This will make hierarchy walk better.
>
> And this allows to remove mem_cgroup_local_pages() check which was used for
> the same purpose. But this function was wrong because it cannot handle
> information of unevictable pages and tmpfs v.s. swapless information.
>
> Changelog:
> ?- added since v3.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> ?mm/memcontrol.c | ?165 +++++++++++++++++++++++++++++++++++++-------------------
> ?1 file changed, 110 insertions(+), 55 deletions(-)
>
> Index: memcg_async/mm/memcontrol.c
> ===================================================================
> --- memcg_async.orig/mm/memcontrol.c
> +++ memcg_async/mm/memcontrol.c
> @@ -584,15 +584,6 @@ static long mem_cgroup_read_stat(struct
> ? ? ? ?return val;
> ?}
>
> -static long mem_cgroup_local_usage(struct mem_cgroup *mem)
> -{
> - ? ? ? long ret;
> -
> - ? ? ? ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
> - ? ? ? ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
> - ? ? ? return ret;
> -}
> -
> ?static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool charge)
> ?{
> @@ -1555,43 +1546,6 @@ u64 mem_cgroup_get_limit(struct mem_cgro
> ? ? ? ?return min(limit, memsw);
> ?}
>
> -/*
> - * Visit the first child (need not be the first child as per the ordering
> - * of the cgroup list, since we track last_scanned_child) of @mem and use
> - * that to reclaim free pages from.
> - */
> -static struct mem_cgroup *
> -mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> -{
> - ? ? ? struct mem_cgroup *ret = NULL;
> - ? ? ? struct cgroup_subsys_state *css;
> - ? ? ? int nextid, found;
> -
> - ? ? ? if (!root_mem->use_hierarchy) {
> - ? ? ? ? ? ? ? css_get(&root_mem->css);
> - ? ? ? ? ? ? ? ret = root_mem;
> - ? ? ? }
> -
> - ? ? ? while (!ret) {
> - ? ? ? ? ? ? ? rcu_read_lock();
> - ? ? ? ? ? ? ? nextid = root_mem->last_scanned_child + 1;
> - ? ? ? ? ? ? ? css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?&found);
> - ? ? ? ? ? ? ? if (css && css_tryget(css))
> - ? ? ? ? ? ? ? ? ? ? ? ret = container_of(css, struct mem_cgroup, css);
> -
> - ? ? ? ? ? ? ? rcu_read_unlock();
> - ? ? ? ? ? ? ? /* Updates scanning parameter */
> - ? ? ? ? ? ? ? if (!css) {
> - ? ? ? ? ? ? ? ? ? ? ? /* this means start scan from ID:1 */
> - ? ? ? ? ? ? ? ? ? ? ? root_mem->last_scanned_child = 0;
> - ? ? ? ? ? ? ? } else
> - ? ? ? ? ? ? ? ? ? ? ? root_mem->last_scanned_child = found;
> - ? ? ? }
> -
> - ? ? ? return ret;
> -}
> -
> ?#if MAX_NUMNODES > 1
>
> ?/*
> @@ -1600,11 +1554,11 @@ mem_cgroup_select_victim(struct mem_cgro
> ?* nodes based on the zonelist. So update the list loosely once per 10 secs.
> ?*
> ?*/
> -static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
> +static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem, bool force)
> ?{
> ? ? ? ?int nid;
>
> - ? ? ? if (time_after(mem->next_scan_node_update, jiffies))
> + ? ? ? if (!force && time_after(mem->next_scan_node_update, jiffies))
> ? ? ? ? ? ? ? ?return;
>
> ? ? ? ?mem->next_scan_node_update = jiffies + 10*HZ;
> @@ -1641,7 +1595,7 @@ int mem_cgroup_select_victim_node(struct
> ?{
> ? ? ? ?int node;
>
> - ? ? ? mem_cgroup_may_update_nodemask(mem);
> + ? ? ? mem_cgroup_may_update_nodemask(mem, false);
> ? ? ? ?node = mem->last_scanned_node;
>
> ? ? ? ?node = next_node(node, mem->scan_nodes);
> @@ -1660,13 +1614,117 @@ int mem_cgroup_select_victim_node(struct
> ? ? ? ?return node;
> ?}
>
> +/**
> + * mem_cgroup_has_reclaimable
> + * @mem_cgroup : the mem_cgroup
> + *
> + * The caller can test whether the memcg has reclaimable pages.
> + *
> + * This function checks memcg has reclaimable pages or not with bitmap of
> + * memcg->scan_nodes. This bitmap is updated periodically and indicates
> + * which node has reclaimable memcg memory or not.
> + * Although this is a rough test and result is not very precise but we don't
> + * have to scan all nodes and don't have to use locks.
> + *
> + * For non-NUMA, this cheks reclaimable pages on zones because we don't
> + * update scan_nodes.(see below)
> + */
> +static bool mem_cgroup_has_reclaimable(struct mem_cgroup *memcg)
> +{
> + ? ? ? return !nodes_empty(memcg->scan_nodes);
> +}
> +
> ?#else
> +
> +static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem, bool force)
> +{
> +}
> +
> ?int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
> ?{
> ? ? ? ?return 0;
> ?}
> +
> +static bool mem_cgroup_has_reclaimable(struct mem_cgroup *memcg)
> +{
> + ? ? ? unsigned long nr;
> + ? ? ? int zid;
> +
> + ? ? ? for (zid = NODE_DATA(0)->nr_zones - 1; zid >= 0; zid--)
> + ? ? ? ? ? ? ? if (mem_cgroup_zone_reclaimable_pages(memcg, 0, zid))
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? if (zid < 0)
> + ? ? ? ? ? ? ? return false;
> + ? ? ? return true;
> +}
> ?#endif

unused variable "nr".

--Ying
>
> +/**
> + * mem_cgroup_select_get_victim
> + * @root_mem: the root memcg of hierarchy which should be shrinked.
> + *
> + * Visit children of root_mem ony by one. If the routine finds a memcg
> + * which contains reclaimable pages, returns it with refcnt +1. The
> + * scan is done in round-robin and 'the next start point' is saved into
> + * mem->last_scanned_child. If no reclaimable memcg are found, returns NULL.
> + */
> +static struct mem_cgroup *
> +mem_cgroup_select_get_victim(struct mem_cgroup *root_mem)
> +{
> + ? ? ? struct mem_cgroup *ret = NULL;
> + ? ? ? struct cgroup_subsys_state *css;
> + ? ? ? int nextid, found;
> + ? ? ? bool second_visit = false;
> +
> + ? ? ? if (!root_mem->use_hierarchy)
> + ? ? ? ? ? ? ? goto return_root;
> +
> + ? ? ? while (!ret) {
> + ? ? ? ? ? ? ? rcu_read_lock();
> + ? ? ? ? ? ? ? nextid = root_mem->last_scanned_child + 1;
> + ? ? ? ? ? ? ? css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?&found);
> + ? ? ? ? ? ? ? if (css && css_tryget(css))
> + ? ? ? ? ? ? ? ? ? ? ? ret = container_of(css, struct mem_cgroup, css);
> +
> + ? ? ? ? ? ? ? rcu_read_unlock();
> + ? ? ? ? ? ? ? /* Updates scanning parameter */
> + ? ? ? ? ? ? ? if (!css) { /* Indicates we scanned the last node of tree */
> + ? ? ? ? ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ? ? ? ? ?* If all memcg has no reclaimable pages, we may enter
> + ? ? ? ? ? ? ? ? ? ? ? ?* an infinite loop. Exit here if we reached the end
> + ? ? ? ? ? ? ? ? ? ? ? ?* of hierarchy tree twice.
> + ? ? ? ? ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? ? ? ? ? if (second_visit)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return NULL;
> + ? ? ? ? ? ? ? ? ? ? ? /* this means start scan from ID:1 */
> + ? ? ? ? ? ? ? ? ? ? ? root_mem->last_scanned_child = 0;
> + ? ? ? ? ? ? ? ? ? ? ? second_visit = true;
> + ? ? ? ? ? ? ? } else
> + ? ? ? ? ? ? ? ? ? ? ? root_mem->last_scanned_child = found;
> + ? ? ? ? ? ? ? if (css && ret) {
> + ? ? ? ? ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ? ? ? ? ?* check memcg has reclaimable memory or not. Update
> + ? ? ? ? ? ? ? ? ? ? ? ?* information carefully if we might fail with cached
> + ? ? ? ? ? ? ? ? ? ? ? ?* bitmask information.
> + ? ? ? ? ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? ? ? ? ? if (second_visit)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_may_update_nodemask(ret, true);
> +
> + ? ? ? ? ? ? ? ? ? ? ? if (!mem_cgroup_has_reclaimable(ret)) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? css_put(css);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ret = NULL;
> + ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? }
> + ? ? ? }
> +
> + ? ? ? return ret;
> +return_root:
> + ? ? ? css_get(&root_mem->css);
> + ? ? ? return root_mem;
> +}
> +
> +
> ?/*
> ?* Scan the hierarchy if needed to reclaim memory. We remember the last child
> ?* we reclaimed from, so that we don't end up penalizing one child extensively
> @@ -1705,7 +1763,9 @@ static int mem_cgroup_hierarchical_recla
> ? ? ? ? ? ? ? ?is_kswapd = true;
>
> ? ? ? ?while (1) {
> - ? ? ? ? ? ? ? victim = mem_cgroup_select_victim(root_mem);
> + ? ? ? ? ? ? ? victim = mem_cgroup_select_get_victim(root_mem);
> + ? ? ? ? ? ? ? if (!victim)
> + ? ? ? ? ? ? ? ? ? ? ? return total;
> ? ? ? ? ? ? ? ?if (victim == root_mem) {
> ? ? ? ? ? ? ? ? ? ? ? ?loop++;
> ? ? ? ? ? ? ? ? ? ? ? ?if (loop >= 1)
> @@ -1733,11 +1793,6 @@ static int mem_cgroup_hierarchical_recla
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?}
> - ? ? ? ? ? ? ? if (!mem_cgroup_local_usage(victim)) {
> - ? ? ? ? ? ? ? ? ? ? ? /* this cgroup's local usage == 0 */
> - ? ? ? ? ? ? ? ? ? ? ? css_put(&victim->css);
> - ? ? ? ? ? ? ? ? ? ? ? continue;
> - ? ? ? ? ? ? ? }
> ? ? ? ? ? ? ? ?/* we use swappiness of local cgroup */
> ? ? ? ? ? ? ? ?if (check_soft) {
> ? ? ? ? ? ? ? ? ? ? ? ?ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
>
>

2011-05-27 07:20:37

by Ying Han

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 0/10] memcg async reclaim

On Thu, May 26, 2011 at 9:49 PM, Ying Han <[email protected]> wrote:
> On Thu, May 26, 2011 at 9:34 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>> On Thu, 26 May 2011 21:33:32 -0700
>> Ying Han <[email protected]> wrote:
>>
>>> On Thu, May 26, 2011 at 7:16 PM, KAMEZAWA Hiroyuki
>>> <[email protected]> wrote:
>>> > On Thu, 26 May 2011 18:49:26 -0700
>>> > Ying Han <[email protected]> wrote:
>>> >
>>> >> On Wed, May 25, 2011 at 10:10 PM, KAMEZAWA Hiroyuki
>>> >> <[email protected]> wrote:
>>> >> >
>>> >> > It's now merge window...I just dump my patch queue to hear other's idea.
>>> >> > I wonder I should wait until dirty_ratio for memcg is queued to mmotm...
>>> >> > I'll be busy with LinuxCon Japan etc...in the next week.
>>> >> >
>>> >> > This patch is onto mmotm-May-11 + some patches queued in mmotm, as numa_stat.
>>> >> >
>>> >> > This is a patch for memcg to keep margin to the limit in background.
>>> >> > By keeping some margin to the limit in background, application can
>>> >> > avoid foreground memory reclaim at charge() and this will help latency.
>>> >> >
>>> >> > Main changes from v2 is.
>>> >> > ?- use SCHED_IDLE.
>>> >> > ?- removed most of heuristic codes. Now, code is very simple.
>>> >> >
>>> >> > By using SCHED_IDLE, async memory reclaim can only consume 0.3%? of cpu
>>> >> > if the system is truely busy but can use much CPU if the cpu is idle.
>>> >> > Because my purpose is for reducing latency without affecting other running
>>> >> > applications, SCHED_IDLE fits this work.
>>> >> >
>>> >> > If application need to stop by some I/O or event, background memory reclaim
>>> >> > will cull memory while the system is idle.
>>> >> >
>>> >> > Perforemce:
>>> >> > ?Running an httpd (apache) under 300M limit. And access 600MB working set
>>> >> > ?with normalized distribution access by apatch-bench.
>>> >> > ?apatch bench's concurrency was 4 and did 40960 accesses.
>>> >> >
>>> >> > Without async reclaim:
>>> >> > Connection Times (ms)
>>> >> > ? ? ? ? ? ? ?min ?mean[+/-sd] median ? max
>>> >> > Connect: ? ? ? ?0 ? ?0 ? 0.0 ? ? ?0 ? ? ? 2
>>> >> > Processing: ? ?30 ? 37 ?28.3 ? ? 32 ? ?1793
>>> >> > Waiting: ? ? ? 28 ? 35 ?25.5 ? ? 31 ? ?1792
>>> >> > Total: ? ? ? ? 30 ? 37 ?28.4 ? ? 32 ? ?1793
>>> >> >
>>> >> > Percentage of the requests served within a certain time (ms)
>>> >> > ?50% ? ? 32
>>> >> > ?66% ? ? 32
>>> >> > ?75% ? ? 33
>>> >> > ?80% ? ? 34
>>> >> > ?90% ? ? 39
>>> >> > ?95% ? ? 60
>>> >> > ?98% ? ?100
>>> >> > ?99% ? ?133
>>> >> > ?100% ? 1793 (longest request)
>>> >> >
>>> >> > With async reclaim:
>>> >> > Connection Times (ms)
>>> >> > ? ? ? ? ? ? ?min ?mean[+/-sd] median ? max
>>> >> > Connect: ? ? ? ?0 ? ?0 ? 0.0 ? ? ?0 ? ? ? 2
>>> >> > Processing: ? ?30 ? 35 ?12.3 ? ? 32 ? ? 678
>>> >> > Waiting: ? ? ? 28 ? 34 ?12.0 ? ? 31 ? ? 658
>>> >> > Total: ? ? ? ? 30 ? 35 ?12.3 ? ? 32 ? ? 678
>>> >> >
>>> >> > Percentage of the requests served within a certain time (ms)
>>> >> > ?50% ? ? 32
>>> >> > ?66% ? ? 32
>>> >> > ?75% ? ? 33
>>> >> > ?80% ? ? 34
>>> >> > ?90% ? ? 39
>>> >> > ?95% ? ? 49
>>> >> > ?98% ? ? 71
>>> >> > ?99% ? ? 86
>>> >> > ?100% ? ?678 (longest request)
>>> >> >
>>> >> >
>>> >> > It seems latency is stabilized by hiding memory reclaim.
>>> >> >
>>> >> > The score for memory reclaim was following.
>>> >> > See patch 10 for meaning of each member.
>>> >> >
>>> >> > == without async reclaim ==
>>> >> > recent_scan_success_ratio 44
>>> >> > limit_scan_pages 388463
>>> >> > limit_freed_pages 162238
>>> >> > limit_elapsed_ns 13852159231
>>> >> > soft_scan_pages 0
>>> >> > soft_freed_pages 0
>>> >> > soft_elapsed_ns 0
>>> >> > margin_scan_pages 0
>>> >> > margin_freed_pages 0
>>> >> > margin_elapsed_ns 0
>>> >> >
>>> >> > == with async reclaim ==
>>> >> > recent_scan_success_ratio 6
>>> >> > limit_scan_pages 0
>>> >> > limit_freed_pages 0
>>> >> > limit_elapsed_ns 0
>>> >> > soft_scan_pages 0
>>> >> > soft_freed_pages 0
>>> >> > soft_elapsed_ns 0
>>> >> > margin_scan_pages 1295556
>>> >> > margin_freed_pages 122450
>>> >> > margin_elapsed_ns 644881521
>>> >> >
>>> >> >
>>> >> > For this case, SCHED_IDLE workqueue can reclaim enough memory to the httpd.
>>> >> >
>>> >> > I may need to dig why scan_success_ratio is far different in the both case.
>>> >> > I guess the difference of epalsed_ns is because several threads enter
>>> >> > memory reclaim when async reclaim doesn't run. But may not...
>>> >> >
>>> >>
>>> >>
>>> >> Hmm.. I noticed a very strange behavior on a simple test w/ the patch set.
>>> >>
>>> >> Test:
>>> >> I created a 4g memcg and start doing cat. Then the memcg being OOM
>>> >> killed as soon as it reaches its hard_limit. We shouldn't hit OOM even
>>> >> w/o async-reclaim.
>>> >>
>>> >> Again, I will read through the patch. But like to post the test result first.
>>> >>
>>> >> $ echo $$ >/dev/cgroup/memory/A/tasks
>>> >> $ cat /dev/cgroup/memory/A/memory.limit_in_bytes
>>> >> 4294967296
>>> >>
>>> >> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
>>> >> Killed
>>> >>
>>> >
>>> > I did the same kind of test without any problem...but ok, I'll do more test
>>> > later.
>>> >
>>> >
>>> >
>>> >> real ?0m53.565s
>>> >> user ?0m0.061s
>>> >> sys ? 0m4.814s
>>> >>
>>> >> Here is the OOM log:
>>> >>
>>> >> May 26 18:43:00 ?kernel: [ ?963.489112] cat invoked oom-killer:
>>> >> gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
>>> >> May 26 18:43:00 ?kernel: [ ?963.489121] Pid: 9425, comm: cat Tainted:
>>> >> G ? ? ? ?W ? 2.6.39-mcg-DEV #131
>>> >> May 26 18:43:00 ?kernel: [ ?963.489123] Call Trace:
>>> >> May 26 18:43:00 ?kernel: [ ?963.489134] ?[<ffffffff810e3512>]
>>> >> dump_header+0x82/0x1af
>>> >> May 26 18:43:00 ?kernel: [ ?963.489137] ?[<ffffffff810e33ca>] ?
>>> >> spin_lock+0xe/0x10
>>> >> May 26 18:43:00 ?kernel: [ ?963.489140] ?[<ffffffff810e33f9>] ?
>>> >> find_lock_task_mm+0x2d/0x67
>>> >> May 26 18:43:00 ?kernel: [ ?963.489143] ?[<ffffffff810e38dd>]
>>> >> oom_kill_process+0x50/0x27b
>>> >> May 26 18:43:00 ?kernel: [ ?963.489155] ?[<ffffffff810e3dc6>]
>>> >> mem_cgroup_out_of_memory+0x9a/0xe4
>>> >> May 26 18:43:00 ?kernel: [ ?963.489160] ?[<ffffffff811153aa>]
>>> >> mem_cgroup_handle_oom+0x134/0x1fe
>>> >> May 26 18:43:00 ?kernel: [ ?963.489163] ?[<ffffffff81114a72>] ?
>>> >> __mem_cgroup_insert_exceeded+0x83/0x83
>>> >> May 26 18:43:00 ?kernel: [ ?963.489176] ?[<ffffffff811166e9>]
>>> >> __mem_cgroup_try_charge.clone.3+0x368/0x43a
>>> >> May 26 18:43:00 ?kernel: [ ?963.489179] ?[<ffffffff81117586>]
>>> >> mem_cgroup_cache_charge+0x95/0x123
>>> >> May 26 18:43:00 ?kernel: [ ?963.489183] ?[<ffffffff810e16d8>]
>>> >> add_to_page_cache_locked+0x42/0x114
>>> >> May 26 18:43:00 ?kernel: [ ?963.489185] ?[<ffffffff810e17db>]
>>> >> add_to_page_cache_lru+0x31/0x5f
>>> >> May 26 18:43:00 ?kernel: [ ?963.489189] ?[<ffffffff81145636>]
>>> >> mpage_readpages+0xb6/0x132
>>> >> May 26 18:43:00 ?kernel: [ ?963.489194] ?[<ffffffff8119992f>] ?
>>> >> noalloc_get_block_write+0x24/0x24
>>> >> May 26 18:43:00 ?kernel: [ ?963.489197] ?[<ffffffff8119992f>] ?
>>> >> noalloc_get_block_write+0x24/0x24
>>> >> May 26 18:43:00 ?kernel: [ ?963.489201] ?[<ffffffff81036742>] ?
>>> >> __switch_to+0x160/0x212
>>> >> May 26 18:43:00 ?kernel: [ ?963.489205] ?[<ffffffff811978b2>]
>>> >> ext4_readpages+0x1d/0x1f
>>> >> May 26 18:43:00 ?kernel: [ ?963.489209] ?[<ffffffff810e8d4b>]
>>> >> __do_page_cache_readahead+0x144/0x1e3
>>> >> May 26 18:43:00 ?kernel: [ ?963.489212] ?[<ffffffff810e8e0b>]
>>> >> ra_submit+0x21/0x25
>>> >> May 26 18:43:00 ?kernel: [ ?963.489215] ?[<ffffffff810e9075>]
>>> >> ondemand_readahead+0x18c/0x19f
>>> >> May 26 18:43:00 ?kernel: [ ?963.489218] ?[<ffffffff810e9105>]
>>> >> page_cache_async_readahead+0x7d/0x86
>>> >> May 26 18:43:00 ?kernel: [ ?963.489221] ?[<ffffffff810e2b7e>]
>>> >> generic_file_aio_read+0x2d8/0x5fe
>>> >> May 26 18:43:00 ?kernel: [ ?963.489225] ?[<ffffffff81119626>]
>>> >> do_sync_read+0xcb/0x108
>>> >> May 26 18:43:00 ?kernel: [ ?963.489230] ?[<ffffffff811f168a>] ?
>>> >> fsnotify_perm+0x66/0x72
>>> >> May 26 18:43:00 ?kernel: [ ?963.489233] ?[<ffffffff811f16f7>] ?
>>> >> security_file_permission+0x2e/0x33
>>> >> May 26 18:43:00 ?kernel: [ ?963.489236] ?[<ffffffff8111a0c8>]
>>> >> vfs_read+0xab/0x107
>>> >> May 26 18:43:00 ?kernel: [ ?963.489239] ?[<ffffffff8111a1e4>] sys_read+0x4a/0x6e
>>> >> May 26 18:43:00 ?kernel: [ ?963.489244] ?[<ffffffff8140f469>]
>>> >> sysenter_dispatch+0x7/0x27
>>> >> May 26 18:43:00 ?kernel: [ ?963.489248] Task in /A killed as a result
>>> >> of limit of /A
>>> >> May 26 18:43:00 ?kernel: [ ?963.489251] memory: usage 4194304kB, limit
>>> >> 4194304kB, failcnt 26
>>> >> May 26 18:43:00 ?kernel: [ ?963.489253] memory+swap: usage 0kB, limit
>>> >> 9007199254740991kB, failcnt 0
>>> >>
>>> >
>>> > Hmm, why memory+swap usage 0kb here...
>>> >
>>> > In this set, I used mem_cgroup_margin() rather than res_counter_margin().
>>> > Hmm, do you disable swap accounting ? If so, I may miss some.
>>>
>>> Yes, I disabled the swap accounting in .config:
>>> # CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set
>>>
>>>
>>> Here is how i reproduce it:
>>>
>>> $ mkdir /dev/cgroup/memory/D
>>> $ echo 4g >/dev/cgroup/memory/D/memory.limit_in_bytes
>>>
>>> $ cat /dev/cgroup/memory/D/memory.limit_in_bytes
>>> 4294967296
>>>
>>> $ cat /dev/cgroup/memory/D/memory.
>>> memory.async_control ? ? ? ? ? ? memory.max_usage_in_bytes
>>> memory.soft_limit_in_bytes ? ? ? memory.use_hierarchy
>>> memory.failcnt ? ? ? ? ? ? ? ? ? memory.move_charge_at_immigrate
>>> memory.stat
>>> memory.force_empty ? ? ? ? ? ? ? memory.oom_control
>>> memory.swappiness
>>> memory.limit_in_bytes ? ? ? ? ? ?memory.reclaim_stat
>>> memory.usage_in_bytes
>>>
>>> $ cat /dev/cgroup/memory/D/memory.async_control
>>> 0
>>> $ echo 1 >/dev/cgroup/memory/D/memory.async_control
>>> $ cat /dev/cgroup/memory/D/memory.async_control
>>> 1
>>>
>>> $ echo $$ >/dev/cgroup/memory/D/tasks
>>> $ cat /proc/4358/cgroup
>>> 3:memory:/D
>>>
>>> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
>>> Killed
>>>
>>
>> If you applied my patches collectly, async_control can be seen if
>> swap controller is configured because of BUG in patch.
>
> I noticed the BUG at the very beginning, so all my tests are having the fix.
>
>>
>> I could cat 20G file under 4G limit without any problem with boot option
>> swapaccount=0. no problem if async_control == 0 ?
>
> $ cat /dev/cgroup/memory/D/memory.async_control
> 1
>
> I have the .config
> # CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set
>
> Not sure if that makes difference. I will test next to turn that on.

I know what's the problem and also verified. Our configuration might
differs on the "#if MAX_NUMNODES > 1"

Please apply the following patch:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6a52699..0b88d71 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1217,7 +1217,7 @@ unsigned long
mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);

nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
- MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE);
+ MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
if (nr_swap_pages > 0)
nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);

--Ying

>
> --Ying
>
>
>>
>>
>>
>> Thanks,
>> -Kame
>>
>>
>>
>>
>

2011-05-27 07:21:36

by Ying Han

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 3/10] memcg: a test whether zone is reclaimable or not

On Wed, May 25, 2011 at 10:19 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From: Ying Han <[email protected]>
>
> The number of reclaimable pages per zone is an useful information for
> controling memory reclaim schedule. This patch exports it.
>
> Changelog v2->v3:
> ?- added comments.
>
> Signed-off-by: Ying Han <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> ?include/linux/memcontrol.h | ? ?2 ++
> ?mm/memcontrol.c ? ? ? ? ? ?| ? 24 ++++++++++++++++++++++++
> ?2 files changed, 26 insertions(+)
>
> Index: memcg_async/mm/memcontrol.c
> ===================================================================
> --- memcg_async.orig/mm/memcontrol.c
> +++ memcg_async/mm/memcontrol.c
> @@ -1240,6 +1240,30 @@ static unsigned long mem_cgroup_nr_lru_p
> ?}
> ?#endif /* CONFIG_NUMA */
>
> +/**
> + * mem_cgroup_zone_reclaimable_pages
> + * @memcg: the memcg
> + * @nid ?: node index to be checked.
> + * @zid ?: zone index to be checked.
> + *
> + * This function returns the number reclaimable pages on a zone for given memcg.
> + * Reclaimable page includes file caches and anonymous pages if swap is
> + * avaliable and never includes unevictable pages.
> + */
> +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int nid, int zid)
> +{
> + ? ? ? unsigned long nr;
> + ? ? ? struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> + ? ? ? nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> + ? ? ? ? ? ? ? MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE);
> + ? ? ? if (nr_swap_pages > 0)
> + ? ? ? ? ? ? ? nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> + ? ? ? ? ? ? ? ? ? ? ? MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
> + ? ? ? return nr;
> +}
> +
> ?struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct zone *zone)
> ?{
> Index: memcg_async/include/linux/memcontrol.h
> ===================================================================
> --- memcg_async.orig/include/linux/memcontrol.h
> +++ memcg_async/include/linux/memcontrol.h
> @@ -109,6 +109,8 @@ extern void mem_cgroup_end_migration(str
> ?*/
> ?int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> ?int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> +unsigned long
> +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zid);
> ?int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
> ?unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct zone *zone,
>
>

Again, please apply the patch:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6a52699..0b88d71 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1217,7 +1217,7 @@ unsigned long
mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);

nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
- MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE);
+ MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
if (nr_swap_pages > 0)
nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);


Also, you need to move this to up since patch 1/10 needs this.

--Ying

2011-05-27 08:32:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 3/10] memcg: a test whether zone is reclaimable or not

On Fri, 27 May 2011 00:21:31 -0700
Ying Han <[email protected]> wrote:

> On Wed, May 25, 2011 at 10:19 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > From: Ying Han <[email protected]>
> >
> > The number of reclaimable pages per zone is an useful information for
> > controling memory reclaim schedule. This patch exports it.
> >
> > Changelog v2->v3:
> >  - added comments.
> >
> > Signed-off-by: Ying Han <[email protected]>
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> >  include/linux/memcontrol.h |    2 ++
> >  mm/memcontrol.c            |   24 ++++++++++++++++++++++++
> >  2 files changed, 26 insertions(+)
> >
> > Index: memcg_async/mm/memcontrol.c
> > ===================================================================
> > --- memcg_async.orig/mm/memcontrol.c
> > +++ memcg_async/mm/memcontrol.c
> > @@ -1240,6 +1240,30 @@ static unsigned long mem_cgroup_nr_lru_p
> >  }
> >  #endif /* CONFIG_NUMA */
> >
> > +/**
> > + * mem_cgroup_zone_reclaimable_pages
> > + * @memcg: the memcg
> > + * @nid  : node index to be checked.
> > + * @zid  : zone index to be checked.
> > + *
> > + * This function returns the number reclaimable pages on a zone for given memcg.
> > + * Reclaimable page includes file caches and anonymous pages if swap is
> > + * avaliable and never includes unevictable pages.
> > + */
> > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> > +                                               int nid, int zid)
> > +{
> > +       unsigned long nr;
> > +       struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> > +
> > +       nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> > +               MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE);
> > +       if (nr_swap_pages > 0)
> > +               nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> > +                       MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
> > +       return nr;
> > +}
> > +
> >  struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> >                                                      struct zone *zone)
> >  {
> > Index: memcg_async/include/linux/memcontrol.h
> > ===================================================================
> > --- memcg_async.orig/include/linux/memcontrol.h
> > +++ memcg_async/include/linux/memcontrol.h
> > @@ -109,6 +109,8 @@ extern void mem_cgroup_end_migration(str
> >  */
> >  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> >  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> > +unsigned long
> > +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zid);
> >  int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
> >  unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
> >                                                struct zone *zone,
> >
> >
>
> Again, please apply the patch:
>

Nice catch. thank you.

-Kame

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6a52699..0b88d71 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1217,7 +1217,7 @@ unsigned long
> mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
>
> nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> - MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE);
> + MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
> if (nr_swap_pages > 0)
> nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
>
>
> Also, you need to move this to up since patch 1/10 needs this.
>
> --Ying
>

2011-05-27 22:38:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 7/10] workqueue: add WQ_IDLEPRI

On Thu, 2011-05-26 at 11:38 +0200, Tejun Heo wrote:
>
> We can add a mechanism to manage work item scheduler priority to
> workqueue if necessary tho, I think. But that would be per-workqueue
> attribute which is applied during execution, not something per-gcwq.
>
Only if we then also make PI possible ;-)

2011-05-31 17:04:12

by Ying Han

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 0/10] memcg async reclaim

Some testing result on the patchset based on mmotm-2011-05-12-15-52.

Test: create a 4g memcg on 32g host, and change the default dirty
ratio to "10% dirty_raio. 5% dirty_background_ratio" due to the lack
for per-memcg dirty limit.

I did a simple streaming IO test ( read/write 20g file) and used the
pagefault histogram to capture the page fault latency under reclaim.
On both tests, there is no pgfault latency introduced from the reclaim
path by enabling the per-memcg async reclaim . I will post the V2 of
the patch right after.

$ cat /export/hdc3/dd_A/tf0 > /dev/zero

w/o the async reclaim:
page reclaim latency histogram (us):
< 150 542
< 200 15402
< 250 481
< 300 23
< 350 1
< 400 0
< 450 0
< rest 0
real>---4m26.604s
user>---0m0.294s
sys>----0m26.632s

with async reclaim:
page reclaim latency histogram (us):
< 150 0
< 200 0
< 250 0
< 300 0
< 350 1
< 400 1
< 450 0
< rest 1
real>---4m26.605s
user>---0m0.246s
sys>----0m24.341s

$ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520

w/o async reclaim:
page reclaim latency histogram (us):
< 150 4
< 200 17382
< 250 3238
< 300 276
< 350 18
< 400 1
< 450 1
< rest 103
real>---4m26.984s
user>---0m4.964s
sys>----1m8.718s

with async reclaim:
page reclaim latency histogram (us):
< 150 0
< 200 0
< 250 0
< 300 0
< 350 0
< 400 0
< 450 0
< rest 0
real>---4m16.355s
user>---0m4.593s
sys>----1m5.896s

Thanks

--Ying

On Fri, May 27, 2011 at 12:20 AM, Ying Han <[email protected]> wrote:
> On Thu, May 26, 2011 at 9:49 PM, Ying Han <[email protected]> wrote:
>> On Thu, May 26, 2011 at 9:34 PM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>>> On Thu, 26 May 2011 21:33:32 -0700
>>> Ying Han <[email protected]> wrote:
>>>
>>>> On Thu, May 26, 2011 at 7:16 PM, KAMEZAWA Hiroyuki
>>>> <[email protected]> wrote:
>>>> > On Thu, 26 May 2011 18:49:26 -0700
>>>> > Ying Han <[email protected]> wrote:
>>>> >
>>>> >> On Wed, May 25, 2011 at 10:10 PM, KAMEZAWA Hiroyuki
>>>> >> <[email protected]> wrote:
>>>> >> >
>>>> >> > It's now merge window...I just dump my patch queue to hear other's idea.
>>>> >> > I wonder I should wait until dirty_ratio for memcg is queued to mmotm...
>>>> >> > I'll be busy with LinuxCon Japan etc...in the next week.
>>>> >> >
>>>> >> > This patch is onto mmotm-May-11 + some patches queued in mmotm, as numa_stat.
>>>> >> >
>>>> >> > This is a patch for memcg to keep margin to the limit in background.
>>>> >> > By keeping some margin to the limit in background, application can
>>>> >> > avoid foreground memory reclaim at charge() and this will help latency.
>>>> >> >
>>>> >> > Main changes from v2 is.
>>>> >> > ?- use SCHED_IDLE.
>>>> >> > ?- removed most of heuristic codes. Now, code is very simple.
>>>> >> >
>>>> >> > By using SCHED_IDLE, async memory reclaim can only consume 0.3%? of cpu
>>>> >> > if the system is truely busy but can use much CPU if the cpu is idle.
>>>> >> > Because my purpose is for reducing latency without affecting other running
>>>> >> > applications, SCHED_IDLE fits this work.
>>>> >> >
>>>> >> > If application need to stop by some I/O or event, background memory reclaim
>>>> >> > will cull memory while the system is idle.
>>>> >> >
>>>> >> > Perforemce:
>>>> >> > ?Running an httpd (apache) under 300M limit. And access 600MB working set
>>>> >> > ?with normalized distribution access by apatch-bench.
>>>> >> > ?apatch bench's concurrency was 4 and did 40960 accesses.
>>>> >> >
>>>> >> > Without async reclaim:
>>>> >> > Connection Times (ms)
>>>> >> > ? ? ? ? ? ? ?min ?mean[+/-sd] median ? max
>>>> >> > Connect: ? ? ? ?0 ? ?0 ? 0.0 ? ? ?0 ? ? ? 2
>>>> >> > Processing: ? ?30 ? 37 ?28.3 ? ? 32 ? ?1793
>>>> >> > Waiting: ? ? ? 28 ? 35 ?25.5 ? ? 31 ? ?1792
>>>> >> > Total: ? ? ? ? 30 ? 37 ?28.4 ? ? 32 ? ?1793
>>>> >> >
>>>> >> > Percentage of the requests served within a certain time (ms)
>>>> >> > ?50% ? ? 32
>>>> >> > ?66% ? ? 32
>>>> >> > ?75% ? ? 33
>>>> >> > ?80% ? ? 34
>>>> >> > ?90% ? ? 39
>>>> >> > ?95% ? ? 60
>>>> >> > ?98% ? ?100
>>>> >> > ?99% ? ?133
>>>> >> > ?100% ? 1793 (longest request)
>>>> >> >
>>>> >> > With async reclaim:
>>>> >> > Connection Times (ms)
>>>> >> > ? ? ? ? ? ? ?min ?mean[+/-sd] median ? max
>>>> >> > Connect: ? ? ? ?0 ? ?0 ? 0.0 ? ? ?0 ? ? ? 2
>>>> >> > Processing: ? ?30 ? 35 ?12.3 ? ? 32 ? ? 678
>>>> >> > Waiting: ? ? ? 28 ? 34 ?12.0 ? ? 31 ? ? 658
>>>> >> > Total: ? ? ? ? 30 ? 35 ?12.3 ? ? 32 ? ? 678
>>>> >> >
>>>> >> > Percentage of the requests served within a certain time (ms)
>>>> >> > ?50% ? ? 32
>>>> >> > ?66% ? ? 32
>>>> >> > ?75% ? ? 33
>>>> >> > ?80% ? ? 34
>>>> >> > ?90% ? ? 39
>>>> >> > ?95% ? ? 49
>>>> >> > ?98% ? ? 71
>>>> >> > ?99% ? ? 86
>>>> >> > ?100% ? ?678 (longest request)
>>>> >> >
>>>> >> >
>>>> >> > It seems latency is stabilized by hiding memory reclaim.
>>>> >> >
>>>> >> > The score for memory reclaim was following.
>>>> >> > See patch 10 for meaning of each member.
>>>> >> >
>>>> >> > == without async reclaim ==
>>>> >> > recent_scan_success_ratio 44
>>>> >> > limit_scan_pages 388463
>>>> >> > limit_freed_pages 162238
>>>> >> > limit_elapsed_ns 13852159231
>>>> >> > soft_scan_pages 0
>>>> >> > soft_freed_pages 0
>>>> >> > soft_elapsed_ns 0
>>>> >> > margin_scan_pages 0
>>>> >> > margin_freed_pages 0
>>>> >> > margin_elapsed_ns 0
>>>> >> >
>>>> >> > == with async reclaim ==
>>>> >> > recent_scan_success_ratio 6
>>>> >> > limit_scan_pages 0
>>>> >> > limit_freed_pages 0
>>>> >> > limit_elapsed_ns 0
>>>> >> > soft_scan_pages 0
>>>> >> > soft_freed_pages 0
>>>> >> > soft_elapsed_ns 0
>>>> >> > margin_scan_pages 1295556
>>>> >> > margin_freed_pages 122450
>>>> >> > margin_elapsed_ns 644881521
>>>> >> >
>>>> >> >
>>>> >> > For this case, SCHED_IDLE workqueue can reclaim enough memory to the httpd.
>>>> >> >
>>>> >> > I may need to dig why scan_success_ratio is far different in the both case.
>>>> >> > I guess the difference of epalsed_ns is because several threads enter
>>>> >> > memory reclaim when async reclaim doesn't run. But may not...
>>>> >> >
>>>> >>
>>>> >>
>>>> >> Hmm.. I noticed a very strange behavior on a simple test w/ the patch set.
>>>> >>
>>>> >> Test:
>>>> >> I created a 4g memcg and start doing cat. Then the memcg being OOM
>>>> >> killed as soon as it reaches its hard_limit. We shouldn't hit OOM even
>>>> >> w/o async-reclaim.
>>>> >>
>>>> >> Again, I will read through the patch. But like to post the test result first.
>>>> >>
>>>> >> $ echo $$ >/dev/cgroup/memory/A/tasks
>>>> >> $ cat /dev/cgroup/memory/A/memory.limit_in_bytes
>>>> >> 4294967296
>>>> >>
>>>> >> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
>>>> >> Killed
>>>> >>
>>>> >
>>>> > I did the same kind of test without any problem...but ok, I'll do more test
>>>> > later.
>>>> >
>>>> >
>>>> >
>>>> >> real ?0m53.565s
>>>> >> user ?0m0.061s
>>>> >> sys ? 0m4.814s
>>>> >>
>>>> >> Here is the OOM log:
>>>> >>
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489112] cat invoked oom-killer:
>>>> >> gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489121] Pid: 9425, comm: cat Tainted:
>>>> >> G ? ? ? ?W ? 2.6.39-mcg-DEV #131
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489123] Call Trace:
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489134] ?[<ffffffff810e3512>]
>>>> >> dump_header+0x82/0x1af
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489137] ?[<ffffffff810e33ca>] ?
>>>> >> spin_lock+0xe/0x10
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489140] ?[<ffffffff810e33f9>] ?
>>>> >> find_lock_task_mm+0x2d/0x67
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489143] ?[<ffffffff810e38dd>]
>>>> >> oom_kill_process+0x50/0x27b
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489155] ?[<ffffffff810e3dc6>]
>>>> >> mem_cgroup_out_of_memory+0x9a/0xe4
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489160] ?[<ffffffff811153aa>]
>>>> >> mem_cgroup_handle_oom+0x134/0x1fe
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489163] ?[<ffffffff81114a72>] ?
>>>> >> __mem_cgroup_insert_exceeded+0x83/0x83
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489176] ?[<ffffffff811166e9>]
>>>> >> __mem_cgroup_try_charge.clone.3+0x368/0x43a
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489179] ?[<ffffffff81117586>]
>>>> >> mem_cgroup_cache_charge+0x95/0x123
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489183] ?[<ffffffff810e16d8>]
>>>> >> add_to_page_cache_locked+0x42/0x114
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489185] ?[<ffffffff810e17db>]
>>>> >> add_to_page_cache_lru+0x31/0x5f
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489189] ?[<ffffffff81145636>]
>>>> >> mpage_readpages+0xb6/0x132
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489194] ?[<ffffffff8119992f>] ?
>>>> >> noalloc_get_block_write+0x24/0x24
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489197] ?[<ffffffff8119992f>] ?
>>>> >> noalloc_get_block_write+0x24/0x24
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489201] ?[<ffffffff81036742>] ?
>>>> >> __switch_to+0x160/0x212
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489205] ?[<ffffffff811978b2>]
>>>> >> ext4_readpages+0x1d/0x1f
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489209] ?[<ffffffff810e8d4b>]
>>>> >> __do_page_cache_readahead+0x144/0x1e3
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489212] ?[<ffffffff810e8e0b>]
>>>> >> ra_submit+0x21/0x25
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489215] ?[<ffffffff810e9075>]
>>>> >> ondemand_readahead+0x18c/0x19f
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489218] ?[<ffffffff810e9105>]
>>>> >> page_cache_async_readahead+0x7d/0x86
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489221] ?[<ffffffff810e2b7e>]
>>>> >> generic_file_aio_read+0x2d8/0x5fe
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489225] ?[<ffffffff81119626>]
>>>> >> do_sync_read+0xcb/0x108
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489230] ?[<ffffffff811f168a>] ?
>>>> >> fsnotify_perm+0x66/0x72
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489233] ?[<ffffffff811f16f7>] ?
>>>> >> security_file_permission+0x2e/0x33
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489236] ?[<ffffffff8111a0c8>]
>>>> >> vfs_read+0xab/0x107
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489239] ?[<ffffffff8111a1e4>] sys_read+0x4a/0x6e
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489244] ?[<ffffffff8140f469>]
>>>> >> sysenter_dispatch+0x7/0x27
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489248] Task in /A killed as a result
>>>> >> of limit of /A
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489251] memory: usage 4194304kB, limit
>>>> >> 4194304kB, failcnt 26
>>>> >> May 26 18:43:00 ?kernel: [ ?963.489253] memory+swap: usage 0kB, limit
>>>> >> 9007199254740991kB, failcnt 0
>>>> >>
>>>> >
>>>> > Hmm, why memory+swap usage 0kb here...
>>>> >
>>>> > In this set, I used mem_cgroup_margin() rather than res_counter_margin().
>>>> > Hmm, do you disable swap accounting ? If so, I may miss some.
>>>>
>>>> Yes, I disabled the swap accounting in .config:
>>>> # CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set
>>>>
>>>>
>>>> Here is how i reproduce it:
>>>>
>>>> $ mkdir /dev/cgroup/memory/D
>>>> $ echo 4g >/dev/cgroup/memory/D/memory.limit_in_bytes
>>>>
>>>> $ cat /dev/cgroup/memory/D/memory.limit_in_bytes
>>>> 4294967296
>>>>
>>>> $ cat /dev/cgroup/memory/D/memory.
>>>> memory.async_control ? ? ? ? ? ? memory.max_usage_in_bytes
>>>> memory.soft_limit_in_bytes ? ? ? memory.use_hierarchy
>>>> memory.failcnt ? ? ? ? ? ? ? ? ? memory.move_charge_at_immigrate
>>>> memory.stat
>>>> memory.force_empty ? ? ? ? ? ? ? memory.oom_control
>>>> memory.swappiness
>>>> memory.limit_in_bytes ? ? ? ? ? ?memory.reclaim_stat
>>>> memory.usage_in_bytes
>>>>
>>>> $ cat /dev/cgroup/memory/D/memory.async_control
>>>> 0
>>>> $ echo 1 >/dev/cgroup/memory/D/memory.async_control
>>>> $ cat /dev/cgroup/memory/D/memory.async_control
>>>> 1
>>>>
>>>> $ echo $$ >/dev/cgroup/memory/D/tasks
>>>> $ cat /proc/4358/cgroup
>>>> 3:memory:/D
>>>>
>>>> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
>>>> Killed
>>>>
>>>
>>> If you applied my patches collectly, async_control can be seen if
>>> swap controller is configured because of BUG in patch.
>>
>> I noticed the BUG at the very beginning, so all my tests are having the fix.
>>
>>>
>>> I could cat 20G file under 4G limit without any problem with boot option
>>> swapaccount=0. no problem if async_control == 0 ?
>>
>> $ cat /dev/cgroup/memory/D/memory.async_control
>> 1
>>
>> I have the .config
>> # CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set
>>
>> Not sure if that makes difference. I will test next to turn that on.
>
> I know what's the problem and also verified. Our configuration might
> differs on the "#if MAX_NUMNODES > 1"
>
> Please apply the following patch:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6a52699..0b88d71 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1217,7 +1217,7 @@ unsigned long
> mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
> ? ? ? ?struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
>
> ? ? ? ?nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> - ? ? ? ? ? ? ? MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE);
> + ? ? ? ? ? ? ? MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
> ? ? ? ?if (nr_swap_pages > 0)
> ? ? ? ? ? ? ? ?nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
> ? ? ? ? ? ? ? ? ? ? ? ?MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
>
> --Ying
>
>>
>> --Ying
>>
>>
>>>
>>>
>>>
>>> Thanks,
>>> -Kame
>>>
>>>
>>>
>>>
>>
>