2011-05-20 03:44:44

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 0/8] memcg async reclaim v2


Since v1, I did some brush up and more tests.

main changes are
- disabled at default
- add a control file to enable it
- never allow enabled on UP machine.
- don't writepage at all (revisit this when dirty_ratio comes.)
- change handling of priorty and total scan, add more sleep chances.

But yes, maybe some more changes/tests will be needed and I don't want to
rush before next kernel version.

IIUC, what pointed out in previous post was "show numbers". Because this kind of
asyncronous reclaim just increase cpu usage and no help to latency, just makes
scores bad.

I tested with apatch bench in following way.

1. create cgroup /cgroup/memory/httpd
2. move httpd under it
3. create 4096 files under /var/www/html/data
each file's size is 160kb.
4. prepare a cgi scipt to acess 4096 files in random as
==
#!/usr/bin/python
# -*- coding: utf-8 -*-

import random

print "Content-Type: text/plain\n\n"

num = int(round(random.normalvariate(0.5, 0.1) * 4096))
filename = "/var/www/html/data/" + str(num)

with open(filename) as f:
buf = f.read(128*1024)
print "Hello world " + str(num) + "\n"
==
This reads random file and returns Hello World. I used "normalvariate()"
for getting normal distribution access to files.

By this, 160kb*4096 files of data is accessed in normal distribution.

5. access files by apatch bench
# ab -n 40960 -c 4 localhost:8080/cgi-bin/rand.py

This access files 40960 times with concurrency 4.
And see latency under memory cgroup.

I run apatch bench 3 times for each test and following scores are score of
3rd trial, we can think file cache is in good state....
(But number other than "max" seems to be stable.)

Note: httpd and apache bench runs on the same host.

A) No limit.

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 2
Processing: 30 32 1.5 32 123
Waiting: 28 31 1.5 31 122
Total: 30 32 1.5 32 123

Percentage of the requests served within a certain time (ms)
50% 32
66% 32
75% 32
80% 33
90% 33
95% 33
98% 34
99% 35
100% 123 (longest request)

If no limit, most of access can be end around 32msecs. After this, I saw
memory.max_usage_in_bytes as mostly 600MB.


B) limit to 300MB and disable async reclaim.

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 1
Processing: 29 35 35.6 31 3507
Waiting: 28 34 33.4 30 3506
Total: 30 35 35.6 31 3507

Percentage of the requests served within a certain time (ms)
50% 31
66% 32
75% 32
80% 32
90% 34
95% 43
98% 89
99% 134
100% 3507 (longest request)

When set limit, "max" latency can take various big value but latency goes
bad.

C) limit to 300MB and enable async reclaim.

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 2
Processing: 29 33 6.9 32 279
Waiting: 27 32 6.8 31 275
Total: 29 33 6.9 32 279

Percentage of the requests served within a certain time (ms)
50% 32
66% 32
75% 33
80% 33
90% 37
95% 42
98% 51
99% 59
100% 279 (longest request)

It seems latency goes better and stable rather than test B).


If you want to see other numbers/tests, please let me know. set up is easy.

I think automatic asynchronous reclaim works effectively for some class of
applications and stabilize its work.

Thanks,
-Kame










2011-05-20 03:47:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 1/8] memcg: export zone reclaimable pages

From: Ying Han <[email protected]>

The number of reclaimable pages per zone is an useful information for
controling memory reclaim schedule. This patch exports it.

Signed-off-by: Ying Han <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 14 ++++++++++++++
2 files changed, 16 insertions(+)

Index: mmotm-May11/mm/memcontrol.c
===================================================================
--- mmotm-May11.orig/mm/memcontrol.c
+++ mmotm-May11/mm/memcontrol.c
@@ -1162,6 +1162,20 @@ unsigned long mem_cgroup_zone_nr_pages(s
return MEM_CGROUP_ZSTAT(mz, lru);
}

+unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
+ int nid, int zid)
+{
+ unsigned long nr;
+ struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+ nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
+ MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE);
+ if (nr_swap_pages > 0)
+ nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
+ MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
+ return nr;
+}
+
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
struct zone *zone)
{
Index: mmotm-May11/include/linux/memcontrol.h
===================================================================
--- mmotm-May11.orig/include/linux/memcontrol.h
+++ mmotm-May11/include/linux/memcontrol.h
@@ -108,6 +108,8 @@ extern void mem_cgroup_end_migration(str
*/
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
+unsigned long
+mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zid);
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
struct zone *zone,

2011-05-20 03:48:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 2/8] memcg: easy check routine for reclaimable


A function for checking that a memcg has reclaimable pages. This makes
use of mem->scan_nodes when CONFIG_NUMA=y.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 19 +++++++++++++++++++
2 files changed, 20 insertions(+)

Index: mmotm-May11/mm/memcontrol.c
===================================================================
--- mmotm-May11.orig/mm/memcontrol.c
+++ mmotm-May11/mm/memcontrol.c
@@ -1587,11 +1587,30 @@ int mem_cgroup_select_victim_node(struct
return node;
}

+bool mem_cgroup_test_reclaimable(struct mem_cgroup *memcg)
+{
+ mem_cgroup_may_update_nodemask(memcg);
+ return !nodes_empty(memcg->scan_nodes);
+}
+
#else
int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
{
return 0;
}
+
+bool mem_cgroup_test_reclaimable(struct mem_cgroup *memcg)
+{
+ unsigned long nr;
+ int zid;
+
+ for (zid = NODE_DATA(0)->nr_zones - 1; zid >= 0; zid--)
+ if (mem_cgroup_zone_reclaimable_pages(memcg, 0, zid))
+ break;
+ if (zid < 0)
+ return false;
+ return true;
+}
#endif

/*
Index: mmotm-May11/include/linux/memcontrol.h
===================================================================
--- mmotm-May11.orig/include/linux/memcontrol.h
+++ mmotm-May11/include/linux/memcontrol.h
@@ -110,6 +110,7 @@ int mem_cgroup_inactive_anon_is_low(stru
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
unsigned long
mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zid);
+bool mem_cgroup_test_reclaimable(struct mem_cgroup *memcg);
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
struct zone *zone,

2011-05-20 03:50:03

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/8] memcg: clean up, export swapiness

From: Ying Han <[email protected]>
change mem_cgroup's swappiness interface.

Now, memcg's swappiness interface is defined as 'static' and
the value is passed as an argument to try_to_free_xxxx...

This patch adds an function mem_cgroup_swappiness() and export it,
reduce arguments. This interface will be used in async reclaim, later.

I think an function is better than passing arguments because it's
clearer where the swappiness comes from to scan_control.

Signed-off-by: Ying Han <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 1 +
include/linux/swap.h | 4 +---
mm/memcontrol.c | 14 ++++++--------
mm/vmscan.c | 9 ++++-----
4 files changed, 12 insertions(+), 16 deletions(-)

Index: mmotm-May11/include/linux/memcontrol.h
===================================================================
--- mmotm-May11.orig/include/linux/memcontrol.h
+++ mmotm-May11/include/linux/memcontrol.h
@@ -112,6 +112,7 @@ unsigned long
mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zid);
bool mem_cgroup_test_reclaimable(struct mem_cgroup *memcg);
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
struct zone *zone,
enum lru_list lru);
Index: mmotm-May11/mm/memcontrol.c
===================================================================
--- mmotm-May11.orig/mm/memcontrol.c
+++ mmotm-May11/mm/memcontrol.c
@@ -1285,7 +1285,7 @@ static unsigned long mem_cgroup_margin(s
return margin >> PAGE_SHIFT;
}

-static unsigned int get_swappiness(struct mem_cgroup *memcg)
+unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
struct cgroup *cgrp = memcg->css.cgroup;

@@ -1687,14 +1687,13 @@ static int mem_cgroup_hierarchical_recla
/* we use swappiness of local cgroup */
if (check_soft) {
ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
- noswap, get_swappiness(victim), zone,
- &nr_scanned);
+ noswap, zone, &nr_scanned);
*total_scanned += nr_scanned;
mem_cgroup_soft_steal(victim, is_kswapd, ret);
mem_cgroup_soft_scan(victim, is_kswapd, nr_scanned);
} else
ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
- noswap, get_swappiness(victim));
+ noswap);
css_put(&victim->css);
/*
* At shrinking usage, we can't check we should stop here or
@@ -3717,8 +3716,7 @@ try_to_free:
ret = -EINTR;
goto out;
}
- progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
- false, get_swappiness(mem));
+ progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, false);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
@@ -4150,7 +4148,7 @@ static u64 mem_cgroup_swappiness_read(st
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);

- return get_swappiness(memcg);
+ return mem_cgroup_swappiness(memcg);
}

static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
@@ -4836,7 +4834,7 @@ mem_cgroup_create(struct cgroup_subsys *
INIT_LIST_HEAD(&mem->oom_notify);

if (parent)
- mem->swappiness = get_swappiness(parent);
+ mem->swappiness = mem_cgroup_swappiness(parent);
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
Index: mmotm-May11/include/linux/swap.h
===================================================================
--- mmotm-May11.orig/include/linux/swap.h
+++ mmotm-May11/include/linux/swap.h
@@ -252,11 +252,9 @@ static inline void lru_cache_add_file(st
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- unsigned int swappiness);
+ gfp_t gfp_mask, bool noswap);
extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
- unsigned int swappiness,
struct zone *zone,
unsigned long *nr_scanned);
extern int __isolate_lru_page(struct page *page, int mode, int file);
Index: mmotm-May11/mm/vmscan.c
===================================================================
--- mmotm-May11.orig/mm/vmscan.c
+++ mmotm-May11/mm/vmscan.c
@@ -2178,7 +2178,6 @@ unsigned long try_to_free_pages(struct z

unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
- unsigned int swappiness,
struct zone *zone,
unsigned long *nr_scanned)
{
@@ -2188,7 +2187,6 @@ unsigned long mem_cgroup_shrink_node_zon
.may_writepage = !laptop_mode,
.may_unmap = 1,
.may_swap = !noswap,
- .swappiness = swappiness,
.order = 0,
.mem_cgroup = mem,
};
@@ -2196,6 +2194,8 @@ unsigned long mem_cgroup_shrink_node_zon
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);

+ sc.swappiness = mem_cgroup_swappiness(mem);
+
trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
sc.may_writepage,
sc.gfp_mask);
@@ -2217,8 +2217,7 @@ unsigned long mem_cgroup_shrink_node_zon

unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask,
- bool noswap,
- unsigned int swappiness)
+ bool noswap)
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
@@ -2228,7 +2227,6 @@ unsigned long try_to_free_mem_cgroup_pag
.may_unmap = 1,
.may_swap = !noswap,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
- .swappiness = swappiness,
.order = 0,
.mem_cgroup = mem_cont,
.nodemask = NULL, /* we don't care the placement */
@@ -2245,6 +2243,7 @@ unsigned long try_to_free_mem_cgroup_pag
* scan does not need to be the current node.
*/
nid = mem_cgroup_select_victim_node(mem_cont);
+ sc.swappiness = mem_cgroup_swappiness(mem_cont);

zonelist = NODE_DATA(nid)->node_zonelists;

2011-05-20 03:51:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 4/8] memcg: export release victim

Later change will call mem_cgroup_select_victim() from vmscan.c
Need to export an interface and add release_victim().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 13 +++++++++----
2 files changed, 11 insertions(+), 4 deletions(-)

Index: mmotm-May11/mm/memcontrol.c
===================================================================
--- mmotm-May11.orig/mm/memcontrol.c
+++ mmotm-May11/mm/memcontrol.c
@@ -1487,7 +1487,7 @@ u64 mem_cgroup_get_limit(struct mem_cgro
* of the cgroup list, since we track last_scanned_child) of @mem and use
* that to reclaim free pages from.
*/
-static struct mem_cgroup *
+struct mem_cgroup *
mem_cgroup_select_victim(struct mem_cgroup *root_mem)
{
struct mem_cgroup *ret = NULL;
@@ -1519,6 +1519,11 @@ mem_cgroup_select_victim(struct mem_cgro
return ret;
}

+void mem_cgroup_release_victim(struct mem_cgroup *mem)
+{
+ css_put(&mem->css);
+}
+
#if MAX_NUMNODES > 1

/*
@@ -1663,7 +1668,7 @@ static int mem_cgroup_hierarchical_recla
* no reclaimable pages under this hierarchy
*/
if (!check_soft || !total) {
- css_put(&victim->css);
+ mem_cgroup_release_victim(victim);
break;
}
/*
@@ -1674,14 +1679,14 @@ static int mem_cgroup_hierarchical_recla
*/
if (total >= (excess >> 2) ||
(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
- css_put(&victim->css);
+ mem_cgroup_release_victim(victim);
break;
}
}
}
if (!mem_cgroup_local_usage(victim)) {
/* this cgroup's local usage == 0 */
- css_put(&victim->css);
+ mem_cgroup_release_victim(victim);
continue;
}
/* we use swappiness of local cgroup */
@@ -1694,7 +1699,7 @@ static int mem_cgroup_hierarchical_recla
} else
ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
noswap);
- css_put(&victim->css);
+ mem_cgroup_release_victim(victim);
/*
* At shrinking usage, we can't check we should stop here or
* reclaim more. It's depends on callers. last_scanned_child
Index: mmotm-May11/include/linux/memcontrol.h
===================================================================
--- mmotm-May11.orig/include/linux/memcontrol.h
+++ mmotm-May11/include/linux/memcontrol.h
@@ -122,6 +122,8 @@ struct zone_reclaim_stat*
mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
+struct mem_cgroup *mem_cgroup_select_victim(struct mem_cgroup *mem);
+void mem_cgroup_release_victim(struct mem_cgroup *mem);
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif

2011-05-20 03:53:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 6/8] memcg asynchronous memory reclaim interface

This patch adds a logic to keep usage margin to the limit in asynchronous way.
When the usage over some threshould (determined automatically), asynchronous
memory reclaim runs and shrink memory to limit - MEMCG_ASYNC_STOP_MARGIN.

By this, there will be no difference in total amount of usage of cpu to
scan the LRU but we'll have a chance to make use of wait time of applications
for freeing memory. For example, when an application read a file or socket,
to fill the newly alloated memory, it needs wait. Async reclaim can make use
of that time and give a chance to reduce latency by background works.

This patch only includes required hooks to trigger async reclaim and user interfaces.
Core logics will be in the following patches.

Changelog v1 -> v2:
- avoid async reclaim check when num_online_cpus() < 2.
- changed MEMCG_ASYNC_START_MARGIN to be 6 * HPAGE_SIZE.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/memory.txt | 46 ++++++++++++++++++-
mm/memcontrol.c | 94 +++++++++++++++++++++++++++++++++++++++
2 files changed, 139 insertions(+), 1 deletion(-)

Index: mmotm-May11/mm/memcontrol.c
===================================================================
--- mmotm-May11.orig/mm/memcontrol.c
+++ mmotm-May11/mm/memcontrol.c
@@ -115,10 +115,12 @@ enum mem_cgroup_events_index {
enum mem_cgroup_events_target {
MEM_CGROUP_TARGET_THRESH,
MEM_CGROUP_TARGET_SOFTLIMIT,
+ MEM_CGROUP_TARGET_ASYNC,
MEM_CGROUP_NTARGETS,
};
#define THRESHOLDS_EVENTS_TARGET (128)
#define SOFTLIMIT_EVENTS_TARGET (1024)
+#define ASYNC_EVENTS_TARGET (512) /* assume x86-64's hpagesize */

struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
@@ -211,6 +213,29 @@ static void mem_cgroup_threshold(struct
static void mem_cgroup_oom_notify(struct mem_cgroup *mem);

/*
+ * For example, with transparent hugepages, memory reclaim scan at hitting
+ * limit can very long as to reclaim HPAGE_SIZE of memory. This increases
+ * latency of page fault and may cause fallback. At usual page allocation,
+ * we'll see some (shorter) latency, too. To reduce latency, it's appreciated
+ * to free memory in background to make margin to the limit. This consumes
+ * cpu but we'll have a chance to make use of wait time of applications
+ * (read disk etc..) by asynchronous reclaim.
+ *
+ * This async reclaim tries to reclaim HPAGE_SIZE * 2 of pages when margin
+ * to the limit is smaller than HPAGE_SIZE * 2. This will be enabled
+ * automatically when the limit is set and it's greater than the threshold.
+ */
+#if HPAGE_SIZE != PAGE_SIZE
+#define MEMCG_ASYNC_LIMIT_THRESH (HPAGE_SIZE * 64)
+#define MEMCG_ASYNC_MARGIN (HPAGE_SIZE * 4)
+#else /* make the margin as 4M bytes */
+#define MEMCG_ASYNC_LIMIT_THRESH (128 * 1024 * 1024)
+#define MEMCG_ASYNC_MARGIN (8 * 1024 * 1024)
+#endif
+
+static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
+
+/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
* statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -278,6 +303,12 @@ struct mem_cgroup {
*/
unsigned long move_charge_at_immigrate;
/*
+ * Checks for async reclaim.
+ */
+ unsigned long async_flags;
+#define AUTO_ASYNC_ENABLED (0)
+#define USE_AUTO_ASYNC (1)
+ /*
* percpu counter.
*/
struct mem_cgroup_stat_cpu *stat;
@@ -722,6 +753,9 @@ static void __mem_cgroup_target_update(s
case MEM_CGROUP_TARGET_SOFTLIMIT:
next = val + SOFTLIMIT_EVENTS_TARGET;
break;
+ case MEM_CGROUP_TARGET_ASYNC:
+ next = val + ASYNC_EVENTS_TARGET;
+ break;
default:
return;
}
@@ -745,6 +779,11 @@ static void memcg_check_events(struct me
__mem_cgroup_target_update(mem,
MEM_CGROUP_TARGET_SOFTLIMIT);
}
+ if (__memcg_event_check(mem, MEM_CGROUP_TARGET_ASYNC)) {
+ mem_cgroup_may_async_reclaim(mem);
+ __mem_cgroup_target_update(mem,
+ MEM_CGROUP_TARGET_ASYNC);
+ }
}
}

@@ -3365,6 +3404,23 @@ void mem_cgroup_print_bad_page(struct pa

static DEFINE_MUTEX(set_limit_mutex);

+/* When limit is changed, check async reclaim switch again */
+static void mem_cgroup_set_auto_async(struct mem_cgroup *mem, u64 val)
+{
+ if (!test_bit(AUTO_ASYNC_ENABLED, &mem->async_flags))
+ goto clear;
+ if (num_online_cpus() < 2)
+ goto clear;
+ if (val < MEMCG_ASYNC_LIMIT_THRESH)
+ goto clear;
+
+ set_bit(USE_AUTO_ASYNC, &mem->async_flags);
+ return;
+clear:
+ clear_bit(USE_AUTO_ASYNC, &mem->async_flags);
+ return;
+}
+
static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
unsigned long long val)
{
@@ -3413,6 +3469,7 @@ static int mem_cgroup_resize_limit(struc
memcg->memsw_is_minimum = true;
else
memcg->memsw_is_minimum = false;
+ mem_cgroup_set_auto_async(memcg, val);
}
mutex_unlock(&set_limit_mutex);

@@ -3590,6 +3647,15 @@ unsigned long mem_cgroup_soft_limit_recl
return nr_reclaimed;
}

+static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem)
+{
+ if (!test_bit(USE_AUTO_ASYNC, &mem->async_flags))
+ return;
+ if (res_counter_margin(&mem->res) <= MEMCG_ASYNC_MARGIN) {
+ /* Fill here */
+ }
+}
+
/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -4149,6 +4215,29 @@ static int mem_control_stat_show(struct
return 0;
}

+static u64 mem_cgroup_async_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+ return mem->async_flags;
+}
+
+static int
+mem_cgroup_async_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+ if (val & (1 << AUTO_ASYNC_ENABLED))
+ set_bit(AUTO_ASYNC_ENABLED, &mem->async_flags);
+ else
+ clear_bit(AUTO_ASYNC_ENABLED, &mem->async_flags);
+
+ val = res_counter_read_u64(&mem->res, RES_LIMIT);
+ mem_cgroup_set_auto_async(mem, val);
+ return 0;
+}
+
+
static u64 mem_cgroup_swappiness_read(struct cgroup *cgrp, struct cftype *cft)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
@@ -4580,6 +4669,11 @@ static struct cftype mem_cgroup_files[]
.unregister_event = mem_cgroup_oom_unregister_event,
.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
},
+ {
+ .name = "async_control",
+ .read_u64 = mem_cgroup_async_read,
+ .write_u64 = mem_cgroup_async_write,
+ },
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
Index: mmotm-May11/Documentation/cgroups/memory.txt
===================================================================
--- mmotm-May11.orig/Documentation/cgroups/memory.txt
+++ mmotm-May11/Documentation/cgroups/memory.txt
@@ -70,6 +70,7 @@ Brief summary of control files.
(See sysctl's vm.swappiness)
memory.move_charge_at_immigrate # set/show controls of moving charges
memory.oom_control # set/show oom controls.
+ memory.async_control # set control for asynchronous memory reclaim

1. History

@@ -664,7 +665,50 @@ At reading, current status of OOM is sho
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
be stopped.)

-11. TODO
+11. Asynchronous memory reclaim
+
+In some kind of applications which uses many file caches, once memory cgroup
+hit limit, following allocation of pages will hit limit again and the
+application may see huge latency because of memory reclaim.
+
+Memory cgroup provides a method for asynchronous memory reclaim for freeing
+memory before hitting limit. By this, some class of application can avoid
+memory reclaim latency effectively and show good performance. For example,
+if an application reads data from files bigger than limit, freeing memory
+in asynchrnous will reduce latency of read. But please note, even if
+latency decreased, the amount of total usage of CPU is unchanged. So,
+asynchronous memory reclaim works effectively only when you have extra unused
+CPU, applications tend to sleep. So, this feature only works on SMP.
+
+So, if you see this feature doesn't help your application, please let it
+turned off.
+
+
+11.1 memory.async_control
+
+memory.async_control is a control for asynchronous memory reclaim and
+represented as bitmask of controls.
+
+ bit 0 ....user control of automatic asynchronous memory reclaim(see below)
+ bit 1 ....indicate automatic asynchronous memory reclaim is really used.
+
+ * Automatic asynchronous memory reclaim is a feature to free pages to
+ some extent below the limit in background. When this runs, applications
+ can reduce latency at hit limit. (but please note, background reclaim
+ use cpu.)
+
+ This feature can be enabled by
+
+ echo 1 > memory.async_control
+
+ If successfully enabled, bit 1 of memory.async_control is set. Bit 1 may
+ not be set when the number of cpu is 1 or when the limit is too small.
+
+ Note: This feature is not propageted to childrens in automatic. This
+ may be conservative but required limitation to avoid using too much
+ cpus.
+
+12. TODO

1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first

2011-05-20 03:54:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 7/8] memcg static scan reclaim for asyncrhonous reclaim

Ostatic scan rate async memory reclaim for memcg.

This patch implements a routine for asynchronous memory reclaim for memory
cgroup, which will be triggered when the usage is near to the limit.
This patch includes only code codes for memory freeing.

Asynchronous memory reclaim can be a help for reduce latency because
memory reclaim goes while an application need to wait or compute something.

To do memory reclaim in async, we need some thread or worker.
Unlike node or zones, memcg can be created on demand and there may be
a system with thousands of memcgs. So, the number of jobs for memcg
asynchronous memory reclaim can be big number in theory. So, node kswapd
codes doesn't fit well. And some scheduling on memcg layer will be appreciated.

This patch implements a static scan rate memory reclaim.
When shrink_mem_cgroup_static_scan() is called, it scans pages at most
MEMCG_STATIC_SCAN_LIMIT(2048) pages and returnes how memory shrinking
was hard. When the function returns false, the caller can assume memory
reclaim on the memcg seemed difficult and can add some scheduling delay
for the job.

Note:
- I think this concept can be used for enhancing softlimit, too.
But need more study.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 2
include/linux/swap.h | 2
mm/memcontrol.c | 5 +
mm/vmscan.c | 171 ++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 179 insertions(+), 1 deletion(-)

Index: mmotm-May11/mm/vmscan.c
===================================================================
--- mmotm-May11.orig/mm/vmscan.c
+++ mmotm-May11/mm/vmscan.c
@@ -106,6 +106,7 @@ struct scan_control {

/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup;
+ unsigned long scan_limit; /* async reclaim uses static scan rate */

/*
* Nodemask of nodes allowed by the caller. If NULL, all nodes
@@ -1717,7 +1718,7 @@ static unsigned long shrink_list(enum lr
static void get_scan_count(struct zone *zone, struct scan_control *sc,
unsigned long *nr, int priority)
{
- unsigned long anon, file, free;
+ unsigned long anon, file, free, total_scan;
unsigned long anon_prio, file_prio;
unsigned long ap, fp;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
@@ -1807,6 +1808,8 @@ static void get_scan_count(struct zone *
fraction[1] = fp;
denominator = ap + fp + 1;
out:
+ total_scan = 0;
+
for_each_evictable_lru(l) {
int file = is_file_lru(l);
unsigned long scan;
@@ -1833,6 +1836,20 @@ out:
scan = SWAP_CLUSTER_MAX;
}
nr[l] = scan;
+ total_scan += nr[l];
+ }
+ /*
+ * Asynchronous reclaim for memcg uses static scan rate for avoiding
+ * too much cpu consumption in a memcg. Adjust the scan count to fit
+ * into scan_limit.
+ */
+ if (total_scan > sc->scan_limit) {
+ for_each_evictable_lru(l) {
+ if (!nr[l] < SWAP_CLUSTER_MAX)
+ continue;
+ nr[l] = div64_u64(nr[l] * sc->scan_limit, total_scan);
+ nr[l] = max((unsigned long)SWAP_CLUSTER_MAX, nr[l]);
+ }
}
}

@@ -1938,6 +1955,11 @@ restart:
*/
if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
break;
+ /*
+ * static scan rate memory reclaim ?
+ */
+ if (sc->nr_scanned > sc->scan_limit)
+ break;
}
sc->nr_reclaimed += nr_reclaimed;

@@ -2158,6 +2180,7 @@ unsigned long try_to_free_pages(struct z
.order = order,
.mem_cgroup = NULL,
.nodemask = nodemask,
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -2189,6 +2212,7 @@ unsigned long mem_cgroup_shrink_node_zon
.may_swap = !noswap,
.order = 0,
.mem_cgroup = mem,
+ .scan_limit = ULONG_MAX,
};

sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2232,6 +2256,7 @@ unsigned long try_to_free_mem_cgroup_pag
.nodemask = NULL, /* we don't care the placement */
.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -2257,6 +2282,147 @@ unsigned long try_to_free_mem_cgroup_pag

return nr_reclaimed;
}
+
+/*
+ * Routines for static scan rate memory reclaim for memory cgroup.
+ *
+ * Because asyncronous memory reclaim is served by the kernel as background
+ * service for reduce latency, we don't want to scan too much as priority=0
+ * scan of kswapd. We just scan MEMCG_ASYNCSCAN_LIMIT per iteration at most
+ * and frees MEMCG_ASYNCSCAN_LIMIT/2 of pages. Then, check our success rate
+ * and returns the information to the caller.
+ */
+
+static void shrink_mem_cgroup_node(int nid,
+ int priority, struct scan_control *sc)
+{
+ unsigned long this_scanned = 0;
+ unsigned long this_reclaimed = 0;
+ int i;
+
+ for (i = 0; i < NODE_DATA(nid)->nr_zones; i++) {
+ struct zone *zone = NODE_DATA(nid)->node_zones + i;
+
+ if (!populated_zone(zone))
+ continue;
+ if (!mem_cgroup_zone_reclaimable_pages(sc->mem_cgroup, nid, i))
+ continue;
+ /* If recent scan didn't go good, do writepate */
+ sc->nr_scanned = 0;
+ sc->nr_reclaimed = 0;
+ shrink_zone(priority, zone, sc);
+ this_scanned += sc->nr_scanned;
+ this_reclaimed += sc->nr_reclaimed;
+ if (this_reclaimed >= sc->nr_to_reclaim)
+ break;
+ if (sc->scan_limit < this_scanned)
+ break;
+ if (need_resched())
+ break;
+ }
+ sc->nr_scanned = this_scanned;
+ sc->nr_reclaimed = this_reclaimed;
+ return;
+}
+
+#define MEMCG_ASYNCSCAN_LIMIT (2048)
+
+bool mem_cgroup_shrink_static_scan(struct mem_cgroup *mem, long required)
+{
+ int nid, priority, noscan;
+ unsigned long total_scanned, total_reclaimed, reclaim_target;
+ struct scan_control sc = {
+ .gfp_mask = GFP_HIGHUSER_MOVABLE,
+ .may_unmap = 1,
+ .may_swap = 1,
+ .order = 0,
+ /* we don't writepage in our scan. but kick flusher threads */
+ .may_writepage = 0,
+ };
+ struct mem_cgroup *victim, *check_again;
+ bool congested = true;
+
+ total_scanned = 0;
+ total_reclaimed = 0;
+ reclaim_target = min(required, MEMCG_ASYNCSCAN_LIMIT/2L);
+ sc.swappiness = mem_cgroup_swappiness(mem);
+
+ noscan = 0;
+ check_again = NULL;
+
+ do {
+ victim = mem_cgroup_select_victim(mem);
+
+ if (!mem_cgroup_test_reclaimable(victim)) {
+ mem_cgroup_release_victim(victim);
+ /*
+ * if selected a hopeless victim again, give up.
+ */
+ if (check_again == victim)
+ goto out;
+ if (!check_again)
+ check_again = victim;
+ } else
+ check_again = NULL;
+ } while (check_again);
+
+ current->flags |= PF_SWAPWRITE;
+ /*
+ * We can use arbitrary priority for our run because we just scan
+ * up to MEMCG_ASYNCSCAN_LIMIT and reclaim only the half of it.
+ * But, we need to have early-give-up chance for avoid cpu hogging.
+ * So, start from a small priority and increase it.
+ */
+ priority = DEF_PRIORITY;
+
+ while ((total_scanned < MEMCG_ASYNCSCAN_LIMIT) &&
+ (total_reclaimed < reclaim_target)) {
+
+ /* select a node to scan */
+ nid = mem_cgroup_select_victim_node(victim);
+
+ sc.mem_cgroup = victim;
+ sc.nr_scanned = 0;
+ sc.scan_limit = MEMCG_ASYNCSCAN_LIMIT - total_scanned;
+ sc.nr_reclaimed = 0;
+ sc.nr_to_reclaim = reclaim_target - total_reclaimed;
+ shrink_mem_cgroup_node(nid, priority, &sc);
+ if (sc.nr_scanned) {
+ total_scanned += sc.nr_scanned;
+ total_reclaimed += sc.nr_reclaimed;
+ noscan = 0;
+ } else
+ noscan++;
+ mem_cgroup_release_victim(victim);
+ /* ok, check condition */
+ if (total_scanned > total_reclaimed * 2)
+ wakeup_flusher_threads(sc.nr_scanned);
+
+ if (mem_cgroup_async_should_stop(mem))
+ break;
+ /* If memory reclaim seems heavy, return that we're congested */
+ if (total_scanned > MEMCG_ASYNCSCAN_LIMIT/4 &&
+ total_scanned > total_reclaimed*8)
+ break;
+ /*
+ * The whole system is busy or some status update
+ * is not synched. It's better to wait for a while.
+ */
+ if ((noscan > 1) || (need_resched()))
+ break;
+ /* ok, we can do deeper scanning. */
+ priority--;
+ }
+ current->flags &= ~PF_SWAPWRITE;
+ /*
+ * If we successfully freed the half of target, report that
+ * memory reclaim went smoothly.
+ */
+ if (total_reclaimed > reclaim_target/2)
+ congested = false;
+out:
+ return congested;
+}
#endif

/*
@@ -2380,6 +2546,7 @@ static unsigned long balance_pgdat(pg_da
.swappiness = vm_swappiness,
.order = order,
.mem_cgroup = NULL,
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -2839,6 +3006,7 @@ unsigned long shrink_all_memory(unsigned
.hibernation_mode = 1,
.swappiness = vm_swappiness,
.order = 0,
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -3026,6 +3194,7 @@ static int __zone_reclaim(struct zone *z
.gfp_mask = gfp_mask,
.swappiness = vm_swappiness,
.order = order,
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
Index: mmotm-May11/mm/memcontrol.c
===================================================================
--- mmotm-May11.orig/mm/memcontrol.c
+++ mmotm-May11/mm/memcontrol.c
@@ -3647,6 +3647,11 @@ unsigned long mem_cgroup_soft_limit_recl
return nr_reclaimed;
}

+bool mem_cgroup_async_should_stop(struct mem_cgroup *mem)
+{
+ return res_counter_margin(&mem->res) >= MEMCG_ASYNC_MARGIN;
+}
+
static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem)
{
if (!test_bit(USE_AUTO_ASYNC, &mem->async_flags))
Index: mmotm-May11/include/linux/memcontrol.h
===================================================================
--- mmotm-May11.orig/include/linux/memcontrol.h
+++ mmotm-May11/include/linux/memcontrol.h
@@ -124,6 +124,8 @@ extern void mem_cgroup_print_oom_info(st
struct task_struct *p);
struct mem_cgroup *mem_cgroup_select_victim(struct mem_cgroup *mem);
void mem_cgroup_release_victim(struct mem_cgroup *mem);
+bool mem_cgroup_async_should_stop(struct mem_cgroup *mem);
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
Index: mmotm-May11/include/linux/swap.h
===================================================================
--- mmotm-May11.orig/include/linux/swap.h
+++ mmotm-May11/include/linux/swap.h
@@ -257,6 +257,8 @@ extern unsigned long mem_cgroup_shrink_n
gfp_t gfp_mask, bool noswap,
struct zone *zone,
unsigned long *nr_scanned);
+extern bool
+mem_cgroup_shrink_static_scan(struct mem_cgroup *mem, long required);
extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;

2011-05-20 03:55:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 8/8] memcg asyncrhouns reclaim workqueue

workqueue for memory cgroup asynchronous memory shrinker.

This patch implements the workqueue of async shrinker routine. each
memcg has a work and only one work can be scheduled at the same time.

If shrinking memory doesn't goes well, delay will be added to the work.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 81 insertions(+), 3 deletions(-)

Index: mmotm-May11/mm/memcontrol.c
===================================================================
--- mmotm-May11.orig/mm/memcontrol.c
+++ mmotm-May11/mm/memcontrol.c
@@ -302,12 +302,17 @@ struct mem_cgroup {
* mem_cgroup ? And what type of charges should we move ?
*/
unsigned long move_charge_at_immigrate;
+
+ /* For asynchronous memory reclaim */
+ struct delayed_work async_work;
/*
* Checks for async reclaim.
*/
unsigned long async_flags;
#define AUTO_ASYNC_ENABLED (0)
#define USE_AUTO_ASYNC (1)
+#define ASYNC_NORESCHED (2) /* need to stop scanning */
+#define ASYNC_RUNNING (3) /* a work is in schedule or running. */
/*
* percpu counter.
*/
@@ -3647,6 +3652,78 @@ unsigned long mem_cgroup_soft_limit_recl
return nr_reclaimed;
}

+struct workqueue_struct *memcg_async_shrinker;
+
+static int memcg_async_shrinker_init(void)
+{
+ memcg_async_shrinker = alloc_workqueue("memcg_async",
+ WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_FREEZABLE, 0);
+ return 0;
+}
+module_init(memcg_async_shrinker_init);
+
+static void mem_cgroup_async_shrink(struct work_struct *work)
+{
+ struct delayed_work *dw = to_delayed_work(work);
+ struct mem_cgroup *mem = container_of(dw,
+ struct mem_cgroup, async_work);
+ bool congested = false;
+ int delay = 0;
+ unsigned long long required, usage, limit, shrink_to;
+
+ limit = res_counter_read_u64(&mem->res, RES_LIMIT);
+ shrink_to = limit - MEMCG_ASYNC_MARGIN - PAGE_SIZE;
+ usage = res_counter_read_u64(&mem->res, RES_USAGE);
+ if (shrink_to <= usage) {
+ required = usage - shrink_to;
+ required = (required >> PAGE_SHIFT) + 1;
+ /*
+ * This scans some number of pages and returns that memory
+ * reclaim was slow or now. If slow, we add a delay as
+ * congestion_wait() in vmscan.c
+ */
+ congested = mem_cgroup_shrink_static_scan(mem, (long)required);
+ }
+ if (test_bit(ASYNC_NORESCHED, &mem->async_flags)
+ || mem_cgroup_async_should_stop(mem))
+ goto finish_scan;
+ /* If memory reclaim couldn't go well, add delay */
+ if (congested)
+ delay = HZ/10;
+
+ queue_delayed_work(memcg_async_shrinker, &mem->async_work, delay);
+ return;
+finish_scan:
+ cgroup_release_and_wakeup_rmdir(&mem->css);
+ clear_bit(ASYNC_RUNNING, &mem->async_flags);
+ return;
+}
+
+static void run_mem_cgroup_async_shrinker(struct mem_cgroup *mem)
+{
+ if (test_bit(ASYNC_NORESCHED, &mem->async_flags))
+ return;
+ if (test_and_set_bit(ASYNC_RUNNING, &mem->async_flags))
+ return;
+ cgroup_exclude_rmdir(&mem->css);
+ /*
+ * start reclaim with small delay. This delay will allow us to do job
+ * in batch.
+ */
+ if (!queue_delayed_work(memcg_async_shrinker, &mem->async_work, 1)) {
+ cgroup_release_and_wakeup_rmdir(&mem->css);
+ clear_bit(ASYNC_RUNNING, &mem->async_flags);
+ }
+ return;
+}
+
+static void stop_mem_cgroup_async_shrinker(struct mem_cgroup *mem)
+{
+ set_bit(ASYNC_NORESCHED, &mem->async_flags);
+ flush_delayed_work(&mem->async_work);
+ clear_bit(ASYNC_NORESCHED, &mem->async_flags);
+}
+
bool mem_cgroup_async_should_stop(struct mem_cgroup *mem)
{
return res_counter_margin(&mem->res) >= MEMCG_ASYNC_MARGIN;
@@ -3656,9 +3733,8 @@ static void mem_cgroup_may_async_reclaim
{
if (!test_bit(USE_AUTO_ASYNC, &mem->async_flags))
return;
- if (res_counter_margin(&mem->res) <= MEMCG_ASYNC_MARGIN) {
- /* Fill here */
- }
+ if (res_counter_margin(&mem->res) <= MEMCG_ASYNC_MARGIN)
+ run_mem_cgroup_async_shrinker(mem);
}

/*
@@ -3743,6 +3819,7 @@ move_account:
if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
goto out;
ret = -EINTR;
+ stop_mem_cgroup_async_shrinker(mem);
if (signal_pending(current))
goto out;
/* This is for making all *used* pages to be on LRU. */
@@ -4941,6 +5018,7 @@ mem_cgroup_create(struct cgroup_subsys *
mem->swappiness = mem_cgroup_swappiness(parent);
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
+ INIT_DELAYED_WORK(&mem->async_work, mem_cgroup_async_shrink);
mutex_init(&mem->thresholds_lock);
return &mem->css;
free_out:

2011-05-20 21:50:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/8] memcg: easy check routine for reclaimable

On Fri, 20 May 2011 12:42:12 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> +bool mem_cgroup_test_reclaimable(struct mem_cgroup *memcg)
> +{
> + unsigned long nr;
> + int zid;
> +
> + for (zid = NODE_DATA(0)->nr_zones - 1; zid >= 0; zid--)
> + if (mem_cgroup_zone_reclaimable_pages(memcg, 0, zid))
> + break;
> + if (zid < 0)
> + return false;
> + return true;
> +}

A wee bit of documentation would be nice. Perhaps improving the name
would suffice: mem_cgroup_has_reclaimable().

2011-05-20 21:49:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 6/8] memcg asynchronous memory reclaim interface

On Fri, 20 May 2011 12:46:36 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> This patch adds a logic to keep usage margin to the limit in asynchronous way.
> When the usage over some threshould (determined automatically), asynchronous
> memory reclaim runs and shrink memory to limit - MEMCG_ASYNC_STOP_MARGIN.
>
> By this, there will be no difference in total amount of usage of cpu to
> scan the LRU

This is not true if "don't writepage at all (revisit this when
dirty_ratio comes.)" is true. Skipping over dirty pages can cause
larger amounts of CPU consumption.

> but we'll have a chance to make use of wait time of applications
> for freeing memory. For example, when an application read a file or socket,
> to fill the newly alloated memory, it needs wait. Async reclaim can make use
> of that time and give a chance to reduce latency by background works.
>
> This patch only includes required hooks to trigger async reclaim and user interfaces.
> Core logics will be in the following patches.
>
>
> ...
>
> /*
> + * For example, with transparent hugepages, memory reclaim scan at hitting
> + * limit can very long as to reclaim HPAGE_SIZE of memory. This increases
> + * latency of page fault and may cause fallback. At usual page allocation,
> + * we'll see some (shorter) latency, too. To reduce latency, it's appreciated
> + * to free memory in background to make margin to the limit. This consumes
> + * cpu but we'll have a chance to make use of wait time of applications
> + * (read disk etc..) by asynchronous reclaim.
> + *
> + * This async reclaim tries to reclaim HPAGE_SIZE * 2 of pages when margin
> + * to the limit is smaller than HPAGE_SIZE * 2. This will be enabled
> + * automatically when the limit is set and it's greater than the threshold.
> + */
> +#if HPAGE_SIZE != PAGE_SIZE
> +#define MEMCG_ASYNC_LIMIT_THRESH (HPAGE_SIZE * 64)
> +#define MEMCG_ASYNC_MARGIN (HPAGE_SIZE * 4)
> +#else /* make the margin as 4M bytes */
> +#define MEMCG_ASYNC_LIMIT_THRESH (128 * 1024 * 1024)
> +#define MEMCG_ASYNC_MARGIN (8 * 1024 * 1024)
> +#endif

Document them, please. How are they used, what are their units.

> +static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
> +
> +/*
> * The memory controller data structure. The memory controller controls both
> * page cache and RSS per cgroup. We would eventually like to provide
> * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> @@ -278,6 +303,12 @@ struct mem_cgroup {
> */
> unsigned long move_charge_at_immigrate;
> /*
> + * Checks for async reclaim.
> + */
> + unsigned long async_flags;
> +#define AUTO_ASYNC_ENABLED (0)
> +#define USE_AUTO_ASYNC (1)

These are really confusing. I looked at the implementation and at the
documentation file and I'm still scratching my head. I can't work out
why they exist. With the amount of effort I put into it ;)

Also, AUTO_ASYNC_ENABLED and USE_AUTO_ASYNC have practically the same
meaning, which doesn't help things.

Some careful description at this place in the code might help clear
things up.

Perhaps s/USE_AUTO_ASYNC/AUTO_ASYNC_IN_USE/ is what you meant.

>
> ...
>
> +static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem)
> +{
> + if (!test_bit(USE_AUTO_ASYNC, &mem->async_flags))
> + return;
> + if (res_counter_margin(&mem->res) <= MEMCG_ASYNC_MARGIN) {
> + /* Fill here */
> + }
> +}

I'd expect a function called foo_may_bar() to return a bool.

But given the lack of documentation and no-op implementation, I have o
idea what's happening here!

>
> ...
>

2011-05-20 21:50:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg static scan reclaim for asyncrhonous reclaim

On Fri, 20 May 2011 12:47:53 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> Ostatic scan rate async memory reclaim for memcg.
>
> This patch implements a routine for asynchronous memory reclaim for memory
> cgroup, which will be triggered when the usage is near to the limit.
> This patch includes only code codes for memory freeing.
>
> Asynchronous memory reclaim can be a help for reduce latency because
> memory reclaim goes while an application need to wait or compute something.
>
> To do memory reclaim in async, we need some thread or worker.
> Unlike node or zones, memcg can be created on demand and there may be
> a system with thousands of memcgs. So, the number of jobs for memcg
> asynchronous memory reclaim can be big number in theory. So, node kswapd
> codes doesn't fit well. And some scheduling on memcg layer will be appreciated.
>
> This patch implements a static scan rate memory reclaim.
> When shrink_mem_cgroup_static_scan() is called, it scans pages at most
> MEMCG_STATIC_SCAN_LIMIT(2048) pages and returnes how memory shrinking
> was hard. When the function returns false, the caller can assume memory
> reclaim on the memcg seemed difficult and can add some scheduling delay
> for the job.

Fully and carefully define the new term "static scan rate"?

> Note:
> - I think this concept can be used for enhancing softlimit, too.
> But need more study.
>
>
> ...
>
> + total_scan += nr[l];
> + }
> + /*
> + * Asynchronous reclaim for memcg uses static scan rate for avoiding
> + * too much cpu consumption in a memcg. Adjust the scan count to fit
> + * into scan_limit.
> + */
> + if (total_scan > sc->scan_limit) {
> + for_each_evictable_lru(l) {
> + if (!nr[l] < SWAP_CLUSTER_MAX)

That statement doesn't do what you think it does!

> + continue;
> + nr[l] = div64_u64(nr[l] * sc->scan_limit, total_scan);
> + nr[l] = max((unsigned long)SWAP_CLUSTER_MAX, nr[l]);
> + }
> }

This gets included in CONFIG_CGROUP_MEM_RES_CTLR=n kernels. Needlessly?

It also has the potential to affect non-memcg behaviour at runtime.

> }
>
> @@ -1938,6 +1955,11 @@ restart:
> */
> if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
> break;
> + /*
> + * static scan rate memory reclaim ?

I still don't know what "static scan rate" means :(

> + */
> + if (sc->nr_scanned > sc->scan_limit)
> + break;
> }
> sc->nr_reclaimed += nr_reclaimed;
>
>
> ...
>
> +static void shrink_mem_cgroup_node(int nid,
> + int priority, struct scan_control *sc)
> +{
> + unsigned long this_scanned = 0;
> + unsigned long this_reclaimed = 0;
> + int i;
> +
> + for (i = 0; i < NODE_DATA(nid)->nr_zones; i++) {
> + struct zone *zone = NODE_DATA(nid)->node_zones + i;
> +
> + if (!populated_zone(zone))
> + continue;
> + if (!mem_cgroup_zone_reclaimable_pages(sc->mem_cgroup, nid, i))
> + continue;
> + /* If recent scan didn't go good, do writepate */
> + sc->nr_scanned = 0;
> + sc->nr_reclaimed = 0;
> + shrink_zone(priority, zone, sc);
> + this_scanned += sc->nr_scanned;
> + this_reclaimed += sc->nr_reclaimed;
> + if (this_reclaimed >= sc->nr_to_reclaim)
> + break;
> + if (sc->scan_limit < this_scanned)
> + break;
> + if (need_resched())
> + break;

Whoa! Explain?

> + }
> + sc->nr_scanned = this_scanned;
> + sc->nr_reclaimed = this_reclaimed;
> + return;
> +}
> +
> +#define MEMCG_ASYNCSCAN_LIMIT (2048)

Needs documentation. What happens if I set it to 1024?

> +bool mem_cgroup_shrink_static_scan(struct mem_cgroup *mem, long required)

Exported function has no interface documentation.

`required' appears to have units of "number of pages". Should be unsigned.

> +{
> + int nid, priority, noscan;

`noscan' is poorly named and distressingly mysterious. Basically I
don't have a clue what you're doing with this.

It should be unsigned.

> + unsigned long total_scanned, total_reclaimed, reclaim_target;
> + struct scan_control sc = {
> + .gfp_mask = GFP_HIGHUSER_MOVABLE,
> + .may_unmap = 1,
> + .may_swap = 1,
> + .order = 0,
> + /* we don't writepage in our scan. but kick flusher threads */
> + .may_writepage = 0,
> + };
> + struct mem_cgroup *victim, *check_again;
> + bool congested = true;
> +
> + total_scanned = 0;
> + total_reclaimed = 0;
> + reclaim_target = min(required, MEMCG_ASYNCSCAN_LIMIT/2L);
> + sc.swappiness = mem_cgroup_swappiness(mem);
> +
> + noscan = 0;
> + check_again = NULL;
> +
> + do {
> + victim = mem_cgroup_select_victim(mem);
> +
> + if (!mem_cgroup_test_reclaimable(victim)) {
> + mem_cgroup_release_victim(victim);
> + /*
> + * if selected a hopeless victim again, give up.
> + */
> + if (check_again == victim)
> + goto out;
> + if (!check_again)
> + check_again = victim;
> + } else
> + check_again = NULL;
> + } while (check_again);

What's all this trying to do?

> + current->flags |= PF_SWAPWRITE;
> + /*
> + * We can use arbitrary priority for our run because we just scan
> + * up to MEMCG_ASYNCSCAN_LIMIT and reclaim only the half of it.
> + * But, we need to have early-give-up chance for avoid cpu hogging.
> + * So, start from a small priority and increase it.
> + */
> + priority = DEF_PRIORITY;
> +
> + while ((total_scanned < MEMCG_ASYNCSCAN_LIMIT) &&
> + (total_reclaimed < reclaim_target)) {
> +
> + /* select a node to scan */
> + nid = mem_cgroup_select_victim_node(victim);
> +
> + sc.mem_cgroup = victim;
> + sc.nr_scanned = 0;
> + sc.scan_limit = MEMCG_ASYNCSCAN_LIMIT - total_scanned;
> + sc.nr_reclaimed = 0;
> + sc.nr_to_reclaim = reclaim_target - total_reclaimed;
> + shrink_mem_cgroup_node(nid, priority, &sc);
> + if (sc.nr_scanned) {
> + total_scanned += sc.nr_scanned;
> + total_reclaimed += sc.nr_reclaimed;
> + noscan = 0;
> + } else
> + noscan++;
> + mem_cgroup_release_victim(victim);
> + /* ok, check condition */
> + if (total_scanned > total_reclaimed * 2)
> + wakeup_flusher_threads(sc.nr_scanned);
> +
> + if (mem_cgroup_async_should_stop(mem))
> + break;
> + /* If memory reclaim seems heavy, return that we're congested */
> + if (total_scanned > MEMCG_ASYNCSCAN_LIMIT/4 &&
> + total_scanned > total_reclaimed*8)
> + break;
> + /*
> + * The whole system is busy or some status update
> + * is not synched. It's better to wait for a while.
> + */
> + if ((noscan > 1) || (need_resched()))
> + break;

So we bale out if there were two priority levels at which
shrink_mem_cgroup_node() didn't scan any pages? What on earth???

And what was the point in calling shrink_mem_cgroup_node() if it didn't
scan anything? I could understand using nr_reclaimed...

> + /* ok, we can do deeper scanning. */
> + priority--;
> + }
> + current->flags &= ~PF_SWAPWRITE;
> + /*
> + * If we successfully freed the half of target, report that
> + * memory reclaim went smoothly.
> + */
> + if (total_reclaimed > reclaim_target/2)
> + congested = false;
> +out:
> + return congested;
> +}
> #endif



I dunno, the whole thing seems sprinkled full of arbitrary assumptions
and guess-and-giggle magic numbers. I expect a lot of this stuff is
just unnecessary. And if it _is_ necessary then I'd expect there to
be lots of situations and corner cases in which it malfunctions,
because the magic numbers weren't tuned to that case.

2011-05-20 21:51:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 8/8] memcg asyncrhouns reclaim workqueue

On Fri, 20 May 2011 12:48:37 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> workqueue for memory cgroup asynchronous memory shrinker.
>
> This patch implements the workqueue of async shrinker routine. each
> memcg has a work and only one work can be scheduled at the same time.
>
> If shrinking memory doesn't goes well, delay will be added to the work.
>

When this code explodes (as it surely will), users will see large
amounts of CPU consumption in the work queue thread. We want to make
this as easy to debug as possible, so we should try to make the
workqueue's names mappable back onto their memcg's. And anything else
we can think of to help?

>
> ...
>
> +static void mem_cgroup_async_shrink(struct work_struct *work)
> +{
> + struct delayed_work *dw = to_delayed_work(work);
> + struct mem_cgroup *mem = container_of(dw,
> + struct mem_cgroup, async_work);
> + bool congested = false;
> + int delay = 0;
> + unsigned long long required, usage, limit, shrink_to;

There's a convention which is favored by some (and ignored by the
clueless ;)) which says "one definition per line".

The reason I like one-definition-per-line is that it leaves a little
room on the right where the programmer can explain the role of the
local.

Another advantage is that one can initialise it. eg:

unsigned long limit = res_counter_read_u64(&mem->res, RES_LIMIT);

That conveys useful information: the reader can see what it's
initialised with and can infer its use.

A third advantage is that it can now be made const, which conveys very
useful informtation and can prevent bugs.

A fourth advantage is that it makes later patches to this function more
readable and easier to apply when there are conflicts.


> + limit = res_counter_read_u64(&mem->res, RES_LIMIT);
> + shrink_to = limit - MEMCG_ASYNC_MARGIN - PAGE_SIZE;
> + usage = res_counter_read_u64(&mem->res, RES_USAGE);
> + if (shrink_to <= usage) {
> + required = usage - shrink_to;
> + required = (required >> PAGE_SHIFT) + 1;
> + /*
> + * This scans some number of pages and returns that memory
> + * reclaim was slow or now. If slow, we add a delay as
> + * congestion_wait() in vmscan.c
> + */
> + congested = mem_cgroup_shrink_static_scan(mem, (long)required);
> + }
> + if (test_bit(ASYNC_NORESCHED, &mem->async_flags)
> + || mem_cgroup_async_should_stop(mem))
> + goto finish_scan;
> + /* If memory reclaim couldn't go well, add delay */
> + if (congested)
> + delay = HZ/10;

Another magic number.

If Moore's law holds, we need to reduce this number by 1.4 each year.
Is this good?

> + queue_delayed_work(memcg_async_shrinker, &mem->async_work, delay);
> + return;
> +finish_scan:
> + cgroup_release_and_wakeup_rmdir(&mem->css);
> + clear_bit(ASYNC_RUNNING, &mem->async_flags);
> + return;
> +}
> +
> +static void run_mem_cgroup_async_shrinker(struct mem_cgroup *mem)
> +{
> + if (test_bit(ASYNC_NORESCHED, &mem->async_flags))
> + return;

I can't work out what ASYNC_NORESCHED does. Is its name well-chosen?

> + if (test_and_set_bit(ASYNC_RUNNING, &mem->async_flags))
> + return;
> + cgroup_exclude_rmdir(&mem->css);
> + /*
> + * start reclaim with small delay. This delay will allow us to do job
> + * in batch.

Explain more?

> + */
> + if (!queue_delayed_work(memcg_async_shrinker, &mem->async_work, 1)) {
> + cgroup_release_and_wakeup_rmdir(&mem->css);
> + clear_bit(ASYNC_RUNNING, &mem->async_flags);
> + }
> + return;
> +}
> +
>
> ...
>

2011-05-20 23:56:53

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH 6/8] memcg asynchronous memory reclaim interface

2011/5/21 Andrew Morton <[email protected]>:
> On Fri, 20 May 2011 12:46:36 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> This patch adds a logic to keep usage margin to the limit in asynchronous way.
>> When the usage over some threshould (determined automatically), asynchronous
>> memory reclaim runs and shrink memory to limit - MEMCG_ASYNC_STOP_MARGIN.
>>
>> By this, there will be no difference in total amount of usage of cpu to
>> scan the LRU
>
> This is not true if "don't writepage at all (revisit this when
> dirty_ratio comes.)" is true. ?Skipping over dirty pages can cause
> larger amounts of CPU consumption.
>
>> but we'll have a chance to make use of wait time of applications
>> for freeing memory. For example, when an application read a file or socket,
>> to fill the newly alloated memory, it needs wait. Async reclaim can make use
>> of that time and give a chance to reduce latency by background works.
>>
>> This patch only includes required hooks to trigger async reclaim and user interfaces.
>> Core logics will be in the following patches.
>>
>>
>> ...
>>
>> ?/*
>> + * For example, with transparent hugepages, memory reclaim scan at hitting
>> + * limit can very long as to reclaim HPAGE_SIZE of memory. This increases
>> + * latency of page fault and may cause fallback. At usual page allocation,
>> + * we'll see some (shorter) latency, too. To reduce latency, it's appreciated
>> + * to free memory in background to make margin to the limit. This consumes
>> + * cpu but we'll have a chance to make use of wait time of applications
>> + * (read disk etc..) by asynchronous reclaim.
>> + *
>> + * This async reclaim tries to reclaim HPAGE_SIZE * 2 of pages when margin
>> + * to the limit is smaller than HPAGE_SIZE * 2. This will be enabled
>> + * automatically when the limit is set and it's greater than the threshold.
>> + */
>> +#if HPAGE_SIZE != PAGE_SIZE
>> +#define MEMCG_ASYNC_LIMIT_THRESH ? ? ?(HPAGE_SIZE * 64)
>> +#define MEMCG_ASYNC_MARGIN ? ? ? ? (HPAGE_SIZE * 4)
>> +#else /* make the margin as 4M bytes */
>> +#define MEMCG_ASYNC_LIMIT_THRESH ? ? ?(128 * 1024 * 1024)
>> +#define MEMCG_ASYNC_MARGIN ? ? ? ? ? ?(8 * 1024 * 1024)
>> +#endif
>
> Document them, please. ?How are they used, what are their units.
>

will do.


>> +static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
>> +
>> +/*
>> ? * The memory controller data structure. The memory controller controls both
>> ? * page cache and RSS per cgroup. We would eventually like to provide
>> ? * statistics based on the statistics developed by Rik Van Riel for clock-pro,
>> @@ -278,6 +303,12 @@ struct mem_cgroup {
>> ? ? ? ?*/
>> ? ? ? unsigned long ? move_charge_at_immigrate;
>> ? ? ? /*
>> + ? ? ?* Checks for async reclaim.
>> + ? ? ?*/
>> + ? ? unsigned long ? async_flags;
>> +#define AUTO_ASYNC_ENABLED ? (0)
>> +#define USE_AUTO_ASYNC ? ? ? ? ? ? ? (1)
>
> These are really confusing. ?I looked at the implementation and at the
> documentation file and I'm still scratching my head. ?I can't work out
> why they exist. ?With the amount of effort I put into it ;)
>
> Also, AUTO_ASYNC_ENABLED and USE_AUTO_ASYNC have practically the same
> meaning, which doesn't help things.
>
Ah, yes it's confusing.

> Some careful description at this place in the code might help clear
> things up.
>
yes, I'll fix and add text, consider better name.

> Perhaps s/USE_AUTO_ASYNC/AUTO_ASYNC_IN_USE/ is what you meant.
>
Ah, good name :)

>>
>> ...
>>
>> +static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem)
>> +{
>> + ? ? if (!test_bit(USE_AUTO_ASYNC, &mem->async_flags))
>> + ? ? ? ? ? ? return;
>> + ? ? if (res_counter_margin(&mem->res) <= MEMCG_ASYNC_MARGIN) {
>> + ? ? ? ? ? ? /* Fill here */
>> + ? ? }
>> +}
>
> I'd expect a function called foo_may_bar() to return a bool.
>
ok,

> But given the lack of documentation and no-op implementation, I have o
> idea what's happening here!
>
yes. Hmm, maybe adding an empty function here and comments on the
function will make this better.

Thank you for review.
-Kame

2011-05-20 23:58:05

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH 2/8] memcg: easy check routine for reclaimable

2011/5/21 Andrew Morton <[email protected]>:
> On Fri, 20 May 2011 12:42:12 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> +bool mem_cgroup_test_reclaimable(struct mem_cgroup *memcg)
>> +{
>> + ? ? unsigned long nr;
>> + ? ? int zid;
>> +
>> + ? ? for (zid = NODE_DATA(0)->nr_zones - 1; zid >= 0; zid--)
>> + ? ? ? ? ? ? if (mem_cgroup_zone_reclaimable_pages(memcg, 0, zid))
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? if (zid < 0)
>> + ? ? ? ? ? ? return false;
>> + ? ? return true;
>> +}
>
> A wee bit of documentation would be nice.

Yes, I'll add some.

> ?Perhaps improving the name
> would suffice: mem_cgroup_has_reclaimable().
>
ok, I will use that name.

-Kame

2011-05-21 00:23:53

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg static scan reclaim for asyncrhonous reclaim

2011/5/21 Andrew Morton <[email protected]>:
> On Fri, 20 May 2011 12:47:53 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> Ostatic scan rate async memory reclaim for memcg.
>>
>> This patch implements a routine for asynchronous memory reclaim for memory
>> cgroup, which will be triggered when the usage is near to the limit.
>> This patch includes only code codes for memory freeing.
>>
>> Asynchronous memory reclaim can be a help for reduce latency because
>> memory reclaim goes while an application need to wait or compute something.
>>
>> To do memory reclaim in async, we need some thread or worker.
>> Unlike node or zones, memcg can be created on demand and there may be
>> a system with thousands of memcgs. So, the number of jobs for memcg
>> asynchronous memory reclaim can be big number in theory. So, node kswapd
>> codes doesn't fit well. And some scheduling on memcg layer will be appreciated.
>>
>> This patch implements a static scan rate memory reclaim.
>> When shrink_mem_cgroup_static_scan() is called, it scans pages at most
>> MEMCG_STATIC_SCAN_LIMIT(2048) pages and returnes how memory shrinking
>> was hard. When the function returns false, the caller can assume memory
>> reclaim on the memcg seemed difficult and can add some scheduling delay
>> for the job.
>
> Fully and carefully define the new term "static scan rate"?
>

Ah, yes. It's need to be explained.

>> Note:
>> ? - I think this concept can be used for enhancing softlimit, too.
>> ? ? But need more study.
>>
>>
>> ...
>>
>> + ? ? ? ? ? ? total_scan += nr[l];
>> + ? ? }
>> + ? ? /*
>> + ? ? ?* Asynchronous reclaim for memcg uses static scan rate for avoiding
>> + ? ? ?* too much cpu consumption in a memcg. Adjust the scan count to fit
>> + ? ? ?* into scan_limit.
>> + ? ? ?*/
>> + ? ? if (total_scan > sc->scan_limit) {
>> + ? ? ? ? ? ? for_each_evictable_lru(l) {
>> + ? ? ? ? ? ? ? ? ? ? if (!nr[l] < SWAP_CLUSTER_MAX)
>
> That statement doesn't do what you think it does!
>
....that's my bug. will be fixed or removed in the next version.

>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>> + ? ? ? ? ? ? ? ? ? ? nr[l] = div64_u64(nr[l] * sc->scan_limit, total_scan);
>> + ? ? ? ? ? ? ? ? ? ? nr[l] = max((unsigned long)SWAP_CLUSTER_MAX, nr[l]);
>> + ? ? ? ? ? ? }
>> ? ? ? }
>
> This gets included in CONFIG_CGROUP_MEM_RES_CTLR=n kernels. ?Needlessly?
>
Yes, no global reclaim uses scan_limit, now. I'll add a
scanning_global_lru() check
and compiler can hide this.

> It also has the potential to affect non-memcg behaviour at runtime.
>
Hmm, if scan_limit is set by mistake....ok, I'll add scanning_global_lru().


>> ?}
>>
>> @@ -1938,6 +1955,11 @@ restart:
>> ? ? ? ? ? ? ? ?*/
>> ? ? ? ? ? ? ? if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ?* static scan rate memory reclaim ?
>
> I still don't know what "static scan rate" means :(
>
static scan rate ....maybe my English skill is also bad ;(

Maybe I should name this as "stable scan rate per run" or "limited
scan reclaim",
it means when it's invoked, scan pages up to the scan_limit, at most.
In usual reclaim, it tries to reclaim some amount of pages and may need to scan
the whole memory. But this logic stops and returns to the caller when
hits scan_limit.


>> + ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? if (sc->nr_scanned > sc->scan_limit)
>> + ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? }
>> ? ? ? sc->nr_reclaimed += nr_reclaimed;
>>
>>
>> ...
>>
>> +static void shrink_mem_cgroup_node(int nid,
>> + ? ? ? ? ? ? int priority, struct scan_control *sc)
>> +{
>> + ? ? unsigned long this_scanned = 0;
>> + ? ? unsigned long this_reclaimed = 0;
>> + ? ? int i;
>> +
>> + ? ? for (i = 0; i < NODE_DATA(nid)->nr_zones; i++) {
>> + ? ? ? ? ? ? struct zone *zone = NODE_DATA(nid)->node_zones + i;
>> +
>> + ? ? ? ? ? ? if (!populated_zone(zone))
>> + ? ? ? ? ? ? ? ? ? ? continue;
>> + ? ? ? ? ? ? if (!mem_cgroup_zone_reclaimable_pages(sc->mem_cgroup, nid, i))
>> + ? ? ? ? ? ? ? ? ? ? continue;
>> + ? ? ? ? ? ? /* If recent scan didn't go good, do writepate */
>> + ? ? ? ? ? ? sc->nr_scanned = 0;
>> + ? ? ? ? ? ? sc->nr_reclaimed = 0;
>> + ? ? ? ? ? ? shrink_zone(priority, zone, sc);
>> + ? ? ? ? ? ? this_scanned += sc->nr_scanned;
>> + ? ? ? ? ? ? this_reclaimed += sc->nr_reclaimed;
>> + ? ? ? ? ? ? if (this_reclaimed >= sc->nr_to_reclaim)
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? if (sc->scan_limit < this_scanned)
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? if (need_resched())
>> + ? ? ? ? ? ? ? ? ? ? break;
>
> Whoa! ?Explain?
>
>> + ? ? }
>> + ? ? sc->nr_scanned = this_scanned;
>> + ? ? sc->nr_reclaimed = this_reclaimed;
>> + ? ? return;
>> +}
>> +
>> +#define MEMCG_ASYNCSCAN_LIMIT ? ? ? ? ? ? ? ?(2048)
>
> Needs documentation. ?What happens if I set it to 1024?
>
I will do and explain why 2048 now.

>> +bool mem_cgroup_shrink_static_scan(struct mem_cgroup *mem, long required)
>
> Exported function has no interface documentation.
>
> `required' appears to have units of "number of pages". ?Should be unsigned.
>
>
yes, I'll fix and add documents.

> +{
>> + ? ? int nid, priority, noscan;
>
> `noscan' is poorly named and distressingly mysterious. ?Basically I
> don't have a clue what you're doing with this.
>
> It should be unsigned.
>

ok, I'll think of better name ....hmm, scan_failed or no_reclaimable_pages.


>> + ? ? unsigned long total_scanned, total_reclaimed, reclaim_target;
>> + ? ? struct scan_control sc = {
>> + ? ? ? ? ? ? .gfp_mask ? ? ?= GFP_HIGHUSER_MOVABLE,
>> + ? ? ? ? ? ? .may_unmap ? ? = 1,
>> + ? ? ? ? ? ? .may_swap ? ? ?= 1,
>> + ? ? ? ? ? ? .order ? ? ? ? = 0,
>> + ? ? ? ? ? ? /* we don't writepage in our scan. but kick flusher threads */
>> + ? ? ? ? ? ? .may_writepage = 0,
>> + ? ? };
>> + ? ? struct mem_cgroup *victim, *check_again;
>> + ? ? bool congested = true;
>> +
>> + ? ? total_scanned = 0;
>> + ? ? total_reclaimed = 0;
>> + ? ? reclaim_target = min(required, MEMCG_ASYNCSCAN_LIMIT/2L);
>> + ? ? sc.swappiness = mem_cgroup_swappiness(mem);
>> +
>> + ? ? noscan = 0;
>> + ? ? check_again = NULL;
>> +
>> + ? ? do {
>> + ? ? ? ? ? ? victim = mem_cgroup_select_victim(mem);
>> +
>> + ? ? ? ? ? ? if (!mem_cgroup_test_reclaimable(victim)) {
>> + ? ? ? ? ? ? ? ? ? ? mem_cgroup_release_victim(victim);
>> + ? ? ? ? ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ? ? ? ? ?* if selected a hopeless victim again, give up.
>> + ? ? ? ? ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? ? ? ? ? if (check_again == victim)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? goto out;
>> + ? ? ? ? ? ? ? ? ? ? if (!check_again)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? check_again = victim;
>> + ? ? ? ? ? ? } else
>> + ? ? ? ? ? ? ? ? ? ? check_again = NULL;
>> + ? ? } while (check_again);
>
> What's all this trying to do?
>
This tries to do walk hierarchy of memcg and select a victim memcg
to be scanned under given memcg. But if all pages under the memcg is
unevictable,
we have no job. So, this gives up when the same memcg found twice, which is
unevictable. This works because current select_victim() is round-robin....

Yes, need to be documented.


>> + ? ? current->flags |= PF_SWAPWRITE;
>> + ? ? /*
>> + ? ? ?* We can use arbitrary priority for our run because we just scan
>> + ? ? ?* up to MEMCG_ASYNCSCAN_LIMIT and reclaim only the half of it.
>> + ? ? ?* But, we need to have early-give-up chance for avoid cpu hogging.
>> + ? ? ?* So, start from a small priority and increase it.
>> + ? ? ?*/
>> + ? ? priority = DEF_PRIORITY;
>> +
>> + ? ? while ((total_scanned < MEMCG_ASYNCSCAN_LIMIT) &&
>> + ? ? ? ? ? ? (total_reclaimed < reclaim_target)) {
>> +
>> + ? ? ? ? ? ? /* select a node to scan */
>> + ? ? ? ? ? ? nid = mem_cgroup_select_victim_node(victim);
>> +
>> + ? ? ? ? ? ? sc.mem_cgroup = victim;
>> + ? ? ? ? ? ? sc.nr_scanned = 0;
>> + ? ? ? ? ? ? sc.scan_limit = MEMCG_ASYNCSCAN_LIMIT - total_scanned;
>> + ? ? ? ? ? ? sc.nr_reclaimed = 0;
>> + ? ? ? ? ? ? sc.nr_to_reclaim = reclaim_target - total_reclaimed;
>> + ? ? ? ? ? ? shrink_mem_cgroup_node(nid, priority, &sc);
>> + ? ? ? ? ? ? if (sc.nr_scanned) {
>> + ? ? ? ? ? ? ? ? ? ? total_scanned += sc.nr_scanned;
>> + ? ? ? ? ? ? ? ? ? ? total_reclaimed += sc.nr_reclaimed;
>> + ? ? ? ? ? ? ? ? ? ? noscan = 0;
>> + ? ? ? ? ? ? } else
>> + ? ? ? ? ? ? ? ? ? ? noscan++;
>> + ? ? ? ? ? ? mem_cgroup_release_victim(victim);
>> + ? ? ? ? ? ? /* ok, check condition */
>> + ? ? ? ? ? ? if (total_scanned > total_reclaimed * 2)
>> + ? ? ? ? ? ? ? ? ? ? wakeup_flusher_threads(sc.nr_scanned);
>> +
>> + ? ? ? ? ? ? if (mem_cgroup_async_should_stop(mem))
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? /* If memory reclaim seems heavy, return that we're congested */
>> + ? ? ? ? ? ? if (total_scanned > MEMCG_ASYNCSCAN_LIMIT/4 &&
>> + ? ? ? ? ? ? ? ? total_scanned > total_reclaimed*8)
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ?* The whole system is busy or some status update
>> + ? ? ? ? ? ? ?* is not synched. It's better to wait for a while.
>> + ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? if ((noscan > 1) || (need_resched()))
>> + ? ? ? ? ? ? ? ? ? ? break;
>
> So we bale out if there were two priority levels at which
> shrink_mem_cgroup_node() didn't scan any pages? ?What on earth???
>
I thought there is a case that a memcg contains only ANON in a node
on swapless system or all pages are isolated or some...

> And what was the point in calling shrink_mem_cgroup_node() if it didn't
> scan anything?

I wonder if there are threads in synchronous relcaim we can have race.

> I could understand using nr_reclaimed...
>
I'll reconsider why I inserted this..(I might add this before fixing
get_scan_count()..0

Maybe I can remove this check because I don't hit this case, and later,
memcg can have some logic similar to zone->all_unreclaimable.



>> + ? ? ? ? ? ? /* ok, we can do deeper scanning. */
>> + ? ? ? ? ? ? priority--;
>> + ? ? }
>> + ? ? current->flags &= ~PF_SWAPWRITE;
>> + ? ? /*
>> + ? ? ?* If we successfully freed the half of target, report that
>> + ? ? ?* memory reclaim went smoothly.
>> + ? ? ?*/
>> + ? ? if (total_reclaimed > reclaim_target/2)
>> + ? ? ? ? ? ? congested = false;
>> +out:
>> + ? ? return congested;
>> +}
>> ?#endif
>
>
>
> I dunno, the whole thing seems sprinkled full of arbitrary assumptions
> and guess-and-giggle magic numbers. ?I expect a lot of this stuff is
> just unnecessary. ?And if it _is_ necessary then I'd expect there to
> be lots of situations and corner cases in which it malfunctions,
> because the magic numbers weren't tuned to that case.

Hmm, ok, I'll make this function simpler and add explanation to numbers.


Regards,
-Kame

2011-05-21 00:41:57

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH 8/8] memcg asyncrhouns reclaim workqueue

2011/5/21 Andrew Morton <[email protected]>:
> On Fri, 20 May 2011 12:48:37 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> workqueue for memory cgroup asynchronous memory shrinker.
>>
>> This patch implements the workqueue of async shrinker routine. each
>> memcg has a work and only one work can be scheduled at the same time.
>>
>> If shrinking memory doesn't goes well, delay will be added to the work.
>>
>
> When this code explodes (as it surely will), users will see large
> amounts of CPU consumption in the work queue thread. ?We want to make
> this as easy to debug as possible, so we should try to make the
> workqueue's names mappable back onto their memcg's. ?And anything else
> we can think of to help?
>

I had a patch for showing per-memcg reclaim latency stats. It will be help.
I'll add it again to this set. I just dropped it because there are many patches
onto memory.stat in flight..


>>
>> ...
>>
>> +static void mem_cgroup_async_shrink(struct work_struct *work)
>> +{
>> + ? ? struct delayed_work *dw = to_delayed_work(work);
>> + ? ? struct mem_cgroup *mem = container_of(dw,
>> + ? ? ? ? ? ? ? ? ? ? struct mem_cgroup, async_work);
>> + ? ? bool congested = false;
>> + ? ? int delay = 0;
>> + ? ? unsigned long long required, usage, limit, shrink_to;
>
> There's a convention which is favored by some (and ignored by the
> clueless ;)) which says "one definition per line".
>
> The reason I like one-definition-per-line is that it leaves a little
> room on the right where the programmer can explain the role of the
> local.
>
> Another advantage is that one can initialise it. ?eg:
>
> ? ? ? ?unsigned long limit = res_counter_read_u64(&mem->res, RES_LIMIT);
>
> That conveys useful information: the reader can see what it's
> initialised with and can infer its use.
>
> A third advantage is that it can now be made const, which conveys very
> useful informtation and can prevent bugs.
>
> A fourth advantage is that it makes later patches to this function more
> readable and easier to apply when there are conflicts.
>
ok, I will fix.

>
>> + ? ? limit = res_counter_read_u64(&mem->res, RES_LIMIT);
>> + ? ? shrink_to = limit - MEMCG_ASYNC_MARGIN - PAGE_SIZE;
>> + ? ? usage = res_counter_read_u64(&mem->res, RES_USAGE);
>> + ? ? if (shrink_to <= usage) {
>> + ? ? ? ? ? ? required = usage - shrink_to;
>> + ? ? ? ? ? ? required = (required >> PAGE_SHIFT) + 1;
>> + ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ?* This scans some number of pages and returns that memory
>> + ? ? ? ? ? ? ?* reclaim was slow or now. If slow, we add a delay as
>> + ? ? ? ? ? ? ?* congestion_wait() in vmscan.c
>> + ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? congested = mem_cgroup_shrink_static_scan(mem, (long)required);
>> + ? ? }
>> + ? ? if (test_bit(ASYNC_NORESCHED, &mem->async_flags)
>> + ? ? ? ? || mem_cgroup_async_should_stop(mem))
>> + ? ? ? ? ? ? goto finish_scan;
>> + ? ? /* If memory reclaim couldn't go well, add delay */
>> + ? ? if (congested)
>> + ? ? ? ? ? ? delay = HZ/10;
>
> Another magic number.
>
> If Moore's law holds, we need to reduce this number by 1.4 each year.
> Is this good?
>

not good. I just used the same magic number now used with wait_iff_congested.
Other than timer, I can use pagein/pageout event counter. If we have
dirty_ratio,
I may able to link this to dirty_ratio and wait until dirty_ratio is enough low.
Or, wake up again hit limit.

Do you have suggestion ?



>> + ? ? queue_delayed_work(memcg_async_shrinker, &mem->async_work, delay);
>> + ? ? return;
>> +finish_scan:
>> + ? ? cgroup_release_and_wakeup_rmdir(&mem->css);
>> + ? ? clear_bit(ASYNC_RUNNING, &mem->async_flags);
>> + ? ? return;
>> +}
>> +
>> +static void run_mem_cgroup_async_shrinker(struct mem_cgroup *mem)
>> +{
>> + ? ? if (test_bit(ASYNC_NORESCHED, &mem->async_flags))
>> + ? ? ? ? ? ? return;
>
> I can't work out what ASYNC_NORESCHED does. ?Is its name well-chosen?
>
how about BLOCK/STOP_ASYNC_RECLAIM ?

>> + ? ? if (test_and_set_bit(ASYNC_RUNNING, &mem->async_flags))
>> + ? ? ? ? ? ? return;
>> + ? ? cgroup_exclude_rmdir(&mem->css);
>> + ? ? /*
>> + ? ? ?* start reclaim with small delay. This delay will allow us to do job
>> + ? ? ?* in batch.
>
> Explain more?
>
yes, or I'll change this logic. I wanted to do low/high watermark
without "low" watermark...

>> + ? ? ?*/
>> + ? ? if (!queue_delayed_work(memcg_async_shrinker, &mem->async_work, 1)) {
>> + ? ? ? ? ? ? cgroup_release_and_wakeup_rmdir(&mem->css);
>> + ? ? ? ? ? ? clear_bit(ASYNC_RUNNING, &mem->async_flags);
>> + ? ? }
>> + ? ? return;
>> +}
>> +
>>
>> ...
>>
>

Thank you for review. I realized I need some amount of works. I'll add texts to
explain behavior and make codes simpler.


Thanks,
-Kame

2011-05-21 01:26:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 8/8] memcg asyncrhouns reclaim workqueue

On Sat, 21 May 2011 09:41:50 +0900 Hiroyuki Kamezawa <[email protected]> wrote:

> 2011/5/21 Andrew Morton <[email protected]>:
> > On Fri, 20 May 2011 12:48:37 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> >> workqueue for memory cgroup asynchronous memory shrinker.
> >>
> >> This patch implements the workqueue of async shrinker routine. each
> >> memcg has a work and only one work can be scheduled at the same time.
> >>
> >> If shrinking memory doesn't goes well, delay will be added to the work.
> >>
> >
> > When this code explodes (as it surely will), users will see large
> > amounts of CPU consumption in the work queue thread. __We want to make
> > this as easy to debug as possible, so we should try to make the
> > workqueue's names mappable back onto their memcg's. __And anything else
> > we can think of to help?
> >
>
> I had a patch for showing per-memcg reclaim latency stats. It will be help.
> I'll add it again to this set. I just dropped it because there are many patches
> onto memory.stat in flight..

Will that patch help us when users report the memcg equivalent of
"kswapd uses 99% of CPU"?

> >
> >> + __ __ limit = res_counter_read_u64(&mem->res, RES_LIMIT);
> >> + __ __ shrink_to = limit - MEMCG_ASYNC_MARGIN - PAGE_SIZE;
> >> + __ __ usage = res_counter_read_u64(&mem->res, RES_USAGE);
> >> + __ __ if (shrink_to <= usage) {
> >> + __ __ __ __ __ __ required = usage - shrink_to;
> >> + __ __ __ __ __ __ required = (required >> PAGE_SHIFT) + 1;
> >> + __ __ __ __ __ __ /*
> >> + __ __ __ __ __ __ __* This scans some number of pages and returns that memory
> >> + __ __ __ __ __ __ __* reclaim was slow or now. If slow, we add a delay as
> >> + __ __ __ __ __ __ __* congestion_wait() in vmscan.c
> >> + __ __ __ __ __ __ __*/
> >> + __ __ __ __ __ __ congested = mem_cgroup_shrink_static_scan(mem, (long)required);
> >> + __ __ }
> >> + __ __ if (test_bit(ASYNC_NORESCHED, &mem->async_flags)
> >> + __ __ __ __ || mem_cgroup_async_should_stop(mem))
> >> + __ __ __ __ __ __ goto finish_scan;
> >> + __ __ /* If memory reclaim couldn't go well, add delay */
> >> + __ __ if (congested)
> >> + __ __ __ __ __ __ delay = HZ/10;
> >
> > Another magic number.
> >
> > If Moore's law holds, we need to reduce this number by 1.4 each year.
> > Is this good?
> >
>
> not good. I just used the same magic number now used with wait_iff_congested.
> Other than timer, I can use pagein/pageout event counter. If we have
> dirty_ratio,
> I may able to link this to dirty_ratio and wait until dirty_ratio is enough low.
> Or, wake up again hit limit.
>
> Do you have suggestion ?
>

mm.. It would be pretty easy to generate an estimate of "pages scanned
per second" from the contents of (and changes in) the scan_control.
Konwing that datum and knowing the number of pages in the memcg, we
should be able to come up with a delay period which scales
appropriately with CPU speed and with memory size?

Such a thing could be used to rationalise magic delays in other places,
hopefully.

>
> >> + __ __ queue_delayed_work(memcg_async_shrinker, &mem->async_work, delay);
> >> + __ __ return;
> >> +finish_scan:
> >> + __ __ cgroup_release_and_wakeup_rmdir(&mem->css);
> >> + __ __ clear_bit(ASYNC_RUNNING, &mem->async_flags);
> >> + __ __ return;
> >> +}
> >> +
> >> +static void run_mem_cgroup_async_shrinker(struct mem_cgroup *mem)
> >> +{
> >> + __ __ if (test_bit(ASYNC_NORESCHED, &mem->async_flags))
> >> + __ __ __ __ __ __ return;
> >
> > I can't work out what ASYNC_NORESCHED does. __Is its name well-chosen?
> >
> how about BLOCK/STOP_ASYNC_RECLAIM ?

I can't say - I don't know what it does! Or maybe I did, and immediately
forgot ;)

2011-05-23 00:32:47

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 8/8] memcg asyncrhouns reclaim workqueue

On Fri, 20 May 2011 18:26:40 -0700
Andrew Morton <[email protected]> wrote:

> On Sat, 21 May 2011 09:41:50 +0900 Hiroyuki Kamezawa <[email protected]> wrote:
>
> > 2011/5/21 Andrew Morton <[email protected]>:
> > > On Fri, 20 May 2011 12:48:37 +0900
> > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > >
> > >> workqueue for memory cgroup asynchronous memory shrinker.
> > >>
> > >> This patch implements the workqueue of async shrinker routine. each
> > >> memcg has a work and only one work can be scheduled at the same time.
> > >>
> > >> If shrinking memory doesn't goes well, delay will be added to the work.
> > >>
> > >
> > > When this code explodes (as it surely will), users will see large
> > > amounts of CPU consumption in the work queue thread. __We want to make
> > > this as easy to debug as possible, so we should try to make the
> > > workqueue's names mappable back onto their memcg's. __And anything else
> > > we can think of to help?
> > >
> >
> > I had a patch for showing per-memcg reclaim latency stats. It will be help.
> > I'll add it again to this set. I just dropped it because there are many patches
> > onto memory.stat in flight..
>
> Will that patch help us when users report the memcg equivalent of
> "kswapd uses 99% of CPU"?
>
I think so. Each memcg shows what amount of cpu is used.

But, maybe it's not an easy interface. I have several idea.


An idea I have is to rename task->comm by overwrite from kworker/u:%d as
to memcg/%d when the work is scheduled. I think this can be implemented in very
simple interface and flags to workqueue. Then, ps -elf can show what was goin on.
If necessary, I'll add a hardlimit of cpu usage for a work or I'll limit
the number of thread for memcg workqueue.

Considering there are user who uses 2000+ memcg on a system, a thread per a memcg
was not a choice to me. Another idea was thread poll or workqueue. Because thread
pool can be a poor reimplemenation of workqueue, I used workqueue.

I'll implement some idea in above to the next version.


> > >
> > >> + __ __ limit = res_counter_read_u64(&mem->res, RES_LIMIT);
> > >> + __ __ shrink_to = limit - MEMCG_ASYNC_MARGIN - PAGE_SIZE;
> > >> + __ __ usage = res_counter_read_u64(&mem->res, RES_USAGE);
> > >> + __ __ if (shrink_to <= usage) {
> > >> + __ __ __ __ __ __ required = usage - shrink_to;
> > >> + __ __ __ __ __ __ required = (required >> PAGE_SHIFT) + 1;
> > >> + __ __ __ __ __ __ /*
> > >> + __ __ __ __ __ __ __* This scans some number of pages and returns that memory
> > >> + __ __ __ __ __ __ __* reclaim was slow or now. If slow, we add a delay as
> > >> + __ __ __ __ __ __ __* congestion_wait() in vmscan.c
> > >> + __ __ __ __ __ __ __*/
> > >> + __ __ __ __ __ __ congested = mem_cgroup_shrink_static_scan(mem, (long)required);
> > >> + __ __ }
> > >> + __ __ if (test_bit(ASYNC_NORESCHED, &mem->async_flags)
> > >> + __ __ __ __ || mem_cgroup_async_should_stop(mem))
> > >> + __ __ __ __ __ __ goto finish_scan;
> > >> + __ __ /* If memory reclaim couldn't go well, add delay */
> > >> + __ __ if (congested)
> > >> + __ __ __ __ __ __ delay = HZ/10;
> > >
> > > Another magic number.
> > >
> > > If Moore's law holds, we need to reduce this number by 1.4 each year.
> > > Is this good?
> > >
> >
> > not good. I just used the same magic number now used with wait_iff_congested.
> > Other than timer, I can use pagein/pageout event counter. If we have
> > dirty_ratio,
> > I may able to link this to dirty_ratio and wait until dirty_ratio is enough low.
> > Or, wake up again hit limit.
> >
> > Do you have suggestion ?
> >
>
> mm.. It would be pretty easy to generate an estimate of "pages scanned
> per second" from the contents of (and changes in) the scan_control.

Hmm.

> Konwing that datum and knowing the number of pages in the memcg, we
> should be able to come up with a delay period which scales
> appropriately with CPU speed and with memory size?
>
> Such a thing could be used to rationalise magic delays in other places,
> hopefully.
>

Ok, I'll conder that. Thank you for nice idea.


> >
> > >> + __ __ queue_delayed_work(memcg_async_shrinker, &mem->async_work, delay);
> > >> + __ __ return;
> > >> +finish_scan:
> > >> + __ __ cgroup_release_and_wakeup_rmdir(&mem->css);
> > >> + __ __ clear_bit(ASYNC_RUNNING, &mem->async_flags);
> > >> + __ __ return;
> > >> +}
> > >> +
> > >> +static void run_mem_cgroup_async_shrinker(struct mem_cgroup *mem)
> > >> +{
> > >> + __ __ if (test_bit(ASYNC_NORESCHED, &mem->async_flags))
> > >> + __ __ __ __ __ __ return;
> > >
> > > I can't work out what ASYNC_NORESCHED does. __Is its name well-chosen?
> > >
> > how about BLOCK/STOP_ASYNC_RECLAIM ?
>
> I can't say - I don't know what it does! Or maybe I did, and immediately
> forgot ;)
>

I'll find a better name ;)

Thanks,
-Kame

2011-05-23 17:26:30

by Ying Han

[permalink] [raw]
Subject: Re: [PATCH 0/8] memcg: clean up, export swapiness

Hi Kame:

Is this patch part of the "memcg async reclaim v2" patchset? I am
trying to do some tests on top of that, but having hard time finding
the [PATCH 3/8] and [PATCH 5/8].

--Ying

On Thu, May 19, 2011 at 8:43 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From: Ying Han <[email protected]>
> change mem_cgroup's swappiness interface.
>
> Now, memcg's swappiness interface is defined as 'static' and
> the value is passed as an argument to try_to_free_xxxx...
>
> This patch adds an function mem_cgroup_swappiness() and export it,
> reduce arguments. This interface will be used in async reclaim, later.
>
> I think an function is better than passing arguments because it's
> clearer where the swappiness comes from to scan_control.
>
> Signed-off-by: Ying Han <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> ?include/linux/memcontrol.h | ? ?1 +
> ?include/linux/swap.h ? ? ? | ? ?4 +---
> ?mm/memcontrol.c ? ? ? ? ? ?| ? 14 ++++++--------
> ?mm/vmscan.c ? ? ? ? ? ? ? ?| ? ?9 ++++-----
> ?4 files changed, 12 insertions(+), 16 deletions(-)
>
> Index: mmotm-May11/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-May11.orig/include/linux/memcontrol.h
> +++ mmotm-May11/include/linux/memcontrol.h
> @@ -112,6 +112,7 @@ unsigned long
> ?mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zid);
> ?bool mem_cgroup_test_reclaimable(struct mem_cgroup *memcg);
> ?int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
> +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg);
> ?unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct zone *zone,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? enum lru_list lru);
> Index: mmotm-May11/mm/memcontrol.c
> ===================================================================
> --- mmotm-May11.orig/mm/memcontrol.c
> +++ mmotm-May11/mm/memcontrol.c
> @@ -1285,7 +1285,7 @@ static unsigned long mem_cgroup_margin(s
> ? ? ? ?return margin >> PAGE_SHIFT;
> ?}
>
> -static unsigned int get_swappiness(struct mem_cgroup *memcg)
> +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> ?{
> ? ? ? ?struct cgroup *cgrp = memcg->css.cgroup;
>
> @@ -1687,14 +1687,13 @@ static int mem_cgroup_hierarchical_recla
> ? ? ? ? ? ? ? ?/* we use swappiness of local cgroup */
> ? ? ? ? ? ? ? ?if (check_soft) {
> ? ? ? ? ? ? ? ? ? ? ? ?ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, get_swappiness(victim), zone,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &nr_scanned);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, zone, &nr_scanned);
> ? ? ? ? ? ? ? ? ? ? ? ?*total_scanned += nr_scanned;
> ? ? ? ? ? ? ? ? ? ? ? ?mem_cgroup_soft_steal(victim, is_kswapd, ret);
> ? ? ? ? ? ? ? ? ? ? ? ?mem_cgroup_soft_scan(victim, is_kswapd, nr_scanned);
> ? ? ? ? ? ? ? ?} else
> ? ? ? ? ? ? ? ? ? ? ? ?ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap, get_swappiness(victim));
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? noswap);
> ? ? ? ? ? ? ? ?css_put(&victim->css);
> ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? * At shrinking usage, we can't check we should stop here or
> @@ -3717,8 +3716,7 @@ try_to_free:
> ? ? ? ? ? ? ? ? ? ? ? ?ret = -EINTR;
> ? ? ? ? ? ? ? ? ? ? ? ?goto out;
> ? ? ? ? ? ? ? ?}
> - ? ? ? ? ? ? ? progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? false, get_swappiness(mem));
> + ? ? ? ? ? ? ? progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, false);
> ? ? ? ? ? ? ? ?if (!progress) {
> ? ? ? ? ? ? ? ? ? ? ? ?nr_retries--;
> ? ? ? ? ? ? ? ? ? ? ? ?/* maybe some writeback is necessary */
> @@ -4150,7 +4148,7 @@ static u64 mem_cgroup_swappiness_read(st
> ?{
> ? ? ? ?struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>
> - ? ? ? return get_swappiness(memcg);
> + ? ? ? return mem_cgroup_swappiness(memcg);
> ?}
>
> ?static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
> @@ -4836,7 +4834,7 @@ mem_cgroup_create(struct cgroup_subsys *
> ? ? ? ?INIT_LIST_HEAD(&mem->oom_notify);
>
> ? ? ? ?if (parent)
> - ? ? ? ? ? ? ? mem->swappiness = get_swappiness(parent);
> + ? ? ? ? ? ? ? mem->swappiness = mem_cgroup_swappiness(parent);
> ? ? ? ?atomic_set(&mem->refcnt, 1);
> ? ? ? ?mem->move_charge_at_immigrate = 0;
> ? ? ? ?mutex_init(&mem->thresholds_lock);
> Index: mmotm-May11/include/linux/swap.h
> ===================================================================
> --- mmotm-May11.orig/include/linux/swap.h
> +++ mmotm-May11/include/linux/swap.h
> @@ -252,11 +252,9 @@ static inline void lru_cache_add_file(st
> ?extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask, nodemask_t *mask);
> ?extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned int swappiness);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask, bool noswap);
> ?extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask, bool noswap,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned int swappiness,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct zone *zone,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long *nr_scanned);
> ?extern int __isolate_lru_page(struct page *page, int mode, int file);
> Index: mmotm-May11/mm/vmscan.c
> ===================================================================
> --- mmotm-May11.orig/mm/vmscan.c
> +++ mmotm-May11/mm/vmscan.c
> @@ -2178,7 +2178,6 @@ unsigned long try_to_free_pages(struct z
>
> ?unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask, bool noswap,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned int swappiness,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct zone *zone,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long *nr_scanned)
> ?{
> @@ -2188,7 +2187,6 @@ unsigned long mem_cgroup_shrink_node_zon
> ? ? ? ? ? ? ? ?.may_writepage = !laptop_mode,
> ? ? ? ? ? ? ? ?.may_unmap = 1,
> ? ? ? ? ? ? ? ?.may_swap = !noswap,
> - ? ? ? ? ? ? ? .swappiness = swappiness,
> ? ? ? ? ? ? ? ?.order = 0,
> ? ? ? ? ? ? ? ?.mem_cgroup = mem,
> ? ? ? ?};
> @@ -2196,6 +2194,8 @@ unsigned long mem_cgroup_shrink_node_zon
> ? ? ? ?sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> ? ? ? ? ? ? ? ? ? ? ? ?(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
>
> + ? ? ? sc.swappiness = mem_cgroup_swappiness(mem);
> +
> ? ? ? ?trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc.may_writepage,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc.gfp_mask);
> @@ -2217,8 +2217,7 @@ unsigned long mem_cgroup_shrink_node_zon
>
> ?unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bool noswap,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned int swappiness)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bool noswap)
> ?{
> ? ? ? ?struct zonelist *zonelist;
> ? ? ? ?unsigned long nr_reclaimed;
> @@ -2228,7 +2227,6 @@ unsigned long try_to_free_mem_cgroup_pag
> ? ? ? ? ? ? ? ?.may_unmap = 1,
> ? ? ? ? ? ? ? ?.may_swap = !noswap,
> ? ? ? ? ? ? ? ?.nr_to_reclaim = SWAP_CLUSTER_MAX,
> - ? ? ? ? ? ? ? .swappiness = swappiness,
> ? ? ? ? ? ? ? ?.order = 0,
> ? ? ? ? ? ? ? ?.mem_cgroup = mem_cont,
> ? ? ? ? ? ? ? ?.nodemask = NULL, /* we don't care the placement */
> @@ -2245,6 +2243,7 @@ unsigned long try_to_free_mem_cgroup_pag
> ? ? ? ? * scan does not need to be the current node.
> ? ? ? ? */
> ? ? ? ?nid = mem_cgroup_select_victim_node(mem_cont);
> + ? ? ? sc.swappiness = mem_cgroup_swappiness(mem_cont);
>
> ? ? ? ?zonelist = NODE_DATA(nid)->node_zonelists;
>
>
>

2011-05-23 23:36:26

by Ying Han

[permalink] [raw]
Subject: Re: [PATCH 6/8] memcg asynchronous memory reclaim interface

On Fri, May 20, 2011 at 4:56 PM, Hiroyuki Kamezawa
<[email protected]> wrote:
> 2011/5/21 Andrew Morton <[email protected]>:
>> On Fri, 20 May 2011 12:46:36 +0900
>> KAMEZAWA Hiroyuki <[email protected]> wrote:
>>
>>> This patch adds a logic to keep usage margin to the limit in asynchronous way.
>>> When the usage over some threshould (determined automatically), asynchronous
>>> memory reclaim runs and shrink memory to limit - MEMCG_ASYNC_STOP_MARGIN.
>>>
>>> By this, there will be no difference in total amount of usage of cpu to
>>> scan the LRU
>>
>> This is not true if "don't writepage at all (revisit this when
>> dirty_ratio comes.)" is true. ?Skipping over dirty pages can cause
>> larger amounts of CPU consumption.
>>
>>> but we'll have a chance to make use of wait time of applications
>>> for freeing memory. For example, when an application read a file or socket,
>>> to fill the newly alloated memory, it needs wait. Async reclaim can make use
>>> of that time and give a chance to reduce latency by background works.
>>>
>>> This patch only includes required hooks to trigger async reclaim and user interfaces.
>>> Core logics will be in the following patches.
>>>
>>>
>>> ...
>>>
>>> ?/*
>>> + * For example, with transparent hugepages, memory reclaim scan at hitting
>>> + * limit can very long as to reclaim HPAGE_SIZE of memory. This increases
>>> + * latency of page fault and may cause fallback. At usual page allocation,
>>> + * we'll see some (shorter) latency, too. To reduce latency, it's appreciated
>>> + * to free memory in background to make margin to the limit. This consumes
>>> + * cpu but we'll have a chance to make use of wait time of applications
>>> + * (read disk etc..) by asynchronous reclaim.
>>> + *
>>> + * This async reclaim tries to reclaim HPAGE_SIZE * 2 of pages when margin
>>> + * to the limit is smaller than HPAGE_SIZE * 2. This will be enabled
>>> + * automatically when the limit is set and it's greater than the threshold.
>>> + */
>>> +#if HPAGE_SIZE != PAGE_SIZE
>>> +#define MEMCG_ASYNC_LIMIT_THRESH ? ? ?(HPAGE_SIZE * 64)
>>> +#define MEMCG_ASYNC_MARGIN ? ? ? ? (HPAGE_SIZE * 4)
>>> +#else /* make the margin as 4M bytes */
>>> +#define MEMCG_ASYNC_LIMIT_THRESH ? ? ?(128 * 1024 * 1024)
>>> +#define MEMCG_ASYNC_MARGIN ? ? ? ? ? ?(8 * 1024 * 1024)
>>> +#endif
>>
>> Document them, please. ?How are they used, what are their units.
>>
>
> will do.
>
>
>>> +static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
>>> +
>>> +/*
>>> ? * The memory controller data structure. The memory controller controls both
>>> ? * page cache and RSS per cgroup. We would eventually like to provide
>>> ? * statistics based on the statistics developed by Rik Van Riel for clock-pro,
>>> @@ -278,6 +303,12 @@ struct mem_cgroup {
>>> ? ? ? ?*/
>>> ? ? ? unsigned long ? move_charge_at_immigrate;
>>> ? ? ? /*
>>> + ? ? ?* Checks for async reclaim.
>>> + ? ? ?*/
>>> + ? ? unsigned long ? async_flags;
>>> +#define AUTO_ASYNC_ENABLED ? (0)
>>> +#define USE_AUTO_ASYNC ? ? ? ? ? ? ? (1)
>>
>> These are really confusing. ?I looked at the implementation and at the
>> documentation file and I'm still scratching my head. ?I can't work out
>> why they exist. ?With the amount of effort I put into it ;)
>>
>> Also, AUTO_ASYNC_ENABLED and USE_AUTO_ASYNC have practically the same
>> meaning, which doesn't help things.
>>
> Ah, yes it's confusing.

Sorry I was confused by the memory.async_control interface. I assume
that is the knob to turn on/off the bg reclaim on per-memcg basis. But
when I tried to turn it off, it seems not working well:

$ cat /proc/7248/cgroup
3:memory:/A

$ cat /dev/cgroup/memory/A/memory.async_control
0

Then i can see the kworkers start running when the memcg A under
memory pressure. There was no other memcgs configured under root.

$ cat /dev/cgroup/memory/memory.async_control
0

--Ying



>> Some careful description at this place in the code might help clear
>> things up.
>>
> yes, I'll fix and add text, consider better name.
>
>> Perhaps s/USE_AUTO_ASYNC/AUTO_ASYNC_IN_USE/ is what you meant.
>>
> Ah, good name :)
>
>>>
>>> ...
>>>
>>> +static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem)
>>> +{
>>> + ? ? if (!test_bit(USE_AUTO_ASYNC, &mem->async_flags))
>>> + ? ? ? ? ? ? return;
>>> + ? ? if (res_counter_margin(&mem->res) <= MEMCG_ASYNC_MARGIN) {
>>> + ? ? ? ? ? ? /* Fill here */
>>> + ? ? }
>>> +}
>>
>> I'd expect a function called foo_may_bar() to return a bool.
>>
> ok,
>
>> But given the lack of documentation and no-op implementation, I have o
>> idea what's happening here!
>>
> yes. Hmm, maybe adding an empty function here and comments on the
> function will make this better.
>
> Thank you for review.
> -Kame
>

2011-05-24 00:02:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/8] memcg: clean up, export swapiness

On Mon, 23 May 2011 10:26:22 -0700
Ying Han <[email protected]> wrote:

> Hi Kame:
>
> Is this patch part of the "memcg async reclaim v2" patchset?

yes, I failed to change title...

I am
> trying to do some tests on top of that, but having hard time finding
> the [PATCH 3/8] and [PATCH 5/8].
>
PATCH 5 is attached.

I think I can send a simplified v3 until this Friday and will not do update until
the end of LinuxCon Japan (I'm now writing slides ;). Of course, it's merge
window and I don't want to push any new feature to Andrew.

Sorry for my inconvenience.

Thanks,
-Kame

==
This patch adds a logic to keep usage margin to the limit in asynchronous way.
When the usage over some threshould (determined automatically), asynchronous
memory reclaim runs and shrink memory to limit - MEMCG_ASYNC_STOP_MARGIN.

By this, there will be no difference in total amount of usage of cpu to
scan the LRU but we'll have a chance to make use of wait time of applications
for freeing memory. For example, when an application read a file or socket,
to fill the newly alloated memory, it needs wait. Async reclaim can make use
of that time and give a chance to reduce latency by background works.

This patch only includes required hooks to trigger async reclaim. Core logics
will be in the following patches.

Changelog v1 -> v2:
- avoid async reclaim check when num_online_cpus() < 2.
- changed MEMCG_ASYNC_START_MARGIN to be 6 * HPAGE_SIZE.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/memory.txt | 46 ++++++++++++++++++-
mm/memcontrol.c | 94 +++++++++++++++++++++++++++++++++++++++
2 files changed, 139 insertions(+), 1 deletion(-)

Index: mmotm-May11/mm/memcontrol.c
===================================================================
--- mmotm-May11.orig/mm/memcontrol.c
+++ mmotm-May11/mm/memcontrol.c
@@ -115,10 +115,12 @@ enum mem_cgroup_events_index {
enum mem_cgroup_events_target {
MEM_CGROUP_TARGET_THRESH,
MEM_CGROUP_TARGET_SOFTLIMIT,
+ MEM_CGROUP_TARGET_ASYNC,
MEM_CGROUP_NTARGETS,
};
#define THRESHOLDS_EVENTS_TARGET (128)
#define SOFTLIMIT_EVENTS_TARGET (1024)
+#define ASYNC_EVENTS_TARGET (512) /* assume x86-64's hpagesize */

struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
@@ -211,6 +213,29 @@ static void mem_cgroup_threshold(struct
static void mem_cgroup_oom_notify(struct mem_cgroup *mem);

/*
+ * For example, with transparent hugepages, memory reclaim scan at hitting
+ * limit can very long as to reclaim HPAGE_SIZE of memory. This increases
+ * latency of page fault and may cause fallback. At usual page allocation,
+ * we'll see some (shorter) latency, too. To reduce latency, it's appreciated
+ * to free memory in background to make margin to the limit. This consumes
+ * cpu but we'll have a chance to make use of wait time of applications
+ * (read disk etc..) by asynchronous reclaim.
+ *
+ * This async reclaim tries to reclaim HPAGE_SIZE * 2 of pages when margin
+ * to the limit is smaller than HPAGE_SIZE * 2. This will be enabled
+ * automatically when the limit is set and it's greater than the threshold.
+ */
+#if HPAGE_SIZE != PAGE_SIZE
+#define MEMCG_ASYNC_LIMIT_THRESH (HPAGE_SIZE * 64)
+#define MEMCG_ASYNC_MARGIN (HPAGE_SIZE * 4)
+#else /* make the margin as 4M bytes */
+#define MEMCG_ASYNC_LIMIT_THRESH (128 * 1024 * 1024)
+#define MEMCG_ASYNC_MARGIN (8 * 1024 * 1024)
+#endif
+
+static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
+
+/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
* statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -278,6 +303,12 @@ struct mem_cgroup {
*/
unsigned long move_charge_at_immigrate;
/*
+ * Checks for async reclaim.
+ */
+ unsigned long async_flags;
+#define AUTO_ASYNC_ENABLED (0)
+#define USE_AUTO_ASYNC (1)
+ /*
* percpu counter.
*/
struct mem_cgroup_stat_cpu *stat;
@@ -722,6 +753,9 @@ static void __mem_cgroup_target_update(s
case MEM_CGROUP_TARGET_SOFTLIMIT:
next = val + SOFTLIMIT_EVENTS_TARGET;
break;
+ case MEM_CGROUP_TARGET_ASYNC:
+ next = val + ASYNC_EVENTS_TARGET;
+ break;
default:
return;
}
@@ -745,6 +779,11 @@ static void memcg_check_events(struct me
__mem_cgroup_target_update(mem,
MEM_CGROUP_TARGET_SOFTLIMIT);
}
+ if (__memcg_event_check(mem, MEM_CGROUP_TARGET_ASYNC)) {
+ mem_cgroup_may_async_reclaim(mem);
+ __mem_cgroup_target_update(mem,
+ MEM_CGROUP_TARGET_ASYNC);
+ }
}
}

@@ -3365,6 +3404,23 @@ void mem_cgroup_print_bad_page(struct pa

static DEFINE_MUTEX(set_limit_mutex);

+/* When limit is changed, check async reclaim switch again */
+static void mem_cgroup_set_auto_async(struct mem_cgroup *mem, u64 val)
+{
+ if (!test_bit(AUTO_ASYNC_ENABLED, &mem->async_flags))
+ goto clear;
+ if (num_online_cpus() < 2)
+ goto clear;
+ if (val < MEMCG_ASYNC_LIMIT_THRESH)
+ goto clear;
+
+ set_bit(USE_AUTO_ASYNC, &mem->async_flags);
+ return;
+clear:
+ clear_bit(USE_AUTO_ASYNC, &mem->async_flags);
+ return;
+}
+
static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
unsigned long long val)
{
@@ -3413,6 +3469,7 @@ static int mem_cgroup_resize_limit(struc
memcg->memsw_is_minimum = true;
else
memcg->memsw_is_minimum = false;
+ mem_cgroup_set_auto_async(memcg, val);
}
mutex_unlock(&set_limit_mutex);

@@ -3590,6 +3647,15 @@ unsigned long mem_cgroup_soft_limit_recl
return nr_reclaimed;
}

+static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem)
+{
+ if (!test_bit(USE_AUTO_ASYNC, &mem->async_flags))
+ return;
+ if (res_counter_margin(&mem->res) <= MEMCG_ASYNC_MARGIN) {
+ /* Fill here */
+ }
+}
+
/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -4149,6 +4215,29 @@ static int mem_control_stat_show(struct
return 0;
}

+static u64 mem_cgroup_async_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+ return mem->async_flags;
+}
+
+static int
+mem_cgroup_async_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+ if (val & (1 << AUTO_ASYNC_ENABLED))
+ set_bit(AUTO_ASYNC_ENABLED, &mem->async_flags);
+ else
+ clear_bit(AUTO_ASYNC_ENABLED, &mem->async_flags);
+
+ val = res_counter_read_u64(&mem->res, RES_LIMIT);
+ mem_cgroup_set_auto_async(mem, val);
+ return 0;
+}
+
+
static u64 mem_cgroup_swappiness_read(struct cgroup *cgrp, struct cftype *cft)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
@@ -4580,6 +4669,11 @@ static struct cftype mem_cgroup_files[]
.unregister_event = mem_cgroup_oom_unregister_event,
.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
},
+ {
+ .name = "async_control",
+ .read_u64 = mem_cgroup_async_read,
+ .write_u64 = mem_cgroup_async_write,
+ },
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
Index: mmotm-May11/Documentation/cgroups/memory.txt
===================================================================
--- mmotm-May11.orig/Documentation/cgroups/memory.txt
+++ mmotm-May11/Documentation/cgroups/memory.txt
@@ -70,6 +70,7 @@ Brief summary of control files.
(See sysctl's vm.swappiness)
memory.move_charge_at_immigrate # set/show controls of moving charges
memory.oom_control # set/show oom controls.
+ memory.async_control # set control for asynchronous memory reclaim

1. History

@@ -664,7 +665,50 @@ At reading, current status of OOM is sho
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
be stopped.)

-11. TODO
+11. Asynchronous memory reclaim
+
+In some kind of applications which uses many file caches, once memory cgroup
+hit limit, following allocation of pages will hit limit again and the
+application may see huge latency because of memory reclaim.
+
+Memory cgroup provides a method for asynchronous memory reclaim for freeing
+memory before hitting limit. By this, some class of application can avoid
+memory reclaim latency effectively and show good performance. For example,
+if an application reads data from files bigger than limit, freeing memory
+in asynchrnous will reduce latency of read. But please note, even if
+latency decreased, the amount of total usage of CPU is unchanged. So,
+asynchronous memory reclaim works effectively only when you have extra unused
+CPU, applications tend to sleep. So, this feature only works on SMP.
+
+So, if you see this feature doesn't help your application, please let it
+turned off.
+
+
+11.1 memory.async_control
+
+memory.async_control is a control for asynchronous memory reclaim and
+represented as bitmask of controls.
+
+ bit 0 ....user control of automatic asynchronous memory reclaim(see below)
+ bit 1 ....indicate automatic asynchronous memory reclaim is really used.
+
+ * Automatic asynchronous memory reclaim is a feature to free pages to
+ some extent below the limit in background. When this runs, applications
+ can reduce latency at hit limit. (but please note, background reclaim
+ use cpu.)
+
+ This feature can be enabled by
+
+ echo 1 > memory.async_control
+
+ If successfully enabled, bit 1 of memory.async_control is set. Bit 1 may
+ not be set when the number of cpu is 1 or when the limit is too small.
+
+ Note: This feature is not propageted to childrens in automatic. This
+ may be conservative but required limitation to avoid using too much
+ cpus.
+
+12. TODO

1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first

2011-05-24 00:18:07

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 6/8] memcg asynchronous memory reclaim interface

On Mon, 23 May 2011 16:36:20 -0700
Ying Han <[email protected]> wrote:

> On Fri, May 20, 2011 at 4:56 PM, Hiroyuki Kamezawa
> <[email protected]> wrote:
> > 2011/5/21 Andrew Morton <[email protected]>:
> >> On Fri, 20 May 2011 12:46:36 +0900
> >> KAMEZAWA Hiroyuki <[email protected]> wrote:
> >>
> >>> This patch adds a logic to keep usage margin to the limit in asynchronous way.
> >>> When the usage over some threshould (determined automatically), asynchronous
> >>> memory reclaim runs and shrink memory to limit - MEMCG_ASYNC_STOP_MARGIN.
> >>>
> >>> By this, there will be no difference in total amount of usage of cpu to
> >>> scan the LRU
> >>
> >> This is not true if "don't writepage at all (revisit this when
> >> dirty_ratio comes.)" is true.  Skipping over dirty pages can cause
> >> larger amounts of CPU consumption.
> >>
> >>> but we'll have a chance to make use of wait time of applications
> >>> for freeing memory. For example, when an application read a file or socket,
> >>> to fill the newly alloated memory, it needs wait. Async reclaim can make use
> >>> of that time and give a chance to reduce latency by background works.
> >>>
> >>> This patch only includes required hooks to trigger async reclaim and user interfaces.
> >>> Core logics will be in the following patches.
> >>>
> >>>
> >>> ...
> >>>
> >>>  /*
> >>> + * For example, with transparent hugepages, memory reclaim scan at hitting
> >>> + * limit can very long as to reclaim HPAGE_SIZE of memory. This increases
> >>> + * latency of page fault and may cause fallback. At usual page allocation,
> >>> + * we'll see some (shorter) latency, too. To reduce latency, it's appreciated
> >>> + * to free memory in background to make margin to the limit. This consumes
> >>> + * cpu but we'll have a chance to make use of wait time of applications
> >>> + * (read disk etc..) by asynchronous reclaim.
> >>> + *
> >>> + * This async reclaim tries to reclaim HPAGE_SIZE * 2 of pages when margin
> >>> + * to the limit is smaller than HPAGE_SIZE * 2. This will be enabled
> >>> + * automatically when the limit is set and it's greater than the threshold.
> >>> + */
> >>> +#if HPAGE_SIZE != PAGE_SIZE
> >>> +#define MEMCG_ASYNC_LIMIT_THRESH      (HPAGE_SIZE * 64)
> >>> +#define MEMCG_ASYNC_MARGIN         (HPAGE_SIZE * 4)
> >>> +#else /* make the margin as 4M bytes */
> >>> +#define MEMCG_ASYNC_LIMIT_THRESH      (128 * 1024 * 1024)
> >>> +#define MEMCG_ASYNC_MARGIN            (8 * 1024 * 1024)
> >>> +#endif
> >>
> >> Document them, please.  How are they used, what are their units.
> >>
> >
> > will do.
> >
> >
> >>> +static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
> >>> +
> >>> +/*
> >>>   * The memory controller data structure. The memory controller controls both
> >>>   * page cache and RSS per cgroup. We would eventually like to provide
> >>>   * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> >>> @@ -278,6 +303,12 @@ struct mem_cgroup {
> >>>        */
> >>>       unsigned long   move_charge_at_immigrate;
> >>>       /*
> >>> +      * Checks for async reclaim.
> >>> +      */
> >>> +     unsigned long   async_flags;
> >>> +#define AUTO_ASYNC_ENABLED   (0)
> >>> +#define USE_AUTO_ASYNC               (1)
> >>
> >> These are really confusing.  I looked at the implementation and at the
> >> documentation file and I'm still scratching my head.  I can't work out
> >> why they exist.  With the amount of effort I put into it ;)
> >>
> >> Also, AUTO_ASYNC_ENABLED and USE_AUTO_ASYNC have practically the same
> >> meaning, which doesn't help things.
> >>
> > Ah, yes it's confusing.
>
> Sorry I was confused by the memory.async_control interface. I assume
> that is the knob to turn on/off the bg reclaim on per-memcg basis. But
> when I tried to turn it off, it seems not working well:
>
> $ cat /proc/7248/cgroup
> 3:memory:/A
>
> $ cat /dev/cgroup/memory/A/memory.async_control
> 0
>

If enabled and kworker runs, this shows "3", for now.
I'll make this simpler in the next post.

> Then i can see the kworkers start running when the memcg A under
> memory pressure. There was no other memcgs configured under root.


What kworkers ? For example, many kworkers runs on ext4? on my host.
If kworker/u:x works, it may be for memcg (for my host)

Ok, I'll add statistics in v3.

Thanks,
-Kame

2011-05-24 00:26:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/8] memcg async reclaim v2

On Mon, 23 May 2011 15:38:31 -0700
Ying Han <[email protected]> wrote:

> Hi Kame:
>
> I applied and tested the patchset on top of mmotm-2011-05-12-15-52. I
> admit that I didn't look the patch closely yet, which I plan to do
> next. Now i have few quick questions based on the testing result:
>
> Test:
> 1) create a 2g memcg and enable async_control
> $ mkdir /dev/cgroup/memory/A
> $ echo 2g >/dev/cgroup/memory/A/memory.limit_in_bytes
> $ echo 1 >/dev/cgroup/memory/A/memory.async_control
>
> 2) read a 20g file in the memcg
> $ echo $$ >/dev/cgroup/memory/A/tasks
> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
>
> real 4m26.677s
> user 0m0.222s
> sys 0m28.481s
>
> Here are the questions:
>
> 1. I monitored the "top" while the test is running. The amount of
> cputime the kworkers take worries me, and the following top output
> stays pretty consistent while the "cat" is running/
>

memcg-async's kworker is kworker/u:x .....because of UNBOUND_WQ.
Then, kworker you see is for other purpose....Hmm, from trace log,
most of them are for "draining" per-cpu memcg cache. I'll prepare a patch.




> Tasks: 152 total, 2 running, 150 sleeping, 0 stopped, 0 zombie
> Cpu(s): 0.1%us, 1.2%sy, 0.0%ni, 87.6%id, 10.6%wa, 0.0%hi, 0.5%si, 0.0%st
> Mem: 32963480k total, 2694728k used, 30268752k free, 3888k buffers
> Swap: 0k total, 0k used, 0k free, 2316500k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 389 root 20 0 0 0 0 R 45 0.0 1:36.24
> kworker/3:1
> 23127 root 20 0 0 0 0 S 44 0.0 0:13.44
> kworker/4:2
> 393 root 20 0 0 0 0 S 43 0.0 2:02.28
> kworker/7:1
> 32 root 20 0 0 0 0 S 42 0.0 1:54.02
> kworker/6:0
> 1230 root 20 0 0 0 0 S 42 0.0 1:22.01
> kworker/2:2
> 23130 root 20 0 0 0 0 S 31 0.0 0:04.04
> kworker/0:2
> 391 root 20 0 0 0 0 S 22 0.0 1:45.79
> kworker/5:1
> 23109 root 20 0 3104 228 180 D 10 0.0 0:08.56 cat
>
> I attached the tracing output of the kworkers while they are running
> by doing the following:
>
> $ mount -t debugfs nodev /sys/kernel/debug/
> $ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
> $ cat /sys/kernel/debug/tracing/trace_pipe > out.txt
>
> 2. I can not justify the cputime on the kworkers. I am looking for the
> patch which we exports the time before and after workitem on memcg
> basis. I recall we have that in previous post, sorry I missed that
> patch somehere.
>
> # cat /cgroup/memory/A/memory.stat
> ....
> direct_elapsed_ns 0
> wmark_elapsed_ns 103566424
> direct_scanned 0
> wmark_scanned 29303
> direct_freed 0
> wmark_freed 29290
>

I didn't include this for this version because you and others working on
memory.stat file. I wanted to avoid to add new mess ;)
I'll include it again in v3.



> 3. Here is the outout of memory.stat after the test, the last one is
> the memory.failcnt. As far as I remember, the failcnt is far higher
> than the result i got on previous testing (per-memcg-per-kswapd
> patch). This is all clean file pages which shouldn't be hard to
> reclaim.
>
> cache 2147151872
> rss 94208
> mapped_file 0
> pgpgin 5242945
> pgpgout 4718715
> pgfault 274
> pgmajfault 0
> 1050041
>
> Please let me know if the current version isn't ready for testing, and
> I will wait :)
>

This version has tweaked to be less cpu hogging than previous one. So,
hit_limit increases. I'll drop some tweakes I added in v2 for starting from
a simple one.

I'll post v3 in this week. But if dirty_ratio is ready, I think it should be
merged 1st. But it's merge window....

Thanks,
-Kame



2011-05-24 00:26:36

by Ying Han

[permalink] [raw]
Subject: Re: [PATCH 6/8] memcg asynchronous memory reclaim interface

On Mon, May 23, 2011 at 5:11 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Mon, 23 May 2011 16:36:20 -0700
> Ying Han <[email protected]> wrote:
>
>> On Fri, May 20, 2011 at 4:56 PM, Hiroyuki Kamezawa
>> <[email protected]> wrote:
>> > 2011/5/21 Andrew Morton <[email protected]>:
>> >> On Fri, 20 May 2011 12:46:36 +0900
>> >> KAMEZAWA Hiroyuki <[email protected]> wrote:
>> >>
>> >>> This patch adds a logic to keep usage margin to the limit in asynchronous way.
>> >>> When the usage over some threshould (determined automatically), asynchronous
>> >>> memory reclaim runs and shrink memory to limit - MEMCG_ASYNC_STOP_MARGIN.
>> >>>
>> >>> By this, there will be no difference in total amount of usage of cpu to
>> >>> scan the LRU
>> >>
>> >> This is not true if "don't writepage at all (revisit this when
>> >> dirty_ratio comes.)" is true. ?Skipping over dirty pages can cause
>> >> larger amounts of CPU consumption.
>> >>
>> >>> but we'll have a chance to make use of wait time of applications
>> >>> for freeing memory. For example, when an application read a file or socket,
>> >>> to fill the newly alloated memory, it needs wait. Async reclaim can make use
>> >>> of that time and give a chance to reduce latency by background works.
>> >>>
>> >>> This patch only includes required hooks to trigger async reclaim and user interfaces.
>> >>> Core logics will be in the following patches.
>> >>>
>> >>>
>> >>> ...
>> >>>
>> >>> ?/*
>> >>> + * For example, with transparent hugepages, memory reclaim scan at hitting
>> >>> + * limit can very long as to reclaim HPAGE_SIZE of memory. This increases
>> >>> + * latency of page fault and may cause fallback. At usual page allocation,
>> >>> + * we'll see some (shorter) latency, too. To reduce latency, it's appreciated
>> >>> + * to free memory in background to make margin to the limit. This consumes
>> >>> + * cpu but we'll have a chance to make use of wait time of applications
>> >>> + * (read disk etc..) by asynchronous reclaim.
>> >>> + *
>> >>> + * This async reclaim tries to reclaim HPAGE_SIZE * 2 of pages when margin
>> >>> + * to the limit is smaller than HPAGE_SIZE * 2. This will be enabled
>> >>> + * automatically when the limit is set and it's greater than the threshold.
>> >>> + */
>> >>> +#if HPAGE_SIZE != PAGE_SIZE
>> >>> +#define MEMCG_ASYNC_LIMIT_THRESH ? ? ?(HPAGE_SIZE * 64)
>> >>> +#define MEMCG_ASYNC_MARGIN ? ? ? ? (HPAGE_SIZE * 4)
>> >>> +#else /* make the margin as 4M bytes */
>> >>> +#define MEMCG_ASYNC_LIMIT_THRESH ? ? ?(128 * 1024 * 1024)
>> >>> +#define MEMCG_ASYNC_MARGIN ? ? ? ? ? ?(8 * 1024 * 1024)
>> >>> +#endif
>> >>
>> >> Document them, please. ?How are they used, what are their units.
>> >>
>> >
>> > will do.
>> >
>> >
>> >>> +static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
>> >>> +
>> >>> +/*
>> >>> ? * The memory controller data structure. The memory controller controls both
>> >>> ? * page cache and RSS per cgroup. We would eventually like to provide
>> >>> ? * statistics based on the statistics developed by Rik Van Riel for clock-pro,
>> >>> @@ -278,6 +303,12 @@ struct mem_cgroup {
>> >>> ? ? ? ?*/
>> >>> ? ? ? unsigned long ? move_charge_at_immigrate;
>> >>> ? ? ? /*
>> >>> + ? ? ?* Checks for async reclaim.
>> >>> + ? ? ?*/
>> >>> + ? ? unsigned long ? async_flags;
>> >>> +#define AUTO_ASYNC_ENABLED ? (0)
>> >>> +#define USE_AUTO_ASYNC ? ? ? ? ? ? ? (1)
>> >>
>> >> These are really confusing. ?I looked at the implementation and at the
>> >> documentation file and I'm still scratching my head. ?I can't work out
>> >> why they exist. ?With the amount of effort I put into it ;)
>> >>
>> >> Also, AUTO_ASYNC_ENABLED and USE_AUTO_ASYNC have practically the same
>> >> meaning, which doesn't help things.
>> >>
>> > Ah, yes it's confusing.
>>
>> Sorry I was confused by the memory.async_control interface. I assume
>> that is the knob to turn on/off the bg reclaim on per-memcg basis. But
>> when I tried to turn it off, it seems not working well:
>>
>> $ cat /proc/7248/cgroup
>> 3:memory:/A
>>
>> $ cat /dev/cgroup/memory/A/memory.async_control
>> 0
>>
>
> If enabled and kworker runs, this shows "3", for now.
> I'll make this simpler in the next post.
>
>> Then i can see the kworkers start running when the memcg A under
>> memory pressure. There was no other memcgs configured under root.
>
>
> What kworkers ? For example, many kworkers runs on ext4? on my host.
> If kworker/u:x works, it may be for memcg (for my host)

I am kind of sure they are kworkers from memcg. They start running
right after my test and then stop when i kill that test.

$ cat /dev/cgroup/memory/A/memory.limit_in_bytes
2147483648
$ cat /dev/cgroup/memory/A/memory.async_control
0


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
393 root 20 0 0 0 0 S 54 0.0 1:30.36
kworker/7:1
391 root 20 0 0 0 0 S 51 0.0 1:42.35
kworker/5:1
390 root 20 0 0 0 0 S 43 0.0 1:45.55
kworker/4:1
11 root 20 0 0 0 0 S 40 0.0 1:36.98
kworker/1:0
14 root 20 0 0 0 0 S 36 0.0 1:47.04
kworker/0:1
389 root 20 0 0 0 0 S 24 0.0 0:47.35
kworker/3:1
20071 root 20 0 20.0g 497m 497m D 12 1.5 0:04.99 memtoy
392 root 20 0 0 0 0 S 10 0.0 1:26.43
kworker/6:1

--Ying

>
> Ok, I'll add statistics in v3.
>
> Thanks,
> -Kame
>
>