2009-10-02 04:57:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge

Hi,

This patch is against mmotm + softlimit fix patches.
(which are now in -rc git tree.)

In the latest -rc series, the kernel avoids accessing res_counter when
cgroup is root cgroup. This helps scalabilty when memcg is not used.

It's necessary to improve scalabilty even when memcg is used. This patch
is for that. Previous Balbir's work shows that the biggest obstacles for
better scalabilty is memcg's res_counter. Then, there are 2 ways.

(1) make counter scale well.
(2) avoid accessing core counter as much as possible.

My first direction was (1). But no, there is no counter which is free
from false sharing when it needs system-wide fine grain synchronization.
And res_counter has several functionality...this makes (1) difficult.
spin_lock (in slow path) around counter means tons of invalidation will
happen even when we just access counter without modification.

This patch series is for (2). This implements charge/uncharge in bached manner.
This coalesces access to res_counter at charge/uncharge using nature of
access locality.

Tested for a month. And I got good reorts from Balbir and Nishimura, thanks.
One concern is that this adds some members to the bottom of task_struct.
Better idea is welcome.

Following is test result of continuous page-fault on my 8cpu box(x86-64).

A loop like this runs on all cpus in parallel for 60secs.
==
while (1) {
x = mmap(NULL, MEGA, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

for (off = 0; off < MEGA; off += PAGE_SIZE)
x[off]=0;
munmap(x, MEGA);
}
==
please see # of page faults. I think this is good improvement.


[Before]
Performance counter stats for './runpause.sh' (5 runs):

474539.756944 task-clock-msecs # 7.890 CPUs ( +- 0.015% )
10284 context-switches # 0.000 M/sec ( +- 0.156% )
12 CPU-migrations # 0.000 M/sec ( +- 0.000% )
18425800 page-faults # 0.039 M/sec ( +- 0.107% )
1486296285360 cycles # 3132.080 M/sec ( +- 0.029% )
380334406216 instructions # 0.256 IPC ( +- 0.058% )
3274206662 cache-references # 6.900 M/sec ( +- 0.453% )
1272947699 cache-misses # 2.682 M/sec ( +- 0.118% )

60.147907341 seconds time elapsed ( +- 0.010% )

[After]
Performance counter stats for './runpause.sh' (5 runs):

474658.997489 task-clock-msecs # 7.891 CPUs ( +- 0.006% )
10250 context-switches # 0.000 M/sec ( +- 0.020% )
11 CPU-migrations # 0.000 M/sec ( +- 0.000% )
33177858 page-faults # 0.070 M/sec ( +- 0.152% )
1485264748476 cycles # 3129.120 M/sec ( +- 0.021% )
409847004519 instructions # 0.276 IPC ( +- 0.123% )
3237478723 cache-references # 6.821 M/sec ( +- 0.574% )
1182572827 cache-misses # 2.491 M/sec ( +- 0.179% )

60.151786309 seconds time elapsed ( +- 0.014% )

Regards,
-Kame


2009-10-02 05:03:43

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation


In massive parallel enviroment, res_counter can be a performance bottleneck.
One strong techinque to reduce lock contention is reducing calls by
coalescing some amount of calls into one.

Considering charge/uncharge chatacteristic,
- charge is done one by one via demand-paging.
- uncharge is done by
- in chunk at munmap, truncate, exit, execve...
- one by one via vmscan/paging.

It seems we have a chance in uncharge at unmap/truncation.

This patch is a for coalescing uncharge. For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task.

The degree of coalescing depends on callers
- at invalidate/trucate... pagevec size
- at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.

Changelog(now):
- rebased onto the latest mmotm + softlimit fix patches.

Changelog(old):
- unified patch for callers
- added commetns.
- make ->do_batch as bool.
- removed css_get() at el. We don't need it.


Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 13 ++++++
include/linux/sched.h | 7 +++
mm/memcontrol.c | 91 ++++++++++++++++++++++++++++++++++++++++++---
mm/memory.c | 2
mm/truncate.c | 6 ++
5 files changed, 113 insertions(+), 6 deletions(-)

Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
+++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
@@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
extern void mem_cgroup_del_lru(struct page *page);
extern void mem_cgroup_move_lists(struct page *page,
enum lru_list from, enum lru_list to);
+
+/* For coalescing uncharge for reducing memcg' overhead*/
+extern void mem_cgroup_uncharge_start(void);
+extern void mem_cgroup_uncharge_end(void);
+
extern void mem_cgroup_uncharge_page(struct page *page);
extern void mem_cgroup_uncharge_cache_page(struct page *page);
extern int mem_cgroup_shmem_charge_fallback(struct page *page,
@@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
{
}

+static inline void mem_cgroup_uncharge_batch_start(void)
+{
+}
+
+static inline void mem_cgroup_uncharge_batch_start(void)
+{
+}
+
static inline void mem_cgroup_uncharge_page(struct page *page)
{
}
Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str
css_put(&mem->css);
}

+static void
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+{
+ struct memcg_batch_info *batch = NULL;
+ bool uncharge_memsw = true;
+ /* If swapout, usage of swap doesn't decrease */
+ if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+ uncharge_memsw = false;
+ /*
+ * do_batch > 0 when unmapping pages or inode invalidate/truncate.
+ * In those cases, all pages freed continously can be expected to be in
+ * the same cgroup and we have chance to coalesce uncharges.
+ * And, we do uncharge one by one if this is killed by OOM.
+ */
+ if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
+ goto direct_uncharge;
+
+ batch = &current->memcg_batch;
+ /*
+ * In usual, we do css_get() when we remember memcg pointer.
+ * But in this case, we keep res->usage until end of a series of
+ * uncharges. Then, it's ok to ignore memcg's refcnt.
+ */
+ if (!batch->memcg)
+ batch->memcg = mem;
+ /*
+ * In typical case, batch->memcg == mem. This means we can
+ * merge a series of uncharges to an uncharge of res_counter.
+ * If not, we uncharge res_counter ony by one.
+ */
+ if (batch->memcg != mem)
+ goto direct_uncharge;
+ /* remember freed charge and uncharge it later */
+ batch->pages += PAGE_SIZE;
+ if (uncharge_memsw)
+ batch->memsw += PAGE_SIZE;
+ return;
+direct_uncharge:
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ if (uncharge_memsw)
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+ return;
+}

/*
* uncharge if !page_mapped(page)
@@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page
break;
}

- if (!mem_cgroup_is_root(mem)) {
- res_counter_uncharge(&mem->res, PAGE_SIZE);
- if (do_swap_account &&
- (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
- res_counter_uncharge(&mem->memsw, PAGE_SIZE);
- }
+ if (!mem_cgroup_is_root(mem))
+ __do_uncharge(mem, ctype);
if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
mem_cgroup_swap_statistics(mem, true);
mem_cgroup_charge_statistics(mem, pc, false);
@@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}

+/*
+ * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
+ * In that cases, pages are freed continuously and we can expect pages
+ * are in the same memcg. All these calls itself limits the number of
+ * pages freed at once, then uncharge_start/end() is called properly.
+ */
+
+void mem_cgroup_uncharge_start(void)
+{
+ if (!current->memcg_batch.do_batch) {
+ current->memcg_batch.memcg = NULL;
+ current->memcg_batch.pages = 0;
+ current->memcg_batch.memsw = 0;
+ }
+ current->memcg_batch.do_batch++;
+}
+
+void mem_cgroup_uncharge_end(void)
+{
+ struct mem_cgroup *mem;
+
+ if (!current->memcg_batch.do_batch)
+ return;
+
+ current->memcg_batch.do_batch--;
+ if (current->memcg_batch.do_batch) /* Nested ? */
+ return;
+
+ mem = current->memcg_batch.memcg;
+ if (!mem)
+ return;
+ /* This "mem" is valid bacause we hide charges behind us. */
+ if (current->memcg_batch.pages)
+ res_counter_uncharge(&mem->res, current->memcg_batch.pages);
+ if (current->memcg_batch.memsw)
+ res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
+ /* Not necessary. but forget this pointer */
+ current->memcg_batch.memcg = NULL;
+}
+
#ifdef CONFIG_SWAP
/*
* called after __delete_from_swap_cache() and drop "page" account.
Index: mmotm-2.6.31-Sep28/include/linux/sched.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
+++ mmotm-2.6.31-Sep28/include/linux/sched.h
@@ -1549,6 +1549,13 @@ struct task_struct {
unsigned long trace_recursion;
#endif /* CONFIG_TRACING */
unsigned long stack_start;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
+ struct memcg_batch_info {
+ int do_batch;
+ struct mem_cgroup *memcg;
+ long pages, memsw;
+ } memcg_batch;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
Index: mmotm-2.6.31-Sep28/mm/memory.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memory.c
+++ mmotm-2.6.31-Sep28/mm/memory.c
@@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
details = NULL;

BUG_ON(addr >= end);
+ mem_cgroup_uncharge_start();
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
do {
@@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
zap_work, details);
} while (pgd++, addr = next, (addr != end && *zap_work > 0));
tlb_end_vma(tlb, vma);
+ mem_cgroup_uncharge_end();

return addr;
}
Index: mmotm-2.6.31-Sep28/mm/truncate.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/truncate.c
+++ mmotm-2.6.31-Sep28/mm/truncate.c
@@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
pagevec_release(&pvec);
break;
}
+ mem_cgroup_uncharge_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

@@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
unlock_page(page);
}
pagevec_release(&pvec);
+ mem_cgroup_uncharge_end();
}
}
EXPORT_SYMBOL(truncate_inode_pages_range);
@@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
pagevec_init(&pvec, 0);
while (next <= end &&
pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+ mem_cgroup_uncharge_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t index;
@@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
break;
}
pagevec_release(&pvec);
+ mem_cgroup_uncharge_end();
cond_resched();
}
return ret;
@@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
while (next <= end && !wrapped &&
pagevec_lookup(&pvec, mapping, next,
min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+ mem_cgroup_uncharge_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t page_index;
@@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
unlock_page(page);
}
pagevec_release(&pvec);
+ mem_cgroup_uncharge_end();
cond_resched();
}
return ret;

2009-10-02 05:05:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 2/2] memcg: coalescing charges per cpu

From: KAMEZAWA Hiroyuki <[email protected]>

This is a patch for coalescing access to res_counter at charging by
percpu caching. At charge, memcg charges 64pages and remember it in
percpu cache. Because it's cache, drain/flush is done if necessary.

This version uses public percpu area.
2 benefits of using public percpu area.
1. Sum of stocked charge in the system is limited to # of cpus
not to the number of memcg. This shows better synchonization.
2. drain code for flush/cpuhotplug is very easy (and quick)

The most important point of this patch is that we never touch res_counter
in fast path. The res_counter is system-wide shared counter which is modified
very frequently. We shouldn't touch it as far as we can for avoiding
false sharing.

Changelog (new):
- rabased onto the latest mmotm
Changelog (old):
- moved charge size check before __GFP_WAIT check for avoiding unnecesary
- added asynchronous flush routine.
- fixed bugs pointed out by Nishimura-san.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 126 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 120 insertions(+), 6 deletions(-)

Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -38,6 +38,7 @@
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
+#include <linux/cpu.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -275,6 +276,7 @@ enum charge_type {
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
+static void drain_all_stock_async(void);

static struct mem_cgroup_per_zone *
mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1137,6 +1139,8 @@ static int mem_cgroup_hierarchical_recla
victim = mem_cgroup_select_victim(root_mem);
if (victim == root_mem) {
loop++;
+ if (loop >= 1)
+ drain_all_stock_async();
if (loop >= 2) {
/*
* If we have not been able to reclaim
@@ -1258,6 +1262,103 @@ done:
unlock_page_cgroup(pc);
}

+/* size of first charge trial. "32" comes from vmscan.c's magic value */
+#define CHARGE_SIZE (32 * PAGE_SIZE)
+struct memcg_stock_pcp {
+ struct mem_cgroup *cached;
+ int charge;
+ struct work_struct work;
+};
+static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
+static DEFINE_MUTEX(memcg_drain_mutex);
+
+static bool consume_stock(struct mem_cgroup *mem)
+{
+ struct memcg_stock_pcp *stock;
+ bool ret = true;
+
+ stock = &get_cpu_var(memcg_stock);
+ if (mem == stock->cached && stock->charge)
+ stock->charge -= PAGE_SIZE;
+ else
+ ret = false;
+ put_cpu_var(memcg_stock);
+ return ret;
+}
+
+static void drain_stock(struct memcg_stock_pcp *stock)
+{
+ struct mem_cgroup *old = stock->cached;
+
+ if (stock->charge) {
+ res_counter_uncharge(&old->res, stock->charge);
+ if (do_swap_account)
+ res_counter_uncharge(&old->memsw, stock->charge);
+ }
+ stock->cached = NULL;
+ stock->charge = 0;
+}
+
+static void drain_local_stock(struct work_struct *dummy)
+{
+ struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
+ drain_stock(stock);
+ put_cpu_var(memcg_stock);
+}
+
+static void refill_stock(struct mem_cgroup *mem, int val)
+{
+ struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
+
+ if (stock->cached != mem) {
+ drain_stock(stock);
+ stock->cached = mem;
+ }
+ stock->charge += val;
+ put_cpu_var(memcg_stock);
+}
+
+static void drain_all_stock_async(void)
+{
+ int cpu;
+ /* Contention means someone tries to flush. */
+ if (!mutex_trylock(&memcg_drain_mutex))
+ return;
+ for_each_online_cpu(cpu) {
+ struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
+ if (work_pending(&stock->work))
+ continue;
+ INIT_WORK(&stock->work, drain_local_stock);
+ schedule_work_on(cpu, &stock->work);
+ }
+ mutex_unlock(&memcg_drain_mutex);
+ /* We don't wait for flush_work */
+}
+
+static void drain_all_stock_sync(void)
+{
+ /* called when force_empty is called */
+ mutex_lock(&memcg_drain_mutex);
+ schedule_on_each_cpu(drain_local_stock);
+ mutex_unlock(&memcg_drain_mutex);
+}
+
+static int __cpuinit memcg_stock_cpu_callback(struct notifier_block *nb,
+ unsigned long action,
+ void *hcpu)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+ int cpu = (unsigned long)hcpu;
+ struct memcg_stock_pcp *stock;
+
+ if (action != CPU_DEAD)
+ return NOTIFY_OK;
+ stock = &per_cpu(memcg_stock, cpu);
+ drain_stock(stock);
+#endif
+ return NOTIFY_OK;
+}
+
/*
* Unlike exported interface, "oom" parameter is added. if oom==true,
* oom-killer can be invoked.
@@ -1269,6 +1370,7 @@ static int __mem_cgroup_try_charge(struc
struct mem_cgroup *mem, *mem_over_limit;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct res_counter *fail_res;
+ int csize = CHARGE_SIZE;

if (unlikely(test_thread_flag(TIF_MEMDIE))) {
/* Don't account this! */
@@ -1293,23 +1395,25 @@ static int __mem_cgroup_try_charge(struc
return 0;

VM_BUG_ON(css_is_removed(&mem->css));
+ if (mem_cgroup_is_root(mem))
+ goto done;

while (1) {
int ret = 0;
unsigned long flags = 0;

- if (mem_cgroup_is_root(mem))
- goto done;
- ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+ if (consume_stock(mem))
+ goto charged;
+
+ ret = res_counter_charge(&mem->res, csize, &fail_res);
if (likely(!ret)) {
if (!do_swap_account)
break;
- ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
- &fail_res);
+ ret = res_counter_charge(&mem->memsw, csize, &fail_res);
if (likely(!ret))
break;
/* mem+swap counter fails */
- res_counter_uncharge(&mem->res, PAGE_SIZE);
+ res_counter_uncharge(&mem->res, csize);
flags |= MEM_CGROUP_RECLAIM_NOSWAP;
mem_over_limit = mem_cgroup_from_res_counter(fail_res,
memsw);
@@ -1318,6 +1422,11 @@ static int __mem_cgroup_try_charge(struc
mem_over_limit = mem_cgroup_from_res_counter(fail_res,
res);

+ /* reduce request size and retry */
+ if (csize > PAGE_SIZE) {
+ csize = PAGE_SIZE;
+ continue;
+ }
if (!(gfp_mask & __GFP_WAIT))
goto nomem;

@@ -1347,6 +1456,9 @@ static int __mem_cgroup_try_charge(struc
goto nomem;
}
}
+ if (csize > PAGE_SIZE)
+ refill_stock(mem, csize - PAGE_SIZE);
+charged:
/*
* Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
* if they exceeds softlimit.
@@ -2463,6 +2575,7 @@ move_account:
goto out;
/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
+ drain_all_stock_sync();
ret = 0;
for_each_node_state(node, N_HIGH_MEMORY) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
@@ -3181,6 +3294,7 @@ mem_cgroup_create(struct cgroup_subsys *
root_mem_cgroup = mem;
if (mem_cgroup_soft_limit_tree_init())
goto free_out;
+ hotcpu_notifier(memcg_stock_cpu_callback, 0);

} else {
parent = mem_cgroup_from_cont(cont->parent);

2009-10-02 06:47:58

by Hiroshi Shimamoto

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation

KAMEZAWA Hiroyuki wrote:
> In massive parallel enviroment, res_counter can be a performance bottleneck.
> One strong techinque to reduce lock contention is reducing calls by
> coalescing some amount of calls into one.
>
> Considering charge/uncharge chatacteristic,
> - charge is done one by one via demand-paging.
> - uncharge is done by
> - in chunk at munmap, truncate, exit, execve...
> - one by one via vmscan/paging.
>
> It seems we have a chance in uncharge at unmap/truncation.
>
> This patch is a for coalescing uncharge. For avoiding scattering memcg's
> structure to functions under /mm, this patch adds memcg batch uncharge
> information to the task.
>
> The degree of coalescing depends on callers
> - at invalidate/trucate... pagevec size
> - at unmap ....ZAP_BLOCK_SIZE
> (memory itself will be freed in this degree.)
> Then, we'll not coalescing too much.
>
> Changelog(now):
> - rebased onto the latest mmotm + softlimit fix patches.
>
> Changelog(old):
> - unified patch for callers
> - added commetns.
> - make ->do_batch as bool.
> - removed css_get() at el. We don't need it.
>
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/memcontrol.h | 13 ++++++
> include/linux/sched.h | 7 +++
> mm/memcontrol.c | 91 ++++++++++++++++++++++++++++++++++++++++++---
> mm/memory.c | 2
> mm/truncate.c | 6 ++
> 5 files changed, 113 insertions(+), 6 deletions(-)
>
> Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
> @@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
> extern void mem_cgroup_del_lru(struct page *page);
> extern void mem_cgroup_move_lists(struct page *page,
> enum lru_list from, enum lru_list to);
> +
> +/* For coalescing uncharge for reducing memcg' overhead*/
> +extern void mem_cgroup_uncharge_start(void);
> +extern void mem_cgroup_uncharge_end(void);
> +
> extern void mem_cgroup_uncharge_page(struct page *page);
> extern void mem_cgroup_uncharge_cache_page(struct page *page);
> extern int mem_cgroup_shmem_charge_fallback(struct page *page,
> @@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
> {
> }
>
> +static inline void mem_cgroup_uncharge_batch_start(void)
> +{
> +}
> +
> +static inline void mem_cgroup_uncharge_batch_start(void)

mem_cgroup_uncharge_batch_end?


Thanks,
Hiroshi

> +{
> +}
> +
> static inline void mem_cgroup_uncharge_page(struct page *page)
> {
> }
> Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Sep28/mm/memcontrol.c
> @@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str
> css_put(&mem->css);
> }
>
> +static void
> +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> +{
> + struct memcg_batch_info *batch = NULL;
> + bool uncharge_memsw = true;
> + /* If swapout, usage of swap doesn't decrease */
> + if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> + uncharge_memsw = false;
> + /*
> + * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> + * In those cases, all pages freed continously can be expected to be in
> + * the same cgroup and we have chance to coalesce uncharges.
> + * And, we do uncharge one by one if this is killed by OOM.
> + */
> + if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
> + goto direct_uncharge;
> +
> + batch = &current->memcg_batch;
> + /*
> + * In usual, we do css_get() when we remember memcg pointer.
> + * But in this case, we keep res->usage until end of a series of
> + * uncharges. Then, it's ok to ignore memcg's refcnt.
> + */
> + if (!batch->memcg)
> + batch->memcg = mem;
> + /*
> + * In typical case, batch->memcg == mem. This means we can
> + * merge a series of uncharges to an uncharge of res_counter.
> + * If not, we uncharge res_counter ony by one.
> + */
> + if (batch->memcg != mem)
> + goto direct_uncharge;
> + /* remember freed charge and uncharge it later */
> + batch->pages += PAGE_SIZE;
> + if (uncharge_memsw)
> + batch->memsw += PAGE_SIZE;
> + return;
> +direct_uncharge:
> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> + if (uncharge_memsw)
> + res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> + return;
> +}
>
> /*
> * uncharge if !page_mapped(page)
> @@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page
> break;
> }
>
> - if (!mem_cgroup_is_root(mem)) {
> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> - if (do_swap_account &&
> - (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> - res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> - }
> + if (!mem_cgroup_is_root(mem))
> + __do_uncharge(mem, ctype);
> if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> mem_cgroup_swap_statistics(mem, true);
> mem_cgroup_charge_statistics(mem, pc, false);
> @@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru
> __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> }
>
> +/*
> + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
> + * In that cases, pages are freed continuously and we can expect pages
> + * are in the same memcg. All these calls itself limits the number of
> + * pages freed at once, then uncharge_start/end() is called properly.
> + */
> +
> +void mem_cgroup_uncharge_start(void)
> +{
> + if (!current->memcg_batch.do_batch) {
> + current->memcg_batch.memcg = NULL;
> + current->memcg_batch.pages = 0;
> + current->memcg_batch.memsw = 0;
> + }
> + current->memcg_batch.do_batch++;
> +}
> +
> +void mem_cgroup_uncharge_end(void)
> +{
> + struct mem_cgroup *mem;
> +
> + if (!current->memcg_batch.do_batch)
> + return;
> +
> + current->memcg_batch.do_batch--;
> + if (current->memcg_batch.do_batch) /* Nested ? */
> + return;
> +
> + mem = current->memcg_batch.memcg;
> + if (!mem)
> + return;
> + /* This "mem" is valid bacause we hide charges behind us. */
> + if (current->memcg_batch.pages)
> + res_counter_uncharge(&mem->res, current->memcg_batch.pages);
> + if (current->memcg_batch.memsw)
> + res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
> + /* Not necessary. but forget this pointer */
> + current->memcg_batch.memcg = NULL;
> +}
> +
> #ifdef CONFIG_SWAP
> /*
> * called after __delete_from_swap_cache() and drop "page" account.
> Index: mmotm-2.6.31-Sep28/include/linux/sched.h
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
> +++ mmotm-2.6.31-Sep28/include/linux/sched.h
> @@ -1549,6 +1549,13 @@ struct task_struct {
> unsigned long trace_recursion;
> #endif /* CONFIG_TRACING */
> unsigned long stack_start;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
> + struct memcg_batch_info {
> + int do_batch;
> + struct mem_cgroup *memcg;
> + long pages, memsw;
> + } memcg_batch;
> +#endif
> };
>
> /* Future-safe accessor for struct task_struct's cpus_allowed. */
> Index: mmotm-2.6.31-Sep28/mm/memory.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/memory.c
> +++ mmotm-2.6.31-Sep28/mm/memory.c
> @@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
> details = NULL;
>
> BUG_ON(addr >= end);
> + mem_cgroup_uncharge_start();
> tlb_start_vma(tlb, vma);
> pgd = pgd_offset(vma->vm_mm, addr);
> do {
> @@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
> zap_work, details);
> } while (pgd++, addr = next, (addr != end && *zap_work > 0));
> tlb_end_vma(tlb, vma);
> + mem_cgroup_uncharge_end();
>
> return addr;
> }
> Index: mmotm-2.6.31-Sep28/mm/truncate.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/truncate.c
> +++ mmotm-2.6.31-Sep28/mm/truncate.c
> @@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
> pagevec_release(&pvec);
> break;
> }
> + mem_cgroup_uncharge_start();
> for (i = 0; i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
>
> @@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
> unlock_page(page);
> }
> pagevec_release(&pvec);
> + mem_cgroup_uncharge_end();
> }
> }
> EXPORT_SYMBOL(truncate_inode_pages_range);
> @@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
> pagevec_init(&pvec, 0);
> while (next <= end &&
> pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> + mem_cgroup_uncharge_start();
> for (i = 0; i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
> pgoff_t index;
> @@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
> break;
> }
> pagevec_release(&pvec);
> + mem_cgroup_uncharge_end();
> cond_resched();
> }
> return ret;
> @@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
> while (next <= end && !wrapped &&
> pagevec_lookup(&pvec, mapping, next,
> min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> + mem_cgroup_uncharge_start();
> for (i = 0; i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
> pgoff_t page_index;
> @@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
> unlock_page(page);
> }
> pagevec_release(&pvec);
> + mem_cgroup_uncharge_end();
> cond_resched();
> }
> return ret;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
>

2009-10-02 06:54:41

by Hiroshi Shimamoto

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation

Hiroshi Shimamoto wrote:
> KAMEZAWA Hiroyuki wrote:
>> In massive parallel enviroment, res_counter can be a performance bottleneck.
>> One strong techinque to reduce lock contention is reducing calls by
>> coalescing some amount of calls into one.
>>
>> Considering charge/uncharge chatacteristic,
>> - charge is done one by one via demand-paging.
>> - uncharge is done by
>> - in chunk at munmap, truncate, exit, execve...
>> - one by one via vmscan/paging.
>>
>> It seems we have a chance in uncharge at unmap/truncation.
>>
>> This patch is a for coalescing uncharge. For avoiding scattering memcg's
>> structure to functions under /mm, this patch adds memcg batch uncharge
>> information to the task.
>>
>> The degree of coalescing depends on callers
>> - at invalidate/trucate... pagevec size
>> - at unmap ....ZAP_BLOCK_SIZE
>> (memory itself will be freed in this degree.)
>> Then, we'll not coalescing too much.
>>
>> Changelog(now):
>> - rebased onto the latest mmotm + softlimit fix patches.
>>
>> Changelog(old):
>> - unified patch for callers
>> - added commetns.
>> - make ->do_batch as bool.
>> - removed css_get() at el. We don't need it.
>>
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>> ---
>> include/linux/memcontrol.h | 13 ++++++
>> include/linux/sched.h | 7 +++
>> mm/memcontrol.c | 91 ++++++++++++++++++++++++++++++++++++++++++---
>> mm/memory.c | 2
>> mm/truncate.c | 6 ++
>> 5 files changed, 113 insertions(+), 6 deletions(-)
>>
>> Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
>> ===================================================================
>> --- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
>> +++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
>> @@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
>> extern void mem_cgroup_del_lru(struct page *page);
>> extern void mem_cgroup_move_lists(struct page *page,
>> enum lru_list from, enum lru_list to);
>> +
>> +/* For coalescing uncharge for reducing memcg' overhead*/
>> +extern void mem_cgroup_uncharge_start(void);
>> +extern void mem_cgroup_uncharge_end(void);
>> +
>> extern void mem_cgroup_uncharge_page(struct page *page);
>> extern void mem_cgroup_uncharge_cache_page(struct page *page);
>> extern int mem_cgroup_shmem_charge_fallback(struct page *page,
>> @@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
>> {
>> }
>>
>> +static inline void mem_cgroup_uncharge_batch_start(void)
>> +{
>> +}
>> +
>> +static inline void mem_cgroup_uncharge_batch_start(void)
>
> mem_cgroup_uncharge_batch_end?

s/_batch// too?

Thanks,
Hiroshi

>
>
> Thanks,
> Hiroshi
>
>> +{
>> +}
>> +
>> static inline void mem_cgroup_uncharge_page(struct page *page)
>> {
>> }
>> Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
>> ===================================================================
>> --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
>> +++ mmotm-2.6.31-Sep28/mm/memcontrol.c
>> @@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str
>> css_put(&mem->css);
>> }
>>
>> +static void
>> +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
>> +{
>> + struct memcg_batch_info *batch = NULL;
>> + bool uncharge_memsw = true;
>> + /* If swapout, usage of swap doesn't decrease */
>> + if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>> + uncharge_memsw = false;
>> + /*
>> + * do_batch > 0 when unmapping pages or inode invalidate/truncate.
>> + * In those cases, all pages freed continously can be expected to be in
>> + * the same cgroup and we have chance to coalesce uncharges.
>> + * And, we do uncharge one by one if this is killed by OOM.
>> + */
>> + if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
>> + goto direct_uncharge;
>> +
>> + batch = &current->memcg_batch;
>> + /*
>> + * In usual, we do css_get() when we remember memcg pointer.
>> + * But in this case, we keep res->usage until end of a series of
>> + * uncharges. Then, it's ok to ignore memcg's refcnt.
>> + */
>> + if (!batch->memcg)
>> + batch->memcg = mem;
>> + /*
>> + * In typical case, batch->memcg == mem. This means we can
>> + * merge a series of uncharges to an uncharge of res_counter.
>> + * If not, we uncharge res_counter ony by one.
>> + */
>> + if (batch->memcg != mem)
>> + goto direct_uncharge;
>> + /* remember freed charge and uncharge it later */
>> + batch->pages += PAGE_SIZE;
>> + if (uncharge_memsw)
>> + batch->memsw += PAGE_SIZE;
>> + return;
>> +direct_uncharge:
>> + res_counter_uncharge(&mem->res, PAGE_SIZE);
>> + if (uncharge_memsw)
>> + res_counter_uncharge(&mem->memsw, PAGE_SIZE);
>> + return;
>> +}
>>
>> /*
>> * uncharge if !page_mapped(page)
>> @@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page
>> break;
>> }
>>
>> - if (!mem_cgroup_is_root(mem)) {
>> - res_counter_uncharge(&mem->res, PAGE_SIZE);
>> - if (do_swap_account &&
>> - (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
>> - res_counter_uncharge(&mem->memsw, PAGE_SIZE);
>> - }
>> + if (!mem_cgroup_is_root(mem))
>> + __do_uncharge(mem, ctype);
>> if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>> mem_cgroup_swap_statistics(mem, true);
>> mem_cgroup_charge_statistics(mem, pc, false);
>> @@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru
>> __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
>> }
>>
>> +/*
>> + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
>> + * In that cases, pages are freed continuously and we can expect pages
>> + * are in the same memcg. All these calls itself limits the number of
>> + * pages freed at once, then uncharge_start/end() is called properly.
>> + */
>> +
>> +void mem_cgroup_uncharge_start(void)
>> +{
>> + if (!current->memcg_batch.do_batch) {
>> + current->memcg_batch.memcg = NULL;
>> + current->memcg_batch.pages = 0;
>> + current->memcg_batch.memsw = 0;
>> + }
>> + current->memcg_batch.do_batch++;
>> +}
>> +
>> +void mem_cgroup_uncharge_end(void)
>> +{
>> + struct mem_cgroup *mem;
>> +
>> + if (!current->memcg_batch.do_batch)
>> + return;
>> +
>> + current->memcg_batch.do_batch--;
>> + if (current->memcg_batch.do_batch) /* Nested ? */
>> + return;
>> +
>> + mem = current->memcg_batch.memcg;
>> + if (!mem)
>> + return;
>> + /* This "mem" is valid bacause we hide charges behind us. */
>> + if (current->memcg_batch.pages)
>> + res_counter_uncharge(&mem->res, current->memcg_batch.pages);
>> + if (current->memcg_batch.memsw)
>> + res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
>> + /* Not necessary. but forget this pointer */
>> + current->memcg_batch.memcg = NULL;
>> +}
>> +
>> #ifdef CONFIG_SWAP
>> /*
>> * called after __delete_from_swap_cache() and drop "page" account.
>> Index: mmotm-2.6.31-Sep28/include/linux/sched.h
>> ===================================================================
>> --- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
>> +++ mmotm-2.6.31-Sep28/include/linux/sched.h
>> @@ -1549,6 +1549,13 @@ struct task_struct {
>> unsigned long trace_recursion;
>> #endif /* CONFIG_TRACING */
>> unsigned long stack_start;
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
>> + struct memcg_batch_info {
>> + int do_batch;
>> + struct mem_cgroup *memcg;
>> + long pages, memsw;
>> + } memcg_batch;
>> +#endif
>> };
>>
>> /* Future-safe accessor for struct task_struct's cpus_allowed. */
>> Index: mmotm-2.6.31-Sep28/mm/memory.c
>> ===================================================================
>> --- mmotm-2.6.31-Sep28.orig/mm/memory.c
>> +++ mmotm-2.6.31-Sep28/mm/memory.c
>> @@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
>> details = NULL;
>>
>> BUG_ON(addr >= end);
>> + mem_cgroup_uncharge_start();
>> tlb_start_vma(tlb, vma);
>> pgd = pgd_offset(vma->vm_mm, addr);
>> do {
>> @@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
>> zap_work, details);
>> } while (pgd++, addr = next, (addr != end && *zap_work > 0));
>> tlb_end_vma(tlb, vma);
>> + mem_cgroup_uncharge_end();
>>
>> return addr;
>> }
>> Index: mmotm-2.6.31-Sep28/mm/truncate.c
>> ===================================================================
>> --- mmotm-2.6.31-Sep28.orig/mm/truncate.c
>> +++ mmotm-2.6.31-Sep28/mm/truncate.c
>> @@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
>> pagevec_release(&pvec);
>> break;
>> }
>> + mem_cgroup_uncharge_start();
>> for (i = 0; i < pagevec_count(&pvec); i++) {
>> struct page *page = pvec.pages[i];
>>
>> @@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
>> unlock_page(page);
>> }
>> pagevec_release(&pvec);
>> + mem_cgroup_uncharge_end();
>> }
>> }
>> EXPORT_SYMBOL(truncate_inode_pages_range);
>> @@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
>> pagevec_init(&pvec, 0);
>> while (next <= end &&
>> pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
>> + mem_cgroup_uncharge_start();
>> for (i = 0; i < pagevec_count(&pvec); i++) {
>> struct page *page = pvec.pages[i];
>> pgoff_t index;
>> @@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
>> break;
>> }
>> pagevec_release(&pvec);
>> + mem_cgroup_uncharge_end();
>> cond_resched();
>> }
>> return ret;
>> @@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
>> while (next <= end && !wrapped &&
>> pagevec_lookup(&pvec, mapping, next,
>> min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
>> + mem_cgroup_uncharge_start();
>> for (i = 0; i < pagevec_count(&pvec); i++) {
>> struct page *page = pvec.pages[i];
>> pgoff_t page_index;
>> @@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
>> unlock_page(page);
>> }
>> pagevec_release(&pvec);
>> + mem_cgroup_uncharge_end();
>> cond_resched();
>> }
>> return ret;
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>>
>>
>
>

2009-10-02 07:04:37

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation (fixed coimpile bug)

On Fri, 02 Oct 2009 15:47:22 +0900
Hiroshi Shimamoto <[email protected]> wrote:
> > extern void mem_cgroup_uncharge_page(struct page *page);
> > extern void mem_cgroup_uncharge_cache_page(struct page *page);
> > extern int mem_cgroup_shmem_charge_fallback(struct page *page,
> > @@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
> > {
> > }
> >
> > +static inline void mem_cgroup_uncharge_batch_start(void)
> > +{
> > +}
> > +
> > +static inline void mem_cgroup_uncharge_batch_start(void)
>
> mem_cgroup_uncharge_batch_end?
>
yes, of course...lack of !CONFIC_MEM_CGROUP test after updating...
thank you!
==
In massive parallel enviroment, res_counter can be a performance bottleneck.
One strong techinque to reduce lock contention is reducing calls by
coalescing some amount of calls into one.

Considering charge/uncharge chatacteristic,
- charge is done one by one via demand-paging.
- uncharge is done by
- in chunk at munmap, truncate, exit, execve...
- one by one via vmscan/paging.

It seems we have a chance in uncharge at unmap/truncation.

This patch is a for coalescing uncharge. For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task.

The degree of coalescing depends on callers
- at invalidate/trucate... pagevec size
- at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.

Changelog(now):
- fixed !CONFIG_MEM_CGROUP case.

Changelog(old):
- unified patch for callers
- added commetns.
- make ->do_batch as bool.
- removed css_get() at el. We don't need it.



Changelog(now):
- rebased onto the latest mmotm + softlimit fix patches.

Changelog(old):
- unified patch for callers
- added commetns.
- make ->do_batch as bool.
- removed css_get() at el. We don't need it.


Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 13 ++++++
include/linux/sched.h | 7 +++
mm/memcontrol.c | 91 ++++++++++++++++++++++++++++++++++++++++++---
mm/memory.c | 2
mm/truncate.c | 6 ++
5 files changed, 113 insertions(+), 6 deletions(-)

Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
+++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
@@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
extern void mem_cgroup_del_lru(struct page *page);
extern void mem_cgroup_move_lists(struct page *page,
enum lru_list from, enum lru_list to);
+
+/* For coalescing uncharge for reducing memcg' overhead*/
+extern void mem_cgroup_uncharge_start(void);
+extern void mem_cgroup_uncharge_end(void);
+
extern void mem_cgroup_uncharge_page(struct page *page);
extern void mem_cgroup_uncharge_cache_page(struct page *page);
extern int mem_cgroup_shmem_charge_fallback(struct page *page,
@@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
{
}

+static inline void mem_cgroup_uncharge_start(void)
+{
+}
+
+static inline void mem_cgroup_uncharge_end(void)
+{
+}
+
static inline void mem_cgroup_uncharge_page(struct page *page)
{
}
Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str
css_put(&mem->css);
}

+static void
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+{
+ struct memcg_batch_info *batch = NULL;
+ bool uncharge_memsw = true;
+ /* If swapout, usage of swap doesn't decrease */
+ if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+ uncharge_memsw = false;
+ /*
+ * do_batch > 0 when unmapping pages or inode invalidate/truncate.
+ * In those cases, all pages freed continously can be expected to be in
+ * the same cgroup and we have chance to coalesce uncharges.
+ * And, we do uncharge one by one if this is killed by OOM.
+ */
+ if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
+ goto direct_uncharge;
+
+ batch = &current->memcg_batch;
+ /*
+ * In usual, we do css_get() when we remember memcg pointer.
+ * But in this case, we keep res->usage until end of a series of
+ * uncharges. Then, it's ok to ignore memcg's refcnt.
+ */
+ if (!batch->memcg)
+ batch->memcg = mem;
+ /*
+ * In typical case, batch->memcg == mem. This means we can
+ * merge a series of uncharges to an uncharge of res_counter.
+ * If not, we uncharge res_counter ony by one.
+ */
+ if (batch->memcg != mem)
+ goto direct_uncharge;
+ /* remember freed charge and uncharge it later */
+ batch->pages += PAGE_SIZE;
+ if (uncharge_memsw)
+ batch->memsw += PAGE_SIZE;
+ return;
+direct_uncharge:
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ if (uncharge_memsw)
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+ return;
+}

/*
* uncharge if !page_mapped(page)
@@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page
break;
}

- if (!mem_cgroup_is_root(mem)) {
- res_counter_uncharge(&mem->res, PAGE_SIZE);
- if (do_swap_account &&
- (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
- res_counter_uncharge(&mem->memsw, PAGE_SIZE);
- }
+ if (!mem_cgroup_is_root(mem))
+ __do_uncharge(mem, ctype);
if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
mem_cgroup_swap_statistics(mem, true);
mem_cgroup_charge_statistics(mem, pc, false);
@@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}

+/*
+ * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
+ * In that cases, pages are freed continuously and we can expect pages
+ * are in the same memcg. All these calls itself limits the number of
+ * pages freed at once, then uncharge_start/end() is called properly.
+ */
+
+void mem_cgroup_uncharge_start(void)
+{
+ if (!current->memcg_batch.do_batch) {
+ current->memcg_batch.memcg = NULL;
+ current->memcg_batch.pages = 0;
+ current->memcg_batch.memsw = 0;
+ }
+ current->memcg_batch.do_batch++;
+}
+
+void mem_cgroup_uncharge_end(void)
+{
+ struct mem_cgroup *mem;
+
+ if (!current->memcg_batch.do_batch)
+ return;
+
+ current->memcg_batch.do_batch--;
+ if (current->memcg_batch.do_batch) /* Nested ? */
+ return;
+
+ mem = current->memcg_batch.memcg;
+ if (!mem)
+ return;
+ /* This "mem" is valid bacause we hide charges behind us. */
+ if (current->memcg_batch.pages)
+ res_counter_uncharge(&mem->res, current->memcg_batch.pages);
+ if (current->memcg_batch.memsw)
+ res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
+ /* Not necessary. but forget this pointer */
+ current->memcg_batch.memcg = NULL;
+}
+
#ifdef CONFIG_SWAP
/*
* called after __delete_from_swap_cache() and drop "page" account.
Index: mmotm-2.6.31-Sep28/include/linux/sched.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
+++ mmotm-2.6.31-Sep28/include/linux/sched.h
@@ -1549,6 +1549,13 @@ struct task_struct {
unsigned long trace_recursion;
#endif /* CONFIG_TRACING */
unsigned long stack_start;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
+ struct memcg_batch_info {
+ int do_batch;
+ struct mem_cgroup *memcg;
+ long pages, memsw;
+ } memcg_batch;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
Index: mmotm-2.6.31-Sep28/mm/memory.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memory.c
+++ mmotm-2.6.31-Sep28/mm/memory.c
@@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
details = NULL;

BUG_ON(addr >= end);
+ mem_cgroup_uncharge_start();
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
do {
@@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
zap_work, details);
} while (pgd++, addr = next, (addr != end && *zap_work > 0));
tlb_end_vma(tlb, vma);
+ mem_cgroup_uncharge_end();

return addr;
}
Index: mmotm-2.6.31-Sep28/mm/truncate.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/truncate.c
+++ mmotm-2.6.31-Sep28/mm/truncate.c
@@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
pagevec_release(&pvec);
break;
}
+ mem_cgroup_uncharge_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

@@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
unlock_page(page);
}
pagevec_release(&pvec);
+ mem_cgroup_uncharge_end();
}
}
EXPORT_SYMBOL(truncate_inode_pages_range);
@@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
pagevec_init(&pvec, 0);
while (next <= end &&
pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+ mem_cgroup_uncharge_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t index;
@@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
break;
}
pagevec_release(&pvec);
+ mem_cgroup_uncharge_end();
cond_resched();
}
return ret;
@@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
while (next <= end && !wrapped &&
pagevec_lookup(&pvec, mapping, next,
min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+ mem_cgroup_uncharge_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t page_index;
@@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
unlock_page(page);
}
pagevec_release(&pvec);
+ mem_cgroup_uncharge_end();
cond_resched();
}
return ret;

2009-10-02 07:06:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation

On Fri, 02 Oct 2009 15:53:19 +0900
Hiroshi Shimamoto <[email protected]> wrote:

> Hiroshi Shimamoto wrote:
> > KAMEZAWA Hiroyuki wrote:

> >> +static inline void mem_cgroup_uncharge_batch_start(void)
> >> +{
> >> +}
> >> +
> >> +static inline void mem_cgroup_uncharge_batch_start(void)
> >
> > mem_cgroup_uncharge_batch_end?
>
> s/_batch// too?
>
thank you. fixed.

-Kame

2009-10-02 08:55:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge

On Fri, 2 Oct 2009 13:55:31 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> Following is test result of continuous page-fault on my 8cpu box(x86-64).
>
> A loop like this runs on all cpus in parallel for 60secs.
> ==
> while (1) {
> x = mmap(NULL, MEGA, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
>
> for (off = 0; off < MEGA; off += PAGE_SIZE)
> x[off]=0;
> munmap(x, MEGA);
> }
> ==
> please see # of page faults. I think this is good improvement.
>
>
> [Before]
> Performance counter stats for './runpause.sh' (5 runs):
>
> 474539.756944 task-clock-msecs # 7.890 CPUs ( +- 0.015% )
> 10284 context-switches # 0.000 M/sec ( +- 0.156% )
> 12 CPU-migrations # 0.000 M/sec ( +- 0.000% )
> 18425800 page-faults # 0.039 M/sec ( +- 0.107% )
> 1486296285360 cycles # 3132.080 M/sec ( +- 0.029% )
> 380334406216 instructions # 0.256 IPC ( +- 0.058% )
> 3274206662 cache-references # 6.900 M/sec ( +- 0.453% )
> 1272947699 cache-misses # 2.682 M/sec ( +- 0.118% )
>
> 60.147907341 seconds time elapsed ( +- 0.010% )
>
> [After]
> Performance counter stats for './runpause.sh' (5 runs):
>
> 474658.997489 task-clock-msecs # 7.891 CPUs ( +- 0.006% )
> 10250 context-switches # 0.000 M/sec ( +- 0.020% )
> 11 CPU-migrations # 0.000 M/sec ( +- 0.000% )
> 33177858 page-faults # 0.070 M/sec ( +- 0.152% )
> 1485264748476 cycles # 3129.120 M/sec ( +- 0.021% )
> 409847004519 instructions # 0.276 IPC ( +- 0.123% )
> 3237478723 cache-references # 6.821 M/sec ( +- 0.574% )
> 1182572827 cache-misses # 2.491 M/sec ( +- 0.179% )
>
> 60.151786309 seconds time elapsed ( +- 0.014% )
>
BTW, this is a score in root cgroup.


473811.590852 task-clock-msecs # 7.878 CPUs ( +- 0.006% )
10257 context-switches # 0.000 M/sec ( +- 0.049% )
10 CPU-migrations # 0.000 M/sec ( +- 0.000% )
36418112 page-faults # 0.077 M/sec ( +- 0.195% )
1482880352588 cycles # 3129.684 M/sec ( +- 0.011% )
410948762898 instructions # 0.277 IPC ( +- 0.123% )
3182986911 cache-references # 6.718 M/sec ( +- 0.555% )
1147144023 cache-misses # 2.421 M/sec ( +- 0.137% )


Then,
36418112 x 100 / 33177858 = 109% slower in children cgroup.

But, Hmm, this test is an extreme case.(60sec continuous page faults on all cpus.)
We may can do something more, but this score itself is not so bad. I think.
Results on more cpus are welcome. Programs I used are attached.

Thanks,
-Kame




Attachments:
pagefault.c (453.00 B)
runpause.sh (128.00 B)
Download all attachments

2009-10-05 07:21:10

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge

On Fri, 2 Oct 2009 17:53:10 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> > [After]
> > Performance counter stats for './runpause.sh' (5 runs):
> >
> > 474658.997489 task-clock-msecs # 7.891 CPUs ( +- 0.006% )
> > 10250 context-switches # 0.000 M/sec ( +- 0.020% )
> > 11 CPU-migrations # 0.000 M/sec ( +- 0.000% )
> > 33177858 page-faults # 0.070 M/sec ( +- 0.152% )
> > 1485264748476 cycles # 3129.120 M/sec ( +- 0.021% )
> > 409847004519 instructions # 0.276 IPC ( +- 0.123% )
> > 3237478723 cache-references # 6.821 M/sec ( +- 0.574% )
> > 1182572827 cache-misses # 2.491 M/sec ( +- 0.179% )
> >
> > 60.151786309 seconds time elapsed ( +- 0.014% )
> >
> BTW, this is a score in root cgroup.
>
>
> 473811.590852 task-clock-msecs # 7.878 CPUs ( +- 0.006% )
> 10257 context-switches # 0.000 M/sec ( +- 0.049% )
> 10 CPU-migrations # 0.000 M/sec ( +- 0.000% )
> 36418112 page-faults # 0.077 M/sec ( +- 0.195% )
> 1482880352588 cycles # 3129.684 M/sec ( +- 0.011% )
> 410948762898 instructions # 0.277 IPC ( +- 0.123% )
> 3182986911 cache-references # 6.718 M/sec ( +- 0.555% )
> 1147144023 cache-misses # 2.421 M/sec ( +- 0.137% )
>
>
> Then,
> 36418112 x 100 / 33177858 = 109% slower in children cgroup.
>

This is an additional patch now under testing.(just experimental)
result of above test:
==
[root cgroup]
37062405 page-faults # 0.078 M/sec ( +- 0.156% )
[children]
35876894 page-faults # 0.076 M/sec ( +- 0.233% )
==
Near to my target....

This patch adds bulk_css_put() and coalesces css_put() in batched-uncharge path.
avoidng frequent calls css_put().

coalescing-uncharge patch, it reduces reference to res_counter
but css_put() per page is still called.
Of course, we can coalesce prural css_put() to a call of bulk_css_put().

This patch adds bulk_css_put() and reduces false-sharing and will have
good effects in scalability.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
include/linux/cgroup.h | 10 ++++++++--
kernel/cgroup.c | 5 ++---
mm/memcontrol.c | 16 +++++++++++-----
3 files changed, 21 insertions(+), 10 deletions(-)

Index: mmotm-2.6.31-Sep28/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/cgroup.h
+++ mmotm-2.6.31-Sep28/include/linux/cgroup.h
@@ -117,11 +117,17 @@ static inline bool css_tryget(struct cgr
* css_get() or css_tryget()
*/

-extern void __css_put(struct cgroup_subsys_state *css);
+extern void __css_put(struct cgroup_subsys_state *css, int val);
static inline void css_put(struct cgroup_subsys_state *css)
{
if (!test_bit(CSS_ROOT, &css->flags))
- __css_put(css);
+ __css_put(css, 1);
+}
+
+static inline void bulk_css_put(struct cgroup_subsys_state *css, int val)
+{
+ if (!test_bit(CSS_ROOT, &css->flags))
+ __css_put(css, val);
}

/* bits in struct cgroup flags field */
Index: mmotm-2.6.31-Sep28/kernel/cgroup.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/kernel/cgroup.c
+++ mmotm-2.6.31-Sep28/kernel/cgroup.c
@@ -3705,12 +3705,11 @@ static void check_for_release(struct cgr
}
}

-void __css_put(struct cgroup_subsys_state *css)
+void __css_put(struct cgroup_subsys_state *css, int val)
{
struct cgroup *cgrp = css->cgroup;
- int val;
rcu_read_lock();
- val = atomic_dec_return(&css->refcnt);
+ val = atomic_sub_return(val, &css->refcnt);
if (val == 1) {
if (notify_on_release(cgrp)) {
set_bit(CGRP_RELEASABLE, &cgrp->flags);
Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -1977,8 +1977,14 @@ __do_uncharge(struct mem_cgroup *mem, co
return;
direct_uncharge:
res_counter_uncharge(&mem->res, PAGE_SIZE);
- if (uncharge_memsw)
+ if (uncharge_memsw) {
res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+ /*
+ * swapout-uncharge do css_put() by itself. then we do
+ * css_put() only in this case.
+ */
+ css_put(&mem->css);
+ }
return;
}

@@ -2048,9 +2054,6 @@ __mem_cgroup_uncharge_common(struct page

if (mem_cgroup_soft_limit_check(mem))
mem_cgroup_update_tree(mem, page);
- /* at swapout, this memcg will be accessed to record to swap */
- if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
- css_put(&mem->css);

return mem;

@@ -2108,8 +2111,11 @@ void mem_cgroup_uncharge_end(void)
if (!mem)
return;
/* This "mem" is valid bacause we hide charges behind us. */
- if (current->memcg_batch.pages)
+ if (current->memcg_batch.pages) {
res_counter_uncharge(&mem->res, current->memcg_batch.pages);
+ bulk_css_put(&mem->css,
+ current->memcg_batch.pages >> PAGE_SHIFT);
+ }
if (current->memcg_batch.memsw)
res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
/* Not necessary. but forget this pointer */




2009-10-05 10:38:15

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge

* KAMEZAWA Hiroyuki <[email protected]> [2009-10-02 13:55:31]:

> Hi,
>
> This patch is against mmotm + softlimit fix patches.
> (which are now in -rc git tree.)
>
> In the latest -rc series, the kernel avoids accessing res_counter when
> cgroup is root cgroup. This helps scalabilty when memcg is not used.
>
> It's necessary to improve scalabilty even when memcg is used. This patch
> is for that. Previous Balbir's work shows that the biggest obstacles for
> better scalabilty is memcg's res_counter. Then, there are 2 ways.
>
> (1) make counter scale well.
> (2) avoid accessing core counter as much as possible.
>
> My first direction was (1). But no, there is no counter which is free
> from false sharing when it needs system-wide fine grain synchronization.
> And res_counter has several functionality...this makes (1) difficult.
> spin_lock (in slow path) around counter means tons of invalidation will
> happen even when we just access counter without modification.
>
> This patch series is for (2). This implements charge/uncharge in bached manner.
> This coalesces access to res_counter at charge/uncharge using nature of
> access locality.
>
> Tested for a month. And I got good reorts from Balbir and Nishimura, thanks.
> One concern is that this adds some members to the bottom of task_struct.
> Better idea is welcome.
>
> Following is test result of continuous page-fault on my 8cpu box(x86-64).
>
> A loop like this runs on all cpus in parallel for 60secs.
> ==
> while (1) {
> x = mmap(NULL, MEGA, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
>
> for (off = 0; off < MEGA; off += PAGE_SIZE)
> x[off]=0;
> munmap(x, MEGA);
> }
> ==
> please see # of page faults. I think this is good improvement.
>
>
> [Before]
> Performance counter stats for './runpause.sh' (5 runs):
>
> 474539.756944 task-clock-msecs # 7.890 CPUs ( +- 0.015% )
> 10284 context-switches # 0.000 M/sec ( +- 0.156% )
> 12 CPU-migrations # 0.000 M/sec ( +- 0.000% )
> 18425800 page-faults # 0.039 M/sec ( +- 0.107% )
> 1486296285360 cycles # 3132.080 M/sec ( +- 0.029% )
> 380334406216 instructions # 0.256 IPC ( +- 0.058% )
> 3274206662 cache-references # 6.900 M/sec ( +- 0.453% )
> 1272947699 cache-misses # 2.682 M/sec ( +- 0.118% )
>
> 60.147907341 seconds time elapsed ( +- 0.010% )
>
> [After]
> Performance counter stats for './runpause.sh' (5 runs):
>
> 474658.997489 task-clock-msecs # 7.891 CPUs ( +- 0.006% )
> 10250 context-switches # 0.000 M/sec ( +- 0.020% )
> 11 CPU-migrations # 0.000 M/sec ( +- 0.000% )
> 33177858 page-faults # 0.070 M/sec ( +- 0.152% )
> 1485264748476 cycles # 3129.120 M/sec ( +- 0.021% )
> 409847004519 instructions # 0.276 IPC ( +- 0.123% )
> 3237478723 cache-references # 6.821 M/sec ( +- 0.574% )
> 1182572827 cache-misses # 2.491 M/sec ( +- 0.179% )
>
> 60.151786309 seconds time elapsed ( +- 0.014% )
>

I agree, I liked the previous patchset, let me re-review this one!
Definitely a good candidate to -mm.

--
Balbir

2009-10-08 22:18:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation (fixed coimpile bug)

On Fri, 2 Oct 2009 16:02:13 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

>
> ...
>
> In massive parallel enviroment, res_counter can be a performance bottleneck.
> One strong techinque to reduce lock contention is reducing calls by
> coalescing some amount of calls into one.
>
> Considering charge/uncharge chatacteristic,
> - charge is done one by one via demand-paging.
> - uncharge is done by
> - in chunk at munmap, truncate, exit, execve...
> - one by one via vmscan/paging.
>
> It seems we have a chance in uncharge at unmap/truncation.
>
> This patch is a for coalescing uncharge. For avoiding scattering memcg's
> structure to functions under /mm, this patch adds memcg batch uncharge
> information to the task.
>
> The degree of coalescing depends on callers
> - at invalidate/trucate... pagevec size
> - at unmap ....ZAP_BLOCK_SIZE
> (memory itself will be freed in this degree.)
> Then, we'll not coalescing too much.
>
>
> ...
>
> +static void
> +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> +{
> + struct memcg_batch_info *batch = NULL;
> + bool uncharge_memsw = true;
> + /* If swapout, usage of swap doesn't decrease */
> + if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> + uncharge_memsw = false;
> + /*
> + * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> + * In those cases, all pages freed continously can be expected to be in
> + * the same cgroup and we have chance to coalesce uncharges.
> + * And, we do uncharge one by one if this is killed by OOM.
> + */
> + if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
> + goto direct_uncharge;
> +
> + batch = &current->memcg_batch;
> + /*
> + * In usual, we do css_get() when we remember memcg pointer.
> + * But in this case, we keep res->usage until end of a series of
> + * uncharges. Then, it's ok to ignore memcg's refcnt.
> + */
> + if (!batch->memcg)
> + batch->memcg = mem;
> + /*
> + * In typical case, batch->memcg == mem. This means we can
> + * merge a series of uncharges to an uncharge of res_counter.
> + * If not, we uncharge res_counter ony by one.
> + */
> + if (batch->memcg != mem)
> + goto direct_uncharge;
> + /* remember freed charge and uncharge it later */
> + batch->pages += PAGE_SIZE;

->pages is really confusingly named. It doesn't count pages, it counts
bytes!

We could call it `bytes', but perhaps charge_bytes would be more
communicative?

> +/*
> + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
> + * In that cases, pages are freed continuously and we can expect pages
> + * are in the same memcg. All these calls itself limits the number of
> + * pages freed at once, then uncharge_start/end() is called properly.
> + */
> +
> +void mem_cgroup_uncharge_start(void)
> +{
> + if (!current->memcg_batch.do_batch) {
> + current->memcg_batch.memcg = NULL;
> + current->memcg_batch.pages = 0;
> + current->memcg_batch.memsw = 0;

what's memsw?

> + }
> + current->memcg_batch.do_batch++;
> +}
> +
>
> ...
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
> + struct memcg_batch_info {
> + int do_batch;
> + struct mem_cgroup *memcg;
> + long pages, memsw;
> + } memcg_batch;
> +#endif

I find the most valuable documetnation is that which is devoted to the
data structures. This one didn't get any :(

Negative values of `pages' and `memsw' are meaningless, so it would be
better to give them an unsigned type. That matches the
res_counter_charge() expectations also.

2009-10-08 22:27:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: coalescing charges per cpu

On Fri, 2 Oct 2009 14:03:43 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> This is a patch for coalescing access to res_counter at charging by
> percpu caching. At charge, memcg charges 64pages and remember it in
> percpu cache. Because it's cache, drain/flush is done if necessary.
>
> This version uses public percpu area.
> 2 benefits of using public percpu area.
> 1. Sum of stocked charge in the system is limited to # of cpus
> not to the number of memcg. This shows better synchonization.
> 2. drain code for flush/cpuhotplug is very easy (and quick)
>
> The most important point of this patch is that we never touch res_counter
> in fast path. The res_counter is system-wide shared counter which is modified
> very frequently. We shouldn't touch it as far as we can for avoiding
> false sharing.
>
> ...
>
> +/* size of first charge trial. "32" comes from vmscan.c's magic value */
> +#define CHARGE_SIZE (32 * PAGE_SIZE)
> +struct memcg_stock_pcp {
> + struct mem_cgroup *cached;
> + int charge;
> + struct work_struct work;
> +};
> +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> +static DEFINE_MUTEX(memcg_drain_mutex);
> +
> +static bool consume_stock(struct mem_cgroup *mem)
> +{
> + struct memcg_stock_pcp *stock;
> + bool ret = true;
> +
> + stock = &get_cpu_var(memcg_stock);
> + if (mem == stock->cached && stock->charge)
> + stock->charge -= PAGE_SIZE;
> + else
> + ret = false;
> + put_cpu_var(memcg_stock);
> + return ret;
> +}

It's unobvious what the return value from this function means. A nice
comment would help.

> +static void drain_stock(struct memcg_stock_pcp *stock)
> +{
> + struct mem_cgroup *old = stock->cached;
> +
> + if (stock->charge) {
> + res_counter_uncharge(&old->res, stock->charge);
> + if (do_swap_account)
> + res_counter_uncharge(&old->memsw, stock->charge);
> + }
> + stock->cached = NULL;
> + stock->charge = 0;
> +}
> +
> +static void drain_local_stock(struct work_struct *dummy)
> +{
> + struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> + drain_stock(stock);
> + put_cpu_var(memcg_stock);
> +}

drain_local_stock() is only ever run by a thread which is pinned to a
particular CPU, so we can use plain old __get_cpu_var() here and remove
the put_cpu_var(), methinks. If this is done, a comment shuold be
added explaining why we can use that optimisation.

> +static void refill_stock(struct mem_cgroup *mem, int val)
> +{
> + struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> +
> + if (stock->cached != mem) {
> + drain_stock(stock);
> + stock->cached = mem;
> + }
> + stock->charge += val;
> + put_cpu_var(memcg_stock);
> +}
> +
> +static void drain_all_stock_async(void)
> +{
> + int cpu;
> + /* Contention means someone tries to flush. */
> + if (!mutex_trylock(&memcg_drain_mutex))
> + return;

Any time I see a trylock I ask "hm, what happens when it fails - what
are the consequences and how to we later rerun the aborted operation".

That's unobvious here and merits a comment.

> + for_each_online_cpu(cpu) {

Did we need get_online_cpus() here?

> + struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> + if (work_pending(&stock->work))
> + continue;
> + INIT_WORK(&stock->work, drain_local_stock);
> + schedule_work_on(cpu, &stock->work);
> + }
> + mutex_unlock(&memcg_drain_mutex);
> + /* We don't wait for flush_work */
> +}
> +
> +static void drain_all_stock_sync(void)
> +{
> + /* called when force_empty is called */
> + mutex_lock(&memcg_drain_mutex);
> + schedule_on_each_cpu(drain_local_stock);
> + mutex_unlock(&memcg_drain_mutex);
> +}

In fact it would be nice were each of these functions to have a brief
note describing their role.

> +static int __cpuinit memcg_stock_cpu_callback(struct notifier_block *nb,
> + unsigned long action,
> + void *hcpu)
> +{
> +#ifdef CONFIG_HOTPLUG_CPU
> + int cpu = (unsigned long)hcpu;
> + struct memcg_stock_pcp *stock;
> +
> + if (action != CPU_DEAD)
> + return NOTIFY_OK;
> + stock = &per_cpu(memcg_stock, cpu);
> + drain_stock(stock);
> +#endif
> + return NOTIFY_OK;
> +}

Is the ifdef needed? Using hotcpu_notifier() should cause this entire
function to disappear from vmlinux when CONFIG_HOTPLUG_CPU=n.

> /*
> * Unlike exported interface, "oom" parameter is added. if oom==true,
> * oom-killer can be invoked.
> @@ -1269,6 +1370,7 @@ static int __mem_cgroup_try_charge(struc
> struct mem_cgroup *mem, *mem_over_limit;
> int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> struct res_counter *fail_res;
> + int csize = CHARGE_SIZE;
>
> if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> /* Don't account this! */
> @@ -1293,23 +1395,25 @@ static int __mem_cgroup_try_charge(struc
> return 0;
>
> VM_BUG_ON(css_is_removed(&mem->css));
> + if (mem_cgroup_is_root(mem))
> + goto done;
>
> while (1) {
> int ret = 0;
> unsigned long flags = 0;
>
> - if (mem_cgroup_is_root(mem))
> - goto done;
> - ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> + if (consume_stock(mem))
> + goto charged;
> +
> + ret = res_counter_charge(&mem->res, csize, &fail_res);
> if (likely(!ret)) {
> if (!do_swap_account)
> break;
> - ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> - &fail_res);
> + ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> if (likely(!ret))
> break;
> /* mem+swap counter fails */
> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> + res_counter_uncharge(&mem->res, csize);
> flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> memsw);
> @@ -1318,6 +1422,11 @@ static int __mem_cgroup_try_charge(struc
> mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> res);
>
> + /* reduce request size and retry */
> + if (csize > PAGE_SIZE) {
> + csize = PAGE_SIZE;
> + continue;
> + }
> if (!(gfp_mask & __GFP_WAIT))
> goto nomem;
>
> @@ -1347,6 +1456,9 @@ static int __mem_cgroup_try_charge(struc
> goto nomem;
> }
> }
> + if (csize > PAGE_SIZE)
> + refill_stock(mem, csize - PAGE_SIZE);
> +charged:
> /*
> * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> * if they exceeds softlimit.
> @@ -2463,6 +2575,7 @@ move_account:
> goto out;
> /* This is for making all *used* pages to be on LRU. */
> lru_add_drain_all();
> + drain_all_stock_sync();
> ret = 0;
> for_each_node_state(node, N_HIGH_MEMORY) {
> for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> @@ -3181,6 +3294,7 @@ mem_cgroup_create(struct cgroup_subsys *
> root_mem_cgroup = mem;
> if (mem_cgroup_soft_limit_tree_init())
> goto free_out;
> + hotcpu_notifier(memcg_stock_cpu_callback, 0);
>
> } else {
> parent = mem_cgroup_from_cont(cont->parent);

2009-10-08 23:51:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation (fixed coimpile bug)

On Thu, 8 Oct 2009 15:17:10 -0700
Andrew Morton <[email protected]> wrote:

> On Fri, 2 Oct 2009 16:02:13 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> >
> > ...
> >
> > In massive parallel enviroment, res_counter can be a performance bottleneck.
> > One strong techinque to reduce lock contention is reducing calls by
> > coalescing some amount of calls into one.
> >
> > Considering charge/uncharge chatacteristic,
> > - charge is done one by one via demand-paging.
> > - uncharge is done by
> > - in chunk at munmap, truncate, exit, execve...
> > - one by one via vmscan/paging.
> >
> > It seems we have a chance in uncharge at unmap/truncation.
> >
> > This patch is a for coalescing uncharge. For avoiding scattering memcg's
> > structure to functions under /mm, this patch adds memcg batch uncharge
> > information to the task.
> >
> > The degree of coalescing depends on callers
> > - at invalidate/trucate... pagevec size
> > - at unmap ....ZAP_BLOCK_SIZE
> > (memory itself will be freed in this degree.)
> > Then, we'll not coalescing too much.
> >
> >
> > ...
> >
> > +static void
> > +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> > +{
> > + struct memcg_batch_info *batch = NULL;
> > + bool uncharge_memsw = true;
> > + /* If swapout, usage of swap doesn't decrease */
> > + if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> > + uncharge_memsw = false;
> > + /*
> > + * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> > + * In those cases, all pages freed continously can be expected to be in
> > + * the same cgroup and we have chance to coalesce uncharges.
> > + * And, we do uncharge one by one if this is killed by OOM.
> > + */
> > + if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
> > + goto direct_uncharge;
> > +
> > + batch = &current->memcg_batch;
> > + /*
> > + * In usual, we do css_get() when we remember memcg pointer.
> > + * But in this case, we keep res->usage until end of a series of
> > + * uncharges. Then, it's ok to ignore memcg's refcnt.
> > + */
> > + if (!batch->memcg)
> > + batch->memcg = mem;
> > + /*
> > + * In typical case, batch->memcg == mem. This means we can
> > + * merge a series of uncharges to an uncharge of res_counter.
> > + * If not, we uncharge res_counter ony by one.
> > + */
> > + if (batch->memcg != mem)
> > + goto direct_uncharge;
> > + /* remember freed charge and uncharge it later */
> > + batch->pages += PAGE_SIZE;
>
> ->pages is really confusingly named. It doesn't count pages, it counts
> bytes!
>
> We could call it `bytes', but perhaps charge_bytes would be more
> communicative?
>
Ah, I agree. I'll change this "bytes".

> > +/*
> > + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
> > + * In that cases, pages are freed continuously and we can expect pages
> > + * are in the same memcg. All these calls itself limits the number of
> > + * pages freed at once, then uncharge_start/end() is called properly.
> > + */
> > +
> > +void mem_cgroup_uncharge_start(void)
> > +{
> > + if (!current->memcg_batch.do_batch) {
> > + current->memcg_batch.memcg = NULL;
> > + current->memcg_batch.pages = 0;
> > + current->memcg_batch.memsw = 0;
>
> what's memsw?
>
Ah, memccontol.c uses "memsw" in several parts.

For example, memory usage interface to user is shown as

memory.usage_in_bytes

memory+swap usage interface to suer is shown as

memory.memsw.usage_in_bytes.

But, Hmm, this is visible from sched.c then...

memsw_bytes or memory_and_swap_bytes ?



> > + }
> > + current->memcg_batch.do_batch++;
> > +}
> > +
> >
> > ...
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
> > + struct memcg_batch_info {
> > + int do_batch;
> > + struct mem_cgroup *memcg;
> > + long pages, memsw;
> > + } memcg_batch;
> > +#endif
>
> I find the most valuable documetnation is that which is devoted to the
> data structures. This one didn't get any :(
>
> Negative values of `pages' and `memsw' are meaningless, so it would be
> better to give them an unsigned type. That matches the
> res_counter_charge() expectations also.
>
Agreed. I'll rewrite this part with appropriate comments.
Thank you for pointing out.

Regards,
-Kame

2009-10-08 23:57:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: coalescing charges per cpu

On Thu, 8 Oct 2009 15:26:20 -0700
Andrew Morton <[email protected]> wrote:

> On Fri, 2 Oct 2009 14:03:43 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > This is a patch for coalescing access to res_counter at charging by
> > percpu caching. At charge, memcg charges 64pages and remember it in
> > percpu cache. Because it's cache, drain/flush is done if necessary.
> >
> > This version uses public percpu area.
> > 2 benefits of using public percpu area.
> > 1. Sum of stocked charge in the system is limited to # of cpus
> > not to the number of memcg. This shows better synchonization.
> > 2. drain code for flush/cpuhotplug is very easy (and quick)
> >
> > The most important point of this patch is that we never touch res_counter
> > in fast path. The res_counter is system-wide shared counter which is modified
> > very frequently. We shouldn't touch it as far as we can for avoiding
> > false sharing.
> >
> > ...
> >
> > +/* size of first charge trial. "32" comes from vmscan.c's magic value */
> > +#define CHARGE_SIZE (32 * PAGE_SIZE)
> > +struct memcg_stock_pcp {
> > + struct mem_cgroup *cached;
> > + int charge;
> > + struct work_struct work;
> > +};
> > +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > +static DEFINE_MUTEX(memcg_drain_mutex);
> > +
> > +static bool consume_stock(struct mem_cgroup *mem)
> > +{
> > + struct memcg_stock_pcp *stock;
> > + bool ret = true;
> > +
> > + stock = &get_cpu_var(memcg_stock);
> > + if (mem == stock->cached && stock->charge)
> > + stock->charge -= PAGE_SIZE;
> > + else
> > + ret = false;
> > + put_cpu_var(memcg_stock);
> > + return ret;
> > +}
>
> It's unobvious what the return value from this function means. A nice
> comment would help.

will do.

>
> > +static void drain_stock(struct memcg_stock_pcp *stock)
> > +{
> > + struct mem_cgroup *old = stock->cached;
> > +
> > + if (stock->charge) {
> > + res_counter_uncharge(&old->res, stock->charge);
> > + if (do_swap_account)
> > + res_counter_uncharge(&old->memsw, stock->charge);
> > + }
> > + stock->cached = NULL;
> > + stock->charge = 0;
> > +}
> > +
> > +static void drain_local_stock(struct work_struct *dummy)
> > +{
> > + struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> > + drain_stock(stock);
> > + put_cpu_var(memcg_stock);
> > +}
>
> drain_local_stock() is only ever run by a thread which is pinned to a
> particular CPU, so we can use plain old __get_cpu_var() here and remove
> the put_cpu_var(), methinks. If this is done, a comment shuold be
> added explaining why we can use that optimisation.
>

Ok. I'll try and test that.


> > +static void refill_stock(struct mem_cgroup *mem, int val)
> > +{
> > + struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> > +
> > + if (stock->cached != mem) {
> > + drain_stock(stock);
> > + stock->cached = mem;
> > + }
> > + stock->charge += val;
> > + put_cpu_var(memcg_stock);
> > +}
> > +
> > +static void drain_all_stock_async(void)
> > +{
> > + int cpu;
> > + /* Contention means someone tries to flush. */
> > + if (!mutex_trylock(&memcg_drain_mutex))
> > + return;
>
> Any time I see a trylock I ask "hm, what happens when it fails - what
> are the consequences and how to we later rerun the aborted operation".
>
I'll add more comments.

> That's unobvious here and merits a comment.
>
> > + for_each_online_cpu(cpu) {
>
> Did we need get_online_cpus() here?
>

Ahhhh, yes. It's necessary.

> > + struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > + if (work_pending(&stock->work))
> > + continue;
> > + INIT_WORK(&stock->work, drain_local_stock);
> > + schedule_work_on(cpu, &stock->work);
> > + }
> > + mutex_unlock(&memcg_drain_mutex);
> > + /* We don't wait for flush_work */
> > +}
> > +
> > +static void drain_all_stock_sync(void)
> > +{
> > + /* called when force_empty is called */
> > + mutex_lock(&memcg_drain_mutex);
> > + schedule_on_each_cpu(drain_local_stock);
> > + mutex_unlock(&memcg_drain_mutex);
> > +}
>
> In fact it would be nice were each of these functions to have a brief
> note describing their role.
>
Okay. I'll add.

> > +static int __cpuinit memcg_stock_cpu_callback(struct notifier_block *nb,
> > + unsigned long action,
> > + void *hcpu)
> > +{
> > +#ifdef CONFIG_HOTPLUG_CPU
> > + int cpu = (unsigned long)hcpu;
> > + struct memcg_stock_pcp *stock;
> > +
> > + if (action != CPU_DEAD)
> > + return NOTIFY_OK;
> > + stock = &per_cpu(memcg_stock, cpu);
> > + drain_stock(stock);
> > +#endif
> > + return NOTIFY_OK;
> > +}
>
> Is the ifdef needed? Using hotcpu_notifier() should cause this entire
> function to disappear from vmlinux when CONFIG_HOTPLUG_CPU=n.
>
Ah, I didn't notice that. I'll try.

Thank you.

Regards,
-Kame

> > /*
> > * Unlike exported interface, "oom" parameter is added. if oom==true,
> > * oom-killer can be invoked.
> > @@ -1269,6 +1370,7 @@ static int __mem_cgroup_try_charge(struc
> > struct mem_cgroup *mem, *mem_over_limit;
> > int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > struct res_counter *fail_res;
> > + int csize = CHARGE_SIZE;
> >
> > if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> > /* Don't account this! */
> > @@ -1293,23 +1395,25 @@ static int __mem_cgroup_try_charge(struc
> > return 0;
> >
> > VM_BUG_ON(css_is_removed(&mem->css));
> > + if (mem_cgroup_is_root(mem))
> > + goto done;
> >
> > while (1) {
> > int ret = 0;
> > unsigned long flags = 0;
> >
> > - if (mem_cgroup_is_root(mem))
> > - goto done;
> > - ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> > + if (consume_stock(mem))
> > + goto charged;
> > +
> > + ret = res_counter_charge(&mem->res, csize, &fail_res);
> > if (likely(!ret)) {
> > if (!do_swap_account)
> > break;
> > - ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> > - &fail_res);
> > + ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> > if (likely(!ret))
> > break;
> > /* mem+swap counter fails */
> > - res_counter_uncharge(&mem->res, PAGE_SIZE);
> > + res_counter_uncharge(&mem->res, csize);
> > flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> > mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> > memsw);
> > @@ -1318,6 +1422,11 @@ static int __mem_cgroup_try_charge(struc
> > mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> > res);
> >
> > + /* reduce request size and retry */
> > + if (csize > PAGE_SIZE) {
> > + csize = PAGE_SIZE;
> > + continue;
> > + }
> > if (!(gfp_mask & __GFP_WAIT))
> > goto nomem;
> >
> > @@ -1347,6 +1456,9 @@ static int __mem_cgroup_try_charge(struc
> > goto nomem;
> > }
> > }
> > + if (csize > PAGE_SIZE)
> > + refill_stock(mem, csize - PAGE_SIZE);
> > +charged:
> > /*
> > * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> > * if they exceeds softlimit.
> > @@ -2463,6 +2575,7 @@ move_account:
> > goto out;
> > /* This is for making all *used* pages to be on LRU. */
> > lru_add_drain_all();
> > + drain_all_stock_sync();
> > ret = 0;
> > for_each_node_state(node, N_HIGH_MEMORY) {
> > for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> > @@ -3181,6 +3294,7 @@ mem_cgroup_create(struct cgroup_subsys *
> > root_mem_cgroup = mem;
> > if (mem_cgroup_soft_limit_tree_init())
> > goto free_out;
> > + hotcpu_notifier(memcg_stock_cpu_callback, 0);
> >
> > } else {
> > parent = mem_cgroup_from_cont(cont->parent);
>

2009-10-09 04:02:26

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation

* KAMEZAWA Hiroyuki <[email protected]> [2009-10-02 14:01:26]:

>
> In massive parallel enviroment, res_counter can be a performance bottleneck.
> One strong techinque to reduce lock contention is reducing calls by
> coalescing some amount of calls into one.
>
> Considering charge/uncharge chatacteristic,
> - charge is done one by one via demand-paging.
> - uncharge is done by
> - in chunk at munmap, truncate, exit, execve...
> - one by one via vmscan/paging.
>
> It seems we have a chance in uncharge at unmap/truncation.

A chance to improve scalability?

>
> This patch is a for coalescing uncharge. For avoiding scattering memcg's
> structure to functions under /mm, this patch adds memcg batch uncharge
> information to the task.
>

Is there a reason for associating batch with the task rather than
per-cpu or per-memcg? per-memcg, I suspect would add some locking
overhead, per-cpu would require synchronization across cpu's while
uncharging, is that where per-task helps? I suspect per-mm,
per-signal will have the issues above.

> The degree of coalescing depends on callers
> - at invalidate/trucate... pagevec size
> - at unmap ....ZAP_BLOCK_SIZE
> (memory itself will be freed in this degree.)
> Then, we'll not coalescing too much.
>
> Changelog(now):
> - rebased onto the latest mmotm + softlimit fix patches.
>
> Changelog(old):
> - unified patch for callers
> - added commetns.
> - make ->do_batch as bool.
> - removed css_get() at el. We don't need it.
>
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/memcontrol.h | 13 ++++++
> include/linux/sched.h | 7 +++
> mm/memcontrol.c | 91 ++++++++++++++++++++++++++++++++++++++++++---
> mm/memory.c | 2
> mm/truncate.c | 6 ++
> 5 files changed, 113 insertions(+), 6 deletions(-)
>
> Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
> @@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
> extern void mem_cgroup_del_lru(struct page *page);
> extern void mem_cgroup_move_lists(struct page *page,
> enum lru_list from, enum lru_list to);
> +
> +/* For coalescing uncharge for reducing memcg' overhead*/
> +extern void mem_cgroup_uncharge_start(void);
> +extern void mem_cgroup_uncharge_end(void);
> +
> extern void mem_cgroup_uncharge_page(struct page *page);
> extern void mem_cgroup_uncharge_cache_page(struct page *page);
> extern int mem_cgroup_shmem_charge_fallback(struct page *page,
> @@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
> {
> }
>
> +static inline void mem_cgroup_uncharge_batch_start(void)
> +{
> +}
> +
> +static inline void mem_cgroup_uncharge_batch_start(void)
> +{
> +}
> +
> static inline void mem_cgroup_uncharge_page(struct page *page)
> {
> }
> Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Sep28/mm/memcontrol.c
> @@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str
> css_put(&mem->css);
> }
>
> +static void
> +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> +{
> + struct memcg_batch_info *batch = NULL;
> + bool uncharge_memsw = true;
> + /* If swapout, usage of swap doesn't decrease */
> + if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> + uncharge_memsw = false;
> + /*
> + * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> + * In those cases, all pages freed continously can be expected to be in
> + * the same cgroup and we have chance to coalesce uncharges.
> + * And, we do uncharge one by one if this is killed by OOM.
> + */
> + if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
> + goto direct_uncharge;

Should we also not uncharge the current batch when the task is dying?

> +
> + batch = &current->memcg_batch;
> + /*
> + * In usual, we do css_get() when we remember memcg pointer.
> + * But in this case, we keep res->usage until end of a series of
> + * uncharges. Then, it's ok to ignore memcg's refcnt.
> + */
> + if (!batch->memcg)
> + batch->memcg = mem;
> + /*
> + * In typical case, batch->memcg == mem. This means we can
> + * merge a series of uncharges to an uncharge of res_counter.
> + * If not, we uncharge res_counter ony by one.
> + */
> + if (batch->memcg != mem)
> + goto direct_uncharge;
> + /* remember freed charge and uncharge it later */
> + batch->pages += PAGE_SIZE;
> + if (uncharge_memsw)
> + batch->memsw += PAGE_SIZE;
> + return;
> +direct_uncharge:
> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> + if (uncharge_memsw)
> + res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> + return;
> +}
>
> /*
> * uncharge if !page_mapped(page)
> @@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page
> break;
> }
>
> - if (!mem_cgroup_is_root(mem)) {
> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> - if (do_swap_account &&
> - (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> - res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> - }
> + if (!mem_cgroup_is_root(mem))
> + __do_uncharge(mem, ctype);
> if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> mem_cgroup_swap_statistics(mem, true);
> mem_cgroup_charge_statistics(mem, pc, false);
> @@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru
> __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> }
>
> +/*
> + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
> + * In that cases, pages are freed continuously and we can expect pages
> + * are in the same memcg. All these calls itself limits the number of
> + * pages freed at once, then uncharge_start/end() is called properly.
> + */
> +
> +void mem_cgroup_uncharge_start(void)
> +{
> + if (!current->memcg_batch.do_batch) {
> + current->memcg_batch.memcg = NULL;
> + current->memcg_batch.pages = 0;
> + current->memcg_batch.memsw = 0;
> + }
> + current->memcg_batch.do_batch++;
> +}
> +
> +void mem_cgroup_uncharge_end(void)
> +{
> + struct mem_cgroup *mem;
> +
> + if (!current->memcg_batch.do_batch)
> + return;
> +
> + current->memcg_batch.do_batch--;
> + if (current->memcg_batch.do_batch) /* Nested ? */
> + return;
> +
> + mem = current->memcg_batch.memcg;
> + if (!mem)
> + return;
> + /* This "mem" is valid bacause we hide charges behind us. */
> + if (current->memcg_batch.pages)
> + res_counter_uncharge(&mem->res, current->memcg_batch.pages);
> + if (current->memcg_batch.memsw)
> + res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
> + /* Not necessary. but forget this pointer */
> + current->memcg_batch.memcg = NULL;
> +}
> +
> #ifdef CONFIG_SWAP
> /*
> * called after __delete_from_swap_cache() and drop "page" account.
> Index: mmotm-2.6.31-Sep28/include/linux/sched.h
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
> +++ mmotm-2.6.31-Sep28/include/linux/sched.h
> @@ -1549,6 +1549,13 @@ struct task_struct {
> unsigned long trace_recursion;
> #endif /* CONFIG_TRACING */
> unsigned long stack_start;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
> + struct memcg_batch_info {
> + int do_batch;
> + struct mem_cgroup *memcg;
> + long pages, memsw;
> + } memcg_batch;
> +#endif
> };
>
> /* Future-safe accessor for struct task_struct's cpus_allowed. */
> Index: mmotm-2.6.31-Sep28/mm/memory.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/memory.c
> +++ mmotm-2.6.31-Sep28/mm/memory.c
> @@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
> details = NULL;
>
> BUG_ON(addr >= end);
> + mem_cgroup_uncharge_start();
> tlb_start_vma(tlb, vma);
> pgd = pgd_offset(vma->vm_mm, addr);
> do {
> @@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
> zap_work, details);
> } while (pgd++, addr = next, (addr != end && *zap_work > 0));
> tlb_end_vma(tlb, vma);
> + mem_cgroup_uncharge_end();
>
> return addr;
> }
> Index: mmotm-2.6.31-Sep28/mm/truncate.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/truncate.c
> +++ mmotm-2.6.31-Sep28/mm/truncate.c
> @@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
> pagevec_release(&pvec);
> break;
> }
> + mem_cgroup_uncharge_start();
> for (i = 0; i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
>
> @@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
> unlock_page(page);
> }
> pagevec_release(&pvec);
> + mem_cgroup_uncharge_end();
> }
> }
> EXPORT_SYMBOL(truncate_inode_pages_range);
> @@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
> pagevec_init(&pvec, 0);
> while (next <= end &&
> pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> + mem_cgroup_uncharge_start();
> for (i = 0; i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
> pgoff_t index;
> @@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
> break;
> }
> pagevec_release(&pvec);
> + mem_cgroup_uncharge_end();
> cond_resched();
> }
> return ret;
> @@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
> while (next <= end && !wrapped &&
> pagevec_lookup(&pvec, mapping, next,
> min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> + mem_cgroup_uncharge_start();
> for (i = 0; i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
> pgoff_t page_index;
> @@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
> unlock_page(page);
> }
> pagevec_release(&pvec);
> + mem_cgroup_uncharge_end();
> cond_resched();
> }
> return ret;
>

The patch overall looks good, just some questions about it.

--
Balbir

2009-10-09 04:16:29

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: coalescing charges per cpu

* KAMEZAWA Hiroyuki <[email protected]> [2009-10-02 14:03:43]:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> This is a patch for coalescing access to res_counter at charging by
> percpu caching. At charge, memcg charges 64pages and remember it in
> percpu cache. Because it's cache, drain/flush is done if necessary.
>
> This version uses public percpu area.
> 2 benefits of using public percpu area.
> 1. Sum of stocked charge in the system is limited to # of cpus
> not to the number of memcg. This shows better synchonization.
> 2. drain code for flush/cpuhotplug is very easy (and quick)
>
> The most important point of this patch is that we never touch res_counter
> in fast path. The res_counter is system-wide shared counter which is modified
> very frequently. We shouldn't touch it as far as we can for avoiding
> false sharing.
>
> Changelog (new):
> - rabased onto the latest mmotm
> Changelog (old):
> - moved charge size check before __GFP_WAIT check for avoiding unnecesary
> - added asynchronous flush routine.
> - fixed bugs pointed out by Nishimura-san.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/memcontrol.c | 126 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 120 insertions(+), 6 deletions(-)
>
> Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Sep28/mm/memcontrol.c
> @@ -38,6 +38,7 @@
> #include <linux/vmalloc.h>
> #include <linux/mm_inline.h>
> #include <linux/page_cgroup.h>
> +#include <linux/cpu.h>
> #include "internal.h"
>
> #include <asm/uaccess.h>
> @@ -275,6 +276,7 @@ enum charge_type {
> static void mem_cgroup_get(struct mem_cgroup *mem);
> static void mem_cgroup_put(struct mem_cgroup *mem);
> static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> +static void drain_all_stock_async(void);
>
> static struct mem_cgroup_per_zone *
> mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> @@ -1137,6 +1139,8 @@ static int mem_cgroup_hierarchical_recla
> victim = mem_cgroup_select_victim(root_mem);
> if (victim == root_mem) {
> loop++;
> + if (loop >= 1)
> + drain_all_stock_async();
> if (loop >= 2) {
> /*
> * If we have not been able to reclaim
> @@ -1258,6 +1262,103 @@ done:
> unlock_page_cgroup(pc);
> }
>
> +/* size of first charge trial. "32" comes from vmscan.c's magic value */
> +#define CHARGE_SIZE (32 * PAGE_SIZE)

Should this then be SWAP_CLUSTER_MAX instead of 32?

> +struct memcg_stock_pcp {
> + struct mem_cgroup *cached;
> + int charge;
> + struct work_struct work;
> +};
> +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> +static DEFINE_MUTEX(memcg_drain_mutex);
> +
> +static bool consume_stock(struct mem_cgroup *mem)
> +{
> + struct memcg_stock_pcp *stock;
> + bool ret = true;
> +
> + stock = &get_cpu_var(memcg_stock);
> + if (mem == stock->cached && stock->charge)
> + stock->charge -= PAGE_SIZE;
> + else
> + ret = false;
> + put_cpu_var(memcg_stock);
> + return ret;
> +}


Shouldn't consume stock and routines below check for memcg_is_root()?

> +
> +static void drain_stock(struct memcg_stock_pcp *stock)
> +{
> + struct mem_cgroup *old = stock->cached;
> +
> + if (stock->charge) {
> + res_counter_uncharge(&old->res, stock->charge);
> + if (do_swap_account)
> + res_counter_uncharge(&old->memsw, stock->charge);
> + }
> + stock->cached = NULL;
> + stock->charge = 0;
> +}
> +
> +static void drain_local_stock(struct work_struct *dummy)
> +{
> + struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> + drain_stock(stock);
> + put_cpu_var(memcg_stock);
> +}
> +
> +static void refill_stock(struct mem_cgroup *mem, int val)
> +{
> + struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> +
> + if (stock->cached != mem) {
> + drain_stock(stock);
> + stock->cached = mem;
> + }

More comments here would help

> + stock->charge += val;
> + put_cpu_var(memcg_stock);
> +}
> +
> +static void drain_all_stock_async(void)
> +{
> + int cpu;
> + /* Contention means someone tries to flush. */
> + if (!mutex_trylock(&memcg_drain_mutex))
> + return;
> + for_each_online_cpu(cpu) {
> + struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> + if (work_pending(&stock->work))
> + continue;
> + INIT_WORK(&stock->work, drain_local_stock);
> + schedule_work_on(cpu, &stock->work);
> + }
> + mutex_unlock(&memcg_drain_mutex);
> + /* We don't wait for flush_work */
> +}
> +
> +static void drain_all_stock_sync(void)
> +{
> + /* called when force_empty is called */
> + mutex_lock(&memcg_drain_mutex);
> + schedule_on_each_cpu(drain_local_stock);
> + mutex_unlock(&memcg_drain_mutex);
> +}
> +
> +static int __cpuinit memcg_stock_cpu_callback(struct notifier_block *nb,
> + unsigned long action,
> + void *hcpu)
> +{
> +#ifdef CONFIG_HOTPLUG_CPU
> + int cpu = (unsigned long)hcpu;
> + struct memcg_stock_pcp *stock;
> +
> + if (action != CPU_DEAD)
> + return NOTIFY_OK;
> + stock = &per_cpu(memcg_stock, cpu);
> + drain_stock(stock);
> +#endif
> + return NOTIFY_OK;
> +}
> +
> /*
> * Unlike exported interface, "oom" parameter is added. if oom==true,
> * oom-killer can be invoked.
> @@ -1269,6 +1370,7 @@ static int __mem_cgroup_try_charge(struc
> struct mem_cgroup *mem, *mem_over_limit;
> int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> struct res_counter *fail_res;
> + int csize = CHARGE_SIZE;
>
> if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> /* Don't account this! */
> @@ -1293,23 +1395,25 @@ static int __mem_cgroup_try_charge(struc
> return 0;
>
> VM_BUG_ON(css_is_removed(&mem->css));
> + if (mem_cgroup_is_root(mem))
> + goto done;
>
> while (1) {
> int ret = 0;
> unsigned long flags = 0;
>
> - if (mem_cgroup_is_root(mem))
> - goto done;
> - ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> + if (consume_stock(mem))
> + goto charged;
> +
> + ret = res_counter_charge(&mem->res, csize, &fail_res);
> if (likely(!ret)) {
> if (!do_swap_account)
> break;
> - ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> - &fail_res);
> + ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> if (likely(!ret))
> break;
> /* mem+swap counter fails */
> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> + res_counter_uncharge(&mem->res, csize);
> flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> memsw);
> @@ -1318,6 +1422,11 @@ static int __mem_cgroup_try_charge(struc
> mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> res);
>
> + /* reduce request size and retry */
> + if (csize > PAGE_SIZE) {
> + csize = PAGE_SIZE;
> + continue;
> + }
> if (!(gfp_mask & __GFP_WAIT))
> goto nomem;
>
> @@ -1347,6 +1456,9 @@ static int __mem_cgroup_try_charge(struc
> goto nomem;
> }
> }
> + if (csize > PAGE_SIZE)
> + refill_stock(mem, csize - PAGE_SIZE);
> +charged:
> /*
> * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> * if they exceeds softlimit.
> @@ -2463,6 +2575,7 @@ move_account:
> goto out;
> /* This is for making all *used* pages to be on LRU. */
> lru_add_drain_all();
> + drain_all_stock_sync();
> ret = 0;
> for_each_node_state(node, N_HIGH_MEMORY) {
> for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> @@ -3181,6 +3294,7 @@ mem_cgroup_create(struct cgroup_subsys *
> root_mem_cgroup = mem;
> if (mem_cgroup_soft_limit_tree_init())
> goto free_out;
> + hotcpu_notifier(memcg_stock_cpu_callback, 0);
>
> } else {
> parent = mem_cgroup_from_cont(cont->parent);
>

I tested earlier versions of the patchset and they worked absolutely
fine for me!


--
Balbir

2009-10-09 04:20:56

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation

On Fri, 9 Oct 2009 09:31:42 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-10-02 14:01:26]:
>
> >
> > In massive parallel enviroment, res_counter can be a performance bottleneck.
> > One strong techinque to reduce lock contention is reducing calls by
> > coalescing some amount of calls into one.
> >
> > Considering charge/uncharge chatacteristic,
> > - charge is done one by one via demand-paging.
> > - uncharge is done by
> > - in chunk at munmap, truncate, exit, execve...
> > - one by one via vmscan/paging.
> >
> > It seems we have a chance in uncharge at unmap/truncation.
>
> A chance to improve scalability?

yes.


> >
> > This patch is a for coalescing uncharge. For avoiding scattering memcg's
> > structure to functions under /mm, this patch adds memcg batch uncharge
> > information to the task.
> >
>
> Is there a reason for associating batch with the task rather than
> per-cpu or per-memcg? per-memcg,

Per-cpu is bad because we can't be robust against task migration.
(I mean per-cpu cannot be clean/simple as this patch.)

Per-memcg doesn't work well. If we desgin per-memcg, we'll use a some logic
like percpu counter. batch-legnth of it cannot be enough big.
And we need fine grain synchronization more than this patch.


> I suspect would add some locking
> overhead, per-cpu would require synchronization across cpu's while
> uncharging, is that where per-task helps?

Yes, it's one of reason for this design.

> I suspect per-mm, per-signal will have the issues above.
Per-mm, per-signal can't be help. mulit-thread can do file truncations (and unmap)
in parallel.

The biggest reason for per-task is that I'd like to change this behavior by
calling _context_. Per-task batch means that we can make use of caller's context.
When uncharge is called by try_to_unmap(), __remove_mapping() in vmscan.c,
we do direct uncharge, for example.


>
> > The degree of coalescing depends on callers
> > - at invalidate/trucate... pagevec size
> > - at unmap ....ZAP_BLOCK_SIZE
> > (memory itself will be freed in this degree.)
> > Then, we'll not coalescing too much.
> >
> > Changelog(now):
> > - rebased onto the latest mmotm + softlimit fix patches.
> >
> > Changelog(old):
> > - unified patch for callers
> > - added commetns.
> > - make ->do_batch as bool.
> > - removed css_get() at el. We don't need it.
> >
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > include/linux/memcontrol.h | 13 ++++++
> > include/linux/sched.h | 7 +++
> > mm/memcontrol.c | 91 ++++++++++++++++++++++++++++++++++++++++++---
> > mm/memory.c | 2
> > mm/truncate.c | 6 ++
> > 5 files changed, 113 insertions(+), 6 deletions(-)
> >
> > Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
> > ===================================================================
> > --- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
> > +++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
> > @@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
> > extern void mem_cgroup_del_lru(struct page *page);
> > extern void mem_cgroup_move_lists(struct page *page,
> > enum lru_list from, enum lru_list to);
> > +
> > +/* For coalescing uncharge for reducing memcg' overhead*/
> > +extern void mem_cgroup_uncharge_start(void);
> > +extern void mem_cgroup_uncharge_end(void);
> > +
> > extern void mem_cgroup_uncharge_page(struct page *page);
> > extern void mem_cgroup_uncharge_cache_page(struct page *page);
> > extern int mem_cgroup_shmem_charge_fallback(struct page *page,
> > @@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
> > {
> > }
> >
> > +static inline void mem_cgroup_uncharge_batch_start(void)
> > +{
> > +}
> > +
> > +static inline void mem_cgroup_uncharge_batch_start(void)
> > +{
> > +}
> > +
> > static inline void mem_cgroup_uncharge_page(struct page *page)
> > {
> > }
> > Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
> > +++ mmotm-2.6.31-Sep28/mm/memcontrol.c
> > @@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str
> > css_put(&mem->css);
> > }
> >
> > +static void
> > +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> > +{
> > + struct memcg_batch_info *batch = NULL;
> > + bool uncharge_memsw = true;
> > + /* If swapout, usage of swap doesn't decrease */
> > + if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> > + uncharge_memsw = false;
> > + /*
> > + * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> > + * In those cases, all pages freed continously can be expected to be in
> > + * the same cgroup and we have chance to coalesce uncharges.
> > + * And, we do uncharge one by one if this is killed by OOM.
> > + */
> > + if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
> > + goto direct_uncharge;
>
> Should we also not uncharge the current batch when the task is dying?
>
TIF_MEMDIE means OOM happens. I take "reduce usage as fast as possible"
scheme here.

> > +
> > + batch = &current->memcg_batch;
> > + /*
> > + * In usual, we do css_get() when we remember memcg pointer.
> > + * But in this case, we keep res->usage until end of a series of
> > + * uncharges. Then, it's ok to ignore memcg's refcnt.
> > + */
> > + if (!batch->memcg)
> > + batch->memcg = mem;
> > + /*
> > + * In typical case, batch->memcg == mem. This means we can
> > + * merge a series of uncharges to an uncharge of res_counter.
> > + * If not, we uncharge res_counter ony by one.
> > + */
> > + if (batch->memcg != mem)
> > + goto direct_uncharge;
> > + /* remember freed charge and uncharge it later */
> > + batch->pages += PAGE_SIZE;
> > + if (uncharge_memsw)
> > + batch->memsw += PAGE_SIZE;
> > + return;
> > +direct_uncharge:
> > + res_counter_uncharge(&mem->res, PAGE_SIZE);
> > + if (uncharge_memsw)
> > + res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > + return;
> > +}
> >
> > /*
> > * uncharge if !page_mapped(page)
> > @@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page
> > break;
> > }
> >
> > - if (!mem_cgroup_is_root(mem)) {
> > - res_counter_uncharge(&mem->res, PAGE_SIZE);
> > - if (do_swap_account &&
> > - (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> > - res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > - }
> > + if (!mem_cgroup_is_root(mem))
> > + __do_uncharge(mem, ctype);
> > if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> > mem_cgroup_swap_statistics(mem, true);
> > mem_cgroup_charge_statistics(mem, pc, false);
> > @@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru
> > __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> > }
> >
> > +/*
> > + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
> > + * In that cases, pages are freed continuously and we can expect pages
> > + * are in the same memcg. All these calls itself limits the number of
> > + * pages freed at once, then uncharge_start/end() is called properly.
> > + */
> > +
> > +void mem_cgroup_uncharge_start(void)
> > +{
> > + if (!current->memcg_batch.do_batch) {
> > + current->memcg_batch.memcg = NULL;
> > + current->memcg_batch.pages = 0;
> > + current->memcg_batch.memsw = 0;
> > + }
> > + current->memcg_batch.do_batch++;
> > +}
> > +
> > +void mem_cgroup_uncharge_end(void)
> > +{
> > + struct mem_cgroup *mem;
> > +
> > + if (!current->memcg_batch.do_batch)
> > + return;
> > +
> > + current->memcg_batch.do_batch--;
> > + if (current->memcg_batch.do_batch) /* Nested ? */
> > + return;
> > +
> > + mem = current->memcg_batch.memcg;
> > + if (!mem)
> > + return;
> > + /* This "mem" is valid bacause we hide charges behind us. */
> > + if (current->memcg_batch.pages)
> > + res_counter_uncharge(&mem->res, current->memcg_batch.pages);
> > + if (current->memcg_batch.memsw)
> > + res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
> > + /* Not necessary. but forget this pointer */
> > + current->memcg_batch.memcg = NULL;
> > +}
> > +
> > #ifdef CONFIG_SWAP
> > /*
> > * called after __delete_from_swap_cache() and drop "page" account.
> > Index: mmotm-2.6.31-Sep28/include/linux/sched.h
> > ===================================================================
> > --- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
> > +++ mmotm-2.6.31-Sep28/include/linux/sched.h
> > @@ -1549,6 +1549,13 @@ struct task_struct {
> > unsigned long trace_recursion;
> > #endif /* CONFIG_TRACING */
> > unsigned long stack_start;
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
> > + struct memcg_batch_info {
> > + int do_batch;
> > + struct mem_cgroup *memcg;
> > + long pages, memsw;
> > + } memcg_batch;
> > +#endif
> > };
> >
> > /* Future-safe accessor for struct task_struct's cpus_allowed. */
> > Index: mmotm-2.6.31-Sep28/mm/memory.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep28.orig/mm/memory.c
> > +++ mmotm-2.6.31-Sep28/mm/memory.c
> > @@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
> > details = NULL;
> >
> > BUG_ON(addr >= end);
> > + mem_cgroup_uncharge_start();
> > tlb_start_vma(tlb, vma);
> > pgd = pgd_offset(vma->vm_mm, addr);
> > do {
> > @@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
> > zap_work, details);
> > } while (pgd++, addr = next, (addr != end && *zap_work > 0));
> > tlb_end_vma(tlb, vma);
> > + mem_cgroup_uncharge_end();
> >
> > return addr;
> > }
> > Index: mmotm-2.6.31-Sep28/mm/truncate.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep28.orig/mm/truncate.c
> > +++ mmotm-2.6.31-Sep28/mm/truncate.c
> > @@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
> > pagevec_release(&pvec);
> > break;
> > }
> > + mem_cgroup_uncharge_start();
> > for (i = 0; i < pagevec_count(&pvec); i++) {
> > struct page *page = pvec.pages[i];
> >
> > @@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
> > unlock_page(page);
> > }
> > pagevec_release(&pvec);
> > + mem_cgroup_uncharge_end();
> > }
> > }
> > EXPORT_SYMBOL(truncate_inode_pages_range);
> > @@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
> > pagevec_init(&pvec, 0);
> > while (next <= end &&
> > pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> > + mem_cgroup_uncharge_start();
> > for (i = 0; i < pagevec_count(&pvec); i++) {
> > struct page *page = pvec.pages[i];
> > pgoff_t index;
> > @@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
> > break;
> > }
> > pagevec_release(&pvec);
> > + mem_cgroup_uncharge_end();
> > cond_resched();
> > }
> > return ret;
> > @@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
> > while (next <= end && !wrapped &&
> > pagevec_lookup(&pvec, mapping, next,
> > min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> > + mem_cgroup_uncharge_start();
> > for (i = 0; i < pagevec_count(&pvec); i++) {
> > struct page *page = pvec.pages[i];
> > pgoff_t page_index;
> > @@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
> > unlock_page(page);
> > }
> > pagevec_release(&pvec);
> > + mem_cgroup_uncharge_end();
> > cond_resched();
> > }
> > return ret;
> >
>
> The patch overall looks good, just some questions about it.
>

Thanks,
-Kame

2009-10-09 04:29:15

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: coalescing charges per cpu

On Fri, 9 Oct 2009 09:45:36 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-10-02 14:03:43]:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > This is a patch for coalescing access to res_counter at charging by
> > percpu caching. At charge, memcg charges 64pages and remember it in
> > percpu cache. Because it's cache, drain/flush is done if necessary.
> >
> > This version uses public percpu area.
> > 2 benefits of using public percpu area.
> > 1. Sum of stocked charge in the system is limited to # of cpus
> > not to the number of memcg. This shows better synchonization.
> > 2. drain code for flush/cpuhotplug is very easy (and quick)
> >
> > The most important point of this patch is that we never touch res_counter
> > in fast path. The res_counter is system-wide shared counter which is modified
> > very frequently. We shouldn't touch it as far as we can for avoiding
> > false sharing.
> >
> > Changelog (new):
> > - rabased onto the latest mmotm
> > Changelog (old):
> > - moved charge size check before __GFP_WAIT check for avoiding unnecesary
> > - added asynchronous flush routine.
> > - fixed bugs pointed out by Nishimura-san.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > mm/memcontrol.c | 126 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 120 insertions(+), 6 deletions(-)
> >
> > Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
> > +++ mmotm-2.6.31-Sep28/mm/memcontrol.c
> > @@ -38,6 +38,7 @@
> > #include <linux/vmalloc.h>
> > #include <linux/mm_inline.h>
> > #include <linux/page_cgroup.h>
> > +#include <linux/cpu.h>
> > #include "internal.h"
> >
> > #include <asm/uaccess.h>
> > @@ -275,6 +276,7 @@ enum charge_type {
> > static void mem_cgroup_get(struct mem_cgroup *mem);
> > static void mem_cgroup_put(struct mem_cgroup *mem);
> > static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> > +static void drain_all_stock_async(void);
> >
> > static struct mem_cgroup_per_zone *
> > mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> > @@ -1137,6 +1139,8 @@ static int mem_cgroup_hierarchical_recla
> > victim = mem_cgroup_select_victim(root_mem);
> > if (victim == root_mem) {
> > loop++;
> > + if (loop >= 1)
> > + drain_all_stock_async();
> > if (loop >= 2) {
> > /*
> > * If we have not been able to reclaim
> > @@ -1258,6 +1262,103 @@ done:
> > unlock_page_cgroup(pc);
> > }
> >
> > +/* size of first charge trial. "32" comes from vmscan.c's magic value */
> > +#define CHARGE_SIZE (32 * PAGE_SIZE)
>
> Should this then be SWAP_CLUSTER_MAX instead of 32?
>
I'm not sure what number is the best here. So, just borrow a number from vmscan.
(maybe not very bad number.)
I'd like to keep this as it is for a while but write why I selected this more.
as..
==
SWAP_CLUSTER_MAX(32) is the number of pages reclaimed by vmscan.c in a turn.
So, we borrow that number here. Considering scalability, this should
be proportional to the number of cpus. But it's a TO-DO now.
==
Maybe we cannot update this until we get 32 or 64+cpus machine...


> > +struct memcg_stock_pcp {
> > + struct mem_cgroup *cached;
> > + int charge;
> > + struct work_struct work;
> > +};
> > +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > +static DEFINE_MUTEX(memcg_drain_mutex);
> > +
> > +static bool consume_stock(struct mem_cgroup *mem)
> > +{
> > + struct memcg_stock_pcp *stock;
> > + bool ret = true;
> > +
> > + stock = &get_cpu_var(memcg_stock);
> > + if (mem == stock->cached && stock->charge)
> > + stock->charge -= PAGE_SIZE;
> > + else
> > + ret = false;
> > + put_cpu_var(memcg_stock);
> > + return ret;
> > +}
>
>
> Shouldn't consume stock and routines below check for memcg_is_root()?
>
This is never called if memcg_is_root(). It's checked by caller.


> > +
> > +static void drain_stock(struct memcg_stock_pcp *stock)
> > +{
> > + struct mem_cgroup *old = stock->cached;
> > +
> > + if (stock->charge) {
> > + res_counter_uncharge(&old->res, stock->charge);
> > + if (do_swap_account)
> > + res_counter_uncharge(&old->memsw, stock->charge);
> > + }
> > + stock->cached = NULL;
> > + stock->charge = 0;
> > +}
> > +
> > +static void drain_local_stock(struct work_struct *dummy)
> > +{
> > + struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> > + drain_stock(stock);
> > + put_cpu_var(memcg_stock);
> > +}
> > +
> > +static void refill_stock(struct mem_cgroup *mem, int val)
> > +{
> > + struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> > +
> > + if (stock->cached != mem) {
> > + drain_stock(stock);
> > + stock->cached = mem;
> > + }
>
> More comments here would help
>
Ah, yes. I'll add.


> > + stock->charge += val;
> > + put_cpu_var(memcg_stock);
> > +}
> > +
> > +static void drain_all_stock_async(void)
> > +{
> > + int cpu;
> > + /* Contention means someone tries to flush. */
> > + if (!mutex_trylock(&memcg_drain_mutex))
> > + return;
> > + for_each_online_cpu(cpu) {
> > + struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > + if (work_pending(&stock->work))
> > + continue;
> > + INIT_WORK(&stock->work, drain_local_stock);
> > + schedule_work_on(cpu, &stock->work);
> > + }
> > + mutex_unlock(&memcg_drain_mutex);
> > + /* We don't wait for flush_work */
> > +}
> > +
> > +static void drain_all_stock_sync(void)
> > +{
> > + /* called when force_empty is called */
> > + mutex_lock(&memcg_drain_mutex);
> > + schedule_on_each_cpu(drain_local_stock);
> > + mutex_unlock(&memcg_drain_mutex);
> > +}
> > +
> > +static int __cpuinit memcg_stock_cpu_callback(struct notifier_block *nb,
> > + unsigned long action,
> > + void *hcpu)
> > +{
> > +#ifdef CONFIG_HOTPLUG_CPU
> > + int cpu = (unsigned long)hcpu;
> > + struct memcg_stock_pcp *stock;
> > +
> > + if (action != CPU_DEAD)
> > + return NOTIFY_OK;
> > + stock = &per_cpu(memcg_stock, cpu);
> > + drain_stock(stock);
> > +#endif
> > + return NOTIFY_OK;
> > +}
> > +
> > /*
> > * Unlike exported interface, "oom" parameter is added. if oom==true,
> > * oom-killer can be invoked.
> > @@ -1269,6 +1370,7 @@ static int __mem_cgroup_try_charge(struc
> > struct mem_cgroup *mem, *mem_over_limit;
> > int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > struct res_counter *fail_res;
> > + int csize = CHARGE_SIZE;
> >
> > if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> > /* Don't account this! */
> > @@ -1293,23 +1395,25 @@ static int __mem_cgroup_try_charge(struc
> > return 0;
> >
> > VM_BUG_ON(css_is_removed(&mem->css));
> > + if (mem_cgroup_is_root(mem))
> > + goto done;
> >
> > while (1) {
> > int ret = 0;
> > unsigned long flags = 0;
> >
> > - if (mem_cgroup_is_root(mem))
> > - goto done;
> > - ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> > + if (consume_stock(mem))
> > + goto charged;
> > +
> > + ret = res_counter_charge(&mem->res, csize, &fail_res);
> > if (likely(!ret)) {
> > if (!do_swap_account)
> > break;
> > - ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> > - &fail_res);
> > + ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> > if (likely(!ret))
> > break;
> > /* mem+swap counter fails */
> > - res_counter_uncharge(&mem->res, PAGE_SIZE);
> > + res_counter_uncharge(&mem->res, csize);
> > flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> > mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> > memsw);
> > @@ -1318,6 +1422,11 @@ static int __mem_cgroup_try_charge(struc
> > mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> > res);
> >
> > + /* reduce request size and retry */
> > + if (csize > PAGE_SIZE) {
> > + csize = PAGE_SIZE;
> > + continue;
> > + }
> > if (!(gfp_mask & __GFP_WAIT))
> > goto nomem;
> >
> > @@ -1347,6 +1456,9 @@ static int __mem_cgroup_try_charge(struc
> > goto nomem;
> > }
> > }
> > + if (csize > PAGE_SIZE)
> > + refill_stock(mem, csize - PAGE_SIZE);
> > +charged:
> > /*
> > * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> > * if they exceeds softlimit.
> > @@ -2463,6 +2575,7 @@ move_account:
> > goto out;
> > /* This is for making all *used* pages to be on LRU. */
> > lru_add_drain_all();
> > + drain_all_stock_sync();
> > ret = 0;
> > for_each_node_state(node, N_HIGH_MEMORY) {
> > for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> > @@ -3181,6 +3294,7 @@ mem_cgroup_create(struct cgroup_subsys *
> > root_mem_cgroup = mem;
> > if (mem_cgroup_soft_limit_tree_init())
> > goto free_out;
> > + hotcpu_notifier(memcg_stock_cpu_callback, 0);
> >
> > } else {
> > parent = mem_cgroup_from_cont(cont->parent);
> >
>
> I tested earlier versions of the patchset and they worked absolutely
> fine for me!
>
Thanks! I'll update commentary.

Regards,
-Kame

2009-10-12 11:39:13

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge

* Ying Han <[email protected]> [2009-10-11 11:34:39]:

> 2009/10/10 KAMEZAWA Hiroyuki <[email protected]>
>
> > Ying Han wrote:
> > > Hi KAMEZAWA-san: I tested your patch set based on 2.6.32-rc3 but I don't
> > > see
> > > much improvement on the page-faults rate.
> > > Here is the number I got:
> > >
> > > [Before]
> > > Performance counter stats for './runpause.sh 10' (5 runs):
> > >
> > > 226272.271246 task-clock-msecs # 3.768 CPUs ( +-
> > > 0.193%
> > > )
> > > 4424 context-switches # 0.000 M/sec ( +-
> > > 14.418%
> > > )
> > > 25 CPU-migrations # 0.000 M/sec ( +-
> > > 23.077%
> > > )
> > > 80499059 page-faults # 0.356 M/sec ( +-
> > > 2.586%
> > > )
> > > 499246232482 cycles # 2206.396 M/sec ( +-
> > > 0.055%
> > > )
> > > 193036122022 instructions # 0.387 IPC ( +-
> > > 0.281%
> > > )
> > > 76548856038 cache-references # 338.304 M/sec ( +-
> > > 0.832%
> > > )
> > > 480196860 cache-misses # 2.122 M/sec ( +-
> > > 2.741%
> > > )
> > >
> > > 60.051646892 seconds time elapsed ( +- 0.010% )
> > >
> > > [After]
> > > Performance counter stats for './runpause.sh 10' (5 runs):
> > >
> > > 226491.338475 task-clock-msecs # 3.772 CPUs ( +-
> > > 0.176%
> > > )
> > > 3377 context-switches # 0.000 M/sec ( +-
> > > 14.713%
> > > )
> > > 12 CPU-migrations # 0.000 M/sec ( +-
> > > 23.077%
> > > )
> > > 81867014 page-faults # 0.361 M/sec ( +-
> > > 3.201%
> > > )
> > > 499835798750 cycles # 2206.865 M/sec ( +-
> > > 0.036%
> > > )
> > > 196685031865 instructions # 0.393 IPC ( +-
> > > 0.286%
> > > )
> > > 81143829910 cache-references # 358.265 M/sec ( +-
> > > 0.428%
> > > )
> > > 119362559 cache-misses # 0.527 M/sec ( +-
> > > 5.291%
> > > )
> > >
> > > 60.048917062 seconds time elapsed ( +- 0.010% )
> > >
> > > I ran it on an 4 core machine with 16G of RAM. And I modified
> > > the runpause.sh to fork 4 pagefault process instead of 8. I mounted
> > cgroup
> > > with only memory subsystem and start running the test on the root cgroup.
> > >
> > > I believe that we might have different running environment including the
> > > cgroup configuration. Any suggestions?
> > >
> >
> > This patch series is only for "child" cgroup. Sorry, I had to write it
> > clearer. No effects to root.
> >
>
> Ok, Thanks for making it clearer. :) So Do you mind post the cgroup+memcg
> configuration
> while you are running on your host?
>
> Thanks
>

Yes, root was fixed by another patchset now in mainline. Another check
is to see if resource_counter lock shows up in /proc/lock_stats.

--
Balbir