2009-09-30 10:06:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat

Hi,

In current implementation, memcg uses its own percpu counters for counting
evetns and # of RSS, CACHES. Now, counter is maintainer per cpu without
any synchronization as vm_stat[] or percpu_counter. So, this is
update-is-fast-but-read-is-slow conter.

Because "read" for these counter was only done by memory.stat file, I thought
read-side-slowness was acceptable. Amount of memory usage, which affects
memory limit check, can be read by memory.usage_in_bytes. It's maintained
by res_counter.

But in current -rc, root memcg's memory usage is calcualted by this per cpu
counter and read side slowness may be trouble if it's frequently read.

And, in recent discusstion, I wonder we should maintain NR_DIRTY etc...
in memcg. So, slow-read-counter will not match our requirements, I guess.
I want some counter like vm_stat[] in memcg.

This 2 patches are for using counter like vm_stat[] in memcg.
Just an idea level implementaion but I think this is not so bad.

I confirmed this patch works well. I'm now thinking how to test performance...

Any comments are welcome.
This patch is onto mmotm + some myown patches...so...this is just an RFC.

Regards,
-Kame


2009-09-30 10:11:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 1/2] percpu array counter like vmstat

This patch is for implemening percpu counter of array. Unlike percpu counter,
this one's design is based on vm_stat[], an array counter for zone statistics.
It's an array of percpu counter.

The user can define counter array as
struct foobar {
.....
DEFINE_ARRAY_COUNTER(name, NR_ELEMENTS);
....
}
and there will be array of counter, which has NR_ELEMENTS of size.

And, a macro GET_ARC() is provided. Users can read this counter by

array_counter_read(GET_ARC(&foobar.name), NAME_OF_ELEMENT)

can add a value be
array_counter_add(GET_ARC(&foobar.name), NAME_OF_ELEMENT, val)

My first purpose for writing this is replacing memcg's private percpu
counter array with this. (Then, memcg can use generic percpu object.)

I placed this counter in the same files of lib/percpu_counter

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/percpu_counter.h | 108 +++++++++++++++++++++++++++++++++++++++++
lib/percpu_counter.c | 83 +++++++++++++++++++++++++++++++
2 files changed, 191 insertions(+)

Index: mmotm-2.6.31-Sep28/include/linux/percpu_counter.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/percpu_counter.h
+++ mmotm-2.6.31-Sep28/include/linux/percpu_counter.h
@@ -77,6 +77,67 @@ static inline s64 percpu_counter_read_po
return 1;
}

+/*
+ * A counter for implementing counter array as vmstat[]. Usage and behavior
+ * is similar to percpu_counter but good for handle multiple statistics. The
+ * idea is from vmstat[] array implementation. atomic_long_t implies this is
+ * a 32bit counter on 32bit architecture and this is intentional. If you want
+ * 64bit, please use percpu_counter. This will not provide a function like
+ * percpu_counter_sum() function.
+ *
+ * For avoiding unnecessary access, it's recommended to use this via macro.
+ * You can use this counter as following in a struct.
+ *
+ * struct xxxx {
+ * ......
+ * DEFINE_ARRAY_COUTER(coutner, NR_MY_ELEMENTS);
+ * .....
+ * };
+ * Then, you can define your array within above struct xxxx.
+ * In many case, address of counter(idx) can be calculated by compiler, easily.
+ *
+ * To access this, GET_ARC() macro is provided. This can be used in
+ * following style.
+ * array_counter_add(GET_ARC(&object->coutner), idx, num);
+ * array_counter_read(GET_ARC(&object->counter), idx)
+ */
+
+struct pad_array_counter { /* Don't use this struct directly */
+ s8 batch;
+ s8 *counters;
+#ifdef CONFIG_HOTPLUG_CPU
+ struct list_head list;
+#endif
+ int elements;
+} ____cacheline_aligned_in_smp;
+
+struct array_counter {
+ struct pad_array_counter v;
+ atomic_long_t counters[0];
+};
+/* For static size definitions */
+
+#define DEFINE_ARRAY_COUNTER(name, elements) \
+ struct {\
+ struct array_counter ac;\
+ long __counters[(elements)];} name;
+
+#define GET_ARC(x) (&(x)->ac)
+
+#define INIT_ARC(x,s) do { \
+ memset((x), 0, sizeof(*(x))); \
+ array_counter_init(&(x)->ac, (s));\
+}
+
+extern int array_counter_init(struct array_counter *ac, int size);
+extern void array_counter_destroy(struct array_counter *ac);
+extern void array_counter_add(struct array_counter *ac, int idx, int val);
+
+static inline long array_counter_read(struct array_counter *ac,int idx)
+{
+ return atomic_long_read(&ac->counters[idx]);
+}
+
#else

struct percpu_counter {
@@ -129,6 +190,44 @@ static inline s64 percpu_counter_sum(str
return percpu_counter_read(fbc);
}

+struct array_counter {
+ int elmements;
+ long counters[0];
+};
+/* For static size definitions (please see CONFIG_SMP case) */
+#define DEFINE_ARRAY_COUNTER(name, elements) \
+ struct {\
+ struct array_counter ac;
+ long __counters[elements];\
+ }name;
+#define GET_ARC(x) (&(x)->ac)
+#define INIT_ARC(x,s) do { \
+ memset((x), 0, sizeof(*(x))); \
+ array_counter_init(&(x)->ac, (s));\
+}
+
+static inline void
+array_counter_init(struct array_counter *ac, int size)
+{
+ ac->elements = size;
+}
+static inline void array_counter_destroy(struct array_counter *ac)
+{
+ /* nothing to do */
+}
+
+static inline
+void array_counter_add(struct array_counter *ac, int idx, int val)
+{
+ ac->counter[idx] += val;
+}
+
+static inline
+void array_coutner_read(struct array_counter *ac, int idx)
+{
+ return ac->counter[idx];
+}
+
#endif /* CONFIG_SMP */

static inline void percpu_counter_inc(struct percpu_counter *fbc)
@@ -146,4 +245,13 @@ static inline void percpu_counter_sub(st
percpu_counter_add(fbc, -amount);
}

+static inline void array_counter_inc(struct array_counter *ac, int idx)
+{
+ array_counter_add(ac, idx, 1);
+}
+
+static inline void array_counter_dec(struct array_counter *ac, int idx)
+{
+ array_counter_add(ac, idx, -1);
+}
#endif /* _LINUX_PERCPU_COUNTER_H */
Index: mmotm-2.6.31-Sep28/lib/percpu_counter.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/lib/percpu_counter.c
+++ mmotm-2.6.31-Sep28/lib/percpu_counter.c
@@ -144,3 +144,86 @@ static int __init percpu_counter_startup
return 0;
}
module_init(percpu_counter_startup);
+
+
+static LIST_HEAD(array_counters);
+static DEFINE_MUTEX(array_counters_lock);
+
+int array_counter_init(struct array_counter *ac, int size)
+{
+ ac->v.elements = size;
+ ac->v.counters = alloc_percpu(s8);
+ if (!ac->v.counters)
+ return -ENOMEM;
+ ac->v.batch = percpu_counter_batch;
+
+#ifdef CONFIG_HOTPLUG_CPU
+ mutex_lock(&array_counters_lock);
+ list_add(&ac->v.list, &array_counters);
+ mutex_unlock(&array_counters_lock);
+#endif
+ return 0;
+}
+
+void array_counter_destroy(struct array_counter *ac)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+ mutex_lock(&array_counters_lock);
+ list_del(&ac->v.list);
+ mutex_unlock(&array_counters_lock);
+#endif
+ free_percpu(&ac->v.counters);
+ ac->v.counters = NULL;
+}
+
+void array_counter_add(struct array_counter *ac, int idx, int val)
+{
+ s8 *pcount;
+ long count;
+ int cpu = get_cpu();
+
+ pcount = per_cpu_ptr(ac->v.counters, cpu);
+ count = pcount[idx] + val;
+ if ((count >= ac->v.batch) || (-count >= ac->v.batch)) {
+ atomic_long_add(count, &ac->counters[idx]);
+ pcount[idx] = 0;
+ } else
+ pcount[idx] = (s8)count;
+ put_cpu();
+}
+
+
+static int __cpuinit array_counter_hotcpu_callback(struct notifier_block *nb,
+ unsigned long action, void *hcpu)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+ unsigned int cpu;
+ struct pad_array_counter *pac;
+ int idx;
+ if (action != CPU_DEAD)
+ return NOTIFY_OK;
+
+ cpu = (unsigned long)hcpu;
+ mutex_lock(&percpu_counters_lock);
+ list_for_each_entry(pac, &array_counters, list) {
+ s8 *pcount;
+ struct array_counter *ac;
+
+ ac = container_of(pac, struct array_counter, v);
+ pcount = per_cpu_ptr(pac->counters, cpu);
+ for (idx = 0; idx < pac->elements; idx++){
+ atomic_long_add(pcount[idx], &ac->counters[idx]);
+ pcount[idx] = 0;
+ }
+ }
+ mutex_unlock(&percpu_counters_lock);
+#endif
+ return NOTIFY_OK;
+}
+
+static int __init array_counter_startup(void)
+{
+ hotcpu_notifier(array_counter_hotcpu_callback, 0);
+ return 0;
+}
+module_init(array_counter_startup);

2009-09-30 10:14:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 2/2] memcg: use generic percpu array_counter

Replace memcg's private percpu counter with array_counter().
And I made counter's array index's name shorter for easy reading.
By this patch, one of memcg's dirty part will go away.

Side-effect:
This makes event-coutner as global counter, not per-cpu counter.
Maybe good side-effect for softlimit updates.


Signged-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 216 +++++++++++++++++++-------------------------------------
1 file changed, 75 insertions(+), 141 deletions(-)

Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -39,6 +39,7 @@
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
+#include <linux/percpu_counter.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -56,7 +57,6 @@ static int really_do_swap_account __init
#endif

static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
-#define SOFTLIMIT_EVENTS_THRESH (1000)

/*
* Statistics for memory cgroup.
@@ -65,67 +65,17 @@ enum mem_cgroup_stat_index {
/*
* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
*/
- MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
- MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
- MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
- MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
- MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
- MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+ MEMCG_STAT_CACHE, /* # of pages charged as cache */
+ MEMCG_STAT_RSS, /* # of pages charged as anon rss */
+ MEMCG_STAT_MAPPED_FILE, /* # of pages charged as file rss */
+ MEMCG_STAT_PGPGIN, /* # of pages paged in */
+ MEMCG_STAT_PGPGOUT, /* # of pages paged out */
+ MEMCG_STAT_EVENTS, /* sum of pagein + pageout for internal use */
+ MEMCG_STAT_SWAPOUT, /* # of pages, swapped out */

- MEM_CGROUP_STAT_NSTATS,
+ MEMCG_STAT_NSTATS,
};

-struct mem_cgroup_stat_cpu {
- s64 count[MEM_CGROUP_STAT_NSTATS];
-} ____cacheline_aligned_in_smp;
-
-struct mem_cgroup_stat {
- struct mem_cgroup_stat_cpu cpustat[0];
-};
-
-static inline void
-__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
- enum mem_cgroup_stat_index idx)
-{
- stat->count[idx] = 0;
-}
-
-static inline s64
-__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
- enum mem_cgroup_stat_index idx)
-{
- return stat->count[idx];
-}
-
-/*
- * For accounting under irq disable, no need for increment preempt count.
- */
-static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
- enum mem_cgroup_stat_index idx, int val)
-{
- stat->count[idx] += val;
-}
-
-static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
- enum mem_cgroup_stat_index idx)
-{
- int cpu;
- s64 ret = 0;
- for_each_possible_cpu(cpu)
- ret += stat->cpustat[cpu].count[idx];
- return ret;
-}
-
-static s64 mem_cgroup_local_usage(struct mem_cgroup_stat *stat)
-{
- s64 ret;
-
- ret = mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_CACHE);
- ret += mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_RSS);
- return ret;
-}
-
/*
* per-zone information in memory controller.
*/
@@ -223,13 +173,13 @@ struct mem_cgroup {

unsigned int swappiness;

+ unsigned long softlimit_prev_event;
+#define SOFTLIMIT_EVENTS_THRESH (4000)
/* set when res.limit == memsw.limit */
bool memsw_is_minimum;

- /*
- * statistics. This must be placed at the end of memcg.
- */
- struct mem_cgroup_stat stat;
+ /* statistics. using percpu array counter */
+ DEFINE_ARRAY_COUNTER(stat, MEMCG_STAT_NSTATS);
};

/*
@@ -369,20 +319,17 @@ mem_cgroup_remove_exceeded(struct mem_cg

static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
{
- bool ret = false;
- int cpu;
- s64 val;
- struct mem_cgroup_stat_cpu *cpustat;
-
- cpu = get_cpu();
- cpustat = &mem->stat.cpustat[cpu];
- val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
- if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
- __mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
- ret = true;
+ unsigned long val;
+ val = (unsigned long)
+ array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_EVENTS);
+
+ if (unlikely(time_after(val, mem->softlimit_prev_event +
+ SOFTLIMIT_EVENTS_THRESH))) {
+ /* there can be race..but not necessary to correct. */
+ mem->softlimit_prev_event = val;
+ return true;
}
- put_cpu();
- return ret;
+ return false;
}

static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
@@ -480,14 +427,10 @@ mem_cgroup_largest_soft_limit_node(struc
static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
bool charge)
{
- int val = (charge) ? 1 : -1;
- struct mem_cgroup_stat *stat = &mem->stat;
- struct mem_cgroup_stat_cpu *cpustat;
- int cpu = get_cpu();
-
- cpustat = &stat->cpustat[cpu];
- __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SWAPOUT, val);
- put_cpu();
+ if (charge)
+ array_counter_inc(GET_ARC(&mem->stat), MEMCG_STAT_SWAPOUT);
+ else
+ array_counter_dec(GET_ARC(&mem->stat), MEMCG_STAT_SWAPOUT);
}

static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
@@ -495,24 +438,28 @@ static void mem_cgroup_charge_statistics
bool charge)
{
int val = (charge) ? 1 : -1;
- struct mem_cgroup_stat *stat = &mem->stat;
- struct mem_cgroup_stat_cpu *cpustat;
- int cpu = get_cpu();
+ struct array_counter *ac = GET_ARC(&mem->stat);

- cpustat = &stat->cpustat[cpu];
if (PageCgroupCache(pc))
- __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
+ array_counter_add(ac, MEMCG_STAT_CACHE, val);
else
- __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_RSS, val);
+ array_counter_add(ac, MEMCG_STAT_RSS, val);

if (charge)
- __mem_cgroup_stat_add_safe(cpustat,
- MEM_CGROUP_STAT_PGPGIN_COUNT, 1);
+ array_counter_inc(ac,MEMCG_STAT_PGPGIN);
else
- __mem_cgroup_stat_add_safe(cpustat,
- MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
- __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
- put_cpu();
+ array_counter_inc(ac, MEMCG_STAT_PGPGOUT);
+ array_counter_inc(ac, MEMCG_STAT_EVENTS);
+}
+
+/* This function ignores hierarchy */
+static unsigned long mem_cgroup_my_usage(struct mem_cgroup *mem)
+{
+ unsigned long ret;
+
+ ret = array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_RSS);
+ ret += array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_CACHE);
+ return ret;
}

static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
@@ -1164,7 +1111,7 @@ static int mem_cgroup_hierarchical_recla
}
}
}
- if (!mem_cgroup_local_usage(&victim->stat)) {
+ if (!mem_cgroup_my_usage(victim)) {
/* this cgroup's local usage == 0 */
css_put(&victim->css);
continue;
@@ -1230,9 +1177,6 @@ static void record_last_oom(struct mem_c
void mem_cgroup_update_mapped_file_stat(struct page *page, int val)
{
struct mem_cgroup *mem;
- struct mem_cgroup_stat *stat;
- struct mem_cgroup_stat_cpu *cpustat;
- int cpu;
struct page_cgroup *pc;

if (!page_is_file_cache(page))
@@ -1250,14 +1194,7 @@ void mem_cgroup_update_mapped_file_stat(
if (!PageCgroupUsed(pc))
goto done;

- /*
- * Preemption is already disabled, we don't need get_cpu()
- */
- cpu = smp_processor_id();
- stat = &mem->stat;
- cpustat = &stat->cpustat[cpu];
-
- __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
+ array_counter_add(GET_ARC(&mem->stat), MEMCG_STAT_MAPPED_FILE, val);
done:
unlock_page_cgroup(pc);
}
@@ -1598,9 +1535,6 @@ static int mem_cgroup_move_account(struc
int nid, zid;
int ret = -EBUSY;
struct page *page;
- int cpu;
- struct mem_cgroup_stat *stat;
- struct mem_cgroup_stat_cpu *cpustat;

VM_BUG_ON(from == to);
VM_BUG_ON(PageLRU(pc->page));
@@ -1625,18 +1559,10 @@ static int mem_cgroup_move_account(struc

page = pc->page;
if (page_is_file_cache(page) && page_mapped(page)) {
- cpu = smp_processor_id();
/* Update mapped_file data for mem_cgroup "from" */
- stat = &from->stat;
- cpustat = &stat->cpustat[cpu];
- __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
- -1);
-
+ array_counter_dec(GET_ARC(&from->stat), MEMCG_STAT_MAPPED_FILE);
/* Update mapped_file data for mem_cgroup "to" */
- stat = &to->stat;
- cpustat = &stat->cpustat[cpu];
- __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
- 1);
+ array_counter_inc(GET_ARC(&to->stat), MEMCG_STAT_MAPPED_FILE);
}

if (do_swap_account && !mem_cgroup_is_root(from))
@@ -2687,7 +2613,7 @@ static int
mem_cgroup_get_idx_stat(struct mem_cgroup *mem, void *data)
{
struct mem_cgroup_idx_data *d = data;
- d->val += mem_cgroup_read_stat(&mem->stat, d->idx);
+ d->val += array_counter_read(GET_ARC(&mem->stat), d->idx);
return 0;
}

@@ -2714,10 +2640,10 @@ static u64 mem_cgroup_read(struct cgroup
case _MEM:
if (name == RES_USAGE && mem_cgroup_is_root(mem)) {
mem_cgroup_get_recursive_idx_stat(mem,
- MEM_CGROUP_STAT_CACHE, &idx_val);
+ MEMCG_STAT_CACHE, &idx_val);
val = idx_val;
mem_cgroup_get_recursive_idx_stat(mem,
- MEM_CGROUP_STAT_RSS, &idx_val);
+ MEMCG_STAT_RSS, &idx_val);
val += idx_val;
val <<= PAGE_SHIFT;
} else
@@ -2726,13 +2652,13 @@ static u64 mem_cgroup_read(struct cgroup
case _MEMSWAP:
if (name == RES_USAGE && mem_cgroup_is_root(mem)) {
mem_cgroup_get_recursive_idx_stat(mem,
- MEM_CGROUP_STAT_CACHE, &idx_val);
+ MEMCG_STAT_CACHE, &idx_val);
val = idx_val;
mem_cgroup_get_recursive_idx_stat(mem,
- MEM_CGROUP_STAT_RSS, &idx_val);
+ MEMCG_STAT_RSS, &idx_val);
val += idx_val;
mem_cgroup_get_recursive_idx_stat(mem,
- MEM_CGROUP_STAT_SWAPOUT, &idx_val);
+ MEMCG_STAT_SWAPOUT, &idx_val);
val <<= PAGE_SHIFT;
} else
val = res_counter_read_u64(&mem->memsw, name);
@@ -2892,18 +2818,19 @@ static int mem_cgroup_get_local_stat(str
s64 val;

/* per cpu stat */
- val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_CACHE);
+ val = array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_CACHE);
s->stat[MCS_CACHE] += val * PAGE_SIZE;
- val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
+ val = array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_RSS);
s->stat[MCS_RSS] += val * PAGE_SIZE;
- val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_MAPPED_FILE);
+ val = array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_MAPPED_FILE);
s->stat[MCS_MAPPED_FILE] += val * PAGE_SIZE;
- val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGIN_COUNT);
+ val = array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_PGPGIN);
s->stat[MCS_PGPGIN] += val;
- val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
+ val = array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_PGPGOUT);
s->stat[MCS_PGPGOUT] += val;
if (do_swap_account) {
- val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_SWAPOUT);
+ val = array_counter_read(GET_ARC(&mem->stat),
+ MEMCG_STAT_SWAPOUT);
s->stat[MCS_SWAP] += val * PAGE_SIZE;
}

@@ -3162,16 +3089,10 @@ static void free_mem_cgroup_per_zone_inf
kfree(mem->info.nodeinfo[node]);
}

-static int mem_cgroup_size(void)
-{
- int cpustat_size = nr_cpu_ids * sizeof(struct mem_cgroup_stat_cpu);
- return sizeof(struct mem_cgroup) + cpustat_size;
-}
-
static struct mem_cgroup *mem_cgroup_alloc(void)
{
struct mem_cgroup *mem;
- int size = mem_cgroup_size();
+ int size = sizeof(struct mem_cgroup);

if (size < PAGE_SIZE)
mem = kmalloc(size, GFP_KERNEL);
@@ -3180,6 +3101,17 @@ static struct mem_cgroup *mem_cgroup_all

if (mem)
memset(mem, 0, size);
+ if (!mem)
+ return NULL;
+
+ if (array_counter_init(GET_ARC(&mem->stat), MEMCG_STAT_NSTATS)) {
+ /* can't allocate percpu counter */
+ if (size < PAGE_SIZE)
+ kfree(mem);
+ else
+ vfree(mem);
+ mem = NULL;
+ }
return mem;
}

@@ -3204,7 +3136,9 @@ static void __mem_cgroup_free(struct mem
for_each_node_state(node, N_POSSIBLE)
free_mem_cgroup_per_zone_info(mem, node);

- if (mem_cgroup_size() < PAGE_SIZE)
+ array_counter_destroy(GET_ARC(&mem->stat));
+
+ if (sizeof(*mem) < PAGE_SIZE)
kfree(mem);
else
vfree(mem);

2009-09-30 23:34:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] percpu array counter like vmstat

On Wed, 30 Sep 2009 19:09:43 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> +int array_counter_init(struct array_counter *ac, int size)
> +{
> + ac->v.elements = size;
> + ac->v.counters = alloc_percpu(s8);
This is a bug, of course...
should be
ac->v.counters = __alloc_percpu(size, __alignof__(char));

Regads,
-Kame

2009-10-01 00:03:08

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] percpu array counter like vmstat

> On Wed, 30 Sep 2009 19:09:43 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> +int array_counter_init(struct array_counter *ac, int size)
>> +{
>> + ac->v.elements = size;
>> + ac->v.counters = alloc_percpu(s8);
> This is a bug, of course...
Yes, I was confused at that point and about to pointing it out :)

> should be
> ac->v.counters = __alloc_percpu(size, __alignof__(char));
>
__alloc_pecpu(size * sizeof(s8), __alignof__(s8)) would be better ?
There is no actual difference, though.


Thanks,
Daisuke Nishimura.

2009-10-01 00:56:46

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat

On Wed, 30 Sep 2009 19:04:17 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> Hi,
>
> In current implementation, memcg uses its own percpu counters for counting
> evetns and # of RSS, CACHES. Now, counter is maintainer per cpu without
> any synchronization as vm_stat[] or percpu_counter. So, this is
> update-is-fast-but-read-is-slow conter.
>
> Because "read" for these counter was only done by memory.stat file, I thought
> read-side-slowness was acceptable. Amount of memory usage, which affects
> memory limit check, can be read by memory.usage_in_bytes. It's maintained
> by res_counter.
>
> But in current -rc, root memcg's memory usage is calcualted by this per cpu
> counter and read side slowness may be trouble if it's frequently read.
>
> And, in recent discusstion, I wonder we should maintain NR_DIRTY etc...
> in memcg. So, slow-read-counter will not match our requirements, I guess.
> I want some counter like vm_stat[] in memcg.
>
I see your concern.

But IMHO, it would be better to explain why we need a new percpu array counter
instead of using array of percpu_counter(size or consolidation of related counters ?),
IOW, what the benefit of percpu array counter is.


Thanks,
Daisuke Nishimura.

> This 2 patches are for using counter like vm_stat[] in memcg.
> Just an idea level implementaion but I think this is not so bad.
>
> I confirmed this patch works well. I'm now thinking how to test performance...
>
> Any comments are welcome.
> This patch is onto mmotm + some myown patches...so...this is just an RFC.
>
> Regards,
> -Kame
>

2009-10-01 01:31:24

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat

On Thu, 1 Oct 2009 09:45:14 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Wed, 30 Sep 2009 19:04:17 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > Hi,
> >
> > In current implementation, memcg uses its own percpu counters for counting
> > evetns and # of RSS, CACHES. Now, counter is maintainer per cpu without
> > any synchronization as vm_stat[] or percpu_counter. So, this is
> > update-is-fast-but-read-is-slow conter.
> >
> > Because "read" for these counter was only done by memory.stat file, I thought
> > read-side-slowness was acceptable. Amount of memory usage, which affects
> > memory limit check, can be read by memory.usage_in_bytes. It's maintained
> > by res_counter.
> >
> > But in current -rc, root memcg's memory usage is calcualted by this per cpu
> > counter and read side slowness may be trouble if it's frequently read.
> >
> > And, in recent discusstion, I wonder we should maintain NR_DIRTY etc...
> > in memcg. So, slow-read-counter will not match our requirements, I guess.
> > I want some counter like vm_stat[] in memcg.
> >
> I see your concern.
>
> But IMHO, it would be better to explain why we need a new percpu array counter
> instead of using array of percpu_counter(size or consolidation of related counters ?),
> IOW, what the benefit of percpu array counter is.
>
Ok.
array of 4 percpu counter means a struct like following.

lock 4bytes (int)
count 8bytes
list_head 16bytes
pointer to percpu 8bytes
lock ,,,
count
list_head
pointer to percpu
lock
count
list_head
pointer to percpu
lock
count
list_head
pointer to percpu

36x4= 144 bytes and this has 4 spinlocks.2 cache lines.
4 spinlock means if one of "batch" expires in a cpu, all cache above will
be invalidated. Most of read-only data will lost.

Making alignments of each percpu counter to cacheline for avoiding
false sharing means this will use 4 cachelines + percpu area.
That's bad.

array counter of 4 entry is:
s8 batch 4bytes (will be aligned)
pointer to percpu 8bytes
elements 4bytes.
list head 16bytes
==== cacheline aligned here== 128bytes.
atomic_long_t 4x8==32bytes
==== should be aligned to cache ? maybe yes===

Then, this will occupy 2 cachelines + percpu area.
No false sharing in read-only area.
All writes are done in one (locked) access.

Hmm..I may have to consider more about archs which has not atomic_xxx ops.

Considerng sets of counters can be updated at once, array of percpu counter
is not good choice. I think.

Thanks,
-Kame