LinuxLists.cc - [PATCH -mm 0/5] towards I/O aware memory cgroup v3.

2010-08-02 10:16:13

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH -mm 0/5] towards I/O aware memory cgroup v3.

This is v3. removed terrble garbages from v2 and tested.(no big changes)

Now, it's merge-window and I'll have to maintain this in my box for a while.
I'll continue to update this. Maybe we can make new progress after LinuxCon.
(And I'll be busy for a while.)

This set has 2+1 purposes.
1. re-desgin struct page_cgroup and makes room for blocckio-cgroup ID.
2. implement quick updating method for memcg's file stat.
3. optionally? use spin_lock instead of bit_spinlock.

Plans after this.

1. check influence of Mel's new writeback method.
I think we'll see OOM easier. IIUC, memory cgroup needs a thread like kswapd
to do background writeback or low-high watermark.
(By this, we can control priority of background writeout thread priority
by CFS. This is very good.)

2. implementing dirty_ratio.
Now, Greg Thelen is working on. One of biggest problems of previous trial was
update cost of status. I think this patch set can reduce it.

3. record blockio cgroup's ID.
Ikeda posted one. IIUC, it requires some consideration on (swapin)readahead
for assigning IDs. But it seemed to be good in general.

Importance is in this order in my mind. But all aboves can be done in parallel.

Beyond that, some guys has problem with file-cache-control. If it need to use
account migration, we have to take care of races.

Thanks,
-Kame

2010-08-02 10:18:03

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH -mm 1/5] quick lookup memcg by ID

From: KAMEZAWA Hiroyuki <[email protected]>

Now, memory cgroup has an ID per cgroup and make use of it at
- hierarchy walk,
- swap recording.

This patch is for making more use of it. The final purpose is
to replace page_cgroup->mem_cgroup's pointer to an unsigned short.

This patch caches a pointer of memcg in an array. By this, we
don't have to call css_lookup() which requires radix-hash walk.
This saves some amount of memory footprint at lookup memcg via id.

Changelog: 20100730
- fixed rcu_read_unlock() placement.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
init/Kconfig | 11 +++++++++++
mm/memcontrol.c | 48 ++++++++++++++++++++++++++++++++++--------------
2 files changed, 45 insertions(+), 14 deletions(-)

Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -292,6 +292,30 @@ static bool move_file(void)
&mc.to->move_charge_at_immigrate);
}

+/* 0 is unused */
+static atomic_t mem_cgroup_num;
+#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
+static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
+
+static struct mem_cgroup *id_to_memcg(unsigned short id)
+{
+ /*
+ * This array is set to NULL when mem_cgroup is freed.
+ * IOW, there are no more references && rcu_synchronized().
+ * This lookup-caching is safe.
+ */
+ if (unlikely(!mem_cgroups[id])) {
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+ css = css_lookup(&mem_cgroup_subsys, id);
+ rcu_read_unlock();
+ if (!css)
+ return NULL;
+ mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
+ }
+ return mem_cgroups[id];
+}
/*
* Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
* limit reclaim to prevent infinite loops, if they ever occur.
@@ -1824,18 +1848,7 @@ static void mem_cgroup_cancel_charge(str
* it's concern. (dropping refcnt from swap can be called against removed
* memcg.)
*/
-static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
-{
- struct cgroup_subsys_state *css;

- /* ID 0 is unused ID */
- if (!id)
- return NULL;
- css = css_lookup(&mem_cgroup_subsys, id);
- if (!css)
- return NULL;
- return container_of(css, struct mem_cgroup, css);
-}

struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
{
@@ -1856,7 +1869,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
ent.val = page_private(page);
id = lookup_swap_cgroup(ent);
rcu_read_lock();
- mem = mem_cgroup_lookup(id);
+ mem = id_to_memcg(id);
if (mem && !css_tryget(&mem->css))
mem = NULL;
rcu_read_unlock();
@@ -2208,7 +2221,7 @@ __mem_cgroup_commit_charge_swapin(struct

id = swap_cgroup_record(ent, 0);
rcu_read_lock();
- memcg = mem_cgroup_lookup(id);
+ memcg = id_to_memcg(id);
if (memcg) {
/*
* This recorded memcg can be obsolete one. So, avoid
@@ -2472,7 +2485,7 @@ void mem_cgroup_uncharge_swap(swp_entry_

id = swap_cgroup_record(ent, 0);
rcu_read_lock();
- memcg = mem_cgroup_lookup(id);
+ memcg = id_to_memcg(id);
if (memcg) {
/*
* We uncharge this because swap is freed.
@@ -3988,6 +4001,9 @@ static struct mem_cgroup *mem_cgroup_all
struct mem_cgroup *mem;
int size = sizeof(struct mem_cgroup);

+ if (atomic_read(&mem_cgroup_num) == NR_MEMCG_GROUPS)
+ return NULL;
+
/* Can be very big if MAX_NUMNODES is very big */
if (size < PAGE_SIZE)
mem = kmalloc(size, GFP_KERNEL);
@@ -4025,7 +4041,10 @@ static void __mem_cgroup_free(struct mem
int node;

mem_cgroup_remove_from_trees(mem);
+ /* No more lookup against this ID */
+ mem_cgroups[css_id(&mem->css)] = NULL;
free_css_id(&mem_cgroup_subsys, &mem->css);
+ atomic_dec(&mem_cgroup_num);

for_each_node_state(node, N_POSSIBLE)
free_mem_cgroup_per_zone_info(mem, node);
@@ -4162,6 +4181,7 @@ mem_cgroup_create(struct cgroup_subsys *
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
+ atomic_inc(&mem_cgroup_num);
return &mem->css;
free_out:
__mem_cgroup_free(mem);
Index: mmotm-0727/init/Kconfig
===================================================================
--- mmotm-0727.orig/init/Kconfig
+++ mmotm-0727/init/Kconfig
@@ -594,6 +594,17 @@ config CGROUP_MEM_RES_CTLR_SWAP
Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
size is 4096bytes, 512k per 1Gbytes of swap.

+config MEM_CGROUP_MAX_GROUPS
+ int "Maximum number of memory cgroups on a system"
+ range 1 65535
+ default 8192 if 64BIT
+ default 2048 if 32BIT
+ help
+ Memory cgroup has limitation of the number of groups created.
+ Please select your favorite value. The more you allow, the more
+ memory will be consumed. This consumes vmalloc() area, so,
+ this should be small on 32bit arch.
+
menuconfig CGROUP_SCHED
bool "Group CPU scheduler"
depends on EXPERIMENTAL && CGROUPS

2010-08-02 10:19:08

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH -mm 2/5] use ID in page cgroup

From: KAMEZAWA Hiroyuki <[email protected]>

Now, addresses of memory cgroup can be calculated by their ID without complex.
This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
On 64bit architecture, this offers us more 6bytes room per page_cgroup.
Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
some light-weight concurrent access.

We may able to move this id onto flags field but ...go step by step.

Changelog: 20100730
- fixed some garbage added by debug code in early stage

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/page_cgroup.h | 3 ++-
mm/memcontrol.c | 32 +++++++++++++++++++-------------
mm/page_cgroup.c | 2 +-
3 files changed, 22 insertions(+), 15 deletions(-)

Index: mmotm-0727/include/linux/page_cgroup.h
===================================================================
--- mmotm-0727.orig/include/linux/page_cgroup.h
+++ mmotm-0727/include/linux/page_cgroup.h
@@ -12,7 +12,8 @@
*/
struct page_cgroup {
unsigned long flags;
- struct mem_cgroup *mem_cgroup;
+ unsigned short mem_cgroup; /* ID of assigned memory cgroup */
+ unsigned short blk_cgroup; /* Not Used..but will be. */
struct page *page;
struct list_head lru; /* per cgroup LRU list */
};
Index: mmotm-0727/mm/page_cgroup.c
===================================================================
--- mmotm-0727.orig/mm/page_cgroup.c
+++ mmotm-0727/mm/page_cgroup.c
@@ -15,7 +15,7 @@ static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
pc->flags = 0;
- pc->mem_cgroup = NULL;
+ pc->mem_cgroup = 0;
pc->page = pfn_to_page(pfn);
INIT_LIST_HEAD(&pc->lru);
}
Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -379,7 +379,7 @@ struct cgroup_subsys_state *mem_cgroup_c
static struct mem_cgroup_per_zone *
page_cgroup_zoneinfo(struct page_cgroup *pc)
{
- struct mem_cgroup *mem = pc->mem_cgroup;
+ struct mem_cgroup *mem = id_to_memcg(pc->mem_cgroup);
int nid = page_cgroup_nid(pc);
int zid = page_cgroup_zid(pc);

@@ -721,6 +721,11 @@ static inline bool mem_cgroup_is_root(st
return (mem == root_mem_cgroup);
}

+static inline bool mem_cgroup_is_rootid(unsigned short id)
+{
+ return (id == 1);
+}
+
/*
* Following LRU functions are allowed to be used without PCG_LOCK.
* Operations are called by routine of global LRU independently from memcg.
@@ -753,7 +758,7 @@ void mem_cgroup_del_lru_list(struct page
*/
mz = page_cgroup_zoneinfo(pc);
MEM_CGROUP_ZSTAT(mz, lru) -= 1;
- if (mem_cgroup_is_root(pc->mem_cgroup))
+ if (mem_cgroup_is_rootid(pc->mem_cgroup))
return;
VM_BUG_ON(list_empty(&pc->lru));
list_del_init(&pc->lru);
@@ -780,7 +785,7 @@ void mem_cgroup_rotate_lru_list(struct p
*/
smp_rmb();
/* unused or root page is not rotated. */
- if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
+ if (!PageCgroupUsed(pc) || mem_cgroup_is_rootid(pc->mem_cgroup))
return;
mz = page_cgroup_zoneinfo(pc);
list_move(&pc->lru, &mz->lists[lru]);
@@ -806,7 +811,7 @@ void mem_cgroup_add_lru_list(struct page
mz = page_cgroup_zoneinfo(pc);
MEM_CGROUP_ZSTAT(mz, lru) += 1;
SetPageCgroupAcctLRU(pc);
- if (mem_cgroup_is_root(pc->mem_cgroup))
+ if (mem_cgroup_is_rootid(pc->mem_cgroup))
return;
list_add(&pc->lru, &mz->lists[lru]);
}
@@ -1474,7 +1479,7 @@ void mem_cgroup_update_file_mapped(struc
return;

lock_page_cgroup(pc);
- mem = pc->mem_cgroup;
+ mem = id_to_memcg(pc->mem_cgroup);
if (!mem || !PageCgroupUsed(pc))
goto done;

@@ -1862,7 +1867,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
- mem = pc->mem_cgroup;
+ mem = id_to_memcg(pc->mem_cgroup);
if (mem && !css_tryget(&mem->css))
mem = NULL;
} else if (PageSwapCache(page)) {
@@ -1898,7 +1903,7 @@ static void __mem_cgroup_commit_charge(s
return;
}

- pc->mem_cgroup = mem;
+ pc->mem_cgroup = css_id(&mem->css);
/*
* We access a page_cgroup asynchronously without lock_page_cgroup().
* Especially when a page_cgroup is taken from a page, pc->mem_cgroup
@@ -1956,7 +1961,7 @@ static void __mem_cgroup_move_account(st
VM_BUG_ON(PageLRU(pc->page));
VM_BUG_ON(!PageCgroupLocked(pc));
VM_BUG_ON(!PageCgroupUsed(pc));
- VM_BUG_ON(pc->mem_cgroup != from);
+ VM_BUG_ON(id_to_memcg(pc->mem_cgroup) != from);

if (PageCgroupFileMapped(pc)) {
/* Update mapped_file data for mem_cgroup */
@@ -1971,7 +1976,7 @@ static void __mem_cgroup_move_account(st
mem_cgroup_cancel_charge(from);

/* caller should have done css_get */
- pc->mem_cgroup = to;
+ pc->mem_cgroup = css_id(&to->css);
mem_cgroup_charge_statistics(to, pc, true);
/*
* We charges against "to" which may not have any tasks. Then, "to"
@@ -1991,7 +1996,7 @@ static int mem_cgroup_move_account(struc
{
int ret = -EINVAL;
lock_page_cgroup(pc);
- if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
+ if (PageCgroupUsed(pc) && id_to_memcg(pc->mem_cgroup) == from) {
__mem_cgroup_move_account(pc, from, to, uncharge);
ret = 0;
}
@@ -2330,7 +2335,7 @@ __mem_cgroup_uncharge_common(struct page

lock_page_cgroup(pc);

- mem = pc->mem_cgroup;
+ mem = id_to_memcg(pc->mem_cgroup);

if (!PageCgroupUsed(pc))
goto unlock_out;
@@ -2575,7 +2580,7 @@ int mem_cgroup_prepare_migration(struct
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
- mem = pc->mem_cgroup;
+ mem = id_to_memcg(pc->mem_cgroup);
css_get(&mem->css);
/*
* At migrating an anonymous page, its mapcount goes down
@@ -4398,7 +4403,8 @@ static int is_target_pte_for_mc(struct v
* mem_cgroup_move_account() checks the pc is valid or not under
* the lock.
*/
- if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
+ if (PageCgroupUsed(pc) &&
+ id_to_memcg(pc->mem_cgroup) == mc.from) {
ret = MC_TARGET_PAGE;
if (target)
target->page = page;

2010-08-02 10:20:54

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH -mm 3/5] memcg scalable file stat accounting method

From: KAMEZAWA Hiroyuki <[email protected]>

At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().

But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.
At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
update is done while the page-cache is in stable state, we don't have to
take care of race with charge/uncharge.

Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move
1. Increment it before start moving
2. Call synchronize_rcu()
3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.

Changelog: 20100730
- some cleanup.
Changelog: 20100729
- replaced __this_cpu_xxx() with this_cpu_xxx
(because we don't call spinlock)
- added VM_BUG_ON().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 66 insertions(+), 12 deletions(-)

Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -88,6 +88,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
+ MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */

MEM_CGROUP_STAT_NSTATS,
};
@@ -1074,7 +1075,49 @@ static unsigned int get_swappiness(struc
return swappiness;
}

-/* A routine for testing mem is not under move_account */
+static void mem_cgroup_start_move(struct mem_cgroup *mem)
+{
+ int cpu;
+ /* for fast checking in mem_cgroup_update_file_stat() etc..*/
+ spin_lock(&mc.lock);
+ for_each_possible_cpu(cpu)
+ per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
+ spin_unlock(&mc.lock);
+
+ synchronize_rcu();
+}
+
+static void mem_cgroup_end_move(struct mem_cgroup *mem)
+{
+ int cpu;
+
+ if (!mem)
+ return;
+ /* for fast checking in mem_cgroup_update_file_stat() etc..*/
+ spin_lock(&mc.lock);
+ for_each_possible_cpu(cpu)
+ per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
+ spin_unlock(&mc.lock);
+}
+
+/*
+ * mem_cgroup_is_moved -- checking a cgroup is mc.from target or not.
+ * used for avoiding race.
+ * mem_cgroup_under_move -- checking a cgroup is mc.from or mc.to or
+ * under hierarchy of them. used for waiting at
+ * memory pressure.
+ * Result of is_moved can be trusted until the end of rcu_read_unlock().
+ * The caller must do
+ * rcu_read_lock();
+ * result = mem_cgroup_is_moved();
+ * .....make use of result here....
+ * rcu_read_unlock();
+ */
+static bool mem_cgroup_is_moved(struct mem_cgroup *mem)
+{
+ VM_BUG_ON(!rcu_read_lock_held());
+ return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
+}

static bool mem_cgroup_under_move(struct mem_cgroup *mem)
{
@@ -1473,29 +1516,36 @@ void mem_cgroup_update_file_mapped(struc
{
struct mem_cgroup *mem;
struct page_cgroup *pc;
+ bool need_lock = false;

pc = lookup_page_cgroup(page);
if (unlikely(!pc))
return;
-
- lock_page_cgroup(pc);
+ rcu_read_lock();
mem = id_to_memcg(pc->mem_cgroup);
- if (!mem || !PageCgroupUsed(pc))
+ if (likely(mem)) {
+ if (mem_cgroup_is_moved(mem)) {
+ /* need to serialize with move_account */
+ lock_page_cgroup(pc);
+ need_lock = true;
+ mem = id_to_memcg(pc->mem_cgroup);
+ if (unlikely(!mem))
+ goto done;
+ }
+ }
+ if (unlikely(!PageCgroupUsed(pc)))
goto done;
-
- /*
- * Preemption is already disabled. We can use __this_cpu_xxx
- */
if (val > 0) {
- __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
SetPageCgroupFileMapped(pc);
} else {
- __this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
ClearPageCgroupFileMapped(pc);
}
-
done:
- unlock_page_cgroup(pc);
+ if (need_lock)
+ unlock_page_cgroup(pc);
+ rcu_read_unlock();
}

/*
@@ -3034,6 +3084,7 @@ move_account:
lru_add_drain_all();
drain_all_stock_sync();
ret = 0;
+ mem_cgroup_start_move(mem);
for_each_node_state(node, N_HIGH_MEMORY) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
enum lru_list l;
@@ -3047,6 +3098,7 @@ move_account:
if (ret)
break;
}
+ mem_cgroup_end_move(mem);
memcg_oom_recover(mem);
/* it seems parent cgroup doesn't have enough mem */
if (ret == -ENOMEM)
@@ -4513,6 +4565,7 @@ static void mem_cgroup_clear_mc(void)
mc.to = NULL;
mc.moving_task = NULL;
spin_unlock(&mc.lock);
+ mem_cgroup_end_move(from);
memcg_oom_recover(from);
memcg_oom_recover(to);
wake_up_all(&mc.waitq);
@@ -4543,6 +4596,7 @@ static int mem_cgroup_can_attach(struct
VM_BUG_ON(mc.moved_charge);
VM_BUG_ON(mc.moved_swap);
VM_BUG_ON(mc.moving_task);
+ mem_cgroup_start_move(from);
spin_lock(&mc.lock);
mc.from = from;
mc.to = mem;

2010-08-02 10:22:09

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH -mm 4/5] memcg generic file stat accounting interface.

From: KAMEZAWA Hiroyuki <[email protected]>

Preparing for adding new status arounf file caches.(dirty, writeback,etc..)
Using a unified macro and more generic names.
All counters will have the same rule for updating.

Changelog:
- clean up and moved mem_cgroup_stat_index to header file.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 23 ++++++++++++++++++++++
include/linux/page_cgroup.h | 12 +++++------
mm/memcontrol.c | 46 ++++++++++++++++++--------------------------
3 files changed, 48 insertions(+), 33 deletions(-)

Index: mmotm-0727/include/linux/memcontrol.h
===================================================================
--- mmotm-0727.orig/include/linux/memcontrol.h
+++ mmotm-0727/include/linux/memcontrol.h
@@ -25,6 +25,29 @@ struct page_cgroup;
struct page;
struct mm_struct;

+/*
+ * Per-cpu Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+ /*
+ * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+ */
+ MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
+ MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
+ MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
+ MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
+ MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+ MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
+ MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
+ /* About file-stat please see memcontrol.h */
+ MEM_CGROUP_FSTAT_BASE,
+ MEM_CGROUP_FSTAT_FILE_MAPPED = MEM_CGROUP_FSTAT_BASE,
+ MEM_CGROUP_FSTAT_END,
+ MEM_CGROUP_STAT_NSTATS = MEM_CGROUP_FSTAT_END,
+};
+
+#define MEMCG_FSTAT_IDX(idx) ((idx) - MEM_CGROUP_FSTAT_BASE)
+
extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -74,24 +74,6 @@ static int really_do_swap_account __init
#define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
#define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */

-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
- /*
- * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
- */
- MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
- MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
- MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
- MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
- MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
- MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
- MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
-
- MEM_CGROUP_STAT_NSTATS,
-};

struct mem_cgroup_stat_cpu {
s64 count[MEM_CGROUP_STAT_NSTATS];
@@ -1512,7 +1494,8 @@ bool mem_cgroup_handle_oom(struct mem_cg
* Currently used to update mapped file statistics, but the routine can be
* generalized to update other statistics as well.
*/
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+static void
+mem_cgroup_update_file_stat(struct page *page, unsigned int idx, int val)
{
struct mem_cgroup *mem;
struct page_cgroup *pc;
@@ -1536,11 +1519,11 @@ void mem_cgroup_update_file_mapped(struc
if (unlikely(!PageCgroupUsed(pc)))
goto done;
if (val > 0) {
- this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- SetPageCgroupFileMapped(pc);
+ this_cpu_inc(mem->stat->count[idx]);
+ set_bit(fflag_idx(MEMCG_FSTAT_IDX(idx)), &pc->flags);
} else {
- this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- ClearPageCgroupFileMapped(pc);
+ this_cpu_dec(mem->stat->count[idx]);
+ clear_bit(fflag_idx(MEMCG_FSTAT_IDX(idx)), &pc->flags);
}
done:
if (need_lock)
@@ -1548,6 +1531,12 @@ done:
rcu_read_unlock();
}

+void mem_cgroup_update_file_mapped(struct page *page, int val)
+{
+ return mem_cgroup_update_file_stat(page,
+ MEM_CGROUP_FSTAT_FILE_MAPPED, val);
+}
+
/*
* size of first charge trial. "32" comes from vmscan.c's magic value.
* TODO: maybe necessary to use big numbers in big irons.
@@ -2007,17 +1996,20 @@ static void __mem_cgroup_commit_charge(s
static void __mem_cgroup_move_account(struct page_cgroup *pc,
struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
{
+ int i;
VM_BUG_ON(from == to);
VM_BUG_ON(PageLRU(pc->page));
VM_BUG_ON(!PageCgroupLocked(pc));
VM_BUG_ON(!PageCgroupUsed(pc));
VM_BUG_ON(id_to_memcg(pc->mem_cgroup) != from);

- if (PageCgroupFileMapped(pc)) {
+ for (i = MEM_CGROUP_FSTAT_BASE; i < MEM_CGROUP_FSTAT_END; ++i) {
+ if (!test_bit(fflag_idx(MEMCG_FSTAT_IDX(i)), &pc->flags))
+ continue;
/* Update mapped_file data for mem_cgroup */
preempt_disable();
- __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ __this_cpu_dec(from->stat->count[i]);
+ __this_cpu_inc(to->stat->count[i]);
preempt_enable();
}
mem_cgroup_charge_statistics(from, pc, false);
@@ -3447,7 +3439,7 @@ static int mem_cgroup_get_local_stat(str
s->stat[MCS_CACHE] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
s->stat[MCS_RSS] += val * PAGE_SIZE;
- val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_MAPPED);
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_FSTAT_FILE_MAPPED);
s->stat[MCS_FILE_MAPPED] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGIN_COUNT);
s->stat[MCS_PGPGIN] += val;
Index: mmotm-0727/include/linux/page_cgroup.h
===================================================================
--- mmotm-0727.orig/include/linux/page_cgroup.h
+++ mmotm-0727/include/linux/page_cgroup.h
@@ -40,9 +40,14 @@ enum {
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
- PCG_FILE_MAPPED, /* page is accounted as "mapped" */
PCG_MIGRATION, /* under page migration */
+ PCG_FILE_FLAGS, /* see memcontrol.h */
};
+/*
+ * file-stat flags are defined regarding to memcg's stat information.
+ * Here, just defines a macro for indexing
+ */
+#define fflag_idx(idx) ((idx) + PCG_FILE_FLAGS)

#define TESTPCGFLAG(uname, lname) \
static inline int PageCgroup##uname(struct page_cgroup *pc) \
@@ -76,11 +81,6 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
TESTPCGFLAG(AcctLRU, ACCT_LRU)
TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)

-
-SETPCGFLAG(FileMapped, FILE_MAPPED)
-CLEARPCGFLAG(FileMapped, FILE_MAPPED)
-TESTPCGFLAG(FileMapped, FILE_MAPPED)
-
SETPCGFLAG(Migration, MIGRATION)
CLEARPCGFLAG(Migration, MIGRATION)
TESTPCGFLAG(Migration, MIGRATION)

2010-08-02 10:24:59

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH -mm 5/5] memcg: use spinlock in page_cgroup instead of bit_spinlock

From: KAMEZAWA Hiroyuki <[email protected]>

This patch replaces bit_spinlock with spinlock. In general,
spinlock has good functinality than bit_spin_lock and we should use
it if we have a room for it. In 64bit arch, we have extra 4bytes.
Let's use it.
expected effects:
- use better codes.
- ticket lock on x86-64
- para-vitualization aware lock
etc..

Chagelog: 20090729
- fixed page_cgroup_is_locked().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
--
---
include/linux/page_cgroup.h | 33 ++++++++++++++++++++++++++++++++-
mm/memcontrol.c | 2 +-
mm/page_cgroup.c | 3 +++
3 files changed, 36 insertions(+), 2 deletions(-)

Index: mmotm-0727/include/linux/page_cgroup.h
===================================================================
--- mmotm-0727.orig/include/linux/page_cgroup.h
+++ mmotm-0727/include/linux/page_cgroup.h
@@ -10,8 +10,14 @@
* All page cgroups are allocated at boot or memory hotplug event,
* then the page cgroup for pfn always exists.
*/
+#ifdef CONFIG_64BIT
+#define PCG_HAS_SPINLOCK
+#endif
struct page_cgroup {
unsigned long flags;
+#ifdef PCG_HAS_SPINLOCK
+ spinlock_t lock;
+#endif
unsigned short mem_cgroup; /* ID of assigned memory cgroup */
unsigned short blk_cgroup; /* Not Used..but will be. */
struct page *page;
@@ -36,7 +42,9 @@ struct page_cgroup *lookup_page_cgroup(s

enum {
/* flags for mem_cgroup */
- PCG_LOCK, /* page cgroup is locked */
+#ifndef PCG_HAS_SPINLOCK
+ PCG_LOCK, /* page cgroup is locked (see below also.)*/
+#endif
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
@@ -65,8 +73,6 @@ static inline void ClearPageCgroup##unam
static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
{ return test_and_clear_bit(PCG_##lname, &pc->flags); }

-TESTPCGFLAG(Locked, LOCK)
-
/* Cache flag is set only once (at allocation) */
TESTPCGFLAG(Cache, CACHE)
CLEARPCGFLAG(Cache, CACHE)
@@ -95,6 +101,22 @@ static inline enum zone_type page_cgroup
return page_zonenum(pc->page);
}

+#ifdef PCG_HAS_SPINLOCK
+static inline void lock_page_cgroup(struct page_cgroup *pc)
+{
+ spin_lock(&pc->lock);
+}
+static inline void unlock_page_cgroup(struct page_cgroup *pc)
+{
+ spin_unlock(&pc->lock);
+}
+
+static inline bool page_cgroup_is_locked(struct page_cgroup *pc)
+{
+ return spin_is_locked(&pc->lock);
+}
+
+#else
static inline void lock_page_cgroup(struct page_cgroup *pc)
{
bit_spin_lock(PCG_LOCK, &pc->flags);
@@ -105,6 +127,14 @@ static inline void unlock_page_cgroup(st
bit_spin_unlock(PCG_LOCK, &pc->flags);
}

+static inline void page_cgroup_is_locked(struct page_cgrou *pc)
+{
+ bit_spin_is_locked(PCG_LOCK, &pc->flags);
+}
+TESTPCGFLAG(Locked, LOCK)
+
+#endif
+
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct page_cgroup;

Index: mmotm-0727/mm/page_cgroup.c
===================================================================
--- mmotm-0727.orig/mm/page_cgroup.c
+++ mmotm-0727/mm/page_cgroup.c
@@ -18,6 +18,9 @@ __init_page_cgroup(struct page_cgroup *p
pc->mem_cgroup = 0;
pc->page = pfn_to_page(pfn);
INIT_LIST_HEAD(&pc->lru);
+#ifdef PCG_HAS_SPINLOCK
+ spin_lock_init(&pc->lock);
+#endif
}
static unsigned long total_usage;

Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -1999,7 +1999,7 @@ static void __mem_cgroup_move_account(st
int i;
VM_BUG_ON(from == to);
VM_BUG_ON(PageLRU(pc->page));
- VM_BUG_ON(!PageCgroupLocked(pc));
+ VM_BUG_ON(!page_cgroup_is_locked(pc));
VM_BUG_ON(!PageCgroupUsed(pc));
VM_BUG_ON(id_to_memcg(pc->mem_cgroup) != from);

2010-08-03 02:36:59

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH -mm 0/5] towards I/O aware memory cgroup v3.

* KAMEZAWA Hiroyuki <[email protected]> [2010-08-02 19:11:13]:

>
> This is v3. removed terrble garbages from v2 and tested.(no big changes)
>
> Now, it's merge-window and I'll have to maintain this in my box for a while.
> I'll continue to update this. Maybe we can make new progress after LinuxCon.
> (And I'll be busy for a while.)
>

I was catching up with my inbox, did not realize you had moved onto v3
and hence reviewed v1 first.

> This set has 2+1 purposes.
> 1. re-desgin struct page_cgroup and makes room for blocckio-cgroup ID.
> 2. implement quick updating method for memcg's file stat.
> 3. optionally? use spin_lock instead of bit_spinlock.
>
> Plans after this.
>
> 1. check influence of Mel's new writeback method.
> I think we'll see OOM easier. IIUC, memory cgroup needs a thread like kswapd
> to do background writeback or low-high watermark.
> (By this, we can control priority of background writeout thread priority
> by CFS. This is very good.)

Agreed, background watermark based reclaim is something we should look
at.

>
> 2. implementing dirty_ratio.
> Now, Greg Thelen is working on. One of biggest problems of previous trial was
> update cost of status. I think this patch set can reduce it.

That is good news

>
> 3. record blockio cgroup's ID.
> Ikeda posted one. IIUC, it requires some consideration on (swapin)readahead
> for assigning IDs. But it seemed to be good in general.
>
> Importance is in this order in my mind. But all aboves can be done in parallel.
>
> Beyond that, some guys has problem with file-cache-control. If it need to use
> account migration, we have to take care of races.
>

--
Three Cheers,
Balbir

2010-08-03 03:22:25

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH -mm 1/5] quick lookup memcg by ID

* KAMEZAWA Hiroyuki <[email protected]> [2010-08-02 19:13:04]:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, memory cgroup has an ID per cgroup and make use of it at
> - hierarchy walk,
> - swap recording.
>
> This patch is for making more use of it. The final purpose is
> to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
>
> This patch caches a pointer of memcg in an array. By this, we
> don't have to call css_lookup() which requires radix-hash walk.
> This saves some amount of memory footprint at lookup memcg via id.
>

It is a memory versus speed tradeoff, but if the number of created
cgroups is low, it might not be all that slow, besides we do that for
swap_cgroup anyway - no?

--
Three Cheers,
Balbir

2010-08-03 03:26:53

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH -mm 1/5] quick lookup memcg by ID

On Tue, 3 Aug 2010 08:52:16 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2010-08-02 19:13:04]:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, memory cgroup has an ID per cgroup and make use of it at
> > - hierarchy walk,
> > - swap recording.
> >
> > This patch is for making more use of it. The final purpose is
> > to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
> >
> > This patch caches a pointer of memcg in an array. By this, we
> > don't have to call css_lookup() which requires radix-hash walk.
> > This saves some amount of memory footprint at lookup memcg via id.
> >
>
> It is a memory versus speed tradeoff, but if the number of created
> cgroups is low, it might not be all that slow, besides we do that for
> swap_cgroup anyway - no?
>

In following patch, pc->page_cgroup is changed from pointer to ID.
Then, this lookup happens in lru_add/del, for example.
And, by this, we can place all lookup related things to __read_mostly.
With css_lookup(), we can't do it and have to be afraid of cache
behavior.

I hear there are a users who create 2000+ cgroups and considering
about "low number" user here is not important.
This patch is a help for getting _stable_ performance even when there are
many cgroups.

Thanks,
-Kame

2010-08-03 03:33:34

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH -mm 3/5] memcg scalable file stat accounting method

* KAMEZAWA Hiroyuki <[email protected]> [2010-08-02 19:15:59]:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> At accounting file events per memory cgroup, we need to find memory cgroup
> via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().
>
> But, considering the context which page-cgroup for files are accessed,
> we can use alternative light-weight mutual execusion in the most case.
> At handling file-caches, the only race we have to take care of is "moving"
> account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
> update is done while the page-cache is in stable state, we don't have to
> take care of race with charge/uncharge.
>
> Unlike charge/uncharge, "move" happens not so frequently. It happens only when
> rmdir() and task-moving (with a special settings.)
> This patch adds a race-checker for file-cache-status accounting v.s. account
> moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
> The routine for account move
> 1. Increment it before start moving
> 2. Call synchronize_rcu()
> 3. Decrement it after the end of moving.
> By this, file-status-counting routine can check it needs to call
> lock_page_cgroup(). In most case, I doesn't need to call it.
>
>
> Changelog: 20100730
> - some cleanup.
> Changelog: 20100729
> - replaced __this_cpu_xxx() with this_cpu_xxx
> (because we don't call spinlock)
> - added VM_BUG_ON().
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 66 insertions(+), 12 deletions(-)
>
> Index: mmotm-0727/mm/memcontrol.c
> ===================================================================
> --- mmotm-0727.orig/mm/memcontrol.c
> +++ mmotm-0727/mm/memcontrol.c
> @@ -88,6 +88,7 @@ enum mem_cgroup_stat_index {
> MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
> + MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
>
> MEM_CGROUP_STAT_NSTATS,
> };
> @@ -1074,7 +1075,49 @@ static unsigned int get_swappiness(struc
> return swappiness;
> }
>
> -/* A routine for testing mem is not under move_account */
> +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> +{
> + int cpu;
> + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> + spin_lock(&mc.lock);
> + for_each_possible_cpu(cpu)
> + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;

Is for_each_possible really required? Won't online cpus suffice? There
can be a race if a hotplug event happens between start and end move,
shouldn't we handle that. My concern is that with something like 1024
cpus possible today, we might need to optimize this further.

May be we can do this first and optimize later.

--
Three Cheers,
Balbir

2010-08-03 03:38:44

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH -mm 1/5] quick lookup memcg by ID

* KAMEZAWA Hiroyuki <[email protected]> [2010-08-03 12:21:58]:

> On Tue, 3 Aug 2010 08:52:16 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2010-08-02 19:13:04]:
> >
> > > From: KAMEZAWA Hiroyuki <[email protected]>
> > >
> > > Now, memory cgroup has an ID per cgroup and make use of it at
> > > - hierarchy walk,
> > > - swap recording.
> > >
> > > This patch is for making more use of it. The final purpose is
> > > to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
> > >
> > > This patch caches a pointer of memcg in an array. By this, we
> > > don't have to call css_lookup() which requires radix-hash walk.
> > > This saves some amount of memory footprint at lookup memcg via id.
> > >
> >
> > It is a memory versus speed tradeoff, but if the number of created
> > cgroups is low, it might not be all that slow, besides we do that for
> > swap_cgroup anyway - no?
> >
>
> In following patch, pc->page_cgroup is changed from pointer to ID.
> Then, this lookup happens in lru_add/del, for example.
> And, by this, we can place all lookup related things to __read_mostly.
> With css_lookup(), we can't do it and have to be afraid of cache
> behavior.
>

OK, fair enough

> I hear there are a users who create 2000+ cgroups and considering
> about "low number" user here is not important.
> This patch is a help for getting _stable_ performance even when there are
> many cgroups.
>

I've heard of one such user on the libcgroup mailing list.

--
Three Cheers,
Balbir

2010-08-03 03:44:31

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH -mm 3/5] memcg scalable file stat accounting method

On Tue, 3 Aug 2010 09:03:27 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2010-08-02 19:15:59]:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > At accounting file events per memory cgroup, we need to find memory cgroup
> > via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().
> >
> > But, considering the context which page-cgroup for files are accessed,
> > we can use alternative light-weight mutual execusion in the most case.
> > At handling file-caches, the only race we have to take care of is "moving"
> > account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
> > update is done while the page-cache is in stable state, we don't have to
> > take care of race with charge/uncharge.
> >
> > Unlike charge/uncharge, "move" happens not so frequently. It happens only when
> > rmdir() and task-moving (with a special settings.)
> > This patch adds a race-checker for file-cache-status accounting v.s. account
> > moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
> > The routine for account move
> > 1. Increment it before start moving
> > 2. Call synchronize_rcu()
> > 3. Decrement it after the end of moving.
> > By this, file-status-counting routine can check it needs to call
> > lock_page_cgroup(). In most case, I doesn't need to call it.
> >
> >
> > Changelog: 20100730
> > - some cleanup.
> > Changelog: 20100729
> > - replaced __this_cpu_xxx() with this_cpu_xxx
> > (because we don't call spinlock)
> > - added VM_BUG_ON().
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++---------
> > 1 file changed, 66 insertions(+), 12 deletions(-)
> >
> > Index: mmotm-0727/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0727.orig/mm/memcontrol.c
> > +++ mmotm-0727/mm/memcontrol.c
> > @@ -88,6 +88,7 @@ enum mem_cgroup_stat_index {
> > MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> > MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
> > + MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
> >
> > MEM_CGROUP_STAT_NSTATS,
> > };
> > @@ -1074,7 +1075,49 @@ static unsigned int get_swappiness(struc
> > return swappiness;
> > }
> >
> > -/* A routine for testing mem is not under move_account */
> > +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> > +{
> > + int cpu;
> > + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> > + spin_lock(&mc.lock);
> > + for_each_possible_cpu(cpu)
> > + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
>
> Is for_each_possible really required? Won't online cpus suffice? There
> can be a race if a hotplug event happens between start and end move,
> shouldn't we handle that. My concern is that with something like 1024
> cpus possible today, we might need to optimize this further.
>
yes. I have the same concern. But I don't have any justification to disable
cpu hotplug while moving pages , it may take several msec.

> May be we can do this first and optimize later.
>
Maybe. For now, cpu-hotplug event hanlder tend to be a noise for this patch.
I would like to do it later.

Thanks,
-Kame

2010-08-03 03:45:23

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH -mm 2/5] use ID in page cgroup

* KAMEZAWA Hiroyuki <[email protected]> [2010-08-02 19:14:10]:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, addresses of memory cgroup can be calculated by their ID without complex.
> This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
> On 64bit architecture, this offers us more 6bytes room per page_cgroup.
> Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
> some light-weight concurrent access.
>
> We may able to move this id onto flags field but ...go step by step.
>
> Changelog: 20100730
> - fixed some garbage added by debug code in early stage
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/page_cgroup.h | 3 ++-
> mm/memcontrol.c | 32 +++++++++++++++++++-------------
> mm/page_cgroup.c | 2 +-
> 3 files changed, 22 insertions(+), 15 deletions(-)
>
> Index: mmotm-0727/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-0727.orig/include/linux/page_cgroup.h
> +++ mmotm-0727/include/linux/page_cgroup.h
> @@ -12,7 +12,8 @@
> */
> struct page_cgroup {
> unsigned long flags;
> - struct mem_cgroup *mem_cgroup;
> + unsigned short mem_cgroup; /* ID of assigned memory cgroup */
> + unsigned short blk_cgroup; /* Not Used..but will be. */
> struct page *page;
> struct list_head lru; /* per cgroup LRU list */
> };

Can I recommend that on 64 bit systems, we merge the flag, mem_cgroup
and blk_cgroup into one 8 byte value. We could use
__attribute("packed") and do something like this

struct page_cgroup {
unsigned int flags;
unsigned short mem_cgroup;
unsigned short blk_cgroup;
...
} __attribute(("packed"));

Then we need to make sure we don't use more that 32 bits for flags,
which is very much under control at the moment.

This will save us 8 bytes in total on 64 bit systems and nothing on 32
bit systems, but will enable blkio cgroup to co-exist.

--
Three Cheers,
Balbir

2010-08-03 03:53:29

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH -mm 2/5] use ID in page cgroup

On Tue, 3 Aug 2010 09:15:13 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2010-08-02 19:14:10]:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, addresses of memory cgroup can be calculated by their ID without complex.
> > This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
> > On 64bit architecture, this offers us more 6bytes room per page_cgroup.
> > Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
> > some light-weight concurrent access.
> >
> > We may able to move this id onto flags field but ...go step by step.
> >
> > Changelog: 20100730
> > - fixed some garbage added by debug code in early stage
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > include/linux/page_cgroup.h | 3 ++-
> > mm/memcontrol.c | 32 +++++++++++++++++++-------------
> > mm/page_cgroup.c | 2 +-
> > 3 files changed, 22 insertions(+), 15 deletions(-)
> >
> > Index: mmotm-0727/include/linux/page_cgroup.h
> > ===================================================================
> > --- mmotm-0727.orig/include/linux/page_cgroup.h
> > +++ mmotm-0727/include/linux/page_cgroup.h
> > @@ -12,7 +12,8 @@
> > */
> > struct page_cgroup {
> > unsigned long flags;
> > - struct mem_cgroup *mem_cgroup;
> > + unsigned short mem_cgroup; /* ID of assigned memory cgroup */
> > + unsigned short blk_cgroup; /* Not Used..but will be. */
> > struct page *page;
> > struct list_head lru; /* per cgroup LRU list */
> > };
>
> Can I recommend that on 64 bit systems, we merge the flag, mem_cgroup
> and blk_cgroup into one 8 byte value. We could use
> __attribute("packed") and do something like this
>

It's a next step.

> struct page_cgroup {
> unsigned int flags;
> unsigned short mem_cgroup;
> unsigned short blk_cgroup;
> ...
> } __attribute(("packed"));
>
> Then we need to make sure we don't use more that 32 bits for flags,
> which is very much under control at the moment.
>
set_bit() requires "long" as its argument. more some trick is required.

And, IIUC, packing implies
pc->mem_cgroup = mem_cgroup_id; or
pc->blk_cgroup = blk_cgroup_id; will have race with
set/clear_bit(BIT_XXXX, &pc->flags)

This "packing" is not very easy. we have to consider all possible combinations
of operations.

> This will save us 8 bytes in total on 64 bit systems and nothing on 32
> bit systems, but will enable blkio cgroup to co-exist.
>

yes. But I have cocnerns of race condition. to do that, we need
patch 3-5. (But patch 5 adds spinlock, then no 8bytes reduce.)

Let me go step by step. I'm _really_ afraid of race conditions.

Thanks,
-Kame

2010-08-03 04:03:23

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH -mm 4/5] memcg generic file stat accounting interface.

* KAMEZAWA Hiroyuki <[email protected]> [2010-08-02 19:17:15]:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Preparing for adding new status arounf file caches.(dirty, writeback,etc..)
> Using a unified macro and more generic names.
> All counters will have the same rule for updating.
>
> Changelog:
> - clean up and moved mem_cgroup_stat_index to header file.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/memcontrol.h | 23 ++++++++++++++++++++++
> include/linux/page_cgroup.h | 12 +++++------
> mm/memcontrol.c | 46 ++++++++++++++++++--------------------------
> 3 files changed, 48 insertions(+), 33 deletions(-)
>
> Index: mmotm-0727/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-0727.orig/include/linux/memcontrol.h
> +++ mmotm-0727/include/linux/memcontrol.h
> @@ -25,6 +25,29 @@ struct page_cgroup;
> struct page;
> struct mm_struct;
>
> +/*
> + * Per-cpu Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> + /*
> + * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> + */
> + MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> + MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
> + MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> + MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> + MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
> + MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
> + /* About file-stat please see memcontrol.h */

Isn't this memcontrol.h?

> + MEM_CGROUP_FSTAT_BASE,
> + MEM_CGROUP_FSTAT_FILE_MAPPED = MEM_CGROUP_FSTAT_BASE,
> + MEM_CGROUP_FSTAT_END,
> + MEM_CGROUP_STAT_NSTATS = MEM_CGROUP_FSTAT_END,
> +};
> +
> +#define MEMCG_FSTAT_IDX(idx) ((idx) - MEM_CGROUP_FSTAT_BASE)
> +
> extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> struct list_head *dst,
> unsigned long *scanned, int order,
> Index: mmotm-0727/mm/memcontrol.c
> ===================================================================
> --- mmotm-0727.orig/mm/memcontrol.c
> +++ mmotm-0727/mm/memcontrol.c
> @@ -74,24 +74,6 @@ static int really_do_swap_account __init
> #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
> #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
>
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> - /*
> - * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> - */
> - MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> - MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> - MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
> - MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
> - MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> - MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> - MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
> - MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
> -
> - MEM_CGROUP_STAT_NSTATS,
> -};
>
> struct mem_cgroup_stat_cpu {
> s64 count[MEM_CGROUP_STAT_NSTATS];
> @@ -1512,7 +1494,8 @@ bool mem_cgroup_handle_oom(struct mem_cg
> * Currently used to update mapped file statistics, but the routine can be
> * generalized to update other statistics as well.
> */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +static void
> +mem_cgroup_update_file_stat(struct page *page, unsigned int idx, int val)
> {
> struct mem_cgroup *mem;
> struct page_cgroup *pc;
> @@ -1536,11 +1519,11 @@ void mem_cgroup_update_file_mapped(struc
> if (unlikely(!PageCgroupUsed(pc)))
> goto done;
> if (val > 0) {
> - this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> - SetPageCgroupFileMapped(pc);
> + this_cpu_inc(mem->stat->count[idx]);
> + set_bit(fflag_idx(MEMCG_FSTAT_IDX(idx)), &pc->flags);

Do we use the bit in pc->flags, otherwise is there an advantage of
creating a separate index for the other stats the block I/O needs?

--
Three Cheers,
Balbir

2010-08-03 04:06:55

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH -mm 5/5] memcg: use spinlock in page_cgroup instead of bit_spinlock

* KAMEZAWA Hiroyuki <[email protected]> [2010-08-02 19:20:06]:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> This patch replaces bit_spinlock with spinlock. In general,
> spinlock has good functinality than bit_spin_lock and we should use
> it if we have a room for it. In 64bit arch, we have extra 4bytes.
> Let's use it.
> expected effects:
> - use better codes.
> - ticket lock on x86-64
> - para-vitualization aware lock
> etc..
>
> Chagelog: 20090729
> - fixed page_cgroup_is_locked().
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> --

The additional space usage is a big concern, I think saving space
would be of highest priority. I understand the expected benefits, but
a spinlock_t per page_cgroup is quite expensive at the moment. If
anything I think it should be a config option under CONFIG_DEBUG or
something else to play with and see the side effects.

--
Three Cheers,
Balbir

2010-08-03 04:29:03

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH -mm 4/5] memcg generic file stat accounting interface.

On Tue, 3 Aug 2010 09:33:04 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2010-08-02 19:17:15]:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Preparing for adding new status arounf file caches.(dirty, writeback,etc..)
> > Using a unified macro and more generic names.
> > All counters will have the same rule for updating.
> >
> > Changelog:
> > - clean up and moved mem_cgroup_stat_index to header file.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > include/linux/memcontrol.h | 23 ++++++++++++++++++++++
> > include/linux/page_cgroup.h | 12 +++++------
> > mm/memcontrol.c | 46 ++++++++++++++++++--------------------------
> > 3 files changed, 48 insertions(+), 33 deletions(-)
> >
> > Index: mmotm-0727/include/linux/memcontrol.h
> > ===================================================================
> > --- mmotm-0727.orig/include/linux/memcontrol.h
> > +++ mmotm-0727/include/linux/memcontrol.h
> > @@ -25,6 +25,29 @@ struct page_cgroup;
> > struct page;
> > struct mm_struct;
> >
> > +/*
> > + * Per-cpu Statistics for memory cgroup.
> > + */
> > +enum mem_cgroup_stat_index {
> > + /*
> > + * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > + */
> > + MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> > + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> > + MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
> > + MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> > + MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > + MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
> > + MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
> > + /* About file-stat please see memcontrol.h */
>
> Isn't this memcontrol.h?
>
Ahhhh, it's a garbae. sorry.

> > + MEM_CGROUP_FSTAT_BASE,
> > + MEM_CGROUP_FSTAT_FILE_MAPPED = MEM_CGROUP_FSTAT_BASE,
> > + MEM_CGROUP_FSTAT_END,
> > + MEM_CGROUP_STAT_NSTATS = MEM_CGROUP_FSTAT_END,
> > +};
> > +
> > +#define MEMCG_FSTAT_IDX(idx) ((idx) - MEM_CGROUP_FSTAT_BASE)
> > +
> > extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> > struct list_head *dst,
> > unsigned long *scanned, int order,
> > Index: mmotm-0727/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0727.orig/mm/memcontrol.c
> > +++ mmotm-0727/mm/memcontrol.c
> > @@ -74,24 +74,6 @@ static int really_do_swap_account __init
> > #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
> > #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
> >
> > -/*
> > - * Statistics for memory cgroup.
> > - */
> > -enum mem_cgroup_stat_index {
> > - /*
> > - * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > - */
> > - MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> > - MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> > - MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
> > - MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
> > - MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> > - MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > - MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
> > - MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
> > -
> > - MEM_CGROUP_STAT_NSTATS,
> > -};
> >
> > struct mem_cgroup_stat_cpu {
> > s64 count[MEM_CGROUP_STAT_NSTATS];
> > @@ -1512,7 +1494,8 @@ bool mem_cgroup_handle_oom(struct mem_cg
> > * Currently used to update mapped file statistics, but the routine can be
> > * generalized to update other statistics as well.
> > */
> > -void mem_cgroup_update_file_mapped(struct page *page, int val)
> > +static void
> > +mem_cgroup_update_file_stat(struct page *page, unsigned int idx, int val)
> > {
> > struct mem_cgroup *mem;
> > struct page_cgroup *pc;
> > @@ -1536,11 +1519,11 @@ void mem_cgroup_update_file_mapped(struc
> > if (unlikely(!PageCgroupUsed(pc)))
> > goto done;
> > if (val > 0) {
> > - this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > - SetPageCgroupFileMapped(pc);
> > + this_cpu_inc(mem->stat->count[idx]);
> > + set_bit(fflag_idx(MEMCG_FSTAT_IDX(idx)), &pc->flags);
>
> Do we use the bit in pc->flags, otherwise is there an advantage of
> creating a separate index for the other stats the block I/O needs?
>
??? using pc->flags.

use SetPageFileMapped() etc.. and drop this patch ?
I don't want to add swtich(idx) to call SetPageFileMapped() etc.

Thanks,
-Kmae

2010-08-03 04:30:57

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH -mm 5/5] memcg: use spinlock in page_cgroup instead of bit_spinlock

On Tue, 3 Aug 2010 09:36:45 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2010-08-02 19:20:06]:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > This patch replaces bit_spinlock with spinlock. In general,
> > spinlock has good functinality than bit_spin_lock and we should use
> > it if we have a room for it. In 64bit arch, we have extra 4bytes.
> > Let's use it.
> > expected effects:
> > - use better codes.
> > - ticket lock on x86-64
> > - para-vitualization aware lock
> > etc..
> >
> > Chagelog: 20090729
> > - fixed page_cgroup_is_locked().
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > --
>
> The additional space usage is a big concern, I think saving space
> would be of highest priority. I understand the expected benefits, but
> a spinlock_t per page_cgroup is quite expensive at the moment. If
> anything I think it should be a config option under CONFIG_DEBUG or
> something else to play with and see the side effects.
>

Hmm. As I already wrote, packing id to flags is not easy.
leave 4 bytes space _pad for a while and drop this patch ?

I don't like to add CONFIG_DEBUG in this core.

Thanks,
-Kame

2010-08-03 04:37:06

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: [PATCH -mm 1/5] quick lookup memcg by ID

Hi.

Thank you for all of your works.

Several comments are inlined.

On Mon, 2 Aug 2010 19:13:04 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, memory cgroup has an ID per cgroup and make use of it at
> - hierarchy walk,
> - swap recording.
>
> This patch is for making more use of it. The final purpose is
> to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
>
> This patch caches a pointer of memcg in an array. By this, we
> don't have to call css_lookup() which requires radix-hash walk.
> This saves some amount of memory footprint at lookup memcg via id.
>
> Changelog: 20100730
> - fixed rcu_read_unlock() placement.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> init/Kconfig | 11 +++++++++++
> mm/memcontrol.c | 48 ++++++++++++++++++++++++++++++++++--------------
> 2 files changed, 45 insertions(+), 14 deletions(-)
>
> Index: mmotm-0727/mm/memcontrol.c
> ===================================================================
> --- mmotm-0727.orig/mm/memcontrol.c
> +++ mmotm-0727/mm/memcontrol.c
> @@ -292,6 +292,30 @@ static bool move_file(void)
> &mc.to->move_charge_at_immigrate);
> }
>
> +/* 0 is unused */
> +static atomic_t mem_cgroup_num;
> +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> +
> +static struct mem_cgroup *id_to_memcg(unsigned short id)
> +{
> + /*
> + * This array is set to NULL when mem_cgroup is freed.
> + * IOW, there are no more references && rcu_synchronized().
> + * This lookup-caching is safe.
> + */
> + if (unlikely(!mem_cgroups[id])) {
> + struct cgroup_subsys_state *css;
> +
> + rcu_read_lock();
> + css = css_lookup(&mem_cgroup_subsys, id);
> + rcu_read_unlock();
> + if (!css)
> + return NULL;
> + mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
> + }
> + return mem_cgroups[id];
> +}
id_to_memcg() seems to be called under rcu_read_lock() already, so I think
rcu_read_lock()/unlock() would be unnecessary.

> Index: mmotm-0727/init/Kconfig
> ===================================================================
> --- mmotm-0727.orig/init/Kconfig
> +++ mmotm-0727/init/Kconfig
> @@ -594,6 +594,17 @@ config CGROUP_MEM_RES_CTLR_SWAP
> Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
> size is 4096bytes, 512k per 1Gbytes of swap.
>
> +config MEM_CGROUP_MAX_GROUPS
> + int "Maximum number of memory cgroups on a system"
> + range 1 65535
> + default 8192 if 64BIT
> + default 2048 if 32BIT
> + help
> + Memory cgroup has limitation of the number of groups created.
> + Please select your favorite value. The more you allow, the more
> + memory will be consumed. This consumes vmalloc() area, so,
> + this should be small on 32bit arch.
> +
We don't use vmalloc() area in this version :)

Thanks,
Daisuke Nishimura.

2010-08-03 04:42:21

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH -mm 1/5] quick lookup memcg by ID

On Tue, 3 Aug 2010 13:31:09 +0900
Daisuke Nishimura <[email protected]> wrote:

> Hi.
>
> Thank you for all of your works.
>
> Several comments are inlined.
>
> On Mon, 2 Aug 2010 19:13:04 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, memory cgroup has an ID per cgroup and make use of it at
> > - hierarchy walk,
> > - swap recording.
> >
> > This patch is for making more use of it. The final purpose is
> > to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
> >
> > This patch caches a pointer of memcg in an array. By this, we
> > don't have to call css_lookup() which requires radix-hash walk.
> > This saves some amount of memory footprint at lookup memcg via id.
> >
> > Changelog: 20100730
> > - fixed rcu_read_unlock() placement.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > init/Kconfig | 11 +++++++++++
> > mm/memcontrol.c | 48 ++++++++++++++++++++++++++++++++++--------------
> > 2 files changed, 45 insertions(+), 14 deletions(-)
> >
> > Index: mmotm-0727/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0727.orig/mm/memcontrol.c
> > +++ mmotm-0727/mm/memcontrol.c
> > @@ -292,6 +292,30 @@ static bool move_file(void)
> > &mc.to->move_charge_at_immigrate);
> > }
> >
> > +/* 0 is unused */
> > +static atomic_t mem_cgroup_num;
> > +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> > +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> > +
> > +static struct mem_cgroup *id_to_memcg(unsigned short id)
> > +{
> > + /*
> > + * This array is set to NULL when mem_cgroup is freed.
> > + * IOW, there are no more references && rcu_synchronized().
> > + * This lookup-caching is safe.
> > + */
> > + if (unlikely(!mem_cgroups[id])) {
> > + struct cgroup_subsys_state *css;
> > +
> > + rcu_read_lock();
> > + css = css_lookup(&mem_cgroup_subsys, id);
> > + rcu_read_unlock();
> > + if (!css)
> > + return NULL;
> > + mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
> > + }
> > + return mem_cgroups[id];
> > +}
> id_to_memcg() seems to be called under rcu_read_lock() already, so I think
> rcu_read_lock()/unlock() would be unnecessary.
>

Maybe. I thought about which is better to add

VM_BUG_ON(!rcu_read_lock_held);
or
rcu_read_lock()
..
rcu_read_unlock()

Do you like former ? If so, it's ok to remove rcu-read-lock.

> > Index: mmotm-0727/init/Kconfig
> > ===================================================================
> > --- mmotm-0727.orig/init/Kconfig
> > +++ mmotm-0727/init/Kconfig
> > @@ -594,6 +594,17 @@ config CGROUP_MEM_RES_CTLR_SWAP
> > Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
> > size is 4096bytes, 512k per 1Gbytes of swap.
> >
> > +config MEM_CGROUP_MAX_GROUPS
> > + int "Maximum number of memory cgroups on a system"
> > + range 1 65535
> > + default 8192 if 64BIT
> > + default 2048 if 32BIT
> > + help
> > + Memory cgroup has limitation of the number of groups created.
> > + Please select your favorite value. The more you allow, the more
> > + memory will be consumed. This consumes vmalloc() area, so,
> > + this should be small on 32bit arch.
> > +
> We don't use vmalloc() area in this version :)
>
Oh. yes. thank you. I'll fix

Thanks,
-Kame

2010-08-03 04:56:59

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: [PATCH -mm 1/5] quick lookup memcg by ID

On Tue, 3 Aug 2010 13:37:23 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 3 Aug 2010 13:31:09 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
(snip)
> > > +/* 0 is unused */
> > > +static atomic_t mem_cgroup_num;
> > > +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> > > +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> > > +
> > > +static struct mem_cgroup *id_to_memcg(unsigned short id)
> > > +{
> > > + /*
> > > + * This array is set to NULL when mem_cgroup is freed.
> > > + * IOW, there are no more references && rcu_synchronized().
> > > + * This lookup-caching is safe.
> > > + */
> > > + if (unlikely(!mem_cgroups[id])) {
> > > + struct cgroup_subsys_state *css;
> > > +
> > > + rcu_read_lock();
> > > + css = css_lookup(&mem_cgroup_subsys, id);
> > > + rcu_read_unlock();
> > > + if (!css)
> > > + return NULL;
> > > + mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
> > > + }
> > > + return mem_cgroups[id];
> > > +}
> > id_to_memcg() seems to be called under rcu_read_lock() already, so I think
> > rcu_read_lock()/unlock() would be unnecessary.
> >
>
> Maybe. I thought about which is better to add
>
> VM_BUG_ON(!rcu_read_lock_held);
> or
> rcu_read_lock()
> ..
> rcu_read_unlock()
>
> Do you like former ? If so, it's ok to remove rcu-read-lock.
>
Yes, I personally like the former.

2010-08-03 04:59:11

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH -mm 1/5] quick lookup memcg by ID

On Tue, 3 Aug 2010 13:51:29 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Tue, 3 Aug 2010 13:37:23 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Tue, 3 Aug 2010 13:31:09 +0900
> > Daisuke Nishimura <[email protected]> wrote:
> >
> (snip)
> > > > +/* 0 is unused */
> > > > +static atomic_t mem_cgroup_num;
> > > > +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> > > > +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> > > > +
> > > > +static struct mem_cgroup *id_to_memcg(unsigned short id)
> > > > +{
> > > > + /*
> > > > + * This array is set to NULL when mem_cgroup is freed.
> > > > + * IOW, there are no more references && rcu_synchronized().
> > > > + * This lookup-caching is safe.
> > > > + */
> > > > + if (unlikely(!mem_cgroups[id])) {
> > > > + struct cgroup_subsys_state *css;
> > > > +
> > > > + rcu_read_lock();
> > > > + css = css_lookup(&mem_cgroup_subsys, id);
> > > > + rcu_read_unlock();
> > > > + if (!css)
> > > > + return NULL;
> > > > + mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
> > > > + }
> > > > + return mem_cgroups[id];
> > > > +}
> > > id_to_memcg() seems to be called under rcu_read_lock() already, so I think
> > > rcu_read_lock()/unlock() would be unnecessary.
> > >
> >
> > Maybe. I thought about which is better to add
> >
> > VM_BUG_ON(!rcu_read_lock_held);
> > or
> > rcu_read_lock()
> > ..
> > rcu_read_unlock()
> >
> > Do you like former ? If so, it's ok to remove rcu-read-lock.
> >
> Yes, I personally like the former.

ok, will rewrite in that style.

-Kame

2010-08-03 05:09:58

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: [PATCH -mm 1/5] quick lookup memcg by ID

On Tue, 3 Aug 2010 13:54:13 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 3 Aug 2010 13:51:29 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > On Tue, 3 Aug 2010 13:37:23 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > > On Tue, 3 Aug 2010 13:31:09 +0900
> > > Daisuke Nishimura <[email protected]> wrote:
> > >
> > (snip)
> > > > > +/* 0 is unused */
> > > > > +static atomic_t mem_cgroup_num;
> > > > > +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> > > > > +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> > > > > +
> > > > > +static struct mem_cgroup *id_to_memcg(unsigned short id)
> > > > > +{
> > > > > + /*
> > > > > + * This array is set to NULL when mem_cgroup is freed.
> > > > > + * IOW, there are no more references && rcu_synchronized().
> > > > > + * This lookup-caching is safe.
> > > > > + */
> > > > > + if (unlikely(!mem_cgroups[id])) {
> > > > > + struct cgroup_subsys_state *css;
> > > > > +
> > > > > + rcu_read_lock();
> > > > > + css = css_lookup(&mem_cgroup_subsys, id);
> > > > > + rcu_read_unlock();
> > > > > + if (!css)
> > > > > + return NULL;
> > > > > + mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
> > > > > + }
> > > > > + return mem_cgroups[id];
> > > > > +}
> > > > id_to_memcg() seems to be called under rcu_read_lock() already, so I think
> > > > rcu_read_lock()/unlock() would be unnecessary.
> > > >
Ah, looking into [2/5], this would not be true necessarily..
So, it might be better to leave as it is.

Sorry for noise.

> > >
> > > Maybe. I thought about which is better to add
> > >
> > > VM_BUG_ON(!rcu_read_lock_held);
> > > or
> > > rcu_read_lock()
> > > ..
> > > rcu_read_unlock()
> > >
> > > Do you like former ? If so, it's ok to remove rcu-read-lock.
> > >
> > Yes, I personally like the former.
>
> ok, will rewrite in that style.
>
> -Kame
>

2010-08-04 01:02:37

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: [PATCH -mm 3/5] memcg scalable file stat accounting method

> @@ -1074,7 +1075,49 @@ static unsigned int get_swappiness(struc
> return swappiness;
> }
>
> -/* A routine for testing mem is not under move_account */
> +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> +{
> + int cpu;
> + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> + spin_lock(&mc.lock);
> + for_each_possible_cpu(cpu)
> + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> + spin_unlock(&mc.lock);
> +
> + synchronize_rcu();
> +}
> +
> +static void mem_cgroup_end_move(struct mem_cgroup *mem)
> +{
> + int cpu;
> +
> + if (!mem)
> + return;
Is this check necessary?

Thanks,
Daisuke Nishimura.

2010-08-04 01:16:45

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH -mm 3/5] memcg scalable file stat accounting method

On Wed, 4 Aug 2010 09:55:13 +0900
Daisuke Nishimura <[email protected]> wrote:

> > @@ -1074,7 +1075,49 @@ static unsigned int get_swappiness(struc
> > return swappiness;
> > }
> >
> > -/* A routine for testing mem is not under move_account */
> > +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> > +{
> > + int cpu;
> > + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> > + spin_lock(&mc.lock);
> > + for_each_possible_cpu(cpu)
> > + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> > + spin_unlock(&mc.lock);
> > +
> > + synchronize_rcu();
> > +}
> > +
> > +static void mem_cgroup_end_move(struct mem_cgroup *mem)
> > +{
> > + int cpu;
> > +
> > + if (!mem)
> > + return;
> Is this check necessary?
>

Yes, I hit NULL here.
That happens migration=off case, IIRC.

Thanks,
-Kame

> Thanks,
> Daisuke Nishimura.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2010-08-04 01:29:08

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: [PATCH -mm 3/5] memcg scalable file stat accounting method

On Wed, 4 Aug 2010 10:11:50 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Wed, 4 Aug 2010 09:55:13 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > > @@ -1074,7 +1075,49 @@ static unsigned int get_swappiness(struc
> > > return swappiness;
> > > }
> > >
> > > -/* A routine for testing mem is not under move_account */
> > > +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> > > +{
> > > + int cpu;
> > > + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> > > + spin_lock(&mc.lock);
> > > + for_each_possible_cpu(cpu)
> > > + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> > > + spin_unlock(&mc.lock);
> > > +
> > > + synchronize_rcu();
> > > +}
> > > +
> > > +static void mem_cgroup_end_move(struct mem_cgroup *mem)
> > > +{
> > > + int cpu;
> > > +
> > > + if (!mem)
> > > + return;
> > Is this check necessary?
> >
>
> Yes, I hit NULL here.
> That happens migration=off case, IIRC.
>
Ah, you're right.
Thank you for your clarification.

Daisuke Nishimura.