2010-08-20 10:01:02

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH] memcg: towards I/O aware memcg v5

This is v5.

Sorry for delaying...but I had time for resetting myself and..several
changes are added. I think this version is simpler than v4.

Major changes from v4 is
a) added kernel/cgroup.c hooks again. (for b)
b) make RCU aware. previous version seems dangerous in an extreme case.

Then, codes are updated. Most of changes are related to RCU.

Patch brief view:
1. add hooks to kernel/cgroup.c for ID management.
2. use ID-array in memcg.
3. record ID to page_cgroup rather than pointer.
4. make update_file_mapped to be RCU aware routine instead of spinlock.
5. make update_file_mapped as general-purpose function.


Thanks,
-Kame


2010-08-20 10:03:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 1/5] cgroup: ID notification call back

CC'ed to Paul Menage and Li Zefan.
==
From: KAMEZAWA Hiroyuki <[email protected]>

When cgroup subsystem use ID (ss->use_id==1), each css's ID is assigned
after successful call of ->create(). css_ID is tightly coupled with
css struct itself but it is allocated by ->create() call, IOW,
per-subsystem special allocations.

To know css_id before creation, this patch adds id_attached() callback.
after css_ID allocation. This will be used by memory cgroup's quick lookup
routine.

Maybe you can think of other implementations as
- pass ID to ->create()
or
- add post_create()
etc...
But when considering dirtiness of codes, this straightforward patch seems
good to me. If someone wants post_create(), this patch can be replaced.

Changelog: 20100820
- new approarch.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/cgroups.txt | 10 ++++++++++
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 5 ++++-
3 files changed, 15 insertions(+), 1 deletion(-)

Index: mmotm-0811/Documentation/cgroups/cgroups.txt
===================================================================
--- mmotm-0811.orig/Documentation/cgroups/cgroups.txt
+++ mmotm-0811/Documentation/cgroups/cgroups.txt
@@ -621,6 +621,16 @@ and root cgroup. Currently this will onl
the default hierarchy (which never has sub-cgroups) and a hierarchy
that is being created/destroyed (and hence has no sub-cgroups).

+void id_attached(struct cgroup_subsys *ss, struct cgroup *root)
+(cgroup_mutex and ss->hierarchy_mutex held by caller)
+(called only when ss->use_id=1)
+
+Called when css_id is attached to css. Because css_id is assigned
+against "css", css_id is not available until ->create() is called.
+If subsystem wants to make use of ID at createtion time, use
+this handler. This handler will be called after css_id is assigned
+to css. Not necessary to be implemented in usual(see memcontrol.c)
+
4. Questions
============

Index: mmotm-0811/include/linux/cgroup.h
===================================================================
--- mmotm-0811.orig/include/linux/cgroup.h
+++ mmotm-0811/include/linux/cgroup.h
@@ -475,6 +475,7 @@ struct cgroup_subsys {
struct cgroup *cgrp);
void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
+ void (*id_attached)(struct cgroup_subsys *ss, struct cgroup *cgrp);

int subsys_id;
int active;
Index: mmotm-0811/kernel/cgroup.c
===================================================================
--- mmotm-0811.orig/kernel/cgroup.c
+++ mmotm-0811/kernel/cgroup.c
@@ -4618,6 +4618,8 @@ static int __init_or_module cgroup_init_
newid->stack[0] = newid->id;
newid->css = rootcss;
rootcss->id = newid;
+ if (ss->id_attached)
+ ss->id_attached(ss, dummytop);
return 0;
}

@@ -4646,7 +4648,8 @@ static int alloc_css_id(struct cgroup_su
* see cgroup_populate_dir()
*/
rcu_assign_pointer(child_css->id, child_id);
-
+ if (ss->id_attached)
+ ss->id_attached(ss, child);
return 0;
}

2010-08-20 10:04:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 2/5] memcg: use array and ID for quick look up

From: KAMEZAWA Hiroyuki <[email protected]>

Now, memory cgroup has an ID per cgroup and make use of it at
- hierarchy walk,
- swap recording.

This patch is for making more use of it. The final purpose is
to replace page_cgroup->mem_cgroup's pointer to an unsigned short.

This patch caches a pointer of memcg in an array. By this, we
don't have to call css_lookup() which requires radix-hash walk.
This saves some amount of memory footprint at lookup memcg via id.

Changelog: 20100811
- adjusted onto mmotm-2010-08-11
- fixed RCU related parts.
- use attach_id() callback.

Changelog: 20100804
- fixed description in init/Kconfig

Changelog: 20100730
- fixed rcu_read_unlock() placement.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
init/Kconfig | 10 ++++++
mm/memcontrol.c | 83 ++++++++++++++++++++++++++++++++++++++++++--------------
2 files changed, 73 insertions(+), 20 deletions(-)

Index: mmotm-0811/mm/memcontrol.c
===================================================================
--- mmotm-0811.orig/mm/memcontrol.c
+++ mmotm-0811/mm/memcontrol.c
@@ -195,6 +195,7 @@ static void mem_cgroup_oom_notify(struct
*/
struct mem_cgroup {
struct cgroup_subsys_state css;
+ int valid; /* for checking validness under RCU access.*/
/*
* the counter to account for memory usage
*/
@@ -294,6 +295,29 @@ static bool move_file(void)
&mc.to->move_charge_at_immigrate);
}

+/* 0 is unused */
+static atomic_t mem_cgroup_num;
+#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
+static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
+
+/* Must be called under rcu_read_lock */
+static struct mem_cgroup *id_to_memcg(unsigned short id)
+{
+ struct mem_cgroup *ret;
+ /* see mem_cgroup_free() */
+ ret = rcu_dereference_check(mem_cgroups[id], rch_read_lock_held());
+ if (likely(ret && ret->valid))
+ return ret;
+ return NULL;
+}
+
+static void register_memcg_id(struct mem_cgroup *mem)
+{
+ int id = css_id(&mem->css);
+ rcu_assign_pointer(mem_cgroups[id], mem);
+ VM_BUG_ON(!mem->valid);
+}
+
/*
* Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
* limit reclaim to prevent infinite loops, if they ever occur.
@@ -1847,18 +1871,7 @@ static void mem_cgroup_cancel_charge(str
* it's concern. (dropping refcnt from swap can be called against removed
* memcg.)
*/
-static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
-{
- struct cgroup_subsys_state *css;

- /* ID 0 is unused ID */
- if (!id)
- return NULL;
- css = css_lookup(&mem_cgroup_subsys, id);
- if (!css)
- return NULL;
- return container_of(css, struct mem_cgroup, css);
-}

struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
{
@@ -1879,7 +1892,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
ent.val = page_private(page);
id = lookup_swap_cgroup(ent);
rcu_read_lock();
- mem = mem_cgroup_lookup(id);
+ mem = id_to_memcg(id);
if (mem && !css_tryget(&mem->css))
mem = NULL;
rcu_read_unlock();
@@ -2231,7 +2244,7 @@ __mem_cgroup_commit_charge_swapin(struct

id = swap_cgroup_record(ent, 0);
rcu_read_lock();
- memcg = mem_cgroup_lookup(id);
+ memcg = id_to_memcg(id);
if (memcg) {
/*
* This recorded memcg can be obsolete one. So, avoid
@@ -2240,9 +2253,10 @@ __mem_cgroup_commit_charge_swapin(struct
if (!mem_cgroup_is_root(memcg))
res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
mem_cgroup_swap_statistics(memcg, false);
+ rcu_read_unlock();
mem_cgroup_put(memcg);
- }
- rcu_read_unlock();
+ } else
+ rcu_read_unlock();
}
/*
* At swapin, we may charge account against cgroup which has no tasks.
@@ -2495,7 +2509,7 @@ void mem_cgroup_uncharge_swap(swp_entry_

id = swap_cgroup_record(ent, 0);
rcu_read_lock();
- memcg = mem_cgroup_lookup(id);
+ memcg = id_to_memcg(id);
if (memcg) {
/*
* We uncharge this because swap is freed.
@@ -2504,9 +2518,10 @@ void mem_cgroup_uncharge_swap(swp_entry_
if (!mem_cgroup_is_root(memcg))
res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
mem_cgroup_swap_statistics(memcg, false);
+ rcu_read_unlock();
mem_cgroup_put(memcg);
- }
- rcu_read_unlock();
+ } else
+ rcu_read_unlock();
}

/**
@@ -4010,6 +4025,9 @@ static struct mem_cgroup *mem_cgroup_all
struct mem_cgroup *mem;
int size = sizeof(struct mem_cgroup);

+ if (atomic_read(&mem_cgroup_num) == NR_MEMCG_GROUPS)
+ return NULL;
+
/* Can be very big if MAX_NUMNODES is very big */
if (size < PAGE_SIZE)
mem = kmalloc(size, GFP_KERNEL);
@@ -4020,6 +4038,7 @@ static struct mem_cgroup *mem_cgroup_all
return NULL;

memset(mem, 0, size);
+ mem->valid = 1;
mem->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
if (!mem->stat) {
if (size < PAGE_SIZE)
@@ -4049,6 +4068,7 @@ static void __mem_cgroup_free(struct mem
mem_cgroup_remove_from_trees(mem);
free_css_id(&mem_cgroup_subsys, &mem->css);

+ atomic_dec(&mem_cgroup_num);
for_each_node_state(node, N_POSSIBLE)
free_mem_cgroup_per_zone_info(mem, node);

@@ -4059,6 +4079,19 @@ static void __mem_cgroup_free(struct mem
vfree(mem);
}

+static void mem_cgroup_free(struct mem_cgroup *mem)
+{
+ /* No more lookup */
+ mem->valid = 0;
+ rcu_assign_pointer(mem_cgroups[css_id(&mem->css)], NULL);
+ /*
+ * Because we call vfree() etc...use synchronize_rcu() rather than
+ * call_rcu();
+ */
+ synchronize_rcu();
+ __mem_cgroup_free(mem);
+}
+
static void mem_cgroup_get(struct mem_cgroup *mem)
{
atomic_inc(&mem->refcnt);
@@ -4068,7 +4101,7 @@ static void __mem_cgroup_put(struct mem_
{
if (atomic_sub_and_test(count, &mem->refcnt)) {
struct mem_cgroup *parent = parent_mem_cgroup(mem);
- __mem_cgroup_free(mem);
+ mem_cgroup_free(mem);
if (parent)
mem_cgroup_put(parent);
}
@@ -4184,13 +4217,22 @@ mem_cgroup_create(struct cgroup_subsys *
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
+ atomic_inc(&mem_cgroup_num);
return &mem->css;
free_out:
- __mem_cgroup_free(mem);
+ mem_cgroup_free(mem);
root_mem_cgroup = NULL;
return ERR_PTR(error);
}

+static void
+mem_cgroup_id_attached(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+ register_memcg_id(mem);
+}
+
static int mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
struct cgroup *cont)
{
@@ -4714,6 +4756,7 @@ struct cgroup_subsys mem_cgroup_subsys =
.can_attach = mem_cgroup_can_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.attach = mem_cgroup_move_task,
+ .id_attached = mem_cgroup_id_attached,
.early_init = 0,
.use_id = 1,
};
Index: mmotm-0811/init/Kconfig
===================================================================
--- mmotm-0811.orig/init/Kconfig
+++ mmotm-0811/init/Kconfig
@@ -594,6 +594,16 @@ config CGROUP_MEM_RES_CTLR_SWAP
Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
size is 4096bytes, 512k per 1Gbytes of swap.

+config MEM_CGROUP_MAX_GROUPS
+ int "Maximum number of memory cgroups on a system"
+ range 1 65535
+ default 8192 if 64BIT
+ default 2048 if 32BIT
+ help
+ Memory cgroup has limitation of the number of groups created.
+ Please select your favorite value. The more you allow, the more
+ memory(a pointer per group) will be consumed.
+
menuconfig CGROUP_SCHED
bool "Group CPU scheduler"
depends on EXPERIMENTAL && CGROUPS

2010-08-20 10:06:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH] memcg: use ID in page_cgroup


I have an idea to remove page_cgroup->page pointer, 8bytes reduction per page.
But it will be after this work.
==
From: KAMEZAWA Hiroyuki <[email protected]>

Now, addresses of memory cgroup can be calculated by their ID without complex.
This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
On 64bit architecture, this offers us more 6bytes room per page_cgroup.
Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
some light-weight concurrent access.

We may able to move this id onto flags field but ...go step by step.

Changelog: 20100811
- using new rcu APIs, as rcu_dereference_check() etc.
Changelog: 20100804
- added comments to page_cgroup.h
Changelog: 20100730
- fixed some garbage added by debug code in early stage

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/page_cgroup.h | 6 ++++-
mm/memcontrol.c | 52 +++++++++++++++++++++++++++-----------------
mm/page_cgroup.c | 2 -
3 files changed, 38 insertions(+), 22 deletions(-)

Index: mmotm-0811/include/linux/page_cgroup.h
===================================================================
--- mmotm-0811.orig/include/linux/page_cgroup.h
+++ mmotm-0811/include/linux/page_cgroup.h
@@ -9,10 +9,14 @@
* page_cgroup helps us identify information about the cgroup
* All page cgroups are allocated at boot or memory hotplug event,
* then the page cgroup for pfn always exists.
+ *
+ * TODO: It seems ID for cgroup can be packed into "flags". But there will
+ * be race between assigning ID <-> set/clear flags. Please be careful.
*/
struct page_cgroup {
unsigned long flags;
- struct mem_cgroup *mem_cgroup;
+ unsigned short mem_cgroup; /* ID of assigned memory cgroup */
+ unsigned short blk_cgroup; /* Not Used..but will be. */
struct page *page;
struct list_head lru; /* per cgroup LRU list */
};
Index: mmotm-0811/mm/page_cgroup.c
===================================================================
--- mmotm-0811.orig/mm/page_cgroup.c
+++ mmotm-0811/mm/page_cgroup.c
@@ -15,7 +15,7 @@ static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
pc->flags = 0;
- pc->mem_cgroup = NULL;
+ pc->mem_cgroup = 0;
pc->page = pfn_to_page(pfn);
INIT_LIST_HEAD(&pc->lru);
}
Index: mmotm-0811/mm/memcontrol.c
===================================================================
--- mmotm-0811.orig/mm/memcontrol.c
+++ mmotm-0811/mm/memcontrol.c
@@ -300,12 +300,13 @@ static atomic_t mem_cgroup_num;
#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;

-/* Must be called under rcu_read_lock */
-static struct mem_cgroup *id_to_memcg(unsigned short id)
+/* Must be called under rcu_read_lock, set safe==true if under lock */
+static struct mem_cgroup *id_to_memcg(unsigned short id, bool safe)
{
struct mem_cgroup *ret;
/* see mem_cgroup_free() */
- ret = rcu_dereference_check(mem_cgroups[id], rch_read_lock_held());
+ ret = rcu_dereference_check(mem_cgroups[id],
+ rch_read_lock_held() || safe);
if (likely(ret && ret->valid))
return ret;
return NULL;
@@ -381,7 +382,12 @@ struct cgroup_subsys_state *mem_cgroup_c
static struct mem_cgroup_per_zone *
page_cgroup_zoneinfo(struct page_cgroup *pc)
{
- struct mem_cgroup *mem = pc->mem_cgroup;
+ /*
+ * The caller should guarantee this "pc" is under lock. In typical
+ * case, this function is called by lru function with zone->lru_lock.
+ * It is a safe access.
+ */
+ struct mem_cgroup *mem = id_to_memcg(pc->mem_cgroup, true);
int nid = page_cgroup_nid(pc);
int zid = page_cgroup_zid(pc);

@@ -723,6 +729,11 @@ static inline bool mem_cgroup_is_root(st
return (mem == root_mem_cgroup);
}

+static inline bool mem_cgroup_is_rootid(unsigned short id)
+{
+ return (id == 1);
+}
+
/*
* Following LRU functions are allowed to be used without PCG_LOCK.
* Operations are called by routine of global LRU independently from memcg.
@@ -755,7 +766,7 @@ void mem_cgroup_del_lru_list(struct page
*/
mz = page_cgroup_zoneinfo(pc);
MEM_CGROUP_ZSTAT(mz, lru) -= 1;
- if (mem_cgroup_is_root(pc->mem_cgroup))
+ if (mem_cgroup_is_rootid(pc->mem_cgroup))
return;
VM_BUG_ON(list_empty(&pc->lru));
list_del_init(&pc->lru);
@@ -782,7 +793,7 @@ void mem_cgroup_rotate_lru_list(struct p
*/
smp_rmb();
/* unused or root page is not rotated. */
- if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
+ if (!PageCgroupUsed(pc) || mem_cgroup_is_rootid(pc->mem_cgroup))
return;
mz = page_cgroup_zoneinfo(pc);
list_move(&pc->lru, &mz->lists[lru]);
@@ -808,7 +819,7 @@ void mem_cgroup_add_lru_list(struct page
mz = page_cgroup_zoneinfo(pc);
MEM_CGROUP_ZSTAT(mz, lru) += 1;
SetPageCgroupAcctLRU(pc);
- if (mem_cgroup_is_root(pc->mem_cgroup))
+ if (mem_cgroup_is_rootid(pc->mem_cgroup))
return;
list_add(&pc->lru, &mz->lists[lru]);
}
@@ -1497,7 +1508,7 @@ void mem_cgroup_update_file_mapped(struc
return;

lock_page_cgroup(pc);
- mem = pc->mem_cgroup;
+ mem = id_to_memcg(pc->mem_cgroup, true);
if (!mem || !PageCgroupUsed(pc))
goto done;

@@ -1885,14 +1896,14 @@ struct mem_cgroup *try_get_mem_cgroup_fr
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
- mem = pc->mem_cgroup;
+ mem = id_to_memcg(pc->mem_cgroup, true);
if (mem && !css_tryget(&mem->css))
mem = NULL;
} else if (PageSwapCache(page)) {
ent.val = page_private(page);
id = lookup_swap_cgroup(ent);
rcu_read_lock();
- mem = id_to_memcg(id);
+ mem = id_to_memcg(id, false);
if (mem && !css_tryget(&mem->css))
mem = NULL;
rcu_read_unlock();
@@ -1921,7 +1932,7 @@ static void __mem_cgroup_commit_charge(s
return;
}

- pc->mem_cgroup = mem;
+ pc->mem_cgroup = css_id(&mem->css);
/*
* We access a page_cgroup asynchronously without lock_page_cgroup().
* Especially when a page_cgroup is taken from a page, pc->mem_cgroup
@@ -1979,7 +1990,7 @@ static void __mem_cgroup_move_account(st
VM_BUG_ON(PageLRU(pc->page));
VM_BUG_ON(!PageCgroupLocked(pc));
VM_BUG_ON(!PageCgroupUsed(pc));
- VM_BUG_ON(pc->mem_cgroup != from);
+ VM_BUG_ON(id_to_memcg(pc->mem_cgroup, true) != from);

if (PageCgroupFileMapped(pc)) {
/* Update mapped_file data for mem_cgroup */
@@ -1994,7 +2005,7 @@ static void __mem_cgroup_move_account(st
mem_cgroup_cancel_charge(from);

/* caller should have done css_get */
- pc->mem_cgroup = to;
+ pc->mem_cgroup = css_id(&to->css);
mem_cgroup_charge_statistics(to, pc, true);
/*
* We charges against "to" which may not have any tasks. Then, "to"
@@ -2014,11 +2025,11 @@ static int mem_cgroup_move_account(struc
{
int ret = -EINVAL;
lock_page_cgroup(pc);
- if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
+ if (PageCgroupUsed(pc) && id_to_memcg(pc->mem_cgroup, true) == from) {
__mem_cgroup_move_account(pc, from, to, uncharge);
ret = 0;
}
- unlock_page_cgroup(pc);
+ rcu_read_unlock();
/*
* check events
*/
@@ -2244,7 +2255,7 @@ __mem_cgroup_commit_charge_swapin(struct

id = swap_cgroup_record(ent, 0);
rcu_read_lock();
- memcg = id_to_memcg(id);
+ memcg = id_to_memcg(id, false);
if (memcg) {
/*
* This recorded memcg can be obsolete one. So, avoid
@@ -2354,7 +2365,7 @@ __mem_cgroup_uncharge_common(struct page

lock_page_cgroup(pc);

- mem = pc->mem_cgroup;
+ mem = id_to_memcg(pc->mem_cgroup, true);

if (!PageCgroupUsed(pc))
goto unlock_out;
@@ -2509,7 +2520,7 @@ void mem_cgroup_uncharge_swap(swp_entry_

id = swap_cgroup_record(ent, 0);
rcu_read_lock();
- memcg = id_to_memcg(id);
+ memcg = id_to_memcg(id, false);
if (memcg) {
/*
* We uncharge this because swap is freed.
@@ -2600,7 +2611,7 @@ int mem_cgroup_prepare_migration(struct
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
- mem = pc->mem_cgroup;
+ mem = id_to_memcg(pc->mem_cgroup, true);
css_get(&mem->css);
/*
* At migrating an anonymous page, its mapcount goes down
@@ -4442,7 +4453,8 @@ static int is_target_pte_for_mc(struct v
* mem_cgroup_move_account() checks the pc is valid or not under
* the lock.
*/
- if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
+ if (PageCgroupUsed(pc) &&
+ id_to_memcg(pc->mem_cgroup, true) == mc.from) {
ret = MC_TARGET_PAGE;
if (target)
target->page = page;

2010-08-20 10:07:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 4/5] memcg: lockless update of file_mapped

No changes from v4.
==
From: KAMEZAWA Hiroyuki <[email protected]>

At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().

But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.
At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
update is done while the page-cache is in stable state, we don't have to
take care of race with charge/uncharge.

Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move
1. Increment it before start moving
2. Call synchronize_rcu()
3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.


Changelog: 20100804
- added a comment for possible optimization hint.
Changelog: 20100730
- some cleanup.
Changelog: 20100729
- replaced __this_cpu_xxx() with this_cpu_xxx
(because we don't call spinlock)
- added VM_BUG_ON().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 67 insertions(+), 12 deletions(-)

Index: mmotm-0811/mm/memcontrol.c
===================================================================
--- mmotm-0811.orig/mm/memcontrol.c
+++ mmotm-0811/mm/memcontrol.c
@@ -90,6 +90,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
+ MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */

MEM_CGROUP_STAT_NSTATS,
};
@@ -1086,7 +1087,50 @@ static unsigned int get_swappiness(struc
return swappiness;
}

-/* A routine for testing mem is not under move_account */
+static void mem_cgroup_start_move(struct mem_cgroup *mem)
+{
+ int cpu;
+ /* for fast checking in mem_cgroup_update_file_stat() etc..*/
+ spin_lock(&mc.lock);
+ /* TODO: Can we optimize this by for_each_online_cpu() ? */
+ for_each_possible_cpu(cpu)
+ per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
+ spin_unlock(&mc.lock);
+
+ synchronize_rcu();
+}
+
+static void mem_cgroup_end_move(struct mem_cgroup *mem)
+{
+ int cpu;
+
+ if (!mem)
+ return;
+ /* for fast checking in mem_cgroup_update_file_stat() etc..*/
+ spin_lock(&mc.lock);
+ for_each_possible_cpu(cpu)
+ per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
+ spin_unlock(&mc.lock);
+}
+
+/*
+ * mem_cgroup_is_moved -- checking a cgroup is mc.from target or not.
+ * used for avoiding race.
+ * mem_cgroup_under_move -- checking a cgroup is mc.from or mc.to or
+ * under hierarchy of them. used for waiting at
+ * memory pressure.
+ * Result of is_moved can be trusted until the end of rcu_read_unlock().
+ * The caller must do
+ * rcu_read_lock();
+ * result = mem_cgroup_is_moved();
+ * .....make use of result here....
+ * rcu_read_unlock();
+ */
+static bool mem_cgroup_is_moved(struct mem_cgroup *mem)
+{
+ VM_BUG_ON(!rcu_read_lock_held());
+ return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
+}

static bool mem_cgroup_under_move(struct mem_cgroup *mem)
{
@@ -1502,29 +1546,36 @@ void mem_cgroup_update_file_mapped(struc
{
struct mem_cgroup *mem;
struct page_cgroup *pc;
+ bool need_lock = false;

pc = lookup_page_cgroup(page);
if (unlikely(!pc))
return;
-
- lock_page_cgroup(pc);
+ rcu_read_lock();
mem = id_to_memcg(pc->mem_cgroup, true);
- if (!mem || !PageCgroupUsed(pc))
+ if (likely(mem)) {
+ if (mem_cgroup_is_moved(mem)) {
+ /* need to serialize with move_account */
+ lock_page_cgroup(pc);
+ need_lock = true;
+ mem = id_to_memcg(pc->mem_cgroup, true);
+ if (unlikely(!mem))
+ goto done;
+ }
+ }
+ if (unlikely(!PageCgroupUsed(pc)))
goto done;
-
- /*
- * Preemption is already disabled. We can use __this_cpu_xxx
- */
if (val > 0) {
- __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
SetPageCgroupFileMapped(pc);
} else {
- __this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
ClearPageCgroupFileMapped(pc);
}
-
done:
- unlock_page_cgroup(pc);
+ if (need_lock)
+ unlock_page_cgroup(pc);
+ rcu_read_unlock();
}

/*
@@ -3064,6 +3115,7 @@ move_account:
lru_add_drain_all();
drain_all_stock_sync();
ret = 0;
+ mem_cgroup_start_move(mem);
for_each_node_state(node, N_HIGH_MEMORY) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
enum lru_list l;
@@ -3077,6 +3129,7 @@ move_account:
if (ret)
break;
}
+ mem_cgroup_end_move(mem);
memcg_oom_recover(mem);
/* it seems parent cgroup doesn't have enough mem */
if (ret == -ENOMEM)
@@ -4563,6 +4616,7 @@ static void mem_cgroup_clear_mc(void)
mc.to = NULL;
mc.moving_task = NULL;
spin_unlock(&mc.lock);
+ mem_cgroup_end_move(from);
memcg_oom_recover(from);
memcg_oom_recover(to);
wake_up_all(&mc.waitq);
@@ -4593,6 +4647,7 @@ static int mem_cgroup_can_attach(struct
VM_BUG_ON(mc.moved_charge);
VM_BUG_ON(mc.moved_swap);
VM_BUG_ON(mc.moving_task);
+ mem_cgroup_start_move(from);
spin_lock(&mc.lock);
mc.from = from;
mc.to = mem;

2010-08-20 10:08:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 5/5] memcg: generic file accounting update function

No changes from v4.
==
From: KAMEZAWA Hiroyuki <[email protected]>

Preparing for adding new status arounf file caches.(dirty, writeback,etc..)
Using a unified macro and more generic names.
All counters will have the same rule for updating.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 24 ++++++++++++++++++++++
include/linux/page_cgroup.h | 19 ++++++++++++------
mm/memcontrol.c | 46 ++++++++++++++++++--------------------------
3 files changed, 56 insertions(+), 33 deletions(-)

Index: mmotm-0811/include/linux/memcontrol.h
===================================================================
--- mmotm-0811.orig/include/linux/memcontrol.h
+++ mmotm-0811/include/linux/memcontrol.h
@@ -25,6 +25,30 @@ struct page_cgroup;
struct page;
struct mm_struct;

+/*
+ * Per-cpu Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+ /*
+ * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+ */
+ MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
+ MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
+ MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
+ MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
+ MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+ MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
+ MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
+ /* When you add new member for file-stat, please update page_cgroup.h */
+ MEM_CGROUP_FSTAT_BASE,
+ MEM_CGROUP_FSTAT_FILE_MAPPED = MEM_CGROUP_FSTAT_BASE,
+ MEM_CGROUP_FSTAT_END,
+ MEM_CGROUP_STAT_NSTATS = MEM_CGROUP_FSTAT_END,
+};
+
+#define MEMCG_FSTAT_IDX(idx) ((idx) - MEM_CGROUP_FSTAT_BASE)
+#define NR_FILE_FLAGS_MEMCG ((MEM_CGROUP_FSTAT_END - MEM_CGROUP_FSTAT_BASE))
+
extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
Index: mmotm-0811/mm/memcontrol.c
===================================================================
--- mmotm-0811.orig/mm/memcontrol.c
+++ mmotm-0811/mm/memcontrol.c
@@ -76,24 +76,6 @@ static int really_do_swap_account __init
#define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
#define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */

-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
- /*
- * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
- */
- MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
- MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
- MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
- MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
- MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
- MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
- MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
-
- MEM_CGROUP_STAT_NSTATS,
-};

struct mem_cgroup_stat_cpu {
s64 count[MEM_CGROUP_STAT_NSTATS];
@@ -1545,7 +1527,8 @@ bool mem_cgroup_handle_oom(struct mem_cg
* Currently used to update mapped file statistics, but the routine can be
* generalized to update other statistics as well.
*/
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+static void
+mem_cgroup_update_file_stat(struct page *page, unsigned int idx, int val)
{
struct mem_cgroup *mem;
struct page_cgroup *pc;
@@ -1569,11 +1552,11 @@ void mem_cgroup_update_file_mapped(struc
if (unlikely(!PageCgroupUsed(pc)))
goto done;
if (val > 0) {
- this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- SetPageCgroupFileMapped(pc);
+ this_cpu_inc(mem->stat->count[idx]);
+ set_bit(fflag_idx(MEMCG_FSTAT_IDX(idx)), &pc->flags);
} else {
- this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- ClearPageCgroupFileMapped(pc);
+ this_cpu_dec(mem->stat->count[idx]);
+ clear_bit(fflag_idx(MEMCG_FSTAT_IDX(idx)), &pc->flags);
}
done:
if (need_lock)
@@ -1581,6 +1564,12 @@ done:
rcu_read_unlock();
}

+void mem_cgroup_update_file_mapped(struct page *page, int val)
+{
+ return mem_cgroup_update_file_stat(page,
+ MEM_CGROUP_FSTAT_FILE_MAPPED, val);
+}
+
/*
* size of first charge trial. "32" comes from vmscan.c's magic value.
* TODO: maybe necessary to use big numbers in big irons.
@@ -2040,17 +2029,20 @@ static void __mem_cgroup_commit_charge(s
static void __mem_cgroup_move_account(struct page_cgroup *pc,
struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
{
+ int i;
VM_BUG_ON(from == to);
VM_BUG_ON(PageLRU(pc->page));
VM_BUG_ON(!PageCgroupLocked(pc));
VM_BUG_ON(!PageCgroupUsed(pc));
VM_BUG_ON(id_to_memcg(pc->mem_cgroup, true) != from);

- if (PageCgroupFileMapped(pc)) {
+ for (i = MEM_CGROUP_FSTAT_BASE; i < MEM_CGROUP_FSTAT_END; ++i) {
+ if (!test_bit(fflag_idx(MEMCG_FSTAT_IDX(i)), &pc->flags))
+ continue;
/* Update mapped_file data for mem_cgroup */
preempt_disable();
- __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ __this_cpu_dec(from->stat->count[i]);
+ __this_cpu_inc(to->stat->count[i]);
preempt_enable();
}
mem_cgroup_charge_statistics(from, pc, false);
@@ -3479,7 +3471,7 @@ static int mem_cgroup_get_local_stat(str
s->stat[MCS_CACHE] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
s->stat[MCS_RSS] += val * PAGE_SIZE;
- val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_MAPPED);
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_FSTAT_FILE_MAPPED);
s->stat[MCS_FILE_MAPPED] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGIN_COUNT);
s->stat[MCS_PGPGIN] += val;
Index: mmotm-0811/include/linux/page_cgroup.h
===================================================================
--- mmotm-0811.orig/include/linux/page_cgroup.h
+++ mmotm-0811/include/linux/page_cgroup.h
@@ -3,6 +3,7 @@

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
#include <linux/bit_spinlock.h>
+#include <linux/memcontrol.h> /* for flags */
/*
* Page Cgroup can be considered as an extended mem_map.
* A page_cgroup page is associated with every page descriptor. The
@@ -43,10 +44,22 @@ enum {
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
- PCG_FILE_MAPPED, /* page is accounted as "mapped" */
PCG_MIGRATION, /* under page migration */
+ PCG_FILE_FLAGS_MEMCG, /* see memcontrol.h */
+ PCG_FILE_FLAGS_MEMCG_END
+ = PCG_FILE_FLAGS_MEMCG + NR_FILE_FLAGS_MEMCG - 1,
};

+/*
+ * file-stat flags are defined regarding to memcg's stat information.
+ * Here, just defines a macro for indexing
+ */
+static inline int fflag_idx(int idx)
+{
+ VM_BUG_ON((idx) >= NR_FILE_FLAGS_MEMCG);
+ return (idx) + PCG_FILE_FLAGS_MEMCG;
+}
+
#define TESTPCGFLAG(uname, lname) \
static inline int PageCgroup##uname(struct page_cgroup *pc) \
{ return test_bit(PCG_##lname, &pc->flags); }
@@ -79,11 +92,6 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
TESTPCGFLAG(AcctLRU, ACCT_LRU)
TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)

-
-SETPCGFLAG(FileMapped, FILE_MAPPED)
-CLEARPCGFLAG(FileMapped, FILE_MAPPED)
-TESTPCGFLAG(FileMapped, FILE_MAPPED)
-
SETPCGFLAG(Migration, MIGRATION)
CLEARPCGFLAG(Migration, MIGRATION)
TESTPCGFLAG(Migration, MIGRATION)

2010-08-20 10:10:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: use ID in page_cgroup


But title....this was [PATCH 3/5].
Sorry.
-kame

2010-08-23 03:39:44

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH 2/5] memcg: use array and ID for quick look up

Hi,

> +/* 0 is unused */
> +static atomic_t mem_cgroup_num;
> +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> +
> +/* Must be called under rcu_read_lock */
> +static struct mem_cgroup *id_to_memcg(unsigned short id)
> +{
> + struct mem_cgroup *ret;
> + /* see mem_cgroup_free() */
> + ret = rcu_dereference_check(mem_cgroups[id], rch_read_lock_held());
> + if (likely(ret && ret->valid))
> + return ret;
> + return NULL;
> +}
> +
I prefer "mem" to "ret".

> @@ -2231,7 +2244,7 @@ __mem_cgroup_commit_charge_swapin(struct
>
> id = swap_cgroup_record(ent, 0);
> rcu_read_lock();
> - memcg = mem_cgroup_lookup(id);
> + memcg = id_to_memcg(id);
> if (memcg) {
> /*
> * This recorded memcg can be obsolete one. So, avoid
> @@ -2240,9 +2253,10 @@ __mem_cgroup_commit_charge_swapin(struct
> if (!mem_cgroup_is_root(memcg))
> res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> mem_cgroup_swap_statistics(memcg, false);
> + rcu_read_unlock();
> mem_cgroup_put(memcg);
> - }
> - rcu_read_unlock();
> + } else
> + rcu_read_unlock();
> }
> /*
> * At swapin, we may charge account against cgroup which has no tasks.
> @@ -2495,7 +2509,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
>
> id = swap_cgroup_record(ent, 0);
> rcu_read_lock();
> - memcg = mem_cgroup_lookup(id);
> + memcg = id_to_memcg(id);
> if (memcg) {
> /*
> * We uncharge this because swap is freed.
> @@ -2504,9 +2518,10 @@ void mem_cgroup_uncharge_swap(swp_entry_
> if (!mem_cgroup_is_root(memcg))
> res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> mem_cgroup_swap_statistics(memcg, false);
> + rcu_read_unlock();
> mem_cgroup_put(memcg);
> - }
> - rcu_read_unlock();
> + } else
> + rcu_read_unlock();
> }
>
> /**
Could you explain why we need rcu_read_unlock() before mem_cgroup_put() ?
I suspect that it's because mem_cgroup_put() can free the memcg, but do we
need mem->valid then ?


Thanks,
Daisuke Nishimura.

2010-08-23 05:52:50

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH] memcg: use ID in page_cgroup

On Fri, 20 Aug 2010 19:01:32 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

>
> I have an idea to remove page_cgroup->page pointer, 8bytes reduction per page.
> But it will be after this work.
Another off topic. I think we can reduce the size of mem_cgroup by packing
some boolean members into one "unsinged long flags".

> @@ -300,12 +300,13 @@ static atomic_t mem_cgroup_num;
> #define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
>
> -/* Must be called under rcu_read_lock */
> -static struct mem_cgroup *id_to_memcg(unsigned short id)
> +/* Must be called under rcu_read_lock, set safe==true if under lock */
Do you mean, "Set safe==true if we can ensure by some locks that the id can be
safely dereferenced without rcu_read_lock", right ?

> +static struct mem_cgroup *id_to_memcg(unsigned short id, bool safe)
> {
> struct mem_cgroup *ret;
> /* see mem_cgroup_free() */
> - ret = rcu_dereference_check(mem_cgroups[id], rch_read_lock_held());
> + ret = rcu_dereference_check(mem_cgroups[id],
> + rch_read_lock_held() || safe);
> if (likely(ret && ret->valid))
> return ret;
> return NULL;

(snip)
> @@ -723,6 +729,11 @@ static inline bool mem_cgroup_is_root(st
> return (mem == root_mem_cgroup);
> }
>
> +static inline bool mem_cgroup_is_rootid(unsigned short id)
> +{
> + return (id == 1);
> +}
> +
It might be better to add

BUG_ON(newid->id != 1)

in cgroup.c::cgroup_init_idr().


Thanks,
Daisuke Nishimura.

2010-08-23 09:00:47

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH 4/5] memcg: lockless update of file_mapped

This patch looks good to me, but I have one question.

Why do we need to acquire sc.lock inside mem_cgroup_(start|end)_move() ?
These functions doesn't access mc.*.

Thanks,
Daisuke Nishimura.

On Fri, 20 Aug 2010 19:02:56 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> No changes from v4.
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> At accounting file events per memory cgroup, we need to find memory cgroup
> via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().
>
> But, considering the context which page-cgroup for files are accessed,
> we can use alternative light-weight mutual execusion in the most case.
> At handling file-caches, the only race we have to take care of is "moving"
> account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
> update is done while the page-cache is in stable state, we don't have to
> take care of race with charge/uncharge.
>
> Unlike charge/uncharge, "move" happens not so frequently. It happens only when
> rmdir() and task-moving (with a special settings.)
> This patch adds a race-checker for file-cache-status accounting v.s. account
> moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
> The routine for account move
> 1. Increment it before start moving
> 2. Call synchronize_rcu()
> 3. Decrement it after the end of moving.
> By this, file-status-counting routine can check it needs to call
> lock_page_cgroup(). In most case, I doesn't need to call it.
>
>
> Changelog: 20100804
> - added a comment for possible optimization hint.
> Changelog: 20100730
> - some cleanup.
> Changelog: 20100729
> - replaced __this_cpu_xxx() with this_cpu_xxx
> (because we don't call spinlock)
> - added VM_BUG_ON().
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 67 insertions(+), 12 deletions(-)
>
> Index: mmotm-0811/mm/memcontrol.c
> ===================================================================
> --- mmotm-0811.orig/mm/memcontrol.c
> +++ mmotm-0811/mm/memcontrol.c
> @@ -90,6 +90,7 @@ enum mem_cgroup_stat_index {
> MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
> + MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
>
> MEM_CGROUP_STAT_NSTATS,
> };
> @@ -1086,7 +1087,50 @@ static unsigned int get_swappiness(struc
> return swappiness;
> }
>
> -/* A routine for testing mem is not under move_account */
> +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> +{
> + int cpu;
> + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> + spin_lock(&mc.lock);
> + /* TODO: Can we optimize this by for_each_online_cpu() ? */
> + for_each_possible_cpu(cpu)
> + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> + spin_unlock(&mc.lock);
> +
> + synchronize_rcu();
> +}
> +
> +static void mem_cgroup_end_move(struct mem_cgroup *mem)
> +{
> + int cpu;
> +
> + if (!mem)
> + return;
> + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> + spin_lock(&mc.lock);
> + for_each_possible_cpu(cpu)
> + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
> + spin_unlock(&mc.lock);
> +}
> +
> +/*
> + * mem_cgroup_is_moved -- checking a cgroup is mc.from target or not.
> + * used for avoiding race.
> + * mem_cgroup_under_move -- checking a cgroup is mc.from or mc.to or
> + * under hierarchy of them. used for waiting at
> + * memory pressure.
> + * Result of is_moved can be trusted until the end of rcu_read_unlock().
> + * The caller must do
> + * rcu_read_lock();
> + * result = mem_cgroup_is_moved();
> + * .....make use of result here....
> + * rcu_read_unlock();
> + */
> +static bool mem_cgroup_is_moved(struct mem_cgroup *mem)
> +{
> + VM_BUG_ON(!rcu_read_lock_held());
> + return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
> +}
>
> static bool mem_cgroup_under_move(struct mem_cgroup *mem)
> {
> @@ -1502,29 +1546,36 @@ void mem_cgroup_update_file_mapped(struc
> {
> struct mem_cgroup *mem;
> struct page_cgroup *pc;
> + bool need_lock = false;
>
> pc = lookup_page_cgroup(page);
> if (unlikely(!pc))
> return;
> -
> - lock_page_cgroup(pc);
> + rcu_read_lock();
> mem = id_to_memcg(pc->mem_cgroup, true);
> - if (!mem || !PageCgroupUsed(pc))
> + if (likely(mem)) {
> + if (mem_cgroup_is_moved(mem)) {
> + /* need to serialize with move_account */
> + lock_page_cgroup(pc);
> + need_lock = true;
> + mem = id_to_memcg(pc->mem_cgroup, true);
> + if (unlikely(!mem))
> + goto done;
> + }
> + }
> + if (unlikely(!PageCgroupUsed(pc)))
> goto done;
> -
> - /*
> - * Preemption is already disabled. We can use __this_cpu_xxx
> - */
> if (val > 0) {
> - __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> + this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> SetPageCgroupFileMapped(pc);
> } else {
> - __this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> + this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> ClearPageCgroupFileMapped(pc);
> }
> -
> done:
> - unlock_page_cgroup(pc);
> + if (need_lock)
> + unlock_page_cgroup(pc);
> + rcu_read_unlock();
> }
>
> /*
> @@ -3064,6 +3115,7 @@ move_account:
> lru_add_drain_all();
> drain_all_stock_sync();
> ret = 0;
> + mem_cgroup_start_move(mem);
> for_each_node_state(node, N_HIGH_MEMORY) {
> for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> enum lru_list l;
> @@ -3077,6 +3129,7 @@ move_account:
> if (ret)
> break;
> }
> + mem_cgroup_end_move(mem);
> memcg_oom_recover(mem);
> /* it seems parent cgroup doesn't have enough mem */
> if (ret == -ENOMEM)
> @@ -4563,6 +4616,7 @@ static void mem_cgroup_clear_mc(void)
> mc.to = NULL;
> mc.moving_task = NULL;
> spin_unlock(&mc.lock);
> + mem_cgroup_end_move(from);
> memcg_oom_recover(from);
> memcg_oom_recover(to);
> wake_up_all(&mc.waitq);
> @@ -4593,6 +4647,7 @@ static int mem_cgroup_can_attach(struct
> VM_BUG_ON(mc.moved_charge);
> VM_BUG_ON(mc.moved_swap);
> VM_BUG_ON(mc.moving_task);
> + mem_cgroup_start_move(from);
> spin_lock(&mc.lock);
> mc.from = from;
> mc.to = mem;
>

2010-08-23 23:54:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/5] memcg: lockless update of file_mapped

On Mon, 23 Aug 2010 17:50:15 +0900
Daisuke Nishimura <[email protected]> wrote:

> This patch looks good to me, but I have one question.
>
> Why do we need to acquire sc.lock inside mem_cgroup_(start|end)_move() ?
> These functions doesn't access mc.*.
>

just reusing a lock to update status. If you don't like, I'll add a new lock.

Thanks,
-Kame

> Thanks,
> Daisuke Nishimura.
>
> On Fri, 20 Aug 2010 19:02:56 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > No changes from v4.
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > At accounting file events per memory cgroup, we need to find memory cgroup
> > via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().
> >
> > But, considering the context which page-cgroup for files are accessed,
> > we can use alternative light-weight mutual execusion in the most case.
> > At handling file-caches, the only race we have to take care of is "moving"
> > account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
> > update is done while the page-cache is in stable state, we don't have to
> > take care of race with charge/uncharge.
> >
> > Unlike charge/uncharge, "move" happens not so frequently. It happens only when
> > rmdir() and task-moving (with a special settings.)
> > This patch adds a race-checker for file-cache-status accounting v.s. account
> > moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
> > The routine for account move
> > 1. Increment it before start moving
> > 2. Call synchronize_rcu()
> > 3. Decrement it after the end of moving.
> > By this, file-status-counting routine can check it needs to call
> > lock_page_cgroup(). In most case, I doesn't need to call it.
> >
> >
> > Changelog: 20100804
> > - added a comment for possible optimization hint.
> > Changelog: 20100730
> > - some cleanup.
> > Changelog: 20100729
> > - replaced __this_cpu_xxx() with this_cpu_xxx
> > (because we don't call spinlock)
> > - added VM_BUG_ON().
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++---------
> > 1 file changed, 67 insertions(+), 12 deletions(-)
> >
> > Index: mmotm-0811/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0811.orig/mm/memcontrol.c
> > +++ mmotm-0811/mm/memcontrol.c
> > @@ -90,6 +90,7 @@ enum mem_cgroup_stat_index {
> > MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> > MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
> > + MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
> >
> > MEM_CGROUP_STAT_NSTATS,
> > };
> > @@ -1086,7 +1087,50 @@ static unsigned int get_swappiness(struc
> > return swappiness;
> > }
> >
> > -/* A routine for testing mem is not under move_account */
> > +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> > +{
> > + int cpu;
> > + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> > + spin_lock(&mc.lock);
> > + /* TODO: Can we optimize this by for_each_online_cpu() ? */
> > + for_each_possible_cpu(cpu)
> > + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> > + spin_unlock(&mc.lock);
> > +
> > + synchronize_rcu();
> > +}
> > +
> > +static void mem_cgroup_end_move(struct mem_cgroup *mem)
> > +{
> > + int cpu;
> > +
> > + if (!mem)
> > + return;
> > + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> > + spin_lock(&mc.lock);
> > + for_each_possible_cpu(cpu)
> > + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
> > + spin_unlock(&mc.lock);
> > +}
> > +
> > +/*
> > + * mem_cgroup_is_moved -- checking a cgroup is mc.from target or not.
> > + * used for avoiding race.
> > + * mem_cgroup_under_move -- checking a cgroup is mc.from or mc.to or
> > + * under hierarchy of them. used for waiting at
> > + * memory pressure.
> > + * Result of is_moved can be trusted until the end of rcu_read_unlock().
> > + * The caller must do
> > + * rcu_read_lock();
> > + * result = mem_cgroup_is_moved();
> > + * .....make use of result here....
> > + * rcu_read_unlock();
> > + */
> > +static bool mem_cgroup_is_moved(struct mem_cgroup *mem)
> > +{
> > + VM_BUG_ON(!rcu_read_lock_held());
> > + return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
> > +}
> >
> > static bool mem_cgroup_under_move(struct mem_cgroup *mem)
> > {
> > @@ -1502,29 +1546,36 @@ void mem_cgroup_update_file_mapped(struc
> > {
> > struct mem_cgroup *mem;
> > struct page_cgroup *pc;
> > + bool need_lock = false;
> >
> > pc = lookup_page_cgroup(page);
> > if (unlikely(!pc))
> > return;
> > -
> > - lock_page_cgroup(pc);
> > + rcu_read_lock();
> > mem = id_to_memcg(pc->mem_cgroup, true);
> > - if (!mem || !PageCgroupUsed(pc))
> > + if (likely(mem)) {
> > + if (mem_cgroup_is_moved(mem)) {
> > + /* need to serialize with move_account */
> > + lock_page_cgroup(pc);
> > + need_lock = true;
> > + mem = id_to_memcg(pc->mem_cgroup, true);
> > + if (unlikely(!mem))
> > + goto done;
> > + }
> > + }
> > + if (unlikely(!PageCgroupUsed(pc)))
> > goto done;
> > -
> > - /*
> > - * Preemption is already disabled. We can use __this_cpu_xxx
> > - */
> > if (val > 0) {
> > - __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > + this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > SetPageCgroupFileMapped(pc);
> > } else {
> > - __this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > + this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > ClearPageCgroupFileMapped(pc);
> > }
> > -
> > done:
> > - unlock_page_cgroup(pc);
> > + if (need_lock)
> > + unlock_page_cgroup(pc);
> > + rcu_read_unlock();
> > }
> >
> > /*
> > @@ -3064,6 +3115,7 @@ move_account:
> > lru_add_drain_all();
> > drain_all_stock_sync();
> > ret = 0;
> > + mem_cgroup_start_move(mem);
> > for_each_node_state(node, N_HIGH_MEMORY) {
> > for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> > enum lru_list l;
> > @@ -3077,6 +3129,7 @@ move_account:
> > if (ret)
> > break;
> > }
> > + mem_cgroup_end_move(mem);
> > memcg_oom_recover(mem);
> > /* it seems parent cgroup doesn't have enough mem */
> > if (ret == -ENOMEM)
> > @@ -4563,6 +4616,7 @@ static void mem_cgroup_clear_mc(void)
> > mc.to = NULL;
> > mc.moving_task = NULL;
> > spin_unlock(&mc.lock);
> > + mem_cgroup_end_move(from);
> > memcg_oom_recover(from);
> > memcg_oom_recover(to);
> > wake_up_all(&mc.waitq);
> > @@ -4593,6 +4647,7 @@ static int mem_cgroup_can_attach(struct
> > VM_BUG_ON(mc.moved_charge);
> > VM_BUG_ON(mc.moved_swap);
> > VM_BUG_ON(mc.moving_task);
> > + mem_cgroup_start_move(from);
> > spin_lock(&mc.lock);
> > mc.from = from;
> > mc.to = mem;
> >
>

2010-08-23 23:56:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/5] memcg: use array and ID for quick look up

On Mon, 23 Aug 2010 12:35:33 +0900
Daisuke Nishimura <[email protected]> wrote:

> Hi,
>
> > +/* 0 is unused */
> > +static atomic_t mem_cgroup_num;
> > +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> > +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> > +
> > +/* Must be called under rcu_read_lock */
> > +static struct mem_cgroup *id_to_memcg(unsigned short id)
> > +{
> > + struct mem_cgroup *ret;
> > + /* see mem_cgroup_free() */
> > + ret = rcu_dereference_check(mem_cgroups[id], rch_read_lock_held());
> > + if (likely(ret && ret->valid))
> > + return ret;
> > + return NULL;
> > +}
> > +
> I prefer "mem" to "ret".
>
Hmm, ok.


> > @@ -2231,7 +2244,7 @@ __mem_cgroup_commit_charge_swapin(struct
> >
> > id = swap_cgroup_record(ent, 0);
> > rcu_read_lock();
> > - memcg = mem_cgroup_lookup(id);
> > + memcg = id_to_memcg(id);
> > if (memcg) {
> > /*
> > * This recorded memcg can be obsolete one. So, avoid
> > @@ -2240,9 +2253,10 @@ __mem_cgroup_commit_charge_swapin(struct
> > if (!mem_cgroup_is_root(memcg))
> > res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> > mem_cgroup_swap_statistics(memcg, false);
> > + rcu_read_unlock();
> > mem_cgroup_put(memcg);
> > - }
> > - rcu_read_unlock();
> > + } else
> > + rcu_read_unlock();
> > }
> > /*
> > * At swapin, we may charge account against cgroup which has no tasks.
> > @@ -2495,7 +2509,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
> >
> > id = swap_cgroup_record(ent, 0);
> > rcu_read_lock();
> > - memcg = mem_cgroup_lookup(id);
> > + memcg = id_to_memcg(id);
> > if (memcg) {
> > /*
> > * We uncharge this because swap is freed.
> > @@ -2504,9 +2518,10 @@ void mem_cgroup_uncharge_swap(swp_entry_
> > if (!mem_cgroup_is_root(memcg))
> > res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> > mem_cgroup_swap_statistics(memcg, false);
> > + rcu_read_unlock();
> > mem_cgroup_put(memcg);
> > - }
> > - rcu_read_unlock();
> > + } else
> > + rcu_read_unlock();
> > }
> >
> > /**
> Could you explain why we need rcu_read_unlock() before mem_cgroup_put() ?
> I suspect that it's because mem_cgroup_put() can free the memcg, but do we
> need mem->valid then ?
>
mem_cgroup_put() may call synchronize_rcu(). So, we have to unlock before it.

Thanks,
-Kame


2010-08-23 23:57:46

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: use ID in page_cgroup

On Mon, 23 Aug 2010 14:32:37 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Fri, 20 Aug 2010 19:01:32 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> >
> > I have an idea to remove page_cgroup->page pointer, 8bytes reduction per page.
> > But it will be after this work.
> Another off topic. I think we can reduce the size of mem_cgroup by packing
> some boolean members into one "unsinged long flags".
>
I'll use "flags" to remove struct page pointer.


> > @@ -300,12 +300,13 @@ static atomic_t mem_cgroup_num;
> > #define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> > static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> >
> > -/* Must be called under rcu_read_lock */
> > -static struct mem_cgroup *id_to_memcg(unsigned short id)
> > +/* Must be called under rcu_read_lock, set safe==true if under lock */
> Do you mean, "Set safe==true if we can ensure by some locks that the id can be
> safely dereferenced without rcu_read_lock", right ?
>

yes. that's just for rcu_deerference_check().


> > +static struct mem_cgroup *id_to_memcg(unsigned short id, bool safe)
> > {
> > struct mem_cgroup *ret;
> > /* see mem_cgroup_free() */
> > - ret = rcu_dereference_check(mem_cgroups[id], rch_read_lock_held());
> > + ret = rcu_dereference_check(mem_cgroups[id],
> > + rch_read_lock_held() || safe);
> > if (likely(ret && ret->valid))
> > return ret;
> > return NULL;
>
> (snip)
> > @@ -723,6 +729,11 @@ static inline bool mem_cgroup_is_root(st
> > return (mem == root_mem_cgroup);
> > }
> >
> > +static inline bool mem_cgroup_is_rootid(unsigned short id)
> > +{
> > + return (id == 1);
> > +}
> > +
> It might be better to add
>
> BUG_ON(newid->id != 1)
>
> in cgroup.c::cgroup_init_idr().
>

Why ??

Thanks,
-Kame


2010-08-24 00:33:33

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH 2/5] memcg: use array and ID for quick look up

> > > @@ -2231,7 +2244,7 @@ __mem_cgroup_commit_charge_swapin(struct
> > >
> > > id = swap_cgroup_record(ent, 0);
> > > rcu_read_lock();
> > > - memcg = mem_cgroup_lookup(id);
> > > + memcg = id_to_memcg(id);
> > > if (memcg) {
> > > /*
> > > * This recorded memcg can be obsolete one. So, avoid
> > > @@ -2240,9 +2253,10 @@ __mem_cgroup_commit_charge_swapin(struct
> > > if (!mem_cgroup_is_root(memcg))
> > > res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> > > mem_cgroup_swap_statistics(memcg, false);
> > > + rcu_read_unlock();
> > > mem_cgroup_put(memcg);
> > > - }
> > > - rcu_read_unlock();
> > > + } else
> > > + rcu_read_unlock();
> > > }
> > > /*
> > > * At swapin, we may charge account against cgroup which has no tasks.
> > > @@ -2495,7 +2509,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
> > >
> > > id = swap_cgroup_record(ent, 0);
> > > rcu_read_lock();
> > > - memcg = mem_cgroup_lookup(id);
> > > + memcg = id_to_memcg(id);
> > > if (memcg) {
> > > /*
> > > * We uncharge this because swap is freed.
> > > @@ -2504,9 +2518,10 @@ void mem_cgroup_uncharge_swap(swp_entry_
> > > if (!mem_cgroup_is_root(memcg))
> > > res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> > > mem_cgroup_swap_statistics(memcg, false);
> > > + rcu_read_unlock();
> > > mem_cgroup_put(memcg);
> > > - }
> > > - rcu_read_unlock();
> > > + } else
> > > + rcu_read_unlock();
> > > }
> > >
> > > /**
> > Could you explain why we need rcu_read_unlock() before mem_cgroup_put() ?
> > I suspect that it's because mem_cgroup_put() can free the memcg, but do we
> > need mem->valid then ?
> >
> mem_cgroup_put() may call synchronize_rcu(). So, we have to unlock before it.
>
Ah, I see. Thank you for your explanation.

Daisuke Nishimura.

2010-08-24 00:34:05

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH 4/5] memcg: lockless update of file_mapped

On Tue, 24 Aug 2010 08:49:16 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Mon, 23 Aug 2010 17:50:15 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > This patch looks good to me, but I have one question.
> >
> > Why do we need to acquire sc.lock inside mem_cgroup_(start|end)_move() ?
> > These functions doesn't access mc.*.
> >
>
> just reusing a lock to update status. If you don't like, I'll add a new lock.
>
I see. I think it would be enough just to add some comments about it.


Thanks,
Daisuke Nishimura.

2010-08-24 01:18:54

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH] memcg: use ID in page_cgroup

> > > @@ -723,6 +729,11 @@ static inline bool mem_cgroup_is_root(st
> > > return (mem == root_mem_cgroup);
> > > }
> > >
> > > +static inline bool mem_cgroup_is_rootid(unsigned short id)
> > > +{
> > > + return (id == 1);
> > > +}
> > > +
> > It might be better to add
> >
> > BUG_ON(newid->id != 1)
> >
> > in cgroup.c::cgroup_init_idr().
> >
>
> Why ??
>
Just to make sure that the root css has id==1. mem_cgroup_is_rootid() make
use of the fact.
I'm sorry if I miss something.

Thanks,
Daisuke Nishimura.

2010-08-24 02:01:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: use ID in page_cgroup

On Tue, 24 Aug 2010 10:14:25 +0900
Daisuke Nishimura <[email protected]> wrote:

> > > > @@ -723,6 +729,11 @@ static inline bool mem_cgroup_is_root(st
> > > > return (mem == root_mem_cgroup);
> > > > }
> > > >
> > > > +static inline bool mem_cgroup_is_rootid(unsigned short id)
> > > > +{
> > > > + return (id == 1);
> > > > +}
> > > > +
> > > It might be better to add
> > >
> > > BUG_ON(newid->id != 1)
> > >
> > > in cgroup.c::cgroup_init_idr().
> > >
> >
> > Why ??
> >
> Just to make sure that the root css has id==1. mem_cgroup_is_rootid() make
> use of the fact.
> I'm sorry if I miss something.
>

Hmm. The function allocating ID does

4530 static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
4531 {
==
4546 spin_lock(&ss->id_lock);
4547 /* Don't use 0. allocates an ID of 1-65535 */
4548 error = idr_get_new_above(&ss->idr, newid, 1, &myid);
4549 spin_unlock(&ss->id_lock);
==

and allocates ID above "1", always.

Adding BUG_ON(newid->id != 1) will mean that we doubt the bitmap function and
consider possibility that new->id == 0.

But, we're 100% sure that it never happens.

I don't think adding a comment is a right thing to do.

Thanks,
-Kame

2010-08-24 04:08:52

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH] memcg: use ID in page_cgroup

On Tue, 24 Aug 2010 10:54:05 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 24 Aug 2010 10:14:25 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > > > > @@ -723,6 +729,11 @@ static inline bool mem_cgroup_is_root(st
> > > > > return (mem == root_mem_cgroup);
> > > > > }
> > > > >
> > > > > +static inline bool mem_cgroup_is_rootid(unsigned short id)
> > > > > +{
> > > > > + return (id == 1);
> > > > > +}
> > > > > +
> > > > It might be better to add
> > > >
> > > > BUG_ON(newid->id != 1)
> > > >
> > > > in cgroup.c::cgroup_init_idr().
> > > >
> > >
> > > Why ??
> > >
> > Just to make sure that the root css has id==1. mem_cgroup_is_rootid() make
> > use of the fact.
> > I'm sorry if I miss something.
> >
>
> Hmm. The function allocating ID does
>
> 4530 static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
> 4531 {
> ==
> 4546 spin_lock(&ss->id_lock);
> 4547 /* Don't use 0. allocates an ID of 1-65535 */
> 4548 error = idr_get_new_above(&ss->idr, newid, 1, &myid);
> 4549 spin_unlock(&ss->id_lock);
> ==
>
> and allocates ID above "1", always.
>
> Adding BUG_ON(newid->id != 1) will mean that we doubt the bitmap function and
> consider possibility that new->id == 0.
>
> But, we're 100% sure that it never happens.
>
> I don't think adding a comment is a right thing to do.
>
Okey, I don't have strong requirement to add BUG_ON() anyway.

These patches looks good to me except for some minor points I've commented.

Thanks,
Daisuke Nishimura.

2010-08-24 06:10:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: use ID in page_cgroup

On Tue, 24 Aug 2010 13:04:02 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Tue, 24 Aug 2010 10:54:05 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Tue, 24 Aug 2010 10:14:25 +0900
> > Daisuke Nishimura <[email protected]> wrote:
> >
> > > > > > @@ -723,6 +729,11 @@ static inline bool mem_cgroup_is_root(st
> > > > > > return (mem == root_mem_cgroup);
> > > > > > }
> > > > > >
> > > > > > +static inline bool mem_cgroup_is_rootid(unsigned short id)
> > > > > > +{
> > > > > > + return (id == 1);
> > > > > > +}
> > > > > > +
> > > > > It might be better to add
> > > > >
> > > > > BUG_ON(newid->id != 1)
> > > > >
> > > > > in cgroup.c::cgroup_init_idr().
> > > > >
> > > >
> > > > Why ??
> > > >
> > > Just to make sure that the root css has id==1. mem_cgroup_is_rootid() make
> > > use of the fact.
> > > I'm sorry if I miss something.
> > >
> >
> > Hmm. The function allocating ID does
> >
> > 4530 static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
> > 4531 {
> > ==
> > 4546 spin_lock(&ss->id_lock);
> > 4547 /* Don't use 0. allocates an ID of 1-65535 */
> > 4548 error = idr_get_new_above(&ss->idr, newid, 1, &myid);
> > 4549 spin_unlock(&ss->id_lock);
> > ==
> >
> > and allocates ID above "1", always.
> >
> > Adding BUG_ON(newid->id != 1) will mean that we doubt the bitmap function and
> > consider possibility that new->id == 0.
> >
> > But, we're 100% sure that it never happens.
> >
> > I don't think adding a comment is a right thing to do.
> >
> Okey, I don't have strong requirement to add BUG_ON() anyway.
>
> These patches looks good to me except for some minor points I've commented.
>

Thank you for review. I'll repost after applying your feedback.

-Kame

2010-08-24 07:19:37

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

KAMEZAWA Hiroyuki <[email protected]> writes:

> CC'ed to Paul Menage and Li Zefan.
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> When cgroup subsystem use ID (ss->use_id==1), each css's ID is assigned
> after successful call of ->create(). css_ID is tightly coupled with
> css struct itself but it is allocated by ->create() call, IOW,
> per-subsystem special allocations.
>
> To know css_id before creation, this patch adds id_attached() callback.
> after css_ID allocation. This will be used by memory cgroup's quick lookup
> routine.
>
> Maybe you can think of other implementations as
> - pass ID to ->create()
> or
> - add post_create()
> etc...
> But when considering dirtiness of codes, this straightforward patch seems
> good to me. If someone wants post_create(), this patch can be replaced.
>
> Changelog: 20100820
> - new approarch.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> Documentation/cgroups/cgroups.txt | 10 ++++++++++
> include/linux/cgroup.h | 1 +
> kernel/cgroup.c | 5 ++++-
> 3 files changed, 15 insertions(+), 1 deletion(-)
>
> Index: mmotm-0811/Documentation/cgroups/cgroups.txt
> ===================================================================
> --- mmotm-0811.orig/Documentation/cgroups/cgroups.txt
> +++ mmotm-0811/Documentation/cgroups/cgroups.txt
> @@ -621,6 +621,16 @@ and root cgroup. Currently this will onl
> the default hierarchy (which never has sub-cgroups) and a hierarchy
> that is being created/destroyed (and hence has no sub-cgroups).
>
> +void id_attached(struct cgroup_subsys *ss, struct cgroup *root)
> +(cgroup_mutex and ss->hierarchy_mutex held by caller)
> +(called only when ss->use_id=1)
> +
> +Called when css_id is attached to css. Because css_id is assigned
> +against "css", css_id is not available until ->create() is called.
> +If subsystem wants to make use of ID at createtion time, use

Minor spelling correction: s/createtion/creation/

> +this handler. This handler will be called after css_id is assigned
> +to css. Not necessary to be implemented in usual(see memcontrol.c)

Maybe this sounds better?

"This handler is not usually needed. See memcontrol.c for an example."

> +
> 4. Questions
> ============
>
> Index: mmotm-0811/include/linux/cgroup.h
> ===================================================================
> --- mmotm-0811.orig/include/linux/cgroup.h
> +++ mmotm-0811/include/linux/cgroup.h
> @@ -475,6 +475,7 @@ struct cgroup_subsys {
> struct cgroup *cgrp);
> void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
> + void (*id_attached)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>
> int subsys_id;
> int active;
> Index: mmotm-0811/kernel/cgroup.c
> ===================================================================
> --- mmotm-0811.orig/kernel/cgroup.c
> +++ mmotm-0811/kernel/cgroup.c
> @@ -4618,6 +4618,8 @@ static int __init_or_module cgroup_init_
> newid->stack[0] = newid->id;
> newid->css = rootcss;
> rootcss->id = newid;
> + if (ss->id_attached)
> + ss->id_attached(ss, dummytop);
> return 0;
> }
>
> @@ -4646,7 +4648,8 @@ static int alloc_css_id(struct cgroup_su
> * see cgroup_populate_dir()
> */
> rcu_assign_pointer(child_css->id, child_id);
> -
> + if (ss->id_attached)
> + ss->id_attached(ss, child);
> return 0;
> }
>

2010-08-24 07:23:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, 24 Aug 2010 00:19:01 -0700
Greg Thelen <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > CC'ed to Paul Menage and Li Zefan.
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > When cgroup subsystem use ID (ss->use_id==1), each css's ID is assigned
> > after successful call of ->create(). css_ID is tightly coupled with
> > css struct itself but it is allocated by ->create() call, IOW,
> > per-subsystem special allocations.
> >
> > To know css_id before creation, this patch adds id_attached() callback.
> > after css_ID allocation. This will be used by memory cgroup's quick lookup
> > routine.
> >
> > Maybe you can think of other implementations as
> > - pass ID to ->create()
> > or
> > - add post_create()
> > etc...
> > But when considering dirtiness of codes, this straightforward patch seems
> > good to me. If someone wants post_create(), this patch can be replaced.
> >
> > Changelog: 20100820
> > - new approarch.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > Documentation/cgroups/cgroups.txt | 10 ++++++++++
> > include/linux/cgroup.h | 1 +
> > kernel/cgroup.c | 5 ++++-
> > 3 files changed, 15 insertions(+), 1 deletion(-)
> >
> > Index: mmotm-0811/Documentation/cgroups/cgroups.txt
> > ===================================================================
> > --- mmotm-0811.orig/Documentation/cgroups/cgroups.txt
> > +++ mmotm-0811/Documentation/cgroups/cgroups.txt
> > @@ -621,6 +621,16 @@ and root cgroup. Currently this will onl
> > the default hierarchy (which never has sub-cgroups) and a hierarchy
> > that is being created/destroyed (and hence has no sub-cgroups).
> >
> > +void id_attached(struct cgroup_subsys *ss, struct cgroup *root)
> > +(cgroup_mutex and ss->hierarchy_mutex held by caller)
> > +(called only when ss->use_id=1)
> > +
> > +Called when css_id is attached to css. Because css_id is assigned
> > +against "css", css_id is not available until ->create() is called.
> > +If subsystem wants to make use of ID at createtion time, use
>
> Minor spelling correction: s/createtion/creation/
>
ok.

> > +this handler. This handler will be called after css_id is assigned
> > +to css. Not necessary to be implemented in usual(see memcontrol.c)
>
> Maybe this sounds better?
>
> "This handler is not usually needed. See memcontrol.c for an example."
>

Sure. will fix.

> > +
> > 4. Questions
> > ============
> >
> > Index: mmotm-0811/include/linux/cgroup.h
> > ===================================================================
> > --- mmotm-0811.orig/include/linux/cgroup.h
> > +++ mmotm-0811/include/linux/cgroup.h
> > @@ -475,6 +475,7 @@ struct cgroup_subsys {
> > struct cgroup *cgrp);
> > void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> > void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
> > + void (*id_attached)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> >
> > int subsys_id;
> > int active;
> > Index: mmotm-0811/kernel/cgroup.c
> > ===================================================================
> > --- mmotm-0811.orig/kernel/cgroup.c
> > +++ mmotm-0811/kernel/cgroup.c
> > @@ -4618,6 +4618,8 @@ static int __init_or_module cgroup_init_
> > newid->stack[0] = newid->id;
> > newid->css = rootcss;
> > rootcss->id = newid;
> > + if (ss->id_attached)
> > + ss->id_attached(ss, dummytop);
> > return 0;
> > }
> >
> > @@ -4646,7 +4648,8 @@ static int alloc_css_id(struct cgroup_su
> > * see cgroup_populate_dir()
> > */
> > rcu_assign_pointer(child_css->id, child_id);
> > -
> > + if (ss->id_attached)
> > + ss->id_attached(ss, child);
> > return 0;
> > }
> >
>

Thank you for review.

-Kame

2010-08-24 07:46:04

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH 2/5] memcg: use array and ID for quick look up

KAMEZAWA Hiroyuki <[email protected]> writes:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, memory cgroup has an ID per cgroup and make use of it at
> - hierarchy walk,
> - swap recording.
>
> This patch is for making more use of it. The final purpose is
> to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
>
> This patch caches a pointer of memcg in an array. By this, we
> don't have to call css_lookup() which requires radix-hash walk.
> This saves some amount of memory footprint at lookup memcg via id.
>
> Changelog: 20100811
> - adjusted onto mmotm-2010-08-11
> - fixed RCU related parts.
> - use attach_id() callback.
>
> Changelog: 20100804
> - fixed description in init/Kconfig
>
> Changelog: 20100730
> - fixed rcu_read_unlock() placement.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> init/Kconfig | 10 ++++++
> mm/memcontrol.c | 83 ++++++++++++++++++++++++++++++++++++++++++--------------
> 2 files changed, 73 insertions(+), 20 deletions(-)
>
> Index: mmotm-0811/mm/memcontrol.c
> ===================================================================
> --- mmotm-0811.orig/mm/memcontrol.c
> +++ mmotm-0811/mm/memcontrol.c
> @@ -195,6 +195,7 @@ static void mem_cgroup_oom_notify(struct
> */
> struct mem_cgroup {
> struct cgroup_subsys_state css;
> + int valid; /* for checking validness under RCU access.*/
> /*
> * the counter to account for memory usage
> */
> @@ -294,6 +295,29 @@ static bool move_file(void)
> &mc.to->move_charge_at_immigrate);
> }
>
> +/* 0 is unused */
> +static atomic_t mem_cgroup_num;
> +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> +
> +/* Must be called under rcu_read_lock */
> +static struct mem_cgroup *id_to_memcg(unsigned short id)
> +{
> + struct mem_cgroup *ret;
> + /* see mem_cgroup_free() */
> + ret = rcu_dereference_check(mem_cgroups[id], rch_read_lock_held());

I think this be rcu_read_lock_held() instead of rch_read_lock_held()?

> + if (likely(ret && ret->valid))
> + return ret;
> + return NULL;
> +}
> +
> +static void register_memcg_id(struct mem_cgroup *mem)
> +{
> + int id = css_id(&mem->css);
> + rcu_assign_pointer(mem_cgroups[id], mem);
> + VM_BUG_ON(!mem->valid);
> +}
> +
> /*
> * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> * limit reclaim to prevent infinite loops, if they ever occur.
> @@ -1847,18 +1871,7 @@ static void mem_cgroup_cancel_charge(str
> * it's concern. (dropping refcnt from swap can be called against removed
> * memcg.)
> */
> -static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> -{
> - struct cgroup_subsys_state *css;
>
> - /* ID 0 is unused ID */
> - if (!id)
> - return NULL;
> - css = css_lookup(&mem_cgroup_subsys, id);
> - if (!css)
> - return NULL;
> - return container_of(css, struct mem_cgroup, css);
> -}
>
> struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> {
> @@ -1879,7 +1892,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
> ent.val = page_private(page);
> id = lookup_swap_cgroup(ent);
> rcu_read_lock();
> - mem = mem_cgroup_lookup(id);
> + mem = id_to_memcg(id);
> if (mem && !css_tryget(&mem->css))
> mem = NULL;
> rcu_read_unlock();
> @@ -2231,7 +2244,7 @@ __mem_cgroup_commit_charge_swapin(struct
>
> id = swap_cgroup_record(ent, 0);
> rcu_read_lock();
> - memcg = mem_cgroup_lookup(id);
> + memcg = id_to_memcg(id);
> if (memcg) {
> /*
> * This recorded memcg can be obsolete one. So, avoid
> @@ -2240,9 +2253,10 @@ __mem_cgroup_commit_charge_swapin(struct
> if (!mem_cgroup_is_root(memcg))
> res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> mem_cgroup_swap_statistics(memcg, false);
> + rcu_read_unlock();
> mem_cgroup_put(memcg);
> - }
> - rcu_read_unlock();
> + } else
> + rcu_read_unlock();
> }
> /*
> * At swapin, we may charge account against cgroup which has no tasks.
> @@ -2495,7 +2509,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
>
> id = swap_cgroup_record(ent, 0);
> rcu_read_lock();
> - memcg = mem_cgroup_lookup(id);
> + memcg = id_to_memcg(id);
> if (memcg) {
> /*
> * We uncharge this because swap is freed.
> @@ -2504,9 +2518,10 @@ void mem_cgroup_uncharge_swap(swp_entry_
> if (!mem_cgroup_is_root(memcg))
> res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> mem_cgroup_swap_statistics(memcg, false);
> + rcu_read_unlock();
> mem_cgroup_put(memcg);
> - }
> - rcu_read_unlock();
> + } else
> + rcu_read_unlock();
> }
>
> /**
> @@ -4010,6 +4025,9 @@ static struct mem_cgroup *mem_cgroup_all
> struct mem_cgroup *mem;
> int size = sizeof(struct mem_cgroup);
>
> + if (atomic_read(&mem_cgroup_num) == NR_MEMCG_GROUPS)
> + return NULL;
> +
> /* Can be very big if MAX_NUMNODES is very big */
> if (size < PAGE_SIZE)
> mem = kmalloc(size, GFP_KERNEL);
> @@ -4020,6 +4038,7 @@ static struct mem_cgroup *mem_cgroup_all
> return NULL;
>
> memset(mem, 0, size);
> + mem->valid = 1;
> mem->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
> if (!mem->stat) {
> if (size < PAGE_SIZE)
> @@ -4049,6 +4068,7 @@ static void __mem_cgroup_free(struct mem
> mem_cgroup_remove_from_trees(mem);
> free_css_id(&mem_cgroup_subsys, &mem->css);
>
> + atomic_dec(&mem_cgroup_num);
> for_each_node_state(node, N_POSSIBLE)
> free_mem_cgroup_per_zone_info(mem, node);
>
> @@ -4059,6 +4079,19 @@ static void __mem_cgroup_free(struct mem
> vfree(mem);
> }
>
> +static void mem_cgroup_free(struct mem_cgroup *mem)
> +{
> + /* No more lookup */
> + mem->valid = 0;
> + rcu_assign_pointer(mem_cgroups[css_id(&mem->css)], NULL);
> + /*
> + * Because we call vfree() etc...use synchronize_rcu() rather than
> + * call_rcu();
> + */
> + synchronize_rcu();
> + __mem_cgroup_free(mem);
> +}
> +
> static void mem_cgroup_get(struct mem_cgroup *mem)
> {
> atomic_inc(&mem->refcnt);
> @@ -4068,7 +4101,7 @@ static void __mem_cgroup_put(struct mem_
> {
> if (atomic_sub_and_test(count, &mem->refcnt)) {
> struct mem_cgroup *parent = parent_mem_cgroup(mem);
> - __mem_cgroup_free(mem);
> + mem_cgroup_free(mem);
> if (parent)
> mem_cgroup_put(parent);
> }
> @@ -4184,13 +4217,22 @@ mem_cgroup_create(struct cgroup_subsys *
> atomic_set(&mem->refcnt, 1);
> mem->move_charge_at_immigrate = 0;
> mutex_init(&mem->thresholds_lock);
> + atomic_inc(&mem_cgroup_num);
> return &mem->css;
> free_out:
> - __mem_cgroup_free(mem);
> + mem_cgroup_free(mem);
> root_mem_cgroup = NULL;
> return ERR_PTR(error);
> }
>
> +static void
> +mem_cgroup_id_attached(struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +
> + register_memcg_id(mem);
> +}
> +
> static int mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
> struct cgroup *cont)
> {
> @@ -4714,6 +4756,7 @@ struct cgroup_subsys mem_cgroup_subsys =
> .can_attach = mem_cgroup_can_attach,
> .cancel_attach = mem_cgroup_cancel_attach,
> .attach = mem_cgroup_move_task,
> + .id_attached = mem_cgroup_id_attached,
> .early_init = 0,
> .use_id = 1,
> };
> Index: mmotm-0811/init/Kconfig
> ===================================================================
> --- mmotm-0811.orig/init/Kconfig
> +++ mmotm-0811/init/Kconfig
> @@ -594,6 +594,16 @@ config CGROUP_MEM_RES_CTLR_SWAP
> Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
> size is 4096bytes, 512k per 1Gbytes of swap.
>
> +config MEM_CGROUP_MAX_GROUPS
> + int "Maximum number of memory cgroups on a system"
> + range 1 65535
> + default 8192 if 64BIT
> + default 2048 if 32BIT
> + help
> + Memory cgroup has limitation of the number of groups created.
> + Please select your favorite value. The more you allow, the more
> + memory(a pointer per group) will be consumed.
> +
> menuconfig CGROUP_SCHED
> bool "Group CPU scheduler"
> depends on EXPERIMENTAL && CGROUPS

2010-08-24 07:46:25

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] memcg: towards I/O aware memcg v5

* KAMEZAWA Hiroyuki <[email protected]> [2010-08-20 18:55:52]:

> This is v5.
>
> Sorry for delaying...but I had time for resetting myself and..several
> changes are added. I think this version is simpler than v4.
>
> Major changes from v4 is
> a) added kernel/cgroup.c hooks again. (for b)
> b) make RCU aware. previous version seems dangerous in an extreme case.
>
> Then, codes are updated. Most of changes are related to RCU.
>
> Patch brief view:
> 1. add hooks to kernel/cgroup.c for ID management.
> 2. use ID-array in memcg.
> 3. record ID to page_cgroup rather than pointer.
> 4. make update_file_mapped to be RCU aware routine instead of spinlock.
> 5. make update_file_mapped as general-purpose function.
>

Thanks for being persistent, will review the patches with comments in
the relevant patches.

--
Three Cheers,
Balbir

2010-08-24 07:47:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/5] memcg: use array and ID for quick look up

On Tue, 24 Aug 2010 00:44:59 -0700
Greg Thelen <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, memory cgroup has an ID per cgroup and make use of it at
> > - hierarchy walk,
> > - swap recording.
> >
> > This patch is for making more use of it. The final purpose is
> > to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
> >
> > This patch caches a pointer of memcg in an array. By this, we
> > don't have to call css_lookup() which requires radix-hash walk.
> > This saves some amount of memory footprint at lookup memcg via id.
> >
> > Changelog: 20100811
> > - adjusted onto mmotm-2010-08-11
> > - fixed RCU related parts.
> > - use attach_id() callback.
> >
> > Changelog: 20100804
> > - fixed description in init/Kconfig
> >
> > Changelog: 20100730
> > - fixed rcu_read_unlock() placement.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > init/Kconfig | 10 ++++++
> > mm/memcontrol.c | 83 ++++++++++++++++++++++++++++++++++++++++++--------------
> > 2 files changed, 73 insertions(+), 20 deletions(-)
> >
> > Index: mmotm-0811/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0811.orig/mm/memcontrol.c
> > +++ mmotm-0811/mm/memcontrol.c
> > @@ -195,6 +195,7 @@ static void mem_cgroup_oom_notify(struct
> > */
> > struct mem_cgroup {
> > struct cgroup_subsys_state css;
> > + int valid; /* for checking validness under RCU access.*/
> > /*
> > * the counter to account for memory usage
> > */
> > @@ -294,6 +295,29 @@ static bool move_file(void)
> > &mc.to->move_charge_at_immigrate);
> > }
> >
> > +/* 0 is unused */
> > +static atomic_t mem_cgroup_num;
> > +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> > +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> > +
> > +/* Must be called under rcu_read_lock */
> > +static struct mem_cgroup *id_to_memcg(unsigned short id)
> > +{
> > + struct mem_cgroup *ret;
> > + /* see mem_cgroup_free() */
> > + ret = rcu_dereference_check(mem_cgroups[id], rch_read_lock_held());
>
> I think this be rcu_read_lock_held() instead of rch_read_lock_held()?
>
yes, mayb overwritten by following patch.. thank you for finding.


Thanks,
-Kame

2010-08-24 07:48:12

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH] memcg: use ID in page_cgroup

KAMEZAWA Hiroyuki <[email protected]> writes:

> I have an idea to remove page_cgroup->page pointer, 8bytes reduction per page.
> But it will be after this work.
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, addresses of memory cgroup can be calculated by their ID without complex.
> This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
> On 64bit architecture, this offers us more 6bytes room per page_cgroup.
> Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
> some light-weight concurrent access.
>
> We may able to move this id onto flags field but ...go step by step.
>
> Changelog: 20100811
> - using new rcu APIs, as rcu_dereference_check() etc.
> Changelog: 20100804
> - added comments to page_cgroup.h
> Changelog: 20100730
> - fixed some garbage added by debug code in early stage
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/page_cgroup.h | 6 ++++-
> mm/memcontrol.c | 52 +++++++++++++++++++++++++++-----------------
> mm/page_cgroup.c | 2 -
> 3 files changed, 38 insertions(+), 22 deletions(-)
>
> Index: mmotm-0811/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-0811.orig/include/linux/page_cgroup.h
> +++ mmotm-0811/include/linux/page_cgroup.h
> @@ -9,10 +9,14 @@
> * page_cgroup helps us identify information about the cgroup
> * All page cgroups are allocated at boot or memory hotplug event,
> * then the page cgroup for pfn always exists.
> + *
> + * TODO: It seems ID for cgroup can be packed into "flags". But there will
> + * be race between assigning ID <-> set/clear flags. Please be careful.
> */
> struct page_cgroup {
> unsigned long flags;
> - struct mem_cgroup *mem_cgroup;
> + unsigned short mem_cgroup; /* ID of assigned memory cgroup */
> + unsigned short blk_cgroup; /* Not Used..but will be. */
> struct page *page;
> struct list_head lru; /* per cgroup LRU list */
> };
> Index: mmotm-0811/mm/page_cgroup.c
> ===================================================================
> --- mmotm-0811.orig/mm/page_cgroup.c
> +++ mmotm-0811/mm/page_cgroup.c
> @@ -15,7 +15,7 @@ static void __meminit
> __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> {
> pc->flags = 0;
> - pc->mem_cgroup = NULL;
> + pc->mem_cgroup = 0;
> pc->page = pfn_to_page(pfn);
> INIT_LIST_HEAD(&pc->lru);
> }
> Index: mmotm-0811/mm/memcontrol.c
> ===================================================================
> --- mmotm-0811.orig/mm/memcontrol.c
> +++ mmotm-0811/mm/memcontrol.c
> @@ -300,12 +300,13 @@ static atomic_t mem_cgroup_num;
> #define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
>
> -/* Must be called under rcu_read_lock */
> -static struct mem_cgroup *id_to_memcg(unsigned short id)
> +/* Must be called under rcu_read_lock, set safe==true if under lock */
> +static struct mem_cgroup *id_to_memcg(unsigned short id, bool safe)
> {
> struct mem_cgroup *ret;
> /* see mem_cgroup_free() */
> - ret = rcu_dereference_check(mem_cgroups[id], rch_read_lock_held());
> + ret = rcu_dereference_check(mem_cgroups[id],
> + rch_read_lock_held() || safe);
> if (likely(ret && ret->valid))
> return ret;
> return NULL;
> @@ -381,7 +382,12 @@ struct cgroup_subsys_state *mem_cgroup_c
> static struct mem_cgroup_per_zone *
> page_cgroup_zoneinfo(struct page_cgroup *pc)
> {
> - struct mem_cgroup *mem = pc->mem_cgroup;
> + /*
> + * The caller should guarantee this "pc" is under lock. In typical
> + * case, this function is called by lru function with zone->lru_lock.
> + * It is a safe access.
> + */
> + struct mem_cgroup *mem = id_to_memcg(pc->mem_cgroup, true);
> int nid = page_cgroup_nid(pc);
> int zid = page_cgroup_zid(pc);
>
> @@ -723,6 +729,11 @@ static inline bool mem_cgroup_is_root(st
> return (mem == root_mem_cgroup);
> }
>
> +static inline bool mem_cgroup_is_rootid(unsigned short id)
> +{
> + return (id == 1);
> +}
> +
> /*
> * Following LRU functions are allowed to be used without PCG_LOCK.
> * Operations are called by routine of global LRU independently from memcg.
> @@ -755,7 +766,7 @@ void mem_cgroup_del_lru_list(struct page
> */
> mz = page_cgroup_zoneinfo(pc);
> MEM_CGROUP_ZSTAT(mz, lru) -= 1;
> - if (mem_cgroup_is_root(pc->mem_cgroup))
> + if (mem_cgroup_is_rootid(pc->mem_cgroup))
> return;
> VM_BUG_ON(list_empty(&pc->lru));
> list_del_init(&pc->lru);
> @@ -782,7 +793,7 @@ void mem_cgroup_rotate_lru_list(struct p
> */
> smp_rmb();
> /* unused or root page is not rotated. */
> - if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
> + if (!PageCgroupUsed(pc) || mem_cgroup_is_rootid(pc->mem_cgroup))
> return;
> mz = page_cgroup_zoneinfo(pc);
> list_move(&pc->lru, &mz->lists[lru]);
> @@ -808,7 +819,7 @@ void mem_cgroup_add_lru_list(struct page
> mz = page_cgroup_zoneinfo(pc);
> MEM_CGROUP_ZSTAT(mz, lru) += 1;
> SetPageCgroupAcctLRU(pc);
> - if (mem_cgroup_is_root(pc->mem_cgroup))
> + if (mem_cgroup_is_rootid(pc->mem_cgroup))
> return;
> list_add(&pc->lru, &mz->lists[lru]);
> }
> @@ -1497,7 +1508,7 @@ void mem_cgroup_update_file_mapped(struc
> return;
>
> lock_page_cgroup(pc);
> - mem = pc->mem_cgroup;
> + mem = id_to_memcg(pc->mem_cgroup, true);
> if (!mem || !PageCgroupUsed(pc))
> goto done;
>
> @@ -1885,14 +1896,14 @@ struct mem_cgroup *try_get_mem_cgroup_fr
> pc = lookup_page_cgroup(page);
> lock_page_cgroup(pc);
> if (PageCgroupUsed(pc)) {
> - mem = pc->mem_cgroup;
> + mem = id_to_memcg(pc->mem_cgroup, true);
> if (mem && !css_tryget(&mem->css))
> mem = NULL;
> } else if (PageSwapCache(page)) {
> ent.val = page_private(page);
> id = lookup_swap_cgroup(ent);
> rcu_read_lock();
> - mem = id_to_memcg(id);
> + mem = id_to_memcg(id, false);
> if (mem && !css_tryget(&mem->css))
> mem = NULL;
> rcu_read_unlock();
> @@ -1921,7 +1932,7 @@ static void __mem_cgroup_commit_charge(s
> return;
> }
>
> - pc->mem_cgroup = mem;
> + pc->mem_cgroup = css_id(&mem->css);
> /*
> * We access a page_cgroup asynchronously without lock_page_cgroup().
> * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
> @@ -1979,7 +1990,7 @@ static void __mem_cgroup_move_account(st
> VM_BUG_ON(PageLRU(pc->page));
> VM_BUG_ON(!PageCgroupLocked(pc));

Should this be VM_BUG_ON(!rcu_read_lock_held())? I suspect that
mem_cgroup_move_account() should grab rcu read lock (see my comment below).

> VM_BUG_ON(!PageCgroupUsed(pc));
> - VM_BUG_ON(pc->mem_cgroup != from);
> + VM_BUG_ON(id_to_memcg(pc->mem_cgroup, true) != from);
>
> if (PageCgroupFileMapped(pc)) {
> /* Update mapped_file data for mem_cgroup */
> @@ -1994,7 +2005,7 @@ static void __mem_cgroup_move_account(st
> mem_cgroup_cancel_charge(from);
>
> /* caller should have done css_get */
> - pc->mem_cgroup = to;
> + pc->mem_cgroup = css_id(&to->css);
> mem_cgroup_charge_statistics(to, pc, true);
> /*
> * We charges against "to" which may not have any tasks. Then, "to"
> @@ -2014,11 +2025,11 @@ static int mem_cgroup_move_account(struc
> {
> int ret = -EINVAL;
> lock_page_cgroup(pc);

Should this be rcu_read_lock() instead of lock_page_cgroup()?

> - if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
> + if (PageCgroupUsed(pc) && id_to_memcg(pc->mem_cgroup, true) == from) {
> __mem_cgroup_move_account(pc, from, to, uncharge);
> ret = 0;
> }
> - unlock_page_cgroup(pc);
> + rcu_read_unlock();
> /*
> * check events
> */
> @@ -2244,7 +2255,7 @@ __mem_cgroup_commit_charge_swapin(struct
>
> id = swap_cgroup_record(ent, 0);
> rcu_read_lock();
> - memcg = id_to_memcg(id);
> + memcg = id_to_memcg(id, false);
> if (memcg) {
> /*
> * This recorded memcg can be obsolete one. So, avoid
> @@ -2354,7 +2365,7 @@ __mem_cgroup_uncharge_common(struct page
>
> lock_page_cgroup(pc);
>
> - mem = pc->mem_cgroup;
> + mem = id_to_memcg(pc->mem_cgroup, true);
>
> if (!PageCgroupUsed(pc))
> goto unlock_out;
> @@ -2509,7 +2520,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
>
> id = swap_cgroup_record(ent, 0);
> rcu_read_lock();
> - memcg = id_to_memcg(id);
> + memcg = id_to_memcg(id, false);
> if (memcg) {
> /*
> * We uncharge this because swap is freed.
> @@ -2600,7 +2611,7 @@ int mem_cgroup_prepare_migration(struct
> pc = lookup_page_cgroup(page);
> lock_page_cgroup(pc);
> if (PageCgroupUsed(pc)) {
> - mem = pc->mem_cgroup;
> + mem = id_to_memcg(pc->mem_cgroup, true);
> css_get(&mem->css);
> /*
> * At migrating an anonymous page, its mapcount goes down
> @@ -4442,7 +4453,8 @@ static int is_target_pte_for_mc(struct v
> * mem_cgroup_move_account() checks the pc is valid or not under
> * the lock.
> */
> - if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
> + if (PageCgroupUsed(pc) &&
> + id_to_memcg(pc->mem_cgroup, true) == mc.from) {
> ret = MC_TARGET_PAGE;
> if (target)
> target->page = page;

2010-08-24 07:56:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: use ID in page_cgroup

On Tue, 24 Aug 2010 00:47:50 -0700
Greg Thelen <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > I have an idea to remove page_cgroup->page pointer, 8bytes reduction per page.
> > But it will be after this work.
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, addresses of memory cgroup can be calculated by their ID without complex.
> > This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
> > On 64bit architecture, this offers us more 6bytes room per page_cgroup.
> > Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
> > some light-weight concurrent access.
> >
> > We may able to move this id onto flags field but ...go step by step.
> >
> > Changelog: 20100811
> > - using new rcu APIs, as rcu_dereference_check() etc.
> > Changelog: 20100804
> > - added comments to page_cgroup.h
> > Changelog: 20100730
> > - fixed some garbage added by debug code in early stage
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > include/linux/page_cgroup.h | 6 ++++-
> > mm/memcontrol.c | 52 +++++++++++++++++++++++++++-----------------
> > mm/page_cgroup.c | 2 -
> > 3 files changed, 38 insertions(+), 22 deletions(-)
> >
> > Index: mmotm-0811/include/linux/page_cgroup.h
> > ===================================================================
> > --- mmotm-0811.orig/include/linux/page_cgroup.h
> > +++ mmotm-0811/include/linux/page_cgroup.h
> > @@ -9,10 +9,14 @@
> > * page_cgroup helps us identify information about the cgroup
> > * All page cgroups are allocated at boot or memory hotplug event,
> > * then the page cgroup for pfn always exists.
> > + *
> > + * TODO: It seems ID for cgroup can be packed into "flags". But there will
> > + * be race between assigning ID <-> set/clear flags. Please be careful.
> > */
> > struct page_cgroup {
> > unsigned long flags;
> > - struct mem_cgroup *mem_cgroup;
> > + unsigned short mem_cgroup; /* ID of assigned memory cgroup */
> > + unsigned short blk_cgroup; /* Not Used..but will be. */
> > struct page *page;
> > struct list_head lru; /* per cgroup LRU list */
> > };
> > Index: mmotm-0811/mm/page_cgroup.c
> > ===================================================================
> > --- mmotm-0811.orig/mm/page_cgroup.c
> > +++ mmotm-0811/mm/page_cgroup.c
> > @@ -15,7 +15,7 @@ static void __meminit
> > __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> > {
> > pc->flags = 0;
> > - pc->mem_cgroup = NULL;
> > + pc->mem_cgroup = 0;
> > pc->page = pfn_to_page(pfn);
> > INIT_LIST_HEAD(&pc->lru);
> > }
> > Index: mmotm-0811/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0811.orig/mm/memcontrol.c
> > +++ mmotm-0811/mm/memcontrol.c
> > @@ -300,12 +300,13 @@ static atomic_t mem_cgroup_num;
> > #define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> > static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> >
> > -/* Must be called under rcu_read_lock */
> > -static struct mem_cgroup *id_to_memcg(unsigned short id)
> > +/* Must be called under rcu_read_lock, set safe==true if under lock */
> > +static struct mem_cgroup *id_to_memcg(unsigned short id, bool safe)
> > {
> > struct mem_cgroup *ret;
> > /* see mem_cgroup_free() */
> > - ret = rcu_dereference_check(mem_cgroups[id], rch_read_lock_held());
> > + ret = rcu_dereference_check(mem_cgroups[id],
> > + rch_read_lock_held() || safe);
> > if (likely(ret && ret->valid))
> > return ret;
> > return NULL;
> > @@ -381,7 +382,12 @@ struct cgroup_subsys_state *mem_cgroup_c
> > static struct mem_cgroup_per_zone *
> > page_cgroup_zoneinfo(struct page_cgroup *pc)
> > {
> > - struct mem_cgroup *mem = pc->mem_cgroup;
> > + /*
> > + * The caller should guarantee this "pc" is under lock. In typical
> > + * case, this function is called by lru function with zone->lru_lock.
> > + * It is a safe access.
> > + */
> > + struct mem_cgroup *mem = id_to_memcg(pc->mem_cgroup, true);
> > int nid = page_cgroup_nid(pc);
> > int zid = page_cgroup_zid(pc);
> >
> > @@ -723,6 +729,11 @@ static inline bool mem_cgroup_is_root(st
> > return (mem == root_mem_cgroup);
> > }
> >
> > +static inline bool mem_cgroup_is_rootid(unsigned short id)
> > +{
> > + return (id == 1);
> > +}
> > +
> > /*
> > * Following LRU functions are allowed to be used without PCG_LOCK.
> > * Operations are called by routine of global LRU independently from memcg.
> > @@ -755,7 +766,7 @@ void mem_cgroup_del_lru_list(struct page
> > */
> > mz = page_cgroup_zoneinfo(pc);
> > MEM_CGROUP_ZSTAT(mz, lru) -= 1;
> > - if (mem_cgroup_is_root(pc->mem_cgroup))
> > + if (mem_cgroup_is_rootid(pc->mem_cgroup))
> > return;
> > VM_BUG_ON(list_empty(&pc->lru));
> > list_del_init(&pc->lru);
> > @@ -782,7 +793,7 @@ void mem_cgroup_rotate_lru_list(struct p
> > */
> > smp_rmb();
> > /* unused or root page is not rotated. */
> > - if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
> > + if (!PageCgroupUsed(pc) || mem_cgroup_is_rootid(pc->mem_cgroup))
> > return;
> > mz = page_cgroup_zoneinfo(pc);
> > list_move(&pc->lru, &mz->lists[lru]);
> > @@ -808,7 +819,7 @@ void mem_cgroup_add_lru_list(struct page
> > mz = page_cgroup_zoneinfo(pc);
> > MEM_CGROUP_ZSTAT(mz, lru) += 1;
> > SetPageCgroupAcctLRU(pc);
> > - if (mem_cgroup_is_root(pc->mem_cgroup))
> > + if (mem_cgroup_is_rootid(pc->mem_cgroup))
> > return;
> > list_add(&pc->lru, &mz->lists[lru]);
> > }
> > @@ -1497,7 +1508,7 @@ void mem_cgroup_update_file_mapped(struc
> > return;
> >
> > lock_page_cgroup(pc);
> > - mem = pc->mem_cgroup;
> > + mem = id_to_memcg(pc->mem_cgroup, true);
> > if (!mem || !PageCgroupUsed(pc))
> > goto done;
> >
> > @@ -1885,14 +1896,14 @@ struct mem_cgroup *try_get_mem_cgroup_fr
> > pc = lookup_page_cgroup(page);
> > lock_page_cgroup(pc);
> > if (PageCgroupUsed(pc)) {
> > - mem = pc->mem_cgroup;
> > + mem = id_to_memcg(pc->mem_cgroup, true);
> > if (mem && !css_tryget(&mem->css))
> > mem = NULL;
> > } else if (PageSwapCache(page)) {
> > ent.val = page_private(page);
> > id = lookup_swap_cgroup(ent);
> > rcu_read_lock();
> > - mem = id_to_memcg(id);
> > + mem = id_to_memcg(id, false);
> > if (mem && !css_tryget(&mem->css))
> > mem = NULL;
> > rcu_read_unlock();
> > @@ -1921,7 +1932,7 @@ static void __mem_cgroup_commit_charge(s
> > return;
> > }
> >
> > - pc->mem_cgroup = mem;
> > + pc->mem_cgroup = css_id(&mem->css);
> > /*
> > * We access a page_cgroup asynchronously without lock_page_cgroup().
> > * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
> > @@ -1979,7 +1990,7 @@ static void __mem_cgroup_move_account(st
> > VM_BUG_ON(PageLRU(pc->page));
> > VM_BUG_ON(!PageCgroupLocked(pc));
>
> Should this be VM_BUG_ON(!rcu_read_lock_held())? I suspect that
> mem_cgroup_move_account() should grab rcu read lock (see my comment below).
>

No. (see below)


> > VM_BUG_ON(!PageCgroupUsed(pc));
> > - VM_BUG_ON(pc->mem_cgroup != from);
> > + VM_BUG_ON(id_to_memcg(pc->mem_cgroup, true) != from);
> >
> > if (PageCgroupFileMapped(pc)) {
> > /* Update mapped_file data for mem_cgroup */
> > @@ -1994,7 +2005,7 @@ static void __mem_cgroup_move_account(st
> > mem_cgroup_cancel_charge(from);
> >
> > /* caller should have done css_get */
> > - pc->mem_cgroup = to;
> > + pc->mem_cgroup = css_id(&to->css);
> > mem_cgroup_charge_statistics(to, pc, true);
> > /*
> > * We charges against "to" which may not have any tasks. Then, "to"
> > @@ -2014,11 +2025,11 @@ static int mem_cgroup_move_account(struc
> > {
> > int ret = -EINVAL;
> > lock_page_cgroup(pc);
>
> Should this be rcu_read_lock() instead of lock_page_cgroup()?
>

No.
- lock_page_cgroup() is for keeping page_cgroup's status stable.
- rcu_read_lock() is for delaying to discard mem_cgroup object.

rcu_read_lock() is just for delaying to discard object, not for avoiding
racy updates. All _updates_ requires proper lock or speculative logic as
atomic_inc_not_zero() etc... Basically, RCU is for avoiding use-after-free.

Thanks,
-Kame

2010-08-24 08:04:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: towards I/O aware memcg v5

On Tue, 24 Aug 2010 13:16:17 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2010-08-20 18:55:52]:
>
> > This is v5.
> >
> > Sorry for delaying...but I had time for resetting myself and..several
> > changes are added. I think this version is simpler than v4.
> >
> > Major changes from v4 is
> > a) added kernel/cgroup.c hooks again. (for b)
> > b) make RCU aware. previous version seems dangerous in an extreme case.
> >
> > Then, codes are updated. Most of changes are related to RCU.
> >
> > Patch brief view:
> > 1. add hooks to kernel/cgroup.c for ID management.
> > 2. use ID-array in memcg.
> > 3. record ID to page_cgroup rather than pointer.
> > 4. make update_file_mapped to be RCU aware routine instead of spinlock.
> > 5. make update_file_mapped as general-purpose function.
> >
>
> Thanks for being persistent, will review the patches with comments in
> the relevant patches.
>

Thank you. I may be able to post v6, tomorrow.

-Kame

2010-08-24 08:35:59

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH] memcg: use ID in page_cgroup

KAMEZAWA Hiroyuki <[email protected]> writes:

> On Tue, 24 Aug 2010 00:47:50 -0700
> Greg Thelen <[email protected]> wrote:
>
>> KAMEZAWA Hiroyuki <[email protected]> writes:
>>
>> > I have an idea to remove page_cgroup->page pointer, 8bytes reduction per page.
>> > But it will be after this work.
>> > ==
>> > From: KAMEZAWA Hiroyuki <[email protected]>
>> >
>> > Now, addresses of memory cgroup can be calculated by their ID without complex.
>> > This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
>> > On 64bit architecture, this offers us more 6bytes room per page_cgroup.
>> > Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
>> > some light-weight concurrent access.
>> >
>> > We may able to move this id onto flags field but ...go step by step.
>> >
>> > Changelog: 20100811
>> > - using new rcu APIs, as rcu_dereference_check() etc.
>> > Changelog: 20100804
>> > - added comments to page_cgroup.h
>> > Changelog: 20100730
>> > - fixed some garbage added by debug code in early stage
>> >
>> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>> > ---
>> > include/linux/page_cgroup.h | 6 ++++-
>> > mm/memcontrol.c | 52 +++++++++++++++++++++++++++-----------------
>> > mm/page_cgroup.c | 2 -
>> > 3 files changed, 38 insertions(+), 22 deletions(-)
>> >
>> > Index: mmotm-0811/include/linux/page_cgroup.h
>> > ===================================================================
>> > --- mmotm-0811.orig/include/linux/page_cgroup.h
>> > +++ mmotm-0811/include/linux/page_cgroup.h
>> > @@ -9,10 +9,14 @@
>> > * page_cgroup helps us identify information about the cgroup
>> > * All page cgroups are allocated at boot or memory hotplug event,
>> > * then the page cgroup for pfn always exists.
>> > + *
>> > + * TODO: It seems ID for cgroup can be packed into "flags". But there will
>> > + * be race between assigning ID <-> set/clear flags. Please be careful.
>> > */
>> > struct page_cgroup {
>> > unsigned long flags;
>> > - struct mem_cgroup *mem_cgroup;
>> > + unsigned short mem_cgroup; /* ID of assigned memory cgroup */
>> > + unsigned short blk_cgroup; /* Not Used..but will be. */
>> > struct page *page;
>> > struct list_head lru; /* per cgroup LRU list */
>> > };
>> > Index: mmotm-0811/mm/page_cgroup.c
>> > ===================================================================
>> > --- mmotm-0811.orig/mm/page_cgroup.c
>> > +++ mmotm-0811/mm/page_cgroup.c
>> > @@ -15,7 +15,7 @@ static void __meminit
>> > __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
>> > {
>> > pc->flags = 0;
>> > - pc->mem_cgroup = NULL;
>> > + pc->mem_cgroup = 0;
>> > pc->page = pfn_to_page(pfn);
>> > INIT_LIST_HEAD(&pc->lru);
>> > }
>> > Index: mmotm-0811/mm/memcontrol.c
>> > ===================================================================
>> > --- mmotm-0811.orig/mm/memcontrol.c
>> > +++ mmotm-0811/mm/memcontrol.c
>> > @@ -300,12 +300,13 @@ static atomic_t mem_cgroup_num;
>> > #define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
>> > static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
>> >
>> > -/* Must be called under rcu_read_lock */
>> > -static struct mem_cgroup *id_to_memcg(unsigned short id)
>> > +/* Must be called under rcu_read_lock, set safe==true if under lock */
>> > +static struct mem_cgroup *id_to_memcg(unsigned short id, bool safe)
>> > {
>> > struct mem_cgroup *ret;
>> > /* see mem_cgroup_free() */
>> > - ret = rcu_dereference_check(mem_cgroups[id], rch_read_lock_held());
>> > + ret = rcu_dereference_check(mem_cgroups[id],
>> > + rch_read_lock_held() || safe);
>> > if (likely(ret && ret->valid))
>> > return ret;
>> > return NULL;
>> > @@ -381,7 +382,12 @@ struct cgroup_subsys_state *mem_cgroup_c
>> > static struct mem_cgroup_per_zone *
>> > page_cgroup_zoneinfo(struct page_cgroup *pc)
>> > {
>> > - struct mem_cgroup *mem = pc->mem_cgroup;
>> > + /*
>> > + * The caller should guarantee this "pc" is under lock. In typical
>> > + * case, this function is called by lru function with zone->lru_lock.
>> > + * It is a safe access.
>> > + */
>> > + struct mem_cgroup *mem = id_to_memcg(pc->mem_cgroup, true);
>> > int nid = page_cgroup_nid(pc);
>> > int zid = page_cgroup_zid(pc);
>> >
>> > @@ -723,6 +729,11 @@ static inline bool mem_cgroup_is_root(st
>> > return (mem == root_mem_cgroup);
>> > }
>> >
>> > +static inline bool mem_cgroup_is_rootid(unsigned short id)
>> > +{
>> > + return (id == 1);
>> > +}
>> > +
>> > /*
>> > * Following LRU functions are allowed to be used without PCG_LOCK.
>> > * Operations are called by routine of global LRU independently from memcg.
>> > @@ -755,7 +766,7 @@ void mem_cgroup_del_lru_list(struct page
>> > */
>> > mz = page_cgroup_zoneinfo(pc);
>> > MEM_CGROUP_ZSTAT(mz, lru) -= 1;
>> > - if (mem_cgroup_is_root(pc->mem_cgroup))
>> > + if (mem_cgroup_is_rootid(pc->mem_cgroup))
>> > return;
>> > VM_BUG_ON(list_empty(&pc->lru));
>> > list_del_init(&pc->lru);
>> > @@ -782,7 +793,7 @@ void mem_cgroup_rotate_lru_list(struct p
>> > */
>> > smp_rmb();
>> > /* unused or root page is not rotated. */
>> > - if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
>> > + if (!PageCgroupUsed(pc) || mem_cgroup_is_rootid(pc->mem_cgroup))
>> > return;
>> > mz = page_cgroup_zoneinfo(pc);
>> > list_move(&pc->lru, &mz->lists[lru]);
>> > @@ -808,7 +819,7 @@ void mem_cgroup_add_lru_list(struct page
>> > mz = page_cgroup_zoneinfo(pc);
>> > MEM_CGROUP_ZSTAT(mz, lru) += 1;
>> > SetPageCgroupAcctLRU(pc);
>> > - if (mem_cgroup_is_root(pc->mem_cgroup))
>> > + if (mem_cgroup_is_rootid(pc->mem_cgroup))
>> > return;
>> > list_add(&pc->lru, &mz->lists[lru]);
>> > }
>> > @@ -1497,7 +1508,7 @@ void mem_cgroup_update_file_mapped(struc
>> > return;
>> >
>> > lock_page_cgroup(pc);
>> > - mem = pc->mem_cgroup;
>> > + mem = id_to_memcg(pc->mem_cgroup, true);
>> > if (!mem || !PageCgroupUsed(pc))
>> > goto done;
>> >
>> > @@ -1885,14 +1896,14 @@ struct mem_cgroup *try_get_mem_cgroup_fr
>> > pc = lookup_page_cgroup(page);
>> > lock_page_cgroup(pc);
>> > if (PageCgroupUsed(pc)) {
>> > - mem = pc->mem_cgroup;
>> > + mem = id_to_memcg(pc->mem_cgroup, true);
>> > if (mem && !css_tryget(&mem->css))
>> > mem = NULL;
>> > } else if (PageSwapCache(page)) {
>> > ent.val = page_private(page);
>> > id = lookup_swap_cgroup(ent);
>> > rcu_read_lock();
>> > - mem = id_to_memcg(id);
>> > + mem = id_to_memcg(id, false);
>> > if (mem && !css_tryget(&mem->css))
>> > mem = NULL;
>> > rcu_read_unlock();
>> > @@ -1921,7 +1932,7 @@ static void __mem_cgroup_commit_charge(s
>> > return;
>> > }
>> >
>> > - pc->mem_cgroup = mem;
>> > + pc->mem_cgroup = css_id(&mem->css);
>> > /*
>> > * We access a page_cgroup asynchronously without lock_page_cgroup().
>> > * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
>> > @@ -1979,7 +1990,7 @@ static void __mem_cgroup_move_account(st
>> > VM_BUG_ON(PageLRU(pc->page));
>> > VM_BUG_ON(!PageCgroupLocked(pc));
>>
>> Should this be VM_BUG_ON(!rcu_read_lock_held())? I suspect that
>> mem_cgroup_move_account() should grab rcu read lock (see my comment below).
>>
>
> No. (see below)
>
>
>> > VM_BUG_ON(!PageCgroupUsed(pc));
>> > - VM_BUG_ON(pc->mem_cgroup != from);
>> > + VM_BUG_ON(id_to_memcg(pc->mem_cgroup, true) != from);
>> >
>> > if (PageCgroupFileMapped(pc)) {
>> > /* Update mapped_file data for mem_cgroup */
>> > @@ -1994,7 +2005,7 @@ static void __mem_cgroup_move_account(st
>> > mem_cgroup_cancel_charge(from);
>> >
>> > /* caller should have done css_get */
>> > - pc->mem_cgroup = to;
>> > + pc->mem_cgroup = css_id(&to->css);
>> > mem_cgroup_charge_statistics(to, pc, true);
>> > /*
>> > * We charges against "to" which may not have any tasks. Then, "to"
>> > @@ -2014,11 +2025,11 @@ static int mem_cgroup_move_account(struc
>> > {
>> > int ret = -EINVAL;
>> > lock_page_cgroup(pc);
>>
>> Should this be rcu_read_lock() instead of lock_page_cgroup()?
>>
>
> No.
> - lock_page_cgroup() is for keeping page_cgroup's status stable.
> - rcu_read_lock() is for delaying to discard mem_cgroup object.
>
> rcu_read_lock() is just for delaying to discard object, not for avoiding
> racy updates. All _updates_ requires proper lock or speculative logic as
> atomic_inc_not_zero() etc... Basically, RCU is for avoiding use-after-free.


Thanks for the info. Referring to your original patch:
> @@ -2014,11 +2025,11 @@ static int mem_cgroup_move_account(struc
> {
> int ret = -EINVAL;
> lock_page_cgroup(pc);
> - if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
> + if (PageCgroupUsed(pc) && id_to_memcg(pc->mem_cgroup, true) == from) {
> __mem_cgroup_move_account(pc, from, to, uncharge);
> ret = 0;
> }
> - unlock_page_cgroup(pc);
> + rcu_read_unlock();
>

It seems like mem_cgroup_move_account() is not balanced. Why is
lock_page_cgroup(pc) used to lock but rcu_read_unlock() used to unlock?

2010-08-24 08:43:47

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: use ID in page_cgroup

On Tue, 24 Aug 2010 01:35:37 -0700
Greg Thelen <[email protected]> wrote:

> > rcu_read_lock() is just for delaying to discard object, not for avoiding
> > racy updates. All _updates_ requires proper lock or speculative logic as
> > atomic_inc_not_zero() etc... Basically, RCU is for avoiding use-after-free.
>
>
> Thanks for the info. Referring to your original patch:
> > @@ -2014,11 +2025,11 @@ static int mem_cgroup_move_account(struc
> > {
> > int ret = -EINVAL;
> > lock_page_cgroup(pc);
> > - if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
> > + if (PageCgroupUsed(pc) && id_to_memcg(pc->mem_cgroup, true) == from) {
> > __mem_cgroup_move_account(pc, from, to, uncharge);
> > ret = 0;
> > }
> > - unlock_page_cgroup(pc);
> > + rcu_read_unlock();
> >
>
> It seems like mem_cgroup_move_account() is not balanced. Why is
> lock_page_cgroup(pc) used to lock but rcu_read_unlock() used to unlock?
>

Nice catch. It's bug. It seems my eyes were corrupted..
Will be fixed in the next version. Sorry for bad code.

Thanks,
-Kame

2010-08-24 09:00:03

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

KAMEZAWA Hiroyuki wrote:
> CC'ed to Paul Menage and Li Zefan.
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> When cgroup subsystem use ID (ss->use_id==1), each css's ID is assigned
> after successful call of ->create(). css_ID is tightly coupled with
> css struct itself but it is allocated by ->create() call, IOW,
> per-subsystem special allocations.
>
> To know css_id before creation, this patch adds id_attached() callback.
> after css_ID allocation. This will be used by memory cgroup's quick lookup
> routine.
>
> Maybe you can think of other implementations as
> - pass ID to ->create()
> or
> - add post_create()
> etc...
> But when considering dirtiness of codes, this straightforward patch seems
> good to me. If someone wants post_create(), this patch can be replaced.
>
> Changelog: 20100820
> - new approarch.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Acked-by: Li Zefan <[email protected]>

...
>
> Index: mmotm-0811/include/linux/cgroup.h
> ===================================================================
> --- mmotm-0811.orig/include/linux/cgroup.h
> +++ mmotm-0811/include/linux/cgroup.h
> @@ -475,6 +475,7 @@ struct cgroup_subsys {
> struct cgroup *cgrp);
> void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
> + void (*id_attached)(struct cgroup_subsys *ss, struct cgroup *cgrp);

Maybe pass the id number to id_attached() is better.

And actually the @ss argument is not necessary, because the memcg's
id_attached() handler of course knows it's dealing with the memory
cgroup subsystem.

So I suspect we can just remove all the @ss from all the callbacks..

2010-08-25 00:03:56

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, 24 Aug 2010 17:04:52 +0800
Li Zefan <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > CC'ed to Paul Menage and Li Zefan.
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > When cgroup subsystem use ID (ss->use_id==1), each css's ID is assigned
> > after successful call of ->create(). css_ID is tightly coupled with
> > css struct itself but it is allocated by ->create() call, IOW,
> > per-subsystem special allocations.
> >
> > To know css_id before creation, this patch adds id_attached() callback.
> > after css_ID allocation. This will be used by memory cgroup's quick lookup
> > routine.
> >
> > Maybe you can think of other implementations as
> > - pass ID to ->create()
> > or
> > - add post_create()
> > etc...
> > But when considering dirtiness of codes, this straightforward patch seems
> > good to me. If someone wants post_create(), this patch can be replaced.
> >
> > Changelog: 20100820
> > - new approarch.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> Acked-by: Li Zefan <[email protected]>
>
> ...
> >
> > Index: mmotm-0811/include/linux/cgroup.h
> > ===================================================================
> > --- mmotm-0811.orig/include/linux/cgroup.h
> > +++ mmotm-0811/include/linux/cgroup.h
> > @@ -475,6 +475,7 @@ struct cgroup_subsys {
> > struct cgroup *cgrp);
> > void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> > void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
> > + void (*id_attached)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>
> Maybe pass the id number to id_attached() is better.
>
ok.

> And actually the @ss argument is not necessary, because the memcg's
> id_attached() handler of course knows it's dealing with the memory
> cgroup subsystem.
>
> So I suspect we can just remove all the @ss from all the callbacks..
>

I agree. But could you write "remove ss" patch later ?
It seems an design change. I leave "ss" as this for now.

Thanks
-Kame

2010-08-25 00:09:32

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Fri, Aug 20, 2010 at 2:58 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> CC'ed to Paul Menage and Li Zefan.
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> When cgroup subsystem use ID (ss->use_id==1), each css's ID is assigned
> after successful call of ->create(). css_ID is tightly coupled with
> css struct itself but it is allocated by ->create() call, IOW,
> per-subsystem special allocations.
>
> To know css_id before creation, this patch adds id_attached() callback.
> after css_ID allocation. This will be used by memory cgroup's quick lookup
> routine.
>
> Maybe you can think of other implementations as
> ? ? ? ?- pass ID to ->create()
> ? ? ? ?or
> ? ? ? ?- add post_create()
> ? ? ? ?etc...
> But when considering dirtiness of codes, this straightforward patch seems
> good to me. If someone wants post_create(), this patch can be replaced.

I think I'd prefer the approach where any necessary css_ids are
allocated prior to calling any create methods (which gives the
additional advantage of removing the need to roll back partial
creation of a cgroup in the event of alloc_css_id() failing) and then
passed in to the create() method. The main cgroups framework would
still be responsible for actually filling the css->id field with the
allocated id.

Paul

2010-08-25 00:11:29

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, Aug 24, 2010 at 2:04 AM, Li Zefan <[email protected]> wrote:
>
> Maybe pass the id number to id_attached() is better.
>
> And actually the @ss argument is not necessary, because the memcg's
> id_attached() handler of course knows it's dealing with the memory
> cgroup subsystem.
>
> So I suspect we can just remove all the @ss from all the callbacks..

Yes, I don't think any subsystem uses these. They dated originally
from when, as part of the initial cgroups framwork, I included a
library that could wrap a mostly-unmodified CKRM resource controller
into a cgroups subsystem, at which point the callback code didn't
necessarily know which subsystem it was being called for. But that's
obsolete now.

Paul

2010-08-25 00:22:51

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, 24 Aug 2010 17:11:22 -0700
Paul Menage <[email protected]> wrote:

> On Tue, Aug 24, 2010 at 2:04 AM, Li Zefan <[email protected]> wrote:
> >
> > Maybe pass the id number to id_attached() is better.
> >
> > And actually the @ss argument is not necessary, because the memcg's
> > id_attached() handler of course knows it's dealing with the memory
> > cgroup subsystem.
> >
> > So I suspect we can just remove all the @ss from all the callbacks..
>
> Yes, I don't think any subsystem uses these. They dated originally
> from when, as part of the initial cgroups framwork, I included a
> library that could wrap a mostly-unmodified CKRM resource controller
> into a cgroups subsystem, at which point the callback code didn't
> necessarily know which subsystem it was being called for. But that's
> obsolete now.
>

Hmm, then, should I remove it in the next version, or leave it as it is
now and should be removed by another total clean up ?

(IOW, mixture of inconsistent interface is O.K. ?)

Thanks,
-Kame

2010-08-25 00:25:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, 24 Aug 2010 17:09:25 -0700
Paul Menage <[email protected]> wrote:

> On Fri, Aug 20, 2010 at 2:58 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > CC'ed to Paul Menage and Li Zefan.
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > When cgroup subsystem use ID (ss->use_id==1), each css's ID is assigned
> > after successful call of ->create(). css_ID is tightly coupled with
> > css struct itself but it is allocated by ->create() call, IOW,
> > per-subsystem special allocations.
> >
> > To know css_id before creation, this patch adds id_attached() callback.
> > after css_ID allocation. This will be used by memory cgroup's quick lookup
> > routine.
> >
> > Maybe you can think of other implementations as
> >        - pass ID to ->create()
> >        or
> >        - add post_create()
> >        etc...
> > But when considering dirtiness of codes, this straightforward patch seems
> > good to me. If someone wants post_create(), this patch can be replaced.
>
> I think I'd prefer the approach where any necessary css_ids are
> allocated prior to calling any create methods (which gives the
> additional advantage of removing the need to roll back partial
> creation of a cgroup in the event of alloc_css_id() failing) and then
> passed in to the create() method. The main cgroups framework would
> still be responsible for actually filling the css->id field with the
> allocated id.
>

Hmm, sure. I'll change the ->create() interface. O.K. ?

Thanks,
-Kame

2010-08-25 00:25:37

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, Aug 24, 2010 at 5:17 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> Hmm, then, should I remove it in the next version, or leave it as it is
> now and should be removed by another total clean up ?

I think these are two separate issues. A cleanup to remove the ss
parameters from the existing callback methods would be separate from
your proposed id callback, which I think is better done (as I
mentioned above) by passing the id to create()

Paul

2010-08-25 00:34:43

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, Aug 24, 2010 at 5:20 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> Hmm, sure. I'll change the ->create() interface. ?O.K. ?
>

Hmm. An alternative (possibly cleaner) would be:

1) add a css_size field in cgroup_subsys that contains the size of the
per-subsystem structure
2) change cgroups to allocate and populate the css *before* calling
create(), since it now knows the actual size
3) simplify all the subsystem create() methods since they no longer
have to worry about allocation or out-of-memory handling
4) also add a top_css field in cgroups that allows cpusets to use the
statically-allocated top_cpuset since it's initialized prior to memory
allocation being reliable

This avoids us having to pass in any new parameters to the create()
method in future since they can be populated in the CSS.

Paul

2010-08-25 00:42:49

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, 24 Aug 2010 17:34:38 -0700
Paul Menage <[email protected]> wrote:

> On Tue, Aug 24, 2010 at 5:20 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >
> > Hmm, sure. I'll change the ->create() interface.  O.K. ?
> >
>
> Hmm. An alternative (possibly cleaner) would be:
>
> 1) add a css_size field in cgroup_subsys that contains the size of the
> per-subsystem structure
> 2) change cgroups to allocate and populate the css *before* calling
> create(), since it now knows the actual size
> 3) simplify all the subsystem create() methods since they no longer
> have to worry about allocation or out-of-memory handling
> 4) also add a top_css field in cgroups that allows cpusets to use the
> statically-allocated top_cpuset since it's initialized prior to memory
> allocation being reliable
>
> This avoids us having to pass in any new parameters to the create()
> method in future since they can be populated in the CSS.
>

Ou...I'm sorry but I would like to use attach_id() for this time.
Forgive me, above seems a big change.
I'd like to write a series of patch to do above, later.
At least, to do a trial.

Thanks,
-Kame

2010-08-25 00:46:08

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, Aug 24, 2010 at 5:37 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> Ou...I'm sorry but I would like to use attach_id() for this time.
> Forgive me, above seems a big change.
> I'd like to write a series of patch to do above, later.
> At least, to do a trial.
>

Sure, for testing outside the tree. I don't think introducing new code
into mainline that's intentionally ugly and cutting corners is a good
idea.

Paul

2010-08-25 01:08:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, 24 Aug 2010 17:46:03 -0700
Paul Menage <[email protected]> wrote:

> On Tue, Aug 24, 2010 at 5:37 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >
> > Ou...I'm sorry but I would like to use attach_id() for this time.
> > Forgive me, above seems a big change.
> > I'd like to write a series of patch to do above, later.
> > At least, to do a trial.
> >
>
> Sure, for testing outside the tree. I don't think introducing new code
> into mainline that's intentionally ugly and cutting corners is a good
> idea.
>

Hmm. How this pseudo code looks like ? This passes "new id" via
cgroup->subsys[array] at creation. (Using union will be better, maybe).

---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 26 +++++++++++++++++++-------
2 files changed, 20 insertions(+), 7 deletions(-)

Index: mmotm-0811/include/linux/cgroup.h
===================================================================
--- mmotm-0811.orig/include/linux/cgroup.h
+++ mmotm-0811/include/linux/cgroup.h
@@ -508,6 +508,7 @@ struct cgroup_subsys {
struct cgroupfs_root *root;
struct list_head sibling;
/* used when use_id == true */
+ int max_id;
struct idr idr;
spinlock_t id_lock;

Index: mmotm-0811/kernel/cgroup.c
===================================================================
--- mmotm-0811.orig/kernel/cgroup.c
+++ mmotm-0811/kernel/cgroup.c
@@ -3335,19 +3335,31 @@ static long cgroup_create(struct cgroup
set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);

for_each_subsys(root, ss) {
- struct cgroup_subsys_state *css = ss->create(ss, cgrp);
+ struct cgroup_subsys_state *css;

- if (IS_ERR(css)) {
- err = PTR_ERR(css);
- goto err_destroy;
- }
- init_cgroup_css(css, ss, cgrp);
if (ss->use_id) {
err = alloc_css_id(ss, parent, cgrp);
if (err)
goto err_destroy;
+ /*
+ * Here, created css_id is recorded into
+ * cgrp->subsys[ss->subsys_id]
+ * array and passed to subsystem.
+ */
+ } else
+ cgrp->subsys[ss->subsys_id] = NULL;
+
+ css = ss->create(ss, cgrp);
+
+ if (IS_ERR(css)) {
+ /* forget preallocated id */
+ if (cgrp->subsys[ss->subsys_id])
+ free_css_id_direct((struct css_id *)
+ cgrp->subsys[ss->subsys_id]);
+ err = PTR_ERR(css);
+ goto err_destroy;
}
- /* At error, ->destroy() callback has to free assigned ID. */
+ init_cgroup_css(css, ss, cgrp);
}

cgroup_lock_hierarchy(root);


2010-08-25 01:35:09

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, Aug 24, 2010 at 6:03 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> Hmm. How this pseudo code looks like ? This passes "new id" via
> cgroup->subsys[array] at creation. (Using union will be better, maybe).
>

That's rather ugly. I was thinking of something more like this. (Not
even compiled yet, and the only subsystem updated is cpuset).

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed3e92e..063d9f2 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -458,8 +458,7 @@ void cgroup_release_and_wakeup_rmdir(struct
cgroup_subsys_state *css);
*/

struct cgroup_subsys {
- struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
- struct cgroup *cgrp);
+ int (*create)(struct cgroup_subsys *ss, struct cgroup *cgrp);
int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
@@ -513,6 +512,12 @@ struct cgroup_subsys {

/* should be defined only by modular subsystems */
struct module *module;
+
+ /* Total size of the subsystem's CSS object */
+ size_t css_size;
+
+ /* If non-NULL, the CSS to use for the root cgroup */
+ struct cgroup_subsys_state *root_css;
};

#define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 192f88c..c589a41 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3307,6 +3307,7 @@ static long cgroup_create(struct cgroup *parent,
struct dentry *dentry,
mode_t mode)
{
struct cgroup *cgrp;
+ struct cgroup_subsys_state *new_css[CGROUP_SUBSYS_COUNT] = {};
struct cgroupfs_root *root = parent->root;
int err = 0;
struct cgroup_subsys *ss;
@@ -3325,6 +3326,16 @@ static long cgroup_create(struct cgroup
*parent, struct dentry *dentry,

mutex_lock(&cgroup_mutex);

+ for_each_subsys(root, ss) {
+ int id = ss->subsys_id;
+ new_css[id] = kzalloc(ss->css_size, GFP_KERNEL);
+ if (!new_css) {
+ /* Failed to allocate memory */
+ err = -ENOMEM;
+ goto err_destroy;
+ }
+ }
+
init_cgroup_housekeeping(cgrp);

cgrp->parent = parent;
@@ -3335,19 +3346,19 @@ static long cgroup_create(struct cgroup
*parent, struct dentry *dentry,
set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);

for_each_subsys(root, ss) {
- struct cgroup_subsys_state *css = ss->create(ss, cgrp);
-
- if (IS_ERR(css)) {
- err = PTR_ERR(css);
- goto err_destroy;
- }
- init_cgroup_css(css, ss, cgrp);
+ int id = ss->subsys_id;
+ init_cgroup_css(new_css[id], ss, cgrp);
if (ss->use_id) {
err = alloc_css_id(ss, parent, cgrp);
if (err)
goto err_destroy;
}
- /* At error, ->destroy() callback has to free assigned ID. */
+ err = ss->create(ss, cgrp);
+ if (err) {
+ free_css_id(ss, css->id);
+ goto err_destroy;
+ }
+ new_css[id] = NULL;
}

cgroup_lock_hierarchy(root);
@@ -3380,7 +3391,10 @@ static long cgroup_create(struct cgroup
*parent, struct dentry *dentry,
err_destroy:

for_each_subsys(root, ss) {
- if (cgrp->subsys[ss->subsys_id])
+ int id = ss->subsys_id;
+ if (new_css[id])
+ kfree(new_css[id]);
+ else if (cgrp->subsys[id])
ss->destroy(ss, cgrp);
}

@@ -3607,11 +3621,16 @@ static void __init cgroup_init_subsys(struct
cgroup_subsys *ss)
/* Create the top cgroup state for this subsystem */
list_add(&ss->sibling, &rootnode.subsys_list);
ss->root = &rootnode;
- css = ss->create(ss, dummytop);
+ if (ss->root_css)
+ css = ss->root_css;
+ else
+ css = kzalloc(ss->css_size, GFP_KERNEL);
/* We don't handle early failures gracefully */
- BUG_ON(IS_ERR(css));
+ BUG_ON(!css);
init_cgroup_css(css, ss, dummytop);

+ BUG_ON(ss->create(ss, dummytop));
+
/* Update the init_css_set to contain a subsys
* pointer to this state - since the subsystem is
* newly registered, all tasks and hence the
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b23c097..7720a79 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1871,24 +1871,12 @@ static void cpuset_post_clone(struct cgroup_subsys *ss,
* cont: control group that the new cpuset will be part of
*/

-static struct cgroup_subsys_state *cpuset_create(
- struct cgroup_subsys *ss,
- struct cgroup *cont)
+static int cpuset_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
- struct cpuset *cs;
- struct cpuset *parent;
-
- if (!cont->parent) {
- return &top_cpuset.css;
- }
- parent = cgroup_cs(cont->parent);
- cs = kmalloc(sizeof(*cs), GFP_KERNEL);
- if (!cs)
- return ERR_PTR(-ENOMEM);
- if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL)) {
- kfree(cs);
- return ERR_PTR(-ENOMEM);
- }
+ struct cpuset *cs = cgroup_cs(cont);
+ struct cpuset *parent = cgroup_cs(cont->parent);
+ if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL))
+ return -ENOMEM;

cs->flags = 0;
if (is_spread_page(parent))
@@ -1903,7 +1891,7 @@ static struct cgroup_subsys_state *cpuset_create(

cs->parent = parent;
number_of_cpusets++;
- return &cs->css ;
+ return 0;
}

/*
@@ -1934,6 +1922,8 @@ struct cgroup_subsys cpuset_subsys = {
.post_clone = cpuset_post_clone,
.subsys_id = cpuset_subsys_id,
.early_init = 1,
+ .css_size = sizeof(struct cpuset),
+ .root_css = &top_cpuset.css;
};

/**

2010-08-25 01:47:44

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, 24 Aug 2010 18:35:00 -0700
Paul Menage <[email protected]> wrote:

> On Tue, Aug 24, 2010 at 6:03 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >
> > Hmm. How this pseudo code looks like ? This passes "new id" via
> > cgroup->subsys[array] at creation. (Using union will be better, maybe).
> >
>
> That's rather ugly. I was thinking of something more like this. (Not
> even compiled yet, and the only subsystem updated is cpuset).
>

Hmm, but placing css and subsystem's its own structure in different cache line
can increase cacheline/TLB miss, I think.

I wonder I should stop this patch series and do small thing.
I prefer to call alloc_css_id() by ->create() call by subsys's its own decistion
is much better and cleaner. (as my original design)

mem_cgroup_create()
{

cgroup_attach_css_id(ss, cgrp, &mem->css);
}

And then, there will be no difficulty.

Do we have to call alloc_css_id() in kernel/cgroup.c ?

Thanks,
-Kame

2010-08-25 01:52:26

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, Aug 24, 2010 at 6:42 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> Hmm, but placing css and subsystem's its own structure in different cache line
> can increase cacheline/TLB miss, I think.

My patch shouldn't affect the memory placement of any structures.
struct cgroup_subsys_state is still embedded in the per-subsystem
state.

>
> Do we have to call alloc_css_id() in kernel/cgroup.c ?

I guess not, if no-one's using it except for memcg. The general
approach of allocating the CSS in cgroup.c rather than in every
subsystem is something that I'd like to do separately, though.

Paul

2010-08-25 02:34:43

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/5] cgroup: ID notification call back

On Tue, 24 Aug 2010 18:52:20 -0700
Paul Menage <[email protected]> wrote:

> On Tue, Aug 24, 2010 at 6:42 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >
> > Hmm, but placing css and subsystem's its own structure in different cache line
> > can increase cacheline/TLB miss, I think.
>
> My patch shouldn't affect the memory placement of any structures.
> struct cgroup_subsys_state is still embedded in the per-subsystem
> state.
>
> >
> > Do we have to call alloc_css_id() in kernel/cgroup.c ?
>
> I guess not, if no-one's using it except for memcg. The general
> approach of allocating the CSS in cgroup.c rather than in every
> subsystem is something that I'd like to do separately, though.
>

I'll use this for v6.
When you move css's allcation up to kernel/cgroup.c, this can be moved, too.
My regret is that I should argue CSS_ID is very tightly coupled with css itself
and should use this design from the 1st version.

-Kame

==

---
block/blk-cgroup.c | 6 ++++++
include/linux/cgroup.h | 16 +++++++++-------
kernel/cgroup.c | 35 +++++++++--------------------------
mm/memcontrol.c | 4 ++++
4 files changed, 28 insertions(+), 33 deletions(-)

Index: mmotm-0811/kernel/cgroup.c
===================================================================
--- mmotm-0811.orig/kernel/cgroup.c
+++ mmotm-0811/kernel/cgroup.c
@@ -770,9 +770,6 @@ static struct backing_dev_info cgroup_ba
.capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK,
};

-static int alloc_css_id(struct cgroup_subsys *ss,
- struct cgroup *parent, struct cgroup *child);
-
static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
{
struct inode *inode = new_inode(sb);
@@ -3257,7 +3254,8 @@ static void init_cgroup_css(struct cgrou
css->cgroup = cgrp;
atomic_set(&css->refcnt, 1);
css->flags = 0;
- css->id = NULL;
+ if (!ss->use_id)
+ css->id = NULL;
if (cgrp == dummytop)
set_bit(CSS_ROOT, &css->flags);
BUG_ON(cgrp->subsys[ss->subsys_id]);
@@ -3342,11 +3340,6 @@ static long cgroup_create(struct cgroup
goto err_destroy;
}
init_cgroup_css(css, ss, cgrp);
- if (ss->use_id) {
- err = alloc_css_id(ss, parent, cgrp);
- if (err)
- goto err_destroy;
- }
/* At error, ->destroy() callback has to free assigned ID. */
}

@@ -3709,17 +3702,6 @@ int __init_or_module cgroup_load_subsys(

/* our new subsystem will be attached to the dummy hierarchy. */
init_cgroup_css(css, ss, dummytop);
- /* init_idr must be after init_cgroup_css because it sets css->id. */
- if (ss->use_id) {
- int ret = cgroup_init_idr(ss, css);
- if (ret) {
- dummytop->subsys[ss->subsys_id] = NULL;
- ss->destroy(ss, dummytop);
- subsys[i] = NULL;
- mutex_unlock(&cgroup_mutex);
- return ret;
- }
- }

/*
* Now we need to entangle the css into the existing css_sets. unlike
@@ -3888,8 +3870,6 @@ int __init cgroup_init(void)
struct cgroup_subsys *ss = subsys[i];
if (!ss->early_init)
cgroup_init_subsys(ss);
- if (ss->use_id)
- cgroup_init_idr(ss, init_css_set.subsys[ss->subsys_id]);
}

/* Add init_css_set to the hash table */
@@ -4603,8 +4583,8 @@ err_out:

}

-static int __init_or_module cgroup_init_idr(struct cgroup_subsys *ss,
- struct cgroup_subsys_state *rootcss)
+static int cgroup_init_idr(struct cgroup_subsys *ss,
+ struct cgroup_subsys_state *rootcss)
{
struct css_id *newid;

@@ -4621,13 +4601,16 @@ static int __init_or_module cgroup_init_
return 0;
}

-static int alloc_css_id(struct cgroup_subsys *ss, struct cgroup *parent,
- struct cgroup *child)
+int alloc_css_id(struct cgroup_subsys *ss, struct cgroup *cgrp)
{
int subsys_id, i, depth = 0;
+ struct cgroup *parent = cgrp->parent;
struct cgroup_subsys_state *parent_css, *child_css;
struct css_id *child_id, *parent_id;

+ if (cgrp->parent == NULL)
+ return cgroup_init_idr(ss, cgrp->subsys[ss->subsys_id]);
+
subsys_id = ss->subsys_id;
parent_css = parent->subsys[subsys_id];
child_css = child->subsys[subsys_id];
Index: mmotm-0811/include/linux/cgroup.h
===================================================================
--- mmotm-0811.orig/include/linux/cgroup.h
+++ mmotm-0811/include/linux/cgroup.h
@@ -583,9 +583,11 @@ int cgroup_attach_task_current_cg(struct
/*
* CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
* if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
- * CSS ID is assigned at cgroup allocation (create) automatically
- * and removed when subsys calls free_css_id() function. This is because
- * the lifetime of cgroup_subsys_state is subsys's matter.
+ * CSS ID must be assigned by subsys itself at cgroup creation and deleted
+ * when subsys calls free_css_id() function. This is because the life time of
+ * of cgroup_subsys_state is subsys's matter.
+ *
+ * ID->css look up is available after cgroup's directory is populated.
*
* Looking up and scanning function should be called under rcu_read_lock().
* Taking cgroup_mutex()/hierarchy_mutex() is not necessary for following calls.
@@ -593,10 +595,10 @@ int cgroup_attach_task_current_cg(struct
* destroyed". The caller should check css and cgroup's status.
*/

-/*
- * Typically Called at ->destroy(), or somewhere the subsys frees
- * cgroup_subsys_state.
- */
+/* Should be called in ->create() by subsys itself */
+int alloc_css_id(struct cgroup_subsys *ss, struct cgroup *child);
+
+/* Typically Called at ->destroy(), or somewhere the subsys frees css */
void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);

/* Find a cgroup_subsys_state which has given ID */
Index: mmotm-0811/mm/memcontrol.c
===================================================================
--- mmotm-0811.orig/mm/memcontrol.c
+++ mmotm-0811/mm/memcontrol.c
@@ -4141,6 +4141,10 @@ mem_cgroup_create(struct cgroup_subsys *
if (alloc_mem_cgroup_per_zone_info(mem, node))
goto free_out;

+ if (alloc_css_id(ss, cont))
+ goto free_out;
+ /* Here, css_id(&mem->css) works. but css_lookup(id)->mem doesn't */
+
/* root ? */
if (cont->parent == NULL) {
int cpu;
Index: mmotm-0811/block/blk-cgroup.c
===================================================================
--- mmotm-0811.orig/block/blk-cgroup.c
+++ mmotm-0811/block/blk-cgroup.c
@@ -958,6 +958,7 @@ blkiocg_create(struct cgroup_subsys *sub
{
struct blkio_cgroup *blkcg;
struct cgroup *parent = cgroup->parent;
+ int ret;

if (!parent) {
blkcg = &blkio_root_cgroup;
@@ -971,6 +972,11 @@ blkiocg_create(struct cgroup_subsys *sub
blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
if (!blkcg)
return ERR_PTR(-ENOMEM);
+ ret = alloc_css_id(subsys, cgroup);
+ if (ret) {
+ kfree(blkcg);
+ return ret;
+ }

blkcg->weight = BLKIO_WEIGHT_DEFAULT;
done: