LinuxLists.cc - [PATCH 0/5 -mm][memcg] towards I/O aware memory cgroup v4

2010-08-05 09:49:37

Subject: [PATCH 0/5 -mm][memcg] towards I/O aware memory cgroup v4

This is v4.

Major changes from v3 is
- dropped spinlock per page_cgroup (then, 4byte free space.)
- added more comments
- clean up.

This set has 2 purposes.
1. re-desgin struct page_cgroup and makes room for blocckio-cgroup ID.
2. implement quick updating method for memcg's file stat.

1. check influence of Mel's new writeback method.
I think we'll see OOM easier. IIUC, memory cgroup needs a thread like kswapd
to do background writeback or low-high watermark.
(By this, we can control priority of background writeout thread priority
by CFS. This is very good.)

2. implementing dirty_ratio.
Now, Greg Thelen is working on. One of biggest problems of previous trial was
update cost of status. I think this patch set can reduce it.

About reducing size of struct page_cgroup:
We have several choices. So, plz wait.
I don't think packing memcgid, blockio-id to pc->flags is a good choice.
(I'm afraid of races and new limitations added by that.)

One idea: free spaces in pc->flags can be used...but it's better to store some
very stable/staic value.
One idea is store pfn or section-ID or node-id there. Then, we can remove
pc->page pointer. Considering page_cgroup's usage, page->page_cgroup is
a usual operation but page_cgroup->page is not. (only used at memory recalim)
Then, we can implement a function page_cgroup_to_page() and remove
page_cgroup->page pointer. For this, free space in pc->flags can be used.

After MM-Summit, I have another event until Aug16, and will be busy for a while.
(And cannot read e-mail box on my office) If you want to contact me, please
e-mail kahi at mte biglobe ne jp.

Thanks,
-Kame

2010-08-05 10:02:16

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 1/4 -mm][memcg] quick ID lookup in memcg

From: KAMEZAWA Hiroyuki <[email protected]>

Now, memory cgroup has an ID per cgroup and make use of it at
- hierarchy walk,
- swap recording.

This patch is for making more use of it. The final purpose is
to replace page_cgroup->mem_cgroup's pointer to an unsigned short.

This patch caches a pointer of memcg in an array. By this, we
don't have to call css_lookup() which requires radix-hash walk.
This saves some amount of memory footprint at lookup memcg via id.

Changelog: 20100804
- fixed description in init/Kconfig

Changelog: 20100730
- fixed rcu_read_unlock() placement.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
init/Kconfig | 10 ++++++++++
mm/memcontrol.c | 48 ++++++++++++++++++++++++++++++++++--------------
2 files changed, 44 insertions(+), 14 deletions(-)

Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -292,6 +292,30 @@ static bool move_file(void)
&mc.to->move_charge_at_immigrate);
}

+/* 0 is unused */
+static atomic_t mem_cgroup_num;
+#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
+static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
+
+static struct mem_cgroup *id_to_memcg(unsigned short id)
+{
+ /*
+ * This array is set to NULL when mem_cgroup is freed.
+ * IOW, there are no more references && rcu_synchronized().
+ * This lookup-caching is safe.
+ */
+ if (unlikely(!mem_cgroups[id])) {
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+ css = css_lookup(&mem_cgroup_subsys, id);
+ rcu_read_unlock();
+ if (!css)
+ return NULL;
+ mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
+ }
+ return mem_cgroups[id];
+}
/*
* Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
* limit reclaim to prevent infinite loops, if they ever occur.
@@ -1824,18 +1848,7 @@ static void mem_cgroup_cancel_charge(str
* it's concern. (dropping refcnt from swap can be called against removed
* memcg.)
*/
-static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
-{
- struct cgroup_subsys_state *css;

- /* ID 0 is unused ID */
- if (!id)
- return NULL;
- css = css_lookup(&mem_cgroup_subsys, id);
- if (!css)
- return NULL;
- return container_of(css, struct mem_cgroup, css);
-}

struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
{
@@ -1856,7 +1869,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
ent.val = page_private(page);
id = lookup_swap_cgroup(ent);
rcu_read_lock();
- mem = mem_cgroup_lookup(id);
+ mem = id_to_memcg(id);
if (mem && !css_tryget(&mem->css))
mem = NULL;
rcu_read_unlock();
@@ -2208,7 +2221,7 @@ __mem_cgroup_commit_charge_swapin(struct

id = swap_cgroup_record(ent, 0);
rcu_read_lock();
- memcg = mem_cgroup_lookup(id);
+ memcg = id_to_memcg(id);
if (memcg) {
/*
* This recorded memcg can be obsolete one. So, avoid
@@ -2472,7 +2485,7 @@ void mem_cgroup_uncharge_swap(swp_entry_

id = swap_cgroup_record(ent, 0);
rcu_read_lock();
- memcg = mem_cgroup_lookup(id);
+ memcg = id_to_memcg(id);
if (memcg) {
/*
* We uncharge this because swap is freed.
@@ -3988,6 +4001,9 @@ static struct mem_cgroup *mem_cgroup_all
struct mem_cgroup *mem;
int size = sizeof(struct mem_cgroup);

+ if (atomic_read(&mem_cgroup_num) == NR_MEMCG_GROUPS)
+ return NULL;
+
/* Can be very big if MAX_NUMNODES is very big */
if (size < PAGE_SIZE)
mem = kmalloc(size, GFP_KERNEL);
@@ -4025,7 +4041,10 @@ static void __mem_cgroup_free(struct mem
int node;

mem_cgroup_remove_from_trees(mem);
+ /* No more lookup against this ID */
+ mem_cgroups[css_id(&mem->css)] = NULL;
free_css_id(&mem_cgroup_subsys, &mem->css);
+ atomic_dec(&mem_cgroup_num);

for_each_node_state(node, N_POSSIBLE)
free_mem_cgroup_per_zone_info(mem, node);
@@ -4162,6 +4181,7 @@ mem_cgroup_create(struct cgroup_subsys *
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
+ atomic_inc(&mem_cgroup_num);
return &mem->css;
free_out:
__mem_cgroup_free(mem);
Index: mmotm-0727/init/Kconfig
===================================================================
--- mmotm-0727.orig/init/Kconfig
+++ mmotm-0727/init/Kconfig
@@ -594,6 +594,16 @@ config CGROUP_MEM_RES_CTLR_SWAP
Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
size is 4096bytes, 512k per 1Gbytes of swap.

+config MEM_CGROUP_MAX_GROUPS
+ int "Maximum number of memory cgroups on a system"
+ range 1 65535
+ default 8192 if 64BIT
+ default 2048 if 32BIT
+ help
+ Memory cgroup has limitation of the number of groups created.
+ Please select your favorite value. The more you allow, the more
+ memory(a pointer per group) will be consumed.
+
menuconfig CGROUP_SCHED
bool "Group CPU scheduler"
depends on EXPERIMENTAL && CGROUPS

2010-08-05 10:02:55

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 2/4 -mm][memcg] use id in page cgroup

From: KAMEZAWA Hiroyuki <[email protected]>

Now, addresses of memory cgroup can be calculated by their ID without complex.
This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
On 64bit architecture, this offers us more 6bytes room per page_cgroup.
Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
some light-weight concurrent access.

We may able to move this id onto flags field but ...go step by step.

Changelog: 20100804
- added comments to page_cgroup.h
Changelog: 20100730
- fixed some garbage added by debug code in early stage

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/page_cgroup.h | 6 +++++-
mm/memcontrol.c | 32 +++++++++++++++++++-------------
mm/page_cgroup.c | 2 +-
3 files changed, 25 insertions(+), 15 deletions(-)

Index: mmotm-0727/include/linux/page_cgroup.h
===================================================================
--- mmotm-0727.orig/include/linux/page_cgroup.h
+++ mmotm-0727/include/linux/page_cgroup.h
@@ -9,10 +9,14 @@
* page_cgroup helps us identify information about the cgroup
* All page cgroups are allocated at boot or memory hotplug event,
* then the page cgroup for pfn always exists.
+ *
+ * TODO: It seems ID for cgroup can be packed into "flags". But there will
+ * be race between assigning ID <-> set/clear flags. Please be careful.
*/
struct page_cgroup {
unsigned long flags;
- struct mem_cgroup *mem_cgroup;
+ unsigned short mem_cgroup; /* ID of assigned memory cgroup */
+ unsigned short blk_cgroup; /* Not Used..but will be. */
struct page *page;
struct list_head lru; /* per cgroup LRU list */
};
Index: mmotm-0727/mm/page_cgroup.c
===================================================================
--- mmotm-0727.orig/mm/page_cgroup.c
+++ mmotm-0727/mm/page_cgroup.c
@@ -15,7 +15,7 @@ static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
pc->flags = 0;
- pc->mem_cgroup = NULL;
+ pc->mem_cgroup = 0;
pc->page = pfn_to_page(pfn);
INIT_LIST_HEAD(&pc->lru);
}
Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -379,7 +379,7 @@ struct cgroup_subsys_state *mem_cgroup_c
static struct mem_cgroup_per_zone *
page_cgroup_zoneinfo(struct page_cgroup *pc)
{
- struct mem_cgroup *mem = pc->mem_cgroup;
+ struct mem_cgroup *mem = id_to_memcg(pc->mem_cgroup);
int nid = page_cgroup_nid(pc);
int zid = page_cgroup_zid(pc);

@@ -721,6 +721,11 @@ static inline bool mem_cgroup_is_root(st
return (mem == root_mem_cgroup);
}

+static inline bool mem_cgroup_is_rootid(unsigned short id)
+{
+ return (id == 1);
+}
+
/*
* Following LRU functions are allowed to be used without PCG_LOCK.
* Operations are called by routine of global LRU independently from memcg.
@@ -753,7 +758,7 @@ void mem_cgroup_del_lru_list(struct page
*/
mz = page_cgroup_zoneinfo(pc);
MEM_CGROUP_ZSTAT(mz, lru) -= 1;
- if (mem_cgroup_is_root(pc->mem_cgroup))
+ if (mem_cgroup_is_rootid(pc->mem_cgroup))
return;
VM_BUG_ON(list_empty(&pc->lru));
list_del_init(&pc->lru);
@@ -780,7 +785,7 @@ void mem_cgroup_rotate_lru_list(struct p
*/
smp_rmb();
/* unused or root page is not rotated. */
- if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
+ if (!PageCgroupUsed(pc) || mem_cgroup_is_rootid(pc->mem_cgroup))
return;
mz = page_cgroup_zoneinfo(pc);
list_move(&pc->lru, &mz->lists[lru]);
@@ -806,7 +811,7 @@ void mem_cgroup_add_lru_list(struct page
mz = page_cgroup_zoneinfo(pc);
MEM_CGROUP_ZSTAT(mz, lru) += 1;
SetPageCgroupAcctLRU(pc);
- if (mem_cgroup_is_root(pc->mem_cgroup))
+ if (mem_cgroup_is_rootid(pc->mem_cgroup))
return;
list_add(&pc->lru, &mz->lists[lru]);
}
@@ -1474,7 +1479,7 @@ void mem_cgroup_update_file_mapped(struc
return;

lock_page_cgroup(pc);
- mem = pc->mem_cgroup;
+ mem = id_to_memcg(pc->mem_cgroup);
if (!mem || !PageCgroupUsed(pc))
goto done;

@@ -1862,7 +1867,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
- mem = pc->mem_cgroup;
+ mem = id_to_memcg(pc->mem_cgroup);
if (mem && !css_tryget(&mem->css))
mem = NULL;
} else if (PageSwapCache(page)) {
@@ -1898,7 +1903,7 @@ static void __mem_cgroup_commit_charge(s
return;
}

- pc->mem_cgroup = mem;
+ pc->mem_cgroup = css_id(&mem->css);
/*
* We access a page_cgroup asynchronously without lock_page_cgroup().
* Especially when a page_cgroup is taken from a page, pc->mem_cgroup
@@ -1956,7 +1961,7 @@ static void __mem_cgroup_move_account(st
VM_BUG_ON(PageLRU(pc->page));
VM_BUG_ON(!PageCgroupLocked(pc));
VM_BUG_ON(!PageCgroupUsed(pc));
- VM_BUG_ON(pc->mem_cgroup != from);
+ VM_BUG_ON(id_to_memcg(pc->mem_cgroup) != from);

if (PageCgroupFileMapped(pc)) {
/* Update mapped_file data for mem_cgroup */
@@ -1971,7 +1976,7 @@ static void __mem_cgroup_move_account(st
mem_cgroup_cancel_charge(from);

/* caller should have done css_get */
- pc->mem_cgroup = to;
+ pc->mem_cgroup = css_id(&to->css);
mem_cgroup_charge_statistics(to, pc, true);
/*
* We charges against "to" which may not have any tasks. Then, "to"
@@ -1991,7 +1996,7 @@ static int mem_cgroup_move_account(struc
{
int ret = -EINVAL;
lock_page_cgroup(pc);
- if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
+ if (PageCgroupUsed(pc) && id_to_memcg(pc->mem_cgroup) == from) {
__mem_cgroup_move_account(pc, from, to, uncharge);
ret = 0;
}
@@ -2330,7 +2335,7 @@ __mem_cgroup_uncharge_common(struct page

lock_page_cgroup(pc);

- mem = pc->mem_cgroup;
+ mem = id_to_memcg(pc->mem_cgroup);

if (!PageCgroupUsed(pc))
goto unlock_out;
@@ -2575,7 +2580,7 @@ int mem_cgroup_prepare_migration(struct
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
- mem = pc->mem_cgroup;
+ mem = id_to_memcg(pc->mem_cgroup);
css_get(&mem->css);
/*
* At migrating an anonymous page, its mapcount goes down
@@ -4398,7 +4403,8 @@ static int is_target_pte_for_mc(struct v
* mem_cgroup_move_account() checks the pc is valid or not under
* the lock.
*/
- if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
+ if (PageCgroupUsed(pc) &&
+ id_to_memcg(pc->mem_cgroup) == mc.from) {
ret = MC_TARGET_PAGE;
if (target)
target->page = page;

2010-08-05 10:06:52

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 3/4 -mm][memcg] scalable file status update logic without race

From: KAMEZAWA Hiroyuki <[email protected]>

At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().

But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.
At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
update is done while the page-cache is in stable state, we don't have to
take care of race with charge/uncharge.

Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move
1. Increment it before start moving
2. Call synchronize_rcu()
3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.

Changelog: 20100804
- added a comment for possible optimization hint. (for_each_possible...)
(but this is not needed to be quick..now)
Changelog: 20100730
- some cleanup.
Changelog: 20100729
- replaced __this_cpu_xxx() with this_cpu_xxx
(because we don't call spinlock)
- added VM_BUG_ON().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 67 insertions(+), 12 deletions(-)

Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -88,6 +88,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
+ MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */

MEM_CGROUP_STAT_NSTATS,
};
@@ -1074,7 +1075,50 @@ static unsigned int get_swappiness(struc
return swappiness;
}

-/* A routine for testing mem is not under move_account */
+static void mem_cgroup_start_move(struct mem_cgroup *mem)
+{
+ int cpu;
+ /* for fast checking in mem_cgroup_update_file_stat() etc..*/
+ spin_lock(&mc.lock);
+ /* TODO: Can we optimize this by for_each_online_cpu() ? */
+ for_each_possible_cpu(cpu)
+ per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
+ spin_unlock(&mc.lock);
+
+ synchronize_rcu();
+}
+
+static void mem_cgroup_end_move(struct mem_cgroup *mem)
+{
+ int cpu;
+
+ if (!mem)
+ return;
+ /* for fast checking in mem_cgroup_update_file_stat() etc..*/
+ spin_lock(&mc.lock);
+ for_each_possible_cpu(cpu)
+ per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
+ spin_unlock(&mc.lock);
+}
+
+/*
+ * mem_cgroup_is_moved -- checking a cgroup is mc.from target or not.
+ * used for avoiding race.
+ * mem_cgroup_under_move -- checking a cgroup is mc.from or mc.to or
+ * under hierarchy of them. used for waiting at
+ * memory pressure.
+ * Result of is_moved can be trusted until the end of rcu_read_unlock().
+ * The caller must do
+ * rcu_read_lock();
+ * result = mem_cgroup_is_moved();
+ * .....make use of result here....
+ * rcu_read_unlock();
+ */
+static bool mem_cgroup_is_moved(struct mem_cgroup *mem)
+{
+ VM_BUG_ON(!rcu_read_lock_held());
+ return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
+}

static bool mem_cgroup_under_move(struct mem_cgroup *mem)
{
@@ -1473,29 +1517,36 @@ void mem_cgroup_update_file_mapped(struc
{
struct mem_cgroup *mem;
struct page_cgroup *pc;
+ bool need_lock = false;

pc = lookup_page_cgroup(page);
if (unlikely(!pc))
return;
-
- lock_page_cgroup(pc);
+ rcu_read_lock();
mem = id_to_memcg(pc->mem_cgroup);
- if (!mem || !PageCgroupUsed(pc))
+ if (likely(mem)) {
+ if (mem_cgroup_is_moved(mem)) {
+ /* need to serialize with move_account */
+ lock_page_cgroup(pc);
+ need_lock = true;
+ mem = id_to_memcg(pc->mem_cgroup);
+ if (unlikely(!mem))
+ goto done;
+ }
+ }
+ if (unlikely(!PageCgroupUsed(pc)))
goto done;
-
- /*
- * Preemption is already disabled. We can use __this_cpu_xxx
- */
if (val > 0) {
- __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
SetPageCgroupFileMapped(pc);
} else {
- __this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
ClearPageCgroupFileMapped(pc);
}
-
done:
- unlock_page_cgroup(pc);
+ if (need_lock)
+ unlock_page_cgroup(pc);
+ rcu_read_unlock();
}

/*
@@ -3034,6 +3085,7 @@ move_account:
lru_add_drain_all();
drain_all_stock_sync();
ret = 0;
+ mem_cgroup_start_move(mem);
for_each_node_state(node, N_HIGH_MEMORY) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
enum lru_list l;
@@ -3047,6 +3099,7 @@ move_account:
if (ret)
break;
}
+ mem_cgroup_end_move(mem);
memcg_oom_recover(mem);
/* it seems parent cgroup doesn't have enough mem */
if (ret == -ENOMEM)
@@ -4513,6 +4566,7 @@ static void mem_cgroup_clear_mc(void)
mc.to = NULL;
mc.moving_task = NULL;
spin_unlock(&mc.lock);
+ mem_cgroup_end_move(from);
memcg_oom_recover(from);
memcg_oom_recover(to);
wake_up_all(&mc.waitq);
@@ -4543,6 +4597,7 @@ static int mem_cgroup_can_attach(struct
VM_BUG_ON(mc.moved_charge);
VM_BUG_ON(mc.moved_swap);
VM_BUG_ON(mc.moving_task);
+ mem_cgroup_start_move(from);
spin_lock(&mc.lock);
mc.from = from;
mc.to = mem;

2010-08-05 10:08:27

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 4/4 -mm][memcg] generic file-stat accounting interface for memcg

From: KAMEZAWA Hiroyuki <[email protected]>

Preparing for adding new status arounf file caches.(dirty, writeback,etc..)
Using a unified macro and more generic names.
All counters will have the same rule for updating.

Changelog:
- resrructured for clean up.
- added VM_BUG_ON().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 24 ++++++++++++++++++++++
include/linux/page_cgroup.h | 19 ++++++++++++------
mm/memcontrol.c | 46 ++++++++++++++++++--------------------------
3 files changed, 56 insertions(+), 33 deletions(-)

Index: mmotm-0727/include/linux/memcontrol.h
===================================================================
--- mmotm-0727.orig/include/linux/memcontrol.h
+++ mmotm-0727/include/linux/memcontrol.h
@@ -25,6 +25,30 @@ struct page_cgroup;
struct page;
struct mm_struct;

+/*
+ * Per-cpu Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+ /*
+ * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+ */
+ MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
+ MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
+ MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
+ MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
+ MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+ MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
+ MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
+ /* When you add new member for file-stat, please update page_cgroup.h */
+ MEM_CGROUP_FSTAT_BASE,
+ MEM_CGROUP_FSTAT_FILE_MAPPED = MEM_CGROUP_FSTAT_BASE,
+ MEM_CGROUP_FSTAT_END,
+ MEM_CGROUP_STAT_NSTATS = MEM_CGROUP_FSTAT_END,
+};
+
+#define MEMCG_FSTAT_IDX(idx) ((idx) - MEM_CGROUP_FSTAT_BASE)
+#define NR_FILE_FLAGS_MEMCG ((MEM_CGROUP_FSTAT_END - MEM_CGROUP_FSTAT_BASE))
+
extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -74,24 +74,6 @@ static int really_do_swap_account __init
#define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
#define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */

-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
- /*
- * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
- */
- MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
- MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
- MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
- MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
- MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
- MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
- MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
-
- MEM_CGROUP_STAT_NSTATS,
-};

struct mem_cgroup_stat_cpu {
s64 count[MEM_CGROUP_STAT_NSTATS];
@@ -1513,7 +1495,8 @@ bool mem_cgroup_handle_oom(struct mem_cg
* Currently used to update mapped file statistics, but the routine can be
* generalized to update other statistics as well.
*/
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+static void
+mem_cgroup_update_file_stat(struct page *page, unsigned int idx, int val)
{
struct mem_cgroup *mem;
struct page_cgroup *pc;
@@ -1537,11 +1520,11 @@ void mem_cgroup_update_file_mapped(struc
if (unlikely(!PageCgroupUsed(pc)))
goto done;
if (val > 0) {
- this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- SetPageCgroupFileMapped(pc);
+ this_cpu_inc(mem->stat->count[idx]);
+ set_bit(fflag_idx(MEMCG_FSTAT_IDX(idx)), &pc->flags);
} else {
- this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- ClearPageCgroupFileMapped(pc);
+ this_cpu_dec(mem->stat->count[idx]);
+ clear_bit(fflag_idx(MEMCG_FSTAT_IDX(idx)), &pc->flags);
}
done:
if (need_lock)
@@ -1549,6 +1532,12 @@ done:
rcu_read_unlock();
}

+void mem_cgroup_update_file_mapped(struct page *page, int val)
+{
+ return mem_cgroup_update_file_stat(page,
+ MEM_CGROUP_FSTAT_FILE_MAPPED, val);
+}
+
/*
* size of first charge trial. "32" comes from vmscan.c's magic value.
* TODO: maybe necessary to use big numbers in big irons.
@@ -2008,17 +1997,20 @@ static void __mem_cgroup_commit_charge(s
static void __mem_cgroup_move_account(struct page_cgroup *pc,
struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
{
+ int i;
VM_BUG_ON(from == to);
VM_BUG_ON(PageLRU(pc->page));
VM_BUG_ON(!PageCgroupLocked(pc));
VM_BUG_ON(!PageCgroupUsed(pc));
VM_BUG_ON(id_to_memcg(pc->mem_cgroup) != from);

- if (PageCgroupFileMapped(pc)) {
+ for (i = MEM_CGROUP_FSTAT_BASE; i < MEM_CGROUP_FSTAT_END; ++i) {
+ if (!test_bit(fflag_idx(MEMCG_FSTAT_IDX(i)), &pc->flags))
+ continue;
/* Update mapped_file data for mem_cgroup */
preempt_disable();
- __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ __this_cpu_dec(from->stat->count[i]);
+ __this_cpu_inc(to->stat->count[i]);
preempt_enable();
}
mem_cgroup_charge_statistics(from, pc, false);
@@ -3448,7 +3440,7 @@ static int mem_cgroup_get_local_stat(str
s->stat[MCS_CACHE] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
s->stat[MCS_RSS] += val * PAGE_SIZE;
- val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_MAPPED);
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_FSTAT_FILE_MAPPED);
s->stat[MCS_FILE_MAPPED] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGIN_COUNT);
s->stat[MCS_PGPGIN] += val;
Index: mmotm-0727/include/linux/page_cgroup.h
===================================================================
--- mmotm-0727.orig/include/linux/page_cgroup.h
+++ mmotm-0727/include/linux/page_cgroup.h
@@ -3,6 +3,7 @@

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
#include <linux/bit_spinlock.h>
+#include <linux/memcontrol.h> /* for flags */
/*
* Page Cgroup can be considered as an extended mem_map.
* A page_cgroup page is associated with every page descriptor. The
@@ -43,10 +44,21 @@ enum {
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
- PCG_FILE_MAPPED, /* page is accounted as "mapped" */
PCG_MIGRATION, /* under page migration */
+ PCG_FILE_FLAGS_MEMCG, /* see memcontrol.h */
+ PCG_FILE_FLAGS_MEMCG_END
+ = PCG_FILE_FLAGS_MEMCG + NR_FILE_FLAGS_MEMCG - 1,
};

+/*
+ * file-stat flags are defined regarding to memcg's stat information.
+ * Here, just defines a macro for indexing
+ */
+static inline int fflag_idx (int idx) {
+ VM_BUG_ON((idx)>=NR_FILE_FLAGS_MEMCG);
+ return ((idx) + PCG_FILE_FLAGS_MEMCG);
+}
+
#define TESTPCGFLAG(uname, lname) \
static inline int PageCgroup##uname(struct page_cgroup *pc) \
{ return test_bit(PCG_##lname, &pc->flags); }
@@ -79,11 +91,6 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
TESTPCGFLAG(AcctLRU, ACCT_LRU)
TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)

-
-SETPCGFLAG(FileMapped, FILE_MAPPED)
-CLEARPCGFLAG(FileMapped, FILE_MAPPED)
-TESTPCGFLAG(FileMapped, FILE_MAPPED)
-
SETPCGFLAG(Migration, MIGRATION)
CLEARPCGFLAG(Migration, MIGRATION)
TESTPCGFLAG(Migration, MIGRATION)

2010-08-06 04:13:30

by Greg Thelen

[permalink] [raw]

Subject: Re: [PATCH 1/4 -mm][memcg] quick ID lookup in memcg

KAMEZAWA Hiroyuki <[email protected]> writes:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, memory cgroup has an ID per cgroup and make use of it at
> - hierarchy walk,
> - swap recording.
>
> This patch is for making more use of it. The final purpose is
> to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
>
> This patch caches a pointer of memcg in an array. By this, we
> don't have to call css_lookup() which requires radix-hash walk.
> This saves some amount of memory footprint at lookup memcg via id.
>
> Changelog: 20100804
> - fixed description in init/Kconfig
>
> Changelog: 20100730
> - fixed rcu_read_unlock() placement.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> init/Kconfig | 10 ++++++++++
> mm/memcontrol.c | 48 ++++++++++++++++++++++++++++++++++--------------
> 2 files changed, 44 insertions(+), 14 deletions(-)
>
> Index: mmotm-0727/mm/memcontrol.c
> ===================================================================
> --- mmotm-0727.orig/mm/memcontrol.c
> +++ mmotm-0727/mm/memcontrol.c
> @@ -292,6 +292,30 @@ static bool move_file(void)
> &mc.to->move_charge_at_immigrate);
> }
>
> +/* 0 is unused */
> +static atomic_t mem_cgroup_num;
> +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> +
> +static struct mem_cgroup *id_to_memcg(unsigned short id)
> +{
> + /*
> + * This array is set to NULL when mem_cgroup is freed.
> + * IOW, there are no more references && rcu_synchronized().
> + * This lookup-caching is safe.
> + */
> + if (unlikely(!mem_cgroups[id])) {
> + struct cgroup_subsys_state *css;
> +
> + rcu_read_lock();
> + css = css_lookup(&mem_cgroup_subsys, id);
> + rcu_read_unlock();
> + if (!css)
> + return NULL;
> + mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
> + }
> + return mem_cgroups[id];
> +}

I am worried that id may be larger than CONFIG_MEM_CGROUP_MAX_GROUPS and
cause an illegal array index. I see that
mem_cgroup_uncharge_swapcache() uses css_id() to compute 'id'.
mem_cgroup_num ensures that there are never more than
CONFIG_MEM_CGROUP_MAX_GROUPS memcg active. But do we have guarantee
that the that all of the css_id of each active memcg are less than
NR_MEMCG_GROUPS?

> /*
> * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> * limit reclaim to prevent infinite loops, if they ever occur.
> @@ -1824,18 +1848,7 @@ static void mem_cgroup_cancel_charge(str
> * it's concern. (dropping refcnt from swap can be called against removed
> * memcg.)
> */
> -static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> -{
> - struct cgroup_subsys_state *css;
>
> - /* ID 0 is unused ID */
> - if (!id)
> - return NULL;
> - css = css_lookup(&mem_cgroup_subsys, id);
> - if (!css)
> - return NULL;
> - return container_of(css, struct mem_cgroup, css);
> -}
>
> struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> {
> @@ -1856,7 +1869,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
> ent.val = page_private(page);
> id = lookup_swap_cgroup(ent);
> rcu_read_lock();
> - mem = mem_cgroup_lookup(id);
> + mem = id_to_memcg(id);
> if (mem && !css_tryget(&mem->css))
> mem = NULL;
> rcu_read_unlock();
> @@ -2208,7 +2221,7 @@ __mem_cgroup_commit_charge_swapin(struct
>
> id = swap_cgroup_record(ent, 0);
> rcu_read_lock();
> - memcg = mem_cgroup_lookup(id);
> + memcg = id_to_memcg(id);
> if (memcg) {
> /*
> * This recorded memcg can be obsolete one. So, avoid
> @@ -2472,7 +2485,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
>
> id = swap_cgroup_record(ent, 0);
> rcu_read_lock();
> - memcg = mem_cgroup_lookup(id);
> + memcg = id_to_memcg(id);
> if (memcg) {
> /*
> * We uncharge this because swap is freed.
> @@ -3988,6 +4001,9 @@ static struct mem_cgroup *mem_cgroup_all
> struct mem_cgroup *mem;
> int size = sizeof(struct mem_cgroup);
>
> + if (atomic_read(&mem_cgroup_num) == NR_MEMCG_GROUPS)
> + return NULL;
> +

I think that multiple tasks to be simultaneously running
mem_cgroup_create(). Therefore more than NR_MEMCG_GROUPS memcg may be
created.

> /* Can be very big if MAX_NUMNODES is very big */
> if (size < PAGE_SIZE)
> mem = kmalloc(size, GFP_KERNEL);
> @@ -4025,7 +4041,10 @@ static void __mem_cgroup_free(struct mem
> int node;
>
> mem_cgroup_remove_from_trees(mem);
> + /* No more lookup against this ID */
> + mem_cgroups[css_id(&mem->css)] = NULL;
> free_css_id(&mem_cgroup_subsys, &mem->css);
> + atomic_dec(&mem_cgroup_num);
>
> for_each_node_state(node, N_POSSIBLE)
> free_mem_cgroup_per_zone_info(mem, node);
> @@ -4162,6 +4181,7 @@ mem_cgroup_create(struct cgroup_subsys *
> atomic_set(&mem->refcnt, 1);
> mem->move_charge_at_immigrate = 0;
> mutex_init(&mem->thresholds_lock);
> + atomic_inc(&mem_cgroup_num);
> return &mem->css;
> free_out:
> __mem_cgroup_free(mem);
> Index: mmotm-0727/init/Kconfig
> ===================================================================
> --- mmotm-0727.orig/init/Kconfig
> +++ mmotm-0727/init/Kconfig
> @@ -594,6 +594,16 @@ config CGROUP_MEM_RES_CTLR_SWAP
> Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
> size is 4096bytes, 512k per 1Gbytes of swap.
>
> +config MEM_CGROUP_MAX_GROUPS
> + int "Maximum number of memory cgroups on a system"
> + range 1 65535
> + default 8192 if 64BIT
> + default 2048 if 32BIT
> + help
> + Memory cgroup has limitation of the number of groups created.
> + Please select your favorite value. The more you allow, the more
> + memory(a pointer per group) will be consumed.
> +
> menuconfig CGROUP_SCHED
> bool "Group CPU scheduler"
> depends on EXPERIMENTAL && CGROUPS

2010-08-06 04:15:55

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 1/4 -mm][memcg] quick ID lookup in memcg

On Thu, 05 Aug 2010 21:12:50 -0700
Greg Thelen <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, memory cgroup has an ID per cgroup and make use of it at
> > - hierarchy walk,
> > - swap recording.
> >
> > This patch is for making more use of it. The final purpose is
> > to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
> >
> > This patch caches a pointer of memcg in an array. By this, we
> > don't have to call css_lookup() which requires radix-hash walk.
> > This saves some amount of memory footprint at lookup memcg via id.
> >
> > Changelog: 20100804
> > - fixed description in init/Kconfig
> >
> > Changelog: 20100730
> > - fixed rcu_read_unlock() placement.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > init/Kconfig | 10 ++++++++++
> > mm/memcontrol.c | 48 ++++++++++++++++++++++++++++++++++--------------
> > 2 files changed, 44 insertions(+), 14 deletions(-)
> >
> > Index: mmotm-0727/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0727.orig/mm/memcontrol.c
> > +++ mmotm-0727/mm/memcontrol.c
> > @@ -292,6 +292,30 @@ static bool move_file(void)
> > &mc.to->move_charge_at_immigrate);
> > }
> >
> > +/* 0 is unused */
> > +static atomic_t mem_cgroup_num;
> > +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> > +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> > +
> > +static struct mem_cgroup *id_to_memcg(unsigned short id)
> > +{
> > + /*
> > + * This array is set to NULL when mem_cgroup is freed.
> > + * IOW, there are no more references && rcu_synchronized().
> > + * This lookup-caching is safe.
> > + */
> > + if (unlikely(!mem_cgroups[id])) {
> > + struct cgroup_subsys_state *css;
> > +
> > + rcu_read_lock();
> > + css = css_lookup(&mem_cgroup_subsys, id);
> > + rcu_read_unlock();
> > + if (!css)
> > + return NULL;
> > + mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
> > + }
> > + return mem_cgroups[id];
> > +}
>
> I am worried that id may be larger than CONFIG_MEM_CGROUP_MAX_GROUPS and
> cause an illegal array index. I see that
> mem_cgroup_uncharge_swapcache() uses css_id() to compute 'id'.
> mem_cgroup_num ensures that there are never more than
> CONFIG_MEM_CGROUP_MAX_GROUPS memcg active. But do we have guarantee
> that the that all of the css_id of each active memcg are less than
> NR_MEMCG_GROUPS?
>
Yes. kernel/cgroup.c's ID assign routine use the smallest number, always.

> > /*
> > * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> > * limit reclaim to prevent infinite loops, if they ever occur.
> > @@ -1824,18 +1848,7 @@ static void mem_cgroup_cancel_charge(str
> > * it's concern. (dropping refcnt from swap can be called against removed
> > * memcg.)
> > */
> > -static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > -{
> > - struct cgroup_subsys_state *css;
> >
> > - /* ID 0 is unused ID */
> > - if (!id)
> > - return NULL;
> > - css = css_lookup(&mem_cgroup_subsys, id);
> > - if (!css)
> > - return NULL;
> > - return container_of(css, struct mem_cgroup, css);
> > -}
> >
> > struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> > {
> > @@ -1856,7 +1869,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
> > ent.val = page_private(page);
> > id = lookup_swap_cgroup(ent);
> > rcu_read_lock();
> > - mem = mem_cgroup_lookup(id);
> > + mem = id_to_memcg(id);
> > if (mem && !css_tryget(&mem->css))
> > mem = NULL;
> > rcu_read_unlock();
> > @@ -2208,7 +2221,7 @@ __mem_cgroup_commit_charge_swapin(struct
> >
> > id = swap_cgroup_record(ent, 0);
> > rcu_read_lock();
> > - memcg = mem_cgroup_lookup(id);
> > + memcg = id_to_memcg(id);
> > if (memcg) {
> > /*
> > * This recorded memcg can be obsolete one. So, avoid
> > @@ -2472,7 +2485,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
> >
> > id = swap_cgroup_record(ent, 0);
> > rcu_read_lock();
> > - memcg = mem_cgroup_lookup(id);
> > + memcg = id_to_memcg(id);
> > if (memcg) {
> > /*
> > * We uncharge this because swap is freed.
> > @@ -3988,6 +4001,9 @@ static struct mem_cgroup *mem_cgroup_all
> > struct mem_cgroup *mem;
> > int size = sizeof(struct mem_cgroup);
> >
> > + if (atomic_read(&mem_cgroup_num) == NR_MEMCG_GROUPS)
> > + return NULL;
> > +
>
> I think that multiple tasks to be simultaneously running
> mem_cgroup_create(). Therefore more than NR_MEMCG_GROUPS memcg may be
> created.
>

No. cgroup_mutex() is held.

Thanks,
-Kame

2010-08-06 04:37:54

by Greg Thelen

[permalink] [raw]

Subject: Re: [PATCH 1/4 -mm][memcg] quick ID lookup in memcg

On Thu, Aug 5, 2010 at 9:10 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Thu, 05 Aug 2010 21:12:50 -0700
> Greg Thelen <[email protected]> wrote:
>
>> KAMEZAWA Hiroyuki <[email protected]> writes:
>>
>> > From: KAMEZAWA Hiroyuki <[email protected]>
>> >
>> > Now, memory cgroup has an ID per cgroup and make use of it at
>> > ?- hierarchy walk,
>> > ?- swap recording.
>> >
>> > This patch is for making more use of it. The final purpose is
>> > to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
>> >
>> > This patch caches a pointer of memcg in an array. By this, we
>> > don't have to call css_lookup() which requires radix-hash walk.
>> > This saves some amount of memory footprint at lookup memcg via id.
>> >
>> > Changelog: 20100804
>> > ?- fixed description in init/Kconfig
>> >
>> > Changelog: 20100730
>> > ?- fixed rcu_read_unlock() placement.
>> >
>> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>> > ---
>> > ?init/Kconfig ? ?| ? 10 ++++++++++
>> > ?mm/memcontrol.c | ? 48 ++++++++++++++++++++++++++++++++++--------------
>> > ?2 files changed, 44 insertions(+), 14 deletions(-)
>> >
>> > Index: mmotm-0727/mm/memcontrol.c
>> > ===================================================================
>> > --- mmotm-0727.orig/mm/memcontrol.c
>> > +++ mmotm-0727/mm/memcontrol.c
>> > @@ -292,6 +292,30 @@ static bool move_file(void)
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &mc.to->move_charge_at_immigrate);
>> > ?}
>> >
>> > +/* 0 is unused */
>> > +static atomic_t mem_cgroup_num;
>> > +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
>> > +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
>> > +
>> > +static struct mem_cgroup *id_to_memcg(unsigned short id)
>> > +{
>> > + ? /*
>> > + ? ?* This array is set to NULL when mem_cgroup is freed.
>> > + ? ?* IOW, there are no more references && rcu_synchronized().
>> > + ? ?* This lookup-caching is safe.
>> > + ? ?*/
>> > + ? if (unlikely(!mem_cgroups[id])) {
>> > + ? ? ? ? ? struct cgroup_subsys_state *css;
>> > +
>> > + ? ? ? ? ? rcu_read_lock();
>> > + ? ? ? ? ? css = css_lookup(&mem_cgroup_subsys, id);
>> > + ? ? ? ? ? rcu_read_unlock();
>> > + ? ? ? ? ? if (!css)
>> > + ? ? ? ? ? ? ? ? ? return NULL;
>> > + ? ? ? ? ? mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
>> > + ? }
>> > + ? return mem_cgroups[id];
>> > +}
>>
>> I am worried that id may be larger than CONFIG_MEM_CGROUP_MAX_GROUPS and
>> cause an illegal array index. ?I see that
>> mem_cgroup_uncharge_swapcache() uses css_id() to compute 'id'.
>> mem_cgroup_num ensures that there are never more than
>> CONFIG_MEM_CGROUP_MAX_GROUPS memcg active. ?But do we have guarantee
>> that the that all of the css_id of each active memcg are less than
>> NR_MEMCG_GROUPS?
>>
> Yes. kernel/cgroup.c's ID assign routine use the smallest number, always.
>
>
>
>> > ?/*
>> > ? * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
>> > ? * limit reclaim to prevent infinite loops, if they ever occur.
>> > @@ -1824,18 +1848,7 @@ static void mem_cgroup_cancel_charge(str
>> > ? * it's concern. (dropping refcnt from swap can be called against removed
>> > ? * memcg.)
>> > ? */
>> > -static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
>> > -{
>> > - ? struct cgroup_subsys_state *css;
>> >
>> > - ? /* ID 0 is unused ID */
>> > - ? if (!id)
>> > - ? ? ? ? ? return NULL;
>> > - ? css = css_lookup(&mem_cgroup_subsys, id);
>> > - ? if (!css)
>> > - ? ? ? ? ? return NULL;
>> > - ? return container_of(css, struct mem_cgroup, css);
>> > -}
>> >
>> > ?struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>> > ?{
>> > @@ -1856,7 +1869,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
>> > ? ? ? ? ? ? ent.val = page_private(page);
>> > ? ? ? ? ? ? id = lookup_swap_cgroup(ent);
>> > ? ? ? ? ? ? rcu_read_lock();
>> > - ? ? ? ? ? mem = mem_cgroup_lookup(id);
>> > + ? ? ? ? ? mem = id_to_memcg(id);
>> > ? ? ? ? ? ? if (mem && !css_tryget(&mem->css))
>> > ? ? ? ? ? ? ? ? ? ? mem = NULL;
>> > ? ? ? ? ? ? rcu_read_unlock();
>> > @@ -2208,7 +2221,7 @@ __mem_cgroup_commit_charge_swapin(struct
>> >
>> > ? ? ? ? ? ? id = swap_cgroup_record(ent, 0);
>> > ? ? ? ? ? ? rcu_read_lock();
>> > - ? ? ? ? ? memcg = mem_cgroup_lookup(id);
>> > + ? ? ? ? ? memcg = id_to_memcg(id);
>> > ? ? ? ? ? ? if (memcg) {
>> > ? ? ? ? ? ? ? ? ? ? /*
>> > ? ? ? ? ? ? ? ? ? ? ?* This recorded memcg can be obsolete one. So, avoid
>> > @@ -2472,7 +2485,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
>> >
>> > ? ? id = swap_cgroup_record(ent, 0);
>> > ? ? rcu_read_lock();
>> > - ? memcg = mem_cgroup_lookup(id);
>> > + ? memcg = id_to_memcg(id);
>> > ? ? if (memcg) {
>> > ? ? ? ? ? ? /*
>> > ? ? ? ? ? ? ?* We uncharge this because swap is freed.
>> > @@ -3988,6 +4001,9 @@ static struct mem_cgroup *mem_cgroup_all
>> > ? ? struct mem_cgroup *mem;
>> > ? ? int size = sizeof(struct mem_cgroup);
>> >
>> > + ? if (atomic_read(&mem_cgroup_num) == NR_MEMCG_GROUPS)
>> > + ? ? ? ? ? return NULL;
>> > +
>>
>> I think that multiple tasks to be simultaneously running
>> mem_cgroup_create(). ?Therefore more than NR_MEMCG_GROUPS memcg may be
>> created.
>>
>
> No. cgroup_mutex() is held.
>
> Thanks,
> -Kame
>
>

I see that now. Thank you clarification. I am doing some testing on
the patches now.