2008-11-05 08:17:25

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 0/6] memcg updates (05/Nov)

Weekly (RFC) update for memcg.

This set includes

1. change force_empty to do move account rather than forget all
2. swap cache handling
3. mem+swap controller kconfig
4. swap_cgroup for rememver swap account information
5. mem+swap controller core
6. synchronize memcg's LRU and global LRU.

"1" is already sent, "6" is a newcomer.
I'd like to push out "2" or "2-5" in the next week (if no bugs.)

after 6, next candidates are
- dirty_ratio handler
- account move at task move.

Some more explanation about purpose of "6". (see details in patch itself)
Now, one of complicated logic in memcg is LRU handling. Because the place of
lru_head depends on page_cgroup->mem_cgroup pointer, we have to take
lock as following even under zone->lru_lock.
==
pc = lookup_page_cgroup(page);
if (!trylock_page_cgroup(pc))
return -EBUSY;

if (PageCgroupUsed(pc)) {
struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
....some operation on LRU.
spin_unlock_irqrestore(&mz->lru_lock, flags);
}
unlock_page_cgroup(pc);
==
Sigh..

After "6", page_cgroup's LRU management can be done independently to some extent.
== as
(zone->lru_lock is held here)
pc = lookup_page_cgroup(page);
list operation on pc.
(unlock zone->lru_lock)
==
Maybe good for maintainance and as a bonus, we can make use of isolate_lru_page() when
doing some racy operation.

isolate_lru_page(page);
pc = lookup_page_cgroup(page);
do some jobs.
putback_lru_page(page);

Maybe this will be a help to implement "account move at task move".

Thanks,
-Kame




2008-11-05 08:19:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 1/6] memcg: move all accounts to parent at rmdir()

Not different from a patch I sent.
==
This patch provides a function to move account information of a page between
mem_cgroups and rewrite force_empty to make use of this.

This moving of page_cgroup is done under
- lru_lock of source/destination mem_cgroup is held.
- lock_page_cgroup() is held.

Then, a routine which touches pc->mem_cgroup without lock_page_cgroup() should
confirm pc->mem_cgroup is still valid or not. Typical code can be following.

(while page is not under lock_page())
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc)
spin_lock_irqsave(&mz->lru_lock);
if (pc->mem_cgroup == mem)
...../* some list handling */
spin_unlock_irqrestore(&mz->lru_lock);

Of course, better way is
lock_page_cgroup(pc);
....
unlock_page_cgroup(pc);

But you should confirm the nest of lock and avoid deadlock.

If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
you don't have to worry about what pc->mem_cgroup points to.
moved pages are added to head of lru, not to tail.

Expected users of this routine is:
- force_empty (rmdir)
- moving tasks between cgroup (for moving account information.)
- hierarchy (maybe useful.)

force_empty(rmdir) uses this move_account and move pages to its parent.
This "move" will not cause OOM (I added "oom" parameter to try_charge().)

If the parent is busy (not enough memory), force_empty calls try_to_free_page()
and reduce usage.

Purpose of this behavior is
- Fix "forget all" behavior of force_empty and avoid leak of accounting.
- By "moving first, free if necessary", keep pages on memory as much as
possible.

Adding a switch to change behavior of force_empty to
- free first, move if necessary
- free all, if there is mlocked/busy pages, return -EBUSY.
is under consideration.

This patch removes memory.force_empty file, a brutal debug-only interface.

Changelog: (v8) -> (v9)
- fixed excessive use of __mem_cgroup_try_charge().
- explained "page cache moves, too" in Documentation.

Changelog: (v6) -> (v8)
- removed memory.force_empty file which was provided only for debug.

Changelog: (v5) -> (v6)
- removed unnecessary check.
- do all under lock_page_cgroup().
- removed res_counter_charge() from move function itself.
(and modifies try_charge() function.)
- add argument to add_list() to specify to add page_cgroup head or tail.
- merged with force_empty patch. (to answer who is user? question)

Changelog: (v4) -> (v5)
- check for lock_page() is removed.
- rewrote description.

Changelog: (v2) -> (v4)
- added lock_page_cgroup().
- moved out from new-force-empty patch.
- added how-to-use text.
- fixed race in __mem_cgroup_uncharge_common().

Reviewed-by: Daisuke Nishimura <[email protected]>
Tested-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Documentation/controllers/memory.txt | 12 -
mm/memcontrol.c | 277 ++++++++++++++++++++++++++---------
2 files changed, 214 insertions(+), 75 deletions(-)

Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -257,7 +257,7 @@ static void __mem_cgroup_remove_list(str
}

static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
- struct page_cgroup *pc)
+ struct page_cgroup *pc, bool hot)
{
int lru = LRU_BASE;

@@ -271,7 +271,10 @@ static void __mem_cgroup_add_list(struct
}

MEM_CGROUP_ZSTAT(mz, lru) += 1;
- list_add(&pc->lru, &mz->lists[lru]);
+ if (hot)
+ list_add(&pc->lru, &mz->lists[lru]);
+ else
+ list_add_tail(&pc->lru, &mz->lists[lru]);

mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
}
@@ -467,21 +470,12 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
}

-
-/**
- * mem_cgroup_try_charge - get charge of PAGE_SIZE.
- * @mm: an mm_struct which is charged against. (when *memcg is NULL)
- * @gfp_mask: gfp_mask for reclaim.
- * @memcg: a pointer to memory cgroup which is charged against.
- *
- * charge against memory cgroup pointed by *memcg. if *memcg == NULL, estimated
- * memory cgroup from @mm is got and stored in *memcg.
- *
- * Returns 0 if success. -ENOMEM at failure.
+/*
+ * Unlike exported interface, "oom" parameter is added. if oom==true,
+ * oom-killer can be invoked.
*/
-
-int mem_cgroup_try_charge(struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcg)
+static int __mem_cgroup_try_charge(struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
{
struct mem_cgroup *mem;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -528,7 +522,8 @@ int mem_cgroup_try_charge(struct mm_stru
continue;

if (!nr_retries--) {
- mem_cgroup_out_of_memory(mem, gfp_mask);
+ if (oom)
+ mem_cgroup_out_of_memory(mem, gfp_mask);
goto nomem;
}
}
@@ -538,6 +533,25 @@ nomem:
return -ENOMEM;
}

+/**
+ * mem_cgroup_try_charge - get charge of PAGE_SIZE.
+ * @mm: an mm_struct which is charged against. (when *memcg is NULL)
+ * @gfp_mask: gfp_mask for reclaim.
+ * @memcg: a pointer to memory cgroup which is charged against.
+ *
+ * charge against memory cgroup pointed by *memcg. if *memcg == NULL, estimated
+ * memory cgroup from @mm is got and stored in *memcg.
+ *
+ * Returns 0 if success. -ENOMEM at failure.
+ * This call can invoke OOM-Killer.
+ */
+
+int mem_cgroup_try_charge(struct mm_struct *mm,
+ gfp_t mask, struct mem_cgroup **memcg)
+{
+ return __mem_cgroup_try_charge(mm, mask, memcg, true);
+}
+
/*
* commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
* USED state. If already USED, uncharge and return.
@@ -571,11 +585,109 @@ static void __mem_cgroup_commit_charge(s
mz = page_cgroup_zoneinfo(pc);

spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_add_list(mz, pc);
+ __mem_cgroup_add_list(mz, pc, true);
spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);
}

+/**
+ * mem_cgroup_move_account - move account of the page
+ * @pc: page_cgroup of the page.
+ * @from: mem_cgroup which the page is moved from.
+ * @to: mem_cgroup which the page is moved to. @from != @to.
+ *
+ * The caller must confirm following.
+ * 1. disable irq.
+ * 2. lru_lock of old mem_cgroup(@from) should be held.
+ *
+ * returns 0 at success,
+ * returns -EBUSY when lock is busy or "pc" is unstable.
+ *
+ * This function does "uncharge" from old cgroup but doesn't do "charge" to
+ * new cgroup. It should be done by a caller.
+ */
+
+static int mem_cgroup_move_account(struct page_cgroup *pc,
+ struct mem_cgroup *from, struct mem_cgroup *to)
+{
+ struct mem_cgroup_per_zone *from_mz, *to_mz;
+ int nid, zid;
+ int ret = -EBUSY;
+
+ VM_BUG_ON(!irqs_disabled());
+ VM_BUG_ON(from == to);
+
+ nid = page_cgroup_nid(pc);
+ zid = page_cgroup_zid(pc);
+ from_mz = mem_cgroup_zoneinfo(from, nid, zid);
+ to_mz = mem_cgroup_zoneinfo(to, nid, zid);
+
+
+ if (!trylock_page_cgroup(pc))
+ return ret;
+
+ if (!PageCgroupUsed(pc))
+ goto out;
+
+ if (pc->mem_cgroup != from)
+ goto out;
+
+ if (spin_trylock(&to_mz->lru_lock)) {
+ __mem_cgroup_remove_list(from_mz, pc);
+ css_put(&from->css);
+ res_counter_uncharge(&from->res, PAGE_SIZE);
+ pc->mem_cgroup = to;
+ css_get(&to->css);
+ __mem_cgroup_add_list(to_mz, pc, false);
+ ret = 0;
+ spin_unlock(&to_mz->lru_lock);
+ }
+out:
+ unlock_page_cgroup(pc);
+ return ret;
+}
+
+/*
+ * move charges to its parent.
+ */
+
+static int mem_cgroup_move_parent(struct page_cgroup *pc,
+ struct mem_cgroup *child,
+ gfp_t gfp_mask)
+{
+ struct cgroup *cg = child->css.cgroup;
+ struct cgroup *pcg = cg->parent;
+ struct mem_cgroup *parent;
+ struct mem_cgroup_per_zone *mz;
+ unsigned long flags;
+ int ret;
+
+ /* Is ROOT ? */
+ if (!pcg)
+ return -EINVAL;
+
+ parent = mem_cgroup_from_cont(pcg);
+
+ ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
+ if (ret)
+ return ret;
+
+ mz = mem_cgroup_zoneinfo(child,
+ page_cgroup_nid(pc), page_cgroup_zid(pc));
+
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ ret = mem_cgroup_move_account(pc, child, parent);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+ /* drop extra refcnt */
+ css_put(&parent->css);
+ /* uncharge if move fails */
+ if (ret)
+ res_counter_uncharge(&parent->res, PAGE_SIZE);
+
+ return ret;
+}
+
/*
* Charge the memory controller for page usage.
* Return
@@ -597,7 +709,7 @@ static int mem_cgroup_charge_common(stru
prefetchw(pc);

mem = memcg;
- ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
+ ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
if (ret)
return ret;

@@ -898,46 +1010,52 @@ int mem_cgroup_resize_limit(struct mem_c
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
*/
-#define FORCE_UNCHARGE_BATCH (128)
-static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
+static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
struct mem_cgroup_per_zone *mz,
enum lru_list lru)
{
- struct page_cgroup *pc;
- struct page *page;
- int count = FORCE_UNCHARGE_BATCH;
+ struct page_cgroup *pc, *busy;
unsigned long flags;
+ unsigned long loop;
struct list_head *list;
+ int ret = 0;

list = &mz->lists[lru];

- spin_lock_irqsave(&mz->lru_lock, flags);
- while (!list_empty(list)) {
- pc = list_entry(list->prev, struct page_cgroup, lru);
- page = pc->page;
- if (!PageCgroupUsed(pc))
+ loop = MEM_CGROUP_ZSTAT(mz, lru);
+ /* give some margin against EBUSY etc...*/
+ loop += 256;
+ busy = NULL;
+ while (loop--) {
+ ret = 0;
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ if (list_empty(list)) {
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
break;
- get_page(page);
+ }
+ pc = list_entry(list->prev, struct page_cgroup, lru);
+ if (busy == pc) {
+ list_move(&pc->lru, list);
+ busy = 0;
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ continue;
+ }
spin_unlock_irqrestore(&mz->lru_lock, flags);
- /*
- * Check if this page is on LRU. !LRU page can be found
- * if it's under page migration.
- */
- if (PageLRU(page)) {
- __mem_cgroup_uncharge_common(page,
- MEM_CGROUP_CHARGE_TYPE_FORCE);
- put_page(page);
- if (--count <= 0) {
- count = FORCE_UNCHARGE_BATCH;
- cond_resched();
- }
- } else {
- spin_lock_irqsave(&mz->lru_lock, flags);
+
+ ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
+ if (ret == -ENOMEM)
break;
- }
- spin_lock_irqsave(&mz->lru_lock, flags);
+
+ if (ret == -EBUSY || ret == -EINVAL) {
+ /* found lock contention or "pc" is obsolete. */
+ busy = pc;
+ cond_resched();
+ } else
+ busy = NULL;
}
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ if (!ret && !list_empty(list))
+ return -EBUSY;
+ return ret;
}

/*
@@ -946,34 +1064,68 @@ static void mem_cgroup_force_empty_list(
*/
static int mem_cgroup_force_empty(struct mem_cgroup *mem)
{
- int ret = -EBUSY;
- int node, zid;
+ int ret;
+ int node, zid, shrink;
+ int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;

css_get(&mem->css);
- /*
- * page reclaim code (kswapd etc..) will move pages between
- * active_list <-> inactive_list while we don't take a lock.
- * So, we have to do loop here until all lists are empty.
- */
+
+ shrink = 0;
+move_account:
while (mem->res.usage > 0) {
+ ret = -EBUSY;
if (atomic_read(&mem->css.cgroup->count) > 0)
goto out;
+
/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
- for_each_node_state(node, N_POSSIBLE)
- for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ ret = 0;
+ for_each_node_state(node, N_POSSIBLE) {
+ for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
struct mem_cgroup_per_zone *mz;
enum lru_list l;
mz = mem_cgroup_zoneinfo(mem, node, zid);
- for_each_lru(l)
- mem_cgroup_force_empty_list(mem, mz, l);
+ for_each_lru(l) {
+ ret = mem_cgroup_force_empty_list(mem,
+ mz, l);
+ if (ret)
+ break;
+ }
}
+ if (ret)
+ break;
+ }
+ /* it seems parent cgroup doesn't have enough mem */
+ if (ret == -ENOMEM)
+ goto try_to_free;
cond_resched();
}
ret = 0;
out:
css_put(&mem->css);
return ret;
+
+try_to_free:
+ /* returns EBUSY if we come here twice. */
+ if (shrink) {
+ ret = -EBUSY;
+ goto out;
+ }
+ /* try to free all pages in this cgroup */
+ shrink = 1;
+ while (nr_retries && mem->res.usage > 0) {
+ int progress;
+ progress = try_to_free_mem_cgroup_pages(mem,
+ GFP_HIGHUSER_MOVABLE);
+ if (!progress)
+ nr_retries--;
+
+ }
+ /* try move_account...there may be some *locked* pages. */
+ if (mem->res.usage)
+ goto move_account;
+ ret = 0;
+ goto out;
}

static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
@@ -1022,11 +1174,6 @@ static int mem_cgroup_reset(struct cgrou
return 0;
}

-static int mem_force_empty_write(struct cgroup *cont, unsigned int event)
-{
- return mem_cgroup_force_empty(mem_cgroup_from_cont(cont));
-}
-
static const struct mem_cgroup_stat_desc {
const char *msg;
u64 unit;
@@ -1103,10 +1250,6 @@ static struct cftype mem_cgroup_files[]
.read_u64 = mem_cgroup_read,
},
{
- .name = "force_empty",
- .trigger = mem_force_empty_write,
- },
- {
.name = "stat",
.read_map = mem_control_stat_show,
},
Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
@@ -207,12 +207,6 @@ exceeded.
The memory.stat file gives accounting information. Now, the number of
caches, RSS and Active pages/Inactive pages are shown.

-The memory.force_empty gives an interface to drop *all* charges by force.
-
-# echo 1 > memory.force_empty
-
-will drop all charges in cgroup. Currently, this is maintained for test.
-
4. Testing

Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
@@ -242,8 +236,10 @@ reclaimed.

A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
cgroup might have some charge associated with it, even though all
-tasks have migrated away from it. Such charges are automatically dropped at
-rmdir() if there are no tasks.
+tasks have migrated away from it.
+Such charges are moved to its parent as much as possible and freed if parent
+is full. Both of RSS and CACHES are moved to parent.
+If both of them are busy, rmdir() returns -EBUSY.

5. TODO

2008-11-05 08:21:00

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 2/6] memcg: handle swap cache

SwapCache support for memory resource controller (memcg)

Before mem+swap controller, memcg itself should handle SwapCache in proper way.

In current memcg, SwapCache is just leaked and the user can create tons of
SwapCache. This is a leak of account and should be handled.

SwapCache accounting is done as following.

charge (anon)
- charged when it's mapped.
(because of readahead, charge at add_to_swap_cache() is not sane)
uncharge (anon)
- uncharged when it's dropped from swapcache and fully unmapped.
means it's not uncharged at unmap.
Note: delete from swap cache at swap-in is done after rmap information
is established.
charge (shmem)
- charged at swap-in. this prevents charge at add_to_page_cache().

uncharge (shmem)
- uncharged when it's dropped from swapcache and not on shmem's
radix-tree.

at migration, check against 'old page' is modified to handle shmem.

Comparing to the old version discussed (and caused troubles), we have
advantages of
- PCG_USED bit.
- simple migrating handling.

So, situation is much easier than several months ago, maybe.

Changelog (v1) -> (v2)
- use lock_page() when we handle unlocked SwapCache.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
Documentation/controllers/memory.txt | 5 ++
include/linux/swap.h | 16 ++++++++
mm/memcontrol.c | 67 +++++++++++++++++++++++++++++++----
mm/shmem.c | 18 ++++++++-
mm/swap_state.c | 1
5 files changed, 99 insertions(+), 8 deletions(-)

Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -21,6 +21,7 @@
#include <linux/memcontrol.h>
#include <linux/cgroup.h>
#include <linux/mm.h>
+#include <linux/pagemap.h>
#include <linux/smp.h>
#include <linux/page-flags.h>
#include <linux/backing-dev.h>
@@ -140,6 +141,7 @@ enum charge_type {
MEM_CGROUP_CHARGE_TYPE_MAPPED,
MEM_CGROUP_CHARGE_TYPE_SHMEM, /* used by page migration of shmem */
MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
+ MEM_CGROUP_CHARGE_TYPE_SWAPOUT, /* for accounting swapcache */
NR_CHARGE_TYPE,
};

@@ -781,6 +783,33 @@ int mem_cgroup_cache_charge(struct page
MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
}

+#ifdef CONFIG_SWAP
+int mem_cgroup_cache_charge_swapin(struct page *page,
+ struct mm_struct *mm, gfp_t mask, bool locked)
+{
+ int ret = 0;
+
+ if (mem_cgroup_subsys.disabled)
+ return 0;
+ if (unlikely(!mm))
+ mm = &init_mm;
+ if (!locked)
+ lock_page(page);
+ /*
+ * If not locked, the page can be dropped from SwapCache until
+ * we reach here.
+ */
+ if (PageSwapCache(page)) {
+ ret = mem_cgroup_charge_common(page, mm, mask,
+ MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
+ }
+ if (!locked)
+ unlock_page(page);
+
+ return ret;
+}
+#endif
+
void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
{
struct page_cgroup *pc;
@@ -818,6 +847,9 @@ __mem_cgroup_uncharge_common(struct page
if (mem_cgroup_subsys.disabled)
return;

+ if (PageSwapCache(page))
+ return;
+
/*
* Check if our page_cgroup is valid
*/
@@ -826,12 +858,26 @@ __mem_cgroup_uncharge_common(struct page
return;

lock_page_cgroup(pc);
- if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
- || !PageCgroupUsed(pc)) {
- /* This happens at race in zap_pte_range() and do_swap_page()*/
- unlock_page_cgroup(pc);
- return;
+
+ if (!PageCgroupUsed(pc))
+ goto unlock_out;
+
+ switch(ctype) {
+ case MEM_CGROUP_CHARGE_TYPE_MAPPED:
+ if (page_mapped(page))
+ goto unlock_out;
+ break;
+ case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
+ if (!PageAnon(page)) { /* Shared memory */
+ if (page->mapping && !page_is_file_cache(page))
+ goto unlock_out;
+ } else if (page_mapped(page)) /* Anon */
+ goto unlock_out;
+ break;
+ default:
+ break;
}
+
ClearPageCgroupUsed(pc);
mem = pc->mem_cgroup;

@@ -845,6 +891,10 @@ __mem_cgroup_uncharge_common(struct page
css_put(&mem->css);

return;
+
+unlock_out:
+ unlock_page_cgroup(pc);
+ return;
}

void mem_cgroup_uncharge_page(struct page *page)
@@ -864,6 +914,11 @@ void mem_cgroup_uncharge_cache_page(stru
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}

+void mem_cgroup_uncharge_swapcache(struct page *page)
+{
+ __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+}
+
/*
* Before starting migration, account PAGE_SIZE to mem_cgroup that the old
* page belongs to.
@@ -921,7 +976,7 @@ void mem_cgroup_end_migration(struct mem
ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;

/* unused page is not on radix-tree now. */
- if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
+ if (unused)
__mem_cgroup_uncharge_common(unused, ctype);

pc = lookup_page_cgroup(target);
Index: mmotm-2.6.28-rc2+/mm/swap_state.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swap_state.c
+++ mmotm-2.6.28-rc2+/mm/swap_state.c
@@ -119,6 +119,7 @@ void __delete_from_swap_cache(struct pag
total_swapcache_pages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(del_total);
+ mem_cgroup_uncharge_swapcache(page);
}

/**
Index: mmotm-2.6.28-rc2+/include/linux/swap.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/swap.h
+++ mmotm-2.6.28-rc2+/include/linux/swap.h
@@ -332,6 +332,22 @@ static inline void disable_swap_token(vo
put_swap_token(swap_token_mm);
}

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+extern int mem_cgroup_cache_charge_swapin(struct page *page,
+ struct mm_struct *mm, gfp_t mask, bool locked);
+extern void mem_cgroup_uncharge_swapcache(struct page *page);
+#else
+static inline
+int mem_cgroup_cache_charge_swapin(struct page *page,
+ struct mm_struct *mm, gfp_t mask, bool locked)
+{
+ return 0;
+}
+static inline void mem_cgroup_uncharge_swapcache(struct page *page)
+{
+}
+#endif
+
#else /* CONFIG_SWAP */

#define total_swap_pages 0
Index: mmotm-2.6.28-rc2+/mm/shmem.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/shmem.c
+++ mmotm-2.6.28-rc2+/mm/shmem.c
@@ -920,8 +920,12 @@ found:
error = 1;
if (!inode)
goto out;
- /* Charge page using GFP_HIGHUSER_MOVABLE while we can wait */
- error = mem_cgroup_cache_charge(page, current->mm, GFP_HIGHUSER_MOVABLE);
+ /*
+ * Charge page using GFP_HIGHUSER_MOVABLE while we can wait.
+ * charged back to the user(not to caller) when swap account is used.
+ */
+ error = mem_cgroup_cache_charge_swapin(page,
+ current->mm, GFP_HIGHUSER_MOVABLE, true);
if (error)
goto out;
error = radix_tree_preload(GFP_KERNEL);
@@ -1258,6 +1262,16 @@ repeat:
goto repeat;
}
wait_on_page_locked(swappage);
+ /*
+ * We want to avoid charge at add_to_page_cache().
+ * charge against this swap cache here.
+ */
+ if (mem_cgroup_cache_charge_swapin(swappage,
+ current->mm, gfp, false)) {
+ page_cache_release(swappage);
+ error = -ENOMEM;
+ goto failed;
+ }
page_cache_release(swappage);
goto repeat;
}
Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
@@ -137,6 +137,11 @@ behind this approach is that a cgroup th
page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure).

+Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+be backed into memory in force, charges for pages are accounted against the
+caller of swapoff rather than the users of shmem.
+
+
2.4 Reclaim

Each cgroup maintains a per cgroup LRU that consists of an active

2008-11-05 08:21:37

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 3/6] memcg : mem+swap controller kconfig

Experimental.

Config and control variable for mem+swap controller.

This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
(memory resource controller swap extension.)

For accounting swap, it's obvious that we have to use additional memory
to remember "who uses swap". This adds more overhead.
So, it's better to offer "choice" to users. This patch adds 2 choices.

This patch adds 2 parameters to enable swap extension or not.
- CONFIG
- boot option

Changelog: v1 -> v2
- fixed typo.
- make default value of "do_swap_account" to be 0 and turned on 1
later if configured.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>


Documentation/kernel-parameters.txt | 3 +++
include/linux/memcontrol.h | 3 +++
init/Kconfig | 17 +++++++++++++++++
mm/memcontrol.c | 32 ++++++++++++++++++++++++++++++++
4 files changed, 55 insertions(+)

Index: mmotm-2.6.28-rc2+/init/Kconfig
===================================================================
--- mmotm-2.6.28-rc2+.orig/init/Kconfig
+++ mmotm-2.6.28-rc2+/init/Kconfig
@@ -428,6 +428,23 @@ config CGROUP_MEM_RES_CTLR
config MM_OWNER
bool

+config CGROUP_MEM_RES_CTLR_SWAP
+ bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
+ depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
+ help
+ Add swap management feature to memory resource controller. When you
+ enable this, you can limit mem+swap usage per cgroup. In other words,
+ when you disable this, memory resource controller has no cares to
+ usage of swap...a process can exhaust all of the swap. This extension
+ is useful when you want to avoid exhaustion swap but this itself
+ adds more overheads and consumes memory for remembering information.
+ Especially if you use 32bit system or small memory system, please
+ be careful about enabling this. When memory resource controller
+ is disabled by boot option, this will be automatically disabled and
+ there will be no overhead from this. Even when you set this config=y,
+ if boot option "noswapaccount" is set, swap will not be accounted.
+
+
endmenu

config SYSFS_DEPRECATED
Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -41,6 +41,15 @@
struct cgroup_subsys mem_cgroup_subsys __read_mostly;
#define MEM_CGROUP_RECLAIM_RETRIES 5

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/* Turned on only when memory cgroup is enabled && really_do_swap_account = 0 */
+int do_swap_account __read_mostly;
+static int really_do_swap_account __initdata = 1; /* for remember boot option*/
+#else
+#define do_swap_account (0)
+#endif
+
+
/*
* Statistics for memory cgroup.
*/
@@ -1370,6 +1379,18 @@ static void mem_cgroup_free(struct mem_c
}


+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+static void __init enable_swap_cgroup(void)
+{
+ if (!mem_cgroup_subsys.disabled && really_do_swap_account)
+ do_swap_account = 1;
+}
+#else
+static void __init enable_swap_cgroup(void)
+{
+}
+#endif
+
static struct cgroup_subsys_state *
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
@@ -1378,6 +1399,7 @@ mem_cgroup_create(struct cgroup_subsys *

if (unlikely((cont->parent) == NULL)) {
mem = &init_mem_cgroup;
+ enable_swap_cgroup();
} else {
mem = mem_cgroup_alloc();
if (!mem)
@@ -1461,3 +1483,13 @@ struct cgroup_subsys mem_cgroup_subsys =
.attach = mem_cgroup_move_task,
.early_init = 0,
};
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+static int __init disable_swap_account(char *s)
+{
+ really_do_swap_account = 0;
+ return 1;
+}
+__setup("noswapaccount", disable_swap_account);
+#endif
Index: mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
===================================================================
--- mmotm-2.6.28-rc2+.orig/Documentation/kernel-parameters.txt
+++ mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
@@ -1543,6 +1543,9 @@ and is between 256 and 4096 characters.

nosoftlockup [KNL] Disable the soft-lockup detector.

+ noswapaccount [KNL] Disable accounting of swap in memory resource
+ controller. (See Documentation/controllers/memory.txt)
+
nosync [HW,M68K] Disables sync negotiation for all devices.

notsc [BUGS=X86-32] Disable Time Stamp Counter
Index: mmotm-2.6.28-rc2+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-rc2+/include/linux/memcontrol.h
@@ -77,6 +77,9 @@ extern void mem_cgroup_record_reclaim_pr
extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
int priority, enum lru_list lru);

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern int do_swap_account;
+#endif

#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

2008-11-05 08:22:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 4/6] memcg : swap cgroup

For accounting swap, we need a record per swap entry, at least.

This patch adds following function.
- swap_cgroup_swapon() .... called from swapon
- swap_cgroup_swapoff() ... called at the end of swapoff.

- swap_cgroup_record() .... record information of swap entry.
- swap_cgroup_lookup() .... lookup information of swap entry.

This patch just implements "how to record information". No actual
method for limit the usage of swap. These routine uses flat table
to record and lookup. "wise" lookup system like radix-tree requires
requires memory allocation at new records but swap-out is usually
called under memory shortage (or memcg hits limit.)
So, I used static allocation. (maybe dynamic allocation is not very
hard but it adds additional memory allocation in memory shortage path.)


Note1: In this, we use pointer to record information and this means
8bytes per swap entry. I think we can reduce this when we
create "id of cgroup" in the range of 0-65535 or 0-255.

Note2: array of swap_cgroup is allocated from HIGHMEM. maybe good for x86-32.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

include/linux/page_cgroup.h | 35 +++++++
mm/page_cgroup.c | 201 ++++++++++++++++++++++++++++++++++++++++++++
mm/swapfile.c | 8 +
3 files changed, 244 insertions(+)

Index: mmotm-2.6.28-rc2+/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-rc2+/mm/page_cgroup.c
@@ -8,6 +8,8 @@
#include <linux/memory.h>
#include <linux/vmalloc.h>
#include <linux/cgroup.h>
+#include <linux/swapops.h>
+#include <linux/highmem.h>

static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -254,3 +256,202 @@ void __init pgdat_page_cgroup_init(struc
}

#endif
+
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+DEFINE_MUTEX(swap_cgroup_mutex);
+struct swap_cgroup_ctrl {
+ spinlock_t lock;
+ struct page **map;
+ unsigned long length;
+};
+
+struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
+
+/*
+ * This 8bytes seems big..maybe we can reduce this when we can use "id" for
+ * cgroup rather than pointer.
+ */
+struct swap_cgroup {
+ struct mem_cgroup *val;
+};
+#define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup))
+#define SC_POS_MASK (SC_PER_PAGE - 1)
+
+/*
+ * allocate buffer for swap_cgroup.
+ */
+static int swap_cgroup_prepare(int type)
+{
+ struct page *page;
+ struct swap_cgroup_ctrl *ctrl;
+ unsigned long idx, max;
+
+ if (!do_swap_account)
+ return 0;
+ ctrl = &swap_cgroup_ctrl[type];
+
+ for (idx = 0; idx < ctrl->length; idx++) {
+ page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
+ if (!page)
+ goto not_enough_page;
+ ctrl->map[idx] = page;
+ }
+ return 0;
+not_enough_page:
+ max = idx;
+ for (idx = 0; idx < max; idx++)
+ __free_page(ctrl->map[idx]);
+
+ return -ENOMEM;
+}
+
+/**
+ * swap_cgroup_record - record mem_cgroup for this swp_entry.
+ * @ent: swap entry to be recorded into
+ * @mem: mem_cgroup to be recorded
+ *
+ * Returns old value at success, NULL at failure.
+ * (Of course, old value can be NULL.)
+ */
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+ unsigned long flags;
+ int type = swp_type(ent);
+ unsigned long offset = swp_offset(ent);
+ unsigned long idx = offset / SC_PER_PAGE;
+ unsigned long pos = offset & SC_POS_MASK;
+ struct swap_cgroup_ctrl *ctrl;
+ struct page *mappage;
+ struct swap_cgroup *sc;
+ struct mem_cgroup *old;
+
+ if (!do_swap_account)
+ return NULL;
+
+ ctrl = &swap_cgroup_ctrl[type];
+
+ mappage = ctrl->map[idx];
+ spin_lock_irqsave(&ctrl->lock, flags);
+ sc = kmap_atomic(mappage, KM_USER0);
+ sc += pos;
+ old = sc->val;
+ sc->val = mem;
+ kunmap_atomic(mappage, KM_USER0);
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+ return old;
+}
+
+/**
+ * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
+ * @ent: swap entry to be looked up.
+ *
+ * Returns pointer to mem_cgroup at success. NULL at failure.
+ */
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+ int type = swp_type(ent);
+ unsigned long flags;
+ unsigned long offset = swp_offset(ent);
+ unsigned long idx = offset / SC_PER_PAGE;
+ unsigned long pos = offset & SC_POS_MASK;
+ struct swap_cgroup_ctrl *ctrl;
+ struct page *mappage;
+ struct swap_cgroup *sc;
+ struct mem_cgroup *ret;
+
+ if (!do_swap_account)
+ return NULL;
+
+ ctrl = &swap_cgroup_ctrl[type];
+
+ mappage = ctrl->map[idx];
+
+ spin_lock_irqsave(&ctrl->lock, flags);
+ sc = kmap_atomic(mappage, KM_USER0);
+ sc += pos;
+ ret = sc->val;
+ kunmap_atomic(mapppage, KM_USER0);
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+ return ret;
+}
+
+int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+ void *array;
+ unsigned long array_size;
+ unsigned long length;
+ struct swap_cgroup_ctrl *ctrl;
+
+ if (!do_swap_account)
+ return 0;
+
+ length = ((max_pages/SC_PER_PAGE) + 1);
+ array_size = length * sizeof(void *);
+
+ array = vmalloc(array_size);
+ if (!array)
+ goto nomem;
+
+ memset(array, 0, array_size);
+ ctrl = &swap_cgroup_ctrl[type];
+ mutex_lock(&swap_cgroup_mutex);
+ ctrl->length = length;
+ ctrl->map = array;
+ if (swap_cgroup_prepare(type)) {
+ /* memory shortage */
+ ctrl->map = NULL;
+ ctrl->length = 0;
+ vfree(array);
+ mutex_unlock(&swap_cgroup_mutex);
+ goto nomem;
+ }
+ mutex_unlock(&swap_cgroup_mutex);
+
+ printk(KERN_INFO
+ "swap_cgroup: uses %ld bytes vmalloc and %ld bytes buffres\n",
+ array_size, length * PAGE_SIZE);
+ printk(KERN_INFO
+ "swap_cgroup can be disabled by noswapaccount boot option.\n");
+
+ return 0;
+nomem:
+ printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
+ printk(KERN_INFO
+ "swap_cgroup can be disabled by noswapaccount boot option\n");
+ return -ENOMEM;
+}
+
+void swap_cgroup_swapoff(int type)
+{
+ int i;
+ struct swap_cgroup_ctrl *ctrl;
+
+ if (!do_swap_account)
+ return;
+
+ mutex_lock(&swap_cgroup_mutex);
+ if (ctrl->map) {
+ ctrl = &swap_cgroup_ctrl[type];
+ for (i = 0; i < ctrl->length; i++) {
+ struct page *page = ctrl->map[i];
+ if (page)
+ __free_page(page);
+ }
+ vfree(ctrl->map);
+ ctrl->map = NULL;
+ ctrl->length = 0;
+ }
+ mutex_unlock(&swap_cgroup_mutex);
+}
+
+static int __init swap_cgroup_init(void)
+{
+ int i;
+ for (i = 0; i < MAX_SWAPFILES; i++)
+ spin_lock_init(&swap_cgroup_ctrl[i].lock);
+ return 0;
+}
+late_initcall(swap_cgroup_init);
+#endif
Index: mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
@@ -105,4 +105,39 @@ static inline void page_cgroup_init(void
}

#endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#include <linux/swap.h>
+extern struct mem_cgroup *
+swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem);
+extern struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent);
+extern int swap_cgroup_swapon(int type, unsigned long max_pages);
+extern void swap_cgroup_swapoff(int type);
+#else
+#include <linux/swap.h>
+
+static inline
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+ return NULL;
+}
+
+static inline
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+ return NULL;
+}
+
+static inline int
+swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+ return 0;
+}
+
+static inline void swap_cgroup_swapoff(int type)
+{
+ return;
+}
+
+#endif
#endif
Index: mmotm-2.6.28-rc2+/mm/swapfile.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swapfile.c
+++ mmotm-2.6.28-rc2+/mm/swapfile.c
@@ -32,6 +32,7 @@
#include <asm/pgtable.h>
#include <asm/tlbflush.h>
#include <linux/swapops.h>
+#include <linux/page_cgroup.h>

static DEFINE_SPINLOCK(swap_lock);
static unsigned int nr_swapfiles;
@@ -1345,6 +1346,9 @@ asmlinkage long sys_swapoff(const char _
spin_unlock(&swap_lock);
mutex_unlock(&swapon_mutex);
vfree(swap_map);
+ /* Destroy swap account informatin */
+ swap_cgroup_swapoff(type);
+
inode = mapping->host;
if (S_ISBLK(inode->i_mode)) {
struct block_device *bdev = I_BDEV(inode);
@@ -1669,6 +1673,10 @@ asmlinkage long sys_swapon(const char __
nr_good_pages = swap_header->info.last_page -
swap_header->info.nr_badpages -
1 /* header page */;
+
+ if (!error)
+ error = swap_cgroup_swapon(type, maxpages);
+
if (error)
goto bad_swap;
}

2008-11-05 08:24:10

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 5/6] memcg: mem+swap controller

Mem+Swap controller core.

This patch implements per cgroup limit for usage of memory+swap.
However there are SwapCache, double counting of swap-cache and
swap-entry is avoided.

Mem+Swap controller works as following.
- memory usage is limited by memory.limit_in_bytes.
- memory + swap usage is limited by memory.memsw_limit_in_bytes.


This has following benefits.
- A user can limit total resource usage of mem+swap.

Without this, because memory resource controller doesn't take care of
usage of swap, a process can exhaust all the swap (by memory leak.)
We can avoid this case.

And Swap is shared resource but it cannot be reclaimed (goes back to memory)
until it's used. This characteristic can be trouble when the memory
is divided into some parts by cpuset or memcg.
Assume group A and group B.
After some application executes, the system can be..

Group A -- very large free memory space but occupy 99% of swap.
Group B -- under memory shortage but cannot use swap...it's nearly full.

Ability to set appropriate swap limit for each group is required.

Maybe someone wonder "why not swap but mem+swap ?"

- The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of
mem+swap.

In other words, when we want to limit the usage of swap without affecting
global LRU, mem+swap limit is better than just limiting swap.


Accounting target information is stored in swap_cgroup which is
per swap entry record.

Charge is done as following.
map
- charge page and memsw.

unmap
- uncharge page/memsw if not SwapCache.

swap-out (__delete_from_swap_cache)
- uncharge page
- record mem_cgroup information to swap_cgroup.

swap-in (do_swap_page)
- charged as page and memsw.
record in swap_cgroup is cleared.
memsw accounting is decremented.

swap-free (swap_free())
- if swap entry is freed, memsw is uncharged by PAGE_SIZE.


After this, usual memory resource controller handles SwapCache.
(It was lacked(ignored) feature in current memcg but must be handled.)

There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead. This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)

Changelog: v1 -> v2
- fixed typos
- fixed migration of anon pages.
- fixed uncharge to check USED bit always.
- code for swapcache is moved to another patch.
- added "noswap" argument to try_to_free_mem_cgroup_pages
- fixed lock_page around mem_cgroup_charge_cache_swap()
- fixed failcnt file.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>


Documentation/controllers/memory.txt | 29 ++
include/linux/memcontrol.h | 11 -
include/linux/swap.h | 14 +
mm/memcontrol.c | 340 +++++++++++++++++++++++++++++++----
mm/memory.c | 3
mm/swap_state.c | 5
mm/swapfile.c | 11 -
mm/vmscan.c | 6
8 files changed, 375 insertions(+), 44 deletions(-)

Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -132,6 +132,10 @@ struct mem_cgroup {
*/
struct res_counter res;
/*
+ * the counter to account for mem+swap usage.
+ */
+ struct res_counter memsw;
+ /*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
@@ -142,6 +146,12 @@ struct mem_cgroup {
* statistics.
*/
struct mem_cgroup_stat stat;
+
+ /*
+ * used for counting reference from swap_cgroup.
+ */
+ int obsolete;
+ atomic_t swapref;
};
static struct mem_cgroup init_mem_cgroup;

@@ -168,6 +178,16 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
0, /* FORCE */
};

+
+/* for encoding cft->private value on file */
+#define _MEM (0)
+#define _MEMSWAP (1)
+#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
+#define MEMFILE_TYPE(val) (((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val) ((val) & 0xffff)
+
+static void mem_cgroup_forget_swapref(struct mem_cgroup *mem);
+
/*
* Always modified under lru lock. Then, not necessary to preempt_disable()
*/
@@ -514,12 +534,25 @@ static int __mem_cgroup_try_charge(struc
css_get(&mem->css);
}

+ while (1) {
+ int ret;
+ bool noswap = false;

- while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
+ ret = res_counter_charge(&mem->res, PAGE_SIZE);
+ if (likely(!ret)) {
+ if (!do_swap_account)
+ break;
+ ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
+ if (likely(!ret))
+ break;
+ /* mem+swap counter fails */
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ noswap = true;
+ }
if (!(gfp_mask & __GFP_WAIT))
goto nomem;

- if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+ if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
continue;

/*
@@ -528,8 +561,13 @@ static int __mem_cgroup_try_charge(struc
* moved to swap cache or just unmapped from the cgroup.
* Check the limit again to see if the reclaim reduced the
* current usage of the cgroup before giving up
+ *
*/
- if (res_counter_check_under_limit(&mem->res))
+ if (!do_swap_account &&
+ res_counter_check_under_limit(&mem->res))
+ continue;
+ if (do_swap_account &&
+ res_counter_check_under_limit(&mem->memsw))
continue;

if (!nr_retries--) {
@@ -583,6 +621,8 @@ static void __mem_cgroup_commit_charge(s
if (unlikely(PageCgroupUsed(pc))) {
unlock_page_cgroup(pc);
res_counter_uncharge(&mem->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
css_put(&mem->css);
return;
}
@@ -647,6 +687,8 @@ static int mem_cgroup_move_account(struc
__mem_cgroup_remove_list(from_mz, pc);
css_put(&from->css);
res_counter_uncharge(&from->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&from->memsw, PAGE_SIZE);
pc->mem_cgroup = to;
css_get(&to->css);
__mem_cgroup_add_list(to_mz, pc, false);
@@ -693,8 +735,11 @@ static int mem_cgroup_move_parent(struct
/* drop extra refcnt */
css_put(&parent->css);
/* uncharge if move fails */
- if (ret)
+ if (ret) {
res_counter_uncharge(&parent->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+ }

return ret;
}
@@ -792,7 +837,34 @@ int mem_cgroup_cache_charge(struct page
MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
}

+int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+ struct page *page,
+ gfp_t mask, struct mem_cgroup **ptr)
+{
+ struct mem_cgroup *mem;
+ swp_entry_t ent;
+
+ if (mem_cgroup_subsys.disabled)
+ return 0;
+
+ if (!do_swap_account)
+ goto charge_cur_mm;
+
+ ent.val = page_private(page);
+
+ mem = lookup_swap_cgroup(ent);
+ if (!mem || mem->obsolete)
+ goto charge_cur_mm;
+ *ptr = mem;
+ return __mem_cgroup_try_charge(NULL, mask, ptr, true);
+charge_cur_mm:
+ if (unlikely(!mm))
+ mm = &init_mm;
+ return __mem_cgroup_try_charge(mm, mask, ptr, true);
+}
+
#ifdef CONFIG_SWAP
+
int mem_cgroup_cache_charge_swapin(struct page *page,
struct mm_struct *mm, gfp_t mask, bool locked)
{
@@ -809,8 +881,28 @@ int mem_cgroup_cache_charge_swapin(struc
* we reach here.
*/
if (PageSwapCache(page)) {
+ struct mem_cgroup *mem = NULL;
+ swp_entry_t ent;
+
+ ent.val = page_private(page);
+ if (do_swap_account) {
+ mem = lookup_swap_cgroup(ent);
+ if (mem && mem->obsolete)
+ mem = NULL;
+ if (mem)
+ mm = NULL;
+ }
ret = mem_cgroup_charge_common(page, mm, mask,
- MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
+ MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
+
+ if (!ret && do_swap_account) {
+ /* avoid double counting */
+ mem = swap_cgroup_record(ent, NULL);
+ if (mem) {
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+ mem_cgroup_forget_swapref(mem);
+ }
+ }
}
if (!locked)
unlock_page(page);
@@ -829,6 +921,23 @@ void mem_cgroup_commit_charge_swapin(str
return;
pc = lookup_page_cgroup(page);
__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+ /*
+ * Now swap is on-memory. This means this page may be
+ * counted both as mem and swap....double count.
+ * Fix it by uncharging from memsw. This SwapCache is stable
+ * because we're still under lock_page().
+ */
+ if (do_swap_account) {
+ swp_entry_t ent = {.val = page_private(page)};
+ struct mem_cgroup *memcg;
+ memcg = swap_cgroup_record(ent, NULL);
+ if (memcg) {
+ /* If memcg is obsolete, memcg can be != ptr */
+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+ mem_cgroup_forget_swapref(memcg);
+ }
+
+ }
}

void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -838,6 +947,7 @@ void mem_cgroup_cancel_charge_swapin(str
if (!mem)
return;
res_counter_uncharge(&mem->res, PAGE_SIZE);
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
css_put(&mem->css);
}

@@ -845,29 +955,31 @@ void mem_cgroup_cancel_charge_swapin(str
/*
* uncharge if !page_mapped(page)
*/
-static void
+static struct mem_cgroup *
__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
{
struct page_cgroup *pc;
- struct mem_cgroup *mem;
+ struct mem_cgroup *mem = NULL;
struct mem_cgroup_per_zone *mz;
unsigned long flags;

if (mem_cgroup_subsys.disabled)
- return;
+ return NULL;

if (PageSwapCache(page))
- return;
+ return NULL;

/*
* Check if our page_cgroup is valid
*/
pc = lookup_page_cgroup(page);
if (unlikely(!pc || !PageCgroupUsed(pc)))
- return;
+ return NULL;

lock_page_cgroup(pc);

+ mem = pc->mem_cgroup;
+
if (!PageCgroupUsed(pc))
goto unlock_out;

@@ -887,8 +999,11 @@ __mem_cgroup_uncharge_common(struct page
break;
}

+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+
ClearPageCgroupUsed(pc);
- mem = pc->mem_cgroup;

mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
@@ -896,14 +1011,13 @@ __mem_cgroup_uncharge_common(struct page
spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);

- res_counter_uncharge(&mem->res, PAGE_SIZE);
css_put(&mem->css);

- return;
+ return mem;

unlock_out:
unlock_page_cgroup(pc);
- return;
+ return NULL;
}

void mem_cgroup_uncharge_page(struct page *page)
@@ -923,10 +1037,42 @@ void mem_cgroup_uncharge_cache_page(stru
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}

-void mem_cgroup_uncharge_swapcache(struct page *page)
+/*
+ * called from __delete_from_swap_cache() and drop "page" account.
+ * memcg information is recorded to swap_cgroup of "ent"
+ */
+void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
{
- __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+ struct mem_cgroup *memcg;
+
+ memcg = __mem_cgroup_uncharge_common(page,
+ MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+ /* record memcg information */
+ if (do_swap_account && memcg) {
+ swap_cgroup_record(ent, memcg);
+ atomic_inc(&memcg->swapref);
+ }
+}
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/*
+ * called from swap_entry_free(). remove record in swap_cgroup and
+ * uncharge "memsw" account.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t ent)
+{
+ struct mem_cgroup *memcg;
+
+ if (!do_swap_account)
+ return;
+
+ memcg = swap_cgroup_record(ent, NULL);
+ if (memcg) {
+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+ mem_cgroup_forget_swapref(memcg);
+ }
}
+#endif

/*
* Before starting migration, account PAGE_SIZE to mem_cgroup that the old
@@ -1035,7 +1181,7 @@ int mem_cgroup_shrink_usage(struct mm_st
rcu_read_unlock();

do {
- progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
+ progress = try_to_free_mem_cgroup_pages(mem, gfp_mask, true);
progress += res_counter_check_under_limit(&mem->res);
} while (!progress && --retry);

@@ -1062,13 +1208,55 @@ int mem_cgroup_resize_limit(struct mem_c
break;
}
progress = try_to_free_mem_cgroup_pages(memcg,
- GFP_HIGHUSER_MOVABLE);
+ GFP_HIGHUSER_MOVABLE, false);
if (!progress)
retry_count--;
}
return ret;
}

+int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
+ unsigned long long val)
+{
+ int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
+ unsigned long flags;
+ u64 memlimit, oldusage, curusage;
+ int ret;
+
+ if (!do_swap_account)
+ return -EINVAL;
+
+ while (retry_count) {
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ break;
+ }
+ /*
+ * Rather than hide all in some function, I do this in
+ * open coded manner. You see what this really does.
+ * We have to guarantee mem->res.limit < mem->memsw.limit.
+ */
+ spin_lock_irqsave(&memcg->res.lock, flags);
+ memlimit = memcg->res.limit;
+ if (memlimit > val) {
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+ ret = -EINVAL;
+ break;
+ }
+ ret = res_counter_set_limit(&memcg->memsw, val);
+ oldusage = memcg->memsw.usage;
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+
+ if (!ret)
+ break;
+ try_to_free_mem_cgroup_pages(memcg, GFP_HIGHUSER_MOVABLE, true);
+ curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+ if (curusage >= oldusage)
+ retry_count--;
+ }
+ return ret;
+}
+

/*
* This routine traverse page_cgroup in given list and drop them all.
@@ -1180,7 +1368,7 @@ try_to_free:
while (nr_retries && mem->res.usage > 0) {
int progress;
progress = try_to_free_mem_cgroup_pages(mem,
- GFP_HIGHUSER_MOVABLE);
+ GFP_HIGHUSER_MOVABLE, false);
if (!progress)
nr_retries--;

@@ -1194,8 +1382,25 @@ try_to_free:

static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
{
- return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
- cft->private);
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ u64 val = 0;
+ int type, name;
+
+ type = MEMFILE_TYPE(cft->private);
+ name = MEMFILE_ATTR(cft->private);
+ switch (type) {
+ case _MEM:
+ val = res_counter_read_u64(&mem->res, name);
+ break;
+ case _MEMSWAP:
+ if (do_swap_account)
+ val = res_counter_read_u64(&mem->memsw, name);
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return val;
}
/*
* The user of this function is...
@@ -1205,15 +1410,22 @@ static int mem_cgroup_write(struct cgrou
const char *buffer)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ int type, name;
unsigned long long val;
int ret;

- switch (cft->private) {
+ type = MEMFILE_TYPE(cft->private);
+ name = MEMFILE_ATTR(cft->private);
+ switch (name) {
case RES_LIMIT:
/* This function does all necessary parse...reuse it */
ret = res_counter_memparse_write_strategy(buffer, &val);
- if (!ret)
+ if (ret)
+ break;
+ if (type == _MEM)
ret = mem_cgroup_resize_limit(memcg, val);
+ else
+ ret = mem_cgroup_resize_memsw_limit(memcg, val);
break;
default:
ret = -EINVAL; /* should be BUG() ? */
@@ -1225,14 +1437,23 @@ static int mem_cgroup_write(struct cgrou
static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
{
struct mem_cgroup *mem;
+ int type, name;

mem = mem_cgroup_from_cont(cont);
- switch (event) {
+ type = MEMFILE_TYPE(event);
+ name = MEMFILE_ATTR(event);
+ switch (name) {
case RES_MAX_USAGE:
- res_counter_reset_max(&mem->res);
+ if (type == _MEM)
+ res_counter_reset_max(&mem->res);
+ else
+ res_counter_reset_max(&mem->memsw);
break;
case RES_FAILCNT:
- res_counter_reset_failcnt(&mem->res);
+ if (type == _MEM)
+ res_counter_reset_failcnt(&mem->res);
+ else
+ res_counter_reset_failcnt(&mem->memsw);
break;
}
return 0;
@@ -1286,30 +1507,33 @@ static int mem_control_stat_show(struct
cb->fill(cb, "unevictable", unevictable * PAGE_SIZE);

}
+ /* showing refs from disk-swap */
+ cb->fill(cb, "swap_on_disk", atomic_read(&mem_cont->swapref)
+ * PAGE_SIZE);
return 0;
}

static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
- .private = RES_USAGE,
+ .private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
.read_u64 = mem_cgroup_read,
},
{
.name = "max_usage_in_bytes",
- .private = RES_MAX_USAGE,
+ .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
.trigger = mem_cgroup_reset,
.read_u64 = mem_cgroup_read,
},
{
.name = "limit_in_bytes",
- .private = RES_LIMIT,
+ .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
.write_string = mem_cgroup_write,
.read_u64 = mem_cgroup_read,
},
{
.name = "failcnt",
- .private = RES_FAILCNT,
+ .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
.trigger = mem_cgroup_reset,
.read_u64 = mem_cgroup_read,
},
@@ -1317,6 +1541,31 @@ static struct cftype mem_cgroup_files[]
.name = "stat",
.read_map = mem_control_stat_show,
},
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+ {
+ .name = "memsw.usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
+ .read_u64 = mem_cgroup_read,
+ },
+ {
+ .name = "memsw.max_usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
+ .trigger = mem_cgroup_reset,
+ .read_u64 = mem_cgroup_read,
+ },
+ {
+ .name = "memsw.limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
+ .write_string = mem_cgroup_write,
+ .read_u64 = mem_cgroup_read,
+ },
+ {
+ .name = "memsw.failcnt",
+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
+ .trigger = mem_cgroup_reset,
+ .read_u64 = mem_cgroup_read,
+ },
+#endif
};

static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1370,14 +1619,43 @@ static struct mem_cgroup *mem_cgroup_all
return mem;
}

+/*
+ * At destroying mem_cgroup, references from swap_cgroup can remain.
+ * (scanning all at force_empty is too costly...)
+ *
+ * Instead of clearing all references at force_empty, we remember
+ * the number of reference from swap_cgroup and free mem_cgroup when
+ * it goes down to 0.
+ *
+ * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
+ * entry which points to this memcg will be ignore at swapin.
+ *
+ * Removal of cgroup itself succeeds regardless of refs from swap.
+ */
+
static void mem_cgroup_free(struct mem_cgroup *mem)
{
+ if (do_swap_account) {
+ if (atomic_read(&mem->swapref) > 0)
+ return;
+ }
if (sizeof(*mem) < PAGE_SIZE)
kfree(mem);
else
vfree(mem);
}

+static void mem_cgroup_forget_swapref(struct mem_cgroup *mem)
+{
+ if (!do_swap_account)
+ return;
+ if (atomic_dec_and_test(&mem->swapref)) {
+ if (!mem->obsolete)
+ return;
+ mem_cgroup_free(mem);
+ }
+}
+

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
static void __init enable_swap_cgroup(void)
@@ -1407,6 +1685,7 @@ mem_cgroup_create(struct cgroup_subsys *
}

res_counter_init(&mem->res);
+ res_counter_init(&mem->memsw);

for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -1425,6 +1704,7 @@ static void mem_cgroup_pre_destroy(struc
struct cgroup *cont)
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ mem->obsolete = 1;
mem_cgroup_force_empty(mem);
}

Index: mmotm-2.6.28-rc2+/mm/swapfile.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swapfile.c
+++ mmotm-2.6.28-rc2+/mm/swapfile.c
@@ -271,8 +271,9 @@ out:
return NULL;
}

-static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
+static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
{
+ unsigned long offset = swp_offset(ent);
int count = p->swap_map[offset];

if (count < SWAP_MAP_MAX) {
@@ -287,6 +288,7 @@ static int swap_entry_free(struct swap_i
swap_list.next = p - swap_info;
nr_swap_pages++;
p->inuse_pages--;
+ mem_cgroup_uncharge_swap(ent);
}
}
return count;
@@ -302,7 +304,7 @@ void swap_free(swp_entry_t entry)

p = swap_info_get(entry);
if (p) {
- swap_entry_free(p, swp_offset(entry));
+ swap_entry_free(p, entry);
spin_unlock(&swap_lock);
}
}
@@ -421,7 +423,7 @@ void free_swap_and_cache(swp_entry_t ent

p = swap_info_get(entry);
if (p) {
- if (swap_entry_free(p, swp_offset(entry)) == 1) {
+ if (swap_entry_free(p, entry) == 1) {
page = find_get_page(&swapper_space, entry.val);
if (page && !trylock_page(page)) {
page_cache_release(page);
@@ -536,7 +538,8 @@ static int unuse_pte(struct vm_area_stru
pte_t *pte;
int ret = 1;

- if (mem_cgroup_try_charge(vma->vm_mm, GFP_HIGHUSER_MOVABLE, &ptr))
+ if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
+ GFP_HIGHUSER_MOVABLE, &ptr))
ret = -ENOMEM;

pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
Index: mmotm-2.6.28-rc2+/mm/swap_state.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swap_state.c
+++ mmotm-2.6.28-rc2+/mm/swap_state.c
@@ -17,6 +17,7 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/migrate.h>
+#include <linux/page_cgroup.h>

#include <asm/pgtable.h>

@@ -108,6 +109,8 @@ int add_to_swap_cache(struct page *page,
*/
void __delete_from_swap_cache(struct page *page)
{
+ swp_entry_t ent = {.val = page_private(page)};
+
BUG_ON(!PageLocked(page));
BUG_ON(!PageSwapCache(page));
BUG_ON(PageWriteback(page));
@@ -119,7 +122,7 @@ void __delete_from_swap_cache(struct pag
total_swapcache_pages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(del_total);
- mem_cgroup_uncharge_swapcache(page);
+ mem_cgroup_uncharge_swapcache(page, ent);
}

/**
Index: mmotm-2.6.28-rc2+/include/linux/swap.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/swap.h
+++ mmotm-2.6.28-rc2+/include/linux/swap.h
@@ -213,7 +213,7 @@ static inline void lru_cache_add_active_
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask);
+ gfp_t gfp_mask, bool noswap);
extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
@@ -335,7 +335,7 @@ static inline void disable_swap_token(vo
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
extern int mem_cgroup_cache_charge_swapin(struct page *page,
struct mm_struct *mm, gfp_t mask, bool locked);
-extern void mem_cgroup_uncharge_swapcache(struct page *page);
+extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
#else
static inline
int mem_cgroup_cache_charge_swapin(struct page *page,
@@ -343,7 +343,15 @@ int mem_cgroup_cache_charge_swapin(struc
{
return 0;
}
-static inline void mem_cgroup_uncharge_swapcache(struct page *page)
+static inline void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+{
+}
+#endif
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+#else
+static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
{
}
#endif
Index: mmotm-2.6.28-rc2+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-rc2+/include/linux/memcontrol.h
@@ -32,6 +32,8 @@ extern int mem_cgroup_newpage_charge(str
/* for swap handling */
extern int mem_cgroup_try_charge(struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+ struct page *page, gfp_t mask, struct mem_cgroup **ptr);
extern void mem_cgroup_commit_charge_swapin(struct page *page,
struct mem_cgroup *ptr);
extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
@@ -80,7 +82,6 @@ extern long mem_cgroup_calc_reclaim(stru
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
-
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

@@ -97,7 +98,13 @@ static inline int mem_cgroup_cache_charg
}

static inline int mem_cgroup_try_charge(struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **ptr)
+ gfp_t gfp_mask, struct mem_cgroup **ptr)
+{
+ return 0;
+}
+
+static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+ struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
{
return 0;
}
Index: mmotm-2.6.28-rc2+/mm/memory.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memory.c
+++ mmotm-2.6.28-rc2+/mm/memory.c
@@ -2324,7 +2324,8 @@ static int do_swap_page(struct mm_struct
lock_page(page);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

- if (mem_cgroup_try_charge(mm, GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
+ if (mem_cgroup_try_charge_swapin(mm, page,
+ GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
ret = VM_FAULT_OOM;
unlock_page(page);
goto out;
Index: mmotm-2.6.28-rc2+/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/vmscan.c
+++ mmotm-2.6.28-rc2+/mm/vmscan.c
@@ -1669,7 +1669,8 @@ unsigned long try_to_free_pages(struct z
#ifdef CONFIG_CGROUP_MEM_RES_CTLR

unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
- gfp_t gfp_mask)
+ gfp_t gfp_mask,
+ bool noswap)
{
struct scan_control sc = {
.may_writepage = !laptop_mode,
@@ -1682,6 +1683,9 @@ unsigned long try_to_free_mem_cgroup_pag
};
struct zonelist *zonelist;

+ if (noswap)
+ sc.may_swap = 0;
+
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
zonelist = NODE_DATA(numa_node_id())->node_zonelists;
Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
@@ -137,12 +137,32 @@ behind this approach is that a cgroup th
page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure).

-Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
+When you do swapoff and make swapped-out pages of shmem(tmpfs) to
be backed into memory in force, charges for pages are accounted against the
caller of swapoff rather than the users of shmem.


-2.4 Reclaim
+2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
+Swap Extension allows you to record charge for swap. A swapped-in page is
+charged back to original page allocator if possible.
+
+When swap is accounted, following files are added.
+ - memory.memsw.usage_in_bytes.
+ - memory.memsw.limit_in_bytes.
+
+usage of mem+swap is limited by memsw.limit_in_bytes.
+
+Note: why 'mem+swap' rather than swap.
+The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
+to move account from memory to swap...there is no change in usage of
+mem+swap.
+
+In other words, when we want to limit the usage of swap without affecting
+global LRU, mem+swap limit is better than just limiting swap from OS point
+of view.
+
+2.5 Reclaim

Each cgroup maintains a per cgroup LRU that consists of an active
and inactive list. When a cgroup goes over its limit, we first try
@@ -246,6 +266,11 @@ Such charges are moved to its parent as
is full. Both of RSS and CACHES are moved to parent.
If both of them are busy, rmdir() returns -EBUSY.

+Charges recorded in swap information is not updated at removal of cgroup.
+Recorded information is effectively discarded and a cgroup which uses swap
+(swapcache) will be charged as a new owner of it.
+
+
5. TODO

1. Add support for accounting huge pages (as a separate controller)

2008-11-05 08:24:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 6/6] memcg: synchronized LRU

A big patch for chaning memcg's LRU semantics.

Now,
- page_cgroup is linked to mem_cgroup's its own LRU (per zone).

- LRU of page_cgroup is not synchronous with global LRU.

- page and page_cgroup is one-to-one and statically allocated.

- To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
- lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

- SwapCache is handled.

And, when we handle LRU list of page_cgroup, we do following.

pc = lookup_page_cgroup(page);
lock_page_cgroup(pc); .....................(1)
mz = page_cgroup_zoneinfo(pc);
spin_lock(&mz->lru_lock);
.....add to LRU
spin_unlock(&mz->lru_lock);
unlock_page_cgroup(pc);

But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as

spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
mem_cgroup_add/remove/etc_lru() {
pc = lookup_page_cgroup(page);
mz = page_cgroup_zoneinfo(pc);
if (PageCgroupUsed(pc)) {
....add to LRU
}
spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
1. When pc->mem_cgroup can be modified.
- at charge.
- at account_move().
2. at charge
the PCG_USED bit is not set before pc->mem_cgroup is fixed.
3. at account_move()
the page is isolated and not on LRU.

Pros.
- easy for maintainance.
- memcg can make use of laziness of pagevec.
- we don't have to duplicated LRU/Acitve/Unevictable bit in page_cgroup.
- LRU status of memcg will be synchronized with global LRU's one.
- # of locks are reduced.
- account_move() is simplified very much.
Cons.
- may increase cost of LRU rotation.
(no impact if memcg is not configured.)

Changelog v0 -> v1
- fixed statistics.

Signed-off-by: KAMEZAWA Hiruyoki <[email protected]>

fs/splice.c | 1
include/linux/memcontrol.h | 29 +++
include/linux/mm_inline.h | 3
include/linux/page_cgroup.h | 17 --
mm/memcontrol.c | 337 +++++++++++++++++++-------------------------
mm/page_cgroup.c | 1
mm/swap.c | 1
mm/vmscan.c | 9 -
8 files changed, 186 insertions(+), 212 deletions(-)

Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -35,6 +35,7 @@
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
+#include "internal.h"

#include <asm/uaccess.h>

@@ -99,7 +100,6 @@ struct mem_cgroup_per_zone {
/*
* spin_lock to protect the per cgroup LRU
*/
- spinlock_t lru_lock;
struct list_head lists[NR_LRU_LISTS];
unsigned long count[NR_LRU_LISTS];
};
@@ -136,17 +136,19 @@ struct mem_cgroup {
*/
struct res_counter memsw;
/*
+ * statistics.
+ */
+ struct mem_cgroup_stat stat;
+ /*
* Per cgroup active and inactive list, similar to the
- * per zone LRU lists.
+ * per zone LRU lists. This area is read_mostly.
*/
struct mem_cgroup_lru_info info;
-
- int prev_priority; /* for recording reclaim priority */
/*
- * statistics.
+ * Per cgroup active and inactive list, similar to the
+ * per zone LRU lists.
*/
- struct mem_cgroup_stat stat;
-
+ int prev_priority; /* for recording reclaim priority */
/*
* used for counting reference from swap_cgroup.
*/
@@ -167,14 +169,12 @@ enum charge_type {
/* only for here (for easy reading.) */
#define PCGF_CACHE (1UL << PCG_CACHE)
#define PCGF_USED (1UL << PCG_USED)
-#define PCGF_ACTIVE (1UL << PCG_ACTIVE)
#define PCGF_LOCK (1UL << PCG_LOCK)
-#define PCGF_FILE (1UL << PCG_FILE)
static const unsigned long
pcg_default_flags[NR_CHARGE_TYPE] = {
- PCGF_CACHE | PCGF_FILE | PCGF_USED | PCGF_LOCK, /* File Cache */
- PCGF_ACTIVE | PCGF_USED | PCGF_LOCK, /* Anon */
- PCGF_ACTIVE | PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* Shmem */
+ PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* File Cache */
+ PCGF_USED | PCGF_LOCK, /* Anon */
+ PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* Shmem */
0, /* FORCE */
};

@@ -188,9 +188,6 @@ pcg_default_flags[NR_CHARGE_TYPE] = {

static void mem_cgroup_forget_swapref(struct mem_cgroup *mem);

-/*
- * Always modified under lru lock. Then, not necessary to preempt_disable()
- */
static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
struct page_cgroup *pc,
bool charge)
@@ -198,10 +195,9 @@ static void mem_cgroup_charge_statistics
int val = (charge)? 1 : -1;
struct mem_cgroup_stat *stat = &mem->stat;
struct mem_cgroup_stat_cpu *cpustat;
+ int cpu = get_cpu();

- VM_BUG_ON(!irqs_disabled());
-
- cpustat = &stat->cpustat[smp_processor_id()];
+ cpustat = &stat->cpustat[cpu];
if (PageCgroupCache(pc))
__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
else
@@ -213,6 +209,7 @@ static void mem_cgroup_charge_statistics
else
__mem_cgroup_stat_add_safe(cpustat,
MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
+ put_cpu();
}

static struct mem_cgroup_per_zone *
@@ -267,80 +264,95 @@ struct mem_cgroup *mem_cgroup_from_task(
struct mem_cgroup, css);
}

-static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
- struct page_cgroup *pc)
-{
- int lru = LRU_BASE;
+/*
+ * Following LRU functions are allowed to be used without PCG_LOCK.
+ * Operations are called by routine of global LRU independently from memcg.
+ * What we have to take care of here is validness of pc->mem_cgroup.
+ *
+ * Changes to pc->mem_cgroup happens when
+ * 1. charge
+ * 2. moving account
+ * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
+ * It is added to LRU before charge.
+ * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
+ * When moving account, the page is not on LRU. It's isolated.
+ */

- if (PageCgroupUnevictable(pc))
- lru = LRU_UNEVICTABLE;
- else {
- if (PageCgroupActive(pc))
- lru += LRU_ACTIVE;
- if (PageCgroupFile(pc))
- lru += LRU_FILE;
- }
+void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
+{
+ struct page_cgroup *pc;
+ struct mem_cgroup *mem;
+ struct mem_cgroup_per_zone *mz;

+ if (mem_cgroup_subsys.disabled)
+ return;
+ pc = lookup_page_cgroup(page);
+ /* can happen while we handle swapcache. */
+ if (list_empty(&pc->lru))
+ return;
+ mz = page_cgroup_zoneinfo(pc);
+ mem = pc->mem_cgroup;
MEM_CGROUP_ZSTAT(mz, lru) -= 1;
-
- mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
- list_del(&pc->lru);
+ list_del_init(&pc->lru);
+ return;
}

-static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
- struct page_cgroup *pc, bool hot)
+void mem_cgroup_del_lru(struct page *page)
{
- int lru = LRU_BASE;
+ mem_cgroup_del_lru_list(page, page_lru(page));
+}

- if (PageCgroupUnevictable(pc))
- lru = LRU_UNEVICTABLE;
- else {
- if (PageCgroupActive(pc))
- lru += LRU_ACTIVE;
- if (PageCgroupFile(pc))
- lru += LRU_FILE;
- }
+void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
+{
+ struct mem_cgroup_per_zone *mz;
+ struct page_cgroup *pc;

- MEM_CGROUP_ZSTAT(mz, lru) += 1;
- if (hot)
- list_add(&pc->lru, &mz->lists[lru]);
- else
- list_add_tail(&pc->lru, &mz->lists[lru]);
+ if (mem_cgroup_subsys.disabled)
+ return;

- mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
+ pc = lookup_page_cgroup(page);
+ smp_rmb();
+ /* unused page is not rotated. */
+ if (!PageCgroupUsed(pc))
+ return;
+ mz = page_cgroup_zoneinfo(pc);
+ list_move(&pc->lru, &mz->lists[lru]);
}

-static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
+void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
{
- struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
- int active = PageCgroupActive(pc);
- int file = PageCgroupFile(pc);
- int unevictable = PageCgroupUnevictable(pc);
- enum lru_list from = unevictable ? LRU_UNEVICTABLE :
- (LRU_FILE * !!file + !!active);
+ struct page_cgroup *pc;
+ struct mem_cgroup_per_zone *mz;

- if (lru == from)
+ if (mem_cgroup_subsys.disabled)
+ return;
+ pc = lookup_page_cgroup(page);
+ /* barrier to sync with "charge" */
+ smp_rmb();
+ if (!PageCgroupUsed(pc))
return;

- MEM_CGROUP_ZSTAT(mz, from) -= 1;
- /*
- * However this is done under mz->lru_lock, another flags, which
- * are not related to LRU, will be modified from out-of-lock.
- * We have to use atomic set/clear flags.
- */
- if (is_unevictable_lru(lru)) {
- ClearPageCgroupActive(pc);
- SetPageCgroupUnevictable(pc);
- } else {
- if (is_active_lru(lru))
- SetPageCgroupActive(pc);
- else
- ClearPageCgroupActive(pc);
- ClearPageCgroupUnevictable(pc);
- }
-
+ mz = page_cgroup_zoneinfo(pc);
MEM_CGROUP_ZSTAT(mz, lru) += 1;
- list_move(&pc->lru, &mz->lists[lru]);
+ list_add(&pc->lru, &mz->lists[lru]);
+}
+/*
+ * To add swapcache into LRU. Be careful to all this function.
+ * zone->lru_lock shouldn't be held and irq must not be disabled.
+ */
+static void mem_cgroup_lru_fixup(struct page *page)
+{
+ if (!isolate_lru_page(page))
+ putback_lru_page(page);
+}
+
+void mem_cgroup_move_lists(struct page *page,
+ enum lru_list from, enum lru_list to)
+{
+ if (mem_cgroup_subsys.disabled)
+ return;
+ mem_cgroup_del_lru_list(page, from);
+ mem_cgroup_add_lru_list(page, to);
}

int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
@@ -354,37 +366,6 @@ int task_in_mem_cgroup(struct task_struc
}

/*
- * This routine assumes that the appropriate zone's lru lock is already held
- */
-void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
-{
- struct page_cgroup *pc;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
-
- if (mem_cgroup_subsys.disabled)
- return;
-
- /*
- * We cannot lock_page_cgroup while holding zone's lru_lock,
- * because other holders of lock_page_cgroup can be interrupted
- * with an attempt to rotate_reclaimable_page. But we cannot
- * safely get to page_cgroup without it, so just try_lock it:
- * mem_cgroup_isolate_pages allows for page left on wrong list.
- */
- pc = lookup_page_cgroup(page);
- if (!trylock_page_cgroup(pc))
- return;
- if (pc && PageCgroupUsed(pc)) {
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_move_lists(pc, lru);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
- }
- unlock_page_cgroup(pc);
-}
-
-/*
* Calculate mapped_ratio under memory controller. This will be used in
* vmscan.c for deteremining we have to reclaim mapped pages.
*/
@@ -463,40 +444,24 @@ unsigned long mem_cgroup_isolate_pages(u
mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
src = &mz->lists[lru];

- spin_lock(&mz->lru_lock);
scan = 0;
list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
if (scan >= nr_to_scan)
break;
+
+ page = pc->page;
if (unlikely(!PageCgroupUsed(pc)))
continue;
- page = pc->page;
-
if (unlikely(!PageLRU(page)))
continue;

- /*
- * TODO: play better with lumpy reclaim, grabbing anything.
- */
- if (PageUnevictable(page) ||
- (PageActive(page) && !active) ||
- (!PageActive(page) && active)) {
- __mem_cgroup_move_lists(pc, page_lru(page));
- continue;
- }
-
scan++;
- list_move(&pc->lru, &pc_list);
-
if (__isolate_lru_page(page, mode, file) == 0) {
list_move(&page->lru, dst);
nr_taken++;
}
}

- list_splice(&pc_list, src);
- spin_unlock(&mz->lru_lock);
-
*scanned = scan;
return nr_taken;
}
@@ -610,9 +575,6 @@ static void __mem_cgroup_commit_charge(s
struct page_cgroup *pc,
enum charge_type ctype)
{
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
-
/* try_charge() can return NULL to *memcg, taking care of it. */
if (!mem)
return;
@@ -627,17 +589,11 @@ static void __mem_cgroup_commit_charge(s
return;
}
pc->mem_cgroup = mem;
- /*
- * If a page is accounted as a page cache, insert to inactive list.
- * If anon, insert to active list.
- */
+ smp_wmb();
pc->flags = pcg_default_flags[ctype];

- mz = page_cgroup_zoneinfo(pc);
+ mem_cgroup_charge_statistics(mem, pc, true);

- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_add_list(mz, pc, true);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);
}

@@ -648,8 +604,7 @@ static void __mem_cgroup_commit_charge(s
* @to: mem_cgroup which the page is moved to. @from != @to.
*
* The caller must confirm following.
- * 1. disable irq.
- * 2. lru_lock of old mem_cgroup(@from) should be held.
+ * - page is not on LRU (isolate_page() is useful.)
*
* returns 0 at success,
* returns -EBUSY when lock is busy or "pc" is unstable.
@@ -665,15 +620,14 @@ static int mem_cgroup_move_account(struc
int nid, zid;
int ret = -EBUSY;

- VM_BUG_ON(!irqs_disabled());
VM_BUG_ON(from == to);
+ VM_BUG_ON(PageLRU(pc->page));

nid = page_cgroup_nid(pc);
zid = page_cgroup_zid(pc);
from_mz = mem_cgroup_zoneinfo(from, nid, zid);
to_mz = mem_cgroup_zoneinfo(to, nid, zid);

-
if (!trylock_page_cgroup(pc))
return ret;

@@ -683,18 +637,15 @@ static int mem_cgroup_move_account(struc
if (pc->mem_cgroup != from)
goto out;

- if (spin_trylock(&to_mz->lru_lock)) {
- __mem_cgroup_remove_list(from_mz, pc);
- css_put(&from->css);
- res_counter_uncharge(&from->res, PAGE_SIZE);
- if (do_swap_account)
- res_counter_uncharge(&from->memsw, PAGE_SIZE);
- pc->mem_cgroup = to;
- css_get(&to->css);
- __mem_cgroup_add_list(to_mz, pc, false);
- ret = 0;
- spin_unlock(&to_mz->lru_lock);
- }
+ css_put(&from->css);
+ res_counter_uncharge(&from->res, PAGE_SIZE);
+ mem_cgroup_charge_statistics(from, pc, false);
+ if (do_swap_account)
+ res_counter_uncharge(&from->memsw, PAGE_SIZE);
+ pc->mem_cgroup = to;
+ mem_cgroup_charge_statistics(to, pc, true);
+ css_get(&to->css);
+ ret = 0;
out:
unlock_page_cgroup(pc);
return ret;
@@ -708,39 +659,47 @@ static int mem_cgroup_move_parent(struct
struct mem_cgroup *child,
gfp_t gfp_mask)
{
+ struct page *page = pc->page;
struct cgroup *cg = child->css.cgroup;
struct cgroup *pcg = cg->parent;
struct mem_cgroup *parent;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
int ret;

/* Is ROOT ? */
if (!pcg)
return -EINVAL;

+
parent = mem_cgroup_from_cont(pcg);

+
ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
if (ret)
return ret;

- mz = mem_cgroup_zoneinfo(child,
- page_cgroup_nid(pc), page_cgroup_zid(pc));
+ if (!get_page_unless_zero(page))
+ return -EBUSY;
+
+ ret = isolate_lru_page(page);
+
+ if (ret)
+ goto cancel;

- spin_lock_irqsave(&mz->lru_lock, flags);
ret = mem_cgroup_move_account(pc, child, parent);
- spin_unlock_irqrestore(&mz->lru_lock, flags);

- /* drop extra refcnt */
+ /* drop extra refcnt by try_charge() (move_account increment one) */
css_put(&parent->css);
- /* uncharge if move fails */
- if (ret) {
- res_counter_uncharge(&parent->res, PAGE_SIZE);
- if (do_swap_account)
- res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+ putback_lru_page(page);
+ if (!ret) {
+ put_page(page);
+ return 0;
}
-
+ /* uncharge if move fails */
+cancel:
+ res_counter_uncharge(&parent->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+ put_page(page);
return ret;
}

@@ -906,6 +865,8 @@ int mem_cgroup_cache_charge_swapin(struc
}
if (!locked)
unlock_page(page);
+ /* add this page(page_cgroup) to the LRU we want. */
+ mem_cgroup_lru_fixup(page);

return ret;
}
@@ -938,6 +899,8 @@ void mem_cgroup_commit_charge_swapin(str
}

}
+ /* add this page(page_cgroup) to the LRU we want. */
+ mem_cgroup_lru_fixup(page);
}

void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -961,7 +924,6 @@ __mem_cgroup_uncharge_common(struct page
struct page_cgroup *pc;
struct mem_cgroup *mem = NULL;
struct mem_cgroup_per_zone *mz;
- unsigned long flags;

if (mem_cgroup_subsys.disabled)
return NULL;
@@ -1003,12 +965,10 @@ __mem_cgroup_uncharge_common(struct page
if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
res_counter_uncharge(&mem->memsw, PAGE_SIZE);

+ mem_cgroup_charge_statistics(mem, pc, false);
ClearPageCgroupUsed(pc);

mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_remove_list(mz, pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);

css_put(&mem->css);
@@ -1257,21 +1217,22 @@ int mem_cgroup_resize_memsw_limit(struct
return ret;
}

-
/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
*/
static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
- struct mem_cgroup_per_zone *mz,
- enum lru_list lru)
+ int node, int zid, enum lru_list lru)
{
+ struct zone *zone;
+ struct mem_cgroup_per_zone *mz;
struct page_cgroup *pc, *busy;
- unsigned long flags;
- unsigned long loop;
+ unsigned long flags, loop;
struct list_head *list;
int ret = 0;

+ zone = &NODE_DATA(node)->node_zones[zid];
+ mz = mem_cgroup_zoneinfo(mem, node, zid);
list = &mz->lists[lru];

loop = MEM_CGROUP_ZSTAT(mz, lru);
@@ -1280,19 +1241,19 @@ static int mem_cgroup_force_empty_list(s
busy = NULL;
while (loop--) {
ret = 0;
- spin_lock_irqsave(&mz->lru_lock, flags);
+ spin_lock_irqsave(&zone->lru_lock, flags);
if (list_empty(list)) {
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
break;
}
pc = list_entry(list->prev, struct page_cgroup, lru);
if (busy == pc) {
list_move(&pc->lru, list);
busy = 0;
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
continue;
}
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ spin_unlock_irqrestore(&zone->lru_lock, flags);

ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
if (ret == -ENOMEM)
@@ -1305,6 +1266,7 @@ static int mem_cgroup_force_empty_list(s
} else
busy = NULL;
}
+
if (!ret && !list_empty(list))
return -EBUSY;
return ret;
@@ -1334,12 +1296,10 @@ move_account:
ret = 0;
for_each_node_state(node, N_POSSIBLE) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
- struct mem_cgroup_per_zone *mz;
enum lru_list l;
- mz = mem_cgroup_zoneinfo(mem, node, zid);
for_each_lru(l) {
ret = mem_cgroup_force_empty_list(mem,
- mz, l);
+ node, zid, l);
if (ret)
break;
}
@@ -1373,6 +1333,7 @@ try_to_free:
nr_retries--;

}
+ lru_add_drain();
/* try move_account...there may be some *locked* pages. */
if (mem->res.usage)
goto move_account;
@@ -1593,7 +1554,6 @@ static int alloc_mem_cgroup_per_zone_inf

for (zone = 0; zone < MAX_NR_ZONES; zone++) {
mz = &pn->zoneinfo[zone];
- spin_lock_init(&mz->lru_lock);
for_each_lru(l)
INIT_LIST_HEAD(&mz->lists[l]);
}
@@ -1635,10 +1595,17 @@ static struct mem_cgroup *mem_cgroup_all

static void mem_cgroup_free(struct mem_cgroup *mem)
{
+ int node;
+
if (do_swap_account) {
if (atomic_read(&mem->swapref) > 0)
return;
}
+
+
+ for_each_node_state(node, N_POSSIBLE)
+ free_mem_cgroup_per_zone_info(mem, node);
+
if (sizeof(*mem) < PAGE_SIZE)
kfree(mem);
else
@@ -1711,12 +1678,6 @@ static void mem_cgroup_pre_destroy(struc
static void mem_cgroup_destroy(struct cgroup_subsys *ss,
struct cgroup *cont)
{
- int node;
- struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-
- for_each_node_state(node, N_POSSIBLE)
- free_mem_cgroup_per_zone_info(mem, node);
-
mem_cgroup_free(mem_cgroup_from_cont(cont));
}

Index: mmotm-2.6.28-rc2+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-rc2+/include/linux/memcontrol.h
@@ -40,7 +40,12 @@ extern void mem_cgroup_cancel_charge_swa

extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
-extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
+extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_del_lru(struct page *page);
+extern void mem_cgroup_move_lists(struct page *page,
+ enum lru_list from, enum lru_list to);
extern void mem_cgroup_uncharge_page(struct page *page);
extern void mem_cgroup_uncharge_cache_page(struct page *page);
extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
@@ -131,7 +136,27 @@ static inline int mem_cgroup_shrink_usag
return 0;
}

-static inline void mem_cgroup_move_lists(struct page *page, bool active)
+static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
+{
+}
+
+static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
+{
+ return ;
+}
+
+static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
+{
+ return ;
+}
+
+static inline void mem_cgroup_del_lru(struct page *page)
+{
+ return ;
+}
+
+static inline void
+mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
{
}

Index: mmotm-2.6.28-rc2+/include/linux/mm_inline.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/mm_inline.h
+++ mmotm-2.6.28-rc2+/include/linux/mm_inline.h
@@ -28,6 +28,7 @@ add_page_to_lru_list(struct zone *zone,
{
list_add(&page->lru, &zone->lru[l].list);
__inc_zone_state(zone, NR_LRU_BASE + l);
+ mem_cgroup_add_lru_list(page, l);
}

static inline void
@@ -35,6 +36,7 @@ del_page_from_lru_list(struct zone *zone
{
list_del(&page->lru);
__dec_zone_state(zone, NR_LRU_BASE + l);
+ mem_cgroup_del_lru_list(page, l);
}

static inline void
@@ -54,6 +56,7 @@ del_page_from_lru(struct zone *zone, str
l += page_is_file_cache(page);
}
__dec_zone_state(zone, NR_LRU_BASE + l);
+ mem_cgroup_del_lru_list(page, l);
}

/**
Index: mmotm-2.6.28-rc2+/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-rc2+/mm/page_cgroup.c
@@ -17,6 +17,7 @@ __init_page_cgroup(struct page_cgroup *p
pc->flags = 0;
pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
+ INIT_LIST_HEAD(&pc->lru);
}
static unsigned long total_usage;

Index: mmotm-2.6.28-rc2+/fs/splice.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/fs/splice.c
+++ mmotm-2.6.28-rc2+/fs/splice.c
@@ -21,6 +21,7 @@
#include <linux/file.h>
#include <linux/pagemap.h>
#include <linux/splice.h>
+#include <linux/memcontrol.h>
#include <linux/mm_inline.h>
#include <linux/swap.h>
#include <linux/writeback.h>
Index: mmotm-2.6.28-rc2+/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/vmscan.c
+++ mmotm-2.6.28-rc2+/mm/vmscan.c
@@ -516,7 +516,6 @@ redo:
lru = LRU_UNEVICTABLE;
add_page_to_unevictable_list(page);
}
- mem_cgroup_move_lists(page, lru);

/*
* page's status can change while we move it among lru. If an evictable
@@ -551,7 +550,6 @@ void putback_lru_page(struct page *page)

lru = !!TestClearPageActive(page) + page_is_file_cache(page);
lru_cache_add_lru(page, lru);
- mem_cgroup_move_lists(page, lru);
put_page(page);
}
#endif /* CONFIG_UNEVICTABLE_LRU */
@@ -823,6 +821,7 @@ int __isolate_lru_page(struct page *page
return ret;

ret = -EBUSY;
+
if (likely(get_page_unless_zero(page))) {
/*
* Be careful not to clear PageLRU until after we're
@@ -831,6 +830,7 @@ int __isolate_lru_page(struct page *page
*/
ClearPageLRU(page);
ret = 0;
+ mem_cgroup_del_lru(page);
}

return ret;
@@ -1144,7 +1144,6 @@ static unsigned long shrink_inactive_lis
SetPageLRU(page);
lru = page_lru(page);
add_page_to_lru_list(zone, page, lru);
- mem_cgroup_move_lists(page, lru);
if (PageActive(page) && scan_global_lru(sc)) {
int file = !!page_is_file_cache(page);
zone->recent_rotated[file]++;
@@ -1277,7 +1276,7 @@ static void shrink_active_list(unsigned
ClearPageActive(page);

list_move(&page->lru, &zone->lru[lru].list);
- mem_cgroup_move_lists(page, lru);
+ mem_cgroup_add_lru_list(page, lru);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
@@ -2436,6 +2435,7 @@ retry:

__dec_zone_state(zone, NR_UNEVICTABLE);
list_move(&page->lru, &zone->lru[l].list);
+ mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
__inc_zone_state(zone, NR_INACTIVE_ANON + l);
__count_vm_event(UNEVICTABLE_PGRESCUED);
} else {
@@ -2444,6 +2444,7 @@ retry:
*/
SetPageUnevictable(page);
list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
+ mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
if (page_evictable(page, NULL))
goto retry;
}
Index: mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
@@ -26,10 +26,6 @@ enum {
PCG_LOCK, /* page cgroup is locked */
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
- /* flags for LRU placement */
- PCG_ACTIVE, /* page is active in this cgroup */
- PCG_FILE, /* page is file system backed */
- PCG_UNEVICTABLE, /* page is unevictableable */
};

#define TESTPCGFLAG(uname, lname) \
@@ -50,19 +46,6 @@ TESTPCGFLAG(Cache, CACHE)
TESTPCGFLAG(Used, USED)
CLEARPCGFLAG(Used, USED)

-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
static inline int page_cgroup_nid(struct page_cgroup *pc)
{
return page_to_nid(pc->page);
Index: mmotm-2.6.28-rc2+/mm/swap.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swap.c
+++ mmotm-2.6.28-rc2+/mm/swap.c
@@ -168,7 +168,6 @@ void activate_page(struct page *page)
lru += LRU_ACTIVE;
add_page_to_lru_list(zone, page, lru);
__count_vm_event(PGACTIVATE);
- mem_cgroup_move_lists(page, lru);

zone->recent_rotated[!!file]++;
zone->recent_scanned[!!file]++;

2008-11-06 06:58:19

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/6] memcg updates (05/Nov)

KAMEZAWA Hiroyuki wrote:
> Weekly (RFC) update for memcg.
>
> This set includes
>
> 1. change force_empty to do move account rather than forget all

I would like this to be selectable, please. We don't want to break behaviour and
not everyone would like to pay the cost of movement.

> 2. swap cache handling
> 3. mem+swap controller kconfig
> 4. swap_cgroup for rememver swap account information
> 5. mem+swap controller core
> 6. synchronize memcg's LRU and global LRU.
>
> "1" is already sent, "6" is a newcomer.
> I'd like to push out "2" or "2-5" in the next week (if no bugs.)
>
> after 6, next candidates are
> - dirty_ratio handler
> - account move at task move.
>
> Some more explanation about purpose of "6". (see details in patch itself)
> Now, one of complicated logic in memcg is LRU handling. Because the place of
> lru_head depends on page_cgroup->mem_cgroup pointer, we have to take
> lock as following even under zone->lru_lock.
> ==
> pc = lookup_page_cgroup(page);
> if (!trylock_page_cgroup(pc))
> return -EBUSY;
>
> if (PageCgroupUsed(pc)) {
> struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
> spin_lock_irqsave(&mz->lru_lock, flags);
> ....some operation on LRU.
> spin_unlock_irqrestore(&mz->lru_lock, flags);
> }
> unlock_page_cgroup(pc);
> ==
> Sigh..
>
> After "6", page_cgroup's LRU management can be done independently to some extent.
> == as
> (zone->lru_lock is held here)
> pc = lookup_page_cgroup(page);
> list operation on pc.
> (unlock zone->lru_lock)
> ==
> Maybe good for maintainance and as a bonus, we can make use of isolate_lru_page() when
> doing some racy operation.
>
> isolate_lru_page(page);
> pc = lookup_page_cgroup(page);
> do some jobs.
> putback_lru_page(page);
>
> Maybe this will be a help to implement "account move at task move".

Sounds promising!

--
Balbir

2008-11-06 07:06:00

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/6] memcg updates (05/Nov)

On Thu, 06 Nov 2008 12:24:11 +0530
Balbir Singh <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > Weekly (RFC) update for memcg.
> >
> > This set includes
> >
> > 1. change force_empty to do move account rather than forget all
>
> I would like this to be selectable, please. We don't want to break behaviour and
> not everyone would like to pay the cost of movement.
>
Just current behavior is broken ;)

Hmm. I have an option in my stack to do
- call try_to_free_pages() only. or
- call try_to_free_pages(). only when memory is locked, move to parent.

Ok ? *forget all* is no choice.

Thanks,
-Kame

> > 2. swap cache handling
> > 3. mem+swap controller kconfig
> > 4. swap_cgroup for rememver swap account information
> > 5. mem+swap controller core
> > 6. synchronize memcg's LRU and global LRU.
> >
> > "1" is already sent, "6" is a newcomer.
> > I'd like to push out "2" or "2-5" in the next week (if no bugs.)
> >
> > after 6, next candidates are
> > - dirty_ratio handler
> > - account move at task move.
> >
> > Some more explanation about purpose of "6". (see details in patch itself)
> > Now, one of complicated logic in memcg is LRU handling. Because the place of
> > lru_head depends on page_cgroup->mem_cgroup pointer, we have to take
> > lock as following even under zone->lru_lock.
> > ==
> > pc = lookup_page_cgroup(page);
> > if (!trylock_page_cgroup(pc))
> > return -EBUSY;
> >
> > if (PageCgroupUsed(pc)) {
> > struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
> > spin_lock_irqsave(&mz->lru_lock, flags);
> > ....some operation on LRU.
> > spin_unlock_irqrestore(&mz->lru_lock, flags);
> > }
> > unlock_page_cgroup(pc);
> > ==
> > Sigh..
> >
> > After "6", page_cgroup's LRU management can be done independently to some extent.
> > == as
> > (zone->lru_lock is held here)
> > pc = lookup_page_cgroup(page);
> > list operation on pc.
> > (unlock zone->lru_lock)
> > ==
> > Maybe good for maintainance and as a bonus, we can make use of isolate_lru_page() when
> > doing some racy operation.
> >
> > isolate_lru_page(page);
> > pc = lookup_page_cgroup(page);
> > do some jobs.
> > putback_lru_page(page);
> >
> > Maybe this will be a help to implement "account move at task move".
>
> Sounds promising!
>
> --
> Balbir
>

2008-11-06 10:42:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir)

On Thu, 06 Nov 2008 12:24:11 +0530
Balbir Singh <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > Weekly (RFC) update for memcg.
> >
> > This set includes
> >
> > 1. change force_empty to do move account rather than forget all
>
> I would like this to be selectable, please. We don't want to break behaviour and
> not everyone would like to pay the cost of movement.

How about a patch like this ? I'd like to move this as [2/7], if possible.
It obviously needs painful rework. If I found it difficult, schedule this as [7/7].

BTW, cost of movement itself is not far from cost for force_empty.

If you can't find why "forget" is bad, please consider one more day.

==
This patch adds attribute to memory resource controller.

This memory.attribute file allows following to set/clear attribute.
#echo attribute option > memory.attribute

This patch implements an attribute

# on_rmdir [keep | drop] > memory.attribute.

When on_rmdir=keep, memory remaining in memcg will be moved up to parent
at rmdir. This is fast.
When on_rmdir=drop, memory remaining in memcg will be freed.

Characteristic of Keep.
- fast.
- Doesn't cause unnecessary freeing of memory(page cache).
(IOW. page-cache for temporal files or some unnecessary pages will be kept.)
Characteristic of Drop.
- slow
- No influence to its parent. all page caches will be dropped.
(IOW. page-cache for libc or some important pages will be dropped.)

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Documentation/controllers/memory.txt | 22 +++++
mm/memcontrol.c | 134 ++++++++++++++++++++++++++++++++++-
2 files changed, 154 insertions(+), 2 deletions(-)

Index: mmotm-2.6.28-rc2-Oct30/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2-Oct30.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2-Oct30/mm/memcontrol.c
@@ -35,6 +35,7 @@
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
+#include <linux/ctype.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -146,6 +147,10 @@ struct mem_cgroup {
*/
int prev_priority; /* for recording reclaim priority */
/*
+ * attribute
+ */
+ char drop_on_rmdir;
+ /*
* used for counting reference from swap_cgroup.
*/
int obsolete;
@@ -182,6 +187,22 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
#define MEMFILE_TYPE(val) (((val) >> 16) & 0xffff)
#define MEMFILE_ATTR(val) ((val) & 0xffff)

+/*
+ * attribute for memcg default value comes from its parent.
+ * the root set all to false.
+ */
+enum {
+ MEMCG_ATTR_ON_RMDIR, /* drop_all if true, default is true. */
+ MEMCG_LAST_ATTR,
+};
+/* we may have to check status under racy situation. use global mutex. */
+DEFINE_MUTEX(memcg_attr_mutex);
+
+static char *memcg_attribute_names[MEMCG_LAST_ATTR] = {
+ "on_rmdir",
+};
+
+
static void mem_cgroup_forget_swapref(struct mem_cgroup *mem);

static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
@@ -1294,6 +1315,10 @@ static int mem_cgroup_force_empty(struct
css_get(&mem->css);

shrink = 0;
+ /* If this is true, free all orphan pages on LRU as much as possible */
+ if (mem->drop_on_rmdir)
+ goto try_to_free;
+
move_account:
while (mem->res.usage > 0) {
ret = -EBUSY;
@@ -1311,6 +1336,9 @@ move_account:
/* it seems parent cgroup doesn't have enough mem */
if (ret == -ENOMEM)
goto try_to_free;
+ ret = -EINTR;
+ if (signal_pending(current))
+ goto out;
cond_resched();
}
ret = 0;
@@ -1332,6 +1360,10 @@ try_to_free:
GFP_HIGHUSER_MOVABLE, false);
if (!progress)
nr_retries--;
+ ret = -EINTR;
+ if (signal_pending(current))
+ goto out;
+ cond_resched();

}
lru_add_drain();
@@ -1475,6 +1507,95 @@ static int mem_control_stat_show(struct
return 0;
}

+
+/*
+ * Assumes
+ * #echo attribute [option] > memory.feature
+ */
+static int
+parse_attr_option(char *buffer, char **attr, char **option, char **end)
+{
+ char *c = buffer;
+
+ *attr = NULL;
+ *option = NULL;
+ /* skip white space */
+ for (; *c && isspace(*c);c++);
+ /* found NULL ? */
+ if (!*c)
+ return -EINVAL;
+ *attr = c;
+ /* skip attribute */
+ for (; *c && !isspace(*c);c++);
+ /* skip space */
+ for (; *c && isspace(*c);c++);
+ /* pass pointer to option */
+ *option = c;
+ for (; *c; c++);
+ *end = c;
+ return 0;
+
+}
+
+static int mem_cgroup_write_attr(struct cgroup *cont,
+ struct cftype *cft,
+ const char *buffer)
+{
+ int i, len;
+ char *attr, *option, *end;
+ int ret = -EINVAL;
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+ mutex_lock(&memcg_attr_mutex);
+ /* parse attribute option */
+ ret = parse_attr_option((char*)buffer, &attr, &option, &end);
+ if (ret)
+ goto out;
+ for (i = 0; i < MEMCG_LAST_ATTR; i++) {
+
+ len = strlen(memcg_attribute_names[i]);
+ if ((option - attr) < len)
+ continue;
+ if (!strncmp(memcg_attribute_names[i], attr, len))
+ break;
+ }
+ ret = -EINVAL;
+ if (i == MEMCG_LAST_ATTR)
+ goto out;
+ switch(i) {
+ case MEMCG_ATTR_ON_RMDIR:
+ if ((end - option) < 4)
+ break;
+ ret = 0;
+ if (strncmp(option, "keep", 4) == 0)
+ mem->drop_on_rmdir = 0;
+ else if (strncmp(option, "drop", 4) == 0)
+ mem->drop_on_rmdir = 1;
+ else
+ ret = -EINVAL;
+ break;
+ }
+out:
+ mutex_unlock(&memcg_attr_mutex);
+ return ret;
+}
+
+static int mem_cgroup_read_attr(struct cgroup *cont, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ char *s;
+
+ s = memcg_attribute_names[MEMCG_ATTR_ON_RMDIR];
+
+ if (mem->drop_on_rmdir)
+ seq_printf(m, "%s drop\n",s);
+ else
+ seq_printf(m, "%s keep\n",s);
+
+ return 0;
+}
+
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
@@ -1503,6 +1624,11 @@ static struct cftype mem_cgroup_files[]
.name = "stat",
.read_map = mem_control_stat_show,
},
+ {
+ .name = "attribute",
+ .write_string = mem_cgroup_write_attr,
+ .read_seq_string = mem_cgroup_read_attr,
+ },
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
{
.name = "memsw.usage_in_bytes",
@@ -1640,20 +1766,26 @@ static void __init enable_swap_cgroup(vo
static struct cgroup_subsys_state *
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
- struct mem_cgroup *mem;
+ struct mem_cgroup *mem, *parent;
int node;

if (unlikely((cont->parent) == NULL)) {
mem = &init_mem_cgroup;
enable_swap_cgroup();
+ parent = NULL;
+ mem->drop_on_rmdir = 1; /* default is drop */
} else {
mem = mem_cgroup_alloc();
if (!mem)
return ERR_PTR(-ENOMEM);
+ parent = mem_cgroup_from_cont(cont->parent);
}

res_counter_init(&mem->res);
res_counter_init(&mem->memsw);
+ /* inherit */
+ if (parent)
+ mem->drop_on_rmdir = parent->drop_on_rmdir;

for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
Index: mmotm-2.6.28-rc2-Oct30/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-rc2-Oct30.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-rc2-Oct30/Documentation/controllers/memory.txt
@@ -270,8 +270,28 @@ Charges recorded in swap information is
Recorded information is effectively discarded and a cgroup which uses swap
(swapcache) will be charged as a new owner of it.

+5. Attributes
+memory.attribute file is provided to set per-memcg attribute.
+You can specify attribute by
+ #echo attribute option > memory.attribute
+
+5.1 on_rmdir
+set behavior of memcg at rmdir (destroy cgroup) default is "drop".
+
+ 5.1.1 drop
+ #echo on_rmdir drop > memory.attribute
+ This is default. All pages on this memcg will be freed.
+ If pages are locked or too busy, they will be moved up to the parent.
+ Useful when you want to drop (large) page caches used in this memcg.
+
+ 5.1.2 keep
+ #echo on_rmdir keep > memory.attribute
+ All pages on this memcg will be moved to parent.
+ Useful when you don't want to drop page caches used in this memcg.
+ You can keep page caches from some library or DB accessed by this
+ memcg on memory.

-5. TODO
+6. TODO

1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first

2008-11-06 11:49:00

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/6] memcg : mem+swap controller kconfig

On Wed, 5 Nov 2008 17:20:50 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> Experimental.
>
> Config and control variable for mem+swap controller.
>
> This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> (memory resource controller swap extension.)
>
> For accounting swap, it's obvious that we have to use additional memory
> to remember "who uses swap". This adds more overhead.
> So, it's better to offer "choice" to users. This patch adds 2 choices.
>
> This patch adds 2 parameters to enable swap extension or not.
> - CONFIG
> - boot option
>
> Changelog: v1 -> v2
> - fixed typo.
> - make default value of "do_swap_account" to be 0 and turned on 1
> later if configured.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
looks good to me.

Reviewed-by: Daisuke Nishimura <[email protected]>

>
> Documentation/kernel-parameters.txt | 3 +++
> include/linux/memcontrol.h | 3 +++
> init/Kconfig | 17 +++++++++++++++++
> mm/memcontrol.c | 32 ++++++++++++++++++++++++++++++++
> 4 files changed, 55 insertions(+)
>
> Index: mmotm-2.6.28-rc2+/init/Kconfig
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/init/Kconfig
> +++ mmotm-2.6.28-rc2+/init/Kconfig
> @@ -428,6 +428,23 @@ config CGROUP_MEM_RES_CTLR
> config MM_OWNER
> bool
>
> +config CGROUP_MEM_RES_CTLR_SWAP
> + bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
> + depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> + help
> + Add swap management feature to memory resource controller. When you
> + enable this, you can limit mem+swap usage per cgroup. In other words,
> + when you disable this, memory resource controller has no cares to
> + usage of swap...a process can exhaust all of the swap. This extension
> + is useful when you want to avoid exhaustion swap but this itself
> + adds more overheads and consumes memory for remembering information.
> + Especially if you use 32bit system or small memory system, please
> + be careful about enabling this. When memory resource controller
> + is disabled by boot option, this will be automatically disabled and
> + there will be no overhead from this. Even when you set this config=y,
> + if boot option "noswapaccount" is set, swap will not be accounted.
> +
> +
> endmenu
>
> config SYSFS_DEPRECATED
> Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-rc2+/mm/memcontrol.c
> @@ -41,6 +41,15 @@
> struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> #define MEM_CGROUP_RECLAIM_RETRIES 5
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +/* Turned on only when memory cgroup is enabled && really_do_swap_account = 0 */
> +int do_swap_account __read_mostly;
> +static int really_do_swap_account __initdata = 1; /* for remember boot option*/
> +#else
> +#define do_swap_account (0)
> +#endif
> +
> +
> /*
> * Statistics for memory cgroup.
> */
> @@ -1370,6 +1379,18 @@ static void mem_cgroup_free(struct mem_c
> }
>
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +static void __init enable_swap_cgroup(void)
> +{
> + if (!mem_cgroup_subsys.disabled && really_do_swap_account)
> + do_swap_account = 1;
> +}
> +#else
> +static void __init enable_swap_cgroup(void)
> +{
> +}
> +#endif
> +
> static struct cgroup_subsys_state *
> mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> {
> @@ -1378,6 +1399,7 @@ mem_cgroup_create(struct cgroup_subsys *
>
> if (unlikely((cont->parent) == NULL)) {
> mem = &init_mem_cgroup;
> + enable_swap_cgroup();
> } else {
> mem = mem_cgroup_alloc();
> if (!mem)
> @@ -1461,3 +1483,13 @@ struct cgroup_subsys mem_cgroup_subsys =
> .attach = mem_cgroup_move_task,
> .early_init = 0,
> };
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +
> +static int __init disable_swap_account(char *s)
> +{
> + really_do_swap_account = 0;
> + return 1;
> +}
> +__setup("noswapaccount", disable_swap_account);
> +#endif
> Index: mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/Documentation/kernel-parameters.txt
> +++ mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
> @@ -1543,6 +1543,9 @@ and is between 256 and 4096 characters.
>
> nosoftlockup [KNL] Disable the soft-lockup detector.
>
> + noswapaccount [KNL] Disable accounting of swap in memory resource
> + controller. (See Documentation/controllers/memory.txt)
> +
> nosync [HW,M68K] Disables sync negotiation for all devices.
>
> notsc [BUGS=X86-32] Disable Time Stamp Counter
> Index: mmotm-2.6.28-rc2+/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.28-rc2+/include/linux/memcontrol.h
> @@ -77,6 +77,9 @@ extern void mem_cgroup_record_reclaim_pr
> extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
> int priority, enum lru_list lru);
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +extern int do_swap_account;
> +#endif
>
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>

2008-11-06 11:49:39

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/6] memcg : swap cgroup

On Wed, 5 Nov 2008 17:21:41 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> For accounting swap, we need a record per swap entry, at least.
>
> This patch adds following function.
> - swap_cgroup_swapon() .... called from swapon
> - swap_cgroup_swapoff() ... called at the end of swapoff.
>
> - swap_cgroup_record() .... record information of swap entry.
> - swap_cgroup_lookup() .... lookup information of swap entry.
>
> This patch just implements "how to record information". No actual
> method for limit the usage of swap. These routine uses flat table
> to record and lookup. "wise" lookup system like radix-tree requires
> requires memory allocation at new records but swap-out is usually
> called under memory shortage (or memcg hits limit.)
> So, I used static allocation. (maybe dynamic allocation is not very
> hard but it adds additional memory allocation in memory shortage path.)
>
>
> Note1: In this, we use pointer to record information and this means
> 8bytes per swap entry. I think we can reduce this when we
> create "id of cgroup" in the range of 0-65535 or 0-255.
>
> Note2: array of swap_cgroup is allocated from HIGHMEM. maybe good for x86-32.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> include/linux/page_cgroup.h | 35 +++++++
> mm/page_cgroup.c | 201 ++++++++++++++++++++++++++++++++++++++++++++
> mm/swapfile.c | 8 +
> 3 files changed, 244 insertions(+)
>
Is there any reason why they are defined not in memcontrol.[ch]
but in page_cgroup.[ch]?

> Index: mmotm-2.6.28-rc2+/mm/page_cgroup.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/page_cgroup.c
> +++ mmotm-2.6.28-rc2+/mm/page_cgroup.c
> @@ -8,6 +8,8 @@
> #include <linux/memory.h>
> #include <linux/vmalloc.h>
> #include <linux/cgroup.h>
> +#include <linux/swapops.h>
> +#include <linux/highmem.h>
>
> static void __meminit
> __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> @@ -254,3 +256,202 @@ void __init pgdat_page_cgroup_init(struc
> }
>
> #endif
> +
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +
> +DEFINE_MUTEX(swap_cgroup_mutex);
> +struct swap_cgroup_ctrl {
> + spinlock_t lock;
> + struct page **map;
> + unsigned long length;
> +};
> +
> +struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
> +
> +/*
> + * This 8bytes seems big..maybe we can reduce this when we can use "id" for
> + * cgroup rather than pointer.
> + */
> +struct swap_cgroup {
> + struct mem_cgroup *val;
> +};
> +#define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup))
> +#define SC_POS_MASK (SC_PER_PAGE - 1)
> +
> +/*
> + * allocate buffer for swap_cgroup.
> + */
> +static int swap_cgroup_prepare(int type)
> +{
> + struct page *page;
> + struct swap_cgroup_ctrl *ctrl;
> + unsigned long idx, max;
> +
> + if (!do_swap_account)
> + return 0;
> + ctrl = &swap_cgroup_ctrl[type];
> +
> + for (idx = 0; idx < ctrl->length; idx++) {
> + page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> + if (!page)
> + goto not_enough_page;
> + ctrl->map[idx] = page;
> + }
> + return 0;
> +not_enough_page:
> + max = idx;
> + for (idx = 0; idx < max; idx++)
> + __free_page(ctrl->map[idx]);
> +
> + return -ENOMEM;
> +}
> +
> +/**
> + * swap_cgroup_record - record mem_cgroup for this swp_entry.
> + * @ent: swap entry to be recorded into
> + * @mem: mem_cgroup to be recorded
> + *
> + * Returns old value at success, NULL at failure.
> + * (Of course, old value can be NULL.)
> + */
> +struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
> +{
> + unsigned long flags;
> + int type = swp_type(ent);
> + unsigned long offset = swp_offset(ent);
> + unsigned long idx = offset / SC_PER_PAGE;
> + unsigned long pos = offset & SC_POS_MASK;
> + struct swap_cgroup_ctrl *ctrl;
> + struct page *mappage;
> + struct swap_cgroup *sc;
> + struct mem_cgroup *old;
> +
> + if (!do_swap_account)
> + return NULL;
> +
> + ctrl = &swap_cgroup_ctrl[type];
> +
> + mappage = ctrl->map[idx];
> + spin_lock_irqsave(&ctrl->lock, flags);
> + sc = kmap_atomic(mappage, KM_USER0);
> + sc += pos;
> + old = sc->val;
> + sc->val = mem;
> + kunmap_atomic(mappage, KM_USER0);
> + spin_unlock_irqrestore(&ctrl->lock, flags);
> + return old;
> +}
> +
> +/**
> + * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
> + * @ent: swap entry to be looked up.
> + *
> + * Returns pointer to mem_cgroup at success. NULL at failure.
> + */
> +struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
> +{
> + int type = swp_type(ent);
> + unsigned long flags;
> + unsigned long offset = swp_offset(ent);
> + unsigned long idx = offset / SC_PER_PAGE;
> + unsigned long pos = offset & SC_POS_MASK;
> + struct swap_cgroup_ctrl *ctrl;
> + struct page *mappage;
> + struct swap_cgroup *sc;
> + struct mem_cgroup *ret;
> +
> + if (!do_swap_account)
> + return NULL;
> +
> + ctrl = &swap_cgroup_ctrl[type];
> +
> + mappage = ctrl->map[idx];
> +
> + spin_lock_irqsave(&ctrl->lock, flags);
> + sc = kmap_atomic(mappage, KM_USER0);
> + sc += pos;
> + ret = sc->val;
> + kunmap_atomic(mapppage, KM_USER0);
> + spin_unlock_irqrestore(&ctrl->lock, flags);
> + return ret;
> +}
> +
> +int swap_cgroup_swapon(int type, unsigned long max_pages)
> +{
> + void *array;
> + unsigned long array_size;
> + unsigned long length;
> + struct swap_cgroup_ctrl *ctrl;
> +
> + if (!do_swap_account)
> + return 0;
> +
> + length = ((max_pages/SC_PER_PAGE) + 1);
> + array_size = length * sizeof(void *);
> +
> + array = vmalloc(array_size);
> + if (!array)
> + goto nomem;
> +
> + memset(array, 0, array_size);
> + ctrl = &swap_cgroup_ctrl[type];
> + mutex_lock(&swap_cgroup_mutex);
> + ctrl->length = length;
> + ctrl->map = array;
> + if (swap_cgroup_prepare(type)) {
> + /* memory shortage */
> + ctrl->map = NULL;
> + ctrl->length = 0;
> + vfree(array);
> + mutex_unlock(&swap_cgroup_mutex);
> + goto nomem;
> + }
> + mutex_unlock(&swap_cgroup_mutex);
> +
> + printk(KERN_INFO
> + "swap_cgroup: uses %ld bytes vmalloc and %ld bytes buffres\n",
> + array_size, length * PAGE_SIZE);
> + printk(KERN_INFO
> + "swap_cgroup can be disabled by noswapaccount boot option.\n");
> +
> + return 0;
> +nomem:
> + printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
> + printk(KERN_INFO
> + "swap_cgroup can be disabled by noswapaccount boot option\n");
> + return -ENOMEM;
> +}
> +
> +void swap_cgroup_swapoff(int type)
> +{
> + int i;
> + struct swap_cgroup_ctrl *ctrl;
> +
> + if (!do_swap_account)
> + return;
> +
> + mutex_lock(&swap_cgroup_mutex);
> + if (ctrl->map) {
> + ctrl = &swap_cgroup_ctrl[type];
This line should be before "if (ctrl->map)"(otherwise "ctrl" will be undefined!).


Thanks,
Daisuke Nishimura.

> + for (i = 0; i < ctrl->length; i++) {
> + struct page *page = ctrl->map[i];
> + if (page)
> + __free_page(page);
> + }
> + vfree(ctrl->map);
> + ctrl->map = NULL;
> + ctrl->length = 0;
> + }
> + mutex_unlock(&swap_cgroup_mutex);
> +}
> +
> +static int __init swap_cgroup_init(void)
> +{
> + int i;
> + for (i = 0; i < MAX_SWAPFILES; i++)
> + spin_lock_init(&swap_cgroup_ctrl[i].lock);
> + return 0;
> +}
> +late_initcall(swap_cgroup_init);
> +#endif
> Index: mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
> @@ -105,4 +105,39 @@ static inline void page_cgroup_init(void
> }
>
> #endif
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +#include <linux/swap.h>
> +extern struct mem_cgroup *
> +swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem);
> +extern struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent);
> +extern int swap_cgroup_swapon(int type, unsigned long max_pages);
> +extern void swap_cgroup_swapoff(int type);
> +#else
> +#include <linux/swap.h>
> +
> +static inline
> +struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
> +{
> + return NULL;
> +}
> +
> +static inline
> +struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
> +{
> + return NULL;
> +}
> +
> +static inline int
> +swap_cgroup_swapon(int type, unsigned long max_pages)
> +{
> + return 0;
> +}
> +
> +static inline void swap_cgroup_swapoff(int type)
> +{
> + return;
> +}
> +
> +#endif
> #endif
> Index: mmotm-2.6.28-rc2+/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/swapfile.c
> +++ mmotm-2.6.28-rc2+/mm/swapfile.c
> @@ -32,6 +32,7 @@
> #include <asm/pgtable.h>
> #include <asm/tlbflush.h>
> #include <linux/swapops.h>
> +#include <linux/page_cgroup.h>
>
> static DEFINE_SPINLOCK(swap_lock);
> static unsigned int nr_swapfiles;
> @@ -1345,6 +1346,9 @@ asmlinkage long sys_swapoff(const char _
> spin_unlock(&swap_lock);
> mutex_unlock(&swapon_mutex);
> vfree(swap_map);
> + /* Destroy swap account informatin */
> + swap_cgroup_swapoff(type);
> +
> inode = mapping->host;
> if (S_ISBLK(inode->i_mode)) {
> struct block_device *bdev = I_BDEV(inode);
> @@ -1669,6 +1673,10 @@ asmlinkage long sys_swapon(const char __
> nr_good_pages = swap_header->info.last_page -
> swap_header->info.nr_badpages -
> 1 /* header page */;
> +
> + if (!error)
> + error = swap_cgroup_swapon(type, maxpages);
> +
> if (error)
> goto bad_swap;
> }
>

2008-11-06 12:00:27

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir)

On Thu, 6 Nov 2008, KAMEZAWA Hiroyuki wrote:
> On Thu, 06 Nov 2008 12:24:11 +0530
> Balbir Singh <[email protected]> wrote:
> > KAMEZAWA Hiroyuki wrote:
> > >
> > > 1. change force_empty to do move account rather than forget all
> >
> > I would like this to be selectable, please. We don't want to break behaviour and
> > not everyone would like to pay the cost of movement.
>
> How about a patch like this ? I'd like to move this as [2/7], if possible.
> It obviously needs painful rework. If I found it difficult, schedule this as [7/7].
>
> BTW, cost of movement itself is not far from cost for force_empty.
>
> If you can't find why "forget" is bad, please consider one more day.

My recollection from a year ago is that force_empty totally violated
the rules established elsewhere, making a nonsense of it all: once a
force_empty had been done, you couldn't really be sure of anything
(perhaps it deserved a Taint flag).

Without studying your proposals at all, I do believe you've a good
chance of creating a sensible and consistent force_empty by moving
account, and abandoning the old "forget all" approach completely.

Hugh

2008-11-06 12:44:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/6] memcg : swap cgroup

Daisuke Nishimura said:
> On Wed, 5 Nov 2008 17:21:41 +0900, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>> Note1: In this, we use pointer to record information and this means
>> 8bytes per swap entry. I think we can reduce this when we
>> create "id of cgroup" in the range of 0-65535 or 0-255.
>>
>> Note2: array of swap_cgroup is allocated from HIGHMEM. maybe good for
>> x86-32.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>>
>> include/linux/page_cgroup.h | 35 +++++++
>> mm/page_cgroup.c | 201
>> ++++++++++++++++++++++++++++++++++++++++++++
>> mm/swapfile.c | 8 +
>> 3 files changed, 244 insertions(+)
>>
> Is there any reason why they are defined not in memcontrol.[ch]
> but in page_cgroup.[ch]?
>
no strong reason. just because this is not core logic for acccounting.
do you want to see this in memcontrol.c ?

>> +void swap_cgroup_swapoff(int type)
>> +{
>> + int i;
>> + struct swap_cgroup_ctrl *ctrl;
>> +
>> + if (!do_swap_account)
>> + return;
>> +
>> + mutex_lock(&swap_cgroup_mutex);
>> + if (ctrl->map) {
>> + ctrl = &swap_cgroup_ctrl[type];
> This line should be before "if (ctrl->map)"(otherwise "ctrl" will be
> undefined!).
>
Oh....maybe refresh mis...brame me.

Thanks,
-Kame

2008-11-06 12:47:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/6] memcg: add atribute (for change bahavior ofrmdir)

Hugh Dickins said:
> On Thu, 6 Nov 2008, KAMEZAWA Hiroyuki wrote:
>> On Thu, 06 Nov 2008 12:24:11 +0530
>> Balbir Singh <[email protected]> wrote:
>> > KAMEZAWA Hiroyuki wrote:
>> > >
>> > > 1. change force_empty to do move account rather than forget all
>> >
>> > I would like this to be selectable, please. We don't want to break
>> behaviour and
>> > not everyone would like to pay the cost of movement.
>>
>> How about a patch like this ? I'd like to move this as [2/7], if
>> possible.
>> It obviously needs painful rework. If I found it difficult, schedule
>> this as [7/7].
>>
>> BTW, cost of movement itself is not far from cost for force_empty.
>>
>> If you can't find why "forget" is bad, please consider one more day.
>
> My recollection from a year ago is that force_empty totally violated
> the rules established elsewhere, making a nonsense of it all: once a
> force_empty had been done, you couldn't really be sure of anything
> (perhaps it deserved a Taint flag).
>
yes. that was terrible. (but necessary to do so because we accounted
pages not on LRU or some other reason.)

> Without studying your proposals at all, I do believe you've a good
> chance of creating a sensible and consistent force_empty by moving
> account, and abandoning the old "forget all" approach completely.
>

yes. I'm very encouraged. thanks!
After patch [1/6]
-> move all at force empty
After this
-> move all or free (not forget) all.

-Kame

2008-11-06 13:47:28

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir)

KAMEZAWA Hiroyuki wrote:
> On Thu, 06 Nov 2008 12:24:11 +0530
> Balbir Singh <[email protected]> wrote:
>
>> KAMEZAWA Hiroyuki wrote:
>>> Weekly (RFC) update for memcg.
>>>
>>> This set includes
>>>
>>> 1. change force_empty to do move account rather than forget all
>> I would like this to be selectable, please. We don't want to break behaviour and
>> not everyone would like to pay the cost of movement.
>
> How about a patch like this ? I'd like to move this as [2/7], if possible.
> It obviously needs painful rework. If I found it difficult, schedule this as [7/7].
>
> BTW, cost of movement itself is not far from cost for force_empty.
>
> If you can't find why "forget" is bad, please consider one more day.

The attributes seem quite reasonable, I've taken a quick look, not done a full
review or test.

--
Balbir

2008-11-06 14:30:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/6] memcg: add atribute (for change bahavior ofrmdir)

Balbir Singh said:
> KAMEZAWA Hiroyuki wrote:
>> BTW, cost of movement itself is not far from cost for force_empty.
>>
>> If you can't find why "forget" is bad, please consider one more day.
>
> The attributes seem quite reasonable, I've taken a quick look, not done a
> full
> review or test.
>
Thanks, I'll go ahead in this direction.

By the way, should we keep "one value per one file" for attributes ?
If so, I'll add a new file just for this.
Current my patch allows prural attributes set on a file.

Thanks,
-Kame

2008-11-07 01:13:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/6] memcg: add atribute (for change bahavior ofrmdir)

On Thu, 6 Nov 2008 23:30:28 +0900 (JST)
"KAMEZAWA Hiroyuki" <[email protected]> wrote:

> Balbir Singh said:
> > KAMEZAWA Hiroyuki wrote:
> >> BTW, cost of movement itself is not far from cost for force_empty.
> >>
> >> If you can't find why "forget" is bad, please consider one more day.
> >
> > The attributes seem quite reasonable, I've taken a quick look, not done a
> > full
> > review or test.
> >
> Thanks, I'll go ahead in this direction.
>
It seems Andrew picked account_move patch into his queue. I'll post this as
add-on to mmotm, in the next week. patch-for-test can be posted in hours if
I work well...

Thanks,
-Kame

2008-11-07 01:21:27

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/6] memcg : swap cgroup

On Thu, 6 Nov 2008 21:44:21 +0900 (JST), "KAMEZAWA Hiroyuki" <[email protected]> wrote:
> Daisuke Nishimura said:
> > On Wed, 5 Nov 2008 17:21:41 +0900, KAMEZAWA Hiroyuki
> > <[email protected]> wrote:
> >> Note1: In this, we use pointer to record information and this means
> >> 8bytes per swap entry. I think we can reduce this when we
> >> create "id of cgroup" in the range of 0-65535 or 0-255.
> >>
> >> Note2: array of swap_cgroup is allocated from HIGHMEM. maybe good for
> >> x86-32.
> >>
> >> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> >>
> >> include/linux/page_cgroup.h | 35 +++++++
> >> mm/page_cgroup.c | 201
> >> ++++++++++++++++++++++++++++++++++++++++++++
> >> mm/swapfile.c | 8 +
> >> 3 files changed, 244 insertions(+)
> >>
> > Is there any reason why they are defined not in memcontrol.[ch]
> > but in page_cgroup.[ch]?
> >
> no strong reason. just because this is not core logic for acccounting.
> do you want to see this in memcontrol.c ?
>
I just felt strange just because they are not "page_cgroup".
I don't have any strong request to move them to memcontrol.c.


Thanks,
Daisuke Nishimura.

2008-11-07 09:08:15

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/6] memcg: handle swap cache

On Wed, 5 Nov 2008 17:20:09 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> SwapCache support for memory resource controller (memcg)
>
> Before mem+swap controller, memcg itself should handle SwapCache in proper way.
>
> In current memcg, SwapCache is just leaked and the user can create tons of
> SwapCache. This is a leak of account and should be handled.
>
> SwapCache accounting is done as following.
>
> charge (anon)
> - charged when it's mapped.
> (because of readahead, charge at add_to_swap_cache() is not sane)
> uncharge (anon)
> - uncharged when it's dropped from swapcache and fully unmapped.
> means it's not uncharged at unmap.
> Note: delete from swap cache at swap-in is done after rmap information
> is established.
> charge (shmem)
> - charged at swap-in. this prevents charge at add_to_page_cache().
>
> uncharge (shmem)
> - uncharged when it's dropped from swapcache and not on shmem's
> radix-tree.
>
> at migration, check against 'old page' is modified to handle shmem.
>
> Comparing to the old version discussed (and caused troubles), we have
> advantages of
> - PCG_USED bit.
> - simple migrating handling.
>
> So, situation is much easier than several months ago, maybe.
>
> Changelog (v1) -> (v2)
> - use lock_page() when we handle unlocked SwapCache.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
I tested this version under swap in/out activity with page migration/rmdir,
and it worked w/o errors for more than 24 hours.

Reviewed-by: Daisuke Nishimura <[email protected]>
Tested-by: Daisuke Nishimura <[email protected]>


Thanks,
Daisuke Nishimura.

> ---
> Documentation/controllers/memory.txt | 5 ++
> include/linux/swap.h | 16 ++++++++
> mm/memcontrol.c | 67 +++++++++++++++++++++++++++++++----
> mm/shmem.c | 18 ++++++++-
> mm/swap_state.c | 1
> 5 files changed, 99 insertions(+), 8 deletions(-)
>
> Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-rc2+/mm/memcontrol.c
> @@ -21,6 +21,7 @@
> #include <linux/memcontrol.h>
> #include <linux/cgroup.h>
> #include <linux/mm.h>
> +#include <linux/pagemap.h>
> #include <linux/smp.h>
> #include <linux/page-flags.h>
> #include <linux/backing-dev.h>
> @@ -140,6 +141,7 @@ enum charge_type {
> MEM_CGROUP_CHARGE_TYPE_MAPPED,
> MEM_CGROUP_CHARGE_TYPE_SHMEM, /* used by page migration of shmem */
> MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
> + MEM_CGROUP_CHARGE_TYPE_SWAPOUT, /* for accounting swapcache */
> NR_CHARGE_TYPE,
> };
>
> @@ -781,6 +783,33 @@ int mem_cgroup_cache_charge(struct page
> MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> }
>
> +#ifdef CONFIG_SWAP
> +int mem_cgroup_cache_charge_swapin(struct page *page,
> + struct mm_struct *mm, gfp_t mask, bool locked)
> +{
> + int ret = 0;
> +
> + if (mem_cgroup_subsys.disabled)
> + return 0;
> + if (unlikely(!mm))
> + mm = &init_mm;
> + if (!locked)
> + lock_page(page);
> + /*
> + * If not locked, the page can be dropped from SwapCache until
> + * we reach here.
> + */
> + if (PageSwapCache(page)) {
> + ret = mem_cgroup_charge_common(page, mm, mask,
> + MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> + }
> + if (!locked)
> + unlock_page(page);
> +
> + return ret;
> +}
> +#endif
> +
> void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> {
> struct page_cgroup *pc;
> @@ -818,6 +847,9 @@ __mem_cgroup_uncharge_common(struct page
> if (mem_cgroup_subsys.disabled)
> return;
>
> + if (PageSwapCache(page))
> + return;
> +
> /*
> * Check if our page_cgroup is valid
> */
> @@ -826,12 +858,26 @@ __mem_cgroup_uncharge_common(struct page
> return;
>
> lock_page_cgroup(pc);
> - if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
> - || !PageCgroupUsed(pc)) {
> - /* This happens at race in zap_pte_range() and do_swap_page()*/
> - unlock_page_cgroup(pc);
> - return;
> +
> + if (!PageCgroupUsed(pc))
> + goto unlock_out;
> +
> + switch(ctype) {
> + case MEM_CGROUP_CHARGE_TYPE_MAPPED:
> + if (page_mapped(page))
> + goto unlock_out;
> + break;
> + case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
> + if (!PageAnon(page)) { /* Shared memory */
> + if (page->mapping && !page_is_file_cache(page))
> + goto unlock_out;
> + } else if (page_mapped(page)) /* Anon */
> + goto unlock_out;
> + break;
> + default:
> + break;
> }
> +
> ClearPageCgroupUsed(pc);
> mem = pc->mem_cgroup;
>
> @@ -845,6 +891,10 @@ __mem_cgroup_uncharge_common(struct page
> css_put(&mem->css);
>
> return;
> +
> +unlock_out:
> + unlock_page_cgroup(pc);
> + return;
> }
>
> void mem_cgroup_uncharge_page(struct page *page)
> @@ -864,6 +914,11 @@ void mem_cgroup_uncharge_cache_page(stru
> __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> }
>
> +void mem_cgroup_uncharge_swapcache(struct page *page)
> +{
> + __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
> +}
> +
> /*
> * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
> * page belongs to.
> @@ -921,7 +976,7 @@ void mem_cgroup_end_migration(struct mem
> ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
>
> /* unused page is not on radix-tree now. */
> - if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
> + if (unused)
> __mem_cgroup_uncharge_common(unused, ctype);
>
> pc = lookup_page_cgroup(target);
> Index: mmotm-2.6.28-rc2+/mm/swap_state.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/swap_state.c
> +++ mmotm-2.6.28-rc2+/mm/swap_state.c
> @@ -119,6 +119,7 @@ void __delete_from_swap_cache(struct pag
> total_swapcache_pages--;
> __dec_zone_page_state(page, NR_FILE_PAGES);
> INC_CACHE_INFO(del_total);
> + mem_cgroup_uncharge_swapcache(page);
> }
>
> /**
> Index: mmotm-2.6.28-rc2+/include/linux/swap.h
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/include/linux/swap.h
> +++ mmotm-2.6.28-rc2+/include/linux/swap.h
> @@ -332,6 +332,22 @@ static inline void disable_swap_token(vo
> put_swap_token(swap_token_mm);
> }
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +extern int mem_cgroup_cache_charge_swapin(struct page *page,
> + struct mm_struct *mm, gfp_t mask, bool locked);
> +extern void mem_cgroup_uncharge_swapcache(struct page *page);
> +#else
> +static inline
> +int mem_cgroup_cache_charge_swapin(struct page *page,
> + struct mm_struct *mm, gfp_t mask, bool locked)
> +{
> + return 0;
> +}
> +static inline void mem_cgroup_uncharge_swapcache(struct page *page)
> +{
> +}
> +#endif
> +
> #else /* CONFIG_SWAP */
>
> #define total_swap_pages 0
> Index: mmotm-2.6.28-rc2+/mm/shmem.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/shmem.c
> +++ mmotm-2.6.28-rc2+/mm/shmem.c
> @@ -920,8 +920,12 @@ found:
> error = 1;
> if (!inode)
> goto out;
> - /* Charge page using GFP_HIGHUSER_MOVABLE while we can wait */
> - error = mem_cgroup_cache_charge(page, current->mm, GFP_HIGHUSER_MOVABLE);
> + /*
> + * Charge page using GFP_HIGHUSER_MOVABLE while we can wait.
> + * charged back to the user(not to caller) when swap account is used.
> + */
> + error = mem_cgroup_cache_charge_swapin(page,
> + current->mm, GFP_HIGHUSER_MOVABLE, true);
> if (error)
> goto out;
> error = radix_tree_preload(GFP_KERNEL);
> @@ -1258,6 +1262,16 @@ repeat:
> goto repeat;
> }
> wait_on_page_locked(swappage);
> + /*
> + * We want to avoid charge at add_to_page_cache().
> + * charge against this swap cache here.
> + */
> + if (mem_cgroup_cache_charge_swapin(swappage,
> + current->mm, gfp, false)) {
> + page_cache_release(swappage);
> + error = -ENOMEM;
> + goto failed;
> + }
> page_cache_release(swappage);
> goto repeat;
> }
> Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
> +++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
> @@ -137,6 +137,11 @@ behind this approach is that a cgroup th
> page will eventually get charged for it (once it is uncharged from
> the cgroup that brought it in -- this will happen on memory pressure).
>
> +Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
> +be backed into memory in force, charges for pages are accounted against the
> +caller of swapoff rather than the users of shmem.
> +
> +
> 2.4 Reclaim
>
> Each cgroup maintains a per cgroup LRU that consists of an active
>

2008-11-07 09:08:55

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/6] memcg: mem+swap controller

On Wed, 5 Nov 2008 17:23:16 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> Mem+Swap controller core.
>
> This patch implements per cgroup limit for usage of memory+swap.
> However there are SwapCache, double counting of swap-cache and
> swap-entry is avoided.
>
> Mem+Swap controller works as following.
> - memory usage is limited by memory.limit_in_bytes.
> - memory + swap usage is limited by memory.memsw_limit_in_bytes.
>
>
> This has following benefits.
> - A user can limit total resource usage of mem+swap.
>
> Without this, because memory resource controller doesn't take care of
> usage of swap, a process can exhaust all the swap (by memory leak.)
> We can avoid this case.
>
> And Swap is shared resource but it cannot be reclaimed (goes back to memory)
> until it's used. This characteristic can be trouble when the memory
> is divided into some parts by cpuset or memcg.
> Assume group A and group B.
> After some application executes, the system can be..
>
> Group A -- very large free memory space but occupy 99% of swap.
> Group B -- under memory shortage but cannot use swap...it's nearly full.
>
> Ability to set appropriate swap limit for each group is required.
>
> Maybe someone wonder "why not swap but mem+swap ?"
>
> - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
> to move account from memory to swap...there is no change in usage of
> mem+swap.
>
> In other words, when we want to limit the usage of swap without affecting
> global LRU, mem+swap limit is better than just limiting swap.
>
>
> Accounting target information is stored in swap_cgroup which is
> per swap entry record.
>
> Charge is done as following.
> map
> - charge page and memsw.
>
> unmap
> - uncharge page/memsw if not SwapCache.
>
> swap-out (__delete_from_swap_cache)
> - uncharge page
> - record mem_cgroup information to swap_cgroup.
>
> swap-in (do_swap_page)
> - charged as page and memsw.
> record in swap_cgroup is cleared.
> memsw accounting is decremented.
>
> swap-free (swap_free())
> - if swap entry is freed, memsw is uncharged by PAGE_SIZE.
>
>
> After this, usual memory resource controller handles SwapCache.
> (It was lacked(ignored) feature in current memcg but must be handled.)
>
SwapCache has been handled in [2/6] already :)

(snip)
> @@ -514,12 +534,25 @@ static int __mem_cgroup_try_charge(struc
> css_get(&mem->css);
> }
>
> + while (1) {
> + int ret;
> + bool noswap = false;
>
> - while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> + ret = res_counter_charge(&mem->res, PAGE_SIZE);
> + if (likely(!ret)) {
> + if (!do_swap_account)
> + break;
> + ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
> + if (likely(!ret))
> + break;
> + /* mem+swap counter fails */
> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> + noswap = true;
> + }
> if (!(gfp_mask & __GFP_WAIT))
> goto nomem;
>
> - if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> + if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
> continue;
>
> /*
I have two comment about try_charge.

1. It would be better if possible to avoid charging memsw at swapin (and uncharging
it again at mem_cgroup_cache_charge_swapin/mem_cgroup_commit_charge_swapin).
How about adding a new argument "charge_memsw" ? (it has many args already now...)
2. Should we use swap when exceeding mem.limit but mem.limit == memsw.limit ?

(snip)
> void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
> @@ -838,6 +947,7 @@ void mem_cgroup_cancel_charge_swapin(str
> if (!mem)
> return;
> res_counter_uncharge(&mem->res, PAGE_SIZE);
> + res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> css_put(&mem->css);
> }
>
"if (do_swap_account)" is needed before uncharging memsw.

(snip)
> static struct cftype mem_cgroup_files[] = {
> {
> .name = "usage_in_bytes",
> - .private = RES_USAGE,
> + .private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> .read_u64 = mem_cgroup_read,
> },
> {
> .name = "max_usage_in_bytes",
> - .private = RES_MAX_USAGE,
> + .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
> .trigger = mem_cgroup_reset,
> .read_u64 = mem_cgroup_read,
> },
> {
> .name = "limit_in_bytes",
> - .private = RES_LIMIT,
> + .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
> .write_string = mem_cgroup_write,
> .read_u64 = mem_cgroup_read,
> },
> {
> .name = "failcnt",
> - .private = RES_FAILCNT,
> + .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
> .trigger = mem_cgroup_reset,
> .read_u64 = mem_cgroup_read,
> },
> @@ -1317,6 +1541,31 @@ static struct cftype mem_cgroup_files[]
> .name = "stat",
> .read_map = mem_control_stat_show,
> },
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> + {
> + .name = "memsw.usage_in_bytes",
> + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
> + .read_u64 = mem_cgroup_read,
> + },
> + {
> + .name = "memsw.max_usage_in_bytes",
> + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
> + .trigger = mem_cgroup_reset,
> + .read_u64 = mem_cgroup_read,
> + },
> + {
> + .name = "memsw.limit_in_bytes",
> + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
> + .write_string = mem_cgroup_write,
> + .read_u64 = mem_cgroup_read,
> + },
> + {
> + .name = "memsw.failcnt",
> + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
> + .trigger = mem_cgroup_reset,
> + .read_u64 = mem_cgroup_read,
> + },
> +#endif
> };
>
IMHO, it would be better to define those "memsw.*" files as memsw_cgroup_files[],
and change mem_cgroup_populate() like:

static int mem_cgroup_populate(struct cgroup_subsys *ss,
struct cgroup *cont)
{
int ret;

ret = cgroup_add_files(cont, ss, mem_cgroup_files,
ARRAY_SIZE(mem_cgroup_files));
if (!ret && do_swap_account)
ret = cgroup_add_files(cont, ss, memsw_cgroup_files,
ARRAY_SIZE(memsw_cgroup_files));

return ret;
}

so that those files appear only when swap accounting is enabled.


Thanks,
Daisuke Nishimura.

2008-11-07 09:13:59

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/6] memcg: handle swap cache

On Fri, 7 Nov 2008 17:53:03 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Wed, 5 Nov 2008 17:20:09 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > SwapCache support for memory resource controller (memcg)
> >
> > Before mem+swap controller, memcg itself should handle SwapCache in proper way.
> >
> > In current memcg, SwapCache is just leaked and the user can create tons of
> > SwapCache. This is a leak of account and should be handled.
> >
> > SwapCache accounting is done as following.
> >
> > charge (anon)
> > - charged when it's mapped.
> > (because of readahead, charge at add_to_swap_cache() is not sane)
> > uncharge (anon)
> > - uncharged when it's dropped from swapcache and fully unmapped.
> > means it's not uncharged at unmap.
> > Note: delete from swap cache at swap-in is done after rmap information
> > is established.
> > charge (shmem)
> > - charged at swap-in. this prevents charge at add_to_page_cache().
> >
> > uncharge (shmem)
> > - uncharged when it's dropped from swapcache and not on shmem's
> > radix-tree.
> >
> > at migration, check against 'old page' is modified to handle shmem.
> >
> > Comparing to the old version discussed (and caused troubles), we have
> > advantages of
> > - PCG_USED bit.
> > - simple migrating handling.
> >
> > So, situation is much easier than several months ago, maybe.
> >
> > Changelog (v1) -> (v2)
> > - use lock_page() when we handle unlocked SwapCache.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> >
> I tested this version under swap in/out activity with page migration/rmdir,
> and it worked w/o errors for more than 24 hours.
>
> Reviewed-by: Daisuke Nishimura <[email protected]>
> Tested-by: Daisuke Nishimura <[email protected]>
>
Thank you!

-Kame

2008-11-07 09:20:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/6] memcg: mem+swap controller

On Fri, 7 Nov 2008 18:02:48 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Wed, 5 Nov 2008 17:23:16 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > Mem+Swap controller core.
> >
> > This patch implements per cgroup limit for usage of memory+swap.
> > However there are SwapCache, double counting of swap-cache and
> > swap-entry is avoided.
> >
> > Mem+Swap controller works as following.
> > - memory usage is limited by memory.limit_in_bytes.
> > - memory + swap usage is limited by memory.memsw_limit_in_bytes.
> >
> >
> > This has following benefits.
> > - A user can limit total resource usage of mem+swap.
> >
> > Without this, because memory resource controller doesn't take care of
> > usage of swap, a process can exhaust all the swap (by memory leak.)
> > We can avoid this case.
> >
> > And Swap is shared resource but it cannot be reclaimed (goes back to memory)
> > until it's used. This characteristic can be trouble when the memory
> > is divided into some parts by cpuset or memcg.
> > Assume group A and group B.
> > After some application executes, the system can be..
> >
> > Group A -- very large free memory space but occupy 99% of swap.
> > Group B -- under memory shortage but cannot use swap...it's nearly full.
> >
> > Ability to set appropriate swap limit for each group is required.
> >
> > Maybe someone wonder "why not swap but mem+swap ?"
> >
> > - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
> > to move account from memory to swap...there is no change in usage of
> > mem+swap.
> >
> > In other words, when we want to limit the usage of swap without affecting
> > global LRU, mem+swap limit is better than just limiting swap.
> >
> >
> > Accounting target information is stored in swap_cgroup which is
> > per swap entry record.
> >
> > Charge is done as following.
> > map
> > - charge page and memsw.
> >
> > unmap
> > - uncharge page/memsw if not SwapCache.
> >
> > swap-out (__delete_from_swap_cache)
> > - uncharge page
> > - record mem_cgroup information to swap_cgroup.
> >
> > swap-in (do_swap_page)
> > - charged as page and memsw.
> > record in swap_cgroup is cleared.
> > memsw accounting is decremented.
> >
> > swap-free (swap_free())
> > - if swap entry is freed, memsw is uncharged by PAGE_SIZE.
> >
> >
> > After this, usual memory resource controller handles SwapCache.
> > (It was lacked(ignored) feature in current memcg but must be handled.)
> >
> SwapCache has been handled in [2/6] already :)
>
yes. I'll rewrite this.


> (snip)
> > @@ -514,12 +534,25 @@ static int __mem_cgroup_try_charge(struc
> > css_get(&mem->css);
> > }
> >
> > + while (1) {
> > + int ret;
> > + bool noswap = false;
> >
> > - while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> > + ret = res_counter_charge(&mem->res, PAGE_SIZE);
> > + if (likely(!ret)) {
> > + if (!do_swap_account)
> > + break;
> > + ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
> > + if (likely(!ret))
> > + break;
> > + /* mem+swap counter fails */
> > + res_counter_uncharge(&mem->res, PAGE_SIZE);
> > + noswap = true;
> > + }
> > if (!(gfp_mask & __GFP_WAIT))
> > goto nomem;
> >
> > - if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> > + if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
> > continue;
> >
> > /*
> I have two comment about try_charge.
>
> 1. It would be better if possible to avoid charging memsw at swapin (and uncharging
> it again at mem_cgroup_cache_charge_swapin/mem_cgroup_commit_charge_swapin).
> How about adding a new argument "charge_memsw" ? (it has many args already now...)

Hmm, maybe possible and good. I'll cosider this again.

> 2. Should we use swap when exceeding mem.limit but mem.limit == memsw.limit ?
>
I'd like to put that special case into "TODO" list. Hmm...
maybe set noswap=true in that case is enough. but we have to be careful.


> (snip)
> > void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
> > @@ -838,6 +947,7 @@ void mem_cgroup_cancel_charge_swapin(str
> > if (!mem)
> > return;
> > res_counter_uncharge(&mem->res, PAGE_SIZE);
> > + res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > css_put(&mem->css);
> > }
> >
> "if (do_swap_account)" is needed before uncharging memsw.
>
good catch !

> (snip)
> > static struct cftype mem_cgroup_files[] = {
> > {
> > .name = "usage_in_bytes",
> > - .private = RES_USAGE,
> > + .private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> > .read_u64 = mem_cgroup_read,
> > },
> > {
> > .name = "max_usage_in_bytes",
> > - .private = RES_MAX_USAGE,
> > + .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
> > .trigger = mem_cgroup_reset,
> > .read_u64 = mem_cgroup_read,
> > },
> > {
> > .name = "limit_in_bytes",
> > - .private = RES_LIMIT,
> > + .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
> > .write_string = mem_cgroup_write,
> > .read_u64 = mem_cgroup_read,
> > },
> > {
> > .name = "failcnt",
> > - .private = RES_FAILCNT,
> > + .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
> > .trigger = mem_cgroup_reset,
> > .read_u64 = mem_cgroup_read,
> > },
> > @@ -1317,6 +1541,31 @@ static struct cftype mem_cgroup_files[]
> > .name = "stat",
> > .read_map = mem_control_stat_show,
> > },
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > + {
> > + .name = "memsw.usage_in_bytes",
> > + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
> > + .read_u64 = mem_cgroup_read,
> > + },
> > + {
> > + .name = "memsw.max_usage_in_bytes",
> > + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
> > + .trigger = mem_cgroup_reset,
> > + .read_u64 = mem_cgroup_read,
> > + },
> > + {
> > + .name = "memsw.limit_in_bytes",
> > + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
> > + .write_string = mem_cgroup_write,
> > + .read_u64 = mem_cgroup_read,
> > + },
> > + {
> > + .name = "memsw.failcnt",
> > + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
> > + .trigger = mem_cgroup_reset,
> > + .read_u64 = mem_cgroup_read,
> > + },
> > +#endif
> > };
> >
> IMHO, it would be better to define those "memsw.*" files as memsw_cgroup_files[],
> and change mem_cgroup_populate() like:
>
> static int mem_cgroup_populate(struct cgroup_subsys *ss,
> struct cgroup *cont)
> {
> int ret;
>
> ret = cgroup_add_files(cont, ss, mem_cgroup_files,
> ARRAY_SIZE(mem_cgroup_files));
> if (!ret && do_swap_account)
> ret = cgroup_add_files(cont, ss, memsw_cgroup_files,
> ARRAY_SIZE(memsw_cgroup_files));
>
> return ret;
> }
>
> so that those files appear only when swap accounting is enabled.
>

Nice idea. I'll try that.

Thanks,
-Kame

2008-11-07 13:31:47

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/6] memcg: mem+swap controller

> @@ -1286,30 +1507,33 @@ static int mem_control_stat_show(struct
> cb->fill(cb, "unevictable", unevictable * PAGE_SIZE);
>
> }
> + /* showing refs from disk-swap */
> + cb->fill(cb, "swap_on_disk", atomic_read(&mem_cont->swapref)
> + * PAGE_SIZE);
> return 0;
> }
>
"if (do_swap_account)" is needed here too.


Thanks,
Daisuke Nishimura.

2008-11-07 13:34:53

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/6] memcg: mem+swap controller

> > > static struct cftype mem_cgroup_files[] = {
> > > {
> > > .name = "usage_in_bytes",
> > > - .private = RES_USAGE,
> > > + .private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> > > .read_u64 = mem_cgroup_read,
> > > },
> > > {
> > > .name = "max_usage_in_bytes",
> > > - .private = RES_MAX_USAGE,
> > > + .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
> > > .trigger = mem_cgroup_reset,
> > > .read_u64 = mem_cgroup_read,
> > > },
> > > {
> > > .name = "limit_in_bytes",
> > > - .private = RES_LIMIT,
> > > + .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
> > > .write_string = mem_cgroup_write,
> > > .read_u64 = mem_cgroup_read,
> > > },
> > > {
> > > .name = "failcnt",
> > > - .private = RES_FAILCNT,
> > > + .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
> > > .trigger = mem_cgroup_reset,
> > > .read_u64 = mem_cgroup_read,
> > > },
> > > @@ -1317,6 +1541,31 @@ static struct cftype mem_cgroup_files[]
> > > .name = "stat",
> > > .read_map = mem_control_stat_show,
> > > },
> > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > > + {
> > > + .name = "memsw.usage_in_bytes",
> > > + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
> > > + .read_u64 = mem_cgroup_read,
> > > + },
> > > + {
> > > + .name = "memsw.max_usage_in_bytes",
> > > + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
> > > + .trigger = mem_cgroup_reset,
> > > + .read_u64 = mem_cgroup_read,
> > > + },
> > > + {
> > > + .name = "memsw.limit_in_bytes",
> > > + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
> > > + .write_string = mem_cgroup_write,
> > > + .read_u64 = mem_cgroup_read,
> > > + },
> > > + {
> > > + .name = "memsw.failcnt",
> > > + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
> > > + .trigger = mem_cgroup_reset,
> > > + .read_u64 = mem_cgroup_read,
> > > + },
> > > +#endif
> > > };
> > >
> > IMHO, it would be better to define those "memsw.*" files as memsw_cgroup_files[],
> > and change mem_cgroup_populate() like:
> >
> > static int mem_cgroup_populate(struct cgroup_subsys *ss,
> > struct cgroup *cont)
> > {
> > int ret;
> >
> > ret = cgroup_add_files(cont, ss, mem_cgroup_files,
> > ARRAY_SIZE(mem_cgroup_files));
> > if (!ret && do_swap_account)
> > ret = cgroup_add_files(cont, ss, memsw_cgroup_files,
> > ARRAY_SIZE(memsw_cgroup_files));
> >
> > return ret;
> > }
> >
> > so that those files appear only when swap accounting is enabled.
> >
>
> Nice idea. I'll try that.
>
I made a patch for this.

please merge this if it looks good to you.

I've confirmed that memsw.* files doesn't created with noswapaccount,
and this can be compiled with !CONFIG_CGROUP_MEM_RES_CTLR_SWAP.


Signed-off-by: Daisuke Nishimura <[email protected]>
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 27f1772..03dfc46 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1542,6 +1542,9 @@ static struct cftype mem_cgroup_files[] = {
.name = "stat",
.read_map = mem_control_stat_show,
},
+};
+
+static struct cftype swap_cgroup_files[] = {
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
{
.name = "memsw.usage_in_bytes",
@@ -1724,8 +1727,14 @@ static void mem_cgroup_destroy(struct cgroup_subsys *ss,
static int mem_cgroup_populate(struct cgroup_subsys *ss,
struct cgroup *cont)
{
- return cgroup_add_files(cont, ss, mem_cgroup_files,
+ int ret;
+ ret = cgroup_add_files(cont, ss, mem_cgroup_files,
ARRAY_SIZE(mem_cgroup_files));
+ if (!ret && do_swap_account)
+ ret = cgroup_add_files(cont, ss, swap_cgroup_files,
+ ARRAY_SIZE(swap_cgroup_files));
+
+ return ret;
}

static void mem_cgroup_move_task(struct cgroup_subsys *ss,

2008-11-10 04:44:08

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/6] memcg: mem+swap controller

> @@ -1062,13 +1208,55 @@ int mem_cgroup_resize_limit(struct mem_c
> break;
> }
> progress = try_to_free_mem_cgroup_pages(memcg,
> - GFP_HIGHUSER_MOVABLE);
> + GFP_HIGHUSER_MOVABLE, false);
> if (!progress)
> retry_count--;
> }
> return ret;
> }
>
mem_cgroup_resize_limit() should verify that mem.limit <= memsw.limit
as mem_cgroup_resize_memsw_limit() does.


Thanks,
Daisuke Nishimura.

2008-11-10 07:04:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/6] memcg: mem+swap controller

On Mon, 10 Nov 2008 13:30:54 +0900
Daisuke Nishimura <[email protected]> wrote:

> > @@ -1062,13 +1208,55 @@ int mem_cgroup_resize_limit(struct mem_c
> > break;
> > }
> > progress = try_to_free_mem_cgroup_pages(memcg,
> > - GFP_HIGHUSER_MOVABLE);
> > + GFP_HIGHUSER_MOVABLE, false);
> > if (!progress)
> > retry_count--;
> > }
> > return ret;
> > }
> >
> mem_cgroup_resize_limit() should verify that mem.limit <= memsw.limit
> as mem_cgroup_resize_memsw_limit() does.
>
>
nice catch. will be fixed in the next version.

Thanks a lot!

-Kame