This patch is memcg update set. (the last post was 23/Oct)
This patch is agaisnt 2.6.28-rc2-mm1. (mmotm at 30/Oct)
Major changes from previous one.
- dropped lazy_lru related patches. It's complicated and I'd like to
design it again. (Nishimura-san reported a bug to me.)
Fortunately, no big Hunk.
- added a patch for swap_cache. (cut out from mem+swap controller.)
IMHO, this is necessary to do. But previous trial (in summer) caused tons of
troubles. I think currrent memcg's code is better than old ones and I hope
this trial will be much easier.
and fixed typos and logics.
brief patch description is here. I'd like to send [1/5] in the next week.
Review and test is welcome. [2-5/5] of patches should be tested more.
And I'd like to clean up them before sending...
[1/5] change force_empty to do move account.
- current force_empty just drops page_cgroup reference. But this means
leak of accounting. After this patch, remaining pages at rmdir() will be
moved to parent group.
[2/5] account swap cache
- when processes exceed limit of memory, swap-cache can be created.
Now, swap-cache is not accounted and can be spread over hundreds of mega
bytes. I don't think this is sane. This patch makes swap_cache to be accounted.
[3/5] mem+swap controller config.
- add Kconfig for mem+swap controller.
[4/5] swap_cgroup
- create a buffer to remember account information of swap.
Because my x86-32 test environment is poor, x86-32 test report is welcome.
This patch uses HIGHMEM to rememver information.
[5/5]
- mem+swap controller.
After this patch, a page swapped-in will be accounted against an original group
which allocated memory swapped-out.
This is not swap controller but mem+swap controller. This limits sum of
pages + swaps_on_disk. By this,
- global LRU's swap-out will not hit the limit. (page is converted to swap.)
- a group of processes leaking memory will be effectively blocked.
TODO list:
- more optimization
- add "shrink_usage" file
- remove mem_cgroup_per_zone->lock and use zone->lru_lock ?
- consider behavior of oom-kill again.
- need some notifier to tell memory shortage as mem_notity ?
- dirty page accountng and throttle.
- we have to decied how to handle HUGE_PAGE
- kernel support for hierarchy ?
- help bio_cgroup people.
Oh, too long ;)
Thanks,
-Kame
This patch provides a function to move account information of a page between
mem_cgroups and rewrite force_empty to make use of this.
This moving of page_cgroup is done under
- lru_lock of source/destination mem_cgroup is held.
- lock_page_cgroup() is held.
Then, a routine which touches pc->mem_cgroup without lock_page_cgroup() should
confirm pc->mem_cgroup is still valid or not. Typical code can be following.
(while page is not under lock_page())
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc)
spin_lock_irqsave(&mz->lru_lock);
if (pc->mem_cgroup == mem)
...../* some list handling */
spin_unlock_irqrestore(&mz->lru_lock);
Of course, better way is
lock_page_cgroup(pc);
....
unlock_page_cgroup(pc);
But you should confirm the nest of lock and avoid deadlock.
If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
you don't have to worry about what pc->mem_cgroup points to.
moved pages are added to head of lru, not to tail.
Expected users of this routine is:
- force_empty (rmdir)
- moving tasks between cgroup (for moving account information.)
- hierarchy (maybe useful.)
force_empty(rmdir) uses this move_account and move pages to its parent.
This "move" will not cause OOM (I added "oom" parameter to try_charge().)
If the parent is busy (not enough memory), force_empty calls try_to_free_page()
and reduce usage.
Purpose of this behavior is
- Fix "forget all" behavior of force_empty and avoid leak of accounting.
- By "moving first, free if necessary", keep pages on memory as much as
possible.
Adding a switch to change behavior of force_empty to
- free first, move if necessary
- free all, if there is mlocked/busy pages, return -EBUSY.
is under consideration.
This patch removes memory.force_empty file, a brutal debug-only interface.
Changelog: (v8) -> (v9)
- fixed typos.
Changelog: (v6) -> (v8)
- removed memory.force_empty file which was provided only for debug.
Changelog: (v5) -> (v6)
- removed unnecessary check.
- do all under lock_page_cgroup().
- removed res_counter_charge() from move function itself.
(and modifies try_charge() function.)
- add argument to add_list() to specify to add page_cgroup head or tail.
- merged with force_empty patch. (to answer who is user? question)
Changelog: (v4) -> (v5)
- check for lock_page() is removed.
- rewrote description.
Changelog: (v2) -> (v4)
- added lock_page_cgroup().
- moved out from new-force-empty patch.
- added how-to-use text.
- fixed race in __mem_cgroup_uncharge_common().
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Documentation/controllers/memory.txt | 12 -
mm/memcontrol.c | 277 ++++++++++++++++++++++++++---------
2 files changed, 214 insertions(+), 75 deletions(-)
Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -257,7 +257,7 @@ static void __mem_cgroup_remove_list(str
}
static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
- struct page_cgroup *pc)
+ struct page_cgroup *pc, bool hot)
{
int lru = LRU_BASE;
@@ -271,7 +271,10 @@ static void __mem_cgroup_add_list(struct
}
MEM_CGROUP_ZSTAT(mz, lru) += 1;
- list_add(&pc->lru, &mz->lists[lru]);
+ if (hot)
+ list_add(&pc->lru, &mz->lists[lru]);
+ else
+ list_add_tail(&pc->lru, &mz->lists[lru]);
mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
}
@@ -467,21 +470,12 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
}
-
-/**
- * mem_cgroup_try_charge - get charge of PAGE_SIZE.
- * @mm: an mm_struct which is charged against. (when *memcg is NULL)
- * @gfp_mask: gfp_mask for reclaim.
- * @memcg: a pointer to memory cgroup which is charged against.
- *
- * charge against memory cgroup pointed by *memcg. if *memcg == NULL, estimated
- * memory cgroup from @mm is got and stored in *memcg.
- *
- * Returns 0 if success. -ENOMEM at failure.
+/*
+ * Unlike exported interface, "oom" parameter is added. if oom==true,
+ * oom-killer can be invoked.
*/
-
-int mem_cgroup_try_charge(struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcg)
+static int __mem_cgroup_try_charge(struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
{
struct mem_cgroup *mem;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -528,7 +522,8 @@ int mem_cgroup_try_charge(struct mm_stru
continue;
if (!nr_retries--) {
- mem_cgroup_out_of_memory(mem, gfp_mask);
+ if (oom)
+ mem_cgroup_out_of_memory(mem, gfp_mask);
goto nomem;
}
}
@@ -538,6 +533,25 @@ nomem:
return -ENOMEM;
}
+/**
+ * mem_cgroup_try_charge - get charge of PAGE_SIZE.
+ * @mm: an mm_struct which is charged against. (when *memcg is NULL)
+ * @gfp_mask: gfp_mask for reclaim.
+ * @memcg: a pointer to memory cgroup which is charged against.
+ *
+ * charge against memory cgroup pointed by *memcg. if *memcg == NULL, estimated
+ * memory cgroup from @mm is got and stored in *memcg.
+ *
+ * Returns 0 if success. -ENOMEM at failure.
+ * This call can invoke OOM-Killer.
+ */
+
+int mem_cgroup_try_charge(struct mm_struct *mm,
+ gfp_t mask, struct mem_cgroup **memcg)
+{
+ return __mem_cgroup_try_charge(mm, mask, memcg, true);
+}
+
/*
* commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
* USED state. If already USED, uncharge and return.
@@ -571,11 +585,109 @@ static void __mem_cgroup_commit_charge(s
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_add_list(mz, pc);
+ __mem_cgroup_add_list(mz, pc, true);
spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);
}
+/**
+ * mem_cgroup_move_account - move account of the page
+ * @pc: page_cgroup of the page.
+ * @from: mem_cgroup which the page is moved from.
+ * @to: mem_cgroup which the page is moved to. @from != @to.
+ *
+ * The caller must confirm following.
+ * 1. disable irq.
+ * 2. lru_lock of old mem_cgroup(@from) should be held.
+ *
+ * returns 0 at success,
+ * returns -EBUSY when lock is busy or "pc" is unstable.
+ *
+ * This function does "uncharge" from old cgroup but doesn't do "charge" to
+ * new cgroup. It should be done by a caller.
+ */
+
+static int mem_cgroup_move_account(struct page_cgroup *pc,
+ struct mem_cgroup *from, struct mem_cgroup *to)
+{
+ struct mem_cgroup_per_zone *from_mz, *to_mz;
+ int nid, zid;
+ int ret = -EBUSY;
+
+ VM_BUG_ON(!irqs_disabled());
+ VM_BUG_ON(from == to);
+
+ nid = page_cgroup_nid(pc);
+ zid = page_cgroup_zid(pc);
+ from_mz = mem_cgroup_zoneinfo(from, nid, zid);
+ to_mz = mem_cgroup_zoneinfo(to, nid, zid);
+
+
+ if (!trylock_page_cgroup(pc))
+ return ret;
+
+ if (!PageCgroupUsed(pc))
+ goto out;
+
+ if (pc->mem_cgroup != from)
+ goto out;
+
+ if (spin_trylock(&to_mz->lru_lock)) {
+ __mem_cgroup_remove_list(from_mz, pc);
+ css_put(&from->css);
+ res_counter_uncharge(&from->res, PAGE_SIZE);
+ pc->mem_cgroup = to;
+ css_get(&to->css);
+ __mem_cgroup_add_list(to_mz, pc, false);
+ ret = 0;
+ spin_unlock(&to_mz->lru_lock);
+ }
+out:
+ unlock_page_cgroup(pc);
+ return ret;
+}
+
+/*
+ * move charges to its parent.
+ */
+
+static int mem_cgroup_move_parent(struct page_cgroup *pc,
+ struct mem_cgroup *child,
+ gfp_t gfp_mask)
+{
+ struct cgroup *cg = child->css.cgroup;
+ struct cgroup *pcg = cg->parent;
+ struct mem_cgroup *parent;
+ struct mem_cgroup_per_zone *mz;
+ unsigned long flags;
+ int ret;
+
+ /* Is ROOT ? */
+ if (!pcg)
+ return -EINVAL;
+
+ parent = mem_cgroup_from_cont(pcg);
+
+ ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
+ if (ret)
+ return ret;
+
+ mz = mem_cgroup_zoneinfo(child,
+ page_cgroup_nid(pc), page_cgroup_zid(pc));
+
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ ret = mem_cgroup_move_account(pc, child, parent);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+ /* drop extra refcnt */
+ css_put(&parent->css);
+ /* uncharge if move fails */
+ if (ret)
+ res_counter_uncharge(&parent->res, PAGE_SIZE);
+
+ return ret;
+}
+
/*
* Charge the memory controller for page usage.
* Return
@@ -597,7 +709,7 @@ static int mem_cgroup_charge_common(stru
prefetchw(pc);
mem = memcg;
- ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
+ ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
if (ret)
return ret;
@@ -898,46 +1010,52 @@ int mem_cgroup_resize_limit(struct mem_c
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
*/
-#define FORCE_UNCHARGE_BATCH (128)
-static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
+static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
struct mem_cgroup_per_zone *mz,
enum lru_list lru)
{
- struct page_cgroup *pc;
- struct page *page;
- int count = FORCE_UNCHARGE_BATCH;
+ struct page_cgroup *pc, *busy;
unsigned long flags;
+ unsigned long loop;
struct list_head *list;
+ int ret = 0;
list = &mz->lists[lru];
- spin_lock_irqsave(&mz->lru_lock, flags);
- while (!list_empty(list)) {
- pc = list_entry(list->prev, struct page_cgroup, lru);
- page = pc->page;
- if (!PageCgroupUsed(pc))
+ loop = MEM_CGROUP_ZSTAT(mz, lru);
+ /* give some margin against EBUSY etc...*/
+ loop += 256;
+ busy = NULL;
+ while (loop--) {
+ ret = 0;
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ if (list_empty(list)) {
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
break;
- get_page(page);
+ }
+ pc = list_entry(list->prev, struct page_cgroup, lru);
+ if (busy == pc) {
+ list_move(&pc->lru, list);
+ busy = 0;
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ continue;
+ }
spin_unlock_irqrestore(&mz->lru_lock, flags);
- /*
- * Check if this page is on LRU. !LRU page can be found
- * if it's under page migration.
- */
- if (PageLRU(page)) {
- __mem_cgroup_uncharge_common(page,
- MEM_CGROUP_CHARGE_TYPE_FORCE);
- put_page(page);
- if (--count <= 0) {
- count = FORCE_UNCHARGE_BATCH;
- cond_resched();
- }
- } else {
- spin_lock_irqsave(&mz->lru_lock, flags);
+
+ ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
+ if (ret == -ENOMEM)
break;
- }
- spin_lock_irqsave(&mz->lru_lock, flags);
+
+ if (ret == -EBUSY || ret == -EINVAL) {
+ /* found lock contention or "pc" is obsolete. */
+ busy = pc;
+ cond_resched();
+ } else
+ busy = NULL;
}
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ if (!ret && !list_empty(list))
+ return -EBUSY;
+ return ret;
}
/*
@@ -946,34 +1064,68 @@ static void mem_cgroup_force_empty_list(
*/
static int mem_cgroup_force_empty(struct mem_cgroup *mem)
{
- int ret = -EBUSY;
- int node, zid;
+ int ret;
+ int node, zid, shrink;
+ int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
css_get(&mem->css);
- /*
- * page reclaim code (kswapd etc..) will move pages between
- * active_list <-> inactive_list while we don't take a lock.
- * So, we have to do loop here until all lists are empty.
- */
+
+ shrink = 0;
+move_account:
while (mem->res.usage > 0) {
+ ret = -EBUSY;
if (atomic_read(&mem->css.cgroup->count) > 0)
goto out;
+
/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
- for_each_node_state(node, N_POSSIBLE)
- for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ ret = 0;
+ for_each_node_state(node, N_POSSIBLE) {
+ for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
struct mem_cgroup_per_zone *mz;
enum lru_list l;
mz = mem_cgroup_zoneinfo(mem, node, zid);
- for_each_lru(l)
- mem_cgroup_force_empty_list(mem, mz, l);
+ for_each_lru(l) {
+ ret = mem_cgroup_force_empty_list(mem,
+ mz, l);
+ if (ret)
+ break;
+ }
}
+ if (ret)
+ break;
+ }
+ /* it seems parent cgroup doesn't have enough mem */
+ if (ret == -ENOMEM)
+ goto try_to_free;
cond_resched();
}
ret = 0;
out:
css_put(&mem->css);
return ret;
+
+try_to_free:
+ /* returns EBUSY if we come here twice. */
+ if (shrink) {
+ ret = -EBUSY;
+ goto out;
+ }
+ /* try to free all pages in this cgroup */
+ shrink = 1;
+ while (nr_retries && mem->res.usage > 0) {
+ int progress;
+ progress = try_to_free_mem_cgroup_pages(mem,
+ GFP_HIGHUSER_MOVABLE);
+ if (!progress)
+ nr_retries--;
+
+ }
+ /* try move_account...there may be some *locked* pages. */
+ if (mem->res.usage)
+ goto move_account;
+ ret = 0;
+ goto out;
}
static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
@@ -1022,11 +1174,6 @@ static int mem_cgroup_reset(struct cgrou
return 0;
}
-static int mem_force_empty_write(struct cgroup *cont, unsigned int event)
-{
- return mem_cgroup_force_empty(mem_cgroup_from_cont(cont));
-}
-
static const struct mem_cgroup_stat_desc {
const char *msg;
u64 unit;
@@ -1103,10 +1250,6 @@ static struct cftype mem_cgroup_files[]
.read_u64 = mem_cgroup_read,
},
{
- .name = "force_empty",
- .trigger = mem_force_empty_write,
- },
- {
.name = "stat",
.read_map = mem_control_stat_show,
},
Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
@@ -207,12 +207,6 @@ exceeded.
The memory.stat file gives accounting information. Now, the number of
caches, RSS and Active pages/Inactive pages are shown.
-The memory.force_empty gives an interface to drop *all* charges by force.
-
-# echo 1 > memory.force_empty
-
-will drop all charges in cgroup. Currently, this is maintained for test.
-
4. Testing
Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
@@ -242,8 +236,10 @@ reclaimed.
A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
cgroup might have some charge associated with it, even though all
-tasks have migrated away from it. Such charges are automatically dropped at
-rmdir() if there are no tasks.
+tasks have migrated away from it.
+Such charges are moved to its parent as much as possible and freed if parent
+is full.
+If both of them are busy, rmdir() returns -EBUSY.
5. TODO
SwapCache support for memory resource controller (memcg)
Before mem+swap controller, memcg itself should handle SwapCache in proper way.
In current memcg, SwapCache is just leaked and the user can create tons of
SwapCache. This is leak of account and should be handled.
SwapCache accounting is done as following.
charge (anon)
- charged when it's mapped.
(because of readahead, charge at add_to_swap_cache() is not sane)
uncharge (anon)
- uncharged when it's dropped from swapcache and fully unmapped.
means it's not uncharged at unmap.
Note: delete from swap cache at swap-in is done after rmap information
is established.
charge (shmem)
- charged at swap-in. this prevents charge at add_to_page_cache().
uncharge (shmem)
- uncharged when it's dropped from swapcache and not on shmem's
radix-tree.
at migartion, check against 'old page' is modified to handle shmem.
Compareing to the old version discussed (and caused troubles), we have
advanteges of
- PCG_USED bit.
- simple migraton handling.
So, situation is much easier than several months ago, maybe.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/controllers/memory.txt | 5 ++
include/linux/swap.h | 16 ++++++++
mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++----
mm/shmem.c | 18 ++++++++-
mm/swap_state.c | 1
5 files changed, 98 insertions(+), 8 deletions(-)
Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -21,6 +21,7 @@
#include <linux/memcontrol.h>
#include <linux/cgroup.h>
#include <linux/mm.h>
+#include <linux/pagemap.h>
#include <linux/smp.h>
#include <linux/page-flags.h>
#include <linux/backing-dev.h>
@@ -140,6 +141,7 @@ enum charge_type {
MEM_CGROUP_CHARGE_TYPE_MAPPED,
MEM_CGROUP_CHARGE_TYPE_SHMEM, /* used by page migration of shmem */
MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
+ MEM_CGROUP_CHARGE_TYPE_SWAPOUT, /* for accounting swapcache */
NR_CHARGE_TYPE,
};
@@ -781,6 +783,32 @@ int mem_cgroup_cache_charge(struct page
MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
}
+#ifdef CONFIG_SWAP
+int mem_cgroup_cache_charge_swapin(struct page *page,
+ struct mm_struct *mm, gfp_t mask)
+{
+ int ret = 0;
+
+ if (mem_cgroup_subsys.disabled)
+ return 0;
+ if (unlikely(!mm))
+ mm = &init_mm;
+
+ ret = mem_cgroup_charge_common(page, mm, mask,
+ MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
+ /*
+ * The page may be dropped from SwapCache because we don't have
+ * lock_page().This may cause charge-after-uncharge trouble.
+ * Fix it up here. (the caller have refcnt to this page and
+ * page itself is guaranteed not to be freed.)
+ */
+ if (ret && !PageSwapCache(page))
+ mem_cgroup_uncharge_swapcache(page);
+
+ return ret;
+}
+#endif
+
void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
{
struct page_cgroup *pc;
@@ -818,6 +846,9 @@ __mem_cgroup_uncharge_common(struct page
if (mem_cgroup_subsys.disabled)
return;
+ if (PageSwapCache(page))
+ return;
+
/*
* Check if our page_cgroup is valid
*/
@@ -826,12 +857,26 @@ __mem_cgroup_uncharge_common(struct page
return;
lock_page_cgroup(pc);
- if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
- || !PageCgroupUsed(pc)) {
- /* This happens at race in zap_pte_range() and do_swap_page()*/
- unlock_page_cgroup(pc);
- return;
+
+ if (!PageCgroupUsed(pc))
+ goto unlock_out;
+
+ switch(ctype) {
+ case MEM_CGROUP_CHARGE_TYPE_MAPPED:
+ if (page_mapped(page))
+ goto unlock_out;
+ break;
+ case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
+ if (!PageAnon(page)) { /* Shared memory */
+ if (page->mapping && !page_is_file_cache(page))
+ goto unlock_out;
+ } else if (page_mapped(page)) /* Anon */
+ goto unlock_out;
+ break;
+ default:
+ break;
}
+
ClearPageCgroupUsed(pc);
mem = pc->mem_cgroup;
@@ -845,6 +890,10 @@ __mem_cgroup_uncharge_common(struct page
css_put(&mem->css);
return;
+
+unlock_out:
+ unlock_page_cgroup(pc);
+ return;
}
void mem_cgroup_uncharge_page(struct page *page)
@@ -864,6 +913,11 @@ void mem_cgroup_uncharge_cache_page(stru
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}
+void mem_cgroup_uncharge_swapcache(struct page *page)
+{
+ __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+}
+
/*
* Before starting migration, account PAGE_SIZE to mem_cgroup that the old
* page belongs to.
@@ -921,7 +975,7 @@ void mem_cgroup_end_migration(struct mem
ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
/* unused page is not on radix-tree now. */
- if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
+ if (unused)
__mem_cgroup_uncharge_common(unused, ctype);
pc = lookup_page_cgroup(target);
Index: mmotm-2.6.28-rc2+/mm/swap_state.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swap_state.c
+++ mmotm-2.6.28-rc2+/mm/swap_state.c
@@ -119,6 +119,7 @@ void __delete_from_swap_cache(struct pag
total_swapcache_pages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(del_total);
+ mem_cgroup_uncharge_swapcache(page);
}
/**
Index: mmotm-2.6.28-rc2+/include/linux/swap.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/swap.h
+++ mmotm-2.6.28-rc2+/include/linux/swap.h
@@ -332,6 +332,22 @@ static inline void disable_swap_token(vo
put_swap_token(swap_token_mm);
}
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+extern int mem_cgroup_cache_charge_swapin(struct page *page,
+ struct mm_struct *mm, gfp_t mask);
+extern void mem_cgroup_uncharge_swapcache(struct page *page);
+#else
+static inline
+int mem_cgroup_cache_charge_swapin(struct page *page,
+ struct mm_struct *mm, gfp_t mask)
+{
+ return 0;
+}
+static inline void mem_cgroup_uncharge_swapcache(struct page *page)
+{
+}
+#endif
+
#else /* CONFIG_SWAP */
#define total_swap_pages 0
Index: mmotm-2.6.28-rc2+/mm/shmem.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/shmem.c
+++ mmotm-2.6.28-rc2+/mm/shmem.c
@@ -920,8 +920,12 @@ found:
error = 1;
if (!inode)
goto out;
- /* Charge page using GFP_HIGHUSER_MOVABLE while we can wait */
- error = mem_cgroup_cache_charge(page, current->mm, GFP_HIGHUSER_MOVABLE);
+ /*
+ * Charge page using GFP_HIGHUSER_MOVABLE while we can wait.
+ * charged back to the user(not to caller) when swap account is used.
+ */
+ error = mem_cgroup_cache_charge_swapin(page,
+ current->mm, GFP_HIGHUSER_MOVABLE);
if (error)
goto out;
error = radix_tree_preload(GFP_KERNEL);
@@ -1258,6 +1262,16 @@ repeat:
goto repeat;
}
wait_on_page_locked(swappage);
+ /*
+ * We want to avoid charge at add_to_page_cache().
+ * charge against this swap cache here.
+ */
+ if (mem_cgroup_cache_charge_swapin(swappage,
+ current->mm, gfp)) {
+ page_cache_release(swappage);
+ error = -ENOMEM;
+ goto failed;
+ }
page_cache_release(swappage);
goto repeat;
}
Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
@@ -137,6 +137,11 @@ behind this approach is that a cgroup th
page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure).
+Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+be backed into memory in force, charges for pages are accounted against the
+caller of swapoff rather than the users of shmem.
+
+
2.4 Reclaim
Each cgroup maintains a per cgroup LRU that consists of an active
Experimental.
Config and control variable for mem+swap controller.
This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
(memory resource controller swap extension.)
For accounting swap, it's obvious that we have to use additional memory
to remember "who uses swap". This adds more overhead.
So, it's better to offer "choice" to users. This patch adds 2 choices.
This patch adds 2 parameters to enable swap extension or not.
- CONFIG
- boot option
Changelog: v1 -> v2
- fixed typo.
- make default value of "do_swap_account" to be 0 and turned on 1
later if configured.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Documentation/kernel-parameters.txt | 3 +++
include/linux/memcontrol.h | 3 +++
init/Kconfig | 17 +++++++++++++++++
mm/memcontrol.c | 32 ++++++++++++++++++++++++++++++++
4 files changed, 55 insertions(+)
Index: mmotm-2.6.28-rc2+/init/Kconfig
===================================================================
--- mmotm-2.6.28-rc2+.orig/init/Kconfig
+++ mmotm-2.6.28-rc2+/init/Kconfig
@@ -428,6 +428,23 @@ config CGROUP_MEM_RES_CTLR
config MM_OWNER
bool
+config CGROUP_MEM_RES_CTLR_SWAP
+ bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
+ depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
+ help
+ Add swap management feature to memory resource controller. When you
+ enable this, you can limit mem+swap usage per cgroup. In other words,
+ when you disable this, memory resource controller has no cares to
+ usage of swap...a process can exhaust all of the swap. This extension
+ is useful when you want to avoid exhaustion swap but this itself
+ adds more overheads and consumes memory for remembering information.
+ Especially if you use 32bit system or small memory system, please
+ be careful about enabling this. When memory resource controller
+ is disabled by boot option, this will be automatically disabled and
+ there will be no overhead from this. Even when you set this config=y,
+ if boot option "noswapaccount" is set, swap will not be accounted.
+
+
endmenu
config SYSFS_DEPRECATED
Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -41,6 +41,15 @@
struct cgroup_subsys mem_cgroup_subsys __read_mostly;
#define MEM_CGROUP_RECLAIM_RETRIES 5
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/* Turned on only when memory cgroup is enabled && reallY_do_swap_account = 0 */
+int do_swap_account __read_mostly;
+static int really_do_swap_account __initdata = 1; /* for remember boot option*/
+#else
+#define do_swap_account (0)
+#endif
+
+
/*
* Statistics for memory cgroup.
*/
@@ -1369,6 +1378,18 @@ static void mem_cgroup_free(struct mem_c
}
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+static void __init enable_swap_cgroup(void)
+{
+ if (really_do_swap_account)
+ do_swap_account = 1;
+}
+#else
+static void __init enable_swap_cgroup(void)
+{
+}
+#endif
+
static struct cgroup_subsys_state *
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
@@ -1377,6 +1398,7 @@ mem_cgroup_create(struct cgroup_subsys *
if (unlikely((cont->parent) == NULL)) {
mem = &init_mem_cgroup;
+ enable_swap_cgroup();
} else {
mem = mem_cgroup_alloc();
if (!mem)
@@ -1460,3 +1482,13 @@ struct cgroup_subsys mem_cgroup_subsys =
.attach = mem_cgroup_move_task,
.early_init = 0,
};
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+static int __init disable_swap_account(char *s)
+{
+ really_do_swap_account = 0;
+ return 1;
+}
+__setup("noswapaccount", disable_swap_account);
+#endif
Index: mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
===================================================================
--- mmotm-2.6.28-rc2+.orig/Documentation/kernel-parameters.txt
+++ mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
@@ -1543,6 +1543,9 @@ and is between 256 and 4096 characters.
nosoftlockup [KNL] Disable the soft-lockup detector.
+ noswapaccount [KNL] Disable accounting of swap in memory resource
+ controller. (See Documentation/controllers/memory.txt)
+
nosync [HW,M68K] Disables sync negotiation for all devices.
notsc [BUGS=X86-32] Disable Time Stamp Counter
Index: mmotm-2.6.28-rc2+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-rc2+/include/linux/memcontrol.h
@@ -77,6 +77,9 @@ extern void mem_cgroup_record_reclaim_pr
extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
int priority, enum lru_list lru);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern int do_swap_account;
+#endif
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;
For accounting swap, we need a record per swap entry, at least.
This patch adds following function.
- swap_cgroup_swapon() .... called from swapon
- swap_cgroup_swapoff() ... called at the end of swapoff.
- swap_cgroup_record() .... record information of swap entry.
returns old value.
- swap_cgroup_lookup() .... lookup information of swap entry.
This patch just implements "how to record information". No actual
method for limit the usage of swap. These routine uses flat table
to record and lookup. "wise" lookup system like radix-tree requires
requires memory allocation at new records but swap-out is usually
called under memory shortage (or memcg hits limit.)
So, I used static allocation. (maybe dynamic allocation is not very
hard but it adds additional memory allocation in memory shortage path.)
Note1: In this, we use pointer to record information and this means
8bytes per swap entry. I think we can reduce this when we
create "id of cgroup" in the range of 0-65535 or 0-255.
Note2: array of swap_cgroup is allocated from HIGHMEM. maybe good for x86-32.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
include/linux/page_cgroup.h | 35 +++++++
mm/page_cgroup.c | 199 ++++++++++++++++++++++++++++++++++++++++++++
mm/swapfile.c | 8 +
3 files changed, 242 insertions(+)
Index: mmotm-2.6.28-rc2+/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-rc2+/mm/page_cgroup.c
@@ -8,6 +8,8 @@
#include <linux/memory.h>
#include <linux/vmalloc.h>
#include <linux/cgroup.h>
+#include <linux/swapops.h>
+#include <linux/highmem.h>
static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -254,3 +256,200 @@ void __init pgdat_page_cgroup_init(struc
}
#endif
+
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+DEFINE_MUTEX(swap_cgroup_mutex);
+struct swap_cgroup_ctrl {
+ spinlock_t lock;
+ struct page **map;
+ unsigned long length;
+};
+
+struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
+
+/*
+ * This 8bytes seems big..maybe we can reduce this when we can use "id" for
+ * cgroup rather than pointer.
+ */
+struct swap_cgroup {
+ struct mem_cgroup *val;
+};
+#define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup))
+#define SC_POS_MASK (SC_PER_PAGE - 1)
+
+/*
+ * allocate buffer for swap_cgroup.
+ */
+static int swap_cgroup_prepare(int type)
+{
+ struct page *page;
+ struct swap_cgroup_ctrl *ctrl;
+ unsigned long idx, max;
+
+ if (!do_swap_account)
+ return 0;
+ ctrl = &swap_cgroup_ctrl[type];
+
+ for (idx = 0; idx < ctrl->length; idx++) {
+ page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
+ if (!page)
+ goto not_enough_page;
+ ctrl->map[idx] = page;
+ }
+ return 0;
+not_enough_page:
+ max = idx;
+ for (idx = 0; idx < max; idx++)
+ __free_page(ctrl->map[idx]);
+
+ return -ENOMEM;
+}
+
+/**
+ * swap_cgroup_record - record mem_cgroup for this swp_entry.
+ * @ent: swap entry to be recorded into
+ * @mem: mem_cgroup to be recorded
+ *
+ * Returns old value at success, NULL at failure.
+ * (Of course, old value can be NULL.)
+ */
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+ unsigned long flags;
+ int type = swp_type(ent);
+ unsigned long offset = swp_offset(ent);
+ unsigned long idx = offset / SC_PER_PAGE;
+ unsigned long pos = offset & SC_POS_MASK;
+ struct swap_cgroup_ctrl *ctrl;
+ struct page *mappage;
+ struct swap_cgroup *sc;
+ struct mem_cgroup *old;
+
+ if (!do_swap_account)
+ return NULL;
+
+ ctrl = &swap_cgroup_ctrl[type];
+
+ mappage = ctrl->map[idx];
+ spin_lock_irqsave(&ctrl->lock, flags);
+ sc = kmap_atomic(mappage, KM_USER0);
+ sc += pos;
+ old = sc->val;
+ sc->val = mem;
+ kunmap_atomic(mappage, KM_USER0);
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+ return old;
+}
+
+/**
+ * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
+ * @ent: swap entry to be looked up.
+ *
+ * Returns pointer to mem_cgroup at success. NULL at failure.
+ */
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+ int type = swp_type(ent);
+ unsigned long flags;
+ unsigned long offset = swp_offset(ent);
+ unsigned long idx = offset / SC_PER_PAGE;
+ unsigned long pos = offset & SC_POS_MASK;
+ struct swap_cgroup_ctrl *ctrl;
+ struct page *mappage;
+ struct swap_cgroup *sc;
+ struct mem_cgroup *ret;
+
+ if (!do_swap_account)
+ return NULL;
+
+ ctrl = &swap_cgroup_ctrl[type];
+
+ mappage = ctrl->map[idx];
+
+ spin_lock_irqsave(&ctrl->lock, flags);
+ sc = kmap_atomic(mappage, KM_USER0);
+ sc += pos;
+ ret = sc->val;
+ kunmap_atomic(mapppage, KM_USER0);
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+ return ret;
+}
+
+int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+ void *array;
+ unsigned long array_size;
+ unsigned long length;
+ struct swap_cgroup_ctrl *ctrl;
+
+ if (!do_swap_account)
+ return 0;
+
+ length = ((max_pages/SC_PER_PAGE) + 1);
+ array_size = length * sizeof(void *);
+
+ array = vmalloc(array_size);
+ if (!array)
+ goto nomem;
+
+ memset(array, 0, array_size);
+ ctrl = &swap_cgroup_ctrl[type];
+ mutex_lock(&swap_cgroup_mutex);
+ ctrl->length = length;
+ ctrl->map = array;
+ if (swap_cgroup_prepare(type)) {
+ /* memory shortage */
+ ctrl->map = NULL;
+ ctrl->length = 0;
+ vfree(array);
+ mutex_unlock(&swap_cgroup_mutex);
+ goto nomem;
+ }
+ mutex_unlock(&swap_cgroup_mutex);
+
+ printk(KERN_INFO
+ "swap_cgroup: uses %ld bytes vmalloc and %ld bytes buffres\n",
+ array_size, length * PAGE_SIZE);
+ printk(KERN_INFO
+ "swap_cgroup can be disabled by noswapaccount boot option.\n");
+
+ return 0;
+nomem:
+ printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
+ printk(KERN_INFO
+ "swap_cgroup can be disabled by noswapaccount boot option\n");
+ return -ENOMEM;
+}
+
+void swap_cgroup_swapoff(int type)
+{
+ int i;
+ struct swap_cgroup_ctrl *ctrl;
+
+ if (!do_swap_account)
+ return;
+
+ mutex_lock(&swap_cgroup_mutex);
+ ctrl = &swap_cgroup_ctrl[type];
+ for (i = 0; i < ctrl->length; i++) {
+ struct page *page = ctrl->map[i];
+ if (page)
+ __free_page(page);
+ }
+ vfree(ctrl->map);
+ ctrl->map = NULL;
+ ctrl->length = 0;
+ mutex_unlock(&swap_cgroup_mutex);
+}
+
+static int __init swap_cgroup_init(void)
+{
+ int i;
+ for (i = 0; i < MAX_SWAPFILES; i++)
+ spin_lock_init(&swap_cgroup_ctrl[i].lock);
+ return 0;
+}
+late_initcall(swap_cgroup_init);
+#endif
Index: mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
@@ -105,4 +105,39 @@ static inline void page_cgroup_init(void
}
#endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#include <linux/swap.h>
+extern struct mem_cgroup *
+swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem);
+extern struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent);
+extern int swap_cgroup_swapon(int type, unsigned long max_pages);
+extern void swap_cgroup_swapoff(int type);
+#else
+#include <linux/swap.h>
+
+static inline
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+ return NULL;
+}
+
+static inline
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+ return NULL;
+}
+
+static inline int
+swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+ return 0;
+}
+
+static inline void swap_cgroup_swapoff(int type)
+{
+ return;
+}
+
+#endif
#endif
Index: mmotm-2.6.28-rc2+/mm/swapfile.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swapfile.c
+++ mmotm-2.6.28-rc2+/mm/swapfile.c
@@ -32,6 +32,7 @@
#include <asm/pgtable.h>
#include <asm/tlbflush.h>
#include <linux/swapops.h>
+#include <linux/page_cgroup.h>
static DEFINE_SPINLOCK(swap_lock);
static unsigned int nr_swapfiles;
@@ -1345,6 +1346,9 @@ asmlinkage long sys_swapoff(const char _
spin_unlock(&swap_lock);
mutex_unlock(&swapon_mutex);
vfree(swap_map);
+ /* Destroy swap account informatin */
+ swap_cgroup_swapoff(type);
+
inode = mapping->host;
if (S_ISBLK(inode->i_mode)) {
struct block_device *bdev = I_BDEV(inode);
@@ -1669,6 +1673,10 @@ asmlinkage long sys_swapon(const char __
nr_good_pages = swap_header->info.last_page -
swap_header->info.nr_badpages -
1 /* header page */;
+
+ if (!error)
+ error = swap_cgroup_swapon(type, maxpages);
+
if (error)
goto bad_swap;
}
Mem+Swap controller core.
This patch implements per cgroup limit for usage of memory+swap.
However there are SwapCache, double counting of swap-cache and
swap-entry is avoided.
Mem+Swap controller works as following.
- memory usage is limited by memory.limit_in_bytes.
- memory + swap usage is limited by memory.memsw_limit_in_bytes.
This has following benefits.
- A user can limit total resource usage of mem+swap.
Without this, because memory resource controller doesn't take care of
usage of swap, a process can exhaust all the swap (by memory leak.)
We can avoid this case.
And Swap is shared resource but it cannot be reclaimed (goes back to memory)
until it's used. This characteristic can be trouble when the memory
is divided into some parts by cpuset or memcg.
Assume group A and group B.
After some application executes, the system can be..
Group A -- very large free memory space but occupy 99% of swap.
Group B -- under memory shortage but cannot use swap...it's nearly full.
Ability to set appropriate swap limit for each group is required.
Maybe someone wonder "why not swap but mem+swap ?"
- The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of
mem+swap.
In other words, when we want to limit the usage of swap without affecting
global LRU, mem+swap limit is better than just limiting swap.
Accounting target information is stored in swap_cgroup which is
per swap entry record.
Charge is done as following.
map
- charge page and memsw.
unmap
- uncharge page/memsw if not SwapCache.
swap-out (__delete_from_swap_cache)
- uncharge page
- record mem_cgroup information to swap_cgroup.
swap-in (do_swap_page)
- charged as page and memsw.
record in swap_cgroup is cleared.
memsw accounting is decremented.
swap-free (swap_free())
- if swap entry is freed, memsw is uncharged by PAGE_SIZE.
After this, usual memory resource controller handles SwapCache.
(It was lacked(ignored) feature in current memcg but must be handled.)
There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead. This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)
Changelog: v1 -> v2
- fixed typos
- fixed migration of anon pages.
- fixed uncharge to check USED bit always.
- code for swapcache is moved to another patch.
- added "noswap" argument to try_to_free_mem_cgroup_pages
- fixed lock_page around mem_cgroup_charge_cache_swap()
- fixed "failcnt" file.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Documentation/controllers/memory.txt | 29 ++
include/linux/memcontrol.h | 11 -
include/linux/swap.h | 18 +
mm/memcontrol.c | 361 +++++++++++++++++++++++++++++++----
mm/memory.c | 3
mm/shmem.c | 4
mm/swap_state.c | 5
mm/swapfile.c | 11 -
mm/vmscan.c | 6
9 files changed, 391 insertions(+), 57 deletions(-)
Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -132,6 +132,10 @@ struct mem_cgroup {
*/
struct res_counter res;
/*
+ * the counter to account for mem+swap usage.
+ */
+ struct res_counter memsw;
+ /*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
@@ -142,6 +146,12 @@ struct mem_cgroup {
* statistics.
*/
struct mem_cgroup_stat stat;
+
+ /*
+ * used for counting reference from swap_cgroup.
+ */
+ int obsolete;
+ atomic_t swapref;
};
static struct mem_cgroup init_mem_cgroup;
@@ -168,6 +178,16 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
0, /* FORCE */
};
+
+/* for encoding cft->private value on file */
+#define _MEM (0)
+#define _MEMSWAP (1)
+#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
+#define MEMFILE_TYPE(val) (((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val) ((val) & 0xffff)
+
+static void mem_cgroup_forget_swapref(struct mem_cgroup *mem);
+
/*
* Always modified under lru lock. Then, not necessary to preempt_disable()
*/
@@ -514,12 +534,25 @@ static int __mem_cgroup_try_charge(struc
css_get(&mem->css);
}
+ while (1) {
+ int ret;
+ bool noswap = false;
- while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
+ ret = res_counter_charge(&mem->res, PAGE_SIZE);
+ if (likely(!ret)) {
+ if (!do_swap_account)
+ break;
+ ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
+ if (likely(!ret))
+ break;
+ /* mem+swap counter fails */
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ noswap = true;
+ }
if (!(gfp_mask & __GFP_WAIT))
goto nomem;
- if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+ if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
continue;
/*
@@ -528,8 +561,13 @@ static int __mem_cgroup_try_charge(struc
* moved to swap cache or just unmapped from the cgroup.
* Check the limit again to see if the reclaim reduced the
* current usage of the cgroup before giving up
+ *
*/
- if (res_counter_check_under_limit(&mem->res))
+ if (!do_swap_account &&
+ res_counter_check_under_limit(&mem->res))
+ continue;
+ if (do_swap_account &&
+ res_counter_check_under_limit(&mem->memsw))
continue;
if (!nr_retries--) {
@@ -583,6 +621,8 @@ static void __mem_cgroup_commit_charge(s
if (unlikely(PageCgroupUsed(pc))) {
unlock_page_cgroup(pc);
res_counter_uncharge(&mem->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
css_put(&mem->css);
return;
}
@@ -647,6 +687,8 @@ static int mem_cgroup_move_account(struc
__mem_cgroup_remove_list(from_mz, pc);
css_put(&from->css);
res_counter_uncharge(&from->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&from->memsw, PAGE_SIZE);
pc->mem_cgroup = to;
css_get(&to->css);
__mem_cgroup_add_list(to_mz, pc, false);
@@ -693,8 +735,11 @@ static int mem_cgroup_move_parent(struct
/* drop extra refcnt */
css_put(&parent->css);
/* uncharge if move fails */
- if (ret)
+ if (ret) {
res_counter_uncharge(&parent->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+ }
return ret;
}
@@ -792,27 +837,77 @@ int mem_cgroup_cache_charge(struct page
MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
}
+int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+ struct page *page, gfp_t mask, struct mem_cgroup **ptr)
+{
+ struct mem_cgroup *mem;
+ swp_entry_t ent;
+
+ if (mem_cgroup_subsys.disabled)
+ return 0;
+
+ if (!do_swap_account)
+ goto charge_cur_mm;
+
+ ent.val = page_private(page);
+
+ mem = lookup_swap_cgroup(ent);
+ if (!mem || mem->obsolete)
+ goto charge_cur_mm;
+ *ptr = mem;
+ return __mem_cgroup_try_charge(NULL, mask, ptr, true);
+charge_cur_mm:
+ if (unlikely(!mm))
+ mm = &init_mm;
+ return __mem_cgroup_try_charge(mm, mask, ptr, true);
+}
+
+
+
#ifdef CONFIG_SWAP
int mem_cgroup_cache_charge_swapin(struct page *page,
- struct mm_struct *mm, gfp_t mask)
+ struct mm_struct *mm, gfp_t mask, bool locked)
{
int ret = 0;
+ swp_entry_t ent;
+ struct mem_cgroup *mem = NULL;
if (mem_cgroup_subsys.disabled)
return 0;
+
if (unlikely(!mm))
mm = &init_mm;
+ if (!locked) {
+ lock_page(page);
+ if (!PageSwapCache(page))
+ goto unlock_out;
+ }
+ ent.val = page_private(page);
+
+ if (do_swap_account) {
+ mem = lookup_swap_cgroup(ent);
+ if (mem && mem->obsolete)
+ mem = NULL;
+ if (mem)
+ mm = NULL;
+ }
ret = mem_cgroup_charge_common(page, mm, mask,
- MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
- /*
- * The page may be dropped from SwapCache because we don't have
- * lock_page().This may cause charge-after-uncharge trouble.
- * Fix it up here. (the caller have refcnt to this page and
- * page itself is guaranteed not to be freed.)
- */
- if (ret && !PageSwapCache(page))
- mem_cgroup_uncharge_swapcache(page);
+ MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
+ if (!ret && do_swap_account) {
+ /*
+ * At this point, we successfully charge both for mem and swap.
+ * fix this double counting, here.
+ */
+ mem = swap_cgroup_record(ent, NULL);
+ if (mem) {
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+ mem_cgroup_forget_swapref(mem);
+ }
+ }
+unlock_out:
+ if (!locked)
+ unlock_page(page);
return ret;
}
@@ -828,6 +923,23 @@ void mem_cgroup_commit_charge_swapin(str
return;
pc = lookup_page_cgroup(page);
__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+ /*
+ * Now swap is on-memory. This means this page may be
+ * counted both as mem and swap....double count.
+ * Fix it by uncharging from memsw. This SwapCache is stable
+ * because we're still under lock_page().
+ */
+ if (do_swap_account) {
+ swp_entry_t ent = {.val = page_private(page)};
+ struct mem_cgroup *memcg;
+ memcg = swap_cgroup_record(ent, NULL);
+ if (memcg) {
+ /* If memcg is obsolete, memcg can be != ptr */
+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+ mem_cgroup_forget_swapref(memcg);
+ }
+
+ }
}
void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -837,6 +949,7 @@ void mem_cgroup_cancel_charge_swapin(str
if (!mem)
return;
res_counter_uncharge(&mem->res, PAGE_SIZE);
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
css_put(&mem->css);
}
@@ -844,29 +957,31 @@ void mem_cgroup_cancel_charge_swapin(str
/*
* uncharge if !page_mapped(page)
*/
-static void
+static struct mem_cgroup *
__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
{
struct page_cgroup *pc;
- struct mem_cgroup *mem;
+ struct mem_cgroup *mem = NULL;
struct mem_cgroup_per_zone *mz;
unsigned long flags;
if (mem_cgroup_subsys.disabled)
- return;
+ return NULL;
if (PageSwapCache(page))
- return;
+ return NULL;
/*
* Check if our page_cgroup is valid
*/
pc = lookup_page_cgroup(page);
if (unlikely(!pc || !PageCgroupUsed(pc)))
- return;
+ return NULL;
lock_page_cgroup(pc);
+ mem = pc->mem_cgroup;
+
if (!PageCgroupUsed(pc))
goto unlock_out;
@@ -886,8 +1001,11 @@ __mem_cgroup_uncharge_common(struct page
break;
}
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+
ClearPageCgroupUsed(pc);
- mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
@@ -895,14 +1013,13 @@ __mem_cgroup_uncharge_common(struct page
spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);
- res_counter_uncharge(&mem->res, PAGE_SIZE);
css_put(&mem->css);
- return;
+ return mem;
unlock_out:
unlock_page_cgroup(pc);
- return;
+ return NULL;
}
void mem_cgroup_uncharge_page(struct page *page)
@@ -922,11 +1039,43 @@ void mem_cgroup_uncharge_cache_page(stru
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}
-void mem_cgroup_uncharge_swapcache(struct page *page)
+/*
+ * called from __delete_from_swap_cache() and drop "page" account.
+ * memcg information is recorded to swap_cgroup of "ent"
+ */
+void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
{
- __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+ struct mem_cgroup *memcg;
+
+ memcg = __mem_cgroup_uncharge_common(page,
+ MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+ /* record memcg information */
+ if (do_swap_account && memcg) {
+ swap_cgroup_record(ent, memcg);
+ atomic_inc(&memcg->swapref);
+ }
}
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/*
+ * called from swap_entry_free(). remove record in swap_cgroup and
+ * uncharge "memsw" account.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t ent)
+{
+ struct mem_cgroup *memcg;
+
+ if (!do_swap_account)
+ return;
+
+ memcg = swap_cgroup_record(ent, NULL);
+ if (memcg) {
+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+ mem_cgroup_forget_swapref(memcg);
+ }
+}
+#endif
+
/*
* Before starting migration, account PAGE_SIZE to mem_cgroup that the old
* page belongs to.
@@ -1034,7 +1183,7 @@ int mem_cgroup_shrink_usage(struct mm_st
rcu_read_unlock();
do {
- progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
+ progress = try_to_free_mem_cgroup_pages(mem, gfp_mask, true);
progress += res_counter_check_under_limit(&mem->res);
} while (!progress && --retry);
@@ -1061,13 +1210,55 @@ int mem_cgroup_resize_limit(struct mem_c
break;
}
progress = try_to_free_mem_cgroup_pages(memcg,
- GFP_HIGHUSER_MOVABLE);
+ GFP_HIGHUSER_MOVABLE, false);
if (!progress)
retry_count--;
}
return ret;
}
+int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
+ unsigned long long val)
+{
+ int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
+ unsigned long flags;
+ u64 memlimit, oldusage, curusage;
+ int ret;
+
+ if (!do_swap_account)
+ return -EINVAL;
+
+ while (retry_count) {
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ break;
+ }
+ /*
+ * Rather than hide all in some function, I do this in
+ * open coded manner. You see what this really does.
+ * We have to guarantee mem->res.limit < mem->memsw.limit.
+ */
+ spin_lock_irqsave(&memcg->res.lock, flags);
+ memlimit = memcg->res.limit;
+ if (memlimit > val) {
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+ ret = -EINVAL;
+ break;
+ }
+ ret = res_counter_set_limit(&memcg->memsw, val);
+ oldusage = memcg->memsw.usage;
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+
+ if (!ret)
+ break;
+ try_to_free_mem_cgroup_pages(memcg, GFP_HIGHUSER_MOVABLE, true);
+ curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+ if (curusage >= oldusage)
+ retry_count--;
+ }
+ return ret;
+}
+
/*
* This routine traverse page_cgroup in given list and drop them all.
@@ -1179,7 +1370,7 @@ try_to_free:
while (nr_retries && mem->res.usage > 0) {
int progress;
progress = try_to_free_mem_cgroup_pages(mem,
- GFP_HIGHUSER_MOVABLE);
+ GFP_HIGHUSER_MOVABLE, false);
if (!progress)
nr_retries--;
@@ -1193,8 +1384,25 @@ try_to_free:
static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
{
- return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
- cft->private);
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ u64 val = 0;
+ int type, name;
+
+ type = MEMFILE_TYPE(cft->private);
+ name = MEMFILE_ATTR(cft->private);
+ switch (type) {
+ case _MEM:
+ val = res_counter_read_u64(&mem->res, name);
+ break;
+ case _MEMSWAP:
+ if (do_swap_account)
+ val = res_counter_read_u64(&mem->memsw, name);
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return val;
}
/*
* The user of this function is...
@@ -1204,15 +1412,22 @@ static int mem_cgroup_write(struct cgrou
const char *buffer)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ int type, name;
unsigned long long val;
int ret;
- switch (cft->private) {
+ type = MEMFILE_TYPE(cft->private);
+ name = MEMFILE_ATTR(cft->private);
+ switch (name) {
case RES_LIMIT:
/* This function does all necessary parse...reuse it */
ret = res_counter_memparse_write_strategy(buffer, &val);
- if (!ret)
+ if (ret)
+ break;
+ if (type == _MEM)
ret = mem_cgroup_resize_limit(memcg, val);
+ else
+ ret = mem_cgroup_resize_memsw_limit(memcg, val);
break;
default:
ret = -EINVAL; /* should be BUG() ? */
@@ -1224,14 +1439,23 @@ static int mem_cgroup_write(struct cgrou
static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
{
struct mem_cgroup *mem;
+ int type, name;
mem = mem_cgroup_from_cont(cont);
- switch (event) {
+ type = MEMFILE_TYPE(event);
+ name = MEMFILE_ATTR(event);
+ switch (name) {
case RES_MAX_USAGE:
- res_counter_reset_max(&mem->res);
+ if (type == _MEM)
+ res_counter_reset_max(&mem->res);
+ else
+ res_counter_reset_max(&mem->memsw);
break;
case RES_FAILCNT:
- res_counter_reset_failcnt(&mem->res);
+ if (type == _MEM)
+ res_counter_reset_failcnt(&mem->res);
+ else
+ res_counter_reset_failcnt(&mem->memsw);
break;
}
return 0;
@@ -1285,30 +1509,33 @@ static int mem_control_stat_show(struct
cb->fill(cb, "unevictable", unevictable * PAGE_SIZE);
}
+ /* showing refs from disk-swap */
+ cb->fill(cb, "swap_on_disk", atomic_read(&mem_cont->swapref)
+ * PAGE_SIZE);
return 0;
}
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
- .private = RES_USAGE,
+ .private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
.read_u64 = mem_cgroup_read,
},
{
.name = "max_usage_in_bytes",
- .private = RES_MAX_USAGE,
+ .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
.trigger = mem_cgroup_reset,
.read_u64 = mem_cgroup_read,
},
{
.name = "limit_in_bytes",
- .private = RES_LIMIT,
+ .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
.write_string = mem_cgroup_write,
.read_u64 = mem_cgroup_read,
},
{
.name = "failcnt",
- .private = RES_FAILCNT,
+ .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
.trigger = mem_cgroup_reset,
.read_u64 = mem_cgroup_read,
},
@@ -1316,6 +1543,31 @@ static struct cftype mem_cgroup_files[]
.name = "stat",
.read_map = mem_control_stat_show,
},
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+ {
+ .name = "memsw.usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
+ .read_u64 = mem_cgroup_read,
+ },
+ {
+ .name = "memsw.max_usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
+ .trigger = mem_cgroup_reset,
+ .read_u64 = mem_cgroup_read,
+ },
+ {
+ .name = "memsw.limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
+ .write_string = mem_cgroup_write,
+ .read_u64 = mem_cgroup_read,
+ },
+ {
+ .name = "memsw.failcnt",
+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
+ .trigger = mem_cgroup_reset,
+ .read_u64 = mem_cgroup_read,
+ },
+#endif
};
static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1369,14 +1621,43 @@ static struct mem_cgroup *mem_cgroup_all
return mem;
}
+/*
+ * At destroying mem_cgroup, references from swap_cgroup can remain.
+ * (scanning all at force_empty is too costly...)
+ *
+ * Instead of clearing all references at force_empty, we remember
+ * the number of reference from swap_cgroup and free mem_cgroup when
+ * it goes down to 0.
+ *
+ * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
+ * entry which points to this memcg will be ignore at swapin.
+ *
+ * Removal of cgroup itself succeeds regardless of refs from swap.
+ */
+
static void mem_cgroup_free(struct mem_cgroup *mem)
{
+ if (do_swap_account) {
+ if (atomic_read(&mem->swapref) > 0)
+ return;
+ }
if (sizeof(*mem) < PAGE_SIZE)
kfree(mem);
else
vfree(mem);
}
+static void mem_cgroup_forget_swapref(struct mem_cgroup *mem)
+{
+ if (!do_swap_account)
+ return;
+ if (atomic_dec_and_test(&mem->swapref)) {
+ if (!mem->obsolete)
+ return;
+ mem_cgroup_free(mem);
+ }
+}
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
static void __init enable_swap_cgroup(void)
@@ -1406,6 +1687,7 @@ mem_cgroup_create(struct cgroup_subsys *
}
res_counter_init(&mem->res);
+ res_counter_init(&mem->memsw);
for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -1424,6 +1706,7 @@ static void mem_cgroup_pre_destroy(struc
struct cgroup *cont)
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ mem->obsolete = 1;
mem_cgroup_force_empty(mem);
}
Index: mmotm-2.6.28-rc2+/mm/swapfile.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swapfile.c
+++ mmotm-2.6.28-rc2+/mm/swapfile.c
@@ -271,8 +271,9 @@ out:
return NULL;
}
-static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
+static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
{
+ unsigned long offset = swp_offset(ent);
int count = p->swap_map[offset];
if (count < SWAP_MAP_MAX) {
@@ -287,6 +288,7 @@ static int swap_entry_free(struct swap_i
swap_list.next = p - swap_info;
nr_swap_pages++;
p->inuse_pages--;
+ mem_cgroup_uncharge_swap(ent);
}
}
return count;
@@ -302,7 +304,7 @@ void swap_free(swp_entry_t entry)
p = swap_info_get(entry);
if (p) {
- swap_entry_free(p, swp_offset(entry));
+ swap_entry_free(p, entry);
spin_unlock(&swap_lock);
}
}
@@ -421,7 +423,7 @@ void free_swap_and_cache(swp_entry_t ent
p = swap_info_get(entry);
if (p) {
- if (swap_entry_free(p, swp_offset(entry)) == 1) {
+ if (swap_entry_free(p, entry) == 1) {
page = find_get_page(&swapper_space, entry.val);
if (page && !trylock_page(page)) {
page_cache_release(page);
@@ -536,7 +538,8 @@ static int unuse_pte(struct vm_area_stru
pte_t *pte;
int ret = 1;
- if (mem_cgroup_try_charge(vma->vm_mm, GFP_HIGHUSER_MOVABLE, &ptr))
+ if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
+ GFP_HIGHUSER_MOVABLE, &ptr))
ret = -ENOMEM;
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
Index: mmotm-2.6.28-rc2+/mm/swap_state.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swap_state.c
+++ mmotm-2.6.28-rc2+/mm/swap_state.c
@@ -17,6 +17,7 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/migrate.h>
+#include <linux/page_cgroup.h>
#include <asm/pgtable.h>
@@ -108,6 +109,8 @@ int add_to_swap_cache(struct page *page,
*/
void __delete_from_swap_cache(struct page *page)
{
+ swp_entry_t ent = {.val = page_private(page)};
+
BUG_ON(!PageLocked(page));
BUG_ON(!PageSwapCache(page));
BUG_ON(PageWriteback(page));
@@ -119,7 +122,7 @@ void __delete_from_swap_cache(struct pag
total_swapcache_pages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(del_total);
- mem_cgroup_uncharge_swapcache(page);
+ mem_cgroup_uncharge_swapcache(page, ent);
}
/**
Index: mmotm-2.6.28-rc2+/include/linux/swap.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/swap.h
+++ mmotm-2.6.28-rc2+/include/linux/swap.h
@@ -213,7 +213,7 @@ static inline void lru_cache_add_active_
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask);
+ gfp_t gfp_mask, bool noswap);
extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
@@ -334,16 +334,24 @@ static inline void disable_swap_token(vo
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
extern int mem_cgroup_cache_charge_swapin(struct page *page,
- struct mm_struct *mm, gfp_t mask);
-extern void mem_cgroup_uncharge_swapcache(struct page *page);
+ struct mm_struct *mm, gfp_t mask, bool locked);
+extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
#else
static inline
int mem_cgroup_cache_charge_swapin(struct page *page,
- struct mm_struct *mm, gfp_t mask)
+ struct mm_struct *mm, gfp_t mask, bool locked)
{
return 0;
}
-static inline void mem_cgroup_uncharge_swapcache(struct page *page)
+static inline void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+{
+}
+#endif
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+#else
+static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
{
}
#endif
Index: mmotm-2.6.28-rc2+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-rc2+/include/linux/memcontrol.h
@@ -32,6 +32,8 @@ extern int mem_cgroup_newpage_charge(str
/* for swap handling */
extern int mem_cgroup_try_charge(struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+ struct page *page, gfp_t mask, struct mem_cgroup **ptr);
extern void mem_cgroup_commit_charge_swapin(struct page *page,
struct mem_cgroup *ptr);
extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
@@ -80,7 +82,6 @@ extern long mem_cgroup_calc_reclaim(stru
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
-
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;
@@ -97,7 +98,13 @@ static inline int mem_cgroup_cache_charg
}
static inline int mem_cgroup_try_charge(struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **ptr)
+ gfp_t gfp_mask, struct mem_cgroup **ptr)
+{
+ return 0;
+}
+
+static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+ struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
{
return 0;
}
Index: mmotm-2.6.28-rc2+/mm/memory.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memory.c
+++ mmotm-2.6.28-rc2+/mm/memory.c
@@ -2324,7 +2324,8 @@ static int do_swap_page(struct mm_struct
lock_page(page);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
- if (mem_cgroup_try_charge(mm, GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
+ if (mem_cgroup_try_charge_swapin(mm, page,
+ GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
ret = VM_FAULT_OOM;
unlock_page(page);
goto out;
Index: mmotm-2.6.28-rc2+/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/vmscan.c
+++ mmotm-2.6.28-rc2+/mm/vmscan.c
@@ -1669,7 +1669,8 @@ unsigned long try_to_free_pages(struct z
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
- gfp_t gfp_mask)
+ gfp_t gfp_mask,
+ bool noswap)
{
struct scan_control sc = {
.may_writepage = !laptop_mode,
@@ -1682,6 +1683,9 @@ unsigned long try_to_free_mem_cgroup_pag
};
struct zonelist *zonelist;
+ if (noswap)
+ sc.may_swap = 0;
+
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
zonelist = NODE_DATA(numa_node_id())->node_zonelists;
Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
@@ -137,12 +137,32 @@ behind this approach is that a cgroup th
page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure).
-Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
+When you do swapoff and make swapped-out pages of shmem(tmpfs) to
be backed into memory in force, charges for pages are accounted against the
caller of swapoff rather than the users of shmem.
-2.4 Reclaim
+2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
+Swap Extension allows you to record charge for swap. A swapped-in page is
+charged back to original page allocator if possible.
+
+When swap is accounted, following files are added.
+ - memory.memsw.usage_in_bytes.
+ - memory.memsw.limit_in_bytes.
+
+usage of mem+swap is limited by memsw.limit_in_bytes.
+
+Note: why 'mem+swap' rather than swap.
+The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
+to move account from memory to swap...there is no change in usage of
+mem+swap.
+
+In other words, when we want to limit the usage of swap without affecting
+global LRU, mem+swap limit is better than just limiting swap from OS point
+of view.
+
+2.5 Reclaim
Each cgroup maintains a per cgroup LRU that consists of an active
and inactive list. When a cgroup goes over its limit, we first try
@@ -246,6 +266,11 @@ Such charges are moved to its parent as
is full.
If both of them are busy, rmdir() returns -EBUSY.
+Charges recorded in swap information is not updated at removal of cgroup.
+Recorded information is effectively discarded and a cgroup which uses swap
+(swapcache) will be charged as a new owner of it.
+
+
5. TODO
1. Add support for accounting huge pages (as a separate controller)
Index: mmotm-2.6.28-rc2+/mm/shmem.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/shmem.c
+++ mmotm-2.6.28-rc2+/mm/shmem.c
@@ -925,7 +925,7 @@ found:
* charged back to the user(not to caller) when swap account is used.
*/
error = mem_cgroup_cache_charge_swapin(page,
- current->mm, GFP_HIGHUSER_MOVABLE);
+ current->mm, GFP_HIGHUSER_MOVABLE, true);
if (error)
goto out;
error = radix_tree_preload(GFP_KERNEL);
@@ -1267,7 +1267,7 @@ repeat:
* charge against this swap cache here.
*/
if (mem_cgroup_cache_charge_swapin(swappage,
- current->mm, gfp)) {
+ current->mm, gfp, false)) {
page_cache_release(swappage);
error = -ENOMEM;
goto failed;
On Fri, 31 Oct 2008 11:52:41 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> This patch provides a function to move account information of a page between
> mem_cgroups and rewrite force_empty to make use of this.
>
> This moving of page_cgroup is done under
> - lru_lock of source/destination mem_cgroup is held.
> - lock_page_cgroup() is held.
>
> Then, a routine which touches pc->mem_cgroup without lock_page_cgroup() should
> confirm pc->mem_cgroup is still valid or not. Typical code can be following.
>
> (while page is not under lock_page())
> mem = pc->mem_cgroup;
> mz = page_cgroup_zoneinfo(pc)
> spin_lock_irqsave(&mz->lru_lock);
> if (pc->mem_cgroup == mem)
> ...../* some list handling */
> spin_unlock_irqrestore(&mz->lru_lock);
>
> Of course, better way is
> lock_page_cgroup(pc);
> ....
> unlock_page_cgroup(pc);
>
> But you should confirm the nest of lock and avoid deadlock.
>
> If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
> you don't have to worry about what pc->mem_cgroup points to.
> moved pages are added to head of lru, not to tail.
>
> Expected users of this routine is:
> - force_empty (rmdir)
> - moving tasks between cgroup (for moving account information.)
> - hierarchy (maybe useful.)
>
> force_empty(rmdir) uses this move_account and move pages to its parent.
> This "move" will not cause OOM (I added "oom" parameter to try_charge().)
>
> If the parent is busy (not enough memory), force_empty calls try_to_free_page()
> and reduce usage.
>
> Purpose of this behavior is
> - Fix "forget all" behavior of force_empty and avoid leak of accounting.
> - By "moving first, free if necessary", keep pages on memory as much as
> possible.
>
> Adding a switch to change behavior of force_empty to
> - free first, move if necessary
> - free all, if there is mlocked/busy pages, return -EBUSY.
> is under consideration.
>
> This patch removes memory.force_empty file, a brutal debug-only interface.
>
> Changelog: (v8) -> (v9)
> - fixed typos.
>
> Changelog: (v6) -> (v8)
> - removed memory.force_empty file which was provided only for debug.
>
> Changelog: (v5) -> (v6)
> - removed unnecessary check.
> - do all under lock_page_cgroup().
> - removed res_counter_charge() from move function itself.
> (and modifies try_charge() function.)
> - add argument to add_list() to specify to add page_cgroup head or tail.
> - merged with force_empty patch. (to answer who is user? question)
>
> Changelog: (v4) -> (v5)
> - check for lock_page() is removed.
> - rewrote description.
>
> Changelog: (v2) -> (v4)
> - added lock_page_cgroup().
> - moved out from new-force-empty patch.
> - added how-to-use text.
> - fixed race in __mem_cgroup_uncharge_common().
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
Previous version(v8) of this patch worked well under my rmdir test
for more than 24 hours, and there is no practical difference from then.
(I tested this version too for a few hours just to make sure.)
Reviewed-by: Daisuke Nishimura <[email protected]>
Tested-by: Daisuke Nishimura <[email protected]>
Thanks,
Daisuke Nishimura.
> Documentation/controllers/memory.txt | 12 -
> mm/memcontrol.c | 277 ++++++++++++++++++++++++++---------
> 2 files changed, 214 insertions(+), 75 deletions(-)
>
> Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-rc2+/mm/memcontrol.c
> @@ -257,7 +257,7 @@ static void __mem_cgroup_remove_list(str
> }
>
> static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
> - struct page_cgroup *pc)
> + struct page_cgroup *pc, bool hot)
> {
> int lru = LRU_BASE;
>
> @@ -271,7 +271,10 @@ static void __mem_cgroup_add_list(struct
> }
>
> MEM_CGROUP_ZSTAT(mz, lru) += 1;
> - list_add(&pc->lru, &mz->lists[lru]);
> + if (hot)
> + list_add(&pc->lru, &mz->lists[lru]);
> + else
> + list_add_tail(&pc->lru, &mz->lists[lru]);
>
> mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
> }
> @@ -467,21 +470,12 @@ unsigned long mem_cgroup_isolate_pages(u
> return nr_taken;
> }
>
> -
> -/**
> - * mem_cgroup_try_charge - get charge of PAGE_SIZE.
> - * @mm: an mm_struct which is charged against. (when *memcg is NULL)
> - * @gfp_mask: gfp_mask for reclaim.
> - * @memcg: a pointer to memory cgroup which is charged against.
> - *
> - * charge against memory cgroup pointed by *memcg. if *memcg == NULL, estimated
> - * memory cgroup from @mm is got and stored in *memcg.
> - *
> - * Returns 0 if success. -ENOMEM at failure.
> +/*
> + * Unlike exported interface, "oom" parameter is added. if oom==true,
> + * oom-killer can be invoked.
> */
> -
> -int mem_cgroup_try_charge(struct mm_struct *mm,
> - gfp_t gfp_mask, struct mem_cgroup **memcg)
> +static int __mem_cgroup_try_charge(struct mm_struct *mm,
> + gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
> {
> struct mem_cgroup *mem;
> int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> @@ -528,7 +522,8 @@ int mem_cgroup_try_charge(struct mm_stru
> continue;
>
> if (!nr_retries--) {
> - mem_cgroup_out_of_memory(mem, gfp_mask);
> + if (oom)
> + mem_cgroup_out_of_memory(mem, gfp_mask);
> goto nomem;
> }
> }
> @@ -538,6 +533,25 @@ nomem:
> return -ENOMEM;
> }
>
> +/**
> + * mem_cgroup_try_charge - get charge of PAGE_SIZE.
> + * @mm: an mm_struct which is charged against. (when *memcg is NULL)
> + * @gfp_mask: gfp_mask for reclaim.
> + * @memcg: a pointer to memory cgroup which is charged against.
> + *
> + * charge against memory cgroup pointed by *memcg. if *memcg == NULL, estimated
> + * memory cgroup from @mm is got and stored in *memcg.
> + *
> + * Returns 0 if success. -ENOMEM at failure.
> + * This call can invoke OOM-Killer.
> + */
> +
> +int mem_cgroup_try_charge(struct mm_struct *mm,
> + gfp_t mask, struct mem_cgroup **memcg)
> +{
> + return __mem_cgroup_try_charge(mm, mask, memcg, true);
> +}
> +
> /*
> * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
> * USED state. If already USED, uncharge and return.
> @@ -571,11 +585,109 @@ static void __mem_cgroup_commit_charge(s
> mz = page_cgroup_zoneinfo(pc);
>
> spin_lock_irqsave(&mz->lru_lock, flags);
> - __mem_cgroup_add_list(mz, pc);
> + __mem_cgroup_add_list(mz, pc, true);
> spin_unlock_irqrestore(&mz->lru_lock, flags);
> unlock_page_cgroup(pc);
> }
>
> +/**
> + * mem_cgroup_move_account - move account of the page
> + * @pc: page_cgroup of the page.
> + * @from: mem_cgroup which the page is moved from.
> + * @to: mem_cgroup which the page is moved to. @from != @to.
> + *
> + * The caller must confirm following.
> + * 1. disable irq.
> + * 2. lru_lock of old mem_cgroup(@from) should be held.
> + *
> + * returns 0 at success,
> + * returns -EBUSY when lock is busy or "pc" is unstable.
> + *
> + * This function does "uncharge" from old cgroup but doesn't do "charge" to
> + * new cgroup. It should be done by a caller.
> + */
> +
> +static int mem_cgroup_move_account(struct page_cgroup *pc,
> + struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> + struct mem_cgroup_per_zone *from_mz, *to_mz;
> + int nid, zid;
> + int ret = -EBUSY;
> +
> + VM_BUG_ON(!irqs_disabled());
> + VM_BUG_ON(from == to);
> +
> + nid = page_cgroup_nid(pc);
> + zid = page_cgroup_zid(pc);
> + from_mz = mem_cgroup_zoneinfo(from, nid, zid);
> + to_mz = mem_cgroup_zoneinfo(to, nid, zid);
> +
> +
> + if (!trylock_page_cgroup(pc))
> + return ret;
> +
> + if (!PageCgroupUsed(pc))
> + goto out;
> +
> + if (pc->mem_cgroup != from)
> + goto out;
> +
> + if (spin_trylock(&to_mz->lru_lock)) {
> + __mem_cgroup_remove_list(from_mz, pc);
> + css_put(&from->css);
> + res_counter_uncharge(&from->res, PAGE_SIZE);
> + pc->mem_cgroup = to;
> + css_get(&to->css);
> + __mem_cgroup_add_list(to_mz, pc, false);
> + ret = 0;
> + spin_unlock(&to_mz->lru_lock);
> + }
> +out:
> + unlock_page_cgroup(pc);
> + return ret;
> +}
> +
> +/*
> + * move charges to its parent.
> + */
> +
> +static int mem_cgroup_move_parent(struct page_cgroup *pc,
> + struct mem_cgroup *child,
> + gfp_t gfp_mask)
> +{
> + struct cgroup *cg = child->css.cgroup;
> + struct cgroup *pcg = cg->parent;
> + struct mem_cgroup *parent;
> + struct mem_cgroup_per_zone *mz;
> + unsigned long flags;
> + int ret;
> +
> + /* Is ROOT ? */
> + if (!pcg)
> + return -EINVAL;
> +
> + parent = mem_cgroup_from_cont(pcg);
> +
> + ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
> + if (ret)
> + return ret;
> +
> + mz = mem_cgroup_zoneinfo(child,
> + page_cgroup_nid(pc), page_cgroup_zid(pc));
> +
> + spin_lock_irqsave(&mz->lru_lock, flags);
> + ret = mem_cgroup_move_account(pc, child, parent);
> + spin_unlock_irqrestore(&mz->lru_lock, flags);
> +
> + /* drop extra refcnt */
> + css_put(&parent->css);
> + /* uncharge if move fails */
> + if (ret)
> + res_counter_uncharge(&parent->res, PAGE_SIZE);
> +
> + return ret;
> +}
> +
> /*
> * Charge the memory controller for page usage.
> * Return
> @@ -597,7 +709,7 @@ static int mem_cgroup_charge_common(stru
> prefetchw(pc);
>
> mem = memcg;
> - ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
> + ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
> if (ret)
> return ret;
>
> @@ -898,46 +1010,52 @@ int mem_cgroup_resize_limit(struct mem_c
> * This routine traverse page_cgroup in given list and drop them all.
> * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> */
> -#define FORCE_UNCHARGE_BATCH (128)
> -static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
> +static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
> struct mem_cgroup_per_zone *mz,
> enum lru_list lru)
> {
> - struct page_cgroup *pc;
> - struct page *page;
> - int count = FORCE_UNCHARGE_BATCH;
> + struct page_cgroup *pc, *busy;
> unsigned long flags;
> + unsigned long loop;
> struct list_head *list;
> + int ret = 0;
>
> list = &mz->lists[lru];
>
> - spin_lock_irqsave(&mz->lru_lock, flags);
> - while (!list_empty(list)) {
> - pc = list_entry(list->prev, struct page_cgroup, lru);
> - page = pc->page;
> - if (!PageCgroupUsed(pc))
> + loop = MEM_CGROUP_ZSTAT(mz, lru);
> + /* give some margin against EBUSY etc...*/
> + loop += 256;
> + busy = NULL;
> + while (loop--) {
> + ret = 0;
> + spin_lock_irqsave(&mz->lru_lock, flags);
> + if (list_empty(list)) {
> + spin_unlock_irqrestore(&mz->lru_lock, flags);
> break;
> - get_page(page);
> + }
> + pc = list_entry(list->prev, struct page_cgroup, lru);
> + if (busy == pc) {
> + list_move(&pc->lru, list);
> + busy = 0;
> + spin_unlock_irqrestore(&mz->lru_lock, flags);
> + continue;
> + }
> spin_unlock_irqrestore(&mz->lru_lock, flags);
> - /*
> - * Check if this page is on LRU. !LRU page can be found
> - * if it's under page migration.
> - */
> - if (PageLRU(page)) {
> - __mem_cgroup_uncharge_common(page,
> - MEM_CGROUP_CHARGE_TYPE_FORCE);
> - put_page(page);
> - if (--count <= 0) {
> - count = FORCE_UNCHARGE_BATCH;
> - cond_resched();
> - }
> - } else {
> - spin_lock_irqsave(&mz->lru_lock, flags);
> +
> + ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
> + if (ret == -ENOMEM)
> break;
> - }
> - spin_lock_irqsave(&mz->lru_lock, flags);
> +
> + if (ret == -EBUSY || ret == -EINVAL) {
> + /* found lock contention or "pc" is obsolete. */
> + busy = pc;
> + cond_resched();
> + } else
> + busy = NULL;
> }
> - spin_unlock_irqrestore(&mz->lru_lock, flags);
> + if (!ret && !list_empty(list))
> + return -EBUSY;
> + return ret;
> }
>
> /*
> @@ -946,34 +1064,68 @@ static void mem_cgroup_force_empty_list(
> */
> static int mem_cgroup_force_empty(struct mem_cgroup *mem)
> {
> - int ret = -EBUSY;
> - int node, zid;
> + int ret;
> + int node, zid, shrink;
> + int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>
> css_get(&mem->css);
> - /*
> - * page reclaim code (kswapd etc..) will move pages between
> - * active_list <-> inactive_list while we don't take a lock.
> - * So, we have to do loop here until all lists are empty.
> - */
> +
> + shrink = 0;
> +move_account:
> while (mem->res.usage > 0) {
> + ret = -EBUSY;
> if (atomic_read(&mem->css.cgroup->count) > 0)
> goto out;
> +
> /* This is for making all *used* pages to be on LRU. */
> lru_add_drain_all();
> - for_each_node_state(node, N_POSSIBLE)
> - for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> + ret = 0;
> + for_each_node_state(node, N_POSSIBLE) {
> + for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> struct mem_cgroup_per_zone *mz;
> enum lru_list l;
> mz = mem_cgroup_zoneinfo(mem, node, zid);
> - for_each_lru(l)
> - mem_cgroup_force_empty_list(mem, mz, l);
> + for_each_lru(l) {
> + ret = mem_cgroup_force_empty_list(mem,
> + mz, l);
> + if (ret)
> + break;
> + }
> }
> + if (ret)
> + break;
> + }
> + /* it seems parent cgroup doesn't have enough mem */
> + if (ret == -ENOMEM)
> + goto try_to_free;
> cond_resched();
> }
> ret = 0;
> out:
> css_put(&mem->css);
> return ret;
> +
> +try_to_free:
> + /* returns EBUSY if we come here twice. */
> + if (shrink) {
> + ret = -EBUSY;
> + goto out;
> + }
> + /* try to free all pages in this cgroup */
> + shrink = 1;
> + while (nr_retries && mem->res.usage > 0) {
> + int progress;
> + progress = try_to_free_mem_cgroup_pages(mem,
> + GFP_HIGHUSER_MOVABLE);
> + if (!progress)
> + nr_retries--;
> +
> + }
> + /* try move_account...there may be some *locked* pages. */
> + if (mem->res.usage)
> + goto move_account;
> + ret = 0;
> + goto out;
> }
>
> static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
> @@ -1022,11 +1174,6 @@ static int mem_cgroup_reset(struct cgrou
> return 0;
> }
>
> -static int mem_force_empty_write(struct cgroup *cont, unsigned int event)
> -{
> - return mem_cgroup_force_empty(mem_cgroup_from_cont(cont));
> -}
> -
> static const struct mem_cgroup_stat_desc {
> const char *msg;
> u64 unit;
> @@ -1103,10 +1250,6 @@ static struct cftype mem_cgroup_files[]
> .read_u64 = mem_cgroup_read,
> },
> {
> - .name = "force_empty",
> - .trigger = mem_force_empty_write,
> - },
> - {
> .name = "stat",
> .read_map = mem_control_stat_show,
> },
> Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
> +++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
> @@ -207,12 +207,6 @@ exceeded.
> The memory.stat file gives accounting information. Now, the number of
> caches, RSS and Active pages/Inactive pages are shown.
>
> -The memory.force_empty gives an interface to drop *all* charges by force.
> -
> -# echo 1 > memory.force_empty
> -
> -will drop all charges in cgroup. Currently, this is maintained for test.
> -
> 4. Testing
>
> Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
> @@ -242,8 +236,10 @@ reclaimed.
>
> A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
> cgroup might have some charge associated with it, even though all
> -tasks have migrated away from it. Such charges are automatically dropped at
> -rmdir() if there are no tasks.
> +tasks have migrated away from it.
> +Such charges are moved to its parent as much as possible and freed if parent
> +is full.
> +If both of them are busy, rmdir() returns -EBUSY.
>
> 5. TODO
>
>
This one is just an experimental. (means may cause panic easily)
==
Experimental!
Now,
- page and page_cgroup is one-to-one and statically allocated.
Then, page-to-pc and pc-to-page is always valid.
- we can trust PCG_USED bit.
Then, it seems that there is a chance to synchronize mem cgroup's lru
and global lru....in other words, remove mz->lru_lock and depends on zone->lru_lock.
This patch also modifies mem_cgroup's LRU to be just a sub-lru of zone....
memcg's LRU is not independent from global LRU after this.
Pros.
- easy maintainance.
- we can reduce types of lock.
- we don't have to duplicate LRU_ flags.
- memcg can make use of laziness of pagevec.
- when we make use of zone->lru_lock, the lock is not per memcg and
this is much robust than mz->lru_lock which depends on pc->mem_cgroup value.
Cons.
- may increase cost of LRU rotation.
(no impact if memcg is not configured.)
Signed-off-by: KAMEZAWA Hiruyoki <[email protected]>
fs/splice.c | 1
include/linux/memcontrol.h | 29 +++-
include/linux/mm_inline.h | 3
include/linux/page_cgroup.h | 17 --
mm/memcontrol.c | 307 +++++++++++++++++---------------------------
mm/page_cgroup.c | 1
mm/swap.c | 1
mm/vmscan.c | 9 -
8 files changed, 162 insertions(+), 206 deletions(-)
Index: mmotm-2.6.28-rc2-Oct30/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2-Oct30.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2-Oct30/mm/memcontrol.c
@@ -35,6 +35,7 @@
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
+#include "internal.h"
#include <asm/uaccess.h>
@@ -99,7 +100,6 @@ struct mem_cgroup_per_zone {
/*
* spin_lock to protect the per cgroup LRU
*/
- spinlock_t lru_lock;
struct list_head lists[NR_LRU_LISTS];
unsigned long count[NR_LRU_LISTS];
};
@@ -167,14 +167,12 @@ enum charge_type {
/* only for here (for easy reading.) */
#define PCGF_CACHE (1UL << PCG_CACHE)
#define PCGF_USED (1UL << PCG_USED)
-#define PCGF_ACTIVE (1UL << PCG_ACTIVE)
#define PCGF_LOCK (1UL << PCG_LOCK)
-#define PCGF_FILE (1UL << PCG_FILE)
static const unsigned long
pcg_default_flags[NR_CHARGE_TYPE] = {
- PCGF_CACHE | PCGF_FILE | PCGF_USED | PCGF_LOCK, /* File Cache */
- PCGF_ACTIVE | PCGF_USED | PCGF_LOCK, /* Anon */
- PCGF_ACTIVE | PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* Shmem */
+ PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* File Cache */
+ PCGF_USED | PCGF_LOCK, /* Anon */
+ PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* Shmem */
0, /* FORCE */
};
@@ -188,9 +186,6 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
static void mem_cgroup_forget_swapref(struct mem_cgroup *mem);
-/*
- * Always modified under lru lock. Then, not necessary to preempt_disable()
- */
static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
struct page_cgroup *pc,
bool charge)
@@ -198,10 +193,9 @@ static void mem_cgroup_charge_statistics
int val = (charge)? 1 : -1;
struct mem_cgroup_stat *stat = &mem->stat;
struct mem_cgroup_stat_cpu *cpustat;
+ int cpu = get_cpu();
- VM_BUG_ON(!irqs_disabled());
-
- cpustat = &stat->cpustat[smp_processor_id()];
+ cpustat = &stat->cpustat[cpu];
if (PageCgroupCache(pc))
__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
else
@@ -213,6 +207,7 @@ static void mem_cgroup_charge_statistics
else
__mem_cgroup_stat_add_safe(cpustat,
MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
+ put_cpu();
}
static struct mem_cgroup_per_zone *
@@ -267,80 +262,91 @@ struct mem_cgroup *mem_cgroup_from_task(
struct mem_cgroup, css);
}
-static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
- struct page_cgroup *pc)
-{
- int lru = LRU_BASE;
+/*
+ * Following LRU functions are allowed to be used without PCG_LOCK.
+ * Operations are called by routine of global LRU independently from memcg.
+ * What we have to take care of here is validness of pc->mem_cgroup.
+ *
+ * Changes to pc->mem_cgroup happens when
+ * 1. charge
+ * 2. moving account
+ * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
+ * It is added to LRU before charge.
+ * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
+ * When moving account, the page is not on LRU. It's isolated.
+ */
- if (PageCgroupUnevictable(pc))
- lru = LRU_UNEVICTABLE;
- else {
- if (PageCgroupActive(pc))
- lru += LRU_ACTIVE;
- if (PageCgroupFile(pc))
- lru += LRU_FILE;
- }
+void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
+{
+ struct page_cgroup *pc;
+ struct mem_cgroup *mem;
+ struct mem_cgroup_per_zone *mz;
+ if (mem_cgroup_subsys.disabled)
+ return;
+ pc = lookup_page_cgroup(page);
+ /* can happen while we handle swapcache */
+ if (list_empty(&pc->lru))
+ return;
+ mz = page_cgroup_zoneinfo(pc);
+ mem = pc->mem_cgroup;
MEM_CGROUP_ZSTAT(mz, lru) -= 1;
-
- mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
- list_del(&pc->lru);
+ list_del_init(&pc->lru);
+ return;
}
-static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
- struct page_cgroup *pc, bool hot)
+void mem_cgroup_del_lru(struct page *page)
{
- int lru = LRU_BASE;
+ mem_cgroup_del_lru_list(page, page_lru(page));
+}
- if (PageCgroupUnevictable(pc))
- lru = LRU_UNEVICTABLE;
- else {
- if (PageCgroupActive(pc))
- lru += LRU_ACTIVE;
- if (PageCgroupFile(pc))
- lru += LRU_FILE;
- }
+void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
+{
+ struct mem_cgroup_per_zone *mz;
+ struct page_cgroup *pc;
- MEM_CGROUP_ZSTAT(mz, lru) += 1;
- if (hot)
- list_add(&pc->lru, &mz->lists[lru]);
- else
- list_add_tail(&pc->lru, &mz->lists[lru]);
+ if (mem_cgroup_subsys.disabled)
+ return;
- mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
+ pc = lookup_page_cgroup(page);
+ mz = page_cgroup_zoneinfo(pc);
+ list_move(&pc->lru, &mz->lists[lru]);
}
-static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
+void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
{
- struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
- int active = PageCgroupActive(pc);
- int file = PageCgroupFile(pc);
- int unevictable = PageCgroupUnevictable(pc);
- enum lru_list from = unevictable ? LRU_UNEVICTABLE :
- (LRU_FILE * !!file + !!active);
+ struct page_cgroup *pc;
+ struct mem_cgroup_per_zone *mz;
- if (lru == from)
+ if (mem_cgroup_subsys.disabled)
+ return;
+ pc = lookup_page_cgroup(page);
+ /* barrier to sync with "charge" */
+ smp_rmb();
+ if (!PageCgroupUsed(pc))
return;
- MEM_CGROUP_ZSTAT(mz, from) -= 1;
- /*
- * However this is done under mz->lru_lock, another flags, which
- * are not related to LRU, will be modified from out-of-lock.
- * We have to use atomic set/clear flags.
- */
- if (is_unevictable_lru(lru)) {
- ClearPageCgroupActive(pc);
- SetPageCgroupUnevictable(pc);
- } else {
- if (is_active_lru(lru))
- SetPageCgroupActive(pc);
- else
- ClearPageCgroupActive(pc);
- ClearPageCgroupUnevictable(pc);
- }
-
+ mz = page_cgroup_zoneinfo(pc);
MEM_CGROUP_ZSTAT(mz, lru) += 1;
- list_move(&pc->lru, &mz->lists[lru]);
+ list_add(&pc->lru, &mz->lists[lru]);
+}
+/*
+ * To add swapcache into LRU. Be careful to all this function.
+ * zone->lru_lock shouldn't be held and irq must not be disabled.
+ */
+static void mem_cgroup_lru_fixup(struct page *page)
+{
+ if (!isolate_lru_page(page))
+ putback_lru_page(page);
+}
+
+void mem_cgroup_move_lists(struct page *page,
+ enum lru_list from, enum lru_list to)
+{
+ if (mem_cgroup_subsys.disabled)
+ return;
+ mem_cgroup_del_lru_list(page, from);
+ mem_cgroup_add_lru_list(page, to);
}
int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
@@ -354,37 +360,6 @@ int task_in_mem_cgroup(struct task_struc
}
/*
- * This routine assumes that the appropriate zone's lru lock is already held
- */
-void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
-{
- struct page_cgroup *pc;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
-
- if (mem_cgroup_subsys.disabled)
- return;
-
- /*
- * We cannot lock_page_cgroup while holding zone's lru_lock,
- * because other holders of lock_page_cgroup can be interrupted
- * with an attempt to rotate_reclaimable_page. But we cannot
- * safely get to page_cgroup without it, so just try_lock it:
- * mem_cgroup_isolate_pages allows for page left on wrong list.
- */
- pc = lookup_page_cgroup(page);
- if (!trylock_page_cgroup(pc))
- return;
- if (pc && PageCgroupUsed(pc)) {
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_move_lists(pc, lru);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
- }
- unlock_page_cgroup(pc);
-}
-
-/*
* Calculate mapped_ratio under memory controller. This will be used in
* vmscan.c for deteremining we have to reclaim mapped pages.
*/
@@ -463,40 +438,24 @@ unsigned long mem_cgroup_isolate_pages(u
mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
src = &mz->lists[lru];
- spin_lock(&mz->lru_lock);
scan = 0;
list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
if (scan >= nr_to_scan)
break;
+
+ page = pc->page;
if (unlikely(!PageCgroupUsed(pc)))
continue;
- page = pc->page;
-
if (unlikely(!PageLRU(page)))
continue;
- /*
- * TODO: play better with lumpy reclaim, grabbing anything.
- */
- if (PageUnevictable(page) ||
- (PageActive(page) && !active) ||
- (!PageActive(page) && active)) {
- __mem_cgroup_move_lists(pc, page_lru(page));
- continue;
- }
-
scan++;
- list_move(&pc->lru, &pc_list);
-
if (__isolate_lru_page(page, mode, file) == 0) {
list_move(&page->lru, dst);
nr_taken++;
}
}
- list_splice(&pc_list, src);
- spin_unlock(&mz->lru_lock);
-
*scanned = scan;
return nr_taken;
}
@@ -610,9 +569,6 @@ static void __mem_cgroup_commit_charge(s
struct page_cgroup *pc,
enum charge_type ctype)
{
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
-
/* try_charge() can return NULL to *memcg, taking care of it. */
if (!mem)
return;
@@ -627,17 +583,11 @@ static void __mem_cgroup_commit_charge(s
return;
}
pc->mem_cgroup = mem;
- /*
- * If a page is accounted as a page cache, insert to inactive list.
- * If anon, insert to active list.
- */
+ smp_wmb();
pc->flags = pcg_default_flags[ctype];
- mz = page_cgroup_zoneinfo(pc);
+ mem_cgroup_charge_statistics(mem, pc, true);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_add_list(mz, pc, true);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);
}
@@ -648,8 +598,7 @@ static void __mem_cgroup_commit_charge(s
* @to: mem_cgroup which the page is moved to. @from != @to.
*
* The caller must confirm following.
- * 1. disable irq.
- * 2. lru_lock of old mem_cgroup(@from) should be held.
+ * - page is not on LRU
*
* returns 0 at success,
* returns -EBUSY when lock is busy or "pc" is unstable.
@@ -665,15 +614,14 @@ static int mem_cgroup_move_account(struc
int nid, zid;
int ret = -EBUSY;
- VM_BUG_ON(!irqs_disabled());
VM_BUG_ON(from == to);
+ VM_BUG_ON(PageLRU(pc->page));
nid = page_cgroup_nid(pc);
zid = page_cgroup_zid(pc);
from_mz = mem_cgroup_zoneinfo(from, nid, zid);
to_mz = mem_cgroup_zoneinfo(to, nid, zid);
-
if (!trylock_page_cgroup(pc))
return ret;
@@ -683,18 +631,14 @@ static int mem_cgroup_move_account(struc
if (pc->mem_cgroup != from)
goto out;
- if (spin_trylock(&to_mz->lru_lock)) {
- __mem_cgroup_remove_list(from_mz, pc);
- css_put(&from->css);
- res_counter_uncharge(&from->res, PAGE_SIZE);
- if (do_swap_account)
- res_counter_uncharge(&from->memsw, PAGE_SIZE);
- pc->mem_cgroup = to;
- css_get(&to->css);
- __mem_cgroup_add_list(to_mz, pc, false);
- ret = 0;
- spin_unlock(&to_mz->lru_lock);
- }
+ css_put(&from->css);
+ res_counter_uncharge(&from->res, PAGE_SIZE);
+ mem_cgroup_charge_statistics(from, pc, false);
+ if (do_swap_account)
+ res_counter_uncharge(&from->memsw, PAGE_SIZE);
+ pc->mem_cgroup = to;
+ css_get(&to->css);
+ ret = 0;
out:
unlock_page_cgroup(pc);
return ret;
@@ -708,11 +652,10 @@ static int mem_cgroup_move_parent(struct
struct mem_cgroup *child,
gfp_t gfp_mask)
{
+ struct page *page = pc->page;
struct cgroup *cg = child->css.cgroup;
struct cgroup *pcg = cg->parent;
struct mem_cgroup *parent;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
int ret;
/* Is ROOT ? */
@@ -725,22 +668,23 @@ static int mem_cgroup_move_parent(struct
if (ret)
return ret;
- mz = mem_cgroup_zoneinfo(child,
- page_cgroup_nid(pc), page_cgroup_zid(pc));
+ ret = isolate_lru_page(page);
+
+ if (ret)
+ goto cancel;
- spin_lock_irqsave(&mz->lru_lock, flags);
ret = mem_cgroup_move_account(pc, child, parent);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
- /* drop extra refcnt */
+ /* drop extra refcnt by try_charge() (move_account increment one) */
css_put(&parent->css);
+ putback_lru_page(page);
+ if (!ret)
+ return 0;
/* uncharge if move fails */
- if (ret) {
- res_counter_uncharge(&parent->res, PAGE_SIZE);
- if (do_swap_account)
- res_counter_uncharge(&parent->memsw, PAGE_SIZE);
- }
-
+cancel:
+ res_counter_uncharge(&parent->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&parent->memsw, PAGE_SIZE);
return ret;
}
@@ -905,6 +849,7 @@ int mem_cgroup_cache_charge_swapin(struc
mem_cgroup_forget_swapref(mem);
}
}
+ mem_cgroup_lru_fixup(page);
unlock_out:
if (!locked)
unlock_page(page);
@@ -940,6 +885,7 @@ void mem_cgroup_commit_charge_swapin(str
}
}
+ mem_cgroup_lru_fixup(page);
}
void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -963,7 +909,6 @@ __mem_cgroup_uncharge_common(struct page
struct page_cgroup *pc;
struct mem_cgroup *mem = NULL;
struct mem_cgroup_per_zone *mz;
- unsigned long flags;
if (mem_cgroup_subsys.disabled)
return NULL;
@@ -1008,9 +953,6 @@ __mem_cgroup_uncharge_common(struct page
ClearPageCgroupUsed(pc);
mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_remove_list(mz, pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);
css_put(&mem->css);
@@ -1259,21 +1201,22 @@ int mem_cgroup_resize_memsw_limit(struct
return ret;
}
-
/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
*/
static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
- struct mem_cgroup_per_zone *mz,
- enum lru_list lru)
+ int node, int zid, enum lru_list lru)
{
+ struct zone *zone;
+ struct mem_cgroup_per_zone *mz;
struct page_cgroup *pc, *busy;
- unsigned long flags;
- unsigned long loop;
+ unsigned long flags, loop;
struct list_head *list;
int ret = 0;
+ zone = &NODE_DATA(node)->node_zones[zid];
+ mz = mem_cgroup_zoneinfo(mem, node, zid);
list = &mz->lists[lru];
loop = MEM_CGROUP_ZSTAT(mz, lru);
@@ -1282,19 +1225,19 @@ static int mem_cgroup_force_empty_list(s
busy = NULL;
while (loop--) {
ret = 0;
- spin_lock_irqsave(&mz->lru_lock, flags);
+ spin_lock_irqsave(&zone->lru_lock, flags);
if (list_empty(list)) {
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
break;
}
pc = list_entry(list->prev, struct page_cgroup, lru);
if (busy == pc) {
list_move(&pc->lru, list);
busy = 0;
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
continue;
}
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
if (ret == -ENOMEM)
@@ -1307,6 +1250,7 @@ static int mem_cgroup_force_empty_list(s
} else
busy = NULL;
}
+
if (!ret && !list_empty(list))
return -EBUSY;
return ret;
@@ -1336,12 +1280,10 @@ move_account:
ret = 0;
for_each_node_state(node, N_POSSIBLE) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
- struct mem_cgroup_per_zone *mz;
enum lru_list l;
- mz = mem_cgroup_zoneinfo(mem, node, zid);
for_each_lru(l) {
ret = mem_cgroup_force_empty_list(mem,
- mz, l);
+ node, zid, l);
if (ret)
break;
}
@@ -1375,6 +1317,7 @@ try_to_free:
nr_retries--;
}
+ lru_add_drain();
/* try move_account...there may be some *locked* pages. */
if (mem->res.usage)
goto move_account;
@@ -1595,7 +1538,6 @@ static int alloc_mem_cgroup_per_zone_inf
for (zone = 0; zone < MAX_NR_ZONES; zone++) {
mz = &pn->zoneinfo[zone];
- spin_lock_init(&mz->lru_lock);
for_each_lru(l)
INIT_LIST_HEAD(&mz->lists[l]);
}
@@ -1637,10 +1579,17 @@ static struct mem_cgroup *mem_cgroup_all
static void mem_cgroup_free(struct mem_cgroup *mem)
{
+ int node;
+
if (do_swap_account) {
if (atomic_read(&mem->swapref) > 0)
return;
}
+
+
+ for_each_node_state(node, N_POSSIBLE)
+ free_mem_cgroup_per_zone_info(mem, node);
+
if (sizeof(*mem) < PAGE_SIZE)
kfree(mem);
else
@@ -1713,12 +1662,6 @@ static void mem_cgroup_pre_destroy(struc
static void mem_cgroup_destroy(struct cgroup_subsys *ss,
struct cgroup *cont)
{
- int node;
- struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-
- for_each_node_state(node, N_POSSIBLE)
- free_mem_cgroup_per_zone_info(mem, node);
-
mem_cgroup_free(mem_cgroup_from_cont(cont));
}
Index: mmotm-2.6.28-rc2-Oct30/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-rc2-Oct30.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-rc2-Oct30/include/linux/memcontrol.h
@@ -40,7 +40,12 @@ extern void mem_cgroup_cancel_charge_swa
extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
-extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
+extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_del_lru(struct page *page);
+extern void mem_cgroup_move_lists(struct page *page,
+ enum lru_list from, enum lru_list to);
extern void mem_cgroup_uncharge_page(struct page *page);
extern void mem_cgroup_uncharge_cache_page(struct page *page);
extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
@@ -131,7 +136,27 @@ static inline int mem_cgroup_shrink_usag
return 0;
}
-static inline void mem_cgroup_move_lists(struct page *page, bool active)
+static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
+{
+}
+
+static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
+{
+ return ;
+}
+
+static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
+{
+ return ;
+}
+
+static inline void mem_cgroup_del_lru(struct page *page)
+{
+ return ;
+}
+
+static inline void
+mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
{
}
Index: mmotm-2.6.28-rc2-Oct30/include/linux/mm_inline.h
===================================================================
--- mmotm-2.6.28-rc2-Oct30.orig/include/linux/mm_inline.h
+++ mmotm-2.6.28-rc2-Oct30/include/linux/mm_inline.h
@@ -28,6 +28,7 @@ add_page_to_lru_list(struct zone *zone,
{
list_add(&page->lru, &zone->lru[l].list);
__inc_zone_state(zone, NR_LRU_BASE + l);
+ mem_cgroup_add_lru_list(page, l);
}
static inline void
@@ -35,6 +36,7 @@ del_page_from_lru_list(struct zone *zone
{
list_del(&page->lru);
__dec_zone_state(zone, NR_LRU_BASE + l);
+ mem_cgroup_del_lru_list(page, l);
}
static inline void
@@ -54,6 +56,7 @@ del_page_from_lru(struct zone *zone, str
l += page_is_file_cache(page);
}
__dec_zone_state(zone, NR_LRU_BASE + l);
+ mem_cgroup_del_lru_list(page, l);
}
/**
Index: mmotm-2.6.28-rc2-Oct30/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-rc2-Oct30.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-rc2-Oct30/mm/page_cgroup.c
@@ -17,6 +17,7 @@ __init_page_cgroup(struct page_cgroup *p
pc->flags = 0;
pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
+ INIT_LIST_HEAD(&pc->lru);
}
static unsigned long total_usage;
Index: mmotm-2.6.28-rc2-Oct30/fs/splice.c
===================================================================
--- mmotm-2.6.28-rc2-Oct30.orig/fs/splice.c
+++ mmotm-2.6.28-rc2-Oct30/fs/splice.c
@@ -21,6 +21,7 @@
#include <linux/file.h>
#include <linux/pagemap.h>
#include <linux/splice.h>
+#include <linux/memcontrol.h>
#include <linux/mm_inline.h>
#include <linux/swap.h>
#include <linux/writeback.h>
Index: mmotm-2.6.28-rc2-Oct30/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-rc2-Oct30.orig/mm/vmscan.c
+++ mmotm-2.6.28-rc2-Oct30/mm/vmscan.c
@@ -516,7 +516,6 @@ redo:
lru = LRU_UNEVICTABLE;
add_page_to_unevictable_list(page);
}
- mem_cgroup_move_lists(page, lru);
/*
* page's status can change while we move it among lru. If an evictable
@@ -551,7 +550,6 @@ void putback_lru_page(struct page *page)
lru = !!TestClearPageActive(page) + page_is_file_cache(page);
lru_cache_add_lru(page, lru);
- mem_cgroup_move_lists(page, lru);
put_page(page);
}
#endif /* CONFIG_UNEVICTABLE_LRU */
@@ -823,6 +821,7 @@ int __isolate_lru_page(struct page *page
return ret;
ret = -EBUSY;
+
if (likely(get_page_unless_zero(page))) {
/*
* Be careful not to clear PageLRU until after we're
@@ -831,6 +830,7 @@ int __isolate_lru_page(struct page *page
*/
ClearPageLRU(page);
ret = 0;
+ mem_cgroup_del_lru(page);
}
return ret;
@@ -1144,7 +1144,6 @@ static unsigned long shrink_inactive_lis
SetPageLRU(page);
lru = page_lru(page);
add_page_to_lru_list(zone, page, lru);
- mem_cgroup_move_lists(page, lru);
if (PageActive(page) && scan_global_lru(sc)) {
int file = !!page_is_file_cache(page);
zone->recent_rotated[file]++;
@@ -1277,7 +1276,7 @@ static void shrink_active_list(unsigned
ClearPageActive(page);
list_move(&page->lru, &zone->lru[lru].list);
- mem_cgroup_move_lists(page, lru);
+ mem_cgroup_add_lru_list(page, lru);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
@@ -2436,6 +2435,7 @@ retry:
__dec_zone_state(zone, NR_UNEVICTABLE);
list_move(&page->lru, &zone->lru[l].list);
+ mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
__inc_zone_state(zone, NR_INACTIVE_ANON + l);
__count_vm_event(UNEVICTABLE_PGRESCUED);
} else {
@@ -2444,6 +2444,7 @@ retry:
*/
SetPageUnevictable(page);
list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
+ mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
if (page_evictable(page, NULL))
goto retry;
}
Index: mmotm-2.6.28-rc2-Oct30/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.28-rc2-Oct30.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.28-rc2-Oct30/include/linux/page_cgroup.h
@@ -26,10 +26,6 @@ enum {
PCG_LOCK, /* page cgroup is locked */
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
- /* flags for LRU placement */
- PCG_ACTIVE, /* page is active in this cgroup */
- PCG_FILE, /* page is file system backed */
- PCG_UNEVICTABLE, /* page is unevictableable */
};
#define TESTPCGFLAG(uname, lname) \
@@ -50,19 +46,6 @@ TESTPCGFLAG(Cache, CACHE)
TESTPCGFLAG(Used, USED)
CLEARPCGFLAG(Used, USED)
-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
static inline int page_cgroup_nid(struct page_cgroup *pc)
{
return page_to_nid(pc->page);
Index: mmotm-2.6.28-rc2-Oct30/mm/swap.c
===================================================================
--- mmotm-2.6.28-rc2-Oct30.orig/mm/swap.c
+++ mmotm-2.6.28-rc2-Oct30/mm/swap.c
@@ -168,7 +168,6 @@ void activate_page(struct page *page)
lru += LRU_ACTIVE;
add_page_to_lru_list(zone, page, lru);
__count_vm_event(PGACTIVATE);
- mem_cgroup_move_lists(page, lru);
zone->recent_rotated[!!file]++;
zone->recent_scanned[!!file]++;
On Thu, Oct 30, 2008 at 6:52 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> This patch provides a function to move account information of a page between
> mem_cgroups and rewrite force_empty to make use of this.
One part of this that wasn't clear to me - if a cgroup has a lot of
unmapped pagecache sitting around but no tasks, and we try to rmdir
it, then all the pagecache gets moved to the parent too? Or just the
stray mapped pages?
> @@ -597,7 +709,7 @@ static int mem_cgroup_charge_common(stru
> prefetchw(pc);
>
> mem = memcg;
> - ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
> + ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
Isn't this the same as the definition of mem_cgroup_try_charge()? So
you could leave it as-is?
Paul
On Mon, 3 Nov 2008 22:15:41 -0800
Paul Menage <[email protected]> wrote:
> On Thu, Oct 30, 2008 at 6:52 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > This patch provides a function to move account information of a page between
> > mem_cgroups and rewrite force_empty to make use of this.
>
> One part of this that wasn't clear to me - if a cgroup has a lot of
> unmapped pagecache sitting around but no tasks, and we try to rmdir
> it, then all the pagecache gets moved to the parent too? Or just the
> stray mapped pages?
>
All page caches will be moved to.
(I'll add "shrink_usage" interface to drop page caches, later.)
> > @@ -597,7 +709,7 @@ static int mem_cgroup_charge_common(stru
> > prefetchw(pc);
> >
> > mem = memcg;
> > - ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
> > + ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
>
> Isn't this the same as the definition of mem_cgroup_try_charge()? So
> you could leave it as-is?
>
try_charge is called by other places....swapin.
Thanks,
-Kame
On Mon, Nov 3, 2008 at 10:17 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>> >
>> > mem = memcg;
>> > - ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
>> > + ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
>>
>> Isn't this the same as the definition of mem_cgroup_try_charge()? So
>> you could leave it as-is?
>>
> try_charge is called by other places....swapin.
>
No, I mean here you can call mem_cgroup_try_charge(...) rather than
__mem_cgroup_try_charge(..., true).
Paul
On Mon, 3 Nov 2008 22:23:11 -0800
Paul Menage <[email protected]> wrote:
> On Mon, Nov 3, 2008 at 10:17 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >> >
> >> > mem = memcg;
> >> > - ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
> >> > + ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
> >>
> >> Isn't this the same as the definition of mem_cgroup_try_charge()? So
> >> you could leave it as-is?
> >>
> > try_charge is called by other places....swapin.
> >
>
> No, I mean here you can call mem_cgroup_try_charge(...) rather than
> __mem_cgroup_try_charge(..., true).
>
you're right. will remove this change.
Thanks,
-Kame
On Fri, 31 Oct 2008 11:54:11 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> SwapCache support for memory resource controller (memcg)
>
> Before mem+swap controller, memcg itself should handle SwapCache in proper way.
>
> In current memcg, SwapCache is just leaked and the user can create tons of
> SwapCache. This is leak of account and should be handled.
>
> SwapCache accounting is done as following.
>
> charge (anon)
> - charged when it's mapped.
> (because of readahead, charge at add_to_swap_cache() is not sane)
> uncharge (anon)
> - uncharged when it's dropped from swapcache and fully unmapped.
> means it's not uncharged at unmap.
> Note: delete from swap cache at swap-in is done after rmap information
> is established.
> charge (shmem)
> - charged at swap-in. this prevents charge at add_to_page_cache().
>
> uncharge (shmem)
> - uncharged when it's dropped from swapcache and not on shmem's
> radix-tree.
>
> at migartion, check against 'old page' is modified to handle shmem.
>
> Compareing to the old version discussed (and caused troubles), we have
> advanteges of
> - PCG_USED bit.
> - simple migraton handling.
>
> So, situation is much easier than several months ago, maybe.
>
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> Documentation/controllers/memory.txt | 5 ++
> include/linux/swap.h | 16 ++++++++
> mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++----
> mm/shmem.c | 18 ++++++++-
> mm/swap_state.c | 1
> 5 files changed, 98 insertions(+), 8 deletions(-)
>
> Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-rc2+/mm/memcontrol.c
> @@ -21,6 +21,7 @@
> #include <linux/memcontrol.h>
> #include <linux/cgroup.h>
> #include <linux/mm.h>
> +#include <linux/pagemap.h>
> #include <linux/smp.h>
> #include <linux/page-flags.h>
> #include <linux/backing-dev.h>
> @@ -140,6 +141,7 @@ enum charge_type {
> MEM_CGROUP_CHARGE_TYPE_MAPPED,
> MEM_CGROUP_CHARGE_TYPE_SHMEM, /* used by page migration of shmem */
> MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
> + MEM_CGROUP_CHARGE_TYPE_SWAPOUT, /* for accounting swapcache */
> NR_CHARGE_TYPE,
> };
>
> @@ -781,6 +783,32 @@ int mem_cgroup_cache_charge(struct page
> MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> }
>
> +#ifdef CONFIG_SWAP
> +int mem_cgroup_cache_charge_swapin(struct page *page,
> + struct mm_struct *mm, gfp_t mask)
> +{
> + int ret = 0;
> +
> + if (mem_cgroup_subsys.disabled)
> + return 0;
> + if (unlikely(!mm))
> + mm = &init_mm;
> +
> + ret = mem_cgroup_charge_common(page, mm, mask,
> + MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> + /*
> + * The page may be dropped from SwapCache because we don't have
> + * lock_page().This may cause charge-after-uncharge trouble.
> + * Fix it up here. (the caller have refcnt to this page and
> + * page itself is guaranteed not to be freed.)
> + */
> + if (ret && !PageSwapCache(page))
> + mem_cgroup_uncharge_swapcache(page);
> +
Hmm.. after [5/5], mem_cgroup_cache_charge_swapin has 'locked' parameter,
calls lock_page(if !locked), and checks PageSwapCache under page lock.
Why not doing it in this patch?
Thanks,
Daisuke Nishimura.
> + return ret;
> +}
> +#endif
> +
> void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> {
> struct page_cgroup *pc;
> @@ -818,6 +846,9 @@ __mem_cgroup_uncharge_common(struct page
> if (mem_cgroup_subsys.disabled)
> return;
>
> + if (PageSwapCache(page))
> + return;
> +
> /*
> * Check if our page_cgroup is valid
> */
> @@ -826,12 +857,26 @@ __mem_cgroup_uncharge_common(struct page
> return;
>
> lock_page_cgroup(pc);
> - if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
> - || !PageCgroupUsed(pc)) {
> - /* This happens at race in zap_pte_range() and do_swap_page()*/
> - unlock_page_cgroup(pc);
> - return;
> +
> + if (!PageCgroupUsed(pc))
> + goto unlock_out;
> +
> + switch(ctype) {
> + case MEM_CGROUP_CHARGE_TYPE_MAPPED:
> + if (page_mapped(page))
> + goto unlock_out;
> + break;
> + case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
> + if (!PageAnon(page)) { /* Shared memory */
> + if (page->mapping && !page_is_file_cache(page))
> + goto unlock_out;
> + } else if (page_mapped(page)) /* Anon */
> + goto unlock_out;
> + break;
> + default:
> + break;
> }
> +
> ClearPageCgroupUsed(pc);
> mem = pc->mem_cgroup;
>
> @@ -845,6 +890,10 @@ __mem_cgroup_uncharge_common(struct page
> css_put(&mem->css);
>
> return;
> +
> +unlock_out:
> + unlock_page_cgroup(pc);
> + return;
> }
>
> void mem_cgroup_uncharge_page(struct page *page)
> @@ -864,6 +913,11 @@ void mem_cgroup_uncharge_cache_page(stru
> __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> }
>
> +void mem_cgroup_uncharge_swapcache(struct page *page)
> +{
> + __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
> +}
> +
> /*
> * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
> * page belongs to.
> @@ -921,7 +975,7 @@ void mem_cgroup_end_migration(struct mem
> ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
>
> /* unused page is not on radix-tree now. */
> - if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
> + if (unused)
> __mem_cgroup_uncharge_common(unused, ctype);
>
> pc = lookup_page_cgroup(target);
> Index: mmotm-2.6.28-rc2+/mm/swap_state.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/swap_state.c
> +++ mmotm-2.6.28-rc2+/mm/swap_state.c
> @@ -119,6 +119,7 @@ void __delete_from_swap_cache(struct pag
> total_swapcache_pages--;
> __dec_zone_page_state(page, NR_FILE_PAGES);
> INC_CACHE_INFO(del_total);
> + mem_cgroup_uncharge_swapcache(page);
> }
>
> /**
> Index: mmotm-2.6.28-rc2+/include/linux/swap.h
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/include/linux/swap.h
> +++ mmotm-2.6.28-rc2+/include/linux/swap.h
> @@ -332,6 +332,22 @@ static inline void disable_swap_token(vo
> put_swap_token(swap_token_mm);
> }
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +extern int mem_cgroup_cache_charge_swapin(struct page *page,
> + struct mm_struct *mm, gfp_t mask);
> +extern void mem_cgroup_uncharge_swapcache(struct page *page);
> +#else
> +static inline
> +int mem_cgroup_cache_charge_swapin(struct page *page,
> + struct mm_struct *mm, gfp_t mask)
> +{
> + return 0;
> +}
> +static inline void mem_cgroup_uncharge_swapcache(struct page *page)
> +{
> +}
> +#endif
> +
> #else /* CONFIG_SWAP */
>
> #define total_swap_pages 0
> Index: mmotm-2.6.28-rc2+/mm/shmem.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/shmem.c
> +++ mmotm-2.6.28-rc2+/mm/shmem.c
> @@ -920,8 +920,12 @@ found:
> error = 1;
> if (!inode)
> goto out;
> - /* Charge page using GFP_HIGHUSER_MOVABLE while we can wait */
> - error = mem_cgroup_cache_charge(page, current->mm, GFP_HIGHUSER_MOVABLE);
> + /*
> + * Charge page using GFP_HIGHUSER_MOVABLE while we can wait.
> + * charged back to the user(not to caller) when swap account is used.
> + */
> + error = mem_cgroup_cache_charge_swapin(page,
> + current->mm, GFP_HIGHUSER_MOVABLE);
> if (error)
> goto out;
> error = radix_tree_preload(GFP_KERNEL);
> @@ -1258,6 +1262,16 @@ repeat:
> goto repeat;
> }
> wait_on_page_locked(swappage);
> + /*
> + * We want to avoid charge at add_to_page_cache().
> + * charge against this swap cache here.
> + */
> + if (mem_cgroup_cache_charge_swapin(swappage,
> + current->mm, gfp)) {
> + page_cache_release(swappage);
> + error = -ENOMEM;
> + goto failed;
> + }
> page_cache_release(swappage);
> goto repeat;
> }
> Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
> +++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
> @@ -137,6 +137,11 @@ behind this approach is that a cgroup th
> page will eventually get charged for it (once it is uncharged from
> the cgroup that brought it in -- this will happen on memory pressure).
>
> +Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
> +be backed into memory in force, charges for pages are accounted against the
> +caller of swapoff rather than the users of shmem.
> +
> +
> 2.4 Reclaim
>
> Each cgroup maintains a per cgroup LRU that consists of an active
>
On Fri, 31 Oct 2008 11:55:10 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> Experimental.
>
> Config and control variable for mem+swap controller.
>
> This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> (memory resource controller swap extension.)
>
> For accounting swap, it's obvious that we have to use additional memory
> to remember "who uses swap". This adds more overhead.
> So, it's better to offer "choice" to users. This patch adds 2 choices.
>
> This patch adds 2 parameters to enable swap extension or not.
> - CONFIG
> - boot option
>
> Changelog: v1 -> v2
> - fixed typo.
> - make default value of "do_swap_account" to be 0 and turned on 1
> later if configured.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
>
> Documentation/kernel-parameters.txt | 3 +++
> include/linux/memcontrol.h | 3 +++
> init/Kconfig | 17 +++++++++++++++++
> mm/memcontrol.c | 32 ++++++++++++++++++++++++++++++++
> 4 files changed, 55 insertions(+)
>
> Index: mmotm-2.6.28-rc2+/init/Kconfig
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/init/Kconfig
> +++ mmotm-2.6.28-rc2+/init/Kconfig
> @@ -428,6 +428,23 @@ config CGROUP_MEM_RES_CTLR
> config MM_OWNER
> bool
>
> +config CGROUP_MEM_RES_CTLR_SWAP
> + bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
> + depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> + help
> + Add swap management feature to memory resource controller. When you
> + enable this, you can limit mem+swap usage per cgroup. In other words,
> + when you disable this, memory resource controller has no cares to
> + usage of swap...a process can exhaust all of the swap. This extension
> + is useful when you want to avoid exhaustion swap but this itself
> + adds more overheads and consumes memory for remembering information.
> + Especially if you use 32bit system or small memory system, please
> + be careful about enabling this. When memory resource controller
> + is disabled by boot option, this will be automatically disabled and
> + there will be no overhead from this. Even when you set this config=y,
> + if boot option "noswapaccount" is set, swap will not be accounted.
> +
> +
> endmenu
>
> config SYSFS_DEPRECATED
> Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-rc2+/mm/memcontrol.c
> @@ -41,6 +41,15 @@
> struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> #define MEM_CGROUP_RECLAIM_RETRIES 5
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +/* Turned on only when memory cgroup is enabled && reallY_do_swap_account = 0 */
> +int do_swap_account __read_mostly;
> +static int really_do_swap_account __initdata = 1; /* for remember boot option*/
> +#else
> +#define do_swap_account (0)
> +#endif
> +
> +
> /*
> * Statistics for memory cgroup.
> */
> @@ -1369,6 +1378,18 @@ static void mem_cgroup_free(struct mem_c
> }
>
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +static void __init enable_swap_cgroup(void)
> +{
> + if (really_do_swap_account)
> + do_swap_account = 1;
> +}
I think check for !mem_cgroup_subsys.disabled is also needed here.
Thanks,
Daisuke Nishimura.
> +#else
> +static void __init enable_swap_cgroup(void)
> +{
> +}
> +#endif
> +
> static struct cgroup_subsys_state *
> mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> {
> @@ -1377,6 +1398,7 @@ mem_cgroup_create(struct cgroup_subsys *
>
> if (unlikely((cont->parent) == NULL)) {
> mem = &init_mem_cgroup;
> + enable_swap_cgroup();
> } else {
> mem = mem_cgroup_alloc();
> if (!mem)
> @@ -1460,3 +1482,13 @@ struct cgroup_subsys mem_cgroup_subsys =
> .attach = mem_cgroup_move_task,
> .early_init = 0,
> };
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +
> +static int __init disable_swap_account(char *s)
> +{
> + really_do_swap_account = 0;
> + return 1;
> +}
> +__setup("noswapaccount", disable_swap_account);
> +#endif
> Index: mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/Documentation/kernel-parameters.txt
> +++ mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
> @@ -1543,6 +1543,9 @@ and is between 256 and 4096 characters.
>
> nosoftlockup [KNL] Disable the soft-lockup detector.
>
> + noswapaccount [KNL] Disable accounting of swap in memory resource
> + controller. (See Documentation/controllers/memory.txt)
> +
> nosync [HW,M68K] Disables sync negotiation for all devices.
>
> notsc [BUGS=X86-32] Disable Time Stamp Counter
> Index: mmotm-2.6.28-rc2+/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.28-rc2+/include/linux/memcontrol.h
> @@ -77,6 +77,9 @@ extern void mem_cgroup_record_reclaim_pr
> extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
> int priority, enum lru_list lru);
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +extern int do_swap_account;
> +#endif
>
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
On Tue, 4 Nov 2008 17:42:01 +0900
Daisuke Nishimura <[email protected]> wrote:
> > +#ifdef CONFIG_SWAP
> > +int mem_cgroup_cache_charge_swapin(struct page *page,
> > + struct mm_struct *mm, gfp_t mask)
> > +{
> > + int ret = 0;
> > +
> > + if (mem_cgroup_subsys.disabled)
> > + return 0;
> > + if (unlikely(!mm))
> > + mm = &init_mm;
> > +
> > + ret = mem_cgroup_charge_common(page, mm, mask,
> > + MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> > + /*
> > + * The page may be dropped from SwapCache because we don't have
> > + * lock_page().This may cause charge-after-uncharge trouble.
> > + * Fix it up here. (the caller have refcnt to this page and
> > + * page itself is guaranteed not to be freed.)
> > + */
> > + if (ret && !PageSwapCache(page))
> > + mem_cgroup_uncharge_swapcache(page);
> > +
> Hmm.. after [5/5], mem_cgroup_cache_charge_swapin has 'locked' parameter,
> calls lock_page(if !locked), and checks PageSwapCache under page lock.
>
> Why not doing it in this patch?
>
My intention is to guard swap_cgroup by lock_page() against SwapCache.
In Mem+Swap controller. we get "memcg" from information in page->private.
I think we need lock_page(), there.
But here, we don't refer page->private information.
I think we don't need lock_page() because there is no inofrmation we depends on.
Thanks,
-Kame
On Tue, 4 Nov 2008 17:54:06 +0900
Daisuke Nishimura <[email protected]> wrote:
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +static void __init enable_swap_cgroup(void)
> > +{
> > + if (really_do_swap_account)
> > + do_swap_account = 1;
> > +}
> I think check for !mem_cgroup_subsys.disabled is also needed here.
>
Hmm, mem_cgroup_create() is called even when disabled ?
.......seems so.
Ok, will fix. thank you for checking it.
Regards,
-Kame
>
> Thanks,
> Daisuke Nishimura.
>
> > +#else
> > +static void __init enable_swap_cgroup(void)
> > +{
> > +}
> > +#endif
> > +
> > static struct cgroup_subsys_state *
> > mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> > {
> > @@ -1377,6 +1398,7 @@ mem_cgroup_create(struct cgroup_subsys *
> >
> > if (unlikely((cont->parent) == NULL)) {
> > mem = &init_mem_cgroup;
> > + enable_swap_cgroup();
> > } else {
> > mem = mem_cgroup_alloc();
> > if (!mem)
> > @@ -1460,3 +1482,13 @@ struct cgroup_subsys mem_cgroup_subsys =
> > .attach = mem_cgroup_move_task,
> > .early_init = 0,
> > };
> > +
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +
> > +static int __init disable_swap_account(char *s)
> > +{
> > + really_do_swap_account = 0;
> > + return 1;
> > +}
> > +__setup("noswapaccount", disable_swap_account);
> > +#endif
> > Index: mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
> > ===================================================================
> > --- mmotm-2.6.28-rc2+.orig/Documentation/kernel-parameters.txt
> > +++ mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
> > @@ -1543,6 +1543,9 @@ and is between 256 and 4096 characters.
> >
> > nosoftlockup [KNL] Disable the soft-lockup detector.
> >
> > + noswapaccount [KNL] Disable accounting of swap in memory resource
> > + controller. (See Documentation/controllers/memory.txt)
> > +
> > nosync [HW,M68K] Disables sync negotiation for all devices.
> >
> > notsc [BUGS=X86-32] Disable Time Stamp Counter
> > Index: mmotm-2.6.28-rc2+/include/linux/memcontrol.h
> > ===================================================================
> > --- mmotm-2.6.28-rc2+.orig/include/linux/memcontrol.h
> > +++ mmotm-2.6.28-rc2+/include/linux/memcontrol.h
> > @@ -77,6 +77,9 @@ extern void mem_cgroup_record_reclaim_pr
> > extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
> > int priority, enum lru_list lru);
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +extern int do_swap_account;
> > +#endif
> >
> > #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> > struct mem_cgroup;
> >
>
On Tue, 4 Nov 2008 18:04:29 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Tue, 4 Nov 2008 17:42:01 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > > +#ifdef CONFIG_SWAP
> > > +int mem_cgroup_cache_charge_swapin(struct page *page,
> > > + struct mm_struct *mm, gfp_t mask)
> > > +{
> > > + int ret = 0;
> > > +
> > > + if (mem_cgroup_subsys.disabled)
> > > + return 0;
> > > + if (unlikely(!mm))
> > > + mm = &init_mm;
> > > +
> > > + ret = mem_cgroup_charge_common(page, mm, mask,
> > > + MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> > > + /*
> > > + * The page may be dropped from SwapCache because we don't have
> > > + * lock_page().This may cause charge-after-uncharge trouble.
> > > + * Fix it up here. (the caller have refcnt to this page and
> > > + * page itself is guaranteed not to be freed.)
> > > + */
> > > + if (ret && !PageSwapCache(page))
> > > + mem_cgroup_uncharge_swapcache(page);
> > > +
> > Hmm.. after [5/5], mem_cgroup_cache_charge_swapin has 'locked' parameter,
> > calls lock_page(if !locked), and checks PageSwapCache under page lock.
> >
> > Why not doing it in this patch?
> >
>
> My intention is to guard swap_cgroup by lock_page() against SwapCache.
> In Mem+Swap controller. we get "memcg" from information in page->private.
> I think we need lock_page(), there.
>
> But here, we don't refer page->private information.
> I think we don't need lock_page() because there is no inofrmation we depends on.
>
I just thought it would be simpler to check PageSwapCache after holding
page lock rather than to handle the case that the page might be removed from
swap cache.
And to be honest, I can't understand the "charge-after-uncharge trouble".
Could you explain more?
Thanks,
Dasiuke Nishimura.
On Tue, 4 Nov 2008 19:28:22 +0900
Daisuke Nishimura <[email protected]> wrote:
> On Tue, 4 Nov 2008 18:04:29 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > On Tue, 4 Nov 2008 17:42:01 +0900
> > Daisuke Nishimura <[email protected]> wrote:
> >
> > > > +#ifdef CONFIG_SWAP
> > > > +int mem_cgroup_cache_charge_swapin(struct page *page,
> > > > + struct mm_struct *mm, gfp_t mask)
> > > > +{
> > > > + int ret = 0;
> > > > +
> > > > + if (mem_cgroup_subsys.disabled)
> > > > + return 0;
> > > > + if (unlikely(!mm))
> > > > + mm = &init_mm;
> > > > +
> > > > + ret = mem_cgroup_charge_common(page, mm, mask,
> > > > + MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> > > > + /*
> > > > + * The page may be dropped from SwapCache because we don't have
> > > > + * lock_page().This may cause charge-after-uncharge trouble.
> > > > + * Fix it up here. (the caller have refcnt to this page and
> > > > + * page itself is guaranteed not to be freed.)
> > > > + */
> > > > + if (ret && !PageSwapCache(page))
> > > > + mem_cgroup_uncharge_swapcache(page);
> > > > +
> > > Hmm.. after [5/5], mem_cgroup_cache_charge_swapin has 'locked' parameter,
> > > calls lock_page(if !locked), and checks PageSwapCache under page lock.
> > >
> > > Why not doing it in this patch?
> > >
> >
> > My intention is to guard swap_cgroup by lock_page() against SwapCache.
> > In Mem+Swap controller. we get "memcg" from information in page->private.
> > I think we need lock_page(), there.
> >
> > But here, we don't refer page->private information.
> > I think we don't need lock_page() because there is no inofrmation we depends on.
> >
> I just thought it would be simpler to check PageSwapCache after holding
> page lock rather than to handle the case that the page might be removed from
> swap cache.
>
> And to be honest, I can't understand the "charge-after-uncharge trouble".
> Could you explain more?
>
Maybe typical case is following.
__delete_from_swapcache can happen while the page is unlocked.
==
some other thread.
page = shmem_swapin()
swapin_readahead();
# page is SwapCache here.
# but this page is not locked.
___delete_from_swapcache(page)
# This is not SwapCache. => uncharge swapcache.
mem_cgroup_charge_cache_swapin();
{
charge(); # charged this page but we don't know this is still swapcache.
if (!PageSwapCache(page)) {
# Oh we should unroll this.
}
}
=
Thanks,
-Kame
On Wed, 5 Nov 2008 09:07:49 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Tue, 4 Nov 2008 19:28:22 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > On Tue, 4 Nov 2008 18:04:29 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > On Tue, 4 Nov 2008 17:42:01 +0900
> > > Daisuke Nishimura <[email protected]> wrote:
> > >
> > > > > +#ifdef CONFIG_SWAP
> > > > > +int mem_cgroup_cache_charge_swapin(struct page *page,
> > > > > + struct mm_struct *mm, gfp_t mask)
> > > > > +{
> > > > > + int ret = 0;
> > > > > +
> > > > > + if (mem_cgroup_subsys.disabled)
> > > > > + return 0;
> > > > > + if (unlikely(!mm))
> > > > > + mm = &init_mm;
> > > > > +
> > > > > + ret = mem_cgroup_charge_common(page, mm, mask,
> > > > > + MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> > > > > + /*
> > > > > + * The page may be dropped from SwapCache because we don't have
> > > > > + * lock_page().This may cause charge-after-uncharge trouble.
> > > > > + * Fix it up here. (the caller have refcnt to this page and
> > > > > + * page itself is guaranteed not to be freed.)
> > > > > + */
> > > > > + if (ret && !PageSwapCache(page))
> > > > > + mem_cgroup_uncharge_swapcache(page);
> > > > > +
> > > > Hmm.. after [5/5], mem_cgroup_cache_charge_swapin has 'locked' parameter,
> > > > calls lock_page(if !locked), and checks PageSwapCache under page lock.
> > > >
> > > > Why not doing it in this patch?
> > > >
> > >
> > > My intention is to guard swap_cgroup by lock_page() against SwapCache.
> > > In Mem+Swap controller. we get "memcg" from information in page->private.
> > > I think we need lock_page(), there.
> > >
> > > But here, we don't refer page->private information.
> > > I think we don't need lock_page() because there is no inofrmation we depends on.
> > >
> > I just thought it would be simpler to check PageSwapCache after holding
> > page lock rather than to handle the case that the page might be removed from
> > swap cache.
> >
> > And to be honest, I can't understand the "charge-after-uncharge trouble".
> > Could you explain more?
> >
I'll add lock_page() here to make this simple.
Thanks,
-Kame
> Maybe typical case is following.
> __delete_from_swapcache can happen while the page is unlocked.
> ==
> some other thread.
> page = shmem_swapin()
> swapin_readahead();
> # page is SwapCache here.
> # but this page is not locked.
> ___delete_from_swapcache(page)
> # This is not SwapCache. => uncharge swapcache.
> mem_cgroup_charge_cache_swapin();
> {
> charge(); # charged this page but we don't know this is still swapcache.
> if (!PageSwapCache(page)) {
> # Oh we should unroll this.
> }
> }
> =
>
> Thanks,
> -Kame
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>