2008-10-10 09:00:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 0/5] memcg: more updates (still under test) v7

This is just-for-test sets.

ready-to-go set (v7 http://marc.info/?l=linux-mm&m=122353662107309&w=2)
can be applied to the latest mmotm(stamp-2008-10-09-21-35) with small easy
Hunk. (change in mm_types.h show HUNK) So I don't send it again now.

While it seems I have to maintain ready-to-go set by myself until the end of
merge window, I'd like to share patches on my stack and restart Mem+Swap
controller at el.

This set includes following patches

[1/5] ....charge/commit/cancel protcol
[2/5] ....fix for page migration handling (I hope this will be silver bullet.)
[3/5] ....new force_empty() and move_account()
[4/5] ....lazy lru free
[5/5] ....lazy lru add.

I'm still testing this but dump it now before weekend. (works well as far as my
test.) I'll restart Mem+Swap controller from scratch in the next week.

If you want me to do other work rather than Mem+Swap controller, don't hesitate
to request me. I may not notice that any important work should be done before
complicated tasks.
tasks on my queue is..
- Mem+Swap controller.
- an interface to shrink memory usage of a group.
- move account at task move (maybe complicated.)
- swappiness support (after split-lru setteld.)
- dirty_ratio (Andrea Righi's work will go.)
- priority to memory reclaim. (can we ?)
- hierarchy support. (needs discusstion for trade-off among performance/function.)
- NUMA statistics (this needs cgroup interface enhancements.)

And maybe we need more performance check.

Bye,
-Kame





2008-10-10 09:02:34

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 1/5] memcg: charge-commit-cancel

There is a small race in do_swap_page(). When the page swapped-in is charged,
the mapcount can be greater than 0. But, at the same time some process (shares
it ) call unmap and make mapcount 1->0 and the page is uncharged.

CPUA CPUB
mapcount == 1.
(1) charge if mapcount==0 zap_pte_range()
(2) mapcount 1 => 0.
(3) uncharge(). (success)
(4) set page'rmap()
mapcoint 0=>1

Then, this swap page's account is leaked.

For fixing this, I added a new interface.
- precharge
account to res_counter by PAGE_SIZE and try to free pages if necessary.
- commit
register page_cgroup and add to LRU if necessary.
- cancel
uncharge PAGE_SIZE because of do_swap_page failure.


CPUA
(1) charge (always)
(2) set page's rmap (mapcount > 0)
(3) commit charge was necessary or not after set_pte().

This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
Usual mem_cgroup_charge_common() does precharge -> commit at a time.

And this patch also adds following function to clarify all charges.

- mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
called against newly allocated anon pages.

- mem_cgroup_charge_migrate_fixup()
called only from remove_migration_ptes().
we'll have to rewrite this later.(this patch just keeps old behavior)
This function will be removed by additional patch to make migration
clearer.

Good for clarify "what we does"

Then, we have 4 following charge points.
- newpage
- swapin
- add-to-cache.
- migration.

precharge/commit/cancel can be used for other places,
- shmem, (and other places need precharge.)
- move_account(force_empty) etc...
- migration.

we'll revisit later.

Changelog v5 -> v6:
- added newpage_charge() and migrate_fixup().
- renamed functions for swap-in from "swap" to "swapin"
- add more precise description.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

include/linux/memcontrol.h | 35 +++++++++-
mm/memcontrol.c | 151 +++++++++++++++++++++++++++++++++++----------
mm/memory.c | 12 ++-
mm/migrate.c | 2
mm/swapfile.c | 6 +
5 files changed, 165 insertions(+), 41 deletions(-)

Index: mmotm-2.6.27-rc8+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27-rc8+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27-rc8+/include/linux/memcontrol.h
@@ -27,8 +27,17 @@ struct mm_struct;

#ifdef CONFIG_CGROUP_MEM_RES_CTLR

-extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
+extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
+extern int mem_cgroup_charge_migrate_fixup(struct page *page,
+ struct mm_struct *mm, gfp_t gfp_mask);
+/* for swap handling */
+extern int mem_cgroup_try_charge(struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern void mem_cgroup_commit_charge_swapin(struct page *page,
+ struct mem_cgroup *ptr);
+extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
+
extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
@@ -71,7 +80,9 @@ extern long mem_cgroup_calc_reclaim(stru


#else /* CONFIG_CGROUP_MEM_RES_CTLR */
-static inline int mem_cgroup_charge(struct page *page,
+struct mem_cgroup;
+
+static inline int mem_cgroup_newpage_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
return 0;
@@ -83,6 +94,26 @@ static inline int mem_cgroup_cache_charg
return 0;
}

+static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
+ struct mm_struct *mm, gfp_t gfp_mask)
+{
+ return 0;
+}
+
+static int mem_cgroup_try_charge(struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **ptr)
+{
+ return 0;
+}
+
+static void mem_cgroup_commit_charge_swapin(struct page *page,
+ struct mem_cgroup *ptr)
+{
+}
+static void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr)
+{
+}
+
static inline void mem_cgroup_uncharge_page(struct page *page)
{
}
Index: mmotm-2.6.27-rc8+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc8+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc8+/mm/memcontrol.c
@@ -467,35 +467,31 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
}

-/*
- * Charge the memory controller for page usage.
- * Return
- * 0 if the charge was successful
- * < 0 if the cgroup is over its limit
+
+/**
+ * mem_cgroup_try_charge - get charge of PAGE_SIZE.
+ * @mm: an mm_struct which is charged against. (when *memcg is NULL)
+ * @gfp_mask: gfp_mask for reclaim.
+ * @memcg: a pointer to memory cgroup which is charged against.
+ *
+ * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
+ * memory cgroup from @mm is got and stored in *memcg.
+ *
+ * Retruns 0 if success. -ENOMEM at failure.
*/
-static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, enum charge_type ctype,
- struct mem_cgroup *memcg)
+
+int mem_cgroup_try_charge(struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **memcg)
{
struct mem_cgroup *mem;
- struct page_cgroup *pc;
- unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
-
- pc = lookup_page_cgroup(page);
- /* can happen at boot */
- if (unlikely(!pc))
- return 0;
- prefetchw(pc);
+ int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
/*
* We always charge the cgroup the mm_struct belongs to.
* The mm_struct's mem_cgroup changes on task migration if the
* thread group leader migrates. It's possible that mm is not
* set, if so charge the init_mm (happens for pagecache usage).
*/
-
- if (likely(!memcg)) {
+ if (likely(!*memcg)) {
rcu_read_lock();
mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
if (unlikely(!mem)) {
@@ -506,15 +502,17 @@ static int mem_cgroup_charge_common(stru
* For every charge from the cgroup, increment reference count
*/
css_get(&mem->css);
+ *memcg = mem;
rcu_read_unlock();
} else {
- mem = memcg;
- css_get(&memcg->css);
+ mem = *memcg;
+ css_get(&mem->css);
}

+
while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
if (!(gfp_mask & __GFP_WAIT))
- goto out;
+ goto nomem;

if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
continue;
@@ -531,18 +529,33 @@ static int mem_cgroup_charge_common(stru

if (!nr_retries--) {
mem_cgroup_out_of_memory(mem, gfp_mask);
- goto out;
+ goto nomem;
}
}
+ return 0;
+nomem:
+ css_put(&mem->css);
+ return -ENOMEM;
+}

+/*
+ * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
+ * USED state. If already USED, uncharge and return.
+ */
+
+static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
+ struct page_cgroup *pc,
+ enum charge_type ctype)
+{
+ struct mem_cgroup_per_zone *mz;
+ unsigned long flags;

lock_page_cgroup(pc);
if (unlikely(PageCgroupUsed(pc))) {
unlock_page_cgroup(pc);
res_counter_uncharge(&mem->res, PAGE_SIZE);
css_put(&mem->css);
-
- goto done;
+ return;
}
pc->mem_cgroup = mem;
/*
@@ -557,15 +570,39 @@ static int mem_cgroup_charge_common(stru
__mem_cgroup_add_list(mz, pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);
+}

-done:
+/*
+ * Charge the memory controller for page usage.
+ * Return
+ * 0 if the charge was successful
+ * < 0 if the cgroup is over its limit
+ */
+static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask, enum charge_type ctype,
+ struct mem_cgroup *memcg)
+{
+ struct mem_cgroup *mem;
+ struct page_cgroup *pc;
+ int ret;
+
+ pc = lookup_page_cgroup(page);
+ /* can happen at boot */
+ if (unlikely(!pc))
+ return 0;
+ prefetchw(pc);
+
+ mem = memcg;
+ ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
+ if (ret)
+ return ret;
+
+ __mem_cgroup_commit_charge(mem, pc, ctype);
return 0;
-out:
- css_put(&mem->css);
- return -ENOMEM;
}

-int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
+int mem_cgroup_newpage_charge(struct page *page,
+ struct mm_struct *mm, gfp_t gfp_mask)
{
if (mem_cgroup_subsys.disabled)
return 0;
@@ -586,6 +623,34 @@ int mem_cgroup_charge(struct page *page,
MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
}

+/*
+ * same as mem_cgroup_newpage_charge(), now.
+ * But what we assume is different from newpage, and this is special case.
+ * treat this in special function. easy for maintainance.
+ */
+
+int mem_cgroup_charge_migrate_fixup(struct page *page,
+ struct mm_struct *mm, gfp_t gfp_mask)
+{
+ if (mem_cgroup_subsys.disabled)
+ return 0;
+
+ if (PageCompound(page))
+ return 0;
+
+ if (page_mapped(page) || (page->mapping && !PageAnon(page)))
+ return 0;
+
+ if (unlikely(!mm))
+ mm = &init_mm;
+
+ return mem_cgroup_charge_common(page, mm, gfp_mask,
+ MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+}
+
+
+
+
int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask)
{
@@ -628,6 +693,30 @@ int mem_cgroup_cache_charge(struct page
MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
}

+
+void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
+{
+ struct page_cgroup *pc;
+
+ if (mem_cgroup_subsys.disabled)
+ return;
+ if (!ptr)
+ return;
+ pc = lookup_page_cgroup(page);
+ __mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+}
+
+void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
+{
+ if (mem_cgroup_subsys.disabled)
+ return;
+ if (!mem)
+ return;
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ css_put(&mem->css);
+}
+
+
/*
* uncharge if !page_mapped(page)
*/
Index: mmotm-2.6.27-rc8+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc8+.orig/mm/memory.c
+++ mmotm-2.6.27-rc8+/mm/memory.c
@@ -1891,7 +1891,7 @@ gotten:
cow_user_page(new_page, old_page, address, vma);
__SetPageUptodate(new_page);

- if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
+ if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
goto oom_free_new;

/*
@@ -2287,6 +2287,7 @@ static int do_swap_page(struct mm_struct
struct page *page;
swp_entry_t entry;
pte_t pte;
+ struct mem_cgroup *ptr = NULL;
int ret = 0;

if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
@@ -2325,7 +2326,7 @@ static int do_swap_page(struct mm_struct
lock_page(page);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

- if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+ if (mem_cgroup_try_charge(mm, GFP_KERNEL, &ptr) == -ENOMEM) {
ret = VM_FAULT_OOM;
unlock_page(page);
goto out;
@@ -2355,6 +2356,7 @@ static int do_swap_page(struct mm_struct
flush_icache_page(vma, page);
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);
+ mem_cgroup_commit_charge_swapin(page, ptr);

swap_free(entry);
if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -2375,7 +2377,7 @@ unlock:
out:
return ret;
out_nomap:
- mem_cgroup_uncharge_page(page);
+ mem_cgroup_cancel_charge_swapin(ptr);
pte_unmap_unlock(page_table, ptl);
unlock_page(page);
page_cache_release(page);
@@ -2405,7 +2407,7 @@ static int do_anonymous_page(struct mm_s
goto oom;
__SetPageUptodate(page);

- if (mem_cgroup_charge(page, mm, GFP_KERNEL))
+ if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
goto oom_free_page;

entry = mk_pte(page, vma->vm_page_prot);
@@ -2498,7 +2500,7 @@ static int __do_fault(struct mm_struct *
ret = VM_FAULT_OOM;
goto out;
}
- if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+ if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
ret = VM_FAULT_OOM;
page_cache_release(page);
goto out;
Index: mmotm-2.6.27-rc8+/mm/migrate.c
===================================================================
--- mmotm-2.6.27-rc8+.orig/mm/migrate.c
+++ mmotm-2.6.27-rc8+/mm/migrate.c
@@ -133,7 +133,7 @@ static void remove_migration_pte(struct
* be reliable, and this charge can actually fail: oh well, we don't
* make the situation any worse by proceeding as if it had succeeded.
*/
- mem_cgroup_charge(new, mm, GFP_ATOMIC);
+ mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);

get_page(new);
pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
Index: mmotm-2.6.27-rc8+/mm/swapfile.c
===================================================================
--- mmotm-2.6.27-rc8+.orig/mm/swapfile.c
+++ mmotm-2.6.27-rc8+/mm/swapfile.c
@@ -530,17 +530,18 @@ unsigned int count_swap_pages(int type,
static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, swp_entry_t entry, struct page *page)
{
+ struct mem_cgroup *ptr;
spinlock_t *ptl;
pte_t *pte;
int ret = 1;

- if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
+ if (mem_cgroup_try_charge(vma->vm_mm, GFP_KERNEL, &ptr))
ret = -ENOMEM;

pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry)))) {
if (ret > 0)
- mem_cgroup_uncharge_page(page);
+ mem_cgroup_cancel_charge_swapin(ptr);
ret = 0;
goto out;
}
@@ -550,6 +551,7 @@ static int unuse_pte(struct vm_area_stru
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
page_add_anon_rmap(page, vma, addr);
+ mem_cgroup_commit_charge_swapin(page, ptr);
swap_free(entry);
/*
* Move the page to the active list so it is not

2008-10-10 09:04:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 2/5] memcg: migration account fix

Now, management of "charge" under page migration is done under following
manner. (Assume migrate page contents from oldapge to newpage)

before
- "newpage" is charged before migration.
at success.
- "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
at failure
- "newpage" is uncharged.
- "oldpage" is charged if necessary (*1)

But (*1) is not trustable....because of GFP_ATOMIC.

This patch tries to change behavior as following by charge/commit/cancel ops.

before
- charge PAGE_SIZE (no target page)
success
- commit charge against "newpage".
failure
- commit charge against "oldpage".
(PCG_USED bit works effectively to avoid double-counting)

if "oldpage" or "newpage" is obsolete, uncharge it.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

include/linux/memcontrol.h | 19 ++++-------
mm/memcontrol.c | 75 +++++++++++++++++++++++++++++----------------
mm/migrate.c | 42 ++++++++-----------------
3 files changed, 70 insertions(+), 66 deletions(-)

Index: mmotm-2.6.27-rc8+/mm/migrate.c
===================================================================
--- mmotm-2.6.27-rc8+.orig/mm/migrate.c
+++ mmotm-2.6.27-rc8+/mm/migrate.c
@@ -121,20 +121,6 @@ static void remove_migration_pte(struct
if (!is_migration_entry(entry) || migration_entry_to_page(entry) != old)
goto out;

- /*
- * Yes, ignore the return value from a GFP_ATOMIC mem_cgroup_charge.
- * Failure is not an option here: we're now expected to remove every
- * migration pte, and will cause crashes otherwise. Normally this
- * is not an issue: mem_cgroup_prepare_migration bumped up the old
- * page_cgroup count for safety, that's now attached to the new page,
- * so this charge should just be another incrementation of the count,
- * to keep in balance with rmap.c's mem_cgroup_uncharging. But if
- * there's been a force_empty, those reference counts may no longer
- * be reliable, and this charge can actually fail: oh well, we don't
- * make the situation any worse by proceeding as if it had succeeded.
- */
- mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);
-
get_page(new);
pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
if (is_write_migration_entry(entry))
@@ -382,9 +368,6 @@ static void migrate_page_copy(struct pag
anon = PageAnon(page);
page->mapping = NULL;

- if (!anon) /* This page was removed from radix-tree. */
- mem_cgroup_uncharge_cache_page(page);
-
/*
* If any waiters have accumulated on the new page then
* wake them up.
@@ -621,6 +604,7 @@ static int unmap_and_move(new_page_t get
struct page *newpage = get_new_page(page, private, &result);
int rcu_locked = 0;
int charge = 0;
+ struct mem_cgroup *mem;

if (!newpage)
return -ENOMEM;
@@ -630,24 +614,26 @@ static int unmap_and_move(new_page_t get
goto move_newpage;
}

- charge = mem_cgroup_prepare_migration(page, newpage);
- if (charge == -ENOMEM) {
- rc = -ENOMEM;
- goto move_newpage;
- }
/* prepare cgroup just returns 0 or -ENOMEM */
- BUG_ON(charge);
-
rc = -EAGAIN;
+
if (!trylock_page(page)) {
if (!force)
goto move_newpage;
lock_page(page);
}

+ /* precharge against new page */
+ charge = mem_cgroup_prepare_migration(page, &mem);
+ if (charge == -ENOMEM) {
+ rc = -ENOMEM;
+ goto unlock;
+ }
+ BUG_ON(charge);
+
if (PageWriteback(page)) {
if (!force)
- goto unlock;
+ goto uncharge;
wait_on_page_writeback(page);
}
/*
@@ -700,7 +686,9 @@ static int unmap_and_move(new_page_t get
rcu_unlock:
if (rcu_locked)
rcu_read_unlock();
-
+uncharge:
+ if (!charge)
+ mem_cgroup_end_migration(mem, page, newpage);
unlock:
unlock_page(page);

@@ -716,8 +704,6 @@ unlock:
}

move_newpage:
- if (!charge)
- mem_cgroup_end_migration(newpage);

/*
* Move the new page to the LRU. If migration was not successful
Index: mmotm-2.6.27-rc8+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27-rc8+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27-rc8+/include/linux/memcontrol.h
@@ -29,8 +29,6 @@ struct mm_struct;

extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
-extern int mem_cgroup_charge_migrate_fixup(struct page *page,
- struct mm_struct *mm, gfp_t gfp_mask);
/* for swap handling */
extern int mem_cgroup_try_charge(struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **ptr);
@@ -60,8 +58,9 @@ extern struct mem_cgroup *mem_cgroup_fro
((cgroup) == mem_cgroup_from_task((mm)->owner))

extern int
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage);
-extern void mem_cgroup_end_migration(struct page *page);
+mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr);
+extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
+ struct page *oldpage, struct page *newpage);

/*
* For memory reclaim.
@@ -94,12 +93,6 @@ static inline int mem_cgroup_cache_charg
return 0;
}

-static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
- struct mm_struct *mm, gfp_t gfp_mask)
-{
- return 0;
-}
-
static int mem_cgroup_try_charge(struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **ptr)
{
@@ -143,12 +136,14 @@ static inline int task_in_mem_cgroup(str
}

static inline int
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
+mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
{
return 0;
}

-static inline void mem_cgroup_end_migration(struct page *page)
+static inline void mem_cgroup_end_migration(struct mem_cgroup *mem,
+ struct page *oldpage,
+ struct page *newpage)
{
}

Index: mmotm-2.6.27-rc8+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc8+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc8+/mm/memcontrol.c
@@ -623,34 +623,6 @@ int mem_cgroup_newpage_charge(struct pag
MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
}

-/*
- * same as mem_cgroup_newpage_charge(), now.
- * But what we assume is different from newpage, and this is special case.
- * treat this in special function. easy for maintainance.
- */
-
-int mem_cgroup_charge_migrate_fixup(struct page *page,
- struct mm_struct *mm, gfp_t gfp_mask)
-{
- if (mem_cgroup_subsys.disabled)
- return 0;
-
- if (PageCompound(page))
- return 0;
-
- if (page_mapped(page) || (page->mapping && !PageAnon(page)))
- return 0;
-
- if (unlikely(!mm))
- mm = &init_mm;
-
- return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
-}
-
-
-
-
int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask)
{
@@ -778,13 +750,13 @@ void mem_cgroup_uncharge_cache_page(stru
}

/*
- * Before starting migration, account against new page.
+ * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
+ * page belongs to.
*/
-int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
+int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
{
struct page_cgroup *pc;
struct mem_cgroup *mem = NULL;
- enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
int ret = 0;

if (mem_cgroup_subsys.disabled)
@@ -795,43 +767,67 @@ int mem_cgroup_prepare_migration(struct
if (PageCgroupUsed(pc)) {
mem = pc->mem_cgroup;
css_get(&mem->css);
- if (PageCgroupCache(pc)) {
- if (page_is_file_cache(page))
- ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
- else
- ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
- }
}
unlock_page_cgroup(pc);
+
if (mem) {
- ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
- ctype, mem);
+ ret = mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem);
css_put(&mem->css);
+ *ptr = mem;
}
return ret;
}

/* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct page *newpage)
+void mem_cgroup_end_migration(struct mem_cgroup *mem,
+ struct page *oldpage, struct page *newpage)
{
+ struct page *target, *unused;
+ struct page_cgroup *pc;
+ enum charge_type ctype;
+
+ /* at migration success, oldpage->mapping is NULL. */
+ if (oldpage->mapping) {
+ target = oldpage;
+ unused = NULL;
+ } else {
+ target = newpage;
+ unused = oldpage;
+ }
+
+ if (PageAnon(target))
+ ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
+ else if (page_is_file_cache(target))
+ ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
+ else
+ ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
+
+ /* unused page is not on radix-tree now. */
+ if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
+ __mem_cgroup_uncharge_common(unused, ctype);
+
+ pc = lookup_page_cgroup(target);
/*
- * At success, page->mapping is not NULL.
- * special rollback care is necessary when
- * 1. at migration failure. (newpage->mapping is cleared in this case)
- * 2. the newpage was moved but not remapped again because the task
- * exits and the newpage is obsolete. In this case, the new page
- * may be a swapcache. So, we just call mem_cgroup_uncharge_page()
- * always for avoiding mess. The page_cgroup will be removed if
- * unnecessary. File cache pages is still on radix-tree. Don't
- * care it.
+ * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
+ * So, double-counting is effectively avoided.
*/
- if (!newpage->mapping)
- __mem_cgroup_uncharge_common(newpage,
- MEM_CGROUP_CHARGE_TYPE_FORCE);
- else if (PageAnon(newpage))
- mem_cgroup_uncharge_page(newpage);
+ __mem_cgroup_commit_charge(mem, pc, ctype);
+
+ /*
+ * Both of oldpage and newpage are still under lock_page().
+ * Then, we don't have to care about race in radix-tree.
+ * But we have to be careful that this page is unmapped or not.
+ *
+ * There is a case for !page_mapped(). At the start of
+ * migration, oldpage was mapped. But now, it's zapped.
+ * But we know *target* page is not freed/reused under us.
+ * mem_cgroup_uncharge_page() does all necessary checks.
+ */
+ if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
+ mem_cgroup_uncharge_page(target);
}

+
/*
* A call to try to shrink memory usage under specified resource controller.
* This is typically used for page reclaiming for shmem for reducing side

2008-10-10 09:04:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 3/5] memcg: more updates (still under test) v7

This patch provides a function to move account information of a page between
mem_cgroups and rewrite force_empty to make use of this.

This moving of page_cgroup is done under
- lru_lock of source/destination mem_cgroup is held.
- lock_page_cgroup() is held.

Then, a routine which touches pc->mem_cgroup without lock_page_cgroup() should
confirm pc->mem_cgroup is still valid or not. Typlical code can be following.

(while page is not under lock_page())
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc)
spin_lock_irqsave(&mz->lru_lock);
if (pc->mem_cgroup == mem)
...../* some list handling */
spin_unlock_irq(&mz->lru_lock);

Of course, better way is
lock_page_cgroup(pc);
....
unlock_page_cgroup(pc);

But you should confirm the nest of lock and avoid deadlock.

If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
you don't have to worry about what pc->mem_cgroup points to.
moved pages are added to head of lru, not to tail.

Expected users of this routine is:
- force_empty (rmdir)
- moving tasks between cgroup (for moving account information.)
- hierarchy (maybe useful.)

force_empty(rmdir) uses this move_account and move pages to its parent.
This "move" will not cause OOM (I added "oom" parameter to try_charge().)

If the parent is busy (not enough memory), force_empty calls try_to_free_page()
and reduce usage.

Purpose of this behavior is
- Fix "forget all" behavior of force_empty and avoid leak of accounting.
- By "moving first, free if necessary", keep pages on memory as much as
possible.

Adding a switch to change behavior of force_empty to
- free first, move if necessary
- free all, if there is mlocked/busy pages, return -EBUSY.
is under consideration.

Changelog: (v6) -> (v7)
Changelog: (v5) -> (v6)
- removed unnecessary check.
- do all under lock_page_cgroup().
- removed res_counter_charge() from move function itself.
(and modifies try_charge() function.)
- add argument to add_list() to specify to add page_cgroup head or tail.
- merged with force_empty patch. (to answer who is user? question)

Changelog: (v4) -> (v5)
- check for lock_page() is removed.
- rewrote description.

Changelog: (v2) -> (v4)
- added lock_page_cgroup().
- splitted out from new-force-empty patch.
- added how-to-use text.
- fixed race in __mem_cgroup_uncharge_common().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Documentation/controllers/memory.txt | 10 -
mm/memcontrol.c | 267 +++++++++++++++++++++++++++--------
2 files changed, 216 insertions(+), 61 deletions(-)

Index: mmotm-2.6.27-rc8+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc8+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc8+/mm/memcontrol.c
@@ -257,7 +257,7 @@ static void __mem_cgroup_remove_list(str
}

static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
- struct page_cgroup *pc)
+ struct page_cgroup *pc, bool hot)
{
int lru = LRU_BASE;

@@ -271,7 +271,10 @@ static void __mem_cgroup_add_list(struct
}

MEM_CGROUP_ZSTAT(mz, lru) += 1;
- list_add(&pc->lru, &mz->lists[lru]);
+ if (hot)
+ list_add(&pc->lru, &mz->lists[lru]);
+ else
+ list_add_tail(&pc->lru, &mz->lists[lru]);

mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
}
@@ -467,21 +470,12 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
}

-
-/**
- * mem_cgroup_try_charge - get charge of PAGE_SIZE.
- * @mm: an mm_struct which is charged against. (when *memcg is NULL)
- * @gfp_mask: gfp_mask for reclaim.
- * @memcg: a pointer to memory cgroup which is charged against.
- *
- * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
- * memory cgroup from @mm is got and stored in *memcg.
- *
- * Retruns 0 if success. -ENOMEM at failure.
+/*
+ * Unlike exported interface, "oom" parameter is added. if oom==true,
+ * oom-killer can be invoked.
*/
-
-int mem_cgroup_try_charge(struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcg)
+static int __mem_cgroup_try_charge(struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
{
struct mem_cgroup *mem;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -528,7 +522,8 @@ int mem_cgroup_try_charge(struct mm_stru
continue;

if (!nr_retries--) {
- mem_cgroup_out_of_memory(mem, gfp_mask);
+ if (oom)
+ mem_cgroup_out_of_memory(mem, gfp_mask);
goto nomem;
}
}
@@ -538,6 +533,24 @@ nomem:
return -ENOMEM;
}

+/**
+ * mem_cgroup_try_charge - get charge of PAGE_SIZE.
+ * @mm: an mm_struct which is charged against. (when *memcg is NULL)
+ * @gfp_mask: gfp_mask for reclaim.
+ * @memcg: a pointer to memory cgroup which is charged against.
+ *
+ * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
+ * memory cgroup from @mm is got and stored in *memcg.
+ *
+ * Retruns 0 if success. -ENOMEM at failure.
+ */
+
+int mem_cgroup_try_charge(struct mm_struct *mm,
+ gfp_t mask, struct mem_cgroup **memcg)
+{
+ return __mem_cgroup_try_charge(mm, mask, memcg, false);
+}
+
/*
* commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
* USED state. If already USED, uncharge and return.
@@ -567,11 +580,109 @@ static void __mem_cgroup_commit_charge(s
mz = page_cgroup_zoneinfo(pc);

spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_add_list(mz, pc);
+ __mem_cgroup_add_list(mz, pc, true);
spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);
}

+/**
+ * mem_cgroup_move_account - move account of the page
+ * @pc: page_cgroup of the page.
+ * @from: mem_cgroup which the page is moved from.
+ * @to: mem_cgroup which the page is moved to. @from != @to.
+ *
+ * The caller must confirm following.
+ * 1. disable irq.
+ * 2. lru_lock of old mem_cgroup(@from) should be held.
+ *
+ * returns 0 at success,
+ * returns -EBUSY when lock is busy or "pc" is unstable.
+ *
+ * This function does "uncharge" from old cgroup but doesn't do "charge" to
+ * new cgroup. It should be done by a caller.
+ */
+
+static int mem_cgroup_move_account(struct page_cgroup *pc,
+ struct mem_cgroup *from, struct mem_cgroup *to)
+{
+ struct mem_cgroup_per_zone *from_mz, *to_mz;
+ int nid, zid;
+ int ret = -EBUSY;
+
+ VM_BUG_ON(!irqs_disabled());
+ VM_BUG_ON(from == to);
+
+ nid = page_cgroup_nid(pc);
+ zid = page_cgroup_zid(pc);
+ from_mz = mem_cgroup_zoneinfo(from, nid, zid);
+ to_mz = mem_cgroup_zoneinfo(to, nid, zid);
+
+
+ if (!trylock_page_cgroup(pc))
+ return ret;
+
+ if (!PageCgroupUsed(pc))
+ goto out;
+
+ if (pc->mem_cgroup != from)
+ goto out;
+
+ if (spin_trylock(&to_mz->lru_lock)) {
+ __mem_cgroup_remove_list(from_mz, pc);
+ css_put(&from->css);
+ res_counter_uncharge(&from->res, PAGE_SIZE);
+ pc->mem_cgroup = to;
+ css_get(&to->css);
+ __mem_cgroup_add_list(to_mz, pc, false);
+ ret = 0;
+ spin_unlock(&to_mz->lru_lock);
+ }
+out:
+ unlock_page_cgroup(pc);
+ return ret;
+}
+
+/*
+ * move charges to its parent.
+ */
+
+static int mem_cgroup_move_parent(struct page_cgroup *pc,
+ struct mem_cgroup *child,
+ gfp_t gfp_mask)
+{
+ struct cgroup *cg = child->css.cgroup;
+ struct cgroup *pcg = cg->parent;
+ struct mem_cgroup *parent;
+ struct mem_cgroup_per_zone *mz;
+ unsigned long flags;
+ int ret;
+
+ /* Is ROOT ? */
+ if (!pcg)
+ return -EINVAL;
+
+ parent = mem_cgroup_from_cont(pcg);
+
+ ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
+ if (ret)
+ return ret;
+
+ mz = mem_cgroup_zoneinfo(child,
+ page_cgroup_nid(pc), page_cgroup_zid(pc));
+
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ ret = mem_cgroup_move_account(pc, child, parent);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+ /* drop extra refcnt */
+ css_put(&parent->css);
+ /* uncharge if move fails */
+ if (ret)
+ res_counter_uncharge(&parent->res, PAGE_SIZE);
+
+ return ret;
+}
+
/*
* Charge the memory controller for page usage.
* Return
@@ -593,7 +704,7 @@ static int mem_cgroup_charge_common(stru
prefetchw(pc);

mem = memcg;
- ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
+ ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
if (ret)
return ret;

@@ -892,46 +1003,52 @@ int mem_cgroup_resize_limit(struct mem_c
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
*/
-#define FORCE_UNCHARGE_BATCH (128)
-static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
+static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
struct mem_cgroup_per_zone *mz,
enum lru_list lru)
{
- struct page_cgroup *pc;
- struct page *page;
- int count = FORCE_UNCHARGE_BATCH;
+ struct page_cgroup *pc, *busy;
unsigned long flags;
+ unsigned long loop;
struct list_head *list;
+ int ret = 0;

list = &mz->lists[lru];

- spin_lock_irqsave(&mz->lru_lock, flags);
- while (!list_empty(list)) {
- pc = list_entry(list->prev, struct page_cgroup, lru);
- page = pc->page;
- if (!PageCgroupUsed(pc))
+ loop = MEM_CGROUP_ZSTAT(mz, lru);
+ /* give some margin against EBUSY etc...*/
+ loop += 256;
+ busy = NULL;
+ while (loop--) {
+ ret = 0;
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ if (list_empty(list)) {
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
break;
- get_page(page);
+ }
+ pc = list_entry(list->prev, struct page_cgroup, lru);
+ if (busy == pc) {
+ list_move(&pc->lru, list);
+ busy = 0;
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ continue;
+ }
spin_unlock_irqrestore(&mz->lru_lock, flags);
- /*
- * Check if this page is on LRU. !LRU page can be found
- * if it's under page migration.
- */
- if (PageLRU(page)) {
- __mem_cgroup_uncharge_common(page,
- MEM_CGROUP_CHARGE_TYPE_FORCE);
- put_page(page);
- if (--count <= 0) {
- count = FORCE_UNCHARGE_BATCH;
- cond_resched();
- }
- } else {
- spin_lock_irqsave(&mz->lru_lock, flags);
+
+ ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
+ if (ret == -ENOMEM)
break;
- }
- spin_lock_irqsave(&mz->lru_lock, flags);
+
+ if (ret == -EBUSY || ret == -EINVAL) {
+ /* found lock contention or "pc" is obsolete. */
+ busy = pc;
+ cond_resched();
+ } else
+ busy = NULL;
}
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ if (!ret && !list_empty(list))
+ return -EBUSY;
+ return ret;
}

/*
@@ -940,34 +1057,68 @@ static void mem_cgroup_force_empty_list(
*/
static int mem_cgroup_force_empty(struct mem_cgroup *mem)
{
- int ret = -EBUSY;
- int node, zid;
+ int ret;
+ int node, zid, shrink;
+ int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;

css_get(&mem->css);
- /*
- * page reclaim code (kswapd etc..) will move pages between
- * active_list <-> inactive_list while we don't take a lock.
- * So, we have to do loop here until all lists are empty.
- */
+
+ shrink = 0;
+move_account:
while (mem->res.usage > 0) {
+ ret = -EBUSY;
if (atomic_read(&mem->css.cgroup->count) > 0)
goto out;
+
/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
- for_each_node_state(node, N_POSSIBLE)
- for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ ret = 0;
+ for_each_node_state(node, N_POSSIBLE) {
+ for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
struct mem_cgroup_per_zone *mz;
enum lru_list l;
mz = mem_cgroup_zoneinfo(mem, node, zid);
- for_each_lru(l)
- mem_cgroup_force_empty_list(mem, mz, l);
+ for_each_lru(l) {
+ ret = mem_cgroup_force_empty_list(mem,
+ mz, l);
+ if (ret)
+ break;
+ }
}
+ if (ret)
+ break;
+ }
+ /* it seems parent cgroup doesn't have enough mem */
+ if (ret == -ENOMEM)
+ goto try_to_free;
cond_resched();
}
ret = 0;
out:
css_put(&mem->css);
return ret;
+
+try_to_free:
+ /* returns EBUSY if we come here twice. */
+ if (shrink) {
+ ret = -EBUSY;
+ goto out;
+ }
+ /* try to free all pages in this cgroup */
+ shrink = 1;
+ while (nr_retries && mem->res.usage > 0) {
+ int progress;
+ progress = try_to_free_mem_cgroup_pages(mem,
+ GFP_HIGHUSER_MOVABLE);
+ if (!progress)
+ nr_retries--;
+
+ }
+ /* try move_account...there may be some *locked* pages. */
+ if (mem->res.usage)
+ goto move_account;
+ ret = 0;
+ goto out;
}

static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
Index: mmotm-2.6.27-rc8+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.27-rc8+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.27-rc8+/Documentation/controllers/memory.txt
@@ -211,7 +211,9 @@ The memory.force_empty gives an interfac

# echo 1 > memory.force_empty

-will drop all charges in cgroup. Currently, this is maintained for test.
+Will move account to parent. if parent is full, will try to free pages.
+If both parent and child are busy, returns -EBUSY;
+This file, memory.force_empty, is just for debug purpose.

4. Testing

@@ -242,8 +244,10 @@ reclaimed.

A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
cgroup might have some charge associated with it, even though all
-tasks have migrated away from it. Such charges are automatically dropped at
-rmdir() if there are no tasks.
+tasks have migrated away from it.
+Such charges are moved to its parent as much as possible and freed if parent
+is full. (see force_empty)
+If both of them are busy, rmdir() returns -EBUSY.

5. TODO

2008-10-10 09:05:34

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 4/5] memcg: lazy lru free

Free page_cgroup from its LRU in batched manner.

When uncharge() is called, page is pushed onto per-cpu vector and
removed from LRU, later.. This routine resembles to global LRU's pagevec.
This patch is half of the whole patch and a set with following lazy LRU add
patch.

After this, a pc, which is PageCgroupLRU(pc)==true, is on LRU.
This LRU bit is guarded by lru_lock().

PageCgroupUsed(pc) && PageCgroupLRU(pc) means "pc" is used and on LRU.
This check makes sense only when both 2 locks, lock_page_cgroup()/lru_lock(),
are aquired.

PageCgroupUsed(pc) && !PageCgroupLRU(pc) means "pc" is used but not on LRU.
!PageCgroupUsed(pc) && PageCgroupLRU(pc) means "pc" is unused but still on
LRU. lru walk routine should avoid touching this.

Changelog (v6) -> (v7)
- added check for race to check pc->mem_cgroup without lock.

Changelog (v5) -> (v6)
- Fixing race and added PCG_LRU bit

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

include/linux/page_cgroup.h | 5 +
mm/memcontrol.c | 210 ++++++++++++++++++++++++++++++++++++++++----
2 files changed, 199 insertions(+), 16 deletions(-)

Index: mmotm-2.6.27-rc8+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc8+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc8+/mm/memcontrol.c
@@ -34,6 +34,7 @@
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
+#include <linux/cpu.h>

#include <asm/uaccess.h>

@@ -344,7 +345,7 @@ void mem_cgroup_move_lists(struct page *
pc = lookup_page_cgroup(page);
if (!trylock_page_cgroup(pc))
return;
- if (pc && PageCgroupUsed(pc)) {
+ if (pc && PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
__mem_cgroup_move_lists(pc, lru);
@@ -470,6 +471,129 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
}

+
+#define MEMCG_PCPVEC_SIZE (14) /* size of pagevec */
+struct memcg_percpu_vec {
+ int nr;
+ int limit;
+ struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
+};
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+
+static void
+__release_page_cgroup(struct memcg_percpu_vec *mpv)
+{
+ unsigned long flags;
+ struct mem_cgroup_per_zone *mz, *prev_mz;
+ struct page_cgroup *pc;
+ struct mem_cgroup *tmp;
+ int i, nr;
+
+ local_irq_save(flags);
+ nr = mpv->nr;
+ mpv->nr = 0;
+ prev_mz = NULL;
+ for (i = nr - 1; i >= 0; i--) {
+ pc = mpv->vec[i];
+ tmp = pc->mem_cgroup;
+ mz = mem_cgroup_zoneinfo(tmp,
+ page_cgroup_nid(pc), page_cgroup_zid(pc));
+ if (prev_mz != mz) {
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ prev_mz = mz;
+ spin_lock(&mz->lru_lock);
+ }
+ /*
+ * this "pc" may be charge()->uncharge() while we are waiting
+ * for this. But charge() path check LRU bit and remove this
+ * from LRU if necessary. So, tmp == pc->mem_cgroup can be
+ * considered to be always true...but logically, we should check
+ * it.
+ */
+ if (!PageCgroupUsed(pc)
+ && PageCgroupLRU(pc)
+ && tmp == pc->mem_cgroup) {
+ ClearPageCgroupLRU(pc);
+ __mem_cgroup_remove_list(mz, pc);
+ css_put(&pc->mem_cgroup->css);
+ }
+ }
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ local_irq_restore(flags);
+
+}
+
+static void
+release_page_cgroup(struct page_cgroup *pc)
+{
+ struct memcg_percpu_vec *mpv;
+
+ mpv = &get_cpu_var(memcg_free_vec);
+ mpv->vec[mpv->nr++] = pc;
+ if (mpv->nr >= mpv->limit)
+ __release_page_cgroup(mpv);
+ put_cpu_var(memcg_free_vec);
+}
+
+static void page_cgroup_start_cache_cpu(int cpu)
+{
+ struct memcg_percpu_vec *mpv;
+ mpv = &per_cpu(memcg_free_vec, cpu);
+ mpv->limit = MEMCG_PCPVEC_SIZE;
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void page_cgroup_stop_cache_cpu(int cpu)
+{
+ struct memcg_percpu_vec *mpv;
+ mpv = &per_cpu(memcg_free_vec, cpu);
+ mpv->limit = 0;
+}
+#endif
+
+
+/*
+ * Used when freeing memory resource controller to remove all
+ * page_cgroup (in obsolete list).
+ */
+static DEFINE_MUTEX(memcg_force_drain_mutex);
+
+static void drain_page_cgroup_local(struct work_struct *work)
+{
+ struct memcg_percpu_vec *mpv;
+ mpv = &get_cpu_var(memcg_free_vec);
+ __release_page_cgroup(mpv);
+ put_cpu_var(mpv);
+}
+
+static void drain_page_cgroup_cpu(int cpu)
+{
+ int local_cpu;
+ struct work_struct work;
+
+ local_cpu = get_cpu();
+ if (local_cpu == cpu) {
+ drain_page_cgroup_local(NULL);
+ put_cpu();
+ return;
+ }
+ put_cpu();
+
+ INIT_WORK(&work, drain_page_cgroup_local);
+ schedule_work_on(cpu, &work);
+ flush_work(&work);
+}
+
+static void drain_page_cgroup_all(void)
+{
+ mutex_lock(&memcg_force_drain_mutex);
+ schedule_on_each_cpu(drain_page_cgroup_local);
+ mutex_unlock(&memcg_force_drain_mutex);
+}
+
+
/*
* Unlike exported interface, "oom" parameter is added. if oom==true,
* oom-killer can be invoked.
@@ -564,25 +688,46 @@ static void __mem_cgroup_commit_charge(s
unsigned long flags;

lock_page_cgroup(pc);
+ /*
+ * USED bit is set after pc->mem_cgroup has valid value.
+ */
if (unlikely(PageCgroupUsed(pc))) {
unlock_page_cgroup(pc);
res_counter_uncharge(&mem->res, PAGE_SIZE);
css_put(&mem->css);
return;
}
+ /*
+ * This page_cgroup is not used but may be on LRU.
+ */
+ if (unlikely(PageCgroupLRU(pc))) {
+ /*
+ * pc->mem_cgroup has old information. force_empty() guarantee
+ * that we never see stale mem_cgroup here.
+ */
+ mz = page_cgroup_zoneinfo(pc);
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ if (PageCgroupLRU(pc)) {
+ ClearPageCgroupLRU(pc);
+ __mem_cgroup_remove_list(mz, pc);
+ css_put(&pc->mem_cgroup->css);
+ }
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ }
+ /* Here, PCG_LRU bit is cleared */
pc->mem_cgroup = mem;
/*
- * If a page is accounted as a page cache, insert to inactive list.
- * If anon, insert to active list.
+ * below pcg_default_flags includes PCG_LOCK bit.
*/
pc->flags = pcg_default_flags[ctype];
+ unlock_page_cgroup(pc);

mz = page_cgroup_zoneinfo(pc);

spin_lock_irqsave(&mz->lru_lock, flags);
__mem_cgroup_add_list(mz, pc, true);
+ SetPageCgroupLRU(pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);
- unlock_page_cgroup(pc);
}

/**
@@ -621,7 +766,7 @@ static int mem_cgroup_move_account(struc
if (!trylock_page_cgroup(pc))
return ret;

- if (!PageCgroupUsed(pc))
+ if (!PageCgroupUsed(pc) || !PageCgroupLRU(pc))
goto out;

if (pc->mem_cgroup != from)
@@ -808,8 +953,6 @@ __mem_cgroup_uncharge_common(struct page
{
struct page_cgroup *pc;
struct mem_cgroup *mem;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;

if (mem_cgroup_subsys.disabled)
return;
@@ -830,16 +973,13 @@ __mem_cgroup_uncharge_common(struct page
}
ClearPageCgroupUsed(pc);
mem = pc->mem_cgroup;
-
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_remove_list(mz, pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
- unlock_page_cgroup(pc);
-
+ /*
+ * We must uncharge here because "reuse" can occur just after we
+ * unlock this.
+ */
res_counter_uncharge(&mem->res, PAGE_SIZE);
- css_put(&mem->css);
-
+ unlock_page_cgroup(pc);
+ release_page_cgroup(pc);
return;
}

@@ -1072,6 +1212,7 @@ move_account:

/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
+ drain_page_cgroup_all();
ret = 0;
for_each_node_state(node, N_POSSIBLE) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
@@ -1095,6 +1236,7 @@ move_account:
}
ret = 0;
out:
+ drain_page_cgroup_all();
css_put(&mem->css);
return ret;

@@ -1316,6 +1458,38 @@ static void mem_cgroup_free(struct mem_c
vfree(mem);
}

+static void mem_cgroup_init_pcp(int cpu)
+{
+ page_cgroup_start_cache_cpu(cpu);
+}
+
+static int cpu_memcgroup_callback(struct notifier_block *nb,
+ unsigned long action, void *hcpu)
+{
+ int cpu = (long)hcpu;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ mem_cgroup_init_pcp(cpu);
+ break;
+#ifdef CONFIG_HOTPLUG_CPU
+ case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
+ page_cgroup_stop_cache_cpu(cpu);
+ drain_page_cgroup_cpu(cpu);
+ break;
+#endif
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __refdata memcgroup_nb =
+{
+ .notifier_call = cpu_memcgroup_callback,
+};

static struct cgroup_subsys_state *
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -1326,6 +1500,10 @@ mem_cgroup_create(struct cgroup_subsys *
if (unlikely((cont->parent) == NULL)) {
page_cgroup_init();
mem = &init_mem_cgroup;
+ cpu_memcgroup_callback(&memcgroup_nb,
+ (unsigned long)CPU_UP_PREPARE,
+ (void *)(long)smp_processor_id());
+ register_hotcpu_notifier(&memcgroup_nb);
} else {
mem = mem_cgroup_alloc();
if (!mem)
Index: mmotm-2.6.27-rc8+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.27-rc8+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.27-rc8+/include/linux/page_cgroup.h
@@ -26,6 +26,7 @@ enum {
PCG_LOCK, /* page cgroup is locked */
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
+ PCG_LRU, /* on LRU */
/* flags for LRU placement */
PCG_ACTIVE, /* page is active in this cgroup */
PCG_FILE, /* page is file system backed */
@@ -50,6 +51,10 @@ TESTPCGFLAG(Cache, CACHE)
TESTPCGFLAG(Used, USED)
CLEARPCGFLAG(Used, USED)

+SETPCGFLAG(LRU, LRU)
+TESTPCGFLAG(LRU, LRU)
+CLEARPCGFLAG(LRU, LRU)
+
/* LRU management flags (from global-lru definition) */
TESTPCGFLAG(File, FILE)
SETPCGFLAG(File, FILE)

2008-10-10 09:06:29

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 5/5] memcg: lazy lru add

Delaying add_to_lru() and do it in batched manner like page_vec.
For doing that 2 flags PCG_USED and PCG_LRU.

Because __set_page_cgroup_lru() itself doesn't take lock_page_cgroup(),
we need a sanity check inside lru_lock().

And this delaying make css_put()/get() complicated. To make it clear,
* css_get() is called from mem_cgroup_add_list().
* css_put() is called from mem_cgroup_remove_list().
* css_get()->css_put() is called while try_charge()->commit/cancel sequence.


Changelog: v6 -> v7
- removed redundant css_put()

Changelog: v5 -> v6.
- css_get()/put comes back again...it's called via add_list(), remove_list().
- patch for PCG_LRU bit part is moved to release_page_cgroup_lru() patch.
- Avoid TestSet and just use lock_page_cgroup() etc.
- fixed race condition we saw in v5. (smp_wmb() and USED bit magic help us)

Changelog: v3 -> v5.
- removed css_get/put per page_cgroup struct.
Now, *new* force_empty checks there is page_cgroup on the memcg.
We don't need to be afraid of leak.

Changelog: v2 -> v3
- added TRANSIT flag and removed lock from core logic.
Changelog: v1 -> v2:
- renamed function name from use_page_cgroup to set_page_cgroup_lru().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

mm/memcontrol.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 73 insertions(+), 11 deletions(-)

Index: mmotm-2.6.27-rc8+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc8+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc8+/mm/memcontrol.c
@@ -255,6 +255,7 @@ static void __mem_cgroup_remove_list(str

mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
list_del(&pc->lru);
+ css_put(&pc->mem_cgroup->css);
}

static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
@@ -278,6 +279,7 @@ static void __mem_cgroup_add_list(struct
list_add_tail(&pc->lru, &mz->lists[lru]);

mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
+ css_get(&pc->mem_cgroup->css);
}

static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
@@ -479,6 +481,7 @@ struct memcg_percpu_vec {
struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
};
static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_add_vec);

static void
__release_page_cgroup(struct memcg_percpu_vec *mpv)
@@ -516,7 +519,6 @@ __release_page_cgroup(struct memcg_percp
&& tmp == pc->mem_cgroup) {
ClearPageCgroupLRU(pc);
__mem_cgroup_remove_list(mz, pc);
- css_put(&pc->mem_cgroup->css);
}
}
if (prev_mz)
@@ -526,10 +528,53 @@ __release_page_cgroup(struct memcg_percp
}

static void
+__set_page_cgroup_lru(struct memcg_percpu_vec *mpv)
+{
+ unsigned long flags;
+ struct mem_cgroup *mem;
+ struct mem_cgroup_per_zone *mz, *prev_mz;
+ struct page_cgroup *pc;
+ int i, nr;
+
+ local_irq_save(flags);
+ nr = mpv->nr;
+ mpv->nr = 0;
+ prev_mz = NULL;
+
+ for (i = nr - 1; i >= 0; i--) {
+ pc = mpv->vec[i];
+ mem = pc->mem_cgroup;
+ mz = page_cgroup_zoneinfo(pc);
+ if (prev_mz != mz) {
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ prev_mz = mz;
+ spin_lock(&mz->lru_lock);
+ }
+ if (PageCgroupUsed(pc) && !PageCgroupLRU(pc)) {
+ /*
+ * while we wait for lru_lock, uncharge()->charge()
+ * can occur. check here pc->mem_cgroup is what
+ * we expected or yet.
+ */
+ smp_rmb();
+ if (likely(mem == pc->mem_cgroup)) {
+ SetPageCgroupLRU(pc);
+ __mem_cgroup_add_list(mz, pc, true);
+ }
+ }
+ }
+
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ local_irq_restore(flags);
+
+}
+
+static void
release_page_cgroup(struct page_cgroup *pc)
{
struct memcg_percpu_vec *mpv;
-
mpv = &get_cpu_var(memcg_free_vec);
mpv->vec[mpv->nr++] = pc;
if (mpv->nr >= mpv->limit)
@@ -537,11 +582,25 @@ release_page_cgroup(struct page_cgroup *
put_cpu_var(memcg_free_vec);
}

+static void
+set_page_cgroup_lru(struct page_cgroup *pc)
+{
+ struct memcg_percpu_vec *mpv;
+
+ mpv = &get_cpu_var(memcg_add_vec);
+ mpv->vec[mpv->nr++] = pc;
+ if (mpv->nr >= mpv->limit)
+ __set_page_cgroup_lru(mpv);
+ put_cpu_var(memcg_add_vec);
+}
+
static void page_cgroup_start_cache_cpu(int cpu)
{
struct memcg_percpu_vec *mpv;
mpv = &per_cpu(memcg_free_vec, cpu);
mpv->limit = MEMCG_PCPVEC_SIZE;
+ mpv = &per_cpu(memcg_add_vec, cpu);
+ mpv->limit = MEMCG_PCPVEC_SIZE;
}

#ifdef CONFIG_HOTPLUG_CPU
@@ -550,6 +609,8 @@ static void page_cgroup_stop_cache_cpu(i
struct memcg_percpu_vec *mpv;
mpv = &per_cpu(memcg_free_vec, cpu);
mpv->limit = 0;
+ mpv = &per_cpu(memcg_add_vec, cpu);
+ mpv->limit = 0;
}
#endif

@@ -563,6 +624,9 @@ static DEFINE_MUTEX(memcg_force_drain_mu
static void drain_page_cgroup_local(struct work_struct *work)
{
struct memcg_percpu_vec *mpv;
+ mpv = &get_cpu_var(memcg_add_vec);
+ __set_page_cgroup_lru(mpv);
+ put_cpu_var(mpv);
mpv = &get_cpu_var(memcg_free_vec);
__release_page_cgroup(mpv);
put_cpu_var(mpv);
@@ -710,24 +774,24 @@ static void __mem_cgroup_commit_charge(s
if (PageCgroupLRU(pc)) {
ClearPageCgroupLRU(pc);
__mem_cgroup_remove_list(mz, pc);
- css_put(&pc->mem_cgroup->css);
}
spin_unlock_irqrestore(&mz->lru_lock, flags);
}
/* Here, PCG_LRU bit is cleared */
pc->mem_cgroup = mem;
/*
+ * We have to set pc->mem_cgroup before set USED bit for avoiding
+ * race with (delayed) __set_page_cgroup_lru() in other cpu.
+ */
+ smp_wmb();
+ /*
* below pcg_default_flags includes PCG_LOCK bit.
*/
pc->flags = pcg_default_flags[ctype];
unlock_page_cgroup(pc);

- mz = page_cgroup_zoneinfo(pc);
-
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_add_list(mz, pc, true);
- SetPageCgroupLRU(pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ set_page_cgroup_lru(pc);
+ css_put(&mem->css);
}

/**
@@ -774,10 +838,8 @@ static int mem_cgroup_move_account(struc

if (spin_trylock(&to_mz->lru_lock)) {
__mem_cgroup_remove_list(from_mz, pc);
- css_put(&from->css);
res_counter_uncharge(&from->res, PAGE_SIZE);
pc->mem_cgroup = to;
- css_get(&to->css);
__mem_cgroup_add_list(to_mz, pc, false);
ret = 0;
spin_unlock(&to_mz->lru_lock);

2008-10-14 06:19:42

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH 1/5] memcg: charge-commit-cancel

On Fri, 10 Oct 2008 18:01:50 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> There is a small race in do_swap_page(). When the page swapped-in is charged,
> the mapcount can be greater than 0. But, at the same time some process (shares
> it ) call unmap and make mapcount 1->0 and the page is uncharged.
>
> CPUA CPUB
> mapcount == 1.
> (1) charge if mapcount==0 zap_pte_range()
> (2) mapcount 1 => 0.
> (3) uncharge(). (success)
> (4) set page'rmap()
> mapcoint 0=>1
>
> Then, this swap page's account is leaked.
>
> For fixing this, I added a new interface.
> - precharge
> account to res_counter by PAGE_SIZE and try to free pages if necessary.
> - commit
> register page_cgroup and add to LRU if necessary.
> - cancel
> uncharge PAGE_SIZE because of do_swap_page failure.
>
>
> CPUA
> (1) charge (always)
> (2) set page's rmap (mapcount > 0)
> (3) commit charge was necessary or not after set_pte().
>
> This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
> Usual mem_cgroup_charge_common() does precharge -> commit at a time.
>
> And this patch also adds following function to clarify all charges.
>
> - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
> called against newly allocated anon pages.
>
> - mem_cgroup_charge_migrate_fixup()
> called only from remove_migration_ptes().
> we'll have to rewrite this later.(this patch just keeps old behavior)
> This function will be removed by additional patch to make migration
> clearer.
>
> Good for clarify "what we does"
>
> Then, we have 4 following charge points.
> - newpage
> - swapin
> - add-to-cache.
> - migration.
>
> precharge/commit/cancel can be used for other places,
> - shmem, (and other places need precharge.)
> - move_account(force_empty) etc...
> - migration.
>
> we'll revisit later.
>
> Changelog v5 -> v6:
> - added newpage_charge() and migrate_fixup().
> - renamed functions for swap-in from "swap" to "swapin"
> - add more precise description.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> include/linux/memcontrol.h | 35 +++++++++-
> mm/memcontrol.c | 151 +++++++++++++++++++++++++++++++++++----------
> mm/memory.c | 12 ++-
> mm/migrate.c | 2
> mm/swapfile.c | 6 +
> 5 files changed, 165 insertions(+), 41 deletions(-)
>
> Index: mmotm-2.6.27-rc8+/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.27-rc8+.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.27-rc8+/include/linux/memcontrol.h
> @@ -27,8 +27,17 @@ struct mm_struct;
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>
> -extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
> +extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> +extern int mem_cgroup_charge_migrate_fixup(struct page *page,
> + struct mm_struct *mm, gfp_t gfp_mask);
> +/* for swap handling */
> +extern int mem_cgroup_try_charge(struct mm_struct *mm,
> + gfp_t gfp_mask, struct mem_cgroup **ptr);
> +extern void mem_cgroup_commit_charge_swapin(struct page *page,
> + struct mem_cgroup *ptr);
> +extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
> +
> extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
> @@ -71,7 +80,9 @@ extern long mem_cgroup_calc_reclaim(stru
>
>
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> -static inline int mem_cgroup_charge(struct page *page,
> +struct mem_cgroup;
> +
> +static inline int mem_cgroup_newpage_charge(struct page *page,
> struct mm_struct *mm, gfp_t gfp_mask)
> {
> return 0;
> @@ -83,6 +94,26 @@ static inline int mem_cgroup_cache_charg
> return 0;
> }
>
> +static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
> + struct mm_struct *mm, gfp_t gfp_mask)
> +{
> + return 0;
> +}
> +
> +static int mem_cgroup_try_charge(struct mm_struct *mm,
> + gfp_t gfp_mask, struct mem_cgroup **ptr)
> +{
> + return 0;
> +}
> +
> +static void mem_cgroup_commit_charge_swapin(struct page *page,
> + struct mem_cgroup *ptr)
> +{
> +}
> +static void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr)
> +{
> +}
> +
> static inline void mem_cgroup_uncharge_page(struct page *page)
> {
> }
> Index: mmotm-2.6.27-rc8+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.27-rc8+.orig/mm/memcontrol.c
> +++ mmotm-2.6.27-rc8+/mm/memcontrol.c
> @@ -467,35 +467,31 @@ unsigned long mem_cgroup_isolate_pages(u
> return nr_taken;
> }
>
> -/*
> - * Charge the memory controller for page usage.
> - * Return
> - * 0 if the charge was successful
> - * < 0 if the cgroup is over its limit
> +
> +/**
> + * mem_cgroup_try_charge - get charge of PAGE_SIZE.
> + * @mm: an mm_struct which is charged against. (when *memcg is NULL)
> + * @gfp_mask: gfp_mask for reclaim.
> + * @memcg: a pointer to memory cgroup which is charged against.
> + *
> + * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
> + * memory cgroup from @mm is got and stored in *memcg.
> + *
> + * Retruns 0 if success. -ENOMEM at failure.
> */
> -static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> - gfp_t gfp_mask, enum charge_type ctype,
> - struct mem_cgroup *memcg)
> +
> +int mem_cgroup_try_charge(struct mm_struct *mm,
> + gfp_t gfp_mask, struct mem_cgroup **memcg)
> {
> struct mem_cgroup *mem;
> - struct page_cgroup *pc;
> - unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> - struct mem_cgroup_per_zone *mz;
> - unsigned long flags;
> -
> - pc = lookup_page_cgroup(page);
> - /* can happen at boot */
> - if (unlikely(!pc))
> - return 0;
> - prefetchw(pc);
> + int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> /*
> * We always charge the cgroup the mm_struct belongs to.
> * The mm_struct's mem_cgroup changes on task migration if the
> * thread group leader migrates. It's possible that mm is not
> * set, if so charge the init_mm (happens for pagecache usage).
> */
> -
> - if (likely(!memcg)) {
> + if (likely(!*memcg)) {
> rcu_read_lock();
> mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> if (unlikely(!mem)) {
> @@ -506,15 +502,17 @@ static int mem_cgroup_charge_common(stru
> * For every charge from the cgroup, increment reference count
> */
> css_get(&mem->css);
> + *memcg = mem;
> rcu_read_unlock();
> } else {
> - mem = memcg;
> - css_get(&memcg->css);
> + mem = *memcg;
> + css_get(&mem->css);
> }
>
> +
> while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> if (!(gfp_mask & __GFP_WAIT))
> - goto out;
> + goto nomem;
>
> if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> continue;
> @@ -531,18 +529,33 @@ static int mem_cgroup_charge_common(stru
>
> if (!nr_retries--) {
> mem_cgroup_out_of_memory(mem, gfp_mask);
> - goto out;
> + goto nomem;
> }
> }
> + return 0;
> +nomem:
> + css_put(&mem->css);
> + return -ENOMEM;
> +}
>
> +/*
> + * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
> + * USED state. If already USED, uncharge and return.
> + */
> +
> +static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> + struct page_cgroup *pc,
> + enum charge_type ctype)
> +{
> + struct mem_cgroup_per_zone *mz;
> + unsigned long flags;
>
mem_cgroup_try_charge may return with memcg=NULL,
so __mem_cgroup_commit_charge should handle this case.
(mem_cgroup_commit_charge_swapin and mem_cgroup_cancel_charge_swapin
have already handled it.)

I acutually have seen a NULL-pointer-dereference BUG caused by this
in testing previous version.

> lock_page_cgroup(pc);
> if (unlikely(PageCgroupUsed(pc))) {
> unlock_page_cgroup(pc);
> res_counter_uncharge(&mem->res, PAGE_SIZE);
> css_put(&mem->css);
> -
> - goto done;
> + return;
> }
> pc->mem_cgroup = mem;
> /*
> @@ -557,15 +570,39 @@ static int mem_cgroup_charge_common(stru
> __mem_cgroup_add_list(mz, pc);
> spin_unlock_irqrestore(&mz->lru_lock, flags);
> unlock_page_cgroup(pc);
> +}
>
> -done:
> +/*
> + * Charge the memory controller for page usage.
> + * Return
> + * 0 if the charge was successful
> + * < 0 if the cgroup is over its limit
> + */
> +static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> + gfp_t gfp_mask, enum charge_type ctype,
> + struct mem_cgroup *memcg)
> +{
> + struct mem_cgroup *mem;
> + struct page_cgroup *pc;
> + int ret;
> +
> + pc = lookup_page_cgroup(page);
> + /* can happen at boot */
> + if (unlikely(!pc))
> + return 0;
> + prefetchw(pc);
> +
> + mem = memcg;
> + ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
> + if (ret)
> + return ret;
> +
> + __mem_cgroup_commit_charge(mem, pc, ctype);
> return 0;
> -out:
> - css_put(&mem->css);
> - return -ENOMEM;
> }
>
> -int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> +int mem_cgroup_newpage_charge(struct page *page,
> + struct mm_struct *mm, gfp_t gfp_mask)
> {
> if (mem_cgroup_subsys.disabled)
> return 0;
> @@ -586,6 +623,34 @@ int mem_cgroup_charge(struct page *page,
> MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> }
>
> +/*
> + * same as mem_cgroup_newpage_charge(), now.
> + * But what we assume is different from newpage, and this is special case.
> + * treat this in special function. easy for maintainance.
> + */
> +
> +int mem_cgroup_charge_migrate_fixup(struct page *page,
> + struct mm_struct *mm, gfp_t gfp_mask)
> +{
> + if (mem_cgroup_subsys.disabled)
> + return 0;
> +
> + if (PageCompound(page))
> + return 0;
> +
> + if (page_mapped(page) || (page->mapping && !PageAnon(page)))
> + return 0;
> +
> + if (unlikely(!mm))
> + mm = &init_mm;
> +
> + return mem_cgroup_charge_common(page, mm, gfp_mask,
> + MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> +}
> +
> +
> +
> +
> int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask)
> {
> @@ -628,6 +693,30 @@ int mem_cgroup_cache_charge(struct page
> MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> }
>
> +
> +void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> +{
> + struct page_cgroup *pc;
> +
> + if (mem_cgroup_subsys.disabled)
> + return;
> + if (!ptr)
> + return;
> + pc = lookup_page_cgroup(page);
> + __mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> +}
> +
> +void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
> +{
> + if (mem_cgroup_subsys.disabled)
> + return;
> + if (!mem)
> + return;
> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> + css_put(&mem->css);
> +}
> +
> +
> /*
> * uncharge if !page_mapped(page)
> */
> Index: mmotm-2.6.27-rc8+/mm/memory.c
> ===================================================================
> --- mmotm-2.6.27-rc8+.orig/mm/memory.c
> +++ mmotm-2.6.27-rc8+/mm/memory.c
> @@ -1891,7 +1891,7 @@ gotten:
> cow_user_page(new_page, old_page, address, vma);
> __SetPageUptodate(new_page);
>
> - if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
> + if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
> goto oom_free_new;
>
> /*
> @@ -2287,6 +2287,7 @@ static int do_swap_page(struct mm_struct
> struct page *page;
> swp_entry_t entry;
> pte_t pte;
> + struct mem_cgroup *ptr = NULL;
> int ret = 0;
>
> if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
> @@ -2325,7 +2326,7 @@ static int do_swap_page(struct mm_struct
> lock_page(page);
> delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>
> - if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
> + if (mem_cgroup_try_charge(mm, GFP_KERNEL, &ptr) == -ENOMEM) {
> ret = VM_FAULT_OOM;
> unlock_page(page);
> goto out;
> @@ -2355,6 +2356,7 @@ static int do_swap_page(struct mm_struct
> flush_icache_page(vma, page);
> set_pte_at(mm, address, page_table, pte);
> page_add_anon_rmap(page, vma, address);
> + mem_cgroup_commit_charge_swapin(page, ptr);
>
> swap_free(entry);
> if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> @@ -2375,7 +2377,7 @@ unlock:
> out:
> return ret;
> out_nomap:
> - mem_cgroup_uncharge_page(page);
> + mem_cgroup_cancel_charge_swapin(ptr);
> pte_unmap_unlock(page_table, ptl);
> unlock_page(page);
> page_cache_release(page);
> @@ -2405,7 +2407,7 @@ static int do_anonymous_page(struct mm_s
> goto oom;
> __SetPageUptodate(page);
>
> - if (mem_cgroup_charge(page, mm, GFP_KERNEL))
> + if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
> goto oom_free_page;
>
> entry = mk_pte(page, vma->vm_page_prot);
> @@ -2498,7 +2500,7 @@ static int __do_fault(struct mm_struct *
> ret = VM_FAULT_OOM;
> goto out;
> }
> - if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
> + if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
> ret = VM_FAULT_OOM;
> page_cache_release(page);
> goto out;
> Index: mmotm-2.6.27-rc8+/mm/migrate.c
> ===================================================================
> --- mmotm-2.6.27-rc8+.orig/mm/migrate.c
> +++ mmotm-2.6.27-rc8+/mm/migrate.c
> @@ -133,7 +133,7 @@ static void remove_migration_pte(struct
> * be reliable, and this charge can actually fail: oh well, we don't
> * make the situation any worse by proceeding as if it had succeeded.
> */
> - mem_cgroup_charge(new, mm, GFP_ATOMIC);
> + mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);
>
> get_page(new);
> pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
> Index: mmotm-2.6.27-rc8+/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.27-rc8+.orig/mm/swapfile.c
> +++ mmotm-2.6.27-rc8+/mm/swapfile.c
> @@ -530,17 +530,18 @@ unsigned int count_swap_pages(int type,
> static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long addr, swp_entry_t entry, struct page *page)
> {
> + struct mem_cgroup *ptr;
should be initialized to NULL.


Thanks,
Daisuke Nishimura.

> spinlock_t *ptl;
> pte_t *pte;
> int ret = 1;
>
> - if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
> + if (mem_cgroup_try_charge(vma->vm_mm, GFP_KERNEL, &ptr))
> ret = -ENOMEM;
>
> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry)))) {
> if (ret > 0)
> - mem_cgroup_uncharge_page(page);
> + mem_cgroup_cancel_charge_swapin(ptr);
> ret = 0;
> goto out;
> }
> @@ -550,6 +551,7 @@ static int unuse_pte(struct vm_area_stru
> set_pte_at(vma->vm_mm, addr, pte,
> pte_mkold(mk_pte(page, vma->vm_page_prot)));
> page_add_anon_rmap(page, vma, addr);
> + mem_cgroup_commit_charge_swapin(page, ptr);
> swap_free(entry);
> /*
> * Move the page to the active list so it is not
>

2008-10-15 00:32:37

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/5] memcg: charge-commit-cancel

On Tue, 14 Oct 2008 15:12:43 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Fri, 10 Oct 2008 18:01:50 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > There is a small race in do_swap_page(). When the page swapped-in is charged,
> > the mapcount can be greater than 0. But, at the same time some process (shares
> > it ) call unmap and make mapcount 1->0 and the page is uncharged.
> >
> > CPUA CPUB
> > mapcount == 1.
> > (1) charge if mapcount==0 zap_pte_range()
> > (2) mapcount 1 => 0.
> > (3) uncharge(). (success)
> > (4) set page'rmap()
> > mapcoint 0=>1
> >
> > Then, this swap page's account is leaked.
> >
> > For fixing this, I added a new interface.
> > - precharge
> > account to res_counter by PAGE_SIZE and try to free pages if necessary.
> > - commit
> > register page_cgroup and add to LRU if necessary.
> > - cancel
> > uncharge PAGE_SIZE because of do_swap_page failure.
> >
> >
> > CPUA
> > (1) charge (always)
> > (2) set page's rmap (mapcount > 0)
> > (3) commit charge was necessary or not after set_pte().
> >
> > This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
> > Usual mem_cgroup_charge_common() does precharge -> commit at a time.
> >
> > And this patch also adds following function to clarify all charges.
> >
> > - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
> > called against newly allocated anon pages.
> >
> > - mem_cgroup_charge_migrate_fixup()
> > called only from remove_migration_ptes().
> > we'll have to rewrite this later.(this patch just keeps old behavior)
> > This function will be removed by additional patch to make migration
> > clearer.
> >
> > Good for clarify "what we does"
> >
> > Then, we have 4 following charge points.
> > - newpage
> > - swapin
> > - add-to-cache.
> > - migration.
> >
> > precharge/commit/cancel can be used for other places,
> > - shmem, (and other places need precharge.)
> > - move_account(force_empty) etc...
> > - migration.
> >
> > we'll revisit later.
> >
> > Changelog v5 -> v6:
> > - added newpage_charge() and migrate_fixup().
> > - renamed functions for swap-in from "swap" to "swapin"
> > - add more precise description.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> >
> > include/linux/memcontrol.h | 35 +++++++++-
> > mm/memcontrol.c | 151 +++++++++++++++++++++++++++++++++++----------
> > mm/memory.c | 12 ++-
> > mm/migrate.c | 2
> > mm/swapfile.c | 6 +
> > 5 files changed, 165 insertions(+), 41 deletions(-)
> >
> > Index: mmotm-2.6.27-rc8+/include/linux/memcontrol.h
> > ===================================================================
> > --- mmotm-2.6.27-rc8+.orig/include/linux/memcontrol.h
> > +++ mmotm-2.6.27-rc8+/include/linux/memcontrol.h
> > @@ -27,8 +27,17 @@ struct mm_struct;
> >
> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >
> > -extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
> > +extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> > gfp_t gfp_mask);
> > +extern int mem_cgroup_charge_migrate_fixup(struct page *page,
> > + struct mm_struct *mm, gfp_t gfp_mask);
> > +/* for swap handling */
> > +extern int mem_cgroup_try_charge(struct mm_struct *mm,
> > + gfp_t gfp_mask, struct mem_cgroup **ptr);
> > +extern void mem_cgroup_commit_charge_swapin(struct page *page,
> > + struct mem_cgroup *ptr);
> > +extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
> > +
> > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> > gfp_t gfp_mask);
> > extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
> > @@ -71,7 +80,9 @@ extern long mem_cgroup_calc_reclaim(stru
> >
> >
> > #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> > -static inline int mem_cgroup_charge(struct page *page,
> > +struct mem_cgroup;
> > +
> > +static inline int mem_cgroup_newpage_charge(struct page *page,
> > struct mm_struct *mm, gfp_t gfp_mask)
> > {
> > return 0;
> > @@ -83,6 +94,26 @@ static inline int mem_cgroup_cache_charg
> > return 0;
> > }
> >
> > +static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
> > + struct mm_struct *mm, gfp_t gfp_mask)
> > +{
> > + return 0;
> > +}
> > +
> > +static int mem_cgroup_try_charge(struct mm_struct *mm,
> > + gfp_t gfp_mask, struct mem_cgroup **ptr)
> > +{
> > + return 0;
> > +}
> > +
> > +static void mem_cgroup_commit_charge_swapin(struct page *page,
> > + struct mem_cgroup *ptr)
> > +{
> > +}
> > +static void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr)
> > +{
> > +}
> > +
> > static inline void mem_cgroup_uncharge_page(struct page *page)
> > {
> > }
> > Index: mmotm-2.6.27-rc8+/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.27-rc8+.orig/mm/memcontrol.c
> > +++ mmotm-2.6.27-rc8+/mm/memcontrol.c
> > @@ -467,35 +467,31 @@ unsigned long mem_cgroup_isolate_pages(u
> > return nr_taken;
> > }
> >
> > -/*
> > - * Charge the memory controller for page usage.
> > - * Return
> > - * 0 if the charge was successful
> > - * < 0 if the cgroup is over its limit
> > +
> > +/**
> > + * mem_cgroup_try_charge - get charge of PAGE_SIZE.
> > + * @mm: an mm_struct which is charged against. (when *memcg is NULL)
> > + * @gfp_mask: gfp_mask for reclaim.
> > + * @memcg: a pointer to memory cgroup which is charged against.
> > + *
> > + * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
> > + * memory cgroup from @mm is got and stored in *memcg.
> > + *
> > + * Retruns 0 if success. -ENOMEM at failure.
> > */
> > -static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> > - gfp_t gfp_mask, enum charge_type ctype,
> > - struct mem_cgroup *memcg)
> > +
> > +int mem_cgroup_try_charge(struct mm_struct *mm,
> > + gfp_t gfp_mask, struct mem_cgroup **memcg)
> > {
> > struct mem_cgroup *mem;
> > - struct page_cgroup *pc;
> > - unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > - struct mem_cgroup_per_zone *mz;
> > - unsigned long flags;
> > -
> > - pc = lookup_page_cgroup(page);
> > - /* can happen at boot */
> > - if (unlikely(!pc))
> > - return 0;
> > - prefetchw(pc);
> > + int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > /*
> > * We always charge the cgroup the mm_struct belongs to.
> > * The mm_struct's mem_cgroup changes on task migration if the
> > * thread group leader migrates. It's possible that mm is not
> > * set, if so charge the init_mm (happens for pagecache usage).
> > */
> > -
> > - if (likely(!memcg)) {
> > + if (likely(!*memcg)) {
> > rcu_read_lock();
> > mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> > if (unlikely(!mem)) {
> > @@ -506,15 +502,17 @@ static int mem_cgroup_charge_common(stru
> > * For every charge from the cgroup, increment reference count
> > */
> > css_get(&mem->css);
> > + *memcg = mem;
> > rcu_read_unlock();
> > } else {
> > - mem = memcg;
> > - css_get(&memcg->css);
> > + mem = *memcg;
> > + css_get(&mem->css);
> > }
> >
> > +
> > while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> > if (!(gfp_mask & __GFP_WAIT))
> > - goto out;
> > + goto nomem;
> >
> > if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> > continue;
> > @@ -531,18 +529,33 @@ static int mem_cgroup_charge_common(stru
> >
> > if (!nr_retries--) {
> > mem_cgroup_out_of_memory(mem, gfp_mask);
> > - goto out;
> > + goto nomem;
> > }
> > }
> > + return 0;
> > +nomem:
> > + css_put(&mem->css);
> > + return -ENOMEM;
> > +}
> >
> > +/*
> > + * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
> > + * USED state. If already USED, uncharge and return.
> > + */
> > +
> > +static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> > + struct page_cgroup *pc,
> > + enum charge_type ctype)
> > +{
> > + struct mem_cgroup_per_zone *mz;
> > + unsigned long flags;
> >
> mem_cgroup_try_charge may return with memcg=NULL,
> so __mem_cgroup_commit_charge should handle this case.
> (mem_cgroup_commit_charge_swapin and mem_cgroup_cancel_charge_swapin
> have already handled it.)
>
Hmm...looks into this again, thanks !


> I acutually have seen a NULL-pointer-dereference BUG caused by this
> in testing previous version.
>
> > lock_page_cgroup(pc);
> > if (unlikely(PageCgroupUsed(pc))) {
> > unlock_page_cgroup(pc);
> > res_counter_uncharge(&mem->res, PAGE_SIZE);
> > css_put(&mem->css);
> > -
> > - goto done;
> > + return;
> > }
> > pc->mem_cgroup = mem;
> > /*
> > @@ -557,15 +570,39 @@ static int mem_cgroup_charge_common(stru
> > __mem_cgroup_add_list(mz, pc);
> > spin_unlock_irqrestore(&mz->lru_lock, flags);
> > unlock_page_cgroup(pc);
> > +}
> >
> > -done:
> > +/*
> > + * Charge the memory controller for page usage.
> > + * Return
> > + * 0 if the charge was successful
> > + * < 0 if the cgroup is over its limit
> > + */
> > +static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> > + gfp_t gfp_mask, enum charge_type ctype,
> > + struct mem_cgroup *memcg)
> > +{
> > + struct mem_cgroup *mem;
> > + struct page_cgroup *pc;
> > + int ret;
> > +
> > + pc = lookup_page_cgroup(page);
> > + /* can happen at boot */
> > + if (unlikely(!pc))
> > + return 0;
> > + prefetchw(pc);
> > +
> > + mem = memcg;
> > + ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
> > + if (ret)
> > + return ret;
> > +
> > + __mem_cgroup_commit_charge(mem, pc, ctype);
> > return 0;
> > -out:
> > - css_put(&mem->css);
> > - return -ENOMEM;
> > }
> >
> > -int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> > +int mem_cgroup_newpage_charge(struct page *page,
> > + struct mm_struct *mm, gfp_t gfp_mask)
> > {
> > if (mem_cgroup_subsys.disabled)
> > return 0;
> > @@ -586,6 +623,34 @@ int mem_cgroup_charge(struct page *page,
> > MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> > }
> >
> > +/*
> > + * same as mem_cgroup_newpage_charge(), now.
> > + * But what we assume is different from newpage, and this is special case.
> > + * treat this in special function. easy for maintainance.
> > + */
> > +
> > +int mem_cgroup_charge_migrate_fixup(struct page *page,
> > + struct mm_struct *mm, gfp_t gfp_mask)
> > +{
> > + if (mem_cgroup_subsys.disabled)
> > + return 0;
> > +
> > + if (PageCompound(page))
> > + return 0;
> > +
> > + if (page_mapped(page) || (page->mapping && !PageAnon(page)))
> > + return 0;
> > +
> > + if (unlikely(!mm))
> > + mm = &init_mm;
> > +
> > + return mem_cgroup_charge_common(page, mm, gfp_mask,
> > + MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> > +}
> > +
> > +
> > +
> > +
> > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> > gfp_t gfp_mask)
> > {
> > @@ -628,6 +693,30 @@ int mem_cgroup_cache_charge(struct page
> > MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> > }
> >
> > +
> > +void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> > +{
> > + struct page_cgroup *pc;
> > +
> > + if (mem_cgroup_subsys.disabled)
> > + return;
> > + if (!ptr)
> > + return;
> > + pc = lookup_page_cgroup(page);
> > + __mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> > +}
> > +
> > +void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
> > +{
> > + if (mem_cgroup_subsys.disabled)
> > + return;
> > + if (!mem)
> > + return;
> > + res_counter_uncharge(&mem->res, PAGE_SIZE);
> > + css_put(&mem->css);
> > +}
> > +
> > +
> > /*
> > * uncharge if !page_mapped(page)
> > */
> > Index: mmotm-2.6.27-rc8+/mm/memory.c
> > ===================================================================
> > --- mmotm-2.6.27-rc8+.orig/mm/memory.c
> > +++ mmotm-2.6.27-rc8+/mm/memory.c
> > @@ -1891,7 +1891,7 @@ gotten:
> > cow_user_page(new_page, old_page, address, vma);
> > __SetPageUptodate(new_page);
> >
> > - if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
> > + if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
> > goto oom_free_new;
> >
> > /*
> > @@ -2287,6 +2287,7 @@ static int do_swap_page(struct mm_struct
> > struct page *page;
> > swp_entry_t entry;
> > pte_t pte;
> > + struct mem_cgroup *ptr = NULL;
> > int ret = 0;
> >
> > if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
> > @@ -2325,7 +2326,7 @@ static int do_swap_page(struct mm_struct
> > lock_page(page);
> > delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> >
> > - if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
> > + if (mem_cgroup_try_charge(mm, GFP_KERNEL, &ptr) == -ENOMEM) {
> > ret = VM_FAULT_OOM;
> > unlock_page(page);
> > goto out;
> > @@ -2355,6 +2356,7 @@ static int do_swap_page(struct mm_struct
> > flush_icache_page(vma, page);
> > set_pte_at(mm, address, page_table, pte);
> > page_add_anon_rmap(page, vma, address);
> > + mem_cgroup_commit_charge_swapin(page, ptr);
> >
> > swap_free(entry);
> > if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> > @@ -2375,7 +2377,7 @@ unlock:
> > out:
> > return ret;
> > out_nomap:
> > - mem_cgroup_uncharge_page(page);
> > + mem_cgroup_cancel_charge_swapin(ptr);
> > pte_unmap_unlock(page_table, ptl);
> > unlock_page(page);
> > page_cache_release(page);
> > @@ -2405,7 +2407,7 @@ static int do_anonymous_page(struct mm_s
> > goto oom;
> > __SetPageUptodate(page);
> >
> > - if (mem_cgroup_charge(page, mm, GFP_KERNEL))
> > + if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
> > goto oom_free_page;
> >
> > entry = mk_pte(page, vma->vm_page_prot);
> > @@ -2498,7 +2500,7 @@ static int __do_fault(struct mm_struct *
> > ret = VM_FAULT_OOM;
> > goto out;
> > }
> > - if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
> > + if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
> > ret = VM_FAULT_OOM;
> > page_cache_release(page);
> > goto out;
> > Index: mmotm-2.6.27-rc8+/mm/migrate.c
> > ===================================================================
> > --- mmotm-2.6.27-rc8+.orig/mm/migrate.c
> > +++ mmotm-2.6.27-rc8+/mm/migrate.c
> > @@ -133,7 +133,7 @@ static void remove_migration_pte(struct
> > * be reliable, and this charge can actually fail: oh well, we don't
> > * make the situation any worse by proceeding as if it had succeeded.
> > */
> > - mem_cgroup_charge(new, mm, GFP_ATOMIC);
> > + mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);
> >
> > get_page(new);
> > pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
> > Index: mmotm-2.6.27-rc8+/mm/swapfile.c
> > ===================================================================
> > --- mmotm-2.6.27-rc8+.orig/mm/swapfile.c
> > +++ mmotm-2.6.27-rc8+/mm/swapfile.c
> > @@ -530,17 +530,18 @@ unsigned int count_swap_pages(int type,
> > static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
> > unsigned long addr, swp_entry_t entry, struct page *page)
> > {
> > + struct mem_cgroup *ptr;
> should be initialized to NULL.
>
exactly.

Thanks,
-Kame

2008-10-15 08:24:27

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH 2/5] memcg: migration account fix

> @@ -795,43 +767,67 @@ int mem_cgroup_prepare_migration(struct
> if (PageCgroupUsed(pc)) {
> mem = pc->mem_cgroup;
> css_get(&mem->css);
> - if (PageCgroupCache(pc)) {
> - if (page_is_file_cache(page))
> - ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
> - else
> - ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
> - }
> }
> unlock_page_cgroup(pc);
> +
> if (mem) {
> - ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
> - ctype, mem);
> + ret = mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem);
> css_put(&mem->css);
> + *ptr = mem;
> }
> return ret;
> }
>
"*ptr = mem" should be outside of if(mem).
Otherwise, ptr would be kept unset when !PageCgroupUsed.
(unmap_and_move, caller of prepare_migration, doesn't initilize it.)

And,

> /* remove redundant charge if migration failed*/
> -void mem_cgroup_end_migration(struct page *newpage)
> +void mem_cgroup_end_migration(struct mem_cgroup *mem,
> + struct page *oldpage, struct page *newpage)
> {
> + struct page *target, *unused;
> + struct page_cgroup *pc;
> + enum charge_type ctype;
> +
mem_cgroup_end_migration should handle "mem == NULL" case
(just return would be enough).

> + /* at migration success, oldpage->mapping is NULL. */
> + if (oldpage->mapping) {
> + target = oldpage;
> + unused = NULL;
> + } else {
> + target = newpage;
> + unused = oldpage;
> + }
> +
> + if (PageAnon(target))
> + ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
> + else if (page_is_file_cache(target))
> + ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
> + else
> + ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
> +
> + /* unused page is not on radix-tree now. */
> + if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
> + __mem_cgroup_uncharge_common(unused, ctype);
> +
> + pc = lookup_page_cgroup(target);
> /*
> - * At success, page->mapping is not NULL.
> - * special rollback care is necessary when
> - * 1. at migration failure. (newpage->mapping is cleared in this case)
> - * 2. the newpage was moved but not remapped again because the task
> - * exits and the newpage is obsolete. In this case, the new page
> - * may be a swapcache. So, we just call mem_cgroup_uncharge_page()
> - * always for avoiding mess. The page_cgroup will be removed if
> - * unnecessary. File cache pages is still on radix-tree. Don't
> - * care it.
> + * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
> + * So, double-counting is effectively avoided.
> */
> - if (!newpage->mapping)
> - __mem_cgroup_uncharge_common(newpage,
> - MEM_CGROUP_CHARGE_TYPE_FORCE);
> - else if (PageAnon(newpage))
> - mem_cgroup_uncharge_page(newpage);
> + __mem_cgroup_commit_charge(mem, pc, ctype);
> +
> + /*
> + * Both of oldpage and newpage are still under lock_page().
> + * Then, we don't have to care about race in radix-tree.
> + * But we have to be careful that this page is unmapped or not.
> + *
> + * There is a case for !page_mapped(). At the start of
> + * migration, oldpage was mapped. But now, it's zapped.
> + * But we know *target* page is not freed/reused under us.
> + * mem_cgroup_uncharge_page() does all necessary checks.
> + */
> + if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> + mem_cgroup_uncharge_page(target);
> }
>
> +
> /*
> * A call to try to shrink memory usage under specified resource controller.
> * This is typically used for page reclaiming for shmem for reducing side
>

BTW, I'm now testing v7 patches with some fixes I've reported,
and it has worked well so far(for several hours) in my test.
(testing page migration and rmdir(force_empty) under swap in/out activity)


Thanks,
Daisuke Nishimura.

2008-10-15 08:42:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/5] memcg: migration account fix

On Wed, 15 Oct 2008 17:16:55 +0900
Daisuke Nishimura <[email protected]> wrote:

> > @@ -795,43 +767,67 @@ int mem_cgroup_prepare_migration(struct
> > if (PageCgroupUsed(pc)) {
> > mem = pc->mem_cgroup;
> > css_get(&mem->css);
> > - if (PageCgroupCache(pc)) {
> > - if (page_is_file_cache(page))
> > - ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
> > - else
> > - ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
> > - }
> > }
> > unlock_page_cgroup(pc);
> > +
> > if (mem) {
> > - ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
> > - ctype, mem);
> > + ret = mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem);
> > css_put(&mem->css);
> > + *ptr = mem;
> > }
> > return ret;
> > }
> >
> "*ptr = mem" should be outside of if(mem).
> Otherwise, ptr would be kept unset when !PageCgroupUsed.
> (unmap_and_move, caller of prepare_migration, doesn't initilize it.)
>
Hmm, I see.

> And,
>
> > /* remove redundant charge if migration failed*/
> > -void mem_cgroup_end_migration(struct page *newpage)
> > +void mem_cgroup_end_migration(struct mem_cgroup *mem,
> > + struct page *oldpage, struct page *newpage)
> > {
> > + struct page *target, *unused;
> > + struct page_cgroup *pc;
> > + enum charge_type ctype;
> > +
> mem_cgroup_end_migration should handle "mem == NULL" case
> (just return would be enough).
>
ya, you're right.

> > + /* at migration success, oldpage->mapping is NULL. */
> > + if (oldpage->mapping) {
> > + target = oldpage;
> > + unused = NULL;
> > + } else {
> > + target = newpage;
> > + unused = oldpage;
> > + }
> > +
> > + if (PageAnon(target))
> > + ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
> > + else if (page_is_file_cache(target))
> > + ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
> > + else
> > + ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
> > +
> > + /* unused page is not on radix-tree now. */
> > + if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
> > + __mem_cgroup_uncharge_common(unused, ctype);
> > +
> > + pc = lookup_page_cgroup(target);
> > /*
> > - * At success, page->mapping is not NULL.
> > - * special rollback care is necessary when
> > - * 1. at migration failure. (newpage->mapping is cleared in this case)
> > - * 2. the newpage was moved but not remapped again because the task
> > - * exits and the newpage is obsolete. In this case, the new page
> > - * may be a swapcache. So, we just call mem_cgroup_uncharge_page()
> > - * always for avoiding mess. The page_cgroup will be removed if
> > - * unnecessary. File cache pages is still on radix-tree. Don't
> > - * care it.
> > + * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
> > + * So, double-counting is effectively avoided.
> > */
> > - if (!newpage->mapping)
> > - __mem_cgroup_uncharge_common(newpage,
> > - MEM_CGROUP_CHARGE_TYPE_FORCE);
> > - else if (PageAnon(newpage))
> > - mem_cgroup_uncharge_page(newpage);
> > + __mem_cgroup_commit_charge(mem, pc, ctype);
> > +
> > + /*
> > + * Both of oldpage and newpage are still under lock_page().
> > + * Then, we don't have to care about race in radix-tree.
> > + * But we have to be careful that this page is unmapped or not.
> > + *
> > + * There is a case for !page_mapped(). At the start of
> > + * migration, oldpage was mapped. But now, it's zapped.
> > + * But we know *target* page is not freed/reused under us.
> > + * mem_cgroup_uncharge_page() does all necessary checks.
> > + */
> > + if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> > + mem_cgroup_uncharge_page(target);
> > }
> >
> > +
> > /*
> > * A call to try to shrink memory usage under specified resource controller.
> > * This is typically used for page reclaiming for shmem for reducing side
> >
>
> BTW, I'm now testing v7 patches with some fixes I've reported,
> and it has worked well so far(for several hours) in my test.
> (testing page migration and rmdir(force_empty) under swap in/out activity)
>
Good to hear that :)

Thank you for all your help, patient review and tests !

Thanks,
-Kame