2008-10-28 10:09:53

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 0/4][mmotm] memcg clean up

This set is easy clean up and fixes to current memory resource controller in mmotm.

Contents are
[1/4] make cgroup menu as submenu
[2/4] divide mem_cgroup's charge behavior to charge/commit/cancel
[3/4] fix gfp_mask of callers of mem_cgroup_charge_xxx
[4/4] make memcg's page migration handler simple.

pushed out from memcg updates posted at 23/Oct. These are easy part.
all comments are applied (and spell check is done...)

They are against today's mmotm and tested on x86-64.

Thanks,
-Kame


2008-10-28 10:10:51

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 1/4][mmotm] cgroup: make cgroup config as submenu

Making CGROUP related configs to be sub-menu.

This patch will making CGROUP related configs to be sub-menu and
making 1st level configs of "General Setup" shorter.

including following additional changes
- add help comment about CGROUPS and GROUP_SCHED.
- moved MM_OWNER config to the bottom.
(for good indent in menuconfig)

Changelog: v1->v2
- applied comments and fixed text.
- added precise "See Documentation..."

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>


init/Kconfig | 123 ++++++++++++++++++++++++++++++++---------------------------
1 file changed, 67 insertions(+), 56 deletions(-)

Index: mmotm-2.6.28rc2+/init/Kconfig
===================================================================
--- mmotm-2.6.28rc2+.orig/init/Kconfig
+++ mmotm-2.6.28rc2+/init/Kconfig
@@ -271,59 +271,6 @@ config LOG_BUF_SHIFT
13 => 8 KB
12 => 4 KB

-config CGROUPS
- bool "Control Group support"
- help
- This option will let you use process cgroup subsystems
- such as Cpusets
-
- Say N if unsure.
-
-config CGROUP_DEBUG
- bool "Example debug cgroup subsystem"
- depends on CGROUPS
- default n
- help
- This option enables a simple cgroup subsystem that
- exports useful debugging information about the cgroups
- framework
-
- Say N if unsure
-
-config CGROUP_NS
- bool "Namespace cgroup subsystem"
- depends on CGROUPS
- help
- Provides a simple namespace cgroup subsystem to
- provide hierarchical naming of sets of namespaces,
- for instance virtual servers and checkpoint/restart
- jobs.
-
-config CGROUP_FREEZER
- bool "control group freezer subsystem"
- depends on CGROUPS
- help
- Provides a way to freeze and unfreeze all tasks in a
- cgroup.
-
-config CGROUP_DEVICE
- bool "Device controller for cgroups"
- depends on CGROUPS && EXPERIMENTAL
- help
- Provides a cgroup implementing whitelists for devices which
- a process in the cgroup can mknod or open.
-
-config CPUSETS
- bool "Cpuset support"
- depends on SMP && CGROUPS
- help
- This option will let you create and manage CPUSETs which
- allow dynamically partitioning a system into sets of CPUs and
- Memory Nodes and assigning tasks to run only within those sets.
- This is primarily useful on large SMP or NUMA systems.
-
- Say N if unsure.
-
#
# Architectures with an unreliable sched_clock() should select this:
#
@@ -337,6 +284,8 @@ config GROUP_SCHED
help
This feature lets CPU scheduler recognize task groups and control CPU
bandwidth allocation to such task groups.
+ In order to create a group from arbitrary set of processes, use
+ CONFIG_CGROUPS. (See Control Group support.)

config FAIR_GROUP_SCHED
bool "Group scheduling for SCHED_OTHER"
@@ -379,6 +328,66 @@ config CGROUP_SCHED

endchoice

+menu "Control Group support"
+config CGROUPS
+ bool "Control Group support"
+ help
+ This option add support for grouping sets of processes together, for
+ use with process control subsystems such as Cpusets, CFS, memory
+ controls or device isolation.
+ See
+ - Documentation/cpusets.txt (Cpusets)
+ - Documentation/scheduler/sched-design-CFS.txt (CFS)
+ - Documentation/cgroups/ (features for grouping, isolation)
+ - Documentation/controllers/ (features for resource control)
+
+ Say N if unsure.
+
+config CGROUP_DEBUG
+ bool "Example debug cgroup subsystem"
+ depends on CGROUPS
+ default n
+ help
+ This option enables a simple cgroup subsystem that
+ exports useful debugging information about the cgroups
+ framework
+
+ Say N if unsure
+
+config CGROUP_NS
+ bool "Namespace cgroup subsystem"
+ depends on CGROUPS
+ help
+ Provides a simple namespace cgroup subsystem to
+ provide hierarchical naming of sets of namespaces,
+ for instance virtual servers and checkpoint/restart
+ jobs.
+
+config CGROUP_FREEZER
+ bool "control group freezer subsystem"
+ depends on CGROUPS
+ help
+ Provides a way to freeze and unfreeze all tasks in a
+ cgroup.
+
+config CGROUP_DEVICE
+ bool "Device controller for cgroups"
+ depends on CGROUPS && EXPERIMENTAL
+ help
+ Provides a cgroup implementing whitelists for devices which
+ a process in the cgroup can mknod or open.
+
+config CPUSETS
+ bool "Cpuset support"
+ depends on SMP && CGROUPS
+ help
+ This option will let you create and manage CPUSETs which
+ allow dynamically partitioning a system into sets of CPUs and
+ Memory Nodes and assigning tasks to run only within those sets.
+ This is primarily useful on large SMP or NUMA systems.
+
+ Say N if unsure.
+
config CGROUP_CPUACCT
bool "Simple CPU accounting cgroup subsystem"
depends on CGROUPS
@@ -393,9 +402,6 @@ config RESOURCE_COUNTERS
infrastructure that works with cgroups
depends on CGROUPS

-config MM_OWNER
- bool
-
config CGROUP_MEM_RES_CTLR
bool "Memory Resource Controller for Control Groups"
depends on CGROUPS && RESOURCE_COUNTERS
@@ -419,6 +425,11 @@ config CGROUP_MEM_RES_CTLR
This config option also selects MM_OWNER config option, which
could in turn add some fork/exit overhead.

+config MM_OWNER
+ bool
+
+endmenu
+
config SYSFS_DEPRECATED
bool

2008-10-28 10:12:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 2/4][mmotm] memcg: introduce charge-commit-cancel style of functions.

There is a small race in do_swap_page(). When the page swapped-in is charged,
the mapcount can be greater than 0. But, at the same time some process (shares
it ) call unmap and make mapcount 1->0 and the page is uncharged.

CPUA CPUB
mapcount == 1.
(1) charge if mapcount==0 zap_pte_range()
(2) mapcount 1 => 0.
(3) uncharge(). (success)
(4) set page's rmap()
mapcount 0=>1

Then, this swap page's account is leaked.

For fixing this, I added a new interface.
- charge
account to res_counter by PAGE_SIZE and try to free pages if necessary.
- commit
register page_cgroup and add to LRU if necessary.
- cancel
uncharge PAGE_SIZE because of do_swap_page failure.


CPUA
(1) charge (always)
(2) set page's rmap (mapcount > 0)
(3) commit charge was necessary or not after set_pte().

This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
Usual mem_cgroup_charge_common() does charge -> commit at a time.

And this patch also adds following function to clarify all charges.

- mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
called against newly allocated anon pages.

- mem_cgroup_charge_migrate_fixup()
called only from remove_migration_ptes().
we'll have to rewrite this later.(this patch just keeps old behavior)
This function will be removed by additional patch to make migration
clearer.

Good for clarify "what we does"

Then, we have 4 following charge points.
- newpage
- swap-in
- add-to-cache.
- migration.

Changelog v8 -> v9
- fixed text
Changelog v7 -> v8
- handles that try_charge() sets NULL to *memcg.

Changelog v5 -> v7:
- added newpage_charge() and migrate_fixup().
- renamed functions for swap-in from "swap" to "swapin"
- add more precise description.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

include/linux/memcontrol.h | 35 +++++++++-
mm/memcontrol.c | 155 ++++++++++++++++++++++++++++++++++++---------
mm/memory.c | 12 ++-
mm/migrate.c | 2
mm/swapfile.c | 6 +
5 files changed, 169 insertions(+), 41 deletions(-)

Index: mmotm-2.6.28rc2+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28rc2+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28rc2+/include/linux/memcontrol.h
@@ -27,8 +27,17 @@ struct mm_struct;

#ifdef CONFIG_CGROUP_MEM_RES_CTLR

-extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
+extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
+extern int mem_cgroup_charge_migrate_fixup(struct page *page,
+ struct mm_struct *mm, gfp_t gfp_mask);
+/* for swap handling */
+extern int mem_cgroup_try_charge(struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern void mem_cgroup_commit_charge_swapin(struct page *page,
+ struct mem_cgroup *ptr);
+extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
+
extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
@@ -71,7 +80,9 @@ extern long mem_cgroup_calc_reclaim(stru


#else /* CONFIG_CGROUP_MEM_RES_CTLR */
-static inline int mem_cgroup_charge(struct page *page,
+struct mem_cgroup;
+
+static inline int mem_cgroup_newpage_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
return 0;
@@ -83,6 +94,26 @@ static inline int mem_cgroup_cache_charg
return 0;
}

+static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
+ struct mm_struct *mm, gfp_t gfp_mask)
+{
+ return 0;
+}
+
+static int mem_cgroup_try_charge(struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **ptr)
+{
+ return 0;
+}
+
+static void mem_cgroup_commit_charge_swapin(struct page *page,
+ struct mem_cgroup *ptr)
+{
+}
+static void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr)
+{
+}
+
static inline void mem_cgroup_uncharge_page(struct page *page)
{
}
Index: mmotm-2.6.28rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28rc2+/mm/memcontrol.c
@@ -467,35 +467,31 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
}

-/*
- * Charge the memory controller for page usage.
- * Return
- * 0 if the charge was successful
- * < 0 if the cgroup is over its limit
+
+/**
+ * mem_cgroup_try_charge - get charge of PAGE_SIZE.
+ * @mm: an mm_struct which is charged against. (when *memcg is NULL)
+ * @gfp_mask: gfp_mask for reclaim.
+ * @memcg: a pointer to memory cgroup which is charged against.
+ *
+ * charge against memory cgroup pointed by *memcg. if *memcg == NULL, estimated
+ * memory cgroup from @mm is got and stored in *memcg.
+ *
+ * Returns 0 if success. -ENOMEM at failure.
*/
-static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, enum charge_type ctype,
- struct mem_cgroup *memcg)
+
+int mem_cgroup_try_charge(struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **memcg)
{
struct mem_cgroup *mem;
- struct page_cgroup *pc;
- unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
-
- pc = lookup_page_cgroup(page);
- /* can happen at boot */
- if (unlikely(!pc))
- return 0;
- prefetchw(pc);
+ int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
/*
* We always charge the cgroup the mm_struct belongs to.
* The mm_struct's mem_cgroup changes on task migration if the
* thread group leader migrates. It's possible that mm is not
* set, if so charge the init_mm (happens for pagecache usage).
*/
-
- if (likely(!memcg)) {
+ if (likely(!*memcg)) {
rcu_read_lock();
mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
if (unlikely(!mem)) {
@@ -506,15 +502,17 @@ static int mem_cgroup_charge_common(stru
* For every charge from the cgroup, increment reference count
*/
css_get(&mem->css);
+ *memcg = mem;
rcu_read_unlock();
} else {
- mem = memcg;
- css_get(&memcg->css);
+ mem = *memcg;
+ css_get(&mem->css);
}

+
while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
if (!(gfp_mask & __GFP_WAIT))
- goto out;
+ goto nomem;

if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
continue;
@@ -531,18 +529,37 @@ static int mem_cgroup_charge_common(stru

if (!nr_retries--) {
mem_cgroup_out_of_memory(mem, gfp_mask);
- goto out;
+ goto nomem;
}
}
+ return 0;
+nomem:
+ css_put(&mem->css);
+ return -ENOMEM;
+}
+
+/*
+ * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
+ * USED state. If already USED, uncharge and return.
+ */

+static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
+ struct page_cgroup *pc,
+ enum charge_type ctype)
+{
+ struct mem_cgroup_per_zone *mz;
+ unsigned long flags;
+
+ /* try_charge() can return NULL to *memcg, taking care of it. */
+ if (!mem)
+ return;

lock_page_cgroup(pc);
if (unlikely(PageCgroupUsed(pc))) {
unlock_page_cgroup(pc);
res_counter_uncharge(&mem->res, PAGE_SIZE);
css_put(&mem->css);
-
- goto done;
+ return;
}
pc->mem_cgroup = mem;
/*
@@ -557,15 +574,39 @@ static int mem_cgroup_charge_common(stru
__mem_cgroup_add_list(mz, pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);
+}
+
+/*
+ * Charge the memory controller for page usage.
+ * Return
+ * 0 if the charge was successful
+ * < 0 if the cgroup is over its limit
+ */
+static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask, enum charge_type ctype,
+ struct mem_cgroup *memcg)
+{
+ struct mem_cgroup *mem;
+ struct page_cgroup *pc;
+ int ret;

-done:
+ pc = lookup_page_cgroup(page);
+ /* can happen at boot */
+ if (unlikely(!pc))
+ return 0;
+ prefetchw(pc);
+
+ mem = memcg;
+ ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
+ if (ret)
+ return ret;
+
+ __mem_cgroup_commit_charge(mem, pc, ctype);
return 0;
-out:
- css_put(&mem->css);
- return -ENOMEM;
}

-int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
+int mem_cgroup_newpage_charge(struct page *page,
+ struct mm_struct *mm, gfp_t gfp_mask)
{
if (mem_cgroup_subsys.disabled)
return 0;
@@ -586,6 +627,34 @@ int mem_cgroup_charge(struct page *page,
MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
}

+/*
+ * same as mem_cgroup_newpage_charge(), now.
+ * But what we assume is different from newpage, and this is special case.
+ * treat this in special function. easy for maintenance.
+ */
+
+int mem_cgroup_charge_migrate_fixup(struct page *page,
+ struct mm_struct *mm, gfp_t gfp_mask)
+{
+ if (mem_cgroup_subsys.disabled)
+ return 0;
+
+ if (PageCompound(page))
+ return 0;
+
+ if (page_mapped(page) || (page->mapping && !PageAnon(page)))
+ return 0;
+
+ if (unlikely(!mm))
+ mm = &init_mm;
+
+ return mem_cgroup_charge_common(page, mm, gfp_mask,
+ MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+}
+
+
+
+
int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask)
{
@@ -628,6 +697,30 @@ int mem_cgroup_cache_charge(struct page
MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
}

+
+void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
+{
+ struct page_cgroup *pc;
+
+ if (mem_cgroup_subsys.disabled)
+ return;
+ if (!ptr)
+ return;
+ pc = lookup_page_cgroup(page);
+ __mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+}
+
+void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
+{
+ if (mem_cgroup_subsys.disabled)
+ return;
+ if (!mem)
+ return;
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ css_put(&mem->css);
+}
+
+
/*
* uncharge if !page_mapped(page)
*/
Index: mmotm-2.6.28rc2+/mm/memory.c
===================================================================
--- mmotm-2.6.28rc2+.orig/mm/memory.c
+++ mmotm-2.6.28rc2+/mm/memory.c
@@ -1889,7 +1889,7 @@ gotten:
cow_user_page(new_page, old_page, address, vma);
__SetPageUptodate(new_page);

- if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
+ if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
goto oom_free_new;

/*
@@ -2285,6 +2285,7 @@ static int do_swap_page(struct mm_struct
struct page *page;
swp_entry_t entry;
pte_t pte;
+ struct mem_cgroup *ptr = NULL;
int ret = 0;

if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
@@ -2323,7 +2324,7 @@ static int do_swap_page(struct mm_struct
lock_page(page);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

- if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+ if (mem_cgroup_try_charge(mm, GFP_KERNEL, &ptr) == -ENOMEM) {
ret = VM_FAULT_OOM;
unlock_page(page);
goto out;
@@ -2353,6 +2354,7 @@ static int do_swap_page(struct mm_struct
flush_icache_page(vma, page);
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);
+ mem_cgroup_commit_charge_swapin(page, ptr);

swap_free(entry);
if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -2373,7 +2375,7 @@ unlock:
out:
return ret;
out_nomap:
- mem_cgroup_uncharge_page(page);
+ mem_cgroup_cancel_charge_swapin(ptr);
pte_unmap_unlock(page_table, ptl);
unlock_page(page);
page_cache_release(page);
@@ -2403,7 +2405,7 @@ static int do_anonymous_page(struct mm_s
goto oom;
__SetPageUptodate(page);

- if (mem_cgroup_charge(page, mm, GFP_KERNEL))
+ if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
goto oom_free_page;

entry = mk_pte(page, vma->vm_page_prot);
@@ -2496,7 +2498,7 @@ static int __do_fault(struct mm_struct *
ret = VM_FAULT_OOM;
goto out;
}
- if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+ if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
ret = VM_FAULT_OOM;
page_cache_release(page);
goto out;
Index: mmotm-2.6.28rc2+/mm/migrate.c
===================================================================
--- mmotm-2.6.28rc2+.orig/mm/migrate.c
+++ mmotm-2.6.28rc2+/mm/migrate.c
@@ -133,7 +133,7 @@ static void remove_migration_pte(struct
* be reliable, and this charge can actually fail: oh well, we don't
* make the situation any worse by proceeding as if it had succeeded.
*/
- mem_cgroup_charge(new, mm, GFP_ATOMIC);
+ mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);

get_page(new);
pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
Index: mmotm-2.6.28rc2+/mm/swapfile.c
===================================================================
--- mmotm-2.6.28rc2+.orig/mm/swapfile.c
+++ mmotm-2.6.28rc2+/mm/swapfile.c
@@ -530,17 +530,18 @@ unsigned int count_swap_pages(int type,
static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, swp_entry_t entry, struct page *page)
{
+ struct mem_cgroup *ptr = NULL;
spinlock_t *ptl;
pte_t *pte;
int ret = 1;

- if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
+ if (mem_cgroup_try_charge(vma->vm_mm, GFP_KERNEL, &ptr))
ret = -ENOMEM;

pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry)))) {
if (ret > 0)
- mem_cgroup_uncharge_page(page);
+ mem_cgroup_cancel_charge_swapin(ptr);
ret = 0;
goto out;
}
@@ -550,6 +551,7 @@ static int unuse_pte(struct vm_area_stru
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
page_add_anon_rmap(page, vma, addr);
+ mem_cgroup_commit_charge_swapin(page, ptr);
swap_free(entry);
/*
* Move the page to the active list so it is not

2008-10-28 10:15:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 3/4][mmotm] memcg: fix gfp_mask of callers of charge

fix misuse of gfp_kernel.

Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.

I think that this is from the fact that page_cgroup *was* dynamically allocated.

But now, we allocate all page_cgroup at boot. And mem_cgroup_try_to_free_pages()
reclaim memory from GFP_HIGHUSER_MOVABLE + specified GFP_RECLAIM_MASK.
* This is because we just want to reduce memory usage.
"Where we should reclaim from ?" is not a problem in memcg.

This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.
Note: This patch is not for fixing behavior but for showing sane information
in source code.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>


mm/memcontrol.c | 8 +++++---
mm/memory.c | 9 +++++----
mm/shmem.c | 6 +++---
mm/swapfile.c | 2 +-
4 files changed, 14 insertions(+), 11 deletions(-)

Index: mmotm-2.6.28rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28rc2+/mm/memcontrol.c
@@ -808,8 +808,9 @@ int mem_cgroup_prepare_migration(struct
}
unlock_page_cgroup(pc);
if (mem) {
- ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
- ctype, mem);
+ ret = mem_cgroup_charge_common(newpage, NULL,
+ GFP_HIGHUSER_MOVABLE,
+ ctype, mem);
css_put(&mem->css);
}
return ret;
@@ -888,7 +889,8 @@ int mem_cgroup_resize_limit(struct mem_c
ret = -EBUSY;
break;
}
- progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
+ progress = try_to_free_mem_cgroup_pages(memcg,
+ GFP_HIGHUSER_MOVABLE);
if (!progress)
retry_count--;
}
Index: mmotm-2.6.28rc2+/mm/memory.c
===================================================================
--- mmotm-2.6.28rc2+.orig/mm/memory.c
+++ mmotm-2.6.28rc2+/mm/memory.c
@@ -1889,7 +1889,7 @@ gotten:
cow_user_page(new_page, old_page, address, vma);
__SetPageUptodate(new_page);

- if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
+ if (mem_cgroup_newpage_charge(new_page, mm, GFP_HIGHUSER_MOVABLE))
goto oom_free_new;

/*
@@ -2324,7 +2324,7 @@ static int do_swap_page(struct mm_struct
lock_page(page);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

- if (mem_cgroup_try_charge(mm, GFP_KERNEL, &ptr) == -ENOMEM) {
+ if (mem_cgroup_try_charge(mm, GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
ret = VM_FAULT_OOM;
unlock_page(page);
goto out;
@@ -2405,7 +2405,7 @@ static int do_anonymous_page(struct mm_s
goto oom;
__SetPageUptodate(page);

- if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
+ if (mem_cgroup_newpage_charge(page, mm, GFP_HIGHUSER_MOVABLE))
goto oom_free_page;

entry = mk_pte(page, vma->vm_page_prot);
@@ -2498,7 +2498,8 @@ static int __do_fault(struct mm_struct *
ret = VM_FAULT_OOM;
goto out;
}
- if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+ if (mem_cgroup_newpage_charge(page,
+ mm, GFP_HIGHUSER_MOVABLE)) {
ret = VM_FAULT_OOM;
page_cache_release(page);
goto out;
Index: mmotm-2.6.28rc2+/mm/swapfile.c
===================================================================
--- mmotm-2.6.28rc2+.orig/mm/swapfile.c
+++ mmotm-2.6.28rc2+/mm/swapfile.c
@@ -535,7 +535,7 @@ static int unuse_pte(struct vm_area_stru
pte_t *pte;
int ret = 1;

- if (mem_cgroup_try_charge(vma->vm_mm, GFP_KERNEL, &ptr))
+ if (mem_cgroup_try_charge(vma->vm_mm, GFP_HIGHUSER_MOVABLE, &ptr))
ret = -ENOMEM;

pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
Index: mmotm-2.6.28rc2+/mm/shmem.c
===================================================================
--- mmotm-2.6.28rc2+.orig/mm/shmem.c
+++ mmotm-2.6.28rc2+/mm/shmem.c
@@ -920,8 +920,8 @@ found:
error = 1;
if (!inode)
goto out;
- /* Precharge page using GFP_KERNEL while we can wait */
- error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
+ /* Charge page using GFP_HIGHUSER_MOVABLE while we can wait */
+ error = mem_cgroup_cache_charge(page, current->mm, GFP_HIGHUSER_MOVABLE);
if (error)
goto out;
error = radix_tree_preload(GFP_KERNEL);
@@ -1371,7 +1371,7 @@ repeat:

/* Precharge page while we can wait, compensate after */
error = mem_cgroup_cache_charge(filepage, current->mm,
- gfp & ~__GFP_HIGHMEM);
+ GFP_HIGHUSER_MOVABLE);
if (error) {
page_cache_release(filepage);
shmem_unacct_blocks(info->flags, 1);

2008-10-28 10:16:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 4/4][mmotm] memcg: simple migration handling

Now, management of "charge" under page migration is done under following
manner. (Assume migrate page contents from oldpage to newpage)

before
- "newpage" is charged before migration.
at success.
- "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
at failure
- "newpage" is uncharged.
- "oldpage" is charged if necessary (*1)

But (*1) is not reliable....because of GFP_ATOMIC.

This patch tries to change behavior as following by charge/commit/cancel ops.

before
- charge PAGE_SIZE (no target page)
success
- commit charge against "newpage".
failure
- commit charge against "oldpage".
(PCG_USED bit works effectively to avoid double-counting)
- if "oldpage" is obsolete, cancel charge of PAGE_SIZE.

Changelog: v8 -> v9
- fixed text.
Changelog: v7 -> v8
- fixed memcg==NULL case in migration handling.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

include/linux/memcontrol.h | 19 ++-----
mm/memcontrol.c | 108 +++++++++++++++++++++------------------------
mm/migrate.c | 42 +++++------------
3 files changed, 73 insertions(+), 96 deletions(-)

Index: mmotm-2.6.28rc2+/mm/migrate.c
===================================================================
--- mmotm-2.6.28rc2+.orig/mm/migrate.c
+++ mmotm-2.6.28rc2+/mm/migrate.c
@@ -121,20 +121,6 @@ static void remove_migration_pte(struct
if (!is_migration_entry(entry) || migration_entry_to_page(entry) != old)
goto out;

- /*
- * Yes, ignore the return value from a GFP_ATOMIC mem_cgroup_charge.
- * Failure is not an option here: we're now expected to remove every
- * migration pte, and will cause crashes otherwise. Normally this
- * is not an issue: mem_cgroup_prepare_migration bumped up the old
- * page_cgroup count for safety, that's now attached to the new page,
- * so this charge should just be another incrementation of the count,
- * to keep in balance with rmap.c's mem_cgroup_uncharging. But if
- * there's been a force_empty, those reference counts may no longer
- * be reliable, and this charge can actually fail: oh well, we don't
- * make the situation any worse by proceeding as if it had succeeded.
- */
- mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);
-
get_page(new);
pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
if (is_write_migration_entry(entry))
@@ -382,9 +368,6 @@ static void migrate_page_copy(struct pag
anon = PageAnon(page);
page->mapping = NULL;

- if (!anon) /* This page was removed from radix-tree. */
- mem_cgroup_uncharge_cache_page(page);
-
/*
* If any waiters have accumulated on the new page then
* wake them up.
@@ -621,6 +604,7 @@ static int unmap_and_move(new_page_t get
struct page *newpage = get_new_page(page, private, &result);
int rcu_locked = 0;
int charge = 0;
+ struct mem_cgroup *mem;

if (!newpage)
return -ENOMEM;
@@ -630,24 +614,26 @@ static int unmap_and_move(new_page_t get
goto move_newpage;
}

- charge = mem_cgroup_prepare_migration(page, newpage);
- if (charge == -ENOMEM) {
- rc = -ENOMEM;
- goto move_newpage;
- }
/* prepare cgroup just returns 0 or -ENOMEM */
- BUG_ON(charge);
-
rc = -EAGAIN;
+
if (!trylock_page(page)) {
if (!force)
goto move_newpage;
lock_page(page);
}

+ /* charge against new page */
+ charge = mem_cgroup_prepare_migration(page, &mem);
+ if (charge == -ENOMEM) {
+ rc = -ENOMEM;
+ goto unlock;
+ }
+ BUG_ON(charge);
+
if (PageWriteback(page)) {
if (!force)
- goto unlock;
+ goto uncharge;
wait_on_page_writeback(page);
}
/*
@@ -700,7 +686,9 @@ static int unmap_and_move(new_page_t get
rcu_unlock:
if (rcu_locked)
rcu_read_unlock();
-
+uncharge:
+ if (!charge)
+ mem_cgroup_end_migration(mem, page, newpage);
unlock:
unlock_page(page);

@@ -716,8 +704,6 @@ unlock:
}

move_newpage:
- if (!charge)
- mem_cgroup_end_migration(newpage);

/*
* Move the new page to the LRU. If migration was not successful
Index: mmotm-2.6.28rc2+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28rc2+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28rc2+/include/linux/memcontrol.h
@@ -29,8 +29,6 @@ struct mm_struct;

extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
-extern int mem_cgroup_charge_migrate_fixup(struct page *page,
- struct mm_struct *mm, gfp_t gfp_mask);
/* for swap handling */
extern int mem_cgroup_try_charge(struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **ptr);
@@ -60,8 +58,9 @@ extern struct mem_cgroup *mem_cgroup_fro
((cgroup) == mem_cgroup_from_task((mm)->owner))

extern int
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage);
-extern void mem_cgroup_end_migration(struct page *page);
+mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr);
+extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
+ struct page *oldpage, struct page *newpage);

/*
* For memory reclaim.
@@ -94,12 +93,6 @@ static inline int mem_cgroup_cache_charg
return 0;
}

-static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
- struct mm_struct *mm, gfp_t gfp_mask)
-{
- return 0;
-}
-
static int mem_cgroup_try_charge(struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **ptr)
{
@@ -143,12 +136,14 @@ static inline int task_in_mem_cgroup(str
}

static inline int
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
+mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
{
return 0;
}

-static inline void mem_cgroup_end_migration(struct page *page)
+static inline void mem_cgroup_end_migration(struct mem_cgroup *mem,
+ struct page *oldpage,
+ struct page *newpage)
{
}

Index: mmotm-2.6.28rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28rc2+/mm/memcontrol.c
@@ -627,34 +627,6 @@ int mem_cgroup_newpage_charge(struct pag
MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
}

-/*
- * same as mem_cgroup_newpage_charge(), now.
- * But what we assume is different from newpage, and this is special case.
- * treat this in special function. easy for maintenance.
- */
-
-int mem_cgroup_charge_migrate_fixup(struct page *page,
- struct mm_struct *mm, gfp_t gfp_mask)
-{
- if (mem_cgroup_subsys.disabled)
- return 0;
-
- if (PageCompound(page))
- return 0;
-
- if (page_mapped(page) || (page->mapping && !PageAnon(page)))
- return 0;
-
- if (unlikely(!mm))
- mm = &init_mm;
-
- return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
-}
-
-
-
-
int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask)
{
@@ -697,7 +669,6 @@ int mem_cgroup_cache_charge(struct page
MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
}

-
void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
{
struct page_cgroup *pc;
@@ -782,13 +753,13 @@ void mem_cgroup_uncharge_cache_page(stru
}

/*
- * Before starting migration, account against new page.
+ * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
+ * page belongs to.
*/
-int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
+int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
{
struct page_cgroup *pc;
struct mem_cgroup *mem = NULL;
- enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
int ret = 0;

if (mem_cgroup_subsys.disabled)
@@ -799,42 +770,67 @@ int mem_cgroup_prepare_migration(struct
if (PageCgroupUsed(pc)) {
mem = pc->mem_cgroup;
css_get(&mem->css);
- if (PageCgroupCache(pc)) {
- if (page_is_file_cache(page))
- ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
- else
- ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
- }
}
unlock_page_cgroup(pc);
+
if (mem) {
- ret = mem_cgroup_charge_common(newpage, NULL,
- GFP_HIGHUSER_MOVABLE,
- ctype, mem);
+ ret = mem_cgroup_try_charge(NULL, GFP_HIGHUSER_MOVABLE, &mem);
css_put(&mem->css);
}
+ *ptr = mem;
return ret;
}

/* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct page *newpage)
+void mem_cgroup_end_migration(struct mem_cgroup *mem,
+ struct page *oldpage, struct page *newpage)
{
+ struct page *target, *unused;
+ struct page_cgroup *pc;
+ enum charge_type ctype;
+
+ if (!mem)
+ return;
+
+ /* at migration success, oldpage->mapping is NULL. */
+ if (oldpage->mapping) {
+ target = oldpage;
+ unused = NULL;
+ } else {
+ target = newpage;
+ unused = oldpage;
+ }
+
+ if (PageAnon(target))
+ ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
+ else if (page_is_file_cache(target))
+ ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
+ else
+ ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
+
+ /* unused page is not on radix-tree now. */
+ if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
+ __mem_cgroup_uncharge_common(unused, ctype);
+
+ pc = lookup_page_cgroup(target);
/*
- * At success, page->mapping is not NULL.
- * special rollback care is necessary when
- * 1. at migration failure. (newpage->mapping is cleared in this case)
- * 2. the newpage was moved but not remapped again because the task
- * exits and the newpage is obsolete. In this case, the new page
- * may be a swapcache. So, we just call mem_cgroup_uncharge_page()
- * always for avoiding mess. The page_cgroup will be removed if
- * unnecessary. File cache pages is still on radix-tree. Don't
- * care it.
+ * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
+ * So, double-counting is effectively avoided.
+ */
+ __mem_cgroup_commit_charge(mem, pc, ctype);
+
+ /*
+ * Both of oldpage and newpage are still under lock_page().
+ * Then, we don't have to care about race in radix-tree.
+ * But we have to be careful that this page is unmapped or not.
+ *
+ * There is a case for !page_mapped(). At the start of
+ * migration, oldpage was mapped. But now, it's zapped.
+ * But we know *target* page is not freed/reused under us.
+ * mem_cgroup_uncharge_page() does all necessary checks.
*/
- if (!newpage->mapping)
- __mem_cgroup_uncharge_common(newpage,
- MEM_CGROUP_CHARGE_TYPE_FORCE);
- else if (PageAnon(newpage))
- mem_cgroup_uncharge_page(newpage);
+ if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
+ mem_cgroup_uncharge_page(target);
}

/*

2008-10-28 11:17:19

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH 1/4][mmotm] cgroup: make cgroup config as submenu

On Tue, 28 Oct 2008 19:10:08 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> Making CGROUP related configs to be sub-menu.
>
> This patch will making CGROUP related configs to be sub-menu and
> making 1st level configs of "General Setup" shorter.
>
> including following additional changes
> - add help comment about CGROUPS and GROUP_SCHED.
> - moved MM_OWNER config to the bottom.
> (for good indent in menuconfig)
>
> Changelog: v1->v2
> - applied comments and fixed text.
> - added precise "See Documentation..."
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
Reviewed-by: Daisuke Nishimura <[email protected]>

>
> init/Kconfig | 123 ++++++++++++++++++++++++++++++++---------------------------
> 1 file changed, 67 insertions(+), 56 deletions(-)
>
> Index: mmotm-2.6.28rc2+/init/Kconfig
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/init/Kconfig
> +++ mmotm-2.6.28rc2+/init/Kconfig
> @@ -271,59 +271,6 @@ config LOG_BUF_SHIFT
> 13 => 8 KB
> 12 => 4 KB
>
> -config CGROUPS
> - bool "Control Group support"
> - help
> - This option will let you use process cgroup subsystems
> - such as Cpusets
> -
> - Say N if unsure.
> -
> -config CGROUP_DEBUG
> - bool "Example debug cgroup subsystem"
> - depends on CGROUPS
> - default n
> - help
> - This option enables a simple cgroup subsystem that
> - exports useful debugging information about the cgroups
> - framework
> -
> - Say N if unsure
> -
> -config CGROUP_NS
> - bool "Namespace cgroup subsystem"
> - depends on CGROUPS
> - help
> - Provides a simple namespace cgroup subsystem to
> - provide hierarchical naming of sets of namespaces,
> - for instance virtual servers and checkpoint/restart
> - jobs.
> -
> -config CGROUP_FREEZER
> - bool "control group freezer subsystem"
> - depends on CGROUPS
> - help
> - Provides a way to freeze and unfreeze all tasks in a
> - cgroup.
> -
> -config CGROUP_DEVICE
> - bool "Device controller for cgroups"
> - depends on CGROUPS && EXPERIMENTAL
> - help
> - Provides a cgroup implementing whitelists for devices which
> - a process in the cgroup can mknod or open.
> -
> -config CPUSETS
> - bool "Cpuset support"
> - depends on SMP && CGROUPS
> - help
> - This option will let you create and manage CPUSETs which
> - allow dynamically partitioning a system into sets of CPUs and
> - Memory Nodes and assigning tasks to run only within those sets.
> - This is primarily useful on large SMP or NUMA systems.
> -
> - Say N if unsure.
> -
> #
> # Architectures with an unreliable sched_clock() should select this:
> #
> @@ -337,6 +284,8 @@ config GROUP_SCHED
> help
> This feature lets CPU scheduler recognize task groups and control CPU
> bandwidth allocation to such task groups.
> + In order to create a group from arbitrary set of processes, use
> + CONFIG_CGROUPS. (See Control Group support.)
>
> config FAIR_GROUP_SCHED
> bool "Group scheduling for SCHED_OTHER"
> @@ -379,6 +328,66 @@ config CGROUP_SCHED
>
> endchoice
>
> +menu "Control Group support"
> +config CGROUPS
> + bool "Control Group support"
> + help
> + This option add support for grouping sets of processes together, for
> + use with process control subsystems such as Cpusets, CFS, memory
> + controls or device isolation.
> + See
> + - Documentation/cpusets.txt (Cpusets)
> + - Documentation/scheduler/sched-design-CFS.txt (CFS)
> + - Documentation/cgroups/ (features for grouping, isolation)
> + - Documentation/controllers/ (features for resource control)
> +
> + Say N if unsure.
> +
> +config CGROUP_DEBUG
> + bool "Example debug cgroup subsystem"
> + depends on CGROUPS
> + default n
> + help
> + This option enables a simple cgroup subsystem that
> + exports useful debugging information about the cgroups
> + framework
> +
> + Say N if unsure
> +
> +config CGROUP_NS
> + bool "Namespace cgroup subsystem"
> + depends on CGROUPS
> + help
> + Provides a simple namespace cgroup subsystem to
> + provide hierarchical naming of sets of namespaces,
> + for instance virtual servers and checkpoint/restart
> + jobs.
> +
> +config CGROUP_FREEZER
> + bool "control group freezer subsystem"
> + depends on CGROUPS
> + help
> + Provides a way to freeze and unfreeze all tasks in a
> + cgroup.
> +
> +config CGROUP_DEVICE
> + bool "Device controller for cgroups"
> + depends on CGROUPS && EXPERIMENTAL
> + help
> + Provides a cgroup implementing whitelists for devices which
> + a process in the cgroup can mknod or open.
> +
> +config CPUSETS
> + bool "Cpuset support"
> + depends on SMP && CGROUPS
> + help
> + This option will let you create and manage CPUSETs which
> + allow dynamically partitioning a system into sets of CPUs and
> + Memory Nodes and assigning tasks to run only within those sets.
> + This is primarily useful on large SMP or NUMA systems.
> +
> + Say N if unsure.
> +
> config CGROUP_CPUACCT
> bool "Simple CPU accounting cgroup subsystem"
> depends on CGROUPS
> @@ -393,9 +402,6 @@ config RESOURCE_COUNTERS
> infrastructure that works with cgroups
> depends on CGROUPS
>
> -config MM_OWNER
> - bool
> -
> config CGROUP_MEM_RES_CTLR
> bool "Memory Resource Controller for Control Groups"
> depends on CGROUPS && RESOURCE_COUNTERS
> @@ -419,6 +425,11 @@ config CGROUP_MEM_RES_CTLR
> This config option also selects MM_OWNER config option, which
> could in turn add some fork/exit overhead.
>
> +config MM_OWNER
> + bool
> +
> +endmenu
> +
> config SYSFS_DEPRECATED
> bool
>
>

2008-10-28 12:08:18

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH 2/4][mmotm] memcg: introduce charge-commit-cancel style of functions.

On Tue, 28 Oct 2008 19:11:25 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> There is a small race in do_swap_page(). When the page swapped-in is charged,
> the mapcount can be greater than 0. But, at the same time some process (shares
> it ) call unmap and make mapcount 1->0 and the page is uncharged.
>
> CPUA CPUB
> mapcount == 1.
> (1) charge if mapcount==0 zap_pte_range()
> (2) mapcount 1 => 0.
> (3) uncharge(). (success)
> (4) set page's rmap()
> mapcount 0=>1
>
> Then, this swap page's account is leaked.
>
> For fixing this, I added a new interface.
> - charge
> account to res_counter by PAGE_SIZE and try to free pages if necessary.
> - commit
> register page_cgroup and add to LRU if necessary.
> - cancel
> uncharge PAGE_SIZE because of do_swap_page failure.
>
>
> CPUA
> (1) charge (always)
> (2) set page's rmap (mapcount > 0)
> (3) commit charge was necessary or not after set_pte().
>
> This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
> Usual mem_cgroup_charge_common() does charge -> commit at a time.
>
> And this patch also adds following function to clarify all charges.
>
> - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
> called against newly allocated anon pages.
>
> - mem_cgroup_charge_migrate_fixup()
> called only from remove_migration_ptes().
> we'll have to rewrite this later.(this patch just keeps old behavior)
> This function will be removed by additional patch to make migration
> clearer.
>
> Good for clarify "what we does"
>
> Then, we have 4 following charge points.
> - newpage
> - swap-in
> - add-to-cache.
> - migration.
>
> Changelog v8 -> v9
> - fixed text
> Changelog v7 -> v8
> - handles that try_charge() sets NULL to *memcg.
>
> Changelog v5 -> v7:
> - added newpage_charge() and migrate_fixup().
> - renamed functions for swap-in from "swap" to "swapin"
> - add more precise description.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
Looks good to me.

Reviewed-by: Daisuke Nishimura <[email protected]>


> include/linux/memcontrol.h | 35 +++++++++-
> mm/memcontrol.c | 155 ++++++++++++++++++++++++++++++++++++---------
> mm/memory.c | 12 ++-
> mm/migrate.c | 2
> mm/swapfile.c | 6 +
> 5 files changed, 169 insertions(+), 41 deletions(-)
>
> Index: mmotm-2.6.28rc2+/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.28rc2+/include/linux/memcontrol.h
> @@ -27,8 +27,17 @@ struct mm_struct;
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>
> -extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
> +extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> +extern int mem_cgroup_charge_migrate_fixup(struct page *page,
> + struct mm_struct *mm, gfp_t gfp_mask);
> +/* for swap handling */
> +extern int mem_cgroup_try_charge(struct mm_struct *mm,
> + gfp_t gfp_mask, struct mem_cgroup **ptr);
> +extern void mem_cgroup_commit_charge_swapin(struct page *page,
> + struct mem_cgroup *ptr);
> +extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
> +
> extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
> @@ -71,7 +80,9 @@ extern long mem_cgroup_calc_reclaim(stru
>
>
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> -static inline int mem_cgroup_charge(struct page *page,
> +struct mem_cgroup;
> +
> +static inline int mem_cgroup_newpage_charge(struct page *page,
> struct mm_struct *mm, gfp_t gfp_mask)
> {
> return 0;
> @@ -83,6 +94,26 @@ static inline int mem_cgroup_cache_charg
> return 0;
> }
>
> +static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
> + struct mm_struct *mm, gfp_t gfp_mask)
> +{
> + return 0;
> +}
> +
> +static int mem_cgroup_try_charge(struct mm_struct *mm,
> + gfp_t gfp_mask, struct mem_cgroup **ptr)
> +{
> + return 0;
> +}
> +
> +static void mem_cgroup_commit_charge_swapin(struct page *page,
> + struct mem_cgroup *ptr)
> +{
> +}
> +static void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr)
> +{
> +}
> +
> static inline void mem_cgroup_uncharge_page(struct page *page)
> {
> }
> Index: mmotm-2.6.28rc2+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/mm/memcontrol.c
> +++ mmotm-2.6.28rc2+/mm/memcontrol.c
> @@ -467,35 +467,31 @@ unsigned long mem_cgroup_isolate_pages(u
> return nr_taken;
> }
>
> -/*
> - * Charge the memory controller for page usage.
> - * Return
> - * 0 if the charge was successful
> - * < 0 if the cgroup is over its limit
> +
> +/**
> + * mem_cgroup_try_charge - get charge of PAGE_SIZE.
> + * @mm: an mm_struct which is charged against. (when *memcg is NULL)
> + * @gfp_mask: gfp_mask for reclaim.
> + * @memcg: a pointer to memory cgroup which is charged against.
> + *
> + * charge against memory cgroup pointed by *memcg. if *memcg == NULL, estimated
> + * memory cgroup from @mm is got and stored in *memcg.
> + *
> + * Returns 0 if success. -ENOMEM at failure.
> */
> -static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> - gfp_t gfp_mask, enum charge_type ctype,
> - struct mem_cgroup *memcg)
> +
> +int mem_cgroup_try_charge(struct mm_struct *mm,
> + gfp_t gfp_mask, struct mem_cgroup **memcg)
> {
> struct mem_cgroup *mem;
> - struct page_cgroup *pc;
> - unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> - struct mem_cgroup_per_zone *mz;
> - unsigned long flags;
> -
> - pc = lookup_page_cgroup(page);
> - /* can happen at boot */
> - if (unlikely(!pc))
> - return 0;
> - prefetchw(pc);
> + int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> /*
> * We always charge the cgroup the mm_struct belongs to.
> * The mm_struct's mem_cgroup changes on task migration if the
> * thread group leader migrates. It's possible that mm is not
> * set, if so charge the init_mm (happens for pagecache usage).
> */
> -
> - if (likely(!memcg)) {
> + if (likely(!*memcg)) {
> rcu_read_lock();
> mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> if (unlikely(!mem)) {
> @@ -506,15 +502,17 @@ static int mem_cgroup_charge_common(stru
> * For every charge from the cgroup, increment reference count
> */
> css_get(&mem->css);
> + *memcg = mem;
> rcu_read_unlock();
> } else {
> - mem = memcg;
> - css_get(&memcg->css);
> + mem = *memcg;
> + css_get(&mem->css);
> }
>
> +
> while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> if (!(gfp_mask & __GFP_WAIT))
> - goto out;
> + goto nomem;
>
> if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> continue;
> @@ -531,18 +529,37 @@ static int mem_cgroup_charge_common(stru
>
> if (!nr_retries--) {
> mem_cgroup_out_of_memory(mem, gfp_mask);
> - goto out;
> + goto nomem;
> }
> }
> + return 0;
> +nomem:
> + css_put(&mem->css);
> + return -ENOMEM;
> +}
> +
> +/*
> + * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
> + * USED state. If already USED, uncharge and return.
> + */
>
> +static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> + struct page_cgroup *pc,
> + enum charge_type ctype)
> +{
> + struct mem_cgroup_per_zone *mz;
> + unsigned long flags;
> +
> + /* try_charge() can return NULL to *memcg, taking care of it. */
> + if (!mem)
> + return;
>
> lock_page_cgroup(pc);
> if (unlikely(PageCgroupUsed(pc))) {
> unlock_page_cgroup(pc);
> res_counter_uncharge(&mem->res, PAGE_SIZE);
> css_put(&mem->css);
> -
> - goto done;
> + return;
> }
> pc->mem_cgroup = mem;
> /*
> @@ -557,15 +574,39 @@ static int mem_cgroup_charge_common(stru
> __mem_cgroup_add_list(mz, pc);
> spin_unlock_irqrestore(&mz->lru_lock, flags);
> unlock_page_cgroup(pc);
> +}
> +
> +/*
> + * Charge the memory controller for page usage.
> + * Return
> + * 0 if the charge was successful
> + * < 0 if the cgroup is over its limit
> + */
> +static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> + gfp_t gfp_mask, enum charge_type ctype,
> + struct mem_cgroup *memcg)
> +{
> + struct mem_cgroup *mem;
> + struct page_cgroup *pc;
> + int ret;
>
> -done:
> + pc = lookup_page_cgroup(page);
> + /* can happen at boot */
> + if (unlikely(!pc))
> + return 0;
> + prefetchw(pc);
> +
> + mem = memcg;
> + ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
> + if (ret)
> + return ret;
> +
> + __mem_cgroup_commit_charge(mem, pc, ctype);
> return 0;
> -out:
> - css_put(&mem->css);
> - return -ENOMEM;
> }
>
> -int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> +int mem_cgroup_newpage_charge(struct page *page,
> + struct mm_struct *mm, gfp_t gfp_mask)
> {
> if (mem_cgroup_subsys.disabled)
> return 0;
> @@ -586,6 +627,34 @@ int mem_cgroup_charge(struct page *page,
> MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> }
>
> +/*
> + * same as mem_cgroup_newpage_charge(), now.
> + * But what we assume is different from newpage, and this is special case.
> + * treat this in special function. easy for maintenance.
> + */
> +
> +int mem_cgroup_charge_migrate_fixup(struct page *page,
> + struct mm_struct *mm, gfp_t gfp_mask)
> +{
> + if (mem_cgroup_subsys.disabled)
> + return 0;
> +
> + if (PageCompound(page))
> + return 0;
> +
> + if (page_mapped(page) || (page->mapping && !PageAnon(page)))
> + return 0;
> +
> + if (unlikely(!mm))
> + mm = &init_mm;
> +
> + return mem_cgroup_charge_common(page, mm, gfp_mask,
> + MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> +}
> +
> +
> +
> +
> int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask)
> {
> @@ -628,6 +697,30 @@ int mem_cgroup_cache_charge(struct page
> MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> }
>
> +
> +void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> +{
> + struct page_cgroup *pc;
> +
> + if (mem_cgroup_subsys.disabled)
> + return;
> + if (!ptr)
> + return;
> + pc = lookup_page_cgroup(page);
> + __mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> +}
> +
> +void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
> +{
> + if (mem_cgroup_subsys.disabled)
> + return;
> + if (!mem)
> + return;
> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> + css_put(&mem->css);
> +}
> +
> +
> /*
> * uncharge if !page_mapped(page)
> */
> Index: mmotm-2.6.28rc2+/mm/memory.c
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/mm/memory.c
> +++ mmotm-2.6.28rc2+/mm/memory.c
> @@ -1889,7 +1889,7 @@ gotten:
> cow_user_page(new_page, old_page, address, vma);
> __SetPageUptodate(new_page);
>
> - if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
> + if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
> goto oom_free_new;
>
> /*
> @@ -2285,6 +2285,7 @@ static int do_swap_page(struct mm_struct
> struct page *page;
> swp_entry_t entry;
> pte_t pte;
> + struct mem_cgroup *ptr = NULL;
> int ret = 0;
>
> if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
> @@ -2323,7 +2324,7 @@ static int do_swap_page(struct mm_struct
> lock_page(page);
> delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>
> - if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
> + if (mem_cgroup_try_charge(mm, GFP_KERNEL, &ptr) == -ENOMEM) {
> ret = VM_FAULT_OOM;
> unlock_page(page);
> goto out;
> @@ -2353,6 +2354,7 @@ static int do_swap_page(struct mm_struct
> flush_icache_page(vma, page);
> set_pte_at(mm, address, page_table, pte);
> page_add_anon_rmap(page, vma, address);
> + mem_cgroup_commit_charge_swapin(page, ptr);
>
> swap_free(entry);
> if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> @@ -2373,7 +2375,7 @@ unlock:
> out:
> return ret;
> out_nomap:
> - mem_cgroup_uncharge_page(page);
> + mem_cgroup_cancel_charge_swapin(ptr);
> pte_unmap_unlock(page_table, ptl);
> unlock_page(page);
> page_cache_release(page);
> @@ -2403,7 +2405,7 @@ static int do_anonymous_page(struct mm_s
> goto oom;
> __SetPageUptodate(page);
>
> - if (mem_cgroup_charge(page, mm, GFP_KERNEL))
> + if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
> goto oom_free_page;
>
> entry = mk_pte(page, vma->vm_page_prot);
> @@ -2496,7 +2498,7 @@ static int __do_fault(struct mm_struct *
> ret = VM_FAULT_OOM;
> goto out;
> }
> - if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
> + if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
> ret = VM_FAULT_OOM;
> page_cache_release(page);
> goto out;
> Index: mmotm-2.6.28rc2+/mm/migrate.c
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/mm/migrate.c
> +++ mmotm-2.6.28rc2+/mm/migrate.c
> @@ -133,7 +133,7 @@ static void remove_migration_pte(struct
> * be reliable, and this charge can actually fail: oh well, we don't
> * make the situation any worse by proceeding as if it had succeeded.
> */
> - mem_cgroup_charge(new, mm, GFP_ATOMIC);
> + mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);
>
> get_page(new);
> pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
> Index: mmotm-2.6.28rc2+/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/mm/swapfile.c
> +++ mmotm-2.6.28rc2+/mm/swapfile.c
> @@ -530,17 +530,18 @@ unsigned int count_swap_pages(int type,
> static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long addr, swp_entry_t entry, struct page *page)
> {
> + struct mem_cgroup *ptr = NULL;
> spinlock_t *ptl;
> pte_t *pte;
> int ret = 1;
>
> - if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
> + if (mem_cgroup_try_charge(vma->vm_mm, GFP_KERNEL, &ptr))
> ret = -ENOMEM;
>
> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry)))) {
> if (ret > 0)
> - mem_cgroup_uncharge_page(page);
> + mem_cgroup_cancel_charge_swapin(ptr);
> ret = 0;
> goto out;
> }
> @@ -550,6 +551,7 @@ static int unuse_pte(struct vm_area_stru
> set_pte_at(vma->vm_mm, addr, pte,
> pte_mkold(mk_pte(page, vma->vm_page_prot)));
> page_add_anon_rmap(page, vma, addr);
> + mem_cgroup_commit_charge_swapin(page, ptr);
> swap_free(entry);
> /*
> * Move the page to the active list so it is not
>

2008-10-28 12:10:00

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH 3/4][mmotm] memcg: fix gfp_mask of callers of charge

On Tue, 28 Oct 2008 19:14:49 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> fix misuse of gfp_kernel.
>
> Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.
>
> I think that this is from the fact that page_cgroup *was* dynamically allocated.
>
> But now, we allocate all page_cgroup at boot. And mem_cgroup_try_to_free_pages()
> reclaim memory from GFP_HIGHUSER_MOVABLE + specified GFP_RECLAIM_MASK.
> * This is because we just want to reduce memory usage.
> "Where we should reclaim from ?" is not a problem in memcg.
>
> This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.
> Note: This patch is not for fixing behavior but for showing sane information
> in source code.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
>
> mm/memcontrol.c | 8 +++++---
> mm/memory.c | 9 +++++----
> mm/shmem.c | 6 +++---
> mm/swapfile.c | 2 +-
> 4 files changed, 14 insertions(+), 11 deletions(-)
>
> Index: mmotm-2.6.28rc2+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/mm/memcontrol.c
> +++ mmotm-2.6.28rc2+/mm/memcontrol.c
> @@ -808,8 +808,9 @@ int mem_cgroup_prepare_migration(struct
> }
> unlock_page_cgroup(pc);
> if (mem) {
> - ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
> - ctype, mem);
> + ret = mem_cgroup_charge_common(newpage, NULL,
> + GFP_HIGHUSER_MOVABLE,
> + ctype, mem);
> css_put(&mem->css);
> }
> return ret;
> @@ -888,7 +889,8 @@ int mem_cgroup_resize_limit(struct mem_c
> ret = -EBUSY;
> break;
> }
> - progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
> + progress = try_to_free_mem_cgroup_pages(memcg,
> + GFP_HIGHUSER_MOVABLE);
> if (!progress)
> retry_count--;
> }
> Index: mmotm-2.6.28rc2+/mm/memory.c
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/mm/memory.c
> +++ mmotm-2.6.28rc2+/mm/memory.c
> @@ -1889,7 +1889,7 @@ gotten:
> cow_user_page(new_page, old_page, address, vma);
> __SetPageUptodate(new_page);
>
> - if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
> + if (mem_cgroup_newpage_charge(new_page, mm, GFP_HIGHUSER_MOVABLE))
> goto oom_free_new;
>
> /*
> @@ -2324,7 +2324,7 @@ static int do_swap_page(struct mm_struct
> lock_page(page);
> delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>
> - if (mem_cgroup_try_charge(mm, GFP_KERNEL, &ptr) == -ENOMEM) {
> + if (mem_cgroup_try_charge(mm, GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
> ret = VM_FAULT_OOM;
> unlock_page(page);
> goto out;
> @@ -2405,7 +2405,7 @@ static int do_anonymous_page(struct mm_s
> goto oom;
> __SetPageUptodate(page);
>
> - if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
> + if (mem_cgroup_newpage_charge(page, mm, GFP_HIGHUSER_MOVABLE))
> goto oom_free_page;
>
> entry = mk_pte(page, vma->vm_page_prot);
> @@ -2498,7 +2498,8 @@ static int __do_fault(struct mm_struct *
> ret = VM_FAULT_OOM;
> goto out;
> }
> - if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
> + if (mem_cgroup_newpage_charge(page,
> + mm, GFP_HIGHUSER_MOVABLE)) {
> ret = VM_FAULT_OOM;
> page_cache_release(page);
> goto out;
> Index: mmotm-2.6.28rc2+/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/mm/swapfile.c
> +++ mmotm-2.6.28rc2+/mm/swapfile.c
> @@ -535,7 +535,7 @@ static int unuse_pte(struct vm_area_stru
> pte_t *pte;
> int ret = 1;
>
> - if (mem_cgroup_try_charge(vma->vm_mm, GFP_KERNEL, &ptr))
> + if (mem_cgroup_try_charge(vma->vm_mm, GFP_HIGHUSER_MOVABLE, &ptr))
> ret = -ENOMEM;
>
> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> Index: mmotm-2.6.28rc2+/mm/shmem.c
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/mm/shmem.c
> +++ mmotm-2.6.28rc2+/mm/shmem.c
> @@ -920,8 +920,8 @@ found:
> error = 1;
> if (!inode)
> goto out;
> - /* Precharge page using GFP_KERNEL while we can wait */
> - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
> + /* Charge page using GFP_HIGHUSER_MOVABLE while we can wait */
> + error = mem_cgroup_cache_charge(page, current->mm, GFP_HIGHUSER_MOVABLE);
This line exceeds 80 characters.

Looks good to me except for that.

Reviewed-by: Daisuke Nishimura <[email protected]>


> if (error)
> goto out;
> error = radix_tree_preload(GFP_KERNEL);
> @@ -1371,7 +1371,7 @@ repeat:
>
> /* Precharge page while we can wait, compensate after */
> error = mem_cgroup_cache_charge(filepage, current->mm,
> - gfp & ~__GFP_HIGHMEM);
> + GFP_HIGHUSER_MOVABLE);
> if (error) {
> page_cache_release(filepage);
> shmem_unacct_blocks(info->flags, 1);
>

2008-10-28 12:47:19

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH 4/4][mmotm] memcg: simple migration handling

On Tue, 28 Oct 2008 19:15:32 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> Now, management of "charge" under page migration is done under following
> manner. (Assume migrate page contents from oldpage to newpage)
>
> before
> - "newpage" is charged before migration.
> at success.
> - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
> at failure
> - "newpage" is uncharged.
> - "oldpage" is charged if necessary (*1)
>
> But (*1) is not reliable....because of GFP_ATOMIC.
>
> This patch tries to change behavior as following by charge/commit/cancel ops.
>
> before
> - charge PAGE_SIZE (no target page)
> success
> - commit charge against "newpage".
> failure
> - commit charge against "oldpage".
> (PCG_USED bit works effectively to avoid double-counting)
> - if "oldpage" is obsolete, cancel charge of PAGE_SIZE.
>
> Changelog: v8 -> v9
> - fixed text.
> Changelog: v7 -> v8
> - fixed memcg==NULL case in migration handling.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
Reviewed-by: Daisuke Nishimura <[email protected]>

I tested previous version of these patches of course, I'm now
testing this version just in case.


Thanks,
Daisuke Nishimura.

> include/linux/memcontrol.h | 19 ++-----
> mm/memcontrol.c | 108 +++++++++++++++++++++------------------------
> mm/migrate.c | 42 +++++------------
> 3 files changed, 73 insertions(+), 96 deletions(-)
>
> Index: mmotm-2.6.28rc2+/mm/migrate.c
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/mm/migrate.c
> +++ mmotm-2.6.28rc2+/mm/migrate.c
> @@ -121,20 +121,6 @@ static void remove_migration_pte(struct
> if (!is_migration_entry(entry) || migration_entry_to_page(entry) != old)
> goto out;
>
> - /*
> - * Yes, ignore the return value from a GFP_ATOMIC mem_cgroup_charge.
> - * Failure is not an option here: we're now expected to remove every
> - * migration pte, and will cause crashes otherwise. Normally this
> - * is not an issue: mem_cgroup_prepare_migration bumped up the old
> - * page_cgroup count for safety, that's now attached to the new page,
> - * so this charge should just be another incrementation of the count,
> - * to keep in balance with rmap.c's mem_cgroup_uncharging. But if
> - * there's been a force_empty, those reference counts may no longer
> - * be reliable, and this charge can actually fail: oh well, we don't
> - * make the situation any worse by proceeding as if it had succeeded.
> - */
> - mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);
> -
> get_page(new);
> pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
> if (is_write_migration_entry(entry))
> @@ -382,9 +368,6 @@ static void migrate_page_copy(struct pag
> anon = PageAnon(page);
> page->mapping = NULL;
>
> - if (!anon) /* This page was removed from radix-tree. */
> - mem_cgroup_uncharge_cache_page(page);
> -
> /*
> * If any waiters have accumulated on the new page then
> * wake them up.
> @@ -621,6 +604,7 @@ static int unmap_and_move(new_page_t get
> struct page *newpage = get_new_page(page, private, &result);
> int rcu_locked = 0;
> int charge = 0;
> + struct mem_cgroup *mem;
>
> if (!newpage)
> return -ENOMEM;
> @@ -630,24 +614,26 @@ static int unmap_and_move(new_page_t get
> goto move_newpage;
> }
>
> - charge = mem_cgroup_prepare_migration(page, newpage);
> - if (charge == -ENOMEM) {
> - rc = -ENOMEM;
> - goto move_newpage;
> - }
> /* prepare cgroup just returns 0 or -ENOMEM */
> - BUG_ON(charge);
> -
> rc = -EAGAIN;
> +
> if (!trylock_page(page)) {
> if (!force)
> goto move_newpage;
> lock_page(page);
> }
>
> + /* charge against new page */
> + charge = mem_cgroup_prepare_migration(page, &mem);
> + if (charge == -ENOMEM) {
> + rc = -ENOMEM;
> + goto unlock;
> + }
> + BUG_ON(charge);
> +
> if (PageWriteback(page)) {
> if (!force)
> - goto unlock;
> + goto uncharge;
> wait_on_page_writeback(page);
> }
> /*
> @@ -700,7 +686,9 @@ static int unmap_and_move(new_page_t get
> rcu_unlock:
> if (rcu_locked)
> rcu_read_unlock();
> -
> +uncharge:
> + if (!charge)
> + mem_cgroup_end_migration(mem, page, newpage);
> unlock:
> unlock_page(page);
>
> @@ -716,8 +704,6 @@ unlock:
> }
>
> move_newpage:
> - if (!charge)
> - mem_cgroup_end_migration(newpage);
>
> /*
> * Move the new page to the LRU. If migration was not successful
> Index: mmotm-2.6.28rc2+/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.28rc2+/include/linux/memcontrol.h
> @@ -29,8 +29,6 @@ struct mm_struct;
>
> extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> -extern int mem_cgroup_charge_migrate_fixup(struct page *page,
> - struct mm_struct *mm, gfp_t gfp_mask);
> /* for swap handling */
> extern int mem_cgroup_try_charge(struct mm_struct *mm,
> gfp_t gfp_mask, struct mem_cgroup **ptr);
> @@ -60,8 +58,9 @@ extern struct mem_cgroup *mem_cgroup_fro
> ((cgroup) == mem_cgroup_from_task((mm)->owner))
>
> extern int
> -mem_cgroup_prepare_migration(struct page *page, struct page *newpage);
> -extern void mem_cgroup_end_migration(struct page *page);
> +mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr);
> +extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
> + struct page *oldpage, struct page *newpage);
>
> /*
> * For memory reclaim.
> @@ -94,12 +93,6 @@ static inline int mem_cgroup_cache_charg
> return 0;
> }
>
> -static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
> - struct mm_struct *mm, gfp_t gfp_mask)
> -{
> - return 0;
> -}
> -
> static int mem_cgroup_try_charge(struct mm_struct *mm,
> gfp_t gfp_mask, struct mem_cgroup **ptr)
> {
> @@ -143,12 +136,14 @@ static inline int task_in_mem_cgroup(str
> }
>
> static inline int
> -mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
> +mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
> {
> return 0;
> }
>
> -static inline void mem_cgroup_end_migration(struct page *page)
> +static inline void mem_cgroup_end_migration(struct mem_cgroup *mem,
> + struct page *oldpage,
> + struct page *newpage)
> {
> }
>
> Index: mmotm-2.6.28rc2+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28rc2+.orig/mm/memcontrol.c
> +++ mmotm-2.6.28rc2+/mm/memcontrol.c
> @@ -627,34 +627,6 @@ int mem_cgroup_newpage_charge(struct pag
> MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> }
>
> -/*
> - * same as mem_cgroup_newpage_charge(), now.
> - * But what we assume is different from newpage, and this is special case.
> - * treat this in special function. easy for maintenance.
> - */
> -
> -int mem_cgroup_charge_migrate_fixup(struct page *page,
> - struct mm_struct *mm, gfp_t gfp_mask)
> -{
> - if (mem_cgroup_subsys.disabled)
> - return 0;
> -
> - if (PageCompound(page))
> - return 0;
> -
> - if (page_mapped(page) || (page->mapping && !PageAnon(page)))
> - return 0;
> -
> - if (unlikely(!mm))
> - mm = &init_mm;
> -
> - return mem_cgroup_charge_common(page, mm, gfp_mask,
> - MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> -}
> -
> -
> -
> -
> int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask)
> {
> @@ -697,7 +669,6 @@ int mem_cgroup_cache_charge(struct page
> MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> }
>
> -
> void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> {
> struct page_cgroup *pc;
> @@ -782,13 +753,13 @@ void mem_cgroup_uncharge_cache_page(stru
> }
>
> /*
> - * Before starting migration, account against new page.
> + * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
> + * page belongs to.
> */
> -int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
> +int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
> {
> struct page_cgroup *pc;
> struct mem_cgroup *mem = NULL;
> - enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
> int ret = 0;
>
> if (mem_cgroup_subsys.disabled)
> @@ -799,42 +770,67 @@ int mem_cgroup_prepare_migration(struct
> if (PageCgroupUsed(pc)) {
> mem = pc->mem_cgroup;
> css_get(&mem->css);
> - if (PageCgroupCache(pc)) {
> - if (page_is_file_cache(page))
> - ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
> - else
> - ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
> - }
> }
> unlock_page_cgroup(pc);
> +
> if (mem) {
> - ret = mem_cgroup_charge_common(newpage, NULL,
> - GFP_HIGHUSER_MOVABLE,
> - ctype, mem);
> + ret = mem_cgroup_try_charge(NULL, GFP_HIGHUSER_MOVABLE, &mem);
> css_put(&mem->css);
> }
> + *ptr = mem;
> return ret;
> }
>
> /* remove redundant charge if migration failed*/
> -void mem_cgroup_end_migration(struct page *newpage)
> +void mem_cgroup_end_migration(struct mem_cgroup *mem,
> + struct page *oldpage, struct page *newpage)
> {
> + struct page *target, *unused;
> + struct page_cgroup *pc;
> + enum charge_type ctype;
> +
> + if (!mem)
> + return;
> +
> + /* at migration success, oldpage->mapping is NULL. */
> + if (oldpage->mapping) {
> + target = oldpage;
> + unused = NULL;
> + } else {
> + target = newpage;
> + unused = oldpage;
> + }
> +
> + if (PageAnon(target))
> + ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
> + else if (page_is_file_cache(target))
> + ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
> + else
> + ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
> +
> + /* unused page is not on radix-tree now. */
> + if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
> + __mem_cgroup_uncharge_common(unused, ctype);
> +
> + pc = lookup_page_cgroup(target);
> /*
> - * At success, page->mapping is not NULL.
> - * special rollback care is necessary when
> - * 1. at migration failure. (newpage->mapping is cleared in this case)
> - * 2. the newpage was moved but not remapped again because the task
> - * exits and the newpage is obsolete. In this case, the new page
> - * may be a swapcache. So, we just call mem_cgroup_uncharge_page()
> - * always for avoiding mess. The page_cgroup will be removed if
> - * unnecessary. File cache pages is still on radix-tree. Don't
> - * care it.
> + * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
> + * So, double-counting is effectively avoided.
> + */
> + __mem_cgroup_commit_charge(mem, pc, ctype);
> +
> + /*
> + * Both of oldpage and newpage are still under lock_page().
> + * Then, we don't have to care about race in radix-tree.
> + * But we have to be careful that this page is unmapped or not.
> + *
> + * There is a case for !page_mapped(). At the start of
> + * migration, oldpage was mapped. But now, it's zapped.
> + * But we know *target* page is not freed/reused under us.
> + * mem_cgroup_uncharge_page() does all necessary checks.
> */
> - if (!newpage->mapping)
> - __mem_cgroup_uncharge_common(newpage,
> - MEM_CGROUP_CHARGE_TYPE_FORCE);
> - else if (PageAnon(newpage))
> - mem_cgroup_uncharge_page(newpage);
> + if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> + mem_cgroup_uncharge_page(target);
> }
>
> /*
>