LinuxLists.cc - [RFC][PATCH 0/6] memcg updates (12/Nov/2008)

2008-11-12 03:26:57

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [RFC][PATCH 0/6] memcg updates (12/Nov/2008)

Weekly updates on my queue.

Changes from previous (05/Nov)
- added "free all at rmdir" patch.
- fixed several bugs reported by Nishimura (Thanks!)
- many style bugs are fixed.

Brief description:
[1/6].. free all at rmdir (and add attribute to memcg.)
[2/6].. handle swap cache
[3/6].. mem+swap controller kconfig
[4/6].. swap_cgroup
[5/6].. mem+swap controller
[6/6].. synchrinized LRU (unify lru lock.)

I think it's near to a month to test this mem+swap controller internally.
It's getting better. Making progress in step by step works good.

I'll send [1/6] and [2/6] to Andrew, tomorrow or weekend.(please do final check).

If no acks to [1/6] (I haven't got any ;), I'll postpone it and reschedule as [7/6].

Thanks,
-Kame

2008-11-12 03:27:47

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [RFC][PATCH 1/6] memcg: free all at rmdir

By memcg-move-all-accounts-to-parent-at-rmdir.patch, there is no leak
of memory usage. But obviously, there is an another choice.

This patch adds "free all at rmdir" and its attribute to memcg.

This memory.attribute file allows following to set/clear attribute.
#echo attribute option > memory.attribute

This patch implements an attribute

# on_rmdir [keep | drop] > memory.attribute.

When on_rmdir=keep, memory remaining in memcg will be moved up to parent
at rmdir. When on_rmdir=drop, memory remaining in memcg will be freed.

Characteristic of Keep.
- Doesn't cause unnecessary freeing of memory(page cache).
(IOW. page-cache for temporal files or some unnecessary pages will be kept.)
Characteristic of Drop.
- maybe do necessary write-back.
- All page caches and RSSs will be dropped.
(IOW. page-cache for libc or some important pages will be dropped.)
- If a patch which cannot be freed (mlocked etc..) is found, use move
logic and move it to the parent.

This patch also adds break for signal in the loop of force_empty.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Changelog v0->v1:
- added congestion wait to try_to_free_page() failure path of force_empty.
(If not, we'll see -EBUSY at rmdir at removing small cgroup.)

Documentation/controllers/memory.txt | 35 +++++++-
mm/memcontrol.c | 152 ++++++++++++++++++++++++++++++++++-
2 files changed, 180 insertions(+), 7 deletions(-)

Index: mmotm-2.6.28-Nov10/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov10/mm/memcontrol.c
@@ -30,6 +30,7 @@
#include <linux/swap.h>
#include <linux/spinlock.h>
#include <linux/fs.h>
+#include <linux/ctype.h>
#include <linux/seq_file.h>
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
@@ -132,6 +133,11 @@ struct mem_cgroup {
* statistics.
*/
struct mem_cgroup_stat stat;
+ /*
+ * attributes.
+ * on_rmdir ....0=free all 1=move all.
+ */
+ char on_rmdir;
};
static struct mem_cgroup init_mem_cgroup;

@@ -157,6 +163,22 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
0, /* FORCE */
};

+
+/*
+ * attribute for memcg default value comes from its parent.
+ * the root set all to false.
+ */
+enum {
+ MEMCG_ATTR_ON_RMDIR, /* drop_all if true, default is true. */
+ MEMCG_LAST_ATTR,
+};
+/* we may have to check status under racy situation. use global mutex. */
+DEFINE_MUTEX(memcg_attr_mutex);
+
+static char *memcg_attribute_names[MEMCG_LAST_ATTR] = {
+ "on_rmdir",
+};
+
/*
* Always modified under lru lock. Then, not necessary to preempt_disable()
*/
@@ -1072,11 +1094,17 @@ static int mem_cgroup_force_empty(struct

shrink = 0;
move_account:
+ /* should free all ? */
+ if (!mem->on_rmdir)
+ goto try_to_free;
+
while (mem->res.usage > 0) {
ret = -EBUSY;
if (atomic_read(&mem->css.cgroup->count) > 0)
goto out;
-
+ ret = -EINTR;
+ if (signal_pending(current))
+ goto out;
/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
ret = 0;
@@ -1111,14 +1139,24 @@ try_to_free:
ret = -EBUSY;
goto out;
}
+ /* we call try-to-free pages for make this cgroup empty */
+ lru_add_drain_all();
/* try to free all pages in this cgroup */
shrink = 1;
while (nr_retries && mem->res.usage > 0) {
int progress;
+
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
progress = try_to_free_mem_cgroup_pages(mem,
GFP_HIGHUSER_MOVABLE);
- if (!progress)
+ if (!progress) {
nr_retries--;
+ /* maybe some writeback is necessary */
+ congestion_wait(WRITE, HZ/10);
+ }

}
/* try move_account...there may be some *locked* pages. */
@@ -1225,6 +1263,103 @@ static int mem_control_stat_show(struct
return 0;
}

+
+/*
+ * Expected Usage:
+ * #echo attribute [option] > memory.feature
+ */
+static int
+parse_attr_option(char *buffer, char **attr, char **option, char **end)
+{
+ char *c = buffer;
+
+ *attr = NULL;
+ *option = NULL;
+ /* skip white space */
+ while (*c && isspace(*c))
+ c++;
+ /* found NULL ? */
+ if (!*c)
+ return -EINVAL;
+ *attr = c;
+ /* skip attribute */
+ while (*c && !isspace(*c))
+ c++;
+ /* skip space */
+ while (*c && isspace(*c))
+ c++;
+ /* pass pointer to option */
+ *option = c;
+ while (*c)
+ c++;
+ *end = c;
+ return 0;
+
+}
+
+static int mem_cgroup_write_attr(struct cgroup *cont,
+ struct cftype *cft,
+ const char *buffer)
+{
+ int i, len;
+ char *attr, *option, *end;
+ int ret = -EINVAL;
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+ mutex_lock(&memcg_attr_mutex);
+ /* parse attribute option */
+ ret = parse_attr_option((char *)buffer, &attr, &option, &end);
+ if (ret)
+ goto out;
+
+ for (i = 0; i < MEMCG_LAST_ATTR; i++) {
+
+ len = strlen(memcg_attribute_names[i]);
+ if ((option - attr) < len)
+ continue;
+ if (!strncmp(memcg_attribute_names[i], attr, len))
+ break;
+ }
+
+ ret = -EINVAL;
+ if (i == MEMCG_LAST_ATTR)
+ goto out;
+ switch (i) {
+ case MEMCG_ATTR_ON_RMDIR:
+ if ((end - option) < 4)
+ break;
+ ret = 0;
+ if (strncmp(option, "keep", 4) == 0)
+ mem->on_rmdir = 1;
+ else if (strncmp(option, "drop", 4) == 0)
+ mem->on_rmdir = 0;
+ else
+ ret = -EINVAL;
+ break;
+ }
+out:
+ mutex_unlock(&memcg_attr_mutex);
+ return ret;
+}
+
+
+static int mem_cgroup_read_attr(struct cgroup *cont, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ char *s;
+
+ s = memcg_attribute_names[MEMCG_ATTR_ON_RMDIR];
+
+ if (!mem->on_rmdir)
+ seq_printf(m, "%s drop\n", s);
+ else
+ seq_printf(m, "%s keep\n", s);
+
+ return 0;
+}
+
+
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
@@ -1253,6 +1388,11 @@ static struct cftype mem_cgroup_files[]
.name = "stat",
.read_map = mem_control_stat_show,
},
+ {
+ .name = "attribute",
+ .write_string = mem_cgroup_write_attr,
+ .read_seq_string = mem_cgroup_read_attr,
+ },
};

static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1318,18 +1458,24 @@ static void mem_cgroup_free(struct mem_c
static struct cgroup_subsys_state *
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
- struct mem_cgroup *mem;
+ struct mem_cgroup *mem, *parent;
int node;

+ parent = NULL;
if (unlikely((cont->parent) == NULL)) {
mem = &init_mem_cgroup;
+ /* do this in explicit way. */
+ mem->on_rmdir = 0;
} else {
mem = mem_cgroup_alloc();
if (!mem)
return ERR_PTR(-ENOMEM);
+ parent = mem_cgroup_from_cont(cont->parent);
}

res_counter_init(&mem->res);
+ if (parent)
+ mem->on_rmdir = parent->on_rmdir;

for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
Index: mmotm-2.6.28-Nov10/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-Nov10.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-Nov10/Documentation/controllers/memory.txt
@@ -237,11 +237,38 @@ reclaimed.
A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
cgroup might have some charge associated with it, even though all
tasks have migrated away from it.
-Such charges are moved to its parent as much as possible and freed if parent
-is full. Both of RSS and CACHES are moved to parent.
-If both of them are busy, rmdir() returns -EBUSY.
+Such charges are freed(at default) or moved to its parent. When moved,
+both of RSS and CACHES are moved to parent.
+If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.

-5. TODO
+5. Attributes.
+
+memory controller has some of attributes for customizing behavior.
+You can specify attribute by
+#echo attribute option > memory.attribute
+
+To see current value, read the file.
+#cat memory.attribute
+
+5.1 on_rmdir
+set behavior of memcg at rmdir (Removing cgroup) default is "drop".
+
+5.1.1 drop
+ #echo on_rmdir drop > memory.attribute
+ This is default. All pages on the memcg will be freed.
+ If pages are locked or too busy, they will be moved up to the parent.
+ Useful when you want to drop (large) page caches used in this memcg.
+ But some of in-use page cache can be dropped by this.
+
+5.1.2 keep
+ #echo on_rmdir keep > memory.attribute
+ All pages on the memcg will be moved to its parent.
+ Useful when you don't want to drop page caches used in this memcg.
+ You can keep page caches from some library or DB accessed by this
+ memcg on memory.
+
+
+6. TODO

1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first

2008-11-12 03:28:24

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [RFC][PATCH 2/6] memcg: account swapcache

SwapCache support for memory resource controller (memcg)

Before mem+swap controller, memcg itself should handle SwapCache in proper way.
This is cut-out from it.

In current memcg, SwapCache is just leaked and the user can create tons of
SwapCache. This is a leak of account and should be handled.

SwapCache accounting is done as following.

charge (anon)
- charged when it's mapped.
(because of readahead, charge at add_to_swap_cache() is not sane)
uncharge (anon)
- uncharged when it's dropped from swapcache and fully unmapped.
means it's not uncharged at unmap.
Note: delete from swap cache at swap-in is done after rmap information
is established.
charge (shmem)
- charged at swap-in. this prevents charge at add_to_page_cache().

uncharge (shmem)
- uncharged when it's dropped from swapcache and not on shmem's
radix-tree.

at migration, check against 'old page' is modified to handle shmem.

Comparing to the old version discussed (and caused troubles), we have
advantages of
- PCG_USED bit.
- simple migrating handling.

So, situation is much easier than several months ago, maybe.

Reviewed-by: Daisuke Nishimura <[email protected]>
Tested-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
Documentation/controllers/memory.txt | 5 ++
include/linux/swap.h | 16 ++++++++
mm/memcontrol.c | 67 +++++++++++++++++++++++++++++++----
mm/shmem.c | 18 ++++++++-
mm/swap_state.c | 1
5 files changed, 99 insertions(+), 8 deletions(-)

Index: mmotm-2.6.28-Nov10/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov10/mm/memcontrol.c
@@ -21,6 +21,7 @@
#include <linux/memcontrol.h>
#include <linux/cgroup.h>
#include <linux/mm.h>
+#include <linux/pagemap.h>
#include <linux/smp.h>
#include <linux/page-flags.h>
#include <linux/backing-dev.h>
@@ -146,6 +147,7 @@ enum charge_type {
MEM_CGROUP_CHARGE_TYPE_MAPPED,
MEM_CGROUP_CHARGE_TYPE_SHMEM, /* used by page migration of shmem */
MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
+ MEM_CGROUP_CHARGE_TYPE_SWAPOUT, /* for accounting swapcache */
NR_CHARGE_TYPE,
};

@@ -803,6 +805,33 @@ int mem_cgroup_cache_charge(struct page
MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
}

+#ifdef CONFIG_SWAP
+int mem_cgroup_cache_charge_swapin(struct page *page,
+ struct mm_struct *mm, gfp_t mask, bool locked)
+{
+ int ret = 0;
+
+ if (mem_cgroup_subsys.disabled)
+ return 0;
+ if (unlikely(!mm))
+ mm = &init_mm;
+ if (!locked)
+ lock_page(page);
+ /*
+ * If not locked, the page can be dropped from SwapCache until
+ * we reach here.
+ */
+ if (PageSwapCache(page)) {
+ ret = mem_cgroup_charge_common(page, mm, mask,
+ MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
+ }
+ if (!locked)
+ unlock_page(page);
+
+ return ret;
+}
+#endif
+
void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
{
struct page_cgroup *pc;
@@ -840,6 +869,9 @@ __mem_cgroup_uncharge_common(struct page
if (mem_cgroup_subsys.disabled)
return;

+ if (PageSwapCache(page))
+ return;
+
/*
* Check if our page_cgroup is valid
*/
@@ -848,12 +880,26 @@ __mem_cgroup_uncharge_common(struct page
return;

lock_page_cgroup(pc);
- if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
- || !PageCgroupUsed(pc)) {
- /* This happens at race in zap_pte_range() and do_swap_page()*/
- unlock_page_cgroup(pc);
- return;
+
+ if (!PageCgroupUsed(pc))
+ goto unlock_out;
+
+ switch (ctype) {
+ case MEM_CGROUP_CHARGE_TYPE_MAPPED:
+ if (page_mapped(page))
+ goto unlock_out;
+ break;
+ case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
+ if (!PageAnon(page)) { /* Shared memory */
+ if (page->mapping && !page_is_file_cache(page))
+ goto unlock_out;
+ } else if (page_mapped(page)) /* Anon */
+ goto unlock_out;
+ break;
+ default:
+ break;
}
+
ClearPageCgroupUsed(pc);
mem = pc->mem_cgroup;

@@ -867,6 +913,10 @@ __mem_cgroup_uncharge_common(struct page
css_put(&mem->css);

return;
+
+unlock_out:
+ unlock_page_cgroup(pc);
+ return;
}

void mem_cgroup_uncharge_page(struct page *page)
@@ -886,6 +936,11 @@ void mem_cgroup_uncharge_cache_page(stru
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}

+void mem_cgroup_uncharge_swapcache(struct page *page)
+{
+ __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+}
+
/*
* Before starting migration, account PAGE_SIZE to mem_cgroup that the old
* page belongs to.
@@ -943,7 +998,7 @@ void mem_cgroup_end_migration(struct mem
ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;

/* unused page is not on radix-tree now. */
- if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
+ if (unused)
__mem_cgroup_uncharge_common(unused, ctype);

pc = lookup_page_cgroup(target);
Index: mmotm-2.6.28-Nov10/mm/swap_state.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/swap_state.c
+++ mmotm-2.6.28-Nov10/mm/swap_state.c
@@ -119,6 +119,7 @@ void __delete_from_swap_cache(struct pag
total_swapcache_pages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(del_total);
+ mem_cgroup_uncharge_swapcache(page);
}

/**
Index: mmotm-2.6.28-Nov10/include/linux/swap.h
===================================================================
--- mmotm-2.6.28-Nov10.orig/include/linux/swap.h
+++ mmotm-2.6.28-Nov10/include/linux/swap.h
@@ -335,6 +335,22 @@ static inline void disable_swap_token(vo
put_swap_token(swap_token_mm);
}

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+extern int mem_cgroup_cache_charge_swapin(struct page *page,
+ struct mm_struct *mm, gfp_t mask, bool locked);
+extern void mem_cgroup_uncharge_swapcache(struct page *page);
+#else
+static inline
+int mem_cgroup_cache_charge_swapin(struct page *page,
+ struct mm_struct *mm, gfp_t mask, bool locked)
+{
+ return 0;
+}
+static inline void mem_cgroup_uncharge_swapcache(struct page *page)
+{
+}
+#endif
+
#else /* CONFIG_SWAP */

#define total_swap_pages 0
Index: mmotm-2.6.28-Nov10/mm/shmem.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/shmem.c
+++ mmotm-2.6.28-Nov10/mm/shmem.c
@@ -920,8 +920,12 @@ found:
error = 1;
if (!inode)
goto out;
- /* Charge page using GFP_HIGHUSER_MOVABLE while we can wait */
- error = mem_cgroup_cache_charge(page, current->mm, GFP_HIGHUSER_MOVABLE);
+ /*
+ * Charge page using GFP_HIGHUSER_MOVABLE while we can wait.
+ * charged back to the user(not to caller) when swap account is used.
+ */
+ error = mem_cgroup_cache_charge_swapin(page,
+ current->mm, GFP_HIGHUSER_MOVABLE, true);
if (error)
goto out;
error = radix_tree_preload(GFP_KERNEL);
@@ -1258,6 +1262,16 @@ repeat:
goto repeat;
}
wait_on_page_locked(swappage);
+ /*
+ * We want to avoid charge at add_to_page_cache().
+ * charge against this swap cache here.
+ */
+ if (mem_cgroup_cache_charge_swapin(swappage,
+ current->mm, gfp, false)) {
+ page_cache_release(swappage);
+ error = -ENOMEM;
+ goto failed;
+ }
page_cache_release(swappage);
goto repeat;
}
Index: mmotm-2.6.28-Nov10/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-Nov10.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-Nov10/Documentation/controllers/memory.txt
@@ -137,6 +137,11 @@ behind this approach is that a cgroup th
page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure).

+Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+be backed into memory in force, charges for pages are accounted against the
+caller of swapoff rather than the users of shmem.
+
+
2.4 Reclaim

Each cgroup maintains a per cgroup LRU that consists of an active

2008-11-12 03:29:22

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [RFC][PATCH 3/6] memcg: mem+swap controller kconfig

Config and control variable for mem+swap controller.

This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
(memory resource controller swap extension.)

For accounting swap, it's obvious that we have to use additional memory
to remember "who uses swap". This adds more overhead.
So, it's better to offer "choice" to users. This patch adds 2 choices.

This patch adds 2 parameters to enable swap extension or not.
- CONFIG
- boot option

Changelog: v2 -> v3
- adjusted to avoid HUNK.

Changelog: v1 -> v2
- fixed typo.
- make default value of "do_swap_account" to be 0 and turned on 1
later if configured.

Reviewed-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Documentation/kernel-parameters.txt | 3 +++
include/linux/memcontrol.h | 3 +++
init/Kconfig | 17 +++++++++++++++++
mm/memcontrol.c | 32 ++++++++++++++++++++++++++++++++
4 files changed, 55 insertions(+)

Index: mmotm-2.6.28-Nov10/init/Kconfig
===================================================================
--- mmotm-2.6.28-Nov10.orig/init/Kconfig
+++ mmotm-2.6.28-Nov10/init/Kconfig
@@ -428,6 +428,23 @@ config CGROUP_MEM_RES_CTLR
config MM_OWNER
bool

+config CGROUP_MEM_RES_CTLR_SWAP
+ bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
+ depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
+ help
+ Add swap management feature to memory resource controller. When you
+ enable this, you can limit mem+swap usage per cgroup. In other words,
+ when you disable this, memory resource controller has no cares to
+ usage of swap...a process can exhaust all of the swap. This extension
+ is useful when you want to avoid exhaustion swap but this itself
+ adds more overheads and consumes memory for remembering information.
+ Especially if you use 32bit system or small memory system, please
+ be careful about enabling this. When memory resource controller
+ is disabled by boot option, this will be automatically disabled and
+ there will be no overhead from this. Even when you set this config=y,
+ if boot option "noswapaccount" is set, swap will not be accounted.
+
+
endmenu

config SYSFS_DEPRECATED
Index: mmotm-2.6.28-Nov10/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov10/mm/memcontrol.c
@@ -42,6 +42,15 @@
struct cgroup_subsys mem_cgroup_subsys __read_mostly;
#define MEM_CGROUP_RECLAIM_RETRIES 5

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/* Turned on only when memory cgroup is enabled && really_do_swap_account = 0 */
+int do_swap_account __read_mostly;
+static int really_do_swap_account __initdata = 1; /* for remember boot option*/
+#else
+#define do_swap_account (0)
+#endif
+
+
/*
* Statistics for memory cgroup.
*/
@@ -1510,6 +1519,18 @@ static void mem_cgroup_free(struct mem_c
}

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+static void __init enable_swap_cgroup(void)
+{
+ if (!mem_cgroup_subsys.disabled && really_do_swap_account)
+ do_swap_account = 1;
+}
+#else
+static void __init enable_swap_cgroup(void)
+{
+}
+#endif
+
static struct cgroup_subsys_state *
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
@@ -1521,6 +1542,7 @@ mem_cgroup_create(struct cgroup_subsys *
mem = &init_mem_cgroup;
/* do this in explicit way. */
mem->on_rmdir = 0;
+ enable_swap_cgroup();
} else {
mem = mem_cgroup_alloc();
if (!mem)
@@ -1607,3 +1629,13 @@ struct cgroup_subsys mem_cgroup_subsys =
.attach = mem_cgroup_move_task,
.early_init = 0,
};
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+static int __init disable_swap_account(char *s)
+{
+ really_do_swap_account = 0;
+ return 1;
+}
+__setup("noswapaccount", disable_swap_account);
+#endif
Index: mmotm-2.6.28-Nov10/Documentation/kernel-parameters.txt
===================================================================
--- mmotm-2.6.28-Nov10.orig/Documentation/kernel-parameters.txt
+++ mmotm-2.6.28-Nov10/Documentation/kernel-parameters.txt
@@ -1554,6 +1554,9 @@ and is between 256 and 4096 characters.

nosoftlockup [KNL] Disable the soft-lockup detector.

+ noswapaccount [KNL] Disable accounting of swap in memory resource
+ controller. (See Documentation/controllers/memory.txt)
+
nosync [HW,M68K] Disables sync negotiation for all devices.

notsc [BUGS=X86-32] Disable Time Stamp Counter
Index: mmotm-2.6.28-Nov10/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Nov10.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Nov10/include/linux/memcontrol.h
@@ -77,6 +77,9 @@ extern void mem_cgroup_record_reclaim_pr
extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
int priority, enum lru_list lru);

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern int do_swap_account;
+#endif

#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

2008-11-12 03:30:42

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [RFC][PATCH 4/6] memcg: swap cgroup for remembering account

For accounting swap, we need a record per swap entry, at least.

This patch adds following function.
- swap_cgroup_swapon() .... called from swapon
- swap_cgroup_swapoff() ... called at the end of swapoff.

- swap_cgroup_record() .... record information of swap entry.
- swap_cgroup_lookup() .... lookup information of swap entry.

This patch just implements "how to record information". No actual
method for limit the usage of swap. These routine uses flat table
to record and lookup. "wise" lookup system like radix-tree requires
requires memory allocation at new records but swap-out is usually
called under memory shortage (or memcg hits limit.)
So, I used static allocation. (maybe dynamic allocation is not very
hard but it adds additional memory allocation in memory shortage path.)

Note1: In this, we use pointer to record information and this means
8bytes per swap entry. I think we can reduce this when we
create "id of cgroup" in the range of 0-65535 or 0-255.

Note2: array of swap_cgroup is allocated from HIGHMEM. maybe good for x86-32.

Changelog: v1 -> v2
- fixed bug in swapoff.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

include/linux/page_cgroup.h | 35 +++++++
mm/page_cgroup.c | 201 ++++++++++++++++++++++++++++++++++++++++++++
mm/swapfile.c | 8 +
3 files changed, 244 insertions(+)

Index: mmotm-2.6.28-Nov10/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-Nov10/mm/page_cgroup.c
@@ -8,6 +8,8 @@
#include <linux/memory.h>
#include <linux/vmalloc.h>
#include <linux/cgroup.h>
+#include <linux/swapops.h>
+#include <linux/highmem.h>

static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -251,3 +253,202 @@ void __init pgdat_page_cgroup_init(struc
}

#endif
+
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+DEFINE_MUTEX(swap_cgroup_mutex);
+struct swap_cgroup_ctrl {
+ spinlock_t lock;
+ struct page **map;
+ unsigned long length;
+};
+
+struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
+
+/*
+ * This 8bytes seems big..maybe we can reduce this when we can use "id" for
+ * cgroup rather than pointer.
+ */
+struct swap_cgroup {
+ struct mem_cgroup *val;
+};
+#define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup))
+#define SC_POS_MASK (SC_PER_PAGE - 1)
+
+/*
+ * allocate buffer for swap_cgroup.
+ */
+static int swap_cgroup_prepare(int type)
+{
+ struct page *page;
+ struct swap_cgroup_ctrl *ctrl;
+ unsigned long idx, max;
+
+ if (!do_swap_account)
+ return 0;
+ ctrl = &swap_cgroup_ctrl[type];
+
+ for (idx = 0; idx < ctrl->length; idx++) {
+ page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
+ if (!page)
+ goto not_enough_page;
+ ctrl->map[idx] = page;
+ }
+ return 0;
+not_enough_page:
+ max = idx;
+ for (idx = 0; idx < max; idx++)
+ __free_page(ctrl->map[idx]);
+
+ return -ENOMEM;
+}
+
+/**
+ * swap_cgroup_record - record mem_cgroup for this swp_entry.
+ * @ent: swap entry to be recorded into
+ * @mem: mem_cgroup to be recorded
+ *
+ * Returns old value at success, NULL at failure.
+ * (Of course, old value can be NULL.)
+ */
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+ unsigned long flags;
+ int type = swp_type(ent);
+ unsigned long offset = swp_offset(ent);
+ unsigned long idx = offset / SC_PER_PAGE;
+ unsigned long pos = offset & SC_POS_MASK;
+ struct swap_cgroup_ctrl *ctrl;
+ struct page *mappage;
+ struct swap_cgroup *sc;
+ struct mem_cgroup *old;
+
+ if (!do_swap_account)
+ return NULL;
+
+ ctrl = &swap_cgroup_ctrl[type];
+
+ mappage = ctrl->map[idx];
+ spin_lock_irqsave(&ctrl->lock, flags);
+ sc = kmap_atomic(mappage, KM_USER0);
+ sc += pos;
+ old = sc->val;
+ sc->val = mem;
+ kunmap_atomic(mappage, KM_USER0);
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+ return old;
+}
+
+/**
+ * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
+ * @ent: swap entry to be looked up.
+ *
+ * Returns pointer to mem_cgroup at success. NULL at failure.
+ */
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+ int type = swp_type(ent);
+ unsigned long flags;
+ unsigned long offset = swp_offset(ent);
+ unsigned long idx = offset / SC_PER_PAGE;
+ unsigned long pos = offset & SC_POS_MASK;
+ struct swap_cgroup_ctrl *ctrl;
+ struct page *mappage;
+ struct swap_cgroup *sc;
+ struct mem_cgroup *ret;
+
+ if (!do_swap_account)
+ return NULL;
+
+ ctrl = &swap_cgroup_ctrl[type];
+
+ mappage = ctrl->map[idx];
+
+ spin_lock_irqsave(&ctrl->lock, flags);
+ sc = kmap_atomic(mappage, KM_USER0);
+ sc += pos;
+ ret = sc->val;
+ kunmap_atomic(mapppage, KM_USER0);
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+ return ret;
+}
+
+int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+ void *array;
+ unsigned long array_size;
+ unsigned long length;
+ struct swap_cgroup_ctrl *ctrl;
+
+ if (!do_swap_account)
+ return 0;
+
+ length = ((max_pages/SC_PER_PAGE) + 1);
+ array_size = length * sizeof(void *);
+
+ array = vmalloc(array_size);
+ if (!array)
+ goto nomem;
+
+ memset(array, 0, array_size);
+ ctrl = &swap_cgroup_ctrl[type];
+ mutex_lock(&swap_cgroup_mutex);
+ ctrl->length = length;
+ ctrl->map = array;
+ if (swap_cgroup_prepare(type)) {
+ /* memory shortage */
+ ctrl->map = NULL;
+ ctrl->length = 0;
+ vfree(array);
+ mutex_unlock(&swap_cgroup_mutex);
+ goto nomem;
+ }
+ mutex_unlock(&swap_cgroup_mutex);
+
+ printk(KERN_INFO
+ "swap_cgroup: uses %ld bytes vmalloc and %ld bytes buffres\n",
+ array_size, length * PAGE_SIZE);
+ printk(KERN_INFO
+ "swap_cgroup can be disabled by noswapaccount boot option.\n");
+
+ return 0;
+nomem:
+ printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
+ printk(KERN_INFO
+ "swap_cgroup can be disabled by noswapaccount boot option\n");
+ return -ENOMEM;
+}
+
+void swap_cgroup_swapoff(int type)
+{
+ int i;
+ struct swap_cgroup_ctrl *ctrl;
+
+ if (!do_swap_account)
+ return;
+
+ mutex_lock(&swap_cgroup_mutex);
+ ctrl = &swap_cgroup_ctrl[type];
+ if (ctrl->map) {
+ for (i = 0; i < ctrl->length; i++) {
+ struct page *page = ctrl->map[i];
+ if (page)
+ __free_page(page);
+ }
+ vfree(ctrl->map);
+ ctrl->map = NULL;
+ ctrl->length = 0;
+ }
+ mutex_unlock(&swap_cgroup_mutex);
+}
+
+static int __init swap_cgroup_init(void)
+{
+ int i;
+ for (i = 0; i < MAX_SWAPFILES; i++)
+ spin_lock_init(&swap_cgroup_ctrl[i].lock);
+ return 0;
+}
+late_initcall(swap_cgroup_init);
+#endif
Index: mmotm-2.6.28-Nov10/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.28-Nov10.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.28-Nov10/include/linux/page_cgroup.h
@@ -105,4 +105,39 @@ static inline void page_cgroup_init(void
}

#endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#include <linux/swap.h>
+extern struct mem_cgroup *
+swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem);
+extern struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent);
+extern int swap_cgroup_swapon(int type, unsigned long max_pages);
+extern void swap_cgroup_swapoff(int type);
+#else
+#include <linux/swap.h>
+
+static inline
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+ return NULL;
+}
+
+static inline
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+ return NULL;
+}
+
+static inline int
+swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+ return 0;
+}
+
+static inline void swap_cgroup_swapoff(int type)
+{
+ return;
+}
+
+#endif
#endif
Index: mmotm-2.6.28-Nov10/mm/swapfile.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/swapfile.c
+++ mmotm-2.6.28-Nov10/mm/swapfile.c
@@ -32,6 +32,7 @@
#include <asm/pgtable.h>
#include <asm/tlbflush.h>
#include <linux/swapops.h>
+#include <linux/page_cgroup.h>

static DEFINE_SPINLOCK(swap_lock);
static unsigned int nr_swapfiles;
@@ -1345,6 +1346,9 @@ asmlinkage long sys_swapoff(const char _
spin_unlock(&swap_lock);
mutex_unlock(&swapon_mutex);
vfree(swap_map);
+ /* Destroy swap account informatin */
+ swap_cgroup_swapoff(type);
+
inode = mapping->host;
if (S_ISBLK(inode->i_mode)) {
struct block_device *bdev = I_BDEV(inode);
@@ -1669,6 +1673,10 @@ asmlinkage long sys_swapon(const char __
nr_good_pages = swap_header->info.last_page -
swap_header->info.nr_badpages -
1 /* header page */;
+
+ if (!error)
+ error = swap_cgroup_swapon(type, maxpages);
+
if (error)
goto bad_swap;
}

2008-11-12 03:31:45

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [RFC][PATCH 5/6] memcg: mem+swap controller

Mem+Swap controller core.

This patch implements per cgroup limit for usage of memory+swap.
However there are SwapCache, double counting of swap-cache and
swap-entry is avoided.

Mem+Swap controller works as following.
- memory usage is limited by memory.limit_in_bytes.
- memory + swap usage is limited by memory.memsw_limit_in_bytes.

This has following benefits.
- A user can limit total resource usage of mem+swap.

Without this, because memory resource controller doesn't take care of
usage of swap, a process can exhaust all the swap (by memory leak.)
We can avoid this case.

And Swap is shared resource but it cannot be reclaimed (goes back to memory)
until it's used. This characteristic can be trouble when the memory
is divided into some parts by cpuset or memcg.
Assume group A and group B.
After some application executes, the system can be..

Group A -- very large free memory space but occupy 99% of swap.
Group B -- under memory shortage but cannot use swap...it's nearly full.

Ability to set appropriate swap limit for each group is required.

Maybe someone wonder "why not swap but mem+swap ?"

- The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of
mem+swap.

In other words, when we want to limit the usage of swap without affecting
global LRU, mem+swap limit is better than just limiting swap.

Accounting target information is stored in swap_cgroup which is
per swap entry record.

Charge is done as following.
map
- charge page and memsw.

unmap
- uncharge page/memsw if not SwapCache.

swap-out (__delete_from_swap_cache)
- uncharge page
- record mem_cgroup information to swap_cgroup.

swap-in (do_swap_page)
- charged as page and memsw.
record in swap_cgroup is cleared.
memsw accounting is decremented.

swap-free (swap_free())
- if swap entry is freed, memsw is uncharged by PAGE_SIZE.

There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead. This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)

TODO:
- maybe more optimization can be don in swap-in path. (but not very safe.)
But we just do simple accounting at this stage.

Changelog: v2 -> v3
- create memsw.* file only when do_swap_account==1.
- fixed cancel_charge_swapin().
- swap_on_disk in stat file is dropped.
- swapref is renamed to be refcnt. (so, this can be used for general use.)
- fixed resize_limit() to check that new limit don't exceed memsw.limit

Changelog: v1 -> v2
- fixed typos
- fixed migration of anon pages.
- fixed uncharge to check USED bit always.
- code for swapcache is moved to another patch.
- added "noswap" argument to try_to_free_mem_cgroup_pages
- fixed lock_page around mem_cgroup_charge_cache_swap()
- fixed failcnt file.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Documentation/controllers/memory.txt | 29 ++
include/linux/memcontrol.h | 11 -
include/linux/swap.h | 14 +
mm/memcontrol.c | 374 +++++++++++++++++++++++++++++++----
mm/memory.c | 3
mm/swap_state.c | 5
mm/swapfile.c | 11 -
mm/vmscan.c | 6
8 files changed, 406 insertions(+), 47 deletions(-)

Index: mmotm-2.6.28-Nov10/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov10/mm/memcontrol.c
@@ -133,6 +133,10 @@ struct mem_cgroup {
*/
struct res_counter res;
/*
+ * the counter to account for mem+swap usage.
+ */
+ struct res_counter memsw;
+ /*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
@@ -148,6 +152,12 @@ struct mem_cgroup {
* on_rmdir ....0=free all 1=move all.
*/
char on_rmdir;
+
+ /*
+ * for delayed freeing.
+ */
+ atomic_t refcnt;
+ int obsolete;
};
static struct mem_cgroup init_mem_cgroup;

@@ -190,6 +200,17 @@ static char *memcg_attribute_names[MEMCG
"on_rmdir",
};

+
+/* for encoding cft->private value on file */
+#define _MEM (0)
+#define _MEMSWAP (1)
+#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
+#define MEMFILE_TYPE(val) (((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val) ((val) & 0xffff)
+
+static void mem_cgroup_get(struct mem_cgroup *mem);
+static void mem_cgroup_put(struct mem_cgroup *mem);
+
/*
* Always modified under lru lock. Then, not necessary to preempt_disable()
*/
@@ -508,7 +529,8 @@ unsigned long mem_cgroup_isolate_pages(u
* oom-killer can be invoked.
*/
static int __mem_cgroup_try_charge(struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
+ gfp_t gfp_mask, struct mem_cgroup **memcg,
+ bool oom)
{
struct mem_cgroup *mem;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -536,12 +558,25 @@ static int __mem_cgroup_try_charge(struc
css_get(&mem->css);
}

+ while (1) {
+ int ret;
+ bool noswap = false;

- while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
+ ret = res_counter_charge(&mem->res, PAGE_SIZE);
+ if (likely(!ret)) {
+ if (!do_swap_account)
+ break;
+ ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
+ if (likely(!ret))
+ break;
+ /* mem+swap counter fails */
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ noswap = true;
+ }
if (!(gfp_mask & __GFP_WAIT))
goto nomem;

- if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+ if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
continue;

/*
@@ -550,8 +585,13 @@ static int __mem_cgroup_try_charge(struc
* moved to swap cache or just unmapped from the cgroup.
* Check the limit again to see if the reclaim reduced the
* current usage of the cgroup before giving up
+ *
*/
- if (res_counter_check_under_limit(&mem->res))
+ if (!do_swap_account &&
+ res_counter_check_under_limit(&mem->res))
+ continue;
+ if (do_swap_account &&
+ res_counter_check_under_limit(&mem->memsw))
continue;

if (!nr_retries--) {
@@ -605,6 +645,8 @@ static void __mem_cgroup_commit_charge(s
if (unlikely(PageCgroupUsed(pc))) {
unlock_page_cgroup(pc);
res_counter_uncharge(&mem->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
css_put(&mem->css);
return;
}
@@ -669,6 +711,8 @@ static int mem_cgroup_move_account(struc
__mem_cgroup_remove_list(from_mz, pc);
css_put(&from->css);
res_counter_uncharge(&from->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&from->memsw, PAGE_SIZE);
pc->mem_cgroup = to;
css_get(&to->css);
__mem_cgroup_add_list(to_mz, pc, false);
@@ -715,8 +759,11 @@ static int mem_cgroup_move_parent(struct
/* drop extra refcnt */
css_put(&parent->css);
/* uncharge if move fails */
- if (ret)
+ if (ret) {
res_counter_uncharge(&parent->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+ }

return ret;
}
@@ -814,7 +861,34 @@ int mem_cgroup_cache_charge(struct page
MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
}

+int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+ struct page *page,
+ gfp_t mask, struct mem_cgroup **ptr)
+{
+ struct mem_cgroup *mem;
+ swp_entry_t ent;
+
+ if (mem_cgroup_subsys.disabled)
+ return 0;
+
+ if (!do_swap_account)
+ goto charge_cur_mm;
+
+ ent.val = page_private(page);
+
+ mem = lookup_swap_cgroup(ent);
+ if (!mem || mem->obsolete)
+ goto charge_cur_mm;
+ *ptr = mem;
+ return __mem_cgroup_try_charge(NULL, mask, ptr, true);
+charge_cur_mm:
+ if (unlikely(!mm))
+ mm = &init_mm;
+ return __mem_cgroup_try_charge(mm, mask, ptr, true);
+}
+
#ifdef CONFIG_SWAP
+
int mem_cgroup_cache_charge_swapin(struct page *page,
struct mm_struct *mm, gfp_t mask, bool locked)
{
@@ -831,8 +905,28 @@ int mem_cgroup_cache_charge_swapin(struc
* we reach here.
*/
if (PageSwapCache(page)) {
+ struct mem_cgroup *mem = NULL;
+ swp_entry_t ent;
+
+ ent.val = page_private(page);
+ if (do_swap_account) {
+ mem = lookup_swap_cgroup(ent);
+ if (mem && mem->obsolete)
+ mem = NULL;
+ if (mem)
+ mm = NULL;
+ }
ret = mem_cgroup_charge_common(page, mm, mask,
- MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
+ MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
+
+ if (!ret && do_swap_account) {
+ /* avoid double counting */
+ mem = swap_cgroup_record(ent, NULL);
+ if (mem) {
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+ mem_cgroup_put(mem);
+ }
+ }
}
if (!locked)
unlock_page(page);
@@ -851,6 +945,23 @@ void mem_cgroup_commit_charge_swapin(str
return;
pc = lookup_page_cgroup(page);
__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+ /*
+ * Now swap is on-memory. This means this page may be
+ * counted both as mem and swap....double count.
+ * Fix it by uncharging from memsw. This SwapCache is stable
+ * because we're still under lock_page().
+ */
+ if (do_swap_account) {
+ swp_entry_t ent = {.val = page_private(page)};
+ struct mem_cgroup *memcg;
+ memcg = swap_cgroup_record(ent, NULL);
+ if (memcg) {
+ /* If memcg is obsolete, memcg can be != ptr */
+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+ mem_cgroup_put(memcg);
+ }
+
+ }
}

void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -860,6 +971,8 @@ void mem_cgroup_cancel_charge_swapin(str
if (!mem)
return;
res_counter_uncharge(&mem->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
css_put(&mem->css);
}

@@ -867,29 +980,31 @@ void mem_cgroup_cancel_charge_swapin(str
/*
* uncharge if !page_mapped(page)
*/
-static void
+static struct mem_cgroup *
__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
{
struct page_cgroup *pc;
- struct mem_cgroup *mem;
+ struct mem_cgroup *mem = NULL;
struct mem_cgroup_per_zone *mz;
unsigned long flags;

if (mem_cgroup_subsys.disabled)
- return;
+ return NULL;

if (PageSwapCache(page))
- return;
+ return NULL;

/*
* Check if our page_cgroup is valid
*/
pc = lookup_page_cgroup(page);
if (unlikely(!pc || !PageCgroupUsed(pc)))
- return;
+ return NULL;

lock_page_cgroup(pc);

+ mem = pc->mem_cgroup;
+
if (!PageCgroupUsed(pc))
goto unlock_out;

@@ -909,8 +1024,11 @@ __mem_cgroup_uncharge_common(struct page
break;
}

+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+
ClearPageCgroupUsed(pc);
- mem = pc->mem_cgroup;

mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
@@ -918,14 +1036,13 @@ __mem_cgroup_uncharge_common(struct page
spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);

- res_counter_uncharge(&mem->res, PAGE_SIZE);
css_put(&mem->css);

- return;
+ return mem;

unlock_out:
unlock_page_cgroup(pc);
- return;
+ return NULL;
}

void mem_cgroup_uncharge_page(struct page *page)
@@ -945,10 +1062,42 @@ void mem_cgroup_uncharge_cache_page(stru
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}

-void mem_cgroup_uncharge_swapcache(struct page *page)
+/*
+ * called from __delete_from_swap_cache() and drop "page" account.
+ * memcg information is recorded to swap_cgroup of "ent"
+ */
+void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
{
- __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+ struct mem_cgroup *memcg;
+
+ memcg = __mem_cgroup_uncharge_common(page,
+ MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+ /* record memcg information */
+ if (do_swap_account && memcg) {
+ swap_cgroup_record(ent, memcg);
+ mem_cgroup_get(memcg);
+ }
+}
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/*
+ * called from swap_entry_free(). remove record in swap_cgroup and
+ * uncharge "memsw" account.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t ent)
+{
+ struct mem_cgroup *memcg;
+
+ if (!do_swap_account)
+ return;
+
+ memcg = swap_cgroup_record(ent, NULL);
+ if (memcg) {
+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+ mem_cgroup_put(memcg);
+ }
}
+#endif

/*
* Before starting migration, account PAGE_SIZE to mem_cgroup that the old
@@ -1057,7 +1206,7 @@ int mem_cgroup_shrink_usage(struct mm_st
rcu_read_unlock();

do {
- progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
+ progress = try_to_free_mem_cgroup_pages(mem, gfp_mask, true);
progress += res_counter_check_under_limit(&mem->res);
} while (!progress && --retry);

@@ -1074,6 +1223,11 @@ int mem_cgroup_resize_limit(struct mem_c
int progress;
int ret = 0;

+ if (do_swap_account) {
+ if (val > memcg->memsw.limit)
+ return -EINVAL;
+ }
+
while (res_counter_set_limit(&memcg->res, val)) {
if (signal_pending(current)) {
ret = -EINTR;
@@ -1084,13 +1238,55 @@ int mem_cgroup_resize_limit(struct mem_c
break;
}
progress = try_to_free_mem_cgroup_pages(memcg,
- GFP_HIGHUSER_MOVABLE);
+ GFP_HIGHUSER_MOVABLE, false);
if (!progress)
retry_count--;
}
return ret;
}

+int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
+ unsigned long long val)
+{
+ int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
+ unsigned long flags;
+ u64 memlimit, oldusage, curusage;
+ int ret;
+
+ if (!do_swap_account)
+ return -EINVAL;
+
+ while (retry_count) {
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ break;
+ }
+ /*
+ * Rather than hide all in some function, I do this in
+ * open coded manner. You see what this really does.
+ * We have to guarantee mem->res.limit < mem->memsw.limit.
+ */
+ spin_lock_irqsave(&memcg->res.lock, flags);
+ memlimit = memcg->res.limit;
+ if (memlimit > val) {
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+ ret = -EINVAL;
+ break;
+ }
+ ret = res_counter_set_limit(&memcg->memsw, val);
+ oldusage = memcg->memsw.usage;
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+
+ if (!ret)
+ break;
+ try_to_free_mem_cgroup_pages(memcg, GFP_HIGHUSER_MOVABLE, true);
+ curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+ if (curusage >= oldusage)
+ retry_count--;
+ }
+ return ret;
+}
+

/*
* This routine traverse page_cgroup in given list and drop them all.
@@ -1215,7 +1411,7 @@ try_to_free:
goto out;
}
progress = try_to_free_mem_cgroup_pages(mem,
- GFP_HIGHUSER_MOVABLE);
+ GFP_HIGHUSER_MOVABLE, false);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
@@ -1232,8 +1428,25 @@ try_to_free:

static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
{
- return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
- cft->private);
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ u64 val = 0;
+ int type, name;
+
+ type = MEMFILE_TYPE(cft->private);
+ name = MEMFILE_ATTR(cft->private);
+ switch (type) {
+ case _MEM:
+ val = res_counter_read_u64(&mem->res, name);
+ break;
+ case _MEMSWAP:
+ if (do_swap_account)
+ val = res_counter_read_u64(&mem->memsw, name);
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return val;
}
/*
* The user of this function is...
@@ -1243,15 +1456,22 @@ static int mem_cgroup_write(struct cgrou
const char *buffer)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ int type, name;
unsigned long long val;
int ret;

- switch (cft->private) {
+ type = MEMFILE_TYPE(cft->private);
+ name = MEMFILE_ATTR(cft->private);
+ switch (name) {
case RES_LIMIT:
/* This function does all necessary parse...reuse it */
ret = res_counter_memparse_write_strategy(buffer, &val);
- if (!ret)
+ if (ret)
+ break;
+ if (type == _MEM)
ret = mem_cgroup_resize_limit(memcg, val);
+ else
+ ret = mem_cgroup_resize_memsw_limit(memcg, val);
break;
default:
ret = -EINVAL; /* should be BUG() ? */
@@ -1263,14 +1483,23 @@ static int mem_cgroup_write(struct cgrou
static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
{
struct mem_cgroup *mem;
+ int type, name;

mem = mem_cgroup_from_cont(cont);
- switch (event) {
+ type = MEMFILE_TYPE(event);
+ name = MEMFILE_ATTR(event);
+ switch (name) {
case RES_MAX_USAGE:
- res_counter_reset_max(&mem->res);
+ if (type == _MEM)
+ res_counter_reset_max(&mem->res);
+ else
+ res_counter_reset_max(&mem->memsw);
break;
case RES_FAILCNT:
- res_counter_reset_failcnt(&mem->res);
+ if (type == _MEM)
+ res_counter_reset_failcnt(&mem->res);
+ else
+ res_counter_reset_failcnt(&mem->memsw);
break;
}
return 0;
@@ -1427,24 +1656,24 @@ static int mem_cgroup_read_attr(struct c
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
- .private = RES_USAGE,
+ .private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
.read_u64 = mem_cgroup_read,
},
{
.name = "max_usage_in_bytes",
- .private = RES_MAX_USAGE,
+ .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
.trigger = mem_cgroup_reset,
.read_u64 = mem_cgroup_read,
},
{
.name = "limit_in_bytes",
- .private = RES_LIMIT,
+ .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
.write_string = mem_cgroup_write,
.read_u64 = mem_cgroup_read,
},
{
.name = "failcnt",
- .private = RES_FAILCNT,
+ .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
.trigger = mem_cgroup_reset,
.read_u64 = mem_cgroup_read,
},
@@ -1459,6 +1688,47 @@ static struct cftype mem_cgroup_files[]
},
};

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+static struct cftype memsw_cgroup_files[] = {
+ {
+ .name = "memsw.usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
+ .read_u64 = mem_cgroup_read,
+ },
+ {
+ .name = "memsw.max_usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
+ .trigger = mem_cgroup_reset,
+ .read_u64 = mem_cgroup_read,
+ },
+ {
+ .name = "memsw.limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
+ .write_string = mem_cgroup_write,
+ .read_u64 = mem_cgroup_read,
+ },
+ {
+ .name = "memsw.failcnt",
+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
+ .trigger = mem_cgroup_reset,
+ .read_u64 = mem_cgroup_read,
+ },
+};
+
+static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+ if (!do_swap_account)
+ return 0;
+ return cgroup_add_files(cont, ss, memsw_cgroup_files,
+ ARRAY_SIZE(memsw_cgroup_files));
+};
+#else
+static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+ return 0;
+}
+#endif
+
static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
{
struct mem_cgroup_per_node *pn;
@@ -1510,14 +1780,44 @@ static struct mem_cgroup *mem_cgroup_all
return mem;
}

+/*
+ * At destroying mem_cgroup, references from swap_cgroup can remain.
+ * (scanning all at force_empty is too costly...)
+ *
+ * Instead of clearing all references at force_empty, we remember
+ * the number of reference from swap_cgroup and free mem_cgroup when
+ * it goes down to 0.
+ *
+ * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
+ * entry which points to this memcg will be ignore at swapin.
+ *
+ * Removal of cgroup itself succeeds regardless of refs from swap.
+ */
+
static void mem_cgroup_free(struct mem_cgroup *mem)
{
+ if (atomic_read(&mem->refcnt) > 0)
+ return;
if (sizeof(*mem) < PAGE_SIZE)
kfree(mem);
else
vfree(mem);
}

+static void mem_cgroup_get(struct mem_cgroup *mem)
+{
+ atomic_inc(&mem->refcnt);
+}
+
+static void mem_cgroup_put(struct mem_cgroup *mem)
+{
+ if (atomic_dec_and_test(&mem->refcnt)) {
+ if (!mem->obsolete)
+ return;
+ mem_cgroup_free(mem);
+ }
+}
+

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
static void __init enable_swap_cgroup(void)
@@ -1551,6 +1851,7 @@ mem_cgroup_create(struct cgroup_subsys *
}

res_counter_init(&mem->res);
+ res_counter_init(&mem->memsw);
if (parent)
mem->on_rmdir = parent->on_rmdir;

@@ -1571,6 +1872,7 @@ static void mem_cgroup_pre_destroy(struc
struct cgroup *cont)
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ mem->obsolete = 1;
mem_cgroup_force_empty(mem);
}

@@ -1589,8 +1891,14 @@ static void mem_cgroup_destroy(struct cg
static int mem_cgroup_populate(struct cgroup_subsys *ss,
struct cgroup *cont)
{
- return cgroup_add_files(cont, ss, mem_cgroup_files,
- ARRAY_SIZE(mem_cgroup_files));
+ int ret;
+
+ ret = cgroup_add_files(cont, ss, mem_cgroup_files,
+ ARRAY_SIZE(mem_cgroup_files));
+
+ if (!ret)
+ ret = register_memsw_files(cont, ss);
+ return ret;
}

static void mem_cgroup_move_task(struct cgroup_subsys *ss,
Index: mmotm-2.6.28-Nov10/mm/swapfile.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/swapfile.c
+++ mmotm-2.6.28-Nov10/mm/swapfile.c
@@ -271,8 +271,9 @@ out:
return NULL;
}

-static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
+static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
{
+ unsigned long offset = swp_offset(ent);
int count = p->swap_map[offset];

if (count < SWAP_MAP_MAX) {
@@ -287,6 +288,7 @@ static int swap_entry_free(struct swap_i
swap_list.next = p - swap_info;
nr_swap_pages++;
p->inuse_pages--;
+ mem_cgroup_uncharge_swap(ent);
}
}
return count;
@@ -302,7 +304,7 @@ void swap_free(swp_entry_t entry)

p = swap_info_get(entry);
if (p) {
- swap_entry_free(p, swp_offset(entry));
+ swap_entry_free(p, entry);
spin_unlock(&swap_lock);
}
}
@@ -421,7 +423,7 @@ void free_swap_and_cache(swp_entry_t ent

p = swap_info_get(entry);
if (p) {
- if (swap_entry_free(p, swp_offset(entry)) == 1) {
+ if (swap_entry_free(p, entry) == 1) {
page = find_get_page(&swapper_space, entry.val);
if (page && !trylock_page(page)) {
page_cache_release(page);
@@ -536,7 +538,8 @@ static int unuse_pte(struct vm_area_stru
pte_t *pte;
int ret = 1;

- if (mem_cgroup_try_charge(vma->vm_mm, GFP_HIGHUSER_MOVABLE, &ptr))
+ if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
+ GFP_HIGHUSER_MOVABLE, &ptr))
ret = -ENOMEM;

pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
Index: mmotm-2.6.28-Nov10/mm/swap_state.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/swap_state.c
+++ mmotm-2.6.28-Nov10/mm/swap_state.c
@@ -17,6 +17,7 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/migrate.h>
+#include <linux/page_cgroup.h>

#include <asm/pgtable.h>

@@ -108,6 +109,8 @@ int add_to_swap_cache(struct page *page,
*/
void __delete_from_swap_cache(struct page *page)
{
+ swp_entry_t ent = {.val = page_private(page)};
+
BUG_ON(!PageLocked(page));
BUG_ON(!PageSwapCache(page));
BUG_ON(PageWriteback(page));
@@ -119,7 +122,7 @@ void __delete_from_swap_cache(struct pag
total_swapcache_pages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(del_total);
- mem_cgroup_uncharge_swapcache(page);
+ mem_cgroup_uncharge_swapcache(page, ent);
}

/**
Index: mmotm-2.6.28-Nov10/include/linux/swap.h
===================================================================
--- mmotm-2.6.28-Nov10.orig/include/linux/swap.h
+++ mmotm-2.6.28-Nov10/include/linux/swap.h
@@ -213,7 +213,7 @@ static inline void lru_cache_add_active_
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask);
+ gfp_t gfp_mask, bool noswap);
extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
@@ -338,7 +338,7 @@ static inline void disable_swap_token(vo
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
extern int mem_cgroup_cache_charge_swapin(struct page *page,
struct mm_struct *mm, gfp_t mask, bool locked);
-extern void mem_cgroup_uncharge_swapcache(struct page *page);
+extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
#else
static inline
int mem_cgroup_cache_charge_swapin(struct page *page,
@@ -346,7 +346,15 @@ int mem_cgroup_cache_charge_swapin(struc
{
return 0;
}
-static inline void mem_cgroup_uncharge_swapcache(struct page *page)
+static inline void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+{
+}
+#endif
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+#else
+static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
{
}
#endif
Index: mmotm-2.6.28-Nov10/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Nov10.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Nov10/include/linux/memcontrol.h
@@ -32,6 +32,8 @@ extern int mem_cgroup_newpage_charge(str
/* for swap handling */
extern int mem_cgroup_try_charge(struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+ struct page *page, gfp_t mask, struct mem_cgroup **ptr);
extern void mem_cgroup_commit_charge_swapin(struct page *page,
struct mem_cgroup *ptr);
extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
@@ -80,7 +82,6 @@ extern long mem_cgroup_calc_reclaim(stru
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
-
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

@@ -97,7 +98,13 @@ static inline int mem_cgroup_cache_charg
}

static inline int mem_cgroup_try_charge(struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **ptr)
+ gfp_t gfp_mask, struct mem_cgroup **ptr)
+{
+ return 0;
+}
+
+static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+ struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
{
return 0;
}
Index: mmotm-2.6.28-Nov10/mm/memory.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/memory.c
+++ mmotm-2.6.28-Nov10/mm/memory.c
@@ -2324,7 +2324,8 @@ static int do_swap_page(struct mm_struct
lock_page(page);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

- if (mem_cgroup_try_charge(mm, GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
+ if (mem_cgroup_try_charge_swapin(mm, page,
+ GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
ret = VM_FAULT_OOM;
unlock_page(page);
goto out;
Index: mmotm-2.6.28-Nov10/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/vmscan.c
+++ mmotm-2.6.28-Nov10/mm/vmscan.c
@@ -1718,7 +1718,8 @@ unsigned long try_to_free_pages(struct z
#ifdef CONFIG_CGROUP_MEM_RES_CTLR

unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
- gfp_t gfp_mask)
+ gfp_t gfp_mask,
+ bool noswap)
{
struct scan_control sc = {
.may_writepage = !laptop_mode,
@@ -1731,6 +1732,9 @@ unsigned long try_to_free_mem_cgroup_pag
};
struct zonelist *zonelist;

+ if (noswap)
+ sc.may_swap = 0;
+
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
zonelist = NODE_DATA(numa_node_id())->node_zonelists;
Index: mmotm-2.6.28-Nov10/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-Nov10.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-Nov10/Documentation/controllers/memory.txt
@@ -137,12 +137,32 @@ behind this approach is that a cgroup th
page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure).

-Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
+When you do swapoff and make swapped-out pages of shmem(tmpfs) to
be backed into memory in force, charges for pages are accounted against the
caller of swapoff rather than the users of shmem.

-2.4 Reclaim
+2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
+Swap Extension allows you to record charge for swap. A swapped-in page is
+charged back to original page allocator if possible.
+
+When swap is accounted, following files are added.
+ - memory.memsw.usage_in_bytes.
+ - memory.memsw.limit_in_bytes.
+
+usage of mem+swap is limited by memsw.limit_in_bytes.
+
+Note: why 'mem+swap' rather than swap.
+The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
+to move account from memory to swap...there is no change in usage of
+mem+swap.
+
+In other words, when we want to limit the usage of swap without affecting
+global LRU, mem+swap limit is better than just limiting swap from OS point
+of view.
+
+2.5 Reclaim

Each cgroup maintains a per cgroup LRU that consists of an active
and inactive list. When a cgroup goes over its limit, we first try
@@ -246,6 +266,11 @@ Such charges are freed(at default) or mo
both of RSS and CACHES are moved to parent.
If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.

+Charges recorded in swap information is not updated at removal of cgroup.
+Recorded information is discarded and a cgroup which uses swap (swapcache)
+will be charged as a new owner of it.
+
+
5. Attributes.

memory controller has some of attributes for customizing behavior.

2008-11-12 03:32:53

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [RFC][PATCH 6/6] memcg: synchronized LRU

A patch for changing memcg's LRU semantics.

Now,
- page_cgroup is linked to mem_cgroup's its own LRU (per zone).

- LRU of page_cgroup is not synchronous with global LRU.

- page and page_cgroup is one-to-one and statically allocated.

- To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
- lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

- SwapCache is handled.

And, when we handle LRU list of page_cgroup, we do following.

pc = lookup_page_cgroup(page);
lock_page_cgroup(pc); .....................(1)
mz = page_cgroup_zoneinfo(pc);
spin_lock(&mz->lru_lock);
.....add to LRU
spin_unlock(&mz->lru_lock);
unlock_page_cgroup(pc);

But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as

spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
mem_cgroup_add/remove/etc_lru() {
pc = lookup_page_cgroup(page);
mz = page_cgroup_zoneinfo(pc);
if (PageCgroupUsed(pc)) {
....add to LRU
}
spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
1. When pc->mem_cgroup can be modified.
- at charge.
- at account_move().
2. at charge
the PCG_USED bit is not set before pc->mem_cgroup is fixed.
3. at account_move()
the page is isolated and not on LRU.

Pros.
- easy for maintenance.
- memcg can make use of laziness of pagevec.
- we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
- LRU status of memcg will be synchronized with global LRU's one.
- # of locks are reduced.
- account_move() is simplified very much.
Cons.
- may increase cost of LRU rotation.
(no impact if memcg is not configured.)

Changelog v0 -> v1
- fixed statistics.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

fs/splice.c | 1
include/linux/memcontrol.h | 29 +++
include/linux/mm_inline.h | 3
include/linux/page_cgroup.h | 17 --
mm/memcontrol.c | 332 +++++++++++++++++++-------------------------
mm/page_cgroup.c | 1
mm/swap.c | 1
mm/vmscan.c | 9 -
8 files changed, 183 insertions(+), 210 deletions(-)

Index: mmotm-2.6.28-Nov10/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov10/mm/memcontrol.c
@@ -36,6 +36,7 @@
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
+#include "internal.h"

#include <asm/uaccess.h>

@@ -100,7 +101,6 @@ struct mem_cgroup_per_zone {
/*
* spin_lock to protect the per cgroup LRU
*/
- spinlock_t lru_lock;
struct list_head lists[NR_LRU_LISTS];
unsigned long count[NR_LRU_LISTS];
};
@@ -137,6 +137,11 @@ struct mem_cgroup {
*/
struct res_counter memsw;
/*
+ * Statistics.
+ */
+ struct mem_cgroup_stat stat;
+ /*
+ * Below is read-mostly part.
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
@@ -144,10 +149,6 @@ struct mem_cgroup {

int prev_priority; /* for recording reclaim priority */
/*
- * statistics.
- */
- struct mem_cgroup_stat stat;
- /*
* attributes.
* on_rmdir ....0=free all 1=move all.
*/
@@ -173,14 +174,12 @@ enum charge_type {
/* only for here (for easy reading.) */
#define PCGF_CACHE (1UL << PCG_CACHE)
#define PCGF_USED (1UL << PCG_USED)
-#define PCGF_ACTIVE (1UL << PCG_ACTIVE)
#define PCGF_LOCK (1UL << PCG_LOCK)
-#define PCGF_FILE (1UL << PCG_FILE)
static const unsigned long
pcg_default_flags[NR_CHARGE_TYPE] = {
- PCGF_CACHE | PCGF_FILE | PCGF_USED | PCGF_LOCK, /* File Cache */
- PCGF_ACTIVE | PCGF_USED | PCGF_LOCK, /* Anon */
- PCGF_ACTIVE | PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* Shmem */
+ PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* File Cache */
+ PCGF_USED | PCGF_LOCK, /* Anon */
+ PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* Shmem */
0, /* FORCE */
};

@@ -211,9 +210,6 @@ static char *memcg_attribute_names[MEMCG
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);

-/*
- * Always modified under lru lock. Then, not necessary to preempt_disable()
- */
static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
struct page_cgroup *pc,
bool charge)
@@ -221,10 +217,9 @@ static void mem_cgroup_charge_statistics
int val = (charge)? 1 : -1;
struct mem_cgroup_stat *stat = &mem->stat;
struct mem_cgroup_stat_cpu *cpustat;
+ int cpu = get_cpu();

- VM_BUG_ON(!irqs_disabled());
-
- cpustat = &stat->cpustat[smp_processor_id()];
+ cpustat = &stat->cpustat[cpu];
if (PageCgroupCache(pc))
__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
else
@@ -236,6 +231,7 @@ static void mem_cgroup_charge_statistics
else
__mem_cgroup_stat_add_safe(cpustat,
MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
+ put_cpu();
}

static struct mem_cgroup_per_zone *
@@ -290,80 +286,95 @@ struct mem_cgroup *mem_cgroup_from_task(
struct mem_cgroup, css);
}

-static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
- struct page_cgroup *pc)
-{
- int lru = LRU_BASE;
+/*
+ * Following LRU functions are allowed to be used without PCG_LOCK.
+ * Operations are called by routine of global LRU independently from memcg.
+ * What we have to take care of here is validness of pc->mem_cgroup.
+ *
+ * Changes to pc->mem_cgroup happens when
+ * 1. charge
+ * 2. moving account
+ * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
+ * It is added to LRU before charge.
+ * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
+ * When moving account, the page is not on LRU. It's isolated.
+ */

- if (PageCgroupUnevictable(pc))
- lru = LRU_UNEVICTABLE;
- else {
- if (PageCgroupActive(pc))
- lru += LRU_ACTIVE;
- if (PageCgroupFile(pc))
- lru += LRU_FILE;
- }
+void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
+{
+ struct page_cgroup *pc;
+ struct mem_cgroup *mem;
+ struct mem_cgroup_per_zone *mz;

+ if (mem_cgroup_subsys.disabled)
+ return;
+ pc = lookup_page_cgroup(page);
+ /* can happen while we handle swapcache. */
+ if (list_empty(&pc->lru))
+ return;
+ mz = page_cgroup_zoneinfo(pc);
+ mem = pc->mem_cgroup;
MEM_CGROUP_ZSTAT(mz, lru) -= 1;
-
- mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
- list_del(&pc->lru);
+ list_del_init(&pc->lru);
+ return;
}

-static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
- struct page_cgroup *pc, bool hot)
+void mem_cgroup_del_lru(struct page *page)
{
- int lru = LRU_BASE;
+ mem_cgroup_del_lru_list(page, page_lru(page));
+}

- if (PageCgroupUnevictable(pc))
- lru = LRU_UNEVICTABLE;
- else {
- if (PageCgroupActive(pc))
- lru += LRU_ACTIVE;
- if (PageCgroupFile(pc))
- lru += LRU_FILE;
- }
+void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
+{
+ struct mem_cgroup_per_zone *mz;
+ struct page_cgroup *pc;

- MEM_CGROUP_ZSTAT(mz, lru) += 1;
- if (hot)
- list_add(&pc->lru, &mz->lists[lru]);
- else
- list_add_tail(&pc->lru, &mz->lists[lru]);
+ if (mem_cgroup_subsys.disabled)
+ return;

- mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
+ pc = lookup_page_cgroup(page);
+ smp_rmb();
+ /* unused page is not rotated. */
+ if (!PageCgroupUsed(pc))
+ return;
+ mz = page_cgroup_zoneinfo(pc);
+ list_move(&pc->lru, &mz->lists[lru]);
}

-static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
+void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
{
- struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
- int active = PageCgroupActive(pc);
- int file = PageCgroupFile(pc);
- int unevictable = PageCgroupUnevictable(pc);
- enum lru_list from = unevictable ? LRU_UNEVICTABLE :
- (LRU_FILE * !!file + !!active);
+ struct page_cgroup *pc;
+ struct mem_cgroup_per_zone *mz;

- if (lru == from)
+ if (mem_cgroup_subsys.disabled)
+ return;
+ pc = lookup_page_cgroup(page);
+ /* barrier to sync with "charge" */
+ smp_rmb();
+ if (!PageCgroupUsed(pc))
return;

- MEM_CGROUP_ZSTAT(mz, from) -= 1;
- /*
- * However this is done under mz->lru_lock, another flags, which
- * are not related to LRU, will be modified from out-of-lock.
- * We have to use atomic set/clear flags.
- */
- if (is_unevictable_lru(lru)) {
- ClearPageCgroupActive(pc);
- SetPageCgroupUnevictable(pc);
- } else {
- if (is_active_lru(lru))
- SetPageCgroupActive(pc);
- else
- ClearPageCgroupActive(pc);
- ClearPageCgroupUnevictable(pc);
- }
-
+ mz = page_cgroup_zoneinfo(pc);
MEM_CGROUP_ZSTAT(mz, lru) += 1;
- list_move(&pc->lru, &mz->lists[lru]);
+ list_add(&pc->lru, &mz->lists[lru]);
+}
+/*
+ * To add swapcache into LRU. Be careful to all this function.
+ * zone->lru_lock shouldn't be held and irq must not be disabled.
+ */
+static void mem_cgroup_lru_fixup(struct page *page)
+{
+ if (!isolate_lru_page(page))
+ putback_lru_page(page);
+}
+
+void mem_cgroup_move_lists(struct page *page,
+ enum lru_list from, enum lru_list to)
+{
+ if (mem_cgroup_subsys.disabled)
+ return;
+ mem_cgroup_del_lru_list(page, from);
+ mem_cgroup_add_lru_list(page, to);
}

int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
@@ -377,37 +388,6 @@ int task_in_mem_cgroup(struct task_struc
}

/*
- * This routine assumes that the appropriate zone's lru lock is already held
- */
-void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
-{
- struct page_cgroup *pc;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
-
- if (mem_cgroup_subsys.disabled)
- return;
-
- /*
- * We cannot lock_page_cgroup while holding zone's lru_lock,
- * because other holders of lock_page_cgroup can be interrupted
- * with an attempt to rotate_reclaimable_page. But we cannot
- * safely get to page_cgroup without it, so just try_lock it:
- * mem_cgroup_isolate_pages allows for page left on wrong list.
- */
- pc = lookup_page_cgroup(page);
- if (!trylock_page_cgroup(pc))
- return;
- if (pc && PageCgroupUsed(pc)) {
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_move_lists(pc, lru);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
- }
- unlock_page_cgroup(pc);
-}
-
-/*
* Calculate mapped_ratio under memory controller. This will be used in
* vmscan.c for deteremining we have to reclaim mapped pages.
*/
@@ -486,40 +466,24 @@ unsigned long mem_cgroup_isolate_pages(u
mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
src = &mz->lists[lru];

- spin_lock(&mz->lru_lock);
scan = 0;
list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
if (scan >= nr_to_scan)
break;
+
+ page = pc->page;
if (unlikely(!PageCgroupUsed(pc)))
continue;
- page = pc->page;
-
if (unlikely(!PageLRU(page)))
continue;

- /*
- * TODO: play better with lumpy reclaim, grabbing anything.
- */
- if (PageUnevictable(page) ||
- (PageActive(page) && !active) ||
- (!PageActive(page) && active)) {
- __mem_cgroup_move_lists(pc, page_lru(page));
- continue;
- }
-
scan++;
- list_move(&pc->lru, &pc_list);
-
if (__isolate_lru_page(page, mode, file) == 0) {
list_move(&page->lru, dst);
nr_taken++;
}
}

- list_splice(&pc_list, src);
- spin_unlock(&mz->lru_lock);
-
*scanned = scan;
return nr_taken;
}
@@ -634,9 +598,6 @@ static void __mem_cgroup_commit_charge(s
struct page_cgroup *pc,
enum charge_type ctype)
{
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
-
/* try_charge() can return NULL to *memcg, taking care of it. */
if (!mem)
return;
@@ -651,17 +612,11 @@ static void __mem_cgroup_commit_charge(s
return;
}
pc->mem_cgroup = mem;
- /*
- * If a page is accounted as a page cache, insert to inactive list.
- * If anon, insert to active list.
- */
+ smp_wmb();
pc->flags = pcg_default_flags[ctype];

- mz = page_cgroup_zoneinfo(pc);
+ mem_cgroup_charge_statistics(mem, pc, true);

- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_add_list(mz, pc, true);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);
}

@@ -672,8 +627,7 @@ static void __mem_cgroup_commit_charge(s
* @to: mem_cgroup which the page is moved to. @from != @to.
*
* The caller must confirm following.
- * 1. disable irq.
- * 2. lru_lock of old mem_cgroup(@from) should be held.
+ * - page is not on LRU (isolate_page() is useful.)
*
* returns 0 at success,
* returns -EBUSY when lock is busy or "pc" is unstable.
@@ -689,15 +643,14 @@ static int mem_cgroup_move_account(struc
int nid, zid;
int ret = -EBUSY;

- VM_BUG_ON(!irqs_disabled());
VM_BUG_ON(from == to);
+ VM_BUG_ON(PageLRU(pc->page));

nid = page_cgroup_nid(pc);
zid = page_cgroup_zid(pc);
from_mz = mem_cgroup_zoneinfo(from, nid, zid);
to_mz = mem_cgroup_zoneinfo(to, nid, zid);

-
if (!trylock_page_cgroup(pc))
return ret;

@@ -707,18 +660,15 @@ static int mem_cgroup_move_account(struc
if (pc->mem_cgroup != from)
goto out;

- if (spin_trylock(&to_mz->lru_lock)) {
- __mem_cgroup_remove_list(from_mz, pc);
- css_put(&from->css);
- res_counter_uncharge(&from->res, PAGE_SIZE);
- if (do_swap_account)
- res_counter_uncharge(&from->memsw, PAGE_SIZE);
- pc->mem_cgroup = to;
- css_get(&to->css);
- __mem_cgroup_add_list(to_mz, pc, false);
- ret = 0;
- spin_unlock(&to_mz->lru_lock);
- }
+ css_put(&from->css);
+ res_counter_uncharge(&from->res, PAGE_SIZE);
+ mem_cgroup_charge_statistics(from, pc, false);
+ if (do_swap_account)
+ res_counter_uncharge(&from->memsw, PAGE_SIZE);
+ pc->mem_cgroup = to;
+ mem_cgroup_charge_statistics(to, pc, true);
+ css_get(&to->css);
+ ret = 0;
out:
unlock_page_cgroup(pc);
return ret;
@@ -732,39 +682,47 @@ static int mem_cgroup_move_parent(struct
struct mem_cgroup *child,
gfp_t gfp_mask)
{
+ struct page *page = pc->page;
struct cgroup *cg = child->css.cgroup;
struct cgroup *pcg = cg->parent;
struct mem_cgroup *parent;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
int ret;

/* Is ROOT ? */
if (!pcg)
return -EINVAL;

+
parent = mem_cgroup_from_cont(pcg);

+
ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
if (ret)
return ret;

- mz = mem_cgroup_zoneinfo(child,
- page_cgroup_nid(pc), page_cgroup_zid(pc));
+ if (!get_page_unless_zero(page))
+ return -EBUSY;
+
+ ret = isolate_lru_page(page);
+
+ if (ret)
+ goto cancel;

- spin_lock_irqsave(&mz->lru_lock, flags);
ret = mem_cgroup_move_account(pc, child, parent);
- spin_unlock_irqrestore(&mz->lru_lock, flags);

- /* drop extra refcnt */
+ /* drop extra refcnt by try_charge() (move_account increment one) */
css_put(&parent->css);
- /* uncharge if move fails */
- if (ret) {
- res_counter_uncharge(&parent->res, PAGE_SIZE);
- if (do_swap_account)
- res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+ putback_lru_page(page);
+ if (!ret) {
+ put_page(page);
+ return 0;
}
-
+ /* uncharge if move fails */
+cancel:
+ res_counter_uncharge(&parent->res, PAGE_SIZE);
+ if (do_swap_account)
+ res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+ put_page(page);
return ret;
}

@@ -930,6 +888,8 @@ int mem_cgroup_cache_charge_swapin(struc
}
if (!locked)
unlock_page(page);
+ /* add this page(page_cgroup) to the LRU we want. */
+ mem_cgroup_lru_fixup(page);

return ret;
}
@@ -962,6 +922,8 @@ void mem_cgroup_commit_charge_swapin(str
}

}
+ /* add this page(page_cgroup) to the LRU we want. */
+ mem_cgroup_lru_fixup(page);
}

void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -986,7 +948,6 @@ __mem_cgroup_uncharge_common(struct page
struct page_cgroup *pc;
struct mem_cgroup *mem = NULL;
struct mem_cgroup_per_zone *mz;
- unsigned long flags;

if (mem_cgroup_subsys.disabled)
return NULL;
@@ -1028,12 +989,10 @@ __mem_cgroup_uncharge_common(struct page
if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
res_counter_uncharge(&mem->memsw, PAGE_SIZE);

+ mem_cgroup_charge_statistics(mem, pc, false);
ClearPageCgroupUsed(pc);

mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_remove_list(mz, pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
unlock_page_cgroup(pc);

css_put(&mem->css);
@@ -1287,21 +1246,22 @@ int mem_cgroup_resize_memsw_limit(struct
return ret;
}

-
/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
*/
static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
- struct mem_cgroup_per_zone *mz,
- enum lru_list lru)
+ int node, int zid, enum lru_list lru)
{
+ struct zone *zone;
+ struct mem_cgroup_per_zone *mz;
struct page_cgroup *pc, *busy;
- unsigned long flags;
- unsigned long loop;
+ unsigned long flags, loop;
struct list_head *list;
int ret = 0;

+ zone = &NODE_DATA(node)->node_zones[zid];
+ mz = mem_cgroup_zoneinfo(mem, node, zid);
list = &mz->lists[lru];

loop = MEM_CGROUP_ZSTAT(mz, lru);
@@ -1310,19 +1270,19 @@ static int mem_cgroup_force_empty_list(s
busy = NULL;
while (loop--) {
ret = 0;
- spin_lock_irqsave(&mz->lru_lock, flags);
+ spin_lock_irqsave(&zone->lru_lock, flags);
if (list_empty(list)) {
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
break;
}
pc = list_entry(list->prev, struct page_cgroup, lru);
if (busy == pc) {
list_move(&pc->lru, list);
busy = 0;
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
continue;
}
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ spin_unlock_irqrestore(&zone->lru_lock, flags);

ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
if (ret == -ENOMEM)
@@ -1335,6 +1295,7 @@ static int mem_cgroup_force_empty_list(s
} else
busy = NULL;
}
+
if (!ret && !list_empty(list))
return -EBUSY;
return ret;
@@ -1370,12 +1331,10 @@ move_account:
ret = 0;
for_each_node_state(node, N_POSSIBLE) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
- struct mem_cgroup_per_zone *mz;
enum lru_list l;
- mz = mem_cgroup_zoneinfo(mem, node, zid);
for_each_lru(l) {
ret = mem_cgroup_force_empty_list(mem,
- mz, l);
+ node, zid, l);
if (ret)
break;
}
@@ -1419,6 +1378,7 @@ try_to_free:
}

}
+ lru_add_drain();
/* try move_account...there may be some *locked* pages. */
if (mem->res.usage)
goto move_account;
@@ -1754,7 +1714,6 @@ static int alloc_mem_cgroup_per_zone_inf

for (zone = 0; zone < MAX_NR_ZONES; zone++) {
mz = &pn->zoneinfo[zone];
- spin_lock_init(&mz->lru_lock);
for_each_lru(l)
INIT_LIST_HEAD(&mz->lists[l]);
}
@@ -1796,8 +1755,15 @@ static struct mem_cgroup *mem_cgroup_all

static void mem_cgroup_free(struct mem_cgroup *mem)
{
+ int node;
+
if (atomic_read(&mem->refcnt) > 0)
return;
+
+
+ for_each_node_state(node, N_POSSIBLE)
+ free_mem_cgroup_per_zone_info(mem, node);
+
if (sizeof(*mem) < PAGE_SIZE)
kfree(mem);
else
@@ -1879,12 +1845,6 @@ static void mem_cgroup_pre_destroy(struc
static void mem_cgroup_destroy(struct cgroup_subsys *ss,
struct cgroup *cont)
{
- int node;
- struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-
- for_each_node_state(node, N_POSSIBLE)
- free_mem_cgroup_per_zone_info(mem, node);
-
mem_cgroup_free(mem_cgroup_from_cont(cont));
}

Index: mmotm-2.6.28-Nov10/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Nov10.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Nov10/include/linux/memcontrol.h
@@ -40,7 +40,12 @@ extern void mem_cgroup_cancel_charge_swa

extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
-extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
+extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_del_lru(struct page *page);
+extern void mem_cgroup_move_lists(struct page *page,
+ enum lru_list from, enum lru_list to);
extern void mem_cgroup_uncharge_page(struct page *page);
extern void mem_cgroup_uncharge_cache_page(struct page *page);
extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
@@ -131,7 +136,27 @@ static inline int mem_cgroup_shrink_usag
return 0;
}

-static inline void mem_cgroup_move_lists(struct page *page, bool active)
+static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
+{
+}
+
+static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
+{
+ return ;
+}
+
+static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
+{
+ return ;
+}
+
+static inline void mem_cgroup_del_lru(struct page *page)
+{
+ return ;
+}
+
+static inline void
+mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
{
}

Index: mmotm-2.6.28-Nov10/include/linux/mm_inline.h
===================================================================
--- mmotm-2.6.28-Nov10.orig/include/linux/mm_inline.h
+++ mmotm-2.6.28-Nov10/include/linux/mm_inline.h
@@ -28,6 +28,7 @@ add_page_to_lru_list(struct zone *zone,
{
list_add(&page->lru, &zone->lru[l].list);
__inc_zone_state(zone, NR_LRU_BASE + l);
+ mem_cgroup_add_lru_list(page, l);
}

static inline void
@@ -35,6 +36,7 @@ del_page_from_lru_list(struct zone *zone
{
list_del(&page->lru);
__dec_zone_state(zone, NR_LRU_BASE + l);
+ mem_cgroup_del_lru_list(page, l);
}

static inline void
@@ -54,6 +56,7 @@ del_page_from_lru(struct zone *zone, str
l += page_is_file_cache(page);
}
__dec_zone_state(zone, NR_LRU_BASE + l);
+ mem_cgroup_del_lru_list(page, l);
}

/**
Index: mmotm-2.6.28-Nov10/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-Nov10/mm/page_cgroup.c
@@ -17,6 +17,7 @@ __init_page_cgroup(struct page_cgroup *p
pc->flags = 0;
pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
+ INIT_LIST_HEAD(&pc->lru);
}
static unsigned long total_usage;

Index: mmotm-2.6.28-Nov10/fs/splice.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/fs/splice.c
+++ mmotm-2.6.28-Nov10/fs/splice.c
@@ -21,6 +21,7 @@
#include <linux/file.h>
#include <linux/pagemap.h>
#include <linux/splice.h>
+#include <linux/memcontrol.h>
#include <linux/mm_inline.h>
#include <linux/swap.h>
#include <linux/writeback.h>
Index: mmotm-2.6.28-Nov10/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/vmscan.c
+++ mmotm-2.6.28-Nov10/mm/vmscan.c
@@ -565,7 +565,6 @@ redo:
lru = LRU_UNEVICTABLE;
add_page_to_unevictable_list(page);
}
- mem_cgroup_move_lists(page, lru);

/*
* page's status can change while we move it among lru. If an evictable
@@ -600,7 +599,6 @@ void putback_lru_page(struct page *page)

lru = !!TestClearPageActive(page) + page_is_file_cache(page);
lru_cache_add_lru(page, lru);
- mem_cgroup_move_lists(page, lru);
put_page(page);
}
#endif /* CONFIG_UNEVICTABLE_LRU */
@@ -872,6 +870,7 @@ int __isolate_lru_page(struct page *page
return ret;

ret = -EBUSY;
+
if (likely(get_page_unless_zero(page))) {
/*
* Be careful not to clear PageLRU until after we're
@@ -880,6 +879,7 @@ int __isolate_lru_page(struct page *page
*/
ClearPageLRU(page);
ret = 0;
+ mem_cgroup_del_lru(page);
}

return ret;
@@ -1193,7 +1193,6 @@ static unsigned long shrink_inactive_lis
SetPageLRU(page);
lru = page_lru(page);
add_page_to_lru_list(zone, page, lru);
- mem_cgroup_move_lists(page, lru);
if (PageActive(page) && scan_global_lru(sc)) {
int file = !!page_is_file_cache(page);
zone->recent_rotated[file]++;
@@ -1326,7 +1325,7 @@ static void shrink_active_list(unsigned
ClearPageActive(page);

list_move(&page->lru, &zone->lru[lru].list);
- mem_cgroup_move_lists(page, lru);
+ mem_cgroup_add_lru_list(page, lru);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
@@ -2486,6 +2485,7 @@ retry:

__dec_zone_state(zone, NR_UNEVICTABLE);
list_move(&page->lru, &zone->lru[l].list);
+ mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
__inc_zone_state(zone, NR_INACTIVE_ANON + l);
__count_vm_event(UNEVICTABLE_PGRESCUED);
} else {
@@ -2494,6 +2494,7 @@ retry:
*/
SetPageUnevictable(page);
list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
+ mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
if (page_evictable(page, NULL))
goto retry;
}
Index: mmotm-2.6.28-Nov10/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.28-Nov10.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.28-Nov10/include/linux/page_cgroup.h
@@ -26,10 +26,6 @@ enum {
PCG_LOCK, /* page cgroup is locked */
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
- /* flags for LRU placement */
- PCG_ACTIVE, /* page is active in this cgroup */
- PCG_FILE, /* page is file system backed */
- PCG_UNEVICTABLE, /* page is unevictableable */
};

#define TESTPCGFLAG(uname, lname) \
@@ -50,19 +46,6 @@ TESTPCGFLAG(Cache, CACHE)
TESTPCGFLAG(Used, USED)
CLEARPCGFLAG(Used, USED)

-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
static inline int page_cgroup_nid(struct page_cgroup *pc)
{
return page_to_nid(pc->page);
Index: mmotm-2.6.28-Nov10/mm/swap.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/swap.c
+++ mmotm-2.6.28-Nov10/mm/swap.c
@@ -168,7 +168,6 @@ void activate_page(struct page *page)
lru += LRU_ACTIVE;
add_page_to_lru_list(zone, page, lru);
__count_vm_event(PGACTIVATE);
- mem_cgroup_move_lists(page, lru);

zone->recent_rotated[!!file]++;
zone->recent_scanned[!!file]++;

2008-11-12 03:34:14

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/6] memcg updates (12/Nov/2008)

On Wed, 12 Nov 2008 12:26:06 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> Weekly updates on my queue.
>
> Changes from previous (05/Nov)
> - added "free all at rmdir" patch.
> - fixed several bugs reported by Nishimura (Thanks!)
> - many style bugs are fixed.
>
> Brief description:
> [1/6].. free all at rmdir (and add attribute to memcg.)
> [2/6].. handle swap cache
> [3/6].. mem+swap controller kconfig
> [4/6].. swap_cgroup
> [5/6].. mem+swap controller
> [6/6].. synchrinized LRU (unify lru lock.)
>
> I think it's near to a month to test this mem+swap controller internally.
> It's getting better. Making progress in step by step works good.
>
> I'll send [1/6] and [2/6] to Andrew, tomorrow or weekend.(please do final check).
>
> If no acks to [1/6] (I haven't got any ;), I'll postpone it and reschedule as [7/6].
>
Ah, sorry.
All are based on "The mm-of-the-moment snapshot 2008-11-10-14-54".

Thanks,
-Kame

2008-11-12 04:00:39

by Balbir Singh

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/6] memcg updates (12/Nov/2008)

KAMEZAWA Hiroyuki wrote:
> Weekly updates on my queue.
>
> Changes from previous (05/Nov)
> - added "free all at rmdir" patch.
> - fixed several bugs reported by Nishimura (Thanks!)
> - many style bugs are fixed.
>
> Brief description:
> [1/6].. free all at rmdir (and add attribute to memcg.)
> [2/6].. handle swap cache
> [3/6].. mem+swap controller kconfig
> [4/6].. swap_cgroup
> [5/6].. mem+swap controller
> [6/6].. synchrinized LRU (unify lru lock.)
>
> I think it's near to a month to test this mem+swap controller internally.
> It's getting better. Making progress in step by step works good.
>
> I'll send [1/6] and [2/6] to Andrew, tomorrow or weekend.(please do final check).
>
> If no acks to [1/6] (I haven't got any ;), I'll postpone it and reschedule as [7/6].

Sorry, I have not looked at the patches yet, I am busy with the hierarchy
patches and number extraction.

--
Balbir

2008-11-12 04:23:16

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH 4/6] memcg: swap cgroup for remembering account

On Wed, 12 Nov 2008 13:17:01 +0900
Daisuke Nishimura <[email protected]> wrote:

> > +/**
> > + * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
> > + * @ent: swap entry to be looked up.
> > + *
> > + * Returns pointer to mem_cgroup at success. NULL at failure.
> > + */
> > +struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
> > +{
> > + int type = swp_type(ent);
> > + unsigned long flags;
> > + unsigned long offset = swp_offset(ent);
> > + unsigned long idx = offset / SC_PER_PAGE;
> > + unsigned long pos = offset & SC_POS_MASK;
> > + struct swap_cgroup_ctrl *ctrl;
> > + struct page *mappage;
> > + struct swap_cgroup *sc;
> > + struct mem_cgroup *ret;
> > +
> > + if (!do_swap_account)
> > + return NULL;
> > +
> > + ctrl = &swap_cgroup_ctrl[type];
> > +
> > + mappage = ctrl->map[idx];
> > +
> > + spin_lock_irqsave(&ctrl->lock, flags);
> > + sc = kmap_atomic(mappage, KM_USER0);
> > + sc += pos;
> > + ret = sc->val;
> > + kunmap_atomic(mapppage, KM_USER0);
> s/mapppage/mappage
>
> I don't know why I didn't notice this while testing previous version.
>
Ah...kmap_atomic() doesn't check its argument if HIGHMEM=n.
and mapppage disappears by macro.

I'm now preparing x86-32 test enviroment before goint further.

Thanks,
-Kame

>
> Thanks,
> Daisuke Nishimura.
>
> > + spin_unlock_irqrestore(&ctrl->lock, flags);
> > + return ret;
> > +}
> > +
>

2008-11-12 04:54:51

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: [RFC][PATCH 4/6] memcg: swap cgroup for remembering account

> +/**
> + * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
> + * @ent: swap entry to be looked up.
> + *
> + * Returns pointer to mem_cgroup at success. NULL at failure.
> + */
> +struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
> +{
> + int type = swp_type(ent);
> + unsigned long flags;
> + unsigned long offset = swp_offset(ent);
> + unsigned long idx = offset / SC_PER_PAGE;
> + unsigned long pos = offset & SC_POS_MASK;
> + struct swap_cgroup_ctrl *ctrl;
> + struct page *mappage;
> + struct swap_cgroup *sc;
> + struct mem_cgroup *ret;
> +
> + if (!do_swap_account)
> + return NULL;
> +
> + ctrl = &swap_cgroup_ctrl[type];
> +
> + mappage = ctrl->map[idx];
> +
> + spin_lock_irqsave(&ctrl->lock, flags);
> + sc = kmap_atomic(mappage, KM_USER0);
> + sc += pos;
> + ret = sc->val;
> + kunmap_atomic(mapppage, KM_USER0);
s/mapppage/mappage

I don't know why I didn't notice this while testing previous version.

Thanks,
Daisuke Nishimura.

> + spin_unlock_irqrestore(&ctrl->lock, flags);
> + return ret;
> +}
> +

2008-11-12 06:08:45

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

On Wed, 12 Nov 2008 12:26:56 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> move_account:
> + /* should free all ? */
> + if (!mem->on_rmdir)
> + goto try_to_free;
> +

This must be
==
if (!mem->on_rmdir)
goto try_to_free;
move_account:
==

Sorry.
-Kame

2008-11-13 00:08:42

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

On Wed, 12 Nov 2008 12:26:56 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> +5.1 on_rmdir
> +set behavior of memcg at rmdir (Removing cgroup) default is "drop".
> +
> +5.1.1 drop
> + #echo on_rmdir drop > memory.attribute
> + This is default. All pages on the memcg will be freed.
> + If pages are locked or too busy, they will be moved up to the parent.
> + Useful when you want to drop (large) page caches used in this memcg.
> + But some of in-use page cache can be dropped by this.
> +
> +5.1.2 keep
> + #echo on_rmdir keep > memory.attribute
> + All pages on the memcg will be moved to its parent.
> + Useful when you don't want to drop page caches used in this memcg.
> + You can keep page caches from some library or DB accessed by this
> + memcg on memory.

Would it not be more useful to implement a per-memcg version of
/proc/sys/vm/drop_caches? (One without drop_caches' locking bug,
hopefully).

If we do this then we can make the above "keep" behaviour non-optional,
and the operator gets to choose whether or not to drop the caches
before doing the rmdir.

Plus, we get a new per-memcg drop_caches capability. And it's a nicer
interface, and it doesn't have the obvious races which on_rmdir has,
etc.

hm?

2008-11-13 00:24:19

by Balbir Singh

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

Andrew Morton wrote:
> On Wed, 12 Nov 2008 12:26:56 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> +5.1 on_rmdir
>> +set behavior of memcg at rmdir (Removing cgroup) default is "drop".
>> +
>> +5.1.1 drop
>> + #echo on_rmdir drop > memory.attribute
>> + This is default. All pages on the memcg will be freed.
>> + If pages are locked or too busy, they will be moved up to the parent.
>> + Useful when you want to drop (large) page caches used in this memcg.
>> + But some of in-use page cache can be dropped by this.
>> +
>> +5.1.2 keep
>> + #echo on_rmdir keep > memory.attribute
>> + All pages on the memcg will be moved to its parent.
>> + Useful when you don't want to drop page caches used in this memcg.
>> + You can keep page caches from some library or DB accessed by this
>> + memcg on memory.
>
> Would it not be more useful to implement a per-memcg version of
> /proc/sys/vm/drop_caches? (One without drop_caches' locking bug,
> hopefully).
>
> If we do this then we can make the above "keep" behaviour non-optional,
> and the operator gets to choose whether or not to drop the caches
> before doing the rmdir.
>
> Plus, we get a new per-memcg drop_caches capability. And it's a nicer
> interface, and it doesn't have the obvious races which on_rmdir has,
> etc.
>

Andrew, I suspect that will not be easy, since we don't track address spaces
that belong to a particular memcg. If page cache ends up being shared across
memcg's, dropping them would impact both mem cgroups.

--
Balbir

2008-11-13 00:47:03

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

On Thu, 13 Nov 2008 05:53:49 +0530
Balbir Singh <[email protected]> wrote:

> Andrew Morton wrote:
> > On Wed, 12 Nov 2008 12:26:56 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> >> +5.1 on_rmdir
> >> +set behavior of memcg at rmdir (Removing cgroup) default is "drop".
> >> +
> >> +5.1.1 drop
> >> + #echo on_rmdir drop > memory.attribute
> >> + This is default. All pages on the memcg will be freed.
> >> + If pages are locked or too busy, they will be moved up to the parent.
> >> + Useful when you want to drop (large) page caches used in this memcg.
> >> + But some of in-use page cache can be dropped by this.
> >> +
> >> +5.1.2 keep
> >> + #echo on_rmdir keep > memory.attribute
> >> + All pages on the memcg will be moved to its parent.
> >> + Useful when you don't want to drop page caches used in this memcg.
> >> + You can keep page caches from some library or DB accessed by this
> >> + memcg on memory.
> >
> > Would it not be more useful to implement a per-memcg version of
> > /proc/sys/vm/drop_caches? (One without drop_caches' locking bug,
> > hopefully).
> >
> > If we do this then we can make the above "keep" behaviour non-optional,
> > and the operator gets to choose whether or not to drop the caches
> > before doing the rmdir.
> >
> > Plus, we get a new per-memcg drop_caches capability. And it's a nicer
> > interface, and it doesn't have the obvious races which on_rmdir has,
> > etc.
> >
>
> Andrew, I suspect that will not be easy, since we don't track address spaces
> that belong to a particular memcg. If page cache ends up being shared across
> memcg's, dropping them would impact both mem cgroups.
>

walk the LRUs?

2008-11-13 00:49:41

by Balbir Singh

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

Andrew Morton wrote:
> On Thu, 13 Nov 2008 05:53:49 +0530
> Balbir Singh <[email protected]> wrote:
>
>> Andrew Morton wrote:
>>> On Wed, 12 Nov 2008 12:26:56 +0900
>>> KAMEZAWA Hiroyuki <[email protected]> wrote:
>>>
>>>> +5.1 on_rmdir
>>>> +set behavior of memcg at rmdir (Removing cgroup) default is "drop".
>>>> +
>>>> +5.1.1 drop
>>>> + #echo on_rmdir drop > memory.attribute
>>>> + This is default. All pages on the memcg will be freed.
>>>> + If pages are locked or too busy, they will be moved up to the parent.
>>>> + Useful when you want to drop (large) page caches used in this memcg.
>>>> + But some of in-use page cache can be dropped by this.
>>>> +
>>>> +5.1.2 keep
>>>> + #echo on_rmdir keep > memory.attribute
>>>> + All pages on the memcg will be moved to its parent.
>>>> + Useful when you don't want to drop page caches used in this memcg.
>>>> + You can keep page caches from some library or DB accessed by this
>>>> + memcg on memory.
>>> Would it not be more useful to implement a per-memcg version of
>>> /proc/sys/vm/drop_caches? (One without drop_caches' locking bug,
>>> hopefully).
>>>
>>> If we do this then we can make the above "keep" behaviour non-optional,
>>> and the operator gets to choose whether or not to drop the caches
>>> before doing the rmdir.
>>>
>>> Plus, we get a new per-memcg drop_caches capability. And it's a nicer
>>> interface, and it doesn't have the obvious races which on_rmdir has,
>>> etc.
>>>
>> Andrew, I suspect that will not be easy, since we don't track address spaces
>> that belong to a particular memcg. If page cache ends up being shared across
>> memcg's, dropping them would impact both mem cgroups.
>>
>
> walk the LRUs?

We do that for the force_empty() interface we have. Although we don't
differentiate between cache and RSS at the moment.

--
Balbir

2008-11-13 01:04:58

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

On Thu, 13 Nov 2008 06:18:56 +0530
Balbir Singh <[email protected]> wrote:

> Andrew Morton wrote:
> > On Thu, 13 Nov 2008 05:53:49 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> >> Andrew Morton wrote:
> >>> On Wed, 12 Nov 2008 12:26:56 +0900
> >>> KAMEZAWA Hiroyuki <[email protected]> wrote:
> >>>
> >>>> +5.1 on_rmdir
> >>>> +set behavior of memcg at rmdir (Removing cgroup) default is "drop".
> >>>> +
> >>>> +5.1.1 drop
> >>>> + #echo on_rmdir drop > memory.attribute
> >>>> + This is default. All pages on the memcg will be freed.
> >>>> + If pages are locked or too busy, they will be moved up to the parent.
> >>>> + Useful when you want to drop (large) page caches used in this memcg.
> >>>> + But some of in-use page cache can be dropped by this.
> >>>> +
> >>>> +5.1.2 keep
> >>>> + #echo on_rmdir keep > memory.attribute
> >>>> + All pages on the memcg will be moved to its parent.
> >>>> + Useful when you don't want to drop page caches used in this memcg.
> >>>> + You can keep page caches from some library or DB accessed by this
> >>>> + memcg on memory.
> >>> Would it not be more useful to implement a per-memcg version of
> >>> /proc/sys/vm/drop_caches? (One without drop_caches' locking bug,
> >>> hopefully).
> >>>
> >>> If we do this then we can make the above "keep" behaviour non-optional,
> >>> and the operator gets to choose whether or not to drop the caches
> >>> before doing the rmdir.
> >>>
> >>> Plus, we get a new per-memcg drop_caches capability. And it's a nicer
> >>> interface, and it doesn't have the obvious races which on_rmdir has,
> >>> etc.
> >>>
> >> Andrew, I suspect that will not be easy, since we don't track address spaces
> >> that belong to a particular memcg. If page cache ends up being shared across
> >> memcg's, dropping them would impact both mem cgroups.
> >>
> >
> > walk the LRUs?
>
> We do that for the force_empty() interface we have. Although we don't
> differentiate between cache and RSS at the moment.

so.. what's wrong with using that (possibly with some
generalisation/enhancement)?

btw, mem_cgroup_force_empty_list() uses PageLRU() outside ->lru_lock.
That's racy, although afaict this race will only cause an accounting
error.

Or maybe not. What happens if
__mem_cgroup_uncharge_common()->__mem_cgroup_remove_list() is passed a
page which isn't on an LRU any more? boom?

2008-11-13 01:15:47

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

On Wed, 12 Nov 2008 16:07:58 -0800
Andrew Morton <[email protected]> wrote:

> On Wed, 12 Nov 2008 12:26:56 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > +5.1 on_rmdir
> > +set behavior of memcg at rmdir (Removing cgroup) default is "drop".
> > +
> > +5.1.1 drop
> > + #echo on_rmdir drop > memory.attribute
> > + This is default. All pages on the memcg will be freed.
> > + If pages are locked or too busy, they will be moved up to the parent.
> > + Useful when you want to drop (large) page caches used in this memcg.
> > + But some of in-use page cache can be dropped by this.
> > +
> > +5.1.2 keep
> > + #echo on_rmdir keep > memory.attribute
> > + All pages on the memcg will be moved to its parent.
> > + Useful when you don't want to drop page caches used in this memcg.
> > + You can keep page caches from some library or DB accessed by this
> > + memcg on memory.
>
> Would it not be more useful to implement a per-memcg version of
> /proc/sys/vm/drop_caches? (One without drop_caches' locking bug,
> hopefully).
>
> If we do this then we can make the above "keep" behaviour non-optional,
> and the operator gets to choose whether or not to drop the caches
> before doing the rmdir.
>
> Plus, we get a new per-memcg drop_caches capability. And it's a nicer
> interface, and it doesn't have the obvious races which on_rmdir has,
> etc.
>
> hm?
>
In my plan, I'll add

memory.shrink_usage interface to do and allows

#echo 0M > memory.shrink_memory_usage
(you may swap tasks out if there is task..)

to drop pages.

Balbir, how do you think ? I've already removed "force_empty".

Thanks,
-Kame

2008-11-13 01:17:50

by Balbir Singh

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

Andrew Morton wrote:
> On Thu, 13 Nov 2008 06:18:56 +0530
> Balbir Singh <[email protected]> wrote:
>
>> Andrew Morton wrote:
>>> On Thu, 13 Nov 2008 05:53:49 +0530
>>> Balbir Singh <[email protected]> wrote:
>>>
>>>> Andrew Morton wrote:
>>>>> On Wed, 12 Nov 2008 12:26:56 +0900
>>>>> KAMEZAWA Hiroyuki <[email protected]> wrote:
>>>>>
>>>>>> +5.1 on_rmdir
>>>>>> +set behavior of memcg at rmdir (Removing cgroup) default is "drop".
>>>>>> +
>>>>>> +5.1.1 drop
>>>>>> + #echo on_rmdir drop > memory.attribute
>>>>>> + This is default. All pages on the memcg will be freed.
>>>>>> + If pages are locked or too busy, they will be moved up to the parent.
>>>>>> + Useful when you want to drop (large) page caches used in this memcg.
>>>>>> + But some of in-use page cache can be dropped by this.
>>>>>> +
>>>>>> +5.1.2 keep
>>>>>> + #echo on_rmdir keep > memory.attribute
>>>>>> + All pages on the memcg will be moved to its parent.
>>>>>> + Useful when you don't want to drop page caches used in this memcg.
>>>>>> + You can keep page caches from some library or DB accessed by this
>>>>>> + memcg on memory.
>>>>> Would it not be more useful to implement a per-memcg version of
>>>>> /proc/sys/vm/drop_caches? (One without drop_caches' locking bug,
>>>>> hopefully).
>>>>>
>>>>> If we do this then we can make the above "keep" behaviour non-optional,
>>>>> and the operator gets to choose whether or not to drop the caches
>>>>> before doing the rmdir.
>>>>>
>>>>> Plus, we get a new per-memcg drop_caches capability. And it's a nicer
>>>>> interface, and it doesn't have the obvious races which on_rmdir has,
>>>>> etc.
>>>>>
>>>> Andrew, I suspect that will not be easy, since we don't track address spaces
>>>> that belong to a particular memcg. If page cache ends up being shared across
>>>> memcg's, dropping them would impact both mem cgroups.
>>>>
>>> walk the LRUs?
>> We do that for the force_empty() interface we have. Although we don't
>> differentiate between cache and RSS at the moment.
>
> so.. what's wrong with using that (possibly with some
> generalisation/enhancement)?
>

I am not against enhancing force_empty to drop caches, in fact it would be
useful. I've been testing swappiness of late in mainline and looking at patches
posted for memcg swappiness and the results have not been very good. I'm trying
to see if my testing is valid. A drop_caches interface will help me
debug/understand the problem better and even provide an alternative when I need
a brute force approach.

> btw, mem_cgroup_force_empty_list() uses PageLRU() outside ->lru_lock.
> That's racy, although afaict this race will only cause an accounting
> error.
>
> Or maybe not. What happens if
> __mem_cgroup_uncharge_common()->__mem_cgroup_remove_list() is passed a
> page which isn't on an LRU any more? boom?
>

IIRC, Kamezawa has been working on redoing force_empty interface. We are
reworking its internals as well.

--
Balbir

2008-11-13 01:20:56

by Balbir Singh

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

KAMEZAWA Hiroyuki wrote:
> On Wed, 12 Nov 2008 16:07:58 -0800
> Andrew Morton <[email protected]> wrote:
>
>> On Wed, 12 Nov 2008 12:26:56 +0900
>> KAMEZAWA Hiroyuki <[email protected]> wrote:
>>
>>> +5.1 on_rmdir
>>> +set behavior of memcg at rmdir (Removing cgroup) default is "drop".
>>> +
>>> +5.1.1 drop
>>> + #echo on_rmdir drop > memory.attribute
>>> + This is default. All pages on the memcg will be freed.
>>> + If pages are locked or too busy, they will be moved up to the parent.
>>> + Useful when you want to drop (large) page caches used in this memcg.
>>> + But some of in-use page cache can be dropped by this.
>>> +
>>> +5.1.2 keep
>>> + #echo on_rmdir keep > memory.attribute
>>> + All pages on the memcg will be moved to its parent.
>>> + Useful when you don't want to drop page caches used in this memcg.
>>> + You can keep page caches from some library or DB accessed by this
>>> + memcg on memory.
>> Would it not be more useful to implement a per-memcg version of
>> /proc/sys/vm/drop_caches? (One without drop_caches' locking bug,
>> hopefully).
>>
>> If we do this then we can make the above "keep" behaviour non-optional,
>> and the operator gets to choose whether or not to drop the caches
>> before doing the rmdir.
>>
>> Plus, we get a new per-memcg drop_caches capability. And it's a nicer
>> interface, and it doesn't have the obvious races which on_rmdir has,
>> etc.
>>
>> hm?
>>
> In my plan, I'll add
>
> memory.shrink_usage interface to do and allows
>
> #echo 0M > memory.shrink_memory_usage
> (you may swap tasks out if there is task..)
>
> to drop pages.
>

So, shrink_memory_usage is just for dropping caches? I don't understand the part
about swap tasks out.

> Balbir, how do you think ? I've already removed "force_empty".

Have you? Won't that go against API/ABI compatibility guidelines. I would
recommend cc'ing linux-api as well. Sorry, I missed the patch that removes
force_empty. Me culpa.

--
Balbir

2008-11-13 01:28:43

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

On Thu, 13 Nov 2008 06:50:40 +0530
Balbir Singh <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Wed, 12 Nov 2008 16:07:58 -0800
> > Andrew Morton <[email protected]> wrote:
> >
> >> On Wed, 12 Nov 2008 12:26:56 +0900
> >> KAMEZAWA Hiroyuki <[email protected]> wrote:
> >>
> >>> +5.1 on_rmdir
> >>> +set behavior of memcg at rmdir (Removing cgroup) default is "drop".
> >>> +
> >>> +5.1.1 drop
> >>> + #echo on_rmdir drop > memory.attribute
> >>> + This is default. All pages on the memcg will be freed.
> >>> + If pages are locked or too busy, they will be moved up to the parent.
> >>> + Useful when you want to drop (large) page caches used in this memcg.
> >>> + But some of in-use page cache can be dropped by this.
> >>> +
> >>> +5.1.2 keep
> >>> + #echo on_rmdir keep > memory.attribute
> >>> + All pages on the memcg will be moved to its parent.
> >>> + Useful when you don't want to drop page caches used in this memcg.
> >>> + You can keep page caches from some library or DB accessed by this
> >>> + memcg on memory.
> >> Would it not be more useful to implement a per-memcg version of
> >> /proc/sys/vm/drop_caches? (One without drop_caches' locking bug,
> >> hopefully).
> >>
> >> If we do this then we can make the above "keep" behaviour non-optional,
> >> and the operator gets to choose whether or not to drop the caches
> >> before doing the rmdir.
> >>
> >> Plus, we get a new per-memcg drop_caches capability. And it's a nicer
> >> interface, and it doesn't have the obvious races which on_rmdir has,
> >> etc.
> >>
> >> hm?
> >>
> > In my plan, I'll add
> >
> > memory.shrink_usage interface to do and allows
> >
> > #echo 0M > memory.shrink_memory_usage
> > (you may swap tasks out if there is task..)
> >
> > to drop pages.
> >
>
> So, shrink_memory_usage is just for dropping caches? I don't understand the part
> about swap tasks out.
>
No, just for shrinking usage. It can also drops ANON.

> > Balbir, how do you think ? I've already removed "force_empty".
>
> Have you? Won't that go against API/ABI compatibility guidelines. I would
> recommend cc'ing linux-api as well. Sorry, I missed the patch that removes
> force_empty. Me culpa.
>
account_move did that....

That was only for debug and is a hole for security of resource management.
So, removed.

Thanks,
-Kame

2008-11-13 02:31:29

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

On Thu, 13 Nov 2008 06:47:31 +0530
Balbir Singh <[email protected]> wrote:

> > btw, mem_cgroup_force_empty_list() uses PageLRU() outside ->lru_lock.
> > That's racy, although afaict this race will only cause an accounting
> > error.
> >
> > Or maybe not. What happens if
> > __mem_cgroup_uncharge_common()->__mem_cgroup_remove_list() is passed a
> > page which isn't on an LRU any more? boom?
> >
>
> IIRC, Kamezawa has been working on redoing force_empty interface. We are
> reworking its internals as well.
>

PageLRU() is not used in account_move() version (in mmotm queue)
patches/memcg-move-all-acccounts-to-parent-at-rmdir.patch removes that.

We're now testing patch [6/6] which does

1. remove per-memcg-lru-lock
2. use zone->lru_lock instead of that.

Then, maintenance of this memcontrol.c will be much easier.

After patch [6/6]. account_move does

isolate_page(page);
move to other cgroup
putback_lru_page(page);

as other usual routine does.

Thanks,
-Kame

2008-11-13 02:49:59

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

On Wed, 12 Nov 2008 16:07:58 -0800
Andrew Morton <[email protected]> wrote:
> If we do this then we can make the above "keep" behaviour non-optional,
> and the operator gets to choose whether or not to drop the caches
> before doing the rmdir.
>
> Plus, we get a new per-memcg drop_caches capability. And it's a nicer
> interface, and it doesn't have the obvious races which on_rmdir has,
> etc.
>
> hm?
>

Balbir, how would you want to do ?

I planned to post shrink_uage patch later (it's easy to be implemented) regardless
of acceptance of this patch.

So, I think we should add shrink_usage now and drop this is a way to go.
I think I can prepare patch soon. But I'd like to push handle-swap-cache patch
before introducing shrink_usage.

Then, posting following 2 patch for this week is my current intention.
[1/2] handle swap cache
[2/2] shrink_usage patch (instead of this patch)

Objection ?

Thanks,
-Kame

2008-11-13 03:15:36

by Balbir Singh

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

KAMEZAWA Hiroyuki wrote:
> On Wed, 12 Nov 2008 16:07:58 -0800
> Andrew Morton <[email protected]> wrote:
>> If we do this then we can make the above "keep" behaviour non-optional,
>> and the operator gets to choose whether or not to drop the caches
>> before doing the rmdir.
>>
>> Plus, we get a new per-memcg drop_caches capability. And it's a nicer
>> interface, and it doesn't have the obvious races which on_rmdir has,
>> etc.
>>
>> hm?
>>
>
> Balbir, how would you want to do ?
>
> I planned to post shrink_uage patch later (it's easy to be implemented) regardless
> of acceptance of this patch.
>
> So, I think we should add shrink_usage now and drop this is a way to go.

I am a bit concerned about dropping stuff at will later. Ubuntu 8.10 has memory
controller enabled and we exposed memory.force_empty interface there and now
we've dropped it (bad on our part). I think we should have deprecated it and
dropped it later.

> I think I can prepare patch soon. But I'd like to push handle-swap-cache patch
> before introducing shrink_usage.
>
> Then, posting following 2 patch for this week is my current intention.
> [1/2] handle swap cache
> [2/2] shrink_usage patch (instead of this patch)
>
> Objection ?

No.. just be wary of breaking API, please!

--
Balbir

2008-11-13 03:23:30

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

On Thu, 13 Nov 2008 08:44:59 +0530
Balbir Singh <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Wed, 12 Nov 2008 16:07:58 -0800
> > Andrew Morton <[email protected]> wrote:
> >> If we do this then we can make the above "keep" behaviour non-optional,
> >> and the operator gets to choose whether or not to drop the caches
> >> before doing the rmdir.
> >>
> >> Plus, we get a new per-memcg drop_caches capability. And it's a nicer
> >> interface, and it doesn't have the obvious races which on_rmdir has,
> >> etc.
> >>
> >> hm?
> >>
> >
> > Balbir, how would you want to do ?
> >
> > I planned to post shrink_uage patch later (it's easy to be implemented) regardless
> > of acceptance of this patch.
> >
> > So, I think we should add shrink_usage now and drop this is a way to go.
>
> I am a bit concerned about dropping stuff at will later. Ubuntu 8.10 has memory
> controller enabled and we exposed memory.force_empty interface there and now
> we've dropped it (bad on our part). I think we should have deprecated it and
> dropped it later.
>
I *was* documented as "for debug only".

Hmm, adding force_empty again to do "shrink usage to 0" is another choice.

ok?

Thanks,
-Kame

2008-11-13 03:27:24

by Balbir Singh

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

KAMEZAWA Hiroyuki wrote:
> On Thu, 13 Nov 2008 08:44:59 +0530
> Balbir Singh <[email protected]> wrote:
>
>> KAMEZAWA Hiroyuki wrote:
>>> On Wed, 12 Nov 2008 16:07:58 -0800
>>> Andrew Morton <[email protected]> wrote:
>>>> If we do this then we can make the above "keep" behaviour non-optional,
>>>> and the operator gets to choose whether or not to drop the caches
>>>> before doing the rmdir.
>>>>
>>>> Plus, we get a new per-memcg drop_caches capability. And it's a nicer
>>>> interface, and it doesn't have the obvious races which on_rmdir has,
>>>> etc.
>>>>
>>>> hm?
>>>>
>>> Balbir, how would you want to do ?
>>>
>>> I planned to post shrink_uage patch later (it's easy to be implemented) regardless
>>> of acceptance of this patch.
>>>
>>> So, I think we should add shrink_usage now and drop this is a way to go.
>> I am a bit concerned about dropping stuff at will later. Ubuntu 8.10 has memory
>> controller enabled and we exposed memory.force_empty interface there and now
>> we've dropped it (bad on our part). I think we should have deprecated it and
>> dropped it later.
>>
> I *was* documented as "for debug only".
>

I know, but I suspect users won't. I am not blaming you in person, just trying
the find way to do such things. I think I've seen sysfs add a deprecated option
and put all deprecated files there, they are only created if someone cares about
them.

> Hmm, adding force_empty again to do "shrink usage to 0" is another choice.

Yes, sounds reasonable.

--
Balbir

2008-11-13 03:39:11

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/6] memcg: free all at rmdir

On Thu, 13 Nov 2008 08:57:07 +0530
Balbir Singh <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Thu, 13 Nov 2008 08:44:59 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> >> KAMEZAWA Hiroyuki wrote:
> >>> On Wed, 12 Nov 2008 16:07:58 -0800
> >>> Andrew Morton <[email protected]> wrote:
> >>>> If we do this then we can make the above "keep" behaviour non-optional,
> >>>> and the operator gets to choose whether or not to drop the caches
> >>>> before doing the rmdir.
> >>>>
> >>>> Plus, we get a new per-memcg drop_caches capability. And it's a nicer
> >>>> interface, and it doesn't have the obvious races which on_rmdir has,
> >>>> etc.
> >>>>
> >>>> hm?
> >>>>
> >>> Balbir, how would you want to do ?
> >>>
> >>> I planned to post shrink_uage patch later (it's easy to be implemented) regardless
> >>> of acceptance of this patch.
> >>>
> >>> So, I think we should add shrink_usage now and drop this is a way to go.
> >> I am a bit concerned about dropping stuff at will later. Ubuntu 8.10 has memory
> >> controller enabled and we exposed memory.force_empty interface there and now
> >> we've dropped it (bad on our part). I think we should have deprecated it and
> >> dropped it later.
> >>
> > I *was* documented as "for debug only".
> >
>
> I know, but I suspect users won't. I am not blaming you in person, just trying
> the find way to do such things. I think I've seen sysfs add a deprecated option
> and put all deprecated files there, they are only created if someone cares about
> them.
>
Ah, but it should be done in cgroup layer.

> > Hmm, adding force_empty again to do "shrink usage to 0" is another choice.
>
> Yes, sounds reasonable.
>
ok, will post a patch today.

Thanks,
-Kame