2008-08-19 08:24:44

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH -mm][preview] memcg: a patch series for next [0/9]

Hi,

This post is for showing what I'm trying now.

This patch set is for memory resource controller.
4 purposes here.
- improve performance of memcg.
- remove lock_page_cgroup()
- making page_cgroup->flags to be atomic_ops.
- support mem+swap controller.

But this is still under test and the series is not well organised.
and base tree is old. (2.6.27-rc1-mm1) I'll rebase this set to newer mmtom tree.

Maybe this set have some troubles/objections but I think the direction is not bad.

Patch description. (patch ordering is bad. I'll fix in the next post.)

[1/9] ... private_counter ...replace res_counter with my own counter.
This is for supporting mem+swap controller.
(And I think memcg has a bit different characteristics from other
users of res_counter....)

[2/9] ... change-order-uncharge ...
This patch is for making it easy to handle swap-cache.

[3/9] ... atomic_flags
This patch changes operations for page_cgroup->flags to be atomic_ops.

[4/9] ... delayed freeing.
delaying to free page_cgroup at uncharge.

[5/9] ... RCU freeing of page_cgroup
free page_cgroup by RCU.

[6/9] ... lockress page cgroup.
remove lock_page_cgroup() and use RCU semantics.

[7/9] ... add preftech
add prefetch() macro

[8/9] ... mem+swap controller base.
introduce mem+swap controller. A bit big patch....but have tons of TODO.
and have troubles. (it seems it's difficult to cause OOM killer.)

[9/9] ... mem+swap controller control files.
add mem+swap controller's control files.

I'd like to push patch [2,3,4,5,6,7] first.

Thanks,
-Kame


2008-08-19 08:31:29

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH -mm][preview] memcg: a patch series for next [1/9]

Replasce res_counter with new mem_counter to do complex counting.
This patch is for mem+swap controller.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

mm/memcontrol.c | 160 ++++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 139 insertions(+), 21 deletions(-)

Index: linux-2.6.27-rc1-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/memcontrol.c
+++ linux-2.6.27-rc1-mm1/mm/memcontrol.c
@@ -116,12 +116,20 @@ struct mem_cgroup_lru_info {
* no reclaim occurs from a cgroup at it's low water mark, this is
* a feature that will be implemented much later in the future.
*/
+struct mem_counter {
+ unsigned long pages_limit;
+ unsigned long pages;
+ unsigned long failcnt;
+ unsigned long max_usage;
+ spinlock_t lock;
+};
+
struct mem_cgroup {
struct cgroup_subsys_state css;
/*
* the counter to account for memory usage
*/
- struct res_counter res;
+ struct mem_counter res;
/*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
@@ -181,6 +189,16 @@ enum charge_type {
MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
};

+/* Private File ID for memcg */
+enum {
+ MEMCG_FILE_TYPE_PAGE_LIMIT,
+ MEMCG_FILE_TYPE_PAGE_USAGE,
+ MEMCG_FILE_TYPE_FAILCNT,
+ MEMCG_FILE_TYPE_MAX_USAGE,
+};
+
+
+
/*
* Always modified under lru lock. Then, not necessary to preempt_disable()
*/
@@ -279,6 +297,74 @@ static void unlock_page_cgroup(struct pa
bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
}

+/*
+ * counter for memory resource accounting.
+ *
+ */
+static void mem_counter_init(struct mem_cgroup *memcg)
+{
+ spin_lock_init(&memcg->res.lock);
+ memcg->res.pages = 0;
+ memcg->res.pages_limit = ~0UL;
+ memcg->res.failcnt = 0;
+}
+
+static int mem_counter_charge(struct mem_cgroup *memcg, long num)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&memcg->res.lock, flags);
+ if (memcg->res.pages + num > memcg->res.pages_limit) {
+ memcg->res.failcnt++;
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+ return -EBUSY;
+ }
+ memcg->res.pages += num;
+ if (memcg->res.pages > memcg->res.max_usage)
+ memcg->res.max_usage = memcg->res.pages;
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+ return 0;
+}
+
+static inline void mem_counter_uncharge(struct mem_cgroup *memcg, long num)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&memcg->res.lock, flags);
+ memcg->res.pages -= num;
+ BUG_ON(memcg->res.pages < 0);
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+}
+
+static int mem_counter_set_pages_limit(struct mem_cgroup *memcg,
+ unsigned long lim)
+{
+ unsigned long flags;
+ int ret = -EBUSY;
+
+ spin_lock_irqsave(&memcg->res.lock, flags);
+ if (memcg->res.pages < lim) {
+ memcg->res.pages_limit = lim;
+ ret = 0;
+ }
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+
+ return ret;
+}
+
+static int __mem_counter_check_under_limit(struct mem_cgroup *memcg)
+{
+ unsigned long flags;
+ int ret = 0;
+
+ spin_lock_irqsave(&memcg->res.lock, flags);
+ if (memcg->res.pages < memcg->res.pages_limit)
+ ret = 1;
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+
+ return ret;
+}
+
static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
struct page_cgroup *pc)
{
@@ -402,7 +488,7 @@ int mem_cgroup_calc_mapped_ratio(struct
* usage is recorded in bytes. But, here, we assume the number of
* physical pages can be represented by "long" on any arch.
*/
- total = (long) (mem->res.usage >> PAGE_SHIFT) + 1L;
+ total = (long) (mem->res.pages >> PAGE_SHIFT) + 1L;
rss = (long)mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
return (int)((rss * 100L) / total);
}
@@ -544,7 +630,7 @@ static int mem_cgroup_charge_common(stru
css_get(&memcg->css);
}

- while (res_counter_charge(&mem->res, PAGE_SIZE)) {
+ while (mem_counter_charge(mem, 1)) {
if (!(gfp_mask & __GFP_WAIT))
goto out;

@@ -558,7 +644,7 @@ static int mem_cgroup_charge_common(stru
* Check the limit again to see if the reclaim reduced the
* current usage of the cgroup before giving up
*/
- if (res_counter_check_under_limit(&mem->res))
+ if (__mem_counter_check_under_limit(mem))
continue;

if (!nr_retries--) {
@@ -585,7 +671,7 @@ static int mem_cgroup_charge_common(stru
lock_page_cgroup(page);
if (unlikely(page_get_page_cgroup(page))) {
unlock_page_cgroup(page);
- res_counter_uncharge(&mem->res, PAGE_SIZE);
+ mem_counter_uncharge(mem, 1);
css_put(&mem->css);
kmem_cache_free(page_cgroup_cache, pc);
goto done;
@@ -701,7 +787,7 @@ __mem_cgroup_uncharge_common(struct page
unlock_page_cgroup(page);

mem = pc->mem_cgroup;
- res_counter_uncharge(&mem->res, PAGE_SIZE);
+ mem_counter_uncharge(mem, 1);
css_put(&mem->css);

kmem_cache_free(page_cgroup_cache, pc);
@@ -807,8 +893,9 @@ int mem_cgroup_resize_limit(struct mem_c
int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
int progress;
int ret = 0;
+ unsigned long pages = (unsigned long)(val >> PAGE_SHIFT);

- while (res_counter_set_limit(&memcg->res, val)) {
+ while (mem_counter_set_pages_limit(memcg, pages)) {
if (signal_pending(current)) {
ret = -EINTR;
break;
@@ -882,7 +969,7 @@ static int mem_cgroup_force_empty(struct
* active_list <-> inactive_list while we don't take a lock.
* So, we have to do loop here until all lists are empty.
*/
- while (mem->res.usage > 0) {
+ while (mem->res.pages > 0) {
if (atomic_read(&mem->css.cgroup->count) > 0)
goto out;
for_each_node_state(node, N_POSSIBLE)
@@ -902,13 +989,44 @@ out:

static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
{
- return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
- cft->private);
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ unsigned long long ret;
+
+ switch (cft->private) {
+ case MEMCG_FILE_TYPE_PAGE_USAGE:
+ ret = memcg->res.pages << PAGE_SHIFT;
+ break;
+ case MEMCG_FILE_TYPE_MAX_USAGE:
+ ret = memcg->res.max_usage << PAGE_SHIFT;
+ break;
+ case MEMCG_FILE_TYPE_PAGE_LIMIT:
+ ret = memcg->res.pages_limit << PAGE_SHIFT;
+ break;
+ case MEMCG_FILE_TYPE_FAILCNT:
+ ret = memcg->res.failcnt << PAGE_SHIFT;
+ break;
+ default:
+ BUG();
+ }
+ return ret;
}
+
/*
* The user of this function is...
* RES_LIMIT.
*/
+
+static int call_memparse(const char *buf, unsigned long long *val)
+{
+ char *end;
+
+ *val = memparse((char *)buf, &end);
+ if (*end != '\0')
+ return -EINVAL;
+ *val = PAGE_ALIGN(*val);
+ return 0;
+}
+
static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
const char *buffer)
{
@@ -917,9 +1035,9 @@ static int mem_cgroup_write(struct cgrou
int ret;

switch (cft->private) {
- case RES_LIMIT:
+ case MEMCG_FILE_TYPE_PAGE_LIMIT:
/* This function does all necessary parse...reuse it */
- ret = res_counter_memparse_write_strategy(buffer, &val);
+ ret = call_memparse(buffer, &val);
if (!ret)
ret = mem_cgroup_resize_limit(memcg, val);
break;
@@ -936,11 +1054,11 @@ static int mem_cgroup_reset(struct cgrou

mem = mem_cgroup_from_cont(cont);
switch (event) {
- case RES_MAX_USAGE:
- res_counter_reset_max(&mem->res);
+ case MEMCG_FILE_TYPE_MAX_USAGE:
+ mem->res.max_usage = 0;
break;
- case RES_FAILCNT:
- res_counter_reset_failcnt(&mem->res);
+ case MEMCG_FILE_TYPE_FAILCNT:
+ mem->res.failcnt = 0;
break;
}
return 0;
@@ -1005,24 +1123,24 @@ static int mem_control_stat_show(struct
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
- .private = RES_USAGE,
+ .private = MEMCG_FILE_TYPE_PAGE_USAGE,
.read_u64 = mem_cgroup_read,
},
{
.name = "max_usage_in_bytes",
- .private = RES_MAX_USAGE,
+ .private = MEMCG_FILE_TYPE_MAX_USAGE,
.trigger = mem_cgroup_reset,
.read_u64 = mem_cgroup_read,
},
{
.name = "limit_in_bytes",
- .private = RES_LIMIT,
+ .private = MEMCG_FILE_TYPE_PAGE_LIMIT,
.write_string = mem_cgroup_write,
.read_u64 = mem_cgroup_read,
},
{
.name = "failcnt",
- .private = RES_FAILCNT,
+ .private = MEMCG_FILE_TYPE_FAILCNT,
.trigger = mem_cgroup_reset,
.read_u64 = mem_cgroup_read,
},
@@ -1111,7 +1229,7 @@ mem_cgroup_create(struct cgroup_subsys *
return ERR_PTR(-ENOMEM);
}

- res_counter_init(&mem->res);
+ mem_counter_init(mem);

for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))

2008-08-19 08:32:42

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH -mm][preview] memcg: a patch series for next [2/9]

This patch changes placement of mem_cgroup_uncharge_cache_page().

After this patch, mem_cgroup_uncharge_cache_page() is called only after
page->mapping is cleared. This will make uncharge() handling easier in future.
(And error check code is added.)

Signed-off-by: KAMEZAWA Hiruyoki <[email protected]>

mm/filemap.c | 2 +-
mm/memcontrol.c | 1 +
mm/migrate.c | 13 ++++++++++---
3 files changed, 12 insertions(+), 4 deletions(-)

Index: linux-2.6.27-rc1-mm1/mm/filemap.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/filemap.c
+++ linux-2.6.27-rc1-mm1/mm/filemap.c
@@ -116,12 +116,12 @@ void __remove_from_page_cache(struct pag
{
struct address_space *mapping = page->mapping;

- mem_cgroup_uncharge_cache_page(page);
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
BUG_ON(page_mapped(page));
+ mem_cgroup_uncharge_cache_page(page);

/*
* Some filesystems seem to re-dirty the page even after
Index: linux-2.6.27-rc1-mm1/mm/migrate.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/migrate.c
+++ linux-2.6.27-rc1-mm1/mm/migrate.c
@@ -330,8 +330,6 @@ static int migrate_page_move_mapping(str
__inc_zone_page_state(newpage, NR_FILE_PAGES);

spin_unlock_irq(&mapping->tree_lock);
- if (!PageSwapCache(newpage))
- mem_cgroup_uncharge_cache_page(page);

return 0;
}
@@ -378,7 +376,16 @@ static void migrate_page_copy(struct pag
#endif
ClearPagePrivate(page);
set_page_private(page, 0);
- page->mapping = NULL;
+
+ /* PageAnon() checks page->mapping's bit */
+ if (PageAnon(page)) {
+ /* This page is uncharged in try_to_unmap() */
+ page->mapping = NULL;
+ } else {
+ /* This page was removed from radix-tree.*/
+ page->mapping = NULL;
+ mem_cgroup_uncharge_cache_page(page);
+ }

/*
* If any waiters have accumulated on the new page then
Index: linux-2.6.27-rc1-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/memcontrol.c
+++ linux-2.6.27-rc1-mm1/mm/memcontrol.c
@@ -804,6 +804,7 @@ void mem_cgroup_uncharge_page(struct pag
void mem_cgroup_uncharge_cache_page(struct page *page)
{
VM_BUG_ON(page_mapped(page));
+ VM_BUG_ON(page->mapping);
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}

2008-08-19 08:33:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH -mm][preview] memcg: a patch series for next [3/9]


Maybe coding style of this patch is not good. (I should use enum..etc.)
This will be rewritten.

This patch adds function to modify page_cgroup->flags with using
set_bit/clear_bit/test_bit.

set/clear/test_bit is an usual way to manipulate flags an will reduce
ugly if sentenses. "atomic" set_bit may increase overhead but will
allow us looser control of this flags. (flag modification without locks!)
Of course, we don't have to use "atomic" ops where we can convice there
is no race.

This is a base patch for adding new flags.
(FLAG names are a bit modified....they are too long for 80 columns.)


Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>



---
mm/memcontrol.c | 82 +++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 55 insertions(+), 27 deletions(-)

Index: linux-2.6.27-rc1-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/memcontrol.c
+++ linux-2.6.27-rc1-mm1/mm/memcontrol.c
@@ -166,12 +166,35 @@ struct page_cgroup {
struct list_head lru; /* per cgroup LRU list */
struct page *page;
struct mem_cgroup *mem_cgroup;
- int flags;
+ unsigned long flags;
};
-#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
-#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8) /* page is unevictableable */
+
+/* These 2 flags are unchanged during being used. */
+#define PAGE_CG_FLAG_CACHE (0) /* charged as cache */
+#define PAGE_CG_FLAG_FILE (1) /* page is file system backed */
+#define PAGE_CG_FLAG_ACTIVE (2) /* page is active in this cgroup */
+#define PAGE_CG_FLAG_UNEVICTABLE (3) /* page is unevictableable */
+
+static inline void page_cgroup_set_bit(struct page_cgroup *pc, int flag)
+{
+ set_bit(flag, &pc->flags);
+}
+
+static inline void __page_cgroup_set_bit(struct page_cgroup *pc, int flag)
+{
+ set_bit(flag, &pc->flags);
+ __set_bit(flag, &pc->flags);
+}
+
+static inline void page_cgroup_clear_bit(struct page_cgroup *pc, int flag)
+{
+ clear_bit(flag, &pc->flags);
+}
+
+static inline int page_cgroup_test_bit(struct page_cgroup *pc, int flag)
+{
+ return test_bit(flag, &pc->flags);
+}

static int page_cgroup_nid(struct page_cgroup *pc)
{
@@ -201,6 +224,9 @@ enum {

/*
* Always modified under lru lock. Then, not necessary to preempt_disable()
+ * "flags" passed to this function is a copy of pc->flags but flags checked
+ * in this function is permanent flags....means never being changed once
+ * being set. So, this is sage.
*/
static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, int flags,
bool charge)
@@ -209,7 +235,7 @@ static void mem_cgroup_charge_statistics
struct mem_cgroup_stat *stat = &mem->stat;

VM_BUG_ON(!irqs_disabled());
- if (flags & PAGE_CGROUP_FLAG_CACHE)
+ if (flags & (1 << PAGE_CG_FLAG_CACHE))
__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
else
__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
@@ -370,12 +396,12 @@ static void __mem_cgroup_remove_list(str
{
int lru = LRU_BASE;

- if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+ if (page_cgroup_test_bit(pc, PAGE_CG_FLAG_UNEVICTABLE))
lru = LRU_UNEVICTABLE;
else {
- if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+ if (page_cgroup_test_bit(pc, PAGE_CG_FLAG_ACTIVE))
lru += LRU_ACTIVE;
- if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+ if (page_cgroup_test_bit(pc, PAGE_CG_FLAG_FILE))
lru += LRU_FILE;
}

@@ -390,12 +416,12 @@ static void __mem_cgroup_add_list(struct
{
int lru = LRU_BASE;

- if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+ if (page_cgroup_test_bit(pc, PAGE_CG_FLAG_UNEVICTABLE))
lru = LRU_UNEVICTABLE;
else {
- if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+ if (page_cgroup_test_bit(pc, PAGE_CG_FLAG_ACTIVE))
lru += LRU_ACTIVE;
- if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+ if (page_cgroup_test_bit(pc, PAGE_CG_FLAG_FILE))
lru += LRU_FILE;
}

@@ -408,9 +434,9 @@ static void __mem_cgroup_add_list(struct
static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
{
struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
- int active = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
- int file = pc->flags & PAGE_CGROUP_FLAG_FILE;
- int unevictable = pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE;
+ int active = page_cgroup_test_bit(pc, PAGE_CG_FLAG_ACTIVE);
+ int file = page_cgroup_test_bit(pc, PAGE_CG_FLAG_FILE);
+ int unevictable = page_cgroup_test_bit(pc, PAGE_CG_FLAG_UNEVICTABLE);
enum lru_list from = unevictable ? LRU_UNEVICTABLE :
(LRU_FILE * !!file + !!active);

@@ -420,14 +446,15 @@ static void __mem_cgroup_move_lists(stru
MEM_CGROUP_ZSTAT(mz, from) -= 1;

if (is_unevictable_lru(lru)) {
- pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
- pc->flags |= PAGE_CGROUP_FLAG_UNEVICTABLE;
+ page_cgroup_clear_bit(pc, PAGE_CG_FLAG_ACTIVE);
+ page_cgroup_set_bit(pc, PAGE_CG_FLAG_UNEVICTABLE);
} else {
if (is_active_lru(lru))
- pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
- else
- pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
- pc->flags &= ~PAGE_CGROUP_FLAG_UNEVICTABLE;
+ page_cgroup_set_bit(pc, PAGE_CG_FLAG_ACTIVE);
+ else if (active)
+ page_cgroup_clear_bit(pc, PAGE_CG_FLAG_ACTIVE);
+ if (unevictable)
+ page_cgroup_clear_bit(pc, PAGE_CG_FLAG_UNEVICTABLE);
}

MEM_CGROUP_ZSTAT(mz, lru) += 1;
@@ -655,18 +682,19 @@ static int mem_cgroup_charge_common(stru

pc->mem_cgroup = mem;
pc->page = page;
+ pc->flags = 0;
/*
* If a page is accounted as a page cache, insert to inactive list.
* If anon, insert to active list.
*/
if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
- pc->flags = PAGE_CGROUP_FLAG_CACHE;
+ __page_cgroup_set_bit(pc, PAGE_CG_FLAG_CACHE);
if (page_is_file_cache(page))
- pc->flags |= PAGE_CGROUP_FLAG_FILE;
+ __page_cgroup_set_bit(pc, PAGE_CG_FLAG_FILE);
else
- pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+ __page_cgroup_set_bit(pc, PAGE_CG_FLAG_ACTIVE);
} else
- pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
+ __page_cgroup_set_bit(pc, PAGE_CG_FLAG_ACTIVE);

lock_page_cgroup(page);
if (unlikely(page_get_page_cgroup(page))) {
@@ -774,7 +802,7 @@ __mem_cgroup_uncharge_common(struct page
VM_BUG_ON(pc->page != page);

if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
- && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
+ && (page_cgroup_test_bit(pc, PAGE_CG_FLAG_CACHE)
|| page_mapped(page)))
goto unlock;

@@ -826,7 +854,7 @@ int mem_cgroup_prepare_migration(struct
if (pc) {
mem = pc->mem_cgroup;
css_get(&mem->css);
- if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
+ if (page_cgroup_test_bit(pc, PAGE_CG_FLAG_CACHE))
ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
}
unlock_page_cgroup(page);

2008-08-19 08:35:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH -mm][preview] memcg: a patch series for next [4/9]

Making freeing page_cgroup() at mem_cgroup_uncharge() to use lazy manner.

In mem_cgroup_uncharge_common(), we don't free page_cgroup
and just link it to per-cpu free queue.
And remove it later by checking threshold.

This patch is a base patch for freeing page_cgroup by RCU patch.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>


---
mm/memcontrol.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 103 insertions(+), 17 deletions(-)

Index: linux-2.6.27-rc1-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/memcontrol.c
+++ linux-2.6.27-rc1-mm1/mm/memcontrol.c
@@ -167,6 +167,7 @@ struct page_cgroup {
struct page *page;
struct mem_cgroup *mem_cgroup;
unsigned long flags;
+ struct page_cgroup *next; /* used for Lazy LRU */
};

/* These 2 flags are unchanged during being used. */
@@ -174,6 +175,21 @@ struct page_cgroup {
#define PAGE_CG_FLAG_FILE (1) /* page is file system backed */
#define PAGE_CG_FLAG_ACTIVE (2) /* page is active in this cgroup */
#define PAGE_CG_FLAG_UNEVICTABLE (3) /* page is unevictableable */
+#define PAGE_CG_FLAG_OBSOLETE (4) /* page is unevictableable */
+
+#define MEMCG_LRU_THRESH (16)
+
+/*
+ * per-cpu slot for freeing page_cgroup in lazy way.
+ */
+
+struct mem_cgroup_lazy_lru {
+ int count;
+ struct page_cgroup *next;
+};
+
+DEFINE_PER_CPU(struct mem_cgroup_lazy_lru, memcg_lazy_lru);
+

static inline void page_cgroup_set_bit(struct page_cgroup *pc, int flag)
{
@@ -495,10 +511,12 @@ void mem_cgroup_move_lists(struct page *

pc = page_get_page_cgroup(page);
if (pc) {
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_move_lists(pc, lru);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ if (!page_cgroup_test_bit(pc, PAGE_CG_FLAG_OBSOLETE)) {
+ mz = page_cgroup_zoneinfo(pc);
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ __mem_cgroup_move_lists(pc, lru);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ }
}
unlock_page_cgroup(page);
}
@@ -592,6 +610,8 @@ unsigned long mem_cgroup_isolate_pages(u
if (unlikely(!PageLRU(page)))
continue;

+ if (page_cgroup_test_bit(pc, PAGE_CG_FLAG_OBSOLETE))
+ continue;
/*
* TODO: play better with lumpy reclaim, grabbing anything.
*/
@@ -618,6 +638,75 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
}

+void __mem_cgroup_drop_lru(void)
+{
+ struct mem_cgroup *memcg;
+ struct page_cgroup *pc, *next;
+ struct mem_cgroup_per_zone *mz, *page_mz;
+ struct mem_cgroup_lazy_lru *mll;
+ unsigned long flags;
+
+ mll = &get_cpu_var(memcg_lazy_lru);
+ next = mll->next;
+ mll->next = NULL;
+ mll->count = 0;
+ put_cpu_var(memcg_lazy_lru);
+
+ mz = NULL;
+
+ local_irq_save(flags);
+ while (next) {
+ pc = next;
+ next = pc->next;
+ prefetch(next);
+ page_mz = page_cgroup_zoneinfo(pc);
+ memcg = pc->mem_cgroup;
+ if (page_mz != mz) {
+ if (mz)
+ spin_unlock(&mz->lru_lock);
+ mz = page_mz;
+ spin_lock(&mz->lru_lock);
+ }
+ __mem_cgroup_remove_list(mz, pc);
+ css_put(&memcg->css);
+ kmem_cache_free(page_cgroup_cache, pc);
+ }
+ if (mz)
+ spin_unlock(&mz->lru_lock);
+ local_irq_restore(flags);
+
+ return;
+}
+
+static void mem_cgroup_drop_lru(struct page_cgroup *pc)
+{
+ int count;
+ struct mem_cgroup_lazy_lru *mll;
+
+ mll = &get_cpu_var(memcg_lazy_lru);
+ pc->next = mll->next;
+ mll->next = pc;
+ count = ++mll->count;
+ put_cpu_var(memcg_lazy_lru);
+
+ if (count >= MEMCG_LRU_THRESH)
+ __mem_cgroup_drop_lru();
+}
+
+
+static DEFINE_MUTEX(memcg_force_drain_mutex);
+static void mem_cgroup_local_force_drain(struct work_struct *work)
+{
+ __mem_cgroup_drop_lru();
+}
+
+static void mem_cgroup_all_force_drain(struct mem_cgroup *memcg)
+{
+ mutex_lock(&memcg_force_drain_mutex);
+ schedule_on_each_cpu(mem_cgroup_local_force_drain);
+ mutex_unlock(&memcg_force_drain_mutex);
+}
+
/*
* Charge the memory controller for page usage.
* Return
@@ -629,10 +718,10 @@ static int mem_cgroup_charge_common(stru
struct mem_cgroup *memcg)
{
struct mem_cgroup *mem;
+ struct mem_cgroup_per_zone *mz;
struct page_cgroup *pc;
- unsigned long flags;
unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
- struct mem_cgroup_per_zone *mz;
+ unsigned long flags;

pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
if (unlikely(pc == NULL))
@@ -683,6 +772,7 @@ static int mem_cgroup_charge_common(stru
pc->mem_cgroup = mem;
pc->page = page;
pc->flags = 0;
+ pc->next = NULL;
/*
* If a page is accounted as a page cache, insert to inactive list.
* If anon, insert to active list.
@@ -712,6 +802,7 @@ static int mem_cgroup_charge_common(stru
spin_unlock_irqrestore(&mz->lru_lock, flags);

unlock_page_cgroup(page);
+
done:
return 0;
out:
@@ -785,8 +876,6 @@ __mem_cgroup_uncharge_common(struct page
{
struct page_cgroup *pc;
struct mem_cgroup *mem;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;

if (mem_cgroup_subsys.disabled)
return;
@@ -806,19 +895,14 @@ __mem_cgroup_uncharge_common(struct page
|| page_mapped(page)))
goto unlock;

- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_remove_list(mz, pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
-
+ mem = pc->mem_cgroup;
+ prefetch(mem);
+ page_cgroup_set_bit(pc, PAGE_CG_FLAG_OBSOLETE);
page_assign_page_cgroup(page, NULL);
unlock_page_cgroup(page);
-
- mem = pc->mem_cgroup;
mem_counter_uncharge(mem, 1);
- css_put(&mem->css);
+ mem_cgroup_drop_lru(pc);

- kmem_cache_free(page_cgroup_cache, pc);
return;
unlock:
unlock_page_cgroup(page);
@@ -1011,6 +1095,7 @@ static int mem_cgroup_force_empty(struct
}
}
ret = 0;
+ mem_cgroup_all_force_drain(mem);
out:
css_put(&mem->css);
return ret;
@@ -1212,6 +1297,7 @@ static int alloc_mem_cgroup_per_zone_inf
for_each_lru(l)
INIT_LIST_HEAD(&mz->lists[l]);
}
+
return 0;
}

2008-08-19 08:35:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH -mm][preview] memcg: a patch series for next [5/9]

Making freeing of page_cgroup to be rcu routine.

This patch avoid directly freeing per-cpu page_cgroup free and
pass freeq to RCU routine.

This patch is a base patch for removing lock_page_cgroup().

By this, page_cgroup object is valid while rcu_read_lock() is taken.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
mm/memcontrol.c | 39 +++++++++++++++++++++++++++++++++------
1 file changed, 33 insertions(+), 6 deletions(-)

Index: linux-2.6.27-rc1-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/memcontrol.c
+++ linux-2.6.27-rc1-mm1/mm/memcontrol.c
@@ -638,21 +638,25 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
}

-void __mem_cgroup_drop_lru(void)
+struct memcg_rcu_work {
+ struct rcu_head head;
+ struct page_cgroup *list;
+};
+
+
+void __mem_cgroup_drop_lru(struct rcu_head *head)
{
struct mem_cgroup *memcg;
struct page_cgroup *pc, *next;
struct mem_cgroup_per_zone *mz, *page_mz;
- struct mem_cgroup_lazy_lru *mll;
unsigned long flags;
+ struct memcg_rcu_work *work;

- mll = &get_cpu_var(memcg_lazy_lru);
- next = mll->next;
- mll->next = NULL;
- mll->count = 0;
- put_cpu_var(memcg_lazy_lru);
+ work = container_of(head, struct memcg_rcu_work, head);

+ next = work->list;
mz = NULL;
+ kfree(work);

local_irq_save(flags);
while (next) {
@@ -678,6 +682,27 @@ void __mem_cgroup_drop_lru(void)
return;
}

+static int mem_cgroup_drop_lru_rcu(void)
+{
+ struct mem_cgroup_lazy_lru *mll;
+ struct memcg_rcu_work *work;
+
+ work = kmalloc(sizeof(*work), GFP_ATOMIC);
+ if (!work)
+ return 1;
+
+ INIT_RCU_HEAD(&work->head);
+
+ mll = &get_cpu_var(memcg_lazy_lru);
+ work->list = mll->next;
+ mll->next = NULL;
+ mll->count = 0;
+ put_cpu_var(memcg_lazy_lru);
+ call_rcu(&work->head, __mem_cgroup_drop_lru);
+
+ return 0;
+}
+
static void mem_cgroup_drop_lru(struct page_cgroup *pc)
{
int count;
@@ -690,14 +715,17 @@ static void mem_cgroup_drop_lru(struct p
put_cpu_var(memcg_lazy_lru);

if (count >= MEMCG_LRU_THRESH)
- __mem_cgroup_drop_lru();
+ mem_cgroup_drop_lru_rcu();
}


static DEFINE_MUTEX(memcg_force_drain_mutex);
static void mem_cgroup_local_force_drain(struct work_struct *work)
{
- __mem_cgroup_drop_lru();
+ int ret;
+ do {
+ ret = mem_cgroup_drop_lru_rcu();
+ } while (ret);
}

static void mem_cgroup_all_force_drain(struct mem_cgroup *memcg)
@@ -705,6 +733,7 @@ static void mem_cgroup_all_force_drain(s
mutex_lock(&memcg_force_drain_mutex);
schedule_on_each_cpu(mem_cgroup_local_force_drain);
mutex_unlock(&memcg_force_drain_mutex);
+ synchronize_rcu();
}

/*

2008-08-19 08:36:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH -mm][preview] memcg: a patch series for next [6/9]

Experimental !!

This patch just removes lock_page_cgroup().
By RCU, it seems unnecessary....

Why it's safe without lock_page_cgroup().

Anon pages:
* pages are chareged/uncharged only when first-mapped/last-unmapped.
page_mapcount() handles that.
(at uncharge, pte_lock() is always held in racy case.)

Swap pages:
About SwapCache, there will be race.
mem_cgroup_charge() is moved under lock_page().

File pages: (not Shmem)
* pages are charged/uncharged only when it's added/removed to radix-tree.
In this case, PageLock() is always held.

Install Page:
Is it worth to charge driver's map page ? which is (maybe) not on LRU.
Is it targe resource of memcg ? I think no.
I removed charge/uncharge from install_page().

freeing page_cgroup is done under RCU.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
include/linux/mm_types.h | 2 -
mm/memcontrol.c | 86 ++++++-----------------------------------------
mm/memory.c | 17 +++------
3 files changed, 19 insertions(+), 86 deletions(-)

Index: linux-2.6.27-rc1-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.27-rc1-mm1.orig/include/linux/mm_types.h
+++ linux-2.6.27-rc1-mm1/include/linux/mm_types.h
@@ -93,7 +93,7 @@ struct page {
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- unsigned long page_cgroup;
+ struct page_cgroup *page_cgroup;
#endif

#ifdef CONFIG_KMEMCHECK
Index: linux-2.6.27-rc1-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/memcontrol.c
+++ linux-2.6.27-rc1-mm1/mm/memcontrol.c
@@ -145,20 +145,6 @@ struct mem_cgroup {
static struct mem_cgroup init_mem_cgroup;

/*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock. We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin). But since
- * bit_spin_lock doesn't actually set that lock bit in a non-debug
- * uniprocessor kernel, we should avoid setting it here too.
- */
-#define PAGE_CGROUP_LOCK_BIT 0x0
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
-#else
-#define PAGE_CGROUP_LOCK 0x0
-#endif
-
-/*
* A page_cgroup page is associated with every page descriptor. The
* page_cgroup helps us identify information about the cgroup
*/
@@ -308,35 +294,14 @@ struct mem_cgroup *mem_cgroup_from_task(
struct mem_cgroup, css);
}

-static inline int page_cgroup_locked(struct page *page)
-{
- return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
{
- VM_BUG_ON(!page_cgroup_locked(page));
- page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
+ rcu_assign_pointer(page->page_cgroup, pc);
}

struct page_cgroup *page_get_page_cgroup(struct page *page)
{
- return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
- bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
- return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
- bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+ return rcu_dereference(page->page_cgroup);
}

/*
@@ -499,16 +464,7 @@ void mem_cgroup_move_lists(struct page *
if (mem_cgroup_subsys.disabled)
return;

- /*
- * We cannot lock_page_cgroup while holding zone's lru_lock,
- * because other holders of lock_page_cgroup can be interrupted
- * with an attempt to rotate_reclaimable_page. But we cannot
- * safely get to page_cgroup without it, so just try_lock it:
- * mem_cgroup_isolate_pages allows for page left on wrong list.
- */
- if (!try_lock_page_cgroup(page))
- return;
-
+ rcu_read_lock();
pc = page_get_page_cgroup(page);
if (pc) {
if (!page_cgroup_test_bit(pc, PAGE_CG_FLAG_OBSOLETE)) {
@@ -518,7 +474,7 @@ void mem_cgroup_move_lists(struct page *
spin_unlock_irqrestore(&mz->lru_lock, flags);
}
}
- unlock_page_cgroup(page);
+ rcu_read_unlock();
}

/*
@@ -815,24 +771,13 @@ static int mem_cgroup_charge_common(stru
} else
__page_cgroup_set_bit(pc, PAGE_CG_FLAG_ACTIVE);

- lock_page_cgroup(page);
- if (unlikely(page_get_page_cgroup(page))) {
- unlock_page_cgroup(page);
- mem_counter_uncharge(mem, 1);
- css_put(&mem->css);
- kmem_cache_free(page_cgroup_cache, pc);
- goto done;
- }
+ VM_BUG_ON(page->page_cgroup);
page_assign_page_cgroup(page, pc);
-
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
__mem_cgroup_add_list(mz, pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);

- unlock_page_cgroup(page);
-
-done:
return 0;
out:
css_put(&mem->css);
@@ -874,20 +819,17 @@ int mem_cgroup_cache_charge(struct page
*
* For GFP_NOWAIT case, the page may be pre-charged before calling
* add_to_page_cache(). (See shmem.c) check it here and avoid to call
- * charge twice. (It works but has to pay a bit larger cost.)
+ * charge twice.
*/
if (!(gfp_mask & __GFP_WAIT)) {
struct page_cgroup *pc;

- lock_page_cgroup(page);
pc = page_get_page_cgroup(page);
if (pc) {
VM_BUG_ON(pc->page != page);
VM_BUG_ON(!pc->mem_cgroup);
- unlock_page_cgroup(page);
return 0;
}
- unlock_page_cgroup(page);
}

if (unlikely(!mm))
@@ -912,29 +854,25 @@ __mem_cgroup_uncharge_common(struct page
/*
* Check if our page_cgroup is valid
*/
- lock_page_cgroup(page);
pc = page_get_page_cgroup(page);
if (unlikely(!pc))
- goto unlock;
+ goto out;

VM_BUG_ON(pc->page != page);
+ VM_BUG_ON(page_cgroup_test_bit(pc, PAGE_CG_FLAG_OBSOLETE));

if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
&& (page_cgroup_test_bit(pc, PAGE_CG_FLAG_CACHE)
|| page_mapped(page)))
- goto unlock;
+ goto out;

mem = pc->mem_cgroup;
- prefetch(mem);
page_cgroup_set_bit(pc, PAGE_CG_FLAG_OBSOLETE);
page_assign_page_cgroup(page, NULL);
- unlock_page_cgroup(page);
mem_counter_uncharge(mem, 1);
mem_cgroup_drop_lru(pc);
-
+out:
return;
-unlock:
- unlock_page_cgroup(page);
}

void mem_cgroup_uncharge_page(struct page *page)
@@ -962,7 +900,7 @@ int mem_cgroup_prepare_migration(struct
if (mem_cgroup_subsys.disabled)
return 0;

- lock_page_cgroup(page);
+ rcu_read_lock();
pc = page_get_page_cgroup(page);
if (pc) {
mem = pc->mem_cgroup;
@@ -970,7 +908,7 @@ int mem_cgroup_prepare_migration(struct
if (page_cgroup_test_bit(pc, PAGE_CG_FLAG_CACHE))
ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
}
- unlock_page_cgroup(page);
+ rcu_read_unlock();
if (mem) {
ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
ctype, mem);
Index: linux-2.6.27-rc1-mm1/mm/memory.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/memory.c
+++ linux-2.6.27-rc1-mm1/mm/memory.c
@@ -1325,18 +1325,14 @@ static int insert_page(struct vm_area_st
pte_t *pte;
spinlock_t *ptl;

- retval = mem_cgroup_charge(page, mm, GFP_KERNEL);
- if (retval)
- goto out;
-
retval = -EINVAL;
if (PageAnon(page))
- goto out_uncharge;
+ goto out;
retval = -ENOMEM;
flush_dcache_page(page);
pte = get_locked_pte(mm, addr, &ptl);
if (!pte)
- goto out_uncharge;
+ goto out;
retval = -EBUSY;
if (!pte_none(*pte))
goto out_unlock;
@@ -1352,8 +1348,6 @@ static int insert_page(struct vm_area_st
return retval;
out_unlock:
pte_unmap_unlock(pte, ptl);
-out_uncharge:
- mem_cgroup_uncharge_page(page);
out:
return retval;
}
@@ -2328,15 +2322,16 @@ static int do_swap_page(struct mm_struct
count_vm_event(PGMAJFAULT);
}

+ lock_page(page);
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+
if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
- delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
ret = VM_FAULT_OOM;
+ unlock_page(page);
goto out;
}

mark_page_accessed(page);
- lock_page(page);
- delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

/*
* Back out if somebody else already faulted in this pte.

2008-08-19 08:37:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH -mm][preview] memcg: a patch series for next [7/9]

mem_cgroup_charge() common has to take lock but the place of lock can
be calculated in early stage. This patch tries to prefetch lock line.
(Have some good effect on my host.)

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6.27-rc1-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/memcontrol.c
+++ linux-2.6.27-rc1-mm1/mm/memcontrol.c
@@ -707,11 +707,14 @@ static int mem_cgroup_charge_common(stru
struct page_cgroup *pc;
unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
unsigned long flags;
+ int nid, zid;

pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
if (unlikely(pc == NULL))
goto err;

+ nid = page_to_nid(page);
+ zid = page_zonenum(page);
/*
* We always charge the cgroup the mm_struct belongs to.
* The mm_struct's mem_cgroup changes on task migration if the
@@ -753,6 +756,8 @@ static int mem_cgroup_charge_common(stru
goto out;
}
}
+ mz = mem_cgroup_zoneinfo(mem, nid, zid);
+ prefetchw(mz);

pc->mem_cgroup = mem;
pc->page = page;
@@ -773,7 +778,6 @@ static int mem_cgroup_charge_common(stru

VM_BUG_ON(page->page_cgroup);
page_assign_page_cgroup(page, pc);
- mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
__mem_cgroup_add_list(mz, pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);

2008-08-19 08:38:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH -mm][preview] memcg: a patch series for next [8/9]

Very experimental...

mem+swap controller prototype.

This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP as memory resource
controller's swap extension.

When enabling this, memory resource controller will have 2 limits.

- memory.limit_in_bytes .... limit for pages
- memory.memsw_limit_in_bytes .... limit for pages + swaps.

Following is (easy) accounting state transion after this patch.

pages swaps pages_total memsw_total
+1 - +1 +1 new page allocation.
-1 +1 -1 - swap out.
+1 -1 0 - swap in (*).
- -1 - -1 swap_free.

At swap-out, swp_entry will be charged against the cgroup of the page.
At swap-in, the page will be charged when it's mapped.
(Maybe accounting at read_swap() will be beautiful but we can avoid some of
error handling to delay accounting until mem_cgroup_charge().)

The charge against swap_entry will be uncharged when swap_entry is freed.

The parameter res.swaps just includes swaps not-on swap cache.
So, this doesn't show real usage of swp_entry just shows swp_entry on disk.

This patch doesn't include codes for control files.

TODO:
- clean up. and add comments.
- support vm_swap_full() under cgroup.
- find easier-to-understand protocol....
- check force_empty....(maybe buggy)
- support page migration.
- test!!

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
include/linux/swap.h | 32 +++-
init/Kconfig | 12 +
kernel/power/swsusp.c | 2
mm/memcontrol.c | 387 +++++++++++++++++++++++++++++++++++++++++++++-----
mm/shmem.c | 2
mm/swap_state.c | 9 -
mm/swapfile.c | 54 ++++++
7 files changed, 453 insertions(+), 45 deletions(-)

Index: linux-2.6.27-rc1-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/memcontrol.c
+++ linux-2.6.27-rc1-mm1/mm/memcontrol.c
@@ -117,8 +117,10 @@ struct mem_cgroup_lru_info {
* a feature that will be implemented much later in the future.
*/
struct mem_counter {
- unsigned long pages_limit;
+ unsigned long pages_limit; /* limit for amount of pages. */
+ unsigned long memsw_limit; /* limit for amount of pages + swaps */
unsigned long pages;
+ unsigned long swaps;
unsigned long failcnt;
unsigned long max_usage;
spinlock_t lock;
@@ -141,6 +143,11 @@ struct mem_cgroup {
* statistics.
*/
struct mem_cgroup_stat stat;
+ /*
+ * swap
+ */
+ spinlock_t swap_list_lock;
+ struct list_head swap_list;
};
static struct mem_cgroup init_mem_cgroup;

@@ -176,6 +183,46 @@ struct mem_cgroup_lazy_lru {

DEFINE_PER_CPU(struct mem_cgroup_lazy_lru, memcg_lazy_lru);

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/*
+ * For swap management.
+ */
+DEFINE_SPINLOCK(memcg_swap_control_lock);
+RADIX_TREE(memcg_swap_control, GFP_KERNEL);
+
+struct swap_cgroup {
+ swp_entry_t entry;
+ unsigned long flags;
+ struct mem_cgroup *mem_cgroup;
+ struct list_head list;
+};
+
+/* for flags */
+enum {
+ SWAP_CG_FLAG_ACCOUNTED,
+ NR_SWAP_CG_FLAGS,
+};
+
+static inline int swap_accounted(struct swap_cgroup *sc)
+{
+ return test_bit(SWAP_CG_FLAG_ACCOUNTED, &sc->flags);
+}
+
+static inline void set_swap_accounted(struct swap_cgroup *sc)
+{
+ set_bit(SWAP_CG_FLAG_ACCOUNTED, &sc->flags);
+}
+
+static inline void clear_swap_accounted(struct swap_cgroup *sc)
+{
+ clear_bit(SWAP_CG_FLAG_ACCOUNTED, &sc->flags);
+}
+
+#define do_account_swap (1)
+#else
+#define do_account_swap (0)
+#endif
+

static inline void page_cgroup_set_bit(struct page_cgroup *pc, int flag)
{
@@ -211,6 +258,7 @@ static enum zone_type page_cgroup_zid(st
enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
MEM_CGROUP_CHARGE_TYPE_MAPPED,
+ MEM_CGROUP_CHARGE_TYPE_SWAPOUT,
MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
};

@@ -313,7 +361,9 @@ static void mem_counter_init(struct mem_
spin_lock_init(&memcg->res.lock);
memcg->res.pages = 0;
memcg->res.pages_limit = ~0UL;
+ memcg->res.memsw_limit = ~0UL;
memcg->res.failcnt = 0;
+ memcg->res.swaps = 0;
}

static int mem_counter_charge(struct mem_cgroup *memcg, long num)
@@ -321,16 +371,22 @@ static int mem_counter_charge(struct mem
unsigned long flags;

spin_lock_irqsave(&memcg->res.lock, flags);
- if (memcg->res.pages + num > memcg->res.pages_limit) {
- memcg->res.failcnt++;
- spin_unlock_irqrestore(&memcg->res.lock, flags);
- return -EBUSY;
- }
+ if (memcg->res.pages + num > memcg->res.pages_limit)
+ goto busy;
+ if (do_account_swap
+ && (memcg->res.pages + memcg->res.swaps + num
+ > memcg->res.memsw_limit))
+ goto busy;
memcg->res.pages += num;
if (memcg->res.pages > memcg->res.max_usage)
memcg->res.max_usage = memcg->res.pages;
spin_unlock_irqrestore(&memcg->res.lock, flags);
return 0;
+busy:
+ memcg->res.failcnt++;
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+ return -EBUSY;
+
}

static inline void mem_counter_uncharge(struct mem_cgroup *memcg, long num)
@@ -343,6 +399,30 @@ static inline void mem_counter_uncharge(
spin_unlock_irqrestore(&memcg->res.lock, flags);
}

+/*
+ * Convert the charge from page to swap. (no change in total)
+ * charge value is always "1".
+ */
+static inline void
+mem_counter_recharge_swapout(struct mem_cgroup *memcg)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&memcg->res.lock, flags);
+ memcg->res.swaps += 1;
+ memcg->res.pages -= 1;
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+}
+
+static inline void
+mem_counter_uncharge_swap(struct mem_cgroup *memcg, long num)
+{
+ unsigned long flags;
+ spin_lock_irqsave(&memcg->res.lock, flags);
+ memcg->res.swaps -= num;
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+}
+
static int mem_counter_set_pages_limit(struct mem_cgroup *memcg,
unsigned long lim)
{
@@ -372,6 +452,18 @@ static int __mem_counter_check_under_lim
return ret;
}

+static int __mem_counter_check_under_memsw_limit(struct mem_cgroup *memcg)
+{
+ unsigned long flags;
+ int ret = 0;
+
+ spin_lock_irqsave(&memcg->res.lock, flags);
+ if (memcg->res.pages + memcg->res.swaps < memcg->res.memsw_limit)
+ ret = 1;
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+ return ret;
+}
+
static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
struct page_cgroup *pc)
{
@@ -467,16 +559,156 @@ void mem_cgroup_move_lists(struct page *
rcu_read_lock();
pc = page_get_page_cgroup(page);
if (pc) {
- if (!page_cgroup_test_bit(pc, PAGE_CG_FLAG_OBSOLETE)) {
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_move_lists(pc, lru);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
+ mz = page_cgroup_zoneinfo(pc);
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ __mem_cgroup_move_lists(pc, lru);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ }
+ rcu_read_unlock();
+}
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/*
+ * Create a space for remember swap_entry.
+ * Called from get_swap_page().
+ */
+int cgroup_precharge_swap_ent(swp_entry_t entry, gfp_t mask)
+{
+ struct swap_cgroup *sc;
+ unsigned long flags;
+ int error = -ENOMEM;
+
+ sc = kmalloc(sizeof(*sc), mask);
+ if (!sc)
+ return error;
+ error = radix_tree_preload(mask);
+ if (error)
+ return error;
+ sc->entry = entry;
+ sc->mem_cgroup = NULL;
+ INIT_LIST_HEAD(&sc->list);
+ spin_lock_irqsave(&memcg_swap_control_lock, flags);
+ error = radix_tree_insert(&memcg_swap_control, entry.val, sc);
+ spin_unlock_irqrestore(&memcg_swap_control_lock, flags);
+
+ if (error) {
+ if (error == -EEXIST)
+ error = 0;
+ kfree(sc);
+ }
+ return error;
+}
+
+/*
+ * This function will never cause memory allocation.
+ * called from add_to_swap_cache().
+ */
+void cgroup_commit_swap_owner(struct page *page, swp_entry_t entry)
+{
+ struct swap_cgroup *sc;
+ unsigned long flags;
+
+ rcu_read_lock();
+ spin_lock_irqsave(&memcg_swap_control_lock, flags);
+ sc = radix_tree_lookup(&memcg_swap_control, entry.val);
+ /*
+ * There are 2 cases:
+ * Swap-In: we do nothing. In this case, sc->mem_cgroup is not NULL.
+ * Swap-out: we marks sc->mem_cgroup to be page->mem_cgroup.
+ *
+ */
+ VM_BUG_ON(!sc);
+ if (!sc->mem_cgroup) {
+ struct page_cgroup *pc;
+ pc = page_get_page_cgroup(page);
+ if (pc && !page_cgroup_test_bit(pc, PAGE_CG_FLAG_OBSOLETE)) {
+ struct mem_cgroup *memcg = pc->mem_cgroup;
+ sc->mem_cgroup = memcg;
+ sc->flags = 0;
+ spin_lock(&memcg->swap_list_lock);
+ list_add(&sc->list, &memcg->swap_list);
+ spin_unlock(&memcg->swap_list_lock);
+ css_get(&memcg->css);
}
}
+ spin_unlock_irqrestore(&memcg_swap_control_lock, flags);
+ rcu_read_unlock();
+}
+
+static struct swap_cgroup *mem_cgroup_lookup_swap(swp_entry_t entry)
+{
+ struct swap_cgroup *sc;
+
+ rcu_read_lock();
+ sc = radix_tree_lookup(&memcg_swap_control, entry.val);
rcu_read_unlock();
+
+ return sc;
}

+static struct mem_cgroup *lookup_memcg_from_swap(swp_entry_t entry)
+{
+ struct swap_cgroup *sc;
+ sc = mem_cgroup_lookup_swap(entry);
+ if (sc)
+ return sc->mem_cgroup;
+ /* never reach here ? */
+ WARN_ON("lookup_memcg_from_swap returns NULL");
+ return NULL;
+}
+
+static void swap_cgroup_uncharge_swap(struct mem_cgroup *mem, swp_entry_t entry)
+{
+ struct swap_cgroup *sc;
+
+ sc = mem_cgroup_lookup_swap(entry);
+ BUG_ON(!sc);
+
+ if (!swap_accounted(sc))
+ return;
+ mem_counter_uncharge_swap(mem, 1);
+ clear_swap_accounted(sc);
+}
+
+static void swap_cgroup_delete_swap(swp_entry_t entry)
+{
+ struct swap_cgroup *sc;
+ struct mem_cgroup *memcg;
+ unsigned long flags;
+
+ spin_lock_irqsave(&memcg_swap_control_lock, flags);
+ sc = radix_tree_delete(&memcg_swap_control, entry.val);
+ spin_unlock_irqrestore(&memcg_swap_control_lock, flags);
+
+ if (sc) {
+ memcg = sc->mem_cgroup;
+ spin_lock_irqsave(&memcg->swap_list_lock, flags);
+ list_del(&sc->list);
+ spin_unlock_irqrestore(&memcg->swap_list_lock, flags);
+ if (swap_accounted(sc))
+ mem_counter_uncharge_swap(memcg, 1);
+ css_put(&memcg->css);
+ kfree(sc);
+ }
+ return;
+}
+#else
+
+static struct mem_cgroup *lookup_memcg_from_swap(swp_entry_t entry)
+{
+ return NULL;
+}
+static void swap_cgroup_uncharge_swap(struct mem_cgroup *mem, swp_entry_t val)
+{
+ return;
+}
+static void swap_cgroup_delete_swap(swp_entry_t val)
+{
+ return;
+}
+
+#endif
+
/*
* Calculate mapped_ratio under memory controller. This will be used in
* vmscan.c for deteremining we have to reclaim mapped pages.
@@ -566,8 +798,6 @@ unsigned long mem_cgroup_isolate_pages(u
if (unlikely(!PageLRU(page)))
continue;

- if (page_cgroup_test_bit(pc, PAGE_CG_FLAG_OBSOLETE))
- continue;
/*
* TODO: play better with lumpy reclaim, grabbing anything.
*/
@@ -735,21 +965,33 @@ static int mem_cgroup_charge_common(stru
}

while (mem_counter_charge(mem, 1)) {
+ int progress;
+
if (!(gfp_mask & __GFP_WAIT))
goto out;

- if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
- continue;
-
- /*
- * try_to_free_mem_cgroup_pages() might not give us a full
- * picture of reclaim. Some pages are reclaimed and might be
- * moved to swap cache or just unmapped from the cgroup.
- * Check the limit again to see if the reclaim reduced the
- * current usage of the cgroup before giving up
- */
- if (__mem_counter_check_under_limit(mem))
- continue;
+ progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
+ if (do_account_swap) {
+ /* When we hit memsw_limit, success of
+ try_to_free_page() doesn't mean we can go ahead. */
+ if (progress
+ && __mem_counter_check_under_memsw_limit(mem))
+ continue;
+ } else {
+ if (progress)
+ continue;
+ /*
+ * try_to_free_mem_cgroup_pages() might not give us a
+ * full picture of reclaim. Some pages are reclaimed
+ * and might be moved to swap cache or just unmapped
+ * from moved to swap cache or just unmapped from the
+ * cgroup.
+ * Check the limit again to see if the reclaim reduced
+ * the current usage of the cgroup before giving up.
+ */
+ if (__mem_counter_check_under_limit(mem))
+ continue;
+ }

if (!nr_retries--) {
mem_cgroup_out_of_memory(mem, gfp_mask);
@@ -782,6 +1024,11 @@ static int mem_cgroup_charge_common(stru
__mem_cgroup_add_list(mz, pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);

+ if (do_account_swap && PageSwapCache(page)) {
+ swp_entry_t entry = { .val = page_private(page) };
+ swap_cgroup_uncharge_swap(mem, entry);
+ }
+
return 0;
out:
css_put(&mem->css);
@@ -792,6 +1039,8 @@ err:

int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
{
+ struct mem_cgroup *memcg = NULL;
+
if (mem_cgroup_subsys.disabled)
return 0;

@@ -806,13 +1055,23 @@ int mem_cgroup_charge(struct page *page,
return 0;
if (unlikely(!mm))
mm = &init_mm;
+
+ if (do_account_swap && PageSwapCache(page)) {
+ swp_entry_t entry = { .val = page_private(page) };
+ /* swap cache can have valid page->page_cgroup */
+ if (page->mapping && page_get_page_cgroup(page))
+ return 0;
+ memcg = lookup_memcg_from_swap(entry);
+ }
+
return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+ MEM_CGROUP_CHARGE_TYPE_MAPPED, memcg);
}

int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask)
{
+ struct mem_cgroup *memcg = NULL;
if (mem_cgroup_subsys.disabled)
return 0;

@@ -835,25 +1094,33 @@ int mem_cgroup_cache_charge(struct page
return 0;
}
}
+ if (do_account_swap && PageSwapCache(page)) {
+ swp_entry_t entry = { .val = page_private(page) };
+ /* swap cache can have valid page->page_cgroup */
+ if (page->mapping && page_get_page_cgroup(page))
+ return 0;
+ memcg = lookup_memcg_from_swap(entry);
+ }

if (unlikely(!mm))
mm = &init_mm;

return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
+ MEM_CGROUP_CHARGE_TYPE_CACHE, memcg);
}

/*
* uncharge if !page_mapped(page)
*/
-static void
+static int
__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
{
struct page_cgroup *pc;
struct mem_cgroup *mem;
+ int ret = 0;

if (mem_cgroup_subsys.disabled)
- return;
+ return 0;

/*
* Check if our page_cgroup is valid
@@ -865,18 +1132,23 @@ __mem_cgroup_uncharge_common(struct page
VM_BUG_ON(pc->page != page);
VM_BUG_ON(page_cgroup_test_bit(pc, PAGE_CG_FLAG_OBSOLETE));

- if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
- && (page_cgroup_test_bit(pc, PAGE_CG_FLAG_CACHE)
- || page_mapped(page)))
+ if (likely(ctype != MEM_CGROUP_CHARGE_TYPE_FORCE))
+ if (PageSwapCache(page) || page_mapped(page) ||
+ (page->mapping && !PageAnon(page)))
goto out;
-
+ ret = 1;
mem = pc->mem_cgroup;
page_cgroup_set_bit(pc, PAGE_CG_FLAG_OBSOLETE);
page_assign_page_cgroup(page, NULL);
- mem_counter_uncharge(mem, 1);
+
+ if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
+ /* swap is not accouned but drop swap-cache. */
+ mem_counter_recharge_swapout(mem);
+ } else
+ mem_counter_uncharge(mem, 1);
mem_cgroup_drop_lru(pc);
out:
- return;
+ return ret;
}

void mem_cgroup_uncharge_page(struct page *page)
@@ -890,7 +1162,50 @@ void mem_cgroup_uncharge_cache_page(stru
VM_BUG_ON(page->mapping);
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/*
+ * This function is called from __delete_from_swap_cache.
+ * The function will be called under following case.
+ * 1. swap out memory by vmscan.
+ * 2. discard shmem's swp_entry at shmem's swap-in.
+ * 3. discard anonymous memory which was swap-cache.
+ */
+
+void mem_cgroup_uncharge_swap_cache(struct page *page, swp_entry_t entry)
+{
+ struct page_cgroup *pc;
+ struct swap_cgroup *sc;
+ enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
+
+ sc = mem_cgroup_lookup_swap(entry);
+
+ BUG_ON(!sc);
+ BUG_ON(PageSwapCache(page));
+
+ if (swap_accounted(sc)) {
+ pc = page_get_page_cgroup(page);
+ if (pc) {
+ /* never reach here...just for debug */
+ printk("%d need to uncharge page ???", __LINE__);
+ ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
+ __mem_cgroup_uncharge_common(page, ctype);
+ }
+ return;
+ }
+
+ if (__mem_cgroup_uncharge_common(page, ctype))
+ set_swap_accounted(sc);
+}
+
+/*
+ * Called when swap is freed.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t entry)
+{
+ swap_cgroup_delete_swap(entry);
+}

+#endif
/*
* Before starting migration, account against new page.
*/
@@ -1321,6 +1636,8 @@ mem_cgroup_create(struct cgroup_subsys *
if (alloc_mem_cgroup_per_zone_info(mem, node))
goto free_out;

+ spin_lock_init(&mem->swap_list_lock);
+ INIT_LIST_HEAD(&mem->swap_list);
return &mem->css;
free_out:
for_each_node_state(node, N_POSSIBLE)
Index: linux-2.6.27-rc1-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.27-rc1-mm1.orig/include/linux/swap.h
+++ linux-2.6.27-rc1-mm1/include/linux/swap.h
@@ -295,8 +295,8 @@ extern struct page *swapin_readahead(swp
/* linux/mm/swapfile.c */
extern long total_swap_pages;
extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
-extern swp_entry_t get_swap_page_of_type(int);
+extern swp_entry_t get_swap_page(gfp_t);
+extern swp_entry_t get_swap_page_of_type(int, gfp_t);
extern int swap_duplicate(swp_entry_t);
extern int valid_swaphandles(swp_entry_t, unsigned long *);
extern void swap_free(swp_entry_t);
@@ -332,6 +332,34 @@ static inline void disable_swap_token(vo
put_swap_token(swap_token_mm);
}

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern int cgroup_precharge_swap_ent(swp_entry_t entry, gfp_t mask);
+/* All below functions never fail. */
+extern void cgroup_commit_swap_owner(struct page *page, swp_entry_t entry);
+extern void mem_cgroup_uncharge_swap_cache(struct page *page,
+ swp_entry_t entry);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
+
+#else
+
+static int cgroup_precharge_swap_ent(swp_entry_t entry, gfp_mask mask) {
+ return 0;
+}
+
+static void cgroup_commit_swap_owner(struct page *page, swp_entry_t entry)
+{
+}
+
+static void mem_cgroup_uncharge_swap_cache(struct page *page,
+ swp_entry_t entry)
+{
+}
+
+static void mem_cgroup_uncharge_swap(swp_entry_t entry)
+{
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_SWAP */
+
#else /* CONFIG_SWAP */

#define total_swap_pages 0
Index: linux-2.6.27-rc1-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/swap_state.c
+++ linux-2.6.27-rc1-mm1/mm/swap_state.c
@@ -99,6 +99,8 @@ int add_to_swap_cache(struct page *page,
page_cache_release(page);
}
}
+ if (!error)
+ cgroup_commit_swap_owner(page, entry);
return error;
}

@@ -108,14 +110,17 @@ int add_to_swap_cache(struct page *page,
*/
void __delete_from_swap_cache(struct page *page)
{
+ swp_entry_t entry = { .val = page_private(page) };
+
BUG_ON(!PageLocked(page));
BUG_ON(!PageSwapCache(page));
BUG_ON(PageWriteback(page));
BUG_ON(PagePrivate(page));

- radix_tree_delete(&swapper_space.page_tree, page_private(page));
+ radix_tree_delete(&swapper_space.page_tree, entry.val);
set_page_private(page, 0);
ClearPageSwapCache(page);
+ mem_cgroup_uncharge_swap_cache(page, entry);
total_swapcache_pages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(del_total);
@@ -138,7 +143,7 @@ int add_to_swap(struct page * page, gfp_
BUG_ON(!PageUptodate(page));

for (;;) {
- entry = get_swap_page();
+ entry = get_swap_page(gfp_mask);
if (!entry.val)
return 0;

Index: linux-2.6.27-rc1-mm1/mm/shmem.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/shmem.c
+++ linux-2.6.27-rc1-mm1/mm/shmem.c
@@ -1023,7 +1023,7 @@ static int shmem_writepage(struct page *
* want to check if there's a redundant swappage to be discarded.
*/
if (wbc->for_reclaim)
- swap = get_swap_page();
+ swap = get_swap_page(GFP_ATOMIC);
else
swap.val = 0;

Index: linux-2.6.27-rc1-mm1/kernel/power/swsusp.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/kernel/power/swsusp.c
+++ linux-2.6.27-rc1-mm1/kernel/power/swsusp.c
@@ -127,7 +127,7 @@ sector_t alloc_swapdev_block(int swap)
{
unsigned long offset;

- offset = swp_offset(get_swap_page_of_type(swap));
+ offset = swp_offset(get_swap_page_of_type(swap), GFP_KERNEL);
if (offset) {
if (swsusp_extents_insert(offset))
swap_free(swp_entry(swap, offset));
Index: linux-2.6.27-rc1-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/swapfile.c
+++ linux-2.6.27-rc1-mm1/mm/swapfile.c
@@ -173,7 +173,7 @@ no_page:
return 0;
}

-swp_entry_t get_swap_page(void)
+swp_entry_t __get_swap_page(void)
{
struct swap_info_struct *si;
pgoff_t offset;
@@ -214,7 +214,7 @@ noswap:
return (swp_entry_t) {0};
}

-swp_entry_t get_swap_page_of_type(int type)
+swp_entry_t __get_swap_page_of_type(int type)
{
struct swap_info_struct *si;
pgoff_t offset;
@@ -233,6 +233,48 @@ swp_entry_t get_swap_page_of_type(int ty
spin_unlock(&swap_lock);
return (swp_entry_t) {0};
}
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+swp_entry_t get_swap_page(gfp_t mask)
+{
+ swp_entry_t ret;
+ int error;
+
+ ret = __get_swap_page();
+ if (!ret.val)
+ return (swp_entry_t){0};
+ error = cgroup_precharge_swap_ent(ret, mask);
+ if (error) {
+ swap_free(ret);
+ return (swp_entry_t){0};
+ }
+ return ret;
+}
+swp_entry_t get_swap_page_of_type(int type, gfp_t mask)
+{
+ swp_entry_t ret;
+ int error;
+
+ ret = __get_swap_page_of_type(type);
+ if (!ret.val)
+ return (swp_entry_t){0};
+
+ error = cgroup_precharge_swap_ent(ret, mask);
+ if (error) {
+ swap_free(ret);
+ return (swp_entry_t){0};
+ }
+ return ret;
+}
+#else
+swp_entry_t get_swap_page(gfp_t mask)
+{
+ return __get_swap_page();
+}
+swp_entry_t get_swap_page_of_type(int type, gfp_t mask)
+{
+ return __get_swap_page_of_type(type);
+}
+#endif

static struct swap_info_struct * swap_info_get(swp_entry_t entry)
{
@@ -270,8 +312,9 @@ out:
return NULL;
}

-static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
+static int swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
{
+ unsigned long offset = swp_offset(entry);
int count = p->swap_map[offset];

if (count < SWAP_MAP_MAX) {
@@ -286,6 +329,7 @@ static int swap_entry_free(struct swap_i
swap_list.next = p - swap_info;
nr_swap_pages++;
p->inuse_pages--;
+ mem_cgroup_uncharge_swap(entry);
}
}
return count;
@@ -301,7 +345,7 @@ void swap_free(swp_entry_t entry)

p = swap_info_get(entry);
if (p) {
- swap_entry_free(p, swp_offset(entry));
+ swap_entry_free(p, entry);
spin_unlock(&swap_lock);
}
}
@@ -420,7 +464,7 @@ void free_swap_and_cache(swp_entry_t ent

p = swap_info_get(entry);
if (p) {
- if (swap_entry_free(p, swp_offset(entry)) == 1) {
+ if (swap_entry_free(p, entry) == 1) {
page = find_get_page(&swapper_space, entry.val);
if (page && unlikely(TestSetPageLocked(page))) {
page_cache_release(page);
Index: linux-2.6.27-rc1-mm1/init/Kconfig
===================================================================
--- linux-2.6.27-rc1-mm1.orig/init/Kconfig
+++ linux-2.6.27-rc1-mm1/init/Kconfig
@@ -408,6 +408,18 @@ config CGROUP_MEM_RES_CTLR
This config option also selects MM_OWNER config option, which
could in turn add some fork/exit overhead.

+config CGROUP_MEM_RES_CTLR_SWAP
+ bool "Memory Resource Contoller Swap extention"
+ depends on CGROUP_MEM_RES_CTLR
+ help
+ Provides resource accounting for swap. Enabling this, memory
+ resource controller has 2 limit value. "limit" means the limit
+ of amount of pages. "memsw_limit" means the limit for sum of
+ amount of pages and amount of swaps. When you don't want to allow
+ excessive use of swap under memory resource controller, enable this.
+ But this extension will use some amount of memory for internal
+ information accounting and memory for the kernel will be consumed.
+
config CGROUP_MEMRLIMIT_CTLR
bool "Memory resource limit controls for cgroups"
depends on CGROUPS && RESOURCE_COUNTERS && MMU

2008-08-19 08:38:42

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH -mm][preview] memcg: a patch series for next [9/9]

Add control files to mem+swap controller.

This patch adds following 2 files.
- memory.memsw_limit_in_bytes ..... limit for mem+swap usage.
- memory.swap_usage_in_bytes ..... usage for swap_entry.

Following rules must be kept.
memory.memsw_limit_in_bytes >= memory.limit_in_bytes.

If not, -EINVAL will return.

TODO:
- add Documentation.
- add function/file to force swap-in for reducing swap usage.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
mm/memcontrol.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 93 insertions(+), 7 deletions(-)

Index: linux-2.6.27-rc1-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.27-rc1-mm1.orig/mm/memcontrol.c
+++ linux-2.6.27-rc1-mm1/mm/memcontrol.c
@@ -268,10 +268,11 @@ enum {
MEMCG_FILE_TYPE_PAGE_USAGE,
MEMCG_FILE_TYPE_FAILCNT,
MEMCG_FILE_TYPE_MAX_USAGE,
+ MEMCG_FILE_TYPE_MEMSW_LIMIT,
+ MEMCG_FILE_TYPE_SWAP_USAGE,
};


-
/*
* Always modified under lru lock. Then, not necessary to preempt_disable()
* "flags" passed to this function is a copy of pc->flags but flags checked
@@ -415,11 +416,11 @@ mem_counter_recharge_swapout(struct mem_
}

static inline void
-mem_counter_uncharge_swap(struct mem_cgroup *memcg, long num)
+mem_counter_uncharge_swap(struct mem_cgroup *memcg)
{
unsigned long flags;
spin_lock_irqsave(&memcg->res.lock, flags);
- memcg->res.swaps -= num;
+ memcg->res.swaps -= 1;
spin_unlock_irqrestore(&memcg->res.lock, flags);
}

@@ -430,7 +431,9 @@ static int mem_counter_set_pages_limit(s
int ret = -EBUSY;

spin_lock_irqsave(&memcg->res.lock, flags);
- if (memcg->res.pages < lim) {
+ if (lim > memcg->res.memsw_limit)
+ ret = -EINVAL;
+ else if (memcg->res.pages < lim) {
memcg->res.pages_limit = lim;
ret = 0;
}
@@ -568,6 +571,25 @@ void mem_cgroup_move_lists(struct page *
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+static int mem_cgroup_set_memsw_limit(struct mem_cgroup *memcg,
+ unsigned long lim)
+{
+ unsigned long flags;
+ int ret = -EBUSY;
+
+ spin_lock_irqsave(&memcg->res.lock, flags);
+ if (memcg->res.pages_limit > lim)
+ ret = -EINVAL;
+ else if (memcg->res.pages + memcg->res.swaps < lim) {
+ memcg->res.memsw_limit = lim;
+ ret = 0;
+ }
+ spin_unlock_irqrestore(&memcg->res.lock, flags);
+
+ return ret;
+
+}
/*
* Create a space for remember swap_entry.
* Called from get_swap_page().
@@ -666,7 +688,7 @@ static void swap_cgroup_uncharge_swap(st

if (!swap_accounted(sc))
return;
- mem_counter_uncharge_swap(mem, 1);
+ mem_counter_uncharge_swap(mem);
clear_swap_accounted(sc);
}

@@ -686,7 +708,7 @@ static void swap_cgroup_delete_swap(swp_
list_del(&sc->list);
spin_unlock_irqrestore(&memcg->swap_list_lock, flags);
if (swap_accounted(sc))
- mem_counter_uncharge_swap(memcg, 1);
+ mem_counter_uncharge_swap(memcg);
css_put(&memcg->css);
kfree(sc);
}
@@ -1294,7 +1316,10 @@ int mem_cgroup_resize_limit(struct mem_c
int ret = 0;
unsigned long pages = (unsigned long)(val >> PAGE_SHIFT);

- while (mem_counter_set_pages_limit(memcg, pages)) {
+ while (1) {
+ ret = mem_counter_set_pages_limit(memcg, pages);
+ if (!ret || ret == -EINVAL)
+ break;
if (signal_pending(current)) {
ret = -EINTR;
break;
@@ -1310,6 +1335,43 @@ int mem_cgroup_resize_limit(struct mem_c
return ret;
}

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+static int
+mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, unsigned long long val)
+{
+ int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
+ int progress;
+ int ret = 0;
+ unsigned long pages = (unsigned long)(val >> PAGE_SHIFT);
+
+ while (1) {
+ ret = mem_cgroup_set_memsw_limit(memcg, pages);
+ if (!ret || ret == -EINVAL)
+ break;
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ break;
+ }
+ if (!retry_count) {
+ ret = -EBUSY;
+ break;
+ }
+ progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
+ if (!progress)
+ retry_count--;
+ }
+ return ret;
+
+}
+#else
+static int
+mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, unsigned long long val)
+{
+ return -EINVAL;
+}
+#endif
+
+

/*
* This routine traverse page_cgroup in given list and drop them all.
@@ -1405,6 +1467,12 @@ static u64 mem_cgroup_read(struct cgroup
case MEMCG_FILE_TYPE_FAILCNT:
ret = memcg->res.failcnt << PAGE_SHIFT;
break;
+ case MEMCG_FILE_TYPE_SWAP_USAGE:
+ ret = memcg->res.swaps << PAGE_SHIFT;
+ break;
+ case MEMCG_FILE_TYPE_MEMSW_LIMIT:
+ ret = memcg->res.memsw_limit << PAGE_SHIFT;
+ break;
default:
BUG();
}
@@ -1441,6 +1509,11 @@ static int mem_cgroup_write(struct cgrou
if (!ret)
ret = mem_cgroup_resize_limit(memcg, val);
break;
+ case MEMCG_FILE_TYPE_MEMSW_LIMIT:
+ ret = call_memparse(buffer, &val);
+ if (!ret)
+ ret = mem_cgroup_resize_memsw_limit(memcg, val);
+ break;
default:
ret = -EINVAL; /* should be BUG() ? */
break;
@@ -1552,6 +1625,19 @@ static struct cftype mem_cgroup_files[]
.name = "stat",
.read_map = mem_control_stat_show,
},
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+ {
+ .name = "memsw_limit_in_bytes",
+ .private = MEMCG_FILE_TYPE_MEMSW_LIMIT,
+ .read_u64 = mem_cgroup_read,
+ .write_string = mem_cgroup_write,
+ },
+ {
+ .name = "swap_usage_in_bytes",
+ .private = MEMCG_FILE_TYPE_SWAP_USAGE,
+ .read_u64 = mem_cgroup_read,
+ }
+#endif
};

static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)

2008-08-19 09:12:10

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH -mm][preview] memcg: a patch series for next [0/9]

On Tue, 19 Aug 2008 17:30:14 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> Hi,
>
> This post is for showing what I'm trying now.
>
> This patch set is for memory resource controller.
> 4 purposes here.
> - improve performance of memcg.
> - remove lock_page_cgroup()
> - making page_cgroup->flags to be atomic_ops.
> - support mem+swap controller.
>

Thank you for working on mem+swap controller.

I'll review and comment later.


Thanks,
Daisuke Nishimura.

> But this is still under test and the series is not well organised.
> and base tree is old. (2.6.27-rc1-mm1) I'll rebase this set to newer mmtom tree.
>
> Maybe this set have some troubles/objections but I think the direction is not bad.
>
> Patch description. (patch ordering is bad. I'll fix in the next post.)
>
> [1/9] ... private_counter ...replace res_counter with my own counter.
> This is for supporting mem+swap controller.
> (And I think memcg has a bit different characteristics from other
> users of res_counter....)
>
> [2/9] ... change-order-uncharge ...
> This patch is for making it easy to handle swap-cache.
>
> [3/9] ... atomic_flags
> This patch changes operations for page_cgroup->flags to be atomic_ops.
>
> [4/9] ... delayed freeing.
> delaying to free page_cgroup at uncharge.
>
> [5/9] ... RCU freeing of page_cgroup
> free page_cgroup by RCU.
>
> [6/9] ... lockress page cgroup.
> remove lock_page_cgroup() and use RCU semantics.
>
> [7/9] ... add preftech
> add prefetch() macro
>
> [8/9] ... mem+swap controller base.
> introduce mem+swap controller. A bit big patch....but have tons of TODO.
> and have troubles. (it seems it's difficult to cause OOM killer.)
>
> [9/9] ... mem+swap controller control files.
> add mem+swap controller's control files.
>
> I'd like to push patch [2,3,4,5,6,7] first.
>
> Thanks,
> -Kame
>

2008-08-20 01:20:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -mm][preview] memcg: a patch series for next [0/9]

On Tue, 19 Aug 2008 18:11:50 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Tue, 19 Aug 2008 17:30:14 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > Hi,
> >
> > This post is for showing what I'm trying now.
> >
> > This patch set is for memory resource controller.
> > 4 purposes here.
> > - improve performance of memcg.
> > - remove lock_page_cgroup()
> > - making page_cgroup->flags to be atomic_ops.
> > - support mem+swap controller.
> >
>
> Thank you for working on mem+swap controller.
>
> I'll review and comment later.
>
My next version (writing now) will be much cleaner and clearer than this ;)
So, please review only if you have free time.

Thanks,
-Kame

>
> Thanks,
> Daisuke Nishimura.
>
> > But this is still under test and the series is not well organised.
> > and base tree is old. (2.6.27-rc1-mm1) I'll rebase this set to newer mmtom tree.
> >
> > Maybe this set have some troubles/objections but I think the direction is not bad.
> >
> > Patch description. (patch ordering is bad. I'll fix in the next post.)
> >
> > [1/9] ... private_counter ...replace res_counter with my own counter.
> > This is for supporting mem+swap controller.
> > (And I think memcg has a bit different characteristics from other
> > users of res_counter....)
> >
> > [2/9] ... change-order-uncharge ...
> > This patch is for making it easy to handle swap-cache.
> >
> > [3/9] ... atomic_flags
> > This patch changes operations for page_cgroup->flags to be atomic_ops.
> >
> > [4/9] ... delayed freeing.
> > delaying to free page_cgroup at uncharge.
> >
> > [5/9] ... RCU freeing of page_cgroup
> > free page_cgroup by RCU.
> >
> > [6/9] ... lockress page cgroup.
> > remove lock_page_cgroup() and use RCU semantics.
> >
> > [7/9] ... add preftech
> > add prefetch() macro
> >
> > [8/9] ... mem+swap controller base.
> > introduce mem+swap controller. A bit big patch....but have tons of TODO.
> > and have troubles. (it seems it's difficult to cause OOM killer.)
> >
> > [9/9] ... mem+swap controller control files.
> > add mem+swap controller's control files.
> >
> > I'd like to push patch [2,3,4,5,6,7] first.
> >
> > Thanks,
> > -Kame
> >
>

2008-08-20 03:31:10

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH -mm][preview] memcg: a patch series for next [0/9]

KAMEZAWA Hiroyuki wrote:
> Hi,
>
> This post is for showing what I'm trying now.
>
> This patch set is for memory resource controller.
> 4 purposes here.
> - improve performance of memcg.
> - remove lock_page_cgroup()
> - making page_cgroup->flags to be atomic_ops.
> - support mem+swap controller.
>
> But this is still under test and the series is not well organised.
> and base tree is old. (2.6.27-rc1-mm1) I'll rebase this set to newer mmtom tree.
>
> Maybe this set have some troubles/objections but I think the direction is not bad.
>
> Patch description. (patch ordering is bad. I'll fix in the next post.)
>
> [1/9] ... private_counter ...replace res_counter with my own counter.
> This is for supporting mem+swap controller.
> (And I think memcg has a bit different characteristics from other
> users of res_counter....)
>
> [2/9] ... change-order-uncharge ...
> This patch is for making it easy to handle swap-cache.
>
> [3/9] ... atomic_flags
> This patch changes operations for page_cgroup->flags to be atomic_ops.
>
> [4/9] ... delayed freeing.
> delaying to free page_cgroup at uncharge.
>
> [5/9] ... RCU freeing of page_cgroup
> free page_cgroup by RCU.
>
> [6/9] ... lockress page cgroup.
> remove lock_page_cgroup() and use RCU semantics.
>
> [7/9] ... add preftech
> add prefetch() macro
>
> [8/9] ... mem+swap controller base.
> introduce mem+swap controller. A bit big patch....but have tons of TODO.
> and have troubles. (it seems it's difficult to cause OOM killer.)
>
> [9/9] ... mem+swap controller control files.
> add mem+swap controller's control files.
>
> I'd like to push patch [2,3,4,5,6,7] first.
>

I took a quick look at the patches, patch 1 seemed not so clear, why can't we
enhance or fix resource counters? I'll review/test the patches tonight.

--
Balbir

2008-08-20 03:45:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -mm][preview] memcg: a patch series for next [0/9]

On Wed, 20 Aug 2008 09:00:50 +0530
Balbir Singh <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > Hi,
> >
> > This post is for showing what I'm trying now.
> >
> > This patch set is for memory resource controller.
> > 4 purposes here.
> > - improve performance of memcg.
> > - remove lock_page_cgroup()
> > - making page_cgroup->flags to be atomic_ops.
> > - support mem+swap controller.
> >
> > But this is still under test and the series is not well organised.
> > and base tree is old. (2.6.27-rc1-mm1) I'll rebase this set to newer mmtom tree.
> >
> > Maybe this set have some troubles/objections but I think the direction is not bad.
> >
> > Patch description. (patch ordering is bad. I'll fix in the next post.)
> >
> > [1/9] ... private_counter ...replace res_counter with my own counter.
> > This is for supporting mem+swap controller.
> > (And I think memcg has a bit different characteristics from other
> > users of res_counter....)
> >
> > [2/9] ... change-order-uncharge ...
> > This patch is for making it easy to handle swap-cache.
> >
> > [3/9] ... atomic_flags
> > This patch changes operations for page_cgroup->flags to be atomic_ops.
> >
> > [4/9] ... delayed freeing.
> > delaying to free page_cgroup at uncharge.
> >
> > [5/9] ... RCU freeing of page_cgroup
> > free page_cgroup by RCU.
> >
> > [6/9] ... lockress page cgroup.
> > remove lock_page_cgroup() and use RCU semantics.
> >
> > [7/9] ... add preftech
> > add prefetch() macro
> >
> > [8/9] ... mem+swap controller base.
> > introduce mem+swap controller. A bit big patch....but have tons of TODO.
> > and have troubles. (it seems it's difficult to cause OOM killer.)
> >
> > [9/9] ... mem+swap controller control files.
> > add mem+swap controller's control files.
> >
> > I'd like to push patch [2,3,4,5,6,7] first.
> >
>
> I took a quick look at the patches, patch 1 seemed not so clear, why can't we
> enhance or fix resource counters? I'll review/test the patches tonight.
>
patch 1 is for patch 8. (patch order is too bad.)
please ignore this version. this is just a preview.(Sorry)

I'm now writing easier-to-read one,

thanks,
-kame





> --
> Balbir
>

2008-08-20 09:47:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1

Hi, this is a patch set for lockless page_cgroup.

dropped patches related to mem+swap controller for easy review.
(I'm rewriting it, too.)

Changes from current -mm is.
- page_cgroup->flags operations is set to be atomic.
- lock_page_cgroup() is removed.
- page->page_cgroup is changed from unsigned long to struct page_cgroup*
- page_cgroup is freed by RCU.
- For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
omitted. This is ususally used for mapping device's page. (I think...)

In my quick test, perfomance is improved a little. But the benefit of this
patch is to allow access page_cgroup without lock. I think this is good
for Yamamoto's Dirty page tracking for memcg.
For I/O tracking people, I added a header file for allowing access to
page_cgroup from out of memcontrol.c

The base kernel is recent mmtom. Any comments are welcome.
This is still under test. I have to do long-run test before removing "RFC".

patch [1-4] is core logic.

[1/7] page_cgroup_atomic_flags.patch
[2/7] delayed_batch_freeing_of_page_cgroup.patch
[3/7] freeing page_cgroup by rcu.patch
[4/7] lockess page_cgroup.patch
[5/7] add prefetch patch
[6/7] make-mapping-null-before-calling-uncharge.patch
[7/7] adding page_cgroup.h header file.patch


Thanks,
-Kame

2008-08-20 09:49:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH -mm 1/7] memcg: page_cgroup_atomic_flags.patch

This patch makes page_cgroup->flags to be atomic_ops and define
functions (and macros) to access it.

This patch itself makes memcg slow but this patch's final purpose is
to remove lock_page_cgroup() and allowing fast/easy access to page_cgroup.

Before trying to modify memory resource controller, this atomic operation
on flags is necessary.

Changelog (preview) -> (v1):
- patch ordering is changed.
- Added macro for defining functions for Test/Set/Clear bit.
- made the names of flags shorter.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
mm/memcontrol.c | 108 +++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 77 insertions(+), 31 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -158,12 +158,57 @@ struct page_cgroup {
struct list_head lru; /* per cgroup LRU list */
struct page *page;
struct mem_cgroup *mem_cgroup;
- int flags;
+ unsigned long flags;
};
-#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
-#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8) /* page is unevictableable */
+
+enum {
+ /* flags for mem_cgroup */
+ Pcg_CACHE, /* charged as cache */
+ /* flags for LRU placement */
+ Pcg_ACTIVE, /* page is active in this cgroup */
+ Pcg_FILE, /* page is file system backed */
+ Pcg_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname) \
+static inline int Pcg##uname(struct page_cgroup *pc) \
+ { return test_bit(Pcg_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname) \
+static inline void SetPcg##uname(struct page_cgroup *pc)\
+ { set_bit(Pcg_##lname, &pc->flags); }
+
+#define CLEARPCGFLAG(uname, lname) \
+static inline void ClearPcg##uname(struct page_cgroup *pc) \
+ { clear_bit(Pcg_##lname, &pc->flags); }
+
+#define __SETPCGFLAG(uname, lname) \
+static inline void __SetPcg##uname(struct page_cgroup *pc)\
+ { __set_bit(Pcg_##lname, &pc->flags); }
+
+#define __CLEARPCGFLAG(uname, lname) \
+static inline void __ClearPcg##uname(struct page_cgroup *pc) \
+ { __clear_bit(Pcg_##lname, &pc->flags); }
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+__SETPCGFLAG(Cache, CACHE)
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+__SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+__SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
+

static int page_cgroup_nid(struct page_cgroup *pc)
{
@@ -184,14 +229,15 @@ enum charge_type {
/*
* Always modified under lru lock. Then, not necessary to preempt_disable()
*/
-static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, int flags,
- bool charge)
+static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
+ struct page_cgroup *pc,
+ bool charge)
{
int val = (charge)? 1 : -1;
struct mem_cgroup_stat *stat = &mem->stat;

VM_BUG_ON(!irqs_disabled());
- if (flags & PAGE_CGROUP_FLAG_CACHE)
+ if (PcgCache(pc))
__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
else
__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
@@ -284,18 +330,18 @@ static void __mem_cgroup_remove_list(str
{
int lru = LRU_BASE;

- if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+ if (PcgUnevictable(pc))
lru = LRU_UNEVICTABLE;
else {
- if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+ if (PcgActive(pc))
lru += LRU_ACTIVE;
- if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+ if (PcgFile(pc))
lru += LRU_FILE;
}

MEM_CGROUP_ZSTAT(mz, lru) -= 1;

- mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
+ mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
list_del(&pc->lru);
}

@@ -304,27 +350,27 @@ static void __mem_cgroup_add_list(struct
{
int lru = LRU_BASE;

- if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+ if (PcgUnevictable(pc))
lru = LRU_UNEVICTABLE;
else {
- if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+ if (PcgActive(pc))
lru += LRU_ACTIVE;
- if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+ if (PcgFile(pc))
lru += LRU_FILE;
}

MEM_CGROUP_ZSTAT(mz, lru) += 1;
list_add(&pc->lru, &mz->lists[lru]);

- mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
+ mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
}

static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
{
struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
- int active = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
- int file = pc->flags & PAGE_CGROUP_FLAG_FILE;
- int unevictable = pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE;
+ int active = PcgActive(pc);
+ int file = PcgFile(pc);
+ int unevictable = PcgUnevictable(pc);
enum lru_list from = unevictable ? LRU_UNEVICTABLE :
(LRU_FILE * !!file + !!active);

@@ -334,14 +380,14 @@ static void __mem_cgroup_move_lists(stru
MEM_CGROUP_ZSTAT(mz, from) -= 1;

if (is_unevictable_lru(lru)) {
- pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
- pc->flags |= PAGE_CGROUP_FLAG_UNEVICTABLE;
+ ClearPcgActive(pc);
+ SetPcgUnevictable(pc);
} else {
if (is_active_lru(lru))
- pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+ SetPcgActive(pc);
else
- pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
- pc->flags &= ~PAGE_CGROUP_FLAG_UNEVICTABLE;
+ ClearPcgActive(pc);
+ ClearPcgUnevictable(pc);
}

MEM_CGROUP_ZSTAT(mz, lru) += 1;
@@ -569,18 +615,19 @@ static int mem_cgroup_charge_common(stru

pc->mem_cgroup = mem;
pc->page = page;
+ pc->flags = 0;
/*
* If a page is accounted as a page cache, insert to inactive list.
* If anon, insert to active list.
*/
if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
- pc->flags = PAGE_CGROUP_FLAG_CACHE;
+ __SetPcgCache(pc);
if (page_is_file_cache(page))
- pc->flags |= PAGE_CGROUP_FLAG_FILE;
+ __SetPcgFile(pc);
else
- pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+ __SetPcgActive(pc);
} else
- pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
+ __SetPcgActive(pc);

lock_page_cgroup(page);
if (unlikely(page_get_page_cgroup(page))) {
@@ -688,8 +735,7 @@ __mem_cgroup_uncharge_common(struct page
VM_BUG_ON(pc->page != page);

if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
- && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
- || page_mapped(page)))
+ && ((PcgCache(pc) || page_mapped(page))))
goto unlock;

mz = page_cgroup_zoneinfo(pc);
@@ -739,7 +785,7 @@ int mem_cgroup_prepare_migration(struct
if (pc) {
mem = pc->mem_cgroup;
css_get(&mem->css);
- if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
+ if (PcgCache(pc))
ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
}
unlock_page_cgroup(page);

2008-08-20 09:53:34

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH -mm 2/7] memcg: delayed_batch_freeing_of_page_cgroup.patch

Freeing page_cgroup at mem_cgroup_uncharge() in lazy way.

In mem_cgroup_uncharge_common(), we don't free page_cgroup
and just link it to per-cpu free queue.
And remove it later by checking threshold.

This patch is a base patch for freeing page_cgroup by RCU patch.
This patch depends on page_cgroup_atomic_flags.patch.

Changelog: (preview) -> (v1)
- Clean up.
- renamed functions

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
mm/memcontrol.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 103 insertions(+), 12 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -159,11 +159,13 @@ struct page_cgroup {
struct page *page;
struct mem_cgroup *mem_cgroup;
unsigned long flags;
+ struct page_cgroup *next;
};

enum {
/* flags for mem_cgroup */
Pcg_CACHE, /* charged as cache */
+ Pcg_OBSOLETE, /* this page cgroup is invalid (unused) */
/* flags for LRU placement */
Pcg_ACTIVE, /* page is active in this cgroup */
Pcg_FILE, /* page is file system backed */
@@ -194,6 +196,10 @@ static inline void __ClearPcg##uname(str
TESTPCGFLAG(Cache, CACHE)
__SETPCGFLAG(Cache, CACHE)

+/* No "Clear" routine for OBSOLETE flag */
+TESTPCGFLAG(Obsolete, OBSOLETE);
+SETPCGFLAG(Obsolete, OBSOLETE);
+
/* LRU management flags (from global-lru definition) */
TESTPCGFLAG(File, FILE)
SETPCGFLAG(File, FILE)
@@ -220,6 +226,18 @@ static enum zone_type page_cgroup_zid(st
return page_zonenum(pc->page);
}

+/*
+ * per-cpu slot for freeing page_cgroup in lazy manner.
+ * All page_cgroup linked to this list is OBSOLETE.
+ */
+struct mem_cgroup_sink_list {
+ int count;
+ struct page_cgroup *next;
+};
+DEFINE_PER_CPU(struct mem_cgroup_sink_list, memcg_sink_list);
+#define MEMCG_LRU_THRESH (16)
+
+
enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -427,7 +445,7 @@ void mem_cgroup_move_lists(struct page *
return;

pc = page_get_page_cgroup(page);
- if (pc) {
+ if (pc && !PcgObsolete(pc)) {
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
__mem_cgroup_move_lists(pc, lru);
@@ -520,6 +538,10 @@ unsigned long mem_cgroup_isolate_pages(u
list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
if (scan >= nr_to_scan)
break;
+
+ if (PcgObsolete(pc))
+ continue;
+
page = pc->page;

if (unlikely(!PageLRU(page)))
@@ -552,6 +574,81 @@ unsigned long mem_cgroup_isolate_pages(u
}

/*
+ * Free obsolete page_cgroups which is linked to per-cpu drop list.
+ */
+
+static void __free_obsolete_page_cgroup(void)
+{
+ struct mem_cgroup *memcg;
+ struct page_cgroup *pc, *next;
+ struct mem_cgroup_per_zone *mz, *page_mz;
+ struct mem_cgroup_sink_list *mcsl;
+ unsigned long flags;
+
+ mcsl = &get_cpu_var(memcg_sink_list);
+ next = mcsl->next;
+ mcsl->next = NULL;
+ mcsl->count = 0;
+ put_cpu_var(memcg_sink_list);
+
+ mz = NULL;
+
+ local_irq_save(flags);
+ while (next) {
+ pc = next;
+ VM_BUG_ON(!PcgObsolete(pc));
+ next = pc->next;
+ prefetch(next);
+ page_mz = page_cgroup_zoneinfo(pc);
+ memcg = pc->mem_cgroup;
+ if (page_mz != mz) {
+ if (mz)
+ spin_unlock(&mz->lru_lock);
+ mz = page_mz;
+ spin_lock(&mz->lru_lock);
+ }
+ __mem_cgroup_remove_list(mz, pc);
+ css_put(&memcg->css);
+ kmem_cache_free(page_cgroup_cache, pc);
+ }
+ if (mz)
+ spin_unlock(&mz->lru_lock);
+ local_irq_restore(flags);
+}
+
+static void free_obsolete_page_cgroup(struct page_cgroup *pc)
+{
+ int count;
+ struct mem_cgroup_sink_list *mcsl;
+
+ mcsl = &get_cpu_var(memcg_sink_list);
+ pc->next = mcsl->next;
+ mcsl->next = pc;
+ count = ++mcsl->count;
+ put_cpu_var(memcg_sink_list);
+ if (count >= MEMCG_LRU_THRESH)
+ __free_obsolete_page_cgroup();
+}
+
+/*
+ * Used when freeing memory resource controller to remove all
+ * page_cgroup (in obsolete list).
+ */
+static DEFINE_MUTEX(memcg_force_drain_mutex);
+
+static void mem_cgroup_local_force_drain(struct work_struct *work)
+{
+ __free_obsolete_page_cgroup();
+}
+
+static void mem_cgroup_all_force_drain(void)
+{
+ mutex_lock(&memcg_force_drain_mutex);
+ schedule_on_each_cpu(mem_cgroup_local_force_drain);
+ mutex_unlock(&memcg_force_drain_mutex);
+}
+
+/*
* Charge the memory controller for page usage.
* Return
* 0 if the charge was successful
@@ -616,6 +713,7 @@ static int mem_cgroup_charge_common(stru
pc->mem_cgroup = mem;
pc->page = page;
pc->flags = 0;
+ pc->next = NULL;
/*
* If a page is accounted as a page cache, insert to inactive list.
* If anon, insert to active list.
@@ -718,8 +816,6 @@ __mem_cgroup_uncharge_common(struct page
{
struct page_cgroup *pc;
struct mem_cgroup *mem;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;

if (mem_cgroup_subsys.disabled)
return;
@@ -737,20 +833,14 @@ __mem_cgroup_uncharge_common(struct page
if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
&& ((PcgCache(pc) || page_mapped(page))))
goto unlock;
-
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_remove_list(mz, pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
-
+ mem = pc->mem_cgroup;
+ SetPcgObsolete(pc);
page_assign_page_cgroup(page, NULL);
unlock_page_cgroup(page);

- mem = pc->mem_cgroup;
res_counter_uncharge(&mem->res, PAGE_SIZE);
- css_put(&mem->css);
+ free_obsolete_page_cgroup(pc);

- kmem_cache_free(page_cgroup_cache, pc);
return;
unlock:
unlock_page_cgroup(page);
@@ -943,6 +1033,7 @@ static int mem_cgroup_force_empty(struct
}
}
ret = 0;
+ mem_cgroup_all_force_drain();
out:
css_put(&mem->css);
return ret;

2008-08-20 09:57:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH -mm 3/7] memcg: freeing page_cgroup by rcu.patch

By delayed_batch_freeing_of_page_cgroup.patch, page_cgroup can be
freed lazily. After this patch, page_cgroup is freed by RCU and
page_cgroup is RCU safe. This is necessary for lockless page_cgroup patch

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
mm/memcontrol.c | 44 ++++++++++++++++++++++++++++++++++++--------
1 file changed, 36 insertions(+), 8 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -577,19 +577,23 @@ unsigned long mem_cgroup_isolate_pages(u
* Free obsolete page_cgroups which is linked to per-cpu drop list.
*/

-static void __free_obsolete_page_cgroup(void)
+struct page_cgroup_rcu_work {
+ struct rcu_head head;
+ struct page_cgroup *list;
+};
+
+static void __free_obsolete_page_cgroup_cb(struct rcu_head *head)
{
struct mem_cgroup *memcg;
struct page_cgroup *pc, *next;
struct mem_cgroup_per_zone *mz, *page_mz;
- struct mem_cgroup_sink_list *mcsl;
+ struct page_cgroup_rcu_work *work;
unsigned long flags;

- mcsl = &get_cpu_var(memcg_sink_list);
- next = mcsl->next;
- mcsl->next = NULL;
- mcsl->count = 0;
- put_cpu_var(memcg_sink_list);
+
+ work = container_of(head, struct page_cgroup_rcu_work, head);
+ next = work->list;
+ kfree(work);

mz = NULL;

@@ -616,6 +620,26 @@ static void __free_obsolete_page_cgroup(
local_irq_restore(flags);
}

+static int __free_obsolete_page_cgroup(void)
+{
+ struct page_cgroup_rcu_work *work;
+ struct mem_cgroup_sink_list *mcsl;
+
+ work = kmalloc(sizeof(*work), GFP_ATOMIC);
+ if (!work)
+ return -ENOMEM;
+ INIT_RCU_HEAD(&work->head);
+
+ mcsl = &get_cpu_var(memcg_sink_list);
+ work->list = mcsl->next;
+ mcsl->next = NULL;
+ mcsl->count = 0;
+ put_cpu_var(memcg_sink_list);
+
+ call_rcu(&work->head, __free_obsolete_page_cgroup_cb);
+ return 0;
+}
+
static void free_obsolete_page_cgroup(struct page_cgroup *pc)
{
int count;
@@ -638,13 +662,17 @@ static DEFINE_MUTEX(memcg_force_drain_mu

static void mem_cgroup_local_force_drain(struct work_struct *work)
{
- __free_obsolete_page_cgroup();
+ int ret;
+ do {
+ ret = __free_obsolete_page_cgroup();
+ } while (ret);
}

static void mem_cgroup_all_force_drain(void)
{
mutex_lock(&memcg_force_drain_mutex);
schedule_on_each_cpu(mem_cgroup_local_force_drain);
+ synchronize_rcu();
mutex_unlock(&memcg_force_drain_mutex);
}

2008-08-20 09:58:53

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH -mm 4/7] memcg: lockless page_cgroup

This patch removes lock_page_cgroup(). Now, page_cgroup is guarded by RCU.

To remove lock_page_cgroup(), we have to confirm there is no race.

Anon pages:
* pages are chareged/uncharged only when first-mapped/last-unmapped.
page_mapcount() handles that.
(And... pte_lock() is always held in any racy case.)

Swap pages:
There will be race because charge is done before lock_page().
This patch moves mem_cgroup_charge() under lock_page().

File pages: (not Shmem)
* pages are charged/uncharged only when it's added/removed to radix-tree.
In this case, PageLock() is always held.

Install Page:
Is it worth to charge this special map page ? which is (maybe) not on LRU.
I think no.
I removed charge/uncharge from install_page().

Page Migration:
We precharge it and map it back under lock_page(). This should be treated
as special case.

freeing page_cgroup is done under RCU.

After this patch, page_cgroup can be accessed via

**
rcu_read_lock();
pc = page_get_page_cgroup(page);
if (pc && !PcgObsolete(pc)) {
......
}
rcu_read_unlock();
**

This is now under test. Don't apply if you're not brave.

Changelog: (preview) -> (v1)
- Added comments.
- Fixed page migration.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>


---
include/linux/mm_types.h | 2
mm/memcontrol.c | 119 +++++++++++++++++------------------------------
mm/memory.c | 16 +-----
3 files changed, 51 insertions(+), 86 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -137,20 +137,6 @@ struct mem_cgroup {
static struct mem_cgroup init_mem_cgroup;

/*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock. We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin). But since
- * bit_spin_lock doesn't actually set that lock bit in a non-debug
- * uniprocessor kernel, we should avoid setting it here too.
- */
-#define PAGE_CGROUP_LOCK_BIT 0x0
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
-#else
-#define PAGE_CGROUP_LOCK 0x0
-#endif
-
-/*
* A page_cgroup page is associated with every page descriptor. The
* page_cgroup helps us identify information about the cgroup
*/
@@ -312,35 +298,14 @@ struct mem_cgroup *mem_cgroup_from_task(
struct mem_cgroup, css);
}

-static inline int page_cgroup_locked(struct page *page)
-{
- return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
{
- VM_BUG_ON(!page_cgroup_locked(page));
- page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
+ rcu_assign_pointer(page->page_cgroup, pc);
}

struct page_cgroup *page_get_page_cgroup(struct page *page)
{
- return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
- bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
- return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
- bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+ return rcu_dereference(page->page_cgroup);
}

static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
@@ -434,16 +399,7 @@ void mem_cgroup_move_lists(struct page *
if (mem_cgroup_subsys.disabled)
return;

- /*
- * We cannot lock_page_cgroup while holding zone's lru_lock,
- * because other holders of lock_page_cgroup can be interrupted
- * with an attempt to rotate_reclaimable_page. But we cannot
- * safely get to page_cgroup without it, so just try_lock it:
- * mem_cgroup_isolate_pages allows for page left on wrong list.
- */
- if (!try_lock_page_cgroup(page))
- return;
-
+ rcu_read_lock();
pc = page_get_page_cgroup(page);
if (pc && !PcgObsolete(pc)) {
mz = page_cgroup_zoneinfo(pc);
@@ -451,7 +407,7 @@ void mem_cgroup_move_lists(struct page *
__mem_cgroup_move_lists(pc, lru);
spin_unlock_irqrestore(&mz->lru_lock, flags);
}
- unlock_page_cgroup(page);
+ rcu_read_unlock();
}

/*
@@ -755,14 +711,9 @@ static int mem_cgroup_charge_common(stru
} else
__SetPcgActive(pc);

- lock_page_cgroup(page);
- if (unlikely(page_get_page_cgroup(page))) {
- unlock_page_cgroup(page);
- res_counter_uncharge(&mem->res, PAGE_SIZE);
- css_put(&mem->css);
- kmem_cache_free(page_cgroup_cache, pc);
- goto done;
- }
+ /* Double counting race condition ? */
+ VM_BUG_ON(page_get_page_cgroup(page));
+
page_assign_page_cgroup(page, pc);

mz = page_cgroup_zoneinfo(pc);
@@ -770,8 +721,6 @@ static int mem_cgroup_charge_common(stru
__mem_cgroup_add_list(mz, pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);

- unlock_page_cgroup(page);
-done:
return 0;
out:
css_put(&mem->css);
@@ -796,6 +745,28 @@ int mem_cgroup_charge(struct page *page,
return 0;
if (unlikely(!mm))
mm = &init_mm;
+ /*
+ * Check for pre-charged case of an anonymous page.
+ * i.e. page migraion.
+ *
+ * Under page migration, the new page (target of migration) is charged
+ * befere being mapped. And page->mapping points to anon_vma.
+ * Check it here wheter we've already charged this or not.
+ *
+ * But, in this case, we don't charge against a page which is newly
+ * allocated. It should be locked for avoiding race.
+ */
+ if (PageAnon(page)) {
+ struct page_cgroup *pc;
+ VM_BUG_ON(!PageLocked(page));
+ rcu_read_lock();
+ pc = page_get_page_cgroup(page);
+ if (pc && !PcgObsolete(pc)) {
+ rcu_read_unlock();
+ return 0;
+ }
+ rcu_read_unlock();
+ }
return mem_cgroup_charge_common(page, mm, gfp_mask,
MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
}
@@ -813,20 +784,21 @@ int mem_cgroup_cache_charge(struct page
*
* For GFP_NOWAIT case, the page may be pre-charged before calling
* add_to_page_cache(). (See shmem.c) check it here and avoid to call
- * charge twice. (It works but has to pay a bit larger cost.)
+ * charge twice.
+ *
+ * Note: page migration doesn't call add_to_page_cache(). We can ignore
+ * the case.
*/
if (!(gfp_mask & __GFP_WAIT)) {
struct page_cgroup *pc;
-
- lock_page_cgroup(page);
+ rcu_read_lock();
pc = page_get_page_cgroup(page);
- if (pc) {
+ if (pc && !PcgObsolete(pc)) {
VM_BUG_ON(pc->page != page);
VM_BUG_ON(!pc->mem_cgroup);
- unlock_page_cgroup(page);
return 0;
}
- unlock_page_cgroup(page);
+ rcu_read_unlock();
}

if (unlikely(!mm))
@@ -851,27 +823,26 @@ __mem_cgroup_uncharge_common(struct page
/*
* Check if our page_cgroup is valid
*/
- lock_page_cgroup(page);
+ rcu_read_lock();
pc = page_get_page_cgroup(page);
- if (unlikely(!pc))
- goto unlock;
+ if (unlikely(!pc) || PcgObsolete(pc))
+ goto out;

VM_BUG_ON(pc->page != page);

if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
&& ((PcgCache(pc) || page_mapped(page))))
- goto unlock;
+ goto out;
mem = pc->mem_cgroup;
SetPcgObsolete(pc);
page_assign_page_cgroup(page, NULL);
- unlock_page_cgroup(page);

res_counter_uncharge(&mem->res, PAGE_SIZE);
free_obsolete_page_cgroup(pc);

+out:
+ rcu_read_unlock();
return;
-unlock:
- unlock_page_cgroup(page);
}

void mem_cgroup_uncharge_page(struct page *page)
@@ -898,15 +869,15 @@ int mem_cgroup_prepare_migration(struct
if (mem_cgroup_subsys.disabled)
return 0;

- lock_page_cgroup(page);
+ rcu_read_lock();
pc = page_get_page_cgroup(page);
- if (pc) {
+ if (pc && !PcgObsolete(pc)) {
mem = pc->mem_cgroup;
css_get(&mem->css);
if (PcgCache(pc))
ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
}
- unlock_page_cgroup(page);
+ rcu_read_unlock();
if (mem) {
ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
ctype, mem);
Index: mmtom-2.6.27-rc3+/mm/memory.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memory.c
+++ mmtom-2.6.27-rc3+/mm/memory.c
@@ -1323,18 +1323,14 @@ static int insert_page(struct vm_area_st
pte_t *pte;
spinlock_t *ptl;

- retval = mem_cgroup_charge(page, mm, GFP_KERNEL);
- if (retval)
- goto out;
-
retval = -EINVAL;
if (PageAnon(page))
- goto out_uncharge;
+ goto out;
retval = -ENOMEM;
flush_dcache_page(page);
pte = get_locked_pte(mm, addr, &ptl);
if (!pte)
- goto out_uncharge;
+ goto out;
retval = -EBUSY;
if (!pte_none(*pte))
goto out_unlock;
@@ -1350,8 +1346,6 @@ static int insert_page(struct vm_area_st
return retval;
out_unlock:
pte_unmap_unlock(pte, ptl);
-out_uncharge:
- mem_cgroup_uncharge_page(page);
out:
return retval;
}
@@ -2325,16 +2319,16 @@ static int do_swap_page(struct mm_struct
ret = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
}
+ lock_page(page);
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
- delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
ret = VM_FAULT_OOM;
+ unlock_page(page);
goto out;
}

mark_page_accessed(page);
- lock_page(page);
- delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

/*
* Back out if somebody else already faulted in this pte.
Index: mmtom-2.6.27-rc3+/include/linux/mm_types.h
===================================================================
--- mmtom-2.6.27-rc3+.orig/include/linux/mm_types.h
+++ mmtom-2.6.27-rc3+/include/linux/mm_types.h
@@ -93,7 +93,7 @@ struct page {
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- unsigned long page_cgroup;
+ struct page_cgroup *page_cgroup;
#endif

#ifdef CONFIG_KMEMCHECK

2008-08-20 10:00:00

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH -mm 5/7] memcg: prefetch mem cgroup per zone

Address of "mz" can be calculated in early stage.
prefetch it (we always do spin_lock later.)


Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -694,6 +694,8 @@ static int mem_cgroup_charge_common(stru
}
}

+ mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zonenum(page));
+ prefetchw(mz);
pc->mem_cgroup = mem;
pc->page = page;
pc->flags = 0;
@@ -716,7 +718,6 @@ static int mem_cgroup_charge_common(stru

page_assign_page_cgroup(page, pc);

- mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
__mem_cgroup_add_list(mz, pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);

2008-08-20 10:00:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH -mm 6/7] memcg: make-mapping-null-before-calling-uncharge.patch

This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.

"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not".

This patch also adds VM_BUG_ON() to mem_cgroup_uncharge_cache_page();


Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
mm/filemap.c | 2 +-
mm/memcontrol.c | 1 +
mm/migrate.c | 11 +++++++++--
3 files changed, 11 insertions(+), 3 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/filemap.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/filemap.c
+++ mmtom-2.6.27-rc3+/mm/filemap.c
@@ -116,12 +116,12 @@ void __remove_from_page_cache(struct pag
{
struct address_space *mapping = page->mapping;

- mem_cgroup_uncharge_cache_page(page);
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
BUG_ON(page_mapped(page));
+ mem_cgroup_uncharge_cache_page(page);

/*
* Some filesystems seem to re-dirty the page even after
Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -854,6 +854,7 @@ void mem_cgroup_uncharge_page(struct pag
void mem_cgroup_uncharge_cache_page(struct page *page)
{
VM_BUG_ON(page_mapped(page));
+ VM_BUG_ON(page->mapping);
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}

Index: mmtom-2.6.27-rc3+/mm/migrate.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/migrate.c
+++ mmtom-2.6.27-rc3+/mm/migrate.c
@@ -330,8 +330,6 @@ static int migrate_page_move_mapping(str
__inc_zone_page_state(newpage, NR_FILE_PAGES);

spin_unlock_irq(&mapping->tree_lock);
- if (!PageSwapCache(newpage))
- mem_cgroup_uncharge_cache_page(page);

return 0;
}
@@ -379,6 +377,15 @@ static void migrate_page_copy(struct pag
ClearPagePrivate(page);
set_page_private(page, 0);
page->mapping = NULL;
+ /* page->mapping contains a flag for PageAnon() */
+ if (PageAnon(page)) {
+ /* This page is uncharged at try_to_unmap(). */
+ page->mapping = NULL;
+ } else {
+ /* Obsolete file cache should be uncharged */
+ page->mapping = NULL;
+ mem_cgroup_uncharge_cache_page(page);
+ }

/*
* If any waiters have accumulated on the new page then

2008-08-20 10:03:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH -mm 7/7] memcg: add page_cgroup.h header file

Experimental...I wonder whether this is enough for potential users.
==

page_cgroup is a struct for accounting each page under memory resource
controller. Currently, it's only used under memcontrol.h but there
is possible user of this struct (now).
(*) Because page_cgroup is an extended/on-demand mem_map by nature,
there are people who want to use this for recording information.

If no users, this patch is not necessary.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
include/linux/page_cgroup.h | 100 ++++++++++++++++++++++++++++++++++++++++++++
mm/memcontrol.c | 82 ------------------------------------
2 files changed, 101 insertions(+), 81 deletions(-)

Index: mmtom-2.6.27-rc3+/include/linux/page_cgroup.h
===================================================================
--- /dev/null
+++ mmtom-2.6.27-rc3+/include/linux/page_cgroup.h
@@ -0,0 +1,100 @@
+#ifndef __LINUX_PAGE_CGROUP_H
+#define __LINUX_PAGE_CGROUP_H
+
+/*
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup.
+ *
+ * This is pointed from struct page by page->page_cgroup pointer.
+ * This pointer is safe under RCU. If a page_cgroup is marked as
+ * Obsolete, don't access it.
+ *
+ * Typical way to access page_cgroup is following.
+ *
+ * rcu_read_lock();
+ * pc = page_get_page_cgroup(page);
+ * if (pc && !PcgObsolete(pc)) {
+ * ......
+ * }
+ * rcu_read_unlock();
+ *
+ */
+struct page_cgroup {
+ struct list_head lru; /* per zone/memcg LRU list */
+ struct page *page; /* the page this accounts for */
+ struct mem_cgroup *mem_cgroup; /* belongs to this mem_cgroup */
+ unsigned long flags;
+ struct page_cgroup *next;
+};
+
+enum {
+ /* flags for mem_cgroup */
+ Pcg_CACHE, /* charged as cache */
+ Pcg_OBSOLETE, /* this page cgroup is invalid (unused) */
+ /* flags for LRU placement */
+ Pcg_ACTIVE, /* page is active in this cgroup */
+ Pcg_FILE, /* page is file system backed */
+ Pcg_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname) \
+static inline int Pcg##uname(struct page_cgroup *pc) \
+ { return test_bit(Pcg_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname) \
+static inline void SetPcg##uname(struct page_cgroup *pc)\
+ { set_bit(Pcg_##lname, &pc->flags); }
+
+#define CLEARPCGFLAG(uname, lname) \
+static inline void ClearPcg##uname(struct page_cgroup *pc) \
+ { clear_bit(Pcg_##lname, &pc->flags); }
+
+#define __SETPCGFLAG(uname, lname) \
+static inline void __SetPcg##uname(struct page_cgroup *pc)\
+ { __set_bit(Pcg_##lname, &pc->flags); }
+
+#define __CLEARPCGFLAG(uname, lname) \
+static inline void __ClearPcg##uname(struct page_cgroup *pc) \
+ { __clear_bit(Pcg_##lname, &pc->flags); }
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+__SETPCGFLAG(Cache, CACHE)
+
+/* No "Clear" routine for OBSOLETE flag */
+TESTPCGFLAG(Obsolete, OBSOLETE);
+SETPCGFLAG(Obsolete, OBSOLETE);
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+__SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+__SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
+
+
+static int page_cgroup_nid(struct page_cgroup *pc)
+{
+ return page_to_nid(pc->page);
+}
+
+static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
+{
+ return page_zonenum(pc->page);
+}
+
+struct page_cgroup *page_get_page_cgroup(struct page *page)
+{
+ return rcu_dereference(page->page_cgroup);
+}
+
+
+#endif
Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -33,7 +33,7 @@
#include <linux/seq_file.h>
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
-
+#include <linux/page_cgroup.h>
#include <asm/uaccess.h>

struct cgroup_subsys mem_cgroup_subsys __read_mostly;
@@ -136,81 +136,6 @@ struct mem_cgroup {
};
static struct mem_cgroup init_mem_cgroup;

-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
- struct list_head lru; /* per cgroup LRU list */
- struct page *page;
- struct mem_cgroup *mem_cgroup;
- unsigned long flags;
- struct page_cgroup *next;
-};
-
-enum {
- /* flags for mem_cgroup */
- Pcg_CACHE, /* charged as cache */
- Pcg_OBSOLETE, /* this page cgroup is invalid (unused) */
- /* flags for LRU placement */
- Pcg_ACTIVE, /* page is active in this cgroup */
- Pcg_FILE, /* page is file system backed */
- Pcg_UNEVICTABLE, /* page is unevictableable */
-};
-
-#define TESTPCGFLAG(uname, lname) \
-static inline int Pcg##uname(struct page_cgroup *pc) \
- { return test_bit(Pcg_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname) \
-static inline void SetPcg##uname(struct page_cgroup *pc)\
- { set_bit(Pcg_##lname, &pc->flags); }
-
-#define CLEARPCGFLAG(uname, lname) \
-static inline void ClearPcg##uname(struct page_cgroup *pc) \
- { clear_bit(Pcg_##lname, &pc->flags); }
-
-#define __SETPCGFLAG(uname, lname) \
-static inline void __SetPcg##uname(struct page_cgroup *pc)\
- { __set_bit(Pcg_##lname, &pc->flags); }
-
-#define __CLEARPCGFLAG(uname, lname) \
-static inline void __ClearPcg##uname(struct page_cgroup *pc) \
- { __clear_bit(Pcg_##lname, &pc->flags); }
-
-/* Cache flag is set only once (at allocation) */
-TESTPCGFLAG(Cache, CACHE)
-__SETPCGFLAG(Cache, CACHE)
-
-/* No "Clear" routine for OBSOLETE flag */
-TESTPCGFLAG(Obsolete, OBSOLETE);
-SETPCGFLAG(Obsolete, OBSOLETE);
-
-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-__SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-__SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
-
-static int page_cgroup_nid(struct page_cgroup *pc)
-{
- return page_to_nid(pc->page);
-}
-
-static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
-{
- return page_zonenum(pc->page);
-}

/*
* per-cpu slot for freeing page_cgroup in lazy manner.
@@ -303,11 +228,6 @@ static void page_assign_page_cgroup(stru
rcu_assign_pointer(page->page_cgroup, pc);
}

-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
- return rcu_dereference(page->page_cgroup);
-}
-
static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
struct page_cgroup *pc)
{

2008-08-20 10:35:32

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1

On Wed, 20 Aug 2008 18:53:06 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> Hi, this is a patch set for lockless page_cgroup.
>
> dropped patches related to mem+swap controller for easy review.
> (I'm rewriting it, too.)
>
> Changes from current -mm is.
> - page_cgroup->flags operations is set to be atomic.
> - lock_page_cgroup() is removed.
> - page->page_cgroup is changed from unsigned long to struct page_cgroup*
> - page_cgroup is freed by RCU.
> - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
> omitted. This is ususally used for mapping device's page. (I think...)
>
> In my quick test, perfomance is improved a little. But the benefit of this
> patch is to allow access page_cgroup without lock. I think this is good
> for Yamamoto's Dirty page tracking for memcg.
> For I/O tracking people, I added a header file for allowing access to
> page_cgroup from out of memcontrol.c
>
> The base kernel is recent mmtom. Any comments are welcome.
> This is still under test. I have to do long-run test before removing "RFC".
>
Known problem: force_emtpy is broken...so rmdir will struck into nightmare.
It's because of patch 2/7.
will be fixed in the next version.

Thanks,
-Kame

2008-08-20 10:54:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1

On Wed, 20 Aug 2008 19:41:08 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Wed, 20 Aug 2008 18:53:06 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > Hi, this is a patch set for lockless page_cgroup.
> >
> > dropped patches related to mem+swap controller for easy review.
> > (I'm rewriting it, too.)
> >
> > Changes from current -mm is.
> > - page_cgroup->flags operations is set to be atomic.
> > - lock_page_cgroup() is removed.
> > - page->page_cgroup is changed from unsigned long to struct page_cgroup*
> > - page_cgroup is freed by RCU.
> > - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
> > omitted. This is ususally used for mapping device's page. (I think...)
> >
> > In my quick test, perfomance is improved a little. But the benefit of this
> > patch is to allow access page_cgroup without lock. I think this is good
> > for Yamamoto's Dirty page tracking for memcg.
> > For I/O tracking people, I added a header file for allowing access to
> > page_cgroup from out of memcontrol.c
> >
> > The base kernel is recent mmtom. Any comments are welcome.
> > This is still under test. I have to do long-run test before removing "RFC".
> >
> Known problem: force_emtpy is broken...so rmdir will struck into nightmare.
> It's because of patch 2/7.
> will be fixed in the next version.
>

This is a quick fix but I think I can find some better solution..
==
Because removal from LRU is delayed, mz->lru will never be empty until
someone kick drain. This patch rotate LRU while force_empty and makes
page_cgroup will be freed.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>


---
mm/memcontrol.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -893,34 +893,45 @@ static void mem_cgroup_force_empty_list(
struct mem_cgroup_per_zone *mz,
enum lru_list lru)
{
- struct page_cgroup *pc;
+ struct page_cgroup *pc, *tmp;
struct page *page;
int count = FORCE_UNCHARGE_BATCH;
unsigned long flags;
struct list_head *list;
+ int drain, rotate;

list = &mz->lists[lru];

spin_lock_irqsave(&mz->lru_lock, flags);
+ rotate = 0;
while (!list_empty(list)) {
pc = list_entry(list->prev, struct page_cgroup, lru);
- page = pc->page;
- get_page(page);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
- /*
- * Check if this page is on LRU. !LRU page can be found
- * if it's under page migration.
- */
- if (PageLRU(page)) {
- __mem_cgroup_uncharge_common(page,
- MEM_CGROUP_CHARGE_TYPE_FORCE);
- put_page(page);
+ drain = PcgObsolete(pc);
+ if (drain) {
+ /* Skip this */
+ list_move(&pc->lru);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ rotate++;
+ if (rotate > MEMCG_LRU_THRESH/2)
+ mem_cgroup_all_force_drain();
+ cond_resched();
+ } else {
+ page = pc->page;
+ get_page(page);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+ /*
+ * Check if this page is on LRU. !LRU page can be found
+ * if it's under page migration.
+ */
+ if (PageLRU(page)) {
+ __mem_cgroup_uncharge_common(page,
+ MEM_CGROUP_CHARGE_TYPE_FORCE);
+ }
if (--count <= 0) {
count = FORCE_UNCHARGE_BATCH;
cond_resched();
}
- } else
- cond_resched();
+ }
spin_lock_irqsave(&mz->lru_lock, flags);
}
spin_unlock_irqrestore(&mz->lru_lock, flags);
@@ -954,7 +965,6 @@ static int mem_cgroup_force_empty(struct
}
}
ret = 0;
- mem_cgroup_all_force_drain();
out:
css_put(&mem->css);
return ret;

2008-08-20 11:33:41

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1

Hi,

> Hi, this is a patch set for lockless page_cgroup.
>
> dropped patches related to mem+swap controller for easy review.
> (I'm rewriting it, too.)
>
> Changes from current -mm is.
> - page_cgroup->flags operations is set to be atomic.
> - lock_page_cgroup() is removed.
> - page->page_cgroup is changed from unsigned long to struct page_cgroup*
> - page_cgroup is freed by RCU.
> - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
> omitted. This is ususally used for mapping device's page. (I think...)
>
> In my quick test, perfomance is improved a little. But the benefit of this
> patch is to allow access page_cgroup without lock. I think this is good
> for Yamamoto's Dirty page tracking for memcg.
> For I/O tracking people, I added a header file for allowing access to
> page_cgroup from out of memcontrol.c

Thanks, Kame.
It is a good news that the page tracking framework is open.
I think I can send some feedback to you to make it more generic.

> The base kernel is recent mmtom. Any comments are welcome.
> This is still under test. I have to do long-run test before removing "RFC".
>

Thanks,
Hirokazu Takahashi.

2008-08-21 02:11:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1

On Wed, 20 Aug 2008 20:00:06 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Wed, 20 Aug 2008 19:41:08 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Wed, 20 Aug 2008 18:53:06 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > > Hi, this is a patch set for lockless page_cgroup.
> > >
> > > dropped patches related to mem+swap controller for easy review.
> > > (I'm rewriting it, too.)
> > >
> > > Changes from current -mm is.
> > > - page_cgroup->flags operations is set to be atomic.
> > > - lock_page_cgroup() is removed.
> > > - page->page_cgroup is changed from unsigned long to struct page_cgroup*
> > > - page_cgroup is freed by RCU.
> > > - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
> > > omitted. This is ususally used for mapping device's page. (I think...)
> > >
> > > In my quick test, perfomance is improved a little. But the benefit of this
> > > patch is to allow access page_cgroup without lock. I think this is good
> > > for Yamamoto's Dirty page tracking for memcg.
> > > For I/O tracking people, I added a header file for allowing access to
> > > page_cgroup from out of memcontrol.c
> > >
> > > The base kernel is recent mmtom. Any comments are welcome.
> > > This is still under test. I have to do long-run test before removing "RFC".
> > >
> > Known problem: force_emtpy is broken...so rmdir will struck into nightmare.
> > It's because of patch 2/7.
> > will be fixed in the next version.
> >
>
> This is a quick fix but I think I can find some better solution..
> ==
> Because removal from LRU is delayed, mz->lru will never be empty until
> someone kick drain. This patch rotate LRU while force_empty and makes
> page_cgroup will be freed.
>

I'd like to rewrite force_empty to move all usage to "default" cgroup.
There are some reasons.

1. current force_empty creates an alive page which has no page_cgroup.
This is bad for routine which want to access page_cgroup from page.
And this behavior will be an issue of race condition in future.
2. We can see amount of out-of-control usage in default cgroup.

But to do this, I'll have to avoid "hitting limit" in default cgroup.
I'm now wondering to make it impossible to set limit to default cgroup.
(will show as a patch in the next version of series.)
Does anyone have an idea ?

Thanks,
-Kame

2008-08-21 03:38:27

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1

KAMEZAWA Hiroyuki wrote:
> On Wed, 20 Aug 2008 20:00:06 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> On Wed, 20 Aug 2008 19:41:08 +0900
>> KAMEZAWA Hiroyuki <[email protected]> wrote:
>>
>>> On Wed, 20 Aug 2008 18:53:06 +0900
>>> KAMEZAWA Hiroyuki <[email protected]> wrote:
>>>
>>>> Hi, this is a patch set for lockless page_cgroup.
>>>>
>>>> dropped patches related to mem+swap controller for easy review.
>>>> (I'm rewriting it, too.)
>>>>
>>>> Changes from current -mm is.
>>>> - page_cgroup->flags operations is set to be atomic.
>>>> - lock_page_cgroup() is removed.
>>>> - page->page_cgroup is changed from unsigned long to struct page_cgroup*
>>>> - page_cgroup is freed by RCU.
>>>> - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
>>>> omitted. This is ususally used for mapping device's page. (I think...)
>>>>
>>>> In my quick test, perfomance is improved a little. But the benefit of this
>>>> patch is to allow access page_cgroup without lock. I think this is good
>>>> for Yamamoto's Dirty page tracking for memcg.
>>>> For I/O tracking people, I added a header file for allowing access to
>>>> page_cgroup from out of memcontrol.c
>>>>
>>>> The base kernel is recent mmtom. Any comments are welcome.
>>>> This is still under test. I have to do long-run test before removing "RFC".
>>>>
>>> Known problem: force_emtpy is broken...so rmdir will struck into nightmare.
>>> It's because of patch 2/7.
>>> will be fixed in the next version.
>>>
>> This is a quick fix but I think I can find some better solution..
>> ==
>> Because removal from LRU is delayed, mz->lru will never be empty until
>> someone kick drain. This patch rotate LRU while force_empty and makes
>> page_cgroup will be freed.
>>
>
> I'd like to rewrite force_empty to move all usage to "default" cgroup.
> There are some reasons.
>
> 1. current force_empty creates an alive page which has no page_cgroup.
> This is bad for routine which want to access page_cgroup from page.
> And this behavior will be an issue of race condition in future.
> 2. We can see amount of out-of-control usage in default cgroup.
>
> But to do this, I'll have to avoid "hitting limit" in default cgroup.
> I'm now wondering to make it impossible to set limit to default cgroup.
> (will show as a patch in the next version of series.)
> Does anyone have an idea ?
>

Hi, Kamezawa-San,

The definition of default-cgroup would be root cgroup right? I would like to
implement hierarchies correctly in order to define the default-cgroup (it could
be a parent of the child cgroup for example).


--
Balbir

2008-08-21 03:52:38

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1

On Thu, 21 Aug 2008 09:06:53 +0530
Balbir Singh <[email protected]> wrote:
> > I'd like to rewrite force_empty to move all usage to "default" cgroup.
> > There are some reasons.
> >
> > 1. current force_empty creates an alive page which has no page_cgroup.
> > This is bad for routine which want to access page_cgroup from page.
> > And this behavior will be an issue of race condition in future.
> > 2. We can see amount of out-of-control usage in default cgroup.
> >
> > But to do this, I'll have to avoid "hitting limit" in default cgroup.
> > I'm now wondering to make it impossible to set limit to default cgroup.
> > (will show as a patch in the next version of series.)
> > Does anyone have an idea ?
> >
>
> Hi, Kamezawa-San,
>
> The definition of default-cgroup would be root cgroup right? I would like to
> implement hierarchies correctly in order to define the default-cgroup (it could
> be a parent of the child cgroup for example).
>

Ah yes, "root" cgroup, now.
I need trash-can-cgroup somewhere for force_empty. Accounted-in-trash-can is
better than accounter by no one. Once we change the behavior, we can have
another choices of improvements.

1. move account information to the parent cgroup.
2. move account information to user-defined trash-can cgroup.

As first step, I'd like to start from "root" cgroup. We can improve behavior in
step-by-step manner as we've done.

Thanks,
-Kame

2008-08-21 05:15:49

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1

On Thu, 21 Aug 2008 11:17:40 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Wed, 20 Aug 2008 20:00:06 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Wed, 20 Aug 2008 19:41:08 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > > On Wed, 20 Aug 2008 18:53:06 +0900
> > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > >
> > > > Hi, this is a patch set for lockless page_cgroup.
> > > >
> > > > dropped patches related to mem+swap controller for easy review.
> > > > (I'm rewriting it, too.)
> > > >
> > > > Changes from current -mm is.
> > > > - page_cgroup->flags operations is set to be atomic.
> > > > - lock_page_cgroup() is removed.
> > > > - page->page_cgroup is changed from unsigned long to struct page_cgroup*
> > > > - page_cgroup is freed by RCU.
> > > > - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
> > > > omitted. This is ususally used for mapping device's page. (I think...)
> > > >
> > > > In my quick test, perfomance is improved a little. But the benefit of this
> > > > patch is to allow access page_cgroup without lock. I think this is good
> > > > for Yamamoto's Dirty page tracking for memcg.
> > > > For I/O tracking people, I added a header file for allowing access to
> > > > page_cgroup from out of memcontrol.c
> > > >
> > > > The base kernel is recent mmtom. Any comments are welcome.
> > > > This is still under test. I have to do long-run test before removing "RFC".
> > > >
> > > Known problem: force_emtpy is broken...so rmdir will struck into nightmare.
> > > It's because of patch 2/7.
> > > will be fixed in the next version.
> > >
> >
> > This is a quick fix but I think I can find some better solution..
> > ==
> > Because removal from LRU is delayed, mz->lru will never be empty until
> > someone kick drain. This patch rotate LRU while force_empty and makes
> > page_cgroup will be freed.
> >
>
> I'd like to rewrite force_empty to move all usage to "default" cgroup.
> There are some reasons.
>
> 1. current force_empty creates an alive page which has no page_cgroup.
> This is bad for routine which want to access page_cgroup from page.
> And this behavior will be an issue of race condition in future.
I agree that current force_empty is not good in this point.

> 2. We can see amount of out-of-control usage in default cgroup.
>
> But to do this, I'll have to avoid "hitting limit" in default cgroup.
> I'm now wondering to make it impossible to set limit to default cgroup.
> (will show as a patch in the next version of series.)
> Does anyone have an idea ?
>
I don't have a strong objection about setting default cgroup unlimited
and moving usages to default cgroup.

But I think this is related to hierarchy support as Balbir-san says.
And, setting default cgroup unlimited would not be so strange if
hierarchy is supported.


Thanks,
Daisuke Nishimura.

2008-08-21 08:28:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1

On Wed, 20 Aug 2008 20:00:06 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:
> > Known problem: force_emtpy is broken...so rmdir will struck into nightmare.
> > It's because of patch 2/7.
> > will be fixed in the next version.
> >
>
This is a new routine for force_empty. Assumes init_mem_cgroup has no limit.
(lockless page_cgroup is also applied.)

I think this routine is enough generic to be enhanced for hierarchy in future.
I think move_account() routine can be used for other purpose.
(for example, move_task.)


==
int mem_cgroup_move_account(struct page *page, struct page_cgroup *pc,
struct mem_cgroup *from, struct mem_cgroup *to)
{
struct mem_cgroup_per_zone *from_mz, *to_mz;
int nid, zid;
int ret = 1;

VM_BUG_ON(to->no_limit == 0);
VM_BUG_ON(!irqs_disabled());

nid = page_to_nid(page);
zid = page_zonenum(page);
from_mz = mem_cgroup_zoneinfo(from, nid, zid);
to_mz = mem_cgroup_zoneinfo(to, nid, zid);

if (res_counter_charge(&to->res, PAGE_SIZE)) {
/* Now, we assume no_limit...no failure here. */
return ret;
}

if (spin_trylock(&to_mz->lru_lock)) {
__mem_cgroup_remove_list(from_mz, pc);
css_put(&from->css);
res_counter_uncharge(&from->res, PAGE_SIZE);
pc->mem_cgroup = to;
css_get(&to->css);
__mem_cgroup_add_list(to_mz, pc);
ret = 0;
spin_unlock(&to_mz->lru_lock);
} else {
res_counter_uncharge(&to->res, PAGE_SIZE);
}

return ret;
}
/*
* This routine moves all account to root cgroup.
*/
static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
struct mem_cgroup_per_zone *mz,
enum lru_list lru)
{
struct page_cgroup *pc;
unsigned long flags;
struct list_head *list;
int drain = 0;

list = &mz->lists[lru];

spin_lock_irqsave(&mz->lru_lock, flags);
while (!list_empty(list)) {
pc = list_entry(list->prev, struct page_cgroup, lru);
if (PcgObsolete(pc)) {
list_move(&pc->lru, list);
/* This page_cgroup may remain on this list until
we drain it. */
if (drain++ > MEMCG_LRU_THRESH/2) {
spin_unlock_irqrestore(&mz->lru_lock, flags);
mem_cgroup_all_force_drain();
yield();
drain = 0;
spin_lock_irqsave(&mz->lru_lock, flags);
}
continue;
}
if (mem_cgroup_move_account(page, pc->page,
mem, &init_mem_cgroup)) {
/* some confliction */
list_move(&pc->lru, list);
spin_unlock_irqrestore(&mz->lru_lock, flags);
yield();
spin_lock_irqsave(&mz->lru_lock, flags);
}
if (atomic_read(&mem->css.cgroup->count) > 0)
break;
}
spin_unlock_irqrestore(&mz->lru_lock, flags);
}
==

2008-08-22 05:03:37

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH -mm 6/7] memcg: make-mapping-null-before-calling-uncharge.patch

> @@ -379,6 +377,15 @@ static void migrate_page_copy(struct pag
> ClearPagePrivate(page);
> set_page_private(page, 0);
> page->mapping = NULL;
You forget to remove this line :)

Thanks,
Daisuke Nishimura.

> + /* page->mapping contains a flag for PageAnon() */
> + if (PageAnon(page)) {
> + /* This page is uncharged at try_to_unmap(). */
> + page->mapping = NULL;
> + } else {
> + /* Obsolete file cache should be uncharged */
> + page->mapping = NULL;
> + mem_cgroup_uncharge_cache_page(page);
> + }
>
> /*
> * If any waiters have accumulated on the new page then
>

2008-08-22 05:42:09

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH -mm 6/7] memcg: make-mapping-null-before-calling-uncharge.patch

On Fri, 22 Aug 2008 13:57:43 +0900
Daisuke Nishimura <[email protected]> wrote:

> > @@ -379,6 +377,15 @@ static void migrate_page_copy(struct pag
> > ClearPagePrivate(page);
> > set_page_private(page, 0);
> > page->mapping = NULL;
> You forget to remove this line :)
>
Ouch, thanks.
-Kame

> Thanks,
> Daisuke Nishimura.
>
> > + /* page->mapping contains a flag for PageAnon() */
> > + if (PageAnon(page)) {
> > + /* This page is uncharged at try_to_unmap(). */
> > + page->mapping = NULL;
> > + } else {
> > + /* Obsolete file cache should be uncharged */
> > + page->mapping = NULL;
> > + mem_cgroup_uncharge_cache_page(page);
> > + }
> >
> > /*
> > * If any waiters have accumulated on the new page then
> >
>

2008-08-22 10:49:30

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH -mm][preview] memcg: a patch series for next [8/9]

Hi.

I think you are making updated ones, I send comments so far.

On Tue, 19 Aug 2008 17:44:04 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> Very experimental...
>
> mem+swap controller prototype.
>
> This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP as memory resource
> controller's swap extension.
>
> When enabling this, memory resource controller will have 2 limits.
>
> - memory.limit_in_bytes .... limit for pages
> - memory.memsw_limit_in_bytes .... limit for pages + swaps.
>
> Following is (easy) accounting state transion after this patch.
>
> pages swaps pages_total memsw_total
> +1 - +1 +1 new page allocation.
> -1 +1 -1 - swap out.
> +1 -1 0 - swap in (*).
> - -1 - -1 swap_free.
>
What do you mean by "pages_total"?

> At swap-out, swp_entry will be charged against the cgroup of the page.
> At swap-in, the page will be charged when it's mapped.
> (Maybe accounting at read_swap() will be beautiful but we can avoid some of
> error handling to delay accounting until mem_cgroup_charge().)
>
> The charge against swap_entry will be uncharged when swap_entry is freed.
>
> The parameter res.swaps just includes swaps not-on swap cache.
> So, this doesn't show real usage of swp_entry just shows swp_entry on disk.
>
IMHO, it would be better to "show" real usage of swp_entry.
Otherwise, "sum of swap usage of all groups" != "swap usage of
system shown by meminfo"(but it means adding another counter, hmm...).

Instead of showing the usage of disk_swap, how about showing
the memsw total usage, which is to be limited by user.

> This patch doesn't include codes for control files.
>
> TODO:
> - clean up. and add comments.
> - support vm_swap_full() under cgroup.
Is it needed?

In my swap controller, swap entries are limited per cgroup.
So, to make swap_cgroup_charge() fail less frequently,
vm_swap_full() should be calculated per cgroup so that
vm can free swap entries in advance.

But I think in mem+swap controller the situation is different.

> - find easier-to-understand protocol....
> - check force_empty....(maybe buggy)
> - support page migration.
> - test!!
>
And,
- move charge along with task move
- hierarchy support

Of course, more basic features and stabilization should be done first.


I agree with this patch as a whole, but I'm worrying about race
between swapout and swapin about the same entry(I should consider more...).


Thanks,
Daisuke Nishimura.

2008-08-22 11:48:24

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -mm][preview] memcg: a patch series for next [8/9]

On Fri, 22 Aug 2008 19:29:43 +0900
Daisuke Nishimura <[email protected]> wrote:

> Hi.
>
> I think you are making updated ones, I send comments so far.
>
Ah, sorry. I just sent one ;(.

> On Tue, 19 Aug 2008 17:44:04 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > Very experimental...
> >
> > mem+swap controller prototype.
> >
> > This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP as memory resource
> > controller's swap extension.
> >
> > When enabling this, memory resource controller will have 2 limits.
> >
> > - memory.limit_in_bytes .... limit for pages
> > - memory.memsw_limit_in_bytes .... limit for pages + swaps.
> >
> > Following is (easy) accounting state transion after this patch.
> >
> > pages swaps pages_total memsw_total
> > +1 - +1 +1 new page allocation.
> > -1 +1 -1 - swap out.
> > +1 -1 0 - swap in (*).
> > - -1 - -1 swap_free.
> >
> What do you mean by "pages_total"?
>
On memory resource.
The sum of
- mapped anonymous pages
- pages of file cache.
- pages of swap cache.


> > At swap-out, swp_entry will be charged against the cgroup of the page.
> > At swap-in, the page will be charged when it's mapped.
> > (Maybe accounting at read_swap() will be beautiful but we can avoid some of
> > error handling to delay accounting until mem_cgroup_charge().)
> >
> > The charge against swap_entry will be uncharged when swap_entry is freed.
> >
> > The parameter res.swaps just includes swaps not-on swap cache.
> > So, this doesn't show real usage of swp_entry just shows swp_entry on disk.
> >
> IMHO, it would be better to "show" real usage of swp_entry.
> Otherwise, "sum of swap usage of all groups" != "swap usage of
> system shown by meminfo"(but it means adding another counter, hmm...).
>
Yes it means to add another counter. I'd like to try it.

> Instead of showing the usage of disk_swap, how about showing
> the memsw total usage, which is to be limited by user.
>

Yes, I feel the amount of disk_swap is not very useful in my test.
Ok, showing memsw total somewhere.


> > This patch doesn't include codes for control files.
> >
> > TODO:
> > - clean up. and add comments.
> > - support vm_swap_full() under cgroup.
> Is it needed?
>
maybe.

> In my swap controller, swap entries are limited per cgroup.
> So, to make swap_cgroup_charge() fail less frequently,
> vm_swap_full() should be calculated per cgroup so that
> vm can free swap entries in advance.
>
> But I think in mem+swap controller the situation is different.
>
Hmm, I'd like to postpone this until we ends the test.


> > - find easier-to-understand protocol....
> > - check force_empty....(maybe buggy)
> > - support page migration.
> > - test!!
> >
> And,
> - move charge along with task move
yes and moving charge of on-memory-resource should be done, too.

> - hierarchy support
>
> Of course, more basic features and stabilization should be done first.
>
Yes ;)

>
> I agree with this patch as a whole, but I'm worrying about race
> between swapout and swapin about the same entry(I should consider more...).
>
>
swapout/swapin race is guarded by the face I always handle swap-cache.

add_to_swap_cache/delete_from_swap_cache is under lock_page().
and do_swap_page()'s charging is moved under lock_page().

I saw race with force_empty ;(. I hope it's fixed in the latest version.

Thanks,
-Kame