2008-12-03 04:48:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 0/21] memcg updates 2008/12/03

This is memcg update series onto
"The mm-of-the-moment snapshot 2008-12-02-17-08"

including following patches. 18-21 are highly experimenal
(so, drop CC: to Andrew)

Bug fixes.
1. memcg-revert-gfp-mask-fix.patch
2. memcg-check-group-leader-fix.patch
3. memsw_limit_check.patch
4. memcg-swapout-refcnt-fix.patch
5. avoid-unnecessary-reclaim.patch

Kosaki's LRU works. (thanks!)
6. inactive_anon_is_low-move-to-vmscan.patch
7. introduce-zone_reclaim-struct.patch
8. make-zone-nr_pages-helper-function.patch
9. make-get_scan_ratio-to-memcg-safe.patch
10. memcg-add-null-check-to-page_cgroup_zoneinfo.patch
11. memcg-make-inactive_anon_is_low.patch
12. memcg-make-mem_cgroup_zone_nr_pages.patch
13. memcg-make-zone_reclaim_stat.patch
14. memcg-remove-mem_cgroup_cal_reclaim.patch
15. memcg-show-reclaim-stat.patch
Cleanup
16. memcg-rename-scan-glonal-lru.patch
Bug fix
16. memcg_prev_priority_protect.patch
New Feature
17. memcg-swappiness.patch

Experimentals. need more works. (from me)
18. fix-pre-destroy.patch
19. cgroup_id.patch
20. memcg-new-hierarchical-reclaim.patch
21. memcg-explain-details-and-test.patch

Thanks,
-Kame


2008-12-03 04:49:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 1/21] memcg-revert-gfp-mask-fix.patch

From: KAMEZAWA Hiroyuki <[email protected]>

My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will happen
at memory reclaim.

But in recent discussion, it's NACKed because it sounds ugly.

This patch is for reverting it and add some clean up to gfp_mask of callers
of charge. No behavior change but need review before generating HUNK in deep
queue.

This patch also adds explanation to meaning of gfp_mask passed to charge
functions in memcontrol.h.

Singned-off-by:KAMEZAWA Hiroyuki <[email protected]>

include/linux/memcontrol.h | 10 ++++++++++
mm/filemap.c | 2 +-
mm/memcontrol.c | 10 +++++-----
mm/memory.c | 10 ++++------
mm/shmem.c | 8 ++++----
mm/swapfile.c | 3 +--
6 files changed, 25 insertions(+), 18 deletions(-)

Index: mmotm-2.6.28-Dec02/mm/filemap.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/filemap.c
+++ mmotm-2.6.28-Dec02/mm/filemap.c
@@ -461,7 +461,7 @@ int add_to_page_cache_locked(struct page
VM_BUG_ON(!PageLocked(page));

error = mem_cgroup_cache_charge(page, current->mm,
- gfp_mask & ~__GFP_HIGHMEM);
+ gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;

Index: mmotm-2.6.28-Dec02/mm/memory.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memory.c
+++ mmotm-2.6.28-Dec02/mm/memory.c
@@ -1967,7 +1967,7 @@ gotten:
cow_user_page(new_page, old_page, address, vma);
__SetPageUptodate(new_page);

- if (mem_cgroup_newpage_charge(new_page, mm, GFP_HIGHUSER_MOVABLE))
+ if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
goto oom_free_new;

/*
@@ -2398,8 +2398,7 @@ static int do_swap_page(struct mm_struct
lock_page(page);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

- if (mem_cgroup_try_charge_swapin(mm, page,
- GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
+ if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
ret = VM_FAULT_OOM;
unlock_page(page);
goto out;
@@ -2491,7 +2490,7 @@ static int do_anonymous_page(struct mm_s
goto oom;
__SetPageUptodate(page);

- if (mem_cgroup_newpage_charge(page, mm, GFP_HIGHUSER_MOVABLE))
+ if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
goto oom_free_page;

entry = mk_pte(page, vma->vm_page_prot);
@@ -2582,8 +2581,7 @@ static int __do_fault(struct mm_struct *
ret = VM_FAULT_OOM;
goto out;
}
- if (mem_cgroup_newpage_charge(page,
- mm, GFP_HIGHUSER_MOVABLE)) {
+ if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
ret = VM_FAULT_OOM;
page_cache_release(page);
goto out;
Index: mmotm-2.6.28-Dec02/mm/swapfile.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/swapfile.c
+++ mmotm-2.6.28-Dec02/mm/swapfile.c
@@ -698,8 +698,7 @@ static int unuse_pte(struct vm_area_stru
pte_t *pte;
int ret = 1;

- if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
- GFP_HIGHUSER_MOVABLE, &ptr))
+ if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, GFP_KERNEL, &ptr))
ret = -ENOMEM;

pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
Index: mmotm-2.6.28-Dec02/mm/shmem.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/shmem.c
+++ mmotm-2.6.28-Dec02/mm/shmem.c
@@ -924,8 +924,8 @@ found:
* Charge page using GFP_HIGHUSER_MOVABLE while we can wait.
* charged back to the user(not to caller) when swap account is used.
*/
- error = mem_cgroup_cache_charge_swapin(page,
- current->mm, GFP_HIGHUSER_MOVABLE, true);
+ error = mem_cgroup_cache_charge_swapin(page, current->mm, GFP_KERNEL,
+ true);
if (error)
goto out;
error = radix_tree_preload(GFP_KERNEL);
@@ -1267,7 +1267,7 @@ repeat:
* charge against this swap cache here.
*/
if (mem_cgroup_cache_charge_swapin(swappage,
- current->mm, gfp, false)) {
+ current->mm, gfp & GFP_RECLAIM_MASK, false)) {
page_cache_release(swappage);
error = -ENOMEM;
goto failed;
@@ -1385,7 +1385,7 @@ repeat:

/* Precharge page while we can wait, compensate after */
error = mem_cgroup_cache_charge(filepage, current->mm,
- GFP_HIGHUSER_MOVABLE);
+ GFP_KERNEL);
if (error) {
page_cache_release(filepage);
shmem_unacct_blocks(info->flags, 1);
Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -1248,7 +1248,7 @@ int mem_cgroup_prepare_migration(struct
unlock_page_cgroup(pc);

if (mem) {
- ret = mem_cgroup_try_charge(NULL, GFP_HIGHUSER_MOVABLE, &mem);
+ ret = mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem);
css_put(&mem->css);
}
*ptr = mem;
@@ -1378,7 +1378,7 @@ static int mem_cgroup_resize_limit(struc
break;

progress = try_to_free_mem_cgroup_pages(memcg,
- GFP_HIGHUSER_MOVABLE, false);
+ GFP_KERNEL, false);
if (!progress) retry_count--;
}
return ret;
@@ -1418,7 +1418,7 @@ int mem_cgroup_resize_memsw_limit(struct
break;

oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
- try_to_free_mem_cgroup_pages(memcg, GFP_HIGHUSER_MOVABLE, true);
+ try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL, true);
curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
if (curusage >= oldusage)
retry_count--;
@@ -1464,7 +1464,7 @@ static int mem_cgroup_force_empty_list(s
}
spin_unlock_irqrestore(&zone->lru_lock, flags);

- ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
+ ret = mem_cgroup_move_parent(pc, mem, GFP_KERNEL);
if (ret == -ENOMEM)
break;

@@ -1550,7 +1550,7 @@ try_to_free:
goto out;
}
progress = try_to_free_mem_cgroup_pages(mem,
- GFP_HIGHUSER_MOVABLE, false);
+ GFP_KERNEL, false);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
Index: mmotm-2.6.28-Dec02/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Dec02.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Dec02/include/linux/memcontrol.h
@@ -26,6 +26,16 @@ struct page;
struct mm_struct;

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * All "charge" functions with gfp_mask should use GFP_KERNEL or
+ * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
+ * alloc memory but reclaims memory from all available zones. So, "where I want
+ * memory from" bits of gfp_mask has no meaning. So any bits of that field is
+ * available but adding a rule is better. charge functions' gfp_mask should
+ * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
+ * codes.
+ * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
+ */

extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);

2008-12-03 04:52:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 2/21] memcg-check-group-leader-fix.patch

Remove unnecessary codes (...fragments of not-implemented functionalilty...)

Changelog:
- removed all unused fragments.
- added comment.


Reported-by: Nikanth Karthikesan <[email protected]>
Signed-off-by: Nikanth Karthikesan <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Index: mmotm-2.6.28-Nov30/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov30.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov30/mm/memcontrol.c
@@ -2008,25 +2008,10 @@ static void mem_cgroup_move_task(struct
struct cgroup *old_cont,
struct task_struct *p)
{
- struct mm_struct *mm;
- struct mem_cgroup *mem, *old_mem;
-
- mm = get_task_mm(p);
- if (mm == NULL)
- return;
-
- mem = mem_cgroup_from_cont(cont);
- old_mem = mem_cgroup_from_cont(old_cont);
-
/*
- * Only thread group leaders are allowed to migrate, the mm_struct is
- * in effect owned by the leader
+ * FIXME: It's better to move charges of this process from old
+ * memcg to new memcg. But it's just on TODO-List now.
*/
- if (!thread_group_leader(p))
- goto out;
-
-out:
- mmput(mm);
}

struct cgroup_subsys mem_cgroup_subsys = {

2008-12-03 04:53:15

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 3/21] memcg-memoryswap-controller-fix-limit-check.patch

There are scatterd calls of res_counter_check_under_limit(), and most
of them don't take mem+swap accounting into account.

define mem_cgroup_check_under_limit() and avoid direct use of
res_counter_check_limit().

Changelog:
- replaces all res_counter_check_under_limit().

Reported-by: Daisuke Nishimura <[email protected]>
Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

mm/memcontrol.c | 26 +++++++++++++++++---------
1 file changed, 17 insertions(+), 9 deletions(-)

Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -571,6 +571,18 @@ done:
return ret;
}

+static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
+{
+ if (do_swap_account) {
+ if (res_counter_check_under_limit(&mem->res) &&
+ res_counter_check_under_limit(&mem->memsw))
+ return true;
+ } else
+ if (res_counter_check_under_limit(&mem->res))
+ return true;
+ return false;
+}
+
/*
* Dance down the hierarchy if needed to reclaim memory. We remember the
* last child we reclaimed from, so that we don't end up penalizing
@@ -592,7 +604,7 @@ static int mem_cgroup_hierarchical_recla
* have left.
*/
ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap);
- if (res_counter_check_under_limit(&root_mem->res))
+ if (mem_cgroup_check_under_limit(root_mem))
return 0;

next_mem = mem_cgroup_get_first_node(root_mem);
@@ -606,7 +618,7 @@ static int mem_cgroup_hierarchical_recla
continue;
}
ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap);
- if (res_counter_check_under_limit(&root_mem->res))
+ if (mem_cgroup_check_under_limit(root_mem))
return 0;
cgroup_lock();
next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
@@ -709,12 +721,8 @@ static int __mem_cgroup_try_charge(struc
* current usage of the cgroup before giving up
*
*/
- if (do_swap_account) {
- if (res_counter_check_under_limit(&mem_over_limit->res) &&
- res_counter_check_under_limit(&mem_over_limit->memsw))
- continue;
- } else if (res_counter_check_under_limit(&mem_over_limit->res))
- continue;
+ if (mem_cgroup_check_under_limit(mem_over_limit))
+ continue;

if (!nr_retries--) {
if (oom) {
@@ -1334,7 +1342,7 @@ int mem_cgroup_shrink_usage(struct mm_st

do {
progress = try_to_free_mem_cgroup_pages(mem, gfp_mask, true);
- progress += res_counter_check_under_limit(&mem->res);
+ progress += mem_cgroup_check_under_limit(mem);
} while (!progress && --retry);

css_put(&mem->css);

2008-12-03 04:53:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 4/21] memcg-swapout-refcnt-fix.patch

Fix for memcg-memswap-controller-core.patch

css's refcnt is dropped before end of following access.
Hold it until end of access.

Reported-by: Li Zefan <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

mm/memcontrol.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

Index: mmotm-2.6.28-Dec01-2/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec01-2.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec01-2/mm/memcontrol.c
@@ -1171,7 +1171,9 @@ __mem_cgroup_uncharge_common(struct page
mz = page_cgroup_zoneinfo(pc);
unlock_page_cgroup(pc);

- css_put(&mem->css);
+ /* at swapout, this memcg will be accessed to record to swap */
+ if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+ css_put(&mem->css);

return mem;

@@ -1212,6 +1214,8 @@ void mem_cgroup_uncharge_swapcache(struc
swap_cgroup_record(ent, memcg);
mem_cgroup_get(memcg);
}
+ if (memcg)
+ css_put(&memcg->css);
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP

2008-12-03 04:54:43

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 5/21] memcg-hierarchy-avoid-unnecessary-reclaim.patch

From: Daisuke Nishimura <[email protected]>

If hierarchy is not used, no tree-walk is necessary.

Reviewed-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

mm/memcontrol.c | 2 ++
1 file changed, 2 insertions(+)

Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -606,6 +606,8 @@ static int mem_cgroup_hierarchical_recla
ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap);
if (mem_cgroup_check_under_limit(root_mem))
return 0;
+ if (!root_mem->use_hierarchy)
+ return ret;

next_mem = mem_cgroup_get_first_node(root_mem);

2008-12-03 04:55:32

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 6/21] inactive_anon_is_low-move-to-vmscan.patch

The inactive_anon_is_low() is called only vmscan.
Then it can move to vmscan.c

This patch doesn't have any functional change.

Reviewd-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Index: mmotm-2.6.28-Dec02/include/linux/mm_inline.h
===================================================================
--- mmotm-2.6.28-Dec02.orig/include/linux/mm_inline.h
+++ mmotm-2.6.28-Dec02/include/linux/mm_inline.h
@@ -81,23 +81,4 @@ static inline enum lru_list page_lru(str
return lru;
}

-/**
- * inactive_anon_is_low - check if anonymous pages need to be deactivated
- * @zone: zone to check
- *
- * Returns true if the zone does not have enough inactive anon pages,
- * meaning some active anon pages need to be deactivated.
- */
-static inline int inactive_anon_is_low(struct zone *zone)
-{
- unsigned long active, inactive;
-
- active = zone_page_state(zone, NR_ACTIVE_ANON);
- inactive = zone_page_state(zone, NR_INACTIVE_ANON);
-
- if (inactive * zone->inactive_ratio < active)
- return 1;
-
- return 0;
-}
#endif
Index: mmotm-2.6.28-Dec02/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/vmscan.c
+++ mmotm-2.6.28-Dec02/mm/vmscan.c
@@ -1345,6 +1345,26 @@ static void shrink_active_list(unsigned
pagevec_release(&pvec);
}

+/**
+ * inactive_anon_is_low - check if anonymous pages need to be deactivated
+ * @zone: zone to check
+ *
+ * Returns true if the zone does not have enough inactive anon pages,
+ * meaning some active anon pages need to be deactivated.
+ */
+static int inactive_anon_is_low(struct zone *zone)
+{
+ unsigned long active, inactive;
+
+ active = zone_page_state(zone, NR_ACTIVE_ANON);
+ inactive = zone_page_state(zone, NR_INACTIVE_ANON);
+
+ if (inactive * zone->inactive_ratio < active)
+ return 1;
+
+ return 0;
+}
+
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc, int priority)
{

2008-12-03 04:56:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 7/21] introduce-zone_reclaim-struct.patch

make zone_reclam_stat strcut for latter enhancement.

latter patch use this.
this patch doesn't any behavior change (yet).

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: Rik van Riel <[email protected]>

include/linux/mmzone.h | 24 ++++++++++++++----------
mm/page_alloc.c | 8 ++++----
mm/swap.c | 12 ++++++++----
mm/vmscan.c | 47 ++++++++++++++++++++++++++++++-----------------
4 files changed, 56 insertions(+), 35 deletions(-)

Index: mmotm-2.6.28-Dec02/include/linux/mmzone.h
===================================================================
--- mmotm-2.6.28-Dec02.orig/include/linux/mmzone.h
+++ mmotm-2.6.28-Dec02/include/linux/mmzone.h
@@ -263,6 +263,19 @@ enum zone_type {
#error ZONES_SHIFT -- too many zones configured adjust calculation
#endif

+struct zone_reclaim_stat {
+ /*
+ * The pageout code in vmscan.c keeps track of how many of the
+ * mem/swap backed and file backed pages are refeferenced.
+ * The higher the rotated/scanned ratio, the more valuable
+ * that cache is.
+ *
+ * The anon LRU stats live in [0], file LRU stats in [1]
+ */
+ unsigned long recent_rotated[2];
+ unsigned long recent_scanned[2];
+};
+
struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long pages_min, pages_low, pages_high;
@@ -315,16 +328,7 @@ struct zone {
unsigned long nr_scan;
} lru[NR_LRU_LISTS];

- /*
- * The pageout code in vmscan.c keeps track of how many of the
- * mem/swap backed and file backed pages are refeferenced.
- * The higher the rotated/scanned ratio, the more valuable
- * that cache is.
- *
- * The anon LRU stats live in [0], file LRU stats in [1]
- */
- unsigned long recent_rotated[2];
- unsigned long recent_scanned[2];
+ struct zone_reclaim_stat reclaim_stat;

unsigned long pages_scanned; /* since last reclaim */
unsigned long slab_defrag_counter; /* since last defrag */
Index: mmotm-2.6.28-Dec02/mm/page_alloc.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/page_alloc.c
+++ mmotm-2.6.28-Dec02/mm/page_alloc.c
@@ -3523,10 +3523,10 @@ static void __paginginit free_area_init_
INIT_LIST_HEAD(&zone->lru[l].list);
zone->lru[l].nr_scan = 0;
}
- zone->recent_rotated[0] = 0;
- zone->recent_rotated[1] = 0;
- zone->recent_scanned[0] = 0;
- zone->recent_scanned[1] = 0;
+ zone->reclaim_stat.recent_rotated[0] = 0;
+ zone->reclaim_stat.recent_rotated[1] = 0;
+ zone->reclaim_stat.recent_scanned[0] = 0;
+ zone->reclaim_stat.recent_scanned[1] = 0;
zap_zone_vm_stats(zone);
zone->flags = 0;
if (!size)
Index: mmotm-2.6.28-Dec02/mm/swap.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/swap.c
+++ mmotm-2.6.28-Dec02/mm/swap.c
@@ -157,6 +157,7 @@ void rotate_reclaimable_page(struct pag
void activate_page(struct page *page)
{
struct zone *zone = page_zone(page);
+ struct zone_reclaim_stat *reclaim_stat = &zone->reclaim_stat;

spin_lock_irq(&zone->lru_lock);
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
@@ -169,8 +170,8 @@ void activate_page(struct page *page)
add_page_to_lru_list(zone, page, lru);
__count_vm_event(PGACTIVATE);

- zone->recent_rotated[!!file]++;
- zone->recent_scanned[!!file]++;
+ reclaim_stat->recent_rotated[!!file]++;
+ reclaim_stat->recent_scanned[!!file]++;
}
spin_unlock_irq(&zone->lru_lock);
}
@@ -398,6 +399,8 @@ void ____pagevec_lru_add(struct pagevec
{
int i;
struct zone *zone = NULL;
+ struct zone_reclaim_stat *reclaim_stat = NULL;
+
VM_BUG_ON(is_unevictable_lru(lru));

for (i = 0; i < pagevec_count(pvec); i++) {
@@ -409,6 +412,7 @@ void ____pagevec_lru_add(struct pagevec
if (zone)
spin_unlock_irq(&zone->lru_lock);
zone = pagezone;
+ reclaim_stat = &zone->reclaim_stat;
spin_lock_irq(&zone->lru_lock);
}
VM_BUG_ON(PageActive(page));
@@ -416,10 +420,10 @@ void ____pagevec_lru_add(struct pagevec
VM_BUG_ON(PageLRU(page));
SetPageLRU(page);
file = is_file_lru(lru);
- zone->recent_scanned[file]++;
+ reclaim_stat->recent_scanned[file]++;
if (is_active_lru(lru)) {
SetPageActive(page);
- zone->recent_rotated[file]++;
+ reclaim_stat->recent_rotated[file]++;
}
add_page_to_lru_list(zone, page, lru);
}
Index: mmotm-2.6.28-Dec02/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/vmscan.c
+++ mmotm-2.6.28-Dec02/mm/vmscan.c
@@ -131,6 +131,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
#define scan_global_lru(sc) (1)
#endif

+static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
+ struct scan_control *sc)
+{
+ return &zone->reclaim_stat;
+}
+
/*
* Add a shrinker callback to be called from the vm
*/
@@ -1083,6 +1089,7 @@ static unsigned long shrink_inactive_lis
struct pagevec pvec;
unsigned long nr_scanned = 0;
unsigned long nr_reclaimed = 0;
+ struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);

pagevec_init(&pvec, 1);

@@ -1126,10 +1133,14 @@ static unsigned long shrink_inactive_lis

if (scan_global_lru(sc)) {
zone->pages_scanned += nr_scan;
- zone->recent_scanned[0] += count[LRU_INACTIVE_ANON];
- zone->recent_scanned[0] += count[LRU_ACTIVE_ANON];
- zone->recent_scanned[1] += count[LRU_INACTIVE_FILE];
- zone->recent_scanned[1] += count[LRU_ACTIVE_FILE];
+ reclaim_stat->recent_scanned[0] +=
+ count[LRU_INACTIVE_ANON];
+ reclaim_stat->recent_scanned[0] +=
+ count[LRU_ACTIVE_ANON];
+ reclaim_stat->recent_scanned[1] +=
+ count[LRU_INACTIVE_FILE];
+ reclaim_stat->recent_scanned[1] +=
+ count[LRU_ACTIVE_FILE];
}
spin_unlock_irq(&zone->lru_lock);

@@ -1190,7 +1201,7 @@ static unsigned long shrink_inactive_lis
add_page_to_lru_list(zone, page, lru);
if (PageActive(page) && scan_global_lru(sc)) {
int file = !!page_is_file_cache(page);
- zone->recent_rotated[file]++;
+ reclaim_stat->recent_rotated[file]++;
}
if (!pagevec_add(&pvec, page)) {
spin_unlock_irq(&zone->lru_lock);
@@ -1250,6 +1261,7 @@ static void shrink_active_list(unsigned
struct page *page;
struct pagevec pvec;
enum lru_list lru;
+ struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);

lru_add_drain();
spin_lock_irq(&zone->lru_lock);
@@ -1262,7 +1274,7 @@ static void shrink_active_list(unsigned
*/
if (scan_global_lru(sc)) {
zone->pages_scanned += pgscanned;
- zone->recent_scanned[!!file] += pgmoved;
+ reclaim_stat->recent_scanned[!!file] += pgmoved;
}

if (file)
@@ -1298,7 +1310,7 @@ static void shrink_active_list(unsigned
* pages in get_scan_ratio.
*/
if (scan_global_lru(sc))
- zone->recent_rotated[!!file] += pgmoved;
+ reclaim_stat->recent_rotated[!!file] += pgmoved;

/*
* Move the pages to the [file or anon] inactive list.
@@ -1398,6 +1410,7 @@ static void get_scan_ratio(struct zone *
unsigned long anon, file, free;
unsigned long anon_prio, file_prio;
unsigned long ap, fp;
+ struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);

/* If we have no swap space, do not bother scanning anon pages. */
if (nr_swap_pages <= 0) {
@@ -1430,17 +1443,17 @@ static void get_scan_ratio(struct zone *
*
* anon in [0], file in [1]
*/
- if (unlikely(zone->recent_scanned[0] > anon / 4)) {
+ if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
spin_lock_irq(&zone->lru_lock);
- zone->recent_scanned[0] /= 2;
- zone->recent_rotated[0] /= 2;
+ reclaim_stat->recent_scanned[0] /= 2;
+ reclaim_stat->recent_rotated[0] /= 2;
spin_unlock_irq(&zone->lru_lock);
}

- if (unlikely(zone->recent_scanned[1] > file / 4)) {
+ if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) {
spin_lock_irq(&zone->lru_lock);
- zone->recent_scanned[1] /= 2;
- zone->recent_rotated[1] /= 2;
+ reclaim_stat->recent_scanned[1] /= 2;
+ reclaim_stat->recent_rotated[1] /= 2;
spin_unlock_irq(&zone->lru_lock);
}

@@ -1456,11 +1469,11 @@ static void get_scan_ratio(struct zone *
* proportional to the fraction of recently scanned pages on
* each list that were recently referenced and in active use.
*/
- ap = (anon_prio + 1) * (zone->recent_scanned[0] + 1);
- ap /= zone->recent_rotated[0] + 1;
+ ap = (anon_prio + 1) * (reclaim_stat->recent_scanned[0] + 1);
+ ap /= reclaim_stat->recent_rotated[0] + 1;

- fp = (file_prio + 1) * (zone->recent_scanned[1] + 1);
- fp /= zone->recent_rotated[1] + 1;
+ fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
+ fp /= reclaim_stat->recent_rotated[1] + 1;

/* Normalize to percentages */
percent[0] = 100 * ap / (ap + fp + 1);

2008-12-03 04:56:56

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 8/21] make-zone-nr_pages-helper-function.patch

make zone_nr_pages() helper function.

it is used by latter patch.
this patch doesn't have any functional change.

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: Rik van Riel <[email protected]>
mm/vmscan.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

Index: mmotm-2.6.28-Dec02/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/vmscan.c
+++ mmotm-2.6.28-Dec02/mm/vmscan.c
@@ -137,6 +137,13 @@ static struct zone_reclaim_stat *get_rec
return &zone->reclaim_stat;
}

+static unsigned long zone_nr_pages(struct zone *zone, struct scan_control *sc,
+ enum lru_list lru)
+{
+ return zone_page_state(zone, NR_LRU_BASE + lru);
+}
+
+
/*
* Add a shrinker callback to be called from the vm
*/
@@ -1419,10 +1426,10 @@ static void get_scan_ratio(struct zone *
return;
}

- anon = zone_page_state(zone, NR_ACTIVE_ANON) +
- zone_page_state(zone, NR_INACTIVE_ANON);
- file = zone_page_state(zone, NR_ACTIVE_FILE) +
- zone_page_state(zone, NR_INACTIVE_FILE);
+ anon = zone_nr_pages(zone, sc, LRU_ACTIVE_ANON) +
+ zone_nr_pages(zone, sc, LRU_INACTIVE_ANON);
+ file = zone_nr_pages(zone, sc, LRU_ACTIVE_FILE) +
+ zone_nr_pages(zone, sc, LRU_INACTIVE_FILE);
free = zone_page_state(zone, NR_FREE_PAGES);

/* If we have very few page cache pages, force-scan anon pages. */

2008-12-03 04:58:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 9/21] make-get_scan_ratio-to-memcg-safe.patch

Currently, get_scan_ratio() always calculate the balancing value for global reclaim and
memcg reclaim doesn't use it.
Therefore it doesn't have scan_global_lru() condition.

However, we plan to expand get_scan_ratio() to be usable for memcg too, latter.
Then, The dependency code of global reclaim in the get_scan_ratio() insert into
scan_global_lru() condision explictly.


this patch doesn't have any functional change.

Acked-by: Rik van Riel <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Index: mmotm-2.6.28-Dec02/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/vmscan.c
+++ mmotm-2.6.28-Dec02/mm/vmscan.c
@@ -1430,13 +1430,16 @@ static void get_scan_ratio(struct zone *
zone_nr_pages(zone, sc, LRU_INACTIVE_ANON);
file = zone_nr_pages(zone, sc, LRU_ACTIVE_FILE) +
zone_nr_pages(zone, sc, LRU_INACTIVE_FILE);
- free = zone_page_state(zone, NR_FREE_PAGES);

- /* If we have very few page cache pages, force-scan anon pages. */
- if (unlikely(file + free <= zone->pages_high)) {
- percent[0] = 100;
- percent[1] = 0;
- return;
+ if (scan_global_lru(sc)) {
+ free = zone_page_state(zone, NR_FREE_PAGES);
+ /* If we have very few page cache pages,
+ force-scan anon pages. */
+ if (unlikely(file + free <= zone->pages_high)) {
+ percent[0] = 100;
+ percent[1] = 0;
+ return;
+ }
}

/*

2008-12-03 04:59:42

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 10/21] memcg-add-null-check-to-page_cgroup_zoneinfo.patch

if CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y, page_cgroup::mem_cgroup can be NULL.
Therefore null checking is better.

latter patch use this function.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
mm/memcontrol.c | 3 +++
1 file changed, 3 insertions(+)

Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -231,6 +231,9 @@ page_cgroup_zoneinfo(struct page_cgroup
int nid = page_cgroup_nid(pc);
int zid = page_cgroup_zid(pc);

+ if (!mem)
+ return NULL;
+
return mem_cgroup_zoneinfo(mem, nid, zid);
}

2008-12-03 05:01:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 11/21] memcg-make-inactive_anon_is_low.patch

Changelog:
v1 -> v2:
- add detail patch description
- fix coding style in mem_cgroup_set_inactive_ratio()
- add comment to mem_cgroup_set_inactive_ratio
- remove extra newline
- memcg::inactiveratio change type to unsigned int


The inactive_anon_is_low() is key component of active/inactive anon balancing on reclaim.
However current inactive_anon_is_low() function only consider global reclaim.

Therefore, we need following ugly scan_global_lru() condision.

if (lru == LRU_ACTIVE_ANON &&
(!scan_global_lru(sc) || inactive_anon_is_low(zone))) {
shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;


it cause that memcg reclaim always deactivate pages when shrink_list() is called.
To make mem_cgroup_inactive_anon_is_low() improve active/inactive anon balancing of memcgroup.


Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
CC: Cyrill Gorcunov <[email protected]>
CC: "Pekka Enberg" <[email protected]>
include/linux/memcontrol.h | 9 ++++++++
mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++++++++++++-
mm/vmscan.c | 36 ++++++++++++++++++++++-------------
3 files changed, 77 insertions(+), 14 deletions(-)

Index: mmotm-2.6.28-Dec02/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Dec02.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Dec02/include/linux/memcontrol.h
@@ -100,6 +100,8 @@ extern void mem_cgroup_record_reclaim_pr

extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
int priority, enum lru_list lru);
+int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg,
+ struct zone *zone);

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
@@ -251,6 +253,13 @@ static inline bool mem_cgroup_oom_called
{
return false;
}
+
+static inline int
+mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
+{
+ return 1;
+}
+
#endif /* CONFIG_CGROUP_MEM_CONT */

#endif /* _LINUX_MEMCONTROL_H */
Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -156,6 +156,9 @@ struct mem_cgroup {
unsigned long last_oom_jiffies;
int obsolete;
atomic_t refcnt;
+
+ unsigned int inactive_ratio;
+
/*
* statistics. This must be placed at the end of memcg.
*/
@@ -431,6 +434,20 @@ long mem_cgroup_calc_reclaim(struct mem_
return (nr_pages >> priority);
}

+int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
+{
+ unsigned long active;
+ unsigned long inactive;
+
+ inactive = mem_cgroup_get_all_zonestat(memcg, LRU_INACTIVE_ANON);
+ active = mem_cgroup_get_all_zonestat(memcg, LRU_ACTIVE_ANON);
+
+ if (inactive * memcg->inactive_ratio < active)
+ return 1;
+
+ return 0;
+}
+
unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
@@ -1360,6 +1377,29 @@ int mem_cgroup_shrink_usage(struct mm_st
return 0;
}

+/*
+ * The inactive anon list should be small enough that the VM never has to
+ * do too much work, but large enough that each inactive page has a chance
+ * to be referenced again before it is swapped out.
+ *
+ * this calculation is straightforward porting from
+ * page_alloc.c::setup_per_zone_inactive_ratio().
+ * it describe more detail.
+ */
+static void mem_cgroup_set_inactive_ratio(struct mem_cgroup *memcg)
+{
+ unsigned int gb, ratio;
+
+ gb = res_counter_read_u64(&memcg->res, RES_LIMIT) >> 30;
+ if (gb)
+ ratio = int_sqrt(10 * gb);
+ else
+ ratio = 1;
+
+ memcg->inactive_ratio = ratio;
+
+}
+
static DEFINE_MUTEX(set_limit_mutex);

static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
@@ -1398,6 +1438,10 @@ static int mem_cgroup_resize_limit(struc
GFP_KERNEL, false);
if (!progress) retry_count--;
}
+
+ if (!ret)
+ mem_cgroup_set_inactive_ratio(memcg);
+
return ret;
}

@@ -1982,7 +2026,7 @@ mem_cgroup_create(struct cgroup_subsys *
res_counter_init(&mem->res, NULL);
res_counter_init(&mem->memsw, NULL);
}
-
+ mem_cgroup_set_inactive_ratio(mem);
mem->last_scanned_child = NULL;

return &mem->css;
Index: mmotm-2.6.28-Dec02/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/vmscan.c
+++ mmotm-2.6.28-Dec02/mm/vmscan.c
@@ -1364,14 +1364,7 @@ static void shrink_active_list(unsigned
pagevec_release(&pvec);
}

-/**
- * inactive_anon_is_low - check if anonymous pages need to be deactivated
- * @zone: zone to check
- *
- * Returns true if the zone does not have enough inactive anon pages,
- * meaning some active anon pages need to be deactivated.
- */
-static int inactive_anon_is_low(struct zone *zone)
+static int inactive_anon_is_low_global(struct zone *zone)
{
unsigned long active, inactive;

@@ -1384,6 +1377,25 @@ static int inactive_anon_is_low(struct z
return 0;
}

+/**
+ * inactive_anon_is_low - check if anonymous pages need to be deactivated
+ * @zone: zone to check
+ * @sc: scan control of this context
+ *
+ * Returns true if the zone does not have enough inactive anon pages,
+ * meaning some active anon pages need to be deactivated.
+ */
+static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
+{
+ int low;
+
+ if (scan_global_lru(sc))
+ low = inactive_anon_is_low_global(zone);
+ else
+ low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup, zone);
+ return low;
+}
+
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc, int priority)
{
@@ -1395,7 +1407,7 @@ static unsigned long shrink_list(enum lr
}

if (lru == LRU_ACTIVE_ANON &&
- (!scan_global_lru(sc) || inactive_anon_is_low(zone))) {
+ inactive_anon_is_low(zone, sc)) {
shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
}
@@ -1560,9 +1572,7 @@ static void shrink_zone(int priority, st
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
- if (!scan_global_lru(sc) || inactive_anon_is_low(zone))
- shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
- else if (!scan_global_lru(sc))
+ if (inactive_anon_is_low(zone, sc))
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

throttle_vm_writeout(sc->gfp_mask);
@@ -1858,7 +1868,7 @@ loop_again:
* Do some background aging of the anon list, to give
* pages a chance to be referenced before reclaiming.
*/
- if (inactive_anon_is_low(zone))
+ if (inactive_anon_is_low(zone, &sc))
shrink_active_list(SWAP_CLUSTER_MAX, zone,
&sc, priority, 0);

2008-12-03 05:02:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 12/21] memcg-make-mem_cgroup_zone_nr_pages.patch

introduce mem_cgroup_zone_nr_pages().
it is called by zone_nr_pages() helper function.


this patch doesn't have any behavior change.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Index: mmotm-2.6.28-Dec02/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Dec02.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Dec02/include/linux/memcontrol.h
@@ -102,6 +102,9 @@ extern long mem_cgroup_calc_reclaim(stru
int priority, enum lru_list lru);
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg,
struct zone *zone);
+unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
+ struct zone *zone,
+ enum lru_list lru);

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
@@ -260,6 +263,14 @@ mem_cgroup_inactive_anon_is_low(struct m
return 1;
}

+static inline unsigned long
+mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
+ enum lru_list lru)
+{
+ return 0;
+}
+
+
#endif /* CONFIG_CGROUP_MEM_CONT */

#endif /* _LINUX_MEMCONTROL_H */
Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -186,7 +186,6 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
0, /* FORCE */
};

-
/* for encoding cft->private value on file */
#define _MEM (0)
#define _MEMSWAP (1)
@@ -448,6 +447,17 @@ int mem_cgroup_inactive_anon_is_low(stru
return 0;
}

+unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
+ struct zone *zone,
+ enum lru_list lru)
+{
+ int nid = zone->zone_pgdat->node_id;
+ int zid = zone_idx(zone);
+ struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+ return MEM_CGROUP_ZSTAT(mz, lru);
+}
+
unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
Index: mmotm-2.6.28-Dec02/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/vmscan.c
+++ mmotm-2.6.28-Dec02/mm/vmscan.c
@@ -140,6 +140,9 @@ static struct zone_reclaim_stat *get_rec
static unsigned long zone_nr_pages(struct zone *zone, struct scan_control *sc,
enum lru_list lru)
{
+ if (!scan_global_lru(sc))
+ return mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone, lru);
+
return zone_page_state(zone, NR_LRU_BASE + lru);
}

2008-12-03 05:03:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 13/21] memcg-make-zone_reclaim_stat.patch

introduce mem_cgroup_per_zone::reclaim_stat member and its statics collecting
function.

Now, get_scan_ratio() can calculate correct value on memcg reclaim.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
include/linux/memcontrol.h | 16 ++++++++++++++++
mm/memcontrol.c | 23 +++++++++++++++++++++++
mm/swap.c | 14 ++++++++++++++
mm/vmscan.c | 27 +++++++++++++--------------
4 files changed, 66 insertions(+), 14 deletions(-)

Index: mmotm-2.6.28-Dec02/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Dec02.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Dec02/include/linux/memcontrol.h
@@ -105,6 +105,10 @@ int mem_cgroup_inactive_anon_is_low(stru
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
struct zone *zone,
enum lru_list lru);
+struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
+ struct zone *zone);
+struct zone_reclaim_stat*
+mem_cgroup_get_reclaim_stat_by_page(struct page *page);

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
@@ -271,6 +275,18 @@ mem_cgroup_zone_nr_pages(struct mem_cgro
}


+static inline struct zone_reclaim_stat*
+mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
+{
+ return NULL;
+}
+
+static inline struct zone_reclaim_stat*
+mem_cgroup_get_reclaim_stat_by_page(struct page *page)
+{
+ return NULL;
+}
+
#endif /* CONFIG_CGROUP_MEM_CONT */

#endif /* _LINUX_MEMCONTROL_H */
Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -103,6 +103,8 @@ struct mem_cgroup_per_zone {
*/
struct list_head lists[NR_LRU_LISTS];
unsigned long count[NR_LRU_LISTS];
+
+ struct zone_reclaim_stat reclaim_stat;
};
/* Macro for accessing counter */
#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
@@ -458,6 +460,27 @@ unsigned long mem_cgroup_zone_nr_pages(s
return MEM_CGROUP_ZSTAT(mz, lru);
}

+struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
+ struct zone *zone)
+{
+ int nid = zone->zone_pgdat->node_id;
+ int zid = zone_idx(zone);
+ struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+ return &mz->reclaim_stat;
+}
+
+struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat_by_page(struct page *page)
+{
+ struct page_cgroup *pc = lookup_page_cgroup(page);
+ struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+
+ if (!mz)
+ return NULL;
+
+ return &mz->reclaim_stat;
+}
+
unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
Index: mmotm-2.6.28-Dec02/mm/swap.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/swap.c
+++ mmotm-2.6.28-Dec02/mm/swap.c
@@ -158,6 +158,7 @@ void activate_page(struct page *page)
{
struct zone *zone = page_zone(page);
struct zone_reclaim_stat *reclaim_stat = &zone->reclaim_stat;
+ struct zone_reclaim_stat *memcg_reclaim_stat;

spin_lock_irq(&zone->lru_lock);
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
@@ -172,6 +173,12 @@ void activate_page(struct page *page)

reclaim_stat->recent_rotated[!!file]++;
reclaim_stat->recent_scanned[!!file]++;
+
+ memcg_reclaim_stat = mem_cgroup_get_reclaim_stat_by_page(page);
+ if (memcg_reclaim_stat) {
+ memcg_reclaim_stat->recent_rotated[!!file]++;
+ memcg_reclaim_stat->recent_scanned[!!file]++;
+ }
}
spin_unlock_irq(&zone->lru_lock);
}
@@ -400,6 +407,7 @@ void ____pagevec_lru_add(struct pagevec
int i;
struct zone *zone = NULL;
struct zone_reclaim_stat *reclaim_stat = NULL;
+ struct zone_reclaim_stat *memcg_reclaim_stat = NULL;

VM_BUG_ON(is_unevictable_lru(lru));

@@ -413,6 +421,8 @@ void ____pagevec_lru_add(struct pagevec
spin_unlock_irq(&zone->lru_lock);
zone = pagezone;
reclaim_stat = &zone->reclaim_stat;
+ memcg_reclaim_stat =
+ mem_cgroup_get_reclaim_stat_by_page(page);
spin_lock_irq(&zone->lru_lock);
}
VM_BUG_ON(PageActive(page));
@@ -421,9 +431,13 @@ void ____pagevec_lru_add(struct pagevec
SetPageLRU(page);
file = is_file_lru(lru);
reclaim_stat->recent_scanned[file]++;
+ if (memcg_reclaim_stat)
+ memcg_reclaim_stat->recent_scanned[file]++;
if (is_active_lru(lru)) {
SetPageActive(page);
reclaim_stat->recent_rotated[file]++;
+ if (memcg_reclaim_stat)
+ memcg_reclaim_stat->recent_rotated[file]++;
}
add_page_to_lru_list(zone, page, lru);
}
Index: mmotm-2.6.28-Dec02/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/vmscan.c
+++ mmotm-2.6.28-Dec02/mm/vmscan.c
@@ -134,6 +134,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
struct scan_control *sc)
{
+ if (!scan_global_lru(sc))
+ mem_cgroup_get_reclaim_stat(sc->mem_cgroup, zone);
+
return &zone->reclaim_stat;
}

@@ -1141,17 +1144,14 @@ static unsigned long shrink_inactive_lis
__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-count[LRU_INACTIVE_ANON]);

- if (scan_global_lru(sc)) {
+ if (scan_global_lru(sc))
zone->pages_scanned += nr_scan;
- reclaim_stat->recent_scanned[0] +=
- count[LRU_INACTIVE_ANON];
- reclaim_stat->recent_scanned[0] +=
- count[LRU_ACTIVE_ANON];
- reclaim_stat->recent_scanned[1] +=
- count[LRU_INACTIVE_FILE];
- reclaim_stat->recent_scanned[1] +=
- count[LRU_ACTIVE_FILE];
- }
+
+ reclaim_stat->recent_scanned[0] += count[LRU_INACTIVE_ANON];
+ reclaim_stat->recent_scanned[0] += count[LRU_ACTIVE_ANON];
+ reclaim_stat->recent_scanned[1] += count[LRU_INACTIVE_FILE];
+ reclaim_stat->recent_scanned[1] += count[LRU_ACTIVE_FILE];
+
spin_unlock_irq(&zone->lru_lock);

nr_scanned += nr_scan;
@@ -1209,7 +1209,7 @@ static unsigned long shrink_inactive_lis
SetPageLRU(page);
lru = page_lru(page);
add_page_to_lru_list(zone, page, lru);
- if (PageActive(page) && scan_global_lru(sc)) {
+ if (PageActive(page)) {
int file = !!page_is_file_cache(page);
reclaim_stat->recent_rotated[file]++;
}
@@ -1284,8 +1284,8 @@ static void shrink_active_list(unsigned
*/
if (scan_global_lru(sc)) {
zone->pages_scanned += pgscanned;
- reclaim_stat->recent_scanned[!!file] += pgmoved;
}
+ reclaim_stat->recent_scanned[!!file] += pgmoved;

if (file)
__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
@@ -1319,8 +1319,7 @@ static void shrink_active_list(unsigned
* This helps balance scan pressure between file and anonymous
* pages in get_scan_ratio.
*/
- if (scan_global_lru(sc))
- reclaim_stat->recent_rotated[!!file] += pgmoved;
+ reclaim_stat->recent_rotated[!!file] += pgmoved;

/*
* Move the pages to the [file or anon] inactive list.

2008-12-03 05:05:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 14/21] memcg-remove-mem_cgroup_cal_reclaim.patch

Now, get_scan_ratio() return correct value although memcg reclaim.
Then, mem_cgroup_calc_reclaim() can be removed.

So, memcg reclaim get the same capability of anon/file reclaim balancing as global reclaim now.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
include/linux/memcontrol.h | 10 ----------
mm/memcontrol.c | 21 ---------------------
mm/vmscan.c | 27 ++++++++++-----------------
3 files changed, 10 insertions(+), 48 deletions(-)

Index: mmotm-2.6.28-Dec02/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Dec02.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Dec02/include/linux/memcontrol.h
@@ -97,9 +97,6 @@ extern void mem_cgroup_note_reclaim_prio
int priority);
extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
int priority);
-
-extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
- int priority, enum lru_list lru);
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg,
struct zone *zone);
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
@@ -244,13 +241,6 @@ static inline void mem_cgroup_record_rec
{
}

-static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem,
- struct zone *zone, int priority,
- enum lru_list lru)
-{
- return 0;
-}
-
static inline bool mem_cgroup_disabled(void)
{
return true;
Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -414,27 +414,6 @@ void mem_cgroup_record_reclaim_priority(
mem->prev_priority = priority;
}

-/*
- * Calculate # of pages to be scanned in this priority/zone.
- * See also vmscan.c
- *
- * priority starts from "DEF_PRIORITY" and decremented in each loop.
- * (see include/linux/mmzone.h)
- */
-
-long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
- int priority, enum lru_list lru)
-{
- long nr_pages;
- int nid = zone->zone_pgdat->node_id;
- int zid = zone_idx(zone);
- struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
-
- nr_pages = MEM_CGROUP_ZSTAT(mz, lru);
-
- return (nr_pages >> priority);
-}
-
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
{
unsigned long active;
Index: mmotm-2.6.28-Dec02/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/vmscan.c
+++ mmotm-2.6.28-Dec02/mm/vmscan.c
@@ -1519,30 +1519,23 @@ static void shrink_zone(int priority, st
get_scan_ratio(zone, sc, percent);

for_each_evictable_lru(l) {
- if (scan_global_lru(sc)) {
- int file = is_file_lru(l);
- int scan;
+ int file = is_file_lru(l);
+ int scan;

- scan = zone_page_state(zone, NR_LRU_BASE + l);
- if (priority) {
- scan >>= priority;
- scan = (scan * percent[file]) / 100;
- }
+ scan = zone_page_state(zone, NR_LRU_BASE + l);
+ if (priority) {
+ scan >>= priority;
+ scan = (scan * percent[file]) / 100;
+ }
+ if (scan_global_lru(sc)) {
zone->lru[l].nr_scan += scan;
nr[l] = zone->lru[l].nr_scan;
if (nr[l] >= sc->swap_cluster_max)
zone->lru[l].nr_scan = 0;
else
nr[l] = 0;
- } else {
- /*
- * This reclaim occurs not because zone memory shortage
- * but because memory controller hits its limit.
- * Don't modify zone reclaim related data.
- */
- nr[l] = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone,
- priority, l);
- }
+ } else
+ nr[l] = scan;
}

while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||

2008-12-03 05:06:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 15/21] memcg-show-reclaim-stat.patch

added following four field to memory.stat file.
- inactive_ratio
- recent_rotated_anon
- recent_rotated_file
- recent_scanned_anon
- recent_scanned_file

Changelog:
- unified inactive_ratio patch and recent_rotate patch.
- added documentation.
- put under CONFIG_DEBUG_VM.

Acked-by: Rik van Riel <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>

Documentation/controllers/memory.txt | 25 +++++++++++++++++++++++++
mm/memcontrol.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+)

Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -1810,6 +1810,36 @@ static int mem_control_stat_show(struct
cb->fill(cb, "unevictable", unevictable * PAGE_SIZE);

}
+
+#ifdef CONFIG_DEBUG_VM
+ cb->fill(cb, "inactive_ratio", mem_cont->inactive_ratio);
+
+ {
+ int nid, zid;
+ struct mem_cgroup_per_zone *mz;
+ unsigned long recent_rotated[2] = {0, 0};
+ unsigned long recent_scanned[2] = {0, 0};
+
+ for_each_online_node(nid)
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
+
+ recent_rotated[0] +=
+ mz->reclaim_stat.recent_rotated[0];
+ recent_rotated[1] +=
+ mz->reclaim_stat.recent_rotated[1];
+ recent_scanned[0] +=
+ mz->reclaim_stat.recent_scanned[0];
+ recent_scanned[1] +=
+ mz->reclaim_stat.recent_scanned[1];
+ }
+ cb->fill(cb, "recent_rotated_anon", recent_rotated[0]);
+ cb->fill(cb, "recent_rotated_file", recent_rotated[1]);
+ cb->fill(cb, "recent_scanned_anon", recent_scanned[0]);
+ cb->fill(cb, "recent_scanned_file", recent_scanned[1]);
+ }
+#endif
+
return 0;
}

Index: mmotm-2.6.28-Dec02/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-Dec02.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-Dec02/Documentation/controllers/memory.txt
@@ -289,6 +289,31 @@ will be charged as a new owner of it.
Because rmdir() moves all pages to parent, some out-of-use page caches can be
moved to the parent. If you want to avoid that, force_empty will be useful.

+5.2 stat file
+ memory.stat file includes following statistics (now)
+ cache - # of pages from page-cache and shmem.
+ rss - # of pages from anonymous memory.
+ pgpgin - # of event of charging
+ pgpgout - # of event of uncharging
+ active_anon - # of pages on active lru of anon, shmem.
+ inactive_anon - # of pages on active lru of anon, shmem
+ active_file - # of pages on active lru of file-cache
+ inactive_file - # of pages on inactive lru of file cache
+ unevictable - # of pages cannot be reclaimed.(mlocked etc)
+
+ Below is depend on CONFIG_DEBUG_VM.
+ inactive_ratio - VM inernal parameter. (see mm/page_alloc.c)
+ recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
+ recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
+ recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
+ recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
+
+ Memo:
+ recent_rotated means recent frequency of lru rotation.
+ recent_scanned means recent # of scans to lru.
+ showing for better debug please see the code for meanings.
+
+
6. Hierarchy support

The memory controller supports a deep hierarchy and hierarchical accounting.

2008-12-03 05:07:59

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 16/21] memcg-rename-scan-glonal-lru.patch

Rename scan_global_lru() to scanning_global_lru().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

mm/vmscan.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)

Index: mmotm-2.6.28-Dec02/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/vmscan.c
+++ mmotm-2.6.28-Dec02/mm/vmscan.c
@@ -126,15 +126,15 @@ static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-#define scan_global_lru(sc) (!(sc)->mem_cgroup)
+#define scanning_global_lru(sc) (!(sc)->mem_cgroup)
#else
-#define scan_global_lru(sc) (1)
+#define scanning_global_lru(sc) (1)
#endif

static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
struct scan_control *sc)
{
- if (!scan_global_lru(sc))
+ if (!scanning_global_lru(sc))
mem_cgroup_get_reclaim_stat(sc->mem_cgroup, zone);

return &zone->reclaim_stat;
@@ -143,7 +143,7 @@ static struct zone_reclaim_stat *get_rec
static unsigned long zone_nr_pages(struct zone *zone, struct scan_control *sc,
enum lru_list lru)
{
- if (!scan_global_lru(sc))
+ if (!scanning_global_lru(sc))
return mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone, lru);

return zone_page_state(zone, NR_LRU_BASE + lru);
@@ -1144,7 +1144,7 @@ static unsigned long shrink_inactive_lis
__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-count[LRU_INACTIVE_ANON]);

- if (scan_global_lru(sc))
+ if (scanning_global_lru(sc))
zone->pages_scanned += nr_scan;

reclaim_stat->recent_scanned[0] += count[LRU_INACTIVE_ANON];
@@ -1183,7 +1183,7 @@ static unsigned long shrink_inactive_lis
if (current_is_kswapd()) {
__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
__count_vm_events(KSWAPD_STEAL, nr_freed);
- } else if (scan_global_lru(sc))
+ } else if (scanning_global_lru(sc))
__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scan);

__count_zone_vm_events(PGSTEAL, zone, nr_freed);
@@ -1282,7 +1282,7 @@ static void shrink_active_list(unsigned
* zone->pages_scanned is used for detect zone's oom
* mem_cgroup remembers nr_scan by itself.
*/
- if (scan_global_lru(sc)) {
+ if (scanning_global_lru(sc)) {
zone->pages_scanned += pgscanned;
}
reclaim_stat->recent_scanned[!!file] += pgmoved;
@@ -1391,7 +1391,7 @@ static int inactive_anon_is_low(struct z
{
int low;

- if (scan_global_lru(sc))
+ if (scanning_global_lru(sc))
low = inactive_anon_is_low_global(zone);
else
low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup, zone);
@@ -1445,7 +1445,7 @@ static void get_scan_ratio(struct zone *
file = zone_nr_pages(zone, sc, LRU_ACTIVE_FILE) +
zone_nr_pages(zone, sc, LRU_INACTIVE_FILE);

- if (scan_global_lru(sc)) {
+ if (scanning_global_lru(sc)) {
free = zone_page_state(zone, NR_FREE_PAGES);
/* If we have very few page cache pages,
force-scan anon pages. */
@@ -1527,7 +1527,7 @@ static void shrink_zone(int priority, st
scan >>= priority;
scan = (scan * percent[file]) / 100;
}
- if (scan_global_lru(sc)) {
+ if (scanning_global_lru(sc)) {
zone->lru[l].nr_scan += scan;
nr[l] = zone->lru[l].nr_scan;
if (nr[l] >= sc->swap_cluster_max)
@@ -1602,7 +1602,7 @@ static void shrink_zones(int priority, s
* Take care memory controller reclaiming has small influence
* to global LRU.
*/
- if (scan_global_lru(sc)) {
+ if (scanning_global_lru(sc)) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
note_zone_scanning_priority(zone, priority);
@@ -1655,12 +1655,12 @@ static unsigned long do_try_to_free_page

delayacct_freepages_start();

- if (scan_global_lru(sc))
+ if (scanning_global_lru(sc))
count_vm_event(ALLOCSTALL);
/*
* mem_cgroup will not do shrink_slab.
*/
- if (scan_global_lru(sc)) {
+ if (scanning_global_lru(sc)) {
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {

if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
@@ -1679,7 +1679,7 @@ static unsigned long do_try_to_free_page
* Don't shrink slabs when reclaiming memory from
* over limit cgroups
*/
- if (scan_global_lru(sc)) {
+ if (scanning_global_lru(sc)) {
shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages, NULL);
if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -1710,7 +1710,7 @@ static unsigned long do_try_to_free_page
congestion_wait(WRITE, HZ/10);
}
/* top priority shrink_zones still had more to do? don't OOM, then */
- if (!sc->all_unreclaimable && scan_global_lru(sc))
+ if (!sc->all_unreclaimable && scanning_global_lru(sc))
ret = sc->nr_reclaimed;
out:
/*
@@ -1723,7 +1723,7 @@ out:
if (priority < 0)
priority = 0;

- if (scan_global_lru(sc)) {
+ if (scanning_global_lru(sc)) {
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {

if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))

2008-12-03 05:09:29

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 17/21] memcg_prev_priority_protect.patch

From: KOSAKI Motohiro <[email protected]>

Currently, mem_cgroup doesn't have own lock and almost its member doesn't need.
(e.g. mem_cgroup->info is protected by zone lock, mem_cgroup->stat is
per cpu variable)

However, there is one explict exception. mem_cgroup->prev_priorit need lock,
but doesn't protect.
Luckly, this is NOT bug because prev_priority isn't used for current reclaim code.

However, we plan to use prev_priority future again.
Therefore, fixing is better.


In addision, we plan to reuse this lock for another member.
Then "reclaim_param_lock" name is better than "prev_priority_lock".


Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
mm/memcontrol.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -144,6 +144,11 @@ struct mem_cgroup {
*/
struct mem_cgroup_lru_info info;

+ /*
+ protect against reclaim related member.
+ */
+ spinlock_t reclaim_param_lock;
+
int prev_priority; /* for recording reclaim priority */

/*
@@ -400,18 +405,28 @@ int mem_cgroup_calc_mapped_ratio(struct
*/
int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
{
- return mem->prev_priority;
+ int prev_priority;
+
+ spin_lock(&mem->reclaim_param_lock);
+ prev_priority = mem->prev_priority;
+ spin_unlock(&mem->reclaim_param_lock);
+
+ return prev_priority;
}

void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, int priority)
{
+ spin_lock(&mem->reclaim_param_lock);
if (priority < mem->prev_priority)
mem->prev_priority = priority;
+ spin_unlock(&mem->reclaim_param_lock);
}

void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority)
{
+ spin_lock(&mem->reclaim_param_lock);
mem->prev_priority = priority;
+ spin_unlock(&mem->reclaim_param_lock);
}

int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
@@ -2070,6 +2085,7 @@ mem_cgroup_create(struct cgroup_subsys *
}
mem_cgroup_set_inactive_ratio(mem);
mem->last_scanned_child = NULL;
+ spin_lock_init(&mem->reclaim_param_lock);

return &mem->css;
free_out:

2008-12-03 05:10:43

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 18/21] memcg-swappiness.patch

Currently, /proc/sys/vm/swappiness can change swappiness ratio for global reclaim.
However, memcg reclaim doesn't have tuning parameter for itself.

In general, the optimal swappiness depend on workload.
(e.g. hpc workload need to low swappiness than the others.)

Then, per cgroup swappiness improve administrator tunability.

Changelog:
- modified for stacking file.
- return -EINVAL rather than -EBUSY.
- fixed hierarchy handling.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>

Documentation/controllers/memory.txt | 9 ++++
include/linux/swap.h | 3 -
mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++----
mm/vmscan.c | 7 +--
4 files changed, 86 insertions(+), 11 deletions(-)

Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -164,6 +164,9 @@ struct mem_cgroup {
int obsolete;
atomic_t refcnt;

+ unsigned int swappiness;
+
+
unsigned int inactive_ratio;

/*
@@ -630,6 +633,22 @@ static bool mem_cgroup_check_under_limit
return false;
}

+static unsigned int get_swappiness(struct mem_cgroup *memcg)
+{
+ struct cgroup *cgrp = memcg->css.cgroup;
+ unsigned int swappiness;
+
+ /* root ? */
+ if (cgrp->parent == NULL)
+ return vm_swappiness;
+
+ spin_lock(&memcg->reclaim_param_lock);
+ swappiness = memcg->swappiness;
+ spin_unlock(&memcg->reclaim_param_lock);
+
+ return swappiness;
+}
+
/*
* Dance down the hierarchy if needed to reclaim memory. We remember the
* last child we reclaimed from, so that we don't end up penalizing
@@ -650,7 +669,8 @@ static int mem_cgroup_hierarchical_recla
* but there might be left over accounting, even after children
* have left.
*/
- ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap);
+ ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
+ get_swappiness(root_mem));
if (mem_cgroup_check_under_limit(root_mem))
return 0;
if (!root_mem->use_hierarchy)
@@ -666,7 +686,8 @@ static int mem_cgroup_hierarchical_recla
cgroup_unlock();
continue;
}
- ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap);
+ ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
+ get_swappiness(next_mem));
if (mem_cgroup_check_under_limit(root_mem))
return 0;
cgroup_lock();
@@ -1394,7 +1415,8 @@ int mem_cgroup_shrink_usage(struct mm_st
rcu_read_unlock();

do {
- progress = try_to_free_mem_cgroup_pages(mem, gfp_mask, true);
+ progress = try_to_free_mem_cgroup_pages(mem, gfp_mask, true,
+ get_swappiness(mem));
progress += mem_cgroup_check_under_limit(mem);
} while (!progress && --retry);

@@ -1462,7 +1484,9 @@ static int mem_cgroup_resize_limit(struc
break;

progress = try_to_free_mem_cgroup_pages(memcg,
- GFP_KERNEL, false);
+ GFP_KERNEL,
+ false,
+ get_swappiness(memcg));
if (!progress) retry_count--;
}

@@ -1506,7 +1530,8 @@ int mem_cgroup_resize_memsw_limit(struct
break;

oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
- try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL, true);
+ try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL, true,
+ get_swappiness(memcg));
curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
if (curusage >= oldusage)
retry_count--;
@@ -1637,8 +1662,8 @@ try_to_free:
ret = -EINTR;
goto out;
}
- progress = try_to_free_mem_cgroup_pages(mem,
- GFP_KERNEL, false);
+ progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
+ false, get_swappiness(mem));
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
@@ -1858,6 +1883,37 @@ static int mem_control_stat_show(struct
return 0;
}

+static u64 mem_cgroup_swappiness_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+ return get_swappiness(memcg);
+}
+
+static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
+ u64 val)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ struct mem_cgroup *parent;
+ if (val > 100)
+ return -EINVAL;
+
+ if (cgrp->parent == NULL)
+ return -EINVAL;
+
+ parent = mem_cgroup_from_cont(cgrp->parent);
+ /* If under hierarchy, only empty-root can set this value */
+ if ((parent->use_hierarchy) ||
+ (memcg->use_hierarchy && !list_empty(&cgrp->children)))
+ return -EINVAL;
+
+ spin_lock(&memcg->reclaim_param_lock);
+ memcg->swappiness = val;
+ spin_unlock(&memcg->reclaim_param_lock);
+
+ return 0;
+}
+

static struct cftype mem_cgroup_files[] = {
{
@@ -1896,6 +1952,11 @@ static struct cftype mem_cgroup_files[]
.write_u64 = mem_cgroup_hierarchy_write,
.read_u64 = mem_cgroup_hierarchy_read,
},
+ {
+ .name = "swappiness",
+ .read_u64 = mem_cgroup_swappiness_read,
+ .write_u64 = mem_cgroup_swappiness_write,
+ },
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -2087,6 +2148,9 @@ mem_cgroup_create(struct cgroup_subsys *
mem->last_scanned_child = NULL;
spin_lock_init(&mem->reclaim_param_lock);

+ if (parent)
+ mem->swappiness = get_swappiness(parent);
+
return &mem->css;
free_out:
for_each_node_state(node, N_POSSIBLE)
Index: mmotm-2.6.28-Dec02/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/vmscan.c
+++ mmotm-2.6.28-Dec02/mm/vmscan.c
@@ -1759,14 +1759,15 @@ unsigned long try_to_free_pages(struct z
#ifdef CONFIG_CGROUP_MEM_RES_CTLR

unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
- gfp_t gfp_mask,
- bool noswap)
+ gfp_t gfp_mask,
+ bool noswap,
+ unsigned int swappiness)
{
struct scan_control sc = {
.may_writepage = !laptop_mode,
.may_swap = 1,
.swap_cluster_max = SWAP_CLUSTER_MAX,
- .swappiness = vm_swappiness,
+ .swappiness = swappiness,
.order = 0,
.mem_cgroup = mem_cont,
.isolate_pages = mem_cgroup_isolate_pages,
Index: mmotm-2.6.28-Dec02/include/linux/swap.h
===================================================================
--- mmotm-2.6.28-Dec02.orig/include/linux/swap.h
+++ mmotm-2.6.28-Dec02/include/linux/swap.h
@@ -214,7 +214,8 @@ static inline void lru_cache_add_active_
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap);
+ gfp_t gfp_mask, bool noswap,
+ unsigned int swappiness);
extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
Index: mmotm-2.6.28-Dec02/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-Dec02.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-Dec02/Documentation/controllers/memory.txt
@@ -314,6 +314,15 @@ will be charged as a new owner of it.
showing for better debug please see the code for meanings.


+5.3 swappiness
+ Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
+
+ Following cgroup's swapiness can't be changed.
+ - root cgroup (uses /proc/sys/vm/swappiness).
+ - a cgroup which uses hierarchy and it has child cgroup.
+ - a cgroup which uses hierarchy and not the root of hierarchy.
+
+
6. Hierarchy support

The memory controller supports a deep hierarchy and hierarchical accounting.

2008-12-03 05:12:25

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [Experimental][PATCH 19/21] memcg-fix-pre-destroy.patch

still under development.
==
Now, final check of refcnt is done after pre_destroy(), so rmdir() can fail
after pre_destroy().
memcg set mem->obsolete to be 1 at pre_destroy and this is buggy..

Several ways to fix this can be considered. This is an idea.

Fortunately, the user of css_get()/css_put() is only memcg, now.
I'd like to reuse it.
This patch changes this css->refcnt usage and action as following
- css->refcnt is initialized to 1.

- after pre_destroy, before destroy(), try to drop css->refcnt to 0.

- css_tryget() is added. This only success when css->refcnt > 0.

- css_is_removed() is added. This checks css->refcnt == 0 and means
this cgroup is under destroy() or not.

- css_put() is changed not to call notify_on_release().
From documentation, notify_on_release() is called when there is no
tasks/children in cgroup. On implementation, notify_on_release is
not called if css->refcnt > 0.
This is problematic. memcg has css->refcnt by each page even when
there are no tasks. release handler will be never called.
But, now, rmdir()/pre_destroy() of memcg works well and checking
checking css->ref is not (and shouldn't be) necessary for notifying.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>


include/linux/cgroup.h | 21 +++++++++++++++++--
kernel/cgroup.c | 53 +++++++++++++++++++++++++++++++++++--------------
mm/memcontrol.c | 40 +++++++++++++++++++++++++-----------
3 files changed, 85 insertions(+), 29 deletions(-)

Index: mmotm-2.6.28-Dec02/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Dec02.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Dec02/include/linux/cgroup.h
@@ -54,7 +54,9 @@ struct cgroup_subsys_state {

/* State maintained by the cgroup system to allow
* subsystems to be "busy". Should be accessed via css_get()
- * and css_put() */
+ * and css_put(). If this value is 0, css is now under removal and
+ * destroy() will be called soon. (and there is no roll-back.)
+ */

atomic_t refcnt;

@@ -86,7 +88,22 @@ extern void __css_put(struct cgroup_subs
static inline void css_put(struct cgroup_subsys_state *css)
{
if (!test_bit(CSS_ROOT, &css->flags))
- __css_put(css);
+ atomic_dec(&css->refcnt);
+}
+
+/* returns not-zero if success */
+static inline int css_tryget(struct cgroup_subsys_state *css)
+{
+ if (!test_bit(CSS_ROOT, &css->flags))
+ return atomic_inc_not_zero(&css->refcnt);
+ return 1;
+}
+
+static inline bool css_under_removal(struct cgroup_subsys_state *css)
+{
+ if (test_bit(CSS_ROOT, &css->flags))
+ return false;
+ return atomic_read(&css->refcnt) == 0;
}

/* bits in struct cgroup flags field */
Index: mmotm-2.6.28-Dec02/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Dec02/kernel/cgroup.c
@@ -589,6 +589,32 @@ static void cgroup_call_pre_destroy(stru
return;
}

+/*
+ * Try to set all subsys's refcnt to be 0.
+ * css->refcnt==0 means this subsys will be destroy()'d.
+ */
+static bool cgroup_set_subsys_removed(struct cgroup *cgrp)
+{
+ struct cgroup_subsys *ss;
+ struct cgroup_subsys_state *css, *tmp;
+
+ for_each_subsys(cgrp->root, ss) {
+ css = cgrp->subsys[ss->subsys_id];
+ if (!atomic_dec_and_test(&css->refcnt))
+ goto rollback;
+ }
+ return true;
+rollback:
+ for_each_subsys(cgrp->root, ss) {
+ tmp = cgrp->subsys[ss->subsys_id];
+ atomic_inc(&tmp->refcnt);
+ if (tmp == css)
+ break;
+ }
+ return false;
+}
+
+
static void cgroup_diput(struct dentry *dentry, struct inode *inode)
{
/* is dentry a directory ? if so, kfree() associated cgroup */
@@ -2310,7 +2336,7 @@ static void init_cgroup_css(struct cgrou
struct cgroup *cgrp)
{
css->cgroup = cgrp;
- atomic_set(&css->refcnt, 0);
+ atomic_set(&css->refcnt, 1);
css->flags = 0;
if (cgrp == dummytop)
set_bit(CSS_ROOT, &css->flags);
@@ -2438,7 +2464,7 @@ static int cgroup_has_css_refs(struct cg
* matter, since it can only happen if the cgroup
* has been deleted and hence no longer needs the
* release agent to be called anyway. */
- if (css && atomic_read(&css->refcnt))
+ if (css && (atomic_read(&css->refcnt) > 1))
return 1;
}
return 0;
@@ -2465,7 +2491,8 @@ static int cgroup_rmdir(struct inode *un

/*
* Call pre_destroy handlers of subsys. Notify subsystems
- * that rmdir() request comes.
+ * that rmdir() request comes. pre_destroy() is expected to drop all
+ * extra refcnt to css. (css->refcnt == 1)
*/
cgroup_call_pre_destroy(cgrp);

@@ -2479,8 +2506,15 @@ static int cgroup_rmdir(struct inode *un
return -EBUSY;
}

+ /* last check ! */
+ if (!cgroup_set_subsys_removed(cgrp)) {
+ mutex_unlock(&cgroup_mutex);
+ return -EBUSY;
+ }
+
spin_lock(&release_list_lock);
set_bit(CGRP_REMOVED, &cgrp->flags);
+
if (!list_empty(&cgrp->release_list))
list_del(&cgrp->release_list);
spin_unlock(&release_list_lock);
@@ -3003,7 +3037,7 @@ static void check_for_release(struct cgr
/* All of these checks rely on RCU to keep the cgroup
* structure alive */
if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count)
- && list_empty(&cgrp->children) && !cgroup_has_css_refs(cgrp)) {
+ && list_empty(&cgrp->children)) {
/* Control Group is currently removeable. If it's not
* already queued for a userspace notification, queue
* it now */
@@ -3020,17 +3054,6 @@ static void check_for_release(struct cgr
}
}

-void __css_put(struct cgroup_subsys_state *css)
-{
- struct cgroup *cgrp = css->cgroup;
- rcu_read_lock();
- if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) {
- set_bit(CGRP_RELEASABLE, &cgrp->flags);
- check_for_release(cgrp);
- }
- rcu_read_unlock();
-}
-
/*
* Notify userspace when a cgroup is released, by running the
* configured release agent with the name of the cgroup (path
Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -161,7 +161,6 @@ struct mem_cgroup {
*/
bool use_hierarchy;
unsigned long last_oom_jiffies;
- int obsolete;
atomic_t refcnt;

unsigned int swappiness;
@@ -590,8 +589,14 @@ mem_cgroup_get_first_node(struct mem_cgr
{
struct cgroup *cgroup;
struct mem_cgroup *ret;
- bool obsolete = (root_mem->last_scanned_child &&
- root_mem->last_scanned_child->obsolete);
+ struct mem_cgroup *last_scan = root_mem->last_scanned_child;
+ bool obsolete = false;
+
+ if (last_scan) {
+ if (css_under_removal(&last_scan->css))
+ obsolete = true;
+ } else
+ obsolete = true;

/*
* Scan all children under the mem_cgroup mem
@@ -679,7 +684,7 @@ static int mem_cgroup_hierarchical_recla
next_mem = mem_cgroup_get_first_node(root_mem);

while (next_mem != root_mem) {
- if (next_mem->obsolete) {
+ if (css_under_removal(&next_mem->css)) {
mem_cgroup_put(next_mem);
cgroup_lock();
next_mem = mem_cgroup_get_first_node(root_mem);
@@ -1063,6 +1068,7 @@ int mem_cgroup_try_charge_swapin(struct
{
struct mem_cgroup *mem;
swp_entry_t ent;
+ int ret;

if (mem_cgroup_disabled())
return 0;
@@ -1081,10 +1087,18 @@ int mem_cgroup_try_charge_swapin(struct
ent.val = page_private(page);

mem = lookup_swap_cgroup(ent);
- if (!mem || mem->obsolete)
+ /*
+ * Because we can't assume "mem" is alive now, use tryget() and
+ * drop extra count later
+ */
+ if (!mem || !css_tryget(&mem->css))
goto charge_cur_mm;
*ptr = mem;
- return __mem_cgroup_try_charge(NULL, mask, ptr, true);
+ ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
+ /* drop extra count */
+ css_put(&mem->css);
+
+ return ret;
charge_cur_mm:
if (unlikely(!mm))
mm = &init_mm;
@@ -1115,14 +1129,16 @@ int mem_cgroup_cache_charge_swapin(struc
ent.val = page_private(page);
if (do_swap_account) {
mem = lookup_swap_cgroup(ent);
- if (mem && mem->obsolete)
+ if (mem && !css_tryget(&mem->css))
mem = NULL;
if (mem)
mm = NULL;
}
ret = mem_cgroup_charge_common(page, mm, mask,
MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
-
+ /* drop extra ref */
+ if (mem)
+ css_put(&mem->css);
if (!ret && do_swap_account) {
/* avoid double counting */
mem = swap_cgroup_record(ent, NULL);
@@ -2065,8 +2081,8 @@ static struct mem_cgroup *mem_cgroup_all
* the number of reference from swap_cgroup and free mem_cgroup when
* it goes down to 0.
*
- * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
- * entry which points to this memcg will be ignore at swapin.
+ * When mem_cgroup is destroyed, css_under_removal() is true and entry which
+ * points to this memcg will be ignore at swapin.
*
* Removal of cgroup itself succeeds regardless of refs from swap.
*/
@@ -2096,7 +2112,7 @@ static void mem_cgroup_get(struct mem_cg
static void mem_cgroup_put(struct mem_cgroup *mem)
{
if (atomic_dec_and_test(&mem->refcnt)) {
- if (!mem->obsolete)
+ if (!css_under_removal(&mem->css))
return;
mem_cgroup_free(mem);
}
@@ -2163,7 +2179,7 @@ static void mem_cgroup_pre_destroy(struc
struct cgroup *cont)
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
- mem->obsolete = 1;
+ /* dentry's mutex makes this safe. */
mem_cgroup_force_empty(mem, false);
}

2008-12-03 05:13:53

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [Experimental][PATCH 20/21] cgroup_id.patch

still under development.
==
patch for Cgroup ID and hierarchy code.

This patch tries to assign a ID to each cgroup. Attach unique ID to each
cgroup and provides following functions.

- cgroup_lookup(id)
returns struct cgroup of id.
- cgroup_get_next(id, rootid, depth, foundid)
returns the next cgroup under "root" by scanning bitmap (not by tree-walk)
- cgroup_id_put/getref()
used when subsystem want to prevent reuse of ID.

There is several reasons to develop this.

- While trying to implement hierarchy in memory cgroup, we have to
implement "walk under hierarchy" code.
Now it's consists of cgroup_lock and tree up-down code. Because
Because memory cgroup have to do hierarchy walk in other places,
intelligent processing, we'll reuse the "walk" code.
But taking "cgroup_lock" in walking tree can cause deadlocks.
Easier way is helpful.

- SwapCgroup uses array of "pointer" to record the owner of swaps.
By ID, we can reduce this to "short" or "int". This means ID is
useful for reducing space consumption by pointer if the access cost
is not problem.
(I hear bio-cgroup will use the same kind of...)

Example) OOM-Killer under hierarchy.
do {
rcu_read_lock();
next = cgroup_get_next(id, root, nextid);
/* check sanity of next here */
css_tryget();
rcu_read_unlock();
if (!next)
break;
cgroup_scan_tasks(select_bad_process?);
/* record score here...*/
} while (1);


Characteristics:
- Each cgroup get new ID when created.
- cgroup ID contains "ID" and "Depth in tree" and hierarchy code.
- hierarchy code is array of IDs of ancestors.
- ID 0 is UNUSED ID.

Consideration:
- I'd like to use "short" to cgroup_id for saving space...
- MAX_DEPTH is small ? (making this depend on boot option is easy.)
TODO:
- Documentation.

Changelog (v1) -> (v2):
- Design change: show only ID(integer) to outside of cgroup.c
- moved cgroup ID definition from include/ to kernel/cgroup.c
- struct cgroup_id is freed by RCU.
- changed interface from pointer to "int"
- kill_sb() is handled.
- ID 0 as unused ID.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

include/linux/cgroup.h | 28 ++++-
include/linux/idr.h | 1
kernel/cgroup.c | 272 ++++++++++++++++++++++++++++++++++++++++++++++++-
lib/idr.c | 46 ++++++++
4 files changed, 342 insertions(+), 5 deletions(-)

Index: mmotm-2.6.28-Dec02/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Dec02.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Dec02/include/linux/cgroup.h
@@ -22,6 +22,7 @@ struct cgroupfs_root;
struct cgroup_subsys;
struct inode;
struct cgroup;
+struct cgroup_id;

extern int cgroup_init_early(void);
extern int cgroup_init(void);
@@ -63,6 +64,12 @@ struct cgroup_subsys_state {
unsigned long flags;
};

+/*
+ * Cgroup ID for *internal* identification and lookup. For user-land,"path"
+ * of cgroup works well.
+ */
+#define MAX_CGROUP_DEPTH (10)
+
/* bits in struct cgroup_subsys_state flags field */
enum {
CSS_ROOT, /* This CSS is the root of the subsystem */
@@ -162,6 +169,9 @@ struct cgroup {
int pids_use_count;
/* Length of the current tasks_pids array */
int pids_length;
+
+ /* Cgroup ID */
+ struct cgroup_id *id;
};

/* A css_set is a structure holding pointers to a set of
@@ -346,7 +356,6 @@ struct cgroup_subsys {
struct cgroup *cgrp);
void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
-
int subsys_id;
int active;
int disabled;
@@ -410,6 +419,23 @@ void cgroup_iter_end(struct cgroup *cgrp
int cgroup_scan_tasks(struct cgroup_scanner *scan);
int cgroup_attach_task(struct cgroup *, struct task_struct *);

+/*
+ * For supporting cgroup lookup and hierarchy management.
+ * Giving Flat view of cgroup hierarchy rather than tree.
+ */
+/* An interface for usual lookup */
+struct cgroup *cgroup_lookup(int id);
+/* get next cgroup under tree (for scan) */
+struct cgroup *
+cgroup_get_next(int id, int rootid, int depth, int *foundid);
+/* get id and depth of cgroup */
+int cgroup_id(struct cgroup *cgroup);
+int cgroup_depth(struct cgroup *cgroup);
+/* For delayed freeing of IDs */
+void cgroup_id_getref(int id);
+void cgroup_id_putref(int id);
+bool cgroup_id_is_obsolete(int id);
+
#else /* !CONFIG_CGROUPS */

static inline int cgroup_init_early(void) { return 0; }
Index: mmotm-2.6.28-Dec02/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Dec02/kernel/cgroup.c
@@ -46,7 +46,7 @@
#include <linux/cgroupstats.h>
#include <linux/hash.h>
#include <linux/namei.h>
-
+#include <linux/idr.h>
#include <asm/atomic.h>

static DEFINE_MUTEX(cgroup_mutex);
@@ -545,6 +545,253 @@ void cgroup_unlock(void)
}

/*
+ * CGROUP ID
+ */
+struct cgroup_id {
+ struct cgroup *myself;
+ unsigned int id;
+ unsigned int depth;
+ atomic_t refcnt;
+ struct rcu_head rcu_head;
+ unsigned int hierarchy_code[MAX_CGROUP_DEPTH];
+};
+
+void free_cgroupid_cb(struct rcu_head *head)
+{
+ struct cgroup_id *id;
+
+ id = container_of(head, struct cgroup_id, rcu_head);
+ kfree(id);
+}
+
+void free_cgroupid(struct cgroup_id *id)
+{
+ call_rcu(&id->rcu_head, free_cgroupid_cb);
+}
+
+/*
+ * Cgroup ID and lookup functions.
+ * cgid->myself pointer is safe under rcu_read_lock() because d_put() of
+ * cgroup, which finally frees cgroup pointer, uses rcu_synchronize().
+ */
+static DEFINE_IDR(cgroup_idr);
+DEFINE_SPINLOCK(cgroup_idr_lock);
+
+static int cgrouproot_setup_idr(struct cgroupfs_root *root)
+{
+ struct cgroup_id *newid;
+ int err = -ENOMEM;
+ int myid;
+
+ newid = kzalloc(sizeof(*newid), GFP_KERNEL);
+ if (!newid)
+ goto out;
+ if (!idr_pre_get(&cgroup_idr, GFP_KERNEL))
+ goto free_out;
+
+ spin_lock_irq(&cgroup_idr_lock);
+ err = idr_get_new_above(&cgroup_idr, newid, 1, &myid);
+ spin_unlock_irq(&cgroup_idr_lock);
+
+ /* This one is new idr....*/
+ BUG_ON(err);
+ newid->id = myid;
+ newid->depth = 0;
+ newid->hierarchy_code[0] = myid;
+ atomic_set(&newid->refcnt, 1);
+ rcu_assign_pointer(newid->myself, &root->top_cgroup);
+ root->top_cgroup.id = newid;
+ return 0;
+
+free_out:
+ kfree(newid);
+out:
+ return err;
+}
+
+/*
+ * should be called while "cgrp" is valid.
+ */
+int cgroup_id(struct cgroup *cgrp)
+{
+ if (cgrp->id)
+ return cgrp->id->id;
+ return 0;
+}
+
+int cgroup_depth(struct cgroup *cgrp)
+{
+ if (cgrp->id)
+ return cgrp->id->depth;
+ return 0;
+}
+
+static int cgroup_prepare_id(struct cgroup *parent, struct cgroup_id **id)
+{
+ struct cgroup_id *newid;
+ int myid, error;
+
+ /* check depth */
+ if (parent->id->depth + 1 >= MAX_CGROUP_DEPTH)
+ return -ENOSPC;
+ newid = kzalloc(sizeof(*newid), GFP_KERNEL);
+ if (!newid)
+ return -ENOMEM;
+ /* get id */
+ if (unlikely(!idr_pre_get(&cgroup_idr, GFP_KERNEL))) {
+ error = -ENOMEM;
+ goto err_out;
+ }
+ spin_lock_irq(&cgroup_idr_lock);
+ /* Don't use 0 */
+ error = idr_get_new_above(&cgroup_idr, newid, 1, &myid);
+ spin_unlock_irq(&cgroup_idr_lock);
+ if (error)
+ goto err_out;
+
+ newid->id = myid;
+ atomic_set(&newid->refcnt, 1);
+ *id = newid;
+ return 0;
+err_out:
+ kfree(newid);
+ return error;
+}
+
+
+static void cgroup_id_attach(struct cgroup_id *cgid,
+ struct cgroup *cg, struct cgroup *parent)
+{
+ struct cgroup_id *parent_id = parent->id;
+ int i;
+
+ cgid->depth = parent_id->depth + 1;
+ /* Inherit hierarchy code from parent */
+ for (i = 0; i < cgid->depth; i++) {
+ cgid->hierarchy_code[i] =
+ parent_id->hierarchy_code[i];
+ cgid->hierarchy_code[cgid->depth] = cgid->id;
+ }
+ rcu_assign_pointer(cgid->myself, cg);
+ cg->id = cgid;
+
+ return;
+}
+static void cgroup_id_put(int id)
+{
+ struct cgroup_id *cgid;
+ unsigned long flags;
+
+ rcu_read_lock();
+ cgid = idr_find(&cgroup_idr, id);
+ BUG_ON(!cgid);
+ if (atomic_dec_and_test(&cgid->refcnt)) {
+ spin_lock_irqsave(&cgroup_idr_lock, flags);
+ idr_remove(&cgroup_idr, cgid->id);
+ spin_unlock_irq(&cgroup_idr_lock);
+ free_cgroupid(cgid);
+ }
+ rcu_read_unlock();
+}
+
+static void cgroup_id_detach(struct cgroup *cg)
+{
+ rcu_assign_pointer(cg->id->myself, NULL);
+ cgroup_id_put(cg->id->id);
+}
+
+void cgroup_id_getref(int id)
+{
+ struct cgroup_id *cgid;
+
+ rcu_read_lock();
+ cgid = idr_find(&cgroup_idr, id);
+ if (cgid)
+ atomic_inc(&cgid->refcnt);
+ rcu_read_unlock();
+}
+
+void cgroup_id_putref(int id)
+{
+ cgroup_id_put(id);
+}
+/**
+ * cgroup_lookup - lookup cgroup by id
+ * @id: the id of cgroup to be looked up
+ *
+ * Returns pointer to cgroup if there is valid cgroup with id, NULL if not.
+ * Should be called under rcu_read_lock() or cgroup_lock.
+ * If subsys is not used, returns NULL.
+ */
+
+struct cgroup *cgroup_lookup(int id)
+{
+ struct cgroup *cgrp = NULL;
+ struct cgroup_id *cgid = NULL;
+
+ rcu_read_lock();
+ cgid = idr_find(&cgroup_idr, id);
+
+ if (unlikely(!cgid))
+ goto out;
+
+ cgrp = rcu_dereference(cgid->myself);
+ if (unlikely(!cgrp || cgroup_is_removed(cgrp)))
+ cgrp = NULL;
+out:
+ rcu_read_unlock();
+ return cgrp;
+}
+
+/**
+ * cgroup_get_next - lookup next cgroup under specified hierarchy.
+ * @id: current position of iteration.
+ * @rootid: search tree under this.
+ * @depth: depth of root id.
+ * @foundid: position of found object.
+ *
+ * Search next cgroup under the specified hierarchy. If "cur" is NULL,
+ * start from root cgroup. Called under rcu_read_lock() or cgroup_lock()
+ * is necessary (to access a found cgroup.).
+ * If subsys is not used, returns NULL. If used, it's guaranteed that there is
+ * a used cgroup ID (root).
+ */
+struct cgroup *
+cgroup_get_next(int id, int rootid, int depth, int *foundid)
+{
+ struct cgroup *ret = NULL;
+ struct cgroup_id *tmp;
+ int tmpid;
+ unsigned long flags;
+
+ rcu_read_lock();
+ tmpid = id;
+ while (1) {
+ /* scan next entry from bitmap(tree) */
+ spin_lock_irqsave(&cgroup_idr_lock, flags);
+ tmp = idr_get_next(&cgroup_idr, &tmpid);
+ spin_unlock_irqrestore(&cgroup_idr_lock, flags);
+
+ if (!tmp) {
+ ret = NULL;
+ break;
+ }
+
+ if (tmp->hierarchy_code[depth] == rootid) {
+ ret = rcu_dereference(tmp->myself);
+ /* Sanity check and check hierarchy */
+ if (ret && !cgroup_is_removed(ret))
+ break;
+ }
+ tmpid = tmpid + 1;
+ }
+
+ rcu_read_unlock();
+ *foundid = tmpid;
+ return ret;
+}
+
+/*
* A couple of forward declarations required, due to cyclic reference loop:
* cgroup_mkdir -> cgroup_create -> cgroup_populate_dir ->
* cgroup_add_file -> cgroup_create_file -> cgroup_dir_inode_operations
@@ -1039,6 +1286,13 @@ static int cgroup_get_sb(struct file_sys
mutex_unlock(&inode->i_mutex);
goto drop_new_super;
}
+ /* Setup Cgroup ID for this fs */
+ ret = cgrouproot_setup_idr(root);
+ if (ret) {
+ mutex_unlock(&cgroup_mutex);
+ mutex_unlock(&inode->i_mutex);
+ goto drop_new_super;
+ }

ret = rebind_subsystems(root, root->subsys_bits);
if (ret == -EBUSY) {
@@ -1125,9 +1379,10 @@ static void cgroup_kill_sb(struct super_

list_del(&root->root_list);
root_count--;
-
+ if (root->top_cgroup.id)
+ cgroup_id_detach(&root->top_cgroup);
mutex_unlock(&cgroup_mutex);
-
+ synchronize_rcu();
kfree(root);
kill_litter_super(sb);
}
@@ -2360,11 +2615,18 @@ static long cgroup_create(struct cgroup
int err = 0;
struct cgroup_subsys *ss;
struct super_block *sb = root->sb;
+ struct cgroup_id *cgid = NULL;

cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
if (!cgrp)
return -ENOMEM;

+ err = cgroup_prepare_id(parent, &cgid);
+ if (err) {
+ kfree(cgrp);
+ return err;
+ }
+
/* Grab a reference on the superblock so the hierarchy doesn't
* get deleted on unmount if there are child cgroups. This
* can be done outside cgroup_mutex, since the sb can't
@@ -2404,7 +2666,7 @@ static long cgroup_create(struct cgroup

err = cgroup_populate_dir(cgrp);
/* If err < 0, we have a half-filled directory - oh well ;) */
-
+ cgroup_id_attach(cgid, cgrp, parent);
mutex_unlock(&cgroup_mutex);
mutex_unlock(&cgrp->dentry->d_inode->i_mutex);

@@ -2512,6 +2774,8 @@ static int cgroup_rmdir(struct inode *un
return -EBUSY;
}

+ cgroup_id_detach(cgrp);
+
spin_lock(&release_list_lock);
set_bit(CGRP_REMOVED, &cgrp->flags);

Index: mmotm-2.6.28-Dec02/include/linux/idr.h
===================================================================
--- mmotm-2.6.28-Dec02.orig/include/linux/idr.h
+++ mmotm-2.6.28-Dec02/include/linux/idr.h
@@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
int idr_for_each(struct idr *idp,
int (*fn)(int id, void *p, void *data), void *data);
+void *idr_get_next(struct idr *idp, int *nextid);
void *idr_replace(struct idr *idp, void *ptr, int id);
void idr_remove(struct idr *idp, int id);
void idr_remove_all(struct idr *idp);
Index: mmotm-2.6.28-Dec02/lib/idr.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/lib/idr.c
+++ mmotm-2.6.28-Dec02/lib/idr.c
@@ -573,6 +573,52 @@ int idr_for_each(struct idr *idp,
EXPORT_SYMBOL(idr_for_each);

/**
+ * idr_get_next - lookup next object of id to given id.
+ * @idp: idr handle
+ * @id: pointer to lookup key
+ *
+ * Returns pointer to registered object with id, which is next number to
+ * given id.
+ */
+
+void *idr_get_next(struct idr *idp, int *nextidp)
+{
+ struct idr_layer *p, *pa[MAX_LEVEL];
+ struct idr_layer **paa = &pa[0];
+ int id = *nextidp;
+ int n, max;
+
+ /* find first ent */
+ n = idp->layers * IDR_BITS;
+ max = 1 << n;
+ p = rcu_dereference(idp->top);
+ if (!p)
+ return NULL;
+
+ while (id < max) {
+ while (n > 0 && p) {
+ n -= IDR_BITS;
+ *paa++ = p;
+ p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
+ }
+
+ if (p) {
+ *nextidp = id;
+ return p;
+ }
+
+ id += 1 << n;
+ while (n < fls(id)) {
+ n += IDR_BITS;
+ p = *--paa;
+ }
+ }
+ return NULL;
+}
+
+
+
+/**
* idr_replace - replace pointer for given id
* @idp: idr handle
* @ptr: pointer you want associated with the id

2008-12-03 05:15:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [Experimental][PATCH 21/21] memcg-new-hierarchical-reclaim.patch

Implement hierarchy reclaim by cgroup_id.

What changes:
- reclaim is not done by tree-walk algorithm
- mem_cgroup->last_schan_child is ID, not pointer.
- no cgroup_lock.
- scanning order is just defined by ID's order.
(Scan by round-robin logic.)

Changelog: v1 -> v2
- make use of css_tryget();
- count # of loops rather than remembering position.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>


mm/memcontrol.c | 214 +++++++++++++++++++-------------------------------------
1 file changed, 75 insertions(+), 139 deletions(-)

Index: mmotm-2.6.28-Dec02/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec02.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec02/mm/memcontrol.c
@@ -153,9 +153,10 @@ struct mem_cgroup {

/*
* While reclaiming in a hiearchy, we cache the last child we
- * reclaimed from. Protected by cgroup_lock()
+ * reclaimed from.
*/
- struct mem_cgroup *last_scanned_child;
+ int last_scanned_child;
+ unsigned long scan_age;
/*
* Should the accounting and control be hierarchical, per subtree?
*/
@@ -521,108 +522,72 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
}

-#define mem_cgroup_from_res_counter(counter, member) \
- container_of(counter, struct mem_cgroup, member)
-
-/*
- * This routine finds the DFS walk successor. This routine should be
- * called with cgroup_mutex held
- */
-static struct mem_cgroup *
-mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
+static unsigned int get_swappiness(struct mem_cgroup *memcg)
{
- struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
-
- curr_cgroup = curr->css.cgroup;
- root_cgroup = root_mem->css.cgroup;
-
- if (!list_empty(&curr_cgroup->children)) {
- /*
- * Walk down to children
- */
- mem_cgroup_put(curr);
- cgroup = list_entry(curr_cgroup->children.next,
- struct cgroup, sibling);
- curr = mem_cgroup_from_cont(cgroup);
- mem_cgroup_get(curr);
- goto done;
- }
-
-visit_parent:
- if (curr_cgroup == root_cgroup) {
- mem_cgroup_put(curr);
- curr = root_mem;
- mem_cgroup_get(curr);
- goto done;
- }
+ struct cgroup *cgrp = memcg->css.cgroup;
+ unsigned int swappiness;

- /*
- * Goto next sibling
- */
- if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
- mem_cgroup_put(curr);
- cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
- sibling);
- curr = mem_cgroup_from_cont(cgroup);
- mem_cgroup_get(curr);
- goto done;
- }
+ /* root ? */
+ if (cgrp->parent == NULL)
+ return vm_swappiness;

- /*
- * Go up to next parent and next parent's sibling if need be
- */
- curr_cgroup = curr_cgroup->parent;
- goto visit_parent;
+ spin_lock(&memcg->reclaim_param_lock);
+ swappiness = memcg->swappiness;
+ spin_unlock(&memcg->reclaim_param_lock);

-done:
- root_mem->last_scanned_child = curr;
- return curr;
+ return swappiness;
}

+#define mem_cgroup_from_res_counter(counter, member) \
+ container_of(counter, struct mem_cgroup, member)
+
/*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
+ * This routine select next memcg by ID. Using RCU and tryget().
+ * No cgroup_mutex is required.
*/
static struct mem_cgroup *
-mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
+mem_cgroup_select_victim(struct mem_cgroup *root_mem)
{
- struct cgroup *cgroup;
+ struct cgroup *cgroup, *root_cgroup;
struct mem_cgroup *ret;
- struct mem_cgroup *last_scan = root_mem->last_scanned_child;
- bool obsolete = false;
+ int nextid, rootid, depth, found;

- if (last_scan) {
- if (css_under_removal(&last_scan->css))
- obsolete = true;
- } else
- obsolete = true;
+ root_cgroup = root_mem->css.cgroup;
+ rootid = cgroup_id(root_cgroup);
+ depth = cgroup_depth(root_cgroup);
+ found = 0;
+ ret = NULL;

- /*
- * Scan all children under the mem_cgroup mem
- */
- cgroup_lock();
- if (list_empty(&root_mem->css.cgroup->children)) {
- ret = root_mem;
- goto done;
+ rcu_read_lock();
+ if (!root_mem->use_hierarchy) {
+ spin_lock(&root_mem->reclaim_param_lock);
+ root_mem->scan_age++;
+ spin_unlock(&root_mem->reclaim_param_lock);
+ css_get(&root_mem->css);
+ goto out;
}

- if (!root_mem->last_scanned_child || obsolete) {
-
- if (obsolete)
- mem_cgroup_put(root_mem->last_scanned_child);
-
- cgroup = list_first_entry(&root_mem->css.cgroup->children,
- struct cgroup, sibling);
- ret = mem_cgroup_from_cont(cgroup);
- mem_cgroup_get(ret);
- } else
- ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
- root_mem);
+ while (!ret) {
+ /* ID:0 is not used by cgroup-id */
+ nextid = root_mem->last_scanned_child + 1;
+ cgroup = cgroup_get_next(nextid, rootid, depth, &found);
+ if (cgroup) {
+ spin_lock(&root_mem->reclaim_param_lock);
+ root_mem->last_scanned_child = found;
+ spin_unlock(&root_mem->reclaim_param_lock);
+ ret = mem_cgroup_from_cont(cgroup);
+ if (!css_tryget(&ret->css))
+ ret = NULL;
+ } else {
+ spin_lock(&root_mem->reclaim_param_lock);
+ root_mem->scan_age++;
+ root_mem->last_scanned_child = 0;
+ spin_unlock(&root_mem->reclaim_param_lock);
+ }
+ }
+out:
+ rcu_read_unlock();

-done:
- root_mem->last_scanned_child = ret;
- cgroup_unlock();
return ret;
}

@@ -638,67 +603,34 @@ static bool mem_cgroup_check_under_limit
return false;
}

-static unsigned int get_swappiness(struct mem_cgroup *memcg)
-{
- struct cgroup *cgrp = memcg->css.cgroup;
- unsigned int swappiness;
-
- /* root ? */
- if (cgrp->parent == NULL)
- return vm_swappiness;
-
- spin_lock(&memcg->reclaim_param_lock);
- swappiness = memcg->swappiness;
- spin_unlock(&memcg->reclaim_param_lock);
-
- return swappiness;
-}

/*
- * Dance down the hierarchy if needed to reclaim memory. We remember the
- * last child we reclaimed from, so that we don't end up penalizing
- * one child extensively based on its position in the children list.
- *
* root_mem is the original ancestor that we've been reclaim from.
*/
static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
gfp_t gfp_mask, bool noswap)
{
- struct mem_cgroup *next_mem;
+ struct mem_cgroup *victim;
+ unsigned long start_age;
int ret = 0;
+ int total = 0;

- /*
- * Reclaim unconditionally and don't check for return value.
- * We need to reclaim in the current group and down the tree.
- * One might think about checking for children before reclaiming,
- * but there might be left over accounting, even after children
- * have left.
- */
- ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
- get_swappiness(root_mem));
- if (mem_cgroup_check_under_limit(root_mem))
- return 0;
- if (!root_mem->use_hierarchy)
- return ret;
-
- next_mem = mem_cgroup_get_first_node(root_mem);
-
- while (next_mem != root_mem) {
- if (css_under_removal(&next_mem->css)) {
- mem_cgroup_put(next_mem);
- cgroup_lock();
- next_mem = mem_cgroup_get_first_node(root_mem);
- cgroup_unlock();
- continue;
- }
- ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
- get_swappiness(next_mem));
+ start_age = root_mem->scan_age;
+ /* allows 2 times of loops */
+ while (time_after((start_age + 2UL), root_mem->scan_age)) {
+ victim = mem_cgroup_select_victim(root_mem);
+ ret = try_to_free_mem_cgroup_pages(victim,
+ gfp_mask, noswap, get_swappiness(victim));
+ css_put(&victim->css);
if (mem_cgroup_check_under_limit(root_mem))
- return 0;
- cgroup_lock();
- next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
- cgroup_unlock();
+ return 1;
+ total += ret;
}
+
+ ret = total;
+ if (mem_cgroup_check_under_limit(root_mem))
+ ret = 1;
+
return ret;
}

@@ -787,6 +719,8 @@ static int __mem_cgroup_try_charge(struc

ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
noswap);
+ if (ret)
+ continue;

/*
* try_to_free_mem_cgroup_pages() might not give us a full
@@ -2161,7 +2095,8 @@ mem_cgroup_create(struct cgroup_subsys *
res_counter_init(&mem->memsw, NULL);
}
mem_cgroup_set_inactive_ratio(mem);
- mem->last_scanned_child = NULL;
+ mem->last_scanned_child = 0;
+ mem->scan_age = 0;
spin_lock_init(&mem->reclaim_param_lock);

if (parent)

2008-12-03 05:16:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 22/21] memcg-explain-details-and-test-document.patch

just passed spell check. sorry for 22/21.

==
Documentation for implementation details and how to test.

just an example. feel free to modify, add, remove lines.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Documentation/controllers/memcg_test.txt | 145 +++++++++++++++++++++++++++++++
1 file changed, 145 insertions(+)

Index: mmotm-2.6.28-Dec02/Documentation/controllers/memcg_test.txt
===================================================================
--- /dev/null
+++ mmotm-2.6.28-Dec02/Documentation/controllers/memcg_test.txt
@@ -0,0 +1,145 @@
+Memory Resource Controller(Memcg) Implementation Memo.
+Last Updated: 2009/12/03
+
+Because VM is getting complex (one of reasons is memcg...), memcg's behavior
+is complex. This is a document for memcg's internal behavior and some test
+patterns tend to be racy.
+
+1. charges
+
+ a page/swp_entry may be charged (usage += PAGE_SIZE) at
+
+ mem_cgroup_newpage_newpage()
+ called at new page fault and COW.
+
+ mem_cgroup_try_charge_swapin()
+ called at do_swap_page() and swapoff.
+ followed by charge-commit-cancel protocol.
+ (With swap accounting) at commit, charges recorded in swap is removed.
+
+ mem_cgroup_cache_charge()
+ called at add_to_page_cache()
+
+ mem_cgroup_cache_charge_swapin)()
+ called by shmem's swapin processing.
+
+ mem_cgroup_prepare_migration()
+ called before migration. "extra" charge is done
+ followed by charge-commit-cancel protocol.
+ At commit, charge against oldpage or newpage will be committed.
+
+2. uncharge
+ a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
+
+ mem_cgroup_uncharge_page()
+ called when an anonymous page is unmapped. If the page is SwapCache
+ uncharge is delayed until mem_cgroup_uncharge_swapcache().
+
+ mem_cgroup_uncharge_cache_page()
+ called when a page-cache is deleted from radix-tree. If the page is
+ SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache()
+
+ mem_cgroup_uncharge_swapcache()
+ called when SwapCache is removed from radix-tree. the charge itself
+ is moved to swap_cgroup. (If mem+swap controller is disabled, no
+ charge to swap.)
+
+ mem_cgroup_uncharge_swap()
+ called when swp_entry's refcnt goes down to be 0. charge against swap
+ disappears.
+
+ mem_cgroup_end_migration(old, new)
+ at success of migration -> old is uncharged (if necessary), charge
+ to new is committed. at failure, charge to old is committed.
+
+3. charge-commit-cancel
+ In some case, we can't know this "charge" is valid or not at charge.
+ To handle such case, there are charge-commit-cancel functions.
+ mem_cgroup_try_charge_XXX
+ mem_cgroup_commit_charge_XXX
+ mem_cgroup_cancel_charge_XXX
+ these are used in swap-in and migration.
+
+ At try_charge(), there are no flags to say "this page is charged".
+ at this point, usage += PAGE_SIZE.
+
+ At commit(), the function checks the page should be charged or not
+ and set flags or avoid charging.(usage -= PAGE_SIZE)
+
+ At cancel(), simply usage -= PAGE_SIZE.
+
+4. Typical Tests.
+
+ Tests for racy cases.
+
+ 4.1 small limit to memcg.
+ When you do test to do racy case, it's good test to set memcg's limit
+ to be very small rather than GB. Many races found in the test under
+ xKB or xxMB limits.
+ (Memory behavior under GB and Memory behavior under MB shows very
+ different situation.)
+
+ 4.2 shmem
+ Historically, memcg's shmem handling was poor and we saw some amount
+ of troubles here. This is because shmem is page-cache but can be
+ SwapCache. Test with shmem/tmpfs is always good test.
+
+ 4.3 migration
+ For NUMA, migration is an another special. To do easy test, cpuset
+ is useful. Following is a sample script to do migration.
+
+ mount -t cgroup -o cpuset none /opt/cpuset
+
+ mkdir /opt/cpuset/01
+ echo 1 > /opt/cpuset/01/cpuset.cpus
+ echo 0 > /opt/cpuset/01/cpuset.mems
+ echo 1 > /opt/cpuset/01/cpuset.memory_migrate
+ mkdir /opt/cpuset/02
+ echo 1 > /opt/cpuset/02/cpuset.cpus
+ echo 1 > /opt/cpuset/02/cpuset.mems
+ echo 1 > /opt/cpuset/02/cpuset.memory_migrate
+
+ In above set, when you moves a task from 01 to 02, page migration to
+ node 0 to node 1 will occur. Following is a script to migrate all
+ under cpuset.
+ --
+ move_task()
+ {
+ for pid in $1
+ do
+ /bin/echo $pid >$2/tasks 2>/dev/null
+ echo -n $pid
+ echo -n " "
+ done
+ echo END
+ }
+
+ G1_TASK=`cat ${G1}/tasks`
+ G2_TASK=`cat ${G2}/tasks`
+ move_task "${G1_TASK}" ${G2} &
+ --
+ 4.4 memory hotplug.
+ memory hotplug test is one of good test.
+ to offline memory, do following.
+ # echo offline > /sys/devices/system/memory/memoryXXX/state
+ (XXX is the place of memory)
+ This is an easy way to test page migration, too.
+
+ 4.5 mkdir/rmdir
+ When using hierarchy, mkdir/rmdir test should be done.
+ tests like following.
+
+ #echo 1 >/opt/cgroup/01/memory/use_hierarchy
+ #mkdir /opt/cgroup/01/child_a
+ #mkdir /opt/cgroup/01/child_b
+
+ set limit to 01.
+ add limit to 01/child_b
+ run jobs under child_a and child_b
+
+ create/delete following groups at random while jobs are running.
+ /opt/cgroup/01/child_a/child_aa
+ /opt/cgroup/01/child_b/child_bb
+ /opt/cgroup/01/child_c
+
+ running new jobs in new group is also good.

2008-12-03 05:21:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/21] memcg updates 2008/12/03

On Wed, 3 Dec 2008 13:47:18 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> This is memcg update series onto
> "The mm-of-the-moment snapshot 2008-12-02-17-08"
>
> including following patches. 18-21 are highly experimenal
> (so, drop CC: to Andrew)
>
> Bug fixes.
> 1. memcg-revert-gfp-mask-fix.patch
> 2. memcg-check-group-leader-fix.patch
> 3. memsw_limit_check.patch
> 4. memcg-swapout-refcnt-fix.patch
> 5. avoid-unnecessary-reclaim.patch
>
> Kosaki's LRU works. (thanks!)
> 6. inactive_anon_is_low-move-to-vmscan.patch
> 7. introduce-zone_reclaim-struct.patch
> 8. make-zone-nr_pages-helper-function.patch
> 9. make-get_scan_ratio-to-memcg-safe.patch
> 10. memcg-add-null-check-to-page_cgroup_zoneinfo.patch
> 11. memcg-make-inactive_anon_is_low.patch
> 12. memcg-make-mem_cgroup_zone_nr_pages.patch
> 13. memcg-make-zone_reclaim_stat.patch
> 14. memcg-remove-mem_cgroup_cal_reclaim.patch
> 15. memcg-show-reclaim-stat.patch
> Cleanup
> 16. memcg-rename-scan-glonal-lru.patch
> Bug fix
> 16. memcg_prev_priority_protect.patch
double counts here ..sigh...

If mmotm eats too patches to apply this, I'll post again in Friday.

BTW, Balbir, "21" (really 22/21) meets your request ?

Thanks,
-Kame

2008-12-03 05:57:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/21] memcg updates 2008/12/03

On Wed, 3 Dec 2008 13:47:18 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:

> This is memcg update series onto
> "The mm-of-the-moment snapshot 2008-12-02-17-08"

Complaints...

- All these patches had filenames in their Subject: lines. I turned
these all back into sensible-sounding English titles.

- I think a lot of authorships got lost. For example, the way these
patches were sent, you will be identified as the author of
inactive_anon_is_low-move-to-vmscan.patch, but I don't think you
were. So please work out the correct authorship for

memcg-revert-gfp-mask-fix.patch
memcg-check-group-leader-fix.patch
memcg-memoryswap-controller-fix-limit-check.patch
memcg-swapout-refcnt-fix.patch
memcg-hierarchy-avoid-unnecessary-reclaim.patch
inactive_anon_is_low-move-to-vmscan.patch
mm-introduce-zone_reclaim-struct.patch
mm-add-zone-nr_pages-helper-function.patch
mm-make-get_scan_ratio-safe-for-memcg.patch
memcg-add-null-check-to-page_cgroup_zoneinfo.patch
memcg-add-inactive_anon_is_low.patch
memcg-add-mem_cgroup_zone_nr_pages.patch
memcg-add-zone_reclaim_stat.patch
memcg-remove-mem_cgroup_cal_reclaim.patch
memcg-show-reclaim-stat.patch
memcg-rename-scan-global-lru.patch
memcg-protect-prev_priority.patch
memcg-swappiness.patch
memcg-explain-details-and-test-document.patch

and let me know?

- Sentences start with capital letters.

- Your patches are missing the ^--- after the changelog. This
creates additional work (and potential for mistakes) at the other
end.

- I didn't check whether any acked-by's got lost. They may have been...

2008-12-03 06:18:49

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/21] memcg updates 2008/12/03

On Tue, 2 Dec 2008 21:56:50 -0800
Andrew Morton <[email protected]> wrote:

> On Wed, 3 Dec 2008 13:47:18 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > This is memcg update series onto
> > "The mm-of-the-moment snapshot 2008-12-02-17-08"
>
> Complaints...
>
> - All these patches had filenames in their Subject: lines. I turned
> these all back into sensible-sounding English titles.
>
Sorry..

> - I think a lot of authorships got lost. For example, the way these
> patches were sent, you will be identified as the author of
> inactive_anon_is_low-move-to-vmscan.patch, but I don't think you
> were. So please work out the correct authorship for
>
Sure. some patch includes modification from me (no big changes)

> memcg-revert-gfp-mask-fix.patch
Author: KAMEZAWA Hiroyuki <[email protected]>

> memcg-check-group-leader-fix.patch
Author: Nikanth Karthikesan <[email protected]>
a bit modified by me.

> memcg-memoryswap-controller-fix-limit-check.patch
Author: Daisuke Nishimura <[email protected]>
a bit modified by me.

> memcg-swapout-refcnt-fix.patch
Author: KAMEZAWA Hiroyuki <[email protected]>

> memcg-hierarchy-avoid-unnecessary-reclaim.patch
Author: Daisuke Nishimura <[email protected]>
a bit modified by me.

> inactive_anon_is_low-move-to-vmscan.patch
Author: KOSAKI Motohiro <[email protected]>

> mm-introduce-zone_reclaim-struct.patch
Author: KOSAKI Motohiro <[email protected]>

> mm-add-zone-nr_pages-helper-function.patch
Author: KOSAKI Motohiro <[email protected]>

> mm-make-get_scan_ratio-safe-for-memcg.patch
Author: KOSAKI Motohiro <[email protected]>

> memcg-add-null-check-to-page_cgroup_zoneinfo.patch
Author: KOSAKI Motohiro <[email protected]>

> memcg-add-inactive_anon_is_low.patch
Author: KOSAKI Motohiro <[email protected]>

> memcg-add-mem_cgroup_zone_nr_pages.patch
Author: KOSAKI Motohiro <[email protected]>

> memcg-add-zone_reclaim_stat.patch
Author: KOSAKI Motohiro <[email protected]>

> memcg-remove-mem_cgroup_cal_reclaim.patch
Author: KOSAKI Motohiro <[email protected]>

> memcg-show-reclaim-stat.patch
Author: KOSAKI Motohiro <[email protected]>
a bit modified by me.

> memcg-rename-scan-global-lru.patch
Author: KAMEZAWA Hiroyuki <[email protected]>

> memcg-protect-prev_priority.patch
Author: KOSAKI Motohiro <[email protected]>

> memcg-swappiness.patch
Author: KOSAKI Motohiro <[email protected]>
fixed bug by me.

> memcg-explain-details-and-test-document.patch
Author: KAMEZAWA Hiroyuki <[email protected]>
>
> and let me know?
>
> - Sentences start with capital letters.
>
> - Your patches are missing the ^--- after the changelog. This
> creates additional work (and potential for mistakes) at the other
> end.
will fix when I do this kind of again..

>
> - I didn't check whether any acked-by's got lost. They may have been...
>

AFAIK, only Acks from people other than Kamezawa, Balbir, Nishimura to aboves
are Rik van Riel's. I think I've picked up all.


Thanks,
-Kame



2008-12-04 09:36:42

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [Experimental][PATCH 19/21] memcg-fix-pre-destroy.patch

Added CC: Paul Menage <[email protected]>

> @@ -2096,7 +2112,7 @@ static void mem_cgroup_get(struct mem_cg
> static void mem_cgroup_put(struct mem_cgroup *mem)
> {
> if (atomic_dec_and_test(&mem->refcnt)) {
> - if (!mem->obsolete)
> + if (!css_under_removal(&mem->css))
> return;
> mem_cgroup_free(mem);
> }
I don't think it's safe to check css_under_removal here w/o cgroup_lock.
(It's safe *NOW* just because memcg is the only user of css->refcnt.)

As Li said before, css_under_removal doesn't necessarily mean
this this group has been destroyed, but mem_cgroup will be freed.

But adding cgroup_lock/unlock here causes another dead lock,
because mem_cgroup_get_next_node calls mem_cgroup_put.

hmm.. hierarchical reclaim code will be re-written completely by [21/21],
so would it be better to change patch order or to take another approach ?


Thanks,
Daisuke Nishimura.

2008-12-04 09:44:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [Experimental][PATCH 19/21] memcg-fix-pre-destroy.patch

On Thu, 4 Dec 2008 18:34:28 +0900
Daisuke Nishimura <[email protected]> wrote:

> Added CC: Paul Menage <[email protected]>
>
> > @@ -2096,7 +2112,7 @@ static void mem_cgroup_get(struct mem_cg
> > static void mem_cgroup_put(struct mem_cgroup *mem)
> > {
> > if (atomic_dec_and_test(&mem->refcnt)) {
> > - if (!mem->obsolete)
> > + if (!css_under_removal(&mem->css))
> > return;
> > mem_cgroup_free(mem);
> > }
> I don't think it's safe to check css_under_removal here w/o cgroup_lock.
> (It's safe *NOW* just because memcg is the only user of css->refcnt.)
>

> As Li said before, css_under_removal doesn't necessarily mean
> this this group has been destroyed, but mem_cgroup will be freed.
>
> But adding cgroup_lock/unlock here causes another dead lock,
> because mem_cgroup_get_next_node calls mem_cgroup_put.
>
> hmm.. hierarchical reclaim code will be re-written completely by [21/21],
> so would it be better to change patch order or to take another approach ?
>
Hmm, ok.

How about this ?
==
At initlization, mem_cgroup_create(), set memcg->refcnt to be 1.

At destroy(), put this refcnt by 1.

remove css_under_removal(&mem->css) check.
==

-Kame

2008-12-04 09:50:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [Experimental][PATCH 19/21] memcg-fix-pre-destroy.patch

On Thu, 4 Dec 2008 18:43:09 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Thu, 4 Dec 2008 18:34:28 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > Added CC: Paul Menage <[email protected]>
> >
> > > @@ -2096,7 +2112,7 @@ static void mem_cgroup_get(struct mem_cg
> > > static void mem_cgroup_put(struct mem_cgroup *mem)
> > > {
> > > if (atomic_dec_and_test(&mem->refcnt)) {
> > > - if (!mem->obsolete)
> > > + if (!css_under_removal(&mem->css))
> > > return;
> > > mem_cgroup_free(mem);
> > > }
> > I don't think it's safe to check css_under_removal here w/o cgroup_lock.
> > (It's safe *NOW* just because memcg is the only user of css->refcnt.)
> >
>
> > As Li said before, css_under_removal doesn't necessarily mean
> > this this group has been destroyed, but mem_cgroup will be freed.
> >
> > But adding cgroup_lock/unlock here causes another dead lock,
> > because mem_cgroup_get_next_node calls mem_cgroup_put.
> >
> > hmm.. hierarchical reclaim code will be re-written completely by [21/21],
> > so would it be better to change patch order or to take another approach ?
> >
> Hmm, ok.
>
> How about this ?
> ==
> At initlization, mem_cgroup_create(), set memcg->refcnt to be 1.
>
> At destroy(), put this refcnt by 1.
>
> remove css_under_removal(&mem->css) check.
> ==
Ah, anyway, I'll remove mem->refcnt when swap-cgroup uses this ID.
I'll use refcnt-to-ID rather than this.

Thanks,
-Kame



2008-12-04 10:43:24

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [Experimental][PATCH 19/21] memcg-fix-pre-destroy.patch

On Thu, 4 Dec 2008 18:43:09 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Thu, 4 Dec 2008 18:34:28 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > Added CC: Paul Menage <[email protected]>
> >
> > > @@ -2096,7 +2112,7 @@ static void mem_cgroup_get(struct mem_cg
> > > static void mem_cgroup_put(struct mem_cgroup *mem)
> > > {
> > > if (atomic_dec_and_test(&mem->refcnt)) {
> > > - if (!mem->obsolete)
> > > + if (!css_under_removal(&mem->css))
> > > return;
> > > mem_cgroup_free(mem);
> > > }
> > I don't think it's safe to check css_under_removal here w/o cgroup_lock.
> > (It's safe *NOW* just because memcg is the only user of css->refcnt.)
> >
>
> > As Li said before, css_under_removal doesn't necessarily mean
> > this this group has been destroyed, but mem_cgroup will be freed.
> >
> > But adding cgroup_lock/unlock here causes another dead lock,
> > because mem_cgroup_get_next_node calls mem_cgroup_put.
> >
> > hmm.. hierarchical reclaim code will be re-written completely by [21/21],
> > so would it be better to change patch order or to take another approach ?
> >
> Hmm, ok.
>
> How about this ?
> ==
> At initlization, mem_cgroup_create(), set memcg->refcnt to be 1.
>
> At destroy(), put this refcnt by 1.
>
> remove css_under_removal(&mem->css) check.
> ==
>
would be make sence.

Thanks,
Daisuke Nishimura.

2008-12-04 11:12:27

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [Experimental][PATCH 21/21] memcg-new-hierarchical-reclaim.patch

On Wed, 3 Dec 2008 14:14:23 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> Implement hierarchy reclaim by cgroup_id.
>
> What changes:
> - reclaim is not done by tree-walk algorithm
> - mem_cgroup->last_schan_child is ID, not pointer.
> - no cgroup_lock.
> - scanning order is just defined by ID's order.
> (Scan by round-robin logic.)
>
> Changelog: v1 -> v2
> - make use of css_tryget();
> - count # of loops rather than remembering position.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
>
> mm/memcontrol.c | 214 +++++++++++++++++++-------------------------------------
> 1 file changed, 75 insertions(+), 139 deletions(-)
>
(snip)
> /*
> - * Visit the first child (need not be the first child as per the ordering
> - * of the cgroup list, since we track last_scanned_child) of @mem and use
> - * that to reclaim free pages from.
> + * This routine select next memcg by ID. Using RCU and tryget().
> + * No cgroup_mutex is required.
> */
> static struct mem_cgroup *
> -mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
> +mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> {
> - struct cgroup *cgroup;
> + struct cgroup *cgroup, *root_cgroup;
> struct mem_cgroup *ret;
> - struct mem_cgroup *last_scan = root_mem->last_scanned_child;
> - bool obsolete = false;
> + int nextid, rootid, depth, found;
>
> - if (last_scan) {
> - if (css_under_removal(&last_scan->css))
> - obsolete = true;
> - } else
> - obsolete = true;
> + root_cgroup = root_mem->css.cgroup;
> + rootid = cgroup_id(root_cgroup);
> + depth = cgroup_depth(root_cgroup);
> + found = 0;
> + ret = NULL;
>
> - /*
> - * Scan all children under the mem_cgroup mem
> - */
> - cgroup_lock();
> - if (list_empty(&root_mem->css.cgroup->children)) {
> - ret = root_mem;
> - goto done;
> + rcu_read_lock();
> + if (!root_mem->use_hierarchy) {
> + spin_lock(&root_mem->reclaim_param_lock);
> + root_mem->scan_age++;
> + spin_unlock(&root_mem->reclaim_param_lock);
> + css_get(&root_mem->css);
> + goto out;
> }
>
I think you forgot "ret = root_mem".
I got NULL pointer dereference BUG in my test(I've not tested use_hierarchy case yet).


Thanks,
Daisuke Nishimura.

> - if (!root_mem->last_scanned_child || obsolete) {
> -
> - if (obsolete)
> - mem_cgroup_put(root_mem->last_scanned_child);
> -
> - cgroup = list_first_entry(&root_mem->css.cgroup->children,
> - struct cgroup, sibling);
> - ret = mem_cgroup_from_cont(cgroup);
> - mem_cgroup_get(ret);
> - } else
> - ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
> - root_mem);
> + while (!ret) {
> + /* ID:0 is not used by cgroup-id */
> + nextid = root_mem->last_scanned_child + 1;
> + cgroup = cgroup_get_next(nextid, rootid, depth, &found);
> + if (cgroup) {
> + spin_lock(&root_mem->reclaim_param_lock);
> + root_mem->last_scanned_child = found;
> + spin_unlock(&root_mem->reclaim_param_lock);
> + ret = mem_cgroup_from_cont(cgroup);
> + if (!css_tryget(&ret->css))
> + ret = NULL;
> + } else {
> + spin_lock(&root_mem->reclaim_param_lock);
> + root_mem->scan_age++;
> + root_mem->last_scanned_child = 0;
> + spin_unlock(&root_mem->reclaim_param_lock);
> + }
> + }
> +out:
> + rcu_read_unlock();
>
> -done:
> - root_mem->last_scanned_child = ret;
> - cgroup_unlock();
> return ret;
> }
>

2008-12-04 12:45:09

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [Experimental][PATCH 21/21]memcg-new-hierarchical-reclaim.patch

Daisuke Nishimura said:
> On Wed, 3 Dec 2008 14:14:23 +0900, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>> Implement hierarchy reclaim by cgroup_id.

>> + rcu_read_lock();
>> + if (!root_mem->use_hierarchy) {
>> + spin_lock(&root_mem->reclaim_param_lock);
>> + root_mem->scan_age++;
>> + spin_unlock(&root_mem->reclaim_param_lock);
>> + css_get(&root_mem->css);
>> + goto out;
>> }
>>
> I think you forgot "ret = root_mem".
> I got NULL pointer dereference BUG in my test(I've not tested
> use_hierarchy case yet).
>
yes...thank you for catching. will fix.

-Kame