2008-12-09 11:03:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 0/6] cgroup id and mix fixes (2008/12/09)


A misc patch set for memcg/cgroup including bug fix.

This set is against mm-of-the-moment snapshot 2008-12-08-16-36

[1/6] Documentation updates (thank you for Randy)
[2/6] fixing pre-destroy() (new implementation, Paul, please comment)
[3/6] cgroup id (removed codes around release_handler)
[4/6] mem_cgroup reclaim with scanning by ID (no big change.)
[5/6] fix active_ratio bug under hierarchy. (new one. very important, I think)
[6/6] fix oom-kill handler.

[2/6] and [3/6] adds codes to kernel/cgroup.c
[4/6] is for sanity of codes, removing cgroup_lock() for scanning.
[5/6] and [6/6] is bug fixes for use_hierarchy=1 case.
If my fixes cannot go ahead, we"ll have to find alternative, anyway.

Thanks,
-Kame


2008-12-09 11:05:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 1/6] memcg: Documentation for internal implementation


From: KAMEZAWA Hiroyuki <[email protected]>

Update/Fix document about implementation details of memcg.

Changelog:
- applied Randy Dunlap's comments.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/controllers/memcg_test.txt | 240 ++++++++++++++++++++++++++-----
mm/memcontrol.c | 4
2 files changed, 207 insertions(+), 37 deletions(-)

Index: mmotm-2.6.28-Dec03/Documentation/controllers/memcg_test.txt
===================================================================
--- mmotm-2.6.28-Dec03.orig/Documentation/controllers/memcg_test.txt
+++ mmotm-2.6.28-Dec03/Documentation/controllers/memcg_test.txt
@@ -1,59 +1,76 @@
Memory Resource Controller(Memcg) Implementation Memo.
-Last Updated: 2009/12/03
+Last Updated: 2008/12/04
+Base Kernel Version: based on 2.6.28-rc7-mm.

Because VM is getting complex (one of reasons is memcg...), memcg's behavior
-is complex. This is a document for memcg's internal behavior and some test
-patterns tend to be racy.
+is complex. This is a document for memcg's internal behavior.
+Please note that implementation details can be changed.

-1. charges
+(*) Topics on API should be in Documentation/controllers/memory.txt)
+
+0. How to record usage ?
+ 2 objects are used.
+
+ page_cgroup ....an object per page.
+ Allocated at boot or memory hotplug. Freed at memory hot removal.
+
+ swap_cgroup ... an entry per swp_entry.
+ Allocated at swapon(). Freed at swapoff().
+
+ The page_cgroup has USED bit and double count against a page_cgroup never
+ occurs. swap_cgroup is used only when a charged page is swapped-out.
+
+1. Charge

a page/swp_entry may be charged (usage += PAGE_SIZE) at

- mem_cgroup_newpage_newpage()
- called at new page fault and COW.
+ mem_cgroup_newpage_charge()
+ Called at new page fault and Copy-On-Write.

mem_cgroup_try_charge_swapin()
- called at do_swap_page() and swapoff.
- followed by charge-commit-cancel protocol.
- (With swap accounting) at commit, charges recorded in swap is removed.
+ Called at do_swap_page() (page fault on swap entry) and swapoff.
+ Followed by charge-commit-cancel protocol. (With swap accounting)
+ At commit, a charge recorded in swap_cgroup is removed.

mem_cgroup_cache_charge()
- called at add_to_page_cache()
+ Called at add_to_page_cache()

- mem_cgroup_cache_charge_swapin)()
- called by shmem's swapin processing.
+ mem_cgroup_cache_charge_swapin()
+ Called at shmem's swapin.

mem_cgroup_prepare_migration()
- called before migration. "extra" charge is done
- followed by charge-commit-cancel protocol.
+ Called before migration. "extra" charge is done and followed by
+ charge-commit-cancel protocol.
At commit, charge against oldpage or newpage will be committed.

-2. uncharge
+2. Uncharge
a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by

mem_cgroup_uncharge_page()
- called when an anonymous page is unmapped. If the page is SwapCache
- uncharge is delayed until mem_cgroup_uncharge_swapcache().
+ Called when an anonymous page is fully unmapped. I.e., mapcount goes
+ to 0. If the page is SwapCache, uncharge is delayed until
+ mem_cgroup_uncharge_swapcache().

mem_cgroup_uncharge_cache_page()
- called when a page-cache is deleted from radix-tree. If the page is
- SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache()
+ Called when a page-cache is deleted from radix-tree. If the page is
+ SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().

mem_cgroup_uncharge_swapcache()
- called when SwapCache is removed from radix-tree. the charge itself
+ Called when SwapCache is removed from radix-tree. The charge itself
is moved to swap_cgroup. (If mem+swap controller is disabled, no
- charge to swap.)
+ charge to swap occurs.)

mem_cgroup_uncharge_swap()
- called when swp_entry's refcnt goes down to be 0. charge against swap
+ Called when swp_entry's refcnt goes down to 0. A charge against swap
disappears.

mem_cgroup_end_migration(old, new)
- at success of migration -> old is uncharged (if necessary), charge
- to new is committed. at failure, charge to old is committed.
+ At success of migration old is uncharged (if necessary), a charge
+ to new page is committed. At failure, charge to old page is committed.

3. charge-commit-cancel
- In some case, we can't know this "charge" is valid or not at charge.
+ In some case, we can't know this "charge" is valid or not at charging
+ (because of races).
To handle such case, there are charge-commit-cancel functions.
mem_cgroup_try_charge_XXX
mem_cgroup_commit_charge_XXX
@@ -68,24 +85,164 @@ patterns tend to be racy.

At cancel(), simply usage -= PAGE_SIZE.

-4. Typical Tests.
+Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
+
+4. Anonymous
+ Anonymous page is newly allocated at
+ - page fault into MAP_ANONYMOUS mapping.
+ - Copy-On-Write.
+ It is charged right after it's allocated before doing any page table
+ related operations. Of course, it's uncharged when another page is used
+ for the fault address.
+
+ At freeing anonymous page (by exit() or munmap()), zap_pte() is called
+ and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
+ are done at page_remove_rmap() when page_mapcount() goes down to 0.
+
+ Another page freeing is by page-reclaim (vmscan.c) and anonymous
+ pages are swapped out. In this case, the page is marked as
+ PageSwapCache(). uncharge() routine doesn't uncharge the page marked
+ as SwapCache(). It's delayed until __delete_from_swap_cache().
+
+ 4.1 Swap-in.
+ At swap-in, the page is taken from swap-cache. There are 2 cases.
+
+ (a) If the SwapCache is newly allocated and read, it has no charges.
+ (b) If the SwapCache has been mapped by processes, it has been
+ charged already.
+
+ In case (a), we charge it. In case (b), we don't charge it.
+ (But racy state between (a) and (b) exists. We do check it.)
+ At charging, a charge recorded in swap_cgroup is moved to page_cgroup.
+
+ 4.2 Swap-out.
+ At swap-out, typical state transition is below.
+
+ (a) add to swap cache. (marked as SwapCache)
+ swp_entry's refcnt += 1.
+ (b) fully unmapped.
+ swp_entry's refcnt += # of ptes.
+ (c) write back to swap.
+ (d) delete from swap cache. (remove from SwapCache)
+ swp_entry's refcnt -= 1.
+
+
+ At (b), the page is marked as SwapCache and not uncharged.
+ At (d), the page is removed from SwapCache and a charge in page_cgroup
+ is moved to swap_cgroup.
+
+ Finally, at task exit,
+ (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
+ Here, a charge in swap_cgroup disappears.
+
+5. Page Cache
+ Page Cache is charged at
+ - add_to_page_cache_locked().
+
+ uncharged at
+ - __remove_from_page_cache().
+
+ The logic is very clear. (About migration, see below)
+ Note: __remove_from_page_cache() is called by remove_from_page_cache()
+ and __remove_mapping().
+
+6. Shmem(tmpfs) Page Cache
+ Memcg's charge/uncharge have special handlers of shmem. The best way
+ to understand shmem's page state transition is to read mm/shmem.c.
+ But brief explanation of the behavior of memcg around shmem will be
+ helpful to understand the logic.
+
+ Shmem's page (just leaf page, not direct/indirect block) can be on
+ - radix-tree of shmem's inode.
+ - SwapCache.
+ - Both on radix-tree and SwapCache. This happens at swap-in
+ and swap-out,
+
+ It's charged when...
+ - A new page is added to shmem's radix-tree.
+ - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
+ It's uncharged when
+ - A page is removed from radix-tree and not SwapCache.
+ - When SwapCache is removed, a charge is moved to swap_cgroup.
+ - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
+ disappears.

- Tests for racy cases.
+7. Page Migration
+ One of the most complicated functions is page-migration-handler.
+ Memcg has 2 routines. Assume that we are migrating a page's contents
+ from OLDPAGE to NEWPAGE.
+
+ Usual migration logic is..
+ (a) remove the page from LRU.
+ (b) allocate NEWPAGE (migration target)
+ (c) lock by lock_page().
+ (d) unmap all mappings.
+ (e-1) If necessary, replace entry in radix-tree.
+ (e-2) move contents of a page.
+ (f) map all mappings again.
+ (g) pushback the page to LRU.
+ (-) OLDPAGE will be freed.
+
+ Before (g), memcg should complete all necessary charge/uncharge to
+ NEWPAGE/OLDPAGE.
+
+ The point is....
+ - If OLDPAGE is anonymous, all charges will be dropped at (d) because
+ try_to_unmap() drops all mapcount and the page will not be
+ SwapCache.
+
+ - If OLDPAGE is SwapCache, charges will be kept at (g) because
+ __delete_from_swap_cache() isn't called at (e-1)
+
+ - If OLDPAGE is page-cache, charges will be kept at (g) because
+ remove_from_swap_cache() isn't called at (e-1)
+
+ memcg provides following hooks.
+
+ - mem_cgroup_prepare_migration(OLDPAGE)
+ Called after (b) to account a charge (usage += PAGE_SIZE) against
+ memcg which OLDPAGE belongs to.
+
+ - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
+ Called after (f) before (g).
+ If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
+ charged, a charge by prepare_migration() is automatically canceled.
+ If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
+
+ But zap_pte() (by exit or munmap) can be called while migration,
+ we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
+
+8. LRU
+ Each memcg has its own private LRU. Now, it's handling is under global
+ VM's control (means that it's handled under global zone->lru_lock).
+ Almost all routines around memcg's LRU is called by global LRU's
+ list management functions under zone->lru_lock().
+
+ A special function is mem_cgroup_isolate_pages(). This scans
+ memcg's private LRU and call __isolate_lru_page() to extract a page
+ from LRU.
+ (By __isolate_lru_page(), the page is removed from both of global and
+ private LRU.)

- 4.1 small limit to memcg.
+
+9. Typical Tests.
+
+ Tests for racy cases.
+
+ 9.1 Small limit to memcg.
When you do test to do racy case, it's good test to set memcg's limit
to be very small rather than GB. Many races found in the test under
xKB or xxMB limits.
(Memory behavior under GB and Memory behavior under MB shows very
different situation.)

- 4.2 shmem
+ 9.2 Shmem
Historically, memcg's shmem handling was poor and we saw some amount
of troubles here. This is because shmem is page-cache but can be
SwapCache. Test with shmem/tmpfs is always good test.

- 4.3 migration
- For NUMA, migration is an another special. To do easy test, cpuset
+ 9.3 Migration
+ For NUMA, migration is an another special case. To do easy test, cpuset
is useful. Following is a sample script to do migration.

mount -t cgroup -o cpuset none /opt/cpuset
@@ -118,20 +275,20 @@ patterns tend to be racy.
G2_TASK=`cat ${G2}/tasks`
move_task "${G1_TASK}" ${G2} &
--
- 4.4 memory hotplug.
+ 9.4 Memory hotplug.
memory hotplug test is one of good test.
to offline memory, do following.
# echo offline > /sys/devices/system/memory/memoryXXX/state
(XXX is the place of memory)
This is an easy way to test page migration, too.

- 4.5 mkdir/rmdir
+ 9.5 mkdir/rmdir
When using hierarchy, mkdir/rmdir test should be done.
- tests like following.
+ Use tests like the following.

- #echo 1 >/opt/cgroup/01/memory/use_hierarchy
- #mkdir /opt/cgroup/01/child_a
- #mkdir /opt/cgroup/01/child_b
+ echo 1 >/opt/cgroup/01/memory/use_hierarchy
+ mkdir /opt/cgroup/01/child_a
+ mkdir /opt/cgroup/01/child_b

set limit to 01.
add limit to 01/child_b
@@ -143,3 +300,12 @@ patterns tend to be racy.
/opt/cgroup/01/child_c

running new jobs in new group is also good.
+
+ 9.6 Mount with other subsystems.
+ Mounting with other subsystems is a good test because there is a
+ race and lock dependency with other cgroup subsystems.
+
+ example)
+ # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices
+
+ and do task move, mkdir, rmdir etc...under this.
Index: mmotm-2.6.28-Dec03/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec03.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec03/mm/memcontrol.c
@@ -6,6 +6,10 @@
* Copyright 2007 OpenVZ SWsoft Inc
* Author: Pavel Emelianov <[email protected]>
*
+ * Documentation is available at
+ * Documentation/controllers/memory.txt
+ * Documentation/controllers/memcg_test.txt
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or

2008-12-09 11:07:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 1/6] memcg: fix pre_destory handler

better name for new flag is welcome.

==
Because pre_destroy() handler is moved out to cgroup_lock() for
avoiding dead-lock, now, cgroup's rmdir() does following sequence.

cgroup_lock()
check children and tasks.
(A)
cgroup_unlock()
(B)
pre_destroy() for subsys;-----(1)
(C)
cgroup_lock();
(D)
Second check:check for -EBUSY again because we released the lock.
(E)
mark cgroup as removed.
(F)
unlink from lists.
cgroup_unlock();
dput()
=> when dentry's refcnt goes down to 0
destroy() handers for subsys

memcg marks itself as "obsolete" when pre_destroy() is called at (1)
But rmdir() can fail after pre_destroy(). So marking "obsolete" is bug.
I'd like to fix sanity of pre_destroy() in cgroup layer.

Considering above sequence, new tasks can be added while
(B) and (C)
swap-in recored can be charged back to a cgroup after pre_destroy()
at (C) and (D), (E)
(means cgrp's refcnt not comes from task but from other persistent objects.)

This patch adds "cgroup_is_being_removed()" check. (better name is welcome)
After this,

- cgroup is marked as CGRP_PRE_REMOVAL at (A)
- If Second check fails, CGRP_PRE_REMOVAL flag is removed.
- memcg's its own obsolete flag is removed.
- While CGROUP_PRE_REMOVAL, task attach will fail by -EBUSY.
(task attach via clone() will not hit the case.)

By this, we can trust pre_restroy()'s result.


Note: if CGRP_REMOVED can be set and cleared, it should be used instead of
CGRP_PRE_REMOVAL.


Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
include/linux/cgroup.h | 5 +++++
kernel/cgroup.c | 18 +++++++++++++++++-
mm/memcontrol.c | 36 ++++++++++++++++++++++++++----------
3 files changed, 48 insertions(+), 11 deletions(-)

Index: mmotm-2.6.28-Dec08/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec08.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec08/mm/memcontrol.c
@@ -166,7 +166,6 @@ struct mem_cgroup {
*/
bool use_hierarchy;
unsigned long last_oom_jiffies;
- int obsolete;
atomic_t refcnt;

unsigned int swappiness;
@@ -211,6 +210,24 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);

+static bool memcg_is_obsolete(struct mem_cgroup *mem)
+{
+ struct cgroup *cg = mem->css.cgroup;
+ /*
+ * "Being Removed" means pre_destroy() handler is called.
+ * After "pre_destroy" handler is called, memcg should not
+ * have any additional charges.
+ * This means there are small races for mis-accounting. But this
+ * mis-accounting should happen only under swap-in opration.
+ * (Attachin new task will fail if cgroup is under rmdir()).
+ */
+
+ if (!cg || cgroup_is_removed(cg) || cgroup_is_being_removed(cg))
+ return true;
+ return false;
+}
+
+
static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
struct page_cgroup *pc,
bool charge)
@@ -597,7 +614,7 @@ mem_cgroup_get_first_node(struct mem_cgr
struct cgroup *cgroup;
struct mem_cgroup *ret;
bool obsolete = (root_mem->last_scanned_child &&
- root_mem->last_scanned_child->obsolete);
+ memcg_is_obsolete(root_mem->last_scanned_child);

/*
* Scan all children under the mem_cgroup mem
@@ -1070,7 +1087,7 @@ int mem_cgroup_try_charge_swapin(struct
ent.val = page_private(page);

mem = lookup_swap_cgroup(ent);
- if (!mem || mem->obsolete)
+ if (!mem || memcg_is_obsolete(mem))
goto charge_cur_mm;
*ptr = mem;
return __mem_cgroup_try_charge(NULL, mask, ptr, true);
@@ -1104,7 +1121,7 @@ int mem_cgroup_cache_charge_swapin(struc
ent.val = page_private(page);
if (do_swap_account) {
mem = lookup_swap_cgroup(ent);
- if (mem && mem->obsolete)
+ if (mem && memcg_is_obsolete(mem))
mem = NULL;
if (mem)
mm = NULL;
@@ -2050,9 +2067,6 @@ static struct mem_cgroup *mem_cgroup_all
* the number of reference from swap_cgroup and free mem_cgroup when
* it goes down to 0.
*
- * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
- * entry which points to this memcg will be ignore at swapin.
- *
* Removal of cgroup itself succeeds regardless of refs from swap.
*/

@@ -2081,7 +2095,7 @@ static void mem_cgroup_get(struct mem_cg
static void mem_cgroup_put(struct mem_cgroup *mem)
{
if (atomic_dec_and_test(&mem->refcnt)) {
- if (!mem->obsolete)
+ if (!memcg_is_obsolete(mem))
return;
mem_cgroup_free(mem);
}
@@ -2148,14 +2162,16 @@ static void mem_cgroup_pre_destroy(struc
struct cgroup *cont)
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
- mem->obsolete = 1;
mem_cgroup_force_empty(mem, false);
}

static void mem_cgroup_destroy(struct cgroup_subsys *ss,
struct cgroup *cont)
{
- mem_cgroup_free(mem_cgroup_from_cont(cont));
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont):
+ mem_cgroup_free(mem);
+ /* forget */
+ mem->css.cgroup = NULL;
}

static int mem_cgroup_populate(struct cgroup_subsys *ss,
Index: mmotm-2.6.28-Dec08/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Dec08.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Dec08/include/linux/cgroup.h
@@ -98,6 +98,8 @@ enum {
CGRP_RELEASABLE,
/* Control Group requires release notifications to userspace */
CGRP_NOTIFY_ON_RELEASE,
+ /* Control Group is preparing for death */
+ CGRP_PRE_REMOVAL,
};

struct cgroup {
@@ -303,8 +305,11 @@ int cgroup_add_files(struct cgroup *cgrp
const struct cftype cft[],
int count);

+
int cgroup_is_removed(const struct cgroup *cgrp);

+int cgroup_is_being_removed(const struct cgroup *cgrp);
+
int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen);

int cgroup_task_count(const struct cgroup *cgrp);
Index: mmotm-2.6.28-Dec08/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Dec08.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Dec08/kernel/cgroup.c
@@ -123,6 +123,11 @@ inline int cgroup_is_removed(const struc
return test_bit(CGRP_REMOVED, &cgrp->flags);
}

+inline int cgroup_is_being_removed(const struct cgroup *cgrp)
+{
+ return test_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
+}
+
/* bits in struct cgroupfs_root flags field */
enum {
ROOT_NOPREFIX, /* mounted subsystems have no named prefix */
@@ -1217,6 +1222,13 @@ int cgroup_attach_task(struct cgroup *cg
if (cgrp == oldcgrp)
return 0;

+ /*
+ * This cgroup is under rmdir() operation. Never fails here when this
+ * is called from clone().
+ */
+ if (cgroup_is_being_removed(cgrp))
+ return -EBUSY;
+
for_each_subsys(root, ss) {
if (ss->can_attach) {
retval = ss->can_attach(ss, cgrp, tsk);
@@ -2469,12 +2481,14 @@ static int cgroup_rmdir(struct inode *un
mutex_unlock(&cgroup_mutex);
return -EBUSY;
}
- mutex_unlock(&cgroup_mutex);

/*
* Call pre_destroy handlers of subsys. Notify subsystems
* that rmdir() request comes.
*/
+ set_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
+ mutex_unlock(&cgroup_mutex);
+
cgroup_call_pre_destroy(cgrp);

mutex_lock(&cgroup_mutex);
@@ -2483,12 +2497,14 @@ static int cgroup_rmdir(struct inode *un
if (atomic_read(&cgrp->count)
|| !list_empty(&cgrp->children)
|| cgroup_has_css_refs(cgrp)) {
+ clear_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
mutex_unlock(&cgroup_mutex);
return -EBUSY;
}

spin_lock(&release_list_lock);
set_bit(CGRP_REMOVED, &cgrp->flags);
+ clear_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
if (!list_empty(&cgrp->release_list))
list_del(&cgrp->release_list);
spin_unlock(&release_list_lock);

2008-12-09 11:09:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 2/6] cgroup id


From: KAMEZAWA Hiroyuki <[email protected]>

Patch for Cgroup ID and hierarchy code.

This patch tries to assign a ID to each cgroup. Attach unique ID to each
cgroup and provides following functions.

- cgroup_lookup(id)
returns struct cgroup of id.
- cgroup_get_next(id, rootid, depth, foundid)
returns the next cgroup under "root" by scanning bitmap (not by tree-walk)
- cgroup_id_put/getref()
used when subsystem want to prevent reuse of ID.

There is several reasons to develop this.

- While trying to implement hierarchy in memory cgroup, we have to
implement "walk under hierarchy" code.
Now it's consists of cgroup_lock and tree up-down code. Because
Because memory cgroup have to do hierarchy walk in other places,
intelligent processing, we'll reuse the "walk" code.
But taking "cgroup_lock" in walking tree can cause deadlocks.
Easier way is helpful.

- SwapCgroup uses array of "pointer" to record the owner of swaps.
By ID, we can reduce this to "short" or "int". This means ID is
useful for reducing space consumption by pointer if the access cost
is not problem.
(I hear bio-cgroup will use the same kind of...)

Example) OOM-Killer under hierarchy.
do {
rcu_read_lock();
next = cgroup_get_next(id, root, nextid);
/* check sanity of next here */
css_tryget();
rcu_read_unlock();
if (!next)
break;
cgroup_scan_tasks(select_bad_process?);
/* record score here...*/
} while (1);


Characteristics:
- Each cgroup get new ID when created.
- cgroup ID contains "ID" and "Depth in tree" and hierarchy code.
- hierarchy code is array of IDs of ancestors.
- ID 0 is UNUSED ID.

Design Choices:
- At this moment, swap_cgroup and bio_cgroup will be the user of this
ID. And memcg will use some routine which allows
scanning-without-cgroup-lock.

- Now, SwapCgroup has its own refcnt to memory cgroup because the
lifetime of swap_enty can be longer than cgroup.
To replace pointer in SwapCgroup with ID, ID should have refcnt.

- scan-by-ID v.s. scan-by-tree-walk.
As /proc's pid scan does, scan-by-ID is robust when scanning is done
by following kind of routine.
scan -> rest a while(release a lock) -> conitunue from interrupted
memcg's hierarchical reclaim does this.

Consideration:
- I'd like to use "short" to cgroup_id for saving space...
- MAX_DEPTH is small ? (making this depend on boot option is easy.)
TODO:
- Documentation.

Changelog:(v3) -> (v4)
- updated comments.
- renamed hierarchy_code[] to stack[]
- merged prepare_id routines.

Changelog (v2) -> (v3)
- removed cgroup_id_getref().
- added cgroup_id_tryget().

Changelog (v1) -> (v2):
- Design change: show only ID(integer) to outside of cgroup.c
- moved cgroup ID definition from include/ to kernel/cgroup.c
- struct cgroup_id is freed by RCU.
- changed interface from pointer to "int"
- kill_sb() is handled.
- ID 0 as unused ID.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/cgroup.h | 33 +++++
include/linux/idr.h | 1
kernel/cgroup.c | 320 ++++++++++++++++++++++++++++++++++++++++++++++++-
lib/idr.c | 46 +++++++
4 files changed, 396 insertions(+), 4 deletions(-)

Index: mmotm-2.6.28-Dec08/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Dec08.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Dec08/include/linux/cgroup.h
@@ -22,6 +22,7 @@ struct cgroupfs_root;
struct cgroup_subsys;
struct inode;
struct cgroup;
+struct cgroup_id;

extern int cgroup_init_early(void);
extern int cgroup_init(void);
@@ -61,6 +62,8 @@ struct cgroup_subsys_state {
unsigned long flags;
};

+#define MAX_CGROUP_DEPTH (10)
+
/* bits in struct cgroup_subsys_state flags field */
enum {
CSS_ROOT, /* This CSS is the root of the subsystem */
@@ -147,6 +150,8 @@ struct cgroup {
int pids_use_count;
/* Length of the current tasks_pids array */
int pids_length;
+ /* Cgroup ID */
+ struct cgroup_id *id;
};

/* A css_set is a structure holding pointers to a set of
@@ -398,6 +403,34 @@ void cgroup_iter_end(struct cgroup *cgrp
int cgroup_scan_tasks(struct cgroup_scanner *scan);
int cgroup_attach_task(struct cgroup *, struct task_struct *);

+/*
+ * Cgroup ID is a system-wide ID for all cgroup struct. It can be used for
+ * look-up and scaning. Cgroup ID is assined at cgroup allocation and removed
+ * when refcnt to ID goes down to 0. Refcnt is inremented when subsys want to
+ * avoid reuse of ID for persistent objects. In usual, refcnt to ID will be 0
+ * when cgroup is removed.
+ * This look-up and scan function should be called under rcu_read_lock().
+ * cgroup_lock() is not necessary.
+ */
+
+/* Find a cgroup which the "ID" is attached. */
+struct cgroup *cgroup_lookup(int id);
+/*
+ * Get next cgroup under tree. Returning a cgroup which has equal or greater
+ * ID than "id" in argument.
+ */
+struct cgroup *cgroup_get_next(int id, int rootid, int depth, int *foundid);
+
+/* get id and depth of cgroup */
+int cgroup_id(struct cgroup *cgroup);
+int cgroup_depth(struct cgroup *cgroup);
+/* returns non-zero if root is ancestor of cg */
+int cgroup_is_ancestor(struct cgroup *cg, struct cgroup *root);
+
+/* For refcnt/delayed freeing of IDs */
+int cgroup_id_tryget(int id);
+void cgroup_id_put(int id);
+
#else /* !CONFIG_CGROUPS */

static inline int cgroup_init_early(void) { return 0; }
Index: mmotm-2.6.28-Dec08/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Dec08.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Dec08/kernel/cgroup.c
@@ -46,7 +46,7 @@
#include <linux/cgroupstats.h>
#include <linux/hash.h>
#include <linux/namei.h>
-
+#include <linux/idr.h>
#include <asm/atomic.h>

static DEFINE_MUTEX(cgroup_mutex);
@@ -556,6 +556,301 @@ void cgroup_unlock(void)
}

/*
+ * CGROUP ID
+ */
+struct cgroup_id {
+ /*
+ * The cgroup to whiech this ID points. If cgroup is removed,
+ * this will point to NULL.
+ */
+ struct cgroup *cgroup;
+ /*
+ * ID of this cgroup.
+ */
+ unsigned int id;
+ /*
+ * Depth in hierarchy which this ID belongs to.
+ */
+ unsigned int depth;
+ /*
+ * Refcnt is managed for persistent objects.
+ */
+ atomic_t refcnt;
+ /*
+ * ID is freed by RCU. (and lookup routine is RCU safe.)
+ */
+ struct rcu_head rcu_head;
+ /*
+ * Hierarchy of Cgroup ID belongs to.
+ */
+ unsigned int stack[MAX_CGROUP_DEPTH];
+};
+
+void free_cgroupid_cb(struct rcu_head *head)
+{
+ struct cgroup_id *id;
+
+ id = container_of(head, struct cgroup_id, rcu_head);
+ kfree(id);
+}
+
+void free_cgroupid(struct cgroup_id *id)
+{
+ call_rcu(&id->rcu_head, free_cgroupid_cb);
+}
+
+/*
+ * Cgroup ID and lookup functions.
+ * cgid->cgroup pointer is safe under rcu_read_lock() because d_put() of
+ * cgroup, which finally frees cgroup pointer, uses rcu_synchronize().
+ *
+ * TODO: defining private ID per hierarchy is maybe better. But it's difficult
+ * now because we cannot guarantee that all IDs are freed at kill_sb().
+ */
+static DEFINE_IDR(cgroup_idr);
+DEFINE_SPINLOCK(cgroup_idr_lock);
+
+/*
+ * To get ID other than 0, this should be called when !cgroup_is_removed().
+ */
+int cgroup_id(struct cgroup *cgrp)
+{
+ struct cgroup_id *cgid = rcu_dereference(cgrp->id);
+
+ if (cgid)
+ return cgid->id;
+ return 0;
+}
+
+int cgroup_depth(struct cgroup *cgrp)
+{
+ struct cgroup_id *cgid = rcu_dereference(cgrp->id);
+
+ if (cgid)
+ return cgid->depth;
+ return 0;
+}
+
+
+int cgroup_is_ancestor(struct cgroup *cgrp, struct cgroup *root)
+{
+ struct cgroup_id *id = cgrp->id;
+ struct cgroup_id *ans = root->id;
+
+ if (!id || !ans)
+ return 0;
+ return (id->stack[ans->depth] == ans->id);
+}
+
+
+static int __get_and_prepare_newid(struct cgroup_id **ret)
+{
+ struct cgroup_id *newid;
+ int myid, error;
+
+ newid = kzalloc(sizeof(*newid), GFP_KERNEL);
+ if (!newid)
+ return -ENOMEM;
+ /* get id */
+ if (unlikely(!idr_pre_get(&cgroup_idr, GFP_KERNEL))) {
+ error = -ENOMEM;
+ goto err_out;
+ }
+ spin_lock_irq(&cgroup_idr_lock);
+ /* Don't use 0 */
+ error = idr_get_new_above(&cgroup_idr, newid, 1, &myid);
+ spin_unlock_irq(&cgroup_idr_lock);
+
+ /* Returns error only when there are no free spaces for new ID.*/
+ if (error)
+ goto err_out;
+
+ newid->id = myid;
+ atomic_set(&newid->refcnt, 1);
+ *ret = newid;
+ return 0;
+err_out:
+ kfree(newid);
+ return error;
+
+}
+
+
+static int cgrouproot_setup_idr(struct cgroupfs_root *root)
+{
+ struct cgroup_id *newid;
+ int err = -ENOMEM;
+
+ err = __get_and_prepare_newid(&newid);
+ if (err)
+ return err;
+
+ newid->depth = 0;
+ newid->stack[0] = newid->id;
+ newid->cgroup = &root->top_cgroup;
+ root->top_cgroup.id = newid;
+ return 0;
+}
+
+static int cgroup_prepare_id(struct cgroup *parent, struct cgroup_id **id)
+{
+ struct cgroup_id *newid;
+ int error;
+
+ /* check depth */
+ if (parent->id->depth + 1 >= MAX_CGROUP_DEPTH)
+ return -ENOSPC;
+
+ error = __get_and_prepare_newid(&newid);
+ if (error)
+ return error;
+ *id = newid;
+ return 0;
+}
+
+
+static void cgroup_id_attach(struct cgroup_id *cgid,
+ struct cgroup *cg, struct cgroup *parent)
+{
+ struct cgroup_id *parent_id = parent->id; /* parent is alive */
+ int i;
+
+ cgid->depth = parent_id->depth + 1;
+ /* Inherit hierarchy code from parent */
+ for (i = 0; i < cgid->depth; i++) {
+ cgid->stack[i] = parent_id->stack[i];
+
+ }
+ cgid->stack[cgid->depth] = cgid->id;
+ rcu_assign_pointer(cgid->cgroup, cg);
+ rcu_assign_pointer(cg->id, cgid);
+
+ return;
+}
+
+void cgroup_id_put(int id)
+{
+ struct cgroup_id *cgid;
+ unsigned long flags;
+
+ rcu_read_lock();
+ cgid = idr_find(&cgroup_idr, id);
+ BUG_ON(!cgid);
+ if (atomic_dec_and_test(&cgid->refcnt)) {
+ spin_lock_irqsave(&cgroup_idr_lock, flags);
+ idr_remove(&cgroup_idr, cgid->id);
+ spin_unlock_irq(&cgroup_idr_lock);
+ free_cgroupid(cgid);
+ }
+ rcu_read_unlock();
+}
+
+static void cgroup_id_detach(struct cgroup *cg)
+{
+ struct cgroup_id *id = rcu_dereference(cg->id);
+
+ rcu_assign_pointer(id->cgroup, NULL);
+ cgroup_id_put(id->id);
+ rcu_assign_pointer(cg->id, NULL);
+}
+/**
+ * cgroup_id_tryget() -- try to get refcnt of ID.
+ * @id: the ID to be kept for a while.
+ *
+ * Increment refcnt of ID and prevent reuse. Useful for subsys which remember
+ * cgroup by ID rather than pointer to struct cgroup (or subsys). Returns
+ * value other than zero at success.
+ */
+int cgroup_id_tryget(int id)
+{
+ struct cgroup_id *cgid;
+ int ret = 0;
+
+ rcu_read_lock();
+ cgid = idr_find(&cgroup_idr, id);
+ if (cgid)
+ ret = atomic_inc_not_zero(&cgid->refcnt);
+ rcu_read_unlock();
+ return ret;
+}
+
+/**
+ * cgroup_lookup - lookup cgroup by id
+ * @id: the id of cgroup to be looked up
+ *
+ * Returns pointer to cgroup if there is valid cgroup with id, NULL if not.
+ * Should be called under rcu_read_lock() or cgroup_lock.
+ * If subsys is not used, returns NULL.
+ */
+
+struct cgroup *cgroup_lookup(int id)
+{
+ struct cgroup *cgrp = NULL;
+ struct cgroup_id *cgid = NULL;
+
+ rcu_read_lock();
+ cgid = idr_find(&cgroup_idr, id);
+
+ if (unlikely(!cgid))
+ goto out;
+
+ cgrp = rcu_dereference(cgid->cgroup);
+ if (unlikely(!cgrp || cgroup_is_removed(cgrp)))
+ cgrp = NULL;
+out:
+ rcu_read_unlock();
+ return cgrp;
+}
+
+/**
+ * cgroup_get_next - lookup next cgroup under specified hierarchy.
+ * @id: current position of iteration.
+ * @rootid: search tree under this.
+ * @depth: depth of root id.
+ * @foundid: position of found object.
+ *
+ * Search next cgroup under the specified hierarchy. If "cur" is NULL,
+ * start from root cgroup. Called under rcu_read_lock() or cgroup_lock()
+ * is necessary (to access a found cgroup.).
+ * If subsys is not used, returns NULL. If used, it's guaranteed that there is
+ * a used cgroup ID (root).
+ */
+struct cgroup *
+cgroup_get_next(int id, int rootid, int depth, int *foundid)
+{
+ struct cgroup *ret = NULL;
+ struct cgroup_id *tmp;
+ int tmpid;
+ unsigned long flags;
+
+ rcu_read_lock();
+ tmpid = id;
+ while (1) {
+ /* scan next entry from bitmap(tree) */
+ spin_lock_irqsave(&cgroup_idr_lock, flags);
+ tmp = idr_get_next(&cgroup_idr, &tmpid);
+ spin_unlock_irqrestore(&cgroup_idr_lock, flags);
+
+ if (!tmp) {
+ ret = NULL;
+ break;
+ }
+
+ if (tmp->stack[depth] == rootid) {
+ ret = rcu_dereference(tmp->cgroup);
+ /* Sanity check and check hierarchy */
+ if (ret && !cgroup_is_removed(ret))
+ break;
+ }
+ tmpid = tmpid + 1;
+ }
+
+ rcu_read_unlock();
+ *foundid = tmpid;
+ return ret;
+}
+
+/*
* A couple of forward declarations required, due to cyclic reference loop:
* cgroup_mkdir -> cgroup_create -> cgroup_populate_dir ->
* cgroup_add_file -> cgroup_create_file -> cgroup_dir_inode_operations
@@ -1024,6 +1319,13 @@ static int cgroup_get_sb(struct file_sys
mutex_unlock(&inode->i_mutex);
goto drop_new_super;
}
+ /* Setup Cgroup ID for this fs */
+ ret = cgrouproot_setup_idr(root);
+ if (ret) {
+ mutex_unlock(&cgroup_mutex);
+ mutex_unlock(&inode->i_mutex);
+ goto drop_new_super;
+ }

ret = rebind_subsystems(root, root->subsys_bits);
if (ret == -EBUSY) {
@@ -1110,9 +1412,10 @@ static void cgroup_kill_sb(struct super_

list_del(&root->root_list);
root_count--;
-
+ if (root->top_cgroup.id)
+ cgroup_id_detach(&root->top_cgroup);
mutex_unlock(&cgroup_mutex);
-
+ synchronize_rcu();
kfree(root);
kill_litter_super(sb);
}
@@ -2354,11 +2657,18 @@ static long cgroup_create(struct cgroup
int err = 0;
struct cgroup_subsys *ss;
struct super_block *sb = root->sb;
+ struct cgroup_id *cgid = NULL;

cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
if (!cgrp)
return -ENOMEM;

+ err = cgroup_prepare_id(parent, &cgid);
+ if (err) {
+ kfree(cgrp);
+ return err;
+ }
+
/* Grab a reference on the superblock so the hierarchy doesn't
* get deleted on unmount if there are child cgroups. This
* can be done outside cgroup_mutex, since the sb can't
@@ -2398,7 +2708,7 @@ static long cgroup_create(struct cgroup

err = cgroup_populate_dir(cgrp);
/* If err < 0, we have a half-filled directory - oh well ;) */
-
+ cgroup_id_attach(cgid, cgrp, parent);
mutex_unlock(&cgroup_mutex);
mutex_unlock(&cgrp->dentry->d_inode->i_mutex);

@@ -2502,6 +2812,8 @@ static int cgroup_rmdir(struct inode *un
return -EBUSY;
}

+ cgroup_id_detach(cgrp);
+
spin_lock(&release_list_lock);
set_bit(CGRP_REMOVED, &cgrp->flags);
clear_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
Index: mmotm-2.6.28-Dec08/include/linux/idr.h
===================================================================
--- mmotm-2.6.28-Dec08.orig/include/linux/idr.h
+++ mmotm-2.6.28-Dec08/include/linux/idr.h
@@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
int idr_for_each(struct idr *idp,
int (*fn)(int id, void *p, void *data), void *data);
+void *idr_get_next(struct idr *idp, int *nextid);
void *idr_replace(struct idr *idp, void *ptr, int id);
void idr_remove(struct idr *idp, int id);
void idr_remove_all(struct idr *idp);
Index: mmotm-2.6.28-Dec08/lib/idr.c
===================================================================
--- mmotm-2.6.28-Dec08.orig/lib/idr.c
+++ mmotm-2.6.28-Dec08/lib/idr.c
@@ -573,6 +573,52 @@ int idr_for_each(struct idr *idp,
EXPORT_SYMBOL(idr_for_each);

/**
+ * idr_get_next - lookup next object of id to given id.
+ * @idp: idr handle
+ * @id: pointer to lookup key
+ *
+ * Returns pointer to registered object with id, which is next number to
+ * given id.
+ */
+
+void *idr_get_next(struct idr *idp, int *nextidp)
+{
+ struct idr_layer *p, *pa[MAX_LEVEL];
+ struct idr_layer **paa = &pa[0];
+ int id = *nextidp;
+ int n, max;
+
+ /* find first ent */
+ n = idp->layers * IDR_BITS;
+ max = 1 << n;
+ p = rcu_dereference(idp->top);
+ if (!p)
+ return NULL;
+
+ while (id < max) {
+ while (n > 0 && p) {
+ n -= IDR_BITS;
+ *paa++ = p;
+ p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
+ }
+
+ if (p) {
+ *nextidp = id;
+ return p;
+ }
+
+ id += 1 << n;
+ while (n < fls(id)) {
+ n += IDR_BITS;
+ p = *--paa;
+ }
+ }
+ return NULL;
+}
+
+
+
+/**
* idr_replace - replace pointer for given id
* @idp: idr handle
* @ptr: pointer you want associated with the id

2008-12-09 11:10:25

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 4/6] Flat hierarchical reclaim by ID


From: KAMEZAWA Hiroyuki <[email protected]>

Implement hierarchy reclaim by cgroup_id.

What changes:
- Page reclaim is not done by tree-walk algorithm
- mem_cgroup->last_schan_child is changed to be ID, not pointer.
- no cgroup_lock, done under RCU.
- scanning order is just defined by ID's order.
(Scan by round-robin logic.)

Changelog: v3 -> v4
- adjusted to changes in base kernel.
- is_acnestor() is moved to other patch.

Changelog: v2 -> v3
- fixed use_hierarchy==0 case

Changelog: v1 -> v2
- make use of css_tryget();
- count # of loops rather than remembering position.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>


mm/memcontrol.c | 274 +++++++++++++++++++++++++++-----------------------------
1 file changed, 134 insertions(+), 140 deletions(-)

---
Index: mmotm-2.6.28-Dec08/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec08.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec08/mm/memcontrol.c
@@ -158,9 +158,10 @@ struct mem_cgroup {

/*
* While reclaiming in a hiearchy, we cache the last child we
- * reclaimed from. Protected by cgroup_lock()
+ * reclaimed from.
*/
- struct mem_cgroup *last_scanned_child;
+ int last_scanned_child;
+ unsigned long scan_age;
/*
* Should the accounting and control be hierarchical, per subtree?
*/
@@ -170,7 +171,6 @@ struct mem_cgroup {

unsigned int swappiness;

-
unsigned int inactive_ratio;

/*
@@ -550,102 +550,70 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
}

-#define mem_cgroup_from_res_counter(counter, member) \
- container_of(counter, struct mem_cgroup, member)
-
-/*
- * This routine finds the DFS walk successor. This routine should be
- * called with cgroup_mutex held
- */
-static struct mem_cgroup *
-mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
+static unsigned int get_swappiness(struct mem_cgroup *memcg)
{
- struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
-
- curr_cgroup = curr->css.cgroup;
- root_cgroup = root_mem->css.cgroup;
-
- if (!list_empty(&curr_cgroup->children)) {
- /*
- * Walk down to children
- */
- mem_cgroup_put(curr);
- cgroup = list_entry(curr_cgroup->children.next,
- struct cgroup, sibling);
- curr = mem_cgroup_from_cont(cgroup);
- mem_cgroup_get(curr);
- goto done;
- }
-
-visit_parent:
- if (curr_cgroup == root_cgroup) {
- mem_cgroup_put(curr);
- curr = root_mem;
- mem_cgroup_get(curr);
- goto done;
- }
+ struct cgroup *cgrp = memcg->css.cgroup;
+ unsigned int swappiness;

- /*
- * Goto next sibling
- */
- if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
- mem_cgroup_put(curr);
- cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
- sibling);
- curr = mem_cgroup_from_cont(cgroup);
- mem_cgroup_get(curr);
- goto done;
- }
+ /* root ? */
+ if (cgrp->parent == NULL)
+ return vm_swappiness;

- /*
- * Go up to next parent and next parent's sibling if need be
- */
- curr_cgroup = curr_cgroup->parent;
- goto visit_parent;
+ spin_lock(&memcg->reclaim_param_lock);
+ swappiness = memcg->swappiness;
+ spin_unlock(&memcg->reclaim_param_lock);

-done:
- root_mem->last_scanned_child = curr;
- return curr;
+ return swappiness;
}

-/*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
+#define mem_cgroup_from_res_counter(counter, member) \
+ container_of(counter, struct mem_cgroup, member)
+
static struct mem_cgroup *
-mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
+mem_cgroup_select_victim(struct mem_cgroup *root_mem)
{
- struct cgroup *cgroup;
- struct mem_cgroup *ret;
- bool obsolete = memcg_is_obsolete(root_mem->last_scanned_child);
+ struct cgroup *cgroup, *root_cgroup;
+ int nextid, rootid, depth, found;
+ struct mem_cgroup *ret = NULL;

+ root_cgroup = root_mem->css.cgroup;
+ rootid = cgroup_id(root_cgroup);
+ depth = cgroup_depth(root_cgroup);

- /*
- * Scan all children under the mem_cgroup mem
- */
- cgroup_lock();
- if (list_empty(&root_mem->css.cgroup->children)) {
+ /* If no hierarchy, "root_mem" is always victim */
+ if (!root_mem->use_hierarchy) {
+ spin_lock(&root_mem->reclaim_param_lock);
+ root_mem->scan_age++;
+ spin_unlock(&root_mem->reclaim_param_lock);
+ css_get(&root_mem->css);
ret = root_mem;
- goto done;
}
+ while (!ret) {
+ rcu_read_lock();
+ /* ID:0 is unsued */
+ nextid = root_mem->last_scanned_child + 1;
+ cgroup = cgroup_get_next(nextid, rootid, depth, &found);

- if (!root_mem->last_scanned_child || obsolete) {
-
- if (obsolete)
- mem_cgroup_put(root_mem->last_scanned_child);
-
- cgroup = list_first_entry(&root_mem->css.cgroup->children,
- struct cgroup, sibling);
- ret = mem_cgroup_from_cont(cgroup);
- mem_cgroup_get(ret);
- } else
- ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
- root_mem);
+ spin_lock(&root_mem->reclaim_param_lock);
+ if (cgroup)
+ root_mem->last_scanned_child = found;
+ else {
+ root_mem->scan_age++;
+ root_mem->last_scanned_child = 0;
+ }
+ spin_unlock(&root_mem->reclaim_param_lock);

-done:
- root_mem->last_scanned_child = ret;
- cgroup_unlock();
+ if (cgroup) {
+ ret = mem_cgroup_from_cont(cgroup);
+ css_get(&ret->css);
+ /* avoid to block rmdir() */
+ if (memcg_is_obsolete(ret)) {
+ css_put(&ret->css);
+ ret = NULL;
+ }
+ }
+ rcu_read_unlock();
+ }
return ret;
}

@@ -661,22 +629,6 @@ static bool mem_cgroup_check_under_limit
return false;
}

-static unsigned int get_swappiness(struct mem_cgroup *memcg)
-{
- struct cgroup *cgrp = memcg->css.cgroup;
- unsigned int swappiness;
-
- /* root ? */
- if (cgrp->parent == NULL)
- return vm_swappiness;
-
- spin_lock(&memcg->reclaim_param_lock);
- swappiness = memcg->swappiness;
- spin_unlock(&memcg->reclaim_param_lock);
-
- return swappiness;
-}
-
/*
* Dance down the hierarchy if needed to reclaim memory. We remember the
* last child we reclaimed from, so that we don't end up penalizing
@@ -687,41 +639,26 @@ static unsigned int get_swappiness(struc
static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
gfp_t gfp_mask, bool noswap)
{
- struct mem_cgroup *next_mem;
+ struct mem_cgroup *victim;
+ unsigned long start_age;
int ret = 0;
+ int total = 0;

- /*
- * Reclaim unconditionally and don't check for return value.
- * We need to reclaim in the current group and down the tree.
- * One might think about checking for children before reclaiming,
- * but there might be left over accounting, even after children
- * have left.
- */
- ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
- get_swappiness(root_mem));
- if (mem_cgroup_check_under_limit(root_mem))
- return 0;
- if (!root_mem->use_hierarchy)
- return ret;
+ start_age = root_mem->scan_age;

- next_mem = mem_cgroup_get_first_node(root_mem);
+ /* Allows 2 vists to "root_mem" */
+ while (time_after(start_age + 2UL, root_mem->scan_age)) {

- while (next_mem != root_mem) {
- if (memcg_is_obsolete(next_mem)) {
- mem_cgroup_put(next_mem);
- cgroup_lock();
- next_mem = mem_cgroup_get_first_node(root_mem);
- cgroup_unlock();
- continue;
- }
- ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
- get_swappiness(next_mem));
+ victim = mem_cgroup_select_victim(root_mem);
+ ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
+ get_swappiness(victim));
+ css_put(&victim->css);
if (mem_cgroup_check_under_limit(root_mem))
- return 0;
- cgroup_lock();
- next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
- cgroup_unlock();
+ return 1;
+ total += ret;
}
+ ret = total;
+
return ret;
}

@@ -2149,7 +2086,8 @@ mem_cgroup_create(struct cgroup_subsys *
res_counter_init(&mem->memsw, NULL);
}
mem_cgroup_set_inactive_ratio(mem);
- mem->last_scanned_child = NULL;
+ mem->last_scanned_child = 0;
+ mem->scan_age = 0;
spin_lock_init(&mem->reclaim_param_lock);

if (parent)

2008-12-09 11:11:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 5/6] fix inactive_ratio under hierarchy


From: KAMEZAWA Hiroyuki <[email protected]>

After lru updates for memcg, followint test easily see OOM.
and memory-reclaim speed was very bad.

mkdir /opt/cgroup/xxx
echo 1 > /opt/cgroup/xxx/memory.use_hierarchy
mkdir /opt/cgroup/xxx/01
mkdir /opt/cgroup/xxx/02
echo 40M > /opt/cgroup/xxx/memory.limit_in_bytes

Run task under group 01 or 02.

This is because calclation of inactive_ratio doesn't handle hierarchy.
In above, 01 and 02's inactive_ratio = 65535 and inactive list will be
empty.

This patch tries to set 01 and 02 's inactive ration to appropriate value
under hierarchy. inactive_ratio is adjusted to the minimum limit found in
upwards in hierarchy.


ex)In following tree,
/opt/cgroup/01 limit=1G
/opt/cgroup/01/A limit=500M
/opt/cgroup/01/A/B limit=unlimited
/opt/cgroup/01/A/C limit=50M
/opt/cgroup/01/Z limit=700M


/opt/cgroup/01's inactive_ratio is calculated by limit of 1G.
/opt/cgroup/01/A's inactive_ratio is calculated by limit of 500M
/opt/cgroup/01/A/B's inactive_ratio is calculated by limit of 500M.
/opt/cgroup/01/A/C's inactive_ratio is calculated by limit of 50M.
/opt/cgroup/01's inactive_ratio is calculated by limit of 700M.


Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 64 insertions(+), 7 deletions(-)

---
Index: mmotm-2.6.28-Dec08/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec08.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec08/mm/memcontrol.c
@@ -1382,20 +1382,73 @@ int mem_cgroup_shrink_usage(struct mm_st
* page_alloc.c::setup_per_zone_inactive_ratio().
* it describe more detail.
*/
-static void mem_cgroup_set_inactive_ratio(struct mem_cgroup *memcg)
+static int __mem_cgroup_inactive_ratio(unsigned long long gb)
{
- unsigned int gb, ratio;
+ unsigned int ratio;

- gb = res_counter_read_u64(&memcg->res, RES_LIMIT) >> 30;
+ gb = gb >> 30;
if (gb)
ratio = int_sqrt(10 * gb);
else
ratio = 1;

- memcg->inactive_ratio = ratio;
+ return ratio;
+}
+
+
+static void mem_cgroup_update_inactive_ratio(struct mem_cgroup *memcg)
+{
+ struct cgroup *cur;
+ struct mem_cgroup *root_memcg, *tmp;
+ unsigned long long min_limit, limit;
+ int depth, nextid, rootid, found, ratio;
+
+ if (!memcg->use_hierarchy) {
+ limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+ memcg->inactive_ratio = __mem_cgroup_inactive_ratio(limit);
+ return;
+ }

+ cur = memcg->css.cgroup;
+ min_limit = res_counter_read_u64(&tmp->res, RES_LIMIT);
+
+ /* go up to root cgroup and find min limit.*/
+ while (cur->parent != NULL) {
+ tmp = mem_cgroup_from_cont(cur);
+ if (!tmp->use_hierarchy)
+ break;
+ limit = res_counter_read_u64(&tmp->res, RES_LIMIT);
+ if (limit < min_limit)
+ limit = min_limit;
+ cur = cur->parent;
+ }
+ /* new inactive ratio for this hierarchy */
+ ratio = __mem_cgroup_inactive_ratio(min_limit);
+
+ /*
+ * update inactive ratio under this.
+ * all children's inactive_ratio will be updated.
+ */
+ cur = memcg->css.cgroup;
+ rootid = cgroup_id(cur);
+ depth = cgroup_depth(cur);
+ nextid = 0;
+ rcu_read_lock();
+ while (1) {
+ cur = cgroup_get_next(nextid, rootid, depth, &found);
+ if (!cur)
+ break;
+ if (!cgroup_is_removed(cur)) {
+ tmp = mem_cgroup_from_cont(cur);
+ tmp->inactive_ratio = ratio;
+ }
+ nextid = found + 1;
+ }
+ rcu_read_unlock();
}

+
+
static DEFINE_MUTEX(set_limit_mutex);

static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
@@ -1435,8 +1488,11 @@ static int mem_cgroup_resize_limit(struc
if (!progress) retry_count--;
}

- if (!ret)
- mem_cgroup_set_inactive_ratio(memcg);
+ if (!ret) {
+ mutex_lock(&set_limit_mutex);
+ mem_cgroup_update_inactive_ratio(memcg);
+ mutex_unlock(&set_limit_mutex);
+ }

return ret;
}
@@ -2081,11 +2137,12 @@ mem_cgroup_create(struct cgroup_subsys *
if (parent && parent->use_hierarchy) {
res_counter_init(&mem->res, &parent->res);
res_counter_init(&mem->memsw, &parent->memsw);
+ /* min_limit under hierarchy is unchanged.*/
+ mem->inactive_ratio = parent->inactive_ratio;
} else {
res_counter_init(&mem->res, NULL);
res_counter_init(&mem->memsw, NULL);
}
- mem_cgroup_set_inactive_ratio(mem);
mem->last_scanned_child = 0;
mem->scan_age = 0;
spin_lock_init(&mem->reclaim_param_lock);

2008-12-09 11:13:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 6/6] fix oom under hierarchy

From: KAMEZAWA Hiroyki <[email protected]>

Current memcg's code other than memcontrol.c cannot handle hierarchy
in correct way.
In following,
/opt/cgroup/01 limit=1G
/opt/cgroup/01/A limit = unlimited
/opt/cgroup/01/B limit = unlimited

searching tasks under '01' and cannot find bad process in 01/A and 01/B

After this.
- OOM Killer check hierarchy and can Kill badprocess under hierarchy.

Changelog:
- adjusted to the base kernel.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
include/linux/memcontrol.h | 6 +++-
mm/memcontrol.c | 57 ++++++++++++++++++++++++++++++++++++++++++---
mm/oom_kill.c | 2 -
3 files changed, 59 insertions(+), 6 deletions(-)

Index: mmotm-2.6.28-Dec08/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec08.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec08/mm/memcontrol.c
@@ -403,12 +403,36 @@ void mem_cgroup_move_lists(struct page *
mem_cgroup_add_lru_list(page, to);
}

-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
+static int mm_under_cgroup(struct mm_struct *mm, struct mem_cgroup *mcg)
+{
+ struct cgroup *cg, *check;
+ int ret = 0;
+
+ if (!mm)
+ return 0;
+
+ rcu_read_lock();
+
+ cg = task_cgroup(rcu_dereference(mm->owner), mem_cgroup_subsys_id);
+ check = mcg->css.cgroup;
+
+ if (!mcg->use_hierarchy)
+ ret = (cg == check);
+ else
+ ret = cgroup_is_ancestor(cg, check);
+ rcu_read_unlock();
+
+ return ret;
+}
+
+
+int task_in_mem_cgroup(struct task_struct *task,
+ struct mem_cgroup *mem)
{
int ret;

task_lock(task);
- ret = task->mm && mm_match_cgroup(task->mm, mem);
+ ret = mm_under_cgroup(task->mm, mem);
task_unlock(task);
return ret;
}
@@ -662,6 +686,33 @@ static int mem_cgroup_hierarchical_recla
return ret;
}

+void update_oom_jiffies(struct mem_cgroup *mem)
+{
+ struct cgroup *cgroup, *rootcg;
+ struct mem_cgroup *tmp;
+ int rootid, nextid, depth, found;
+
+ if (!mem->use_hierarchy) {
+ mem->last_oom_jiffies = jiffies;
+ return;
+ }
+ rootcg = mem->css.cgroup;
+ rootid = cgroup_id(rootcg);
+ depth = cgroup_depth(rootcg);
+ nextid = 0;
+ rcu_read_lock();
+ while (1) {
+ cgroup = cgroup_get_next(nextid, rootid, depth, &found);
+
+ if (!cgroup)
+ break;
+ tmp = mem_cgroup_from_cont(cgroup);
+ tmp->last_oom_jiffies = jiffies;
+ nextid = found + 1;
+ }
+ rcu_read_unlock();
+}
+
bool mem_cgroup_oom_called(struct task_struct *task)
{
bool ret = false;
@@ -764,7 +815,7 @@ static int __mem_cgroup_try_charge(struc
mutex_lock(&memcg_tasklist);
mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
mutex_unlock(&memcg_tasklist);
- mem_over_limit->last_oom_jiffies = jiffies;
+ update_oom_jiffies(mem_over_limit);
}
goto nomem;
}
Index: mmotm-2.6.28-Dec08/mm/oom_kill.c
===================================================================
--- mmotm-2.6.28-Dec08.orig/mm/oom_kill.c
+++ mmotm-2.6.28-Dec08/mm/oom_kill.c
@@ -279,7 +279,7 @@ static struct task_struct *select_bad_pr
*
* Call with tasklist_lock read-locked.
*/
-static void dump_tasks(const struct mem_cgroup *mem)
+static void dump_tasks(struct mem_cgroup *mem)
{
struct task_struct *g, *p;

Index: mmotm-2.6.28-Dec08/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Dec08.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Dec08/include/linux/memcontrol.h
@@ -65,7 +65,9 @@ extern unsigned long mem_cgroup_isolate_
struct mem_cgroup *mem_cont,
int active, int file);
extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
+
+int task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *mem);
+

extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

@@ -191,7 +193,7 @@ static inline int mm_match_cgroup(struct
}

static inline int task_in_mem_cgroup(struct task_struct *task,
- const struct mem_cgroup *mem)
+ struct mem_cgroup *mem)
{
return 1;
}

2008-12-09 12:32:49

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/6] Flat hierarchical reclaim by ID

* KAMEZAWA Hiroyuki <[email protected]> [2008-12-09 20:09:15]:

>
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Implement hierarchy reclaim by cgroup_id.
>
> What changes:
> - Page reclaim is not done by tree-walk algorithm
> - mem_cgroup->last_schan_child is changed to be ID, not pointer.
> - no cgroup_lock, done under RCU.
> - scanning order is just defined by ID's order.
> (Scan by round-robin logic.)
>
> Changelog: v3 -> v4
> - adjusted to changes in base kernel.
> - is_acnestor() is moved to other patch.
>
> Changelog: v2 -> v3
> - fixed use_hierarchy==0 case
>
> Changelog: v1 -> v2
> - make use of css_tryget();
> - count # of loops rather than remembering position.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

I have not yet run the patch, but the heuristics seem a lot like
magic. I am not against scanning by order, but is order the right way
to scan groups? Does this order reflect their position in the
hierarchy? Shouldn't id's belong to cgroups instead of just memory
controller? I would push back ids to cgroups infrastructure.

--
Balbir

2008-12-09 14:30:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/6] Flat hierarchical reclaim by ID

Balbir Singh said:
> * KAMEZAWA Hiroyuki <[email protected]> [2008-12-09
> 20:09:15]:
>
>>
>> From: KAMEZAWA Hiroyuki <[email protected]>
>>
>> Implement hierarchy reclaim by cgroup_id.
>>
>> What changes:
>> - Page reclaim is not done by tree-walk algorithm
>> - mem_cgroup->last_schan_child is changed to be ID, not pointer.
>> - no cgroup_lock, done under RCU.
>> - scanning order is just defined by ID's order.
>> (Scan by round-robin logic.)
>>
>> Changelog: v3 -> v4
>> - adjusted to changes in base kernel.
>> - is_acnestor() is moved to other patch.
>>
>> Changelog: v2 -> v3
>> - fixed use_hierarchy==0 case
>>
>> Changelog: v1 -> v2
>> - make use of css_tryget();
>> - count # of loops rather than remembering position.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> I have not yet run the patch, but the heuristics seem a lot like
> magic. I am not against scanning by order, but is order the right way
> to scan groups?
My consideration is
- Both of current your implementation and this round robin is just
an example. I never think some kind of search algorighm detemined by
shape of tree is the best way.
- No one knows what order is the best, now. We have to find it.
- The best order will be determined by some kind of calculation rather
than shape of tree and must pass by tons of tests.
This needs much amount of time and patient work. VM management is not
so easy thing.
I think your soft-limit idea can be easily merged onto this patch set.

> Does this order reflect their position in the hierarchy?
No. just scan IDs from last scannned one in RR.
BTW, can you show what an algorithm works well in following case ?
ex)
groupA/ limit=1G usage=300M
01/ limit=600M usage=600M
02/ limit=700M usage=70M
03/ limit=100M usage=30M
Which one should be shrinked at first and why ?
1) when group_A hit limits.
2) when group_A/01 hit limits.
3) when group_A/02 hit limits.
I can't now.

This patch itself uses round-robin and have no special order.
I think implenting good algorithm under this needs some amount of time.

> Shouldn't id's belong to cgroups instead of just memory controller?
If Paul rejects, I'll move this to memcg. But bio-cgroup people also use
ID and, in this summer, I posted swap-cgroup-ID patch and asked to
implement IDs under cgroup rather than subsys. (asked by Paul or you.)

>From implementation, hierarchy code management at el. should go into
cgroup.c and it gives us clear view rather than implemented under memcg.

-Kame
> I would push back ids to cgroups infrastructure.
>

2008-12-09 15:47:44

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/6] Flat hierarchical reclaim by ID

* KAMEZAWA Hiroyuki <[email protected]> [2008-12-09 23:28:32]:

> Balbir Singh said:
> > * KAMEZAWA Hiroyuki <[email protected]> [2008-12-09
> > 20:09:15]:
> >
> >>
> >> From: KAMEZAWA Hiroyuki <[email protected]>
> >>
> >> Implement hierarchy reclaim by cgroup_id.
> >>
> >> What changes:
> >> - Page reclaim is not done by tree-walk algorithm
> >> - mem_cgroup->last_schan_child is changed to be ID, not pointer.
> >> - no cgroup_lock, done under RCU.
> >> - scanning order is just defined by ID's order.
> >> (Scan by round-robin logic.)
> >>
> >> Changelog: v3 -> v4
> >> - adjusted to changes in base kernel.
> >> - is_acnestor() is moved to other patch.
> >>
> >> Changelog: v2 -> v3
> >> - fixed use_hierarchy==0 case
> >>
> >> Changelog: v1 -> v2
> >> - make use of css_tryget();
> >> - count # of loops rather than remembering position.
> >>
> >> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> >
> > I have not yet run the patch, but the heuristics seem a lot like
> > magic. I am not against scanning by order, but is order the right way
> > to scan groups?
> My consideration is
> - Both of current your implementation and this round robin is just
> an example. I never think some kind of search algorighm detemined by
> shape of tree is the best way.
> - No one knows what order is the best, now. We have to find it.
> - The best order will be determined by some kind of calculation rather
> than shape of tree and must pass by tons of tests.

Yes, the shape of the tree just limits where to reclaim from

> This needs much amount of time and patient work. VM management is not
> so easy thing.
> I think your soft-limit idea can be easily merged onto this patch set.
>

Yes, potentially. With soft limit, the general expectation is this

Let us say you have group A and B

groupA, soft limit = 1G
groupB, soft limit = 2G

Now assume the system has 4G. When groupB is not using its memory,
group A can grab all 4G, but when groupB kicks in and tries to use 2G
or more, then the expectation is that

group A will get 1/3 * 4 = 4/3G
group B will get 2/3 * 4 = 8/3G

Similar to CPU shares currently.

> > Does this order reflect their position in the hierarchy?
> No. just scan IDs from last scannned one in RR.
> BTW, can you show what an algorithm works well in following case ?
> ex)
> groupA/ limit=1G usage=300M
> 01/ limit=600M usage=600M
> 02/ limit=700M usage=70M
> 03/ limit=100M usage=30M
> Which one should be shrinked at first and why ?
> 1) when group_A hit limits.

With tree reclaim, reclaim will first reclaim from A and stop if
successful, otherwise it will go to 01, 02 and 03 and then go back to
A.

> 2) when group_A/01 hit limits.

This will reclaim only from 01, since A is under its limit

> 3) when group_A/02 hit limits.

This will reclaim only from 02 since A is under its limit

Does RR do the same right now?

> I can't now.
>
> This patch itself uses round-robin and have no special order.
> I think implenting good algorithm under this needs some amount of time.
>

I agree that fine tuning it will require time, but what we need is
something usable that will not have hard to debug or understand corner cases.

> > Shouldn't id's belong to cgroups instead of just memory controller?
> If Paul rejects, I'll move this to memcg. But bio-cgroup people also use
> ID and, in this summer, I posted swap-cgroup-ID patch and asked to
> implement IDs under cgroup rather than subsys. (asked by Paul or you.)
>

We should talk to Paul and convince him.

> >From implementation, hierarchy code management at el. should go into
> cgroup.c and it gives us clear view rather than implemented under memcg.
>

cgroup has hierarchy management already, in the form of children and
sibling. Walking those structures is up to us, that is all we do
currently :)

> -Kame
> > I would push back ids to cgroups infrastructure.
> >
>
>
>

--
Balbir

2008-12-09 16:34:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/6] Flat hierarchical reclaim by ID

Balbir Singh said:

>> I think your soft-limit idea can be easily merged onto this patch
>> set.
>>
>
> Yes, potentially. With soft limit, the general expectation is this
>
> Let us say you have group A and B
>
> groupA, soft limit = 1G
> groupB, soft limit = 2G
>
> Now assume the system has 4G. When groupB is not using its memory,
> group A can grab all 4G, but when groupB kicks in and tries to use 2G
> or more, then the expectation is that
>
> group A will get 1/3 * 4 = 4/3G
> group B will get 2/3 * 4 = 8/3G
>
> Similar to CPU shares currently.
>
I like that idea because it's easy to understand.

>> > Does this order reflect their position in the hierarchy?
>> No. just scan IDs from last scannned one in RR.
>> BTW, can you show what an algorithm works well in following case ?
>> ex)
>> groupA/ limit=1G usage=300M
>> 01/ limit=600M usage=600M
>> 02/ limit=700M usage=70M
>> 03/ limit=100M usage=30M
>> Which one should be shrinked at first and why ?
>> 1) when group_A hit limits.
>
> With tree reclaim, reclaim will first reclaim from A and stop if
> successful, otherwise it will go to 01, 02 and 03 and then go back to
> A.
>
Sorry for my poor example

>> 2) when group_A/01 hit limits.
>
> This will reclaim only from 01, since A is under its limit
>
I should ask
2') when a task in group_A/01 hit limit in group_A

ex)
group_A/ limtit=1G, usage~0
/01 limit= unlimited usage=800M
/02 limit= unlimited usage=200M
(what limit is allowed to children is another problem to be fixed...)
when a task in 01 hits limit of group_A
when a task in 02 hits limit of group_A
where we should start from ? (is unknown)
Currenty , this patch uses RR (in A->01->02->A->...).
and soft-limit or some good algorithm will give us better view.

>> 3) when group_A/02 hit limits.
>
> This will reclaim only from 02 since A is under its limit
>
> Does RR do the same right now?
>
I think so.

Assume
group_A/
/01
/02
RR does
1) when a task under A/01/02 hit limits at A, shrink A, 01, 02,
2) when a task under 01 hit limits at 01, shrink only 01.
3) when a task under 02 hit limits at 02, shrink only 02.

When 1), start point of shrinking is saved as last_scanned_child.


>> I can't now.
>>
>> This patch itself uses round-robin and have no special order.
>> I think implenting good algorithm under this needs some amount of
>> time.
>>
>
> I agree that fine tuning it will require time, but what we need is
> something usable that will not have hard to debug or understand corner
> cases.

yes, we have now. My point is "cgroup_lock()" caused many problems and
will cause new ones in future, I convince.

And please see 5/6 and 6/6 we need hierarchy consideration in other
places. I think there are more codes which should take care of hierarchy.


> > Shouldn't id's belong to cgroups instead of just memory controller?
>> If Paul rejects, I'll move this to memcg. But bio-cgroup people also use
>> ID and, in this summer, I posted swap-cgroup-ID patch and asked to
>> implement IDs under cgroup rather than subsys. (asked by Paul or you.)
>>
>
> We should talk to Paul and convince him.
>
yes.

>> >From implementation, hierarchy code management at el. should go into
>> cgroup.c and it gives us clear view rather than implemented under memcg.
>>
>
> cgroup has hierarchy management already, in the form of children and
> sibling. Walking those structures is up to us, that is all we do
> currently :)
>
yes, but need cgroup_lock(). and you have to keep refcnt to pointer
just for rememebring it.

This patch doesn't change anything other than removing cgroup_lock() and
removing refcnt to remember start point.

Thanks,
-Kame

2008-12-10 00:28:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: Documentation for internal implementation

Paul, Balbir

I have a question.

Why cgroup's documentation directroy is divided into 2 places ?

Documentation/cgroups
/controllers

If no strong demands, I'd like to remove "controllers" directroy and move
contents under "cgroups". Some people complains me that finding document
for memcg is not easy.

On Tue, 9 Dec 2008 20:04:13 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

>
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Update/Fix document about implementation details of memcg.
>
> Changelog:
> - applied Randy Dunlap's comments.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> Documentation/controllers/memcg_test.txt | 240 ++++++++++++++++++++++++++-----
> mm/memcontrol.c | 4
> 2 files changed, 207 insertions(+), 37 deletions(-)
>
> Index: mmotm-2.6.28-Dec03/Documentation/controllers/memcg_test.txt
> ===================================================================
> --- mmotm-2.6.28-Dec03.orig/Documentation/controllers/memcg_test.txt
> +++ mmotm-2.6.28-Dec03/Documentation/controllers/memcg_test.txt
> @@ -1,59 +1,76 @@
> Memory Resource Controller(Memcg) Implementation Memo.
> -Last Updated: 2009/12/03
> +Last Updated: 2008/12/04
> +Base Kernel Version: based on 2.6.28-rc7-mm.
>
> Because VM is getting complex (one of reasons is memcg...), memcg's behavior
> -is complex. This is a document for memcg's internal behavior and some test
> -patterns tend to be racy.
> +is complex. This is a document for memcg's internal behavior.
> +Please note that implementation details can be changed.
>
> -1. charges
> +(*) Topics on API should be in Documentation/controllers/memory.txt)
> +
> +0. How to record usage ?
> + 2 objects are used.
> +
> + page_cgroup ....an object per page.
> + Allocated at boot or memory hotplug. Freed at memory hot removal.
> +
> + swap_cgroup ... an entry per swp_entry.
> + Allocated at swapon(). Freed at swapoff().
> +
> + The page_cgroup has USED bit and double count against a page_cgroup never
> + occurs. swap_cgroup is used only when a charged page is swapped-out.
> +
> +1. Charge
>
> a page/swp_entry may be charged (usage += PAGE_SIZE) at
>
> - mem_cgroup_newpage_newpage()
> - called at new page fault and COW.
> + mem_cgroup_newpage_charge()
> + Called at new page fault and Copy-On-Write.
>
> mem_cgroup_try_charge_swapin()
> - called at do_swap_page() and swapoff.
> - followed by charge-commit-cancel protocol.
> - (With swap accounting) at commit, charges recorded in swap is removed.
> + Called at do_swap_page() (page fault on swap entry) and swapoff.
> + Followed by charge-commit-cancel protocol. (With swap accounting)
> + At commit, a charge recorded in swap_cgroup is removed.
>
> mem_cgroup_cache_charge()
> - called at add_to_page_cache()
> + Called at add_to_page_cache()
>
> - mem_cgroup_cache_charge_swapin)()
> - called by shmem's swapin processing.
> + mem_cgroup_cache_charge_swapin()
> + Called at shmem's swapin.
>
> mem_cgroup_prepare_migration()
> - called before migration. "extra" charge is done
> - followed by charge-commit-cancel protocol.
> + Called before migration. "extra" charge is done and followed by
> + charge-commit-cancel protocol.
> At commit, charge against oldpage or newpage will be committed.
>
> -2. uncharge
> +2. Uncharge
> a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
>
> mem_cgroup_uncharge_page()
> - called when an anonymous page is unmapped. If the page is SwapCache
> - uncharge is delayed until mem_cgroup_uncharge_swapcache().
> + Called when an anonymous page is fully unmapped. I.e., mapcount goes
> + to 0. If the page is SwapCache, uncharge is delayed until
> + mem_cgroup_uncharge_swapcache().
>
> mem_cgroup_uncharge_cache_page()
> - called when a page-cache is deleted from radix-tree. If the page is
> - SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache()
> + Called when a page-cache is deleted from radix-tree. If the page is
> + SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
>
> mem_cgroup_uncharge_swapcache()
> - called when SwapCache is removed from radix-tree. the charge itself
> + Called when SwapCache is removed from radix-tree. The charge itself
> is moved to swap_cgroup. (If mem+swap controller is disabled, no
> - charge to swap.)
> + charge to swap occurs.)
>
> mem_cgroup_uncharge_swap()
> - called when swp_entry's refcnt goes down to be 0. charge against swap
> + Called when swp_entry's refcnt goes down to 0. A charge against swap
> disappears.
>
> mem_cgroup_end_migration(old, new)
> - at success of migration -> old is uncharged (if necessary), charge
> - to new is committed. at failure, charge to old is committed.
> + At success of migration old is uncharged (if necessary), a charge
> + to new page is committed. At failure, charge to old page is committed.
>
> 3. charge-commit-cancel
> - In some case, we can't know this "charge" is valid or not at charge.
> + In some case, we can't know this "charge" is valid or not at charging
> + (because of races).
> To handle such case, there are charge-commit-cancel functions.
> mem_cgroup_try_charge_XXX
> mem_cgroup_commit_charge_XXX
> @@ -68,24 +85,164 @@ patterns tend to be racy.
>
> At cancel(), simply usage -= PAGE_SIZE.
>
> -4. Typical Tests.
> +Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
> +
> +4. Anonymous
> + Anonymous page is newly allocated at
> + - page fault into MAP_ANONYMOUS mapping.
> + - Copy-On-Write.
> + It is charged right after it's allocated before doing any page table
> + related operations. Of course, it's uncharged when another page is used
> + for the fault address.
> +
> + At freeing anonymous page (by exit() or munmap()), zap_pte() is called
> + and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
> + are done at page_remove_rmap() when page_mapcount() goes down to 0.
> +
> + Another page freeing is by page-reclaim (vmscan.c) and anonymous
> + pages are swapped out. In this case, the page is marked as
> + PageSwapCache(). uncharge() routine doesn't uncharge the page marked
> + as SwapCache(). It's delayed until __delete_from_swap_cache().
> +
> + 4.1 Swap-in.
> + At swap-in, the page is taken from swap-cache. There are 2 cases.
> +
> + (a) If the SwapCache is newly allocated and read, it has no charges.
> + (b) If the SwapCache has been mapped by processes, it has been
> + charged already.
> +
> + In case (a), we charge it. In case (b), we don't charge it.
> + (But racy state between (a) and (b) exists. We do check it.)
> + At charging, a charge recorded in swap_cgroup is moved to page_cgroup.
> +
> + 4.2 Swap-out.
> + At swap-out, typical state transition is below.
> +
> + (a) add to swap cache. (marked as SwapCache)
> + swp_entry's refcnt += 1.
> + (b) fully unmapped.
> + swp_entry's refcnt += # of ptes.
> + (c) write back to swap.
> + (d) delete from swap cache. (remove from SwapCache)
> + swp_entry's refcnt -= 1.
> +
> +
> + At (b), the page is marked as SwapCache and not uncharged.
> + At (d), the page is removed from SwapCache and a charge in page_cgroup
> + is moved to swap_cgroup.
> +
> + Finally, at task exit,
> + (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
> + Here, a charge in swap_cgroup disappears.
> +
> +5. Page Cache
> + Page Cache is charged at
> + - add_to_page_cache_locked().
> +
> + uncharged at
> + - __remove_from_page_cache().
> +
> + The logic is very clear. (About migration, see below)
> + Note: __remove_from_page_cache() is called by remove_from_page_cache()
> + and __remove_mapping().
> +
> +6. Shmem(tmpfs) Page Cache
> + Memcg's charge/uncharge have special handlers of shmem. The best way
> + to understand shmem's page state transition is to read mm/shmem.c.
> + But brief explanation of the behavior of memcg around shmem will be
> + helpful to understand the logic.
> +
> + Shmem's page (just leaf page, not direct/indirect block) can be on
> + - radix-tree of shmem's inode.
> + - SwapCache.
> + - Both on radix-tree and SwapCache. This happens at swap-in
> + and swap-out,
> +
> + It's charged when...
> + - A new page is added to shmem's radix-tree.
> + - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
> + It's uncharged when
> + - A page is removed from radix-tree and not SwapCache.
> + - When SwapCache is removed, a charge is moved to swap_cgroup.
> + - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
> + disappears.
>
> - Tests for racy cases.
> +7. Page Migration
> + One of the most complicated functions is page-migration-handler.
> + Memcg has 2 routines. Assume that we are migrating a page's contents
> + from OLDPAGE to NEWPAGE.
> +
> + Usual migration logic is..
> + (a) remove the page from LRU.
> + (b) allocate NEWPAGE (migration target)
> + (c) lock by lock_page().
> + (d) unmap all mappings.
> + (e-1) If necessary, replace entry in radix-tree.
> + (e-2) move contents of a page.
> + (f) map all mappings again.
> + (g) pushback the page to LRU.
> + (-) OLDPAGE will be freed.
> +
> + Before (g), memcg should complete all necessary charge/uncharge to
> + NEWPAGE/OLDPAGE.
> +
> + The point is....
> + - If OLDPAGE is anonymous, all charges will be dropped at (d) because
> + try_to_unmap() drops all mapcount and the page will not be
> + SwapCache.
> +
> + - If OLDPAGE is SwapCache, charges will be kept at (g) because
> + __delete_from_swap_cache() isn't called at (e-1)
> +
> + - If OLDPAGE is page-cache, charges will be kept at (g) because
> + remove_from_swap_cache() isn't called at (e-1)
> +
> + memcg provides following hooks.
> +
> + - mem_cgroup_prepare_migration(OLDPAGE)
> + Called after (b) to account a charge (usage += PAGE_SIZE) against
> + memcg which OLDPAGE belongs to.
> +
> + - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
> + Called after (f) before (g).
> + If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
> + charged, a charge by prepare_migration() is automatically canceled.
> + If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
> +
> + But zap_pte() (by exit or munmap) can be called while migration,
> + we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
> +
> +8. LRU
> + Each memcg has its own private LRU. Now, it's handling is under global
> + VM's control (means that it's handled under global zone->lru_lock).
> + Almost all routines around memcg's LRU is called by global LRU's
> + list management functions under zone->lru_lock().
> +
> + A special function is mem_cgroup_isolate_pages(). This scans
> + memcg's private LRU and call __isolate_lru_page() to extract a page
> + from LRU.
> + (By __isolate_lru_page(), the page is removed from both of global and
> + private LRU.)
>
> - 4.1 small limit to memcg.
> +
> +9. Typical Tests.
> +
> + Tests for racy cases.
> +
> + 9.1 Small limit to memcg.
> When you do test to do racy case, it's good test to set memcg's limit
> to be very small rather than GB. Many races found in the test under
> xKB or xxMB limits.
> (Memory behavior under GB and Memory behavior under MB shows very
> different situation.)
>
> - 4.2 shmem
> + 9.2 Shmem
> Historically, memcg's shmem handling was poor and we saw some amount
> of troubles here. This is because shmem is page-cache but can be
> SwapCache. Test with shmem/tmpfs is always good test.
>
> - 4.3 migration
> - For NUMA, migration is an another special. To do easy test, cpuset
> + 9.3 Migration
> + For NUMA, migration is an another special case. To do easy test, cpuset
> is useful. Following is a sample script to do migration.
>
> mount -t cgroup -o cpuset none /opt/cpuset
> @@ -118,20 +275,20 @@ patterns tend to be racy.
> G2_TASK=`cat ${G2}/tasks`
> move_task "${G1_TASK}" ${G2} &
> --
> - 4.4 memory hotplug.
> + 9.4 Memory hotplug.
> memory hotplug test is one of good test.
> to offline memory, do following.
> # echo offline > /sys/devices/system/memory/memoryXXX/state
> (XXX is the place of memory)
> This is an easy way to test page migration, too.
>
> - 4.5 mkdir/rmdir
> + 9.5 mkdir/rmdir
> When using hierarchy, mkdir/rmdir test should be done.
> - tests like following.
> + Use tests like the following.
>
> - #echo 1 >/opt/cgroup/01/memory/use_hierarchy
> - #mkdir /opt/cgroup/01/child_a
> - #mkdir /opt/cgroup/01/child_b
> + echo 1 >/opt/cgroup/01/memory/use_hierarchy
> + mkdir /opt/cgroup/01/child_a
> + mkdir /opt/cgroup/01/child_b
>
> set limit to 01.
> add limit to 01/child_b
> @@ -143,3 +300,12 @@ patterns tend to be racy.
> /opt/cgroup/01/child_c
>
> running new jobs in new group is also good.
> +
> + 9.6 Mount with other subsystems.
> + Mounting with other subsystems is a good test because there is a
> + race and lock dependency with other cgroup subsystems.
> +
> + example)
> + # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices
> +
> + and do task move, mkdir, rmdir etc...under this.
> Index: mmotm-2.6.28-Dec03/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-Dec03.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-Dec03/mm/memcontrol.c
> @@ -6,6 +6,10 @@
> * Copyright 2007 OpenVZ SWsoft Inc
> * Author: Pavel Emelianov <[email protected]>
> *
> + * Documentation is available at
> + * Documentation/controllers/memory.txt
> + * Documentation/controllers/memcg_test.txt
> + *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2008-12-10 01:04:16

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: Documentation for internal implementation

KAMEZAWA Hiroyuki wrote:
> Paul, Balbir
>
> I have a question.
>
> Why cgroup's documentation directroy is divided into 2 places ?
>
> Documentation/cgroups
> /controllers
>

Documentation/cgroups was created by Matt Helsley, when he added freezer-subsystem.txt,
and he also moved cgroups.txt to the new Documentation/cgroups.

> If no strong demands, I'd like to remove "controllers" directroy and move

I prepared a patch to do so long ago, but didn't ever send it out.

> contents under "cgroups". Some people complains me that finding document
> for memcg is not easy.
>

2008-12-10 01:08:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: Documentation for internal implementation

On Wed, 10 Dec 2008 09:02:19 +0800
Li Zefan <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > Paul, Balbir
> >
> > I have a question.
> >
> > Why cgroup's documentation directroy is divided into 2 places ?
> >
> > Documentation/cgroups
> > /controllers
> >
>
> Documentation/cgroups was created by Matt Helsley, when he added freezer-subsystem.txt,
> and he also moved cgroups.txt to the new Documentation/cgroups.
>
> > If no strong demands, I'd like to remove "controllers" directroy and move
>
> I prepared a patch to do so long ago, but didn't ever send it out.
>
Thank you for clarification. Then, it seems there are no special meanings.
I think people tends to recognize codes as "cgroups" rather than "controllers".
I'd like to prepare a patch after sending this update out.
If you do, I'll adjust my updates.

Thanks,
-Kame

2008-12-10 02:09:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Tue, 9 Dec 2008 20:06:47 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> @@ -98,6 +98,8 @@ enum {
> CGRP_RELEASABLE,
> /* Control Group requires release notifications to userspace */
> CGRP_NOTIFY_ON_RELEASE,
> + /* Control Group is preparing for death */
> + CGRP_PRE_REMOVAL,
> };
I'll rename CGRP_PRE_REMOVAL to be CGRP_TRY_REMOVE>

Thanks,
-Kame

2008-12-10 02:21:26

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

> +static bool memcg_is_obsolete(struct mem_cgroup *mem)
> +{

Will this function be called with mem->css.refcnt == 0? If yes, then
this function is racy.

cg = mem->css.cgroup
cgroup_diput()
mem_cgroup_destroy()
mem->css.cgroup = NULL;
kfree(cg);
if (!cg || cgroup_is_removed(cg)...)

(accessing invalid cg)

> + struct cgroup *cg = mem->css.cgroup;
> + /*
> + * "Being Removed" means pre_destroy() handler is called.
> + * After "pre_destroy" handler is called, memcg should not
> + * have any additional charges.
> + * This means there are small races for mis-accounting. But this
> + * mis-accounting should happen only under swap-in opration.
> + * (Attachin new task will fail if cgroup is under rmdir()).
> + */
> +
> + if (!cg || cgroup_is_removed(cg) || cgroup_is_being_removed(cg))
> + return true;
> + return false;
> +}
> +

...

> static void mem_cgroup_destroy(struct cgroup_subsys *ss,
> struct cgroup *cont)
> {
> - mem_cgroup_free(mem_cgroup_from_cont(cont));
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cont):
> + mem_cgroup_free(mem);
> + /* forget */
> + mem->css.cgroup = NULL;

mem might already be destroyed by mem_cgroup_free(mem).

> }

2008-12-10 02:24:53

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, 10 Dec 2008 10:19:35 +0800
Li Zefan <[email protected]> wrote:

> > +static bool memcg_is_obsolete(struct mem_cgroup *mem)
> > +{
>
> Will this function be called with mem->css.refcnt == 0? If yes, then
> this function is racy.
>
> cg = mem->css.cgroup
> cgroup_diput()
> mem_cgroup_destroy()
> mem->css.cgroup = NULL;
> kfree(cg);
> if (!cg || cgroup_is_removed(cg)...)
>
> (accessing invalid cg)
>
Hmm. then we have to add flag to css itself, anyway.


> > + struct cgroup *cg = mem->css.cgroup;
> > + /*
> > + * "Being Removed" means pre_destroy() handler is called.
> > + * After "pre_destroy" handler is called, memcg should not
> > + * have any additional charges.
> > + * This means there are small races for mis-accounting. But this
> > + * mis-accounting should happen only under swap-in opration.
> > + * (Attachin new task will fail if cgroup is under rmdir()).
> > + */
> > +
> > + if (!cg || cgroup_is_removed(cg) || cgroup_is_being_removed(cg))
> > + return true;
> > + return false;
> > +}
> > +
>
> ...
>
> > static void mem_cgroup_destroy(struct cgroup_subsys *ss,
> > struct cgroup *cont)
> > {
> > - mem_cgroup_free(mem_cgroup_from_cont(cont));
> > + struct mem_cgroup *mem = mem_cgroup_from_cont(cont):
> > + mem_cgroup_free(mem);
> > + /* forget */
> > + mem->css.cgroup = NULL;
>
> mem might already be destroyed by mem_cgroup_free(mem).
>
Ah, maybe. will fix.

Thanks,
-Kame

2008-12-10 02:48:52

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Tue, 9 Dec 2008 20:06:47 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> better name for new flag is welcome.
>
> ==
> Because pre_destroy() handler is moved out to cgroup_lock() for
> avoiding dead-lock, now, cgroup's rmdir() does following sequence.
>
> cgroup_lock()
> check children and tasks.
> (A)
> cgroup_unlock()
> (B)
> pre_destroy() for subsys;-----(1)
> (C)
> cgroup_lock();
> (D)
> Second check:check for -EBUSY again because we released the lock.
> (E)
> mark cgroup as removed.
> (F)
> unlink from lists.
> cgroup_unlock();
> dput()
> => when dentry's refcnt goes down to 0
> destroy() handers for subsys
>
> memcg marks itself as "obsolete" when pre_destroy() is called at (1)
> But rmdir() can fail after pre_destroy(). So marking "obsolete" is bug.
> I'd like to fix sanity of pre_destroy() in cgroup layer.
>
> Considering above sequence, new tasks can be added while
> (B) and (C)
> swap-in recored can be charged back to a cgroup after pre_destroy()
> at (C) and (D), (E)
> (means cgrp's refcnt not comes from task but from other persistent objects.)
>
> This patch adds "cgroup_is_being_removed()" check. (better name is welcome)
> After this,
>
> - cgroup is marked as CGRP_PRE_REMOVAL at (A)
> - If Second check fails, CGRP_PRE_REMOVAL flag is removed.
> - memcg's its own obsolete flag is removed.
> - While CGROUP_PRE_REMOVAL, task attach will fail by -EBUSY.
> (task attach via clone() will not hit the case.)
>
> By this, we can trust pre_restroy()'s result.
>
>
I agrree to the direction of this patch, but I think it would be better
to split this into cgroup and memcg part.

> Note: if CGRP_REMOVED can be set and cleared, it should be used instead of
> CGRP_PRE_REMOVAL.
>
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> include/linux/cgroup.h | 5 +++++
> kernel/cgroup.c | 18 +++++++++++++++++-
> mm/memcontrol.c | 36 ++++++++++++++++++++++++++----------
> 3 files changed, 48 insertions(+), 11 deletions(-)
>
> Index: mmotm-2.6.28-Dec08/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-Dec08.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-Dec08/mm/memcontrol.c
> @@ -166,7 +166,6 @@ struct mem_cgroup {
> */
> bool use_hierarchy;
> unsigned long last_oom_jiffies;
> - int obsolete;
> atomic_t refcnt;
>
> unsigned int swappiness;
> @@ -211,6 +210,24 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
> static void mem_cgroup_get(struct mem_cgroup *mem);
> static void mem_cgroup_put(struct mem_cgroup *mem);
>
> +static bool memcg_is_obsolete(struct mem_cgroup *mem)
> +{
> + struct cgroup *cg = mem->css.cgroup;
> + /*
> + * "Being Removed" means pre_destroy() handler is called.
> + * After "pre_destroy" handler is called, memcg should not
> + * have any additional charges.
> + * This means there are small races for mis-accounting. But this
> + * mis-accounting should happen only under swap-in opration.
> + * (Attachin new task will fail if cgroup is under rmdir()).
> + */
> +
> + if (!cg || cgroup_is_removed(cg) || cgroup_is_being_removed(cg))
> + return true;
> + return false;
> +}
> +
> +
> static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> struct page_cgroup *pc,
> bool charge)
> @@ -597,7 +614,7 @@ mem_cgroup_get_first_node(struct mem_cgr
> struct cgroup *cgroup;
> struct mem_cgroup *ret;
> bool obsolete = (root_mem->last_scanned_child &&
> - root_mem->last_scanned_child->obsolete);
> + memcg_is_obsolete(root_mem->last_scanned_child);
>
> /*
> * Scan all children under the mem_cgroup mem
> @@ -1070,7 +1087,7 @@ int mem_cgroup_try_charge_swapin(struct
> ent.val = page_private(page);
>
> mem = lookup_swap_cgroup(ent);
> - if (!mem || mem->obsolete)
> + if (!mem || memcg_is_obsolete(mem))
> goto charge_cur_mm;
> *ptr = mem;
> return __mem_cgroup_try_charge(NULL, mask, ptr, true);
> @@ -1104,7 +1121,7 @@ int mem_cgroup_cache_charge_swapin(struc
> ent.val = page_private(page);
> if (do_swap_account) {
> mem = lookup_swap_cgroup(ent);
> - if (mem && mem->obsolete)
> + if (mem && memcg_is_obsolete(mem))
> mem = NULL;
> if (mem)
> mm = NULL;
> @@ -2050,9 +2067,6 @@ static struct mem_cgroup *mem_cgroup_all
> * the number of reference from swap_cgroup and free mem_cgroup when
> * it goes down to 0.
> *
> - * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
> - * entry which points to this memcg will be ignore at swapin.
> - *
> * Removal of cgroup itself succeeds regardless of refs from swap.
> */
>
> @@ -2081,7 +2095,7 @@ static void mem_cgroup_get(struct mem_cg
> static void mem_cgroup_put(struct mem_cgroup *mem)
> {
> if (atomic_dec_and_test(&mem->refcnt)) {
> - if (!mem->obsolete)
> + if (!memcg_is_obsolete(mem))
> return;
> mem_cgroup_free(mem);
> }
> @@ -2148,14 +2162,16 @@ static void mem_cgroup_pre_destroy(struc
> struct cgroup *cont)
> {
> struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> - mem->obsolete = 1;
> mem_cgroup_force_empty(mem, false);
> }
>
> static void mem_cgroup_destroy(struct cgroup_subsys *ss,
> struct cgroup *cont)
> {
> - mem_cgroup_free(mem_cgroup_from_cont(cont));
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cont):
> + mem_cgroup_free(mem);
> + /* forget */
> + mem->css.cgroup = NULL;
> }
>
Is it OK to access "mem" while it may have been freed ?

> static int mem_cgroup_populate(struct cgroup_subsys *ss,
> Index: mmotm-2.6.28-Dec08/include/linux/cgroup.h
> ===================================================================
> --- mmotm-2.6.28-Dec08.orig/include/linux/cgroup.h
> +++ mmotm-2.6.28-Dec08/include/linux/cgroup.h
> @@ -98,6 +98,8 @@ enum {
> CGRP_RELEASABLE,
> /* Control Group requires release notifications to userspace */
> CGRP_NOTIFY_ON_RELEASE,
> + /* Control Group is preparing for death */
> + CGRP_PRE_REMOVAL,
> };
>
> struct cgroup {
> @@ -303,8 +305,11 @@ int cgroup_add_files(struct cgroup *cgrp
> const struct cftype cft[],
> int count);
>
> +
> int cgroup_is_removed(const struct cgroup *cgrp);
>
> +int cgroup_is_being_removed(const struct cgroup *cgrp);
> +
> int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen);
>
> int cgroup_task_count(const struct cgroup *cgrp);
> Index: mmotm-2.6.28-Dec08/kernel/cgroup.c
> ===================================================================
> --- mmotm-2.6.28-Dec08.orig/kernel/cgroup.c
> +++ mmotm-2.6.28-Dec08/kernel/cgroup.c
> @@ -123,6 +123,11 @@ inline int cgroup_is_removed(const struc
> return test_bit(CGRP_REMOVED, &cgrp->flags);
> }
>
> +inline int cgroup_is_being_removed(const struct cgroup *cgrp)
> +{
> + return test_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
> +}
> +
> /* bits in struct cgroupfs_root flags field */
> enum {
> ROOT_NOPREFIX, /* mounted subsystems have no named prefix */
> @@ -1217,6 +1222,13 @@ int cgroup_attach_task(struct cgroup *cg
> if (cgrp == oldcgrp)
> return 0;
>
> + /*
> + * This cgroup is under rmdir() operation. Never fails here when this
> + * is called from clone().
> + */
I don't think clone() calls cgroup_attach_task().
Do you mean cgroup_clone() ?
(I'm sorry that I'm not good at ns_cgroup.)

> + if (cgroup_is_being_removed(cgrp))
> + return -EBUSY;
> +
> for_each_subsys(root, ss) {
> if (ss->can_attach) {
> retval = ss->can_attach(ss, cgrp, tsk);
> @@ -2469,12 +2481,14 @@ static int cgroup_rmdir(struct inode *un
> mutex_unlock(&cgroup_mutex);
> return -EBUSY;
> }
> - mutex_unlock(&cgroup_mutex);
>
> /*
> * Call pre_destroy handlers of subsys. Notify subsystems
> * that rmdir() request comes.
> */
> + set_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
> + mutex_unlock(&cgroup_mutex);
> +
> cgroup_call_pre_destroy(cgrp);
>
Is there any case where pre_destory is called simultaneusly ?

> mutex_lock(&cgroup_mutex);
> @@ -2483,12 +2497,14 @@ static int cgroup_rmdir(struct inode *un
> if (atomic_read(&cgrp->count)
> || !list_empty(&cgrp->children)
> || cgroup_has_css_refs(cgrp)) {
> + clear_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
> mutex_unlock(&cgroup_mutex);
> return -EBUSY;
> }
>
> spin_lock(&release_list_lock);
> set_bit(CGRP_REMOVED, &cgrp->flags);
> + clear_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
> if (!list_empty(&cgrp->release_list))
> list_del(&cgrp->release_list);
> spin_unlock(&release_list_lock);
>


Thanks,
Daisuke Nishimura.

2008-12-10 02:52:20

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/6] Flat hierarchical reclaim by ID

* KAMEZAWA Hiroyuki <[email protected]> [2008-12-10 01:34:14]:

> Balbir Singh said:
>
> >> I think your soft-limit idea can be easily merged onto this patch
> >> set.
> >>
> >
> > Yes, potentially. With soft limit, the general expectation is this
> >
> > Let us say you have group A and B
> >
> > groupA, soft limit = 1G
> > groupB, soft limit = 2G
> >
> > Now assume the system has 4G. When groupB is not using its memory,
> > group A can grab all 4G, but when groupB kicks in and tries to use 2G
> > or more, then the expectation is that
> >
> > group A will get 1/3 * 4 = 4/3G
> > group B will get 2/3 * 4 = 8/3G
> >
> > Similar to CPU shares currently.
> >
> I like that idea because it's easy to understand.
>

Excellent, I'll start looking at how to implement it

> >> > Does this order reflect their position in the hierarchy?
> >> No. just scan IDs from last scannned one in RR.
> >> BTW, can you show what an algorithm works well in following case ?
> >> ex)
> >> groupA/ limit=1G usage=300M
> >> 01/ limit=600M usage=600M
> >> 02/ limit=700M usage=70M
> >> 03/ limit=100M usage=30M
> >> Which one should be shrinked at first and why ?
> >> 1) when group_A hit limits.
> >
> > With tree reclaim, reclaim will first reclaim from A and stop if
> > successful, otherwise it will go to 01, 02 and 03 and then go back to
> > A.
> >
> Sorry for my poor example
>
> >> 2) when group_A/01 hit limits.
> >
> > This will reclaim only from 01, since A is under its limit
> >
> I should ask
> 2') when a task in group_A/01 hit limit in group_A
>
> ex)
> group_A/ limtit=1G, usage~0
> /01 limit= unlimited usage=800M
> /02 limit= unlimited usage=200M
> (what limit is allowed to children is another problem to be fixed...)
> when a task in 01 hits limit of group_A
> when a task in 02 hits limit of group_A
> where we should start from ? (is unknown)
> Currenty , this patch uses RR (in A->01->02->A->...).
> and soft-limit or some good algorithm will give us better view.
>
> >> 3) when group_A/02 hit limits.
> >
> > This will reclaim only from 02 since A is under its limit
> >
> > Does RR do the same right now?
> >
> I think so.
>
> Assume
> group_A/
> /01
> /02
> RR does
> 1) when a task under A/01/02 hit limits at A, shrink A, 01, 02,
> 2) when a task under 01 hit limits at 01, shrink only 01.
> 3) when a task under 02 hit limits at 02, shrink only 02.
>
> When 1), start point of shrinking is saved as last_scanned_child.
>
>
> >> I can't now.
> >>
> >> This patch itself uses round-robin and have no special order.
> >> I think implenting good algorithm under this needs some amount of
> >> time.
> >>
> >
> > I agree that fine tuning it will require time, but what we need is
> > something usable that will not have hard to debug or understand corner
> > cases.
>
> yes, we have now. My point is "cgroup_lock()" caused many problems and
> will cause new ones in future, I convince.
>
> And please see 5/6 and 6/6 we need hierarchy consideration in other
> places. I think there are more codes which should take care of hierarchy.
>

Yes, I do have the patches to remove cgroup_lock(), let me post them
indepedent of Daisuke's patches

>
> > > Shouldn't id's belong to cgroups instead of just memory controller?
> >> If Paul rejects, I'll move this to memcg. But bio-cgroup people also use
> >> ID and, in this summer, I posted swap-cgroup-ID patch and asked to
> >> implement IDs under cgroup rather than subsys. (asked by Paul or you.)
> >>
> >
> > We should talk to Paul and convince him.
> >
> yes.
>

Paul, would it be very hard to add id's to control groups?

> >> >From implementation, hierarchy code management at el. should go into
> >> cgroup.c and it gives us clear view rather than implemented under memcg.
> >>
> >
> > cgroup has hierarchy management already, in the form of children and
> > sibling. Walking those structures is up to us, that is all we do
> > currently :)
> >
> yes, but need cgroup_lock(). and you have to keep refcnt to pointer
> just for rememebring it.
>
> This patch doesn't change anything other than removing cgroup_lock() and
> removing refcnt to remember start point.
>

OK, I'll play with it

--
Balbir

2008-12-10 02:59:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, 10 Dec 2008 11:28:15 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Tue, 9 Dec 2008 20:06:47 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > better name for new flag is welcome.
> >
> > ==
> > Because pre_destroy() handler is moved out to cgroup_lock() for
> > avoiding dead-lock, now, cgroup's rmdir() does following sequence.
> >
> > cgroup_lock()
> > check children and tasks.
> > (A)
> > cgroup_unlock()
> > (B)
> > pre_destroy() for subsys;-----(1)
> > (C)
> > cgroup_lock();
> > (D)
> > Second check:check for -EBUSY again because we released the lock.
> > (E)
> > mark cgroup as removed.
> > (F)
> > unlink from lists.
> > cgroup_unlock();
> > dput()
> > => when dentry's refcnt goes down to 0
> > destroy() handers for subsys
> >
> > memcg marks itself as "obsolete" when pre_destroy() is called at (1)
> > But rmdir() can fail after pre_destroy(). So marking "obsolete" is bug.
> > I'd like to fix sanity of pre_destroy() in cgroup layer.
> >
> > Considering above sequence, new tasks can be added while
> > (B) and (C)
> > swap-in recored can be charged back to a cgroup after pre_destroy()
> > at (C) and (D), (E)
> > (means cgrp's refcnt not comes from task but from other persistent objects.)
> >
> > This patch adds "cgroup_is_being_removed()" check. (better name is welcome)
> > After this,
> >
> > - cgroup is marked as CGRP_PRE_REMOVAL at (A)
> > - If Second check fails, CGRP_PRE_REMOVAL flag is removed.
> > - memcg's its own obsolete flag is removed.
> > - While CGROUP_PRE_REMOVAL, task attach will fail by -EBUSY.
> > (task attach via clone() will not hit the case.)
> >
> > By this, we can trust pre_restroy()'s result.
> >
> >
> I agrree to the direction of this patch, but I think it would be better
> to split this into cgroup and memcg part.
>
Hmm, but "showing usage" part is necessary for this kind of patches.

> > Note: if CGRP_REMOVED can be set and cleared, it should be used instead of
> > CGRP_PRE_REMOVAL.
> >
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> >
> > ---
> > include/linux/cgroup.h | 5 +++++
> > kernel/cgroup.c | 18 +++++++++++++++++-
> > mm/memcontrol.c | 36 ++++++++++++++++++++++++++----------
> > 3 files changed, 48 insertions(+), 11 deletions(-)
> >
> > Index: mmotm-2.6.28-Dec08/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.28-Dec08.orig/mm/memcontrol.c
> > +++ mmotm-2.6.28-Dec08/mm/memcontrol.c
> > @@ -166,7 +166,6 @@ struct mem_cgroup {
> > */
> > bool use_hierarchy;
> > unsigned long last_oom_jiffies;
> > - int obsolete;
> > atomic_t refcnt;
> >
> > unsigned int swappiness;
> > @@ -211,6 +210,24 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
> > static void mem_cgroup_get(struct mem_cgroup *mem);
> > static void mem_cgroup_put(struct mem_cgroup *mem);
> >
> > +static bool memcg_is_obsolete(struct mem_cgroup *mem)
> > +{
> > + struct cgroup *cg = mem->css.cgroup;
> > + /*
> > + * "Being Removed" means pre_destroy() handler is called.
> > + * After "pre_destroy" handler is called, memcg should not
> > + * have any additional charges.
> > + * This means there are small races for mis-accounting. But this
> > + * mis-accounting should happen only under swap-in opration.
> > + * (Attachin new task will fail if cgroup is under rmdir()).
> > + */
> > +
> > + if (!cg || cgroup_is_removed(cg) || cgroup_is_being_removed(cg))
> > + return true;
> > + return false;
> > +}
> > +
> > +
> > static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> > struct page_cgroup *pc,
> > bool charge)
> > @@ -597,7 +614,7 @@ mem_cgroup_get_first_node(struct mem_cgr
> > struct cgroup *cgroup;
> > struct mem_cgroup *ret;
> > bool obsolete = (root_mem->last_scanned_child &&
> > - root_mem->last_scanned_child->obsolete);
> > + memcg_is_obsolete(root_mem->last_scanned_child);
> >
> > /*
> > * Scan all children under the mem_cgroup mem
> > @@ -1070,7 +1087,7 @@ int mem_cgroup_try_charge_swapin(struct
> > ent.val = page_private(page);
> >
> > mem = lookup_swap_cgroup(ent);
> > - if (!mem || mem->obsolete)
> > + if (!mem || memcg_is_obsolete(mem))
> > goto charge_cur_mm;
> > *ptr = mem;
> > return __mem_cgroup_try_charge(NULL, mask, ptr, true);
> > @@ -1104,7 +1121,7 @@ int mem_cgroup_cache_charge_swapin(struc
> > ent.val = page_private(page);
> > if (do_swap_account) {
> > mem = lookup_swap_cgroup(ent);
> > - if (mem && mem->obsolete)
> > + if (mem && memcg_is_obsolete(mem))
> > mem = NULL;
> > if (mem)
> > mm = NULL;
> > @@ -2050,9 +2067,6 @@ static struct mem_cgroup *mem_cgroup_all
> > * the number of reference from swap_cgroup and free mem_cgroup when
> > * it goes down to 0.
> > *
> > - * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
> > - * entry which points to this memcg will be ignore at swapin.
> > - *
> > * Removal of cgroup itself succeeds regardless of refs from swap.
> > */
> >
> > @@ -2081,7 +2095,7 @@ static void mem_cgroup_get(struct mem_cg
> > static void mem_cgroup_put(struct mem_cgroup *mem)
> > {
> > if (atomic_dec_and_test(&mem->refcnt)) {
> > - if (!mem->obsolete)
> > + if (!memcg_is_obsolete(mem))
> > return;
> > mem_cgroup_free(mem);
> > }
> > @@ -2148,14 +2162,16 @@ static void mem_cgroup_pre_destroy(struc
> > struct cgroup *cont)
> > {
> > struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> > - mem->obsolete = 1;
> > mem_cgroup_force_empty(mem, false);
> > }
> >
> > static void mem_cgroup_destroy(struct cgroup_subsys *ss,
> > struct cgroup *cont)
> > {
> > - mem_cgroup_free(mem_cgroup_from_cont(cont));
> > + struct mem_cgroup *mem = mem_cgroup_from_cont(cont):
> > + mem_cgroup_free(mem);
> > + /* forget */
> > + mem->css.cgroup = NULL;
> > }
> >
> Is it OK to access "mem" while it may have been freed ?
>
no, will fix.



> > static int mem_cgroup_populate(struct cgroup_subsys *ss,
> > Index: mmotm-2.6.28-Dec08/include/linux/cgroup.h
> > ===================================================================
> > --- mmotm-2.6.28-Dec08.orig/include/linux/cgroup.h
> > +++ mmotm-2.6.28-Dec08/include/linux/cgroup.h
> > @@ -98,6 +98,8 @@ enum {
> > CGRP_RELEASABLE,
> > /* Control Group requires release notifications to userspace */
> > CGRP_NOTIFY_ON_RELEASE,
> > + /* Control Group is preparing for death */
> > + CGRP_PRE_REMOVAL,
> > };
> >
> > struct cgroup {
> > @@ -303,8 +305,11 @@ int cgroup_add_files(struct cgroup *cgrp
> > const struct cftype cft[],
> > int count);
> >
> > +
> > int cgroup_is_removed(const struct cgroup *cgrp);
> >
> > +int cgroup_is_being_removed(const struct cgroup *cgrp);
> > +
> > int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen);
> >
> > int cgroup_task_count(const struct cgroup *cgrp);
> > Index: mmotm-2.6.28-Dec08/kernel/cgroup.c
> > ===================================================================
> > --- mmotm-2.6.28-Dec08.orig/kernel/cgroup.c
> > +++ mmotm-2.6.28-Dec08/kernel/cgroup.c
> > @@ -123,6 +123,11 @@ inline int cgroup_is_removed(const struc
> > return test_bit(CGRP_REMOVED, &cgrp->flags);
> > }
> >
> > +inline int cgroup_is_being_removed(const struct cgroup *cgrp)
> > +{
> > + return test_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
> > +}
> > +
> > /* bits in struct cgroupfs_root flags field */
> > enum {
> > ROOT_NOPREFIX, /* mounted subsystems have no named prefix */
> > @@ -1217,6 +1222,13 @@ int cgroup_attach_task(struct cgroup *cg
> > if (cgrp == oldcgrp)
> > return 0;
> >
> > + /*
> > + * This cgroup is under rmdir() operation. Never fails here when this
> > + * is called from clone().
> > + */
> I don't think clone() calls cgroup_attach_task().
> Do you mean cgroup_clone() ?
> (I'm sorry that I'm not good at ns_cgroup.)
>
yes, cgroup_clone(). I'll update.


> > + if (cgroup_is_being_removed(cgrp))
> > + return -EBUSY;
> > +
> > for_each_subsys(root, ss) {
> > if (ss->can_attach) {
> > retval = ss->can_attach(ss, cgrp, tsk);
> > @@ -2469,12 +2481,14 @@ static int cgroup_rmdir(struct inode *un
> > mutex_unlock(&cgroup_mutex);
> > return -EBUSY;
> > }
> > - mutex_unlock(&cgroup_mutex);
> >
> > /*
> > * Call pre_destroy handlers of subsys. Notify subsystems
> > * that rmdir() request comes.
> > */
> > + set_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
> > + mutex_unlock(&cgroup_mutex);
> > +
> > cgroup_call_pre_destroy(cgrp);
> >
> Is there any case where pre_destory is called simultaneusly ?
>
I can't catch what is your concern.

AFAIK, vfs_rmdir() is done under mutex to dentry
==
mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
...
vfs_rmdir()
==
And calls to cgroup_rmdir() will be serialized.

Thanks,
-Kame

2008-12-10 03:04:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/6] Flat hierarchical reclaim by ID

On Wed, 10 Dec 2008 08:19:29 +0530
Balbir Singh <[email protected]> wrote:
> > >> >From implementation, hierarchy code management at el. should go into
> > >> cgroup.c and it gives us clear view rather than implemented under memcg.
> > >>
> > >
> > > cgroup has hierarchy management already, in the form of children and
> > > sibling. Walking those structures is up to us, that is all we do
> > > currently :)
> > >
> > yes, but need cgroup_lock(). and you have to keep refcnt to pointer
> > just for rememebring it.
> >
> > This patch doesn't change anything other than removing cgroup_lock() and
> > removing refcnt to remember start point.
> >
>
> OK, I'll play with it
>
I hear Kosaki has another idea to 5/6. So please ignore 5/6 for a while.
It's complicated. I'll post updated ones.

-Kame

2008-12-10 04:05:36

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, 10 Dec 2008 11:58:30 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Wed, 10 Dec 2008 11:28:15 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > On Tue, 9 Dec 2008 20:06:47 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > better name for new flag is welcome.
> > >
> > > ==
> > > Because pre_destroy() handler is moved out to cgroup_lock() for
> > > avoiding dead-lock, now, cgroup's rmdir() does following sequence.
> > >
> > > cgroup_lock()
> > > check children and tasks.
> > > (A)
> > > cgroup_unlock()
> > > (B)
> > > pre_destroy() for subsys;-----(1)
> > > (C)
> > > cgroup_lock();
> > > (D)
> > > Second check:check for -EBUSY again because we released the lock.
> > > (E)
> > > mark cgroup as removed.
> > > (F)
> > > unlink from lists.
> > > cgroup_unlock();
> > > dput()
> > > => when dentry's refcnt goes down to 0
> > > destroy() handers for subsys
> > >
> > > memcg marks itself as "obsolete" when pre_destroy() is called at (1)
> > > But rmdir() can fail after pre_destroy(). So marking "obsolete" is bug.
> > > I'd like to fix sanity of pre_destroy() in cgroup layer.
> > >
> > > Considering above sequence, new tasks can be added while
> > > (B) and (C)
> > > swap-in recored can be charged back to a cgroup after pre_destroy()
> > > at (C) and (D), (E)
> > > (means cgrp's refcnt not comes from task but from other persistent objects.)
> > >
> > > This patch adds "cgroup_is_being_removed()" check. (better name is welcome)
> > > After this,
> > >
> > > - cgroup is marked as CGRP_PRE_REMOVAL at (A)
> > > - If Second check fails, CGRP_PRE_REMOVAL flag is removed.
> > > - memcg's its own obsolete flag is removed.
> > > - While CGROUP_PRE_REMOVAL, task attach will fail by -EBUSY.
> > > (task attach via clone() will not hit the case.)
> > >
> > > By this, we can trust pre_restroy()'s result.
> > >
> > >
> > I agrree to the direction of this patch, but I think it would be better
> > to split this into cgroup and memcg part.
> >
> Hmm, but "showing usage" part is necessary for this kind of patches.
>
I see.

(snip)
> > > + if (cgroup_is_being_removed(cgrp))
> > > + return -EBUSY;
> > > +
> > > for_each_subsys(root, ss) {
> > > if (ss->can_attach) {
> > > retval = ss->can_attach(ss, cgrp, tsk);
> > > @@ -2469,12 +2481,14 @@ static int cgroup_rmdir(struct inode *un
> > > mutex_unlock(&cgroup_mutex);
> > > return -EBUSY;
> > > }
> > > - mutex_unlock(&cgroup_mutex);
> > >
> > > /*
> > > * Call pre_destroy handlers of subsys. Notify subsystems
> > > * that rmdir() request comes.
> > > */
> > > + set_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
> > > + mutex_unlock(&cgroup_mutex);
> > > +
> > > cgroup_call_pre_destroy(cgrp);
> > >
> > Is there any case where pre_destory is called simultaneusly ?
> >
> I can't catch what is your concern.
>
> AFAIK, vfs_rmdir() is done under mutex to dentry
> ==
> mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
> ...
> vfs_rmdir()
> ==
> And calls to cgroup_rmdir() will be serialized.
>
You're right. I missed that.
Thank you for your clarification.


Thanks,
Daisuke Nishimura.

2008-12-10 04:18:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, 10 Dec 2008 12:03:40 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Wed, 10 Dec 2008 11:58:30 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > On Wed, 10 Dec 2008 11:28:15 +0900
> > Daisuke Nishimura <[email protected]> wrote:
> >
> > > On Tue, 9 Dec 2008 20:06:47 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > better name for new flag is welcome.
> > > >
> > > > ==
> > > > Because pre_destroy() handler is moved out to cgroup_lock() for
> > > > avoiding dead-lock, now, cgroup's rmdir() does following sequence.
> > > >
> > > > cgroup_lock()
> > > > check children and tasks.
> > > > (A)
> > > > cgroup_unlock()
> > > > (B)
> > > > pre_destroy() for subsys;-----(1)
> > > > (C)
> > > > cgroup_lock();
> > > > (D)
> > > > Second check:check for -EBUSY again because we released the lock.
> > > > (E)
> > > > mark cgroup as removed.
> > > > (F)
> > > > unlink from lists.
> > > > cgroup_unlock();
> > > > dput()
> > > > => when dentry's refcnt goes down to 0
> > > > destroy() handers for subsys
> > > >
> > > > memcg marks itself as "obsolete" when pre_destroy() is called at (1)
> > > > But rmdir() can fail after pre_destroy(). So marking "obsolete" is bug.
> > > > I'd like to fix sanity of pre_destroy() in cgroup layer.
> > > >
> > > > Considering above sequence, new tasks can be added while
> > > > (B) and (C)
> > > > swap-in recored can be charged back to a cgroup after pre_destroy()
> > > > at (C) and (D), (E)
> > > > (means cgrp's refcnt not comes from task but from other persistent objects.)
> > > >
> > > > This patch adds "cgroup_is_being_removed()" check. (better name is welcome)
> > > > After this,
> > > >
> > > > - cgroup is marked as CGRP_PRE_REMOVAL at (A)
> > > > - If Second check fails, CGRP_PRE_REMOVAL flag is removed.
> > > > - memcg's its own obsolete flag is removed.
> > > > - While CGROUP_PRE_REMOVAL, task attach will fail by -EBUSY.
> > > > (task attach via clone() will not hit the case.)
> > > >
> > > > By this, we can trust pre_restroy()'s result.
> > > >
> > > >
> > > I agrree to the direction of this patch, but I think it would be better
> > > to split this into cgroup and memcg part.
> > >
> > Hmm, but "showing usage" part is necessary for this kind of patches.
> >
> I see.
>
> (snip)
> > > > + if (cgroup_is_being_removed(cgrp))
> > > > + return -EBUSY;
> > > > +
> > > > for_each_subsys(root, ss) {
> > > > if (ss->can_attach) {
> > > > retval = ss->can_attach(ss, cgrp, tsk);
> > > > @@ -2469,12 +2481,14 @@ static int cgroup_rmdir(struct inode *un
> > > > mutex_unlock(&cgroup_mutex);
> > > > return -EBUSY;
> > > > }
> > > > - mutex_unlock(&cgroup_mutex);
> > > >
> > > > /*
> > > > * Call pre_destroy handlers of subsys. Notify subsystems
> > > > * that rmdir() request comes.
> > > > */
> > > > + set_bit(CGRP_PRE_REMOVAL, &cgrp->flags);
> > > > + mutex_unlock(&cgroup_mutex);
> > > > +
> > > > cgroup_call_pre_destroy(cgrp);
> > > >
> > > Is there any case where pre_destory is called simultaneusly ?
> > >
> > I can't catch what is your concern.
> >
> > AFAIK, vfs_rmdir() is done under mutex to dentry
> > ==
> > mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
> > ...
> > vfs_rmdir()
> > ==
> > And calls to cgroup_rmdir() will be serialized.
> >
> You're right. I missed that.
> Thank you for your clarification.
>
>

I'll post updated one. maybe much clearer.

Thanks,
-Kame



> Thanks,
> Daisuke Nishimura.
>

2008-12-10 10:40:28

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Tue, Dec 9, 2008 at 3:06 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> better name for new flag is welcome.
>
> ==
> Because pre_destroy() handler is moved out to cgroup_lock() for
> avoiding dead-lock, now, cgroup's rmdir() does following sequence.
>
> cgroup_lock()
> check children and tasks.
> (A)
> cgroup_unlock()
> (B)
> pre_destroy() for subsys;-----(1)
> (C)
> cgroup_lock();
> (D)
> Second check:check for -EBUSY again because we released the lock.
> (E)
> mark cgroup as removed.
> (F)
> unlink from lists.
> cgroup_unlock();
> dput()
> => when dentry's refcnt goes down to 0
> destroy() handers for subsys

The reason for needing this patch is because of the non-atomic locking
in cgroup_rmdir() that was introduced due to the circular locking
dependency between the hotplug lock and the cgroup_mutex.

But rather than adding a whole bunch more complexity, this looks like
another case that could be solved by the hierarchy_mutex patches that
I posted a while ago.

Those allow the cpuset hotplug notifier (and any other subsystem that
wants a stable hierarchy) to take ss->hierarchy_mutex, to prevent
mkdir/rmdir/bind in its hierarchy, which helps to remove the deadlock
that the above dropping of cgroup_mutex was introduced to work around.

>
> Considering above sequence, new tasks can be added while
> (B) and (C)
> swap-in recored can be charged back to a cgroup after pre_destroy()
> at (C) and (D), (E)
> (means cgrp's refcnt not comes from task but from other persistent objects.)

Which "persistent object" are you getting the css refcount from?

Is the problem that swap references aren't refcounted because you want
to avoid swap from keeping a cgroup alive? But you still want to be
able to do css_get() on the mem_cgroup* obtained from a swap
reference, and be safely synchronized with concurrent rmdir operations
without having to take a heavy lock?

The solution that I originally tried to use for this in an early
version of cgroups (but dropped as I thought it was not needed) was to
treat css->refcount as follows:

0 => trying to remove or removed
1 => normal state with no additional refs

So to get a reference on a possibly removed css we'd have:

int css_tryget(css) {
while (!atomic_inc_not_zero(&css->refcount)) {
if (test_bit(CSS_REMOVED, &css->flags)) {
return 0;
}
}
return 1;
}

and cgroup_rmdir would do:

for_each_subsys() {
if (cmpxchg(&css->refcount, 0, -1)) {
// busy, roll back -1s to 0s, give error
...
}
}
// success
for_each_subsys() {
set_bit(CSS_REMOVED, &css->flags);
}

This makes it easy to have weak references to a css that can be
dropped in a destroy() callback. Would this maybe even remove the need
for mem_cgroup_pre_destroy()?

Paul

2008-12-10 11:29:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

Paul Menage said:
> The reason for needing this patch is because of the non-atomic locking
> in cgroup_rmdir() that was introduced due to the circular locking
> dependency between the hotplug lock and the cgroup_mutex.
>
> But rather than adding a whole bunch more complexity, this looks like
> another case that could be solved by the hierarchy_mutex patches that
> I posted a while ago.
>
removing cgroup_lock() from memcg's reclaim path is okay.
(and I posted patch...)
But, prevent css_get() after pre_destroy() is another problem.

(BTW, I don't like hierarchy-walk-by-small-locks approarch now because
I'd like to implement scan-and-stop-continue routine.
See how readdir() aginst /proc scans PID. It's very roboust against
very temporal PIDs.)

> Those allow the cpuset hotplug notifier (and any other subsystem that
> wants a stable hierarchy) to take ss->hierarchy_mutex, to prevent
> mkdir/rmdir/bind in its hierarchy, which helps to remove the deadlock
> that the above dropping of cgroup_mutex was introduced to work around.
>
>>
>> Considering above sequence, new tasks can be added while
>> (B) and (C)
>> swap-in recored can be charged back to a cgroup after pre_destroy()
>> at (C) and (D), (E)
>> (means cgrp's refcnt not comes from task but from other persistent
>> objects.)
>
> Which "persistent object" are you getting the css refcount from?
>
page_cgroup generated from swap_cgroup.

> Is the problem that swap references aren't refcounted because you want
> to avoid swap from keeping a cgroup alive?
Yes. There is no operations allows users to make swap on memory.
> But you still want to be able to do css_get() on the mem_cgroup*
obtained from a swap
> reference, and be safely synchronized with concurrent rmdir operations
> without having to take a heavy lock?
>
yes. I don't want any locks.

> The solution that I originally tried to use for this in an early
> version of cgroups (but dropped as I thought it was not needed) was to
> treat css->refcount as follows:
>
> 0 => trying to remove or removed
> 1 => normal state with no additional refs
>
> So to get a reference on a possibly removed css we'd have:
>
> int css_tryget(css) {
> while (!atomic_inc_not_zero(&css->refcount)) {
> if (test_bit(CSS_REMOVED, &css->flags)) {
> return 0;
> }
> }
> return 1;
> }
>
> and cgroup_rmdir would do:
>
> for_each_subsys() {
> if (cmpxchg(&css->refcount, 0, -1)) {
> // busy, roll back -1s to 0s, give error
> ...
> }
> }
> // success
> for_each_subsys() {
> set_bit(CSS_REMOVED, &css->flags);
> }
>
> This makes it easy to have weak references to a css that can be
> dropped in a destroy() callback.
I tried similar patch and made it to use only one shared refcnt.
(my previous patch...)

We need rolling update of refcnts and rollback. Such code tends to make
a hole (This was what my first patch did...).
And there is no fundamental difference between my shared refcnt and
css_tryget() patch. Maybe above will give us better cache localily.

Anyway, I have no objections to rolling update of refcnt and tryget().
If it's a way to go, I'll go ahead.

> Would this maybe even remove the need for mem_cgroup_pre_destroy()?
>
Yes and No. not sure. but my thinking is No.

1. pre_destroy() is called by rmdir(), in synchronized manner.
This means that all refs in memcg will be removed at rmdir().
If we drop refs at destroy(), it happens when dput()'s refcnt finally
goes down to 0. This asynchronous manner is not good for users.

2. Current pre_destroy() code is very young. And we don't find any
corner case in which pre_destroy() can't complete thier works.
So, I don't want to remove pre_destroy() for a while.

3. Sometimes, pre_destroy() have to call try_to_free_pages() and
we cannot know we can call try_to_free_page() in dput().

Thanks,
-Kame

2008-12-10 13:26:33

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

* KAMEZAWA Hiroyuki <[email protected]> [2008-12-10 20:29:41]:

> Paul Menage said:
> > The reason for needing this patch is because of the non-atomic locking
> > in cgroup_rmdir() that was introduced due to the circular locking
> > dependency between the hotplug lock and the cgroup_mutex.
> >
> > But rather than adding a whole bunch more complexity, this looks like
> > another case that could be solved by the hierarchy_mutex patches that
> > I posted a while ago.
> >

Paul, I can't find those patches in -mm. I will try and dig them out
from my mbox. I agree, we need a hierarchy_mutex, cgroup_mutex is
becoming the next BKL.

> removing cgroup_lock() from memcg's reclaim path is okay.
> (and I posted patch...)
> But, prevent css_get() after pre_destroy() is another problem.
>
> (BTW, I don't like hierarchy-walk-by-small-locks approarch now because
> I'd like to implement scan-and-stop-continue routine.
> See how readdir() aginst /proc scans PID. It's very roboust against
> very temporal PIDs.)
>
> > Those allow the cpuset hotplug notifier (and any other subsystem that
> > wants a stable hierarchy) to take ss->hierarchy_mutex, to prevent
> > mkdir/rmdir/bind in its hierarchy, which helps to remove the deadlock
> > that the above dropping of cgroup_mutex was introduced to work around.
> >
> >>
> >> Considering above sequence, new tasks can be added while
> >> (B) and (C)
> >> swap-in recored can be charged back to a cgroup after pre_destroy()
> >> at (C) and (D), (E)
> >> (means cgrp's refcnt not comes from task but from other persistent
> >> objects.)
> >
> > Which "persistent object" are you getting the css refcount from?
> >
> page_cgroup generated from swap_cgroup.
>
> > Is the problem that swap references aren't refcounted because you want
> > to avoid swap from keeping a cgroup alive?
> Yes. There is no operations allows users to make swap on memory.
> > But you still want to be able to do css_get() on the mem_cgroup*
> obtained from a swap
> > reference, and be safely synchronized with concurrent rmdir operations
> > without having to take a heavy lock?
> >
> yes. I don't want any locks.
>
> > The solution that I originally tried to use for this in an early
> > version of cgroups (but dropped as I thought it was not needed) was to
> > treat css->refcount as follows:
> >
> > 0 => trying to remove or removed
> > 1 => normal state with no additional refs
> >
> > So to get a reference on a possibly removed css we'd have:
> >
> > int css_tryget(css) {
> > while (!atomic_inc_not_zero(&css->refcount)) {
> > if (test_bit(CSS_REMOVED, &css->flags)) {
> > return 0;
> > }
> > }
> > return 1;
> > }
> >
> > and cgroup_rmdir would do:
> >
> > for_each_subsys() {
> > if (cmpxchg(&css->refcount, 0, -1)) {
> > // busy, roll back -1s to 0s, give error
> > ...
> > }
> > }
> > // success
> > for_each_subsys() {
> > set_bit(CSS_REMOVED, &css->flags);
> > }
> >
> > This makes it easy to have weak references to a css that can be
> > dropped in a destroy() callback.
> I tried similar patch and made it to use only one shared refcnt.
> (my previous patch...)
>
> We need rolling update of refcnts and rollback. Such code tends to make
> a hole (This was what my first patch did...).
> And there is no fundamental difference between my shared refcnt and
> css_tryget() patch. Maybe above will give us better cache localily.
>
> Anyway, I have no objections to rolling update of refcnt and tryget().
> If it's a way to go, I'll go ahead.
>
> > Would this maybe even remove the need for mem_cgroup_pre_destroy()?
> >
> Yes and No. not sure. but my thinking is No.
>
> 1. pre_destroy() is called by rmdir(), in synchronized manner.
> This means that all refs in memcg will be removed at rmdir().
> If we drop refs at destroy(), it happens when dput()'s refcnt finally
> goes down to 0. This asynchronous manner is not good for users.
>
> 2. Current pre_destroy() code is very young. And we don't find any
> corner case in which pre_destroy() can't complete thier works.
> So, I don't want to remove pre_destroy() for a while.
>
> 3. Sometimes, pre_destroy() have to call try_to_free_pages() and
> we cannot know we can call try_to_free_page() in dput().
>
> Thanks,
> -Kame
>
>
>

--
Balbir

2008-12-10 13:48:24

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, 10 Dec 2008 18:55:59 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2008-12-10 20:29:41]:
>
> > Paul Menage said:
> > > The reason for needing this patch is because of the non-atomic locking
> > > in cgroup_rmdir() that was introduced due to the circular locking
> > > dependency between the hotplug lock and the cgroup_mutex.
> > >
> > > But rather than adding a whole bunch more complexity, this looks like
> > > another case that could be solved by the hierarchy_mutex patches that
> > > I posted a while ago.
> > >
>
> Paul, I can't find those patches in -mm. I will try and dig them out
> from my mbox. I agree, we need a hierarchy_mutex, cgroup_mutex is
> becoming the next BKL.
>
Hmm.. but doesn't per-hierarchy-mutex solve the problem if memory and cpuset
mounted on the same hierarchy ?


Thanks,
Daisuke Nishimura.

2008-12-10 18:25:49

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, Dec 10, 2008 at 5:25 AM, Balbir Singh <[email protected]> wrote:
>
> Paul, I can't find those patches in -mm. I will try and dig them out
> from my mbox. I agree, we need a hierarchy_mutex, cgroup_mutex is
> becoming the next BKL.

It never actually went into -mm. I'll sync it with the latest tree and
try to send it out today.

Paul

2008-12-10 18:26:42

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, Dec 10, 2008 at 5:47 AM, Daisuke Nishimura
<[email protected]> wrote:
> Hmm.. but doesn't per-hierarchy-mutex solve the problem if memory and cpuset
> mounted on the same hierarchy ?
>

It's not a per-hierarchy mutex - it's a per-subsystem lock against
changes on that subsystem's hierarchy. So each subsystem just has to
take its own lock, rather than a global or per-hierarchy lock. The
cgroups code takes care of acquiring the multiple locks in a safe
order when necessary.

Paul

2008-12-10 18:36:14

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, Dec 10, 2008 at 3:29 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> (BTW, I don't like hierarchy-walk-by-small-locks approarch now because
> I'd like to implement scan-and-stop-continue routine.
> See how readdir() aginst /proc scans PID. It's very roboust against
> very temporal PIDs.)

So you mean that you want to be able to sleep, and then contine
approximately where you left off, without keeping any kind of
reference count on the last cgroup that you touched? OK, so in that
case I agree that you would need some kind of hierarch

> I tried similar patch and made it to use only one shared refcnt.
> (my previous patch...)

A crucial difference is that your css_tryget() fails if the cgroups
framework is trying to remove the cgroup but might abort due to
another subsystem holding a reference, whereas mine spins and if the
rmdir is aborted it will return a refcount.

>
> We need rolling update of refcnts and rollback. Such code tends to make
> a hole (This was what my first patch did...).

Can you clarify what you mean by "rolling update of refcnts"?

>
> 1. pre_destroy() is called by rmdir(), in synchronized manner.
> This means that all refs in memcg will be removed at rmdir().
> If we drop refs at destroy(), it happens when dput()'s refcnt finally
> goes down to 0. This asynchronous manner is not good for users.

OK, fair enough.

Paul

2008-12-10 19:01:20

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, Dec 10, 2008 at 10:35 AM, Paul Menage <[email protected]> wrote:
> On Wed, Dec 10, 2008 at 3:29 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>>
>> (BTW, I don't like hierarchy-walk-by-small-locks approarch now because
>> I'd like to implement scan-and-stop-continue routine.
>> See how readdir() aginst /proc scans PID. It's very roboust against
>> very temporal PIDs.)
>
> So you mean that you want to be able to sleep, and then contine
> approximately where you left off, without keeping any kind of
> reference count on the last cgroup that you touched? OK, so in that
> case I agree that you would need some kind of hierarch

Oops, didn't finish that sentence.

I agree that you'd need some kind of hierarchical-restart. But I'd
like to play with / look at your cgroup-id patch more closely and see
if we can come up with something simpler that still does what you
want.

One particular problem with the patch as it stands is that the ids
should be per-css, not per-cgroup, since a css can move between
hierarchies and hence between cgroups. (Currently only at bind/unbind
time, but it still results in a cgroup change).

Paul

2008-12-11 00:22:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, 10 Dec 2008 11:00:35 -0800
Paul Menage <[email protected]> wrote:

> On Wed, Dec 10, 2008 at 10:35 AM, Paul Menage <[email protected]> wrote:
> > On Wed, Dec 10, 2008 at 3:29 AM, KAMEZAWA Hiroyuki
> > <[email protected]> wrote:
> >>
> >> (BTW, I don't like hierarchy-walk-by-small-locks approarch now because
> >> I'd like to implement scan-and-stop-continue routine.
> >> See how readdir() aginst /proc scans PID. It's very roboust against
> >> very temporal PIDs.)
> >
> > So you mean that you want to be able to sleep, and then contine
> > approximately where you left off, without keeping any kind of
> > reference count on the last cgroup that you touched? OK, so in that
> > case I agree that you would need some kind of hierarch
>
> Oops, didn't finish that sentence.
>
> I agree that you'd need some kind of hierarchical-restart. But I'd
> like to play with / look at your cgroup-id patch more closely and see
> if we can come up with something simpler that still does what you
> want.
>
Sure, I have to do, too. It's still too young.

> One particular problem with the patch as it stands is that the ids
> should be per-css, not per-cgroup, since a css can move between
> hierarchies and hence between cgroups. (Currently only at bind/unbind
> time, but it still results in a cgroup change).
>

If per-css, looking up function will be
==
struct cgroup_subsys_state *cgroup_css_lookup(subsys_id, id)
==
Do you mean this ?

ok, I'll implement and see what happens. Maybe I'll move hooks to prepare/destroy IDs
to subsys layer and assign ID only when subsys want IDs.

-Kame



2008-12-11 00:24:57

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, Dec 10, 2008 at 4:21 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> If per-css, looking up function will be
> ==
> struct cgroup_subsys_state *cgroup_css_lookup(subsys_id, id)
> ==
> Do you mean this ?

Yes, plausibly. And we can presumably have a separate idr per subsystem.

Paul

2008-12-11 00:26:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, 10 Dec 2008 10:35:34 -0800
Paul Menage <[email protected]> wrote:

> On Wed, Dec 10, 2008 at 3:29 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >
> > (BTW, I don't like hierarchy-walk-by-small-locks approarch now because
> > I'd like to implement scan-and-stop-continue routine.
> > See how readdir() aginst /proc scans PID. It's very roboust against
> > very temporal PIDs.)
>
> So you mean that you want to be able to sleep, and then contine
> approximately where you left off, without keeping any kind of
> reference count on the last cgroup that you touched? OK, so in that
> case I agree that you would need some kind of hierarch
>
> > I tried similar patch and made it to use only one shared refcnt.
> > (my previous patch...)
>
> A crucial difference is that your css_tryget() fails if the cgroups
> framework is trying to remove the cgroup but might abort due to
> another subsystem holding a reference, whereas mine spins and if the
> rmdir is aborted it will return a refcount.
>
sure.
> >
> > We need rolling update of refcnts and rollback. Such code tends to make
> > a hole (This was what my first patch did...).
>
> Can you clarify what you mean by "rolling update of refcnts"?
>
for(..i++)
atomic_dec/inc( refcnt[i)

But my first version of this patch did above. I can write it again easily.

Thanks,
-Kame

2008-12-11 00:28:50

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, Dec 10, 2008 at 4:25 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>>
>> Can you clarify what you mean by "rolling update of refcnts"?
>>
> for(..i++)
> atomic_dec/inc( refcnt[i)
>
> But my first version of this patch did above. I can write it again easily.

I just sent out a small patch collection that had my version of
css_tryget() in it - is that what you had in mind by "rolling update"?

Paul

2008-12-11 01:07:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, 10 Dec 2008 16:24:44 -0800
Paul Menage <[email protected]> wrote:

> On Wed, Dec 10, 2008 at 4:21 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > If per-css, looking up function will be
> > ==
> > struct cgroup_subsys_state *cgroup_css_lookup(subsys_id, id)
> > ==
> > Do you mean this ?
>
> Yes, plausibly. And we can presumably have a separate idr per subsystem.
>
Okay, I'll try one.

-Kame

2008-12-11 01:10:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, 10 Dec 2008 16:28:38 -0800
Paul Menage <[email protected]> wrote:

> On Wed, Dec 10, 2008 at 4:25 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >>
> >> Can you clarify what you mean by "rolling update of refcnts"?
> >>
> > for(..i++)
> > atomic_dec/inc( refcnt[i)
> >
> > But my first version of this patch did above. I can write it again easily.
>
> I just sent out a small patch collection that had my version of
> css_tryget() in it - is that what you had in mind by "rolling update"?
>
yes. but yours seems to be more robust than my version.
I'll try to use yours and prepare a patch for css_tryget() in memcg.

-Kame

2008-12-11 03:14:19

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/6] fix inactive_ratio under hierarchy

Hi

> ex)In following tree,
> /opt/cgroup/01 limit=1G
> /opt/cgroup/01/A limit=500M
> /opt/cgroup/01/A/B limit=unlimited
> /opt/cgroup/01/A/C limit=50M
> /opt/cgroup/01/Z limit=700M
>
>
> /opt/cgroup/01's inactive_ratio is calculated by limit of 1G.
> /opt/cgroup/01/A's inactive_ratio is calculated by limit of 500M
> /opt/cgroup/01/A/B's inactive_ratio is calculated by limit of 500M.
> /opt/cgroup/01/A/C's inactive_ratio is calculated by limit of 50M.
> /opt/cgroup/01's inactive_ratio is calculated by limit of 700M.

this is one of good choice.
but I think it is a bit complex rule.


How about this?

===============
Currently, inactive_ratio of memcg is calculated at setting limit.
because page_alloc.c does so and current implementation is straightforward porting.

However, memcg introduced hierarchy feature recently.
In hierarchy restriction, memory limit is not only decided memory.limit_in_bytes of current cgroup,
but also parent limit and sibling memory usage.

Then, The optimal inactive_ratio is changed frequently.
So, everytime calculation is better.


Signed-off-by: KOSAKI Motohiro <[email protected]>
CC: KAMEZAWA Hiroyuki <[email protected]>
CC: Balbir Singh <[email protected]>
---
include/linux/memcontrol.h | 3 --
mm/memcontrol.c | 64 +++++++++++++++++++++------------------------
mm/vmscan.c | 2 -
3 files changed, 33 insertions(+), 36 deletions(-)

Index: b/include/linux/memcontrol.h
===================================================================
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -95,8 +95,7 @@ extern void mem_cgroup_note_reclaim_prio
int priority);
extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
int priority);
-int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg,
- struct zone *zone);
+int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
struct zone *zone,
enum lru_list lru);
Index: b/mm/memcontrol.c
===================================================================
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -167,9 +167,6 @@ struct mem_cgroup {

unsigned int swappiness;

-
- unsigned int inactive_ratio;
-
/*
* statistics. This must be placed at the end of memcg.
*/
@@ -433,15 +430,43 @@ void mem_cgroup_record_reclaim_priority(
spin_unlock(&mem->reclaim_param_lock);
}

-int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
+static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
{
unsigned long active;
unsigned long inactive;
+ unsigned long gb;
+ unsigned long inactive_ratio;

inactive = mem_cgroup_get_all_zonestat(memcg, LRU_INACTIVE_ANON);
active = mem_cgroup_get_all_zonestat(memcg, LRU_ACTIVE_ANON);

- if (inactive * memcg->inactive_ratio < active)
+ gb = (inactive + active) >> (30 - PAGE_SHIFT);
+ if (gb)
+ inactive_ratio = int_sqrt(10 * gb);
+ else
+ inactive_ratio = 1;
+
+ if (present_pages) {
+ present_pages[0] = inactive;
+ present_pages[1] = active;
+ }
+
+ return inactive_ratio;
+}
+
+int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
+{
+ unsigned long active;
+ unsigned long inactive;
+ unsigned long present_pages[2];
+ unsigned long inactive_ratio;
+
+ inactive_ratio = calc_inactive_ratio(memcg, present_pages);
+
+ inactive = present_pages[0];
+ active = present_pages[1];
+
+ if (inactive * inactive_ratio < active)
return 1;

return 0;
@@ -1410,29 +1435,6 @@ int mem_cgroup_shrink_usage(struct mm_st
return 0;
}

-/*
- * The inactive anon list should be small enough that the VM never has to
- * do too much work, but large enough that each inactive page has a chance
- * to be referenced again before it is swapped out.
- *
- * this calculation is straightforward porting from
- * page_alloc.c::setup_per_zone_inactive_ratio().
- * it describe more detail.
- */
-static void mem_cgroup_set_inactive_ratio(struct mem_cgroup *memcg)
-{
- unsigned int gb, ratio;
-
- gb = res_counter_read_u64(&memcg->res, RES_LIMIT) >> 30;
- if (gb)
- ratio = int_sqrt(10 * gb);
- else
- ratio = 1;
-
- memcg->inactive_ratio = ratio;
-
-}
-
static DEFINE_MUTEX(set_limit_mutex);

static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
@@ -1472,9 +1474,6 @@ static int mem_cgroup_resize_limit(struc
if (!progress) retry_count--;
}

- if (!ret)
- mem_cgroup_set_inactive_ratio(memcg);
-
return ret;
}

@@ -1833,7 +1832,7 @@ static int mem_control_stat_show(struct
}

#ifdef CONFIG_DEBUG_VM
- cb->fill(cb, "inactive_ratio", mem_cont->inactive_ratio);
+ cb->fill(cb, "inactive_ratio", calc_inactive_ratio(mem_cont, NULL));

{
int nid, zid;
@@ -2125,7 +2124,6 @@ mem_cgroup_create(struct cgroup_subsys *
res_counter_init(&mem->res, NULL);
res_counter_init(&mem->memsw, NULL);
}
- mem_cgroup_set_inactive_ratio(mem);
mem->last_scanned_child = NULL;
spin_lock_init(&mem->reclaim_param_lock);

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1394,7 +1394,7 @@ static int inactive_anon_is_low(struct z
if (scanning_global_lru(sc))
low = inactive_anon_is_low_global(zone);
else
- low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup, zone);
+ low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup);
return low;
}


2008-12-11 03:20:42

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/6] fix inactive_ratio under hierarchy

On Thu, 11 Dec 2008 12:14:04 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> Hi
>
> > ex)In following tree,
> > /opt/cgroup/01 limit=1G
> > /opt/cgroup/01/A limit=500M
> > /opt/cgroup/01/A/B limit=unlimited
> > /opt/cgroup/01/A/C limit=50M
> > /opt/cgroup/01/Z limit=700M
> >
> >
> > /opt/cgroup/01's inactive_ratio is calculated by limit of 1G.
> > /opt/cgroup/01/A's inactive_ratio is calculated by limit of 500M
> > /opt/cgroup/01/A/B's inactive_ratio is calculated by limit of 500M.
> > /opt/cgroup/01/A/C's inactive_ratio is calculated by limit of 50M.
> > /opt/cgroup/01's inactive_ratio is calculated by limit of 700M.
>
> this is one of good choice.
> but I think it is a bit complex rule.
>
>
> How about this?
>
> ===============
> Currently, inactive_ratio of memcg is calculated at setting limit.
> because page_alloc.c does so and current implementation is straightforward porting.
>
> However, memcg introduced hierarchy feature recently.
> In hierarchy restriction, memory limit is not only decided memory.limit_in_bytes of current cgroup,
> but also parent limit and sibling memory usage.
>
> Then, The optimal inactive_ratio is changed frequently.
> So, everytime calculation is better.
>
thx, I'll test this.

-Kame



>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> CC: KAMEZAWA Hiroyuki <[email protected]>
> CC: Balbir Singh <[email protected]>
> ---
> include/linux/memcontrol.h | 3 --
> mm/memcontrol.c | 64 +++++++++++++++++++++------------------------
> mm/vmscan.c | 2 -
> 3 files changed, 33 insertions(+), 36 deletions(-)
>
> Index: b/include/linux/memcontrol.h
> ===================================================================
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -95,8 +95,7 @@ extern void mem_cgroup_note_reclaim_prio
> int priority);
> extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
> int priority);
> -int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg,
> - struct zone *zone);
> +int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> struct zone *zone,
> enum lru_list lru);
> Index: b/mm/memcontrol.c
> ===================================================================
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -167,9 +167,6 @@ struct mem_cgroup {
>
> unsigned int swappiness;
>
> -
> - unsigned int inactive_ratio;
> -
> /*
> * statistics. This must be placed at the end of memcg.
> */
> @@ -433,15 +430,43 @@ void mem_cgroup_record_reclaim_priority(
> spin_unlock(&mem->reclaim_param_lock);
> }
>
> -int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
> +static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
> {
> unsigned long active;
> unsigned long inactive;
> + unsigned long gb;
> + unsigned long inactive_ratio;
>
> inactive = mem_cgroup_get_all_zonestat(memcg, LRU_INACTIVE_ANON);
> active = mem_cgroup_get_all_zonestat(memcg, LRU_ACTIVE_ANON);
>
> - if (inactive * memcg->inactive_ratio < active)
> + gb = (inactive + active) >> (30 - PAGE_SHIFT);
> + if (gb)
> + inactive_ratio = int_sqrt(10 * gb);
> + else
> + inactive_ratio = 1;
> +
> + if (present_pages) {
> + present_pages[0] = inactive;
> + present_pages[1] = active;
> + }
> +
> + return inactive_ratio;
> +}
> +
> +int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
> +{
> + unsigned long active;
> + unsigned long inactive;
> + unsigned long present_pages[2];
> + unsigned long inactive_ratio;
> +
> + inactive_ratio = calc_inactive_ratio(memcg, present_pages);
> +
> + inactive = present_pages[0];
> + active = present_pages[1];
> +
> + if (inactive * inactive_ratio < active)
> return 1;
>
> return 0;
> @@ -1410,29 +1435,6 @@ int mem_cgroup_shrink_usage(struct mm_st
> return 0;
> }
>
> -/*
> - * The inactive anon list should be small enough that the VM never has to
> - * do too much work, but large enough that each inactive page has a chance
> - * to be referenced again before it is swapped out.
> - *
> - * this calculation is straightforward porting from
> - * page_alloc.c::setup_per_zone_inactive_ratio().
> - * it describe more detail.
> - */
> -static void mem_cgroup_set_inactive_ratio(struct mem_cgroup *memcg)
> -{
> - unsigned int gb, ratio;
> -
> - gb = res_counter_read_u64(&memcg->res, RES_LIMIT) >> 30;
> - if (gb)
> - ratio = int_sqrt(10 * gb);
> - else
> - ratio = 1;
> -
> - memcg->inactive_ratio = ratio;
> -
> -}
> -
> static DEFINE_MUTEX(set_limit_mutex);
>
> static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> @@ -1472,9 +1474,6 @@ static int mem_cgroup_resize_limit(struc
> if (!progress) retry_count--;
> }
>
> - if (!ret)
> - mem_cgroup_set_inactive_ratio(memcg);
> -
> return ret;
> }
>
> @@ -1833,7 +1832,7 @@ static int mem_control_stat_show(struct
> }
>
> #ifdef CONFIG_DEBUG_VM
> - cb->fill(cb, "inactive_ratio", mem_cont->inactive_ratio);
> + cb->fill(cb, "inactive_ratio", calc_inactive_ratio(mem_cont, NULL));
>
> {
> int nid, zid;
> @@ -2125,7 +2124,6 @@ mem_cgroup_create(struct cgroup_subsys *
> res_counter_init(&mem->res, NULL);
> res_counter_init(&mem->memsw, NULL);
> }
> - mem_cgroup_set_inactive_ratio(mem);
> mem->last_scanned_child = NULL;
> spin_lock_init(&mem->reclaim_param_lock);
>
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1394,7 +1394,7 @@ static int inactive_anon_is_low(struct z
> if (scanning_global_lru(sc))
> low = inactive_anon_is_low_global(zone);
> else
> - low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup, zone);
> + low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup);
> return low;
> }
>
>
>
>

2008-12-11 12:44:34

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler

On Wed, 10 Dec 2008 16:24:44 -0800
Paul Menage <[email protected]> wrote:

> On Wed, Dec 10, 2008 at 4:21 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > If per-css, looking up function will be
> > ==
> > struct cgroup_subsys_state *cgroup_css_lookup(subsys_id, id)
> > ==
> > Do you mean this ?
>
> Yes, plausibly. And we can presumably have a separate idr per subsystem.
>

this is my temporal patch. this look and interface is ok ?
(still under test)
==

From: KAMEZAWA Hiroyuki <[email protected]>

Patch for Per-CSS ID and private hierarchy code.

This patch tries to assign a ID to each css. Attach unique ID to each
css and provides following functions.

- css_lookup(subsys, id)
returns struct cgroup of id.
- css_get_next(subsys, id, rootid, depth, foundid)
returns the next cgroup under "root" by scanning bitmap (not by tree-walk)

When cgrou_subsys->use_id is set, id field for css is maintained.
kernel/cgroup.c just parepare
- css_id of root css for subsys
- alloc/free id functions.
So, each subsys should allocate ID in attach() callback if necessary.

There is several reasons to develop this.

- While trying to implement hierarchy in memory cgroup, we have to
implement "walk under hierarchy" code.
Now it's consists of cgroup_lock and tree up-down code. Because
Because memory cgroup have to do hierarchy walk in other places,
intelligent processing, we'll reuse the "walk" code.
But taking "cgroup_lock" in walking tree can cause deadlocks.
Easier way is helpful.

- SwapCgroup uses array of "pointer" to record the owner of swaps.
By ID, we can reduce this to "short" or "int". This means ID is
useful for reducing space consumption by pointer if the access cost
is not problem.
(I hear bio-cgroup will use the same kind of...)

Example) OOM-Killer under hierarchy.
do {
rcu_read_lock();
next = cgroup_get_next(id, root, nextid);
/* check sanity of next here */
css_tryget();
rcu_read_unlock();
if (!next)
break;
cgroup_scan_tasks(select_bad_process?);
/* record score here...*/
} while (1);


Characteristics:
- Each css should have unique ID under subsys.
- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
- Allowed ID is 1-65535, ID 0 is UNUSED ID.
-

Design Choices:
- scan-by-ID v.s. scan-by-tree-walk.
As /proc's pid scan does, scan-by-ID is robust when scanning is done
by following kind of routine.
scan -> rest a while(release a lock) -> conitunue from interrupted
memcg's hierarchical reclaim does this.

- When subsys->use_id is set, # of css in the system is limited to
65534.
- max depth of hierarchy is also limited.

Changelog: (v4) -> (v5)
- Totally re-designed as per-css ID.
Changelog:(v3) -> (v4)
- updated comments.
- renamed hierarchy_code[] to stack[]
- merged prepare_id routines.

Changelog (v2) -> (v3)
- removed cgroup_id_getref().
- added cgroup_id_tryget().

Changelog (v1) -> (v2):
- Design change: show only ID(integer) to outside of cgroup.c
- moved cgroup ID definition from include/ to kernel/cgroup.c
- struct cgroup_id is freed by RCU.
- changed interface from pointer to "int"
- kill_sb() is handled.
- ID 0 as unused ID.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/cgroup.h | 51 +++++++++
include/linux/idr.h | 1
kernel/cgroup.c | 251 ++++++++++++++++++++++++++++++++++++++++++++++++-
lib/idr.c | 46 ++++++++
4 files changed, 347 insertions(+), 2 deletions(-)

Index: mmotm-2.6.28-Dec10/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Dec10.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Dec10/include/linux/cgroup.h
@@ -15,13 +15,14 @@
#include <linux/cgroupstats.h>
#include <linux/prio_heap.h>
#include <linux/rwsem.h>
-
+#include <linux/idr.h>
#ifdef CONFIG_CGROUPS

struct cgroupfs_root;
struct cgroup_subsys;
struct inode;
struct cgroup;
+struct css_id;

extern int cgroup_init_early(void);
extern int cgroup_init(void);
@@ -59,6 +60,8 @@ struct cgroup_subsys_state {
atomic_t refcnt;

unsigned long flags;
+ /* ID for this css, if possible */
+ struct css_id *id;
};

/* bits in struct cgroup_subsys_state flags field */
@@ -360,6 +363,13 @@ struct cgroup_subsys {
int active;
int disabled;
int early_init;
+ /*
+ * set 1 if subsys uses ID. ID is not available before cgroup_init()
+ * (not available in early_init time.
+ */
+ int use_id;
+ /* this defines max depth of hierarchy of this subsys if using ID. */
+ int max_depth;
#define MAX_CGROUP_TYPE_NAMELEN 32
const char *name;

@@ -373,6 +383,9 @@ struct cgroup_subsys {
/* Protected by this->hierarchy_mutex and cgroup_lock() */
struct cgroupfs_root *root;
struct list_head sibling;
+ /* used when use_id == 1 */
+ struct idr idr;
+ spinlock_t id_lock;
};

#define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
@@ -426,6 +439,42 @@ void cgroup_iter_end(struct cgroup *cgrp
int cgroup_scan_tasks(struct cgroup_scanner *scan);
int cgroup_attach_task(struct cgroup *, struct task_struct *);

+/*
+ * CSS ID is a ID for all css struct under subsys. Only works when
+ * cgroup_subsys->use_id != 0. It can be used for look up and scanning
+ * Cgroup ID is assined at cgroup allocation (create) and removed
+ * when refcnt to ID goes down to 0. Refcnt is inremented when subsys want to
+ * avoid reuse of ID for persistent objects. In usual, refcnt to ID will be 0
+ * when cgroup is removed.
+ * This look-up and scan function should be called under rcu_read_lock().
+ * cgroup_lock() is not necessary.
+ *
+ * Note: At using ID, max depth of the hierarchy is determined by
+ * cgroup_subsys->max_id_depth.
+ */
+
+/* called at create() */
+int css_id_alloc(struct cgroup_subsys *ss, struct cgroup_subsys_state *parent,
+ struct cgroup_subsys_state *css);
+/* called at destroy(), or somewhere we can free ID */
+void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
+
+/* Find a cgroup which the "ID" is attached. */
+struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
+/*
+ * Get next cgroup under tree. Returning a cgroup which has equal or greater
+ * ID than "id" in argument.
+ */
+struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss,
+ int id, int rootid, int depth, int *foundid);
+
+/* get id and depth of css */
+unsigned short css_id(struct cgroup_subsys_state *css);
+unsigned short css_depth(struct cgroup_subsys_state *css);
+/* returns non-zero if root is ancestor of cg */
+int css_is_ancestor(struct cgroup_subsys_state *cg,
+ struct cgroup_subsys_state *root);
+
#else /* !CONFIG_CGROUPS */

static inline int cgroup_init_early(void) { return 0; }
Index: mmotm-2.6.28-Dec10/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Dec10.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Dec10/kernel/cgroup.c
@@ -46,7 +46,7 @@
#include <linux/cgroupstats.h>
#include <linux/hash.h>
#include <linux/namei.h>
-
+#include <linux/idr.h>
#include <asm/atomic.h>

static DEFINE_MUTEX(cgroup_mutex);
@@ -185,6 +185,8 @@ struct cg_cgroup_link {
static struct css_set init_css_set;
static struct cg_cgroup_link init_css_set_link;

+static int cgroup_subsys_init_idr(struct cgroup_subsys *ss);
+
/* css_set_lock protects the list of css_set objects, and the
* chain of tasks off each css_set. Nests outside task->alloc_lock
* due to cgroup_iter_start() */
@@ -2684,6 +2686,8 @@ int __init cgroup_init(void)
struct cgroup_subsys *ss = subsys[i];
if (!ss->early_init)
cgroup_init_subsys(ss);
+ if (ss->use_id)
+ cgroup_subsys_init_idr(ss);
}

/* Add init_css_set to the hash table */
@@ -3215,3 +3219,248 @@ static int __init cgroup_disable(char *s
return 1;
}
__setup("cgroup_disable=", cgroup_disable);
+
+/*
+ * CSS ID
+ */
+struct css_id {
+ /*
+ * The cgroup to whiech this ID points. If cgroup is removed,
+ * this will point to NULL.
+ */
+ struct cgroup_subsys_state *css;
+ /*
+ * ID of this css.
+ */
+ unsigned short id;
+ /*
+ * Depth in hierarchy which this ID belongs to.
+ */
+ unsigned short depth;
+ /*
+ * ID is freed by RCU. (and lookup routine is RCU safe.)
+ */
+ struct rcu_head rcu_head;
+ /*
+ * Hierarchy of CSS ID belongs to.
+ */
+ unsigned short stack[0];
+};
+
+/*
+ * To get ID other than 0, this should be called when !cgroup_is_removed().
+ */
+unsigned short css_id(struct cgroup_subsys_state *css)
+{
+ struct css_id *cssid = rcu_dereference(css->id);
+
+ if (cssid)
+ return cssid->id;
+ return 0;
+}
+
+unsigned short css_depth(struct cgroup_subsys_state *css)
+{
+ struct css_id *cssid = rcu_dereference(css->id);
+
+ if (cssid)
+ return cssid->depth;
+ return 0;
+}
+
+int css_is_ancestor(struct cgroup_subsys_state *css,
+ struct cgroup_subsys_state *root)
+{
+ struct css_id *id = css->id;
+ struct css_id *ans = root->id;
+
+ if (!id || !ans)
+ return 0;
+ return id->stack[ans->depth] == ans->id;
+}
+
+static void __free_css_id_cb(struct rcu_head *head)
+{
+ struct css_id *id;
+
+ id = container_of(head, struct css_id, rcu_head);
+ kfree(id);
+}
+
+void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css)
+{
+ struct css_id *id = css->id;
+
+ BUG_ON(!ss->use_id);
+
+ rcu_assign_pointer(id->css, NULL);
+ rcu_assign_pointer(css->id, NULL);
+ spin_lock(&ss->id_lock);
+ idr_remove(&ss->idr, id->id);
+ spin_unlock(&ss->id_lock);
+ call_rcu(&id->rcu_head, __free_css_id_cb);
+}
+
+
+static int __get_and_prepare_newid(struct cgroup_subsys *ss,
+ struct css_id **ret)
+{
+ struct css_id *newid;
+ int myid, error, size;
+
+ BUG_ON(!ss->use_id);
+
+ size = sizeof(*newid) + sizeof(unsigned short) * ss->max_depth;
+ newid = kzalloc(size, GFP_KERNEL);
+ if (!newid)
+ return -ENOMEM;
+ /* get id */
+ if (unlikely(!idr_pre_get(&ss->idr, GFP_KERNEL))) {
+ error = -ENOMEM;
+ goto err_out;
+ }
+ spin_lock(&ss->id_lock);
+ /* Don't use 0 */
+ error = idr_get_new_above(&ss->idr, newid, 1, &myid);
+ spin_unlock(&ss->id_lock);
+
+ /* Returns error when there are no free spaces for new ID.*/
+ if (error) {
+ error = -ENOSPC;
+ goto err_out;
+ }
+
+ newid->id = myid;
+ *ret = newid;
+ return 0;
+err_out:
+ kfree(newid);
+ return error;
+
+}
+
+
+static int __init cgroup_subsys_init_idr(struct cgroup_subsys *ss)
+{
+ struct css_id *newid;
+ struct cgroup_subsys_state *rootcss;
+ int err = -ENOMEM;
+
+ spin_lock_init(&ss->id_lock);
+ idr_init(&ss->idr);
+
+ rootcss = init_css_set.subsys[ss->subsys_id];
+ err = __get_and_prepare_newid(ss, &newid);
+ if (err)
+ return err;
+
+ newid->depth = 0;
+ newid->stack[0] = newid->id;
+ newid->css = rootcss;
+ rootcss->id = newid;
+ return 0;
+}
+
+int css_id_alloc(struct cgroup_subsys *ss,
+ struct cgroup_subsys_state *parent,
+ struct cgroup_subsys_state *css)
+{
+ int i, depth = 0;
+ struct css_id *cssid, *parent_id = NULL;
+ int error;
+
+ if (parent) {
+ parent_id = parent->id;
+ depth = parent_id->depth + 1;
+ }
+
+ if (depth >= ss->max_depth)
+ return -ENOSPC;
+
+ error = __get_and_prepare_newid(ss, &cssid);
+ if (error)
+ return error;
+
+ for (i = 0; i < depth; i++)
+ cssid->stack[i] = parent_id->stack[i];
+ cssid->stack[depth] = cssid->id;
+ cssid->depth = depth;
+
+ rcu_assign_pointer(cssid->css, css);
+ rcu_assign_pointer(css->id, cssid);
+
+ return 0;
+}
+
+/**
+ * css_lookup - lookup css by id
+ * @id: the id of cgroup to be looked up
+ *
+ * Returns pointer to css if there is valid css with id, NULL if not.
+ * Should be called under rcu_read_lock()
+ */
+
+struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
+{
+ struct cgroup_subsys_state *css = NULL;
+ struct css_id *cssid = NULL;
+
+ BUG_ON(!ss->use_id);
+ rcu_read_lock();
+ cssid = idr_find(&ss->idr, id);
+
+ if (unlikely(!cssid))
+ goto out;
+
+ css = rcu_dereference(cssid->css);
+out:
+ rcu_read_unlock();
+ return css;
+}
+
+/**
+ * css_get_next - lookup next cgroup under specified hierarchy.
+ * @ss: pointer to subsystem
+ * @id: current position of iteration.
+ * @rootid: search tree under this.
+ * @depth: depth of root id.
+ * @foundid: position of found object.
+ *
+ * Search next css under the specified hierarchy of rootid. Calling under
+ * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
+ */
+struct cgroup_subsys_state *
+css_get_next(struct cgroup_subsys *ss,
+ int id, int rootid, int depth, int *foundid)
+{
+ struct cgroup_subsys_state *ret = NULL;
+ struct css_id *tmp;
+ int tmpid;
+
+ BUG_ON(!ss->use_id);
+ rcu_read_lock();
+ tmpid = id;
+ while (1) {
+ /* scan next entry from bitmap(tree) */
+ spin_lock(&ss->id_lock);
+ tmp = idr_get_next(&ss->idr, &tmpid);
+ spin_unlock(&ss->id_lock);
+
+ if (!tmp) {
+ ret = NULL;
+ break;
+ }
+ if (tmp->stack[depth] == rootid) {
+ ret = rcu_dereference(tmp->css);
+ /* Sanity check and check hierarchy */
+ if (ret && !css_is_removed(ret))
+ break;
+ }
+ tmpid = tmpid + 1;
+ }
+
+ rcu_read_unlock();
+ *foundid = tmpid;
+ return ret;
+}
+
Index: mmotm-2.6.28-Dec10/include/linux/idr.h
===================================================================
--- mmotm-2.6.28-Dec10.orig/include/linux/idr.h
+++ mmotm-2.6.28-Dec10/include/linux/idr.h
@@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
int idr_for_each(struct idr *idp,
int (*fn)(int id, void *p, void *data), void *data);
+void *idr_get_next(struct idr *idp, int *nextid);
void *idr_replace(struct idr *idp, void *ptr, int id);
void idr_remove(struct idr *idp, int id);
void idr_remove_all(struct idr *idp);
Index: mmotm-2.6.28-Dec10/lib/idr.c
===================================================================
--- mmotm-2.6.28-Dec10.orig/lib/idr.c
+++ mmotm-2.6.28-Dec10/lib/idr.c
@@ -573,6 +573,52 @@ int idr_for_each(struct idr *idp,
EXPORT_SYMBOL(idr_for_each);

/**
+ * idr_get_next - lookup next object of id to given id.
+ * @idp: idr handle
+ * @id: pointer to lookup key
+ *
+ * Returns pointer to registered object with id, which is next number to
+ * given id.
+ */
+
+void *idr_get_next(struct idr *idp, int *nextidp)
+{
+ struct idr_layer *p, *pa[MAX_LEVEL];
+ struct idr_layer **paa = &pa[0];
+ int id = *nextidp;
+ int n, max;
+
+ /* find first ent */
+ n = idp->layers * IDR_BITS;
+ max = 1 << n;
+ p = rcu_dereference(idp->top);
+ if (!p)
+ return NULL;
+
+ while (id < max) {
+ while (n > 0 && p) {
+ n -= IDR_BITS;
+ *paa++ = p;
+ p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
+ }
+
+ if (p) {
+ *nextidp = id;
+ return p;
+ }
+
+ id += 1 << n;
+ while (n < fls(id)) {
+ n += IDR_BITS;
+ p = *--paa;
+ }
+ }
+ return NULL;
+}
+
+
+
+/**
* idr_replace - replace pointer for given id
* @idp: idr handle
* @ptr: pointer you want associated with the id