This is an updates from previes CSS ID patch set and some updates to memcg.
But it seems people are enjoying LinuxConf.au, I'll keep this set on my box
for a while ;)
This set contains following patches
==
[1/7] add CSS ID to cgroup.
[2/7] use CSS ID under memcg
[3/7] show more hierarchical information via memory.stat file
[4/7] fix "set limit" to return -EBUSY if it seems difficult to shrink usage.
[5/7] fix OOM-Killer under memcg's hierarchy.
[6/7] fix frequent -EBUSY at cgroup rmdir() with memory subsystem.
[7/7] support background reclaim. (for RFC)
patch 4, 7 is new.
Thanks,
-Kame
From: KAMEZAWA Hiroyuki <[email protected]>
Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
This patch attaches unique ID to each css and provides following.
- css_lookup(subsys, id)
returns pointer to struct cgroup_subysys_state of id.
- css_get_next(subsys, id, rootid, depth, foundid)
returns the next css under "root" by scanning
When cgrou_subsys->use_id is set, an id for css is maintained.
The cgroup framework only parepares
- css_id of root css for subsys
- id is automatically attached at creation of css.
- id is *not* freed automatically. Because the cgroup framework
don't know lifetime of cgroup_subsys_state.
free_css_id() function is provided. This must be called by subsys.
There are several reasons to develop this.
- Saving space .... For example, memcg's swap_cgroup is array of
pointers to cgroup. But it is not necessary to be very fast.
By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
reduce much amount of memory usage.
- Scanning without lock.
CSS_ID provides "scan id under this ROOT" function. By this, scanning
css under root can be written without locks.
ex)
do {
rcu_read_lock();
next = cgroup_get_next(subsys, id, root, &found);
/* check sanity of next here */
css_tryget();
rcu_read_unlock();
id = found + 1
} while(...)
Characteristics:
- Each css has unique ID under subsys.
- Lifetime of ID is controlled by subsys.
- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
- Allowed ID is 1-65535, ID 0 is UNUSED ID.
Design Choices:
- scan-by-ID v.s. scan-by-tree-walk.
As /proc's pid scan does, scan-by-ID is robust when scanning is done
by following kind of routine.
scan -> rest a while(release a lock) -> conitunue from interrupted
memcg's hierarchical reclaim does this.
- When subsys->use_id is set, # of css in the system is limited to
65535.
Changelog: (v7) -> (v8)
- Update id->css pointer after cgroup is populated.
Changelog: (v6) -> (v7)
- refcnt for CSS ID is removed. Subsys can do it by own logic.
- New id allocation is done automatically.
- fixed typos.
- fixed limit check of ID.
Changelog: (v5) -> (v6)
- max depth is removed.
- changed arguments to "scan"
Changelog: (v4) -> (v5)
- Totally re-designed as per-css ID.
Changelog:(v3) -> (v4)
- updated comments.
- renamed hierarchy_code[] to stack[]
- merged prepare_id routines.
Changelog (v2) -> (v3)
- removed cgroup_id_getref().
- added cgroup_id_tryget().
Changelog (v1) -> (v2):
- Design change: show only ID(integer) to outside of cgroup.c
- moved cgroup ID definition from include/ to kernel/cgroup.c
- struct cgroup_id is freed by RCU.
- changed interface from pointer to "int"
- kill_sb() is handled.
- ID 0 as unused ID.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/cgroup.h | 50 ++++++++
include/linux/idr.h | 1
kernel/cgroup.c | 289 ++++++++++++++++++++++++++++++++++++++++++++++++-
lib/idr.c | 46 +++++++
4 files changed, 385 insertions(+), 1 deletion(-)
Index: mmotm-2.6.29-Jan16/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.29-Jan16.orig/include/linux/cgroup.h
+++ mmotm-2.6.29-Jan16/include/linux/cgroup.h
@@ -15,6 +15,7 @@
#include <linux/cgroupstats.h>
#include <linux/prio_heap.h>
#include <linux/rwsem.h>
+#include <linux/idr.h>
#ifdef CONFIG_CGROUPS
@@ -22,6 +23,7 @@ struct cgroupfs_root;
struct cgroup_subsys;
struct inode;
struct cgroup;
+struct css_id;
extern int cgroup_init_early(void);
extern int cgroup_init(void);
@@ -59,6 +61,8 @@ struct cgroup_subsys_state {
atomic_t refcnt;
unsigned long flags;
+ /* ID for this css, if possible */
+ struct css_id *id;
};
/* bits in struct cgroup_subsys_state flags field */
@@ -363,6 +367,11 @@ struct cgroup_subsys {
int active;
int disabled;
int early_init;
+ /*
+ * True if this subsys uses ID. ID is not available before cgroup_init()
+ * (not available in early_init time.)
+ */
+ bool use_id;
#define MAX_CGROUP_TYPE_NAMELEN 32
const char *name;
@@ -384,6 +393,9 @@ struct cgroup_subsys {
*/
struct cgroupfs_root *root;
struct list_head sibling;
+ /* used when use_id == true */
+ struct idr idr;
+ spinlock_t id_lock;
};
#define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
@@ -437,6 +449,44 @@ void cgroup_iter_end(struct cgroup *cgrp
int cgroup_scan_tasks(struct cgroup_scanner *scan);
int cgroup_attach_task(struct cgroup *, struct task_struct *);
+/*
+ * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
+ * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
+ * CSS ID is assigned at cgroup allocation (create) automatically
+ * and removed when subsys calls free_css_id() function. This is because
+ * the lifetime of cgroup_subsys_state is subsys's matter.
+ *
+ * Looking up and scanning function should be called under rcu_read_lock().
+ * Taking cgroup_mutex()/hierarchy_mutex() is not necessary for following calls.
+ * But the css returned by this routine can be "not populated yet" or "being
+ * destroyed". The caller should check css and cgroup's status.
+ */
+
+/*
+ * Typically Called at ->destroy(), or somewhere the subsys frees
+ * cgroup_subsys_state.
+ */
+void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
+
+/* Find a cgroup_subsys_state which has given ID */
+
+struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
+
+/*
+ * Get a cgroup whose id is greater than or equal to id under tree of root.
+ * Returning a cgroup_subsys_state or NULL.
+ */
+struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
+ struct cgroup_subsys_state *root, int *foundid);
+
+/* Returns true if root is ancestor of cg */
+bool css_is_ancestor(struct cgroup_subsys_state *cg,
+ struct cgroup_subsys_state *root);
+
+/* Get id and depth of css */
+unsigned short css_id(struct cgroup_subsys_state *css);
+unsigned short css_depth(struct cgroup_subsys_state *css);
+
#else /* !CONFIG_CGROUPS */
static inline int cgroup_init_early(void) { return 0; }
Index: mmotm-2.6.29-Jan16/kernel/cgroup.c
===================================================================
--- mmotm-2.6.29-Jan16.orig/kernel/cgroup.c
+++ mmotm-2.6.29-Jan16/kernel/cgroup.c
@@ -94,7 +94,6 @@ struct cgroupfs_root {
char release_agent_path[PATH_MAX];
};
-
/*
* The "rootnode" hierarchy is the "dummy hierarchy", reserved for the
* subsystems that are otherwise unattached - it never has more than a
@@ -102,6 +101,39 @@ struct cgroupfs_root {
*/
static struct cgroupfs_root rootnode;
+/*
+ * CSS ID -- ID per subsys's Cgroup Subsys State(CSS). used only when
+ * cgroup_subsys->use_id != 0.
+ */
+#define CSS_ID_MAX (65535)
+struct css_id {
+ /*
+ * The css to which this ID points. This pointer is set to valid value
+ * after cgroup is populated. If cgroup is removed, this will be NULL.
+ * This pointer is expected to be RCU-safe because destroy()
+ * is called after synchronize_rcu(). But for safe use, css_is_removed()
+ * css_tryget() should be used for avoiding race.
+ */
+ struct cgroup_subsys_state *css;
+ /*
+ * ID of this css.
+ */
+ unsigned short id;
+ /*
+ * Depth in hierarchy which this ID belongs to.
+ */
+ unsigned short depth;
+ /*
+ * ID is freed by RCU. (and lookup routine is RCU safe.)
+ */
+ struct rcu_head rcu_head;
+ /*
+ * Hierarchy of CSS ID belongs to.
+ */
+ unsigned short stack[0]; /* Array of Length (depth+1) */
+};
+
+
/* The list of hierarchy roots */
static LIST_HEAD(roots);
@@ -185,6 +217,8 @@ struct cg_cgroup_link {
static struct css_set init_css_set;
static struct cg_cgroup_link init_css_set_link;
+static int cgroup_subsys_init_idr(struct cgroup_subsys *ss);
+
/* css_set_lock protects the list of css_set objects, and the
* chain of tasks off each css_set. Nests outside task->alloc_lock
* due to cgroup_iter_start() */
@@ -567,6 +601,9 @@ static struct backing_dev_info cgroup_ba
.capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK,
};
+static int alloc_css_id(struct cgroup_subsys *ss,
+ struct cgroup *parent, struct cgroup *child);
+
static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
{
struct inode *inode = new_inode(sb);
@@ -2324,6 +2361,17 @@ static int cgroup_populate_dir(struct cg
if (ss->populate && (err = ss->populate(ss, cgrp)) < 0)
return err;
}
+ /* This cgroup is ready now */
+ for_each_subsys(cgrp->root, ss) {
+ struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+ /*
+ * Update id->css pointer and make this css visible from
+ * CSS ID functions. This pointer will be dereferened
+ * from RCU-read-side without locks.
+ */
+ if (css->id)
+ rcu_assign_pointer(css->id->css, css);
+ }
return 0;
}
@@ -2335,6 +2383,7 @@ static void init_cgroup_css(struct cgrou
css->cgroup = cgrp;
atomic_set(&css->refcnt, 1);
css->flags = 0;
+ css->id = NULL;
if (cgrp == dummytop)
set_bit(CSS_ROOT, &css->flags);
BUG_ON(cgrp->subsys[ss->subsys_id]);
@@ -2410,6 +2459,10 @@ static long cgroup_create(struct cgroup
goto err_destroy;
}
init_cgroup_css(css, ss, cgrp);
+ if (ss->use_id)
+ if (alloc_css_id(ss, parent, cgrp))
+ goto err_destroy;
+ /* At error, ->destroy() callback has to free assigned ID. */
}
cgroup_lock_hierarchy(root);
@@ -2701,6 +2754,8 @@ int __init cgroup_init(void)
struct cgroup_subsys *ss = subsys[i];
if (!ss->early_init)
cgroup_init_subsys(ss);
+ if (ss->use_id)
+ cgroup_subsys_init_idr(ss);
}
/* Add init_css_set to the hash table */
@@ -3234,3 +3289,235 @@ static int __init cgroup_disable(char *s
return 1;
}
__setup("cgroup_disable=", cgroup_disable);
+
+/*
+ * Functons for CSS ID.
+ */
+
+/*
+ *To get ID other than 0, this should be called when !cgroup_is_removed().
+ */
+unsigned short css_id(struct cgroup_subsys_state *css)
+{
+ struct css_id *cssid = rcu_dereference(css->id);
+
+ if (cssid)
+ return cssid->id;
+ return 0;
+}
+
+unsigned short css_depth(struct cgroup_subsys_state *css)
+{
+ struct css_id *cssid = rcu_dereference(css->id);
+
+ if (cssid)
+ return cssid->depth;
+ return 0;
+}
+
+bool css_is_ancestor(struct cgroup_subsys_state *child,
+ struct cgroup_subsys_state *root)
+{
+ struct css_id *child_id = rcu_dereference(child->id);
+ struct css_id *root_id = rcu_dereference(root->id);
+
+ if (!child_id || !root_id || (child_id->depth < root_id->depth))
+ return false;
+ return child_id->stack[root_id->depth] == root_id->id;
+}
+
+static void __free_css_id_cb(struct rcu_head *head)
+{
+ struct css_id *id;
+
+ id = container_of(head, struct css_id, rcu_head);
+ kfree(id);
+}
+
+void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css)
+{
+ struct css_id *id = css->id;
+ /* When this is called before css_id initialization, id can be NULL */
+ if (!id)
+ return;
+
+ BUG_ON(!ss->use_id);
+
+ rcu_assign_pointer(id->css, NULL);
+ rcu_assign_pointer(css->id, NULL);
+ spin_lock(&ss->id_lock);
+ idr_remove(&ss->idr, id->id);
+ spin_unlock(&ss->id_lock);
+ call_rcu(&id->rcu_head, __free_css_id_cb);
+}
+
+/*
+ * This is called by init or create(). Then, calls to this function are
+ * always serialized (By cgroup_mutex() at create()).
+ */
+
+static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
+{
+ struct css_id *newid;
+ int myid, error, size;
+
+ BUG_ON(!ss->use_id);
+
+ size = sizeof(*newid) + sizeof(unsigned short) * (depth + 1);
+ newid = kzalloc(size, GFP_KERNEL);
+ if (!newid)
+ return ERR_PTR(-ENOMEM);
+ /* get id */
+ if (unlikely(!idr_pre_get(&ss->idr, GFP_KERNEL))) {
+ error = -ENOMEM;
+ goto err_out;
+ }
+ spin_lock(&ss->id_lock);
+ /* Don't use 0. allocates an ID of 1-65535 */
+ error = idr_get_new_above(&ss->idr, newid, 1, &myid);
+ spin_unlock(&ss->id_lock);
+
+ /* Returns error when there are no free spaces for new ID.*/
+ if (error) {
+ error = -ENOSPC;
+ goto err_out;
+ }
+ if (myid > CSS_ID_MAX)
+ goto remove_idr;
+
+ newid->id = myid;
+ newid->depth = depth;
+ return newid;
+remove_idr:
+ error = -ENOSPC;
+ spin_lock(&ss->id_lock);
+ idr_remove(&ss->idr, myid);
+ spin_unlock(&ss->id_lock);
+err_out:
+ kfree(newid);
+ return ERR_PTR(error);
+
+}
+
+static int __init cgroup_subsys_init_idr(struct cgroup_subsys *ss)
+{
+ struct css_id *newid;
+ struct cgroup_subsys_state *rootcss;
+
+ spin_lock_init(&ss->id_lock);
+ idr_init(&ss->idr);
+
+ rootcss = init_css_set.subsys[ss->subsys_id];
+ newid = get_new_cssid(ss, 0);
+ if (IS_ERR(newid))
+ return PTR_ERR(newid);
+
+ newid->stack[0] = newid->id;
+ newid->css = rootcss;
+ rootcss->id = newid;
+ return 0;
+}
+
+static int alloc_css_id(struct cgroup_subsys *ss, struct cgroup *parent,
+ struct cgroup *child)
+{
+ int subsys_id, i, depth = 0;
+ struct cgroup_subsys_state *parent_css, *child_css;
+ struct css_id *child_id, *parent_id = NULL;
+
+ subsys_id = ss->subsys_id;
+ parent_css = parent->subsys[subsys_id];
+ child_css = child->subsys[subsys_id];
+ depth = css_depth(parent_css) + 1;
+ parent_id = parent_css->id;
+
+ child_id = get_new_cssid(ss, depth);
+ if (IS_ERR(child_id))
+ return PTR_ERR(child_id);
+
+ for (i = 0; i < depth; i++)
+ child_id->stack[i] = parent_id->stack[i];
+ child_id->stack[depth] = child_id->id;
+ /*
+ * child_id->css pointer will be set after this cgroup is available
+ * see cgroup_populate_dir()
+ */
+ rcu_assign_pointer(child_css->id, child_id);
+
+ return 0;
+}
+
+/**
+ * css_lookup - lookup css by id
+ * @ss: cgroup subsys to be looked into.
+ * @id: the id
+ *
+ * Returns pointer to cgroup_subsys_state if there is valid one with id.
+ * NULL if not. Should be called under rcu_read_lock()
+ */
+struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
+{
+ struct css_id *cssid = NULL;
+
+ BUG_ON(!ss->use_id);
+ cssid = idr_find(&ss->idr, id);
+
+ if (unlikely(!cssid))
+ return NULL;
+
+ return rcu_dereference(cssid->css);
+}
+
+/**
+ * css_get_next - lookup next cgroup under specified hierarchy.
+ * @ss: pointer to subsystem
+ * @id: current position of iteration.
+ * @root: pointer to css. search tree under this.
+ * @foundid: position of found object.
+ *
+ * Search next css under the specified hierarchy of rootid. Calling under
+ * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
+ */
+struct cgroup_subsys_state *
+css_get_next(struct cgroup_subsys *ss, int id,
+ struct cgroup_subsys_state *root, int *foundid)
+{
+ struct cgroup_subsys_state *ret = NULL;
+ struct css_id *tmp;
+ int tmpid;
+ int rootid = css_id(root);
+ int depth = css_depth(root);
+
+ if (!rootid)
+ return NULL;
+
+ BUG_ON(!ss->use_id);
+ rcu_read_lock();
+ /* fill start point for scan */
+ tmpid = id;
+ while (1) {
+ /*
+ * scan next entry from bitmap(tree), tmpid is updated after
+ * idr_get_next().
+ */
+ spin_lock(&ss->id_lock);
+ tmp = idr_get_next(&ss->idr, &tmpid);
+ spin_unlock(&ss->id_lock);
+
+ if (!tmp)
+ break;
+ if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
+ ret = rcu_dereference(tmp->css);
+ if (ret) {
+ *foundid = tmpid;
+ break;
+ }
+ }
+ /* continue to scan from next id */
+ tmpid = tmpid + 1;
+ }
+
+ rcu_read_unlock();
+ return ret;
+}
+
Index: mmotm-2.6.29-Jan16/include/linux/idr.h
===================================================================
--- mmotm-2.6.29-Jan16.orig/include/linux/idr.h
+++ mmotm-2.6.29-Jan16/include/linux/idr.h
@@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
int idr_for_each(struct idr *idp,
int (*fn)(int id, void *p, void *data), void *data);
+void *idr_get_next(struct idr *idp, int *nextid);
void *idr_replace(struct idr *idp, void *ptr, int id);
void idr_remove(struct idr *idp, int id);
void idr_remove_all(struct idr *idp);
Index: mmotm-2.6.29-Jan16/lib/idr.c
===================================================================
--- mmotm-2.6.29-Jan16.orig/lib/idr.c
+++ mmotm-2.6.29-Jan16/lib/idr.c
@@ -579,6 +579,52 @@ int idr_for_each(struct idr *idp,
EXPORT_SYMBOL(idr_for_each);
/**
+ * idr_get_next - lookup next object of id to given id.
+ * @idp: idr handle
+ * @id: pointer to lookup key
+ *
+ * Returns pointer to registered object with id, which is next number to
+ * given id.
+ */
+
+void *idr_get_next(struct idr *idp, int *nextidp)
+{
+ struct idr_layer *p, *pa[MAX_LEVEL];
+ struct idr_layer **paa = &pa[0];
+ int id = *nextidp;
+ int n, max;
+
+ /* find first ent */
+ n = idp->layers * IDR_BITS;
+ max = 1 << n;
+ p = rcu_dereference(idp->top);
+ if (!p)
+ return NULL;
+
+ while (id < max) {
+ while (n > 0 && p) {
+ n -= IDR_BITS;
+ *paa++ = p;
+ p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
+ }
+
+ if (p) {
+ *nextidp = id;
+ return p;
+ }
+
+ id += 1 << n;
+ while (n < fls(id)) {
+ n += IDR_BITS;
+ p = *--paa;
+ }
+ }
+ return NULL;
+}
+
+
+
+/**
* idr_replace - replace pointer for given id
* @idp: idr handle
* @ptr: pointer you want associated with the id
From: KAMEZAWA Hiroyuki <[email protected]>
Use css ID in memcg.
Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.
Assume folloing tree.
group_A (ID=3)
/01 (ID=4)
/0A (ID=7)
/02 (ID=10)
group_B (ID=5)
and task in group_A/01/0A hits limit at group_A.
reclaim will be done in following order (round-robin).
group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
-> group_A -> .....
Round robin by ID. The last visited cgroup is recorded and restart
from it when it start reclaim again.
(More smart algorithm can be implemented..)
No cgroup_mutex or hierarchy_mutex is required.
Changelog (v3) -> (v4)
- dropped css_is_populated() check
- removed scan_age and use more simple logic.
Changelog (v2) -> (v3)
- Added css_is_populatd() check
- Adjusted to rc1 + Nishimrua's fixes.
- Increased comments.
Changelog (v1) -> (v2)
- Updated texts.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 220 ++++++++++++++++++++------------------------------------
1 file changed, 82 insertions(+), 138 deletions(-)
Index: mmotm-2.6.29-Jan16/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Jan16.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Jan16/mm/memcontrol.c
@@ -95,6 +95,15 @@ static s64 mem_cgroup_read_stat(struct m
return ret;
}
+static s64 mem_cgroup_local_usage(struct mem_cgroup_stat *stat)
+{
+ s64 ret;
+
+ ret = mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_CACHE);
+ ret += mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_RSS);
+ return ret;
+}
+
/*
* per-zone information in memory controller.
*/
@@ -154,9 +163,9 @@ struct mem_cgroup {
/*
* While reclaiming in a hiearchy, we cache the last child we
- * reclaimed from. Protected by hierarchy_mutex
+ * reclaimed from.
*/
- struct mem_cgroup *last_scanned_child;
+ int last_scanned_child;
/*
* Should the accounting and control be hierarchical, per subtree?
*/
@@ -629,103 +638,6 @@ unsigned long mem_cgroup_isolate_pages(u
#define mem_cgroup_from_res_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)
-/*
- * This routine finds the DFS walk successor. This routine should be
- * called with hierarchy_mutex held
- */
-static struct mem_cgroup *
-__mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
-{
- struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
-
- curr_cgroup = curr->css.cgroup;
- root_cgroup = root_mem->css.cgroup;
-
- if (!list_empty(&curr_cgroup->children)) {
- /*
- * Walk down to children
- */
- cgroup = list_entry(curr_cgroup->children.next,
- struct cgroup, sibling);
- curr = mem_cgroup_from_cont(cgroup);
- goto done;
- }
-
-visit_parent:
- if (curr_cgroup == root_cgroup) {
- /* caller handles NULL case */
- curr = NULL;
- goto done;
- }
-
- /*
- * Goto next sibling
- */
- if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
- cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
- sibling);
- curr = mem_cgroup_from_cont(cgroup);
- goto done;
- }
-
- /*
- * Go up to next parent and next parent's sibling if need be
- */
- curr_cgroup = curr_cgroup->parent;
- goto visit_parent;
-
-done:
- return curr;
-}
-
-/*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
-static struct mem_cgroup *
-mem_cgroup_get_next_node(struct mem_cgroup *root_mem)
-{
- struct cgroup *cgroup;
- struct mem_cgroup *orig, *next;
- bool obsolete;
-
- /*
- * Scan all children under the mem_cgroup mem
- */
- mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
-
- orig = root_mem->last_scanned_child;
- obsolete = mem_cgroup_is_obsolete(orig);
-
- if (list_empty(&root_mem->css.cgroup->children)) {
- /*
- * root_mem might have children before and last_scanned_child
- * may point to one of them. We put it later.
- */
- if (orig)
- VM_BUG_ON(!obsolete);
- next = NULL;
- goto done;
- }
-
- if (!orig || obsolete) {
- cgroup = list_first_entry(&root_mem->css.cgroup->children,
- struct cgroup, sibling);
- next = mem_cgroup_from_cont(cgroup);
- } else
- next = __mem_cgroup_get_next_node(orig, root_mem);
-
-done:
- if (next)
- mem_cgroup_get(next);
- root_mem->last_scanned_child = next;
- if (orig)
- mem_cgroup_put(orig);
- mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
- return (next) ? next : root_mem;
-}
-
static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
{
if (do_swap_account) {
@@ -755,46 +667,79 @@ static unsigned int get_swappiness(struc
}
/*
- * Dance down the hierarchy if needed to reclaim memory. We remember the
- * last child we reclaimed from, so that we don't end up penalizing
- * one child extensively based on its position in the children list.
+ * Visit the first child (need not be the first child as per the ordering
+ * of the cgroup list, since we track last_scanned_child) of @mem and use
+ * that to reclaim free pages from.
+ */
+static struct mem_cgroup *
+mem_cgroup_select_victim(struct mem_cgroup *root_mem)
+{
+ struct mem_cgroup *ret = NULL;
+ struct cgroup_subsys_state *css;
+ int nextid, found;
+
+ if (!root_mem->use_hierarchy) {
+ css_get(&root_mem->css);
+ ret = root_mem;
+ }
+
+ while (!ret) {
+ rcu_read_lock();
+ nextid = root_mem->last_scanned_child + 1;
+ css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
+ &found);
+ if (css && css_tryget(css))
+ ret = container_of(css, struct mem_cgroup, css);
+
+ rcu_read_unlock();
+ /* Updates scanning parameter */
+ spin_lock(&root_mem->reclaim_param_lock);
+ if (!css) {
+ /* this means start scan from ID:1 */
+ root_mem->last_scanned_child = 0;
+ } else
+ root_mem->last_scanned_child = found;
+ spin_unlock(&root_mem->reclaim_param_lock);
+ }
+
+ return ret;
+}
+
+/*
+ * Scan the hierarchy if needed to reclaim memory. We remember the last child
+ * we reclaimed from, so that we don't end up penalizing one child extensively
+ * based on its position in the children list.
*
* root_mem is the original ancestor that we've been reclaim from.
+ *
+ * We give up and return to the caller when we visit root_mem twice.
+ * (other groups can be removed while we're walking....)
*/
static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
gfp_t gfp_mask, bool noswap)
{
- struct mem_cgroup *next_mem;
- int ret = 0;
-
- /*
- * Reclaim unconditionally and don't check for return value.
- * We need to reclaim in the current group and down the tree.
- * One might think about checking for children before reclaiming,
- * but there might be left over accounting, even after children
- * have left.
- */
- ret += try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
- get_swappiness(root_mem));
- if (mem_cgroup_check_under_limit(root_mem))
- return 1; /* indicate reclaim has succeeded */
- if (!root_mem->use_hierarchy)
- return ret;
-
- next_mem = mem_cgroup_get_next_node(root_mem);
-
- while (next_mem != root_mem) {
- if (mem_cgroup_is_obsolete(next_mem)) {
- next_mem = mem_cgroup_get_next_node(root_mem);
+ struct mem_cgroup *victim;
+ int ret, total = 0;
+ int loop = 0;
+
+ while (loop < 2) {
+ victim = mem_cgroup_select_victim(root_mem);
+ if (victim == root_mem)
+ loop++;
+ if (!mem_cgroup_local_usage(&victim->stat)) {
+ /* this cgroup's local usage == 0 */
+ css_put(&victim->css);
continue;
}
- ret += try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
- get_swappiness(next_mem));
+ /* we use swappiness of local cgroup */
+ ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
+ get_swappiness(victim));
+ css_put(&victim->css);
+ total += ret;
if (mem_cgroup_check_under_limit(root_mem))
- return 1; /* indicate reclaim has succeeded */
- next_mem = mem_cgroup_get_next_node(root_mem);
+ return 1 + total;
}
- return ret;
+ return total;
}
bool mem_cgroup_oom_called(struct task_struct *task)
@@ -1324,8 +1269,8 @@ __mem_cgroup_uncharge_common(struct page
res_counter_uncharge(&mem->res, PAGE_SIZE);
if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-
mem_cgroup_charge_statistics(mem, pc, false);
+
ClearPageCgroupUsed(pc);
/*
* pc->mem_cgroup is not cleared here. It will be accessed when it's
@@ -2178,6 +2123,8 @@ static void __mem_cgroup_free(struct mem
{
int node;
+ free_css_id(&mem_cgroup_subsys, &mem->css);
+
for_each_node_state(node, N_POSSIBLE)
free_mem_cgroup_per_zone_info(mem, node);
@@ -2228,11 +2175,12 @@ static struct cgroup_subsys_state * __re
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
struct mem_cgroup *mem, *parent;
+ long error = -ENOMEM;
int node;
mem = mem_cgroup_alloc();
if (!mem)
- return ERR_PTR(-ENOMEM);
+ return ERR_PTR(error);
for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -2260,7 +2208,7 @@ mem_cgroup_create(struct cgroup_subsys *
res_counter_init(&mem->res, NULL);
res_counter_init(&mem->memsw, NULL);
}
- mem->last_scanned_child = NULL;
+ mem->last_scanned_child = 0;
spin_lock_init(&mem->reclaim_param_lock);
if (parent)
@@ -2269,7 +2217,7 @@ mem_cgroup_create(struct cgroup_subsys *
return &mem->css;
free_out:
__mem_cgroup_free(mem);
- return ERR_PTR(-ENOMEM);
+ return ERR_PTR(error);
}
static void mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
@@ -2283,12 +2231,7 @@ static void mem_cgroup_destroy(struct cg
struct cgroup *cont)
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
- struct mem_cgroup *last_scanned_child = mem->last_scanned_child;
- if (last_scanned_child) {
- VM_BUG_ON(!mem_cgroup_is_obsolete(last_scanned_child));
- mem_cgroup_put(last_scanned_child);
- }
mem_cgroup_put(mem);
}
@@ -2327,6 +2270,7 @@ struct cgroup_subsys mem_cgroup_subsys =
.populate = mem_cgroup_populate,
.attach = mem_cgroup_move_task,
.early_init = 0,
+ .use_id = 1,
};
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
From: KAMEZAWA Hiroyuki <[email protected]>
Clean up memory.stat file routine and show "total" hierarchical stat.
This patch does
- renamed get_all_zonestat to be get_local_zonestat.
- remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
- add mcs_stat to cover both of per-cpu/per-lru stat.
- add "total" stat of hierarchy (*)
- add a callback system to scan all memcg under a root.
== "total" is added.
[kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
cache 0
rss 0
pgpgin 0
pgpgout 0
inactive_anon 0
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 50331648
hierarchical_memsw_limit 9223372036854775807
total_cache 65536
total_rss 192512
total_pgpgin 218
total_pgpgout 155
total_inactive_anon 0
total_active_anon 135168
total_inactive_file 61440
total_active_file 4096
total_unevictable 0
==
(*) maybe the user can do calc hierarchical stat by his own program
in userland but if it can be written in clean way, it's worth to be
shown, I think.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Index: mmotm-2.6.29-Jan16/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Jan16.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Jan16/mm/memcontrol.c
@@ -256,7 +256,7 @@ page_cgroup_zoneinfo(struct page_cgroup
return mem_cgroup_zoneinfo(mem, nid, zid);
}
-static unsigned long mem_cgroup_get_all_zonestat(struct mem_cgroup *mem,
+static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
enum lru_list idx)
{
int nid, zid;
@@ -317,6 +317,42 @@ static bool mem_cgroup_is_obsolete(struc
return css_is_removed(&mem->css);
}
+
+/*
+ * Call callback function against all cgroup under hierarchy tree.
+ */
+static int mem_cgroup_walk_tree(struct mem_cgroup *root, void *data,
+ int (*func)(struct mem_cgroup *, void *))
+{
+ int found, ret, nextid;
+ struct cgroup_subsys_state *css;
+ struct mem_cgroup *mem;
+
+ if (!root->use_hierarchy)
+ return (*func)(root, data);
+
+ nextid = 1;
+ do {
+ ret = 0;
+ mem = NULL;
+
+ rcu_read_lock();
+ css = css_get_next(&mem_cgroup_subsys, nextid, &root->css,
+ &found);
+ if (css && css_tryget(css))
+ mem = container_of(css, struct mem_cgroup, css);
+ rcu_read_unlock();
+
+ if (mem) {
+ ret = (*func)(mem, data);
+ css_put(&mem->css);
+ }
+ nextid = found + 1;
+ } while (!ret && css);
+
+ return ret;
+}
+
/*
* Following LRU functions are allowed to be used without PCG_LOCK.
* Operations are called by routine of global LRU independently from memcg.
@@ -510,8 +546,8 @@ static int calc_inactive_ratio(struct me
unsigned long gb;
unsigned long inactive_ratio;
- inactive = mem_cgroup_get_all_zonestat(memcg, LRU_INACTIVE_ANON);
- active = mem_cgroup_get_all_zonestat(memcg, LRU_ACTIVE_ANON);
+ inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_ANON);
+ active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_ANON);
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
@@ -1838,54 +1874,90 @@ static int mem_cgroup_reset(struct cgrou
return 0;
}
-static const struct mem_cgroup_stat_desc {
- const char *msg;
- u64 unit;
-} mem_cgroup_stat_desc[] = {
- [MEM_CGROUP_STAT_CACHE] = { "cache", PAGE_SIZE, },
- [MEM_CGROUP_STAT_RSS] = { "rss", PAGE_SIZE, },
- [MEM_CGROUP_STAT_PGPGIN_COUNT] = {"pgpgin", 1, },
- [MEM_CGROUP_STAT_PGPGOUT_COUNT] = {"pgpgout", 1, },
+
+/* For read statistics */
+enum {
+ MCS_CACHE,
+ MCS_RSS,
+ MCS_PGPGIN,
+ MCS_PGPGOUT,
+ MCS_INACTIVE_ANON,
+ MCS_ACTIVE_ANON,
+ MCS_INACTIVE_FILE,
+ MCS_ACTIVE_FILE,
+ MCS_UNEVICTABLE,
+ NR_MCS_STAT,
+};
+
+struct mcs_total_stat {
+ s64 stat[NR_MCS_STAT];
+};
+
+struct {
+ char *local_name;
+ char *total_name;
+} memcg_stat_strings[NR_MCS_STAT] = {
+ {"cache", "total_cache"},
+ {"rss", "total_rss"},
+ {"pgpgin", "total_pgpgin"},
+ {"pgpgout", "total_pgpgout"},
+ {"inactive_anon", "total_inactive_anon"},
+ {"active_anon", "total_active_anon"},
+ {"inactive_file", "total_inactive_file"},
+ {"active_file", "total_active_file"},
+ {"unevictable", "total_unevictable"}
};
+
+static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
+{
+ struct mcs_total_stat *s = data;
+ s64 val;
+
+ /* per cpu stat */
+ val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_CACHE);
+ s->stat[MCS_CACHE] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
+ s->stat[MCS_RSS] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGIN_COUNT);
+ s->stat[MCS_PGPGIN] += val;
+ val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
+ s->stat[MCS_PGPGOUT] += val;
+
+ /* per zone stat */
+ val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
+ s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
+ val = mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_ANON);
+ s->stat[MCS_ACTIVE_ANON] += val * PAGE_SIZE;
+ val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_FILE);
+ s->stat[MCS_INACTIVE_FILE] += val * PAGE_SIZE;
+ val = mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_FILE);
+ s->stat[MCS_ACTIVE_FILE] += val * PAGE_SIZE;
+ val = mem_cgroup_get_local_zonestat(mem, LRU_UNEVICTABLE);
+ s->stat[MCS_UNEVICTABLE] += val * PAGE_SIZE;
+ return 0;
+}
+
+static void
+mem_cgroup_get_total_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
+{
+ mem_cgroup_walk_tree(mem, s, mem_cgroup_get_local_stat);
+}
+
static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
struct cgroup_map_cb *cb)
{
struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cont);
- struct mem_cgroup_stat *stat = &mem_cont->stat;
+ struct mcs_total_stat mystat;
int i;
- for (i = 0; i < ARRAY_SIZE(stat->cpustat[0].count); i++) {
- s64 val;
+ memset(&mystat, 0, sizeof(mystat));
+ mem_cgroup_get_local_stat(mem_cont, &mystat);
- val = mem_cgroup_read_stat(stat, i);
- val *= mem_cgroup_stat_desc[i].unit;
- cb->fill(cb, mem_cgroup_stat_desc[i].msg, val);
- }
- /* showing # of active pages */
- {
- unsigned long active_anon, inactive_anon;
- unsigned long active_file, inactive_file;
- unsigned long unevictable;
-
- inactive_anon = mem_cgroup_get_all_zonestat(mem_cont,
- LRU_INACTIVE_ANON);
- active_anon = mem_cgroup_get_all_zonestat(mem_cont,
- LRU_ACTIVE_ANON);
- inactive_file = mem_cgroup_get_all_zonestat(mem_cont,
- LRU_INACTIVE_FILE);
- active_file = mem_cgroup_get_all_zonestat(mem_cont,
- LRU_ACTIVE_FILE);
- unevictable = mem_cgroup_get_all_zonestat(mem_cont,
- LRU_UNEVICTABLE);
-
- cb->fill(cb, "active_anon", (active_anon) * PAGE_SIZE);
- cb->fill(cb, "inactive_anon", (inactive_anon) * PAGE_SIZE);
- cb->fill(cb, "active_file", (active_file) * PAGE_SIZE);
- cb->fill(cb, "inactive_file", (inactive_file) * PAGE_SIZE);
- cb->fill(cb, "unevictable", unevictable * PAGE_SIZE);
+ for (i = 0; i < NR_MCS_STAT; i++)
+ cb->fill(cb, memcg_stat_strings[i].local_name, mystat.stat[i]);
- }
+ /* Hierarchical information */
{
unsigned long long limit, memsw_limit;
memcg_get_hierarchical_limit(mem_cont, &limit, &memsw_limit);
@@ -1894,6 +1966,12 @@ static int mem_control_stat_show(struct
cb->fill(cb, "hierarchical_memsw_limit", memsw_limit);
}
+ memset(&mystat, 0, sizeof(mystat));
+ mem_cgroup_get_total_stat(mem_cont, &mystat);
+ for (i = 0; i < NR_MCS_STAT; i++)
+ cb->fill(cb, memcg_stat_strings[i].total_name, mystat.stat[i]);
+
+
#ifdef CONFIG_DEBUG_VM
cb->fill(cb, "inactive_ratio", calc_inactive_ratio(mem_cont, NULL));
From: KAMEZAWA Hiroyuki <[email protected]>
As pointed out, shrinking memcg's limit should return -EBUSY
after reasonable retries. This patch tries to fix the current behavior
of shrink_usage.
Before looking into "shrink should return -EBUSY" problem, we should fix
hierarchical reclaim code. It compares current usage and current limit,
but it only makes sense when the kernel reclaims memory because hit limits.
This is also a problem.
What this patch does are.
1. add new argument "shrink" to hierarchical reclaim. If "shrink==true",
hierarchical reclaim returns immediately and the caller checks the kernel
should shrink more or not.
(At shrinking memory, usage is always smaller than limit. So check for
usage < limit is useless.)
2. For adjusting to above change, 2 changes in "shrink"'s retry path.
2-a. retry_count depends on # of children because the kernel visits
the children under hierarchy one by one.
2-b. rather than checking return value of hierarchical_reclaim's progress,
compares usage-before-shrink and usage-after-shrink.
If usage-before-shrink <= usage-after-shrink, retry_count is
decremented.
Reported-by: Li Zefan <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 59 insertions(+), 12 deletions(-)
Index: mmotm-2.6.29-Jan16/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Jan16.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Jan16/mm/memcontrol.c
@@ -702,6 +702,23 @@ static unsigned int get_swappiness(struc
return swappiness;
}
+static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
+{
+ int *val = data;
+ (*val)++;
+ return 0;
+}
+/*
+ * This function returns the number of memcg under hierarchy tree. Returns
+ * 1(self count) if no children.
+ */
+static int mem_cgroup_count_children(struct mem_cgroup *mem)
+{
+ int num = 0;
+ mem_cgroup_walk_tree(mem, &num, mem_cgroup_count_children_cb);
+ return num;
+}
+
/*
* Visit the first child (need not be the first child as per the ordering
* of the cgroup list, since we track last_scanned_child) of @mem and use
@@ -750,9 +767,11 @@ mem_cgroup_select_victim(struct mem_cgro
*
* We give up and return to the caller when we visit root_mem twice.
* (other groups can be removed while we're walking....)
+ *
+ * If shrink==true, for avoiding to free too much, this returns immedieately.
*/
static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
- gfp_t gfp_mask, bool noswap)
+ gfp_t gfp_mask, bool noswap, bool shrink)
{
struct mem_cgroup *victim;
int ret, total = 0;
@@ -771,6 +790,13 @@ static int mem_cgroup_hierarchical_recla
ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
get_swappiness(victim));
css_put(&victim->css);
+ /*
+ * At shrinking usage, we can't check we should stop here or
+ * reclaim more. It's depends on callers. last_scanned_child
+ * will work enough for keeping fairness under tree.
+ */
+ if (shrink)
+ return ret;
total += ret;
if (mem_cgroup_check_under_limit(root_mem))
return 1 + total;
@@ -856,7 +882,7 @@ static int __mem_cgroup_try_charge(struc
goto nomem;
ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
- noswap);
+ noswap, false);
if (ret)
continue;
@@ -1489,7 +1515,8 @@ int mem_cgroup_shrink_usage(struct page
return 0;
do {
- progress = mem_cgroup_hierarchical_reclaim(mem, gfp_mask, true);
+ progress = mem_cgroup_hierarchical_reclaim(mem,
+ gfp_mask, true, false);
progress += mem_cgroup_check_under_limit(mem);
} while (!progress && --retry);
@@ -1504,11 +1531,21 @@ static DEFINE_MUTEX(set_limit_mutex);
static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
unsigned long long val)
{
-
- int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
+ int retry_count;
int progress;
u64 memswlimit;
int ret = 0;
+ int children = mem_cgroup_count_children(memcg);
+ u64 curusage, oldusage;
+
+ /*
+ * For keeping hierarchical_reclaim simple, how long we should retry
+ * is depends on callers. We set our retry-count to be function
+ * of # of children which we should visit in this loop.
+ */
+ retry_count = MEM_CGROUP_RECLAIM_RETRIES * children;
+
+ oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
while (retry_count) {
if (signal_pending(current)) {
@@ -1534,8 +1571,13 @@ static int mem_cgroup_resize_limit(struc
break;
progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
- false);
- if (!progress) retry_count--;
+ false, true);
+ curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
+ /* Usage is reduced ? */
+ if (curusage >= oldusage)
+ retry_count--;
+ else
+ oldusage = curusage;
}
return ret;
@@ -1544,13 +1586,16 @@ static int mem_cgroup_resize_limit(struc
int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
unsigned long long val)
{
- int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
+ int retry_count;
u64 memlimit, oldusage, curusage;
- int ret;
+ int children = mem_cgroup_count_children(memcg);
+ int ret = -EBUSY;
if (!do_swap_account)
return -EINVAL;
-
+ /* see mem_cgroup_resize_res_limit */
+ retry_count = children * MEM_CGROUP_RECLAIM_RETRIES;
+ oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
while (retry_count) {
if (signal_pending(current)) {
ret = -EINTR;
@@ -1574,11 +1619,13 @@ int mem_cgroup_resize_memsw_limit(struct
if (!ret)
break;
- oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
- mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true);
+ mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+ /* Usage is reduced ? */
if (curusage >= oldusage)
retry_count--;
+ else
+ oldusage = curusage;
}
return ret;
}
From: KAMEZAWA Hiroyuki <[email protected]>
This patch tries to fix OOM Killer problems caused by hierarchy.
Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
kill a task in memcg.
But, when hierarchy is used, it's broken and correct task cannot
be killed. For example, in following cgroup
/groupA/ hierarchy=1, limit=1G,
01 nolimit
02 nolimit
All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
groupA's 1Gbytes but OOM Killer just kills tasks in groupA.
This patch provides makes the bad process be selected from all tasks
under hierarchy. BTW, currently, oom_jiffies is updated against groupA
in above case. oom_jiffies of tree should be updated.
To see how oom_jiffies is used, please check mem_cgroup_oom_called()
callers.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/memcg_test.txt | 20 +++++++++++++++++++-
include/linux/memcontrol.h | 4 ++--
mm/memcontrol.c | 32 +++++++++++++++++++++++++++++---
3 files changed, 50 insertions(+), 6 deletions(-)
Index: mmotm-2.6.29-Jan16/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Jan16.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Jan16/mm/memcontrol.c
@@ -295,6 +295,9 @@ struct mem_cgroup *mem_cgroup_from_task(
static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
{
struct mem_cgroup *mem = NULL;
+
+ if (!mm)
+ return;
/*
* Because we have no locks, mm->owner's may be being moved to other
* cgroup. We use css_tryget() here even if this looks
@@ -483,13 +486,23 @@ void mem_cgroup_move_lists(struct page *
mem_cgroup_add_lru_list(page, to);
}
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
+int task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *mem)
{
int ret;
+ struct mem_cgroup *curr = NULL;
task_lock(task);
- ret = task->mm && mm_match_cgroup(task->mm, mem);
+ rcu_read_lock();
+ curr = try_get_mem_cgroup_from_mm(task->mm);
+ rcu_read_unlock();
task_unlock(task);
+ if (!curr)
+ return 0;
+ if (curr->use_hierarchy)
+ ret = css_is_ancestor(&curr->css, &mem->css);
+ else
+ ret = (curr == mem);
+ css_put(&curr->css);
return ret;
}
@@ -820,6 +833,19 @@ bool mem_cgroup_oom_called(struct task_s
rcu_read_unlock();
return ret;
}
+
+static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
+{
+ mem->last_oom_jiffies = jiffies;
+ return 0;
+}
+
+static void record_last_oom(struct mem_cgroup *mem)
+{
+ mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
+}
+
+
/*
* Unlike exported interface, "oom" parameter is added. if oom==true,
* oom-killer can be invoked.
@@ -902,7 +928,7 @@ static int __mem_cgroup_try_charge(struc
mutex_lock(&memcg_tasklist);
mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
mutex_unlock(&memcg_tasklist);
- mem_over_limit->last_oom_jiffies = jiffies;
+ record_last_oom(mem_over_limit);
}
goto nomem;
}
Index: mmotm-2.6.29-Jan16/Documentation/cgroups/memcg_test.txt
===================================================================
--- mmotm-2.6.29-Jan16.orig/Documentation/cgroups/memcg_test.txt
+++ mmotm-2.6.29-Jan16/Documentation/cgroups/memcg_test.txt
@@ -1,5 +1,5 @@
Memory Resource Controller(Memcg) Implementation Memo.
-Last Updated: 2009/1/19
+Last Updated: 2009/1/20
Base Kernel Version: based on 2.6.29-rc2.
Because VM is getting complex (one of reasons is memcg...), memcg's behavior
@@ -360,3 +360,21 @@ Under below explanation, we assume CONFI
# kill malloc task.
Of course, tmpfs v.s. swapoff test should be tested, too.
+
+ 9.8 OOM-Killer
+ Out-of-memory caused by memcg's limit will kill tasks under
+ the memcg. When hierarchy is used, a task under hierarchy
+ will be killed by the kernel.
+ In this case, panic_on_oom shouldn't be invoked and tasks
+ in other groups shouldn't be killed.
+
+ It's not difficult to cause OOM under memcg as following.
+ Case A) when you can swapoff
+ #swapoff -a
+ #echo 50M > /memory.limit_in_bytes
+ run 51M of malloc
+
+ Case B) when you use mem+swap limitation.
+ #echo 50M > memory.limit_in_bytes
+ #echo 50M > memory.memsw.limit_in_bytes
+ run 51M of malloc
Index: mmotm-2.6.29-Jan16/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.29-Jan16.orig/include/linux/memcontrol.h
+++ mmotm-2.6.29-Jan16/include/linux/memcontrol.h
@@ -66,7 +66,7 @@ extern unsigned long mem_cgroup_isolate_
struct mem_cgroup *mem_cont,
int active, int file);
extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
+int task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *mem);
extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -192,7 +192,7 @@ static inline int mm_match_cgroup(struct
}
static inline int task_in_mem_cgroup(struct task_struct *task,
- const struct mem_cgroup *mem)
+ struct mem_cgroup *mem)
{
return 1;
}
From: KAMEZAWA Hiroyuki <[email protected]>
In following situation, with memory subsystem,
/groupA use_hierarchy==1
/01 some tasks
/02 some tasks
/03 some tasks
/04 empty
When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
is triggered and the kernel walks tree under groupA. In this case,
rmdir /groupA/04 fails with -EBUSY frequently because of temporal
refcnt from the kernel.
In general. cgroup can be rmdir'd if there are no children groups and
no tasks. Frequent fails of rmdir() is not useful to users.
(And the reason for -EBUSY is unknown to users.....in most cases)
This patch tries to modify above behavior, by
- retries if css_refcnt is got by someone.
- add "return value" to pre_destroy() and allows subsystem to
say "we're really busy!"
Changelog: v2 -> v3.
- moved CGRP_ flags to cgroup.c
- unified test function and wake up function.
- check signal_pending() after wake up.
- Modified documentation about pre_destroy().
Changelog: v1 -> v2.
- added return value to pre_destroy().
- removed modification to cgroup_subsys.
- added signal_pending() check.
- added waitqueue and avoid busy spin loop.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/cgroups.txt | 6 +-
include/linux/cgroup.h | 16 +-----
kernel/cgroup.c | 97 ++++++++++++++++++++++++++++++++------
mm/memcontrol.c | 5 +
4 files changed, 93 insertions(+), 31 deletions(-)
Index: mmotm-2.6.29-Jan16/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.29-Jan16.orig/include/linux/cgroup.h
+++ mmotm-2.6.29-Jan16/include/linux/cgroup.h
@@ -119,19 +119,9 @@ static inline void css_put(struct cgroup
__css_put(css);
}
-/* bits in struct cgroup flags field */
-enum {
- /* Control Group is dead */
- CGRP_REMOVED,
- /* Control Group has previously had a child cgroup or a task,
- * but no longer (only if CGRP_NOTIFY_ON_RELEASE is set) */
- CGRP_RELEASABLE,
- /* Control Group requires release notifications to userspace */
- CGRP_NOTIFY_ON_RELEASE,
-};
-
struct cgroup {
- unsigned long flags; /* "unsigned long" so bitops work */
+ /* "unsigned long" so bitops work. See CGRP_ flags in cgroup.c */
+ unsigned long flags;
/* count users of this cgroup. >0 means busy, but doesn't
* necessarily indicate the number of tasks in the
@@ -350,7 +340,7 @@ int cgroup_is_descendant(const struct cg
struct cgroup_subsys {
struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
struct cgroup *cgrp);
- void (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
+ int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
int (*can_attach)(struct cgroup_subsys *ss,
struct cgroup *cgrp, struct task_struct *tsk);
Index: mmotm-2.6.29-Jan16/kernel/cgroup.c
===================================================================
--- mmotm-2.6.29-Jan16.orig/kernel/cgroup.c
+++ mmotm-2.6.29-Jan16/kernel/cgroup.c
@@ -94,6 +94,22 @@ struct cgroupfs_root {
char release_agent_path[PATH_MAX];
};
+/* bits in struct cgroup flags field */
+enum {
+ /* Control Group is dead */
+ CGRP_REMOVED,
+ /* Control Group has previously had a child cgroup or a task,
+ * but no longer (only if CGRP_NOTIFY_ON_RELEASE is set) */
+ CGRP_RELEASABLE,
+ /* Control Group requires release notifications to userspace */
+ CGRP_NOTIFY_ON_RELEASE,
+ /*
+ * A thread in rmdir() is waiting to destroy this cgroup
+ * See wake_up_rmdir_waiters().
+ */
+ CGRP_WAIT_ON_RMDIR,
+};
+
/*
* The "rootnode" hierarchy is the "dummy hierarchy", reserved for the
* subsystems that are otherwise unattached - it never has more than a
@@ -622,13 +638,18 @@ static struct inode *cgroup_new_inode(mo
* Call subsys's pre_destroy handler.
* This is called before css refcnt check.
*/
-static void cgroup_call_pre_destroy(struct cgroup *cgrp)
+static int cgroup_call_pre_destroy(struct cgroup *cgrp)
{
struct cgroup_subsys *ss;
+ int ret = 0;
+
for_each_subsys(cgrp->root, ss)
- if (ss->pre_destroy)
- ss->pre_destroy(ss, cgrp);
- return;
+ if (ss->pre_destroy) {
+ ret = ss->pre_destroy(ss, cgrp);
+ if (ret)
+ break;
+ }
+ return ret;
}
static void free_cgroup_rcu(struct rcu_head *obj)
@@ -722,6 +743,22 @@ static void cgroup_d_remove_dir(struct d
remove_dir(dentry);
}
+/*
+ * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
+ * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
+ * reference to css->refcnt. In general, this refcnt is expected to goes down
+ * to zero, soon.
+ *
+ * CGRP_WAIT_ON_RMDIR flag is modified under cgroup's inode->i_mutex;
+ */
+DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
+
+static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
+{
+ if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
+ wake_up_all(&cgroup_rmdir_waitq);
+}
+
static int rebind_subsystems(struct cgroupfs_root *root,
unsigned long final_bits)
{
@@ -1314,6 +1351,12 @@ int cgroup_attach_task(struct cgroup *cg
set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
synchronize_rcu();
put_css_set(cg);
+
+ /*
+ * wake up rmdir() waiter. the rmdir should fail since the cgroup
+ * is no longer empty.
+ */
+ cgroup_wakeup_rmdir_waiters(cgrp);
return 0;
}
@@ -2602,9 +2645,11 @@ static int cgroup_rmdir(struct inode *un
struct cgroup *cgrp = dentry->d_fsdata;
struct dentry *d;
struct cgroup *parent;
+ DEFINE_WAIT(wait);
+ int ret;
/* the vfs holds both inode->i_mutex already */
-
+again:
mutex_lock(&cgroup_mutex);
if (atomic_read(&cgrp->count) != 0) {
mutex_unlock(&cgroup_mutex);
@@ -2620,17 +2665,39 @@ static int cgroup_rmdir(struct inode *un
* Call pre_destroy handlers of subsys. Notify subsystems
* that rmdir() request comes.
*/
- cgroup_call_pre_destroy(cgrp);
+ ret = cgroup_call_pre_destroy(cgrp);
+ if (ret)
+ return ret;
mutex_lock(&cgroup_mutex);
parent = cgrp->parent;
-
- if (atomic_read(&cgrp->count)
- || !list_empty(&cgrp->children)
- || !cgroup_clear_css_refs(cgrp)) {
+ if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
mutex_unlock(&cgroup_mutex);
return -EBUSY;
}
+ /*
+ * css_put/get is provided for subsys to grab refcnt to css. In typical
+ * case, subsystem has no reference after pre_destroy(). But, under
+ * hierarchy management, some *temporal* refcnt can be hold.
+ * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
+ * is really busy, it should return -EBUSY at pre_destroy(). wake_up
+ * is called when css_put() is called and refcnt goes down to 0.
+ */
+ set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
+ prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
+
+ if (!cgroup_clear_css_refs(cgrp)) {
+ mutex_unlock(&cgroup_mutex);
+ schedule();
+ finish_wait(&cgroup_rmdir_waitq, &wait);
+ clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
+ if (signal_pending(current))
+ return -EINTR;
+ goto again;
+ }
+ /* NO css_tryget() can success after here. */
+ finish_wait(&cgroup_rmdir_waitq, &wait);
+ clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
spin_lock(&release_list_lock);
set_bit(CGRP_REMOVED, &cgrp->flags);
@@ -3186,10 +3253,12 @@ void __css_put(struct cgroup_subsys_stat
{
struct cgroup *cgrp = css->cgroup;
rcu_read_lock();
- if ((atomic_dec_return(&css->refcnt) == 1) &&
- notify_on_release(cgrp)) {
- set_bit(CGRP_RELEASABLE, &cgrp->flags);
- check_for_release(cgrp);
+ if (atomic_dec_return(&css->refcnt) == 1) {
+ if (notify_on_release(cgrp)) {
+ set_bit(CGRP_RELEASABLE, &cgrp->flags);
+ check_for_release(cgrp);
+ }
+ cgroup_wakeup_rmdir_waiters(cgrp);
}
rcu_read_unlock();
}
Index: mmotm-2.6.29-Jan16/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Jan16.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Jan16/mm/memcontrol.c
@@ -2371,11 +2371,12 @@ free_out:
return ERR_PTR(error);
}
-static void mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
+static int mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
struct cgroup *cont)
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
- mem_cgroup_force_empty(mem, false);
+
+ return mem_cgroup_force_empty(mem, false);
}
static void mem_cgroup_destroy(struct cgroup_subsys *ss,
Index: mmotm-2.6.29-Jan16/Documentation/cgroups/cgroups.txt
===================================================================
--- mmotm-2.6.29-Jan16.orig/Documentation/cgroups/cgroups.txt
+++ mmotm-2.6.29-Jan16/Documentation/cgroups/cgroups.txt
@@ -478,11 +478,13 @@ cgroup->parent is still valid. (Note - c
newly-created cgroup if an error occurs after this subsystem's
create() method has been called for the new cgroup).
-void pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
+int pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
Called before checking the reference count on each subsystem. This may
be useful for subsystems which have some extra references even if
-there are not tasks in the cgroup.
+there are not tasks in the cgroup. If pre_destroy() returns error code,
+rmdir() will fail with it. From this behavior, pre_destroy() can be
+called plural times against a cgroup.
int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct task_struct *task)
A sample for background-reclaim for memcg using pdflush().
Just an example, any comments are welcome.
Maybe it needs some amount of more work/time to fix this patch.
This is a patch for background memory reclaim for memcg, like kswapd().
In this, pdflush() is used for reclaim some more memory when tasks
under memcg hits limit.
Note:
- considering hierarchy, high-low watermark in the kernel seems to be
very complex. My purpose is adding an funcitonality like kswapd().
No performance test yet.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 85 insertions(+), 2 deletions(-)
Index: mmotm-2.6.29-Jan16/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Jan16.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Jan16/mm/memcontrol.c
@@ -37,7 +37,7 @@
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
#include "internal.h"
-
+#include <linux/writeback.h>
#include <asm/uaccess.h>
struct cgroup_subsys mem_cgroup_subsys __read_mostly;
@@ -166,6 +166,7 @@ struct mem_cgroup {
* reclaimed from.
*/
int last_scanned_child;
+ int pdflush_called;
/*
* Should the accounting and control be hierarchical, per subtree?
*/
@@ -297,7 +298,7 @@ static struct mem_cgroup *try_get_mem_cg
struct mem_cgroup *mem = NULL;
if (!mm)
- return;
+ return NULL;
/*
* Because we have no locks, mm->owner's may be being moved to other
* cgroup. We use css_tryget() here even if this looks
@@ -771,6 +772,7 @@ mem_cgroup_select_victim(struct mem_cgro
return ret;
}
+static void mem_cgroup_bg_reclaim(unsigned long arg0);
/*
* Scan the hierarchy if needed to reclaim memory. We remember the last child
* we reclaimed from, so that we don't end up penalizing one child extensively
@@ -790,6 +792,17 @@ static int mem_cgroup_hierarchical_recla
int ret, total = 0;
int loop = 0;
+ if (!shrink) { /* memory usage hit limit */
+ if (!root_mem->pdflush_called) {
+ if (!pdflush_operation(mem_cgroup_bg_reclaim,
+ css_id(&root_mem->css))) {
+ spin_lock(&root_mem->reclaim_param_lock);
+ root_mem->pdflush_called = 1;
+ spin_unlock(&root_mem->reclaim_param_lock);
+ }
+ }
+ }
+
while (loop < 2) {
victim = mem_cgroup_select_victim(root_mem);
if (victim == root_mem)
@@ -817,6 +830,76 @@ static int mem_cgroup_hierarchical_recla
return total;
}
+/*
+ * Called when hierarchy reclaim triggered by memory limitation check.
+ * ID of hierarchy root is the argument.
+ */
+#define FREE_THRESH_RATIO (95) /* 95% */
+#define FREE_THRESH_MAX (1024 * 1024) /* 1M bytes*/
+#define FREE_THRESH_MIN (128 * 1024) /* 128kbytes */
+static u64 memcg_stable_free_thresh(u64 limit)
+{
+ u64 ret;
+
+ ret = limit * FREE_THRESH_RATIO/100;
+ /* backgroubd writeout is overkill to this cgroup ? */
+ if (ret < FREE_THRESH_MIN)
+ ret = 0;
+ if (ret > FREE_THRESH_MAX)
+ ret = FREE_THRESH_MAX;
+ return ret;
+}
+
+static void mem_cgroup_bg_reclaim(unsigned long arg0)
+{
+ struct cgroup_subsys_state *css;
+ struct mem_cgroup *mem = NULL;
+ u64 usage, limit;
+ bool memshortage, noswap;
+ int retry;
+
+ rcu_read_lock();
+ css = css_lookup(&mem_cgroup_subsys, arg0);
+ if (css && css_tryget(css))
+ mem = container_of(css, struct mem_cgroup, css);
+ rcu_read_unlock();
+ if (!mem)
+ return;
+ retry = mem_cgroup_count_children(mem);
+ while (retry--) {
+ /* check situation */
+ memshortage = false;
+ noswap = false;
+ usage = res_counter_read_u64(&mem->res, RES_USAGE);
+ limit = res_counter_read_u64(&mem->res, RES_LIMIT);
+
+ if (usage > limit - memcg_stable_free_thresh(limit))
+ memshortage = true;
+
+ if (do_swap_account) {
+ usage = res_counter_read_u64(&mem->res, RES_USAGE);
+ limit = res_counter_read_u64(&mem->res, RES_LIMIT);
+ if (usage > limit - memcg_stable_free_thresh(limit))
+ noswap = true;
+ }
+ if (memshortage || noswap)
+ mem_cgroup_hierarchical_reclaim(mem, GFP_KERNEL,
+ noswap, true);
+ else
+ break;
+ cond_resched();
+ }
+
+ spin_lock(&mem->reclaim_param_lock);
+ mem->pdflush_called = 0;
+ spin_unlock(&mem->reclaim_param_lock);
+ css_put(&mem->css);
+
+ return;
+}
+
+
+
bool mem_cgroup_oom_called(struct task_struct *task)
{
bool ret = false;
> +/**
> + * css_get_next - lookup next cgroup under specified hierarchy.
> + * @ss: pointer to subsystem
> + * @id: current position of iteration.
> + * @root: pointer to css. search tree under this.
> + * @foundid: position of found object.
> + *
> + * Search next css under the specified hierarchy of rootid. Calling under
> + * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
> + */
> +struct cgroup_subsys_state *
> +css_get_next(struct cgroup_subsys *ss, int id,
> + struct cgroup_subsys_state *root, int *foundid)
> +{
> + struct cgroup_subsys_state *ret = NULL;
> + struct css_id *tmp;
> + int tmpid;
> + int rootid = css_id(root);
> + int depth = css_depth(root);
> +
> + if (!rootid)
> + return NULL;
> +
> + BUG_ON(!ss->use_id);
> + rcu_read_lock();
> + /* fill start point for scan */
> + tmpid = id;
> + while (1) {
> + /*
> + * scan next entry from bitmap(tree), tmpid is updated after
> + * idr_get_next().
> + */
> + spin_lock(&ss->id_lock);
> + tmp = idr_get_next(&ss->idr, &tmpid);
> + spin_unlock(&ss->id_lock);
> +
> + if (!tmp)
> + break;
> + if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
> + ret = rcu_dereference(tmp->css);
> + if (ret) {
> + *foundid = tmpid;
> + break;
> + }
> + }
> + /* continue to scan from next id */
> + tmpid = tmpid + 1;
> + }
> +
> + rcu_read_unlock();
> + return ret;
> +}
> +
css_get_next is called under rcu_read_lock, so it doesn't need to call
rcu_read_lock/unlock by itself.
Otherwise, looks good to me.
Reviewed-by: Daisuke Nishimura <[email protected]>
Thanks,
Daisuke Nishimura.
On Thu, 22 Jan 2009 18:35:09 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
>
> This patch attaches unique ID to each css and provides following.
>
> - css_lookup(subsys, id)
> returns pointer to struct cgroup_subysys_state of id.
> - css_get_next(subsys, id, rootid, depth, foundid)
> returns the next css under "root" by scanning
>
> When cgrou_subsys->use_id is set, an id for css is maintained.
> The cgroup framework only parepares
> - css_id of root css for subsys
> - id is automatically attached at creation of css.
> - id is *not* freed automatically. Because the cgroup framework
> don't know lifetime of cgroup_subsys_state.
> free_css_id() function is provided. This must be called by subsys.
>
> There are several reasons to develop this.
> - Saving space .... For example, memcg's swap_cgroup is array of
> pointers to cgroup. But it is not necessary to be very fast.
> By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
> reduce much amount of memory usage.
>
> - Scanning without lock.
> CSS_ID provides "scan id under this ROOT" function. By this, scanning
> css under root can be written without locks.
> ex)
> do {
> rcu_read_lock();
> next = cgroup_get_next(subsys, id, root, &found);
> /* check sanity of next here */
> css_tryget();
> rcu_read_unlock();
> id = found + 1
> } while(...)
>
> Characteristics:
> - Each css has unique ID under subsys.
> - Lifetime of ID is controlled by subsys.
> - css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
> - Allowed ID is 1-65535, ID 0 is UNUSED ID.
>
> Design Choices:
> - scan-by-ID v.s. scan-by-tree-walk.
> As /proc's pid scan does, scan-by-ID is robust when scanning is done
> by following kind of routine.
> scan -> rest a while(release a lock) -> conitunue from interrupted
> memcg's hierarchical reclaim does this.
>
> - When subsys->use_id is set, # of css in the system is limited to
> 65535.
>
> Changelog: (v7) -> (v8)
> - Update id->css pointer after cgroup is populated.
>
> Changelog: (v6) -> (v7)
> - refcnt for CSS ID is removed. Subsys can do it by own logic.
> - New id allocation is done automatically.
> - fixed typos.
> - fixed limit check of ID.
>
> Changelog: (v5) -> (v6)
> - max depth is removed.
> - changed arguments to "scan"
> Changelog: (v4) -> (v5)
> - Totally re-designed as per-css ID.
> Changelog:(v3) -> (v4)
> - updated comments.
> - renamed hierarchy_code[] to stack[]
> - merged prepare_id routines.
>
> Changelog (v2) -> (v3)
> - removed cgroup_id_getref().
> - added cgroup_id_tryget().
>
> Changelog (v1) -> (v2):
> - Design change: show only ID(integer) to outside of cgroup.c
> - moved cgroup ID definition from include/ to kernel/cgroup.c
> - struct cgroup_id is freed by RCU.
> - changed interface from pointer to "int"
> - kill_sb() is handled.
> - ID 0 as unused ID.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/cgroup.h | 50 ++++++++
> include/linux/idr.h | 1
> kernel/cgroup.c | 289 ++++++++++++++++++++++++++++++++++++++++++++++++-
> lib/idr.c | 46 +++++++
> 4 files changed, 385 insertions(+), 1 deletion(-)
>
> Index: mmotm-2.6.29-Jan16/include/linux/cgroup.h
> ===================================================================
> --- mmotm-2.6.29-Jan16.orig/include/linux/cgroup.h
> +++ mmotm-2.6.29-Jan16/include/linux/cgroup.h
> @@ -15,6 +15,7 @@
> #include <linux/cgroupstats.h>
> #include <linux/prio_heap.h>
> #include <linux/rwsem.h>
> +#include <linux/idr.h>
>
> #ifdef CONFIG_CGROUPS
>
> @@ -22,6 +23,7 @@ struct cgroupfs_root;
> struct cgroup_subsys;
> struct inode;
> struct cgroup;
> +struct css_id;
>
> extern int cgroup_init_early(void);
> extern int cgroup_init(void);
> @@ -59,6 +61,8 @@ struct cgroup_subsys_state {
> atomic_t refcnt;
>
> unsigned long flags;
> + /* ID for this css, if possible */
> + struct css_id *id;
> };
>
> /* bits in struct cgroup_subsys_state flags field */
> @@ -363,6 +367,11 @@ struct cgroup_subsys {
> int active;
> int disabled;
> int early_init;
> + /*
> + * True if this subsys uses ID. ID is not available before cgroup_init()
> + * (not available in early_init time.)
> + */
> + bool use_id;
> #define MAX_CGROUP_TYPE_NAMELEN 32
> const char *name;
>
> @@ -384,6 +393,9 @@ struct cgroup_subsys {
> */
> struct cgroupfs_root *root;
> struct list_head sibling;
> + /* used when use_id == true */
> + struct idr idr;
> + spinlock_t id_lock;
> };
>
> #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
> @@ -437,6 +449,44 @@ void cgroup_iter_end(struct cgroup *cgrp
> int cgroup_scan_tasks(struct cgroup_scanner *scan);
> int cgroup_attach_task(struct cgroup *, struct task_struct *);
>
> +/*
> + * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
> + * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
> + * CSS ID is assigned at cgroup allocation (create) automatically
> + * and removed when subsys calls free_css_id() function. This is because
> + * the lifetime of cgroup_subsys_state is subsys's matter.
> + *
> + * Looking up and scanning function should be called under rcu_read_lock().
> + * Taking cgroup_mutex()/hierarchy_mutex() is not necessary for following calls.
> + * But the css returned by this routine can be "not populated yet" or "being
> + * destroyed". The caller should check css and cgroup's status.
> + */
> +
> +/*
> + * Typically Called at ->destroy(), or somewhere the subsys frees
> + * cgroup_subsys_state.
> + */
> +void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
> +
> +/* Find a cgroup_subsys_state which has given ID */
> +
> +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
> +
> +/*
> + * Get a cgroup whose id is greater than or equal to id under tree of root.
> + * Returning a cgroup_subsys_state or NULL.
> + */
> +struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
> + struct cgroup_subsys_state *root, int *foundid);
> +
> +/* Returns true if root is ancestor of cg */
> +bool css_is_ancestor(struct cgroup_subsys_state *cg,
> + struct cgroup_subsys_state *root);
> +
> +/* Get id and depth of css */
> +unsigned short css_id(struct cgroup_subsys_state *css);
> +unsigned short css_depth(struct cgroup_subsys_state *css);
> +
> #else /* !CONFIG_CGROUPS */
>
> static inline int cgroup_init_early(void) { return 0; }
> Index: mmotm-2.6.29-Jan16/kernel/cgroup.c
> ===================================================================
> --- mmotm-2.6.29-Jan16.orig/kernel/cgroup.c
> +++ mmotm-2.6.29-Jan16/kernel/cgroup.c
> @@ -94,7 +94,6 @@ struct cgroupfs_root {
> char release_agent_path[PATH_MAX];
> };
>
> -
> /*
> * The "rootnode" hierarchy is the "dummy hierarchy", reserved for the
> * subsystems that are otherwise unattached - it never has more than a
> @@ -102,6 +101,39 @@ struct cgroupfs_root {
> */
> static struct cgroupfs_root rootnode;
>
> +/*
> + * CSS ID -- ID per subsys's Cgroup Subsys State(CSS). used only when
> + * cgroup_subsys->use_id != 0.
> + */
> +#define CSS_ID_MAX (65535)
> +struct css_id {
> + /*
> + * The css to which this ID points. This pointer is set to valid value
> + * after cgroup is populated. If cgroup is removed, this will be NULL.
> + * This pointer is expected to be RCU-safe because destroy()
> + * is called after synchronize_rcu(). But for safe use, css_is_removed()
> + * css_tryget() should be used for avoiding race.
> + */
> + struct cgroup_subsys_state *css;
> + /*
> + * ID of this css.
> + */
> + unsigned short id;
> + /*
> + * Depth in hierarchy which this ID belongs to.
> + */
> + unsigned short depth;
> + /*
> + * ID is freed by RCU. (and lookup routine is RCU safe.)
> + */
> + struct rcu_head rcu_head;
> + /*
> + * Hierarchy of CSS ID belongs to.
> + */
> + unsigned short stack[0]; /* Array of Length (depth+1) */
> +};
> +
> +
> /* The list of hierarchy roots */
>
> static LIST_HEAD(roots);
> @@ -185,6 +217,8 @@ struct cg_cgroup_link {
> static struct css_set init_css_set;
> static struct cg_cgroup_link init_css_set_link;
>
> +static int cgroup_subsys_init_idr(struct cgroup_subsys *ss);
> +
> /* css_set_lock protects the list of css_set objects, and the
> * chain of tasks off each css_set. Nests outside task->alloc_lock
> * due to cgroup_iter_start() */
> @@ -567,6 +601,9 @@ static struct backing_dev_info cgroup_ba
> .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK,
> };
>
> +static int alloc_css_id(struct cgroup_subsys *ss,
> + struct cgroup *parent, struct cgroup *child);
> +
> static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
> {
> struct inode *inode = new_inode(sb);
> @@ -2324,6 +2361,17 @@ static int cgroup_populate_dir(struct cg
> if (ss->populate && (err = ss->populate(ss, cgrp)) < 0)
> return err;
> }
> + /* This cgroup is ready now */
> + for_each_subsys(cgrp->root, ss) {
> + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> + /*
> + * Update id->css pointer and make this css visible from
> + * CSS ID functions. This pointer will be dereferened
> + * from RCU-read-side without locks.
> + */
> + if (css->id)
> + rcu_assign_pointer(css->id->css, css);
> + }
>
> return 0;
> }
> @@ -2335,6 +2383,7 @@ static void init_cgroup_css(struct cgrou
> css->cgroup = cgrp;
> atomic_set(&css->refcnt, 1);
> css->flags = 0;
> + css->id = NULL;
> if (cgrp == dummytop)
> set_bit(CSS_ROOT, &css->flags);
> BUG_ON(cgrp->subsys[ss->subsys_id]);
> @@ -2410,6 +2459,10 @@ static long cgroup_create(struct cgroup
> goto err_destroy;
> }
> init_cgroup_css(css, ss, cgrp);
> + if (ss->use_id)
> + if (alloc_css_id(ss, parent, cgrp))
> + goto err_destroy;
> + /* At error, ->destroy() callback has to free assigned ID. */
> }
>
> cgroup_lock_hierarchy(root);
> @@ -2701,6 +2754,8 @@ int __init cgroup_init(void)
> struct cgroup_subsys *ss = subsys[i];
> if (!ss->early_init)
> cgroup_init_subsys(ss);
> + if (ss->use_id)
> + cgroup_subsys_init_idr(ss);
> }
>
> /* Add init_css_set to the hash table */
> @@ -3234,3 +3289,235 @@ static int __init cgroup_disable(char *s
> return 1;
> }
> __setup("cgroup_disable=", cgroup_disable);
> +
> +/*
> + * Functons for CSS ID.
> + */
> +
> +/*
> + *To get ID other than 0, this should be called when !cgroup_is_removed().
> + */
> +unsigned short css_id(struct cgroup_subsys_state *css)
> +{
> + struct css_id *cssid = rcu_dereference(css->id);
> +
> + if (cssid)
> + return cssid->id;
> + return 0;
> +}
> +
> +unsigned short css_depth(struct cgroup_subsys_state *css)
> +{
> + struct css_id *cssid = rcu_dereference(css->id);
> +
> + if (cssid)
> + return cssid->depth;
> + return 0;
> +}
> +
> +bool css_is_ancestor(struct cgroup_subsys_state *child,
> + struct cgroup_subsys_state *root)
> +{
> + struct css_id *child_id = rcu_dereference(child->id);
> + struct css_id *root_id = rcu_dereference(root->id);
> +
> + if (!child_id || !root_id || (child_id->depth < root_id->depth))
> + return false;
> + return child_id->stack[root_id->depth] == root_id->id;
> +}
> +
> +static void __free_css_id_cb(struct rcu_head *head)
> +{
> + struct css_id *id;
> +
> + id = container_of(head, struct css_id, rcu_head);
> + kfree(id);
> +}
> +
> +void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css)
> +{
> + struct css_id *id = css->id;
> + /* When this is called before css_id initialization, id can be NULL */
> + if (!id)
> + return;
> +
> + BUG_ON(!ss->use_id);
> +
> + rcu_assign_pointer(id->css, NULL);
> + rcu_assign_pointer(css->id, NULL);
> + spin_lock(&ss->id_lock);
> + idr_remove(&ss->idr, id->id);
> + spin_unlock(&ss->id_lock);
> + call_rcu(&id->rcu_head, __free_css_id_cb);
> +}
> +
> +/*
> + * This is called by init or create(). Then, calls to this function are
> + * always serialized (By cgroup_mutex() at create()).
> + */
> +
> +static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
> +{
> + struct css_id *newid;
> + int myid, error, size;
> +
> + BUG_ON(!ss->use_id);
> +
> + size = sizeof(*newid) + sizeof(unsigned short) * (depth + 1);
> + newid = kzalloc(size, GFP_KERNEL);
> + if (!newid)
> + return ERR_PTR(-ENOMEM);
> + /* get id */
> + if (unlikely(!idr_pre_get(&ss->idr, GFP_KERNEL))) {
> + error = -ENOMEM;
> + goto err_out;
> + }
> + spin_lock(&ss->id_lock);
> + /* Don't use 0. allocates an ID of 1-65535 */
> + error = idr_get_new_above(&ss->idr, newid, 1, &myid);
> + spin_unlock(&ss->id_lock);
> +
> + /* Returns error when there are no free spaces for new ID.*/
> + if (error) {
> + error = -ENOSPC;
> + goto err_out;
> + }
> + if (myid > CSS_ID_MAX)
> + goto remove_idr;
> +
> + newid->id = myid;
> + newid->depth = depth;
> + return newid;
> +remove_idr:
> + error = -ENOSPC;
> + spin_lock(&ss->id_lock);
> + idr_remove(&ss->idr, myid);
> + spin_unlock(&ss->id_lock);
> +err_out:
> + kfree(newid);
> + return ERR_PTR(error);
> +
> +}
> +
> +static int __init cgroup_subsys_init_idr(struct cgroup_subsys *ss)
> +{
> + struct css_id *newid;
> + struct cgroup_subsys_state *rootcss;
> +
> + spin_lock_init(&ss->id_lock);
> + idr_init(&ss->idr);
> +
> + rootcss = init_css_set.subsys[ss->subsys_id];
> + newid = get_new_cssid(ss, 0);
> + if (IS_ERR(newid))
> + return PTR_ERR(newid);
> +
> + newid->stack[0] = newid->id;
> + newid->css = rootcss;
> + rootcss->id = newid;
> + return 0;
> +}
> +
> +static int alloc_css_id(struct cgroup_subsys *ss, struct cgroup *parent,
> + struct cgroup *child)
> +{
> + int subsys_id, i, depth = 0;
> + struct cgroup_subsys_state *parent_css, *child_css;
> + struct css_id *child_id, *parent_id = NULL;
> +
> + subsys_id = ss->subsys_id;
> + parent_css = parent->subsys[subsys_id];
> + child_css = child->subsys[subsys_id];
> + depth = css_depth(parent_css) + 1;
> + parent_id = parent_css->id;
> +
> + child_id = get_new_cssid(ss, depth);
> + if (IS_ERR(child_id))
> + return PTR_ERR(child_id);
> +
> + for (i = 0; i < depth; i++)
> + child_id->stack[i] = parent_id->stack[i];
> + child_id->stack[depth] = child_id->id;
> + /*
> + * child_id->css pointer will be set after this cgroup is available
> + * see cgroup_populate_dir()
> + */
> + rcu_assign_pointer(child_css->id, child_id);
> +
> + return 0;
> +}
> +
> +/**
> + * css_lookup - lookup css by id
> + * @ss: cgroup subsys to be looked into.
> + * @id: the id
> + *
> + * Returns pointer to cgroup_subsys_state if there is valid one with id.
> + * NULL if not. Should be called under rcu_read_lock()
> + */
> +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
> +{
> + struct css_id *cssid = NULL;
> +
> + BUG_ON(!ss->use_id);
> + cssid = idr_find(&ss->idr, id);
> +
> + if (unlikely(!cssid))
> + return NULL;
> +
> + return rcu_dereference(cssid->css);
> +}
> +
> +/**
> + * css_get_next - lookup next cgroup under specified hierarchy.
> + * @ss: pointer to subsystem
> + * @id: current position of iteration.
> + * @root: pointer to css. search tree under this.
> + * @foundid: position of found object.
> + *
> + * Search next css under the specified hierarchy of rootid. Calling under
> + * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
> + */
> +struct cgroup_subsys_state *
> +css_get_next(struct cgroup_subsys *ss, int id,
> + struct cgroup_subsys_state *root, int *foundid)
> +{
> + struct cgroup_subsys_state *ret = NULL;
> + struct css_id *tmp;
> + int tmpid;
> + int rootid = css_id(root);
> + int depth = css_depth(root);
> +
> + if (!rootid)
> + return NULL;
> +
> + BUG_ON(!ss->use_id);
> + rcu_read_lock();
> + /* fill start point for scan */
> + tmpid = id;
> + while (1) {
> + /*
> + * scan next entry from bitmap(tree), tmpid is updated after
> + * idr_get_next().
> + */
> + spin_lock(&ss->id_lock);
> + tmp = idr_get_next(&ss->idr, &tmpid);
> + spin_unlock(&ss->id_lock);
> +
> + if (!tmp)
> + break;
> + if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
> + ret = rcu_dereference(tmp->css);
> + if (ret) {
> + *foundid = tmpid;
> + break;
> + }
> + }
> + /* continue to scan from next id */
> + tmpid = tmpid + 1;
> + }
> +
> + rcu_read_unlock();
> + return ret;
> +}
> +
> Index: mmotm-2.6.29-Jan16/include/linux/idr.h
> ===================================================================
> --- mmotm-2.6.29-Jan16.orig/include/linux/idr.h
> +++ mmotm-2.6.29-Jan16/include/linux/idr.h
> @@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
> int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
> int idr_for_each(struct idr *idp,
> int (*fn)(int id, void *p, void *data), void *data);
> +void *idr_get_next(struct idr *idp, int *nextid);
> void *idr_replace(struct idr *idp, void *ptr, int id);
> void idr_remove(struct idr *idp, int id);
> void idr_remove_all(struct idr *idp);
> Index: mmotm-2.6.29-Jan16/lib/idr.c
> ===================================================================
> --- mmotm-2.6.29-Jan16.orig/lib/idr.c
> +++ mmotm-2.6.29-Jan16/lib/idr.c
> @@ -579,6 +579,52 @@ int idr_for_each(struct idr *idp,
> EXPORT_SYMBOL(idr_for_each);
>
> /**
> + * idr_get_next - lookup next object of id to given id.
> + * @idp: idr handle
> + * @id: pointer to lookup key
> + *
> + * Returns pointer to registered object with id, which is next number to
> + * given id.
> + */
> +
> +void *idr_get_next(struct idr *idp, int *nextidp)
> +{
> + struct idr_layer *p, *pa[MAX_LEVEL];
> + struct idr_layer **paa = &pa[0];
> + int id = *nextidp;
> + int n, max;
> +
> + /* find first ent */
> + n = idp->layers * IDR_BITS;
> + max = 1 << n;
> + p = rcu_dereference(idp->top);
> + if (!p)
> + return NULL;
> +
> + while (id < max) {
> + while (n > 0 && p) {
> + n -= IDR_BITS;
> + *paa++ = p;
> + p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
> + }
> +
> + if (p) {
> + *nextidp = id;
> + return p;
> + }
> +
> + id += 1 << n;
> + while (n < fls(id)) {
> + n += IDR_BITS;
> + p = *--paa;
> + }
> + }
> + return NULL;
> +}
> +
> +
> +
> +/**
> * idr_replace - replace pointer for given id
> * @idp: idr handle
> * @ptr: pointer you want associated with the id
>
On Thu, 22 Jan 2009 18:35:57 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> From: KAMEZAWA Hiroyuki <[email protected]>
> Use css ID in memcg.
>
> Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.
>
> Assume folloing tree.
>
> group_A (ID=3)
> /01 (ID=4)
> /0A (ID=7)
> /02 (ID=10)
> group_B (ID=5)
> and task in group_A/01/0A hits limit at group_A.
>
> reclaim will be done in following order (round-robin).
> group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
> -> group_A -> .....
>
> Round robin by ID. The last visited cgroup is recorded and restart
> from it when it start reclaim again.
> (More smart algorithm can be implemented..)
>
> No cgroup_mutex or hierarchy_mutex is required.
>
> Changelog (v3) -> (v4)
> - dropped css_is_populated() check
> - removed scan_age and use more simple logic.
>
I think a check for mem_cgroup_local_usage is also added by this version :)
> Changelog (v2) -> (v3)
> - Added css_is_populatd() check
> - Adjusted to rc1 + Nishimrua's fixes.
> - Increased comments.
>
> Changelog (v1) -> (v2)
> - Updated texts.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> mm/memcontrol.c | 220 ++++++++++++++++++++------------------------------------
> 1 file changed, 82 insertions(+), 138 deletions(-)
>
> Index: mmotm-2.6.29-Jan16/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.29-Jan16.orig/mm/memcontrol.c
> +++ mmotm-2.6.29-Jan16/mm/memcontrol.c
> @@ -95,6 +95,15 @@ static s64 mem_cgroup_read_stat(struct m
> return ret;
> }
>
> +static s64 mem_cgroup_local_usage(struct mem_cgroup_stat *stat)
> +{
> + s64 ret;
> +
It would be better to initialize it to 0.
Reviewed-by: Daisuke Nishimura <[email protected]>
Thanks,
Daisuke Nishimura.
> + ret = mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_CACHE);
> + ret += mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_RSS);
> + return ret;
> +}
> +
> /*
> * per-zone information in memory controller.
> */
> @@ -154,9 +163,9 @@ struct mem_cgroup {
>
> /*
> * While reclaiming in a hiearchy, we cache the last child we
> - * reclaimed from. Protected by hierarchy_mutex
> + * reclaimed from.
> */
> - struct mem_cgroup *last_scanned_child;
> + int last_scanned_child;
> /*
> * Should the accounting and control be hierarchical, per subtree?
> */
> @@ -629,103 +638,6 @@ unsigned long mem_cgroup_isolate_pages(u
> #define mem_cgroup_from_res_counter(counter, member) \
> container_of(counter, struct mem_cgroup, member)
>
> -/*
> - * This routine finds the DFS walk successor. This routine should be
> - * called with hierarchy_mutex held
> - */
> -static struct mem_cgroup *
> -__mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> -{
> - struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
> -
> - curr_cgroup = curr->css.cgroup;
> - root_cgroup = root_mem->css.cgroup;
> -
> - if (!list_empty(&curr_cgroup->children)) {
> - /*
> - * Walk down to children
> - */
> - cgroup = list_entry(curr_cgroup->children.next,
> - struct cgroup, sibling);
> - curr = mem_cgroup_from_cont(cgroup);
> - goto done;
> - }
> -
> -visit_parent:
> - if (curr_cgroup == root_cgroup) {
> - /* caller handles NULL case */
> - curr = NULL;
> - goto done;
> - }
> -
> - /*
> - * Goto next sibling
> - */
> - if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
> - cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
> - sibling);
> - curr = mem_cgroup_from_cont(cgroup);
> - goto done;
> - }
> -
> - /*
> - * Go up to next parent and next parent's sibling if need be
> - */
> - curr_cgroup = curr_cgroup->parent;
> - goto visit_parent;
> -
> -done:
> - return curr;
> -}
> -
> -/*
> - * Visit the first child (need not be the first child as per the ordering
> - * of the cgroup list, since we track last_scanned_child) of @mem and use
> - * that to reclaim free pages from.
> - */
> -static struct mem_cgroup *
> -mem_cgroup_get_next_node(struct mem_cgroup *root_mem)
> -{
> - struct cgroup *cgroup;
> - struct mem_cgroup *orig, *next;
> - bool obsolete;
> -
> - /*
> - * Scan all children under the mem_cgroup mem
> - */
> - mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
> -
> - orig = root_mem->last_scanned_child;
> - obsolete = mem_cgroup_is_obsolete(orig);
> -
> - if (list_empty(&root_mem->css.cgroup->children)) {
> - /*
> - * root_mem might have children before and last_scanned_child
> - * may point to one of them. We put it later.
> - */
> - if (orig)
> - VM_BUG_ON(!obsolete);
> - next = NULL;
> - goto done;
> - }
> -
> - if (!orig || obsolete) {
> - cgroup = list_first_entry(&root_mem->css.cgroup->children,
> - struct cgroup, sibling);
> - next = mem_cgroup_from_cont(cgroup);
> - } else
> - next = __mem_cgroup_get_next_node(orig, root_mem);
> -
> -done:
> - if (next)
> - mem_cgroup_get(next);
> - root_mem->last_scanned_child = next;
> - if (orig)
> - mem_cgroup_put(orig);
> - mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
> - return (next) ? next : root_mem;
> -}
> -
> static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
> {
> if (do_swap_account) {
> @@ -755,46 +667,79 @@ static unsigned int get_swappiness(struc
> }
>
> /*
> - * Dance down the hierarchy if needed to reclaim memory. We remember the
> - * last child we reclaimed from, so that we don't end up penalizing
> - * one child extensively based on its position in the children list.
> + * Visit the first child (need not be the first child as per the ordering
> + * of the cgroup list, since we track last_scanned_child) of @mem and use
> + * that to reclaim free pages from.
> + */
> +static struct mem_cgroup *
> +mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> +{
> + struct mem_cgroup *ret = NULL;
> + struct cgroup_subsys_state *css;
> + int nextid, found;
> +
> + if (!root_mem->use_hierarchy) {
> + css_get(&root_mem->css);
> + ret = root_mem;
> + }
> +
> + while (!ret) {
> + rcu_read_lock();
> + nextid = root_mem->last_scanned_child + 1;
> + css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
> + &found);
> + if (css && css_tryget(css))
> + ret = container_of(css, struct mem_cgroup, css);
> +
> + rcu_read_unlock();
> + /* Updates scanning parameter */
> + spin_lock(&root_mem->reclaim_param_lock);
> + if (!css) {
> + /* this means start scan from ID:1 */
> + root_mem->last_scanned_child = 0;
> + } else
> + root_mem->last_scanned_child = found;
> + spin_unlock(&root_mem->reclaim_param_lock);
> + }
> +
> + return ret;
> +}
> +
> +/*
> + * Scan the hierarchy if needed to reclaim memory. We remember the last child
> + * we reclaimed from, so that we don't end up penalizing one child extensively
> + * based on its position in the children list.
> *
> * root_mem is the original ancestor that we've been reclaim from.
> + *
> + * We give up and return to the caller when we visit root_mem twice.
> + * (other groups can be removed while we're walking....)
> */
> static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> gfp_t gfp_mask, bool noswap)
> {
> - struct mem_cgroup *next_mem;
> - int ret = 0;
> -
> - /*
> - * Reclaim unconditionally and don't check for return value.
> - * We need to reclaim in the current group and down the tree.
> - * One might think about checking for children before reclaiming,
> - * but there might be left over accounting, even after children
> - * have left.
> - */
> - ret += try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> - get_swappiness(root_mem));
> - if (mem_cgroup_check_under_limit(root_mem))
> - return 1; /* indicate reclaim has succeeded */
> - if (!root_mem->use_hierarchy)
> - return ret;
> -
> - next_mem = mem_cgroup_get_next_node(root_mem);
> -
> - while (next_mem != root_mem) {
> - if (mem_cgroup_is_obsolete(next_mem)) {
> - next_mem = mem_cgroup_get_next_node(root_mem);
> + struct mem_cgroup *victim;
> + int ret, total = 0;
> + int loop = 0;
> +
> + while (loop < 2) {
> + victim = mem_cgroup_select_victim(root_mem);
> + if (victim == root_mem)
> + loop++;
> + if (!mem_cgroup_local_usage(&victim->stat)) {
> + /* this cgroup's local usage == 0 */
> + css_put(&victim->css);
> continue;
> }
> - ret += try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> - get_swappiness(next_mem));
> + /* we use swappiness of local cgroup */
> + ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
> + get_swappiness(victim));
> + css_put(&victim->css);
> + total += ret;
> if (mem_cgroup_check_under_limit(root_mem))
> - return 1; /* indicate reclaim has succeeded */
> - next_mem = mem_cgroup_get_next_node(root_mem);
> + return 1 + total;
> }
> - return ret;
> + return total;
> }
>
> bool mem_cgroup_oom_called(struct task_struct *task)
> @@ -1324,8 +1269,8 @@ __mem_cgroup_uncharge_common(struct page
> res_counter_uncharge(&mem->res, PAGE_SIZE);
> if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> -
> mem_cgroup_charge_statistics(mem, pc, false);
> +
> ClearPageCgroupUsed(pc);
> /*
> * pc->mem_cgroup is not cleared here. It will be accessed when it's
> @@ -2178,6 +2123,8 @@ static void __mem_cgroup_free(struct mem
> {
> int node;
>
> + free_css_id(&mem_cgroup_subsys, &mem->css);
> +
> for_each_node_state(node, N_POSSIBLE)
> free_mem_cgroup_per_zone_info(mem, node);
>
> @@ -2228,11 +2175,12 @@ static struct cgroup_subsys_state * __re
> mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> {
> struct mem_cgroup *mem, *parent;
> + long error = -ENOMEM;
> int node;
>
> mem = mem_cgroup_alloc();
> if (!mem)
> - return ERR_PTR(-ENOMEM);
> + return ERR_PTR(error);
>
> for_each_node_state(node, N_POSSIBLE)
> if (alloc_mem_cgroup_per_zone_info(mem, node))
> @@ -2260,7 +2208,7 @@ mem_cgroup_create(struct cgroup_subsys *
> res_counter_init(&mem->res, NULL);
> res_counter_init(&mem->memsw, NULL);
> }
> - mem->last_scanned_child = NULL;
> + mem->last_scanned_child = 0;
> spin_lock_init(&mem->reclaim_param_lock);
>
> if (parent)
> @@ -2269,7 +2217,7 @@ mem_cgroup_create(struct cgroup_subsys *
> return &mem->css;
> free_out:
> __mem_cgroup_free(mem);
> - return ERR_PTR(-ENOMEM);
> + return ERR_PTR(error);
> }
>
> static void mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
> @@ -2283,12 +2231,7 @@ static void mem_cgroup_destroy(struct cg
> struct cgroup *cont)
> {
> struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> - struct mem_cgroup *last_scanned_child = mem->last_scanned_child;
>
> - if (last_scanned_child) {
> - VM_BUG_ON(!mem_cgroup_is_obsolete(last_scanned_child));
> - mem_cgroup_put(last_scanned_child);
> - }
> mem_cgroup_put(mem);
> }
>
> @@ -2327,6 +2270,7 @@ struct cgroup_subsys mem_cgroup_subsys =
> .populate = mem_cgroup_populate,
> .attach = mem_cgroup_move_task,
> .early_init = 0,
> + .use_id = 1,
> };
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>
> > +static s64 mem_cgroup_local_usage(struct mem_cgroup_stat *stat)
> > +{
> > + s64 ret;
> > +
> It would be better to initialize it to 0.
>
> > + ret = mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_CACHE);
> > + ret += mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_RSS);
> > + return ret;
> > +}
> > +
Ah, ret is initialized by mem_cgroup_read_stat...
please ignore the above comment.
Thanks,
Daisuke Nishimura.
Daisuke Nishimura さんは書きました:
> On Thu, 22 Jan 2009 18:35:57 +0900, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>>
>> From: KAMEZAWA Hiroyuki <[email protected]>
>> Use css ID in memcg.
>>
>> Assigning CSS ID for each memcg and use css_get_next() for scanning
>> hierarchy.
>>
>> Assume folloing tree.
>>
>> group_A (ID=3)
>> /01 (ID=4)
>> /0A (ID=7)
>> /02 (ID=10)
>> group_B (ID=5)
>> and task in group_A/01/0A hits limit at group_A.
>>
>> reclaim will be done in following order (round-robin).
>> group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
>> -> group_A -> .....
>>
>> Round robin by ID. The last visited cgroup is recorded and restart
>> from it when it start reclaim again.
>> (More smart algorithm can be implemented..)
>>
>> No cgroup_mutex or hierarchy_mutex is required.
>>
>> Changelog (v3) -> (v4)
>> - dropped css_is_populated() check
>> - removed scan_age and use more simple logic.
>>
> I think a check for mem_cgroup_local_usage is also added by this version
> :)
>
>> Changelog (v2) -> (v3)
>> - Added css_is_populatd() check
>> - Adjusted to rc1 + Nishimrua's fixes.
>> - Increased comments.
>>
>> Changelog (v1) -> (v2)
>> - Updated texts.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>>
>> ---
>> mm/memcontrol.c | 220
>> ++++++++++++++++++++------------------------------------
>> 1 file changed, 82 insertions(+), 138 deletions(-)
>>
>> Index: mmotm-2.6.29-Jan16/mm/memcontrol.c
>> ===================================================================
>> --- mmotm-2.6.29-Jan16.orig/mm/memcontrol.c
>> +++ mmotm-2.6.29-Jan16/mm/memcontrol.c
>> @@ -95,6 +95,15 @@ static s64 mem_cgroup_read_stat(struct m
>> return ret;
>> }
>>
>> +static s64 mem_cgroup_local_usage(struct mem_cgroup_stat *stat)
>> +{
>> + s64 ret;
>> +
> It would be better to initialize it to 0.
>
Hmm ? why ?
> Reviewed-by: Daisuke Nishimura <[email protected]>
>
> Thanks,
> Daisuke Nishimura.
>
Thanks,
-Kame
>> + ret = mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_CACHE);
>> + ret += mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_RSS);
>> + return ret;
>> +}
>> +
On Thu, 22 Jan 2009 18:40:18 +0900 KAMEZAWA Hiroyuki wrote:
> Documentation/cgroups/cgroups.txt | 6 +-
> include/linux/cgroup.h | 16 +-----
> kernel/cgroup.c | 97 ++++++++++++++++++++++++++++++++------
> mm/memcontrol.c | 5 +
> 4 files changed, 93 insertions(+), 31 deletions(-)
>
> Index: mmotm-2.6.29-Jan16/Documentation/cgroups/cgroups.txt
> ===================================================================
> --- mmotm-2.6.29-Jan16.orig/Documentation/cgroups/cgroups.txt
> +++ mmotm-2.6.29-Jan16/Documentation/cgroups/cgroups.txt
> @@ -478,11 +478,13 @@ cgroup->parent is still valid. (Note - c
> newly-created cgroup if an error occurs after this subsystem's
> create() method has been called for the new cgroup).
>
> -void pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +int pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
>
> Called before checking the reference count on each subsystem. This may
> be useful for subsystems which have some extra references even if
> -there are not tasks in the cgroup.
> +there are not tasks in the cgroup. If pre_destroy() returns error code,
> +rmdir() will fail with it. From this behavior, pre_destroy() can be
> +called plural times against a cgroup.
s/plural/multiple/ please.
>
> int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> struct task_struct *task)
---
~Randy
On Mon, 26 Jan 2009 16:58:23 -0800
Randy Dunlap <[email protected]> wrote:
> On Thu, 22 Jan 2009 18:40:18 +0900 KAMEZAWA Hiroyuki wrote:
>
> > Documentation/cgroups/cgroups.txt | 6 +-
> > include/linux/cgroup.h | 16 +-----
> > kernel/cgroup.c | 97 ++++++++++++++++++++++++++++++++------
> > mm/memcontrol.c | 5 +
> > 4 files changed, 93 insertions(+), 31 deletions(-)
> >
> > Index: mmotm-2.6.29-Jan16/Documentation/cgroups/cgroups.txt
> > ===================================================================
> > --- mmotm-2.6.29-Jan16.orig/Documentation/cgroups/cgroups.txt
> > +++ mmotm-2.6.29-Jan16/Documentation/cgroups/cgroups.txt
> > @@ -478,11 +478,13 @@ cgroup->parent is still valid. (Note - c
> > newly-created cgroup if an error occurs after this subsystem's
> > create() method has been called for the new cgroup).
> >
> > -void pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
> > +int pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
> >
> > Called before checking the reference count on each subsystem. This may
> > be useful for subsystems which have some extra references even if
> > -there are not tasks in the cgroup.
> > +there are not tasks in the cgroup. If pre_destroy() returns error code,
> > +rmdir() will fail with it. From this behavior, pre_destroy() can be
> > +called plural times against a cgroup.
>
> s/plural/multiple/ please.
>
ok, thank you for review.
-Kame
> >
> > int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> > struct task_struct *task)
>
>
> ---
> ~Randy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>