I'm sorry that I couldn't work so much, this week.
No much updates but I think all comments I got are applied.
About memcg part, I'll wait for that all Nishimura's fixes go ahead.
If cgroup part looks good, please Ack. I added CC to Andrew Morton for that part.
changes from previous series
- dropeed a fix to OOM KILL (will reschedule)
- dropped a fix to -EBUSY (will reschedule)
- added css_is_populated()
- added hierarchy_stat patch
Known my homework is
- resize_limit should return -EBUSY. (Li Zefan reported.)
Andrew, I'll CC: you [1/4] and [2/4]. But no explicit Acked-by yet to any patches.
Thanks,
-Kame
From: KAMEZAWA Hiroyuki <[email protected]>
Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
This patch attaches unique ID to each css and provides following.
- css_lookup(subsys, id)
returns pointer to struct cgroup_subysys_state of id.
- css_get_next(subsys, id, rootid, depth, foundid)
returns the next css under "root" by scanning
When cgrou_subsys->use_id is set, an id for css is maintained.
The cgroup framework only parepares
- css_id of root css for subsys
- id is automatically attached at creation of css.
- id is *not* freed automatically. Because the cgroup framework
don't know lifetime of cgroup_subsys_state.
free_css_id() function is provided. This must be called by subsys.
There are several reasons to develop this.
- Saving space .... For example, memcg's swap_cgroup is array of
pointers to cgroup. But it is not necessary to be very fast.
By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
reduce much amount of memory usage.
- Scanning without lock.
CSS_ID provides "scan id under this ROOT" function. By this, scanning
css under root can be written without locks.
ex)
do {
rcu_read_lock();
next = cgroup_get_next(subsys, id, root, &found);
/* check sanity of next here */
css_tryget();
rcu_read_unlock();
id = found + 1
} while(...)
Characteristics:
- Each css has unique ID under subsys.
- Lifetime of ID is controlled by subsys.
- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
- Allowed ID is 1-65535, ID 0 is UNUSED ID.
Design Choices:
- scan-by-ID v.s. scan-by-tree-walk.
As /proc's pid scan does, scan-by-ID is robust when scanning is done
by following kind of routine.
scan -> rest a while(release a lock) -> conitunue from interrupted
memcg's hierarchical reclaim does this.
- When subsys->use_id is set, # of css in the system is limited to
65535.
Changelog: (v7) -> (v8)
- fixed css_free_id() can see NULL problem.
- fixed typo
Changelog: (v6) -> (v7)
- refcnt for CSS ID is removed. Subsys can do it by own logic.
- New id allocation is done automatically.
- fixed typos.
- fixed limit check of ID.
Changelog: (v5) -> (v6)
- max depth is removed.
- changed arguments to "scan"
Changelog: (v4) -> (v5)
- Totally re-designed as per-css ID.
Changelog:(v3) -> (v4)
- updated comments.
- renamed hierarchy_code[] to stack[]
- merged prepare_id routines.
Changelog (v2) -> (v3)
- removed cgroup_id_getref().
- added cgroup_id_tryget().
Changelog (v1) -> (v2):
- Design change: show only ID(integer) to outside of cgroup.c
- moved cgroup ID definition from include/ to kernel/cgroup.c
- struct cgroup_id is freed by RCU.
- changed interface from pointer to "int"
- kill_sb() is handled.
- ID 0 as unused ID.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Index: mmotm-2.6.29-Jan14/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.29-Jan14.orig/include/linux/cgroup.h
+++ mmotm-2.6.29-Jan14/include/linux/cgroup.h
@@ -15,6 +15,7 @@
#include <linux/cgroupstats.h>
#include <linux/prio_heap.h>
#include <linux/rwsem.h>
+#include <linux/idr.h>
#ifdef CONFIG_CGROUPS
@@ -22,6 +23,7 @@ struct cgroupfs_root;
struct cgroup_subsys;
struct inode;
struct cgroup;
+struct css_id;
extern int cgroup_init_early(void);
extern int cgroup_init(void);
@@ -59,6 +61,8 @@ struct cgroup_subsys_state {
atomic_t refcnt;
unsigned long flags;
+ /* ID for this css, if possible */
+ struct css_id *id;
};
/* bits in struct cgroup_subsys_state flags field */
@@ -363,6 +367,11 @@ struct cgroup_subsys {
int active;
int disabled;
int early_init;
+ /*
+ * True if this subsys uses ID. ID is not available before cgroup_init()
+ * (not available in early_init time.)
+ */
+ bool use_id;
#define MAX_CGROUP_TYPE_NAMELEN 32
const char *name;
@@ -384,6 +393,9 @@ struct cgroup_subsys {
*/
struct cgroupfs_root *root;
struct list_head sibling;
+ /* used when use_id == true */
+ struct idr idr;
+ spinlock_t id_lock;
};
#define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
@@ -437,6 +449,44 @@ void cgroup_iter_end(struct cgroup *cgrp
int cgroup_scan_tasks(struct cgroup_scanner *scan);
int cgroup_attach_task(struct cgroup *, struct task_struct *);
+/*
+ * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
+ * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
+ * CSS ID is assigned at cgroup allocation (create) automatically
+ * and removed when subsys calls free_css_id() function. This is because
+ * the lifetime of cgroup_subsys_state is subsys's matter.
+ *
+ * Looking up and scanning function should be called under rcu_read_lock().
+ * Taking cgroup_mutex()/hierarchy_mutex() is not necessary for following calls.
+ * But the css returned by this routine can be "not populated yet" or "being
+ * destroyed". The caller should check css and cgroup's status.
+ */
+
+/*
+ * Typically Called at ->destroy(), or somewhere the subsys frees
+ * cgroup_subsys_state.
+ */
+void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
+
+/* Find a cgroup_subsys_state which has given ID */
+
+struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
+
+/*
+ * Get a cgroup whose id is greater than or equal to id under tree of root.
+ * Returning a cgroup_subsys_state or NULL.
+ */
+struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
+ struct cgroup_subsys_state *root, int *foundid);
+
+/* Returns true if root is ancestor of cg */
+bool css_is_ancestor(struct cgroup_subsys_state *cg,
+ struct cgroup_subsys_state *root);
+
+/* Get id and depth of css */
+unsigned short css_id(struct cgroup_subsys_state *css);
+unsigned short css_depth(struct cgroup_subsys_state *css);
+
#else /* !CONFIG_CGROUPS */
static inline int cgroup_init_early(void) { return 0; }
Index: mmotm-2.6.29-Jan14/kernel/cgroup.c
===================================================================
--- mmotm-2.6.29-Jan14.orig/kernel/cgroup.c
+++ mmotm-2.6.29-Jan14/kernel/cgroup.c
@@ -185,6 +185,8 @@ struct cg_cgroup_link {
static struct css_set init_css_set;
static struct cg_cgroup_link init_css_set_link;
+static int cgroup_subsys_init_idr(struct cgroup_subsys *ss);
+
/* css_set_lock protects the list of css_set objects, and the
* chain of tasks off each css_set. Nests outside task->alloc_lock
* due to cgroup_iter_start() */
@@ -567,6 +569,9 @@ static struct backing_dev_info cgroup_ba
.capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK,
};
+static int alloc_css_id(struct cgroup_subsys *ss,
+ struct cgroup *parent, struct cgroup *child);
+
static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
{
struct inode *inode = new_inode(sb);
@@ -2335,6 +2340,7 @@ static void init_cgroup_css(struct cgrou
css->cgroup = cgrp;
atomic_set(&css->refcnt, 1);
css->flags = 0;
+ css->id = NULL;
if (cgrp == dummytop)
set_bit(CSS_ROOT, &css->flags);
BUG_ON(cgrp->subsys[ss->subsys_id]);
@@ -2410,6 +2416,10 @@ static long cgroup_create(struct cgroup
goto err_destroy;
}
init_cgroup_css(css, ss, cgrp);
+ if (ss->use_id)
+ if (alloc_css_id(ss, parent, cgrp))
+ goto err_destroy;
+ /* At error, ->destroy() callback has to free assigned ID. */
}
cgroup_lock_hierarchy(root);
@@ -2699,6 +2709,8 @@ int __init cgroup_init(void)
struct cgroup_subsys *ss = subsys[i];
if (!ss->early_init)
cgroup_init_subsys(ss);
+ if (ss->use_id)
+ cgroup_subsys_init_idr(ss);
}
/* Add init_css_set to the hash table */
@@ -3232,3 +3244,259 @@ static int __init cgroup_disable(char *s
return 1;
}
__setup("cgroup_disable=", cgroup_disable);
+
+/*
+ * CSS ID -- ID per subsys's Cgroup Subsys State(CSS).
+ */
+struct css_id {
+ /*
+ * The css to which this ID points. If cgroup is removed, this will
+ * be NULL. This pointer is expected to be RCU-safe because destroy()
+ * is called after synchronize_rcu(). But for safe use, css_is_removed()
+ * css_tryget() should be used for avoiding race.
+ */
+ struct cgroup_subsys_state *css;
+ /*
+ * ID of this css.
+ */
+ unsigned short id;
+ /*
+ * Depth in hierarchy which this ID belongs to.
+ */
+ unsigned short depth;
+ /*
+ * ID is freed by RCU. (and lookup routine is RCU safe.)
+ */
+ struct rcu_head rcu_head;
+ /*
+ * Hierarchy of CSS ID belongs to.
+ */
+ unsigned short stack[0]; /* Array of Length (depth+1) */
+};
+#define CSS_ID_MAX (65535)
+
+/*
+ * To get ID other than 0, this should be called when !cgroup_is_removed().
+ */
+unsigned short css_id(struct cgroup_subsys_state *css)
+{
+ struct css_id *cssid = rcu_dereference(css->id);
+
+ if (cssid)
+ return cssid->id;
+ return 0;
+}
+
+unsigned short css_depth(struct cgroup_subsys_state *css)
+{
+ struct css_id *cssid = rcu_dereference(css->id);
+
+ if (cssid)
+ return cssid->depth;
+ return 0;
+}
+
+bool css_is_ancestor(struct cgroup_subsys_state *child,
+ struct cgroup_subsys_state *root)
+{
+ struct css_id *child_id = rcu_dereference(child->id);
+ struct css_id *root_id = rcu_dereference(root->id);
+
+ if (!child_id || !root_id || (child_id->depth < root_id->depth))
+ return false;
+ return child_id->stack[root_id->depth] == root_id->id;
+}
+
+static void __free_css_id_cb(struct rcu_head *head)
+{
+ struct css_id *id;
+
+ id = container_of(head, struct css_id, rcu_head);
+ kfree(id);
+}
+
+void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css)
+{
+ struct css_id *id = css->id;
+ /* When this is called before css_id initialization, id can be NULL */
+ if (!id)
+ return;
+
+ BUG_ON(!ss->use_id);
+
+ rcu_assign_pointer(id->css, NULL);
+ rcu_assign_pointer(css->id, NULL);
+ spin_lock(&ss->id_lock);
+ idr_remove(&ss->idr, id->id);
+ spin_unlock(&ss->id_lock);
+ call_rcu(&id->rcu_head, __free_css_id_cb);
+}
+
+/*
+ * This is called by init or create(). Then, calls to this function are
+ * always serialized (By cgroup_mutex() at create()).
+ */
+
+static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
+{
+ struct css_id *newid;
+ int myid, error, size;
+
+ BUG_ON(!ss->use_id);
+
+ size = sizeof(*newid) + sizeof(unsigned short) * (depth + 1);
+ newid = kzalloc(size, GFP_KERNEL);
+ if (!newid)
+ return ERR_PTR(-ENOMEM);
+ /* get id */
+ if (unlikely(!idr_pre_get(&ss->idr, GFP_KERNEL))) {
+ error = -ENOMEM;
+ goto err_out;
+ }
+ spin_lock(&ss->id_lock);
+ /* Don't use 0. allocates an ID of 1-65535 */
+ error = idr_get_new_above(&ss->idr, newid, 1, &myid);
+ spin_unlock(&ss->id_lock);
+
+ /* Returns error when there are no free spaces for new ID.*/
+ if (error) {
+ error = -ENOSPC;
+ goto err_out;
+ }
+ if (myid > CSS_ID_MAX)
+ goto remove_idr;
+
+ newid->id = myid;
+ newid->depth = depth;
+ return newid;
+remove_idr:
+ error = -ENOSPC;
+ spin_lock(&ss->id_lock);
+ idr_remove(&ss->idr, myid);
+ spin_unlock(&ss->id_lock);
+err_out:
+ kfree(newid);
+ return ERR_PTR(error);
+
+}
+
+static int __init cgroup_subsys_init_idr(struct cgroup_subsys *ss)
+{
+ struct css_id *newid;
+ struct cgroup_subsys_state *rootcss;
+
+ spin_lock_init(&ss->id_lock);
+ idr_init(&ss->idr);
+
+ rootcss = init_css_set.subsys[ss->subsys_id];
+ newid = get_new_cssid(ss, 0);
+ if (IS_ERR(newid))
+ return PTR_ERR(newid);
+
+ newid->stack[0] = newid->id;
+ newid->css = rootcss;
+ rootcss->id = newid;
+ return 0;
+}
+
+static int alloc_css_id(struct cgroup_subsys *ss, struct cgroup *parent,
+ struct cgroup *child)
+{
+ int subsys_id, i, depth = 0;
+ struct cgroup_subsys_state *parent_css, *child_css;
+ struct css_id *child_id, *parent_id = NULL;
+
+ subsys_id = ss->subsys_id;
+ parent_css = parent->subsys[subsys_id];
+ child_css = child->subsys[subsys_id];
+ depth = css_depth(parent_css) + 1;
+ parent_id = parent_css->id;
+
+ child_id = get_new_cssid(ss, depth);
+ if (IS_ERR(child_id))
+ return PTR_ERR(child_id);
+
+ for (i = 0; i < depth; i++)
+ child_id->stack[i] = parent_id->stack[i];
+ child_id->stack[depth] = child_id->id;
+
+ rcu_assign_pointer(child_id->css, child_css);
+ rcu_assign_pointer(child_css->id, child_id);
+
+ return 0;
+}
+
+/**
+ * css_lookup - lookup css by id
+ * @ss: cgroup subsys to be looked into.
+ * @id: the id
+ *
+ * Returns pointer to cgroup_subsys_state if there is valid one with id.
+ * NULL if not. Should be called under rcu_read_lock()
+ */
+struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
+{
+ struct css_id *cssid = NULL;
+
+ BUG_ON(!ss->use_id);
+ cssid = idr_find(&ss->idr, id);
+
+ if (unlikely(!cssid))
+ return NULL;
+
+ return rcu_dereference(cssid->css);
+}
+
+/**
+ * css_get_next - lookup next cgroup under specified hierarchy.
+ * @ss: pointer to subsystem
+ * @id: current position of iteration.
+ * @root: pointer to css. search tree under this.
+ * @foundid: position of found object.
+ *
+ * Search next css under the specified hierarchy of rootid. Calling under
+ * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
+ */
+struct cgroup_subsys_state *
+css_get_next(struct cgroup_subsys *ss, int id,
+ struct cgroup_subsys_state *root, int *foundid)
+{
+ struct cgroup_subsys_state *ret = NULL;
+ struct css_id *tmp;
+ int tmpid;
+ int rootid = css_id(root);
+ int depth = css_depth(root);
+
+ if (!rootid)
+ return NULL;
+
+ BUG_ON(!ss->use_id);
+ rcu_read_lock();
+ /* fill start point for scan */
+ tmpid = id;
+ while (1) {
+ /*
+ * scan next entry from bitmap(tree), tmpid is updated after
+ * idr_get_next().
+ */
+ spin_lock(&ss->id_lock);
+ tmp = idr_get_next(&ss->idr, &tmpid);
+ spin_unlock(&ss->id_lock);
+
+ if (!tmp)
+ break;
+ if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
+ ret = rcu_dereference(tmp->css);
+ if (ret) {
+ *foundid = tmpid;
+ break;
+ }
+ }
+ /* continue to scan from next id */
+ tmpid = tmpid + 1;
+ }
+
+ rcu_read_unlock();
+ return ret;
+}
+
Index: mmotm-2.6.29-Jan14/include/linux/idr.h
===================================================================
--- mmotm-2.6.29-Jan14.orig/include/linux/idr.h
+++ mmotm-2.6.29-Jan14/include/linux/idr.h
@@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
int idr_for_each(struct idr *idp,
int (*fn)(int id, void *p, void *data), void *data);
+void *idr_get_next(struct idr *idp, int *nextid);
void *idr_replace(struct idr *idp, void *ptr, int id);
void idr_remove(struct idr *idp, int id);
void idr_remove_all(struct idr *idp);
Index: mmotm-2.6.29-Jan14/lib/idr.c
===================================================================
--- mmotm-2.6.29-Jan14.orig/lib/idr.c
+++ mmotm-2.6.29-Jan14/lib/idr.c
@@ -579,6 +579,52 @@ int idr_for_each(struct idr *idp,
EXPORT_SYMBOL(idr_for_each);
/**
+ * idr_get_next - lookup next object of id to given id.
+ * @idp: idr handle
+ * @id: pointer to lookup key
+ *
+ * Returns pointer to registered object with id, which is next number to
+ * given id.
+ */
+
+void *idr_get_next(struct idr *idp, int *nextidp)
+{
+ struct idr_layer *p, *pa[MAX_LEVEL];
+ struct idr_layer **paa = &pa[0];
+ int id = *nextidp;
+ int n, max;
+
+ /* find first ent */
+ n = idp->layers * IDR_BITS;
+ max = 1 << n;
+ p = rcu_dereference(idp->top);
+ if (!p)
+ return NULL;
+
+ while (id < max) {
+ while (n > 0 && p) {
+ n -= IDR_BITS;
+ *paa++ = p;
+ p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
+ }
+
+ if (p) {
+ *nextidp = id;
+ return p;
+ }
+
+ id += 1 << n;
+ while (n < fls(id)) {
+ n += IDR_BITS;
+ p = *--paa;
+ }
+ }
+ return NULL;
+}
+
+
+
+/**
* idr_replace - replace pointer for given id
* @idp: idr handle
* @ptr: pointer you want associated with the id
From: KAMEZAWA Hiroyuki <[email protected]>
cgroup creation is done in several stages.
After allocated and linked to cgroup's hierarchy tree, all necessary
control files are created.
When using CSS_ID, scanning cgroups without cgrouo_lock(), status
of cgroup is important. At removal of cgroup/css, css_tryget() works fine
and we can write a safe code. At creation, we need some flag to show
"This cgroup is not ready yet"
This patch adds CSS_POPULATED flag.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Index: mmotm-2.6.29-Jan14/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.29-Jan14.orig/include/linux/cgroup.h
+++ mmotm-2.6.29-Jan14/include/linux/cgroup.h
@@ -69,6 +69,7 @@ struct cgroup_subsys_state {
enum {
CSS_ROOT, /* This CSS is the root of the subsystem */
CSS_REMOVED, /* This CSS is dead */
+ CSS_POPULATED, /* This CSS finished all initialization */
};
/*
@@ -90,6 +91,11 @@ static inline bool css_is_removed(struct
return test_bit(CSS_REMOVED, &css->flags);
}
+static inline bool css_is_populated(struct cgroup_subsys_state *css)
+{
+ return test_bit(CSS_POPULATED, &css->flags);
+}
+
/*
* Call css_tryget() to take a reference on a css if your existing
* (known-valid) reference isn't already ref-counted. Returns false if
Index: mmotm-2.6.29-Jan14/kernel/cgroup.c
===================================================================
--- mmotm-2.6.29-Jan14.orig/kernel/cgroup.c
+++ mmotm-2.6.29-Jan14/kernel/cgroup.c
@@ -2326,8 +2326,10 @@ static int cgroup_populate_dir(struct cg
}
for_each_subsys(cgrp->root, ss) {
+ struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
if (ss->populate && (err = ss->populate(ss, cgrp)) < 0)
return err;
+ set_bit(CSS_POPULATED, &css->flags);
}
return 0;
Balbir, I updated comments for reclaim mechanism. If still unclear,
plz order.
==
From: KAMEZAWA Hiroyuki <[email protected]>
Use css ID in memcg.
Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.
Assume folloing tree.
group_A (ID=3)
/01 (ID=4)
/0A (ID=7)
/02 (ID=10)
group_B (ID=5)
and task in group_A/01/0A hits limit at group_A.
reclaim will be done in following order (round-robin).
group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
-> group_A -> .....
Round robin by ID. The last visited cgroup is recorded and restart
from it when it start reclaim again.
(More smart algorithm can be implemented..)
No cgroup_mutex or hierarchy_mutex is required.
Changelog (v2) -> (v3)
- Added css_is_populatd() check
- Adjusted to rc1 + Nishimrua's fixes.
- Increased comments.
Changelog (v1) -> (v2)
- Updated texts.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Index: mmotm-2.6.29-Jan14/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Jan14.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Jan14/mm/memcontrol.c
@@ -154,9 +154,13 @@ struct mem_cgroup {
/*
* While reclaiming in a hiearchy, we cache the last child we
- * reclaimed from. Protected by hierarchy_mutex
+ * reclaimed from. scan_age is incremented when this is the root
+ * of hierarchical reclaim and hierarchical reclaim visit this.
+ * When scan_age is updated by 2, exit loop and check we have to
+ * retry more. (see hierarchical reclaim codes.)
*/
- struct mem_cgroup *last_scanned_child;
+ int last_scanned_child;
+ unsigned long scan_age;
/*
* Should the accounting and control be hierarchical, per subtree?
*/
@@ -628,103 +632,6 @@ unsigned long mem_cgroup_isolate_pages(u
#define mem_cgroup_from_res_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)
-/*
- * This routine finds the DFS walk successor. This routine should be
- * called with hierarchy_mutex held
- */
-static struct mem_cgroup *
-__mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
-{
- struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
-
- curr_cgroup = curr->css.cgroup;
- root_cgroup = root_mem->css.cgroup;
-
- if (!list_empty(&curr_cgroup->children)) {
- /*
- * Walk down to children
- */
- cgroup = list_entry(curr_cgroup->children.next,
- struct cgroup, sibling);
- curr = mem_cgroup_from_cont(cgroup);
- goto done;
- }
-
-visit_parent:
- if (curr_cgroup == root_cgroup) {
- /* caller handles NULL case */
- curr = NULL;
- goto done;
- }
-
- /*
- * Goto next sibling
- */
- if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
- cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
- sibling);
- curr = mem_cgroup_from_cont(cgroup);
- goto done;
- }
-
- /*
- * Go up to next parent and next parent's sibling if need be
- */
- curr_cgroup = curr_cgroup->parent;
- goto visit_parent;
-
-done:
- return curr;
-}
-
-/*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
-static struct mem_cgroup *
-mem_cgroup_get_next_node(struct mem_cgroup *root_mem)
-{
- struct cgroup *cgroup;
- struct mem_cgroup *orig, *next;
- bool obsolete;
-
- /*
- * Scan all children under the mem_cgroup mem
- */
- mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
-
- orig = root_mem->last_scanned_child;
- obsolete = mem_cgroup_is_obsolete(orig);
-
- if (list_empty(&root_mem->css.cgroup->children)) {
- /*
- * root_mem might have children before and last_scanned_child
- * may point to one of them. We put it later.
- */
- if (orig)
- VM_BUG_ON(!obsolete);
- next = NULL;
- goto done;
- }
-
- if (!orig || obsolete) {
- cgroup = list_first_entry(&root_mem->css.cgroup->children,
- struct cgroup, sibling);
- next = mem_cgroup_from_cont(cgroup);
- } else
- next = __mem_cgroup_get_next_node(orig, root_mem);
-
-done:
- if (next)
- mem_cgroup_get(next);
- root_mem->last_scanned_child = next;
- if (orig)
- mem_cgroup_put(orig);
- mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
- return (next) ? next : root_mem;
-}
-
static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
{
if (do_swap_account) {
@@ -754,46 +661,91 @@ static unsigned int get_swappiness(struc
}
/*
- * Dance down the hierarchy if needed to reclaim memory. We remember the
- * last child we reclaimed from, so that we don't end up penalizing
- * one child extensively based on its position in the children list.
+ * Visit the first child (need not be the first child as per the ordering
+ * of the cgroup list, since we track last_scanned_child) of @mem and use
+ * that to reclaim free pages from.
+ */
+static struct mem_cgroup *
+mem_cgroup_select_victim(struct mem_cgroup *root_mem)
+{
+ struct mem_cgroup *ret = NULL;
+ struct cgroup_subsys_state *css;
+ int nextid, found;
+
+ if (!root_mem->use_hierarchy) {
+ spin_lock(&root_mem->reclaim_param_lock);
+ root_mem->scan_age++;
+ spin_unlock(&root_mem->reclaim_param_lock);
+ css_get(&root_mem->css);
+ ret = root_mem;
+ }
+
+ while (!ret) {
+ rcu_read_lock();
+ nextid = root_mem->last_scanned_child + 1;
+ css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
+ &found);
+ if (css && css_is_populated(css) && css_tryget(css))
+ ret = container_of(css, struct mem_cgroup, css);
+
+ rcu_read_unlock();
+ /* Updates scanning parameter */
+ spin_lock(&root_mem->reclaim_param_lock);
+ if (!css) {
+ /* this means start scan from ID:1 */
+ root_mem->last_scanned_child = 0;
+ root_mem->scan_age++;
+ } else
+ root_mem->last_scanned_child = found;
+ spin_unlock(&root_mem->reclaim_param_lock);
+ }
+
+ return ret;
+}
+
+/*
+ * Scan the hierarchy if needed to reclaim memory. We remember the last child
+ * we reclaimed from, so that we don't end up penalizing one child extensively
+ * based on its position in the children list.
*
* root_mem is the original ancestor that we've been reclaim from.
+ *
+ * scan_age is updated every time when select_victim returns "root" and
+ * it's shared under system (per hierarchy root).
+ *
+ * We give up and return to the caller when scan_age is increased by 2. This
+ * means try_to_free_mem_cgroup_pages() is called against all children cgroup,
+ * at least once. The caller itself will do further retry if necessary.
*/
static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
gfp_t gfp_mask, bool noswap)
{
- struct mem_cgroup *next_mem;
- int ret = 0;
-
- /*
- * Reclaim unconditionally and don't check for return value.
- * We need to reclaim in the current group and down the tree.
- * One might think about checking for children before reclaiming,
- * but there might be left over accounting, even after children
- * have left.
- */
- ret += try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
- get_swappiness(root_mem));
- if (mem_cgroup_check_under_limit(root_mem))
- return 1; /* indicate reclaim has succeeded */
- if (!root_mem->use_hierarchy)
- return ret;
-
- next_mem = mem_cgroup_get_next_node(root_mem);
-
- while (next_mem != root_mem) {
- if (mem_cgroup_is_obsolete(next_mem)) {
- next_mem = mem_cgroup_get_next_node(root_mem);
- continue;
- }
- ret += try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
- get_swappiness(next_mem));
+ struct mem_cgroup *victim;
+ unsigned long start_age;
+ int ret, total = 0;
+ /*
+ * Reclaim memory from cgroups under root_mem in round robin.
+ */
+ start_age = root_mem->scan_age;
+ /*
+ * Assume a scan starting from somewhere 1,2,3,4,..
+ * ...->1->2->3->4->1->2->3->4->1->2->3->4->.....
+ * check that "1" is visited twice is enough for checking whether
+ * all IDs are scanned. So, here, checking scan_age is updated by 2.
+ * This scan_age is not time, but just a counter. time_after() is
+ * a useful to check this kind of counters.
+ */
+ while (time_after((start_age + 2UL), root_mem->scan_age)) {
+ victim = mem_cgroup_select_victim(root_mem);
+ /* we use swappiness of local cgroup */
+ ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
+ get_swappiness(victim));
+ css_put(&victim->css);
+ total += ret;
if (mem_cgroup_check_under_limit(root_mem))
- return 1; /* indicate reclaim has succeeded */
- next_mem = mem_cgroup_get_next_node(root_mem);
+ return 1 + total;
}
- return ret;
+ return total;
}
bool mem_cgroup_oom_called(struct task_struct *task)
@@ -1319,7 +1271,6 @@ __mem_cgroup_uncharge_common(struct page
default:
break;
}
-
res_counter_uncharge(&mem->res, PAGE_SIZE);
if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
res_counter_uncharge(&mem->memsw, PAGE_SIZE);
@@ -2177,6 +2128,8 @@ static void __mem_cgroup_free(struct mem
{
int node;
+ free_css_id(&mem_cgroup_subsys, &mem->css);
+
for_each_node_state(node, N_POSSIBLE)
free_mem_cgroup_per_zone_info(mem, node);
@@ -2214,11 +2167,12 @@ static struct cgroup_subsys_state * __re
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
struct mem_cgroup *mem, *parent;
+ long error = -ENOMEM;
int node;
mem = mem_cgroup_alloc();
if (!mem)
- return ERR_PTR(-ENOMEM);
+ return ERR_PTR(error);
for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -2239,7 +2193,8 @@ mem_cgroup_create(struct cgroup_subsys *
res_counter_init(&mem->res, NULL);
res_counter_init(&mem->memsw, NULL);
}
- mem->last_scanned_child = NULL;
+ mem->last_scanned_child = 0;
+ mem->scan_age = 0;
spin_lock_init(&mem->reclaim_param_lock);
if (parent)
@@ -2248,7 +2203,7 @@ mem_cgroup_create(struct cgroup_subsys *
return &mem->css;
free_out:
__mem_cgroup_free(mem);
- return ERR_PTR(-ENOMEM);
+ return ERR_PTR(error);
}
static void mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
@@ -2262,12 +2217,7 @@ static void mem_cgroup_destroy(struct cg
struct cgroup *cont)
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
- struct mem_cgroup *last_scanned_child = mem->last_scanned_child;
- if (last_scanned_child) {
- VM_BUG_ON(!mem_cgroup_is_obsolete(last_scanned_child));
- mem_cgroup_put(last_scanned_child);
- }
mem_cgroup_put(mem);
}
@@ -2306,6 +2256,7 @@ struct cgroup_subsys mem_cgroup_subsys =
.populate = mem_cgroup_populate,
.attach = mem_cgroup_move_task,
.early_init = 0,
+ .use_id = 1,
};
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
This is an demonstration for scan by CSS ID.
word "hierachical" is better than "total" ?
==
From: KAMEZAWA Hiroyuki <[email protected]>
Clean up memory.stat file routine and show "total" hierarchical stat.
This patch does
- renamed get_all_zonestat to be get_local_zonestat.
- remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
- add mcs_stat to cover both of per-cpu/per-lru stat.
- add "total" stat of hierarchy (*)
- add a callback system to scan all memcg under a root.
== "total_xxx" is added.
[kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
cache 0
rss 0
pgpgin 0
pgpgout 0
inactive_anon 0
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 50331648
hierarchical_memsw_limit 9223372036854775807
total_cache 65536
total_rss 192512
total_pgpgin 218
total_pgpgout 155
total_inactive_anon 0
total_active_anon 135168
total_inactive_file 61440
total_active_file 4096
total_unevictable 0
==
(*) maybe the user can do calc hierarchical stat by his own program
in userland but if it can be written in clean way, it's worth to be
shown, I think.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Index: mmotm-2.6.29-Jan14/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Jan14.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Jan14/mm/memcontrol.c
@@ -250,7 +250,7 @@ page_cgroup_zoneinfo(struct page_cgroup
return mem_cgroup_zoneinfo(mem, nid, zid);
}
-static unsigned long mem_cgroup_get_all_zonestat(struct mem_cgroup *mem,
+static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
enum lru_list idx)
{
int nid, zid;
@@ -504,8 +504,8 @@ static int calc_inactive_ratio(struct me
unsigned long gb;
unsigned long inactive_ratio;
- inactive = mem_cgroup_get_all_zonestat(memcg, LRU_INACTIVE_ANON);
- active = mem_cgroup_get_all_zonestat(memcg, LRU_ACTIVE_ANON);
+ inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_ANON);
+ active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_ANON);
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
@@ -661,6 +661,41 @@ static unsigned int get_swappiness(struc
}
/*
+ * Call callback function against all cgroup under hierarchy tree.
+ */
+static int mem_cgroup_walk_tree(struct mem_cgroup *root, void *data,
+ int (*func)(struct mem_cgroup *, void *))
+{
+ int found, ret, nextid;
+ struct cgroup_subsys_state *css;
+ struct mem_cgroup *mem;
+
+ if (!root->use_hierarchy)
+ return (*func)(root, data);
+
+ nextid = 1;
+ do {
+ ret = 0;
+ mem = NULL;
+
+ rcu_read_lock();
+ css = css_get_next(&mem_cgroup_subsys, nextid, &root->css,
+ &found);
+ if (css && css_is_populated(css) && css_tryget(css))
+ mem = container_of(css, struct mem_cgroup, css);
+ rcu_read_unlock();
+
+ if (mem) {
+ ret = (*func)(mem, data);
+ css_put(&mem->css);
+ }
+ nextid = found + 1;
+ } while (!ret && css);
+
+ return ret;
+}
+
+/*
* Visit the first child (need not be the first child as per the ordering
* of the cgroup list, since we track last_scanned_child) of @mem and use
* that to reclaim free pages from.
@@ -1843,54 +1878,90 @@ static int mem_cgroup_reset(struct cgrou
return 0;
}
-static const struct mem_cgroup_stat_desc {
- const char *msg;
- u64 unit;
-} mem_cgroup_stat_desc[] = {
- [MEM_CGROUP_STAT_CACHE] = { "cache", PAGE_SIZE, },
- [MEM_CGROUP_STAT_RSS] = { "rss", PAGE_SIZE, },
- [MEM_CGROUP_STAT_PGPGIN_COUNT] = {"pgpgin", 1, },
- [MEM_CGROUP_STAT_PGPGOUT_COUNT] = {"pgpgout", 1, },
+
+/* For read statistics */
+enum {
+ MCS_CACHE,
+ MCS_RSS,
+ MCS_PGPGIN,
+ MCS_PGPGOUT,
+ MCS_INACTIVE_ANON,
+ MCS_ACTIVE_ANON,
+ MCS_INACTIVE_FILE,
+ MCS_ACTIVE_FILE,
+ MCS_UNEVICTABLE,
+ NR_MCS_STAT,
+};
+
+struct mcs_total_stat {
+ s64 stat[NR_MCS_STAT];
+};
+
+struct {
+ char *local_name;
+ char *total_name;
+} memcg_stat_strings[NR_MCS_STAT] = {
+ {"cache", "total_cache"},
+ {"rss", "total_rss"},
+ {"pgpgin", "total_pgpgin"},
+ {"pgpgout", "total_pgpgout"},
+ {"inactive_anon", "total_inactive_anon"},
+ {"active_anon", "total_active_anon"},
+ {"inactive_file", "total_inactive_file"},
+ {"active_file", "total_active_file"},
+ {"unevictable", "total_unevictable"}
};
+
+static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
+{
+ struct mcs_total_stat *s = data;
+ s64 val;
+
+ /* per cpu stat */
+ val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_CACHE);
+ s->stat[MCS_CACHE] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
+ s->stat[MCS_RSS] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGIN_COUNT);
+ s->stat[MCS_PGPGIN] += val;
+ val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
+ s->stat[MCS_PGPGOUT] += val;
+
+ /* per zone stat */
+ val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
+ s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
+ val = mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_ANON);
+ s->stat[MCS_ACTIVE_ANON] += val * PAGE_SIZE;
+ val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_FILE);
+ s->stat[MCS_INACTIVE_FILE] += val * PAGE_SIZE;
+ val = mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_FILE);
+ s->stat[MCS_ACTIVE_FILE] += val * PAGE_SIZE;
+ val = mem_cgroup_get_local_zonestat(mem, LRU_UNEVICTABLE);
+ s->stat[MCS_UNEVICTABLE] += val * PAGE_SIZE;
+ return 0;
+}
+
+static void
+mem_cgroup_get_total_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
+{
+ mem_cgroup_walk_tree(mem, s, mem_cgroup_get_local_stat);
+}
+
static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
struct cgroup_map_cb *cb)
{
struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cont);
- struct mem_cgroup_stat *stat = &mem_cont->stat;
+ struct mcs_total_stat mystat;
int i;
- for (i = 0; i < ARRAY_SIZE(stat->cpustat[0].count); i++) {
- s64 val;
+ memset(&mystat, 0, sizeof(mystat));
+ mem_cgroup_get_local_stat(mem_cont, &mystat);
- val = mem_cgroup_read_stat(stat, i);
- val *= mem_cgroup_stat_desc[i].unit;
- cb->fill(cb, mem_cgroup_stat_desc[i].msg, val);
- }
- /* showing # of active pages */
- {
- unsigned long active_anon, inactive_anon;
- unsigned long active_file, inactive_file;
- unsigned long unevictable;
-
- inactive_anon = mem_cgroup_get_all_zonestat(mem_cont,
- LRU_INACTIVE_ANON);
- active_anon = mem_cgroup_get_all_zonestat(mem_cont,
- LRU_ACTIVE_ANON);
- inactive_file = mem_cgroup_get_all_zonestat(mem_cont,
- LRU_INACTIVE_FILE);
- active_file = mem_cgroup_get_all_zonestat(mem_cont,
- LRU_ACTIVE_FILE);
- unevictable = mem_cgroup_get_all_zonestat(mem_cont,
- LRU_UNEVICTABLE);
-
- cb->fill(cb, "active_anon", (active_anon) * PAGE_SIZE);
- cb->fill(cb, "inactive_anon", (inactive_anon) * PAGE_SIZE);
- cb->fill(cb, "active_file", (active_file) * PAGE_SIZE);
- cb->fill(cb, "inactive_file", (inactive_file) * PAGE_SIZE);
- cb->fill(cb, "unevictable", unevictable * PAGE_SIZE);
+ for (i = 0; i < NR_MCS_STAT; i++)
+ cb->fill(cb, memcg_stat_strings[i].local_name, mystat.stat[i]);
- }
+ /* Hierarchical information */
{
unsigned long long limit, memsw_limit;
memcg_get_hierarchical_limit(mem_cont, &limit, &memsw_limit);
@@ -1899,6 +1970,12 @@ static int mem_control_stat_show(struct
cb->fill(cb, "hierarchical_memsw_limit", memsw_limit);
}
+ memset(&mystat, 0, sizeof(mystat));
+ mem_cgroup_get_total_stat(mem_cont, &mystat);
+ for (i = 0; i < NR_MCS_STAT; i++)
+ cb->fill(cb, memcg_stat_strings[i].total_name, mystat.stat[i]);
+
+
#ifdef CONFIG_DEBUG_VM
cb->fill(cb, "inactive_ratio", calc_inactive_ratio(mem_cont, NULL));
* KAMEZAWA Hiroyuki <[email protected]> [2009-01-15 19:21:20]:
>
> I'm sorry that I couldn't work so much, this week.
> No much updates but I think all comments I got are applied.
>
> About memcg part, I'll wait for that all Nishimura's fixes go ahead.
> If cgroup part looks good, please Ack. I added CC to Andrew Morton for that part.
>
> changes from previous series
> - dropeed a fix to OOM KILL (will reschedule)
> - dropped a fix to -EBUSY (will reschedule)
> - added css_is_populated()
> - added hierarchy_stat patch
>
> Known my homework is
> - resize_limit should return -EBUSY. (Li Zefan reported.)
>
> Andrew, I'll CC: you [1/4] and [2/4]. But no explicit Acked-by yet to any patches.
>
Kamezawa-San, like you've suggested earlier, I think it is important
to split up the fixes from the development patches. I wonder if we
should start marking all patches with BUGFIX for bug fixes, so that we
can prioritize those first.
--
Balbir
On Thu, 15 Jan 2009 16:21:16 +0530
Balbir Singh <[email protected]> wrote:
> * KAMEZAWA Hiroyuki <[email protected]> [2009-01-15 19:21:20]:
>
> >
> > I'm sorry that I couldn't work so much, this week.
> > No much updates but I think all comments I got are applied.
> >
> > About memcg part, I'll wait for that all Nishimura's fixes go ahead.
> > If cgroup part looks good, please Ack. I added CC to Andrew Morton for that part.
> >
> > changes from previous series
> > - dropeed a fix to OOM KILL (will reschedule)
> > - dropped a fix to -EBUSY (will reschedule)
> > - added css_is_populated()
> > - added hierarchy_stat patch
> >
> > Known my homework is
> > - resize_limit should return -EBUSY. (Li Zefan reported.)
> >
> > Andrew, I'll CC: you [1/4] and [2/4]. But no explicit Acked-by yet to any patches.
> >
>
> Kamezawa-San, like you've suggested earlier, I think it is important
> to split up the fixes from the development patches. I wonder if we
> should start marking all patches with BUGFIX for bug fixes, so that we
> can prioritize those first.
>
Ah yes. I'll do that from next post.
patch [1/4] and [2/4] doesnt modify memcg at all. So, I added Andrew to CC.
It's a new feature for cgroup.
for memcg part. of course, I'll wait for all Nishimura's fixes goes to -rc.
patch [3/4] will remove most of them, crazy maze of hierarchical reclaim.
patch [4/4] is written for demonstration. but the output seems atractive.
Thanks,
-Kame
KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
>
> This patch attaches unique ID to each css and provides following.
>
...
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Li Zefan <[email protected]>
> /*
> - * Dance down the hierarchy if needed to reclaim memory. We remember the
> - * last child we reclaimed from, so that we don't end up penalizing
> - * one child extensively based on its position in the children list.
> + * Visit the first child (need not be the first child as per the ordering
> + * of the cgroup list, since we track last_scanned_child) of @mem and use
> + * that to reclaim free pages from.
> + */
> +static struct mem_cgroup *
> +mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> +{
> + struct mem_cgroup *ret = NULL;
> + struct cgroup_subsys_state *css;
> + int nextid, found;
> +
> + if (!root_mem->use_hierarchy) {
> + spin_lock(&root_mem->reclaim_param_lock);
> + root_mem->scan_age++;
> + spin_unlock(&root_mem->reclaim_param_lock);
> + css_get(&root_mem->css);
> + ret = root_mem;
> + }
> +
> + while (!ret) {
> + rcu_read_lock();
> + nextid = root_mem->last_scanned_child + 1;
> + css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
> + &found);
> + if (css && css_is_populated(css) && css_tryget(css))
I don't see why you need to check css_is_populated(css) ?
> + ret = container_of(css, struct mem_cgroup, css);
> +
> + rcu_read_unlock();
> + /* Updates scanning parameter */
> + spin_lock(&root_mem->reclaim_param_lock);
> + if (!css) {
> + /* this means start scan from ID:1 */
> + root_mem->last_scanned_child = 0;
> + root_mem->scan_age++;
> + } else
> + root_mem->last_scanned_child = found;
> + spin_unlock(&root_mem->reclaim_param_lock);
> + }
> +
> + return ret;
> +}
> +
On Fri, 16 Jan 2009 09:29:48 +0800
Li Zefan <[email protected]> wrote:
> > /*
> > - * Dance down the hierarchy if needed to reclaim memory. We remember the
> > - * last child we reclaimed from, so that we don't end up penalizing
> > - * one child extensively based on its position in the children list.
> > + * Visit the first child (need not be the first child as per the ordering
> > + * of the cgroup list, since we track last_scanned_child) of @mem and use
> > + * that to reclaim free pages from.
> > + */
> > +static struct mem_cgroup *
> > +mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> > +{
> > + struct mem_cgroup *ret = NULL;
> > + struct cgroup_subsys_state *css;
> > + int nextid, found;
> > +
> > + if (!root_mem->use_hierarchy) {
> > + spin_lock(&root_mem->reclaim_param_lock);
> > + root_mem->scan_age++;
> > + spin_unlock(&root_mem->reclaim_param_lock);
> > + css_get(&root_mem->css);
> > + ret = root_mem;
> > + }
> > +
> > + while (!ret) {
> > + rcu_read_lock();
> > + nextid = root_mem->last_scanned_child + 1;
> > + css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
> > + &found);
> > + if (css && css_is_populated(css) && css_tryget(css))
>
> I don't see why you need to check css_is_populated(css) ?
>
Main reason is for sanity. I don't like to hold css->refcnt of not populated css.
Second reason is for avoinding unnecessary calls to try_to_free_pages(),
it's heavy. I should also add mem->res.usage == 0 case for skipping but not yet.
THanks,
-Kame
> > + ret = container_of(css, struct mem_cgroup, css);
> > +
> > + rcu_read_unlock();
> > + /* Updates scanning parameter */
> > + spin_lock(&root_mem->reclaim_param_lock);
> > + if (!css) {
> > + /* this means start scan from ID:1 */
> > + root_mem->last_scanned_child = 0;
> > + root_mem->scan_age++;
> > + } else
> > + root_mem->last_scanned_child = found;
> > + spin_unlock(&root_mem->reclaim_param_lock);
> > + }
> > +
> > + return ret;
> > +}
> > +
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
KAMEZAWA Hiroyuki wrote:
> On Fri, 16 Jan 2009 09:29:48 +0800
> Li Zefan <[email protected]> wrote:
>
>>> /*
>>> - * Dance down the hierarchy if needed to reclaim memory. We remember the
>>> - * last child we reclaimed from, so that we don't end up penalizing
>>> - * one child extensively based on its position in the children list.
>>> + * Visit the first child (need not be the first child as per the ordering
>>> + * of the cgroup list, since we track last_scanned_child) of @mem and use
>>> + * that to reclaim free pages from.
>>> + */
>>> +static struct mem_cgroup *
>>> +mem_cgroup_select_victim(struct mem_cgroup *root_mem)
>>> +{
>>> + struct mem_cgroup *ret = NULL;
>>> + struct cgroup_subsys_state *css;
>>> + int nextid, found;
>>> +
>>> + if (!root_mem->use_hierarchy) {
>>> + spin_lock(&root_mem->reclaim_param_lock);
>>> + root_mem->scan_age++;
>>> + spin_unlock(&root_mem->reclaim_param_lock);
>>> + css_get(&root_mem->css);
>>> + ret = root_mem;
>>> + }
>>> +
>>> + while (!ret) {
>>> + rcu_read_lock();
>>> + nextid = root_mem->last_scanned_child + 1;
>>> + css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
>>> + &found);
>>> + if (css && css_is_populated(css) && css_tryget(css))
>> I don't see why you need to check css_is_populated(css) ?
>>
>
> Main reason is for sanity. I don't like to hold css->refcnt of not populated css.
I think this is a rare case. It's just a very short period when a cgroup is
being created but not yet fully created.
> Second reason is for avoinding unnecessary calls to try_to_free_pages(),
> it's heavy. I should also add mem->res.usage == 0 case for skipping but not yet.
>
And if mem->res.usage == 0 is checked, css_is_popuated() is just redundant.
> THanks,
> -Kame
>
>>> + ret = container_of(css, struct mem_cgroup, css);
>>> +
>>> + rcu_read_unlock();
>>> + /* Updates scanning parameter */
>>> + spin_lock(&root_mem->reclaim_param_lock);
>>> + if (!css) {
>>> + /* this means start scan from ID:1 */
>>> + root_mem->last_scanned_child = 0;
>>> + root_mem->scan_age++;
>>> + } else
>>> + root_mem->last_scanned_child = found;
>>> + spin_unlock(&root_mem->reclaim_param_lock);
>>> + }
>>> +
>>> + return ret;
>>> +}
>>> +
On Fri, 16 Jan 2009 09:49:05 +0800
Li Zefan <[email protected]> wrote:
> KAMEZAWA Hiroyuki wrote:
> > On Fri, 16 Jan 2009 09:29:48 +0800
> > Li Zefan <[email protected]> wrote:
> >
> >>> /*
> >>> - * Dance down the hierarchy if needed to reclaim memory. We remember the
> >>> - * last child we reclaimed from, so that we don't end up penalizing
> >>> - * one child extensively based on its position in the children list.
> >>> + * Visit the first child (need not be the first child as per the ordering
> >>> + * of the cgroup list, since we track last_scanned_child) of @mem and use
> >>> + * that to reclaim free pages from.
> >>> + */
> >>> +static struct mem_cgroup *
> >>> +mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> >>> +{
> >>> + struct mem_cgroup *ret = NULL;
> >>> + struct cgroup_subsys_state *css;
> >>> + int nextid, found;
> >>> +
> >>> + if (!root_mem->use_hierarchy) {
> >>> + spin_lock(&root_mem->reclaim_param_lock);
> >>> + root_mem->scan_age++;
> >>> + spin_unlock(&root_mem->reclaim_param_lock);
> >>> + css_get(&root_mem->css);
> >>> + ret = root_mem;
> >>> + }
> >>> +
> >>> + while (!ret) {
> >>> + rcu_read_lock();
> >>> + nextid = root_mem->last_scanned_child + 1;
> >>> + css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
> >>> + &found);
> >>> + if (css && css_is_populated(css) && css_tryget(css))
> >> I don't see why you need to check css_is_populated(css) ?
> >>
> >
> > Main reason is for sanity. I don't like to hold css->refcnt of not populated css.
>
> I think this is a rare case. It's just a very short period when a cgroup is
> being created but not yet fully created.
>
I don't think so. When the cgroup is mounted with several subsystems, it can call
create() and populate() several times. So, memory allocation occurs between
create() and populate(), it can call try_to_free_page() (of global LRU). More than
that, if CONFIG_PREEMPT=y, any "short" race doesn't mean safe.
Thanks,
-Kame
On Fri, 16 Jan 2009 09:49:05 +0800
Li Zefan <[email protected]> wrote:
> KAMEZAWA Hiroyuki wrote:
> > On Fri, 16 Jan 2009 09:29:48 +0800
> > Li Zefan <[email protected]> wrote:
> >
> >>> /*
> >>> - * Dance down the hierarchy if needed to reclaim memory. We remember the
> >>> - * last child we reclaimed from, so that we don't end up penalizing
> >>> - * one child extensively based on its position in the children list.
> >>> + * Visit the first child (need not be the first child as per the ordering
> >>> + * of the cgroup list, since we track last_scanned_child) of @mem and use
> >>> + * that to reclaim free pages from.
> >>> + */
> >>> +static struct mem_cgroup *
> >>> +mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> >>> +{
> >>> + struct mem_cgroup *ret = NULL;
> >>> + struct cgroup_subsys_state *css;
> >>> + int nextid, found;
> >>> +
> >>> + if (!root_mem->use_hierarchy) {
> >>> + spin_lock(&root_mem->reclaim_param_lock);
> >>> + root_mem->scan_age++;
> >>> + spin_unlock(&root_mem->reclaim_param_lock);
> >>> + css_get(&root_mem->css);
> >>> + ret = root_mem;
> >>> + }
> >>> +
> >>> + while (!ret) {
> >>> + rcu_read_lock();
> >>> + nextid = root_mem->last_scanned_child + 1;
> >>> + css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
> >>> + &found);
> >>> + if (css && css_is_populated(css) && css_tryget(css))
> >> I don't see why you need to check css_is_populated(css) ?
> >>
> >
> > Main reason is for sanity. I don't like to hold css->refcnt of not populated css.
>
> I think this is a rare case. It's just a very short period when a cgroup is
> being created but not yet fully created.
>
> > Second reason is for avoinding unnecessary calls to try_to_free_pages(),
> > it's heavy. I should also add mem->res.usage == 0 case for skipping but not yet.
> >
>
> And if mem->res.usage == 0 is checked, css_is_popuated() is just redundant.
>
Hmm ? Can I check mem->res.usage before css_tryget() ?
-Kame
> > THanks,
> > -Kame
> >
> >>> + ret = container_of(css, struct mem_cgroup, css);
> >>> +
> >>> + rcu_read_unlock();
> >>> + /* Updates scanning parameter */
> >>> + spin_lock(&root_mem->reclaim_param_lock);
> >>> + if (!css) {
> >>> + /* this means start scan from ID:1 */
> >>> + root_mem->last_scanned_child = 0;
> >>> + root_mem->scan_age++;
> >>> + } else
> >>> + root_mem->last_scanned_child = found;
> >>> + spin_unlock(&root_mem->reclaim_param_lock);
> >>> + }
> >>> +
> >>> + return ret;
> >>> +}
> >>> +
>
Li-san, If you don't like this, could you give me an idea for
"How to check cgroup is fully ready or not" ?
BTW, why "we have a half filled direcotory - oh well" is allowed....
Thanks,
-Kame
On Thu, 15 Jan 2009 19:27:12 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> cgroup creation is done in several stages.
> After allocated and linked to cgroup's hierarchy tree, all necessary
> control files are created.
>
> When using CSS_ID, scanning cgroups without cgrouo_lock(), status
> of cgroup is important. At removal of cgroup/css, css_tryget() works fine
> and we can write a safe code. At creation, we need some flag to show
> "This cgroup is not ready yet"
>
> This patch adds CSS_POPULATED flag.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> Index: mmotm-2.6.29-Jan14/include/linux/cgroup.h
> ===================================================================
> --- mmotm-2.6.29-Jan14.orig/include/linux/cgroup.h
> +++ mmotm-2.6.29-Jan14/include/linux/cgroup.h
> @@ -69,6 +69,7 @@ struct cgroup_subsys_state {
> enum {
> CSS_ROOT, /* This CSS is the root of the subsystem */
> CSS_REMOVED, /* This CSS is dead */
> + CSS_POPULATED, /* This CSS finished all initialization */
> };
>
> /*
> @@ -90,6 +91,11 @@ static inline bool css_is_removed(struct
> return test_bit(CSS_REMOVED, &css->flags);
> }
>
> +static inline bool css_is_populated(struct cgroup_subsys_state *css)
> +{
> + return test_bit(CSS_POPULATED, &css->flags);
> +}
> +
> /*
> * Call css_tryget() to take a reference on a css if your existing
> * (known-valid) reference isn't already ref-counted. Returns false if
> Index: mmotm-2.6.29-Jan14/kernel/cgroup.c
> ===================================================================
> --- mmotm-2.6.29-Jan14.orig/kernel/cgroup.c
> +++ mmotm-2.6.29-Jan14/kernel/cgroup.c
> @@ -2326,8 +2326,10 @@ static int cgroup_populate_dir(struct cg
> }
>
> for_each_subsys(cgrp->root, ss) {
> + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> if (ss->populate && (err = ss->populate(ss, cgrp)) < 0)
> return err;
> + set_bit(CSS_POPULATED, &css->flags);
> }
>
> return 0;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>>>>> + while (!ret) {
>>>>> + rcu_read_lock();
>>>>> + nextid = root_mem->last_scanned_child + 1;
>>>>> + css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
>>>>> + &found);
>>>>> + if (css && css_is_populated(css) && css_tryget(css))
>>>> I don't see why you need to check css_is_populated(css) ?
>>>>
>>> Main reason is for sanity. I don't like to hold css->refcnt of not populated css.
>> I think this is a rare case. It's just a very short period when a cgroup is
>> being created but not yet fully created.
>>
>>> Second reason is for avoinding unnecessary calls to try_to_free_pages(),
>>> it's heavy. I should also add mem->res.usage == 0 case for skipping but not yet.
>>>
>> And if mem->res.usage == 0 is checked, css_is_popuated() is just redundant.
>>
> Hmm ? Can I check mem->res.usage before css_tryget() ?
>
I think you can. If css != NULL, css is valid (otherwise how can we access css->flags
in css_tryget), so mem is valid. Correct me if I'm wrong. :)
On Fri, 16 Jan 2009 15:35:41 +0800
Li Zefan <[email protected]> wrote:
> >>>>> + while (!ret) {
> >>>>> + rcu_read_lock();
> >>>>> + nextid = root_mem->last_scanned_child + 1;
> >>>>> + css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
> >>>>> + &found);
> >>>>> + if (css && css_is_populated(css) && css_tryget(css))
> >>>> I don't see why you need to check css_is_populated(css) ?
> >>>>
> >>> Main reason is for sanity. I don't like to hold css->refcnt of not populated css.
> >> I think this is a rare case. It's just a very short period when a cgroup is
> >> being created but not yet fully created.
> >>
> >>> Second reason is for avoinding unnecessary calls to try_to_free_pages(),
> >>> it's heavy. I should also add mem->res.usage == 0 case for skipping but not yet.
> >>>
> >> And if mem->res.usage == 0 is checked, css_is_popuated() is just redundant.
> >>
> > Hmm ? Can I check mem->res.usage before css_tryget() ?
> >
>
> I think you can. If css != NULL, css is valid (otherwise how can we access css->flags
> in css_tryget), so mem is valid. Correct me if I'm wrong. :)
>
Ok, I'll remove css_is_populated(). (I alread removed it in my local set.)
BTW, because we can access cgroup outside of cgroup_lock via CSS ID scanning,
I want some way to confirm this cgroup is worth to be looked into or not.
*And* I think it's better to mark cgroup as NOT VALID in which initialization is
not complete.
And it's better to notify user that "you should rmdir this incomplete cgroup"
when populate() finally fails. Do you have idea ?
-Kame
On Thu, Jan 15, 2009 at 2:27 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> cgroup creation is done in several stages.
> After allocated and linked to cgroup's hierarchy tree, all necessary
> control files are created.
>
> When using CSS_ID, scanning cgroups without cgrouo_lock(), status
> of cgroup is important. At removal of cgroup/css, css_tryget() works fine
> and we can write a safe code.
What problems are you currently running into during creation? Won't
the fact that the css for the cgroup has been created, and its pointer
been stored in the cgroup, be sufficient?
Or is the problem that a cgroup that fails creation half-way could
result in the memory code alreadying having taken a reference on the
memcg, which can't then be cleanly destroyed?
Paul
> "This cgroup is not ready yet"
>
> This patch adds CSS_POPULATED flag.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> Index: mmotm-2.6.29-Jan14/include/linux/cgroup.h
> ===================================================================
> --- mmotm-2.6.29-Jan14.orig/include/linux/cgroup.h
> +++ mmotm-2.6.29-Jan14/include/linux/cgroup.h
> @@ -69,6 +69,7 @@ struct cgroup_subsys_state {
> enum {
> CSS_ROOT, /* This CSS is the root of the subsystem */
> CSS_REMOVED, /* This CSS is dead */
> + CSS_POPULATED, /* This CSS finished all initialization */
> };
>
> /*
> @@ -90,6 +91,11 @@ static inline bool css_is_removed(struct
> return test_bit(CSS_REMOVED, &css->flags);
> }
>
> +static inline bool css_is_populated(struct cgroup_subsys_state *css)
> +{
> + return test_bit(CSS_POPULATED, &css->flags);
> +}
> +
> /*
> * Call css_tryget() to take a reference on a css if your existing
> * (known-valid) reference isn't already ref-counted. Returns false if
> Index: mmotm-2.6.29-Jan14/kernel/cgroup.c
> ===================================================================
> --- mmotm-2.6.29-Jan14.orig/kernel/cgroup.c
> +++ mmotm-2.6.29-Jan14/kernel/cgroup.c
> @@ -2326,8 +2326,10 @@ static int cgroup_populate_dir(struct cg
> }
>
> for_each_subsys(cgrp->root, ss) {
> + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> if (ss->populate && (err = ss->populate(ss, cgrp)) < 0)
> return err;
> + set_bit(CSS_POPULATED, &css->flags);
> }
>
> return 0;
>
>
On Thu, Jan 15, 2009 at 7:00 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> Li-san, If you don't like this, could you give me an idea for
> "How to check cgroup is fully ready or not" ?
>
> BTW, why "we have a half filled direcotory - oh well" is allowed....
That's pretty much inherited from the original cpusets code. It
probably should be cleaned up, but unless the system is totally hosed
it seems pretty unlikely for creation of a few dentries to fail.
Paul
Paul Menage wrote:
> On Thu, Jan 15, 2009 at 7:00 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>> Li-san, If you don't like this, could you give me an idea for
>> "How to check cgroup is fully ready or not" ?
>>
>> BTW, why "we have a half filled direcotory - oh well" is allowed....
>
> That's pretty much inherited from the original cpusets code. It
> probably should be cleaned up, but unless the system is totally hosed
> it seems pretty unlikely for creation of a few dentries to fail.
>
Yes, but I can see a potential problem. If we have subsystem foo and bar,
and both of them have a control file with exactly the same name, like
foo.stat & bar.stat. Now we mount them with -o noprefix, and then one
of the stat file will fail to be created, without any warnning or notification
to the user.
On Mon, Jan 19, 2009 at 5:51 PM, Li Zefan <[email protected]> wrote:
>
> Yes, but I can see a potential problem. If we have subsystem foo and bar,
> and both of them have a control file with exactly the same name, like
> foo.stat & bar.stat. Now we mount them with -o noprefix, and then one
> of the stat file will fail to be created, without any warnning or notification
> to the user.
That's a fair point. The "noprefix" option was really only added for
backwards-compatibility when mounting as the "cpuset" filesystem type,
so it shouldn't end up getting used directly. But you're right that
this does mean that populate can fail and we should handle that, or
else make the "noprefix" option only usable from within the kernel
when mounting cpusets.
Paul
On Thu, Jan 15, 2009 at 2:25 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
>
> This patch attaches unique ID to each css and provides following.
>
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Paul Menage <[email protected]>
> ---
> Index: mmotm-2.6.29-Jan14/include/linux/cgroup.h
> ===================================================================
> --- mmotm-2.6.29-Jan14.orig/include/linux/cgroup.h
> +++ mmotm-2.6.29-Jan14/include/linux/cgroup.h
> @@ -15,6 +15,7 @@
> #include <linux/cgroupstats.h>
> #include <linux/prio_heap.h>
> #include <linux/rwsem.h>
> +#include <linux/idr.h>
>
> #ifdef CONFIG_CGROUPS
>
> @@ -22,6 +23,7 @@ struct cgroupfs_root;
> struct cgroup_subsys;
> struct inode;
> struct cgroup;
> +struct css_id;
>
> extern int cgroup_init_early(void);
> extern int cgroup_init(void);
> @@ -59,6 +61,8 @@ struct cgroup_subsys_state {
> atomic_t refcnt;
>
> unsigned long flags;
> + /* ID for this css, if possible */
> + struct css_id *id;
> };
>
> /* bits in struct cgroup_subsys_state flags field */
> @@ -363,6 +367,11 @@ struct cgroup_subsys {
> int active;
> int disabled;
> int early_init;
> + /*
> + * True if this subsys uses ID. ID is not available before cgroup_init()
> + * (not available in early_init time.)
> + */
> + bool use_id;
> #define MAX_CGROUP_TYPE_NAMELEN 32
> const char *name;
>
> @@ -384,6 +393,9 @@ struct cgroup_subsys {
> */
> struct cgroupfs_root *root;
> struct list_head sibling;
> + /* used when use_id == true */
> + struct idr idr;
> + spinlock_t id_lock;
> };
>
> #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
> @@ -437,6 +449,44 @@ void cgroup_iter_end(struct cgroup *cgrp
> int cgroup_scan_tasks(struct cgroup_scanner *scan);
> int cgroup_attach_task(struct cgroup *, struct task_struct *);
>
> +/*
> + * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
> + * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
> + * CSS ID is assigned at cgroup allocation (create) automatically
> + * and removed when subsys calls free_css_id() function. This is because
> + * the lifetime of cgroup_subsys_state is subsys's matter.
> + *
> + * Looking up and scanning function should be called under rcu_read_lock().
> + * Taking cgroup_mutex()/hierarchy_mutex() is not necessary for following calls.
> + * But the css returned by this routine can be "not populated yet" or "being
> + * destroyed". The caller should check css and cgroup's status.
> + */
> +
> +/*
> + * Typically Called at ->destroy(), or somewhere the subsys frees
> + * cgroup_subsys_state.
> + */
> +void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
> +
> +/* Find a cgroup_subsys_state which has given ID */
> +
> +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
> +
> +/*
> + * Get a cgroup whose id is greater than or equal to id under tree of root.
> + * Returning a cgroup_subsys_state or NULL.
> + */
> +struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
> + struct cgroup_subsys_state *root, int *foundid);
> +
> +/* Returns true if root is ancestor of cg */
> +bool css_is_ancestor(struct cgroup_subsys_state *cg,
> + struct cgroup_subsys_state *root);
> +
> +/* Get id and depth of css */
> +unsigned short css_id(struct cgroup_subsys_state *css);
> +unsigned short css_depth(struct cgroup_subsys_state *css);
> +
> #else /* !CONFIG_CGROUPS */
>
> static inline int cgroup_init_early(void) { return 0; }
> Index: mmotm-2.6.29-Jan14/kernel/cgroup.c
> ===================================================================
> --- mmotm-2.6.29-Jan14.orig/kernel/cgroup.c
> +++ mmotm-2.6.29-Jan14/kernel/cgroup.c
> @@ -185,6 +185,8 @@ struct cg_cgroup_link {
> static struct css_set init_css_set;
> static struct cg_cgroup_link init_css_set_link;
>
> +static int cgroup_subsys_init_idr(struct cgroup_subsys *ss);
> +
> /* css_set_lock protects the list of css_set objects, and the
> * chain of tasks off each css_set. Nests outside task->alloc_lock
> * due to cgroup_iter_start() */
> @@ -567,6 +569,9 @@ static struct backing_dev_info cgroup_ba
> .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK,
> };
>
> +static int alloc_css_id(struct cgroup_subsys *ss,
> + struct cgroup *parent, struct cgroup *child);
> +
> static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
> {
> struct inode *inode = new_inode(sb);
> @@ -2335,6 +2340,7 @@ static void init_cgroup_css(struct cgrou
> css->cgroup = cgrp;
> atomic_set(&css->refcnt, 1);
> css->flags = 0;
> + css->id = NULL;
> if (cgrp == dummytop)
> set_bit(CSS_ROOT, &css->flags);
> BUG_ON(cgrp->subsys[ss->subsys_id]);
> @@ -2410,6 +2416,10 @@ static long cgroup_create(struct cgroup
> goto err_destroy;
> }
> init_cgroup_css(css, ss, cgrp);
> + if (ss->use_id)
> + if (alloc_css_id(ss, parent, cgrp))
> + goto err_destroy;
> + /* At error, ->destroy() callback has to free assigned ID. */
> }
>
> cgroup_lock_hierarchy(root);
> @@ -2699,6 +2709,8 @@ int __init cgroup_init(void)
> struct cgroup_subsys *ss = subsys[i];
> if (!ss->early_init)
> cgroup_init_subsys(ss);
> + if (ss->use_id)
> + cgroup_subsys_init_idr(ss);
> }
>
> /* Add init_css_set to the hash table */
> @@ -3232,3 +3244,259 @@ static int __init cgroup_disable(char *s
> return 1;
> }
> __setup("cgroup_disable=", cgroup_disable);
> +
> +/*
> + * CSS ID -- ID per subsys's Cgroup Subsys State(CSS).
> + */
> +struct css_id {
> + /*
> + * The css to which this ID points. If cgroup is removed, this will
> + * be NULL. This pointer is expected to be RCU-safe because destroy()
> + * is called after synchronize_rcu(). But for safe use, css_is_removed()
> + * css_tryget() should be used for avoiding race.
> + */
> + struct cgroup_subsys_state *css;
> + /*
> + * ID of this css.
> + */
> + unsigned short id;
> + /*
> + * Depth in hierarchy which this ID belongs to.
> + */
> + unsigned short depth;
> + /*
> + * ID is freed by RCU. (and lookup routine is RCU safe.)
> + */
> + struct rcu_head rcu_head;
> + /*
> + * Hierarchy of CSS ID belongs to.
> + */
> + unsigned short stack[0]; /* Array of Length (depth+1) */
> +};
> +#define CSS_ID_MAX (65535)
> +
> +/*
> + * To get ID other than 0, this should be called when !cgroup_is_removed().
> + */
> +unsigned short css_id(struct cgroup_subsys_state *css)
> +{
> + struct css_id *cssid = rcu_dereference(css->id);
> +
> + if (cssid)
> + return cssid->id;
> + return 0;
> +}
> +
> +unsigned short css_depth(struct cgroup_subsys_state *css)
> +{
> + struct css_id *cssid = rcu_dereference(css->id);
> +
> + if (cssid)
> + return cssid->depth;
> + return 0;
> +}
> +
> +bool css_is_ancestor(struct cgroup_subsys_state *child,
> + struct cgroup_subsys_state *root)
> +{
> + struct css_id *child_id = rcu_dereference(child->id);
> + struct css_id *root_id = rcu_dereference(root->id);
> +
> + if (!child_id || !root_id || (child_id->depth < root_id->depth))
> + return false;
> + return child_id->stack[root_id->depth] == root_id->id;
> +}
> +
> +static void __free_css_id_cb(struct rcu_head *head)
> +{
> + struct css_id *id;
> +
> + id = container_of(head, struct css_id, rcu_head);
> + kfree(id);
> +}
> +
> +void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css)
> +{
> + struct css_id *id = css->id;
> + /* When this is called before css_id initialization, id can be NULL */
> + if (!id)
> + return;
> +
> + BUG_ON(!ss->use_id);
> +
> + rcu_assign_pointer(id->css, NULL);
> + rcu_assign_pointer(css->id, NULL);
> + spin_lock(&ss->id_lock);
> + idr_remove(&ss->idr, id->id);
> + spin_unlock(&ss->id_lock);
> + call_rcu(&id->rcu_head, __free_css_id_cb);
> +}
> +
> +/*
> + * This is called by init or create(). Then, calls to this function are
> + * always serialized (By cgroup_mutex() at create()).
> + */
> +
> +static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
> +{
> + struct css_id *newid;
> + int myid, error, size;
> +
> + BUG_ON(!ss->use_id);
> +
> + size = sizeof(*newid) + sizeof(unsigned short) * (depth + 1);
> + newid = kzalloc(size, GFP_KERNEL);
> + if (!newid)
> + return ERR_PTR(-ENOMEM);
> + /* get id */
> + if (unlikely(!idr_pre_get(&ss->idr, GFP_KERNEL))) {
> + error = -ENOMEM;
> + goto err_out;
> + }
> + spin_lock(&ss->id_lock);
> + /* Don't use 0. allocates an ID of 1-65535 */
> + error = idr_get_new_above(&ss->idr, newid, 1, &myid);
> + spin_unlock(&ss->id_lock);
> +
> + /* Returns error when there are no free spaces for new ID.*/
> + if (error) {
> + error = -ENOSPC;
> + goto err_out;
> + }
> + if (myid > CSS_ID_MAX)
> + goto remove_idr;
> +
> + newid->id = myid;
> + newid->depth = depth;
> + return newid;
> +remove_idr:
> + error = -ENOSPC;
> + spin_lock(&ss->id_lock);
> + idr_remove(&ss->idr, myid);
> + spin_unlock(&ss->id_lock);
> +err_out:
> + kfree(newid);
> + return ERR_PTR(error);
> +
> +}
> +
> +static int __init cgroup_subsys_init_idr(struct cgroup_subsys *ss)
> +{
> + struct css_id *newid;
> + struct cgroup_subsys_state *rootcss;
> +
> + spin_lock_init(&ss->id_lock);
> + idr_init(&ss->idr);
> +
> + rootcss = init_css_set.subsys[ss->subsys_id];
> + newid = get_new_cssid(ss, 0);
> + if (IS_ERR(newid))
> + return PTR_ERR(newid);
> +
> + newid->stack[0] = newid->id;
> + newid->css = rootcss;
> + rootcss->id = newid;
> + return 0;
> +}
> +
> +static int alloc_css_id(struct cgroup_subsys *ss, struct cgroup *parent,
> + struct cgroup *child)
> +{
> + int subsys_id, i, depth = 0;
> + struct cgroup_subsys_state *parent_css, *child_css;
> + struct css_id *child_id, *parent_id = NULL;
> +
> + subsys_id = ss->subsys_id;
> + parent_css = parent->subsys[subsys_id];
> + child_css = child->subsys[subsys_id];
> + depth = css_depth(parent_css) + 1;
> + parent_id = parent_css->id;
> +
> + child_id = get_new_cssid(ss, depth);
> + if (IS_ERR(child_id))
> + return PTR_ERR(child_id);
> +
> + for (i = 0; i < depth; i++)
> + child_id->stack[i] = parent_id->stack[i];
> + child_id->stack[depth] = child_id->id;
> +
> + rcu_assign_pointer(child_id->css, child_css);
> + rcu_assign_pointer(child_css->id, child_id);
> +
> + return 0;
> +}
> +
> +/**
> + * css_lookup - lookup css by id
> + * @ss: cgroup subsys to be looked into.
> + * @id: the id
> + *
> + * Returns pointer to cgroup_subsys_state if there is valid one with id.
> + * NULL if not. Should be called under rcu_read_lock()
> + */
> +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
> +{
> + struct css_id *cssid = NULL;
> +
> + BUG_ON(!ss->use_id);
> + cssid = idr_find(&ss->idr, id);
> +
> + if (unlikely(!cssid))
> + return NULL;
> +
> + return rcu_dereference(cssid->css);
> +}
> +
> +/**
> + * css_get_next - lookup next cgroup under specified hierarchy.
> + * @ss: pointer to subsystem
> + * @id: current position of iteration.
> + * @root: pointer to css. search tree under this.
> + * @foundid: position of found object.
> + *
> + * Search next css under the specified hierarchy of rootid. Calling under
> + * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
> + */
> +struct cgroup_subsys_state *
> +css_get_next(struct cgroup_subsys *ss, int id,
> + struct cgroup_subsys_state *root, int *foundid)
> +{
> + struct cgroup_subsys_state *ret = NULL;
> + struct css_id *tmp;
> + int tmpid;
> + int rootid = css_id(root);
> + int depth = css_depth(root);
> +
> + if (!rootid)
> + return NULL;
> +
> + BUG_ON(!ss->use_id);
> + rcu_read_lock();
> + /* fill start point for scan */
> + tmpid = id;
> + while (1) {
> + /*
> + * scan next entry from bitmap(tree), tmpid is updated after
> + * idr_get_next().
> + */
> + spin_lock(&ss->id_lock);
> + tmp = idr_get_next(&ss->idr, &tmpid);
> + spin_unlock(&ss->id_lock);
> +
> + if (!tmp)
> + break;
> + if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
> + ret = rcu_dereference(tmp->css);
> + if (ret) {
> + *foundid = tmpid;
> + break;
> + }
> + }
> + /* continue to scan from next id */
> + tmpid = tmpid + 1;
> + }
> +
> + rcu_read_unlock();
> + return ret;
> +}
> +
> Index: mmotm-2.6.29-Jan14/include/linux/idr.h
> ===================================================================
> --- mmotm-2.6.29-Jan14.orig/include/linux/idr.h
> +++ mmotm-2.6.29-Jan14/include/linux/idr.h
> @@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
> int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
> int idr_for_each(struct idr *idp,
> int (*fn)(int id, void *p, void *data), void *data);
> +void *idr_get_next(struct idr *idp, int *nextid);
> void *idr_replace(struct idr *idp, void *ptr, int id);
> void idr_remove(struct idr *idp, int id);
> void idr_remove_all(struct idr *idp);
> Index: mmotm-2.6.29-Jan14/lib/idr.c
> ===================================================================
> --- mmotm-2.6.29-Jan14.orig/lib/idr.c
> +++ mmotm-2.6.29-Jan14/lib/idr.c
> @@ -579,6 +579,52 @@ int idr_for_each(struct idr *idp,
> EXPORT_SYMBOL(idr_for_each);
>
> /**
> + * idr_get_next - lookup next object of id to given id.
> + * @idp: idr handle
> + * @id: pointer to lookup key
> + *
> + * Returns pointer to registered object with id, which is next number to
> + * given id.
> + */
> +
> +void *idr_get_next(struct idr *idp, int *nextidp)
> +{
> + struct idr_layer *p, *pa[MAX_LEVEL];
> + struct idr_layer **paa = &pa[0];
> + int id = *nextidp;
> + int n, max;
> +
> + /* find first ent */
> + n = idp->layers * IDR_BITS;
> + max = 1 << n;
> + p = rcu_dereference(idp->top);
> + if (!p)
> + return NULL;
> +
> + while (id < max) {
> + while (n > 0 && p) {
> + n -= IDR_BITS;
> + *paa++ = p;
> + p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
> + }
> +
> + if (p) {
> + *nextidp = id;
> + return p;
> + }
> +
> + id += 1 << n;
> + while (n < fls(id)) {
> + n += IDR_BITS;
> + p = *--paa;
> + }
> + }
> + return NULL;
> +}
> +
> +
> +
> +/**
> * idr_replace - replace pointer for given id
> * @idp: idr handle
> * @ptr: pointer you want associated with the id
>
>
On Mon, 19 Jan 2009 17:39:40 -0800
Paul Menage <[email protected]> wrote:
> On Thu, Jan 15, 2009 at 2:27 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > cgroup creation is done in several stages.
> > After allocated and linked to cgroup's hierarchy tree, all necessary
> > control files are created.
> >
> > When using CSS_ID, scanning cgroups without cgrouo_lock(), status
> > of cgroup is important. At removal of cgroup/css, css_tryget() works fine
> > and we can write a safe code.
>
> What problems are you currently running into during creation? Won't
> the fact that the css for the cgroup has been created, and its pointer
> been stored in the cgroup, be sufficient?
>
> Or is the problem that a cgroup that fails creation half-way could
> result in the memory code alreadying having taken a reference on the
> memcg, which can't then be cleanly destroyed?
>
Ah, this is related to CSS ID scanning. No real problem to current codes.
Now, in my patch, CSS ID is attached just after create().
Then, "scan by CSS ID" can find a cgroup which is not populated yet.
I just wanted to skip them for avoiding mess.
For example, css_tryget() can succeed against css which belongs to not-populated
cgroup. If creation of cgroup fails, it's destroyed and freed without RCU synchronize.
This may breaks CSS ID scanning's assumption that "we're safe under rcu_read_lock".
And allows destroy css while css->refcnt > 1.
I'm now rewriting codes a bit, what I want is something like this.
==
for_each_subsys(root, ss) {
/* if error , goto destroy "/
create();
if (ss->use_id)
attach_id();
}
create dir and populate files.
.....
destroy:
synchronize_rcu();
/* start destroy here */
==
css_is_populated() is to show "ok, you can call css_tryget()".
If css_tryget() itself checks CSS_POPULATED bit, it's maybe clearer.
-Kame
> Paul
>
> > "This cgroup is not ready yet"
> >
> > This patch adds CSS_POPULATED flag.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> >
> > ---
> > Index: mmotm-2.6.29-Jan14/include/linux/cgroup.h
> > ===================================================================
> > --- mmotm-2.6.29-Jan14.orig/include/linux/cgroup.h
> > +++ mmotm-2.6.29-Jan14/include/linux/cgroup.h
> > @@ -69,6 +69,7 @@ struct cgroup_subsys_state {
> > enum {
> > CSS_ROOT, /* This CSS is the root of the subsystem */
> > CSS_REMOVED, /* This CSS is dead */
> > + CSS_POPULATED, /* This CSS finished all initialization */
> > };
> >
> > /*
> > @@ -90,6 +91,11 @@ static inline bool css_is_removed(struct
> > return test_bit(CSS_REMOVED, &css->flags);
> > }
> >
> > +static inline bool css_is_populated(struct cgroup_subsys_state *css)
> > +{
> > + return test_bit(CSS_POPULATED, &css->flags);
> > +}
> > +
> > /*
> > * Call css_tryget() to take a reference on a css if your existing
> > * (known-valid) reference isn't already ref-counted. Returns false if
> > Index: mmotm-2.6.29-Jan14/kernel/cgroup.c
> > ===================================================================
> > --- mmotm-2.6.29-Jan14.orig/kernel/cgroup.c
> > +++ mmotm-2.6.29-Jan14/kernel/cgroup.c
> > @@ -2326,8 +2326,10 @@ static int cgroup_populate_dir(struct cg
> > }
> >
> > for_each_subsys(cgrp->root, ss) {
> > + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> > if (ss->populate && (err = ss->populate(ss, cgrp)) < 0)
> > return err;
> > + set_bit(CSS_POPULATED, &css->flags);
> > }
> >
> > return 0;
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Mon, Jan 19, 2009 at 6:02 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> Ah, this is related to CSS ID scanning. No real problem to current codes.
>
> Now, in my patch, CSS ID is attached just after create().
>
> Then, "scan by CSS ID" can find a cgroup which is not populated yet.
> I just wanted to skip them for avoiding mess.
>
> For example, css_tryget() can succeed against css which belongs to not-populated
> cgroup. If creation of cgroup fails, it's destroyed and freed without RCU synchronize.
> This may breaks CSS ID scanning's assumption that "we're safe under rcu_read_lock".
> And allows destroy css while css->refcnt > 1.
So for the CSS ID case, we could solve this by not populating
css_id->css until creation is guaranteed to have succeeded? (We'd
still allocated the css_id along with the subsystem, just not complete
its initialization). cgroup.c already knows about and hides the
details of css_id, so this wouldn't be hard.
The question is whether other users of css_tryget() might run into
this problem, without using CSS IDs. But currently no-one's using
css_tryget() apart from you, so that's a problem we can solve as it
arises.
I think we're safe from css_get() being used in this case, since
css_get() can only be used on css references obtained from a locked
task, or from other pointers that are known to be ref-counted, which
would be impossible if css_tryget() can't succeed on a css in the
partially-completed state.
Paul
On Mon, 19 Jan 2009 18:23:03 -0800
Paul Menage <[email protected]> wrote:
> On Mon, Jan 19, 2009 at 6:02 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > Ah, this is related to CSS ID scanning. No real problem to current codes.
> >
> > Now, in my patch, CSS ID is attached just after create().
> >
> > Then, "scan by CSS ID" can find a cgroup which is not populated yet.
> > I just wanted to skip them for avoiding mess.
> >
> > For example, css_tryget() can succeed against css which belongs to not-populated
> > cgroup. If creation of cgroup fails, it's destroyed and freed without RCU synchronize.
> > This may breaks CSS ID scanning's assumption that "we're safe under rcu_read_lock".
> > And allows destroy css while css->refcnt > 1.
>
> So for the CSS ID case, we could solve this by not populating
> css_id->css until creation is guaranteed to have succeeded? (We'd
> still allocated the css_id along with the subsystem, just not complete
> its initialization). cgroup.c already knows about and hides the
> details of css_id, so this wouldn't be hard.
>
Hmm, moving this call to after populate is not good because
id > max_id cannot be handled ;)
> + if (ss->use_id)
> + if (alloc_css_id(ss, parent, cgrp))
> + goto err_destroy;
> + /* At error, ->destroy() callback has to free assigned ID. */
> }
Should I delay to set css_id->css pointer to valid value until the end of
populate() ? (add populage_css_id() call after cgroup_populate_dir()).
I'd like to write add-on patch to the patch [1/4]. (or update it.)
css_id->css == NULL case is handled now, anyway.
> The question is whether other users of css_tryget() might run into
> this problem, without using CSS IDs. But currently no-one's using
> css_tryget() apart from you, so that's a problem we can solve as it
> arises.
>
yes ;)
> I think we're safe from css_get() being used in this case, since
> css_get() can only be used on css references obtained from a locked
> task, or from other pointers that are known to be ref-counted, which
> would be impossible if css_tryget() can't succeed on a css in the
> partially-completed state.
>
>
Thanks, I'll try some other patch. Maybe it's ok to delay updating css_id->css
pointer.
-Kame
On Tue, 20 Jan 2009 11:58:32 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:
> > + if (ss->use_id)
> > + if (alloc_css_id(ss, parent, cgrp))
> > + goto err_destroy;
> > + /* At error, ->destroy() callback has to free assigned ID. */
> > }
>
> Should I delay to set css_id->css pointer to valid value until the end of
> populate() ? (add populage_css_id() call after cgroup_populate_dir()).
>
> I'd like to write add-on patch to the patch [1/4]. (or update it.)
> css_id->css == NULL case is handled now, anyway.
>
How about this ?
==
From: KAMEZAWA Hiroyuki <[email protected]>
When CSS ID is attached, it's not guaranteed that the cgroup will
be finally populated out. (some failure in create())
But, scan by CSS ID can find CSS which is not fully initialized.
This patch tries to prevent that by delaying to fill id->css pointer.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
kernel/cgroup.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
Index: mmotm-2.6.29-Jan16/kernel/cgroup.c
===================================================================
--- mmotm-2.6.29-Jan16.orig/kernel/cgroup.c
+++ mmotm-2.6.29-Jan16/kernel/cgroup.c
@@ -569,6 +569,7 @@ static struct backing_dev_info cgroup_ba
.capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK,
};
+static void populate_css_id(struct cgroup_subsys_state *id);
static int alloc_css_id(struct cgroup_subsys *ss,
struct cgroup *parent, struct cgroup *child);
@@ -2329,6 +2330,12 @@ static int cgroup_populate_dir(struct cg
if (ss->populate && (err = ss->populate(ss, cgrp)) < 0)
return err;
}
+ /* This cgroup is ready now */
+ for_each_subsys(cgrp->root, ss) {
+ struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+ if (ss->use_id)
+ populate_css_id(css);
+ }
return 0;
}
@@ -3252,8 +3259,9 @@ __setup("cgroup_disable=", cgroup_disabl
*/
struct css_id {
/*
- * The css to which this ID points. If cgroup is removed, this will
- * be NULL. This pointer is expected to be RCU-safe because destroy()
+ * The css to which this ID points. This pointer is set to valid value
+ * after cgroup is populated. If cgroup is removed, this will be NULL.
+ * This pointer is expected to be RCU-safe because destroy()
* is called after synchronize_rcu(). But for safe use, css_is_removed()
* css_tryget() should be used for avoiding race.
*/
@@ -3401,6 +3409,13 @@ static int __init cgroup_subsys_init_idr
return 0;
}
+static void populate_css_id(struct cgroup_subsys_state *css)
+{
+ struct css_id *id = rcu_dereference(css->id);
+ if (id)
+ rcu_assign_pointer(id->css, css);
+}
+
static int alloc_css_id(struct cgroup_subsys *ss, struct cgroup *parent,
struct cgroup *child)
{
@@ -3421,8 +3436,7 @@ static int alloc_css_id(struct cgroup_su
for (i = 0; i < depth; i++)
child_id->stack[i] = parent_id->stack[i];
child_id->stack[depth] = child_id->id;
-
- rcu_assign_pointer(child_id->css, child_css);
+ /* child_id->css pointer will be set after this cgroup is available */
rcu_assign_pointer(child_css->id, child_id);
return 0;
On Mon, Jan 19, 2009 at 9:43 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> +static void populate_css_id(struct cgroup_subsys_state *css)
> +{
> + struct css_id *id = rcu_dereference(css->id);
> + if (id)
> + rcu_assign_pointer(id->css, css);
> +}
I don't think this needs to be split out into a separate function.
Also, there's no need for an rcu_dereference(), since we're holding
cgroup_mutex.
Paul
On Wed, 21 Jan 2009 01:36:32 -0800
Paul Menage <[email protected]> wrote:
> On Mon, Jan 19, 2009 at 9:43 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > +static void populate_css_id(struct cgroup_subsys_state *css)
> > +{
> > + struct css_id *id = rcu_dereference(css->id);
> > + if (id)
> > + rcu_assign_pointer(id->css, css);
> > +}
>
> I don't think this needs to be split out into a separate function.
ok.
> Also, there's no need for an rcu_dereference(), since we're holding
> cgroup_mutex.
>
Sure. I'll fix.
I'll merge this behavior to CSS ID patch and post it again.
Thanks,
-Kame
On Wed, 21 Jan 2009 19:34:36 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Wed, 21 Jan 2009 01:36:32 -0800
> Paul Menage <[email protected]> wrote:
>
> > On Mon, Jan 19, 2009 at 9:43 PM, KAMEZAWA Hiroyuki
> > <[email protected]> wrote:
> > > +static void populate_css_id(struct cgroup_subsys_state *css)
> > > +{
> > > + struct css_id *id = rcu_dereference(css->id);
> > > + if (id)
> > > + rcu_assign_pointer(id->css, css);
> > > +}
> >
> > I don't think this needs to be split out into a separate function.
> ok.
>
> > Also, there's no need for an rcu_dereference(), since we're holding
> > cgroup_mutex.
> >
> Sure. I'll fix.
>
BTW, isn't it better to use rcu_assign_pointer to show "this pointer will be
dereferenced from RCU-read-side" ?
-Kame
On Wed, Jan 21, 2009 at 6:06 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> BTW, isn't it better to use rcu_assign_pointer to show "this pointer will be
> dereferenced from RCU-read-side" ?
>
Yes, I think using rcu_assign_pointer() is fine here.
Paul
Hi.
Sorry for very late reply.
It looks good in general.
Just a few comments.
> +/**
> + * css_lookup - lookup css by id
> + * @ss: cgroup subsys to be looked into.
> + * @id: the id
> + *
> + * Returns pointer to cgroup_subsys_state if there is valid one with id.
> + * NULL if not. Should be called under rcu_read_lock()
> + */
> +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
> +{
> + struct css_id *cssid = NULL;
> +
> + BUG_ON(!ss->use_id);
> + cssid = idr_find(&ss->idr, id);
> +
> + if (unlikely(!cssid))
> + return NULL;
> +
> + return rcu_dereference(cssid->css);
> +}
> +
Just for clarification, is there any user of this function ?
(I agree it's natulal to define 'lookup' function, though.)
> +/**
> + * css_get_next - lookup next cgroup under specified hierarchy.
> + * @ss: pointer to subsystem
> + * @id: current position of iteration.
> + * @root: pointer to css. search tree under this.
> + * @foundid: position of found object.
> + *
> + * Search next css under the specified hierarchy of rootid. Calling under
> + * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
> + */
> +struct cgroup_subsys_state *
> +css_get_next(struct cgroup_subsys *ss, int id,
> + struct cgroup_subsys_state *root, int *foundid)
> +{
> + struct cgroup_subsys_state *ret = NULL;
> + struct css_id *tmp;
> + int tmpid;
> + int rootid = css_id(root);
> + int depth = css_depth(root);
> +
I think it's safe here, but isn't it better to call css_id/css_depth
under rcu_read_lock(they call rcu_dereference) ?
> + if (!rootid)
> + return NULL;
> +
> + BUG_ON(!ss->use_id);
> + rcu_read_lock();
> + /* fill start point for scan */
> + tmpid = id;
> + while (1) {
> + /*
> + * scan next entry from bitmap(tree), tmpid is updated after
> + * idr_get_next().
> + */
> + spin_lock(&ss->id_lock);
> + tmp = idr_get_next(&ss->idr, &tmpid);
> + spin_unlock(&ss->id_lock);
> +
> + if (!tmp)
> + break;
> + if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
Can't be css_is_ancestor used here ?
I think it would be more easy to understand.
> + ret = rcu_dereference(tmp->css);
> + if (ret) {
> + *foundid = tmpid;
> + break;
> + }
> + }
> + /* continue to scan from next id */
> + tmpid = tmpid + 1;
> + }
> +
> + rcu_read_unlock();
> + return ret;
> +}
> +
Thanks,
Daisuke Nishimura.
On Thu, 22 Jan 2009 11:37:29 +0900
Daisuke Nishimura <[email protected]> wrote:
> Hi.
>
> Sorry for very late reply.
>
> It looks good in general.
> Just a few comments.
>
> > +/**
> > + * css_lookup - lookup css by id
> > + * @ss: cgroup subsys to be looked into.
> > + * @id: the id
> > + *
> > + * Returns pointer to cgroup_subsys_state if there is valid one with id.
> > + * NULL if not. Should be called under rcu_read_lock()
> > + */
> > +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
> > +{
> > + struct css_id *cssid = NULL;
> > +
> > + BUG_ON(!ss->use_id);
> > + cssid = idr_find(&ss->idr, id);
> > +
> > + if (unlikely(!cssid))
> > + return NULL;
> > +
> > + return rcu_dereference(cssid->css);
> > +}
> > +
> Just for clarification, is there any user of this function ?
> (I agree it's natulal to define 'lookup' function, though.)
>
A user in my plan is swap_cgroup.
> > +/**
> > + * css_get_next - lookup next cgroup under specified hierarchy.
> > + * @ss: pointer to subsystem
> > + * @id: current position of iteration.
> > + * @root: pointer to css. search tree under this.
> > + * @foundid: position of found object.
> > + *
> > + * Search next css under the specified hierarchy of rootid. Calling under
> > + * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
> > + */
> > +struct cgroup_subsys_state *
> > +css_get_next(struct cgroup_subsys *ss, int id,
> > + struct cgroup_subsys_state *root, int *foundid)
> > +{
> > + struct cgroup_subsys_state *ret = NULL;
> > + struct css_id *tmp;
> > + int tmpid;
> > + int rootid = css_id(root);
> > + int depth = css_depth(root);
> > +
> I think it's safe here, but isn't it better to call css_id/css_depth
> under rcu_read_lock(they call rcu_dereference) ?
>
As commented, this css_get_next() call should be called under rcu_read_lock().
> > + if (!rootid)
> > + return NULL;
> > +
> > + BUG_ON(!ss->use_id);
> > + rcu_read_lock();
> > + /* fill start point for scan */
> > + tmpid = id;
> > + while (1) {
> > + /*
> > + * scan next entry from bitmap(tree), tmpid is updated after
> > + * idr_get_next().
> > + */
> > + spin_lock(&ss->id_lock);
> > + tmp = idr_get_next(&ss->idr, &tmpid);
> > + spin_unlock(&ss->id_lock);
> > +
> > + if (!tmp)
> > + break;
> > + if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
> Can't be css_is_ancestor used here ?
> I think it would be more easy to understand.
>
Hmm, it requires
css_is_ancestor(tmp->css, root);
and adds memory barriers to acsess tmp->css and tmp->css->id, root->id
(compiler will not optimize these accesses because of memory barrier.)
So, I think bare code is better here.
Thank you for review.
-Kame