Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753922AbYK1B2b (ORCPT ); Thu, 27 Nov 2008 20:28:31 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752603AbYK1B2W (ORCPT ); Thu, 27 Nov 2008 20:28:22 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:60034 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1752557AbYK1B2U (ORCPT ); Thu, 27 Nov 2008 20:28:20 -0500 Message-ID: <492F4941.2030707@cn.fujitsu.com> Date: Fri, 28 Nov 2008 09:28:33 +0800 From: Li Zefan User-Agent: Thunderbird 2.0.0.9 (X11/20071115) MIME-Version: 1.0 To: KAMEZAWA Hiroyuki CC: "linux-kernel@vger.kernel.org" , "menage@google.com" , "balbir@linux.vnet.ibm.com" , "nishimura@mxp.nes.nec.co.jp" , taka@valinux.co.jp Subject: Re: [RFC][PATCH 1/2] CGROUP ID and Hierarchy Code References: <20081127160548.3274c8e6.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20081127160548.3274c8e6.kamezawa.hiroyu@jp.fujitsu.com> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 14596 Lines: 504 KAMEZAWA Hiroyuki wrote: > Patches for CGROUP ID and Hierarchy Code. > (based on mmotm Nov/24...I'll update later.) > > CGROUP ID is a unique value for each cgroup (struct cgroup) > Hierarchy Code is array of CGROUP ID which contains all ID od ancestors. > and tries to remove cgroup_lock() to walk tree. > > This is just an example-level patch but I'm now testing/refleshing this. > Works well with memcg's hierarchy relcaim. > Any comment is welcome. > > -Kame > > == > A toy patch for Cgroup ID and hierarchy code. > > This patch tries to assing a ID to each cgroup. Attach unique ID to each > cgroup and provides following functions. > > - cgroup_lookup(id) > returns cgroup of id. > - cgroup_get_next(id, root, foundid) > returns the next cgroup under "root" by scanning bitmap (not by tree-walk) > - cgroup_is_ancestor(cgroup, root) > returns true if "root" is ancestor of cgroup. > > There is several reasons to develop this. > > - While trying to implement hierarchy in memory cgroup, we had to > implement "walk under hierarchy" code. > Now it's consists of cgroup_lock and tree up-down code. > But taking "cgroup_lock" in walking tree can cause deadlocks. > cgroup_lock() is too big. Easier way is helpful. > > - SwapCgroup uses array of "pointer" to record the owner of swaps. > By ID, we can reduce this to "short" or "int". This means ID is > useful for reducing space consumption by pointer if the access cost > is not problem. > (I hear bio-cgroup will use the same kind of...) > > Example) OOM-Killer under hierarchy. > do { > rcu_read_lock(); > next = cgroup_get_next(id, root, nextid); > /* check sanity of next here */ > /* increment refcnt to "next" */ > rcu_read_unlock(); > if (!next) > break; It's not safe to use the cgroup returned by cgroup_get_next() outside rcu lock. --------------------------------------------- cgroup_rmdir() (check refcnt == 0) rcu_read_lock(); cgrp = cgroup_get_next() (inc refcnt) rcu_read_unlock(); cgroup_id_detach() cgroup_diput(); kfree(cgrp); use cgrp --------------------------------------------- I think the safe way is: rcu_read_lock(); cgrp = cgroup_get_next() if (!inc_not_zero(cgrp) { rcu_read_unlock(); return NULL; } rcu_read_unlock(); return cgrp; But it's also safe to use cgrp = list_entry(&parent->children.next) for the above scenario, seems you don't have to invent this cgroup_get_next(). > cgroup_scan_tasks(select_bad_process); > } while (1); > > > Characteristics: > - Each cgroup get new ID when created. > - cgroup ID contains "ID" and "Depth in tree" and hierarchy code. > - hierarchy code is array of IDs of ancestors. > > Consideration: > - gang lookup if necessary. > - I'd like to use "(unsigned)short" to cgroup_id for saving space... > - MAX_DEPTH is bad ? > > Signed-off-by: KAMEZAWA Hiroyuki > > include/linux/cgroup.h | 42 +++++++++++ > include/linux/idr.h | 1 > kernel/cgroup.c | 180 ++++++++++++++++++++++++++++++++++++++++++++++++- > lib/idr.c | 46 ++++++++++++ > 4 files changed, 266 insertions(+), 3 deletions(-) > > Index: mmotm-2.6.28-Nov24/include/linux/cgroup.h > =================================================================== > --- mmotm-2.6.28-Nov24.orig/include/linux/cgroup.h > +++ mmotm-2.6.28-Nov24/include/linux/cgroup.h > @@ -61,6 +61,20 @@ struct cgroup_subsys_state { > unsigned long flags; > }; > > +/* > + * Cgroup ID for *internal* identification and lookup. For user-land,"path" > + * of cgroup works well. > + */ > +#define MAX_CGROUP_DEPTH (10) > +struct cgroup_id { > + struct cgroup *myself; > + unsigned int id; > + atomic_t refcnt; > + unsigned int depth; > + unsigned int hierarchy_code[MAX_CGROUP_DEPTH]; > + struct list_head list; > +}; > + > /* bits in struct cgroup_subsys_state flags field */ > enum { > CSS_ROOT, /* This CSS is the root of the subsystem */ > @@ -145,6 +159,9 @@ struct cgroup { > int pids_use_count; > /* Length of the current tasks_pids array */ > int pids_length; > + > + /* Cgroup ID */ > + struct cgroup_id *id; > }; > > /* A css_set is a structure holding pointers to a set of > @@ -393,6 +410,31 @@ void cgroup_iter_end(struct cgroup *cgrp > int cgroup_scan_tasks(struct cgroup_scanner *scan); > int cgroup_attach_task(struct cgroup *, struct task_struct *); > > +/* > + * For supporting cgroup lookup and hierarchy management. > + * Giving Flat view of cgroup hierarchy rather than tree. > + */ > +/* An interface for usual lookup */ > +struct cgroup *cgroup_lookup(int id); > +/* get next cgroup under tree (for scan) */ > +struct cgroup *cgroup_get_next(int id, struct cgroup *root, int *foundid); > + > +static inline bool > +cgroup_is_ancestor(struct cgroup *cur, struct cgroup *root) > +{ > + struct cgroup_id *curid = cur->id; > + struct cgroup_id *rootid = root->id; > + > + return (curid->hierarchy_code[rootid->depth] = rootid->id); > +} > + > +static inline int cgroup_myid(struct cgroup *cgrp) > +{ > + return cgrp->id->id; > +} > + > +/* TODO: gang lookup should be */ > + > #else /* !CONFIG_CGROUPS */ > > static inline int cgroup_init_early(void) { return 0; } > Index: mmotm-2.6.28-Nov24/kernel/cgroup.c > =================================================================== > --- mmotm-2.6.28-Nov24.orig/kernel/cgroup.c > +++ mmotm-2.6.28-Nov24/kernel/cgroup.c > @@ -46,7 +46,7 @@ > #include > #include > #include > - > +#include > #include > > static DEFINE_MUTEX(cgroup_mutex); > @@ -547,6 +547,162 @@ void cgroup_unlock(void) > mutex_unlock(&cgroup_mutex); > } > > + > +/* > + * Cgroup ID and lookup functions. > + * cgid->myself pointer is safe under rcu_read_lock() because d_put() of > + * cgroup, which finally frees cgroup pointer, uses rcu_synchronize(). > + */ > + > +static DEFINE_IDR(cgroup_idr); > +static DEFINE_SPINLOCK(cgroup_idr_lock); > + > +static int cgroup_prepare_id(struct cgroup *parent, struct cgroup_id **id) > +{ > + struct cgroup_id *newid; > + int myid, error; > + > + /* check depth */ > + if (parent && parent->id->depth + 1 >= MAX_CGROUP_DEPTH) > + return -ENOSPC; > + newid = kzalloc(sizeof(*newid), GFP_KERNEL); > + if (!newid) > + return -ENOMEM; > + /* get id */ > + if (unlikely(!idr_pre_get(&cgroup_idr, GFP_KERNEL))) { > + error = -ENOMEM; > + goto err_out; > + } > + spin_lock_irq(&cgroup_idr_lock); > + error = idr_get_new(&cgroup_idr, newid, &myid); > + spin_unlock_irq(&cgroup_idr_lock); > + if (error) > + goto err_out; > + > + newid->id = myid; > + atomic_set(&newid->refcnt, 1); > + INIT_LIST_HEAD(&newid->list); > + *id = newid; > + return 0; > +err_out: > + kfree(newid); > + return error; > +} > + > +static void cgroup_id_put(struct cgroup_id *cgid) > +{ > + unsigned long flags; > + > + if (atomic_dec_and_test(&cgid->refcnt)) { > + spin_lock_irqsave(&cgroup_idr_lock, flags); > + idr_remove(&cgroup_idr, cgid->id); > + spin_unlock_irqrestore(&cgroup_idr_lock, flags); > + kfree(cgid); > + } > +} > + > +static void cgroup_id_attach(struct cgroup_id *cgid, > + struct cgroup *cg, struct cgroup *parent) > +{ > + if (parent) { > + struct cgroup_id *parent_id = parent->id; > + int i; > + > + cgid->depth = parent_id->depth + 1; > + /* Inherit hierarchy code from parent */ > + for (i = 0; i < cgid->depth; i++) { > + cgid->hierarchy_code[i] = > + parent_id->hierarchy_code[i]; > + cgid->hierarchy_code[cgid->depth] = cgid->id; > + } > + } else { > + cgid->depth = 0; > + cgid->hierarchy_code[0] = cgid->id; > + } > + rcu_assign_pointer(cgid->myself, cg); > + cg->id = cgid; > + > + return; > +} > + > +static void cgroup_id_detach(struct cgroup *cg) > +{ > + struct cgroup_id *id = cg->id; > + > + rcu_assign_pointer(id->myself, NULL); > + cgroup_id_put(id); > +} > + > +/** > + * cgroup_lookup - lookup cgroup by id > + * @id: the id of cgroup to be looked up > + * > + * Retruns pointer to cgroup if there is vald cgroup with id, NULL if not. > + * Should be called under rcu_read_lock() or cgroup_lock(). > + */ > + > +struct cgroup *cgroup_lookup(int id) > +{ > + struct cgroup *cgrp; > + struct cgroup_id *cgid; > + > + rcu_read_lock(); > + cgid = idr_find(&cgroup_idr, id); > + rcu_read_unlock(); > + > + if (unlikely(!cgid)) > + return NULL; > + > + cgrp = rcu_dereference(cgid->myself); > + if (unlikely(!cgrp || test_bit(CGRP_REMOVED, &cgrp->flags))) > + return NULL; > + > + return cgrp; > +} > + > +/** > + * cgroup_get_next - lookup next cgroup under specified hierarchy. > + * @id: current position of iteration. > + * @root: search tree under this. . > + * @foundid: position of found object. > + * > + * Search next cgroup under the specified hierarchy. If "cur" is NULL, > + * start from root cgroup. Called under rcu_read_lock() or cgroup_lock() > + * is necessary (to access a found cgroup.). > + */ > +struct cgroup *cgroup_get_next(int id, struct cgroup *root, int *foundid) > +{ > + struct cgroup *ret = NULL; > + struct cgroup_id *tmp, *rootid; > + int tmpid, depth, targetid; > + unsigned long flags; > + > + rootid = root->id; > + depth = rootid->depth; > + targetid = rootid->id; > + tmpid = id; > + > + do { > + /* scan next entry from bitmap(tree) */ > + spin_lock_irqsave(&cgroup_idr_lock, flags); > + tmp = idr_get_next(&cgroup_idr, &tmpid); > + spin_unlock_irqrestore(&cgroup_idr_lock, flags); > + > + if (tmp) { > + ret = rcu_dereference(tmp->myself); > + /* Sanity check and check hierarchy */ > + if (ret && !test_bit(CGRP_REMOVED, &ret->flags) > + && tmp->hierarchy_code[depth] == targetid) > + break; > + tmpid = tmpid + 1; > + } else > + tmpid = 0; > + } while (1); > + > + *foundid = tmpid; > + return ret; > +} > + > /* > * A couple of forward declarations required, due to cyclic reference loop: > * cgroup_mkdir -> cgroup_create -> cgroup_populate_dir -> > @@ -991,6 +1147,7 @@ static int cgroup_get_sb(struct file_sys > /* New superblock */ > struct cgroup *cgrp = &root->top_cgroup; > struct inode *inode; > + struct cgroup_id *newid; > int i; > > BUG_ON(sb->s_root != NULL); > @@ -998,6 +1155,11 @@ static int cgroup_get_sb(struct file_sys > ret = cgroup_get_rootdir(sb); > if (ret) > goto drop_new_super; > + > + ret = cgroup_prepare_id(NULL, &newid); > + if (ret) > + goto drop_new_super; > + > inode = sb->s_root->d_inode; > > mutex_lock(&inode->i_mutex); > @@ -1062,7 +1224,7 @@ static int cgroup_get_sb(struct file_sys > BUG_ON(!list_empty(&cgrp->sibling)); > BUG_ON(!list_empty(&cgrp->children)); > BUG_ON(root->number_of_cgroups != 1); > - > + cgroup_id_attach(newid, cgrp, NULL); > cgroup_populate_dir(cgrp); > mutex_unlock(&inode->i_mutex); > mutex_unlock(&cgroup_mutex); > @@ -1997,6 +2159,7 @@ int cgroup_scan_tasks(struct cgroup_scan > return 0; > } > > + > /* > * Stuff for reading the 'tasks' file. > * > @@ -2349,11 +2512,18 @@ static long cgroup_create(struct cgroup > int err = 0; > struct cgroup_subsys *ss; > struct super_block *sb = root->sb; > + struct cgroup_id *cgid = NULL; > > cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL); > if (!cgrp) > return -ENOMEM; > > + err = cgroup_prepare_id(parent, &cgid); > + if (err) { > + kfree(cgrp); > + return err; > + } > + > /* Grab a reference on the superblock so the hierarchy doesn't > * get deleted on unmount if there are child cgroups. This > * can be done outside cgroup_mutex, since the sb can't > @@ -2394,6 +2564,8 @@ static long cgroup_create(struct cgroup > err = cgroup_populate_dir(cgrp); > /* If err < 0, we have a half-filled directory - oh well ;) */ > > + cgroup_id_attach(cgid, cgrp, parent); > + > mutex_unlock(&cgroup_mutex); > mutex_unlock(&cgrp->dentry->d_inode->i_mutex); > > @@ -2415,7 +2587,7 @@ static long cgroup_create(struct cgroup > > /* Release the reference count that we took on the superblock */ > deactivate_super(sb); > - > + cgroup_id_put(cgid); > kfree(cgrp); > return err; > } > @@ -2494,6 +2666,8 @@ static int cgroup_rmdir(struct inode *un > return -EBUSY; > } > > + cgroup_id_detach(cgrp); > + > spin_lock(&release_list_lock); > set_bit(CGRP_REMOVED, &cgrp->flags); > if (!list_empty(&cgrp->release_list)) > Index: mmotm-2.6.28-Nov24/include/linux/idr.h > =================================================================== > --- mmotm-2.6.28-Nov24.orig/include/linux/idr.h > +++ mmotm-2.6.28-Nov24/include/linux/idr.h > @@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p > int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id); > int idr_for_each(struct idr *idp, > int (*fn)(int id, void *p, void *data), void *data); > +void *idr_get_next(struct idr *idp, int *nextid); > void *idr_replace(struct idr *idp, void *ptr, int id); > void idr_remove(struct idr *idp, int id); > void idr_remove_all(struct idr *idp); > Index: mmotm-2.6.28-Nov24/lib/idr.c > =================================================================== > --- mmotm-2.6.28-Nov24.orig/lib/idr.c > +++ mmotm-2.6.28-Nov24/lib/idr.c > @@ -573,6 +573,52 @@ int idr_for_each(struct idr *idp, > EXPORT_SYMBOL(idr_for_each); > > /** > + * idr_get_next - lookup next opbject of id to given id. > + * @idp: idr handle > + * @id: pointer to lookup key > + * > + * Retruns pointer to registered object with id, which is next number to > + * given id. > + */ > + > +void *idr_get_next(struct idr *idp, int *nextidp) > +{ > + struct idr_layer *p, *pa[MAX_LEVEL]; > + struct idr_layer **paa = &pa[0]; > + int id = *nextidp; > + int n, max; > + > + /* find first ent */ > + n = idp->layers * IDR_BITS; > + max = 1 << n; > + p = rcu_dereference(idp->top); > + if (!p) > + return NULL; > + > + while (id < max) { > + while (n > 0 && p) { > + n -= IDR_BITS; > + *paa++ = p; > + p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]); > + } > + > + if (p) { > + *nextidp = id; > + return p; > + } > + > + id += 1 << n; > + while (n < fls(id)) { > + n += IDR_BITS; > + p = *--paa; > + } > + } > + return NULL; > +} > + > + > + > +/** > * idr_replace - replace pointer for given id > * @idp: idr handle > * @ptr: pointer you want associated with the id > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/