Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755548AbYLJC7l (ORCPT ); Tue, 9 Dec 2008 21:59:41 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754246AbYLJC7b (ORCPT ); Tue, 9 Dec 2008 21:59:31 -0500 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:49925 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754062AbYLJC7a (ORCPT ); Tue, 9 Dec 2008 21:59:30 -0500 Date: Wed, 10 Dec 2008 11:58:30 +0900 From: KAMEZAWA Hiroyuki To: Daisuke Nishimura Cc: "linux-mm@kvack.org" , "balbir@linux.vnet.ibm.com" , "lizf@cn.fujitsu.com" , "menage@google.com" , "kosaki.motohiro@jp.fujitsu.com" , "linux-kernel@vger.kernel.org" Subject: Re: [RFC][PATCH 1/6] memcg: fix pre_destory handler Message-Id: <20081210115830.dd13b90b.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20081210112815.d5098e9e.nishimura@mxp.nes.nec.co.jp> References: <20081209200213.0e2128c1.kamezawa.hiroyu@jp.fujitsu.com> <20081209200647.a1fa76a9.kamezawa.hiroyu@jp.fujitsu.com> <20081210112815.d5098e9e.nishimura@mxp.nes.nec.co.jp> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 2.5.0 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9166 Lines: 273 On Wed, 10 Dec 2008 11:28:15 +0900 Daisuke Nishimura wrote: > On Tue, 9 Dec 2008 20:06:47 +0900, KAMEZAWA Hiroyuki wrote: > > better name for new flag is welcome. > > > > == > > Because pre_destroy() handler is moved out to cgroup_lock() for > > avoiding dead-lock, now, cgroup's rmdir() does following sequence. > > > > cgroup_lock() > > check children and tasks. > > (A) > > cgroup_unlock() > > (B) > > pre_destroy() for subsys;-----(1) > > (C) > > cgroup_lock(); > > (D) > > Second check:check for -EBUSY again because we released the lock. > > (E) > > mark cgroup as removed. > > (F) > > unlink from lists. > > cgroup_unlock(); > > dput() > > => when dentry's refcnt goes down to 0 > > destroy() handers for subsys > > > > memcg marks itself as "obsolete" when pre_destroy() is called at (1) > > But rmdir() can fail after pre_destroy(). So marking "obsolete" is bug. > > I'd like to fix sanity of pre_destroy() in cgroup layer. > > > > Considering above sequence, new tasks can be added while > > (B) and (C) > > swap-in recored can be charged back to a cgroup after pre_destroy() > > at (C) and (D), (E) > > (means cgrp's refcnt not comes from task but from other persistent objects.) > > > > This patch adds "cgroup_is_being_removed()" check. (better name is welcome) > > After this, > > > > - cgroup is marked as CGRP_PRE_REMOVAL at (A) > > - If Second check fails, CGRP_PRE_REMOVAL flag is removed. > > - memcg's its own obsolete flag is removed. > > - While CGROUP_PRE_REMOVAL, task attach will fail by -EBUSY. > > (task attach via clone() will not hit the case.) > > > > By this, we can trust pre_restroy()'s result. > > > > > I agrree to the direction of this patch, but I think it would be better > to split this into cgroup and memcg part. > Hmm, but "showing usage" part is necessary for this kind of patches. > > Note: if CGRP_REMOVED can be set and cleared, it should be used instead of > > CGRP_PRE_REMOVAL. > > > > > > Signed-off-by: KAMEZAWA Hiroyuki > > > > --- > > include/linux/cgroup.h | 5 +++++ > > kernel/cgroup.c | 18 +++++++++++++++++- > > mm/memcontrol.c | 36 ++++++++++++++++++++++++++---------- > > 3 files changed, 48 insertions(+), 11 deletions(-) > > > > Index: mmotm-2.6.28-Dec08/mm/memcontrol.c > > =================================================================== > > --- mmotm-2.6.28-Dec08.orig/mm/memcontrol.c > > +++ mmotm-2.6.28-Dec08/mm/memcontrol.c > > @@ -166,7 +166,6 @@ struct mem_cgroup { > > */ > > bool use_hierarchy; > > unsigned long last_oom_jiffies; > > - int obsolete; > > atomic_t refcnt; > > > > unsigned int swappiness; > > @@ -211,6 +210,24 @@ pcg_default_flags[NR_CHARGE_TYPE] = { > > static void mem_cgroup_get(struct mem_cgroup *mem); > > static void mem_cgroup_put(struct mem_cgroup *mem); > > > > +static bool memcg_is_obsolete(struct mem_cgroup *mem) > > +{ > > + struct cgroup *cg = mem->css.cgroup; > > + /* > > + * "Being Removed" means pre_destroy() handler is called. > > + * After "pre_destroy" handler is called, memcg should not > > + * have any additional charges. > > + * This means there are small races for mis-accounting. But this > > + * mis-accounting should happen only under swap-in opration. > > + * (Attachin new task will fail if cgroup is under rmdir()). > > + */ > > + > > + if (!cg || cgroup_is_removed(cg) || cgroup_is_being_removed(cg)) > > + return true; > > + return false; > > +} > > + > > + > > static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, > > struct page_cgroup *pc, > > bool charge) > > @@ -597,7 +614,7 @@ mem_cgroup_get_first_node(struct mem_cgr > > struct cgroup *cgroup; > > struct mem_cgroup *ret; > > bool obsolete = (root_mem->last_scanned_child && > > - root_mem->last_scanned_child->obsolete); > > + memcg_is_obsolete(root_mem->last_scanned_child); > > > > /* > > * Scan all children under the mem_cgroup mem > > @@ -1070,7 +1087,7 @@ int mem_cgroup_try_charge_swapin(struct > > ent.val = page_private(page); > > > > mem = lookup_swap_cgroup(ent); > > - if (!mem || mem->obsolete) > > + if (!mem || memcg_is_obsolete(mem)) > > goto charge_cur_mm; > > *ptr = mem; > > return __mem_cgroup_try_charge(NULL, mask, ptr, true); > > @@ -1104,7 +1121,7 @@ int mem_cgroup_cache_charge_swapin(struc > > ent.val = page_private(page); > > if (do_swap_account) { > > mem = lookup_swap_cgroup(ent); > > - if (mem && mem->obsolete) > > + if (mem && memcg_is_obsolete(mem)) > > mem = NULL; > > if (mem) > > mm = NULL; > > @@ -2050,9 +2067,6 @@ static struct mem_cgroup *mem_cgroup_all > > * the number of reference from swap_cgroup and free mem_cgroup when > > * it goes down to 0. > > * > > - * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and > > - * entry which points to this memcg will be ignore at swapin. > > - * > > * Removal of cgroup itself succeeds regardless of refs from swap. > > */ > > > > @@ -2081,7 +2095,7 @@ static void mem_cgroup_get(struct mem_cg > > static void mem_cgroup_put(struct mem_cgroup *mem) > > { > > if (atomic_dec_and_test(&mem->refcnt)) { > > - if (!mem->obsolete) > > + if (!memcg_is_obsolete(mem)) > > return; > > mem_cgroup_free(mem); > > } > > @@ -2148,14 +2162,16 @@ static void mem_cgroup_pre_destroy(struc > > struct cgroup *cont) > > { > > struct mem_cgroup *mem = mem_cgroup_from_cont(cont); > > - mem->obsolete = 1; > > mem_cgroup_force_empty(mem, false); > > } > > > > static void mem_cgroup_destroy(struct cgroup_subsys *ss, > > struct cgroup *cont) > > { > > - mem_cgroup_free(mem_cgroup_from_cont(cont)); > > + struct mem_cgroup *mem = mem_cgroup_from_cont(cont): > > + mem_cgroup_free(mem); > > + /* forget */ > > + mem->css.cgroup = NULL; > > } > > > Is it OK to access "mem" while it may have been freed ? > no, will fix. > > static int mem_cgroup_populate(struct cgroup_subsys *ss, > > Index: mmotm-2.6.28-Dec08/include/linux/cgroup.h > > =================================================================== > > --- mmotm-2.6.28-Dec08.orig/include/linux/cgroup.h > > +++ mmotm-2.6.28-Dec08/include/linux/cgroup.h > > @@ -98,6 +98,8 @@ enum { > > CGRP_RELEASABLE, > > /* Control Group requires release notifications to userspace */ > > CGRP_NOTIFY_ON_RELEASE, > > + /* Control Group is preparing for death */ > > + CGRP_PRE_REMOVAL, > > }; > > > > struct cgroup { > > @@ -303,8 +305,11 @@ int cgroup_add_files(struct cgroup *cgrp > > const struct cftype cft[], > > int count); > > > > + > > int cgroup_is_removed(const struct cgroup *cgrp); > > > > +int cgroup_is_being_removed(const struct cgroup *cgrp); > > + > > int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen); > > > > int cgroup_task_count(const struct cgroup *cgrp); > > Index: mmotm-2.6.28-Dec08/kernel/cgroup.c > > =================================================================== > > --- mmotm-2.6.28-Dec08.orig/kernel/cgroup.c > > +++ mmotm-2.6.28-Dec08/kernel/cgroup.c > > @@ -123,6 +123,11 @@ inline int cgroup_is_removed(const struc > > return test_bit(CGRP_REMOVED, &cgrp->flags); > > } > > > > +inline int cgroup_is_being_removed(const struct cgroup *cgrp) > > +{ > > + return test_bit(CGRP_PRE_REMOVAL, &cgrp->flags); > > +} > > + > > /* bits in struct cgroupfs_root flags field */ > > enum { > > ROOT_NOPREFIX, /* mounted subsystems have no named prefix */ > > @@ -1217,6 +1222,13 @@ int cgroup_attach_task(struct cgroup *cg > > if (cgrp == oldcgrp) > > return 0; > > > > + /* > > + * This cgroup is under rmdir() operation. Never fails here when this > > + * is called from clone(). > > + */ > I don't think clone() calls cgroup_attach_task(). > Do you mean cgroup_clone() ? > (I'm sorry that I'm not good at ns_cgroup.) > yes, cgroup_clone(). I'll update. > > + if (cgroup_is_being_removed(cgrp)) > > + return -EBUSY; > > + > > for_each_subsys(root, ss) { > > if (ss->can_attach) { > > retval = ss->can_attach(ss, cgrp, tsk); > > @@ -2469,12 +2481,14 @@ static int cgroup_rmdir(struct inode *un > > mutex_unlock(&cgroup_mutex); > > return -EBUSY; > > } > > - mutex_unlock(&cgroup_mutex); > > > > /* > > * Call pre_destroy handlers of subsys. Notify subsystems > > * that rmdir() request comes. > > */ > > + set_bit(CGRP_PRE_REMOVAL, &cgrp->flags); > > + mutex_unlock(&cgroup_mutex); > > + > > cgroup_call_pre_destroy(cgrp); > > > Is there any case where pre_destory is called simultaneusly ? > I can't catch what is your concern. AFAIK, vfs_rmdir() is done under mutex to dentry == mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT); ... vfs_rmdir() == And calls to cgroup_rmdir() will be serialized. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/