Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754370AbZF1XwO (ORCPT ); Sun, 28 Jun 2009 19:52:14 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752882AbZF1XwA (ORCPT ); Sun, 28 Jun 2009 19:52:00 -0400 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:44516 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750840AbZF1Xv7 (ORCPT ); Sun, 28 Jun 2009 19:51:59 -0400 Date: Mon, 29 Jun 2009 08:50:26 +0900 From: KAMEZAWA Hiroyuki To: Paul Menage Cc: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "nishimura@mxp.nes.nec.co.jp" , "balbir@linux.vnet.ibm.com" , "lizf@cn.fujitsu.com" , "akpm@linux-foundation.org" Subject: Re: [PATCH] memcg: cgroup fix rmdir hang Message-Id: <20090629085026.82e0674d.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <6599ad830906261316x52d6c115t720b87ba16b3617@mail.gmail.com> References: <20090623160720.36230fa2.kamezawa.hiroyu@jp.fujitsu.com> <20090626141020.849a081e.kamezawa.hiroyu@jp.fujitsu.com> <6599ad830906261316x52d6c115t720b87ba16b3617@mail.gmail.com> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 2.5.0 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11361 Lines: 268 On Fri, 26 Jun 2009 13:16:23 -0700 Paul Menage wrote: > Hi Kamezawa, > > Sorry that I didn't get a chance to look at these patches before now. > > On Thu, Jun 25, 2009 at 10:10 PM, KAMEZAWA > Hiroyuki wrote: > > I hope this will be a final bullet.. > > I myself think this one is enough simple and good. > > I'm sorry that we need test again. > > > > == > > From: KAMEZAWA Hiroyuki > > > > After commit: cgroup: fix frequent -EBUSY at rmdir > >              ec64f51545fffbc4cb968f0cea56341a4b07e85a > > cgroup's rmdir (especially against memcg) doesn't return -EBUSY > > by temporal ref counts. That commit expects all refs after pre_destroy() > > is temporary but...it wasn't. Then, rmdir can wait permanently. > > This patch tries to fix that and change followings. > > > >  - set CGRP_WAIT_ON_RMDIR flag before pre_destroy(). > >  - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case. > >   if there are sleeping ones, wakes them up. > >  - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set. > > > > Changelog v2->v3: > >  - removed retry_rmdir() callback. > >  - make use of CGRP_WAIT_ON_RMDIR flag more. > > > > Reported-by: Daisuke Nishimura > > Signed-off-by: KAMEZAWA Hiroyuki > > --- > >  include/linux/cgroup.h |   13 +++++++++++++ > >  kernel/cgroup.c        |   38 ++++++++++++++++++++++---------------- > >  mm/memcontrol.c        |   25 +++++++++++++++++++++++-- > >  3 files changed, 58 insertions(+), 18 deletions(-) > > > > Index: mmotm-2.6.31-Jun25/include/linux/cgroup.h > > =================================================================== > > --- mmotm-2.6.31-Jun25.orig/include/linux/cgroup.h > > +++ mmotm-2.6.31-Jun25/include/linux/cgroup.h > > @@ -366,6 +366,19 @@ int cgroup_task_count(const struct cgrou > >  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task); > > > >  /* > > + * Allow to use CGRP_WAIT_ON_RMDIR flag to check race with rmdir() for subsys. > > + * Subsys can call this function if it's necessary to call pre_destroy() again > > + * because it adds not-temporary refs to css after or while pre_destroy(). > > + * The caller of this function should use css_tryget(), too. > > + */ > > +void __cgroup_wakeup_rmdir_waiters(void); > > +static inline void cgroup_wakeup_rmdir_waiters(struct cgroup *cgrp) > > +{ > > +       if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))) > > +               __cgroup_wakeup_rmdir_waiters(); > > +} > > + > > +/* > >  * Control Group subsystem type. > >  * See Documentation/cgroups/cgroups.txt for details > >  */ > > Index: mmotm-2.6.31-Jun25/kernel/cgroup.c > > =================================================================== > > --- mmotm-2.6.31-Jun25.orig/kernel/cgroup.c > > +++ mmotm-2.6.31-Jun25/kernel/cgroup.c > > @@ -734,14 +734,13 @@ static void cgroup_d_remove_dir(struct d > >  * reference to css->refcnt. In general, this refcnt is expected to goes down > >  * to zero, soon. > >  * > > - * CGRP_WAIT_ON_RMDIR flag is modified under cgroup's inode->i_mutex; > > + * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex; > >  */ > >  DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq); > > > > -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp) > > +void __cgroup_wakeup_rmdir_waiters(void) > >  { > > Maybe we should name this wakeup_rmdir_waiter() to emphasise that fact > that there will only be one waiter (the thread doing the rmdir). > Hm, but, there is no guarantee that there will "an" waiter. > > -       if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))) > > -               wake_up_all(&cgroup_rmdir_waitq); > > +       wake_up_all(&cgroup_rmdir_waitq); > >  } > > > >  static int rebind_subsystems(struct cgroupfs_root *root, > > @@ -2696,33 +2695,40 @@ again: > >        mutex_unlock(&cgroup_mutex); > > > >        /* > > +        * css_put/get is provided for subsys to grab refcnt to css. In typical > > +        * case, subsystem has no reference after pre_destroy(). But, under > > +        * hierarchy management, some *temporal* refcnt can be hold. > > This sentence needs improvement/clarification. (Yes, I know it's just > copied from later in the file, but it wasn't clear there either :-) ) > ok, I'll modify here. How about this ? == In general, subsystem has no css->refcnt after pre_destroy(). But in racy cases, subsystem may have to get css->refcnt after pre_destroy() and it makes rmdir return with -EBUSY. But we don't like frequent -EBUSY. To avoid that, we use waitqueue for cgroup's rmdir. CGROUP_WAIT_ON_RMDIR bit is for synchronizing rmdir waitqueue and subsystem behavior. Please see css_get()/put()/tryget() implementation, it allows subsystem to avoid unnecessary failure of rmdir(). If css_put()/get()/tryget() is not enough, cgroup_wakeup_rmdir_waiter() can be used. == > > > +        * To avoid returning -EBUSY to a user, waitqueue is used. If subsys > > +        * is really busy, it should return -EBUSY at pre_destroy(). wake_up > > +        * is called when css_put() is called and refcnt goes down to 0. > > +        * And this WAIT_ON_RMDIR flag is cleared when subsys detect a race > > +        * condition under pre_destroy()->rmdir. > > What exactly do you mean by pre_destroy()->rmdir ? > Ah, between pre_destroy() and "cgroup is removed" state. I'll remove this. Maybe above comment is enough. > > If flag is cleared, we need > > +        * to call pre_destroy(), again. > > +        */ > > +       set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); > > + > > +       /* > >         * Call pre_destroy handlers of subsys. Notify subsystems > >         * that rmdir() request comes. > >         */ > >        ret = cgroup_call_pre_destroy(cgrp); > > -       if (ret) > > +       if (ret) { > > +               clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); > >                return ret; > > +       } > > > >        mutex_lock(&cgroup_mutex); > >        parent = cgrp->parent; > >        if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) { > > +               clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); > >                mutex_unlock(&cgroup_mutex); > >                return -EBUSY; > >        } > > -       /* > > -        * css_put/get is provided for subsys to grab refcnt to css. In typical > > -        * case, subsystem has no reference after pre_destroy(). But, under > > -        * hierarchy management, some *temporal* refcnt can be hold. > > -        * To avoid returning -EBUSY to a user, waitqueue is used. If subsys > > -        * is really busy, it should return -EBUSY at pre_destroy(). wake_up > > -        * is called when css_put() is called and refcnt goes down to 0. > > -        */ > > -       set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); > >        prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE); > > - > >        if (!cgroup_clear_css_refs(cgrp)) { > >                mutex_unlock(&cgroup_mutex); > > -               schedule(); > > +               if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)) > > +                       schedule(); > > The need for this test_bit() should be commented, I think - at first > sight it doesn't seem necessary since if we've already been woken up, > then the schedule() is logically a no-op anyway. We only need it > because we are worried about the case where someone calls > wakeup_rmdir_waiters() prior to the prepare_to_wait() > > yes. will add comments. > > =================================================================== > > --- mmotm-2.6.31-Jun25.orig/mm/memcontrol.c > > +++ mmotm-2.6.31-Jun25/mm/memcontrol.c > > @@ -1234,6 +1234,12 @@ static int mem_cgroup_move_account(struc > >        ret = 0; > >  out: > >        unlock_page_cgroup(pc); > > +       /* > > +        * We charges against "to" which may not have any tasks. Then, "to" > > +        * can be under rmdir(). But in current implementation, caller of > > +        * this function is just force_empty() and it's garanteed that > > +        * "to" is never removed. So, we don't check rmdir status here. > > +        */ > >        return ret; > >  } > > > > @@ -1455,6 +1461,7 @@ __mem_cgroup_commit_charge_swapin(struct > >                return; > >        if (!ptr) > >                return; > > +       css_get(&ptr->css); > >        pc = lookup_page_cgroup(page); > >        mem_cgroup_lru_del_before_commit_swapcache(page); > >        __mem_cgroup_commit_charge(ptr, pc, ctype); > > @@ -1484,7 +1491,13 @@ __mem_cgroup_commit_charge_swapin(struct > >                } > >                rcu_read_unlock(); > >        } > > -       /* add this page(page_cgroup) to the LRU we want. */ > > +       /* > > +        * At swapin, we may charge account against cgroup which has no tasks. > > +        * So, rmdir()->pre_destroy() can be called while we do this charge. > > +        * In that case, we need to call pre_destroy() again. check it here. > > +        */ > > +       cgroup_wakeup_rmdir_waiters(ptr->css.cgroup); > > +       css_put(&ptr->css); > > > >  } > > > > @@ -1691,7 +1704,7 @@ void mem_cgroup_end_migration(struct mem > > > >        if (!mem) > >                return; > > - > > +       css_get(&mem->css); > >        /* at migration success, oldpage->mapping is NULL. */ > >        if (oldpage->mapping) { > >                target = oldpage; > > @@ -1731,6 +1744,14 @@ void mem_cgroup_end_migration(struct mem > >         */ > >        if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) > >                mem_cgroup_uncharge_page(target); > > +       /* > > +        * At migration, we may charge account against cgroup which has no tasks > > +        * So, rmdir()->pre_destroy() can be called while we do this charge. > > +        * In that case, we need to call pre_destroy() again. check it here. > > +        */ > > +       cgroup_wakeup_rmdir_waiters(mem->css.cgroup); > > +       css_put(&mem->css); > > Having to do an extra get/put purely in order to seems unfortunate - > is that purely to force the cgroup_clear_css_refs() to fail? > yes. > Maybe we could wrap the get and the wakeup/put inside functions named > "cgroup_exclude_rmdir()" "cgroup_release_rmdir()" so that the mm > cgroup code is more self-explanatory. > Ah, it sounds nice idea. I'll prepare that as [2/2] patch. Thanks, -Kame > Paul > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/