Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932568Ab2JKOVf (ORCPT ); Thu, 11 Oct 2012 10:21:35 -0400 Received: from cantor2.suse.de ([195.135.220.15]:59686 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932177Ab2JKOVa (ORCPT ); Thu, 11 Oct 2012 10:21:30 -0400 Date: Thu, 11 Oct 2012 16:21:27 +0200 From: Michal Hocko To: Glauber Costa Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Mel Gorman , Suleiman Souhlal , Tejun Heo , cgroups@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, Johannes Weiner , Greg Thelen , devel@openvz.org, Frederic Weisbecker Subject: Re: [PATCH v4 12/14] execute the whole memcg freeing in free_worker Message-ID: <20121011142127.GI29295@dhcp22.suse.cz> References: <1349690780-15988-1-git-send-email-glommer@parallels.com> <1349690780-15988-13-git-send-email-glommer@parallels.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1349690780-15988-13-git-send-email-glommer@parallels.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6618 Lines: 173 On Mon 08-10-12 14:06:18, Glauber Costa wrote: > A lot of the initialization we do in mem_cgroup_create() is done with > softirqs enabled. This include grabbing a css id, which holds > &ss->id_lock->rlock, and the per-zone trees, which holds > rtpz->lock->rlock. All of those signal to the lockdep mechanism that > those locks can be used in SOFTIRQ-ON-W context. This means that the > freeing of memcg structure must happen in a compatible context, > otherwise we'll get a deadlock, like the one bellow, caught by lockdep: > > [] free_accounted_pages+0x47/0x4c > [] free_task+0x31/0x5c > [] __put_task_struct+0xc2/0xdb > [] put_task_struct+0x1e/0x22 > [] delayed_put_task_struct+0x7a/0x98 > [] __rcu_process_callbacks+0x269/0x3df > [] rcu_process_callbacks+0x31/0x5b > [] __do_softirq+0x122/0x277 > > This usage pattern could not be triggered before kmem came into play. > With the introduction of kmem stack handling, it is possible that we > call the last mem_cgroup_put() from the task destructor, which is run in > an rcu callback. Such callbacks are run with softirqs disabled, leading > to the offensive usage pattern. > > In general, we have little, if any, means to guarantee in which context > the last memcg_put will happen. The best we can do is test it and try to > make sure no invalid context releases are happening. But as we add more > code to memcg, the possible interactions grow in number and expose more > ways to get context conflicts. One thing to keep in mind, is that part > of the freeing process is already deferred to a worker, such as vfree(), > that can only be called from process context. > > For the moment, the only two functions we really need moved away are: > > * free_css_id(), and > * mem_cgroup_remove_from_trees(). > > But because the later accesses per-zone info, > free_mem_cgroup_per_zone_info() needs to be moved as well. With that, we > are left with the per_cpu stats only. Better move it all. > > Signed-off-by: Glauber Costa > Tested-by: Greg Thelen > CC: KAMEZAWA Hiroyuki > CC: Michal Hocko > CC: Johannes Weiner > CC: Tejun Heo OK, it seems it is much easier this way. Acked-by: Michal Hocko > --- > mm/memcontrol.c | 66 +++++++++++++++++++++++++++++---------------------------- > 1 file changed, 34 insertions(+), 32 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2f92f89..c5215f1 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5205,16 +5205,29 @@ out_free: > } > > /* > - * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU, > - * but in process context. The work_freeing structure is overlaid > - * on the rcu_freeing structure, which itself is overlaid on memsw. > + * At destroying mem_cgroup, references from swap_cgroup can remain. > + * (scanning all at force_empty is too costly...) > + * > + * Instead of clearing all references at force_empty, we remember > + * the number of reference from swap_cgroup and free mem_cgroup when > + * it goes down to 0. > + * > + * Removal of cgroup itself succeeds regardless of refs from swap. > */ > -static void free_work(struct work_struct *work) > + > +static void __mem_cgroup_free(struct mem_cgroup *memcg) > { > - struct mem_cgroup *memcg; > + int node; > int size = sizeof(struct mem_cgroup); > > - memcg = container_of(work, struct mem_cgroup, work_freeing); > + mem_cgroup_remove_from_trees(memcg); > + free_css_id(&mem_cgroup_subsys, &memcg->css); > + > + for_each_node(node) > + free_mem_cgroup_per_zone_info(memcg, node); > + > + free_percpu(memcg->stat); > + > /* > * We need to make sure that (at least for now), the jump label > * destruction code runs outside of the cgroup lock. This is because > @@ -5233,38 +5246,27 @@ static void free_work(struct work_struct *work) > vfree(memcg); > } > > -static void free_rcu(struct rcu_head *rcu_head) > -{ > - struct mem_cgroup *memcg; > - > - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); > - INIT_WORK(&memcg->work_freeing, free_work); > - schedule_work(&memcg->work_freeing); > -} > > /* > - * At destroying mem_cgroup, references from swap_cgroup can remain. > - * (scanning all at force_empty is too costly...) > - * > - * Instead of clearing all references at force_empty, we remember > - * the number of reference from swap_cgroup and free mem_cgroup when > - * it goes down to 0. > - * > - * Removal of cgroup itself succeeds regardless of refs from swap. > + * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU, > + * but in process context. The work_freeing structure is overlaid > + * on the rcu_freeing structure, which itself is overlaid on memsw. > */ > - > -static void __mem_cgroup_free(struct mem_cgroup *memcg) > +static void free_work(struct work_struct *work) > { > - int node; > + struct mem_cgroup *memcg; > > - mem_cgroup_remove_from_trees(memcg); > - free_css_id(&mem_cgroup_subsys, &memcg->css); > + memcg = container_of(work, struct mem_cgroup, work_freeing); > + __mem_cgroup_free(memcg); > +} > > - for_each_node(node) > - free_mem_cgroup_per_zone_info(memcg, node); > +static void free_rcu(struct rcu_head *rcu_head) > +{ > + struct mem_cgroup *memcg; > > - free_percpu(memcg->stat); > - call_rcu(&memcg->rcu_freeing, free_rcu); > + memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); > + INIT_WORK(&memcg->work_freeing, free_work); > + schedule_work(&memcg->work_freeing); > } > > static void mem_cgroup_get(struct mem_cgroup *memcg) > @@ -5276,7 +5278,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count) > { > if (atomic_sub_and_test(count, &memcg->refcnt)) { > struct mem_cgroup *parent = parent_mem_cgroup(memcg); > - __mem_cgroup_free(memcg); > + call_rcu(&memcg->rcu_freeing, free_rcu); > if (parent) > mem_cgroup_put(parent); > } > -- > 1.7.11.4 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/