Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935173AbcCPPXP (ORCPT ); Wed, 16 Mar 2016 11:23:15 -0400 Received: from casper.infradead.org ([85.118.1.10]:32939 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935013AbcCPPXN (ORCPT ); Wed, 16 Mar 2016 11:23:13 -0400 Date: Wed, 16 Mar 2016 16:22:45 +0100 From: Peter Zijlstra To: Tejun Heo Cc: Kazuki Yamaguchi , Niklas Cassel , linux-kernel@vger.kernel.org Subject: Re: [BUG] sched: leaf_cfs_rq_list use after free Message-ID: <20160316152245.GY6344@twins.programming.kicks-ass.net> References: <20216ece-a75c-e3cf-4bae-ccbcf5694e9f@rhe.jp> <20160314112057.GT6356@twins.programming.kicks-ass.net> <20160314120903.GP6375@twins.programming.kicks-ass.net> <20160316142414.GA6980@mtj.duckdns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160316142414.GA6980@mtj.duckdns.org> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7422 Lines: 210 On Wed, Mar 16, 2016 at 07:24:14AM -0700, Tejun Heo wrote: > So, the problem here is that cpu is using css_offline to tear down the > css. perf shouldn't have the same problem as it destroys its css from > css_free. The distinctions among different callbacks evolved over > time and we need to update the documentation but here are the rules. > > css_alloc() > > This one is obvious. Alloc and init. The css will become > visible during css iteration sometime between alloc and > online. > > css_online() > > The css is now guaranteed to be visible for css_for_each*() > iterations. This distinction exists because some controllers > need to propagate state changes downwards requiring a new css > to become visible before it inherits the current state from > the parent. Conversely, there's no reason to use this > callback if there's no such requirement. > > Ex: Freezer which propagates the target state downwards and > needs a new child to inherit the current state while > iteratable. So it looks like sched uses css_online() for no particular reason either, I've moved all that into css_alloc(). > > css_offline() > > The css is going down and draining usages by refusing new > css_tryget_online()'s but still guaranteed to be visible > during css iterations. Controllers which explicitly needs to > put, say, cache references or need to perform cleanup > operations while the css is iteratable need to use this > method; otherwise, no reason to bother with it. > > Ex: blkcg pins csses as part of lookup cache which can prevent > a css from being drained and released, so blkcg uses this call > back to disable caching for the css. > > css_released() > > The css is drained and not visible during iteration and will > be freed after a RCU grace period. Used by controllers which > cache pointers to csses being drained. > > Ex: memcg needs to iterate csses which are being drained and > its custom iterator implementation caches RCU protected > pointers to csses. memcg uses this callback to shoot down > those cached pointers. > > css_free() > > The css is drained, not visible to iterations, not used by > anyone, and will be freed immediately. None of that speaks of where Zombies live, am I to infer that Zombies pass css_offline() but stall css_released() ? I don't particularly care about iterating css bits, but I do need my parent group to still exist, this is now also guaranteed for css_release(), right? The above documentation also doesn't mention this; in particular I require that css_release() for any group is not called before the css_release() of any child group. The reason I'm interested in css_release() is that there is an RCU grace period between that and css_free(), so I can kill my call_rcu() from css_free(). > > So something needs to fundamentally ensure that ->css changes before we > > go offline the thing. > > I could be mistaken but AFAICS there doesn't seem to be anything > requiring bothering with the more specialized exit methods. Given > that no css iteration is used and everything is lock protected, the > patch at the end of this messages should do and seems to work fine > here. Am I missing something? Some additional cleanups maybe.. but yes, that seems to work. Thanks! --- Subject: sched: Fix/cleanup cgroup teardown/init The cpu controller hasn't kept up with the various changes in the whole cgroup initialization / destruction sequence, and commit 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups") caused it to explode. The reason for this is that zombies do not inhibit css_offline() from being called, but do stall css_released(). Now we tear down the cfs_rq structures on css_offline() but zombies can run after that, leading to use-after-free issues. The solution is to move the tear-down to css_released(), which guarantees nobody (including no zombies) is still using our cgroup. Furthermore, a few simple cleanups are possible too. There doesn't appear to be any point to us using css_online() (anymore?) so fold that in css_alloc(). And since cgroup code guarantees an RCU grace period between css_released() and css_free() we can forgo using call_rcu() and free the stuff immediately. Cc: stable@vger.kernel.org Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups") Suggested-by: Tejun Heo Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c | 35 ++++++++++++++--------------------- 1 file changed, 14 insertions(+), 21 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 04a04e0378fe..bc5076fd0167 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7538,7 +7538,7 @@ void set_curr_task(int cpu, struct task_struct *p) /* task_group_lock serializes the addition/removal of task groups */ static DEFINE_SPINLOCK(task_group_lock); -static void free_sched_group(struct task_group *tg) +static void sched_free_group(struct task_group *tg) { free_fair_sched_group(tg); free_rt_sched_group(tg); @@ -7564,7 +7564,7 @@ struct task_group *sched_create_group(struct task_group *parent) return tg; err: - free_sched_group(tg); + sched_free_group(tg); return ERR_PTR(-ENOMEM); } @@ -7584,17 +7584,16 @@ void sched_online_group(struct task_group *tg, struct task_group *parent) } /* rcu callback to free various structures associated with a task group */ -static void free_sched_group_rcu(struct rcu_head *rhp) +static void sched_free_group_rcu(struct rcu_head *rhp) { /* now it should be safe to free those cfs_rqs */ - free_sched_group(container_of(rhp, struct task_group, rcu)); + sched_free_group(container_of(rhp, struct task_group, rcu)); } -/* Destroy runqueue etc associated with a task group */ void sched_destroy_group(struct task_group *tg) { /* wait for possible concurrent references to cfs_rqs complete */ - call_rcu(&tg->rcu, free_sched_group_rcu); + call_rcu(&tg->rcu, sched_free_group_rcu); } void sched_offline_group(struct task_group *tg) @@ -8053,31 +8052,26 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) if (IS_ERR(tg)) return ERR_PTR(-ENOMEM); + sched_online_group(tg, parent); + return &tg->css; } -static int cpu_cgroup_css_online(struct cgroup_subsys_state *css) +static void cpu_cgroup_css_released(struct cgroup_subsys_state *css) { struct task_group *tg = css_tg(css); - struct task_group *parent = css_tg(css->parent); - if (parent) - sched_online_group(tg, parent); - return 0; + sched_offline_group(tg); } static void cpu_cgroup_css_free(struct cgroup_subsys_state *css) { struct task_group *tg = css_tg(css); - sched_destroy_group(tg); -} - -static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css) -{ - struct task_group *tg = css_tg(css); - - sched_offline_group(tg); + /* + * Relies on the RCU grace period between css_released() and this. + */ + sched_free_group(tg); } static void cpu_cgroup_fork(struct task_struct *task) @@ -8437,9 +8431,8 @@ static struct cftype cpu_files[] = { struct cgroup_subsys cpu_cgrp_subsys = { .css_alloc = cpu_cgroup_css_alloc, + .css_released = cpu_cgroup_css_released, .css_free = cpu_cgroup_css_free, - .css_online = cpu_cgroup_css_online, - .css_offline = cpu_cgroup_css_offline, .fork = cpu_cgroup_fork, .can_attach = cpu_cgroup_can_attach, .attach = cpu_cgroup_attach,