Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966806AbcCPOYT (ORCPT ); Wed, 16 Mar 2016 10:24:19 -0400 Received: from mail-pf0-f170.google.com ([209.85.192.170]:36555 "EHLO mail-pf0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932846AbcCPOYR (ORCPT ); Wed, 16 Mar 2016 10:24:17 -0400 Date: Wed, 16 Mar 2016 07:24:14 -0700 From: Tejun Heo To: Peter Zijlstra Cc: Kazuki Yamaguchi , Niklas Cassel , linux-kernel@vger.kernel.org Subject: Re: [BUG] sched: leaf_cfs_rq_list use after free Message-ID: <20160316142414.GA6980@mtj.duckdns.org> References: <20216ece-a75c-e3cf-4bae-ccbcf5694e9f@rhe.jp> <20160314112057.GT6356@twins.programming.kicks-ass.net> <20160314120903.GP6375@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160314120903.GP6375@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4793 Lines: 130 Hello, Peter. Sorry about the delay. On Mon, Mar 14, 2016 at 01:09:03PM +0100, Peter Zijlstra wrote: > On Mon, Mar 14, 2016 at 12:20:57PM +0100, Peter Zijlstra wrote: > > So I would suggest TJ to revert that patch and queue it for stable. > > > > It it clearly borken, because cgroup_exit() is called from preemptible > > context, so _obviously_ we can (and clearly will) schedule after that, > > which is somewhat hard if the cgroup we're supposedly belonging to has > > been torn to shreds in the meantime. > > And I think it might be fundamentally broken, because it leaves ->css > set to whatever cgroup we had, while simultaneously allowing that css to > go away. So, the problem here is that cpu is using css_offline to tear down the css. perf shouldn't have the same problem as it destroys its css from css_free. The distinctions among different callbacks evolved over time and we need to update the documentation but here are the rules. css_alloc() This one is obvious. Alloc and init. The css will become visible during css iteration sometime between alloc and online. css_online() The css is now guaranteed to be visible for css_for_each*() iterations. This distinction exists because some controllers need to propagate state changes downwards requiring a new css to become visible before it inherits the current state from the parent. Conversely, there's no reason to use this callback if there's no such requirement. Ex: Freezer which propagates the target state downwards and needs a new child to inherit the current state while iteratable. css_offline() The css is going down and draining usages by refusing new css_tryget_online()'s but still guaranteed to be visible during css iterations. Controllers which explicitly needs to put, say, cache references or need to perform cleanup operations while the css is iteratable need to use this method; otherwise, no reason to bother with it. Ex: blkcg pins csses as part of lookup cache which can prevent a css from being drained and released, so blkcg uses this call back to disable caching for the css. css_released() The css is drained and not visible during iteration and will be freed after a RCU grace period. Used by controllers which cache pointers to csses being drained. Ex: memcg needs to iterate csses which are being drained and its custom iterator implementation caches RCU protected pointers to csses. memcg uses this callback to shoot down those cached pointers. css_free() The css is drained, not visible to iterations, not used by anyone, and will be freed immediately. So, controllers which don't have persistent states or synchronization requirements around state propagation have no reason to bother with all the rest. css_alloc() and css_free() are the right callbacks to init and tear down csses. If a controller has specific needs to propagate states, only the part of operations which are affected should be in the respective specialized init/exit methods. > This means that anything trying to use this pointer; and there's quite a > lot of that; is in for a nasty surprise. > > So you really need to change the ->css, either when the task starts > dying (like it used to), or otherwise right before the cgroup goes > offline. So, the rule is that the css should stay serviceable until everyone depending on it is gone. > The argument used was that people want to see resources consumed by > Zombies, which is all fine and dandy, but when you destroy the cgroup > you cannot see that anyway. > > So something needs to fundamentally ensure that ->css changes before we > go offline the thing. I could be mistaken but AFAICS there doesn't seem to be anything requiring bothering with the more specialized exit methods. Given that no css iteration is used and everything is lock protected, the patch at the end of this messages should do and seems to work fine here. Am I missing something? Thanks. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0f5abc6..98d019d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8332,14 +8332,8 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css) { struct task_group *tg = css_tg(css); - sched_destroy_group(tg); -} - -static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css) -{ - struct task_group *tg = css_tg(css); - sched_offline_group(tg); + sched_destroy_group(tg); } static void cpu_cgroup_fork(struct task_struct *task) @@ -8701,7 +8695,6 @@ struct cgroup_subsys cpu_cgrp_subsys = { .css_alloc = cpu_cgroup_css_alloc, .css_free = cpu_cgroup_css_free, .css_online = cpu_cgroup_css_online, - .css_offline = cpu_cgroup_css_offline, .fork = cpu_cgroup_fork, .can_attach = cpu_cgroup_can_attach, .attach = cpu_cgroup_attach,