Received: by 2002:a05:6a10:8395:0:0:0:0 with SMTP id n21csp646122pxh; Wed, 10 Nov 2021 07:20:54 -0800 (PST) X-Google-Smtp-Source: ABdhPJyUKNYi90ingaC9KF6mkyNQn3GxZwAewzyf9PWnbWRnPasHJA9+Qj0DY8m2gi6ikphSxh32 X-Received: by 2002:a17:90a:e7ca:: with SMTP id kb10mr9037506pjb.8.1636557653815; Wed, 10 Nov 2021 07:20:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1636557653; cv=none; d=google.com; s=arc-20160816; b=etlddajtLX9Are2F73qTBe1biDfp1tfNWc/lHmvrYFPYi9WDHGjt/Gkt+HVfZpied0 Z9QKw36+cFzs+khupiEQlIw5lr0gaw7cteypJ2FvCpm2RseNq7z36joHnFAKUadoSAOS omQnSQl1S+OZS9Py9I2ISXNMgkhWEXnoiJXfMhy+fzb4w9LRm3vjbCD70cO8VOGPY3t9 4O2We5gI9hfkj4dD6XBBj3OrpcsQOqoKyjLaPTWJg9+1AFV+e3D1JIX0S9pt3CED2NkP Fv9b+GxqFv9EIi744p/xs8AlKk3PH4Tkx5HUhONYbFjp8jUSe5TLVMJ91VXlftfsM65h U0DA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=YH2xCgHFc2IPo9l8PDtF0qVwo4tUGGVQgv5T/G0WwXc=; b=aJpqWGy8wZRgw4adYt2cjrEIlbZsyZkNxtAr+jLu9dOkvLVa5EoqMjuy6LtZadMN+6 l3Vbdm8TLUrukmU0yXeBPuvSv4whJZMpcwG1WBETLPbbXyIeMVozHqY2pzHQ5kQz9vTJ DjFHT9TtMwuxAxfspLVHYmn0wRwcPnRwh+aiBzKp8k5q3QbEwWIbK37OvygQHbmswzkn C9q+igtwPvBbWNC0omhlNfVbRxGmpgECx5DJAcvNfIRfP7cj/oszPRCcizy8UqUMg+A+ AFhb3E5eI4wzc0iGgYinTOAO5m1n4MmWXYBsY8xdBQ9zS3eBwDMJJA8SqSY6mJeLrTvu lieA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=IfQOgq2b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id s8si46490pgs.318.2021.11.10.07.20.40; Wed, 10 Nov 2021 07:20:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=IfQOgq2b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232298AbhKJPUP (ORCPT + 99 others); Wed, 10 Nov 2021 10:20:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53532 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232339AbhKJPUO (ORCPT ); Wed, 10 Nov 2021 10:20:14 -0500 Received: from mail-yb1-xb2c.google.com (mail-yb1-xb2c.google.com [IPv6:2607:f8b0:4864:20::b2c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 69A8EC061764 for ; Wed, 10 Nov 2021 07:17:26 -0800 (PST) Received: by mail-yb1-xb2c.google.com with SMTP id v64so7296533ybi.5 for ; Wed, 10 Nov 2021 07:17:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=YH2xCgHFc2IPo9l8PDtF0qVwo4tUGGVQgv5T/G0WwXc=; b=IfQOgq2bV1exhBaasi5sxxlHub5+qb4wI16hxasieT+5BLNe+6H0q2otz39R9mNKwr w2N+oq9VlZ1372qm/K/3OutqYpAh9P6SGF93qkH+lpdsqWGcOgjKVeUEpS317V4Yyc0g 3Zt2M7+L2Zd90RJrsbrSQYL1/4KeBEf8HUu5oxE82lpIxcYLELZgLZ/6mW2J1yCHF8TV SM/uZeXrkh/0/AqV6g85TX4sLjICNLK+KLnuU0Zkg7iTX2/Wg8OyVnXKzYvgwFhpFXtw lOfc/yjC5oKFO4E7CCU00LJQUDY8DzqnRPONNpeqvEAwmkSnoQOv+FaSt447U+myd/R1 P5jg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=YH2xCgHFc2IPo9l8PDtF0qVwo4tUGGVQgv5T/G0WwXc=; b=ddcNoO0tv9OdH4cPFOyzk5/uPCrYdwT6NAKGQdqrfS1ESw1ai6n61Yni7fKpnYYKgC Yf1Qw8lWR68QTmixzfBRTlrcq+MnpN0naXEMVqpkQay7iS2gfIzxfhSmU6p1yIkBlaRA zHgWSyNjSa1GUufuY6ZRCnsndTA9PVULDXWGoCu93blgWyyDwD2pO17x00YhzYg2lmJx N9ZN+SDZg3rNUbgB9pqlQ79SFwaVQPv3SDOIaYUedQ4oLxMpfVAHCLbhh1Z9vq4ZNH+V TMFXTnjfly7NU5JmLrPV7sNyFeuUZBKDpowIUgBdaD7tI0FrSFGpAW52ywWzjMnHdoNa rV+Q== X-Gm-Message-State: AOAM530qP+9+edZSctcwtivpb0QpBs7rcZmEB8T53kx7YiGKq4Ga8AJM mN+gT95fdFKJFT5CPSzHR+uvpjh/Qnd1GBwvwxNL3g== X-Received: by 2002:a5b:d50:: with SMTP id f16mr20215ybr.350.1636557445540; Wed, 10 Nov 2021 07:17:25 -0800 (PST) MIME-Version: 1.0 References: <8f4ed996-e6e5-75f4-b5fa-dffb7b7da05b@grsecurity.net> <20211105162914.215420-1-minipli@grsecurity.net> <20211106104854.GU174703@worktop.programming.kicks-ass.net> <9e8b2c49-2a10-2b34-e644-2b99708080bc@grsecurity.net> <20211109184744.GA31882@blackbody.suse.cz> In-Reply-To: <20211109184744.GA31882@blackbody.suse.cz> From: Vincent Guittot Date: Wed, 10 Nov 2021 16:17:14 +0100 Message-ID: Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's To: =?UTF-8?Q?Michal_Koutn=C3=BD?= Cc: Peter Zijlstra , Mathias Krause , Benjamin Segall , Ingo Molnar , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , linux-kernel@vger.kernel.org, Odin Ugedal , Kevin Tanguy , Brad Spengler , cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 9 Nov 2021 at 19:47, Michal Koutn=C3=BD wrote: > > Hello. > > Let me add a little summary (and I'm CCing cgroups ML too). > > The cgroup API allows following callbacks (simplified explanation): > > - .css_offline called after killing main reference > - .css_released called when css.refcnt hits zero > - .css_free called after css_released and RCU GP > > (The hidden catch is that they're not called directly but through workque= ues, > see css_free_work_fn() for details. .css_free() is queued after RCU GP th= ough.) > > How is this currently used in the cpu controller: > > - .css_offline not used > - .css_released sched_offline_group / unregister_fair_sched_grou= p > - .css_free sched_free_group / free_fair_sched_group > > > On Mon, Nov 08, 2021 at 12:40:19PM +0100, Peter Zijlstra wrote: > > Something like so good? > > Thanks for putting this together, in short I understand it reshuffles as = follows: > > .css_offline not used > .css_released cpu_cgroup_css_released > sched_released_group(tg) > // unlinking from tg tree > .css_free cpu_cgroup_css_free > sched_unregister_group(tg) > unregister_fair_sched_group(tg) > for_each(cpu) > remove_entity_load_avg(cfs_rq) > list_del_leaf_cfs_rq(cfs_rq) (1) > destroy_cfs_bandwidth(tg) (2) > call_rcu(sched_free_group_rcu) (3) > [rcu] sched_free_group > > I see two following issues with this: > > The cfs_bandwidth.period_timer is still active between (1) and (2) and ca= n be > fired and tg_unthrottle_up() may add a cfs_rq back to the leaf list (subs= equent > GP won't help). > (Admittedly, this window is shorter than the original window between > cpu_cgroup_css_released() and cpu_cgroup_css_released().) > > Having another call_rcu() at (3) seems overly complex given the cgroup AP= I > provides a grace period for free (pun intended :-). > > > Therefore, symbolically, the symbolic ordering should be: > > .css_offline not used > .css_released cpu_cgroup_css_released > sched_unregister_group(tg) > unregister_fair_sched_group(tg) > destroy_cfs_bandwidth(tg) > for_each(cpu) > remove_entity_load_avg(cfs_rq) > list_del_leaf_cfs_rq(cfs_rq) As we are not unlinked from tg tree, parent can walk_tg_tree_from and add it back > sched_released_group(tg) > // unlinking from tg tree > > .css_free cpu_cgroup_css_free > sched_free_group > > That is, the destroy_cfs_bandwidth() is called first to make sure the > tg_unthrottle_up() won't undo some of the cleanups and the respective > structures are only freed after RCU grace period. > > (Considering Vincent's remark, remove_entity_load_avg() and > list_del_leaf_cfs_rq() should be swapped but I'd keep that for a differen= t > patch.) > > To continue the discussion, the suggestion above in a form of diff (I str= ipped > the commit message for now). > > --8<-- > diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c > index 2067080bb235..8629b37d118e 100644 > --- a/kernel/sched/autogroup.c > +++ b/kernel/sched/autogroup.c > @@ -31,7 +31,7 @@ static inline void autogroup_destroy(struct kref *kref) > ag->tg->rt_se =3D NULL; > ag->tg->rt_rq =3D NULL; > #endif > - sched_offline_group(ag->tg); > + sched_release_group(ag->tg); > sched_destroy_group(ag->tg); > } > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 523fd602ea90..a5ca3ae6837b 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -9771,12 +9771,12 @@ void sched_destroy_group(struct task_group *tg) > call_rcu(&tg->rcu, sched_free_group_rcu); > } > > -void sched_offline_group(struct task_group *tg) > +void sched_release_group(struct task_group *tg) > { > unsigned long flags; > > - /* End participation in shares distribution: */ > unregister_fair_sched_group(tg); > + unregister_rt_sched_group(tg); > > spin_lock_irqsave(&task_group_lock, flags); > list_del_rcu(&tg->list); > @@ -9896,7 +9896,7 @@ static void cpu_cgroup_css_released(struct cgroup_s= ubsys_state *css) > { > struct task_group *tg =3D css_tg(css); > > - sched_offline_group(tg); > + sched_release_group(tg); > } > > static void cpu_cgroup_css_free(struct cgroup_subsys_state *css) > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 13950beb01a2..6e476f6d9435 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11456,8 +11456,6 @@ void free_fair_sched_group(struct task_group *tg) > { > int i; > > - destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)); > - > for_each_possible_cpu(i) { > if (tg->cfs_rq) > kfree(tg->cfs_rq[i]); > @@ -11534,6 +11532,8 @@ void unregister_fair_sched_group(struct task_grou= p *tg) > struct rq *rq; > int cpu; > > + destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)); > + > for_each_possible_cpu(cpu) { > if (tg->se[cpu]) > remove_entity_load_avg(tg->se[cpu]); > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index bb945f8faeca..b48baaba2fc2 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -137,13 +137,17 @@ static inline struct rq *rq_of_rt_se(struct sched_r= t_entity *rt_se) > return rt_rq->rq; > } > > -void free_rt_sched_group(struct task_group *tg) > +void unregister_rt_sched_group(struct task_group *tg) > { > - int i; > - > if (tg->rt_se) > destroy_rt_bandwidth(&tg->rt_bandwidth); > > +} > + > +void free_rt_sched_group(struct task_group *tg) > +{ > + int i; > + > for_each_possible_cpu(i) { > if (tg->rt_rq) > kfree(tg->rt_rq[i]); > @@ -250,6 +254,8 @@ static inline struct rt_rq *rt_rq_of_se(struct sched_= rt_entity *rt_se) > return &rq->rt; > } > > +void unregister_rt_sched_group(struct task_group *tg) { } > + > void free_rt_sched_group(struct task_group *tg) { } > > int alloc_rt_sched_group(struct task_group *tg, struct task_group *paren= t) > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 7f1612d26c18..0e66749486e7 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -488,6 +488,7 @@ extern void __refill_cfs_bandwidth_runtime(struct cfs= _bandwidth *cfs_b); > extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b); > extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq); > > +extern void unregister_rt_sched_group(struct task_group *tg); > extern void free_rt_sched_group(struct task_group *tg); > extern int alloc_rt_sched_group(struct task_group *tg, struct task_group= *parent); > extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq, > @@ -503,7 +504,7 @@ extern struct task_group *sched_create_group(struct t= ask_group *parent); > extern void sched_online_group(struct task_group *tg, > struct task_group *parent); > extern void sched_destroy_group(struct task_group *tg); > -extern void sched_offline_group(struct task_group *tg); > +extern void sched_release_group(struct task_group *tg); > > extern void sched_move_task(struct task_struct *tsk); > >