Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751689AbdI2JEm (ORCPT ); Fri, 29 Sep 2017 05:04:42 -0400 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:39882 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750709AbdI2JEj (ORCPT ); Fri, 29 Sep 2017 05:04:39 -0400 Date: Fri, 29 Sep 2017 10:04:34 +0100 From: Morten Rasmussen To: Peter Zijlstra Cc: mingo@kernel.org, linux-kernel@vger.kernel.org, tj@kernel.org, josef@toxicpanda.com, torvalds@linux-foundation.org, vincent.guittot@linaro.org, efault@gmx.de, pjt@google.com, clm@fb.com, dietmar.eggemann@arm.com, bsegall@google.com, yuyang.du@intel.com Subject: Re: [PATCH -v2 03/18] sched/fair: Cure calc_cfs_shares() vs reweight_entity() Message-ID: <20170929090434.GB962@e105550-lin.cambridge.arm.com> References: <20170901132059.342024223@infradead.org> <20170901132748.141477819@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170901132748.141477819@infradead.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2794 Lines: 67 On Fri, Sep 01, 2017 at 03:21:02PM +0200, Peter Zijlstra wrote: > Vincent reported that when running in a cgroup, his root > cfs_rq->avg.load_avg dropped to 0 on task idle. > > This is because reweight_entity() will now immediately propagate the > weight change of the group entity to its cfs_rq, and as it happens, > our approxmation (5) for calc_cfs_shares() results in 0 when the group > is idle. > > Avoid this by using the correct (3) as a lower bound on (5). This way > the empty cgroup will slowly decay instead of instantly drop to 0. > > Reported-by: Vincent Guittot > Signed-off-by: Peter Zijlstra (Intel) > --- > kernel/sched/fair.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2703,11 +2703,10 @@ static long calc_cfs_shares(struct cfs_r > tg_shares = READ_ONCE(tg->shares); > > /* > - * This really should be: cfs_rq->avg.load_avg, but instead we use > - * cfs_rq->load.weight, which is its upper bound. This helps ramp up > - * the shares for small weight interactive tasks. > + * Because (5) drops to 0 when the cfs_rq is idle, we need to use (3) > + * as a lower bound. > */ > - load = scale_load_down(cfs_rq->load.weight); > + load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg); We use cfs_rq->tg_load_avg_contrib (the filtered version of cfs_rq->avg.load_avg) instead of cfs_rq->avg.load_avg further down, so I think we should here too for consistency. + load = max(scale_load_down(cfs_rq->load.weight), + cfs_rq->tg_load_avg_contrib); With this change (5) almost becomes (3): ge->load.weight = tg->weight * max(grq->load.weight, grq->avg.load_avg) --------------------------------------------------------------------------- tg->load_avg - grq->avg.load_avg + max(grq->load.weight, grq->avg.load_avg) The difference is that we boost ge->load.weight for if the grq has runnable tasks with se->avg.load_avg < se->load.weight, i.e. tasks that occasionally block. This means that the underestimate scenario I have in my reply for patch #2 is no longer possible. AFAICT, we are now guaranteed to over-estimate ge->load.weight. It is still quite sensitive to periodic high priority tasks though. tg->weight = 1024 tg->load_avg = 2560 \Sum grq->load.weight = 2048 cpu 0 1 \Sum grq->avg.load_avg 1536 1024 grq->load.weight 1024 1024 load (max) 1536 1024 ge->load_weight (1) 512 512 1024 >= tg->weight ge->load_weight (3) 614 410 1024 >= tg->weight ge->load_weight (5) 512 410 922 < tg->weight ge->load_weight (5*) 614 410 1024 >= tg->weight