Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932689AbcDEJK5 (ORCPT ); Tue, 5 Apr 2016 05:10:57 -0400 Received: from foss.arm.com ([217.140.101.70]:49793 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932549AbcDEJKy (ORCPT ); Tue, 5 Apr 2016 05:10:54 -0400 Date: Tue, 5 Apr 2016 10:13:29 +0100 From: Morten Rasmussen To: Leo Yan Cc: Steve Muckle , Peter Zijlstra , Ingo Molnar , Dietmar Eggemann , Vincent Guittot , linux-kernel@vger.kernel.org, eas-dev@lists.linaro.org Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration Message-ID: <20160405091328.GD18516@e105550-lin.cambridge.arm.com> References: <1459528717-17339-1-git-send-email-leo.yan@linaro.org> <20160401194948.GN3448@twins.programming.kicks-ass.net> <56FEF621.3070404@linaro.org> <20160402071154.GA7046@leoy-linaro> <20160404084821.GA18516@e105550-lin.cambridge.arm.com> <20160405065644.GA29778@leoy-linaro> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160405065644.GA29778@leoy-linaro> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4938 Lines: 92 On Tue, Apr 05, 2016 at 02:56:44PM +0800, Leo Yan wrote: > On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote: > > On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote: > > > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote: > > > > I think I follow - Leo please correct me if I mangle your intentions. > > > > It's an issue that Morten and Dietmar had mentioned to me as well. > > > > Yes. We have been working on this issue for a while without getting to a > > nice solution yet. > > Good to know this. This patch is mainly for discussion purpose. > > [...] > > > > > Leo I noticed you did not modify detach_entity_load_average(). I think > > > > this would be needed to avoid the task's stats being double counted for > > > > a while after switched_from_fair() or task_move_group_fair(). > > > > I'm afraid that the solution to problem is more complicated than that > > :-( > > > > You are adding/removing a contribution from the root cfs_rq.avg which > > isn't part of the signal in the first place. The root cfs_rq.avg only > > contains the sum of the load/util of the sched_entities on the cfs_rq. > > If you remove the contribution of the tasks from there you may end up > > double-accounting for the task migration. Once due to you patch and then > > again slowly over time as the group sched_entity starts reflecting that > > the task has migrated. Furthermore, for group scheduling to make sense > > it has to be the task_h_load() you add/remove otherwise the group > > weighting is completely lost. Or am I completely misreading your patch? > > Here have one thing want to confirm firstly: though CFS has maintained > task group's hierarchy, but between task group's cfs_rq.avg and root > cfs_rq.avg, CFS updates these signals independently rather than accouting > them by crossing the hierarchy. > > So currently CFS decreases the group's cfs_rq.avg for task's migration, > but it don't iterate task group's hierarchy to root cfs_rq.avg. I > don't understand your meantioned the second accounting by "then again > slowly over time as the group sched_entity starts reflecting that the > task has migrated." The problem is that there is direct link between a group sched_entity's se->avg and se->my_q.avg. The latter is the sum of PELT load/util of the sched_entities (tasks or nested groups) on the group cfs_rq, while the former is the PELT load/util of the group entity which is not based on cfs_rq sum, but basically just tracks whether that group entity has been running/runnable or not, but weighted by group load code which is updating the weight occasionally. In other words, we do go up/down the hierarchy when tasks migrate, but we only update the se->my_q.avg (cfs_rq), not the se->avg which is the load of the group seen by the parent cfs_rq. So the immediate update of the group cfs_rq.avg where the task sched_entity is enqueued/dequeued doesn't trickle through the hierarchy instantaneously. > Another question is: does cfs_rq.avg _ONLY_ signal historic behavior but > not present behavior? so even the task has been migrated we still need > decay it slowly? Or this will be different between load and util? cfs_rq.avg is instantaneously updated on task migration as it is the sum of the PELT contributions of the sched_entities associated with that cfs_rq. The group se->avg is not a sum, it behaves just as if it a task which has a variable load_weight which is determined by group weighting code, but otherwise identical. No adding/removing of contributions when tasks migrate. > > > I don't think the slow response time for _load_ is necessarily a big > > problem. Otherwise we would have had people complaining already about > > group scheduling being broken. It is however a problem for all the > > initiatives that built on utilization. > > Or maybe we need seperate utilization and load, these two signals > have different semantics and purpose. I think that is up for discussion. People might have different views on the semantics of utilization. I see them as very similar in the non-group scheduling case, one is based on running time and not priority weighted, the other is based on runnable time and has priority weighting. Otherwise they are the same. However, in the group scheduling case, I think they should behave somewhat differently. Load is priority scaled and is designed to ensure fair scheduling when the system is fully utilized, where utilization provides a metric the estimates the actual busy time of the cpus. Group load is scaled such that is capped no matter how much actual cpu time the group gets across the system. I don't think it makes sense to do the same for utilization as it would not represent the actual compute demand. It should be treated as a 'flat hierarchy' as Yuyang mentions in his reply, so the sum at the root cfs_rq is a proper estimate of the utilization of the cpu regardless of whether tasks are grouped or not.