Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932127AbcDDIpy (ORCPT ); Mon, 4 Apr 2016 04:45:54 -0400 Received: from foss.arm.com ([217.140.101.70]:43691 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754432AbcDDIpw (ORCPT ); Mon, 4 Apr 2016 04:45:52 -0400 Date: Mon, 4 Apr 2016 09:48:23 +0100 From: Morten Rasmussen To: Leo Yan Cc: Steve Muckle , Peter Zijlstra , Ingo Molnar , Dietmar Eggemann , Vincent Guittot , linux-kernel@vger.kernel.org, eas-dev@lists.linaro.org Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration Message-ID: <20160404084821.GA18516@e105550-lin.cambridge.arm.com> References: <1459528717-17339-1-git-send-email-leo.yan@linaro.org> <20160401194948.GN3448@twins.programming.kicks-ass.net> <56FEF621.3070404@linaro.org> <20160402071154.GA7046@leoy-linaro> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160402071154.GA7046@leoy-linaro> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4408 Lines: 81 On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote: > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote: > > I think I follow - Leo please correct me if I mangle your intentions. > > It's an issue that Morten and Dietmar had mentioned to me as well. Yes. We have been working on this issue for a while without getting to a nice solution yet. > > Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a > > task group other than the root. The task migrates from one CPU to > > another. The cfs_rq.avg structures on the src and dest CPUs > > corresponding to the group containing the task are updated immediately > > via remove_entity_load_avg()/update_cfs_rq_load_avg() and > > attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root > > on the src and dest are not immediately updated. The root cfs_rq.avg > > must catch up over time with PELT. Yes. The problem is that it is only the cfs_rq.avg of the cfs_rq where the is enqueued/dequeued that gets immediately updated. If the cfs_rq is a group cfs_rq its group entity se.avg doesn't get updated immediately. It has to adapt over time at the pace defined by the geometric series. The impact of a task migration therefore doesn't trickle through to the root cfs_rq.avg. This behaviour is one of the fundamental changes Yuyang introduced with his rewrite of PELT. > > As to why we care, there's at least one issue which may or may not be > > Leo's - the root cfs_rq is the one whose avg structure we read from to > > determine the frequency with schedutil. If you have a cpu-bound task in > > a non-root cgroup which periodically migrates among CPUs on an otherwise > > idle system, I believe each time it migrates the frequency would drop to > > fmin and have to ramp up again with PELT. It makes any scheduling decision based on utilization difficult if fair group scheduling is used as cpu_util() doesn't give an up-to-date picture of any utilization caused by task in task groups. For the energy-aware scheduling patches and patches we have in the pipeline for improving capacity awareness in the scheduler we rely on cpu_util(). > Steve, thanks for explanation and totally agree. My initial purpose is > not from schedutil's perspective, but just did some basic analysis for > utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then > launch a periodic task (period: 100ms, duty cycle: 40%) and limit its > affinity to only two CPUs. So observed the same result like you said. > > After applied this patch, I can get much better result for the CPU's > utilization after task's migration. Please see the parsed result for > CPU's utilization: > http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg.png > http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg.png > > > Leo I noticed you did not modify detach_entity_load_average(). I think > > this would be needed to avoid the task's stats being double counted for > > a while after switched_from_fair() or task_move_group_fair(). I'm afraid that the solution to problem is more complicated than that :-( You are adding/removing a contribution from the root cfs_rq.avg which isn't part of the signal in the first place. The root cfs_rq.avg only contains the sum of the load/util of the sched_entities on the cfs_rq. If you remove the contribution of the tasks from there you may end up double-accounting for the task migration. Once due to you patch and then again slowly over time as the group sched_entity starts reflecting that the task has migrated. Furthermore, for group scheduling to make sense it has to be the task_h_load() you add/remove otherwise the group weighting is completely lost. Or am I completely misreading your patch? I don't think the slow response time for _load_ is necessarily a big problem. Otherwise we would have had people complaining already about group scheduling being broken. It is however a problem for all the initiatives that built on utilization. We have to either make the updates trickle through the group hierarchy for utilization, which is difficult with making a lot of changes to the current code structure, or introduce a new avg structure at root level which contains the sum of task utilization for all _tasks_ (not groups) in the group hierarchy and maintain it separately. None of those two are particularly pretty. Better suggestions?