Date: Mon, 4 Apr 2016 09:48:23 +0100
From: Morten Rasmussen <morten.rasmussen@arm.com>
To: Leo Yan <leo.yan@linaro.org>
Cc: Steve Muckle <steve.muckle@linaro.org>,
        Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        linux-kernel@vger.kernel.org, eas-dev@lists.linaro.org
Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task
 migration
Message-ID: <20160404084821.GA18516@e105550-lin.cambridge.arm.com>
References: <1459528717-17339-1-git-send-email-leo.yan@linaro.org>
 <20160401194948.GN3448@twins.programming.kicks-ass.net>
 <56FEF621.3070404@linaro.org>
 <20160402071154.GA7046@leoy-linaro>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160402071154.GA7046@leoy-linaro>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4408
Lines: 81

On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:
> On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
> > I think I follow - Leo please correct me if I mangle your intentions.
> > It's an issue that Morten and Dietmar had mentioned to me as well.

Yes. We have been working on this issue for a while without getting to a
nice solution yet.

> > Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a
> > task group other than the root. The task migrates from one CPU to
> > another. The cfs_rq.avg structures on the src and dest CPUs
> > corresponding to the group containing the task are updated immediately
> > via remove_entity_load_avg()/update_cfs_rq_load_avg() and
> > attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root
> > on the src and dest are not immediately updated. The root cfs_rq.avg
> > must catch up over time with PELT.

Yes. The problem is that it is only the cfs_rq.avg of the cfs_rq where
the is enqueued/dequeued that gets immediately updated. If the cfs_rq is
a group cfs_rq its group entity se.avg doesn't get updated immediately.
It has to adapt over time at the pace defined by the geometric series.
The impact of a task migration therefore doesn't trickle through to the
root cfs_rq.avg. This behaviour is one of the fundamental changes Yuyang
introduced with his rewrite of PELT.

> > As to why we care, there's at least one issue which may or may not be
> > Leo's - the root cfs_rq is the one whose avg structure we read from to
> > determine the frequency with schedutil. If you have a cpu-bound task in
> > a non-root cgroup which periodically migrates among CPUs on an otherwise
> > idle system, I believe each time it migrates the frequency would drop to
> > fmin and have to ramp up again with PELT.

It makes any scheduling decision based on utilization difficult if fair
group scheduling is used as cpu_util() doesn't give an up-to-date
picture of any utilization caused by task in task groups.

For the energy-aware scheduling patches and patches we have in the
pipeline for improving capacity awareness in the scheduler we rely on
cpu_util().

> Steve, thanks for explanation and totally agree. My initial purpose is
> not from schedutil's perspective, but just did some basic analysis for
> utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then
> launch a periodic task (period: 100ms, duty cycle: 40%) and limit its
> affinity to only two CPUs. So observed the same result like you said.
> 
> After applied this patch, I can get much better result for the CPU's
> utilization after task's migration. Please see the parsed result for
> CPU's utilization:
> http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg.png
> http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg.png
> 
> > Leo I noticed you did not modify detach_entity_load_average(). I think
> > this would be needed to avoid the task's stats being double counted for
> > a while after switched_from_fair() or task_move_group_fair().

I'm afraid that the solution to problem is more complicated than that
:-(

You are adding/removing a contribution from the root cfs_rq.avg which
isn't part of the signal in the first place. The root cfs_rq.avg only
contains the sum of the load/util of the sched_entities on the cfs_rq.
If you remove the contribution of the tasks from there you may end up
double-accounting for the task migration. Once due to you patch and then
again slowly over time as the group sched_entity starts reflecting that
the task has migrated. Furthermore, for group scheduling to make sense
it has to be the task_h_load() you add/remove otherwise the group
weighting is completely lost. Or am I completely misreading your patch?

I don't think the slow response time for _load_ is necessarily a big
problem. Otherwise we would have had people complaining already about
group scheduling being broken. It is however a problem for all the
initiatives that built on utilization.

We have to either make the updates trickle through the group hierarchy
for utilization, which is difficult with making a lot of changes to the
current code structure, or introduce a new avg structure at root level
which contains the sum of task utilization for all _tasks_ (not groups)
in the group hierarchy and maintain it separately.

None of those two are particularly pretty. Better suggestions?