Date: Mon, 7 Aug 2017 13:51:43 +0100
From: Morten Rasmussen <morten.rasmussen@arm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Brendan Jackman <brendan.jackman@arm.com>,
        Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org,
        Joel Fernandes <joelaf@google.com>,
        Andres Oportus <andresoportus@google.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Josef Bacik <josef@toxicpanda.com>
Subject: Re: [PATCH] sched/fair: Sync task util before slow-path wakeup
Message-ID: <20170807125143.GA498@morras01-work>
References: <20170802131002.31576-1-brendan.jackman@arm.com>
 <20170802132405.z5gvut7ecaygbhvy@hirez.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170802132405.z5gvut7ecaygbhvy@hirez.programming.kicks-ass.net>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2548
Lines: 57

On Wed, Aug 02, 2017 at 03:24:05PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 02:10:02PM +0100, Brendan Jackman wrote:
> > We use task_util in find_idlest_group via capacity_spare_wake. This
> > task_util is updated in wake_cap. However wake_cap is not the only
> > reason for ending up in find_idlest_group - we could have been sent
> > there by wake_wide. So explicitly sync the task util with prev_cpu
> > when we are about to head to find_idlest_group.
> > 
> > We could simply do this at the beginning of
> > select_task_rq_fair (i.e. irrespective of whether we're heading to
> > select_idle_sibling or find_idlest_group & co), but I didn't want to
> > slow down the select_idle_sibling path more than necessary.
> > 
> > Don't do this during fork balancing, we won't need the task_util and
> > we'd just clobber the last_update_time, which is supposed to be 0.
> 
> So I remember Morten explicitly not aging util of tasks on wakeup
> because the old util was higher and better representative of what the
> new util would be, or something along those lines.
> 
> Morten?

That was the intention, but when we discussed the wake_cap() stuff we
decided to drop that hoping that decay clamping or some other magic
would be added on top later. So this patch is in line with current
behaviour.

Using non-aged util is causing trouble when comparing prev_cpu to other
cpus. In cpu_util_wake() we compensate for the fact that the aged task
util is already included in the cpu util on the prev_cpu. For that to
work, we need to age the task util so we know how much is already
accounted for. In the original wake_cap() series I think I had a patch
that store the non-aged version so we could calculate the potential cpu
util as:

predicted_cpu_util(prev_cpu) =
	cpu_util(prev_cpu) - task_util_aged(task) + task_util_nonaged(task)

predicted_cpu_util(other_cpu) =
	cpu_util(other_cpu) + task_util_nonaged(task)

This would be better always under-estimating the task util by using the
aged util as we currently do:

predicted_cpu_util(prev_cpu) =
	cpu_util(prev_cpu) - task_util_aged(task) + task_util_aged(task)

predicted_cpu_util(other_cpu) =
	cpu_util(other_cpu) + task_util_aged(task)

but at least it gives us a fair comparison between prev_cpu and other
cpus.

The Android kernel carries additional patches that tracks the max (peak)
utilization and uses that as the non aged util for wake-up placement.
I'm hoping we can discuss this topic again at LPC, as last years idea of
clamping decay didn't work very well to solve this issue.