Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754053AbdDKHwu (ORCPT ); Tue, 11 Apr 2017 03:52:50 -0400 Received: from mail-wr0-f172.google.com ([209.85.128.172]:36818 "EHLO mail-wr0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752010AbdDKHwZ (ORCPT ); Tue, 11 Apr 2017 03:52:25 -0400 Date: Tue, 11 Apr 2017 09:52:21 +0200 From: Vincent Guittot To: Peter Zijlstra Cc: mingo@kernel.org, linux-kernel@vger.kernel.org, dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com, yuyang.du@intel.com, pjt@google.com, bsegall@google.com Subject: Re: [PATCH v2] sched/fair: update scale invariance of PELT Message-ID: <20170411075221.GA30421@linaro.org> References: <1491815909-13345-1-git-send-email-vincent.guittot@linaro.org> <20170410173802.orygigjbcpefqtdv@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20170410173802.orygigjbcpefqtdv@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4342 Lines: 135 Le Monday 10 Apr 2017 ? 19:38:02 (+0200), Peter Zijlstra a ?crit : > > Thanks for the rebase. > > On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote: > > Ok, so let me try and paraphrase what this patch does. > > So consider a task that runs 16 out of our 32ms window: > > running idle > |---------|---------| > > > You're saying that when we scale running with the frequency, suppose we > were at 50% freq, we'll end up with: > > run idle > |----|---------| > > > Which is obviously a shorter total then before; so what you do is add > back the lost idle time like: > > run lost idle > |----|----|---------| > > > to arrive at the same total time. Which seems to make sense. Yes > > Now I have vague memories of Morten having issues with your previous > patches, so I'll wait for him to chime in as well. IIRC, Morten's concerns were about the lost idle time which was not taken into account in previous version. > > > On to the implementation: > > > /* > > + * Scale the time to reflect the effective amount of computation done during > > + * this delta time. > > + */ > > +static __always_inline u64 > > +scale_time(u64 delta, int cpu, struct sched_avg *sa, > > + unsigned long weight, int running) > > +{ > > + if (running) { > > + sa->stolen_idle_time += delta; > > + /* > > + * scale the elapsed time to reflect the real amount of > > + * computation > > + */ > > + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu)); > > + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu)); > > + > > + /* > > + * Track the amount of stolen idle time due to running at > > + * lower capacity > > + */ > > + sa->stolen_idle_time -= delta; > > OK so far so good, this tracks, in stolen_idle_time, the 'lost' bit from > above. > > > + } else if (!weight) { > > + if (sa->util_sum < (LOAD_AVG_MAX * 1000)) { > > But here I'm completely lost. WTF just happened ;-) > > Firstly, I think we want a comment on why we care about the !weight > case. Why isn't !running sufficient? We track the time when the task is "really" idle but not the time that the task spent to wait for running on the CPU. Running is used to detect when the task is really running and how much idle time has been lost while weight is used to detect when the task is back to sleep state and when we have account the lost idle time. > > > Secondly, what's up with the util_sum < LOAD_AVG_MAX * 1000 thing? The lost idle time makes sense only if the task can also be "idle" when running at max capacity. When util_sum reaches the LOAD_AVG_MAX*SCHED_CAPACITY_SCALE value, all tasks are considered to be the same as we can't make any difference between a task running 400ms or a task running 400sec. It means that these tasks are "always running" tasks even at max capacity. In this case, there is no lost idle time as they always run and tracking and adding back the lost idle time because we run at lower capacity doesn't make sense anymore so we discard it. Then an always running task can have a util_sum that is less than the max value because of the rounding (util_avg varies between [1006..1023]), so I use LOAD_AVG_MAX*1000 instead of LOAD_AVG_MAX*1024 > > Is that to deal with cpu_capacity? > > > > + /* > > + * Add the idle time stolen by running at lower compute > > + * capacity > > + */ > > + delta += sa->stolen_idle_time; > > + } > > + sa->stolen_idle_time = 0; > > + } > > + > > + return delta; > > +} > > > Thirdly, I'm thinking this isn't quite right. Imagine a task that's > running across a decay window, then we'll only add back the stolen_idle > time in the next window, even though it should've been in this one, > right? I don't think so because the PELT should not see more decay window at half capacity than at max capacity. In the example below we can see that we cross the absolute time decay window when running at half capacity but once we scale the running delta time we don't cross it anymore and the update of PELT is done in the same manner in both case decay window |-------|-------|-------|-- max capacity ---xxxx------------xxxx---- update ? ? ? ? * ? * ? ? ? ? ? * ? * half capacity---xxxxxxxx--------xxxxxxxx accounted ? ?---xxxx____--------xxxx____ update ? ? ? ? * ? * ? ? ? ? * * x running - sleep _ lost idle time