Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754551AbdDJRiO (ORCPT ); Mon, 10 Apr 2017 13:38:14 -0400 Received: from merlin.infradead.org ([205.233.59.134]:34512 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753464AbdDJRiL (ORCPT ); Mon, 10 Apr 2017 13:38:11 -0400 Date: Mon, 10 Apr 2017 19:38:02 +0200 From: Peter Zijlstra To: Vincent Guittot Cc: mingo@kernel.org, linux-kernel@vger.kernel.org, dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com, yuyang.du@intel.com, pjt@google.com, bsegall@google.com Subject: Re: [PATCH v2] sched/fair: update scale invariance of PELT Message-ID: <20170410173802.orygigjbcpefqtdv@hirez.programming.kicks-ass.net> References: <1491815909-13345-1-git-send-email-vincent.guittot@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1491815909-13345-1-git-send-email-vincent.guittot@linaro.org> User-Agent: NeoMutt/20170113 (1.7.2) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2272 Lines: 91 Thanks for the rebase. On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote: Ok, so let me try and paraphrase what this patch does. So consider a task that runs 16 out of our 32ms window: running idle |---------|---------| You're saying that when we scale running with the frequency, suppose we were at 50% freq, we'll end up with: run idle |----|---------| Which is obviously a shorter total then before; so what you do is add back the lost idle time like: run lost idle |----|----|---------| to arrive at the same total time. Which seems to make sense. Now I have vague memories of Morten having issues with your previous patches, so I'll wait for him to chime in as well. On to the implementation: > /* > + * Scale the time to reflect the effective amount of computation done during > + * this delta time. > + */ > +static __always_inline u64 > +scale_time(u64 delta, int cpu, struct sched_avg *sa, > + unsigned long weight, int running) > +{ > + if (running) { > + sa->stolen_idle_time += delta; > + /* > + * scale the elapsed time to reflect the real amount of > + * computation > + */ > + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu)); > + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu)); > + > + /* > + * Track the amount of stolen idle time due to running at > + * lower capacity > + */ > + sa->stolen_idle_time -= delta; OK so far so good, this tracks, in stolen_idle_time, the 'lost' bit from above. > + } else if (!weight) { > + if (sa->util_sum < (LOAD_AVG_MAX * 1000)) { But here I'm completely lost. WTF just happened ;-) Firstly, I think we want a comment on why we care about the !weight case. Why isn't !running sufficient? Secondly, what's up with the util_sum < LOAD_AVG_MAX * 1000 thing? Is that to deal with cpu_capacity? > + /* > + * Add the idle time stolen by running at lower compute > + * capacity > + */ > + delta += sa->stolen_idle_time; > + } > + sa->stolen_idle_time = 0; > + } > + > + return delta; > +} Thirdly, I'm thinking this isn't quite right. Imagine a task that's running across a decay window, then we'll only add back the stolen_idle time in the next window, even though it should've been in this one, right?