Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751079AbbLHREL (ORCPT ); Tue, 8 Dec 2015 12:04:11 -0500 Received: from foss.arm.com ([217.140.101.70]:45222 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750973AbbLHREJ (ORCPT ); Tue, 8 Dec 2015 12:04:09 -0500 Date: Tue, 8 Dec 2015 17:04:21 +0000 From: Morten Rasmussen To: Vincent Guittot Cc: peterz@infradead.org, mingo@kernel.org, linux-kernel@vger.kernel.org, yuyang.du@intel.com, linaro-kernel@lists.linaro.org, dietmar.eggemann@arm.com, pjt@google.com, bsegall@google.com Subject: Re: [PATCH] sched/fair: update scale invariance of pelt Message-ID: <20151208170419.GA26307@e105550-lin.cambridge.arm.com> References: <1448372970-8764-1-git-send-email-vincent.guittot@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1448372970-8764-1-git-send-email-vincent.guittot@linaro.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11494 Lines: 261 On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote: > The current implementation of load tracking invariance scales the load > tracking value with current frequency and uarch performance (only for > utilization) of the CPU. > > One main result of the current formula is that the figures are capped by > the current capacity of the CPU. This limitation is the main reason of not > including the uarch invariance (arch_scale_cpu_capacity) in the calculation > of load_avg because capping the load can generate erroneous system load > statistic as described with this example [1] The reason why we don't want to scale load_avg with regard to uarch capacity (as we do with util_avg) is explained in e3279a2e6d697e00e74f905851ee7cf532f72b2d as well. > Instead of scaling the complete value of PELT algo, we should only scale > the running time by the current capacity of the CPU. It seems more correct > to only scale the running time because the non running time of a task > (sleeping or waiting for a runqueue) is the same whatever the current freq > and the compute capacity of the CPU. You seem to imply that we currently scale running, waiting, and sleeping time. That is not the case. We scale running and waiting time, but not sleeping time. Whether we should scale waiting time or not is a good question. The waiting time is affected by the running time of the other tasks on the cfs_rq, so on one hand it seems a bit inconsistent to scale one and not the other. On the other hand, not scaling waiting time would make tasks that spend a lot of time waiting appear bigger, which could an advantage as it would make load-balancing more prone to spread tasks. A third alternative is to drop the scaling of load_avg completely, but it is still needed for util_avg as we want util_avg to be invariant to frequency and uarch scaling. > Then, one main advantage of this change is that the load of a task can > reach max value whatever the current freq and the uarch of the CPU on which > it run. It will just take more time at a lower freq than a max freq or on a > "little" CPU compared to a "big" one. The load and the utilization stay > invariant across system so we can still compared them between CPU but with > a wider range of values. Just removing scaling of waiting time and applying scaling by current capacity (including uarch) to the running time will not make load_avg reach the max value for tasks running alone on a cpu. Since the task isn't waiting at all (it is alone) all contributions are running time which is scaled, IIUC, and hence the result is still capped by the current capacity of the cpu. But that doesn't match your example results further down if I read them correctly. The changes made in the code of this patch are quite subtle, but very important as they change the behaviour of the PELT geometric series quite a lot. It is much more than just changing whether we scale waiting time and apply uarch scaling to running time of load_avg or not. I think we need to understand the math behind this patch to understand how the PELT metrics are affected because I think this patch changes some of the fundamentals originally described by Paul and Ben. Instead of scaling the contribution of each 1024us segment like we currently do, this patch is essentially warping time and lumps it together and let it contribute fully but skips decays. It is rather hard to explain, but the result is that the patch affects both load_avg and util_avg, and it breaks scale-invariance. Executive summary: Scaling time (delta) instead of the individual segment contributions breaks scale-invariance. The net result on load_avg seems to be nothing apart from slower reaction time. That is how I see after having tested it a bit. But I could be getting it all wrong. :-/ Much more detail: Original geometric series: \sum (0..n) u_n * y^n Current geometric series with scale invariance: \sum (0..n) u_n * c_n * y^n In reality we only approximate having the capacity scaling for each segment as don't enforce PELT updates for each capacity change due to frequency scaling. In this patch scaling is applied to the entire delta since last update instead of each individual segment. That gives us a very interesting time warping effect when updates happen less frequently than every 1ms. On cpus with reduced capacity the delta is reduced and all the math is done as if less time had passed since last update which introduces an error with regard to the decay of the series as we segments of time with zero contribution. It is probably easier described with an example: We have one periodic task with a period of 4ms. Busy time per activation is 1ms at 100% capacity. The task has been running forever (>350ms) and we consider the load_avg calculations at enqueue/dequeue, which is should the most common update points for this scenario besides the tick updates. task states s = sleeping R = running (scheduled) pelt d = decay segment (load_avg * y, y^32 = 0.5) [0..1024] = segment contribution (including any scaling) U = __update_load_avg() is called f = 100% | 1024us | 1024us | 1024us | 1024us | 1024us | 1024us | task | s | R | s | s | s | R | pelt ml | d U 1024 U d | d | d U 1024 U patch | d U 1024 U d | d | d U 1024 U f = 33% | 1024us | 1024us | 1024us | 1024us | 1024us | 1024us | task | s | R | R | R | s | R | pelt ml | d U 341y^2 | 341y | 341 U d U 341y^2 | patch | d U 1024 | 0 | 0 U d U 1024 | In the first case, f = 100%, the update after the busy period is complete we decay load_avg by one period (segment) and add a contribution of 1024. We are at 100% so it is a full contribution for this segment both with and without this patch. The task enqueue update accounts for the sleeping time by decaying load_avg three periods. The same in both cases. We could say that the contributions of a full cycle of the the task is: f_100% cycle = 1024 + decay(4) If we reduce the capacity to 33%, things look a bit different. In mainline, the dequeque update after the busy period would decay three periods and add \sum (i = 2..0) 0.33*1024*y^i to account for the three busy segments. The enqueue update decays the load_avg by one segment. The full cycle contribution becomes: Mainline: f_33% cycle = 341*y^2 + 341*y + 341 + decay(4) With this patch it is different. At the dequeue update we scale the time delta instead of the contribution, such that delta = 0.33*delta, so the calculation is based on only one period (segment) has passed. Hence we decay by one segment and add 1024, but still set the update point to the true timestamp so the following update doesn't take the two remaining segments into account. The enqueue update decays the load_avg by one segment, just like it does in mainline. The full cycle contribution becomes: Patch: f_33% cycle = 1024 + decay(2) This is clearly different from mainline. Not only is the busy contribution higher, 1024 > 341*y^2 + 341*y + 341, since y < 1, but we also decay less. The result is an inflation of the load_avg and util_avg metrics for tasks that run for more than 1ms at the time if __update_load_avg() isn't called every 1ms. I did a quick test to confirm this using a single periodic task and changing the compute capacity. util_avg capacity mainline patch 1024 ~359 ~352 512 ~340 ~534 Execution time went from 1.4ms to 2.8ms per activation without overloading the cpu. The fundamental idea in scale invariance is that util_avg should be comparable between cpu at any capacity as long none of them are over-utilized. This isn't preserved by the patch in its current form. > With this change, we don't have to test if a CPU is overloaded or not in > order to use one metric (util) or another (load) as all metrics are always > valid. I'm not sure what you mean by always valid. util_avg is still not a meaningful metric for tasks running on over-utilized cpus, so it can not be used unconditionally. If util_avg > capacity we still have no clue if the task can fit on a different cpu with higher capacity. > I have put below some examples of duration to reach some typical load value > according to the capacity of the CPU with current implementation > and with this patch. > > Util (%) max capacity half capacity(mainline) half capacity(w/ patch) > 972 (95%) 138ms not reachable 276ms > 486 (47.5%) 30ms 138ms 60ms > 256 (25%) 13ms 32ms 26ms I assume that these are numbers for util_avg and not load_avg as said in the text above. It confuses me a little bit as you started out by talking about the lack of uarch scaling of load_avg and propose to change that, not util_avg. The equivalent table for load_avg would something like this: load_avg (%) max capacity half capacity(mainline) half capacity(w/ patch) 972 (95%) 138ms 138ms 276ms 486 (47.5%) 30ms 30ms 60ms 256 (25%) 13ms 13ms 26ms load_avg does reach max capacity as it is. The patch just makes it happen at a slower pace, which I'm not sure is a good or bad thing. > We can see that at half capacity, we need twice the duration of max > capacity with this patch whereas we have a non linear increase of the > duration with current implementation. Is it a problem that the time to reach a certain value is not linear? It is still somewhat unclear to me why we would want this change. Adding uarch scaling to load_avg but then modify the geometric series so the end result is the same except that it now reacts slower at lower capacities seems a bit strange. > > [1] https://lkml.org/lkml/2014/12/18/128 > > Signed-off-by: Vincent Guittot > --- > kernel/sched/fair.c | 28 +++++++++++++--------------- > 1 file changed, 13 insertions(+), 15 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 824aa9f..f2a18e1 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2560,10 +2560,9 @@ static __always_inline int > __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > unsigned long weight, int running, struct cfs_rq *cfs_rq) > { > - u64 delta, scaled_delta, periods; > + u64 delta, periods; > u32 contrib; > - unsigned int delta_w, scaled_delta_w, decayed = 0; > - unsigned long scale_freq, scale_cpu; > + unsigned int delta_w, decayed = 0; > > delta = now - sa->last_update_time; > /* > @@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > return 0; > sa->last_update_time = now; > > - scale_freq = arch_scale_freq_capacity(NULL, cpu); > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > + if (running) { > + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu)); > + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu)); This is where the time warping happens. delta is used to determine the number of periods (segments) since last update. Scaling this, as opposed to the contributions for each segment individually, can lead to disappearing segments. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/