Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753827AbcDKJI2 (ORCPT ); Mon, 11 Apr 2016 05:08:28 -0400 Received: from mail-lf0-f53.google.com ([209.85.215.53]:34029 "EHLO mail-lf0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750834AbcDKJI0 (ORCPT ); Mon, 11 Apr 2016 05:08:26 -0400 MIME-Version: 1.0 In-Reply-To: <1460327765-18024-3-git-send-email-yuyang.du@intel.com> References: <1460327765-18024-1-git-send-email-yuyang.du@intel.com> <1460327765-18024-3-git-send-email-yuyang.du@intel.com> From: Vincent Guittot Date: Mon, 11 Apr 2016 11:08:04 +0200 Message-ID: Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue To: Yuyang Du Cc: Peter Zijlstra , Ingo Molnar , linux-kernel , Benjamin Segall , Paul Turner , Morten Rasmussen , Dietmar Eggemann , Juri Lelli Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7095 Lines: 180 Hi Yuyang On 11 April 2016 at 00:36, Yuyang Du wrote: > In __update_load_avg(), the current period is never complete. This > basically leads to a slightly over-decayed average, say on average we yes, that also explains why we almost never reach the max value > have 50% current period, then we will lose 1.08%(=(1-0.5^(1/64)) of > past avg. More importantly, the incomplete current period significantly > complicates the avg computation, even a full period is only about 1ms. > > So we attempt to drop it. The outcome is that for any x.y periods to > update, we will either lose the .y period or unduely gain (1-.y) period. > How big is the impact? For a large x (say 20ms), you barely notice the > difference, which is plus/minus 1% (=(before-after)/before). Moreover, > the aggregated losses and gains in the long run should statistically > even out. > > Signed-off-by: Yuyang Du > --- > include/linux/sched.h | 2 +- > kernel/sched/fair.c | 170 +++++++++++++++++++------------------------------- > 2 files changed, 66 insertions(+), 106 deletions(-) > [snip] > > #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10 > @@ -2704,11 +2694,14 @@ static __always_inline int > __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > unsigned long weight, int running, struct cfs_rq *cfs_rq) > { > - u64 delta, scaled_delta, periods; > - u32 contrib; > - unsigned int delta_w, scaled_delta_w, decayed = 0; > + u64 delta; > + u32 contrib, periods; > unsigned long scale_freq, scale_cpu; > > + /* > + * now rolls down to a period boundary > + */ > + now = now && (u64)(~0xFFFFF); > delta = now - sa->last_update_time; > /* > * This should only happen when time goes backwards, which it > @@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > } > > /* > - * Use 1024ns as the unit of measurement since it's a reasonable > - * approximation of 1us and fast to compute. > + * Use 1024*1024ns as an approximation of 1ms period, pretty close. > */ > - delta >>= 10; > - if (!delta) > + periods = delta >> 20; > + if (!periods) > return 0; > sa->last_update_time = now; The optimization looks quite interesting but I see one potential issue with migration as we will lose the part of the ongoing period that is now not saved anymore. This lost part can be quite significant for a short task that ping pongs between CPUs. > > scale_freq = arch_scale_freq_capacity(NULL, cpu); > scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > > - /* delta_w is the amount already accumulated against our next period */ > - delta_w = sa->period_contrib; > - if (delta + delta_w >= 1024) { > - decayed = 1; > - > - /* how much left for next period will start over, we don't know yet */ > - sa->period_contrib = 0; > - > - /* > - * Now that we know we're crossing a period boundary, figure > - * out how much from delta we need to complete the current > - * period and accrue it. > - */ > - delta_w = 1024 - delta_w; > - scaled_delta_w = cap_scale(delta_w, scale_freq); > - if (weight) { > - sa->load_sum += weight * scaled_delta_w; > - if (cfs_rq) { > - cfs_rq->runnable_load_sum += > - weight * scaled_delta_w; > - } > - } > - if (running) > - sa->util_sum += scaled_delta_w * scale_cpu; > - > - delta -= delta_w; > - > - /* Figure out how many additional periods this update spans */ > - periods = delta / 1024; > - delta %= 1024; > + /* > + * Now we know we're crossing period boundaries, and the *_sum accrues by > + * two steps. > + */ > > - sa->load_sum = decay_load(sa->load_sum, periods + 1); > - if (cfs_rq) { > - cfs_rq->runnable_load_sum = > - decay_load(cfs_rq->runnable_load_sum, periods + 1); > - } > - sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1); > - > - /* Efficiently calculate \sum (1..n_period) 1024*y^i */ > - contrib = __compute_runnable_contrib(periods); > - contrib = cap_scale(contrib, scale_freq); > - if (weight) { > - sa->load_sum += weight * contrib; > - if (cfs_rq) > - cfs_rq->runnable_load_sum += weight * contrib; > - } > - if (running) > - sa->util_sum += contrib * scale_cpu; > + /* > + * Step 1: decay old *_sum > + */ > + sa->load_sum = __decay_sum(sa->load_sum, periods); > + if (cfs_rq) { > + cfs_rq->runnable_load_sum = > + __decay_sum(cfs_rq->runnable_load_sum, periods); > } > + sa->util_sum = __decay_sum((u64)(sa->util_sum), periods); > > - /* Remainder of delta accrued against u_0` */ > - scaled_delta = cap_scale(delta, scale_freq); > + /* > + * Step 2: accumulate and meanwhile decay new *_sum by periods since > + * last_update_time > + */ > + contrib = __accumulate_sum(periods); > + contrib = cap_scale(contrib, scale_freq); > if (weight) { > - sa->load_sum += weight * scaled_delta; > + sa->load_sum += weight * contrib; > if (cfs_rq) > - cfs_rq->runnable_load_sum += weight * scaled_delta; > + cfs_rq->runnable_load_sum += weight * contrib; > } > if (running) > - sa->util_sum += scaled_delta * scale_cpu; > + sa->util_sum += contrib * scale_cpu; > > - sa->period_contrib += delta; > - > - if (decayed) { > - sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX); > - if (cfs_rq) { > - cfs_rq->runnable_load_avg = > - div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX); > - } > - sa->util_avg = sa->util_sum / LOAD_AVG_MAX; > + /* > + * *_sum must have been evolved, we update *_avg > + */ > + sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX); > + if (cfs_rq) { > + cfs_rq->runnable_load_avg = > + div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX); > } > + sa->util_avg = sa->util_sum / LOAD_AVG_MAX; > > - return decayed; > + return 1; > } > > #ifdef CONFIG_FAIR_GROUP_SCHED > -- > 2.1.4 >