Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933538AbcDLL4w (ORCPT ); Tue, 12 Apr 2016 07:56:52 -0400 Received: from mail-wm0-f42.google.com ([74.125.82.42]:37627 "EHLO mail-wm0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755868AbcDLL4u (ORCPT ); Tue, 12 Apr 2016 07:56:50 -0400 Date: Tue, 12 Apr 2016 13:56:45 +0200 From: Vincent Guittot To: Yuyang Du Cc: Peter Zijlstra , Ingo Molnar , linux-kernel , Benjamin Segall , Paul Turner , Morten Rasmussen , Dietmar Eggemann , Juri Lelli Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue Message-ID: <20160412115645.GA3730@vingu-laptop> References: <1460327765-18024-1-git-send-email-yuyang.du@intel.com> <1460327765-18024-3-git-send-email-yuyang.du@intel.com> <20160411194141.GH8697@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20160411194141.GH8697@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4260 Lines: 111 Le Tuesday 12 Apr 2016 ? 03:41:41 (+0800), Yuyang Du a ?crit : > Hi Vincent, > > On Mon, Apr 11, 2016 at 11:08:04AM +0200, Vincent Guittot wrote: > > > @@ -2704,11 +2694,14 @@ static __always_inline int > > > __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > > unsigned long weight, int running, struct cfs_rq *cfs_rq) > > > { > > > - u64 delta, scaled_delta, periods; > > > - u32 contrib; > > > - unsigned int delta_w, scaled_delta_w, decayed = 0; > > > + u64 delta; > > > + u32 contrib, periods; > > > unsigned long scale_freq, scale_cpu; > > > > > > + /* > > > + * now rolls down to a period boundary > > > + */ > > > + now = now && (u64)(~0xFFFFF); > > > delta = now - sa->last_update_time; > > > /* > > > * This should only happen when time goes backwards, which it > > > @@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > > } > > > > > > /* > > > - * Use 1024ns as the unit of measurement since it's a reasonable > > > - * approximation of 1us and fast to compute. > > > + * Use 1024*1024ns as an approximation of 1ms period, pretty close. > > > */ > > > - delta >>= 10; > > > - if (!delta) > > > + periods = delta >> 20; > > > + if (!periods) > > > return 0; > > > sa->last_update_time = now; > > > > The optimization looks quite interesting but I see one potential issue > > with migration as we will lose the part of the ongoing period that is > > now not saved anymore. This lost part can be quite significant for a > > short task that ping pongs between CPUs. > > Yes, basically, it is we lose precision (~1ms scale in contrast with ~1us scale). But with a HZ set to 1000 and a sched-slice in the same range, having a precision of 1ms instead of 1us makes the precision of load tracking of short task quite random IMHO. you can keep recording this partial period without using it in the load tracking. Something like below keep precision without sacrifying the optimization. --- kernel/sched/fair.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 68273e8..b234169 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -674,6 +674,12 @@ void init_entity_runnable_average(struct sched_entity *se) struct sched_avg *sa = &se->avg; sa->last_update_time = 0; + /* + * sched_avg's period_contrib should be strictly less then 1024 * 1024, so + * we give it 1023 * 1024 to make sure it is almost a period (1024us), and + * will definitely be updated (after enqueue). + */ + sa->period_contrib = 1023*1024; sa->load_avg = scale_load_down(se->load.weight); sa->load_sum = sa->load_avg * LOAD_AVG_MAX; /* @@ -2698,10 +2704,6 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, u32 contrib, periods; unsigned long scale_freq, scale_cpu; - /* - * now rolls down to a period boundary - */ - now = now && (u64)(~0xFFFFF); delta = now - sa->last_update_time; /* * This should only happen when time goes backwards, which it @@ -2712,6 +2714,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, return 0; } + /* Add how much left for the current period */ + delta += sa->period_contrib; + /* * Use 1024*1024ns as an approximation of 1ms period, pretty close. */ @@ -2720,6 +2725,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, return 0; sa->last_update_time = now; + /* Get how much left for the next period */ + sa->period_contrib = delta & (u64)(0xFFFFF); + scale_freq = arch_scale_freq_capacity(NULL, cpu); scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > But as I wrote, we may either lose a sub-1ms, or gain a sub-1ms, statistically, > they should even out, given the load/util updates are quite a large number of > samples, and we do want a lot of samples for the metrics, this is the point of > the entire average thing. Plus, as you also said, the incomplete current period > also plays a (somewhat) negative role here. > > In addition, excluding the flat hierarchical util patch, we gain quite some > efficiency.