Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933919AbdC3MRS (ORCPT ); Thu, 30 Mar 2017 08:17:18 -0400 Received: from bombadil.infradead.org ([65.50.211.133]:45533 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932823AbdC3MRH (ORCPT ); Thu, 30 Mar 2017 08:17:07 -0400 Date: Thu, 30 Mar 2017 14:16:58 +0200 From: Peter Zijlstra To: Paul Turner Cc: Yuyang Du , Ingo Molnar , LKML , Benjamin Segall , Morten Rasmussen , Vincent Guittot , Dietmar Eggemann , Matt Fleming , umgwanakikbuti@gmail.com Subject: Re: [RESEND PATCH 2/2] sched/fair: Optimize __update_sched_avg() Message-ID: <20170330121658.6mo7datma4ssw7st@hirez.programming.kicks-ass.net> References: <1486935863-25251-1-git-send-email-yuyang.du@intel.com> <1486935863-25251-3-git-send-email-yuyang.du@intel.com> <20170328144625.GX3093@worktop> <20170329000442.GC2459@ydu19desktop> <20170329104126.lg6ismevfbqywpcj@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170113 (1.7.2) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4501 Lines: 134 On Thu, Mar 30, 2017 at 04:21:08AM -0700, Paul Turner wrote: > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -2767,7 +2767,7 @@ static const u32 __accumulated_sum_N32[] > > * Approximate: > > * val * y^n, where y^32 ~= 0.5 (~1 scheduling period) > > */ > > -static __always_inline u64 decay_load(u64 val, u64 n) > > +static u64 decay_load(u64 val, u64 n) > > { > > unsigned int local_n; > > > > @@ -2795,32 +2795,113 @@ static __always_inline u64 decay_load(u6 > > return val; > > } > > > > -/* > > - * For updates fully spanning n periods, the contribution to runnable > > - * average will be: \Sum 1024*y^n > > - * > > - * We can compute this reasonably efficiently by combining: > > - * y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for n > - */ > > -static u32 __compute_runnable_contrib(u64 n) > > +static u32 __accumulate_sum(u64 periods, u32 period_contrib, u32 remainder) > > - The naming here is really ambiguous: > "__accumulate_sum" -> "__accumulate_pelt_segments"? OK, I did struggle with that a bit too but failed to improve, I'll change it. > - Passing in "remainder" seems irrelevant to the sum accumulation. It would be > more clear to handle it from the caller. Well, this way we have all 3 delta parts in one function. I'll try it and see what it looks like though. > > > > { > > - u32 contrib = 0; > > + u32 c1, c2, c3 = remainder; /* y^0 == 1 */ > > > > - if (likely(n <= LOAD_AVG_PERIOD)) > > - return runnable_avg_yN_sum[n]; > > - else if (unlikely(n >= LOAD_AVG_MAX_N)) > > + if (!periods) > > + return remainder - period_contrib; > > This is super confusing. It only works because remainder already had > period_contrib aggregated _into_ it. We're literally computing: > remainder + period_contrib - period_contrib Correct; although I didn't find it too confusing. Could be because I'd been staring at this code for a few hours though. > We should just not call this in the !periods case and handle the remainder > below. I'll change it see what it looks like. > > + > > + if (unlikely(periods >= LOAD_AVG_MAX_N)) > > return LOAD_AVG_MAX; > > > > - return contrib + runnable_avg_yN_sum[n]; > > + /* > > + * c1 = d1 y^(p+1) > > + */ > > + c1 = decay_load((u64)(1024 - period_contrib), periods); > > + > > + periods -= 1; > > + /* > > + * For updates fully spanning n periods, the contribution to runnable > > + * average will be: > > + * > > + * c2 = 1024 \Sum y^n > > + * > > + * We can compute this reasonably efficiently by combining: > > + * > > + * y^PERIOD = 1/2 with precomputed 1024 \Sum y^n {for: n < PERIOD} > > + */ > > + if (likely(periods <= LOAD_AVG_PERIOD)) { > > + c2 = runnable_avg_yN_sum[periods]; > > + } else { > > + c2 = __accumulated_sum_N32[periods/LOAD_AVG_PERIOD]; > > This still wants the comment justifying why we can't overflow. You mean why periods/LOAD_AVG_PERIOD < ARRAY_SIZE(__accumulated_sum_N32) ? Or something else? > > + periods %= LOAD_AVG_PERIOD; > > + c2 = decay_load(c2, periods); > > + c2 += runnable_avg_yN_sum[periods]; > > + } > > + > > + return c1 + c2 + c3; > > } > > + /* > > + * Step 2 > > + */ > > + delta %= 1024; > > + contrib = __accumulate_sum(periods, sa->period_contrib, delta); > > + sa->period_contrib = delta; > > + > > + contrib = cap_scale(contrib, scale_freq); > > + if (weight) { > > + sa->load_sum += weight * contrib; > > Is this correct in the iterated periods > LOAD_AVG_MAX_N case? > I don't think the decay above is guaranteed to return these to zero. Ah! Indeed, so decay_load() needs LOAD_AVG_PERIOD * 63 before it truncates to 0, because every LOAD_AVG_PERIOD we half the value; loose 1 bit; so 63 of those and we're 0. But __accumulate_sum() OTOH returns LOAD_AVG_MAX after only LOAD_AVG_MAX_N, which < LOAD_AVG_PERIOD * 63. So yes, combined we exceed LOAD_AVG_MAX, which is bad. Let me think what to do about that. > > + if (cfs_rq) > > + cfs_rq->runnable_load_sum += weight * contrib; > > + } > > + if (running) > > + sa->util_sum += contrib * scale_cpu; > > + > > + return periods; > > +}