Date: Thu, 30 Mar 2017 14:16:58 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Paul Turner <pjt@google.com>
Cc: Yuyang Du <yuyang.du@intel.com>, Ingo Molnar <mingo@kernel.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Benjamin Segall <bsegall@google.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Matt Fleming <matt@codeblueprint.co.uk>, umgwanakikbuti@gmail.com
Subject: Re: [RESEND PATCH 2/2] sched/fair: Optimize __update_sched_avg()
Message-ID: <20170330121658.6mo7datma4ssw7st@hirez.programming.kicks-ass.net>
References: <1486935863-25251-1-git-send-email-yuyang.du@intel.com>
 <1486935863-25251-3-git-send-email-yuyang.du@intel.com>
 <20170328144625.GX3093@worktop>
 <20170329000442.GC2459@ydu19desktop>
 <20170329104126.lg6ismevfbqywpcj@hirez.programming.kicks-ass.net>
 <CAPM31RLHHKVBvVn04ZmvNQ=69f3M9WyFBUNPOoO-8=Q207ioxA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAPM31RLHHKVBvVn04ZmvNQ=69f3M9WyFBUNPOoO-8=Q207ioxA@mail.gmail.com>
User-Agent: NeoMutt/20170113 (1.7.2)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4501
Lines: 134

On Thu, Mar 30, 2017 at 04:21:08AM -0700, Paul Turner wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2767,7 +2767,7 @@ static const u32 __accumulated_sum_N32[]
> >   * Approximate:
> >   *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
> >   */
> > -static __always_inline u64 decay_load(u64 val, u64 n)
> > +static u64 decay_load(u64 val, u64 n)
> >  {
> >         unsigned int local_n;
> >
> > @@ -2795,32 +2795,113 @@ static __always_inline u64 decay_load(u6
> >         return val;
> >  }
> >
> > -/*
> > - * For updates fully spanning n periods, the contribution to runnable
> > - * average will be: \Sum 1024*y^n
> > - *
> > - * We can compute this reasonably efficiently by combining:
> > - *   y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for  n <PERIOD}
> > - */
> > -static u32 __compute_runnable_contrib(u64 n)
> > +static u32 __accumulate_sum(u64 periods, u32 period_contrib, u32 remainder)
> 
> - The naming here is really ambiguous:
>     "__accumulate_sum" -> "__accumulate_pelt_segments"?

OK, I did struggle with that a bit too but failed to improve, I'll change it.

> - Passing in "remainder" seems irrelevant to the sum accumulation.  It would be
>   more clear to handle it from the caller.

Well, this way we have all 3 delta parts in one function. I'll try it
and see what it looks like though.

> >
> >  {
> > -       u32 contrib = 0;
> > +       u32 c1, c2, c3 = remainder; /* y^0 == 1 */
> >
> > -       if (likely(n <= LOAD_AVG_PERIOD))
> > -               return runnable_avg_yN_sum[n];
> > -       else if (unlikely(n >= LOAD_AVG_MAX_N))
> > +       if (!periods)
> > +               return remainder - period_contrib;
> 
> This is super confusing.  It only works because remainder already had
> period_contrib aggregated _into_ it.  We're literally computing:
>   remainder + period_contrib - period_contrib

Correct; although I didn't find it too confusing. Could be because I'd
been staring at this code for a few hours though.

> We should just not call this in the !periods case and handle the remainder
> below.

I'll change it see what it looks like.

> > +
> > +       if (unlikely(periods >= LOAD_AVG_MAX_N))
> >                 return LOAD_AVG_MAX;
> >
> > -       return contrib + runnable_avg_yN_sum[n];
> > +       /*
> > +        * c1 = d1 y^(p+1)
> > +        */
> > +       c1 = decay_load((u64)(1024 - period_contrib), periods);
> > +
> > +       periods -= 1;
> > +       /*
> > +        * For updates fully spanning n periods, the contribution to runnable
> > +        * average will be:
> > +        *
> > +        *   c2 = 1024 \Sum y^n
> > +        *
> > +        * We can compute this reasonably efficiently by combining:
> > +        *
> > +        *   y^PERIOD = 1/2 with precomputed 1024 \Sum y^n {for: n < PERIOD}
> > +        */
> > +       if (likely(periods <= LOAD_AVG_PERIOD)) {
> > +               c2 = runnable_avg_yN_sum[periods];
> > +       } else {
> > +               c2 = __accumulated_sum_N32[periods/LOAD_AVG_PERIOD];
> 
> This still wants the comment justifying why we can't overflow.

You mean why periods/LOAD_AVG_PERIOD < ARRAY_SIZE(__accumulated_sum_N32)
? Or something else?

> > +               periods %= LOAD_AVG_PERIOD;
> > +               c2 = decay_load(c2, periods);
> > +               c2 += runnable_avg_yN_sum[periods];
> > +       }
> > +
> > +       return c1 + c2 + c3;
> >  }


> > +       /*
> > +        * Step 2
> > +        */
> > +       delta %= 1024;
> > +       contrib = __accumulate_sum(periods, sa->period_contrib, delta);
> > +       sa->period_contrib = delta;
> > +
> > +       contrib = cap_scale(contrib, scale_freq);
> > +       if (weight) {
> > +               sa->load_sum += weight * contrib;
> 
> Is this correct in the iterated periods > LOAD_AVG_MAX_N case?
> I don't think the decay above is guaranteed to return these to zero.

Ah!

Indeed, so decay_load() needs LOAD_AVG_PERIOD * 63 before it truncates
to 0, because every LOAD_AVG_PERIOD we half the value; loose 1 bit; so
63 of those and we're 0.

But __accumulate_sum() OTOH returns LOAD_AVG_MAX after only
LOAD_AVG_MAX_N, which < LOAD_AVG_PERIOD * 63.

So yes, combined we exceed LOAD_AVG_MAX, which is bad. Let me think what
to do about that.

> > +               if (cfs_rq)
> > +                       cfs_rq->runnable_load_sum += weight * contrib;
> > +       }
> > +       if (running)
> > +               sa->util_sum += contrib * scale_cpu;
> > +
> > +       return periods;
> > +}