MIME-Version: 1.0
In-Reply-To: <1460327765-18024-3-git-send-email-yuyang.du@intel.com>
References: <1460327765-18024-1-git-send-email-yuyang.du@intel.com> <1460327765-18024-3-git-send-email-yuyang.du@intel.com>
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Mon, 11 Apr 2016 11:08:04 +0200
Message-ID: <CAKfTPtDP4uUWevKFjjYiqyAyVPaFrLmtn11wSRMpYD60O2Jc=w@mail.gmail.com>
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when
 sched averages accrue
To: Yuyang Du <yuyang.du@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Benjamin Segall <bsegall@google.com>, Paul Turner <pjt@google.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Juri Lelli <juri.lelli@arm.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7095
Lines: 180

Hi Yuyang

On 11 April 2016 at 00:36, Yuyang Du <yuyang.du@intel.com> wrote:
> In __update_load_avg(), the current period is never complete. This
> basically leads to a slightly over-decayed average, say on average we

yes, that also explains why we almost never reach the max value

> have 50% current period, then we will lose 1.08%(=(1-0.5^(1/64)) of
> past avg. More importantly, the incomplete current period significantly
> complicates the avg computation, even a full period is only about 1ms.
>
> So we attempt to drop it. The outcome is that for any x.y periods to
> update, we will either lose the .y period or unduely gain (1-.y) period.
> How big is the impact? For a large x (say 20ms), you barely notice the
> difference, which is plus/minus 1% (=(before-after)/before). Moreover,
> the aggregated losses and gains in the long run should statistically
> even out.
>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  include/linux/sched.h |   2 +-
>  kernel/sched/fair.c   | 170 +++++++++++++++++++-------------------------------
>  2 files changed, 66 insertions(+), 106 deletions(-)
>

 [snip]

>
>  #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> @@ -2704,11 +2694,14 @@ static __always_inline int
>  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>                   unsigned long weight, int running, struct cfs_rq *cfs_rq)
>  {
> -       u64 delta, scaled_delta, periods;
> -       u32 contrib;
> -       unsigned int delta_w, scaled_delta_w, decayed = 0;
> +       u64 delta;
> +       u32 contrib, periods;
>         unsigned long scale_freq, scale_cpu;
>
> +       /*
> +        * now rolls down to a period boundary
> +        */
> +       now = now && (u64)(~0xFFFFF);
>         delta = now - sa->last_update_time;
>         /*
>          * This should only happen when time goes backwards, which it
> @@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>         }
>
>         /*
> -        * Use 1024ns as the unit of measurement since it's a reasonable
> -        * approximation of 1us and fast to compute.
> +        * Use 1024*1024ns as an approximation of 1ms period, pretty close.
>          */
> -       delta >>= 10;
> -       if (!delta)
> +       periods = delta >> 20;
> +       if (!periods)
>                 return 0;
>         sa->last_update_time = now;

The optimization looks quite interesting but I see one potential issue
with migration as we will lose the part of the ongoing period that is
now not saved anymore. This lost part can be quite significant for a
short task that ping pongs between CPUs.

>
>         scale_freq = arch_scale_freq_capacity(NULL, cpu);
>         scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
>
> -       /* delta_w is the amount already accumulated against our next period */
> -       delta_w = sa->period_contrib;
> -       if (delta + delta_w >= 1024) {
> -               decayed = 1;
> -
> -               /* how much left for next period will start over, we don't know yet */
> -               sa->period_contrib = 0;
> -
> -               /*
> -                * Now that we know we're crossing a period boundary, figure
> -                * out how much from delta we need to complete the current
> -                * period and accrue it.
> -                */
> -               delta_w = 1024 - delta_w;
> -               scaled_delta_w = cap_scale(delta_w, scale_freq);
> -               if (weight) {
> -                       sa->load_sum += weight * scaled_delta_w;
> -                       if (cfs_rq) {
> -                               cfs_rq->runnable_load_sum +=
> -                                               weight * scaled_delta_w;
> -                       }
> -               }
> -               if (running)
> -                       sa->util_sum += scaled_delta_w * scale_cpu;
> -
> -               delta -= delta_w;
> -
> -               /* Figure out how many additional periods this update spans */
> -               periods = delta / 1024;
> -               delta %= 1024;
> +       /*
> +        * Now we know we're crossing period boundaries, and the *_sum accrues by
> +        * two steps.
> +        */
>
> -               sa->load_sum = decay_load(sa->load_sum, periods + 1);
> -               if (cfs_rq) {
> -                       cfs_rq->runnable_load_sum =
> -                               decay_load(cfs_rq->runnable_load_sum, periods + 1);
> -               }
> -               sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1);
> -
> -               /* Efficiently calculate \sum (1..n_period) 1024*y^i */
> -               contrib = __compute_runnable_contrib(periods);
> -               contrib = cap_scale(contrib, scale_freq);
> -               if (weight) {
> -                       sa->load_sum += weight * contrib;
> -                       if (cfs_rq)
> -                               cfs_rq->runnable_load_sum += weight * contrib;
> -               }
> -               if (running)
> -                       sa->util_sum += contrib * scale_cpu;
> +       /*
> +        * Step 1: decay old *_sum
> +        */
> +       sa->load_sum = __decay_sum(sa->load_sum, periods);
> +       if (cfs_rq) {
> +               cfs_rq->runnable_load_sum =
> +                       __decay_sum(cfs_rq->runnable_load_sum, periods);
>         }
> +       sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
>
> -       /* Remainder of delta accrued against u_0` */
> -       scaled_delta = cap_scale(delta, scale_freq);
> +       /*
> +        * Step 2: accumulate and meanwhile decay new *_sum by periods since
> +        * last_update_time
> +        */
> +       contrib = __accumulate_sum(periods);
> +       contrib = cap_scale(contrib, scale_freq);
>         if (weight) {
> -               sa->load_sum += weight * scaled_delta;
> +               sa->load_sum += weight * contrib;
>                 if (cfs_rq)
> -                       cfs_rq->runnable_load_sum += weight * scaled_delta;
> +                       cfs_rq->runnable_load_sum += weight * contrib;
>         }
>         if (running)
> -               sa->util_sum += scaled_delta * scale_cpu;
> +               sa->util_sum += contrib * scale_cpu;
>
> -       sa->period_contrib += delta;
> -
> -       if (decayed) {
> -               sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
> -               if (cfs_rq) {
> -                       cfs_rq->runnable_load_avg =
> -                               div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
> -               }
> -               sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
> +       /*
> +        * *_sum must have been evolved, we update *_avg
> +        */
> +       sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
> +       if (cfs_rq) {
> +               cfs_rq->runnable_load_avg =
> +                       div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
>         }
> +       sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
>
> -       return decayed;
> +       return 1;
>  }
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> --
> 2.1.4
>