Date: Tue, 14 Jul 2015 07:59:23 +0800
From: Yuyang Du <yuyang.du@intel.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: "mingo@kernel.org" <mingo@kernel.org>,
        "peterz@infradead.org" <peterz@infradead.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "pjt@google.com" <pjt@google.com>,
        "bsegall@google.com" <bsegall@google.com>,
        Morten Rasmussen <Morten.Rasmussen@arm.com>,
        "vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
        "len.brown@intel.com" <len.brown@intel.com>,
        "rafael.j.wysocki@intel.com" <rafael.j.wysocki@intel.com>,
        "fengguang.wu@intel.com" <fengguang.wu@intel.com>,
        "boqun.feng@gmail.com" <boqun.feng@gmail.com>,
        "srikar@linux.vnet.ibm.com" <srikar@linux.vnet.ibm.com>
Subject: Re: [PATCH v9 2/4] sched: Rewrite runnable load and utilization
 average tracking
Message-ID: <20150713235923.GJ5197@intel.com>
References: <1435018085-7004-1-git-send-email-yuyang.du@intel.com>
 <1435018085-7004-3-git-send-email-yuyang.du@intel.com>
 <55A3F097.8040101@arm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <55A3F097.8040101@arm.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6411
Lines: 159

Hi Dietmar,

On Mon, Jul 13, 2015 at 06:08:39PM +0100, Dietmar Eggemann wrote:
> Hi Yuyang,
> 
> I did some testing of your new pelt implementation.
> 
> TC 1: one nice-0 60% task affine to cpu1 in root tg and 2 nice-0 20%
> periodic tasks affine to cpu1 in a task group with id=3 (one hierarchy).
> 
> TC 2: 10 nice-0 5% tasks affine to cpu1 in a task group with id=3 (one
> hierarchy).
> 
> and compared the results (the se (tasks and tg representation for cpu1),
> cfs_rq and tg related pelt signals) with the current pelt implementation.
> 
> The signals are very similar (taken the differences due to
> separated/missing blocked load/util in the current pelt and the slightly
> different behaviour in transitional phases (e.g. task enqueue/dequeue)
> into consideration.
> 

Thanks for testing and comments.

> I haven't done any performance related tests yet.
>
> > +/*
> > + * The load_avg/util_avg represents an infinite geometric series:
> > + * 1) load_avg describes the amount of time that a sched_entity
> > + * is runnable on a rq. It is based on both load_sum and the
> > + * weight of the task.
> > + * 2) util_avg describes the amount of time that a sched_entity
> > + * is running on a CPU. It is based on util_sum and is scaled
> > + * in the range [0..SCHED_LOAD_SCALE].
> 
> sa->load_[avg/sum] and sa->util_[avg/sum] are also used for the
> aggregated load/util values on the cfs_rq's.
 
Yes.

> >  /*
> > - * Aggregate cfs_rq runnable averages into an equivalent task_group
> > - * representation for computing load contributions.
> > + * Updating tg's load_avg is necessary before update_cfs_share (which is done)
> > + * and effective_load (which is not done because it is too costly).
> >   */
> > -static inline void __update_tg_runnable_avg(struct sched_avg *sa,
> > -                                                 struct cfs_rq *cfs_rq)
> > +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
> >  {
> 
> This function is always called with force=0 ? I remember that there was
> some discussion about this in your v5 (error bounds of '/ 64') but since
> it is not used ...
 
IIRC, Peter said the cacheline is hot, but also need to bound the error. What I
did is we bound as much as that of now and give the option (force here if someone
wants to use it someday) to flush the error.

> > -       struct task_group *tg = cfs_rq->tg;
> > -       long contrib;
> > -
> > -       /* The fraction of a cpu used by this cfs_rq */
> > -       contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
> > -                         sa->avg_period + 1);
> > -       contrib -= cfs_rq->tg_runnable_contrib;
> > +       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > 
> > -       if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
> > -               atomic_add(contrib, &tg->runnable_avg);
> > -               cfs_rq->tg_runnable_contrib += contrib;
> > -       }
> > -}
> 
> [...]
> 
> > -static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> > -
> > -/* Update a sched_entity's runnable average */
> > -static inline void update_entity_load_avg(struct sched_entity *se,
> > -                                         int update_cfs_rq)
> > +/* Update task and its cfs_rq load average */
> > +static inline void update_load_avg(struct sched_entity *se, int update_tg)
> >  {
> >         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > -       long contrib_delta, utilization_delta;
> >         int cpu = cpu_of(rq_of(cfs_rq));
> > -       u64 now;
> > +       u64 now = cfs_rq_clock_task(cfs_rq);
> > 
> >         /*
> > -        * For a group entity we need to use their owned cfs_rq_clock_task() in
> > -        * case they are the parent of a throttled hierarchy.
> > +        * Track task load average for carrying it to new CPU after migrated, and
> > +        * track group sched_entity load average for task_h_load calc in migration
> >          */
> > -       if (entity_is_task(se))
> > -               now = cfs_rq_clock_task(cfs_rq);
> > -       else
> > -               now = cfs_rq_clock_task(group_cfs_rq(se));
> 
> Why don't you make this distinction while getting 'now' between se's
> representing tasks or task groups anymore?
 
Memory is a little blurry, I think (1) we don't distinguish task or tg SE, and
(2) as cfs_rq tracks load avg on their own, throttled cfs_rq will use stopped
clock when they update their load.

But thinking about this now, it seems neither is doing the right load average for
throttled load. And it does not seem clear to me how to correlate group quota with
averaged load.

> > +       __update_load_avg(now, cpu, &se->avg,
> > +               se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
> > 
> > -       if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
> > -                                       cfs_rq->curr == se))
> > -               return;
> > -
> > -       contrib_delta = __update_entity_load_avg_contrib(se);
> > -       utilization_delta = __update_entity_utilization_avg_contrib(se);
> > -
> > -       if (!update_cfs_rq)
> > -               return;
> > -
> > -       if (se->on_rq) {
> > -               cfs_rq->runnable_load_avg += contrib_delta;
> > -               cfs_rq->utilization_load_avg += utilization_delta;
> > -       } else {
> > -               subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
> > -       }
> > +       if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
> > +               update_tg_load_avg(cfs_rq, 0);
> >  }
> 
> [...]
> 
> > -
> >  static void update_blocked_averages(int cpu)
> 
> The name of this function now becomes misleading since you don't update
> blocked averages any more. Existing pelt calls
> __update_blocked_averages_cpu() -> update_cfs_rq_blocked_load() ->
> subtract_blocked_load_contrib() for all tg tree.
> 
> Whereas you update cfs_rq.avg->[load/util]_[avg/sum] and conditionally
> tg->load_avg and cfs_rq->tg_load_avg_contrib.

I actually gave thoughts to the naming. We don't have explicit blocked load,
but this function is still largely for the "silent" blocked load and get
the cfs_rq and tg updated due to it.

Thanks,
Yuyang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/