Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751908AbbGMRIp (ORCPT ); Mon, 13 Jul 2015 13:08:45 -0400 Received: from eu-smtp-delivery-143.mimecast.com ([146.101.78.143]:61948 "EHLO eu-smtp-delivery-143.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750792AbbGMRIn convert rfc822-to-8bit (ORCPT ); Mon, 13 Jul 2015 13:08:43 -0400 Message-ID: <55A3F097.8040101@arm.com> Date: Mon, 13 Jul 2015 18:08:39 +0100 From: Dietmar Eggemann User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Yuyang Du , "mingo@kernel.org" , "peterz@infradead.org" , "linux-kernel@vger.kernel.org" CC: "pjt@google.com" , "bsegall@google.com" , Morten Rasmussen , "vincent.guittot@linaro.org" , "len.brown@intel.com" , "rafael.j.wysocki@intel.com" , "fengguang.wu@intel.com" , "boqun.feng@gmail.com" , "srikar@linux.vnet.ibm.com" Subject: Re: [PATCH v9 2/4] sched: Rewrite runnable load and utilization average tracking References: <1435018085-7004-1-git-send-email-yuyang.du@intel.com> <1435018085-7004-3-git-send-email-yuyang.du@intel.com> In-Reply-To: <1435018085-7004-3-git-send-email-yuyang.du@intel.com> X-OriginalArrivalTime: 13 Jul 2015 17:08:39.0446 (UTC) FILETIME=[901B5360:01D0BD8E] X-MC-Unique: Cscci6PFSoOvgdZU-pCykA-1 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7323 Lines: 187 Hi Yuyang, I did some testing of your new pelt implementation. TC 1: one nice-0 60% task affine to cpu1 in root tg and 2 nice-0 20% periodic tasks affine to cpu1 in a task group with id=3 (one hierarchy). TC 2: 10 nice-0 5% tasks affine to cpu1 in a task group with id=3 (one hierarchy). and compared the results (the se (tasks and tg representation for cpu1), cfs_rq and tg related pelt signals) with the current pelt implementation. The signals are very similar (taken the differences due to separated/missing blocked load/util in the current pelt and the slightly different behaviour in transitional phases (e.g. task enqueue/dequeue) into consideration. I haven't done any performance related tests yet. -- Dietmar On 23/06/15 01:08, Yuyang Du wrote: > The idea of runnable load average (let runnable time contribute to weight) > was proposed by Paul Turner, and it is still followed by this rewrite. This > rewrite aims to solve the following issues: [...] > diff --git a/include/linux/sched.h b/include/linux/sched.h > index af0eeba..8b4bc4f 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1183,29 +1183,23 @@ struct load_weight { > u32 inv_weight; > }; > > +/* > + * The load_avg/util_avg represents an infinite geometric series: > + * 1) load_avg describes the amount of time that a sched_entity > + * is runnable on a rq. It is based on both load_sum and the > + * weight of the task. > + * 2) util_avg describes the amount of time that a sched_entity > + * is running on a CPU. It is based on util_sum and is scaled > + * in the range [0..SCHED_LOAD_SCALE]. sa->load_[avg/sum] and sa->util_[avg/sum] are also used for the aggregated load/util values on the cfs_rq's. > + * The 64 bit load_sum can: > + * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with > + * the highest weight (=88761) always runnable, we should not overflow > + * 2) for entity, support any load.weight always runnable > + */ > struct sched_avg { > - u64 last_runnable_update; > - s64 decay_count; > - /* > - * utilization_avg_contrib describes the amount of time that a > - * sched_entity is running on a CPU. It is based on running_avg_sum > - * and is scaled in the range [0..SCHED_LOAD_SCALE]. > - * load_avg_contrib described the amount of time that a sched_entity > - * is runnable on a rq. It is based on both runnable_avg_sum and the > - * weight of the task. > - */ > - unsigned long load_avg_contrib, utilization_avg_contrib; > - /* > - * These sums represent an infinite geometric series and so are bound > - * above by 1024/(1-y). Thus we only need a u32 to store them for all > - * choices of y < 1-2^(-32)*1024. > - * running_avg_sum reflects the time that the sched_entity is > - * effectively running on the CPU. > - * runnable_avg_sum represents the amount of time a sched_entity is on > - * a runqueue which includes the running time that is monitored by > - * running_avg_sum. > - */ > - u32 runnable_avg_sum, avg_period, running_avg_sum; > + u64 last_update_time, load_sum; > + u32 util_sum, period_contrib; > + unsigned long load_avg, util_avg; > }; [...] > /* > - * Aggregate cfs_rq runnable averages into an equivalent task_group > - * representation for computing load contributions. > + * Updating tg's load_avg is necessary before update_cfs_share (which is done) > + * and effective_load (which is not done because it is too costly). > */ > -static inline void __update_tg_runnable_avg(struct sched_avg *sa, > - struct cfs_rq *cfs_rq) > +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) > { This function is always called with force=0 ? I remember that there was some discussion about this in your v5 (error bounds of '/ 64') but since it is not used ... > - struct task_group *tg = cfs_rq->tg; > - long contrib; > - > - /* The fraction of a cpu used by this cfs_rq */ > - contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT, > - sa->avg_period + 1); > - contrib -= cfs_rq->tg_runnable_contrib; > + long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > - if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) { > - atomic_add(contrib, &tg->runnable_avg); > - cfs_rq->tg_runnable_contrib += contrib; > - } > -} [...] > -static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); > - > -/* Update a sched_entity's runnable average */ > -static inline void update_entity_load_avg(struct sched_entity *se, > - int update_cfs_rq) > +/* Update task and its cfs_rq load average */ > +static inline void update_load_avg(struct sched_entity *se, int update_tg) > { > struct cfs_rq *cfs_rq = cfs_rq_of(se); > - long contrib_delta, utilization_delta; > int cpu = cpu_of(rq_of(cfs_rq)); > - u64 now; > + u64 now = cfs_rq_clock_task(cfs_rq); > > /* > - * For a group entity we need to use their owned cfs_rq_clock_task() in > - * case they are the parent of a throttled hierarchy. > + * Track task load average for carrying it to new CPU after migrated, and > + * track group sched_entity load average for task_h_load calc in migration > */ > - if (entity_is_task(se)) > - now = cfs_rq_clock_task(cfs_rq); > - else > - now = cfs_rq_clock_task(group_cfs_rq(se)); Why don't you make this distinction while getting 'now' between se's representing tasks or task groups anymore? > + __update_load_avg(now, cpu, &se->avg, > + se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se); > > - if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq, > - cfs_rq->curr == se)) > - return; > - > - contrib_delta = __update_entity_load_avg_contrib(se); > - utilization_delta = __update_entity_utilization_avg_contrib(se); > - > - if (!update_cfs_rq) > - return; > - > - if (se->on_rq) { > - cfs_rq->runnable_load_avg += contrib_delta; > - cfs_rq->utilization_load_avg += utilization_delta; > - } else { > - subtract_blocked_load_contrib(cfs_rq, -contrib_delta); > - } > + if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg) > + update_tg_load_avg(cfs_rq, 0); > } [...] > - > static void update_blocked_averages(int cpu) The name of this function now becomes misleading since you don't update blocked averages any more. Existing pelt calls __update_blocked_averages_cpu() -> update_cfs_rq_blocked_load() -> subtract_blocked_load_contrib() for all tg tree. Whereas you update cfs_rq.avg->[load/util]_[avg/sum] and conditionally tg->load_avg and cfs_rq->tg_load_avg_contrib. [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/