Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754576AbbFWKbM (ORCPT ); Tue, 23 Jun 2015 06:31:12 -0400 Received: from blu004-omc1s33.hotmail.com ([65.55.116.44]:57414 "EHLO BLU004-OMC1S33.hotmail.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754530AbbFWKbD (ORCPT ); Tue, 23 Jun 2015 06:31:03 -0400 X-TMN: [M6sovCRwcyvAJHpFkbw2MJAhiw3ZlnE1] X-Originating-Email: [wanpeng.li@hotmail.com] Message-ID: Subject: Re: [PATCH v9 2/4] sched: Rewrite runnable load and utilization average tracking To: Yuyang Du , mingo@kernel.org, peterz@infradead.org, linux-kernel@vger.kernel.org References: <1435018085-7004-1-git-send-email-yuyang.du@intel.com> <1435018085-7004-3-git-send-email-yuyang.du@intel.com> CC: pjt@google.com, bsegall@google.com, morten.rasmussen@arm.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, len.brown@intel.com, rafael.j.wysocki@intel.com, fengguang.wu@intel.com, boqun.feng@gmail.com, srikar@linux.vnet.ibm.com From: Wanpeng Li Date: Tue, 23 Jun 2015 18:30:48 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.0.1 MIME-Version: 1.0 In-Reply-To: <1435018085-7004-3-git-send-email-yuyang.du@intel.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 23 Jun 2015 10:31:00.0289 (UTC) FILETIME=[B2ADE310:01D0AD9F] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 48400 Lines: 1303 On 6/23/15 8:08 AM, Yuyang Du wrote: > The idea of runnable load average (let runnable time contribute to weight) > was proposed by Paul Turner, and it is still followed by this rewrite. This > rewrite aims to solve the following issues: > > 1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is > updated at the granularity of an entity at a time, which results in the > cfs_rq's load average is stale or partially updated: at any time, only > one entity is up to date, all other entities are effectively lagging > behind. This is undesirable. > > To illustrate, if we have n runnable entities in the cfs_rq, as time > elapses, they certainly become outdated: > > t0: cfs_rq { e1_old, e2_old, ..., en_old } > > and when we update: > > t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old } > > t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old } > > ... > > We solve this by combining all runnable entities' load averages together > in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based > on the fact that if we regard the update as a function, then: > > w * update(e) = update(w * e) and > > update(e1) + update(e2) = update(e1 + e2), then > > w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2) > > therefore, by this rewrite, we have an entirely updated cfs_rq at the > time we update it: > > t1: update cfs_rq { e1_new, e2_new, ..., en_new } > > t2: update cfs_rq { e1_new, e2_new, ..., en_new } > > ... > > 2. cfs_rq's load average is different between top rq->cfs_rq and other > task_group's per CPU cfs_rqs in whether or not blocked_load_average > contributes to the load. > > The basic idea behind runnable load average (the same for utilization) > is that the blocked state is taken into account as opposed to only > accounting for the currently runnable state. Therefore, the average > should include both the runnable/running and blocked load averages. > This rewrite does that. > > In addition, we also combine runnable/running and blocked averages > of all entities into the cfs_rq's average, and update it together at > once. This is based on the fact that: > > update(runnable) + update(blocked) = update(runnable + blocked) > > This significantly reduces the codes as we don't need to separately > maintain/update runnable/running load and blocked load. > > 3. How task_group entities' share is calculated is complex and imprecise. > > We reduce the complexity in this rewrite to allow a very simple rule: > the task_group's load_avg is aggregated from its per CPU cfs_rqs's > load_avgs. Then group entity's weight is simply proportional to its > own cfs_rq's load_avg / task_group's load_avg. To illustrate, > > if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then, > > task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then > > cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share > > To sum up, this rewrite in principle is equivalent to the current one, but > fixes the issues described above. Turns out, it significantly reduces the > code complexity and hence increases clarity and efficiency. In addition, > the new averages are more smooth/continuous (no spurious spikes and valleys) > and updated more consistently and quickly to reflect the load dynamics. As a > result, we have less load tracking overhead, better performance, and > especially better power efficiency due to more balanced load. > > Signed-off-by: Yuyang Du > --- > include/linux/sched.h | 40 ++-- > kernel/sched/core.c | 3 - > kernel/sched/debug.c | 41 ++-- > kernel/sched/fair.c | 630 +++++++++++++++++--------------------------------- > kernel/sched/sched.h | 28 +-- > 5 files changed, 249 insertions(+), 493 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index af0eeba..8b4bc4f 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1183,29 +1183,23 @@ struct load_weight { > u32 inv_weight; > }; > > +/* > + * The load_avg/util_avg represents an infinite geometric series: > + * 1) load_avg describes the amount of time that a sched_entity > + * is runnable on a rq. It is based on both load_sum and the > + * weight of the task. > + * 2) util_avg describes the amount of time that a sched_entity > + * is running on a CPU. It is based on util_sum and is scaled > + * in the range [0..SCHED_LOAD_SCALE]. > + * The 64 bit load_sum can: > + * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with > + * the highest weight (=88761) always runnable, we should not overflow > + * 2) for entity, support any load.weight always runnable > + */ > struct sched_avg { > - u64 last_runnable_update; > - s64 decay_count; > - /* > - * utilization_avg_contrib describes the amount of time that a > - * sched_entity is running on a CPU. It is based on running_avg_sum > - * and is scaled in the range [0..SCHED_LOAD_SCALE]. > - * load_avg_contrib described the amount of time that a sched_entity > - * is runnable on a rq. It is based on both runnable_avg_sum and the > - * weight of the task. > - */ > - unsigned long load_avg_contrib, utilization_avg_contrib; > - /* > - * These sums represent an infinite geometric series and so are bound > - * above by 1024/(1-y). Thus we only need a u32 to store them for all > - * choices of y < 1-2^(-32)*1024. > - * running_avg_sum reflects the time that the sched_entity is > - * effectively running on the CPU. > - * runnable_avg_sum represents the amount of time a sched_entity is on > - * a runqueue which includes the running time that is monitored by > - * running_avg_sum. > - */ > - u32 runnable_avg_sum, avg_period, running_avg_sum; > + u64 last_update_time, load_sum; > + u32 util_sum, period_contrib; > + unsigned long load_avg, util_avg; > }; > > #ifdef CONFIG_SCHEDSTATS > @@ -1271,7 +1265,7 @@ struct sched_entity { > #endif > > #ifdef CONFIG_SMP > - /* Per-entity load-tracking */ > + /* Per entity load average tracking */ > struct sched_avg avg; > #endif > }; > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index d5078c0..4dfab27 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1828,9 +1828,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > p->se.prev_sum_exec_runtime = 0; > p->se.nr_migrations = 0; > p->se.vruntime = 0; > -#ifdef CONFIG_SMP > - p->se.avg.decay_count = 0; > -#endif > INIT_LIST_HEAD(&p->se.group_node); > > #ifdef CONFIG_SCHEDSTATS > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c > index ca39cb7..56d83f3 100644 > --- a/kernel/sched/debug.c > +++ b/kernel/sched/debug.c > @@ -88,12 +88,8 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group > #endif > P(se->load.weight); > #ifdef CONFIG_SMP > - P(se->avg.runnable_avg_sum); > - P(se->avg.running_avg_sum); > - P(se->avg.avg_period); > - P(se->avg.load_avg_contrib); > - P(se->avg.utilization_avg_contrib); > - P(se->avg.decay_count); > + P(se->avg.load_avg); > + P(se->avg.util_avg); > #endif > #undef PN > #undef P > @@ -207,21 +203,19 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) > SEQ_printf(m, " .%-30s: %d\n", "nr_running", cfs_rq->nr_running); > SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight); > #ifdef CONFIG_SMP > - SEQ_printf(m, " .%-30s: %ld\n", "runnable_load_avg", > - cfs_rq->runnable_load_avg); > - SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg", > - cfs_rq->blocked_load_avg); > - SEQ_printf(m, " .%-30s: %ld\n", "utilization_load_avg", > - cfs_rq->utilization_load_avg); > + SEQ_printf(m, " .%-30s: %lu\n", "load_avg", > + cfs_rq->avg.load_avg); > + SEQ_printf(m, " .%-30s: %lu\n", "util_avg", > + cfs_rq->avg.util_avg); > + SEQ_printf(m, " .%-30s: %ld\n", "removed_load_avg", > + atomic_long_read(&cfs_rq->removed_load_avg)); > + SEQ_printf(m, " .%-30s: %ld\n", "removed_util_avg", > + atomic_long_read(&cfs_rq->removed_util_avg)); > #ifdef CONFIG_FAIR_GROUP_SCHED > - SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib", > - cfs_rq->tg_load_contrib); > - SEQ_printf(m, " .%-30s: %d\n", "tg_runnable_contrib", > - cfs_rq->tg_runnable_contrib); > + SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib", > + cfs_rq->tg_load_avg_contrib); > SEQ_printf(m, " .%-30s: %ld\n", "tg_load_avg", > atomic_long_read(&cfs_rq->tg->load_avg)); > - SEQ_printf(m, " .%-30s: %d\n", "tg->runnable_avg", > - atomic_read(&cfs_rq->tg->runnable_avg)); > #endif > #endif > #ifdef CONFIG_CFS_BANDWIDTH > @@ -632,12 +626,11 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) > > P(se.load.weight); > #ifdef CONFIG_SMP > - P(se.avg.runnable_avg_sum); > - P(se.avg.running_avg_sum); > - P(se.avg.avg_period); > - P(se.avg.load_avg_contrib); > - P(se.avg.utilization_avg_contrib); > - P(se.avg.decay_count); > + P(se.avg.load_sum); > + P(se.avg.util_sum); > + P(se.avg.load_avg); > + P(se.avg.util_avg); > + P(se.avg.last_update_time); > #endif > P(policy); > P(prio); > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7922532..452c932 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -283,9 +283,6 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) > return grp->my_q; > } > > -static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, > - int force_update); > - > static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) > { > if (!cfs_rq->on_list) { > @@ -305,8 +302,6 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) > } > > cfs_rq->on_list = 1; > - /* We should have no load, but we need to update last_decay. */ > - update_cfs_rq_blocked_load(cfs_rq, 0); > } > } > > @@ -669,19 +664,31 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se) > static int select_idle_sibling(struct task_struct *p, int cpu); > static unsigned long task_h_load(struct task_struct *p); > > -static inline void __update_task_entity_contrib(struct sched_entity *se); > -static inline void __update_task_entity_utilization(struct sched_entity *se); > +/* > + * We choose a half-life close to 1 scheduling period. > + * Note: The tables below are dependent on this value. > + */ > +#define LOAD_AVG_PERIOD 32 > +#define LOAD_AVG_MAX 47742 /* maximum possible load avg */ > +#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */ > > /* Give new task start runnable values to heavy its load in infant time */ > void init_task_runnable_average(struct task_struct *p) > { > - u32 slice; > + struct sched_avg *sa = &p->se.avg; > > - slice = sched_slice(task_cfs_rq(p), &p->se) >> 10; > - p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice; > - p->se.avg.avg_period = slice; > - __update_task_entity_contrib(&p->se); > - __update_task_entity_utilization(&p->se); > + sa->last_update_time = 0; > + /* > + * sched_avg's period_contrib should be strictly less then 1024, so > + * we give it 1023 to make sure it is almost a period (1024us), and > + * will definitely be update (after enqueue). > + */ > + sa->period_contrib = 1023; > + sa->load_avg = scale_load_down(p->se.load.weight); > + sa->load_sum = sa->load_avg * LOAD_AVG_MAX; > + sa->util_avg = scale_load_down(SCHED_LOAD_SCALE); > + sa->util_sum = LOAD_AVG_MAX; > + /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */ > } > #else > void init_task_runnable_average(struct task_struct *p) > @@ -1702,8 +1709,8 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) > delta = runtime - p->last_sum_exec_runtime; > *period = now - p->last_task_numa_placement; > } else { > - delta = p->se.avg.runnable_avg_sum; > - *period = p->se.avg.avg_period; > + delta = p->se.avg.load_sum / p->se.load.weight; > + *period = LOAD_AVG_MAX; > } > > p->last_sum_exec_runtime = runtime; > @@ -2351,13 +2358,13 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq) > long tg_weight; > > /* > - * Use this CPU's actual weight instead of the last load_contribution > - * to gain a more accurate current total weight. See > - * update_cfs_rq_load_contribution(). > + * Use this CPU's real-time load instead of the last load contribution > + * as the updating of the contribution is delayed, and we will use the > + * the real-time load to calc the share. See update_tg_load_avg(). > */ > tg_weight = atomic_long_read(&tg->load_avg); > - tg_weight -= cfs_rq->tg_load_contrib; > - tg_weight += cfs_rq->load.weight; > + tg_weight -= cfs_rq->tg_load_avg_contrib; > + tg_weight += cfs_rq->avg.load_avg; > > return tg_weight; > } > @@ -2367,7 +2374,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg) > long tg_weight, load, shares; > > tg_weight = calc_tg_weight(tg, cfs_rq); > - load = cfs_rq->load.weight; > + load = cfs_rq->avg.load_avg; > > shares = (tg->shares * load); > if (tg_weight) > @@ -2429,14 +2436,6 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq) > #endif /* CONFIG_FAIR_GROUP_SCHED */ > > #ifdef CONFIG_SMP > -/* > - * We choose a half-life close to 1 scheduling period. > - * Note: The tables below are dependent on this value. > - */ > -#define LOAD_AVG_PERIOD 32 > -#define LOAD_AVG_MAX 47742 /* maximum possible load avg */ > -#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */ > - > /* Precomputed fixed inverse multiplies for multiplication by y^n */ > static const u32 runnable_avg_yN_inv[] = { > 0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6, > @@ -2485,9 +2484,8 @@ static __always_inline u64 decay_load(u64 val, u64 n) > local_n %= LOAD_AVG_PERIOD; > } > > - val *= runnable_avg_yN_inv[local_n]; > - /* We don't use SRR here since we always want to round down. */ > - return val >> 32; > + val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32); > + return val; > } > > /* > @@ -2546,23 +2544,23 @@ static u32 __compute_runnable_contrib(u64 n) > * load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... ) > * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}] > */ > -static __always_inline int __update_entity_runnable_avg(u64 now, int cpu, > - struct sched_avg *sa, > - int runnable, > - int running) > +static __always_inline int __update_load_avg(u64 now, int cpu, > + struct sched_avg *sa, > + unsigned long weight, > + int running) > { > u64 delta, periods; > - u32 runnable_contrib; > + u32 contrib; > int delta_w, decayed = 0; > unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu); > > - delta = now - sa->last_runnable_update; > + delta = now - sa->last_update_time; > /* > * This should only happen when time goes backwards, which it > * unfortunately does during sched clock init when we swap over to TSC. > */ > if ((s64)delta < 0) { > - sa->last_runnable_update = now; > + sa->last_update_time = now; > return 0; > } > > @@ -2573,26 +2571,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu, > delta >>= 10; > if (!delta) > return 0; > - sa->last_runnable_update = now; > + sa->last_update_time = now; > > /* delta_w is the amount already accumulated against our next period */ > - delta_w = sa->avg_period % 1024; > + delta_w = sa->period_contrib; > if (delta + delta_w >= 1024) { > - /* period roll-over */ > decayed = 1; > > + /* how much left for next period will start over, we don't know yet */ > + sa->period_contrib = 0; > + > /* > * Now that we know we're crossing a period boundary, figure > * out how much from delta we need to complete the current > * period and accrue it. > */ > delta_w = 1024 - delta_w; > - if (runnable) > - sa->runnable_avg_sum += delta_w; > + if (weight) > + sa->load_sum += weight * delta_w; > if (running) > - sa->running_avg_sum += delta_w * scale_freq > - >> SCHED_CAPACITY_SHIFT; > - sa->avg_period += delta_w; > + sa->util_sum += delta_w * scale_freq >> SCHED_CAPACITY_SHIFT; > > delta -= delta_w; > > @@ -2600,334 +2598,156 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu, > periods = delta / 1024; > delta %= 1024; > > - sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum, > - periods + 1); > - sa->running_avg_sum = decay_load(sa->running_avg_sum, > - periods + 1); > - sa->avg_period = decay_load(sa->avg_period, > - periods + 1); > + sa->load_sum = decay_load(sa->load_sum, periods + 1); > + sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1); > > /* Efficiently calculate \sum (1..n_period) 1024*y^i */ > - runnable_contrib = __compute_runnable_contrib(periods); > - if (runnable) > - sa->runnable_avg_sum += runnable_contrib; > + contrib = __compute_runnable_contrib(periods); > + if (weight) > + sa->load_sum += weight * contrib; > if (running) > - sa->running_avg_sum += runnable_contrib * scale_freq > - >> SCHED_CAPACITY_SHIFT; > - sa->avg_period += runnable_contrib; > + sa->util_sum += contrib * scale_freq >> SCHED_CAPACITY_SHIFT; > } > > /* Remainder of delta accrued against u_0` */ > - if (runnable) > - sa->runnable_avg_sum += delta; > + if (weight) > + sa->load_sum += weight * delta; > if (running) > - sa->running_avg_sum += delta * scale_freq > - >> SCHED_CAPACITY_SHIFT; > - sa->avg_period += delta; > - > - return decayed; > -} > - > -/* Synchronize an entity's decay with its parenting cfs_rq.*/ > -static inline u64 __synchronize_entity_decay(struct sched_entity *se) > -{ > - struct cfs_rq *cfs_rq = cfs_rq_of(se); > - u64 decays = atomic64_read(&cfs_rq->decay_counter); > + sa->util_sum += delta * scale_freq >> SCHED_CAPACITY_SHIFT; > > - decays -= se->avg.decay_count; > - se->avg.decay_count = 0; > - if (!decays) > - return 0; > + sa->period_contrib += delta; > > - se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays); > - se->avg.utilization_avg_contrib = > - decay_load(se->avg.utilization_avg_contrib, decays); > + if (decayed) { > + sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX); > + sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX; > + } > > - return decays; > + return decayed; > } > > #ifdef CONFIG_FAIR_GROUP_SCHED > -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, > - int force_update) > -{ > - struct task_group *tg = cfs_rq->tg; > - long tg_contrib; > - > - tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg; > - tg_contrib -= cfs_rq->tg_load_contrib; > - > - if (!tg_contrib) > - return; > - > - if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) { > - atomic_long_add(tg_contrib, &tg->load_avg); > - cfs_rq->tg_load_contrib += tg_contrib; > - } > -} > - > /* > - * Aggregate cfs_rq runnable averages into an equivalent task_group > - * representation for computing load contributions. > + * Updating tg's load_avg is necessary before update_cfs_share (which is done) > + * and effective_load (which is not done because it is too costly). > */ > -static inline void __update_tg_runnable_avg(struct sched_avg *sa, > - struct cfs_rq *cfs_rq) > +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) > { > - struct task_group *tg = cfs_rq->tg; > - long contrib; > - > - /* The fraction of a cpu used by this cfs_rq */ > - contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT, > - sa->avg_period + 1); > - contrib -= cfs_rq->tg_runnable_contrib; > + long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > - if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) { > - atomic_add(contrib, &tg->runnable_avg); > - cfs_rq->tg_runnable_contrib += contrib; > - } > -} > - > -static inline void __update_group_entity_contrib(struct sched_entity *se) > -{ > - struct cfs_rq *cfs_rq = group_cfs_rq(se); > - struct task_group *tg = cfs_rq->tg; > - int runnable_avg; > - > - u64 contrib; > - > - contrib = cfs_rq->tg_load_contrib * tg->shares; > - se->avg.load_avg_contrib = div_u64(contrib, > - atomic_long_read(&tg->load_avg) + 1); > - > - /* > - * For group entities we need to compute a correction term in the case > - * that they are consuming <1 cpu so that we would contribute the same > - * load as a task of equal weight. > - * > - * Explicitly co-ordinating this measurement would be expensive, but > - * fortunately the sum of each cpus contribution forms a usable > - * lower-bound on the true value. > - * > - * Consider the aggregate of 2 contributions. Either they are disjoint > - * (and the sum represents true value) or they are disjoint and we are > - * understating by the aggregate of their overlap. > - * > - * Extending this to N cpus, for a given overlap, the maximum amount we > - * understand is then n_i(n_i+1)/2 * w_i where n_i is the number of > - * cpus that overlap for this interval and w_i is the interval width. > - * > - * On a small machine; the first term is well-bounded which bounds the > - * total error since w_i is a subset of the period. Whereas on a > - * larger machine, while this first term can be larger, if w_i is the > - * of consequential size guaranteed to see n_i*w_i quickly converge to > - * our upper bound of 1-cpu. > - */ > - runnable_avg = atomic_read(&tg->runnable_avg); > - if (runnable_avg < NICE_0_LOAD) { > - se->avg.load_avg_contrib *= runnable_avg; > - se->avg.load_avg_contrib >>= NICE_0_SHIFT; > + if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > + atomic_long_add(delta, &cfs_rq->tg->load_avg); > + cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > } > } > > #else /* CONFIG_FAIR_GROUP_SCHED */ > -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, > - int force_update) {} > -static inline void __update_tg_runnable_avg(struct sched_avg *sa, > - struct cfs_rq *cfs_rq) {} > -static inline void __update_group_entity_contrib(struct sched_entity *se) {} > +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {} > #endif /* CONFIG_FAIR_GROUP_SCHED */ > > -static inline void __update_task_entity_contrib(struct sched_entity *se) > -{ > - u32 contrib; > - > - /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ > - contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight); > - contrib /= (se->avg.avg_period + 1); > - se->avg.load_avg_contrib = scale_load(contrib); > -} > +static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); > > -/* Compute the current contribution to load_avg by se, return any delta */ > -static long __update_entity_load_avg_contrib(struct sched_entity *se) > +/* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */ > +static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > { > - long old_contrib = se->avg.load_avg_contrib; > + int decayed; > + struct sched_avg *sa = &cfs_rq->avg; > > - if (entity_is_task(se)) { > - __update_task_entity_contrib(se); > - } else { > - __update_tg_runnable_avg(&se->avg, group_cfs_rq(se)); > - __update_group_entity_contrib(se); > + if (atomic_long_read(&cfs_rq->removed_load_avg)) { > + long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > + sa->load_avg = max_t(long, sa->load_avg - r, 0); > + sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > } > > - return se->avg.load_avg_contrib - old_contrib; > -} > - > - > -static inline void __update_task_entity_utilization(struct sched_entity *se) > -{ > - u32 contrib; > - > - /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ > - contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE); > - contrib /= (se->avg.avg_period + 1); > - se->avg.utilization_avg_contrib = scale_load(contrib); > -} > + if (atomic_long_read(&cfs_rq->removed_util_avg)) { > + long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0); > + sa->util_avg = max_t(long, sa->util_avg - r, 0); > + sa->util_sum = max_t(s32, sa->util_sum - > + ((r * LOAD_AVG_MAX) >> SCHED_LOAD_SHIFT), 0); > + } > > -static long __update_entity_utilization_avg_contrib(struct sched_entity *se) > -{ > - long old_contrib = se->avg.utilization_avg_contrib; > + decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, > + scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL); > > - if (entity_is_task(se)) > - __update_task_entity_utilization(se); > - else > - se->avg.utilization_avg_contrib = > - group_cfs_rq(se)->utilization_load_avg; > - > - return se->avg.utilization_avg_contrib - old_contrib; > -} > +#ifndef CONFIG_64BIT > + smp_wmb(); > + cfs_rq->load_last_update_time_copy = sa->last_update_time; > +#endif > > -static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq, > - long load_contrib) > -{ > - if (likely(load_contrib < cfs_rq->blocked_load_avg)) > - cfs_rq->blocked_load_avg -= load_contrib; > - else > - cfs_rq->blocked_load_avg = 0; > + return decayed; > } > > -static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); > - > -/* Update a sched_entity's runnable average */ > -static inline void update_entity_load_avg(struct sched_entity *se, > - int update_cfs_rq) > +/* Update task and its cfs_rq load average */ > +static inline void update_load_avg(struct sched_entity *se, int update_tg) > { > struct cfs_rq *cfs_rq = cfs_rq_of(se); > - long contrib_delta, utilization_delta; > int cpu = cpu_of(rq_of(cfs_rq)); > - u64 now; > + u64 now = cfs_rq_clock_task(cfs_rq); > > /* > - * For a group entity we need to use their owned cfs_rq_clock_task() in > - * case they are the parent of a throttled hierarchy. > + * Track task load average for carrying it to new CPU after migrated, and > + * track group sched_entity load average for task_h_load calc in migration > */ > - if (entity_is_task(se)) > - now = cfs_rq_clock_task(cfs_rq); > - else > - now = cfs_rq_clock_task(group_cfs_rq(se)); > + __update_load_avg(now, cpu, &se->avg, > + se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se); > > - if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq, > - cfs_rq->curr == se)) > - return; > - > - contrib_delta = __update_entity_load_avg_contrib(se); > - utilization_delta = __update_entity_utilization_avg_contrib(se); > - > - if (!update_cfs_rq) > - return; > - > - if (se->on_rq) { > - cfs_rq->runnable_load_avg += contrib_delta; > - cfs_rq->utilization_load_avg += utilization_delta; > - } else { > - subtract_blocked_load_contrib(cfs_rq, -contrib_delta); > - } > + if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg) > + update_tg_load_avg(cfs_rq, 0); > } > > -/* > - * Decay the load contributed by all blocked children and account this so that > - * their contribution may appropriately discounted when they wake up. > - */ > -static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update) > +/* Add the load generated by se into cfs_rq's load average */ > +static inline void > +enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > - u64 now = cfs_rq_clock_task(cfs_rq) >> 20; > - u64 decays; > - > - decays = now - cfs_rq->last_decay; > - if (!decays && !force_update) > - return; > + struct sched_avg *sa = &se->avg; > + u64 now = cfs_rq_clock_task(cfs_rq); > + int migrated = 0, decayed; > > - if (atomic_long_read(&cfs_rq->removed_load)) { > - unsigned long removed_load; > - removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0); > - subtract_blocked_load_contrib(cfs_rq, removed_load); > + if (sa->last_update_time == 0) { > + sa->last_update_time = now; > + migrated = 1; > } > - > - if (decays) { > - cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg, > - decays); > - atomic64_add(decays, &cfs_rq->decay_counter); > - cfs_rq->last_decay = now; > + else { > + __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, > + se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se); > } > > - __update_cfs_rq_tg_load_contrib(cfs_rq, force_update); > -} > + decayed = update_cfs_rq_load_avg(now, cfs_rq); > > -/* Add the load generated by se into cfs_rq's child load-average */ > -static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, > - struct sched_entity *se, > - int wakeup) > -{ > - /* > - * We track migrations using entity decay_count <= 0, on a wake-up > - * migration we use a negative decay count to track the remote decays > - * accumulated while sleeping. > - * > - * Newly forked tasks are enqueued with se->avg.decay_count == 0, they > - * are seen by enqueue_entity_load_avg() as a migration with an already > - * constructed load_avg_contrib. > - */ > - if (unlikely(se->avg.decay_count <= 0)) { > - se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq)); > - if (se->avg.decay_count) { > - /* > - * In a wake-up migration we have to approximate the > - * time sleeping. This is because we can't synchronize > - * clock_task between the two cpus, and it is not > - * guaranteed to be read-safe. Instead, we can > - * approximate this using our carried decays, which are > - * explicitly atomically readable. > - */ > - se->avg.last_runnable_update -= (-se->avg.decay_count) > - << 20; > - update_entity_load_avg(se, 0); > - /* Indicate that we're now synchronized and on-rq */ > - se->avg.decay_count = 0; > - } > - wakeup = 0; > - } else { > - __synchronize_entity_decay(se); > + if (migrated) { > + cfs_rq->avg.load_avg += sa->load_avg; > + cfs_rq->avg.load_sum += sa->load_sum; > + cfs_rq->avg.util_avg += sa->util_avg; > + cfs_rq->avg.util_sum += sa->util_sum; > } > > - /* migrated tasks did not contribute to our blocked load */ > - if (wakeup) { > - subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib); > - update_entity_load_avg(se, 0); > - } > - > - cfs_rq->runnable_load_avg += se->avg.load_avg_contrib; > - cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib; > - /* we force update consideration on load-balancer moves */ > - update_cfs_rq_blocked_load(cfs_rq, !wakeup); > + if (decayed || migrated) > + update_tg_load_avg(cfs_rq, 0); > } > > /* > - * Remove se's load from this cfs_rq child load-average, if the entity is > - * transitioning to a blocked state we track its projected decay using > - * blocked_load_avg. > + * Task first catches up with cfs_rq, and then subtract > + * itself from the cfs_rq (task must be off the queue now). > */ > -static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, > - struct sched_entity *se, > - int sleep) > +void remove_entity_load_avg(struct sched_entity *se) > { > - update_entity_load_avg(se, 1); > - /* we force update consideration on load-balancer moves */ > - update_cfs_rq_blocked_load(cfs_rq, !sleep); > + struct cfs_rq *cfs_rq = cfs_rq_of(se); > + u64 last_update_time; > + > +#ifndef CONFIG_64BIT > + u64 last_update_time_copy; > > - cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib; > - cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib; > - if (sleep) { > - cfs_rq->blocked_load_avg += se->avg.load_avg_contrib; > - se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); > - } /* migrations, e.g. sleep=0 leave decay_count == 0 */ > + do { > + last_update_time_copy = cfs_rq->load_last_update_time_copy; > + smp_rmb(); > + last_update_time = cfs_rq->avg.last_update_time; > + } while (last_update_time != last_update_time_copy); > +#else > + last_update_time = cfs_rq->avg.last_update_time; > +#endif > + > + __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0); > + atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg); > + atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg); > } > > /* > @@ -2952,16 +2772,10 @@ static int idle_balance(struct rq *this_rq); > > #else /* CONFIG_SMP */ > > -static inline void update_entity_load_avg(struct sched_entity *se, > - int update_cfs_rq) {} > -static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, > - struct sched_entity *se, > - int wakeup) {} > -static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, > - struct sched_entity *se, > - int sleep) {} > -static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, > - int force_update) {} > +static inline void update_load_avg(struct sched_entity *se, int update_tg) {} > +static inline void > +enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} > +static inline void remove_entity_load_avg(struct sched_entity *se) {} > > static inline int idle_balance(struct rq *rq) > { > @@ -3093,7 +2907,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > * Update run-time statistics of the 'current'. > */ > update_curr(cfs_rq); > - enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP); > + enqueue_entity_load_avg(cfs_rq, se); > account_entity_enqueue(cfs_rq, se); > update_cfs_shares(cfs_rq); > > @@ -3168,7 +2982,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > * Update run-time statistics of the 'current'. > */ > update_curr(cfs_rq); > - dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP); > + update_load_avg(se, 1); > > update_stats_dequeue(cfs_rq, se); > if (flags & DEQUEUE_SLEEP) { > @@ -3258,7 +3072,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) > */ > update_stats_wait_end(cfs_rq, se); > __dequeue_entity(cfs_rq, se); > - update_entity_load_avg(se, 1); > + update_load_avg(se, 1); > } > > update_stats_curr_start(cfs_rq, se); > @@ -3358,7 +3172,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev) > /* Put 'current' back into the tree. */ > __enqueue_entity(cfs_rq, prev); > /* in !on_rq case, update occurred at dequeue */ > - update_entity_load_avg(prev, 1); > + update_load_avg(prev, 0); > } > cfs_rq->curr = NULL; > } > @@ -3374,8 +3188,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) > /* > * Ensure that runnable average is periodically updated. > */ > - update_entity_load_avg(curr, 1); > - update_cfs_rq_blocked_load(cfs_rq, 1); > + update_load_avg(curr, 1); > update_cfs_shares(cfs_rq); > > #ifdef CONFIG_SCHED_HRTICK > @@ -4248,8 +4061,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > if (cfs_rq_throttled(cfs_rq)) > break; > > + update_load_avg(se, 1); > update_cfs_shares(cfs_rq); > - update_entity_load_avg(se, 1); > } > > if (!se) > @@ -4308,8 +4121,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > if (cfs_rq_throttled(cfs_rq)) > break; > > + update_load_avg(se, 1); > update_cfs_shares(cfs_rq); > - update_entity_load_avg(se, 1); > } > > if (!se) > @@ -4448,7 +4261,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, > static void update_idle_cpu_load(struct rq *this_rq) > { > unsigned long curr_jiffies = READ_ONCE(jiffies); > - unsigned long load = this_rq->cfs.runnable_load_avg; > + unsigned long load = this_rq->cfs.avg.load_avg; > unsigned long pending_updates; > > /* > @@ -4494,7 +4307,7 @@ void update_cpu_load_nohz(void) > */ > void update_cpu_load_active(struct rq *this_rq) > { > - unsigned long load = this_rq->cfs.runnable_load_avg; > + unsigned long load = this_rq->cfs.avg.load_avg; How about using encapsulation weighted_cpuload() here to indicate rq->cpu_load[] which don't care jitter use weighted cpuload directly. Regards, Wanpeng Li > /* > * See the mess around update_idle_cpu_load() / update_cpu_load_nohz(). > */ > @@ -4505,7 +4318,7 @@ void update_cpu_load_active(struct rq *this_rq) > /* Used instead of source_load when we know the type == 0 */ > static unsigned long weighted_cpuload(const int cpu) > { > - return cpu_rq(cpu)->cfs.runnable_load_avg; > + return cpu_rq(cpu)->cfs.avg.load_avg; > } > > /* > @@ -4555,7 +4368,7 @@ static unsigned long cpu_avg_load_per_task(int cpu) > { > struct rq *rq = cpu_rq(cpu); > unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running); > - unsigned long load_avg = rq->cfs.runnable_load_avg; > + unsigned long load_avg = rq->cfs.avg.load_avg; > > if (nr_running) > return load_avg / nr_running; > @@ -4674,7 +4487,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg) > /* > * w = rw_i + @wl > */ > - w = se->my_q->load.weight + wl; > + w = se->my_q->avg.load_avg + wl; > > /* > * wl = S * s'_i; see (2) > @@ -4695,7 +4508,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg) > /* > * wl = dw_i = S * (s'_i - s_i); see (3) > */ > - wl -= se->load.weight; > + wl -= se->avg.load_avg; > > /* > * Recursively apply this logic to all parent groups to compute > @@ -4769,14 +4582,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) > */ > if (sync) { > tg = task_group(current); > - weight = current->se.load.weight; > + weight = current->se.avg.load_avg; > > this_load += effective_load(tg, this_cpu, -weight, -weight); > load += effective_load(tg, prev_cpu, 0, -weight); > } > > tg = task_group(p); > - weight = p->se.load.weight; > + weight = p->se.avg.load_avg; > > /* > * In low-load situations, where prev_cpu is idle and this_cpu is idle > @@ -4969,12 +4782,12 @@ done: > * tasks. The unit of the return value must be the one of capacity so we can > * compare the usage with the capacity of the CPU that is available for CFS > * task (ie cpu_capacity). > - * cfs.utilization_load_avg is the sum of running time of runnable tasks on a > + * cfs.avg.util_avg is the sum of running time of runnable tasks on a > * CPU. It represents the amount of utilization of a CPU in the range > * [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full > * capacity of the CPU because it's about the running time on this CPU. > - * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE > - * because of unfortunate rounding in avg_period and running_load_avg or just > + * Nevertheless, cfs.avg.util_avg can be higher than SCHED_LOAD_SCALE > + * because of unfortunate rounding in util_avg or just > * after migrating tasks until the average stabilizes with the new running > * time. So we need to check that the usage stays into the range > * [0..cpu_capacity_orig] and cap if necessary. > @@ -4983,7 +4796,7 @@ done: > */ > static int get_cpu_usage(int cpu) > { > - unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg; > + unsigned long usage = cpu_rq(cpu)->cfs.avg.util_avg; > unsigned long capacity = capacity_orig_of(cpu); > > if (usage >= SCHED_LOAD_SCALE) > @@ -5089,26 +4902,22 @@ unlock: > * previous cpu. However, the caller only guarantees p->pi_lock is held; no > * other assumptions, including the state of rq->lock, should be made. > */ > -static void > -migrate_task_rq_fair(struct task_struct *p, int next_cpu) > +static void migrate_task_rq_fair(struct task_struct *p, int next_cpu) > { > - struct sched_entity *se = &p->se; > - struct cfs_rq *cfs_rq = cfs_rq_of(se); > - > /* > - * Load tracking: accumulate removed load so that it can be processed > - * when we next update owning cfs_rq under rq->lock. Tasks contribute > - * to blocked load iff they have a positive decay-count. It can never > - * be negative here since on-rq tasks have decay-count == 0. > + * We are supposed to update the task to "current" time, then its up to date > + * and ready to go to new CPU/cfs_rq. But we have difficulty in getting > + * what current time is, so simply throw away the out-of-date time. This > + * will result in the wakee task is less decayed, but giving the wakee more > + * load sounds not bad. > */ > - if (se->avg.decay_count) { > - se->avg.decay_count = -__synchronize_entity_decay(se); > - atomic_long_add(se->avg.load_avg_contrib, > - &cfs_rq->removed_load); > - } > + remove_entity_load_avg(&p->se); > + > + /* Tell new CPU we are migrated */ > + p->se.avg.last_update_time = 0; > > /* We have migrated, no longer consider this task hot */ > - se->exec_start = 0; > + p->se.exec_start = 0; > } > #endif /* CONFIG_SMP */ > > @@ -5995,36 +5804,6 @@ static void attach_tasks(struct lb_env *env) > } > > #ifdef CONFIG_FAIR_GROUP_SCHED > -/* > - * update tg->load_weight by folding this cpu's load_avg > - */ > -static void __update_blocked_averages_cpu(struct task_group *tg, int cpu) > -{ > - struct sched_entity *se = tg->se[cpu]; > - struct cfs_rq *cfs_rq = tg->cfs_rq[cpu]; > - > - /* throttled entities do not contribute to load */ > - if (throttled_hierarchy(cfs_rq)) > - return; > - > - update_cfs_rq_blocked_load(cfs_rq, 1); > - > - if (se) { > - update_entity_load_avg(se, 1); > - /* > - * We pivot on our runnable average having decayed to zero for > - * list removal. This generally implies that all our children > - * have also been removed (modulo rounding error or bandwidth > - * control); however, such cases are rare and we can fix these > - * at enqueue. > - * > - * TODO: fix up out-of-order children on enqueue. > - */ > - if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running) > - list_del_leaf_cfs_rq(cfs_rq); > - } > -} > - > static void update_blocked_averages(int cpu) > { > struct rq *rq = cpu_rq(cpu); > @@ -6033,17 +5812,18 @@ static void update_blocked_averages(int cpu) > > raw_spin_lock_irqsave(&rq->lock, flags); > update_rq_clock(rq); > + > /* > * Iterates the task_group tree in a bottom up fashion, see > * list_add_leaf_cfs_rq() for details. > */ > for_each_leaf_cfs_rq(rq, cfs_rq) { > - /* > - * Note: We may want to consider periodically releasing > - * rq->lock about these updates so that creating many task > - * groups does not result in continually extending hold time. > - */ > - __update_blocked_averages_cpu(cfs_rq->tg, rq->cpu); > + /* throttled entities do not contribute to load */ > + if (throttled_hierarchy(cfs_rq)) > + continue; > + > + if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq)) > + update_tg_load_avg(cfs_rq, 0); > } > > raw_spin_unlock_irqrestore(&rq->lock, flags); > @@ -6073,14 +5853,13 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq) > } > > if (!se) { > - cfs_rq->h_load = cfs_rq->runnable_load_avg; > + cfs_rq->h_load = cfs_rq->avg.load_avg; > cfs_rq->last_h_load_update = now; > } > > while ((se = cfs_rq->h_load_next) != NULL) { > load = cfs_rq->h_load; > - load = div64_ul(load * se->avg.load_avg_contrib, > - cfs_rq->runnable_load_avg + 1); > + load = div64_ul(load * se->avg.load_avg, cfs_rq->avg.load_avg + 1); > cfs_rq = group_cfs_rq(se); > cfs_rq->h_load = load; > cfs_rq->last_h_load_update = now; > @@ -6092,8 +5871,8 @@ static unsigned long task_h_load(struct task_struct *p) > struct cfs_rq *cfs_rq = task_cfs_rq(p); > > update_cfs_rq_h_load(cfs_rq); > - return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load, > - cfs_rq->runnable_load_avg + 1); > + return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, > + cfs_rq->avg.load_avg + 1); > } > #else > static inline void update_blocked_averages(int cpu) > @@ -6102,7 +5881,7 @@ static inline void update_blocked_averages(int cpu) > > static unsigned long task_h_load(struct task_struct *p) > { > - return p->se.avg.load_avg_contrib; > + return p->se.avg.load_avg; > } > #endif > > @@ -8103,15 +7882,18 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p) > } > > #ifdef CONFIG_SMP > - /* > - * Remove our load from contribution when we leave sched_fair > - * and ensure we don't carry in an old decay_count if we > - * switch back. > - */ > - if (se->avg.decay_count) { > - __synchronize_entity_decay(se); > - subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib); > - } > + /* Catch up with the cfs_rq and remove our load when we leave */ > + __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq), &se->avg, > + se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se); > + > + cfs_rq->avg.load_avg = > + max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0); > + cfs_rq->avg.load_sum = > + max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0); > + cfs_rq->avg.util_avg = > + max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0); > + cfs_rq->avg.util_sum = > + max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0); > #endif > } > > @@ -8168,8 +7950,8 @@ void init_cfs_rq(struct cfs_rq *cfs_rq) > cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; > #endif > #ifdef CONFIG_SMP > - atomic64_set(&cfs_rq->decay_counter, 1); > - atomic_long_set(&cfs_rq->removed_load, 0); > + atomic_long_set(&cfs_rq->removed_load_avg, 0); > + atomic_long_set(&cfs_rq->removed_util_avg, 0); > #endif > } > > @@ -8214,14 +7996,14 @@ static void task_move_group_fair(struct task_struct *p, int queued) > if (!queued) { > cfs_rq = cfs_rq_of(se); > se->vruntime += cfs_rq->min_vruntime; > + > #ifdef CONFIG_SMP > - /* > - * migrate_task_rq_fair() will have removed our previous > - * contribution, but we must synchronize for ongoing future > - * decay. > - */ > - se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); > - cfs_rq->blocked_load_avg += se->avg.load_avg_contrib; > + /* Virtually synchronize task with its new cfs_rq */ > + p->se.avg.last_update_time = cfs_rq->avg.last_update_time; > + cfs_rq->avg.load_avg += p->se.avg.load_avg; > + cfs_rq->avg.load_sum += p->se.avg.load_sum; > + cfs_rq->avg.util_avg += p->se.avg.util_avg; > + cfs_rq->avg.util_sum += p->se.avg.util_sum; > #endif > } > } > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index d465a5c..3dfec8d 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -245,7 +245,6 @@ struct task_group { > > #ifdef CONFIG_SMP > atomic_long_t load_avg; > - atomic_t runnable_avg; > #endif > #endif > > @@ -366,27 +365,18 @@ struct cfs_rq { > > #ifdef CONFIG_SMP > /* > - * CFS Load tracking > - * Under CFS, load is tracked on a per-entity basis and aggregated up. > - * This allows for the description of both thread and group usage (in > - * the FAIR_GROUP_SCHED case). > - * runnable_load_avg is the sum of the load_avg_contrib of the > - * sched_entities on the rq. > - * blocked_load_avg is similar to runnable_load_avg except that its > - * the blocked sched_entities on the rq. > - * utilization_load_avg is the sum of the average running time of the > - * sched_entities on the rq. > + * CFS load tracking > */ > - unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg; > - atomic64_t decay_counter; > - u64 last_decay; > - atomic_long_t removed_load; > - > + struct sched_avg avg; > #ifdef CONFIG_FAIR_GROUP_SCHED > - /* Required to track per-cpu representation of a task_group */ > - u32 tg_runnable_contrib; > - unsigned long tg_load_contrib; > + unsigned long tg_load_avg_contrib; > +#endif > + atomic_long_t removed_load_avg, removed_util_avg; > +#ifndef CONFIG_64BIT > + u64 load_last_update_time_copy; > +#endif > > +#ifdef CONFIG_FAIR_GROUP_SCHED > /* > * h_load = weight * f(tg) > * -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/