2012-11-17 13:04:36

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 0/5] enable runnable load avg in load balance

This patchset try to consider runnable load avg when do cpu load comparison
in load balance.

I had seen preeti's enabling before patch finished, but I still think considing
runnable load avg on rq is may a more natrual way.

BTW, I am thinking if 2 times decay for cpu_load is too complicate? one for
runnable time, another for CPU_LOAD_IDX. I think I missed the decay reason
for CPU_LOAD_IDX. Could anyone like do me favor to give some hints of this?

Best Regards!
Alex

[RFC PATCH 1/5] sched: get rq runnable load average for load balance
[RFC PATCH 2/5] sched: update rq runnable load average in time
[RFC PATCH 3/5] sched: using runnable load avg in cpu_load and
[RFC PATCH 4/5] sched: consider runnable load average in wake_affine
[RFC PATCH 5/5] sched: revert 'Introduce temporary FAIR_GROUP_SCHED


2012-11-17 13:05:05

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 2/5] sched: update rq runnable load average in time

Now we have rq runnable load average value, and prepare to use it in rq
cpu_load[] updating.

So we want to make sure rq cup_load[] updating is using the latest
data. The update_cpu_load_active(rq) was put after task_tick() is for
this purpose.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 1 +
2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9dbbe45..bacfee0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2657,8 +2657,8 @@ void scheduler_tick(void)

raw_spin_lock(&rq->lock);
update_rq_clock(rq);
- update_cpu_load_active(rq);
curr->sched_class->task_tick(rq, curr, 0);
+ update_cpu_load_active(rq);
raw_spin_unlock(&rq->lock);

perf_event_task_tick();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc60e43..44c07ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6011,6 +6011,7 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)

raw_spin_lock_irq(&rq->lock);
update_rq_clock(rq);
+ update_rq_runnable_avg(rq, rq->nr_running);
update_idle_cpu_load(rq);
raw_spin_unlock_irq(&rq->lock);

--
1.7.5.4

2012-11-17 13:05:28

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 4/5] sched: consider runnable load average in wake_affine and move_tasks

Except using runnable load average in background, wake_affine and
move_tasks is also the key functions in load balance. We need consider
the runnable load average in them in order to the apple to apple load
comparison in load balance.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 16 ++++++++++------
1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f918919..7064a13 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3164,8 +3164,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
tg = task_group(current);
weight = current->se.load.weight;

- this_load += effective_load(tg, this_cpu, -weight, -weight);
- load += effective_load(tg, prev_cpu, 0, -weight);
+ this_load += effective_load(tg, this_cpu, -weight, -weight)
+ * cpu_rq(this_cpu)->avg.load_avg_contrib;
+ load += effective_load(tg, prev_cpu, 0, -weight)
+ * cpu_rq(prev_cpu)->avg.load_avg_contrib;
}

tg = task_group(p);
@@ -3185,12 +3187,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)

this_eff_load = 100;
this_eff_load *= power_of(prev_cpu);
- this_eff_load *= this_load +
- effective_load(tg, this_cpu, weight, weight);
+ this_eff_load *= (this_load +
+ effective_load(tg, this_cpu, weight, weight))
+ * cpu_rq(this_cpu)->avg.load_avg_contrib;

prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
prev_eff_load *= power_of(this_cpu);
- prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+ prev_eff_load *= (load + effective_load(tg, prev_cpu, 0, weight))
+ * cpu_rq(prev_cpu)->avg.load_avg_contrib;

balanced = this_eff_load <= prev_eff_load;
} else
@@ -4229,7 +4233,7 @@ static int move_tasks(struct lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
goto next;

- load = task_h_load(p);
+ load = task_h_load(p) * p->se.avg.load_avg_contrib;

if (sched_feat(LB_MIN) && load < 16 && !env->failed)
goto next;
--
1.7.5.4

2012-11-17 13:05:04

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 3/5] sched: using runnable load avg in cpu_load and cpu_avg_load_per_task

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

The base idea of runnable load avg usage is just cosider the runnable
load coefficient in direct load balance process, like load comparison
between cpus.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bacfee0..ee6d765 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2501,7 +2501,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
- unsigned long load = this_rq->load.weight;
+ unsigned long load = this_rq->avg.load_avg_contrib;
unsigned long pending_updates;

/*
@@ -2551,7 +2551,7 @@ static void update_cpu_load_active(struct rq *this_rq)
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
this_rq->last_load_update_tick = jiffies;
- __update_cpu_load(this_rq, this_rq->load.weight, 1);
+ __update_cpu_load(this_rq, this_rq->avg.load_avg_contrib, 1);

calc_load_account_active(this_rq);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44c07ed..f918919 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2950,7 +2950,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->load.weight;
+ return cpu_rq(cpu)->avg.load_avg_contrib;
}

/*
@@ -2997,7 +2997,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);

if (nr_running)
- return rq->load.weight / nr_running;
+ return rq->avg.load_avg_contrib / nr_running;

return 0;
}
--
1.7.5.4

2012-11-17 13:05:52

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 1/5] sched: get rq runnable load average for load balance

In load balance, rq load weight is the core of balance judgement.
Now it's time to consider the PJT's runnable load tracking in load
balance.

Since we already have rq runnable_avg_sum and rq load weight,
the rq runnable load average is easy to get:
runnable_load(rq) = runnable_avg(rq) * weight(rq)

then reuse rq->avg.load_avg_contrib to store the value.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 20 ++++++++++++++++----
2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2cd3c1b..1cd5639 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -71,6 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
struct sched_avg *avg = &cpu_rq(cpu)->avg;
P(avg->runnable_avg_sum);
P(avg->runnable_avg_period);
+ P(avg->load_avg_contrib);
return;
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a624d3b..bc60e43 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1439,14 +1439,25 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
static inline void __update_group_entity_contrib(struct sched_entity *se) {}
#endif

-static inline void __update_task_entity_contrib(struct sched_entity *se)
+static inline void __update_load_avg_contrib(struct sched_avg *sa,
+ struct load_weight *load)
{
u32 contrib;

/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
- contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
- contrib /= (se->avg.runnable_avg_period + 1);
- se->avg.load_avg_contrib = scale_load(contrib);
+ contrib = sa->runnable_avg_sum * scale_load_down(load->weight);
+ contrib /= (sa->runnable_avg_period + 1);
+ sa->load_avg_contrib = scale_load(contrib);
+}
+
+static inline void __update_task_entity_contrib(struct sched_entity *se)
+{
+ __update_load_avg_contrib(&se->avg, &se->load);
+}
+
+static inline void __update_rq_load_contrib(struct rq *rq)
+{
+ __update_load_avg_contrib(&rq->avg, &rq->load);
}

/* Compute the current contribution to load_avg by se, return any delta */
@@ -1539,6 +1550,7 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
{
__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
__update_tg_runnable_avg(&rq->avg, &rq->cfs);
+ __update_rq_load_contrib(rq);
}

/* Add the load generated by se into cfs_rq's child load-average */
--
1.7.5.4

2012-11-17 13:06:06

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 5/5] sched: revert 'Introduce temporary FAIR_GROUP_SCHED dependency ...'

revert commit f4e26b120b9de84cb627b for load-tracking patchset using in
kernel.

Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 8 +-------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 12 +-----------
kernel/sched/sched.h | 9 +--------
4 files changed, 4 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f65323..4ce885a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1169,13 +1169,7 @@ struct sched_entity {
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
- /* Per-entity load-tracking */
+#ifdef CONFIG_SMP
struct sched_avg avg;
#endif
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee6d765..9f9615d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1526,12 +1526,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7064a13..3f7f732 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1149,8 +1149,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3457,7 +3456,6 @@ unlock:
return new_cpu;
}

-#ifdef CONFIG_FAIR_GROUP_SCHED
static void migrate_task_rq_entity(struct task_struct *p, int next_cpu)
{
struct sched_entity *se = &p->se;
@@ -3474,16 +3472,8 @@ static void migrate_task_rq_entity(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#else
-static void migrate_task_rq_entity(struct task_struct *p, int next_cpu) { }
-#endif

/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bb9475c..3a4a8d6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -226,12 +226,6 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -241,8 +235,7 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
u32 tg_runnable_contrib;
u64 tg_load_contrib;
--
1.7.5.4

2012-11-17 14:09:56

by Ricardo Nabinger Sanchez

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] enable runnable load avg in load balance

On Sat, 17 Nov 2012 21:04:12 +0800, Alex Shi wrote:

> This patchset try to consider runnable load avg when do cpu load
> comparison in load balance.

I found the wording in the commit messages (pretty much all of them,
including the introductory message) rather confusing, especially patch
4/5.


--
Ricardo Nabinger Sanchez http://rnsanchez.wait4.org/
"Left to themselves, things tend to go from bad to worse."

2012-11-17 18:10:33

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] sched: consider runnable load average in wake_affine and move_tasks

Hi Alex,

On 11/17/2012 06:34 PM, Alex Shi wrote:
> Except using runnable load average in background, wake_affine and
> move_tasks is also the key functions in load balance. We need consider
> the runnable load average in them in order to the apple to apple load
> comparison in load balance.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 16 ++++++++++------
> 1 files changed, 10 insertions(+), 6 deletions(-)
>
> @@ -4229,7 +4233,7 @@ static int move_tasks(struct lb_env *env)
> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> goto next;
>
> - load = task_h_load(p);
> + load = task_h_load(p) * p->se.avg.load_avg_contrib;
Shouldn't the above be just load = p->se.avg.load_avg_contrib? This
metric already has considered p->se.load.weight.task_h_load(p) returns
the same.
>
> if (sched_feat(LB_MIN) && load < 16 && !env->failed)
> goto next;
>
Regards
Preeti U Murthy

2012-11-17 19:12:37

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] enable runnable load avg in load balance

Hi Alex,

On 11/17/2012 06:34 PM, Alex Shi wrote:
> This patchset try to consider runnable load avg when do cpu load comparison
> in load balance.
>
> I had seen preeti's enabling before patch finished, but I still think considing
> runnable load avg on rq is may a more natrual way.
>
> BTW, I am thinking if 2 times decay for cpu_load is too complicate? one for
> runnable time, another for CPU_LOAD_IDX. I think I missed the decay reason
> for CPU_LOAD_IDX. Could anyone like do me favor to give some hints of this?

The decay happening for CPU_LOAD_IDX is *more coarse grained* than the
decay that __update_entity_runnable_avg() is performing.While
__update_cpu_load() decays the rq->load.weight *for every jiffy*(~4ms)
passed so far without update of the load,
__update_entity_runnable_avg() decays the rq->load.weight *for every
1ms* when called from update_rq_runnable_avg().

Before the introduction of PJT's series,__update_cpu_load() seems to be
the only place where decay of older rq load was happening(so as to give
the older load less importance in its relevance),but with the
introduction of PJT's series since the older rq load gets decayed in
__update_entity_runnable_avg() in a more fine grained fashion,perhaps
you are right,while the CPU_LOAD_IDX gets updated,we dont need to decay
the load once again here.
>
> Best Regards!
> Alex
>
> [RFC PATCH 1/5] sched: get rq runnable load average for load balance
> [RFC PATCH 2/5] sched: update rq runnable load average in time
> [RFC PATCH 3/5] sched: using runnable load avg in cpu_load and
> [RFC PATCH 4/5] sched: consider runnable load average in wake_affine
> [RFC PATCH 5/5] sched: revert 'Introduce temporary FAIR_GROUP_SCHED
>
Regards
Preeti U Murthy

2012-11-18 08:35:49

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] enable runnable load avg in load balance

On Sun, Nov 18, 2012 at 3:12 AM, Preeti U Murthy
<[email protected]> wrote:
> Hi Alex,
>
> On 11/17/2012 06:34 PM, Alex Shi wrote:
>> This patchset try to consider runnable load avg when do cpu load comparison
>> in load balance.
>>
>> I had seen preeti's enabling before patch finished, but I still think considing
>> runnable load avg on rq is may a more natrual way.
>>
>> BTW, I am thinking if 2 times decay for cpu_load is too complicate? one for
>> runnable time, another for CPU_LOAD_IDX. I think I missed the decay reason
>> for CPU_LOAD_IDX. Could anyone like do me favor to give some hints of this?
>
> The decay happening for CPU_LOAD_IDX is *more coarse grained* than the
> decay that __update_entity_runnable_avg() is performing.While
> __update_cpu_load() decays the rq->load.weight *for every jiffy*(~4ms)
> passed so far without update of the load,
> __update_entity_runnable_avg() decays the rq->load.weight *for every
> 1ms* when called from update_rq_runnable_avg().
>
> Before the introduction of PJT's series,__update_cpu_load() seems to be
> the only place where decay of older rq load was happening(so as to give
> the older load less importance in its relevance),but with the
> introduction of PJT's series since the older rq load gets decayed in
> __update_entity_runnable_avg() in a more fine grained fashion,perhaps
> you are right,while the CPU_LOAD_IDX gets updated,we dont need to decay
> the load once again here.


If cpu_load is just a coarse decay, we can remove it. but it has
different meaning for busy_idx, forkexec_idx, idle_idx, newidle_idx.
each of them has different degree decay. that is the key part, but I
has no idea of their value come from.

Thanks!

2012-11-18 09:36:39

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] sched: consider runnable load average in wake_affine and move_tasks

On 11/18/2012 02:09 AM, Preeti U Murthy wrote:
> Hi Alex,
>
> On 11/17/2012 06:34 PM, Alex Shi wrote:
>> Except using runnable load average in background, wake_affine and
>> move_tasks is also the key functions in load balance. We need consider
>> the runnable load average in them in order to the apple to apple load
>> comparison in load balance.
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> ---
>> kernel/sched/fair.c | 16 ++++++++++------
>> 1 files changed, 10 insertions(+), 6 deletions(-)
>>
>> @@ -4229,7 +4233,7 @@ static int move_tasks(struct lb_env *env)
>> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>> goto next;
>>
>> - load = task_h_load(p);
>> + load = task_h_load(p) * p->se.avg.load_avg_contrib;
> Shouldn't the above be just load = p->se.avg.load_avg_contrib? This
> metric already has considered p->se.load.weight.task_h_load(p) returns
> the same.

Thanks for catching this bug!
but task_h_load(p) is clearly not same as p->se.load.weight when tg using.
So, it could be changed as:
+ load = task_h_load(p) * p->se.avg.runnable_avg_sum
+ / (p->se.avg.runnable_avg_period + 1);

a fixed patch is here:

----------

>From 972296706292dcb5cd2bd3c25fa15566130ba74d Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Sat, 17 Nov 2012 19:21:48 +0800
Subject: [PATCH 5/9] sched: consider runnable load average in wake_affine and
move_tasks

Except using runnable load average in background, wake_affine and
move_tasks is also the key functions in load balance. We need consider
the runnable load average in them in order to the apple to apple load
comparison in load balance.

Thanks for Preeti caught the task_h_load bug.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 17 +++++++++++------
1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f918919..f9f1010 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3164,8 +3164,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
tg = task_group(current);
weight = current->se.load.weight;

- this_load += effective_load(tg, this_cpu, -weight, -weight);
- load += effective_load(tg, prev_cpu, 0, -weight);
+ this_load += effective_load(tg, this_cpu, -weight, -weight)
+ * cpu_rq(this_cpu)->avg.load_avg_contrib;
+ load += effective_load(tg, prev_cpu, 0, -weight)
+ * cpu_rq(prev_cpu)->avg.load_avg_contrib;
}

tg = task_group(p);
@@ -3185,12 +3187,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)

this_eff_load = 100;
this_eff_load *= power_of(prev_cpu);
- this_eff_load *= this_load +
- effective_load(tg, this_cpu, weight, weight);
+ this_eff_load *= (this_load +
+ effective_load(tg, this_cpu, weight, weight))
+ * cpu_rq(this_cpu)->avg.load_avg_contrib;

prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
prev_eff_load *= power_of(this_cpu);
- prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+ prev_eff_load *= (load + effective_load(tg, prev_cpu, 0, weight))
+ * cpu_rq(prev_cpu)->avg.load_avg_contrib;

balanced = this_eff_load <= prev_eff_load;
} else
@@ -4229,7 +4233,8 @@ static int move_tasks(struct lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
goto next;

- load = task_h_load(p);
+ load = task_h_load(p) * p->se.avg.runnable_avg_sum
+ / (p->se.avg.runnable_avg_period + 1);

if (sched_feat(LB_MIN) && load < 16 && !env->failed)
goto next;
--
1.7.5.4

2012-11-26 19:03:11

by Benjamin Segall

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] enable runnable load avg in load balance

So, I've been trying out using the runnable averages for load balance in
a few ways, but haven't actually gotten any improvement on the
benchmarks I've run. I'll post my patches once I have the numbers down,
but it's generally been about half a percent to 1% worse on the tests
I've tried.

The basic idea is to use (cfs_rq->runnable_load_avg +
cfs_rq->blocked_load_avg) (which should be equivalent to doing
load_avg_contrib on the rq) for cfs_rqs and possibly the rq, and
p->se.load.weight * p->se.avg.runnable_avg_sum / period for tasks.

I have not yet tried including wake_affine, so this has just involved
h_load (task_load_down and task_h_load), as that makes everything
(besides wake_affine) be based on either the new averages or the
rq->cpu_load averages.

2012-11-27 00:39:30

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] enable runnable load avg in load balance

On 11/27/2012 03:03 AM, Benjamin Segall wrote:
> So, I've been trying out using the runnable averages for load balance in
> a few ways, but haven't actually gotten any improvement on the
> benchmarks I've run. I'll post my patches once I have the numbers down,
> but it's generally been about half a percent to 1% worse on the tests
> I've tried.
>
> The basic idea is to use (cfs_rq->runnable_load_avg +
> cfs_rq->blocked_load_avg) (which should be equivalent to doing
> load_avg_contrib on the rq) for cfs_rqs and possibly the rq, and
> p->se.load.weight * p->se.avg.runnable_avg_sum / period for tasks.
>
> I have not yet tried including wake_affine, so this has just involved
> h_load (task_load_down and task_h_load), as that makes everything
> (besides wake_affine) be based on either the new averages or the
> rq->cpu_load averages.
>


which tree do your code base on? tip/master is changing quickly recently.

2012-11-27 01:01:12

by Benjamin Segall

[permalink] [raw]

2012-11-27 01:13:35

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] enable runnable load avg in load balance

On 11/27/2012 03:03 AM, Benjamin Segall wrote:
> So, I've been trying out using the runnable averages for load balance in
> a few ways, but haven't actually gotten any improvement on the
> benchmarks I've run. I'll post my patches once I have the numbers down,
> but it's generally been about half a percent to 1% worse on the tests
> I've tried.

Did you tried this rfc patch? and what's the result of it? :)

>
> The basic idea is to use (cfs_rq->runnable_load_avg +
> cfs_rq->blocked_load_avg) (which should be equivalent to doing
> load_avg_contrib on the rq) for cfs_rqs and possibly the rq, and
> p->se.load.weight * p->se.avg.runnable_avg_sum / period for tasks.
>
> I have not yet tried including wake_affine, so this has just involved
> h_load (task_load_down and task_h_load), as that makes everything
> (besides wake_affine) be based on either the new averages or the
> rq->cpu_load averages.
>

2012-11-27 03:09:14

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] enable runnable load avg in load balance

Hi everyone,

On 11/27/2012 12:33 AM, Benjamin Segall wrote:
> So, I've been trying out using the runnable averages for load balance in
> a few ways, but haven't actually gotten any improvement on the
> benchmarks I've run. I'll post my patches once I have the numbers down,
> but it's generally been about half a percent to 1% worse on the tests
> I've tried.
>
> The basic idea is to use (cfs_rq->runnable_load_avg +
> cfs_rq->blocked_load_avg) (which should be equivalent to doing
> load_avg_contrib on the rq) for cfs_rqs and possibly the rq, and
> p->se.load.weight * p->se.avg.runnable_avg_sum / period for tasks.

Why should cfs_rq->blocked_load_avg be included to calculate the load
on the rq? They do not contribute to the active load of the cpu right?

When a task goes to sleep its load is removed from cfs_rq->load.weight
as well in account_entity_dequeue(). Which means the load balancer
considers a sleeping entity as *not* contributing to the active runqueue
load.So shouldn't the new metric consider cfs_rq->runnable_load_avg alone?
>
> I have not yet tried including wake_affine, so this has just involved
> h_load (task_load_down and task_h_load), as that makes everything
> (besides wake_affine) be based on either the new averages or the
> rq->cpu_load averages.
>

Yeah I have been trying to view the performance as well,but with
cfs_rq->runnable_load_avg as the rq load contribution and the task load,
same as mentioned above.I have not completed my experiments but I would
expect some significant performance difference due to the below scenario:

Task3(10% task)
Task1(100% task) Task4(10% task)
Task2(100% task) Task5(10% task)
--------------- ---------------- ----------
CPU1 CPU2 CPU3

When cpu3 triggers load balancing:

CASE1:
without PJT's metric the following loads will be perceived
CPU1->2048
CPU2->3042
Therefore CPU2 might be relieved of one task to result in:


Task1(100% task) Task4(10% task)
Task2(100% task) Task5(10% task) Task3(10% task)
--------------- ---------------- ----------
CPU1 CPU2 CPU3

CASE2:
with PJT's metric the following loads will be perceived
CPU1->2048
CPU2->1022
Therefore CPU1 might be relieved of one task to result in:

Task3(10% task)
Task4(10% task)
Task2(100% task) Task5(10% task) Task1(100% task)
--------------- ---------------- ----------
CPU1 CPU2 CPU3


The differences between the above two scenarios include:

1.Reduced latency for Task1 in CASE2,which is the right task to be moved
in the above scenario.

2.Even though in the former case CPU2 is relieved of one task,its of no
use if Task3 is going to sleep most of the time.This might result in
more load balancing on behalf of cpu3.

What do you guys think?

Thank you

Regards
Preeti U Murthy



2012-11-27 06:16:14

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] enable runnable load avg in load balance

On 11/27/2012 11:08 AM, Preeti U Murthy wrote:
> Hi everyone,
>
> On 11/27/2012 12:33 AM, Benjamin Segall wrote:
>> So, I've been trying out using the runnable averages for load balance in
>> a few ways, but haven't actually gotten any improvement on the
>> benchmarks I've run. I'll post my patches once I have the numbers down,
>> but it's generally been about half a percent to 1% worse on the tests
>> I've tried.
>>
>> The basic idea is to use (cfs_rq->runnable_load_avg +
>> cfs_rq->blocked_load_avg) (which should be equivalent to doing
>> load_avg_contrib on the rq) for cfs_rqs and possibly the rq, and
>> p->se.load.weight * p->se.avg.runnable_avg_sum / period for tasks.
>
> Why should cfs_rq->blocked_load_avg be included to calculate the load
> on the rq? They do not contribute to the active load of the cpu right?
>
> When a task goes to sleep its load is removed from cfs_rq->load.weight
> as well in account_entity_dequeue(). Which means the load balancer
> considers a sleeping entity as *not* contributing to the active runqueue
> load.So shouldn't the new metric consider cfs_rq->runnable_load_avg alone?
>>
>> I have not yet tried including wake_affine, so this has just involved
>> h_load (task_load_down and task_h_load), as that makes everything
>> (besides wake_affine) be based on either the new averages or the
>> rq->cpu_load averages.
>>
>
> Yeah I have been trying to view the performance as well,but with
> cfs_rq->runnable_load_avg as the rq load contribution and the task load,
> same as mentioned above.I have not completed my experiments but I would
> expect some significant performance difference due to the below scenario:
>
> Task3(10% task)
> Task1(100% task) Task4(10% task)
> Task2(100% task) Task5(10% task)
> --------------- ---------------- ----------
> CPU1 CPU2 CPU3
>
> When cpu3 triggers load balancing:
>
> CASE1:
> without PJT's metric the following loads will be perceived
> CPU1->2048
> CPU2->3042
> Therefore CPU2 might be relieved of one task to result in:
>
>
> Task1(100% task) Task4(10% task)
> Task2(100% task) Task5(10% task) Task3(10% task)
> --------------- ---------------- ----------
> CPU1 CPU2 CPU3
>
> CASE2:
> with PJT's metric the following loads will be perceived
> CPU1->2048
> CPU2->1022
> Therefore CPU1 might be relieved of one task to result in:
>
> Task3(10% task)
> Task4(10% task)
> Task2(100% task) Task5(10% task) Task1(100% task)
> --------------- ---------------- ----------
> CPU1 CPU2 CPU3
>
>
> The differences between the above two scenarios include:
>
> 1.Reduced latency for Task1 in CASE2,which is the right task to be moved
> in the above scenario.
>
> 2.Even though in the former case CPU2 is relieved of one task,its of no
> use if Task3 is going to sleep most of the time.This might result in
> more load balancing on behalf of cpu3.
>
> What do you guys think?

It looks fine. just a question of CASE 1.
Usually the cpu2 with 3 10% load task will show nr_running == 0, at 70%
time. So, how you make rq->nr_running = 3 always?

Guess in most chance load balance with pull task1 or task2 to cpu2 or
cpu3. not the result of CASE 1.


>
> Thank you
>
> Regards
> Preeti U Murthy
>
>
>
>

2012-11-27 06:46:37

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] enable runnable load avg in load balance

Hi,
On 11/27/2012 11:44 AM, Alex Shi wrote:
> On 11/27/2012 11:08 AM, Preeti U Murthy wrote:
>> Hi everyone,
>>
>> On 11/27/2012 12:33 AM, Benjamin Segall wrote:
>>> So, I've been trying out using the runnable averages for load balance in
>>> a few ways, but haven't actually gotten any improvement on the
>>> benchmarks I've run. I'll post my patches once I have the numbers down,
>>> but it's generally been about half a percent to 1% worse on the tests
>>> I've tried.
>>>
>>> The basic idea is to use (cfs_rq->runnable_load_avg +
>>> cfs_rq->blocked_load_avg) (which should be equivalent to doing
>>> load_avg_contrib on the rq) for cfs_rqs and possibly the rq, and
>>> p->se.load.weight * p->se.avg.runnable_avg_sum / period for tasks.
>>
>> Why should cfs_rq->blocked_load_avg be included to calculate the load
>> on the rq? They do not contribute to the active load of the cpu right?
>>
>> When a task goes to sleep its load is removed from cfs_rq->load.weight
>> as well in account_entity_dequeue(). Which means the load balancer
>> considers a sleeping entity as *not* contributing to the active runqueue
>> load.So shouldn't the new metric consider cfs_rq->runnable_load_avg alone?
>>>
>>> I have not yet tried including wake_affine, so this has just involved
>>> h_load (task_load_down and task_h_load), as that makes everything
>>> (besides wake_affine) be based on either the new averages or the
>>> rq->cpu_load averages.
>>>
>>
>> Yeah I have been trying to view the performance as well,but with
>> cfs_rq->runnable_load_avg as the rq load contribution and the task load,
>> same as mentioned above.I have not completed my experiments but I would
>> expect some significant performance difference due to the below scenario:
>>
>> Task3(10% task)
>> Task1(100% task) Task4(10% task)
>> Task2(100% task) Task5(10% task)
>> --------------- ---------------- ----------
>> CPU1 CPU2 CPU3
>>
>> When cpu3 triggers load balancing:
>>
>> CASE1:
>> without PJT's metric the following loads will be perceived
>> CPU1->2048
>> CPU2->3042
>> Therefore CPU2 might be relieved of one task to result in:
>>
>>
>> Task1(100% task) Task4(10% task)
>> Task2(100% task) Task5(10% task) Task3(10% task)
>> --------------- ---------------- ----------
>> CPU1 CPU2 CPU3
>>
>> CASE2:
>> with PJT's metric the following loads will be perceived
>> CPU1->2048
>> CPU2->1022
>> Therefore CPU1 might be relieved of one task to result in:
>>
>> Task3(10% task)
>> Task4(10% task)
>> Task2(100% task) Task5(10% task) Task1(100% task)
>> --------------- ---------------- ----------
>> CPU1 CPU2 CPU3
>>
>>
>> The differences between the above two scenarios include:
>>
>> 1.Reduced latency for Task1 in CASE2,which is the right task to be moved
>> in the above scenario.
>>
>> 2.Even though in the former case CPU2 is relieved of one task,its of no
>> use if Task3 is going to sleep most of the time.This might result in
>> more load balancing on behalf of cpu3.
>>
>> What do you guys think?
>
> It looks fine. just a question of CASE 1.
> Usually the cpu2 with 3 10% load task will show nr_running == 0, at 70%
> time. So, how you make rq->nr_running = 3 always?
>
> Guess in most chance load balance with pull task1 or task2 to cpu2 or
> cpu3. not the result of CASE 1.

Thats right Alex.Most of the time the nr_running on CPU2 will be shown
to be 0 or perhaps 1/2.But whether you use PJT's metric or not,the load
balancer in such circumstances will behave the same, as you have rightly
pointed out: pull task1/2 to cpu2/3.

But the issue usually arises when all three wake up at the same time on
cpu2,portraying wrongly that the load is 3042, if PJT's metric is not
used.This could lead to load balancing one of these short running tasks
as shown by CASE1.This is the situation where in my opinion,PJT's metric
could make a difference.

Regards
Preeti U Murthy

2012-11-27 08:08:24

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] enable runnable load avg in load balance

On 11/27/2012 02:45 PM, Preeti U Murthy wrote:
> Hi,
> On 11/27/2012 11:44 AM, Alex Shi wrote:
>> On 11/27/2012 11:08 AM, Preeti U Murthy wrote:
>>> Hi everyone,
>>>
>>> On 11/27/2012 12:33 AM, Benjamin Segall wrote:
>>>> So, I've been trying out using the runnable averages for load balance in
>>>> a few ways, but haven't actually gotten any improvement on the
>>>> benchmarks I've run. I'll post my patches once I have the numbers down,
>>>> but it's generally been about half a percent to 1% worse on the tests
>>>> I've tried.
>>>>
>>>> The basic idea is to use (cfs_rq->runnable_load_avg +
>>>> cfs_rq->blocked_load_avg) (which should be equivalent to doing
>>>> load_avg_contrib on the rq) for cfs_rqs and possibly the rq, and
>>>> p->se.load.weight * p->se.avg.runnable_avg_sum / period for tasks.
>>>
>>> Why should cfs_rq->blocked_load_avg be included to calculate the load
>>> on the rq? They do not contribute to the active load of the cpu right?
>>>
>>> When a task goes to sleep its load is removed from cfs_rq->load.weight
>>> as well in account_entity_dequeue(). Which means the load balancer
>>> considers a sleeping entity as *not* contributing to the active runqueue
>>> load.So shouldn't the new metric consider cfs_rq->runnable_load_avg alone?
>>>>
>>>> I have not yet tried including wake_affine, so this has just involved
>>>> h_load (task_load_down and task_h_load), as that makes everything
>>>> (besides wake_affine) be based on either the new averages or the
>>>> rq->cpu_load averages.
>>>>
>>>
>>> Yeah I have been trying to view the performance as well,but with
>>> cfs_rq->runnable_load_avg as the rq load contribution and the task load,
>>> same as mentioned above.I have not completed my experiments but I would
>>> expect some significant performance difference due to the below scenario:
>>>
>>> Task3(10% task)
>>> Task1(100% task) Task4(10% task)
>>> Task2(100% task) Task5(10% task)
>>> --------------- ---------------- ----------
>>> CPU1 CPU2 CPU3
>>>
>>> When cpu3 triggers load balancing:
>>>
>>> CASE1:
>>> without PJT's metric the following loads will be perceived
>>> CPU1->2048
>>> CPU2->3042
>>> Therefore CPU2 might be relieved of one task to result in:
>>>
>>>
>>> Task1(100% task) Task4(10% task)
>>> Task2(100% task) Task5(10% task) Task3(10% task)
>>> --------------- ---------------- ----------
>>> CPU1 CPU2 CPU3
>>>
>>> CASE2:
>>> with PJT's metric the following loads will be perceived
>>> CPU1->2048
>>> CPU2->1022
>>> Therefore CPU1 might be relieved of one task to result in:
>>>
>>> Task3(10% task)
>>> Task4(10% task)
>>> Task2(100% task) Task5(10% task) Task1(100% task)
>>> --------------- ---------------- ----------
>>> CPU1 CPU2 CPU3
>>>
>>>
>>> The differences between the above two scenarios include:
>>>
>>> 1.Reduced latency for Task1 in CASE2,which is the right task to be moved
>>> in the above scenario.
>>>
>>> 2.Even though in the former case CPU2 is relieved of one task,its of no
>>> use if Task3 is going to sleep most of the time.This might result in
>>> more load balancing on behalf of cpu3.
>>>
>>> What do you guys think?
>>
>> It looks fine. just a question of CASE 1.
>> Usually the cpu2 with 3 10% load task will show nr_running == 0, at 70%
>> time. So, how you make rq->nr_running = 3 always?
>>
>> Guess in most chance load balance with pull task1 or task2 to cpu2 or
>> cpu3. not the result of CASE 1.
>
> Thats right Alex.Most of the time the nr_running on CPU2 will be shown
> to be 0 or perhaps 1/2.But whether you use PJT's metric or not,the load
> balancer in such circumstances will behave the same, as you have rightly
> pointed out: pull task1/2 to cpu2/3.
>
> But the issue usually arises when all three wake up at the same time on
> cpu2,portraying wrongly that the load is 3042, if PJT's metric is not
> used.This could lead to load balancing one of these short running tasks
> as shown by CASE1.This is the situation where in my opinion,PJT's metric
> could make a difference.

Sure. And it will be perfect if you can find a appropriate benchmark to
support it.
>
> Regards
> Preeti U Murthy
>