2013-05-30 07:03:05

by Alex Shi

[permalink] [raw]
Subject: [patch v7 0/8] sched: using runnable load avg in balance

Thanks comments from Peter, Paul, Morten, Michael and Preeti.

The most important change of this version is rebasing on latest
tip/sched/core tree.

I tested on Intel core2, NHM, SNB, IVB, 2 and 4 sockets machines with
benchmark kbuild, aim7, dbench, tbench, hackbench, oltp, and netperf
loopback etc.

On SNB EP 4 sockets machine, the hackbench increased about 50%, and
result become stable. on other machines, hackbench increased about
2~10%. oltp increased about 30% in NHM EX box. netperf loopback also
increased on SNB EP 4 sockets box. no clear changes on other
benchmarks.

Michael Wang gotten better performance on pgbench on his box with this
patchset. https://lkml.org/lkml/2013/5/16/82

And Morten tested previous version with better power consumption.
http://comments.gmane.org/gmane.linux.kernel/1463371

Changlong found ltp cgroup stress testing get faster on SNB EP
machine. https://lkml.org/lkml/2013/5/23/65
---
3.10-rc1 patch1-7 patch1-8
duration=764 duration=754 duration=750
duration=764 duration=754 duration=751
duration=763 duration=755 duration=751

duration means the seconds of testing cost.
---

Jason also found java server load benefited on his 8 sockets machine:
https://lkml.org/lkml/2013/5/29/673
---
When using a 3.10-rc2 tip kernel with patches 1-8, there was about a 40%
improvement in performance of the workload compared to when using the
vanilla 3.10-rc2 tip kernel with no patches. When using a 3.10-rc2 tip
kernel with just patches 1-7, the performance improvement of the
workload over the vanilla 3.10-rc2 tip kernel was about 25%.
---

We also tried to include blocked load avg in balance. but find many
benchmark performance drop a lot! So, seems accumulating current
blocked_load_avg into cpu load isn't a good idea.
The blocked_load_avg is decayed same as runnable load, sometime is far
bigger than runnable load, that drive tasks to other idle or slight
load cpu, than cause both performance and power issue. But if the
blocked load is decayed too fast, it lose its effect.
Another issue of blocked load is that when waking up task, we can not
know blocked load proportion of the task on cpu. So, the blocked load is
meaningless in wake affine decision.

According to above problem, I can not figure out some way to use
blocked_load_avg in balance now.

Anyway, since using runnable load avg in balance brings much benefit on
performance and power. and this patch was reviewed for long time.
So maybe it's time to let it clobbered in some sub-maintain tree, like tip
or linux-next. Any comments?

Regards
Alex
[patch v7 1/8] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[patch v7 2/8] sched: move few runnable tg variables into CONFIG_SMP
[patch v7 3/8] sched: set initial value of runnable avg for new
[patch v7 4/8] sched: fix slept time double counting in enqueue
[patch v7 5/8] sched: update cpu load after task_tick.
[patch v7 6/8] sched: compute runnable load avg in cpu_load and
[patch v7 7/8] sched: consider runnable load average in move_tasks
[patch v7 8/8] sched: remove blocked_load_avg in tg


2013-05-30 07:03:10

by Alex Shi

[permalink] [raw]
Subject: [patch v7 1/8] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 7 +------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 13 ++-----------
kernel/sched/sched.h | 10 ++--------
4 files changed, 6 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..0019bef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -994,12 +994,7 @@ struct sched_entity {
struct cfs_rq *my_q;
#endif

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 36f85be..b9e7036 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,12 +1598,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ee1c2e..f404468 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1128,8 +1128,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3436,12 +3435,6 @@ unlock:
}

/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3464,7 +3457,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */

static unsigned long
@@ -6167,9 +6159,8 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 74ff659..d892a9f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -269,12 +269,6 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -284,9 +278,9 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
+ /* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
u64 tg_load_contrib;
#endif /* CONFIG_FAIR_GROUP_SCHED */
--
1.7.12

2013-05-30 07:03:23

by Alex Shi

[permalink] [raw]
Subject: [patch v7 4/8] sched: fix slept time double counting in enqueue entity

The wakeuped migrated task will __synchronize_entity_decay(se); in
migrate_task_fair, then it needs to set
`se->avg.last_runnable_update -= (-se->avg.decay_count) << 20'
before update_entity_load_avg, in order to avoid slept time is updated
twice for se.avg.load_avg_contrib in both __syncchronize and
update_entity_load_avg.

but if the slept task is waked up from self cpu, it miss the
last_runnable_update before update_entity_load_avg(se, 0, 1), then the
slept time was used twice in both functions.
So we need to remove the double slept time counting.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1fc30b9..42c7be0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1570,7 +1570,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
}
wakeup = 0;
} else {
- __synchronize_entity_decay(se);
+ se->avg.last_runnable_update += __synchronize_entity_decay(se)
+ << 20;
}

/* migrated tasks did not contribute to our blocked load */
--
1.7.12

2013-05-30 07:03:33

by Alex Shi

[permalink] [raw]
Subject: [patch v7 8/8] sched: remove blocked_load_avg in tg

blocked_load_avg sometime is too heavy and far bigger than runnable load
avg, that make balance make wrong decision. So remove it.

Changlong tested this patch, found ltp cgroup stress testing get better
performance: https://lkml.org/lkml/2013/5/23/65
---
3.10-rc1 patch1-7 patch1-8
duration=764 duration=754 duration=750
duration=764 duration=754 duration=751
duration=763 duration=755 duration=751

duration means the seconds of testing cost.
---

And Jason also tested this patchset on his 8 sockets machine:
https://lkml.org/lkml/2013/5/29/673
---
When using a 3.10-rc2 tip kernel with patches 1-8, there was about a 40%
improvement in performance of the workload compared to when using the
vanilla 3.10-rc2 tip kernel with no patches. When using a 3.10-rc2 tip
kernel with just patches 1-7, the performance improvement of the
workload over the vanilla 3.10-rc2 tip kernel was about 25%.
---

Signed-off-by: Alex Shi <[email protected]>
Tested-by: Changlong Xie <[email protected]>
Tested-by: Jason Low <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bb2470a..163d9ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1358,7 +1358,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
struct task_group *tg = cfs_rq->tg;
s64 tg_contrib;

- tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+ tg_contrib = cfs_rq->runnable_load_avg;
tg_contrib -= cfs_rq->tg_load_contrib;

if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
--
1.7.12

2013-05-30 07:04:05

by Alex Shi

[permalink] [raw]
Subject: [patch v7 7/8] sched: consider runnable load average in move_tasks

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eadd2e7..bb2470a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4178,11 +4178,11 @@ static int tg_load_down(struct task_group *tg, void *data)
long cpu = (long)data;

if (!tg->parent) {
- load = cpu_rq(cpu)->load.weight;
+ load = cpu_rq(cpu)->avg.load_avg_contrib;
} else {
load = tg->parent->cfs_rq[cpu]->h_load;
- load *= tg->se[cpu]->load.weight;
- load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
+ load *= tg->se[cpu]->avg.load_avg_contrib;
+ load /= tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
}

tg->cfs_rq[cpu]->h_load = load;
@@ -4210,8 +4210,8 @@ static unsigned long task_h_load(struct task_struct *p)
struct cfs_rq *cfs_rq = task_cfs_rq(p);
unsigned long load;

- load = p->se.load.weight;
- load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);
+ load = p->se.avg.load_avg_contrib;
+ load = div_u64(load * cfs_rq->h_load, cfs_rq->runnable_load_avg + 1);

return load;
}
--
1.7.12

2013-05-30 07:04:17

by Alex Shi

[permalink] [raw]
Subject: [patch v7 6/8] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

We also try to include the blocked_load_avg as cpu load in balancing,
but that cause kbuild performance drop 6% on every Intel machine, and
aim7/oltp drop on some of 4 CPU sockets machines.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 5 +++--
kernel/sched/proc.c | 17 +++++++++++++++--
2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 42c7be0..eadd2e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2962,7 +2962,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->load.weight;
+ return cpu_rq(cpu)->cfs.runnable_load_avg;
}

/*
@@ -3007,9 +3007,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+ unsigned long load_avg = rq->cfs.runnable_load_avg;

if (nr_running)
- return rq->load.weight / nr_running;
+ return load_avg / nr_running;

return 0;
}
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index bb3a6a0..ce5cd48 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
sched_avg_update(this_rq);
}

+#ifdef CONFIG_SMP
+unsigned long get_rq_runnable_load(struct rq *rq)
+{
+ return rq->cfs.runnable_load_avg;
+}
+#else
+unsigned long get_rq_runnable_load(struct rq *rq)
+{
+ return rq->load.weight;
+}
+#endif
+
#ifdef CONFIG_NO_HZ_COMMON
/*
* There is no sane way to deal with nohz on smp when using jiffies because the
@@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
- unsigned long load = this_rq->load.weight;
+ unsigned long load = get_rq_runnable_load(this_rq);
unsigned long pending_updates;

/*
@@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
*/
void update_cpu_load_active(struct rq *this_rq)
{
+ unsigned long load = get_rq_runnable_load(this_rq);
/*
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
this_rq->last_load_update_tick = jiffies;
- __update_cpu_load(this_rq, this_rq->load.weight, 1);
+ __update_cpu_load(this_rq, load, 1);

calc_load_account_active(this_rq);
}
--
1.7.12

2013-05-30 07:04:25

by Alex Shi

[permalink] [raw]
Subject: [patch v7 5/8] sched: update cpu load after task_tick.

To get the latest runnable info, we need do this cpuload update after
task_tick.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6f226c2..05176b8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2152,8 +2152,8 @@ void scheduler_tick(void)

raw_spin_lock(&rq->lock);
update_rq_clock(rq);
- update_cpu_load_active(rq);
curr->sched_class->task_tick(rq, curr, 0);
+ update_cpu_load_active(rq);
raw_spin_unlock(&rq->lock);

perf_event_task_tick();
--
1.7.12

2013-05-30 07:04:51

by Alex Shi

[permalink] [raw]
Subject: [patch v7 3/8] sched: set initial value of runnable avg for new forked task

We need initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task.
Otherwise random values of above variables cause mess when do new task
enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg

and make forking balancing imbalance since incorrect load_avg_contrib.

Further more, Morten Rasmussen notice some tasks were not launched at
once after created. So Paul and Peter suggest giving a start value for
new task runnable avg time same as sched_slice().

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 6 ++----
kernel/sched/fair.c | 23 +++++++++++++++++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b9e7036..6f226c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,10 +1598,6 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-#ifdef CONFIG_SMP
- p->se.avg.runnable_avg_period = 0;
- p->se.avg.runnable_avg_sum = 0;
-#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif
@@ -1745,6 +1741,8 @@ void wake_up_new_task(struct task_struct *p)
set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
#endif

+ /* Give new task start runnable values */
+ set_task_runnable_avg(p);
rq = __task_rq_lock(p);
activate_task(rq, p, 0);
p->on_rq = 1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f404468..1fc30b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,6 +680,26 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
return calc_delta_fair(sched_slice(cfs_rq, se), se);
}

+#ifdef CONFIG_SMP
+static inline void __update_task_entity_contrib(struct sched_entity *se);
+
+/* Give new task start runnable values to heavy its load in infant time */
+void set_task_runnable_avg(struct task_struct *p)
+{
+ u32 slice;
+
+ p->se.avg.decay_count = 0;
+ slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
+ p->se.avg.runnable_avg_sum = slice;
+ p->se.avg.runnable_avg_period = slice;
+ __update_task_entity_contrib(&p->se);
+}
+#else
+void set_task_runnable_avg(struct task_struct *p)
+{
+}
+#endif
+
/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
@@ -1527,6 +1547,9 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
+ *
+ * When enqueue a new forked task, the se->avg.decay_count == 0, so
+ * we bypass update_entity_load_avg(), use avg.load_avg_contrib direct.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24b1503..8bc66c6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1058,6 +1058,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime

extern void update_idle_cpu_load(struct rq *this_rq);

+extern void set_task_runnable_avg(struct task_struct *p);
+
#ifdef CONFIG_PARAVIRT
static inline u64 steal_ticks(u64 steal)
{
--
1.7.12

2013-05-30 07:05:04

by Alex Shi

[permalink] [raw]
Subject: [patch v7 2/8] sched: move few runnable tg variables into CONFIG_SMP

The following 2 variables only used under CONFIG_SMP, so better to move
their definiation into CONFIG_SMP too.

atomic64_t load_avg;
atomic_t runnable_avg;

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/sched.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d892a9f..24b1503 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -149,9 +149,11 @@ struct task_group {
unsigned long shares;

atomic_t load_weight;
+#ifdef CONFIG_SMP
atomic64_t load_avg;
atomic_t runnable_avg;
#endif
+#endif

#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
--
1.7.12

2013-05-31 10:19:47

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [patch v7 7/8] sched: consider runnable load average in move_tasks

On Thu, May 30, 2013 at 08:02:03AM +0100, Alex Shi wrote:
> Except using runnable load average in background, move_tasks is also
> the key functions in load balance. We need consider the runnable load
> average in it in order to the apple to apple load comparison.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eadd2e7..bb2470a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4178,11 +4178,11 @@ static int tg_load_down(struct task_group *tg, void *data)
> long cpu = (long)data;
>
> if (!tg->parent) {
> - load = cpu_rq(cpu)->load.weight;
> + load = cpu_rq(cpu)->avg.load_avg_contrib;
> } else {
> load = tg->parent->cfs_rq[cpu]->h_load;
> - load *= tg->se[cpu]->load.weight;
> - load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
> + load *= tg->se[cpu]->avg.load_avg_contrib;
> + load /= tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;

runnable_load_avg is u64, so you need to use div_u64() similar to how it
is already done in task_h_load() further down in this patch. It doesn't
build on ARM as is.

Fix:
- load /= tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
+ load = div_u64(load,
tg->parent->cfs_rq[cpu]->runnable_load_avg + 1);

Morten

> }
>
> tg->cfs_rq[cpu]->h_load = load;
> @@ -4210,8 +4210,8 @@ static unsigned long task_h_load(struct task_struct *p)
> struct cfs_rq *cfs_rq = task_cfs_rq(p);
> unsigned long load;
>
> - load = p->se.load.weight;
> - load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);
> + load = p->se.avg.load_avg_contrib;
> + load = div_u64(load * cfs_rq->h_load, cfs_rq->runnable_load_avg + 1);
>
> return load;
> }
> --
> 1.7.12
>
>

2013-05-31 15:07:57

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v7 7/8] sched: consider runnable load average in move_tasks


>
> runnable_load_avg is u64, so you need to use div_u64() similar to how it
> is already done in task_h_load() further down in this patch. It doesn't
> build on ARM as is.
>
> Fix:
> - load /= tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
> + load = div_u64(load,
> tg->parent->cfs_rq[cpu]->runnable_load_avg + 1);
>
> Morten

Thank a lot for review!

div_u64 or do_div will do force cast u32 on the divisor, so in 64bit machine,
the divisor may become incorrect.
Since cfs_rq->runnable_load_avg is always smaller the cfs_rq.load.weight. and
load.weight is 'unsigned long', we can cast the runnable_load_avg to
'unsigned long' too. Than the div will fit on both 64/32 bit machine and no
data concatenate!

So the patch changed as following.

BTW, Paul & Peter:
in cfs_rq, runnable_load_avg, blocked_load_avg, tg_load_contrib are all
u64, but their are similar with 'unsigned long' load.weight. So could we change
them to 'unsigned long'?

---

>From 4a17564363f6d65c9d513ad206b54ebd032d3f46 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Mon, 3 Dec 2012 23:00:53 +0800
Subject: [PATCH 7/8] sched: consider runnable load average in move_tasks

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Morten catch a div u64 bug on ARM, thanks!

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 17 ++++++++++-------
1 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eadd2e7..73e4507 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4178,11 +4178,14 @@ static int tg_load_down(struct task_group *tg, void *data)
long cpu = (long)data;

if (!tg->parent) {
- load = cpu_rq(cpu)->load.weight;
+ load = cpu_rq(cpu)->avg.load_avg_contrib;
} else {
+ unsigned long tmp_rla;
+ tmp_rla = tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
+
load = tg->parent->cfs_rq[cpu]->h_load;
- load *= tg->se[cpu]->load.weight;
- load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
+ load *= tg->se[cpu]->avg.load_avg_contrib;
+ load /= tmp_rla;
}

tg->cfs_rq[cpu]->h_load = load;
@@ -4208,12 +4211,12 @@ static void update_h_load(long cpu)
static unsigned long task_h_load(struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
- unsigned long load;
+ unsigned long load, tmp_rla;

- load = p->se.load.weight;
- load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);
+ load = p->se.avg.load_avg_contrib * cfs_rq->h_load;
+ tmp_rla = cfs_rq->runnable_load_avg + 1;

- return load;
+ return load / tmp_rla;
}
#else
static inline void update_blocked_averages(int cpu)
--
1.7.5.4

2013-06-03 06:44:40

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v7 0/8] sched: using runnable load avg in balance

On 05/30/2013 03:01 PM, Alex Shi wrote:
> Anyway, since using runnable load avg in balance brings much benefit on
> performance and power. and this patch was reviewed for long time.
> So maybe it's time to let it clobbered in some sub-maintain tree, like tip
> or linux-next. Any comments?
>

Peter,
What's your opinion about this patchset? Are there sth missing? :)


The patchset git tree is here:
[email protected]:alexshi/power-scheduling.git runnablelb

> [patch v7 1/8] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
> [patch v7 2/8] sched: move few runnable tg variables into CONFIG_SMP
> [patch v7 3/8] sched: set initial value of runnable avg for new
> [patch v7 4/8] sched: fix slept time double counting in enqueue
> [patch v7 5/8] sched: update cpu load after task_tick.

Patch 2~5 are the bug fixing.
> [patch v7 6/8] sched: compute runnable load avg in cpu_load and
> [patch v7 7/8] sched: consider runnable load average in move_tasks

Only patch 6th/7th enable the runnable load in load balance.
> [patch v7 8/8] sched: remove blocked_load_avg in tg

According to test, the 8th patch has performance gain.

--
Thanks
Alex