2013-06-07 07:21:50

by Alex Shi

[permalink] [raw]
Subject: [patch 0/9] sched: use runnable load avg in balance

Peter&Ingo:

v8: add a marco div64_ul and used it in task_h_load()
v7: rebasing on tip/sched/core tree.

I tested on Intel core2, NHM, SNB, IVB, 2 and 4 sockets machines with
benchmark kbuild, aim7, dbench, tbench, hackbench, oltp, and netperf
loopback etc.

On SNB EP 4 sockets machine, the hackbench increased about 50%, and
result become stable. on other machines, hackbench increased about
2~10%. oltp increased about 30% in NHM EX box. netperf loopback also
increased on SNB EP 4 sockets box.
No clear changes on other benchmarks.

Michael Wang gotten better performance on pgbench on his box with this
patchset. https://lkml.org/lkml/2013/5/16/82

And Morten tested previous version with better power consumption.
http://comments.gmane.org/gmane.linux.kernel/1463371

Changlong found ltp cgroup stress testing get faster on SNB EP
machine with the last patch. https://lkml.org/lkml/2013/5/23/65
---
3.10-rc1 patch1-7 patch1-8
duration=764 duration=754 duration=750
duration=764 duration=754 duration=751
duration=763 duration=755 duration=751

duration means the seconds of testing cost.
---

Jason also found java server load benefited on his 8 sockets machine
with last patch. https://lkml.org/lkml/2013/5/29/673
---
When using a 3.10-rc2 tip kernel with patches 1-8, there was about a 40%
improvement in performance of the workload compared to when using the
vanilla 3.10-rc2 tip kernel with no patches. When using a 3.10-rc2 tip
kernel with just patches 1-7, the performance improvement of the
workload over the vanilla 3.10-rc2 tip kernel was about 25%.
---

We also tried to include blocked load avg in balance. but find many
benchmark performance drop a lot! Seems accumulating current
blocked_load_avg into cpu load isn't a good idea. Because:
1, The blocked_load_avg is decayed same as runnable load, sometime is far
bigger than runnable load, that drive tasks to other idle or slight
load cpu, than cause both performance and power issue. But if the
blocked load is decayed too fast, it lose its effect.
2, Another issue of blocked load is that when waking up task, we can not
know blocked load proportion of the task on rq. So, the blocked load is
meaningless in wake affine decision.

According to above problem, we can not figure out a right way now to use
blocked_load_avg in balance.

Since using runnable load avg in balance brings much benefit on
performance and power. and this patch was reviewed for long time.
So seems it's time to let it be clobbered in some sub tree, like
tip or linux-next. Any comments?

Regards
Alex


2013-06-07 07:21:56

by Alex Shi

[permalink] [raw]
Subject: [patch v8 1/9] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 7 +------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 13 ++-----------
kernel/sched/sched.h | 10 ++--------
4 files changed, 6 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..0019bef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -994,12 +994,7 @@ struct sched_entity {
struct cfs_rq *my_q;
#endif

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 36f85be..b9e7036 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,12 +1598,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ee1c2e..f404468 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1128,8 +1128,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3436,12 +3435,6 @@ unlock:
}

/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3464,7 +3457,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */

static unsigned long
@@ -6167,9 +6159,8 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 74ff659..d892a9f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -269,12 +269,6 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -284,9 +278,9 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
+ /* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
u64 tg_load_contrib;
#endif /* CONFIG_FAIR_GROUP_SCHED */
--
1.7.12

2013-06-07 07:22:03

by Alex Shi

[permalink] [raw]
Subject: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

We need initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task.
Otherwise random values of above variables cause mess when do new task
enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg

and make forking balancing imbalance since incorrect load_avg_contrib.

Further more, Morten Rasmussen notice some tasks were not launched at
once after created. So Paul and Peter suggest giving a start value for
new task runnable avg time same as sched_slice().

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 6 ++----
kernel/sched/fair.c | 23 +++++++++++++++++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b9e7036..6f226c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,10 +1598,6 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-#ifdef CONFIG_SMP
- p->se.avg.runnable_avg_period = 0;
- p->se.avg.runnable_avg_sum = 0;
-#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif
@@ -1745,6 +1741,8 @@ void wake_up_new_task(struct task_struct *p)
set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
#endif

+ /* Give new task start runnable values */
+ set_task_runnable_avg(p);
rq = __task_rq_lock(p);
activate_task(rq, p, 0);
p->on_rq = 1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f404468..1fc30b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,6 +680,26 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
return calc_delta_fair(sched_slice(cfs_rq, se), se);
}

+#ifdef CONFIG_SMP
+static inline void __update_task_entity_contrib(struct sched_entity *se);
+
+/* Give new task start runnable values to heavy its load in infant time */
+void set_task_runnable_avg(struct task_struct *p)
+{
+ u32 slice;
+
+ p->se.avg.decay_count = 0;
+ slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
+ p->se.avg.runnable_avg_sum = slice;
+ p->se.avg.runnable_avg_period = slice;
+ __update_task_entity_contrib(&p->se);
+}
+#else
+void set_task_runnable_avg(struct task_struct *p)
+{
+}
+#endif
+
/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
@@ -1527,6 +1547,9 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
+ *
+ * When enqueue a new forked task, the se->avg.decay_count == 0, so
+ * we bypass update_entity_load_avg(), use avg.load_avg_contrib direct.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24b1503..8bc66c6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1058,6 +1058,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime

extern void update_idle_cpu_load(struct rq *this_rq);

+extern void set_task_runnable_avg(struct task_struct *p);
+
#ifdef CONFIG_PARAVIRT
static inline u64 steal_ticks(u64 steal)
{
--
1.7.12

2013-06-07 07:22:07

by Alex Shi

[permalink] [raw]
Subject: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

The wakeuped migrated task will __synchronize_entity_decay(se); in
migrate_task_fair, then it needs to set
`se->avg.last_runnable_update -= (-se->avg.decay_count) << 20'
before update_entity_load_avg, in order to avoid slept time is updated
twice for se.avg.load_avg_contrib in both __syncchronize and
update_entity_load_avg.

but if the slept task is waked up from self cpu, it miss the
last_runnable_update before update_entity_load_avg(se, 0, 1), then the
slept time was used twice in both functions.
So we need to remove the double slept time counting.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1fc30b9..42c7be0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1570,7 +1570,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
}
wakeup = 0;
} else {
- __synchronize_entity_decay(se);
+ se->avg.last_runnable_update += __synchronize_entity_decay(se)
+ << 20;
}

/* migrated tasks did not contribute to our blocked load */
--
1.7.12

2013-06-07 07:22:39

by Alex Shi

[permalink] [raw]
Subject: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

We also try to include the blocked_load_avg as cpu load in balancing,
but that cause kbuild performance drop 6% on every Intel machine, and
aim7/oltp drop on some of 4 CPU sockets machines.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 5 +++--
kernel/sched/proc.c | 17 +++++++++++++++--
2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 42c7be0..eadd2e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2962,7 +2962,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->load.weight;
+ return cpu_rq(cpu)->cfs.runnable_load_avg;
}

/*
@@ -3007,9 +3007,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+ unsigned long load_avg = rq->cfs.runnable_load_avg;

if (nr_running)
- return rq->load.weight / nr_running;
+ return load_avg / nr_running;

return 0;
}
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index bb3a6a0..ce5cd48 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
sched_avg_update(this_rq);
}

+#ifdef CONFIG_SMP
+unsigned long get_rq_runnable_load(struct rq *rq)
+{
+ return rq->cfs.runnable_load_avg;
+}
+#else
+unsigned long get_rq_runnable_load(struct rq *rq)
+{
+ return rq->load.weight;
+}
+#endif
+
#ifdef CONFIG_NO_HZ_COMMON
/*
* There is no sane way to deal with nohz on smp when using jiffies because the
@@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
- unsigned long load = this_rq->load.weight;
+ unsigned long load = get_rq_runnable_load(this_rq);
unsigned long pending_updates;

/*
@@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
*/
void update_cpu_load_active(struct rq *this_rq)
{
+ unsigned long load = get_rq_runnable_load(this_rq);
/*
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
this_rq->last_load_update_tick = jiffies;
- __update_cpu_load(this_rq, this_rq->load.weight, 1);
+ __update_cpu_load(this_rq, load, 1);

calc_load_account_active(this_rq);
}
--
1.7.12

2013-06-07 07:22:52

by Alex Shi

[permalink] [raw]
Subject: [patch v8 8/9] sched: consider runnable load average in move_tasks

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Morten had caught a div u64 bug on ARM, thanks!

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eadd2e7..3aa1dc0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4178,11 +4178,14 @@ static int tg_load_down(struct task_group *tg, void *data)
long cpu = (long)data;

if (!tg->parent) {
- load = cpu_rq(cpu)->load.weight;
+ load = cpu_rq(cpu)->avg.load_avg_contrib;
} else {
+ unsigned long tmp_rla;
+ tmp_rla = tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
+
load = tg->parent->cfs_rq[cpu]->h_load;
- load *= tg->se[cpu]->load.weight;
- load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
+ load *= tg->se[cpu]->avg.load_avg_contrib;
+ load /= tmp_rla;
}

tg->cfs_rq[cpu]->h_load = load;
@@ -4208,12 +4211,9 @@ static void update_h_load(long cpu)
static unsigned long task_h_load(struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
- unsigned long load;
-
- load = p->se.load.weight;
- load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);

- return load;
+ return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
+ cfs_rq->runnable_load_avg + 1);
}
#else
static inline void update_blocked_averages(int cpu)
--
1.7.12

2013-06-07 07:22:38

by Alex Shi

[permalink] [raw]
Subject: [patch v8 5/9] sched: update cpu load after task_tick.

To get the latest runnable info, we need do this cpuload update after
task_tick.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6f226c2..05176b8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2152,8 +2152,8 @@ void scheduler_tick(void)

raw_spin_lock(&rq->lock);
update_rq_clock(rq);
- update_cpu_load_active(rq);
curr->sched_class->task_tick(rq, curr, 0);
+ update_cpu_load_active(rq);
raw_spin_unlock(&rq->lock);

perf_event_task_tick();
--
1.7.12

2013-06-07 07:23:13

by Alex Shi

[permalink] [raw]
Subject: [patch v8 9/9] sched/tg: remove blocked_load_avg in balance

blocked_load_avg sometime is too heavy and far bigger than runnable load
avg, that make balance make wrong decision. So remove it.

Changlong tested this patch, found ltp cgroup stress testing get better
performance: https://lkml.org/lkml/2013/5/23/65
---
3.10-rc1 patch1-7 patch1-8
duration=764 duration=754 duration=750
duration=764 duration=754 duration=751
duration=763 duration=755 duration=751

duration means the seconds of testing cost.
---

And Jason also tested this patchset on his 8 sockets machine:
https://lkml.org/lkml/2013/5/29/673
---
When using a 3.10-rc2 tip kernel with patches 1-8, there was about a 40%
improvement in performance of the workload compared to when using the
vanilla 3.10-rc2 tip kernel with no patches. When using a 3.10-rc2 tip
kernel with just patches 1-7, the performance improvement of the
workload over the vanilla 3.10-rc2 tip kernel was about 25%.
---

Signed-off-by: Alex Shi <[email protected]>
Tested-by: Changlong Xie <[email protected]>
Tested-by: Jason Low <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3aa1dc0..985d47e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1358,7 +1358,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
struct task_group *tg = cfs_rq->tg;
s64 tg_contrib;

- tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+ tg_contrib = cfs_rq->runnable_load_avg;
tg_contrib -= cfs_rq->tg_load_contrib;

if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
--
1.7.12

2013-06-07 07:23:41

by Alex Shi

[permalink] [raw]
Subject: [patch v8 7/9] math64: add div64_ul macro

There are div64_long() to handle the s64/long division. But no
macro do u64/ul division in kernel.
It is necessary on some scenarios. So add this macro. And fixed
the coding style error.

Signed-off-by: Alex Shi <[email protected]>
---
include/linux/math64.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/math64.h b/include/linux/math64.h
index b8ba855..2913b86 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -6,7 +6,8 @@

#if BITS_PER_LONG == 64

-#define div64_long(x,y) div64_s64((x),(y))
+#define div64_long(x, y) div64_s64((x), (y))
+#define div64_ul(x, y) div64_u64((x), (y))

/**
* div_u64_rem - unsigned 64bit divide with 32bit divisor with remainder
@@ -47,7 +48,8 @@ static inline s64 div64_s64(s64 dividend, s64 divisor)

#elif BITS_PER_LONG == 32

-#define div64_long(x,y) div_s64((x),(y))
+#define div64_long(x, y) div_s64((x), (y))
+#define div64_ul(x, y) div_u64((x), (y))

#ifndef div_u64_rem
static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
--
1.7.12

2013-06-07 07:24:00

by Alex Shi

[permalink] [raw]
Subject: [patch v8 2/9] sched: move few runnable tg variables into CONFIG_SMP

The following 2 variables only used under CONFIG_SMP, so better to move
their definiation into CONFIG_SMP too.

atomic64_t load_avg;
atomic_t runnable_avg;

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/sched.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d892a9f..24b1503 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -149,9 +149,11 @@ struct task_group {
unsigned long shares;

atomic_t load_weight;
+#ifdef CONFIG_SMP
atomic64_t load_avg;
atomic_t runnable_avg;
#endif
+#endif

#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
--
1.7.12

2013-06-08 02:38:36

by Alex Shi

[permalink] [raw]
Subject: Re: [patch 0/9] sched: use runnable load avg in balance

On 06/07/2013 03:20 PM, Alex Shi wrote:
> Peter&Ingo:
>
> v8: add a marco div64_ul and used it in task_h_load()
> v7: rebasing on tip/sched/core tree.

Peter & Ingo:

IMHO, if the patch set missed in 3.11 kernel. We will lose the following
widely benefit on many benchmarks, hackbench, pgbench, sysbench,
anonymous java load ...

What's your opinions? :)

>
> I tested on Intel core2, NHM, SNB, IVB, 2 and 4 sockets machines with
> benchmark kbuild, aim7, dbench, tbench, hackbench, oltp, and netperf
> loopback etc.
>
> On SNB EP 4 sockets machine, the hackbench increased about 50%, and
> result become stable. on other machines, hackbench increased about
> 2~10%. oltp increased about 30% in NHM EX box. netperf loopback also
> increased on SNB EP 4 sockets box.
> No clear changes on other benchmarks.
>
> Michael Wang gotten better performance on pgbench on his box with this
> patchset. https://lkml.org/lkml/2013/5/16/82
>
> And Morten tested previous version with better power consumption.
> http://comments.gmane.org/gmane.linux.kernel/1463371
>
> Changlong found ltp cgroup stress testing get faster on SNB EP
> machine with the last patch. https://lkml.org/lkml/2013/5/23/65
> ---
> 3.10-rc1 patch1-7 patch1-8
> duration=764 duration=754 duration=750
> duration=764 duration=754 duration=751
> duration=763 duration=755 duration=751
>
> duration means the seconds of testing cost.
> ---
>
> Jason also found java server load benefited on his 8 sockets machine
> with last patch. https://lkml.org/lkml/2013/5/29/673
> ---
> When using a 3.10-rc2 tip kernel with patches 1-8, there was about a 40%
> improvement in performance of the workload compared to when using the
> vanilla 3.10-rc2 tip kernel with no patches. When using a 3.10-rc2 tip
> kernel with just patches 1-7, the performance improvement of the
> workload over the vanilla 3.10-rc2 tip kernel was about 25%.
> ---
>
> We also tried to include blocked load avg in balance. but find many
> benchmark performance drop a lot! Seems accumulating current
> blocked_load_avg into cpu load isn't a good idea. Because:
> 1, The blocked_load_avg is decayed same as runnable load, sometime is far
> bigger than runnable load, that drive tasks to other idle or slight
> load cpu, than cause both performance and power issue. But if the
> blocked load is decayed too fast, it lose its effect.
> 2, Another issue of blocked load is that when waking up task, we can not
> know blocked load proportion of the task on rq. So, the blocked load is
> meaningless in wake affine decision.
>
> According to above problem, we can not figure out a right way now to use
> blocked_load_avg in balance.
>
> Since using runnable load avg in balance brings much benefit on
> performance and power. and this patch was reviewed for long time.
> So seems it's time to let it be clobbered in some sub tree, like
> tip or linux-next. Any comments?
>
> Regards
> Alex
>


--
Thanks
Alex

2013-06-10 01:36:05

by Alex Shi

[permalink] [raw]
Subject: Re: [patch 0/9] sched: use runnable load avg in balance

Ingo & Peter:

How are you?
I know you are very very busy since services for wide key kernel parts.
So hope this email don't disturb you. :)

PJT's runnable load was introduced into kernel from 3.8 kernel. But
besides increase longer code path, it does nothing help now.

This patch try to enable the runnable load into scheduler balance. Most
of this patchset are bug fixing related except 6~8th which are direct
service for introducing runnable load into balance.
The patchset was talked in LKML from last Nov, adopted many suggestions
from PTJ, MikeG, Morten, PeterZ etc. Also tested on x86 and arm
platforms, both got positive feedback.

With this patchset, CFS scheduler will has better and stabler
performance on hackbench, sysbench, pgbench. etc benchmarks. You will
like the performance number which mentioned in last emails.

I appreciated for any further review and comments!

Wish you all best!
Alex


On 06/07/2013 03:20 PM, Alex Shi wrote:
> Peter&Ingo:
>
> v8: add a marco div64_ul and used it in task_h_load()
> v7: rebasing on tip/sched/core tree.
>
> I tested on Intel core2, NHM, SNB, IVB, 2 and 4 sockets machines with
> benchmark kbuild, aim7, dbench, tbench, hackbench, oltp, and netperf
> loopback etc.
>
> On SNB EP 4 sockets machine, the hackbench increased about 50%, and
> result become stable. on other machines, hackbench increased about
> 2~10%. oltp increased about 30% in NHM EX box. netperf loopback also
> increased on SNB EP 4 sockets box.
> No clear changes on other benchmarks.
>
> Michael Wang gotten better performance on pgbench on his box with this
> patchset. https://lkml.org/lkml/2013/5/16/82
>
> And Morten tested previous version with better power consumption.
> http://comments.gmane.org/gmane.linux.kernel/1463371
>
> Changlong found ltp cgroup stress testing get faster on SNB EP
> machine with the last patch. https://lkml.org/lkml/2013/5/23/65
> ---
> 3.10-rc1 patch1-7 patch1-8
> duration=764 duration=754 duration=750
> duration=764 duration=754 duration=751
> duration=763 duration=755 duration=751
>
> duration means the seconds of testing cost.
> ---
>
> Jason also found java server load benefited on his 8 sockets machine
> with last patch. https://lkml.org/lkml/2013/5/29/673
> ---
> When using a 3.10-rc2 tip kernel with patches 1-8, there was about a 40%
> improvement in performance of the workload compared to when using the
> vanilla 3.10-rc2 tip kernel with no patches. When using a 3.10-rc2 tip
> kernel with just patches 1-7, the performance improvement of the
> workload over the vanilla 3.10-rc2 tip kernel was about 25%.
> ---
>
> We also tried to include blocked load avg in balance. but find many
> benchmark performance drop a lot! Seems accumulating current
> blocked_load_avg into cpu load isn't a good idea. Because:
> 1, The blocked_load_avg is decayed same as runnable load, sometime is far
> bigger than runnable load, that drive tasks to other idle or slight
> load cpu, than cause both performance and power issue. But if the
> blocked load is decayed too fast, it lose its effect.
> 2, Another issue of blocked load is that when waking up task, we can not
> know blocked load proportion of the task on rq. So, the blocked load is
> meaningless in wake affine decision.
>
> According to above problem, we can not figure out a right way now to use
> blocked_load_avg in balance.
>
> Since using runnable load avg in balance brings much benefit on
> performance and power. and this patch was reviewed for long time.
> So seems it's time to let it be clobbered in some sub tree, like
> tip or linux-next. Any comments?
>
> Regards
> Alex
>


--
Thanks
Alex

2013-06-10 01:52:22

by Gu Zheng

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/07/2013 03:20 PM, Alex Shi wrote:

> They are the base values in load balance, update them with rq runnable
> load average, then the load balance will consider runnable load avg
> naturally.
>
> We also try to include the blocked_load_avg as cpu load in balancing,
> but that cause kbuild performance drop 6% on every Intel machine, and
> aim7/oltp drop on some of 4 CPU sockets machines.

Hi Alex,
Could you explain me why including the blocked_load_avg causes performance drop ?

Thanks,
Gu

>
> Signed-off-by: Alex Shi <[email protected]>


Reviewed-by: Gu Zheng <[email protected]>

> ---
> kernel/sched/fair.c | 5 +++--
> kernel/sched/proc.c | 17 +++++++++++++++--
> 2 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 42c7be0..eadd2e7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2962,7 +2962,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> /* Used instead of source_load when we know the type == 0 */
> static unsigned long weighted_cpuload(const int cpu)
> {
> - return cpu_rq(cpu)->load.weight;
> + return cpu_rq(cpu)->cfs.runnable_load_avg;
> }
>
> /*
> @@ -3007,9 +3007,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> + unsigned long load_avg = rq->cfs.runnable_load_avg;
>
> if (nr_running)
> - return rq->load.weight / nr_running;
> + return load_avg / nr_running;
>
> return 0;
> }
> diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
> index bb3a6a0..ce5cd48 100644
> --- a/kernel/sched/proc.c
> +++ b/kernel/sched/proc.c
> @@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> sched_avg_update(this_rq);
> }
>
> +#ifdef CONFIG_SMP
> +unsigned long get_rq_runnable_load(struct rq *rq)
> +{
> + return rq->cfs.runnable_load_avg;
> +}
> +#else
> +unsigned long get_rq_runnable_load(struct rq *rq)
> +{
> + return rq->load.weight;
> +}
> +#endif
> +
> #ifdef CONFIG_NO_HZ_COMMON
> /*
> * There is no sane way to deal with nohz on smp when using jiffies because the
> @@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> void update_idle_cpu_load(struct rq *this_rq)
> {
> unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
> - unsigned long load = this_rq->load.weight;
> + unsigned long load = get_rq_runnable_load(this_rq);
> unsigned long pending_updates;
>
> /*
> @@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
> */
> void update_cpu_load_active(struct rq *this_rq)
> {
> + unsigned long load = get_rq_runnable_load(this_rq);
> /*
> * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
> */
> this_rq->last_load_update_tick = jiffies;
> - __update_cpu_load(this_rq, this_rq->load.weight, 1);
> + __update_cpu_load(this_rq, load, 1);
>
> calc_load_account_active(this_rq);
> }

2013-06-10 01:53:56

by Gu Zheng

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

On 06/07/2013 03:20 PM, Alex Shi wrote:

> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> new forked task.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
>
> and make forking balancing imbalance since incorrect load_avg_contrib.
>
> Further more, Morten Rasmussen notice some tasks were not launched at
> once after created. So Paul and Peter suggest giving a start value for
> new task runnable avg time same as sched_slice().
>
> Signed-off-by: Alex Shi <[email protected]>


Reviewed-by: Gu Zheng <[email protected]>

> ---
> kernel/sched/core.c | 6 ++----
> kernel/sched/fair.c | 23 +++++++++++++++++++++++
> kernel/sched/sched.h | 2 ++
> 3 files changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b9e7036..6f226c2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1598,10 +1598,6 @@ static void __sched_fork(struct task_struct *p)
> p->se.vruntime = 0;
> INIT_LIST_HEAD(&p->se.group_node);
>
> -#ifdef CONFIG_SMP
> - p->se.avg.runnable_avg_period = 0;
> - p->se.avg.runnable_avg_sum = 0;
> -#endif
> #ifdef CONFIG_SCHEDSTATS
> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
> #endif
> @@ -1745,6 +1741,8 @@ void wake_up_new_task(struct task_struct *p)
> set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
> #endif
>
> + /* Give new task start runnable values */
> + set_task_runnable_avg(p);
> rq = __task_rq_lock(p);
> activate_task(rq, p, 0);
> p->on_rq = 1;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f404468..1fc30b9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -680,6 +680,26 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> return calc_delta_fair(sched_slice(cfs_rq, se), se);
> }
>
> +#ifdef CONFIG_SMP
> +static inline void __update_task_entity_contrib(struct sched_entity *se);
> +
> +/* Give new task start runnable values to heavy its load in infant time */
> +void set_task_runnable_avg(struct task_struct *p)
> +{
> + u32 slice;
> +
> + p->se.avg.decay_count = 0;
> + slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
> + p->se.avg.runnable_avg_sum = slice;
> + p->se.avg.runnable_avg_period = slice;
> + __update_task_entity_contrib(&p->se);
> +}
> +#else
> +void set_task_runnable_avg(struct task_struct *p)
> +{
> +}
> +#endif
> +
> /*
> * Update the current task's runtime statistics. Skip current tasks that
> * are not in our scheduling class.
> @@ -1527,6 +1547,9 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> * We track migrations using entity decay_count <= 0, on a wake-up
> * migration we use a negative decay count to track the remote decays
> * accumulated while sleeping.
> + *
> + * When enqueue a new forked task, the se->avg.decay_count == 0, so
> + * we bypass update_entity_load_avg(), use avg.load_avg_contrib direct.
> */
> if (unlikely(se->avg.decay_count <= 0)) {
> se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 24b1503..8bc66c6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1058,6 +1058,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
>
> extern void update_idle_cpu_load(struct rq *this_rq);
>
> +extern void set_task_runnable_avg(struct task_struct *p);
> +
> #ifdef CONFIG_PARAVIRT
> static inline u64 steal_ticks(u64 steal)
> {

2013-06-10 02:01:34

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/10/2013 09:49 AM, Gu Zheng wrote:
> On 06/07/2013 03:20 PM, Alex Shi wrote:
>
>> > They are the base values in load balance, update them with rq runnable
>> > load average, then the load balance will consider runnable load avg
>> > naturally.
>> >
>> > We also try to include the blocked_load_avg as cpu load in balancing,
>> > but that cause kbuild performance drop 6% on every Intel machine, and
>> > aim7/oltp drop on some of 4 CPU sockets machines.
> Hi Alex,
> Could you explain me why including the blocked_load_avg causes performance drop ?


Thanks for review!

the 9th patch has few explanation. like, after the only task got into
sleep in a CPU, there is only blocked_load_avg left, it looks quite big
in short time. that, block it get tasks before sleep, drive task to
other cpu in periodic balance. So, it cause clear load imbalance.

--
Thanks
Alex

2013-06-10 02:07:45

by Gu Zheng

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/10/2013 10:01 AM, Alex Shi wrote:

> On 06/10/2013 09:49 AM, Gu Zheng wrote:
>> On 06/07/2013 03:20 PM, Alex Shi wrote:
>>
>>>> They are the base values in load balance, update them with rq runnable
>>>> load average, then the load balance will consider runnable load avg
>>>> naturally.
>>>>
>>>> We also try to include the blocked_load_avg as cpu load in balancing,
>>>> but that cause kbuild performance drop 6% on every Intel machine, and
>>>> aim7/oltp drop on some of 4 CPU sockets machines.
>> Hi Alex,
>> Could you explain me why including the blocked_load_avg causes performance drop ?
>
>
> Thanks for review!
>
> the 9th patch has few explanation. like, after the only task got into
> sleep in a CPU, there is only blocked_load_avg left, it looks quite big
> in short time. that, block it get tasks before sleep, drive task to
> other cpu in periodic balance. So, it cause clear load imbalance.
>

Got it. Thanks very much for your explanation.:)

Best regards,
Gu

2013-06-10 15:02:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 0/9] sched: use runnable load avg in balance

On Fri, Jun 07, 2013 at 03:20:43PM +0800, Alex Shi wrote:
> Peter&Ingo:

I'm tempted to apply 1-7; although I want to look them over in detail
once more; unfortunately (for you) I'm on PTO this week so it'll have to
be early next week.

As to patch 8, I really need/want buy-in from Paul on that; it removes
the entire point of the blocked load stuff.

2013-06-11 03:30:18

by Alex Shi

[permalink] [raw]
Subject: Re: [patch 0/9] sched: use runnable load avg in balance

On 06/10/2013 11:01 PM, Peter Zijlstra wrote:
> On Fri, Jun 07, 2013 at 03:20:43PM +0800, Alex Shi wrote:
>> Peter&Ingo:
>
> I'm tempted to apply 1-7; although I want to look them over in detail
> once more; unfortunately (for you) I'm on PTO this week so it'll have to
> be early next week.

I am very happy to know you like them! :)

BTW, I am sorry for that previous version patch 7th, 'add div64_ul' was
added into mm-tree few hours ago.

>
> As to patch 8, I really need/want buy-in from Paul on that; it removes
> the entire point of the blocked load stuff.

'remove blocked_load_avg' should be 9th patch now. Yes, I also am
waiting for Paul's comments.


Wish you all best!
--
Thanks
Alex

2013-06-14 10:02:48

by Lei Wen

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

Hi Alex,

On Fri, Jun 7, 2013 at 3:20 PM, Alex Shi <[email protected]> wrote:
> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> new forked task.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
>
> and make forking balancing imbalance since incorrect load_avg_contrib.
>
> Further more, Morten Rasmussen notice some tasks were not launched at
> once after created. So Paul and Peter suggest giving a start value for
> new task runnable avg time same as sched_slice().

I am confused at this comment, how set slice to runnable avg would change
the behavior of "some tasks were not launched at once after created"?

IMHO, I could only tell that for the new forked task, it could be run if current
task already be set as need_resched, and preempt_schedule or
preempt_schedule_irq
is called.

Since the set slice to avg behavior would not affect this task's vruntime,
and hence cannot make current running task be need_sched, if
previously it cannot.

Could you help correct if I am wrong at somewhere? ....

Thanks,
Lei


>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/core.c | 6 ++----
> kernel/sched/fair.c | 23 +++++++++++++++++++++++
> kernel/sched/sched.h | 2 ++
> 3 files changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b9e7036..6f226c2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1598,10 +1598,6 @@ static void __sched_fork(struct task_struct *p)
> p->se.vruntime = 0;
> INIT_LIST_HEAD(&p->se.group_node);
>
> -#ifdef CONFIG_SMP
> - p->se.avg.runnable_avg_period = 0;
> - p->se.avg.runnable_avg_sum = 0;
> -#endif
> #ifdef CONFIG_SCHEDSTATS
> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
> #endif
> @@ -1745,6 +1741,8 @@ void wake_up_new_task(struct task_struct *p)
> set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
> #endif
>
> + /* Give new task start runnable values */
> + set_task_runnable_avg(p);
> rq = __task_rq_lock(p);
> activate_task(rq, p, 0);
> p->on_rq = 1;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f404468..1fc30b9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -680,6 +680,26 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> return calc_delta_fair(sched_slice(cfs_rq, se), se);
> }
>
> +#ifdef CONFIG_SMP
> +static inline void __update_task_entity_contrib(struct sched_entity *se);
> +
> +/* Give new task start runnable values to heavy its load in infant time */
> +void set_task_runnable_avg(struct task_struct *p)
> +{
> + u32 slice;
> +
> + p->se.avg.decay_count = 0;
> + slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
> + p->se.avg.runnable_avg_sum = slice;
> + p->se.avg.runnable_avg_period = slice;
> + __update_task_entity_contrib(&p->se);
> +}
> +#else
> +void set_task_runnable_avg(struct task_struct *p)
> +{
> +}
> +#endif
> +
> /*
> * Update the current task's runtime statistics. Skip current tasks that
> * are not in our scheduling class.
> @@ -1527,6 +1547,9 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> * We track migrations using entity decay_count <= 0, on a wake-up
> * migration we use a negative decay count to track the remote decays
> * accumulated while sleeping.
> + *
> + * When enqueue a new forked task, the se->avg.decay_count == 0, so
> + * we bypass update_entity_load_avg(), use avg.load_avg_contrib direct.
> */
> if (unlikely(se->avg.decay_count <= 0)) {
> se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 24b1503..8bc66c6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1058,6 +1058,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
>
> extern void update_idle_cpu_load(struct rq *this_rq);
>
> +extern void set_task_runnable_avg(struct task_struct *p);
> +
> #ifdef CONFIG_PARAVIRT
> static inline u64 steal_ticks(u64 steal)
> {
> --
> 1.7.12
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-06-14 11:10:21

by Paul Turner

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

Minor comments; looks good otherwise.

Signed-off-by: Paul Turner <[email protected]>

On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> new forked task.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
>
> and make forking balancing imbalance since incorrect load_avg_contrib.
>
> Further more, Morten Rasmussen notice some tasks were not launched at
> once after created. So Paul and Peter suggest giving a start value for
> new task runnable avg time same as sched_slice().
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/core.c | 6 ++----
> kernel/sched/fair.c | 23 +++++++++++++++++++++++
> kernel/sched/sched.h | 2 ++
> 3 files changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b9e7036..6f226c2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1598,10 +1598,6 @@ static void __sched_fork(struct task_struct *p)
> p->se.vruntime = 0;
> INIT_LIST_HEAD(&p->se.group_node);
>
> -#ifdef CONFIG_SMP
> - p->se.avg.runnable_avg_period = 0;
> - p->se.avg.runnable_avg_sum = 0;
> -#endif
> #ifdef CONFIG_SCHEDSTATS
> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
> #endif
> @@ -1745,6 +1741,8 @@ void wake_up_new_task(struct task_struct *p)
> set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
> #endif
>
> + /* Give new task start runnable values */

/* Initialize new task's runnable average */

> + set_task_runnable_avg(p);
> rq = __task_rq_lock(p);
> activate_task(rq, p, 0);
> p->on_rq = 1;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f404468..1fc30b9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -680,6 +680,26 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> return calc_delta_fair(sched_slice(cfs_rq, se), se);
> }
>
> +#ifdef CONFIG_SMP
> +static inline void __update_task_entity_contrib(struct sched_entity *se);
> +
> +/* Give new task start runnable values to heavy its load in infant time */
> +void set_task_runnable_avg(struct task_struct *p)

init_task_runnable_average

> +{
> + u32 slice;
> +
> + p->se.avg.decay_count = 0;
> + slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
> + p->se.avg.runnable_avg_sum = slice;
> + p->se.avg.runnable_avg_period = slice;
> + __update_task_entity_contrib(&p->se);
> +}
> +#else
> +void set_task_runnable_avg(struct task_struct *p)
> +{
> +}
> +#endif
> +
> /*
> * Update the current task's runtime statistics. Skip current tasks that
> * are not in our scheduling class.
> @@ -1527,6 +1547,9 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> * We track migrations using entity decay_count <= 0, on a wake-up
> * migration we use a negative decay count to track the remote decays
> * accumulated while sleeping.
> + *
> + * When enqueue a new forked task, the se->avg.decay_count == 0, so
> + * we bypass update_entity_load_avg(), use avg.load_avg_contrib direct.

Newly forked tasks are enqueued with se->avg.decay_count == 0, they
are seen by enqueue_entity_load_avg() as a migration with an already
constructed load_avg_contrib.

> */
> if (unlikely(se->avg.decay_count <= 0)) {
> se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 24b1503..8bc66c6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1058,6 +1058,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
>
> extern void update_idle_cpu_load(struct rq *this_rq);
>
> +extern void set_task_runnable_avg(struct task_struct *p);
> +
> #ifdef CONFIG_PARAVIRT
> static inline u64 steal_ticks(u64 steal)
> {
> --
> 1.7.12
>

2013-06-14 14:00:03

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

On 06/14/2013 06:02 PM, Lei Wen wrote:
>> > enqueue_entity
>> > enqueue_entity_load_avg
>> >
>> > and make forking balancing imbalance since incorrect load_avg_contrib.
>> >
>> > Further more, Morten Rasmussen notice some tasks were not launched at
>> > once after created. So Paul and Peter suggest giving a start value for
>> > new task runnable avg time same as sched_slice().
> I am confused at this comment, how set slice to runnable avg would change
> the behavior of "some tasks were not launched at once after created"?

I also don't know the details on Morten's machine. but just guess, there
are much tasks on in the run queue. the minimum load avg make the new
task wait its time...
>
> IMHO, I could only tell that for the new forked task, it could be run if current
> task already be set as need_resched, and preempt_schedule or
> preempt_schedule_irq
> is called.
>
> Since the set slice to avg behavior would not affect this task's vruntime,
> and hence cannot make current running task be need_sched, if
> previously it cannot.
>
> Could you help correct if I am wrong at somewhere? ....
>
> Thanks,


--
Thanks
Alex

2013-06-14 14:16:38

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

On 06/14/2013 07:09 PM, Paul Turner wrote:
> Minor comments; looks good otherwise.
>
> Signed-off-by: Paul Turner <[email protected]>

thanks a lot Paul. the patch with your input updated here:

BTW, would you like to give some comments on the last patch of this patchset?

---
>From ed35080d0bae803d68f84a3e683d34a356a5a5de Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Tue, 14 May 2013 09:41:09 +0800
Subject: [PATCH 3/8] sched: set initial value of runnable avg for new forked
task

We need initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task.
Otherwise random values of above variables cause mess when do new task
enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg

and make forking balancing imbalance since incorrect load_avg_contrib.

Further more, Morten Rasmussen notice some tasks were not launched at
once after created. So Paul and Peter suggest giving a start value for
new task runnable avg time same as sched_slice().

Signed-off-by: Alex Shi <[email protected]>
Signed-off-by: Paul Turner <[email protected]>
Reviewed-by: Gu Zheng <[email protected]>
---
kernel/sched/core.c | 6 ++----
kernel/sched/fair.c | 24 ++++++++++++++++++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b9e7036..c78a9e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,10 +1598,6 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-#ifdef CONFIG_SMP
- p->se.avg.runnable_avg_period = 0;
- p->se.avg.runnable_avg_sum = 0;
-#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif
@@ -1745,6 +1741,8 @@ void wake_up_new_task(struct task_struct *p)
set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
#endif

+ /* Initialize new task's runnable average */
+ init_task_runnable_average(p);
rq = __task_rq_lock(p);
activate_task(rq, p, 0);
p->on_rq = 1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f404468..df5b8a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,6 +680,26 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
return calc_delta_fair(sched_slice(cfs_rq, se), se);
}

+#ifdef CONFIG_SMP
+static inline void __update_task_entity_contrib(struct sched_entity *se);
+
+/* Give new task start runnable values to heavy its load in infant time */
+void init_task_runnable_average(struct task_struct *p)
+{
+ u32 slice;
+
+ p->se.avg.decay_count = 0;
+ slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
+ p->se.avg.runnable_avg_sum = slice;
+ p->se.avg.runnable_avg_period = slice;
+ __update_task_entity_contrib(&p->se);
+}
+#else
+void init_task_runnable_average(struct task_struct *p)
+{
+}
+#endif
+
/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
@@ -1527,6 +1547,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
+ *
+ * Newly forked tasks are enqueued with se->avg.decay_count == 0, they
+ * are seen by enqueue_entity_load_avg() as a migration with an already
+ * constructed load_avg_contrib.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24b1503..0684c26 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1058,6 +1058,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime

extern void update_idle_cpu_load(struct rq *this_rq);

+extern void init_task_runnable_average(struct task_struct *p);
+
#ifdef CONFIG_PARAVIRT
static inline u64 steal_ticks(u64 steal)
{
--
1.7.5.4

2013-06-15 12:09:15

by Lei Wen

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

On Fri, Jun 14, 2013 at 9:59 PM, Alex Shi <[email protected]> wrote:
> On 06/14/2013 06:02 PM, Lei Wen wrote:
>>> > enqueue_entity
>>> > enqueue_entity_load_avg
>>> >
>>> > and make forking balancing imbalance since incorrect load_avg_contrib.
>>> >
>>> > Further more, Morten Rasmussen notice some tasks were not launched at
>>> > once after created. So Paul and Peter suggest giving a start value for
>>> > new task runnable avg time same as sched_slice().
>> I am confused at this comment, how set slice to runnable avg would change
>> the behavior of "some tasks were not launched at once after created"?
>
> I also don't know the details on Morten's machine. but just guess, there
> are much tasks on in the run queue. the minimum load avg make the new
> task wait its time...

Is there some possibility that since task structure is allocated without being
set to 0, and it cause the imbalance between runqueues. Then the new forked
is migrated to other cpus, so that it cause its execution being delayed?

It is better for Morten to give us more details here. :)

Thanks,
Lei

>>
>> IMHO, I could only tell that for the new forked task, it could be run if current
>> task already be set as need_resched, and preempt_schedule or
>> preempt_schedule_irq
>> is called.
>>
>> Since the set slice to avg behavior would not affect this task's vruntime,
>> and hence cannot make current running task be need_sched, if
>> previously it cannot.
>>
>> Could you help correct if I am wrong at somewhere? ....
>>
>> Thanks,
>
>
> --
> Thanks
> Alex

2013-06-17 00:34:45

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

On 06/15/2013 08:09 PM, Lei Wen wrote:
>>>>> >>> > and make forking balancing imbalance since incorrect load_avg_contrib.
>>>>> >>> >
>>>>> >>> > Further more, Morten Rasmussen notice some tasks were not launched at
>>>>> >>> > once after created. So Paul and Peter suggest giving a start value for
>>>>> >>> > new task runnable avg time same as sched_slice().
>>> >> I am confused at this comment, how set slice to runnable avg would change
>>> >> the behavior of "some tasks were not launched at once after created"?
>> >
>> > I also don't know the details on Morten's machine. but just guess, there
>> > are much tasks on in the run queue. the minimum load avg make the new
>> > task wait its time...
> Is there some possibility that since task structure is allocated without being
> set to 0, and it cause the imbalance between runqueues. Then the new forked
> is migrated to other cpus, so that it cause its execution being delayed?

Is there sth weird happens?
The task should be running a while before migration. and periodic load
balance also need some time to happen. So generally, it has no such worries.

--
Thanks
Alex

2013-06-17 09:21:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

On Fri, Jun 14, 2013 at 06:02:45PM +0800, Lei Wen wrote:
> Hi Alex,
>
> On Fri, Jun 7, 2013 at 3:20 PM, Alex Shi <[email protected]> wrote:
> > We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> > new forked task.
> > Otherwise random values of above variables cause mess when do new task
> > enqueue:
> > enqueue_task_fair
> > enqueue_entity
> > enqueue_entity_load_avg
> >
> > and make forking balancing imbalance since incorrect load_avg_contrib.
> >
> > Further more, Morten Rasmussen notice some tasks were not launched at
> > once after created. So Paul and Peter suggest giving a start value for
> > new task runnable avg time same as sched_slice().
>
> I am confused at this comment, how set slice to runnable avg would change
> the behavior of "some tasks were not launched at once after created"?
>
> IMHO, I could only tell that for the new forked task, it could be run if current
> task already be set as need_resched, and preempt_schedule or
> preempt_schedule_irq
> is called.
>
> Since the set slice to avg behavior would not affect this task's vruntime,
> and hence cannot make current running task be need_sched, if
> previously it cannot.


So the 'problem' is that our running avg is a 'floating' average; ie. it
decays with time. Now we have to guess about the future of our newly
spawned task -- something that is nigh impossible seeing these CPU
vendors keep refusing to implement the crystal ball instruction.

So there's two asymptotic cases we want to deal well with; 1) the case
where the newly spawned program will be 'nearly' idle for its lifetime;
and 2) the case where its cpu-bound.

Since we have to guess, we'll go for worst case and assume its
cpu-bound; now we don't want to make the avg so heavy adjusting to the
near-idle case takes forever. We want to be able to quickly adjust and
lower our running avg.

Now we also don't want to make our avg too light, such that it gets
decremented just for the new task not having had a chance to run yet --
even if when it would run, it would be more cpu-bound than not.

So what we do is we make the initial avg of the same duration as that we
guess it takes to run each task on the system at least once -- aka
sched_slice().

Of course we can defeat this with wakeup/fork bombs, but in the 'normal'
case it should be good enough.


Does that make sense?

2013-06-17 09:22:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

On Fri, Jun 14, 2013 at 10:16:28PM +0800, Alex Shi wrote:
> On 06/14/2013 07:09 PM, Paul Turner wrote:
> > Minor comments; looks good otherwise.
> >
> > Signed-off-by: Paul Turner <[email protected]>
>
> thanks a lot Paul. the patch with your input updated here:
>
> BTW, would you like to give some comments on the last patch of this patchset?
>
> ---
> From ed35080d0bae803d68f84a3e683d34a356a5a5de Mon Sep 17 00:00:00 2001
> From: Alex Shi <[email protected]>
> Date: Tue, 14 May 2013 09:41:09 +0800
> Subject: [PATCH 3/8] sched: set initial value of runnable avg for new forked
> task
>
> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> new forked task.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
>
> and make forking balancing imbalance since incorrect load_avg_contrib.
>
> Further more, Morten Rasmussen notice some tasks were not launched at
> once after created. So Paul and Peter suggest giving a start value for
> new task runnable avg time same as sched_slice().
>
> Signed-off-by: Alex Shi <[email protected]>
> Signed-off-by: Paul Turner <[email protected]>

Should you all go read: Documentation/SubmittingPatches , or am I
somehow confused on the SoB rules?

2013-06-17 09:39:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch v8 9/9] sched/tg: remove blocked_load_avg in balance

On Fri, Jun 07, 2013 at 03:20:52PM +0800, Alex Shi wrote:
> blocked_load_avg sometime is too heavy and far bigger than runnable load
> avg, that make balance make wrong decision. So remove it.
>
> Changlong tested this patch, found ltp cgroup stress testing get better
> performance: https://lkml.org/lkml/2013/5/23/65
> ---
> 3.10-rc1 patch1-7 patch1-8
> duration=764 duration=754 duration=750
> duration=764 duration=754 duration=751
> duration=763 duration=755 duration=751
>
> duration means the seconds of testing cost.
> ---
>
> And Jason also tested this patchset on his 8 sockets machine:
> https://lkml.org/lkml/2013/5/29/673
> ---
> When using a 3.10-rc2 tip kernel with patches 1-8, there was about a 40%
> improvement in performance of the workload compared to when using the
> vanilla 3.10-rc2 tip kernel with no patches. When using a 3.10-rc2 tip
> kernel with just patches 1-7, the performance improvement of the
> workload over the vanilla 3.10-rc2 tip kernel was about 25%.
> ---
>
> Signed-off-by: Alex Shi <[email protected]>
> Tested-by: Changlong Xie <[email protected]>
> Tested-by: Jason Low <[email protected]>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3aa1dc0..985d47e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1358,7 +1358,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> struct task_group *tg = cfs_rq->tg;
> s64 tg_contrib;
>
> - tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
> + tg_contrib = cfs_rq->runnable_load_avg;
> tg_contrib -= cfs_rq->tg_load_contrib;
>
> if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {

PJT ping! ;-)

2013-06-17 09:40:29

by Paul Turner

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

I actually did read it before, and still wasn't sure of the right tag to use.

"13) When to use Acked-by: and Cc:

The Signed-off-by: tag indicates that the signer was involved in the
development of the patch, or that he/she was in the patch's delivery path.

If a person was not directly involved in the preparation or handling of a
patch but wishes to signify and record their approval of it then they can
arrange to have an Acked-by: line added to the patch's changelog."
https://www.kernel.org/doc/Documentation/SubmittingPatches

Acked-By seemed to fail the direct involvement test.
The definition of "delivery path" is not clear; is this strictly by
inputs to Linus' tree or recipients of the original patch?

Is Reviewed-By always more appropriate here?



On Mon, Jun 17, 2013 at 2:21 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jun 14, 2013 at 10:16:28PM +0800, Alex Shi wrote:
>> On 06/14/2013 07:09 PM, Paul Turner wrote:
>> > Minor comments; looks good otherwise.
>> >
>> > Signed-off-by: Paul Turner <[email protected]>
>>
>> thanks a lot Paul. the patch with your input updated here:
>>
>> BTW, would you like to give some comments on the last patch of this patchset?
>>
>> ---
>> From ed35080d0bae803d68f84a3e683d34a356a5a5de Mon Sep 17 00:00:00 2001
>> From: Alex Shi <[email protected]>
>> Date: Tue, 14 May 2013 09:41:09 +0800
>> Subject: [PATCH 3/8] sched: set initial value of runnable avg for new forked
>> task
>>
>> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
>> new forked task.
>> Otherwise random values of above variables cause mess when do new task
>> enqueue:
>> enqueue_task_fair
>> enqueue_entity
>> enqueue_entity_load_avg
>>
>> and make forking balancing imbalance since incorrect load_avg_contrib.
>>
>> Further more, Morten Rasmussen notice some tasks were not launched at
>> once after created. So Paul and Peter suggest giving a start value for
>> new task runnable avg time same as sched_slice().
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> Signed-off-by: Paul Turner <[email protected]>
>
> Should you all go read: Documentation/SubmittingPatches , or am I
> somehow confused on the SoB rules?

2013-06-17 09:58:44

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

On 06/17/2013 05:21 PM, Peter Zijlstra wrote:
>> >
>> > and make forking balancing imbalance since incorrect load_avg_contrib.
>> >
>> > Further more, Morten Rasmussen notice some tasks were not launched at
>> > once after created. So Paul and Peter suggest giving a start value for
>> > new task runnable avg time same as sched_slice().
>> >
>> > Signed-off-by: Alex Shi <[email protected]>
>> > Signed-off-by: Paul Turner <[email protected]>
> Should you all go read: Documentation/SubmittingPatches , or am I
> somehow confused on the SoB rules?

has this should been right, if Paul had handed in the modified patch as
he suggested? :)

Sorry for stupid, I still don't know what's SoB rule?

--
Thanks
Alex

2013-06-17 10:51:39

by Paul Turner

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
> They are the base values in load balance, update them with rq runnable
> load average, then the load balance will consider runnable load avg
> naturally.
>
> We also try to include the blocked_load_avg as cpu load in balancing,
> but that cause kbuild performance drop 6% on every Intel machine, and
> aim7/oltp drop on some of 4 CPU sockets machines.
>

This looks fine.

Did you try including blocked_load_avg in only get_rq_runnable_load()
[ and not weighted_cpuload() which is called by new-idle ]?

> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 5 +++--
> kernel/sched/proc.c | 17 +++++++++++++++--
> 2 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 42c7be0..eadd2e7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2962,7 +2962,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> /* Used instead of source_load when we know the type == 0 */
> static unsigned long weighted_cpuload(const int cpu)
> {
> - return cpu_rq(cpu)->load.weight;
> + return cpu_rq(cpu)->cfs.runnable_load_avg;
> }
>
> /*
> @@ -3007,9 +3007,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> + unsigned long load_avg = rq->cfs.runnable_load_avg;
>
> if (nr_running)
> - return rq->load.weight / nr_running;
> + return load_avg / nr_running;
>
> return 0;
> }
> diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
> index bb3a6a0..ce5cd48 100644
> --- a/kernel/sched/proc.c
> +++ b/kernel/sched/proc.c
> @@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> sched_avg_update(this_rq);
> }
>
> +#ifdef CONFIG_SMP
> +unsigned long get_rq_runnable_load(struct rq *rq)
> +{
> + return rq->cfs.runnable_load_avg;
> +}
> +#else
> +unsigned long get_rq_runnable_load(struct rq *rq)
> +{
> + return rq->load.weight;
> +}
> +#endif
> +
> #ifdef CONFIG_NO_HZ_COMMON
> /*
> * There is no sane way to deal with nohz on smp when using jiffies because the
> @@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> void update_idle_cpu_load(struct rq *this_rq)
> {
> unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
> - unsigned long load = this_rq->load.weight;
> + unsigned long load = get_rq_runnable_load(this_rq);
> unsigned long pending_updates;
>
> /*
> @@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
> */
> void update_cpu_load_active(struct rq *this_rq)
> {
> + unsigned long load = get_rq_runnable_load(this_rq);
> /*
> * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
> */
> this_rq->last_load_update_tick = jiffies;
> - __update_cpu_load(this_rq, this_rq->load.weight, 1);
> + __update_cpu_load(this_rq, load, 1);
>
> calc_load_account_active(this_rq);
> }
> --
> 1.7.12
>

2013-06-17 10:59:10

by Paul Turner

[permalink] [raw]
Subject: Re: [patch v8 8/9] sched: consider runnable load average in move_tasks

On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
> Except using runnable load average in background, move_tasks is also
> the key functions in load balance. We need consider the runnable load
> average in it in order to the apple to apple load comparison.
>
> Morten had caught a div u64 bug on ARM, thanks!
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eadd2e7..3aa1dc0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4178,11 +4178,14 @@ static int tg_load_down(struct task_group *tg, void *data)
> long cpu = (long)data;
>
> if (!tg->parent) {
> - load = cpu_rq(cpu)->load.weight;
> + load = cpu_rq(cpu)->avg.load_avg_contrib;
> } else {
> + unsigned long tmp_rla;
> + tmp_rla = tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
> +
> load = tg->parent->cfs_rq[cpu]->h_load;
> - load *= tg->se[cpu]->load.weight;
> - load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
> + load *= tg->se[cpu]->avg.load_avg_contrib;
> + load /= tmp_rla;

Why do we need the temporary here?

> }
>
> tg->cfs_rq[cpu]->h_load = load;
> @@ -4208,12 +4211,9 @@ static void update_h_load(long cpu)
> static unsigned long task_h_load(struct task_struct *p)
> {
> struct cfs_rq *cfs_rq = task_cfs_rq(p);
> - unsigned long load;
> -
> - load = p->se.load.weight;
> - load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);
>
> - return load;
> + return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
> + cfs_rq->runnable_load_avg + 1);
> }
> #else
> static inline void update_blocked_averages(int cpu)
> --
> 1.7.12
>

2013-06-17 11:51:43

by Paul Turner

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
> The wakeuped migrated task will __synchronize_entity_decay(se); in
> migrate_task_fair, then it needs to set
> `se->avg.last_runnable_update -= (-se->avg.decay_count) << 20'
> before update_entity_load_avg, in order to avoid slept time is updated
> twice for se.avg.load_avg_contrib in both __syncchronize and
> update_entity_load_avg.
>
> but if the slept task is waked up from self cpu, it miss the
> last_runnable_update before update_entity_load_avg(se, 0, 1), then the
> slept time was used twice in both functions.
> So we need to remove the double slept time counting.

Good catch.

>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1fc30b9..42c7be0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1570,7 +1570,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> }
> wakeup = 0;
> } else {
> - __synchronize_entity_decay(se);
> + se->avg.last_runnable_update += __synchronize_entity_decay(se)
> + << 20;

Can you add something like:

+ /*
+ * Task re-woke on same cpu (or else
migrate_task_rq_fair()
+ * would have made count negative); we must be careful
to avoid
+ * double-accounting blocked time after synchronizing
decays.
+ */


Reviewed-by: Paul Turner <[email protected]>

2013-06-17 11:54:35

by Paul Turner

[permalink] [raw]
Subject: Re: [patch v8 5/9] sched: update cpu load after task_tick.

On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
> To get the latest runnable info, we need do this cpuload update after
> task_tick.
>
> Signed-off-by: Alex Shi <[email protected]>

Reviewed-by: Paul Turner <[email protected]>

> ---
> kernel/sched/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6f226c2..05176b8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2152,8 +2152,8 @@ void scheduler_tick(void)
>
> raw_spin_lock(&rq->lock);
> update_rq_clock(rq);
> - update_cpu_load_active(rq);
> curr->sched_class->task_tick(rq, curr, 0);
> + update_cpu_load_active(rq);
> raw_spin_unlock(&rq->lock);
>
> perf_event_task_tick();
> --
> 1.7.12
>

2013-06-17 12:17:51

by Paul Turner

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On Mon, Jun 17, 2013 at 3:51 AM, Paul Turner <[email protected]> wrote:
> On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
>> They are the base values in load balance, update them with rq runnable
>> load average, then the load balance will consider runnable load avg
>> naturally.
>>
>> We also try to include the blocked_load_avg as cpu load in balancing,
>> but that cause kbuild performance drop 6% on every Intel machine, and
>> aim7/oltp drop on some of 4 CPU sockets machines.
>>
>
> This looks fine.
>
> Did you try including blocked_load_avg in only get_rq_runnable_load()
> [ and not weighted_cpuload() which is called by new-idle ]?

Looking at this more this feels less correct since you're taking
averages of averages.

This was previously discussed at:
https://lkml.org/lkml/2013/5/6/109

And you later replied suggesting this didn't seem to hurt; what's the
current status there?


>
>> Signed-off-by: Alex Shi <[email protected]>
>> ---
>> kernel/sched/fair.c | 5 +++--
>> kernel/sched/proc.c | 17 +++++++++++++++--
>> 2 files changed, 18 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 42c7be0..eadd2e7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2962,7 +2962,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> /* Used instead of source_load when we know the type == 0 */
>> static unsigned long weighted_cpuload(const int cpu)
>> {
>> - return cpu_rq(cpu)->load.weight;
>> + return cpu_rq(cpu)->cfs.runnable_load_avg;
>> }
>>
>> /*
>> @@ -3007,9 +3007,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
>> {
>> struct rq *rq = cpu_rq(cpu);
>> unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
>> + unsigned long load_avg = rq->cfs.runnable_load_avg;
>>
>> if (nr_running)
>> - return rq->load.weight / nr_running;
>> + return load_avg / nr_running;
>>
>> return 0;
>> }
>> diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
>> index bb3a6a0..ce5cd48 100644
>> --- a/kernel/sched/proc.c
>> +++ b/kernel/sched/proc.c
>> @@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>> sched_avg_update(this_rq);
>> }
>>
>> +#ifdef CONFIG_SMP
>> +unsigned long get_rq_runnable_load(struct rq *rq)
>> +{
>> + return rq->cfs.runnable_load_avg;
>> +}
>> +#else
>> +unsigned long get_rq_runnable_load(struct rq *rq)
>> +{
>> + return rq->load.weight;
>> +}
>> +#endif
>> +
>> #ifdef CONFIG_NO_HZ_COMMON
>> /*
>> * There is no sane way to deal with nohz on smp when using jiffies because the
>> @@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>> void update_idle_cpu_load(struct rq *this_rq)
>> {
>> unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
>> - unsigned long load = this_rq->load.weight;
>> + unsigned long load = get_rq_runnable_load(this_rq);
>> unsigned long pending_updates;
>>
>> /*
>> @@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
>> */
>> void update_cpu_load_active(struct rq *this_rq)
>> {
>> + unsigned long load = get_rq_runnable_load(this_rq);
>> /*
>> * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
>> */
>> this_rq->last_load_update_tick = jiffies;
>> - __update_cpu_load(this_rq, this_rq->load.weight, 1);
>> + __update_cpu_load(this_rq, load, 1);
>>
>> calc_load_account_active(this_rq);
>> }
>> --
>> 1.7.12
>>

2013-06-17 12:21:21

by Paul Turner

[permalink] [raw]
Subject: Re: [patch v8 9/9] sched/tg: remove blocked_load_avg in balance

On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
> blocked_load_avg sometime is too heavy and far bigger than runnable load
> avg, that make balance make wrong decision. So remove it.

Ok so this is going to have terrible effects on the correctness of
shares distribution; I'm fairly opposed to it in its present form.

So let's see, what could be happening..

In "sched: compute runnable load avg in cpu_load and
cpu_avg_load_per_task" you already update the load average weights
solely based on current runnable load. While this is generally poor
for stability (and I suspect the benefit is coming largely from
weighted_cpuload() where you do want to use runnable_load_avg and not
get_rq_runnable_load() where I suspect including blocked_load_avg() is
correct in the longer term).

Ah so.. I have an inkling:
Inside weighted_cpuload() where you're trying to use only
runnable_load_avg; this is in-fact still including blocked_load_avg
for a cgroup since in the cgroup case a group entities' contribution
is a function of both runnable and blocked load.

Having weighted_cpuload() pull rq->load (possibly moderated by
rq->avg) would reasonably avoid this since issued shares are
calculated using instantaneous weights, without breaking the actual
model for how much load overall that we believe the group has.

>
> Changlong tested this patch, found ltp cgroup stress testing get better
> performance: https://lkml.org/lkml/2013/5/23/65
> ---
> 3.10-rc1 patch1-7 patch1-8
> duration=764 duration=754 duration=750
> duration=764 duration=754 duration=751
> duration=763 duration=755 duration=751
>
> duration means the seconds of testing cost.
> ---
>
> And Jason also tested this patchset on his 8 sockets machine:
> https://lkml.org/lkml/2013/5/29/673
> ---
> When using a 3.10-rc2 tip kernel with patches 1-8, there was about a 40%
> improvement in performance of the workload compared to when using the
> vanilla 3.10-rc2 tip kernel with no patches. When using a 3.10-rc2 tip
> kernel with just patches 1-7, the performance improvement of the
> workload over the vanilla 3.10-rc2 tip kernel was about 25%.
> ---
>
> Signed-off-by: Alex Shi <[email protected]>
> Tested-by: Changlong Xie <[email protected]>
> Tested-by: Jason Low <[email protected]>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3aa1dc0..985d47e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1358,7 +1358,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> struct task_group *tg = cfs_rq->tg;
> s64 tg_contrib;
>
> - tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
> + tg_contrib = cfs_rq->runnable_load_avg;
> tg_contrib -= cfs_rq->tg_load_contrib;
>
> if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
> --
> 1.7.12
>

2013-06-17 12:26:43

by Paul Turner

[permalink] [raw]
Subject: Re: [patch v8 2/9] sched: move few runnable tg variables into CONFIG_SMP

On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
> The following 2 variables only used under CONFIG_SMP, so better to move
> their definiation into CONFIG_SMP too.
>
> atomic64_t load_avg;
> atomic_t runnable_avg;
>
> Signed-off-by: Alex Shi <[email protected]>

Looks fine; the manipulations to the structs could perhaps be
collapsed to one patch.

2013-06-17 12:26:59

by Lei Wen

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

Hi Peter,

On Mon, Jun 17, 2013 at 5:20 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jun 14, 2013 at 06:02:45PM +0800, Lei Wen wrote:
>> Hi Alex,
>>
>> On Fri, Jun 7, 2013 at 3:20 PM, Alex Shi <[email protected]> wrote:
>> > We need initialize the se.avg.{decay_count, load_avg_contrib} for a
>> > new forked task.
>> > Otherwise random values of above variables cause mess when do new task
>> > enqueue:
>> > enqueue_task_fair
>> > enqueue_entity
>> > enqueue_entity_load_avg
>> >
>> > and make forking balancing imbalance since incorrect load_avg_contrib.
>> >
>> > Further more, Morten Rasmussen notice some tasks were not launched at
>> > once after created. So Paul and Peter suggest giving a start value for
>> > new task runnable avg time same as sched_slice().
>>
>> I am confused at this comment, how set slice to runnable avg would change
>> the behavior of "some tasks were not launched at once after created"?
>>
>> IMHO, I could only tell that for the new forked task, it could be run if current
>> task already be set as need_resched, and preempt_schedule or
>> preempt_schedule_irq
>> is called.
>>
>> Since the set slice to avg behavior would not affect this task's vruntime,
>> and hence cannot make current running task be need_sched, if
>> previously it cannot.
>
>
> So the 'problem' is that our running avg is a 'floating' average; ie. it
> decays with time. Now we have to guess about the future of our newly
> spawned task -- something that is nigh impossible seeing these CPU
> vendors keep refusing to implement the crystal ball instruction.

I am curious at this "crystal ball instruction" saying. :)
Could it be real? I mean what kind of hw mechanism could achieve such
magic power? What I see, for silicon vendor they could provide more
monitor unit, but to precise predict the sw's behavior, I don't think hw
also this kind of power...

>
> So there's two asymptotic cases we want to deal well with; 1) the case
> where the newly spawned program will be 'nearly' idle for its lifetime;
> and 2) the case where its cpu-bound.
>
> Since we have to guess, we'll go for worst case and assume its
> cpu-bound; now we don't want to make the avg so heavy adjusting to the
> near-idle case takes forever. We want to be able to quickly adjust and
> lower our running avg.
>
> Now we also don't want to make our avg too light, such that it gets
> decremented just for the new task not having had a chance to run yet --
> even if when it would run, it would be more cpu-bound than not.
>
> So what we do is we make the initial avg of the same duration as that we
> guess it takes to run each task on the system at least once -- aka
> sched_slice().
>
> Of course we can defeat this with wakeup/fork bombs, but in the 'normal'
> case it should be good enough.
>
>
> Does that make sense?

Thanks for your detailed explanation. Very useful indeed! :)

BTW, I have no question for the patch itself, but just confuse at the
patch's comment
"some tasks were not launched at once after created".

Thanks,
Lei

2013-06-17 12:33:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

On Mon, Jun 17, 2013 at 08:26:55PM +0800, Lei Wen wrote:
> Hi Peter,
>
> > So the 'problem' is that our running avg is a 'floating' average; ie. it
> > decays with time. Now we have to guess about the future of our newly
> > spawned task -- something that is nigh impossible seeing these CPU
> > vendors keep refusing to implement the crystal ball instruction.
>
> I am curious at this "crystal ball instruction" saying. :)
> Could it be real? I mean what kind of hw mechanism could achieve such
> magic power? What I see, for silicon vendor they could provide more
> monitor unit, but to precise predict the sw's behavior, I don't think hw
> also this kind of power...

There's indeed no known mechanism for this crystal ball instruction to
really work with. Its just something I often wish for ;-) It would make
life so much easier - although I have no experience with handling the
resulting paradoxes, so who knows :-)

> > So there's two asymptotic cases we want to deal well with; 1) the case
> > where the newly spawned program will be 'nearly' idle for its lifetime;
> > and 2) the case where its cpu-bound.
> >
> > Since we have to guess, we'll go for worst case and assume its
> > cpu-bound; now we don't want to make the avg so heavy adjusting to the
> > near-idle case takes forever. We want to be able to quickly adjust and
> > lower our running avg.
> >
> > Now we also don't want to make our avg too light, such that it gets
> > decremented just for the new task not having had a chance to run yet --
> > even if when it would run, it would be more cpu-bound than not.
> >
> > So what we do is we make the initial avg of the same duration as that we
> > guess it takes to run each task on the system at least once -- aka
> > sched_slice().
> >
> > Of course we can defeat this with wakeup/fork bombs, but in the 'normal'
> > case it should be good enough.
> >
> >
> > Does that make sense?
>
> Thanks for your detailed explanation. Very useful indeed! :)
>
> BTW, I have no question for the patch itself, but just confuse at the
> patch's comment
> "some tasks were not launched at once after created".

Right, I might edit that on applying to clarify things.

2013-06-17 13:00:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

On Mon, Jun 17, 2013 at 02:39:53AM -0700, Paul Turner wrote:
> I actually did read it before, and still wasn't sure of the right tag to use.
>
> "13) When to use Acked-by: and Cc:
>
> The Signed-off-by: tag indicates that the signer was involved in the
> development of the patch, or that he/she was in the patch's delivery path.
>
> If a person was not directly involved in the preparation or handling of a
> patch but wishes to signify and record their approval of it then they can
> arrange to have an Acked-by: line added to the patch's changelog."
> https://www.kernel.org/doc/Documentation/SubmittingPatches
>
> Acked-By seemed to fail the direct involvement test.
> The definition of "delivery path" is not clear; is this strictly by
> inputs to Linus' tree or recipients of the original patch?

The way I interpret the delivery path is when its part of a bigger
grouping. The author is always the first SOB, if the patch series isn't
by the same author then he who compiles the aggregate work adds his SOB.

Similar for (sub)maintainers passing it onwards towards Linus. I compile
batches of patch series and feed them to Ingo, Ingo stuffs the lot into
-tip and feeds them to Linus.

So supposing you wrote a patch, gave it to Alex who composed the bigger
series, which I picked up and handed to Ingo we'd get something like:

SoB: PJT
SoB: Alex
SoB: PeterZ
SoB: Mingo

and then when Linus pulls the lot he could add his SOB too, although
looking at git history he typically doesn't add his sob to each and
everything he pulls.

But yes, I see where the confusion stems from.

> Is Reviewed-By always more appropriate here?

Yes, or even Acked would work. The difference between reviewed and acked
is the level of involvement. An ack is for something you glanced over
and generally agree with, a review is for something you looked at in
minute detail.

Then there's some people (and I tend to lean towards this) who can't be
bothered with that distinction too much and simply always use ack.

2013-06-17 13:07:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

On Mon, Jun 17, 2013 at 05:57:50PM +0800, Alex Shi wrote:
> On 06/17/2013 05:21 PM, Peter Zijlstra wrote:
> >> >
> >> > and make forking balancing imbalance since incorrect load_avg_contrib.
> >> >
> >> > Further more, Morten Rasmussen notice some tasks were not launched at
> >> > once after created. So Paul and Peter suggest giving a start value for
> >> > new task runnable avg time same as sched_slice().
> >> >
> >> > Signed-off-by: Alex Shi <[email protected]>
> >> > Signed-off-by: Paul Turner <[email protected]>
> > Should you all go read: Documentation/SubmittingPatches , or am I
> > somehow confused on the SoB rules?
>
> has this should been right, if Paul had handed in the modified patch as
> he suggested? :)
>
> Sorry for stupid, I still don't know what's SoB rule?

Right, so it depends on who actually wrote the patch; the only case
that's really hard is when a patch is fully co-authored -- agile dev
nonsense like, 4 hands 1 keyboard situation or so.

Typically there's 1 somebody who did most work on a particular patch;
that someone would be Author/From and have first SoB.

If thereafter the patch becomes part of an aggregate work; he who
compiles can add another SoB; possibly with an extra [] line describing
'smallish' changes that were needed to the initial patch to make it fit
the aggregate.

Example:

From: PJT

foo patch implements foo because bar; note the fubar detail.

SoB: PJT
[alex@intel: changed ponies into horses to make it fit]
SoB: Alex

The other case is where a 'simple' modification of the initial patch
simply won't do; you need to change the core idea of the patch or
similar. In this case I've seen things like:

From: Alex

foo patch implements foo because bar; note the fubar detail.

Based-on-patch-by: PJT
SoB: Alex


This isn't actually in the submitting patches document and I'm not sure
it should be; although some clarification for these weird cases might be
useful.

2013-06-17 13:23:57

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

>
> Example:
>
> From: PJT
>
> foo patch implements foo because bar; note the fubar detail.
>
> SoB: PJT
> [alex@intel: changed ponies into horses to make it fit]
> SoB: Alex
>
> The other case is where a 'simple' modification of the initial patch
> simply won't do; you need to change the core idea of the patch or
> similar. In this case I've seen things like:
>
> From: Alex
>
> foo patch implements foo because bar; note the fubar detail.
>
> Based-on-patch-by: PJT
> SoB: Alex
>
>
> This isn't actually in the submitting patches document and I'm not sure
> it should be; although some clarification for these weird cases might be
> useful.

It is very very useful clarification!

Currently even some maintainers do not use the 'From' and SoB correctly.
So I do suggest including it into Documentation/SubmittingPatches files.
That will help everyone! :)

--
Thanks
Alex

2013-06-17 13:39:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On Mon, Jun 17, 2013 at 05:17:17AM -0700, Paul Turner wrote:
> On Mon, Jun 17, 2013 at 3:51 AM, Paul Turner <[email protected]> wrote:
> > On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
> >> They are the base values in load balance, update them with rq runnable
> >> load average, then the load balance will consider runnable load avg
> >> naturally.
> >>
> >> We also try to include the blocked_load_avg as cpu load in balancing,
> >> but that cause kbuild performance drop 6% on every Intel machine, and
> >> aim7/oltp drop on some of 4 CPU sockets machines.
> >>
> >
> > This looks fine.
> >
> > Did you try including blocked_load_avg in only get_rq_runnable_load()
> > [ and not weighted_cpuload() which is called by new-idle ]?
>
> Looking at this more this feels less correct since you're taking
> averages of averages.
>
> This was previously discussed at:
> https://lkml.org/lkml/2013/5/6/109
>
> And you later replied suggesting this didn't seem to hurt; what's the
> current status there?

Wasn't there a follow up series (currently as RFC or so) that proposes
to removes the entire *_idx stuff?

2013-06-17 13:57:44

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/17/2013 08:17 PM, Paul Turner wrote:
> On Mon, Jun 17, 2013 at 3:51 AM, Paul Turner <[email protected]> wrote:
>> On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
>>> They are the base values in load balance, update them with rq runnable
>>> load average, then the load balance will consider runnable load avg
>>> naturally.
>>>
>>> We also try to include the blocked_load_avg as cpu load in balancing,
>>> but that cause kbuild performance drop 6% on every Intel machine, and
>>> aim7/oltp drop on some of 4 CPU sockets machines.
>>>
>>
>> This looks fine.
>>
>> Did you try including blocked_load_avg in only get_rq_runnable_load()
>> [ and not weighted_cpuload() which is called by new-idle ]?
>
> Looking at this more this feels less correct since you're taking
> averages of averages.
>
> This was previously discussed at:
> https://lkml.org/lkml/2013/5/6/109
>
> And you later replied suggesting this didn't seem to hurt; what's the
> current status there?

Yes, your example show the blocked_load_avg value.
So I had given a patch for review at that time before do detailed
testing. https://lkml.org/lkml/2013/5/7/66

But in detailed testing, the patch cause a big performance regression.
When I look into for details. I found some cpu in kbuild just had a big
blocked_load_avg, with a very small runnable_load_avg value.

Seems accumulating current blocked_load_avg into cpu load isn't a good
idea. Because:
1, The blocked_load_avg is decayed same as runnable load, sometime is
far bigger than runnable load, that drive tasks to other idle or slight
load cpu, than cause both performance and power issue. But if the
blocked load is decayed too fast, it lose its effect.
2, Another issue of blocked load is that when waking up task, we can not
know blocked load proportion of the task on rq. So, the blocked load is
meaningless in wake affine decision.

According to above problem, I give up to enable blocked_load_avg in balance.


>
>
>>
>>> Signed-off-by: Alex Shi <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 5 +++--
>>> kernel/sched/proc.c | 17 +++++++++++++++--
>>> 2 files changed, 18 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 42c7be0..eadd2e7 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2962,7 +2962,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>> /* Used instead of source_load when we know the type == 0 */
>>> static unsigned long weighted_cpuload(const int cpu)
>>> {
>>> - return cpu_rq(cpu)->load.weight;
>>> + return cpu_rq(cpu)->cfs.runnable_load_avg;
>>> }
>>>
>>> /*
>>> @@ -3007,9 +3007,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
>>> {
>>> struct rq *rq = cpu_rq(cpu);
>>> unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
>>> + unsigned long load_avg = rq->cfs.runnable_load_avg;
>>>
>>> if (nr_running)
>>> - return rq->load.weight / nr_running;
>>> + return load_avg / nr_running;
>>>
>>> return 0;
>>> }
>>> diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
>>> index bb3a6a0..ce5cd48 100644
>>> --- a/kernel/sched/proc.c
>>> +++ b/kernel/sched/proc.c
>>> @@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>>> sched_avg_update(this_rq);
>>> }
>>>
>>> +#ifdef CONFIG_SMP
>>> +unsigned long get_rq_runnable_load(struct rq *rq)
>>> +{
>>> + return rq->cfs.runnable_load_avg;
>>> +}
>>> +#else
>>> +unsigned long get_rq_runnable_load(struct rq *rq)
>>> +{
>>> + return rq->load.weight;
>>> +}
>>> +#endif
>>> +
>>> #ifdef CONFIG_NO_HZ_COMMON
>>> /*
>>> * There is no sane way to deal with nohz on smp when using jiffies because the
>>> @@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>>> void update_idle_cpu_load(struct rq *this_rq)
>>> {
>>> unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
>>> - unsigned long load = this_rq->load.weight;
>>> + unsigned long load = get_rq_runnable_load(this_rq);
>>> unsigned long pending_updates;
>>>
>>> /*
>>> @@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
>>> */
>>> void update_cpu_load_active(struct rq *this_rq)
>>> {
>>> + unsigned long load = get_rq_runnable_load(this_rq);
>>> /*
>>> * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
>>> */
>>> this_rq->last_load_update_tick = jiffies;
>>> - __update_cpu_load(this_rq, this_rq->load.weight, 1);
>>> + __update_cpu_load(this_rq, load, 1);
>>>
>>> calc_load_account_active(this_rq);
>>> }
>>> --
>>> 1.7.12
>>>


--
Thanks
Alex

2013-06-17 13:59:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch v8 8/9] sched: consider runnable load average in move_tasks

On Fri, Jun 07, 2013 at 03:20:51PM +0800, Alex Shi wrote:
> Except using runnable load average in background, move_tasks is also
> the key functions in load balance. We need consider the runnable load
> average in it in order to the apple to apple load comparison.
>
> Morten had caught a div u64 bug on ARM, thanks!
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eadd2e7..3aa1dc0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4178,11 +4178,14 @@ static int tg_load_down(struct task_group *tg, void *data)
> long cpu = (long)data;
>
> if (!tg->parent) {
> - load = cpu_rq(cpu)->load.weight;
> + load = cpu_rq(cpu)->avg.load_avg_contrib;
> } else {
> + unsigned long tmp_rla;
> + tmp_rla = tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
> +
> load = tg->parent->cfs_rq[cpu]->h_load;
> - load *= tg->se[cpu]->load.weight;
> - load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
> + load *= tg->se[cpu]->avg.load_avg_contrib;
> + load /= tmp_rla;
> }
>
> tg->cfs_rq[cpu]->h_load = load;
> @@ -4208,12 +4211,9 @@ static void update_h_load(long cpu)
> static unsigned long task_h_load(struct task_struct *p)
> {
> struct cfs_rq *cfs_rq = task_cfs_rq(p);
> - unsigned long load;
> -
> - load = p->se.load.weight;
> - load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);
>
> - return load;
> + return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
> + cfs_rq->runnable_load_avg + 1);
> }
> #else
> static inline void update_blocked_averages(int cpu)

Should we not also change the !FAIR_GROUP_SCHED version of task_h_load()
for this?

2013-06-17 13:59:58

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/17/2013 09:39 PM, Peter Zijlstra wrote:
>>> > >
>>> > > This looks fine.
>>> > >
>>> > > Did you try including blocked_load_avg in only get_rq_runnable_load()
>>> > > [ and not weighted_cpuload() which is called by new-idle ]?
>> >
>> > Looking at this more this feels less correct since you're taking
>> > averages of averages.
>> >
>> > This was previously discussed at:
>> > https://lkml.org/lkml/2013/5/6/109
>> >
>> > And you later replied suggesting this didn't seem to hurt; what's the
>> > current status there?
> Wasn't there a follow up series (currently as RFC or so) that proposes
> to removes the entire *_idx stuff?

Yes, it is in my plan. just currently I am kicked ass to do other
internal Intel tasks.

--
Thanks
Alex

2013-06-17 14:01:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch v8 9/9] sched/tg: remove blocked_load_avg in balance

On Mon, Jun 17, 2013 at 05:20:46AM -0700, Paul Turner wrote:
> On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
> > blocked_load_avg sometime is too heavy and far bigger than runnable load
> > avg, that make balance make wrong decision. So remove it.
>
> Ok so this is going to have terrible effects on the correctness of
> shares distribution; I'm fairly opposed to it in its present form.

Should someone take a look at the LTP cgroup tests to make sure they
cover this properly?

2013-06-17 14:01:52

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 8/9] sched: consider runnable load average in move_tasks

On 06/17/2013 06:58 PM, Paul Turner wrote:
>> > + unsigned long tmp_rla;
>> > + tmp_rla = tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
>> > +
>> > load = tg->parent->cfs_rq[cpu]->h_load;
>> > - load *= tg->se[cpu]->load.weight;
>> > - load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
>> > + load *= tg->se[cpu]->avg.load_avg_contrib;
>> > + load /= tmp_rla;
> Why do we need the temporary here?
>

you'r right. tmp_rla is not necessary here.

--
Thanks
Alex

2013-06-17 14:16:01

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 8/9] sched: consider runnable load average in move_tasks

On 06/17/2013 10:01 PM, Alex Shi wrote:
> On 06/17/2013 06:58 PM, Paul Turner wrote:
>>>> + unsigned long tmp_rla;
>>>> + tmp_rla = tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
>>>> +
>>>> load = tg->parent->cfs_rq[cpu]->h_load;
>>>> - load *= tg->se[cpu]->load.weight;
>>>> - load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
>>>> + load *= tg->se[cpu]->avg.load_avg_contrib;
>>>> + load /= tmp_rla;
>> Why do we need the temporary here?
>>
>
> you'r right. tmp_rla is not necessary here.
>

Sorry, not, tmp_rla is a imply type casting. otherwise long /= u64; has
building issue on 32 bit according to Morten.
and a clear type casting cross 2 lines.

--
Thanks
Alex

2013-06-17 14:29:28

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 8/9] sched: consider runnable load average in move_tasks

On 06/17/2013 09:59 PM, Peter Zijlstra wrote:
>> > #else
>> > static inline void update_blocked_averages(int cpu)
> Should we not also change the !FAIR_GROUP_SCHED version of task_h_load()
> for this?

yes, it is needed. Sorry for omit this.
---

>From dc6a4b052c05d793938f6fbf40190d8cec959ca4 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Mon, 3 Dec 2012 23:00:53 +0800
Subject: [PATCH 1/2] sched: consider runnable load average in move_tasks

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Morten had caught a div u64 bug on ARM, thanks!

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 18 +++++++++---------
1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4b68f9f..0ca9cc3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4179,11 +4179,14 @@ static int tg_load_down(struct task_group *tg, void *data)
long cpu = (long)data;

if (!tg->parent) {
- load = cpu_rq(cpu)->load.weight;
+ load = cpu_rq(cpu)->avg.load_avg_contrib;
} else {
+ unsigned long tmp_rla;
+ tmp_rla = tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
+
load = tg->parent->cfs_rq[cpu]->h_load;
- load *= tg->se[cpu]->load.weight;
- load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
+ load *= tg->se[cpu]->avg.load_avg_contrib;
+ load /= tmp_rla;
}

tg->cfs_rq[cpu]->h_load = load;
@@ -4209,12 +4212,9 @@ static void update_h_load(long cpu)
static unsigned long task_h_load(struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
- unsigned long load;
-
- load = p->se.load.weight;
- load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);

- return load;
+ return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
+ cfs_rq->runnable_load_avg + 1);
}
#else
static inline void update_blocked_averages(int cpu)
@@ -4227,7 +4227,7 @@ static inline void update_h_load(long cpu)

static unsigned long task_h_load(struct task_struct *p)
{
- return p->se.load.weight;
+ return p->se.avg.load_avg_contrib;
}
#endif

--
1.7.5.4

2013-06-17 14:58:08

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/17/2013 06:51 PM, Paul Turner wrote:
> This looks fine.
>
> Did you try including blocked_load_avg in only get_rq_runnable_load()
> [ and not weighted_cpuload() which is called by new-idle ]?
>

No, but blocked_load_avg also impact periodic balance much. The ideal
scenario is a sleeping task will be waken up on same cpu. But like
kbuild benchmark, the quick cc/ld task are forked, and then finished and
disappeared quickly. blocked_load_avg cause trouble in kbuild.

--
Thanks
Alex

2013-06-17 15:21:41

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/17/2013 10:57 PM, Alex Shi wrote:
>> >
>> > Did you try including blocked_load_avg in only get_rq_runnable_load()
>> > [ and not weighted_cpuload() which is called by new-idle ]?
>> >
> No, but blocked_load_avg also impact periodic balance much. The ideal
> scenario is a sleeping task will be waken up on same cpu. But like
> kbuild benchmark, the quick cc/ld task are forked, and then finished and

a bit more explain, cc/ld finished cpu task then wait IO to write object
files into disk.
> disappeared quickly. blocked_load_avg cause trouble in kbuild.


--
Thanks
Alex

2013-06-17 15:32:22

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 2/9] sched: move few runnable tg variables into CONFIG_SMP

On 06/17/2013 08:26 PM, Paul Turner wrote:
> On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
>> The following 2 variables only used under CONFIG_SMP, so better to move
>> their definiation into CONFIG_SMP too.
>>
>> atomic64_t load_avg;
>> atomic_t runnable_avg;
>>
>> Signed-off-by: Alex Shi <[email protected]>
>
> Looks fine; the manipulations to the structs could perhaps be
> collapsed to one patch.
>

thanks for review!

--
Thanks
Alex

2013-06-17 15:42:04

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

On 06/17/2013 07:51 PM, Paul Turner wrote:
> Can you add something like:
>
> + /*
> + * Task re-woke on same cpu (or else
> migrate_task_rq_fair()
> + * would have made count negative); we must be careful
> to avoid
> + * double-accounting blocked time after synchronizing
> decays.
> + */
>
>
> Reviewed-by: Paul Turner <[email protected]>

thanks for review!
---

>From 24d9b43e7a269e6ffee5b874d39812b83812a809 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Thu, 9 May 2013 10:54:13 +0800
Subject: [PATCH] sched: fix slept time double counting in enqueue entity

The wakeuped migrated task will __synchronize_entity_decay(se); in
migrate_task_fair, then it needs to set
`se->avg.last_runnable_update -= (-se->avg.decay_count) << 20'
before update_entity_load_avg, in order to avoid slept time is updated
twice for se.avg.load_avg_contrib in both __syncchronize and
update_entity_load_avg.

but if the slept task is waked up from self cpu, it miss the
last_runnable_update before update_entity_load_avg(se, 0, 1), then the
slept time was used twice in both functions.
So we need to remove the double slept time counting.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
---
kernel/sched/fair.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df5b8a9..1e5a5e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1571,7 +1571,13 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
}
wakeup = 0;
} else {
- __synchronize_entity_decay(se);
+ /*
+ * Task re-woke on same cpu (or else migrate_task_rq_fair()
+ * would have made count negative); we must be careful to avoid
+ * double-accounting blocked time after synchronizing decays.
+ */
+ se->avg.last_runnable_update += __synchronize_entity_decay(se)
+ << 20;
}

/* migrated tasks did not contribute to our blocked load */
--
1.7.5.4

2013-06-17 23:01:04

by Paul Turner

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On Mon, Jun 17, 2013 at 6:57 AM, Alex Shi <[email protected]> wrote:
> On 06/17/2013 08:17 PM, Paul Turner wrote:
>> On Mon, Jun 17, 2013 at 3:51 AM, Paul Turner <[email protected]> wrote:
>>> On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
>>>> They are the base values in load balance, update them with rq runnable
>>>> load average, then the load balance will consider runnable load avg
>>>> naturally.
>>>>
>>>> We also try to include the blocked_load_avg as cpu load in balancing,
>>>> but that cause kbuild performance drop 6% on every Intel machine, and
>>>> aim7/oltp drop on some of 4 CPU sockets machines.
>>>>
>>>
>>> This looks fine.
>>>
>>> Did you try including blocked_load_avg in only get_rq_runnable_load()
>>> [ and not weighted_cpuload() which is called by new-idle ]?
>>
>> Looking at this more this feels less correct since you're taking
>> averages of averages.
>>
>> This was previously discussed at:
>> https://lkml.org/lkml/2013/5/6/109
>>
>> And you later replied suggesting this didn't seem to hurt; what's the
>> current status there?
>
> Yes, your example show the blocked_load_avg value.
> So I had given a patch for review at that time before do detailed
> testing. https://lkml.org/lkml/2013/5/7/66
>
> But in detailed testing, the patch cause a big performance regression.
> When I look into for details. I found some cpu in kbuild just had a big
> blocked_load_avg, with a very small runnable_load_avg value.
>
> Seems accumulating current blocked_load_avg into cpu load isn't a good
> idea. Because:

So I think this describes an alternate implementation to the one suggested in:
https://lkml.org/lkml/2013/5/7/66

Specifically, we _don't_ want to accumulate into cpu-load. Taking an
"average of the average" loses the mobility that the new
representation allows.

> 1, The blocked_load_avg is decayed same as runnable load, sometime is
> far bigger than runnable load, that drive tasks to other idle or slight
> load cpu, than cause both performance and power issue. But if the
> blocked load is decayed too fast, it lose its effect.

This is why the idea would be to use an instantaneous load in
weighted_cpuload() and one that incorporated averages on (wants a
rename) get_rq_runnable_load().

For non-instaneous load-indexes we're pulling for stability.

> 2, Another issue of blocked load is that when waking up task, we can not
> know blocked load proportion of the task on rq. So, the blocked load is
> meaningless in wake affine decision.

I think this is confusing two things:

(a) A wake-idle wake-up
(b) A wake-affine wake-up

In (a) we do not care about the blocked load proportion, only whether
a cpu is idle.

But once (a) has failed we should absolutely care how much load is
blocked in (b) as:
- We know we're going to queue for bandwidth on the cpu [ otherwise
we'd be in (a) ]
- Blocked load predicts how much _other_ work is expected to also
share the queue with us during the quantum

>
> According to above problem, I give up to enable blocked_load_avg in balance.
>
>
>>
>>
>>>
>>>> Signed-off-by: Alex Shi <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 5 +++--
>>>> kernel/sched/proc.c | 17 +++++++++++++++--
>>>> 2 files changed, 18 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 42c7be0..eadd2e7 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -2962,7 +2962,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>> /* Used instead of source_load when we know the type == 0 */
>>>> static unsigned long weighted_cpuload(const int cpu)
>>>> {
>>>> - return cpu_rq(cpu)->load.weight;
>>>> + return cpu_rq(cpu)->cfs.runnable_load_avg;
>>>> }
>>>>
>>>> /*
>>>> @@ -3007,9 +3007,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
>>>> {
>>>> struct rq *rq = cpu_rq(cpu);
>>>> unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
>>>> + unsigned long load_avg = rq->cfs.runnable_load_avg;
>>>>
>>>> if (nr_running)
>>>> - return rq->load.weight / nr_running;
>>>> + return load_avg / nr_running;
>>>>
>>>> return 0;
>>>> }
>>>> diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
>>>> index bb3a6a0..ce5cd48 100644
>>>> --- a/kernel/sched/proc.c
>>>> +++ b/kernel/sched/proc.c
>>>> @@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>>>> sched_avg_update(this_rq);
>>>> }
>>>>
>>>> +#ifdef CONFIG_SMP
>>>> +unsigned long get_rq_runnable_load(struct rq *rq)
>>>> +{
>>>> + return rq->cfs.runnable_load_avg;
>>>> +}
>>>> +#else
>>>> +unsigned long get_rq_runnable_load(struct rq *rq)
>>>> +{
>>>> + return rq->load.weight;
>>>> +}
>>>> +#endif
>>>> +
>>>> #ifdef CONFIG_NO_HZ_COMMON
>>>> /*
>>>> * There is no sane way to deal with nohz on smp when using jiffies because the
>>>> @@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>>>> void update_idle_cpu_load(struct rq *this_rq)
>>>> {
>>>> unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
>>>> - unsigned long load = this_rq->load.weight;
>>>> + unsigned long load = get_rq_runnable_load(this_rq);
>>>> unsigned long pending_updates;
>>>>
>>>> /*
>>>> @@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
>>>> */
>>>> void update_cpu_load_active(struct rq *this_rq)
>>>> {
>>>> + unsigned long load = get_rq_runnable_load(this_rq);
>>>> /*
>>>> * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
>>>> */
>>>> this_rq->last_load_update_tick = jiffies;
>>>> - __update_cpu_load(this_rq, this_rq->load.weight, 1);
>>>> + __update_cpu_load(this_rq, load, 1);
>>>>
>>>> calc_load_account_active(this_rq);
>>>> }
>>>> --
>>>> 1.7.12
>>>>
>
>
> --
> Thanks
> Alex

2013-06-18 03:45:05

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/18/2013 07:00 AM, Paul Turner wrote:
> On Mon, Jun 17, 2013 at 6:57 AM, Alex Shi <[email protected]> wrote:
>> > On 06/17/2013 08:17 PM, Paul Turner wrote:
>>> >> On Mon, Jun 17, 2013 at 3:51 AM, Paul Turner <[email protected]> wrote:
>>>> >>> On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
>>>>> >>>> They are the base values in load balance, update them with rq runnable
>>>>> >>>> load average, then the load balance will consider runnable load avg
>>>>> >>>> naturally.
>>>>> >>>>
>>>>> >>>> We also try to include the blocked_load_avg as cpu load in balancing,
>>>>> >>>> but that cause kbuild performance drop 6% on every Intel machine, and
>>>>> >>>> aim7/oltp drop on some of 4 CPU sockets machines.
>>>>> >>>>
>>>> >>>
>>>> >>> This looks fine.
>>>> >>>
>>>> >>> Did you try including blocked_load_avg in only get_rq_runnable_load()
>>>> >>> [ and not weighted_cpuload() which is called by new-idle ]?
>>> >>
>>> >> Looking at this more this feels less correct since you're taking
>>> >> averages of averages.
>>> >>
>>> >> This was previously discussed at:
>>> >> https://lkml.org/lkml/2013/5/6/109
>>> >>
>>> >> And you later replied suggesting this didn't seem to hurt; what's the
>>> >> current status there?
>> >
>> > Yes, your example show the blocked_load_avg value.
>> > So I had given a patch for review at that time before do detailed
>> > testing. https://lkml.org/lkml/2013/5/7/66
>> >
>> > But in detailed testing, the patch cause a big performance regression.
>> > When I look into for details. I found some cpu in kbuild just had a big
>> > blocked_load_avg, with a very small runnable_load_avg value.
>> >
>> > Seems accumulating current blocked_load_avg into cpu load isn't a good
>> > idea. Because:
> So I think this describes an alternate implementation to the one suggested in:
> https://lkml.org/lkml/2013/5/7/66
>
> Specifically, we _don't_ want to accumulate into cpu-load. Taking an
> "average of the average" loses the mobility that the new
> representation allows.
>
>> > 1, The blocked_load_avg is decayed same as runnable load, sometime is
>> > far bigger than runnable load, that drive tasks to other idle or slight
>> > load cpu, than cause both performance and power issue. But if the
>> > blocked load is decayed too fast, it lose its effect.
> This is why the idea would be to use an instantaneous load in
> weighted_cpuload() and one that incorporated averages on (wants a
> rename) get_rq_runnable_load().
>
> For non-instaneous load-indexes we're pulling for stability.

Paul, could I summary your point here:
keep current weighted_cpu_load, but add blocked load avg in
get_rq_runnable_load?

I will test this change.
>
>> > 2, Another issue of blocked load is that when waking up task, we can not
>> > know blocked load proportion of the task on rq. So, the blocked load is
>> > meaningless in wake affine decision.
> I think this is confusing two things:
>
> (a) A wake-idle wake-up
> (b) A wake-affine wake-up

what's I mean the wake affine is (b). Anyway, blocked load is no help on
the scenario.
>
> In (a) we do not care about the blocked load proportion, only whether
> a cpu is idle.
>
> But once (a) has failed we should absolutely care how much load is
> blocked in (b) as:
> - We know we're going to queue for bandwidth on the cpu [ otherwise
> we'd be in (a) ]
> - Blocked load predicts how much _other_ work is expected to also
> share the queue with us during the quantum
>


--
Thanks
Alex

2013-06-18 09:45:34

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task


>
> Paul, could I summary your point here:
> keep current weighted_cpu_load, but add blocked load avg in
> get_rq_runnable_load?
>
> I will test this change.

Current testing(kbuild, oltp, aim7) don't show clear different on my NHM EP box
between the following and the origin patch,
the only different is get_rq_runnable_load added blocked_load_avg. in SMP
will test more cases and more box.

---
kernel/sched/fair.c | 5 +++--
kernel/sched/proc.c | 17 +++++++++++++++--
2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e5a5e6..7d5c477 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2968,7 +2968,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->load.weight;
+ return cpu_rq(cpu)->cfs.runnable_load_avg;
}

/*
@@ -3013,9 +3013,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+ unsigned long load_avg = rq->cfs.runnable_load_avg;

if (nr_running)
- return rq->load.weight / nr_running;
+ return load_avg / nr_running;

return 0;
}
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index bb3a6a0..36d7db6 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
sched_avg_update(this_rq);
}

+#ifdef CONFIG_SMP
+unsigned long get_rq_runnable_load(struct rq *rq)
+{
+ return rq->cfs.runnable_load_avg + rq->cfs.blocked_load_avg;
+}
+#else
+unsigned long get_rq_runnable_load(struct rq *rq)
+{
+ return rq->load.weight;
+}
+#endif
+
#ifdef CONFIG_NO_HZ_COMMON
/*
* There is no sane way to deal with nohz on smp when using jiffies because the
@@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
- unsigned long load = this_rq->load.weight;
+ unsigned long load = get_rq_runnable_load(this_rq);
unsigned long pending_updates;

/*
@@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
*/
void update_cpu_load_active(struct rq *this_rq)
{
+ unsigned long load = get_rq_runnable_load(this_rq);
/*
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
this_rq->last_load_update_tick = jiffies;
- __update_cpu_load(this_rq, this_rq->load.weight, 1);
+ __update_cpu_load(this_rq, load, 1);

calc_load_account_active(this_rq);
}
--
1.7.12

--
Thanks
Alex

2013-06-19 08:16:56

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/18/2013 05:44 PM, Alex Shi wrote:
>
>>
>> Paul, could I summary your point here:
>> keep current weighted_cpu_load, but add blocked load avg in
>> get_rq_runnable_load?
>>
>> I will test this change.
>
> Current testing(kbuild, oltp, aim7) don't show clear different on my NHM EP box
> between the following and the origin patch,
> the only different is get_rq_runnable_load added blocked_load_avg. in SMP
> will test more cases and more box.

I tested the tip/sched/core, tip/sched/core with old patchset and
tip/schec/core with the blocked_load_avg on Core2 2S, NHM EP, IVB EP,
SNB EP 2S and SNB EP 4S box, with benchmark kbuild, sysbench oltp,
hackbench, tbench, dbench.

blocked_load_avg VS origin patchset, oltp has suspicious 5% and
hackbench has 3% drop on NHM EX; dbench has suspicious 6% drop on NHM
EP. other benchmarks has no clear change on all other machines.

origin patchset VS sched/core, hackbench rise 20% on NHM EX, 60% on SNB
EP 4S, and 30% on IVB EP. others no clear changes.

>
> ---
> kernel/sched/fair.c | 5 +++--
> kernel/sched/proc.c | 17 +++++++++++++++--
> 2 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1e5a5e6..7d5c477 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2968,7 +2968,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> /* Used instead of source_load when we know the type == 0 */
> static unsigned long weighted_cpuload(const int cpu)
> {
> - return cpu_rq(cpu)->load.weight;
> + return cpu_rq(cpu)->cfs.runnable_load_avg;
> }
>
> /*
> @@ -3013,9 +3013,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> + unsigned long load_avg = rq->cfs.runnable_load_avg;
>
> if (nr_running)
> - return rq->load.weight / nr_running;
> + return load_avg / nr_running;
>
> return 0;
> }
> diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
> index bb3a6a0..36d7db6 100644
> --- a/kernel/sched/proc.c
> +++ b/kernel/sched/proc.c
> @@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> sched_avg_update(this_rq);
> }
>
> +#ifdef CONFIG_SMP
> +unsigned long get_rq_runnable_load(struct rq *rq)
> +{
> + return rq->cfs.runnable_load_avg + rq->cfs.blocked_load_avg;
> +}
> +#else
> +unsigned long get_rq_runnable_load(struct rq *rq)
> +{
> + return rq->load.weight;
> +}
> +#endif
> +
> #ifdef CONFIG_NO_HZ_COMMON
> /*
> * There is no sane way to deal with nohz on smp when using jiffies because the
> @@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> void update_idle_cpu_load(struct rq *this_rq)
> {
> unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
> - unsigned long load = this_rq->load.weight;
> + unsigned long load = get_rq_runnable_load(this_rq);
> unsigned long pending_updates;
>
> /*
> @@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
> */
> void update_cpu_load_active(struct rq *this_rq)
> {
> + unsigned long load = get_rq_runnable_load(this_rq);
> /*
> * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
> */
> this_rq->last_load_update_tick = jiffies;
> - __update_cpu_load(this_rq, this_rq->load.weight, 1);
> + __update_cpu_load(this_rq, load, 1);
>
> calc_load_account_active(this_rq);
> }
>


--
Thanks
Alex

2013-06-19 09:50:21

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 9/9] sched/tg: remove blocked_load_avg in balance

On 06/17/2013 10:01 PM, Peter Zijlstra wrote:
> On Mon, Jun 17, 2013 at 05:20:46AM -0700, Paul Turner wrote:
>> > On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
>>> > > blocked_load_avg sometime is too heavy and far bigger than runnable load
>>> > > avg, that make balance make wrong decision. So remove it.
>> >
>> > Ok so this is going to have terrible effects on the correctness of
>> > shares distribution; I'm fairly opposed to it in its present form.
> Should someone take a look at the LTP cgroup tests to make sure they
> cover this properly?

Seems I have no time on this in near future. :(

But I were you, I would picked up this patch, since
1, it is simple and easy revert.
2, it wins in testing.
3, it will motivate Paul to look into this problem. ;)

--
Thanks
Alex

2013-06-20 00:34:40

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/19/2013 04:15 PM, Alex Shi wrote:
> On 06/18/2013 05:44 PM, Alex Shi wrote:
>>
>>>
>>> Paul, could I summary your point here:
>>> keep current weighted_cpu_load, but add blocked load avg in
>>> get_rq_runnable_load?
>>>
>>> I will test this change.
>>
>> Current testing(kbuild, oltp, aim7) don't show clear different on my NHM EP box
>> between the following and the origin patch,
>> the only different is get_rq_runnable_load added blocked_load_avg. in SMP
>> will test more cases and more box.
>
> I tested the tip/sched/core, tip/sched/core with old patchset and
> tip/schec/core with the blocked_load_avg on Core2 2S, NHM EP, IVB EP,
> SNB EP 2S and SNB EP 4S box, with benchmark kbuild, sysbench oltp,
> hackbench, tbench, dbench.
>
> blocked_load_avg VS origin patchset, oltp has suspicious 5% and
> hackbench has 3% drop on NHM EX; dbench has suspicious 6% drop on NHM
> EP. other benchmarks has no clear change on all other machines.
>
> origin patchset VS sched/core, hackbench rise 20% on NHM EX, 60% on SNB
> EP 4S, and 30% on IVB EP. others no clear changes.
>

>> +#ifdef CONFIG_SMP
>> +unsigned long get_rq_runnable_load(struct rq *rq)
>> +{
>> + return rq->cfs.runnable_load_avg + rq->cfs.blocked_load_avg;

According to above testing result, with blocked_load_avg is still a
slight worse than without it.

when the blocked_load_avg added here, it will impact nohz idle balance
and periodic balance in update_sg_lb_stats() when the idx is not 0.
As to nohz idle balance, blocked_load_avg should be too small to have
big effect.
As to in update_sg_lb_stats(), since it only works when _idx is not 0,
that means the blocked_load_avg was decay again in update_cpu_load. That
reduce its impact.

So, could I say, at least in above testing, blocked_load_avg should be
keep away from balance?


--
Thanks
Alex

2013-06-20 01:34:14

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 9/9] sched/tg: remove blocked_load_avg in balance

On 06/17/2013 08:20 PM, Paul Turner wrote:
> On Fri, Jun 7, 2013 at 12:20 AM, Alex Shi <[email protected]> wrote:
>> > blocked_load_avg sometime is too heavy and far bigger than runnable load
>> > avg, that make balance make wrong decision. So remove it.
> Ok so this is going to have terrible effects on the correctness of
> shares distribution; I'm fairly opposed to it in its present form.
>
> So let's see, what could be happening..
>
> In "sched: compute runnable load avg in cpu_load and
> cpu_avg_load_per_task" you already update the load average weights
> solely based on current runnable load. While this is generally poor
> for stability (and I suspect the benefit is coming largely from
> weighted_cpuload() where you do want to use runnable_load_avg and not
> get_rq_runnable_load() where I suspect including blocked_load_avg() is
> correct in the longer term).

If the 'poor stability' means your previous example of 2 40% busy task
and one 90% busy task. It occasionally happens. but at least in all
testing, kbuild, aim7, tbench, oltp, hackbench, ltp etc. involve
blocked_load_avg is just worse, guess due to above reason.
>
> Ah so.. I have an inkling:
> Inside weighted_cpuload() where you're trying to use only
> runnable_load_avg; this is in-fact still including blocked_load_avg
> for a cgroup since in the cgroup case a group entities' contribution
> is a function of both runnable and blocked load.

with this patch tg will not include blocked_load_avg.

Honestly, blocked_load_avg should has its meaning, like in your
scenario. but just now, we only can see it bring more harm without any
help on all we tested benchmarks.
I can't find a reason to enable sth that hurt performance.
>
> Having weighted_cpuload() pull rq->load (possibly moderated by
> rq->avg) would reasonably avoid this since issued shares are
> calculated using instantaneous weights, without breaking the actual
> model for how much load overall that we believe the group has.
>

I considered to use rq->avg in weighted_cpuload, but when we do
move_tasks to balance load between cpu, we just consider the cfs tasks
not rt task, consider rq->load/avg will involved a unnecessary rt
interference. So I changed to cfs load only.

--
Thanks
Alex

2013-06-20 01:43:48

by Lei Wen

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

Hi Alex,

On Mon, Jun 17, 2013 at 11:41 PM, Alex Shi <[email protected]> wrote:
> On 06/17/2013 07:51 PM, Paul Turner wrote:
>> Can you add something like:
>>
>> + /*
>> + * Task re-woke on same cpu (or else
>> migrate_task_rq_fair()
>> + * would have made count negative); we must be careful
>> to avoid
>> + * double-accounting blocked time after synchronizing
>> decays.
>> + */
>>
>>
>> Reviewed-by: Paul Turner <[email protected]>
>
> thanks for review!
> ---
>
> From 24d9b43e7a269e6ffee5b874d39812b83812a809 Mon Sep 17 00:00:00 2001
> From: Alex Shi <[email protected]>
> Date: Thu, 9 May 2013 10:54:13 +0800
> Subject: [PATCH] sched: fix slept time double counting in enqueue entity
>
> The wakeuped migrated task will __synchronize_entity_decay(se); in
> migrate_task_fair, then it needs to set

Should be migrate_task_rq_fair, right? :)

> `se->avg.last_runnable_update -= (-se->avg.decay_count) << 20'
> before update_entity_load_avg, in order to avoid slept time is updated
> twice for se.avg.load_avg_contrib in both __syncchronize and
> update_entity_load_avg.
>
> but if the slept task is waked up from self cpu, it miss the
> last_runnable_update before update_entity_load_avg(se, 0, 1), then the
> slept time was used twice in both functions.
> So we need to remove the double slept time counting.
>
> Signed-off-by: Alex Shi <[email protected]>
> Reviewed-by: Paul Turner <[email protected]>
> ---
> kernel/sched/fair.c | 8 +++++++-
> 1 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df5b8a9..1e5a5e6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1571,7 +1571,13 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> }
> wakeup = 0;
> } else {
> - __synchronize_entity_decay(se);
> + /*
> + * Task re-woke on same cpu (or else migrate_task_rq_fair()
> + * would have made count negative); we must be careful to avoid
> + * double-accounting blocked time after synchronizing decays.
> + */
> + se->avg.last_runnable_update += __synchronize_entity_decay(se)
> + << 20;

I'm kind of getting confused at here, since migrate_task_rq_fair use
below equation to
avoid sleep time be accounted for twice.
`se->avg.last_runnable_update -= (-se->avg.decay_count) << 20'

Why here we need to use "+=", which the reversed version comparing
with previous?

Thanks,
Lei

> }
>
> /* migrated tasks did not contribute to our blocked load */
> --
> 1.7.5.4
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-06-20 01:47:14

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

On 06/20/2013 09:43 AM, Lei Wen wrote:
>> > From 24d9b43e7a269e6ffee5b874d39812b83812a809 Mon Sep 17 00:00:00 2001
>> > From: Alex Shi <[email protected]>
>> > Date: Thu, 9 May 2013 10:54:13 +0800
>> > Subject: [PATCH] sched: fix slept time double counting in enqueue entity
>> >
>> > The wakeuped migrated task will __synchronize_entity_decay(se); in
>> > migrate_task_fair, then it needs to set
> Should be migrate_task_rq_fair, right? :)
>

yes, thanks!

--
Thanks
Alex

2013-06-20 02:46:33

by Lei Wen

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

On Thu, Jun 20, 2013 at 9:43 AM, Lei Wen <[email protected]> wrote:
> Hi Alex,
>
> On Mon, Jun 17, 2013 at 11:41 PM, Alex Shi <[email protected]> wrote:
>> On 06/17/2013 07:51 PM, Paul Turner wrote:
>>> Can you add something like:
>>>
>>> + /*
>>> + * Task re-woke on same cpu (or else
>>> migrate_task_rq_fair()
>>> + * would have made count negative); we must be careful
>>> to avoid
>>> + * double-accounting blocked time after synchronizing
>>> decays.
>>> + */
>>>
>>>
>>> Reviewed-by: Paul Turner <[email protected]>
>>
>> thanks for review!
>> ---
>>
>> From 24d9b43e7a269e6ffee5b874d39812b83812a809 Mon Sep 17 00:00:00 2001
>> From: Alex Shi <[email protected]>
>> Date: Thu, 9 May 2013 10:54:13 +0800
>> Subject: [PATCH] sched: fix slept time double counting in enqueue entity
>>
>> The wakeuped migrated task will __synchronize_entity_decay(se); in
>> migrate_task_fair, then it needs to set
>
> Should be migrate_task_rq_fair, right? :)
>
>> `se->avg.last_runnable_update -= (-se->avg.decay_count) << 20'
>> before update_entity_load_avg, in order to avoid slept time is updated
>> twice for se.avg.load_avg_contrib in both __syncchronize and
>> update_entity_load_avg.
>>
>> but if the slept task is waked up from self cpu, it miss the
>> last_runnable_update before update_entity_load_avg(se, 0, 1), then the
>> slept time was used twice in both functions.
>> So we need to remove the double slept time counting.
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> Reviewed-by: Paul Turner <[email protected]>
>> ---
>> kernel/sched/fair.c | 8 +++++++-
>> 1 files changed, 7 insertions(+), 1 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index df5b8a9..1e5a5e6 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1571,7 +1571,13 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
>> }
>> wakeup = 0;
>> } else {
>> - __synchronize_entity_decay(se);
>> + /*
>> + * Task re-woke on same cpu (or else migrate_task_rq_fair()
>> + * would have made count negative); we must be careful to avoid
>> + * double-accounting blocked time after synchronizing decays.
>> + */
>> + se->avg.last_runnable_update += __synchronize_entity_decay(se)
>> + << 20;
>
> I'm kind of getting confused at here, since migrate_task_rq_fair use
> below equation to
> avoid sleep time be accounted for twice.
> `se->avg.last_runnable_update -= (-se->avg.decay_count) << 20'
>
> Why here we need to use "+=", which the reversed version comparing
> with previous?

Here is my understanding for this reversed behavior, not sure am I right?...
1. in migrate_task_rq_fair case, it performs the decay operation directly
over load_avg_contrib, but not update runnable_avg_sum and runnable_avg_period
accordingly. Thus when the task migrated to different runqueue, it
make its previous
decayed result can be reflected over the new runqueue, since load_avg_contrib
would be calculated with original runnable_avg_sum and
runnable_avg_period value,
it would mean the migrated task lost the decay in this process, as
avg.last_runnable_update is aligned with new runqueue's clock_task.

So we need "-=" in the migrated case.

2. For the normal wake up case, since we run over local queue continuously,
our avg.last_runnable_update is not changed, thus if we do nothing, we could
get the another decay in __update_entity_runnable_avg. Since we already get
one over __synchronzie_entity_decay, it should bring us a redundant one.

Thus it is right to add a "+=" here.


But here I have a question, there is another usage of __synchronzie_entity_decay
in current kernel, in the switched_from_fair function.

If task being frequently switched between rt and fair class, would it
also bring the
redundant issue? Do we need patch like below?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b5408e1..9640c66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5856,7 +5856,7 @@ static void switched_from_fair(struct rq *rq,
struct task_struct *p)
*/
if (p->se.avg.decay_count) {
struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
- se->avg.last_runnable_update +=
+ p->se.avg.last_runnable_update +=
__synchronize_entity_decay(&p->se);
subtract_blocked_load_contrib(cfs_rq,
p->se.avg.load_avg_contrib);


Thanks,
Lei


>
> Thanks,
> Lei
>
>> }
>>
>> /* migrated tasks did not contribute to our blocked load */
>> --
>> 1.7.5.4
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/

2013-06-20 10:23:40

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

On Sat, Jun 15, 2013 at 01:09:12PM +0100, Lei Wen wrote:
> On Fri, Jun 14, 2013 at 9:59 PM, Alex Shi <[email protected]> wrote:
> > On 06/14/2013 06:02 PM, Lei Wen wrote:
> >>> > enqueue_entity
> >>> > enqueue_entity_load_avg
> >>> >
> >>> > and make forking balancing imbalance since incorrect load_avg_contrib.
> >>> >
> >>> > Further more, Morten Rasmussen notice some tasks were not launched at
> >>> > once after created. So Paul and Peter suggest giving a start value for
> >>> > new task runnable avg time same as sched_slice().
> >> I am confused at this comment, how set slice to runnable avg would change
> >> the behavior of "some tasks were not launched at once after created"?
> >
> > I also don't know the details on Morten's machine. but just guess, there
> > are much tasks on in the run queue. the minimum load avg make the new
> > task wait its time...
>
> Is there some possibility that since task structure is allocated without being
> set to 0, and it cause the imbalance between runqueues. Then the new forked
> is migrated to other cpus, so that it cause its execution being delayed?
>
> It is better for Morten to give us more details here. :)
>

I think Peter's reply pretty much covers it. The problem is when a task
is not running (other task has lower vruntime or blocked for other
reasons) shortly after the task was created. The runnable_avg_period is
very small, so the load_contrib is very sensitive.

Say if a task runs for 1 ms then is blocked for 1 ms and then runs
again, the load_contrib will go from 100% to 50% instantly and then ramp
back up again. So the task load may be quite different from the true
load of the task depending on when you calculate the load_contrib.

Preloading runnable_avg_period should make the load_contrib a little
less sensitive to this behaviour.

Morten

> Thanks,
> Lei
>
> >>
> >> IMHO, I could only tell that for the new forked task, it could be run if current
> >> task already be set as need_resched, and preempt_schedule or
> >> preempt_schedule_irq
> >> is called.
> >>
> >> Since the set slice to avg behavior would not affect this task's vruntime,
> >> and hence cannot make current running task be need_sched, if
> >> previously it cannot.
> >>
> >> Could you help correct if I am wrong at somewhere? ....
> >>
> >> Thanks,
> >
> >
> > --
> > Thanks
> > Alex
>

2013-06-20 14:59:46

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

On 06/20/2013 10:46 AM, Lei Wen wrote:
>
>
> But here I have a question, there is another usage of __synchronzie_entity_decay
> in current kernel, in the switched_from_fair function.
>
> If task being frequently switched between rt and fair class, would it
> also bring the
> redundant issue? Do we need patch like below?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b5408e1..9640c66 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5856,7 +5856,7 @@ static void switched_from_fair(struct rq *rq,
> struct task_struct *p)
> */
> if (p->se.avg.decay_count) {
> struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
> - se->avg.last_runnable_update +=
> + p->se.avg.last_runnable_update +=

what tree does this patch base on?
> __synchronize_entity_decay(&p->se);
> subtract_blocked_load_contrib(cfs_rq,
> p->se.avg.load_avg_contrib);


--
Thanks
Alex

2013-06-21 02:30:40

by Lei Wen

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

Hi Alex,

On Thu, Jun 20, 2013 at 10:59 PM, Alex Shi <[email protected]> wrote:
> On 06/20/2013 10:46 AM, Lei Wen wrote:
>>
>>
>> But here I have a question, there is another usage of __synchronzie_entity_decay
>> in current kernel, in the switched_from_fair function.
>>
>> If task being frequently switched between rt and fair class, would it
>> also bring the
>> redundant issue? Do we need patch like below?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b5408e1..9640c66 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5856,7 +5856,7 @@ static void switched_from_fair(struct rq *rq,
>> struct task_struct *p)
>> */
>> if (p->se.avg.decay_count) {
>> struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
>> - se->avg.last_runnable_update +=
>> + p->se.avg.last_runnable_update +=
>
> what tree does this patch base on?

I create the patch based on v3.9 kernel.
If it is ok, I could create a formal one based on latest kernel.

Thanks,
Lei

>> __synchronize_entity_decay(&p->se);
>> subtract_blocked_load_contrib(cfs_rq,
>> p->se.avg.load_avg_contrib);
>
>
> --
> Thanks
> Alex

2013-06-21 02:40:33

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

On 06/21/2013 10:30 AM, Lei Wen wrote:
> Hi Alex,
>
> On Thu, Jun 20, 2013 at 10:59 PM, Alex Shi <[email protected]> wrote:
>> On 06/20/2013 10:46 AM, Lei Wen wrote:
>>>
>>>
>>> But here I have a question, there is another usage of __synchronzie_entity_decay
>>> in current kernel, in the switched_from_fair function.
>>>
>>> If task being frequently switched between rt and fair class, would it
>>> also bring the
>>> redundant issue? Do we need patch like below?
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index b5408e1..9640c66 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5856,7 +5856,7 @@ static void switched_from_fair(struct rq *rq,
>>> struct task_struct *p)
>>> */
>>> if (p->se.avg.decay_count) {
>>> struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
>>> - se->avg.last_runnable_update +=
>>> + p->se.avg.last_runnable_update +=
>>
>> what tree does this patch base on?
>
> I create the patch based on v3.9 kernel.
> If it is ok, I could create a formal one based on latest kernel.

In the linus tree, the first commit which introduced
__synchronize_entity_decay(&p->se) here is
9ee474f55664ff63111c843099d365e7ecffb56f
+
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+ /*
+ * Remove our load from contribution when we leave sched_fair
+ * and ensure we don't carry in an old decay_count if we
+ * switch back.
+ */
+ if (p->se.avg.decay_count) {
+ struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+ __synchronize_entity_decay(&p->se);
+ subtract_blocked_load_contrib(cfs_rq,
+ p->se.avg.load_avg_contrib);
+ }
+#endif
}

And it is never changed from then on. So your code must based on a
incorrect kernel. please do a double check.
>
> Thanks,
> Lei
>
>>> __synchronize_entity_decay(&p->se);
>>> subtract_blocked_load_contrib(cfs_rq,
>>> p->se.avg.load_avg_contrib);
>>
>>
>> --
>> Thanks
>> Alex


--
Thanks
Alex

2013-06-21 02:50:35

by Lei Wen

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

Alex,

On Fri, Jun 21, 2013 at 10:39 AM, Alex Shi <[email protected]> wrote:
> On 06/21/2013 10:30 AM, Lei Wen wrote:
>> Hi Alex,
>>
>> On Thu, Jun 20, 2013 at 10:59 PM, Alex Shi <[email protected]> wrote:
>>> On 06/20/2013 10:46 AM, Lei Wen wrote:
>>>>
>>>>
>>>> But here I have a question, there is another usage of __synchronzie_entity_decay
>>>> in current kernel, in the switched_from_fair function.
>>>>
>>>> If task being frequently switched between rt and fair class, would it
>>>> also bring the
>>>> redundant issue? Do we need patch like below?
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b5408e1..9640c66 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -5856,7 +5856,7 @@ static void switched_from_fair(struct rq *rq,
>>>> struct task_struct *p)
>>>> */
>>>> if (p->se.avg.decay_count) {
>>>> struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
>>>> - se->avg.last_runnable_update +=
>>>> + p->se.avg.last_runnable_update +=
>>>
>>> what tree does this patch base on?
>>
>> I create the patch based on v3.9 kernel.
>> If it is ok, I could create a formal one based on latest kernel.
>
> In the linus tree, the first commit which introduced
> __synchronize_entity_decay(&p->se) here is
> 9ee474f55664ff63111c843099d365e7ecffb56f
> +
> +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> + /*
> + * Remove our load from contribution when we leave sched_fair
> + * and ensure we don't carry in an old decay_count if we
> + * switch back.
> + */
> + if (p->se.avg.decay_count) {
> + struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
> + __synchronize_entity_decay(&p->se);
> + subtract_blocked_load_contrib(cfs_rq,
> + p->se.avg.load_avg_contrib);
> + }
> +#endif
> }
>
> And it is never changed from then on. So your code must based on a
> incorrect kernel. please do a double check.

I see your point... I made the mistake that update the wrong patch...
Please help check this one.

commit 5fc3d5c74f8359ef382d9a20ffe657ffc237c109
Author: Lei Wen <[email protected]>
Date: Thu Jun 20 10:43:59 2013 +0800

sched: fix potential twice decay issue

Signed-off-by: Lei Wen <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..9640c66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5856,7 +5856,8 @@ static void switched_from_fair(struct rq *rq,
struct task_struct *p)
*/
if (p->se.avg.decay_count) {
struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
- __synchronize_entity_decay(&p->se);
+ p->se.avg.last_runnable_update +=
+ __synchronize_entity_decay(&p->se);
subtract_blocked_load_contrib(cfs_rq,
p->se.avg.load_avg_contrib);
}

Thanks,
Lei


>>
>> Thanks,
>> Lei
>>
>>>> __synchronize_entity_decay(&p->se);
>>>> subtract_blocked_load_contrib(cfs_rq,
>>>> p->se.avg.load_avg_contrib);
>>>
>>>
>>> --
>>> Thanks
>>> Alex
>
>
> --
> Thanks
> Alex

2013-06-21 02:57:30

by Lei Wen

[permalink] [raw]
Subject: Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

Morten,

On Thu, Jun 20, 2013 at 6:23 PM, Morten Rasmussen
<[email protected]> wrote:
> On Sat, Jun 15, 2013 at 01:09:12PM +0100, Lei Wen wrote:
>> On Fri, Jun 14, 2013 at 9:59 PM, Alex Shi <[email protected]> wrote:
>> > On 06/14/2013 06:02 PM, Lei Wen wrote:
>> >>> > enqueue_entity
>> >>> > enqueue_entity_load_avg
>> >>> >
>> >>> > and make forking balancing imbalance since incorrect load_avg_contrib.
>> >>> >
>> >>> > Further more, Morten Rasmussen notice some tasks were not launched at
>> >>> > once after created. So Paul and Peter suggest giving a start value for
>> >>> > new task runnable avg time same as sched_slice().
>> >> I am confused at this comment, how set slice to runnable avg would change
>> >> the behavior of "some tasks were not launched at once after created"?
>> >
>> > I also don't know the details on Morten's machine. but just guess, there
>> > are much tasks on in the run queue. the minimum load avg make the new
>> > task wait its time...
>>
>> Is there some possibility that since task structure is allocated without being
>> set to 0, and it cause the imbalance between runqueues. Then the new forked
>> is migrated to other cpus, so that it cause its execution being delayed?
>>
>> It is better for Morten to give us more details here. :)
>>
>
> I think Peter's reply pretty much covers it. The problem is when a task
> is not running (other task has lower vruntime or blocked for other
> reasons) shortly after the task was created. The runnable_avg_period is
> very small, so the load_contrib is very sensitive.
>
> Say if a task runs for 1 ms then is blocked for 1 ms and then runs
> again, the load_contrib will go from 100% to 50% instantly and then ramp
> back up again. So the task load may be quite different from the true
> load of the task depending on when you calculate the load_contrib.
>
> Preloading runnable_avg_period should make the load_contrib a little
> less sensitive to this behaviour.

Thanks for detailed explanation!
Now I could understand the preloading value prevent entity's load change quickly
at its beginning.
But I cannot see why this could explain "some tasks were not launched at
once after created"

Or should I understand the previous words as:
"If without preloading value, the new created task's contributed load may vary
too quickly at its beginning. Like if a task runs for 1 ms then is
blocked for 1 ms
and then runs again, the load_contrib will go from 100% to 50% instantly."


Thanks,
Lei

>
> Morten
>
>> Thanks,
>> Lei
>>
>> >>
>> >> IMHO, I could only tell that for the new forked task, it could be run if current
>> >> task already be set as need_resched, and preempt_schedule or
>> >> preempt_schedule_irq
>> >> is called.
>> >>
>> >> Since the set slice to avg behavior would not affect this task's vruntime,
>> >> and hence cannot make current running task be need_sched, if
>> >> previously it cannot.
>> >>
>> >> Could you help correct if I am wrong at somewhere? ....
>> >>
>> >> Thanks,
>> >
>> >
>> > --
>> > Thanks
>> > Alex
>>
>

2013-06-21 08:57:19

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

On 06/21/2013 10:50 AM, Lei Wen wrote:
> I see your point... I made the mistake that update the wrong patch...
> Please help check this one.
>
> commit 5fc3d5c74f8359ef382d9a20ffe657ffc237c109
> Author: Lei Wen <[email protected]>
> Date: Thu Jun 20 10:43:59 2013 +0800
>
> sched: fix potential twice decay issue
>
> Signed-off-by: Lei Wen <[email protected]>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c61a614..9640c66 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5856,7 +5856,8 @@ static void switched_from_fair(struct rq *rq,
> struct task_struct *p)
> */
> if (p->se.avg.decay_count) {
> struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
> - __synchronize_entity_decay(&p->se);
> + p->se.avg.last_runnable_update +=
> + __synchronize_entity_decay(&p->se);

it is not needed, since the last_runnable_update will be reset if it
will be switched to fair.

--
Thanks
Alex

2013-06-21 09:18:23

by Lei Wen

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

Alex,

On Fri, Jun 21, 2013 at 4:56 PM, Alex Shi <[email protected]> wrote:
> On 06/21/2013 10:50 AM, Lei Wen wrote:
>> I see your point... I made the mistake that update the wrong patch...
>> Please help check this one.
>>
>> commit 5fc3d5c74f8359ef382d9a20ffe657ffc237c109
>> Author: Lei Wen <[email protected]>
>> Date: Thu Jun 20 10:43:59 2013 +0800
>>
>> sched: fix potential twice decay issue
>>
>> Signed-off-by: Lei Wen <[email protected]>
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c61a614..9640c66 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5856,7 +5856,8 @@ static void switched_from_fair(struct rq *rq,
>> struct task_struct *p)
>> */
>> if (p->se.avg.decay_count) {
>> struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
>> - __synchronize_entity_decay(&p->se);
>> + p->se.avg.last_runnable_update +=
>> + __synchronize_entity_decay(&p->se);
>
> it is not needed, since the last_runnable_update will be reset if it
> will be switched to fair.

When it is backed to fair class, the only thing scheduler would do
is to check whether it could preempt current running task, so
where to reset the last_runnable_update?

Thanks,
Lei

>
> --
> Thanks
> Alex

2013-06-21 11:09:28

by Alex Shi

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity


>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index c61a614..9640c66 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5856,7 +5856,8 @@ static void switched_from_fair(struct rq *rq,
>>> struct task_struct *p)
>>> */
>>> if (p->se.avg.decay_count) {
>>> struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
>>> - __synchronize_entity_decay(&p->se);
>>> + p->se.avg.last_runnable_update +=
>>> + __synchronize_entity_decay(&p->se);
>>
>> it is not needed, since the last_runnable_update will be reset if it
>> will be switched to fair.
>
> When it is backed to fair class, the only thing scheduler would do
> is to check whether it could preempt current running task, so
> where to reset the last_runnable_update?

here, when the task enqueue to cfs_rq.
avg.decay_count == 0, so the last line will reset last_runnable_update.

Anyway, you may consider to add a comment in the function, since it is bit hard
to understand for you.

---
/* Add the load generated by se into cfs_rq's child load-average */
static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
struct sched_entity *se,
int wakeup)
{
/*
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;

>
> Thanks,
> Lei
>
>>
>> --
>> Thanks
>> Alex


--
Thanks
Alex

2013-06-21 13:26:10

by Lei Wen

[permalink] [raw]
Subject: Re: [patch v8 4/9] sched: fix slept time double counting in enqueue entity

On Fri, Jun 21, 2013 at 7:09 PM, Alex Shi <[email protected]> wrote:
>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index c61a614..9640c66 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -5856,7 +5856,8 @@ static void switched_from_fair(struct rq *rq,
>>>> struct task_struct *p)
>>>> */
>>>> if (p->se.avg.decay_count) {
>>>> struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
>>>> - __synchronize_entity_decay(&p->se);
>>>> + p->se.avg.last_runnable_update +=
>>>> + __synchronize_entity_decay(&p->se);
>>>
>>> it is not needed, since the last_runnable_update will be reset if it
>>> will be switched to fair.
>>
>> When it is backed to fair class, the only thing scheduler would do
>> is to check whether it could preempt current running task, so
>> where to reset the last_runnable_update?
>
> here, when the task enqueue to cfs_rq.
> avg.decay_count == 0, so the last line will reset last_runnable_update.
>
> Anyway, you may consider to add a comment in the function, since it is bit hard
> to understand for you.
>
> ---
> /* Add the load generated by se into cfs_rq's child load-average */
> static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> struct sched_entity *se,
> int wakeup)
> {
> /*
> * We track migrations using entity decay_count <= 0, on a wake-up
> * migration we use a negative decay count to track the remote decays
> * accumulated while sleeping.
> */
> if (unlikely(se->avg.decay_count <= 0)) {
> se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;


I see...
I didn't notice before the switched_from_fair, we already called
dequeue and enqueue.
Since it works as expected, I have no question.

Thanks for your detailed explanation! :)

Thanks,
Lei

>
>>
>> Thanks,
>> Lei
>>
>>>
>>> --
>>> Thanks
>>> Alex
>
>
> --
> Thanks
> Alex