2013-05-06 01:46:13

by Alex Shi

[permalink] [raw]
Subject: [PATCH v5 0/7] use runnable load avg in load balance

This patchset bases on tip/sched/core.

It fixed a UP config bug. And the last of patch changed, it insert the
runnable load avg into effective_load(), thus the wake_affine consider
load avg via effective_load.

I retested on Intel core2, NHM, SNB, IVB, 2 and 4 sockets machines with
benchmark kbuild, aim7, dbench, tbench, hackbench, fileio-cfq(sysbench),
and tested pthread_cond_broadcast on SNB.

The test result is similar with last version. So, clear changes is same:
On SNB EP 4 sockets machine, the hackbench increased about 50%, and result
become stable. on other machines, hackbench increased about 2~5%.
no clear performance change on other benchmarks.

Since the change is small, and my results is similar with last, guess Michael
and Morten still can keep the advantages.

Anyway, for last version result of them, you can find:
https://lkml.org/lkml/2013/4/2/1022
Michael Wang had tested previous version on pgbench on his box:

http://comments.gmane.org/gmane.linux.kernel/1463371
Morten tested previous version with some benchmarks.

Thanks again for Peter's comments!

Regards!
Alex

[PATCH v5 1/7] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[PATCH v5 2/7] sched: remove SMP cover for runnable variables in
[PATCH v5 3/7] sched: set initial value of runnable avg for new
[PATCH v5 4/7] sched: update cpu load after task_tick.
[PATCH v5 5/7] sched: compute runnable load avg in cpu_load and
[PATCH v5 6/7] sched: consider runnable load average in move_tasks
[PATCH v5 7/7] sched: consider runnable load average in


2013-05-06 01:46:16

by Alex Shi

[permalink] [raw]
Subject: [PATCH v5 1/7] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 7 +------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 13 ++-----------
kernel/sched/sched.h | 9 +--------
4 files changed, 5 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..9539597 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1161,12 +1161,7 @@ struct sched_entity {
struct cfs_rq *my_q;
#endif

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..c8db984 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1563,12 +1563,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..9c2f726 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3394,12 +3393,6 @@ unlock:
}

/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3422,7 +3415,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */

static unsigned long
@@ -6114,9 +6106,8 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..7f36024f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -227,12 +227,6 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -242,8 +236,7 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
u32 tg_runnable_contrib;
u64 tg_load_contrib;
--
1.7.12

2013-05-06 01:46:25

by Alex Shi

[permalink] [raw]
Subject: [PATCH v5 4/7] sched: update cpu load after task_tick.

To get the latest runnable info, we need do this cpuload update after
task_tick.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ecec7f1..33bcebf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2692,8 +2692,8 @@ void scheduler_tick(void)

raw_spin_lock(&rq->lock);
update_rq_clock(rq);
- update_cpu_load_active(rq);
curr->sched_class->task_tick(rq, curr, 0);
+ update_cpu_load_active(rq);
raw_spin_unlock(&rq->lock);

perf_event_task_tick();
--
1.7.12

2013-05-06 01:46:32

by Alex Shi

[permalink] [raw]
Subject: [PATCH v5 6/7] sched: consider runnable load average in move_tasks

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0bf88e8..790e23d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3966,6 +3966,12 @@ static unsigned long task_h_load(struct task_struct *p);

static const unsigned int sched_nr_migrate_break = 32;

+static unsigned long task_h_load_avg(struct task_struct *p)
+{
+ return div_u64(task_h_load(p) * (u64)p->se.avg.runnable_avg_sum,
+ p->se.avg.runnable_avg_period + 1);
+}
+
/*
* move_tasks tries to move up to imbalance weighted load from busiest to
* this_rq, as part of a balancing operation within domain "sd".
@@ -4001,7 +4007,7 @@ static int move_tasks(struct lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
goto next;

- load = task_h_load(p);
+ load = task_h_load_avg(p);

if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
goto next;
--
1.7.12

2013-05-06 01:46:20

by Alex Shi

[permalink] [raw]
Subject: [PATCH v5 2/7] sched: remove SMP cover for runnable variables in cfs_rq

The following variables were covered under CONFIG_SMP in struct cfs_rq.
but similar runnable variables work for UP in struct rq and task_group.
like rq->avg, task_group->load_avg.
So move them out, they also can work with UP.

u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;

u32 tg_runnable_contrib;
u64 tg_load_contrib;

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/sched.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7f36024f..1a02b90 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -226,7 +226,6 @@ struct cfs_rq {
unsigned int nr_spread_over;
#endif

-#ifdef CONFIG_SMP
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -242,6 +241,7 @@ struct cfs_rq {
u64 tg_load_contrib;
#endif /* CONFIG_FAIR_GROUP_SCHED */

+#ifdef CONFIG_SMP
/*
* h_load = weight * f(tg)
*
--
1.7.12

2013-05-06 01:46:30

by Alex Shi

[permalink] [raw]
Subject: [PATCH v5 5/7] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 33bcebf..2f51636 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2536,7 +2536,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
- unsigned long load = this_rq->load.weight;
+ unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
unsigned long pending_updates;

/*
@@ -2586,7 +2586,7 @@ static void update_cpu_load_active(struct rq *this_rq)
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
this_rq->last_load_update_tick = jiffies;
- __update_cpu_load(this_rq, this_rq->load.weight, 1);
+ __update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);

calc_load_account_active(this_rq);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2881d42..0bf88e8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2900,7 +2900,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->load.weight;
+ return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;
}

/*
@@ -2947,7 +2947,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);

if (nr_running)
- return rq->load.weight / nr_running;
+ return (unsigned long)rq->cfs.runnable_load_avg / nr_running;

return 0;
}
--
1.7.12

2013-05-06 01:46:36

by Alex Shi

[permalink] [raw]
Subject: [PATCH v5 7/7] sched: consider runnable load average in effective_load

effective_load calculates the load change as seen from the
root_task_group. It needs to engage the runnable average
of changed task.

Thanks for Morten Rasmussen and PeterZ's reminder of this.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 790e23d..6f4f14b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2976,15 +2976,15 @@ static void task_waking_fair(struct task_struct *p)

#ifdef CONFIG_FAIR_GROUP_SCHED
/*
- * effective_load() calculates the load change as seen from the root_task_group
+ * effective_load() calculates load avg change as seen from the root_task_group
*
* Adding load to a group doesn't make a group heavier, but can cause movement
* of group shares between cpus. Assuming the shares were perfectly aligned one
* can calculate the shift in shares.
*
- * Calculate the effective load difference if @wl is added (subtracted) to @tg
- * on this @cpu and results in a total addition (subtraction) of @wg to the
- * total group weight.
+ * Calculate the effective load avg difference if @wl is added (subtracted) to
+ * @tg on this @cpu and results in a total addition (subtraction) of @wg to the
+ * total group load avg.
*
* Given a runqueue weight distribution (rw_i) we can compute a shares
* distribution (s_i) using:
@@ -2998,7 +2998,7 @@ static void task_waking_fair(struct task_struct *p)
* rw_i = { 2, 4, 1, 0 }
* s_i = { 2/7, 4/7, 1/7, 0 }
*
- * As per wake_affine() we're interested in the load of two CPUs (the CPU the
+ * As per wake_affine() we're interested in load avg of two CPUs (the CPU the
* task used to run on and the CPU the waker is running on), we need to
* compute the effect of waking a task on either CPU and, in case of a sync
* wakeup, compute the effect of the current task going to sleep.
@@ -3008,20 +3008,20 @@ static void task_waking_fair(struct task_struct *p)
*
* s'_i = (rw_i + @wl) / (@wg + \Sum rw_j) (2)
*
- * Suppose we're interested in CPUs 0 and 1, and want to compute the load
+ * Suppose we're interested in CPUs 0 and 1, and want to compute the load avg
* differences in waking a task to CPU 0. The additional task changes the
* weight and shares distributions like:
*
* rw'_i = { 3, 4, 1, 0 }
* s'_i = { 3/8, 4/8, 1/8, 0 }
*
- * We can then compute the difference in effective weight by using:
+ * We can then compute the difference in effective load avg by using:
*
* dw_i = S * (s'_i - s_i) (3)
*
* Where 'S' is the group weight as seen by its parent.
*
- * Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7)
+ * Therefore the effective change in load avg on CPU 0 would be 5/56 (3/8 - 2/7)
* times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 -
* 4/7) times the weight of the group.
*/
@@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* w = rw_i + @wl
*/
- w = se->my_q->load.weight + wl;
+ w = se->my_q->tg_load_contrib + wl;

/*
* wl = S * s'_i; see (2)
@@ -3066,7 +3066,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* wl = dw_i = S * (s'_i - s_i); see (3)
*/
- wl -= se->load.weight;
+ wl -= se->avg.load_avg_contrib;

/*
* Recursively apply this logic to all parent groups to compute
@@ -3112,14 +3112,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
*/
if (sync) {
tg = task_group(current);
- weight = current->se.load.weight;
+ weight = current->se.avg.load_avg_contrib;

this_load += effective_load(tg, this_cpu, -weight, -weight);
load += effective_load(tg, prev_cpu, 0, -weight);
}

tg = task_group(p);
- weight = p->se.load.weight;
+ weight = p->se.avg.load_avg_contrib;

/*
* In low-load situations, where prev_cpu is idle and this_cpu is idle
--
1.7.12

2013-05-06 01:46:23

by Alex Shi

[permalink] [raw]
Subject: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

We need initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task.
Otherwise random values of above variables cause mess when do new task
enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg

and make forking balancing imbalance since incorrect load_avg_contrib.

set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
resolve such issues.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 6 ++++++
kernel/sched/fair.c | 4 ++++
2 files changed, 10 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c8db984..ecec7f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1566,6 +1566,7 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
+ p->se.avg.decay_count = 0;
#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
@@ -1653,6 +1654,11 @@ void sched_fork(struct task_struct *p)
p->sched_reset_on_fork = 0;
}

+ /* New forked task assumed with full utilization */
+#if defined(CONFIG_SMP)
+ p->se.avg.load_avg_contrib = p->se.load.weight;
+#endif
+
if (!rt_prio(p->prio))
p->sched_class = &fair_sched_class;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c2f726..2881d42 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1508,6 +1508,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
+ *
+ * When enqueue a new forked task, the se->avg.decay_count == 0, so
+ * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
+ * value: se->load.weight.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
--
1.7.12

2013-05-06 05:23:47

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

Hi, Alex

On 05/06/2013 09:45 AM, Alex Shi wrote:
> effective_load calculates the load change as seen from the
> root_task_group. It needs to engage the runnable average
> of changed task.
[snip]
> */
> @@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
> /*
> * w = rw_i + @wl
> */
> - w = se->my_q->load.weight + wl;
> + w = se->my_q->tg_load_contrib + wl;

I've tested the patch set, seems like the last patch caused big
regression on pgbench:

base patch 1~6 patch 1~7
| db_size | clients | tps | | tps | | tps |
+---------+---------+-------+ +-------+ +-------+
| 22 MB | 32 | 43420 | | 53387 | | 41625 |

I guess some magic thing happened in effective_load() while calculating
group decay combined with load decay, what's your opinion?

Regards,
Michael Wang

>
> /*
> * wl = S * s'_i; see (2)
> @@ -3066,7 +3066,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
> /*
> * wl = dw_i = S * (s'_i - s_i); see (3)
> */
> - wl -= se->load.weight;
> + wl -= se->avg.load_avg_contrib;
>
> /*
> * Recursively apply this logic to all parent groups to compute
> @@ -3112,14 +3112,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> */
> if (sync) {
> tg = task_group(current);
> - weight = current->se.load.weight;
> + weight = current->se.avg.load_avg_contrib;
>
> this_load += effective_load(tg, this_cpu, -weight, -weight);
> load += effective_load(tg, prev_cpu, 0, -weight);
> }
>
> tg = task_group(p);
> - weight = p->se.load.weight;
> + weight = p->se.avg.load_avg_contrib;
>
> /*
> * In low-load situations, where prev_cpu is idle and this_cpu is idle
>

2013-05-06 05:40:51

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

On 05/06/2013 11:34 AM, Michael Wang wrote:
>> > @@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>> > /*
>> > * w = rw_i + @wl
>> > */
>> > - w = se->my_q->load.weight + wl;
>> > + w = se->my_q->tg_load_contrib + wl;
> I've tested the patch set, seems like the last patch caused big
> regression on pgbench:
>
> base patch 1~6 patch 1~7
> | db_size | clients | tps | | tps | | tps |
> +---------+---------+-------+ +-------+ +-------+
> | 22 MB | 32 | 43420 | | 53387 | | 41625 |
>
> I guess some magic thing happened in effective_load() while calculating
> group decay combined with load decay, what's your opinion?


thanks for testing, Michael!

Maybe 2 fix worth to try.

1, change back the tg_weight in calc_tg_weight() to use tg_load_contrib not direct load.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6f4f14b..c770f8d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1037,8 +1037,8 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
* update_cfs_rq_load_contribution().
*/
tg_weight = atomic64_read(&tg->load_avg);
- tg_weight -= cfs_rq->tg_load_contrib;
- tg_weight += cfs_rq->load.weight;
+ //tg_weight -= cfs_rq->tg_load_contrib;
+ //tg_weight += cfs_rq->load.weight;

return tg_weight;
}

2, another try is follow the current calc_tg_weight, so remove the follow change.

>> > @@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>> > /*
>> > * w = rw_i + @wl
>> > */
>> > - w = se->my_q->load.weight + wl;
>> > + w = se->my_q->tg_load_contrib + wl;

Would you like to try them?


--
Thanks
Alex

2013-05-06 06:23:38

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v5 2/7] sched: remove SMP cover for runnable variables in cfs_rq

Hi Alex,

You might want to do the below for struct sched_entity also?
AFAIK,struct sched_entity has struct sched_avg under CONFIG_SMP.

Regards
Preeti U Murthy

On 05/06/2013 07:15 AM, Alex Shi wrote:
> The following variables were covered under CONFIG_SMP in struct cfs_rq.
> but similar runnable variables work for UP in struct rq and task_group.
> like rq->avg, task_group->load_avg.
> So move them out, they also can work with UP.
>
> u64 runnable_load_avg, blocked_load_avg;
> atomic64_t decay_counter, removed_load;
> u64 last_decay;
>
> u32 tg_runnable_contrib;
> u64 tg_load_contrib;
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/sched.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7f36024f..1a02b90 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -226,7 +226,6 @@ struct cfs_rq {
> unsigned int nr_spread_over;
> #endif
>
> -#ifdef CONFIG_SMP
> /*
> * CFS Load tracking
> * Under CFS, load is tracked on a per-entity basis and aggregated up.
> @@ -242,6 +241,7 @@ struct cfs_rq {
> u64 tg_load_contrib;
> #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> +#ifdef CONFIG_SMP
> /*
> * h_load = weight * f(tg)
> *
>

2013-05-06 07:18:38

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 2/7] sched: remove SMP cover for runnable variables in cfs_rq

On 05/06/2013 12:11 PM, Preeti U Murthy wrote:
> Hi Alex,
>
> You might want to do the below for struct sched_entity also?
> AFAIK,struct sched_entity has struct sched_avg under CONFIG_SMP.
>

Ops, why it passed my UP config building? Anyway will check and test more.
Thanks Preeti!
--
Thanks
Alex

2013-05-06 07:50:04

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

On 05/06/2013 01:39 PM, Alex Shi wrote:
[snip]

Rough test done:

>
> 1, change back the tg_weight in calc_tg_weight() to use tg_load_contrib not direct load.

This way stop the regression of patch 7.

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6f4f14b..c770f8d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1037,8 +1037,8 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
> * update_cfs_rq_load_contribution().
> */
> tg_weight = atomic64_read(&tg->load_avg);
> - tg_weight -= cfs_rq->tg_load_contrib;
> - tg_weight += cfs_rq->load.weight;
> + //tg_weight -= cfs_rq->tg_load_contrib;
> + //tg_weight += cfs_rq->load.weight;
>
> return tg_weight;
> }
>
> 2, another try is follow the current calc_tg_weight, so remove the follow change.

This way show even better results than only patch 1~6.

But the way Preeti suggested doesn't works...

May be we should record some explanation about this change here, do we?

Regards,
Michael Wang

>
>>>> @@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>>>> /*
>>>> * w = rw_i + @wl
>>>> */
>>>> - w = se->my_q->load.weight + wl;
>>>> + w = se->my_q->tg_load_contrib + wl;
>
> Would you like to try them?
>
>

2013-05-06 08:02:15

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 2/7] sched: remove SMP cover for runnable variables in cfs_rq

On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
> The following variables were covered under CONFIG_SMP in struct cfs_rq.
> but similar runnable variables work for UP in struct rq and task_group.
> like rq->avg, task_group->load_avg.
> So move them out, they also can work with UP.

Is there a proposed use-case for UP? My apologies if I missed it in
an alternate patch.

It would seem the only possibly useful thing there would the the
per-rq average for p-state selection; but we can get that without the
per-entity values already.

>
> u64 runnable_load_avg, blocked_load_avg;
> atomic64_t decay_counter, removed_load;
> u64 last_decay;
>
> u32 tg_runnable_contrib;
> u64 tg_load_contrib;
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/sched.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7f36024f..1a02b90 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -226,7 +226,6 @@ struct cfs_rq {
> unsigned int nr_spread_over;
> #endif
>
> -#ifdef CONFIG_SMP
> /*
> * CFS Load tracking
> * Under CFS, load is tracked on a per-entity basis and aggregated up.
> @@ -242,6 +241,7 @@ struct cfs_rq {
> u64 tg_load_contrib;
> #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> +#ifdef CONFIG_SMP
> /*
> * h_load = weight * f(tg)
> *
> --
> 1.7.12
>

2013-05-06 08:03:26

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

On 05/06/2013 03:49 PM, Michael Wang wrote:
> On 05/06/2013 01:39 PM, Alex Shi wrote:
> [snip]
>
> Rough test done:
>
>>
>> 1, change back the tg_weight in calc_tg_weight() to use tg_load_contrib not direct load.
>
> This way stop the regression of patch 7.
>
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6f4f14b..c770f8d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1037,8 +1037,8 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
>> * update_cfs_rq_load_contribution().
>> */
>> tg_weight = atomic64_read(&tg->load_avg);
>> - tg_weight -= cfs_rq->tg_load_contrib;
>> - tg_weight += cfs_rq->load.weight;
>> + //tg_weight -= cfs_rq->tg_load_contrib;
>> + //tg_weight += cfs_rq->load.weight;
>>
>> return tg_weight;
>> }
>>
>> 2, another try is follow the current calc_tg_weight, so remove the follow change.
>
> This way show even better results than only patch 1~6.

how much better to the first change?
>
> But the way Preeti suggested doesn't works...

What's the Preeti suggestion? :)
>
> May be we should record some explanation about this change here, do we?

I don't know why we need this, PJT, would you like to tell us why the
calc_tg_weight use cfs_rq->load.weight not cfs_rq->tg_load_contrib?


>
> Regards,
> Michael Wang
>
>>
>>>>> @@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>>>>> /*
>>>>> * w = rw_i + @wl
>>>>> */
>>>>> - w = se->my_q->load.weight + wl;
>>>>> + w = se->my_q->tg_load_contrib + wl;
>>
>> Would you like to try them?
>>
>>
>


--
Thanks
Alex

2013-05-06 08:20:05

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> new forked task.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
>
> and make forking balancing imbalance since incorrect load_avg_contrib.
>
> set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
> resolve such issues.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/core.c | 6 ++++++
> kernel/sched/fair.c | 4 ++++
> 2 files changed, 10 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c8db984..ecec7f1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1566,6 +1566,7 @@ static void __sched_fork(struct task_struct *p)
> #ifdef CONFIG_SMP
> p->se.avg.runnable_avg_period = 0;
> p->se.avg.runnable_avg_sum = 0;
> + p->se.avg.decay_count = 0;
> #endif
> #ifdef CONFIG_SCHEDSTATS
> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
> @@ -1653,6 +1654,11 @@ void sched_fork(struct task_struct *p)
> p->sched_reset_on_fork = 0;
> }
>
> + /* New forked task assumed with full utilization */
> +#if defined(CONFIG_SMP)
> + p->se.avg.load_avg_contrib = p->se.load.weight;

This is missing a scale_load() right? Further: Why not put this in
__sched_fork?

We should also charge a minimum period to make the numbers a little
more kosher, e.g.:
+ p->se.avg.runnable_avg_period = 1024;
+ p->se.avg.runnable_avg_sum = 1024;

Rather than exposing the representation of load_avg_contrib to
__sched_fork it might also be better to call:
__update_task_entity_contrib(&p->se)
After the initialization above; this would also avoid potential bugs
like the missing scale_load() above.


> +#endif
> +
> if (!rt_prio(p->prio))
> p->sched_class = &fair_sched_class;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9c2f726..2881d42 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1508,6 +1508,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> * We track migrations using entity decay_count <= 0, on a wake-up
> * migration we use a negative decay count to track the remote decays
> * accumulated while sleeping.
> + *
> + * When enqueue a new forked task, the se->avg.decay_count == 0, so
> + * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
> + * value: se->load.weight.
> */
> if (unlikely(se->avg.decay_count <= 0)) {
> se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
> --
> 1.7.12
>

2013-05-06 08:24:56

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 1/7] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
> Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
> we can use runnable load variables.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> include/linux/sched.h | 7 +------
> kernel/sched/core.c | 7 +------
> kernel/sched/fair.c | 13 ++-----------
> kernel/sched/sched.h | 9 +--------
> 4 files changed, 5 insertions(+), 31 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e692a02..9539597 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1161,12 +1161,7 @@ struct sched_entity {
> struct cfs_rq *my_q;
> #endif
>
> -/*
> - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
> - * removed when useful for applications beyond shares distribution (e.g.
> - * load-balance).
> - */
> -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
> +#ifdef CONFIG_SMP
> /* Per-entity load-tracking */
> struct sched_avg avg;
> #endif
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 67d0465..c8db984 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1563,12 +1563,7 @@ static void __sched_fork(struct task_struct *p)
> p->se.vruntime = 0;
> INIT_LIST_HEAD(&p->se.group_node);
>
> -/*
> - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
> - * removed when useful for applications beyond shares distribution (e.g.
> - * load-balance).
> - */
> -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
> +#ifdef CONFIG_SMP
> p->se.avg.runnable_avg_period = 0;
> p->se.avg.runnable_avg_sum = 0;
> #endif
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a33e59..9c2f726 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
> }
> #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> -/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
> -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
> +#ifdef CONFIG_SMP
> /*
> * We choose a half-life close to 1 scheduling period.
> * Note: The tables below are dependent on this value.
> @@ -3394,12 +3393,6 @@ unlock:
> }
>
> /*
> - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
> - * removed when useful for applications beyond shares distribution (e.g.
> - * load-balance).
> - */
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -/*
> * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
> * cfs_rq_of(p) references at time of call are still valid and identify the
> * previous cpu. However, the caller only guarantees p->pi_lock is held; no
> @@ -3422,7 +3415,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
> atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
> }
> }
> -#endif
> #endif /* CONFIG_SMP */
>
> static unsigned long
> @@ -6114,9 +6106,8 @@ const struct sched_class fair_sched_class = {
>
> #ifdef CONFIG_SMP
> .select_task_rq = select_task_rq_fair,
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> .migrate_task_rq = migrate_task_rq_fair,
> -#endif
> +
> .rq_online = rq_online_fair,
> .rq_offline = rq_offline_fair,
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index cc03cfd..7f36024f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -227,12 +227,6 @@ struct cfs_rq {
> #endif
>
> #ifdef CONFIG_SMP
> -/*
> - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
> - * removed when useful for applications beyond shares distribution (e.g.
> - * load-balance).
> - */
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> /*
> * CFS Load tracking
> * Under CFS, load is tracked on a per-entity basis and aggregated up.
> @@ -242,8 +236,7 @@ struct cfs_rq {
> u64 runnable_load_avg, blocked_load_avg;
> atomic64_t decay_counter, removed_load;
> u64 last_decay;
> -#endif /* CONFIG_FAIR_GROUP_SCHED */
> -/* These always depend on CONFIG_FAIR_GROUP_SCHED */

We should perhaps replace this with a comment that these are only
needed to aggregate the point-wise representation in the
FAIR_GROUP_SCHED case.

> +
> #ifdef CONFIG_FAIR_GROUP_SCHED
> u32 tg_runnable_contrib;
> u64 tg_load_contrib;
> --
> 1.7.12
>

2013-05-06 08:34:23

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

On 05/06/2013 04:02 PM, Alex Shi wrote:
> On 05/06/2013 03:49 PM, Michael Wang wrote:
>> On 05/06/2013 01:39 PM, Alex Shi wrote:
>> [snip]
>>
>> Rough test done:
>>
>>>
>>> 1, change back the tg_weight in calc_tg_weight() to use tg_load_contrib not direct load.
>>
>> This way stop the regression of patch 7.
>>
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 6f4f14b..c770f8d 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1037,8 +1037,8 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
>>> * update_cfs_rq_load_contribution().
>>> */
>>> tg_weight = atomic64_read(&tg->load_avg);
>>> - tg_weight -= cfs_rq->tg_load_contrib;
>>> - tg_weight += cfs_rq->load.weight;
>>> + //tg_weight -= cfs_rq->tg_load_contrib;
>>> + //tg_weight += cfs_rq->load.weight;
>>>
>>> return tg_weight;
>>> }
>>>
>>> 2, another try is follow the current calc_tg_weight, so remove the follow change.
>>
>> This way show even better results than only patch 1~6.
>
> how much better to the first change?

Nevermind, it's just a rough test, consider them as same...

>>
>> But the way Preeti suggested doesn't works...
>
> What's the Preeti suggestion? :)

Paste at last.

>>
>> May be we should record some explanation about this change here, do we?
>
> I don't know why we need this, PJT, would you like to tell us why the
> calc_tg_weight use cfs_rq->load.weight not cfs_rq->tg_load_contrib?

The comment said this is more accurate, but that was for the world
without decay load I suppose...

But if it using 'cfs_rq->load.weight', which means denominator M contain
that factor, than numerator w has to contain it also...

Regards,
Michael Wang

>
>
>>
>> Regards,
>> Michael Wang
>>

sched: Modify effective_load() to use runnable load average

From: Preeti U Murthy <[email protected]>

The runqueue weight distribution should update the runnable load average of
the cfs_rq on which the task will be woken up.

However since the computation of se->load.weight takes into consideration
the runnable load average in update_cfs_shares(),no need to modify this in
effective_load().
---
kernel/sched/fair.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 790e23d..5489022 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg,
int cpu, long wl, long wg)
/*
* w = rw_i + @wl
*/
- w = se->my_q->load.weight + wl;
+ w = se->my_q->runnable_load_avg + wl;

/*
* wl = S * s'_i; see (2)
@@ -3066,6 +3066,9 @@ static long effective_load(struct task_group *tg,
int cpu, long wl, long wg)
/*
* wl = dw_i = S * (s'_i - s_i); see (3)
*/
+ /* Do not modify the below as it already contains runnable
+ * load average in its computation
+ */
wl -= se->load.weight;

/*
@@ -3112,14 +3115,14 @@ static int wake_affine(struct sched_domain *sd,
struct task_struct *p, int sync)
*/
if (sync) {
tg = task_group(current);
- weight = current->se.load.weight;
+ weight = current->se.avg.load_avg_contrib;

this_load += effective_load(tg, this_cpu, -weight, -weight);
load += effective_load(tg, prev_cpu, 0, -weight);
}

tg = task_group(p);
- weight = p->se.load.weight;
+ weight = p->se.avg.load_avg_contrib;

/*
* In low-load situations, where prev_cpu is idle and this_cpu is idle


Regards
Preeti U Murthy



2013-05-06 08:46:52

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
> They are the base values in load balance, update them with rq runnable
> load average, then the load balance will consider runnable load avg
> naturally.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/core.c | 4 ++--
> kernel/sched/fair.c | 4 ++--
> 2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 33bcebf..2f51636 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2536,7 +2536,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> void update_idle_cpu_load(struct rq *this_rq)
> {
> unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
> - unsigned long load = this_rq->load.weight;
> + unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;

We should be minimizing:
Variance[ for all i ]{ cfs_rq[i]->runnable_load_avg +
cfs_rq[i]->blocked_load_avg }

blocked_load_avg is the expected "to wake" contribution from tasks
already assigned to this rq.

e.g. this could be:
load = this_rq->cfs.runnable_load_avg + this_rq->cfs.blocked_load_avg;

Although, in general I have a major concern with the current implementation:

The entire reason for stability with the bottom up averages is that
when load migrates between cpus we are able to migrate it between the
tracked sums.

Stuffing observed averages of these into the load_idxs loses that
mobility; we will have to stall (as we do today for idx > 0) before we
can recognize that a cpu's load has truly left it; this is a very
similar problem to the need to stably track this for group shares
computation.

To that end, I would rather see the load_idx disappear completely:
(a) We can calculate the imbalance purely from delta (runnable_avg +
blocked_avg)
(b) It eliminates a bad tunable.

> unsigned long pending_updates;
>
> /*
> @@ -2586,7 +2586,7 @@ static void update_cpu_load_active(struct rq *this_rq)
> * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
> */
> this_rq->last_load_update_tick = jiffies;
> - __update_cpu_load(this_rq, this_rq->load.weight, 1);
> + __update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);
>
> calc_load_account_active(this_rq);
> }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2881d42..0bf88e8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2900,7 +2900,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> /* Used instead of source_load when we know the type == 0 */
> static unsigned long weighted_cpuload(const int cpu)
> {
> - return cpu_rq(cpu)->load.weight;
> + return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;

Isn't this going to truncate on the 32-bit case?

> }
>
> /*
> @@ -2947,7 +2947,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
> unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
>
> if (nr_running)
> - return rq->load.weight / nr_running;
> + return (unsigned long)rq->cfs.runnable_load_avg / nr_running;
>
> return 0;
> }
> --
> 1.7.12
>

2013-05-06 08:49:57

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 1/7] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

On 05/06/2013 04:24 PM, Paul Turner wrote:
>> > /*
>> > * CFS Load tracking
>> > * Under CFS, load is tracked on a per-entity basis and aggregated up.
>> > @@ -242,8 +236,7 @@ struct cfs_rq {
>> > u64 runnable_load_avg, blocked_load_avg;
>> > atomic64_t decay_counter, removed_load;
>> > u64 last_decay;
>> > -#endif /* CONFIG_FAIR_GROUP_SCHED */
>> > -/* These always depend on CONFIG_FAIR_GROUP_SCHED */
> We should perhaps replace this with a comment that these are only
> needed to aggregate the point-wise representation in the
> FAIR_GROUP_SCHED case.
>

Is the comments ok? :)

/* runnable related variables only used in FAIR_GROUP_SCHED */

--
Thanks
Alex

2013-05-06 08:54:18

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 6/7] sched: consider runnable load average in move_tasks

On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
> Except using runnable load average in background, move_tasks is also
> the key functions in load balance. We need consider the runnable load
> average in it in order to the apple to apple load comparison.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0bf88e8..790e23d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3966,6 +3966,12 @@ static unsigned long task_h_load(struct task_struct *p);
>
> static const unsigned int sched_nr_migrate_break = 32;
>
> +static unsigned long task_h_load_avg(struct task_struct *p)
> +{
> + return div_u64(task_h_load(p) * (u64)p->se.avg.runnable_avg_sum,
> + p->se.avg.runnable_avg_period + 1);

Similarly, I think you also want to at least include blocked_load_avg here.

More fundamentally:
I suspect the instability from comparing these to an average taken on
them will not give a representative imbalance weight. While we should
be no worse off than the present situation; we could be doing much
better.

Consider that by not consuming {runnable, blocked}_load_avg directly
you are "hiding" the movement from one load-balancer to the next.
> +}
> +
> /*
> * move_tasks tries to move up to imbalance weighted load from busiest to
> * this_rq, as part of a balancing operation within domain "sd".
> @@ -4001,7 +4007,7 @@ static int move_tasks(struct lb_env *env)
> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> goto next;
>
> - load = task_h_load(p);
> + load = task_h_load_avg(p);
>
> if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> goto next;
> --
> 1.7.12
>

2013-05-06 08:56:05

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 1/7] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

On Mon, May 6, 2013 at 1:49 AM, Alex Shi <[email protected]> wrote:
> On 05/06/2013 04:24 PM, Paul Turner wrote:
>>> > /*
>>> > * CFS Load tracking
>>> > * Under CFS, load is tracked on a per-entity basis and aggregated up.
>>> > @@ -242,8 +236,7 @@ struct cfs_rq {
>>> > u64 runnable_load_avg, blocked_load_avg;
>>> > atomic64_t decay_counter, removed_load;
>>> > u64 last_decay;
>>> > -#endif /* CONFIG_FAIR_GROUP_SCHED */
>>> > -/* These always depend on CONFIG_FAIR_GROUP_SCHED */
>> We should perhaps replace this with a comment that these are only
>> needed to aggregate the point-wise representation in the
>> FAIR_GROUP_SCHED case.
>>
>
> Is the comments ok? :)
>
> /* runnable related variables only used in FAIR_GROUP_SCHED */

Perhaps:

-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
#ifdef ...
+/* Required to track per-cpu representation of a task_group */


>
> --
> Thanks
> Alex

2013-05-06 08:58:08

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 2/7] sched: remove SMP cover for runnable variables in cfs_rq

On 05/06/2013 04:01 PM, Paul Turner wrote:
> On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
>> The following variables were covered under CONFIG_SMP in struct cfs_rq.
>> but similar runnable variables work for UP in struct rq and task_group.
>> like rq->avg, task_group->load_avg.
>> So move them out, they also can work with UP.
>
> Is there a proposed use-case for UP? My apologies if I missed it in
> an alternate patch.

> It would seem the only possibly useful thing there would the the
> per-rq average for p-state selection; but we can get that without the
> per-entity values already.


Do you mean to move the rq->avg and task_group->load_avg into CONFIG_SMP?

--
Thanks
Alex

2013-05-06 08:59:07

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 1/7] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

On 05/06/2013 04:55 PM, Paul Turner wrote:
>> > Is the comments ok? :)
>> >
>> > /* runnable related variables only used in FAIR_GROUP_SCHED */
> Perhaps:
>
> -/* These always depend on CONFIG_FAIR_GROUP_SCHED */
> #ifdef ...
> +/* Required to track per-cpu representation of a task_group */
>
>

looks fine. thanks a lot!

--
Thanks
Alex

2013-05-06 09:06:50

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

I don't think this is a good idea:

The problem with not using the instantaneous weight here is that you
potentially penalize the latency of interactive tasks (similarly,
potentially important background threads -- e.g. garbage collection).

Counter-intuitively we actually want such tasks on the least loaded
cpus to minimize their latency. If the load they contribute ever
becomes more substantial we trust that periodic balance will start
taking notice of them.

[ This is similar to why we have to use the instantaneous weight in
calc_cfs_shares. ]

2013-05-06 09:08:35

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 2/7] sched: remove SMP cover for runnable variables in cfs_rq

On Mon, May 6, 2013 at 1:57 AM, Alex Shi <[email protected]> wrote:
> On 05/06/2013 04:01 PM, Paul Turner wrote:
>> On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
>>> The following variables were covered under CONFIG_SMP in struct cfs_rq.
>>> but similar runnable variables work for UP in struct rq and task_group.
>>> like rq->avg, task_group->load_avg.
>>> So move them out, they also can work with UP.
>>
>> Is there a proposed use-case for UP? My apologies if I missed it in
>> an alternate patch.
>
>> It would seem the only possibly useful thing there would the the
>> per-rq average for p-state selection; but we can get that without the
>> per-entity values already.
>
>
> Do you mean to move the rq->avg and task_group->load_avg into CONFIG_SMP?

More generally: Why do we need them in !CONFIG_SMP?

[ I was suggesting (potentially) using only rq->avg in the !CONFIG_SMP case. ]


>
> --
> Thanks
> Alex

2013-05-06 09:21:46

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On 05/06/2013 04:19 PM, Paul Turner wrote:
> On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
>> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
>> new forked task.
>> Otherwise random values of above variables cause mess when do new task
>> enqueue:
>> enqueue_task_fair
>> enqueue_entity
>> enqueue_entity_load_avg
>>
>> and make forking balancing imbalance since incorrect load_avg_contrib.
>>
>> set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
>> resolve such issues.
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> ---
>> kernel/sched/core.c | 6 ++++++
>> kernel/sched/fair.c | 4 ++++
>> 2 files changed, 10 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c8db984..ecec7f1 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1566,6 +1566,7 @@ static void __sched_fork(struct task_struct *p)
>> #ifdef CONFIG_SMP
>> p->se.avg.runnable_avg_period = 0;
>> p->se.avg.runnable_avg_sum = 0;
>> + p->se.avg.decay_count = 0;
>> #endif
>> #ifdef CONFIG_SCHEDSTATS
>> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>> @@ -1653,6 +1654,11 @@ void sched_fork(struct task_struct *p)
>> p->sched_reset_on_fork = 0;
>> }
>>
>> + /* New forked task assumed with full utilization */
>> +#if defined(CONFIG_SMP)
>> + p->se.avg.load_avg_contrib = p->se.load.weight;
>
> This is missing a scale_load() right? Further: Why not put this in
> __sched_fork?

scale_load is not working now. Anyway I can add this.

>
> We should also charge a minimum period to make the numbers a little
> more kosher, e.g.:
> + p->se.avg.runnable_avg_period = 1024;
> + p->se.avg.runnable_avg_sum = 1024;
>
> Rather than exposing the representation of load_avg_contrib to
> __sched_fork it might also be better to call:
> __update_task_entity_contrib(&p->se)
> After the initialization above; this would also avoid potential bugs
> like the missing scale_load() above.

Above simple change can not work.
We had talked this solution months ago. And get agreement on this patch.
https://lkml.org/lkml/2013/2/20/48 :)

--
Thanks
Alex

2013-05-06 09:24:40

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

On 05/06/2013 01:39 PM, Alex Shi wrote:
> On 05/06/2013 11:34 AM, Michael Wang wrote:
>>>> @@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>>>> /*
>>>> * w = rw_i + @wl
>>>> */
>>>> - w = se->my_q->load.weight + wl;
>>>> + w = se->my_q->tg_load_contrib + wl;
>> I've tested the patch set, seems like the last patch caused big
>> regression on pgbench:
>>
>> base patch 1~6 patch 1~7
>> | db_size | clients | tps | | tps | | tps |
>> +---------+---------+-------+ +-------+ +-------+
>> | 22 MB | 32 | 43420 | | 53387 | | 41625 |
>>
>> I guess some magic thing happened in effective_load() while calculating
>> group decay combined with load decay, what's your opinion?
>
>
> thanks for testing, Michael!
>
> Maybe 2 fix worth to try.
>
> 1, change back the tg_weight in calc_tg_weight() to use tg_load_contrib not direct load.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6f4f14b..c770f8d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1037,8 +1037,8 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
> * update_cfs_rq_load_contribution().
> */
> tg_weight = atomic64_read(&tg->load_avg);
> - tg_weight -= cfs_rq->tg_load_contrib;
> - tg_weight += cfs_rq->load.weight;
> + //tg_weight -= cfs_rq->tg_load_contrib;
> + //tg_weight += cfs_rq->load.weight;
>
> return tg_weight;
> }
>
> 2, another try is follow the current calc_tg_weight, so remove the follow change.
>
>>>> @@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>>>> /*
>>>> * w = rw_i + @wl
>>>> */
>>>> - w = se->my_q->load.weight + wl;
>>>> + w = se->my_q->tg_load_contrib + wl;
>
> Would you like to try them?

Sure, I will take a try on both :)

But actually I'm wondering whether it is necessary to change
effective_load()?

It is only severed for wake-affine and the whole stuff is still in the
dark, if patch 1~6 already show good results, why don't we leave it there?

So how about the situation on your box without the last patch? is the
benefit still there?

Regards,
Michael Wang

>
>

2013-05-06 09:35:53

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

On 05/06/2013 05:06 PM, Paul Turner wrote:
> I don't think this is a good idea:
>
> The problem with not using the instantaneous weight here is that you
> potentially penalize the latency of interactive tasks (similarly,
> potentially important background threads -- e.g. garbage collection).
>
> Counter-intuitively we actually want such tasks on the least loaded
> cpus to minimize their latency. If the load they contribute ever
> becomes more substantial we trust that periodic balance will start
> taking notice of them.

Sounds reasonable. Many thanks for your input, Paul!

So, will use the seconds try. :)
>
> [ This is similar to why we have to use the instantaneous weight in
> calc_cfs_shares. ]
>


--
Thanks
Alex

2013-05-06 09:39:49

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

>
> But actually I'm wondering whether it is necessary to change
> effective_load()?
>
> It is only severed for wake-affine and the whole stuff is still in the
> dark, if patch 1~6 already show good results, why don't we leave it there?

It is used for pipe connected process, and your testing showed it is can
not be removed simply. so..
>

--
Thanks
Alex

2013-05-06 10:00:56

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

On Mon, May 6, 2013 at 2:35 AM, Alex Shi <[email protected]> wrote:
> On 05/06/2013 05:06 PM, Paul Turner wrote:
>> I don't think this is a good idea:
>>
>> The problem with not using the instantaneous weight here is that you
>> potentially penalize the latency of interactive tasks (similarly,
>> potentially important background threads -- e.g. garbage collection).
>>
>> Counter-intuitively we actually want such tasks on the least loaded
>> cpus to minimize their latency. If the load they contribute ever
>> becomes more substantial we trust that periodic balance will start
>> taking notice of them.
>
> Sounds reasonable. Many thanks for your input, Paul!
>
> So, will use the seconds try. :)

Sorry -- not sure what you mean here. I'm suggesting leaving
effective_load() unchanged.
>>
>> [ This is similar to why we have to use the instantaneous weight in
>> calc_cfs_shares. ]
>>
>
>
> --
> Thanks
> Alex

2013-05-06 10:18:27

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Mon, May 6, 2013 at 2:21 AM, Alex Shi <[email protected]> wrote:
> On 05/06/2013 04:19 PM, Paul Turner wrote:
>> On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
>>> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
>>> new forked task.
>>> Otherwise random values of above variables cause mess when do new task
>>> enqueue:
>>> enqueue_task_fair
>>> enqueue_entity
>>> enqueue_entity_load_avg
>>>
>>> and make forking balancing imbalance since incorrect load_avg_contrib.
>>>
>>> set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
>>> resolve such issues.
>>>
>>> Signed-off-by: Alex Shi <[email protected]>
>>> ---
>>> kernel/sched/core.c | 6 ++++++
>>> kernel/sched/fair.c | 4 ++++
>>> 2 files changed, 10 insertions(+)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index c8db984..ecec7f1 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -1566,6 +1566,7 @@ static void __sched_fork(struct task_struct *p)
>>> #ifdef CONFIG_SMP
>>> p->se.avg.runnable_avg_period = 0;
>>> p->se.avg.runnable_avg_sum = 0;
>>> + p->se.avg.decay_count = 0;
>>> #endif
>>> #ifdef CONFIG_SCHEDSTATS
>>> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>>> @@ -1653,6 +1654,11 @@ void sched_fork(struct task_struct *p)
>>> p->sched_reset_on_fork = 0;
>>> }
>>>
>>> + /* New forked task assumed with full utilization */
>>> +#if defined(CONFIG_SMP)
>>> + p->se.avg.load_avg_contrib = p->se.load.weight;
>>
>> This is missing a scale_load() right? Further: Why not put this in
>> __sched_fork?
>
> scale_load is not working now. Anyway I can add this.
>
>>
>> We should also charge a minimum period to make the numbers a little
>> more kosher, e.g.:
>> + p->se.avg.runnable_avg_period = 1024;
>> + p->se.avg.runnable_avg_sum = 1024;
>>
>> Rather than exposing the representation of load_avg_contrib to
>> __sched_fork it might also be better to call:
>> __update_task_entity_contrib(&p->se)
>> After the initialization above; this would also avoid potential bugs
>> like the missing scale_load() above.
>
> Above simple change can not work.

Could you provide additional detail here? Note that the sum change I
was suggesting above was:

__sched_fork():
+ p->se.avg.decay_count = 0;
+ p->se.avg.runnable_avg_period = 1024;
+ p->se.avg.runnable_avg_sum = 1024;
+ __update_task_entity_contrib(&p->se);

[ Also: move __sched_fork() beyond p->sched_reset_on_fork in sched_fork(). ]

> We had talked this solution months ago. And get agreement on this patch.
> https://lkml.org/lkml/2013/2/20/48 :)

Yes, I made the same suggestion in the last round, see:
https://lkml.org/lkml/2013/2/19/176

Your reply there seems like an ack of my suggestion, the only
difference I'm seeing is that using __update_task_entity_contrib() as
originally suggested is safer since it keeps the representation of
load_avg_contrib opaque.

> --
> Thanks
> Alex

2013-05-06 10:21:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On Mon, May 06, 2013 at 01:46:19AM -0700, Paul Turner wrote:
> On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
> > @@ -2536,7 +2536,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> > void update_idle_cpu_load(struct rq *this_rq)
> > {
> > unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
> > - unsigned long load = this_rq->load.weight;
> > + unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
>
> We should be minimizing:
> Variance[ for all i ]{ cfs_rq[i]->runnable_load_avg +
> cfs_rq[i]->blocked_load_avg }
>
> blocked_load_avg is the expected "to wake" contribution from tasks
> already assigned to this rq.
>
> e.g. this could be:
> load = this_rq->cfs.runnable_load_avg + this_rq->cfs.blocked_load_avg;
>
> Although, in general I have a major concern with the current implementation:
>
> The entire reason for stability with the bottom up averages is that
> when load migrates between cpus we are able to migrate it between the
> tracked sums.
>
> Stuffing observed averages of these into the load_idxs loses that
> mobility; we will have to stall (as we do today for idx > 0) before we
> can recognize that a cpu's load has truly left it; this is a very
> similar problem to the need to stably track this for group shares
> computation.

Ah indeed. I overlooked that.

> To that end, I would rather see the load_idx disappear completely:
> (a) We can calculate the imbalance purely from delta (runnable_avg +
> blocked_avg)
> (b) It eliminates a bad tunable.

So I suspect (haven't gone back in history to verify) that load_idx mostly
comes from the fact that our balance passes happen more and more slowly the
bigger the domains get.

In that respect it makes sense to equate load_idx to sched_domain::level;
higher domains balance slower and would thus want a longer-term average to base
decisions on.

So what we would want is means to get sane longer term averages.

2013-05-06 10:23:06

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Mon, May 6, 2013 at 2:21 AM, Alex Shi <[email protected]> wrote:
> On 05/06/2013 04:19 PM, Paul Turner wrote:
>> On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
>>> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
>>> new forked task.
>>> Otherwise random values of above variables cause mess when do new task
>>> enqueue:
>>> enqueue_task_fair
>>> enqueue_entity
>>> enqueue_entity_load_avg
>>>
>>> and make forking balancing imbalance since incorrect load_avg_contrib.
>>>
>>> set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
>>> resolve such issues.
>>>
>>> Signed-off-by: Alex Shi <[email protected]>
>>> ---
>>> kernel/sched/core.c | 6 ++++++
>>> kernel/sched/fair.c | 4 ++++
>>> 2 files changed, 10 insertions(+)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index c8db984..ecec7f1 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -1566,6 +1566,7 @@ static void __sched_fork(struct task_struct *p)
>>> #ifdef CONFIG_SMP
>>> p->se.avg.runnable_avg_period = 0;
>>> p->se.avg.runnable_avg_sum = 0;
>>> + p->se.avg.decay_count = 0;
>>> #endif
>>> #ifdef CONFIG_SCHEDSTATS
>>> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>>> @@ -1653,6 +1654,11 @@ void sched_fork(struct task_struct *p)
>>> p->sched_reset_on_fork = 0;
>>> }
>>>
>>> + /* New forked task assumed with full utilization */
>>> +#if defined(CONFIG_SMP)
>>> + p->se.avg.load_avg_contrib = p->se.load.weight;
>>
>> This is missing a scale_load() right? Further: Why not put this in
>> __sched_fork?
>
> scale_load is not working now. Anyway I can add this.

I believe someone tracked down a plausible cause for this:
A governor was examining the values and making a mess with the scaled
ones. I'm sorry, I don't have the post off hand.

You actually likely ideally want this _on_ for these patches; the
available resolution with SCHED_LOAD_SHIFT=10 disappears really
quickly and scaling by runnable_avg only further accelerates that.

We should try to get this generally turned on by default again.
>
>>
>> We should also charge a minimum period to make the numbers a little
>> more kosher, e.g.:
>> + p->se.avg.runnable_avg_period = 1024;
>> + p->se.avg.runnable_avg_sum = 1024;
>>
>> Rather than exposing the representation of load_avg_contrib to
>> __sched_fork it might also be better to call:
>> __update_task_entity_contrib(&p->se)
>> After the initialization above; this would also avoid potential bugs
>> like the missing scale_load() above.
>
> Above simple change can not work.
> We had talked this solution months ago. And get agreement on this patch.
> https://lkml.org/lkml/2013/2/20/48 :)
>
> --
> Thanks
> Alex

2013-05-06 10:28:03

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

Hi, Preeti

On 05/06/2013 03:10 PM, Preeti U Murthy wrote:
> Hi Alex,Michael,
>
> Can you try out the below patch and check?

Sure, I will take a try also.

I have the reason mentioned in the changelog.
> If this also causes performance regression,you probably need to remove changes made in
> effective_load() as Michael points out. I believe the below patch should not cause
> performance regression.

Actually according to the current results of Alex's suggestion, I think
the issue already addressed, anyway, I will test this patch and reply
them at all, let's choose the best way later ;-)

Regards,
Michael Wang

>
> The below patch is a substitute for patch 7.
>
>
> -------------------------------------------------------------------------------
>
> sched: Modify effective_load() to use runnable load average
>
> From: Preeti U Murthy <[email protected]>
>
> The runqueue weight distribution should update the runnable load average of
> the cfs_rq on which the task will be woken up.
>
> However since the computation of se->load.weight takes into consideration
> the runnable load average in update_cfs_shares(),no need to modify this in
> effective_load().
> ---
> kernel/sched/fair.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 790e23d..5489022 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
> /*
> * w = rw_i + @wl
> */
> - w = se->my_q->load.weight + wl;
> + w = se->my_q->runnable_load_avg + wl;
>
> /*
> * wl = S * s'_i; see (2)
> @@ -3066,6 +3066,9 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
> /*
> * wl = dw_i = S * (s'_i - s_i); see (3)
> */
> + /* Do not modify the below as it already contains runnable
> + * load average in its computation
> + */
> wl -= se->load.weight;
>
> /*
> @@ -3112,14 +3115,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> */
> if (sync) {
> tg = task_group(current);
> - weight = current->se.load.weight;
> + weight = current->se.avg.load_avg_contrib;
>
> this_load += effective_load(tg, this_cpu, -weight, -weight);
> load += effective_load(tg, prev_cpu, 0, -weight);
> }
>
> tg = task_group(p);
> - weight = p->se.load.weight;
> + weight = p->se.avg.load_avg_contrib;
>
> /*
> * In low-load situations, where prev_cpu is idle and this_cpu is idle
>
>
> Regards
> Preeti U Murthy
>
> On 05/06/2013 09:04 AM, Michael Wang wrote:
>> Hi, Alex
>>
>> On 05/06/2013 09:45 AM, Alex Shi wrote:
>>> effective_load calculates the load change as seen from the
>>> root_task_group. It needs to engage the runnable average
>>> of changed task.
>> [snip]
>>> */
>>> @@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>>> /*
>>> * w = rw_i + @wl
>>> */
>>> - w = se->my_q->load.weight + wl;
>>> + w = se->my_q->tg_load_contrib + wl;
>>
>> I've tested the patch set, seems like the last patch caused big
>> regression on pgbench:
>>
>> base patch 1~6 patch 1~7
>> | db_size | clients | tps | | tps | | tps |
>> +---------+---------+-------+ +-------+ +-------+
>> | 22 MB | 32 | 43420 | | 53387 | | 41625 |
>>
>> I guess some magic thing happened in effective_load() while calculating
>> group decay combined with load decay, what's your opinion?
>>
>> Regards,
>> Michael Wang
>>
>>>
>>> /*
>>> * wl = S * s'_i; see (2)
>>> @@ -3066,7 +3066,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>>> /*
>>> * wl = dw_i = S * (s'_i - s_i); see (3)
>>> */
>>> - wl -= se->load.weight;
>>> + wl -= se->avg.load_avg_contrib;
>>>
>>> /*
>>> * Recursively apply this logic to all parent groups to compute
>>> @@ -3112,14 +3112,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>> */
>>> if (sync) {
>>> tg = task_group(current);
>>> - weight = current->se.load.weight;
>>> + weight = current->se.avg.load_avg_contrib;
>>>
>>> this_load += effective_load(tg, this_cpu, -weight, -weight);
>>> load += effective_load(tg, prev_cpu, 0, -weight);
>>> }
>>>
>>> tg = task_group(p);
>>> - weight = p->se.load.weight;
>>> + weight = p->se.avg.load_avg_contrib;
>>>
>>> /*
>>> * In low-load situations, where prev_cpu is idle and this_cpu is idle
>>>
>>
>

2013-05-06 10:30:25

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

Hi Alex,Michael,

Can you try out the below patch and check? I have the reason mentioned in the changelog.
If this also causes performance regression,you probably need to remove changes made in
effective_load() as Michael points out. I believe the below patch should not cause
performance regression.

The below patch is a substitute for patch 7.


-------------------------------------------------------------------------------

sched: Modify effective_load() to use runnable load average

From: Preeti U Murthy <[email protected]>

The runqueue weight distribution should update the runnable load average of
the cfs_rq on which the task will be woken up.

However since the computation of se->load.weight takes into consideration
the runnable load average in update_cfs_shares(),no need to modify this in
effective_load().
---
kernel/sched/fair.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 790e23d..5489022 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* w = rw_i + @wl
*/
- w = se->my_q->load.weight + wl;
+ w = se->my_q->runnable_load_avg + wl;

/*
* wl = S * s'_i; see (2)
@@ -3066,6 +3066,9 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* wl = dw_i = S * (s'_i - s_i); see (3)
*/
+ /* Do not modify the below as it already contains runnable
+ * load average in its computation
+ */
wl -= se->load.weight;

/*
@@ -3112,14 +3115,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
*/
if (sync) {
tg = task_group(current);
- weight = current->se.load.weight;
+ weight = current->se.avg.load_avg_contrib;

this_load += effective_load(tg, this_cpu, -weight, -weight);
load += effective_load(tg, prev_cpu, 0, -weight);
}

tg = task_group(p);
- weight = p->se.load.weight;
+ weight = p->se.avg.load_avg_contrib;

/*
* In low-load situations, where prev_cpu is idle and this_cpu is idle


Regards
Preeti U Murthy

On 05/06/2013 09:04 AM, Michael Wang wrote:
> Hi, Alex
>
> On 05/06/2013 09:45 AM, Alex Shi wrote:
>> effective_load calculates the load change as seen from the
>> root_task_group. It needs to engage the runnable average
>> of changed task.
> [snip]
>> */
>> @@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>> /*
>> * w = rw_i + @wl
>> */
>> - w = se->my_q->load.weight + wl;
>> + w = se->my_q->tg_load_contrib + wl;
>
> I've tested the patch set, seems like the last patch caused big
> regression on pgbench:
>
> base patch 1~6 patch 1~7
> | db_size | clients | tps | | tps | | tps |
> +---------+---------+-------+ +-------+ +-------+
> | 22 MB | 32 | 43420 | | 53387 | | 41625 |
>
> I guess some magic thing happened in effective_load() while calculating
> group decay combined with load decay, what's your opinion?
>
> Regards,
> Michael Wang
>
>>
>> /*
>> * wl = S * s'_i; see (2)
>> @@ -3066,7 +3066,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>> /*
>> * wl = dw_i = S * (s'_i - s_i); see (3)
>> */
>> - wl -= se->load.weight;
>> + wl -= se->avg.load_avg_contrib;
>>
>> /*
>> * Recursively apply this logic to all parent groups to compute
>> @@ -3112,14 +3112,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>> */
>> if (sync) {
>> tg = task_group(current);
>> - weight = current->se.load.weight;
>> + weight = current->se.avg.load_avg_contrib;
>>
>> this_load += effective_load(tg, this_cpu, -weight, -weight);
>> load += effective_load(tg, prev_cpu, 0, -weight);
>> }
>>
>> tg = task_group(p);
>> - weight = p->se.load.weight;
>> + weight = p->se.avg.load_avg_contrib;
>>
>> /*
>> * In low-load situations, where prev_cpu is idle and this_cpu is idle
>>
>

2013-05-06 10:34:19

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On Mon, May 6, 2013 at 3:19 AM, Peter Zijlstra <[email protected]> wrote:
> On Mon, May 06, 2013 at 01:46:19AM -0700, Paul Turner wrote:
>> On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
>> > @@ -2536,7 +2536,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>> > void update_idle_cpu_load(struct rq *this_rq)
>> > {
>> > unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
>> > - unsigned long load = this_rq->load.weight;
>> > + unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
>>
>> We should be minimizing:
>> Variance[ for all i ]{ cfs_rq[i]->runnable_load_avg +
>> cfs_rq[i]->blocked_load_avg }
>>
>> blocked_load_avg is the expected "to wake" contribution from tasks
>> already assigned to this rq.
>>
>> e.g. this could be:
>> load = this_rq->cfs.runnable_load_avg + this_rq->cfs.blocked_load_avg;
>>
>> Although, in general I have a major concern with the current implementation:
>>
>> The entire reason for stability with the bottom up averages is that
>> when load migrates between cpus we are able to migrate it between the
>> tracked sums.
>>
>> Stuffing observed averages of these into the load_idxs loses that
>> mobility; we will have to stall (as we do today for idx > 0) before we
>> can recognize that a cpu's load has truly left it; this is a very
>> similar problem to the need to stably track this for group shares
>> computation.
>
> Ah indeed. I overlooked that.
>
>> To that end, I would rather see the load_idx disappear completely:
>> (a) We can calculate the imbalance purely from delta (runnable_avg +
>> blocked_avg)
>> (b) It eliminates a bad tunable.
>
> So I suspect (haven't gone back in history to verify) that load_idx mostly
> comes from the fact that our balance passes happen more and more slowly the
> bigger the domains get.
>
> In that respect it makes sense to equate load_idx to sched_domain::level;
> higher domains balance slower and would thus want a longer-term average to base
> decisions on.
>
> So what we would want is means to get sane longer term averages.

Yeah, most of the rationale is super hand-wavy; especially the fairly
arbitrary choice of periods (e.g. busy_idx vs newidle).

I think the other rationale is:
For smaller indicies (e.g. newidle) we speed up response time by
also cutting motion out of the averages.

The runnable_avgs themselves actually have a fair bit of history in
them already (50% is last 32ms); but given that they don't need to be
cut-off to respond to load being migrated I'm guessing we could
actually potentially get by with just "instaneous" and "use averages"
where appropriate?

We always end up having to re-pick/tune them based on a variety of
workloads; if we can eliminate them I think it would be a win.

2013-05-06 11:12:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On Mon, May 06, 2013 at 03:33:45AM -0700, Paul Turner wrote:
> Yeah, most of the rationale is super hand-wavy; especially the fairly
> arbitrary choice of periods (e.g. busy_idx vs newidle).
>
> I think the other rationale is:
> For smaller indicies (e.g. newidle) we speed up response time by
> also cutting motion out of the averages.
>
> The runnable_avgs themselves actually have a fair bit of history in
> them already (50% is last 32ms); but given that they don't need to be
> cut-off to respond to load being migrated I'm guessing we could
> actually potentially get by with just "instaneous" and "use averages"
> where appropriate?

Sure,. worth a try. If things fall over we can always look at it again.

> We always end up having to re-pick/tune them based on a variety of
> workloads; if we can eliminate them I think it would be a win.

Agreed, esp. the plethora of weird idx things we currently have. If we need to
re-introduce something it would likely only be the busy case and for that we
can immediately link to the balance interval or so.


2013-05-06 11:15:51

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

On 05/06/2013 03:05 PM, Alex Shi wrote:
> On 05/06/2013 05:06 PM, Paul Turner wrote:
>> I don't think this is a good idea:
>>
>> The problem with not using the instantaneous weight here is that you
>> potentially penalize the latency of interactive tasks (similarly,
>> potentially important background threads -- e.g. garbage collection).
>>
>> Counter-intuitively we actually want such tasks on the least loaded
>> cpus to minimize their latency. If the load they contribute ever
>> becomes more substantial we trust that periodic balance will start
>> taking notice of them.
>
> Sounds reasonable. Many thanks for your input, Paul!
>
> So, will use the seconds try. :)
>>
>> [ This is similar to why we have to use the instantaneous weight in
>> calc_cfs_shares. ]
>>
>
>

Yes thank you very much for the inputs Paul :)

So Alex, Michael looks like this is what happened.

1. The effective_load() as it is, uses instantaneous loads to calculate
the CPU shares before and after a new task can be woken up on the given cpu.

2. With my patch, I modified it to use runnable load average while
calculating the CPU share *after* a new task could be woken up and
retained instantaneous load to calculate the CPU share before a new task
could be woken up.

3. With the first patch of Alex, he has used runnable load average while
calculating the CPU share both before and after a new task could be
woken up on the given CPU.

4.The suggestions that Alex gave:

Suggestion1: Would change the CPU share calculation to use runnable load
average all the time.

Suggestion2: Did opposite of point 2 above,it used runnable load average
while calculating the CPU share *before* a new task has been woken up
while it retaining the instantaneous weight to calculate the CPU share
after a new task could be woken up.

So since there was no uniformity in the calculation of CPU shares in
approaches 2 and 3, I think it caused a regression. However I still
don't understand how approach 4-Suggestion2 made that go away although
there was non-uniformity in the CPU shares calculation.

But as Paul says we could retain the usage of instantaneous loads
wherever there is calculation of CPU shares for the reason he mentioned
and leave effective_load() and calc_cfs_shares() untouched.

This also brings forth another question,should we modify wake_affine()
to pass the runnable load average of the waking up task to effective_load().

What do you think?


Thanks

Regards
Preeti U Murthy

2013-05-06 15:00:54

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

>
> blocked_load_avg is the expected "to wake" contribution from tasks
> already assigned to this rq.
>
> e.g. this could be:
> load = this_rq->cfs.runnable_load_avg + this_rq->cfs.blocked_load_avg;

Current load balance doesn't consider slept task's load which is
represented by blocked_load_avg. And the slept task is not on_rq, so
consider it in load balance is a little strange.

But your concern is worth to try. I will change the patchset and give
the testing results.

>
> Although, in general I have a major concern with the current implementation:
>
> The entire reason for stability with the bottom up averages is that
> when load migrates between cpus we are able to migrate it between the
> tracked sums.
>
> Stuffing observed averages of these into the load_idxs loses that
> mobility; we will have to stall (as we do today for idx > 0) before we
> can recognize that a cpu's load has truly left it; this is a very
> similar problem to the need to stably track this for group shares
> computation.
>
> To that end, I would rather see the load_idx disappear completely:
> (a) We can calculate the imbalance purely from delta (runnable_avg +
> blocked_avg)
> (b) It eliminates a bad tunable.

I also show the similar concern of load_idx months ago. seems overlooked. :)
>
>> - return cpu_rq(cpu)->load.weight;
>> + return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;
>
> Isn't this going to truncate on the 32-bit case?

I guess not, the old load.weight is unsigned long, and runnable_load_avg
is smaller than the load.weight. so it should be fine.

btw, according to above reason, guess move runnable_load_avg to
'unsigned long' type is ok, do you think so?
>

--
Thanks
Alex

2013-05-06 15:03:01

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 2/7] sched: remove SMP cover for runnable variables in cfs_rq

On 05/06/2013 05:08 PM, Paul Turner wrote:
> On Mon, May 6, 2013 at 1:57 AM, Alex Shi <[email protected]> wrote:
>> On 05/06/2013 04:01 PM, Paul Turner wrote:
>>> On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
>>>> The following variables were covered under CONFIG_SMP in struct cfs_rq.
>>>> but similar runnable variables work for UP in struct rq and task_group.
>>>> like rq->avg, task_group->load_avg.
>>>> So move them out, they also can work with UP.
>>>
>>> Is there a proposed use-case for UP? My apologies if I missed it in
>>> an alternate patch.
>>
>>> It would seem the only possibly useful thing there would the the
>>> per-rq average for p-state selection; but we can get that without the
>>> per-entity values already.
>>
>>
>> Do you mean to move the rq->avg and task_group->load_avg into CONFIG_SMP?
>
> More generally: Why do we need them in !CONFIG_SMP?
>
> [ I was suggesting (potentially) using only rq->avg in the !CONFIG_SMP case. ]

Thanks for comments.
I will look into this. :)
>
>
>>
>> --
>> Thanks
>> Alex


--
Thanks
Alex

2013-05-06 15:06:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 6/7] sched: consider runnable load average in move_tasks

On Mon, May 06, 2013 at 01:53:44AM -0700, Paul Turner wrote:
> On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
> > Except using runnable load average in background, move_tasks is also
> > the key functions in load balance. We need consider the runnable load
> > average in it in order to the apple to apple load comparison.
> >
> > Signed-off-by: Alex Shi <[email protected]>
> > ---
> > kernel/sched/fair.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0bf88e8..790e23d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3966,6 +3966,12 @@ static unsigned long task_h_load(struct task_struct *p);
> >
> > static const unsigned int sched_nr_migrate_break = 32;
> >
> > +static unsigned long task_h_load_avg(struct task_struct *p)
> > +{
> > + return div_u64(task_h_load(p) * (u64)p->se.avg.runnable_avg_sum,
> > + p->se.avg.runnable_avg_period + 1);
>
> Similarly, I think you also want to at least include blocked_load_avg here.

I'm puzzled, this is an entity weight. Entity's don't have blocked_load_avg.

The purpose here is to compute the amount of weight that's being moved by this
task; to subtract from the imbalance.

2013-05-06 15:07:51

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 6/7] sched: consider runnable load average in move_tasks

On 05/06/2013 04:53 PM, Paul Turner wrote:
>> > +static unsigned long task_h_load_avg(struct task_struct *p)
>> > +{
>> > + return div_u64(task_h_load(p) * (u64)p->se.avg.runnable_avg_sum,
>> > + p->se.avg.runnable_avg_period + 1);
> Similarly, I think you also want to at least include blocked_load_avg here.
>
> More fundamentally:
> I suspect the instability from comparing these to an average taken on
> them will not give a representative imbalance weight. While we should
> be no worse off than the present situation; we could be doing much
> better.
>
> Consider that by not consuming {runnable, blocked}_load_avg directly
> you are "hiding" the movement from one load-balancer to the next.

Sure, sounds reasonable, I will try it.

--
Thanks
Alex

2013-05-06 15:26:16

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On 05/06/2013 06:22 PM, Paul Turner wrote:
>>> >> This is missing a scale_load() right? Further: Why not put this in
>>> >> __sched_fork?
>> >
>> > scale_load is not working now. Anyway I can add this.
> I believe someone tracked down a plausible cause for this:
> A governor was examining the values and making a mess with the scaled
> ones. I'm sorry, I don't have the post off hand.
>
> You actually likely ideally want this _on_ for these patches; the
> available resolution with SCHED_LOAD_SHIFT=10 disappears really
> quickly and scaling by runnable_avg only further accelerates that.

Sorry for can not follow you. Do you mean the scaling by runnable_avg is
better than scale_load?

In fact, after think twice of scale_load, guess better to figure out a
good usage for it before enabling blindly.

> We should try to get this generally turned on by default again.
>> >
--
Thanks
Alex

2013-05-06 15:30:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Mon, May 06, 2013 at 11:26:06PM +0800, Alex Shi wrote:
> On 05/06/2013 06:22 PM, Paul Turner wrote:
> >>> >> This is missing a scale_load() right? Further: Why not put this in
> >>> >> __sched_fork?
> >> >
> >> > scale_load is not working now. Anyway I can add this.
> > I believe someone tracked down a plausible cause for this:
> > A governor was examining the values and making a mess with the scaled
> > ones. I'm sorry, I don't have the post off hand.
> >
> > You actually likely ideally want this _on_ for these patches; the
> > available resolution with SCHED_LOAD_SHIFT=10 disappears really
> > quickly and scaling by runnable_avg only further accelerates that.
>
> Sorry for can not follow you. Do you mean the scaling by runnable_avg is
> better than scale_load?
>
> In fact, after think twice of scale_load, guess better to figure out a
> good usage for it before enabling blindly.

He's saying its way too easy to run out of fractional bits with the current
setup; adding the 10 extra fractional bits provided by SCHED_LOAD_SHIFT is
advantagous and we should re-enable that.

2013-05-06 16:20:48

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v5 2/7] sched: remove SMP cover for runnable variables in cfs_rq

Hi Paul,

On 05/06/2013 02:38 PM, Paul Turner wrote:
> On Mon, May 6, 2013 at 1:57 AM, Alex Shi <[email protected]> wrote:
>> On 05/06/2013 04:01 PM, Paul Turner wrote:
>>> On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
>>>> The following variables were covered under CONFIG_SMP in struct cfs_rq.
>>>> but similar runnable variables work for UP in struct rq and task_group.
>>>> like rq->avg, task_group->load_avg.
>>>> So move them out, they also can work with UP.
>>>
>>> Is there a proposed use-case for UP? My apologies if I missed it in
>>> an alternate patch.
>>
>>> It would seem the only possibly useful thing there would the the
>>> per-rq average for p-state selection; but we can get that without the
>>> per-entity values already.
>>
>>
>> Do you mean to move the rq->avg and task_group->load_avg into CONFIG_SMP?
>
> More generally: Why do we need them in !CONFIG_SMP?
>
> [ I was suggesting (potentially) using only rq->avg in the !CONFIG_SMP case. ]

If you were to have runnable_load_avg and blocked_load_avg of cfs_rq
under CONFIG_SMP, how will tg->load_avg get updated? tg->load_avg is not
SMP dependent.

tg->load_avg in-turn is used to decide the CPU shares of the sched
entities on the processor right?

Thanks

Regards
Preeti U Murthy

2013-05-06 18:34:58

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On Mon, May 6, 2013 at 8:00 AM, Alex Shi <[email protected]> wrote:
>>
>> blocked_load_avg is the expected "to wake" contribution from tasks
>> already assigned to this rq.
>>
>> e.g. this could be:
>> load = this_rq->cfs.runnable_load_avg + this_rq->cfs.blocked_load_avg;
>
> Current load balance doesn't consider slept task's load which is
> represented by blocked_load_avg. And the slept task is not on_rq, so
> consider it in load balance is a little strange.

The load-balancer has a longer time horizon; think of blocked_loag_avg
to be a signal for the load, already assigned to this cpu, which is
expected to appear (within roughly the next quantum).

Consider the following scenario:

tasks: A,B (40% busy), C (90% busy)

Suppose we have:
CPU 0: CPU 1:
A C
B

Then, when C blocks the load balancer ticks.

If we considered only runnable_load then A or B would be eligible for
migration to CPU 1, which is essentially where we are today.

>
> But your concern is worth to try. I will change the patchset and give
> the testing results.
>
>>
>> Although, in general I have a major concern with the current implementation:
>>
>> The entire reason for stability with the bottom up averages is that
>> when load migrates between cpus we are able to migrate it between the
>> tracked sums.
>>
>> Stuffing observed averages of these into the load_idxs loses that
>> mobility; we will have to stall (as we do today for idx > 0) before we
>> can recognize that a cpu's load has truly left it; this is a very
>> similar problem to the need to stably track this for group shares
>> computation.
>>
>> To that end, I would rather see the load_idx disappear completely:
>> (a) We can calculate the imbalance purely from delta (runnable_avg +
>> blocked_avg)
>> (b) It eliminates a bad tunable.
>
> I also show the similar concern of load_idx months ago. seems overlooked. :)
>>
>>> - return cpu_rq(cpu)->load.weight;
>>> + return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;
>>
>> Isn't this going to truncate on the 32-bit case?
>
> I guess not, the old load.weight is unsigned long, and runnable_load_avg
> is smaller than the load.weight. so it should be fine.
>
> btw, according to above reason, guess move runnable_load_avg to
> 'unsigned long' type is ok, do you think so?
>

Hmm, so long as it's unsigned long and not u32 that should be OK.

>From a technical standpoint:
We make the argument that we run out of address space before we can
overflow load.weight in the 32-bit case, we can make the same argument
here.

>
> --
> Thanks
> Alex

2013-05-06 20:59:56

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 6/7] sched: consider runnable load average in move_tasks

On Mon, May 6, 2013 at 8:04 AM, Peter Zijlstra <[email protected]> wrote:
> On Mon, May 06, 2013 at 01:53:44AM -0700, Paul Turner wrote:
>> On Sun, May 5, 2013 at 6:45 PM, Alex Shi <[email protected]> wrote:
>> > Except using runnable load average in background, move_tasks is also
>> > the key functions in load balance. We need consider the runnable load
>> > average in it in order to the apple to apple load comparison.
>> >
>> > Signed-off-by: Alex Shi <[email protected]>
>> > ---
>> > kernel/sched/fair.c | 8 +++++++-
>> > 1 file changed, 7 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 0bf88e8..790e23d 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -3966,6 +3966,12 @@ static unsigned long task_h_load(struct task_struct *p);
>> >
>> > static const unsigned int sched_nr_migrate_break = 32;
>> >
>> > +static unsigned long task_h_load_avg(struct task_struct *p)
>> > +{
>> > + return div_u64(task_h_load(p) * (u64)p->se.avg.runnable_avg_sum,
>> > + p->se.avg.runnable_avg_period + 1);
>>
>> Similarly, I think you also want to at least include blocked_load_avg here.
>
> I'm puzzled, this is an entity weight. Entity's don't have blocked_load_avg.
>
> The purpose here is to compute the amount of weight that's being moved by this
> task; to subtract from the imbalance.

Sorry, what I meant to say here is:
If we're going to be using a runnable average based load here the
fraction we take (currently instantaneous) in tg_load_down should be
consistent.

2013-05-07 00:24:54

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task


> The load-balancer has a longer time horizon; think of blocked_loag_avg
> to be a signal for the load, already assigned to this cpu, which is
> expected to appear (within roughly the next quantum).
>
> Consider the following scenario:
>
> tasks: A,B (40% busy), C (90% busy)
>
> Suppose we have:
> CPU 0: CPU 1:
> A C
> B
>
> Then, when C blocks the load balancer ticks.
>
> If we considered only runnable_load then A or B would be eligible for
> migration to CPU 1, which is essentially where we are today.

Thanks for re-clarify. Yes, that's the value of blocked_load_avg here. :)
Anyway, will try to measure them by some benchmarks.
>
>>
>> But your concern is worth to try. I will change the patchset and give
>> the testing results.
>> I guess not, the old load.weight is unsigned long, and runnable_load_avg
>> is smaller than the load.weight. so it should be fine.
>>
>> btw, according to above reason, guess move runnable_load_avg to
>> 'unsigned long' type is ok, do you think so?
>>
>
> Hmm, so long as it's unsigned long and not u32 that should be OK.
>
> From a technical standpoint:
> We make the argument that we run out of address space before we can
> overflow load.weight in the 32-bit case, we can make the same argument
> here.

thanks for the comments and input! :)
>
>>
>> --
>> Thanks
>> Alex


--
Thanks
Alex

2013-05-07 02:18:18

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On 05/06/2013 06:17 PM, Paul Turner wrote:
>>> >> Rather than exposing the representation of load_avg_contrib to
>>> >> __sched_fork it might also be better to call:
>>> >> __update_task_entity_contrib(&p->se)
>>> >> After the initialization above; this would also avoid potential bugs
>>> >> like the missing scale_load() above.
>> >
>> > Above simple change can not work.
> Could you provide additional detail here? Note that the sum change I
> was suggesting above was:
>
> __sched_fork():
> + p->se.avg.decay_count = 0;
> + p->se.avg.runnable_avg_period = 1024;
> + p->se.avg.runnable_avg_sum = 1024;
> + __update_task_entity_contrib(&p->se);
>
> [ Also: move __sched_fork() beyond p->sched_reset_on_fork in sched_fork(). ]

Thanks Paul!
It seems work with this change if new __sched_fork move after the
p->sched_reset_on_fork setting.

But why we initial avg sum to 1024? new task may goes to sleep, the
initial 1024 give a unreasonable initial value.

guess let the task accumulate itself avg sum and period is more natural.
>
>> > We had talked this solution months ago. And get agreement on this patch.
>> > https://lkml.org/lkml/2013/2/20/48 :)
> Yes, I made the same suggestion in the last round, see:
> https://lkml.org/lkml/2013/2/19/176
>
> Your reply there seems like an ack of my suggestion, the only
> difference I'm seeing is that using __update_task_entity_contrib() as
> originally suggested is safer since it keeps the representation of
> load_avg_contrib opaque.

Yes, using __update_task_entity_contrib make load_avg_contrib opaque.
but just initial value 1024 is a bit arbitrary.
>


--
Thanks
Alex

2013-05-07 02:20:17

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On 05/06/2013 09:45 AM, Alex Shi wrote:
> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> new forked task.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
>
> and make forking balancing imbalance since incorrect load_avg_contrib.
>
> set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
> resolve such issues.
>
> Signed-off-by: Alex Shi <[email protected]>

Sorry for forget this:

Acked-by: Peter Zijlstra <[email protected]>
--
Thanks
Alex

2013-05-07 02:43:21

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

On 05/06/2013 05:59 PM, Preeti U Murthy wrote:
> On 05/06/2013 03:05 PM, Alex Shi wrote:
>> On 05/06/2013 05:06 PM, Paul Turner wrote:
>>> I don't think this is a good idea:
>>>
>>> The problem with not using the instantaneous weight here is that you
>>> potentially penalize the latency of interactive tasks (similarly,
>>> potentially important background threads -- e.g. garbage collection).
>>>
>>> Counter-intuitively we actually want such tasks on the least loaded
>>> cpus to minimize their latency. If the load they contribute ever
>>> becomes more substantial we trust that periodic balance will start
>>> taking notice of them.
>>
>> Sounds reasonable. Many thanks for your input, Paul!
>>
>> So, will use the seconds try. :)
>>>
>>> [ This is similar to why we have to use the instantaneous weight in
>>> calc_cfs_shares. ]
>>>
>>
>>
>
> Yes thank you very much for the inputs Paul :)
>
> So Alex, Michael looks like this is what happened.
>
> 1. The effective_load() as it is, uses instantaneous loads to calculate
> the CPU shares before and after a new task can be woken up on the given cpu.
>
> 2. With my patch, I modified it to use runnable load average while
> calculating the CPU share *after* a new task could be woken up and
> retained instantaneous load to calculate the CPU share before a new task
> could be woken up.
>
> 3. With the first patch of Alex, he has used runnable load average while
> calculating the CPU share both before and after a new task could be
> woken up on the given CPU.
>
> 4.The suggestions that Alex gave:
>
> Suggestion1: Would change the CPU share calculation to use runnable load
> average all the time.
>
> Suggestion2: Did opposite of point 2 above,it used runnable load average
> while calculating the CPU share *before* a new task has been woken up
> while it retaining the instantaneous weight to calculate the CPU share
> after a new task could be woken up.
>
> So since there was no uniformity in the calculation of CPU shares in
> approaches 2 and 3, I think it caused a regression. However I still
> don't understand how approach 4-Suggestion2 made that go away although
> there was non-uniformity in the CPU shares calculation.
>
> But as Paul says we could retain the usage of instantaneous loads
> wherever there is calculation of CPU shares for the reason he mentioned
> and leave effective_load() and calc_cfs_shares() untouched.
>
> This also brings forth another question,should we modify wake_affine()
> to pass the runnable load average of the waking up task to effective_load().
>
> What do you think?

I would suggest we don't touch that dark corner...unless it damaged some
benchmark, AFAIK, pgbench is the only one could be influenced, and all
those approach won't make things better than patch 1~6 only...

Regards,
Michael Wang

>
>
> Thanks
>
> Regards
> Preeti U Murthy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-05-07 03:07:08

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Mon, May 6, 2013 at 7:18 PM, Alex Shi <[email protected]> wrote:
> On 05/06/2013 06:17 PM, Paul Turner wrote:
>>>> >> Rather than exposing the representation of load_avg_contrib to
>>>> >> __sched_fork it might also be better to call:
>>>> >> __update_task_entity_contrib(&p->se)
>>>> >> After the initialization above; this would also avoid potential bugs
>>>> >> like the missing scale_load() above.
>>> >
>>> > Above simple change can not work.
>> Could you provide additional detail here? Note that the sum change I
>> was suggesting above was:
>>
>> __sched_fork():
>> + p->se.avg.decay_count = 0;
>> + p->se.avg.runnable_avg_period = 1024;
>> + p->se.avg.runnable_avg_sum = 1024;
>> + __update_task_entity_contrib(&p->se);
>>
>> [ Also: move __sched_fork() beyond p->sched_reset_on_fork in sched_fork(). ]
>
> Thanks Paul!
> It seems work with this change if new __sched_fork move after the
> p->sched_reset_on_fork setting.
>
> But why we initial avg sum to 1024? new task may goes to sleep, the
> initial 1024 give a unreasonable initial value.
>
> guess let the task accumulate itself avg sum and period is more natural.

1024 is a full single unit period representing ~1ms of time.

The reason to store a small initial "observation" here is so that as
when we reach our next period edge our load converges (presumably
down) towards its true target more smoothly; as well as providing a
task additional protection from being considered "small" through
start-up.

>>
>>> > We had talked this solution months ago. And get agreement on this patch.
>>> > https://lkml.org/lkml/2013/2/20/48 :)
>> Yes, I made the same suggestion in the last round, see:
>> https://lkml.org/lkml/2013/2/19/176
>>
>> Your reply there seems like an ack of my suggestion, the only
>> difference I'm seeing is that using __update_task_entity_contrib() as
>> originally suggested is safer since it keeps the representation of
>> load_avg_contrib opaque.
>
> Yes, using __update_task_entity_contrib make load_avg_contrib opaque.
> but just initial value 1024 is a bit arbitrary.
>>
>
>
> --
> Thanks
> Alex

2013-05-07 03:25:06

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On 05/07/2013 11:06 AM, Paul Turner wrote:
>> > Thanks Paul!
>> > It seems work with this change if new __sched_fork move after the
>> > p->sched_reset_on_fork setting.
>> >
>> > But why we initial avg sum to 1024? new task may goes to sleep, the
>> > initial 1024 give a unreasonable initial value.
>> >
>> > guess let the task accumulate itself avg sum and period is more natural.
> 1024 is a full single unit period representing ~1ms of time.
>
> The reason to store a small initial "observation" here is so that as
> when we reach our next period edge our load converges (presumably
> down) towards its true target more smoothly; as well as providing a
> task additional protection from being considered "small" through
> start-up.
>

It will give new forked task 1 ms extra running time. That will bring
incorrect info if the new forked goes to sleep a while.
But this info should benefit to some benchmarks like aim7,
pthread_cond_broadcast. So I am convinced. :)

What's your opinion of this, Peter?

--
Thanks
Alex

2013-05-07 05:03:52

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On 05/07/2013 11:24 AM, Alex Shi wrote:
>> > The reason to store a small initial "observation" here is so that as
>> > when we reach our next period edge our load converges (presumably
>> > down) towards its true target more smoothly; as well as providing a
>> > task additional protection from being considered "small" through
>> > start-up.
>> >
> It will give new forked task 1 ms extra running time. That will bring
> incorrect info if the new forked goes to sleep a while.
> But this info should benefit to some benchmarks like aim7,
> pthread_cond_broadcast. So I am convinced. :)
>
> What's your opinion of this, Peter?


Here is the patch according to Paul's opinions.
just refer the __update_task_entity_contrib in sched.h looks ugly.
comments are appreciated!

---
>From 647404447c996507b6a94110ed13fd122e4ee154 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Mon, 3 Dec 2012 17:30:39 +0800
Subject: [PATCH 3/7] sched: set initial value of runnable avg for new forked
task

We need initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task.
Otherwise random values of above variables cause mess when do new task
enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg

and make forking balancing imbalance since incorrect load_avg_contrib.

set avg.decay_count = 0, and give initial value of runnable_avg_sum/period
to resolve such issues.

Thanks for Pual's suggestions

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 8 +++++++-
kernel/sched/fair.c | 4 ++++
kernel/sched/sched.h | 1 +
3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c8db984..4e78de1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1566,6 +1566,11 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
+ p->se.avg.decay_count = 0;
+ /* New forked task assumed with full utilization */
+ p->se.avg.runnable_avg_period = 1024;
+ p->se.avg.runnable_avg_sum = 1024;
+ __update_task_entity_contrib(&p->se);
#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
@@ -1619,7 +1624,6 @@ void sched_fork(struct task_struct *p)
unsigned long flags;
int cpu = get_cpu();

- __sched_fork(p);
/*
* We mark the process as running here. This guarantees that
* nobody will actually run it, and a signal or other external
@@ -1653,6 +1657,8 @@ void sched_fork(struct task_struct *p)
p->sched_reset_on_fork = 0;
}

+ __sched_fork(p);
+
if (!rt_prio(p->prio))
p->sched_class = &fair_sched_class;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c2f726..2881d42 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1508,6 +1508,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
+ *
+ * When enqueue a new forked task, the se->avg.decay_count == 0, so
+ * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
+ * value: se->load.weight.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c6634f1..ec4cb9b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -876,6 +876,7 @@ extern const struct sched_class idle_sched_class;
extern void trigger_load_balance(struct rq *rq, int cpu);
extern void idle_balance(int this_cpu, struct rq *this_rq);

+extern inline void __update_task_entity_contrib(struct sched_entity *se);
#else /* CONFIG_SMP */

static inline void idle_balance(int cpu, struct rq *rq)
--
1.7.12

--
Thanks
Alex

2013-05-07 05:05:46

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 1/7] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

On 05/06/2013 04:55 PM, Paul Turner wrote:
> On Mon, May 6, 2013 at 1:49 AM, Alex Shi <[email protected]> wrote:
>> On 05/06/2013 04:24 PM, Paul Turner wrote:
>>>>> /*
>>>>> * CFS Load tracking
>>>>> * Under CFS, load is tracked on a per-entity basis and aggregated up.
>>>>> @@ -242,8 +236,7 @@ struct cfs_rq {
>>>>> u64 runnable_load_avg, blocked_load_avg;
>>>>> atomic64_t decay_counter, removed_load;
>>>>> u64 last_decay;
>>>>> -#endif /* CONFIG_FAIR_GROUP_SCHED */
>>>>> -/* These always depend on CONFIG_FAIR_GROUP_SCHED */
>>> We should perhaps replace this with a comment that these are only
>>> needed to aggregate the point-wise representation in the
>>> FAIR_GROUP_SCHED case.
>>>
>>
>> Is the comments ok? :)
>>
>> /* runnable related variables only used in FAIR_GROUP_SCHED */
>
> Perhaps:
>
> -/* These always depend on CONFIG_FAIR_GROUP_SCHED */
> #ifdef ...
> +/* Required to track per-cpu representation of a task_group */

here is the changed patch.
Do you mind to give a ack or review by? :)

---
>From 5a7ba8bd36051039aa8d7288bbb699efc4d9b8be Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Sun, 3 Mar 2013 16:04:36 +0800
Subject: [PATCH 1/7] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
dependency for load-tracking"

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 7 +------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 13 ++-----------
kernel/sched/sched.h | 10 ++--------
4 files changed, 6 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..9539597 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1161,12 +1161,7 @@ struct sched_entity {
struct cfs_rq *my_q;
#endif

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..c8db984 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1563,12 +1563,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..9c2f726 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3394,12 +3393,6 @@ unlock:
}

/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3422,7 +3415,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */

static unsigned long
@@ -6114,9 +6106,8 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..9419764 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -227,12 +227,6 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -242,9 +236,9 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
+ /* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
u64 tg_load_contrib;
#endif /* CONFIG_FAIR_GROUP_SCHED */
--
1.7.12

--
Thanks
Alex

2013-05-07 05:08:06

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 2/7] sched: remove SMP cover for runnable variables in cfs_rq

On 05/06/2013 05:08 PM, Paul Turner wrote:
>> >
>> >
>> > Do you mean to move the rq->avg and task_group->load_avg into CONFIG_SMP?
> More generally: Why do we need them in !CONFIG_SMP?
>
> [ I was suggesting (potentially) using only rq->avg in the !CONFIG_SMP case. ]
>
>

Paul,
here is the patch according to your opinions. any comments? :)

---
>From 206957227ed899aa44fabeb9890117428103dd1e Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Mon, 6 May 2013 22:20:29 +0800
Subject: [PATCH 2/7] sched: move few runnable tg variables into CONFIG_SMP

The following 2 variables only used under CONFIG_SMP, so better to move
their definiation into CONFIG_SMP too.

atomic64_t load_avg;
atomic_t runnable_avg;

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/sched.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9419764..c6634f1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -114,9 +114,11 @@ struct task_group {
unsigned long shares;

atomic_t load_weight;
+#ifdef CONFIG_SMP
atomic64_t load_avg;
atomic_t runnable_avg;
#endif
+#endif

#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
--
1.7.12

--
Thanks
Alex

2013-05-07 05:12:31

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 05/07/2013 02:34 AM, Paul Turner wrote:
>> > Current load balance doesn't consider slept task's load which is
>> > represented by blocked_load_avg. And the slept task is not on_rq, so
>> > consider it in load balance is a little strange.
> The load-balancer has a longer time horizon; think of blocked_loag_avg
> to be a signal for the load, already assigned to this cpu, which is
> expected to appear (within roughly the next quantum).
>
> Consider the following scenario:
>
> tasks: A,B (40% busy), C (90% busy)
>
> Suppose we have:
> CPU 0: CPU 1:
> A C
> B
>
> Then, when C blocks the load balancer ticks.
>
> If we considered only runnable_load then A or B would be eligible for
> migration to CPU 1, which is essentially where we are today.
>

here is the changed patch according to Paul's comments. Is that you liked, Paul? :)

---

>From 1d7290530e2ee402874bbce39297bb1cfd882339 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Sat, 17 Nov 2012 13:56:11 +0800
Subject: [PATCH 5/7] sched: compute runnable load avg in cpu_load and
cpu_avg_load_per_task

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 16 ++++++++++++++--
kernel/sched/fair.c | 8 ++++++--
2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 33d9a858..f4c6cac 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2536,9 +2536,14 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
- unsigned long load = this_rq->load.weight;
+ unsigned long load;
unsigned long pending_updates;

+#ifdef CONFIG_SMP
+ load = this_rq->cfs.runnable_load_avg + this_rq->cfs.blocked_load_avg;
+#else
+ load = this_rq->load.weight;
+#endif
/*
* bail if there's load or we're actually up-to-date.
*/
@@ -2582,11 +2587,18 @@ void update_cpu_load_nohz(void)
*/
static void update_cpu_load_active(struct rq *this_rq)
{
+ unsigned long load;
+
+#ifdef CONFIG_SMP
+ load = this_rq->cfs.runnable_load_avg + this_rq->cfs.blocked_load_avg;
+#else
+ load = this_rq->load.weight;
+#endif
/*
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
this_rq->last_load_update_tick = jiffies;
- __update_cpu_load(this_rq, this_rq->load.weight, 1);
+ __update_cpu_load(this_rq, load, 1);

calc_load_account_active(this_rq);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2881d42..407ef61 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2900,7 +2900,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->load.weight;
+ struct rq *rq = cpu_rq(cpu);
+ return rq->cfs.runnable_load_avg + rq->cfs.blocked_load_avg;
}

/*
@@ -2946,8 +2947,11 @@ static unsigned long cpu_avg_load_per_task(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);

+ unsigned long load_avg;
+ load_avg = rq->cfs.runnable_load_avg + rq->cfs.blocked_load_avg;
+
if (nr_running)
- return rq->load.weight / nr_running;
+ return load_avg / nr_running;

return 0;
}
--
1.7.12


--
Thanks
Alex

2013-05-07 05:18:11

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 6/7] sched: consider runnable load average in move_tasks

On 05/07/2013 04:59 AM, Paul Turner wrote:
>>> Similarly, I think you also want to at least include blocked_load_avg here.
>> >
>> > I'm puzzled, this is an entity weight. Entity's don't have blocked_load_avg.
>> >
>> > The purpose here is to compute the amount of weight that's being moved by this
>> > task; to subtract from the imbalance.
> Sorry, what I meant to say here is:
> If we're going to be using a runnable average based load here the
> fraction we take (currently instantaneous) in tg_load_down should be
> consistent.

yes. I think so.

So, here is the patch, could you like take a look?
---

>From 8a98af9578154ce5d755b2c6ea7da0109cd6efa8 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Mon, 3 Dec 2012 23:00:53 +0800
Subject: [PATCH 6/7] sched: consider runnable load average in move_tasks

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 407ef61..ca0e051 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4121,11 +4121,12 @@ static int tg_load_down(struct task_group *tg, void *data)
long cpu = (long)data;

if (!tg->parent) {
- load = cpu_rq(cpu)->load.weight;
+ load = cpu_rq(cpu)->avg.load_avg_contrib;
} else {
load = tg->parent->cfs_rq[cpu]->h_load;
- load *= tg->se[cpu]->load.weight;
- load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
+ load *= tg->se[cpu]->avg.load_avg_contrib;
+ load /= tg->parent->cfs_rq[cpu]->runnable_load_avg
+ + tg->parent->cfs_rq[cpu]->blocked_load_avg + 1;
}

tg->cfs_rq[cpu]->h_load = load;
@@ -4153,8 +4154,9 @@ static unsigned long task_h_load(struct task_struct *p)
struct cfs_rq *cfs_rq = task_cfs_rq(p);
unsigned long load;

- load = p->se.load.weight;
- load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);
+ load = p->se.avg.load_avg_contrib;
+ load = div_u64(load * cfs_rq->h_load,
+ cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg + 1);

return load;
}
--
1.7.12

--
Thanks
Alex

2013-05-07 05:44:04

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

On 05/06/2013 05:59 PM, Preeti U Murthy wrote:
> Suggestion1: Would change the CPU share calculation to use runnable load
> average all the time.
>
> Suggestion2: Did opposite of point 2 above,it used runnable load average
> while calculating the CPU share *before* a new task has been woken up
> while it retaining the instantaneous weight to calculate the CPU share
> after a new task could be woken up.
>
> So since there was no uniformity in the calculation of CPU shares in
> approaches 2 and 3, I think it caused a regression. However I still
> don't understand how approach 4-Suggestion2 made that go away although
> there was non-uniformity in the CPU shares calculation.
>
> But as Paul says we could retain the usage of instantaneous loads
> wherever there is calculation of CPU shares for the reason he mentioned
> and leave effective_load() and calc_cfs_shares() untouched.
>
> This also brings forth another question,should we modify wake_affine()
> to pass the runnable load average of the waking up task to effective_load().
>
> What do you think?

I am not Paul. :)

The acceptable patch of pgbench attached. In fact, since effective_load is mixed
with direct load and tg's runnable load. the patch looks no much sense.
So, I am going to agree to drop it if there is no performance benefit on my benchmarks.

---

>From f58519a8de3cebb7a865c9911c00dce5f1dd87f2 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Fri, 3 May 2013 13:29:04 +0800
Subject: [PATCH 7/7] sched: consider runnable load average in effective_load

effective_load calculates the load change as seen from the
root_task_group. It needs to engage the runnable average
of changed task.

Thanks for Morten Rasmussen and PeterZ's reminder of this.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ca0e051..b683909 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2980,15 +2980,15 @@ static void task_waking_fair(struct task_struct *p)

#ifdef CONFIG_FAIR_GROUP_SCHED
/*
- * effective_load() calculates the load change as seen from the root_task_group
+ * effective_load() calculates load avg change as seen from the root_task_group
*
* Adding load to a group doesn't make a group heavier, but can cause movement
* of group shares between cpus. Assuming the shares were perfectly aligned one
* can calculate the shift in shares.
*
- * Calculate the effective load difference if @wl is added (subtracted) to @tg
- * on this @cpu and results in a total addition (subtraction) of @wg to the
- * total group weight.
+ * Calculate the effective load avg difference if @wl is added (subtracted) to
+ * @tg on this @cpu and results in a total addition (subtraction) of @wg to the
+ * total group load avg.
*
* Given a runqueue weight distribution (rw_i) we can compute a shares
* distribution (s_i) using:
@@ -3002,7 +3002,7 @@ static void task_waking_fair(struct task_struct *p)
* rw_i = { 2, 4, 1, 0 }
* s_i = { 2/7, 4/7, 1/7, 0 }
*
- * As per wake_affine() we're interested in the load of two CPUs (the CPU the
+ * As per wake_affine() we're interested in load avg of two CPUs (the CPU the
* task used to run on and the CPU the waker is running on), we need to
* compute the effect of waking a task on either CPU and, in case of a sync
* wakeup, compute the effect of the current task going to sleep.
@@ -3012,20 +3012,20 @@ static void task_waking_fair(struct task_struct *p)
*
* s'_i = (rw_i + @wl) / (@wg + \Sum rw_j) (2)
*
- * Suppose we're interested in CPUs 0 and 1, and want to compute the load
+ * Suppose we're interested in CPUs 0 and 1, and want to compute the load avg
* differences in waking a task to CPU 0. The additional task changes the
* weight and shares distributions like:
*
* rw'_i = { 3, 4, 1, 0 }
* s'_i = { 3/8, 4/8, 1/8, 0 }
*
- * We can then compute the difference in effective weight by using:
+ * We can then compute the difference in effective load avg by using:
*
* dw_i = S * (s'_i - s_i) (3)
*
* Where 'S' is the group weight as seen by its parent.
*
- * Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7)
+ * Therefore the effective change in load avg on CPU 0 would be 5/56 (3/8 - 2/7)
* times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 -
* 4/7) times the weight of the group.
*/
@@ -3070,7 +3070,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* wl = dw_i = S * (s'_i - s_i); see (3)
*/
- wl -= se->load.weight;
+ wl -= se->avg.load_avg_contrib;

/*
* Recursively apply this logic to all parent groups to compute
@@ -3116,14 +3116,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
*/
if (sync) {
tg = task_group(current);
- weight = current->se.load.weight;
+ weight = current->se.avg.load_avg_contrib;

this_load += effective_load(tg, this_cpu, -weight, -weight);
load += effective_load(tg, prev_cpu, 0, -weight);
}

tg = task_group(p);
- weight = p->se.load.weight;
+ weight = p->se.avg.load_avg_contrib;

/*
* In low-load situations, where prev_cpu is idle and this_cpu is idle
--
1.7.12

--
Thanks
Alex

2013-05-07 06:17:57

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 05/06/2013 07:10 PM, Peter Zijlstra wrote:
>> > The runnable_avgs themselves actually have a fair bit of history in
>> > them already (50% is last 32ms); but given that they don't need to be
>> > cut-off to respond to load being migrated I'm guessing we could
>> > actually potentially get by with just "instaneous" and "use averages"
>> > where appropriate?
> Sure,. worth a try. If things fall over we can always look at it again.
>
>> > We always end up having to re-pick/tune them based on a variety of
>> > workloads; if we can eliminate them I think it would be a win.
> Agreed, esp. the plethora of weird idx things we currently have. If we need to
> re-introduce something it would likely only be the busy case and for that we
> can immediately link to the balance interval or so.
>
>
>

I like to have try bases on this patchset. :)

First, we can remove the idx, to check if the removing is fine for our
benchmarks, kbuild, dbench, tbench, hackbench, aim7, specjbb etc.

If there are some regression. we can think more.

--
Thanks
Alex

2013-05-07 09:57:07

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Tue, May 07, 2013 at 04:06:33AM +0100, Paul Turner wrote:
> On Mon, May 6, 2013 at 7:18 PM, Alex Shi <[email protected]> wrote:
> > On 05/06/2013 06:17 PM, Paul Turner wrote:
> >>>> >> Rather than exposing the representation of load_avg_contrib to
> >>>> >> __sched_fork it might also be better to call:
> >>>> >> __update_task_entity_contrib(&p->se)
> >>>> >> After the initialization above; this would also avoid potential bugs
> >>>> >> like the missing scale_load() above.
> >>> >
> >>> > Above simple change can not work.
> >> Could you provide additional detail here? Note that the sum change I
> >> was suggesting above was:
> >>
> >> __sched_fork():
> >> + p->se.avg.decay_count = 0;
> >> + p->se.avg.runnable_avg_period = 1024;
> >> + p->se.avg.runnable_avg_sum = 1024;
> >> + __update_task_entity_contrib(&p->se);
> >>
> >> [ Also: move __sched_fork() beyond p->sched_reset_on_fork in sched_fork(). ]
> >
> > Thanks Paul!
> > It seems work with this change if new __sched_fork move after the
> > p->sched_reset_on_fork setting.
> >
> > But why we initial avg sum to 1024? new task may goes to sleep, the
> > initial 1024 give a unreasonable initial value.
> >
> > guess let the task accumulate itself avg sum and period is more natural.
>
> 1024 is a full single unit period representing ~1ms of time.
>
> The reason to store a small initial "observation" here is so that as
> when we reach our next period edge our load converges (presumably
> down) towards its true target more smoothly; as well as providing a
> task additional protection from being considered "small" through
> start-up.
>

Agree. The tracked load of new tasks is often very unstable during first
couple of ms on loaded systems. I'm not sure if 1024 is the right
initial value. It may need to be larger.

Morten

> >>
> >>> > We had talked this solution months ago. And get agreement on this patch.
> >>> > https://lkml.org/lkml/2013/2/20/48 :)
> >> Yes, I made the same suggestion in the last round, see:
> >> https://lkml.org/lkml/2013/2/19/176
> >>
> >> Your reply there seems like an ack of my suggestion, the only
> >> difference I'm seeing is that using __update_task_entity_contrib() as
> >> originally suggested is safer since it keeps the representation of
> >> load_avg_contrib opaque.
> >
> > Yes, using __update_task_entity_contrib make load_avg_contrib opaque.
> > but just initial value 1024 is a bit arbitrary.
> >>
> >
> >
> > --
> > Thanks
> > Alex
>

2013-05-07 11:05:26

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On 05/07/2013 05:57 PM, Morten Rasmussen wrote:
> Agree. The tracked load of new tasks is often very unstable during first
> couple of ms on loaded systems. I'm not sure if 1024 is the right
> initial value. It may need to be larger.

Maybe Peter can give a better value according to the experience on
scheduler. :)

--
Thanks
Alex

2013-05-07 11:21:28

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

Yes, 1024 was only intended as a starting point. We could also
arbitrarily pick something larger, the key is that we pick
_something_.

If we wanted to be more exacting about it we could just give them a
sched_slice() worth; this would have a few obvious "nice" properties
(pun intended).

On Tue, May 7, 2013 at 4:05 AM, Alex Shi <[email protected]> wrote:
> On 05/07/2013 05:57 PM, Morten Rasmussen wrote:
>> Agree. The tracked load of new tasks is often very unstable during first
>> couple of ms on loaded systems. I'm not sure if 1024 is the right
>> initial value. It may need to be larger.
>
> Maybe Peter can give a better value according to the experience on
> scheduler. :)
>
> --
> Thanks
> Alex

2013-05-08 01:34:09

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] sched: consider runnable load average in effective_load

On 05/07/2013 01:43 PM, Alex Shi wrote:
>> > This also brings forth another question,should we modify wake_affine()
>> > to pass the runnable load average of the waking up task to effective_load().
>> >
>> > What do you think?
> I am not Paul. :)
>
> The acceptable patch of pgbench attached. In fact, since effective_load is mixed
> with direct load and tg's runnable load. the patch looks no much sense.
> So, I am going to agree to drop it if there is no performance benefit on my benchmarks.

This has a bad results on hackbench on 4 socket NHM/SNB machine, and has
bad result oltp on NHM EX. machine. So, drop it.

--
Thanks
Alex

2013-05-08 01:39:41

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 6/7] sched: consider runnable load average in move_tasks

On 05/07/2013 01:17 PM, Alex Shi wrote:
>> > Sorry, what I meant to say here is:
>> > If we're going to be using a runnable average based load here the
>> > fraction we take (currently instantaneous) in tg_load_down should be
>> > consistent.
> yes. I think so.
>
> So, here is the patch, could you like take a look?

The new patchset till to this patch has a bad results on kbuild. So,
will do testing and drop some of them or all.

--
Thanks
Alex

2013-05-08 11:17:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Tue, May 07, 2013 at 11:24:52AM +0800, Alex Shi wrote:
> It will give new forked task 1 ms extra running time. That will bring
> incorrect info if the new forked goes to sleep a while.
> But this info should benefit to some benchmarks like aim7,
> pthread_cond_broadcast. So I am convinced. :)
>
> What's your opinion of this, Peter?

We're going to get it wrong anyhow (lack of crystal ball instruction);
overestimating the load seems like the best deal wrt latency for new tasks.

The only other option we could try (not sure its even worth trying) would be
half load.

2013-05-08 11:36:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Tue, May 07, 2013 at 04:20:55AM -0700, Paul Turner wrote:
> Yes, 1024 was only intended as a starting point. We could also
> arbitrarily pick something larger, the key is that we pick
> _something_.
>
> If we wanted to be more exacting about it we could just give them a
> sched_slice() worth; this would have a few obvious "nice" properties
> (pun intended).

Oh I see I misunderstood again :/ Its not about the effective load but weight
of the initial effective load wrt adjustment.

Previous schedulers didn't have this aspect at all, so no experience from me
here. Paul would be the one, since he's ran longest with this stuff.

That said, I would tend to keep it shorter rather than longer so that it would
adjust quicker to whatever it really wanted to be.

Morten says the load is unstable specifically on loaded systems. I would think
this is because we'd experience scheduling latency, we're runnable more pushing
things up. But if we're really an idle task at heart we'd not run again for a
long while, pushing things down again.

So on that point Paul's suggestion of maybe starting with __sched_slice() might
make sense because it increases the weight of the initial avg with nr_running.

Not sure really, we'll have to play and see what works best for a number of
workloads.

2013-05-08 12:01:08

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Wed, May 8, 2013 at 4:34 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, May 07, 2013 at 04:20:55AM -0700, Paul Turner wrote:
>> Yes, 1024 was only intended as a starting point. We could also
>> arbitrarily pick something larger, the key is that we pick
>> _something_.
>>
>> If we wanted to be more exacting about it we could just give them a
>> sched_slice() worth; this would have a few obvious "nice" properties
>> (pun intended).
>
> Oh I see I misunderstood again :/ Its not about the effective load but weight
> of the initial effective load wrt adjustment.
>
> Previous schedulers didn't have this aspect at all, so no experience from me
> here. Paul would be the one, since he's ran longest with this stuff.
>
> That said, I would tend to keep it shorter rather than longer so that it would
> adjust quicker to whatever it really wanted to be.
>
> Morten says the load is unstable specifically on loaded systems.

Here, Morten was (I believe) referring to the stability at task startup.

To be clear:
Because we have such a small runnable period denominator at this point
a single changed observation (for an equivalently behaving thread)
could have a very large effect. e.g. fork/exec -- happen to take a
major #pf, observe a "relatively" long initial block.

By associating an initial period (along with our full load_contrib)
here, we're making the denominator larger so that these effects are
less pronounced; achieving better convergence towards what our load
contribution should actually be.

Also: We do this conservatively, by converging down, not up.

> I would think
> this is because we'd experience scheduling latency, we're runnable more pushing
> things up. But if we're really an idle task at heart we'd not run again for a
> long while, pushing things down again.

Exactly, this is why we must be careful to use instaneous weights
about wake-up decisions. Interactive and background tasks are largely
idle.

While this is exactly how we want them to be perceived from a
load-balance perspective it's important to keep in mind that while
wake-up placement has a very important role in the overall balance of
a system, it is not playing quite the same game as the load-balancer.

>
> So on that point Paul's suggestion of maybe starting with __sched_slice() might
> make sense because it increases the weight of the initial avg with nr_running.
> Not sure really, we'll have to play and see what works best for a number of
> workloads.

2013-05-09 01:24:41

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 6/7] sched: consider runnable load average in move_tasks

Paul,

I am wondering if the following patch needed.
The wakeuped migrated task will __synchronize_entity_decay(se); in migrate_task_fair,
then it needs to set `se->avg.last_runnable_update -= (-se->avg.decay_count) << 20'
before update_entity_load_avg, in order to avoid slept time is updated twice for
se.avg.load_avg_contrib in both __syncchronize and update_entity_load_avg.

but if the slept task is waked up from self cpu, it miss the last_runnable_update before
update_entity_load_avg(se, 0, 1), so the slept time was used twice in both functions.
Is that right?

Regards
Alex
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80ae94..1b49e97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1532,7 +1532,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
}
wakeup = 0;
} else {
- __synchronize_entity_decay(se);
+ se->avg.last_runnable_update += __synchronize_entity_decay(se)
+ << 20;
}

/* migrated tasks did not contribute to our blocked load */
if (wakeup) {
subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
update_entity_load_avg(se, 0, 1);
}


--
Thanks
Alex

2013-05-09 05:29:29

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 6/7] sched: consider runnable load average in move_tasks

On 05/08/2013 09:39 AM, Alex Shi wrote:
> On 05/07/2013 01:17 PM, Alex Shi wrote:
>>>> >> > Sorry, what I meant to say here is:
>>>> >> > If we're going to be using a runnable average based load here the
>>>> >> > fraction we take (currently instantaneous) in tg_load_down should be
>>>> >> > consistent.
>> > yes. I think so.
>> >
>> > So, here is the patch, could you like take a look?
> The new patchset till to this patch has a bad results on kbuild. So,
> will do testing and drop some of them or all.

After added cfs_rq->blocked_load_avg consideration in balance:
kbuild decrease 6% on my all machines, core2, nhm, snb 2P/4P boxes.
aim7 dropped 2~10% on 4P boxes.
oltp also drop much on 4P box, but it is not very stable.
hackbench dropped 20% on SNB, but increase about 15% on NHM EX box.


Kbuild vmstat show without blocked_load_avg, system has much less CS.

vmstat average values on SNB 4P machine,the box has 4P * 8cores * HT.
without bloacked_load_avg
r b swpd free buff cache si so bi bo in cs us sy id wa st
52 0 0 60820216 30578 2708963 0 0 2404 27841 49176 33642 61 8 29 0 0

with blocked_load_avg:
r b swpd free buff cache si so bi bo in cs us sy id wa st
52 0 0 61045626 32428 2714190 0 0 2272 26943 48923 49661 50 6 42 0 0

aim7 with default workfile, also show less CS.
alexs@debian-alexs:~/tiptest$ cat aim7.vmstat.good
0 0 0 64971976 11316 61352 0 0 0 9 285 589 0 0 100 0 0
0 0 0 64921204 11344 61452 0 0 0 16 36438 97399 15 7 78 0 0
1 0 0 64824712 11368 65452 0 0 0 21 71543 190097 31 15 54 0 0
461 0 0 64569296 11384 77620 0 0 0 2 4164 2113 4 1 94 0 0
1 0 0 64724120 11400 66872 0 0 0 28 107881 296017 42 22 35 0 0
124 0 0 64437332 11416 84248 0 0 0 2 9784 6339 10 4 86 0 0
87 0 0 64224904 11424 63868 0 0 0 1 148456 462921 41 28 31 0 0
1 0 0 64706668 11448 62100 0 0 0 59 33134 41977 30 10 60 0 0
32 0 0 64104320 13064 65836 0 0 324 14 75769 217348 25 16 59 0 0
88 0 0 64444028 13080 66648 0 0 0 4 121466 338645 50 27 22 0 0
2 0 0 64664168 13096 64384 0 0 0 79 20383 22746 20 6 75 0 0
40 0 0 63940308 13104 65020 0 0 0 1 103459 307360 31 20 49 0 0
58 0 0 64197384 13124 67316 0 0 1 2 121445 317690 52 28 20 0 0
average value:
r b swpd free buff cache si so bi bo in cs us sy id wa st
68 0 0 64517724 12043 67089 0 0 25 18 65708 177018 27 14 58 0 0

alexs@debian-alexs:~/tiptest$ cat aim7.vmstat.bad
193 1 0 64701572 8776 67604 0 0 0 2 42509 157649 11 8 81 0 0
0 0 0 64897084 8796 62056 0 0 0 17 15819 21496 11 3 85 0 0
316 0 0 64451460 8812 68488 0 0 0 3 86321 292694 27 17 56 0 0
0 0 0 64853132 8828 61880 0 0 0 32 28989 44443 20 6 73 0 0
82 0 0 64394268 9020 63984 0 0 174 14 74398 280771 18 14 67 0 0
0 0 0 64776500 9036 63752 0 0 0 47 69966 153509 39 16 45 0 0
292 0 0 64347432 9052 74428 0 0 0 2 16542 25876 11 4 85 0 0
340 0 0 64054336 9068 72020 0 0 0 2 132096 524224 28 26 46 0 0
1 0 0 64715984 9084 64440 0 0 0 62 47487 51573 41 13 46 0 0
156 0 0 64124992 9100 73888 0 0 0 2 27755 38801 19 8 73 0 0
326 0 0 63795768 9116 74624 0 0 0 2 138341 560004 25 26 49 0 0
0 0 0 64661592 9140 68796 0 0 0 96 77724 113395 58 20 22 0 0
1951 2 0 64605544 9148 71664 0 0 0 1 1530 2094 1 0 99 0 0
188 0 0 63856212 9164 68536 0 0 0 2 106011 361647 33 23 44 0 0
393 0 0 63941972 9180 76520 0 0 0 3 115553 360168 41 25 34 0 0
average value:
r b swpd free buff cache si so bi bo in cs us sy id wa st
282 0 0 64411856 9021 68845 0 0 11 19 65402 199222 25 13 60 0 0


I reviewed the cfs_rq->blocked_load_avg code path, no clear abnormal found.
Seems the blocked load avg is fit current balance rules.
Sometime the blocked load far bigger than runnable load. The blocked_load_avg
has a long time effect(more than half weight in 32ms), that drives wakeup task to other
cpus not locate, and give unnecessary load in periodic balance, isn't it?

--
Thanks
Alex

2013-05-09 08:22:46

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On 05/08/2013 07:34 PM, Peter Zijlstra wrote:
>> > If we wanted to be more exacting about it we could just give them a
>> > sched_slice() worth; this would have a few obvious "nice" properties
>> > (pun intended).
> Oh I see I misunderstood again :/ Its not about the effective load but weight
> of the initial effective load wrt adjustment.
>
> Previous schedulers didn't have this aspect at all, so no experience from me
> here. Paul would be the one, since he's ran longest with this stuff.
>
> That said, I would tend to keep it shorter rather than longer so that it would
> adjust quicker to whatever it really wanted to be.
>
> Morten says the load is unstable specifically on loaded systems. I would think
> this is because we'd experience scheduling latency, we're runnable more pushing
> things up. But if we're really an idle task at heart we'd not run again for a
> long while, pushing things down again.
>
> So on that point Paul's suggestion of maybe starting with __sched_slice() might
> make sense because it increases the weight of the initial avg with nr_running.
>
> Not sure really, we'll have to play and see what works best for a number of
> workloads.


The patch of using sched_slice for review, I am testing the benchmarks

---
>From da40ffa90ec1de520bd7e92f5653734a964e3bb2 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Thu, 9 May 2013 15:28:34 +0800
Subject: [PATCH 4/8] sched: set initial runnable avg for new task

---
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 15 +++++++++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 19 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ecec7f1..c17925b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1716,6 +1716,8 @@ void wake_up_new_task(struct task_struct *p)
set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
#endif

+ /* Give new task a start runnable time */
+ set_task_runnable_avg(p);
rq = __task_rq_lock(p);
activate_task(rq, p, 0);
p->on_rq = 1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2881d42..4ec5f29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -661,6 +661,21 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
return calc_delta_fair(sched_slice(cfs_rq, se), se);
}

+#ifdef CONFIG_SMP
+void set_task_runnable_avg(struct task_struct *p)
+{
+ u64 slice;
+
+ slice = sched_slice(task_cfs_rq(p), &p->se);
+ p->se.avg.runnable_avg_sum = slice;
+ p->se.avg.runnable_avg_period = slice;
+}
+#else
+void set_task_runnable_avg(struct task_struct *p)
+{
+}
+#endif
+
/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c6634f1..518f3d8a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -900,6 +900,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime

extern void update_idle_cpu_load(struct rq *this_rq);

+extern void set_task_runnable_avg(struct task_struct *p);
+
#ifdef CONFIG_CGROUP_CPUACCT
#include <linux/cgroup.h>
/* track cpu usage of a group of tasks and its child groups */
--
1.7.12

--
Thanks
Alex

2013-05-09 08:31:19

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task


>
> Here is the patch according to Paul's opinions.
> just refer the __update_task_entity_contrib in sched.h looks ugly.
> comments are appreciated!

Paul,

With sched_slice, we need to set the runnable avg sum/period after new
task assigned to a specific CPU.
So, set them __sched_fork is meaningless. and then
__update_task_entity_contrib(&p->se) also no reason to use. I am going
to pick up the old patch and drop this one. that also avoid to declare
it in sched.h.
What's your comment of this?

Regards!
>
> ---
> From 647404447c996507b6a94110ed13fd122e4ee154 Mon Sep 17 00:00:00 2001
> From: Alex Shi <[email protected]>
> Date: Mon, 3 Dec 2012 17:30:39 +0800
> Subject: [PATCH 3/7] sched: set initial value of runnable avg for new forked
> task
>
> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> new forked task.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
>
> and make forking balancing imbalance since incorrect load_avg_contrib.
>
> set avg.decay_count = 0, and give initial value of runnable_avg_sum/period
> to resolve such issues.
>
> Thanks for Pual's suggestions
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/core.c | 8 +++++++-
> kernel/sched/fair.c | 4 ++++
> kernel/sched/sched.h | 1 +
> 3 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c8db984..4e78de1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1566,6 +1566,11 @@ static void __sched_fork(struct task_struct *p)
> #ifdef CONFIG_SMP
> p->se.avg.runnable_avg_period = 0;
> p->se.avg.runnable_avg_sum = 0;
> + p->se.avg.decay_count = 0;
> + /* New forked task assumed with full utilization */
> + p->se.avg.runnable_avg_period = 1024;
> + p->se.avg.runnable_avg_sum = 1024;
> + __update_task_entity_contrib(&p->se);
> #endif
> #ifdef CONFIG_SCHEDSTATS
> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
> @@ -1619,7 +1624,6 @@ void sched_fork(struct task_struct *p)
> unsigned long flags;
> int cpu = get_cpu();
>
> - __sched_fork(p);
> /*
> * We mark the process as running here. This guarantees that
> * nobody will actually run it, and a signal or other external
> @@ -1653,6 +1657,8 @@ void sched_fork(struct task_struct *p)
> p->sched_reset_on_fork = 0;
> }
>
> + __sched_fork(p);
> +
> if (!rt_prio(p->prio))
> p->sched_class = &fair_sched_class;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9c2f726..2881d42 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1508,6 +1508,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> * We track migrations using entity decay_count <= 0, on a wake-up
> * migration we use a negative decay count to track the remote decays
> * accumulated while sleeping.
> + *
> + * When enqueue a new forked task, the se->avg.decay_count == 0, so
> + * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
> + * value: se->load.weight.
> */
> if (unlikely(se->avg.decay_count <= 0)) {
> se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c6634f1..ec4cb9b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -876,6 +876,7 @@ extern const struct sched_class idle_sched_class;
> extern void trigger_load_balance(struct rq *rq, int cpu);
> extern void idle_balance(int this_cpu, struct rq *this_rq);
>
> +extern inline void __update_task_entity_contrib(struct sched_entity *se);
> #else /* CONFIG_SMP */
>
> static inline void idle_balance(int cpu, struct rq *rq)
>


--
Thanks
Alex

2013-05-09 09:24:58

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Thu, May 9, 2013 at 1:22 AM, Alex Shi <[email protected]> wrote:
> On 05/08/2013 07:34 PM, Peter Zijlstra wrote:
>>> > If we wanted to be more exacting about it we could just give them a
>>> > sched_slice() worth; this would have a few obvious "nice" properties
>>> > (pun intended).
>> Oh I see I misunderstood again :/ Its not about the effective load but weight
>> of the initial effective load wrt adjustment.
>>
>> Previous schedulers didn't have this aspect at all, so no experience from me
>> here. Paul would be the one, since he's ran longest with this stuff.
>>
>> That said, I would tend to keep it shorter rather than longer so that it would
>> adjust quicker to whatever it really wanted to be.
>>
>> Morten says the load is unstable specifically on loaded systems. I would think
>> this is because we'd experience scheduling latency, we're runnable more pushing
>> things up. But if we're really an idle task at heart we'd not run again for a
>> long while, pushing things down again.
>>
>> So on that point Paul's suggestion of maybe starting with __sched_slice() might
>> make sense because it increases the weight of the initial avg with nr_running.
>>
>> Not sure really, we'll have to play and see what works best for a number of
>> workloads.
>
>
> The patch of using sched_slice for review, I am testing the benchmarks
>
> ---
> From da40ffa90ec1de520bd7e92f5653734a964e3bb2 Mon Sep 17 00:00:00 2001
> From: Alex Shi <[email protected]>
> Date: Thu, 9 May 2013 15:28:34 +0800
> Subject: [PATCH 4/8] sched: set initial runnable avg for new task
>
> ---
> kernel/sched/core.c | 2 ++
> kernel/sched/fair.c | 15 +++++++++++++++
> kernel/sched/sched.h | 2 ++
> 3 files changed, 19 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ecec7f1..c17925b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1716,6 +1716,8 @@ void wake_up_new_task(struct task_struct *p)
> set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
> #endif
>
> + /* Give new task a start runnable time */
> + set_task_runnable_avg(p);
> rq = __task_rq_lock(p);
> activate_task(rq, p, 0);
> p->on_rq = 1;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2881d42..4ec5f29 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -661,6 +661,21 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> return calc_delta_fair(sched_slice(cfs_rq, se), se);
> }
>
> +#ifdef CONFIG_SMP
> +void set_task_runnable_avg(struct task_struct *p)
> +{
> + u64 slice;
> +
> + slice = sched_slice(task_cfs_rq(p), &p->se);
> + p->se.avg.runnable_avg_sum = slice;
> + p->se.avg.runnable_avg_period = slice;

This needs to be >> 10 right? sched_slice is in ns.

We also still want to set load_avg_contrib right?

> +}
> +#else
> +void set_task_runnable_avg(struct task_struct *p)
> +{
> +}
> +#endif
> +
> /*
> * Update the current task's runtime statistics. Skip current tasks that
> * are not in our scheduling class.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c6634f1..518f3d8a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -900,6 +900,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
>
> extern void update_idle_cpu_load(struct rq *this_rq);
>
> +extern void set_task_runnable_avg(struct task_struct *p);
> +
> #ifdef CONFIG_CGROUP_CPUACCT
> #include <linux/cgroup.h>
> /* track cpu usage of a group of tasks and its child groups */
> --
> 1.7.12
>
> --
> Thanks
> Alex

2013-05-09 09:31:25

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Thu, May 9, 2013 at 1:31 AM, Alex Shi <[email protected]> wrote:
>
>>
>> Here is the patch according to Paul's opinions.
>> just refer the __update_task_entity_contrib in sched.h looks ugly.
>> comments are appreciated!
>
> Paul,
>
> With sched_slice, we need to set the runnable avg sum/period after new
> task assigned to a specific CPU.
> So, set them __sched_fork is meaningless. and then

This is still a reasonable choice.

Assuming the system is well balanced, sched_slice() on the current CPU
should be reasonably indicative of the slice wherever we end up. The
alternative is still to pick a constant. We should do one of these.

> __update_task_entity_contrib(&p->se) also no reason to use.

Surely we'd still want it so that the right load is added by
enqueue_entity_load_avg()?

> I am going
> to pick up the old patch and drop this one. that also avoid to declare
> it in sched.h.
> What's your comment of this?
>
> Regards!
>>
>> ---
>> From 647404447c996507b6a94110ed13fd122e4ee154 Mon Sep 17 00:00:00 2001
>> From: Alex Shi <[email protected]>
>> Date: Mon, 3 Dec 2012 17:30:39 +0800
>> Subject: [PATCH 3/7] sched: set initial value of runnable avg for new forked
>> task
>>
>> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
>> new forked task.
>> Otherwise random values of above variables cause mess when do new task
>> enqueue:
>> enqueue_task_fair
>> enqueue_entity
>> enqueue_entity_load_avg
>>
>> and make forking balancing imbalance since incorrect load_avg_contrib.
>>
>> set avg.decay_count = 0, and give initial value of runnable_avg_sum/period
>> to resolve such issues.
>>
>> Thanks for Pual's suggestions
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> ---
>> kernel/sched/core.c | 8 +++++++-
>> kernel/sched/fair.c | 4 ++++
>> kernel/sched/sched.h | 1 +
>> 3 files changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c8db984..4e78de1 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1566,6 +1566,11 @@ static void __sched_fork(struct task_struct *p)
>> #ifdef CONFIG_SMP
>> p->se.avg.runnable_avg_period = 0;
>> p->se.avg.runnable_avg_sum = 0;
>> + p->se.avg.decay_count = 0;
>> + /* New forked task assumed with full utilization */
>> + p->se.avg.runnable_avg_period = 1024;
>> + p->se.avg.runnable_avg_sum = 1024;
>> + __update_task_entity_contrib(&p->se);
>> #endif
>> #ifdef CONFIG_SCHEDSTATS
>> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>> @@ -1619,7 +1624,6 @@ void sched_fork(struct task_struct *p)
>> unsigned long flags;
>> int cpu = get_cpu();
>>
>> - __sched_fork(p);
>> /*
>> * We mark the process as running here. This guarantees that
>> * nobody will actually run it, and a signal or other external
>> @@ -1653,6 +1657,8 @@ void sched_fork(struct task_struct *p)
>> p->sched_reset_on_fork = 0;
>> }
>>
>> + __sched_fork(p);
>> +
>> if (!rt_prio(p->prio))
>> p->sched_class = &fair_sched_class;
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9c2f726..2881d42 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1508,6 +1508,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
>> * We track migrations using entity decay_count <= 0, on a wake-up
>> * migration we use a negative decay count to track the remote decays
>> * accumulated while sleeping.
>> + *
>> + * When enqueue a new forked task, the se->avg.decay_count == 0, so
>> + * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
>> + * value: se->load.weight.
>> */
>> if (unlikely(se->avg.decay_count <= 0)) {
>> se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index c6634f1..ec4cb9b 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -876,6 +876,7 @@ extern const struct sched_class idle_sched_class;
>> extern void trigger_load_balance(struct rq *rq, int cpu);
>> extern void idle_balance(int this_cpu, struct rq *this_rq);
>>
>> +extern inline void __update_task_entity_contrib(struct sched_entity *se);
>> #else /* CONFIG_SMP */
>>
>> static inline void idle_balance(int cpu, struct rq *rq)
>>
>
>
> --
> Thanks
> Alex

2013-05-09 09:35:11

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Mon, May 6, 2013 at 8:24 PM, Alex Shi <[email protected]> wrote:
> On 05/07/2013 11:06 AM, Paul Turner wrote:
>>> > Thanks Paul!
>>> > It seems work with this change if new __sched_fork move after the
>>> > p->sched_reset_on_fork setting.
>>> >
>>> > But why we initial avg sum to 1024? new task may goes to sleep, the
>>> > initial 1024 give a unreasonable initial value.
>>> >
>>> > guess let the task accumulate itself avg sum and period is more natural.
>> 1024 is a full single unit period representing ~1ms of time.
>>
>> The reason to store a small initial "observation" here is so that as
>> when we reach our next period edge our load converges (presumably
>> down) towards its true target more smoothly; as well as providing a
>> task additional protection from being considered "small" through
>> start-up.
>>
>
> It will give new forked task 1 ms extra running time. That will bring
> incorrect info if the new forked goes to sleep a while.

This is intentional.

Either:
The sleep was representative, we still converge reasonably quickly to zero.
The sleep was not representative and we have not under-represented the task.

Providing it initial time is entirely about improving numerical stability.

> But this info should benefit to some benchmarks like aim7,
> pthread_cond_broadcast. So I am convinced. :)
>
> What's your opinion of this, Peter?
>
> --
> Thanks
> Alex

2013-05-09 10:55:05

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On Wed, May 08, 2013 at 01:00:34PM +0100, Paul Turner wrote:
> On Wed, May 8, 2013 at 4:34 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, May 07, 2013 at 04:20:55AM -0700, Paul Turner wrote:
> >> Yes, 1024 was only intended as a starting point. We could also
> >> arbitrarily pick something larger, the key is that we pick
> >> _something_.
> >>
> >> If we wanted to be more exacting about it we could just give them a
> >> sched_slice() worth; this would have a few obvious "nice" properties
> >> (pun intended).
> >
> > Oh I see I misunderstood again :/ Its not about the effective load but weight
> > of the initial effective load wrt adjustment.
> >
> > Previous schedulers didn't have this aspect at all, so no experience from me
> > here. Paul would be the one, since he's ran longest with this stuff.
> >
> > That said, I would tend to keep it shorter rather than longer so that it would
> > adjust quicker to whatever it really wanted to be.
> >
> > Morten says the load is unstable specifically on loaded systems.
>
> Here, Morten was (I believe) referring to the stability at task startup.
>
> To be clear:
> Because we have such a small runnable period denominator at this point
> a single changed observation (for an equivalently behaving thread)
> could have a very large effect. e.g. fork/exec -- happen to take a
> major #pf, observe a "relatively" long initial block.
>
> By associating an initial period (along with our full load_contrib)
> here, we're making the denominator larger so that these effects are
> less pronounced; achieving better convergence towards what our load
> contribution should actually be.

This is exactly what I meant, thanks :)

For the workloads we are looking at we frequently see tasks that get
blocked for short amounts of time shortly after the task was created. As
you already explained, the small denominator causes the tracked load
change very quickly until the denominator gets larger.

I think it makes good sense to initialize the period and sum (to be
conservative) to some appropriate value to get more a more stable
tracked load for new tasks.

Morten

>
> Also: We do this conservatively, by converging down, not up.
>
> > I would think
> > this is because we'd experience scheduling latency, we're runnable more pushing
> > things up. But if we're really an idle task at heart we'd not run again for a
> > long while, pushing things down again.
>
> Exactly, this is why we must be careful to use instaneous weights
> about wake-up decisions. Interactive and background tasks are largely
> idle.
>
> While this is exactly how we want them to be perceived from a
> load-balance perspective it's important to keep in mind that while
> wake-up placement has a very important role in the overall balance of
> a system, it is not playing quite the same game as the load-balancer.
>
> >
> > So on that point Paul's suggestion of maybe starting with __sched_slice() might
> > make sense because it increases the weight of the initial avg with nr_running.
> > Not sure really, we'll have to play and see what works best for a number of
> > workloads.
>

2013-05-09 13:13:55

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On 05/09/2013 05:24 PM, Paul Turner wrote:
>> >
>> > +#ifdef CONFIG_SMP
>> > +void set_task_runnable_avg(struct task_struct *p)
>> > +{
>> > + u64 slice;
>> > +
>> > + slice = sched_slice(task_cfs_rq(p), &p->se);
>> > + p->se.avg.runnable_avg_sum = slice;
>> > + p->se.avg.runnable_avg_period = slice;
> This needs to be >> 10 right? sched_slice is in ns.

Yes, I omitted this.
>
> We also still want to set load_avg_contrib right?

yes, and seem set it here seems better. :)
>


--
Thanks
Alex

2013-05-09 14:23:22

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 3/7] sched: set initial value of runnable avg for new forked task

On 05/09/2013 05:30 PM, Paul Turner wrote:
>> > With sched_slice, we need to set the runnable avg sum/period after new
>> > task assigned to a specific CPU.
>> > So, set them __sched_fork is meaningless. and then
> This is still a reasonable choice.
>
> Assuming the system is well balanced, sched_slice() on the current CPU
> should be reasonably indicative of the slice wherever we end up. The
> alternative is still to pick a constant. We should do one of these.
>
>> > __update_task_entity_contrib(&p->se) also no reason to use.
> Surely we'd still want it so that the right load is added by
> enqueue_entity_load_avg()?
>

thanks for comments!
the new patch attached, hope it is your liked this time. :)

---
>From 5606df1e46e83e64b0ade30522096c00448a97be Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Mon, 3 Dec 2012 17:30:39 +0800
Subject: [PATCH 3/7] sched: set initial value of runnable avg for new forked
task

We need initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task.
Otherwise random values of above variables cause mess when do new task
enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg

and make forking balancing imbalance since incorrect load_avg_contrib.

Further more, Morten Rasmussen notice some tasks were not launched at
once after created. So Paul and Peter suggest giving a start value for
new task runnable avg time same as sched_slice().

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 6 ++----
kernel/sched/fair.c | 23 +++++++++++++++++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c8db984..866c05a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1563,10 +1563,6 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-#ifdef CONFIG_SMP
- p->se.avg.runnable_avg_period = 0;
- p->se.avg.runnable_avg_sum = 0;
-#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif
@@ -1710,6 +1706,8 @@ void wake_up_new_task(struct task_struct *p)
set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
#endif

+ /* Give new task start runnable values */
+ set_task_runnable_avg(p);
rq = __task_rq_lock(p);
activate_task(rq, p, 0);
p->on_rq = 1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c2f726..203f236 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -661,6 +661,26 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
return calc_delta_fair(sched_slice(cfs_rq, se), se);
}

+#ifdef CONFIG_SMP
+static inline void __update_task_entity_contrib(struct sched_entity *se);
+
+/* Give new task start runnable values to heavy its load in infant time */
+void set_task_runnable_avg(struct task_struct *p)
+{
+ u32 slice;
+
+ p->se.avg.decay_count = 0;
+ slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
+ p->se.avg.runnable_avg_sum = slice;
+ p->se.avg.runnable_avg_period = slice;
+ __update_task_entity_contrib(&p->se);
+}
+#else
+void set_task_runnable_avg(struct task_struct *p)
+{
+}
+#endif
+
/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
@@ -1508,6 +1528,9 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
+ *
+ * When enqueue a new forked task, the se->avg.decay_count == 0, so
+ * we bypass update_entity_load_avg(), use avg.load_avg_contrib direct.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c6634f1..518f3d8a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -900,6 +900,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime

extern void update_idle_cpu_load(struct rq *this_rq);

+extern void set_task_runnable_avg(struct task_struct *p);
+
#ifdef CONFIG_CGROUP_CPUACCT
#include <linux/cgroup.h>
/* track cpu usage of a group of tasks and its child groups */
--
1.7.5.4


--
Thanks
Alex

2013-05-10 13:58:10

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 6/7] sched: consider runnable load average in move_tasks

On 05/09/2013 09:24 AM, Alex Shi wrote:
> Paul,
>
> I am wondering if the following patch needed.
> The wakeuped migrated task will __synchronize_entity_decay(se); in migrate_task_fair,
> then it needs to set `se->avg.last_runnable_update -= (-se->avg.decay_count) << 20'
> before update_entity_load_avg, in order to avoid slept time is updated twice for
> se.avg.load_avg_contrib in both __syncchronize and update_entity_load_avg.
>
> but if the slept task is waked up from self cpu, it miss the last_runnable_update before
> update_entity_load_avg(se, 0, 1), so the slept time was used twice in both functions.
> Is that right?

Would you like to give some comments on this, Paul?

--
Thanks
Alex

2013-05-10 14:05:36

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 6/7] sched: consider runnable load average in move_tasks

On 05/09/2013 01:29 PM, Alex Shi wrote:
>
> I reviewed the cfs_rq->blocked_load_avg code path, no clear abnormal found.
> Seems the blocked load avg is fit current balance rules.

Sorry, I mean, the blocked load avg doesn't fit current balance rules.
The reason is blow, any comments on this?
> Sometime the blocked load far bigger than runnable load. The blocked_load_avg
> has a long time effect(more than half weight in 32ms), that drives wakeup task to other
> cpus not locate, and give unnecessary load in periodic balance, isn't it?


--
Thanks
Alex

2013-06-04 01:46:27

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 05/07/2013 02:17 PM, Alex Shi wrote:
> On 05/06/2013 07:10 PM, Peter Zijlstra wrote:
>>>> The runnable_avgs themselves actually have a fair bit of history in
>>>> them already (50% is last 32ms); but given that they don't need to be
>>>> cut-off to respond to load being migrated I'm guessing we could
>>>> actually potentially get by with just "instaneous" and "use averages"
>>>> where appropriate?
>> Sure,. worth a try. If things fall over we can always look at it again.
>>
>>>> We always end up having to re-pick/tune them based on a variety of
>>>> workloads; if we can eliminate them I think it would be a win.
>> Agreed, esp. the plethora of weird idx things we currently have. If we need to
>> re-introduce something it would likely only be the busy case and for that we
>> can immediately link to the balance interval or so.
>>
>>
>>
>
> I like to have try bases on this patchset. :)
>
> First, we can remove the idx, to check if the removing is fine for our
> benchmarks, kbuild, dbench, tbench, hackbench, aim7, specjbb etc.
>
> If there are some regression. we can think more.
>

Peter,

I just tried to remove the variety rq.cpu_load, by the following patch.
Because forkexec_idx and busy_idx are all zero, after the patch system just keep cpu_load[0]
and remove other values.
I tried the patch base 3.10-rc3 and latest tip/sched/core with benchmark dbench,tbench,
aim7,hackbench. and oltp of sysbench. Seems performance doesn't change clear.
So, for my tested machines, core2, NHM, SNB, with 2 or 4 CPU sockets, and above tested
benchmark. We are fine to remove the variety cpu_load.
Don't know if there some other concerns on other scenarios.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 590d535..f0ca983 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4626,7 +4626,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;

- load_idx = get_sd_load_idx(env->sd, env->idle);
+ load_idx = 0; //get_sd_load_idx(env->sd, env->idle);

do {
int local_group;


--
Thanks
Alex

2013-06-04 01:51:44

by Alex Shi

[permalink] [raw]
Subject: [DISCUSSION] removing variety rq->cpu_load ?

resend with a new subject.

> Peter,
>
> I just tried to remove the variety rq.cpu_load, by the following patch.
> Because forkexec_idx and busy_idx are all zero, after the patch system just keep cpu_load[0]
> and remove other values.
> I tried the patch base 3.10-rc3 and latest tip/sched/core with benchmark dbench,tbench,
> aim7,hackbench. and oltp of sysbench. Seems performance doesn't change clear.
> So, for my tested machines, core2, NHM, SNB, with 2 or 4 CPU sockets, and above tested
> benchmark. We are fine to remove the variety cpu_load.
> Don't know if there some other concerns on other scenarios.
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 590d535..f0ca983 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4626,7 +4626,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
> if (child && child->flags & SD_PREFER_SIBLING)
> prefer_sibling = 1;
>
> - load_idx = get_sd_load_idx(env->sd, env->idle);
> + load_idx = 0; //get_sd_load_idx(env->sd, env->idle);
>
> do {
> int local_group;
>
>


--
Thanks
Alex

2013-06-04 02:33:31

by Michael wang

[permalink] [raw]
Subject: Re: [DISCUSSION] removing variety rq->cpu_load ?

Hi, Alex

On 06/04/2013 09:51 AM, Alex Shi wrote:
> resend with a new subject.

Forgive me but I'm a little lost on this thread...

So we are planing to rely on instant 'cpu_load[0]' and decayed
'runnable_load_avg' only, do we?

Regards,
Michael Wang

>
>> Peter,
>>
>> I just tried to remove the variety rq.cpu_load, by the following patch.
>> Because forkexec_idx and busy_idx are all zero, after the patch system just keep cpu_load[0]
>> and remove other values.
>> I tried the patch base 3.10-rc3 and latest tip/sched/core with benchmark dbench,tbench,
>> aim7,hackbench. and oltp of sysbench. Seems performance doesn't change clear.
>> So, for my tested machines, core2, NHM, SNB, with 2 or 4 CPU sockets, and above tested
>> benchmark. We are fine to remove the variety cpu_load.
>> Don't know if there some other concerns on other scenarios.
>>
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 590d535..f0ca983 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4626,7 +4626,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
>> if (child && child->flags & SD_PREFER_SIBLING)
>> prefer_sibling = 1;
>>
>> - load_idx = get_sd_load_idx(env->sd, env->idle);
>> + load_idx = 0; //get_sd_load_idx(env->sd, env->idle);
>>
>> do {
>> int local_group;
>>
>>
>
>

2013-06-04 02:45:26

by Alex Shi

[permalink] [raw]
Subject: Re: [DISCUSSION] removing variety rq->cpu_load ?

On 06/04/2013 10:33 AM, Michael Wang wrote:
> Hi, Alex
>
> On 06/04/2013 09:51 AM, Alex Shi wrote:
>> resend with a new subject.
>
> Forgive me but I'm a little lost on this thread...
>
> So we are planing to rely on instant 'cpu_load[0]' and decayed
> 'runnable_load_avg' only, do we?

cpu_load is a kind of time decay for cpu load, but after runnable load introduced,
the decay functionality is a kind duplicate with it.
So, remove them will make code simple. The idea were mentioned by Paul, Peter and I.

the following is Peter's word of this affair.

> Agreed, esp. the plethora of weird idx things we currently have. If we need to
> re-introduce something it would likely only be the busy case and for that we
> can immediately link to the balance interval or so.


>
>
> Regards,
> Michael Wang
>
>>
>>> Peter,
>>>
>>> I just tried to remove the variety rq.cpu_load, by the following patch.
>>> Because forkexec_idx and busy_idx are all zero, after the patch system just keep cpu_load[0]
>>> and remove other values.
>>> I tried the patch base 3.10-rc3 and latest tip/sched/core with benchmark dbench,tbench,
>>> aim7,hackbench. and oltp of sysbench. Seems performance doesn't change clear.
>>> So, for my tested machines, core2, NHM, SNB, with 2 or 4 CPU sockets, and above tested
>>> benchmark. We are fine to remove the variety cpu_load.
>>> Don't know if there some other concerns on other scenarios.
>>>
>>> ---
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 590d535..f0ca983 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4626,7 +4626,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
>>> if (child && child->flags & SD_PREFER_SIBLING)
>>> prefer_sibling = 1;
>>>
>>> - load_idx = get_sd_load_idx(env->sd, env->idle);
>>> + load_idx = 0; //get_sd_load_idx(env->sd, env->idle);
>>>
>>> do {
>>> int local_group;
>>>
>>>
>>
>>
>


--
Thanks
Alex

2013-06-04 03:09:24

by Michael wang

[permalink] [raw]
Subject: Re: [DISCUSSION] removing variety rq->cpu_load ?

On 06/04/2013 10:44 AM, Alex Shi wrote:
> On 06/04/2013 10:33 AM, Michael Wang wrote:
>> Hi, Alex
>>
>> On 06/04/2013 09:51 AM, Alex Shi wrote:
>>> resend with a new subject.
>>
>> Forgive me but I'm a little lost on this thread...
>>
>> So we are planing to rely on instant 'cpu_load[0]' and decayed
>> 'runnable_load_avg' only, do we?
>
> cpu_load is a kind of time decay for cpu load, but after runnable load introduced,
> the decay functionality is a kind duplicate with it.
> So, remove them will make code simple. The idea were mentioned by Paul, Peter and I.

Nice, what about make a patch 9 and clean up all those stuff?

I suppose we will get more benefit :)

Regards,
Michael Wang

>
> the following is Peter's word of this affair.
>
>> Agreed, esp. the plethora of weird idx things we currently have. If we need to
>> re-introduce something it would likely only be the busy case and for that we
>> can immediately link to the balance interval or so.
>
>
>>
>>
>> Regards,
>> Michael Wang
>>
>>>
>>>> Peter,
>>>>
>>>> I just tried to remove the variety rq.cpu_load, by the following patch.
>>>> Because forkexec_idx and busy_idx are all zero, after the patch system just keep cpu_load[0]
>>>> and remove other values.
>>>> I tried the patch base 3.10-rc3 and latest tip/sched/core with benchmark dbench,tbench,
>>>> aim7,hackbench. and oltp of sysbench. Seems performance doesn't change clear.
>>>> So, for my tested machines, core2, NHM, SNB, with 2 or 4 CPU sockets, and above tested
>>>> benchmark. We are fine to remove the variety cpu_load.
>>>> Don't know if there some other concerns on other scenarios.
>>>>
>>>> ---
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 590d535..f0ca983 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -4626,7 +4626,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
>>>> if (child && child->flags & SD_PREFER_SIBLING)
>>>> prefer_sibling = 1;
>>>>
>>>> - load_idx = get_sd_load_idx(env->sd, env->idle);
>>>> + load_idx = 0; //get_sd_load_idx(env->sd, env->idle);
>>>>
>>>> do {
>>>> int local_group;
>>>>
>>>>
>>>
>>>
>>
>
>

2013-06-04 04:56:35

by Alex Shi

[permalink] [raw]
Subject: Re: [DISCUSSION] removing variety rq->cpu_load ?


>>> >> Forgive me but I'm a little lost on this thread...
>>> >>
>>> >> So we are planing to rely on instant 'cpu_load[0]' and decayed
>>> >> 'runnable_load_avg' only, do we?
>> >
>> > cpu_load is a kind of time decay for cpu load, but after runnable load introduced,
>> > the decay functionality is a kind duplicate with it.
>> > So, remove them will make code simple. The idea were mentioned by Paul, Peter and I.
> Nice, what about make a patch 9 and clean up all those stuff?
>
> I suppose we will get more benefit :)

I'd like to.

But since Peter doesn't response this thread for long time, I even don't
know what's he opinions of patch 1~8. :(

--
Thanks
Alex