2013-06-20 02:20:15

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 0/13] use runnable load in schedule balance

Resend patchset for more convenient pick up.
This patch set combine 'use runnable load in balance' serials and 'change
64bit variables to long type' serials. also collected Reviewed-bys, and
Tested-bys.

The only changed code is fixing load to load_avg convert in UP mode, which
found by PeterZ in task_h_load().

Paul still has some concern of blocked_load_avg out of balance consideration.
but I didn't see the blocked_load_avg usage was thought through now, or some
strong reason to make it into balance.
So, according to benchmarks testing result I keep patches unchanged.

Regards
Alex

[Resend patch v8 01/13] Revert "sched: Introduce temporary
[Resend patch v8 02/13] sched: move few runnable tg variables into
[Resend patch v8 03/13] sched: set initial value of runnable avg for
[Resend patch v8 04/13] sched: fix slept time double counting in
[Resend patch v8 05/13] sched: update cpu load after task_tick.
[Resend patch v8 06/13] sched: compute runnable load avg in cpu_load
[Resend patch v8 07/13] sched: consider runnable load average in
[Resend patch v8 08/13] sched/tg: remove blocked_load_avg in balance
[Resend patch v8 09/13] sched: change cfs_rq load avg to unsigned
[Resend patch v8 10/13] sched/tg: use 'unsigned long' for load
[Resend patch v8 11/13] sched/cfs_rq: change atomic64_t removed_load
[Resend patch v8 12/13] sched/tg: remove tg.load_weight
[Resend patch v8 13/13] sched: get_rq_runnable_load() can be static


2013-06-20 02:20:21

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 01/13] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 7 +------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 13 ++-----------
kernel/sched/sched.h | 10 ++--------
4 files changed, 6 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..0019bef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -994,12 +994,7 @@ struct sched_entity {
struct cfs_rq *my_q;
#endif

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 36f85be..b9e7036 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,12 +1598,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ee1c2e..f404468 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1128,8 +1128,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3436,12 +3435,6 @@ unlock:
}

/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3464,7 +3457,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */

static unsigned long
@@ -6167,9 +6159,8 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 74ff659..d892a9f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -269,12 +269,6 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -284,9 +278,9 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
+ /* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
u64 tg_load_contrib;
#endif /* CONFIG_FAIR_GROUP_SCHED */
--
1.7.12

2013-06-20 02:20:29

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 02/13] sched: move few runnable tg variables into CONFIG_SMP

The following 2 variables only used under CONFIG_SMP, so better to move
their definiation into CONFIG_SMP too.

atomic64_t load_avg;
atomic_t runnable_avg;

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/sched.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d892a9f..24b1503 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -149,9 +149,11 @@ struct task_group {
unsigned long shares;

atomic_t load_weight;
+#ifdef CONFIG_SMP
atomic64_t load_avg;
atomic_t runnable_avg;
#endif
+#endif

#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
--
1.7.12

2013-06-20 02:20:33

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 03/13] sched: set initial value of runnable avg for new forked task

We need initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task.
Otherwise random values of above variables cause mess when do new task
enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg

and make forking balancing imbalance since incorrect load_avg_contrib.

Further more, Morten Rasmussen notice some tasks were not launched at
once after created. So Paul and Peter suggest giving a start value for
new task runnable avg time same as sched_slice().

Paul also contribute most of code comments in this commit.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Gu Zheng <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
---
kernel/sched/core.c | 6 ++----
kernel/sched/fair.c | 24 ++++++++++++++++++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b9e7036..c78a9e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,10 +1598,6 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-#ifdef CONFIG_SMP
- p->se.avg.runnable_avg_period = 0;
- p->se.avg.runnable_avg_sum = 0;
-#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif
@@ -1745,6 +1741,8 @@ void wake_up_new_task(struct task_struct *p)
set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
#endif

+ /* Initialize new task's runnable average */
+ init_task_runnable_average(p);
rq = __task_rq_lock(p);
activate_task(rq, p, 0);
p->on_rq = 1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f404468..df5b8a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,6 +680,26 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
return calc_delta_fair(sched_slice(cfs_rq, se), se);
}

+#ifdef CONFIG_SMP
+static inline void __update_task_entity_contrib(struct sched_entity *se);
+
+/* Give new task start runnable values to heavy its load in infant time */
+void init_task_runnable_average(struct task_struct *p)
+{
+ u32 slice;
+
+ p->se.avg.decay_count = 0;
+ slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
+ p->se.avg.runnable_avg_sum = slice;
+ p->se.avg.runnable_avg_period = slice;
+ __update_task_entity_contrib(&p->se);
+}
+#else
+void init_task_runnable_average(struct task_struct *p)
+{
+}
+#endif
+
/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
@@ -1527,6 +1547,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
+ *
+ * Newly forked tasks are enqueued with se->avg.decay_count == 0, they
+ * are seen by enqueue_entity_load_avg() as a migration with an already
+ * constructed load_avg_contrib.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24b1503..0684c26 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1058,6 +1058,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime

extern void update_idle_cpu_load(struct rq *this_rq);

+extern void init_task_runnable_average(struct task_struct *p);
+
#ifdef CONFIG_PARAVIRT
static inline u64 steal_ticks(u64 steal)
{
--
1.7.12

2013-06-20 02:21:03

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 04/13] sched: fix slept time double counting in enqueue entity

The wakeuped migrated task will __synchronize_entity_decay(se); in
migrate_task_rq_fair, then it needs to set
`se->avg.last_runnable_update -= (-se->avg.decay_count) << 20'
before update_entity_load_avg, in order to avoid slept time is updated
twice for se.avg.load_avg_contrib in both __syncchronize and
update_entity_load_avg.

but if the slept task is waked up from self cpu, it miss the
last_runnable_update before update_entity_load_avg(se, 0, 1), then the
slept time was used twice in both functions.
So we need to remove the double slept time counting.

Paul also contribute some code comments in this commit.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df5b8a9..1e5a5e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1571,7 +1571,13 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
}
wakeup = 0;
} else {
- __synchronize_entity_decay(se);
+ /*
+ * Task re-woke on same cpu (or else migrate_task_rq_fair()
+ * would have made count negative); we must be careful to avoid
+ * double-accounting blocked time after synchronizing decays.
+ */
+ se->avg.last_runnable_update += __synchronize_entity_decay(se)
+ << 20;
}

/* migrated tasks did not contribute to our blocked load */
--
1.7.12

2013-06-20 02:21:06

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 05/13] sched: update cpu load after task_tick.

To get the latest runnable info, we need do this cpuload update after
task_tick.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c78a9e2..ee0225e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2152,8 +2152,8 @@ void scheduler_tick(void)

raw_spin_lock(&rq->lock);
update_rq_clock(rq);
- update_cpu_load_active(rq);
curr->sched_class->task_tick(rq, curr, 0);
+ update_cpu_load_active(rq);
raw_spin_unlock(&rq->lock);

perf_event_task_tick();
--
1.7.12

2013-06-20 02:21:11

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

We also try to include the blocked_load_avg as cpu load in balancing,
but that cause kbuild performance drop 6% on every Intel machine, and
aim7/oltp drop on some of 4 CPU sockets machines.
Or only add blocked_load_avg into get_rq_runable_load, hackbench still
drop a little on NHM EX.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Gu Zheng <[email protected]>
---
kernel/sched/fair.c | 5 +++--
kernel/sched/proc.c | 17 +++++++++++++++--
2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e5a5e6..7d5c477 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2968,7 +2968,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->load.weight;
+ return cpu_rq(cpu)->cfs.runnable_load_avg;
}

/*
@@ -3013,9 +3013,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+ unsigned long load_avg = rq->cfs.runnable_load_avg;

if (nr_running)
- return rq->load.weight / nr_running;
+ return load_avg / nr_running;

return 0;
}
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index bb3a6a0..ce5cd48 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
sched_avg_update(this_rq);
}

+#ifdef CONFIG_SMP
+unsigned long get_rq_runnable_load(struct rq *rq)
+{
+ return rq->cfs.runnable_load_avg;
+}
+#else
+unsigned long get_rq_runnable_load(struct rq *rq)
+{
+ return rq->load.weight;
+}
+#endif
+
#ifdef CONFIG_NO_HZ_COMMON
/*
* There is no sane way to deal with nohz on smp when using jiffies because the
@@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
- unsigned long load = this_rq->load.weight;
+ unsigned long load = get_rq_runnable_load(this_rq);
unsigned long pending_updates;

/*
@@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
*/
void update_cpu_load_active(struct rq *this_rq)
{
+ unsigned long load = get_rq_runnable_load(this_rq);
/*
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
this_rq->last_load_update_tick = jiffies;
- __update_cpu_load(this_rq, this_rq->load.weight, 1);
+ __update_cpu_load(this_rq, load, 1);

calc_load_account_active(this_rq);
}
--
1.7.12

2013-06-20 02:21:23

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 11/13] sched/cfs_rq: change atomic64_t removed_load to atomic_long_t

Like as runnable_load_avg, blocked_load_avg variable, long type is
enough for removed_load in 64 bit or 32 bit machine.

Then we avoid the expensive atomic64 operations on 32 bit machine.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 10 ++++++----
kernel/sched/sched.h | 3 ++-
2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2a2a36..8889fbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1517,8 +1517,9 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
if (!decays && !force_update)
return;

- if (atomic64_read(&cfs_rq->removed_load)) {
- u64 removed_load = atomic64_xchg(&cfs_rq->removed_load, 0);
+ if (atomic_long_read(&cfs_rq->removed_load)) {
+ unsigned long removed_load;
+ removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
subtract_blocked_load_contrib(cfs_rq, removed_load);
}

@@ -3485,7 +3486,8 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
*/
if (se->avg.decay_count) {
se->avg.decay_count = -__synchronize_entity_decay(se);
- atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
+ atomic_long_add(se->avg.load_avg_contrib,
+ &cfs_rq->removed_load);
}
}
#endif /* CONFIG_SMP */
@@ -5947,7 +5949,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#endif
#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
atomic64_set(&cfs_rq->decay_counter, 1);
- atomic64_set(&cfs_rq->removed_load, 0);
+ atomic_long_set(&cfs_rq->removed_load, 0);
#endif
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 78bb990..755d930 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -278,8 +278,9 @@ struct cfs_rq {
* the FAIR_GROUP_SCHED case).
*/
unsigned long runnable_load_avg, blocked_load_avg;
- atomic64_t decay_counter, removed_load;
+ atomic64_t decay_counter;
u64 last_decay;
+ atomic_long_t removed_load;

#ifdef CONFIG_FAIR_GROUP_SCHED
/* Required to track per-cpu representation of a task_group */
--
1.7.12

2013-06-20 02:21:13

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 07/13] sched: consider runnable load average in move_tasks

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Morten had caught a div u64 bug on ARM, thanks!

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7d5c477..ddbc19f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4184,11 +4184,14 @@ static int tg_load_down(struct task_group *tg, void *data)
long cpu = (long)data;

if (!tg->parent) {
- load = cpu_rq(cpu)->load.weight;
+ load = cpu_rq(cpu)->avg.load_avg_contrib;
} else {
+ unsigned long tmp_rla;
+ tmp_rla = tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
+
load = tg->parent->cfs_rq[cpu]->h_load;
- load *= tg->se[cpu]->load.weight;
- load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
+ load *= tg->se[cpu]->avg.load_avg_contrib;
+ load /= tmp_rla;
}

tg->cfs_rq[cpu]->h_load = load;
@@ -4214,12 +4217,9 @@ static void update_h_load(long cpu)
static unsigned long task_h_load(struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
- unsigned long load;
-
- load = p->se.load.weight;
- load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);

- return load;
+ return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
+ cfs_rq->runnable_load_avg + 1);
}
#else
static inline void update_blocked_averages(int cpu)
@@ -4232,7 +4232,7 @@ static inline void update_h_load(long cpu)

static unsigned long task_h_load(struct task_struct *p)
{
- return p->se.load.weight;
+ return p->se.avg.load_avg_contrib;
}
#endif

--
1.7.12

2013-06-20 02:21:18

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 09/13] sched: change cfs_rq load avg to unsigned long

Since the 'u64 runnable_load_avg, blocked_load_avg' in cfs_rq struct are
smaller than 'unsigned long' cfs_rq->load.weight. We don't need u64
vaiables to describe them. unsigned long is more efficient and convenience.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
---
kernel/sched/debug.c | 4 ++--
kernel/sched/fair.c | 7 ++-----
kernel/sched/sched.h | 2 +-
3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 75024a6..160afdc 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -211,9 +211,9 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight);
#ifdef CONFIG_FAIR_GROUP_SCHED
#ifdef CONFIG_SMP
- SEQ_printf(m, " .%-30s: %lld\n", "runnable_load_avg",
+ SEQ_printf(m, " .%-30s: %ld\n", "runnable_load_avg",
cfs_rq->runnable_load_avg);
- SEQ_printf(m, " .%-30s: %lld\n", "blocked_load_avg",
+ SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
cfs_rq->blocked_load_avg);
SEQ_printf(m, " .%-30s: %lld\n", "tg_load_avg",
(unsigned long long)atomic64_read(&cfs_rq->tg->load_avg));
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 395f57c..39a5bae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4186,12 +4186,9 @@ static int tg_load_down(struct task_group *tg, void *data)
if (!tg->parent) {
load = cpu_rq(cpu)->avg.load_avg_contrib;
} else {
- unsigned long tmp_rla;
- tmp_rla = tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
-
load = tg->parent->cfs_rq[cpu]->h_load;
- load *= tg->se[cpu]->avg.load_avg_contrib;
- load /= tmp_rla;
+ load = div64_ul(load * tg->se[cpu]->avg.load_avg_contrib,
+ tg->parent->cfs_rq[cpu]->runnable_load_avg + 1);
}

tg->cfs_rq[cpu]->h_load = load;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0684c26..762fa63 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -277,7 +277,7 @@ struct cfs_rq {
* This allows for the description of both thread and group usage (in
* the FAIR_GROUP_SCHED case).
*/
- u64 runnable_load_avg, blocked_load_avg;
+ unsigned long runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;

--
1.7.12

2013-06-20 02:21:25

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 12/13] sched/tg: remove tg.load_weight

Since no one use it.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
---
kernel/sched/sched.h | 1 -
1 file changed, 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 755d930..e13d0d5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -148,7 +148,6 @@ struct task_group {
struct cfs_rq **cfs_rq;
unsigned long shares;

- atomic_t load_weight;
#ifdef CONFIG_SMP
atomic_long_t load_avg;
atomic_t runnable_avg;
--
1.7.12

2013-06-20 02:21:29

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 13/13] sched: get_rq_runnable_load() can be static and inline

Based-on-patch-by: Fengguang Wu <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
---
kernel/sched/proc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index ce5cd48..16f5a30 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -502,12 +502,12 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
}

#ifdef CONFIG_SMP
-unsigned long get_rq_runnable_load(struct rq *rq)
+static inline unsigned long get_rq_runnable_load(struct rq *rq)
{
return rq->cfs.runnable_load_avg;
}
#else
-unsigned long get_rq_runnable_load(struct rq *rq)
+static inline unsigned long get_rq_runnable_load(struct rq *rq)
{
return rq->load.weight;
}
--
1.7.12

2013-06-20 02:22:08

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 10/13] sched/tg: use 'unsigned long' for load variable in task group

Since tg->load_avg is smaller than tg->load_weight, we don't need a
atomic64_t variable for load_avg in 32 bit machine.
The same reason for cfs_rq->tg_load_contrib.

The atomic_long_t/unsigned long variable type are more efficient and
convenience for them.

Signed-off-by: Alex Shi <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
---
kernel/sched/debug.c | 6 +++---
kernel/sched/fair.c | 12 ++++++------
kernel/sched/sched.h | 4 ++--
3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 160afdc..d803989 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -215,9 +215,9 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->runnable_load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
cfs_rq->blocked_load_avg);
- SEQ_printf(m, " .%-30s: %lld\n", "tg_load_avg",
- (unsigned long long)atomic64_read(&cfs_rq->tg->load_avg));
- SEQ_printf(m, " .%-30s: %lld\n", "tg_load_contrib",
+ SEQ_printf(m, " .%-30s: %ld\n", "tg_load_avg",
+ atomic_long_read(&cfs_rq->tg->load_avg));
+ SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib",
cfs_rq->tg_load_contrib);
SEQ_printf(m, " .%-30s: %d\n", "tg_runnable_contrib",
cfs_rq->tg_runnable_contrib);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39a5bae..c2a2a36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1075,7 +1075,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
* to gain a more accurate current total weight. See
* update_cfs_rq_load_contribution().
*/
- tg_weight = atomic64_read(&tg->load_avg);
+ tg_weight = atomic_long_read(&tg->load_avg);
tg_weight -= cfs_rq->tg_load_contrib;
tg_weight += cfs_rq->load.weight;

@@ -1356,13 +1356,13 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
int force_update)
{
struct task_group *tg = cfs_rq->tg;
- s64 tg_contrib;
+ long tg_contrib;

tg_contrib = cfs_rq->runnable_load_avg;
tg_contrib -= cfs_rq->tg_load_contrib;

- if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
- atomic64_add(tg_contrib, &tg->load_avg);
+ if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
+ atomic_long_add(tg_contrib, &tg->load_avg);
cfs_rq->tg_load_contrib += tg_contrib;
}
}
@@ -1397,8 +1397,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
u64 contrib;

contrib = cfs_rq->tg_load_contrib * tg->shares;
- se->avg.load_avg_contrib = div64_u64(contrib,
- atomic64_read(&tg->load_avg) + 1);
+ se->avg.load_avg_contrib = div_u64(contrib,
+ atomic_long_read(&tg->load_avg) + 1);

/*
* For group entities we need to compute a correction term in the case
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 762fa63..78bb990 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -150,7 +150,7 @@ struct task_group {

atomic_t load_weight;
#ifdef CONFIG_SMP
- atomic64_t load_avg;
+ atomic_long_t load_avg;
atomic_t runnable_avg;
#endif
#endif
@@ -284,7 +284,7 @@ struct cfs_rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
/* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
- u64 tg_load_contrib;
+ unsigned long tg_load_contrib;
#endif /* CONFIG_FAIR_GROUP_SCHED */

/*
--
1.7.12

2013-06-20 02:22:37

by Alex Shi

[permalink] [raw]
Subject: [Resend patch v8 08/13] sched/tg: remove blocked_load_avg in balance

blocked_load_avg sometime is too heavy and far bigger than runnable load
avg, that make balance make wrong decision. So remove it.

Changlong tested this patch, found ltp cgroup stress testing get better
performance: https://lkml.org/lkml/2013/5/23/65
---
3.10-rc1 patch1-7 patch1-8
duration=764 duration=754 duration=750
duration=764 duration=754 duration=751
duration=763 duration=755 duration=751

duration means the seconds of testing cost.
---

And Jason also tested this patchset on his 8 sockets machine:
https://lkml.org/lkml/2013/5/29/673
---
When using a 3.10-rc2 tip kernel with patches 1-8, there was about a 40%
improvement in performance of the workload compared to when using the
vanilla 3.10-rc2 tip kernel with no patches. When using a 3.10-rc2 tip
kernel with just patches 1-7, the performance improvement of the
workload over the vanilla 3.10-rc2 tip kernel was about 25%.
---

Signed-off-by: Alex Shi <[email protected]>
Tested-by: Changlong Xie <[email protected]>
Tested-by: Jason Low <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ddbc19f..395f57c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1358,7 +1358,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
struct task_group *tg = cfs_rq->tg;
s64 tg_contrib;

- tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+ tg_contrib = cfs_rq->runnable_load_avg;
tg_contrib -= cfs_rq->tg_load_contrib;

if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
--
1.7.12

2013-06-20 13:29:25

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 20 June 2013 04:18, Alex Shi <[email protected]> wrote:
> They are the base values in load balance, update them with rq runnable
> load average, then the load balance will consider runnable load avg
> naturally.
>
> We also try to include the blocked_load_avg as cpu load in balancing,
> but that cause kbuild performance drop 6% on every Intel machine, and
> aim7/oltp drop on some of 4 CPU sockets machines.
> Or only add blocked_load_avg into get_rq_runable_load, hackbench still
> drop a little on NHM EX.
>
> Signed-off-by: Alex Shi <[email protected]>
> Reviewed-by: Gu Zheng <[email protected]>
> ---
> kernel/sched/fair.c | 5 +++--
> kernel/sched/proc.c | 17 +++++++++++++++--
> 2 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1e5a5e6..7d5c477 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2968,7 +2968,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> /* Used instead of source_load when we know the type == 0 */
> static unsigned long weighted_cpuload(const int cpu)
> {
> - return cpu_rq(cpu)->load.weight;
> + return cpu_rq(cpu)->cfs.runnable_load_avg;
> }

Alex,

In the wake-affine function, we use current->se.load.weight and
p->se.load.weight to update the load of this_cpu and prev_cpu whereas
these loads are now equal to runnable_load_avg which is the sum of
se->avg.load_avg_contrib now. Shouldn't we use
se->avg.load_avg_contrib instead of se.load.weight ?

Vincent

>
> /*
> @@ -3013,9 +3013,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> + unsigned long load_avg = rq->cfs.runnable_load_avg;
>
> if (nr_running)
> - return rq->load.weight / nr_running;
> + return load_avg / nr_running;
>
> return 0;
> }
> diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
> index bb3a6a0..ce5cd48 100644
> --- a/kernel/sched/proc.c
> +++ b/kernel/sched/proc.c
> @@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> sched_avg_update(this_rq);
> }
>
> +#ifdef CONFIG_SMP
> +unsigned long get_rq_runnable_load(struct rq *rq)
> +{
> + return rq->cfs.runnable_load_avg;
> +}
> +#else
> +unsigned long get_rq_runnable_load(struct rq *rq)
> +{
> + return rq->load.weight;
> +}
> +#endif
> +
> #ifdef CONFIG_NO_HZ_COMMON
> /*
> * There is no sane way to deal with nohz on smp when using jiffies because the
> @@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> void update_idle_cpu_load(struct rq *this_rq)
> {
> unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
> - unsigned long load = this_rq->load.weight;
> + unsigned long load = get_rq_runnable_load(this_rq);
> unsigned long pending_updates;
>
> /*
> @@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
> */
> void update_cpu_load_active(struct rq *this_rq)
> {
> + unsigned long load = get_rq_runnable_load(this_rq);
> /*
> * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
> */
> this_rq->last_load_update_tick = jiffies;
> - __update_cpu_load(this_rq, this_rq->load.weight, 1);
> + __update_cpu_load(this_rq, load, 1);
>
> calc_load_account_active(this_rq);
> }
> --
> 1.7.12
>

2013-06-24 03:16:16

by Alex Shi

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance

On 06/20/2013 10:18 AM, Alex Shi wrote:
> Resend patchset for more convenient pick up.
> This patch set combine 'use runnable load in balance' serials and 'change
> 64bit variables to long type' serials. also collected Reviewed-bys, and
> Tested-bys.
>
> The only changed code is fixing load to load_avg convert in UP mode, which
> found by PeterZ in task_h_load().
>
> Paul still has some concern of blocked_load_avg out of balance consideration.
> but I didn't see the blocked_load_avg usage was thought through now, or some
> strong reason to make it into balance.
> So, according to benchmarks testing result I keep patches unchanged.

Ingo & Peter,

This patchset was discussed spread and deeply.

Now just 6th/8th patch has some arguments on them, Paul think it is
better to consider blocked_load_avg in balance, since it is helpful on
some scenarios, but I think on most of scenarios, the blocked_load_avg
just cause load imbalance among cpus. and plus testing show with
blocked_load_avg the performance is just worse on some benchmarks. So, I
still prefer to keep it out of balance.

http://www.mail-archive.com/[email protected]/msg455196.html

Is it the time to do the decision or give more comments? Thanks!
>
> Regards
> Alex
>
> [Resend patch v8 01/13] Revert "sched: Introduce temporary
> [Resend patch v8 02/13] sched: move few runnable tg variables into
> [Resend patch v8 03/13] sched: set initial value of runnable avg for
> [Resend patch v8 04/13] sched: fix slept time double counting in
> [Resend patch v8 05/13] sched: update cpu load after task_tick.
> [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load
> [Resend patch v8 07/13] sched: consider runnable load average in
> [Resend patch v8 08/13] sched/tg: remove blocked_load_avg in balance
> [Resend patch v8 09/13] sched: change cfs_rq load avg to unsigned
> [Resend patch v8 10/13] sched/tg: use 'unsigned long' for load
> [Resend patch v8 11/13] sched/cfs_rq: change atomic64_t removed_load
> [Resend patch v8 12/13] sched/tg: remove tg.load_weight
> [Resend patch v8 13/13] sched: get_rq_runnable_load() can be static
>


--
Thanks
Alex

2013-06-24 09:08:01

by Alex Shi

[permalink] [raw]
Subject: Re: [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/20/2013 10:18 AM, Alex Shi wrote:
> They are the base values in load balance, update them with rq runnable
> load average, then the load balance will consider runnable load avg
> naturally.
>
> We also try to include the blocked_load_avg as cpu load in balancing,
> but that cause kbuild performance drop 6% on every Intel machine, and
> aim7/oltp drop on some of 4 CPU sockets machines.
> Or only add blocked_load_avg into get_rq_runable_load, hackbench still
> drop a little on NHM EX.
>
> Signed-off-by: Alex Shi <[email protected]>
> Reviewed-by: Gu Zheng <[email protected]>


I am sorry for still having some swing on cfs and rt task load consideration.
So give extra RFC patch to consider RT load in balance.
With or without this patch, my test result has no change, since there is no
much RT tasks in my testing.

I am not familiar with RT scheduler, just rely on PeterZ who is experts on this. :)

---

>From b9ed5363b0a579a87256b589278c8c66500c7db3 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Mon, 24 Jun 2013 16:12:29 +0800
Subject: [PATCH 08/16] sched: recover to whole rq load include rt tasks'

patch 'sched: compute runnable load avg in cpu_load and
cpu_avg_load_per_task' weight rq's load on cfs.runnable_load_avg instead
of rq->load.weight. That is fine when system has no much RT load.

But if there are lots of RT load on rq, that code will just
weight the cfs tasks in load balance without consideration of RT, that
may cause load imbalance if much RT load isn't imbalanced among cpu.
Using rq->avg.load_avg_contrib can resolve this problem and keep the
advantages from runnable load balance.

BTW, this patch may increase the balance failed times, if move_tasks can
not balance loads between CPUs, like there is only RT load in CPUs.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 4 ++--
kernel/sched/proc.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 37a5720..6979906 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2968,7 +2968,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->cfs.runnable_load_avg;
+ return cpu_rq(cpu)->avg.load_avg_contrib;
}

/*
@@ -3013,7 +3013,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
- unsigned long load_avg = rq->cfs.runnable_load_avg;
+ unsigned long load_avg = rq->avg.load_avg_contrib;

if (nr_running)
return load_avg / nr_running;
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index ce5cd48..4f2490c 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -504,7 +504,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
#ifdef CONFIG_SMP
unsigned long get_rq_runnable_load(struct rq *rq)
{
- return rq->cfs.runnable_load_avg;
+ return rq->avg.load_avg_contrib;
}
#else
unsigned long get_rq_runnable_load(struct rq *rq)
--
1.7.12

2013-06-24 10:41:20

by Paul Turner

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance

On Sun, Jun 23, 2013 at 8:15 PM, Alex Shi <[email protected]> wrote:
> On 06/20/2013 10:18 AM, Alex Shi wrote:
>> Resend patchset for more convenient pick up.
>> This patch set combine 'use runnable load in balance' serials and 'change
>> 64bit variables to long type' serials. also collected Reviewed-bys, and
>> Tested-bys.
>>
>> The only changed code is fixing load to load_avg convert in UP mode, which
>> found by PeterZ in task_h_load().
>>
>> Paul still has some concern of blocked_load_avg out of balance consideration.
>> but I didn't see the blocked_load_avg usage was thought through now, or some
>> strong reason to make it into balance.
>> So, according to benchmarks testing result I keep patches unchanged.
>
> Ingo & Peter,
>
> This patchset was discussed spread and deeply.
>
> Now just 6th/8th patch has some arguments on them, Paul think it is
> better to consider blocked_load_avg in balance, since it is helpful on
> some scenarios, but I think on most of scenarios, the blocked_load_avg
> just cause load imbalance among cpus. and plus testing show with
> blocked_load_avg the performance is just worse on some benchmarks. So, I
> still prefer to keep it out of balance.

I think you have perhaps misunderstood what I was trying to explain.

I have no problems with not including blocked load in load-balance, in
fact, I encouraged not accumulating it in an average of averages in
CPU load.

The problem is that your current approach has removed it both from
load-balance _and_ from shares distribution; isolation matters as much
as performance in the cgroup case (otherwise you would just not use
cgroups). I would expect the latter to have quite negative effects on
fairness, this is my primary concern.

>
> http://www.mail-archive.com/[email protected]/msg455196.html
>
> Is it the time to do the decision or give more comments? Thanks!
>>
>> Regards
>> Alex
>>
>> [Resend patch v8 01/13] Revert "sched: Introduce temporary
>> [Resend patch v8 02/13] sched: move few runnable tg variables into
>> [Resend patch v8 03/13] sched: set initial value of runnable avg for
>> [Resend patch v8 04/13] sched: fix slept time double counting in
>> [Resend patch v8 05/13] sched: update cpu load after task_tick.
>> [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load
>> [Resend patch v8 07/13] sched: consider runnable load average in
>> [Resend patch v8 08/13] sched/tg: remove blocked_load_avg in balance
>> [Resend patch v8 09/13] sched: change cfs_rq load avg to unsigned
>> [Resend patch v8 10/13] sched/tg: use 'unsigned long' for load
>> [Resend patch v8 11/13] sched/cfs_rq: change atomic64_t removed_load
>> [Resend patch v8 12/13] sched/tg: remove tg.load_weight
>> [Resend patch v8 13/13] sched: get_rq_runnable_load() can be static
>>
>
>
> --
> Thanks
> Alex

2013-06-24 10:55:04

by Paul Turner

[permalink] [raw]
Subject: Re: [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On Mon, Jun 24, 2013 at 2:06 AM, Alex Shi <[email protected]> wrote:
> On 06/20/2013 10:18 AM, Alex Shi wrote:
>> They are the base values in load balance, update them with rq runnable
>> load average, then the load balance will consider runnable load avg
>> naturally.
>>
>> We also try to include the blocked_load_avg as cpu load in balancing,
>> but that cause kbuild performance drop 6% on every Intel machine, and
>> aim7/oltp drop on some of 4 CPU sockets machines.
>> Or only add blocked_load_avg into get_rq_runable_load, hackbench still
>> drop a little on NHM EX.
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> Reviewed-by: Gu Zheng <[email protected]>
>
>
> I am sorry for still having some swing on cfs and rt task load consideration.
> So give extra RFC patch to consider RT load in balance.
> With or without this patch, my test result has no change, since there is no
> much RT tasks in my testing.


>
> I am not familiar with RT scheduler, just rely on PeterZ who is experts on this. :)
>
> ---
>
> From b9ed5363b0a579a87256b589278c8c66500c7db3 Mon Sep 17 00:00:00 2001
> From: Alex Shi <[email protected]>
> Date: Mon, 24 Jun 2013 16:12:29 +0800
> Subject: [PATCH 08/16] sched: recover to whole rq load include rt tasks'
>
> patch 'sched: compute runnable load avg in cpu_load and
> cpu_avg_load_per_task' weight rq's load on cfs.runnable_load_avg instead
> of rq->load.weight. That is fine when system has no much RT load.
>
> But if there are lots of RT load on rq, that code will just
> weight the cfs tasks in load balance without consideration of RT, that
> may cause load imbalance if much RT load isn't imbalanced among cpu.
> Using rq->avg.load_avg_contrib can resolve this problem and keep the
> advantages from runnable load balance.

I think this patch confuses what "load_avg_contrib" is.

It's the rate-limited (runnable_load_avg + blocked_load_avg[*]) value
that we've currently accumulated into the task_group for the
observation of an individual cpu's runnable+blocked load.
[*] Supposing you're appending this to the end of your current series
you in fact have it as just: cfs_rq->runnable_load_avg

This patch will do nothing for RT load. It's mostly a no-op which is why
you measured no change.

> BTW, this patch may increase the balance failed times, if move_tasks can
> not balance loads between CPUs, like there is only RT load in CPUs.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++--
> kernel/sched/proc.c | 2 +-
> 2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 37a5720..6979906 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2968,7 +2968,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> /* Used instead of source_load when we know the type == 0 */
> static unsigned long weighted_cpuload(const int cpu)
> {
> - return cpu_rq(cpu)->cfs.runnable_load_avg;
> + return cpu_rq(cpu)->avg.load_avg_contrib;

This is a bad idea. Neither value is really what's intended by
"type==0", but load_avg_contrib is even more stale.

> }
>
> /*
> @@ -3013,7 +3013,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> - unsigned long load_avg = rq->cfs.runnable_load_avg;
> + unsigned long load_avg = rq->avg.load_avg_contrib;
>
> if (nr_running)
> return load_avg / nr_running;
> diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
> index ce5cd48..4f2490c 100644
> --- a/kernel/sched/proc.c
> +++ b/kernel/sched/proc.c
> @@ -504,7 +504,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> #ifdef CONFIG_SMP
> unsigned long get_rq_runnable_load(struct rq *rq)
> {
> - return rq->cfs.runnable_load_avg;
> + return rq->avg.load_avg_contrib;
> }
> #else
> unsigned long get_rq_runnable_load(struct rq *rq)
> --
> 1.7.12
>
>

2013-06-24 11:04:26

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 24 June 2013 11:06, Alex Shi <[email protected]> wrote:
> On 06/20/2013 10:18 AM, Alex Shi wrote:
>> They are the base values in load balance, update them with rq runnable
>> load average, then the load balance will consider runnable load avg
>> naturally.
>>
>> We also try to include the blocked_load_avg as cpu load in balancing,
>> but that cause kbuild performance drop 6% on every Intel machine, and
>> aim7/oltp drop on some of 4 CPU sockets machines.
>> Or only add blocked_load_avg into get_rq_runable_load, hackbench still
>> drop a little on NHM EX.
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> Reviewed-by: Gu Zheng <[email protected]>
>
>
> I am sorry for still having some swing on cfs and rt task load consideration.
> So give extra RFC patch to consider RT load in balance.
> With or without this patch, my test result has no change, since there is no
> much RT tasks in my testing.
>
> I am not familiar with RT scheduler, just rely on PeterZ who is experts on this. :)
>
> ---
>
> From b9ed5363b0a579a87256b589278c8c66500c7db3 Mon Sep 17 00:00:00 2001
> From: Alex Shi <[email protected]>
> Date: Mon, 24 Jun 2013 16:12:29 +0800
> Subject: [PATCH 08/16] sched: recover to whole rq load include rt tasks'
>
> patch 'sched: compute runnable load avg in cpu_load and
> cpu_avg_load_per_task' weight rq's load on cfs.runnable_load_avg instead
> of rq->load.weight. That is fine when system has no much RT load.
>
> But if there are lots of RT load on rq, that code will just
> weight the cfs tasks in load balance without consideration of RT, that

AFAICT, the RT tasks activity is already taken into account by
decreasing the cpu_power that is used during load balance like in the
find_busiest_queue where weighted_cpuload is divided by cpu_power.

Vincent

> may cause load imbalance if much RT load isn't imbalanced among cpu.
> Using rq->avg.load_avg_contrib can resolve this problem and keep the
> advantages from runnable load balance.
>
> BTW, this patch may increase the balance failed times, if move_tasks can
> not balance loads between CPUs, like there is only RT load in CPUs.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++--
> kernel/sched/proc.c | 2 +-
> 2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 37a5720..6979906 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2968,7 +2968,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> /* Used instead of source_load when we know the type == 0 */
> static unsigned long weighted_cpuload(const int cpu)
> {
> - return cpu_rq(cpu)->cfs.runnable_load_avg;
> + return cpu_rq(cpu)->avg.load_avg_contrib;
> }
>
> /*
> @@ -3013,7 +3013,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> - unsigned long load_avg = rq->cfs.runnable_load_avg;
> + unsigned long load_avg = rq->avg.load_avg_contrib;
>
> if (nr_running)
> return load_avg / nr_running;
> diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
> index ce5cd48..4f2490c 100644
> --- a/kernel/sched/proc.c
> +++ b/kernel/sched/proc.c
> @@ -504,7 +504,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> #ifdef CONFIG_SMP
> unsigned long get_rq_runnable_load(struct rq *rq)
> {
> - return rq->cfs.runnable_load_avg;
> + return rq->avg.load_avg_contrib;
> }
> #else
> unsigned long get_rq_runnable_load(struct rq *rq)
> --
> 1.7.12
>
>

2013-06-24 11:07:25

by Paul Turner

[permalink] [raw]
Subject: Re: [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

> AFAICT, the RT tasks activity is already taken into account by
> decreasing the cpu_power that is used during load balance like in the
> find_busiest_queue where weighted_cpuload is divided by cpu_power.
>

Yes. RT tasks used to be handled as having a very-large weight [e.g.
2*nice(-20)] but now we just scale cpu-power for how much non-RT time
there is available.

2013-06-24 14:57:16

by Alex Shi

[permalink] [raw]
Subject: Re: [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/24/2013 07:06 PM, Paul Turner wrote:
>> AFAICT, the RT tasks activity is already taken into account by
>> decreasing the cpu_power that is used during load balance like in the
>> find_busiest_queue where weighted_cpuload is divided by cpu_power.
>>
>
> Yes. RT tasks used to be handled as having a very-large weight [e.g.
> 2*nice(-20)] but now we just scale cpu-power for how much non-RT time
> there is available.
>

yes. you're right. thanks Vincent&Paul for all clarifying!

--
Thanks
Alex

2013-06-24 15:38:09

by Alex Shi

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance

On 06/24/2013 06:40 PM, Paul Turner wrote:
>> > Ingo & Peter,
>> >
>> > This patchset was discussed spread and deeply.
>> >
>> > Now just 6th/8th patch has some arguments on them, Paul think it is
>> > better to consider blocked_load_avg in balance, since it is helpful on
>> > some scenarios, but I think on most of scenarios, the blocked_load_avg
>> > just cause load imbalance among cpus. and plus testing show with
>> > blocked_load_avg the performance is just worse on some benchmarks. So, I
>> > still prefer to keep it out of balance.
> I think you have perhaps misunderstood what I was trying to explain.
>
> I have no problems with not including blocked load in load-balance, in
> fact, I encouraged not accumulating it in an average of averages in
> CPU load.
>

Many thanks for re-clarification!
> The problem is that your current approach has removed it both from
> load-balance _and_ from shares distribution; isolation matters as much
> as performance in the cgroup case (otherwise you would just not use
> cgroups). I would expect the latter to have quite negative effects on
> fairness, this is my primary concern.
>

So the argument is just on patch 'sched/tg: remove blocked_load_avg in balance'. :)

I understand your correctness concern. but blocked_load_avg still will be decayed to zero in few hundreds ms. So such correctness needs just in few hundreds ms. (and cause performance drop)
The blocked_load_avg is decayed on same degree as runnable load, it is a bit overweight since task slept. since it may will be waken up on other cpu. So to relieve this overweight, could we use the half or a quarter weight of blocked_load_avg? like following:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ddbc19f..395f57c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1358,7 +1358,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
struct task_group *tg = cfs_rq->tg;
s64 tg_contrib;

- tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+ tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg / 2;
tg_contrib -= cfs_rq->tg_load_contrib;

if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {

>> >
>> > http://www.mail-archive.com/[email protected]/msg455196.html
>> >
>> > Is it the time to do the decision or give more comments? Thanks!


--
Thanks
Alex

2013-06-25 13:13:26

by Alex Shi

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance

On 06/24/2013 11:37 PM, Alex Shi wrote:
> On 06/24/2013 06:40 PM, Paul Turner wrote:
>>>> Ingo & Peter,
>>>>
>>>> This patchset was discussed spread and deeply.
>>>>
>>>> Now just 6th/8th patch has some arguments on them, Paul think it is
>>>> better to consider blocked_load_avg in balance, since it is helpful on
>>>> some scenarios, but I think on most of scenarios, the blocked_load_avg
>>>> just cause load imbalance among cpus. and plus testing show with
>>>> blocked_load_avg the performance is just worse on some benchmarks. So, I
>>>> still prefer to keep it out of balance.
>> I think you have perhaps misunderstood what I was trying to explain.
>>
>> I have no problems with not including blocked load in load-balance, in
>> fact, I encouraged not accumulating it in an average of averages in
>> CPU load.
>>
>
> Many thanks for re-clarification!
>> The problem is that your current approach has removed it both from
>> load-balance _and_ from shares distribution; isolation matters as much
>> as performance in the cgroup case (otherwise you would just not use
>> cgroups). I would expect the latter to have quite negative effects on
>> fairness, this is my primary concern.
>>
>
> So the argument is just on patch 'sched/tg: remove blocked_load_avg in balance'. :)
>
> I understand your correctness concern. but blocked_load_avg still will be decayed to zero in few hundreds ms. So such correctness needs just in few hundreds ms. (and cause performance drop)
> The blocked_load_avg is decayed on same degree as runnable load, it is a bit overweight since task slept. since it may will be waken up on other cpu. So to relieve this overweight, could we use the half or a quarter weight of blocked_load_avg? like following:
>

Ping to Paul!

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ddbc19f..395f57c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1358,7 +1358,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> struct task_group *tg = cfs_rq->tg;
> s64 tg_contrib;
>
> - tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
> + tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg / 2;
> tg_contrib -= cfs_rq->tg_load_contrib;
>
> if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
>
>>>>
>>>> http://www.mail-archive.com/[email protected]/msg455196.html
>>>>
>>>> Is it the time to do the decision or give more comments? Thanks!
>
>


--
Thanks
Alex

2013-06-26 05:06:44

by Alex Shi

[permalink] [raw]
Subject: Re: [Resend patch v8 01/13] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

On 06/20/2013 10:18 AM, Alex Shi wrote:
> Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
> we can use runnable load variables.
>
> Signed-off-by: Alex Shi <[email protected]>

There are 2 more place need to be reverted too, and merged them into the updated patch.
BTW, this patchset was tested by Fengguang's 0day kbuild system.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f404468..1a14209 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5858,7 +5858,7 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
se->vruntime -= cfs_rq->min_vruntime;
}

-#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+#ifdef CONFIG_SMP
/*
* Remove our load from contribution when we leave sched_fair
* and ensure we don't carry in an old decay_count if we
@@ -5917,7 +5917,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifndef CONFIG_64BIT
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
-#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+#ifdef CONFIG_SMP
atomic64_set(&cfs_rq->decay_counter, 1);
atomic64_set(&cfs_rq->removed_load, 0);
#endif
--

---

>From 190a4d0c8fd6325c6141a9d1ae14518e521e4289 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Sun, 3 Mar 2013 16:04:36 +0800
Subject: Revert "sched: Introduce temporary FAIR_GROUP_SCHED
dependency for load-tracking"

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Also remove 2 CONFIG_FAIR_GROUP_SCHED setting which is not in reverted
patch(introduced in 9ee474f), but also need to revert.

Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 7 +------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 17 ++++-------------
kernel/sched/sched.h | 10 ++--------
4 files changed, 8 insertions(+), 33 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..0019bef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -994,12 +994,7 @@ struct sched_entity {
struct cfs_rq *my_q;
#endif

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 36f85be..b9e7036 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,12 +1598,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ee1c2e..1a14209 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1128,8 +1128,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3436,12 +3435,6 @@ unlock:
}

/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3464,7 +3457,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */

static unsigned long
@@ -5866,7 +5858,7 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
se->vruntime -= cfs_rq->min_vruntime;
}

-#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+#ifdef CONFIG_SMP
/*
* Remove our load from contribution when we leave sched_fair
* and ensure we don't carry in an old decay_count if we
@@ -5925,7 +5917,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifndef CONFIG_64BIT
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
-#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+#ifdef CONFIG_SMP
atomic64_set(&cfs_rq->decay_counter, 1);
atomic64_set(&cfs_rq->removed_load, 0);
#endif
@@ -6167,9 +6159,8 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 74ff659..d892a9f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -269,12 +269,6 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -284,9 +278,9 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
+ /* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
u64 tg_load_contrib;
#endif /* CONFIG_FAIR_GROUP_SCHED */
--
1.7.12

2013-06-26 15:05:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance

On Thu, Jun 20, 2013 at 10:18:44AM +0800, Alex Shi wrote:
> Resend patchset for more convenient pick up.
> This patch set combine 'use runnable load in balance' serials and 'change
> 64bit variables to long type' serials. also collected Reviewed-bys, and
> Tested-bys.
>
> The only changed code is fixing load to load_avg convert in UP mode, which
> found by PeterZ in task_h_load().
>
> Paul still has some concern of blocked_load_avg out of balance consideration.
> but I didn't see the blocked_load_avg usage was thought through now, or some
> strong reason to make it into balance.
> So, according to benchmarks testing result I keep patches unchanged.
>
> [Resend patch v8 01/13] Revert "sched: Introduce temporary
> [Resend patch v8 02/13] sched: move few runnable tg variables into
> [Resend patch v8 03/13] sched: set initial value of runnable avg for
> [Resend patch v8 04/13] sched: fix slept time double counting in
> [Resend patch v8 05/13] sched: update cpu load after task_tick.
> [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load
> [Resend patch v8 07/13] sched: consider runnable load average in
> [Resend patch v8 08/13] sched/tg: remove blocked_load_avg in balance
> [Resend patch v8 09/13] sched: change cfs_rq load avg to unsigned
> [Resend patch v8 10/13] sched/tg: use 'unsigned long' for load
> [Resend patch v8 11/13] sched/cfs_rq: change atomic64_t removed_load
> [Resend patch v8 12/13] sched/tg: remove tg.load_weight
> [Resend patch v8 13/13] sched: get_rq_runnable_load() can be static

I applied all except patch 8.

2013-06-26 15:23:22

by Alex Shi

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance


>> >
>> > [Resend patch v8 01/13] Revert "sched: Introduce temporary
>> > [Resend patch v8 02/13] sched: move few runnable tg variables into
>> > [Resend patch v8 03/13] sched: set initial value of runnable avg for
>> > [Resend patch v8 04/13] sched: fix slept time double counting in
>> > [Resend patch v8 05/13] sched: update cpu load after task_tick.
>> > [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load
>> > [Resend patch v8 07/13] sched: consider runnable load average in
>> > [Resend patch v8 08/13] sched/tg: remove blocked_load_avg in balance
>> > [Resend patch v8 09/13] sched: change cfs_rq load avg to unsigned
>> > [Resend patch v8 10/13] sched/tg: use 'unsigned long' for load
>> > [Resend patch v8 11/13] sched/cfs_rq: change atomic64_t removed_load
>> > [Resend patch v8 12/13] sched/tg: remove tg.load_weight
>> > [Resend patch v8 13/13] sched: get_rq_runnable_load() can be static
> I applied all except patch 8.

Uh, I will reconsider the patch 8th later. And millions thanks for
picking up others! :)

--
Thanks
Alex

2013-06-26 20:30:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Resend patch v8 01/13] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"


* Alex Shi <[email protected]> wrote:

> On 06/20/2013 10:18 AM, Alex Shi wrote:
> > Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
> > we can use runnable load variables.
> >
> > Signed-off-by: Alex Shi <[email protected]>
>
> There are 2 more place need to be reverted too, and merged them into the updated patch.
> BTW, this patchset was tested by Fengguang's 0day kbuild system.

still fails to build with the attached config:

kernel/sched/fair.c:1620:6: error: redefinition of ‘idle_enter_fair’
In file included from kernel/sched/fair.c:35:0:
kernel/sched/sched.h:1034:20: note: previous definition of ‘idle_enter_fair’ was here
kernel/sched/fair.c:1630:6: error: redefinition of ‘idle_exit_fair’
In file included from kernel/sched/fair.c:35:0:
kernel/sched/sched.h:1035:20: note: previous definition of ‘idle_exit_fair’ was here

Thanks,

Ingo


Attachments:
(No filename) (893.00 B)
config (48.99 kB)
Download all attachments

2013-06-27 01:07:36

by Alex Shi

[permalink] [raw]
Subject: Re: [Resend patch v8 01/13] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

On 06/27/2013 04:30 AM, Ingo Molnar wrote:

> kernel/sched/fair.c:1620:6: error: redefinition of ?idle_enter_fair?
> In file included from kernel/sched/fair.c:35:0:
> kernel/sched/sched.h:1034:20: note: previous definition of ?idle_enter_fair? was here
> kernel/sched/fair.c:1630:6: error: redefinition of ?idle_exit_fair?
> In file included from kernel/sched/fair.c:35:0:
> kernel/sched/sched.h:1035:20: note: previous definition of ?idle_exit_fair? was here
>
> Thanks,
>
> Ingo
>

my fault! that also need to be reverted in kerel/sched/sched.h

@@ -1028,17 +1022,8 @@ extern void update_group_power(struct sched_domain *sd, int cpu);
extern void trigger_load_balance(struct rq *rq, int cpu);
extern void idle_balance(int this_cpu, struct rq *this_rq);

-/*
- * Only depends on SMP, FAIR_GROUP_SCHED may be removed when runnable_avg
- * becomes useful in lb
- */
-#if defined(CONFIG_FAIR_GROUP_SCHED)
extern void idle_enter_fair(struct rq *this_rq);
extern void idle_exit_fair(struct rq *this_rq);
-#else
-static inline void idle_enter_fair(struct rq *this_rq) {}
-static inline void idle_exit_fair(struct rq *this_rq) {}
-#endif

now, the patch is here:

---

>From caac05dd984ba52af1d7bde47402ef795f8bc060 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Sun, 3 Mar 2013 16:04:36 +0800
Subject: [PATCH 01/14] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
dependency for load-tracking"

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Also removed 3 extra CONFIG_FAIR_GROUP_SCHED setting in commit 9ee474f
and commit 642dbc39a.

Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 7 +------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 17 ++++-------------
kernel/sched/sched.h | 19 ++-----------------
4 files changed, 8 insertions(+), 42 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..0019bef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -994,12 +994,7 @@ struct sched_entity {
struct cfs_rq *my_q;
#endif

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 36f85be..b9e7036 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,12 +1598,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ee1c2e..1a14209 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1128,8 +1128,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3436,12 +3435,6 @@ unlock:
}

/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3464,7 +3457,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */

static unsigned long
@@ -5866,7 +5858,7 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
se->vruntime -= cfs_rq->min_vruntime;
}

-#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+#ifdef CONFIG_SMP
/*
* Remove our load from contribution when we leave sched_fair
* and ensure we don't carry in an old decay_count if we
@@ -5925,7 +5917,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifndef CONFIG_64BIT
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
-#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+#ifdef CONFIG_SMP
atomic64_set(&cfs_rq->decay_counter, 1);
atomic64_set(&cfs_rq->removed_load, 0);
#endif
@@ -6167,9 +6159,8 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 74ff659..9bc3994 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -269,12 +269,6 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -284,9 +278,9 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
+ /* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
u64 tg_load_contrib;
#endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -1028,17 +1022,8 @@ extern void update_group_power(struct sched_domain *sd, int cpu);
extern void trigger_load_balance(struct rq *rq, int cpu);
extern void idle_balance(int this_cpu, struct rq *this_rq);

-/*
- * Only depends on SMP, FAIR_GROUP_SCHED may be removed when runnable_avg
- * becomes useful in lb
- */
-#if defined(CONFIG_FAIR_GROUP_SCHED)
extern void idle_enter_fair(struct rq *this_rq);
extern void idle_exit_fair(struct rq *this_rq);
-#else
-static inline void idle_enter_fair(struct rq *this_rq) {}
-static inline void idle_exit_fair(struct rq *this_rq) {}
-#endif

#else /* CONFIG_SMP */

--
1.7.5.4

Subject: [tip:sched/core] sched: Move a few runnable tg variables into CONFIG_SMP

Commit-ID: fa6bddeb14d59d701f846b174b59c9982e926e66
Gitweb: http://git.kernel.org/tip/fa6bddeb14d59d701f846b174b59c9982e926e66
Author: Alex Shi <[email protected]>
AuthorDate: Thu, 20 Jun 2013 10:18:46 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 27 Jun 2013 10:07:29 +0200

sched: Move a few runnable tg variables into CONFIG_SMP

The following 2 variables are only used under CONFIG_SMP, so its
better to move their definiation into CONFIG_SMP too.

atomic64_t load_avg;
atomic_t runnable_avg;

Signed-off-by: Alex Shi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/sched.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 77ce668..31d25f8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -149,9 +149,11 @@ struct task_group {
unsigned long shares;

atomic_t load_weight;
+#ifdef CONFIG_SMP
atomic64_t load_avg;
atomic_t runnable_avg;
#endif
+#endif

#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;

Subject: [tip:sched/core] sched: Fix sleep time double accounting in enqueue entity

Commit-ID: 282cf499f03ec1754b6c8c945c9674b02631fb0f
Gitweb: http://git.kernel.org/tip/282cf499f03ec1754b6c8c945c9674b02631fb0f
Author: Alex Shi <[email protected]>
AuthorDate: Thu, 20 Jun 2013 10:18:48 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 27 Jun 2013 10:07:32 +0200

sched: Fix sleep time double accounting in enqueue entity

The woken migrated task will __synchronize_entity_decay(se); in
migrate_task_rq_fair, then it needs to set
`se->avg.last_runnable_update -= (-se->avg.decay_count) << 20' before
update_entity_load_avg, in order to avoid sleep time is updated twice
for se.avg.load_avg_contrib in both __syncchronize and
update_entity_load_avg.

However if the sleeping task is woken up from the same cpu, it miss
the last_runnable_update before update_entity_load_avg(se, 0, 1), then
the sleep time was used twice in both functions. So we need to remove
the double sleep time accounting.

Paul also contributed some code comments in this commit.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e1602a0..9bbc303 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1571,7 +1571,13 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
}
wakeup = 0;
} else {
- __synchronize_entity_decay(se);
+ /*
+ * Task re-woke on same cpu (or else migrate_task_rq_fair()
+ * would have made count negative); we must be careful to avoid
+ * double-accounting blocked time after synchronizing decays.
+ */
+ se->avg.last_runnable_update += __synchronize_entity_decay(se)
+ << 20;
}

/* migrated tasks did not contribute to our blocked load */

Subject: [tip:sched/core] sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_task

Commit-ID: b92486cbf2aa230d00f160664858495c81d2b37b
Gitweb: http://git.kernel.org/tip/b92486cbf2aa230d00f160664858495c81d2b37b
Author: Alex Shi <[email protected]>
AuthorDate: Thu, 20 Jun 2013 10:18:50 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 27 Jun 2013 10:07:35 +0200

sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_task

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

We also try to include the blocked_load_avg as cpu load in balancing,
but that cause kbuild performance drop 6% on every Intel machine, and
aim7/oltp drop on some of 4 CPU sockets machines.
Or only add blocked_load_avg into get_rq_runable_load, hackbench still
drop a little on NHM EX.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Gu Zheng <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 5 +++--
kernel/sched/proc.c | 17 +++++++++++++++--
2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9bbc303..e6d82ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2963,7 +2963,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->load.weight;
+ return cpu_rq(cpu)->cfs.runnable_load_avg;
}

/*
@@ -3008,9 +3008,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+ unsigned long load_avg = rq->cfs.runnable_load_avg;

if (nr_running)
- return rq->load.weight / nr_running;
+ return load_avg / nr_running;

return 0;
}
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index bb3a6a0..ce5cd48 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
sched_avg_update(this_rq);
}

+#ifdef CONFIG_SMP
+unsigned long get_rq_runnable_load(struct rq *rq)
+{
+ return rq->cfs.runnable_load_avg;
+}
+#else
+unsigned long get_rq_runnable_load(struct rq *rq)
+{
+ return rq->load.weight;
+}
+#endif
+
#ifdef CONFIG_NO_HZ_COMMON
/*
* There is no sane way to deal with nohz on smp when using jiffies because the
@@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
- unsigned long load = this_rq->load.weight;
+ unsigned long load = get_rq_runnable_load(this_rq);
unsigned long pending_updates;

/*
@@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
*/
void update_cpu_load_active(struct rq *this_rq)
{
+ unsigned long load = get_rq_runnable_load(this_rq);
/*
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
this_rq->last_load_update_tick = jiffies;
- __update_cpu_load(this_rq, this_rq->load.weight, 1);
+ __update_cpu_load(this_rq, load, 1);

calc_load_account_active(this_rq);
}

Subject: [tip:sched/core] sched: Consider runnable load average in move_tasks()

Commit-ID: a003a25b227d59ded9197ced109517f037d01c27
Gitweb: http://git.kernel.org/tip/a003a25b227d59ded9197ced109517f037d01c27
Author: Alex Shi <[email protected]>
AuthorDate: Thu, 20 Jun 2013 10:18:51 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 27 Jun 2013 10:07:36 +0200

sched: Consider runnable load average in move_tasks()

Aside from using runnable load average in background, move_tasks is
also the key function in load balance. We need consider the runnable
load average in it in order to make it an apple to apple load
comparison.

Morten had caught a div u64 bug on ARM, thanks!

Thanks-to: Morten Rasmussen <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e6d82ca..7948bb8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4179,11 +4179,14 @@ static int tg_load_down(struct task_group *tg, void *data)
long cpu = (long)data;

if (!tg->parent) {
- load = cpu_rq(cpu)->load.weight;
+ load = cpu_rq(cpu)->avg.load_avg_contrib;
} else {
+ unsigned long tmp_rla;
+ tmp_rla = tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
+
load = tg->parent->cfs_rq[cpu]->h_load;
- load *= tg->se[cpu]->load.weight;
- load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
+ load *= tg->se[cpu]->avg.load_avg_contrib;
+ load /= tmp_rla;
}

tg->cfs_rq[cpu]->h_load = load;
@@ -4209,12 +4212,9 @@ static void update_h_load(long cpu)
static unsigned long task_h_load(struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
- unsigned long load;
-
- load = p->se.load.weight;
- load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);

- return load;
+ return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
+ cfs_rq->runnable_load_avg + 1);
}
#else
static inline void update_blocked_averages(int cpu)
@@ -4227,7 +4227,7 @@ static inline void update_h_load(long cpu)

static unsigned long task_h_load(struct task_struct *p)
{
- return p->se.load.weight;
+ return p->se.avg.load_avg_contrib;
}
#endif

Subject: [tip:sched/core] sched: Change cfs_rq load avg to unsigned long

Commit-ID: 72a4cf20cb71a327c636c7042fdacc25abffc87c
Gitweb: http://git.kernel.org/tip/72a4cf20cb71a327c636c7042fdacc25abffc87c
Author: Alex Shi <[email protected]>
AuthorDate: Thu, 20 Jun 2013 10:18:53 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 27 Jun 2013 10:07:38 +0200

sched: Change cfs_rq load avg to unsigned long

Since the 'u64 runnable_load_avg, blocked_load_avg' in cfs_rq struct are
smaller than 'unsigned long' cfs_rq->load.weight. We don't need u64
vaiables to describe them. unsigned long is more efficient and convenience.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/debug.c | 4 ++--
kernel/sched/fair.c | 7 ++-----
kernel/sched/sched.h | 2 +-
3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 75024a6..160afdc 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -211,9 +211,9 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight);
#ifdef CONFIG_FAIR_GROUP_SCHED
#ifdef CONFIG_SMP
- SEQ_printf(m, " .%-30s: %lld\n", "runnable_load_avg",
+ SEQ_printf(m, " .%-30s: %ld\n", "runnable_load_avg",
cfs_rq->runnable_load_avg);
- SEQ_printf(m, " .%-30s: %lld\n", "blocked_load_avg",
+ SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
cfs_rq->blocked_load_avg);
SEQ_printf(m, " .%-30s: %lld\n", "tg_load_avg",
(unsigned long long)atomic64_read(&cfs_rq->tg->load_avg));
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7948bb8..f19772d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4181,12 +4181,9 @@ static int tg_load_down(struct task_group *tg, void *data)
if (!tg->parent) {
load = cpu_rq(cpu)->avg.load_avg_contrib;
} else {
- unsigned long tmp_rla;
- tmp_rla = tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
-
load = tg->parent->cfs_rq[cpu]->h_load;
- load *= tg->se[cpu]->avg.load_avg_contrib;
- load /= tmp_rla;
+ load = div64_ul(load * tg->se[cpu]->avg.load_avg_contrib,
+ tg->parent->cfs_rq[cpu]->runnable_load_avg + 1);
}

tg->cfs_rq[cpu]->h_load = load;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9c65d46..9eb12d9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -277,7 +277,7 @@ struct cfs_rq {
* This allows for the description of both thread and group usage (in
* the FAIR_GROUP_SCHED case).
*/
- u64 runnable_load_avg, blocked_load_avg;
+ unsigned long runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;

Subject: [tip:sched/core] sched/tg: Use 'unsigned long' for load variable in task group

Commit-ID: bf5b986ed4d20428eeec3df4a03dbfebb9b6538c
Gitweb: http://git.kernel.org/tip/bf5b986ed4d20428eeec3df4a03dbfebb9b6538c
Author: Alex Shi <[email protected]>
AuthorDate: Thu, 20 Jun 2013 10:18:54 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 27 Jun 2013 10:07:40 +0200

sched/tg: Use 'unsigned long' for load variable in task group

Since tg->load_avg is smaller than tg->load_weight, we don't need a
atomic64_t variable for load_avg in 32 bit machine.
The same reason for cfs_rq->tg_load_contrib.

The atomic_long_t/unsigned long variable type are more efficient and
convenience for them.

Signed-off-by: Alex Shi <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/debug.c | 6 +++---
kernel/sched/fair.c | 12 ++++++------
kernel/sched/sched.h | 4 ++--
3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 160afdc..d803989 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -215,9 +215,9 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->runnable_load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
cfs_rq->blocked_load_avg);
- SEQ_printf(m, " .%-30s: %lld\n", "tg_load_avg",
- (unsigned long long)atomic64_read(&cfs_rq->tg->load_avg));
- SEQ_printf(m, " .%-30s: %lld\n", "tg_load_contrib",
+ SEQ_printf(m, " .%-30s: %ld\n", "tg_load_avg",
+ atomic_long_read(&cfs_rq->tg->load_avg));
+ SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib",
cfs_rq->tg_load_contrib);
SEQ_printf(m, " .%-30s: %d\n", "tg_runnable_contrib",
cfs_rq->tg_runnable_contrib);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f19772d..30ccc37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1075,7 +1075,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
* to gain a more accurate current total weight. See
* update_cfs_rq_load_contribution().
*/
- tg_weight = atomic64_read(&tg->load_avg);
+ tg_weight = atomic_long_read(&tg->load_avg);
tg_weight -= cfs_rq->tg_load_contrib;
tg_weight += cfs_rq->load.weight;

@@ -1356,13 +1356,13 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
int force_update)
{
struct task_group *tg = cfs_rq->tg;
- s64 tg_contrib;
+ long tg_contrib;

tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
tg_contrib -= cfs_rq->tg_load_contrib;

- if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
- atomic64_add(tg_contrib, &tg->load_avg);
+ if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
+ atomic_long_add(tg_contrib, &tg->load_avg);
cfs_rq->tg_load_contrib += tg_contrib;
}
}
@@ -1397,8 +1397,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
u64 contrib;

contrib = cfs_rq->tg_load_contrib * tg->shares;
- se->avg.load_avg_contrib = div64_u64(contrib,
- atomic64_read(&tg->load_avg) + 1);
+ se->avg.load_avg_contrib = div_u64(contrib,
+ atomic_long_read(&tg->load_avg) + 1);

/*
* For group entities we need to compute a correction term in the case
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9eb12d9..5585eb2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -150,7 +150,7 @@ struct task_group {

atomic_t load_weight;
#ifdef CONFIG_SMP
- atomic64_t load_avg;
+ atomic_long_t load_avg;
atomic_t runnable_avg;
#endif
#endif
@@ -284,7 +284,7 @@ struct cfs_rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
/* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
- u64 tg_load_contrib;
+ unsigned long tg_load_contrib;
#endif /* CONFIG_FAIR_GROUP_SCHED */

/*

Subject: [tip:sched/core] sched/cfs_rq: Change atomic64_t removed_load to atomic_long_t

Commit-ID: 2509940fd71c2e2915a05052bbdbf2d478364184
Gitweb: http://git.kernel.org/tip/2509940fd71c2e2915a05052bbdbf2d478364184
Author: Alex Shi <[email protected]>
AuthorDate: Thu, 20 Jun 2013 10:18:55 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 27 Jun 2013 10:07:41 +0200

sched/cfs_rq: Change atomic64_t removed_load to atomic_long_t

Similar to runnable_load_avg, blocked_load_avg variable, long type is
enough for removed_load in 64 bit or 32 bit machine.

Then we avoid the expensive atomic64 operations on 32 bit machine.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 10 ++++++----
kernel/sched/sched.h | 3 ++-
2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 30ccc37..b43474a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1517,8 +1517,9 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
if (!decays && !force_update)
return;

- if (atomic64_read(&cfs_rq->removed_load)) {
- u64 removed_load = atomic64_xchg(&cfs_rq->removed_load, 0);
+ if (atomic_long_read(&cfs_rq->removed_load)) {
+ unsigned long removed_load;
+ removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
subtract_blocked_load_contrib(cfs_rq, removed_load);
}

@@ -3480,7 +3481,8 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
*/
if (se->avg.decay_count) {
se->avg.decay_count = -__synchronize_entity_decay(se);
- atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
+ atomic_long_add(se->avg.load_avg_contrib,
+ &cfs_rq->removed_load);
}
}
#endif /* CONFIG_SMP */
@@ -5942,7 +5944,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#endif
#ifdef CONFIG_SMP
atomic64_set(&cfs_rq->decay_counter, 1);
- atomic64_set(&cfs_rq->removed_load, 0);
+ atomic_long_set(&cfs_rq->removed_load, 0);
#endif
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5585eb2..7059919 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -278,8 +278,9 @@ struct cfs_rq {
* the FAIR_GROUP_SCHED case).
*/
unsigned long runnable_load_avg, blocked_load_avg;
- atomic64_t decay_counter, removed_load;
+ atomic64_t decay_counter;
u64 last_decay;
+ atomic_long_t removed_load;

#ifdef CONFIG_FAIR_GROUP_SCHED
/* Required to track per-cpu representation of a task_group */

Subject: [tip:sched/core] sched/tg: Remove tg.load_weight

Commit-ID: a9cef46a10cc1b84bf2cdf4060766d858c0439d8
Gitweb: http://git.kernel.org/tip/a9cef46a10cc1b84bf2cdf4060766d858c0439d8
Author: Alex Shi <[email protected]>
AuthorDate: Thu, 20 Jun 2013 10:18:56 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 27 Jun 2013 10:07:43 +0200

sched/tg: Remove tg.load_weight

Since no one use it.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/sched.h | 1 -
1 file changed, 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7059919..ef0a7b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -148,7 +148,6 @@ struct task_group {
struct cfs_rq **cfs_rq;
unsigned long shares;

- atomic_t load_weight;
#ifdef CONFIG_SMP
atomic_long_t load_avg;
atomic_t runnable_avg;

Subject: [tip:sched/core] sched: Change get_rq_runnable_load() to static and inline

Commit-ID: a9dc5d0e33c677619e4b97a38c23db1a42857905
Gitweb: http://git.kernel.org/tip/a9dc5d0e33c677619e4b97a38c23db1a42857905
Author: Alex Shi <[email protected]>
AuthorDate: Thu, 20 Jun 2013 10:18:57 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 27 Jun 2013 10:07:44 +0200

sched: Change get_rq_runnable_load() to static and inline

Based-on-patch-by: Fengguang Wu <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/proc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index ce5cd48..16f5a30 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -502,12 +502,12 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
}

#ifdef CONFIG_SMP
-unsigned long get_rq_runnable_load(struct rq *rq)
+static inline unsigned long get_rq_runnable_load(struct rq *rq)
{
return rq->cfs.runnable_load_avg;
}
#else
-unsigned long get_rq_runnable_load(struct rq *rq)
+static inline unsigned long get_rq_runnable_load(struct rq *rq)
{
return rq->load.weight;
}

Subject: [tip:sched/core] sched: Update cpu load after task_tick

Commit-ID: 83dfd5235ebd66c284b97befe6eabff7132333e6
Gitweb: http://git.kernel.org/tip/83dfd5235ebd66c284b97befe6eabff7132333e6
Author: Alex Shi <[email protected]>
AuthorDate: Thu, 20 Jun 2013 10:18:49 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 27 Jun 2013 10:07:33 +0200

sched: Update cpu load after task_tick

To get the latest runnable info, we need do this cpuload update after
task_tick.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 729e7fc..08746cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2165,8 +2165,8 @@ void scheduler_tick(void)

raw_spin_lock(&rq->lock);
update_rq_clock(rq);
- update_cpu_load_active(rq);
curr->sched_class->task_tick(rq, curr, 0);
+ update_cpu_load_active(rq);
raw_spin_unlock(&rq->lock);

perf_event_task_tick();

Subject: [tip:sched/core] sched: Set an initial value of runnable avg for new forked task

Commit-ID: a75cdaa915e42ef0e6f38dc7f2a6a1deca91d648
Gitweb: http://git.kernel.org/tip/a75cdaa915e42ef0e6f38dc7f2a6a1deca91d648
Author: Alex Shi <[email protected]>
AuthorDate: Thu, 20 Jun 2013 10:18:47 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 27 Jun 2013 10:07:30 +0200

sched: Set an initial value of runnable avg for new forked task

We need to initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task. Otherwise random values of above variables cause a
mess when a new task is enqueued:

enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg

and make fork balancing imbalance due to incorrect load_avg_contrib.

Further more, Morten Rasmussen notice some tasks were not launched at
once after created. So Paul and Peter suggest giving a start value for
new task runnable avg time same as sched_slice().

PeterZ said:

> So the 'problem' is that our running avg is a 'floating' average; ie. it
> decays with time. Now we have to guess about the future of our newly
> spawned task -- something that is nigh impossible seeing these CPU
> vendors keep refusing to implement the crystal ball instruction.
>
> So there's two asymptotic cases we want to deal well with; 1) the case
> where the newly spawned program will be 'nearly' idle for its lifetime;
> and 2) the case where its cpu-bound.
>
> Since we have to guess, we'll go for worst case and assume its
> cpu-bound; now we don't want to make the avg so heavy adjusting to the
> near-idle case takes forever. We want to be able to quickly adjust and
> lower our running avg.
>
> Now we also don't want to make our avg too light, such that it gets
> decremented just for the new task not having had a chance to run yet --
> even if when it would run, it would be more cpu-bound than not.
>
> So what we do is we make the initial avg of the same duration as that we
> guess it takes to run each task on the system at least once -- aka
> sched_slice().
>
> Of course we can defeat this with wakeup/fork bombs, but in the 'normal'
> case it should be good enough.

Paul also contributed most of the code comments in this commit.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Gu Zheng <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
[peterz; added explanation of sched_slice() usage]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 6 ++----
kernel/sched/fair.c | 24 ++++++++++++++++++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0241b1b..729e7fc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1611,10 +1611,6 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-#ifdef CONFIG_SMP
- p->se.avg.runnable_avg_period = 0;
- p->se.avg.runnable_avg_sum = 0;
-#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif
@@ -1758,6 +1754,8 @@ void wake_up_new_task(struct task_struct *p)
set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
#endif

+ /* Initialize new task's runnable average */
+ init_task_runnable_average(p);
rq = __task_rq_lock(p);
activate_task(rq, p, 0);
p->on_rq = 1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 36eadaa..e1602a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,6 +680,26 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
return calc_delta_fair(sched_slice(cfs_rq, se), se);
}

+#ifdef CONFIG_SMP
+static inline void __update_task_entity_contrib(struct sched_entity *se);
+
+/* Give new task start runnable values to heavy its load in infant time */
+void init_task_runnable_average(struct task_struct *p)
+{
+ u32 slice;
+
+ p->se.avg.decay_count = 0;
+ slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
+ p->se.avg.runnable_avg_sum = slice;
+ p->se.avg.runnable_avg_period = slice;
+ __update_task_entity_contrib(&p->se);
+}
+#else
+void init_task_runnable_average(struct task_struct *p)
+{
+}
+#endif
+
/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
@@ -1527,6 +1547,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
+ *
+ * Newly forked tasks are enqueued with se->avg.decay_count == 0, they
+ * are seen by enqueue_entity_load_avg() as a migration with an already
+ * constructed load_avg_contrib.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 31d25f8..9c65d46 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1048,6 +1048,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime

extern void update_idle_cpu_load(struct rq *this_rq);

+extern void init_task_runnable_average(struct task_struct *p);
+
#ifdef CONFIG_PARAVIRT
static inline u64 steal_ticks(u64 steal)
{

Subject: [tip:sched/core] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

Commit-ID: 141965c7494d984b2bf24efd361a3125278869c6
Gitweb: http://git.kernel.org/tip/141965c7494d984b2bf24efd361a3125278869c6
Author: Alex Shi <[email protected]>
AuthorDate: Wed, 26 Jun 2013 13:05:39 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 27 Jun 2013 10:07:22 +0200

Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Also remove 2 CONFIG_FAIR_GROUP_SCHED setting which is not in reverted
patch(introduced in 9ee474f), but also need to revert.

Signed-off-by: Alex Shi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 7 +------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 17 ++++-------------
kernel/sched/sched.h | 19 ++-----------------
4 files changed, 8 insertions(+), 42 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..0019bef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -994,12 +994,7 @@ struct sched_entity {
struct cfs_rq *my_q;
#endif

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ceeaf0f..0241b1b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1611,12 +1611,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0ac2c3..36eadaa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1128,8 +1128,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3431,12 +3430,6 @@ unlock:
}

/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3459,7 +3452,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */

static unsigned long
@@ -5861,7 +5853,7 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
se->vruntime -= cfs_rq->min_vruntime;
}

-#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+#ifdef CONFIG_SMP
/*
* Remove our load from contribution when we leave sched_fair
* and ensure we don't carry in an old decay_count if we
@@ -5920,7 +5912,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifndef CONFIG_64BIT
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
-#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+#ifdef CONFIG_SMP
atomic64_set(&cfs_rq->decay_counter, 1);
atomic64_set(&cfs_rq->removed_load, 0);
#endif
@@ -6162,9 +6154,8 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 029601a..77ce668 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -269,12 +269,6 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -284,9 +278,9 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
+ /* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
u64 tg_load_contrib;
#endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -1027,17 +1021,8 @@ extern void update_group_power(struct sched_domain *sd, int cpu);
extern void trigger_load_balance(struct rq *rq, int cpu);
extern void idle_balance(int this_cpu, struct rq *this_rq);

-/*
- * Only depends on SMP, FAIR_GROUP_SCHED may be removed when runnable_avg
- * becomes useful in lb
- */
-#if defined(CONFIG_FAIR_GROUP_SCHED)
extern void idle_enter_fair(struct rq *this_rq);
extern void idle_exit_fair(struct rq *this_rq);
-#else
-static inline void idle_enter_fair(struct rq *this_rq) {}
-static inline void idle_exit_fair(struct rq *this_rq) {}
-#endif

#else /* CONFIG_SMP */

2013-06-27 13:30:35

by Alex Shi

[permalink] [raw]
Subject: Re: [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

On 06/20/2013 10:18 AM, Alex Shi wrote:
> They are the base values in load balance, update them with rq runnable
> load average, then the load balance will consider runnable load avg
> naturally.
>
> We also try to include the blocked_load_avg as cpu load in balancing,
> but that cause kbuild performance drop 6% on every Intel machine, and
> aim7/oltp drop on some of 4 CPU sockets machines.
> Or only add blocked_load_avg into get_rq_runable_load, hackbench still
> drop a little on NHM EX.
>
> Signed-off-by: Alex Shi <[email protected]>
> Reviewed-by: Gu Zheng <[email protected]>

Paul, do you mind to add ad reviewed-by or acked-by for this patch?

> ---
> kernel/sched/fair.c | 5 +++--
> kernel/sched/proc.c | 17 +++++++++++++++--
> 2 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1e5a5e6..7d5c477 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2968,7 +2968,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> /* Used instead of source_load when we know the type == 0 */
> static unsigned long weighted_cpuload(const int cpu)
> {
> - return cpu_rq(cpu)->load.weight;
> + return cpu_rq(cpu)->cfs.runnable_load_avg;
> }
>
> /*
> @@ -3013,9 +3013,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> + unsigned long load_avg = rq->cfs.runnable_load_avg;
>
> if (nr_running)
> - return rq->load.weight / nr_running;
> + return load_avg / nr_running;
>
> return 0;
> }
> diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
> index bb3a6a0..ce5cd48 100644
> --- a/kernel/sched/proc.c
> +++ b/kernel/sched/proc.c
> @@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> sched_avg_update(this_rq);
> }
>
> +#ifdef CONFIG_SMP
> +unsigned long get_rq_runnable_load(struct rq *rq)
> +{
> + return rq->cfs.runnable_load_avg;
> +}
> +#else
> +unsigned long get_rq_runnable_load(struct rq *rq)
> +{
> + return rq->load.weight;
> +}
> +#endif
> +
> #ifdef CONFIG_NO_HZ_COMMON
> /*
> * There is no sane way to deal with nohz on smp when using jiffies because the
> @@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> void update_idle_cpu_load(struct rq *this_rq)
> {
> unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
> - unsigned long load = this_rq->load.weight;
> + unsigned long load = get_rq_runnable_load(this_rq);
> unsigned long pending_updates;
>
> /*
> @@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
> */
> void update_cpu_load_active(struct rq *this_rq)
> {
> + unsigned long load = get_rq_runnable_load(this_rq);
> /*
> * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
> */
> this_rq->last_load_update_tick = jiffies;
> - __update_cpu_load(this_rq, this_rq->load.weight, 1);
> + __update_cpu_load(this_rq, load, 1);
>
> calc_load_account_active(this_rq);
> }
>


--
Thanks
Alex

2013-06-28 10:56:58

by Paul Turner

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance

On Mon, Jun 24, 2013 at 8:37 AM, Alex Shi <[email protected]> wrote:
> On 06/24/2013 06:40 PM, Paul Turner wrote:
>>> > Ingo & Peter,
>>> >
>>> > This patchset was discussed spread and deeply.
>>> >
>>> > Now just 6th/8th patch has some arguments on them, Paul think it is
>>> > better to consider blocked_load_avg in balance, since it is helpful on
>>> > some scenarios, but I think on most of scenarios, the blocked_load_avg
>>> > just cause load imbalance among cpus. and plus testing show with
>>> > blocked_load_avg the performance is just worse on some benchmarks. So, I
>>> > still prefer to keep it out of balance.
>> I think you have perhaps misunderstood what I was trying to explain.
>>
>> I have no problems with not including blocked load in load-balance, in
>> fact, I encouraged not accumulating it in an average of averages in
>> CPU load.
>>
>
> Many thanks for re-clarification!
>> The problem is that your current approach has removed it both from
>> load-balance _and_ from shares distribution; isolation matters as much
>> as performance in the cgroup case (otherwise you would just not use
>> cgroups). I would expect the latter to have quite negative effects on
>> fairness, this is my primary concern.
>>
>
> So the argument is just on patch 'sched/tg: remove blocked_load_avg in balance'. :)
>
> I understand your correctness concern. but blocked_load_avg still will be decayed to zero in few hundreds ms. So such correctness needs just in few hundreds ms. (and cause performance drop)
> The blocked_load_avg is decayed on same degree as runnable load, it is a bit overweight since task slept. since it may will be waken up on other cpu. So to relieve this overweight, could we use the half or a quarter weight of blocked_load_avg? like following:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ddbc19f..395f57c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1358,7 +1358,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> struct task_group *tg = cfs_rq->tg;
> s64 tg_contrib;
>
> - tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
> + tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg / 2;
> tg_contrib -= cfs_rq->tg_load_contrib;

So this is actually an interesting idea, but don't think of it as
overweight. What "cfs_rq->blocked_load_avg / 2" means is actually
blocked_load_avg one period from now. This is interesting because it
makes the (reasonable) supposition that blocked load is not about to
immediately wake, but will continue to decay.

Could you try testing the gvr_lb_tip branch at
git://git.kernel.org/pub/scm/linux/kernel/git/pjt/sched-tip.git ?

It's an extension to your series that tries to improve some of the
cpu_load interactions in an alternate way to the above.

It seems a little better on one and two-socket machines; but we
couldn't reproduce/compare to your best performance results since they
were taken on larger machines.

Thanks,

- Paul

2013-06-28 11:08:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance

On Fri, Jun 28, 2013 at 03:56:25AM -0700, Paul Turner wrote:

> So this is actually an interesting idea, but don't think of it as
> overweight. What "cfs_rq->blocked_load_avg / 2" means is actually
> blocked_load_avg one period from now. This is interesting because it
> makes the (reasonable) supposition that blocked load is not about to
> immediately wake, but will continue to decay.
>
> Could you try testing the gvr_lb_tip branch at
> git://git.kernel.org/pub/scm/linux/kernel/git/pjt/sched-tip.git ?
>
> It's an extension to your series that tries to improve some of the
> cpu_load interactions in an alternate way to the above.
>
> It seems a little better on one and two-socket machines; but we
> couldn't reproduce/compare to your best performance results since they
> were taken on larger machines.

Oh nice.. it does away with the entire cpu_load[] array thing. Just what
Frederic needs for his NOHZ stuff as well -- he's currently abusing
LB_BIAS for that.

2013-06-28 11:31:58

by Alex Shi

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance

On 06/28/2013 07:07 PM, Peter Zijlstra wrote:
> On Fri, Jun 28, 2013 at 03:56:25AM -0700, Paul Turner wrote:
>
>> So this is actually an interesting idea, but don't think of it as
>> overweight. What "cfs_rq->blocked_load_avg / 2" means is actually
>> blocked_load_avg one period from now. This is interesting because it
>> makes the (reasonable) supposition that blocked load is not about to
>> immediately wake, but will continue to decay.
>>
>> Could you try testing the gvr_lb_tip branch at
>> git://git.kernel.org/pub/scm/linux/kernel/git/pjt/sched-tip.git ?

OK. will try it next week.
>>
>> It's an extension to your series that tries to improve some of the
>> cpu_load interactions in an alternate way to the above.
>>
>> It seems a little better on one and two-socket machines; but we
>> couldn't reproduce/compare to your best performance results since they
>> were taken on larger machines.
>
> Oh nice.. it does away with the entire cpu_load[] array thing. Just what
> Frederic needs for his NOHZ stuff as well -- he's currently abusing
> LB_BIAS for that.
>


--
Thanks
Alex

2013-06-28 16:01:12

by Paul Turner

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance

On Fri, Jun 28, 2013 at 6:20 AM, Alex Shi <[email protected]> wrote:
>
>> So this is actually an interesting idea, but don't think of it as
>> overweight. What "cfs_rq->blocked_load_avg / 2" means is actually
>> blocked_load_avg one period from now. This is interesting because it
>> makes the (reasonable) supposition that blocked load is not about to
>> immediately wake, but will continue to decay.
>>
>> Could you try testing the gvr_lb_tip branch at
>> git://git.kernel.org/pub/scm/linux/kernel/git/pjt/sched-tip.git ?
>>
>
> Could you rebase the patch on latest tip/sched/core?

I suspect it's more direct to just check out and test the branch
directly (e.g. you should not need to apply it on top of any other
branch). It should be based on round-about where you previously
tested.

>
>>
>> It's an extension to your series that tries to improve some of the
>> cpu_load interactions in an alternate way to the above.
>>
>> It seems a little better on one and two-socket machines; but we
>> couldn't reproduce/compare to your best performance results since they
>> were taken on larger machines.
>>
>>
>
> --
> Thanks
> Alex

2013-07-09 08:54:19

by Alex Shi

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance

On 06/29/2013 12:00 AM, Paul Turner wrote:
> On Fri, Jun 28, 2013 at 6:20 AM, Alex Shi <[email protected]> wrote:
>>
>>> So this is actually an interesting idea, but don't think of it as
>>> overweight. What "cfs_rq->blocked_load_avg / 2" means is actually
>>> blocked_load_avg one period from now. This is interesting because it
>>> makes the (reasonable) supposition that blocked load is not about to
>>> immediately wake, but will continue to decay.
>>>
>>> Could you try testing the gvr_lb_tip branch at
>>> git://git.kernel.org/pub/scm/linux/kernel/git/pjt/sched-tip.git ?
>>>
>>
>> Could you rebase the patch on latest tip/sched/core?
>
> I suspect it's more direct to just check out and test the branch
> directly (e.g. you should not need to apply it on top of any other
> branch). It should be based on round-about where you previously
> tested.

I tested aim7, hackbench, tbench, dbench, on NHM EP, SNB EP 2S/4S and
IVB EP.
Comparing to Alex's rlbv8(same as upstream except no blocked_load_avg on
tg), -- both base on 3.9.0 kernel.
aim7 drops about 10% on SNB EP 2S/4S,
hackbench drops 10% on SNB EP 4S. drops 1~5% on other 2 sockets NHM
EP/IVB EP/SNB EP.


tbench/dbench failed due to a bug commit you had dependent on. but it
was fixed on upstream kernel.
---
Running for 600 seconds with load '/usr/local/share/client.txt' and
minimum warmup 120 secs
failed to create barrier semaphore.


>
>>
>>>
>>> It's an extension to your series that tries to improve some of the
>>> cpu_load interactions in an alternate way to the above.
>>>
>>> It seems a little better on one and two-socket machines; but we
>>> couldn't reproduce/compare to your best performance results since they
>>> were taken on larger machines.
>>>
--
Thanks
Alex

2013-10-28 10:25:37

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance

2013/6/28 Peter Zijlstra <[email protected]>:
> On Fri, Jun 28, 2013 at 03:56:25AM -0700, Paul Turner wrote:
>
>> So this is actually an interesting idea, but don't think of it as
>> overweight. What "cfs_rq->blocked_load_avg / 2" means is actually
>> blocked_load_avg one period from now. This is interesting because it
>> makes the (reasonable) supposition that blocked load is not about to
>> immediately wake, but will continue to decay.
>>
>> Could you try testing the gvr_lb_tip branch at
>> git://git.kernel.org/pub/scm/linux/kernel/git/pjt/sched-tip.git ?
>>
>> It's an extension to your series that tries to improve some of the
>> cpu_load interactions in an alternate way to the above.
>>
>> It seems a little better on one and two-socket machines; but we
>> couldn't reproduce/compare to your best performance results since they
>> were taken on larger machines.
>
> Oh nice.. it does away with the entire cpu_load[] array thing. Just what
> Frederic needs for his NOHZ stuff as well -- he's currently abusing
> LB_BIAS for that.

Hi guys,

Is there any updates on the status of this work? I'm getting back on
fixing the cpu_load for full dynticks and this patchset was apparently
taking care of that.

Thanks.

2013-10-28 12:22:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Resend patch v8 0/13] use runnable load in schedule balance

On Mon, Oct 28, 2013 at 11:25:34AM +0100, Frederic Weisbecker wrote:
> 2013/6/28 Peter Zijlstra <[email protected]>:
> > On Fri, Jun 28, 2013 at 03:56:25AM -0700, Paul Turner wrote:
> >
> >> So this is actually an interesting idea, but don't think of it as
> >> overweight. What "cfs_rq->blocked_load_avg / 2" means is actually
> >> blocked_load_avg one period from now. This is interesting because it
> >> makes the (reasonable) supposition that blocked load is not about to
> >> immediately wake, but will continue to decay.
> >>
> >> Could you try testing the gvr_lb_tip branch at
> >> git://git.kernel.org/pub/scm/linux/kernel/git/pjt/sched-tip.git ?
> >>
> >> It's an extension to your series that tries to improve some of the
> >> cpu_load interactions in an alternate way to the above.
> >>
> >> It seems a little better on one and two-socket machines; but we
> >> couldn't reproduce/compare to your best performance results since they
> >> were taken on larger machines.
> >
> > Oh nice.. it does away with the entire cpu_load[] array thing. Just what
> > Frederic needs for his NOHZ stuff as well -- he's currently abusing
> > LB_BIAS for that.
>
> Hi guys,
>
> Is there any updates on the status of this work? I'm getting back on
> fixing the cpu_load for full dynticks and this patchset was apparently
> taking care of that.

I talked to PJT about this last week, he said Ben was looking (or going
to look into) this sometime 'soon' iirc.