2013-09-13 18:26:58

by Jason Low

[permalink] [raw]
Subject: [PATCH v5 0/3] sched: Limiting idle balance

v4->v5
- We don't use the this_rq->avg_idle < this_rq->max_idle_balance_cost check.
However, we kept the old rq->avg_idle < sysctl_sched_migration_cost since
I saw some performance benefits with it.
- Substitute smp_processor_id() with this_cpu.
- Increase the decay to 1% per second.

These patches modify and add to the way we limit idle balancing. The first
patch reduces the chance we overestimate the avg_idle guestimator. The second
patch makes idle balance compare the avg_idle with the max cost we ever spend
on a new idle load balance per sched domain to limit idle balance. The third
patch periodically decays each domain's max newidle balance costs.

These changes further reduce the chance we attempt idle balancing when the time
a CPU remains idle is short and is not more than the cost to do the balancing.

The table below compares the average jobs per minute when running AIM7 on
an 8 socket (80 core) machine at 10-100, 200-1000, and 1100-2000 users between
the vanilla 3.11 tip kernel and the 3.11 tip kernel with Hyperthreading enabled.
Out of the AIM7 workloads, fserver benefited most with this change.

Note: The gains weren't as large as with the v4 patch due to not having
the if (this_rq->avg_idle < this_rq->max_idle_balance_cost) check.

----------------------------------------------------------------
workload | % improvement | % improvement | % improvement
| with patch | with patch | with patch
| 1100-2000 users | 200-1000 users | 10-100 users
----------------------------------------------------------------
alltests | +2.5% | +2.7% | +0.0%
----------------------------------------------------------------
compute | +0.2% | -0.3% | -0.5%
----------------------------------------------------------------
custom | +4.7% | +1.7% | +3.5%
----------------------------------------------------------------
disk | +3.0% | +1.9% | +4.8%
----------------------------------------------------------------
fserver | +27.0% | +7.7% | +2.2%
----------------------------------------------------------------
high_systime | +4.1% | +3.0% | +0.2%
----------------------------------------------------------------
new_fserver | +23.1% | +5.1% | +0.0%
----------------------------------------------------------------
shared | +3.0% | +4.5% | +1.4%
----------------------------------------------------------------

Jason Low (3):
sched: Reduce overestimating rq->avg_idle
sched: Consider max cost of idle balance per sched domain
sched: Periodically decay max cost of idle balance

arch/metag/include/asm/topology.h | 2 +
include/linux/sched.h | 4 +++
include/linux/topology.h | 6 ++++
kernel/sched/core.c | 10 ++++---
kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 3 ++
6 files changed, 68 insertions(+), 11 deletions(-)


2013-09-13 18:27:05

by Jason Low

[permalink] [raw]
Subject: [PATCH v5 1/3] sched: Reduce overestimating rq->avg_idle

When updating avg_idle, if the delta exceeds some max value, then avg_idle
gets set to the max, regardless of what the previous avg was. This can cause
avg_idle to often be overestimated.

This patch modifies the way we update avg_idle by always updating it with the
function call to update_avg() first. Then, if avg_idle exceeds the max, we set
it to the max.

Signed-off-by: Jason Low <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/core.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 725aa06..25b0f79 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1347,10 +1347,11 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
u64 delta = rq_clock(rq) - rq->idle_stamp;
u64 max = 2*sysctl_sched_migration_cost;

- if (delta > max)
+ update_avg(&rq->avg_idle, delta);
+
+ if (rq->avg_idle > max)
rq->avg_idle = max;
- else
- update_avg(&rq->avg_idle, delta);
+
rq->idle_stamp = 0;
}
#endif
--
1.7.1

2013-09-13 18:27:16

by Jason Low

[permalink] [raw]
Subject: [PATCH v5 2/3] sched: Consider max cost of idle balance per sched domain

v4->v5
- We don't use the this_rq->avg_idle < this_rq->max_idle_balance_cost check.
However, we kept the old rq->avg_idle < sysctl_sched_migration_cost since
I saw some performance benefits with it.
- Substitute smp_processor_id() with this_cpu.

In this patch, we keep track of the max cost we spend doing idle load balancing
for each sched domain. If the avg time the CPU remains idle is less then the
time we have already spent on idle balancing + the max cost of idle balancing
in the sched domain, then we don't continue to attempt the balance. We also
keep a per rq variable, max_idle_balance_cost, which keeps track of the max
time spent on newidle load balances throughout all its domains so that we can
determine the avg_idle's max value.

By using the max, we avoid overrunning the average. This further reduces the
chance we attempt balancing when the CPU is not idle for longer than the cost
to balance.

Signed-off-by: Jason Low <[email protected]>
---
arch/metag/include/asm/topology.h | 1 +
include/linux/sched.h | 1 +
include/linux/topology.h | 3 +++
kernel/sched/core.c | 3 ++-
kernel/sched/fair.c | 16 ++++++++++++++++
kernel/sched/sched.h | 3 +++
6 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
index 23f5118..db19292 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -26,6 +26,7 @@
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
+ .max_newidle_lb_cost = 0, \
}

#define cpu_to_node(cpu) ((void)(cpu), 0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f79ced7..fe16b7d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -818,6 +818,7 @@ struct sched_domain {
unsigned int nr_balance_failed; /* initialise to 0 */

u64 last_update;
+ u64 max_newidle_lb_cost;

#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..e2a2c3d 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -106,6 +106,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.smt_gain = 1178, /* 15% */ \
+ .max_newidle_lb_cost = 0, \
}
#endif
#endif /* CONFIG_SCHED_SMT */
@@ -135,6 +136,7 @@ int arch_update_cpu_topology(void);
, \
.last_balance = jiffies, \
.balance_interval = 1, \
+ .max_newidle_lb_cost = 0, \
}
#endif
#endif /* CONFIG_SCHED_MC */
@@ -166,6 +168,7 @@ int arch_update_cpu_topology(void);
, \
.last_balance = jiffies, \
.balance_interval = 1, \
+ .max_newidle_lb_cost = 0, \
}
#endif

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 25b0f79..455ab28 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1345,7 +1345,7 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

if (rq->idle_stamp) {
u64 delta = rq_clock(rq) - rq->idle_stamp;
- u64 max = 2*sysctl_sched_migration_cost;
+ u64 max = 2*rq->max_idle_balance_cost;

update_avg(&rq->avg_idle, delta);

@@ -6521,6 +6521,7 @@ void __init sched_init(void)
rq->online = 0;
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
+ rq->max_idle_balance_cost = sysctl_sched_migration_cost;

INIT_LIST_HEAD(&rq->cfs_tasks);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9b3fe1c..8b33257 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5392,6 +5392,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
struct sched_domain *sd;
int pulled_task = 0;
unsigned long next_balance = jiffies + HZ;
+ u64 curr_cost = 0;

this_rq->idle_stamp = rq_clock(this_rq);

@@ -5408,15 +5409,27 @@ void idle_balance(int this_cpu, struct rq *this_rq)
for_each_domain(this_cpu, sd) {
unsigned long interval;
int continue_balancing = 1;
+ u64 t0, domain_cost;

if (!(sd->flags & SD_LOAD_BALANCE))
continue;

+ if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
+ break;
+
if (sd->flags & SD_BALANCE_NEWIDLE) {
+ t0 = sched_clock_cpu(this_cpu);
+
/* If we've pulled tasks over stop searching: */
pulled_task = load_balance(this_cpu, this_rq,
sd, CPU_NEWLY_IDLE,
&continue_balancing);
+
+ domain_cost = sched_clock_cpu(this_cpu) - t0;
+ if (domain_cost > sd->max_newidle_lb_cost)
+ sd->max_newidle_lb_cost = domain_cost;
+
+ curr_cost += domain_cost;
}

interval = msecs_to_jiffies(sd->balance_interval);
@@ -5438,6 +5451,9 @@ void idle_balance(int this_cpu, struct rq *this_rq)
*/
this_rq->next_balance = next_balance;
}
+
+ if (curr_cost > this_rq->max_idle_balance_cost)
+ this_rq->max_idle_balance_cost = curr_cost;
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3c5653..f24e239 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -476,6 +476,9 @@ struct rq {
u64 age_stamp;
u64 idle_stamp;
u64 avg_idle;
+
+ /* This is used to determine avg_idle's max value */
+ u64 max_idle_balance_cost;
#endif

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
--
1.7.1

2013-09-13 18:27:23

by Jason Low

[permalink] [raw]
Subject: [PATCH v5 3/3] sched: Periodically decay max cost of idle balance

v4->v5
- Increase the decay to 1% per second.
- Peter rewrote much of the logic.

This patch builds on patch 2 and periodically decays that max value to
do idle balancing per sched domain by approximately 1% per second. Also
decay the rq's max_idle_balance_cost value.

Signed-off-by: Jason Low <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
arch/metag/include/asm/topology.h | 1 +
include/linux/sched.h | 3 ++
include/linux/topology.h | 3 ++
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++------
4 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
index db19292..8e9c0b3 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -27,6 +27,7 @@
.balance_interval = 1, \
.nr_balance_failed = 0, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}

#define cpu_to_node(cpu) ((void)(cpu), 0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fe16b7d..b3d2a2d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -818,7 +818,10 @@ struct sched_domain {
unsigned int nr_balance_failed; /* initialise to 0 */

u64 last_update;
+
+ /* idle_balance() stats */
u64 max_newidle_lb_cost;
+ unsigned long next_decay_max_lb_cost;

#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index e2a2c3d..12ae6ce 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -107,6 +107,7 @@ int arch_update_cpu_topology(void);
.balance_interval = 1, \
.smt_gain = 1178, /* 15% */ \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
#endif /* CONFIG_SCHED_SMT */
@@ -137,6 +138,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
#endif /* CONFIG_SCHED_MC */
@@ -169,6 +171,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b33257..f1818f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5677,15 +5677,39 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
int update_next_balance = 0;
- int need_serialize;
+ int need_serialize, need_decay = 0;
+ u64 max_cost = 0;

update_blocked_averages(cpu);

rcu_read_lock();
for_each_domain(cpu, sd) {
+ /*
+ * Decay the newidle max times here because this is a regular
+ * visit to all the domains. Decay ~1% per second.
+ */
+ if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
+ sd->max_newidle_lb_cost =
+ (sd->max_newidle_lb_cost * 253) / 256;
+ sd->next_decay_max_lb_cost = jiffies + HZ;
+ need_decay = 1;
+ }
+ max_cost += sd->max_newidle_lb_cost;
+
if (!(sd->flags & SD_LOAD_BALANCE))
continue;

+ /*
+ * Stop the load balance at this level. There is another
+ * CPU in our sched group which is doing load balancing more
+ * actively.
+ */
+ if (!continue_balancing) {
+ if (need_decay)
+ continue;
+ break;
+ }
+
interval = sd->balance_interval;
if (idle != CPU_IDLE)
interval *= sd->busy_factor;
@@ -5719,14 +5743,14 @@ out:
next_balance = sd->last_balance + interval;
update_next_balance = 1;
}
-
+ }
+ if (need_decay) {
/*
- * Stop the load balance at this level. There is another
- * CPU in our sched group which is doing load balancing more
- * actively.
+ * Ensure the rq-wide value also decays but keep it at a
+ * reasonable floor to avoid funnies with rq->avg_idle.
*/
- if (!continue_balancing)
- break;
+ rq->max_idle_balance_cost =
+ max((u64)sysctl_sched_migration_cost, max_cost);
}
rcu_read_unlock();

--
1.7.1

Subject: [tip:sched/core] sched/balancing: Periodically decay max cost of idle balance

Commit-ID: f48627e686a69f5215cb0761e731edb3d9859dd9
Gitweb: http://git.kernel.org/tip/f48627e686a69f5215cb0761e731edb3d9859dd9
Author: Jason Low <[email protected]>
AuthorDate: Fri, 13 Sep 2013 11:26:53 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 20 Sep 2013 12:03:46 +0200

sched/balancing: Periodically decay max cost of idle balance

This patch builds on patch 2 and periodically decays that max value to
do idle balancing per sched domain by approximately 1% per second. Also
decay the rq's max_idle_balance_cost value.

Signed-off-by: Jason Low <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/metag/include/asm/topology.h | 1 +
include/linux/sched.h | 3 +++
include/linux/topology.h | 3 +++
kernel/sched/fair.c | 38 +++++++++++++++++++++++++++++++-------
4 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
index db19292..8e9c0b3 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -27,6 +27,7 @@
.balance_interval = 1, \
.nr_balance_failed = 0, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}

#define cpu_to_node(cpu) ((void)(cpu), 0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index be078ff..b5344de 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -810,7 +810,10 @@ struct sched_domain {
unsigned int nr_balance_failed; /* initialise to 0 */

u64 last_update;
+
+ /* idle_balance() stats */
u64 max_newidle_lb_cost;
+ unsigned long next_decay_max_lb_cost;

#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index e2a2c3d..12ae6ce 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -107,6 +107,7 @@ int arch_update_cpu_topology(void);
.balance_interval = 1, \
.smt_gain = 1178, /* 15% */ \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
#endif /* CONFIG_SCHED_SMT */
@@ -137,6 +138,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
#endif /* CONFIG_SCHED_MC */
@@ -169,6 +171,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffc99d8..2b89cd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5681,15 +5681,39 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
int update_next_balance = 0;
- int need_serialize;
+ int need_serialize, need_decay = 0;
+ u64 max_cost = 0;

update_blocked_averages(cpu);

rcu_read_lock();
for_each_domain(cpu, sd) {
+ /*
+ * Decay the newidle max times here because this is a regular
+ * visit to all the domains. Decay ~1% per second.
+ */
+ if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
+ sd->max_newidle_lb_cost =
+ (sd->max_newidle_lb_cost * 253) / 256;
+ sd->next_decay_max_lb_cost = jiffies + HZ;
+ need_decay = 1;
+ }
+ max_cost += sd->max_newidle_lb_cost;
+
if (!(sd->flags & SD_LOAD_BALANCE))
continue;

+ /*
+ * Stop the load balance at this level. There is another
+ * CPU in our sched group which is doing load balancing more
+ * actively.
+ */
+ if (!continue_balancing) {
+ if (need_decay)
+ continue;
+ break;
+ }
+
interval = sd->balance_interval;
if (idle != CPU_IDLE)
interval *= sd->busy_factor;
@@ -5723,14 +5747,14 @@ out:
next_balance = sd->last_balance + interval;
update_next_balance = 1;
}
-
+ }
+ if (need_decay) {
/*
- * Stop the load balance at this level. There is another
- * CPU in our sched group which is doing load balancing more
- * actively.
+ * Ensure the rq-wide value also decays but keep it at a
+ * reasonable floor to avoid funnies with rq->avg_idle.
*/
- if (!continue_balancing)
- break;
+ rq->max_idle_balance_cost =
+ max((u64)sysctl_sched_migration_cost, max_cost);
}
rcu_read_unlock();

Subject: [tip:sched/core] sched: Reduce overestimating rq->avg_idle

Commit-ID: abfafa54db9aba404e8e6763503f04d35bd07138
Gitweb: http://git.kernel.org/tip/abfafa54db9aba404e8e6763503f04d35bd07138
Author: Jason Low <[email protected]>
AuthorDate: Fri, 13 Sep 2013 11:26:51 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 20 Sep 2013 12:03:41 +0200

sched: Reduce overestimating rq->avg_idle

When updating avg_idle, if the delta exceeds some max value, then avg_idle
gets set to the max, regardless of what the previous avg was. This can cause
avg_idle to often be overestimated.

This patch modifies the way we update avg_idle by always updating it with the
function call to update_avg() first. Then, if avg_idle exceeds the max, we set
it to the max.

Signed-off-by: Jason Low <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ac63c9..048f39e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1332,10 +1332,11 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
u64 delta = rq_clock(rq) - rq->idle_stamp;
u64 max = 2*sysctl_sched_migration_cost;

- if (delta > max)
+ update_avg(&rq->avg_idle, delta);
+
+ if (rq->avg_idle > max)
rq->avg_idle = max;
- else
- update_avg(&rq->avg_idle, delta);
+
rq->idle_stamp = 0;
}
#endif

Subject: [tip:sched/core] sched/balancing: Consider max cost of idle balance per sched domain

Commit-ID: 9bd721c55c8a886b938a45198aab0ccb52f1f7fa
Gitweb: http://git.kernel.org/tip/9bd721c55c8a886b938a45198aab0ccb52f1f7fa
Author: Jason Low <[email protected]>
AuthorDate: Fri, 13 Sep 2013 11:26:52 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 20 Sep 2013 12:03:44 +0200

sched/balancing: Consider max cost of idle balance per sched domain

In this patch, we keep track of the max cost we spend doing idle load balancing
for each sched domain. If the avg time the CPU remains idle is less then the
time we have already spent on idle balancing + the max cost of idle balancing
in the sched domain, then we don't continue to attempt the balance. We also
keep a per rq variable, max_idle_balance_cost, which keeps track of the max
time spent on newidle load balances throughout all its domains so that we can
determine the avg_idle's max value.

By using the max, we avoid overrunning the average. This further reduces the
chance we attempt balancing when the CPU is not idle for longer than the cost
to balance.

Signed-off-by: Jason Low <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/metag/include/asm/topology.h | 1 +
include/linux/sched.h | 1 +
include/linux/topology.h | 3 +++
kernel/sched/core.c | 3 ++-
kernel/sched/fair.c | 16 ++++++++++++++++
kernel/sched/sched.h | 3 +++
6 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
index 23f5118..db19292 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -26,6 +26,7 @@
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
+ .max_newidle_lb_cost = 0, \
}

#define cpu_to_node(cpu) ((void)(cpu), 0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6682da3..be078ff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -810,6 +810,7 @@ struct sched_domain {
unsigned int nr_balance_failed; /* initialise to 0 */

u64 last_update;
+ u64 max_newidle_lb_cost;

#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..e2a2c3d 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -106,6 +106,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.smt_gain = 1178, /* 15% */ \
+ .max_newidle_lb_cost = 0, \
}
#endif
#endif /* CONFIG_SCHED_SMT */
@@ -135,6 +136,7 @@ int arch_update_cpu_topology(void);
, \
.last_balance = jiffies, \
.balance_interval = 1, \
+ .max_newidle_lb_cost = 0, \
}
#endif
#endif /* CONFIG_SCHED_MC */
@@ -166,6 +168,7 @@ int arch_update_cpu_topology(void);
, \
.last_balance = jiffies, \
.balance_interval = 1, \
+ .max_newidle_lb_cost = 0, \
}
#endif

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 048f39e..c2283c5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1330,7 +1330,7 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

if (rq->idle_stamp) {
u64 delta = rq_clock(rq) - rq->idle_stamp;
- u64 max = 2*sysctl_sched_migration_cost;
+ u64 max = 2*rq->max_idle_balance_cost;

update_avg(&rq->avg_idle, delta);

@@ -6506,6 +6506,7 @@ void __init sched_init(void)
rq->online = 0;
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
+ rq->max_idle_balance_cost = sysctl_sched_migration_cost;

INIT_LIST_HEAD(&rq->cfs_tasks);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0784ab6..ffc99d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5396,6 +5396,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
struct sched_domain *sd;
int pulled_task = 0;
unsigned long next_balance = jiffies + HZ;
+ u64 curr_cost = 0;

this_rq->idle_stamp = rq_clock(this_rq);

@@ -5412,15 +5413,27 @@ void idle_balance(int this_cpu, struct rq *this_rq)
for_each_domain(this_cpu, sd) {
unsigned long interval;
int continue_balancing = 1;
+ u64 t0, domain_cost;

if (!(sd->flags & SD_LOAD_BALANCE))
continue;

+ if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
+ break;
+
if (sd->flags & SD_BALANCE_NEWIDLE) {
+ t0 = sched_clock_cpu(this_cpu);
+
/* If we've pulled tasks over stop searching: */
pulled_task = load_balance(this_cpu, this_rq,
sd, CPU_NEWLY_IDLE,
&continue_balancing);
+
+ domain_cost = sched_clock_cpu(this_cpu) - t0;
+ if (domain_cost > sd->max_newidle_lb_cost)
+ sd->max_newidle_lb_cost = domain_cost;
+
+ curr_cost += domain_cost;
}

interval = msecs_to_jiffies(sd->balance_interval);
@@ -5442,6 +5455,9 @@ void idle_balance(int this_cpu, struct rq *this_rq)
*/
this_rq->next_balance = next_balance;
}
+
+ if (curr_cost > this_rq->max_idle_balance_cost)
+ this_rq->max_idle_balance_cost = curr_cost;
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0d7544c..e82484d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -476,6 +476,9 @@ struct rq {
u64 age_stamp;
u64 idle_stamp;
u64 avg_idle;
+
+ /* This is used to determine avg_idle's max value */
+ u64 max_idle_balance_cost;
#endif

#ifdef CONFIG_IRQ_TIME_ACCOUNTING