These patches modify and add to the way we limit idle balancing. The first
patch reduces the chance we overestimate the avg_idle guestimator. The second
patch makes idle balance compare the avg_idle with the max cost we ever spend
on a new idle load balance per sched domain to limit idle balance.
The third is an RFC patch which periodically decays each domain's max
newidle balance costs and compares avg_idle sd with max newidle balance +
sched_migration_cost to determine if we should skip balancing.
These changes further reduce the chance we attempt idle balancing when the time
a CPU remains idle is short and is not more than the cost to do the balancing.
The first 2 patches provide good performance boosts of many AIM7 workloads on an
8 socket (80 core) machine. The table below compares the average jobs per minute
at 10-100, 200-1000, and 1100-2000 users between the vanilla 3.11-rc7 kernel and
the 3.11-rc7 kernel with the first 2 patches with Hyperthreading enabled.
----------------------------------------------------------------
workload | % improvement | % improvement | % improvement
| with patch | with patch | with patch
| 1100-2000 users | 200-1000 users | 10-100 users
----------------------------------------------------------------
alltests | +12.2% | +7.5% | +1.0%
----------------------------------------------------------------
compute | -0.6% | -0.8% | +0.1%
----------------------------------------------------------------
custom | +24.0% | +25.03 | +16.4%
----------------------------------------------------------------
disk | +11.6% | +21.3% | +0.1%
----------------------------------------------------------------
fserver | +74.7% | +34.7% | -2.7%
----------------------------------------------------------------
high_systime | +21.2% | +10.5% | +0.6%
----------------------------------------------------------------
new_fserver | +59.8% | +23.7% | -1.2%
----------------------------------------------------------------
shared | +9.0% | +13.0% | +6.5%
----------------------------------------------------------------
Jason Low (3):
sched: Reduce overestimating rq->avg_idle
sched: Consider max cost of idle balance per sched domain
sched: Periodically decay max cost of idle balance
arch/metag/include/asm/topology.h | 2 +
include/linux/sched.h | 4 +++
include/linux/topology.h | 6 ++++
kernel/sched/core.c | 10 ++++---
kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 3 ++
6 files changed, 68 insertions(+), 5 deletions(-)
When updating avg_idle, if the delta exceeds some max value, then avg_idle
gets set to the max, regardless of what the previous avg was. This can cause
avg_idle to often be overestimated.
This patch modifies the way we update avg_idle by always updating it with the
function call to update_avg() first. Then, if avg_idle exceeds the max, we set
it to the max.
Signed-off-by: Jason Low <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
kernel/sched/core.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 05c39f0..93b18ef 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1347,10 +1347,11 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
u64 delta = rq_clock(rq) - rq->idle_stamp;
u64 max = 2*sysctl_sched_migration_cost;
- if (delta > max)
+ update_avg(&rq->avg_idle, delta);
+
+ if (rq->avg_idle > max)
rq->avg_idle = max;
- else
- update_avg(&rq->avg_idle, delta);
+
rq->idle_stamp = 0;
}
#endif
--
1.7.1
This RFC patch builds on patch 2 and periodically decays that max value to
do idle balancing per sched domain.
Though we want to decay it fairly consistently, we may not want to lower it by
too much each time, especially since avg_idle is capped based on that value.
So I thought that decaying the value every second and lowering it by half a
percent each time appeared to be fairly reasonable.
This change would allow us to remove the limit we set on each domain's max cost
to idle balance. Also, since the max can be reduced now, we now have to
update rq->max_idle balance_cost more frequently. So after every idle balance,
we loop through the sched domain to find the max sd's newidle load balance cost
for any one domain. Then we will set rq->max_idle_balance_cost to that value.
Since we are now decaying the max cost to do idle balancing, that max cost can
also become not high enough. One possible explanation for why is that
besides the time spent on each newidle load balance, there are other costs
associated with attempting idle balancing. Idle balance also releases and
reacquires a spin lock. That cost is not counted when we keep track of each
domain's cost to do newidle load balance. Also, acquiring the rq locks can
potentially prevent other CPUs from running something useful. And after
migrating tasks, it might potentially have to pay the costs of cache misses and
refreshing tasks' cache.
Because of that, this patch also compares avg_idle with max cost to do idle
balancing + sched_migration_cost. While using the max cost helps reduce
overestimating the average idle, the sched_migration_cost can help account
for those additional costs of idle balancing.
Signed-off-by: Jason Low <[email protected]>
---
arch/metag/include/asm/topology.h | 1 +
include/linux/sched.h | 3 ++
include/linux/topology.h | 3 ++
kernel/sched/core.c | 4 +-
kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++-------
5 files changed, 44 insertions(+), 10 deletions(-)
diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
index db19292..8e9c0b3 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -27,6 +27,7 @@
.balance_interval = 1, \
.nr_balance_failed = 0, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#define cpu_to_node(cpu) ((void)(cpu), 0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 16e7d80..bcc805a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -818,7 +818,10 @@ struct sched_domain {
unsigned int nr_balance_failed; /* initialise to 0 */
u64 last_update;
+
+ /* idle_balance() stats */
u64 max_newidle_lb_cost;
+ unsigned long next_decay_max_lb_cost;
#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index e2a2c3d..12ae6ce 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -107,6 +107,7 @@ int arch_update_cpu_topology(void);
.balance_interval = 1, \
.smt_gain = 1178, /* 15% */ \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
#endif /* CONFIG_SCHED_SMT */
@@ -137,6 +138,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
#endif /* CONFIG_SCHED_MC */
@@ -169,6 +171,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 58b0514..bba5a07 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1345,7 +1345,7 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
if (rq->idle_stamp) {
u64 delta = rq_clock(rq) - rq->idle_stamp;
- u64 max = 2*rq->max_idle_balance_cost;
+ u64 max = 2*(sysctl_sched_migration_cost + rq->max_idle_balance_cost);
update_avg(&rq->avg_idle, delta);
@@ -6509,7 +6509,7 @@ void __init sched_init(void)
rq->online = 0;
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
- rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+ rq->max_idle_balance_cost = 0;
INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7697741..60b984d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5274,6 +5274,20 @@ out:
return ld_moved;
}
+/* Returns the max newidle lb cost out of all of this_cpu's sched domains */
+inline u64 get_max_newidle_lb_cost(int this_cpu)
+{
+ struct sched_domain *sd;
+ u64 max = 0;
+
+ for_each_domain(this_cpu, sd) {
+ if (sd->max_newidle_lb_cost > max)
+ max = sd->max_newidle_lb_cost;
+ }
+
+ return max;
+}
+
/*
* idle_balance is called by schedule() if this_cpu is about to become
* idle. Attempts to pull tasks from other CPUs.
@@ -5283,11 +5297,12 @@ void idle_balance(int this_cpu, struct rq *this_rq)
struct sched_domain *sd;
int pulled_task = 0;
unsigned long next_balance = jiffies + HZ;
- u64 curr_cost = 0;
+ u64 curr_cost = 0, max_newidle_lb_cost = 0;
this_rq->idle_stamp = rq_clock(this_rq);
- if (this_rq->avg_idle < this_rq->max_idle_balance_cost)
+ if (this_rq->avg_idle < sysctl_sched_migration_cost
+ + this_rq->max_idle_balance_cost)
return;
/*
@@ -5300,12 +5315,20 @@ void idle_balance(int this_cpu, struct rq *this_rq)
for_each_domain(this_cpu, sd) {
unsigned long interval;
int balance = 1;
- u64 t0, domain_cost, max = 5*sysctl_sched_migration_cost;
+ u64 t0, domain_cost;
+
+ /* Periodically decay sd's max_newidle_lb_cost */
+ if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
+ sd->max_newidle_lb_cost =
+ (sd->max_newidle_lb_cost * 199) / 200;
+ sd->next_decay_max_lb_cost = jiffies + HZ;
+ }
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
- if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
+ if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost
+ + sysctl_sched_migration_cost)
break;
if (sd->flags & SD_BALANCE_NEWIDLE) {
@@ -5316,8 +5339,6 @@ void idle_balance(int this_cpu, struct rq *this_rq)
sd, CPU_NEWLY_IDLE, &balance);
domain_cost = sched_clock_cpu(smp_processor_id()) - t0;
- if (domain_cost > max)
- domain_cost = max;
if (domain_cost > sd->max_newidle_lb_cost)
sd->max_newidle_lb_cost = domain_cost;
@@ -5333,6 +5354,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
break;
}
}
+ max_newidle_lb_cost = get_max_newidle_lb_cost(this_cpu);
rcu_read_unlock();
raw_spin_lock(&this_rq->lock);
@@ -5345,8 +5367,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
this_rq->next_balance = next_balance;
}
- if (curr_cost > this_rq->max_idle_balance_cost)
- this_rq->max_idle_balance_cost = curr_cost;
+ this_rq->max_idle_balance_cost = max_newidle_lb_cost;
}
/*
@@ -5576,6 +5597,12 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
rcu_read_lock();
for_each_domain(cpu, sd) {
+ if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
+ sd->max_newidle_lb_cost =
+ (sd->max_newidle_lb_cost * 199) / 200;
+ sd->next_decay_max_lb_cost = jiffies + HZ;
+ }
+
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
--
1.7.1
In this patch, we keep track of the max cost we spend doing idle load balancing
for each sched domain. If the avg time the CPU remains idle is less then the
time we have already spent on idle balancing + the max cost of idle balancing
in the sched domain, then we don't continue to attempt the balance. We also
keep a per rq variable, max_idle_balance_cost, which keeps track of the max
time spent on newidle load balances throughout all its domains. Additionally,
we swap sched_migration_cost used in idle_balance for rq->max_idle_balance_cost.
By using the max, we avoid overrunning the average. This further reduces the chance
we attempt balancing when the CPU is not idle for longer than the cost to balance.
I also limited the max cost of each domain to 5*sysctl_sched_migration_cost as
a way to prevent the max from becoming too inflated.
Signed-off-by: Jason Low <[email protected]>
---
arch/metag/include/asm/topology.h | 1 +
include/linux/sched.h | 1 +
include/linux/topology.h | 3 +++
kernel/sched/core.c | 3 ++-
kernel/sched/fair.c | 21 ++++++++++++++++++++-
kernel/sched/sched.h | 3 +++
6 files changed, 30 insertions(+), 2 deletions(-)
diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
index 23f5118..db19292 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -26,6 +26,7 @@
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
+ .max_newidle_lb_cost = 0, \
}
#define cpu_to_node(cpu) ((void)(cpu), 0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 078066d..16e7d80 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -818,6 +818,7 @@ struct sched_domain {
unsigned int nr_balance_failed; /* initialise to 0 */
u64 last_update;
+ u64 max_newidle_lb_cost;
#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..e2a2c3d 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -106,6 +106,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.smt_gain = 1178, /* 15% */ \
+ .max_newidle_lb_cost = 0, \
}
#endif
#endif /* CONFIG_SCHED_SMT */
@@ -135,6 +136,7 @@ int arch_update_cpu_topology(void);
, \
.last_balance = jiffies, \
.balance_interval = 1, \
+ .max_newidle_lb_cost = 0, \
}
#endif
#endif /* CONFIG_SCHED_MC */
@@ -166,6 +168,7 @@ int arch_update_cpu_topology(void);
, \
.last_balance = jiffies, \
.balance_interval = 1, \
+ .max_newidle_lb_cost = 0, \
}
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 93b18ef..58b0514 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1345,7 +1345,7 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
if (rq->idle_stamp) {
u64 delta = rq_clock(rq) - rq->idle_stamp;
- u64 max = 2*sysctl_sched_migration_cost;
+ u64 max = 2*rq->max_idle_balance_cost;
update_avg(&rq->avg_idle, delta);
@@ -6509,6 +6509,7 @@ void __init sched_init(void)
rq->online = 0;
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
+ rq->max_idle_balance_cost = sysctl_sched_migration_cost;
INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68f1609..7697741 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5283,10 +5283,11 @@ void idle_balance(int this_cpu, struct rq *this_rq)
struct sched_domain *sd;
int pulled_task = 0;
unsigned long next_balance = jiffies + HZ;
+ u64 curr_cost = 0;
this_rq->idle_stamp = rq_clock(this_rq);
- if (this_rq->avg_idle < sysctl_sched_migration_cost)
+ if (this_rq->avg_idle < this_rq->max_idle_balance_cost)
return;
/*
@@ -5299,14 +5300,29 @@ void idle_balance(int this_cpu, struct rq *this_rq)
for_each_domain(this_cpu, sd) {
unsigned long interval;
int balance = 1;
+ u64 t0, domain_cost, max = 5*sysctl_sched_migration_cost;
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
+ if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
+ break;
+
if (sd->flags & SD_BALANCE_NEWIDLE) {
+ t0 = sched_clock_cpu(smp_processor_id());
+
/* If we've pulled tasks over stop searching: */
pulled_task = load_balance(this_cpu, this_rq,
sd, CPU_NEWLY_IDLE, &balance);
+
+ domain_cost = sched_clock_cpu(smp_processor_id()) - t0;
+ if (domain_cost > max)
+ domain_cost = max;
+
+ if (domain_cost > sd->max_newidle_lb_cost)
+ sd->max_newidle_lb_cost = domain_cost;
+
+ curr_cost += domain_cost;
}
interval = msecs_to_jiffies(sd->balance_interval);
@@ -5328,6 +5344,9 @@ void idle_balance(int this_cpu, struct rq *this_rq)
*/
this_rq->next_balance = next_balance;
}
+
+ if (curr_cost > this_rq->max_idle_balance_cost)
+ this_rq->max_idle_balance_cost = curr_cost;
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef0a7b2..6340d0a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -477,6 +477,9 @@ struct rq {
u64 age_stamp;
u64 idle_stamp;
u64 avg_idle;
+
+ /* Set to max idle balance cost for any one sched domain */
+ u64 max_idle_balance_cost;
#endif
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
--
1.7.1
On Thu, Aug 29, 2013 at 01:05:35PM -0700, Jason Low wrote:
> @@ -5299,14 +5300,29 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> for_each_domain(this_cpu, sd) {
> unsigned long interval;
> int balance = 1;
> + u64 t0, domain_cost, max = 5*sysctl_sched_migration_cost;
>
> if (!(sd->flags & SD_LOAD_BALANCE))
> continue;
>
> + if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
> + break;
> +
> if (sd->flags & SD_BALANCE_NEWIDLE) {
> + t0 = sched_clock_cpu(smp_processor_id());
> +
> /* If we've pulled tasks over stop searching: */
> pulled_task = load_balance(this_cpu, this_rq,
> sd, CPU_NEWLY_IDLE, &balance);
> +
> + domain_cost = sched_clock_cpu(smp_processor_id()) - t0;
> + if (domain_cost > max)
> + domain_cost = max;
> +
> + if (domain_cost > sd->max_newidle_lb_cost)
> + sd->max_newidle_lb_cost = domain_cost;
> +
> + curr_cost += domain_cost;
> }
>
> interval = msecs_to_jiffies(sd->balance_interval);
I did an s/smp_processor_id()/this_cpu/ on that.
On Thu, Aug 29, 2013 at 01:05:36PM -0700, Jason Low wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 58b0514..bba5a07 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1345,7 +1345,7 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>
> if (rq->idle_stamp) {
> u64 delta = rq_clock(rq) - rq->idle_stamp;
> - u64 max = 2*rq->max_idle_balance_cost;
> + u64 max = 2*(sysctl_sched_migration_cost + rq->max_idle_balance_cost);
You re-introduce sched_migration_cost here because max_idle_balance_cost
can now drop down to 0 again?
>
> update_avg(&rq->avg_idle, delta);
>
> @@ -6509,7 +6509,7 @@ void __init sched_init(void)
> rq->online = 0;
> rq->idle_stamp = 0;
> rq->avg_idle = 2*sysctl_sched_migration_cost;
> - rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> + rq->max_idle_balance_cost = 0;
>
> INIT_LIST_HEAD(&rq->cfs_tasks);
>
Would it make sense to keep it initialized at sched_migration_cost and
'tweak' the decay to also provide as lower bound?
> @@ -5300,12 +5315,20 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> for_each_domain(this_cpu, sd) {
> unsigned long interval;
> int balance = 1;
> - u64 t0, domain_cost, max = 5*sysctl_sched_migration_cost;
> + u64 t0, domain_cost;
> +
> + /* Periodically decay sd's max_newidle_lb_cost */
> + if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
> + sd->max_newidle_lb_cost =
> + (sd->max_newidle_lb_cost * 199) / 200;
> + sd->next_decay_max_lb_cost = jiffies + HZ;
> + }
I see you've already done this in rebalance_domains() as well, which I
think is the right place.
I suppose you also do it here since neither place guarantees a full sd
tree iteration? I think we should fix that since you can very easily
have situations where idle_balance() isn't called for ages.
>
> if (!(sd->flags & SD_LOAD_BALANCE))
> continue;
>
> - if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
> + if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost
> + + sysctl_sched_migration_cost)
> break;
>
> if (sd->flags & SD_BALANCE_NEWIDLE) {
> @@ -5316,8 +5339,6 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> sd, CPU_NEWLY_IDLE, &balance);
>
> domain_cost = sched_clock_cpu(smp_processor_id()) - t0;
> - if (domain_cost > max)
> - domain_cost = max;
>
> if (domain_cost > sd->max_newidle_lb_cost)
> sd->max_newidle_lb_cost = domain_cost;
> @@ -5333,6 +5354,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> break;
> }
> }
> + max_newidle_lb_cost = get_max_newidle_lb_cost(this_cpu);
OK, so this is a tad counter-intuitive. We apparently just iterated all
domains which suggests we could have just computed the max there.
However... we could have broken out due to cost and not seen the top few
domains. Which is why you're now iterating then all again, right?
How about we too move this into the regular balance path, that iterates
the entire tree every time.
So how about something like this:
---
Subject: sched: Periodically decay max cost of idle balance
From: Jason Low <[email protected]>
Date: Thu, 29 Aug 2013 13:05:36 -0700
This RFC patch builds on patch 2 and periodically decays that max value to
do idle balancing per sched domain.
Though we want to decay it fairly consistently, we may not want to lower it by
too much each time, especially since avg_idle is capped based on that value.
So I thought that decaying the value every second and lowering it by half a
percent each time appeared to be fairly reasonable.
This change would allow us to remove the limit we set on each domain's max cost
to idle balance. Also, since the max can be reduced now, we now have to
update rq->max_idle balance_cost more frequently. So after every idle balance,
we loop through the sched domain to find the max sd's newidle load balance cost
for any one domain. Then we will set rq->max_idle_balance_cost to that value.
Since we are now decaying the max cost to do idle balancing, that max cost can
also become not high enough. One possible explanation for why is that
besides the time spent on each newidle load balance, there are other costs
associated with attempting idle balancing. Idle balance also releases and
reacquires a spin lock. That cost is not counted when we keep track of each
domain's cost to do newidle load balance. Also, acquiring the rq locks can
potentially prevent other CPUs from running something useful. And after
migrating tasks, it might potentially have to pay the costs of cache misses and
refreshing tasks' cache.
Because of that, this patch also compares avg_idle with max cost to do idle
balancing + sched_migration_cost. While using the max cost helps reduce
overestimating the average idle, the sched_migration_cost can help account
for those additional costs of idle balancing.
Signed-off-by: Jason Low <[email protected]>
[peterz: rewrote most of the logic, but kept the spirit]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
arch/metag/include/asm/topology.h | 1
include/linux/sched.h | 3 ++
include/linux/topology.h | 3 ++
kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++----------
4 files changed, 39 insertions(+), 10 deletions(-)
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -27,6 +27,7 @@
.balance_interval = 1, \
.nr_balance_failed = 0, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#define cpu_to_node(cpu) ((void)(cpu), 0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -818,7 +818,10 @@ struct sched_domain {
unsigned int nr_balance_failed; /* initialise to 0 */
u64 last_update;
+
+ /* idle_balance() stats */
u64 max_newidle_lb_cost;
+ unsigned long next_decay_max_lb_cost;
#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -107,6 +107,7 @@ int arch_update_cpu_topology(void);
.balance_interval = 1, \
.smt_gain = 1178, /* 15% */ \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
#endif /* CONFIG_SCHED_SMT */
@@ -137,6 +138,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
#endif /* CONFIG_SCHED_MC */
@@ -169,6 +171,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5409,7 +5409,7 @@ void idle_balance(int this_cpu, struct r
for_each_domain(this_cpu, sd) {
unsigned long interval;
int continue_balancing = 1;
- u64 t0, domain_cost, max = 5*sysctl_sched_migration_cost;
+ u64 t0, domain_cost;
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
@@ -5426,8 +5426,6 @@ void idle_balance(int this_cpu, struct r
&continue_balancing);
domain_cost = sched_clock_cpu(this_cpu) - t0;
- if (domain_cost > max)
- domain_cost = max;
if (domain_cost > sd->max_newidle_lb_cost)
sd->max_newidle_lb_cost = domain_cost;
@@ -5680,15 +5678,39 @@ static void rebalance_domains(int cpu, e
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
int update_next_balance = 0;
- int need_serialize;
+ int need_serialize, need_decay = 0;
+ u64 max_cost = 0;
update_blocked_averages(cpu);
rcu_read_lock();
for_each_domain(cpu, sd) {
+ /*
+ * Decay the newidle max times here because this is a regular
+ * visit to all the domains. Decay ~0.5% per second.
+ */
+ if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
+ sd->max_newidle_lb_cost =
+ (sd->max_newidle_lb_cost * 254) / 256;
+ sd->next_decay_max_lb_cost = jiffies + HZ;
+ max_cost = max(max_cost, sd->max_newidle_lb_cost);
+ need_decay = 1;
+ }
+
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
+ /*
+ * Stop the load balance at this level. There is another
+ * CPU in our sched group which is doing load balancing more
+ * actively.
+ */
+ if (!continue_balancing) {
+ if (need_decay)
+ continue;
+ break;
+ }
+
interval = sd->balance_interval;
if (idle != CPU_IDLE)
interval *= sd->busy_factor;
@@ -5722,14 +5744,14 @@ static void rebalance_domains(int cpu, e
next_balance = sd->last_balance + interval;
update_next_balance = 1;
}
-
+ }
+ if (need_decay) {
/*
- * Stop the load balance at this level. There is another
- * CPU in our sched group which is doing load balancing more
- * actively.
+ * Ensure the rq-wide value also decays but keep it at a
+ * reasonable floor to avoid funnies with rq->avg_idle.
*/
- if (!continue_balancing)
- break;
+ rq->max_idle_balance_cost =
+ min(sysctl_sched_migration_cost, max_cost);
}
rcu_read_unlock();
On Fri, Aug 30, 2013 at 12:18:17PM +0200, Peter Zijlstra wrote:
> for_each_domain(cpu, sd) {
> + /*
> + * Decay the newidle max times here because this is a regular
> + * visit to all the domains. Decay ~0.5% per second.
> + */
> + if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
> + sd->max_newidle_lb_cost =
> + (sd->max_newidle_lb_cost * 254) / 256;
> + sd->next_decay_max_lb_cost = jiffies + HZ;
> + max_cost = max(max_cost, sd->max_newidle_lb_cost);
> + need_decay = 1;
> + }
We need to pull the max_cost thing out of that condition, for we could
possibly trigger need_decay half-way through the iteration.
That said, why is it: max(sd->max_newidle_lb_cost) and not
sum(sd->max_newidly_lb_cost)?
The sum seems like the more reasonable thing since we're actually
computing the sum over in idle_balance().
> + if (need_decay) {
> /*
> - * Stop the load balance at this level. There is another
> - * CPU in our sched group which is doing load balancing more
> - * actively.
> + * Ensure the rq-wide value also decays but keep it at a
> + * reasonable floor to avoid funnies with rq->avg_idle.
> */
> - if (!continue_balancing)
> - break;
> + rq->max_idle_balance_cost =
> + min(sysctl_sched_migration_cost, max_cost);
OK, that gives an ugly compile warn, and that should obviously have
been max, so I suppose max((u64)sysctl_sched_migration_cost, max_cost)
would cure that.
Given all that I end up with:
---
Subject: sched: Periodically decay max cost of idle balance
From: Jason Low <[email protected]>
Date: Thu, 29 Aug 2013 13:05:36 -0700
This RFC patch builds on patch 2 and periodically decays that max value to
do idle balancing per sched domain.
Though we want to decay it fairly consistently, we may not want to lower it by
too much each time, especially since avg_idle is capped based on that value.
So I thought that decaying the value every second and lowering it by half a
percent each time appeared to be fairly reasonable.
This change would allow us to remove the limit we set on each domain's max cost
to idle balance. Also, since the max can be reduced now, we now have to
update rq->max_idle balance_cost more frequently. So after every idle balance,
we loop through the sched domain to find the max sd's newidle load balance cost
for any one domain. Then we will set rq->max_idle_balance_cost to that value.
Since we are now decaying the max cost to do idle balancing, that max cost can
also become not high enough. One possible explanation for why is that
besides the time spent on each newidle load balance, there are other costs
associated with attempting idle balancing. Idle balance also releases and
reacquires a spin lock. That cost is not counted when we keep track of each
domain's cost to do newidle load balance. Also, acquiring the rq locks can
potentially prevent other CPUs from running something useful. And after
migrating tasks, it might potentially have to pay the costs of cache misses and
refreshing tasks' cache.
Because of that, this patch also compares avg_idle with max cost to do idle
balancing + sched_migration_cost. While using the max cost helps reduce
overestimating the average idle, the sched_migration_cost can help account
for those additional costs of idle balancing.
Signed-off-by: Jason Low <[email protected]>
[peterz: rewrote the logic, but kept the spirit]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
arch/metag/include/asm/topology.h | 1
include/linux/sched.h | 3 ++
include/linux/topology.h | 3 ++
kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++----------
4 files changed, 40 insertions(+), 11 deletions(-)
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -27,6 +27,7 @@
.balance_interval = 1, \
.nr_balance_failed = 0, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#define cpu_to_node(cpu) ((void)(cpu), 0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -818,7 +818,10 @@ struct sched_domain {
unsigned int nr_balance_failed; /* initialise to 0 */
u64 last_update;
+
+ /* idle_balance() stats */
u64 max_newidle_lb_cost;
+ unsigned long next_decay_max_lb_cost;
#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -107,6 +107,7 @@ int arch_update_cpu_topology(void);
.balance_interval = 1, \
.smt_gain = 1178, /* 15% */ \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
#endif /* CONFIG_SCHED_SMT */
@@ -137,6 +138,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
#endif /* CONFIG_SCHED_MC */
@@ -169,6 +171,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.max_newidle_lb_cost = 0, \
+ .next_decay_max_lb_cost = jiffies, \
}
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5409,7 +5409,7 @@ void idle_balance(int this_cpu, struct r
for_each_domain(this_cpu, sd) {
unsigned long interval;
int continue_balancing = 1;
- u64 t0, domain_cost, max = 5*sysctl_sched_migration_cost;
+ u64 t0, domain_cost;
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
@@ -5426,8 +5426,6 @@ void idle_balance(int this_cpu, struct r
&continue_balancing);
domain_cost = sched_clock_cpu(this_cpu) - t0;
- if (domain_cost > max)
- domain_cost = max;
if (domain_cost > sd->max_newidle_lb_cost)
sd->max_newidle_lb_cost = domain_cost;
@@ -5680,15 +5678,39 @@ static void rebalance_domains(int cpu, e
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
int update_next_balance = 0;
- int need_serialize;
+ int need_serialize, need_decay = 0;
+ u64 max_cost = 0;
update_blocked_averages(cpu);
rcu_read_lock();
for_each_domain(cpu, sd) {
+ /*
+ * Decay the newidle max times here because this is a regular
+ * visit to all the domains. Decay ~0.5% per second.
+ */
+ if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
+ sd->max_newidle_lb_cost =
+ (sd->max_newidle_lb_cost * 254) / 256;
+ sd->next_decay_max_lb_cost = jiffies + HZ;
+ need_decay = 1;
+ }
+ max_cost += sd->max_newidle_lb_cost;
+
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
+ /*
+ * Stop the load balance at this level. There is another
+ * CPU in our sched group which is doing load balancing more
+ * actively.
+ */
+ if (!continue_balancing) {
+ if (need_decay)
+ continue;
+ break;
+ }
+
interval = sd->balance_interval;
if (idle != CPU_IDLE)
interval *= sd->busy_factor;
@@ -5722,14 +5744,14 @@ static void rebalance_domains(int cpu, e
next_balance = sd->last_balance + interval;
update_next_balance = 1;
}
-
- /*
- * Stop the load balance at this level. There is another
- * CPU in our sched group which is doing load balancing more
- * actively.
+ }
+ if (need_decay) {
+ /*
+ * Ensure the rq-wide value also decays but keep it at a
+ * reasonable floor to avoid funnies with rq->avg_idle.
*/
- if (!continue_balancing)
- break;
+ rq->max_idle_balance_cost =
+ max((u64)sysctl_sched_migration_cost, max_cost);
}
rcu_read_unlock();
* Jason Low <[email protected]> [2013-08-29 13:05:34]:
> When updating avg_idle, if the delta exceeds some max value, then avg_idle
> gets set to the max, regardless of what the previous avg was. This can cause
> avg_idle to often be overestimated.
>
> This patch modifies the way we update avg_idle by always updating it with the
> function call to update_avg() first. Then, if avg_idle exceeds the max, we set
> it to the max.
>
> Signed-off-by: Jason Low <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
> ---
* Jason Low <[email protected]> [2013-08-29 13:05:35]:
> + u64 curr_cost = 0;
>
> this_rq->idle_stamp = rq_clock(this_rq);
>
> - if (this_rq->avg_idle < sysctl_sched_migration_cost)
> + if (this_rq->avg_idle < this_rq->max_idle_balance_cost)
> return;
>
Since max_idle_balance_cost includes the cost of balances across all
domains. Can the cost of balance at a higher domain being higher result
in not doing load balance at a lower level?
Shouldnt the check below for sd->max_newidle_lb_cost mean that we can
actually do away with this check.
> /*
> @@ -5299,14 +5300,29 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> for_each_domain(this_cpu, sd) {
> unsigned long interval;
> int balance = 1;
> + u64 t0, domain_cost, max = 5*sysctl_sched_migration_cost;
>
> if (!(sd->flags & SD_LOAD_BALANCE))
> continue;
>
> + if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
> + break;
I am referring to this check in my above comment.
> +
> if (sd->flags & SD_BALANCE_NEWIDLE) {
> + t0 = sched_clock_cpu(smp_processor_id());
> +
> /* If we've pulled tasks over stop searching: */
> pulled_task = load_balance(this_cpu, this_rq,
> sd, CPU_NEWLY_IDLE, &balance);
> +
> + domain_cost = sched_clock_cpu(smp_processor_id()) - t0;
> + if (domain_cost > max)
> + domain_cost = max;
> +
> + if (domain_cost > sd->max_newidle_lb_cost)
> + sd->max_newidle_lb_cost = domain_cost;
If we face a runq lock contention, then domain_cost can go up.
The runq lock contention could be temporary, but we carry the domain
cost forever (i.e till the next reboot). How about averaging the cost +
penalty for unsuccessful balance.
Something like
domain_cost = sched_clock_cpu(smp_processor_id()) - t0;
if (!pulled_task)
domain_cost *= 2;
sd->max_newidle_lb_cost += domain_cost;
sd->max_newidle_lb_cost /= 2;
Maybe the name could then change to avg_newidle_lb_cost.
> +
> + curr_cost += domain_cost;
> }
>
--
Thanks and Regards
Srikar Dronamraju
On Mon, 2013-09-02 at 12:24 +0530, Srikar Dronamraju wrote:
> If we face a runq lock contention, then domain_cost can go up.
> The runq lock contention could be temporary, but we carry the domain
> cost forever (i.e till the next reboot). How about averaging the cost +
> penalty for unsuccessful balance.
>
> Something like
> domain_cost = sched_clock_cpu(smp_processor_id()) - t0;
> if (!pulled_task)
> domain_cost *= 2;
>
> sd->max_newidle_lb_cost += domain_cost;
> sd->max_newidle_lb_cost /= 2;
>
>
> Maybe the name could then change to avg_newidle_lb_cost.
>
> > +
> > + curr_cost += domain_cost;
> > }
> >
We tried keeping track of the avg in the v2 patch. It didn't really help
reduce the contention in idle balancing and we needed to also reduce
avg_idle by a factor of 10-20+ when comparing it to
avg_idle_balance_cost in order to get the good performance boosts.
One potential explanation why is that avg idle balance cost can often
have a large variation. That would make both computing the avg_idle and
comparing avg_idle with avg idle balance cost to not really be
consistent.
I think using the max allows us to keep the cost at a more constant rate
so that we can more meaningfully compare avg_idle with respect to "idle
balance cost". It also helps reduce the chance avg_idle overruns the
balance cost. Patch 3 reduces the max cost so that the value isn't kept
until the next reboot.
On Fri, 2013-08-30 at 12:29 +0200, Peter Zijlstra wrote:
> rcu_read_lock();
> for_each_domain(cpu, sd) {
> + /*
> + * Decay the newidle max times here because this is a regular
> + * visit to all the domains. Decay ~0.5% per second.
> + */
> + if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
> + sd->max_newidle_lb_cost =
> + (sd->max_newidle_lb_cost * 254) / 256;
I initially picked 0.5%, but after trying it out, it appears to decay very
slowing when the max is at a high value. Should we increase the decay a
little bit more? Maybe something like:
sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 63) / 64;
> + /*
> + * Stop the load balance at this level. There is another
> + * CPU in our sched group which is doing load balancing more
> + * actively.
> + */
> + if (!continue_balancing) {
Is "continue_balancing" named "balance" in older kernels?
Here are the AIM7 results with the other 2 patches + this patch with the
slightly higher decay value.
----------------------------------------------------------------
workload | % improvement | % improvement | % improvement
| with patch | with patch | with patch
| 1100-2000 users | 200-1000 users | 10-100 users
----------------------------------------------------------------
alltests | +9.2% | +5.2% | +0.3%
----------------------------------------------------------------
compute | +0.0% | -0.9% | +0.6%
----------------------------------------------------------------
custom | +18.6% | +15.3% | +7.0%
----------------------------------------------------------------
disk | +4.0% | +16.5% | +7.1%
----------------------------------------------------------------
fserver | +64.8% | +27.5% | -0.6%
----------------------------------------------------------------
high_systime | +15.1% | +7.9% | +0.0%
----------------------------------------------------------------
new_fserver | +51.0% | +20.1% | -1.3%
----------------------------------------------------------------
shared | +6.3% | +8.8% | +2.8%
----------------------------------------------------------------
On Fri, 2013-08-30 at 12:18 +0200, Peter Zijlstra wrote:
> On Thu, Aug 29, 2013 at 01:05:36PM -0700, Jason Low wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 58b0514..bba5a07 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1345,7 +1345,7 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> >
> > if (rq->idle_stamp) {
> > u64 delta = rq_clock(rq) - rq->idle_stamp;
> > - u64 max = 2*rq->max_idle_balance_cost;
> > + u64 max = 2*(sysctl_sched_migration_cost + rq->max_idle_balance_cost);
>
> You re-introduce sched_migration_cost here because max_idle_balance_cost
> can now drop down to 0 again?
Yes it was so that max_idle_balance_cost would be at least sched_migration_cost
and that we would still skip idle_balance if avg_idle < sched_migration_cost.
I also initially thought that adding sched_migration_cost would also account for
the extra "costs" of idle balancing that are not accounted for in the time spent
on each newidle load balance. Come to think of it though, sched_migration_cost
might be too large when used in that context considering we're already using the
max cost.
On Tue, Sep 03, 2013 at 11:02:59PM -0700, Jason Low wrote:
> On Fri, 2013-08-30 at 12:29 +0200, Peter Zijlstra wrote:
> > rcu_read_lock();
> > for_each_domain(cpu, sd) {
> > + /*
> > + * Decay the newidle max times here because this is a regular
> > + * visit to all the domains. Decay ~0.5% per second.
> > + */
> > + if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
> > + sd->max_newidle_lb_cost =
> > + (sd->max_newidle_lb_cost * 254) / 256;
>
> I initially picked 0.5%, but after trying it out, it appears to decay very
> slowing when the max is at a high value. Should we increase the decay a
> little bit more? Maybe something like:
>
> sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 63) / 64;
So the half-life in either case is is given by:
n = ln(1/2) / ln(x)
which gives 88 seconds for x := 254/256 or 44 seconds for x := 63/64.
I don't really care too much, but obviously something like:
256*exp(ln(.5)/60) ~= 253
Is attractive ;-)
> > + /*
> > + * Stop the load balance at this level. There is another
> > + * CPU in our sched group which is doing load balancing more
> > + * actively.
> > + */
> > + if (!continue_balancing) {
>
> Is "continue_balancing" named "balance" in older kernels?
Yeah, this patch crossed paths with a series remodeling the
load-balancer a bit, that should all be pushed-out to tip/master.
In particular see commit:
23f0d20 sched: Factor out code to should_we_balance()
> Here are the AIM7 results with the other 2 patches + this patch with the
> slightly higher decay value.
Just to clarify, 'this patch' is the one I sent?
On Wed, Sep 04, 2013 at 12:10:01AM -0700, Jason Low wrote:
> On Fri, 2013-08-30 at 12:18 +0200, Peter Zijlstra wrote:
> > On Thu, Aug 29, 2013 at 01:05:36PM -0700, Jason Low wrote:
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 58b0514..bba5a07 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1345,7 +1345,7 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> > >
> > > if (rq->idle_stamp) {
> > > u64 delta = rq_clock(rq) - rq->idle_stamp;
> > > - u64 max = 2*rq->max_idle_balance_cost;
> > > + u64 max = 2*(sysctl_sched_migration_cost + rq->max_idle_balance_cost);
> >
> > You re-introduce sched_migration_cost here because max_idle_balance_cost
> > can now drop down to 0 again?
>
> Yes it was so that max_idle_balance_cost would be at least sched_migration_cost
> and that we would still skip idle_balance if avg_idle < sched_migration_cost.
>
> I also initially thought that adding sched_migration_cost would also account for
> the extra "costs" of idle balancing that are not accounted for in the time spent
> on each newidle load balance. Come to think of it though, sched_migration_cost
> might be too large when used in that context considering we're already using the
> max cost.
Right, so shall we do as Srikar suggests and drop that initial check?
On Mon, 2013-09-09 at 13:44 +0200, Peter Zijlstra wrote:
> On Tue, Sep 03, 2013 at 11:02:59PM -0700, Jason Low wrote:
> > On Fri, 2013-08-30 at 12:29 +0200, Peter Zijlstra wrote:
> > > rcu_read_lock();
> > > for_each_domain(cpu, sd) {
> > > + /*
> > > + * Decay the newidle max times here because this is a regular
> > > + * visit to all the domains. Decay ~0.5% per second.
> > > + */
> > > + if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
> > > + sd->max_newidle_lb_cost =
> > > + (sd->max_newidle_lb_cost * 254) / 256;
> >
> > I initially picked 0.5%, but after trying it out, it appears to decay very
> > slowing when the max is at a high value. Should we increase the decay a
> > little bit more? Maybe something like:
> >
> > sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 63) / 64;
>
> So the half-life in either case is is given by:
>
> n = ln(1/2) / ln(x)
>
> which gives 88 seconds for x := 254/256 or 44 seconds for x := 63/64.
>
> I don't really care too much, but obviously something like:
>
> 256*exp(ln(.5)/60) ~= 253
>
> Is attractive ;-)
>
> > > + /*
> > > + * Stop the load balance at this level. There is another
> > > + * CPU in our sched group which is doing load balancing more
> > > + * actively.
> > > + */
> > > + if (!continue_balancing) {
> >
> > Is "continue_balancing" named "balance" in older kernels?
>
> Yeah, this patch crossed paths with a series remodeling the
> load-balancer a bit, that should all be pushed-out to tip/master.
>
> In particular see commit:
> 23f0d20 sched: Factor out code to should_we_balance()
>
> > Here are the AIM7 results with the other 2 patches + this patch with the
> > slightly higher decay value.
>
> Just to clarify, 'this patch' is the one I sent?
Yes, I was referring to the one you sent out. Also, I did the
s/smp_processor_id()/this_cpu/ on patch 2.
On Mon, 2013-09-09 at 13:49 +0200, Peter Zijlstra wrote:
> On Wed, Sep 04, 2013 at 12:10:01AM -0700, Jason Low wrote:
> > On Fri, 2013-08-30 at 12:18 +0200, Peter Zijlstra wrote:
> > > On Thu, Aug 29, 2013 at 01:05:36PM -0700, Jason Low wrote:
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index 58b0514..bba5a07 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -1345,7 +1345,7 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> > > >
> > > > if (rq->idle_stamp) {
> > > > u64 delta = rq_clock(rq) - rq->idle_stamp;
> > > > - u64 max = 2*rq->max_idle_balance_cost;
> > > > + u64 max = 2*(sysctl_sched_migration_cost + rq->max_idle_balance_cost);
> > >
> > > You re-introduce sched_migration_cost here because max_idle_balance_cost
> > > can now drop down to 0 again?
> >
> > Yes it was so that max_idle_balance_cost would be at least sched_migration_cost
> > and that we would still skip idle_balance if avg_idle < sched_migration_cost.
> >
> > I also initially thought that adding sched_migration_cost would also account for
> > the extra "costs" of idle balancing that are not accounted for in the time spent
> > on each newidle load balance. Come to think of it though, sched_migration_cost
> > might be too large when used in that context considering we're already using the
> > max cost.
>
> Right, so shall we do as Srikar suggests and drop that initial check?
I agree that we can delete the check between avg_idle and max_idle_balance_cost
so that large costs in higher domains don't cause balancing to be skipped in
lower domains as Srikar suggested. Should we keep the old
"if (this_rq->avg_idle < sysctl_sched_migration_cost)" check?
Also, are the other costs of idle balancing, specifically the cost of cache
refreshes, not considered in the max_newidle_lb_cost? I was wondering if there
is a non expensive way we can also take those into account. For example, can we
multiply max_newidle_lb_cost by a factor of 1.2x to 2x, but also increase the
decay factor (4% to 10% per second)?
On Mon, 2013-09-09 at 14:07 -0700, Jason Low wrote:
> On Mon, 2013-09-09 at 13:49 +0200, Peter Zijlstra wrote:
> > On Wed, Sep 04, 2013 at 12:10:01AM -0700, Jason Low wrote:
> > > On Fri, 2013-08-30 at 12:18 +0200, Peter Zijlstra wrote:
> > > > On Thu, Aug 29, 2013 at 01:05:36PM -0700, Jason Low wrote:
> > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > > index 58b0514..bba5a07 100644
> > > > > --- a/kernel/sched/core.c
> > > > > +++ b/kernel/sched/core.c
> > > > > @@ -1345,7 +1345,7 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> > > > >
> > > > > if (rq->idle_stamp) {
> > > > > u64 delta = rq_clock(rq) - rq->idle_stamp;
> > > > > - u64 max = 2*rq->max_idle_balance_cost;
> > > > > + u64 max = 2*(sysctl_sched_migration_cost + rq->max_idle_balance_cost);
> > > >
> > > > You re-introduce sched_migration_cost here because max_idle_balance_cost
> > > > can now drop down to 0 again?
> > >
> > > Yes it was so that max_idle_balance_cost would be at least sched_migration_cost
> > > and that we would still skip idle_balance if avg_idle < sched_migration_cost.
> > >
> > > I also initially thought that adding sched_migration_cost would also account for
> > > the extra "costs" of idle balancing that are not accounted for in the time spent
> > > on each newidle load balance. Come to think of it though, sched_migration_cost
> > > might be too large when used in that context considering we're already using the
> > > max cost.
> >
> > Right, so shall we do as Srikar suggests and drop that initial check?
>
> I agree that we can delete the check between avg_idle and max_idle_balance_cost
> so that large costs in higher domains don't cause balancing to be skipped in
> lower domains as Srikar suggested. Should we keep the old
> "if (this_rq->avg_idle < sysctl_sched_migration_cost)" check?
It was put there to allow cross core scheduling to recover as much
overlap as possible, so rapidly switching communicating tasks with only
small recoverable overlap in the first place don't get pounded to pulp
by overhead instead. If a different way does a better job, whack it.
-Mike
Jason, could you repost all the 3 patches with the changes we talked
about (Srikar's suggestion, changing the half-time etc..) with fresh
perf numbers so I can apply them all?