When idle balance is being used frequently, functions such as load_balance(),
update_sd_lb_stats(), tg_load_down(), and acquiring run queue locks can increase
time spent in the kernel by quite a bit. In addition to spending kernel time
in those functions, it may sometimes indirectly increase the % time spent
acquiring other kernel locks and also reduce % time spent completing user space
functions. For example, when running fserver at a high number of users on an
8 socket box with HT-enabled, a lot of time was spent on __read_lock_failed(),
__write_lock_failed(), update_sd_lb_stats(), tg_load_down(), ect... as reported
by perf. Once we reduce the rate at which idle_balance occurs, the percentage
time spent on those functions decreased, and the percentage of the time spent
in user space functions such as add_long() increased.
Below is a sample perf report on running fserver with high # of users when
idle_balance is allowed to run often:
18.01% reaim [k] mspin_lock
16.42% reaim [k] __read_lock_failed
10.27% reaim [k] __write_lock_failed
2.84% reaim [k] _raw_spin_lock
2.42% reaim [k] mutex_spin_on_owner
2.21% reaim [k] update_sd_lb_stats
1.93% reaim [k] start_this_handle
1.75% reaim [.] add_int
1.74% reaim [.] add_long
1.59% reaim [k] update_cfs_rq_blocked_load
1.42% reaim [k] tg_load_down
1.35% swapper [k] intel_idle
1.22% reaim [.] div_long
1.21% reaim [k] _raw_spin_lock_irqsave
1.18% reaim [k] load_balance
Below is a perf report when we lower the rate at which idle_balance is attempted:
14.84% reaim [k] mspin_lock
7.03% reaim [.] add_int
6.99% reaim [.] add_long
4.49% reaim [.] div_long
4.18% reaim [.] strncat
3.98% reaim [.] string_rtns_1
3.94% reaim [k] mutex_spin_on_owner
3.27% reaim [.] add_short
2.77% reaim [k] __read_lock_failed
1.77% swapper [k] intel_idle
1.69% reaim [.] msort_with_tmp
1.69% reaim [.] div_short
1.66% reaim [.] div_int
1.49% reaim [.] _int_malloc
1.23% reaim [k] __write_lock_failed
However, this may not necessarily mean we should remove or permanently lower
the rate at which idle balance may be attempted. If a CPU remains idle a lot
longer than the overhead for doing idle balance, then we should allow it.
This patch attempts to keep track of each CPU's "approximate" average idle
time and the average cost of attempting each load_balance per sched domain.
Load balancing gets skipped if the approximate cost of load balancing will
be greater than N% of the approximate time the CPU remains idle. Currently,
N is set to 10% though I'm searching for a more "ideal" way to compute this.
Suggested-by: Peter Zijlstra <[email protected]>
Suggested-by: Rik van Riel <[email protected]>
Signed-off-by: Jason Low <[email protected]>
---
arch/metag/include/asm/topology.h | 1 +
include/linux/sched.h | 1 +
include/linux/topology.h | 3 +++
kernel/sched/core.c | 3 +++
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 19 +++++++++++++++++++
kernel/sched/sched.h | 4 ++++
7 files changed, 32 insertions(+), 0 deletions(-)
diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
index 23f5118..8f0ebeb 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -26,6 +26,7 @@
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
+ .avg_idle_balance_cost = 0, \
}
#define cpu_to_node(cpu) ((void)(cpu), 0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1eb94c2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -809,6 +809,7 @@ struct sched_domain {
unsigned int wake_idx;
unsigned int forkexec_idx;
unsigned int smt_gain;
+ u64 avg_idle_balance_cost;
int nohz_idle; /* NOHZ IDLE status */
int flags; /* See SD_* */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..59ee2e8 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -106,6 +106,7 @@ int arch_update_cpu_topology(void);
.last_balance = jiffies, \
.balance_interval = 1, \
.smt_gain = 1178, /* 15% */ \
+ .avg_idle_balance_cost = 0, \
}
#endif
#endif /* CONFIG_SCHED_SMT */
@@ -135,6 +136,7 @@ int arch_update_cpu_topology(void);
, \
.last_balance = jiffies, \
.balance_interval = 1, \
+ .avg_idle_balance_cost = 0, \
}
#endif
#endif /* CONFIG_SCHED_MC */
@@ -166,6 +168,7 @@ int arch_update_cpu_topology(void);
, \
.last_balance = jiffies, \
.balance_interval = 1, \
+ .avg_idle_balance_cost = 0, \
}
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8b3350..da2cb3e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1348,6 +1348,8 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
else
update_avg(&rq->avg_idle, delta);
rq->idle_stamp = 0;
+
+ rq->idle_duration = (rq->idle_duration + delta) / 2;
}
#endif
}
@@ -7027,6 +7029,7 @@ void __init sched_init(void)
rq->online = 0;
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
+ rq->idle_duration = 0;
INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 75024a6..a3f062c 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -307,6 +307,7 @@ do { \
P(sched_goidle);
#ifdef CONFIG_SMP
P64(avg_idle);
+ P64(idle_duration);
#endif
P(ttwu_count);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..da7ddd6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5240,6 +5240,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
struct sched_domain *sd;
int pulled_task = 0;
unsigned long next_balance = jiffies + HZ;
+ u64 cost = 0;
+ u64 idle_duration = this_rq->idle_duration;
this_rq->idle_stamp = this_rq->clock;
@@ -5256,14 +5258,31 @@ void idle_balance(int this_cpu, struct rq *this_rq)
for_each_domain(this_cpu, sd) {
unsigned long interval;
int balance = 1;
+ u64 this_domain_balance_cost = 0;
+ u64 start_time;
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
+ /*
+ * If the time which this_cpu remains is not lot higher than the cost
+ * of attempt idle balancing within this domain, then stop searching.
+ */
+ if (idle_duration / 10 < (sd->avg_idle_balance_cost + cost))
+ break;
+
if (sd->flags & SD_BALANCE_NEWIDLE) {
+ start_time = sched_clock_cpu(smp_processor_id());
+
/* If we've pulled tasks over stop searching: */
pulled_task = load_balance(this_cpu, this_rq,
sd, CPU_NEWLY_IDLE, &balance);
+
+ this_domain_balance_cost = sched_clock_cpu(smp_processor_id()) - start_time;
+
+ sd->avg_idle_balance_cost =
+ (sd->avg_idle_balance_cost + this_domain_balance_cost) / 2;
+ cost += this_domain_balance_cost;
}
interval = msecs_to_jiffies(sd->balance_interval);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ce39224..6a7e620 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -521,6 +521,10 @@ struct rq {
#endif
struct sched_avg avg;
+
+#ifdef CONFIG_SMP
+ u64 idle_duration;
+#endif
};
static inline int cpu_of(struct rq *rq)
--
1.7.1
Hi Json,
I ran ebizzy and kernbench benchmarks on your 3.11-rc1 + your "V1
patch" on a 1 socket, 16 core powerpc machine. I thought I would let you
know the results before I try your V2.
Ebizzy: 30 seconds run. The table below shows the improvement in the
number of records completed. I have not spent enough time on the patch
to explain such a big improvement.
Number_of_threads %improvement_with_patch
4 41.86%
8 9.8%
12 34.77%
16 28.37%
While on kernbench there was no significant change in the observation.
I will try patch V2 and let you know the results.
Regards
Preeti U Murthy
On Fri, 2013-07-19 at 16:54 +0530, Preeti U Murthy wrote:
> Hi Json,
>
> I ran ebizzy and kernbench benchmarks on your 3.11-rc1 + your "V1
> patch" on a 1 socket, 16 core powerpc machine. I thought I would let you
> know the results before I try your V2.
>
> Ebizzy: 30 seconds run. The table below shows the improvement in the
> number of records completed. I have not spent enough time on the patch
> to explain such a big improvement.
>
> Number_of_threads %improvement_with_patch
> 4 41.86%
> 8 9.8%
> 12 34.77%
> 16 28.37%
>
> While on kernbench there was no significant change in the observation.
>
> I will try patch V2 and let you know the results.
Great to see those improvements so far. Thank you for testing this.
Jason
Hi Json,
With V2 of your patch here are the results for the ebizzy run on
3.11-rc1 + patch on a 1 socket, 16 core powerpc machine. Each ebizzy
run was for 30 seconds.
Number_of_threads %improvement_with_patch
4 8.63
8 1.29
12 9.98
16 20.46
Let me know if you want me to profile any of these runs for specific
statistics.
Regards
Preeti U Murthy
On 07/20/2013 12:58 AM, Jason Low wrote:
> On Fri, 2013-07-19 at 16:54 +0530, Preeti U Murthy wrote:
>> Hi Json,
>>
>> I ran ebizzy and kernbench benchmarks on your 3.11-rc1 + your "V1
>> patch" on a 1 socket, 16 core powerpc machine. I thought I would let you
>> know the results before I try your V2.
>>
>> Ebizzy: 30 seconds run. The table below shows the improvement in the
>> number of records completed. I have not spent enough time on the patch
>> to explain such a big improvement.
>>
>> Number_of_threads %improvement_with_patch
>> 4 41.86%
>> 8 9.8%
>> 12 34.77%
>> 16 28.37%
>>
>> While on kernbench there was no significant change in the observation.
>>
>> I will try patch V2 and let you know the results.
>
> Great to see those improvements so far. Thank you for testing this.
>
> Jason
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e8b3350..da2cb3e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1348,6 +1348,8 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> else
> update_avg(&rq->avg_idle, delta);
> rq->idle_stamp = 0;
> +
> + rq->idle_duration = (rq->idle_duration + delta) / 2;
Cant we just use avg_idle instead of introducing idle_duration?
> }
> #endif
> }
> @@ -7027,6 +7029,7 @@ void __init sched_init(void)
> rq->online = 0;
> rq->idle_stamp = 0;
> rq->avg_idle = 2*sysctl_sched_migration_cost;
> + rq->idle_duration = 0;
>
> INIT_LIST_HEAD(&rq->cfs_tasks);
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 75024a6..a3f062c 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -307,6 +307,7 @@ do { \
> P(sched_goidle);
> #ifdef CONFIG_SMP
> P64(avg_idle);
> + P64(idle_duration);
> #endif
>
> P(ttwu_count);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c61a614..da7ddd6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5240,6 +5240,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> struct sched_domain *sd;
> int pulled_task = 0;
> unsigned long next_balance = jiffies + HZ;
> + u64 cost = 0;
> + u64 idle_duration = this_rq->idle_duration;
>
> this_rq->idle_stamp = this_rq->clock;
>
> @@ -5256,14 +5258,31 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> for_each_domain(this_cpu, sd) {
> unsigned long interval;
> int balance = 1;
> + u64 this_domain_balance_cost = 0;
> + u64 start_time;
>
> if (!(sd->flags & SD_LOAD_BALANCE))
> continue;
>
> + /*
> + * If the time which this_cpu remains is not lot higher than the cost
> + * of attempt idle balancing within this domain, then stop searching.
> + */
> + if (idle_duration / 10 < (sd->avg_idle_balance_cost + cost))
> + break;
> +
> if (sd->flags & SD_BALANCE_NEWIDLE) {
> + start_time = sched_clock_cpu(smp_processor_id());
> +
> /* If we've pulled tasks over stop searching: */
> pulled_task = load_balance(this_cpu, this_rq,
> sd, CPU_NEWLY_IDLE, &balance);
> +
> + this_domain_balance_cost = sched_clock_cpu(smp_processor_id()) - start_time;
Should we take the consideration of whether a idle_balance was
successful or not?
How about having a per-sched_domain counter.
For every nth unsuccessful load balance, skip the n+1th idle
balance and reset the counter. Also reset the counter on every
successful idle load balance.
I am not sure whats a reasonable value for n can be, but may be we could
try with n=3.
Also have we checked the performance after adjusting the
sched_migration_cost tunable?
I guess, if we increase the sched_migration_cost, we should have lesser
newly idle balance requests.
--
Thanks and Regards
Srikar Dronamraju
On Sun, 2013-07-21 at 23:02 +0530, Preeti U Murthy wrote:
> Hi Json,
>
> With V2 of your patch here are the results for the ebizzy run on
> 3.11-rc1 + patch on a 1 socket, 16 core powerpc machine. Each ebizzy
> run was for 30 seconds.
>
> Number_of_threads %improvement_with_patch
> 4 8.63
> 8 1.29
> 12 9.98
> 16 20.46
>
> Let me know if you want me to profile any of these runs for specific
> statistics.
Hi Preeti,
Other numbers that would be interesting to know are the % improvements
when N = 20. That is when I'm getting the bigger performance boosts on
some of the AIM7 workloads.
Would you like to re-run ebizzy with v2 of my patch except replacing
this line of code:
if (idle_duration / 10 < (sd->avg_idle_balance_cost + cost))
with this line of code:
if (idle_duration / 20 < (sd->avg_idle_balance_cost + cost))
Thanks a lot,
Jason
On Mon, 2013-07-22 at 12:31 +0530, Srikar Dronamraju wrote:
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index e8b3350..da2cb3e 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1348,6 +1348,8 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> > else
> > update_avg(&rq->avg_idle, delta);
> > rq->idle_stamp = 0;
> > +
> > + rq->idle_duration = (rq->idle_duration + delta) / 2;
>
> Cant we just use avg_idle instead of introducing idle_duration?
A potential issue I have found with avg_idle is that it may sometimes be
not quite as accurate for the purposes of this patch, because it is
always given a max value (default is 1000000 ns). For example, a CPU
could have remained idle for 1 second and avg_idle would be set to 1
millisecond. Another question I have is whether we can update avg_idle
at all times without putting a maximum value on avg_idle, or increase
the maximum value of avg_idle by a lot.
> Should we take the consideration of whether a idle_balance was
> successful or not?
I recently ran fserver on the 8 socket machine with HT-enabled and found
that load balance was succeeding at a higher than average rate, but idle
balance was still lowering performance of that workload by a lot.
However, it makes sense to allow idle balance to run longer/more often
when it has a higher success rate.
> I am not sure whats a reasonable value for n can be, but may be we could
> try with n=3.
Based on some of the data I collected, n = 10 to 20 provides much better
performance increases.
> Also have we checked the performance after adjusting the
> sched_migration_cost tunable?
>
> I guess, if we increase the sched_migration_cost, we should have lesser
> newly idle balance requests.
Yes, I have done quite a bit of testing with sched_migration_cost and
adjusting it does help performance when idle balance overhead is high.
But I have found that a higher value may decrease the performance during
situations where the cost of idle_balance is not high. Additionally,
when to modify this tunable and by how much to modify it by can
sometimes be unpredictable.
Thanks,
Jason
On Mon, Jul 22, 2013 at 11:57:47AM -0700, Jason Low wrote:
> On Mon, 2013-07-22 at 12:31 +0530, Srikar Dronamraju wrote:
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index e8b3350..da2cb3e 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1348,6 +1348,8 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> > > else
> > > update_avg(&rq->avg_idle, delta);
> > > rq->idle_stamp = 0;
> > > +
> > > + rq->idle_duration = (rq->idle_duration + delta) / 2;
> >
> > Cant we just use avg_idle instead of introducing idle_duration?
>
> A potential issue I have found with avg_idle is that it may sometimes be
> not quite as accurate for the purposes of this patch, because it is
> always given a max value (default is 1000000 ns). For example, a CPU
> could have remained idle for 1 second and avg_idle would be set to 1
> millisecond. Another question I have is whether we can update avg_idle
> at all times without putting a maximum value on avg_idle, or increase
> the maximum value of avg_idle by a lot.
The only user of avg_idle is idle_balance(); since you're building a new
limiter we can completely scrap/rework avg_idle to do as you want it to.
No point in having two of them.
Also, we now have rq->cfs.{blocked,runnable}_load_avg that might help with
estimating if you're so inclined :-)
> > Should we take the consideration of whether a idle_balance was
> > successful or not?
>
> I recently ran fserver on the 8 socket machine with HT-enabled and found
> that load balance was succeeding at a higher than average rate, but idle
> balance was still lowering performance of that workload by a lot.
> However, it makes sense to allow idle balance to run longer/more often
> when it has a higher success rate.
>
> > I am not sure whats a reasonable value for n can be, but may be we could
> > try with n=3.
>
> Based on some of the data I collected, n = 10 to 20 provides much better
> performance increases.
Right, so I'm still a bit puzzled by why this is so; maybe we're
over-estimating the idle duration due to significant variance in the
idle time?
Maybe we should try with something like the below to test this?
/*
* http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
*/
struct stats {
long mean;
long M2;
unsigned int n;
};
static void stats_update(struct stats *stats, long x)
{
long delta;
stats->n++;
delta = x - stats->mean;
stats->mean += delta / stats->n;
stats->M2 += delta * (x - stats->mean);
}
static long stats_var(struct stats *stats)
{
long variance;
if (!stats->n)
return 0;
variance = stats->M2 / (stats->n - 1);
return int_sqrt(variance);
}
static long stats_mean(struct stats *stats)
{
return stats->mean;
}
> Yes, I have done quite a bit of testing with sched_migration_cost and
> adjusting it does help performance when idle balance overhead is high.
> But I have found that a higher value may decrease the performance during
> situations where the cost of idle_balance is not high. Additionally,
> when to modify this tunable and by how much to modify it by can
> sometimes be unpredictable.
So the history if sched_migration_cost is that it used to be a per sd
value; see also:
https://lkml.org/lkml/2008/9/4/215
Ingo wrote it initially for the O(1) scheduler and ripped it out when he
did CFS. He now doesn't like it because it introduces boot-to-boot
scheduling differences -- you never measure the exact numbers again.
That said, there is a case for restoring it since the one measure really
doesn't do justice to larger systems.
>
> A potential issue I have found with avg_idle is that it may sometimes be
> not quite as accurate for the purposes of this patch, because it is
> always given a max value (default is 1000000 ns). For example, a CPU
> could have remained idle for 1 second and avg_idle would be set to 1
> millisecond. Another question I have is whether we can update avg_idle
> at all times without putting a maximum value on avg_idle, or increase
> the maximum value of avg_idle by a lot.
May be the current max value is a limiting factor, but I think there
should be a limit to the maximum value. Peter and Ingo may help us
understand why they limited to the 1ms. But I dont think we should
introduce a new variable just for this.
>
> > Should we take the consideration of whether a idle_balance was
> > successful or not?
>
> I recently ran fserver on the 8 socket machine with HT-enabled and found
> that load balance was succeeding at a higher than average rate, but idle
> balance was still lowering performance of that workload by a lot.
> However, it makes sense to allow idle balance to run longer/more often
> when it has a higher success rate.
>
If idle balance did succeed, then it means that the system was indeed
imbalanced. So idle balance was the right thing to do. May be we chose
the wrong task to pull. May be after numa balancing enhancements go in,
we pick a better task to pull atleast across nodes. And there could be
other opportunities/strategies to select a right task to pull.
Again, schedstats during the application run should give us hints here.
> > I am not sure whats a reasonable value for n can be, but may be we could
> > try with n=3.
>
> Based on some of the data I collected, n = 10 to 20 provides much better
> performance increases.
>
I was saying it the other way.
your suggestion is to run idle balance once in n runs .. where n is 10
to 20.
My thinking was to not run idle balance once in n unsuccessful runs.
> > Also have we checked the performance after adjusting the
> > sched_migration_cost tunable?
> >
> > I guess, if we increase the sched_migration_cost, we should have lesser
> > newly idle balance requests.
>
> Yes, I have done quite a bit of testing with sched_migration_cost and
> adjusting it does help performance when idle balance overhead is high.
> But I have found that a higher value may decrease the performance during
> situations where the cost of idle_balance is not high. Additionally,
> when to modify this tunable and by how much to modify it by can
> sometimes be unpredictable.
I think people understand that migration_cost depends on the
hardware/application and thats why they kept it as a tunable.
But is there something that we can look from the hardware and the
application behaviour to set a migration cost? May be doing this
just complicates stuff then necessary.
--
Thanks and Regards
Srikar Dronamraju
On Tue, Jul 23, 2013 at 04:36:46PM +0530, Srikar Dronamraju wrote:
> May be the current max value is a limiting factor, but I think there
> should be a limit to the maximum value. Peter and Ingo may help us
> understand why they limited to the 1ms. But I dont think we should
> introduce a new variable just for this.
/me blames it all on Mike.. I tried to remember why he did that, but
alas.
> If idle balance did succeed, then it means that the system was indeed
> imbalanced. So idle balance was the right thing to do. May be we chose
> the wrong task to pull. May be after numa balancing enhancements go in,
> we pick a better task to pull atleast across nodes. And there could be
> other opportunities/strategies to select a right task to pull.
>
> Again, schedstats during the application run should give us hints here.
Not necessarily so, IIRC the newidle idx is 0 which means that its very
aggressive at pulling load, there might not actually be an imbalance
with higher idx averages.
> I was saying it the other way.
> your suggestion is to run idle balance once in n runs .. where n is 10
> to 20.
> My thinking was to not run idle balance once in n unsuccessful runs.
I think you're talking past each other. Each having a different N :-)
On Tue, 2013-07-23 at 14:05 +0200, Peter Zijlstra wrote:
> On Tue, Jul 23, 2013 at 04:36:46PM +0530, Srikar Dronamraju wrote:
>
> > May be the current max value is a limiting factor, but I think there
> > should be a limit to the maximum value. Peter and Ingo may help us
> > understand why they limited to the 1ms. But I dont think we should
> > introduce a new variable just for this.
>
> /me blames it all on Mike.. I tried to remember why he did that, but
> alas.
Only to accelerate cutoff after long idle.
-Mike
On Tue, 2013-07-23 at 16:36 +0530, Srikar Dronamraju wrote:
> >
> > A potential issue I have found with avg_idle is that it may sometimes be
> > not quite as accurate for the purposes of this patch, because it is
> > always given a max value (default is 1000000 ns). For example, a CPU
> > could have remained idle for 1 second and avg_idle would be set to 1
> > millisecond. Another question I have is whether we can update avg_idle
> > at all times without putting a maximum value on avg_idle, or increase
> > the maximum value of avg_idle by a lot.
>
> May be the current max value is a limiting factor, but I think there
> should be a limit to the maximum value. Peter and Ingo may help us
> understand why they limited to the 1ms. But I dont think we should
> introduce a new variable just for this.
You're right. As Peter recently mentioned, avg_idle is only used for
idle_balance() anyway, so we should just use the existing avg_idle
estimator.
> >
> > > Should we take the consideration of whether a idle_balance was
> > > successful or not?
> >
> > I recently ran fserver on the 8 socket machine with HT-enabled and found
> > that load balance was succeeding at a higher than average rate, but idle
> > balance was still lowering performance of that workload by a lot.
> > However, it makes sense to allow idle balance to run longer/more often
> > when it has a higher success rate.
> >
>
> If idle balance did succeed, then it means that the system was indeed
> imbalanced. So idle balance was the right thing to do. May be we chose
> the wrong task to pull. May be after numa balancing enhancements go in,
> we pick a better task to pull atleast across nodes. And there could be
> other opportunities/strategies to select a right task to pull.
>
> Again, schedstats during the application run should give us hints here.
>
> > > I am not sure whats a reasonable value for n can be, but may be we could
> > > try with n=3.
> >
> > Based on some of the data I collected, n = 10 to 20 provides much better
> > performance increases.
> >
>
> I was saying it the other way.
> your suggestion is to run idle balance once in n runs .. where n is 10
> to 20.
> My thinking was to not run idle balance once in n unsuccessful runs.
When I suggested N is 20, that means that if the average idle time of a
CPU is 1,000,000 ns, then we stop idle balancing if the average cost to
idle balance within a domain would come up to be greater than (1,000,000
ns / 20) = 50,000 ns. In the v2 patch, N helps determine the maximum
duration in which we allow each idle_balance() to run.
Thanks,
Jason
> > > Should we take the consideration of whether a idle_balance was
> > > successful or not?
> >
> > I recently ran fserver on the 8 socket machine with HT-enabled and found
> > that load balance was succeeding at a higher than average rate, but idle
> > balance was still lowering performance of that workload by a lot.
> > However, it makes sense to allow idle balance to run longer/more often
> > when it has a higher success rate.
> >
> > > I am not sure whats a reasonable value for n can be, but may be we could
> > > try with n=3.
> >
> > Based on some of the data I collected, n = 10 to 20 provides much better
> > performance increases.
>
> Right, so I'm still a bit puzzled by why this is so; maybe we're
> over-estimating the idle duration due to significant variance in the
> idle time?
This time, I also collected per domain stats on the number of load
balances that pulled a task(s) and the number of load balances that did
not pull tasks. Here are some of those numbers for a CPU when running
fserver at 8 socket Hyperthreading enabled:
CPU #2:
| load balance | load balance | # total | load balance
domain | pulled task | did not | attempts | success rate
| | pull tasks | |
--------------------------------------------------------------------------
0 | 10574 | 175311 | 185885 | 5.69%
--------------------------------------------------------------------------
1 | 18218 | 157092 | 175310 | 10.39%
--------------------------------------------------------------------------
2 | 0 | 157092 | 157092 | 0%
--------------------------------------------------------------------------
3 | 14858 | 142234 | 157092 | 9.46%
--------------------------------------------------------------------------
4 | 8632 | 133602 | 142234 | 6.07%
--------------------------------------------------------------------------
5 | 4570 | 129032 | 133602 | 3.42%
Note: The % load balance success rate can be a lot lower in some of the
other AIM7 workloads with 8 socket HT on.
In this case, most of the load balances which did not pull tasks were
either due to find_busiest_group() returning NULL or failing to move any
tasks after attempting to move task.
Based on this current data, one possible explanation for why average
load balance cost per domain can be a lot less than the avg CPU idle
time, yet idle balancing is still lowering performance, is because the
load balance success rate for some of these domains can be very small.
At the same time, there's still the overhead of doing
update_sd_lb_stats(), idle_cpu(), acquiring the rq->lock, ect...
So assume that the average cost of load balance on domain 0 is 30000 ns
and the CPU's average idle time is 500000 ns. The average cost of
attempting each balance on domain 0 is a lot less than the average time
the CPU remains idle. However, since load balance in domain 0 is useful
only 5.69% of the time, it is expected to pay (30000 ns / 0.0569) =
527240 ns worth of kernel time for every load balance that moved a
task(s) to this particular CPU. Additionally, domain 2 in this case is
essentially never moving tasks during its balance attempts, and so a
larger N means spending even less time balancing in a domain in which no
tasks ever gets moved.
Perhaps one of the metrics I may use for computing N is the balance
success rate for each sched domain. So in the above case, we give little
to no time for idle balancing within domain 2, but allow more time to be
spent balancing between domain 1 and domain 3 because the expected
return is greater?
Jason