LinuxLists.cc - [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-16 19:21:12

Subject: [RFC] sched: Limit idle_balance() when it is being used too frequently

When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
there can be a lot of contention in idle_balance() and related functions.
On many AIM7 workloads in which CPUs go idle very often and idle balance
gets called a lot, it is actually lowering performance.

Since idle balance often helps performance (when it is not overused), I
looked into trying to avoid attempting idle balance only when it is
occurring too frequently.

This RFC patch attempts to keep track of the approximate "average" time between
idle balance attempts per CPU. Each time the idle_balance() function is
invoked, it will compute the duration since the last idle_balance() for
the current CPU. The avg time between idle balance attempts is then updated
using a very similar method as how rq->avg_idle is computed.

Once the average time between idle balance attempts drops below a certain
value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
for that CPU will be skipped. The average time between idle balances will
continue to be updated, even if it ends up getting skipped. The
initial/maximum average is set a lot higher though to make sure that the
avg doesn't fall below the threshold until the sample size is large and to
prevent the avg from being overestimated.

This change improved the performance of many AIM7 workloads at 1, 2, 4, 8
sockets on the 3.10 kernel. The most significant differences were at
8 sockets HT-enabled. The table below compares the average jobs per minute
at 1100-2000 users between the vanilla 3.10 kernel and 3.10 kernel with this
patch. I included data for both hyperthreading disabled and enabled. I used
numactl to restrict AIM7 to run on certain number of nodes. I only included
data in which the % difference was beyond a 2% noise range.

--------------------------------------------------------------------------
1 socket
--------------------------------------------------------------------------
workload | HT-disabled | HT-enabled |
| % improvement | % improvement |
| with patch | with patch |
--------------------------------------------------------------------------
disk | +17.7% | +4.7% |
--------------------------------------------------------------------------
high_systime | +2.9% | ----- |
--------------------------------------------------------------------------

--------------------------------------------------------------------------
2 sockets
--------------------------------------------------------------------------
workload | HT-disabled | HT-enabled |
| % improvement | % improvement |
| with patch | with patch |
--------------------------------------------------------------------------
alltests | ----- | +2.3% |
--------------------------------------------------------------------------
disk | +10.5% | ----- |
--------------------------------------------------------------------------
fserver | +3.6% | ----- |
--------------------------------------------------------------------------
new_fserver | +3.7% | ----- |
--------------------------------------------------------------------------

--------------------------------------------------------------------------
4 sockets
--------------------------------------------------------------------------
workload | HT-disabled | HT-enabled |
| % improvement | % improvement |
| with patch | with patch |
--------------------------------------------------------------------------
alltests | +3.7% | ----- |
--------------------------------------------------------------------------
custom | -2.2% | +14.0% |
--------------------------------------------------------------------------
fserver | +2.8% | ----- |
--------------------------------------------------------------------------
high_systime | -3.6% | +18.7% |
--------------------------------------------------------------------------
new_fserver | +3.4% | ----- |
--------------------------------------------------------------------------

--------------------------------------------------------------------------
8 sockets
--------------------------------------------------------------------------
workload | HT-disabled | HT-enabled |
| % improvement | % improvement |
| with patch | with patch |
--------------------------------------------------------------------------
alltests | +4.4% | +13.3% |
--------------------------------------------------------------------------
custom | +8.1% | +15.2% |
--------------------------------------------------------------------------
disk | -4.7% | +20.4% |
--------------------------------------------------------------------------
fserver | +3.4% | +26.8% |
--------------------------------------------------------------------------
high_systime | +11.7% | +14.7% |
--------------------------------------------------------------------------
new_fserver | +3.7% | +16.0% |
--------------------------------------------------------------------------
shared | ----- | +10.1% |
--------------------------------------------------------------------------

All other % difference results were within a 2% noise range.

Signed-off-by: Jason Low <[email protected]>
---
include/linux/sched.h | 4 ++++
kernel/sched/core.c | 3 +++
kernel/sched/fair.c | 26 ++++++++++++++++++++++++++
kernel/sched/sched.h | 6 ++++++
kernel/sysctl.c | 11 +++++++++++
5 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..5385c93 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1031,6 +1031,10 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};

+#ifdef CONFIG_SMP
+extern unsigned int sysctl_sched_idle_balance_limit;
+#endif
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8b3350..320389f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7028,6 +7028,9 @@ void __init sched_init(void)
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;

+ rq->avg_time_between_ib = 20*sysctl_sched_idle_balance_limit;
+ rq->prev_idle_balance = 0;
+
INIT_LIST_HEAD(&rq->cfs_tasks);

rq_attach_root(rq, &def_root_domain);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..f5f5e4e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -34,6 +34,8 @@

#include "sched.h"

+unsigned int sysctl_sched_idle_balance_limit = 5000000U;
+
/*
* Targeted preemption latency for CPU-bound tasks:
* (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
@@ -5231,6 +5233,23 @@ out:
return ld_moved;
}

+#ifdef CONFIG_SMP
+/* Update average time between idle balance attempts this_rq */
+static inline void update_avg_time_between_ib(struct rq *this_rq)
+{
+ u64 time_since_last_ib = this_rq->clock - this_rq->prev_idle_balance;
+ u64 max_avg_idle_balance = 20*sysctl_sched_idle_balance_limit;
+ s64 diff;
+
+ if (time_since_last_ib > max_avg_idle_balance) {
+ this_rq->avg_time_between_ib = max_avg_idle_balance;
+ } else {
+ diff = time_since_last_ib - this_rq->avg_time_between_ib;
+ this_rq->avg_time_between_ib += (diff >> 3);
+ }
+}
+#endif
+
/*
* idle_balance is called by schedule() if this_cpu is about to become
* idle. Attempts to pull tasks from other CPUs.
@@ -5246,6 +5265,13 @@ void idle_balance(int this_cpu, struct rq *this_rq)
if (this_rq->avg_idle < sysctl_sched_migration_cost)
return;

+ update_avg_time_between_ib(this_rq);
+ this_rq->prev_idle_balance = this_rq->clock;
+
+ /* Skip idle balancing if avg time between attempts is small */
+ if (this_rq->avg_time_between_ib < sysctl_sched_idle_balance_limit)
+ return;
+
/*
* Drop the rq->lock, but keep IRQ/preempt disabled.
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ce39224..27d6752 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -521,6 +521,12 @@ struct rq {
#endif

struct sched_avg avg;
+
+#ifdef CONFIG_SMP
+ /* stats for putting a limit on idle balancing */
+ u64 avg_time_between_ib;
+ u64 prev_idle_balance;
+#endif
};

static inline int cpu_of(struct rq *rq)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9edcf45..35e5f86 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -436,6 +436,17 @@ static struct ctl_table kern_table[] = {
.extra1 = &one,
},
#endif
+#ifdef CONFIG_SMP
+ {
+ .procname = "sched_idle_balance_limit",
+ .data = &sysctl_sched_idle_balance_limit,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 =&zero,
+ },
+#endif
+
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
--
1.7.1

2013-07-16 19:28:28

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On 07/16/2013 03:21 PM, Jason Low wrote:
> When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
> there can be a lot of contention in idle_balance() and related functions.
> On many AIM7 workloads in which CPUs go idle very often and idle balance
> gets called a lot, it is actually lowering performance.
>
> Since idle balance often helps performance (when it is not overused), I
> looked into trying to avoid attempting idle balance only when it is
> occurring too frequently.
>
> This RFC patch attempts to keep track of the approximate "average" time between
> idle balance attempts per CPU. Each time the idle_balance() function is
> invoked, it will compute the duration since the last idle_balance() for
> the current CPU. The avg time between idle balance attempts is then updated
> using a very similar method as how rq->avg_idle is computed.
>
> Once the average time between idle balance attempts drops below a certain
> value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
> for that CPU will be skipped. The average time between idle balances will
> continue to be updated, even if it ends up getting skipped. The
> initial/maximum average is set a lot higher though to make sure that the
> avg doesn't fall below the threshold until the sample size is large and to
> prevent the avg from being overestimated.
>
> This change improved the performance of many AIM7 workloads at 1, 2, 4, 8
> sockets on the 3.10 kernel. The most significant differences were at
> 8 sockets HT-enabled. The table below compares the average jobs per minute
> at 1100-2000 users between the vanilla 3.10 kernel and 3.10 kernel with this
> patch. I included data for both hyperthreading disabled and enabled. I used
> numactl to restrict AIM7 to run on certain number of nodes. I only included
> data in which the % difference was beyond a 2% noise range.

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2013-07-16 20:20:38

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote:
> When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
> there can be a lot of contention in idle_balance() and related functions.
> On many AIM7 workloads in which CPUs go idle very often and idle balance
> gets called a lot, it is actually lowering performance.
>
> Since idle balance often helps performance (when it is not overused), I
> looked into trying to avoid attempting idle balance only when it is
> occurring too frequently.
>
> This RFC patch attempts to keep track of the approximate "average" time between
> idle balance attempts per CPU. Each time the idle_balance() function is
> invoked, it will compute the duration since the last idle_balance() for
> the current CPU. The avg time between idle balance attempts is then updated
> using a very similar method as how rq->avg_idle is computed.
>
> Once the average time between idle balance attempts drops below a certain
> value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
> for that CPU will be skipped. The average time between idle balances will
> continue to be updated, even if it ends up getting skipped. The
> initial/maximum average is set a lot higher though to make sure that the
> avg doesn't fall below the threshold until the sample size is large and to
> prevent the avg from being overestimated.

One of the things I've been talking about for a while now is how I'd
like to use the idle guestimator used for cpuidle for newidle balance.

Basically based on the estimated idle time limit how far/wide you'll
search for tasks to run.

You can remove the sysctl and auto-tune by measuring how long it takes
on avg to do a newidle balance.

Also, its not the time between idles that's interesting, but how long
you're going to be idle.

If you track the time it takes to newidle balance per sched-domain you
traverse, you're almost there.

2013-07-16 22:48:10

by Jason Low

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote:
> On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote:
> > When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
> > there can be a lot of contention in idle_balance() and related functions.
> > On many AIM7 workloads in which CPUs go idle very often and idle balance
> > gets called a lot, it is actually lowering performance.
> >
> > Since idle balance often helps performance (when it is not overused), I
> > looked into trying to avoid attempting idle balance only when it is
> > occurring too frequently.
> >
> > This RFC patch attempts to keep track of the approximate "average" time between
> > idle balance attempts per CPU. Each time the idle_balance() function is
> > invoked, it will compute the duration since the last idle_balance() for
> > the current CPU. The avg time between idle balance attempts is then updated
> > using a very similar method as how rq->avg_idle is computed.
> >
> > Once the average time between idle balance attempts drops below a certain
> > value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
> > for that CPU will be skipped. The average time between idle balances will
> > continue to be updated, even if it ends up getting skipped. The
> > initial/maximum average is set a lot higher though to make sure that the
> > avg doesn't fall below the threshold until the sample size is large and to
> > prevent the avg from being overestimated.
>
> One of the things I've been talking about for a while now is how I'd
> like to use the idle guestimator used for cpuidle for newidle balance.
>
> Basically based on the estimated idle time limit how far/wide you'll
> search for tasks to run.
>
> You can remove the sysctl and auto-tune by measuring how long it takes
> on avg to do a newidle balance.

Hi Peter,

When you say how long it takes on avg to do a newidle balance, are you
referring to the avg time it takes for each call to CPU_NEWLY_IDLE
load_balance() to complete, or the avg time it takes for newidle balance
attempts within a domain to eventually successfully pull/move a task(s)?

Thanks,
Jason

2013-07-17 07:25:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Tue, Jul 16, 2013 at 03:48:01PM -0700, Jason Low wrote:
> On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote:
> > On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote:
> > > When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
> > > there can be a lot of contention in idle_balance() and related functions.
> > > On many AIM7 workloads in which CPUs go idle very often and idle balance
> > > gets called a lot, it is actually lowering performance.
> > >
> > > Since idle balance often helps performance (when it is not overused), I
> > > looked into trying to avoid attempting idle balance only when it is
> > > occurring too frequently.
> > >
> > > This RFC patch attempts to keep track of the approximate "average" time between
> > > idle balance attempts per CPU. Each time the idle_balance() function is
> > > invoked, it will compute the duration since the last idle_balance() for
> > > the current CPU. The avg time between idle balance attempts is then updated
> > > using a very similar method as how rq->avg_idle is computed.
> > >
> > > Once the average time between idle balance attempts drops below a certain
> > > value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
> > > for that CPU will be skipped. The average time between idle balances will
> > > continue to be updated, even if it ends up getting skipped. The
> > > initial/maximum average is set a lot higher though to make sure that the
> > > avg doesn't fall below the threshold until the sample size is large and to
> > > prevent the avg from being overestimated.
> >
> > One of the things I've been talking about for a while now is how I'd
> > like to use the idle guestimator used for cpuidle for newidle balance.
> >
> > Basically based on the estimated idle time limit how far/wide you'll
> > search for tasks to run.
> >
> > You can remove the sysctl and auto-tune by measuring how long it takes
> > on avg to do a newidle balance.
>
> Hi Peter,
>
> When you say how long it takes on avg to do a newidle balance, are you
> referring to the avg time it takes for each call to CPU_NEWLY_IDLE
> load_balance() to complete, or the avg time it takes for newidle balance
> attempts within a domain to eventually successfully pull/move a task(s)?

Both :-), being as the completion time would be roughly equivalent for the
top domain and the entire call.

So I suppose I was somewhat unclear :-) I initially started out with a
simpler model, where you measure the avg time of the entire
idle_balance() call and measure the avg idle time and compare the two.

Then I progressed to the more complex model where you measure the
completion time of each domain in the for_each_domain() iteration of
idle_balance() and compare that against the estimated idle time, bailing
out of the domain iteration when the avg completion time exceeds the
expected idle time.

One thing that I thought of since is that we need to consider what
happens for people with a low resolution sched_clock. IIRC there are
still platforms that are jiffy based.

2013-07-17 07:48:36

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Wed, Jul 17, 2013 at 09:25:04AM +0200, Peter Zijlstra wrote:
> One thing that I thought of since is that we need to consider what
> happens for people with a low resolution sched_clock. IIRC there are
> still platforms that are jiffy based.

Ignore that, they're all UP.

2013-07-17 08:11:50

by Jason Low

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Wed, 2013-07-17 at 09:25 +0200, Peter Zijlstra wrote:
> On Tue, Jul 16, 2013 at 03:48:01PM -0700, Jason Low wrote:
> > On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote:
> > > On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote:
> > > > When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
> > > > there can be a lot of contention in idle_balance() and related functions.
> > > > On many AIM7 workloads in which CPUs go idle very often and idle balance
> > > > gets called a lot, it is actually lowering performance.
> > > >
> > > > Since idle balance often helps performance (when it is not overused), I
> > > > looked into trying to avoid attempting idle balance only when it is
> > > > occurring too frequently.
> > > >
> > > > This RFC patch attempts to keep track of the approximate "average" time between
> > > > idle balance attempts per CPU. Each time the idle_balance() function is
> > > > invoked, it will compute the duration since the last idle_balance() for
> > > > the current CPU. The avg time between idle balance attempts is then updated
> > > > using a very similar method as how rq->avg_idle is computed.
> > > >
> > > > Once the average time between idle balance attempts drops below a certain
> > > > value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
> > > > for that CPU will be skipped. The average time between idle balances will
> > > > continue to be updated, even if it ends up getting skipped. The
> > > > initial/maximum average is set a lot higher though to make sure that the
> > > > avg doesn't fall below the threshold until the sample size is large and to
> > > > prevent the avg from being overestimated.
> > >
> > > One of the things I've been talking about for a while now is how I'd
> > > like to use the idle guestimator used for cpuidle for newidle balance.
> > >
> > > Basically based on the estimated idle time limit how far/wide you'll
> > > search for tasks to run.
> > >
> > > You can remove the sysctl and auto-tune by measuring how long it takes
> > > on avg to do a newidle balance.
> >
> > Hi Peter,
> >
> > When you say how long it takes on avg to do a newidle balance, are you
> > referring to the avg time it takes for each call to CPU_NEWLY_IDLE
> > load_balance() to complete, or the avg time it takes for newidle balance
> > attempts within a domain to eventually successfully pull/move a task(s)?
>
> Both :-), being as the completion time would be roughly equivalent for the
> top domain and the entire call.
>
> So I suppose I was somewhat unclear :-) I initially started out with a
> simpler model, where you measure the avg time of the entire
> idle_balance() call and measure the avg idle time and compare the two.
>
> Then I progressed to the more complex model where you measure the
> completion time of each domain in the for_each_domain() iteration of
> idle_balance() and compare that against the estimated idle time, bailing
> out of the domain iteration when the avg completion time exceeds the
> expected idle time.

Hi Peter,

For the more complex model, are you suggesting that each completion time
is the time it takes to complete 1 iteration of the for_each_domain()
loop?

Based on some of the data I collected, a single iteration of the
for_each_domain() loop is almost always significantly lower than the
approximate CPU idle time, even in workloads where idle_balance is
lowering performance. The bigger issue is that it takes so many of these
attempts before idle_balance actually "worked" and pulls a tasks.

I initially was thinking about each "completion time" of an idle balance
as the sum total of the times of all iterations to complete until a task
is successfully pulled within each domain.

Jason

2013-07-17 09:40:29

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Wed, Jul 17, 2013 at 01:11:41AM -0700, Jason Low wrote:
> For the more complex model, are you suggesting that each completion time
> is the time it takes to complete 1 iteration of the for_each_domain()
> loop?

Per sd, yes? So higher domains (or lower depending on how you model the thing
in you head) have bigger CPU spans, and thus take longer to complete. Imagine
the top domain of a 4096 cpu system, it would go look at all cpus to see if it
could find a task.

> Based on some of the data I collected, a single iteration of the
> for_each_domain() loop is almost always significantly lower than the
> approximate CPU idle time, even in workloads where idle_balance is
> lowering performance. The bigger issue is that it takes so many of these
> attempts before idle_balance actually "worked" and pulls a tasks.

I'm confused, so:

schedule()
if (!rq->nr_running)
idle_balance()
for_each_domain(sd)
load_balance(sd)

is the entire thing, there's no other loop in there.

> I initially was thinking about each "completion time" of an idle balance
> as the sum total of the times of all iterations to complete until a task
> is successfully pulled within each domain.

So you're saying that normally idle_balance() won't find a task to pull? And we
need many times going newidle before we do get something?

Wouldn't this mean that there simply weren't enough tasks to keep all cpus busy?

If there were tasks we could've pulled, we might need to look at why they
weren't and maybe fix that. Now it could be that it things this cpu, even with
the (little) idle time it has is sufficiently loaded and we'll get a 'local'
wakeup soon enough. That's perfectly fine.

What we should avoid is spending more time looking for tasks then we have idle,
since that reduces the total time we can spend doing useful work. So that is I
think the critical cut-off point.

2013-07-17 15:59:08

by Jason Low

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

Hi Peter,

On Wed, 2013-07-17 at 11:39 +0200, Peter Zijlstra wrote:
> On Wed, Jul 17, 2013 at 01:11:41AM -0700, Jason Low wrote:
> > For the more complex model, are you suggesting that each completion time
> > is the time it takes to complete 1 iteration of the for_each_domain()
> > loop?
>
> Per sd, yes? So higher domains (or lower depending on how you model the thing
> in you head) have bigger CPU spans, and thus take longer to complete. Imagine
> the top domain of a 4096 cpu system, it would go look at all cpus to see if it
> could find a task.
>
> > Based on some of the data I collected, a single iteration of the
> > for_each_domain() loop is almost always significantly lower than the
> > approximate CPU idle time, even in workloads where idle_balance is
> > lowering performance. The bigger issue is that it takes so many of these
> > attempts before idle_balance actually "worked" and pulls a tasks.
>
> I'm confused, so:
>
> schedule()
> if (!rq->nr_running)
> idle_balance()
> for_each_domain(sd)
> load_balance(sd)
>
> is the entire thing, there's no other loop in there.

So if we have the following:

for_each_domain(sd)
before = sched_clock_cpu
load_balance(sd)
after = sched_clock_cpu
idle_balance_completion_time = after - before

At this point, the "idle_balance_completion_time" is usually a very
small value and is usually a lot smaller than the avg CPU idle time.
However, the vast majority of the time, load_balance returns 0.

> > I initially was thinking about each "completion time" of an idle balance
> > as the sum total of the times of all iterations to complete until a task
> > is successfully pulled within each domain.
>
> So you're saying that normally idle_balance() won't find a task to pull? And we
> need many times going newidle before we do get something?

Yes, a while ago, I collected some data on the rate in which
idle_balance() does not pull tasks, and it was a very high number.

> Wouldn't this mean that there simply weren't enough tasks to keep all cpus busy?

If I remember correctly, in a lot of those load_balance attempts when
the machine is under a high Java load, there were no "imbalance" between
the groups in each sched_domain.

> If there were tasks we could've pulled, we might need to look at why they
> weren't and maybe fix that. Now it could be that it things this cpu, even with
> the (little) idle time it has is sufficiently loaded and we'll get a 'local'
> wakeup soon enough. That's perfectly fine.
>
> What we should avoid is spending more time looking for tasks then we have idle,
> since that reduces the total time we can spend doing useful work. So that is I
> think the critical cut-off point.

Do you think its worth a try to consider each newidle balance attempt as
the total load_balance attempts until it is able to move a task, and
then skip balancing within the domain if a CPU's avg idle time is less
than that avg time doing newidle balance?

Thanks,
Jason

2013-07-17 16:19:33

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Wed, Jul 17, 2013 at 08:59:01AM -0700, Jason Low wrote:
>
> So if we have the following:
>
> for_each_domain(sd)
> before = sched_clock_cpu
> load_balance(sd)
> after = sched_clock_cpu
> idle_balance_completion_time = after - before
>
> At this point, the "idle_balance_completion_time" is usually a very
> small value and is usually a lot smaller than the avg CPU idle time.
> However, the vast majority of the time, load_balance returns 0.

I think the interesting question here is: is it significantly more when we do
find a task?

I would also expect sd->newidle_balance_cost (less typing there) to scale with the
number of CPUs in the domain - thus larger domains will take longer etc.

And (obviously) the cost of the entire newidle balance is the direct sum of
individual domain costs.

> Do you think its worth a try to consider each newidle balance attempt as
> the total load_balance attempts until it is able to move a task, and
> then skip balancing within the domain if a CPU's avg idle time is less
> than that avg time doing newidle balance?

So the way I see things is that the only way newidle balance can slow down
things is if it runs when we could have ran something useful.

So all we need to ensure is to not run longer than we expect to be idle for and
things should be 'free', right?

So the problem I have with your proposal is that supposing we're successful
once every 10 newidle balances. Then the sd->newidle_balance_cost gets inflated
by a factor 10 -- for we'd count 10 of them before 'success'.

However when we're idle for that amount of time (10 times longer than it takes
to do a single newidle balance) we'd still only do a single newidle balance,
not 10.

2013-07-17 17:53:35

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On 07/17/2013 12:18 PM, Peter Zijlstra wrote:
> On Wed, Jul 17, 2013 at 08:59:01AM -0700, Jason Low wrote:
>>

>> Do you think its worth a try to consider each newidle balance attempt as
>> the total load_balance attempts until it is able to move a task, and
>> then skip balancing within the domain if a CPU's avg idle time is less
>> than that avg time doing newidle balance?
>
> So the way I see things is that the only way newidle balance can slow down
> things is if it runs when we could have ran something useful.

Due to contention on the runqueue locks of other CPUs,
newidle also has the potential to keep _others_ from
running something useful.

> So all we need to ensure is to not run longer than we expect to be idle for and
> things should be 'free', right?
>
> So the problem I have with your proposal is that supposing we're successful
> once every 10 newidle balances. Then the sd->newidle_balance_cost gets inflated
> by a factor 10 -- for we'd count 10 of them before 'success'.
>
> However when we're idle for that amount of time (10 times longer than it takes
> to do a single newidle balance) we'd still only do a single newidle balance,
> not 10.

Could we prevent that downside by measuring both the
time spent idle, and the time spent in idle balancing,
and making sure the idle balancing time never exceeds
more than N% of the idle time?

Say, have the CPU never spend more than 10% of its
idle time in the idle balancing code, as averaged
over some time period?

That way we might still do idle balancing every X
idle periods, even if the idle periods themselves
are relatively short.

It might also be enough to prevent excessive lock
contention triggered by the idle balancing code,
though I have to admit I did not really think that
part through :)

2013-07-17 18:03:22

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote:
> On 07/17/2013 12:18 PM, Peter Zijlstra wrote:

> >So the way I see things is that the only way newidle balance can slow down
> >things is if it runs when we could have ran something useful.
>
> Due to contention on the runqueue locks of other CPUs,
> newidle also has the potential to keep _others_ from
> running something useful.

Right, although that should only happen when we do have an imbalance and want
to go move something. Which in Jason's case is 'rare'. But yes, I suppose
there's other scenarios where this is far more likely.

> Could we prevent that downside by measuring both the
> time spent idle, and the time spent in idle balancing,
> and making sure the idle balancing time never exceeds
> more than N% of the idle time?

Sure:

idle_balance(u64 idle_duration)
{
u64 cost = 0;

for_each_domain(sd) {
if (cost + sd->cost > idle_duration/N)
break;

...

sd->cost = (sd->cost + this_cost) / 2;
cost += this_cost;
}
}

I would've initially suggested using something like N=2 since we're dealing
with averages and half should ensure we don't run over except for the worst
peaks. But we could easily use a bigger N.

2013-07-17 18:48:58

by Jason Low

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 04:02:31

by Jason Low

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Wed, 2013-07-17 at 20:01 +0200, Peter Zijlstra wrote:
> On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote:
> > On 07/17/2013 12:18 PM, Peter Zijlstra wrote:
>
> > >So the way I see things is that the only way newidle balance can slow down
> > >things is if it runs when we could have ran something useful.
> >
> > Due to contention on the runqueue locks of other CPUs,
> > newidle also has the potential to keep _others_ from
> > running something useful.
>
> Right, although that should only happen when we do have an imbalance and want
> to go move something. Which in Jason's case is 'rare'. But yes, I suppose
> there's other scenarios where this is far more likely.
>
> > Could we prevent that downside by measuring both the
> > time spent idle, and the time spent in idle balancing,
> > and making sure the idle balancing time never exceeds
> > more than N% of the idle time?
>
> Sure:
>
> idle_balance(u64 idle_duration)
> {
> u64 cost = 0;
>
> for_each_domain(sd) {
> if (cost + sd->cost > idle_duration/N)
> break;
>
> ...
>
> sd->cost = (sd->cost + this_cost) / 2;
> cost += this_cost;
> }
> }
>
> I would've initially suggested using something like N=2 since we're dealing
> with averages and half should ensure we don't run over except for the worst
> peaks. But we could easily use a bigger N.

I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
to set N to more than 20 in order to get the big performance gains.

One thing that I thought of was to have N be based on how often idle
balance attempts does not pull task(s).

For example, N can be calculated based on the number of idle balance
attempts for the CPU since the last "successful" idle balance attempt.
So if the previous 30 idle balance attempts resulted in no tasks moved,
then n = 30 / 5. So idle balance gets less time to run as the number of
unneeded idle balance attempts increases, and thus N will not be set too
high during situations where idle balancing is "successful" more often.
Any comments on this idea?

Thanks,
Jason

2013-07-18 09:32:40

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:

> I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
> to set N to more than 20 in order to get the big performance gains.
>
> One thing that I thought of was to have N be based on how often idle
> balance attempts does not pull task(s).
>
> For example, N can be calculated based on the number of idle balance
> attempts for the CPU since the last "successful" idle balance attempt.
> So if the previous 30 idle balance attempts resulted in no tasks moved,
> then n = 30 / 5. So idle balance gets less time to run as the number of
> unneeded idle balance attempts increases, and thus N will not be set too
> high during situations where idle balancing is "successful" more often.
> Any comments on this idea?

It would be good to get a solid explanation for why we need such high N.
But yes that might work.

2013-07-18 12:00:50

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On 07/18/2013 05:32 AM, Peter Zijlstra wrote:
> On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:
>
>> I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
>> to set N to more than 20 in order to get the big performance gains.
>>
>> One thing that I thought of was to have N be based on how often idle
>> balance attempts does not pull task(s).
>>
>> For example, N can be calculated based on the number of idle balance
>> attempts for the CPU since the last "successful" idle balance attempt.
>> So if the previous 30 idle balance attempts resulted in no tasks moved,
>> then n = 30 / 5. So idle balance gets less time to run as the number of
>> unneeded idle balance attempts increases, and thus N will not be set too
>> high during situations where idle balancing is "successful" more often.
>> Any comments on this idea?
>
> It would be good to get a solid explanation for why we need such high N.
> But yes that might work.

I have some idea, though no proof :)

I suspect a lot of the idle balancing time is spent waiting for
and acquiring the runqueue locks of remote CPUs.

If we spend half our idle time causing contention to remote
runqueue locks, we could be a big factor in keeping those other
CPUs from getting work done.

--
All rights reversed

2013-07-18 12:13:04

by Srikar Dronamraju

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

> >
> > idle_balance(u64 idle_duration)
> > {
> > u64 cost = 0;
> >
> > for_each_domain(sd) {
> > if (cost + sd->cost > idle_duration/N)
> > break;
> >
> > ...
> >
> > sd->cost = (sd->cost + this_cost) / 2;
> > cost += this_cost;
> > }
> > }
> >
> > I would've initially suggested using something like N=2 since we're dealing
> > with averages and half should ensure we don't run over except for the worst
> > peaks. But we could easily use a bigger N.
>
> I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
> to set N to more than 20 in order to get the big performance gains.
>

As per your observation, newly idle balancing isn't picking tasks and
mostly finding the domains to be balanced. find_busiest_queue() is
under rcu. So where and how are we getting these performance gains?

Is it that tasks are getting woken up and queued while the cpu is doing
newly idle load balance? Or is it that the regular CPU_IDLE balancing
which follows idle_balance() does a more aggressive balancing and hence
is able to find a task to balance?

--
Thanks and Regards
Srikar Dronamraju

2013-07-18 12:16:00

by Srikar Dronamraju

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

* Rik van Riel <[email protected]> [2013-07-18 07:59:22]:

> On 07/18/2013 05:32 AM, Peter Zijlstra wrote:
> >On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:
> >
> >>I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
> >>to set N to more than 20 in order to get the big performance gains.
> >>
> >>One thing that I thought of was to have N be based on how often idle
> >>balance attempts does not pull task(s).
> >>
> >>For example, N can be calculated based on the number of idle balance
> >>attempts for the CPU since the last "successful" idle balance attempt.
> >>So if the previous 30 idle balance attempts resulted in no tasks moved,
> >>then n = 30 / 5. So idle balance gets less time to run as the number of
> >>unneeded idle balance attempts increases, and thus N will not be set too
> >>high during situations where idle balancing is "successful" more often.
> >>Any comments on this idea?
> >
> >It would be good to get a solid explanation for why we need such high N.
> >But yes that might work.
>
> I have some idea, though no proof :)
>
> I suspect a lot of the idle balancing time is spent waiting for
> and acquiring the runqueue locks of remote CPUs.
>

We take locks if and only if we see imbalance and want to pull the
tasks.
However if the newly idle balance is not finding an imbalance then this
may not be an issue.

Probably /proc/schedstats will give a better picture.

> If we spend half our idle time causing contention to remote
> runqueue locks, we could be a big factor in keeping those other
> CPUs from getting work done.
>
> --
> All rights reversed
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Thanks and Regards
Srikar Dronamraju

2013-07-18 12:35:59

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Thu, Jul 18, 2013 at 05:45:46PM +0530, Srikar Dronamraju wrote:
> We take locks if and only if we see imbalance and want to pull the
> tasks.
> However if the newly idle balance is not finding an imbalance then this
> may not be an issue.
>
> Probably /proc/schedstats will give a better picture.

Right, so we're interested in move_tasks() calls that fail to 'deliver'.
There's a few conditions in there that can cause us to not move a task,
most of them not counted.

The few that are; are from can_mirgrate_task():

se.statistics.nr_failed_migrations_affine
se.statistics.nr_failed_migrations_running
se.statistics.nr_failed_migrations_hot

If we see significant increments on those we'll be taking locks.

The only one I can see a good way around is the hot one, we could ignore
hotness in favour of newidle -- although I could see that being
detrimental, we'll just have to try or so ;-)

_running shouldn't be much of a problem since we don't bother if
nr_running <= 1. And _affine is out of our reach anyway.

2013-07-18 13:09:53

by Srikar Dronamraju

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

* Peter Zijlstra <[email protected]> [2013-07-18 14:35:31]:

> On Thu, Jul 18, 2013 at 05:45:46PM +0530, Srikar Dronamraju wrote:
> > We take locks if and only if we see imbalance and want to pull the
> > tasks.
> > However if the newly idle balance is not finding an imbalance then this
> > may not be an issue.
> >
> > Probably /proc/schedstats will give a better picture.
>
> Right, so we're interested in move_tasks() calls that fail to 'deliver'.
> There's a few conditions in there that can cause us to not move a task,
> most of them not counted.
>
> The few that are; are from can_mirgrate_task():
>
> se.statistics.nr_failed_migrations_affine
> se.statistics.nr_failed_migrations_running
> se.statistics.nr_failed_migrations_hot
>
> If we see significant increments on those we'll be taking locks.
>

Agree.

Even I think number of times no busy group was found, number of times no
busy queue was found also will tell us that locks are not being taken.

In schedstats, I generally see them as overwhelming majority.

> The only one I can see a good way around is the hot one, we could ignore
> hotness in favour of newidle -- although I could see that being
> detrimental, we'll just have to try or so ;-)
>
> _running shouldn't be much of a problem since we don't bother if
> nr_running <= 1. And _affine is out of our reach anyway.
>

--
Thanks and Regards
Srikar Dronamraju

2013-07-18 19:03:53

by Jason Low

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Thu, 2013-07-18 at 17:42 +0530, Srikar Dronamraju wrote:
> > >
> > > idle_balance(u64 idle_duration)
> > > {
> > > u64 cost = 0;
> > >
> > > for_each_domain(sd) {
> > > if (cost + sd->cost > idle_duration/N)
> > > break;
> > >
> > > ...
> > >
> > > sd->cost = (sd->cost + this_cost) / 2;
> > > cost += this_cost;
> > > }
> > > }
> > >
> > > I would've initially suggested using something like N=2 since we're dealing
> > > with averages and half should ensure we don't run over except for the worst
> > > peaks. But we could easily use a bigger N.
> >
> > I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
> > to set N to more than 20 in order to get the big performance gains.
> >
>
> As per your observation, newly idle balancing isn't picking tasks and
> mostly finding the domains to be balanced. find_busiest_queue() is
> under rcu. So where and how are we getting these performance gains?

I actually just ran fserver on 8 sockets (which idle balance lowers the
performance in this workload at this socket count), and for this
workload, idle balancing is finding tasks to move fairly often on a
per-cpu basis. So I guess it is not always the case that idle balancing
isn't moving tasks on this box.

Jason

2013-07-18 19:06:45

by Jason Low

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Thu, 2013-07-18 at 07:59 -0400, Rik van Riel wrote:
> On 07/18/2013 05:32 AM, Peter Zijlstra wrote:
> > On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:
> >
> >> I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
> >> to set N to more than 20 in order to get the big performance gains.
> >>
> >> One thing that I thought of was to have N be based on how often idle
> >> balance attempts does not pull task(s).
> >>
> >> For example, N can be calculated based on the number of idle balance
> >> attempts for the CPU since the last "successful" idle balance attempt.
> >> So if the previous 30 idle balance attempts resulted in no tasks moved,
> >> then n = 30 / 5. So idle balance gets less time to run as the number of
> >> unneeded idle balance attempts increases, and thus N will not be set too
> >> high during situations where idle balancing is "successful" more often.
> >> Any comments on this idea?
> >
> > It would be good to get a solid explanation for why we need such high N.
> > But yes that might work.
>
> I have some idea, though no proof :)
>
> I suspect a lot of the idle balancing time is spent waiting for
> and acquiring the runqueue locks of remote CPUs.
>
> If we spend half our idle time causing contention to remote
> runqueue locks, we could be a big factor in keeping those other
> CPUs from getting work done.

I collected some perf samples when running fserver when N=1 and N=60.

N = 1
-----
19.21% reaim [k] __read_lock_failed
14.79% reaim [k] mspin_lock
12.19% reaim [k] __write_lock_failed
7.87% reaim [k] _raw_spin_lock
2.03% reaim [k] start_this_handle
1.98% reaim [k] update_sd_lb_stats
1.92% reaim [k] mutex_spin_on_owner
1.86% reaim [k] update_cfs_rq_blocked_load
1.14% swapper [k] intel_idle
1.10% reaim [.] add_long
1.09% reaim [.] add_int
1.08% reaim [k] load_balance

N = 60
------
7.70% reaim [k] _raw_spin_lock
7.25% reaim [k] mspin_lock
6.30% reaim [.] add_long
6.26% reaim [.] add_int
4.05% reaim [.] strncat
3.81% reaim [.] string_rtns_1
3.66% reaim [.] div_long
3.44% reaim [k] mutex_spin_on_owner
2.91% reaim [.] add_short
2.73% swapper [k] intel_idle
2.65% reaim [k] __read_lock_failed

With idle_balance(), we get more contention in kernel functions such as
update_sd_lb_stats(), load_balance(), and spin_lock() for the rq lock.
Additionally, it increases the time spent in mutex's mspin_lock(),
__read_lock_failed() and __write_lock_failed() by a lot.

N needs to be large because avg_idle time is still a lot higher than the
avg time spent in each load_balance() call per sched domain. Despite the
high ratio of avg_idle time to time spent in load_balance(),
load_balance() still increases the time spent in the kernel by quite a
bit, probably because of how often it is being used.

Jason

2013-07-19 18:37:48

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Thu, Jul 18, 2013 at 12:06:39PM -0700, Jason Low wrote:

> N = 1
> -----
> 19.21% reaim [k] __read_lock_failed
> 14.79% reaim [k] mspin_lock
> 12.19% reaim [k] __write_lock_failed
> 7.87% reaim [k] _raw_spin_lock
> 2.03% reaim [k] start_this_handle
> 1.98% reaim [k] update_sd_lb_stats
> 1.92% reaim [k] mutex_spin_on_owner
> 1.86% reaim [k] update_cfs_rq_blocked_load
> 1.14% swapper [k] intel_idle
> 1.10% reaim [.] add_long
> 1.09% reaim [.] add_int
> 1.08% reaim [k] load_balance

But but but but.. wth is causing this? The only thing we do more of with
N=1 is idle_balance(); where would that cause __{read,write}_lock_failed
and or mspin_lock() contention like that.

There shouldn't be a rwlock_t in the entire scheduler; those things suck
worse than quicksand.

If, as Rik thought, we'd have more rq->lock contention, then I'd
expected _raw_spin_lock to be up highest.

2013-07-19 19:15:26

by Jason Low

[permalink] [raw]

Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

On Fri, 2013-07-19 at 20:37 +0200, Peter Zijlstra wrote:
> On Thu, Jul 18, 2013 at 12:06:39PM -0700, Jason Low wrote:
>
> > N = 1
> > -----
> > 19.21% reaim [k] __read_lock_failed
> > 14.79% reaim [k] mspin_lock
> > 12.19% reaim [k] __write_lock_failed
> > 7.87% reaim [k] _raw_spin_lock
> > 2.03% reaim [k] start_this_handle
> > 1.98% reaim [k] update_sd_lb_stats
> > 1.92% reaim [k] mutex_spin_on_owner
> > 1.86% reaim [k] update_cfs_rq_blocked_load
> > 1.14% swapper [k] intel_idle
> > 1.10% reaim [.] add_long
> > 1.09% reaim [.] add_int
> > 1.08% reaim [k] load_balance
>
> But but but but.. wth is causing this? The only thing we do more of with
> N=1 is idle_balance(); where would that cause __{read,write}_lock_failed
> and or mspin_lock() contention like that.
>
> There shouldn't be a rwlock_t in the entire scheduler; those things suck
> worse than quicksand.
>
> If, as Rik thought, we'd have more rq->lock contention, then I'd
> expected _raw_spin_lock to be up highest.

For this particular fserver workload, that mutex was acquired in the
function calls from ext4_orphan_add() and ext4_orphan_del(). Those read
and write lock calls were from start_this_handle().

Although these functions are not called within the idle_balance() code
path, update_sd_lb_stats(), tg_load_down(), idle_cpu(), spin_lock(),
ect... increases the time spent in the kernel and that appears to be
indirectly causing more time to be spent acquiring those other kernel
locks.

Thanks,
Jason