2013-07-31 09:38:00

by Jason Low

[permalink] [raw]
Subject: [RFC PATCH] sched: Reduce overestimating avg_idle

The avg_idle value may sometimes be overestimated, which may cause new idle
load balance to be attempted more often than it should. Currently, when
avg_idle gets updated, if the delta exceeds some max value (default 1000000 ns),
the entire avg gets set to the max value, regardless of what the previous avg
was. So if a CPU remains idle for 200,000 ns most of the time, and if the CPU
goes idle for 1,200,000 ns, the average is then pushed up to 1,000,000 ns when
it should be less.

Additionally, once the avg_idle is at its max, it may take a while to pull the
avg down to a value that it should be. In the above example, after the avg idle
is set the max value of 1000000 ns, the CPU's idle durations needs to
be 200000 ns for the next 8 occurrences before the avg falls below the migration
cost value.

This patch attempts to avoid these situations by always updating the avg_idle
value first with the function call to update_avg(). Then, if the avg_idle
exceeds the max avg value, the avg gets set to the max. Also, this patch lowers
the max avg_idle value to migration_cost * 1.5 instead of migration_cost * 2 to
reduce the time it takes to pull the avg idle to a lower value after long idles.

With this change, I got some decent performance boosts in AIM7 workloads on an
8 socket machine on the 3.10 kernel. In particular, it boosted the AIM7 fserver
workload by about 20% when running it with a high # of users.

An avg_idle related question that I have is does migration_cost in idle balance
need to be the same as the migration_cost in task_hot()? Can we keep
migration_cost default value used in task_hot() the same, but have a different
default value or increase migration_cost only when comparing it with avg_idle in
idle balance?


Signed-off-by: Jason Low <[email protected]>
---
kernel/sched/core.c | 10 +++++-----
1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8b3350..62b484b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1341,12 +1341,12 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

if (rq->idle_stamp) {
u64 delta = rq->clock - rq->idle_stamp;
- u64 max = 2*sysctl_sched_migration_cost;
+ u64 max = (sysctl_sched_migration_cost * 3) / 2;

- if (delta > max)
+ update_avg(&rq->avg_idle, delta);
+
+ if (rq->avg_idle > max)
rq->avg_idle = max;
- else
- update_avg(&rq->avg_idle, delta);
rq->idle_stamp = 0;
}
#endif
@@ -7026,7 +7026,7 @@ void __init sched_init(void)
rq->cpu = i;
rq->online = 0;
rq->idle_stamp = 0;
- rq->avg_idle = 2*sysctl_sched_migration_cost;
+ rq->avg_idle = (sysctl_sched_migration_cost * 3) / 2;

INIT_LIST_HEAD(&rq->cfs_tasks);

--
1.7.1



2013-07-31 09:53:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: Reduce overestimating avg_idle

On Wed, Jul 31, 2013 at 02:37:52AM -0700, Jason Low wrote:
> The avg_idle value may sometimes be overestimated, which may cause new idle
> load balance to be attempted more often than it should. Currently, when
> avg_idle gets updated, if the delta exceeds some max value (default 1000000 ns),
> the entire avg gets set to the max value, regardless of what the previous avg
> was. So if a CPU remains idle for 200,000 ns most of the time, and if the CPU
> goes idle for 1,200,000 ns, the average is then pushed up to 1,000,000 ns when
> it should be less.
>
> Additionally, once the avg_idle is at its max, it may take a while to pull the
> avg down to a value that it should be. In the above example, after the avg idle
> is set the max value of 1000000 ns, the CPU's idle durations needs to
> be 200000 ns for the next 8 occurrences before the avg falls below the migration
> cost value.
>
> This patch attempts to avoid these situations by always updating the avg_idle
> value first with the function call to update_avg(). Then, if the avg_idle
> exceeds the max avg value, the avg gets set to the max. Also, this patch lowers
> the max avg_idle value to migration_cost * 1.5 instead of migration_cost * 2 to
> reduce the time it takes to pull the avg idle to a lower value after long idles.

Indeed, this seems quite sensible.

> With this change, I got some decent performance boosts in AIM7 workloads on an
> 8 socket machine on the 3.10 kernel. In particular, it boosted the AIM7 fserver
> workload by about 20% when running it with a high # of users.

Nice :-)

> An avg_idle related question that I have is does migration_cost in idle balance
> need to be the same as the migration_cost in task_hot()? Can we keep
> migration_cost default value used in task_hot() the same, but have a different
> default value or increase migration_cost only when comparing it with avg_idle in
> idle balance?

No they're quite unrelated. I think you can measure the max time we've
ever spend in newidle balance and use that to clip the values.

Similarly, I've thought about how we updated the sd->avg_cost in the
previous patches and wondered if we should not track max_cost.

The 'only' down-side I could come up with is that its all ran from
SoftIRQ context which means IRQ/NMI/SMI can all stretch/warp the time it
takes to actually do the idle balance.

The idea behind using the max is that we want to reduce the chance we
overrun the averages and consume time we should have spend doing useful
work.

2013-07-31 16:01:34

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: Reduce overestimating avg_idle

On 07/31/2013 05:37 AM, Jason Low wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e8b3350..62b484b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1341,12 +1341,12 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>
> if (rq->idle_stamp) {
> u64 delta = rq->clock - rq->idle_stamp;
> - u64 max = 2*sysctl_sched_migration_cost;
> + u64 max = (sysctl_sched_migration_cost * 3) / 2;
>
> - if (delta > max)
> + update_avg(&rq->avg_idle, delta);
> +
> + if (rq->avg_idle > max)
> rq->avg_idle = max;
> - else
> - update_avg(&rq->avg_idle, delta);
> rq->idle_stamp = 0;
> }

I wonder if we could get even more conservative values
of avg_idle by clamping delta to max, before calling
update_avg...

Or rather, I wonder if that would matter enough to make
a difference, and in what direction that difference would
be.

In other words:

if (rq->idle_stamp) {
u64 delta = rq->clock - rq->idle_stamp;
u64 max = (sysctl_sched_migration_cost * 3) / 2;

if (delta > max)
delta = max;

update_avg(&rq->avg_idle, delta);
rq->idle_stamp = 0;
}

--
All rights reversed

2013-07-31 16:31:27

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: Reduce overestimating avg_idle

On 07/31/2013 05:37 AM, Jason Low wrote:
> The avg_idle value may sometimes be overestimated, which may cause new idle
> load balance to be attempted more often than it should. Currently, when
> avg_idle gets updated, if the delta exceeds some max value (default 1000000 ns),
> the entire avg gets set to the max value, regardless of what the previous avg
> was. So if a CPU remains idle for 200,000 ns most of the time, and if the CPU
> goes idle for 1,200,000 ns, the average is then pushed up to 1,000,000 ns when
> it should be less.
>
> Additionally, once the avg_idle is at its max, it may take a while to pull the
> avg down to a value that it should be. In the above example, after the avg idle
> is set the max value of 1000000 ns, the CPU's idle durations needs to
> be 200000 ns for the next 8 occurrences before the avg falls below the migration
> cost value.
>
> This patch attempts to avoid these situations by always updating the avg_idle
> value first with the function call to update_avg(). Then, if the avg_idle
> exceeds the max avg value, the avg gets set to the max. Also, this patch lowers
> the max avg_idle value to migration_cost * 1.5 instead of migration_cost * 2 to
> reduce the time it takes to pull the avg idle to a lower value after long idles.
>
> With this change, I got some decent performance boosts in AIM7 workloads on an
> 8 socket machine on the 3.10 kernel. In particular, it boosted the AIM7 fserver
> workload by about 20% when running it with a high # of users.
>
> An avg_idle related question that I have is does migration_cost in idle balance
> need to be the same as the migration_cost in task_hot()? Can we keep
> migration_cost default value used in task_hot() the same, but have a different
> default value or increase migration_cost only when comparing it with avg_idle in
> idle balance?
>
>
> Signed-off-by: Jason Low <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2013-08-01 07:36:48

by Jason Low

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: Reduce overestimating avg_idle


> I wonder if we could get even more conservative values
> of avg_idle by clamping delta to max, before calling
> update_avg...
>
> Or rather, I wonder if that would matter enough to make
> a difference, and in what direction that difference would
> be.
>
> In other words:
>
> if (rq->idle_stamp) {
> u64 delta = rq->clock - rq->idle_stamp;
> u64 max = (sysctl_sched_migration_cost * 3) / 2;
>
> if (delta > max)
> delta = max;
>
> update_avg(&rq->avg_idle, delta);
> rq->idle_stamp = 0;
> }

Yes, I initially tried to limit delta to the max. That helped keep the
avg_idle smaller and provided even better performance improvements on
the 8 socket HT-enabled case. Here were some of those performance boosts
on AIM7:

alltests: +14.5% custom: +15.9% disk: +15.9%
fserver: +33.7% new_fserver: +15.7% high_systime: +16.7%
shared: +14.1%

When we limit the average instead of the delta, the performance boosts
were in the range of 5-10%, with the exception of fserver.

I initially thought that limiting delta to a small value might cause the
average to often be underestimated. But come to think of it, this might
actually provide a more accurate estimate of whether the majority of
idle durations are either less than or greater than migration_cost. Idle
durations can be a lot higher while there's a limit to how small each
short idle duration is. This may help offset some of that bias towards a
high avg.

So how acceptable is setting a limit of 2*migration cost or less on the
delta rather than on the avg?

Thanks,
Jason

2013-08-01 07:48:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: Reduce overestimating avg_idle

On Thu, Aug 01, 2013 at 12:36:41AM -0700, Jason Low wrote:
> So how acceptable is setting a limit of 2*migration cost or less on the
> delta rather than on the avg?

Its fine with me as we're already doing that. But since you asked last
time around if those two things weren't unrelated etc..

2013-08-02 08:20:35

by Jason Low

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: Reduce overestimating avg_idle

On Wed, 2013-07-31 at 11:53 +0200, Peter Zijlstra wrote:

> No they're quite unrelated. I think you can measure the max time we've
> ever spend in newidle balance and use that to clip the values.

So I tried using the rq's max newidle balance cost to compare with the
average and used sysctl_migration_cost as the initial/default max. One
thing I noticed when running this on 8 socket machine was that the max
idle balance cost was a lot higher during around boot time compared to
after boot time. Not sure if IRQ/NMI/SMI was the cause of this. A
temporary "fix" I made was to reset the max idle balance costs every 2
minutes.

> Similarly, I've thought about how we updated the sd->avg_cost in the
> previous patches and wondered if we should not track max_cost.
>
> The 'only' down-side I could come up with is that its all ran from
> SoftIRQ context which means IRQ/NMI/SMI can all stretch/warp the time it
> takes to actually do the idle balance.

Another thing that I thought of was that max idle balance cost may also
vary based on the workload that is running. So running a workload in
which there are shorter idle balances after running a workload that has
longer idle balances may sometimes cause it to make use of a higher idle
balance cost. But I guess it is okay if we're trying to reduce
overrunning the average.

Jason