2013-08-12 08:43:00

by Jason Low

[permalink] [raw]
Subject: [PATCH] sched: Give idle_balance() a break when it does not move tasks.

When running benchmarks on an 8 socket (80 core) machine, newidle load balance
was being attempted frequently, but no tasks were moved most of the time. The
most common reason was due to find_busiest_group returning NULL (for example,
groups were already balanced within domains). Another common reason was that
move tasks was attempted but it couldn't find a suitable task to move.

In situations where idle balance does not need or is not able to move tasks,
attempting it frequently may lower performance by consuming kernel time while
also ending up not doing anything. Functions such as update_sd_lb_stats(),
idle_cpu(), acquiring rq lock, ect... show up as hot kernel functions as
reported by perf during these situations.

In this patch, idle balance attempts for a CPU will be blocked for a
brief duration (10 ms) after newidle load balance is attempted for the CPU but
did not move tasks. The idea is that if a CPU just attempted a new idle
load balance and found that it either does not need to do any balancing or is
unable to move a task over for all sched domains, then it should
avoid re-attempting it on the same CPU too soon once again.

This patch is able to provide good performance boosts of many AIM7 workloads
on an 8 socket machine 80 core while still preserving performance of a few
workloads that I have also been running which benefit from idle balancing.

The table below compares the average jobs per minute at 10-100, 200-1000, and
1100-2000 users between the vanilla 3.11-rc4 kernel and 3.11-rc4 kernel with
this patch with Hyperthreading enabled.

----------------------------------------------------------------
workload | % improvement | % improvement | % improvement
| with patch | with patch | with patch
| 1100-2000 users | 200-1000 users | 10-100 users
----------------------------------------------------------------
alltests | +13.9% | +5.6% | +0.9%
----------------------------------------------------------------
custom | +20.3% | +20.7% | +12.2%
----------------------------------------------------------------
disk | +22.3% | +18.7% | +18.3%
----------------------------------------------------------------
fserver | +62.6% | +27.3% | -0.9%
----------------------------------------------------------------
high_systime | +17.7% | +10.7% | +0.6%
----------------------------------------------------------------
new_fserver | +53.1% | +20.2% | -0.2%
----------------------------------------------------------------
shared | +10.3% | +10.8% | +10.7%
----------------------------------------------------------------


Signed-off-by: Jason Low <[email protected]>
---
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 10 +++++++++-
kernel/sched/sched.h | 5 +++++
3 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7c32cb..b1fcf6d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6458,6 +6458,7 @@ void __init sched_init(void)
rq->online = 0;
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
+ rq->next_newidle_balance = jiffies;

INIT_LIST_HEAD(&rq->cfs_tasks);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9565645..70d96d6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5277,10 +5277,12 @@ void idle_balance(int this_cpu, struct rq *this_rq)
struct sched_domain *sd;
int pulled_task = 0;
unsigned long next_balance = jiffies + HZ;
+ bool load_balance_attempted = false;

this_rq->idle_stamp = rq_clock(this_rq);

- if (this_rq->avg_idle < sysctl_sched_migration_cost)
+ if (this_rq->avg_idle < sysctl_sched_migration_cost ||
+ time_before(jiffies, this_rq->next_newidle_balance))
return;

/*
@@ -5298,6 +5300,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
continue;

if (sd->flags & SD_BALANCE_NEWIDLE) {
+ load_balance_attempted = true;
+
/* If we've pulled tasks over stop searching: */
pulled_task = load_balance(this_cpu, this_rq,
sd, CPU_NEWLY_IDLE, &balance);
@@ -5322,6 +5326,10 @@ void idle_balance(int this_cpu, struct rq *this_rq)
*/
this_rq->next_balance = next_balance;
}
+
+ /* Give idle balance on this CPU a break when it isn't moving tasks */
+ if (load_balance_attempted && !pulled_task)
+ this_rq->next_newidle_balance = jiffies + (HZ / 100);
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef0a7b2..6a70be9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -524,6 +524,11 @@ struct rq {
#endif

struct sched_avg avg;
+
+#ifdef CONFIG_SMP
+ /* set when we want to block idle balance */
+ unsigned long next_newidle_balance;
+#endif
};

static inline int cpu_of(struct rq *rq)
--
1.7.1



2013-08-12 11:01:26

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH] sched: Give idle_balance() a break when it does not move tasks.

> /*
> @@ -5298,6 +5300,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> continue;
>
> if (sd->flags & SD_BALANCE_NEWIDLE) {
> + load_balance_attempted = true;
> +
> /* If we've pulled tasks over stop searching: */
> pulled_task = load_balance(this_cpu, this_rq,
> sd, CPU_NEWLY_IDLE, &balance);
> @@ -5322,6 +5326,10 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> */
> this_rq->next_balance = next_balance;
> }
> +
> + /* Give idle balance on this CPU a break when it isn't moving tasks */
> + if (load_balance_attempted && !pulled_task)
> + this_rq->next_newidle_balance = jiffies + (HZ / 100);
> }

Looks reasonable. However should we do this per sd and not per rq. i.e
move the next_newidle_balance to sched_domain. So if we find a
load_balance in newly_idle context that wasn't successful, we skip
load_balance for that sd in the next newly idle balance.

--
Thanks and Regards
Srikar Dronamraju

2013-08-12 20:20:11

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH] sched: Give idle_balance() a break when it does not move tasks.

On Mon, 2013-08-12 at 16:30 +0530, Srikar Dronamraju wrote:
> > /*
> > @@ -5298,6 +5300,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> > continue;
> >
> > if (sd->flags & SD_BALANCE_NEWIDLE) {
> > + load_balance_attempted = true;
> > +
> > /* If we've pulled tasks over stop searching: */
> > pulled_task = load_balance(this_cpu, this_rq,
> > sd, CPU_NEWLY_IDLE, &balance);
> > @@ -5322,6 +5326,10 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> > */
> > this_rq->next_balance = next_balance;
> > }
> > +
> > + /* Give idle balance on this CPU a break when it isn't moving tasks */
> > + if (load_balance_attempted && !pulled_task)
> > + this_rq->next_newidle_balance = jiffies + (HZ / 100);
> > }
>
> Looks reasonable. However should we do this per sd and not per rq. i.e
> move the next_newidle_balance to sched_domain. So if we find a
> load_balance in newly_idle context that wasn't successful, we skip
> load_balance for that sd in the next newly idle balance.

I wonder, if we skip newidle balance for a domain after a newidle
balance attempt for a CPU did not move tasks, would that potentially
cause some "unfairness" for all the other CPUS within the domain?

Perhaps we can reduce the duration that idle balance is blocked from 10
ms to a much smaller duration if we were to block on a per domain basis.

Peter, any thoughts on which method is preferable?

Thanks for the suggestion,
Jason