2014-06-16 19:48:51

by Tim Chen

[permalink] [raw]
Subject: [PATCH v2] sched: Fast idling of CPU when system is partially loaded

Thanks to the review from Jason and Peter. I've moved the check
of whether load balance is required into fair.c's idle_balance.

When a system is lightly loaded (i.e. no more than 1 job per cpu),
attempt to pull job to a cpu before putting it to idle is unnecessary and
can be skipped. This patch adds an indicator so the scheduler can know
when there's no more than 1 active job is on any CPU in the system to
skip needless job pulls.

On a 4 socket machine with a request/response kind of workload from
clients, we saw about 0.13 msec delay when we go through a full load
balance to try pull job from all the other cpus. While 0.1 msec was
spent on processing the request and generating a response, the 0.13 msec
load balance overhead was actually more than the actual work being done.
This overhead can be skipped much of the time for lightly loaded systems.

With this patch, we tested with a netperf request/response workload that
has the server busy with half the cpus in a 4 socket system. We found
the patch eliminated 75% of the load balance attempts before idling a cpu.

The overhead of setting/clearing the indicator is low as we already gather
the necessary info while we call add_nr_running and update_sd_lb_stats.
We switch to full load balance load immediately if any cpu got more than
one job on its run queue in add_nr_running. We'll clear the indicator
to avoid load balance when we detect no cpu's have more than one job
when we scan the work queues in update_sg_lb_stats. We are aggressive
in maintaining the load balance and opportunistic in skipping the load
balance.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 24 +++++++++++++++++++++---
kernel/sched/sched.h | 10 ++++++++--
2 files changed, 29 insertions(+), 5 deletions(-)

Change log:
v2.
1. Move the skip load balance code to idle_balance.
2. Use env->dst_rq->rd to get the root domain directly.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9855e87..95bb541 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5863,7 +5863,8 @@ static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
- int local_group, struct sg_lb_stats *sgs)
+ int local_group, struct sg_lb_stats *sgs,
+ bool *overload)
{
unsigned long load;
int i;
@@ -5881,6 +5882,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,

sgs->group_load += load;
sgs->sum_nr_running += rq->nr_running;
+ if (overload && rq->nr_running > 1)
+ *overload = true;
#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;
sgs->nr_preferred_running += rq->nr_preferred_running;
@@ -5991,6 +5994,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats tmp_sgs;
int load_idx, prefer_sibling = 0;
+ bool overload = false;

if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
@@ -6011,7 +6015,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
update_group_power(env->sd, env->dst_cpu);
}

- update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
+ if (env->sd->parent)
+ update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
+ NULL);
+ else
+ /* gather overload info if we are at root domain */
+ update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
+ &overload);

if (local_group)
goto next_group;
@@ -6045,6 +6055,13 @@ next_group:

if (env->sd->flags & SD_NUMA)
env->fbq_type = fbq_classify_group(&sds->busiest_stat);
+
+ if (!env->sd->parent) {
+ /* update overload indicator if we are at root domain */
+ if (env->dst_rq->rd->overload != overload)
+ env->dst_rq->rd->overload = overload;
+ }
+
}

/**
@@ -6762,7 +6779,8 @@ static int idle_balance(struct rq *this_rq)
*/
this_rq->idle_stamp = rq_clock(this_rq);

- if (this_rq->avg_idle < sysctl_sched_migration_cost) {
+ if (this_rq->avg_idle < sysctl_sched_migration_cost ||
+ !this_rq->rd->overload) {
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e47679b..396bce0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -477,6 +477,9 @@ struct root_domain {
cpumask_var_t span;
cpumask_var_t online;

+ /* Indicate more than one runnable task for any CPU */
+ bool overload;
+
/*
* The bit corresponding to a CPU gets set here if such CPU has more
* than one runnable -deadline task (as it is below for RT tasks).
@@ -1212,15 +1215,18 @@ static inline void add_nr_running(struct rq *rq, unsigned count)

rq->nr_running = prev_nr + count;

-#ifdef CONFIG_NO_HZ_FULL
if (prev_nr < 2 && rq->nr_running >= 2) {
+ if (!rq->rd->overload)
+ rq->rd->overload = true;
+
+#ifdef CONFIG_NO_HZ_FULL
if (tick_nohz_full_cpu(rq->cpu)) {
/* Order rq->nr_running write against the IPI */
smp_wmb();
smp_send_reschedule(rq->cpu);
}
- }
#endif
+ }
}

static inline void sub_nr_running(struct rq *rq, unsigned count)
--
1.7.11.7


2014-06-23 04:33:16

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH v2] sched: Fast idling of CPU when system is partially loaded

On Mon, 2014-06-16 at 12:48 -0700, Tim Chen wrote:

> Thanks to the review from Jason and Peter. I've moved the check
> of whether load balance is required into fair.c's idle_balance.
>
> When a system is lightly loaded (i.e. no more than 1 job per cpu),
> attempt to pull job to a cpu before putting it to idle is unnecessary and
> can be skipped. This patch adds an indicator so the scheduler can know
> when there's no more than 1 active job is on any CPU in the system to
> skip needless job pulls.

> Signed-off-by: Tim Chen <[email protected]>

Acked-by: Jason Low <[email protected]>

This change would address one of the main issues I've also been seeing
on my test machines with idle_balance where most of the
find_busiest_group overhead is not useful due to that issue with no
tasks to move.

2014-06-23 12:52:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched: Fast idling of CPU when system is partially loaded

On Mon, Jun 16, 2014 at 12:48:47PM -0700, Tim Chen wrote:
> +++ b/kernel/sched/fair.c
> @@ -5863,7 +5863,8 @@ static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
> */
> static inline void update_sg_lb_stats(struct lb_env *env,
> struct sched_group *group, int load_idx,
> - int local_group, struct sg_lb_stats *sgs)
> + int local_group, struct sg_lb_stats *sgs,
> + bool *overload)
> {
> unsigned long load;
> int i;
> @@ -5881,6 +5882,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>
> sgs->group_load += load;
> sgs->sum_nr_running += rq->nr_running;
> + if (overload && rq->nr_running > 1)
> + *overload = true;
> #ifdef CONFIG_NUMA_BALANCING
> sgs->nr_numa_running += rq->nr_numa_running;
> sgs->nr_preferred_running += rq->nr_preferred_running;
> @@ -5991,6 +5994,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> struct sched_group *sg = env->sd->groups;
> struct sg_lb_stats tmp_sgs;
> int load_idx, prefer_sibling = 0;
> + bool overload = false;
>
> if (child && child->flags & SD_PREFER_SIBLING)
> prefer_sibling = 1;
> @@ -6011,7 +6015,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> update_group_power(env->sd, env->dst_cpu);
> }
>
> - update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
> + if (env->sd->parent)
> + update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
> + NULL);
> + else
> + /* gather overload info if we are at root domain */
> + update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
> + &overload);
>
> if (local_group)
> goto next_group;
> @@ -6045,6 +6055,13 @@ next_group:
>
> if (env->sd->flags & SD_NUMA)
> env->fbq_type = fbq_classify_group(&sds->busiest_stat);
> +
> + if (!env->sd->parent) {
> + /* update overload indicator if we are at root domain */
> + if (env->dst_rq->rd->overload != overload)
> + env->dst_rq->rd->overload = overload;
> + }
> +
> }
>
> /**

So I don't get why we can't do the below; I think Jason tried to ask the
same...

Making that overload thing unconditional makes the code simpler and the
cost is about the same; it doesn't matter if we test the pointer or
->nr_running, which we've already loaded anyhow.

Also, with only having a single update_sg_lb_stats() callsite GCC can
more easily inline the lot.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5886,7 +5886,7 @@ static inline void update_sg_lb_stats(st

sgs->group_load += load;
sgs->sum_nr_running += rq->nr_running;
- if (overload && rq->nr_running > 1)
+ if (rq->nr_running > 1)
*overload = true;
#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;
@@ -6019,13 +6019,7 @@ static inline void update_sd_lb_stats(st
update_group_capacity(env->sd, env->dst_cpu);
}

- if (env->sd->parent)
- update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
- NULL);
- else
- /* gather overload info if we are at root domain */
- update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
- &overload);
+ update_sg_lb_stats(env, sg, load_idx, local_group, sgs, &overload);

if (local_group)
goto next_group;
@@ -6065,7 +6059,6 @@ static inline void update_sd_lb_stats(st
if (env->dst_rq->rd->overload != overload)
env->dst_rq->rd->overload = overload;
}
-
}

/**

2014-06-23 16:22:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v2] sched: Fast idling of CPU when system is partially loaded

> So I don't get why we can't do the below; I think Jason tried to ask the
> same...

The important part for performance is to minimize the cache line transfers. Your
unconditional variant would cause more dirty cache lines than Tim's,
right?

-Andi

2014-06-23 16:41:00

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2] sched: Fast idling of CPU when system is partially loaded

On Mon, 2014-06-23 at 14:52 +0200, Peter Zijlstra wrote:
> On Mon, Jun 16, 2014 at 12:48:47PM -0700, Tim Chen wrote:
> > +++ b/kernel/sched/fair.c
> > @@ -5863,7 +5863,8 @@ static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
> > */
> > static inline void update_sg_lb_stats(struct lb_env *env,
> > struct sched_group *group, int load_idx,
> > - int local_group, struct sg_lb_stats *sgs)
> > + int local_group, struct sg_lb_stats *sgs,
> > + bool *overload)
> > {
> > unsigned long load;
> > int i;
> > @@ -5881,6 +5882,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >
> > sgs->group_load += load;
> > sgs->sum_nr_running += rq->nr_running;
> > + if (overload && rq->nr_running > 1)
> > + *overload = true;
> > #ifdef CONFIG_NUMA_BALANCING
> > sgs->nr_numa_running += rq->nr_numa_running;
> > sgs->nr_preferred_running += rq->nr_preferred_running;
> > @@ -5991,6 +5994,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > struct sched_group *sg = env->sd->groups;
> > struct sg_lb_stats tmp_sgs;
> > int load_idx, prefer_sibling = 0;
> > + bool overload = false;
> >
> > if (child && child->flags & SD_PREFER_SIBLING)
> > prefer_sibling = 1;
> > @@ -6011,7 +6015,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > update_group_power(env->sd, env->dst_cpu);
> > }
> >
> > - update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
> > + if (env->sd->parent)
> > + update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
> > + NULL);
> > + else
> > + /* gather overload info if we are at root domain */
> > + update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
> > + &overload);
> >
> > if (local_group)
> > goto next_group;
> > @@ -6045,6 +6055,13 @@ next_group:
> >
> > if (env->sd->flags & SD_NUMA)
> > env->fbq_type = fbq_classify_group(&sds->busiest_stat);
> > +
> > + if (!env->sd->parent) {
> > + /* update overload indicator if we are at root domain */
> > + if (env->dst_rq->rd->overload != overload)
> > + env->dst_rq->rd->overload = overload;
> > + }
> > +
> > }
> >
> > /**
>
> So I don't get why we can't do the below; I think Jason tried to ask the
> same...
>
> Making that overload thing unconditional makes the code simpler and the
> cost is about the same; it doesn't matter if we test the pointer or
> ->nr_running, which we've already loaded anyhow.
>
> Also, with only having a single update_sg_lb_stats() callsite GCC can
> more easily inline the lot.
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5886,7 +5886,7 @@ static inline void update_sg_lb_stats(st
>
> sgs->group_load += load;
> sgs->sum_nr_running += rq->nr_running;
> - if (overload && rq->nr_running > 1)
> + if (rq->nr_running > 1)
> *overload = true;
> #ifdef CONFIG_NUMA_BALANCING
> sgs->nr_numa_running += rq->nr_numa_running;
> @@ -6019,13 +6019,7 @@ static inline void update_sd_lb_stats(st
> update_group_capacity(env->sd, env->dst_cpu);
> }
>
> - if (env->sd->parent)
> - update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
> - NULL);
> - else
> - /* gather overload info if we are at root domain */
> - update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
> - &overload);
> + update_sg_lb_stats(env, sg, load_idx, local_group, sgs, &overload);

With this change, we'll be returning the overload indicator
that we don't use for non-root domains, which will be
extra work in sg_lb_stats as it loops through each rq checking
the nr_running to update the indicator. I was hoping to avoid
that if possible.

Tim

2014-06-23 16:44:37

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH v2] sched: Fast idling of CPU when system is partially loaded

On Mon, 2014-06-23 at 18:22 +0200, Andi Kleen wrote:
> > So I don't get why we can't do the below; I think Jason tried to ask the
> > same...
>
> The important part for performance is to minimize the cache line transfers. Your
> unconditional variant would cause more dirty cache lines than Tim's,
> right?

How about that change to the update_sg_lb_stats() call site along with:

sgs->group_load += load;
sgs->sum_nr_running += rq->nr_running;
- if (overload && rq->nr_running > 1)
+ if (!env->sd->parent && rq->nr_running > 1)
*overload = true;

which keeps it as a conditional to avoid unnecessarily setting overload
when it's not used, and still get those benefits Peter mentioned.

2014-06-23 17:01:13

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2] sched: Fast idling of CPU when system is partially loaded

On Mon, 2014-06-23 at 09:44 -0700, Jason Low wrote:
> On Mon, 2014-06-23 at 18:22 +0200, Andi Kleen wrote:
> > > So I don't get why we can't do the below; I think Jason tried to ask the
> > > same...
> >
> > The important part for performance is to minimize the cache line transfers. Your
> > unconditional variant would cause more dirty cache lines than Tim's,
> > right?
>
> How about that change to the update_sg_lb_stats() call site along with:
>
> sgs->group_load += load;
> sgs->sum_nr_running += rq->nr_running;
> - if (overload && rq->nr_running > 1)
> + if (!env->sd->parent && rq->nr_running > 1)
> *overload = true;
>
> which keeps it as a conditional to avoid unnecessarily setting overload
> when it's not used, and still get those benefits Peter mentioned.

I think this change will satisfy both needs. I'll re-spin a v3 patch
with this modification if there're no objections.

Tim

2014-06-23 18:48:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched: Fast idling of CPU when system is partially loaded

On Mon, Jun 23, 2014 at 06:22:34PM +0200, Andi Kleen wrote:
> > So I don't get why we can't do the below; I think Jason tried to ask the
> > same...
>
> The important part for performance is to minimize the cache line transfers. Your
> unconditional variant would cause more dirty cache lines than Tim's,
> right?

Don't think so. We already have the overloaded thing on stack and dirty,
and the only thing it needs is rq->nr_running and we already
unconditionally load that.

2014-06-23 18:50:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched: Fast idling of CPU when system is partially loaded

On Mon, Jun 23, 2014 at 09:40:45AM -0700, Tim Chen wrote:

> > @@ -5886,7 +5886,7 @@ static inline void update_sg_lb_stats(st
> >
> > sgs->group_load += load;
> > sgs->sum_nr_running += rq->nr_running;
> > - if (overload && rq->nr_running > 1)
> > + if (rq->nr_running > 1)
> > *overload = true;
> > #ifdef CONFIG_NUMA_BALANCING
> > sgs->nr_numa_running += rq->nr_numa_running;

> With this change, we'll be returning the overload indicator
> that we don't use for non-root domains, which will be
> extra work in sg_lb_stats as it loops through each rq checking
> the nr_running to update the indicator. I was hoping to avoid
> that if possible.

What extra work? We already load nr_running and overloaded is on-stack
and should be quite dirty already due to that.

2014-06-23 18:59:43

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2] sched: Fast idling of CPU when system is partially loaded

On Mon, 2014-06-23 at 20:50 +0200, Peter Zijlstra wrote:
> On Mon, Jun 23, 2014 at 09:40:45AM -0700, Tim Chen wrote:
>
> > > @@ -5886,7 +5886,7 @@ static inline void update_sg_lb_stats(st
> > >
> > > sgs->group_load += load;
> > > sgs->sum_nr_running += rq->nr_running;
> > > - if (overload && rq->nr_running > 1)
> > > + if (rq->nr_running > 1)
> > > *overload = true;
> > > #ifdef CONFIG_NUMA_BALANCING
> > > sgs->nr_numa_running += rq->nr_numa_running;
>
> > With this change, we'll be returning the overload indicator
> > that we don't use for non-root domains, which will be
> > extra work in sg_lb_stats as it loops through each rq checking
> > the nr_running to update the indicator. I was hoping to avoid
> > that if possible.
>
> What extra work? We already load nr_running and overloaded is on-stack
> and should be quite dirty already due to that.

Okay then. I'll need to modify the v3 patch with the following bits:

- /* only need to update overload indicator for root domain */
- if (!env->sd->parent && rq->nr_running > 1)
+ if (rq->nr_running > 1)
*overload = true;

Just sent v3 patch out before your email reached me.

Tim