LinuxLists.cc - [patch] restore sched_exec load balance heuristics

2008-11-06 19:40:43

Subject: [patch] restore sched_exec load balance heuristics

We've seen long standing performance regression on sys_execve for several
upstream kernels, largely on workload that does heavy execve. The main
reason for the regression was due to a change in sched_exec load balance
heuristics. For example, on 2.6.11 kernel, the "exec" task will run on
the same cpu if that is the only task running. However, 2.6.13 and onward
kernels will go around the sched-domain looking for most idle CPU (which
doesn't treat task exec'ing as an idle CPU). Thus bouncing the exec'ing
task all over the place which leads to poor CPU cache and numa locality.
(The workload happens to share common data between subsequent exec program).

This execve heuristic was removed in upstream kernel by this git commit:

commit 68767a0ae428801649d510d9a65bb71feed44dd1
Author: Nick Piggin <[email protected]>
Date: Sat Jun 25 14:57:20 2005 -0700

[PATCH] sched: schedstats update for balance on fork
Add SCHEDSTAT statistics for sched-balance-fork.

>From the commit description, it appears that deleting the heuristics
was an accident, as the commit is supposedly just for schedstats.

So, restore the sched-exec load balancing if exec'ing task is the only
task running on that specific CPU. The logic make sense: newly exec
program should continue to run on current CPU as it doesn't change any
load imbalance nor does it help anything by bouncing to another idle
CPU. By keeping on the same CPU, it preserves cache and numa locality.

Signed-off-by: Ken Chen <[email protected]>

diff --git a/kernel/sched.c b/kernel/sched.c
index e8819bc..4ad1907 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2873,7 +2873,12 @@ out:
*/
void sched_exec(void)
{
- int new_cpu, this_cpu = get_cpu();
+ int new_cpu, this_cpu;
+
+ if (this_rq()->nr_running <= 1)
+ return;
+
+ this_cpu = get_cpu();
new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC);
put_cpu();
if (new_cpu != this_cpu)

2008-11-06 20:08:17

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] restore sched_exec load balance heuristics

* Ken Chen <[email protected]> wrote:

> We've seen long standing performance regression on sys_execve for several
> upstream kernels, largely on workload that does heavy execve. The main
> reason for the regression was due to a change in sched_exec load balance
> heuristics. For example, on 2.6.11 kernel, the "exec" task will run on
> the same cpu if that is the only task running. However, 2.6.13 and onward
> kernels will go around the sched-domain looking for most idle CPU (which
> doesn't treat task exec'ing as an idle CPU). Thus bouncing the exec'ing
> task all over the place which leads to poor CPU cache and numa locality.
> (The workload happens to share common data between subsequent exec program).
>
> This execve heuristic was removed in upstream kernel by this git commit:
>
> commit 68767a0ae428801649d510d9a65bb71feed44dd1
> Author: Nick Piggin <[email protected]>
> Date: Sat Jun 25 14:57:20 2005 -0700
>
> [PATCH] sched: schedstats update for balance on fork
> Add SCHEDSTAT statistics for sched-balance-fork.
>
> >From the commit description, it appears that deleting the heuristics
> was an accident, as the commit is supposedly just for schedstats.
>
> So, restore the sched-exec load balancing if exec'ing task is the only
> task running on that specific CPU. The logic make sense: newly exec
> program should continue to run on current CPU as it doesn't change any
> load imbalance nor does it help anything by bouncing to another idle
> CPU. By keeping on the same CPU, it preserves cache and numa locality.
>
> Signed-off-by: Ken Chen <[email protected]>
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index e8819bc..4ad1907 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2873,7 +2873,12 @@ out:
> */
> void sched_exec(void)
> {
> - int new_cpu, this_cpu = get_cpu();
> + int new_cpu, this_cpu;
> +
> + if (this_rq()->nr_running <= 1)
> + return;
> +
> + this_cpu = get_cpu();
> new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC);
> put_cpu();
> if (new_cpu != this_cpu)

ok, this should be solved - but rather at the level of
sched_balance_self(): it should never migrate this task over to
another cpu, it should take away this task's load from the current
CPU's load when considering migration.

Ingo

2008-11-06 20:33:18

by Ken Chen

[permalink] [raw]

Subject: Re: [patch] restore sched_exec load balance heuristics

On Thu, Nov 6, 2008 at 12:07 PM, Ingo Molnar <[email protected]> wrote:
> ok, this should be solved - but rather at the level of
> sched_balance_self(): it should never migrate this task over to
> another cpu, it should take away this task's load from the current
> CPU's load when considering migration.

There are two callers to sched_balance_self(). In the sched_fork
path, sched_balance_self will balance the newly forked task. I think
it is OK to bounce a newly forked task to another CPU since current
CPU will be busy when fork returns in the parent process.

And if sched_balance_self() needs to different between fork / exec
load balance, it has to check a flag from function argument, which I
think it is better to just short circuit in sched_exec() directly.

- Ken

2008-11-06 20:39:18

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] restore sched_exec load balance heuristics

* Ken Chen <[email protected]> wrote:

> On Thu, Nov 6, 2008 at 12:07 PM, Ingo Molnar <[email protected]> wrote:
> > ok, this should be solved - but rather at the level of
> > sched_balance_self(): it should never migrate this task over to
> > another cpu, it should take away this task's load from the current
> > CPU's load when considering migration.
>
> There are two callers to sched_balance_self(). In the sched_fork
> path, sched_balance_self will balance the newly forked task. I
> think it is OK to bounce a newly forked task to another CPU since
> current CPU will be busy when fork returns in the parent process.
>
> And if sched_balance_self() needs to different between fork / exec
> load balance, it has to check a flag from function argument, which I
> think it is better to just short circuit in sched_exec() directly.

yes, but the problem is deeper than that and your fix only addresses
teh most obvious case: when a single task is exec()-ing. But if we
exec while there are two tasks on this CPU, and one task on every
other CPU, we bounce around the "new" task unnecessarily just as much.

So the best solution is to pass in not a flag, but a 'load bias'
offset - which is 0 in the fork case and -self-weight in the exec
case.

Ok?

Ingo

2008-11-06 20:50:20

by Chris Friesen

[permalink] [raw]

Subject: Re: [patch] restore sched_exec load balance heuristics

Ken Chen wrote:

> There are two callers to sched_balance_self(). In the sched_fork
> path, sched_balance_self will balance the newly forked task. I think
> it is OK to bounce a newly forked task to another CPU since current
> CPU will be busy when fork returns in the parent process.

What about vfork()?

> And if sched_balance_self() needs to different between fork / exec
> load balance, it has to check a flag from function argument, which I
> think it is better to just short circuit in sched_exec() directly.

From a cleanliness perspective, it make more sense to me for the
decision as to whether or not to balance to be done in the "balance"
function, not the "exec" function.

Chris

2008-11-10 08:50:17

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch] restore sched_exec load balance heuristics

On Thu, 2008-11-06 at 21:07 +0100, Ingo Molnar wrote:

> ok, this should be solved - but rather at the level of
> sched_balance_self(): it should never migrate this task over to
> another cpu, it should take away this task's load from the current
> CPU's load when considering migration.

How's this?

(compile tested only)

---
Subject: sched: fix sched_exec
From: Peter Zijlstra <[email protected]>
Date: Mon Nov 10 09:47:54 CET 2008

When deciding placement for an execve() task, subtract the effect of the
current task.

Signed-off-by: Peter Zijlstra <[email protected]>
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2035,7 +2035,8 @@ static unsigned long target_load(int cpu
* domain.
*/
static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu,
+ int org_cpu, unsigned long org_weight)
{
struct sched_group *idlest = NULL, *this = NULL, *group = sd->groups;
unsigned long min_load = ULONG_MAX, this_load = 0;
@@ -2063,6 +2064,9 @@ find_idlest_group(struct sched_domain *s
else
load = target_load(i, load_idx);

+ if (i == org_cpu)
+ load -= min(org_weight, load);
+
avg_load += load;
}

@@ -2089,7 +2093,7 @@ find_idlest_group(struct sched_domain *s
*/
static int
find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu,
- cpumask_t *tmp)
+ cpumask_t *tmp, int org_cpu, unsigned long org_weight)
{
unsigned long load, min_load = ULONG_MAX;
int idlest = -1;
@@ -2101,6 +2105,9 @@ find_idlest_cpu(struct sched_group *grou
for_each_cpu_mask_nr(i, *tmp) {
load = weighted_cpuload(i);

+ if (i == org_cpu)
+ load -= org_weight;
+
if (load < min_load || (load == min_load && i == this_cpu)) {
min_load = load;
idlest = i;
@@ -2121,10 +2128,11 @@ find_idlest_cpu(struct sched_group *grou
*
* preempt must be disabled.
*/
-static int sched_balance_self(int cpu, int flag)
+static int sched_balance_self(int cpu, int flag, unsigned long org_weight)
{
struct task_struct *t = current;
struct sched_domain *tmp, *sd = NULL;
+ int org_cpu = cpu;

for_each_domain(cpu, tmp) {
/*
@@ -2150,13 +2158,14 @@ static int sched_balance_self(int cpu, i
}

span = sd->span;
- group = find_idlest_group(sd, t, cpu);
+ group = find_idlest_group(sd, t, cpu, org_cpu, org_weight);
if (!group) {
sd = sd->child;
continue;
}

- new_cpu = find_idlest_cpu(group, t, cpu, &tmpmask);
+ new_cpu = find_idlest_cpu(group, t, cpu, &tmpmask,
+ org_cpu, org_weight);
if (new_cpu == -1 || new_cpu == cpu) {
/* Now try balancing at a lower domain level of cpu */
sd = sd->child;
@@ -2365,7 +2374,7 @@ void sched_fork(struct task_struct *p, i
__sched_fork(p);

#ifdef CONFIG_SMP
- cpu = sched_balance_self(cpu, SD_BALANCE_FORK);
+ cpu = sched_balance_self(cpu, SD_BALANCE_FORK, 0);
#endif
set_task_cpu(p, cpu);

@@ -2856,7 +2865,14 @@ out:
void sched_exec(void)
{
int new_cpu, this_cpu = get_cpu();
- new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC);
+ struct task_group *tg;
+ long weight, eload;
+
+ tg = task_group(current);
+ weight = current->se.load.weight;
+ eload = -effective_load(tg, this_cpu, -weight, -weight);
+
+ new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC, eload);
put_cpu();
if (new_cpu != this_cpu)
sched_migrate_task(current, new_cpu);

2008-11-10 09:30:00

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] restore sched_exec load balance heuristics

* Peter Zijlstra <[email protected]> wrote:

> void sched_exec(void)
> {
> int new_cpu, this_cpu = get_cpu();
> - new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC);
> + struct task_group *tg;
> + long weight, eload;
> +
> + tg = task_group(current);
> + weight = current->se.load.weight;
> + eload = -effective_load(tg, this_cpu, -weight, -weight);
> +
> + new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC, eload);

okay, i think this will work.

it feels somewhat backwards though on a conceptual level.

There's nothing particularly special about exec-balancing: the load
picture is in equilibrium - it is in essence a rebalancing pass done
not in the scheduler tick but in a special place in the middle of
exec() where the old-task / new-task cross section is at a minimum
level.

_fork_ balancing is what is special: there we'll get a new context so
we have to take the new load into account. It's a bit like wakeup
balancing. (just done before the new task is truly woken up)

OTOH, triggering the regular busy-balance at exec() time isnt totally
straightforward either: the 'old' task is the current task so it
cannot be balanced away. We have to trigger all the active-migration
logic - which again makes exec() balancing special.

So maybe this patch is the best solution after all. Ken, does it do
the trick for your workload, when applied against v2.6.28-rc4?

You might even try to confirm that your testcase still works fine even
if you elevate the load average with +1.0 on every cpu by starting
infinite CPU eater loops on every CPU, via this bash oneliner:

for ((i=0;i<2;i++)); do while :; do :; done & done

(change the '2' to '4' if you test this on a quad, not on a dual-core
box)

the desired behavior would be for your "exec hopper" testcase to not
hop between cpus, but to stick the same CPU most of the time.

Ingo

2008-11-10 12:54:24

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch] restore sched_exec load balance heuristics

On Mon, 2008-11-10 at 10:29 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > void sched_exec(void)
> > {
> > int new_cpu, this_cpu = get_cpu();
> > - new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC);
> > + struct task_group *tg;
> > + long weight, eload;
> > +
> > + tg = task_group(current);
> > + weight = current->se.load.weight;
> > + eload = -effective_load(tg, this_cpu, -weight, -weight);
> > +
> > + new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC, eload);
>
> okay, i think this will work.
>
> it feels somewhat backwards though on a conceptual level.
>
> There's nothing particularly special about exec-balancing: the load
> picture is in equilibrium - it is in essence a rebalancing pass done
> not in the scheduler tick but in a special place in the middle of
> exec() where the old-task / new-task cross section is at a minimum
> level.
>
> _fork_ balancing is what is special: there we'll get a new context so
> we have to take the new load into account. It's a bit like wakeup
> balancing. (just done before the new task is truly woken up)
>
> OTOH, triggering the regular busy-balance at exec() time isnt totally
> straightforward either: the 'old' task is the current task so it
> cannot be balanced away. We have to trigger all the active-migration
> logic - which again makes exec() balancing special.
>
> So maybe this patch is the best solution after all.

Even worse, you want to balance current, the generic load balance might
pick two cpus to balance neither of which will have current on it. But
even if it would pick the queue with current on it as busiest, there is
no saying you'll actually end up moving current.

So this specialized form of moving current to a possibly more idle cpu
is afaics the best solution for balancing a particular task.