2009-10-24 20:06:36

by Arjan van de Ven

[permalink] [raw]
Subject: [PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain

Subject: sched: Enable wake balancing for the SMT/HT domain
From: Arjan van de Ven <[email protected]>

Logical CPUs that are part of a hyperthreading/SMT set are equivalent
in terms of where to execute a task; after all they share pretty much
all resources including the L1 cache.

This means that if task A wakes up task B, we should really consider
all logical CPUs in the SMT/HT set to run task B, not just the CPU that
task A is running on; in case task A keeps running, task B now gets to
execute with no latency. In the case where task A then immediately goes
to wait for a response from task B, nothing is lost due to the aforementioned
equivalency.

This patch turns on the "balance on wakup" and turns of "affine wakeups"
for the SMT/HT scheduler domain to get this lower latency behavior.

Signed-off-by: Arjan van de Ven <[email protected]>

diff --git a/include/linux/topology.h b/include/linux/topology.h
index fc0bf3e..3665dc2 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -95,8 +95,8 @@ int arch_update_cpu_topology(void);
| 1*SD_BALANCE_NEWIDLE \
| 1*SD_BALANCE_EXEC \
| 1*SD_BALANCE_FORK \
- | 0*SD_BALANCE_WAKE \
- | 1*SD_WAKE_AFFINE \
+ | 1*SD_BALANCE_WAKE \
+ | 0*SD_WAKE_AFFINE \
| 1*SD_SHARE_CPUPOWER \
| 0*SD_POWERSAVINGS_BALANCE \
| 0*SD_SHARE_PKG_RESOURCES \


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org


2009-10-24 20:03:27

by Arjan van de Ven

[permalink] [raw]
Subject: [PATCH 2/3] sched: Add aggressive load balancing for certain situations

Subject: sched: Add aggressive load balancing for certain situations
From: Arjan van de Ven <[email protected]>

The scheduler, in it's "find idlest group" function currently has an unconditional
threshold for an imbalance, before it will consider moving a task.

However, there are situations where this is undesireable, and we want to opt in to a
more aggressive load balancing algorithm to minimize latencies.

This patch adds the infrastructure for this and also adds two cases for which
we select the aggressive approach
1) From interrupt context. Events that happen in irq context are very likely,
as a heuristic, to show latency sensitive behavior
2) When doing a wake_up() and the scheduler domain we're investigating has the
flag set that opts in to load balancing during wake_up()
(for example the SMT/HT domain)


Signed-off-by: Arjan van de Ven <[email protected]>


diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 4e777b4..fe9b95b 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1246,7 +1246,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
*/
static struct sched_group *
find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int load_idx)
+ int this_cpu, int load_idx, int agressive)
{
struct sched_group *idlest = NULL, *this = NULL, *group = sd->groups;
unsigned long min_load = ULONG_MAX, this_load = 0;
@@ -1290,7 +1290,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
}
} while (group = group->next, group != sd->groups);

- if (!idlest || 100*this_load < imbalance*min_load)
+ if (!idlest)
+ return NULL;
+ if (!agressive && 100*this_load < imbalance*min_load)
return NULL;
return idlest;
}
@@ -1412,6 +1414,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
int load_idx = sd->forkexec_idx;
struct sched_group *group;
int weight;
+ int agressive;

if (!(sd->flags & sd_flag)) {
sd = sd->child;
@@ -1421,7 +1424,13 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;

- group = find_idlest_group(sd, p, cpu, load_idx);
+ agressive = 0;
+ if (in_irq())
+ agressive = 1;
+ if (sd_flag & SD_BALANCE_WAKE)
+ agressive = 1;
+
+ group = find_idlest_group(sd, p, cpu, load_idx, agressive);
if (!group) {
sd = sd->child;
continue;



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-10-24 20:06:23

by Arjan van de Ven

[permalink] [raw]
Subject: [PATCH 3/3] sched: Disable affine wakeups by default

Subject: sched: Disable affine wakeups by default
From: Arjan van de Ven <[email protected]>

The global affine wakeup scheduler feature sounds nice, but there is a problem
with this: This is ALSO a per scheduler domain feature already.
By having the global scheduler feature enabled by default, the scheduler domains
no longer have the option to opt out.

There are domains (for example the HT/SMT domain) that have good reason to want
to opt out of this feature.

With this patch they can opt out, while all other domains currently default to
the affine setting anyway.

Signed-off-by: Arjan van de Ven <[email protected]>

diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 0d94083..58c2ea7 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -72,7 +72,7 @@ SCHED_FEAT(SYNC_WAKEUPS, 1)
* improve cache locality. Typically used with SYNC wakeups as
* generated by pipes and the like, see also SYNC_WAKEUPS.
*/
-SCHED_FEAT(AFFINE_WAKEUPS, 1)
+SCHED_FEAT(AFFINE_WAKEUPS, 0)

/*
* Weaken SYNC hint based on overlap



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-10-25 06:55:25

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Sat, 2009-10-24 at 13:07 -0700, Arjan van de Ven wrote:
> Subject: sched: Disable affine wakeups by default
> From: Arjan van de Ven <[email protected]>
>
> The global affine wakeup scheduler feature sounds nice, but there is a problem
> with this: This is ALSO a per scheduler domain feature already.
> By having the global scheduler feature enabled by default, the scheduler domains
> no longer have the option to opt out.

? The affine decision is qualified by SD_WAKE_AFFINE.

if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {

affine_sd = tmp;
want_affine = 0;
}

> There are domains (for example the HT/SMT domain) that have good reason to want
> to opt out of this feature.

Even if you're sharing a cache, there are reasons to wake affine. If
the wakee can preempt the waker while it's still eligible to run, wakee
not only eats toasty warm data, it can hand the cpu back to the waker so
it can make more and repeat this procedure for a while without someone
else getting in between, and trashing cache. Also, for a task which
wakes another, then checks to see if it has more work, sleeps if not,
this preemption can keep that task running, saving wakeups. If you put
the wakee on a runqueue where it may have to wait even a tiny bit, buddy
goes to sleep, so that benefit is gone. These things have a HUGE effect
on scalability, as you can see below.

There are times when not waking affine is good, eg immediately after
fork(), it's _generally_ a good idea to not wake affine, because there
may be more no the way, a work generator like make, for example doing
it's thing, and fork() also frequently means an exec is on the way.
That's not usually a producer/consumer situation.

At low load, with producer/consumer, iff you can hit a shared cache,
it's a good idea to not wake affine, any waker/wakee overlap is pure
performance loss in that case. On my Q6600, there's a 1:3 chance of
hitting if left to random chance. You can see that case happening in
the pgsql+oltp numbers below. That wants further examination.

> With this patch they can opt out, while all other domains currently default to
> the affine setting anyway.

Patch globally disabled affine wakeups. Not good.

Oh, btw, wrt affinity vs interrupt, a long time ago, I tried disabling
affine wakeups in hard/soft and both contexts. In all cases, it was a
losing proposition here.

One thing that would be nice for some mixed loads, including the desktop
is, if a cpu is doing high frequency sync/affine wakeups, try to keep
other things away from that cpu by considering synchronous tasks to
count as two instead of one load balancing wise.

(damn, i'm rambling.. time to shut up;)

Sorry for verbosity, numbers probably would have sufficed. I've been
overdosing on boring affinity/scalability testing ;-)

tip v2.6.32-rc5-1691-g9a8523b

tbench 4
tip 936.314 MB/sec 8 procs
tip+patches 869.153 MB/sec 8 procs
.928

vmark
tip 125307 messages per second
tip+patches 103743 messages per second
.827

mysql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47
tip+patches 8436.34 17826.34 34524.32 31471.92 29188.59 27896.10 26036.43 23774.57 19524.33
.842 .962 .989 .914 .882 .869 .851 .848 .762

pgsql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94
tip+patches 15277.63 23050.99 51943.13 51937.16 42246.60 38397.86 34998.71 31154.21 26335.68
1.098 .849 .980 .989 .816 .757 .700 .644 .569

2009-10-25 08:01:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Sat, 2009-10-24 at 13:07 -0700, Arjan van de Ven wrote:
> Subject: sched: Disable affine wakeups by default
> From: Arjan van de Ven <[email protected]>
>
> The global affine wakeup scheduler feature sounds nice, but there is a problem
> with this: This is ALSO a per scheduler domain feature already.
> By having the global scheduler feature enabled by default, the scheduler domains
> no longer have the option to opt out.
>
> There are domains (for example the HT/SMT domain) that have good reason to want
> to opt out of this feature.
>
> With this patch they can opt out, while all other domains currently default to
> the affine setting anyway.
>
> Signed-off-by: Arjan van de Ven <[email protected]>
>

Hell no, that'll destroy many workloads. What you could possibly do is
disable it for sched domains that are known to share cache, maybe.

2009-10-25 08:01:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: Add aggressive load balancing for certain situations

On Sat, 2009-10-24 at 13:04 -0700, Arjan van de Ven wrote:
> Subject: sched: Add aggressive load balancing for certain situations
> From: Arjan van de Ven <[email protected]>
>
> The scheduler, in it's "find idlest group" function currently has an unconditional
> threshold for an imbalance, before it will consider moving a task.
>
> However, there are situations where this is undesireable, and we want to opt in to a
> more aggressive load balancing algorithm to minimize latencies.
>
> This patch adds the infrastructure for this and also adds two cases for which
> we select the aggressive approach
> 1) From interrupt context. Events that happen in irq context are very likely,
> as a heuristic, to show latency sensitive behavior
> 2) When doing a wake_up() and the scheduler domain we're investigating has the
> flag set that opts in to load balancing during wake_up()
> (for example the SMT/HT domain)
>
>
> Signed-off-by: Arjan van de Ven <[email protected]>



> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 4e777b4..fe9b95b 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1246,7 +1246,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> */
> static struct sched_group *
> find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> - int this_cpu, int load_idx)
> + int this_cpu, int load_idx, int agressive)
> {

can't we fold that into load_idx? like -1 or something?

2009-10-25 08:03:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain

On Sat, 2009-10-24 at 12:58 -0700, Arjan van de Ven wrote:
> Subject: sched: Enable wake balancing for the SMT/HT domain
> From: Arjan van de Ven <[email protected]>
>
> Logical CPUs that are part of a hyperthreading/SMT set are equivalent
> in terms of where to execute a task; after all they share pretty much
> all resources including the L1 cache.
>
> This means that if task A wakes up task B, we should really consider
> all logical CPUs in the SMT/HT set to run task B, not just the CPU that
> task A is running on; in case task A keeps running, task B now gets to
> execute with no latency. In the case where task A then immediately goes
> to wait for a response from task B, nothing is lost due to the aforementioned
> equivalency.
>
> This patch turns on the "balance on wakup" and turns of "affine wakeups"
> for the SMT/HT scheduler domain to get this lower latency behavior.
>
> Signed-off-by: Arjan van de Ven <[email protected]>
>
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index fc0bf3e..3665dc2 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -95,8 +95,8 @@ int arch_update_cpu_topology(void);
> | 1*SD_BALANCE_NEWIDLE \
> | 1*SD_BALANCE_EXEC \
> | 1*SD_BALANCE_FORK \
> - | 0*SD_BALANCE_WAKE \
> - | 1*SD_WAKE_AFFINE \
> + | 1*SD_BALANCE_WAKE \
> + | 0*SD_WAKE_AFFINE \
> | 1*SD_SHARE_CPUPOWER \
> | 0*SD_POWERSAVINGS_BALANCE \
> | 0*SD_SHARE_PKG_RESOURCES \
>

So you're poking at SD_SIBLING_INIT, right?

That seems to make sense. Now doing the same for a cache level domain
(MC is almost that) might also make sense.

2009-10-25 11:48:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: Add aggressive load balancing for certain situations

On Sun, 2009-10-25 at 09:01 +0100, Peter Zijlstra wrote:
>
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index 4e777b4..fe9b95b 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -1246,7 +1246,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> > */
> > static struct sched_group *
> > find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> > - int this_cpu, int load_idx)
> > + int this_cpu, int load_idx, int agressive)
> > {
>
> can't we fold that into load_idx? like -1 or something?

A better alternative might be passing imbalance along instead.

2009-10-25 16:50:05

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Sun, 25 Oct 2009 07:55:25 +0100
Mike Galbraith <[email protected]> wrote:
> Even if you're sharing a cache, there are reasons to wake affine. If
> the wakee can preempt the waker while it's still eligible to run,
> wakee not only eats toasty warm data, it can hand the cpu back to the
> waker so it can make more and repeat this procedure for a while
> without someone else getting in between, and trashing cache.

and on the flipside, and this is the workload I'm looking at,
this is halving your performance roughly due to one core being totally
busy while the other one is idle.

My workload is a relatively simple situation: firefox is starting up
and talking to X. I suspect this is representative for many X using
applications in the field. The application sends commands to X, but is
not (yet) going to wait for a response, it has more work to do.
In this case the affine behavior does not only cause latency, but it
also eats the throughput performance.

This is due to a few things that compound, but a key one is this code:

if (sd_flag & SD_BALANCE_WAKE) {
if (sched_feat(AFFINE_WAKEUPS) &&
cpumask_test_cpu(cpu, &p->cpus_allowed))
want_affine = 1;
new_cpu = prev_cpu;
}

the problem is that

if (affine_sd && wake_affine(affine_sd, p, sync)) {
new_cpu = cpu;
goto out;
}

this then will trigger later, as long as there is any domain that has
SD_WAKE_AFFINE set ;(

(part of that problem is that the code that sets affine_sd is done
before the
if (!(tmp->flags & sd_flag))
continue;
test)


The numbers you posted are for a database, and only measure throughput.
There's more to the world than just databases / throughput-only
computing, and I'm trying to find low impact ways to reduce the latency
aspect of things. One obvious candidate is hyperthreading/SMT where it
IS basically free to switch to a sibbling, so wake-affine does not
really make sense there.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-10-25 17:38:26

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Sun, 2009-10-25 at 09:51 -0700, Arjan van de Ven wrote:
> On Sun, 25 Oct 2009 07:55:25 +0100
> Mike Galbraith <[email protected]> wrote:
> > Even if you're sharing a cache, there are reasons to wake affine. If
> > the wakee can preempt the waker while it's still eligible to run,
> > wakee not only eats toasty warm data, it can hand the cpu back to the
> > waker so it can make more and repeat this procedure for a while
> > without someone else getting in between, and trashing cache.
>
> and on the flipside, and this is the workload I'm looking at,
> this is halving your performance roughly due to one core being totally
> busy while the other one is idle.

Yeah, the "one pgsql+oltp pair" in the numbers I posted show that
problem really well. If you can hit an idle shared cache at low load,
go for it every time. The rest of the numbers just show how big the
penalty is if you solve affinity problems with an 8" howitzer :)

> My workload is a relatively simple situation: firefox is starting up
> and talking to X. I suspect this is representative for many X using
> applications in the field. The application sends commands to X, but is
> not (yet) going to wait for a response, it has more work to do.
> In this case the affine behavior does not only cause latency, but it
> also eats the throughput performance.

Yeah. Damned if you do, damned if you don't.

> This is due to a few things that compound, but a key one is this code:
>
> if (sd_flag & SD_BALANCE_WAKE) {
> if (sched_feat(AFFINE_WAKEUPS) &&
> cpumask_test_cpu(cpu, &p->cpus_allowed))
> want_affine = 1;
> new_cpu = prev_cpu;
> }
>
> the problem is that
>
> if (affine_sd && wake_affine(affine_sd, p, sync)) {
> new_cpu = cpu;
> goto out;
> }
>
> this then will trigger later, as long as there is any domain that has
> SD_WAKE_AFFINE set ;(

And the task looks like a synchronous task.

> (part of that problem is that the code that sets affine_sd is done
> before the
> if (!(tmp->flags & sd_flag))
> continue;
> test)

Hm. That looks like a bug, but after any task has scheduled a few
times, if it looks like a synchronous task, it'll glue itself to it's
waker's runqueue regardless. Initial wakeup may disperse, but it will
come back if it's not overlapping.

> The numbers you posted are for a database, and only measure throughput.
> There's more to the world than just databases / throughput-only
> computing, and I'm trying to find low impact ways to reduce the latency
> aspect of things. One obvious candidate is hyperthreading/SMT where it
> IS basically free to switch to a sibbling, so wake-affine does not
> really make sense there.

It's also almost free on my Q6600 if we aimed for idle shared cache.

I agree fully that affinity decisions could be more perfect than they
are. Getting it wrong is very expensive either way.

-Mike

2009-10-25 19:32:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Sun, 25 Oct 2009 18:38:09 +0100
Mike Galbraith <[email protected]> wrote:
> > > Even if you're sharing a cache, there are reasons to wake
> > > affine. If the wakee can preempt the waker while it's still
> > > eligible to run, wakee not only eats toasty warm data, it can
> > > hand the cpu back to the waker so it can make more and repeat
> > > this procedure for a while without someone else getting in
> > > between, and trashing cache.
> >
> > and on the flipside, and this is the workload I'm looking at,
> > this is halving your performance roughly due to one core being
> > totally busy while the other one is idle.
>
> Yeah, the "one pgsql+oltp pair" in the numbers I posted show that
> problem really well. If you can hit an idle shared cache at low load,
> go for it every time.

sadly the current code does not do this ;(
my patch might be too big an axe for it, but it does solve this part ;)

I'll keep digging to see if we can do a more micro-incursion.

> Hm. That looks like a bug, but after any task has scheduled a few
> times, if it looks like a synchronous task, it'll glue itself to it's
> waker's runqueue regardless. Initial wakeup may disperse, but it will
> come back if it's not overlapping.

the problem is the "synchronous to WHAT" question.
It may be synchronous to the disk for example; in the testcase I'm
looking at, we get "send message to X. do some more code. hit a page
cache miss and do IO" quite a bit.

> > The numbers you posted are for a database, and only measure
> > throughput. There's more to the world than just databases /
> > throughput-only computing, and I'm trying to find low impact ways
> > to reduce the latency aspect of things. One obvious candidate is
> > hyperthreading/SMT where it IS basically free to switch to a
> > sibbling, so wake-affine does not really make sense there.
>
> It's also almost free on my Q6600 if we aimed for idle shared cache.

yeah multicore with shared cache falls for me in the same bucket.

> I agree fully that affinity decisions could be more perfect than they
> are. Getting it wrong is very expensive either way.

Looks like we agree on a key principle:
If there is a free cpu "close enough" (SMT or MC basically), the
wakee should just run on that.

we may not agree on what to do if there's no completely free logical
cpu, but a much lighter loaded one instead.
but first we need to let code speak ;)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-10-25 22:04:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Sun, 2009-10-25 at 12:33 -0700, Arjan van de Ven wrote:
> On Sun, 25 Oct 2009 18:38:09 +0100
> Mike Galbraith <[email protected]> wrote:
> > > > Even if you're sharing a cache, there are reasons to wake
> > > > affine. If the wakee can preempt the waker while it's still
> > > > eligible to run, wakee not only eats toasty warm data, it can
> > > > hand the cpu back to the waker so it can make more and repeat
> > > > this procedure for a while without someone else getting in
> > > > between, and trashing cache.
> > >
> > > and on the flipside, and this is the workload I'm looking at,
> > > this is halving your performance roughly due to one core being
> > > totally busy while the other one is idle.
> >
> > Yeah, the "one pgsql+oltp pair" in the numbers I posted show that
> > problem really well. If you can hit an idle shared cache at low load,
> > go for it every time.
>
> sadly the current code does not do this ;(
> my patch might be too big an axe for it, but it does solve this part ;)

The below fixed up pgsql+oltp low end, but has negative effect on high
end. Must be some stuttering going on.

> I'll keep digging to see if we can do a more micro-incursion.
>
> > Hm. That looks like a bug, but after any task has scheduled a few
> > times, if it looks like a synchronous task, it'll glue itself to it's
> > waker's runqueue regardless. Initial wakeup may disperse, but it will
> > come back if it's not overlapping.
>
> the problem is the "synchronous to WHAT" question.
> It may be synchronous to the disk for example; in the testcase I'm
> looking at, we get "send message to X. do some more code. hit a page
> cache miss and do IO" quite a bit.

Hm. Yes, disk could be problematic. It's going to be exactly what the
affinity code looks for, you wake somebody, and almost immediately go to
sleep. OTOH, even a house keeper threads make warm data.

> > > The numbers you posted are for a database, and only measure
> > > throughput. There's more to the world than just databases /
> > > throughput-only computing, and I'm trying to find low impact ways
> > > to reduce the latency aspect of things. One obvious candidate is
> > > hyperthreading/SMT where it IS basically free to switch to a
> > > sibbling, so wake-affine does not really make sense there.
> >
> > It's also almost free on my Q6600 if we aimed for idle shared cache.
>
> yeah multicore with shared cache falls for me in the same bucket.

Anyone with a non-shared cache multicore would be most unhappy with my
little test hack.

> > I agree fully that affinity decisions could be more perfect than they
> > are. Getting it wrong is very expensive either way.
>
> Looks like we agree on a key principle:
> If there is a free cpu "close enough" (SMT or MC basically), the
> wakee should just run on that.
>
> we may not agree on what to do if there's no completely free logical
> cpu, but a much lighter loaded one instead.
> but first we need to let code speak ;)

mysql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47 3x avg
tip+ 10071.16 18498.33 34697.17 34275.20 32761.96 31657.10 30223.70 27363.50 24698.71
9971.57 18290.17 34632.46 34204.59 32588.94 31513.19 30081.51 27504.66 24832.24
9884.04 18502.26 34650.08 34250.13 32707.81 31566.86 29954.19 27417.09 24811.75


pgsql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94 3x avg
tip+ 15163.56 28882.70 52374.32 52469.79 51739.79 50602.02 49827.18 48029.84 46191.90
15258.65 28778.77 52716.46 52405.32 51434.21 50440.66 49718.89 48082.22 46124.56
15278.02 28178.55 52815.82 52609.98 51729.17 50652.10 49800.19 48126.95 46286.58


diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 37087a7..fa534f0 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1374,6 +1374,8 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag

rcu_read_lock();
for_each_domain(cpu, tmp) {
+ int level = tmp->level;
+
/*
* If power savings logic is enabled for a domain, see if we
* are not overloaded, if so, don't balance wider.
@@ -1398,11 +1400,28 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
want_sd = 0;
}

+ /*
+ * look for an idle shared cache before looking at last CPU.
+ */
if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
- cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+ (level == SD_LV_SIBLING || level == SD_LV_MC)) {
+ int i;

+ for_each_cpu(i, sched_domain_span(tmp)) {
+ if (!cpu_rq(i)->cfs.nr_running) {
+ affine_sd = tmp;
+ want_affine = 0;
+ cpu = i;
+ }
+ }
+ } else if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
+ cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
affine_sd = tmp;
want_affine = 0;
+
+ if ((level == SD_LV_SIBLING || level == SD_LV_MC) &&
+ !cpu_rq(prev_cpu)->cfs.nr_running)
+ cpu = prev_cpu;
}

if (!want_sd && !want_affine)

2009-10-26 01:54:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> - cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> + (level == SD_LV_SIBLING || level == SD_LV_MC)) {

quick comment without actually having looked at the patch, we should
really get rid of sd->level and encode properties of the sched domains
in sd->flags.

2009-10-26 04:38:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Mon, 2009-10-26 at 02:53 +0100, Peter Zijlstra wrote:
> On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> > if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> > - cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> > + (level == SD_LV_SIBLING || level == SD_LV_MC)) {
>
> quick comment without actually having looked at the patch, we should
> really get rid of sd->level and encode properties of the sched domains
> in sd->flags.

Yeah, sounds right, while writing that, it looked kinda ugly. I suppose
arch land needs to encode cache property somehow if I really want to be
able to target cache on multicore. Booting becomes.. exciting when I
tinker down there.

While tinkering with this, I noticed that when mysql+oltp starts
tripping over itself, if you move to any momentarily idle cpu, it helps
get the load moving again, the tail improves. Not hugely, but quite
measurable. There seems to be benefit to be had throughout the load
spectrum, just gotta figure out how to retrieve it without losing
anything.

-Mike

2009-10-26 04:51:21

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Mon, 26 Oct 2009 05:38:27 +0100
Mike Galbraith <[email protected]> wrote:

> On Mon, 2009-10-26 at 02:53 +0100, Peter Zijlstra wrote:
> > On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> > > if (want_affine && (tmp->flags & SD_WAKE_AFFINE)
> > > &&
> > > - cpumask_test_cpu(prev_cpu,
> > > sched_domain_span(tmp))) {
> > > + (level == SD_LV_SIBLING || level
> > > == SD_LV_MC)) {
> >
> > quick comment without actually having looked at the patch, we should
> > really get rid of sd->level and encode properties of the sched
> > domains in sd->flags.
>
> Yeah, sounds right, while writing that, it looked kinda ugly. I
> suppose arch land needs to encode cache property somehow if I really
> want to be able to target cache on multicore. Booting becomes..
> exciting when I tinker down there.

or we just use SD_WAKE_AFFINE / SD_BALANCE_WAKE for this...



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-10-26 05:08:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Sun, 2009-10-25 at 21:52 -0700, Arjan van de Ven wrote:
> On Mon, 26 Oct 2009 05:38:27 +0100
> Mike Galbraith <[email protected]> wrote:
>
> > On Mon, 2009-10-26 at 02:53 +0100, Peter Zijlstra wrote:
> > > On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> > > > if (want_affine && (tmp->flags & SD_WAKE_AFFINE)
> > > > &&
> > > > - cpumask_test_cpu(prev_cpu,
> > > > sched_domain_span(tmp))) {
> > > > + (level == SD_LV_SIBLING || level
> > > > == SD_LV_MC)) {
> > >
> > > quick comment without actually having looked at the patch, we should
> > > really get rid of sd->level and encode properties of the sched
> > > domains in sd->flags.
> >
> > Yeah, sounds right, while writing that, it looked kinda ugly. I
> > suppose arch land needs to encode cache property somehow if I really
> > want to be able to target cache on multicore. Booting becomes..
> > exciting when I tinker down there.
>
> or we just use SD_WAKE_AFFINE / SD_BALANCE_WAKE for this...

I don't see how. Oh, you mean another domain level, top level being
cache property, and turn off when degenerating? That looks like it'd be
a problem, but adding SD_CACHE_SIBLING or whatnot should work. Problem
is how to gain knowledge of whether multicores share a cache or not.

-Mike

2009-10-26 05:21:55

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Sun, 2009-10-25 at 12:33 -0700, Arjan van de Ven wrote:

> we may not agree on what to do if there's no completely free logical
> cpu, but a much lighter loaded one instead.
> but first we need to let code speak ;)

BTW, we agree here too, it's just that cache traffic generated by
ripping high frequency buddies apart can be hugely expensive, so much
caution required. Someone needs to get off his duff, and invent ram
that doesn't suck.

-Mike

2009-10-26 05:35:50

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Mon, 26 Oct 2009 06:08:54 +0100
Mike Galbraith <[email protected]> wrote:

> On Sun, 2009-10-25 at 21:52 -0700, Arjan van de Ven wrote:
> > On Mon, 26 Oct 2009 05:38:27 +0100
> > Mike Galbraith <[email protected]> wrote:
> >
> > > On Mon, 2009-10-26 at 02:53 +0100, Peter Zijlstra wrote:
> > > > On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> > > > > if (want_affine && (tmp->flags &
> > > > > SD_WAKE_AFFINE) &&
> > > > > - cpumask_test_cpu(prev_cpu,
> > > > > sched_domain_span(tmp))) {
> > > > > + (level == SD_LV_SIBLING ||
> > > > > level == SD_LV_MC)) {
> > > >
> > > > quick comment without actually having looked at the patch, we
> > > > should really get rid of sd->level and encode properties of the
> > > > sched domains in sd->flags.
> > >
> > > Yeah, sounds right, while writing that, it looked kinda ugly. I
> > > suppose arch land needs to encode cache property somehow if I
> > > really want to be able to target cache on multicore. Booting
> > > becomes.. exciting when I tinker down there.
> >
> > or we just use SD_WAKE_AFFINE / SD_BALANCE_WAKE for this...
>
> I don't see how. Oh, you mean another domain level, top level being
> cache property, and turn off when degenerating? That looks like it'd
> be a problem, but adding SD_CACHE_SIBLING or whatnot should work.
> Problem is how to gain knowledge of whether multicores share a cache
> or not.

Actually I meant setting the SD_BALANCE_WAKE flag for the SMT and MC
domains (and then making sure that "MC" really means "shares LLC" in
the arch code), and then using this as indication in the sched code..

if you're a multicore domain you better have a shared cache.. that's
what it should mean. If it does not we should fix that.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-10-26 05:47:58

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Sun, 2009-10-25 at 22:36 -0700, Arjan van de Ven wrote:
> On Mon, 26 Oct 2009 06:08:54 +0100
> Mike Galbraith <[email protected]> wrote:

> > >
> > > or we just use SD_WAKE_AFFINE / SD_BALANCE_WAKE for this...
> >
> > I don't see how. Oh, you mean another domain level, top level being
> > cache property, and turn off when degenerating? That looks like it'd
> > be a problem, but adding SD_CACHE_SIBLING or whatnot should work.
> > Problem is how to gain knowledge of whether multicores share a cache
> > or not.
>
> Actually I meant setting the SD_BALANCE_WAKE flag for the SMT and MC
> domains (and then making sure that "MC" really means "shares LLC" in
> the arch code), and then using this as indication in the sched code..

I don't think we can do that, because SD_WAKE_BALANCE already has a
different meaning. SD_WAKE_AFFINE could be used though, affine wakeups
have always been a cache thing, and for trying to keep things affine to
a package or whatnot, we have SD_PREFER_LOCAL. Sounds clean to me.

> if you're a multicore domain you better have a shared cache.. that's
> what it should mean. If it does not we should fix that.

Sounds reasonable to me. I'll go make explosions.

-Mike

2009-10-26 05:57:39

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Mon, 2009-10-26 at 06:47 +0100, Mike Galbraith wrote:
> On Sun, 2009-10-25 at 22:36 -0700, Arjan van de Ven wrote:

> > if you're a multicore domain you better have a shared cache.. that's
> > what it should mean. If it does not we should fix that.
>
> Sounds reasonable to me. I'll go make explosions.

(Actually, if multicode and sibling does indeed mean shared cache, no
arch tinkering should be necessary, just reset SD_WAKE_AFFINE when
degenerating should work fine. Only thing is multicore with siblings..
and test test test)

-Mike

2009-10-26 07:01:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default


* Mike Galbraith <[email protected]> wrote:

> On Mon, 2009-10-26 at 06:47 +0100, Mike Galbraith wrote:
> > On Sun, 2009-10-25 at 22:36 -0700, Arjan van de Ven wrote:
>
> > > if you're a multicore domain you better have a shared cache..
> > > that's what it should mean. If it does not we should fix that.
> >
> > Sounds reasonable to me. I'll go make explosions.
>
> (Actually, if multicode and sibling does indeed mean shared cache, no
> arch tinkering should be necessary, just reset SD_WAKE_AFFINE when
> degenerating should work fine. Only thing is multicore with
> siblings.. and test test test)

Correct. There's a few cpus where multicore means separate caches but
all modern CPUs have shared caches for cores so we want to tune for
that.

Ingo

2009-10-26 07:04:10

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Mon, 26 Oct 2009 08:01:12 +0100
Ingo Molnar <[email protected]> wrote:

>
> * Mike Galbraith <[email protected]> wrote:
>
> > On Mon, 2009-10-26 at 06:47 +0100, Mike Galbraith wrote:
> > > On Sun, 2009-10-25 at 22:36 -0700, Arjan van de Ven wrote:
> >
> > > > if you're a multicore domain you better have a shared cache..
> > > > that's what it should mean. If it does not we should fix that.
> > >
> > > Sounds reasonable to me. I'll go make explosions.
> >
> > (Actually, if multicode and sibling does indeed mean shared cache,
> > no arch tinkering should be necessary, just reset SD_WAKE_AFFINE
> > when degenerating should work fine. Only thing is multicore with
> > siblings.. and test test test)
>
> Correct. There's a few cpus where multicore means separate caches but
> all modern CPUs have shared caches for cores so we want to tune for
> that.

for those cpus where mc means separate caches we should fix the arch
code to set up separate MC domains to be honest..
I can look into that in a bit..


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-10-26 11:33:42

by Suresh Siddha

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Mon, 2009-10-26 at 00:05 -0700, Arjan van de Ven wrote:
> On Mon, 26 Oct 2009 08:01:12 +0100
> Ingo Molnar <[email protected]> wrote:
>
> > Correct. There's a few cpus where multicore means separate caches but
> > all modern CPUs have shared caches for cores so we want to tune for
> > that.
>
> for those cpus where mc means separate caches we should fix the arch
> code to set up separate MC domains to be honest..
> I can look into that in a bit..

In the default performance mode, multi-core domain is populated with
only cores sharing last-level cache. In the case where the cores don't
share caches, we represent them in the smp domain.

thanks,
suresh

2009-10-27 14:35:38

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Mon, 2009-10-26 at 02:53 +0100, Peter Zijlstra wrote:
> On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> > if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> > - cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> > + (level == SD_LV_SIBLING || level == SD_LV_MC)) {
>
> quick comment without actually having looked at the patch, we should
> really get rid of sd->level and encode properties of the sched domains
> in sd->flags.

I used SD_PREFER_SIBLING in the below. Did I break anything?

(wonder what it does for pgsql+oltp on beefy box with siblings)

tip v2.6.32-rc5-1724-g77a088c

mysql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 9999.77 18472.11 34931.60 34412.09 33006.76 32104.36 30700.47 28111.31 25535.09
10082.75 18625.12 34928.17 34476.91 33088.70 32002.36 30695.77 28173.94 25551.05
9949.05 18466.54 34942.66 34420.74 33092.45 32041.10 30666.43 28090.90 25467.63
tip avg 10010.52 18521.25 34934.14 34436.58 33062.63 32049.27 30687.55 28125.38 25517.92

tip+ 9622.23 18297.65 34496.12 34230.85 32704.20 31796.54 30480.45 27740.20 25394.12
10207.79 18275.83 34622.39 34222.47 32996.69 31936.48 30551.29 28144.48 25616.62
10225.32 18515.02 34538.41 34278.06 33014.14 31965.31 30363.90 28089.41 25531.81
tip+ avg 10018.44 18362.83 34552.30 34243.79 32905.01 31899.44 30465.21 27991.36 25514.18
vs tip 1.000 .991 .989 .994 .995 .995 .992 .995 .999

pgsql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 13945.42 26973.91 52504.18 52613.32 51310.82 50442.61 49826.52 48760.62 45570.45
13921.41 27021.48 52722.64 52565.16 51483.19 50638.83 49499.51 48621.31 46115.77
13924.94 26961.02 52624.45 52365.49 51384.91 50499.44 49622.83 48065.03 45743.14
tip avg 13930.59 26985.47 52617.09 52514.65 51392.97 50526.96 49649.62 48482.32 45809.78

tip+ 15259.79 29162.31 52609.01 52562.16 51578.48 50631.90 49537.41 48376.23 46058.95
15156.54 29114.10 52760.02 52524.86 51412.94 50656.30 48774.34 47968.77 45905.02
15118.64 29190.73 52929.34 52503.58 51574.34 50232.27 49599.15 48283.42 45766.74
tip+ avg 15178.32 29155.71 52766.12 52530.20 51521.92 50506.82 49303.63 48209.47 45910.23
vs tip 1.089 1.080 1.002 1.000 1.002 .999 .993 .994 1.002

sched: check for an idle shared cache in select_task_rq_fair()

When waking affine, check for an idle shared cache, and if found, wake to
that CPU/sibling instead of the waker's CPU. This improves pgsql+oltp
ramp up by roughly 8%. Possibly more for other loads, depending on overlap.
The trade-off is a roughly 1% peak downturn if tasks are truly synchronous.

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
LKML-Reference: <new-submission>

---
kernel/sched_fair.c | 33 +++++++++++++++++++++++++++++----
1 file changed, 29 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1398,11 +1398,36 @@ static int select_task_rq_fair(struct ta
want_sd = 0;
}

- if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
- cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+ if (want_affine && (tmp->flags & SD_WAKE_AFFINE)) {
+ int candidate = -1, i;

- affine_sd = tmp;
- want_affine = 0;
+ if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
+ candidate = cpu;
+
+ /*
+ * Check for an idle shared cache.
+ */
+ if (tmp->flags & SD_PREFER_SIBLING) {
+ if (candidate == cpu) {
+ if (!cpu_rq(prev_cpu)->cfs.nr_running)
+ candidate = prev_cpu;
+ }
+
+ if (candidate == -1 || candidate == cpu) {
+ for_each_cpu(i, sched_domain_span(tmp)) {
+ if (!cpu_rq(i)->cfs.nr_running) {
+ candidate = i;
+ break;
+ }
+ }
+ }
+ }
+
+ if (candidate >= 0) {
+ affine_sd = tmp;
+ want_affine = 0;
+ cpu = candidate;
+ }
}

if (!want_sd && !want_affine)

2009-10-28 07:25:45

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Tue, 2009-10-27 at 15:35 +0100, Mike Galbraith wrote:
> On Mon, 2009-10-26 at 02:53 +0100, Peter Zijlstra wrote:
> > On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> > > if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> > > - cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> > > + (level == SD_LV_SIBLING || level == SD_LV_MC)) {
> >
> > quick comment without actually having looked at the patch, we should
> > really get rid of sd->level and encode properties of the sched domains
> > in sd->flags.
>
> I used SD_PREFER_SIBLING in the below. Did I break anything?

Um, other than taskset.

> + if (candidate == -1 || candidate == cpu) {
> + for_each_cpu(i, sched_domain_span(tmp)) {

if (!cpumask_test_cpu(i, &p->cpus_allowed))
continue;

(i think i'm going to need that domain flag in a few minutes anyway)

-Mike

2009-10-28 18:36:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Wed, 2009-10-28 at 08:25 +0100, Mike Galbraith wrote:

> (i think i'm going to need that domain flag in a few minutes anyway)

Dirty trick didn't work out.

As long as I let wake_affine() decide whether an affine wakeup should
succeed based on the original cpu and then take the switched out cpu
instead, it worked ok. Try to be cleaner, and it turned to crud.

-Mike

2009-11-04 19:34:25

by Mike Galbraith

[permalink] [raw]
Subject: [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair()

Commit-ID: a1f84a3ab8e002159498814eaa7e48c33752b04b
Gitweb: http://git.kernel.org/tip/a1f84a3ab8e002159498814eaa7e48c33752b04b
Author: Mike Galbraith <[email protected]>
AuthorDate: Tue, 27 Oct 2009 15:35:38 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 4 Nov 2009 18:46:22 +0100

sched: Check for an idle shared cache in select_task_rq_fair()

When waking affine, check for an idle shared cache, and if
found, wake to that CPU/sibling instead of the waker's CPU.

This improves pgsql+oltp ramp up by roughly 8%. Possibly more
for other loads, depending on overlap. The trade-off is a
roughly 1% peak downturn if tasks are truly synchronous.

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_fair.c | 33 +++++++++++++++++++++++++++++----
1 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 4e777b4..da87385 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1372,11 +1372,36 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
want_sd = 0;
}

- if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
- cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+ if (want_affine && (tmp->flags & SD_WAKE_AFFINE)) {
+ int candidate = -1, i;

- affine_sd = tmp;
- want_affine = 0;
+ if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
+ candidate = cpu;
+
+ /*
+ * Check for an idle shared cache.
+ */
+ if (tmp->flags & SD_PREFER_SIBLING) {
+ if (candidate == cpu) {
+ if (!cpu_rq(prev_cpu)->cfs.nr_running)
+ candidate = prev_cpu;
+ }
+
+ if (candidate == -1 || candidate == cpu) {
+ for_each_cpu(i, sched_domain_span(tmp)) {
+ if (!cpu_rq(i)->cfs.nr_running) {
+ candidate = i;
+ break;
+ }
+ }
+ }
+ }
+
+ if (candidate >= 0) {
+ affine_sd = tmp;
+ want_affine = 0;
+ cpu = candidate;
+ }
}

if (!want_sd && !want_affine)

2009-11-04 20:37:36

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair()

On Wed, 2009-11-04 at 19:33 +0000, tip-bot for Mike Galbraith wrote:
> Commit-ID: a1f84a3ab8e002159498814eaa7e48c33752b04b
> Gitweb: http://git.kernel.org/tip/a1f84a3ab8e002159498814eaa7e48c33752b04b
> Author: Mike Galbraith <[email protected]>
> AuthorDate: Tue, 27 Oct 2009 15:35:38 +0100
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Wed, 4 Nov 2009 18:46:22 +0100
>
> sched: Check for an idle shared cache in select_task_rq_fair()
>
> When waking affine, check for an idle shared cache, and if
> found, wake to that CPU/sibling instead of the waker's CPU.

This one still wants some work. It's not playing well with 1b9508f.

tip v2.6.32-rc6-1731-gc5bb4b1

mysql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 10447.14 19734.58 36038.18 35776.85 34662.76 33682.30 32256.22 28770.99 25323.23
10462.61 19580.14 36050.48 35942.63 35054.84 33988.40 32423.89 29259.65 25892.24
10501.02 19231.27 36007.03 35985.32 35060.79 33945.47 32400.42 29140.84 25716.16
tip avg 10470.25 19515.33 36031.89 35901.60 34926.13 33872.05 32360.17 29057.16 25643.87

tip v2.6.32-rc6-1731-gc5bb4b1 + 1b9508f (Rate-limit newidle)
tip+ 10095.64 19989.67 36449.85 35923.98 35024.24 34026.61 32411.75 28965.77 27287.96
10764.98 19864.03 36632.78 36338.00 35610.52 34439.88 32942.57 29889.18 27426.12
10734.48 19572.63 36507.90 36236.36 35486.51 34443.86 32938.90 30136.39 27186.61
tip+ avg 10531.70 19808.77 36530.17 36166.11 35373.75 34303.45 32764.40 29663.78 27300.23
1.005 1.015 1.013 1.007 1.012 1.012 1.012 1.020 1.064

v2.6.32-rc6-1796-gd995f1d (a1f84a3 not playing well with 1b9508f)
10578.44 19358.96 36184.70 35549.19 34625.40 33658.72 31947.51 27504.11 20400.97
10610.76 19748.84 36401.08 36021.75 34830.67 33566.52 31989.97 27622.25 21348.00
10568.87 19600.60 36190.48 35959.96 34805.45 33641.46 32005.30 27591.48 22509.01

2009-11-04 21:45:30

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair()

On Wed, 2009-11-04 at 21:37 +0100, Mike Galbraith wrote:
> On Wed, 2009-11-04 at 19:33 +0000, tip-bot for Mike Galbraith wrote:
> > Commit-ID: a1f84a3ab8e002159498814eaa7e48c33752b04b
> > Gitweb: http://git.kernel.org/tip/a1f84a3ab8e002159498814eaa7e48c33752b04b
> > Author: Mike Galbraith <[email protected]>
> > AuthorDate: Tue, 27 Oct 2009 15:35:38 +0100
> > Committer: Ingo Molnar <[email protected]>
> > CommitDate: Wed, 4 Nov 2009 18:46:22 +0100
> >
> > sched: Check for an idle shared cache in select_task_rq_fair()
> >
> > When waking affine, check for an idle shared cache, and if
> > found, wake to that CPU/sibling instead of the waker's CPU.
>
> This one still wants some work. It's not playing well with 1b9508f.

Still does want some work (on the tail), but those numbers were from a
failed experiment kernel, not d995f1d. Below are the correct numbers.

tip v2.6.32-rc6-1731-gc5bb4b1

mysql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 10447.14 19734.58 36038.18 35776.85 34662.76 33682.30 32256.22 28770.99 25323.23
10462.61 19580.14 36050.48 35942.63 35054.84 33988.40 32423.89 29259.65 25892.24
10501.02 19231.27 36007.03 35985.32 35060.79 33945.47 32400.42 29140.84 25716.16
tip avg 10470.25 19515.33 36031.89 35901.60 34926.13 33872.05 32360.17 29057.16 25643.87

tip v2.6.32-rc6-1731-gc5bb4b1 + 1b9508f (Rate-limit newidle)
tip+ 10095.64 19989.67 36449.85 35923.98 35024.24 34026.61 32411.75 28965.77 27287.96
10764.98 19864.03 36632.78 36338.00 35610.52 34439.88 32942.57 29889.18 27426.12
10734.48 19572.63 36507.90 36236.36 35486.51 34443.86 32938.90 30136.39 27186.61
tip+ avg 10531.70 19808.77 36530.17 36166.11 35373.75 34303.45 32764.40 29663.78 27300.23
1.005 1.015 1.013 1.007 1.012 1.012 1.012 1.020 1.064

v2.6.32-rc6-1796-gd995f1d
10745.30 19684.71 36367.25 35890.25 34598.08 33672.37 32327.99 28744.09 25393.00
10549.41 19747.85 36778.96 36410.58 35531.90 34292.03 32603.66 29487.77 25351.09
10678.30 19944.74 36789.44 36403.78 35523.40 34267.10 32659.60 29359.59 25751.93
avg 10657.67 19792.43 36645.21 36234.87 35217.79 34077.16 32530.41 29197.15 25498.67
vs c5bb4b1 1.017 1.014 1.017 1.009 1.008 1.006 1.005 1.004 .994

Lost the tail gain, gained peak. Drink one, spill one, give one away.

-Mike

2009-11-05 09:31:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair()


* tip-bot for Mike Galbraith <[email protected]> wrote:

> Commit-ID: a1f84a3ab8e002159498814eaa7e48c33752b04b
> Gitweb: http://git.kernel.org/tip/a1f84a3ab8e002159498814eaa7e48c33752b04b
> Author: Mike Galbraith <[email protected]>
> AuthorDate: Tue, 27 Oct 2009 15:35:38 +0100
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Wed, 4 Nov 2009 18:46:22 +0100
>
> sched: Check for an idle shared cache in select_task_rq_fair()

-tip testing found that this causes problems:

[ 26.804000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
[ 26.808000] caller is vmstat_update+0x26/0x70
[ 26.812000] Pid: 10, comm: events/1 Not tainted 2.6.32-rc5 #6887
[ 26.816000] Call Trace:
[ 26.820000] [<c1924a24>] ? printk+0x28/0x3c
[ 26.824000] [<c13258a0>] debug_smp_processor_id+0xf0/0x110
[ 26.824000] mount used greatest stack depth: 1464 bytes left
[ 26.828000] [<c111d086>] vmstat_update+0x26/0x70
[ 26.832000] [<c1086418>] worker_thread+0x188/0x310
[ 26.836000] [<c10863b7>] ? worker_thread+0x127/0x310
[ 26.840000] [<c108d310>] ? autoremove_wake_function+0x0/0x60
[ 26.844000] [<c1086290>] ? worker_thread+0x0/0x310
[ 26.848000] [<c108cf0c>] kthread+0x7c/0x90
[ 26.852000] [<c108ce90>] ? kthread+0x0/0x90
[ 26.856000] [<c100c0a7>] kernel_thread_helper+0x7/0x10
[ 26.860000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
[ 26.864000] caller is vmstat_update+0x3c/0x70

oh ... doesnt it break cpus_allowed?

Ingo

2009-11-05 09:57:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair()

On Thu, 2009-11-05 at 10:30 +0100, Ingo Molnar wrote:
> * tip-bot for Mike Galbraith <[email protected]> wrote:
>
> > Commit-ID: a1f84a3ab8e002159498814eaa7e48c33752b04b
> > Gitweb: http://git.kernel.org/tip/a1f84a3ab8e002159498814eaa7e48c33752b04b
> > Author: Mike Galbraith <[email protected]>
> > AuthorDate: Tue, 27 Oct 2009 15:35:38 +0100
> > Committer: Ingo Molnar <[email protected]>
> > CommitDate: Wed, 4 Nov 2009 18:46:22 +0100
> >
> > sched: Check for an idle shared cache in select_task_rq_fair()
>
> -tip testing found that this causes problems:
>
> [ 26.804000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
> [ 26.808000] caller is vmstat_update+0x26/0x70
> [ 26.812000] Pid: 10, comm: events/1 Not tainted 2.6.32-rc5 #6887
> [ 26.816000] Call Trace:
> [ 26.820000] [<c1924a24>] ? printk+0x28/0x3c
> [ 26.824000] [<c13258a0>] debug_smp_processor_id+0xf0/0x110
> [ 26.824000] mount used greatest stack depth: 1464 bytes left
> [ 26.828000] [<c111d086>] vmstat_update+0x26/0x70
> [ 26.832000] [<c1086418>] worker_thread+0x188/0x310
> [ 26.836000] [<c10863b7>] ? worker_thread+0x127/0x310
> [ 26.840000] [<c108d310>] ? autoremove_wake_function+0x0/0x60
> [ 26.844000] [<c1086290>] ? worker_thread+0x0/0x310
> [ 26.848000] [<c108cf0c>] kthread+0x7c/0x90
> [ 26.852000] [<c108ce90>] ? kthread+0x0/0x90
> [ 26.856000] [<c100c0a7>] kernel_thread_helper+0x7/0x10
> [ 26.860000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
> [ 26.864000] caller is vmstat_update+0x3c/0x70
>
> oh ... doesnt it break cpus_allowed?

Erk, indeed. You didn't apply the follow-up fix, and I forgot as well.


sched: select_task_rq_fair():: add missing cpu allowed check in commit a1f84a3.

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
LKML-Reference: <new-submission>

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 32f06ed..5488a5d 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1415,6 +1415,8 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag

if (candidate == -1 || candidate == cpu) {
for_each_cpu(i, sched_domain_span(tmp)) {
+ if (!cpumask_test_cpu(i, &p->cpus_allowed))
+ continue;
if (!cpu_rq(i)->cfs.nr_running) {
candidate = i;
break;

2009-11-05 10:00:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair()

On Thu, 2009-11-05 at 10:57 +0100, Mike Galbraith wrote:

> Erk, indeed. You didn't apply the follow-up fix, and I forgot as well.

But maybe you should just back it out until I've fiddled/tested a bit
more. Dunno.

-Mike

2009-11-06 07:10:23

by Mike Galbraith

[permalink] [raw]
Subject: [tip:sched/core] sched: Fix affinity logic in select_task_rq_fair()

Commit-ID: fd210738f6601d0fb462df9a2fe5a41896ff6a8f
Gitweb: http://git.kernel.org/tip/fd210738f6601d0fb462df9a2fe5a41896ff6a8f
Author: Mike Galbraith <[email protected]>
AuthorDate: Thu, 5 Nov 2009 10:57:46 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 5 Nov 2009 11:01:39 +0100

sched: Fix affinity logic in select_task_rq_fair()

Ingo Molnar reported:

[ 26.804000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
[ 26.808000] caller is vmstat_update+0x26/0x70
[ 26.812000] Pid: 10, comm: events/1 Not tainted 2.6.32-rc5 #6887
[ 26.816000] Call Trace:
[ 26.820000] [<c1924a24>] ? printk+0x28/0x3c
[ 26.824000] [<c13258a0>] debug_smp_processor_id+0xf0/0x110
[ 26.824000] mount used greatest stack depth: 1464 bytes left
[ 26.828000] [<c111d086>] vmstat_update+0x26/0x70
[ 26.832000] [<c1086418>] worker_thread+0x188/0x310
[ 26.836000] [<c10863b7>] ? worker_thread+0x127/0x310
[ 26.840000] [<c108d310>] ? autoremove_wake_function+0x0/0x60
[ 26.844000] [<c1086290>] ? worker_thread+0x0/0x310
[ 26.848000] [<c108cf0c>] kthread+0x7c/0x90
[ 26.852000] [<c108ce90>] ? kthread+0x0/0x90
[ 26.856000] [<c100c0a7>] kernel_thread_helper+0x7/0x10
[ 26.860000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
[ 26.864000] caller is vmstat_update+0x3c/0x70

Because this commit:

a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair()

broke ->cpus_allowed.

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_fair.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index da87385..e4d4483 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1389,6 +1389,8 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag

if (candidate == -1 || candidate == cpu) {
for_each_cpu(i, sched_domain_span(tmp)) {
+ if (!cpumask_test_cpu(i, &p->cpus_allowed))
+ continue;
if (!cpu_rq(i)->cfs.nr_running) {
candidate = i;
break;

2009-11-10 21:59:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Sun, 2009-10-25 at 22:36 -0700, Arjan van de Ven wrote:
>
> if you're a multicore domain you better have a shared cache.. that's
> what it should mean. If it does not we should fix that.

Fully agreed.. I've been dying to do the below for ages :-)

---
arch/x86/kernel/smpboot.c | 10 +---------
1 files changed, 1 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 213a7a3..297b307 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -433,15 +433,7 @@ void __cpuinit set_cpu_sibling_map(int cpu)
const struct cpumask *cpu_coregroup_mask(int cpu)
{
struct cpuinfo_x86 *c = &cpu_data(cpu);
- /*
- * For perf, we return last level cache shared map.
- * And for power savings, we return cpu_core_map
- */
- if ((sched_mc_power_savings || sched_smt_power_savings) &&
- !(cpu_has(c, X86_FEATURE_AMD_DCM)))
- return cpu_core_mask(cpu);
- else
- return c->llc_shared_map;
+ return c->llc_shared_map;
}

static void impress_friends(void)

2009-11-11 05:59:36

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default

On Tue, 10 Nov 2009 22:59:29 +0100
Peter Zijlstra <[email protected]> wrote:

> On Sun, 2009-10-25 at 22:36 -0700, Arjan van de Ven wrote:
> >
> > if you're a multicore domain you better have a shared cache.. that's
> > what it should mean. If it does not we should fix that.
>
> Fully agreed.. I've been dying to do the below for ages :-)
>

completely sane.

Acked-by: Arjan van de Ven <[email protected]>