2012-02-20 14:41:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote:

> ---
> kernel/sched_fair.c | 10 ++--------
> 1 file changed, 2 insertions(+), 8 deletions(-)
>
> Index: linux-3.0-tip/kernel/sched_fair.c
> ===================================================================
> --- linux-3.0-tip.orig/kernel/sched_fair.c
> +++ linux-3.0-tip/kernel/sched_fair.c
> @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
> for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
> if (idle_cpu(i)) {
> target = i;
> + if (sd->flags & SD_SHARE_CPUPOWER)
> + continue;
> break;
> }
> }
> -
> - /*
> - * Lets stop looking for an idle sibling when we reached
> - * the domain that spans the current cpu and prev_cpu.
> - */
> - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> - cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> - break;
> }
> rcu_read_unlock();

Mike, Suresh, did we ever get this sorted? I was looking at
select_idle_sibling() and it looks like we dropped this.

Also, did anybody ever get an answer from a HW guy on why sharing stuff
over SMT threads is so much worse than sharing it over proper cores? Its
not like this workload actually does anything concurrently.

I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.


2012-02-20 15:04:29

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

* Peter Zijlstra <[email protected]> [2012-02-20 15:41:01]:

> On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote:
>
> > ---
> > kernel/sched_fair.c | 10 ++--------
> > 1 file changed, 2 insertions(+), 8 deletions(-)
> >
> > Index: linux-3.0-tip/kernel/sched_fair.c
> > ===================================================================
> > --- linux-3.0-tip.orig/kernel/sched_fair.c
> > +++ linux-3.0-tip/kernel/sched_fair.c
> > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
> > for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
> > if (idle_cpu(i)) {
> > target = i;
> > + if (sd->flags & SD_SHARE_CPUPOWER)
> > + continue;
> > break;
> > }
> > }
> > -
> > - /*
> > - * Lets stop looking for an idle sibling when we reached
> > - * the domain that spans the current cpu and prev_cpu.
> > - */
> > - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> > - cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > - break;
> > }
> > rcu_read_unlock();
>
> Mike, Suresh, did we ever get this sorted? I was looking at
> select_idle_sibling() and it looks like we dropped this.
>
> Also, did anybody ever get an answer from a HW guy on why sharing stuff
> over SMT threads is so much worse than sharing it over proper cores? Its
> not like this workload actually does anything concurrently.
>
> I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.

>From a quick scan of that code, it seems to prefer selecting an idle cpu
in the same cache domain (vs selecting prev_cpu in absence of a core
that is fully idle).

I can give that a try for my benchmark and see how much it helps. My
suspicion is it will not fully solve the problem I have on hand.

- vatsa

2012-02-20 18:14:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Mon, 2012-02-20 at 15:41 +0100, Peter Zijlstra wrote:
> On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote:
>
> > ---
> > kernel/sched_fair.c | 10 ++--------
> > 1 file changed, 2 insertions(+), 8 deletions(-)
> >
> > Index: linux-3.0-tip/kernel/sched_fair.c
> > ===================================================================
> > --- linux-3.0-tip.orig/kernel/sched_fair.c
> > +++ linux-3.0-tip/kernel/sched_fair.c
> > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
> > for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
> > if (idle_cpu(i)) {
> > target = i;
> > + if (sd->flags & SD_SHARE_CPUPOWER)
> > + continue;
> > break;
> > }
> > }
> > -
> > - /*
> > - * Lets stop looking for an idle sibling when we reached
> > - * the domain that spans the current cpu and prev_cpu.
> > - */
> > - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> > - cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > - break;
> > }
> > rcu_read_unlock();
>
> Mike, Suresh, did we ever get this sorted? I was looking at
> select_idle_sibling() and it looks like we dropped this.

I thought this was pretty much sorted. We want to prefer core over
sibling, because on at laest some modern CPUs with L3, siblings suck
rocks.

> Also, did anybody ever get an answer from a HW guy on why sharing stuff
> over SMT threads is so much worse than sharing it over proper cores?

No. My numbers on westmere indicated to me that siblings do not share
L2, making them fairly worthless. Hard facts we never got.

> Its
> not like this workload actually does anything concurrently.
>
> I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.

I really really need to find time to do systematic mainline testing.

Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
Maybe that has changed, but I doubt it. (general aside: testing with a
bloated distro config is a big mistake)

-Mike

2012-02-20 18:16:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> I thought this was pretty much sorted. We want to prefer core over
> sibling, because on at laest some modern CPUs with L3, siblings suck
> rocks.

Yeah, I since figured out how its supposed (and supposedly) does work.
Suresh was a bit too clever and forgot to put a comment in since clearly
it was obvious at the time ;-)

2012-02-20 18:25:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Mon, 2012-02-20 at 20:33 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <[email protected]> [2012-02-20 15:41:01]:
>
> > On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote:
> >
> > > ---
> > > kernel/sched_fair.c | 10 ++--------
> > > 1 file changed, 2 insertions(+), 8 deletions(-)
> > >
> > > Index: linux-3.0-tip/kernel/sched_fair.c
> > > ===================================================================
> > > --- linux-3.0-tip.orig/kernel/sched_fair.c
> > > +++ linux-3.0-tip/kernel/sched_fair.c
> > > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
> > > for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
> > > if (idle_cpu(i)) {
> > > target = i;
> > > + if (sd->flags & SD_SHARE_CPUPOWER)
> > > + continue;
> > > break;
> > > }
> > > }
> > > -
> > > - /*
> > > - * Lets stop looking for an idle sibling when we reached
> > > - * the domain that spans the current cpu and prev_cpu.
> > > - */
> > > - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> > > - cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > > - break;
> > > }
> > > rcu_read_unlock();
> >
> > Mike, Suresh, did we ever get this sorted? I was looking at
> > select_idle_sibling() and it looks like we dropped this.
> >
> > Also, did anybody ever get an answer from a HW guy on why sharing stuff
> > over SMT threads is so much worse than sharing it over proper cores? Its
> > not like this workload actually does anything concurrently.
> >
> > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.
>
> From a quick scan of that code, it seems to prefer selecting an idle cpu
> in the same cache domain (vs selecting prev_cpu in absence of a core
> that is fully idle).

Yes, that was the sole purpose of select_idle_sibling() from square one.
If you can mobilize a CPU without eating cache penalty, this is most
excellent for load ramp-up. The gain is huge over affine wakeup if
there is any overlap to regain, ie it's not a 100% synchronous load.

> I can give that a try for my benchmark and see how much it helps. My
> suspicion is it will not fully solve the problem I have on hand.

I doubt it will either. Your problem is when it doesn't succeed, but
you have an idle core available in another domain. That's a whole
different ball game. Yeah, you can reap benefit by doing wakeup
balancing, but you'd better look very closely at the cost. I haven't
been able to do that lately, so dunno what cost is in the here and now,
but it used to be _way_ too expensive to consider, just as unrestricted
idle balancing is, or high frequency load balancing in general is.

-Mike

2012-02-20 19:08:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> Maybe that has changed, but I doubt it.

Right, I through I remembered some such, you could see it on wakeup
heavy things like pipe-bench and that java msg passing thing, right?

2012-02-21 00:06:33

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

* Mike Galbraith <[email protected]> [2012-02-20 19:25:43]:

> > I can give that a try for my benchmark and see how much it helps. My
> > suspicion is it will not fully solve the problem I have on hand.
>
> I doubt it will either. Your problem is when it doesn't succeed, but
> you have an idle core available in another domain.

fwiw the patch I had sent does a wakeup balance within prev_cpu's
cache_domain (and not outside). It handles the case where we don't have
any idle cpu/core within prev_cpu's cache domain, in which case we look
for next best thing (least loaded cpu). I did see good numbers with that
(for both my benchmark and sysbench).

More on this later in the day ..

- vatsa

2012-02-21 05:43:24

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Mon, 2012-02-20 at 20:07 +0100, Peter Zijlstra wrote:
> On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > Maybe that has changed, but I doubt it.
>
> Right, I through I remembered some such, you could see it on wakeup
> heavy things like pipe-bench and that java msg passing thing, right?

Yeah, it beat up switch heavy buddies pretty bad.

-Mike

2012-02-21 06:37:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Tue, 2012-02-21 at 05:36 +0530, Srivatsa Vaddagiri wrote:

> fwiw the patch I had sent does a wakeup balance within prev_cpu's
> cache_domain (and not outside).

Yeah, the testing I did was turn on the flag, and measure. With single
domain scans, maybe select_idle_sibling() could just go away. It ain't
free either.

-Mike

2012-02-21 08:10:05

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

* Mike Galbraith <[email protected]> [2012-02-21 07:37:14]:

> On Tue, 2012-02-21 at 05:36 +0530, Srivatsa Vaddagiri wrote:
>
> > fwiw the patch I had sent does a wakeup balance within prev_cpu's
> > cache_domain (and not outside).
>
> Yeah, the testing I did was turn on the flag, and measure. With single
> domain scans, maybe select_idle_sibling() could just go away.

Yeah that's what I was thinking. Essentially find_idlest_group/cpu
should accomplish that job pretty much for us ...

> It ain't free either.

Will find out how much difference it makes ..

- vatsa

2012-02-21 08:32:54

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

* Mike Galbraith <[email protected]> [2012-02-21 06:43:18]:

> On Mon, 2012-02-20 at 20:07 +0100, Peter Zijlstra wrote:
> > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > > Maybe that has changed, but I doubt it.
> >
> > Right, I through I remembered some such, you could see it on wakeup
> > heavy things like pipe-bench and that java msg passing thing, right?
>
> Yeah, it beat up switch heavy buddies pretty bad.

Do you have pointer to the java benchmark? Also is pipe-bench the same
as the one described below?

http://freecode.com/projects/pipebench

- vatsa

2012-02-21 09:21:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Tue, 2012-02-21 at 14:02 +0530, Srivatsa Vaddagiri wrote:
> * Mike Galbraith <[email protected]> [2012-02-21 06:43:18]:
>
> > On Mon, 2012-02-20 at 20:07 +0100, Peter Zijlstra wrote:
> > > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> > > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > > > Maybe that has changed, but I doubt it.
> > >
> > > Right, I through I remembered some such, you could see it on wakeup
> > > heavy things like pipe-bench and that java msg passing thing, right?
> >
> > Yeah, it beat up switch heavy buddies pretty bad.
>
> Do you have pointer to the java benchmark? Also is pipe-bench the same
> as the one described below?

I use vmark, find it _somewhat_ useful. Not a lovely benchmark, but it
is highly affinity sensitive, and switches heftily. I don't put much
value on it though, too extreme for me, but it is a ~decent indicator.

There are no doubt _lots_ better than vmark for java stuff.

I toss a variety pack at the box in obsessive-compulsive man mode when
testing. Which benchmarks doesn't matter much, just need to be wide
spectrum and consistent.

> http://freecode.com/projects/pipebench

No, I use Ingo's pipe-test, but that to measure fastpath overhead.

-Mike

2012-02-21 10:37:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Tue, 2012-02-21 at 10:21 +0100, Mike Galbraith wrote:

> I use vmark, find it _somewhat_ useful. Not a lovely benchmark, but it
> is highly affinity sensitive, and switches heftily. I don't put much
> value on it though, too extreme for me, but it is a ~decent indicator.
>
> There are no doubt _lots_ better than vmark for java stuff.
>
> I toss a variety pack at the box in obsessive-compulsive man mode when
> testing. Which benchmarks doesn't matter much, just need to be wide
> spectrum and consistent.

http://www.volano.com/benchmarks.html

> No, I use Ingo's pipe-test, but that to measure fastpath overhead.

http://people.redhat.com/mingo/cfs-scheduler/tools/pipe-test.c

Also I think it lives as: perf bench sched pipe

2012-02-21 14:59:00

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

* Peter Zijlstra <[email protected]> [2012-02-21 11:37:23]:

> http://www.volano.com/benchmarks.html

For some reason, volanomark stops working (client hangs waiting for data
to arrive on socket) in 3rd iteration when run on latest tip. I don't see that
with 2.6.32 based kernel though. Checking which commit may have caused
this ..

- vatsa

2012-02-23 11:11:50

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

* Mike Galbraith <[email protected]> [2012-02-20 19:14:21]:

> > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.
>
> I really really need to find time to do systematic mainline testing.
>
> Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> Maybe that has changed, but I doubt it. (general aside: testing with a
> bloated distro config is a big mistake)

I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE
at SMT/MC domains.

Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
Kernel : tip (HEAD at 6241cc8)
Java : OpenJDK 1.6.0_20
Volano : 2_9_0

Volano benchmark was run 4 times with and w/o SD_BALANCE_WAKE enabled in
SMT/MC domains.

Average score std. dev

SD_BALANCE_WAKE disabled 369459.750 4825.046
SD_BALANCE_WAKE enabled 379070.500 379070.5

I am going to try pipe benchmark next. Do you have suggestions for any other
benchmarks I should try to see the effect of SD_BALANCE_WAKE turned on in
SMT/MC domains?

- vatsa

2012-02-23 11:19:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible


* Srivatsa Vaddagiri <[email protected]> wrote:

> * Mike Galbraith <[email protected]> [2012-02-20 19:14:21]:
>
> > > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.
> >
> > I really really need to find time to do systematic mainline testing.
> >
> > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > Maybe that has changed, but I doubt it. (general aside: testing with a
> > bloated distro config is a big mistake)
>
> I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE
> at SMT/MC domains.
>
> Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
> Kernel : tip (HEAD at 6241cc8)
> Java : OpenJDK 1.6.0_20
> Volano : 2_9_0
>
> Volano benchmark was run 4 times with and w/o SD_BALANCE_WAKE enabled in
> SMT/MC domains.
>
> Average score std. dev
>
> SD_BALANCE_WAKE disabled 369459.750 4825.046
> SD_BALANCE_WAKE enabled 379070.500 379070.5
>
> I am going to try pipe benchmark next. Do you have suggestions
> for any other benchmarks I should try to see the effect of
> SD_BALANCE_WAKE turned on in SMT/MC domains?

sysbench is one of the best ones punishing bad scheduler
balancing mistakes.

Thanks,

Ingo

2012-02-23 11:20:16

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

* Srivatsa Vaddagiri <[email protected]> [2012-02-23 16:19:59]:

> > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > Maybe that has changed, but I doubt it. (general aside: testing with a
> > bloated distro config is a big mistake)
>
> I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE
> at SMT/MC domains.

Ditto with pipe benchmark.

> Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
> Kernel : tip (HEAD at 6241cc8)

"perf bench sched pipe" was run 10 times and average ops/sec score
alongwith std. dev is noted as below.


SD_BALANCE_WAKE SD_BALANCE_WAKE
disabled enabled

Avg. score 108984.900 111844.300 (+2.6%)
std dev 20383.457 21668.604

Note:

SD_BALANCE_WAKE for SMT/MC domain was enabled by writing into appropriate
flags file present in /proc:

# cd /proc/sys/kernel/sched_domain/
# for i in `find . -name flags | grep domain1`; do echo 4671 > $i; done
# for i in `find . -name flags | grep domain0`; do echo 703 > $i; done

- vatsa

2012-02-23 11:21:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Thu, 2012-02-23 at 16:19 +0530, Srivatsa Vaddagiri wrote:

> I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE
> at SMT/MC domains.

That's an unexpected (but welcome) result.

> Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
> Kernel : tip (HEAD at 6241cc8)
> Java : OpenJDK 1.6.0_20
> Volano : 2_9_0
>
> Volano benchmark was run 4 times with and w/o SD_BALANCE_WAKE enabled in
> SMT/MC domains.
>
> Average score std. dev
>
> SD_BALANCE_WAKE disabled 369459.750 4825.046
> SD_BALANCE_WAKE enabled 379070.500 379070.5
>
> I am going to try pipe benchmark next. Do you have suggestions for any other
> benchmarks I should try to see the effect of SD_BALANCE_WAKE turned on in
> SMT/MC domains?

Unpinned netperf TCP_RR and/or tbench pairs? Anything that's wakeup
heavy should tell the tail.

-Mike

2012-02-23 11:26:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible


* Srivatsa Vaddagiri <[email protected]> wrote:

> "perf bench sched pipe" was run 10 times and average ops/sec
> score alongwith std. dev is noted as below.
>
>
> SD_BALANCE_WAKE SD_BALANCE_WAKE
> disabled enabled
>
> Avg. score 108984.900 111844.300 (+2.6%)
> std dev 20383.457 21668.604

pro perf tip of the day: did you know that it's possible to run:

perf stat --repeat 10 --null perf bench sched pipe

and get meaningful, normalized stddev calculations for free:

5.486491487 seconds time elapsed ( +- 5.50% )

the percentage at the end shows the level of standard deviation.

You can add "--sync" as well, which will cause perf stat to call
sync() - this gives extra stability of individual iterations and
makes sure all IO cost is accounted to the right run.

Thanks,

Ingo

2012-02-23 11:54:30

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

* Ingo Molnar <[email protected]> [2012-02-23 12:26:50]:

> pro perf tip of the day: did you know that it's possible to run:
>
> perf stat --repeat 10 --null perf bench sched pipe

Ah yes ..had forgotten that! What is the recommended # of iterations
to run?

> and get meaningful, normalized stddev calculations for free:
>
> 5.486491487 seconds time elapsed ( +- 5.50% )
>
> the percentage at the end shows the level of standard deviation.
>
> You can add "--sync" as well, which will cause perf stat to call
> sync() - this gives extra stability of individual iterations and
> makes sure all IO cost is accounted to the right run.

Ok ..thanks for the 'pro' tip!! Will use it in my subsequent runs!

- vatsa

2012-02-23 12:18:50

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

* Ingo Molnar <[email protected]> [2012-02-23 12:19:29]:

> sysbench is one of the best ones punishing bad scheduler
> balancing mistakes.

Here are the sysbench oltp score on same machine i.e:

Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
Kernel : tip (HEAD at 6241cc8)
sysbench : 0.4.12

sysbench was run with 16 threads as:

./sysbench --num-threads=16 --max-requests=100000 --test=oltp --oltp-table-size=500000 --mysql-socket=/var/lib/mysql/mysql.sock --oltp-read-only --mysql-user=root --mysql-password=blah run

sysbench was run 5 times with fs-cache being purged before each run
(echo 3 > /proc/sys/vm/drop_caches).

Average of 5 runs alongwith % std. dev. is noted for various OLTP stats

SD_BALANCE_WAKE SD_BALANCE_WAKE
disabled enabled

transactions (per sec) 4833.826 (+- 0.75%) 4837.354 (+- 1%)
read/write requests (per sec) 67673.580 (+- 0.75%) 67722.960 (+- 1%)
other ops (per sec) 9667.654 (+- 0.75%) 9674.710 (+- 1%)

There is minor improvement seen when SD_BALANCE_WAKE is enabled at SMT/MC
domains, but no degradation observed with it enabled.

- vatsa

2012-02-23 16:17:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible


* Srivatsa Vaddagiri <[email protected]> wrote:

> * Ingo Molnar <[email protected]> [2012-02-23 12:26:50]:
>
> > pro perf tip of the day: did you know that it's possible to run:
> >
> > perf stat --repeat 10 --null perf bench sched pipe
>
> Ah yes ..had forgotten that! What is the recommended # of
> iterations to run?

Well, I start at 3 or 5 and sometimes increase it to get a
meaningful stddev - i.e. one that is smaller than half of the
measured effect. (for cases where I expect an improvement and
want to measure it.)

Thanks,

Ingo

2012-02-25 06:54:15

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

* Mike Galbraith <[email protected]> [2012-02-23 12:21:04]:

> Unpinned netperf TCP_RR and/or tbench pairs? Anything that's wakeup
> heavy should tell the tail.

Here are some tbench numbers:

Machine : 2 Intel Xeon X5650 (Westmere) CPUs (6 core/package)
Kernel : tip (HEAD at ebe97fa)
dbench : v4.0

One tbench server/client pair was run on same host 5 times (with
fs-cache being purged each time) and avg of 5 run for various cases
noted below:

Case A : HT enabled (24 logical CPUs)

Thr'put : 168.166 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
Thr'put : 169.564 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc/smt)
Thr'put : 173.151 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)

Case B : HT disabled (12 logical CPUs)

Thr'put : 167.977 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
Thr'put : 167.891 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc)
Thr'put : 173.801 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)

Observations:

a. ~3% improvement seen with SD_SHARE_PKG_RESOURCES disabled, which I guess
reflects the cost of waking to a cold L2 cache.

b. No degradation seen with SD_BALANCE_WAKE enabled at mc/smt domains

IMO we need to detect tbench type paired wakeups as synchronous case, in
which case blindly wakeup the task to cur_cpu (as cost of L2 cache miss
could outweight the cost of any reduced scheduling latencies).

IOW select_task_rq_fair() needs to be given better hint as to whether L2
cache has been made warm by someone (interrupt handler or a producer
task), in which case (consumer) task needs to be woken in the same L2 cache
domain (i.e on cur_cpu itself)?

- vatsa

2012-02-25 08:31:03

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Sat, 2012-02-25 at 12:24 +0530, Srivatsa Vaddagiri wrote:
> * Mike Galbraith <[email protected]> [2012-02-23 12:21:04]:
>
> > Unpinned netperf TCP_RR and/or tbench pairs? Anything that's wakeup
> > heavy should tell the tail.
>
> Here are some tbench numbers:
>
> Machine : 2 Intel Xeon X5650 (Westmere) CPUs (6 core/package)
> Kernel : tip (HEAD at ebe97fa)
> dbench : v4.0
>
> One tbench server/client pair was run on same host 5 times (with
> fs-cache being purged each time) and avg of 5 run for various cases
> noted below:
>
> Case A : HT enabled (24 logical CPUs)
>
> Thr'put : 168.166 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
> Thr'put : 169.564 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc/smt)
> Thr'put : 173.151 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
>
> Case B : HT disabled (12 logical CPUs)
>
> Thr'put : 167.977 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
> Thr'put : 167.891 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc)
> Thr'put : 173.801 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
>
> Observations:
>
> a. ~3% improvement seen with SD_SHARE_PKG_RESOURCES disabled, which I guess
> reflects the cost of waking to a cold L2 cache.
>
> b. No degradation seen with SD_BALANCE_WAKE enabled at mc/smt domains

I haven't done a lot of testing, but yeah, the little I have doesn't
show SD_BALANCE_WAKE making much difference on single socket boxen.

> IMO we need to detect tbench type paired wakeups as synchronous case, in
> which case blindly wakeup the task to cur_cpu (as cost of L2 cache miss
> could outweight the cost of any reduced scheduling latencies).
>
> IOW select_task_rq_fair() needs to be given better hint as to whether L2
> cache has been made warm by someone (interrupt handler or a producer
> task), in which case (consumer) task needs to be woken in the same L2 cache
> domain (i.e on cur_cpu itself)?

My less rotund config shows the L2 penalty decidedly more prominently.
We used to have avg_overlap as a synchronous wakeup hint, but it was
broken by preemption and whatnot, got the axe to recover some cycles. A
reliable and dirt cheap replacement would be a good thing to have.

TCP_RR and tbench are far way away from the overlap breakeven point on
E5620, whereas with Q6600s shared L2, you can start converting overlap
into throughput almost immediately.

2.4 GHz E5620
Throughput 248.994 MB/sec 1 procs SD_SHARE_PKG_RESOURCES
Throughput 379.488 MB/sec 1 procs !SD_SHARE_PKG_RESOURCES

2.4 GHz Q6600
Throughput 299.049 MB/sec 1 procs SD_SHARE_PKG_RESOURCES
Throughput 300.018 MB/sec 1 procs !SD_SHARE_PKG_RESOURCES

-Mike

2012-02-27 22:12:23

by Suresh Siddha

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Sat, 2012-02-25 at 09:30 +0100, Mike Galbraith wrote:
> My less rotund config shows the L2 penalty decidedly more prominently.
> We used to have avg_overlap as a synchronous wakeup hint, but it was
> broken by preemption and whatnot, got the axe to recover some cycles. A
> reliable and dirt cheap replacement would be a good thing to have.
>
> TCP_RR and tbench are far way away from the overlap breakeven point on
> E5620, whereas with Q6600s shared L2, you can start converting overlap
> into throughput almost immediately.
>
> 2.4 GHz E5620
> Throughput 248.994 MB/sec 1 procs SD_SHARE_PKG_RESOURCES
> Throughput 379.488 MB/sec 1 procs !SD_SHARE_PKG_RESOURCES
>
> 2.4 GHz Q6600
> Throughput 299.049 MB/sec 1 procs SD_SHARE_PKG_RESOURCES
> Throughput 300.018 MB/sec 1 procs !SD_SHARE_PKG_RESOURCES
>

Also it is not always about just the L2 cache being shared/not or
warm/cold etc. It also depends on the core c-states/p-states etc. It
will cost waking up an idle core and the cost will depend on the what
core-c state it is in. And also if we ping-pong between cores often,
cpufreq governor will come and request for a lower core p-state even
though the load was keeping one core or the other in the socket always
busy at any given point of time.

thanks,
suresh

2012-02-28 05:05:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

On Mon, 2012-02-27 at 14:11 -0800, Suresh Siddha wrote:
> On Sat, 2012-02-25 at 09:30 +0100, Mike Galbraith wrote:
> > My less rotund config shows the L2 penalty decidedly more prominently.
> > We used to have avg_overlap as a synchronous wakeup hint, but it was
> > broken by preemption and whatnot, got the axe to recover some cycles. A
> > reliable and dirt cheap replacement would be a good thing to have.
> >
> > TCP_RR and tbench are far way away from the overlap breakeven point on
> > E5620, whereas with Q6600s shared L2, you can start converting overlap
> > into throughput almost immediately.
> >
> > 2.4 GHz E5620
> > Throughput 248.994 MB/sec 1 procs SD_SHARE_PKG_RESOURCES
> > Throughput 379.488 MB/sec 1 procs !SD_SHARE_PKG_RESOURCES
> >
> > 2.4 GHz Q6600
> > Throughput 299.049 MB/sec 1 procs SD_SHARE_PKG_RESOURCES
> > Throughput 300.018 MB/sec 1 procs !SD_SHARE_PKG_RESOURCES
> >
>
> Also it is not always about just the L2 cache being shared/not or
> warm/cold etc. It also depends on the core c-states/p-states etc. It
> will cost waking up an idle core and the cost will depend on the what
> core-c state it is in. And also if we ping-pong between cores often,
> cpufreq governor will come and request for a lower core p-state even
> though the load was keeping one core or the other in the socket always
> busy at any given point of time.

Yeah, pinning yields a couple percent on Q6600 box, more on E5620
despite its spiffier gearbox.. likely turbo-boost doing it's thing.

-Mike