2010-01-04 08:31:38

by Lin Ming

[permalink] [raw]
Subject: volano ~30% regression with 2.6.33-rc1 & -rc2

Mike & Peter,

Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 & -rc2.
Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory

Bisect to below commit,

commit a1f84a3ab8e002159498814eaa7e48c33752b04b
Author: Mike Galbraith <[email protected]>
Date: Tue Oct 27 15:35:38 2009 +0100

sched: Check for an idle shared cache in select_task_rq_fair()

When waking affine, check for an idle shared cache, and if
found, wake to that CPU/sibling instead of the waker's CPU.

This improves pgsql+oltp ramp up by roughly 8%. Possibly more
for other loads, depending on overlap. The trade-off is a
roughly 1% peak downturn if tasks are truly synchronous.

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


This commit can't be reverted due to conflict, so I reverted below 4
commits related to idle-shared-cache in 2.6.33-rc2, and then the
performance was restored to 2.6.32.

fe3bcfe (sched: More generic WAKE_AFFINE vs select_idle_sibling())
a50bde5 (sched: Cleanup select_task_rq_fair())
fd21073 (sched: Fix affinity logic in select_task_rq_fair())
a1f84a3 (sched: Check for an idle shared cache in select_task_rq_fair())

This regression seems caused by cache misses of access to per cpu data.
(see below perf top cache-misses data for detail)

select_idle_sibling(...)
{
....
for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
if (!cpu_rq(i)->cfs.nr_running) {
target = i;
break;
}
}
....
}

The performance can be restored to 2.6.32 as well if SD_PREFER_SIBLING
is not set, so select_idle_sibling will not be called.

perf top data as follow,

2.6.33-rc1 cache-misses data (note 11.8% select_task_rq_fair)
------------------------------------------------------------------------------------
PerfTop: 12262 irqs/sec kernel:90.6% [1000Hz cache-misses], (all, 16 CPUs)
------------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ _____________________________ ________________

18272.00 11.8% select_task_rq_fair [kernel.kallsyms]
15499.00 10.0% schedule [kernel.kallsyms]
9447.00 6.1% update_curr [kernel.kallsyms]
9255.00 6.0% _raw_spin_lock [kernel.kallsyms]
5161.00 3.3% tcp_sendmsg [kernel.kallsyms]

2.6.32 cache-misses data
--------------------------------------------------------------------------------------
PerfTop: 11749 irqs/sec kernel:88.2% [1000Hz cache-misses], (all, 16 CPUs)
--------------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ _____________________________ _________________
11974.00 11.5% schedule [kernel.kallsyms]
6656.00 6.4% _spin_lock [kernel.kallsyms]
5852.00 5.6% update_curr [kernel.kallsyms]
3140.00 3.0% enqueue_entity [kernel.kallsyms]
2846.00 2.7% tcp_sendmsg [kernel.kallsyms]

2.6.33-rc1 cycles data (note 6.5% select_task_rq_fair)
-------------------------------------------------------------------------------
PerfTop: 11106 irqs/sec kernel:99.7% [1000Hz cycles], (all, 16 CPUs)
-------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ _________________________ _________________

11658.00 10.0% schedule [kernel.kallsyms]
10870.00 9.4% _raw_spin_lock [kernel.kallsyms]
7576.00 6.5% select_task_rq_fair [kernel.kallsyms]
3696.00 3.2% tcp_sendmsg [kernel.kallsyms]
3000.00 2.6% update_curr [kernel.kallsyms]

2.6.32 cycles data
------------------------------------------------------------------------------------
PerfTop: 10462 irqs/sec kernel:99.8% [1000Hz cycles], (all, 16 CPUs)
------------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ _________________________ _________________

13364.00 9.9% schedule [kernel.kallsyms]
13140.00 9.8% _spin_lock [kernel.kallsyms]
4903.00 3.6% tcp_sendmsg [kernel.kallsyms]
4017.00 3.0% update_curr [kernel.kallsyms]
3395.00 2.5% _spin_lock_bh [kernel.kallsyms]


Lin Ming


2010-01-04 12:37:54

by Arjan van de Ven

[permalink] [raw]
Subject: Re: volano ~30% regression with 2.6.33-rc1 & -rc2

On Mon, 04 Jan 2010 16:15:58 +0800
Lin Ming <[email protected]> wrote:

> Mike & Peter,
>
> Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory

did this show up only on this cpu?
(since this is a multi-core-without-shared-cache cpu, it could be that
we get the topology wrong and think cores share cache where they don't)


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-01-04 12:57:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: volano ~30% regression with 2.6.33-rc1 & -rc2

On Mon, 2010-01-04 at 04:40 -0800, Arjan van de Ven wrote:
> On Mon, 04 Jan 2010 16:15:58 +0800
> Lin Ming <[email protected]> wrote:
>
> > Mike & Peter,
> >
> > Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> > -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory
>
> did this show up only on this cpu?
> (since this is a multi-core-without-shared-cache cpu, it could be that
> we get the topology wrong and think cores share cache where they don't)

My fault for using PREFER_SIBLING I guess. However, I do wonder why in
the heck we set that at the CPU domain level. Siblings lie northward.

-Mike

2010-01-04 13:02:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: volano ~30% regression with 2.6.33-rc1 & -rc2

On Mon, 2010-01-04 at 13:57 +0100, Mike Galbraith wrote:
> On Mon, 2010-01-04 at 04:40 -0800, Arjan van de Ven wrote:
> > On Mon, 04 Jan 2010 16:15:58 +0800
> > Lin Ming <[email protected]> wrote:
> >
> > > Mike & Peter,
> > >
> > > Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> > > -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory
> >
> > did this show up only on this cpu?
> > (since this is a multi-core-without-shared-cache cpu, it could be that
> > we get the topology wrong and think cores share cache where they don't)
>
> My fault for using PREFER_SIBLING I guess. However, I do wonder why in
> the heck we set that at the CPU domain level. Siblings lie northward.

Ah, PREFER_SIBLING means prefer sibling domain, not sibling thread. Its
set at the CPU (really socket) level so make tasks spread over sockets
first, so that there is no competition for the socket wide resources.

Your change is sane, but we really want a more extensive sched domain
tree in the near future, reflecting the full machine topology.

2010-01-04 13:15:33

by Mike Galbraith

[permalink] [raw]
Subject: Re: volano ~30% regression with 2.6.33-rc1 & -rc2

On Mon, 2010-01-04 at 14:02 +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-04 at 13:57 +0100, Mike Galbraith wrote:
> > On Mon, 2010-01-04 at 04:40 -0800, Arjan van de Ven wrote:
> > > On Mon, 04 Jan 2010 16:15:58 +0800
> > > Lin Ming <[email protected]> wrote:
> > >
> > > > Mike & Peter,
> > > >
> > > > Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> > > > -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory
> > >
> > > did this show up only on this cpu?
> > > (since this is a multi-core-without-shared-cache cpu, it could be that
> > > we get the topology wrong and think cores share cache where they don't)
> >
> > My fault for using PREFER_SIBLING I guess. However, I do wonder why in
> > the heck we set that at the CPU domain level. Siblings lie northward.
>
> Ah, PREFER_SIBLING means prefer sibling domain, not sibling thread. Its
> set at the CPU (really socket) level so make tasks spread over sockets
> first, so that there is no competition for the socket wide resources.

WRT the regression, would you prefer only the sched_fair.c hunk, and
maybe plunking the topology hunk in sched_devel, or both lines in one
patch, since ramp-up gain remains unrealized half of the time on Nehalem
and ilk.

> Your change is sane, but we really want a more extensive sched domain
> tree in the near future, reflecting the full machine topology.

Yeah.

-Mike

2010-01-04 13:27:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: volano ~30% regression with 2.6.33-rc1 & -rc2

On Mon, 2010-01-04 at 14:15 +0100, Mike Galbraith wrote:
> On Mon, 2010-01-04 at 14:02 +0100, Peter Zijlstra wrote:
> > On Mon, 2010-01-04 at 13:57 +0100, Mike Galbraith wrote:
> > > On Mon, 2010-01-04 at 04:40 -0800, Arjan van de Ven wrote:
> > > > On Mon, 04 Jan 2010 16:15:58 +0800
> > > > Lin Ming <[email protected]> wrote:
> > > >
> > > > > Mike & Peter,
> > > > >
> > > > > Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> > > > > -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory
> > > >
> > > > did this show up only on this cpu?
> > > > (since this is a multi-core-without-shared-cache cpu, it could be that
> > > > we get the topology wrong and think cores share cache where they don't)
> > >
> > > My fault for using PREFER_SIBLING I guess. However, I do wonder why in
> > > the heck we set that at the CPU domain level. Siblings lie northward.
> >
> > Ah, PREFER_SIBLING means prefer sibling domain, not sibling thread. Its
> > set at the CPU (really socket) level so make tasks spread over sockets
> > first, so that there is no competition for the socket wide resources.
>
> WRT the regression, would you prefer only the sched_fair.c hunk, and
> maybe plunking the topology hunk in sched_devel, or both lines in one
> patch, since ramp-up gain remains unrealized half of the time on Nehalem
> and ilk.

Both bits seem sane I guess, you change SD_SIBLING_INIT(), right?
Threads really do share package resources so it makes sense to set it.

I guess its back to poking at nehalem to see what makes it tick..

2010-01-04 13:45:01

by Mike Galbraith

[permalink] [raw]
Subject: [patch] Re: volano ~30% regression with 2.6.33-rc1 & -rc2

On Mon, 2010-01-04 at 14:26 +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-04 at 14:15 +0100, Mike Galbraith wrote:

> > WRT the regression, would you prefer only the sched_fair.c hunk, and
> > maybe plunking the topology hunk in sched_devel, or both lines in one
> > patch, since ramp-up gain remains unrealized half of the time on Nehalem
> > and ilk.
>
> Both bits seem sane I guess, you change SD_SIBLING_INIT(), right?

Right.

> Threads really do share package resources so it makes sense to set it.
>
> I guess its back to poking at nehalem to see what makes it tick..

I asked Santa for a quad socket Nehalem and a portable nuclear reactor
to power it, but the stingy old fart let me down ;-)

sched: fix vmark regression on big machines

SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't enabled,
leading to many cache misses on large machines as we traverse looking for an
idle shared cache to wake to. Change the enabler of select_idle_sibling() to
SD_SHARE_PKG_RESOURCES, and enable same at the sibling domain level.

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Reported-by: Lin Ming <[email protected]>
LKML-Reference: <new-submission>

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 57e6357..5b81156 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -99,7 +99,7 @@ int arch_update_cpu_topology(void);
| 1*SD_WAKE_AFFINE \
| 1*SD_SHARE_CPUPOWER \
| 0*SD_POWERSAVINGS_BALANCE \
- | 0*SD_SHARE_PKG_RESOURCES \
+ | 1*SD_SHARE_PKG_RESOURCES \
| 0*SD_SERIALIZE \
| 0*SD_PREFER_SIBLING \
, \
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 42ac3c9..8fe7ee8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1508,7 +1508,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
* If there's an idle sibling in this domain, make that
* the wake_affine target instead of the current cpu.
*/
- if (tmp->flags & SD_PREFER_SIBLING)
+ if (tmp->flags & SD_SHARE_PKG_RESOURCES)
target = select_idle_sibling(p, tmp, target);

if (target >= 0) {

2010-01-05 01:00:44

by Lin Ming

[permalink] [raw]
Subject: Re: volano ~30% regression with 2.6.33-rc1 & -rc2

On Mon, 2010-01-04 at 20:40 +0800, Arjan van de Ven wrote:
> On Mon, 04 Jan 2010 16:15:58 +0800
> Lin Ming <[email protected]> wrote:
>
> > Mike & Peter,
> >
> > Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> > -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory
>
> did this show up only on this cpu?
> (since this is a multi-core-without-shared-cache cpu, it could be that
> we get the topology wrong and think cores share cache where they don't)

Tulsa machine(16cpus, 4P/2Core/HT) also has ~8% regression
and it's now fixed by Mike's patch.

Lin Ming

2010-01-05 02:44:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: volano ~30% regression with 2.6.33-rc1 & -rc2

On Tue, 2010-01-05 at 08:44 +0800, Lin Ming wrote:
> On Mon, 2010-01-04 at 20:40 +0800, Arjan van de Ven wrote:
> > On Mon, 04 Jan 2010 16:15:58 +0800
> > Lin Ming <[email protected]> wrote:
> >
> > > Mike & Peter,
> > >
> > > Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> > > -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory
> >
> > did this show up only on this cpu?
> > (since this is a multi-core-without-shared-cache cpu, it could be that
> > we get the topology wrong and think cores share cache where they don't)
>
> Tulsa machine(16cpus, 4P/2Core/HT) also has ~8% regression
> and it's now fixed by Mike's patch.

Excellent. Thanks for testing.

-Mike

2010-01-21 13:52:19

by Mike Galbraith

[permalink] [raw]
Subject: [tip:sched/urgent] sched: Fix vmark regression on big machines

Commit-ID: 50b926e439620c469565e8be0f28be78f5fca1ce
Gitweb: http://git.kernel.org/tip/50b926e439620c469565e8be0f28be78f5fca1ce
Author: Mike Galbraith <[email protected]>
AuthorDate: Mon, 4 Jan 2010 14:44:56 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 21 Jan 2010 13:39:03 +0100

sched: Fix vmark regression on big machines

SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't
enabled, leading to many cache misses on large machines as we traverse
looking for an idle shared cache to wake to. Change the enabler of
select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the
sibling domain level.

Reported-by: Lin Ming <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/topology.h | 2 +-
kernel/sched_fair.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 57e6357..5b81156 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -99,7 +99,7 @@ int arch_update_cpu_topology(void);
| 1*SD_WAKE_AFFINE \
| 1*SD_SHARE_CPUPOWER \
| 0*SD_POWERSAVINGS_BALANCE \
- | 0*SD_SHARE_PKG_RESOURCES \
+ | 1*SD_SHARE_PKG_RESOURCES \
| 0*SD_SERIALIZE \
| 0*SD_PREFER_SIBLING \
, \
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 42ac3c9..8fe7ee8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1508,7 +1508,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
* If there's an idle sibling in this domain, make that
* the wake_affine target instead of the current cpu.
*/
- if (tmp->flags & SD_PREFER_SIBLING)
+ if (tmp->flags & SD_SHARE_PKG_RESOURCES)
target = select_idle_sibling(p, tmp, target);

if (target >= 0) {