2010-08-12 17:24:35

by Heiko Carstens

[permalink] [raw]
Subject: [PATCH/RFC 0/5] sched: add new 'book' scheduling domain

This patch set adds (yet) another scheduling domain to the scheduler. The
reason for this is that the recent (s390) z196 architecture has four cache
levels and uniform memory access (sort of -- see below).
The cpu/cache/memory hierarchy is as follows:

Each cpu has its private L1 (64KB I-cache + 128KB D-cache) and L2 (1.5MB)
cache.
A core consists of four cpus with a 24MB shared L3 cache.
A book consists of six cores with a 192MB shared L4 cache.

The z196 architecture has no SMT.
Also the statement that we have uniform memory access is not entirely
correct. Actually the machine uses memory striping, so it "looks" like
we have UMA until the next slice of memory gets accessed.
However there is no interface which tells us which piece of memory is local
or remote. So we (have to) simplify and assume that the cost of each memory
access with L4 cache miss is the same.

In order to somehow use the information about the cache hierarchy so that
the scheduler can make some decisions that improves cache hits I added the
'BOOK' scheduling domain between the MC and CPU domains.

First performance measurements however show now effect - neither good nor
bad. So it might be that the workloads aren't good enough, or that the
implementation is simply wrong.

Either way, since its currently very hard to get machine time for additional
measurements I thought it might be a good idea to post the patches as an RFC
even if we do not have any convincing arguments.

Also please note that the scheduling domain initializers certainly need some
tuning:
The line
#define SD_BOOK_INIT SD_CPU_INIT
within the arch support patch is just there so it compiles and until we have
something that really works.

As for the patches, I thinks that the first two patches could be merged
anytime since those are only cleanup/preparation patches.
Patch three adds the new scheduling domain and patch four the code needed
to represent books via the cpu topology sysfs interface.
Patch five is just the architecture backend.

A boot of a logical partition with 20 cpus, shared on two books, gives these
initializion output to the console:

Brought up 20 CPUs
CPU0 attaching sched-domain:
domain 0: span 0-5 level BOOK
groups: 0 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048)
domain 1: span 0-19 level CPU
groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
CPU1 attaching sched-domain:
domain 0: span 1-3 level MC
groups: 1 2 3
domain 1: span 0-5 level BOOK
groups: 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048) 0
domain 2: span 0-19 level CPU
groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
CPU2 attaching sched-domain:
domain 0: span 1-3 level MC
groups: 2 3 1
domain 1: span 0-5 level BOOK
groups: 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048) 0
domain 2: span 0-19 level CPU
groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
CPU3 attaching sched-domain:
domain 0: span 1-3 level MC
groups: 3 1 2
domain 1: span 0-5 level BOOK
groups: 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048) 0
domain 2: span 0-19 level CPU
groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
CPU4 attaching sched-domain:
domain 0: span 4-5 level MC
groups: 4 5
domain 1: span 0-5 level BOOK
groups: 4-5 (cpu_power = 2048) 0 1-3 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
CPU5 attaching sched-domain:
domain 0: span 4-5 level MC
groups: 5 4
domain 1: span 0-5 level BOOK
groups: 4-5 (cpu_power = 2048) 0 1-3 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
CPU6 attaching sched-domain:
domain 0: span 6-9 level MC
groups: 6 7 8 9
domain 1: span 6-19 level BOOK
groups: 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU7 attaching sched-domain:
domain 0: span 6-9 level MC
groups: 7 8 9 6
domain 1: span 6-19 level BOOK
groups: 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU8 attaching sched-domain:
domain 0: span 6-9 level MC
groups: 8 9 6 7
domain 1: span 6-19 level BOOK
groups: 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU9 attaching sched-domain:
domain 0: span 6-9 level MC
groups: 9 6 7 8
domain 1: span 6-19 level BOOK
groups: 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU10 attaching sched-domain:
domain 0: span 10-11 level MC
groups: 10 11
domain 1: span 6-19 level BOOK
groups: 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU11 attaching sched-domain:
domain 0: span 10-11 level MC
groups: 11 10
domain 1: span 6-19 level BOOK
groups: 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU12 attaching sched-domain:
domain 0: span 12-13 level MC
groups: 12 13
domain 1: span 6-19 level BOOK
groups: 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU13 attaching sched-domain:
domain 0: span 12-13 level MC
groups: 13 12
domain 1: span 6-19 level BOOK
groups: 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU14 attaching sched-domain:
domain 0: span 14-16 level MC
groups: 14 15 16
domain 1: span 6-19 level BOOK
groups: 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU15 attaching sched-domain:
domain 0: span 14-16 level MC
groups: 15 16 14
domain 1: span 6-19 level BOOK
groups: 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU16 attaching sched-domain:
domain 0: span 14-16 level MC
groups: 16 14 15
domain 1: span 6-19 level BOOK
groups: 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU17 attaching sched-domain:
domain 0: span 17-19 level MC
groups: 17 18 19
domain 1: span 6-19 level BOOK
groups: 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU18 attaching sched-domain:
domain 0: span 17-19 level MC
groups: 18 19 17
domain 1: span 6-19 level BOOK
groups: 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU19 attaching sched-domain:
domain 0: span 17-19 level MC
groups: 19 17 18
domain 1: span 6-19 level BOOK
groups: 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)


Subject: Re: [PATCH/RFC 0/5] sched: add new 'book' scheduling domain

On Thu, Aug 12, 2010 at 01:25:44PM -0400, Heiko Carstens wrote:
> This patch set adds (yet) another scheduling domain to the scheduler.

All that stuff reminds me of quite similar patches to introduce a
multi-node scheduling domain for Magny-Cours CPUs.

I am afraid that this stuff won't make it upstream and we both have to
review Peter's suggestions from last year to come up with a more
genarelized/flexible way to handle different scheduling domains.


> The reason for this is that the recent (s390) z196 architecture has
> four cache levels and uniform memory access (sort of -- see below).
> The cpu/cache/memory hierarchy is as follows:

> Each cpu has its private L1 (64KB I-cache + 128KB D-cache) and L2 (1.5MB)
> cache.
> A core consists of four cpus with a 24MB shared L3 cache.
> A book consists of six cores with a 192MB shared L4 cache.

> The z196 architecture has no SMT.

[...]

> A boot of a logical partition with 20 cpus, shared on two books, gives these
> initializion output to the console:

Below output shows that there is some odd distribution of your CPUs in
the different domain levels. Is this caused by the fact that not all
CPUs of a core and book were assigned to your logical partition?

For better understanding is the following CPUs-to-core/book mapping correct for
your example?

Book | Core | CPU
------+--------+---------
0 | 0 | 0,1,2,3
0 | 1 | 4,5
1 | 0 | 6,9
1 | 1 | 10,11
1 | 2 | 12,13
1 | 3 | 14,15,16
1 | 4 | 17,18,19

> Brought up 20 CPUs
> CPU0 attaching sched-domain:
> domain 0: span 0-5 level BOOK
> groups: 0 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048)

Why isn't there a range 0-3 instead of "0 1-3"?
And why isn't cpu_power=4096?
Ah, I think that for CPU 0 just the power information is
missing, So we have 3 groups:

0 (cpu_power=1024) 1-3 (cpu_power=3071) 4-5 (cpu_power=2048)

And the MC level is folded because it doesn't add anything in this
case.

So the mapping is in fact

Book | Core | CPU
------+--------+---------
0 | 0 | 0
0 | 1 | 1,2,3
0 | 2 | 4,5
1 | 0 | 6,9
1 | 1 | 10,11
1 | 2 | 12,13
1 | 3 | 14,15,16
1 | 4 | 17,18,19


> domain 1: span 0-19 level CPU
> groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
> CPU1 attaching sched-domain:
> domain 0: span 1-3 level MC
> groups: 1 2 3
> domain 1: span 0-5 level BOOK
> groups: 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048) 0
> domain 2: span 0-19 level CPU
> groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)

It's odd that for CPU 1 the BOOK domain groups differ from those shown
for CPU0.

> CPU2 attaching sched-domain:
> domain 0: span 1-3 level MC
> groups: 2 3 1
> domain 1: span 0-5 level BOOK
> groups: 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048) 0

Again for CPU 0 the cpu_power is missing. I think that is confusing.
For better readability that sould also be displayed (if a group
consists of only 1 CPU).

> domain 2: span 0-19 level CPU
> groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)

[snip the rest]



Andreas

--
Operating | Advanced Micro Devices GmbH
System | Einsteinring 24, 85609 Dornach b. M?nchen, Germany
Research | Gesch?ftsf?hrer: Alberto Bozzo, Andrew Bowd
Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis M?nchen
(OSRC) | Registergericht M?nchen, HRB Nr. 43632