2000-01-24 19:34:10

by dg50

[permalink] [raw]
Subject: SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM)

I've been reading the SMP thread and this is a truly educational and
fascinating discussion. How SMP works, and how much of a benefit it
provides has always been a bit of a mystery to me - and I think the light
is slowly coming on.

But I have a couple of (perhaps dumb) questions.

OK, if you have an n-way SMP box, then you have n processors with n (local)
caches sharing a single block of main system memory. If you then run a
threaded program (like a renderer) with a thread per processor, you wind up
with n threads all looking at a single block of shared memory - right?

OK, if a thread accesses (I assume writes, reading isn't destructive, is
it?) a memory location that another processor is "interested" in, then
you've invalidated that processor's local cache - so it has to be flushed
and refreshed. Have enough cross-talk between threads, and you can achieve
the worst-case scenario where every memory access flushes the cache of
every processor, totally defeating the purpose of the cache, and perhaps
even adding nontrivial cache-flushing overhead.

If this is indeed the case (please correct any misconceptions I have) then
it strikes me that perhaps the hardware design of SMP is broken. That
instead of sharing main memory, each processor should have it's own main
memory. You connect the various main memory chunks to the "primary" CPU via
some sort of very wide, very fast memory bus, and then when you spawn a
thread, you instead do something more like a fork - copy the relevent
process and data to the child cpu's private main memory (perhaps via some
sort of blitter) over this bus, and then let that CPU go play in its own
sandbox for a while.

Which really is more like the "array of uni-processor boxen joined by a
network" model than it is current SMP - just with a REALLY fast&wide
network pipe that just happens to be in the same physical box.

Comments? Please feel free to reply private-only if this is just too
entry-level for general discussuion.

DG



2000-01-24 20:23:48

by Jamie Lokier

[permalink] [raw]
Subject: Re: SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM)

[email protected] wrote:
> If this is indeed the case (please correct any misconceptions I have) then
> it strikes me that perhaps the hardware design of SMP is broken. That
> instead of sharing main memory, each processor should have it's own main
> memory. You connect the various main memory chunks to the "primary" CPU via
> some sort of very wide, very fast memory bus, and then when you spawn a
> thread, you instead do something more like a fork - copy the relevent
> process and data to the child cpu's private main memory (perhaps via some
> sort of blitter) over this bus, and then let that CPU go play in its own
> sandbox for a while.

I think you just reinvented NUMA -- Non-Uniform Memory Access. Every
CPU can access the others' memory, but you really want them to
concentrate on their own. SGI does some boxes like that.

Linux even has a memory allocator which is moving in the direction of
supporting those things.

> Which really is more like the "array of uni-processor boxen joined by a
> network" model than it is current SMP - just with a REALLY fast&wide
> network pipe that just happens to be in the same physical box.

It's been proposed to have multiple instances of the OS running too,
instead of one OS running on all CPUs.

-- Jamie

2000-01-24 21:32:37

by Larry McVoy

[permalink] [raw]
Subject: Re: SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM)

: OK, if you have an n-way SMP box, then you have n processors with n (local)
: caches sharing a single block of main system memory. If you then run a
: threaded program (like a renderer) with a thread per processor, you wind up
: with n threads all looking at a single block of shared memory - right?

Yes. A really good reference for all of this, at the sort of accessable
and general level, is ``Computer Architecture, A Quantitative Approach''
by Patterson and Hennessy, ISBN 1-55860-329-8. I have the second edition
which was updated in '96. It's an excellent book and think that if all
the members of this list read it and understood it, that would be of
enormous benefit. It's a hardware book, but I'm of the school that says
OS people are part hardware people and should have more than a passing
understanding of hardware.

The chapter you want is Chapter 8: multiprocessors. But read the whole
thing, it's one of the most approachable yet useful texts in the field.
Contrast it to Knuth's work, which are without a doubt the definitive
works in his area, but are also not at all for the faint of heart.
I personally can rarely make head or tail of what Knuth is saying
without some outside help, but I can understand all of P&H on my own -
you can too.

: OK, if a thread accesses (I assume writes, reading isn't destructive, is
: it?) a memory location that another processor is "interested" in, then
: you've invalidated that processor's local cache - so it has to be flushed
: and refreshed.

Reading is better than writing, lots of people can read. Where they
read from depends on if the L2 caches are write through or write back
(L1 caches I believe are 100% write through, I can't really see how a
write back one would ever work even on a uniprocessor unless all I/O's
caused a cache flush).

If you did your caches correctly, and you are using a snooping protocol,
then typically what happens is CPU 1 puts a request for location X on
the bus; CPU 5 has that data in a dirty cache line; CPU 5 invalidates
the memory op and responds as if it were memory, saying here's X.
(It's not the CPUs at all, by the way, it's the cache controllers.
They are quite async when compared to CPUs, though there are a lot of
interdependencies).

I believe it was the SGI Challenge which had a busted cache controller
and had to do a write back to main memory and then a read, causing
cache to cache transactions to actually be cache to memory, memory
to cache (which really hurts).

: Have enough cross-talk between threads, and you can achieve
: the worst-case scenario where every memory access flushes the cache of
: every processor, totally defeating the purpose of the cache, and perhaps
: even adding nontrivial cache-flushing overhead.

Yes, indeed. At SGI, where they had a very finely threaded kernel, cache
misses in the OS became a major bottleneck and they did a great deal of
restructuring to make things work better. The most typical thing was to
put the lock and the data most likely to be needed on the same cache line.
I.e.,

struct widget {
rwlock w_lock;
int flags;
struct widget *next;
/* more data after this */
};

and then they would do

for (p = list;
(read_lock(p) == OK) && (p->flags != whatever);
t = p->next, read_unlock(p), p = t);

with the idea being that the common case was that one cache line miss would
get you the lock and the data that you most cared about.

This, by the way, is a REALLY STUPID IDEA. If you are locking at
this level, you really need to be looking for other ways to scale up
your system. You're working like crazy to support a model that the
hardware can't really support. As a part hardware and part software guy,
it breaks my heart to see the hardware guys bust their butts to give you
a system that works, only to watch the software guys abuse the sh*t out
of the system and make it look bad. On the other side of the coin, the
hardware guys do advertise this stuff as working and only later come out
and explain the limitations.

A great case in point is that the hardware guys, when they first started
doing SMP boxes, talked about how anything could run anywhere. Later,
they discovered that this was really false. They never said that, they
wrote a bunch of papers on ``cache affinity'' which is a nice way of
saying ``when you put a process on a CPU/cache, don't move it unless the
world will come to an end if you leave it there''. Mark Hahn's equations
that he psoted about the scheduler have everything to do with this concept,
you have built up some state in the cache and the last thing you want to do
is have to rebuild it somewhere else.

This is an area, sorry to pick on them again, that SGI just completely
screwed up in their scheduler. The simple and right answer is to put
processes on a CPU and leave them there until the system is dramatically
unbalanced. The complicated and slow thing to do is to rethink the
decision at every context switch. The SGI scheduler got it upside down,
you had to prove that you needed to stay on a cache rather than prove that
you needed to move. So I/O bound jobs got screwed because they never
ran long enough to have built up what was considered a cache foot print
so they were always consider candidates for movement. I got substantial
improvements in BDS (an extension to NFS that did ~ 100Mbyte/sec I/O)
by locking the BDS daemons down to individual processors, something the
scheduler could have trivially done correctly itself.

: If this is indeed the case (please correct any misconceptions I have) then
: it strikes me that perhaps the hardware design of SMP is broken. That
: instead of sharing main memory, each processor should have it's own main
: memory. You connect the various main memory chunks to the "primary" CPU via
: some sort of very wide, very fast memory bus, and then when you spawn a
: thread, you instead do something more like a fork - copy the relevent
: process and data to the child cpu's private main memory (perhaps via some
: sort of blitter) over this bus, and then let that CPU go play in its own
: sandbox for a while.

What you have described is an SGI Origin. Each node in an origin looks like

[ 2x R10K CPU ]
|
memory ------+-------- I/O
|
|
interconnect to the rest of the nodes

Each card had 2 cpus, some memory, an I/O bus, and an interconnect bus
which connected the card to the rest of the system in a hypercube fabric
(think of that as a more scalable thing than a bus; more scalable yes,
but with a slight penalty for remote memory access).

The memory coherency was directory based rather than snoopy (snoopy
works when all caches see all transactions, but that doesn't scale.
Directory would appear to scale farther but guess what - the OS locks
completely thrashed the sh*t out of the directories, once again proving
that the illusion of shared, coherent memory if a false one).

Anyway, another thing that didn't work was moving memory. This is called
"page migration" and the problem with it is that the work that you do
has to substantially outweigh the cost of moving the page in the first
place and it usually doesn't. So in practice, at SGI and at customers,
the migration idea didn't pay off. Maybe things have changed since I
left, my info is pretty dated. I kind of doubt it but there must be
some SGI folks reading this who are willing to correct me.

: Which really is more like the "array of uni-processor boxen joined by a
: network" model than it is current SMP - just with a REALLY fast&wide
: network pipe that just happens to be in the same physical box.

Yup. Take a look at the book and the short papers I pointed to the other
day. You'll see a lot in both along these lines.

2000-01-24 22:55:31

by Ralf Baechle

[permalink] [raw]
Subject: Re: SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM)

On Tue, Jan 25, 2000 at 12:56:45AM +0100, Jamie Lokier wrote:

> [email protected] wrote:
> > If this is indeed the case (please correct any misconceptions I have) then
> > it strikes me that perhaps the hardware design of SMP is broken. That
> > instead of sharing main memory, each processor should have it's own main
> > memory. You connect the various main memory chunks to the "primary" CPU via
> > some sort of very wide, very fast memory bus, and then when you spawn a
> > thread, you instead do something more like a fork - copy the relevent
> > process and data to the child cpu's private main memory (perhaps via some
> > sort of blitter) over this bus, and then let that CPU go play in its own
> > sandbox for a while.
>
> I think you just reinvented NUMA -- Non-Uniform Memory Access. Every
> CPU can access the others' memory, but you really want them to
> concentrate on their own. SGI does some boxes like that.

SGI does ccNUMA, cache coherent NUMA. The difference is that unlike in
`real' NUMA machines each processor on a node has access to memory in each
node directly. A node in an Origin system is a dual CPU SMP system.

> Linux even has a memory allocator which is moving in the direction of
> supporting those things.

It's actually the start of the support for the Origin series but other
systems are expected to jump the wagon.

> > Which really is more like the "array of uni-processor boxen joined by a
> > network" model than it is current SMP - just with a REALLY fast&wide
> > network pipe that just happens to be in the same physical box.
>
> It's been proposed to have multiple instances of the OS running too,
> instead of one OS running on all CPUs.

Ok, but users and application developers still want the entire system to
feel like a single system.

Ralf

2000-01-25 06:56:57

by Davide Libenzi

[permalink] [raw]
Subject: Re: SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM)

Tuesday, January 25, 2000 3:18 AM
David Schwartz <[email protected]> wrote :

> If you wanted completely separate memory spaces for each processor, the
> current hardware will let you have it. Just separate the address space
into
> logical chunks and code each processor only to use its chunk. The current
> hardware design lets you do exactly what you are suggesting. And if one
> processor does need to acceess the memory earmarked for another, the
current
> hardware provides a fast way to do it.

100% agree, and is faster than an ethernet connection between N separated UP
machines.
Probably the cost of a N way SMP machine is higher than N single UP machines
( at least
for PCs ) but this isn't linux-business, isn't it ?

The cache misses cost that an SMP architecture must sustain is :

CMTc = Nm * F( Np * Ms * WTR )

where :

CMTc = cache misses time cost
Ms = memory size shared between :
Np = number of processes sharing Ms
WTR = write touch rate ( statistical average ) at which the Np processes
write access Ms
F = a probably non linear function depending on architecture, etc ...
Nm = number of memory shares


This is an absolute value that _must_ be compared ( weighted ) with the time
spent by the single processes in computing to ponder if the application
design
we've chosen for SMP is right, or even more, if SMP is the correct target
for
our app.


Take at the rendering pipeline example.
We've each step read a bit of data ( think as from stdin ), do a relatively
long compute on data and write another kind of data ( think as to stdout )
to be processed by the next pipeline step.
The step pattern can be expressed as :

RCCCCCCCCCCCCW

where R = read, C = compute and W = write.
Say we've a six step pipeline, so :

Nm = 5 ( 6 - 1 )
Np = 2
Ms = tipically small ( triangles, scanlines, ...)
WTR = small compared with the computing times

We can think as Ms be a ( relatively big ) object set.
This increase Ms but lengthen the computing path, so the weighted cost
equals.
This is, IMVHO, a good candidate for SMP.


Consider now a typical data centric application in which we've a continuous
read-write cycles along the entire data set :

RCCWRCRWCCRCWC

If we can't split this data set into autonomous chunks of data, we have :

Nm = the number threads we've split the app
Np = typically equal to Nm
Ms = probably the entire data set
WTR = typically high coz the nature of the application

This is not a good candidate for SMP.

Typical examples of these applications are the ones in which the lower steps
of the computing path must access to data computed ( read as
write-accessed ) from
most of previous steps.



Davide.

--
All this stuff is IMVHO