2000-01-25 17:27:21

by Iain McClatchie

[permalink] [raw]
Subject: Re: SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM)

Jeff> How do you do fault tolerance on Shared-Everything (like what you
Jeff> describe)?

Disclaimer: I worked on CPUs, not the system controller, so I didn't
actually look at the Verilog.

At an electrical level, the links were designed to be hot-pluggable.
Each O2000 cabinet had I think 8 CPU cards in it (16 CPUs), and had
its own power supply and plug. Within a cabinet, the I/O cables were
hot pluggable, as were the power supplies, fans, etc. CPU cards were
not hot pluggable.

The hub chip on each CPU card connected to the shared memory network,
as well as the memory and the two CPUs. I think the hub had hardware
registers which granted memory access and I/O access to other hubs. A
CPU could deny access to its local resources from another CPU. IRIX
could use these capabilities to completely partition the machine, but
I don't think they got any better than a shared-everything vs
shared-nothing toggle, at least, not while I was there.

[I think I'm using the word image wrongly, but you know what I mean.]

The idea was to write an operating system that could transfer
processes between O/S images, reboot O/S images independently, and
tolerate different O/S revisions on different partitions of the
machine. That would allow online O/S upgrades, and perhaps even
replacement of whole cabinets of hardware.

I know the I/O system was set up to connect two different hub chips to
each I/O crossbar, to maintain access to the I/O resources should a
CPU card go down. You could (of course) have multiple network
adapters going to the same network, and spread the adapters across
different cabinets. I think you could also arrange to have fiber
channel disk arrays driven from two different cabinets. This could
have made it possible to hot plug a cabinet while the machine was
online, without taking down any more processes or disks or whatever
than had already been taken down by the failing hardware.

The I/O stuff sounds heavyweight, but I would imagine you'd have to do
the very same thing on a "shared-nothing" cluster.

Jeff> Sounds like COMA or SCI?

COMA: Cache Only Memory Architecture. Whenever you look at such a
thing, ask when cache lines are invalidated. That's the hard problem,
and I haven't yet heard a reasonable answer.

SCI: Rings are bad for latency. Latency is bad for CPUs. SCI is all
rings - in the hardware, and in the coherency protocols. Kendall
Square Research did something like this; it was a disaster.

SGI did neither of these things. ccNUMA was most similar to the DASH
project from Stanford. Not surprising -- SGI has very close ties to
Stanford. Note that the Flash work is done on a mutated Origin.

Jeff> Most folks dismiss shared-everything architectures as
Jeff> non-resilent (Intel and Microsoft > have traditionally been
Jeff> shared-nothing bigots on this point).

Hmm. Do people actually use NT for big clusters? I thought clusters
were all done with VMS, Unix, and O/S 360 (or whatever it's called
these days). I'm not sure how Microsoft's opinion on the matter would
affect anyone.

After working at SGI, I'm convinced that most of the MP problem is a
software problem. Granted, if the CPU is integrated with the memory
controller, you'll need some good hooks to make MP work, but that's 3
to 10 man-years of work. I'm not familiar with what Intel does in the
O/S and libraries/application area (iWarp?), but I would imagine any
bigotry on their part comes from the marketing guys selling what they
have.

And yes, I've heard about Intel's ASCI offering. What was it, 9000
Pentium Pros in a single room? Has Intel Oregon sold much to anyone
else?

-Iain McClatchie
http://www.10xinc.com
[email protected]
650-364-0520 voice
650-364-0530 FAX