2003-09-03 04:03:33

by Larry McVoy

[permalink] [raw]
Subject: Scaling noise

I've frequently tried to make the point that all the scaling for lots of
processors is nonsense. Mr Dell says it better:

"Eight-way (servers) are less than 1 percent of the market and shrinking
pretty dramatically," Dell said. "If our competitors want to claim
they're No. 1 in eight-ways, that's fine. We want to lead the market
with two-way and four-way (processor machines)."

Tell me again that it is a good idea to screw up uniprocessor performance
for 64 way machines. Great idea, that. Go Dinosaurs!
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm


2003-09-03 04:20:07

by Anton Blanchard

[permalink] [raw]
Subject: Re: Scaling noise


> I've frequently tried to make the point that all the scaling for lots of
> processors is nonsense. Mr Dell says it better:
>
> "Eight-way (servers) are less than 1 percent of the market and shrinking
> pretty dramatically," Dell said. "If our competitors want to claim
> they're No. 1 in eight-ways, that's fine. We want to lead the market
> with two-way and four-way (processor machines)."
>
> Tell me again that it is a good idea to screw up uniprocessor performance
> for 64 way machines. Great idea, that. Go Dinosaurs!

And does your 4 way have hyperthreading?

Anton

2003-09-03 04:12:41

by Roland Dreier

[permalink] [raw]
Subject: Re: Scaling noise

+--------------+
| Don't feed |
| the trolls |
| |
| thank you |
+--------------+
| |
| |
| |
| |
....\ /....

2003-09-03 04:21:00

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

And here I thought that real data was interesting. My mistake.

On Tue, Sep 02, 2003 at 09:12:36PM -0700, Roland Dreier wrote:
> +--------------+
> | Don't feed |
> | the trolls |
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-03 04:30:03

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 02:18:51PM +1000, Anton Blanchard wrote:
> > I've frequently tried to make the point that all the scaling for lots of
> > processors is nonsense. Mr Dell says it better:
> >
> > "Eight-way (servers) are less than 1 percent of the market and shrinking
> > pretty dramatically," Dell said. "If our competitors want to claim
> > they're No. 1 in eight-ways, that's fine. We want to lead the market
> > with two-way and four-way (processor machines)."
> >
> > Tell me again that it is a good idea to screw up uniprocessor performance
> > for 64 way machines. Great idea, that. Go Dinosaurs!
>
> And does your 4 way have hyperthreading?

What part of "shrinking pretty dramatically" did you not understand? Maybe
you know more than Mike Dell. Could you share that insight?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-03 04:34:30

by CaT

[permalink] [raw]
Subject: Re: Scaling noise

On Tue, Sep 02, 2003 at 09:29:53PM -0700, Larry McVoy wrote:
> On Wed, Sep 03, 2003 at 02:18:51PM +1000, Anton Blanchard wrote:
> > > I've frequently tried to make the point that all the scaling for lots of
> > > processors is nonsense. Mr Dell says it better:
> > >
> > > "Eight-way (servers) are less than 1 percent of the market and shrinking
> > > pretty dramatically," Dell said. "If our competitors want to claim
> > > they're No. 1 in eight-ways, that's fine. We want to lead the market
> > > with two-way and four-way (processor machines)."
> > >
> > > Tell me again that it is a good idea to screw up uniprocessor performance
> > > for 64 way machines. Great idea, that. Go Dinosaurs!
> >
> > And does your 4 way have hyperthreading?
>
> What part of "shrinking pretty dramatically" did you not understand? Maybe
> you know more than Mike Dell. Could you share that insight?

I think Anton is referring to the fact that on a 4-way cpu machine with
HT enabled you basically have an 8-way smp box (with special conditions)
and so if 4-way machines are becoming more popular, making sure that 8-way
smp works well is a good idea.

At least that's how I took it.

--
"How can I not love the Americans? They helped me with a flat tire the
other day," he said.
- http://tinyurl.com/h6fo

2003-09-03 05:09:09

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote:
> I think Anton is referring to the fact that on a 4-way cpu machine with
> HT enabled you basically have an 8-way smp box (with special conditions)
> and so if 4-way machines are becoming more popular, making sure that 8-way
> smp works well is a good idea.

Maybe this is a better way to get my point across. Think about more CPUs
on the same memory subsystem. I've been trying to make this scaling point
ever since I discovered how much cache misses hurt. That was about 1995
or so. At that point, memory latency was about 200 ns and processor speeds
were at about 200Mhz or 5 ns. Today, memory latency is about 130 ns and
processor speeds are about .3 ns. Processor speeds are 15 times faster and
memory is less than 2 times faster. SMP makes that ratio worse.

It's called asymptotic behavior. After a while you can look at the graph
and see that more CPUs on the same memory doesn't make sense. It hasn't
made sense for a decade, what makes anyone think that is changing?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-03 05:44:34

by Mikael Abrahamsson

[permalink] [raw]
Subject: Re: Scaling noise

On Tue, 2 Sep 2003, Larry McVoy wrote:

> It's called asymptotic behavior. After a while you can look at the graph
> and see that more CPUs on the same memory doesn't make sense. It hasn't
> made sense for a decade, what makes anyone think that is changing?

It didnt make sense two decades ago either, the VAX 8300 could be made to
go 6way and it stopped going faster around the third processor added.

(my memory is a bit rusty, but I believe this is what we came up with when
we got donated a few of those in the mid 90ties and yes, they're not from
83 but perhaps from 86-87 so not two decades ago either).

--
Mikael Abrahamsson email: [email protected]

2003-09-03 06:01:23

by Samium Gromoff

[permalink] [raw]
Subject: Re: Scaling noise


> > It's called asymptotic behavior. After a while you can look at the graph
> > and see that more CPUs on the same memory doesn't make sense. It hasn't
> > made sense for a decade, what makes anyone think that is changing?
>
> It didnt make sense two decades ago either, the VAX 8300 could be made to
> go 6way and it stopped going faster around the third processor added.

It doesn`t dismiss the increasing disbalance between cpu/memory speeds, which is
way more fundamental than the possible technical glitches with experimental
SMP on VAXes.

regards, Samium Gromoff

2003-09-03 06:12:29

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: Scaling noise

In article <[email protected]> you wrote:
> It's called asymptotic behavior. After a while you can look at the graph
> and see that more CPUs on the same memory doesn't make sense. It hasn't
> made sense for a decade, what makes anyone think that is changing?

Thats why NUMA gets so popular.

Larry, dont forget, that Linux is growing in the University Labs, where
those big NUMA and Multi-Node Clusters are most popular for Number
Crunching.

Greetings
Bernd
--
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

2003-09-03 06:28:40

by Anton Blanchard

[permalink] [raw]
Subject: Re: Scaling noise


> > > I've frequently tried to make the point that all the scaling for
> > > lots of processors is nonsense. Mr Dell says it better:
> > >
> > > "Eight-way (servers) are less than 1 percent of the market and
> > > shrinking pretty dramatically," Dell said. "If our competitors
> > > want to claim they're No. 1 in eight-ways, that's fine. We
> > > want to lead the market with two-way and four-way (processor
> > > machines)."
> > >
> > > Tell me again that it is a good idea to screw up uniprocessor
> > > performance for 64 way machines. Great idea, that. Go Dinosaurs!
> >
> > And does your 4 way have hyperthreading?
>
> What part of "shrinking pretty dramatically" did you not understand?
> Maybe you know more than Mike Dell. Could you share that insight?

Ok. But only because you asked nicely.

Mike Dell wants to sell 2 and 4 processor boxes and Intel wants to sell
processors with hyperthreading on them. Scaling to 4 or 8 threads is just
like scaling to 4 or 8 processors, only worse.

However, lets not end up in a yet another 64 way scalability argument here.

The thing we should be worrying about is the UP -> 2 way SMP scalability
issue. If every chip in the future has hyperthreading then all of sudden
everyone is running an SMP kernel. And what hurts us?

atomic ops
memory barriers

Ive always worried about those atomic ops that only appear in an SMP
kernel, but Rusty recently reminded me its the same story for most of the
memory barriers.

Things like RCU can do a lot for this UP -> 2 way SMP issue. The fact it
also helps the big end of town is just a bonus.

Anton

2003-09-03 06:57:40

by John Bradford

[permalink] [raw]
Subject: Re: Scaling noise

> Tell me again that it is a good idea to screw up uniprocessor performance
> for 64 way machines. Great idea, that. Go Dinosaurs!

I suspect Larry is actually right that uniprocessor and $smallnum CPUs
SMP performance will remain the most important, but I don't agree with
his reasoning:

Once true virtualisation becomes part of a mainstream microprocessor
architecture, we'll start to see a lot of small ISPs wanting to move 4
or so 1U servers on to a single, 1U SMP box. Server consolidation
saves money in many ways - physical LAN cabling is replaced by virtual
LANs, less network hardware such as switches is required, there is
less hardware to break, you can add a new Linux image in seconds using
spare capacity rather than going out and buying a new box.

Once the option of running a firewall, a hot spare firewall, a
customer webserver, a hot spare customer webserver, mail server,
backup mail server, and a few virtual machines for customers, all on a
1U box, why are you going to want to pay for seven or more Us in a
datacentre, plus extra network hardware?

You can do this today on Z/Series, but you need to consolidate a lot
of machines to make it financially viable. Once virtualisation is
available on cheaper hardware, everybody will want $bignum way SMP
boxes, but no Linux image will run on more than $smallnum virtual
CPUs.

John.

2003-09-03 06:56:09

by Nick Piggin

[permalink] [raw]
Subject: Re: Scaling noise

Anton Blanchard wrote:

>>>>I've frequently tried to make the point that all the scaling for
>>>>lots of processors is nonsense. Mr Dell says it better:
>>>>
>>>> "Eight-way (servers) are less than 1 percent of the market and
>>>> shrinking pretty dramatically," Dell said. "If our competitors
>>>> want to claim they're No. 1 in eight-ways, that's fine. We
>>>> want to lead the market with two-way and four-way (processor
>>>> machines)."
>>>>
>>>>Tell me again that it is a good idea to screw up uniprocessor
>>>>performance for 64 way machines. Great idea, that. Go Dinosaurs!
>>>>
>>>And does your 4 way have hyperthreading?
>>>
>>What part of "shrinking pretty dramatically" did you not understand?
>>Maybe you know more than Mike Dell. Could you share that insight?
>>
>
>Ok. But only because you asked nicely.
>
>Mike Dell wants to sell 2 and 4 processor boxes and Intel wants to sell
>processors with hyperthreading on them. Scaling to 4 or 8 threads is just
>like scaling to 4 or 8 processors, only worse.
>
>However, lets not end up in a yet another 64 way scalability argument here.
>
>The thing we should be worrying about is the UP -> 2 way SMP scalability
>issue. If every chip in the future has hyperthreading then all of sudden
>everyone is running an SMP kernel. And what hurts us?
>
>atomic ops
>memory barriers
>
>Ive always worried about those atomic ops that only appear in an SMP
>kernel, but Rusty recently reminded me its the same story for most of the
>memory barriers.
>
>Things like RCU can do a lot for this UP -> 2 way SMP issue. The fact it
>also helps the big end of town is just a bonus.
>

I think LM advocates aiming single image scalability at or before the knee
of the CPU vs performance curve. Say thats 4 way, it means you should get
good performance on 8 ways while keeping top performance on 1 and 2 and 4
ways. (Sorry if I mis-represent your position).

I don't think anyone advocates sacrificing UP performance for 32 ways, but
as he says it can happen .1% at a time.

But it looks like 2.6 will scale well to 16 way and higher. I wonder if
there are many regressions from 2.4 or 2.2 on small systems.


2003-09-03 07:38:45

by Mike Fedyk

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 08:10:33AM +0100, John Bradford wrote:
> boxes, but no Linux image will run on more than $smallnum virtual
> CPUs.

Which is exactly what Larry is advocating. Essencially, instead of having
one large image covering a large NUMA box, you have several images covering
each NUMA node (even if they're in the same box).

2003-09-03 08:13:16

by Giuliano Pochini

[permalink] [raw]
Subject: Re: Scaling noise


On 03-Sep-2003 Larry McVoy wrote:
> That was about 1995
> or so. At that point, memory latency was about 200 ns and processor speeds
> were at about 200Mhz or 5 ns. Today, memory latency is about 130 ns and
> processor speeds are about .3 ns. Processor speeds are 15 times faster and
> memory is less than 2 times faster. SMP makes that ratio worse.

Latency is not bandwidth. btw you are right, that's why caches are
growing, too. It's likely in the future there will be only UP (HT'd ?)
and NUMA machines.


Bye.
Giuliano.

2003-09-03 09:41:47

by Brown, Len

[permalink] [raw]
Subject: RE: Scaling noise

> Latency is not bandwidth.

Bingo.

The way to address memory latency is by increasing bandwidth and
increasing parallelism to use it -- thus amortizing the latency. HT is
one of many ways to do this. If systems are to grow faster at a rate
better than memory speeds, then plan on more parallelism, not less.

-Len

2003-09-03 11:03:07

by Geert Uytterhoeven

[permalink] [raw]
Subject: RE: Scaling noise

On Wed, 3 Sep 2003, Brown, Len wrote:
> > Latency is not bandwidth.
>
> Bingo.
>
> The way to address memory latency is by increasing bandwidth and
> increasing parallelism to use it -- thus amortizing the latency. HT is
> one of many ways to do this. If systems are to grow faster at a rate
> better than memory speeds, then plan on more parallelism, not less.

More parallelism usually means more data to process, hence more bandwidth is
needed => back to where we started.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2003-09-03 11:15:08

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 12:38:58AM -0700, Mike Fedyk wrote:
> On Wed, Sep 03, 2003 at 08:10:33AM +0100, John Bradford wrote:
> > boxes, but no Linux image will run on more than $smallnum virtual
> > CPUs.
>
> Which is exactly what Larry is advocating. Essencially, instead of having
> one large image covering a large NUMA box, you have several images covering
> each NUMA node (even if they're in the same box).

Right, that is indeed what I believe needs to happen. Instead of spreading
one kernel out over all the processors, run multiple kernels. Most of the
scaling problems go away. Not all if you want to share memory between
kernels but for what John was talking about that is not even needed.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-03 11:20:17

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 05:41:39AM -0400, Brown, Len wrote:
> > Latency is not bandwidth.
>
> Bingo.
>
> The way to address memory latency is by increasing bandwidth and
> increasing parallelism to use it -- thus amortizing the latency.

And if the app is a pointer chasing app, as many apps are, that doesn't
help at all.

It's pretty much analogous to file systems. If bandwidth was the answer
then we'd all be seeing data moving at 60MB/sec off the disk. Instead
we see about 4 or 5MB/sec.

Expecting more bandwidth to help your app is like expecting more platter
speed to help your file system. It's not the platter speed, it's the
seeks which are the problem. Same thing in system doesn't, it's not the
bcopy speed, it's the cache misses that are the problem. More bandwidth
doesn't do much for that.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-03 12:10:02

by Alan

[permalink] [raw]
Subject: Re: Scaling noise

On Mer, 2003-09-03 at 07:12, Bernd Eckenfels wrote:
> Thats why NUMA gets so popular.

NUMA doesn't help you much.

> Larry, dont forget, that Linux is growing in the University Labs, where
> those big NUMA and Multi-Node Clusters are most popular for Number
> Crunching.

multi node yes, numa not much and where numa-like systems are being used
they are being used for message passing not as a fake big pc.

Numa is valuable because
- It makes some things go faster without having to rewrite them
- It lets you partition a large box into several effective small ones
cutting maintenance
- It lets you partition a large box into several effective small ones
so you can avoid buying two software licenses for expensive toys

if you actually care enough about performance to write the code to do
the job then its value is rather questionable. There are exceptions as
with anything else.


2003-09-03 14:28:05

by Steven Cole

[permalink] [raw]
Subject: Re: Scaling noise

On Tue, 2003-09-02 at 23:08, Larry McVoy wrote:
> On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote:
> > I think Anton is referring to the fact that on a 4-way cpu machine with
> > HT enabled you basically have an 8-way smp box (with special conditions)
> > and so if 4-way machines are becoming more popular, making sure that 8-way
> > smp works well is a good idea.
>
> Maybe this is a better way to get my point across. Think about more CPUs
> on the same memory subsystem. I've been trying to make this scaling point
> ever since I discovered how much cache misses hurt. That was about 1995
> or so. At that point, memory latency was about 200 ns and processor speeds
> were at about 200Mhz or 5 ns. Today, memory latency is about 130 ns and
> processor speeds are about .3 ns. Processor speeds are 15 times faster and
> memory is less than 2 times faster. SMP makes that ratio worse.
>
> It's called asymptotic behavior. After a while you can look at the graph
> and see that more CPUs on the same memory doesn't make sense. It hasn't
> made sense for a decade, what makes anyone think that is changing?

You're right about the asymptotic behavior and you'll just get more
right as time goes on, but other forces are at work.

What is changing is the number of cores per 'processor' is increasing.
The Intel Montecito will increase this to two, and rumor has it that the
Intel Tanglewood may have as many as sixteen. The IBM Power6 will
likely be similarly capable.

The Tanglewood is not some far off flight of fancy; it may be available
as soon as the 2.8.x stable series, so planning to accommodate it should
be happening now.

With companies like SGI building Altix systems with 64 and 128 CPUs
using the current single-core Madison, just think of what will be
possible using the future hardware.

In four years, Michael Dell will still be saying the same thing, but
he'll just fudge his answer by a factor of four.

The question which will continue to be important in the next kernel
series is: How to best accommodate the future many-CPU machines without
sacrificing performance on the low-end? The change is that the 'many'
in the above may start to double every few years.

Some candidate answers to this have been discussed before, such as
cache-coherent clusters. I just hope this gets worked out before the
hardware ships.

Steven

2003-09-03 15:17:25

by Antonio Vargas

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 08:25:36AM -0600, Steven Cole wrote:
> On Tue, 2003-09-02 at 23:08, Larry McVoy wrote:
> > On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote:

> [snip]
>
> The question which will continue to be important in the next kernel
> series is: How to best accommodate the future many-CPU machines without
> sacrificing performance on the low-end? The change is that the 'many'
> in the above may start to double every few years.
>
> Some candidate answers to this have been discussed before, such as
> cache-coherent clusters. I just hope this gets worked out before the
> hardware ships.

As you may probably know, CC-clusters were heavily advocated by the
same Larry McVoy who has started this thread.

Greets, Antonio.

2003-09-03 15:22:46

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

> multi node yes, numa not much and where numa-like systems are being used
> they are being used for message passing not as a fake big pc.
>
> Numa is valuable because
> - It makes some things go faster without having to rewrite them
> - It lets you partition a large box into several effective small ones
> cutting maintenance
> - It lets you partition a large box into several effective small ones
> so you can avoid buying two software licenses for expensive toys
>
> if you actually care enough about performance to write the code to do
> the job then its value is rather questionable. There are exceptions as
> with anything else.

The real core use of NUMA is to run one really big app on one machine,
where it's hard to split it across a cluster. You just can't build an
SMP box big enough for some of these things.

M.

2003-09-03 15:32:31

by Ihar 'Philips' Filipau

[permalink] [raw]
Subject: Re: Scaling noise

Steven Cole wrote:
>
> The question which will continue to be important in the next kernel
> series is: How to best accommodate the future many-CPU machines without
> sacrificing performance on the low-end? The change is that the 'many'
> in the above may start to double every few years.
>
> Some candidate answers to this have been discussed before, such as
> cache-coherent clusters. I just hope this gets worked out before the
> hardware ships.
>

RT frame works are running single kernel under some kind of RT OS.

It should be possible to develop framework to run several Linuces
under single instance of another OS (or Linux itself). And every
instance of slave Linux whould be told which resources it is responsible
for.
You can /partition/ memory, you can say that given instance of kernel
should use e.g. only CPUs from Nth to N+Mth.
But some resources - like IDE controllers, GPUs, NICs - are not that
easy to share. Actually most of the resources are not trivial to share.

And I'm not sure what will turns out to be easier: write very
scaleable kernel or make kernel been able to share efficiently resources
with others.

P.S. My personal belief - that SMP is never going to become comodity.
EPIC/VLIW - probably. Not SMP/AMP.

2003-09-03 15:22:36

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

--Roland Dreier <[email protected]> wrote (on Tuesday, September 02, 2003 21:12:36 -0700):

> +--------------+
>| Don't feed |
>| the trolls |
>| |
>| thank you |
> +--------------+
> | |
> | |
> | |
> | |
> ....\ /....

Agreed. Please refer to the last flamefest a few months ago, when this was
covered in detail.

M.

2003-09-03 15:24:57

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

> I think LM advocates aiming single image scalability at or before the knee
> of the CPU vs performance curve. Say thats 4 way, it means you should get
> good performance on 8 ways while keeping top performance on 1 and 2 and 4
> ways. (Sorry if I mis-represent your position).

Splitting big machines into a cluster is not a solution. However, oddly
enough I actually agree with Larry, with one major caveat ... you have to
make it an SSI cluster (single system image) - that way it's transparent
to users. Unfortunately that's hard to do, but since we still have a
system that's single memory image coherent, it shouldn't actually be nearly
as hard as doing it across machines, as you can still fudge in the odd
global piece if you need it.

Without SSI, it's pretty useless, you're just turning an expensive box
into a cheap cluster, and burning a lot of cash.

> I don't think anyone advocates sacrificing UP performance for 32 ways, but
> as he says it can happen .1% at a time.
>
> But it looks like 2.6 will scale well to 16 way and higher. I wonder if
> there are many regressions from 2.4 or 2.2 on small systems.

You want real data instead of FUD? How *dare* you? ;-)

Would be real interesting to see this ... there are actually plenty of
real degredations there, none of which (that I've seen) come from any
scalability changes. Things like RMAP on fork times (for which there are
other legitimite reasons) are more responsible (for which the "scalability"
people have offered a solution).

Numbers would be cool ... particularly if people can refrain from the
"it's worse, therefore it must be some scalability change that's at fault"
insta-moron-leap-of-logic.

M.

2003-09-03 15:39:16

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 08:23:44AM -0700, Martin J. Bligh wrote:
> > I think LM advocates aiming single image scalability at or before the knee
> > of the CPU vs performance curve. Say thats 4 way, it means you should get
> > good performance on 8 ways while keeping top performance on 1 and 2 and 4
> > ways. (Sorry if I mis-represent your position).
>
> Splitting big machines into a cluster is not a solution. However, oddly
> enough I actually agree with Larry, with one major caveat ... you have to
> make it an SSI cluster (single system image) - that way it's transparent
> to users.

Err, when did I ever say it wasn't SSI? If you look at what I said it's
clearly SSI. Unified process, device, file, and memory namespaces.

I'm pretty sure people were so eager to argue with my lovely personality
that they never bothered to understand the architecture. It's _always_
been SSI. I have slides going back at least 4 years that state this:

http://www.bitmover.com/talks/smp-clusters
http://www.bitmover.com/talks/cliq

> Numbers would be cool ... particularly if people can refrain from the
> "it's worse, therefore it must be some scalability change that's at fault"
> insta-moron-leap-of-logic.

It's really easy to claim that scalability isn't the problem. Scaling
changes in general cause very minute differences, it's just that there
are a lot of them. There is constant pressure to scale further and people
think it's cool. You can argue you all you want that scaling done right
isn't a problem but nobody has ever managed to do it right. I know it's
politically incorrect to say this group won't either but there is no
evidence that they will.

Instead of doggedly following the footsteps down a path that hasn't worked
before, why not do something cool? The CC stuff is a fun place to work,
it's the last paradigm shift that will ever happen in OS, it's a chance
for Linux to actually do something new. I harp all the time that open
source is a copying mechanism and you are playing right into my hands.
Make me wrong. Do something new. Don't like this design? OK, then come
up with a better design.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-03 15:36:09

by Steven Cole

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 2003-09-03 at 06:47, Antonio Vargas wrote:
> On Wed, Sep 03, 2003 at 08:25:36AM -0600, Steven Cole wrote:
> > On Tue, 2003-09-02 at 23:08, Larry McVoy wrote:
> > > On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote:
>
> > [snip]
> >
> > The question which will continue to be important in the next kernel
> > series is: How to best accommodate the future many-CPU machines without
> > sacrificing performance on the low-end? The change is that the 'many'
> > in the above may start to double every few years.
> >
> > Some candidate answers to this have been discussed before, such as
> > cache-coherent clusters. I just hope this gets worked out before the
> > hardware ships.
>
> As you may probably know, CC-clusters were heavily advocated by the
> same Larry McVoy who has started this thread.
>

Yes, thanks. I'm well aware of that. I would like to get a discussion
going again on CC-clusters, since that seems to be a way out of the
scaling spiral. Here is an interesting link:
http://www.opersys.com/adeos/practical-smp-clusters/

Steven



2003-09-03 16:06:21

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

> Err, when did I ever say it wasn't SSI? If you look at what I said it's
> clearly SSI. Unified process, device, file, and memory namespaces.

I think it was the bit when you suggested using bitkeeper to sync multiple
/etc/passwd files when I really switched off ... perhaps you were just
joking ;-) Perhaps we just had a massive communication disconnect.

> I'm pretty sure people were so eager to argue with my lovely personality
> that they never bothered to understand the architecture. It's _always_
> been SSI. I have slides going back at least 4 years that state this:
>
> http://www.bitmover.com/talks/smp-clusters
> http://www.bitmover.com/talks/cliq

I can go back and re-read them, if I misread them last time than I apologise.
I've also shifted perspectives on SSI clusters somewhat over the last year.
Yes, if it's SSI, I'd agree for the most part ... once it's implemented ;-)

I'd rather start with everything separate (one OS instance per node), and
bind things back together, than split everything up. However, I'm really
not sure how feasible it is until we actually have something that works.

I have a rough plan of how to go about it mapped out, in small steps that
might be useful by themselves. It's a lot of fairly complex hard work ;-)

>> Numbers would be cool ... particularly if people can refrain from the
>> "it's worse, therefore it must be some scalability change that's at fault"
>> insta-moron-leap-of-logic.
>
> It's really easy to claim that scalability isn't the problem. Scaling
> changes in general cause very minute differences, it's just that there
> are a lot of them. There is constant pressure to scale further and people
> think it's cool. You can argue you all you want that scaling done right
> isn't a problem but nobody has ever managed to do it right. I know it's
> politically incorrect to say this group won't either but there is no
> evidence that they will.

Let's not go into that one again, we've both dragged that over the coals
already. Time to agree to disagree. All the significant degredations I
looked at that people screamed were scalability changes turned out to
be something else completely.

> Instead of doggedly following the footsteps down a path that hasn't worked
> before, why not do something cool? The CC stuff is a fun place to work,
> it's the last paradigm shift that will ever happen in OS, it's a chance
> for Linux to actually do something new. I harp all the time that open
> source is a copying mechanism and you are playing right into my hands.
> Make me wrong. Do something new. Don't like this design? OK, then come
> up with a better design.

I'm cool with doing SSI clusters over NUMA on a per-node basis. But it's
still vapourware ... yes, I'd love to work on that full time to try and
change that if I can get funding to do so.

M.

2003-09-03 16:03:56

by Jörn Engel

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 3 September 2003 08:10:33 -0700, Martin J. Bligh wrote:
>
> > multi node yes, numa not much and where numa-like systems are being used
> > they are being used for message passing not as a fake big pc.
> >
> > Numa is valuable because
> > - It makes some things go faster without having to rewrite them
> > - It lets you partition a large box into several effective small ones
> > cutting maintenance
> > - It lets you partition a large box into several effective small ones
> > so you can avoid buying two software licenses for expensive toys
> >
> > if you actually care enough about performance to write the code to do
> > the job then its value is rather questionable. There are exceptions as
> > with anything else.
>
> The real core use of NUMA is to run one really big app on one machine,
> where it's hard to split it across a cluster. You just can't build an
> SMP box big enough for some of these things.

This "hard to split" is usually caused by memory use instead of cpu
use, right?

I don't see a big problem scaling number crunchers over a cluster, but
a process with a working set >64GB cannot be split between 4GB
machines easily.

J?rn

--
Good warriors cause others to come to them and do not go to others.
-- Sun Tzu

2003-09-03 16:39:07

by Kurt Wall

[permalink] [raw]
Subject: Re: Scaling noise

Quoth Larry McVoy:

[SMP hits memory latency wall]

> It's called asymptotic behavior. After a while you can look at the graph
> and see that more CPUs on the same memory doesn't make sense. It hasn't
> made sense for a decade, what makes anyone think that is changing?

Isn't this what NUMA is for, then?

Kurt
--
"There was a boy called Eustace Clarence Scrubb, and he almost deserved
it."
-- C. S. Lewis, The Chronicles of Narnia

2003-09-03 16:24:24

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

>> The real core use of NUMA is to run one really big app on one machine,
>> where it's hard to split it across a cluster. You just can't build an
>> SMP box big enough for some of these things.
>
> This "hard to split" is usually caused by memory use instead of cpu
> use, right?

Heavy process intercommunication I guess, often but not always through
shared mem.

> I don't see a big problem scaling number crunchers over a cluster, but
> a process with a working set >64GB cannot be split between 4GB
> machines easily.

Right - some problems split nicely, and should get run on clusters because
it's a shitload cheaper. Preferably an SSI cluster so you get to manage
things easily, but either way. As you say, some things just don't split
that way, and that's why people pay for big iron (which ends up being
NUMA).

I've seen people use big machines for clusterable things, which I think
is a waste of money, but the cost of the machine compared to the cost
of admin (vs multiple machines) may have come down to the point where
it's worth it now. You get implicit "cluster" load balancing done in a
transparent way by the OS on NUMA boxes.

M.

2003-09-03 17:09:54

by Brown, Len

[permalink] [raw]
Subject: RE: Scaling noise

Fortunately seek time on RAM is lower than disk;-) Sure, parallel
systems are a waste of effort for running a single copy of a single
threaded app, but when you have multiple apps, or better yet MT apps,
you win. If system performance were limited over time to the rate of
decrease in RAM latency, then we'd be in sorry shape.

Back to the original off-topic...
An OEM can spin their motivation to focus on smaller systems in 3 ways:

1. large server sales are a small % of industry units
2. large server sales are a small % of industry revenue
3. large server sales are a small % of industry profits

Only 1 is true.

Cheers,
-Len

#include <std/disclaimer.h>


> -----Original Message-----
> From: Larry McVoy [mailto:[email protected]]
> Sent: Wednesday, September 03, 2003 7:20 AM
> To: Brown, Len
> Cc: Giuliano Pochini; Larry McVoy; [email protected]
> Subject: Re: Scaling noise
>
>
> On Wed, Sep 03, 2003 at 05:41:39AM -0400, Brown, Len wrote:
> > > Latency is not bandwidth.
> >
> > Bingo.
> >
> > The way to address memory latency is by increasing bandwidth and
> > increasing parallelism to use it -- thus amortizing the latency.
>
> And if the app is a pointer chasing app, as many apps are,
> that doesn't
> help at all.
>
> It's pretty much analogous to file systems. If bandwidth was
> the answer
> then we'd all be seeing data moving at 60MB/sec off the disk.
> Instead
> we see about 4 or 5MB/sec.
>
> Expecting more bandwidth to help your app is like expecting
> more platter
> speed to help your file system. It's not the platter speed, it's the
> seeks which are the problem. Same thing in system doesn't,
> it's not the
> bcopy speed, it's the cache misses that are the problem.
> More bandwidth
> doesn't do much for that.
> --
> ---
> Larry McVoy lm at bitmover.com
> http://www.bitmover.com/lm
>

2003-09-03 17:15:05

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 08:23:44AM -0700, Martin J. Bligh wrote:
> Would be real interesting to see this ... there are actually plenty of
> real degredations there, none of which (that I've seen) come from any
> scalability changes. Things like RMAP on fork times (for which there are
> other legitimite reasons) are more responsible (for which the "scalability"
> people have offered a solution).

How'd that get capitalized? It's not an acronym.

At any rate, fork()'s relevance to performance is not being measured
in any context remotely resembling real usage cases, e.g. forking
servers. There are other problems with kernel compiles, for instance,
internally limited parallelism, and a relatively highly constrained
userspace component which is impossible to increase the concurrency of.


-- wli

2003-09-03 17:32:53

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 01:07:03PM -0400, Brown, Len wrote:
> Fortunately seek time on RAM is lower than disk;-) Sure, parallel
> systems are a waste of effort for running a single copy of a single
> threaded app, but when you have multiple apps, or better yet MT apps,
> you win. If system performance were limited over time to the rate of
> decrease in RAM latency, then we'd be in sorry shape.

For a lot of applications we are. Go talk to your buddies in the processor
group, I think there is a fair amount of awareness that for most apps faster
processors aren't doing any good. Ditto for SMP.

> Back to the original off-topic...
> An OEM can spin their motivation to focus on smaller systems in 3 ways:
>
> 1. large server sales are a small % of industry units
> 2. large server sales are a small % of industry revenue
> 3. large server sales are a small % of industry profits
>
> Only 1 is true.

How about some data to back up that statement?

Sun: ~11B/year and losing money, heavily server based
Dell: ~38B/year and making money, 99% small box based

If you were gambling with _your_ money, would you invest in Sun or Dell?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-03 18:00:11

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 05:41:39AM -0400, Brown, Len wrote:
>> The way to address memory latency is by increasing bandwidth and
>> increasing parallelism to use it -- thus amortizing the latency.

On Wed, Sep 03, 2003 at 04:19:34AM -0700, Larry McVoy wrote:
> And if the app is a pointer chasing app, as many apps are, that doesn't
> help at all.
> It's pretty much analogous to file systems. If bandwidth was the answer
> then we'd all be seeing data moving at 60MB/sec off the disk. Instead
> we see about 4 or 5MB/sec.

RAM is not operationally analogous to disk. For one, it supports
efficient random access, where disk does not.


On Wed, Sep 03, 2003 at 04:19:34AM -0700, Larry McVoy wrote:
> Expecting more bandwidth to help your app is like expecting more platter
> speed to help your file system. It's not the platter speed, it's the
> seeks which are the problem. Same thing in system doesn't, it's not the
> bcopy speed, it's the cache misses that are the problem. More bandwidth
> doesn't do much for that.

Obviously, since the technique is merely increasing concurrency, it
doesn't help any individual application, but rather utilizes cpu
resources while one is stalled to execute another. Cache misses are no
mystery; N times the number of threads of execution is N times the cache
footprint (assuming all threads equal, which is never true but useful
to assume), so it doesn't pay to cachestrate. But it never did anyway.

The lines of reasoning presented against tightly coupled systems are
grossly flawed. Attacking the communication bottlenecks by increasing
the penalty for communication is highly ineffective, which is why these
cookie cutter clusters for everything strategies don't work even on
paper.

First, communication requirements originate from the applications, not
the operating system, hence so long as there are applications with such
requirements, the requirements for such kernels will exist. Second, the
proposal is ignoring numerous environmental constraints, for instance,
the system administration, colocation, and other costs of the massive
duplication of perfectly shareable resources implied by the clustering.
Third, the communication penalties are turned from memory access to I/O,
which is tremendously slower by several orders of magnitude. Fourth, the
kernel design problem is actually made harder, since no one has ever
been able to produce a working design for these cache coherent clusters
yet that I know of, and what descriptions of this proposal I've seen that
are extant (you wrote some paper on it, IIRC) are too vague to be
operationally useful.

So as best as I can tell the proposal consists of using an orders-of-
magnitude slower communication method to implement an underspecified
solution to some research problem that to all appearances will be more
expensive to maintain and keep running than the now extant designs.

I like distributed systems and clusters, and they're great to use for
what they're good for. They're not substitutes in any way for tightly
coupled systems, nor do they render large specimens thereof unnecessary.


-- wli

2003-09-03 18:18:06

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

At some point in the past, I wrote:
>> The lines of reasoning presented against tightly coupled systems are
>> grossly flawed.

On Wed, Sep 03, 2003 at 11:05:47AM -0700, Larry McVoy wrote:
> [etc].
> Only problem with your statements is that IBM has already implemented all
> of the required features in VM. And multiple Linux instances are running
> on it today, with shared disks underneath so they don't replicate all the
> stuff that doesn't need to be replicated, and they have shared memory
> across instances.

Independent operating system instances running under a hypervisor don't
qualify as a cache-coherent cluster that I can tell; it's merely dynamic
partitioning, which is great, but nothing to do with clustering or SMP.


-- wli

2003-09-03 18:19:31

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 11:15:50AM -0700, William Lee Irwin III wrote:
> Independent operating system instances running under a hypervisor don't
> qualify as a cache-coherent cluster that I can tell; it's merely dynamic
> partitioning, which is great, but nothing to do with clustering or SMP.

they can map memory between instances
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-03 18:11:02

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 11:07:02AM -0700, William Lee Irwin III wrote:
> On Wed, Sep 03, 2003 at 01:07:03PM -0400, Brown, Len wrote:
> >> Fortunately seek time on RAM is lower than disk;-) Sure, parallel
> >> systems are a waste of effort for running a single copy of a single
> >> threaded app, but when you have multiple apps, or better yet MT apps,
> >> you win. If system performance were limited over time to the rate of
> >> decrease in RAM latency, then we'd be in sorry shape.
>
> On Wed, Sep 03, 2003 at 10:32:13AM -0700, Larry McVoy wrote:
> > For a lot of applications we are. Go talk to your buddies in the processor
> > group, I think there is a fair amount of awareness that for most apps faster
> > processors aren't doing any good. Ditto for SMP.
>
> You're thinking single-application again. Systems run more than one
> thing at once.

Then explain why hyperthreading is turned off by default in Windows.

> On Wed, Sep 03, 2003 at 01:07:03PM -0400, Brown, Len wrote:
> >> Back to the original off-topic...
> >> An OEM can spin their motivation to focus on smaller systems in 3 ways:
> >> 1. large server sales are a small % of industry units
> >> 2. large server sales are a small % of industry revenue
> >> 3. large server sales are a small % of industry profits
> >> Only 1 is true.
>
> On Wed, Sep 03, 2003 at 10:32:13AM -0700, Larry McVoy wrote:
> > How about some data to back up that statement?
> > Sun: ~11B/year and losing money, heavily server based
> > Dell: ~38B/year and making money, 99% small box based
> > If you were gambling with _your_ money, would you invest in Sun or Dell?
>
> That's neither sufficient information about those two companies nor a
> sufficient number of companies to make a proper empirical statement
> about this. I really don't care for a stock market update, but I'm just
> not going to believe anything this sketchy (from either source, actually).

Translation: "I don't like your data so I'm ignoring it".

How you can look at those two companies and not see what is obvious is
beyond me but everyone is entitled to their opinion. It's nice when your
opinion is based on data, not religion.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-03 18:07:59

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

> The lines of reasoning presented against tightly coupled systems are
> grossly flawed.

[etc].

Only problem with your statements is that IBM has already implemented all
of the required features in VM. And multiple Linux instances are running
on it today, with shared disks underneath so they don't replicate all the
stuff that doesn't need to be replicated, and they have shared memory
across instances.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-03 18:06:41

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 01:07:03PM -0400, Brown, Len wrote:
>> Fortunately seek time on RAM is lower than disk;-) Sure, parallel
>> systems are a waste of effort for running a single copy of a single
>> threaded app, but when you have multiple apps, or better yet MT apps,
>> you win. If system performance were limited over time to the rate of
>> decrease in RAM latency, then we'd be in sorry shape.

On Wed, Sep 03, 2003 at 10:32:13AM -0700, Larry McVoy wrote:
> For a lot of applications we are. Go talk to your buddies in the processor
> group, I think there is a fair amount of awareness that for most apps faster
> processors aren't doing any good. Ditto for SMP.

You're thinking single-application again. Systems run more than one
thing at once.


On Wed, Sep 03, 2003 at 01:07:03PM -0400, Brown, Len wrote:
>> Back to the original off-topic...
>> An OEM can spin their motivation to focus on smaller systems in 3 ways:
>> 1. large server sales are a small % of industry units
>> 2. large server sales are a small % of industry revenue
>> 3. large server sales are a small % of industry profits
>> Only 1 is true.

On Wed, Sep 03, 2003 at 10:32:13AM -0700, Larry McVoy wrote:
> How about some data to back up that statement?
> Sun: ~11B/year and losing money, heavily server based
> Dell: ~38B/year and making money, 99% small box based
> If you were gambling with _your_ money, would you invest in Sun or Dell?

That's neither sufficient information about those two companies nor a
sufficient number of companies to make a proper empirical statement
about this. I really don't care for a stock market update, but I'm just
not going to believe anything this sketchy (from either source, actually).


-- wli

2003-09-03 18:36:58

by Alan

[permalink] [raw]
Subject: Re: Scaling noise

On Mer, 2003-09-03 at 19:15, William Lee Irwin III wrote:
> Independent operating system instances running under a hypervisor don't
> qualify as a cache-coherent cluster that I can tell; it's merely dynamic
> partitioning, which is great, but nothing to do with clustering or SMP.

Now add a clusterfs and tell me the difference, other than there being a
lot less sharing going on...

2003-09-03 18:33:35

by Alan

[permalink] [raw]
Subject: Re: Scaling noise

On Mer, 2003-09-03 at 19:07, Larry McVoy wrote:
> Then explain why hyperthreading is turned off by default in Windows.

Most people I know turn it off in windows because its a 5-10%
performance boost (which is nice) but vendors bill it as an extra CPU
licenses!


2003-09-03 18:29:21

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 03 Sep 2003 11:07:55 PDT, Larry McVoy <[email protected]> said:
> On Wed, Sep 03, 2003 at 11:07:02AM -0700, William Lee Irwin III wrote:

> > You're thinking single-application again. Systems run more than one
> > thing at once.
>
> Then explain why hyperthreading is turned off by default in Windows.

I haven't actually checked, but is it possible that the Windows HT support
suffers from the same scaling issues as the rest of the Windows innards? To
tie this in with what Mike Dell said - perhaps the reason he's selling mostly
1/2/4 CPU boxes is because the dominant operating system blows chunks with
more, and in the common desktop scenario enabling HT was actually slower than
leaving it off?


Attachments:
(No filename) (226.00 B)

2003-09-03 18:33:36

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 11:15:50AM -0700, William Lee Irwin III wrote:
>> Independent operating system instances running under a hypervisor don't
>> qualify as a cache-coherent cluster that I can tell; it's merely dynamic
>> partitioning, which is great, but nothing to do with clustering or SMP.

On Wed, Sep 03, 2003 at 11:15:52AM -0700, Larry McVoy wrote:
> they can map memory between instances

That's just enough of a hypervisor API for the kernel to do the rest,
which it is very explicitly not doing. It also has other uses.


-- wli

2003-09-03 18:27:00

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 11:07:02AM -0700, William Lee Irwin III wrote:
>> You're thinking single-application again. Systems run more than one
>> thing at once.

On Wed, Sep 03, 2003 at 11:07:55AM -0700, Larry McVoy wrote:
> Then explain why hyperthreading is turned off by default in Windows.

I don't follow M$ stuff and am not interested in much about it. cc:
[email protected] and have them tell the others.


On Wed, Sep 03, 2003 at 11:07:02AM -0700, William Lee Irwin III wrote:
>> That's neither sufficient information about those two companies nor a
>> sufficient number of companies to make a proper empirical statement
>> about this. I really don't care for a stock market update, but I'm just
>> not going to believe anything this sketchy (from either source, actually).

On Wed, Sep 03, 2003 at 11:07:55AM -0700, Larry McVoy wrote:
> Translation: "I don't like your data so I'm ignoring it".
> How you can look at those two companies and not see what is obvious is
> beyond me but everyone is entitled to their opinion. It's nice when your
> opinion is based on data, not religion.

Restating the above in slow motion:

(a) economic arguments make me want to puke in the face of the presenter
(b) I don't believe either of you jokers giving someone else's bottom line
(c) Sun's in the toilet anyway, try comparing Dell to a healthy vendor


-- wli

2003-09-03 18:43:48

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

>> Back to the original off-topic...
>> An OEM can spin their motivation to focus on smaller systems in 3 ways:
>>
>> 1. large server sales are a small % of industry units
>> 2. large server sales are a small % of industry revenue
>> 3. large server sales are a small % of industry profits
>>
>> Only 1 is true.
>
> How about some data to back up that statement?
>
> Sun: ~11B/year and losing money, heavily server based
> Dell: ~38B/year and making money, 99% small box based
>
> If you were gambling with _your_ money, would you invest in Sun or Dell?

Errm, IBM is gambling with _their_ money, as are others such as HP,
and they're making big iron.

So do you believe they're all just completely stupid, and unable to read
their own sales figures? Or is it just a vast conspiracy to promote large
SMP / NUMA boxes because of ... something?

Sun is not loosing money because they're "server based". It's because
they're locked into Solaris and SPARC, and customers by and large
don't want that. Their machines are probably rather overpriced vs using
ia32 hardware as well. Your 1 dimensional extension of logic is ...
unimpressive.

M.

2003-09-03 19:21:27

by Steven Cole

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 2003-09-03 at 12:00, William Lee Irwin III wrote:

>
> First, communication requirements originate from the applications, not
> the operating system, hence so long as there are applications with such
> requirements, the requirements for such kernels will exist. Second, the
> proposal is ignoring numerous environmental constraints, for instance,
> the system administration, colocation, and other costs of the massive
> duplication of perfectly shareable resources implied by the clustering.
> Third, the communication penalties are turned from memory access to I/O,
> which is tremendously slower by several orders of magnitude. Fourth, the
> kernel design problem is actually made harder, since no one has ever
> been able to produce a working design for these cache coherent clusters
> yet that I know of, and what descriptions of this proposal I've seen that
> are extant (you wrote some paper on it, IIRC) are too vague to be
> operationally useful.
>
> So as best as I can tell the proposal consists of using an orders-of-
> magnitude slower communication method to implement an underspecified
> solution to some research problem that to all appearances will be more
> expensive to maintain and keep running than the now extant designs.
>
You and Larry are either talking past each other, or perhaps it is I who
don't understand the not-yet-existing CC-clusters. My understanding is
that communication between nodes of a CC-cluster would be through a
shared-memory mechanism, not through much slower I/O such as a network
(even a very fast network).

>From Karim Yaghmour's paper here:
http://www.opersys.com/adeos/practical-smp-clusters/
"That being said, clustering packages may make assumptions that do not
hold in the current architecture. Primarily, by having nodes so close
together, physical network latencies and problems disappear."

> I like distributed systems and clusters, and they're great to use for
> what they're good for. They're not substitutes in any way for tightly
> coupled systems, nor do they render large specimens thereof unnecessary.
>

My point is this: Currently at least one vendor (SGI) wants to scale the
kernel to 128 CPUs. As far as I know, the SGI Altix systems can be
configured up to 512 CPUs. If the Intel Tanglewood really will have 16
cores per chip, very much larger systems will be possible. Will you be
able to scale the kernel to 2048 CPUs and beyond? This may happen
during the lifetime of 2.8.x, so planning should be happening either now
or soon.

Steven

2003-09-03 18:15:59

by Alan

[permalink] [raw]
Subject: Re: Scaling noise

On Mer, 2003-09-03 at 18:32, Larry McVoy wrote:
> For a lot of applications we are. Go talk to your buddies in the processor
> group, I think there is a fair amount of awareness that for most apps faster
> processors aren't doing any good. Ditto for SMP.

>From the app end I found similar things. My gnome desktop performance
doesn't measurably improve beyond about 7-800Mhz. Some other stuff like
mozilla benefits from more CPU and 3D game stuff can burn all it can
get.

Disk matters a great deal (especially seek performance although there is
a certain amount of bad design in the kernel/desktop also involved
there) and memory bandwidth also seems to matter sometimes.


2003-09-03 19:44:37

by Mike Fedyk

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 09:21:47AM -0700, Martin J. Bligh wrote:
> I've seen people use big machines for clusterable things, which I think
> is a waste of money, but the cost of the machine compared to the cost
> of admin (vs multiple machines) may have come down to the point where
> it's worth it now. You get implicit "cluster" load balancing done in a
> transparent way by the OS on NUMA boxes.

Doesn't SSI clustering do something similar (without the effency of the
interconnections though)?

2003-09-03 19:48:18

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Mer, 2003-09-03 at 19:15, William Lee Irwin III wrote:
>> Independent operating system instances running under a hypervisor don't
>> qualify as a cache-coherent cluster that I can tell; it's merely dynamic
>> partitioning, which is great, but nothing to do with clustering or SMP.

On Wed, Sep 03, 2003 at 07:32:12PM +0100, Alan Cox wrote:
> Now add a clusterfs and tell me the difference, other than there being a
> lot less sharing going on...

The sharing matters; e.g. libc and other massively shared bits are
replicated in memory once per instance, which increases memory and
cache footprint(s). A number of other consequences of the sharing loss:

The number of systems to manage proliferates.

Pagecache access suddenly involves cross-instance communication instead
of swift memory access and function calls, with potentially enormous
invalidation latencies.

Userspace IPC goes from shared memory and pipes and sockets inside
a single instance (which are just memory copies) to cross-instance
data traffic, which involves slinging memory around through the
hypervisor's interface, which is slower.

The limited size of a single instance bounds the size of individual
applications, which at various times would like to have larger memory
footprints or consume more cpu time than fits in a single instance.
i.e. something resembling external fragmentation of system resources.

Process migration is confined to within a single instance without
some very ugly bits; things such as forking servers and dynamic task
creation algorithms like thread pools fall apart here.

There's suddenly competition for and a need for dynamic shifting around
of resources not shared across instances, like private disk space and
devices, shares of cpu, IP numbers and other system identifiers, and
even such things as RAM and virtual cpus.

AFAICT this raises more issues than it addresses. Not that the issues
aren't worth addressing, but there's a lot more to do than Larry
saying "I think this is a good idea" before expecting anyone to even
think it's worth thinking about.


-- wli

2003-09-03 19:55:14

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 2003-09-03 at 12:00, William Lee Irwin III wrote:
>> So as best as I can tell the proposal consists of using an orders-of-
>> magnitude slower communication method to implement an underspecified
>> solution to some research problem that to all appearances will be more
>> expensive to maintain and keep running than the now extant designs.

On Wed, Sep 03, 2003 at 01:11:55PM -0600, Steven Cole wrote:
> You and Larry are either talking past each other, or perhaps it is I who
> don't understand the not-yet-existing CC-clusters. My understanding is
> that communication between nodes of a CC-cluster would be through a
> shared-memory mechanism, not through much slower I/O such as a network
> (even a very fast network).
> From Karim Yaghmour's paper here:
> http://www.opersys.com/adeos/practical-smp-clusters/
> "That being said, clustering packages may make assumptions that do not
> hold in the current architecture. Primarily, by having nodes so close
> together, physical network latencies and problems disappear."

The communication latencies will get better that way, sure.


On Wed, Sep 03, 2003 at 01:11:55PM -0600, Steven Cole wrote:
>> I like distributed systems and clusters, and they're great to use for
>> what they're good for. They're not substitutes in any way for tightly
>> coupled systems, nor do they render large specimens thereof unnecessary.

On Wed, Sep 03, 2003 at 01:11:55PM -0600, Steven Cole wrote:
> My point is this: Currently at least one vendor (SGI) wants to scale the
> kernel to 128 CPUs. As far as I know, the SGI Altix systems can be
> configured up to 512 CPUs. If the Intel Tanglewood really will have 16
> cores per chip, very much larger systems will be possible. Will you be
> able to scale the kernel to 2048 CPUs and beyond? This may happen
> during the lifetime of 2.8.x, so planning should be happening either now
> or soon.

This is not particularly exciting (or truthfully remotely interesting)
news. google for "BBN Butterfly" to see what was around ca. 1988.


-- wli

2003-09-03 19:59:07

by Daniel Gryniewicz

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 2003-09-03 at 14:11, Alan Cox wrote:
> On Mer, 2003-09-03 at 18:32, Larry McVoy wrote:
> > For a lot of applications we are. Go talk to your buddies in the processor
> > group, I think there is a fair amount of awareness that for most apps faster
> > processors aren't doing any good. Ditto for SMP.
>
> >From the app end I found similar things. My gnome desktop performance
> doesn't measurably improve beyond about 7-800Mhz. Some other stuff like
> mozilla benefits from more CPU and 3D game stuff can burn all it can
> get.

Interesting. I've found that my Athlon XP 2200 is noticably faster at
Gnome than my Athlon MP 1500 (running a UP kernel). The memory and disk
is the same speed, so only the processor is different, unless the
chipsets make a huge difference. Of course, both are much faster than
my 1 GHz Athlon laptop, but there's also a large disk/memory speed
difference there.

--
Daniel Gryniewicz <[email protected]>

2003-09-03 20:17:19

by Alan

[permalink] [raw]
Subject: Re: Scaling noise

On Mer, 2003-09-03 at 20:46, William Lee Irwin III wrote:
> The sharing matters; e.g. libc and other massively shared bits are
> replicated in memory once per instance, which increases memory and
> cache footprint(s). A number of other consequences of the sharing loss:

Memory is cheap. NUMA people already replicate pages on big systems,
even the entire kernel. Last time I looked libc cost me under $1 a
system.

> Pagecache access suddenly involves cross-instance communication instead
> of swift memory access and function calls, with potentially enormous
> invalidation latencies.

Your cross instance communication some LPAR like setup is tiny, it
doesnt have to bounce over ethernet in that kind of setup that Larry
talks about - in many cases its probably doable as atomic ops in a
shared space

> a single instance (which are just memory copies) to cross-instance
> data traffic, which involves slinging memory around through the
> hypervisor's interface, which is slower.

Why. If I want to explicitly allocated shared space I can allocate it
shared in a setup which is LPAR like. If its across a LAN then yes thats
a different kettle of fish.

> Process migration is confined to within a single instance without
> some very ugly bits; things such as forking servers and dynamic task
> creation algorithms like thread pools fall apart here.

I'd be suprised if that is an issue because large systems either run
lots of stuff so you can do the occasional move at fork time (which is
expensive) or customised setups. Most NUMA setups already mess around
with CPU binding to make the box fast

> AFAICT this raises more issues than it addresses. Not that the issues
> aren't worth addressing, but there's a lot more to do than Larry
> saying "I think this is a good idea" before expecting anyone to even
> think it's worth thinking about.

Agreed

2003-09-03 20:15:20

by Diego Calleja

[permalink] [raw]
Subject: Re: Scaling noise

El Wed, 3 Sep 2003 11:07:55 -0700 Larry McVoy <[email protected]> escribi?:

> How you can look at those two companies and not see what is obvious is
> beyond me but everyone is entitled to their opinion. It's nice when your
> opinion is based on data, not religion.

What those companies sell isn't a good enough reason to get any conclusion.

Those data don't mean you're right; they just mean what you're *supposing*.


Diego Calleja

2003-09-03 20:32:57

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Mer, 2003-09-03 at 20:46, William Lee Irwin III wrote:
>> Pagecache access suddenly involves cross-instance communication instead
>> of swift memory access and function calls, with potentially enormous
>> invalidation latencies.

On Wed, Sep 03, 2003 at 09:13:05PM +0100, Alan Cox wrote:
> Your cross instance communication some LPAR like setup is tiny, it
> doesnt have to bounce over ethernet in that kind of setup that Larry
> talks about - in many cases its probably doable as atomic ops in a
> shared space

I was thinking of things like truncate(), which is already a single
system latency problem.


On Mer, 2003-09-03 at 20:46, William Lee Irwin III wrote:
>> a single instance (which are just memory copies) to cross-instance
>> data traffic, which involves slinging memory around through the
>> hypervisor's interface, which is slower.

On Wed, Sep 03, 2003 at 09:13:05PM +0100, Alan Cox wrote:
> Why. If I want to explicitly allocated shared space I can allocate it
> shared in a setup which is LPAR like. If its across a LAN then yes thats
> a different kettle of fish.

It'll probably deteriorate by an additional copy plus trap costs for
hcalls for things like sockets (and pipes are precluded unless far more
cross-system integration than I've heard of is planned). Userspace API's
for distributed shared memory are hard to program, but userspace could
exploit them to cut down on the amount of copying.


On Mer, 2003-09-03 at 20:46, William Lee Irwin III wrote:
>> Process migration is confined to within a single instance without
>> some very ugly bits; things such as forking servers and dynamic task
>> creation algorithms like thread pools fall apart here.

On Wed, Sep 03, 2003 at 09:13:05PM +0100, Alan Cox wrote:
> I'd be suprised if that is an issue because large systems either run
> lots of stuff so you can do the occasional move at fork time (which is
> expensive) or customised setups. Most NUMA setups already mess around
> with CPU binding to make the box fast

A better way of phrasing this is "the load balancing problem is harder".


-- wli

2003-09-03 21:00:59

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

> On Wed, Sep 03, 2003 at 07:32:12PM +0100, Alan Cox wrote:
>> Now add a clusterfs and tell me the difference, other than there being a
>> lot less sharing going on...
>
> The sharing matters; e.g. libc and other massively shared bits are
> replicated in memory once per instance, which increases memory and
> cache footprint(s). A number of other consequences of the sharing loss:

Explain the cache footprint argument - if you're only using a single
copy from any given cpu, it shouldn't affect the cpu cache. More
importantly, it'll massively reduce the footprint on the NUMA
interconnect cache, which is the whole point of doing text replication.

> The number of systems to manage proliferates.

Not if you have an SSI cluster, that's the point.

> Pagecache access suddenly involves cross-instance communication instead
> of swift memory access and function calls, with potentially enormous
> invalidation latencies.

No, each node in an SSI cluster has its own pagecache, that's mostly
independant.

> Userspace IPC goes from shared memory and pipes and sockets inside
> a single instance (which are just memory copies) to cross-instance
> data traffic, which involves slinging memory around through the
> hypervisor's interface, which is slower.

Indeed, unless the hypervisor-type-layer sets up an efficient cross
communication mechanism, that doesn't involve it for every transaction.
Yes, there's some cost here. If the workload is fairly "independant"
(between processes), it's easy, if it does a lot of cross-process
traffic with pipes and shit, it's going to hurt to some extent, but
it *may* be fairly small, depending on the implementation.

> The limited size of a single instance bounds the size of individual
> applications, which at various times would like to have larger memory
> footprints or consume more cpu time than fits in a single instance.
> i.e. something resembling external fragmentation of system resources.

True. depends on how the processes / threads in that app communicate
as to how big the impact would be. There's nothing saying that two
processes of the same app in an SSI cluster can't run on different
nodes ... we present a single system image to userspace, across nodes.
Some of the glue layer (eg for ps, to give a simple example), like
for_each_task, is where the hard work in doing this is.

> Process migration is confined to within a single instance without
> some very ugly bits; things such as forking servers and dynamic task
> creation algorithms like thread pools fall apart here.

You *need* to be able to migrate processes across nodes. Yes, it's hard.
Doing it at exec time is easier, but still far from trivial, and not
sufficient anyway.

> There's suddenly competition for and a need for dynamic shifting around
> of resources not shared across instances, like private disk space and
> devices, shares of cpu, IP numbers and other system identifiers, and
> even such things as RAM and virtual cpus.
>
> AFAICT this raises more issues than it addresses. Not that the issues
> aren't worth addressing, but there's a lot more to do than Larry
> saying "I think this is a good idea" before expecting anyone to even
> think it's worth thinking about.

It raises a lot of hard issues. It addresses a lot of hard issues.
IMHO, it's a fascinating concept, that deserves some attention, and
I'd love to work on it. However, I'm far from sure it'd work out, and
until it's proven to do so, it's unreasonable to expect people to give
up working on the existing methods in favour of an unproven (but rather
cool) pipe-dream.

What we're doing now is mostly just small incremental changes, and unlike Larry, I don't believe it's harmful (I'm not delving back into that
debate again - see the mail archives of this list). I'd love to see how
the radical SSI cluster approach compares, when it's done. If I can
get funding for it, I'll help it get done.

M.

2003-09-03 20:24:13

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

> On Wed, Sep 03, 2003 at 09:21:47AM -0700, Martin J. Bligh wrote:
>> I've seen people use big machines for clusterable things, which I think
>> is a waste of money, but the cost of the machine compared to the cost
>> of admin (vs multiple machines) may have come down to the point where
>> it's worth it now. You get implicit "cluster" load balancing done in a
>> transparent way by the OS on NUMA boxes.
>
> Doesn't SSI clustering do something similar (without the effency of the
> interconnections though)?

Yes ... *if* someone had a implementation that worked well and was
maintainable ;-)

M.

2003-09-03 21:20:39

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

At some point in the past, I wrote:
>> The sharing matters; e.g. libc and other massively shared bits are
>> replicated in memory once per instance, which increases memory and
>> cache footprint(s). A number of other consequences of the sharing loss:

On Wed, Sep 03, 2003 at 01:48:59PM -0700, Martin J. Bligh wrote:
> Explain the cache footprint argument - if you're only using a single
> copy from any given cpu, it shouldn't affect the cpu cache. More
> importantly, it'll massively reduce the footprint on the NUMA
> interconnect cache, which is the whole point of doing text replication.

The single copy from any given cpu assumption was not explicitly made.
Some of this depends on how the administrator/whoever wants to arrange
OS instances so that when one becomes blocked on io or otherwise idled
others can make progress or other forms of overcommitment.


At some point in the past, I wrote:
>> The number of systems to manage proliferates.

On Wed, Sep 03, 2003 at 01:48:59PM -0700, Martin J. Bligh wrote:
> Not if you have an SSI cluster, that's the point.

The scenario described above wasn't SSI but independent instances with
a shared distributed fs. SSI clusters have most of the same problems,
really. Managing the systems just becomes "managing the nodes" because
they're not called systems, and you have to go through some (possibly
automated, though not likely) hassle to figure out the right way to
spread things across nodes, which virtualizes pieces to hand to which
nodes running which loads, etc.


At some point in the past, I wrote:
>> Pagecache access suddenly involves cross-instance communication instead
>> of swift memory access and function calls, with potentially enormous
>> invalidation latencies.

On Wed, Sep 03, 2003 at 01:48:59PM -0700, Martin J. Bligh wrote:
> No, each node in an SSI cluster has its own pagecache, that's mostly
> independant.

But not totally. truncate() etc. need handling, i.e. cross-instance
pagecache invalidations. And write() too. =)


At some point in the past, I wrote:
>> The limited size of a single instance bounds the size of individual
>> applications, which at various times would like to have larger memory
>> footprints or consume more cpu time than fits in a single instance.
>> i.e. something resembling external fragmentation of system resources.

On Wed, Sep 03, 2003 at 01:48:59PM -0700, Martin J. Bligh wrote:
> True. depends on how the processes / threads in that app communicate
> as to how big the impact would be. There's nothing saying that two
> processes of the same app in an SSI cluster can't run on different
> nodes ... we present a single system image to userspace, across nodes.
> Some of the glue layer (eg for ps, to give a simple example), like
> for_each_task, is where the hard work in doing this is.

Well, let's try the word "process" then. e.g. 4GB nodes and a process
that suddenly wants to inflate to 8GB due to some ephemeral load
imbalance.


-- wli

2003-09-03 21:40:38

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

> On Wed, Sep 03, 2003 at 01:48:59PM -0700, Martin J. Bligh wrote:
>> Not if you have an SSI cluster, that's the point.
>
> The scenario described above wasn't SSI but independent instances with
> a shared distributed fs.

OK, sorry - crosstalk confusion between subthreads.

> SSI clusters have most of the same problems,
> really. Managing the systems just becomes "managing the nodes" because
> they're not called systems, and you have to go through some (possibly
> automated, though not likely) hassle to figure out the right way to
> spread things across nodes, which virtualizes pieces to hand to which
> nodes running which loads, etc.

That's where I disagree - it's much easier for the USER because an SSI
cluster works out all the load balancing shit for itself, instead of
pushing the problem out to userspace. It's much harder for the KERNEL
programmer, sure ... but we're smart ;-) And I'd rather solve it once,
properly, in the right place where all the right data is about all
the apps running on the system, and the data about the machine hardware.

> On Wed, Sep 03, 2003 at 01:48:59PM -0700, Martin J. Bligh wrote:
>> No, each node in an SSI cluster has its own pagecache, that's mostly
>> independant.
>
> But not totally. truncate() etc. need handling, i.e. cross-instance
> pagecache invalidations. And write() too. =)

Same problem as any clustered fs, but yes, truncate will suck harder
than it does now. Not sure I care though ;-) Invalidations on write,
etc will be more expensive when we're sharing files across nodes, but
independant operations will be cheaper due to the locality. It's a
tradeoff - whether it pays off or not depends on the workload.

>> True. depends on how the processes / threads in that app communicate
>> as to how big the impact would be. There's nothing saying that two
>> processes of the same app in an SSI cluster can't run on different
>> nodes ... we present a single system image to userspace, across nodes.
>> Some of the glue layer (eg for ps, to give a simple example), like
>> for_each_task, is where the hard work in doing this is.
>
> Well, let's try the word "process" then. e.g. 4GB nodes and a process
> that suddenly wants to inflate to 8GB due to some ephemeral load
> imbalance.

Well, if you mean "task" in the linux sense (ie not a multi-threaded
process), that reduces us from worrying about tasks to memory. On an
SSI cluster that's on a NUMA machine, we could loan memory across nodes
or something, but yes, that's definitely a problem area. It ain't no
panacea ;-)

M.

2003-09-03 21:58:46

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

>> That's where I disagree - it's much easier for the USER because an SSI
>> cluster works out all the load balancing shit for itself, instead of
>> pushing the problem out to userspace. It's much harder for the KERNEL
>> programmer, sure ... but we're smart ;-) And I'd rather solve it once,
>> properly, in the right place where all the right data is about all
>> the apps running on the system, and the data about the machine hardware.
>
> This is only truly feasible when the nodes are homogeneous. They will
> not be as there will be physical locality (esp. bits like device
> proximity) concerns.

Same problem as a traditonal set up of a NUMA system - the scheduler
needs to try to move the process closer to the resources it's using.

> It's vaguely possible some kind of punting out
> of the kernel of the solutions to these concerns is possible, but upon
> the assumption it will appear, we descend further toward science fiction.

Nah, punting to userspace is crap - they have no more ability to solve
this than we do on any sort of dynamic worseload, and in most cases,
much worse - they don't have the information that the kernel has available,
at least not on a timely basis. The scheduler belongs in the kernel,
where it can balance decisions across all of userspace, and we have
all the info we need rapidly and efficiently available.

> Some of these proposals also beg the question of "who's going to write
> the rest of the hypervisor supporting this stuff?", which is ominous.

Yeah, it needs lots of hard work by bright people. It's not easy.

M.

2003-09-03 21:50:58

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

At some point in the past, I wrote:
>> SSI clusters have most of the same problems,
>> really. Managing the systems just becomes "managing the nodes" because
>> they're not called systems, and you have to go through some (possibly
>> automated, though not likely) hassle to figure out the right way to
>> spread things across nodes, which virtualizes pieces to hand to which
>> nodes running which loads, etc.

On Wed, Sep 03, 2003 at 02:29:01PM -0700, Martin J. Bligh wrote:
> That's where I disagree - it's much easier for the USER because an SSI
> cluster works out all the load balancing shit for itself, instead of
> pushing the problem out to userspace. It's much harder for the KERNEL
> programmer, sure ... but we're smart ;-) And I'd rather solve it once,
> properly, in the right place where all the right data is about all
> the apps running on the system, and the data about the machine hardware.

This is only truly feasible when the nodes are homogeneous. They will
not be as there will be physical locality (esp. bits like device
proximity) concerns. It's vaguely possible some kind of punting out
of the kernel of the solutions to these concerns is possible, but upon
the assumption it will appear, we descend further toward science fiction.

Some of these proposals also beg the question of "who's going to write
the rest of the hypervisor supporting this stuff?", which is ominous.


-- wli

2003-09-03 23:52:04

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 11:25:24AM -0700, William Lee Irwin III wrote:
>> Restating the above in slow motion:
>> (a) economic arguments make me want to puke in the face of the presenter

On Wed, Sep 03, 2003 at 04:47:37PM -0700, Larry McVoy wrote:
> That sounds like a self control problem. Anger management maybe?

No. It's rather meant to be a humorous sounding way to say economic
arguments regarding technical merit disgust me.


-- wli

2003-09-03 23:48:08

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 11:25:24AM -0700, William Lee Irwin III wrote:
> Restating the above in slow motion:
> (a) economic arguments make me want to puke in the face of the presenter

That sounds like a self control problem. Anger management maybe?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-04 00:07:23

by Mike Fedyk

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 02:46:06PM -0700, Martin J. Bligh wrote:
> > Some of these proposals also beg the question of "who's going to write
> > the rest of the hypervisor supporting this stuff?", which is ominous.
>
> Yeah, it needs lots of hard work by bright people. It's not easy.

Has OpenMosix done anything to help in this regard, or is it unmaintainable?
ISTR, that much of its code is asm, and going from 2.4.18 to 19 took a long
time to stabalize (that was the last time I used OM)

2003-09-03 23:57:23

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

--Larry McVoy <[email protected]> wrote (on Wednesday, September 03, 2003 16:47:37 -0700):

> On Wed, Sep 03, 2003 at 11:25:24AM -0700, William Lee Irwin III wrote:
>> Restating the above in slow motion:
>> (a) economic arguments make me want to puke in the face of the presenter
>
> That sounds like a self control problem. Anger management maybe?

Nah, that'd be if he wanted to punch your lights out.
Real-world detatchement issues, or taste overload, maybe .... ;-)

M.

2003-09-04 00:37:28

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 11:17:56AM -0700, Martin J. Bligh wrote:
> Errm, IBM is gambling with _their_ money, as are others such as HP,
> and they're making big iron.

But making very little money off of it. I'd love to get revenue numbers
as a function of time for >8 processor SMP boxes. I'll bet my left nut
that the numbers are going down. They have to be, CPUs are fast enough
to handle most problems, clustering has worked for lots of big companies
like Google, Amazon, Yahoo, and the HPC market has been flat for years.
So where's the growth? Nowhere I can see. If I'm not seeing it, show
me the data. I may be a pain in the ass but I'll change my mind instantly
when you show me data that says something different than what I believe.
So far, all I've seen is people having fun proving that their ego is
bigger than the next guys, no real data. Come on, you'd love nothing
better than to prove me wrong. Do it. Or admit that you can't.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-04 00:49:54

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 08:50:46AM -0700, Martin J. Bligh wrote:
> > Err, when did I ever say it wasn't SSI? If you look at what I said it's
> > clearly SSI. Unified process, device, file, and memory namespaces.
>
> I think it was the bit when you suggested using bitkeeper to sync multiple
> /etc/passwd files when I really switched off ... perhaps you were just
> joking ;-) Perhaps we just had a massive communication disconnect.

I wasn't joking, but that has nothing to do with clusters. The BK license
has a "single user is free" mode because I wanted very much to allow distros
to use BK to control their /etc files. It would be amazingly useful if you
could do an upgrade and merge your config changes with their config changes.
Instead we're still in the 80's in terms of config files.

By the way, I could care less if it were BK, CVS, SVN, SCCS, RCS,
whatever. The config files need to be under version control and you
need to be able to merge in your changes. BK is what I'd like because
I understand it and know it would work, but it's not a BK thing at all,
I'd happily do work on RCS or whatever to make this happen. It's just
amazingly painful that these files aren't under version control, it's
stupid, there is an obviously better answer and the distros aren't
seeing it. Bummer.

But this has nothing to do with clusters.

> > I'm pretty sure people were so eager to argue with my lovely personality
> > that they never bothered to understand the architecture. It's _always_
> > been SSI. I have slides going back at least 4 years that state this:
> >
> > http://www.bitmover.com/talks/smp-clusters
> > http://www.bitmover.com/talks/cliq
>
> I can go back and re-read them, if I misread them last time than I apologise.
> I've also shifted perspectives on SSI clusters somewhat over the last year.
> Yes, if it's SSI, I'd agree for the most part ... once it's implemented ;-)

Cool!

> I'd rather start with everything separate (one OS instance per node), and
> bind things back together, than split everything up. However, I'm really
> not sure how feasible it is until we actually have something that works.

I'm in 100% agreement. It's much better to have a bunch of OS's and pull
them together than have one and try and pry it apart.

> I have a rough plan of how to go about it mapped out, in small steps that
> might be useful by themselves. It's a lot of fairly complex hard work ;-)

I've spent quite a bit of time thinking about this and if it started going
anywhere it would be easy for you to tell me to put up or shut up. I'd
be happy to do some real work on this. Maybe it would just be doing the
architecture stuff but I strongly suspect there are few people out there
masochistic enough to make controlling tty semantics work properly in this
environment. I don't want to do it, I'd love someone else to do it, but
if noone steps up to the bat I will. I did all the POSIX crud in SunOS,
I understand the issues, I can do it here and it is part of the least fun
work so if I'm pushing the model I should be willing to put some work into
the non fun part.

The VM work is a lot more fun, I'd like to play there but I suspect that if
we got rolling there are far more talented people who would push me aside.
That's cool, the best people should do the work.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-04 00:58:34

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 02:51:35PM -0700, William Lee Irwin III wrote:
> At some point in the past, I wrote:
> >> SSI clusters have most of the same problems,
> >> really. Managing the systems just becomes "managing the nodes" because
> >> they're not called systems, and you have to go through some (possibly
> >> automated, though not likely) hassle to figure out the right way to
> >> spread things across nodes, which virtualizes pieces to hand to which
> >> nodes running which loads, etc.
>
> On Wed, Sep 03, 2003 at 02:29:01PM -0700, Martin J. Bligh wrote:
> > That's where I disagree - it's much easier for the USER because an SSI
> > cluster works out all the load balancing shit for itself, instead of
> > pushing the problem out to userspace. It's much harder for the KERNEL
> > programmer, sure ... but we're smart ;-) And I'd rather solve it once,
> > properly, in the right place where all the right data is about all
> > the apps running on the system, and the data about the machine hardware.
>
> This is only truly feasible when the nodes are homogeneous. They will
> not be as there will be physical locality (esp. bits like device
> proximity) concerns.

Huh? The nodes are homogeneous. Devices are either local or proxied.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-04 01:11:56

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 02:51:35PM -0700, William Lee Irwin III wrote:
>> This is only truly feasible when the nodes are homogeneous. They will
>> not be as there will be physical locality (esp. bits like device
>> proximity) concerns.

On Wed, Sep 03, 2003 at 05:58:22PM -0700, Larry McVoy wrote:
> Huh? The nodes are homogeneous. Devices are either local or proxied.

Virtualized devices are backed by real devices at some level, so the
distance from the node's physical location to the device's then matters.


-- wli

2003-09-04 01:07:12

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

Here's a thought. Maybe the next kernel summit needs to have a CC cluster
BOF or whatever. I'd be happy to show up, describe what it is that I see
and have you all try and poke holes in it. If the net result was that you
walked away with the same picture in your head that I have that would be
cool. Heck, I'll sponser it and buy beer and food if you like.

On Wed, Sep 03, 2003 at 02:46:06PM -0700, Martin J. Bligh wrote:
> >> That's where I disagree - it's much easier for the USER because an SSI
> >> cluster works out all the load balancing shit for itself, instead of
> >> pushing the problem out to userspace. It's much harder for the KERNEL
> >> programmer, sure ... but we're smart ;-) And I'd rather solve it once,
> >> properly, in the right place where all the right data is about all
> >> the apps running on the system, and the data about the machine hardware.
> >
> > This is only truly feasible when the nodes are homogeneous. They will
> > not be as there will be physical locality (esp. bits like device
> > proximity) concerns.
>
> Same problem as a traditonal set up of a NUMA system - the scheduler
> needs to try to move the process closer to the resources it's using.
>
> > It's vaguely possible some kind of punting out
> > of the kernel of the solutions to these concerns is possible, but upon
> > the assumption it will appear, we descend further toward science fiction.
>
> Nah, punting to userspace is crap - they have no more ability to solve
> this than we do on any sort of dynamic worseload, and in most cases,
> much worse - they don't have the information that the kernel has available,
> at least not on a timely basis. The scheduler belongs in the kernel,
> where it can balance decisions across all of userspace, and we have
> all the info we need rapidly and efficiently available.
>
> > Some of these proposals also beg the question of "who's going to write
> > the rest of the hypervisor supporting this stuff?", which is ominous.
>
> Yeah, it needs lots of hard work by bright people. It's not easy.
>
> M.

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-04 01:11:05

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 06:06:53PM -0700, Larry McVoy wrote:
> Here's a thought. Maybe the next kernel summit needs to have a CC cluster
> BOF or whatever. I'd be happy to show up, describe what it is that I see
> and have you all try and poke holes in it. If the net result was that you
> walked away with the same picture in your head that I have that would be
> cool. Heck, I'll sponser it and buy beer and food if you like.

Oops. s/sponser/sponsor/. Long day, sorry.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-04 01:32:25

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 06:06:53PM -0700, Larry McVoy wrote:
> Here's a thought. Maybe the next kernel summit needs to have a CC cluster
> BOF or whatever. I'd be happy to show up, describe what it is that I see
> and have you all try and poke holes in it. If the net result was that you
> walked away with the same picture in your head that I have that would be
> cool. Heck, I'll sponser it and buy beer and food if you like.

It'd be nice if there were a prototype or something around to at least
get a feel for whether it's worthwhile and how it behaves.

Most of the individual mechanisms have other uses ranging from playing
the good citizen under a hypervisor to just plain old filesharing, so
it should be vaguely possible to get a couple kernels talking and
farting around without much more than 1-2 P-Y's for bootstrapping bits
and some unspecified amount of pain for missing pieces of the above.

Unfortunately, this means
(a) the box needs a hypervisor (or equivalent in native nomenclature)
(b) substantial outlay of kernel hacking time (who's doing this?)

I'm vaguely attached to the idea of there being _something_ to assess,
otherwise it's difficult to ground the discussions in evidence, though
worse comes to worse, we can break down to plotting and scheming again.


-- wli

2003-09-04 01:51:15

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 06:46:02PM -0700, David Lang wrote:
> how much of this need could be met with a native linux master and kernels
> running user-mode kernels? (your resource sharing would obviously not be
> that clean, but you could develop the tools to work across the kernel
> images this way)

Probably a fair amount.


-- wli

2003-09-04 01:47:02

by Daniel Phillips

[permalink] [raw]
Subject: Re: Scaling noise

On Wednesday 03 September 2003 17:31, Steven Cole wrote:
> On Wed, 2003-09-03 at 06:47, Antonio Vargas wrote:
> > As you may probably know, CC-clusters were heavily advocated by the
> > same Larry McVoy who has started this thread.
>
> Yes, thanks. I'm well aware of that. I would like to get a discussion
> going again on CC-clusters, since that seems to be a way out of the
> scaling spiral. Here is an interesting link:
> http://www.opersys.com/adeos/practical-smp-clusters/

As you know, the argument is that locking overhead grows by some factor worse
than linear as the size of an SMP cluster increases, so that the locking
overhead explodes at some point, and thus it would be more efficient to
eliminate the SMP overhead entirely and run a cluster of UP kernels,
communicating through the high bandwidth channel provided by shared memory.

There are other arguments, such as how complex locking is, and how it will
never work correctly, but those are noise: it's pretty much done now, the
complexity is still manageable, and Linux has never been more stable.

There was a time when SMP locking overhead actually cost something in the high
single digits on Linux, on certain loads. Today, you'd have to work at it to
find a real load where the 2.5/6 kernel spends more than 1% of its time in
locking overhead, even on a large SMP machine (sample size of one: I asked
Bill Irwin how his 32 node Numa cluster is running these days). This blows
the ccCluster idea out of the water, sorry. The only way ccCluster gets to
live is if SMP locking is pathetic and it's not.

As for Karim's work, it's a quintessentially flashy trick to make two UP
kernels run on a dual processor. It's worth doing, but not because it blazes
the way forward for ccClusters. It can be the basis for hot kernel swap:
migrate all the processes to one of the two CPUs, load and start a new kernel
on the other one, migrate all processes to it, and let the new kernel restart
the first processor, which is now idle.

Regards,

Daniel

2003-09-04 01:49:22

by David Lang

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 3 Sep 2003, William Lee Irwin III wrote:

> On Wed, Sep 03, 2003 at 06:06:53PM -0700, Larry McVoy wrote:
> > Here's a thought. Maybe the next kernel summit needs to have a CC cluster
> > BOF or whatever. I'd be happy to show up, describe what it is that I see
> > and have you all try and poke holes in it. If the net result was that you
> > walked away with the same picture in your head that I have that would be
> > cool. Heck, I'll sponser it and buy beer and food if you like.
>
> It'd be nice if there were a prototype or something around to at least
> get a feel for whether it's worthwhile and how it behaves.
>
> Most of the individual mechanisms have other uses ranging from playing
> the good citizen under a hypervisor to just plain old filesharing, so
> it should be vaguely possible to get a couple kernels talking and
> farting around without much more than 1-2 P-Y's for bootstrapping bits
> and some unspecified amount of pain for missing pieces of the above.
>
> Unfortunately, this means
> (a) the box needs a hypervisor (or equivalent in native nomenclature)

how much of this need could be met with a native linux master and kernels
running user-mode kernels? (your resource sharing would obviously not be
that clean, but you could develop the tools to work across the kernel
images this way)

David Lang

> (b) substantial outlay of kernel hacking time (who's doing this?)
>
> I'm vaguely attached to the idea of there being _something_ to assess,
> otherwise it's difficult to ground the discussions in evidence, though
> worse comes to worse, we can break down to plotting and scheming again.
>
>
> -- wli
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-09-04 01:55:42

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Thu, Sep 04, 2003 at 03:50:31AM +0200, Daniel Phillips wrote:
> There are other arguments, such as how complex locking is, and how it will
> never work correctly, but those are noise: it's pretty much done now, the
> complexity is still manageable, and Linux has never been more stable.

yeah, right. I'm not sure what you are smoking but I'll avoid your dealer.

Your politics are showing, Daniel. Try staying focussed on the technical
merits and we can have a discussion. Otherwise you just get ignored.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-04 02:17:38

by Daniel Phillips

[permalink] [raw]
Subject: Re: Scaling noise

On Thursday 04 September 2003 02:49, Larry McVoy wrote:
> It's much better to have a bunch of OS's and pull
> them together than have one and try and pry it apart.

This is bogus. The numbers clearly don't work if the ccCluster is made of
uniprocessors, so obviously the SMP locking has to be implemented anyway, to
get each node up to the size just below the supposed knee in the scaling
curve. This eliminates the argument about saving complexity and/or work.

The way Linux scales now, the locking stays out of the range where SSI could
compete up to, what? 128 processors? More? Maybe we'd better ask SGI about
that, but we already know what the answer is for 32: boring old SMP wins
hands down. Where is the machine that has the knee in the wrong part of the
curve? Oh, maybe we should all just stop whatever work we're doing and wait
ten years for one to show up.

But far be it from me to suggest that reality should intefere with your fun.

Regards,

Daniel

2003-09-04 02:17:54

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Thu, Sep 04, 2003 at 03:50:31AM +0200, Daniel Phillips wrote:
> Bill Irwin how his 32 node Numa cluster is running these days). This blows

Sorry for any misunderstanding, the model only goes to 16 nodes/64x,
the box mentioned was 32 cpus. It's also SMP (SSI, shared memory,
mach-numaq), not a cluster. I also only have half of it full-time.


-- wli

2003-09-04 02:22:56

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

--Larry McVoy <[email protected]> wrote (on Wednesday, September 03, 2003 17:36:33 -0700):

> On Wed, Sep 03, 2003 at 11:17:56AM -0700, Martin J. Bligh wrote:
>> Errm, IBM is gambling with _their_ money, as are others such as HP,
>> and they're making big iron.
>
> But making very little money off of it. I'd love to get revenue numbers
> as a function of time for >8 processor SMP boxes. I'll bet my left nut
> that the numbers are going down.

A tempting offer. Make it a legally binding written document, and I'm sure
I can go get them for you ;-)

AFAIK, Dell just wimped out of the 8x market, so we're talking about
more than 4x really, I think.

> They have to be, CPUs are fast enough
> to handle most problems, clustering has worked for lots of big companies
> like Google, Amazon, Yahoo, and the HPC market has been flat for years.
> So where's the growth? Nowhere I can see. If I'm not seeing it, show
> me the data. I may be a pain in the ass but I'll change my mind instantly
> when you show me data that says something different than what I believe.
> So far, all I've seen is people having fun proving that their ego is
> bigger than the next guys, no real data. Come on, you'd love nothing
> better than to prove me wrong. Do it. Or admit that you can't.

Not quite sure why the onus is on the rest of us to disprove your pet
theory, rather than you to prove it.

M.

2003-09-04 02:21:52

by Steven Cole

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 2003-09-03 at 19:50, Daniel Phillips wrote:
> On Wednesday 03 September 2003 17:31, Steven Cole wrote:
> > On Wed, 2003-09-03 at 06:47, Antonio Vargas wrote:
> > > As you may probably know, CC-clusters were heavily advocated by the
> > > same Larry McVoy who has started this thread.
> >
> > Yes, thanks. I'm well aware of that. I would like to get a discussion
> > going again on CC-clusters, since that seems to be a way out of the
> > scaling spiral. Here is an interesting link:
> > http://www.opersys.com/adeos/practical-smp-clusters/
>
> As you know, the argument is that locking overhead grows by some factor worse
> than linear as the size of an SMP cluster increases, so that the locking
> overhead explodes at some point, and thus it would be more efficient to
> eliminate the SMP overhead entirely and run a cluster of UP kernels,
> communicating through the high bandwidth channel provided by shared memory.
>
> There are other arguments, such as how complex locking is, and how it will
> never work correctly, but those are noise: it's pretty much done now, the
> complexity is still manageable, and Linux has never been more stable.
>
> There was a time when SMP locking overhead actually cost something in the high
> single digits on Linux, on certain loads. Today, you'd have to work at it to
> find a real load where the 2.5/6 kernel spends more than 1% of its time in
> locking overhead, even on a large SMP machine (sample size of one: I asked
> Bill Irwin how his 32 node Numa cluster is running these days). This blows
> the ccCluster idea out of the water, sorry. The only way ccCluster gets to
> live is if SMP locking is pathetic and it's not.

I would never call the SMP locking pathetic, but it could be improved.
Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7
(Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for
Large NUMA Systems", available for download here:
http://archive.linuxsymposium.org/ols2003/Proceedings/
it appears that for those applications, the curves begin to flatten
rather alarmingly. This may have little to do with locking overhead.

One possible benefit of using ccClusters would be to stay on that lower
part of the curve for the nodes, using perhaps 16 CPUs in a node. That
way, a 256 CPU (e.g. Altix 3000) system might perform better than if a
single kernel were to be used. I say might. It's likely that only
empirical data will tell the tale for sure.

>
> As for Karim's work, it's a quintessentially flashy trick to make two UP
> kernels run on a dual processor. It's worth doing, but not because it blazes
> the way forward for ccClusters. It can be the basis for hot kernel swap:
> migrate all the processes to one of the two CPUs, load and start a new kernel
> on the other one, migrate all processes to it, and let the new kernel restart
> the first processor, which is now idle.
>
Thank you for that very succinct summary of my rather long-winded
exposition on that subject which I posted here:
http://marc.theaimsgroup.com/?l=linux-kernel&m=105214105131450&w=2
Quite a bit of the complexity which I mentioned, if it were necessary at
all, could go into user space helper processes which get spawned for the
kernel going away, and before init for the on-coming kernel. Also, my
comment about not being able to shoe-horn two kernels in at once for
32-bit arches may have been addressed by Ingo's 4G/4G split.

Steven

2003-09-04 02:33:51

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

> On Wed, Sep 03, 2003 at 06:06:53PM -0700, Larry McVoy wrote:
>> Here's a thought. Maybe the next kernel summit needs to have a CC cluster
>> BOF or whatever. I'd be happy to show up, describe what it is that I see
>> and have you all try and poke holes in it. If the net result was that you
>> walked away with the same picture in your head that I have that would be
>> cool. Heck, I'll sponser it and buy beer and food if you like.
>
> It'd be nice if there were a prototype or something around to at least
> get a feel for whether it's worthwhile and how it behaves.
>
> Most of the individual mechanisms have other uses ranging from playing
> the good citizen under a hypervisor to just plain old filesharing, so
> it should be vaguely possible to get a couple kernels talking and
> farting around without much more than 1-2 P-Y's for bootstrapping bits
> and some unspecified amount of pain for missing pieces of the above.
>
> Unfortunately, this means
> (a) the box needs a hypervisor (or equivalent in native nomenclature)
> (b) substantial outlay of kernel hacking time (who's doing this?)
>
> I'm vaguely attached to the idea of there being _something_ to assess,
> otherwise it's difficult to ground the discussions in evidence, though
> worse comes to worse, we can break down to plotting and scheming again.

I don't think the initial development baby-steps are *too* bad, and don't
even have to be done on a NUMA box - a pair of PCs connected by 100baseT
would work. Personally, I think the first step is to do task migration -
migrate a process without it realising from one linux instance to another.
Start without the more complex bits like shared filehandles, etc. Something
that just writes 1,2,3,4 to a file. It could even just use shared root NFS,
I think that works already.

Basically swap it out on one node, and in on another, though obviously
there's more state to take across than just RAM. I was talking to Tridge
the other day, and he said someone had hacked up something in userspace
which kinda worked ... I'll get some details.

I view UP -> SMP -> NUMA -> SSI on NUMA -> SSI on many PCs -> beowulf cluster
as a continuum ... the SSI problems are easier on NUMA, because you can
wimp out on things like shmem much easier, but it's all similar.

M.

2003-09-04 02:35:59

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 07:21:29PM -0700, Martin J. Bligh wrote:
> --Larry McVoy <[email protected]> wrote (on Wednesday, September 03, 2003 17:36:33 -0700):
> > They have to be, CPUs are fast enough
> > to handle most problems, clustering has worked for lots of big companies
> > like Google, Amazon, Yahoo, and the HPC market has been flat for years.
> > So where's the growth? Nowhere I can see. If I'm not seeing it, show
> > me the data. I may be a pain in the ass but I'll change my mind instantly
> > when you show me data that says something different than what I believe.
> > So far, all I've seen is people having fun proving that their ego is
> > bigger than the next guys, no real data. Come on, you'd love nothing
> > better than to prove me wrong. Do it. Or admit that you can't.
>
> Not quite sure why the onus is on the rest of us to disprove your pet
> theory, rather than you to prove it.

Maybe because history has shown over and over again that your pet theory
doesn't work. Mine might be wrong but it hasn't been proven wrong. Yours
has. Multiple times.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-04 02:34:01

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 08:19:26PM -0600, Steven Cole wrote:
> I would never call the SMP locking pathetic, but it could be improved.
> Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7
> (Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for
> Large NUMA Systems", available for download here:
> http://archive.linuxsymposium.org/ols2003/Proceedings/
> it appears that for those applications, the curves begin to flatten
> rather alarmingly. This may have little to do with locking overhead.

Those numbers are 2.4.x


-- wli

2003-09-04 02:36:17

by Martin J. Bligh

[permalink] [raw]
Subject: SSI clusters on NUMA (was Re: Scaling noise)

> how much of this need could be met with a native linux master and kernels
> running user-mode kernels? (your resource sharing would obviously not be
> that clean, but you could develop the tools to work across the kernel
> images this way)

I talked to Jeff and Andrea about this at KS & OLS this year ... the feeling
was that UML was too much overhead, but there were various ways to reduce
that, especially if the underlying OS had UML support (doesn't require it
right now).

I'd really like to see the performance proved to be better before basing
a design on UML, though that was my first instinct of how to do it ...

M.

2003-09-04 02:38:35

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

> On Thursday 04 September 2003 02:49, Larry McVoy wrote:
>> It's much better to have a bunch of OS's and pull
>> them together than have one and try and pry it apart.
>
> This is bogus. The numbers clearly don't work if the ccCluster is made of
> uniprocessors, so obviously the SMP locking has to be implemented anyway, to
> get each node up to the size just below the supposed knee in the scaling
> curve. This eliminates the argument about saving complexity and/or work.
>
> The way Linux scales now, the locking stays out of the range where SSI could
> compete up to, what? 128 processors? More? Maybe we'd better ask SGI about
> that, but we already know what the answer is for 32: boring old SMP wins
> hands down. Where is the machine that has the knee in the wrong part of the
> curve? Oh, maybe we should all just stop whatever work we're doing and wait
> ten years for one to show up.
>
> But far be it from me to suggest that reality should intefere with your fun.

Yes you need locking, but only for the bits where you glue stuff back
together. Plenty of bits can operate indepandantly per node, or at
least ... I'm hoping they can in my vapourware world ;-)

M.

2003-09-04 02:41:40

by Mike Fedyk

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 07:31:13PM -0700, Martin J. Bligh wrote:
> > Unfortunately, this means
> > (a) the box needs a hypervisor (or equivalent in native nomenclature)
> > (b) substantial outlay of kernel hacking time (who's doing this?)
> >
> > I'm vaguely attached to the idea of there being _something_ to assess,
> > otherwise it's difficult to ground the discussions in evidence, though
> > worse comes to worse, we can break down to plotting and scheming again.
>
> I don't think the initial development baby-steps are *too* bad, and don't
> even have to be done on a NUMA box - a pair of PCs connected by 100baseT
> would work. Personally, I think the first step is to do task migration -
> migrate a process without it realising from one linux instance to another.
> Start without the more complex bits like shared filehandles, etc. Something
> that just writes 1,2,3,4 to a file. It could even just use shared root NFS,
> I think that works already.
>
> Basically swap it out on one node, and in on another, though obviously
> there's more state to take across than just RAM. I was talking to Tridge
> the other day, and he said someone had hacked up something in userspace
> which kinda worked ... I'll get some details.
>
> I view UP -> SMP -> NUMA -> SSI on NUMA -> SSI on many PCs -> beowulf cluster
> as a continuum ... the SSI problems are easier on NUMA, because you can
> wimp out on things like shmem much easier, but it's all similar.

Am I missing something, but why hasn't openmosix been brought into this
discussion? It looks like the perfect base for something like this. All
that it needs is some cleanup.

2003-09-04 02:44:42

by Steven Cole

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 2003-09-03 at 20:35, William Lee Irwin III wrote:
> On Wed, Sep 03, 2003 at 08:19:26PM -0600, Steven Cole wrote:
> > I would never call the SMP locking pathetic, but it could be improved.
> > Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7
> > (Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for
> > Large NUMA Systems", available for download here:
> > http://archive.linuxsymposium.org/ols2003/Proceedings/
> > it appears that for those applications, the curves begin to flatten
> > rather alarmingly. This may have little to do with locking overhead.
>
> Those numbers are 2.4.x

Yes, I saw that. It would be interesting to see results for recent
2.6.0-textX kernels. Judging from other recent numbers out of osdl, the
results for 2.6 should be quite a bit better. But won't the curves
still begin to flatten, but at a higher CPU count? Or has the miracle
goodness of RCU pushed those limits to insanely high numbers?

Steven

2003-09-04 02:47:47

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Thu, Sep 04, 2003 at 04:21:16AM +0200, Daniel Phillips wrote:
> On Thursday 04 September 2003 02:49, Larry McVoy wrote:
> > It's much better to have a bunch of OS's and pull
> > them together than have one and try and pry it apart.
>
> This is bogus. The numbers clearly don't work if the ccCluster is made of
> uniprocessors, so obviously the SMP locking has to be implemented anyway, to
> get each node up to the size just below the supposed knee in the scaling
> curve. This eliminates the argument about saving complexity and/or work.

If you thought before you spoke you'd realize how wrong you are. How many
locks are there in the IRIX/Solaris/Linux I/O path? How many are needed for
2-4 way scaling?

Here's the litmus test: list all the locks in the kernel and the locking
hierarchy. If you, a self claimed genius, can't do it, how can the rest
of us mortals possibly do it? Quick. You have 30 seconds, I want a list.
A complete list with the locking hierarchy, no silly awk scripts. You have
to show which locks can deadlock, from memory.

No list? Cool, you just proved my point.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-04 02:51:18

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

>> --Larry McVoy <[email protected]> wrote (on Wednesday, September 03, 2003 17:36:33 -0700):
>> > They have to be, CPUs are fast enough
>> > to handle most problems, clustering has worked for lots of big companies
>> > like Google, Amazon, Yahoo, and the HPC market has been flat for years.
>> > So where's the growth? Nowhere I can see. If I'm not seeing it, show
>> > me the data. I may be a pain in the ass but I'll change my mind instantly
>> > when you show me data that says something different than what I believe.
>> > So far, all I've seen is people having fun proving that their ego is
>> > bigger than the next guys, no real data. Come on, you'd love nothing
>> > better than to prove me wrong. Do it. Or admit that you can't.
>>
>> Not quite sure why the onus is on the rest of us to disprove your pet
>> theory, rather than you to prove it.
>
> Maybe because history has shown over and over again that your pet theory
> doesn't work. Mine might be wrong but it hasn't been proven wrong. Yours
> has. Multiple times.

Please, this makes no sense. Why do you think IBM and others make large
machines? Stupidity? From my experience they're hard assed "if it don't
make a profit, nor is likely to, then it can piss off" marketeers. Which
often pisses me off, but still ... if it didn't make money, they wouldn't
do it. And no, I can't go get you internal confidential sales figures,
but I'll bet you we're not selling these things at a loss for our own
general self-flagellating amusement.

I don't think you're stupid, but please ... who do you think has better
data on this? IBM market research people? or you? I think I'd bet on IBM.

M.

2003-09-04 02:56:43

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

> Am I missing something, but why hasn't openmosix been brought into this
> discussion? It looks like the perfect base for something like this. All
> that it needs is some cleanup.

>From what I've seen, it needs a redesign. I don't think it's maintainable
or mergeable as is ... nor would I want to work with their design. Just
an initial gut reaction, I haven't spent a lot of time looking at it, but
from what I saw, I didn't bother looking further.

>From all accounts, OpenSSI sounds more promising, but I need to spend some
more time looking at it.

M.

2003-09-04 02:51:17

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 06:12:53PM -0700, William Lee Irwin III wrote:
> On Wed, Sep 03, 2003 at 02:51:35PM -0700, William Lee Irwin III wrote:
> >> This is only truly feasible when the nodes are homogeneous. They will
> >> not be as there will be physical locality (esp. bits like device
> >> proximity) concerns.
>
> On Wed, Sep 03, 2003 at 05:58:22PM -0700, Larry McVoy wrote:
> > Huh? The nodes are homogeneous. Devices are either local or proxied.
>
> Virtualized devices are backed by real devices at some level, so the
> distance from the node's physical location to the device's then matters.

Go read what I've written about this. There is no sharing, devices are
local or remote. You share in the page cache only, if you want fast access
to a device you ask it to put the data in memory and you map it. It's
absolutely as fast as an SMP. With no locking.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-04 02:50:51

by Steven Cole

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 2003-09-03 at 20:31, Martin J. Bligh wrote:

> I don't think the initial development baby-steps are *too* bad, and don't
> even have to be done on a NUMA box - a pair of PCs connected by 100baseT
> would work. Personally, I think the first step is to do task migration -
> migrate a process without it realising from one linux instance to another.
> Start without the more complex bits like shared filehandles, etc. Something
> that just writes 1,2,3,4 to a file. It could even just use shared root NFS,
> I think that works already.
>
> Basically swap it out on one node, and in on another, though obviously
> there's more state to take across than just RAM. I was talking to Tridge
> the other day, and he said someone had hacked up something in userspace
> which kinda worked ... I'll get some details.
>

This project may be applicable: http://bproc.sourceforge.net/
BProc is used here: http://www.lanl.gov/projects/pink/

Steven

2003-09-04 03:04:12

by Daniel Phillips

[permalink] [raw]
Subject: Re: Scaling noise

On Thursday 04 September 2003 04:19, Steven Cole wrote:
> On Wed, 2003-09-03 at 19:50, Daniel Phillips wrote:
> > There was a time when SMP locking overhead actually cost something in the
> > high single digits on Linux, on certain loads. Today, you'd have to work
> > at it to find a real load where the 2.5/6 kernel spends more than 1% of
> > its time in locking overhead, even on a large SMP machine (sample size of
> > one: I asked Bill Irwin how his 32 node Numa cluster is running these
> > days). This blows the ccCluster idea out of the water, sorry. The only
> > way ccCluster gets to live is if SMP locking is pathetic and it's not.
>
> I would never call the SMP locking pathetic, but it could be improved.
> Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7
> (Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for
> Large NUMA Systems", available for download here:
> http://archive.linuxsymposium.org/ols2003/Proceedings/
> it appears that for those applications, the curves begin to flatten
> rather alarmingly. This may have little to do with locking overhead.

2.4.17 is getting a little old, don't you think? This is the thing that
changed most in 2.4 -> 2.6, and indeed, much of the work was in locking.

> One possible benefit of using ccClusters would be to stay on that lower
> part of the curve for the nodes, using perhaps 16 CPUs in a node. That
> way, a 256 CPU (e.g. Altix 3000) system might perform better than if a
> single kernel were to be used. I say might. It's likely that only
> empirical data will tell the tale for sure.

Right, and we do not see SGI contributing patches for partitioning their 256
CPU boxes. That's all the empirical data I need at this point.

They surely do partition them, but not at the Linux OS level.

> > As for Karim's work, it's a quintessentially flashy trick to make two UP
> > kernels run on a dual processor. It's worth doing, but not because it
> > blazes the way forward for ccClusters. It can be the basis for hot
> > kernel swap: migrate all the processes to one of the two CPUs, load and
> > start a new kernel on the other one, migrate all processes to it, and let
> > the new kernel restart the first processor, which is now idle.
>
> Thank you for that very succinct summary of my rather long-winded
> exposition on that subject which I posted here:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=105214105131450&w=2

I swear I made the above up on the spot, just now :-)

> Quite a bit of the complexity which I mentioned, if it were necessary at
> all, could go into user space helper processes which get spawned for the
> kernel going away, and before init for the on-coming kernel. Also, my
> comment about not being able to shoe-horn two kernels in at once for
> 32-bit arches may have been addressed by Ingo's 4G/4G split.

I don't see what you're worried about, they are separate kernels and you get
two instances of whatever split you want.

Regards,

Daniel

2003-09-04 03:03:23

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 07:48:03PM -0700, Martin J. Bligh wrote:
> >> --Larry McVoy <[email protected]> wrote (on Wednesday, September 03, 2003 17:36:33 -0700):
> >> > They have to be, CPUs are fast enough
> >> > to handle most problems, clustering has worked for lots of big companies
> >> > like Google, Amazon, Yahoo, and the HPC market has been flat for years.
> >> > So where's the growth? Nowhere I can see. If I'm not seeing it, show
> >> > me the data. I may be a pain in the ass but I'll change my mind instantly
> >> > when you show me data that says something different than what I believe.
> >> > So far, all I've seen is people having fun proving that their ego is
> >> > bigger than the next guys, no real data. Come on, you'd love nothing
> >> > better than to prove me wrong. Do it. Or admit that you can't.
> >>
> >> Not quite sure why the onus is on the rest of us to disprove your pet
> >> theory, rather than you to prove it.
> >
> > Maybe because history has shown over and over again that your pet theory
> > doesn't work. Mine might be wrong but it hasn't been proven wrong. Yours
> > has. Multiple times.
>
> Please, this makes no sense. Why do you think IBM and others make large
> machines? Stupidity?

I designed and shipped servers for Sun, I know exactly why they do it.
Customers want to know they have a growth path if they need it. It's an
absolute truism in selling that you need three models: the low end,
the middle of the road, the high end. The stupid cheap people (I've
been one of those, don't buy a car without air conditioning, it's dumb)
buy the low end, the vast majority buy the middle of the road, and a
handful of people buy the high end.

You don't make that much money, if any, on the high end, the R&D costs
dominate. But you make money because people buy the middle of the road
because you have the high end. If you don't, they feel uneasy that they
can't grow with you. The high end enables the sales of the real money
makers. It's pure marketing, the high end could be imaginary and as
long as you convinced the customers you had it you'd be more profitable.

I also worked on servers at SGI. At both Sun and SGI, all the money
was made on 4P and less and on disks. The big iron looks like it is
profitable but only when you don't count the R&D.

The problem with the old school approach of low/middle/high is that
everyone knows that they are shouldering far more than the material
costs plus a little profit when buying high end. They are paying for
the R&D. The market for those machines is very small and the volumes
never approach the level where the R&D is lost in the noise, that's a
significant fraction of the purchase price. That's OK as long as there
is no alternative but with all the household name companies like Google,
Amazon, Yahoo, etc demonstrating that racks of 1U boxes is a far better
answer the market for the big boxes is shrinking. Which is exactly what
Dell was saying. I dunno, maybe I'm completely confused but I see his
point. I don't see yours.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-04 03:05:42

by David Lang

[permalink] [raw]
Subject: Re: SSI clusters on NUMA (was Re: Scaling noise)

On Wed, 3 Sep 2003, Martin J. Bligh wrote:

> > how much of this need could be met with a native linux master and kernels
> > running user-mode kernels? (your resource sharing would obviously not be
> > that clean, but you could develop the tools to work across the kernel
> > images this way)
>
> I talked to Jeff and Andrea about this at KS & OLS this year ... the feeling
> was that UML was too much overhead, but there were various ways to reduce
> that, especially if the underlying OS had UML support (doesn't require it
> right now).
>
> I'd really like to see the performance proved to be better before basing
> a design on UML, though that was my first instinct of how to do it ...

I agree that UML won't be able to show the performance advantages (the
fact that the UML kernel can't control the cache footprint on the CPU's
becouse it gets swapped from one to another at the host OS's convienience
is just one issue here)

however with UML you should be able to develop the tools and features to
start to weld the two different kernels into a single logical image. once
people have a handle on how these tools work you can then try them on some
hardware that has a lower level partitioning setup (i.e. the IBM
mainframes) and do real speed comparisons between one kernel that's given
X CPU's and Y memory and two kernels that are each given X/2 CPU's and Y/2
memory.

the fact that common hardware doesn't nicly support the partitioning
shouldn't stop people from solving the other problems.

David Lang

2003-09-04 03:20:52

by Nick Piggin

[permalink] [raw]
Subject: Re: Scaling noise

Steven Cole wrote:

>On Wed, 2003-09-03 at 20:35, William Lee Irwin III wrote:
>
>>On Wed, Sep 03, 2003 at 08:19:26PM -0600, Steven Cole wrote:
>>
>>>I would never call the SMP locking pathetic, but it could be improved.
>>>Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7
>>>(Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for
>>>Large NUMA Systems", available for download here:
>>>http://archive.linuxsymposium.org/ols2003/Proceedings/
>>>it appears that for those applications, the curves begin to flatten
>>>rather alarmingly. This may have little to do with locking overhead.
>>>
>>Those numbers are 2.4.x
>>
>
>Yes, I saw that. It would be interesting to see results for recent
>2.6.0-textX kernels. Judging from other recent numbers out of osdl, the
>results for 2.6 should be quite a bit better. But won't the curves
>still begin to flatten, but at a higher CPU count? Or has the miracle
>goodness of RCU pushed those limits to insanely high numbers?
>

They fixed some big 2.4 scalability problems, so it wouldn't be as
impressive as plain 2.4 -> 2.6. However there are obviously hardware
scalability limits as well as software ones. So a more interesting
comparison would of course be 2.6 vs LM's SSI clusters.


2003-09-04 03:19:30

by David Lang

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 3 Sep 2003, Martin J. Bligh wrote:

> >> --Larry McVoy <[email protected]> wrote (on Wednesday, September 03, 2003 17:36:33 -0700):
> >> > They have to be, CPUs are fast enough
> >> > to handle most problems, clustering has worked for lots of big companies
> >> > like Google, Amazon, Yahoo, and the HPC market has been flat for years.
> >> > So where's the growth? Nowhere I can see. If I'm not seeing it, show
> >> > me the data. I may be a pain in the ass but I'll change my mind instantly
> >> > when you show me data that says something different than what I believe.
> >> > So far, all I've seen is people having fun proving that their ego is
> >> > bigger than the next guys, no real data. Come on, you'd love nothing
> >> > better than to prove me wrong. Do it. Or admit that you can't.
> >>
> >> Not quite sure why the onus is on the rest of us to disprove your pet
> >> theory, rather than you to prove it.
> >
> > Maybe because history has shown over and over again that your pet theory
> > doesn't work. Mine might be wrong but it hasn't been proven wrong. Yours
> > has. Multiple times.
>
> Please, this makes no sense. Why do you think IBM and others make large
> machines? Stupidity? From my experience they're hard assed "if it don't
> make a profit, nor is likely to, then it can piss off" marketeers. Which
> often pisses me off, but still ... if it didn't make money, they wouldn't
> do it. And no, I can't go get you internal confidential sales figures,
> but I'll bet you we're not selling these things at a loss for our own
> general self-flagellating amusement.
>
> I don't think you're stupid, but please ... who do you think has better
> data on this? IBM market research people? or you? I think I'd bet on IBM.

there are some problems that don't scale well to multiple machines/images
(large databases, huge working sets, etc). and there are other cases where
the problem may be able to scale to multiple machines, but the software
was badly written and so the software won't.

in these cases all you can do is to buy a bigger box, however inefficantly
you use the extra processors.

I know one company a couple years ago that looked at buying a 24 CPU IBM
machine but found that it was outperformed by a pair of 4 CPU IBM machines
(all running the same version of AIX, and the same application software).
in this case they hit a AIX limit that more CPU's/RAM couldn't help with,
but more copies of the OS could.

as CPU speeds climb and the relative penalty for going off the chip for
any reason climb even faster the hardware overhead of keeping a SMP
machine coherant becomes more significant. sometimes this can be solved by
changing the algorithm (the O(1) scheduler for example) but other times
you just have to accept the costof bouncing a cache line from one CPU to
another so that you can aquire a lock.

the advantage of multiple images is that the hardware only needs to
maintain full speed consistancy between CPU's in one image, a higher level
(the OS or the partitioning software) can deal with the issues of letting
one image know what it needs to about what the others are doing.

some of these problems can be addressed in hardware (the Opteron could be
called SSI-NUMA that has it's partitioning layer running in hardware
for up to 8 CPU's) but addressing it in hardware runs into scaling
problems becouse you don't want to pay to much at the low end for the
features you need on the high end (which is why the opteron doesn't
directly scale to 128+ CPU's in one image)

David Lang

2003-09-04 03:15:37

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 06:12:53PM -0700, William Lee Irwin III wrote:
>> Virtualized devices are backed by real devices at some level, so the
>> distance from the node's physical location to the device's then matters.

On Wed, Sep 03, 2003 at 07:49:04PM -0700, Larry McVoy wrote:
> Go read what I've written about this. There is no sharing, devices are
> local or remote. You share in the page cache only, if you want fast access
> to a device you ask it to put the data in memory and you map it. It's
> absolutely as fast as an SMP. With no locking.

Given the lack of an implementation I'm going to have to take this
claim as my opportunity to bow out of tonight's discussion.

I'd love to hear more about it when there's something more substantial
to examine.


-- wli

2003-09-04 03:44:37

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 08:16:16PM -0700, David Lang wrote:
> some of these problems can be addressed in hardware (the Opteron could be
> called SSI-NUMA that has it's partitioning layer running in hardware
> for up to 8 CPU's) but addressing it in hardware runs into scaling
> problems becouse you don't want to pay to much at the low end for the
> features you need on the high end (which is why the opteron doesn't
> directly scale to 128+ CPU's in one image)

Most of those features are just methods of connecting hardware
components.


-- wli

2003-09-04 03:42:48

by Nick Piggin

[permalink] [raw]
Subject: Re: Scaling noise

Larry McVoy wrote:

>On Wed, Sep 03, 2003 at 06:12:53PM -0700, William Lee Irwin III wrote:
>
>>On Wed, Sep 03, 2003 at 02:51:35PM -0700, William Lee Irwin III wrote:
>>
>>>>This is only truly feasible when the nodes are homogeneous. They will
>>>>not be as there will be physical locality (esp. bits like device
>>>>proximity) concerns.
>>>>
>>On Wed, Sep 03, 2003 at 05:58:22PM -0700, Larry McVoy wrote:
>>
>>>Huh? The nodes are homogeneous. Devices are either local or proxied.
>>>
>>Virtualized devices are backed by real devices at some level, so the
>>distance from the node's physical location to the device's then matters.
>>
>
>Go read what I've written about this. There is no sharing, devices are
>local or remote. You share in the page cache only, if you want fast access
>to a device you ask it to put the data in memory and you map it. It's
>absolutely as fast as an SMP. With no locking.
>

There is probably more to it - I'm just an interested bystander - but
how much locking does this case incur with a single kernel system?
And what happens if more than one node wants to access the device? Through
a filesystem?


2003-09-04 03:48:04

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 03 Sep 2003 20:02:27 PDT, Larry McVoy wrote:

> The problem with the old school approach of low/middle/high is that
> everyone knows that they are shouldering far more than the material
> costs plus a little profit when buying high end. They are paying for
> the R&D. The market for those machines is very small and the volumes
> never approach the level where the R&D is lost in the noise, that's a
> significant fraction of the purchase price. That's OK as long as there
> is no alternative but with all the household name companies like Google,
> Amazon, Yahoo, etc demonstrating that racks of 1U boxes is a far better
> answer the market for the big boxes is shrinking. Which is exactly what
> Dell was saying. I dunno, maybe I'm completely confused but I see his
> point. I don't see yours.

Hmm. Did you check your data with respect to Amazon, Yahoo, etc.? Not
saying I know anything different, but I think you didn't check before
you made that statement.

gerrit

2003-09-04 03:49:06

by Mike Fedyk

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 07:50:07PM -0700, Martin J. Bligh wrote:
> >From all accounts, OpenSSI sounds more promising, but I need to spend some
> more time looking at it.

No kidding.

Taking a look at the web site it does look pretty impressive. And it is
using other code that was integrated recently (linux virtual server), as
well as lustre, opengfs, etc. This looks like they're making a lot of
progress, and doing it in a generic way.

I hope the code is as good as their documentation, and marketing...

2003-09-04 03:55:05

by Davide Libenzi

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 3 Sep 2003, Larry McVoy wrote:

> Maybe because history has shown over and over again that your pet theory
> doesn't work. Mine might be wrong but it hasn't been proven wrong. Yours
> has. Multiple times.

Ok, who will be using this Larry ? Seriously. You tought us to be market
and business driven, so please tell us why this should be done. Will business
that are already using Beowulf style clusters migrate to SSI ? Why should
they ? They already scale well because they're running apps that scale
well on that type of cluster, and Beowulf style clusters are cheap and
faster for apps that do not share. Will business that are using application
servers like Java, .NET or whatever migrate to the super SSI ? Nahh, why
should they. Their apps server will be probably running thousands of
cluster-unaware threads (and sharing a shit-load of memory) that will make
SSI to look pretty ugly compared to a standard SMP/NUMA. Ok, they will be
cheaper if implemented with cheaper 1..4 way SMPs. But at the very end, to
get maximum performance from SSI you must have apps with a little of
awareness of the system they're running on. So you must force businesses
to either migrate their apps (cost of HW wayyy cheaper than cost of
developers) or to suffer from major performance problems. So my question
to you splits in two parts. Why companies selling HW should go with this
solution (cheaper for the customer) ? And more, why should business buy
into it, with the plan of having to rewrite their server infrastructure
to take full advantage of the new architecture ? Maybe, at the very end,
their is a reason why nobody is doing it.



- Davide

2003-09-04 04:16:07

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 03, 2003 at 08:47:49PM -0700, Davide Libenzi wrote:
> On Wed, 3 Sep 2003, Larry McVoy wrote:
> > Maybe because history has shown over and over again that your pet theory
> > doesn't work. Mine might be wrong but it hasn't been proven wrong. Yours
> > has. Multiple times.
>
> Why companies selling HW should go with this solution?

Higher profits.

> And more, why should business buy
> into it, with the plan of having to rewrite their server infrastructure
> to take full advantage of the new architecture ? Maybe, at the very end,
> their is a reason why nobody is doing it.

Yeah, it's easier to copy existing roadmaps than do something new. Even
when the picture is painted for you. Sheesh.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-04 04:42:45

by David Miller

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 3 Sep 2003 18:52:49 -0700
Larry McVoy <[email protected]> wrote:

> On Thu, Sep 04, 2003 at 03:50:31AM +0200, Daniel Phillips wrote:
> > There are other arguments, such as how complex locking is, and how it will
> > never work correctly, but those are noise: it's pretty much done now, the
> > complexity is still manageable, and Linux has never been more stable.
>
> yeah, right. I'm not sure what you are smoking but I'll avoid your dealer.

I hate to enter these threads but...

The amount of locking bugs found in the core networking, ipv4, and
ipv6 for a year or two in 2.4.x has been nearly nil.

If you're going to try and argue against supporting huge SMP
to me, don't make locking complexity one of the arguments. :-)

2003-09-04 04:47:29

by Martin J. Bligh

[permalink] [raw]
Subject: Re: SSI clusters on NUMA (was Re: Scaling noise)

>> > how much of this need could be met with a native linux master and kernels
>> > running user-mode kernels? (your resource sharing would obviously not be
>> > that clean, but you could develop the tools to work across the kernel
>> > images this way)
>>
>> I talked to Jeff and Andrea about this at KS & OLS this year ... the feeling
>> was that UML was too much overhead, but there were various ways to reduce
>> that, especially if the underlying OS had UML support (doesn't require it
>> right now).
>>
>> I'd really like to see the performance proved to be better before basing
>> a design on UML, though that was my first instinct of how to do it ...
>
> I agree that UML won't be able to show the performance advantages (the
> fact that the UML kernel can't control the cache footprint on the CPU's
> becouse it gets swapped from one to another at the host OS's convienience
> is just one issue here)
>
> however with UML you should be able to develop the tools and features to
> start to weld the two different kernels into a single logical image. once
> people have a handle on how these tools work you can then try them on some
> hardware that has a lower level partitioning setup (i.e. the IBM
> mainframes) and do real speed comparisons between one kernel that's given
> X CPU's and Y memory and two kernels that are each given X/2 CPU's and Y/2
> memory.
>
> the fact that common hardware doesn't nicly support the partitioning
> shouldn't stop people from solving the other problems.

Yeah, it's definitely an interesting development environment at least.
FYI, most of the discussions in Ottowa centered around system call
overhead (4 TLB flushes per, IIRC), but the cache footprint is interesting
too ... with the O(1) sched in the underlying OS, it shouldn't flip-flop
around too easily, but interesting, nonetheless.

M.

2003-09-04 04:53:16

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

> as CPU speeds climb and the relative penalty for going off the chip for
> any reason climb even faster the hardware overhead of keeping a SMP
> machine coherant becomes more significant. sometimes this can be solved by
> changing the algorithm (the O(1) scheduler for example) but other times
> you just have to accept the costof bouncing a cache line from one CPU to
> another so that you can aquire a lock.
>
> the advantage of multiple images is that the hardware only needs to
> maintain full speed consistancy between CPU's in one image, a higher level
> (the OS or the partitioning software) can deal with the issues of letting
> one image know what it needs to about what the others are doing.

All true, but multiple images isn't the only way to solve the problem
(I happen to like the idea, but let's go for full disclosure on all
possibilities ;-)). Take one example, the global pagemap_lru_lock in
2.4 ... either we could split the LRU by multiple OS images, or explicitly
split it per-node ... like we did in 2.6 (Andrew did that particular bit
of work IIRC, apologies if not).

The question is whether it's better to split *everything* to start with,
and glue back the global bits ... or start with everything global, and
split off the problematic bits explicitly. To date, we've been taking
the latter approach ... it's probably more pragmatic and evolutionary,
and is certainly a damned sight easier. The former approach is a fascinating
idea (to Larry and I at least, it seems ... now we've stepped over our
previous communication breakdown on one aspect at least), but unproven.
I wanna go play with it ;-)

M.

2003-09-04 04:45:04

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

> You don't make that much money, if any, on the high end, the R&D costs
> dominate.

Neither of us have figures to hand, I think. But you seem prepared to
admit there's some money to be made at least. Therefore it seems like
a good market for someone to be in.

> But you make money because people buy the middle of the road
> because you have the high end. If you don't, they feel uneasy that they
> can't grow with you. The high end enables the sales of the real money
> makers. It's pure marketing, the high end could be imaginary and as
> long as you convinced the customers you had it you'd be more profitable.

OK. Suppose, for the sake of argument, that I buy that (I don't really,
but agree it might be a factor). What makes you think the argument is
different for an OS than it is for hardware? For adoption of Linux as
a OS to dominate the market, it's important for the high end to be there
as well ... pervasive Linux.

Seeing Linux and Open Source in general succeed is important to me
personaly as a deep seated belief. Maybe from your own arguments, you can
see it's important for the high end to be there, even if you see no other
practical purpose for the machines. I believe there are real uses besides
marketing for it to be there, but maybe you don't need to believe that
to be convinced.

M.

PS. Of course, that still leaves the question of programming approach,
but that's a whole other discussion ;-)


2003-09-04 04:49:59

by David Miller

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 3 Sep 2003 08:39:01 -0700
Larry McVoy <[email protected]> wrote:

> It's really easy to claim that scalability isn't the problem. Scaling
> changes in general cause very minute differences, it's just that there
> are a lot of them. There is constant pressure to scale further and people
> think it's cool.

So why are people still going down this path?

I'll tell you why, because as SMP issues start to embark upon
the mainstream boxes people are going to find clever solutions
to most of the memory sharing issues that cause all the "lock
overhead".

Things like RCU are just the tip of the iceberg. And think Larry,
we didn't have stuff like RCU back when you were directly working
and watching people work on huge SMP systems.

I think it's instructive to look at hyperthreading from another
angle in this argument, that the cpu people invested billions of
dollars in work to turn memory latency into free cpu cycles.

Put that in your pipe and smoke it :-)

2003-09-04 04:59:04

by David Miller

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 3 Sep 2003 19:46:08 -0700
Larry McVoy <[email protected]> wrote:

> Here's the litmus test: list all the locks in the kernel and the locking
> hierarchy. If you, a self claimed genius, can't do it, how can the rest
> of us mortals possibly do it? Quick. You have 30 seconds, I want a list.
> A complete list with the locking hierarchy, no silly awk scripts. You have
> to show which locks can deadlock, from memory.
>
> No list? Cool, you just proved my point.

No point Larry, asking the same question about how the I/O
path works sans the locks will give you the same blank stare.

I absolutely do not accept the complexity argument. We have a fully
scalable kernel now. Do you know why? It's not because we have some
weird genius trolls writing the code, it's because of our insanely
huge testing base.

People give a lot of credit to the people writing the code in the
Linux kernel which actually belongs to the people running the
code. :-)

That's where the other systems failed, all the in-house stress
testing in the world is not going to find the bugs we do find in
Linux. That's why Solaris goes out buggy and with all kinds of
SMP deadlocks, their tester base is just too small to hit all
the important bugs.

FWIW, I actually can list all the locks taken for the primary paths in
the networking, and that's about as finely locked as we can make it.
As can Alexey Kuznetsov...

So again, if you're going to argue against huge SMP (at least to me),
don't use the locking complexity argument. Not only have we basically
conquered it, we've along the way found some amazing ways to find
locking bugs both at runtime and at compile time. You can even debug
them on uniprocessor systems. And this doesn't even count the
potential things we can do with Linus's sparse tool.

2003-09-04 07:24:58

by Matthias Andree

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 03 Sep 2003, Larry McVoy wrote:

> Expecting more bandwidth to help your app is like expecting more platter
> speed to help your file system. It's not the platter speed, it's the
> seeks which are the problem. Same thing in system doesn't, it's not the
> bcopy speed, it's the cache misses that are the problem. More bandwidth
> doesn't do much for that.

Platter speed IS a problem for random access involving seeks, because
platter speed reduces the rotational latency. Whether it takes 7.1 ms
average for a block to rotate past the heads in your average notebook
4,200/min drive or 2 ms in your 15,000/min drive does make a difference.

Even if the drive knows where the sectors are and folds rotational
latency into positioning latency to the maximum possible extent, for
short seeks (track-to-track) it's not going to help.

Unless you're going to add more heads or use other media than spinning
disc, that is.

However, head positioning times, being a tradeoff between noise and
speed, aren't that good particularly with many of the quieter drives, so
the marketing people use the enormous sequential data rate on outer
tracks for advertising. Head positioning time hasn't improved to the
extent throughput has, but that doesn't mean higher rotational frequency
is useless for random access delays.

2003-09-04 07:52:30

by Davide Libenzi

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 3 Sep 2003, Larry McVoy wrote:

> On Wed, Sep 03, 2003 at 08:47:49PM -0700, Davide Libenzi wrote:
> > On Wed, 3 Sep 2003, Larry McVoy wrote:
> > > Maybe because history has shown over and over again that your pet theory
> > > doesn't work. Mine might be wrong but it hasn't been proven wrong. Yours
> > > has. Multiple times.
> >
> > Why companies selling HW should go with this solution?
>
> Higher profits.

And in which way exactly ? You not only didn't make a bare-bone
implementation (numbers talk, bullshits walk), but you didn't even make a
business case here. This is not cold-fusion stuff Larry, SSI concepts are
around by a long time. Ppl didn't buy it, sorry. Beowulf-like clusters have
had a moderate success because they're both cheap and scale very well for
certain share-zero applications. They didn't buy SSI because of the need
of applications remodelling (besides the cool idea of a SSI, share-a-lot
applications will still suck piles compared to SMP), that is not very
popular in businesses that are the target of these systems. They didn't
buy SSI because if they had scalability problems and their app was a
share-nada thingy so that they were willing to rewrite it, they'd be
already using Beowulf-style clusters. Successfull new hardware (in the
really general term) is the one that fits your current solutions/methods
by, at the same time, giving you an increased power/features.



- Davide

2003-09-04 17:02:08

by Daniel Phillips

[permalink] [raw]
Subject: Re: Scaling noise

On Thursday 04 September 2003 04:31, Martin J. Bligh wrote:
> I view UP -> SMP -> NUMA -> SSI on NUMA -> SSI on many PCs -> beowulf
> cluster as a continuum ...

Nicely put. But the last step, ->beowulf, doesn't fit with the others,
because all the other steps are successive levels of virtualization that try
to preserve the appearance of a single system, whereas Beowulf drops the
pretence and lets applications worry about the boundaries, i.e., it lacks
essential SSI features. Also, the hardware changes on each of the first four
arrows and stays the same on the last one.

Regards,

Daniel

2003-09-04 20:37:05

by Rik van Riel

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 3 Sep 2003, Martin J. Bligh wrote:

> The real core use of NUMA is to run one really big app on one machine,
> where it's hard to split it across a cluster. You just can't build an
> SMP box big enough for some of these things.

That only works when the NUMA factor is low enough that
you can effectively treat the box as an SMP system.

It doesn't work when you have a NUMA factor of 15 (like
some unspecified box you are very familiar with) and
half of your database index is always on the "other half"
of the two-node NUMA system.

You'll end up with half your accesses being 15 times as
slow, meaning that your average memory access time is 8
times as high! Good way to REDUCE performance, but most
people won't like that...

If the NUMA factor is low enough that applications can
treat it like SMP, then the kernel NUMA support won't
have to be very high either...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2003-09-04 20:58:13

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scaling noise

> On Wed, 3 Sep 2003, Martin J. Bligh wrote:
>
>> The real core use of NUMA is to run one really big app on one machine,
>> where it's hard to split it across a cluster. You just can't build an
>> SMP box big enough for some of these things.
>
> That only works when the NUMA factor is low enough that
> you can effectively treat the box as an SMP system.
>
> It doesn't work when you have a NUMA factor of 15 (like
> some unspecified box you are very familiar with) and
> half of your database index is always on the "other half"
> of the two-node NUMA system.
>
> You'll end up with half your accesses being 15 times as
> slow, meaning that your average memory access time is 8
> times as high! Good way to REDUCE performance, but most
> people won't like that...
>
> If the NUMA factor is low enough that applications can
> treat it like SMP, then the kernel NUMA support won't
> have to be very high either...

I think there's a few too many assumptions in that - are you thinking
of a big r/w shmem application? There's lots of other application
programming models that wouldn't suffer nearly so much ... but maybe
they're more splittable ... there's lots of things we can do to ensure
at least better than average node-locality for most of the memory.

M.

2003-09-04 21:29:44

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Scaling noise

On Thu, Sep 04, 2003 at 04:36:56PM -0400, Rik van Riel wrote:
> You'll end up with half your accesses being 15 times as
> slow, meaning that your average memory access time is 8
> times as high! Good way to REDUCE performance, but most
> people won't like that...
> If the NUMA factor is low enough that applications can
> treat it like SMP, then the kernel NUMA support won't
> have to be very high either...

This does not hold. The data set is not necessarily where the
communication occurs.


-- wli

2003-09-05 01:35:42

by Robert White

[permalink] [raw]
Subject: RE: Scaling noise

Not to throw a flag on the play but...

Larry asks why penalize low end systems by making the kernel
many-cpu-friendly. The implicit postulate in his question is that the
current design path is "unfair" to the single and very-small-N N-way systems
in favor of the larger-N and very-large-N niche user base.

Lots of discussions then ensue about high end scalability and the
performance penalties of doing memory barriers and atomic actions for large
numbers of CPUs.

I'm not sure I get it. The large-end dynamics don't seem to apply to the
question, in support nor rebut. Is there any concrete evidence to support
the inference?

What really is the impact of high-end scalability issues on the low end
machines?

It *seems* that the high-end diminishing returns due to having any one CPU's
cache invalidated by N companions is clearly decoupled from the
uni-processor model because there are no "other" CPUs to cause the bulk of
such invalidations when there is only one processor.

It *seems* that in a single-kernel architecture, as soon as you reach "more
than one CPU" what "must be shared" ...er... must be shared.

It *seems* that what is unique to each CPU (the instances CPU private data
structure) are wholly unlikely to be faulted out because of the actions of
other CPUs. (that *is* part of why they are in separate spaces, right?)

It *seems* that the high-end degradation of cache performance as N increases
is coupled singularly to the increase in N and the follow-on multi-state
competition for resources. (That is, it is the real presence of the 64 CPUs
and not the code to accommodate them is the cause of the invalidations. If,
on my 2-way box I set the MAX_CPU_COUNT to 8 or 64, the only difference in
runtime is the unused memory for the 6 or 62 per-cpu-data structures. The
cache-invalidation and memory-barrier cost is bounded by the two real CPUs
and not the 62 empty slots.)

It *seems* that any attempt to make a large N system cache friendly would
involve preventing as many invalidations as possible.

And finally, it *seems* that any large-N design, e.g. one that keeps cache
invalidation to a minimum, would, by definition, directly *benefit* the
small N systems because they would naturally also have less invalidation.

That is, "change a pointer, flush a cache" is true for all N greater than 1,
yes? So if I share between 2 or 1000, it is the actions of the 2 or 1000 at
runtime that exact the cost. Any design to better accommodate "up to 1000"
will naturally tend to better accommodate "up to 4".

So...

What/where, if any, are the examples of something being written (or some
technique being propounded) in the current kernel design that "penalizes" a
2-by or 4-by box in the name of making a 64-by or 128-by machine perform
better?

Clearly keeping more private data private is "better" and "worse" for its
respective reasons in a memory-footprint for cache-separation tradeoff, but
that is true for all N >= 2.

Is there some concrete example(s) of SMP code that is wrought overlarge? I
mean all values of N between 1 and 255 fit in one byte (duh 8-) but cache
invalidations happen in more than two bytes, so having some MAX_CPU_COUNT
bounded at 65535 is only one byte more expansive and no-bytes more expensive
in the cache-consistency-conflict space. At runtime, any loop that iterates
across the number of active CPUs will be clamped at the actual number, and
not the theoretical max.

I suspect that the original question is specious and generally presumes
facts not in evidence. Of course, just because *I* cannot immediately
conceive of any useful optimization for a 128-way machine that is inherently
detrimental to a 2-way or 4-way box, doesn't mean that no such optimization
exists.

Someone enlighten me please.

Rob White




-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of William Lee Irwin
III
Sent: Wednesday, September 03, 2003 11:16 AM
To: Larry McVoy; Brown, Len; Giuliano Pochini; Larry McVoy;
[email protected]
Subject: Re: Scaling noise

At some point in the past, I wrote:
>> The lines of reasoning presented against tightly coupled systems are
>> grossly flawed.

On Wed, Sep 03, 2003 at 11:05:47AM -0700, Larry McVoy wrote:
> [etc].
> Only problem with your statements is that IBM has already implemented all
> of the required features in VM. And multiple Linux instances are running
> on it today, with shared disks underneath so they don't replicate all the
> stuff that doesn't need to be replicated, and they have shared memory
> across instances.

Independent operating system instances running under a hypervisor don't
qualify as a cache-coherent cluster that I can tell; it's merely dynamic
partitioning, which is great, but nothing to do with clustering or SMP.


-- wli

2003-09-07 21:18:46

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Scaling noise

Larry McVoy <[email protected]> writes:

> Here's a thought. Maybe the next kernel summit needs to have a CC cluster
> BOF or whatever. I'd be happy to show up, describe what it is that I see
> and have you all try and poke holes in it. If the net result was that you
> walked away with the same picture in your head that I have that would be
> cool. Heck, I'll sponser it and buy beer and food if you like.

Larry CC clusters are an idiotic development target.
The development target should be non coherent clusters.

1) NUMA machines are smaller, more expensive, and less available than
their non cache coherent counter parts.

2) If you can solve the communications problems for a non cache
coherent counter part the solution will also work on a NUMA
machine.

3) People on a NUMA machine can always punt and over share. On a non
cache coherent cluster when people punt they don't share. Not
sharing increases scalability and usually performance.

4) Small start up companies can do non-coherent clusters, and can
scale up. You have to be a substantial company to build a NUMA
machine.

5) NUMA machines are slow. There is not a single NUMA machine in the
top 10 of the top500 supercomputers list. Likely this has more to
do with system sizes supported by the manufacture than inherent
process inferiority, but it makes a difference.

SSI is good and it helps. But that is not the primary management
problem on a large system. The larger you get the imperfection of
your materials tends to be an increasingly dominate factor in
management problems.

For example I routinely reproduce cases where the BIOS does not work
around hardware bugs in a single boot that the motherboard vendors
cannot even reproduce.

Another example is Google who have given up entirely on machines
always working, and has built the software to be robust about error
detection and recovery.

And the SSI solutions are evolving. But the problems are hard.
How do you build a distributed filesystem that scales?
How do you do process migration across machines?
How do you checkpoint a distributed job?
How do you properly build a cluster job scheduler?
How do you handle simultaneous similar actions by a group of nodes?
How do you usefully predict, detect, and isolate hardware failures so
as not to cripple the cluster?
etc.

Eric

2003-09-07 23:07:44

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Sun, Sep 07, 2003 at 03:18:19PM -0600, Eric W. Biederman wrote:
> Larry McVoy <[email protected]> writes:
>
> > Here's a thought. Maybe the next kernel summit needs to have a CC cluster
> > BOF or whatever. I'd be happy to show up, describe what it is that I see
> > and have you all try and poke holes in it. If the net result was that you
> > walked away with the same picture in your head that I have that would be
> > cool. Heck, I'll sponser it and buy beer and food if you like.
>
> Larry CC clusters are an idiotic development target.

What a nice way to start a technical conversation.

*PLONK* on two counts: you're wrong and you're rude. Next contestant please.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-07 23:47:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Scaling noise

Larry McVoy <[email protected]> writes:

> On Sun, Sep 07, 2003 at 03:18:19PM -0600, Eric W. Biederman wrote:
> > Larry McVoy <[email protected]> writes:
> >
> > > Here's a thought. Maybe the next kernel summit needs to have a CC cluster
> > > BOF or whatever. I'd be happy to show up, describe what it is that I see
> > > and have you all try and poke holes in it. If the net result was that you
> > > walked away with the same picture in your head that I have that would be
> > > cool. Heck, I'll sponser it and buy beer and food if you like.
> >
> > Larry CC clusters are an idiotic development target.
>
> What a nice way to start a technical conversation.
>
> *PLONK* on two counts: you're wrong and you're rude. Next contestant please.

Ok. I will keep building clusters and the code that makes them work, and you can dream.

I backed up my assertion, and can do even better.

I have already built a 2304 cpu machine and am working on a 2900+ cpu
machine.

The software stack and that part of the idea are reasonable but your
target hardware is just plain rare, and expensive.

If you don't get the commodity OS on commodity hardware thing, I'm sorry.

The thing is for all of your talk of Dell, Dell doesn't make the hardware you
need for a CC cluster. And because the cc NUMA interface requires a
manufacturer to make chips, and boards, I have a hard time seeing
cc NUMA hardware being a commodity any time soon.

Eric

2003-09-08 00:58:02

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Sun, Sep 07, 2003 at 05:47:04PM -0600, Eric W. Biederman wrote:
> I have already built a 2304 cpu machine and am working on a 2900+ cpu
> machine.

That's not "a machine" that's ~1150 machines on a network. This business
of describing a bunch of boxes on a network as "a machine" is nonsense.

Don't get me wrong, I love clusters, in fact, I think what you are doing
is great. It doesn't screw up the OS, it forces the OS to stay lean and
mean. Goodness.

All the CC cluster stuff is about making sure that the SMP fanatics don't
screw up the OS for you. We're on the same side. Try not to be so rude
and have a bit more vision.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-08 03:56:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Scaling noise

Larry McVoy <[email protected]> writes:

> On Sun, Sep 07, 2003 at 05:47:04PM -0600, Eric W. Biederman wrote:
> > I have already built a 2304 cpu machine and am working on a 2900+ cpu
> > machine.
>
> That's not "a machine" that's ~1150 machines on a network. This business
> of describing a bunch of boxes on a network as "a machine" is nonsense.

Every bit as much as describing a scalable NUMA box with replaceable
nodes as a single machine is nonsense. When things are built and run
as a single machine, it is a single machine. The fact you
standardized parts used many times does not change that.

The only real difference is cache coherency, and the price.

I won't argue that at the lowest end the vendor delivers you a pile of
boxes and walks away, at which point you must do everything yourself
and it is a real maintenance pain. But that is the lowest end and
certainly not what I sell. The systems are built and tested as a
single machine before delivery.

> Don't get me wrong, I love clusters, in fact, I think what you are doing
> is great. It doesn't screw up the OS, it forces the OS to stay lean and
> mean. Goodness.
>
> All the CC cluster stuff is about making sure that the SMP fanatics don't
> screw up the OS for you. We're on the same side. Try not to be so rude
> and have a bit more vision.

And I agree, except on some small details. Although I have yet to see
the large way SMP folks causing problems.

But as far as doing the work there are two different ends the work can be
started from.
a) SMP and make the locks finer grained.
b) Cluster and add the few necessary locks.

Both solutions run fine on a NUMA machine. And both eventually lead
to good SSI solutions. But except for some magic piece that only
works on cc NUMA nodes, you can develop all of the SSI software on an
ordinary cluster. On an ordinary cluster that is that is the only
option, and so the people with clusters are going to do the work.
The only reason you don't see more SSI work out of the cluster guys is
they are willing to sacrifice some coherency for scalability. But
mostly it is because of the fact that clusters are only slowly
catching on.

So assuming the non coherent cluster guys do their part you get SSI
software that works out of the box and does everything except for
optimize the page cache for the shared physical hardware. And the
software will scale awesomely because each generation of cluster
hardware is larger than the last.

The only piece that is unique is CCFS, which builds a shared page
cache. And even then the non coherent cluster guys may come up with
a better solution.

So my argument is that if you are going to do it right. Start with
an ordinary non-coherent cluster. Build the SSI support. Then build
CCFS the global shared page cache as an optimization.

I fail to see how starting with CCFS will help, or assuming CCFS will
be there will help. Unless you think the R&D budgets of all of the non
coherent cluster guys is insubstantial, and somehow not up to the
task.


Eric

2003-09-08 04:46:04

by Stephen Satchell

[permalink] [raw]
Subject: Re: Scaling noise

At 05:57 PM 9/7/2003 -0700, Larry McVoy wrote:
>That's not "a machine" that's ~1150 machines on a network. This business
>of describing a bunch of boxes on a network as "a machine" is nonsense.

Then you haven't been keeping up with Open-source projects, or the
literature. The development of virtual servers composed of clusters of
Linux boxes on a private network appears to be a single machine to the
outside world. Indeed, a highly scaled Web site using such a cluster is
indistinguishable from one using a mainframe-class computer (which for the
past 30 years have been networks of specialized processors working together).

The difference is that the bulk of the nodes are on a private network, not
on a public one. Actually, the machines I have seen have been on a weave
of networks, so that as data traverses the nodes you don't get a bottleneck
effect.

It's a lot different than the Illiac IV I grew up with...

Satch



--
"People who seem to have had a new idea have often just stopped having an
old idea." -- Dr. Edwin H. Land

2003-09-08 05:26:36

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Sun, Sep 07, 2003 at 09:47:58PM -0700, Stephen Satchell wrote:
> At 05:57 PM 9/7/2003 -0700, Larry McVoy wrote:
> >That's not "a machine" that's ~1150 machines on a network. This business
> >of describing a bunch of boxes on a network as "a machine" is nonsense.
>
> Then you haven't been keeping up with Open-source projects, or the
> literature.

Err, I'm in that literature, dig a little, you'll find me. I'm quite
familiar with clustering technology. While it is great that people are
wiring up lots of machines and running MPI or whatever on them, they've
been doing that for decades. It's only a recent thing that they started
calling that "a machine". That's marketing, and it's fine marketing,
but a bunch of machines, a network, and a library does not a machine make.
Not to me it doesn't. I want to be able to exec a proces and have it land
anywhere on the "machine", any CPU, I want controlling tty semantics,
if I have 2300 processes in one process group then when I hit ^Z they
had all better stop. Etc.

A collection of machines that work together is called a network of
machines, it's not one machine, it's a bunch of them. There's nothing
wrong with getting a lot of use out of a pile of networked machines,
it's a great thing. But it's no more a machine than the internet is
a machine.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-08 06:21:24

by Brown, Len

[permalink] [raw]
Subject: RE: Scaling noise

> 5) NUMA machines are slow. There is not a single NUMA machine in the
> top 10 of the top500 supercomputers list. Likely this has more to
> do with system sizes supported by the manufacture than inherent
> process inferiority, but it makes a difference.

Hardware that is good at running linpack (all you gotta run to get onto
http://www.top500.org/ )isn't necessarily hardware that is any good at,
say, http://www.tpc.org/

2003-09-08 08:32:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Scaling noise

Larry McVoy <[email protected]> writes:

> On Sun, Sep 07, 2003 at 09:47:58PM -0700, Stephen Satchell wrote:
> > At 05:57 PM 9/7/2003 -0700, Larry McVoy wrote:
> > >That's not "a machine" that's ~1150 machines on a network. This business
> > >of describing a bunch of boxes on a network as "a machine" is nonsense.
> >
> > Then you haven't been keeping up with Open-source projects, or the
> > literature.
>
> Err, I'm in that literature, dig a little, you'll find me. I'm quite
> familiar with clustering technology. While it is great that people are
> wiring up lots of machines and running MPI or whatever on them, they've
> been doing that for decades. It's only a recent thing that they started
> calling that "a machine". That's marketing, and it's fine marketing,
> but a bunch of machines, a network, and a library does not a machine make.

Oh so you need cache coherency to make it a machine. That being the only
difference between that and a NUMA box.

Although I will state that there is a lot more that goes into such
a system than a network, and a library. At least there is a lot more
that goes into the manageable version of one.

> Not to me it doesn't. I want to be able to exec a proces and have it land
> anywhere on the "machine", any CPU, I want controlling tty semantics,
> if I have 2300 processes in one process group then when I hit ^Z they
> had all better stop. Etc.

Oh wait none of that comes with cache coherency. So the difference
cannot be cache coherency.

> A collection of machines that work together is called a network of
> machines, it's not one machine, it's a bunch of them. There's nothing
> wrong with getting a lot of use out of a pile of networked machines,
> it's a great thing. But it's no more a machine than the internet is
> a machine.

Cool so the SGI Ultrix is not a machine. Nor is the SMP box over in
my lab. They are separate machines wired together with a network, and
so I better start calling them a network of machines.

As far as I can tell which pile of hardware to call a machine
is a difference that makes no difference. Marketing as you put it.

The only practical difference would seem to be what kind of problems
you think are worth solving for a collection of hardware. By calling
it a single machine I am saying I think it is worth solving the single
system image problem. By refusing to call it a machine you seem to
think it is a class of hardware which is not worth paying attention to.

I do think it is a class of hardware that is worth solving the hard
problems for. And I will continue to call that pile of hardware a
machine until I give up on that.

I admit the hard problems have not yet been solved but the solutions
are coming.

Eric

2003-09-08 09:21:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Scaling noise

"Brown, Len" <[email protected]> writes:

> > 5) NUMA machines are slow. There is not a single NUMA machine in the
> > top 10 of the top500 supercomputers list. Likely this has more to
> > do with system sizes supported by the manufacture than inherent
> > process inferiority, but it makes a difference.
>
> Hardware that is good at running linpack (all you gotta run to get onto
> http://www.top500.org/ )isn't necessarily hardware that is any good at,
> say, http://www.tpc.org/

Quite true. And there have been some very reasonable criticisms of linpack,
as it is cache friendly. So I will not argue that clusters are the
proper solution for everything.

The barrier to submitting a TPC result is much higher so it captures
a smaller chunk of the market. For the people who are their customers
this seems reasonable. Though I find the absence of google
from the TPC-H fascinating.

But none of those machines have nearly the same number of cpus
as the machines in the top500. And the point of Larry ideas are
an infrastructure that scales. It is easy to scale things to 64
processors, 2.6 will do that today (though not necessarily
well). Going an order a magnitude bigger is a very significant
undertaking.

I won't argue that a NUMA design is bad. In fact I think it is quite a
nice hardware idea. And optimizing for it if you got it is cool.

But I think if people are going to build software that scales
built of multiple kernels, it will probably be the cluster guys.
Because that is what they must do, and they already have the big
hardware. And if the code works in a non-coherent mode it should only
get better when you tell it the machine is cache coherent.

Eric

2003-09-08 13:25:38

by Pavel Machek

[permalink] [raw]
Subject: Re: Scaling noise

Hi!

> Maybe this is a better way to get my point across. Think about more CPUs
> on the same memory subsystem. I've been trying to make this scaling point

The point of hyperthreading is that more virtual CPUs on same memory
subsystem can actually help stuff.
--
Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...

2003-09-08 13:41:40

by Alan

[permalink] [raw]
Subject: Re: Scaling noise

On Sad, 2003-09-06 at 16:08, Pavel Machek wrote:
> Hi!
>
> > Maybe this is a better way to get my point across. Think about more CPUs
> > on the same memory subsystem. I've been trying to make this scaling point
>
> The point of hyperthreading is that more virtual CPUs on same memory
> subsystem can actually help stuff.

Its a way of exposing asynchronicity keeping the old instruction set.
Its trying to make better use of the bandwidth available by having
something else to schedule into stalls. Thats why HT is really good for
code which is full of polling I/O, badly coded memory accesses but is
worthless on perfectly tuned hand coded stuff which doesnt stall.

Its great feature is that HT gets *more* not less useful as the CPU gets
faster..

2003-09-08 15:39:46

by John Stoffel

[permalink] [raw]
Subject: 2.6.0-test4-mm4 - bad floppy hangs keyboard input


Hi,

I've run into a wierd problem with 2.6.0-test4-mm4 on an SMP Xeon
550Mhz system. I was dd'ing some images to floppy and when I used a
bad floppy, it would read 17+0 blocks, write 16+0 blocks and then
hang. At this point, I couldn't use the keyboard at all, and I got an
error message about "lost serial connection to UPS", which is from
apcupsd, which I have talking over my Cyclades Cyclom-8y ISA serial
port card.

At this point, the load jumps to 1, nothing happes on the floppy
drive, and I can't input data to xterms. I can browse the web just
fine using the mouse inside Galeon, and other stuff runs just fine. I
can even cut'n'paste stuff from one xterm and have it run in another.

But the only way I've found to fix this is to reboot, which requires
an fsck of my disks, which takes a while. I'll see if I can re-create
this in more detail tonight while at home, and hopefully from a
console.

Strangley enough, Magic SysReq still seems to work, since that's how I
reboot. Logging out of Xwindows works, but then when I try to reboot
the system from gdm, it hangs at a blank screen.

Whee!

Here's my interrupts:

jfsnew:~> cat /proc/interrupts
CPU0 CPU1
0: 44911738 411 IO-APIC-edge timer
1: 1290 0 IO-APIC-edge i8042
2: 0 0 XT-PIC cascade
8: 1 0 IO-APIC-edge rtc
11: 861346 2 IO-APIC-edge Cyclom-Y
12: 2053 0 IO-APIC-edge i8042
14: 141353 0 IO-APIC-edge ide0
16: 0 0 IO-APIC-level ohci-hcd
17: 6447 0 IO-APIC-level ohci-hcd, eth0
18: 214706 1 IO-APIC-level aic7xxx, aic7xxx, ehci_hcd
19: 50 0 IO-APIC-level aic7xxx, uhci-hcd
NMI: 0 0
LOC: 44911434 44911479
ERR: 0
MIS: 0

-------------------------------------------------------------------
/var/log/messages
-------------------------------------------------------------------

Sep 7 22:55:26 jfsnew kernel: floppy0: sector not found: track 0, head 1, secto
r 1, size 2
Sep 7 22:55:26 jfsnew kernel: floppy0: sector not found: track 0, head 1, secto
r 1, size 2
Sep 7 22:55:26 jfsnew kernel: end_request: I/O error, dev fd0, sector 18
Sep 7 22:55:26 jfsnew kernel: Unable to handle kernel paging request at virtual
address eb655050
Sep 7 22:55:26 jfsnew kernel: printing eip:
Sep 7 22:55:26 jfsnew kernel: c02ac816
Sep 7 22:55:26 jfsnew kernel: *pde = 00541063
Sep 7 22:55:26 jfsnew kernel: *pte = 2b655000
Sep 7 22:55:26 jfsnew kernel: Oops: 0000 [#1]
Sep 7 22:55:26 jfsnew kernel: SMP DEBUG_PAGEALLOC
Sep 7 22:55:26 jfsnew kernel: CPU: 0
Sep 7 22:55:26 jfsnew kernel: EIP: 0060:[bad_flp_intr+150/224] Not tainte
d VLI
Sep 7 22:55:26 jfsnew kernel: EIP: 0060:[<c02ac816>] Not tainted VLI
Sep 7 22:55:26 jfsnew kernel: EFLAGS: 00010246
Sep 7 22:55:26 jfsnew kernel: EIP is at bad_flp_intr+0x96/0xe0
Sep 7 22:55:26 jfsnew kernel: eax: 00000000 ebx: 00000000 ecx: 00000000 e
dx: eb655050
Sep 7 22:55:26 jfsnew kernel: esi: c0520c00 edi: eb655050 ebp: 00000002 e
sp: efea7f5c
Sep 7 22:55:26 jfsnew kernel: ds: 007b es: 007b ss: 0068
Sep 7 22:55:26 jfsnew kernel: Process events/0 (pid: 6, threadinfo=efea6000 tas
k=c17ef000)
Sep 7 22:55:26 jfsnew kernel: Stack: 00000000 00000000 00000001 c02ad0ba 000000
00 c0430940 00000003 c0471680
Sep 7 22:55:26 jfsnew kernel: 00000283 c17f0004 00000000 c02aa5f1 c04716
c0 c013547d 00000000 5a5a5a5a
Sep 7 22:55:26 jfsnew kernel: c02aa5e0 00000001 00000000 c011f840 000100
00 00000000 00000000 c17dbf1c
Sep 7 22:55:26 jfsnew kernel: Call Trace:
Sep 7 22:55:26 jfsnew kernel: [rw_interrupt+474/800] rw_interrupt+0x1da/0x320
Sep 7 22:55:26 jfsnew kernel: [<c02ad0ba>] rw_interrupt+0x1da/0x320
Sep 7 22:55:26 jfsnew kernel: [main_command_interrupt+17/32] main_command_inte
rrupt+0x11/0x20
Sep 7 22:55:26 jfsnew kernel: [<c02aa5f1>] main_command_interrupt+0x11/0x20
Sep 7 22:55:26 jfsnew kernel: [worker_thread+525/816] worker_thread+0x20d/0x33
0
Sep 7 22:55:26 jfsnew kernel: [<c013547d>] worker_thread+0x20d/0x330
Sep 7 22:55:26 jfsnew kernel: [main_command_interrupt+0/32] main_command_inter
rupt+0x0/0x20
Sep 7 22:55:26 jfsnew kernel: [<c02aa5e0>] main_command_interrupt+0x0/0x20
Sep 7 22:55:26 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:55:26 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:55:26 jfsnew kernel: [ret_from_fork+6/20] ret_from_fork+0x6/0x14
Sep 7 22:55:26 jfsnew kernel: [<c03b5fd2>] ret_from_fork+0x6/0x14
Sep 7 22:55:26 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:55:26 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:55:26 jfsnew kernel: [worker_thread+0/816] worker_thread+0x0/0x330
Sep 7 22:55:26 jfsnew kernel: [<c0135270>] worker_thread+0x0/0x330
Sep 7 22:55:26 jfsnew kernel: [kernel_thread_helper+5/12] kernel_thread_helper
+0x5/0xc
Sep 7 22:55:26 jfsnew kernel: [<c010ac69>] kernel_thread_helper+0x5/0xc
Sep 7 22:55:26 jfsnew kernel:
Sep 7 22:55:26 jfsnew kernel: Code: 52 c0 8d 04 9b 8d 04 43 8b 44 c6 28 39 07 7
6 0b 6a 00 a1 24 1c 52 c0 ff 50 0c 5b 0f b6 0d 84 1c 52 c0 8b 15 20 1c 52 c0 8d
04 89 <8b> 12 8d 04 41 c1 e0 03 3b 54 06 30 76 11 a1 80 1c 52 c0 c1 e0
Sep 7 22:55:38 jfsnew apcupsd[1324]: Serial communications with UPS lost
Sep 7 22:55:46 jfsnew kernel:
Sep 7 22:55:46 jfsnew kernel: floppy driver state
Sep 7 22:55:46 jfsnew kernel: -------------------
Sep 7 22:55:46 jfsnew kernel: now=4294937721 last interrupt=4294917720 diff=200
01 last called handler=c02aa5e0
Sep 7 22:55:46 jfsnew kernel: timeout_message=request done %%d
Sep 7 22:55:46 jfsnew kernel: last output bytes:
Sep 7 22:55:46 jfsnew kernel: 0 90 4294917324
Sep 7 22:55:46 jfsnew kernel: 13 80 4294917324
Sep 7 22:55:46 jfsnew kernel: 0 90 4294917324
Sep 7 22:55:46 jfsnew kernel: 1a 90 4294917324
Sep 7 22:55:46 jfsnew kernel: 0 90 4294917324
Sep 7 22:55:46 jfsnew kernel: 3 80 4294917324
Sep 7 22:55:46 jfsnew kernel: c1 90 4294917324
Sep 7 22:55:46 jfsnew kernel: 10 90 4294917324
Sep 7 22:55:46 jfsnew kernel: 7 80 4294917324
Sep 7 22:55:46 jfsnew kernel: 0 90 4294917324
Sep 7 22:55:46 jfsnew kernel: 8 81 4294917324
Sep 7 22:55:46 jfsnew kernel: e6 80 4294917324
Sep 7 22:55:46 jfsnew kernel: 4 90 4294917324
Sep 7 22:55:46 jfsnew kernel: 0 90 4294917324
Sep 7 22:55:46 jfsnew kernel: 1 90 4294917324
Sep 7 22:55:46 jfsnew kernel: 1 90 4294917324
Sep 7 22:55:46 jfsnew kernel: 2 90 4294917325
Sep 7 22:55:46 jfsnew kernel: 12 90 4294917325
Sep 7 22:55:46 jfsnew kernel: 1b 90 4294917325
Sep 7 22:55:46 jfsnew kernel: ff 90 4294917325
Sep 7 22:55:46 jfsnew kernel: last result at 4294917720
Sep 7 22:55:46 jfsnew kernel: last redo_fd_request at 4294917324
Sep 7 22:55:46 jfsnew kernel: 44 1 0 0 1 1 2
Sep 7 22:55:46 jfsnew kernel: status=80
Sep 7 22:55:46 jfsnew kernel: fdc_busy=1
Sep 7 22:55:46 jfsnew kernel: cont=c0471720
Sep 7 22:55:46 jfsnew kernel: current_req=00000000
Sep 7 22:55:46 jfsnew kernel: command_status=-1
Sep 7 22:55:46 jfsnew kernel:
Sep 7 22:55:46 jfsnew kernel: floppy0: floppy timeout called
Sep 7 22:55:46 jfsnew kernel: floppy.c: no request in request_done
Sep 7 22:55:49 jfsnew kernel:
Sep 7 22:55:49 jfsnew kernel: floppy driver state
Sep 7 22:55:49 jfsnew kernel: -------------------
Sep 7 22:55:49 jfsnew kernel: now=4294940721 last interrupt=4294937722 diff=299
9 last called handler=c02abc40
Sep 7 22:55:49 jfsnew kernel: timeout_message=redo fd request
Sep 7 22:55:49 jfsnew kernel: last output bytes:
Sep 7 22:55:49 jfsnew kernel: 0 90 4294917324
Sep 7 22:55:49 jfsnew kernel: 3 80 4294917324
Sep 7 22:55:49 jfsnew kernel: c1 90 4294917324
Sep 7 22:55:49 jfsnew kernel: 10 90 4294917324
Sep 7 22:55:49 jfsnew kernel: 7 80 4294917324
Sep 7 22:55:49 jfsnew kernel: 0 90 4294917324
Sep 7 22:55:49 jfsnew kernel: 8 81 4294917324
Sep 7 22:55:49 jfsnew kernel: e6 80 4294917324
Sep 7 22:55:49 jfsnew kernel: 4 90 4294917324
Sep 7 22:55:49 jfsnew kernel: 0 90 4294917324
Sep 7 22:55:49 jfsnew kernel: 1 90 4294917324
Sep 7 22:55:49 jfsnew kernel: 1 90 4294917324
Sep 7 22:55:49 jfsnew kernel: 2 90 4294917325
Sep 7 22:55:49 jfsnew kernel: 12 90 4294917325
Sep 7 22:55:49 jfsnew kernel: 1b 90 4294917325
Sep 7 22:55:49 jfsnew kernel: ff 90 4294917325
Sep 7 22:55:49 jfsnew kernel: 8 80 4294937722
Sep 7 22:55:49 jfsnew last message repeated 3 times
Sep 7 22:55:49 jfsnew kernel: last result at 4294937722
Sep 7 22:55:49 jfsnew kernel: last redo_fd_request at 4294937721
Sep 7 22:55:49 jfsnew kernel: c3 0
Sep 7 22:55:49 jfsnew kernel: status=80
Sep 7 22:55:49 jfsnew kernel: fdc_busy=1
Sep 7 22:55:49 jfsnew kernel: floppy_work.func=c02abc40
Sep 7 22:55:49 jfsnew kernel: cont=c0471720
Sep 7 22:55:49 jfsnew kernel: current_req=eb794004
Sep 7 22:55:49 jfsnew kernel: command_status=-1
Sep 7 22:55:49 jfsnew kernel:
Sep 7 22:55:49 jfsnew kernel: floppy0: floppy timeout called
Sep 7 22:55:49 jfsnew kernel: end_request: I/O error, dev fd0, sector 0
Sep 7 22:55:49 jfsnew kernel: Buffer I/O error on device fd0, logical block 0
Sep 7 22:55:49 jfsnew kernel: lost page write due to I/O error on fd0
Sep 7 22:55:52 jfsnew kernel:
Sep 7 22:55:52 jfsnew kernel: floppy driver state
Sep 7 22:55:52 jfsnew kernel: -------------------
Sep 7 22:55:52 jfsnew kernel: now=4294943721 last interrupt=4294940722 diff=299
9 last called handler=c02abc40
Sep 7 22:55:52 jfsnew kernel: timeout_message=redo fd request
Sep 7 22:55:52 jfsnew kernel: last output bytes:
Sep 7 22:55:52 jfsnew kernel: 7 80 4294917324
Sep 7 22:55:52 jfsnew kernel: 0 90 4294917324
Sep 7 22:55:52 jfsnew kernel: 8 81 4294917324
Sep 7 22:55:52 jfsnew kernel: e6 80 4294917324
Sep 7 22:55:52 jfsnew kernel: 4 90 4294917324
Sep 7 22:55:52 jfsnew kernel: 0 90 4294917324
Sep 7 22:55:52 jfsnew kernel: 1 90 4294917324
Sep 7 22:55:52 jfsnew kernel: 1 90 4294917324
Sep 7 22:55:52 jfsnew kernel: 2 90 4294917325
Sep 7 22:55:52 jfsnew kernel: 12 90 4294917325
Sep 7 22:55:52 jfsnew kernel: 1b 90 4294917325
Sep 7 22:55:52 jfsnew kernel: ff 90 4294917325
Sep 7 22:55:52 jfsnew kernel: 8 80 4294937722
Sep 7 22:55:52 jfsnew last message repeated 3 times
Sep 7 22:55:52 jfsnew kernel: 8 80 4294940722
Sep 7 22:55:52 jfsnew last message repeated 3 times
Sep 7 22:55:52 jfsnew kernel: last result at 4294940722
Sep 7 22:55:52 jfsnew kernel: last redo_fd_request at 4294940721
Sep 7 22:55:52 jfsnew kernel: c3 0
Sep 7 22:55:52 jfsnew kernel: status=80
Sep 7 22:55:52 jfsnew kernel: fdc_busy=1
Sep 7 22:55:52 jfsnew kernel: floppy_work.func=c02abc40
Sep 7 22:55:52 jfsnew kernel: cont=c0471720
Sep 7 22:55:52 jfsnew kernel: current_req=eb794004
Sep 7 22:55:52 jfsnew kernel: command_status=-1
Sep 7 22:55:52 jfsnew kernel:
Sep 7 22:55:52 jfsnew kernel: floppy0: floppy timeout called
Sep 7 22:55:52 jfsnew kernel: end_request: I/O error, dev fd0, sector 8
Sep 7 22:55:52 jfsnew kernel: Buffer I/O error on device fd0, logical block 1
Sep 7 22:55:52 jfsnew kernel: lost page write due to I/O error on fd0
Sep 7 22:55:55 jfsnew kernel: work still pending
Sep 7 22:56:04 jfsnew kernel: SysRq : Emergency Sync
Sep 7 22:56:05 jfsnew kernel: Emergency Sync complete
Sep 7 22:56:05 jfsnew kernel: SysRq : Emergency Sync
Sep 7 22:56:05 jfsnew kernel: Emergency Sync complete
Sep 7 22:56:06 jfsnew kernel: SysRq : Emergency Sync
Sep 7 22:56:06 jfsnew kernel: Emergency Sync complete
Sep 7 22:56:06 jfsnew kernel: SysRq : Emergency Sync
Sep 7 22:56:06 jfsnew kernel: Emergency Sync complete
Sep 7 22:56:10 jfsnew kernel: e a625280b
Sep 7 22:56:10 jfsnew kernel: 00000018 c179a0c4 c1799c20 ebbcd000 e9c9a0
00 e9c9a000 c026bc0e e9c9a000
Sep 7 22:56:10 jfsnew kernel: 00000008 7fffffff ec7ff0a4 00000000 c012d4
34 00000001 00000282 c0435500
Sep 7 22:56:10 jfsnew kernel: Call Trace:
Sep 7 22:56:10 jfsnew kernel: [opost_block+478/496] opost_block+0x1de/0x1f0
Sep 7 22:56:10 jfsnew kernel: [<c026bc0e>] opost_block+0x1de/0x1f0
Sep 7 22:56:10 jfsnew kernel: [schedule_timeout+20/192] schedule_timeout+0x14/
0xc0
Sep 7 22:56:10 jfsnew kernel: [<c012d434>] schedule_timeout+0x14/0xc0
Sep 7 22:56:10 jfsnew kernel: [generic_file_aio_read+38/48] generic_file_aio_r
ead+0x26/0x30
Sep 7 22:56:10 jfsnew kernel: [<c0140f86>] generic_file_aio_read+0x26/0x30
Sep 7 22:56:10 jfsnew kernel: [read_chan+1282/3248] read_chan+0x502/0xcb0
Sep 7 22:56:10 jfsnew kernel: [<c026dbe2>] read_chan+0x502/0xcb0
Sep 7 22:56:10 jfsnew kernel: [release_pages+541/560] release_pages+0x21d/0x23
0
Sep 7 22:56:10 jfsnew kernel: [<c014bb2d>] release_pages+0x21d/0x230
Sep 7 22:56:10 jfsnew kernel: [acquire_console_sem+42/80] acquire_console_sem+
0x2a/0x50
Sep 7 22:56:10 jfsnew kernel: [<c0124e7a>] acquire_console_sem+0x2a/0x50
Sep 7 22:56:10 jfsnew kernel: [write_chan+515/544] write_chan+0x203/0x220
Sep 7 22:56:10 jfsnew kernel: [<c026e593>] write_chan+0x203/0x220
Sep 7 22:56:10 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [tty_read+281/416] tty_read+0x119/0x1a0
Sep 7 22:56:10 jfsnew kernel: [<c0267b69>] tty_read+0x119/0x1a0
Sep 7 22:56:10 jfsnew kernel: [vfs_read+170/224] vfs_read+0xaa/0xe0
Sep 7 22:56:10 jfsnew kernel: [<c016014a>] vfs_read+0xaa/0xe0
Sep 7 22:56:10 jfsnew kernel: [do_munmap+347/368] do_munmap+0x15b/0x170
Sep 7 22:56:10 jfsnew kernel: [<c0153e4b>] do_munmap+0x15b/0x170
Sep 7 22:56:10 jfsnew kernel: [sys_read+47/80] sys_read+0x2f/0x50
Sep 7 22:56:10 jfsnew kernel: [<c016036f>] sys_read+0x2f/0x50
Sep 7 22:56:10 jfsnew kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Sep 7 22:56:10 jfsnew kernel: [<c03b60b3>] syscall_call+0x7/0xb
Sep 7 22:56:10 jfsnew kernel:
Sep 7 22:56:10 jfsnew kernel: mingetty S E7944000 1374 1 13
75 1373 (NOTLB)
Sep 7 22:56:10 jfsnew kernel: e79d1e28 00000082 e79ca000 e7944000 e7944020 c179
1c20 000ca26a a68e9177
Sep 7 22:56:10 jfsnew kernel: 00000018 c17920c4 c1791c20 e79ca000 e7ad30
00 e7ad3000 c026bc0e e7ad3000
Sep 7 22:56:10 jfsnew kernel: 00000008 7fffffff e80670a4 00000000 c012d4
34 00000001 00000286 c0435500
Sep 7 22:56:10 jfsnew kernel: Call Trace:
Sep 7 22:56:10 jfsnew kernel: [opost_block+478/496] opost_block+0x1de/0x1f0
Sep 7 22:56:10 jfsnew kernel: [<c026bc0e>] opost_block+0x1de/0x1f0
Sep 7 22:56:10 jfsnew kernel: [schedule_timeout+20/192] schedule_timeout+0x14/
0xc0
Sep 7 22:56:10 jfsnew kernel: [<c012d434>] schedule_timeout+0x14/0xc0
Sep 7 22:56:10 jfsnew kernel: [generic_file_aio_read+38/48] generic_file_aio_r
ead+0x26/0x30
Sep 7 22:56:10 jfsnew kernel: [<c0140f86>] generic_file_aio_read+0x26/0x30
Sep 7 22:56:10 jfsnew kernel: [read_chan+1282/3248] read_chan+0x502/0xcb0
Sep 7 22:56:10 jfsnew kernel: [<c026dbe2>] read_chan+0x502/0xcb0
Sep 7 22:56:10 jfsnew kernel: [release_pages+541/560] release_pages+0x21d/0x23
0
Sep 7 22:56:10 jfsnew kernel: [<c014bb2d>] release_pages+0x21d/0x230
Sep 7 22:56:10 jfsnew kernel: [acquire_console_sem+42/80] acquire_console_sem+
0x2a/0x50
Sep 7 22:56:10 jfsnew kernel: [<c0124e7a>] acquire_console_sem+0x2a/0x50
Sep 7 22:56:10 jfsnew kernel: [write_chan+515/544] write_chan+0x203/0x220
Sep 7 22:56:10 jfsnew kernel: [<c026e593>] write_chan+0x203/0x220
Sep 7 22:56:10 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [default_wake_function+0/32] default_wake_function+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:10 jfsnew kernel: [tty_read+281/416] tty_read+0x119/0x1a0
Sep 7 22:56:10 jfsnew kernel: [<c0267b69>] tty_read+0x119/0x1a0
Sep 7 22:56:10 jfsnew kernel: [vfs_read+170/224] vfs_read+0xaa/0xe0
Sep 7 22:56:10 jfsnew kernel: [<c016014a>] vfs_read+0xaa/0xe0
Sep 7 22:56:10 jfsnew kernel: [do_munmap+347/368] do_munmap+0x15b/0x170
Sep 7 22:56:11 jfsnew kernel: [<c0153e4b>] do_munmap+0x15b/0x170
Sep 7 22:56:11 jfsnew kernel: [sys_read+47/80] sys_read+0x2f/0x50
Sep 7 22:56:11 jfsnew kernel: [<c016036f>] sys_read+0x2f/0x50
Sep 7 22:56:11 jfsnew kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Sep 7 22:56:11 jfsnew kernel: [<c03b60b3>] syscall_call+0x7/0xb
Sep 7 22:56:11 jfsnew kernel:
Sep 7 22:56:11 jfsnew kernel: mingetty S 00000246 1375 1 13
76 1374 (NOTLB)
Sep 7 22:56:11 jfsnew kernel: e7a0fe28 00000082 ea36f000 00000246 00000008 c179
1c20 000517e7 a62e1bc3
Sep 7 22:56:11 jfsnew kernel: 00000018 00000000 c1791c20 ea36f000 e790c0
00 e790c000 c026bc0e e790c000
Sep 7 22:56:11 jfsnew kernel: 00000008 7fffffff e790d0a4 00000000 c012d4
34 00000001 00000286 c0435500
Sep 7 22:56:11 jfsnew kernel: Call Trace:
Sep 7 22:56:11 jfsnew kernel: [opost_block+478/496] opost_block+0x1de/0x1f0
Sep 7 22:56:11 jfsnew kernel: [<c026bc0e>] opost_block+0x1de/0x1f0
Sep 7 22:56:11 jfsnew kernel: [schedule_timeout+20/192] schedule_timeout+0x14/
0xc0
Sep 7 22:56:11 jfsnew kernel: [<c012d434>] schedule_timeout+0x14/0xc0
Sep 7 22:56:11 jfsnew kernel: [generic_file_aio_read+38/48] generic_file_aio_r
ead+0x26/0x30
Sep 7 22:56:11 jfsnew kernel: [<c0140f86>] generic_file_aio_read+0x26/0x30
Sep 7 22:56:11 jfsnew kernel: [read_chan+1282/3248] read_chan+0x502/0xcb0
Sep 7 22:56:11 jfsnew kernel: [<c026dbe2>] read_chan+0x502/0xcb0
Sep 7 22:56:11 jfsnew kernel: [release_pages+541/560] release_pages+0x21d/0x23
0
Sep 7 22:56:11 jfsnew kernel: [<c014bb2d>] release_pages+0x21d/0x230
Sep 7 22:56:11 jfsnew kernel: [acquire_console_sem+42/80] acquire_console_sem+
0x2a/0x50
Sep 7 22:56:11 jfsnew kernel: [<c0124e7a>] acquire_console_sem+0x2a/0x50
Sep 7 22:56:11 jfsnew kernel: [write_chan+515/544] write_chan+0x203/0x220
Sep 7 22:56:11 jfsnew kernel: [<c026e593>] write_chan+0x203/0x220
Sep 7 22:56:11 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [tty_read+281/416] tty_read+0x119/0x1a0
Sep 7 22:56:11 jfsnew kernel: [<c0267b69>] tty_read+0x119/0x1a0
Sep 7 22:56:11 jfsnew kernel: [vfs_read+170/224] vfs_read+0xaa/0xe0
Sep 7 22:56:11 jfsnew kernel: [<c016014a>] vfs_read+0xaa/0xe0
Sep 7 22:56:11 jfsnew kernel: [do_munmap+347/368] do_munmap+0x15b/0x170
Sep 7 22:56:11 jfsnew kernel: [<c0153e4b>] do_munmap+0x15b/0x170
Sep 7 22:56:11 jfsnew kernel: [sys_read+47/80] sys_read+0x2f/0x50
Sep 7 22:56:11 jfsnew kernel: [<c016036f>] sys_read+0x2f/0x50
Sep 7 22:56:11 jfsnew kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Sep 7 22:56:11 jfsnew kernel: [<c03b60b3>] syscall_call+0x7/0xb
Sep 7 22:56:11 jfsnew kernel:
Sep 7 22:56:11 jfsnew kernel: mingetty S 00000246 1376 1 13
77 1375 (NOTLB)
Sep 7 22:56:11 jfsnew kernel: e9c41e28 00000082 eb264000 00000246 00000008 c179
9c20 00048372 a62ece1c
Sep 7 22:56:11 jfsnew kernel: 00000018 00000000 c1799c20 eb264000 e841e0
00 e841e000 c026bc0e e841e000
Sep 7 22:56:11 jfsnew kernel: 00000008 7fffffff e7c740a4 00000000 c012d4
34 00000001 00000282 c0435500
Sep 7 22:56:11 jfsnew kernel: Call Trace:
Sep 7 22:56:11 jfsnew kernel: [opost_block+478/496] opost_block+0x1de/0x1f0
Sep 7 22:56:11 jfsnew kernel: [<c026bc0e>] opost_block+0x1de/0x1f0
Sep 7 22:56:11 jfsnew kernel: [schedule_timeout+20/192] schedule_timeout+0x14/
0xc0
Sep 7 22:56:11 jfsnew kernel: [<c012d434>] schedule_timeout+0x14/0xc0
Sep 7 22:56:11 jfsnew kernel: [generic_file_aio_read+38/48] generic_file_aio_r
ead+0x26/0x30
Sep 7 22:56:11 jfsnew kernel: [<c0140f86>] generic_file_aio_read+0x26/0x30
Sep 7 22:56:11 jfsnew kernel: [read_chan+1282/3248] read_chan+0x502/0xcb0
Sep 7 22:56:11 jfsnew kernel: [<c026dbe2>] read_chan+0x502/0xcb0
Sep 7 22:56:11 jfsnew kernel: [release_pages+541/560] release_pages+0x21d/0x23
0
Sep 7 22:56:11 jfsnew kernel: [<c014bb2d>] release_pages+0x21d/0x230
Sep 7 22:56:11 jfsnew kernel: [acquire_console_sem+42/80] acquire_console_sem+
0x2a/0x50
Sep 7 22:56:11 jfsnew kernel: [<c0124e7a>] acquire_console_sem+0x2a/0x50
Sep 7 22:56:11 jfsnew kernel: [write_chan+515/544] write_chan+0x203/0x220
Sep 7 22:56:11 jfsnew kernel: [<c026e593>] write_chan+0x203/0x220
Sep 7 22:56:11 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:11 jfsnew kernel: [tty_read+281/416] tty_read+0x119/0x1a0
Sep 7 22:56:11 jfsnew kernel: [<c0267b69>] tty_read+0x119/0x1a0
Sep 7 22:56:11 jfsnew kernel: [vfs_read+170/224] vfs_read+0xaa/0xe0
Sep 7 22:56:11 jfsnew kernel: [<c016014a>] vfs_read+0xaa/0xe0
Sep 7 22:56:11 jfsnew kernel: [do_munmap+347/368] do_munmap+0x15b/0x170
Sep 7 22:56:11 jfsnew kernel: [<c0153e4b>] do_munmap+0x15b/0x170
Sep 7 22:56:11 jfsnew kernel: [sys_read+47/80] sys_read+0x2f/0x50
Sep 7 22:56:11 jfsnew kernel: [<c016036f>] sys_read+0x2f/0x50
Sep 7 22:56:11 jfsnew kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Sep 7 22:56:11 jfsnew kernel: [<c03b60b3>] syscall_call+0x7/0xb
Sep 7 22:56:11 jfsnew kernel:
Sep 7 22:56:11 jfsnew kernel: gdm S C0435F00 1377 1 1385 15
68 1376 (NOTLB)
Sep 7 22:56:11 jfsnew kernel: e79b3ee4 00000082 e7944000 c0435f00 00000000 c179
9c20 0003a879 5efbfc83
Sep 7 22:56:11 jfsnew kernel: 00000025 e7abb004 c1799c20 e7944000 000000
00 c0174503 00000246 eb69e004
Sep 7 22:56:11 jfsnew kernel: 00000000 7fffffff e79b3f30 7fffffff c012d4
34 e79b3fa0 e8035004 00000145
Sep 7 22:56:11 jfsnew kernel: Call Trace:
Sep 7 22:56:11 jfsnew kernel: [__pollwait+51/160] __pollwait+0x33/0xa0
Sep 7 22:56:11 jfsnew kernel: [<c0174503>] __pollwait+0x33/0xa0
Sep 7 22:56:11 jfsnew kernel: [schedule_timeout+20/192] schedule_timeout+0x14/
0xc0
Sep 7 22:56:11 jfsnew kernel: [<c012d434>] schedule_timeout+0x14/0xc0
Sep 7 22:56:11 jfsnew kernel: [do_pollfd+69/128] do_pollfd+0x45/0x80
Sep 7 22:56:11 jfsnew kernel: [<c0174f55>] do_pollfd+0x45/0x80
Sep 7 22:56:11 jfsnew kernel: [do_pollfd+89/128] do_pollfd+0x59/0x80
Sep 7 22:56:11 jfsnew kernel: [<c0174f69>] do_pollfd+0x59/0x80
Sep 7 22:56:11 jfsnew kernel: [do_poll+111/224] do_poll+0x6f/0xe0
Sep 7 22:56:11 jfsnew kernel: [<c0174fff>] do_poll+0x6f/0xe0
Sep 7 22:56:11 jfsnew kernel: [do_poll+178/224] do_poll+0xb2/0xe0
Sep 7 22:56:11 jfsnew kernel: [<c0175042>] do_poll+0xb2/0xe0
Sep 7 22:56:11 jfsnew kernel: [sys_poll+486/688] sys_poll+0x1e6/0x2b0
Sep 7 22:56:11 jfsnew kernel: [<c0175256>] sys_poll+0x1e6/0x2b0
Sep 7 22:56:11 jfsnew kernel: [__pollwait+0/160] __pollwait+0x0/0xa0
Sep 7 22:56:11 jfsnew kernel: [<c01744d0>] __pollwait+0x0/0xa0
Sep 7 22:56:11 jfsnew kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Sep 7 22:56:11 jfsnew kernel: [<c03b60b3>] syscall_call+0x7/0xb
Sep 7 22:56:11 jfsnew kernel: [xdr_decode_string_inplace+11/48] xdr_decode_str
ing_inplace+0xb/0x30
Sep 7 22:56:11 jfsnew kernel: [<c03b007b>] xdr_decode_string_inplace+0xb/0x30
Sep 7 22:56:12 jfsnew kernel:
Sep 7 22:56:12 jfsnew kernel: gdm S 00000001 1385 1377 1386
(NOTLB)
Sep 7 22:56:12 jfsnew kernel: ebb4ff60 00000082 eb84b000 00000001 e740eb1c c1791c20 0001200d 5efcae4c
Sep 7 22:56:12 jfsnew kernel: 00000025 405bde34 c1791c20 eb84b000 000000
00 00000000 405425e1 00000073
Sep 7 22:56:12 jfsnew kernel: 00000001 fffffe00 eb84b000 00000000 c01277
b1 ebb4e000 00000001 00000000
Sep 7 22:56:12 jfsnew kernel: Call Trace:
Sep 7 22:56:12 jfsnew kernel: [sys_wait4+593/656] sys_wait4+0x251/0x290
Sep 7 22:56:12 jfsnew kernel: [<c01277b1>] sys_wait4+0x251/0x290
Sep 7 22:56:12 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:12 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:12 jfsnew kernel: [sys_sigreturn+277/320] sys_sigreturn+0x115/0x14
0
Sep 7 22:56:12 jfsnew kernel: [<c010c445>] sys_sigreturn+0x115/0x140
Sep 7 22:56:12 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
on+0x0/0x20
Sep 7 22:56:12 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
Sep 7 22:56:12 jfsnew kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Sep 7 22:56:12 jfsnew kernel: [<c03b60b3>] syscall_call+0x7/0xb
Sep 7 22:56:12 jfsnew kernel:
Sep 7 22:56:12 jfsnew kernel: X S 00000000 1386 1385 14
37 (NOTLB)
Sep 7 22:56:12 jfsnew kernel: e73c3e60 00003082 e73ea000 00000000 c179a580 c179
9c20 00003f46 f11d73ed
Sep 7 22:56:12 jfsnew kernel: 00000042 00003246 c1799c20 e73ea000 e79db0
00 00000000 00010b0d e73c3e68
Sep 7 22:56:12 jfsnew kernel: 00010b0d e73c3e68 00000000 00000000 c012d4
bc c179afb0 c179afb0 00010b0d
Sep 7 22:56:12 jfsnew kernel: Call Trace:
Sep 7 22:56:12 jfsnew kernel: [schedule_timeout+156/192] schedule_timeout+0x9c
/0xc0
Sep 7 22:56:12 jfsnew kernel: [<c012d4bc>] schedule_timeout+0x9c/0xc0
Sep 7 22:56:12 jfsnew kernel: [process_timeout+0/16] process_timeout+0x0/0x10
Sep 7 22:56:12 jfsnew kernel: [<c012d410>] process_timeout+0x0/0x10
Sep 7 22:56:12 jfsnew kernel: [do_select+767/832] do_select+0x2ff/0x340
Sep 7 22:56:12 jfsnew kernel: [<c017494f>] do_select+0x2ff/0x340
Sep 7 22:56:12 jfsnew kernel: [__pollwait+0/160] __pollwait+0x0/0xa0
Sep 7 22:56:12 jfsnew kernel: [<c01744d0>] __pollwait+0x0/0xa0
Sep 7 22:56:12 jfsnew kernel: [select_bits_alloc+23/32] select_bits_alloc+0x17
/0x20
Sep 7 22:56:12 jfsnew kernel: [<c01749a7>] select_bits_alloc+0x17/0x20
Sep 7 22:56:12 jfsnew kernel: [sys_select+904/1360] sys_select+0x388/0x550
Sep 7 22:56:12 jfsnew kernel: [<c0174d48>] sys_select+0x388/0x550
Sep 7 22:56:12 jfsnew kernel: [do_gettimeofday+32/160] do_gettimeofday+0x20/0x
a0
Sep 7 22:56:12 jfsnew kernel: [<c0112cd0>] do_gettimeofday+0x20/0xa0
Sep 7 22:56:12 jfsnew kernel: [sys_gettimeofday+85/192]
sys_gettimeofday+0x55/
0xc0
Sep 7 22:56:12 jfsnew kernel: [<c0127e45>] sys_gettimeofday+0x55/0xc0
Sep 7 22:56:12 jfsnew kernel: [sys_ioctl+631/736] sys_ioctl+0x277/0x2e0
Sep 7 22:56:12 jfsnew kernel: [<c0173ea7>] sys_ioctl+0x277/0x2e0
Sep 7 22:56:12 jfsnew kernel: [sys_sigreturn+277/320] sys_sigreturn+0x115/0x14
0
Sep 7 22:56:12 jfsnew kernel: [<c010c445>] sys_sigreturn+0x115/0x140
Sep 7 22:56:12 jfsnew kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Sep 7 22:56:12 jfsnew kernel: [<c03b60b3>] syscall_call+0x7/0xb
Sep 7 22:56:12 jfsnew kernel:
Sep 7 22:56:12 jfsnew kernel: tcsh S 00000000 1437 1385 1491
1386 (NOTLB)
Sep 7 22:56:12 jfsnew kernel: e6473f78 00000082 e672a000 00000000 e6473f68 c179
9c20 00045179 9ce57cd0
Sep 7 22:56:12 jfsnew kernel: 00000025 bfff55f0 c1799c20 e672a000 000000
00 00000002 00000000 00000008
Sep 7 22:56:12 jfsnew kernel: e6472000 e6473f80 bfff5670 e6473fc4 c010c0
7d 00010002 00000000 00000002
Sep 7 22:56:12 jfsnew kernel: Call Trace:
Sep 7 22:56:12 jfsnew kernel: [sys_rt_sigsuspend+365/400] sys_rt_sigsuspend+0x
16d/0x190
Sep 7 22:56:12 jfsnew kernel: [<c010c07d>] sys_rt_sigsuspend+0x16d/0x190
Sep 7 22:56:12 jfsnew kernel: [sys_fork+25/32] sys_fork+0x19/0x20
Sep 7 22:56:12 jfsnew kernel: [<c010b3f9>] sys_fork+0x19/0x20
Sep 7 22:56:12 jfsnew kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Sep 7 22:56:12 jfsnew kernel: [<c03b60b3>] syscall_call+0x7/0xb
Sep 7 22:56:12 jfsnew kernel:
Sep 7 22:56:12 jfsnew kernel: fvwm2 S E5749000 1491 1437 1495
(NOTLB)
Sep 7 22:56:12 jfsnew kernel: e5945e60 00000082 e62fa000 e5749000 e5749020 c179
1c20 00064040 4970d179
Sep 7 22:56:12 jfsnew kernel: 0000003a c17920c4 c1791c20 e62fa000 000000
01 000000d0 e5945eec e720f004
Sep 7 22:56:12 jfsnew kernel: 00000000 7fffffff 00000000 00000000 c012d4
34 e5c2b004 00000145 00000246
Sep 7 22:56:12 jfsnew kernel: Call Trace:
Sep 7 22:56:12 jfsnew kernel: [schedule_timeout+20/192] schedule_timeout+0x14/
0xc0
Sep 7 22:56:12 jfsnew kernel: [<c012d434>] schedule_timeout+0x14/0xc0
Sep 7 22:56:12 jfsnew kernel: [pipe_poll+35/112] pipe_poll+0x23/0x70


and so on.... I'm not sure if any of the rest really helps here.

John

2003-09-08 18:09:44

by Randy.Dunlap

[permalink] [raw]
Subject: Re: 2.6.0-test4-mm4 - bad floppy hangs keyboard input

On Mon, 8 Sep 2003 11:38:21 -0400 "John Stoffel" <[email protected]> wrote:

|
| Hi,
|
| I've run into a wierd problem with 2.6.0-test4-mm4 on an SMP Xeon
| 550Mhz system. I was dd'ing some images to floppy and when I used a
| bad floppy, it would read 17+0 blocks, write 16+0 blocks and then
| hang. At this point, I couldn't use the keyboard at all, and I got an
| error message about "lost serial connection to UPS", which is from
| apcupsd, which I have talking over my Cyclades Cyclom-8y ISA serial
| port card.
|
| At this point, the load jumps to 1, nothing happes on the floppy
| drive, and I can't input data to xterms. I can browse the web just
| fine using the mouse inside Galeon, and other stuff runs just fine. I
| can even cut'n'paste stuff from one xterm and have it run in another.
|
| But the only way I've found to fix this is to reboot, which requires
| an fsck of my disks, which takes a while. I'll see if I can re-create
| this in more detail tonight while at home, and hopefully from a
| console.
|
| Strangley enough, Magic SysReq still seems to work, since that's how I
| reboot. Logging out of Xwindows works, but then when I try to reboot
| the system from gdm, it hangs at a blank screen.
|
| Whee!
|
| Here's my interrupts:
|
| jfsnew:~> cat /proc/interrupts
| CPU0 CPU1
| 0: 44911738 411 IO-APIC-edge timer
| 1: 1290 0 IO-APIC-edge i8042
| 2: 0 0 XT-PIC cascade
| 8: 1 0 IO-APIC-edge rtc
| 11: 861346 2 IO-APIC-edge Cyclom-Y
| 12: 2053 0 IO-APIC-edge i8042
| 14: 141353 0 IO-APIC-edge ide0
| 16: 0 0 IO-APIC-level ohci-hcd
| 17: 6447 0 IO-APIC-level ohci-hcd, eth0
| 18: 214706 1 IO-APIC-level aic7xxx, aic7xxx, ehci_hcd
| 19: 50 0 IO-APIC-level aic7xxx, uhci-hcd
| NMI: 0 0
| LOC: 44911434 44911479
| ERR: 0
| MIS: 0
|
| -------------------------------------------------------------------
| /var/log/messages
| -------------------------------------------------------------------
|
| Sep 7 22:55:26 jfsnew kernel: floppy0: sector not found: track 0, head 1, secto
| r 1, size 2
| Sep 7 22:55:26 jfsnew kernel: floppy0: sector not found: track 0, head 1, secto
| r 1, size 2
| Sep 7 22:55:26 jfsnew kernel: end_request: I/O error, dev fd0, sector 18
| Sep 7 22:55:26 jfsnew kernel: Unable to handle kernel paging request at virtual
| address eb655050
| Sep 7 22:55:26 jfsnew kernel: printing eip:
| Sep 7 22:55:26 jfsnew kernel: c02ac816
| Sep 7 22:55:26 jfsnew kernel: *pde = 00541063
| Sep 7 22:55:26 jfsnew kernel: *pte = 2b655000
| Sep 7 22:55:26 jfsnew kernel: Oops: 0000 [#1]
| Sep 7 22:55:26 jfsnew kernel: SMP DEBUG_PAGEALLOC
| Sep 7 22:55:26 jfsnew kernel: CPU: 0
| Sep 7 22:55:26 jfsnew kernel: EIP: 0060:[bad_flp_intr+150/224] Not tainte
| d VLI
| Sep 7 22:55:26 jfsnew kernel: EIP: 0060:[<c02ac816>] Not tainted VLI
| Sep 7 22:55:26 jfsnew kernel: EFLAGS: 00010246
| Sep 7 22:55:26 jfsnew kernel: EIP is at bad_flp_intr+0x96/0xe0
| Sep 7 22:55:26 jfsnew kernel: eax: 00000000 ebx: 00000000 ecx: 00000000 e
| dx: eb655050
| Sep 7 22:55:26 jfsnew kernel: esi: c0520c00 edi: eb655050 ebp: 00000002 e
| sp: efea7f5c
| Sep 7 22:55:26 jfsnew kernel: ds: 007b es: 007b ss: 0068
| Sep 7 22:55:26 jfsnew kernel: Process events/0 (pid: 6, threadinfo=efea6000 tas
| k=c17ef000)
| Sep 7 22:55:26 jfsnew kernel: Stack: 00000000 00000000 00000001 c02ad0ba 000000
| 00 c0430940 00000003 c0471680
| Sep 7 22:55:26 jfsnew kernel: 00000283 c17f0004 00000000 c02aa5f1 c04716
| c0 c013547d 00000000 5a5a5a5a
| Sep 7 22:55:26 jfsnew kernel: c02aa5e0 00000001 00000000 c011f840 000100
| 00 00000000 00000000 c17dbf1c
| Sep 7 22:55:26 jfsnew kernel: Call Trace:
| Sep 7 22:55:26 jfsnew kernel: [rw_interrupt+474/800] rw_interrupt+0x1da/0x320
| Sep 7 22:55:26 jfsnew kernel: [<c02ad0ba>] rw_interrupt+0x1da/0x320
| Sep 7 22:55:26 jfsnew kernel: [main_command_interrupt+17/32] main_command_inte
| rrupt+0x11/0x20
| Sep 7 22:55:26 jfsnew kernel: [<c02aa5f1>] main_command_interrupt+0x11/0x20
| Sep 7 22:55:26 jfsnew kernel: [worker_thread+525/816] worker_thread+0x20d/0x33
| 0
| Sep 7 22:55:26 jfsnew kernel: [<c013547d>] worker_thread+0x20d/0x330
| Sep 7 22:55:26 jfsnew kernel: [main_command_interrupt+0/32] main_command_inter
| rupt+0x0/0x20
| Sep 7 22:55:26 jfsnew kernel: [<c02aa5e0>] main_command_interrupt+0x0/0x20
| Sep 7 22:55:26 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
| on+0x0/0x20
| Sep 7 22:55:26 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
| Sep 7 22:55:26 jfsnew kernel: [ret_from_fork+6/20] ret_from_fork+0x6/0x14
| Sep 7 22:55:26 jfsnew kernel: [<c03b5fd2>] ret_from_fork+0x6/0x14
| Sep 7 22:55:26 jfsnew kernel: [default_wake_function+0/32] default_wake_functi
| on+0x0/0x20
| Sep 7 22:55:26 jfsnew kernel: [<c011f840>] default_wake_function+0x0/0x20
| Sep 7 22:55:26 jfsnew kernel: [worker_thread+0/816] worker_thread+0x0/0x330
| Sep 7 22:55:26 jfsnew kernel: [<c0135270>] worker_thread+0x0/0x330
| Sep 7 22:55:26 jfsnew kernel: [kernel_thread_helper+5/12] kernel_thread_helper
| +0x5/0xc
| Sep 7 22:55:26 jfsnew kernel: [<c010ac69>] kernel_thread_helper+0x5/0xc
| Sep 7 22:55:26 jfsnew kernel:
| Sep 7 22:55:26 jfsnew kernel: Code: 52 c0 8d 04 9b 8d 04 43 8b 44 c6 28 39 07 7
| 6 0b 6a 00 a1 24 1c 52 c0 ff 50 0c 5b 0f b6 0d 84 1c 52 c0 8b 15 20 1c 52 c0 8d
| 04 89 <8b> 12 8d 04 41 c1 e0 03 3b 54 06 30 76 11 a1 80 1c 52 c0 c1 e0
| Sep 7 22:55:38 jfsnew apcupsd[1324]: Serial communications with UPS lost
| Sep 7 22:55:46 jfsnew kernel:
| Sep 7 22:55:46 jfsnew kernel: floppy driver state
| Sep 7 22:55:46 jfsnew kernel: -------------------
| Sep 7 22:55:46 jfsnew kernel: now=4294937721 last interrupt=4294917720 diff=200
| 01 last called handler=c02aa5e0
| Sep 7 22:55:46 jfsnew kernel: timeout_message=request done %%d
| Sep 7 22:55:46 jfsnew kernel: last output bytes:
| Sep 7 22:55:46 jfsnew kernel: 0 90 4294917324
| Sep 7 22:55:46 jfsnew kernel: 13 80 4294917324
| Sep 7 22:55:46 jfsnew kernel: 0 90 4294917324
| Sep 7 22:55:46 jfsnew kernel: 1a 90 4294917324
| Sep 7 22:55:46 jfsnew kernel: 0 90 4294917324
| Sep 7 22:55:46 jfsnew kernel: 3 80 4294917324
| Sep 7 22:55:46 jfsnew kernel: c1 90 4294917324
| Sep 7 22:55:46 jfsnew kernel: 10 90 4294917324
| Sep 7 22:55:46 jfsnew kernel: 7 80 4294917324
| Sep 7 22:55:46 jfsnew kernel: 0 90 4294917324
| Sep 7 22:55:46 jfsnew kernel: 8 81 4294917324
| Sep 7 22:55:46 jfsnew kernel: e6 80 4294917324
| Sep 7 22:55:46 jfsnew kernel: 4 90 4294917324
| Sep 7 22:55:46 jfsnew kernel: 0 90 4294917324
| Sep 7 22:55:46 jfsnew kernel: 1 90 4294917324
| Sep 7 22:55:46 jfsnew kernel: 1 90 4294917324
| Sep 7 22:55:46 jfsnew kernel: 2 90 4294917325
| Sep 7 22:55:46 jfsnew kernel: 12 90 4294917325
| Sep 7 22:55:46 jfsnew kernel: 1b 90 4294917325
| Sep 7 22:55:46 jfsnew kernel: ff 90 4294917325
| Sep 7 22:55:46 jfsnew kernel: last result at 4294917720
| Sep 7 22:55:46 jfsnew kernel: last redo_fd_request at 4294917324
| Sep 7 22:55:46 jfsnew kernel: 44 1 0 0 1 1 2
| Sep 7 22:55:46 jfsnew kernel: status=80
| Sep 7 22:55:46 jfsnew kernel: fdc_busy=1
| Sep 7 22:55:46 jfsnew kernel: cont=c0471720
| Sep 7 22:55:46 jfsnew kernel: current_req=00000000
| Sep 7 22:55:46 jfsnew kernel: command_status=-1
| Sep 7 22:55:46 jfsnew kernel:
| Sep 7 22:55:46 jfsnew kernel: floppy0: floppy timeout called
| Sep 7 22:55:46 jfsnew kernel: floppy.c: no request in request_done
| Sep 7 22:55:49 jfsnew kernel:
| Sep 7 22:55:49 jfsnew kernel: floppy driver state
| Sep 7 22:55:49 jfsnew kernel: -------------------
| Sep 7 22:55:49 jfsnew kernel: now=4294940721 last interrupt=4294937722 diff=299
| 9 last called handler=c02abc40
| Sep 7 22:55:49 jfsnew kernel: timeout_message=redo fd request
| Sep 7 22:55:49 jfsnew kernel: last output bytes:
| Sep 7 22:55:49 jfsnew kernel: 0 90 4294917324
| Sep 7 22:55:49 jfsnew kernel: 3 80 4294917324
| Sep 7 22:55:49 jfsnew kernel: c1 90 4294917324
| Sep 7 22:55:49 jfsnew kernel: 10 90 4294917324
| Sep 7 22:55:49 jfsnew kernel: 7 80 4294917324
| Sep 7 22:55:49 jfsnew kernel: 0 90 4294917324
| Sep 7 22:55:49 jfsnew kernel: 8 81 4294917324
| Sep 7 22:55:49 jfsnew kernel: e6 80 4294917324
| Sep 7 22:55:49 jfsnew kernel: 4 90 4294917324
| Sep 7 22:55:49 jfsnew kernel: 0 90 4294917324
| Sep 7 22:55:49 jfsnew kernel: 1 90 4294917324
| Sep 7 22:55:49 jfsnew kernel: 1 90 4294917324
| Sep 7 22:55:49 jfsnew kernel: 2 90 4294917325
| Sep 7 22:55:49 jfsnew kernel: 12 90 4294917325
| Sep 7 22:55:49 jfsnew kernel: 1b 90 4294917325
| Sep 7 22:55:49 jfsnew kernel: ff 90 4294917325
| Sep 7 22:55:49 jfsnew kernel: 8 80 4294937722
| Sep 7 22:55:49 jfsnew last message repeated 3 times
| Sep 7 22:55:49 jfsnew kernel: last result at 4294937722
| Sep 7 22:55:49 jfsnew kernel: last redo_fd_request at 4294937721
| Sep 7 22:55:49 jfsnew kernel: c3 0
| Sep 7 22:55:49 jfsnew kernel: status=80
| Sep 7 22:55:49 jfsnew kernel: fdc_busy=1
| Sep 7 22:55:49 jfsnew kernel: floppy_work.func=c02abc40
| Sep 7 22:55:49 jfsnew kernel: cont=c0471720
| Sep 7 22:55:49 jfsnew kernel: current_req=eb794004
| Sep 7 22:55:49 jfsnew kernel: command_status=-1
| Sep 7 22:55:49 jfsnew kernel:
| Sep 7 22:55:49 jfsnew kernel: floppy0: floppy timeout called
| Sep 7 22:55:49 jfsnew kernel: end_request: I/O error, dev fd0, sector 0
| Sep 7 22:55:49 jfsnew kernel: Buffer I/O error on device fd0, logical block 0
| Sep 7 22:55:49 jfsnew kernel: lost page write due to I/O error on fd0
| Sep 7 22:55:52 jfsnew kernel:
| Sep 7 22:55:52 jfsnew kernel: floppy driver state
| Sep 7 22:55:52 jfsnew kernel: -------------------
| Sep 7 22:55:52 jfsnew kernel: now=4294943721 last interrupt=4294940722 diff=299
| 9 last called handler=c02abc40
| Sep 7 22:55:52 jfsnew kernel: timeout_message=redo fd request
| Sep 7 22:55:52 jfsnew kernel: last output bytes:
| Sep 7 22:55:52 jfsnew kernel: 7 80 4294917324
| Sep 7 22:55:52 jfsnew kernel: 0 90 4294917324
| Sep 7 22:55:52 jfsnew kernel: 8 81 4294917324
| Sep 7 22:55:52 jfsnew kernel: e6 80 4294917324
| Sep 7 22:55:52 jfsnew kernel: 4 90 4294917324
| Sep 7 22:55:52 jfsnew kernel: 0 90 4294917324
| Sep 7 22:55:52 jfsnew kernel: 1 90 4294917324
| Sep 7 22:55:52 jfsnew kernel: 1 90 4294917324
| Sep 7 22:55:52 jfsnew kernel: 2 90 4294917325
| Sep 7 22:55:52 jfsnew kernel: 12 90 4294917325
| Sep 7 22:55:52 jfsnew kernel: 1b 90 4294917325
| Sep 7 22:55:52 jfsnew kernel: ff 90 4294917325
| Sep 7 22:55:52 jfsnew kernel: 8 80 4294937722
| Sep 7 22:55:52 jfsnew last message repeated 3 times
| Sep 7 22:55:52 jfsnew kernel: 8 80 4294940722
| Sep 7 22:55:52 jfsnew last message repeated 3 times
| Sep 7 22:55:52 jfsnew kernel: last result at 4294940722
| Sep 7 22:55:52 jfsnew kernel: last redo_fd_request at 4294940721
| Sep 7 22:55:52 jfsnew kernel: c3 0
| Sep 7 22:55:52 jfsnew kernel: status=80
| Sep 7 22:55:52 jfsnew kernel: fdc_busy=1
| Sep 7 22:55:52 jfsnew kernel: floppy_work.func=c02abc40
| Sep 7 22:55:52 jfsnew kernel: cont=c0471720
| Sep 7 22:55:52 jfsnew kernel: current_req=eb794004
| Sep 7 22:55:52 jfsnew kernel: command_status=-1
| Sep 7 22:55:52 jfsnew kernel:
| Sep 7 22:55:52 jfsnew kernel: floppy0: floppy timeout called
| Sep 7 22:55:52 jfsnew kernel: end_request: I/O error, dev fd0, sector 8
| Sep 7 22:55:52 jfsnew kernel: Buffer I/O error on device fd0, logical block 1
| Sep 7 22:55:52 jfsnew kernel: lost page write due to I/O error on fd0
| Sep 7 22:55:55 jfsnew kernel: work still pending

Yes, bad_flp_intr() has some problem(s). See
http://bugme.osdl.org/show_bug.cgi?id=1033 and
http://marc.theaimsgroup.com/?l=linux-kernel&m=105837886921297&w=2

It's on my long work list.

--
~Randy

2003-09-08 19:21:43

by Bill Davidsen

[permalink] [raw]
Subject: Re: Scaling noise

In article <[email protected]>,
Steven Cole <[email protected]> wrote:
| On Tue, 2003-09-02 at 23:08, Larry McVoy wrote:
| > On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote:
| > > I think Anton is referring to the fact that on a 4-way cpu machine with
| > > HT enabled you basically have an 8-way smp box (with special conditions)
| > > and so if 4-way machines are becoming more popular, making sure that 8-way
| > > smp works well is a good idea.
| >
| > Maybe this is a better way to get my point across. Think about more CPUs
| > on the same memory subsystem. I've been trying to make this scaling point
| > ever since I discovered how much cache misses hurt. That was about 1995
| > or so. At that point, memory latency was about 200 ns and processor speeds
| > were at about 200Mhz or 5 ns. Today, memory latency is about 130 ns and
| > processor speeds are about .3 ns. Processor speeds are 15 times faster and
| > memory is less than 2 times faster. SMP makes that ratio worse.
| >
| > It's called asymptotic behavior. After a while you can look at the graph
| > and see that more CPUs on the same memory doesn't make sense. It hasn't
| > made sense for a decade, what makes anyone think that is changing?
|
| You're right about the asymptotic behavior and you'll just get more
| right as time goes on, but other forces are at work.
|
| What is changing is the number of cores per 'processor' is increasing.
| The Intel Montecito will increase this to two, and rumor has it that the
| Intel Tanglewood may have as many as sixteen. The IBM Power6 will
| likely be similarly capable.
|
| The Tanglewood is not some far off flight of fancy; it may be available
| as soon as the 2.8.x stable series, so planning to accommodate it should
| be happening now.
|
| With companies like SGI building Altix systems with 64 and 128 CPUs
| using the current single-core Madison, just think of what will be
| possible using the future hardware.
|
| In four years, Michael Dell will still be saying the same thing, but
| he'll just fudge his answer by a factor of four.

The mass market will still be in small machines, because the CPUs keep
on getting faster. And at least for most small servers running Linux,
like news, mail, DNS, and web, the disk, memory and network are more of
a problem than the CPU. Some database and CGI loads are CPU intensive,
but I don't see that the nature of loads will change; most aren't CPU
intensive.

| The question which will continue to be important in the next kernel
| series is: How to best accommodate the future many-CPU machines without
| sacrificing performance on the low-end? The change is that the 'many'
| in the above may start to double every few years.

Since you can still get a decent research grant or graduate thesis out
of ways to use a lot of CPUs, there will not be a lack of thought on the
topic. I think Larry is just worried that some of these solutions may
really work poorly on smaller systems.

| Some candidate answers to this have been discussed before, such as
| cache-coherent clusters. I just hope this gets worked out before the
| hardware ships.

Honestly, I would expect a good solution to scale better at the "more"
end of the range than the "less." A good 16-way approach will probably
not need major work for 256, while it may be pretty grim for the uni or
2-way counting HT machines.

With all the work people are doing on writing scheduler changes for
responsiveness, and the number of people trying them, I would assume a
need for improvement on small machines and response over throughput.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-09-08 19:59:20

by Bill Davidsen

[permalink] [raw]
Subject: Re: Scaling noise

In article <[email protected]>,
Larry McVoy <[email protected]> wrote:

| It's really easy to claim that scalability isn't the problem. Scaling
| changes in general cause very minute differences, it's just that there
| are a lot of them. There is constant pressure to scale further and people
| think it's cool. You can argue you all you want that scaling done right
| isn't a problem but nobody has ever managed to do it right. I know it's
| politically incorrect to say this group won't either but there is no
| evidence that they will.

I think that if the problem of a single scheduler which is "best" at
everything proves out of reach, perhaps in 2.7 a modular scheduler will
appear, which will allow the user to select the Nick+Con+Ingo
responsiveness, or the default pretty good at everything, or the 4kbit
affinity mask NUMA on steroids solution.

I have faith that Linux will solve this one one way or the other,
probably both.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-09-08 19:36:20

by Bill Davidsen

[permalink] [raw]
Subject: Re: Scaling noise

In article <[email protected]>,
Daniel Phillips <[email protected]> wrote:

| As for Karim's work, it's a quintessentially flashy trick to make two UP
| kernels run on a dual processor. It's worth doing, but not because it blazes
| the way forward for ccClusters. It can be the basis for hot kernel swap:
| migrate all the processes to one of the two CPUs, load and start a new kernel
| on the other one, migrate all processes to it, and let the new kernel restart
| the first processor, which is now idle.

UML running on a sibling, anyone? Interesting concept, not necessarily
useful.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-09-08 19:49:32

by Bill Davidsen

[permalink] [raw]
Subject: Re: Scaling noise

In article <[email protected]>,
David S. Miller <[email protected]> wrote:
| On Wed, 3 Sep 2003 18:52:49 -0700
| Larry McVoy <[email protected]> wrote:
|
| > On Thu, Sep 04, 2003 at 03:50:31AM +0200, Daniel Phillips wrote:
| > > There are other arguments, such as how complex locking is, and how it will
| > > never work correctly, but those are noise: it's pretty much done now, the
| > > complexity is still manageable, and Linux has never been more stable.
| >
| > yeah, right. I'm not sure what you are smoking but I'll avoid your dealer.
|
| I hate to enter these threads but...
|
| The amount of locking bugs found in the core networking, ipv4, and
| ipv6 for a year or two in 2.4.x has been nearly nil.
|
| If you're going to try and argue against supporting huge SMP
| to me, don't make locking complexity one of the arguments. :-)

If you count only "bugs" which cause hang or oops, sure. But just
because something works doesn't make it simple (or non-complex if you
prefer). But look at all the "lockless" changes and such in 2.4, and I
think you will agree that there have been a number and it is complex. I
don't think stable and complex are mutually exclusive in this case.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-09-08 20:14:12

by Bill Davidsen

[permalink] [raw]
Subject: Re: Scaling noise

In article <[email protected]>,
John Bradford <[email protected]> wrote:

| Once the option of running a firewall, a hot spare firewall, a
| customer webserver, a hot spare customer webserver, mail server,
| backup mail server, and a few virtual machines for customers, all on a
| 1U box, why are you going to want to pay for seven or more Us in a
| datacentre, plus extra network hardware?

If you plan to run anything else on your firewall, and use the same
machine as a hot spare for itself, I don't want you as my ISP.
Reliability is expensive, and what you describe is known as a single
point of failure.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-09-08 23:39:26

by Peter Chubb

[permalink] [raw]
Subject: Re: Scaling noise

>>>>> "bill" == bill davidsen <[email protected]> writes:

> In article <[email protected]>, Larry
> McVoy <[email protected]> wrote:

Larry> It's really easy to claim that scalability isn't the problem.
Larry> Scaling changes in general cause very minute differences, it's
Larry> just that there are a lot of them. There is constant pressure
Larry> to scale further and people think it's cool. You can argue
Larry> you all you want that scaling done right isn't a problem but
Larry> nobody has ever managed to do it right. I know it's
Larry> politically incorrect to say this group won't either but there
Larry> is no evidence that they will.

bill> I think that if the problem of a single scheduler which is
bill> "best" at everything proves out of reach, perhaps in 2.7 a
bill> modular scheduler will appear, which will allow the user to
bill> select the Nick+Con+Ingo responsiveness, or the default pretty
bill> good at everything, or the 4kbit affinity mask NUMA on steroids
bill> solution.

Well, as I see it it's not processor but memory scalability that's the
problem right now. Memories are getting larger (and for NUMA systems,
sparser), and the current linux solutions don't scale particularly
well --- particularly when, for architectures like PPC or IA64, you
need two copies in different formats, one for the hardware to look up,
and one for the OS.

I *do* think that pluggable schedulers are a good idea --- I'd like to
introduce something like the scheduler class mechanism that SVr4 has
(except that I've seen that code, and don't want to get sued by SCO)
to allow different processes to be in different classes in a cleaner
manner than the current FIFO or RR vs OTHER classes. We should be
able to introduce isochronous, gang, lottery or fairshare schedulers
(etc) at runtime, and then tie processes severally and indivdually to
those schedulers, with a well defined idea of what happens when
scheduler priorities overlap, and well defined APIs to adjust
scheduler parameters. However, this will require more major
infrastructure changes, and a better separation of dispatcher from
scheduler than in the current one-size-fits-all scheduler.


--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
You are lost in a maze of BitKeeper repositories, all slightly different.


2003-09-09 14:48:24

by Rob Landley

[permalink] [raw]
Subject: Re: Scaling noise

On Monday 08 September 2003 09:38, Alan Cox wrote:
> On Sad, 2003-09-06 at 16:08, Pavel Machek wrote:
> > Hi!
> >
> > > Maybe this is a better way to get my point across. Think about more
> > > CPUs on the same memory subsystem. I've been trying to make this
> > > scaling point
> >
> > The point of hyperthreading is that more virtual CPUs on same memory
> > subsystem can actually help stuff.
>
> Its a way of exposing asynchronicity keeping the old instruction set.
> Its trying to make better use of the bandwidth available by having
> something else to schedule into stalls. Thats why HT is really good for
> code which is full of polling I/O, badly coded memory accesses but is
> worthless on perfectly tuned hand coded stuff which doesnt stall.

<rant>

I wouldn't call it worthless. "Proof of concept", maybe.

Modern processors (Athlon and P4 both, I believe) have three execution cores,
and so are trying to dispatch three instructions per clock. With
speculation, lookahead, branch prediction, register renaming, instruction
reordering, magic pixie dust, happy thoughts, a tailwind, and 8 zillion other
related things, they can just about do it too, but not even close to 100% of
the time. Extracting three parallel instructions from one instruction stream
is doable, but not fun, and not consistent.

The third core is unavoidably idle some of the time. Trying to keep four
cores bus would be a nightmare. (All the VLIW guys keep trying to unload
this on the compiler. Don't ask me how a compiler is supposed to do branch
prediction and speculative execution. I suppose having to recompile your
binaries for more cores isn't TOO big a problem these days, but the boxed
mainstream desktop apps people wouldn't like it at all.)

Transistor budgets keep going up as manufacturing die sizes shrink, and the
engineers keep wanting to throw transistors at the problem. The first really
easy way to turn transistors into performance are a bigger L1 cache, but
somewhere between 256k and one megabyte per running process you hit some
serious diminishing returns since your working set is in cache and your far
accesses to big datasets (or streaming data) just aren't going to be helped
by more L1 cache.

The other obvious way to turn transistors into performance is to build
execution cores out of them. (Yeah, you can also pipeline yourself to death
to do less per clock for marketing reasons, but there's serious diminishing
returns there too.) With more execution cores, you can (theoretically)
execute more instructions per clock. Except that keeping 3 cores busy out of
one instruction stream is really hard, and 4 would be a nightmare...

Hyperthreading is just a neat hack to keep multiple cores busy. Having
another point of execution to schedule instructions from means you're
guaranteed to keep 1 core busy all the time for each point of execution
(barring memory access latency on "branch to mars" conditions), and with 3
cores and 2 pointes of execution they can fight over the middle core, which
should just about never be idle when the system is loaded.

With hyperthreading (SMT, whatever you wanna call it), the move to 4 execution
cores becomes a no-brainer. (Keeping 2 cores busy from one instruction
stream is relatively trivial), and even 5 (since keeping 3 cores busy is a
solved problem, although it's not busy all the time, but the two threads can
fight for the extra core when they actually have something for it to do...)

And THAT is where SMT starts showing real performance benefits, when you get
to 4 or 5 cores. It's cheaper than SMP on a die because they can share all
sorts of hardware (not the least of which being L1 cache, and you can even
expand L1 cache a bit because you now have the working sets of 2 processes to
stick in it)...

Intel's been desperate for a way to make use of its transistor budget for a
while; manufacturing is what it does better than AMD< not clever processor
design. The original Itanic, case in point, had more than 3 instruction
execution cores in each chip: 3 VLIW, a HP-PA Risc, and a brain-damaged
Pentium (which itself had a couple execution cores)... The long list of
reasons Itanic sucked started with the fact that it had 3 different modes and
whichever one you were in circuitry for the other 2 wouldn't contribute a
darn thing to your performance (although it did not stop there, and in fact
didn't even slow down...)

Of course since power is now the third variable along with price/performance,
sooner or later you'll see chips that individually power down cores as they
go dormant. Possibly even a banked L1 cache; who knows? (It's another
alternative to clocking down the whole chip; power down individual functional
units of the chip. Dunno who might actually do that, or when, but it's nice
to have options...)

</rant>

In brief: hyper threading is cool.

> Its great feature is that HT gets *more* not less useful as the CPU gets
> faster..

Excution point 1 stalls waiting for memory, so execution point 2 gets extra
cores. The classic tale of overlapping processing and I/O, only this time
with the memory bus being the slow device you have to wait for...

Rob


2003-09-09 16:07:39

by Ricardo Bugalho

[permalink] [raw]
Subject: Re: Scaling noise

On Tue, 09 Sep 2003 02:11:15 -0400, Rob Landley wrote:


> Modern processors (Athlon and P4 both, I believe) have three execution
> cores, and so are trying to dispatch three instructions per clock. With

Neither of these CPUs are multi-core. They're just superscalar cores, that
is, they can dispatch multiple instructions in parallel. An example of a
multi-core CPU is the POWER4: there are two complete cores in the same
sillicon die, sharing some cache levels and memory bus.

BTW, Pentium [Pro,II,III] and Athlon are three way in the sense they have
three-way decoders that decode up to three x86 instructions into µOPs.
Pentium4 has a one-way decoder and the trace cache that stores decodes
µOPs.
As a curiosity, AMD's K5 and K6 were 4 way.

> four cores bus would be a nightmare. (All the VLIW guys keep trying to
> unload this on the compiler. Don't ask me how a compiler is supposed to
> do branch prediction and speculative execution. I suppose having to
> recompile your binaries for more cores isn't TOO big a problem these
> days, but the boxed mainstream desktop apps people wouldn't like it at
> all.)

In normal instructions sets, whatever CPUs do, from the software
perspective, it MUST look like the CPU is executing one instruction at a
time. In VLIW, some forms of parallelism are exposed. For example, before
executing two instructions in parallel, non-VLIW CPUs have to check for
data dependencies. If they exist, those two instructions can't be executed
in parallel. VLIW instruction sets just define that instructions MUST be
grouped in sets of N instructions that can be executed in parallel and
that if they don't the CPU, the CPU will yield an exception or undefined
behaviour.
In a similar manner, there is the issue of avaliable execution units and
exeptions.
The net result is that in-order VLIW CPUs are simpler to design that
in-order superscalar RISC CPUs, but I think it won't make much of a
difference for out-of-order CPUs. I've never seen a VLIW out-of-ordem
implementation.
VLIW ISAs are no different from others regarding branch prediction --
which is a problem for ALL pipelined implementations, superscalar or not.
Speculative execution is a feature of out-of-order implementation.


> Transistor budgets keep going up as manufacturing die sizes shrink, and
> the engineers keep wanting to throw transistors at the problem. The
> first really easy way to turn transistors into performance are a bigger
> L1 cache, but somewhere between 256k and one megabyte per running
> process you hit some serious diminishing returns since your working set
> is in cache and your far accesses to big datasets (or streaming data)
> just aren't going to be helped by more L1 cache.

L1 caches are kept small so they can be fast.

> Hyperthreading is just a neat hack to keep multiple cores busy. Having

SMT (Simultaneous Multi-Threading, aka Hyperthreading in Intel's marketing
term) is a neat hack to keep execution units within the same core busy.
And its a cheap hack when the CPUs are alread out-of-order. CMP
(Concurrent Multi-Processing) is a neat hack to keep expensive resources
like big L2/L3 caches and memory interfaces busy by placing multiple cores
on the same die.
CMP is simpler, but is only usefull for multi-thread performance. With
SMT, it makes sense to add more execution units that now, so it can also
help single-thread performance.


> Intel's been desperate for a way to make use of its transistor budget
> for a while; manufacturing is what it does better than AMD< not clever
> processor design. The original Itanic, case in point, had more than 3
> instruction execution cores in each chip: 3 VLIW, a HP-PA Risc, and a
> brain-damaged Pentium (which itself had a couple execution cores)... The
> long list of reasons Itanic sucked started with the fact that it had 3
> different modes and whichever one you were in circuitry for the other 2
> wouldn't contribute a darn thing to your performance (although it did
> not stop there, and in fact didn't even slow down...)

Itanium doesn't have hardware support for PA-RISC emulation. The IA-64 ISA
has some similarities with PA-RISC to ease dynamic translation though.
But you're right: the IA-32 hardware emulation layer is not a Good Thing™.

--
Ricardo

2003-09-10 05:23:16

by Rob Landley

[permalink] [raw]
Subject: Re: Scaling noise

On Tuesday 09 September 2003 12:07, Ricardo Bugalho wrote:
> On Tue, 09 Sep 2003 02:11:15 -0400, Rob Landley wrote:
> > Modern processors (Athlon and P4 both, I believe) have three execution
> > cores, and so are trying to dispatch three instructions per clock. With
>
> Neither of these CPUs are multi-core. They're just superscalar cores, that
> is, they can dispatch multiple instructions in parallel. An example of a
> multi-core CPU is the POWER4: there are two complete cores in the same
> sillicon die, sharing some cache levels and memory bus.

Sorry, wrong terminology. (I'm a software dude.)

"Instruction execution thingy". (Well you didn't give it a name either. :)

> BTW, Pentium [Pro,II,III] and Athlon are three way in the sense they have
> three-way decoders that decode up to three x86 instructions into µOPs.
> Pentium4 has a one-way decoder and the trace cache that stores decodes
> µOPs.
> As a curiosity, AMD's K5 and K6 were 4 way.

I hadn't known that. (I had known that the AMD guys I talked to around Austin
had proven to themselves that 4 way was not a good idea in the real world,
but I didn't know it had actually made it outside of the labs...)

> > four cores bus would be a nightmare. (All the VLIW guys keep trying to
> > unload this on the compiler. Don't ask me how a compiler is supposed to
> > do branch prediction and speculative execution. I suppose having to
> > recompile your binaries for more cores isn't TOO big a problem these
> > days, but the boxed mainstream desktop apps people wouldn't like it at
> > all.)
>
> In normal instructions sets, whatever CPUs do, from the software
> perspective, it MUST look like the CPU is executing one instruction at a
> time.

Yup.

> In VLIW, some forms of parallelism are exposed.

I tend to think of it as "unloaded upon the compiler"...

> For example, before
> executing two instructions in parallel, non-VLIW CPUs have to check for
> data dependencies. If they exist, those two instructions can't be executed
> in parallel. VLIW instruction sets just define that instructions MUST be
> grouped in sets of N instructions that can be executed in parallel and
> that if they don't the CPU, the CPU will yield an exception or undefined
> behaviour.

Presumably this is the compiler's job, and the CPU can just have "undefined
behavior" if fed impossible instruction mixes. But yeah, throwing an
exception would be the conscientious thing to do. :)

> In a similar manner, there is the issue of avaliable execution units and
> exeptions.
> The net result is that in-order VLIW CPUs are simpler to design that
> in-order superscalar RISC CPUs, but I think it won't make much of a
> difference for out-of-order CPUs. I've never seen a VLIW out-of-ordem
> implementation.

I'm not sure what the point of out-of-order VLIW would be. You just put extra
pressure on the memory bus by tagging your instructions with grouping info,
just to give you even LESS leeway about shuffling the groups at run-time...

> VLIW ISAs are no different from others regarding branch prediction --
> which is a problem for ALL pipelined implementations, superscalar or not.
> Speculative execution is a feature of out-of-order implementation.

Ah yes, predication. Rather than having instruction execution thingies be
idle, have them follow both branches and do work with a 100% chance of being
thrown away. And you wonder why the chips have heat problems... :)

> > Transistor budgets keep going up as manufacturing die sizes shrink, and
> > the engineers keep wanting to throw transistors at the problem. The
> > first really easy way to turn transistors into performance are a bigger
> > L1 cache, but somewhere between 256k and one megabyte per running
> > process you hit some serious diminishing returns since your working set
> > is in cache and your far accesses to big datasets (or streaming data)
> > just aren't going to be helped by more L1 cache.
>
> L1 caches are kept small so they can be fast.

Sorry, I still refer to on-die L2 caches as L1. Bad habit. (As I said, I get
the names wrong...) "On die cache." Right.

The point was, you can spend your transistor budget with big caches on the
die, but there are diminishing returns.

> > Intel's been desperate for a way to make use of its transistor budget
> > for a while; manufacturing is what it does better than AMD< not clever
> > processor design. The original Itanic, case in point, had more than 3
> > instruction execution cores in each chip: 3 VLIW, a HP-PA Risc, and a
> > brain-damaged Pentium (which itself had a couple execution cores)... The
> > long list of reasons Itanic sucked started with the fact that it had 3
> > different modes and whichever one you were in circuitry for the other 2
> > wouldn't contribute a darn thing to your performance (although it did
> > not stop there, and in fact didn't even slow down...)
>
> Itanium doesn't have hardware support for PA-RISC emulation.

I'm under the impression it used to be part of the design, circa 1997. But I
must admit: when discussing Itanium I'm not really prepared; I stopped paying
too much attention a year or so after the sucker had taped out but still had
no silicon to play with, especially after HP and SGI revived their own chip
designs due to the delay...)

I only actually got to play with the original Itanium hardware once, and never
got it out of the darn monitor that substituted for a bios. The people who
did benchmarked it at about Pentium III 300 mhz levels, and it became a
doorstop. (These days, I've got a friend who's got an Itanium II evaluation
system, but it's another doorstop and I'm not going to make him hook it up
again just so I can go "yeah, I agree with you, it sucks"...)

> The IA-64 ISA
> has some similarities with PA-RISC to ease dynamic translation though.
> But you're right: the IA-32 hardware emulation layer is not a Good Thing™.

It's apparently going away.

http://news.com.com/2100-1006-997936.html?tag=nl

Rob

2003-09-10 05:45:11

by David Mosberger

[permalink] [raw]
Subject: Re: Scaling noise

>>>>> On Wed, 10 Sep 2003 01:14:37 -0400, Rob Landley <[email protected]> said:

Rob> (These days, I've got a friend who's got an Itanium II
Rob> evaluation system, but it's another doorstop and I'm not going
Rob> to make him hook it up again just so I can go "yeah, I agree
Rob> with you, it sucks"...)

I'm sorry to hear that. If you really do want to try out an Itanium 2
system, an easy way to go about it is to get an account at
http://testdrive.hp.com/ . It's a quick and painless process and a
single account will give you access to all test-drive machines,
including various Linux Itanium machines (up to 4x 1.4GHz),
as shown here: http://testdrive.hp.com/current.shtml

--david

2003-09-10 09:48:24

by John Bradford

[permalink] [raw]
Subject: Re: Scaling noise

> | Once the option of running a firewall, a hot spare firewall, a
> | customer webserver, a hot spare customer webserver, mail server,
> | backup mail server, and a few virtual machines for customers, all on a
> | 1U box, why are you going to want to pay for seven or more Us in a
> | datacentre, plus extra network hardware?
>
> If you plan to run anything else on your firewall, and use the same
> machine as a hot spare for itself, I don't want you as my ISP.
> Reliability is expensive, and what you describe is known as a single
> point of failure.

Of course, you would be right if we were talking about current
microcomputer architectures. I was talking about the possibility of
current mainframe technologies being implemented in future
microcomputer architectures.

Today, it is perfectly acceptable, normal, and commonplace to run hot
spares of various images on a single Z/Series box. Infact, the
ability to do that is often a large factor in budgeting for the
initial investment.

The hardware is fault tollerant by design. Only extreme events like a
fire or flood at the datacentre are likely to cause downtime of the
whole machine. I don't consider that any less secure than a rack of
small servers.

Different images running in their own LPARs, or under Z/Vm are
separated from each other. Assessments of their isolation have been
done, and ratings are available.

You absolutely _can_ use the same physical hardware to run a hot
spare, and protect yourself against software failiures. A process can
monitor the virtual machine, and switch to the hot spare if it fails.

Add to that the fact that physical LAN cabling is reduced. The amount
of network hardware is also reduced. That adds to reliability.

John.

2003-09-10 10:10:54

by Ricardo Bugalho

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 2003-09-10 at 06:14, Rob Landley wrote:
> I'm not sure what the point of out-of-order VLIW would be. You just put extra
> pressure on the memory bus by tagging your instructions with grouping info,
> just to give you even LESS leeway about shuffling the groups at run-time...

The point is: simpler in-order implementations. In-order CPUs don't
reorder instructions at run-time, as the name suggests.

> > VLIW ISAs are no different from others regarding branch prediction --
> > which is a problem for ALL pipelined implementations, superscalar or not.
> > Speculative execution is a feature of out-of-order implementation.
>
> Ah yes, predication. Rather than having instruction execution thingies be
> idle, have them follow both branches and do work with a 100% chance of being
> thrown away. And you wonder why the chips have heat problems... :)

You're confusing brach prediction with instruction predication.
Branch prediction is a design feature, needed for most pipelined CPUs.
Because they're pipelined, the CPU may not know whether to take or not
the branch when its time to fetch the next instructions. So, instead of
stalling, it guesses. If its wrong, it has to rollback.
Instruction predication is another form of conditional execution: each
instruction has a predicate (a register) and is only executed if the
predicate is true.
The bad thing is that these instructions take their slot in the
pipeline, even if the CPU knows they'll never be executed in the moment
it fecthed them.
The good sides are:
a) Unlike branches, it doesn't have a constant mispredict penalty. So,
its good to replace "small" and unpredictable branches
b) Instead of a control dependency (branches) predication is a data
dependency. So, it gives compilers more freedom in scheduling-

> The point was, you can spend your transistor budget with big caches on the
> die, but there are diminishing returns.

Depends on the workload..

--
Ricardo

2003-09-10 11:38:09

by Alan

[permalink] [raw]
Subject: Re: Scaling noise

On Mer, 2003-09-10 at 11:01, John Bradford wrote
> Of course, you would be right if we were talking about current
> microcomputer architectures. I was talking about the possibility of
> current mainframe technologies being implemented in future
> microcomputer architectures.

1U fits 4 lower powered PC systems.

> Today, it is perfectly acceptable, normal, and commonplace to run hot
> spares of various images on a single Z/Series box.

> The hardware is fault tollerant by design. Only extreme events like a
> fire or flood at the datacentre

Or a bunch of suspicious looking people claiming to be from EDS and
helping themselves to your mainframe (See .AU customs 8))

The virtue of S/390 is more IMHO that it wont make mistakes. The
distributed PC solution is more likely to be available.

Which matters sometimes depends whether you are running websites or
a bank

2003-09-10 13:55:45

by Bill Davidsen

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 10 Sep 2003, John Bradford wrote:

> > | Once the option of running a firewall, a hot spare firewall, a
> > | customer webserver, a hot spare customer webserver, mail server,
> > | backup mail server, and a few virtual machines for customers, all on a
> > | 1U box, why are you going to want to pay for seven or more Us in a
> > | datacentre, plus extra network hardware?
> >
> > If you plan to run anything else on your firewall, and use the same
> > machine as a hot spare for itself, I don't want you as my ISP.
> > Reliability is expensive, and what you describe is known as a single
> > point of failure.
>
> Of course, you would be right if we were talking about current
> microcomputer architectures. I was talking about the possibility of
> current mainframe technologies being implemented in future
> microcomputer architectures.
>
> Today, it is perfectly acceptable, normal, and commonplace to run hot
> spares of various images on a single Z/Series box. Infact, the
> ability to do that is often a large factor in budgeting for the
> initial investment.

Depends on the customer, and how critical the operation is to the
customer. I'm working on a backup plan for a company now, to satisfy
their "smoking hole" disaster recovery plan. You don't do that unless
your company survival depends on it, but I've helped put backup in other
countries (for a bank in Ireland), etc.

> The hardware is fault tollerant by design. Only extreme events like a
> fire or flood at the datacentre are likely to cause downtime of the
> whole machine. I don't consider that any less secure than a rack of
> small servers.

And power failure, loss of ISP connectivity, loss of phone service...
As said originally, reliability is *hard* and *expensive* to do right.

> Different images running in their own LPARs, or under Z/Vm are
> separated from each other. Assessments of their isolation have been
> done, and ratings are available.
>
> You absolutely _can_ use the same physical hardware to run a hot
> spare, and protect yourself against software failiures. A process can
> monitor the virtual machine, and switch to the hot spare if it fails.
>
> Add to that the fact that physical LAN cabling is reduced. The amount
> of network hardware is also reduced. That adds to reliability.

For totally redundant equipment the probability of having some failure
goes up, the probability of total failure (unavailability) goes down.
"You pays your money and you takes your choice."

Actually we were talking about HT, or so it seems looking at the past
items, and if you get a failure of one sibling I have to believe that
other CPUs on the same chip are more likely to fail than those not
sharing the same cooling, power supply, cache memory, and die. I bet
you don't have numbers either, feel free to share if you do.

This would be a good topic for comp.arch, it's getting a bit OT here.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-09-10 14:41:37

by Timothy Miller

[permalink] [raw]
Subject: Re: Scaling noise



Larry McVoy wrote:

>
> You don't make that much money, if any, on the high end, the R&D costs
> dominate. But you make money because people buy the middle of the road
> because you have the high end. If you don't, they feel uneasy that they
> can't grow with you. The high end enables the sales of the real money
> makers. It's pure marketing, the high end could be imaginary and as
> long as you convinced the customers you had it you'd be more profitable.
>


I think some time in the 90's, Chevy considered discontinuing the
Corvette. Then they realized that that would kill their business.
People who would never buy a Corvette buy other GM cars just because the
Corvette exists. Lots of reasons: It's an icon that people recognize,
it makes them feel that other Chevys will share some of the Corvette
quality, etc.


2003-09-10 15:01:10

by John Bradford

[permalink] [raw]
Subject: Re: Scaling noise

[snip most of discussion]

You have changed the topic completely.

> > The hardware is fault tollerant by design. Only extreme events like a
> > fire or flood at the datacentre are likely to cause downtime of the
> > whole machine. I don't consider that any less secure than a rack of
> > small servers.
>
> And power failure, loss of ISP connectivity, loss of phone service...
> As said originally, reliability is *hard* and *expensive* to do right.

Not sure how any of the above is related to the mainframe vs rack of
micros discussion. Redundant ISP connectivity can be provided to
both.

You seem to be suggesting that I'm saying that it's a good idea to
replace geographically distributed, separate microcomputer servers
with a _single_ mainframe. I am not. I am only concerned with the
per-site situation. As long as the geographically separate machine
stays geographically separate it is no longer part of the equation.

The only thing I am suggesting is that a single rack of machines in a
datacentre somewhere could be replaced by a single machine with
virtualisation technology, and the same or better hardware
reliability as the rack of discrete machines, and there would be no
loss of availability.

Infact, increased availability may result, due to the ease of
administrating virtual machines as opposed to physical machines, and
the reduction in network hardware.

> This would be a good topic for comp.arch, it's getting a bit OT here.

Agreed, except that I brought it up to re-inforce Larry's point that
scaling above ~4 CPUs and hurting <=4 CPUs performance was a bad
thing.

John.

2003-09-10 15:13:15

by Larry McVoy

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, Sep 10, 2003 at 11:02:34AM -0400, Timothy Miller wrote:
> Larry McVoy wrote:
> >You don't make that much money, if any, on the high end, the R&D costs
> >dominate. But you make money because people buy the middle of the road
> >because you have the high end. If you don't, they feel uneasy that they
> >can't grow with you. The high end enables the sales of the real money
> >makers. It's pure marketing, the high end could be imaginary and as
> >long as you convinced the customers you had it you'd be more profitable.
>
> I think some time in the 90's, Chevy considered discontinuing the
> Corvette. Then they realized that that would kill their business.
> People who would never buy a Corvette buy other GM cars just because the
> Corvette exists. Lots of reasons: It's an icon that people recognize,
> it makes them feel that other Chevys will share some of the Corvette
> quality, etc.

Yeah, yeah. I get that. What people here are not getting is that if you
are building the low end cars out of the same parts as the high end cars
then you are driving up the cost of your low end cars. Same thing for
the OS. The point is to have some answer that lets people go "ohhh,
wow, that's a big fast OS, now, isn't it?" without having screwed up
the low end and/or having made the source base a rats nest.

Dave and friends can protest as much as they want that the kernel works
and it scales (it does work, it doesn't scale by comparison to something
like IRIX) but that's missing the point. They refused to answer the
question about listing the locks in the I/O path. In a uniprocessor
OS any idiot can tell you the locks in the I/O path. They didn't even
try to list them, they can't do it now. What's it going to be like
when the system scales 10x more?

I've been through this. These guys are kidding themselves and by the
time they wake up there is too much invested in the source base to
unravel the mess.

What people keep missing, over and over, is that I'm perfectly OK with
the idea of making the system work on a big box. I've made a proposal
for a way that I see has far less pressure on it to screw up the OS.
You don't have to scale the I/O path when you can throw a kernel at every
device if you want. The I/O path is free, it's uniprocessor, simple.
Think about it. We're all trying to have the bragging rights, I do
enough sales that I understand the importance of those. It just makes
me sick to think that we could get them far faster and far more easily
than the commercial Unix guys did and than Microsoft will and instead
we're playing by the wrong rules.

Don't like my rules? Make up some different ones. But doing the same
old thing is just bloody stupid.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-28 01:52:34

by Paul Jakma

[permalink] [raw]
Subject: Re: Scaling noise

On Wed, 10 Sep 2003, Larry McVoy wrote:

> Dave and friends can protest as much as they want that the kernel works
> and it scales (it does work, it doesn't scale by comparison to something
> like IRIX)

Aside: Might want to tell SGI as their new ccNUMA Altrix line (Origin
3k with Itanic instead of MIPS i think) run Linux!

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
warning: do not ever send email to [email protected]
Fortune:
If your hands are clean and your cause is just and your demands are
reasonable, at least it's a start.

2003-09-28 03:16:56

by Steven Cole

[permalink] [raw]
Subject: Re: Scaling noise

On Saturday 27 September 2003 07:51 pm, Paul Jakma wrote:
> On Wed, 10 Sep 2003, Larry McVoy wrote:
> > Dave and friends can protest as much as they want that the kernel works
> > and it scales (it does work, it doesn't scale by comparison to something
> > like IRIX)
>
> Aside: Might want to tell SGI as their new ccNUMA Altrix line (Origin
> 3k with Itanic instead of MIPS i think) run Linux!

Since Larry is off doing other things for the next week and a half, I'll attemt to
to answer that for him. Larry was possibly referring to IRIX scaling to 1024 CPUs,
e.g. the "Chapman" machine, mentioned here:
http://www.sgi.com/company_info/awards/03_computerworld.html

It appears that SGI is working to scale the Altix to 128 CPUs on Linux.
http://marc.theaimsgroup.com/?l=linux-kernel&m=106323064611280&w=2

Steven

2003-09-29 00:47:58

by Paul Jakma

[permalink] [raw]
Subject: Re: Scaling noise

On Sat, 27 Sep 2003, Steven Cole wrote:

> Since Larry is off doing other things for the next week and a half,
> I'll attemt to to answer that for him. Larry was possibly
> referring to IRIX scaling to 1024 CPUs, e.g. the "Chapman" machine,
> mentioned here:
> http://www.sgi.com/company_info/awards/03_computerworld.html

An Origin 3k, Altix is essentially the Origin 3k architecture but
engineered around Itanic CPU boards. :)

> It appears that SGI is working to scale the Altix to 128 CPUs on Linux.
> http://marc.theaimsgroup.com/?l=linux-kernel&m=106323064611280&w=2

Working to? Nah, they did that a long time ago:

http://mail.nl.linux.org/linux-mm/2001-03/msg00004.html

They've had another 2.5 years since to further work on Linux
scaleability.

"SGI Altix 3000 superclusters scale up to hundreds of processors."

So, either they've already run Linux on Altix or Origin internally
with hundreds of CPUs or they're fairly confident there arent any
major problems they couldnt sort out in the time it would take to
build, test and deliver a hundred+ node Altix.

> Steven
> -

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
warning: do not ever send email to [email protected]
Fortune:
I owe the public nothing.
-- J.P. Morgan