2003-02-21 23:32:09

by Hanna Linder

[permalink] [raw]
Subject: Minutes from Feb 21 LSE Call


LSE Con Call Minutes from Feb21

Minutes compiled by Hanna Linder [email protected], please post
corrections to [email protected].

Object Based Reverse Mapping:
(Dave McCracken, Ben LaHaise, Rik van Riel, Martin Bligh, Gerrit Huizenga)

Dave coded up an initial patch for partial object based rmap
which he sent to linux-mm yesterday. Rik pointed out there is a scalability
problem with the full object based approach. However, a hybrid approach
between regular rmap and object based may not be too radical for
2.5/2.6 timeframe.
Ben said none of the users have been complaining about
performance with the existing rmap. Martin disagreed and said Linus,
Andrew Morton and himself have all agreed there is a problem.
One of the problems Martin is already hitting on high cpu machines with
large memory is the space consumption by all the pte-chains filling up
memory and killing the machine. There is also a performance impact of
maintaining the chains.
Ben said they shouldnt be using fork and bash is the
main user of fork and should be changed to use clone instead.
Gerrit said bash is not used as much as Ben might think on
these large systems running real world applications.
Ben said he doesnt see the large systems problems with
the users he talks to and doesnt agree the full object based rmap
is needed. Gerrit explained we have very complex workloads running on
very large systems and we are already hitting the space consumption
problem which is a blocker for running Linux on them.
Ben said none of the distros are supporting these large
systems right now. Martin said UL is already starting to support
them. Then it degraded into a distro discussion and Hanna asked
for them to bring it back to the technical side.
In order to show the problem with object based rmap you have to
add vm pressure to existing benchmarks to see what happens. Martin
agreed to run multiple benchmarks on the same systems to simulate this.
Cliff White of the OSDL offered to help Martin with this.
At the end Ben said the solution for now needs to be
a hybrid with existing rmap. Martin, Rik, and Dave all agreed with Ben.
Then we all agreed to move on to other things.

*ActionItem - someone needs to change bash to use clone instead of fork..

Scheduler Hang as discovered by restarting a large Web application
multiple times:
Rick Lindlsey/ Hanna Linder

We were seeing a hard hang after restarting a large web
serving application 3-6 times on the 2.5.59 (and up) kernels
(also seen as far back as 2.5.44). It was mainly caused when two
threads each have interrupts disabled and one is spinning on a lock that
the other is holding. The one holding the lock has sent an IPI to all
the other processes telling them to flush their TLB's. But the one
witinging for the spinlock has interrupts turned off and does not recieve
that IPI request. So they both sit there waiting for ever.

The final fix will be in kernel.org mainline kernel version 2.5.63.
Here are the individual patches which should apply with fuzz to
older kernel versions:

http://linux.bkbits.net:8080/linux-2.5/[email protected]?nav=index.html
http://linux.bkbits.net:8080/linux-2.5/[email protected]?nav=index.html


Shared Memory Binding :
Matt Dobson -
Shared memory binding API (new). A way for an
application to bind shared memory to Nodes. Motivation
is for large databases support that want more control
over their shared memory.
current allocation scheme is each process gets
a chunk of shared memory from the same node the process
is located on. instead of page faulting around to different
nodes dynamicaly this API will allow a process to specify
which node or set of nodes to bind the shared memory to.
Work in progress.

Martin - gcc 2.95 vs 3.2.

Martin has done some testing which indicates that gcc 3.2 produces
slightly worse code for the kernel than 2.95 and takes a bit
longer to do so. gcc 3.2 -Os produces larger code than gcc 2.95 -O2.
On his machines -O2 was faster than -Os, but on a cpu wiht smaller
caches the inverse may be true. More testing may be needed.




2003-02-22 00:06:44

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> Ben said none of the distros are supporting these large
> systems right now. Martin said UL is already starting to support
> them.

Ben is right. I think IBM and the other big iron companies would be
far better served looking at what they have done with running multiple
instances of Linux on one big machine, like the 390 work. Figure out
how to use that model to scale up. There is simply not a big enough
market to justify shoveling lots of scaling stuff in for huge machines
that only a handful of people can afford. That's the same path which
has sunk all the workstation companies, they all have bloated OS's and
Linux runs circles around them.

In terms of the money and in terms of installed seats, the small Linux
machines out number the 4 or more CPU SMP machines easily 10,000:1.
And with the embedded market being one of the few real money makers
for Linux, there will be huge pushback from those companies against
changes which increase memory footprint.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-22 00:17:17

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote:
> Ben is right. I think IBM and the other big iron companies would be
> far better served looking at what they have done with running multiple
> instances of Linux on one big machine, like the 390 work. Figure out
> how to use that model to scale up. There is simply not a big enough
> market to justify shoveling lots of scaling stuff in for huge machines
> that only a handful of people can afford. That's the same path which
> has sunk all the workstation companies, they all have bloated OS's and
> Linux runs circles around them.

Scalability done properly should not degrade performance on smaller
machines, Pee Cees, or even microscopic organisms.


On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote:
> In terms of the money and in terms of installed seats, the small Linux
> machines out number the 4 or more CPU SMP machines easily 10,000:1.
> And with the embedded market being one of the few real money makers
> for Linux, there will be huge pushback from those companies against
> changes which increase memory footprint.

There's quite a bit of commonality with large x86 highmem there, as
the highmem crew is extremely concerned about the kernel's memory
footprint and is looking to trim kernel memory overhead from every
aspect of its operation they can. Reducing kernel memory footprint
is a crucial part of scalability, in both scaling down to the low end
and scaling up to highmem. =)


-- wli

2003-02-22 00:44:23

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> Ben is right. I think IBM and the other big iron companies would be
> far better served looking at what they have done with running multiple
> instances of Linux on one big machine, like the 390 work. Figure out
> how to use that model to scale up. There is simply not a big enough
> market to justify shoveling lots of scaling stuff in for huge machines
> that only a handful of people can afford. That's the same path which
> has sunk all the workstation companies, they all have bloated OS's and
> Linux runs circles around them.

In your humble opinion.

Unfortunately, as I've pointed out to you before, this doesn't work in
practice. Workloads may not be easily divisible amongst machines, and
you're just pushing all the complex problems out for every userspace
app to solve itself, instead of fixing it once in the kernel.

The fact that you were never able to do this before doesn't mean it's
impossible, it just means that you failed.

> In terms of the money and in terms of installed seats, the small Linux
> machines out number the 4 or more CPU SMP machines easily 10,000:1.
> And with the embedded market being one of the few real money makers
> for Linux, there will be huge pushback from those companies against
> changes which increase memory footprint.

And the profit margin on the big machines will outpace the smaller
machines by a similar ratio, inverted. The high-end space is where most
of the money is made by the Linux distros, by selling products like SLES
or Advanced Server to people who can afford to pay for it.

M.

2003-02-22 02:15:07

by Steven Cole

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Fri, 2003-02-21 at 17:25, William Lee Irwin III wrote:
> On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote:
> > Ben is right. I think IBM and the other big iron companies would be
> > far better served looking at what they have done with running multiple
> > instances of Linux on one big machine, like the 390 work. Figure out
> > how to use that model to scale up. There is simply not a big enough
> > market to justify shoveling lots of scaling stuff in for huge machines
> > that only a handful of people can afford. That's the same path which
> > has sunk all the workstation companies, they all have bloated OS's and
> > Linux runs circles around them.

mjb> Unfortunately, as I've pointed out to you before, this doesn't work
mjb> in practice. Workloads may not be easily divisible amongst
mjb> machines, and you're just pushing all the complex problems out for
mjb> every userspace app to solve itself, instead of fixing it once in
mjb> the kernel.


Please permit an observer from the sidelines a few comments.
I think all four of you are right, for different reasons.
>
> Scalability done properly should not degrade performance on smaller
> machines, Pee Cees, or even microscopic organisms.

s/should/must/ in the above. That must be a guiding principle.

>
>
> On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote:
> > In terms of the money and in terms of installed seats, the small Linux
> > machines out number the 4 or more CPU SMP machines easily 10,000:1.
> > And with the embedded market being one of the few real money makers
> > for Linux, there will be huge pushback from those companies against
> > changes which increase memory footprint.
>
> There's quite a bit of commonality with large x86 highmem there, as
> the highmem crew is extremely concerned about the kernel's memory
> footprint and is looking to trim kernel memory overhead from every
> aspect of its operation they can. Reducing kernel memory footprint
> is a crucial part of scalability, in both scaling down to the low end
> and scaling up to highmem. =)
>
>
> -- wli

Since the time between major releases of the kernel seems to be two to
three years now (counting to where the new kernel is really stable),
it is probably worthwhile to think about what high-end systems will
be like when 3.0 is expected.

My guess is that a trend will be machines with increasingly greater cpu
counts with access to the same memory. Why? Because if it can be done,
it will be done. The ability to put more cpus on a single chip may
translate into a Moore's law of increasing cpu counts per machine. And
as Martin points out, the high end machines are where the money is.

In my own unsophisticated opinion, Larry's concept of Cache Coherent
Clusters seems worth further development. And Martin is right about the
need for fixing it in the kernel, again IMHO. But how to fix it in the
kernel? Would something similar to OpenMosix or OpenSSI in a future
kernel be appropriate to get Larry's CCCluster members to cooperate? Or
is it possible to continue the scalability race when cpu counts get to
256, 512, etc.

Just some thoughts from the sidelines.

Best regards,
Steven



2003-02-22 02:37:59

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Fri, Feb 21, 2003 at 04:44:13PM -0800, Martin J. Bligh wrote:
> > Ben is right. I think IBM and the other big iron companies would be
> > far better served looking at what they have done with running multiple
> > instances of Linux on one big machine, like the 390 work. Figure out
> > how to use that model to scale up. There is simply not a big enough
> > market to justify shoveling lots of scaling stuff in for huge machines
> > that only a handful of people can afford. That's the same path which
> > has sunk all the workstation companies, they all have bloated OS's and
> > Linux runs circles around them.
>
> In your humble opinion.

My opinion has nothing to do with it, go benchmark them and see for
yourself. I'm in a pretty good position to back up my statements with
data, we support BitKeeper on AIX, Solaris, IRIX, HP-UX, Tru64, as well
as a pile of others, so we have both the hardware and the software to
do the comparisons. I stand by statement above and so does anyone else
who has done the measurements. It is much much more pleasant to have
Linux versus any other Unix implementation on the same platform. Let's
keep it that way.

> Unfortunately, as I've pointed out to you before, this doesn't work in
> practice. Workloads may not be easily divisible amongst machines, and
> you're just pushing all the complex problems out for every userspace
> app to solve itself, instead of fixing it once in the kernel.

"fixing it", huh? Your "fixes" may be great for your tiny segment of
the market but they are not going to be welcome if they turn Linux into
BloatOS 9.8.

> The fact that you were never able to do this before doesn't mean it's
> impossible, it just means that you failed.

Thanks for the vote of confidence. I think the thing to focus on,
however, is that *noone* has ever succeeded at what you are trying
to do. And there have been many, many attempts. Your opinion, it
would appear, is that you are smarter than all of the people in all
of those past failed attempts, but you'll forgive me if I'm not
impressed with your optimism.

> > In terms of the money and in terms of installed seats, the small Linux
> > machines out number the 4 or more CPU SMP machines easily 10,000:1.
> > And with the embedded market being one of the few real money makers
> > for Linux, there will be huge pushback from those companies against
> > changes which increase memory footprint.
>
> And the profit margin on the big machines will outpace the smaller
> machines by a similar ratio, inverted.

Really? How about some figures? You'd need HUGE profit margins to
justify your position, how about some actual hard cold numbers?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-22 04:22:56

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> In your humble opinion.
>
> My opinion has nothing to do with it, go benchmark them and see for
> yourself.

Nope, I was referring to this:

>> > Ben is right. I think IBM and the other big iron companies would be
>> > far better served looking at what they have done with running multiple
>> > instances of Linux on one big machine, like the 390 work. Figure out
>> > how to use that model to scale up. There is simply not a big enough
>> > market to justify shoveling lots of scaling stuff in for huge machines
>> > that only a handful of people can afford.

Which I totally disagree with.

>> >That's the same path which
>> > has sunk all the workstation companies, they all have bloated OS's and
>> > Linux runs circles around them.

Not the fact that Linux is capable of stellar things, which I totally
agree with.

> I'm in a pretty good position to back up my statements with
> data, we support BitKeeper on AIX, Solaris, IRIX, HP-UX, Tru64, as well
> as a pile of others, so we have both the hardware and the software to
> do the comparisons. I stand by statement above and so does anyone else
> who has done the measurements.

Oh, I don't doubt it - But I'd be amused to see the measurements,
if you have them to hand.

> It is much much more pleasant to have Linux versus any other Unix
> implementation on the same platform. Let's keep it that way.

Absolutely.

>> Unfortunately, as I've pointed out to you before, this doesn't work in
>> practice. Workloads may not be easily divisible amongst machines, and
>> you're just pushing all the complex problems out for every userspace
>> app to solve itself, instead of fixing it once in the kernel.
>
> "fixing it", huh? Your "fixes" may be great for your tiny segment of
> the market but they are not going to be welcome if they turn Linux into
> BloatOS 9.8.

They won't - the maintainers would never allow us to do that.

>> The fact that you were never able to do this before doesn't mean it's
>> impossible, it just means that you failed.
>
> Thanks for the vote of confidence. I think the thing to focus on,
> however, is that *noone* has ever succeeded at what you are trying
> to do. And there have been many, many attempts. Your opinion, it
> would appear, is that you are smarter than all of the people in all
> of those past failed attempts, but you'll forgive me if I'm not
> impressed with your optimism.

Who said that I was going to single-handedly change the world? What's
different with Linux is the development model. That's why *we* will
succeed where others have failed before. There's some incredible intellect
all around Linux, but that's not all it takes, as you've pointed out.

>> > In terms of the money and in terms of installed seats, the small Linux
>> > machines out number the 4 or more CPU SMP machines easily 10,000:1.
>> > And with the embedded market being one of the few real money makers
>> > for Linux, there will be huge pushback from those companies against
>> > changes which increase memory footprint.
>>
>> And the profit margin on the big machines will outpace the smaller
>> machines by a similar ratio, inverted.
>
> Really? How about some figures? You'd need HUGE profit margins to
> justify your position, how about some actual hard cold numbers?

I don't have them to hand, but if you think anyone's making money on
PCs nowadays, you're delusional (with respect to hardware). With respect
to Linux, what makes you think distros are going to make large amounts
of money from a freely replicatable OS, for tiny embedded systems?
Support for servers, on the other hand, is a different game ...

M.


2003-02-22 04:55:42

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Fri, Feb 21, 2003 at 08:32:30PM -0800, Martin J. Bligh wrote:
> > "fixing it", huh? Your "fixes" may be great for your tiny segment of
> > the market but they are not going to be welcome if they turn Linux into
> > BloatOS 9.8.
>
> They won't - the maintainers would never allow us to do that.

The path to hell is paved with good intentions.

> > Really? How about some figures? You'd need HUGE profit margins to
> > justify your position, how about some actual hard cold numbers?
>
> I don't have them to hand, but if you think anyone's making money on
> PCs nowadays, you're delusional (with respect to hardware).

Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
$500M/quarter in profit.

Lots of people working for companies who haven't figured out how to do
it as well as Dell *say* it can't be done but numbers say differently.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-22 06:30:04

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> I don't have them to hand, but if you think anyone's making money on
>> PCs nowadays, you're delusional (with respect to hardware).
>
> Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
> $500M/quarter in profit.
>
> Lots of people working for companies who haven't figured out how to do
> it as well as Dell *say* it can't be done but numbers say differently.

And how much of that was profit on PCs running Linux?

M.

2003-02-22 07:37:33

by David Miller

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Fri, 2003-02-21 at 16:16, Larry McVoy wrote:
> In terms of the money and in terms of installed seats, the small Linux
> machines out number the 4 or more CPU SMP machines easily 10,000:1.

While I totally agree with your points, I want to mention that
although this ratio is true, the exact opposite ratio applies to
the price of the service contracts a company can land with the big
machines :-)

2003-02-22 07:43:20

by David Miller

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Fri, 2003-02-21 at 21:05, Larry McVoy wrote:
> Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
> $500M/quarter in profit.

While I understand these numbers are on the mark, there is a tertiary
issue to realize.

Dell makes money on many things other than thin-margin PCs. And lo'
and behold one of those things is selling the larger Intel based
servers and support contracts to go along with that. And so you're
nearly supporting Martin's arguments for supporting large servers
better under Linux by bringing up Dell's balance sheet :-)

2003-02-22 07:43:48

by David Miller

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Fri, 2003-02-21 at 22:39, Martin J. Bligh wrote:
> > Lots of people working for companies who haven't figured out how to do
> > it as well as Dell *say* it can't be done but numbers say differently.
>
> And how much of that was profit on PCs running Linux?

Or PCs period, they make tons of bucks on servers and associated
support contracts.

2003-02-22 08:28:08

by Jeff Garzik

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

ia32 big iron. sigh. I think that's so unfortunately in a number
of ways, but the main reason, of course, is that highmem is evil :)

Intel can use PAE to "turn back the clock" on ia32. Although googling
doesn't support this speculation, I am willing to bet Intel will
eventually unveil a new PAE that busts the 64GB barrier -- instead of
trying harder to push consumers to 64-bit processors. Processor speed,
FSB speed, PCI bus bandwidth, all these are issues -- but ones that
pale in comparison to the long term effects of highmem on the market.

Enterprise customers will see this as a signal to continue building
around ia32 for the next few years, thoroughly damaging 64-bit
technology sales and development. I bet even IA64 suffers...
at Intel's own hands. Rumors of a "Pentium64" at Intel are constantly
floating around The Register and various rumor web sites, but Intel
is gonna miss that huge profit opportunity too by trying to hack the
ia32 ISA to scale up to big iron -- where it doesn't belong.

Being cynical, one might guess that Intel will treat IA64 as a loss
leader until the other 64-bit competition dies, keeping ia32 at the
top end of the market via silly PAE/PSE hacks. When the existing
64-bit compettion disappears, five years down the road, compilers
will have matured sufficiently to make using IA64 boxes feasible.

If you really want to scale, just go to 64-bits, darn it. Don't keep
hacking ia32 ISA -- leave it alone, it's fine as it is, and will live
a nice long life as the future's preferred embedded platform.

64-bit. alpha is old tech, and dead. *sniff* sparc64 is mostly
old tech, and mostly dead. IA64 isn't, yet. x86-64 is _nice_ tech,
but who knows if AMD will survive competition with Intel. PPC64 is
the wild card in all this. I hope it succeeds.

Jeff,
feeling like a silly, random rant after a long drive



...and from a technical perspective, highmem grots up the code, too :)

2003-02-22 14:24:59

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 12:38:33AM -0800, David S. Miller wrote:
> On Fri, 2003-02-21 at 21:05, Larry McVoy wrote:
> > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
> > $500M/quarter in profit.
>
> While I understand these numbers are on the mark, there is a tertiary
> issue to realize.
>
> Dell makes money on many things other than thin-margin PCs. And lo'
> and behold one of those things is selling the larger Intel based
> servers and support contracts to go along with that.

I did some digging trying to find that ratio before I posted last night
and couldn't. You obviously think that the servers are a significant
part of their business. I'd be surprised at that, but that's cool,
what are the numbers? PC's, monitors, disks, laptops, anything with less
than 4 cpus is in the little bucket, so how much revenue does Dell generate
on the 4 CPU and larger servers?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-22 15:38:31

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
>> > $500M/quarter in profit.
>>
>> While I understand these numbers are on the mark, there is a tertiary
>> issue to realize.
>>
>> Dell makes money on many things other than thin-margin PCs. And lo'
>> and behold one of those things is selling the larger Intel based
>> servers and support contracts to go along with that.
>
> I did some digging trying to find that ratio before I posted last night
> and couldn't. You obviously think that the servers are a significant
> part of their business. I'd be surprised at that, but that's cool,
> what are the numbers? PC's, monitors, disks, laptops, anything with less
> than 4 cpus is in the little bucket, so how much revenue does Dell generate
> on the 4 CPU and larger servers?

It's not a question of revenue, it's one of profit. Very few people buy
desktops for use with Linux, compared to those that buy them for Windows.
The profit on each PC is small, thus I still think a substantial proportion
of the profit made by hardware vendors by Linux is on servers rather than
desktop PCs. The numbers will be smaller for high end machines, but the
profit margins are much higher.

M.

2003-02-22 16:04:22

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 07:47:53AM -0800, Martin J. Bligh wrote:
> >> > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
> >> > $500M/quarter in profit.
> >>
> >> While I understand these numbers are on the mark, there is a tertiary
> >> issue to realize.
> >>
> >> Dell makes money on many things other than thin-margin PCs. And lo'
> >> and behold one of those things is selling the larger Intel based
> >> servers and support contracts to go along with that.
> >
> > I did some digging trying to find that ratio before I posted last night
> > and couldn't. You obviously think that the servers are a significant
> > part of their business. I'd be surprised at that, but that's cool,
> > what are the numbers? PC's, monitors, disks, laptops, anything with less
> > than 4 cpus is in the little bucket, so how much revenue does Dell generate
> > on the 4 CPU and larger servers?
>
> It's not a question of revenue, it's one of profit. Very few people buy
> desktops for use with Linux, compared to those that buy them for Windows.
> The profit on each PC is small, thus I still think a substantial proportion
> of the profit made by hardware vendors by Linux is on servers rather than
> desktop PCs. The numbers will be smaller for high end machines, but the
> profit margins are much higher.

That's all handwaving and has no meaning without numbers. I could care less
if Dell has 99.99% margins on their servers, if they only sell $50M of servers
a quarter that is still less than 10% of their quarterly profit.

So what are the actual *numbers*? Your point makes sense if and only if
people sell lots of server. I spent a few minutes in google: world wide
server sales are $40B at the moment. The overwhelming majority of that
revenue is small servers. Let's say that Dell has 20% of that market,
that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet
you long long odds that that is 90% of their revenue in the server space.
Supposing that's right, that's $200M/quarter in big iron sales. Out of
$8000M/quarter.

I'd love to see data which is different than this but you'll have a tough
time finding it. More and more companies are looking at the cost of
big iron and deciding it doesn't make sense to spend $20K/CPU when they
could be spending $1K/CPU. Look at Google, try selling them some big
iron. Look at Wall Street - abandoning big iron as fast as they can.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-22 16:19:52

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> That's all handwaving and has no meaning without numbers. I could care less
> if Dell has 99.99% margins on their servers, if they only sell $50M of servers
> a quarter that is still less than 10% of their quarterly profit.
>
> So what are the actual *numbers*? Your point makes sense if and only if
> people sell lots of server. I spent a few minutes in google: world wide
> server sales are $40B at the moment. The overwhelming majority of that
> revenue is small servers. Let's say that Dell has 20% of that market,
> that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet
> you long long odds that that is 90% of their revenue in the server space.
> Supposing that's right, that's $200M/quarter in big iron sales. Out of
> $8000M/quarter.
>
> I'd love to see data which is different than this but you'll have a tough
> time finding it. More and more companies are looking at the cost of
> big iron and deciding it doesn't make sense to spend $20K/CPU when they
> could be spending $1K/CPU. Look at Google, try selling them some big
> iron. Look at Wall Street - abandoning big iron as fast as they can.

But we're talking about linux ... and we're talking about profit, not
revenue. I'd guess that 99% of their desktop sales are for Windows.
And I'd guess they make 100 times as much profit on a big server as they
do on a desktop PC.

Would be nice if someone had real numbers, but I doubt they're published
except in non-free corporate research reports.

M.



2003-02-22 16:24:32

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 08:29:34AM -0800, Martin J. Bligh wrote:
> > people sell lots of server. I spent a few minutes in google: world wide
> > server sales are $40B at the moment. The overwhelming majority of that
> > revenue is small servers. Let's say that Dell has 20% of that market,
> > that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet
> > you long long odds that that is 90% of their revenue in the server space.
> > Supposing that's right, that's $200M/quarter in big iron sales. Out of
> > $8000M/quarter.
> >
> > I'd love to see data which is different than this but you'll have a tough
> > time finding it. More and more companies are looking at the cost of
> > big iron and deciding it doesn't make sense to spend $20K/CPU when they
> > could be spending $1K/CPU. Look at Google, try selling them some big
> > iron. Look at Wall Street - abandoning big iron as fast as they can.
>
> But we're talking about linux ... and we're talking about profit, not
> revenue. I'd guess that 99% of their desktop sales are for Windows.
> And I'd guess they make 100 times as much profit on a big server as they
> do on a desktop PC.

You are thinking in today's terms. Find the asymptote and project out.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-22 16:29:59

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> But we're talking about linux ... and we're talking about profit, not
>> revenue. I'd guess that 99% of their desktop sales are for Windows.
>> And I'd guess they make 100 times as much profit on a big server as they
>> do on a desktop PC.
>
> You are thinking in today's terms. Find the asymptote and project out.

OK, I predict that Linux will take over the whole of the high end server
market ... if people stop complaining about us fixing scalability. That
should give some nicer numbers ....

M.

2003-02-22 16:49:01

by John Bradford

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> OK, I predict that Linux will take over the whole of the high end server
> market ... if people stop complaining about us fixing scalability. That
> should give some nicer numbers ....

Extending the useful life of current hardware will shift profit even
further towards support contracts, and away from hardware sales.

Imagine the performance gain a webserver serving mostly static
content, with light database and scripting usage is going to see
moving from a 2.4 -> 2.6 kernel? Zero copy and filesystem
improvements alone will extend it's useful life dramatically, in my
opinion.

John.

2003-02-22 17:09:47

by Alan

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, 2003-02-22 at 00:16, Larry McVoy wrote:
> In terms of the money and in terms of installed seats, the small Linux
> machines out number the 4 or more CPU SMP machines easily 10,000:1.
> And with the embedded market being one of the few real money makers
> for Linux, there will be huge pushback from those companies against
> changes which increase memory footprint.

I think people overestimate the numbner of large boxes badly. Several IDE
pre-patches didn't work on highmem boxes. It took *ages* for people to
actually notice there was a problem. The desktop world is still 128-256Mb
and some of the crap people push is problematic even there. In the embedded
space where there is a *ton* of money to be made by smart people a lot
of the 2.5 choices look very questionable indeed - but not all by any
means, we are for example close to being able to dump the block layer,
shrink stacks down by using IRQ stacks and other good stuff.

I'm hoping the Montavista and IBM people will swat each others bogons 8)

Alan

2003-02-22 19:46:34

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 11:42:55AM -0800, Martin J. Bligh wrote:
> >> Dell makes money on many things other than thin-margin PCs. And lo'
> >
> > Dell's revenue is 53/29/18% desktop/notebook/server;
> > 80% of US sales are to businesses. their annual report doesn't
> > break out service revenue.
>
> Interesting. Given the profit margins involved, I bet they still
> make more money on servers than desktops and notebooks combined
> (the annual report doesn't seem to list that). And that's before
> you take account of the "linux weighting" on top of that ...

Err, here's a news flash. Dell has just one server with more than
4 CPUS and it tops out at 8. Everything else is clusters. And they
call any machine that doesn't have a head a server, they have servers
starting $299. Yeah, that's right, $299.

http://www.dell.com/us/en/bsd/products/series_pedge_servers.htm

How much do you want to bet that more than 95% of their server revenue
comes from 4CPU or less boxes? I wouldn't be surprised if it is more
like 99.5%. And you can configure yourself a pretty nice quad xeon box
for $25K. Yeah, there is some profit in there but nowhere near the huge
margins you are counting on to make your case.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-22 19:56:20

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, 2003-02-22 at 00:16, Larry McVoy wrote:
>> And with the embedded market being one of the few real money makers
>> for Linux, there will be huge pushback from those companies against
>> changes which increase memory footprint.

On Sat, Feb 22, 2003 at 06:20:19PM +0000, Alan Cox wrote:
> I think people overestimate the numbner of large boxes badly. Several IDE
> pre-patches didn't work on highmem boxes. It took *ages* for people to
> actually notice there was a problem. The desktop world is still 128-256Mb
> and some of the crap people push is problematic even there. In the embedded
> space where there is a *ton* of money to be made by smart people a lot
> of the 2.5 choices look very questionable indeed - but not all by any
> means, we are for example close to being able to dump the block layer,
> shrink stacks down by using IRQ stacks and other good stuff.

Well, I've never seen IDE in a highmem box, and there's probably a good
reason for it. The space trimmings sound pretty interesting. IRQ stacks
in general sound good just to mitigate stackblowings due to IRQ pounding.


On Sat, Feb 22, 2003 at 06:20:19PM +0000, Alan Cox wrote:
> I'm hoping the Montavista and IBM people will swat each others bogons 8)

Sounds like a bigger win for the bigboxen, since space matters there,
but large-scale SMP efficiency probably doesn't make a difference to
embedded (though I think some 2x embedded systems are floating around).


-- wli

2003-02-22 20:15:20

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 11:56:42AM -0800, Larry McVoy wrote:
> Err, here's a news flash. Dell has just one server with more than
> 4 CPUS and it tops out at 8. Everything else is clusters. And they
> call any machine that doesn't have a head a server, they have servers
> starting $299. Yeah, that's right, $299.
> http://www.dell.com/us/en/bsd/products/series_pedge_servers.htm

Sounds like low-capacity boxen meant to minimize colocation costs via
rackspace minimization.


On Sat, Feb 22, 2003 at 11:56:42AM -0800, Larry McVoy wrote:
> How much do you want to bet that more than 95% of their server revenue
> comes from 4CPU or less boxes? I wouldn't be surprised if it is more
> like 99.5%. And you can configure yourself a pretty nice quad xeon box
> for $25K. Yeah, there is some profit in there but nowhere near the huge
> margins you are counting on to make your case.

Ask their marketing dept. or something. I can maximize utility
integrals and find Nash equilibria, but can't tell you Dell's secrets.


-- wli

2003-02-22 20:25:25

by Alan

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, 2003-02-22 at 20:05, William Lee Irwin III wrote:
> On Sat, Feb 22, 2003 at 06:20:19PM +0000, Alan Cox wrote:
> > I'm hoping the Montavista and IBM people will swat each others bogons 8)
>
> Sounds like a bigger win for the bigboxen, since space matters there,
> but large-scale SMP efficiency probably doesn't make a difference to
> embedded (though I think some 2x embedded systems are floating around).

Smaller cleaner code is a win for everyone, and it often pays off in ways
that are not immediately obvious. For example having your entire kernel
working set and running app fitting in the L2 cache happens to be very
good news to most people.

Alan

2003-02-22 20:52:11

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> Interesting. Given the profit margins involved, I bet they still
>> make more money on servers than desktops and notebooks combined
>> (the annual report doesn't seem to list that). And that's before
>> you take account of the "linux weighting" on top of that ...
>
> Err, here's a news flash. Dell has just one server with more than
> 4 CPUS and it tops out at 8. Everything else is clusters. And they
> call any machine that doesn't have a head a server, they have servers
> starting $299. Yeah, that's right, $299.
>
> http://www.dell.com/us/en/bsd/products/series_pedge_servers.htm
>
> How much do you want to bet that more than 95% of their server revenue
> comes from 4CPU or less boxes? I wouldn't be surprised if it is more
> like 99.5%. And you can configure yourself a pretty nice quad xeon box
> for $25K. Yeah, there is some profit in there but nowhere near the huge
> margins you are counting on to make your case.

OK, so now you've slid from talking about PCs to 2-way to 4-way ...
perhaps because your original arguement was fatally flawed.

The work we're doing on scalablity has big impacts on 4-way systems
as well as the high end. We're also simultaneously dramatically improving
stability for smaller SMP machines by finding reproducing races in
5 minutes that smaller machines might hit once every year or so, and
running high-stress workloads that thrash the hell out of various
subsystems exposing bugs.

Some applications work well on clusters, which will give them cheaper
hardware, at the expense of a lot more complexity in userspace ...
depending on the scale of the system, that's a tradeoff that might go
either way.

For applications that don't work well on clusters, you have no real
choice but to go with the high-end systems. I'd like to see Linux
across the board, as would many others.

You don't believe we can make it scale without screwing up the low end,
I do believe we can do that. Time will tell ... Linus et al are not
stupid ... we're not going to be able to submit stuff that screwed up
the low-end, even if we wanted to.

M.



2003-02-22 21:19:29

by Jeff Garzik

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Oh, come on :)

It's all vague handwaving because people either don't know real numbers,
or sure as heck won't post them on a public list...

Jeff



2003-02-22 21:29:02

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On 22 Feb 2003 18:20:19 GMT, Alan Cox wrote:
> I think people overestimate the numbner of large boxes badly. Several IDE
> pre-patches didn't work on highmem boxes. It took *ages* for people to
> actually notice there was a problem. The desktop world is still 128-256Mb

IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB
is a fun toy, but bigger than *I* need, even for development purposes.
But I don't think EMC, Clariion (low end EMC), Shark, etc. have any
IDE products for my 8-proc 16 GB machine... And running pre-patches in
a production environment that might expose this would be a little
silly as well.

Probably a bad example to extrapolate large system numbers from.

gerrit

2003-02-22 21:32:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 01:36:31PM -0800, Gerrit Huizenga wrote:
> IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB
> is a fun toy, but bigger than *I* need, even for development purposes.
> But I don't think EMC, Clariion (low end EMC), Shark, etc. have any
> IDE products for my 8-proc 16 GB machine... And running pre-patches in
> a production environment that might expose this would be a little
> silly as well.
>
> Probably a bad example to extrapolate large system numbers from.

At least the SGI Altix does have an IDE/ATAPI CDROM drive :)

2003-02-22 21:56:20

by Mark Hahn

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> OK, so now you've slid from talking about PCs to 2-way to 4-way ...
> perhaps because your original arguement was fatally flawed.

oh, come on. the issue is whether memory is fast and flat.
most "scalability" efforts are mainly trying to code around the fact
that any ccNUMA (and most 4-ways) is going to be slow/bumpy.
it is reasonable to worry that optimizations for imbalanced machines
will hurt "normal" ones. is it worth hurting uni by 5% to give
a 50% speedup to IBM's 32-way? I think not, simply because
low-end machines are more important to Linux.

the best way to kill Linux is to turn it into an OS best suited
for $6+-digit machines.

> For applications that don't work well on clusters, you have no real

ccNUMA worst-case latencies are not much different from decent
cluster (message-passing) latencies. getting an app to work on a cluster
is a matter of programming will.

regards, mark hahn.

2003-02-22 22:09:14

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> ia32 big iron. sigh. I think that's so unfortunately in a number
> of ways, but the main reason, of course, is that highmem is evil :)
> Intel can use PAE to "turn back the clock" on ia32. Although googling
> doesn't support this speculation, I am willing to bet Intel will
> eventually unveil a new PAE that busts the 64GB barrier -- instead of
> trying harder to push consumers to 64-bit processors. Processor speed,
> FSB speed, PCI bus bandwidth, all these are issues -- but ones that
> pale in comparison to the long term effects of highmem on the market.

PAE is a relatively minor insult compared to the FPU, the 50,000 psi
register pressure, variable-length instruction encoding with extremely
difficult to optimize for instruction decoder trickiness, the nauseating
bastardization of segmentation, the microscopic caches and TLB's, the
lack of TLB context tags, frankly bizarre and just-barely-fixable gate
nonsense, the interrupt controller, and ISA DMA.

I've got no idea why this particular system-level ugliness which is
nothing more than a routine pitstop in any bring your own barfbag
reading session of x86 manuals fascinates you so much.

At any rate, if systems (or any other) programming difficulties were
any concern at all, x86 wouldn't be used at all.


On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> Enterprise customers will see this as a signal to continue building
> around ia32 for the next few years, thoroughly damaging 64-bit
> technology sales and development. I bet even IA64 suffers...
> at Intel's own hands. Rumors of a "Pentium64" at Intel are constantly
> floating around The Register and various rumor web sites, but Intel
> is gonna miss that huge profit opportunity too by trying to hack the
> ia32 ISA to scale up to big iron -- where it doesn't belong.

What power do you suppose we have to resist any of this? Intel, the
800lb gorilla, shoves what it wants where it wants to shove it, and
all the "exit only" signs in the world attached to our backsides do
absolutely nothing to deter it whatsoever.


On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> Being cynical, one might guess that Intel will treat IA64 as a loss
> leader until the other 64-bit competition dies, keeping ia32 at the
> top end of the market via silly PAE/PSE hacks. When the existing
> 64-bit compettion disappears, five years down the road, compilers
> will have matured sufficiently to make using IA64 boxes feasible.

Sounds relatively natural. I don't have a good notion of the legality
boundaries wrt. to antitrust, but I'd assume they would otherwise do
whatever it takes to either defeat or wipe out competitors.


On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> If you really want to scale, just go to 64-bits, darn it. Don't keep
> hacking ia32 ISA -- leave it alone, it's fine as it is, and will live
> a nice long life as the future's preferred embedded platform.

Take this up with Intel. The rest of us are at their mercy.
Good luck finding anyone there to listen to it, you'll need it.


On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> 64-bit. alpha is old tech, and dead. *sniff* sparc64 is mostly
> old tech, and mostly dead. IA64 isn't, yet. x86-64 is _nice_ tech,
> but who knows if AMD will survive competition with Intel. PPC64 is
> the wild card in all this. I hope it succeeds.

Alpha is old, dead, and kicking most other cpus' asses from the grave.
I always did like DEC hardware. =(

I'm not sure what's so nice about x86-64; another opcode prefix
controlled extension atop the festering pile of existing x86 crud
sounds every bit as bad any other attempt to prolong x86. Some of
the system device -level cleanups like the HPET look nice, though.

This success/failure stuff sounds a lot like economics, which is
pretty much even further out of our control than the weather or the
government. What prompted this bit?


-- wli

2003-02-22 22:08:29

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 05:06:27PM -0500, Mark Hahn wrote:
> ccNUMA worst-case latencies are not much different from decent
> cluster (message-passing) latencies.

Not even close, by several orders of magnitude.


-- wli

2003-02-22 22:34:15

by Ben Greear

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Mark Hahn wrote:
>>OK, so now you've slid from talking about PCs to 2-way to 4-way ...
>>perhaps because your original arguement was fatally flawed.
>
>
> oh, come on. the issue is whether memory is fast and flat.
> most "scalability" efforts are mainly trying to code around the fact
> that any ccNUMA (and most 4-ways) is going to be slow/bumpy.
> it is reasonable to worry that optimizations for imbalanced machines
> will hurt "normal" ones. is it worth hurting uni by 5% to give
> a 50% speedup to IBM's 32-way? I think not, simply because
> low-end machines are more important to Linux.
>
> the best way to kill Linux is to turn it into an OS best suited
> for $6+-digit machines.

Linux has a key feature that most other OS's lack: It can (easily, and by all)
be recompiled for a particular architecture. So, there is no particular reason why
optimizing for a high-end system has to kill performance on uni-processor
machines.

For instance, don't locks simply get compiled away to nothing on
uni-processor machines?

--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear


2003-02-22 23:00:46

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> OK, so now you've slid from talking about PCs to 2-way to 4-way ...
>> perhaps because your original arguement was fatally flawed.
>
> oh, come on. the issue is whether memory is fast and flat.
> most "scalability" efforts are mainly trying to code around the fact
> that any ccNUMA (and most 4-ways) is going to be slow/bumpy.

Scalability is not just NUMA machines by any stretch of the imagination.
It's 2x, 4x, 8x SMP as well.

> it is reasonable to worry that optimizations for imbalanced machines
> will hurt "normal" ones. is it worth hurting uni by 5% to give
> a 50% speedup to IBM's 32-way? I think not, simply because
> low-end machines are more important to Linux.

We would never try to propose such a change, and never have.
Name a scalability change that's hurt the performance of UP by 5%.
There isn't one.

> ccNUMA worst-case latencies are not much different from decent
> cluster (message-passing) latencies. getting an app to work on a cluster
> is a matter of programming will.

It's a matter of repeatedly reimplementing a bunch of stuff in userspace,
instead of doing things in kernel space once, properly, with all the
machine specific knowledge that's needed. It's *so* much easier to
program over a single OS image.

M.

2003-02-22 23:10:23

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> We would never try to propose such a change, and never have.
> Name a scalability change that's hurt the performance of UP by 5%.
> There isn't one.

This is *exactly* the reasoning that every OS marketing weenie has used
for the last 20 years to justify their "feature" of the week.

The road to slow bloated code is paved one cache miss at a time. You
may quote me on that. In fact, print it out and put it above your
monitor and look at it every day. One cache miss at a time. How much
does one cache miss add to any benchmark? .001%? Less.

But your pet features didn't slow the system down. Nope, they just made
the cache smaller, which you didn't notice because whatever artificial
benchmark you ran didn't happen to need the whole cache.

You need to understand that system resources belong to the user. Not the
kernel. The goal is to have all of the kernel code running under any
load be less than 1% of the CPU. Your 5% number up there would pretty
much double the amount of time we spend in the kernel for most workloads.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-22 23:05:49

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 01:02:12PM -0800, Martin J. Bligh wrote:
> > How much do you want to bet that more than 95% of their server revenue
> > comes from 4CPU or less boxes? I wouldn't be surprised if it is more
> > like 99.5%. And you can configure yourself a pretty nice quad xeon box
> > for $25K. Yeah, there is some profit in there but nowhere near the huge
> > margins you are counting on to make your case.
>
> OK, so now you've slid from talking about PCs to 2-way to 4-way ...
> perhaps because your original arguement was fatally flawed.

Nice attempt at deflection but it won't work. Your position is that
there is no money in PC's only in big iron. Last I checked, "big iron"
doesn't include $25K 4 way machines, now does it? You claimed that
Dell was making the majority of their profits from servers. To refresh
your memory: "I bet they still make more money on servers than desktops
and notebooks combined". Are you still claiming that? If so, please
provide some data to back it up because, as Mark and others have pointed
out, the bulk of their servers are headless desktop machines in tower
or rackmount cases. I fail to see how there are better margins on the
same hardware in a rackmount box for $800 when the desktop costs $750.
Those rack mount power supplies and cases are not as cheap as the desktop
ones, so I see no difference in the margins.

Let's get back to your position. You want to shovel stuff in the kernel
for the benefit of the 32 way / 64 way etc boxes. I don't see that as
wise. You could prove me wrong. Here's how you do it: go get oprofile
or whatever that tool is which lets you run apps and count cache misses.
Start including before/after runs of each microbench in lmbench and
some time sharing loads with and without your changes. When you can do
that and you don't add any more bus traffic, you're a genius and
I'll shut up.

But that's a false promise because by definition, fine grained threading
adds more bus traffic. It's kind of hard to not have that happen, the
caches have to stay coherent somehow.

> Some applications work well on clusters, which will give them cheaper
> hardware, at the expense of a lot more complexity in userspace ...
> depending on the scale of the system, that's a tradeoff that might go
> either way.

Tell it to Google. That's probably one of the largest applications in
the world; I was the 4th engineer there, and I didn't think that the
cluster added complexity at all. On the contrary, it made things go
one hell of a lot faster.

> You don't believe we can make it scale without screwing up the low end,
> I do believe we can do that.

I'd like a little more than "I think I can, I think I can, I think I can".
The people who are saying "no you can't, no you can't, no you can't" have
seen this sort of work done before and there is no data which shows that
it is possible and all sorts of data which shows that it is not.

Show me one OS which scales to 32 CPUs on an I/O load and run lmbench
on a single CPU. Then take that same CPU and stuff it into a uniprocessor
motherboard and run the same benchmarks on under Linux. The Linux one
will blow away the multi threaded one. Come on, prove me wrong, show
me the data.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-22 23:13:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 03:15:52PM -0800, Larry McVoy wrote:
> Show me one OS which scales to 32 CPUs on an I/O load and run lmbench
> on a single CPU. Then take that same CPU and stuff it into a uniprocessor
> motherboard and run the same benchmarks on under Linux. The Linux one
> will blow away the multi threaded one. Come on, prove me wrong, show
> me the data.

I could ask the SGI Eagan folks to do that with an Altix and a IA64
Whitebox - oh wait, both OSes would be Linux..

2003-02-22 23:18:54

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 02:17:39PM -0800, William Lee Irwin III wrote:
> On Sat, Feb 22, 2003 at 05:06:27PM -0500, Mark Hahn wrote:
> > ccNUMA worst-case latencies are not much different from decent
> > cluster (message-passing) latencies.
>
> Not even close, by several orders of magnitude.

Err, I think you're wrong. It's been a long time since I looked, but I'm
pretty sure myrinet had single digit microseconds. Yup, google rocks,
7.6 usecs, user to user. Last I checked, Sequents worst case was around
there, right?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-22 23:34:07

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> OK, so now you've slid from talking about PCs to 2-way to 4-way ...
>> perhaps because your original arguement was fatally flawed.
>
> Nice attempt at deflection but it won't work.

On your part or mine? seemingly yours.

> Your position is that
> there is no money in PC's only in big iron. Last I checked, "big iron"
> doesn't include $25K 4 way machines, now does it?

I would call 4x a "big machine" which is what I originally said.

> You claimed that
> Dell was making the majority of their profits from servers.

I think that's probably true (nobody can be certain, as we don't have the
numbers).

> To refresh
> your memory: "I bet they still make more money on servers than desktops
> and notebooks combined". Are you still claiming that?

Yup.

> If so, please
> provide some data to back it up because, as Mark and others have pointed
> out, the bulk of their servers are headless desktop machines in tower
> or rackmount cases.

So what? they're still servers. I can no more provide data to back it up
than you can to contradict it, because they don't release those figures.
Note my sentence began "I bet", not "I have cast iron evidence".

> Let's get back to your position. You want to shovel stuff in the kernel
> for the benefit of the 32 way / 64 way etc boxes.

Actually, I'm focussed on 16-way at the moment, and have never run on,
or published numbers for anything higher. If you need to exaggerate
to make your point, then go ahead, but it's pretty transparent.

> I don't see that as wise. You could prove me wrong.
> Here's how you do it: go get oprofile
> or whatever that tool is which lets you run apps and count cache misses.
> Start including before/after runs of each microbench in lmbench and
> some time sharing loads with and without your changes. When you can do
> that and you don't add any more bus traffic, you're a genius and
> I'll shut up.

I don't feel the need to do that to prove my point, but if you feel the
need to do it to prove yours, go ahead.

> But that's a false promise because by definition, fine grained threading
> adds more bus traffic. It's kind of hard to not have that happen, the
> caches have to stay coherent somehow.

Adding more bus traffic is fine if you increase throughput. Focussing
on just one tiny aspect of performance is ludicrous. Look at the big
picture. Run some non-micro benchmarks. Analyse the results. Compare
2.4 vs 2.5 (or any set of patches I've put into the kernel of your choice)
On UP, 2P or whatever you care about.

You seem to think the maintainers are morons that we can just slide crap
straight by ... give them a little more credit than that.

> Tell it to Google. That's probably one of the largest applications in
> the world; I was the 4th engineer there, and I didn't think that the
> cluster added complexity at all. On the contrary, it made things go
> one hell of a lot faster.

As I've explained to you many times before, it depends on the system.
Some things split easily, some don't.

>> You don't believe we can make it scale without screwing up the low end,
>> I do believe we can do that.
>
> I'd like a little more than "I think I can, I think I can, I think I can".
> The people who are saying "no you can't, no you can't, no you can't" have
> seen this sort of work done before and there is no data which shows that
> it is possible and all sorts of data which shows that it is not.

The only data that's relevant is what we've done to Linux. If you want
to run the numbers, and show some useful metric on a semi-realistic
benchmark, I'd love to seem.

> Show me one OS which scales to 32 CPUs on an I/O load and run lmbench
> on a single CPU. Then take that same CPU and stuff it into a uniprocessor
> motherboard and run the same benchmarks on under Linux. The Linux one
> will blow away the multi threaded one.

Nobody has every really focussed before on an OS that scales across the
board from UP to big iron ... a closed development system is bad at
resolving that sort of thing. The real interesting comparison is UP
or 2x SMP on Linux with and without the scalability changes that have
made it into the tree.

> Come on, prove me wrong, show me the data.

I don't have to *prove* you wrong. I'm happy in my own personal knowledge
that you're wrong, and things seem to be going along just fine, thanks.
If you want to change the attitude of the maintainers, I suggest you
generate the data yourself.

M.



2003-02-22 23:37:07

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> > ccNUMA worst-case latencies are not much different from decent
>> > cluster (message-passing) latencies.
>>
>> Not even close, by several orders of magnitude.
>
> Err, I think you're wrong. It's been a long time since I looked, but I'm
> pretty sure myrinet had single digit microseconds. Yup, google rocks,
> 7.6 usecs, user to user. Last I checked, Sequents worst case was around
> there, right?

Sequent hardware is very old. Go time a Regatta.

M.

2003-02-22 23:35:57

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> We would never try to propose such a change, and never have.
>> Name a scalability change that's hurt the performance of UP by 5%.
>> There isn't one.
>
> This is *exactly* the reasoning that every OS marketing weenie has used
> for the last 20 years to justify their "feature" of the week.

Fine, stick 'em all together. I bet it's either an improvement or
doesn't even register on the scale. Knock yourself out.

M.

2003-02-22 23:44:03

by Mark Hahn

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> I could ask the SGI Eagan folks to do that with an Altix and a IA64
> Whitebox - oh wait, both OSes would be Linux..

the only public info I've seen is "round-trip in as little as 40ns",
which is too vague to be useful. and sounds WAY optimistic - perhaps
that's just between two CPUs in a single brick. remember that
LMBench shows memory latencies of O(100ns) for even fast uniprocessors.

2003-02-22 23:47:52

by Jeff Garzik

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 03:15:52PM -0800, Larry McVoy wrote:
> or rackmount cases. I fail to see how there are better margins on the
> same hardware in a rackmount box for $800 when the desktop costs $750.
> Those rack mount power supplies and cases are not as cheap as the desktop
> ones, so I see no difference in the margins.

Oh, it's definitely different hardware. Maybe the 16550-related portion
of the ASIC is the same :) but just do an lspci to see huge differences in
motherboard chipsets, on-board parts, more complicated BIOS, remote
management bells and whistles, etc. Even the low-end rackmounts.

But the better margins come simply from the mentality, IMO. Desktops
just aren't "as important" to a business compared to servers, so IT
shops are willing to spend more money to not only get better hardware,
but also the support services that accompany it. Selling servers
to enterprise data centers means bigger, more concentrated cash pool.

Jeff



2003-02-23 00:00:04

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, 22 Feb 2003 15:28:59 PST, Larry McVoy wrote:
> On Sat, Feb 22, 2003 at 02:17:39PM -0800, William Lee Irwin III wrote:
> > On Sat, Feb 22, 2003 at 05:06:27PM -0500, Mark Hahn wrote:
> > > ccNUMA worst-case latencies are not much different from decent
> > > cluster (message-passing) latencies.
> >
> > Not even close, by several orders of magnitude.
>
> Err, I think you're wrong. It's been a long time since I looked, but I'm
> pretty sure myrinet had single digit microseconds. Yup, google rocks,
> 7.6 usecs, user to user. Last I checked, Sequents worst case was around
> there, right?

You are going to drag 1994 technology into this to compare against
something in 2003? Hmm. You might win on that comparison. But yeah,
Sequent way back then was in that ballpark. World has moved forwards
since then...

gerrit

2003-02-23 00:28:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Larry McVoy <[email protected]> writes:

> > Ben said none of the distros are supporting these large
> > systems right now. Martin said UL is already starting to support
> > them.
>
> Ben is right. I think IBM and the other big iron companies would be
> far better served looking at what they have done with running multiple
> instances of Linux on one big machine, like the 390 work. Figure out
> how to use that model to scale up. There is simply not a big enough
> market to justify shoveling lots of scaling stuff in for huge machines
> that only a handful of people can afford. That's the same path which
> has sunk all the workstation companies, they all have bloated OS's and
> Linux runs circles around them.

Larry it isn't that Linux isn't being scaled in the way you suggest.
But for the people who really care about scalability having a single
system image is not the most important thing so making it look like
one system is secondary.

Linux clusters are currently among the top 5 supercomputers of the
world. And there the question is how do you make 1200 machines look
like one. And how do you handle the reliability issues. When MTBF
becomes a predictor for how many times a week someone needs to replace
hardware the problem is very different from a simple SMP.

And there seems to be a fairly substantial market for huge machines,
for people who need high performance. All kinds of problems are
require enormous amounts of data crunching.

So far the low hanging fruit on large clusters is still with making
the hardware and the systems actually work. But increasingly having
a single high performance distributed filesystem is becoming
important.

But look at projects like bproc, mosix, and lustre. Not the best
things in the world but the work is getting done. Scalability is
easy. The hard part is making it look like one machine when you are
done.

Eric

2003-02-23 00:32:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Hanna Linder <[email protected]> writes:

> LSE Con Call Minutes from Feb21
>
> Minutes compiled by Hanna Linder [email protected], please post
> corrections to [email protected].
>
> Object Based Reverse Mapping:
> (Dave McCracken, Ben LaHaise, Rik van Riel, Martin Bligh, Gerrit Huizenga)
>

> Ben said none of the users have been complaining about
> performance with the existing rmap. Martin disagreed and said Linus,
> Andrew Morton and himself have all agreed there is a problem.
> One of the problems Martin is already hitting on high cpu machines with
> large memory is the space consumption by all the pte-chains filling up
> memory and killing the machine. There is also a performance impact of
> maintaining the chains.

Note: rmap chains can be restricted to an arbitrary length, or an
arbitrary total count trivially. All you have to do is allow a fixed
limit on the number of people who can map a page simultaneously.

The selection of which chain to unmap can be a bit tricky but is
relatively straight forward. Why doesn't someone who is seeing
this just hack this up?


Eric

2003-02-23 00:51:13

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
>> ia32 big iron. sigh. I think that's so unfortunately in a number
>> of ways, but the main reason, of course, is that highmem is evil :)

One phrase ... "price:performance ratio". That's all it's about.
The only thing that will kill 32-bit big iron is the availability of
cheap 64 bit chips. It's a free-market economy.

It's ugly to program, but it's cheap, and it works.

M.

2003-02-23 01:07:15

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote:
> I'm not sure what's so nice about x86-64; another opcode prefix
> controlled extension atop the festering pile of existing x86 crud

What's nice about x86-64 is that it runs existing 32 bit apps fast and
doesn't suffer from the blisteringly small caches that were part of your
rant. Plus, x86-64 binaries are not horrifically bloated like ia64.
Not to mention that the amount of reengineering in compilers like
gcc required to get decent performance out of it is actually sane.

> sounds every bit as bad any other attempt to prolong x86. Some of
> the system device -level cleanups like the HPET look nice, though.

HPET is part of one of the PCYY specs and even available on 32 bit x86,
there are just not that many bug free implements yet. Since x86-64 made
it part of the base platform and is testing it from launch, they actually
have a chance at being debugged in the mass market versions.

-ben
--
Don't email: <a href=mailto:"[email protected]">[email protected]</a>

2003-02-23 03:14:03

by Andrew Morton

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Hanna Linder <[email protected]> wrote:
>
>
> Dave coded up an initial patch for partial object based rmap
> which he sent to linux-mm yesterday.

I've run some numbers on this. Looks like it reclaims most of the
fork/exec/exit rmap overhead.

The testcase is applying and removing 64 kernel patches using my patch
management scripts. I use this because

a) It's a real workload, which someone cares about and

b) It's about as forky as anything is ever likely to be, without being a
stupid microbenchmark.

Testing is on the fast P4-HT, everything in pagecache.

2.4.21-pre4: 8.10 seconds
2.5.62-mm3 with objrmap: 9.95 seconds (+1.85)
2.5.62-mm3 without objrmap: 10.86 seconds (+0.91)

Current 2.5 is 2.76 seconds slower, and this patch reclaims 0.91 of those
seconds.


So whole stole the remaining 1.85 seconds? Looks like pte_highmem.

Here is 2.5.62-mm3, with objrmap:

c013042c find_get_page 601 10.7321
c01333dc free_hot_cold_page 641 2.7629
c0207130 __copy_to_user_ll 687 6.6058
c011450c flush_tlb_page 725 6.4732
c0139ba0 clear_page_tables 841 2.4735
c011718c pte_alloc_one 910 6.5000
c013b56c do_anonymous_page 954 1.7667
c013b788 do_no_page 1044 1.6519
c015b59c d_lookup 1096 3.2619
c013ba00 handle_mm_fault 1098 4.6525
c0108d14 system_call 1116 25.3636
c0137240 release_pages 1828 6.4366
c013a1f4 zap_pte_range 2616 4.8806
c013f5c0 page_add_rmap 2776 8.3614
c0139eac copy_page_range 2994 3.5643
c013f70c page_remove_rmap 3132 6.2640
c013adb4 do_wp_page 6712 8.4322
c01172e0 do_page_fault 8788 7.7496
c0106ed8 poll_idle 99878 1189.0238
00000000 total 158601 0.0869

Note one second spent in pte_alloc_one().


Here is 2.4.21-pre4, with the following functions uninlined

pte_t *pte_alloc_one(struct mm_struct *mm, unsigned long address);
pte_t *pte_alloc_one_fast(struct mm_struct *mm, unsigned long address);
void pte_free_fast(pte_t *pte);
void pte_free_slow(pte_t *pte);

c0252950 atomic_dec_and_lock 36 0.4800
c0111778 flush_tlb_mm 37 0.3304
c0129c3c file_read_actor 37 0.2569
c025282c strnlen_user 43 0.5119
c012b35c generic_file_write 46 0.0283
c0114c78 schedule 48 0.0361
c0129050 unlock_page 53 0.4907
c0140974 link_path_walk 57 0.0237
c0116740 copy_mm 62 0.0852
c0130740 __free_pages_ok 62 0.0963
c0126afc handle_mm_fault 63 0.3424
c01254c0 __free_pte 67 0.8816
c0129198 __find_get_page 67 0.9853
c01309c4 rmqueue 70 0.1207
c011ae0c exit_notify 77 0.1075
c0149b34 d_lookup 81 0.2774
c0126874 do_anonymous_page 83 0.3517
c0126960 do_no_page 86 0.2087
c01117e8 flush_tlb_page 105 0.8750
c0106f54 system_call 138 2.4643
c01255c8 copy_page_range 197 0.4603
c0130ffc __free_pages 204 5.6667
c0125774 zap_page_range 262 0.3104
c0126330 do_wp_page 775 1.4904
c0113c18 do_page_fault 864 0.7030
c01052f8 poll_idle 6803 170.0750
00000000 total 11923 0.0087

Note the lack of pte_alloc_one_slow().

So we need the page table cache back.

We cannot put it in slab, because slab does not do highmem.

I believe the best way to solve this is to implement a per-cpu LIFO head
array of known-to-be-zeroed pages in the page allocator. Populate it with
free_zeroed_page(), grab pages from it with __GFP_ZEROED.

This is a simple extension to the existing hot and cold head arrays, and I
have patches, and they don't work. Something in the pagetable freeing path
seems to be putting back pages which are not fully zeroed, and I didn't get
onto debugging it.

It would be nice to get it going, because a number of architectures can
perhaps nuke their private pagetable caches.

I shall drop the patches in next-mm/experimental and look hopefully
at Dave ;)

2003-02-23 05:12:24

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, 22 Feb 2003 20:17:24 EST, Benjamin LaHaise wrote:
> On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote:
> > I'm not sure what's so nice about x86-64; another opcode prefix
> > controlled extension atop the festering pile of existing x86 crud
>
> What's nice about x86-64 is that it runs existing 32 bit apps fast and
> doesn't suffer from the blisteringly small caches that were part of your
> rant. Plus, x86-64 binaries are not horrifically bloated like ia64.
> Not to mention that the amount of reengineering in compilers like
> gcc required to get decent performance out of it is actually sane.

Four or five years ago the claim was that IA64 would solve all the large
memory problems. Commercial viability and substantial market presence
is still lacking. x86-64 has the same uphill battle. It has a better
architecture for highmem and potentially better architecture for large
systems in general (compared to IA32, not substantially better than, say,
IA64 or PPC64). It also has at least one manufacturer looking at high
end systems. But until those systems have some recognized market share,
the boys with the big pockets aren't likely to make the ubiquitous.
The whole thing about expenses to design and develop combined with the
ROI model have more influence on their deployment than the fact that it
is technically a useful architecture.

gerrit

2003-02-23 07:51:36

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 04:09:15PM -0800, Gerrit Huizenga wrote:
> On Sat, 22 Feb 2003 15:28:59 PST, Larry McVoy wrote:
> > On Sat, Feb 22, 2003 at 02:17:39PM -0800, William Lee Irwin III wrote:
> > > On Sat, Feb 22, 2003 at 05:06:27PM -0500, Mark Hahn wrote:
> > > > ccNUMA worst-case latencies are not much different from decent
> > > > cluster (message-passing) latencies.
> > >
> > > Not even close, by several orders of magnitude.
> >
> > Err, I think you're wrong. It's been a long time since I looked, but I'm
> > pretty sure myrinet had single digit microseconds. Yup, google rocks,
> > 7.6 usecs, user to user. Last I checked, Sequents worst case was around
> > there, right?
>
> You are going to drag 1994 technology into this to compare against
> something in 2003? Hmm. You might win on that comparison. But yeah,
> Sequent way back then was in that ballpark. World has moved forwards
> since then...

Really? "Several orders of magnitude"? Show me the data.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-23 07:58:53

by David Lang

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, 22 Feb 2003, Gerrit Huizenga wrote:

> On Sat, 22 Feb 2003 20:17:24 EST, Benjamin LaHaise wrote:
> > On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote:
> > > I'm not sure what's so nice about x86-64; another opcode prefix
> > > controlled extension atop the festering pile of existing x86 crud
> >
> > What's nice about x86-64 is that it runs existing 32 bit apps fast and
> > doesn't suffer from the blisteringly small caches that were part of your
> > rant. Plus, x86-64 binaries are not horrifically bloated like ia64.
> > Not to mention that the amount of reengineering in compilers like
> > gcc required to get decent performance out of it is actually sane.
>
> Four or five years ago the claim was that IA64 would solve all the large
> memory problems. Commercial viability and substantial market presence
> is still lacking. x86-64 has the same uphill battle. It has a better
> architecture for highmem and potentially better architecture for large
> systems in general (compared to IA32, not substantially better than, say,
> IA64 or PPC64). It also has at least one manufacturer looking at high
> end systems. But until those systems have some recognized market share,
> the boys with the big pockets aren't likely to make the ubiquitous.
> The whole thing about expenses to design and develop combined with the
> ROI model have more influence on their deployment than the fact that it
> is technically a useful architecture.

Garrit, you missed the preior posters point. IA64 had the same fundamental
problem as the Alpha, PPC, and Sparc processors, it doesn't run x86
binaries.

the 8086/8088 CPU was nothing special when it was picked to be used on the
IBM PC, but once it was picked it hit a critical mass that has meant that
compatability with it is critical to a new CPU. the 286 and 386 CPUs were
arguably inferior to other options available at the time, but they had one
feature that absolutly trumped everything else, they could run existing
programs with no modifications faster then anything else available. with
the IA64 Intel forgot this (or decided their name value was so high that
they were immune to the issue) x86-64 takes the same approach that the 286
and 386 did and will be used by people who couldn't care less about 64 bit
stuff simply becouse it looks to be the fastest x86 cpu available (and if
the SMP features work as advertised it will again give a big boost to the
price/performance of SMP machines due to much cheaper MLB designs). if it
was being marketed by Intel it would be a shoo-in, but AMD does have a bit
of an uphill struggle

David Lang

2003-02-23 07:57:14

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 04:09:15PM -0800, Gerrit Huizenga wrote:
>> You are going to drag 1994 technology into this to compare against
>> something in 2003? Hmm. You might win on that comparison. But yeah,
>> Sequent way back then was in that ballpark. World has moved forwards
>> since then...

On Sun, Feb 23, 2003 at 12:01:43AM -0800, Larry McVoy wrote:
> Really? "Several orders of magnitude"? Show me the data.

I was assuming ethernet when I said that.


-- wli

2003-02-23 08:11:29

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 12:07:50AM -0800, David Lang wrote:
> Garrit, you missed the preior posters point. IA64 had the same fundamental
> problem as the Alpha, PPC, and Sparc processors, it doesn't run x86
> binaries.

If I didn't know this mattered I wouldn't bother with the barfbags.
I just wouldn't deal with it.


-- wli

2003-02-23 09:28:51

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote:
>> I'm not sure what's so nice about x86-64; another opcode prefix
>> controlled extension atop the festering pile of existing x86 crud

On Sat, Feb 22, 2003 at 08:17:24PM -0500, Benjamin LaHaise wrote:
> What's nice about x86-64 is that it runs existing 32 bit apps fast and
> doesn't suffer from the blisteringly small caches that were part of your
> rant. Plus, x86-64 binaries are not horrifically bloated like ia64.
> Not to mention that the amount of reengineering in compilers like
> gcc required to get decent performance out of it is actually sane.

Rant? It was just a catalogue of other things that are nasty. The
point was that PAE's not special, it's one of a very long list of
very ugly uglinesses, and my list wasn't anywhere near exhaustive.
But yes, more cache is good. Unfortunately the amount of baggage from
32-bit x86 stuff still puts a good chunk of systems programming into
the old bring your own barfbag territory.


On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote:
>> sounds every bit as bad any other attempt to prolong x86. Some of
>> the system device -level cleanups like the HPET look nice, though.

On Sat, Feb 22, 2003 at 08:17:24PM -0500, Benjamin LaHaise wrote:
> HPET is part of one of the PCYY specs and even available on 32 bit x86,
> there are just not that many bug free implements yet. Since x86-64 made
> it part of the base platform and is testing it from launch, they actually
> have a chance at being debugged in the mass market versions.

Well, it beats the heck out of the TSC and the PIT, and x86-64 is
apparently supposed to have it "for real".

I'm not excited at all about another opcode prefix and pagetable format.


-- wli

2003-02-23 11:12:58

by Magnus Danielson

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

From: "Martin J. Bligh" <[email protected]>
Subject: Re: Minutes from Feb 21 LSE Call
Date: Sat, 22 Feb 2003 16:50:36 -0800

> > On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> >> ia32 big iron. sigh. I think that's so unfortunately in a number
> >> of ways, but the main reason, of course, is that highmem is evil :)
>
> One phrase ... "price:performance ratio". That's all it's about.
> The only thing that will kill 32-bit big iron is the availability of
> cheap 64 bit chips. It's a free-market economy.
>
> It's ugly to program, but it's cheap, and it works.

Not all heavy-duty problems die for 64 bit, but fit nicely into 32 bit.
There is however different 32-bit architectures for which it fit more or less
nicely into. SIMD may or may not give the boost just as 64 bit in itself.
This is just like clustering vs. SMP, it depends on the application.

Cheers,
Magnus

2003-02-23 14:19:26

by Rik van Riel

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, 22 Feb 2003, Eric W. Biederman wrote:

> Note: rmap chains can be restricted to an arbitrary length, or an
> arbitrary total count trivially. All you have to do is allow a fixed
> limit on the number of people who can map a page simultaneously.
>
> The selection of which chain to unmap can be a bit tricky but is
> relatively straight forward. Why doesn't someone who is seeing
> this just hack this up?

I'm not sure how useful this feature would be. Also,
there are a bunch of corner cases in which you cannot
limit the number of processes mapping a page, think
about eg. mlock, nonlinear vmas and anonymous memory.

All in all I suspect that the cost of such a feature
might be higher than any benefits.

cheers,

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-23 16:05:24

by Martin J. Bligh

[permalink] [raw]
Subject: object-based rmap and pte-highmem

> So whole stole the remaining 1.85 seconds? Looks like pte_highmem.

I have a plan for that (UKVA) ... we reserve a per-process area with
kernel type protections (either at the top of user space, changing
permissions appropriately, or inside kernel space, changing per-process
vs global appropriately).

This area is permanently mapped into each process, so that there's no
kmap_atomic / tlb_flush_one overhead ... it's highmem backed still.
In order to do fork efficiently, we may need space for 2 sets of
pagetables (12Mb on PAE).

Dave McCracken had an earlier implementation of that, but we never saw
an improvement (quite possibly because the fork double-space wasn't
there) - Dave Hansen is now trying to get something work with current
kernels ... will let you know.

M.


2003-02-23 17:18:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Rik van Riel <[email protected]> writes:

> On Sat, 22 Feb 2003, Eric W. Biederman wrote:
>
> > Note: rmap chains can be restricted to an arbitrary length, or an
> > arbitrary total count trivially. All you have to do is allow a fixed
> > limit on the number of people who can map a page simultaneously.
> >
> > The selection of which chain to unmap can be a bit tricky but is
> > relatively straight forward. Why doesn't someone who is seeing
> > this just hack this up?
>
> I'm not sure how useful this feature would be.

The problem. There is no upper bound to how many rmap
entries there can be at one time. And the unbounded
growth can overwhelm a machine.

The goal is to provide an overall system cap on the number
of rmap entries.

> Also,
> there are a bunch of corner cases in which you cannot
> limit the number of processes mapping a page, think
> about eg. mlock, nonlinear vmas and anonymous memory.

Unless something has changed for nonlinear vmas, and anonymous
memory we have been storing enough information to recover
the page in the page tables for ages.

For mlock we want a cap on the number of pages that are locked,
so it should not be a problem. But even then we don't have to
guarantee the page is constantly in the processes page table, simply
that the mlocked page is never swapped out.

> All in all I suspect that the cost of such a feature
> might be higher than any benefits.

Cost? What Cost?

The simple implementation is to walk the page lists and unmap
the pages that are least likely to be used next.

This is not something new. We have been doing this in 2.4.x and
before for years. Before it just never freed up rmap entries, as well
as preparing a page to be paged out.

Eric

2003-02-23 19:03:03

by David Mosberger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>>>>> On Sun, 23 Feb 2003 00:07:50 -0800 (PST), David Lang <[email protected]> said:

David.L> Garrit, you missed the preior posters point. IA64 had the
David.L> same fundamental problem as the Alpha, PPC, and Sparc
David.L> processors, it doesn't run x86 binaries.

This simply isn't true. Itanium and Itanium 2 have full x86 hardware
built into the chip (for better or worse ;-). The speed isn't as good
as the fastest x86 chips today, but it's faster (~300MHz P6) than the
PCs many of us are using and it certainly meets my needs better than
any other x86 "emulation" I have used in the past (which includes
FX!32 and its relatives for Alpha).

--david

2003-02-23 19:12:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

In article <[email protected]>,
William Lee Irwin III <[email protected]> wrote:
>On Sun, Feb 23, 2003 at 12:07:50AM -0800, David Lang wrote:
>> Garrit, you missed the preior posters point. IA64 had the same fundamental
>> problem as the Alpha, PPC, and Sparc processors, it doesn't run x86
>> binaries.
>
>If I didn't know this mattered I wouldn't bother with the barfbags.
>I just wouldn't deal with it.

Why?

The x86 is a hell of a lot nicer than the ppc32, for example. On the
x86, you get good performance and you can ignore the design mistakes (ie
segmentation) by just basically turning them off.

On the ppc32, the MMU braindamage is not something you can ignore, you
have to write your OS for it and if you turn it off (ie enable soft-fill
on the ones that support it) you now have to have separate paths in the
OS for it.

And the baroque instruction encoding on the x86 is actually a _good_
thing: it's a rather dense encoding, which means that you win on icache.
It's a bit hard to decode, but who cares? Existing chips do well at
decoding, and thanks to the icache win they tend to perform better - and
they load faster too (which is important - you can make your CPU have
big caches, but _nothing_ saves you from the cold-cache costs).

The low register count isn't an issue when you code in any high-level
language, and it has actually forced x86 implementors to do a hell of a
lot better job than the competition when it comes to memory loads and
stores - which helps in general. While the RISC people were off trying
to optimize their compilers to generate loops that used all 32 registers
efficiently, the x86 implementors instead made the chip run fast on
varied loads and used tons of register renaming hardware (and looking at
_memory_ renaming too).

IA64 made all the mistakes anybody else did, and threw out all the good
parts of the x86 because people thought those parts were ugly. They
aren't ugly, they're the "charming oddity" that makes it do well. Look
at them the right way and you realize that a lot of the grottyness is
exactly _why_ the x86 works so well (yeah, and the fact that they are
everywhere ;).

The only real major failure of the x86 is the PAE crud. Let's hope
we'll get to forget it, the same way the DOS people eventually forgot
about their memory extenders.

(Yeah, and maybe IBM will make their ppc64 chips cheap enough that they
will matter, and people can overlook the grottiness there. Right now
Intel doesn't even seem to be interested in "64-bit for the masses", and
maybe IBM will be. AMD certainly seems to be serious about the "masses"
part, which in the end is the only part that really matters).

Linus

2003-02-23 19:15:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: object-based rmap and pte-highmem

In article <11090000.1046016895@[10.10.2.4]>,
Martin J. Bligh <[email protected]> wrote:
>> So whole stole the remaining 1.85 seconds? Looks like pte_highmem.
>
>I have a plan for that (UKVA) ... we reserve a per-process area with
>kernel type protections (either at the top of user space, changing
>permissions appropriately, or inside kernel space, changing per-process
>vs global appropriately).

Nobody ever seems to have solved the threading impact of UKVA's. I told
Andrea about it almost a year ago, and his reaction was "oh, duh!" and
couldn't come up with a solution either.

The thing is, you _cannot_ have a per-thread area, since all threads
share the same TLB. And if it isn't per-thread, you still need all the
locking and all the scalability stuff that the _current_ pte_highmem
code needs, since there are people with thousands of threads in the same
process.

Until somebody _addresses_ this issue with UKVA, I consider UKVA to be a
pipe-dream of people who haven't thought it through.

Linus

2003-02-23 19:19:28

by David Mosberger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>>>>> On Sun, 23 Feb 2003 19:17:30 +0000 (UTC), [email protected] (Linus Torvalds) said:

Linus> Look at them the right way and you realize that a lot of the
Linus> grottyness is exactly _why_ the x86 works so well (yeah, and
Linus> the fact that they are everywhere ;).

But does x86 reall work so well? Itanium 2 on 0.13um performs a lot
better than P4 on 0.13um. As far as I can guess, the only reason P4
comes out on 0.13um (and 0.09um) before anything else is due to the
latter part you mention: it's where the volume is today.

--david

2003-02-23 19:44:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

"Martin J. Bligh" <[email protected]> writes:

> > On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> >> ia32 big iron. sigh. I think that's so unfortunately in a number
> >> of ways, but the main reason, of course, is that highmem is evil :)
>
> One phrase ... "price:performance ratio". That's all it's about.
> The only thing that will kill 32-bit big iron is the availability of
> cheap 64 bit chips. It's a free-market economy.
>
> It's ugly to program, but it's cheap, and it works.

I guess ugly to program is in the eye of the beholder. The big platforms
have always seemed much worse to me. When every box is feels free to
change things in arbitrary ways for no good reason. Or where OS and
other low-level software must know exactly which motherboard they are
running on to work properly.

Gratuitous incompatibilities are the ugliest thing I have ever seen.
Much less ugly then the warts a real platform accumulates because it
is designed to actually be used.

Eric

2003-02-23 20:03:06

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> Linus> Look at them the right way and you realize that a lot of the
> Linus> grottyness is exactly _why_ the x86 works so well (yeah, and
> Linus> the fact that they are everywhere ;).
>
> But does x86 reall work so well? Itanium 2 on 0.13um performs a lot
> better than P4 on 0.13um. As far as I can guess, the only reason P4
> comes out on 0.13um (and 0.09um) before anything else is due to the
> latter part you mention: it's where the volume is today.

Care to share those impressive benchmark numbers (for macro-benchmarks)?
Would be interesting to see the difference, and where it wins.

Thanks,

M

2003-02-23 20:06:26

by Martin J. Bligh

[permalink] [raw]
Subject: Re: object-based rmap and pte-highmem

>> I have a plan for that (UKVA) ... we reserve a per-process area with
>> kernel type protections (either at the top of user space, changing
>> permissions appropriately, or inside kernel space, changing per-process
>> vs global appropriately).
>
> Nobody ever seems to have solved the threading impact of UKVA's. I told
> Andrea about it almost a year ago, and his reaction was "oh, duh!" and
> couldn't come up with a solution either.
>
> The thing is, you _cannot_ have a per-thread area, since all threads
> share the same TLB. And if it isn't per-thread, you still need all the
> locking and all the scalability stuff that the _current_ pte_highmem
> code needs, since there are people with thousands of threads in the same
> process.
>
> Until somebody _addresses_ this issue with UKVA, I consider UKVA to be a
> pipe-dream of people who haven't thought it through.

I don't see why that's an issue - the pagetables are per-process, not
per-thread.

Yes, that was a stalling point for sticking kmap in there, which was
amongst my original plotting for it, but the stuff that's per-process
still works.

I'm not suggesting kmapping them dynamically (though it's rather like
permanent kmap), I'm suggesting making enough space so we have them all
there for each process all the time. None of this tiny little window
shifting around stuff ...

M.

2003-02-23 20:11:47

by Xavier Bestel

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Le dim 23/02/2003 ? 20:17, Linus Torvalds a ?crit :

> And the baroque instruction encoding on the x86 is actually a _good_
> thing: it's a rather dense encoding, which means that you win on icache.
> It's a bit hard to decode, but who cares? Existing chips do well at
> decoding, and thanks to the icache win they tend to perform better - and
> they load faster too (which is important - you can make your CPU have
> big caches, but _nothing_ saves you from the cold-cache costs).

Next step: hardware gzip ?

2003-02-23 20:24:11

by Alan

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, 2003-02-23 at 20:21, Xavier Bestel wrote:
> > they load faster too (which is important - you can make your CPU have
> > big caches, but _nothing_ saves you from the cold-cache costs).
>
> Next step: hardware gzip ?

gzip doesn't work because its not unpackable from an arbitary point. x86
in many ways is compressed, with common codes carefully bitpacked. A
horrible cisc design constraint for size has come full circle and turned
into a very nice memory/cache optimisation

2003-02-23 20:38:26

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, 23 Feb 2003 00:07:50 PST, David Lang wrote:
> Garrit, you missed the preior posters point. IA64 had the same fundamental
> problem as the Alpha, PPC, and Sparc processors, it doesn't run x86
> binaries.

IA64 *can* run IA32 binaries, just more slowly than native IA64 code.

gerrit

2003-02-23 20:40:54

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> And the baroque instruction encoding on the x86 is actually a _good_
>> thing: it's a rather dense encoding, which means that you win on icache.
>> It's a bit hard to decode, but who cares? Existing chips do well at
>> decoding, and thanks to the icache win they tend to perform better - and
>> they load faster too (which is important - you can make your CPU have
>> big caches, but _nothing_ saves you from the cold-cache costs).
>
> Next step: hardware gzip ?

They did that already ... IBM were demonstrating such a thing a couple of
years ago. Don't see it helping with icache though, as it unpacks between
memory and the processory, IIRC.

M.

2003-02-23 21:03:37

by John Bradford

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> >If I didn't know this mattered I wouldn't bother with the barfbags.
> >I just wouldn't deal with it.
>
> Why?
>
> The x86 is a hell of a lot nicer than the ppc32, for example. On the
> x86, you get good performance and you can ignore the design mistakes (ie
> segmentation) by just basically turning them off.

I could be wrong, but I always thought that Sparc, and a lot of other
architectures could mark arbitrary areas of memory, (such as the
stack), as non-executable, whereas x86 only lets you have one
non-executable segment.

John.

2003-02-23 21:30:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: object-based rmap and pte-highmem


On Sun, 23 Feb 2003, Martin J. Bligh wrote:
> >
> > The thing is, you _cannot_ have a per-thread area, since all threads
> > share the same TLB. And if it isn't per-thread, you still need all the
> > locking and all the scalability stuff that the _current_ pte_highmem
> > code needs, since there are people with thousands of threads in the same
> > process.
>
> I don't see why that's an issue - the pagetables are per-process, not
> per-thread.

Exactly. Which means that UKVA has all the same problems as the current
global map.

There are _NO_ differences. Any problems you have with the current global
map you would have with UKVA in threads. So I don't see what you expect to
win from UKVA.

> Yes, that was a stalling point for sticking kmap in there, which was
> amongst my original plotting for it, but the stuff that's per-process
> still works.

Exactly what _is_ "per-process"? The only thing that is per-process is
stuff that is totally local to the VM, by the linux definition.

And the rmap stuff certainly isn't "local to the VM". Yes, it is torn down
and built up by the VM, but it needs to be traversed by global code.

Linus

2003-02-23 21:27:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call


On Sun, 23 Feb 2003, David Mosberger wrote:
>
> But does x86 reall work so well? Itanium 2 on 0.13um performs a lot
> better than P4 on 0.13um.

On WHAT benchmark?

Itanium 2 doesn't hold a candle to a P4 on any real-world benchmarks.

As far as I know, the _only_ things Itanium 2 does better on is (a) FP
kernels, partly due to a huge cache and (b) big databases, entirely
because the P4 is crippled with lots of memory because Intel refuses to do
a 64-bit version (because they know it would totally kill ia-64).

Last I saw P4 was kicking ia-64 butt on specint and friends.

That's also ignoring the fact that ia-64 simply CANNOT DO the things a P4
does every single day. You can't put an ia-64 in a reasonable desktop
machine, partly because of pricing, but partly because it would just suck
so horribly at things people expect not to suck (games spring to mind).

And I further bet that using a native distribution (ie totally ignoring
the power and price and bad x86 performance issues), ia-64 will work a lot
worse for people simply because the binaries are bigger. That was quite
painful on alpha, and ia-64 is even worse - to offset the bigger binaries,
you need a faster disk subsystem etc just to not feel slower than a
bog-standard PC.

Code size matters. Price matters. Real world matters. And ia-64 at least
so far falls flat on its face on ALL of these.

> As far as I can guess, the only reason P4
> comes out on 0.13um (and 0.09um) before anything else is due to the
> latter part you mention: it's where the volume is today.

It's where all the money is ("ia-64: 5 billion dollars in the red and
still sinking") so of _course_ it's where the efforts get put.

Linus

2003-02-23 21:34:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call


On 23 Feb 2003, Xavier Bestel wrote:
> Le dim 23/02/2003 ? 20:17, Linus Torvalds a ?crit :
>
> > And the baroque instruction encoding on the x86 is actually a _good_
> > thing: it's a rather dense encoding, which means that you win on icache.
> > It's a bit hard to decode, but who cares? Existing chips do well at
> > decoding, and thanks to the icache win they tend to perform better - and
> > they load faster too (which is important - you can make your CPU have
> > big caches, but _nothing_ saves you from the cold-cache costs).
>
> Next step: hardware gzip ?

Not gzip, no. It needs to be a random-access compression with reasonably
small blocks, not something designed for streaming. Which makes it harder
to do right and efficiently.

But ARM has Thumb (not the same thing, but same idea), and at least some
PPC chips have a page-based compressor - IBM calls it "CodePack" in case
you want to google for it.

Linus

2003-02-23 21:38:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call


On Sun, 23 Feb 2003, John Bradford wrote:
>
> I could be wrong, but I always thought that Sparc, and a lot of other
> architectures could mark arbitrary areas of memory, (such as the
> stack), as non-executable, whereas x86 only lets you have one
> non-executable segment.

The x86 has that stupid "executablility is tied to a segment" thing, which
means that you cannot make things executable on a page-per-page level.
It's a mistake, but it's one that _could_ be fixed in the architecture if
it really mattered, the same way the WP bit got fixed in the i486.

I'm definitely not saying that the x86 is perfect. It clearly isn't. But a
lot of people complain about the wrong things, and a lot of people who
tried to "fix" things just made them worse by throwing out the good parts
too.

Linus

2003-02-23 21:46:14

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
>> If I didn't know this mattered I wouldn't bother with the barfbags.
>> I just wouldn't deal with it.

On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> The x86 is a hell of a lot nicer than the ppc32, for example. On the
> x86, you get good performance and you can ignore the design mistakes (ie
> segmentation) by just basically turning them off.

We "basically" turn it off, but I was recently reminded it existed,
as LDT's are apparently wanted by something in userspace. There seem
to be various other unwelcome reminders floating around performance
critical paths as well.

I vaguely remember segmentation being the only way to enforce
execution permissions for mmap(), which we just don't bother doing.


On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> On the ppc32, the MMU braindamage is not something you can ignore, you
> have to write your OS for it and if you turn it off (ie enable soft-fill
> on the ones that support it) you now have to have separate paths in the
> OS for it.

The hashtables don't bother me very much. They can relatively easily
be front-ended by radix tree pagetables anyway, and if it sucks, well,
no software in the world can save sucky hardware. Hopefully later models
fix it to be fast or disablable. I'm more bothered by x86 lacking ASN's.


On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> And the baroque instruction encoding on the x86 is actually a _good_
> thing: it's a rather dense encoding, which means that you win on icache.
> It's a bit hard to decode, but who cares? Existing chips do well at
> decoding, and thanks to the icache win they tend to perform better - and
> they load faster too (which is important - you can make your CPU have
> big caches, but _nothing_ saves you from the cold-cache costs).

I'm not so sure, between things cacheline aligning branch targets and
space/time tradeoffs with smaller instructions running slower than
large sequences of instructions, this stuff gets pretty strange. It
still comes out smaller in the end but by a smaller-than-expected though
probably still significant margin. There's a good chunk of the
instruction set that should probably just be dumped outright, too.


On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> The low register count isn't an issue when you code in any high-level
> language, and it has actually forced x86 implementors to do a hell of a
> lot better job than the competition when it comes to memory loads and
> stores - which helps in general. While the RISC people were off trying
> to optimize their compilers to generate loops that used all 32 registers
> efficiently, the x86 implementors instead made the chip run fast on
> varied loads and used tons of register renaming hardware (and looking at
> _memory_ renaming too).

Invariably we get stuck diving into assembly anyway. =)

This one is basically me getting irked by looking at disassemblies of
random x86 binaries and seeing vast amounts of register spilling. It's
probably not a performance issue aside from code bloat esp. given the
amount of trickery with the weird L1 cache stack magic and so on.


On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> IA64 made all the mistakes anybody else did, and threw out all the good
> parts of the x86 because people thought those parts were ugly. They
> aren't ugly, they're the "charming oddity" that makes it do well. Look
> at them the right way and you realize that a lot of the grottyness is
> exactly _why_ the x86 works so well (yeah, and the fact that they are
> everywhere ;).

Count me as "not charmed". We've actually tripped over this stuff, and
for the most part you've been personally squashing the super low-level
bugs like the NT flag business and vsyscall segmentation oddities.

IA64 suffers from truly excessive featuritis and there are relatively
good chances some (or all) of them will be every bit as unused and
hated as segmentation if it actually survives.


On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> The only real major failure of the x86 is the PAE crud. Let's hope
> we'll get to forget it, the same way the DOS people eventually forgot
> about their memory extenders.

We've not really been able to forget about segments or ISA DMA...
The pessimist in me has more or less already resigned me to PAE as
a fact of life.


On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> (Yeah, and maybe IBM will make their ppc64 chips cheap enough that they
> will matter, and people can overlook the grottiness there. Right now
> Intel doesn't even seem to be interested in "64-bit for the masses", and
> maybe IBM will be. AMD certainly seems to be serious about the "masses"
> part, which in the end is the only part that really matters).

ppc64 is sane in my book (not vendor nepotism, the other "vanilla RISC"
machines get the same rating in my book). No idea about marketing stuff.


-- wli

2003-02-23 21:51:06

by David Mosberger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>>>>> On Sun, 23 Feb 2003 12:13:00 -0800, "Martin J. Bligh" <[email protected]> said:

Linus> Look at them the right way and you realize that a lot of the
Linus> grottyness is exactly _why_ the x86 works so well (yeah, and
Linus> the fact that they are everywhere ;).

>> But does x86 reall work so well? Itanium 2 on 0.13um performs a
>> lot better than P4 on 0.13um. As far as I can guess, the only
>> reason P4 comes out on 0.13um (and 0.09um) before anything else
>> is due to the latter part you mention: it's where the volume is
>> today.

Martin> Care to share those impressive benchmark numbers (for
Martin> macro-benchmarks)? Would be interesting to see the
Martin> difference, and where it wins.

You can do it two ways: you can look at the numbers Intel is publicly
projected for Madison, or you can compare McKinley with 0.18um Pentium 4.

--david

2003-02-23 21:57:41

by Martin J. Bligh

[permalink] [raw]
Subject: Re: pte-highmem vs UKVA (was: object-based rmap and pte-highmem)

>> > The thing is, you _cannot_ have a per-thread area, since all threads
>> > share the same TLB. And if it isn't per-thread, you still need all the
>> > locking and all the scalability stuff that the _current_ pte_highmem
>> > code needs, since there are people with thousands of threads in the
>> > same process.
>>
>> I don't see why that's an issue - the pagetables are per-process, not
>> per-thread.
>
> Exactly. Which means that UKVA has all the same problems as the current
> global map.
>
> There are _NO_ differences. Any problems you have with the current global
> map you would have with UKVA in threads. So I don't see what you expect
> to win from UKVA.

This just just for PTEs ... for which at the moment we have two choices:
1. Stick them in lowmem (fills up the global space too much).
2. Stick them in highmem - too much overhead doing k(un)map_atomic
as measured by both myself and Andrew.

Using UKVA for PTEs seems to be a better way to implement pte-highmem to me.
If you're walking another processes' pagetables, you just kmap them as now,
but I think this will avoid most of the kmap'ing (if we have space for two
sets of pagetables so we can do a little bit of trickery at fork time).

>> Yes, that was a stalling point for sticking kmap in there, which was
>> amongst my original plotting for it, but the stuff that's per-process
>> still works.
>
> Exactly what _is_ "per-process"? The only thing that is per-process is
> stuff that is totally local to the VM, by the linux definition.

The pagetables.

> And the rmap stuff certainly isn't "local to the VM". Yes, it is torn
> down and built up by the VM, but it needs to be traversed by global code.

Sorry, subject was probably misleading ... I'm just talking about the
PTEs here, not sticking anything to do with rmap into UKVA.

Partially object-based rmap is cool for other reasons, that have little to
do with this. ;-)

M.

2003-02-23 22:01:22

by William Lee Irwin III

[permalink] [raw]
Subject: Re: pte-highmem vs UKVA (was: object-based rmap and pte-highmem)

On Sun, Feb 23, 2003 at 02:07:42PM -0800, Martin J. Bligh wrote:
> Using UKVA for PTEs seems to be a better way to implement pte-highmem to me.
> If you're walking another processes' pagetables, you just kmap them as now,
> but I think this will avoid most of the kmap'ing (if we have space for two
> sets of pagetables so we can do a little bit of trickery at fork time).

Another term for "UKVA for pagetables only" is "recursive pagetables",
if this helps clarify anything.


-- wli

2003-02-23 22:02:14

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> >> But does x86 reall work so well? Itanium 2 on 0.13um performs a
> >> lot better than P4 on 0.13um. As far as I can guess, the only
> >> reason P4 comes out on 0.13um (and 0.09um) before anything else
> >> is due to the latter part you mention: it's where the volume is
> >> today.
>
> Martin> Care to share those impressive benchmark numbers (for
> Martin> macro-benchmarks)? Would be interesting to see the
> Martin> difference, and where it wins.
>
> You can do it two ways: you can look at the numbers Intel is publicly
> projected for Madison, or you can compare McKinley with 0.18um Pentium 4.

Ummm ... I'm not exactly happy working with Intel's own projections on the
performance of their Itanium chips ... seems a little unscientific ;-)

Presumably when you said "Itanium 2 on 0.13um performs a lot better than P4
on 0.13um." you were referring to some benchmarks you have the results of?

If you can't publish them, fair enough. But if you can, I'd love to see how
it compares ... Itanium seems to be "more interesting" nowadays, though I
can't say I'm happy about the complexity of it.

M.

2003-02-23 22:30:35

by David Mosberger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>>>>> On Sun, 23 Feb 2003 13:34:32 -0800 (PST), Linus Torvalds <[email protected]> said:

Linus> Last I saw P4 was kicking ia-64 butt on specint and friends.

I don't think so. According to Intel [1], the highest clockfrequency
for a 0.18um part is 2GHz (both for Xeon and P4, for Xeon MP it's
1.5GHz). The highest reported SPECint for a 2GHz Xeon seems to be 701
[2]. In comparison, a 1GHz McKinley gets a SPECint of 810 [3].

--david

[1] http://www.intel.com/support/processors/xeon/corespeeds.htm
[2] http://www.specbench.org/cpu2000/results/res2002q1/cpu2000-20020128-01232.html
[3] http://www.specbench.org/cpu2000/results/res2002q3/cpu2000-20020711-01469.html

2003-02-23 22:39:45

by David Lang

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

I would call a 15% lead over the ia64 pretty substantial.

yes it's not the same clock speed, but if that's the clock speed they can
achieve on that process it's equivalent. the P4 covers a LOT of sins by
ratcheting up it's speed, what matters is the final capability, not the
capability/clock (if capability/clock was what mattered the AMD chips
would have put intel out of business and the P4 would be as common as
ia-64)

David Lang


On Sun, 23 Feb 2003, David Mosberger wrote:

> Date: Sun, 23 Feb 2003 14:40:44 -0800
> From: David Mosberger <[email protected]>
> Reply-To: [email protected]
> To: Linus Torvalds <[email protected]>
> Cc: [email protected], [email protected]
> Subject: Re: Minutes from Feb 21 LSE Call
>
> >>>>> On Sun, 23 Feb 2003 13:34:32 -0800 (PST), Linus Torvalds <[email protected]> said:
>
> Linus> Last I saw P4 was kicking ia-64 butt on specint and friends.
>
> I don't think so. According to Intel [1], the highest clockfrequency
> for a 0.18um part is 2GHz (both for Xeon and P4, for Xeon MP it's
> 1.5GHz). The highest reported SPECint for a 2GHz Xeon seems to be 701
> [2]. In comparison, a 1GHz McKinley gets a SPECint of 810 [3].
>
> --david
>
> [1] http://www.intel.com/support/processors/xeon/corespeeds.htm
> [2] http://www.specbench.org/cpu2000/results/res2002q1/cpu2000-20020128-01232.html
> [3] http://www.specbench.org/cpu2000/results/res2002q3/cpu2000-20020711-01469.html
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-02-23 22:47:34

by David Lang

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

yep, I revered the numbers

David Lang

On Sun, 23 Feb 2003, David Mosberger wrote:

> Date: Sun, 23 Feb 2003 14:54:12 -0800
> From: David Mosberger <[email protected]>
> Reply-To: [email protected]
> To: David Lang <[email protected]>
> Cc: [email protected], Linus Torvalds <[email protected]>,
> [email protected]
> Subject: Re: Minutes from Feb 21 LSE Call
>
> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <[email protected]> said:
>
> David.L> I would call a 15% lead over the ia64 pretty substantial.
>
> Huh? Did you misread my mail?
>
> 2 GHz Xeon: 701 SPECint
> 1 GHz Itanium 2: 810 SPECint
>
> That is, Itanium 2 is 15% faster.
>
> --david
>

2003-02-23 22:44:07

by David Mosberger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <[email protected]> said:

David.L> I would call a 15% lead over the ia64 pretty substantial.

Huh? Did you misread my mail?

2 GHz Xeon: 701 SPECint
1 GHz Itanium 2: 810 SPECint

That is, Itanium 2 is 15% faster.

--david

2003-02-23 22:45:56

by Alan

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, 2003-02-23 at 20:50, Martin J. Bligh wrote:
> >> And the baroque instruction encoding on the x86 is actually a _good_
> >> thing: it's a rather dense encoding, which means that you win on icache.
> >> It's a bit hard to decode, but who cares? Existing chips do well at
> >> decoding, and thanks to the icache win they tend to perform better - and
> >> they load faster too (which is important - you can make your CPU have
> >> big caches, but _nothing_ saves you from the cold-cache costs).
> >
> > Next step: hardware gzip ?
>
> They did that already ... IBM were demonstrating such a thing a couple of
> years ago. Don't see it helping with icache though, as it unpacks between
> memory and the processory, IIRC.

I saw the L2/L3 compressed cache thing, and I thought "doh!", and I watched and
I've not seen it for a long time. What happened to it ?

2003-02-23 22:56:55

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> Linus> Last I saw P4 was kicking ia-64 butt on specint and friends.
>
> I don't think so. According to Intel [1], the highest clockfrequency
> for a 0.18um part is 2GHz (both for Xeon and P4, for Xeon MP it's
> 1.5GHz). The highest reported SPECint for a 2GHz Xeon seems to be 701
> [2]. In comparison, a 1GHz McKinley gets a SPECint of 810 [3].
>
> --david
>
> [1] http://www.intel.com/support/processors/xeon/corespeeds.htm
> [2]
> http://www.specbench.org/cpu2000/results/res2002q1/cpu2000-20020128-01232
> .html [3]
> http://www.specbench.org/cpu2000/results/res2002q3/cpu2000-20020711-01469
> .html -

Got anything more real-world than SPECint type microbenchmarks?

M.

2003-02-23 23:18:46

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 11:13:03AM -0800, David Mosberger wrote:
> This simply isn't true. Itanium and Itanium 2 have full x86 hardware
> built into the chip (for better or worse ;-). The speed isn't as good
> as the fastest x86 chips today, but it's faster (~300MHz P6) than the

That hardly counts as reasonably performant: the slowest mainstream chips
from Intel and AMD are clocked well over 1 GHz. At least x86-64 will
improve the performance of the 32 bit databases people have already
invested large amounts of money in, and it will do so without the need
for a massive outlay of funds for a new 64 bit license. Why accept
more than 10x the cost to migrate to ia64 when a new x86-64 will improve
the speed of existing applications, and improve scalability with the
transparent addition of a 64 bit kernel?

-ben
--
Don't email: <a href=mailto:"[email protected]">[email protected]</a>

2003-02-23 23:16:26

by Bill Davidsen

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, 22 Feb 2003, Gerrit Huizenga wrote:

> On 22 Feb 2003 18:20:19 GMT, Alan Cox wrote:
> > I think people overestimate the numbner of large boxes badly. Several IDE
> > pre-patches didn't work on highmem boxes. It took *ages* for people to
> > actually notice there was a problem. The desktop world is still 128-256Mb
>
> IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB
> is a fun toy, but bigger than *I* need, even for development purposes.
> But I don't think EMC, Clariion (low end EMC), Shark, etc. have any
> IDE products for my 8-proc 16 GB machine... And running pre-patches in
> a production environment that might expose this would be a little
> silly as well.

I don't disagree with most of your point, however there certainly are
legitimate uses for big boxes with small (IDE) disk. Those which first
come to mind are all computational problems, in which a small dataset is
read from disk and then processors beat on the data. More or less common
examples are graphics transformations (original and final data
compressed), engineering calculations such as finite element analysis,
rendering (raytracing) type calculations, and data analysis (things like
setiathome or automated medical image analysis).

IDE drives are very cost effective, and low cost motherboard RAID is
certainly useful for preserving the results of large calculations on small
(relatively) datasets.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-02-23 23:23:40

by Bill Davidsen

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, 22 Feb 2003, Ben Greear wrote:

> Mark Hahn wrote:

> > oh, come on. the issue is whether memory is fast and flat.
> > most "scalability" efforts are mainly trying to code around the fact
> > that any ccNUMA (and most 4-ways) is going to be slow/bumpy.
> > it is reasonable to worry that optimizations for imbalanced machines
> > will hurt "normal" ones. is it worth hurting uni by 5% to give
> > a 50% speedup to IBM's 32-way? I think not, simply because
> > low-end machines are more important to Linux.
> >
> > the best way to kill Linux is to turn it into an OS best suited
> > for $6+-digit machines.
>
> Linux has a key feature that most other OS's lack: It can (easily, and by all)
> be recompiled for a particular architecture. So, there is no particular reason why
> optimizing for a high-end system has to kill performance on uni-processor
> machines.

This is exactly correct, although build just the optimal kernel for a
machine is still somewhat art rather than science. You have to choose the
trade-offs carefully.

> For instance, don't locks simply get compiled away to nothing on
> uni-processor machines?

Preempt causes most of the issues of SMP with few of the benefits. There
are loads for which it's ideal, but for general use it may not be the
right feature, and I ran it during the time when it was just a patch, but
lately I'm convinced it's for special occasions.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-02-23 23:31:35

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> For instance, don't locks simply get compiled away to nothing on
>> uni-processor machines?
>
> Preempt causes most of the issues of SMP with few of the benefits. There
> are loads for which it's ideal, but for general use it may not be the
> right feature, and I ran it during the time when it was just a patch, but
> lately I'm convinced it's for special occasions.

Note that preemption was pushed by the embedded people Larry was advocating
for, not the big-machine crowd .... ironic, eh?

M.

2003-02-23 23:54:39

by Bill Davidsen

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On 23 Feb 2003, Xavier Bestel wrote:

> Le dim 23/02/2003 ? 20:17, Linus Torvalds a ?crit :
>
> > And the baroque instruction encoding on the x86 is actually a _good_
> > thing: it's a rather dense encoding, which means that you win on icache.
> > It's a bit hard to decode, but who cares? Existing chips do well at
> > decoding, and thanks to the icache win they tend to perform better - and
> > they load faster too (which is important - you can make your CPU have
> > big caches, but _nothing_ saves you from the cold-cache costs).
>
> Next step: hardware gzip ?

If the firmware issues were better defined in Intel ia32 chips, I could
see a gzip instruction pointing to blocks in memory. As a proof of
concept, not a big win.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-02-23 23:49:04

by David Mosberger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>>>>> On Sun, 23 Feb 2003 15:06:56 -0800, "Martin J. Bligh" <[email protected]> said:

Linus> Last I saw P4 was kicking ia-64 butt on specint and friends.
>> I don't think so. According to Intel [1], the highest
>> clockfrequency for a 0.18um part is 2GHz (both for Xeon and P4,
>> for Xeon MP it's 1.5GHz). The highest reported SPECint for a
>> 2GHz Xeon seems to be 701 [2]. In comparison, a 1GHz McKinley
>> gets a SPECint of 810 [3].

Martin> Got anything more real-world than SPECint type
Martin> microbenchmarks?

SPECint a microbenchmark? You seem to be redefining the meaning of
the word (last time I checked, lmbench was a microbenchmark).

Ironically, Itanium 2 seems to do even better in the "real world" than
suggested by benchmarks, partly because of the large caches, memory
bandwidth and, I'm guessing, partly because of it's straight-forward
micro-architecture (e.g., a synchronization operation takes on the
order of 10 cycles, as compared to order of dozens and hundres of
cycles on the Pentium 4).

BTW: I hope I don't sound too negative on the Pentium 4/Xeon. It's
certainly an excellent performer for many things. I just want to
point out that Itanium 2 also is a good performer, probably more so
than many on this list seem to be willing to give it credit for.

--david

2003-02-23 23:50:42

by Bill Davidsen

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, 22 Feb 2003, Larry McVoy wrote:

> > We would never try to propose such a change, and never have.
> > Name a scalability change that's hurt the performance of UP by 5%.
> > There isn't one.
>
> This is *exactly* the reasoning that every OS marketing weenie has used
> for the last 20 years to justify their "feature" of the week.
>
> The road to slow bloated code is paved one cache miss at a time. You
> may quote me on that. In fact, print it out and put it above your
> monitor and look at it every day. One cache miss at a time. How much
> does one cache miss add to any benchmark? .001%? Less.
>
> But your pet features didn't slow the system down. Nope, they just made
> the cache smaller, which you didn't notice because whatever artificial
> benchmark you ran didn't happen to need the whole cache.

Clearly this is the case, the benefit of a change must balance the
negative effects. Making the code paths longer hurts free cache, having
more of them should not. More code is not always slower code, and doesn't
always have more impact on cache use. You identify something which must be
considered, but it's not the only thing to consider. Linux shouild be
stable, not moribund.

> You need to understand that system resources belong to the user. Not the
> kernel. The goal is to have all of the kernel code running under any
> load be less than 1% of the CPU. Your 5% number up there would pretty
> much double the amount of time we spend in the kernel for most workloads.

Who profits? For most users a bit more system time resulting in better
disk performance would be a win, or at least non-lose. This isn't black
and white.


On Sat, 22 Feb 2003, Larry McVoy wrote:

> Let's get back to your position. You want to shovel stuff in the kernel
> for the benefit of the 32 way / 64 way etc boxes. I don't see that as
> wise. You could prove me wrong. Here's how you do it: go get oprofile
> or whatever that tool is which lets you run apps and count cache misses.
> Start including before/after runs of each microbench in lmbench and
> some time sharing loads with and without your changes. When you can do
> that and you don't add any more bus traffic, you're a genius and
> I'll shut up.

Code only costs when it's executed. Linux is somewhat heading to the place
where a distro has a few useful configs and then people who care for the
last bit of whatever they see as a bottleneck can build their own fro
"make config." So it is possible to add features for big machines without
any impact on the builds which don't use the features. it goes without
saying that this is hard. I would guess that it results in more bugs as
well, if one path or another is "the less-traveled way."

>
> But that's a false promise because by definition, fine grained threading
> adds more bus traffic. It's kind of hard to not have that happen, the
> caches have to stay coherent somehow.

Clearly. And things which require more locking will pay some penalty for
this. But a quick scan of this list on keyword "lockless' will show that
people are thinking about this.

I don't think developers will buy ignoring part of the market to
completely optimize for another. Linux will grow by being ubiquitious, not
by winning some battle and losing the war. It's not a niche market os.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-02-24 00:23:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: pte-highmem vs UKVA (was: object-based rmap and pte-highmem)


On Sun, 23 Feb 2003, William Lee Irwin III wrote:
>
> Another term for "UKVA for pagetables only" is "recursive pagetables",
> if this helps clarify anything.

Oh, ok. We did that for alpha, and it was a good deal there (it's actually
architected for alpha). So yes, I don't mind doing it for the page tables,
and it should work fine on x86 too (it's not necessarily a very portable
approach, since it requires that the pmd- and the pte- tables look the
same, which is not always true).

So sure, go ahead with that part.

Linus

2003-02-24 00:33:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call


On Sun, 23 Feb 2003, David Mosberger wrote:
>
> 2 GHz Xeon: 701 SPECint
> 1 GHz Itanium 2: 810 SPECint
>
> That is, Itanium 2 is 15% faster.

Ehh, and this is with how much cache?

Last I saw, the Itanium 2 machines came with 3MB of integrated L3 caches,
and I suspect that whatever 0.13 Itanium numbers you're looking at are
with the new 6MB caches.

So your "apples to apples" comparison isn't exactly that.

The only thing that is meaningful is "performace at the same time of
general availability". At which point the P4 beats the Itanium 2 senseless
with a 25% higher SpecInt. And last I heard, by the time Itanium 2 is up
at 2GHz, the P4 is apparently going to be at 5GHz, comfortably keeping
that 25% lead.

Linus

2003-02-24 00:32:47

by Victor Yodaiken

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 09:21:27PM +0100, Xavier Bestel wrote:
> Le dim 23/02/2003 ? 20:17, Linus Torvalds a ?crit :
>
> > And the baroque instruction encoding on the x86 is actually a _good_
> > thing: it's a rather dense encoding, which means that you win on icache.
> > It's a bit hard to decode, but who cares? Existing chips do well at
> > decoding, and thanks to the icache win they tend to perform better - and
> > they load faster too (which is important - you can make your CPU have
> > big caches, but _nothing_ saves you from the cold-cache costs).
>
> Next step: hardware gzip ?

See ARM "thumb"

2003-02-24 00:56:18

by dean gaudet

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, 23 Feb 2003, David Mosberger wrote:

> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <[email protected]> said:
>
> David.L> I would call a 15% lead over the ia64 pretty substantial.
>
> Huh? Did you misread my mail?
>
> 2 GHz Xeon: 701 SPECint
> 1 GHz Itanium 2: 810 SPECint
>
> That is, Itanium 2 is 15% faster.

according to pricewatch i could buy ten 2GHz Xeons for about the cost of
one Itanium 2 900MHz.

that's not even considering the cost of the motherboards i'd need to plug
those into.

-dean

2003-02-24 01:15:02

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 01:45:16PM -0800, Linus Torvalds wrote:
> The x86 has that stupid "executablility is tied to a segment" thing, which
> means that you cannot make things executable on a page-per-page level.
> It's a mistake, but it's one that _could_ be fixed in the architecture if
> it really mattered, the same way the WP bit got fixed in the i486.

I've been thinking about this recently, and it turns out that the whole
point is moot with a fixed address vsyscall page: non-exec stacks are
trivially circumvented by using the vsyscall page as a known starting
point for the exploite. All the other tricks of changing the starting
stack offset and using randomized load addresses don't help at all,
since the exploite can merely use the vsyscall page to perform various
operations. Personally, I'm still a fan of the shared library vsyscall
trick, which would allow us to randomize its laod address and defeat
this problem.

-ben

2003-02-24 01:16:51

by Kenneth Johansson

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, 2003-02-24 at 00:57, Alan Cox wrote:
> On Sun, 2003-02-23 at 20:50, Martin J. Bligh wrote:
> > >> And the baroque instruction encoding on the x86 is actually a _good_
> > >> thing: it's a rather dense encoding, which means that you win on icache.
> > >> It's a bit hard to decode, but who cares? Existing chips do well at
> > >> decoding, and thanks to the icache win they tend to perform better - and
> > >> they load faster too (which is important - you can make your CPU have
> > >> big caches, but _nothing_ saves you from the cold-cache costs).
> > >
> > > Next step: hardware gzip ?
> >
> > They did that already ... IBM were demonstrating such a thing a couple of
> > years ago. Don't see it helping with icache though, as it unpacks between
> > memory and the processory, IIRC.
>
> I saw the L2/L3 compressed cache thing, and I thought "doh!", and I watched and
> I've not seen it for a long time. What happened to it ?
>

http://www-3.ibm.com/chips/techlib/techlib.nsf/products/CodePack

If you are thinking of this it dose look like people was not using it I
know I'm not.It reduces memory for instructions but that is all and
memory is seems is not a problem at least not for instructions.

It dose not exist in new cpu's from IBM I don't know the official reason
for the removal.

If you really do mean compressed cache I don't think anybody has done
that for real.

2003-02-24 01:32:27

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 10:28:04AM -0700, Eric W. Biederman wrote:
> The problem. There is no upper bound to how many rmap
> entries there can be at one time. And the unbounded
> growth can overwhelm a machine.

Eh? By that logic there's no bound to the number of vmas that can exist
at a given time. But there is a bound on the number that a single process
can force the system into using, and that limit also caps the number of
rmap entries the process can bring into existance. Virtual address space
is not free, and there are already mechanisms in place to limit it which,
given that the number of rmap entries are directly proportion to the amount
of virtual address space in use, probably need proper configuration.

> The goal is to provide an overall system cap on the number
> of rmap entries.

No, the goal is to have a stable system under a variety of workloads that
performs well. User exploitable worst case behaviour is a bad idea. Hybrid
solves that at the expense of added complexity.

-ben
--
Don't email: <a href=mailto:"[email protected]">[email protected]</a>

2003-02-24 01:43:14

by dean gaudet

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, 24 Feb 2003, Kenneth Johansson wrote:

> If you really do mean compressed cache I don't think anybody has done
> that for real.

people are doing this *for real* -- it really depends on what you define
as compressed.

ARM thumb is definitely a compression function for code.

x86 native instructions are compressed compared to the RISC-like micro-ops
which a processor like athlon, p3, and p4 actually execute. for similar
operations, an x86 would average probably 1.5 bytes to encode what a
32-bit RISC would need 4 bytes to encode.

-dean

2003-02-24 01:46:30

by David Mosberger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>>>>> On Sun, 23 Feb 2003 17:06:29 -0800 (PST), dean gaudet <[email protected]> said:

Dean> On Sun, 23 Feb 2003, David Mosberger wrote:
>> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <[email protected]> said:

David.L> I would call a 15% lead over the ia64 pretty substantial.

>> Huh? Did you misread my mail?

>> 2 GHz Xeon: 701 SPECint
>> 1 GHz Itanium 2: 810 SPECint

>> That is, Itanium 2 is 15% faster.

Dean> according to pricewatch i could buy ten 2GHz Xeons for about
Dean> the cost of one Itanium 2 900MHz.

Not if you want comparable cache-sizes [1]:

Intel Xeon MP, 2MB L3 cache: $3692

Itanium 2, 1 GHZ, 3MB L3 cache: $4226
Itanium 2, 1 GHZ, 1.5MB L3 cache: $2247
Itanium 2, 900 MHZ, 1.5MB L3 cache: $1338

Intel basically prices things by the cache size.

--david

[1]: http://www.intel.com/intel/finance/pricelist/

2003-02-24 01:54:21

by George Spelvin

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Linus brought back tablets from the mount on which were graven:
> The x86 is a hell of a lot nicer than the ppc32, for example. On the
> x86, you get good performance and you can ignore the design mistakes (ie
> segmentation) by just basically turning them off.

Now wait a minute. I thought you worked at Transmeta.

There were no development and debugging costs associated with getting
all those different kinds of gates working, and all the segmentation
checking right?

Wouldn't it have been easier to build the system, and shift the effort
where it would really do some good, if you didn't have to support
all that crap?

An extra base/bounds check doesn't take any die area? An extra exception
source doesn't complicate exception handling?

> And the baroque instruction encoding on the x86 is actually a _good_
> thing: it's a rather dense encoding, which means that you win on icache.
> It's a bit hard to decode, but who cares? Existing chips do well at
> decoding, and thanks to the icache win they tend to perform better - and
> they load faster too (which is important - you can make your CPU have
> big caches, but _nothing_ saves you from the cold-cache costs).

I *really* thought you worked at Transmeta.

Transmeta's software-decoding is an extreme example of what all modern
x86 processors are doing in their L1 caches, namely predecoding the
instructions and storing them in expanded form. This varies from
just adding boundary tags (Pentium) and instruction type (K7) through
converting them to uops and cacheing those (P4).

This exactly undoes any L1 cache size benefits. The win, of course, is
that you don't have as much shifting and aligning on your i-fetch path,
which all the fixed-instruction-size architectures already started with.

So your comments only apply to the L2 cache.

And for the expense of all the instruction predecoding logic betweeen
L2 and L1, don't you think someone could build an instruction compressor
to fit more into the die-size-limited L2 cache? With the sizes cache likes
are getting to these days, you should be able to do pretty well.
It seems like 6 of one, half dozen of the other, and would save the
compiler writers a lot of pain.

> The low register count isn't an issue when you code in any high-level
> language, and it has actually forced x86 implementors to do a hell of a
> lot better job than the competition when it comes to memory loads and
> stores - which helps in general. While the RISC people were off trying
> to optimize their compilers to generate loops that used all 32 registers
> efficiently, the x86 implementors instead made the chip run fast on
> varied loads and used tons of register renaming hardware (and looking at
> _memory_ renaming too).

I don't disagree that chip designers have managed to do very well with
the x86, and there's nothing wrong with making a virtue out of a necessity,
but that doesn't make the necessity good.

I was about to raise the same point. L1 dcache access tends to be a
cycle-limiting bottleneck, and as pearly as the original Pentium, the
x86 had to go to a 2-access-per-cycle L1 dcache to avoid bottlenecking
with only 2 pipes!

The low register count *does* affect you when using a high-level language,
because if you have too many live variables floating around, you start
suffering. Handling these spills is why you need memory renaming.

It's true that x86 processors have had fancy architectural features
sooner than similar-performance RISCs, but I think there's a fair case
that that's because they've *needed* them. Why do the P4 and K7/K8 have
such enormous reorder buffers, able to keep around 100 instructions
in flight at a time? Because they need it to extract parallelism out
of an instruction stream serialized by a miserly register file.

They've developed some great technology to compensate for the weaknesses,
but it's sure nice to dream of an architecture with all that great
technology but with fewer initial warts. (Alpha seemed like the
best hope, but *sigh*. Still, however you apportion blame for its
demise, performance was clearly not one of its problems.)


I think the same claim applies much more powerfully to the ppc32's MMU.
It may be stupid, but it is only visible from inside the kernel, and
a fairly small piece of the kernel at that.

It could be scrapped and replaced with something better without any
effect on existing user-level code at all.

Do you think you can replace the x86's register problems as easily?

> The only real major failure of the x86 is the PAE crud.

So you think AMD extended the register file just for fun?

Hell, the "PAE crud" is the *same* problem as the tiny register
file. Insufficient virtual address space leading to physical > virtual
kludges.

And, as you've noticed, there are limits to the physical/virtual
ratio above which it gets really painful. And the 64G:4G ratio of PAE
is mirrored in the 128:8 ratio of P4 integer registers.

I wish the original Intel designers could have left a "no heroic measures"
living will, because that design is on more life support than Darth Vader.

2003-02-24 02:05:19

by dean gaudet

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call



On Sun, 23 Feb 2003, David Mosberger wrote:

> >>>>> On Sun, 23 Feb 2003 17:06:29 -0800 (PST), dean gaudet <[email protected]> said:
>
> Dean> On Sun, 23 Feb 2003, David Mosberger wrote:
> >> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <[email protected]> said:
>
> David.L> I would call a 15% lead over the ia64 pretty substantial.
>
> >> Huh? Did you misread my mail?
>
> >> 2 GHz Xeon: 701 SPECint
> >> 1 GHz Itanium 2: 810 SPECint
>
> >> That is, Itanium 2 is 15% faster.
>
> Dean> according to pricewatch i could buy ten 2GHz Xeons for about
> Dean> the cost of one Itanium 2 900MHz.
>
> Not if you want comparable cache-sizes [1]:

somehow i doubt you're quoting Xeon numbers w/2MB of cache above. in
fact, here's a 701 specint with only 512KB of cache @ 2GHz:

http://www.spec.org/osg/cpu2000/results/res2002q1/cpu2000-20020128-01232.html

my point was that if you had comparable die sizes the 15% "advantage"
would disappear. there's a hell of a lot which could be done with the
approximately double die size that the itanium 2 has compared to any of
the commodity x86 parts. but then the cost per part would be
correspondingly higher... which is exactly what is shown in the intel cost
numbers.

a more fair comparison would be your itanium 2 number with this:

http://www.spec.org/osg/cpu2000/results/res2002q4/cpu2000-20021021-01742.html

2MB L2 Xeon @ 2GHz, scores 842.

is this the itanium 2 number you're quoting us?

http://www.spec.org/osg/cpu2000/results/res2002q3/cpu2000-20020711-01469.html

'cause that's with 3MB L3.

-dean

>
> Intel Xeon MP, 2MB L3 cache: $3692
>
> Itanium 2, 1 GHZ, 3MB L3 cache: $4226
> Itanium 2, 1 GHZ, 1.5MB L3 cache: $2247
> Itanium 2, 900 MHZ, 1.5MB L3 cache: $1338
>
> Intel basically prices things by the cache size.
>
> --david
>
> [1]: http://www.intel.com/intel/finance/pricelist/
>

2003-02-24 02:22:43

by David Mosberger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>>>>> On Sun, 23 Feb 2003 16:40:40 -0800 (PST), Linus Torvalds <[email protected]> said:

Linus> On Sun, 23 Feb 2003, David Mosberger wrote:

>> 2 GHz Xeon: 701 SPECint
>> 1 GHz Itanium 2: 810 SPECint

>> That is, Itanium 2 is 15% faster.

Linus> Ehh, and this is with how much cache?

Linus> Last I saw, the Itanium 2 machines came with 3MB of
Linus> integrated L3 caches, and I suspect that whatever 0.13
Linus> Itanium numbers you're looking at are with the new 6MB
Linus> caches.

Unfortunately, HP doesn't sell 1.5MB/1GHz Itanium 2 workstations, but
we can do some educated guessing:

1GHz Itanium 2, 3MB cache: 810 SPECint
900MHz Itanium 2, 1.5MB cache: 674 SPECint

Assuming pure frequency scaling, a 1GHz/1.5MB Itanium 2 would get
around 750 SPECint. In reality, it would get slightly less, but most
likely substantially more than 701.

Linus> So your "apples to apples" comparison isn't exactly that.

I never claimed it's an apples to apples comparison. But comparing
same-process chips from the same manufacturer does make for a fairer
"architectural" comparison because it factors out at least some of the
effects caused by volume (there is no reason other than (a) volume and
(b) being designed as a server chip for Itanium chips to come out on
the same process later than the corresponding x86 chips).

Linus> The only thing that is meaningful is "performace at the same
Linus> time of general availability".

You claimed that x86 is inherently superior. I provided data that
shows that much of this apparent superiority is simply an effect of
the larger volume that x86 achieves today. Please don't claim that
x86 wins on technical grounds when it really wins on economic grounds.

--david

2003-02-24 02:31:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call


On 24 Feb 2003 [email protected] wrote:
>
> Now wait a minute. I thought you worked at Transmeta.
>
> There were no development and debugging costs associated with getting
> all those different kinds of gates working, and all the segmentation
> checking right?

So? The only thing that matters is the end result.

> Wouldn't it have been easier to build the system, and shift the effort
> where it would really do some good, if you didn't have to support
> all that crap?

Probably not appreciably. You forget - it's been tried. Over and over
again. The whole RISC philosophy was all about "wouldn't it perform better
if you didn't have to support that crap".

The fact is, the "crap" doesn't matter that much. As proven by the fact
that the "crap" processor family ends up being the one that eats pretty
much everybody else for lunch on performance issues.

Yes, the "crap" does end up making it a harder market to enter. There's a
lot of IP involved in knowing what all the rules are, and having literally
_millions_ of tests that check for conformance to the architecture (and
much of the "architecture" is a de-facto thing, not really written down in
architecture manuals).

But clearly even that is not insurmountable, as shown by the fact that not
only does the x85 perform well, it's also one of the few CPU's that are
actively worked on by multiple different companies (including Transmeta,
as you point out - although clearly the "crap" is one reason why the sw
approach works at all).

> Transmeta's software-decoding is an extreme example of what all modern
> x86 processors are doing in their L1 caches, namely predecoding the
> instructions and storing them in expanded form. This varies from
> just adding boundary tags (Pentium) and instruction type (K7) through
> converting them to uops and cacheing those (P4).

But you seem to imply that that is somehow a counter-argument to _my_
argument. And I don't agree.

I think what Transmeta (and AMD, and VIA etc) show is that the ugliness
doesn't really matter - there are different ways of handling it, and you
can either throw hardware at it or software at it, but it's still worth
doing, because in the end what matters is not the bad parts of it, but the
good parts.

Btw, the P4 tracecache does pretty much exactly the same thing that
Transmeta does, except in hardware. It's based on a very simple reality:
decoding _is_ going to be the bottleneck for _any_ instruction set, once
you've pushed the rest hard enough. If you're not doing predecoding, that
only means that you haven't pushed hard enough yet - _regardless_ of your
archtiecture.

> This exactly undoes any L1 cache size benefits. The win, of course, is
> that you don't have as much shifting and aligning on your i-fetch path,
> which all the fixed-instruction-size architectures already started with.

No. You don't understand what "cold-cache" case really means. It's more
than just bringing the thing in from memory to the cache. It's also all
about loading the dang thing from disk.

> So your comments only apply to the L2 cache.

And the disk.

> And for the expense of all the instruction predecoding logic betweeen
> L2 and L1, don't you think someone could build an instruction compressor
> to fit more into the die-size-limited L2 cache?

It's been done. See the PPC stuff. I've read the papers (it's been a long
time, admittedly - it's not something new), and the fact is, it's not
apparently being used that much. Because it's quite painful, unlike the
x86 approach.

> > stores - which helps in general. While the RISC people were off trying
> > to optimize their compilers to generate loops that used all 32 registers
> > efficiently, the x86 implementors instead made the chip run fast on
> > varied loads and used tons of register renaming hardware (and looking at
> > _memory_ renaming too).
>
> I don't disagree that chip designers have managed to do very well with
> the x86, and there's nothing wrong with making a virtue out of a necessity,
> but that doesn't make the necessity good.

Actually, you miss my point.

The necessity is good because it _forced_ people to look at what really
matters. Instead of wasting 15 years and countless PhD's on things that
are, in the end, just engineering-masturbation (nr of registers etc).

> The low register count *does* affect you when using a high-level language,
> because if you have too many live variables floating around, you start
> suffering. Handling these spills is why you need memory renaming.

Bzzt. Wrong answer.

The right answer is that you need memory renaming and memory alias
hardware _anyway_, because doing dynamic scheduling of loads vs stores is
something that is _required_ to get the kind of performance that people
expect today. And all the RISC stuff that tried to avoid it was just a BIG
WASTE OF TIME. Because the _only_ thing the RISC approach ended up showing
was that eventually you have to do the hard stuff anyway, so you might as
well design for doing it in the first place.

Which is what ia-64 did wrong - and what I mean by doing the same mistakes
that everybody else did 15 years ago. Look at all the crap that ia64 does
in order to do compiler-driven loop modulo-optimizations. That's part of
the whole design, with predication and those horrible register windows.
Can you say "risc mistakes all over again"?

My strong suspicion (and that makes it a "fact" ;) is that in another 5
years they'll get to where the x86 has been for the last 10 years, and
they'll realize that they will need to do out-of-order accesses etc, which
makes all of that modulo optimization pretty much useless, since the
hardware pretty much has to do it _anyway_.

> It's true that x86 processors have had fancy architectural features
> sooner than similar-performance RISCs, but I think there's a fair case
> that that's because they've *needed* them.

Which is exactly my point. And by the time you implement them, you notice
that the half-way measures don't mean anything, and in fact make for more
problems.

For example, that small register state is a pain in the ass, no? But since
you basically need register renaming _anyway_, the small register state
actually has some advantages in that it makes it easier to have tons of
read ports and still keep the register file fast. And once you do renaming
(including memory state renaming), IT DOESN'T MUCH MATTER.

> Why do the P4 and K7/K8 have
> such enormous reorder buffers, able to keep around 100 instructions
> in flight at a time? Because they need it to extract parallelism out
> of an instruction stream serialized by a miserly register file.

You think this is bad?

Look at it another way: once you have hundreds of instructions in flight,
you have hardware that automatically

- executes legacy applications reasonably well, since compilers aren't
the most important thing.

End result: users are happy.

- you don't need to have compilers that do stupid things like unrolling
loops, thus keeping your icache pressure down, since you do loop
unrolling in hardware thanks to deep pipelines.

Even the RISC people are doing hundreds of instructions in flight (ie
Power5), but they started doing it years after the x86 did, because they
claimed that they could force their users to recompile their binaries
every few years. And look where it actually got them..

> They've developed some great technology to compensate for the weaknesses,
> but it's sure nice to dream of an architecture with all that great
> technology but with fewer initial warts. (Alpha seemed like the
> best hope, but *sigh*. Still, however you apportion blame for its
> demise, performance was clearly not one of its problems.)

So my premise is that you always end up doing the hard things anyway, and
the "crap" _really_ doesn't matter.

Alpha was nice, no question about it. But it took them way too long to get
to the whole OoO thing, because they tried to take a short-cut that in the
end wasn't the answer. It _looked_ like the answer (the original alpha
design was done explicitly to not _need_ things like complex out-of-order
execution), but it was all just wrong.

The thing about the x86 is that hard cold reality (ie millions of
customers that have existign applications) really _forces_ you to look at
what matters, and so far it clearly appears that the things you are
complaining about (registers and segmentation) simply do _not_ matter.

> I think the same claim applies much more powerfully to the ppc32's MMU.
> It may be stupid, but it is only visible from inside the kernel, and
> a fairly small piece of the kernel at that.
>
> It could be scrapped and replaced with something better without any
> effect on existing user-level code at all.
>
> Do you think you can replace the x86's register problems as easily?

They _have_ been solved. The x86 performs about twice as well as any ppc32
on the market. End of discussion.

> > The only real major failure of the x86 is the PAE crud.
>
> So you think AMD extended the register file just for fun?

I think the AMD register file extension was unnecessary, yes. They did it
because they could, and it wasn't a big deal. That's not the part that
makes the architecture interesting. As you should well know.

> Hell, the "PAE crud" is the *same* problem as the tiny register
> file. Insufficient virtual address space leading to physical > virtual
> kludges.

Nope. The small register file is a non-issue. Trust me. I do work for
transmeta, and we do the register renaming in software, and it doesn't
matter in the end.

Linus

2003-02-24 02:47:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call


On Sun, 23 Feb 2003, David Mosberger wrote:
> >> 2 GHz Xeon: 701 SPECint
> >> 1 GHz Itanium 2: 810 SPECint
>
> >> That is, Itanium 2 is 15% faster.
>
> Unfortunately, HP doesn't sell 1.5MB/1GHz Itanium 2 workstations, but
> we can do some educated guessing:
>
> 1GHz Itanium 2, 3MB cache: 810 SPECint
> 900MHz Itanium 2, 1.5MB cache: 674 SPECint
>
> Assuming pure frequency scaling, a 1GHz/1.5MB Itanium 2 would get
> around 750 SPECint. In reality, it would get slightly less, but most
> likely substantially more than 701.

And as Dean pointed out:

2Ghz Xeon MP with 2MB L3 cache: 842 SPECint

In other words, the P4 eats the Itanium for breakfast even if you limit it
to 2GHz due to some "process" rule.

And if you don't make up any silly rules, but simply look at "what's
available today", you get

2.8Ghz Xeon MP with 2MB L3 cache: 907 SPECint

or even better (much cheaper CPUs):

3.06 GHz P4 with 512kB L2 cache: 1074 SPECint
AMD Athlon XP 2800+: 933 SPECint

These are systems that you can buy today. With _less_ cache, and clearly
much higher performance (the difference between the best-performing
published ia-64 and the best P4 on specint, the P4 is 32% faster. Even
with the "you can only run the P4 at 2GHz because that is all it ever ran
at in 0.18" thing the ia-64 falls behind.

> Linus> The only thing that is meaningful is "performace at the same
> Linus> time of general availability".
>
> You claimed that x86 is inherently superior. I provided data that
> shows that much of this apparent superiority is simply an effect of
> the larger volume that x86 achieves today.

And I showed that your data is flawed. Clearly the P4 outperforms ia-64
on an architectural level _even_ when taking process into account.

Linus

2003-02-24 02:56:53

by Martin J. Bligh

[permalink] [raw]
Subject: Re: pte-highmem vs UKVA (was: object-based rmap and pte-highmem)

> This just just for PTEs ... for which at the moment we have two choices:
> 1. Stick them in lowmem (fills up the global space too much).
> 2. Stick them in highmem - too much overhead doing k(un)map_atomic
> as measured by both myself and Andrew.

Actually Andrew's measurements seem to be a bit different from mine ...
several different things all interacting. I'll try to get some more
measurements from a straight SMP box, and see if they correlate more
closely with what he's seeing.

M.

2003-02-24 02:58:40

by David Mosberger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>>>>> On Sun, 23 Feb 2003 18:54:41 -0800 (PST), Linus Torvalds <[email protected]> said:

Linus> In other words, the P4 eats the Itanium for breakfast even if
Linus> you limit it to 2GHz due to some "process" rule.

Ugh, 842 vs 810 is "eating for breakfast"? In my lexicon, that's "in
the same ballpark".

Besides the 2GHz Xeon MP is a 0.13um part.

>> You claimed that x86 is inherently superior. I provided data that
>> shows that much of this apparent superiority is simply an effect of
>> the larger volume that x86 achieves today.

Linus> And I showed that your data is flawed.

No, you did not.

--david

2003-02-24 03:01:00

by David Mosberger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>>>>> On Sun, 23 Feb 2003 18:15:29 -0800 (PST), dean gaudet <[email protected]> said:

Dean> somehow i doubt you're quoting Xeon numbers w/2MB of cache above.

I quoted the Xeon 0.13um price because there was no 0.18um part with
>512KB cache (for better or worse, Intel basically prices CPUs by
cache-size).

--david

2003-02-24 03:24:13

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, 23 Feb 2003 18:23:01 EST, Bill Davidsen wrote:
> On Sat, 22 Feb 2003, Gerrit Huizenga wrote:
>
> > On 22 Feb 2003 18:20:19 GMT, Alan Cox wrote:
> > > I think people overestimate the numbner of large boxes badly. Several IDE
> > > pre-patches didn't work on highmem boxes. It took *ages* for people to
> > > actually notice there was a problem. The desktop world is still 128-256Mb
> >
> > IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB
> > is a fun toy, but bigger than *I* need, even for development purposes.
> > But I don't think EMC, Clariion (low end EMC), Shark, etc. have any
> > IDE products for my 8-proc 16 GB machine... And running pre-patches in
> > a production environment that might expose this would be a little
> > silly as well.
>
> I don't disagree with most of your point, however there certainly are
> legitimate uses for big boxes with small (IDE) disk. Those which first
> come to mind are all computational problems, in which a small dataset is
> read from disk and then processors beat on the data. More or less common
> examples are graphics transformations (original and final data
> compressed), engineering calculations such as finite element analysis,
> rendering (raytracing) type calculations, and data analysis (things like
> setiathome or automated medical image analysis).

Yeah and as Christoph pointed out, a lot of big machines have IDE
based CD-ROMs. And, there *are* some IDE disk subsystems with 1 TB
on an IDE bus and such, but there just aren't enough IDE busses or PCI
slots on most big machines to span out to the really high disk capacities
or large numbers of spindles. But some of the compute engines could
either be net-booted (no local disk) or have a cheap, small disk for
boot, small static storage (couple hundred GB range) etc. But most
people don't connect big machines to IDE drive subsystems.

gerrit

2003-02-24 03:19:36

by David Lang

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

also the RISC argument was started back when the CPU ran at the same speed
as memory and CISC was limited to slow clock speeds, the RISC folks
figured that if they could eliminate the CISC complexity they could
ratchet up the clock speeds and more then make up for the other
inefficiancies.

unfortunantly for them the core CPU speeds became uncoupled from the
memory speeds and skyrocketed up to the point where CISC cores are as fast
or faster then the 'high speed' RISC cores.

so the remaining difference between CISC and RISC really isn't the core
functions, it's where do the optimizations take place.

in CISC they take place on the CPU (either in software or hardware)
in RISC they take place in the compiler

unfortunantly for the RISC folks the optimizations change significantly
from chip to chip and so the code needs to be compiled for each chip (in
many cases each motheboard) to take full advantage of them. while this
could be acceptable in an OpenSource world in the real world of
pre-compiled binaries companies are not interested in shipping a dozen
versions of a program so there is no preasure for the compilers to get
optimized this well (besides the problem of people not wanting to use the
'blessed' compiler), and software development always laggs hardware so the
compilers are never available when the CPUs are (and at the pace of
development by the time they are availabel new CPUs are out) while
on the CISC side of the house these optimizations are done in the CPU by
the designers of the chips and since the performance of the chips is
judged including these optimizations (no way to weasle out by claiming
it's a compiler problem) a lot of attention is paid to this work and it's
out when the chip is released (at this point both the tranemeta and intel
CPUs have the ability to update this code, but it's really only done to
fix major bugs)

the other big claim of the RISC folks was that they didn't waste
transisters on this since it was done in the compiler, but the transister
budget has climbed so high since then that this cost is minimal. I seem to
remember that intel claims that enabling hyperthreading (which not only
duplicates this entire section, but adds even more to keep them straight)
adds less then 10% to the transister count of the CPU. yes it would be
nicer to have those transisters available to be used for other things, but
experiance with RISC has shown that overall the tradeoff isn't worth it

David Lang

On Sun, 23 Feb 2003, Linus Torvalds wrote:

> Date: Sun, 23 Feb 2003 18:39:09 -0800 (PST)
> From: Linus Torvalds <[email protected]>
> To: [email protected]
> Cc: [email protected]
> Subject: Re: Minutes from Feb 21 LSE Call
>
>
> On 24 Feb 2003 [email protected] wrote:
> >
> > Now wait a minute. I thought you worked at Transmeta.
> >
> > There were no development and debugging costs associated with getting
> > all those different kinds of gates working, and all the segmentation
> > checking right?
>
> So? The only thing that matters is the end result.
>
> > Wouldn't it have been easier to build the system, and shift the effort
> > where it would really do some good, if you didn't have to support
> > all that crap?
>
> Probably not appreciably. You forget - it's been tried. Over and over
> again. The whole RISC philosophy was all about "wouldn't it perform better
> if you didn't have to support that crap".
>
> The fact is, the "crap" doesn't matter that much. As proven by the fact
> that the "crap" processor family ends up being the one that eats pretty
> much everybody else for lunch on performance issues.
>
> Yes, the "crap" does end up making it a harder market to enter. There's a
> lot of IP involved in knowing what all the rules are, and having literally
> _millions_ of tests that check for conformance to the architecture (and
> much of the "architecture" is a de-facto thing, not really written down in
> architecture manuals).
>
> But clearly even that is not insurmountable, as shown by the fact that not
> only does the x85 perform well, it's also one of the few CPU's that are
> actively worked on by multiple different companies (including Transmeta,
> as you point out - although clearly the "crap" is one reason why the sw
> approach works at all).
>
> > Transmeta's software-decoding is an extreme example of what all modern
> > x86 processors are doing in their L1 caches, namely predecoding the
> > instructions and storing them in expanded form. This varies from
> > just adding boundary tags (Pentium) and instruction type (K7) through
> > converting them to uops and cacheing those (P4).
>
> But you seem to imply that that is somehow a counter-argument to _my_
> argument. And I don't agree.
>
> I think what Transmeta (and AMD, and VIA etc) show is that the ugliness
> doesn't really matter - there are different ways of handling it, and you
> can either throw hardware at it or software at it, but it's still worth
> doing, because in the end what matters is not the bad parts of it, but the
> good parts.
>
> Btw, the P4 tracecache does pretty much exactly the same thing that
> Transmeta does, except in hardware. It's based on a very simple reality:
> decoding _is_ going to be the bottleneck for _any_ instruction set, once
> you've pushed the rest hard enough. If you're not doing predecoding, that
> only means that you haven't pushed hard enough yet - _regardless_ of your
> archtiecture.
>
> > This exactly undoes any L1 cache size benefits. The win, of course, is
> > that you don't have as much shifting and aligning on your i-fetch path,
> > which all the fixed-instruction-size architectures already started with.
>
> No. You don't understand what "cold-cache" case really means. It's more
> than just bringing the thing in from memory to the cache. It's also all
> about loading the dang thing from disk.
>
> > So your comments only apply to the L2 cache.
>
> And the disk.
>
> > And for the expense of all the instruction predecoding logic betweeen
> > L2 and L1, don't you think someone could build an instruction compressor
> > to fit more into the die-size-limited L2 cache?
>
> It's been done. See the PPC stuff. I've read the papers (it's been a long
> time, admittedly - it's not something new), and the fact is, it's not
> apparently being used that much. Because it's quite painful, unlike the
> x86 approach.
>
> > > stores - which helps in general. While the RISC people were off trying
> > > to optimize their compilers to generate loops that used all 32 registers
> > > efficiently, the x86 implementors instead made the chip run fast on
> > > varied loads and used tons of register renaming hardware (and looking at
> > > _memory_ renaming too).
> >
> > I don't disagree that chip designers have managed to do very well with
> > the x86, and there's nothing wrong with making a virtue out of a necessity,
> > but that doesn't make the necessity good.
>
> Actually, you miss my point.
>
> The necessity is good because it _forced_ people to look at what really
> matters. Instead of wasting 15 years and countless PhD's on things that
> are, in the end, just engineering-masturbation (nr of registers etc).
>
> > The low register count *does* affect you when using a high-level language,
> > because if you have too many live variables floating around, you start
> > suffering. Handling these spills is why you need memory renaming.
>
> Bzzt. Wrong answer.
>
> The right answer is that you need memory renaming and memory alias
> hardware _anyway_, because doing dynamic scheduling of loads vs stores is
> something that is _required_ to get the kind of performance that people
> expect today. And all the RISC stuff that tried to avoid it was just a BIG
> WASTE OF TIME. Because the _only_ thing the RISC approach ended up showing
> was that eventually you have to do the hard stuff anyway, so you might as
> well design for doing it in the first place.
>
> Which is what ia-64 did wrong - and what I mean by doing the same mistakes
> that everybody else did 15 years ago. Look at all the crap that ia64 does
> in order to do compiler-driven loop modulo-optimizations. That's part of
> the whole design, with predication and those horrible register windows.
> Can you say "risc mistakes all over again"?
>
> My strong suspicion (and that makes it a "fact" ;) is that in another 5
> years they'll get to where the x86 has been for the last 10 years, and
> they'll realize that they will need to do out-of-order accesses etc, which
> makes all of that modulo optimization pretty much useless, since the
> hardware pretty much has to do it _anyway_.
>
> > It's true that x86 processors have had fancy architectural features
> > sooner than similar-performance RISCs, but I think there's a fair case
> > that that's because they've *needed* them.
>
> Which is exactly my point. And by the time you implement them, you notice
> that the half-way measures don't mean anything, and in fact make for more
> problems.
>
> For example, that small register state is a pain in the ass, no? But since
> you basically need register renaming _anyway_, the small register state
> actually has some advantages in that it makes it easier to have tons of
> read ports and still keep the register file fast. And once you do renaming
> (including memory state renaming), IT DOESN'T MUCH MATTER.
>
> > Why do the P4 and K7/K8 have
> > such enormous reorder buffers, able to keep around 100 instructions
> > in flight at a time? Because they need it to extract parallelism out
> > of an instruction stream serialized by a miserly register file.
>
> You think this is bad?
>
> Look at it another way: once you have hundreds of instructions in flight,
> you have hardware that automatically
>
> - executes legacy applications reasonably well, since compilers aren't
> the most important thing.
>
> End result: users are happy.
>
> - you don't need to have compilers that do stupid things like unrolling
> loops, thus keeping your icache pressure down, since you do loop
> unrolling in hardware thanks to deep pipelines.
>
> Even the RISC people are doing hundreds of instructions in flight (ie
> Power5), but they started doing it years after the x86 did, because they
> claimed that they could force their users to recompile their binaries
> every few years. And look where it actually got them..
>
> > They've developed some great technology to compensate for the weaknesses,
> > but it's sure nice to dream of an architecture with all that great
> > technology but with fewer initial warts. (Alpha seemed like the
> > best hope, but *sigh*. Still, however you apportion blame for its
> > demise, performance was clearly not one of its problems.)
>
> So my premise is that you always end up doing the hard things anyway, and
> the "crap" _really_ doesn't matter.
>
> Alpha was nice, no question about it. But it took them way too long to get
> to the whole OoO thing, because they tried to take a short-cut that in the
> end wasn't the answer. It _looked_ like the answer (the original alpha
> design was done explicitly to not _need_ things like complex out-of-order
> execution), but it was all just wrong.
>
> The thing about the x86 is that hard cold reality (ie millions of
> customers that have existign applications) really _forces_ you to look at
> what matters, and so far it clearly appears that the things you are
> complaining about (registers and segmentation) simply do _not_ matter.
>
> > I think the same claim applies much more powerfully to the ppc32's MMU.
> > It may be stupid, but it is only visible from inside the kernel, and
> > a fairly small piece of the kernel at that.
> >
> > It could be scrapped and replaced with something better without any
> > effect on existing user-level code at all.
> >
> > Do you think you can replace the x86's register problems as easily?
>
> They _have_ been solved. The x86 performs about twice as well as any ppc32
> on the market. End of discussion.
>
> > > The only real major failure of the x86 is the PAE crud.
> >
> > So you think AMD extended the register file just for fun?
>
> I think the AMD register file extension was unnecessary, yes. They did it
> because they could, and it wasn't a big deal. That's not the part that
> makes the architecture interesting. As you should well know.
>
> > Hell, the "PAE crud" is the *same* problem as the tiny register
> > file. Insufficient virtual address space leading to physical > virtual
> > kludges.
>
> Nope. The small register file is a non-issue. Trust me. I do work for
> transmeta, and we do the register renaming in software, and it doesn't
> matter in the end.
>
> Linus
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-02-24 03:40:04

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, 23 Feb 2003 15:59:12 PST, David Mosberger wrote:
> >>>>> On Sun, 23 Feb 2003 15:06:56 -0800, "Martin J. Bligh" <[email protected]> said:
> Martin> Got anything more real-world than SPECint type
> Martin> microbenchmarks?
>
> SPECint a microbenchmark? You seem to be redefining the meaning of
> the word (last time I checked, lmbench was a microbenchmark).
>
> Ironically, Itanium 2 seems to do even better in the "real world" than
> suggested by benchmarks, partly because of the large caches, memory
> bandwidth and, I'm guessing, partly because of it's straight-forward
> micro-architecture (e.g., a synchronization operation takes on the
> order of 10 cycles, as compared to order of dozens and hundres of
> cycles on the Pentium 4).

Two major types of high end workloads here (and IA64 is definitely
still in the "high end" category). There are the scientific and
technical style workloads, which SPECcpu (of which CINT and CFP are
the integer and floating point subsets) might reasonably categorize,
and some of the "system" workloads, such as those roughly categorized
by things like TPC-C/H/W/etc, or SPECweb/jbb/jvm/jAppServer which
exercise some more complex, multi-tier interactions.

I haven't seen anything recently on the higher level System bencmarks
for IA64 - I'm not sure that anyone is doing much that is significant
in this space, where IA32 results practically saturate the overall
reported results.

I know SGI is generally more interested in the scientific and
technical area. I would assume that HP would be more interested
in the broader system deployment, except that too much activity in
that area might endanger parisc sales. IBM is doing some stuff in
the IA64 space, but more in IA32 and obviously PPC64. That leaves
NEC and a few others that I don't know about. It may be that IA64
isn't really ready for the system level stuff or that it competes
with too many entrenched platforms to make it economically viable.

But, I would be really interested in seeing anything other than
"scientific and technical" based benchmarks for IA64. I don't think
there is much out there. That implies that nobody is interested in
IA64 or that it doesn't perform "competitively" in that space...

gerrit

2003-02-24 03:57:38

by David Mosberger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>>>>> On Sun, 23 Feb 2003 19:49:38 -0800, Gerrit Huizenga <[email protected]> said:

Gerrit> I haven't seen anything recently on the higher level System bencmarks
Gerrit> for IA64

Did you miss the TPC-C announcement from last November & December?

rx5670 4-way Itanium 2: 80498 tpmC @ $5.30/transaction (Oracle 10 on Linux).
rx5670 4-way Itanium 2: 87741 tpmC @ $5.03/transaction (MS SQL on Windows).

Both world-records for 4-way machines when they were announced (not
sure if that's still true).

--david

2003-02-24 03:54:32

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 07:31:26PM -0800, Gerrit Huizenga wrote:
> But most
> people don't connect big machines to IDE drive subsystems.

3ware controllers. They look like SCSI to the host, but use cheap IDE
drives on the back end. Really nice cards. bkbits.net runs on one.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-24 04:11:08

by Russell Leighton

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call


Yup.

Great price and super price/performance.

Gotta luv it.

Larry McVoy wrote:

>On Sun, Feb 23, 2003 at 07:31:26PM -0800, Gerrit Huizenga wrote:
>
>>But most
>>people don't connect big machines to IDE drive subsystems.
>>
>
>3ware controllers. They look like SCSI to the host, but use cheap IDE
>drives on the back end. Really nice cards. bkbits.net runs on one.
>


2003-02-24 04:24:23

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> Gerrit> I haven't seen anything recently on the higher level System
> bencmarks Gerrit> for IA64
>
> Did you miss the TPC-C announcement from last November & December?
>
> rx5670 4-way Itanium 2: 80498 tpmC @ $5.30/transaction (Oracle 10 on
> Linux). rx5670 4-way Itanium 2: 87741 tpmC @ $5.03/transaction (MS SQL
> on Windows).
>
> Both world-records for 4-way machines when they were announced (not
> sure if that's still true).

Cool - thanks. that's more what I was looking for.

M.


2003-02-24 04:32:24

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> The fact is, the "crap" doesn't matter that much. As proven by the fact
> that the "crap" processor family ends up being the one that eats pretty
> much everybody else for lunch on performance issues.

But is that because it's a better design? Or because it has more money
thrown at it? I suspect it's merely it's mass-market dominance generating
huge amounts of cash to improve it ... and it got there through history,
not technical prowess.

Of course, to be pragmatic about it, none of this matters. The chip with
the best price:performance and market presence wins, not the best technical
design.

M.

2003-02-24 04:46:11

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> > Your position is that
> > there is no money in PC's only in big iron. Last I checked, "big iron"
> > doesn't include $25K 4 way machines, now does it?
>
> I would call 4x a "big machine" which is what I originally said.

Nonsense. You were talking about 16/32/64 way boxes, go read your own mail.
In fact, you said so in this message.

Furthermore, I can prove that isn't what you are talking about. Show me
the performance gains you are getting on 4way systems from your changes.
Last I checked, things scaled pretty nicely on 4 ways.

> > You claimed that
> > Dell was making the majority of their profits from servers.
>
> I think that's probably true (nobody can be certain, as we don't have the
> numbers).

Yes, we do. You just don't like what the numbers are saying. You can
work backward from the size of the server market and the percentages
claimed by Sun, HP, IBM, etc. If you do that, you'll see that even
if Dell was making 100% margins on every server they sold, that still
wouldn't be 51% of their profits.

It's not "probably true", it's not physically possible that it is true
and if you don't know that you are simply waving your hands and not
doing any math.

> > To refresh
> > your memory: "I bet they still make more money on servers than desktops
> > and notebooks combined". Are you still claiming that?
>
> Yup.

Well, you are flat out 100% wrong.

> > If so, please
> > provide some data to back it up because, as Mark and others have pointed
> > out, the bulk of their servers are headless desktop machines in tower
> > or rackmount cases.
>
> So what? they're still servers. I can no more provide data to back it up
> than you can to contradict it, because they don't release those figures.

Read the mail I've posted on topic, the data is there. Or better yet,
don't trust me, go work it out for yourself, it isn't hard.

> > I don't see that as wise. You could prove me wrong.
> > Here's how you do it: go get oprofile
> > or whatever that tool is which lets you run apps and count cache misses.
> > Start including before/after runs of each microbench in lmbench and
> > some time sharing loads with and without your changes. When you can do
> > that and you don't add any more bus traffic, you're a genius and
> > I'll shut up.
>
> I don't feel the need to do that to prove my point, but if you feel the
> need to do it to prove yours, go ahead.

Ahh, now we're getting somewhere. As soon as we get anywhere near real
numbers, you don't want anything to do with it. Why is that?

> You seem to think the maintainers are morons that we can just slide crap
> straight by ... give them a little more credit than that.

It happens all the time.

> > Come on, prove me wrong, show me the data.
>
> I don't have to *prove* you wrong. I'm happy in my own personal knowledge
> that you're wrong, and things seem to be going along just fine, thanks.

Wow. Compelling. "It is so because I say it is so". Jeez, forgive me
if I'm not falling all over myself to have that sort of engineering being
the basis for scaling work.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-24 04:48:30

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 03:37:49PM -0800, Martin J. Bligh wrote:
> >> For instance, don't locks simply get compiled away to nothing on
> >> uni-processor machines?
> >
> > Preempt causes most of the issues of SMP with few of the benefits. There
> > are loads for which it's ideal, but for general use it may not be the
> > right feature, and I ran it during the time when it was just a patch, but
> > lately I'm convinced it's for special occasions.
>
> Note that preemption was pushed by the embedded people Larry was advocating
> for, not the big-machine crowd .... ironic, eh?

Dig through the mail logs and you'll see that I was completely against the
preemption patch. I think it is a bad idea, if you want real time, use
rt/linux, it solves the problem right.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-24 04:52:37

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, 23 Feb 2003 20:07:43 PST, David Mosberger wrote:
> >>> On Sun, 23 Feb 2003 19:49:38 -0800, Gerrit Huizenga <[email protected]> said:
>
> Gerrit> I haven't seen anything recently on the higher level System bencmarks
> Gerrit> for IA64
>
> Did you miss the TPC-C announcement from last November & December?
>
> rx5670 4-way Itanium 2: 80498 tpmC @ $5.30/transaction (Oracle 10 on Linux).
> rx5670 4-way Itanium 2: 87741 tpmC @ $5.03/transaction (MS SQL on Windows).
>
> Both world-records for 4-way machines when they were announced (not
> sure if that's still true).

Yeah, I missed that. And my spot checking didn't catch anything IA64
related. Was there anything else on IA64 that competed with the current
rack of 8-way IA32 boxen, or the upcoming 16-way stuff rolling out
this year? Seems like the larger phys memory support should help on
several of those benchmarks...

The thin number of IA64 results indicates the difference in marketing/sales,
although better price/performance should be able to change that... ;)

Odd that MS is still outdoing Linux (or SQL is outdoing Oracle on Linux).
Will be nice when that changes...

gerrit

2003-02-24 04:51:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call


On Sun, 23 Feb 2003, Martin J. Bligh wrote:
>
> > The fact is, the "crap" doesn't matter that much. As proven by the fact
> > that the "crap" processor family ends up being the one that eats pretty
> > much everybody else for lunch on performance issues.
>
> But is that because it's a better design? Or because it has more money
> thrown at it? I suspect it's merely it's mass-market dominance generating
> huge amounts of cash to improve it ... and it got there through history,
> not technical prowess.

Sure. It's to a large degree "more money and resources", no question about
that.

But what is "better design"? Would it have been possible to put as much
effort as Intel (and others) put into the x86 architecture into something
else, and make it even better?

MY standpoint is that the above question is _meaningless_ and stupid.
People did try. Very hard. Claiming anything else is clearly misguided.
But compatibility and price matter equally much - and often more - than
raw performance. Which means that even _if_ another architecture performed
better (and it certainly happened, in the hay-day of the alpha), it
wouldn't much matter. People still stayed away from it in droves.

And in the end, that's why I don't like IA-64. I'll take back every single
bad thing I've ever said about IA-64 if Intel were to just to sell those
things to the mass market instead of P4's. But clearly the IA-64 can't
make it in that market, and thus it is made irrelevant. The same way alpha
was made irrelevant, _despite_ having had much better performance - an
advantage that ia-64 clearly doesn't have.

(Admittedly, alpha didn't have hugely better performance for very long.
Intel came out with the PPro, and took a _lot_ of people by surprise).

AMD's x86-64 approach is a lot more interesting not so much because of any
technical issues, but because AMD _can_ try to avoid the "irrelevant"
part. By having a part that _can_ potentially compete in the market
against a P4, AMD has something that is worth hoping for. Something that
can make a difference.

IBM with Power5 and apple could be the same thing (yeah yeah, I personally
suspect it goes enough against IBMs normal approach that it will cause
some friction). A CPU that actually competes in a market that is relevant.

Because server CPU's simply aren't very interesting from a technical
standpoint. I don't know of a _single_ CPU that ever grew down. But we've
seen a _lot_ of CPU's grow _up_. In other words: the small machines tend
to eat into the large ones, not the other way around.

And if you start from the large ones, you aren't going to make it in the
long run.

Put yet another way: if I was on Intels IA-32 team, I'd be a lot more
worried about those XScale people finally getting their act together than
I would be about IA-64.

Linus

2003-02-24 04:57:40

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 08:56:16PM -0800, Larry McVoy wrote:
> Furthermore, I can prove that isn't what you are talking about. Show me
> the performance gains you are getting on 4way systems from your changes.
> Last I checked, things scaled pretty nicely on 4 ways.

Try 4 or 8 mkfs's in parallel on a 4x box running virgin 2.4.x.


-- wli

2003-02-24 05:06:35

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> Nonsense. You were talking about 16/32/64 way boxes, go read your own
> mail. In fact, you said so in this message.

Where? I never mentioned 32 / 64 way boxes, for starters ...

> Furthermore, I can prove that isn't what you are talking about. Show me
> the performance gains you are getting on 4way systems from your changes.
> Last I checked, things scaled pretty nicely on 4 ways.

Depends what you mean by "your changes". If you do a before and after
comparison on a 4x machine on the scalability changes IBM LTC has made, I
think you'd find a dramatic difference. Of course, it depends to some
extent on what tests you run. Maybe running bitkeeper (or whatever you're
testing) just eats cpu, and doesn't do much interprocess communication or
disk IO (compared to the CPU load), in which case it'll scale pretty well
on
anything as long as it's multithreaded enough. If you're just worried about
one particular app, yes of course you could tweak the system to go faster
for it ... but that's not what a general purpose OS is about.

> Yes, we do. You just don't like what the numbers are saying. You can
> work backward from the size of the server market and the percentages
> claimed by Sun, HP, IBM, etc. If you do that, you'll see that even
> if Dell was making 100% margins on every server they sold, that still
> wouldn't be 51% of their profits.

Ummm ... now go back to what we were actually talking about. Linux margins.
You think a significant percentage of the desktops they sell run Linux?

>> > To refresh
>> > your memory: "I bet they still make more money on servers than desktops
>> > and notebooks combined". Are you still claiming that?
>>
>> Yup.
>
> Well, you are flat out 100% wrong.

In the context we were talking about (Linux), I seriously doubt it.
Apologies if I didn't feel the need to continously restate the context in
every email to stop you from trying to twist the argument.

> Ahh, now we're getting somewhere. As soon as we get anywhere near real
> numbers, you don't want anything to do with it. Why is that?

Because I don't see why I should waste my time running benchmarks just to
prove you wrong. I don't respect you that much, and it seems the
maintainers don't either. When you become somebody with the stature in the
Linux community of, say, Linus or Andrew I'd be prepared to spend a lot
more time running benchmarks on any concerns you might have.

>> I don't have to *prove* you wrong. I'm happy in my own personal knowledge
>> that you're wrong, and things seem to be going along just fine, thanks.
>
> Wow. Compelling. "It is so because I say it is so". Jeez, forgive me
> if I'm not falling all over myself to have that sort of engineering being
> the basis for scaling work.

Ummm ... and your argument is different because of what? You've run some
tiny little microfocused benchmark, seen a couple of bus cycles, and
projected the results out? Not very impressive, really, is it? Go run a
real benchmark and prove it makes a difference if you want to sway people's
opinions. Until then, I suspect the current status quo will continue in
terms of us getting patches accepted.

M.


2003-02-24 05:05:15

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 07:31:26PM -0800, Gerrit Huizenga wrote:
>> But most people don't connect big machines to IDE drive subsystems.
>
On Sun, Feb 23, 2003 at 08:02:46PM -0800, Larry McVoy wrote:
> 3ware controllers. They look like SCSI to the host, but use cheap IDE
> drives on the back end. Really nice cards. bkbits.net runs on one.

A quick back of the napkin estimate guesstimates that this 3ware stuff
would max at 6 racks of disks on NUMA-Q or 3/8 of a rack per node
(ignoring cabling, which looks infeasible, but never mind that), which
is a smaller capacity than I remember FC having. NUMA-Q's a bit
optimistic for 3ware because it has buttloads of PCI slots in
comparison to more modern machines.


-- wli

2003-02-24 05:50:16

by Mark Hahn

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> > Last I checked, things scaled pretty nicely on 4 ways.
>
> Try 4 or 8 mkfs's in parallel on a 4x box running virgin 2.4.x.

"Doctor, it hurts..."

2003-02-24 05:53:10

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

At some point in the past, Larry McVoy wrote:
>>> Last I checked, things scaled pretty nicely on 4 ways.

At some point in the past, I wrote:
>> Try 4 or 8 mkfs's in parallel on a 4x box running virgin 2.4.x.

On Mon, Feb 24, 2003 at 01:00:22AM -0500, Mark Hahn wrote:
> "Doctor, it hurts..."

Doing disk io is supposed to hurt? I'll file this in the "sick and
wrong" category along with RBJ and Hohensee.

In the meantime, compare to 2.5.x.


-- wli

2003-02-24 06:00:18

by Gerhard Mack

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, 23 Feb 2003, Larry McVoy wrote:

> Date: Sun, 23 Feb 2003 20:57:17 -0800
> From: Larry McVoy <[email protected]>
> To: Martin J. Bligh <[email protected]>
> Cc: Bill Davidsen <[email protected]>, Ben Greear <[email protected]>,
> Linux Kernel Mailing List <[email protected]>
> Subject: Re: Minutes from Feb 21 LSE Call
>
> On Sun, Feb 23, 2003 at 03:37:49PM -0800, Martin J. Bligh wrote:
> > >> For instance, don't locks simply get compiled away to nothing on
> > >> uni-processor machines?
> > >
> > > Preempt causes most of the issues of SMP with few of the benefits. There
> > > are loads for which it's ideal, but for general use it may not be the
> > > right feature, and I ran it during the time when it was just a patch, but
> > > lately I'm convinced it's for special occasions.
> >
> > Note that preemption was pushed by the embedded people Larry was advocating
> > for, not the big-machine crowd .... ironic, eh?
>
> Dig through the mail logs and you'll see that I was completely against the
> preemption patch. I think it is a bad idea, if you want real time, use
> rt/linux, it solves the problem right.

So your saying I need to switch to rt/linux to run games or an mp3 player?

Gerhard


--
Gerhard Mack

[email protected]

<>< As a computer I find your faith in technology amusing.

2003-02-24 06:25:33

by Valerie Henson

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 06:57:09PM -0500, Bill Davidsen wrote:
> On Sat, 22 Feb 2003, Larry McVoy wrote:
> >
> > But that's a false promise because by definition, fine grained threading
> > adds more bus traffic. It's kind of hard to not have that happen, the
> > caches have to stay coherent somehow.
>
> Clearly. And things which require more locking will pay some penalty for
> this. But a quick scan of this list on keyword "lockless' will show that
> people are thinking about this.

Lockless algorithms still generate bus traffic when you do the atomic
compare-and-swap or load-linked or whatever hardware instruction you
use to implement your lockless algorithm. Caches still have to stay
coherent, lock or no lock.

-VAL

2003-02-24 06:32:18

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 06:57:09PM -0500, Bill Davidsen wrote:
>> Clearly. And things which require more locking will pay some penalty for
>> this. But a quick scan of this list on keyword "lockless' will show that
>> people are thinking about this.

On Sun, Feb 23, 2003 at 11:22:30PM -0700, Val Henson wrote:
> Lockless algorithms still generate bus traffic when you do the atomic
> compare-and-swap or load-linked or whatever hardware instruction you
> use to implement your lockless algorithm. Caches still have to stay
> coherent, lock or no lock.

Not all lockless algorithms operate on the "access everything with
atomic operations" principle. RCU, for example, uses no atomic
operations on the read side, which is actually fewer atomic operations
than standard rwlocks use for the read side.


-- wli

2003-02-24 06:43:18

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> > Dig through the mail logs and you'll see that I was completely against the
> > preemption patch. I think it is a bad idea, if you want real time, use
> > rt/linux, it solves the problem right.
>
> So your saying I need to switch to rt/linux to run games or an mp3 player?

It depends on the quality you want. If you want it to work without
exception, yeah, I guess that is what I'm saying. People seem to be
willing to put up with sloppy playback on a computer that they would
freak out over if it happened on their TV. rt/linux will make your
el cheapo laptop actually deliver what you need.

I think there has been a fair amount of discussion of this sort of stuff
in the games world. Some game company got taken to task recently because
even 2Ghz machines couldn't run their game properly. Makes me wonder if
a real time system is what they need.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-24 06:48:22

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 09:16:38PM -0800, Martin J. Bligh wrote:
> Ummm ... now go back to what we were actually talking about. Linux margins.
> You think a significant percentage of the desktops they sell run Linux?

The real discussion was the justification for scaling work beyond the
small SMPs. You tried to make the point that there is no money in PC's so
any work to scale Linux up would help hardware companies stay financially
healthy. I and others pointed out that there is indeed a pile of money
in PC's, that's vast majority of the hardware dell sells. They don't
sell anything bigger than an 8 way and they only have one of those.
We went on to do the digging to figure out that it's impossible that
dell makes a substantial portion of their profits from the big servers.

The point being that there is a company generating $32B/year in sales and
almost all of that is in uniprocessors. Directly countering your statement
that there is no margin in PC's. They are making $2B/year in profits, QED.

Which brings us back to the point. If the world is not heading towards
an 8 way on every desk then it is really questionable to make a lot of
changes to the kernel to make it work really well on 8-ways. Yeah, I'm
sure it makes you feel good, but it's more of a intellectual exercise than
anything which really benefits the vast majority of the kernel user base.

> > Ahh, now we're getting somewhere. As soon as we get anywhere near real
> > numbers, you don't want anything to do with it. Why is that?
>
> Because I don't see why I should waste my time running benchmarks just to
> prove you wrong. I don't respect you that much, and it seems the
> maintainers don't either. When you become somebody with the stature in the
> Linux community of, say, Linus or Andrew I'd be prepared to spend a lot
> more time running benchmarks on any concerns you might have.

Who cares if you respect me, what does that have to do with proper
engineering? Do you think that I'm the only person who wants to see
numbers? You think Linus doesn't care about this? Maybe you missed
the whole IA32 vs IA64 instruction cache thread. It sure sounded like
he cares. How about Alan? He stepped up and pointed out that less
is more. How about Mark? He knows a thing or two about the topic?
In fact, I think you'd be hard pressed to find anyone who wouldn't be
interested in seeing the cache effects of a patch.

People care about performance, both scaling up and scaling down. A lot of
performance changes are measured poorly, in a way that makes the changes
look good but doesn't expose the hidden costs of the change. What I'm
saying is that those sorts of measurements screwed over performance in
the past, why are you trying to repeat old mistakes?

> > Wow. Compelling. "It is so because I say it is so". Jeez, forgive me
> > if I'm not falling all over myself to have that sort of engineering being
> > the basis for scaling work.
>
> Ummm ... and your argument is different because of what? You've run some
> tiny little microfocused benchmark, seen a couple of bus cycles, and
> projected the results out?

My argument is different because every effort which has gone in the
direction you are going has ended up with a kernel that worked well on
big boxes and sucked rocks on little boxes. And all of them started
with kernels which performed quite nicely on uniprocessors.

If I was waving my hands and saying "I'm an old fart and I think this
won't work" and that was it, you'd have every right to tell me to piss
off. I'd tell me to piss off. But that's not what is going on here.
What's going on is that a pile of smart people have tried over and over
to do what you claim you will do and they all failed. They all ended up
with kernels that gave up lots of uniprocessor performance and justified
it by throwing more processors at that problem. You haven't said a
single thing to refute that and when challenged to measure the parts
which lead to those results you respond with "nah, nah, I don't respect
you so I don't have to measure it". Come on, *you* should want to know
if what I'm saying is true. You're an engineer, not a marketing drone,
of course you should want to know, why wouldn't you?

Linux is a really fast system right now. The code paths are short and
it is possible to use the OS almost as if it were a library, the cost is
so little that you really can mmap stuff in as you need, something that
people have wanted since Multics. There will always be many more uses
of Linux in small systems than large, simply because there will always
be more small systems. Keeping Linux working well on small systems is
going to have a dramatically larger positive benefit for the world than
scaling it to 64 processors. So who do you want to help? An elite
few or everyone?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-24 07:29:32

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> Ummm ... now go back to what we were actually talking about. Linux
>> margins. You think a significant percentage of the desktops they sell
>> run Linux?
>
> The real discussion was the justification for scaling work beyond the
> small SMPs. You tried to make the point that there is no money in PC's so
> any work to scale Linux up would help hardware companies stay financially
> healthy.

More or less, yes.

> The point being that there is a company generating $32B/year in sales and
> almost all of that is in uniprocessors. Directly countering your
> statement that there is no margin in PC's. They are making $2B/year in
> profits, QED.

Which is totally irrelevant. It's the *LINUX* market that matters. What
part of that do you find so hard to understand?

> Which brings us back to the point. If the world is not heading towards
> an 8 way on every desk then it is really questionable to make a lot of
> changes to the kernel to make it work really well on 8-ways. Yeah, I'm
> sure it makes you feel good, but it's more of a intellectual exercise than
> anything which really benefits the vast majority of the kernel user base.

It makes IBM money, ergo they pay me. I enjoy doing it, ergo I work for
them. Most of the work benefits smaller systems as well, ergo we get our
patches accepted. So everyone's happy, apart from you, who keeps whining.

>> Because I don't see why I should waste my time running benchmarks just to
>> prove you wrong. I don't respect you that much, and it seems the
>> maintainers don't either. When you become somebody with the stature in
>> the Linux community of, say, Linus or Andrew I'd be prepared to spend a
>> lot more time running benchmarks on any concerns you might have.
>
> Who cares if you respect me, what does that have to do with proper
> engineering? Do you think that I'm the only person who wants to see
> numbers? You think Linus doesn't care about this? Maybe you missed
> the whole IA32 vs IA64 instruction cache thread. It sure sounded like
> he cares. How about Alan? He stepped up and pointed out that less
> is more. How about Mark? He knows a thing or two about the topic?
> In fact, I think you'd be hard pressed to find anyone who wouldn't be
> interested in seeing the cache effects of a patch.

So now we've slid from talking about bus traffic from fine-grained locking,
which is mostly just you whining in ignorance of the big picture, to cache
effects, which are obviously important. Nice try at twisting the
conversation. Again.

> People care about performance, both scaling up and scaling down. A lot of
> performance changes are measured poorly, in a way that makes the changes
> look good but doesn't expose the hidden costs of the change. What I'm
> saying is that those sorts of measurements screwed over performance in
> the past, why are you trying to repeat old mistakes?

One way to measure those changes poorly would be to do what you were
advocating earlier - look at one tiny metric of a microbenchmark, rather
than the actual throughput of the machine. So pardon me if I take your
concerns, and file them in the appropriate place.

> My argument is different because every effort which has gone in the
> direction you are going has ended up with a kernel that worked well on
> big boxes and sucked rocks on little boxes. And all of them started
> with kernels which performed quite nicely on uniprocessors.

So you're trying to say that fine-grained locking ruins uniprocessor
performance now? Or did you have some other change in mind?

> If I was waving my hands and saying "I'm an old fart and I think this
> won't work" and that was it, you'd have every right to tell me to piss
> off. I'd tell me to piss off. But that's not what is going on here.
> What's going on is that a pile of smart people have tried over and over
> to do what you claim you will do and they all failed. They all ended up
> with kernels that gave up lots of uniprocessor performance and justified
> it by throwing more processors at that problem. You haven't said a
> single thing to refute that and when challenged to measure the parts
> which lead to those results you respond with "nah, nah, I don't respect
> you so I don't have to measure it". Come on, *you* should want to know
> if what I'm saying is true. You're an engineer, not a marketing drone,
> of course you should want to know, why wouldn't you?

You just don't get it, do you? Your head is so vastly inflated that you
think everyone should run around researching whatever *you* happen to think
is interesting. Do your own benchmarking if you think it's a problem.
You're the one whining about this.

> Linux is a really fast system right now. The code paths are short and
> it is possible to use the OS almost as if it were a library, the cost is
> so little that you really can mmap stuff in as you need, something that
> people have wanted since Multics. There will always be many more uses
> of Linux in small systems than large, simply because there will always
> be more small systems. Keeping Linux working well on small systems is
> going to have a dramatically larger positive benefit for the world than
> scaling it to 64 processors. So who do you want to help? An elite
> few or everyone?

Everyone. And we can do that, and make large systems work at the same time.
Despite the fact you don't believe me. And despite the fact that you can't
grasp the difference between the number 16 and the number 64.

M.

2003-02-24 07:40:09

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 10:52:04PM -0800, Larry McVoy wrote:
> willing to put up with sloppy playback on a computer that they would
> freak out over if it happened on their TV. rt/linux will make your
> el cheapo laptop actually deliver what you need.
>
> I think there has been a fair amount of discussion of this sort of stuff
> in the games world. Some game company got taken to task recently because
> even 2Ghz machines couldn't run their game properly. Makes me wonder if
> a real time system is what they need.

RT for TV, mp3 player and game performance ? What the hell happened to
network, disk QoS and carrier grade issues with modern operating systems
as it concerns telecoms ? VoIP ? My god.

bill

2003-02-24 07:38:02

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 08:57:17PM -0800, Larry McVoy wrote:
> Dig through the mail logs and you'll see that I was completely against the
> preemption patch. I think it is a bad idea, if you want real time, use
> rt/linux, it solves the problem right.

And large unbounded operation on data structures. DOS, a single tasking
operating system is fast running a single thread of execution too, it just
happens to also be completely useless.

Whether folks like it or not, embedded RT is the future of Linux much more
so than any single NUMA machine that's sold or can be sold by IBM, SGI and
any other vendor of that type.

bill

2003-02-24 07:42:31

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 10:58:26PM -0800, Larry McVoy wrote:
> Linux is a really fast system right now. The code paths are short and
> it is possible to use the OS almost as if it were a library, the cost is
> so little that you really can mmap stuff in as you need, something that
> people have wanted since Multics. There will always be many more uses
> of Linux in small systems than large, simply because there will always
> be more small systems. Keeping Linux working well on small systems is
> going to have a dramatically larger positive benefit for the world than
> scaling it to 64 processors. So who do you want to help? An elite
> few or everyone?

I don't know what kind of joke you think I'm trying to play here.

"Scalability" is about making the kernel properly adapt to the size of
the system. This means UP. This means embedded. This means mid-range
x86 bigfathighmem turds. This means SGI Altix. I have _personally_
written patches to decrease the space footprint of pidhashes and other
data structures so that embedded systems function more optimally.

It's not about crapping all over the low end. It's not about degrading
performance on commonly available systems. It's about increasing the
range of systems on which Linux performs well and is useful.

Maintaining the performance of Linux on commonly available systems is
not only deeply ingrained as one of a set of personal standards amongst
all kernel hackers involved with scalability, it's also a prerequisite
for patch acceptance that is rigorously enforced by maintainers. To
further demonstrate this, look at the pgd_ctor patches, which markedly
reduced the overhead of pgd setup and teardown on UP lowmem systems and
were very minor improvements on PAE systems.

Now it's time to turn the question back around on you. Why do you not
want Linux to work well on a broader range of systems than it does now?


-- wli

2003-02-24 07:50:44

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 11:54:30PM -0800, William Lee Irwin III wrote:
> On Sun, Feb 23, 2003 at 11:44:47PM -0800, Bill Huey wrote:
> > And large unbounded operation on data structures. DOS, a single tasking
> > operating system is fast running a single thread of execution too, it just
> > happens to also be completely useless.
> > Whether folks like it or not, embedded RT is the future of Linux much more
> > so than any single NUMA machine that's sold or can be sold by IBM, SGI and
> > any other vendor of that type.
>
> And scalability is as essential there as it is on 512x/16TB O2K's.
>
> For this, it's _downward_ scalability, where "downward" is relative to
> "typical" UP x86 boxen.

The good thing about Linux is that, with some compile options, stuff (scalability)
can be insert and removed and any time. One shouldn't narrow their view of how an
OS can be out of a strict tradition.

I don't buy this spinlock-for-all-locking things tradition with no preemption,
especially given some of the IO performance improvement that happened as a courtesy
of preempt. Some how that was forgotten in Larry's discussion.

bill

2003-02-24 07:45:47

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 08:57:17PM -0800, Larry McVoy wrote:
>> Dig through the mail logs and you'll see that I was completely against the
>> preemption patch. I think it is a bad idea, if you want real time, use
>> rt/linux, it solves the problem right.

On Sun, Feb 23, 2003 at 11:44:47PM -0800, Bill Huey wrote:
> And large unbounded operation on data structures. DOS, a single tasking
> operating system is fast running a single thread of execution too, it just
> happens to also be completely useless.
> Whether folks like it or not, embedded RT is the future of Linux much more
> so than any single NUMA machine that's sold or can be sold by IBM, SGI and
> any other vendor of that type.

And scalability is as essential there as it is on 512x/16TB O2K's.

For this, it's _downward_ scalability, where "downward" is relative to
"typical" UP x86 boxen.


-- wli

2003-02-24 07:57:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 08:02:46PM -0800, Larry McVoy wrote:
> On Sun, Feb 23, 2003 at 07:31:26PM -0800, Gerrit Huizenga wrote:
> > But most
> > people don't connect big machines to IDE drive subsystems.
>
> 3ware controllers. They look like SCSI to the host, but use cheap IDE
> drives on the back end. Really nice cards. bkbits.net runs on one.

That's true (similar for some nice scsi2ide external raid boxens), but Alan's
original argument was about the Linux IDE driver on bix machines which is used
by neither..

2003-02-24 08:29:49

by Andrew Morton

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Bill Huey (Hui) <[email protected]> wrote:
>
> especially given some of the IO performance improvement that happened as a courtesy
> of preempt.

There is no evidence for any such thing. Nor has any plausible
theory been put forward as to why such an improvement should occur.

2003-02-24 08:33:54

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 11:54:30PM -0800, William Lee Irwin III wrote:
>> And scalability is as essential there as it is on 512x/16TB O2K's.
>> For this, it's _downward_ scalability, where "downward" is relative to
>> "typical" UP x86 boxen.

On Mon, Feb 24, 2003 at 12:00:52AM -0800, Bill Huey wrote:
> The good thing about Linux is that, with some compile options, stuff
> (scalability) can be insert and removed and any time. One shouldn't
> narrow their view of how an OS can be out of a strict tradition.

No!! Scalability means the kernel figures out how to adapt to the box.
Removing scalability means it no longer adapts to the size of your box.
Scalability includes scaling "downward" to smaller systems.


On Mon, Feb 24, 2003 at 12:00:52AM -0800, Bill Huey wrote:
> I don't buy this spinlock-for-all-locking things tradition with no
> preemption, especially given some of the IO performance improvement
> that happened as a courtesy of preempt. Some how that was forgotten
> in Larry's discussion.

I've largely not been a party to the preempt business. Advances in
scheduling semantics are good, but are not my focus.


-- wli

2003-02-24 08:41:20

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Bill Huey (Hui) <[email protected]> wrote:
>> especially given some of the IO performance improvement that
>> happened as a courtesy of preempt.

On Mon, Feb 24, 2003 at 12:40:05AM -0800, Andrew Morton wrote:
> There is no evidence for any such thing. Nor has any plausible
> theory been put forward as to why such an improvement should occur.

There's a vague notion in my head that it should decrease scheduling
latencies in general, possibly including responses to io completion.

No idea how that lines up with reality. You've actually tracked
scheduling latencies at least at some point in the past. What kind
of results have you seen from the stuff (if any)?


-- wli

2003-02-24 08:49:21

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 12:40:05AM -0800, Andrew Morton wrote:
> There is no evidence for any such thing. Nor has any plausible
> theory been put forward as to why such an improvement should occur.

I find what you're saying a rather unbelievable given some of the
benchmarks I saw when the preempt patch started to floating around.

If you search linuxdevices.com for articles on preempt, you'll see a
claim about IO performance improvements with the patch. If somethings
changed then I'd like to know.

The numbers are here:
http://kpreempt.sourceforge.net/

bill

2003-02-24 08:59:21

by Andrew Morton

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Bill Huey (Hui) <[email protected]> wrote:
>
> On Mon, Feb 24, 2003 at 12:40:05AM -0800, Andrew Morton wrote:
> > There is no evidence for any such thing. Nor has any plausible
> > theory been put forward as to why such an improvement should occur.
>
> I find what you're saying a rather unbelievable given some of the
> benchmarks I saw when the preempt patch started to floating around.
>
> If you search linuxdevices.com for articles on preempt, you'll see a
> claim about IO performance improvements with the patch. If somethings
> changed then I'd like to know.
>
> The numbers are here:
> http://kpreempt.sourceforge.net/
>

That's a 5% difference across five dbench runs. If it is even
statistically significant, dbench is notoriously prone to chaotic
effects (less so in 2.5) It is a long stretch to say that any
increase in dbench numbers can be generalised to "improved IO
performance" across the board.

The preempt stuff is all about *worst-case* latency. I doubt if
it shifts the average latency (which is in the 50-100 microsecond
range) by more that 50 microseconds.

2003-02-24 09:17:31

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 01:09:38AM -0800, Andrew Morton wrote:
> That's a 5% difference across five dbench runs. If it is even
> statistically significant, dbench is notoriously prone to chaotic
> effects (less so in 2.5) It is a long stretch to say that any
> increase in dbench numbers can be generalised to "improved IO
> performance" across the board.

I think the test is valid. If the scheduler can't deal with some
kind IO event in a very tight time window, then you'd think that
it might influence the performance of that IO system.

> The preempt stuff is all about *worst-case* latency. I doubt if
> it shifts the average latency (which is in the 50-100 microsecond
> range) by more that 50 microseconds.

You obviously don't know what the current patch is suppose to do, I'm
assuming that's what you're refering to at this point. A fully preemptive
kernel, like the one from TimeSys, is about constraining worst case
latency by using sleeping locks that enable preemption across critical
section where that's normally turned off courtesy of spinlocks. Combine
that with heavy weight interrupts you have a mix for constraining maximum
latency to about 50us in their kernel.

The patch and locking schema in Linux in it's current form only reduces
the latency on "average", which is an inverse to your claim if concerning
maximum latency. The last time I looked at 2.5.62 there were still quite
a few place where there was the possibility of a critical section bounded
by spinlocks (with interrupts turned off) to iterate over a data structure
(VM), copy, move memory in critical sections that have very large upper
bounds.

I can't believe an engineer of your stature would blow something this
basic to the understanding of locking. You can't mean what you just
said above.

Read:
http://linuxdevices.com/articles/AT6106723802.html

That's basically what I'm refering to...

bill

2003-02-24 09:46:08

by Andrew Morton

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Bill Huey (Hui) <[email protected]> wrote:
>
> On Mon, Feb 24, 2003 at 01:09:38AM -0800, Andrew Morton wrote:
> > That's a 5% difference across five dbench runs. If it is even
> > statistically significant, dbench is notoriously prone to chaotic
> > effects (less so in 2.5) It is a long stretch to say that any
> > increase in dbench numbers can be generalised to "improved IO
> > performance" across the board.
>
> I think the test is valid. If the scheduler can't deal with some
> kind IO event in a very tight time window, then you'd think that
> it might influence the performance of that IO system.
>

On the contrary. If the disk request queue is plugged and the task
which is submitting writeback is preempted, the IO system could
remain artificially idle for hundreds of milliseconds while the CPU
is off calculating pi. This is one of the reasons why I converted
the 2.5 request queues to unplug autonomously.

But that is speculation as well - I never observed this aspect to be
a real problem. Probably, it was not.

Substantiation of your claim requires quality testing and a plausible
explanation. I do not believe we have seen either, OK?

> Read:
> http://linuxdevices.com/articles/AT6106723802.html

I did, briefly. It appears to be claiming that the average scheduling
latency of the non-preemptible kernel is ten milliseconds!

Maybe I need to read that again in the morning.



2003-02-24 10:04:39

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 01:56:25AM -0800, Andrew Morton wrote:
> But that is speculation as well - I never observed this aspect to be
> a real problem. Probably, it was not.
>
> Substantiation of your claim requires quality testing and a plausible
> explanation. I do not believe we have seen either, OK?

Well, let's back off here. It's not my claim, it's Robert Love's in that
URL. Not to arrange a fight, but I had to point that out. :)

> > http://linuxdevices.com/articles/AT6106723802.html
>
> I did, briefly. It appears to be claiming that the average scheduling
> latency of the non-preemptible kernel is ten milliseconds!

They mention that this is related to the console code. Obviously, if you're
not checking for reschedule in a big pix map scroll blit, then it's going
to stick out boldly as a big latency spike.

A fully preemptive system would only turn off preemption in places that
would break drivers and other obvious places like scheduler run-queues,
etc...

> Maybe I need to read that again in the morning.

It's also an old article, but goes over a lot of the basics of a fully
preemptable kernel like that. Things might not be as dramatic now with
2.5.62. Not sure how things are now...

bill

2003-02-24 12:17:19

by Alan

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, 2003-02-24 at 06:58, Larry McVoy wrote:
> Which brings us back to the point. If the world is not heading towards
> an 8 way on every desk then it is really questionable to make a lot of
> changes to the kernel to make it work really well on 8-ways.

_If_ it harms performance on small boxes. Otherwise you turn Linux into
Irix and your market doesnt look so hot in 3 or 4 years time. Featuritus
is a slow creeping death.

The definitive Linux box appears to be $199 from Walmart right now, and
its not SMP.


2003-02-24 13:55:50

by Alan

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, 2003-02-24 at 05:06, William Lee Irwin III wrote:
> On Sun, Feb 23, 2003 at 08:56:16PM -0800, Larry McVoy wrote:
> > Furthermore, I can prove that isn't what you are talking about. Show me
> > the performance gains you are getting on 4way systems from your changes.
> > Last I checked, things scaled pretty nicely on 4 ways.
>
> Try 4 or 8 mkfs's in parallel on a 4x box running virgin 2.4.x.

You have strange ideas of typical workloads. The mkfs paralle one is a good
one though because its also a lot better on one CPU in 2.5

2003-02-24 14:48:16

by Bill Davidsen

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, 24 Feb 2003, Bill Huey wrote:

> On Mon, Feb 24, 2003 at 12:40:05AM -0800, Andrew Morton wrote:
> > There is no evidence for any such thing. Nor has any plausible
> > theory been put forward as to why such an improvement should occur.
>
> I find what you're saying a rather unbelievable given some of the
> benchmarks I saw when the preempt patch started to floating around.
>
> If you search linuxdevices.com for articles on preempt, you'll see a
> claim about IO performance improvements with the patch. If somethings
> changed then I'd like to know.

Clearly you do know... preempt started out when 2.4 was the only game in
town. It made improvements to some degree because the rest of the kernel
had some real latency issues.

Skip forward through low latency patches, several flavors of elevator
improvements, faster clock rate, rmap, better VM, object rmap, finer
grained locking, io scheduling of several types including latency limiting
and prevention of write blocking, and the O(1) scheduler.

Preempt was a great way to get the right thing running sooner because
there was a lot of latency in many places. Just doesn't seem to be true
anymore. Preempt doesn't make as much difference anymore because many
things have been improved.

I'm sure that there are applications which benefit greatly from preempt,
but the days of vast improvement seem to be gone, the low hanging fruit
has been picked. Context switching latency is still way higher than 2.4,
that isn't hurting io as much as all the other improvements have helped.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-02-24 15:37:23

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 11:51:42PM -0800, William Lee Irwin III wrote:
> Now it's time to turn the question back around on you. Why do you not
> want Linux to work well on a broader range of systems than it does now?

I never said that I didn't. I'm just taking issue with the choosen path
which has been demonstrated to not work.

"Let's scale Linux by multi threading"

"Err, that really sucked for everyone who has tried it in the past, all
the code paths got long and uniprocessor performance suffered"

"Oh, but we won't do that, that would be bad".

"Great, how about you measure the changes carefully and really show that?"

"We don't need to measure the changes, we know we'll do it right".

And just like in every other time this come up in every other engineering
organization, the focus is in 2x wherever we are today. It is *never*
about getting to 100x or 1000x.

If you were looking at the problem assuming that the same code had to
run on uniprocessor and a 1000 way smp, right now, today, and designing
for it, I doubt very much we'd have anything to argue about. A lot of
what I'm saying starts to become obviously true as you increase the
number of CPUs but engineers are always seduced into making it go 2x
farther than it does today. Unfortunately, each of those 2x increases
comes at some cost and they add up.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-24 15:50:38

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> I never said that I didn't. I'm just taking issue with the choosen path
> which has been demonstrated to not work.
>
> "Let's scale Linux by multi threading"
>
> "Err, that really sucked for everyone who has tried it in the past,
> all the code paths got long and uniprocessor performance suffered"
>
> "Oh, but we won't do that, that would be bad".
>
> "Great, how about you measure the changes carefully and really show
> that?"
>
> "We don't need to measure the changes, we know we'll do it right".

Most of the threading changes have been things like 1 thread per cpu, which
would seem to scale up and down rather well to me ... could you illustrate
by pointing to an example of something that's changed in that area which
you think is bad? Yes, if Linux started 2000 kernel threads on a UP system,
that would obviously be bad.

M.

2003-02-24 16:13:03

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 07:47:25AM -0800, Larry McVoy wrote:
> If you were looking at the problem assuming that the same code had to
> run on uniprocessor and a 1000 way smp, right now, today, and designing
> for it, I doubt very much we'd have anything to argue about. A lot of
> what I'm saying starts to become obviously true as you increase the
> number of CPUs but engineers are always seduced into making it go 2x
> farther than it does today. Unfortunately, each of those 2x increases
> comes at some cost and they add up.

Good point. However, we are in a position to compare test results of
older linux kernels against newer, and to recompile code out of the
kernel for specific applications. I'm curious if there is a collection
of lmbench results of hand configured and compiled kernels vs the vendor
module based kernels across 2.0, 2.2, 2.4 and recent 2.5 on the same
uniprocessor and dual processor configuration. That would really give
us a better idea of how a properly tuned kernel vs what people actually
use for support reasons is costing us, and if we're winning or losing.

-ben
--
Don't email: <a href=mailto:"[email protected]">[email protected]</a>

2003-02-24 16:07:15

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 11:39:34PM -0800, Martin J. Bligh wrote:
> > The point being that there is a company generating $32B/year in sales and
> > almost all of that is in uniprocessors. Directly countering your
> > statement that there is no margin in PC's. They are making $2B/year in
> > profits, QED.
>
> Which is totally irrelevant. It's the *LINUX* market that matters. What
> part of that do you find so hard to understand?

OK, so you can't handle the reality that the server market overall doesn't
make your point so you retreat to the Linux market. OK, fine. All the
data anyone has ever seen has Linux running on *smaller* servers, not
larger. Show me all the cases where people replaced 4 CPU NT boxes with
8 CPU Linux boxes.

The point being that if in the overall market place, big iron isn't
dominating, you have one hell of a tough time making the case that the
Linux market place is somehow profoundly different and needs larger
boxes to do the same job.

In fact, the opposite is true. Linux squeezes substantially more
performance out of the same hardware than the commercial OS offerings,
NT or Unix. So where is the market force which says "oh, switching to
Linux? Better get more CPUs".

> It makes IBM money, ergo they pay me. I enjoy doing it, ergo I work for
> them. Most of the work benefits smaller systems as well, ergo we get our
> patches accepted. So everyone's happy, apart from you, who keeps whining.

Indeed I do, I'm good at it. You're about to find out how good. It's
quite effective to simply focus attention on a problem area. Here's
my promise to you: there will be a ton of attention focussed on the
scaling patches until you and anyone else doing them starts showing
up with cache miss counters as part of the submission process.

> So now we've slid from talking about bus traffic from fine-grained locking,
> which is mostly just you whining in ignorance of the big picture, to cache
> effects, which are obviously important. Nice try at twisting the
> conversation. Again.

You need to take a deep breath and try and understand that the focus of
the conversation is Linux, not your ego or mine. Getting mad at me just
wastes energy, stay focussed on the real issue, Linux.

> > People care about performance, both scaling up and scaling down. A lot of
> > performance changes are measured poorly, in a way that makes the changes
> > look good but doesn't expose the hidden costs of the change. What I'm
> > saying is that those sorts of measurements screwed over performance in
> > the past, why are you trying to repeat old mistakes?
>
> One way to measure those changes poorly would be to do what you were
> advocating earlier - look at one tiny metric of a microbenchmark, rather
> than the actual throughput of the machine. So pardon me if I take your
> concerns, and file them in the appropriate place.

You apparently missed the point where I have said (a bunch of times)
run the benchmarks you want and report before and after the patch
cache miss counters for the same runs. Microbenchmarks would be
a really bad way to do that, you really want to run a real application
because you need it fighting for the cache.

> > My argument is different because every effort which has gone in the
> > direction you are going has ended up with a kernel that worked well on
> > big boxes and sucked rocks on little boxes. And all of them started
> > with kernels which performed quite nicely on uniprocessors.
>
> So you're trying to say that fine-grained locking ruins uniprocessor
> performance now?

I've been saying that for almost 10 years, check the archives.

> You just don't get it, do you? Your head is so vastly inflated that you
> think everyone should run around researching whatever *you* happen to think
> is interesting. Do your own benchmarking if you think it's a problem.

That's exactly what I'll do if you don't learn how to do it yourself. I'm
astounded that any competent engineer wouldn't want to know the effects of
their changes, I think you actually do but are just too pissed right now
to see it.

> > Linux is a really fast system right now. [etc]
>
> Everyone. And we can do that, and make large systems work at the same time.
> Despite the fact you don't believe me. And despite the fact that you can't
> grasp the difference between the number 16 and the number 64.

See other postings on this one. All engineers in your position have said
"we're just trying to get to N cpus where N = ~2x where we are today and
it won't hurt uniprocessor performance". They *all* say that. And they
all end up with a slow uniprocessor OS. Unlike security and a number of
other invasive features, the SMP stuff can't be configed out or you end
up with an #ifdef-ed mess like IRIX.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-24 16:21:14

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 11:23:14AM -0500, Benjamin LaHaise wrote:
> kernel for specific applications. I'm curious if there is a collection
> of lmbench results of hand configured and compiled kernels vs the vendor
> module based kernels across 2.0, 2.2, 2.4 and recent 2.5 on the same
> uniprocessor and dual processor configuration.

If someone were willing to build the init script infra structure to
reboot to a new kernel, run the test, etc., I'll buy a couple of
machines and just let them run through this. I'd like to do it
with the cache miss counters turned on so if P4's do a nicer job
of counting than Athlons, I'll get those.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-24 16:21:03

by Victor Yodaiken

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 12:50:31AM -0800, William Lee Irwin III wrote:
> Bill Huey (Hui) <[email protected]> wrote:
> >> especially given some of the IO performance improvement that
> >> happened as a courtesy of preempt.
>
> On Mon, Feb 24, 2003 at 12:40:05AM -0800, Andrew Morton wrote:
> > There is no evidence for any such thing. Nor has any plausible
> > theory been put forward as to why such an improvement should occur.
>
> There's a vague notion in my head that it should decrease scheduling

Vague notions seems to be the level of data on this topic.

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
1+ 505 838 9109

2003-02-24 16:22:20

by Victor Yodaiken

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 11:23:14AM -0500, Benjamin LaHaise wrote:
> Good point. However, we are in a position to compare test results of
> older linux kernels against newer, and to recompile code out of the
> kernel for specific applications. I'm curious if there is a collection
> of lmbench results of hand configured and compiled kernels vs the vendor
> module based kernels across 2.0, 2.2, 2.4 and recent 2.5 on the same
> uniprocessor and dual processor configuration. That would really give
> us a better idea of how a properly tuned kernel vs what people actually
> use for support reasons is costing us, and if we're winning or losing.

It's interesting to me that the people supporting the scale up do not
carefully do such benchmarks and indeed have a rather cavilier attitude
to testing and benchmarking: or perhaps they don't think it's worth
publishing.

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
1+ 505 838 9109

2003-02-24 16:39:02

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> > The point being that there is a company generating $32B/year in sales
>> > and almost all of that is in uniprocessors. Directly countering your
>> > statement that there is no margin in PC's. They are making $2B/year in
>> > profits, QED.
>>
>> Which is totally irrelevant. It's the *LINUX* market that matters. What
>> part of that do you find so hard to understand?
>
> OK, so you can't handle the reality that the server market overall doesn't
> make your point so you retreat to the Linux market. OK, fine. All the

Errm. no. That was the conversation all along - you just took some remarks
out of context

> The point being that if in the overall market place, big iron isn't
> dominating, you have one hell of a tough time making the case that the
> Linux market place is somehow profoundly different and needs larger
> boxes to do the same job.

Dominating in terms of volume? No. My postion is that Linux sales for
hardware companies make more money on servers than desktops. We're working
on scalability ... that means CPUs, memory, disk IO, networking,
everything. That improves both the efficiency of servers ... "large
machines" (which your original message had as, and I quote, "4 or more CPU
SMP machines"), 2x and even larger 1x machines. If you're being more
specific as to things like NUMA changes, please point to examples of
patches you think degrades performance on UP / 2x or whatever.

> Indeed I do, I'm good at it. You're about to find out how good. It's
> quite effective to simply focus attention on a problem area. Here's
> my promise to you: there will be a ton of attention focussed on the
> scaling patches until you and anyone else doing them starts showing
> up with cache miss counters as part of the submission process.

Here's my promise to you: people listen to you far less than you think, and
our patches will continue to go into the kernel.

>> So now we've slid from talking about bus traffic from fine-grained
>> locking, which is mostly just you whining in ignorance of the big
>> picture, to cache effects, which are obviously important. Nice try at
>> twisting the conversation. Again.
>
> You need to take a deep breath and try and understand that the focus of
> the conversation is Linux, not your ego or mine. Getting mad at me just
> wastes energy, stay focussed on the real issue, Linux.

So exactly what do you think is the problem? it seems to keep shifting
mysteriously. Name some patches that got accepted into mainline ... if
they're broken, that'll give us some clues what is bad for the future, and
we can fix them.

>> One way to measure those changes poorly would be to do what you were
>> advocating earlier - look at one tiny metric of a microbenchmark, rather
>> than the actual throughput of the machine. So pardon me if I take your
>> concerns, and file them in the appropriate place.
>
> You apparently missed the point where I have said (a bunch of times)
> run the benchmarks you want and report before and after the patch
> cache miss counters for the same runs. Microbenchmarks would be
> a really bad way to do that, you really want to run a real application
> because you need it fighting for the cache.

One statistic (eg cache miss counters) isn't the big picture. If throughput
goes up or remains the same on all machines, that's what important.

>> So you're trying to say that fine-grained locking ruins uniprocessor
>> performance now?
>
> I've been saying that for almost 10 years, check the archives.

And you haven't worked out that locks compile away to nothing on UP yet? I
think you might be better off pulling your head out of where it's currently
residing, and pointing it at the source code.

>> You just don't get it, do you? Your head is so vastly inflated that you
>> think everyone should run around researching whatever *you* happen to
>> think is interesting. Do your own benchmarking if you think it's a
>> problem.
>
> That's exactly what I'll do if you don't learn how to do it yourself. I'm
> astounded that any competent engineer wouldn't want to know the effects of
> their changes, I think you actually do but are just too pissed right now
> to see it.

Cool, I'd love to see some benchmarks ... and real throughput numbers from
them, not just microstatistics.

> See other postings on this one. All engineers in your position have said
> "we're just trying to get to N cpus where N = ~2x where we are today and
> it won't hurt uniprocessor performance". They *all* say that. And they
> all end up with a slow uniprocessor OS. Unlike security and a number of
> other invasive features, the SMP stuff can't be configed out or you end
> up with an #ifdef-ed mess like IRIX.

Try looking up "abstraction" in a dictionary. Linus doesn't take #ifdef's
in the main code.

M.

2003-02-24 17:55:52

by Timothy D. Witham

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, 2003-02-22 at 08:13, Larry McVoy wrote:
> On Sat, Feb 22, 2003 at 07:47:53AM -0800, Martin J. Bligh wrote:
> > >> > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
> > >> > $500M/quarter in profit.
> > >>
> > >> While I understand these numbers are on the mark, there is a tertiary
> > >> issue to realize.
> > >>
> > >> Dell makes money on many things other than thin-margin PCs. And lo'
> > >> and behold one of those things is selling the larger Intel based
> > >> servers and support contracts to go along with that.
> > >
> > > I did some digging trying to find that ratio before I posted last night
> > > and couldn't. You obviously think that the servers are a significant
> > > part of their business. I'd be surprised at that, but that's cool,
> > > what are the numbers? PC's, monitors, disks, laptops, anything with less
> > > than 4 cpus is in the little bucket, so how much revenue does Dell generate
> > > on the 4 CPU and larger servers?
> >
> > It's not a question of revenue, it's one of profit. Very few people buy
> > desktops for use with Linux, compared to those that buy them for Windows.
> > The profit on each PC is small, thus I still think a substantial proportion
> > of the profit made by hardware vendors by Linux is on servers rather than
> > desktop PCs. The numbers will be smaller for high end machines, but the
> > profit margins are much higher.
>
> That's all handwaving and has no meaning without numbers. I could care less
> if Dell has 99.99% margins on their servers, if they only sell $50M of servers
> a quarter that is still less than 10% of their quarterly profit.
>
> So what are the actual *numbers*? Your point makes sense if and only if
> people sell lots of server. I spent a few minutes in google: world wide
> server sales are $40B at the moment. The overwhelming majority of that
> revenue is small servers. Let's say that Dell has 20% of that market,
> that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet
> you long long odds that that is 90% of their revenue in the server space.
> Supposing that's right, that's $200M/quarter in big iron sales. Out of
> $8000M/quarter.
>
The numbers that I have seen are covered under an NDA so I can' put
them out but an important point to note is that while there is a very
sharp decrease in the number of servers sold as you go hight up into
the price bands the total $ in revenue is hourglass shaped. With
the neck being in a price band that corresponds to a 4 way server.

The total $ spent on the highest band of servers is about equal
to the total $ spent on the lowest price band of servers. But the
margins for the high end are much better than the margins for the
lowest band.

> I'd love to see data which is different than this but you'll have a tough
> time finding it. More and more companies are looking at the cost of
> big iron and deciding it doesn't make sense to spend $20K/CPU when they
> could be spending $1K/CPU. Look at Google, try selling them some big
> iron. Look at Wall Street - abandoning big iron as fast as they can.

Oh, you can see it, it will just cost you about $50,000 to get the
survey from the company that spends all the money putting it together.

On the size of the system, every system should be as big as it needs
to be. Some problems partition nicely, like Google but other ones
do not, like accounts receivable. It all seems to come down to the
question, "Does the data _naturally_ partition?" If it does then
you should either use lots of small servers or a s/390 type solution
with lots of instances. However if the data doesn't naturally partition
you should use one large machine as you will spend more money on
people trying to manage the servers than you would of spent initially
on the hardware.

Also you need to look at the backend systems in places like Wall
Street, those are big machines, have been for a long time and
aren't changing out. But it doesn't make a good story.

Tim


--
Timothy D. Witham <[email protected]>
Open Sourcre Development Lab, Inc

2003-02-24 18:27:43

by Andy Pfiffer

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, 2003-02-22 at 15:28, Larry McVoy wrote:
> On Sat, Feb 22, 2003 at 02:17:39PM -0800, William Lee Irwin III wrote:
> > On Sat, Feb 22, 2003 at 05:06:27PM -0500, Mark Hahn wrote:
> > > ccNUMA worst-case latencies are not much different from decent
> > > cluster (message-passing) latencies.
> >
> > Not even close, by several orders of magnitude.
>
> Err, I think you're wrong. It's been a long time since I looked, but I'm
> pretty sure myrinet had single digit microseconds. Yup, google rocks,
> 7.6 usecs, user to user. Last I checked, Sequents worst case was around
> there, right?

FYI: The Intel/DOE ASCI Red system (>1 TFLOPS) delivered user-to-user
messaging of < 5us. With a tail wind, peak point-to-point data rates,
delivered from a user-mode buffer into another user-mode buffer anywhere
else on the system were just shy of 400 megabytes/second (actual rates
could be affected by several factors -- obviously).


Andy


2003-02-24 18:29:13

by Davide Libenzi

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, 23 Feb 2003, Larry McVoy wrote:

> > Because I don't see why I should waste my time running benchmarks just to
> > prove you wrong. I don't respect you that much, and it seems the
> > maintainers don't either. When you become somebody with the stature in the
> > Linux community of, say, Linus or Andrew I'd be prepared to spend a lot
> > more time running benchmarks on any concerns you might have.
>
> Who cares if you respect me, what does that have to do with proper
> engineering? Do you think that I'm the only person who wants to see
> numbers? You think Linus doesn't care about this? Maybe you missed
> the whole IA32 vs IA64 instruction cache thread. It sure sounded like
> he cares. How about Alan? He stepped up and pointed out that less
> is more. How about Mark? He knows a thing or two about the topic?
> In fact, I think you'd be hard pressed to find anyone who wouldn't be
> interested in seeing the cache effects of a patch.
>
> People care about performance, both scaling up and scaling down. A lot of
> performance changes are measured poorly, in a way that makes the changes
> look good but doesn't expose the hidden costs of the change. What I'm
> saying is that those sorts of measurements screwed over performance in
> the past, why are you trying to repeat old mistakes?

Larry, how many times this kind of discussions went on during the last
years ? I think you should remember pretty well because it was always you
on that side of the river pushing back "Barbarians" with your UP sword.
The point is that people ( expecially young ) like to dig where other
failed, it's normal. It's attractive like honey for bears. Let them try,
many they will fail, but chances are that someone will succeed making it
worth the try. And trust Linus, that is more on your wavelength than on
the huge scalabity one.



- Davide

2003-02-24 18:26:55

by John W. M. Stevens

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 08:17:16AM -0800, Larry McVoy wrote:
> On Sun, Feb 23, 2003 at 11:39:34PM -0800, Martin J. Bligh wrote:
>
> See other postings on this one. All engineers in your position have said
> "we're just trying to get to N cpus where N = ~2x where we are today and
> it won't hurt uniprocessor performance". They *all* say that. And they
> all end up with a slow uniprocessor OS. Unlike security and a number of
> other invasive features, the SMP stuff can't be configed out

Heck, you can't even configure it out on so-called UP systems.

The moment you introduce DMA into a system, you have an (admittedly,
constrained) SMP system.

And of course, simple interruption is another, contrained, kind of
"virtual SMP", yes?

Anybody whose done any USB HC programming is horribly aware of this
fact, trust me! ;-)

> or you end
> up with an #ifdef-ed mess like IRIX.

Why if-def it every where?

#ifdef SMP

#define lock( mutex ) smpLock( lock )

#else

#define lock( mutex )

#endif

Do that once, use the lock macro, and forget about it (except in
cases where you have to worry about DMA, interruption, or some other
kind of MP, of course).

My (limited, only about 600 machines) experience is that Linux is
inevitably less stable on non-Intel, and on non-UP machines. Before
worrying about scalability, my opinion is that worrying about getting
the simplest (dual processor) machines as stable as UP machines, first,
would be both a better ROI, and a good basis for higher levels of
scalability.

Mind you, there is a perfectly simple reason (for Linux being less
stable on non-Intel, non-UP machines) that this is true: the
Linux development methodology pretty much makes this an emergent
property.

Interesting discussion, though . . . from my experience, the commercial
Unices use fine grained locking.

Luck,
John S.

2003-02-24 18:38:42

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, 24 Feb 2003 09:25:33 MST, [email protected] wrote:
> It's interesting to me that the people supporting the scale up do not
> carefully do such benchmarks and indeed have a rather cavilier attitude
> to testing and benchmarking: or perhaps they don't think it's worth
> publishing.

I'm afraid it is the latter half that is closer to correct. Within
IBM's Linux Technology Center, we have a good sized performance team
and a tightly coupled set of developers who can internally share a
lot of real benchmark data. Unfortunately, the rules of SPEC and TPC
don't allow us to release data unless it is carefully (and time-
consumingly) audited, and IBM has a history of not dumping the output
of a few hundred runs of benchmarks out in the open and then claiming
that it is all valid, without doing a lot of internal validation first.

I'm sure other large companies doing Linux stuff have similar hurdles.
In some cases, ours are probably higher than average (IBM as an
entity has zero interest in pissing of the TPC or SPEC).

We do have a few papers out there, check OLS for the large database
workload one that steps through 2.4 performance changes (stock
2.4 vs. a set of patches we pushed to UL & RHAT) that increase
database performance about, oh, I forget, 5-fold... And there
is occasional other data sent out on web server stuff, some
microbenchmark data (see the continuing stream of data from mbligh,
for instance). Also, the contest data, OSDL data, etc. etc.
shows comparisons and trends for anyone who cares to pay attention.

It *would* be nice if someone could publish a compedium of performance
data, but that would be asking a lot...

gerrit

2003-02-24 21:01:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 12:56:17AM -0800, Bill Huey wrote:
> On Mon, Feb 24, 2003 at 12:40:05AM -0800, Andrew Morton wrote:
> > There is no evidence for any such thing. Nor has any plausible
> > theory been put forward as to why such an improvement should occur.
>
> I find what you're saying a rather unbelievable given some of the
> benchmarks I saw when the preempt patch started to floating around.
>
> If you search linuxdevices.com for articles on preempt, you'll see a
> claim about IO performance improvements with the patch. If somethings
> changed then I'd like to know.
>
> The numbers are here:
> http://kpreempt.sourceforge.net/

most kernels out there are buggy w/o preempt. 2.4.21pre4aa3 has most of
the needed preemption checks in the kernel loops instead. It's quite
pointless to compare preempt with an otherwise buggy kernel.

Andrea

2003-02-24 21:31:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 06:54:41PM -0800, Linus Torvalds wrote:
>
> On Sun, 23 Feb 2003, David Mosberger wrote:
> > >> 2 GHz Xeon: 701 SPECint
> > >> 1 GHz Itanium 2: 810 SPECint
> >
> > >> That is, Itanium 2 is 15% faster.
> >
> > Unfortunately, HP doesn't sell 1.5MB/1GHz Itanium 2 workstations, but
> > we can do some educated guessing:
> >
> > 1GHz Itanium 2, 3MB cache: 810 SPECint
> > 900MHz Itanium 2, 1.5MB cache: 674 SPECint
> >
> > Assuming pure frequency scaling, a 1GHz/1.5MB Itanium 2 would get
> > around 750 SPECint. In reality, it would get slightly less, but most
> > likely substantially more than 701.
>
> And as Dean pointed out:
>
> 2Ghz Xeon MP with 2MB L3 cache: 842 SPECint
>
> In other words, the P4 eats the Itanium for breakfast even if you limit it
> to 2GHz due to some "process" rule.
>
> And if you don't make up any silly rules, but simply look at "what's
> available today", you get
>
> 2.8Ghz Xeon MP with 2MB L3 cache: 907 SPECint
>
> or even better (much cheaper CPUs):
>
> 3.06 GHz P4 with 512kB L2 cache: 1074 SPECint
> AMD Athlon XP 2800+: 933 SPECint
>
> These are systems that you can buy today. With _less_ cache, and clearly
> much higher performance (the difference between the best-performing
> published ia-64 and the best P4 on specint, the P4 is 32% faster. Even
> with the "you can only run the P4 at 2GHz because that is all it ever ran
> at in 0.18" thing the ia-64 falls behind.

I agree, especially the cache difference makes any comparison not
interesting to my eyes (it's similar to running dbench with different
pagecache sizes and comparing the results). But I've a side note on
these matters in favour of the 64bit platforms. I could be wrong, but
AFIK some of the specint testcases generates a double data memory
footprint if compiled 64bit, so I guess some of the testcases should be
really called speclong and not specint. (however I don't think those
testcases alone can explain a global 32% difference, but still there
would be some difference in favour of the 32bit platform)

So in short, I currently believe specint is not a good benchmark to
compare a 64bit cpu to a 32bit cpu, 64bit can only lose in specint if
the cpu is exactly the same but only the data 'longs' are changed to
64bit. To do a real fair comparison one should first change the source
replacing every "long" with either a "long long" or an "int", only then
it will be fair to compare specint results between 32bit and 64bit cpus.

I never used specint myself, so don't ask me more details on this, and
again I could be wrong, but really - if I'm right - somebody should go
over the source and make a kind of unofficial (but official) patch
available to people to generate a specint testsuite usable to compare
32bit with 64bit results, or lots of effort will be wasted by people
pretending to do the impossible. I mean, if the memory bus is the same
hardware in both the 32bit and 64bit runs, the double memory footprint
will run slower and there's nothing the OS or the hardware can do about
it (and dozen mbytes of ram won't fit in l1 cache, not even on the
itanium 8). The benchmark suite really must be fixed to ensure the 32bit
and 64bit compilation will generate the same _data_ memory footprint if
one wants to make comparisons between the two.

Andrea

2003-02-24 23:09:22

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, 2003-02-24 at 05:06, William Lee Irwin III wrote:
>> Try 4 or 8 mkfs's in parallel on a 4x box running virgin 2.4.x.

On Mon, Feb 24, 2003 at 03:06:53PM +0000, Alan Cox wrote:
> You have strange ideas of typical workloads. The mkfs paralle one is a good
> one though because its also a lot better on one CPU in 2.5

The results I saw were that this did not affect 2.5 in any interesting
way and 2.4 behaved "very badly".

It's a simple way to get lots of disk io going without a complex
benchmark. There are good reasons and real workloads why things were
done to fix this.


-- wli

2003-02-24 23:11:03

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 12:50:31AM -0800, William Lee Irwin III wrote:
>> There's a vague notion in my head that it should decrease scheduling

On Mon, Feb 24, 2003 at 09:17:58AM -0700, [email protected] wrote:
> Vague notions seems to be the level of data on this topic.

Which, if you had bothered reading the rest of my post, is why I asked
for data.


-- wli

2003-02-24 23:27:54

by Victor Yodaiken

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 03:13:41PM -0800, William Lee Irwin III wrote:
> On Mon, Feb 24, 2003 at 12:50:31AM -0800, William Lee Irwin III wrote:
> >> There's a vague notion in my head that it should decrease scheduling
>
> On Mon, Feb 24, 2003 at 09:17:58AM -0700, [email protected] wrote:
> > Vague notions seems to be the level of data on this topic.
>
> Which, if you had bothered reading the rest of my post, is why I asked
> for data.

I'm not sure what you are complaining about. I don't think there is good
or even marginal data or explanations of this "effect".




--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
1+ 505 838 9109

2003-02-24 23:27:26

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sun, Feb 23, 2003 at 11:51:42PM -0800, William Lee Irwin III wrote:
>> Now it's time to turn the question back around on you. Why do you not
>> want Linux to work well on a broader range of systems than it does now?

On Mon, Feb 24, 2003 at 07:47:25AM -0800, Larry McVoy wrote:
> I never said that I didn't. I'm just taking issue with the choosen path
> which has been demonstrated to not work.
> "Let's scale Linux by multi threading"
> "Err, that really sucked for everyone who has tried it in the past, all
> the code paths got long and uniprocessor performance suffered"
> "Oh, but we won't do that, that would be bad".
> "Great, how about you measure the changes carefully and really show that?"
> "We don't need to measure the changes, we know we'll do it right".

The changes are getting measured. By and large if it's slower on UP
it's rejected. There's a dedicated benchmark crew, of which Randy Hron
is an important member, that benchmarks such things very consistently.
Internal benchmarking includes both free and non-free benchmarks. dbench,
tiobench, kernel compiles, contest, and so on are the publicable bits.

Also, code paths are also not necessarily getting longer. Single-
threaded efficiency lowers lock hold time and helps small systems too,
and numerous improvements with buffer_heads, task searching, file
truncation, and the like, are of that flavor.


On Mon, Feb 24, 2003 at 07:47:25AM -0800, Larry McVoy wrote:
> And just like in every other time this come up in every other engineering
> organization, the focus is in 2x wherever we are today. It is *never*
> about getting to 100x or 1000x.
> If you were looking at the problem assuming that the same code had to
> run on uniprocessor and a 1000 way smp, right now, today, and designing
> for it, I doubt very much we'd have anything to argue about. A lot of
> what I'm saying starts to become obviously true as you increase the
> number of CPUs but engineers are always seduced into making it go 2x
> farther than it does today. Unfortunately, each of those 2x increases
> comes at some cost and they add up.

Linux is a patchwork kernel. No coherent design will ever shine through.
Scaling the kernel incrementally merely becomes that much more difficult.
The small system performance standards aren't getting lowered.

Also note there are various efforts to scale the kernel _downward_ to
smaller embedded systems, partly by controlling "bloated" hash tables'
sizes and partly by making major subsystems optional and partly by
supporting systems with no MMU. This is not a one-way street, though I
myself am clearly pointed in the upward direction.


-- wli

2003-02-24 23:45:25

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 03:13:41PM -0800, William Lee Irwin III wrote:
>> Which, if you had bothered reading the rest of my post, is why I asked
>> for data.

On Mon, Feb 24, 2003 at 04:27:54PM -0700, [email protected] wrote:
> I'm not sure what you are complaining about. I don't think there is good
> or even marginal data or explanations of this "effect".

I'm complaining about being quoted out of context and the animus against
unsupported preempt claims being directed against me.

Re-stating preempt's "ostensible purpose" is the purpose of the "vague
notion", not adding to the pile of speculation.

For the data, akpm has apparently tracked scheduling latency, so there
is a chance he actually knows whether it's serving its ostensible
purpose as opposed to having a large stockpile of overwrought wisecracks
and a propensity for quoting out of context.


-- wli

2003-02-24 23:54:26

by Victor Yodaiken

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 03:54:33PM -0800, William Lee Irwin III wrote:
> On Mon, Feb 24, 2003 at 03:13:41PM -0800, William Lee Irwin III wrote:
> >> Which, if you had bothered reading the rest of my post, is why I asked
> >> for data.
>
> On Mon, Feb 24, 2003 at 04:27:54PM -0700, [email protected] wrote:
> > I'm not sure what you are complaining about. I don't think there is good
> > or even marginal data or explanations of this "effect".
>
> I'm complaining about being quoted out of context and the animus against
> unsupported preempt claims being directed against me.

I did not quote you out of context.

> For the data, akpm has apparently tracked scheduling latency, so there
> is a chance he actually knows whether it's serving its ostensible
> purpose as opposed to having a large stockpile of overwrought wisecracks
> and a propensity for quoting out of context.

You seem determined to pick a fight. Goodbye.



--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
1+ 505 838 9109

2003-02-25 00:13:02

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> The changes are getting measured. By and large if it's slower on UP
> it's rejected.

Suppose I have an application which has a working set which just exactly
fits in the I+D caches, including the related OS stuff.

Someone makes some change to the OS and the benchmark for that change is
smaller than the I+D caches but the change increased the I+D cache space
needed.

The benchmark will not show any slowdown, correct?
My application no longer fits and will suffer, correct?

The point is that if you are putting SMP changes into the system, you
have to be held to a higher standard for measurement given the past
track record of SMP changes increasing code length and cache footprints.
So "measuring" doesn't mean "it's not slower on XYZ microbenchmark".
It means "under the following work loads the cache misses went down or
stayed the same for before and after tests".

And if you said that all changes should be held to this standard, not
just scaling changes, I'd agree with you. But scaling changes are the
"bad guy" in my mind, they are not to be trusted, so they should be held
to this standard first. If we can get everyone to step up to this bat,
that's all to the good.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-25 00:31:01

by Larry McVoy

[permalink] [raw]
Subject: Server shipments [was Re: Minutes from Feb 21 LSE Call]

More data from news.com.

Dell has 19% of the server market with $531M/quarter in sales[1] over
212,750 machines per quarter[2].

That means that the average sale price for a server from Dell was $2495.

The average sale price of all servers from all companies is $9347.

I still don't see the big profits touted by the scaling fanatics, anyone
care to explain it?

[1] http://news.com.com/2100-1001-983892.html
[2] http://news.com.com/2100-1001-982004.html
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-25 00:44:43

by Larry McVoy

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

On Mon, Feb 24, 2003 at 04:41:04PM -0800, Martin J. Bligh wrote:
> > More data from news.com.
> >
> > Dell has 19% of the server market with $531M/quarter in sales[1] over
> > 212,750 machines per quarter[2].
> >
> > That means that the average sale price for a server from Dell was $2495.
> >
> > The average sale price of all servers from all companies is $9347.
> >
> > I still don't see the big profits touted by the scaling fanatics, anyone
> > care to explain it?
>
> Sigh. If you're so convinced that there's no money in larger systems,
> why don't you write to Sam Palmisano and explain to him the error of
> his ways? I'm sure IBM has absolutely no market data to go on ...

Numbers talk, bullshit walks. Got shoes?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-25 00:41:15

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

> More data from news.com.
>
> Dell has 19% of the server market with $531M/quarter in sales[1] over
> 212,750 machines per quarter[2].
>
> That means that the average sale price for a server from Dell was $2495.
>
> The average sale price of all servers from all companies is $9347.
>
> I still don't see the big profits touted by the scaling fanatics, anyone
> care to explain it?

Sigh. If you're so convinced that there's no money in larger systems,
why don't you write to Sam Palmisano and explain to him the error of
his ways? I'm sure IBM has absolutely no market data to go on ...

If only he could receive an explanation of the error of his ways from
Larry McVoy, I'm sure he'd turn the ship around, for you obviously have
all the facts, figures, and experience of the server market to make this
kind of decision. I await the email from the our CEO that tells us how
much he respects you, and has taken this decision at your bidding.

M.

2003-02-25 01:00:35

by David Lang

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

if you want to say that sales of LINUX servers generates mor profits then
sales of LINUX desktops then you have a chance of being right, not becouse
the server market is so large, but becouse the desktop market is so small.

however if the linux desktop were to get 10% of the market in terms of new
sales (it's already in the 5%-7% range according to some reports, but a
large percentage of that is in repurposed windows desktops) then the
sales and profits of the desktops would easily outclass the sales and
profits of servers due to the shear volume.

IBM and Sun make a lot of sales from the theory that their machines (you
know, the ones in fancy PC cases with PC power supplies and IDE drives)
are somehow more reliable then a x86 machine. as pwople really start
analysing the cost/performance of the machines and implement HA becouse
they need 24x7 coverage and even the big boys boxes need to be updated
people realize that they can buy multiple cheap boxes and get HA for less
then the cost of buying the one 'professional' box (in some cases they can
afford to buy the multiple smaller boxes and replace them every year for
less then the cost of the professional box over 3 years). And as more
folks use linux on the small(er) machines it breaks down the risk barrier.

one of the big reasons people have traditionally used small numbers of
large boxes was that the licensing costs have been significant, well linux
doesn't have a per server license cost (unless you really want to pay one)
so that's also no longer an issue.

there are some jobs that require large machines instead of clusters,
databases are still one of them (at least as far as I have been able to
learn) but a lot of other jobs are being moved to multiple smaller boxes
(or to multiple logical boxes on one large box which is what Larry is
advocating) and in spite of the doomsayers the problems are being worked
out (can you imagine the reaction from telling a sysadmin team managing
one server in 1970 that in 2000 a similar sized team would be managing
hundreds or thousands of servers ala google :-) yes it takes planning and
dicipline, but it's not nearly as hard as people imagine before they get
started down that path)

David Lang

On Mon, 24 Feb 2003, Martin J. Bligh wrote:

> Date: Mon, 24 Feb 2003 16:41:04 -0800
> From: Martin J. Bligh <[email protected]>
> To: Larry McVoy <[email protected]>
> Cc: [email protected]
> Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]
>
> > More data from news.com.
> >
> > Dell has 19% of the server market with $531M/quarter in sales[1] over
> > 212,750 machines per quarter[2].
> >
> > That means that the average sale price for a server from Dell was $2495.
> >
> > The average sale price of all servers from all companies is $9347.
> >
> > I still don't see the big profits touted by the scaling fanatics, anyone
> > care to explain it?
>
> Sigh. If you're so convinced that there's no money in larger systems,
> why don't you write to Sam Palmisano and explain to him the error of
> his ways? I'm sure IBM has absolutely no market data to go on ...
>
> If only he could receive an explanation of the error of his ways from
> Larry McVoy, I'm sure he'd turn the ship around, for you obviously have
> all the facts, figures, and experience of the server market to make this
> kind of decision. I await the email from the our CEO that tells us how
> much he respects you, and has taken this decision at your bidding.
>
> M.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-02-25 01:41:16

by Craig Thomas

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call - publishing performance data

On Mon, 2003-02-24 at 10:20, Gerrit Huizenga wrote:

>
> We do have a few papers out there, check OLS for the large database
> workload one that steps through 2.4 performance changes (stock
> 2.4 vs. a set of patches we pushed to UL & RHAT) that increase
> database performance about, oh, I forget, 5-fold... And there
> is occasional other data sent out on web server stuff, some
> microbenchmark data (see the continuing stream of data from mbligh,
> for instance). Also, the contest data, OSDL data, etc. etc.
> shows comparisons and trends for anyone who cares to pay attention.
>
> It *would* be nice if someone could publish a compedium of performance
> data, but that would be asking a lot...
>
> gerrit
> -

OSDL is trying to provide something like this for the 2.5 kernel. It is
an interest we have to provide this sort of data. We have been building
database workload information and generating test results from our STP
test framework.

We are in the midst of creating content for a Linux Stability Results
web page. http://www.osdl.org/projects/26lnxstblztn/results/ There is
a great desire on our part to share good performance data for the kernel
as it evolves. I would like to ask you guys what would you like to
see on page like this? I feel that we could create a single site where
anyone can get access to performance and reliability information about
the Linux kernel as we move toward the 2.6 version.

The page is set up now so that anyone can contribute content to the page
by editing an html template file to point to test and performance data.
If anyone is interested in this concept, email me privately or
[email protected]


--
Craig Thomas <[email protected]>
OSDL

2003-02-25 02:00:39

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 09:17:58AM -0700, [email protected] wrote:
> On Mon, Feb 24, 2003 at 12:50:31AM -0800, William Lee Irwin III wrote:
> > There's a vague notion in my head that it should decrease scheduling
>
> Vague notions seems to be the level of data on this topic.

Ok, replace "vague notion" with latency and scheduling concepts that
everybody else except you understands and you'll be a bit more relevant.

It's not even about IO system, it's about a consumer-producer relationships
between threads and some kind of IPC generic mechanism. You'd run into
the same problems by having two threads communicating in a priorty capable
scheduler, since the temporal granualarity of "things that the scheduler
manages" gets clobbered but inheritently brain damaged locking.

Say, how would the scheduler properly order the priority relationships for
non-preemptable thread that holds that critical section for 100ms under
an extreme (or normal) case ?

The effectiveness of the scheduler in these cases would be meaningless.
Shit, just replace that SOB with a stocastic-insert-round-robin system and
it'll be just as effective if this current state of Linux locking stays
in place. There's probably more truth than exaggeration from what I've
seen both in the code and running Linux as a desktop OS.

> Victor Yodaiken

bill


2003-02-25 01:50:30

by Tupshin Harper

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

This conversation has not only gotten out of hand, it's gotten quite
silly. People are arguing semantics and relative economic value where a
few simple assertions should do:

1) There is a significant interest from developers and users in having
Linux run efficiently on *small* platforms.
2) There is a significant interest from developers and users in having
Linux run efficiently on *large* platforms.
3) There is disagreement on whether it is possible to accomplish 1 and 2
simultaneously.
4) There is disagreement on whether adequate testing is taking place to
make sure 2 doesn't degrade 1(or vice versa).

This leads to two choices:
a) Fork. Obviously to be avoided at all reasonable costs.
b) Identify reasonable improvements to the testing methodology so that
any design conflicts are identified immediately instead of gradually
accumulating and degrading performance over time.

I vote b(surprise surprise), however, this just changes the debate to
"what is reasonable testing methodology?" This, however is a debate much
more worth having than "who ships more of what" and "who said what when".

Given that a fairly thorough performance testing suite is already in
place, it would seem to be up to the advocates for the "threatened"
computing environment (large or small) to convince the "testers that be"
that certain tests should be added. It is inherently unreasonable to
expect the developer of a feature/change to be unbiased and neutral with
respect to that feature, therefore it is unreasonable to expect them to
prove beyond a reasonable doubt that their feature has no negative
impact. The best that they can do is convince themselves that the
feature passes the really deep sniff test. The rest is up to the
community. The ability of a third party to critique code changes is a
large part of why the bazaar nature of linux development is so valuable.

-Tupshin


2003-02-25 02:07:30

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 04:27:54PM -0700, [email protected] wrote:
> I'm not sure what you are complaining about. I don't think there is good
> or even marginal data or explanations of this "effect".

You don't need data. It's conceptually obvious. If you have a higher
priority thread that's not running because another thread of lower priority
is hogging the CPU for some unknown operation in the kernel, then you're
going be less able to respond to external events from the IO system and
other things with respect to a Unix style priority scheduler.

That's why we have fully preemptive RTOS to deal with that and priority
inheritence, both of which are fundamental to any kind of fixed-priority
RTOS.

If you're scheduler is scheduling crap, then it's not going to be very
effective and scheduling...

Rhetorical question... what the hell do you think this is about ?

http://linuxdevices.com/articles/AT5698775833.html

It's about getting relationship inside the kernel to respect and be
controllable by the scheduler in some formal manner, not some random
not-so-well-though-out hack of the day.

bill

2003-02-25 02:04:20

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 06:07:30PM -0800, Bill Huey wrote:
> On Mon, Feb 24, 2003 at 09:17:58AM -0700, [email protected] wrote:
> > On Mon, Feb 24, 2003 at 12:50:31AM -0800, William Lee Irwin III wrote:
> > > There's a vague notion in my head that it should decrease scheduling
> >
> > Vague notions seems to be the level of data on this topic.
>
> Ok, replace "vague notion" with latency and scheduling concepts that
> everybody else except you understands and you'll be a bit more relevant.

Victor has forgotten more than most people know about operating systems.
Dig into his background, he tends to know what he is talking about even
if he is a little terse at times.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-25 02:14:33

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 06:14:26PM -0800, Larry McVoy wrote:
> > Ok, replace "vague notion" with latency and scheduling concepts that
> > everybody else except you understands and you'll be a bit more relevant.
>
> Victor has forgotten more than most people know about operating systems.
> Dig into his background, he tends to know what he is talking about even
> if he is a little terse at times.

But apparently what knows is not very modern. I'm no slouch either being a
former BSDi (the original Unix folks) engineer, but I don't go dimissing
folks implicitly like he did to "The Will", William Irwin... and then not
adding anything usable in the conversation. That's just no excuse for an
adult running a company or in a public forum that's discussing these very
important issues.

Frankly, I don't care what he has or what traditional so called "Unix
folks" think. Even FreeBSD's SMPng project, using BSD/OS's 5.0 code
deals with these issues respectfully. These old school Unix folks seem
to have a much more modern attitude towards this stuff than either
you or Victor.

bill

2003-02-25 02:25:00

by Victor Yodaiken

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 06:17:36PM -0800, Bill Huey wrote:
> On Mon, Feb 24, 2003 at 04:27:54PM -0700, [email protected] wrote:
> > I'm not sure what you are complaining about. I don't think there is good
> > or even marginal data or explanations of this "effect".
>
> You don't need data. It's conceptually obvious. If you have a higher

Oh. Well that makes things clear enough. Goodbye.

2003-02-25 02:22:17

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> Rhetorical question... what the hell do you think this is about ?
>
> http://linuxdevices.com/articles/AT5698775833.html

Hmm, maybe someone who is advertising their companies mistaken approach?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-25 02:25:53

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 07:24:45PM -0700, [email protected] wrote:
> Oh. Well that makes things clear enough. Goodbye.

It's completely clear.

Ok, now I know you're a completely screwed narrow minded asshole. Good
grief.

bill

2003-02-25 02:33:14

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 07:24:45PM -0700, [email protected] wrote:
> Oh. Well that makes things clear enough. Goodbye.

I'd be worried about why you don't have a competent reply to this
article again:

http://linuxdevices.com/articles/AT5698775833.html

Whether you, Larry and other so call Unix traditionalists realize
it, "resource kernels" from the folks like CMU's RTOS group are going
to rule you and the rest of the RT community. It's the future.

bill

2003-02-25 02:27:02

by Werner Almesberger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Larry McVoy wrote:
> The point is that if you are putting SMP changes into the system, you
> have to be held to a higher standard for measurement given the past
> track record of SMP changes increasing code length and cache footprints.

So you probably want to run this benchmark on a synthetic CPU a la
cachegrind. The difficult part would be to come up with a reasonably
understandable additive metric for cache pressure.

(I guess there goes another call to arms to academia :-)

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-02-25 02:30:26

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 06:32:26PM -0800, Larry McVoy wrote:
> Hmm, maybe someone who is advertising their companies mistaken approach?

Or maybe your understanding of this is faded and you haven't keep up with
your generational contemporaries, like our BSD/OS engineers about schedulers,
preemption and priority inheritence.

Again, assuming that you actually understand what this means read this:
http://linuxdevices.com/articles/AT5698775833.html

...because I don't think you really do understand it.

bill

2003-02-25 02:28:34

by Hans Reiser

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

I expect to have 16-32 CPUs in my $3000 desktop in 5 years . If you all
start planning for that now, you might get it debugged before it happens
to me.;-)

I don't expect to connect the 16-32 CPUs with ethernet.... but it won't
surprise me if they have non-uniform memory.

It is just a matter of time before the users need Reiser4 to be highly
scalable, and I don't want to rewrite when they do, so we are worrying
about it now.

--
Hans


2003-02-25 02:50:39

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

>> > More data from news.com.
>> >
>> > Dell has 19% of the server market with $531M/quarter in sales[1] over
>> > 212,750 machines per quarter[2].
>> >
>> > That means that the average sale price for a server from Dell was
>> > $2495.
>> >
>> > The average sale price of all servers from all companies is $9347.
>> >
>> > I still don't see the big profits touted by the scaling fanatics,
>> > anyone care to explain it?
>>
>> Sigh. If you're so convinced that there's no money in larger systems,
>> why don't you write to Sam Palmisano and explain to him the error of
>> his ways? I'm sure IBM has absolutely no market data to go on ...
>
> Numbers talk, bullshit walks. Got shoes?

Bullshit numbers walk too. Remember the context? Linux.
Linux servers vs. Linux desktops. If you think the Linux desktop market is
large, I'd like some of whatever you're smoking, as it's obviously good
stuff.

I think there's money in big iron, you don't seem to. That's fine, you're
not paying my salary (thank $deity).

Perhaps a person with the slightest understanding of basic arithmetic would
see that this:

>> > That means that the average sale price for a server from Dell was
>> > $2495.
>> >
>> > The average sale price of all servers from all companies is $9347.

means that somebody other than Dell is making the money on the big servers.
As Dell is a PC company, no real suprise there.

By the way ... you remember when I said that Linux could scale upwards
without hurting the low end? And that the reason we'd succeed in that where
Solaris et al failed was because the development model was different?

When you said you'd go run some UP benchmarks, that's *exactly* where the
development model is different. It's open enough that you can go do that
sort of thing, and if errors are made, you can point them out. I honestly
welcome the benchmark results you provide ... it's the strength of the
system.

M.

2003-02-25 03:03:21

by Larry McVoy

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

On Mon, Feb 24, 2003 at 07:00:42PM -0800, Martin J. Bligh wrote:
> >> > That means that the average sale price for a server from Dell was
> >> > $2495.
> >> >
> >> > The average sale price of all servers from all companies is $9347.
>
> means that somebody other than Dell is making the money on the big servers.

What part of "all servers from all companies" did you not understand?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-25 03:39:32

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> I expect to have 16-32 CPUs in my $3000 desktop in 5 years . If you all
> start planning for that now, you might get it debugged before it happens
> to me.;-)

Thank you ... some sanity amongst the crowd

> I don't expect to connect the 16-32 CPUs with ethernet.... but it won't
> surprise me if they have non-uniform memory.

Indeed. Just look at AMD hammer for NUMA effects, and SMT and multiple
chip on die technologies for the way things are going.

M.


2003-02-25 03:43:56

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

> Given that a fairly thorough performance testing suite is already in
> place, it would seem to be up to the advocates for the "threatened"
> computing environment (large or small) to convince the "testers that be"
> that certain tests should be added. It is inherently unreasonable to
> expect the developer of a feature/change to be unbiased and neutral with
> respect to that feature, therefore it is unreasonable to expect them to
> prove beyond a reasonable doubt that their feature has no negative
> impact. The best that they can do is convince themselves that the
> feature passes the really deep sniff test. The rest is up to the
> community. The ability of a third party to critique code changes is a
> large part of why the bazaar nature of linux development is so valuable.

An excellent and well thought out summary, and exactly why I welcome
Larry's proposal to do some testing and produce specific numbers on
specific patches instead of hand-waving and spreading FUD. This kind of
arrangement is exactly why the open development model will allow Linux to
win out in the long term.

M.

2003-02-25 04:01:16

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

>> >> > That means that the average sale price for a server from Dell was
>> >> > $2495.
>> >> >
>> >> > The average sale price of all servers from all companies is $9347.
>>
>> means that somebody other than Dell is making the money on the big
>> servers.
>
> What part of "all servers from all companies" did you not understand?

Average price from Dell: $2495
Average price overall: $9347

Conclusion ... Dell makes cheaper servers than average, presumably smaller.

M.

2003-02-25 04:06:57

by Larry McVoy

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

On Mon, Feb 24, 2003 at 08:11:21PM -0800, Martin J. Bligh wrote:
> > What part of "all servers from all companies" did you not understand?
>
> Average price from Dell: $2495
> Average price overall: $9347
>
> Conclusion ... Dell makes cheaper servers than average, presumably smaller.

So how many CPUs do you think you get in a $9K server?

Better yet, since you work for IBM, how many servers do they ship in a year
with 16 CPUs?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-25 04:11:48

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

>> > What part of "all servers from all companies" did you not understand?
>>
>> Average price from Dell: $2495
>> Average price overall: $9347
>>
>> Conclusion ... Dell makes cheaper servers than average, presumably
>> smaller.
>
> So how many CPUs do you think you get in a $9K server?

Not sure. Average by price is probably 4 or a little over.

> Better yet, since you work for IBM, how many servers do they ship in a
> year with 16 CPUs?

Will look. If I can find that data, and it's releasable, I'll send it out.
What's more interesting is how much money they make on machines with, say,
more than 4 CPUs. But I doubt I'll be allowed to release that info ;-)

M.

2003-02-25 04:27:45

by Larry McVoy

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

On Mon, Feb 24, 2003 at 08:21:57PM -0800, Martin J. Bligh wrote:
> > So how many CPUs do you think you get in a $9K server?
>
> Not sure. Average by price is probably 4 or a little over.

Nope. For $12K you can get
4x 1.9Ghz
512MB
No networking
1 disk
No operating system

That's as cheap as it gets. And I don't know about you, but I have a
tough time believing that anyone buys a 4 CPU box without an OS, without
networking, with .5GB of ram, and with one disk.

If you think you are getting a realistic 4 CPU server for $9K from
a vendor, you're dreaming.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-25 04:33:30

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

At some point in the past, I wrote:
>> The changes are getting measured. By and large if it's slower on UP
>> it's rejected.

On Mon, Feb 24, 2003 at 04:23:09PM -0800, Larry McVoy wrote:
> Suppose I have an application which has a working set which just exactly
> fits in the I+D caches, including the related OS stuff.
> Someone makes some change to the OS and the benchmark for that change is
> smaller than the I+D caches but the change increased the I+D cache space
> needed.
> The benchmark will not show any slowdown, correct?
> My application no longer fits and will suffer, correct?

Well, it's often clear from the code whether it'll have a larger cache
footprint or not, so it's probably not that large a problem. OTOH it is
a real problem that little cache or TLB profiling is going on. I tried
once or twice and actually came up with a function or two that should
be inlined instead of uninlined in very short order. Much low-hanging
fruit could be gleaned from those kinds profiles.

It's also worthwhile noting increased cache footprints are actually
very often degradations on SMP and especially NUMA. The notion that
optimizing for SMP and/or NUMA involves increasing cache footprint
on anything doesn't really sound plausible, though I'll admit that
the mistake of trusting microbenchmarks too far on SMP has probably
already been committed at least once. Userspace owns the cache; using
cache for the kernel is "cache pollution", which should be minimized.
Going too far out on the space end of time/space tradeoff curves is
every bit as bad for SMP as UP, and really horrible for NUMA.


On Mon, Feb 24, 2003 at 04:23:09PM -0800, Larry McVoy wrote:
> The point is that if you are putting SMP changes into the system, you
> have to be held to a higher standard for measurement given the past
> track record of SMP changes increasing code length and cache footprints.
> So "measuring" doesn't mean "it's not slower on XYZ microbenchmark".
> It means "under the following work loads the cache misses went down or
> stayed the same for before and after tests".

This kind of measurement is actually relatively unusual. I'm definitely
interested in it, as there appear to be some deficits wrt. locality of
reference that show up as big profile spikes on NUMA boxen. With care
exercised good solutions should also trim down cache misses on UP also.
Cache and TLB miss profile driven development sounds very attractive.


On Mon, Feb 24, 2003 at 04:23:09PM -0800, Larry McVoy wrote:
> And if you said that all changes should be held to this standard, not
> just scaling changes, I'd agree with you. But scaling changes are the
> "bad guy" in my mind, they are not to be trusted, so they should be held
> to this standard first. If we can get everyone to step up to this bat,
> that's all to the good.

Let me put it this way: IBM sells tiny boxen too, from 4x, to UP, to
whatever. And people are simultaneously actively trying to scale
downward to embedded bacteria or whatever. So the small systems are
being neither ignored nor sacrificed for anything else.


-- wli

2003-02-25 04:43:53

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> Userspace owns the cache; using
> cache for the kernel is "cache pollution", which should be minimized.
> Going too far out on the space end of time/space tradeoff curves is
> every bit as bad for SMP as UP, and really horrible for NUMA.

Cool, I agree 100% with this.

> > So "measuring" doesn't mean "it's not slower on XYZ microbenchmark".
> > It means "under the following work loads the cache misses went down or
> > stayed the same for before and after tests".
>
> This kind of measurement is actually relatively unusual. I'm definitely
> interested in it, as there appear to be some deficits wrt. locality of
> reference that show up as big profile spikes on NUMA boxen. With care
> exercised good solutions should also trim down cache misses on UP also.
> Cache and TLB miss profile driven development sounds very attractive.

Again, I'm with you all the way on this. If the scale up guys can adopt
this as a mantra, I'm a lot less concerned that anything bad will happen.

Tim at OSDL and I have been talking about trying to work out some benchmarks
to test for this. I came up with the idea of adding a "-s XXX" which means
"touch XXX bytes between each iteration" to each LMbench test. One problem
is the lack of page coloring will make the numbers bounce around too much.
We talked that over with Linus and he suggested using the big TLB hack to
get around that. Assuming we can deal with the page coloring, do you think
that there is any merit in taking microbenchmarks, adding an artificial
working set, and running those?

> Let me put it this way: IBM sells tiny boxen too, from 4x, to UP, to
> whatever. And people are simultaneously actively trying to scale
> downward to embedded bacteria or whatever.

That's really great, I know it's a lot less sexy but it's important.
I'd love to see as much attention on making Linux work on tiny embedded
platforms as there is on making it work on big iron. Small is cool too.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-25 05:03:24

by Steven Cole

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, 2003-02-24 at 20:49, Martin J. Bligh wrote:
> > I expect to have 16-32 CPUs in my $3000 desktop in 5 years . If you all
> > start planning for that now, you might get it debugged before it happens
> > to me.;-)
>
> Thank you ... some sanity amongst the crowd
>
> > I don't expect to connect the 16-32 CPUs with ethernet.... but it won't
> > surprise me if they have non-uniform memory.
>
> Indeed. Just look at AMD hammer for NUMA effects, and SMT and multiple
> chip on die technologies for the way things are going.
>
> M.

Hans may have 32 CPUs in his $3000 box, and I expect to have 8 CPUs in
my $500 Walmart special 5 or 6 years hence. And multiple chip on die
along with HT is what will make it possible.

What concerns me is that this will make it possible to put insane
numbers of CPUs in those $250,000 and higher boxes. If Martin et al can
scale Linux to 64 CPUs, can they make it scale several binary orders of
magnitude higher? Why do this? NUMA memory is much faster than even
very fast network connections any day.

Is there a market for such a thing? I won't pretend to know that
answer. But the capability to do it will be there, and in 5 years the
3.2 kernel probably won't be quite stable yet, so decisions made in the
next year for 2.9/3.0 may have to last until then.

Please listen to Larry. When he says you can't scale endlessly, I have
a feeling he knows what he's talking about. The Nirvana machine has 48
SGI boxes with 128 CPUs in each. I don't hear about many 128 CPU
machines nowadays. Perhaps Irix just wasn't quite up to the job. But
new technologies will make this kind of machine affordable (by the
government and financial institutions) in the not too distant future.

Just my two cents. Enough ranting for today.

Steven

2003-02-25 05:09:44

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 01:28:30PM +0000, Alan Cox wrote:

> _If_ it harms performance on small boxes.

You mean like the general slowdown from 2.4 - >2.5?

It seems to me for small boxes, 2.5.x is margianlly slower at most
things than 2.4.x.

I'm hoping and the code solidifes and things are tuned this gap will
go away and 2.5.x will inch ahead... hoping....

> The definitive Linux box appears to be $199 from Walmart right now,
> and its not SMP.

In two year this kind of hardware probably will be SMP (HT or some
variant).


--cw

2003-02-25 05:15:23

by Rik van Riel

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, 24 Feb 2003, Bill Huey wrote:

> You don't need data. It's conceptually obvious.

I hope you realise this is about as good as a real godwination ?

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-25 05:16:59

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 01:28:30PM +0000, Alan Cox wrote:
>> _If_ it harms performance on small boxes.

On Mon, Feb 24, 2003 at 09:19:56PM -0800, Chris Wedgwood wrote:
> You mean like the general slowdown from 2.4 - >2.5?
> It seems to me for small boxes, 2.5.x is margianlly slower at most
> things than 2.4.x.
> I'm hoping and the code solidifes and things are tuned this gap will
> go away and 2.5.x will inch ahead... hoping....

Could you help identify the regressions? Profiles? Workload?

On Mon, Feb 24, 2003 at 01:28:30PM +0000, Alan Cox wrote:
>> The definitive Linux box appears to be $199 from Walmart right now,
>> and its not SMP.

On Mon, Feb 24, 2003 at 09:19:56PM -0800, Chris Wedgwood wrote:
> In two year this kind of hardware probably will be SMP (HT or some

I'm a programmer not an economist (despite utility functions and Nash
equilibria). Don't tell me what's definitive, give me some profiles.

-- wli

2003-02-25 05:51:42

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

At some point in the past, I wrote:
>> This kind of measurement is actually relatively unusual. I'm definitely
>> interested in it, as there appear to be some deficits wrt. locality of
>> reference that show up as big profile spikes on NUMA boxen. With care
>> exercised good solutions should also trim down cache misses on UP also.
>> Cache and TLB miss profile driven development sounds very attractive.

On Mon, Feb 24, 2003 at 08:54:04PM -0800, Larry McVoy wrote:
> Again, I'm with you all the way on this. If the scale up guys can adopt
> this as a mantra, I'm a lot less concerned that anything bad will happen.

I don't know about mantras, but we're getting to the point where lock
contention is a non-issue on midrange SMP and straight line efficiency
is beyond the range of "obviously it should be done some other way."
The time to chase cache pollution is certainly coming.


On Mon, Feb 24, 2003 at 08:54:04PM -0800, Larry McVoy wrote:
> Tim at OSDL and I have been talking about trying to work out some benchmarks
> to test for this. I came up with the idea of adding a "-s XXX" which means
> "touch XXX bytes between each iteration" to each LMbench test. One problem
> is the lack of page coloring will make the numbers bounce around too much.
> We talked that over with Linus and he suggested using the big TLB hack to
> get around that. Assuming we can deal with the page coloring, do you think
> that there is any merit in taking microbenchmarks, adding an artificial
> working set, and running those?

Page coloring needs to get into the kernel at some point. Using large
TLB entries will artificially tie this to TLB effects and fragmentation,
in addition to pagetable space conservation (on x86 anyway). So I really
don't see any way to deal with reproducibility issues on this front but
just doing page coloring. Everything else that does it as a side effect
would unduly disturb the results, IMHO.


At some point in the past, I wrote:
>> Let me put it this way: IBM sells tiny boxen too, from 4x, to UP, to
>> whatever. And people are simultaneously actively trying to scale
>> downward to embedded bacteria or whatever.

On Mon, Feb 24, 2003 at 08:54:04PM -0800, Larry McVoy wrote:
> That's really great, I know it's a lot less sexy but it's important.
> I'd love to see as much attention on making Linux work on tiny embedded
> platforms as there is on making it work on big iron. Small is cool too.

There is, unfortunately the participation in the development cycle of
embedded vendors is not as visible as it is with large system vendors.
More direct, frequent, and vocal input from embedded kernel hackers
would be very valuable, as many "corner cases" with automatic kernel
scaling should occur on the small end, not just the large end.

I've had some brief attempts to explain to me the motives and methods
of embedded system vendors and the like, but I've failed to absorb
enough to get a "big picture" or much of any notion as to why embedded
kernel hackers aren't participating as much in the development cycle.

On the large system side, it's very clear that issues in the core VM
and other parts of the kernel must be addressed to achieve the goals,
and hence participation in the development cycle is outright mandatory.
It's not "working effectively". It's a requirement. And part of that
"requirement" bit is we have to work with constraints never enforced
before, including maintaining the scalability curve on the low end.

It's hard, and probably not impossible, but absolutely required.


-- wli

2003-02-25 06:07:05

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> _If_ it harms performance on small boxes.
>
> You mean like the general slowdown from 2.4 - >2.5?
>
> It seems to me for small boxes, 2.5.x is margianlly slower at most
> things than 2.4.x.

Can you name a benchmark, or at least do something reproducible between
versions, and produce a 2.4 vs 2.5 profile? Let's at least try to fix it ...

M.

2003-02-25 06:50:03

by Valerie Henson

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 10:00:53PM -0800, William Lee Irwin III wrote:
> On Mon, Feb 24, 2003 at 08:54:04PM -0800, Larry McVoy wrote:
> > That's really great, I know it's a lot less sexy but it's important.
> > I'd love to see as much attention on making Linux work on tiny embedded
> > platforms as there is on making it work on big iron. Small is cool too.
>
> There is, unfortunately the participation in the development cycle of
> embedded vendors is not as visible as it is with large system vendors.
> More direct, frequent, and vocal input from embedded kernel hackers
> would be very valuable, as many "corner cases" with automatic kernel
> scaling should occur on the small end, not just the large end.
>
> I've had some brief attempts to explain to me the motives and methods
> of embedded system vendors and the like, but I've failed to absorb
> enough to get a "big picture" or much of any notion as to why embedded
> kernel hackers aren't participating as much in the development cycle.

Speaking as a former Linux developer for an embedded[1] systems
vendor, it's because embedded companies aren't the size of IBM and
don't have money to spend on software development beyond the "make it
work on our boards" point. One of the many reasons I'm a _former_
embedded Linux developer.

-VAL

[1] Okay, our boards had up to 4 processors and 1GB memory. But the
same principles applied.

2003-02-25 14:20:50

by Alan

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, 2003-02-25 at 02:17, Bill Huey wrote:
> You don't need data. It's conceptually obvious. If you have a higher
> priority thread that's not running because another thread of lower priority
> is hogging the CPU for some unknown operation in the kernel, then you're
> going be less able to respond to external events from the IO system and
> other things with respect to a Unix style priority scheduler.

Nothing is conceptually obvious. Thats the difference between 'science'
and engineering. Our bridges have to stay up.

> It's about getting relationship inside the kernel to respect and be
> controllable by the scheduler in some formal manner, not some random
> not-so-well-though-out hack of the day.

Prove it, compute the bounded RT worst case. You can't do it. Linux, NT,
VMS and so on are all basically "armwaved real time". Now for a lot of
things armwaved realtime is ok, one 'click' an hour on a phone call
from a DSP load miss isnt a big deal. Just don't try the same with
precision heavy machinery.

Its not a lack of competence, we genuinely don't yet have the understanding
in computing to solve some of the problems people are content to armwave
about.

If I need extremely high provable precision, Victor's approach is right, if
I want armwaved realtimeish behaviour with a more convenient way of working
then Victor's approach may not be the best.

Its called engineering. There are multiple ways to build most things, each
with different advantages, there are multiple ways to model it each with
more accuracy in some areas. Knowing how to use the right tool is a lot
more important than having some religion about it.

Alan

2003-02-25 14:26:31

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, 24 Feb 2003 18:24:38 PST, Bill Huey said:

> But apparently what knows is not very modern. I'm no slouch either being a
> former BSDi (the original Unix folks) engineer, but I don't go dimissing

And here I thought "the original Unix folks" was Dennis and Ken mailing you
an RL05 with a "Good luck, let us know if it works" cover letter... ;)


Attachments:
(No filename) (226.00 B)

2003-02-25 14:37:11

by Mr. James W. Laferriere

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call


Hello Valdis , One in those days there were no RL05's (never were
if my memory serves) . They were RL02's 10mb packs . Maybe RM05 ?
*nix definately was NOT known as BSD then . JimL

On Mon, 24 Feb 2003 [email protected] wrote:
> On Mon, 24 Feb 2003 18:24:38 PST, Bill Huey said:
> > But apparently what knows is not very modern. I'm no slouch either being a
> > former BSDi (the original Unix folks) engineer, but I don't go dimissing
> And here I thought "the original Unix folks" was Dennis and Ken mailing you
> an RL05 with a "Good luck, let us know if it works" cover letter... ;)
--
+------------------------------------------------------------------+
| James W. Laferriere | System Techniques | Give me VMS |
| Network Engineer | P.O. Box 854 | Give me Linux |
| [email protected] | Coudersport PA 16915 | only on AXP |
+------------------------------------------------------------------+

2003-02-25 14:52:34

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 03:30:59PM +0000, Alan Cox wrote:
> Nothing is conceptually obvious. Thats the difference between 'science'
> and engineering. Our bridges have to stay up.

Yes, I absolutely agree with this. It shouldn't be the case where one is
over the other, they should have a complementary relationship.

> > It's about getting relationship inside the kernel to respect and be
> > controllable by the scheduler in some formal manner, not some random
> > not-so-well-though-out hack of the day.
>
> Prove it, compute the bounded RT worst case. You can't do it. Linux, NT,
> VMS and so on are all basically "armwaved real time". Now for a lot of
> things armwaved realtime is ok, one 'click' an hour on a phone call
> from a DSP load miss isnt a big deal. Just don't try the same with
> precision heavy machinery.
>
> Its not a lack of competence, we genuinely don't yet have the understanding
> in computing to solve some of the problems people are content to armwave
> about.
>
> If I need extremely high provable precision, Victor's approach is right, if
> I want armwaved realtimeish behaviour with a more convenient way of working
> then Victor's approach may not be the best.

I spoke to some folks related to CMU's RTOS group about a year ago and was
influenced by their preemption design in that they claimed to get tight RT
latency characteristics by what seems like some mild changes to the Linux
kernel. I recently start to investigate their stuff, took a clue from them
and became convince that this approach was very neat and elegant. MontaVista
apparently uses this approach over other groups that run Linux as a thread
in another RT kernel. Whether this, static analysis tools doing rate{deadline}-monotonic
analysis and scheduler "reservations" (born from that RT theory I believe)
are unclear to me at this moment. I just find this particular track neat
and reminiscent of some FreeBSD ideals that I'd like to see fully working in
an open source kernel.

Top level link to many papers:
http://linuxdevices.com/articles/AT6476691775.html

A paper I've take interest in recently from the top-level link:
http://www.linuxdevices.com/articles/AT6078481804.html

People I originally talked to that influence my view on this:
http://www-2.cs.cmu.edu/~rajkumar/linux-rk.html

> Its called engineering. There are multiple ways to build most things, each
> with different advantages, there are multiple ways to model it each with
> more accuracy in some areas. Knowing how to use the right tool is a lot
> more important than having some religion about it.

Yes, I agree. I'm not trying to make a religious assertion and I don't
function that way. I just want things to work smoother and explore some
interesting ideas that I think eventually will be highly relevant to a
very broad embedded arena.

bill

2003-02-25 15:48:19

by Victor Yodaiken

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 06:59:12AM -0800, Bill Huey wrote:
> latency characteristics by what seems like some mild changes to the Linux
> kernel. I recently start to investigate their stuff, took a clue from them
> and became convince that this approach was very neat and elegant. MontaVista
> apparently uses this approach over other groups that run Linux as a thread
> in another RT kernel. Whether this, static analysis tools doing rate{deadline}-monotonic
> analysis and scheduler "reservations" (born from that RT theory I believe)
> are unclear to me at this moment. I just find this particular track neat
> and reminiscent of some FreeBSD ideals that I'd like to see fully working in
> an open source kernel.

There are two easy tests:
1) Run a millisecond period real-time task on a system under
heave load (not just compute load) and ping flood
and find worst case jitter.
In our experience tests run for less than 24 hours are worthless.
(I've seen a lot of numbers based on 1million interrupts -
do the math and laugh)
It's not fair to throttle the network to make the numbers come
out better. Please also make clear how much of the kernel you
had to rewrite to get your numbers: e.g. specially configured
network drivers are nice, but have an impact on usability.

BTW: a version of this test is distributed with RTLinux .
2) Run the same real-time task and run a known compute/I/O load
such as the standard kernel compile to see the overhead of
real-time. Remember:
hard cli:
run RT code only

produces great numbers for (1) at the expense of (2) so
no reconfiguration allowed between these tests.

Now try these on some embedded processors that run under
1GHz and 1G memory.

FWIW: RTLinux numbers are 18microseconds jitter and about 15 seconds
slowdown of a 5 minute kernel compile on a kinda wimpy P3. On a 2.4Ghz
we do slightly better. I got 12 microseconds on a K7, the drop for
embedded processors is low. PowerPCs are generally excellent. The
second test requires a little more work on things like StrongArms
because nobody has the patience to time a kernel compile on those.

As for RMA, it's a nice trick, but of limited use. Instead of
test (and design for testability) you get a formula for calculating
schedulability from the computation times of the tasks. But since we
have no good way to estimate compute times of code without test, it
has the result of moving ignorance instead of removing it. Also, the
idea that frequency and priority are lock-step is simply incorrect
for many applications. When you start dealing with really esoteric
concepts: like demand driven tasks and shared resources,
RMA wheezes mightily.

Pre-allocation of resources is good for RT, although not especially
revolutionary. Traditional RT systems were written using cyclic
schedulers. Many of our simulation customers use a "slot" or
"frame" scheduler. Fortunately, these are really old ideas so I
know about them.

Probably because of the well advertised low level of my knowledge and
abilities, I advocate that RT systems be designed with simplicity and
testability in mind. We have found that exceptionally complex RT
control systems can be developed on such a basis. Making the tools more
complicated does not seem to improve reliability or performance:
the application performance is more interesting than features of the
OS.

You can see a nice illustration of the differences
between RTLinux and the TimeSys approach in my paper on priority
inheritance http://www.fsmlabs.com/articles/inherit/inherit.html
(Orignally http://www.linuxdevices.com/articles/AT7168794919.html)
and Doug Locke's response
http://www.linuxdevices.com/articles/AT5698775833.html




>
> Top level link to many papers:
> http://linuxdevices.com/articles/AT6476691775.html
>
> A paper I've take interest in recently from the top-level link:
> http://www.linuxdevices.com/articles/AT6078481804.html
>
> People I originally talked to that influence my view on this:
> http://www-2.cs.cmu.edu/~rajkumar/linux-rk.html
>
> > Its called engineering. There are multiple ways to build most things, each
> > with different advantages, there are multiple ways to model it each with
> > more accuracy in some areas. Knowing how to use the right tool is a lot
> > more important than having some religion about it.
>
> Yes, I agree. I'm not trying to make a religious assertion and I don't
> function that way. I just want things to work smoother and explore some
> interesting ideas that I think eventually will be highly relevant to a
> very broad embedded arena.
>
> bill
>

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
1+ 505 838 9109

2003-02-25 15:49:38

by Jesse Pollard

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tuesday 25 February 2003 08:47 am, Mr. James W. Laferriere wrote:
> Hello Valdis , One in those days there were no RL05's (never were
> if my memory serves) . They were RL02's 10mb packs . Maybe RM05 ?
> *nix definately was NOT known as BSD then . JimL

nope - it was RK05 (2.5 MB disk). The distribution was on tape, not disk. (9
track tapes were only 20 bucks, RKs were several thousand).

RLs did not show up for almost 10 years.

RM05 was a 600 MB disk, and didn't show up until after the PDP11/70 and VAX/11
existed (it was a relabled CDC9766 disk I believe).

And it was UNIX v x, where x varied from null (not labeled) and 1 .. 7.

RL distributions did not come from AT&T (Yourden, Inc. was where I got one)
--
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2003-02-25 17:01:40

by Cliff White

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> >> _If_ it harms performance on small boxes.
> >
> > You mean like the general slowdown from 2.4 - >2.5?
> >
> > It seems to me for small boxes, 2.5.x is margianlly slower at most
> > things than 2.4.x.
>
> Can you name a benchmark, or at least do something reproducible between
> versions, and produce a 2.4 vs 2.5 profile? Let's at least try to fix it ...
>
> M.

Well, here's one bit of data. Easy enough to do if you have a web browser.
LMBench 2.0 on 1-way and 2-way, kernels 2.4.18 and 2.5.60
1-way (stp1-003 stp1-002)
2.4.18 http://khack.osdl.org/stp/7443/
2.5.60 http://khack.osdl.org/stp/265622/

2-way (stp2-003 stp2-000)
2.4.18 http://khack.osdl.org/stp/3165/
2.5.60 http://khack.osdl.org/stp/265643/

Interesting items for me are the fork/exec/sh times and some of the file + VM
numbers
LMBench 2.0 Data ( items selected from total of five runs )

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host OS Mhz null null open selct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
stp2-003. Linux 2.4.18 1000 0.39 0.67 3.89 4.99 30.4 0.93 3.06 344. 1403 4465
stp2-000. Linux 2.5.60 1000 0.41 0.77 4.34 5.57 32.6 1.15 3.59 245. 1406 5795

stp1-003. Linux 2.4.18 1000 0.32 0.46 2.60 3.21 16.6 0.79 2.52 104. 918. 4460
stp1-002. Linux 2.5.60 1000 0.33 0.47 2.83 3.47 16.0 0.94 2.70 143. 1212 5292

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
stp2-003. Linux 2.4.18 2.680 6.2100 15.8 7.9400 110.7 26.4 111.1
stp2-000. Linux 2.5.60 1.590 5.0700 17.6 7.5800 79.8 11.0 113.6

stp1-003. Linux 2.4.18 0.590 3.4700 11.1 4.8200 134.3 30.8 131.7
stp1-002. Linux 2.5.60 1.000 3.5400 11.2 4.1400 129.6 30.4 127.8

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
stp2-003. Linux 2.4.18 2.680 9.071 17.5 26.9 46.2 34.4 60.0 62.9
stp2-000. Linux 2.5.60 1.590 8.414 13.2 21.2 43.2 28.3 54.1 97.1

stp1-003. Linux 2.4.18 0.590 3.623 6.98 11.7 28.2 17.8 38.4 300K
stp1-002. Linux 2.5.60 1.050 4.591 8.54 14.8 31.8 20.0 41.0 67.1

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page
Create Delete Create Delete Latency Fault Fault
--------- ------------- ------ ------ ------ ------ ------- ----- -----
stp2-003. Linux 2.4.18 34.6 7.2490 110.9 17.9 2642.0 0.771 3.00000
stp2-000. Linux 2.5.60 40.0 9.2780 113.3 23.3 4592.0 0.543 3.00000

stp1-003. Linux 2.4.18 28.8 4.8890 107.5 11.3 686.0 0.621 2.00000
stp1-002. Linux 2.5.60 32.4 6.4290 112.9 16.2 1455.0 0.465 2.00000

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
stp2-003. Linux 2.4.18 563. 277. 263. 437.0 552.8 249.1 180.7 553. 215.2
stp2-000. Linux 2.5.60 603. 516. 151. 436.3 549.0 238.0 171.9 548. 233.7

stp1-003. Linux 2.4.18 1009 820. 404. 414.3 467.0 167.2 154.1 466. 236.2
stp1-002. Linux 2.5.60 806. 584. 69.1 408.0 461.7 161.1 149.1 461. 233.5


Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
---------------------------------------------------
Host OS Mhz L1 $ L2 $ Main mem Guesses
--------- ------------- ---- ----- ------ -------- -------
stp2-003. Linux 2.4.18 1000 3.464 8.0820 110.9
stp2-000. Linux 2.5.60 1000 3.545 8.2790 110.6

stp1-003. Linux 2.4.18 1000 2.994 6.9850 121.4
stp1-002. Linux 2.5.60 1000 3.023 7.0530 122.5

------------------
cliffw

>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


2003-02-25 17:07:03

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote:
> 2.4.21-pre4: 8.10 seconds
> 2.5.62-mm3 with objrmap: 9.95 seconds (+1.85)
> 2.5.62-mm3 without objrmap: 10.86 seconds (+0.91)
>
> Current 2.5 is 2.76 seconds slower, and this patch reclaims 0.91 of those
> seconds.
>
>
> So whole stole the remaining 1.85 seconds? Looks like pte_highmem.

would you mind to add the line for 2.4.21-pre4aa3? it has pte-highmem so
you can easily find it out for sure if it is pte_highmem that stole >10%
of your fast cpu. A line for the 2.4-rmap patch would be also
interesting.

> Note one second spent in pte_alloc_one().

note the seconds spent in the rmap affected paths too.

Andrea

2003-02-25 17:08:54

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 09:11:38AM -0800, Cliff White wrote:
> Interesting items for me are the fork/exec/sh times and some of the file + VM
> numbers
> LMBench 2.0 Data ( items selected from total of five runs )

Okay, got profiles for the individual tests you're interested in?

Also, what are the statistical significance cutoffs?


-- wli

2003-02-25 17:25:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

On Mon, Feb 24, 2003 at 07:00:42PM -0800, Martin J. Bligh wrote:
> Solaris et al failed was because the development model was different?

Solaris can't be recompiled UP AFIK. This whole discussion about UP
performance is almost pointless in linux since we have CONFIG_SMP and we
can recompile it.

Especially if what you care is the desktop (not the UP server), the only
kernel bits that matters for the desktop are the VM, the scheduler and
I/O latency and perpahs the clear_page too. the rest is all a matter of
the X/kde/qt/glibc-dynamiclinking/opengl/memorybloatwithmultiplelibs/etc..
the kernel core-raw performance in the fast paths doesn't matter much for the
desktop, even if the syscall would be twice slower desktop users
wouldn't notice much.

Andrea

2003-02-25 17:34:49

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote:
>> So whole stole the remaining 1.85 seconds? Looks like pte_highmem.

On Tue, Feb 25, 2003 at 06:17:27PM +0100, Andrea Arcangeli wrote:
> would you mind to add the line for 2.4.21-pre4aa3? it has pte-highmem so
> you can easily find it out for sure if it is pte_highmem that stole >10%
> of your fast cpu. A line for the 2.4-rmap patch would be also
> interesting.

On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote:
>> Note one second spent in pte_alloc_one().

On Tue, Feb 25, 2003 at 06:17:27PM +0100, Andrea Arcangeli wrote:
> note the seconds spent in the rmap affected paths too.

The pagetable cache is gone in 2.5, so pte_alloc_one() takes the
bitblitting hit for pagetables.

I didn't catch the whole profile, so I'll need numbers for rmap paths.


-- wli

2003-02-25 17:33:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

In article <[email protected]>,
Cliff White <[email protected]> wrote:
>
>Well, here's one bit of data. Easy enough to do if you have a web browser.
>LMBench 2.0 on 1-way and 2-way, kernels 2.4.18 and 2.5.60
>1-way (stp1-003 stp1-002)
>2.4.18 http://khack.osdl.org/stp/7443/
>2.5.60 http://khack.osdl.org/stp/265622/
>
>2-way (stp2-003 stp2-000)
>2.4.18 http://khack.osdl.org/stp/3165/
>2.5.60 http://khack.osdl.org/stp/265643/
>
>Interesting items for me are the fork/exec/sh times and some of the file + VM
>numbers
>LMBench 2.0 Data ( items selected from total of five runs )
>
>Processor, Processes - times in microseconds - smaller is better
>----------------------------------------------------------------
>Host OS Mhz null null open selct sig sig fork exec sh
> call I/O stat clos TCP inst hndl proc proc proc
>--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
>stp2-003. Linux 2.4.18 1000 0.39 0.67 3.89 4.99 30.4 0.93 3.06 344. 1403 4465
>stp2-000. Linux 2.5.60 1000 0.41 0.77 4.34 5.57 32.6 1.15 3.59 245. 1406 5795

Note that those numbers will look quite different (at least on a P4) if
you use a modern library that uses the "sysenter" stuff. The difference
ends up being something like this:

Host OS Mhz null null open selct sig sig fork exec sh
call I/O stat clos inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
i686-linu Linux 2.5.30 2380 0.8 1.1 3 5 0.04K 1.1 3 0.2K 1K 3K
i686-linu Linux 2.5.62 2380 0.2 0.6 3 4 0.04K 0.7 3 0.2K 1K 3K

(Yeah, I've never run a 2.4.x kernel on this machine, so..) In other
words, the system call has been speeded up quite noticeably.

Yes, if you don't take advantage of sysenter, then all the sysenter
support will just make us look worse ;(

I'm surprised by your "sh proc" changes, they are quite big. I guess
it's rmap and highmem that bites us, and yes, we've gotten slower there.

Linus

2003-02-25 17:48:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 09:43:59AM -0800, William Lee Irwin III wrote:
> On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote:
> >> So whole stole the remaining 1.85 seconds? Looks like pte_highmem.
>
> On Tue, Feb 25, 2003 at 06:17:27PM +0100, Andrea Arcangeli wrote:
> > would you mind to add the line for 2.4.21-pre4aa3? it has pte-highmem so
> > you can easily find it out for sure if it is pte_highmem that stole >10%
> > of your fast cpu. A line for the 2.4-rmap patch would be also
> > interesting.
>
> On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote:
> >> Note one second spent in pte_alloc_one().
>
> On Tue, Feb 25, 2003 at 06:17:27PM +0100, Andrea Arcangeli wrote:
> > note the seconds spent in the rmap affected paths too.
>
> The pagetable cache is gone in 2.5, so pte_alloc_one() takes the
> bitblitting hit for pagetables.

I'm talking about do_anonymous_page, do_wp_page, do_no_page fork and all
the other places that introduces spinlocks (per-page) and allocations of
2 pieces of ram rather than just 1 (and in turn potentially global
spinlocks too if the cpu-caches are empty). Just grep for
pte_chain_alloc or page_add_rmap in mm/memory.c, that's what I mean, I'm
not talking about pagetables.

Andrea

2003-02-25 17:55:26

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 09:43:59AM -0800, William Lee Irwin III wrote:
>> The pagetable cache is gone in 2.5, so pte_alloc_one() takes the
>> bitblitting hit for pagetables.

On Tue, Feb 25, 2003 at 06:59:28PM +0100, Andrea Arcangeli wrote:
> I'm talking about do_anonymous_page, do_wp_page, do_no_page fork and all
> the other places that introduces spinlocks (per-page) and allocations of
> 2 pieces of ram rather than just 1 (and in turn potentially global
> spinlocks too if the cpu-caches are empty). Just grep for
> pte_chain_alloc or page_add_rmap in mm/memory.c, that's what I mean, I'm
> not talking about pagetables.

Well, pte_alloc_one() has a clear explanation.

The fact that the rmap accounting is not free is not news.

For anonymous pages performing the analogous vma-based lookup as with
Dave McCracken's patch for file-backed pages would require a
significant anonymous page accounting rework.


-- wli

2003-02-25 18:08:38

by Alan

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, 2003-02-25 at 05:19, Chris Wedgwood wrote:
> > The definitive Linux box appears to be $199 from Walmart right now,
> > and its not SMP.
>
> In two year this kind of hardware probably will be SMP (HT or some
> variant).

Not if it costs money. If the cheapest reasonable x86 cpu is one that has chosen
to avoid HT and SMP it won't have HT and SMP. Think 4xUSB2 connectors, brick PSU
and no user adjustable components.

2003-02-25 18:41:17

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 09:43:59AM -0800, William Lee Irwin III wrote:
>> The pagetable cache is gone in 2.5, so pte_alloc_one() takes the
>> bitblitting hit for pagetables.

On Tue, Feb 25, 2003 at 06:59:28PM +0100, Andrea Arcangeli wrote:
> I'm talking about do_anonymous_page, do_wp_page, do_no_page fork and all
> the other places that introduces spinlocks (per-page) and allocations of
> 2 pieces of ram rather than just 1 (and in turn potentially global
> spinlocks too if the cpu-caches are empty). Just grep for
> pte_chain_alloc or page_add_rmap in mm/memory.c, that's what I mean, I'm
> not talking about pagetables.

Okay, fished out the profiles (w/Dave's optimization):

00000000 total 158601 0.0869
c0106ed8 poll_idle 99878 1189.0238
c01172e0 do_page_fault 8788 7.7496
c013adb4 do_wp_page 6712 8.4322
c013f70c page_remove_rmap 3132 6.2640
c0139eac copy_page_range 2994 3.5643
c013f5c0 page_add_rmap 2776 8.3614
c013a1f4 zap_pte_range 2616 4.8806
c0137240 release_pages 1828 6.4366
c0108d14 system_call 1116 25.3636
c013ba00 handle_mm_fault 1098 4.6525
c015b59c d_lookup 1096 3.2619
c013b788 do_no_page 1044 1.6519
c013b56c do_anonymous_page 954 1.7667
c011718c pte_alloc_one 910 6.5000
c0139ba0 clear_page_tables 841 2.4735
c011450c flush_tlb_page 725 6.4732
c0207130 __copy_to_user_ll 687 6.6058
c01333dc free_hot_cold_page 641 2.7629
c013042c find_get_page 601 10.7321

Just taking the exception dwarfs anything written in C.

page_add_rmap() absorbs hits from all of the fault routines and
copy_page_range(). page_remove_rmap() absorbs hits from zap_pte_range().
do_wp_page() is huge because it's doing bitblitting in-line.

These things aren't cheap with or without rmap. Trimming down
accounting overhead could raise search problems elsewhere.
Whether avoiding the search problem is worth the accounting overhead
could probably use some more investigation, like actually trying the
anonymous page handling rework needed to use vma-based ptov resolution.


-- wli

2003-02-25 18:46:14

by Dave Jones

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 05:38:31PM +0000, Linus Torvalds wrote:

> Yes, if you don't take advantage of sysenter, then all the sysenter
> support will just make us look worse ;(

Andi's patch[1] to remove one of the wrmsr's from the context switch
fast path should win back at least some of the lost microbenchmark
points. (Full info at http://bugzilla.kernel.org/show_bug.cgi?id=350)

Dave

[1] http://bugzilla.kernel.org/attachment.cgi?id=140&action=view

2003-02-25 19:07:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote:
> On Tue, Feb 25, 2003 at 09:43:59AM -0800, William Lee Irwin III wrote:
> >> The pagetable cache is gone in 2.5, so pte_alloc_one() takes the
> >> bitblitting hit for pagetables.
>
> On Tue, Feb 25, 2003 at 06:59:28PM +0100, Andrea Arcangeli wrote:
> > I'm talking about do_anonymous_page, do_wp_page, do_no_page fork and all
> > the other places that introduces spinlocks (per-page) and allocations of
> > 2 pieces of ram rather than just 1 (and in turn potentially global
> > spinlocks too if the cpu-caches are empty). Just grep for
> > pte_chain_alloc or page_add_rmap in mm/memory.c, that's what I mean, I'm
> > not talking about pagetables.
>
> Okay, fished out the profiles (w/Dave's optimization):
>
> 00000000 total 158601 0.0869
> c0106ed8 poll_idle 99878 1189.0238
> c01172e0 do_page_fault 8788 7.7496
> c013adb4 do_wp_page 6712 8.4322
> c013f70c page_remove_rmap 3132 6.2640
> c0139eac copy_page_range 2994 3.5643
> c013f5c0 page_add_rmap 2776 8.3614
> c013a1f4 zap_pte_range 2616 4.8806
> c0137240 release_pages 1828 6.4366
> c0108d14 system_call 1116 25.3636
> c013ba00 handle_mm_fault 1098 4.6525
> c015b59c d_lookup 1096 3.2619
> c013b788 do_no_page 1044 1.6519
> c013b56c do_anonymous_page 954 1.7667
> c011718c pte_alloc_one 910 6.5000
> c0139ba0 clear_page_tables 841 2.4735
> c011450c flush_tlb_page 725 6.4732
> c0207130 __copy_to_user_ll 687 6.6058
> c01333dc free_hot_cold_page 641 2.7629
> c013042c find_get_page 601 10.7321
>
> Just taking the exception dwarfs anything written in C.
>
> page_add_rmap() absorbs hits from all of the fault routines and
> copy_page_range(). page_remove_rmap() absorbs hits from zap_pte_range().
> do_wp_page() is huge because it's doing bitblitting in-line.

"absorbing" is a nice word for it. The way I see it, page_add_rmap and
page_remove_rmap are even more expensive than the pagtable zapping.
They're even more expensive than copy_page_range. Also focus on the
numbers on the right that are even more interesting to find what is
worth to optimize away first IMHO


>
> These things aren't cheap with or without rmap. Trimming down

lots of things aren't cheap, but this isn't a good reason to make them
twice more expensive, especially if they were as cheap as possible and
they're critical hot paths.

> accounting overhead could raise search problems elsewhere.

this is the point indeed, but at least in 2.4 I don't see any cpu saving
advantage during swapping because during swapping the cpu is always idle
anyways.

Infact I had to drop the lru_cache_add too from the anonymous page fault
path because it was wasting way too much cpu to get peak performance (of
course you're using per-page spinlocks by hand with rmap, and
lru_cache_add needs a global spinlock, so at least rmap shouldn't
introduce very big scalability issue unlike the lru_cache_add)

> Whether avoiding the search problem is worth the accounting overhead
> could probably use some more investigation, like actually trying the
> anonymous page handling rework needed to use vma-based ptov resolution.

the only solution is to do rmap lazily, i.e. to start building the rmap
during swapping by walking the pagetables, basically exactly like I
refill the lru with anonymous pages only after I start to need this
information recently in my 2.4 tree, so if you never need to pageout
heavily several giga of ram (like most of very high end numa servers),
you'll never waste a single cycle in locking or whatever other worthless
accounting overhead that hurts performance of all common workloads

Andrea

2003-02-25 19:29:10

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> the only solution is to do rmap lazily, i.e. to start building the rmap
> during swapping by walking the pagetables, basically exactly like I
> refill the lru with anonymous pages only after I start to need this
> information recently in my 2.4 tree, so if you never need to pageout
> heavily several giga of ram (like most of very high end numa servers),
> you'll never waste a single cycle in locking or whatever other worthless
> accounting overhead that hurts performance of all common workloads

Did you see the partially object-based rmap stuff? I think that does
very close to what you want already.

M.

2003-02-25 19:48:00

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> Interesting items for me are the fork/exec/sh times and some of the file
> + VM numbers

For the ones where you see degradation in fork/exec type stuff, any chance
you could rerun them with 62-mjb3 with the objrmap stuff in it? That should
fix a lot of the overhead.

Thanks,

M.

2003-02-25 19:49:06

by Scott Robert Ladd

[permalink] [raw]
Subject: RE: Minutes from Feb 21 LSE Call

Chris Wedgwood wrote:
> > The definitive Linux box appears to be $199 from Walmart right now,
> > and its not SMP.
>
> In two year this kind of hardware probably will be SMP (HT or some
> variant).

HT is not the same thing as SMP; while the chip may appear to be two
processors, it is actually equivalent 1.1 to 1.3 processors, depending on
the application.

Multicore processors and true SMP systems are unlikely to become mainstream
consumer items, given the premium price charged for such systems.

That given, I see some value in a stripped-down, low-overhead,
consumer-focused Linux that targets uniprocessor and HT systems, to be used
in the typical business or gaming PC. I'm not sure such is achievable with
the current config options; perhaps I should try to see how small a kernel I
can build for a simple ia32 system...

..Scott

Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Professional programming for science and engineering;
Interesting and unusual bits of very free code.

2003-02-25 20:01:10

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote:
>> Just taking the exception dwarfs anything written in C.
>> page_add_rmap() absorbs hits from all of the fault routines and
>> copy_page_range(). page_remove_rmap() absorbs hits from zap_pte_range().
>> do_wp_page() is huge because it's doing bitblitting in-line.

On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote:
> "absorbing" is a nice word for it. The way I see it, page_add_rmap and
> page_remove_rmap are even more expensive than the pagtable zapping.
> They're even more expensive than copy_page_range. Also focus on the
> numbers on the right that are even more interesting to find what is
> worth to optimize away first IMHO

Those just divide the number of hits by the size of the function IIRC,
which is useless for some codepath spinning hard in the middle of a
large function or in the presence of over-inlining. It's also greatly
disturbed by spinlock section hackery (as are most profilers).


On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote:
>> These things aren't cheap with or without rmap. Trimming down

On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote:
> lots of things aren't cheap, but this isn't a good reason to make them
> twice more expensive, especially if they were as cheap as possible and
> they're critical hot paths.

They weren't as cheap as possible and it's a bad idea to make them so.
SVR4 proved there are limits to the usefulness of lazy evaluation wrt.
pagetable copying and the like.

You're also looking at sampling hits, not end-to-end timings.

After all these disclaimers, trimming down cpu cost is a good idea.


On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote:
>> accounting overhead could raise search problems elsewhere.

On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote:
> this is the point indeed, but at least in 2.4 I don't see any cpu saving
> advantage during swapping because during swapping the cpu is always idle
> anyways.

It's probably not swapping that matters, but high turnover of clean data.
No one can really make a concrete assertion without some implementations
of the alternatives, which is why I think they need to be done soon.

Once one or more are there we're set. I'm personally in favor of the
anonymous handling rework as the alternative to pursue, since that
actually retains the locality of reference as opposed to wild pagetable
scanning over random processes, which is highly unpredictable with
respect to locality and even worse with respect to cpu consumption.


On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote:
> Infact I had to drop the lru_cache_add too from the anonymous page fault
> path because it was wasting way too much cpu to get peak performance (of
> course you're using per-page spinlocks by hand with rmap, and
> lru_cache_add needs a global spinlock, so at least rmap shouldn't
> introduce very big scalability issue unlike the lru_cache_add)

The high arrival rates to LRU lists in do_anonymous_page() etc. were
dealt with by the pagevec batching infrastructure in 2.5.x, which is
the primary method by which pagemap_lru_lock contention was addressed.
The "breakup" so to speak is primarily for locality of reference.
Which reminds me, my node-local pgdat allocation patch is pending...


On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote:
>> Whether avoiding the search problem is worth the accounting overhead
>> could probably use some more investigation, like actually trying the
>> anonymous page handling rework needed to use vma-based ptov resolution.

On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote:
> the only solution is to do rmap lazily, i.e. to start building the rmap
> during swapping by walking the pagetables, basically exactly like I
> refill the lru with anonymous pages only after I start to need this
> information recently in my 2.4 tree, so if you never need to pageout
> heavily several giga of ram (like most of very high end numa servers),
> you'll never waste a single cycle in locking or whatever other worthless
> accounting overhead that hurts performance of all common workloads

I'd just bite the bullet and do the anonymous rework. Building
pte_chains lazily raises the issue of needing to allocate in order to
free, which is relatively thorny. Maintaining any level of accuracy of
the things with lazy buildup is also problematic. That and the whole
space issue wrt. pte_chains is blown away by the anonymous rework,
which is a significant advantage.


-- wli

2003-02-25 20:08:17

by jlnance

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 02:59:05PM -0500, Scott Robert Ladd wrote:
> > In two year this kind of hardware probably will be SMP (HT or some
> > variant).
>
> HT is not the same thing as SMP; while the chip may appear to be two
> processors, it is actually equivalent 1.1 to 1.3 processors, depending on
> the application.
>
> Multicore processors and true SMP systems are unlikely to become mainstream
> consumer items, given the premium price charged for such systems.

I think the difference between SMP and HT is likely to decrease rather
than increase in the future. Even now people want to put multiple CPUs
on the same piece of silicon. Once you do that it only makes sense to
start sharning things between them. If you had a system with 2 CPUs
which shared a common L1 cache is that going to be a HT or an SMP system?
Or you could go further and have 2 CPUs which share an FPU. There are
all sorts of combinations you could come up with. I think designers
will experiment and find the one that gives the most throughput for
the least money.

Jim

2003-02-25 20:13:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 12:10:23PM -0800, William Lee Irwin III wrote:
> I'd just bite the bullet and do the anonymous rework. Building
> pte_chains lazily raises the issue of needing to allocate in order to

note that there is no need of allocate to free.

Andrea

2003-02-25 20:20:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 11:27:40AM -0800, Martin J. Bligh wrote:
> > the only solution is to do rmap lazily, i.e. to start building the rmap
> > during swapping by walking the pagetables, basically exactly like I
> > refill the lru with anonymous pages only after I start to need this
> > information recently in my 2.4 tree, so if you never need to pageout
> > heavily several giga of ram (like most of very high end numa servers),
> > you'll never waste a single cycle in locking or whatever other worthless
> > accounting overhead that hurts performance of all common workloads
>
> Did you see the partially object-based rmap stuff? I think that does
> very close to what you want already.

I don't see how it can optimize away the overhead but I didn't look at
it for long.

Andrea

2003-02-25 20:37:05

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 12:10:23PM -0800, William Lee Irwin III wrote:
>> I'd just bite the bullet and do the anonymous rework. Building
>> pte_chains lazily raises the issue of needing to allocate in order to

On Tue, Feb 25, 2003 at 09:23:35PM +0100, Andrea Arcangeli wrote:
> note that there is no need of allocate to free.

I've no longer got any idea what you're talking about, then.


-- wli

2003-02-25 20:47:18

by Scott Robert Ladd

[permalink] [raw]
Subject: RE: Minutes from Feb 21 LSE Call

[email protected] wrote:
> I think the difference between SMP and HT is likely to decrease rather
> than increase in the future. Even now people want to put multiple CPUs
> on the same piece of silicon. Once you do that it only makes sense to
> start sharning things between them. If you had a system with 2 CPUs
> which shared a common L1 cache is that going to be a HT or an SMP system?
> Or you could go further and have 2 CPUs which share an FPU. There are
> all sorts of combinations you could come up with. I think designers
> will experiment and find the one that gives the most throughput for
> the least money.

IBM's forthcoming Power5 will have two cores, each with SMT (the generic
term for HyperThreading); it will present itself to the OS as four
processors. Those four processors, however, are not equal; SMT is certainly
valuable, but it can only be as effective as mutliple cores if it in effect
*becomes* multiple cores (and, as such, turns into SMP).

I'm writing a chapter on memory architectures in my parallel programming
book; it's giving me a bit of a headache, as the issues you raise are both
important and complex. We have multiple levels of caches, NUMA
architectures, clusters, SMP, HT... the list just goes on and on, infinite
in diversity and combinations. Vendors will continue to experiment; I doubt
very much that any one architecture will take center stage.

I hope Linux handles the brain-sprain better than I am at the moment! ;)

..Scott

Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)

2003-02-25 20:27:46

by Scott Robert Ladd

[permalink] [raw]
Subject: RE: Minutes from Feb 21 LSE Call

Steven Cole wrote:
> Hans may have 32 CPUs in his $3000 box, and I expect to have 8 CPUs in
> my $500 Walmart special 5 or 6 years hence. And multiple chip on die
> along with HT is what will make it possible.

Or will Walmart be selling systems with one CPU for $62.50?

"Normal" folk simply have no use for an 8 CPU system. Sure, the technology
is great -- but no many people are buying HDTV, let alone a computer system
that could do real-time 3D holographic imaging. What Walmart is selling
today for $199 is a 1.1 GHz Duron system with minimal memory and a 10GB hard
drive. Not exactly state of the art (although it might make a nice node in a
super-cheap cluster!)

Of course, you'll have your Joe Normals who will buy multiprocessor machines
with neon lights and case windows -- but those are the same people who drive
a Ford Excessive 4WD SuperCab pickup when the only thing they ever "haul" is
groceries.

(Note: I drive a big SUV because I *do* haul stuff, and I've got lots of
kids -- the right tool for the job, as Alan stated.)

> What concerns me is that this will make it possible to put insane
> numbers of CPUs in those $250,000 and higher boxes. If Martin et al can
> scale Linux to 64 CPUs, can they make it scale several binary orders of
> magnitude higher? Why do this? NUMA memory is much faster than even
> very fast network connections any day.
>
> Is there a market for such a thing?

Such systems will be very useful in limited markets. If I need to simulate
the global climate or the evolution of galaxies, I can damned-well use
65,536 quad-core CPUs, and I'll be happy to install Linux on such a box.
Writing e-mail or scanning my kids' drawings doesn't require that sort of
power.

> Please listen to Larry. When he says you can't scale endlessly, I have
> a feeling he knows what he's talking about. The Nirvana machine has 48
> SGI boxes with 128 CPUs in each. I don't hear about many 128 CPU
> machines nowadays. Perhaps Irix just wasn't quite up to the job. But
> new technologies will make this kind of machine affordable (by the
> government and financial institutions) in the not too distant future.

Linux needs a roadmap; perhaps it has one, and I just haven't seen it?

I'm not entirely certain that Linux can scale from toasters to Deep Thought;
the needs of an office worker don't coincide well with the needs of a
scientist trying to simulate the dynamics of hurricanes. I've worked both
ends of that spectrum; they really are two different universes that may not
be effectively addressed by one Linux.

I, for one, would rather see Linux work best on high-end systems; I have no
problem leaving the low end of the spectrum to consumer-oriented companies
like Microsoft. Linux has the most potential of any extant OS, in my
opinion, for handling the types of systems you envision. And to achieve such
a goal, some planning needs to be done *now* to avoid quagmires and
minefields in the future.

..Scott

--
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)

2003-02-25 20:42:09

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 12:46:16PM -0800, William Lee Irwin III wrote:
> On Tue, Feb 25, 2003 at 12:10:23PM -0800, William Lee Irwin III wrote:
> >> I'd just bite the bullet and do the anonymous rework. Building
> >> pte_chains lazily raises the issue of needing to allocate in order to
>
> On Tue, Feb 25, 2003 at 09:23:35PM +0100, Andrea Arcangeli wrote:
> > note that there is no need of allocate to free.
>
> I've no longer got any idea what you're talking about, then.

Were we able to release memory w/o rmap: yes.

Can we do it again: yes.

Can we use a bit of the released memory to release further memory more
efficiently with rmap: yes.

I'm not saying it's easy to implement that, but the problem that we'll
need memory to release memory doesn't exit, since it also never existed
before rmap was introduced into the kernel. Sure, the early stage of the
swapping would be more cpu-intensive, but that is the feature.

Andrea

2003-02-25 20:58:53

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> > the only solution is to do rmap lazily, i.e. to start building the rmap
>> > during swapping by walking the pagetables, basically exactly like I
>> > refill the lru with anonymous pages only after I start to need this
>> > information recently in my 2.4 tree, so if you never need to pageout
>> > heavily several giga of ram (like most of very high end numa servers),
>> > you'll never waste a single cycle in locking or whatever other
>> > worthless accounting overhead that hurts performance of all common
>> > workloads
>>
>> Did you see the partially object-based rmap stuff? I think that does
>> very close to what you want already.
>
> I don't see how it can optimize away the overhead but I didn't look at
> it for long.

Because you don't set up and tear down the rmap pte-chains for every
fault in / delete of any page ... it just works off the vmas.

M.

2003-02-25 21:07:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 12:53:44PM -0800, Martin J. Bligh wrote:
> >> > the only solution is to do rmap lazily, i.e. to start building the rmap
> >> > during swapping by walking the pagetables, basically exactly like I
> >> > refill the lru with anonymous pages only after I start to need this
> >> > information recently in my 2.4 tree, so if you never need to pageout
> >> > heavily several giga of ram (like most of very high end numa servers),
> >> > you'll never waste a single cycle in locking or whatever other
> >> > worthless accounting overhead that hurts performance of all common
> >> > workloads
> >>
> >> Did you see the partially object-based rmap stuff? I think that does
> >> very close to what you want already.
> >
> > I don't see how it can optimize away the overhead but I didn't look at
> > it for long.
>
> Because you don't set up and tear down the rmap pte-chains for every
> fault in / delete of any page ... it just works off the vmas.

so basically it uses the rmap that we always had since at least 2.2 for
everything but anon mappings, right? this is what DaveM did a few years
back too. This makes lots of sense to me, so at least we avoid the
duplication of rmap information, even if it won't fix the anonymous page
overhead, but clearly it's much lower cost for everything but anonymous
pages.

Andrea

2003-02-25 21:09:19

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 02:59:05PM -0500, Scott Robert Ladd wrote:

> HT is not the same thing as SMP; while the chip may appear to be two
> processors, it is actually equivalent 1.1 to 1.3 processors,
> depending on the application.

You can't have non-integer numbers of processors. HT is a hack that
makes what appears to be two processors using common silicon.

The fact it's slower than a really dual CPU box is irrelevant in some
sense, you still need SMP smart to deal with it; it's only important
when you want to know why performance increases aren't apparent or you
loose performance in some cases... (ie. other virtual CPU thrashing
the cache).

> Multicore processors and true SMP systems are unlikely to become
> mainstream consumer items, given the premium price charged for such
> systems.

I overstated things thinking SMP/HT would be in low-end hardware given
two years.

As Alan pointed out, since the 'Walmart' class hardware is 'whatever
is cheapest' then perhaps HT/SMT/whatever won't be common place for
super-low end boxes in two years --- but I would be surprised if it
didn't gain considerable market share elsewhere.

> That given, I see some value in a stripped-down, low-overhead,
> consumer-focused Linux that targets uniprocessor and HT systems, to
> be used in the typical business or gaming PC.

UP != HT

HT is SMP with magic requirements. For multiple physical CPUs the
requirements become even more complex; you want to try to group tasks
to physical CPUs, not logical ones lest you thrash the cache.

Presumably there are other tweaks possible two, cache-line's don't
bounce between logic CPUs on a physical CPU for example, so some locks
and other data structures will be much faster to access than those
which actually do need cache-lines to migrate between different
physical CPUs. I'm not sure if these specific property cane be
exploited in the general case though.

> I'm not sure such is achievable with the current config options;
> perhaps I should try to see how small a kernel I can build for a
> simple ia32 system...

Present 2.5.x looks like it will have smarts for HT as a subset of
NUMA.

If HT does become more common and similar things abound, I'm not sure
if it even makes sense to have a UP kernel for certain platforms
and/or CPUs --- since a mere BIOS change will affect what is
'virtually' apparent to the OS.


--cw

2003-02-25 21:11:02

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 09:26:02PM -0800, William Lee Irwin III wrote:

> Could you help identify the regressions? Profiles? Workload?

I the OSDL data that Cliff White pointed out sufficient to work-with,
or do you want specific tests run with oprofile outputs?


--cw

2003-02-25 21:14:34

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> Could you help identify the regressions? Profiles? Workload?
>
> I the OSDL data that Cliff White pointed out sufficient to work-with,
> or do you want specific tests run with oprofile outputs?

It's a great start, but profiles would really help if you can grab them.

M.

2003-02-25 21:19:22

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

At some point in the past, Chris Wedgewood wrote:
>> It seems to me for small boxes, 2.5.x is margianlly slower at most
>> things than 2.4.x.

On Mon, Feb 24, 2003 at 10:17:05PM -0800, Martin J. Bligh wrote:
> Can you name a benchmark, or at least do something reproducible between
> versions, and produce a 2.4 vs 2.5 profile? Let's at least try to fix it ...

Looks like Cliff's got some good data.


-- wli

2003-02-25 21:12:37

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>> Because you don't set up and tear down the rmap pte-chains for every
>> fault in / delete of any page ... it just works off the vmas.
>
> so basically it uses the rmap that we always had since at least 2.2 for
> everything but anon mappings, right? this is what DaveM did a few years
> back too. This makes lots of sense to me, so at least we avoid the
> duplication of rmap information, even if it won't fix the anonymous page
> overhead, but clearly it's much lower cost for everything but anonymous
> pages.

Right ... and anonymous chains are about 95% single-reference (at least for
the case I looked at), so they're direct mapped from the struct page with
no chain at all. Cuts out something like 95% of the space overhead of
pte-chains, and 65% of the time (for kernel compile -j256 on 16x system).
However, it's going to be a little more expensive to *use* the mappings,
so we need to measure that carefully.

M.

2003-02-25 21:18:54

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 12:53:44PM -0800, Martin J. Bligh wrote:
>> Because you don't set up and tear down the rmap pte-chains for every
>> fault in / delete of any page ... it just works off the vmas.

On Tue, Feb 25, 2003 at 10:17:18PM +0100, Andrea Arcangeli wrote:
> so basically it uses the rmap that we always had since at least 2.2 for
> everything but anon mappings, right? this is what DaveM did a few years
> back too. This makes lots of sense to me, so at least we avoid the
> duplication of rmap information, even if it won't fix the anonymous page
> overhead, but clearly it's much lower cost for everything but anonymous
> pages.

This is what the "anonymous rework" is about. There is already a fix
extant for the file-backed case, which I presumed you knew of already,
and so were were speaking of issues with the anonymous case.

My impression thus far is that the anonymous case has not been pressing
with respect to space consumption or cpu time once the file-backed code
is in place, though if it resurfaces as a serious concern the anonymous
rework can be pursued (along with other things).


-- wli

2003-02-25 21:12:33

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, Feb 24, 2003 at 09:26:02PM -0800, William Lee Irwin III wrote:
>> Could you help identify the regressions? Profiles? Workload?

On Tue, Feb 25, 2003 at 01:21:15PM -0800, Chris Wedgwood wrote:
> I the OSDL data that Cliff White pointed out sufficient to work-with,
> or do you want specific tests run with oprofile outputs?

oprofile is what's needed. Looks like he's taking care of that too.


-- wli

2003-02-25 21:28:07

by Scott Robert Ladd

[permalink] [raw]
Subject: RE: Minutes from Feb 21 LSE Call

Chris Wedgwood wrote:
SRL>HT is not the same thing as SMP; while the chip may appear to be
SRL>two processors, it is actually equivalent 1.1 to 1.3 processors,
SRL>depending on the application.
>
CW> You can't have non-integer numbers of processors. HT is a hack
CW> that makes what appears to be two processors using common
CW> silicon.

I'm aware of that. ;) I'm well aware of the architecture needed to support
HT.

> The fact it's slower than a really dual CPU box is irrelevant in some
> sense, you still need SMP smart to deal with it; it's only important
> when you want to know why performance increases aren't apparent or you
> loose performance in some cases... (ie. other virtual CPU thrashing
> the cache).

Performance differences *are* quite relevant when it comes to thread
scheduling; the two virtual CPUS are not necessarily equivalent in
performnace.

> As Alan pointed out, since the 'Walmart' class hardware is 'whatever
> is cheapest' then perhaps HT/SMT/whatever won't be common place for
> super-low end boxes in two years --- but I would be surprised if it
> didn't gain considerable market share elsewhere.

I suspect HT/SMT be common for people who have multimedia systems, for video
editing and high-end gaming.

I doubt we'll see SMT toasters, though.

> UP != HT

An HT system is still a single, phsyical processor; HT is not equivalent to
a multicore chip, either. Much depends on memory and connection models; a
dual-core chip may be faster or slower than two similar physical SMP
processors. depending on the architecture.

I was speaking in terms of Intel's push to add HT to all of their P4s.
Systems with a single CPU will likely have HT; that still doesn't make them
as powerful as a true dual processor (or dual core CPU) system.

> HT is SMP with magic requirements. For multiple physical CPUs the
> requirements become even more complex; you want to try to group tasks
> to physical CPUs, not logical ones lest you thrash the cache.

Eaxctly. This is why HT is not the same thing as two physical CPUs. The OS
must be aware of this the effectively schedule jobs. So I think we generally
agree.

> If HT does become more common and similar things abound, I'm not sure
> if it even makes sense to have a UP kernel for certain platforms
> and/or CPUs --- since a mere BIOS change will affect what is
> 'virtually' apparent to the OS.

A good point.

..Scott

2003-02-25 21:45:04

by Hans Reiser

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Scott Robert Ladd wrote:

>"Normal" folk simply have no use for an 8 CPU system.
>
I had this argument over whether normal people would ever really need a
10mb hard drive when I was 21. Once was enough, sorry, I didn't
convince the other guy then, and I don't think I have gotten more
eloquent since then.

I'll just say that entertainment will drive computing for the next 5-15
years, and game designers won't have enough CPU that whole time.
Hollywood is dying like radio did, and immersive experiences are
replacing it.

HDTV might not make it. I personally don't really want any audio or
video devices or sources which are not well integrated into my computer,
and HDTV is not. I am not sure if the rest of the market will think
like me, but the gamers might.... I am getting a La Cie 4 monitor next
week which will do 2048x1536 without blurring pixels for $960, and I
just don't think I will want to use an HDTV for anything except maybe
the kitchen. I try to watch a high quality movie once a week with a
friend because I don't want to miss out on our culture (and games are
not yet as culturally rich as movies), but games are more engaging, and
I am not really managing to watch the movie a week. I seem to be at the
extreme of a growing trend.

Scott Robert Ladd wrote:

(Note: I drive a big SUV because I *do** haul stuff, and I've got lots of
kids -- the right tool for the job, as Alan stated.)

You didn't say whether you typically haul stuff and kids over rough
roads. If you don't (and very few SUV owners do), then what you need is
called a "mini-van", which is what people who are functionally oriented
buy for city hauling of kids and stuff ;-), and I bought my wife one.
It has more than 16 CPUs in it....

--
Hans


2003-02-25 21:58:02

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 01:21:34PM -0800, William Lee Irwin III wrote:
> On Mon, Feb 24, 2003 at 09:26:02PM -0800, William Lee Irwin III wrote:
> >> Could you help identify the regressions? Profiles? Workload?
>
> On Tue, Feb 25, 2003 at 01:21:15PM -0800, Chris Wedgwood wrote:
> > I the OSDL data that Cliff White pointed out sufficient to work-with,
> > or do you want specific tests run with oprofile outputs?
>
> oprofile is what's needed. Looks like he's taking care of that too.

Without doing something about the page coloring problem (and he might be)
the numbers will be fairly meaningless.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-25 22:07:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 01:26:35PM -0800, William Lee Irwin III wrote:
> My impression thus far is that the anonymous case has not been pressing
> with respect to space consumption or cpu time once the file-backed code
> is in place, though if it resurfaces as a serious concern the anonymous
> rework can be pursued (along with other things).

sounds good to me ;)

Andrea

2003-02-25 22:01:36

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 01:21:34PM -0800, William Lee Irwin III wrote:
>> oprofile is what's needed. Looks like he's taking care of that too.

On Tue, Feb 25, 2003 at 02:08:11PM -0800, Larry McVoy wrote:
> Without doing something about the page coloring problem (and he might be)
> the numbers will be fairly meaningless.

Hmm, point. Let's see if we can get Cliff to apply the new patch that
one guy put out yesterday or so.


-- wli

2003-02-25 22:04:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 01:12:55PM -0800, Martin J. Bligh wrote:
> >> Because you don't set up and tear down the rmap pte-chains for every
> >> fault in / delete of any page ... it just works off the vmas.
> >
> > so basically it uses the rmap that we always had since at least 2.2 for
> > everything but anon mappings, right? this is what DaveM did a few years
> > back too. This makes lots of sense to me, so at least we avoid the
> > duplication of rmap information, even if it won't fix the anonymous page
> > overhead, but clearly it's much lower cost for everything but anonymous
> > pages.
>
> Right ... and anonymous chains are about 95% single-reference (at least for
> the case I looked at), so they're direct mapped from the struct page with
> no chain at all. Cuts out something like 95% of the space overhead of
> pte-chains, and 65% of the time (for kernel compile -j256 on 16x system).
> However, it's going to be a little more expensive to *use* the mappings,
> so we need to measure that carefully.

Sure, it is more expensive to use them, but all we care about is
complexity, and they solve the complexity problem just fine, so I
definitely prefer it. Cpu utilization during heavy swapping isn't a big
deal IMHO

Andrea

2003-02-25 22:17:12

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> Sure, it is more expensive to use them, but all we care about is
> complexity, and they solve the complexity problem just fine, so I
> definitely prefer it. Cpu utilization during heavy swapping isn't a big
> deal IMHO

I totally agree with you. However the concerns others raised were over
page aging and page stealing (eg from pagecache), which might not involve
disk, but would also be slower. It probably need some tuning and tweaking,
but I'm pretty sure it's fundamentally the right approach.

M.

2003-02-25 22:26:08

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 02:17:48PM -0800, Martin J. Bligh wrote:
> > Sure, it is more expensive to use them, but all we care about is
> > complexity, and they solve the complexity problem just fine, so I
> > definitely prefer it. Cpu utilization during heavy swapping isn't a big
> > deal IMHO
>
> I totally agree with you. However the concerns others raised were over
> page aging and page stealing (eg from pagecache), which might not involve
> disk, but would also be slower. It probably need some tuning and tweaking,
> but I'm pretty sure it's fundamentally the right approach.

there's no slowdown at all when we don't need to unmap anything. We
just need to avoid watching the pte young bit in the pagetables unless
we're about to start unmapping stuff. Most machines won't reach the
point where they need to start unmapping stuff. Watching the ptes during
normal pagecache recycling would be wasteful anyways, regardless what
chain we take to reach the pte.

Andrea

2003-02-25 22:27:29

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 02:08:11PM -0800, Larry McVoy wrote:

> Without doing something about the page coloring problem (and he
> might be) the numbers will be fairly meaningless.

page coloring problem?

i was under the impression on anything 8-way-associative or better the
page coloring improvements were negligible for real-world benchmarks
(ie. kernel compiles)

... or is this more an artifact that even though the improvements for
real-world are negligible, micro-benchmarks are susceptible to these
variations this making things like the std. dev. larger than it would
otherwise be?



--cw

2003-02-25 22:47:58

by Larry McVoy

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

> ... or is this more an artifact that even though the improvements for
> real-world are negligible, micro-benchmarks are susceptible to these
> variations this making things like the std. dev. larger than it would
> otherwise be?

Bingo. If you are trying to measure whether something adds cache misses
you really want reproducible runs.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-25 22:57:31

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

On Mon, 24 Feb 2003 20:17:01 PST, Larry McVoy wrote:
> On Mon, Feb 24, 2003 at 08:11:21PM -0800, Martin J. Bligh wrote:
> > > What part of "all servers from all companies" did you not understand?
> >
> > Average price from Dell: $2495
> > Average price overall: $9347
> >
> > Conclusion ... Dell makes cheaper servers than average, presumably smaller.
>
> So how many CPUs do you think you get in a $9K server?

Did the numbers track add-on prices, as opposed to base server? Most
servers are sold with one CPU and lots of extra slots. Need to dig
down to the add-on data to find upgrades to more CPUs and more memory
in the field (and more disk drives).

> Better yet, since you work for IBM, how many servers do they ship in a year
> with 16 CPUs?

Proprietary data, unfortunately. And I'm not sure if even internally
the totals are rolled up as pSeries, xSeries, zSeries, iSeries, etc. and
broken down by linux/aix/NT/VM/etc. Nor do most big companies efficiently
track the size of a machine at a customer site after hardware upgrades
(I know for Sequent this in particular was a painful problem - sold a
two way and supported an 18-way machine later).

gerrit

2003-02-25 23:09:15

by Larry McVoy

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

On Tue, Feb 25, 2003 at 02:02:28PM -0800, Gerrit Huizenga wrote:
> On Mon, 24 Feb 2003 20:17:01 PST, Larry McVoy wrote:
> > On Mon, Feb 24, 2003 at 08:11:21PM -0800, Martin J. Bligh wrote:
> > > > What part of "all servers from all companies" did you not understand?
> > >
> > > Average price from Dell: $2495
> > > Average price overall: $9347
> > >
> > > Conclusion ... Dell makes cheaper servers than average, presumably smaller.
> >
> > So how many CPUs do you think you get in a $9K server?
>
> Did the numbers track add-on prices, as opposed to base server? Most
> servers are sold with one CPU and lots of extra slots. Need to dig
> down to the add-on data to find upgrades to more CPUs and more memory
> in the field (and more disk drives).

I included the URL's so you could check for yourself but I arrived at
those numbers by taking the world wide revenue associated with servers
and dividing by the number of units shipped. I would expect that would
include the add on stuff.

I'm sure IBM makes money on their high end stuff but I'd suspect that
it is more bragging rights than what keeps the lights on.

I think the point which was missed in this whole thread is that even if
IBM has fantastic margins today on big iron, it's unlikely to stay that
way. The world is catching up. I can by a dual 1.8Ghz AMD box for
about $1500. 4 ways are more, maybe $10K or so. So you have the cheapo
white boxes coming at you from the low end.

On the high end, go look at what customers want. They are mostly taking
those big boxes and partitioning them. Sooner or later some bright boy
is going to realize that they could put 4 4 way boxes in one rack and
call it a 16 way box with 4 way partitioning "pre-installed".
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-25 23:16:31

by Scott Robert Ladd

[permalink] [raw]
Subject: RE: Minutes from Feb 21 LSE Call

> >"Normal" folk simply have no use for an 8 CPU system.
> I had this argument over whether normal people would ever really need a
> 10mb hard drive when I was 21. Once was enough, sorry, I didn't
> convince the other guy then, and I don't think I have gotten more
> eloquent since then.

I should be more careful in what I say. I remember fighting to get 20MB
drives in systems for an early Novell LAN, when management thought that no
one would ever need more than 10MB.

To be more precise in my reasoning:

High-powered, multiprocessor computers will be an essential part of people's
lives -- in medical equipment, possibly guiding transportation, in various
tools that affect people's live. As for what we see today as a "home
computer": The vast majority of people don't use what they already have.
This is one reason that sales of "home computers" have slowed; people just
don't need a 3GHz system (with or without HT or SMP) for checking e-mail and
writing a letter to Aunt Edna.

> I'll just say that entertainment will drive computing for the next 5-15
> years, and game designers won't have enough CPU that whole time.
> Hollywood is dying like radio did, and immersive experiences are
> replacing it.

You are correct. Gaming, file sharing, digital imaging -- those application
eat horsepower. But I honestly can't see how 8 processors can possibly make
Abiword run "better."

Technologies tend to hit a point where they're "good enough" for the
majority of users. For example, houses haven't really changed much in 50
years, in spite of Disney visions and the HGTV. I haven't seen too many
push-button houses (like people predicted in the 1950s); and I still want my
flying cars, dang-it!

> You didn't say whether you typically haul stuff and kids over rough
> roads. If you don't (and very few SUV owners do), then what you need is
> called a "mini-van", which is what people who are functionally oriented
> buy for city hauling of kids and stuff ;-), and I bought my wife one.
> It has more than 16 CPUs in it....

I live half-time in rural Colorado -- at 9800 feet above sea level, on rough
highways 60 miles from the nearest grocery store. I've also done Search &
Rescue, and I'm involved in work on Indian Reservations (where roads just
plain stink). I do need a different vehicle for when I'm in Florida -- we
usually leave my behemoth parked and drive a boring Taurus.

My 4x4 SUV is kinda raggy; it's 18 years old, and I maintain it myself.
People who buy $75,000 Cadillac SUVs with leather seats do it for prestige
and "mine is bigger than yours" competition. Kinda like folks who buy
dual-processor systems with 250GB drives, so they can web surf or impress
people at LAN parties... ;)

This point does fit with our discussion of multiprocessor computers.
Minivans are *not* marvels of high technology; they're actually quite
prosaic. But they do the job well for many people who have no need for a
high-tech car. Meanwhile, the best-technology vehicles don't sell very well.
I suspect the same rule holds true for computers.

..Scott

2003-02-25 23:35:58

by Gerhard Mack

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

On Tue, 25 Feb 2003, Larry McVoy wrote:

> Date: Tue, 25 Feb 2003 15:19:26 -0800
> From: Larry McVoy <[email protected]>
> To: Gerrit Huizenga <[email protected]>
> Cc: Larry McVoy <[email protected]>, Martin J. Bligh <[email protected]>,
> [email protected]
> Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]
>
> On Tue, Feb 25, 2003 at 02:02:28PM -0800, Gerrit Huizenga wrote:
> > On Mon, 24 Feb 2003 20:17:01 PST, Larry McVoy wrote:
> > > On Mon, Feb 24, 2003 at 08:11:21PM -0800, Martin J. Bligh wrote:
> > > > > What part of "all servers from all companies" did you not understand?
> > > >
> > > > Average price from Dell: $2495
> > > > Average price overall: $9347
> > > >
> > > > Conclusion ... Dell makes cheaper servers than average, presumably smaller.
> > >
> > > So how many CPUs do you think you get in a $9K server?
> >
> > Did the numbers track add-on prices, as opposed to base server? Most
> > servers are sold with one CPU and lots of extra slots. Need to dig
> > down to the add-on data to find upgrades to more CPUs and more memory
> > in the field (and more disk drives).
>
> I included the URL's so you could check for yourself but I arrived at
> those numbers by taking the world wide revenue associated with servers
> and dividing by the number of units shipped. I would expect that would
> include the add on stuff.
>
> I'm sure IBM makes money on their high end stuff but I'd suspect that
> it is more bragging rights than what keeps the lights on.
>
> I think the point which was missed in this whole thread is that even if
> IBM has fantastic margins today on big iron, it's unlikely to stay that
> way. The world is catching up. I can by a dual 1.8Ghz AMD box for
> about $1500. 4 ways are more, maybe $10K or so. So you have the cheapo
> white boxes coming at you from the low end.
>
> On the high end, go look at what customers want. They are mostly taking
> those big boxes and partitioning them. Sooner or later some bright boy
> is going to realize that they could put 4 4 way boxes in one rack and
> call it a 16 way box with 4 way partitioning "pre-installed".

er you mean like what racksaver.com does with their 2 dual CPU servers in
a box?

Gerhard

--
Gerhard Mack

[email protected]

<>< As a computer I find your faith in technology amusing.

2003-02-25 23:33:38

by Alan

[permalink] [raw]
Subject: RE: Minutes from Feb 21 LSE Call

On Tue, 2003-02-25 at 20:37, Scott Robert Ladd wrote:
> Steven Cole wrote:
> > Hans may have 32 CPUs in his $3000 box, and I expect to have 8 CPUs in
> > my $500 Walmart special 5 or 6 years hence. And multiple chip on die
> > along with HT is what will make it possible.
>
> Or will Walmart be selling systems with one CPU for $62.50?
>
> "Normal" folk simply have no use for an 8 CPU system. Sure, the technology
> is great -- but no many people are buying HDTV, let alone a computer system
> that could do real-time 3D holographic imaging. What Walmart is selling
> today for $199 is a 1.1 GHz Duron system with minimal memory and a 10GB hard

Last time I checked it was an 800Mhz VIA C3 with onboard everything (EPIA
variant). Even the CPU is BGA mounted to keep cost down

2003-02-25 23:49:58

by Hans Reiser

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Scott Robert Ladd wrote:

> But I honestly can't see how 8 processors can possibly make
>Abiword run "better."
>
They can't, but you know it was before 1980 that hardware exceeded what
was really needed for email. What happened then? People needed more
horsepower for wysiwyg editors, the new thing of that time.....

Now it is games that hardware is too slow for. After games, maybe AI
assistants?.... Will you be saying, "My AI doesn't have enough
horsepower to run on, its databases are small and out of date, and it is
providing me worse advice than my wealthy friends get, and providing it
later."? How much will you pay for a good AI to advise you? (I really
like my I-Nav GPS adviser in my mini-van.... money well spent....)

>I live half-time in rural Colorado -- at 9800 feet above sea level, on rough
>highways 60 miles from the nearest grocery store.
>
Ok, you win that one.;-)

>Kinda like folks who buy
>dual-processor systems with 250GB drives, so they can web surf or impress
>people at LAN parties... ;)
>
I am buying a new monitor so that I can do head-shots more easily in
tribes 2;-). I suppose I should be more motivated by having bigger
emacs windows and thereby increasing the size of my visual cache, and
maybe when I was younger I would have been more motivated by that, and
it does prevent me from feeling guilty about spending that money, but at
this phase of my life ;-) I hate it when pixelization prevents me from
lining up on the head....

It is interesting that games are the only compelling motivation for
faster desktop hardware these days. It may be part of why we are in a
tech bust. When AIs become hardware purchase drivers, there will likely
be a boom again.

--
Hans


2003-02-25 23:46:55

by Scott Robert Ladd

[permalink] [raw]
Subject: RE: Minutes from Feb 21 LSE Call

Alan Cox wrote:
SRL> that could do real-time 3D holographic imaging. What Walmart is
SRL> selling today for $199 is a 1.1 GHz Duron system with minimal
SRL> memory and a 10GB hard.

AC> Last time I checked it was an 800Mhz VIA C3 with onboard everything
AC> (EPIA variant). Even the CPU is BGA mounted to keep cost down

My reference is:

http://www.walmart.com/catalog/product.gsp?product_id=2138700&cat=3951&type=
19&dept=3944&path=0:3944:3951

1.1 GHz Duron
128 MB RAM
10 GB drive
CD-ROM
Ethernet

For $199.98.

..Scott

2003-02-26 00:11:13

by Scott Robert Ladd

[permalink] [raw]
Subject: RE: Minutes from Feb 21 LSE Call

Hans Reiser wrote
> Now it is games that hardware is too slow for. After games, maybe AI
> assistants?.... Will you be saying, "My AI doesn't have enough
> horsepower to run on, its databases are small and out of date, and it is
> providing me worse advice than my wealthy friends get, and providing it
> later."? How much will you pay for a good AI to advise you? (I really
> like my I-Nav GPS adviser in my mini-van.... money well spent....)

Really good AI is predicated on the invention of better algorithms. I do a
bit of work in this area; we're a long way from any useful AI -- unless you
think Microsoft's "Clippy" qualifies. :)

I would love to see "intelligence" in software; IBM's recent "autonomic
computing" initiative is marketing hype for a good idea. Programs (including
Linux!) should be self-diagnosing, fault tolerant, and self-correcting.
We're not there yet on the software side (again).

And "smart AI" may not be something people want. Many people distrust
machines -- and in gaming, a really good AI simply isn't as important (or
desirable) as are pretty graphics (handled by a GPU).

> Ok, you win that one.;-)

Yeah! ;)

> It is interesting that games are the only compelling motivation for
> faster desktop hardware these days. It may be part of why we are in a
> tech bust. When AIs become hardware purchase drivers, there will likely
> be a boom again.

I've worked with several game companies; AI just isn't a priority. Games
need to be "good enough" to challenge average gamers; people who want a real
challenge play online against other humans.

Excellent breast physics (Extreme Beach Volleyball) sells games; a crafty,
hard-to-defeat AI actually turns off casual players and just isn't "sexy".

And now I think we're getting *WAY* off topic. :)

..Scott

2003-02-26 00:38:00

by Steven Cole

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

cc list trimmed.

On Tue, 2003-02-25 at 16:41, Hans Reiser wrote:
> Scott Robert Ladd wrote:
>
> > But I honestly can't see how 8 processors can possibly make
> >Abiword run "better."
> >
> They can't, but you know it was before 1980 that hardware exceeded what
> was really needed for email. What happened then? People needed more
> horsepower for wysiwyg editors, the new thing of that time.....
>
> Now it is games that hardware is too slow for. After games, maybe AI
> assistants?.... Will you be saying, "My AI doesn't have enough
> horsepower to run on, its databases are small and out of date, and it is
> providing me worse advice than my wealthy friends get, and providing it
> later."? How much will you pay for a good AI to advise you? (I really
> like my I-Nav GPS adviser in my mini-van.... money well spent....)
>
[snippage]
>
> It is interesting that games are the only compelling motivation for
> faster desktop hardware these days. It may be part of why we are in a
> tech bust. When AIs become hardware purchase drivers, there will likely
> be a boom again.
>
> --
> Hans

It's easy to say that people don't need a multiple Ghz processor to run
most applications (games and AI aside) because it's true. But human
nature is such that people bought muscle cars with way more horsepower
than needed in the 60's and 70's before environmental concerns
intervened. The current slowdown in PC purchases may be more due to a
cyclical bear market than due to satiation of need. When the economy
turns around, and it always does, many people will opt for the $600 2.4
Ghz P4 instead of the $200 1.1 Ghz Duron. And not because they need it,
but because of other factors.

Now, fast forward five years. If AMD is still around, Intel will be
forced to offer ridiculously fast hardware just to stay in business. My
original point is that the Ghz race may be supplemented by a SMP/HT
race, not because of need (AI and games may help provide an excuse), but
because of greed and envy. Never underestimate those last two. And
that SMP/HT race could have an important impact on future kernel design.

Steven (Looking forward to his 2.4 Ghz P4 which will compile a 2.5
kernel faster than the 15 minutes it takes his 450 Mhz PIII today,
especially with Reiser4 patched in.;) )

2003-02-26 00:44:24

by Hans Reiser

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Scott Robert Ladd wrote:

>
>I've worked with several game companies; AI just isn't a priority. Games
>need to be "good enough" to challenge average gamers; people who want a real
>challenge play online against other humans.
>
I didn't mean game AI. In real life, computers aim better than humans
do. In real life, that makes people want the AI, in a game, when the
robot is faster they don't buy the game.

I predict the US military will drive AI research over the next 10 years
because AIs can shoot better and faster. After the AIs mature on the
battlefield they'll start being more useful to industry (replacing bus
drivers, etc.)

In 15-30 years, AIs will be a big market, a huge one. Of course, people
said that 30 years ago and it seemed reasonable then....

--
Hans


2003-02-26 01:57:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call


On Tue, 25 Feb 2003, Dave Jones wrote:
>
> > Yes, if you don't take advantage of sysenter, then all the sysenter
> > support will just make us look worse ;(
>
> Andi's patch[1] to remove one of the wrmsr's from the context switch
> fast path should win back at least some of the lost microbenchmark
> points.

But the patch is fundamentally broken wrt preemption at least, and it
looks totally unfixable.

It's also overly complex, for no apparent reason. The simple way to avoid
the wrmsr of SYSENTER_CS is to just cache a per-cpu copy in memory,
preferably in some location that is already in the cache at context switch
time for other reasons.

Linus

2003-02-26 04:13:19

by Jesse Pollard

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

On Tuesday 25 February 2003 17:46, Gerhard Mack wrote:
> On Tue, 25 Feb 2003, Larry McVoy wrote:
> > Date: Tue, 25 Feb 2003 15:19:26 -0800
> > From: Larry McVoy <[email protected]>
> > To: Gerrit Huizenga <[email protected]>
> > Cc: Larry McVoy <[email protected]>, Martin J. Bligh <[email protected]>,
> > [email protected]
> > Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]
> >
[snip]
> > On the high end, go look at what customers want. They are mostly taking
> > those big boxes and partitioning them. Sooner or later some bright boy
> > is going to realize that they could put 4 4 way boxes in one rack and
> > call it a 16 way box with 4 way partitioning "pre-installed".
>
> er you mean like what racksaver.com does with their 2 dual CPU servers in
> a box?

And that is not "Big Iron".

sorry - Big Iron is a 1 -5 TFlop single system image, shared memory, with
streaming vector processor...

Something like a Cray X1, single processor for instance.
Or a 1024 processor Cray T3, again single system image, even if it doesn't
have a streaming vector processor.

I don't see that any of the current cluster systems provide the throughput
of such a system. Not even IBMs' SP series. Aggregate measures of theoretical
throughput just don't add up. Practical throughput is almost always only 80%
of the theoretical (ie. the advertised) througput. Most cannot handle the data
I/O requirement, much less the IPC latency.

Sure, 3 microseconds sounds nice for myranet, but nothing beats 17 clock ticks
where each tick is 4 ns for the first 64 bit word of data... followed by the
next word in 4 ns per buss. ( and that is on a slow processor....)

The output is fed to memory on every clock tick. (most Cray processors have 4
memory busses for each processor - two for input data, one for output data
and one for the instruction stream ; and each has the same cycle time...Now
go to 4/8/16/32 processors without reducing that timing. That requires some
CAREFULL hardware design.)

And you better believe that there are big margins on such a system. You only
have to sell 8 to 16 units to exceed the yearly profit of most computer
companies. Do I have hard numbers on the units? no. I don't work for Cray.
I have used their systems for the last 12 years, and until the Earth Simulator
came on line, there was nothing that came close to their throughput for
weather modeling, finite element analysis, or other large problem types.

None of the microprocessors (possibly excepting the Power 4) can come close -
When you look at the processor internals, they all only have a single memory
buss, running approximately 1 - 2 GB/second to cache.

Look at the cray this way: ALL of main memory is cache... with 4 ports to
it... for EACH processor...

Would I like to see Linux running on these? yes. Can I pay for it? No. I'm
not in such a position where I could buy one. Would customers buy one?
Perhaps - if the price were right or the need great enough. Would having
Linux on it save the vendor money? I don't know. I hope that it would.

Unfortunately, there are too many things missing from Linux for it to be
considered:
job and process checkpoint/restart (with files/pipes/sockets intact)
batch job processors (REAL batch jobs ... not just cron)
resource accounting and resource allocation control
compartmented mode security support
truly large filesystem support (10 TB online, 300+ TB nearline in one fs)
large file support (100-300 GB in one file at least)
large process support
(10Gb processes, 10-1000 threads... I can dream can't I :-)
automatic hardware failover support
hot swap components (disks, tapes, memory, processors)

to make a short list.

2003-02-26 04:56:41

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

On Tue, Feb 25, 2003 at 10:23:04PM -0600, Jesse Pollard wrote:
> And that is not "Big Iron".
> sorry - Big Iron is a 1 -5 TFlop single system image, shared memory, with
> streaming vector processor...

Thank you for putting things in their perspectives.

This is why I call x86en maxed to their architectural limits "midrange",
which is a kind overestimate given their sickeningly enormous deficits.


-- wli

2003-02-26 05:17:21

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

In article <03022522230400.04587@tabby> you wrote:
> Something like a Cray X1, single processor for instance.
> Or a 1024 processor Cray T3, again single system image, even if it doesn't
> have a streaming vector processor.
>
> I don't see that any of the current cluster systems provide the throughput
> of such a system. Not even IBMs' SP series.

This clearly depends on the workload. For most vector processors
partitioning does not make sense. And dont forget, most of those systems are
pure compute servers used fr scientific computing.

> The output is fed to memory on every clock tick. (most Cray processors have 4
> memory busses for each processor - two for input data, one for output data
> and one for the instruction stream

The fastest Cray on top500.org is T3E1200 on rank _22_, the fastest IBM is
ranked _2_ with a Power3 PRocessor. There are 13 IBM systems before the
first (fastest) Cray system. Of course those GFlops are measured for
parallel problems, but there are a lot out there.

And all those numbers are totally uninteresting for DB or Storage Servers.
Even a SAP SD Benchmark would not be fun on a Cray.

> I have used their systems for the last 12 years, and until the Earth Simulator
> came on line, there was nothing that came close to their throughput for
> weather modeling, finite element analysis, or other large problem types.

thats clearly wrong. http://www.top500.org/lists/lists.php?Y=2002&M=06

There are a lot of Power3 ans Alpha systems before the first cray.

Greetings
Bernd

2003-02-26 05:16:28

by Rik van Riel

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, 25 Feb 2003, William Lee Irwin III wrote:

> My impression thus far is that the anonymous case has not been pressing
> with respect to space consumption or cpu time once the file-backed code
> is in place, though if it resurfaces as a serious concern the anonymous
> rework can be pursued (along with other things).

... but making the anonymous pages use an object based
scheme probably will make things too expensive.

IIRC the object based reverse map patches by bcrl and
davem both failed on the complexities needed to deal
with anonymous pages.

My instinct is that a hybrid system will work well in
most cases and the worst case with mapped files won't
be too bad.

cheers,

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-26 05:20:51

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

In article <[email protected]> you wrote:
> unfortunantly for them the core CPU speeds became uncoupled from the
> memory speeds and skyrocketed up to the point where CISC cores are as fast
> or faster then the 'high speed' RISC cores.

Hmm.. are there any RISC Cores which run even closely to CISC Speeds?

And why not? Is this only the financial power of Intel?

Bernd

2003-02-26 05:29:09

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, 25 Feb 2003, William Lee Irwin III wrote:
>> My impression thus far is that the anonymous case has not been pressing
>> with respect to space consumption or cpu time once the file-backed code
>> is in place, though if it resurfaces as a serious concern the anonymous
>> rework can be pursued (along with other things).

On Wed, Feb 26, 2003 at 02:24:18AM -0300, Rik van Riel wrote:
> ... but making the anonymous pages use an object based
> scheme probably will make things too expensive.
> IIRC the object based reverse map patches by bcrl and
> davem both failed on the complexities needed to deal
> with anonymous pages.
> My instinct is that a hybrid system will work well in
> most cases and the worst case with mapped files won't
> be too bad.

The boxen I'm supposed to babysit need a high degree of resource
consciousness wrt. lowmem allocations, so there is a clear voice
on this issue. IMHO it's still an open question as to whether this
is efficient for replacement concerns, which may yet favor objects.


-- wli

2003-02-26 05:33:23

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

In article <...> someone wrote:
>> unfortunantly for them the core CPU speeds became uncoupled from the
>> memory speeds and skyrocketed up to the point where CISC cores are as fast
>> or faster then the 'high speed' RISC cores.

On Wed, Feb 26, 2003 at 06:30:50AM +0100, Bernd Eckenfels wrote:
> Hmm.. are there any RISC Cores which run even closely to CISC Speeds?
> And why not? Is this only the financial power of Intel?

There is one other: x86 binary compatibility.

Looks like the beginning and end of it to me.


-- wli

2003-02-26 05:51:57

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

>>> My impression thus far is that the anonymous case has not been pressing
>>> with respect to space consumption or cpu time once the file-backed code
>>> is in place, though if it resurfaces as a serious concern the anonymous
>>> rework can be pursued (along with other things).
>
> On Wed, Feb 26, 2003 at 02:24:18AM -0300, Rik van Riel wrote:
>> ... but making the anonymous pages use an object based
>> scheme probably will make things too expensive.
>> IIRC the object based reverse map patches by bcrl and
>> davem both failed on the complexities needed to deal
>> with anonymous pages.
>> My instinct is that a hybrid system will work well in
>> most cases and the worst case with mapped files won't
>> be too bad.
>
> The boxen I'm supposed to babysit need a high degree of resource
> consciousness wrt. lowmem allocations, so there is a clear voice

It seemed, at least on the simple kernel compile tests that I did, that all
the long chains are not anonymous. It killed 95% of the space issue, which
given the simplicity of the patch was pretty damned stunning. Yes, there's
a pointer per page I guess we could kill in the struct page itself, but I
think you already have a better method for killing mem_map bloat ;-)

M.

2003-02-26 05:56:59

by Aaron Lehmann

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 06:28:08PM -0500, Scott Robert Ladd wrote:
> You are correct. Gaming, file sharing, digital imaging -- those application
> eat horsepower. But I honestly can't see how 8 processors can possibly make
> Abiword run "better."

With the current (and historic) state of Abiword performance, anything
would be an improvement.

2003-02-26 06:06:36

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

At some point in the past, I wrote:
>> The boxen I'm supposed to babysit need a high degree of resource
>> consciousness wrt. lowmem allocations, so there is a clear voice

On Tue, Feb 25, 2003 at 10:01:20PM -0800, Martin J. Bligh wrote:
> It seemed, at least on the simple kernel compile tests that I did, that all
> the long chains are not anonymous. It killed 95% of the space issue, which
> given the simplicity of the patch was pretty damned stunning. Yes, there's
> a pointer per page I guess we could kill in the struct page itself, but I
> think you already have a better method for killing mem_map bloat ;-)

I'm not going to get up in arms about this unless there's a serious
performance issue that's going to get smacked down that I want to have
a say in how it gets smacked down. aa is happy with the filebacked
stuff, so I'm not pressing it (much) further.

And yes, page clustering is certainly on its way and fast. I'm getting
very close to the point where a general announcement will be in order.
There's basically "one last big bug" and two bits of gross suboptimality
I want to clean up before bringing the world to bear on it.


-- wli

2003-02-26 06:24:12

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, Feb 25, 2003 at 10:01:20PM -0800, Martin J. Bligh wrote:
>> It seemed, at least on the simple kernel compile tests that I did, that all
>> the long chains are not anonymous. It killed 95% of the space issue, which
>> given the simplicity of the patch was pretty damned stunning. Yes, there's
>> a pointer per page I guess we could kill in the struct page itself, but I
>> think you already have a better method for killing mem_map bloat ;-)

On Tue, Feb 25, 2003 at 10:14:40PM -0800, William Lee Irwin III wrote:
> I'm not going to get up in arms about this unless there's a serious
> performance issue that's going to get smacked down that I want to have
> a say in how it gets smacked down. aa is happy with the filebacked
> stuff, so I'm not pressing it (much) further.
> And yes, page clustering is certainly on its way and fast. I'm getting
> very close to the point where a general announcement will be in order.
> There's basically "one last big bug" and two bits of gross suboptimality
> I want to clean up before bringing the world to bear on it.

Screw it. Here it comes, ready or not. hch, I hope you were right...


-- wli

2003-02-26 07:13:20

by David Lang

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, 25 Feb 2003, William Lee Irwin III wrote:

> In article <...> someone wrote:
> >> unfortunantly for them the core CPU speeds became uncoupled from the
> >> memory speeds and skyrocketed up to the point where CISC cores are as fast
> >> or faster then the 'high speed' RISC cores.
>
> On Wed, Feb 26, 2003 at 06:30:50AM +0100, Bernd Eckenfels wrote:
> > Hmm.. are there any RISC Cores which run even closely to CISC Speeds?
> > And why not? Is this only the financial power of Intel?
>
> There is one other: x86 binary compatibility.
>
> Looks like the beginning and end of it to me.

it's more then just the financial power of Intel, AMD is also in many ways
above the performance of the 'high-end' processors

aceshardware has a chart showing several different processors (dated in
october 2002 so it's not _that_ out of date

http://www.aceshardware.com/read_news.jsp?id=60000436

one interesting thing I see from this chart is that the x86 processors are
well ahead in integer performance and pulling further ahead (the pace of
development is significantly faster then the other processors) while they
do lag in floating point (but not by that much) there are a LOT of
workloads where the floating point performance is not as important (the K2
showed that it can't lag _to_ far behind)

the x86 binary compatability means that even a 'low volume' x86
compatable chip has a large potential market and a company can do
reasonably well getting a small percentage of the market (see the
transmeta and cyrix shiips) while the non x86 chips (including ia64) have
to invent a new market segment for themselves.

David Lang

2003-02-26 08:36:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

David Mosberger <[email protected]> writes:

> >>>>> On Sun, 23 Feb 2003 00:07:50 -0800 (PST), David Lang
> <[email protected]> said:
>
>
> David.L> Garrit, you missed the preior posters point. IA64 had the
> David.L> same fundamental problem as the Alpha, PPC, and Sparc
> David.L> processors, it doesn't run x86 binaries.
>
> This simply isn't true. Itanium and Itanium 2 have full x86 hardware
> built into the chip (for better or worse ;-). The speed isn't as good
> as the fastest x86 chips today, but it's faster (~300MHz P6) than the
> PCs many of us are using and it certainly meets my needs better than
> any other x86 "emulation" I have used in the past (which includes
> FX!32 and its relatives for Alpha).

I have various random x86 binaries that do not work.

My 32bit x86 user space does not run.

A 32bit kernel doesn't have a chance.

So for me at least the 32bit support is not useful in avoiding
converting binaries. For the handful of apps that cannot be
recompiled I suspect the support is good enough so you can get them
to run somehow.

Eric

2003-02-26 09:26:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

Bernd Eckenfels <[email protected]> writes:

> In article <03022522230400.04587@tabby> you wrote:
> > The output is fed to memory on every clock tick. (most Cray processors have 4
>
> > memory busses for each processor - two for input data, one for output data
> > and one for the instruction stream
>
> The fastest Cray on top500.org is T3E1200 on rank _22_, the fastest IBM is
> ranked _2_ with a Power3 PRocessor. There are 13 IBM systems before the
> first (fastest) Cray system. Of course those GFlops are measured for
> parallel problems, but there are a lot out there.

And it is especially interesting when you note that among 2-5 the
ratings are so close a strong breeze can cause an upset. And that #5
is composed of dual CPU P4 Xeon nodes....

Eric


2003-02-26 11:59:25

by Jesse Pollard

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

On Tuesday 25 February 2003 23:27, Bernd Eckenfels wrote:
> In article <03022522230400.04587@tabby> you wrote:
> > Something like a Cray X1, single processor for instance.
> > Or a 1024 processor Cray T3, again single system image, even if it
> > doesn't have a streaming vector processor.
> >
> > I don't see that any of the current cluster systems provide the
> > throughput of such a system. Not even IBMs' SP series.
>
> This clearly depends on the workload. For most vector processors
> partitioning does not make sense. And dont forget, most of those systems
> are pure compute servers used fr scientific computing.

Not as much as you would expect. I've been next to (cubical over) from some
people doing benchmarking on the IBM SP 3 (a 330 node quad processor system
and a newer one). Neither could achieve the "advertised" speed on real
problems.

> > The output is fed to memory on every clock tick. (most Cray processors
> > have 4 memory busses for each processor - two for input data, one for
> > output data and one for the instruction stream
>
> The fastest Cray on top500.org is T3E1200 on rank _22_, the fastest IBM is
> ranked _2_ with a Power3 PRocessor. There are 13 IBM systems before the
> first (fastest) Cray system. Of course those GFlops are measured for
> parallel problems, but there are a lot out there.

The T3 achieves its speed based on the torus network. The processors
are only 400 MHz Alphas, 4 to a processing element. The IBM achives
its speed from a carefully crafted benchmark to show the fasted aggregate
computation possible. It is not a practical usage. Basically the computation
is split into the largest possible chunk, each chunk run on independant
systems, and merged at the very end of the computation. (I've used them too
and have access to two of them).

It takes something in the neighborhood of 60-100 processors in a T3 to
equal one Cray arch processor (even on a C90). A 32 processor C90
easily kept up with a T3 until you exceed 900 processors in the T3. (had
access to each of those too).

> And all those numbers are totally uninteresting for DB or Storage Servers.
> Even a SAP SD Benchmark would not be fun on a Cray.

The Cray has been known to support 200+ GB filesystems with 300+TB
nearline storage with a maximum of 11 second access to data when that
data has been migrated to tape... Admittedly, the time gets longer if the file
exceeds about 100 MB since it must then access multiple tapes in parallel.

> > I have used their systems for the last 12 years, and until the Earth
> > Simulator came on line, there was nothing that came close to their
> > throughput for weather modeling, finite element analysis, or other large
> > problem types.
>
> thats clearly wrong. http://www.top500.org/lists/lists.php?Y=2002&M=06

what you are actually looking at is a custom benchmark, carefully crafted
to show the fasted aggregate computation possible. It is not a practical
usage. The aggregate Cray system throughput (if you max out a X1 cluster)
exceeds even the Earth Simulator. Unfortunately, one of these hasn't been
sold yet.

One of the biggest weaknesses in the IBM world is the SP switch. The lack
of true shared memory programming model limites the systems to very coarse
grained parallelism. It really is just a collection of very fast small
servers. There is no "single system image". The OS and all core utilities
must be duplicated on each node or the cluster will not boot.

> There are a lot of Power3 ans Alpha systems before the first cray.

Ah no. The first cray was before the Pentium... The company made a profit
off of its first sale on one system. There was no power 3 or alpha chip.

2003-02-26 15:58:39

by Horst H. von Brand

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

[Massive snippage of Cc:]
Hans Reiser <[email protected]> said:

[...]

> It is interesting that games are the only compelling motivation for
> faster desktop hardware these days. It may be part of why we are in a
> tech bust. When AIs become hardware purchase drivers, there will likely
> be a boom again.

Oh, it was always that way. When it was Apple ][+, nobody complained about
the spreadsheet being too small/slow, it was games which were CPU and
display hungry. The machines of that vintage that have still a following
around here are the Atari 800XL and such, which had special hardware for
managing grahics on display. With the first PCs it was color displays. Then
came CDs and multimedia. Today it is fast CPUs and accelerated video cards
my students want for running the latest crop in games. Many keep Win98 just
for running games, for work they use Linux ;-)

Some people say that most new computing stuff is first introduced for
gaming. That does make sense to me, as in a game you'll be more tolerant of
rough edges; plus games do have a much wider appeal than office suites or
databases, and are a much more competitive market to boot. ;-)
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2003-02-26 16:21:23

by Horst H. von Brand

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

[Massive cutdown on Cc:]
Hans Reiser <[email protected]>

[...]

> In 15-30 years, AIs will be a big market, a huge one. Of course, people
> said that 30 years ago and it seemed reasonable then....

It won't. Because AI is handwaving patch over hack with the odd kludge for
lack of a decent, structured solution. If the problem is important, some
solution is found eventually, and the area doesn't qualify anymore ;-)

Happened to "automatic programming", to get a program written from a
high-level specification was an AI problem, until compiler technology was
born and matured. To be able to manage a computer system required a human,
until modern OSes. Today you have machines reading handwriting (sort of) as
part of PDAs, there is even some limited voice input available. Automatic
recognition of failed parts from video cameras is routine, work is
progressing on face recognition. It just isn't called AI anymore.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2003-02-26 16:33:15

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: Server shipments [was Re: Minutes from Feb 21 LSE Call]

On Wed, 26 Feb 2003, Jesse Pollard wrote:
> On Tuesday 25 February 2003 23:27, Bernd Eckenfels wrote:
> > There are a lot of Power3 ans Alpha systems before the first cray.
>
> Ah no. The first cray was before the Pentium... The company made a profit
> off of its first sale on one system. There was no power 3 or alpha chip.

I think Bernd was speaking about the Top 500, not about a historical timeline.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2003-02-26 17:03:55

by Rik van Riel

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Tue, 25 Feb 2003, Martin J. Bligh wrote:
> > On Wed, Feb 26, 2003 at 02:24:18AM -0300, Rik van Riel wrote:
> >> ... but making the anonymous pages use an object based
> >> scheme probably will make things too expensive.

> >> My instinct is that a hybrid system will work well in

[snip] "wli wrote something"

> It seemed, at least on the simple kernel compile tests that I did, that
> all the long chains are not anonymous. It killed 95% of the space issue,
> which given the simplicity of the patch was pretty damned stunning. Yes,
> there's a pointer per page I guess we could kill in the struct page
> itself, but I think you already have a better method for killing mem_map
> bloat ;-)

Also, with copy-on-write and mremap after fork, doing an
object based rmap scheme for anonymous pages is just complex,
almost certainly far too complex to be worth it, since it just
has too many issues. Just read the patches by bcrl and davem,
things get hairy fast.

The pte chain rmap scheme is clean, but suffers from too much
overhead for file mappings.

As shown by Dave's patch, a hybrid system really is simple and
clean, and it removes most of the pte chain overhead while still
keeping the code nice and efficient.

I think this hybrid system is the way to go, possibly with a few
more tweaks left and right...

regards,

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-26 18:34:55

by Alan

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Wed, 2003-02-26 at 16:07, Horst von Brand wrote:
> Some people say that most new computing stuff is first introduced for
> gaming. That does make sense to me, as in a game you'll be more tolerant of
> rough edges; plus games do have a much wider appeal than office suites or
> databases, and are a much more competitive market to boot. ;-)

If you've ever seen Master Thief run on a 16Mhz palmpilot you might want to
ask the game folks some *hard* questions too 8)

2003-02-26 19:38:10

by Bill Davidsen

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Mon, 24 Feb 2003, Bill Huey wrote:

> You don't need data. It's conceptually obvious.

The mantra of doomed IPOs ill-fated software projects, and the guy down
the street who has never invested in a company which was still in business
24 months later. No matter how great the concept it still has to work.

It's conceptionally obvious that professional programmers working for a
major software house will write a better os than a grad student fighting
off boredom one summer... in the end you always need data.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-02-26 20:47:12

by Daniel Phillips

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Wednesday 26 February 2003 17:02, Rik van Riel wrote:
> On Tue, 25 Feb 2003, Martin J. Bligh wrote:
> > It seemed, at least on the simple kernel compile tests that I did, that
> > all the long chains are not anonymous. It killed 95% of the space issue,
> > which given the simplicity of the patch was pretty damned stunning. Yes,
> > there's a pointer per page I guess we could kill in the struct page
> > itself, but I think you already have a better method for killing mem_map
> > bloat ;-)
>
> Also, with copy-on-write and mremap after fork, doing an
> object based rmap scheme for anonymous pages is just complex,
> almost certainly far too complex to be worth it, since it just
> has too many issues. Just read the patches by bcrl and davem,
> things get hairy fast.
>
> The pte chain rmap scheme is clean, but suffers from too much
> overhead for file mappings.

There is a lot of redundancy in the rmap chains that could be exploited. If
a pte page happens to reference a group of (say) 32 anon pages, then you can
set each anon page's page->index to its position in the group and let a
pte_chain node point at the pte of the first page of the group. You can then
find each page's pte by adding its page->index to the pte_chain node's pte
pointer. This allows a single rmap chain to be shared by all the pages in
the group.

This much of the idea is simple, however there are some tricky details to
take care of. How does a copy-on-write break out one page of the group from
one of the pte pages? I tried putting a (32 bit) bitmap in each pte_chain
node to indicate which pte entries actually belong to the group, and that
wasn't too bad except for doubling the per-link memory usage, turning a best
case 32x gain into only 16x. It's probably better to break the group up,
creating log2(groupsize) new chains. (This can be avoided in the common case
that you already know every page in the group is going to be copied, as with
a copy_from_user.) Getting rid of the bitmaps makes the single-page case the
same as the current arrangement and makes it easy to let the size of a page
be as large as the capacity of a whole pte page.

There's also the problem of detecting groupable clusters of pages, e.g., in
do_anon_page. Swap-out and swap-in introduce more messiness, as does mremap.
In the end, I decided it's not needed in the current cycle, but probably
worth investigating later.

My purpose in bringing it up now is to show that there are still some more
incremental gains to be had without needing radical surgery.

> As shown by Dave's patch, a hybrid system really is simple and
> clean, and it removes most of the pte chain overhead while still
> keeping the code nice and efficient.
>
> I think this hybrid system is the way to go, possibly with a few
> more tweaks left and right...

Emphatically, yes.

Regards,

Daniel

2003-02-27 00:49:14

by Bill Huey

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Wed, Feb 26, 2003 at 02:31:33PM -0500, Bill Davidsen wrote:
> On Mon, 24 Feb 2003, Bill Huey wrote:
> > You don't need data. It's conceptually obvious.
>
> The mantra of doomed IPOs ill-fated software projects, and the guy down
> the street who has never invested in a company which was still in business
> 24 months later. No matter how great the concept it still has to work.

I'm not disagreeing with that, but if you read the previous exchange you'd see
that I was reacting to what seemed to be an obviously rude dismissal of how
latency effects both IO performance of a system and trashes the usability of
the a priority driven scheduler. It's basic computer science.

> It's conceptionally obvious that professional programmers working for a
> major software house will write a better os than a grad student fighting
> off boredom one summer... in the end you always need data.

Had to read your post a couple of times to make sure that the tone of it
wasn't charged. :)

All I can say now is that I'm working on it. We'll see if it's vaporware
in the near future.

bill

2003-02-27 17:49:52

by Daniel Egger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Am Mit, 2003-02-26 um 06.30 schrieb Bernd Eckenfels:

> > unfortunantly for them the core CPU speeds became uncoupled from the
> > memory speeds and skyrocketed up to the point where CISC cores are as fast
> > or faster then the 'high speed' RISC cores.

> Hmm.. are there any RISC Cores which run even closely to CISC Speeds?

Define RISC and CISC: do you mean pure RISC implementations or RISC
implementations with CISC frontend?

Define Speed: Felt speed, clock speed or measurable speed?

I'm convinced that for each (sensible) combination of definations above
there's a clear indication that your question is wrong.

--
Servus,
Daniel


Attachments:
signature.asc (189.00 B)
Dies ist ein digital signierter Nachrichtenteil

2003-02-27 18:16:23

by David Lang

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Thu, 27 Feb 2003, Daniel Egger wrote:

> Date: Thu, 27 Feb 2003 09:50:21 -0800
> From: Daniel Egger <[email protected]>
> To: Bernd Eckenfels <[email protected]>
> Cc: [email protected]
> Subject: Re: Minutes from Feb 21 LSE Call
>
> Am Mit, 2003-02-26 um 06.30 schrieb Bernd Eckenfels:
>
> > > unfortunantly for them the core CPU speeds became uncoupled from the
> > > memory speeds and skyrocketed up to the point where CISC cores are
> as fast
> > > or faster then the 'high speed' RISC cores.
>
> > Hmm.. are there any RISC Cores which run even closely to CISC Speeds?
>
> Define RISC and CISC: do you mean pure RISC implementations or RISC
> implementations with CISC frontend?

as far as programmers and users are concerned there is no difference
between CISC and RISC with a CISC front-end (transmeta is the most obvious
example of this, but all current CISC chips use this technique)

> Define Speed: Felt speed, clock speed or measurable speed?

for my origional post I was refering to pure clock speeds, remember that
the origional RISC chips came out when CISC chips were just starting to
hit 60MHz and a large part of their claim was that it didn't matter if the
chip got less done per clock becouse they could run at much higher speeds
(a couple hundred MHz). Also the Instructions per Clock for CISC chips was
very high with the RISC chips pushing towards 1 IPC.

the reasoning was that there was no way to implement all the complicated
CISC instruction set decoding and options and achieve anything close to
the clock speeds that the nice streamlined RISC chips could reach.

when the RISC chip cores are just over 1GHz and talking about possibly
hitting 1.8GHz within a year or so the intel chips are pushing 3GHz while
the AMD chips are pushing 2GHz (true speed, I'll avoid commenting on the
mistakes that intel made on the P4 that make these chips competitive with
each other :-)

and the IPC of current CISC implementations is pushing towards 1 as well
(insert disclaimer about benchmarks) so RISC no longer has a huge
advantage there either.

obviously higher clock speeds to not directly equate to higher
performance, but as Linus has pointed out there are a lot of efficiancies
in the CISC command set that mean that if you have two chips running at
the same clockspeed with the same IPC the CISC command set will outperform
the RISC one.

the only serious advantage the RISC chips have today is the fact that they
are 64 bit instead of 32 bit, and x86-64 will erase that limitation.

David Lang

> I'm convinced that for each (sensible) combination of definations above
> there's a clear indication that your question is wrong.
>
> --
> Servus,
> Daniel
>

2003-02-27 20:10:47

by Bill Davidsen

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Wed, 26 Feb 2003, Bill Huey wrote:

> On Wed, Feb 26, 2003 at 02:31:33PM -0500, Bill Davidsen wrote:
> > On Mon, 24 Feb 2003, Bill Huey wrote:
> > > You don't need data. It's conceptually obvious.
> >
> > The mantra of doomed IPOs ill-fated software projects, and the guy down
> > the street who has never invested in a company which was still in business
> > 24 months later. No matter how great the concept it still has to work.
>
> I'm not disagreeing with that, but if you read the previous exchange you'd see
> that I was reacting to what seemed to be an obviously rude dismissal of how
> latency effects both IO performance of a system and trashes the usability of
> the a priority driven scheduler. It's basic computer science.

No argument from me, but I have seen systems driving up the system time
and beating the cache with scheduling logic and context switches. There's
a balance to be had there, and in timeslice size, and other places as
well, and real data are always useful.

> > It's conceptionally obvious that professional programmers working for a
> > major software house will write a better os than a grad student fighting
> > off boredom one summer... in the end you always need data.
>
> Had to read your post a couple of times to make sure that the tone of it
> wasn't charged. :)

It's always more effective if it's subtle and and people take an instant
to get it.

> All I can say now is that I'm working on it. We'll see if it's vaporware
> in the near future.

Great. I have no doubt that when you have convinced yourself one way or
the other you won't have any problem convincing me. When the io was slow,
the VM was primitive, and the scheduler was a doorknob, preempt made a big
improvement. Now that the rest of the kernel doesn't suck, it's a lot
hardware to make a big improvement.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-02-28 08:41:41

by Filip Van Raemdonck

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Thu, Feb 27, 2003 at 10:25:23AM -0800, David Lang wrote:
> On Thu, 27 Feb 2003, Daniel Egger wrote:
> > Am Mit, 2003-02-26 um 06.30 schrieb Bernd Eckenfels:
> > >
> > > Hmm.. are there any RISC Cores which run even closely to CISC Speeds?
> >
> > Define Speed: Felt speed, clock speed or measurable speed?
>
> when the RISC chip cores are just over 1GHz and talking about possibly
> hitting 1.8GHz within a year or so the intel chips are pushing 3GHz while
> the AMD chips are pushing 2GHz (true speed, I'll avoid commenting on the
> mistakes that intel made on the P4 that make these chips competitive with
> each other :-)

And they need what kind of cooling to get there? Compare that to a
passively cooled G4.[1]

This is AFAIK not true for every RISC chip, I believe current ultrasparc
and alpha even more so do need significant cooling as well.
But IMO your question is not really relevant to the discussion, as
especially the common CISC examples you are referring to are artificially
pushed to higher clock speeds. Compare those to a passively cooled
transmeta or via chip and it you can ask about clock speeds CISC vs CISC
instead of CISC vs RISC.
Or compare a Transmeta or Via chip with RISC cores and see who wins the
clockspeed race then :-)


Regards,

Filip

[1] Yes, those dual powermacs have a case exhaust fan which draws air over
the CPUs' fin. So has the fastest passively cooled, not clocked down
intel desktop I've ever seen, a 300MHz pII Dell Dimension. (That's
also still the most silent "high-end" intel desktop I've ever seen)

--
"The only stupid question is the unasked one."
-- Martin Schulze

2003-02-28 20:17:21

by Diego Calleja

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

El d?a Thu, 27 Feb 2003 10:25:23 -0800 (PST)
David Lang <[email protected]> escribi?...

> when the RISC chip cores are just over 1GHz and talking about possibly
> hitting 1.8GHz within a year or so the intel chips are pushing 3GHz while

I suppose eveybody has seen this... (today on slashdot ;):
http://apple.slashdot.org/article.pl?sid=03/02/27/2227257&mode=thread&tid=136

"PowerPC 970 Running at 2.5 GHz"

2003-03-01 00:41:28

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Fri, Feb 28, 2003 at 08:48:26PM +0100, Arador wrote:

> I suppose eveybody has seen this... (today on slashdot ;):
> http://apple.slashdot.org/article.pl?sid=03/02/27/2227257&mode=thread&tid=136

In a lab... who cares.

I would guess the P4s or whatever are at 5GHz+ in Intel's labs.


--cw

2003-03-01 00:56:24

by Davide Libenzi

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

On Fri, 28 Feb 2003, Chris Wedgwood wrote:

> On Fri, Feb 28, 2003 at 08:48:26PM +0100, Arador wrote:
>
> > I suppose eveybody has seen this... (today on slashdot ;):
> > http://apple.slashdot.org/article.pl?sid=03/02/27/2227257&mode=thread&tid=136
>
> In a lab... who cares.
>
> I would guess the P4s or whatever are at 5GHz+ in Intel's labs.

Last time I checked ( one year ago ), they were running a cooled ALU at 10GHz.



- Davide

2003-03-01 01:18:43

by David Lang

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

this implies that when the new chip is released it may be up to 2.5GHz
instead of the 1.8GHz previously listed.

Ok, this means that a year or so from now when the chip is released it
will have clock speeds in the range of todays x86 chips. this is at least
in the ballpark to remain competitive.

David Lang


On Fri, 28 Feb 2003, Chris Wedgwood wrote:

> Date: Fri, 28 Feb 2003 16:51:49 -0800
> From: Chris Wedgwood <[email protected]>
> To: Arador <[email protected]>
> Cc: David Lang <[email protected]>, [email protected],
> [email protected], [email protected]
> Subject: Re: Minutes from Feb 21 LSE Call
>
> On Fri, Feb 28, 2003 at 08:48:26PM +0100, Arador wrote:
>
> > I suppose eveybody has seen this... (today on slashdot ;):
> > http://apple.slashdot.org/article.pl?sid=03/02/27/2227257&mode=thread&tid=136
>
> In a lab... who cares.
>
> I would guess the P4s or whatever are at 5GHz+ in Intel's labs.
>
>
> --cw
>

2003-03-01 14:07:09

by Daniel Egger

[permalink] [raw]
Subject: Re: Minutes from Feb 21 LSE Call

Am Sam, 2003-03-01 um 02.27 schrieb David Lang:

> this implies that when the new chip is released it may be up to 2.5GHz
> instead of the 1.8GHz previously listed.

> Ok, this means that a year or so from now when the chip is released it
> will have clock speeds in the range of todays x86 chips. this is at least
> in the ballpark to remain competitive.

From a pure clock point of view. For some reallife performance rather look
at the SPEC numbers. Actually I don't care about the clocking of a processor
especially when vendors are pushing it with longer pipelines as this is
all marketing crap (and let's see what problems Intel will face with the
Centrino architecture to tell that lower clocks might bring higher performance
with a different design).

BTW: Distributed.net client has now an AltiVec implementation which makes an
G4/500 eat an Athlon/1.8GHz for breakfast...

--
Servus,
Daniel


Attachments:
signature.asc (189.00 B)
Dies ist ein digital signierter Nachrichtenteil