Linus writes:
>Just face it - people who want memory hotplug had better know that
>beforehand (and let's be honest - in practice it's only going to work in
>virtualized environments or in environments where you can insert the new
>bank of memory and copy it over and remove the old one with hw support).
>
>Same as hugetlb.
>
>Nobody sane _cares_. Nobody sane is asking for these things. Only people
>with special needs are asking for it, and they know their needs.
Hello, my name is Andy. I am insane. I am one of the CRAZY PEOPLE you wrote
about. I am the whisperer in people's minds, causing them to conspire
against sanity everywhere and make lives as insane and crazy as mine is.
I love my work. I am an astrophysicist. I have lurked on various linux
lists for years now, and this is my first time standing in front of all
you people, hoping to make you bend your insane and crazy kernel developing
minds to listen to the rantings of my insane and crazy HPC mind.
I have done high performance computing in astrophysics for nearly two
decades now. It gives me a perspective that kernel developers usually
don't have, but sometimes need. For my part, I promise that I specifically
do *not* have the perspective of a kernel developer. I don't even speak C.
I don't really know what you folks do all day or night, and I actually
don't much care except when it impacts my own work. I am fairly certain
a lot of this hotplug/page defragmentation/page faulting/page zeroing
stuff that the sgi and ibm folk are currently getting rejected from
inclusion in the kernel impacts my work in very serious ways. You're
right, I do know my needs. They are not being met and the people with the
power to do anything about it call me insane and crazy and refuse to be
interested even in making improvement possible, even when it quite likely
helps them too.
Today I didn't hear a voice in my head that told me to shoot the pope, but
I did I hear one telling me to write a note telling you about my issues,
which apparently are in the 0.01% of insane crazies that should be
ignored, as are about 1/2 of the people responding on this thread.
I'll tell you a bit about my issues and their context now that things
have gotten hot enough that even a devout lurker like me is posting. Some
of it might make sense. Other parts may be internally inconsistent if only
I knew enough. Still other parts may be useful to get people who don't
talk to each other in contact, and think about things in ways they haven't.
I run large hydrodynamic simulations using a variety of techniques
whose relevance is only tangential to the current flamefest. I'll let you
know the important details as they come in later. A lot of my statements
will be common to a large fraction of all hpc applications, and I imagine
to many large scale database applications as well though I'm guessing a
bit there.
I run the same codes on many kinds of systems from workstations up
to large supercomputing platforms. Mostly my experience has been
in shared memory systems, but recently I've been part of things that
will put me into distributed memory space as well.
What does it mean to use computers like I do? Maybe this is surprising
but my executables are very very small. Typically 1-2MB or less, with
only a bit more needed for various external libraries like FFTW or
the like. On the other hand, my memory requirements are huge. Typically
many GB, and some folks run simulations with many TB. Picture a very
small and very fast flea repeatedly jumping around all over the skin of a
very large elephant, taking a bite at each jump and that is a crude idea
of what is happening.
This has bearing on the current discussion in the following ways, which
are not theoretical in any way.
1) Some of these simulations frequently need to access data that is
located very far away in memory. That means that the bigger your
pages are, the fewer TLB misses you get, the smaller the
thrashing, and the faster your code runs.
One example: I have a particle hydrodynamics code that uses gravity.
Molecular dynamics simulations have similar issues with long range
forces too. Gravity is calculated by culling acceptable nodes and atoms
out of a tree structure that can be many GB in size, or for bigger
jobs, many TB. You have to traverse the entire tree for every particle
(or closely packed small group). During this stage, almost every node
examination (a simple compare and about 5 flops) requires at least one
TLB miss and depending on how you've laid out your array, several TLB
misses. Huge pages help this problem, big time. Fine with me if all I
had was one single page. If I am stupid and get into swap territory, I
deserve every bad thing that happens to me.
Now you have a list of a few thousand nodes and atoms with their data
spread sparsely over that entire multi-GB memory volume. Grab data
(about 10 double precision numbers) for one node, do 40-50 flops with
it, and repeat, L1 and TLB thrashing your way through the entire list.
There are some tricks that work some times (preload an L1 sized array
of node data and use it for an entire group of particles, then discard
it for another preload if there is more data; dimension arrays in the
right direction, so you get multiple loads from the same cache line
etc) but such things don't always work or aren't always useful.
I can easily imagine database apps doing things not too dissimilar to
this. With my particular code, I have measured factors of several (\sim
3-4) speedup with large pages compared to small. This is measured on
an Origin 3000, with 64k, 1M and 16MB pages were used. Not a factor
of several percent. A factor of several. I have also measured similar
sorts of speedups on other types of machines. It is also not a factor
related to NUMA. I can see other effects from that source and can
distinguish between them.
Another example: Take a code that discretizes space on a grid
in 3d and does something to various variables to make them evolve.
You've got 3d arrays many GB in size, and for various calculations
you have to sweep through them in each direction: x, y and z. Going
in the z direction means that you are leaping across huge slices of
memory every time you increment the grid zone by 1. In some codes
only a few calculations are needed per zone. For example you want
to take a derivative:
deriv = (rho(i,j,k+1) - rho(i,j,k-1))/dz(k)
(I speak fortran, so the last index is the slow one here).
Again, every calculation strides through huge distances and gets you
a TLB miss or several. Note for the unwary: it usually does not make
sense to transpose the arrays so that the fast index is the one you
work with. You don't have enough memory for one thing and you pay
for the TLB overhead in the transpose anyway.
In both examples, with large pages the chances of getting a TLB hit
are far far higher than with small pages. That means I want truly
huge pages. Assuming pages at all (various arches don't have them
I think), a single one that covered my whole memory would be fine.
Other codes don't seem to benefit so much from large pages, or even
benefit from small pages, though my experience is minimal with
such codes. Other folks run them on the same machines I do though:
2) The last paragraph above is important because of the way HPC
works as an industry. We often don't just have a dedicated machine to
run on, that gets booted once and one dedicated application runs on it
till it dies or gets rebooted again. Many jobs run on the same machine.
Some jobs run for weeks. Others run for a few hours over and over
again. Some run massively parallel. Some run throughput.
How is this situation handled? With a batch scheduler. You submit
a job to run and ask for X cpus, Y memory and Z time. It goes and
fits you in wherever it can. cpusets were helpful infrastructure
in linux for this.
You may get some cpus on one side of the machine, some more
on the other, and memory associated with still others. They
do a pretty good job of allocating resources sanely, but there is
only so much that it can do.
The important point here for page related discusssions is that
someone, you don't know who, was running on those cpu's and memory
before you. And doing Ghu Knows What with it.
This code could be running something that benefits from small pages, or
it could be running with large pages. It could be dynamically
allocating and freeing large or small blocks of memory or it could be
allocating everything at the beginning and running statically
thereafter. Different codes do different things. That means that the
memory state could be totally fubar'ed before your job ever gets
any time allocated to it.
>Nobody takes a random machine and says "ok, we'll now put our most
>performance-critical database on this machine, and oh, btw, you can't
>reboot it and tune for it beforehand".
Wanna bet?
What I wrote above makes tuning the machine itself totally ineffective.
What do you tune for? Tuning for one person's code makes someone else's
slower. Tuning for the same code on one input makes another input run
horribly.
You also can't be rebooting after every job. What about all the other
ones that weren't done yet? You'd piss off everyone running there and
it takes too long besides.
What about a machine that is running multiple instances of some
database, some bigger or smaller than others, or doing other kinds
of work? Do you penalize the big ones or the small ones, this kind
of work or that?
You also can't establish zones that can't be changed on the fly
as things on the system change. How do zones like that fit into
numa? How do things work when suddenly you've got a job that wants
the entire memory filled with large pages and you've only got
half your system set up for large pages? What if you tune the
system that way and then let that job run. For some stupid reason user
reason it dies 10 minutes after starting? Do you let the 30
other jobs in the queue sit idle because they want a different
page distribution?
This way lies madness. Sysadmins just say no and set up the machine
in as stably as they can, usually with something not too different
that whatever manufacturer recommends as a default. For very good reasons.
I would bet the only kind of zone stuff that could even possibly
work would be related to a cpu/memset zone arrangement. See below.
3) I have experimented quite a bit with the page merge infrastructure
that exists on irix. I understand that similar large page and merge
infrastructure exists on solaris, though I haven't run on such systems.
I can get very good page distributions if I run immediately after
reboot. I get progressively worse distributions if my job runs only
a few days or weeks later.
My experience is that after some days or weeks of running have gone
by, there is no possible way short of a reboot to get pages merged
effectively back to any pristine state with the infrastructure that
exists there.
Some improvement can be had however, with a bit of pain. What I
would like to see is not a theoretical, general, all purpose
defragmentation and hotplug scheme, but one that can work effectively
with the kinds of constraints that a batch scheduler imposes.
I would even imagine that a more general scheduler type of situation
could be effective it that scheduler was smart enough. God knows,
the scheduler in linux has been rewritten often enough. What is
one more time for this purpose too?
You may claim that this sort of merge stuff requires excessive time
for the OS. Nothing could matter to me less. I've got those cpu's
full time for the next X days and if I want them to spend the first
5 minutes or whatever of my run making the place comfortable, so that
my job gets done three days earlier then I want to spend that time.
3) The thing is that all of this memory management at this level is not
the batch scheduler's job, its the OS's job. The thing that will make
it work is that in the case of a reasonably intelligent batch scheduler
(there are many), you are absolutely certain that nothing else is
running on those cpus and that memory. Except whatever the kernel
sprinkled in and didn't clean up afterwards.
So why can't the kernel clean up after itself? Why does the kernel need
to keep anything in this memory anyway? I supposedly have a guarantee
that it is mine, but it goes and immediately violates that guarantee
long before I even get started. I want all that kernel stuff gone from
my allocation and reset to a nice, sane pristine state.
The thing that would make all of it work is good fragmentation and
hotplug type stuff in the kernel. Push everything that the kernel did
to my memory into the bitbucket and start over. There shouldn't be
anything there that it needs to remember from before anyway. Perhaps
this is what the defragmentation stuff is supposed to help with.
Probably it has other uses that aren't on my agenda. Like pulling out
bad ram sticks or whatever. Perhaps there are things that need to be
remembered. Certainly being able to hotunplug those pieces would do it.
Just do everything but unplug it from the board, and then do a hotplug
to turn it back on.
4) You seem to claim that issues I wrote about above are 'theoretical
general cases'. They are not, at least not to any more people than the
0.01% of people who regularly time their kernel builds as I saw someone
doing some emails ago. Using that sort of argument as a reason not to
incorporate this sort of infrastructure just about made me fall out of
my chair, especially in the context of keeping the sane case sane.
Since this thread has long since lost decency and meaning and descended
into name calling, I suppose I'll pitch in with that too on two fronts:
1) I'd say someone making that sort of argument is doing some very serious
navel gazing.
2) Here's a cluebat: that ain't one of the sane cases you wrote about.
That said, it appears to me there are a variety of constituencies that
have some serious interest in this infrastructure.
1) HPC stuff
2) big database stuff.
3) people who are pushing hotplug for other reasons like the
bad memory replacement stuff I saw discussed.
4) Whatever else the hotplug folk want that I don't follow.
Seems to me that is a bit more than 0.01%.
>When you hear voices in your head that tell you to shoot the pope, do you
>do what they say? Same thing goes for customers and managers. They are the
>crazy voices in your head, and you need to set them right, not just
>blindly do what they ask for.
I don't care if you do what I ask for, but I do start getting irate and
start writing long annoyed letters if I can't do what I need to do, and
find out that someone could do something about it but refuses.
That said, I'm not so hot any more so I'll just unplug now.
Andy Nelson
PS: I read these lists at an archive, so if responders want to rm me from
any cc's that is fine. I'll still read what I want or need to from there.
--
Andy Nelson Theoretical Astrophysics Division (T-6)
andy dot nelson at lanl dot gov Los Alamos National Laboratory
http://www.phys.lsu.edu/~andy Los Alamos, NM 87545
> Linus writes:
>
>> Just face it - people who want memory hotplug had better know that
>> beforehand (and let's be honest - in practice it's only going to work in
>> virtualized environments or in environments where you can insert the new
>> bank of memory and copy it over and remove the old one with hw support).
>>
>> Same as hugetlb.
>>
>> Nobody sane _cares_. Nobody sane is asking for these things. Only people
>> with special needs are asking for it, and they know their needs.
>
>
> Hello, my name is Andy. I am insane. I am one of the CRAZY PEOPLE you wrote
> about.
To provide a slightly shorter version ... we had one customer running
similarly large number crunching things in Fortran. Their app ran 25%
faster with large pages (not a typo). Because they ran a variety of
jobs in batch mode, they need large pages sometimes, and small pages
at others - hence they need to dynamically resize the pool.
That's the sort of thing we were trying to fix with dynamically sized
hugepage pools. It does make a huge difference to real-world customers.
M.
Martin J. Bligh wrote:
>
> To provide a slightly shorter version ... we had one customer running
> similarly large number crunching things in Fortran. Their app ran 25%
> faster with large pages (not a typo). Because they ran a variety of
> jobs in batch mode, they need large pages sometimes, and small pages
> at others - hence they need to dynamically resize the pool.
>
> That's the sort of thing we were trying to fix with dynamically sized
> hugepage pools. It does make a huge difference to real-world customers.
>
Aren't HPC users very easy? In fact, probably the easiest because they
generally not very kernel intensive (apart from perhaps some batches of
IO at the beginning and end of the jobs).
A reclaimable zone should provide exactly what they need. I assume the
sysadmin can give some reasonable upper and lower estimates of the
memory requirements.
They don't need to dynamically resize the pool because it is all being
allocated to pagecache anyway, so all jobs are satisfied from the
reclaimable zone.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Thu, 3 Nov 2005, Andy Nelson wrote:
>
> I have done high performance computing in astrophysics for nearly two
> decades now. It gives me a perspective that kernel developers usually
> don't have, but sometimes need. For my part, I promise that I specifically
> do *not* have the perspective of a kernel developer. I don't even speak C.
Hey, cool. You're a physicist, and you'd like to get closer to 100%
efficiency out of your computer.
And that's really nice, because maybe we can strike a deal.
Because I also have a problem with my computer, and a physicist might just
help _me_ get closer to 100% efficiency out of _my_ computer.
Let me explain.
I've got a laptop that takes about 45W, maybe 60W under load.
And it has a battery that weighs about 350 grams.
Now, I know that if I were to get 100% energy efficiency out of that
battery, a trivial physics calculations tells me that e=mc^2, and that my
battery _should_ have a hell of a lot of energy in it. In fact, according
to my simplistic calculations, it turns out that my laptop _should_ have a
battery life that is only a few times the lifetime of the universe.
It turns out that isn't really the case in practice, but I'm hoping you
can help me out. I obviously don't need it to be really 100% efficient,
but on the other hand, I'd also like the battery to be slightly lighter,
so if you could just make sure that it's at least _slightly_ closer to the
theoretical values I should be getting out of it, maybe I wouldn't need to
find one of those nasty electrical outlets every few hours.
Do we have a deal? After all, you only need to improve my battery
efficiency by a really _tiny_ amount, and I'll never need to recharge it
again. And I'll improve your problem.
Or are you maybe willing to make a few compromises in the name of being
realistic, and living with something less than the theoretical peak
performance of what you're doing?
I'm willing on compromising to using only the chemical energy of the
processes involved, and not even a hundred percent efficiency at that.
Maybe you'd be willing on compromising by using a few kernel boot-time
command line options for your not-very-common load.
Ok?
Linus
Linus wrote:
> Maybe you'd be willing on compromising by using a few kernel boot-time
> command line options for your not-very-common load.
If we were only a few options away from running Andy's varying load
mix with something close to ideal performance, we'd be in fat city,
and Andy would never have been driven to write that rant.
There's more to it than that, but it is not as impossible as a battery
with the efficiencies you (and the rest of us) dream of.
Andy has used systems that resemble what he is seeking. So he is not
asking for something clearly impossible. Though it might not yet be
possible, in ways that contribute to a continuing healthy kernel code
base.
It's an interesting challenge - finding ways to improve the kernel's
performance on such high end loads, that are also suitable and
desirable (or at least innocent enough) for inclusion in a kernel far
more widely used in embeddeds, desktops and ordinary servers.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
* Paul Jackson <[email protected]> wrote:
> Linus wrote:
> > Maybe you'd be willing on compromising by using a few kernel boot-time
> > command line options for your not-very-common load.
>
> If we were only a few options away from running Andy's varying load
> mix with something close to ideal performance, we'd be in fat city,
> and Andy would never have been driven to write that rant.
>
> There's more to it than that, but it is not as impossible as a battery
> with the efficiencies you (and the rest of us) dream of.
just to make sure i didnt get it wrong, wouldnt we get most of the
benefits Andy is seeking by having a: boot-time option which sets aside
a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool
- with the growing happening on a best-effort basis, without guarantees?
i have implemented precisely such a scheme for 'bigpages' years ago, and
it worked reasonably well. (i was lazy and didnt implement it as a
resizable zone, but as a list of large pages taken straight off the
buddy allocator. This made dynamic resizing really easy and i didnt have
to muck with the buddy and mem_map[] data structures that zone-resizing
forces us to do. It had the disadvantage of those pages skewing the
memory balance of the affected zone.)
my quick solution was good enough that on a test-system i could resize
the pool across Oracle test-runs, when the box was otherwise quiet. I'd
expect a well-controlled HPC system to be equally resizable.
what we cannot offer is a guarantee to be able to grow the pool. Hence
the /proc mechanism would be called:
/proc/sys/vm/try_to_grow_hugemem_pool
to clearly stress the 'might easily fail' restriction. But if userspace
is well-behaved on Andy's systems (which it seems to be), then in
practice it should be resizable. On a generic system, only the boot-time
option is guaranteed to allocate as much RAM as possible. And once this
functionality has been clearly communicated and separated, the 'try to
alloc a large page' thing could become more agressive: it could attempt
to construct large pages if it can.
i dont think we object to such a capability, as long as the restrictions
are clearly communicated. (and no, that doesnt mean some obscure
Documentation/ entry - the restrictions have to be obvious from the
primary way of usage. I.e. no /proc/sys/vm/hugemem_pool_size thing where
growing could fail.)
Ingo
Ingo wrote:
> to clearly stress the 'might easily fail' restriction. But if userspace
> is well-behaved on Andy's systems (which it seems to be), then in
> practice it should be resizable.
At first glance, this is the sticky point that jumps out at me.
Andy wrote:
> My experience is that after some days or weeks of running have gone
> by, there is no possible way short of a reboot to get pages merged
> effectively back to any pristine state with the infrastructure that
> exists there.
I take it, from what Andy writes, and from my other experience with
similar customers, that his workload is not "well-behaved" in the
sense you hoped for.
After several diverse jobs are run, we cannot, so far as I know,
merge small pages back to big pages.
I have not played with Mel Gorman's Fragmentation Avoidance patches,
so don't know if they would provide a substantial improvement here.
They well might.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
* Paul Jackson <[email protected]> wrote:
> At first glance, this is the sticky point that jumps out at me.
>
> Andy wrote:
> > My experience is that after some days or weeks of running have gone
> > by, there is no possible way short of a reboot to get pages merged
> > effectively back to any pristine state with the infrastructure that
> > exists there.
>
> I take it, from what Andy writes, and from my other experience with
> similar customers, that his workload is not "well-behaved" in the
> sense you hoped for.
>
> After several diverse jobs are run, we cannot, so far as I know, merge
> small pages back to big pages.
ok, so the zone solution it has to be. I.e. the moment it's a separate
special zone, you can boot with most of the RAM being in that zone, and
you are all set. It can be used both for hugetlb allocations, and for
other PAGE_SIZE allocations as well, in a highmem-fashion. These HPC
setups are rarely kernel-intense.
Thus the only dynamic sizing decision that has to be taken is to
determine the amount of 'generic kernel RAM' that is needed in the
worst-case. To give an example: say on a 256 GB box, set aside 8 GB for
generic kernel needs, and have 248 GB in the hugemem zone. This leaves
us with the following scenario: apps can use up to 97% of all RAM for
hugemem, and they can use up to 100% of all RAM for PAGE_SIZE
allocations. 3% of RAM can be used by generic kernel needs. Sounds
pretty reasonable and straightforward from a system management point of
view. No runtime resizing, but it wouldnt be needed, unless kernel
activity needs more than 8GB of RAM.
Ingo
Paul Jackson a ?crit :
> Linus wrote:
>
>>Maybe you'd be willing on compromising by using a few kernel boot-time
>>command line options for your not-very-common load.
>
>
> If we were only a few options away from running Andy's varying load
> mix with something close to ideal performance, we'd be in fat city,
> and Andy would never have been driven to write that rant.
I found hugetlb support in linux not very practical/usable on NUMA machines,
boot-time parameters or /proc/sys/vm/nr_hugepages.
With this single integer parameter, you cannot allocate 1000 4MB pages on one
specific node, letting small pages on another node.
I'm not an astrophysician, nor a DB admin, I'm only trying to partition a dual
node machine between one (numa aware) memory intensive job and all others
(system, network, shells).
At least I can reboot it if needed, but I feel Andy pain.
There is a /proc/buddyinfo file, maybe we need a /proc/sys/vm/node_hugepages
with a list of integers (one per node) ?
Eric
Linus,
Since my other affiliation is with X2, which also goes by
the name Thermonuclear Applications, we have a deal. I'll
continue to help with the work on getting nuclear fusion
to work, and you work on getting my big pages to work
in linux. We both have lots of funding and resources behind
us and are working with smart people. It should be easy.
Beyond that, I don't know much of anything about chemistry,
you'll have to find someone else to increase your battery
efficiency that way.
Big pages don't work now, and zones do not help because the
load is too unpredictable. Sysadmins *always* turn them
off, for very good reasons. They cripple the machine.
I'll try in this post also to merge a couple of replies with
other responses:
I think it was Martin Bligh who wrote that his customer gets
25% speedups with big pages. That is peanuts compared to my
factor 3.4 (search comp.arch for John Mashey's and my name
at the University of Edinburgh in Jan/Feb 2003 for a conversation
that includes detailed data about this), but proves the point that
it is far more than just me that wants big pages.
If your and other kernel developer's (<<0.01% of the universe) kernel
builds slow down by 5% and my and other people's simulations (perhaps
0.01% of the universe) speed up by a factor up to 3 or 4, who wins?
Answer right now: you do, since you are writing the kernel to
respond to your own issues, which are no more representative of the
rest of the universe than my work is. Answer as I think it
ought to be: I do, since I'd bet that HPC takes far more net
cycles in the world than every one else's kernel builds put
together. I can't expect much of anyone else to notice either
way and neither can you, so that is a wash.
Ingo Molnar says that zones work for him. In response I
will now repeat my previous rant about why zones don't
work. I understand that my post was very long and people
probably didn't read it all. So I'll just repeat that
part:
2) The last paragraph above is important because of the way HPC
works as an industry. We often don't just have a dedicated machine to
run on, that gets booted once and one dedicated application runs on it
till it dies or gets rebooted again. Many jobs run on the same machine.
Some jobs run for weeks. Others run for a few hours over and over
again. Some run massively parallel. Some run throughput.
How is this situation handled? With a batch scheduler. You submit
a job to run and ask for X cpus, Y memory and Z time. It goes and
fits you in wherever it can. cpusets were helpful infrastructure
in linux for this.
You may get some cpus on one side of the machine, some more
on the other, and memory associated with still others. They
do a pretty good job of allocating resources sanely, but there is
only so much that it can do.
The important point here for page related discusssions is that
someone, you don't know who, was running on those cpu's and memory
before you. And doing Ghu Knows What with it.
This code could be running something that benefits from small pages, or
it could be running with large pages. It could be dynamically
allocating and freeing large or small blocks of memory or it could be
allocating everything at the beginning and running statically
thereafter. Different codes do different things. That means that the
memory state could be totally fubar'ed before your job ever gets
any time allocated to it.
>Nobody takes a random machine and says "ok, we'll now put our most
>performance-critical database on this machine, and oh, btw, you can't
>reboot it and tune for it beforehand".
Wanna bet?
What I wrote above makes tuning the machine itself totally ineffective.
What do you tune for? Tuning for one person's code makes someone else's
slower. Tuning for the same code on one input makes another input run
horribly.
You also can't be rebooting after every job. What about all the other
ones that weren't done yet? You'd piss off everyone running there and
it takes too long besides.
What about a machine that is running multiple instances of some
database, some bigger or smaller than others, or doing other kinds
of work? Do you penalize the big ones or the small ones, this kind
of work or that?
You also can't establish zones that can't be changed on the fly
as things on the system change. How do zones like that fit into
numa? How do things work when suddenly you've got a job that wants
the entire memory filled with large pages and you've only got
half your system set up for large pages? What if you tune the
system that way and then let that job run. For some stupid reason user
reason it dies 10 minutes after starting? Do you let the 30
other jobs in the queue sit idle because they want a different
page distribution?
This way lies madness. Sysadmins just say no and set up the machine
in as stably as they can, usually with something not too different
that whatever manufacturer recommends as a default. For very good reasons.
I would bet the only kind of zone stuff that could even possibly
work would be related to a cpu/memset zone arrangement. See below.
Andy Nelson
--
Andy Nelson Theoretical Astrophysics Division (T-6)
andy dot nelson at lanl dot gov Los Alamos National Laboratory
http://www.phys.lsu.edu/~andy Los Alamos, NM 87545
* Andy Nelson <[email protected]> wrote:
> I think it was Martin Bligh who wrote that his customer gets 25%
> speedups with big pages. That is peanuts compared to my factor 3.4
> (search comp.arch for John Mashey's and my name at the University of
> Edinburgh in Jan/Feb 2003 for a conversation that includes detailed
> data about this), but proves the point that it is far more than just
> me that wants big pages.
ok, this posting of you seems to be it:
http://groups.google.com/group/comp.sys.sgi.admin/browse_thread/thread/39884db861b7db15/e0332608c52a17e3?lnk=st&q=&rnum=35#e0332608c52a17e3
| Timing for the tree traveral+gravity calculation were
|
| 16MBpages 1MBpages 64kpages
| 1 * * 2361.8s
| 8 86.4s 198.7s 298.1s
| 16 43.5s 99.2s 148.9s
| 32 22.1s 50.1s 75.0s
| 64 11.2s 25.3s 37.9s
| 96 7.5s 17.1s 25.4s
|
| (*) test not done.
|
| As near as I can tell the numbers show perfect
| linear speedup for the runs for each page size.
|
| Across different page sizes there is degradation
| as follows:
|
| 16m --> 64k decreases by a factor 3.39 in speed
| 16m --> 1m decreases by a factor 2.25 in speed
| 1m --> 64k decreases by a factor 1.49 in speed
[...]
|
| Sum over cpus of TLB miss times for each test:
|
| 16MBpages 1MBpages 64kpages
| 1 3489s
| 8 64.3s 1539s 3237s
| 16 64.5s 1540s 3241s
| 32 64.5s 1542s 3244s
| 64 64.9s 1545s 3246s
| 96 64.7s 1545s 3251s
|
| Thus the 16MB pages rarely produced page misses,
| while the 64kB pages used up 2.5x more time than
| the floating point operations that we wanted to
| have. I have at least some feeling that the 16MB pages
| rarely caused misses because with a 128 entry
| TLB (on the R12000 cpu) that gives about 1GB of
| addressible memory before paging is required at all,
| which I think is quite comparable to the size of
| the memory actually used.
to me it seems that this slowdown is due to some inefficiency in the
R12000's TLB-miss handling - possibly very (very!) long TLB-miss
latencies? On modern CPUs (x86/x64) the TLB-miss latency is rarely
visible. Would it be possible to run some benchmarks of hugetlbs vs. 4K
pages on x86/x64?
if my assumption is correct, then hugeTLBs are more of a workaround for
bad TLB-miss properties of the CPUs you are using, not something that
will inevitably happen in the future. Hence i think the 'factor 3x'
slowdown should not be realistic anymore - or are you still running
R12000 CPUs?
Ingo
On Fri, 4 Nov 2005, Ingo Molnar wrote:
>
> just to make sure i didnt get it wrong, wouldnt we get most of the
> benefits Andy is seeking by having a: boot-time option which sets aside
> a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool
> - with the growing happening on a best-effort basis, without guarantees?
Boot-time option to set the hugetlb zone, yes.
Grow-or-shrink, probably not. Not in practice after bootup on any machine
that is less than idle.
The zones have to be pretty big to make any sense. You don't just grow
them or shrink them - they'd be on the order of tens of megabytes to
gigabytes. In other words, sized big enough that you will _not_ be able to
create them on demand, except perhaps right after boot.
Growing these things later simply isn't reasonable. I can pretty much
guarantee that any kernel I maintain will never have dynamic kernel
pointers: when some memory has been allocated with kmalloc() (or
equivalent routines - pretty much _any_ kernel allocation), it stays put.
Which means that if there is a _single_ kernel alloc in such a zone, it
won't ever be then usable for hugetlb stuff.
And I don't want excessive complexity. We can have things like "turn off
kernel allocations from this zone", and then wait a day or two, and hope
that there aren't long-term allocs. It might even work occasionally. But
the fact is, a number of kernel allocations _are_ long-term (superblocks,
root dentries, "struct thread_struct" for long-running user daemons), and
it's simply not going to work well in practice unless you have set aside
the "no kernel alloc" zone pretty early on.
Linus
Ingo wrote:
>ok, this posting of you seems to be it:
> <elided>
>to me it seems that this slowdown is due to some inefficiency in the
>R12000's TLB-miss handling - possibly very (very!) long TLB-miss
>latencies? On modern CPUs (x86/x64) the TLB-miss latency is rarely
>visible. Would it be possible to run some benchmarks of hugetlbs vs. 4K
>pages on x86/x64?
>
>if my assumption is correct, then hugeTLBs are more of a workaround for
>bad TLB-miss properties of the CPUs you are using, not something that
>will inevitably happen in the future. Hence i think the 'factor 3x'
>slowdown should not be realistic anymore - or are you still running
>R12000 CPUs?
> Ingo
AFAIK, mips chips have a software TLB refill that takes 1000
cycles more or less. I could be wrong. There are sgi folk on this
thread, perhaps they can correct me. What is important is
that I have done similar tests on other arch's and found very
similar results. Specifically with IBM machines running both
AIX and Linux. I've never had the opportunity to try variable
page size stuff on amd or intel chips, either itanic or x86
variants.
The effect is not a consequence of any excessively long tlb
handling times for one single arch.
The effect is a property of the code. Which has one part that
is extremely branchy: traversing a tree, and another part that
isn't branchy but grabs stuff from all over everywhere.
The tree traversal works like this: Start from the root and stop at
each node, load a few numbers, multiply them together and compare to
another number, then open that node or go on to a sibling node. Net,
this is about 5-8 flops and a compare per node. The issue is that the
next time you want to look at a tree node, you are someplace else
in memory entirely. That means a TLB miss almost always.
The tree traversal leaves me with a list of a few thousand nodes
and atoms. I use these nodes and atoms to calculate gravity on some
particle or small group of particles. How? For each node, I grab about
10 numbers from a couple of arrays, do about 50 flops with those
numbers, and store back 4 more numbers. The store back doesn't hurt
anything becasuse it really only happens once at the end of the list.
In the naive case, grabbing 10 numbers out of arrays that are mutiple
GB in size means 10 TLB misses. The obvious solution is to stick
everything together that is needed together, and get that down to
one or two. I've done that. The results you quoted in your post
reflect that. In other words, the performance difference is the minimal
number of TLB misses that I can manage to get.
Now if you have a list of thousands of nodes to cycle through, each of
which lives on a different page (ordinarily true), you thrash TLB,
and you thrash L1, as I noted in my original post.
Believe me, I have worried about this sort of stuff intensely,
and recoded around it a lot. The performance number you saw were what
is left over.
It is true that other sorts of codes have much more regular memory
access patterns, and don't have nearly this kind of speedup. Perhaps
more typical would be the 25% number quoted by Martin Bligh.
Andy
>> just to make sure i didnt get it wrong, wouldnt we get most of the
>> benefits Andy is seeking by having a: boot-time option which sets aside
>> a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool
>> - with the growing happening on a best-effort basis, without guarantees?
>
> Boot-time option to set the hugetlb zone, yes.
>
> Grow-or-shrink, probably not. Not in practice after bootup on any machine
> that is less than idle.
>
> The zones have to be pretty big to make any sense. You don't just grow
> them or shrink them - they'd be on the order of tens of megabytes to
> gigabytes. In other words, sized big enough that you will _not_ be able to
> create them on demand, except perhaps right after boot.
>
> Growing these things later simply isn't reasonable. I can pretty much
> guarantee that any kernel I maintain will never have dynamic kernel
> pointers: when some memory has been allocated with kmalloc() (or
> equivalent routines - pretty much _any_ kernel allocation), it stays put.
> Which means that if there is a _single_ kernel alloc in such a zone, it
> won't ever be then usable for hugetlb stuff.
>
> And I don't want excessive complexity. We can have things like "turn off
> kernel allocations from this zone", and then wait a day or two, and hope
> that there aren't long-term allocs. It might even work occasionally. But
> the fact is, a number of kernel allocations _are_ long-term (superblocks,
> root dentries, "struct thread_struct" for long-running user daemons), and
> it's simply not going to work well in practice unless you have set aside
> the "no kernel alloc" zone pretty early on.
Exactly. But that's what all the anti-fragmentation stuff was about - trying
to pack unfreeable stuff together.
I don't think anyone is proposing dynamic kernel pointers inside Linux,
except in that we could possibly change the P-V mapping underneath from
the hypervisor, so that the phys address would change, but you wouldn't
see it. Trouble is, that's mostly done on a larger-than-page size
granularity, so we need SOME larger chunk to switch out (preferably at
least a large-paged size, so we can continue to use large TLB entries for
the kernel mapping).
However, the statically sized option is hugely problematic too.
M.
* Linus Torvalds <[email protected]> wrote:
> Boot-time option to set the hugetlb zone, yes.
>
> Grow-or-shrink, probably not. Not in practice after bootup on any
> machine that is less than idle.
>
> The zones have to be pretty big to make any sense. You don't just grow
> them or shrink them - they'd be on the order of tens of megabytes to
> gigabytes. In other words, sized big enough that you will _not_ be
> able to create them on demand, except perhaps right after boot.
i think the current hugepages=<N> boot option could transparently be
morphed into a 'separate zone' approach, and /proc/sys/vm/nr_hugepages
would just refuse to change (or would go away altogether). Dynamically
growing zones seem like a lot of trouble, without much gain. [ OTOH
hugepages= parameter unit should be changed from the current 'number of
hugepages' to plain RAM metrics - megabytes/gigabytes. ]
that would solve two problems: any 'zone VM statistics skewing effect'
of the current hugetlbs (which is a preallocated list of really large
pages) would go away, and the hugetlb zone could potentially be utilized
for easily freeable objects.
this would already be alot more flexible that what we have: the hugetlb
area would not be 'lost' altogether, like now. Once we are at this stage
we can see how usable it is in practice. I strongly suspect it will
cover most of the HPC uses.
Ingo
On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> Big pages don't work now, and zones do not help because the
> load is too unpredictable. Sysadmins *always* turn them
> off, for very good reasons. They cripple the machine.
They do. Guess why? It's complicated.
SGI used to do things like that in Irix. They had the flakiest Unix kernel
out there. There's a reason people use Linux, and it's not all price. A
lot of it is development speed, and that in turn comes very much from not
making insane decisions that aren't maintainable in the long run.
Trust me. We can make things _better_, by having zones that you can't do
kernel allocations from. But you'll never get everything you want, without
turning the kernel into an unmaintainable mess.
> I think it was Martin Bligh who wrote that his customer gets
> 25% speedups with big pages. That is peanuts compared to my
> factor 3.4 (search comp.arch for John Mashey's and my name
> at the University of Edinburgh in Jan/Feb 2003 for a conversation
> that includes detailed data about this), but proves the point that
> it is far more than just me that wants big pages.
I didn't find your post on google, but I assume that a large portion on
your 3.4 factor was hardware.
The fact is, there are tons of architectures that suck at TLB handling.
They have small TLB's, and they fill slowly.
x86 is actually one of the best ones out there. It has a hw TLB fill, and
the page tables are cached, with real-life TLB fill times in the single
cycles (a P4 can almost be seen as effectively having 32kB pages because
it fills it's TLB entries to fast when they are next to each other in the
page tables). Even when you have lots of other cache pressure, the page
tables are at least in the L2 (or L3) caches, and you effectively have a
really huge TLB.
In contrast, a lot of other machines will use non-temporal loads to load
the TLB entries, forcing them to _always_ go to memory, and use software
fills, causing the whole machine to stall. To make matters worse, many of
them use hashed page tables, so that even if they could (or do) cache
them, the caching just doesn't work very well.
(I used to be a big proponent of software fill - it's very flexible. It's
also very slow. I've changed my mind after doing timing on x86)
Basically, any machine that gets more than twice the slowdown is _broken_.
If the memory access is cached, then so should be page table entry be
(page tables are _much_ smaller than the pages themselves), so even if you
take a TLB fault on every single access, you shouldn't see a 3.4 factor.
So without finding your post, my guess is that you were on a broken
machine. MIPS or alpha do really well when things generally fit in the
TLB, but break down completely when they don't due to their sw fill (alpha
could have fixed it, it had _archtiecturally_ sane page tables that it
could have walked in hw, but never got the chance. May it rest in peace).
If I remember correctly, ia64 used to suck horribly because Linux had to
use a mode where the hw page table walker didn't work well (maybe it was
just an itanium 1 bug), but should be better now. But x86 probably kicks
its butt.
The reason x86 does pretty well is that it's got one of the few sane page
table setups out there (oh, page table trees are old-fashioned and simple,
but they are dense and cache well), and the microarchitecture is largely
optimized for TLB faults. Not having ASI's and having to work with an OS
that invalidated the TLB about every couple of thousand memory accesses
does that to you - it puts the pressure to do things right.
So I suspect Martin's 25% is a lot more accurate on modern hardware (which
means x86, possibly Power. Nothing else much matters).
> If your and other kernel developer's (<<0.01% of the universe) kernel
> builds slow down by 5% and my and other people's simulations (perhaps
> 0.01% of the universe) speed up by a factor up to 3 or 4, who wins?
First off, you won't speed up by a factor of three or four. Not even
_close_.
Second, it's not about performance. It's about maintainability. It's about
having a system that we can use and understand 10 years down the line. And
the VM is a big part of that.
Linus
* Andy Nelson <[email protected]> wrote:
> Ingo wrote:
> >ok, this posting of you seems to be it:
>
> > <elided>
>
> >to me it seems that this slowdown is due to some inefficiency in the
> >R12000's TLB-miss handling - possibly very (very!) long TLB-miss
> >latencies? On modern CPUs (x86/x64) the TLB-miss latency is rarely
> >visible. Would it be possible to run some benchmarks of hugetlbs vs. 4K
> >pages on x86/x64?
> >
> >if my assumption is correct, then hugeTLBs are more of a workaround for
> >bad TLB-miss properties of the CPUs you are using, not something that
> >will inevitably happen in the future. Hence i think the 'factor 3x'
> >slowdown should not be realistic anymore - or are you still running
> >R12000 CPUs?
>
> > Ingo
>
>
> AFAIK, mips chips have a software TLB refill that takes 1000 cycles
> more or less. I could be wrong. [...]
x86 in comparison has a typical cost of 7 cycles per TLB miss. And a
modern x64 chip has 1024 TLBs ... If that's not enough then i believe
you'll be limited by cachemiss costs and RAM latency/throughput anyway,
and the only thing the TLB misses have to do is to be somewhat better
than those bottlenecks. TLBs are really fast in the x86/x64 world. Then
there come other features like TLB prefetch, so if you are touching
pages in any predictable fashion you ought to see better latencies than
the worst-case.
> The effect is not a consequence of any excessively long tlb handling
> times for one single arch.
>
> The effect is a property of the code. Which has one part that is
> extremely branchy: traversing a tree, and another part that isn't
> branchy but grabs stuff from all over everywhere.
i dont think anyone argues against the fact that a larger 'TLB reach'
will most likely improve performance. The question is always 'by how
much', and that number very much depends on the cost of a single TLB
miss. (and on alot of other factors)
(note that it's also possible for large TLBs to cause a slowdown: there
are CPUs [e.g. P3] where there are fewer large TLBs than 4K TLBs, so
there are workloads where you lose due to fewer TLBs. It is also
possible for large TLBs to be zero speedup: if the working set is so
large that you will always get a TLB miss with a new node accessed.)
Ingo
On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> AFAIK, mips chips have a software TLB refill that takes 1000
> cycles more or less. I could be wrong.
You're not far off.
Time it on a real machine some day. On a modern x86, you will fill a TLB
entry in anything from 1-8 cycles if it's in L1, and add a couple of dozen
cycles for L2.
In fact, the L1 TLB miss can often be hidden by the OoO engine.
Now, do the math. Your "3-4 time slowdown" with several hundred cycle TLB
miss just GOES AWAY with real hardware. Yes, you'll still see slowdowns,
but they won't be nearly as noticeable. And having a simpler and more
efficient kernel will actually make _up_ for them in many cases. For
example, you can do all your calculations on idle workstations that don't
mysteriously just crash because somebody was also doing something else on
them.
Face it. MIPS sucks. It was clean, but it didn't perform very well. SGI
doesn't sell those things very actively these days, do they?
So don't blame Linux. Don't make sweeping statements based on hardware
situations that just aren't relevant any more.
If you ever see a machine again that has a huge TLB slowdown, let the
machine vendor know, and then SWITCH VENDORS. Linux will work on sane
machines too.
Linus
> So I suspect Martin's 25% is a lot more accurate on modern hardware (which
> means x86, possibly Power. Nothing else much matters).
It was PPC64, if that helps.
>> If your and other kernel developer's (<<0.01% of the universe) kernel
>> builds slow down by 5% and my and other people's simulations (perhaps
>> 0.01% of the universe) speed up by a factor up to 3 or 4, who wins?
>
> First off, you won't speed up by a factor of three or four. Not even
> _close_.
Well, I think it depends on the workload a lot. However fast your TLB is,
if we move from "every cacheline read requires is a TLB miss" to "every
cacheline read is a TLB hit" that can be a huge performance knee however
fast your TLB is. Depends heavily on the locality of reference and size
of data set of the application, I suspect.
M.
Linus:
>> If your and other kernel developer's (<<0.01% of the universe) kernel
>> builds slow down by 5% and my and other people's simulations (perhaps
>> 0.01% of the universe) speed up by a factor up to 3 or 4, who wins?
>
>First off, you won't speed up by a factor of three or four. Not even
>_close_.
My measurements of factors of 3-4 on more than one hw arch don't
mean anything then? BTW: Ingo Molnar has a response that did find
my comp.arch posts. As I indicated to him, I've done a lot of code
tuning to get better performance even in the presence of tlb issues.
This factor is what is left. Starting from an untuned code, the factor
can be up to an order of magnitude larger. As in 30-60. Yes, I've
measured that too, though these detailed measurments were only on
mips/origins.
It is true that I have never had the opportunity to test these
issues on x86 and its relatives. Perhaps it would be better there.
The relative insensitivity of the results I have already to hw
arch, indicate otherwise though.
Re maintainability: Fine. I like maintainable code too. Coding
standards are great. Language standards are even better.
These are motherhood statements. Your simple rejections
("NO, HELL NO!!") even of any attempts to make these sorts
of improvements seems to make that issue pretty moot anyway.
Andy
* Linus Torvalds <[email protected]> wrote:
> Time it on a real machine some day. On a modern x86, you will fill a
> TLB entry in anything from 1-8 cycles if it's in L1, and add a couple
> of dozen cycles for L2.
below is my (x86-only) testcode that accurately measures TLB miss costs
in cycles. (Has to be run as root, because it uses 'cli' as the
serializing instruction.)
here's the output from the default 128MB (32768 4K pages) random access
pattern workload, on a 2 GHz P4 (which has 64 dTLBs):
0 24 24 24 12 12 0 0 16 0 24 24 24 12 0 12 0 12
32768 randomly accessed pages, 13 cycles avg, 73.751831% TLB misses.
i.e. really cheap TLB misses even in this very bad and TLB-trashing
scenario: there are only 64 dTLBs and we have 32768 pages - so they are
outnumbered by a factor of 1:512! Still the CPU gets it right.
setting LINEAR to 1 gives an embarrasing:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
32768 linearly accessed pages, 0 cycles avg, 0.259399% TLB misses.
showing that the pagetable got fully cached (probably in L1) and that
has _zero_ overhead. Truly remarkable.
lowering the size to 16 MB (still 1:64 TLB-to-working-set-size ratio!)
gives:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4096 randomly accessed pages, 0 cycles avg, 5.859375% TLB misses.
so near-zero TLB overhead.
increasing BYTES to half a gigabyte gives:
2 0 12 12 24 12 24 264 24 12 24 24 0 0 24 12 24 24 24 24 24 24 24 24 12
12 24 24 24 36 24 24 0 24 24 0 24 24 288 24 24 0 228 24 24 0 0
131072 randomly accessed pages, 75 cycles avg, 94.162750% TLB misses.
so an occasional ~220 cycles (~== 100 nsec - DRAM latency) cachemiss,
but still the average is 75 cycles, or 37 nsecs - which is still only
~37% of the DRAM latency.
(NOTE: the test eliminates most data cachemisses, by using zero-mapped
anonymous memory, so only a single data page exists. So the costs seen
here are mostly TLB misses.)
Ingo
---------------
/*
* TLB miss measurement on PII CPUs.
*
* Copyright (C) 1999, Ingo Molnar <[email protected]>
*/
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/mman.h>
#define BYTES (128*1024*1024)
#define PAGES (BYTES/4096)
/* This define turns on the linear mode.. */
#define LINEAR 0
#if 1
# define BARRIER "cli"
#else
# define BARRIER "lock ; addl $0,0(%%esp)"
#endif
int do_test (char * addr)
{
unsigned long start, end;
/*
* 'cli' is used as a serializing instruction to
* isolate the benchmarked instruction from rdtsc.
*/
__asm__ (
"jmp 1f; 1: .align 128;\
"BARRIER"; \
rdtsc; \
movl %0, %1; \
"BARRIER"; \
movl (%%esi), %%eax; \
"BARRIER"; \
rdtsc; \
"BARRIER"; \
"
:"=a" (end), "=c" (start)
:"S" (addr)
:"dx","memory");
return end - start;
}
extern int iopl(int);
int main (void)
{
unsigned long overhead, sum;
int j, k, c, hit;
int matrix [PAGES];
int delta [PAGES];
char *buffer = mmap(NULL, BYTES, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
iopl(3);
/*
* first generate a random access pattern.
*/
for (j = 0; j < PAGES; j++) {
unsigned long val;
#if LINEAR
val = ((j*8) % PAGES) * 4096;
val = j*2048;
#else
val = (random() % PAGES) * 4096;
#endif
matrix[j] = val;
}
/*
* Calculate the overhead
*/
overhead = ~0UL;
for (j = 0; j < 100; j++) {
unsigned int diff = do_test(buffer);
if (diff < overhead)
overhead = diff;
}
printf("Overhead = %ld cycles\n", overhead);
/*
* 10 warmup loops, the last one is printed.
*/
for (k = 0; k < 10; k++) {
c = 0;
for (j = 0; j < PAGES; j++) {
char * addr;
addr = buffer + matrix[j];
delta[c++] = do_test(addr);
}
}
hit = 0;
sum = 0;
for (j = 0; j < PAGES; j++) {
unsigned long d = delta[j] - overhead;
printf("%ld ", d);
if (d <= 1)
hit++;
sum += d;
}
printf("\n");
printf("%d %s accessed pages, %d cycles avg, %f%% TLB misses.\n",
PAGES,
#if LINEAR
"linearly",
#else
"randomly",
#endif
sum/PAGES,
100.0*((double)PAGES-(double)hit)/(double)PAGES);
return 0;
}
On Fri, 4 Nov 2005, Martin J. Bligh wrote:
>
> > So I suspect Martin's 25% is a lot more accurate on modern hardware (which
> > means x86, possibly Power. Nothing else much matters).
>
> It was PPC64, if that helps.
Ok. I bet x86 is even better, but Power (and possibly itanium) is the only
other architecture that comes close.
I don't like the horrible POWER hash-tables, but for static workloads they
should perform almost as well as a sane page table (I say "almost",
because I bet that the high-performance x86 vendors have spent a lot more
time on tlb latency than even IBM has). My dislike for them comes from the
fact that they are really only optimized for static behaviour.
(And HPC is almost always static wrt TLB stuff - big, long-running
processes).
> Well, I think it depends on the workload a lot. However fast your TLB is,
> if we move from "every cacheline read requires is a TLB miss" to "every
> cacheline read is a TLB hit" that can be a huge performance knee however
> fast your TLB is. Depends heavily on the locality of reference and size
> of data set of the application, I suspect.
I'm sure there are really pathological examples, but the thing is, they
won't be on reasonable code.
Some modern CPU's have TLB's that can span the whole cache. In other
words, if your data is in _any_ level of caches, the TLB will be big
enough to find it.
Yes, that's not universally true, and when it's true, the TLB is two-level
and you can have loads where it will usually miss in the first level, but
we're now talking about loads where the _data_ will then always miss in
the first level cache too. So the TLB miss cost will always be _lower_
than the data miss cost.
Right now, you should buy Opteron if you want that kind of large TLB. I
_think_ Intel still has "small" TLB's (the cpuid information only goes up
to 128 entries, I think), but at least Intel has a really good fill. And I
would bet (but have no first-hand information) that next generation
processors will only get bigger TLB's. These things don't tend to shrink.
(Itanium also has a two-level TLB, but it's absolutely pitiful in size).
NOTE! It is absolutely true that for a few years we had regular caches
growing much faster than TLB's. So there are unquestionably unbalanced
machines out there. But it seems that CPU designers started noticing, and
every indication is that TLB's are catching up.
In other words, adding lots of kernel complexity is the wrong thing in the
long run. This is not a long-term problem, and even in the short term you
can fix it by just selecting the right hardware.
In todays world, AMD leads with bug TLB's (1024-entry L2 TLB), but Intel
has slightly faster fill and the AMD TLB filtering is sadly turned off on
SMP right now, so you might not always get the full effect of the large
TLB (but in HPC you probably won't have task switching blowing your TLB
away very often).
PPC64 has the huge hashed page tables that work well enough for HPC.
Itanium has a pitifully small TLB, and an in-order CPU, so it will take a
noticeably bigger hit on TLB's than x86 will. But even Itanium will be a
_lot_ better than MIPS was.
Linus
On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> My measurements of factors of 3-4 on more than one hw arch don't
> mean anything then?
When I _know_ that modern hardware does what you tested at least two
orders of magnitude better than the hardware you tested?
Think about it.
Linus
>> Well, I think it depends on the workload a lot. However fast your TLB is,
>> if we move from "every cacheline read requires is a TLB miss" to "every
>> cacheline read is a TLB hit" that can be a huge performance knee however
>> fast your TLB is. Depends heavily on the locality of reference and size
>> of data set of the application, I suspect.
>
> I'm sure there are really pathological examples, but the thing is, they
> won't be on reasonable code.
>
> Some modern CPU's have TLB's that can span the whole cache. In other
> words, if your data is in _any_ level of caches, the TLB will be big
> enough to find it.
>
> Yes, that's not universally true, and when it's true, the TLB is two-level
> and you can have loads where it will usually miss in the first level, but
> we're now talking about loads where the _data_ will then always miss in
> the first level cache too. So the TLB miss cost will always be _lower_
> than the data miss cost.
>
> Right now, you should buy Opteron if you want that kind of large TLB. I
> _think_ Intel still has "small" TLB's (the cpuid information only goes up
> to 128 entries, I think), but at least Intel has a really good fill. And I
> would bet (but have no first-hand information) that next generation
> processors will only get bigger TLB's. These things don't tend to shrink.
Well. Last time I looked they had something in the order of 512 entries
per MB of cache or so (ie 2MB of coverage per MB of cache). So it'll only
cover it if you're using 2K of the data in each page (50%), but not if
you're touching cachelines distributed widely over pages. with large
pages, you cover 1000 times that much. Some apps may not be able to
acheive a 50% locality of reference, just by their nature ... not sure
that's bad programming for the big number crunching cases, or DB workloads
with random access patterns to large data sets.
Of course, this doesn't just apply to HPC/database either. dcache walks
on large fileserver, etc.
Even if we're talking data cache / icache misses, it gets even worse,
doesn't it? Several cacheline misses for pagetable walks per data cacheline
miss. Lots of the compute intensive stuff doesn't even come close to
fitting in data cache by orders of magnitude.
M.
Andy,
let's just take Ingo's numbers, measured on modern hardware.
On Fri, 4 Nov 2005, Ingo Molnar wrote:
>
> 32768 randomly accessed pages, 13 cycles avg, 73.751831% TLB misses.
> 32768 linearly accessed pages, 0 cycles avg, 0.259399% TLB misses.
> 131072 randomly accessed pages, 75 cycles avg, 94.162750% TLB misses.
NOTE! It's hard to decide what OoO does - Ingo's load doesn't allow for a
whole lot of overlapping stuff, so Ingo's numbers are fairly close to
worst case, but on the other hand, that serialization can probably be
honestly said to hide a couple of cycles, so let's say that _real_ worst
case is five more cycles than the ones quoted. It doesn't change the math,
and quite frankly, that way we're really anal about it.
In real life, under real load (especially with Fp operations going on at
the same time), OoO might make the cost a few cycles _less_, not more, but
hey, lt's not count that.
So in the absolute worst case, with 95% TLB miss ratio, the TLB cost was
an average 75 cycles. Let's be _really_ nice to MIPS, and say that this is
only five times faster than the MIPS case you tested (in reality, it's
probably over ten).
That's the WORST CASE. Realize that MIPS doesn't get better: it will
_always_ have a latency of several hundred cycles when the TLB misses. It
has absolutely zero OoO activity to hide a TLB miss (a software miss
totally serializes the pipeline), and it has zero "code caching", so even
with a perfect I$ (which it certainly didn't have), the cost of actually
running the TLB miss handler doesn't go down.
In contrast, the x86 hw miss gets better when there is some more locality
and the page tables are cached. Much better. Ingo's worst-case example is
not realistic (no locality at all in half a gigabyte or totally random
examples), yet even for that worst case, modern CPU's beat the MIPS by
that big factor.
So let's say that the 75% miss ratio was more likely (that's still a high
TLB miss ratio). So in the _likely_ case, a P4 did the miss in an average
of 13 cycles. The MIPS miss cost won't have come down at all - in fact, it
possibly went _up_, since the miss handler now might be getting more I$
misses since it's not called all the time (I don't know if the MIPS miss
handler used non-caching loads or not - the positive D$ effects on the
page tables from slightly denser TLB behaviour might help some to offset
this factor).
That's a likely factor of fifty speedup. But let's be pessimistic again,
and say that the P4 number beat the MIPS TLB miss by "only" a factor of
twenty. That means that your worst case totally untuned argument (30 times
slowdown from TLB misses) on a P4 is only a 120% slowdown. Not a factor of
three.
But clearly you could tune your code too, and did. To the point that you
had a factor of 3.4 on MIPS. Now, let's say that the tuning didn't work as
well on P4 (remember, we're still being pessimistic), and you'd only get
half of that.
End result? If the slowdown was entirely due to TLB miss costs, your
likely slowdown is in the 20-40% range. Pessimistically.
Now, switching to x86 may have _other_ issues. Maybe other things might
get slower. [ Mmwwhahahahhahaaa. I crack myself up. x86 slower than MIPS?
I'm such a joker. ]
Anyway. The point stands. This is something where hardware really rules,
and software can't do a lot of sane stuff. 20-40% may sound like a big
number, and it is, but this is all stuff where Moore's Law says that
we shouldn't spend software effort.
We'll likely be better off with a smaller, simpler kernel in the future. I
hope. And the numbers above back me up. Software complexity for something
like this just kills.
Linus
Linus,
Please stop focussing on mips as the bad boy. Mips is dead. It
has been for years and everyone knows it unless they are embedded.
I wrote several times that I had tested other arches and every
time you deleted those comments. Not to mention that in the few
anecdotal (read no records were kept) tests I've done on with intel
vs mips on more than one code, mips doesn't come out nearly as bad
as you seem to believe. Maybe that is tlb related maybe it is other
issue related. The fact remains.
Later on after your posts I also posted numbers for power 5. Haven't
seen a response to that yet. Maybe you're digesting.
> let's just take Ingo's numbers, measured on modern hardware.
Ingo's numbers calculate 95% tlb misses. I will likely have 100% tlb
misses over most of this code. Read my discussion of what it does
and you'll see why. Capsule form: Every tree node results in several
thousand nodes that are acceptable. You need to examine several times
that to get the acceptable ones. Several thousand memory reads from
several thousand different pages means 100% TLB misses. This is by no
means a pathological case. Other codes will have such effects too, as
I noted in my first very long rant.
I may have misread it, but that last bit of difference between 95%
and 100% tlb misses will be a pretty big factor in speed differences.
So your 20-40% goes right back up.
Ok, so there is some minimal in my case fp overlap, but a factor 2
speed difference certainly still exists in the power5 arch numbers I
quoted.
I have a special case version of this code that does cache blocking
on the gravity calculation. As a special case version, it is not
effective for the general case. There are 0 TLB misses and 0 L1 misses
for this part of the code. The tree traversal cannot be similarly
cache blocked and keeps all the tlb and cache misses it always had.
For that version, I can get down to 20% speed up, because overall the
traversal only takes 20% or so of the total time. That is the absolute
best I can do, and I've been tuning this code alone for close to a
decade.
Andy
Ingo wrote:
> i think the current hugepages=<N> boot option could transparently be
> morphed into a 'separate zone' approach, and ...
>
> this would already be alot more flexible that what we have: the hugetlb
> area would not be 'lost' altogether, like now. Once we are at this stage
> we can see how usable it is in practice. I strongly suspect it will
> cover most of the HPC uses.
It seems to me this is making it harder than it should be. You're
trying to create a zone that is 100% cleanable, whereas the HPC folks
only desire 99.8% cleanable.
Unlike the hot(un)plug folks, the HPC folks don't mind a few pages of
Linus's unmoveable kmalloc memory in their way. They rather expect
that some modest percentage of each node will have some 'kernel stuff'
on it that refuses to move. They just want to be able to free up
most of the pages on a node, once one job is done there, before the
next job begins.
They are also quite willing (based on my experience with bootcpusets)
to designate a few nodes for the 'general purpose Unix load', and
reserve the remaining nodes just to run their special jobs.
On the other hand, as Eric Dumazet mentions on another subthread of
this topic, requiring that their apps use the hugetlbfs interface
to place the bulk of their memory would be a serious obstacle.
Their apps are already fairly tightly wound around a rich variety
of compiler, tool, library and runtime memory placement mechanisms,
and they would be hardpressed to make systematic changes in that.
I suspect that the answers lie in some further improvements in memory
placement on various nodes. Perhaps this means a cpuset option to
put the easily reclaimed (what Mel Gorman's patch would mark with
__GFP_EASYRCLM) kernel pages and the user pages on the the nodes of
the current cpuset, but to prefer placing the less easily reclaimed
pages on the bootcpuset nodes. Then, when a job on such a dedicated
set of nodes completed, most of the memory would be easily reclaimable,
in preparation for the next job.
The bootcpuset stuff is entirely invisible to kernel hackers, because
I am doing it entirely in user space, with a pre-init program that
configures the bootcpuset, moves the unpinned kernel threads into
the bootcpuset, and fires up the real init in that bootcpuset.
With one more twist to the cpuset API, providing a way to state
per-cpuset a separate set of nodes (on what the HPC folks would call
their bootcpuset) as the preferred place to allocate not-EASYRCLM
kernel memory, we might be very close to meeting these HPC needs,
with no changes to or reliance on hugetlbs, with no changes to the
kernel boottime code, and with no changes to the memory management
mechanisms used within these HPC apps.
I am imagining yet another per-cpuset field, which I call 'kmems'. It
would be a nodemask, as is the current 'mems' field. I'd pick up the
__GFP_EASYRCLM flag of Mel Gorman's patch (no comment on suitability of
the rest of his patch), and prefer to place __GFP_EASYRCLM pages on the
'mems' nodes, but other pages evenly spread across the 'kmems' nodes.
For compatibility with the current cpuset API, an unset 'kmems'
would tell the kernel to use the 'mems' setting as a fallback.
The hardest part might be providing a mechanism, that would be invoked
by the batch scheduler between jobs, to flush the easily reclaimed
memory off a node (free it or write it to disk). Again, unlike the
hot(un)plug folks, a 98% solution is plenty good enough.
This will have to be coded and some HPC type loads tried on it, before
we know if it flies.
There is an obvious, unanswered question here. Would moving some of
the kernels pages (the not easily reclaimed pages) off the current
(faulting) node into some possibly far off node be an acceptable
price to pay, to increase the percentage of the dedicated job nodes
that can be freed up between jobs? Since these HPC jobs tend to be
far more sensitive to their own internal data placement than they
are to the kernels internal data placement, I am hopeful that this
tradeoff is a good one, for HPC apps.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On Nov 4, 2005, at 10:31:48, Linus Torvalds wrote:
> I can pretty much guarantee that any kernel I maintain will never
> have dynamic kernel pointers: when some memory has been allocated
> with kmalloc() (or equivalent routines - pretty much _any_ kernel
> allocation), it stays put.
Hmm, this brings up something that I haven't seen discussed on this
list (maybe a long time ago, but perhaps it should be brought up
again?). What are the pros/cons to having a non-physically-linear
kernel virtual memory space? Would it be theoretically possible to
allow some kind of dynamic kernel page swapping, such that the _same_
kernel-virtual pointer goes to a different physical memory page?
That would definitely satisfy the memory hotplug people, but I don't
know what the tradeoffs would be for normal boxen.
It seems like the trick would be to make sure that page accesses
_during_ the swap are correctly handled. If the page-swapper
included code in the kernel fault handler to notice that a page was
in the process of being swapped out/in by another CPU, it could just
wait for swap-in to finish and then resume from the new page. This
would get messy with DMA and non-cpu memory accessors and such, which
are what I assume the reasons for not implementing this in the past
have been.
From what I can see, the really dumb-obvious-slow method would be to
call the first and last parts of software-suspend. As memory hotplug
is a relatively rare event, this would probably work well enough
given the requirements:
1) Run software suspend pre-memory-dump code
2) Move pages off the to-be-removed node, remapping the kernel
space to the new locations.
3) Mark the node so that new pages don't end up on it
4) Run software suspend post-memory-reload code
<random-guessing>
Perhaps the non-contiguous memory support would be of some help here?
</random-guessing>
Cheers,
Kyle Moffett
--
Simple things should be simple and complex things should be possible
-- Alan Kay
On Sat, 5 Nov 2005, Paul Jackson wrote:
>
> It seems to me this is making it harder than it should be. You're
> trying to create a zone that is 100% cleanable, whereas the HPC folks
> only desire 99.8% cleanable.
Well, 99.8% is pretty borderline.
> Unlike the hot(un)plug folks, the HPC folks don't mind a few pages of
> Linus's unmoveable kmalloc memory in their way. They rather expect
> that some modest percentage of each node will have some 'kernel stuff'
> on it that refuses to move.
The thing is, if 99.8% of memory is cleanable, the 0.2% is still enough to
make pretty much _every_ hugepage in the system pinned down.
Besides, right now, it's not 99.8% anyway. Not even close. It's more like
60%, and then horribly horribly ugly hacks that try to do something about
the remaining 40% and usually fail (the hacks might get it closer to 99%,
but they are fragile, expensive, and ugly as hell).
It used to be that HIGHMEM pages were always cleanable on x86, but even
that isn't true any more, since now at least pipe buffers can be there
too.
I agree that HPC people are usually a bit less up-tight about things than
database people tend to be, and many of them won't care at all, but if you
want hugetlb, you'll need big areas.
Side note: the exact size of hugetlb is obviously architecture-specific,
and the size matters a lot. On x86, for example, hugetlb pages are either
2MB or 4MB in size (and apparently 2GB may be coming). I assume that's
where you got the 99.8% from (4kB out of 2M).
Other platforms have more flexibility, but sometimes want bigger areas
still.
Linus
On Sun, 6 Nov 2005, Kyle Moffett wrote:
>
> Hmm, this brings up something that I haven't seen discussed on this list
> (maybe a long time ago, but perhaps it should be brought up again?). What are
> the pros/cons to having a non-physically-linear kernel virtual memory space?
Well, we _do_ actually have that, and we use it quite a bit. Both
vmalloc() and HIGHMEM work that way.
The biggest problem with vmalloc() is that the virtual space is often as
constrained as the physical one (ie on old x86-32, the virtual address
space is the bigger problem - you may have 36 bits of physical memory, but
the kernel has only 30 bits of virtual). But it's quite commonly used for
stuff that wants big linear areas.
The HIGHMEM approach works fine, but the overhead of essentially doing a
software TLB is quite high, and if we never ever have to do it again on
any architecture, I suspect everybody will be pretty happy.
> Would it be theoretically possible to allow some kind of dynamic kernel page
> swapping, such that the _same_ kernel-virtual pointer goes to a different
> physical memory page? That would definitely satisfy the memory hotplug
> people, but I don't know what the tradeoffs would be for normal boxen.
Any virtualization will try to do that, but they _all_ prefer huge pages
if they care at all about performance.
If you thought the database people wanted big pages, the kernel is worse.
Unlike databases or HPC, the kernel actually wants to use the physical
page address quite often, notably for IO (but also for just mapping them
into some other virtual address - the users).
And no standard hardware allows you to do that in hw, so we'd end up doing
a software page table walk for it (or, more likely, we'd have to make
"struct page" bigger).
You could do it today, although at a pretty high cost. And you'd have to
forget about supporting any hardware that really wants contiguous memory
for DMA (sound cards etc). It just isn't worth it.
Real memory hotplug needs hardware support anyway (if only buffering the
memory at least electrically). At which point you're much better off
supporting some remapping in the buffering too, I'm convinced. There's no
_need_ to do these things in software.
Linus
On Sun, 6 Nov 2005, Linus Torvalds wrote:
>
> And no standard hardware allows you to do that in hw, so we'd end up doing
> a software page table walk for it (or, more likely, we'd have to make
> "struct page" bigger).
>
> You could do it today, although at a pretty high cost. And you'd have to
> forget about supporting any hardware that really wants contiguous memory
> for DMA (sound cards etc). It just isn't worth it.
Btw, in case it wasn't clear: the cost of these kinds of things in the
kernel is usually not so much the actual "lookup" (whether with hw assist
or with another field in the "struct page").
The biggest cost of almost everything in the kernel these days is the
extra code-footprint of yet another abstraction, and the locking cost.
For example, the real cost of the highmem mapping seems to be almost _all_
in the locking. It also makes some code-paths more complex, so it's yet
another I$ fill for the kernel.
So a remappable kernel tends to be different from a remappable user
application. A user application _only_ ever sees the actual cost of the
TLB walk (which hardware can do quite efficiently and is very amenable
indeed to a lot of optimization like OoO and speculative prefetching), but
on the kernel level, the remapping itself is the cheapest part.
(Yes, user apps can see some of the costs indirectly: they can see the
synchronization costs if they do lots of mmap/munmap's, especially if they
are threaded. But they really have to work at it to see it, and I doubt
the TLB synchronization issues tend to be even on the radar for any user
space performance analysis).
You could probably do a remappable kernel (modulo the problems with
specific devices that want bigger physically contiguous areas than one
page) reasonably cheaply on UP. It gets more complex on SMP and with full
device access.
In fact, I suspect you can ask any Xen developer what their performance
problems and worries are. I suspect they much prefer UP clients over SMP
ones, and _much_ prefer paravirtualization over running unmodified
kernels.
So remappable kernels are certainly doable, they just have more
fundamental problems than remappable user space _ever_ has. Both from a
performance and from a complexity angle.
Linus
Linus wrote:
> The thing is, if 99.8% of memory is cleanable, the 0.2% is still enough to
> make pretty much _every_ hugepage in the system pinned down.
Agreed.
I realized after writing this that I wasn't clear on something.
I wasn't focused the subject of this thread, adding hugetlb pages after
the system has been up a while.
I was focusing on a related subject - freeing up most of the ordinary
size pages on the dedicated application nodes between jobs on a large
system using
* a bootcpuset (for the classic Unix load) and
* dedicated nodes (for the HPC apps).
I am looking to provide the combination of:
1) specifying some hugetlb pages at system boot, plus
2) the ability to clean off most of the ordinary sized pages
from the application nodes between jobs.
Perhaps Andy or some of my HPC customers wish I was also looking
to provide:
3) the ability to add lots of hugetlb pages on the application
nodes after the system has run a while.
But if they are, then they have some more educatin' to do on me.
For now, I am sympathetic to your concerns with code and locking
complexity. Freeing up great globs of hugetlb sized contiguous chunks
of memory after a system has run a while would be hard.
We have to be careful which hard problems we decide to take on.
We can't take on too many, and we have to pick ones that will provide
a major long term advantage to Linux, over the forseeable changes in
system hardware and architecture.
Even if most of the processors that Andy has tested against would
benefit from dynamically added hugetlb pages, if we can anticipate
that this will not be a substained opportunity for Linux (and looking
at current x86 chips doesn't require much anticipating) then that
might not be the place to invest our precious core complexity dollars.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
>>>>> "Linus" == Linus Torvalds <[email protected]> writes:
Linus> On Sun, 6 Nov 2005, Linus Torvalds wrote:
>>
>> And no standard hardware allows you to do that in hw, so we'd end up doing
>> a software page table walk for it (or, more likely, we'd have to make
>> "struct page" bigger).
>>
>> You could do it today, although at a pretty high cost. And you'd have to
>> forget about supporting any hardware that really wants contiguous memory
>> for DMA (sound cards etc). It just isn't worth it.
Linus> Btw, in case it wasn't clear: the cost of these kinds of things
Linus> in the kernel is usually not so much the actual "lookup"
Linus> (whether with hw assist or with another field in the "struct
Linus> page").
Linus> The biggest cost of almost everything in the kernel these days
Linus> is the extra code-footprint of yet another abstraction, and the
Linus> locking cost.
Linus> For example, the real cost of the highmem mapping seems to be
Linus> almost _all_ in the locking. It also makes some code-paths more
Linus> complex, so it's yet another I$ fill for the kernel.
This to me raises the interesting question of what are the most wanted
new features of CPUs and their chipsets by the Linux developers? I
know there are different problem spaces, such as embedded where
power/cost is king, to user desktops to big big clusters.
Has any vendor come close to the ideal CPU architecture for an OS? I
would assume that you'd want:
1. large address space, 64 bits
2. large IO space, 64 bits
3. high memory/io bandwidth
4. efficient locking primitives?
- keep some registers for locking only?
5. efficient memory bandwidth?
6. simple setup where you don't need so much legacy cruft?
7. clean CPU design? RISC? Is CISC king again?
8. Variable page sizes?
- how does this affect TLB?
- how do you change sizes in a program?
9. SMP or hyper-threading or multi-cores?
10. PCI (and it's flavors) addressing/DMA support?
With the growth in data versus instructions these days, does it make
sense to have memory split into D/I sections? Or is it better to just
have a completely flat memory model and let the OS do any splitting it
wants?
Heck, I don't know. I'm just interested in where
Linus/Alan/Andrew/et all think that the low level system design should
think about moving towards since it will make things simpler/faster at
the OS level. I'm completely ignoring the application level since
it's ideally not going to change much... really.
To me, it seems that some sort of efficient low level locking
primitives that work well in any of UP/SMP/NUMA environments would be
key. Just looking at all the fine grain locking people are adding to
the kernel to get around all the issues of the BKL over the years.
Of course making memory faster would be nice too...
I know, it's all out of left field, but it would be interesting to see
what people thought. I honestly wonder if Intel, AMD, PowerPC, Sun
really try to work from the top down when designing their chips, or
more from "this is where we are, how can we speed up what we've got?"
type of view?
Thanks,
John
On Sun, 6 Nov 2005, John Stoffel wrote:
>
> Has any vendor come close to the ideal CPU architecture for an OS? I
> would assume that you'd want:
Well, in the end, the #1 requirement ends up being "wide availability of
development boxes".
For example, I think Apple made a huge difference to the PowerPC platform,
and we'll see what happens when Apple boxes are x86. Can IBM continue to
make Power available enough to be relevant.
Note that raw numbers of CPU's don't much matter - ARM sells a lot more
than x86, but it's not to developers. Similarly, the game consoles may
sell a lot of Power, but the actual developers that are using it is a very
specialized bunch and much smaller in number.
> 1. large address space, 64 bits
> 2. large IO space, 64 bits
> 3. high memory/io bandwidth
> 4. efficient locking primitives?
> - keep some registers for locking only?
> 5. efficient memory bandwidth?
> 6. simple setup where you don't need so much legacy cruft?
> 7. clean CPU design? RISC? Is CISC king again?
> 8. Variable page sizes?
> - how does this affect TLB?
> - how do you change sizes in a program?
> 9. SMP or hyper-threading or multi-cores?
> 10. PCI (and it's flavors) addressing/DMA support?
It's personal, but I don't think the above are huge deal-breakers.
We do want a "big enough" virtual address space, that's pretty much
required. It doesn't necessarily have to be the full 64 bits, and it's
fine if the IO space is just a part of that.
As to ISA and registers - nobody much cares. The compiler takes care of
it, and I'd personally _much_ rather see a common ISA than a "clean" one.
The x86 architecture may be odd, but it works well.
So the ISA doesn't matter that much, but from a microarchitectural
standpoint:
- fast large first-level caches help a lot. And I'd rather take a bigger
L1 that has a two- or even three-cycle latency than a small one. That's
assuming the uarch is out-of-order, of course.
- good fast L2, and I'll take low-latency memory access over an L3 any
day.
- low-latency serialization (locking and memory barriers). In fact,
pretty much low-latency everything (branch mispredict latency etc).
- cheap and powerful.
but the fact is, we'll work with pretty much any crap we're given. If it's
bad, it won't make it in the marketplace.
Linus
Linus> On Sun, 6 Nov 2005, John Stoffel wrote:
>>
>> Has any vendor come close to the ideal CPU architecture for an OS? I
>> would assume that you'd want:
Linus> Well, in the end, the #1 requirement ends up being "wide
Linus> availability of development boxes".
Heh! Take my thoughts and turn them on my head. Bravo!
Linus> We do want a "big enough" virtual address space, that's pretty
Linus> much required. It doesn't necessarily have to be the full 64
Linus> bits, and it's fine if the IO space is just a part of that.
So 40 bits is fine for now, but 64 would be great just because it
solves the problem for a long long time?
Linus> As to ISA and registers - nobody much cares. The compiler takes
Linus> care of it, and I'd personally _much_ rather see a common ISA
Linus> than a "clean" one. The x86 architecture may be odd, but it
Linus> works well.
But aren't there areas where the ISA would expose useful parts of the
underlying microarchitecture that could be more efficiently used in
OSes?
Linus> - fast large first-level caches help a lot. And I'd rather
Linus> take a bigger L1 that has a two- or even three-cycle latency
Linus> than a small one. That's assuming the uarch is out-of-order,
Linus> of course.
Linus> - good fast L2, and I'll take low-latency memory access over
Linus> an L3 any day.
Linus> - low-latency serialization (locking and memory barriers). In
Linus> fact, pretty much low-latency everything (branch mispredict
Linus> latency etc).
Linus> - cheap and powerful.
Linus> but the fact is, we'll work with pretty much any crap we're
Linus> given. If it's bad, it won't make it in the marketplace.
The corollary of course is that if it's excellent but the marketplace
doesn't like it for some reason, we'll still let it go. I keep
wishing for the Alpha to come back sometimes... Oh well.
Thanks for your thoughts Linus.
John
* Linus Torvalds <[email protected]> wrote:
> > You could do it today, although at a pretty high cost. And you'd have to
> > forget about supporting any hardware that really wants contiguous memory
> > for DMA (sound cards etc). It just isn't worth it.
>
> Btw, in case it wasn't clear: the cost of these kinds of things in the
> kernel is usually not so much the actual "lookup" (whether with hw
> assist or with another field in the "struct page").
[...]
> So remappable kernels are certainly doable, they just have more
> fundamental problems than remappable user space _ever_ has. Both from
> a performance and from a complexity angle.
furthermore, it doesnt bring us any closer to removable RAM. The problem
is still unsolvable (due to the 'how to do you find live pointers to fix
up' issue), even if the full kernel VM is 'mapped' at 4K granularity.
Ingo
On Mon, 2005-11-07 at 09:00 +0100, Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
> > So remappable kernels are certainly doable, they just have more
> > fundamental problems than remappable user space _ever_ has. Both from
> > a performance and from a complexity angle.
>
> furthermore, it doesnt bring us any closer to removable RAM. The problem
> is still unsolvable (due to the 'how to do you find live pointers to fix
> up' issue), even if the full kernel VM is 'mapped' at 4K granularity.
I'm not sure I understand. If you're remapping, why do you have to find
live and fix up live pointers? Are you talking about things that
require fixed _physical_ addresses?
-- Dave
* Dave Hansen <[email protected]> wrote:
> On Mon, 2005-11-07 at 09:00 +0100, Ingo Molnar wrote:
> > * Linus Torvalds <[email protected]> wrote:
> > > So remappable kernels are certainly doable, they just have more
> > > fundamental problems than remappable user space _ever_ has. Both from
> > > a performance and from a complexity angle.
> >
> > furthermore, it doesnt bring us any closer to removable RAM. The problem
> > is still unsolvable (due to the 'how to do you find live pointers to fix
> > up' issue), even if the full kernel VM is 'mapped' at 4K granularity.
>
> I'm not sure I understand. If you're remapping, why do you have to
> find live and fix up live pointers? Are you talking about things that
> require fixed _physical_ addresses?
RAM removal, not RAM replacement. I explained all the variants in an
earlier email in this thread. "extending RAM" is relatively easy.
"replacing RAM" while doable, is probably undesirable. "removing RAM"
impossible.
Ingo
On Fri, 2005-11-04 at 08:44 +0100, Eric Dumazet wrote:
> Paul Jackson a ?crit :
> > Linus wrote:
> >
> >>Maybe you'd be willing on compromising by using a few kernel boot-time
> >>command line options for your not-very-common load.
> >
> >
> > If we were only a few options away from running Andy's varying load
> > mix with something close to ideal performance, we'd be in fat city,
> > and Andy would never have been driven to write that rant.
>
> I found hugetlb support in linux not very practical/usable on NUMA machines,
> boot-time parameters or /proc/sys/vm/nr_hugepages.
>
> With this single integer parameter, you cannot allocate 1000 4MB pages on one
> specific node, letting small pages on another node.
>
> I'm not an astrophysician, nor a DB admin, I'm only trying to partition a dual
> node machine between one (numa aware) memory intensive job and all others
> (system, network, shells).
> At least I can reboot it if needed, but I feel Andy pain.
>
> There is a /proc/buddyinfo file, maybe we need a /proc/sys/vm/node_hugepages
> with a list of integers (one per node) ?
Or perhaps /sys/devices/system/node/nodeX/nr_hugepages triggers that
work like the current /proc trigger but on a per node basis?
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center
On Mon, 2005-11-07 at 13:20 +0100, Ingo Molnar wrote:
>
> RAM removal, not RAM replacement. I explained all the variants in an
> earlier email in this thread. "extending RAM" is relatively easy.
> "replacing RAM" while doable, is probably undesirable. "removing RAM"
> impossible.
Hi Ingo,
I'm usually amused when someone says something is impossible, so I'm
wondering exactly "why"?
If the one requirement is that there must be enough free memory
available to remove, then what's the problem for a fully mapped kernel?
Is it the GPT? Or if there's drivers that physical memory mapped?
I'm not sure of the best way to solve the GPT being in the RAM that is
to be removed, but there might be a way. Basically stop all activities
and update all the tasks->mm.
As for the drivers, one could have a accounting for all physical memory
mapped, and disable the driver if it is using the memory that is to be
removed.
But other then these, what exactly is the problem with removing RAM?
BTW, I'm not suggesting any of this is a good idea, I just like to
understand why something _cant_ be done.
-- Steve
>>RAM removal, not RAM replacement. I explained all the variants in an
>>earlier email in this thread. "extending RAM" is relatively easy.
>>"replacing RAM" while doable, is probably undesirable. "removing RAM"
>>impossible.
>
<snip>
> BTW, I'm not suggesting any of this is a good idea, I just like to
> understand why something _cant_ be done.
>
I'm also of the opinion that if we make the kernel remap that we can "remove
RAM". Now, we've had enough people weigh in on this being a bad idea I'm not
going to try it. After all it is fairly complex, quite a bit more so than Mel's
reasonable patches. But I think it is possible. The steps would look like this:
Method A:
1. Find some unused RAM (or free some up)
2. Reserve that RAM
3. Copy the active data from the soon to be removed RAM to the reserved RAM
4. Remap the addresses
5. Remove the RAM
This of course requires step 3 & 4 take place under something like
stop_machine_run() to keep the data from changing.
Alternately you could do it like this:
Method B:
1. Find some unused RAM (or free some up)
2. Reserve that RAM
3. Unmap the addresses on the soon to be removed RAM
4. Copy the active data from the soon to be removed RAM to the reserved RAM
5. Remap the addresses
6. Remove the RAM
Which would save you the stop_machine_run(), but which adds the complication of
dealing with faults on pinned memory during the migration.
On Monday 07 November 2005 17:38, you wrote:
> >>RAM removal, not RAM replacement. I explained all the variants in an
> >>earlier email in this thread. "extending RAM" is relatively easy.
> >>"replacing RAM" while doable, is probably undesirable. "removing RAM"
> >>impossible.
>
> <snip>
>
> > BTW, I'm not suggesting any of this is a good idea, I just like to
> > understand why something _cant_ be done.
>
> I'm also of the opinion that if we make the kernel remap that we can
> "remove RAM". Now, we've had enough people weigh in on this being a bad
> idea I'm not going to try it. After all it is fairly complex, quite a bit
> more so than Mel's reasonable patches. But I think it is possible. The
> steps would look like this:
>
> Method A:
> 1. Find some unused RAM (or free some up)
> 2. Reserve that RAM
> 3. Copy the active data from the soon to be removed RAM to the reserved RAM
> 4. Remap the addresses
> 5. Remove the RAM
>
> This of course requires step 3 & 4 take place under something like
> stop_machine_run() to keep the data from changing.
Actually, what I was thinking is that if you use the swsusp infrastructure to
suspend all processes, all dma, quiesce the heck out of the devices, and
_then_ try to move the kernel... Well, you at least have a much more
controlled problem. Yeah, it's pretty darn intrusive, but if you're doing
"suspend to ram" perhaps the downtime could be only 5 or 10 seconds...
I don't know how much of the problem that leaves unsolved, though.
Rob
> Actually, what I was thinking is that if you use the swsusp infrastructure to
> suspend all processes, all dma, quiesce the heck out of the devices, and
> _then_ try to move the kernel... Well, you at least have a much more
> controlled problem. Yeah, it's pretty darn intrusive, but if you're doing
> "suspend to ram" perhaps the downtime could be only 5 or 10 seconds...
I don't think suspend to ram for a memory hotplug remove would be acceptable to
users. The other methods add some complexity to the kernel, but are transparent
to userspace. Downtime of 5 to 10 seconds is really quite a bit of downtime.
> I don't know how much of the problem that leaves unsolved, though.
It would still require a remappable kernel. And seems intuitively to be wrong
to me. But if you want to try it out I won't stop you. It might even work.