LinuxLists.cc - Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

> Linus writes:
>
>> Just face it - people who want memory hotplug had better know that
>> beforehand (and let's be honest - in practice it's only going to work in
>> virtualized environments or in environments where you can insert the new
>> bank of memory and copy it over and remove the old one with hw support).
>>
>> Same as hugetlb.
>>
>> Nobody sane _cares_. Nobody sane is asking for these things. Only people
>> with special needs are asking for it, and they know their needs.
>
>
> Hello, my name is Andy. I am insane. I am one of the CRAZY PEOPLE you wrote
> about.

To provide a slightly shorter version ... we had one customer running
similarly large number crunching things in Fortran. Their app ran 25%
faster with large pages (not a typo). Because they ran a variety of
jobs in batch mode, they need large pages sometimes, and small pages
at others - hence they need to dynamically resize the pool.

That's the sort of thing we were trying to fix with dynamically sized
hugepage pools. It does make a huge difference to real-world customers.

M.

2005-11-04 01:25:33

by Nick Piggin

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Martin J. Bligh wrote:

>
> To provide a slightly shorter version ... we had one customer running
> similarly large number crunching things in Fortran. Their app ran 25%
> faster with large pages (not a typo). Because they ran a variety of
> jobs in batch mode, they need large pages sometimes, and small pages
> at others - hence they need to dynamically resize the pool.
>
> That's the sort of thing we were trying to fix with dynamically sized
> hugepage pools. It does make a huge difference to real-world customers.
>

Aren't HPC users very easy? In fact, probably the easiest because they
generally not very kernel intensive (apart from perhaps some batches of
IO at the beginning and end of the jobs).

A reclaimable zone should provide exactly what they need. I assume the
sysadmin can give some reasonable upper and lower estimates of the
memory requirements.

They don't need to dynamically resize the pool because it is all being
allocated to pagecache anyway, so all jobs are satisfied from the
reclaimable zone.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-04 05:15:21

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Thu, 3 Nov 2005, Andy Nelson wrote:
>
> I have done high performance computing in astrophysics for nearly two
> decades now. It gives me a perspective that kernel developers usually
> don't have, but sometimes need. For my part, I promise that I specifically
> do *not* have the perspective of a kernel developer. I don't even speak C.

Hey, cool. You're a physicist, and you'd like to get closer to 100%
efficiency out of your computer.

And that's really nice, because maybe we can strike a deal.

Because I also have a problem with my computer, and a physicist might just
help _me_ get closer to 100% efficiency out of _my_ computer.

Let me explain.

I've got a laptop that takes about 45W, maybe 60W under load.

And it has a battery that weighs about 350 grams.

Now, I know that if I were to get 100% energy efficiency out of that
battery, a trivial physics calculations tells me that e=mc^2, and that my
battery _should_ have a hell of a lot of energy in it. In fact, according
to my simplistic calculations, it turns out that my laptop _should_ have a
battery life that is only a few times the lifetime of the universe.

It turns out that isn't really the case in practice, but I'm hoping you
can help me out. I obviously don't need it to be really 100% efficient,
but on the other hand, I'd also like the battery to be slightly lighter,
so if you could just make sure that it's at least _slightly_ closer to the
theoretical values I should be getting out of it, maybe I wouldn't need to
find one of those nasty electrical outlets every few hours.

Do we have a deal? After all, you only need to improve my battery
efficiency by a really _tiny_ amount, and I'll never need to recharge it
again. And I'll improve your problem.

Or are you maybe willing to make a few compromises in the name of being
realistic, and living with something less than the theoretical peak
performance of what you're doing?

I'm willing on compromising to using only the chemical energy of the
processes involved, and not even a hundred percent efficiency at that.
Maybe you'd be willing on compromising by using a few kernel boot-time
command line options for your not-very-common load.

Ok?

Linus

2005-11-04 06:11:03

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Linus wrote:
> Maybe you'd be willing on compromising by using a few kernel boot-time
> command line options for your not-very-common load.

If we were only a few options away from running Andy's varying load
mix with something close to ideal performance, we'd be in fat city,
and Andy would never have been driven to write that rant.

There's more to it than that, but it is not as impossible as a battery
with the efficiencies you (and the rest of us) dream of.

Andy has used systems that resemble what he is seeking. So he is not
asking for something clearly impossible. Though it might not yet be
possible, in ways that contribute to a continuing healthy kernel code
base.

It's an interesting challenge - finding ways to improve the kernel's
performance on such high end loads, that are also suitable and
desirable (or at least innocent enough) for inclusion in a kernel far
more widely used in embeddeds, desktops and ordinary servers.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-04 06:38:39

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

* Paul Jackson <[email protected]> wrote:

> Linus wrote:
> > Maybe you'd be willing on compromising by using a few kernel boot-time
> > command line options for your not-very-common load.
>
> If we were only a few options away from running Andy's varying load
> mix with something close to ideal performance, we'd be in fat city,
> and Andy would never have been driven to write that rant.
>
> There's more to it than that, but it is not as impossible as a battery
> with the efficiencies you (and the rest of us) dream of.

just to make sure i didnt get it wrong, wouldnt we get most of the
benefits Andy is seeking by having a: boot-time option which sets aside
a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool
- with the growing happening on a best-effort basis, without guarantees?

i have implemented precisely such a scheme for 'bigpages' years ago, and
it worked reasonably well. (i was lazy and didnt implement it as a
resizable zone, but as a list of large pages taken straight off the
buddy allocator. This made dynamic resizing really easy and i didnt have
to muck with the buddy and mem_map[] data structures that zone-resizing
forces us to do. It had the disadvantage of those pages skewing the
memory balance of the affected zone.)

my quick solution was good enough that on a test-system i could resize
the pool across Oracle test-runs, when the box was otherwise quiet. I'd
expect a well-controlled HPC system to be equally resizable.

what we cannot offer is a guarantee to be able to grow the pool. Hence
the /proc mechanism would be called:

/proc/sys/vm/try_to_grow_hugemem_pool

to clearly stress the 'might easily fail' restriction. But if userspace
is well-behaved on Andy's systems (which it seems to be), then in
practice it should be resizable. On a generic system, only the boot-time
option is guaranteed to allocate as much RAM as possible. And once this
functionality has been clearly communicated and separated, the 'try to
alloc a large page' thing could become more agressive: it could attempt
to construct large pages if it can.

i dont think we object to such a capability, as long as the restrictions
are clearly communicated. (and no, that doesnt mean some obscure
Documentation/ entry - the restrictions have to be obvious from the
primary way of usage. I.e. no /proc/sys/vm/hugemem_pool_size thing where
growing could fail.)

Ingo

2005-11-04 07:27:26

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Ingo wrote:
> to clearly stress the 'might easily fail' restriction. But if userspace
> is well-behaved on Andy's systems (which it seems to be), then in
> practice it should be resizable.

At first glance, this is the sticky point that jumps out at me.

Andy wrote:
> My experience is that after some days or weeks of running have gone
> by, there is no possible way short of a reboot to get pages merged
> effectively back to any pristine state with the infrastructure that
> exists there.

I take it, from what Andy writes, and from my other experience with
similar customers, that his workload is not "well-behaved" in the
sense you hoped for.

After several diverse jobs are run, we cannot, so far as I know,
merge small pages back to big pages.

I have not played with Mel Gorman's Fragmentation Avoidance patches,
so don't know if they would provide a substantial improvement here.
They well might.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-04 07:37:55

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

* Paul Jackson <[email protected]> wrote:

> At first glance, this is the sticky point that jumps out at me.
>
> Andy wrote:
> > My experience is that after some days or weeks of running have gone
> > by, there is no possible way short of a reboot to get pages merged
> > effectively back to any pristine state with the infrastructure that
> > exists there.
>
> I take it, from what Andy writes, and from my other experience with
> similar customers, that his workload is not "well-behaved" in the
> sense you hoped for.
>
> After several diverse jobs are run, we cannot, so far as I know, merge
> small pages back to big pages.

ok, so the zone solution it has to be. I.e. the moment it's a separate
special zone, you can boot with most of the RAM being in that zone, and
you are all set. It can be used both for hugetlb allocations, and for
other PAGE_SIZE allocations as well, in a highmem-fashion. These HPC
setups are rarely kernel-intense.

Thus the only dynamic sizing decision that has to be taken is to
determine the amount of 'generic kernel RAM' that is needed in the
worst-case. To give an example: say on a 256 GB box, set aside 8 GB for
generic kernel needs, and have 248 GB in the hugemem zone. This leaves
us with the following scenario: apps can use up to 97% of all RAM for
hugemem, and they can use up to 100% of all RAM for PAGE_SIZE
allocations. 3% of RAM can be used by generic kernel needs. Sounds
pretty reasonable and straightforward from a system management point of
view. No runtime resizing, but it wouldnt be needed, unless kernel
activity needs more than 8GB of RAM.

Ingo

2005-11-04 07:44:26

by Eric Dumazet

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Paul Jackson a ?crit :
> Linus wrote:
>
>>Maybe you'd be willing on compromising by using a few kernel boot-time
>>command line options for your not-very-common load.
>
>
> If we were only a few options away from running Andy's varying load
> mix with something close to ideal performance, we'd be in fat city,
> and Andy would never have been driven to write that rant.

I found hugetlb support in linux not very practical/usable on NUMA machines,
boot-time parameters or /proc/sys/vm/nr_hugepages.

With this single integer parameter, you cannot allocate 1000 4MB pages on one
specific node, letting small pages on another node.

I'm not an astrophysician, nor a DB admin, I'm only trying to partition a dual
node machine between one (numa aware) memory intensive job and all others
(system, network, shells).
At least I can reboot it if needed, but I feel Andy pain.

There is a /proc/buddyinfo file, maybe we need a /proc/sys/vm/node_hugepages
with a list of integers (one per node) ?

Eric

2005-11-04 14:56:52

by andy

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Linus,

Since my other affiliation is with X2, which also goes by
the name Thermonuclear Applications, we have a deal. I'll
continue to help with the work on getting nuclear fusion
to work, and you work on getting my big pages to work
in linux. We both have lots of funding and resources behind
us and are working with smart people. It should be easy.
Beyond that, I don't know much of anything about chemistry,
you'll have to find someone else to increase your battery
efficiency that way.

Big pages don't work now, and zones do not help because the
load is too unpredictable. Sysadmins *always* turn them
off, for very good reasons. They cripple the machine.

I'll try in this post also to merge a couple of replies with
other responses:

I think it was Martin Bligh who wrote that his customer gets
25% speedups with big pages. That is peanuts compared to my
factor 3.4 (search comp.arch for John Mashey's and my name
at the University of Edinburgh in Jan/Feb 2003 for a conversation
that includes detailed data about this), but proves the point that
it is far more than just me that wants big pages.

If your and other kernel developer's (<<0.01% of the universe) kernel
builds slow down by 5% and my and other people's simulations (perhaps
0.01% of the universe) speed up by a factor up to 3 or 4, who wins?
Answer right now: you do, since you are writing the kernel to
respond to your own issues, which are no more representative of the
rest of the universe than my work is. Answer as I think it
ought to be: I do, since I'd bet that HPC takes far more net
cycles in the world than every one else's kernel builds put
together. I can't expect much of anyone else to notice either
way and neither can you, so that is a wash.

Ingo Molnar says that zones work for him. In response I
will now repeat my previous rant about why zones don't
work. I understand that my post was very long and people
probably didn't read it all. So I'll just repeat that
part:

2) The last paragraph above is important because of the way HPC
works as an industry. We often don't just have a dedicated machine to
run on, that gets booted once and one dedicated application runs on it
till it dies or gets rebooted again. Many jobs run on the same machine.
Some jobs run for weeks. Others run for a few hours over and over
again. Some run massively parallel. Some run throughput.

How is this situation handled? With a batch scheduler. You submit
a job to run and ask for X cpus, Y memory and Z time. It goes and
fits you in wherever it can. cpusets were helpful infrastructure
in linux for this.

You may get some cpus on one side of the machine, some more
on the other, and memory associated with still others. They
do a pretty good job of allocating resources sanely, but there is
only so much that it can do.

The important point here for page related discusssions is that
someone, you don't know who, was running on those cpu's and memory
before you. And doing Ghu Knows What with it.

This code could be running something that benefits from small pages, or
it could be running with large pages. It could be dynamically
allocating and freeing large or small blocks of memory or it could be
allocating everything at the beginning and running statically
thereafter. Different codes do different things. That means that the
memory state could be totally fubar'ed before your job ever gets
any time allocated to it.

>Nobody takes a random machine and says "ok, we'll now put our most
>performance-critical database on this machine, and oh, btw, you can't
>reboot it and tune for it beforehand".

Wanna bet?

What I wrote above makes tuning the machine itself totally ineffective.
What do you tune for? Tuning for one person's code makes someone else's
slower. Tuning for the same code on one input makes another input run
horribly.

You also can't be rebooting after every job. What about all the other
ones that weren't done yet? You'd piss off everyone running there and
it takes too long besides.

What about a machine that is running multiple instances of some
database, some bigger or smaller than others, or doing other kinds
of work? Do you penalize the big ones or the small ones, this kind
of work or that?

You also can't establish zones that can't be changed on the fly
as things on the system change. How do zones like that fit into
numa? How do things work when suddenly you've got a job that wants
the entire memory filled with large pages and you've only got
half your system set up for large pages? What if you tune the
system that way and then let that job run. For some stupid reason user
reason it dies 10 minutes after starting? Do you let the 30
other jobs in the queue sit idle because they want a different
page distribution?

This way lies madness. Sysadmins just say no and set up the machine
in as stably as they can, usually with something not too different
that whatever manufacturer recommends as a default. For very good reasons.

I would bet the only kind of zone stuff that could even possibly
work would be related to a cpu/memset zone arrangement. See below.

Andy Nelson

--
Andy Nelson Theoretical Astrophysics Division (T-6)
andy dot nelson at lanl dot gov Los Alamos National Laboratory
http://www.phys.lsu.edu/~andy Los Alamos, NM 87545

2005-11-04 15:18:50

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

* Andy Nelson <[email protected]> wrote:

> I think it was Martin Bligh who wrote that his customer gets 25%
> speedups with big pages. That is peanuts compared to my factor 3.4
> (search comp.arch for John Mashey's and my name at the University of
> Edinburgh in Jan/Feb 2003 for a conversation that includes detailed
> data about this), but proves the point that it is far more than just
> me that wants big pages.

ok, this posting of you seems to be it:

http://groups.google.com/group/comp.sys.sgi.admin/browse_thread/thread/39884db861b7db15/e0332608c52a17e3?lnk=st&q=&rnum=35#e0332608c52a17e3

| Timing for the tree traveral+gravity calculation were
|
| 16MBpages 1MBpages 64kpages
| 1 * * 2361.8s
| 8 86.4s 198.7s 298.1s
| 16 43.5s 99.2s 148.9s
| 32 22.1s 50.1s 75.0s
| 64 11.2s 25.3s 37.9s
| 96 7.5s 17.1s 25.4s
|
| (*) test not done.
|
| As near as I can tell the numbers show perfect
| linear speedup for the runs for each page size.
|
| Across different page sizes there is degradation
| as follows:
|
| 16m --> 64k decreases by a factor 3.39 in speed
| 16m --> 1m decreases by a factor 2.25 in speed
| 1m --> 64k decreases by a factor 1.49 in speed

[...]
|
| Sum over cpus of TLB miss times for each test:
|
| 16MBpages 1MBpages 64kpages
| 1 3489s
| 8 64.3s 1539s 3237s
| 16 64.5s 1540s 3241s
| 32 64.5s 1542s 3244s
| 64 64.9s 1545s 3246s
| 96 64.7s 1545s 3251s
|
| Thus the 16MB pages rarely produced page misses,
| while the 64kB pages used up 2.5x more time than
| the floating point operations that we wanted to
| have. I have at least some feeling that the 16MB pages
| rarely caused misses because with a 128 entry
| TLB (on the R12000 cpu) that gives about 1GB of
| addressible memory before paging is required at all,
| which I think is quite comparable to the size of
| the memory actually used.

to me it seems that this slowdown is due to some inefficiency in the
R12000's TLB-miss handling - possibly very (very!) long TLB-miss
latencies? On modern CPUs (x86/x64) the TLB-miss latency is rarely
visible. Would it be possible to run some benchmarks of hugetlbs vs. 4K
pages on x86/x64?

if my assumption is correct, then hugeTLBs are more of a workaround for
bad TLB-miss properties of the CPUs you are using, not something that
will inevitably happen in the future. Hence i think the 'factor 3x'
slowdown should not be realistic anymore - or are you still running
R12000 CPUs?

Ingo

2005-11-04 15:32:25

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Fri, 4 Nov 2005, Ingo Molnar wrote:
>
> just to make sure i didnt get it wrong, wouldnt we get most of the
> benefits Andy is seeking by having a: boot-time option which sets aside
> a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool
> - with the growing happening on a best-effort basis, without guarantees?

Boot-time option to set the hugetlb zone, yes.

Grow-or-shrink, probably not. Not in practice after bootup on any machine
that is less than idle.

The zones have to be pretty big to make any sense. You don't just grow
them or shrink them - they'd be on the order of tens of megabytes to
gigabytes. In other words, sized big enough that you will _not_ be able to
create them on demand, except perhaps right after boot.

Growing these things later simply isn't reasonable. I can pretty much
guarantee that any kernel I maintain will never have dynamic kernel
pointers: when some memory has been allocated with kmalloc() (or
equivalent routines - pretty much _any_ kernel allocation), it stays put.
Which means that if there is a _single_ kernel alloc in such a zone, it
won't ever be then usable for hugetlb stuff.

And I don't want excessive complexity. We can have things like "turn off
kernel allocations from this zone", and then wait a day or two, and hope
that there aren't long-term allocs. It might even work occasionally. But
the fact is, a number of kernel allocations _are_ long-term (superblocks,
root dentries, "struct thread_struct" for long-running user daemons), and
it's simply not going to work well in practice unless you have set aside
the "no kernel alloc" zone pretty early on.

Linus

2005-11-04 15:39:51

by andy

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Ingo wrote:
>ok, this posting of you seems to be it:

> <elided>

>to me it seems that this slowdown is due to some inefficiency in the
>R12000's TLB-miss handling - possibly very (very!) long TLB-miss
>latencies? On modern CPUs (x86/x64) the TLB-miss latency is rarely
>visible. Would it be possible to run some benchmarks of hugetlbs vs. 4K
>pages on x86/x64?
>
>if my assumption is correct, then hugeTLBs are more of a workaround for
>bad TLB-miss properties of the CPUs you are using, not something that
>will inevitably happen in the future. Hence i think the 'factor 3x'
>slowdown should not be realistic anymore - or are you still running
>R12000 CPUs?

> Ingo

AFAIK, mips chips have a software TLB refill that takes 1000
cycles more or less. I could be wrong. There are sgi folk on this
thread, perhaps they can correct me. What is important is
that I have done similar tests on other arch's and found very
similar results. Specifically with IBM machines running both
AIX and Linux. I've never had the opportunity to try variable
page size stuff on amd or intel chips, either itanic or x86
variants.

The effect is not a consequence of any excessively long tlb
handling times for one single arch.

The effect is a property of the code. Which has one part that
is extremely branchy: traversing a tree, and another part that
isn't branchy but grabs stuff from all over everywhere.

The tree traversal works like this: Start from the root and stop at
each node, load a few numbers, multiply them together and compare to
another number, then open that node or go on to a sibling node. Net,
this is about 5-8 flops and a compare per node. The issue is that the
next time you want to look at a tree node, you are someplace else
in memory entirely. That means a TLB miss almost always.

The tree traversal leaves me with a list of a few thousand nodes
and atoms. I use these nodes and atoms to calculate gravity on some
particle or small group of particles. How? For each node, I grab about
10 numbers from a couple of arrays, do about 50 flops with those
numbers, and store back 4 more numbers. The store back doesn't hurt
anything becasuse it really only happens once at the end of the list.

In the naive case, grabbing 10 numbers out of arrays that are mutiple
GB in size means 10 TLB misses. The obvious solution is to stick
everything together that is needed together, and get that down to
one or two. I've done that. The results you quoted in your post
reflect that. In other words, the performance difference is the minimal
number of TLB misses that I can manage to get.

Now if you have a list of thousands of nodes to cycle through, each of
which lives on a different page (ordinarily true), you thrash TLB,
and you thrash L1, as I noted in my original post.

Believe me, I have worried about this sort of stuff intensely,
and recoded around it a lot. The performance number you saw were what
is left over.

It is true that other sorts of codes have much more regular memory
access patterns, and don't have nearly this kind of speedup. Perhaps
more typical would be the 25% number quoted by Martin Bligh.

Andy

2005-11-04 15:39:50

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

>> just to make sure i didnt get it wrong, wouldnt we get most of the
>> benefits Andy is seeking by having a: boot-time option which sets aside
>> a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool
>> - with the growing happening on a best-effort basis, without guarantees?
>
> Boot-time option to set the hugetlb zone, yes.
>
> Grow-or-shrink, probably not. Not in practice after bootup on any machine
> that is less than idle.
>
> The zones have to be pretty big to make any sense. You don't just grow
> them or shrink them - they'd be on the order of tens of megabytes to
> gigabytes. In other words, sized big enough that you will _not_ be able to
> create them on demand, except perhaps right after boot.
>
> Growing these things later simply isn't reasonable. I can pretty much
> guarantee that any kernel I maintain will never have dynamic kernel
> pointers: when some memory has been allocated with kmalloc() (or
> equivalent routines - pretty much _any_ kernel allocation), it stays put.
> Which means that if there is a _single_ kernel alloc in such a zone, it
> won't ever be then usable for hugetlb stuff.
>
> And I don't want excessive complexity. We can have things like "turn off
> kernel allocations from this zone", and then wait a day or two, and hope
> that there aren't long-term allocs. It might even work occasionally. But
> the fact is, a number of kernel allocations _are_ long-term (superblocks,
> root dentries, "struct thread_struct" for long-running user daemons), and
> it's simply not going to work well in practice unless you have set aside
> the "no kernel alloc" zone pretty early on.

Exactly. But that's what all the anti-fragmentation stuff was about - trying
to pack unfreeable stuff together.

I don't think anyone is proposing dynamic kernel pointers inside Linux,
except in that we could possibly change the P-V mapping underneath from
the hypervisor, so that the phys address would change, but you wouldn't
see it. Trouble is, that's mostly done on a larger-than-page size
granularity, so we need SOME larger chunk to switch out (preferably at
least a large-paged size, so we can continue to use large TLB entries for
the kernel mapping).

However, the statically sized option is hugely problematic too.

M.

2005-11-04 15:53:42

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

* Linus Torvalds <[email protected]> wrote:

> Boot-time option to set the hugetlb zone, yes.
>
> Grow-or-shrink, probably not. Not in practice after bootup on any
> machine that is less than idle.
>
> The zones have to be pretty big to make any sense. You don't just grow
> them or shrink them - they'd be on the order of tens of megabytes to
> gigabytes. In other words, sized big enough that you will _not_ be
> able to create them on demand, except perhaps right after boot.

i think the current hugepages=<N> boot option could transparently be
morphed into a 'separate zone' approach, and /proc/sys/vm/nr_hugepages
would just refuse to change (or would go away altogether). Dynamically
growing zones seem like a lot of trouble, without much gain. [ OTOH
hugepages= parameter unit should be changed from the current 'number of
hugepages' to plain RAM metrics - megabytes/gigabytes. ]

that would solve two problems: any 'zone VM statistics skewing effect'
of the current hugetlbs (which is a preallocated list of really large
pages) would go away, and the hugetlb zone could potentially be utilized
for easily freeable objects.

this would already be alot more flexible that what we have: the hugetlb
area would not be 'lost' altogether, like now. Once we are at this stage
we can see how usable it is in practice. I strongly suspect it will
cover most of the HPC uses.

Ingo

2005-11-04 16:01:39

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> Big pages don't work now, and zones do not help because the
> load is too unpredictable. Sysadmins *always* turn them
> off, for very good reasons. They cripple the machine.

They do. Guess why? It's complicated.

SGI used to do things like that in Irix. They had the flakiest Unix kernel
out there. There's a reason people use Linux, and it's not all price. A
lot of it is development speed, and that in turn comes very much from not
making insane decisions that aren't maintainable in the long run.

Trust me. We can make things _better_, by having zones that you can't do
kernel allocations from. But you'll never get everything you want, without
turning the kernel into an unmaintainable mess.

> I think it was Martin Bligh who wrote that his customer gets
> 25% speedups with big pages. That is peanuts compared to my
> factor 3.4 (search comp.arch for John Mashey's and my name
> at the University of Edinburgh in Jan/Feb 2003 for a conversation
> that includes detailed data about this), but proves the point that
> it is far more than just me that wants big pages.

I didn't find your post on google, but I assume that a large portion on
your 3.4 factor was hardware.

The fact is, there are tons of architectures that suck at TLB handling.
They have small TLB's, and they fill slowly.

x86 is actually one of the best ones out there. It has a hw TLB fill, and
the page tables are cached, with real-life TLB fill times in the single
cycles (a P4 can almost be seen as effectively having 32kB pages because
it fills it's TLB entries to fast when they are next to each other in the
page tables). Even when you have lots of other cache pressure, the page
tables are at least in the L2 (or L3) caches, and you effectively have a
really huge TLB.

In contrast, a lot of other machines will use non-temporal loads to load
the TLB entries, forcing them to _always_ go to memory, and use software
fills, causing the whole machine to stall. To make matters worse, many of
them use hashed page tables, so that even if they could (or do) cache
them, the caching just doesn't work very well.

(I used to be a big proponent of software fill - it's very flexible. It's
also very slow. I've changed my mind after doing timing on x86)

Basically, any machine that gets more than twice the slowdown is _broken_.
If the memory access is cached, then so should be page table entry be
(page tables are _much_ smaller than the pages themselves), so even if you
take a TLB fault on every single access, you shouldn't see a 3.4 factor.

So without finding your post, my guess is that you were on a broken
machine. MIPS or alpha do really well when things generally fit in the
TLB, but break down completely when they don't due to their sw fill (alpha
could have fixed it, it had _archtiecturally_ sane page tables that it
could have walked in hw, but never got the chance. May it rest in peace).

If I remember correctly, ia64 used to suck horribly because Linux had to
use a mode where the hw page table walker didn't work well (maybe it was
just an itanium 1 bug), but should be better now. But x86 probably kicks
its butt.

The reason x86 does pretty well is that it's got one of the few sane page
table setups out there (oh, page table trees are old-fashioned and simple,
but they are dense and cache well), and the microarchitecture is largely
optimized for TLB faults. Not having ASI's and having to work with an OS
that invalidated the TLB about every couple of thousand memory accesses
does that to you - it puts the pressure to do things right.

So I suspect Martin's 25% is a lot more accurate on modern hardware (which
means x86, possibly Power. Nothing else much matters).

> If your and other kernel developer's (<<0.01% of the universe) kernel
> builds slow down by 5% and my and other people's simulations (perhaps
> 0.01% of the universe) speed up by a factor up to 3 or 4, who wins?

First off, you won't speed up by a factor of three or four. Not even
_close_.

Second, it's not about performance. It's about maintainability. It's about
having a system that we can use and understand 10 years down the line. And
the VM is a big part of that.

Linus

2005-11-04 16:05:25

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

* Andy Nelson <[email protected]> wrote:

> Ingo wrote:
> >ok, this posting of you seems to be it:
>
> > <elided>
>
> >to me it seems that this slowdown is due to some inefficiency in the
> >R12000's TLB-miss handling - possibly very (very!) long TLB-miss
> >latencies? On modern CPUs (x86/x64) the TLB-miss latency is rarely
> >visible. Would it be possible to run some benchmarks of hugetlbs vs. 4K
> >pages on x86/x64?
> >
> >if my assumption is correct, then hugeTLBs are more of a workaround for
> >bad TLB-miss properties of the CPUs you are using, not something that
> >will inevitably happen in the future. Hence i think the 'factor 3x'
> >slowdown should not be realistic anymore - or are you still running
> >R12000 CPUs?
>
> > Ingo
>
>
> AFAIK, mips chips have a software TLB refill that takes 1000 cycles
> more or less. I could be wrong. [...]

x86 in comparison has a typical cost of 7 cycles per TLB miss. And a
modern x64 chip has 1024 TLBs ... If that's not enough then i believe
you'll be limited by cachemiss costs and RAM latency/throughput anyway,
and the only thing the TLB misses have to do is to be somewhat better
than those bottlenecks. TLBs are really fast in the x86/x64 world. Then
there come other features like TLB prefetch, so if you are touching
pages in any predictable fashion you ought to see better latencies than
the worst-case.

> The effect is not a consequence of any excessively long tlb handling
> times for one single arch.
>
> The effect is a property of the code. Which has one part that is
> extremely branchy: traversing a tree, and another part that isn't
> branchy but grabs stuff from all over everywhere.

i dont think anyone argues against the fact that a larger 'TLB reach'
will most likely improve performance. The question is always 'by how
much', and that number very much depends on the cost of a single TLB
miss. (and on alot of other factors)

(note that it's also possible for large TLBs to cause a slowdown: there
are CPUs [e.g. P3] where there are fewer large TLBs than 4K TLBs, so
there are workloads where you lose due to fewer TLBs. It is also
possible for large TLBs to be zero speedup: if the working set is so
large that you will always get a TLB miss with a new node accessed.)

Ingo

2005-11-04 16:08:20

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> AFAIK, mips chips have a software TLB refill that takes 1000
> cycles more or less. I could be wrong.

You're not far off.

Time it on a real machine some day. On a modern x86, you will fill a TLB
entry in anything from 1-8 cycles if it's in L1, and add a couple of dozen
cycles for L2.

In fact, the L1 TLB miss can often be hidden by the OoO engine.

Now, do the math. Your "3-4 time slowdown" with several hundred cycle TLB
miss just GOES AWAY with real hardware. Yes, you'll still see slowdowns,
but they won't be nearly as noticeable. And having a simpler and more
efficient kernel will actually make _up_ for them in many cases. For
example, you can do all your calculations on idle workstations that don't
mysteriously just crash because somebody was also doing something else on
them.

Face it. MIPS sucks. It was clean, but it didn't perform very well. SGI
doesn't sell those things very actively these days, do they?

So don't blame Linux. Don't make sweeping statements based on hardware
situations that just aren't relevant any more.

If you ever see a machine again that has a huge TLB slowdown, let the
machine vendor know, and then SWITCH VENDORS. Linux will work on sane
machines too.

Linus

2005-11-04 16:13:33

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

> So I suspect Martin's 25% is a lot more accurate on modern hardware (which
> means x86, possibly Power. Nothing else much matters).

It was PPC64, if that helps.

>> If your and other kernel developer's (<<0.01% of the universe) kernel
>> builds slow down by 5% and my and other people's simulations (perhaps
>> 0.01% of the universe) speed up by a factor up to 3 or 4, who wins?
>
> First off, you won't speed up by a factor of three or four. Not even
> _close_.

Well, I think it depends on the workload a lot. However fast your TLB is,
if we move from "every cacheline read requires is a TLB miss" to "every
cacheline read is a TLB hit" that can be a huge performance knee however
fast your TLB is. Depends heavily on the locality of reference and size
of data set of the application, I suspect.

M.

2005-11-04 16:14:29

by andy

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Linus:

>> If your and other kernel developer's (<<0.01% of the universe) kernel
>> builds slow down by 5% and my and other people's simulations (perhaps
>> 0.01% of the universe) speed up by a factor up to 3 or 4, who wins?
>
>First off, you won't speed up by a factor of three or four. Not even
>_close_.

My measurements of factors of 3-4 on more than one hw arch don't
mean anything then? BTW: Ingo Molnar has a response that did find
my comp.arch posts. As I indicated to him, I've done a lot of code
tuning to get better performance even in the presence of tlb issues.
This factor is what is left. Starting from an untuned code, the factor
can be up to an order of magnitude larger. As in 30-60. Yes, I've
measured that too, though these detailed measurments were only on
mips/origins.

It is true that I have never had the opportunity to test these
issues on x86 and its relatives. Perhaps it would be better there.
The relative insensitivity of the results I have already to hw
arch, indicate otherwise though.

Re maintainability: Fine. I like maintainable code too. Coding
standards are great. Language standards are even better.

These are motherhood statements. Your simple rejections
("NO, HELL NO!!") even of any attempts to make these sorts
of improvements seems to make that issue pretty moot anyway.

Andy

2005-11-04 16:40:45

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

* Linus Torvalds <[email protected]> wrote:

> Time it on a real machine some day. On a modern x86, you will fill a
> TLB entry in anything from 1-8 cycles if it's in L1, and add a couple
> of dozen cycles for L2.

below is my (x86-only) testcode that accurately measures TLB miss costs
in cycles. (Has to be run as root, because it uses 'cli' as the
serializing instruction.)

here's the output from the default 128MB (32768 4K pages) random access
pattern workload, on a 2 GHz P4 (which has 64 dTLBs):

0 24 24 24 12 12 0 0 16 0 24 24 24 12 0 12 0 12

32768 randomly accessed pages, 13 cycles avg, 73.751831% TLB misses.

i.e. really cheap TLB misses even in this very bad and TLB-trashing
scenario: there are only 64 dTLBs and we have 32768 pages - so they are
outnumbered by a factor of 1:512! Still the CPU gets it right.

setting LINEAR to 1 gives an embarrasing:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

32768 linearly accessed pages, 0 cycles avg, 0.259399% TLB misses.

showing that the pagetable got fully cached (probably in L1) and that
has _zero_ overhead. Truly remarkable.

lowering the size to 16 MB (still 1:64 TLB-to-working-set-size ratio!)
gives:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4096 randomly accessed pages, 0 cycles avg, 5.859375% TLB misses.

so near-zero TLB overhead.

increasing BYTES to half a gigabyte gives:

2 0 12 12 24 12 24 264 24 12 24 24 0 0 24 12 24 24 24 24 24 24 24 24 12
12 24 24 24 36 24 24 0 24 24 0 24 24 288 24 24 0 228 24 24 0 0

131072 randomly accessed pages, 75 cycles avg, 94.162750% TLB misses.

so an occasional ~220 cycles (~== 100 nsec - DRAM latency) cachemiss,
but still the average is 75 cycles, or 37 nsecs - which is still only
~37% of the DRAM latency.

(NOTE: the test eliminates most data cachemisses, by using zero-mapped
anonymous memory, so only a single data page exists. So the costs seen
here are mostly TLB misses.)

Ingo

---------------
/*
* TLB miss measurement on PII CPUs.
*
* Copyright (C) 1999, Ingo Molnar <[email protected]>
*/
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/mman.h>

#define BYTES (128*1024*1024)
#define PAGES (BYTES/4096)

/* This define turns on the linear mode.. */
#define LINEAR 0

#if 1
# define BARRIER "cli"
#else
# define BARRIER "lock ; addl $0,0(%%esp)"
#endif

int do_test (char * addr)
{
unsigned long start, end;
/*
* 'cli' is used as a serializing instruction to
* isolate the benchmarked instruction from rdtsc.
*/
__asm__ (
"jmp 1f; 1: .align 128;\
"BARRIER"; \
rdtsc; \
movl %0, %1; \
"BARRIER"; \
movl (%%esi), %%eax; \
"BARRIER"; \
rdtsc; \
"BARRIER"; \
"

:"=a" (end), "=c" (start)
:"S" (addr)
:"dx","memory");
return end - start;
}

extern int iopl(int);

int main (void)
{
unsigned long overhead, sum;
int j, k, c, hit;
int matrix [PAGES];
int delta [PAGES];
char *buffer = mmap(NULL, BYTES, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

iopl(3);
/*
* first generate a random access pattern.
*/
for (j = 0; j < PAGES; j++) {
unsigned long val;
#if LINEAR
val = ((j*8) % PAGES) * 4096;
val = j*2048;
#else
val = (random() % PAGES) * 4096;
#endif
matrix[j] = val;
}

/*
* Calculate the overhead
*/
overhead = ~0UL;
for (j = 0; j < 100; j++) {
unsigned int diff = do_test(buffer);
if (diff < overhead)
overhead = diff;
}
printf("Overhead = %ld cycles\n", overhead);

/*
* 10 warmup loops, the last one is printed.
*/
for (k = 0; k < 10; k++) {
c = 0;
for (j = 0; j < PAGES; j++) {
char * addr;
addr = buffer + matrix[j];
delta[c++] = do_test(addr);
}
}
hit = 0;
sum = 0;
for (j = 0; j < PAGES; j++) {
unsigned long d = delta[j] - overhead;
printf("%ld ", d);
if (d <= 1)
hit++;
sum += d;
}
printf("\n");
printf("%d %s accessed pages, %d cycles avg, %f%% TLB misses.\n",
PAGES,
#if LINEAR
"linearly",
#else
"randomly",
#endif
sum/PAGES,
100.0*((double)PAGES-(double)hit)/(double)PAGES);

return 0;
}

2005-11-04 16:41:42

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Fri, 4 Nov 2005, Martin J. Bligh wrote:
>
> > So I suspect Martin's 25% is a lot more accurate on modern hardware (which
> > means x86, possibly Power. Nothing else much matters).
>
> It was PPC64, if that helps.

Ok. I bet x86 is even better, but Power (and possibly itanium) is the only
other architecture that comes close.

I don't like the horrible POWER hash-tables, but for static workloads they
should perform almost as well as a sane page table (I say "almost",
because I bet that the high-performance x86 vendors have spent a lot more
time on tlb latency than even IBM has). My dislike for them comes from the
fact that they are really only optimized for static behaviour.

(And HPC is almost always static wrt TLB stuff - big, long-running
processes).

> Well, I think it depends on the workload a lot. However fast your TLB is,
> if we move from "every cacheline read requires is a TLB miss" to "every
> cacheline read is a TLB hit" that can be a huge performance knee however
> fast your TLB is. Depends heavily on the locality of reference and size
> of data set of the application, I suspect.

I'm sure there are really pathological examples, but the thing is, they
won't be on reasonable code.

Some modern CPU's have TLB's that can span the whole cache. In other
words, if your data is in _any_ level of caches, the TLB will be big
enough to find it.

Yes, that's not universally true, and when it's true, the TLB is two-level
and you can have loads where it will usually miss in the first level, but
we're now talking about loads where the _data_ will then always miss in
the first level cache too. So the TLB miss cost will always be _lower_
than the data miss cost.

Right now, you should buy Opteron if you want that kind of large TLB. I
_think_ Intel still has "small" TLB's (the cpuid information only goes up
to 128 entries, I think), but at least Intel has a really good fill. And I
would bet (but have no first-hand information) that next generation
processors will only get bigger TLB's. These things don't tend to shrink.

(Itanium also has a two-level TLB, but it's absolutely pitiful in size).

NOTE! It is absolutely true that for a few years we had regular caches
growing much faster than TLB's. So there are unquestionably unbalanced
machines out there. But it seems that CPU designers started noticing, and
every indication is that TLB's are catching up.

In other words, adding lots of kernel complexity is the wrong thing in the
long run. This is not a long-term problem, and even in the short term you
can fix it by just selecting the right hardware.

In todays world, AMD leads with bug TLB's (1024-entry L2 TLB), but Intel
has slightly faster fill and the AMD TLB filtering is sadly turned off on
SMP right now, so you might not always get the full effect of the large
TLB (but in HPC you probably won't have task switching blowing your TLB
away very often).

PPC64 has the huge hashed page tables that work well enough for HPC.

Itanium has a pitifully small TLB, and an in-order CPU, so it will take a
noticeably bigger hit on TLB's than x86 will. But even Itanium will be a
_lot_ better than MIPS was.

Linus

2005-11-04 16:49:48

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> My measurements of factors of 3-4 on more than one hw arch don't
> mean anything then?

When I _know_ that modern hardware does what you tested at least two
orders of magnitude better than the hardware you tested?

Think about it.

Linus

2005-11-04 17:10:27

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

>> Well, I think it depends on the workload a lot. However fast your TLB is,
>> if we move from "every cacheline read requires is a TLB miss" to "every
>> cacheline read is a TLB hit" that can be a huge performance knee however
>> fast your TLB is. Depends heavily on the locality of reference and size
>> of data set of the application, I suspect.
>
> I'm sure there are really pathological examples, but the thing is, they
> won't be on reasonable code.
>
> Some modern CPU's have TLB's that can span the whole cache. In other
> words, if your data is in _any_ level of caches, the TLB will be big
> enough to find it.
>
> Yes, that's not universally true, and when it's true, the TLB is two-level
> and you can have loads where it will usually miss in the first level, but
> we're now talking about loads where the _data_ will then always miss in
> the first level cache too. So the TLB miss cost will always be _lower_
> than the data miss cost.
>
> Right now, you should buy Opteron if you want that kind of large TLB. I
> _think_ Intel still has "small" TLB's (the cpuid information only goes up
> to 128 entries, I think), but at least Intel has a really good fill. And I
> would bet (but have no first-hand information) that next generation
> processors will only get bigger TLB's. These things don't tend to shrink.

Well. Last time I looked they had something in the order of 512 entries
per MB of cache or so (ie 2MB of coverage per MB of cache). So it'll only
cover it if you're using 2K of the data in each page (50%), but not if
you're touching cachelines distributed widely over pages. with large
pages, you cover 1000 times that much. Some apps may not be able to
acheive a 50% locality of reference, just by their nature ... not sure
that's bad programming for the big number crunching cases, or DB workloads
with random access patterns to large data sets.

Of course, this doesn't just apply to HPC/database either. dcache walks
on large fileserver, etc.

Even if we're talking data cache / icache misses, it gets even worse,
doesn't it? Several cacheline misses for pagetable walks per data cacheline
miss. Lots of the compute intensive stuff doesn't even come close to
fitting in data cache by orders of magnitude.

M.

2005-11-04 17:23:16

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Andy,
let's just take Ingo's numbers, measured on modern hardware.

On Fri, 4 Nov 2005, Ingo Molnar wrote:
>
> 32768 randomly accessed pages, 13 cycles avg, 73.751831% TLB misses.
> 32768 linearly accessed pages, 0 cycles avg, 0.259399% TLB misses.
> 131072 randomly accessed pages, 75 cycles avg, 94.162750% TLB misses.

NOTE! It's hard to decide what OoO does - Ingo's load doesn't allow for a
whole lot of overlapping stuff, so Ingo's numbers are fairly close to
worst case, but on the other hand, that serialization can probably be
honestly said to hide a couple of cycles, so let's say that _real_ worst
case is five more cycles than the ones quoted. It doesn't change the math,
and quite frankly, that way we're really anal about it.

In real life, under real load (especially with Fp operations going on at
the same time), OoO might make the cost a few cycles _less_, not more, but
hey, lt's not count that.

So in the absolute worst case, with 95% TLB miss ratio, the TLB cost was
an average 75 cycles. Let's be _really_ nice to MIPS, and say that this is
only five times faster than the MIPS case you tested (in reality, it's
probably over ten).

That's the WORST CASE. Realize that MIPS doesn't get better: it will
_always_ have a latency of several hundred cycles when the TLB misses. It
has absolutely zero OoO activity to hide a TLB miss (a software miss
totally serializes the pipeline), and it has zero "code caching", so even
with a perfect I$ (which it certainly didn't have), the cost of actually
running the TLB miss handler doesn't go down.

In contrast, the x86 hw miss gets better when there is some more locality
and the page tables are cached. Much better. Ingo's worst-case example is
not realistic (no locality at all in half a gigabyte or totally random
examples), yet even for that worst case, modern CPU's beat the MIPS by
that big factor.

So let's say that the 75% miss ratio was more likely (that's still a high
TLB miss ratio). So in the _likely_ case, a P4 did the miss in an average
of 13 cycles. The MIPS miss cost won't have come down at all - in fact, it
possibly went _up_, since the miss handler now might be getting more I$
misses since it's not called all the time (I don't know if the MIPS miss
handler used non-caching loads or not - the positive D$ effects on the
page tables from slightly denser TLB behaviour might help some to offset
this factor).

That's a likely factor of fifty speedup. But let's be pessimistic again,
and say that the P4 number beat the MIPS TLB miss by "only" a factor of
twenty. That means that your worst case totally untuned argument (30 times
slowdown from TLB misses) on a P4 is only a 120% slowdown. Not a factor of
three.

But clearly you could tune your code too, and did. To the point that you
had a factor of 3.4 on MIPS. Now, let's say that the tuning didn't work as
well on P4 (remember, we're still being pessimistic), and you'd only get
half of that.

End result? If the slowdown was entirely due to TLB miss costs, your
likely slowdown is in the 20-40% range. Pessimistically.

Now, switching to x86 may have _other_ issues. Maybe other things might
get slower. [ Mmwwhahahahhahaaa. I crack myself up. x86 slower than MIPS?
I'm such a joker. ]

Anyway. The point stands. This is something where hardware really rules,
and software can't do a lot of sane stuff. 20-40% may sound like a big
number, and it is, but this is all stuff where Moore's Law says that
we shouldn't spend software effort.

We'll likely be better off with a smaller, simpler kernel in the future. I
hope. And the numbers above back me up. Software complexity for something
like this just kills.

Linus

2005-11-04 17:44:21

by andy

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Linus,

Please stop focussing on mips as the bad boy. Mips is dead. It
has been for years and everyone knows it unless they are embedded.
I wrote several times that I had tested other arches and every
time you deleted those comments. Not to mention that in the few
anecdotal (read no records were kept) tests I've done on with intel
vs mips on more than one code, mips doesn't come out nearly as bad
as you seem to believe. Maybe that is tlb related maybe it is other
issue related. The fact remains.

Later on after your posts I also posted numbers for power 5. Haven't
seen a response to that yet. Maybe you're digesting.

> let's just take Ingo's numbers, measured on modern hardware.

Ingo's numbers calculate 95% tlb misses. I will likely have 100% tlb
misses over most of this code. Read my discussion of what it does
and you'll see why. Capsule form: Every tree node results in several
thousand nodes that are acceptable. You need to examine several times
that to get the acceptable ones. Several thousand memory reads from
several thousand different pages means 100% TLB misses. This is by no
means a pathological case. Other codes will have such effects too, as
I noted in my first very long rant.

I may have misread it, but that last bit of difference between 95%
and 100% tlb misses will be a pretty big factor in speed differences.
So your 20-40% goes right back up.

Ok, so there is some minimal in my case fp overlap, but a factor 2
speed difference certainly still exists in the power5 arch numbers I
quoted.

I have a special case version of this code that does cache blocking
on the gravity calculation. As a special case version, it is not
effective for the general case. There are 0 TLB misses and 0 L1 misses
for this part of the code. The tree traversal cannot be similarly
cache blocked and keeps all the tlb and cache misses it always had.

For that version, I can get down to 20% speed up, because overall the
traversal only takes 20% or so of the total time. That is the absolute
best I can do, and I've been tuning this code alone for close to a
decade.

Andy

2005-11-06 07:35:31

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Ingo wrote:
> i think the current hugepages=<N> boot option could transparently be
> morphed into a 'separate zone' approach, and ...
>
> this would already be alot more flexible that what we have: the hugetlb
> area would not be 'lost' altogether, like now. Once we are at this stage
> we can see how usable it is in practice. I strongly suspect it will
> cover most of the HPC uses.

It seems to me this is making it harder than it should be. You're
trying to create a zone that is 100% cleanable, whereas the HPC folks
only desire 99.8% cleanable.

Unlike the hot(un)plug folks, the HPC folks don't mind a few pages of
Linus's unmoveable kmalloc memory in their way. They rather expect
that some modest percentage of each node will have some 'kernel stuff'
on it that refuses to move. They just want to be able to free up
most of the pages on a node, once one job is done there, before the
next job begins.

They are also quite willing (based on my experience with bootcpusets)
to designate a few nodes for the 'general purpose Unix load', and
reserve the remaining nodes just to run their special jobs.

On the other hand, as Eric Dumazet mentions on another subthread of
this topic, requiring that their apps use the hugetlbfs interface
to place the bulk of their memory would be a serious obstacle.
Their apps are already fairly tightly wound around a rich variety
of compiler, tool, library and runtime memory placement mechanisms,
and they would be hardpressed to make systematic changes in that.

I suspect that the answers lie in some further improvements in memory
placement on various nodes. Perhaps this means a cpuset option to
put the easily reclaimed (what Mel Gorman's patch would mark with
__GFP_EASYRCLM) kernel pages and the user pages on the the nodes of
the current cpuset, but to prefer placing the less easily reclaimed
pages on the bootcpuset nodes. Then, when a job on such a dedicated
set of nodes completed, most of the memory would be easily reclaimable,
in preparation for the next job.

The bootcpuset stuff is entirely invisible to kernel hackers, because
I am doing it entirely in user space, with a pre-init program that
configures the bootcpuset, moves the unpinned kernel threads into
the bootcpuset, and fires up the real init in that bootcpuset.

With one more twist to the cpuset API, providing a way to state
per-cpuset a separate set of nodes (on what the HPC folks would call
their bootcpuset) as the preferred place to allocate not-EASYRCLM
kernel memory, we might be very close to meeting these HPC needs,
with no changes to or reliance on hugetlbs, with no changes to the
kernel boottime code, and with no changes to the memory management
mechanisms used within these HPC apps.

I am imagining yet another per-cpuset field, which I call 'kmems'. It
would be a nodemask, as is the current 'mems' field. I'd pick up the
__GFP_EASYRCLM flag of Mel Gorman's patch (no comment on suitability of
the rest of his patch), and prefer to place __GFP_EASYRCLM pages on the
'mems' nodes, but other pages evenly spread across the 'kmems' nodes.
For compatibility with the current cpuset API, an unset 'kmems'
would tell the kernel to use the 'mems' setting as a fallback.

The hardest part might be providing a mechanism, that would be invoked
by the batch scheduler between jobs, to flush the easily reclaimed
memory off a node (free it or write it to disk). Again, unlike the
hot(un)plug folks, a 98% solution is plenty good enough.

This will have to be coded and some HPC type loads tried on it, before
we know if it flies.

There is an obvious, unanswered question here. Would moving some of
the kernels pages (the not easily reclaimed pages) off the current
(faulting) node into some possibly far off node be an acceptable
price to pay, to increase the percentage of the dedicated job nodes
that can be freed up between jobs? Since these HPC jobs tend to be
far more sensitive to their own internal data placement than they
are to the kernels internal data placement, I am hopeful that this
tradeoff is a good one, for HPC apps.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-06 08:45:25

by Kyle Moffett

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Nov 4, 2005, at 10:31:48, Linus Torvalds wrote:
> I can pretty much guarantee that any kernel I maintain will never
> have dynamic kernel pointers: when some memory has been allocated
> with kmalloc() (or equivalent routines - pretty much _any_ kernel
> allocation), it stays put.

Hmm, this brings up something that I haven't seen discussed on this
list (maybe a long time ago, but perhaps it should be brought up
again?). What are the pros/cons to having a non-physically-linear
kernel virtual memory space? Would it be theoretically possible to
allow some kind of dynamic kernel page swapping, such that the _same_
kernel-virtual pointer goes to a different physical memory page?
That would definitely satisfy the memory hotplug people, but I don't
know what the tradeoffs would be for normal boxen.

It seems like the trick would be to make sure that page accesses
_during_ the swap are correctly handled. If the page-swapper
included code in the kernel fault handler to notice that a page was
in the process of being swapped out/in by another CPU, it could just
wait for swap-in to finish and then resume from the new page. This
would get messy with DMA and non-cpu memory accessors and such, which
are what I assume the reasons for not implementing this in the past
have been.

From what I can see, the really dumb-obvious-slow method would be to
call the first and last parts of software-suspend. As memory hotplug
is a relatively rare event, this would probably work well enough
given the requirements:
1) Run software suspend pre-memory-dump code
2) Move pages off the to-be-removed node, remapping the kernel
space to the new locations.
3) Mark the node so that new pages don't end up on it
4) Run software suspend post-memory-reload code

<random-guessing>
Perhaps the non-contiguous memory support would be of some help here?
</random-guessing>

Cheers,
Kyle Moffett

--
Simple things should be simple and complex things should be possible
-- Alan Kay

2005-11-06 15:56:16

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Sat, 5 Nov 2005, Paul Jackson wrote:
>
> It seems to me this is making it harder than it should be. You're
> trying to create a zone that is 100% cleanable, whereas the HPC folks
> only desire 99.8% cleanable.

Well, 99.8% is pretty borderline.

> Unlike the hot(un)plug folks, the HPC folks don't mind a few pages of
> Linus's unmoveable kmalloc memory in their way. They rather expect
> that some modest percentage of each node will have some 'kernel stuff'
> on it that refuses to move.

The thing is, if 99.8% of memory is cleanable, the 0.2% is still enough to
make pretty much _every_ hugepage in the system pinned down.

Besides, right now, it's not 99.8% anyway. Not even close. It's more like
60%, and then horribly horribly ugly hacks that try to do something about
the remaining 40% and usually fail (the hacks might get it closer to 99%,
but they are fragile, expensive, and ugly as hell).

It used to be that HIGHMEM pages were always cleanable on x86, but even
that isn't true any more, since now at least pipe buffers can be there
too.

I agree that HPC people are usually a bit less up-tight about things than
database people tend to be, and many of them won't care at all, but if you
want hugetlb, you'll need big areas.

Side note: the exact size of hugetlb is obviously architecture-specific,
and the size matters a lot. On x86, for example, hugetlb pages are either
2MB or 4MB in size (and apparently 2GB may be coming). I assume that's
where you got the 99.8% from (4kB out of 2M).

Other platforms have more flexibility, but sometimes want bigger areas
still.

Linus

2005-11-06 16:12:39

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Sun, 6 Nov 2005, Kyle Moffett wrote:
>
> Hmm, this brings up something that I haven't seen discussed on this list
> (maybe a long time ago, but perhaps it should be brought up again?). What are
> the pros/cons to having a non-physically-linear kernel virtual memory space?

Well, we _do_ actually have that, and we use it quite a bit. Both
vmalloc() and HIGHMEM work that way.

The biggest problem with vmalloc() is that the virtual space is often as
constrained as the physical one (ie on old x86-32, the virtual address
space is the bigger problem - you may have 36 bits of physical memory, but
the kernel has only 30 bits of virtual). But it's quite commonly used for
stuff that wants big linear areas.

The HIGHMEM approach works fine, but the overhead of essentially doing a
software TLB is quite high, and if we never ever have to do it again on
any architecture, I suspect everybody will be pretty happy.

> Would it be theoretically possible to allow some kind of dynamic kernel page
> swapping, such that the _same_ kernel-virtual pointer goes to a different
> physical memory page? That would definitely satisfy the memory hotplug
> people, but I don't know what the tradeoffs would be for normal boxen.

Any virtualization will try to do that, but they _all_ prefer huge pages
if they care at all about performance.

If you thought the database people wanted big pages, the kernel is worse.
Unlike databases or HPC, the kernel actually wants to use the physical
page address quite often, notably for IO (but also for just mapping them
into some other virtual address - the users).

And no standard hardware allows you to do that in hw, so we'd end up doing
a software page table walk for it (or, more likely, we'd have to make
"struct page" bigger).

You could do it today, although at a pretty high cost. And you'd have to
forget about supporting any hardware that really wants contiguous memory
for DMA (sound cards etc). It just isn't worth it.

Real memory hotplug needs hardware support anyway (if only buffering the
memory at least electrically). At which point you're much better off
supporting some remapping in the buffering too, I'm convinced. There's no
_need_ to do these things in software.

Linus

2005-11-06 17:01:56

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Sun, 6 Nov 2005, Linus Torvalds wrote:
>
> And no standard hardware allows you to do that in hw, so we'd end up doing
> a software page table walk for it (or, more likely, we'd have to make
> "struct page" bigger).
>
> You could do it today, although at a pretty high cost. And you'd have to
> forget about supporting any hardware that really wants contiguous memory
> for DMA (sound cards etc). It just isn't worth it.

Btw, in case it wasn't clear: the cost of these kinds of things in the
kernel is usually not so much the actual "lookup" (whether with hw assist
or with another field in the "struct page").

The biggest cost of almost everything in the kernel these days is the
extra code-footprint of yet another abstraction, and the locking cost.

For example, the real cost of the highmem mapping seems to be almost _all_
in the locking. It also makes some code-paths more complex, so it's yet
another I$ fill for the kernel.

So a remappable kernel tends to be different from a remappable user
application. A user application _only_ ever sees the actual cost of the
TLB walk (which hardware can do quite efficiently and is very amenable
indeed to a lot of optimization like OoO and speculative prefetching), but
on the kernel level, the remapping itself is the cheapest part.

(Yes, user apps can see some of the costs indirectly: they can see the
synchronization costs if they do lots of mmap/munmap's, especially if they
are threaded. But they really have to work at it to see it, and I doubt
the TLB synchronization issues tend to be even on the radar for any user
space performance analysis).

You could probably do a remappable kernel (modulo the problems with
specific devices that want bigger physically contiguous areas than one
page) reasonably cheaply on UP. It gets more complex on SMP and with full
device access.

In fact, I suspect you can ask any Xen developer what their performance
problems and worries are. I suspect they much prefer UP clients over SMP
ones, and _much_ prefer paravirtualization over running unmodified
kernels.

So remappable kernels are certainly doable, they just have more
fundamental problems than remappable user space _ever_ has. Both from a
performance and from a complexity angle.

Linus

2005-11-06 18:19:42

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Linus wrote:
> The thing is, if 99.8% of memory is cleanable, the 0.2% is still enough to
> make pretty much _every_ hugepage in the system pinned down.

Agreed.

I realized after writing this that I wasn't clear on something.

I wasn't focused the subject of this thread, adding hugetlb pages after
the system has been up a while.

I was focusing on a related subject - freeing up most of the ordinary
size pages on the dedicated application nodes between jobs on a large
system using
* a bootcpuset (for the classic Unix load) and
* dedicated nodes (for the HPC apps).

I am looking to provide the combination of:
1) specifying some hugetlb pages at system boot, plus
2) the ability to clean off most of the ordinary sized pages
from the application nodes between jobs.

Perhaps Andy or some of my HPC customers wish I was also looking
to provide:
3) the ability to add lots of hugetlb pages on the application
nodes after the system has run a while.
But if they are, then they have some more educatin' to do on me.

For now, I am sympathetic to your concerns with code and locking
complexity. Freeing up great globs of hugetlb sized contiguous chunks
of memory after a system has run a while would be hard.

We have to be careful which hard problems we decide to take on.

We can't take on too many, and we have to pick ones that will provide
a major long term advantage to Linux, over the forseeable changes in
system hardware and architecture.

Even if most of the processors that Andy has tested against would
benefit from dynamically added hugetlb pages, if we can anticipate
that this will not be a substained opportunity for Linux (and looking
at current x86 chips doesn't require much anticipating) then that
might not be the place to invest our precious core complexity dollars.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-07 01:29:05

by John Stoffel

[permalink] [raw]

Subject: Best CPU chipset for Linux? (was: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19)

>>>>> "Linus" == Linus Torvalds <[email protected]> writes:

Linus> On Sun, 6 Nov 2005, Linus Torvalds wrote:
>>
>> And no standard hardware allows you to do that in hw, so we'd end up doing
>> a software page table walk for it (or, more likely, we'd have to make
>> "struct page" bigger).
>>
>> You could do it today, although at a pretty high cost. And you'd have to
>> forget about supporting any hardware that really wants contiguous memory
>> for DMA (sound cards etc). It just isn't worth it.

Linus> Btw, in case it wasn't clear: the cost of these kinds of things
Linus> in the kernel is usually not so much the actual "lookup"
Linus> (whether with hw assist or with another field in the "struct
Linus> page").

Linus> The biggest cost of almost everything in the kernel these days
Linus> is the extra code-footprint of yet another abstraction, and the
Linus> locking cost.

Linus> For example, the real cost of the highmem mapping seems to be
Linus> almost _all_ in the locking. It also makes some code-paths more
Linus> complex, so it's yet another I$ fill for the kernel.

This to me raises the interesting question of what are the most wanted
new features of CPUs and their chipsets by the Linux developers? I
know there are different problem spaces, such as embedded where
power/cost is king, to user desktops to big big clusters.

Has any vendor come close to the ideal CPU architecture for an OS? I
would assume that you'd want:

1. large address space, 64 bits
2. large IO space, 64 bits
3. high memory/io bandwidth
4. efficient locking primitives?
- keep some registers for locking only?
5. efficient memory bandwidth?
6. simple setup where you don't need so much legacy cruft?
7. clean CPU design? RISC? Is CISC king again?
8. Variable page sizes?
- how does this affect TLB?
- how do you change sizes in a program?
9. SMP or hyper-threading or multi-cores?
10. PCI (and it's flavors) addressing/DMA support?

With the growth in data versus instructions these days, does it make
sense to have memory split into D/I sections? Or is it better to just
have a completely flat memory model and let the OS do any splitting it
wants?

Heck, I don't know. I'm just interested in where
Linus/Alan/Andrew/et all think that the low level system design should
think about moving towards since it will make things simpler/faster at
the OS level. I'm completely ignoring the application level since
it's ideally not going to change much... really.

To me, it seems that some sort of efficient low level locking
primitives that work well in any of UP/SMP/NUMA environments would be
key. Just looking at all the fine grain locking people are adding to
the kernel to get around all the issues of the BKL over the years.

Of course making memory faster would be nice too...

I know, it's all out of left field, but it would be interesting to see
what people thought. I honestly wonder if Intel, AMD, PowerPC, Sun
really try to work from the top down when designing their chips, or
more from "this is where we are, how can we speed up what we've got?"
type of view?

Thanks,
John

2005-11-07 02:08:40

[permalink] [raw]

Subject: Re: Best CPU chipset for Linux? (was: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19)

On Sun, 6 Nov 2005, John Stoffel wrote:
>
> Has any vendor come close to the ideal CPU architecture for an OS? I
> would assume that you'd want:

Well, in the end, the #1 requirement ends up being "wide availability of
development boxes".

For example, I think Apple made a huge difference to the PowerPC platform,
and we'll see what happens when Apple boxes are x86. Can IBM continue to
make Power available enough to be relevant.

Note that raw numbers of CPU's don't much matter - ARM sells a lot more
than x86, but it's not to developers. Similarly, the game consoles may
sell a lot of Power, but the actual developers that are using it is a very
specialized bunch and much smaller in number.

> 1. large address space, 64 bits
> 2. large IO space, 64 bits
> 3. high memory/io bandwidth
> 4. efficient locking primitives?
> - keep some registers for locking only?
> 5. efficient memory bandwidth?
> 6. simple setup where you don't need so much legacy cruft?
> 7. clean CPU design? RISC? Is CISC king again?
> 8. Variable page sizes?
> - how does this affect TLB?
> - how do you change sizes in a program?
> 9. SMP or hyper-threading or multi-cores?
> 10. PCI (and it's flavors) addressing/DMA support?

It's personal, but I don't think the above are huge deal-breakers.

We do want a "big enough" virtual address space, that's pretty much
required. It doesn't necessarily have to be the full 64 bits, and it's
fine if the IO space is just a part of that.

As to ISA and registers - nobody much cares. The compiler takes care of
it, and I'd personally _much_ rather see a common ISA than a "clean" one.
The x86 architecture may be odd, but it works well.

So the ISA doesn't matter that much, but from a microarchitectural
standpoint:

- fast large first-level caches help a lot. And I'd rather take a bigger
L1 that has a two- or even three-cycle latency than a small one. That's
assuming the uarch is out-of-order, of course.

- good fast L2, and I'll take low-latency memory access over an L3 any
day.

- low-latency serialization (locking and memory barriers). In fact,
pretty much low-latency everything (branch mispredict latency etc).

- cheap and powerful.

but the fact is, we'll work with pretty much any crap we're given. If it's
bad, it won't make it in the marketplace.

Linus

2005-11-07 03:19:57

by John Stoffel

[permalink] [raw]

Subject: Re: Best CPU chipset for Linux? (was: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19)

Linus> On Sun, 6 Nov 2005, John Stoffel wrote:
>>
>> Has any vendor come close to the ideal CPU architecture for an OS? I
>> would assume that you'd want:

Linus> Well, in the end, the #1 requirement ends up being "wide
Linus> availability of development boxes".

Heh! Take my thoughts and turn them on my head. Bravo!

Linus> We do want a "big enough" virtual address space, that's pretty
Linus> much required. It doesn't necessarily have to be the full 64
Linus> bits, and it's fine if the IO space is just a part of that.

So 40 bits is fine for now, but 64 would be great just because it
solves the problem for a long long time?

Linus> As to ISA and registers - nobody much cares. The compiler takes
Linus> care of it, and I'd personally _much_ rather see a common ISA
Linus> than a "clean" one. The x86 architecture may be odd, but it
Linus> works well.

But aren't there areas where the ISA would expose useful parts of the
underlying microarchitecture that could be more efficiently used in
OSes?

Linus> - fast large first-level caches help a lot. And I'd rather
Linus> take a bigger L1 that has a two- or even three-cycle latency
Linus> than a small one. That's assuming the uarch is out-of-order,
Linus> of course.

Linus> - good fast L2, and I'll take low-latency memory access over
Linus> an L3 any day.

Linus> - low-latency serialization (locking and memory barriers). In
Linus> fact, pretty much low-latency everything (branch mispredict
Linus> latency etc).

Linus> - cheap and powerful.

Linus> but the fact is, we'll work with pretty much any crap we're
Linus> given. If it's bad, it won't make it in the marketplace.

The corollary of course is that if it's excellent but the marketplace
doesn't like it for some reason, we'll still let it go. I keep
wishing for the Alpha to come back sometimes... Oh well.

Thanks for your thoughts Linus.

John

2005-11-07 08:01:11

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

* Linus Torvalds <[email protected]> wrote:

> > You could do it today, although at a pretty high cost. And you'd have to
> > forget about supporting any hardware that really wants contiguous memory
> > for DMA (sound cards etc). It just isn't worth it.
>
> Btw, in case it wasn't clear: the cost of these kinds of things in the
> kernel is usually not so much the actual "lookup" (whether with hw
> assist or with another field in the "struct page").
[...]

> So remappable kernels are certainly doable, they just have more
> fundamental problems than remappable user space _ever_ has. Both from
> a performance and from a complexity angle.

furthermore, it doesnt bring us any closer to removable RAM. The problem
is still unsolvable (due to the 'how to do you find live pointers to fix
up' issue), even if the full kernel VM is 'mapped' at 4K granularity.

Ingo

2005-11-07 11:01:13

by Dave Hansen

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Mon, 2005-11-07 at 09:00 +0100, Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
> > So remappable kernels are certainly doable, they just have more
> > fundamental problems than remappable user space _ever_ has. Both from
> > a performance and from a complexity angle.
>
> furthermore, it doesnt bring us any closer to removable RAM. The problem
> is still unsolvable (due to the 'how to do you find live pointers to fix
> up' issue), even if the full kernel VM is 'mapped' at 4K granularity.

I'm not sure I understand. If you're remapping, why do you have to find
live and fix up live pointers? Are you talking about things that
require fixed _physical_ addresses?

-- Dave

2005-11-07 12:20:19