LinuxLists.cc - [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

2005-11-04 17:04:11

Subject: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

>On Fri, 4 Nov 2005, Andy Nelson wrote:
>>
>> My measurements of factors of 3-4 on more than one hw arch don't
>> mean anything then?
>
>When I _know_ that modern hardware does what you tested at least two
>orders of magnitude better than the hardware you tested?

Ok. In other posts you have skeptically accepted Power as a
`modern' architecture. I have just now dug out some numbers
of a slightly different problem running on a Power 5. Specifically
a IBM p575 I think. These tests were done in June, while the others
were done more than 2.5 years ago. In other words, there may be
other small tuning optimizations that have gone in since then too.

The problem is a different configuration of particles, and about
2 times bigger (7Million) than the one in comp.arch (3million I think).
I would estimate that the data set in this test spans something like
2-2.5GB or so.

Here are the results:

cpus 4k pages 16m pages
1 4888.74s 2399.36s
2 2447.68s 1202.71s
4 1225.98s 617.23s
6 790.05s 418.46s
8 592.26s 310.03s
12 398.46s 210.62s
16 296.19s 161.96s

These numbers were on a recent Linux. I don't know which one.

Now it looks like it is down to a factor 2 or slightly more. That
is a totally different arch, that I think you have accepted as
`modern', running the OS that you say doesn't need big page support.

Still a bit more than insignificant I would say.

>Think about it.

Likewise.

Andy

2005-11-04 17:49:58

by Linus Torvalds

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> Ok. In other posts you have skeptically accepted Power as a
> `modern' architecture.

Yes, sceptically.

I'd really like to hear what your numbers are on a modern x86. Any x86-64
is interesting, and I can't imagine that with a LANL address you can't
find any.

I do believe that Power is within one order of magnitude of a modern x86
when it comes to TLB fill performance. That's much better than many
others, but whether "almost as good" is within the error range, or whether
it's "only five times worse", I don't know.

The thing is, there's a reason x86 machines kick ass. They are cheap, and
they really _do_ outperform pretty much everything else out there.

Power 5 has a wonderful memory architecture, and those L3 caches kick ass.
They probably don't help you as much as they help databases, though, and
it's entirely possible that a small cheap Opteron with its integrated
memory controller will outperform them on your load if you really don't
have a lot of locality.

Linus

2005-11-04 17:52:13

by andy

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Finding an x86 or amd is not the problem. Finding one with
a sysadmin who is willing to let me experiment is. I'll ask
around, but it may be a while.

Andy

2005-11-04 17:56:29

by andy

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Correction:
>and you'll see why. Capsule form: Every tree node results in several

read

>and you'll see why. Capsule form: Every tree traversal results in several

Andy

2005-11-04 20:12:57

by Ingo Molnar

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

* Andy Nelson <[email protected]> wrote:

> The problem is a different configuration of particles, and about 2
> times bigger (7Million) than the one in comp.arch (3million I think).
> I would estimate that the data set in this test spans something like
> 2-2.5GB or so.
>
> Here are the results:
>
> cpus 4k pages 16m pages
> 1 4888.74s 2399.36s
> 2 2447.68s 1202.71s
> 4 1225.98s 617.23s
> 6 790.05s 418.46s
> 8 592.26s 310.03s
> 12 398.46s 210.62s
> 16 296.19s 161.96s

interesting, and thanks for the numbers. Even if hugetlbs were only
showing a 'mere' 5% improvement, a 5% _user-space improvement_ is still
a considerable improvement that we should try to achieve, if possible
cheaply.

the 'separate hugetlb zone' solution is cheap and simple, and i believe
it should cover your needs of mixed hugetlb and smallpages workloads.

it would work like this: unlike the current hugepages=<nr> boot
parameter, this zone would be useful for other (4K sized) allocations
too. If an app requests a hugepage then we have the chance to allocate
it from the hugetlb zone, in a guaranteed way [up to the point where the
whole zone consists of hugepages only].

the architectural appeal in this solution is that no additional
"fragmentation prevention" has to be done on this zone, because we only
allow content into it that is "easy" to flush - this means that there is
no complexity drag on the generic kernel VM.

can you think of any reason why the boot-time-configured hugetlb zone
would be inadequate for your needs?

Ingo

2005-11-04 21:04:41

by andy

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Hi,

>can you think of any reason why the boot-time-configured hugetlb zone
>would be inadequate for your needs?

I am not enough of a kernel level person or sysadmin to know for certain,
but I have still big worries about consecutive jobs that run on the
same resources, but want extremely different page behavior. If what
you are suggesting can cause all previous history on those resources
to be forgotten and then reset to whatever it is that I want when I
start my run, then yes. It would be fine for me. In some sense, this is
perhaps what I was asking for in my original message when I was talking
about using batch schedulers, cpusets and friends to encapsulate
regions of resources, that could be reset to nice states at user
specified intervals, like when the batch scheduler releases one job
and another job starts.

The issues that I can still think of that hpc people will need are
(some points here are clearly related to each other, but anyway).

1) how do zones play with numa? Does setting up resource management this
way mean that various kernel things that help me access my memory
(hellifino what I'm talking about here--things like tables and lists
of pages that I own and how to access them etc I suppose--whatever
it is that kernels don't get rid of when someone else's job ends and
before mine starts) actually get allocated in some other zone half
way across the machine? This is going to kill me on latency grounds.
Can it be set up so that this reserved special kernel zone is somewhere
close by? If it is bigger than the next guy to get my resources wants,
can it be deleted and reset once my job is finished, so his job can run?
This is what I would hope for and expect that something like
cpuset/memsets would help to do.

2) How do zones play with merging small pages into big pages, splitting
big pages into small, or deleting whatever page environment was there
in favor of a reset of those resources to some initial state? If
someone runs a small page job right after my big page job, will
they get big pages? If I run a big page job right after their small
page job, will I get small pages?

In each case, will it simply say 'no can do' and die? If this setup
just means that some jobs can't be run or can't be run after
something else, it will not fly.

3) How does any sort of fall back scheme work? If I can't have all of my
big pages, maybe I'll settle for some small ones and some big ones.
Can I have them? If I can't have them and die instead, zones like
this will not fly.

Points 2 and 3 have mostly to do with the question Does the system
performance degrade over time for different constituencies of users
or can it stay up stably, serving everyone equally and well for a
long time?

4) How does any of this stuff play with interactive management? It is
not going to fly if sysadmins have to get involved on a
daily/regular basis, or even at much more than a cursory level of
turning something on once when the machine is purchased.

5) How does any of this stuff play with me having to rewrite my code to
use nonstandard language features? If I can't run using standard
fortran, standard C and maybe for some folks standard C++ or Java,
it won't fly.

6) what about text vs data pages. I'm talking here about executable
code vs whatever that code operates on. Do they get to have different
sized pages? Do they get allocated from sensible places on the
machine, as in reasonably separate from each other but not in some
far away zone over the rainbow?

7) If OS's/HW ever get decent support for lots and lots of page sizes
(like mips and sparc now) rather than a couple , will the
infrastructure be able to give me whichever size I ask for, or will
I only get to choose between a couple, even if perhaps settable at
boot time? Extensibility like this will be a requirement long term
of course.

8) What if I want 32 cpus and 64GB of memory on a machine, get it,
finish using it, and then the next jobs in line request say 8 cpus
and 16GB of memory, 4cpus and 16GB of memory, 20 cpus and 4GB
of memory? Will the zone system be able to handle such dynamically
changing things?

What I would need to see is that these sorts of issues can be handled
gracefully by the OS, perhaps with the help of some user land or
priveleged userland hints that would come from things like the batch
scheduler or an env variable to set my prefered page size or other
things about memory policy.

Thanks,

Andy

PS to Linus: I have secured access to an dual cpu dual core amd box.
I have to talk to someone who is not here today to see about turning
on large pages. We'll see how that goes probably some time next week.
If it is possible, you'll see some benchmarks then.

2005-11-04 21:14:27

by Ingo Molnar

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

* Andy Nelson <[email protected]> wrote:

> 5) How does any of this stuff play with me having to rewrite my code to
> use nonstandard language features? If I can't run using standard
> fortran, standard C and maybe for some folks standard C++ or Java,
> it won't fly.

it ought to be possible to get pretty much the same API as hugetlbfs via
the 'hugetlb zone' approach too. It doesnt really change the API and FS
side, it only impacts the allocator internally. So if you can utilize
hugetlbfs, you should be able to utilize a 'special zone' approach
pretty much the same way.

Ingo

2005-11-04 21:23:06

by Linus Torvalds

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> I am not enough of a kernel level person or sysadmin to know for certain,
> but I have still big worries about consecutive jobs that run on the
> same resources, but want extremely different page behavior. If what
> you are suggesting can cause all previous history on those resources
> to be forgotten and then reset to whatever it is that I want when I
> start my run, then yes.

That would largely be the behaviour.

When you use the hugetlb zone for big pages, nothing else would be there.

And when you don't use it, we'd be able to use those zones for at least
page cache and user private pages - both of which are fairly easy to evict
if required.

So the downside is that when the admin requests such a zone at boot-time,
that will mean that the kernel will never be able to use it for its
"normal" allocations. Not for inodes, not for directory name caching, not
for page tables and not for process and file descriptors. Only a very
certain class of allocations that we know how to evict easily could use
them.

Now, for many loads, that's fine. User virtual pages and page cache pages
are often a big part (in fact, often a huge majority) of the memory use.

Not always, though. Some loads really want lots of metadata caching, and
if you make too much of memory be in the largepage zones, performance
would suffer badly on such loads.

But the point is that this is easy(ish) to do, and would likely work
wonderfully well for almost all loads. It does put a small onus on the
maintainer of the machine to give a hint, but it's possible that normal
loads won't mind the limitation and that we could even have a few hugepage
zones by default (limit things to 25% of total memory or something). In
fact, we would almost have to do so initially just to get better test
coverage.

Now, if you want _most_ of memory to be available for hugepages, you
really will always require a special boot option, and a friendly machine
maintainer. Limiting things like inodes, process descriptors etc to a
smallish percentage of memory would not be acceptable in general.

Something like 25% "big page zones" probably is fine even in normal use,
and 50% might be an acceptable compromise even for machines that see a
mixture of pretty regular use and some specialized use. But a machine that
only cares about certain loads might boot up with 75% set aside in the
large-page zones, and that almost certainly would _not_ be a good setup
for random other usage.

IOW, we want a hit up-front about how important huge pages would be.
Because it's practically impossible to free pages later, because they
_will_ become fragmented with stuff that we definitely do not want to
teach the VM how to handle.

But the hint can be pretty friendly. Especially if it's an option to just
load a lot of memory into the boxes, and none of the loads are expected to
want to really be excessively close to memory limits (ie you could just
buy an extra 16GB to allow for "slop").

Linus

2005-11-04 21:31:24

by Gregory Maxwell

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On 11/4/05, Andy Nelson <[email protected]> wrote:
> I am not enough of a kernel level person or sysadmin to know for certain,
> but I have still big worries about consecutive jobs that run on the
> same resources, but want extremely different page behavior. I

Thats the idea. The 'hugetlb zone' will only be usable for allocations
which are guaranteed reclaimable. Reclaimable includes userspace
usage (since at worst an in use userspace page can be swapped out then
paged back into another physical location).

For your sort of mixed use this should be a fine solution. However
there are mixed use cases that that this will not solve, for example
if the system usage is split between HPC uses and kernel allocation
heavy workloads (say forking 10quintillion java processes) then the
hugetlb zone will need to be made small to keep the kernel allocation
heavy workload happy.

2005-11-04 21:39:29

by Linus Torvalds

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Fri, 4 Nov 2005, Linus Torvalds wrote:
>
> But the hint can be pretty friendly. Especially if it's an option to just
> load a lot of memory into the boxes, and none of the loads are expected to
> want to really be excessively close to memory limits (ie you could just
> buy an extra 16GB to allow for "slop").

One of the issues _will_ be how to allocate things on NUMA. Right now
"hugetlb" only allows us to say "this much memory for hugetlb", and it
probably needs to be per-zone.

Some uses might want to allocate all of the local memory on one node to
huge-page usage (and specialized programs would then also like to run
pinned to that node), others migth want to spread it out. So the
maintenance would need to decide that.

The good news is that you can boot up with almost all zones being "big
page" zones, and you could turn them into "normal zones" dynamically. It's
only going the other way that is hard.

So from a maintenance standpoint if you manage lots of machines, you could
have them all uniformly boot up with lots of memory set aside for large
pages, and then use user-space tools to individually turn the zones into
regular allocation zones.

Linus

2005-11-04 21:51:28

by andy

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Hi folks,

It sound like in principle I (`I'=generic HPC person) could be
happy with this sort of solution. The proof of the pudding is
in the eating however, and various perversions and misunderstanding
can still always crop up. Hopefully they can be solved or avoided
if the do show up though. Also, other folk might not be so satisfied.
I'll let them speak for themselves though.

One issue remaining is that I don't know how this hugetlbfs stuff
that was discussed actually works or should work, in terms of
the interface to my code. What would work for me is something to
the effect of

f90 -flag_that_turns_access_to_big_pages_on code.f

That then substitutes in allocation calls to this hugetlbfs zone
instead of `normal' allocation calls to generic memory, and perhaps
lets me fall back to normal memory up to whatever system limits may
exist if no big pages are available.

Or even something more simple like

setenv HEY_OS_I_WANT_BIG_PAGES_FOR_MY_JOB

or alternatively, a similar request in a batch script.
I don't know that any of these things really have much to do
with the OS directly however.

Thanks all, and have a good weekend.

Andy

2005-11-04 22:43:38

by Andi Kleen

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Friday 04 November 2005 22:31, Gregory Maxwell wrote:
> On 11/4/05, Andy Nelson <[email protected]> wrote:
> > I am not enough of a kernel level person or sysadmin to know for certain,
> > but I have still big worries about consecutive jobs that run on the
> > same resources, but want extremely different page behavior. I
>
> Thats the idea. The 'hugetlb zone' will only be usable for allocations
> which are guaranteed reclaimable. Reclaimable includes userspace
> usage (since at worst an in use userspace page can be swapped out then
> paged back into another physical location).

I don't like it very much. You have two choices if a workload runs
out of the kernel allocatable pages. Either you spill into the reclaimable
zone or you fail the allocation. The first means that the huge pages
thing is unreliable, the second would mean that all the many problems
of limited lowmem would be back.

None of this is very attractive.

-Andi

2005-11-05 00:05:53

by Nick Piggin

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Andi Kleen wrote:
> On Friday 04 November 2005 22:31, Gregory Maxwell wrote:
>
>>
>>Thats the idea. The 'hugetlb zone' will only be usable for allocations
>>which are guaranteed reclaimable. Reclaimable includes userspace
>>usage (since at worst an in use userspace page can be swapped out then
>>paged back into another physical location).
>
>
> I don't like it very much. You have two choices if a workload runs
> out of the kernel allocatable pages. Either you spill into the reclaimable
> zone or you fail the allocation. The first means that the huge pages
> thing is unreliable, the second would mean that all the many problems
> of limited lowmem would be back.
>

These are essentially the same problems that the frag patches face as
well.

> None of this is very attractive.
>

Though it is simple and I expect it should actually do a really good
job for the non-kernel-intensive HPC group, and the highly tuned
database group.

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-05 01:38:28

by Rohit Seth

[permalink] [raw]

Subject: RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

From: Nick Piggin Friday, November 04, 2005 4:08 PM

>These are essentially the same problems that the frag patches face as
>well.

>> None of this is very attractive.
>>

>Though it is simple and I expect it should actually do a really good
>job for the non-kernel-intensive HPC group, and the highly tuned
>database group.

Not sure how applications seamlessly can use the proposed hugetlb zone
based on hugetlbfs. Depending on the programming language, it might
actually need changes in libs/tools etc.

As far as databases are concerned, I think they mostly already grab vast
chunks of memory to be used as hugepages (particularly for big mem
systems)which is a separate list of pages. And actually are also glad
that kernel never looks at them for any other purpose.

-rohit

2005-11-05 01:53:03

by Rohit Seth

[permalink] [raw]

Subject: RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

From: Linus Torvalds Sent: Friday, November 04, 2005 8:01 AM

>If I remember correctly, ia64 used to suck horribly because Linux had
to
>use a mode where the hw page table walker didn't work well (maybe it
was
>just an itanium 1 bug), but should be better now. But x86 probably
kicks
>its butt.

I don't remember a difference of more than (roughly) 30 percentage
points even on first generation Itaniums (using hugetlb vs normal
pages). And few more percentage points when walker was disabled. Over
time the page table walker on IA-64 has gotten more aggressive.

...though I believe that 30% is a lot of performance.

-rohit

2005-11-05 02:49:49

by Rob Landley

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Friday 04 November 2005 15:22, Linus Torvalds wrote:
> Now, if you want _most_ of memory to be available for hugepages, you
> really will always require a special boot option, and a friendly machine
> maintainer. Limiting things like inodes, process descriptors etc to a
> smallish percentage of memory would not be acceptable in general.

But it might make it a lot easier for User Mode Linux to give unused memory
back to the host system via madvise(DONT_NEED).

(Assuming there's some way to beat the page cache into submission and actually
free up space. If there was an option to tell the page cache to stay the
heck out of the hugepage zone, it would be just about perfect...)

Rob

2005-11-06 01:31:53

by Zan Lynx

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Andi Kleen wrote:
> I don't like it very much. You have two choices if a workload runs
> out of the kernel allocatable pages. Either you spill into the reclaimable
> zone or you fail the allocation. The first means that the huge pages
> thing is unreliable, the second would mean that all the many problems
> of limited lowmem would be back.
>
> None of this is very attractive.
>
You could allow the 'hugetlb zone' to shrink, allowing more kernel
allocations. User pages at the boundary would be moved to make room.

This would at least keep the 'hugetlb zone' pure and not create holes in it.

2005-11-06 02:26:59

by Rob Landley

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Saturday 05 November 2005 19:30, Zan Lynx wrote:
> > None of this is very attractive.
>
> You could allow the 'hugetlb zone' to shrink, allowing more kernel
> allocations. User pages at the boundary would be moved to make room.

Please make that optional if you do. In my potential use case, an OOM kill
lets the administrator know they've got things configure wrong so they can
can fix it and try again. Containing and viciously reaping things like
dentries is the behavior I want out of it.

Also, if you do shrink the hugetlb zone it might be possible to
opportunistically expand it back to its original size. There's no guarantee
that a given kernel allocation will ever go away, but if it _does_ go away
then the hugetlb zone should be able to expand to the next blocking
allocation or the maximum size, whichever comes first. (Given that my
understanding of the layout may not match reality at all; don't ask me how
the discontiguous memory stuff would work in here...)

Rob

2005-11-06 11:00:20

by Paul Jackson

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

How would this hugetlb zone be placed - on which nodes in a NUMA
system?

My understanding is that you are thinking to specify it as a proportion
or amount of total memory, with no particular placement.

I'd rather see it as a subset of the nodes on a system being marked
for use, as much as practical, for easily reclaimed memory (page
cache and user).

My HPC customers normally try to isolate the 'classic Unix load' on
a few nodes that they call the bootcpuset, and keep the other nodes
as unused as practical, except when allocated for dedicated use by a
particular job. These other nodes need to run with a maximum amount of
easily reclaimed memory, while the bootcpuset nodes have no such need.

They don't just want easily reclaimable memory in order to get
hugetlb pages. They also want it so that the memory available for
use as ordinary sized pages by one job will not be unduly reduced by
the hard to reclaim pages left over from some previous job.

This would be easy to do with cpusets, adding a second per-cpuset
nodemask that specified where not easily reclaimed kernel allocations
should come from. The typical HPC user would set that second mask to
their bootcpuset. The few kmalloc calls in the kernel (page cache and
user space) deemed to be easily reclaimable would have a __GFP_EASYRCLM
flag added, and the cpuset hook in the __alloc_pages code path would
put requests -not- marked __GFP_EASYRCLM on this second set of nodes.

No changes to hugetlbs or to the kernel code that runs at boot,
prior to starting init, would be required at all. The bootcpuset
stuff is setup by a pre-init program (specified using the kernels
"init=..." boot option.) This makes all the configuration of this
entirely a user space problem.

Cpuset nodes, not zone sizes, are the proper way to manage this,
in my view.

If you ask what this means for small (1 or 2 node) systems, then
I would first ask you what we are trying to do on those systems.
I suspect that that would involve other classes of users, with
different needs, than what Andy or I can speak to.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-07 00:35:24

by andy

[permalink] [raw]

Subject: RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Hi folks,

>Not sure how applications seamlessly can use the proposed hugetlb zone
>based on hugetlbfs. Depending on the programming language, it might
>actually need changes in libs/tools etc.

This is my biggest worry as well. I can't recall the details
right now, but I have some memories of people telling me, for
example, that large pages on linux were not now available to
fortran programs period, due to lack of toolchain/lib stuff,
just as you note. What the reasons were/are I have no idea. I
do know that the Power 5 numbers I quoted a couple of days ago
required that the sysadmin apply some special patches to linux
and linking to extra library. I don't know what patches (they
came from ibm), but for xlf95 on Power5, the library I had to
link with was this one:

-T /usr/local/lib64/elf64ppc.lbss.x

No changes were required to my code, which is what I need,
but codes that did not link to this library would not run on
a kernel that had the patches installed, and code that did
link with this library would not run on a kernel that didn't
have those patches.

I don't know what library this is or what was in it, but I
cant imagine it would have been something very standard or
mainline, with that sort of drastic behavior. Maybe the ibm
folk can explain what this was about.

I will ask some folks here who should know how it may work
on intel/amd machines about how large pages can be used
this coming week, when I attempt to do page size speed
testing for my code, as I promised before, as I promised
before, as I promised before.

Andy

2005-11-07 18:59:53

by Adam Litke

[permalink] [raw]

Subject: RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Sun, 2005-11-06 at 17:34 -0700, Andy Nelson wrote:
> Hi folks,
>
> >Not sure how applications seamlessly can use the proposed hugetlb zone
> >based on hugetlbfs. Depending on the programming language, it might
> >actually need changes in libs/tools etc.
>
> This is my biggest worry as well. I can't recall the details
> right now, but I have some memories of people telling me, for
> example, that large pages on linux were not now available to
> fortran programs period, due to lack of toolchain/lib stuff,
> just as you note. What the reasons were/are I have no idea. I
> do know that the Power 5 numbers I quoted a couple of days ago
> required that the sysadmin apply some special patches to linux
> and linking to extra library. I don't know what patches (they
> came from ibm), but for xlf95 on Power5, the library I had to
> link with was this one:
>
> -T /usr/local/lib64/elf64ppc.lbss.x
>
>
> No changes were required to my code, which is what I need,
> but codes that did not link to this library would not run on
> a kernel that had the patches installed, and code that did
> link with this library would not run on a kernel that didn't
> have those patches.
>
> I don't know what library this is or what was in it, but I
> cant imagine it would have been something very standard or
> mainline, with that sort of drastic behavior. Maybe the ibm
> folk can explain what this was about.

Wow. It's amazing how these things spread from my little corner of the
universe ;) What you speak of sounds dangerously close to what I've
been working on lately. Indeed it is not standard at all yet.

I am currently working on an new approach to what you tried. It
requires fewer changes to the kernel and implements the special large
page usage entirely in an LD_PRELOAD library. And on newer kernels,
programs linked with the .x ldscript you mention above can run using all
small pages if not enough large pages are available.

For the curious, here's how this all works:
1) Link the unmodified application source with a custom linker script which
does the following:
- Align elf segments to large page boundaries
- Assert a non-standard Elf program header flag (PF_LINUX_HTLB)
to signal something (see below) to use large pages.
2) Boot a kernel that supports copy-on-write for PRIVATE hugetlb pages
3) Use an LD_PRELOAD library which reloads the PF_LINUX_HTLB segments into
large pages and transfers control back to the application.

> I will ask some folks here who should know how it may work
> on intel/amd machines about how large pages can be used
> this coming week, when I attempt to do page size speed
> testing for my code, as I promised before, as I promised
> before, as I promised before.

I have used this method on ppc64, x86, and x86_64 machines successfully.
I'd love to see how my system works for a real-world user so if you're
interested in trying it out I can send you the current version.

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2005-11-07 20:45:04

by Rohit Seth

[permalink] [raw]

Subject: RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Mon, 2005-11-07 at 12:58 -0600, Adam Litke wrote:

> I am currently working on an new approach to what you tried. It
> requires fewer changes to the kernel and implements the special large
> page usage entirely in an LD_PRELOAD library. And on newer kernels,
> programs linked with the .x ldscript you mention above can run using all
> small pages if not enough large pages are available.
>

Isn't it true that most of the times we'll need to be worrying about
run-time allocation of memory (using malloc or such) as compared to
static.

> For the curious, here's how this all works:
> 1) Link the unmodified application source with a custom linker script which
> does the following:
> - Align elf segments to large page boundaries
> - Assert a non-standard Elf program header flag (PF_LINUX_HTLB)
> to signal something (see below) to use large pages.

We'll need a similar flag for even code pages to start using hugetlb
pages. In this case to keep the kernel changes to minimum, RTLD will
need to modified.

> 2) Boot a kernel that supports copy-on-write for PRIVATE hugetlb pages
> 3) Use an LD_PRELOAD library which reloads the PF_LINUX_HTLB segments into
> large pages and transfers control back to the application.
>

COW, swap etc. are all very nice (little!) features that make hugetlb to
get used more transparently.

-rohit

2005-11-07 20:56:07

by andy

[permalink] [raw]

Subject: RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Hi,

>Isn't it true that most of the times we'll need to be worrying about
>run-time allocation of memory (using malloc or such) as compared to
>static.

Perhaps for C. Not neccessarily true for Fortran. I don't know
anything about how memory allocations proceed there, but there
are no `malloc' calls (at least with that spelling) in the language
itself, and I don't know what it does for either static or dynamic
allocations under the hood. It could be malloc like or whatever
else. In the language itself, there are language features for
allocating and deallocating memory and I've seen code that
uses them, but haven't played with it myself, since my codes
need pretty much all the various pieces memory all the time,
and so are simply statically defined.

If you call something like malloc yourself, you risk portability
problems in Fortran. Fortran 2003 supposedly addresses some of
this with some C interop features, but only got approved within
the last year, and no compilers really exist for it yet, let
alone having code written.

Andy

2005-11-07 20:58:43

by Martin Bligh

[permalink] [raw]

Subject: RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

>> Isn't it true that most of the times we'll need to be worrying about
>> run-time allocation of memory (using malloc or such) as compared to
>> static.
>
> Perhaps for C. Not neccessarily true for Fortran. I don't know
> anything about how memory allocations proceed there, but there
> are no `malloc' calls (at least with that spelling) in the language
> itself, and I don't know what it does for either static or dynamic
> allocations under the hood. It could be malloc like or whatever
> else. In the language itself, there are language features for
> allocating and deallocating memory and I've seen code that
> uses them, but haven't played with it myself, since my codes
> need pretty much all the various pieces memory all the time,
> and so are simply statically defined.

Doesn't fortran shove everything in BSS to make some truly monsterous
segment?

M.

2005-11-07 21:12:09

by Adam Litke

[permalink] [raw]

Subject: RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Mon, 2005-11-07 at 12:51 -0800, Rohit Seth wrote:
> On Mon, 2005-11-07 at 12:58 -0600, Adam Litke wrote:
>
> > I am currently working on an new approach to what you tried. It
> > requires fewer changes to the kernel and implements the special large
> > page usage entirely in an LD_PRELOAD library. And on newer kernels,
> > programs linked with the .x ldscript you mention above can run using all
> > small pages if not enough large pages are available.
> >
>
> Isn't it true that most of the times we'll need to be worrying about
> run-time allocation of memory (using malloc or such) as compared to
> static.

It really depends on the workload. I've run HPC apps with 10+GB data
segments. I've also worked with applications that would benefit from a
hugetlb-enabled morecore (glibc malloc/sbrk). I'd like to see one
standard hugetlb preload library that handles every different "memory
object" we care about (static and dynamic). That's what I'm working on
now.

> > For the curious, here's how this all works:
> > 1) Link the unmodified application source with a custom linker script which
> > does the following:
> > - Align elf segments to large page boundaries
> > - Assert a non-standard Elf program header flag (PF_LINUX_HTLB)
> > to signal something (see below) to use large pages.
>
> We'll need a similar flag for even code pages to start using hugetlb
> pages. In this case to keep the kernel changes to minimum, RTLD will
> need to modified.

Yes, I foresee the functionality currently in my preload lib to exist in
RTLD at some point way down the road.

> > 2) Boot a kernel that supports copy-on-write for PRIVATE hugetlb pages
> > 3) Use an LD_PRELOAD library which reloads the PF_LINUX_HTLB segments into
> > large pages and transfers control back to the application.
> >
>
> COW, swap etc. are all very nice (little!) features that make hugetlb to
> get used more transparently.

Indeed. See my parallel post of a hugetlb-COW RFC :)

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2005-11-07 21:14:03

by Rohit Seth

[permalink] [raw]

Subject: RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Mon, 2005-11-07 at 12:58 -0800, Martin J. Bligh wrote:
> >> Isn't it true that most of the times we'll need to be worrying about
> >> run-time allocation of memory (using malloc or such) as compared to
> >> static.
> >
> > Perhaps for C. Not neccessarily true for Fortran. I don't know
> > anything about how memory allocations proceed there, but there
> > are no `malloc' calls (at least with that spelling) in the language
> > itself, and I don't know what it does for either static or dynamic
> > allocations under the hood. It could be malloc like or whatever
> > else. In the language itself, there are language features for
> > allocating and deallocating memory and I've seen code that
> > uses them, but haven't played with it myself, since my codes
> > need pretty much all the various pieces memory all the time,
> > and so are simply statically defined.
>
> Doesn't fortran shove everything in BSS to make some truly monsterous
> segment?
>

hmmm....that would be strange. So, if an app is using TB of data, then
a TB space on disk ...then read in at the load time (or may be some
optimization in the RTLD knows that this is BSS and does not need to get
loaded but then a TB of disk space is a waster).

-rohit

2005-11-07 21:24:45

by Rohit Seth

[permalink] [raw]

Subject: RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Mon, 2005-11-07 at 15:11 -0600, Adam Litke wrote:
> On Mon, 2005-11-07 at 12:51 -0800, Rohit Seth wrote:
>
> > Isn't it true that most of the times we'll need to be worrying about
> > run-time allocation of memory (using malloc or such) as compared to
> > static.
>
> It really depends on the workload. I've run HPC apps with 10+GB data
> segments. I've also worked with applications that would benefit from a
> hugetlb-enabled morecore (glibc malloc/sbrk). I'd like to see one
> standard hugetlb preload library that handles every different "memory
> object" we care about (static and dynamic). That's what I'm working on
> now.
>

As said below, we will need this functionality even for code pages. I
would rather have the changes absorbed in run-time loader rather than
having a preload library. Makes it easy to manage.

malloc/sbrks are the interesting part that does pose some challenges (as
in some archs different address space is reserved hugetlb). Moreover,
it will also be critical that existing semantics of normal pages is
maintained even when the application ends up using hugepages.

> > We'll need a similar flag for even code pages to start using hugetlb
> > pages. In this case to keep the kernel changes to minimum, RTLD will
> > need to modified.
>
> Yes, I foresee the functionality currently in my preload lib to exist in
> RTLD at some point way down the road.
>

It will be much sooner...

-rohit

2005-11-07 21:33:53

by Adam Litke

[permalink] [raw]

Subject: RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Mon, 2005-11-07 at 13:20 -0800, Rohit Seth wrote:
> On Mon, 2005-11-07 at 12:58 -0800, Martin J. Bligh wrote:
> > >> Isn't it true that most of the times we'll need to be worrying about
> > >> run-time allocation of memory (using malloc or such) as compared to
> > >> static.
> > >
> > > Perhaps for C. Not neccessarily true for Fortran. I don't know
> > > anything about how memory allocations proceed there, but there
> > > are no `malloc' calls (at least with that spelling) in the language
> > > itself, and I don't know what it does for either static or dynamic
> > > allocations under the hood. It could be malloc like or whatever
> > > else. In the language itself, there are language features for
> > > allocating and deallocating memory and I've seen code that
> > > uses them, but haven't played with it myself, since my codes
> > > need pretty much all the various pieces memory all the time,
> > > and so are simply statically defined.
> >
> > Doesn't fortran shove everything in BSS to make some truly monsterous
> > segment?
> >
>
> hmmm....that would be strange. So, if an app is using TB of data, then
> a TB space on disk ...then read in at the load time (or may be some
> optimization in the RTLD knows that this is BSS and does not need to get
> loaded but then a TB of disk space is a waster).

Nope, the bss is defined as the difference in file size (on disk) and
the memory size (as specified in the ELF program header for the data
segment). So the kernel loads the pre-initialized data from disk and
extends the mapping to include room for the bss.

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2005-11-08 02:13:16

by David Gibson

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Mon, Nov 07, 2005 at 01:55:32PM -0700, Andy Nelson wrote:
>
> Hi,
>
> >Isn't it true that most of the times we'll need to be worrying about
> >run-time allocation of memory (using malloc or such) as compared to
> >static.
>
> Perhaps for C. Not neccessarily true for Fortran. I don't know
> anything about how memory allocations proceed there, but there
> are no `malloc' calls (at least with that spelling) in the language
> itself, and I don't know what it does for either static or dynamic
> allocations under the hood. It could be malloc like or whatever
> else. In the language itself, there are language features for
> allocating and deallocating memory and I've seen code that
> uses them, but haven't played with it myself, since my codes
> need pretty much all the various pieces memory all the time,
> and so are simply statically defined.
>
> If you call something like malloc yourself, you risk portability
> problems in Fortran. Fortran 2003 supposedly addresses some of
> this with some C interop features, but only got approved within
> the last year, and no compilers really exist for it yet, let
> alone having code written.

I believe F90 has a couple of different ways of dynamically allocating
memory. I'd expect in most implementations the FORTRAN runtime would
translate that into a malloc() call. However, as I gather, many HPC
apps are written by people who are scientists first and programmers
second, and who still think in F77 where there is no dynamic memory
allocation. Hence, gigantic arrays in the BSS are common FORTRAN
practice.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson