2001-12-12 20:03:12

by Wayne Whitney

[permalink] [raw]
Subject: Repost: could ia32 mmap() allocations grow downward?

Hello,

I posted this message five days ago to nary a comment, so perhaps I did
something wrong. Any comments would be appreciated, other than "go buy
64-bit hardware." :-)

Cheers,
Wayne


Although pretty much a kernel newbie, I wanted to bring up an idea that I
first saw here a year ago but which received no commentary.

Namely, from time to time an ia32 user will write about running out of
user address space in one way or another. The standard answer is that
under ia32 Linux, the 32-bit address space for a program of size P is
carved up as follows:

Start Address Map Contents Growth Direction

0x08000000 the executable's code segment upwards
0x08000000 + P the executable's data segment upwards
0x08000000 + 2P the program's heap upwards
0x40000000 mmap() without MAP_FIXED upwards
0xBFFFFFFF the stack downards
0xC0000000 kernel space upwards
0xFFFFFFFF top of the addresss space

Thus a typical problem is that a program that wants to manage its own heap
(using the brk() system call instead of malloc() from libc) will have a
maximum heap size of 0x38000000 - 2P. Or a program that heavily uses
mmap() will only have 0x80000000 of mmap() address space.

Various workaround are usually proposed, such as:

o Modify the program to use malloc(), or tune the malloc() allocation
strategy parameters, as malloc() knows about the two distinct memory
allocation mechanisms, brk() below 0x40000000 and mmap() above it.

o Change the value of TASK_UNMAPPED_BASE in the kernel from its default
of 0x40000000.

o Change __PAGE_OFFSET (and the associated value in vmlinux.lds) to
0xE0000000 to reduce the kernel space to 512MB.

The alternative idea (not mine) which I'm curious about is:

o Pick a maximum stack size S and change the kernel so the "mmap()
without MAP_FIXED" region starts at 0xC0000000 - S and grows downwards.

This seems ideal, as it allows the balance between the mmap() region and
the brk() region to vary for each process, automatically. What changes
would be required to the kernel to implement this properly and
efficiently? Is there some downside I am missing?

FWIW, I made a very simple, very naive attempt at doing this about a year
ago, against 2.2.19-prex. The patch is included below, and it booted OK
for me at the time. I'm sure I made various poor choices in the patch,
though, having not had the Big Picture.


diff -ru linux-2.2.19-pre7/include/asm-i386/processor.h linux-2.2.19-pre7-hack2/include/asm-i386/processor.h
--- linux-2.2.19-pre7/include/asm-i386/processor.h Tue Jan 9 20:26:35 2001
+++ linux-2.2.19-pre7-hack2/include/asm-i386/processor.h Sat Jan 13 11:58:00 2001
@@ -163,10 +163,22 @@
*/
#define TASK_SIZE (PAGE_OFFSET)

-/* This decides where the kernel will search for a free chunk of vm
- * space during mmap's.
+/*
+ * When looking for a free chunk of vm space during mmap's, the kernel
+ * will search upwards from TASK_UNMAPPED_BASE (the usual algorithm),
+ * unless TASK_UNMAPPED_CEILING is defined, in which case it will
+ * search downwards from TASK_UNMAPPED_CEILING to TASK_UNMAPPED_FLOOR.
*/
#define TASK_UNMAPPED_BASE (TASK_SIZE / 3)
+
+/*
+ * We need to allow room for the stack to grow downard from TASK_SIZE,
+ * I really have no idea how large it can get, so I arbitrarily picked
+ * 128MB. Also, I'm not so sure where to stop searching and give up,
+ * so I pick 128MB, which seems to be where exectuables get loaded.
+ */
+#define TASK_UNMAPPED_CEILING (TASK_SIZE - 128 * 1024 * 1024)
+#define TASK_UNMAPPED_FLOOR (128 * 1024 * 1024)

/*
* Size of io_bitmap in longwords: 32 is ports 0-0x3ff.
diff -ru linux-2.2.19-pre7/mm/mmap.c linux-2.2.19-pre7-hack2/mm/mmap.c
--- linux-2.2.19-pre7/mm/mmap.c Sat Dec 9 21:29:39 2000
+++ linux-2.2.19-pre7-hack2/mm/mmap.c Sat Jan 13 11:58:00 2001
@@ -365,6 +365,22 @@

if (len > TASK_SIZE)
return 0;
+#ifdef TASK_UNMAPPED_CEILING
+ if (!addr)
+ addr = TASK_UNMAPPED_CEILING - len;
+
+ do {
+ /* align addr downards; PAGE_ALIGN aligns it upwards */
+ addr = addr&PAGE_MASK;
+ vmm = find_vma(current->mm,addr);
+ /* At this point: (!vmm || addr < vmm->vm_end). */
+ if (!vmm || addr + len <= vmm->vm_start)
+ return addr;
+ addr = vmm->vm_start - len;
+ } while (addr >= TASK_UNMAPPED_FLOOR);
+
+ return 0;
+#else
if (!addr)
addr = TASK_UNMAPPED_BASE;
addr = PAGE_ALIGN(addr);
@@ -377,6 +393,7 @@
return addr;
addr = vmm->vm_end;
}
+#endif
}

#define vm_avl_empty (struct vm_area_struct *) NULL






2001-12-12 20:48:47

by Petr Vandrovec

[permalink] [raw]
Subject: Re: Repost: could ia32 mmap() allocations grow downward?

On 12 Dec 01 at 12:02, Wayne Whitney wrote:

> o Pick a maximum stack size S and change the kernel so the "mmap()
> without MAP_FIXED" region starts at 0xC0000000 - S and grows downwards.

How you'll pick S? 8MB? 128MB? Now you can have 1GB brk + 2GB (stack+mmap),
after change you have 2.9GB (brk+mmap), but only 128MB stack. And if you'll
change your malloc implementation, you can have up to 2GB stack now, or
up to 3GB of mmap. After your change your stack is limited to 128MB, and
you cannot do anything around that except moving stack somewhere else
during libc startup - and in this case couple of argv[] assumptions
setproctitle and other do are no longer valid.

Another problem is mremap. Due to way how apps works, you'll have
to move VMAs around much more because of you cannot grow your last
VMA up without move. And if you shrink your last block, you'll get
a gap.

> This seems ideal, as it allows the balance between the mmap() region and
> the brk() region to vary for each process, automatically. What changes
> would be required to the kernel to implement this properly and
> efficiently? Is there some downside I am missing?

Nobody can call brk() directly from app, as libc may use brk() for
implementing malloc(), and libraries can call malloc. So you have to
create your own allocator on the top of brk() results, and this
allocator must not release memory back to system, as this could
release also chunks you do not own. Writting your allocator on the
top of malloc()ed areas is much better idea.
Best regards,
Petr Vandrovec
[email protected]

P.S.: I do not think that your app calls directly brk(). I think that
your app calls malloc with some small number, and libc decides to use
brk() instead of mmap(). And in such case it is bug in your libc that
it does not use mmap() after brk() fails.

2001-12-13 06:29:45

by Wayne Whitney

[permalink] [raw]
Subject: Re: Repost: could ia32 mmap() allocations grow downward?

On Wed, 12 Dec 2001, Petr Vandrovec wrote:

> On 12 Dec 01 at 12:02, Wayne Whitney wrote:
>
> > o Pick a maximum stack size S and change the kernel so the "mmap()
> > without MAP_FIXED" region starts at 0xC0000000 - S and grows downwards.
>
> How you'll pick S? 8MB? 128MB?

Well, Mark Hahn suggests using the stack ulimit. On my bog standard
RedHat 7.2, ulimit -a tells me the stack size limit is 8MB. Of course,
once an mmap() (sans MAP_FIXED) has occurred, you can't increase S, so a
program that wants more stack would have to ensure that the ulimit is set
before calling mmap().

> Now you can have 1GB brk + 2GB (stack+mmap), after change you have
> 2.9GB (brk+mmap), but only 128MB stack.

My (very limited) experience suggests that of the stack, mmap and brk
regions, stack is likely the smallest. So if one of the three has to have
a predetermined maximum size, and the other two are allowed to grow toward
each other from opposite ends of address space, it seems the stack should
have the fixed size, not brk.

> Another problem is mremap. Due to way how apps works, you'll have to
> move VMAs around much more because of you cannot grow your last VMA up
> without move. And if you shrink your last block, you'll get a gap.

Right now, growing any VMA other than the last requires relocating, and
shrinking any VMA other than the last will cause gaps. How big a hit
would it be to remove the exception for the last VMA, so that any VMA
growth requires relocation, and any VMA shrink leaves a gap? Are there
applications that rely on cheap growth and shrinkage of the most recently
allocated VMA (when there have been deletions and MAP_FIXED mmap()s)?

> Nobody can call brk() directly from app, as libc may use brk() for
> implementing malloc(), and libraries can call malloc. So you have to
> create your own allocator on the top of brk() results, and this
> allocator must not release memory back to system, as this could
> release also chunks you do not own. Writting your allocator on the
> top of malloc()ed areas is much better idea.

Assuming the overhead of malloc() is low, I agree that in writing a new
program one would be better off writing an allocator over malloc() than
over brk(). But there are plenty of legacy programs that use brk(), which
may be hard to port to a malloc()-based allocator, or available to some
users only as binaries.

So there is a tradeoff between changing the programs and changing the
kernel. I'm trying to figure out how expensive the requisite kernel
changes would be. For example, I don't grok the structure that holds the
VMAs, I think it is in some sense sorted by increasing start address. So
if one were to change mmap() to allocate VMAs going downward, would it be
appropriate to change the VMA containment structure to be sorted by
decreasing start address?

BTW, if one were trying to port some code that uses brk() directly and
even frees memory that way, then it seems that with glibc's malloc(), one
could make it work by instructing malloc() always to use mmap().

Cheers, Wayne

P.S. I am 100% sure that the particular application of mine that started
me thinking about this, MAGMA, uses its own allocator built on top of
brk() and never calls malloc() itself.

2001-12-13 10:28:12

by Petr Vandrovec

[permalink] [raw]
Subject: Re: Repost: could ia32 mmap() allocations grow downward?

On 12 Dec 01 at 22:28, Wayne Whitney wrote:

> BTW, if one were trying to port some code that uses brk() directly and
> even frees memory that way, then it seems that with glibc's malloc(), one
> could make it work by instructing malloc() always to use mmap().

> P.S. I am 100% sure that the particular application of mine that started
> me thinking about this, MAGMA, uses its own allocator built on top of
> brk() and never calls malloc() itself.

If you have legacy app, how it comes that it uses mmap? And if I do
not use mmap, I have nothing at 1GB:

void main() { sleep(10); brk((void*)0xBF000000); pause(); }

/proc/`pidof x`/maps says during sleep(10):

08048000-080a1000 r-xp 00000000 03:03 230941 /usr/src/linus/x
080a1000-080a5000 rw-p 00058000 03:03 230941 /usr/src/linus/x
080a5000-080a6000 rwxp 00000000 00:00 0
bffff000-c0000000 rwxp 00000000 00:00 0

and after brk() (which suceeded after I did ulimit -d unlimited
and 'echo 1 >/proc/sys/vm/overcommit_memory') I see:

08048000-080a1000 r-xp 00000000 03:03 230941 /usr/src/linus/x
080a1000-080a5000 rw-p 00058000 03:03 230941 /usr/src/linus/x
080a5000-bf000000 rwxp 00000000 00:00 0
bffff000-c0000000 rwxp 00000000 00:00 0

So maybe MAGMA uses some API which it should not use under any
circumstances... Such as that you linked it with libc6 stdio.
Best regards,
Petr Vandrovec
[email protected]

2001-12-13 16:24:00

by Wayne Whitney

[permalink] [raw]
Subject: Re: Repost: could ia32 mmap() allocations grow downward?

On Thu, 13 Dec 2001, Petr Vandrovec wrote:

> If you have legacy app, how it comes that it uses mmap?

Very good question. The app per se does not call mmap(), but mmap() is
called once when I execute it. So it must be something from libc:

[whitney@mf1 whitney]$ ldd `which magma`
not a dynamic executable
[whitney@mf1 whitney]$ magma
[ . . .]
[2]+ Stopped magma
[whitney@mf1 whitney]$ cat /proc/`pidof magma`/maps
08048000-08afb000 r-xp 00000000 21:07 64318 magma
08afb000-08c3e000 rw-p 00ab2000 21:07 64318 magma
08c3e000-0bc54000 rwxp 00000000 00:00 0
40000000-40001000 rw-p 00000000 00:00 0
bfffd000-c0000000 rwxp ffffe000 00:00 0

> So maybe MAGMA uses some API which it should not use under any
> circumstances... Such as that you linked it with libc6 stdio.

Indeed. How can I avoid the map at 0x40000000? Must I avoid using
certain glibc2 functions, and then link the executable carefully to leave
out their initialization routines? Or can I set some magic environment
variable to tell glibc2 to mmap() the single map with MAP_FIXED at a
higher addresss? Of course I could modify glibc2 so that it does all (or
most) of its mmap()'s with MAP_FIXED at a higher address. Is there an
alternative libc that might work out of the box or require less
modification?

So it seems like for MAGMA I should be able to work around the fact that
mmap()'s start at 0x40000000. But as difficulties with other programs
come up here fairly regularly, I still think it makes sense to fully
understand the downside of modifying the kernel to allocate mmap() VMAs
going downward. If the downside is small, I think it is a good tradeoff.

Cheers, Wayne


2001-12-13 16:55:54

by Wayne Whitney

[permalink] [raw]
Subject: Re: Repost: could ia32 mmap() allocations grow downward?

On Thu, 13 Dec 2001, Wayne Whitney wrote:

> The app per se does not call mmap(), but mmap() is called once when I
> execute it.

Correction: strace shows that it is called many times during startup, but
only once without a corresponding munmap()>

Wayne


2001-12-13 17:08:34

by Hugh Dickins

[permalink] [raw]
Subject: Re: Repost: could ia32 mmap() allocations grow downward?

On Thu, 13 Dec 2001, Wayne Whitney wrote:
> So it seems like for MAGMA I should be able to work around the fact that
> mmap()'s start at 0x40000000. But as difficulties with other programs
> come up here fairly regularly, I still think it makes sense to fully
> understand the downside of modifying the kernel to allocate mmap() VMAs
> going downward. If the downside is small, I think it is a good tradeoff.

My fear is that you may encounter an indefinite number of buggy apps,
which expect an mmap() to follow the mmap() before: easy bug to commit,
and to go unnoticed, until you reverse the layout.

As to where to place your stack: I don't know what assumptions are made
elsewhere, but a seemingly good place is just below the program's text
at 0x08048000. People sometimes ask why i386 ELF text is usually placed
there: I think it's a convention of some other UNIX implementations,
which used to put stack below text and data above it, all sharing
the one page table (if it's a smallish process).

Hugh

2001-12-13 17:37:48

by Petr Vandrovec

[permalink] [raw]
Subject: Re: Repost: could ia32 mmap() allocations grow downward?

On 13 Dec 01 at 8:22, Wayne Whitney wrote:
> > So maybe MAGMA uses some API which it should not use under any
> > circumstances... Such as that you linked it with libc6 stdio.
>
> Indeed. How can I avoid the map at 0x40000000? Must I avoid using
> certain glibc2 functions, and then link the executable carefully to leave
> out their initialization routines? Or can I set some magic environment

It is caused by (I think that stupid...) code in
glibc-2.2.4/libio/libioP.h:ALLOC_BUF(), which unconditionally does
'mmap(0, ROUND_TO_PAGE(size), PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0)' instead of 'malloc(size)' when it finds that underlying system
supports malloc.

If you linked Magma yourself, try adding:
---
#include <malloc.h>

void* malloc(size_t len) { return sbrk(len); }
void* __mmap(void* start, size_t len, int prot, int flags, int fd,
unsigned long offset) {
if (start == 0 && fd == -1) { return malloc(len); }
return NULL;
}
---
into your project. It forces my 'void main() { printf("X\n"); pause(); }'
to use brk() instead of mmap() for stdio buffers. Maybe we should move
to bug-glibc instead, as there is no way to force stdio to not ignore
mallopt() parameters, it still insist on using mmap, and I think that it
is a glibc2.2 bug.
Petr Vandrovec
[email protected]

P.S.: I did some testing, and about 95% of mremap() allocations is
targeted to last VMA, so no VMA move is needed for them. But no Java
was part of picture, only c/c++ programs I use - gcc, mc, perl.

2001-12-13 17:40:08

by Wayne Whitney

[permalink] [raw]
Subject: Re: Repost: could ia32 mmap() allocations grow downward?

On Thu, 13 Dec 2001, Hugh Dickins wrote:

> My fear is that you may encounter an indefinite number of buggy apps,
> which expect an mmap() to follow the mmap() before: easy bug to
> commit, and to go unnoticed, until you reverse the layout.

Hmm, so which is more important to support, buggy users of (unguaranteed
side effects of) the new interface, or users of the legacy interface? I
can see the argument that that the buggy users of the new interface are
more important. Maybe CONFIG_MMAP_GROWS_DOWNWARDS, or a /proc entry?

Wayne



2001-12-13 18:01:19

by Hugh Dickins

[permalink] [raw]
Subject: Re: Repost: could ia32 mmap() allocations grow downward?

On Thu, 13 Dec 2001, Wayne Whitney wrote:
> On Thu, 13 Dec 2001, Hugh Dickins wrote:
>
> > My fear is that you may encounter an indefinite number of buggy apps,
> > which expect an mmap() to follow the mmap() before: easy bug to
> > commit, and to go unnoticed, until you reverse the layout.
>
> Hmm, so which is more important to support, buggy users of (unguaranteed
> side effects of) the new interface, or users of the legacy interface? I
> can see the argument that that the buggy users of the new interface are
> more important. Maybe CONFIG_MMAP_GROWS_DOWNWARDS, or a /proc entry?

Hard to know until you try it: my fear may prove groundless,
or experience may discourage you from the exercise completely.

Quick guess is that what you'd really want in the end is not a
CONFIG option or /proc tunable, but some mark in an ELF section
for what behaviour that particular executable wants.

I'm reluctant to call wanting a large virtual address space buggy;
but expecting contiguous ascending mmaps (without MAP_FIXED) is buggy.

Hugh

2001-12-13 18:04:09

by Wayne Whitney

[permalink] [raw]
Subject: Re: Repost: could ia32 mmap() allocations grow downward?

On Thu, 13 Dec 2001, Petr Vandrovec wrote:

> Maybe we should move to bug-glibc instead, as there is no way to force
> stdio to not ignore mallopt() parameters, it still insist on using
> mmap, and I think that it is a glibc2.2 bug.

OK, that makes sense for the glibc2 subthread of this discussion. Would
you mind submitting the bug report, as you have a better command of the
issues than I do? Or if you want, I can do it and just quote you. :-)

> P.S.: I did some testing, and about 95% of mremap() allocations is
> targeted to last VMA, so no VMA move is needed for them. But no Java
> was part of picture, only c/c++ programs I use - gcc, mc, perl.

Ah, so this is important data. It shows that the mmap() grows downward
strategy will hurt the common case. I don't have any handle on the
magnitude of this effect, but if it is significant, then I would have to
agree that supporting the legacy brk() apps is not as important as keeping
mremap() of the last VMA cheap. How expensive is moving a VMA, and how
often do programs mremap()?

How about the idea of modifying brk() (or adding an alternative) to move
VMAs out of the way as necessary? This way the negative impact (of moving
VMAs) is only borne by the legacy brk() using app. Or is there some other
downside that I am missing?

Wayne



2001-12-13 19:14:10

by Petr Vandrovec

[permalink] [raw]
Subject: Re: Repost: could ia32 mmap() allocations grow downward?

On 13 Dec 01 at 10:03, Wayne Whitney wrote:
> On Thu, 13 Dec 2001, Petr Vandrovec wrote:
>
> > Maybe we should move to bug-glibc instead, as there is no way to force
> > stdio to not ignore mallopt() parameters, it still insist on using
> > mmap, and I think that it is a glibc2.2 bug.
>
> OK, that makes sense for the glibc2 subthread of this discussion. Would
> you mind submitting the bug report, as you have a better command of the
> issues than I do? Or if you want, I can do it and just quote you. :-)

If you can complain yourself...

> > P.S.: I did some testing, and about 95% of mremap() allocations is
> > targeted to last VMA, so no VMA move is needed for them. But no Java
> > was part of picture, only c/c++ programs I use - gcc, mc, perl.
>
> Ah, so this is important data. It shows that the mmap() grows downward
> strategy will hurt the common case. I don't have any handle on the
> magnitude of this effect, but if it is significant, then I would have to
> agree that supporting the legacy brk() apps is not as important as keeping
> mremap() of the last VMA cheap. How expensive is moving a VMA, and how
> often do programs mremap()?

It is not that bad, as only PTEs are moved, but ... currently code calls
mremap(), and in 95% of cases same address is returned, while after
change mremap() changes address in 100% of cases, so couple of bugs
can be discovered due to this change.

> How about the idea of modifying brk() (or adding an alternative) to move
> VMAs out of the way as necessary? This way the negative impact (of moving
> VMAs) is only borne by the legacy brk() using app. Or is there some other
> downside that I am missing?

You cannot move VMAs when app does not request mremap(), as you must notify
app about new location of area - app can have couple of pointers to this
memory, so you cannot move it around without app being informed.

And unfortunately you also cannot just skip existing VMAs by brk(), as
userspace remebers latest value returned by brk(), add size to it, and
calls brk() to grow data segment. As apps decides about new brk() value,
and app does not know that there is some VMA somewhere, kernel cannot
do anything about it too - unfortunately.
Petr Vandrovec
[email protected]