LinuxLists.cc - Re: your mail on mmap() to the kernel list

2001-12-04 14:44:31

Subject: Re: your mail on mmap() to the kernel list

On Sat, Dec 01, 2001 at 01:37:01AM -0800, Andrew Morton wrote:
> Peter Zaitsev wrote:
> >
> > Hello theowl,
> >
> > Saturday, December 01, 2001, 2:20:20 AM, you wrote:
> >
> > Well. Thank you. I've allready found this - in recent kernels it's
> > even regulated via proc fs.
> >
> > The only question is why map anonymous is rather fast (i get
> > 1000req/sec even then mapped 300.000 of blocks), therefore with
> > mapping a fd the perfomance drops to 20req/second at this number.
> >
>
> well a kernel profile is pretty unambiguous:
>
> c0116af4 mm_init 1 0.0050
> c0117394 do_fork 1 0.0005
> c0124ccc clear_page_tables 1 0.0064
> c0125af0 do_wp_page 1 0.0026
> c01260a0 do_no_page 1 0.0033
> c01265dc find_vma_prepare 1 0.0081
> c0129138 file_read_actor 1 0.0093
> c012d95c kmem_cache_alloc 1 0.0035
> c0147890 d_lookup 1 0.0036
> c01573dc write_profile 1 0.0061
> c0169d44 journal_add_journal_head 1 0.0035
> c0106e88 system_call 2 0.0357
> c01264bc unlock_vma_mappings 2 0.0500
> c0135bb4 fget 2 0.0227
> c028982c __generic_copy_from_user 2 0.0192
> c01267ec do_mmap_pgoff 4 0.0035
> c0126d6c find_vma 7 0.0761
> c0105000 _stext 16 0.1667
> c0126c70 get_unmapped_area 4991 19.8056
> c0105278 poll_idle 4997 124.9250
> 00000000 total 10034 0.0062

the vma lookup overhead is nothing compared to the walking of the linked
list (not of the tree) to find the first available slot above
TASK_UNMAPPED_BASE. In the vma lookup engine rewrite I only cared about
find_vma, not about optimizing the search of a free vma over
TASK_UNMAPPED_BASE. Such list-loop is really evil.

> What appears to be happening is that the VMA tree has degenerated
> into a monstrous singly linked list. All those little 4k mappings

actually it's not that the tree degenerated into a list. It's that we
need to walk all the vma to check if there is a large enough hole to
place the new dynamic mapping and so walk we use the list, not the tree,
because it needs less mem resources and it's simpler and faster.

You can fix the problem in userspace by using a meaningful 'addr' as
hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel
won't waste time searching the first available mapping over
TASK_UNMAPPED_BASE.

> The reason you don't see it with an anonymous map is, I think, that
> the kernel will merge contiguous anon mappings into a single one,

Yes. But actually merging also file backed vmas would be a goodness
indipendently of the arch_get_unmapped_area loop. It's not possible
right now because the anonymous memory case it's much simpler: no
i_shared_locks etc... but I'd like if in the long run also the file
backed vma will be mergeable. The side effect of the goodness is that
also the loop would run faster of course. But technically to really kill
the evil complexity of the loop (for example if every vma belongs to a
different file so it cannot be merged anyways) we'd need a tree of
"holes" indexed in function of the size of the hole. but it seems a very
dubious optimization... Complexity wise it makes sense, but in practice
the common case should matter more I guess, and of course userspace can
just fix this without any kernel modification by passing an helpful
"addr", to the mmap syscall with very very little effort. Untested
patch follows:

@@ -68,7 +9,7 @@
int main()
{
int i = 0;
- void *p;
+ void *p = NULL;
int t;
int fd;

@@ -79,7 +20,7 @@
}
t = time(NULL);
while (1) {
- p = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
+ p = mmap(p, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
if ((int) p == -1) {
perror("mmap");
return;

Andrea

2001-12-04 15:36:37

by Peter Zaitsev

[permalink] [raw]

Subject: Re[2]: your mail on mmap() to the kernel list

Hello Andrea,

Tuesday, December 04, 2001, 5:15:49 PM, you wrote:

AA> the vma lookup overhead is nothing compared to the walking of the linked
AA> list (not of the tree) to find the first available slot above
AA> TASK_UNMAPPED_BASE. In the vma lookup engine rewrite I only cared about
AA> find_vma, not about optimizing the search of a free vma over
AA> TASK_UNMAPPED_BASE. Such list-loop is really evil.
Sure.

>> What appears to be happening is that the VMA tree has degenerated
>> into a monstrous singly linked list. All those little 4k mappings

AA> actually it's not that the tree degenerated into a list. It's that we
AA> need to walk all the vma to check if there is a large enough hole to
AA> place the new dynamic mapping and so walk we use the list, not the tree,
AA> because it needs less mem resources and it's simpler and faster.

AA> You can fix the problem in userspace by using a meaningful 'addr' as
AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel
AA> won't waste time searching the first available mapping over
AA> TASK_UNMAPPED_BASE.
Well. Really you can't do this, because you can not really track all of
the mappings in user program as glibc and probably other libraries
use mmap for their purposes. This may work only at the program start
there you may almost be shure only low addresses are used yet.
In my case I'm implementing a cache of mmaped chunks so there are
always some mmaps/munmaps going. Also I can't use "random" addresses
with a high probability it's so as my application mmaps about 50% of
user address space (meaning 3.5G your patches).

>> The reason you don't see it with an anonymous map is, I think, that
>> the kernel will merge contiguous anon mappings into a single one,

AA> Yes. But actually merging also file backed vmas would be a goodness
AA> indipendently of the arch_get_unmapped_area loop. It's not possible
AA> right now because the anonymous memory case it's much simpler: no
AA> i_shared_locks etc... but I'd like if in the long run also the file
AA> backed vma will be mergeable. The side effect of the goodness is that
AA> also the loop would run faster of course.
Do you think it's the big chance the two close mappings will belong to the
different parts of one file. I think this should be true only for some
specific workloads.
AA> But technically to really kill
AA> the evil complexity of the loop (for example if every vma belongs to a
AA> different file so it cannot be merged anyways) we'd need a tree of
AA> "holes" indexed in function of the size of the hole. but it seems a very
AA> dubious optimization...
Are there much problems with this approach ? Much memory usage or cpu
overhead somethere ?
AA> Complexity wise it makes sense, but in practice
AA> the common case should matter more I guess, and of course userspace can
AA> just fix this without any kernel modification by passing an helpful
AA> "addr", to the mmap syscall with very very little effort. Untested
AA> patch follows:

Could you please explain a bit how this the hint works ? Does it tries
only specified address or also tries to look up the free space from
this point ?

--
Best regards,
Peter mailto:[email protected]

2001-12-04 16:44:27

by Rik van Riel

[permalink] [raw]

Subject: Re[2]: your mail on mmap() to the kernel list

On Tue, 4 Dec 2001, Peter Zaitsev wrote:
> Tuesday, December 04, 2001, 5:15:49 PM, you wrote:

> AA> You can fix the problem in userspace by using a meaningful 'addr' as
> AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel
> AA> won't waste time searching the first available mapping over
> AA> TASK_UNMAPPED_BASE.

> Well. Really you can't do this, because you can not really track all of
> the mappings in user program as glibc and probably other libraries
> use mmap for their purposes.

There's no reason we couldn't do this hint in kernel space.

In arch_get_unmapped_area we can simply keep track of the
lowest address where we found free space, while on munmap()
we can adjust this hint if needed.

OTOH, I doubt it would help real-world workloads where the
application maps and unmaps areas of different sizes and
actually does something with the memory instead of just
mapping and unmapping it ;)))

kind regards,

Rik
--
Shortwave goes a long way: irc.starchat.net #swl

http://www.surriel.com/ http://distro.conectiva.com/

2001-12-04 16:57:24

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: your mail on mmap() to the kernel list

On Tue, Dec 04, 2001 at 02:42:28PM -0200, Rik van Riel wrote:
> On Tue, 4 Dec 2001, Peter Zaitsev wrote:
> > Tuesday, December 04, 2001, 5:15:49 PM, you wrote:
>
> > AA> You can fix the problem in userspace by using a meaningful 'addr' as
> > AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel
> > AA> won't waste time searching the first available mapping over
> > AA> TASK_UNMAPPED_BASE.
>
> > Well. Really you can't do this, because you can not really track all of
> > the mappings in user program as glibc and probably other libraries
> > use mmap for their purposes.
>
> There's no reason we couldn't do this hint in kernel space.
>
> In arch_get_unmapped_area we can simply keep track of the
> lowest address where we found free space, while on munmap()
> we can adjust this hint if needed.
>
> OTOH, I doubt it would help real-world workloads where the
> application maps and unmaps areas of different sizes and
> actually does something with the memory instead of just
> mapping and unmapping it ;)))

exactly, while that would be simple to implement and very lightweight at
runtime, that's not enough to mathematically drop the complexity of the
get_unmapped_area algorithm. It would optimize only the case where
there's no fragmentation of the mapped virtual address space.

For finding the best fit in the heap with O(log(N)) complexity (rather
than the current O(N) complexity of the linked list) one tree indexed by
the size of each hole would be necessary.

Andrea

2001-12-05 14:38:47

by Peter Zaitsev

[permalink] [raw]

Subject: Re[2]: your mail on mmap() to the kernel list

Hello Andrea,

>>
>> OTOH, I doubt it would help real-world workloads where the
>> application maps and unmaps areas of different sizes and
>> actually does something with the memory instead of just
>> mapping and unmapping it ;)))

AA> exactly, while that would be simple to implement and very lightweight at
AA> runtime, that's not enough to mathematically drop the complexity of the
AA> get_unmapped_area algorithm. It would optimize only the case where
AA> there's no fragmentation of the mapped virtual address space.

And also will optimize all mappings of 4K and (which are at least 70%
in mu case) :)

AA> For finding the best fit in the heap with O(log(N)) complexity (rather
AA> than the current O(N) complexity of the linked list) one tree indexed by
AA> the size of each hole would be necessary.

This of course would be the best way.

--
Best regards,
Peter mailto:[email protected]

2001-12-05 14:36:48

by Peter Zaitsev

[permalink] [raw]

Subject: Re[3]: your mail on mmap() to the kernel list

Hello Rik,

Tuesday, December 04, 2001, 7:42:28 PM, you wrote:

>> Well. Really you can't do this, because you can not really track all of
>> the mappings in user program as glibc and probably other libraries
>> use mmap for their purposes.

RvR> There's no reason we couldn't do this hint in kernel space.

Yes. This cache will probably give a good hit rate. It of course does
not decrease mathematical complexity but speeding the things up couple
of times is good anyway :)

RvR> In arch_get_unmapped_area we can simply keep track of the
RvR> lowest address where we found free space, while on munmap()
RvR> we can adjust this hint if needed.

RvR> OTOH, I doubt it would help real-world workloads where the
RvR> application maps and unmaps areas of different sizes and
RvR> actually does something with the memory instead of just
RvR> mapping and unmapping it ;)))

Well this is quite simple I think. Database may use mmap to access the
data in files, as you can't map everything to 32bit address space you
will have to map just parts of the files, therefore as you can't have
to much mapped chunks you will have different chunk sizes to merge
continuos mmaped areas. Other thing some databases support different
physical page sizes so this will be true even without merging.

One more thing: fread at least in some cases does mmap of the file -
so if you use it aggressively you will get in the problem.

Anyway in most cases even then mmaped chunks are different sizes most
of them should be around the same size in the most cases.

--
Best regards,
Peter mailto:[email protected]