2001-12-01 09:37:42

by Andrew Morton

[permalink] [raw]
Subject: Re: your mail on mmap() to the kernel list

Peter Zaitsev wrote:
>
> Hello theowl,
>
> Saturday, December 01, 2001, 2:20:20 AM, you wrote:
>
> Well. Thank you. I've allready found this - in recent kernels it's
> even regulated via proc fs.
>
> The only question is why map anonymous is rather fast (i get
> 1000req/sec even then mapped 300.000 of blocks), therefore with
> mapping a fd the perfomance drops to 20req/second at this number.
>

well a kernel profile is pretty unambiguous:

c0116af4 mm_init 1 0.0050
c0117394 do_fork 1 0.0005
c0124ccc clear_page_tables 1 0.0064
c0125af0 do_wp_page 1 0.0026
c01260a0 do_no_page 1 0.0033
c01265dc find_vma_prepare 1 0.0081
c0129138 file_read_actor 1 0.0093
c012d95c kmem_cache_alloc 1 0.0035
c0147890 d_lookup 1 0.0036
c01573dc write_profile 1 0.0061
c0169d44 journal_add_journal_head 1 0.0035
c0106e88 system_call 2 0.0357
c01264bc unlock_vma_mappings 2 0.0500
c0135bb4 fget 2 0.0227
c028982c __generic_copy_from_user 2 0.0192
c01267ec do_mmap_pgoff 4 0.0035
c0126d6c find_vma 7 0.0761
c0105000 _stext 16 0.1667
c0126c70 get_unmapped_area 4991 19.8056
c0105278 poll_idle 4997 124.9250
00000000 total 10034 0.0062

The `poll_idle' count is from the other CPU.

What appears to be happening is that the VMA tree has degenerated
into a monstrous singly linked list. All those little 4k mappings
are individual data structures, chained one after the other.

The reason you don't see it with an anonymous map is, I think, that
the kernel will merge contiguous anon mappings into a single one,
so there's basically nothing to be searched each time you request some
more memory.

Running the same test on the -ac kernel is 4-5% slower, so Andrea's
new rbtree implementation makes a better linked list than the old
AVL-tree's one :)

It strikes me that this is not a completely stupid usage pattern, and
that perhaps it's worth thinking about some tweaks to cope with it.
I really don't know, but I've Cc'ed some people who do.

Here's the test app:

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>

int main()
{
int i = 0;
void *p;
int t;
int fd;

fd = open("test.dat", O_RDWR);
if (fd < 0) {
puts("Unable to open file !");
return;
}
t = time(NULL);
while (1) {
p = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
if ((int) p == -1) {
perror("mmap");
return;
}
i++;
if (i % 10000 == 0) {
printf(" %d Time: %d\n", i, time(NULL) - t);
t = time(NULL);
}
}
}


2001-12-03 08:51:54

by Andrew Morton

[permalink] [raw]
Subject: Re: your mail on mmap() to the kernel list

Peter Zaitsev wrote:
>
> ...

It's very simple. The kernel has a linked list of vma's for your
process. It is kept in address order. Each time you request a
new mapping, the kernel walks down that list looking for an address
at which to place the new mapping. This data structure is set up
for efficient find-by-address, not for efficient find-a-gap.

Question is: do we need to optimise for this case?

If it's just a single file, then you'd be better off just mapping the
whole thing. If you need to map lots and lots of files then
you'll hit the maximum file limit fairly early and the mmap()
performance will be not great, but maybe acceptable.

One scenario where this could be a problem is for a file
which is too large to be mapped in its entirety, but the
application needs access to lots of bits of it at the same
time. There doesn't seem to be much alternative to setting
up a distinct mapping for each access window in this case.

> Also As you see other patterns also show fast performance degradation
> over increasing number of pages. I can also test random allocation and
> freeing but something tells me the result will be the same.

Is this proving to be a problem in a real-world application?
What are you trying to do here?

-

2001-12-03 08:51:46

by Peter Zaitsev

[permalink] [raw]
Subject: Re[2]: your mail on mmap() to the kernel list

Hello Andrew,

Saturday, December 01, 2001, 12:37:01 PM, you wrote:

>>
>> The only question is why map anonymous is rather fast (i get
>> 1000req/sec even then mapped 300.000 of blocks), therefore with
>> mapping a fd the perfomance drops to 20req/second at this number.
>>

AM> well a kernel profile is pretty unambiguous:

AM> c0116af4 mm_init 1 0.0050
AM> c0117394 do_fork 1 0.0005
AM> c0124ccc clear_page_tables 1 0.0064
AM> c0125af0 do_wp_page 1 0.0026
AM> c01260a0 do_no_page 1 0.0033
AM> c01265dc find_vma_prepare 1 0.0081
AM> c0129138 file_read_actor 1 0.0093
AM> c012d95c kmem_cache_alloc 1 0.0035
AM> c0147890 d_lookup 1 0.0036
AM> c01573dc write_profile 1 0.0061
AM> c0169d44 journal_add_journal_head 1 0.0035
AM> c0106e88 system_call 2 0.0357
AM> c01264bc unlock_vma_mappings 2 0.0500
AM> c0135bb4 fget 2 0.0227
AM> c028982c __generic_copy_from_user 2 0.0192
AM> c01267ec do_mmap_pgoff 4 0.0035
AM> c0126d6c find_vma 7 0.0761
AM> c0105000 _stext 16 0.1667
AM> c0126c70 get_unmapped_area 4991 19.8056
AM> c0105278 poll_idle 4997 124.9250
AM> 00000000 total 10034 0.0062

AM> The `poll_idle' count is from the other CPU.

AM> What appears to be happening is that the VMA tree has degenerated
AM> into a monstrous singly linked list. All those little 4k mappings
AM> are individual data structures, chained one after the other.

Well. The thing is I've modified an application a bit so it randomly
asked from 1 to 64 pages and the execution process still look the
same. So this does not only happen then mapping same sizes but also
touches the initial mapping of many chunks.

So the next test was also simple - I started to deallocate in the same
order fist mmaped chunks holding only 40.000 of last mapped chunks:

31000 Time: 5
32000 Time: 4
33000 Time: 5
34000 Time: 5
35000 Time: 5
36000 Time: 6
37000 Time: 5
38000 Time: 6
39000 Time: 5
40000 Time: 6
41000 Time: 0
42000 Time: 0
43000 Time: 1
44000 Time: 0
45000 Time: 0
46000 Time: 1
47000 Time: 1
48000 Time: 1
49000 Time: 1
50000 Time: 1

As you see then I start to free pages they are able to be find rather
fast.

Now I made in to hold 20000 of mappings only on loop iterations from
20000 to 40000 and then stop freeing and continue allocating:


2000 Time: 0
4000 Time: 0
6000 Time: 1
8000 Time: 2
10000 Time: 2
12000 Time: 4
14000 Time: 4
16000 Time: 4
18000 Time: 5
20000 Time: 6
22000 Time: 0
24000 Time: 1
26000 Time: 1
28000 Time: 1
30000 Time: 3
32000 Time: 3
34000 Time: 4
36000 Time: 5
38000 Time: 5
40000 Time: 6
42000 Time: 6
44000 Time: 7
46000 Time: 8


Quite surprising as you see the speed increases in the hole but
degrades quite fast even the number of mapped pages stays the same on
interval 20.000 - 40.000

And now back to the previous test. Now I tested it with 20.000 of
mapped pages: As you see the cycle with a period of 20.000 - with this
period the pages with low addresses are freed so it look exactly like
address space is scaned from the low address to high looking for the
first place for page to fit:

5000 Time: 1
10000 Time: 4
15000 Time: 10
20000 Time: 13
25000 Time: 1
30000 Time: 5
35000 Time: 9
40000 Time: 14
45000 Time: 1
50000 Time: 5
55000 Time: 9
60000 Time: 13
65000 Time: 1
70000 Time: 5
75000 Time: 10
80000 Time: 13
85000 Time: 1
90000 Time: 5
95000 Time: 10
100000 Time: 13
105000 Time: 1
110000 Time: 5
115000 Time: 9
120000 Time: 14




AM> The reason you don't see it with an anonymous map is, I think, that
AM> the kernel will merge contiguous anon mappings into a single one,
AM> so there's basically nothing to be searched each time you request some
AM> more memory.

AM> Running the same test on the -ac kernel is 4-5% slower, so Andrea's
AM> new rbtree implementation makes a better linked list than the old
AM> AVL-tree's one :)

AM> It strikes me that this is not a completely stupid usage pattern, and
AM> that perhaps it's worth thinking about some tweaks to cope with it.
AM> I really don't know, but I've Cc'ed some people who do.

Hope so :)
Also As you see other patterns also show fast performance degradation
over increasing number of pages. I can also test random allocation and
freeing but something tells me the result will be the same.

Hope to hear some info from guru :)


Please write me if some patches will be available I will be happy to
test them

The last test programm (if interested)

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <stdlib.h>

int main()
{
int i=0;
void* p;
int t;
int fd;
int size;
void* arr[1000000];
int sz[1000000];

int addr;

for (t=0;t<1000000;t++) arr[t]=NULL;
for (t=0;t<1000000;t++) sz[t]=0;
t=time(NULL);
while(1)
{
fd=open("/spylog/1/test.dat",O_RDWR);
if (fd<0)
{
puts("Unable to open file !");
return;
}
// size=(((double)random()/RAND_MAX*16)+1)*4096;
size=4096;
// printf("<%d>",size);
p=mmap(0x60000000,size, PROT_READ | PROT_WRITE , MAP_PRIVATE ,fd ,0);
if ((int)p==-1)
{
printf("Failed %d\n",errno);
return;
}
arr[i]=p;
sz[i]=size;
if ((i>20000)) munmap(arr[i-20000],sz[i-20000]);
i++;
if (i%5000==0)
{
printf(" %d Time: %d\n",i,time(NULL)-t);
t=time(NULL);
}

}
}




--
Best regards,
Peter mailto:[email protected]

2001-12-03 23:46:27

by Peter Zaitsev

[permalink] [raw]
Subject: Re[2]: your mail on mmap() to the kernel list

Hello Andrew,

Monday, December 03, 2001, 2:34:54 AM, you wrote:

AM> Peter Zaitsev wrote:
>>
>> ...

AM> It's very simple. The kernel has a linked list of vma's for your
AM> process. It is kept in address order. Each time you request a
AM> new mapping, the kernel walks down that list looking for an address
AM> at which to place the new mapping. This data structure is set up
AM> for efficient find-by-address, not for efficient find-a-gap.

Yes. I see. I've tried some other load patterns like random
allocation/deallocation but still see the speed degrades then a
number of mapped blocks increases :(

The other thing is I do not get the thing with anonymous mapping -
I've now tried the simple thing - Quite the same program but every
second mapping was anonymous to prevent merging of the calls but still
I see the speed gain:

1) Non-anonymous
5000 Time: 1 Mapped: 4999 Unmapped: 1661
10000 Time: 2 Mapped: 9999 Unmapped: 3311
15000 Time: 6 Mapped: 14999 Unmapped: 4957
20000 Time: 9 Mapped: 19999 Unmapped: 6620
25000 Time: 12 Mapped: 24999 Unmapped: 8290
30000 Time: 15 Mapped: 29999 Unmapped: 9975
35000 Time: 17 Mapped: 34999 Unmapped: 11643
40000 Time: 20 Mapped: 39999 Unmapped: 13287
45000 Time: 23 Mapped: 44999 Unmapped: 14953
50000 Time: 26 Mapped: 49999 Unmapped: 16618
55000 Time: 28 Mapped: 54999 Unmapped: 18311
60000 Time: 31 Mapped: 59999 Unmapped: 20013

2) Every second is anonymous
5000 Time: 1 Mapped: 4999 Unmapped: 1661
10000 Time: 1 Mapped: 9999 Unmapped: 3311
15000 Time: 3 Mapped: 14999 Unmapped: 4957
20000 Time: 6 Mapped: 19999 Unmapped: 6620
25000 Time: 8 Mapped: 24999 Unmapped: 8290
30000 Time: 9 Mapped: 29999 Unmapped: 9975
35000 Time: 12 Mapped: 34999 Unmapped: 11643
40000 Time: 13 Mapped: 39999 Unmapped: 13287
45000 Time: 15 Mapped: 44999 Unmapped: 14953
50000 Time: 18 Mapped: 49999 Unmapped: 16618
55000 Time: 18 Mapped: 54999 Unmapped: 18311
60000 Time: 21 Mapped: 59999 Unmapped: 20013


Any question is why ? Does the anonymous mapping is made separate way
to be always able to merge ? This is important for me as I can stand
slowed down mmaping of files but really do not want for memory
allocation to slow down (mmap is used with multi threaded program)

AM> Question is: do we need to optimise for this case?

I hope you would. Or at least some patches would exist for this :)

AM> If it's just a single file, then you'd be better off just mapping the
AM> whole thing. If you need to map lots and lots of files then
AM> you'll hit the maximum file limit fairly early and the mmap()
AM> performance will be not great, but maybe acceptable.

Well. I need much of small files. These files contain different
clients data and so they should not be merged.
I will not hit the open file limit - I've recently tested the 500.000
of open files - it seems to work and with no real speed penalty.

So this is really the strange case - I'm able to open 500.000+ of
files effective way but can't effectively map more than 20.000 of
them :(

AM> One scenario where this could be a problem is for a file
AM> which is too large to be mapped in its entirety, but the
AM> application needs access to lots of bits of it at the same
AM> time. There doesn't seem to be much alternative to setting
AM> up a distinct mapping for each access window in this case.

Yes. I also found what migrating to small number of bug files will
not help much there will produce some other problems :)

>> Also As you see other patterns also show fast performance degradation
>> over increasing number of pages. I can also test random allocation and
>> freeing but something tells me the result will be the same.

AM> Is this proving to be a problem in a real-world application?
AM> What are you trying to do here?

I had two problems which I tried to solve with mmaping a lot of files.
My application works with a lot of data - 2G+ there not really all of
it are often accessed so some of it can be swapped out without any
problem for the application but I got two problems
1) User Address space on Intel is 3.5G (with Andrea's patches) - I
can't afford to migrate to 64bit environment as there are a lot of
mashies running the application. So I've set up the cache of mapped
pieces. As some of them are accessed rather rare it work only with
relatively small number of mapped chunks.
2) The second problem is more complicated - My program may work
effective way with only 20% of data present in memory, therefore I
must save things to the disk periodically. So with read/write this
leads to a lot of swapping as a page must be swapped in before being
able to written to the disk. And even more - only small persent of
pages really change and so need to be saved, therefore it is hard to
track this and mmap should theoretically handle this automatically.




--
Best regards,
Peter mailto:[email protected]

2001-12-04 14:42:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: your mail on mmap() to the kernel list

On Sun, Dec 02, 2001 at 03:34:54PM -0800, Andrew Morton wrote:
> Peter Zaitsev wrote:
> >
> > ...
>
> It's very simple. The kernel has a linked list of vma's for your
> process. It is kept in address order. Each time you request a
> new mapping, the kernel walks down that list looking for an address
> at which to place the new mapping. This data structure is set up
> for efficient find-by-address, not for efficient find-a-gap.

exactly.

Andrea