[email protected] wrote:
> The first idea is to use this for UML - it must create a lot of single page
> mappings, and managing them through separate VMAs is slow.
I don't know about this. The patches add some complexity, I guess because
we now have vmas which cannot communicate the protectedness of the pages.
Still, nobody was too concerned about nonlinear mappings doing the same
for addressing. But this does seem to add more overhead to the common cases
in the VM :(
Now I didn't follow the earlier discussions on this much, but let me try
making a few silly comments to get things going again (cc'ed linux-mm).
I think I would rather this all just folded under VM_NONLINEAR rather than
having this extra MANYPROTS thing, no? (you're already doing that in one
direction).
>
> Additional note: this idea, with some further refinements (which I'll code after
> this chunk is accepted), will allow to reduce the number of used VMAs for most
> userspace programs - in particular, it will allow to avoid creating one VMA for
> one guard pages (which has PROT_NONE) - forcing PROT_NONE on that page will be
> enough.
I think that's silly. Your VM_MANYPROTS|VM_NONLINEAR vmas will cause more
overhead in faulting and reclaim.
It loooks like it would take an hour or two just to code up a patch which
puts a VM_GUARDPAGES flag into the vma, and tells the free area allocator
to skip vm_start-1 .. vm_end+1. What kind of troubles has prevented
something simple and easy like that from going in?
>
> This will be useful since the VMA lookup at fault time can be a bottleneck for
> some programs (I've received a report about this from Ulrich Drepper and I've
> been told that also Val Henson from Intel is interested about this). I guess
> that since we use RB-trees, the slowness is also due to the poor cache locality
> of RB-trees (since RB nodes are within VMAs but aren't accessed together with
> their content), compared for instance with radix trees where the lookup has high
> cache locality (but they have however space usage problems, possibly bigger, on
> 64-bit machines).
Let's try get back to the good old days when people actually reported
their bugs (togther will *real* numbers) to the mailing lists. That way,
everybody gets to think about and discuss the problem.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin wrote:
> [email protected] wrote:
>
>> The first idea is to use this for UML - it must create a lot of single
>> page
>> mappings, and managing them through separate VMAs is slow.
[...]
> Let's try get back to the good old days when people actually reported
> their bugs (togther will *real* numbers) to the mailing lists. That way,
> everybody gets to think about and discuss the problem.
Speaking of which, let's see some numbers for UML -- performance
and memory. I don't doubt your claims, but I (and others) would be
interested to see.
Thanks
PS. I'll be away for the next few days.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
* Nick Piggin <[email protected]> wrote:
> >Let's try get back to the good old days when people actually reported
> >their bugs (togther will *real* numbers) to the mailing lists. That way,
> >everybody gets to think about and discuss the problem.
>
> Speaking of which, let's see some numbers for UML -- performance and
> memory. I don't doubt your claims, but I (and others) would be
> interested to see.
firstly, thanks for the review feedback!
originally i tested this feature with some minimal amount of RAM
simulated by UML 128MB or so. That's just 32 thousand pages, but still
the improvement was massive: context-switch times in UML were cut in
half or more. Process-creation times improved 10-fold. With this feature
included I accidentally (for the first time ever!) confused an UML shell
prompt with a real shell prompt. (before that UML was so slow [even in
"skas mode"] that you'd immediately notice it by the shell's behavior)
the 'have 1 vma instead of 32,000 vmas' thing is a really, really big
plus. It makes UML comparable to Xen, in rough terms of basic VM design.
Now imagine a somewhat larger setup - 16 GB RAM UML instance with 4
million vmas per UML process ... Frankly, without
sys_remap_file_pages_prot() the UML design is still somewhat of a toy.
Ingo
On Tue, 2006-05-02 at 13:45 +1000, Nick Piggin wrote:
> [email protected] wrote:
>
> > The first idea is to use this for UML - it must create a lot of single page
> > mappings, and managing them through separate VMAs is slow.
>
> I don't know about this. The patches add some complexity, I guess because
> we now have vmas which cannot communicate the protectedness of the pages.
> Still, nobody was too concerned about nonlinear mappings doing the same
> for addressing. But this does seem to add more overhead to the common cases
> in the VM :(
>
> Now I didn't follow the earlier discussions on this much, but let me try
> making a few silly comments to get things going again (cc'ed linux-mm).
>
> I think I would rather this all just folded under VM_NONLINEAR rather than
> having this extra MANYPROTS thing, no? (you're already doing that in one
> direction).
<snip>
One way I've seen this done on other systems is to use something like a
prio tree [e.g., see the shared policy support for shmem] for sub-vma
protection ranges. Most vmas [I'm guessing here] will have only the
original protections or will be reprotected in toto. So, one need only
allocate/populate the protection tree when sub-vma protections are
requested. Then, one can test protections via the vma, perhaps with
access/check macros to hide the existence of the protection tree. Of
course, adding a tree-like structure could introduce locking
complications/overhead in some paths where we'd rather not [just
guessing again]. Might be more overhead than just mucking with the ptes
[for UML], but would keep the ptes in sync with the vma's view of
"protectedness".
Lee
Ingo Molnar wrote:
> originally i tested this feature with some minimal amount of RAM
> simulated by UML 128MB or so. That's just 32 thousand pages, but still
> the improvement was massive: context-switch times in UML were cut in
> half or more. Process-creation times improved 10-fold. With this feature
> included I accidentally (for the first time ever!) confused an UML shell
> prompt with a real shell prompt. (before that UML was so slow [even in
> "skas mode"] that you'd immediately notice it by the shell's behavior)
Cool, thanks for the numbers.
>
> the 'have 1 vma instead of 32,000 vmas' thing is a really, really big
> plus. It makes UML comparable to Xen, in rough terms of basic VM design.
>
> Now imagine a somewhat larger setup - 16 GB RAM UML instance with 4
> million vmas per UML process ... Frankly, without
> sys_remap_file_pages_prot() the UML design is still somewhat of a toy.
Yes, I guess I imagined the common case might have been slightly better,
however with reasonable RAM utilisation, fragmentation means I wouldn't
be surprised if it does easily get close to that worst theoretical case.
My request for numbers was more about the Intel/glibc people than Paolo:
I do realise it is a problem for UML. I just like to see nice numbers :)
I think UML's really neat, so I'd love to see this get in. I don't see
any fundamental sticking point, given a few iterations, and some more
discussion.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Tuesday 02 May 2006 05:45, Nick Piggin wrote:
> [email protected] wrote:
> > This will be useful since the VMA lookup at fault time can be a
> > bottleneck for some programs (I've received a report about this from
> > Ulrich Drepper and I've been told that also Val Henson from Intel is
> > interested about this). I guess that since we use RB-trees, the slowness
> > is also due to the poor cache locality of RB-trees (since RB nodes are
> > within VMAs but aren't accessed together with their content), compared
> > for instance with radix trees where the lookup has high cache locality
> > (but they have however space usage problems, possibly bigger, on 64-bit
> > machines).
> Let's try get back to the good old days when people actually reported
> their bugs (togther will *real* numbers) to the mailing lists. That way,
> everybody gets to think about and discuss the problem.
I've not seen the numbers indeed, I've been told of a problem with a "customer
program" and Ingo connected my work with this problem. Frankly, I've been
always astonished about how looking up a 10-level tree can be slow. Poor
cache locality is the only thing that I could think about.
That said, it was an add-on, not the original motivation of the work.
--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale!
http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
On Tuesday 02 May 2006 05:45, Nick Piggin wrote:
> [email protected] wrote:
> > The first idea is to use this for UML - it must create a lot of single
> > page mappings, and managing them through separate VMAs is slow.
> I think I would rather this all just folded under VM_NONLINEAR rather than
> having this extra MANYPROTS thing, no? (you're already doing that in one
> direction).
That step is _temporary_ if the extra usages are accepted.
Also, I reported (changelog of patch 03/14) a definite API bug you get if you
don't distinguish VM_MANYPROTS from VM_NONLINEAR. I'm pasting it here because
that changelog is rather long:
"In fact, without this flag, we'd have indeed a regression with
remap_file_pages VS mprotect, on uniform nonlinear VMAs.
mprotect alters the VMA prots and walks each present PTE, ignoring installed
ones, even when pte_file() is on; their saved prots will be restored on
faults,
ignoring VMA ones and losing the mprotect() on them. So, in do_file_page(), we
must restore anyway VMA prots when the VMA is uniform, as we used to do before
this trail of patches."
> > Additional note: this idea, with some further refinements (which I'll
> > code after this chunk is accepted), will allow to reduce the number of
> > used VMAs for most userspace programs - in particular, it will allow to
> > avoid creating one VMA for one guard pages (which has PROT_NONE) -
> > forcing PROT_NONE on that page will be enough.
> I think that's silly. Your VM_MANYPROTS|VM_NONLINEAR vmas will cause more
> overhead in faulting and reclaim.
I know that problem. In fact for that we want VM_MANYPROTS without
VM_NONLINEAR.
> It loooks like it would take an hour or two just to code up a patch which
> puts a VM_GUARDPAGES flag into the vma, and tells the free area allocator
> to skip vm_start-1 .. vm_end+1
we must refine which pages to skip (the example I saw has only one guard page,
if I'm not mistaken) but
> . What kind of troubles has prevented
> something simple and easy like that from going in?
Fairly better idea... It's just the fact that the original proposal was wider,
and that we looked to the problem in the wrong way (+ we wanted anyway to
have the present work merged, so that wasn't a problem).
Ulrich wanted to have code+data(+guard on 64-bit) into the same VMA, but I
left the code+data VMA joining away, to think more with it, since currently
it's too slow on swapout.
The other part is avoiding guard VMAs for thread stacks, and that could be
accomplished too by your proposal. Iff this work is held out, however.
--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale!
http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
On Tuesday 02 May 2006 19:16, Lee Schermerhorn wrote:
> On Tue, 2006-05-02 at 13:45 +1000, Nick Piggin wrote:
> > [email protected] wrote:
> > I think I would rather this all just folded under VM_NONLINEAR rather
> > than having this extra MANYPROTS thing, no? (you're already doing that in
> > one direction).
> One way I've seen this done on other systems
I'm curious, which ones?
> is to use something like a
> prio tree [e.g., see the shared policy support for shmem] for sub-vma
> protection ranges.
Which sub-vma ranges? The ones created with mprotect?
I'm curious about what is the difference between this sub-tree and the main
tree... you have some point, but I miss which one :-) Actually when doing a
lookup in the main tree the extra nodes in the subtree are not searched, so
you get an advantage.
One possible point is that a VMA maps to one mmap() call (with splits from
mremap(),mprotect(),partial munmap()s), and then they use sub-VMAs instead of
VMA splits.
> Most vmas [I'm guessing here] will have only the
> original protections or will be reprotected in toto.
> So, one need only
> allocate/populate the protection tree when sub-vma protections are
> requested. Then, one can test protections via the vma, perhaps with
> access/check macros to hide the existence of the protection tree. Of
> course, adding a tree-like structure could introduce locking
> complications/overhead in some paths where we'd rather not [just
> guessing again]. Might be more overhead than just mucking with the ptes
> [for UML], but would keep the ptes in sync with the vma's view of
> "protectedness".
>
> Lee
Ok, there are two different situations, I'm globally unconvinced until I
understand the usefulness of a different sub-tree.
a) UML. The answer is _no_ to all guesses, since we must implement page tables
of a guest virtual machine via mmap() or remap_file_pages. And they're as
fragmented as they get (we get one-page-wide VMAs currently).
b) the proposed glibc usage. The original Ulrich's request (which I cut down
because of problems with objrmap) was to have one mapping per DSO, including
code,data and guard page. So you have three protections in one VMA.
However, this is doable via this remap_file_pages, adding something for
handling private VMAs (handling movement of the anonymous memory you get on
writes); but it's slow on swapout, since it stops using objrmap. So I've not
thought to do it.
--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale!
http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
On Wed, 2006-05-03 at 03:20 +0200, Blaisorblade wrote:
> On Tuesday 02 May 2006 19:16, Lee Schermerhorn wrote:
> > On Tue, 2006-05-02 at 13:45 +1000, Nick Piggin wrote:
> > > [email protected] wrote:
>
> > > I think I would rather this all just folded under VM_NONLINEAR rather
> > > than having this extra MANYPROTS thing, no? (you're already doing that in
> > > one direction).
>
> > One way I've seen this done on other systems
>
> I'm curious, which ones?
Let's see: System V [4.2MP] back in the days of USL [Unix Systems Labs]
did this, as did [does] Digital/Compaq/HP Tru64 on alpha. I'm not sure
if the latter came from the original Mach/OSF code or from Bob Picco's
rewrite thereof back in the 90's.
>
> > is to use something like a
> > prio tree [e.g., see the shared policy support for shmem] for sub-vma
> > protection ranges.
> Which sub-vma ranges? The ones created with mprotect?
>
> I'm curious about what is the difference between this sub-tree and the main
> tree... you have some point, but I miss which one :-) Actually when doing a
> lookup in the main tree the extra nodes in the subtree are not searched, so
> you get an advantage.
True, you still have locate the protection range in the subtree and then
still walk the page tables to install the new protections. Of course,
in the bad old days, the only time I saw thousands or 10s of thousands
of different mappings/vmas/regions/whatever was in vm stress tests.
Until squid came along, that is. Then we encountered, on large memory
alpha systems, 100s of thousands of mmaped files. Had to replace the
linear [honest] vma/mapping list at that point ;-).
>
> One possible point is that a VMA maps to one mmap() call (with splits from
> mremap(),mprotect(),partial munmap()s), and then they use sub-VMAs instead of
> VMA splits.
Yeah. That was the point--in response to Nick's comment about the
disconnect betwen the protections as reported by the vma and the actual
pte protections.
>
> > Most vmas [I'm guessing here] will have only the
> > original protections or will be reprotected in toto.
>
> > So, one need only
> > allocate/populate the protection tree when sub-vma protections are
> > requested. Then, one can test protections via the vma, perhaps with
> > access/check macros to hide the existence of the protection tree. Of
> > course, adding a tree-like structure could introduce locking
> > complications/overhead in some paths where we'd rather not [just
> > guessing again]. Might be more overhead than just mucking with the ptes
> > [for UML], but would keep the ptes in sync with the vma's view of
> > "protectedness".
> >
> > Lee
>
> Ok, there are two different situations, I'm globally unconvinced until I
> understand the usefulness of a different sub-tree.
>
> a) UML. The answer is _no_ to all guesses, since we must implement page tables
> of a guest virtual machine via mmap() or remap_file_pages. And they're as
> fragmented as they get (we get one-page-wide VMAs currently).
>
> b) the proposed glibc usage. The original Ulrich's request (which I cut down
> because of problems with objrmap) was to have one mapping per DSO, including
> code,data and guard page. So you have three protections in one VMA.
>
> However, this is doable via this remap_file_pages, adding something for
> handling private VMAs (handling movement of the anonymous memory you get on
> writes); but it's slow on swapout, since it stops using objrmap. So I've not
> thought to do it.
Blaisorblade wrote:
> On Tuesday 02 May 2006 05:45, Nick Piggin wrote:
>
>>[email protected] wrote:
>>
>>>The first idea is to use this for UML - it must create a lot of single
>>>page mappings, and managing them through separate VMAs is slow.
>
>
>>I think I would rather this all just folded under VM_NONLINEAR rather than
>>having this extra MANYPROTS thing, no? (you're already doing that in one
>>direction).
>
>
> That step is _temporary_ if the extra usages are accepted.
Can we try to get the whole design right from the start?
>
> Also, I reported (changelog of patch 03/14) a definite API bug you get if you
> don't distinguish VM_MANYPROTS from VM_NONLINEAR. I'm pasting it here because
> that changelog is rather long:
>
> "In fact, without this flag, we'd have indeed a regression with
> remap_file_pages VS mprotect, on uniform nonlinear VMAs.
>
> mprotect alters the VMA prots and walks each present PTE, ignoring installed
> ones, even when pte_file() is on; their saved prots will be restored on
> faults,
> ignoring VMA ones and losing the mprotect() on them. So, in do_file_page(), we
> must restore anyway VMA prots when the VMA is uniform, as we used to do before
> this trail of patches."
It is only a bug because you hadn't plugged the hole -- make it fix up pte_file
ones as well.
> Ulrich wanted to have code+data(+guard on 64-bit) into the same VMA, but I
> left the code+data VMA joining away, to think more with it, since currently
> it's too slow on swapout.
Yes, and it would be ridiculous to do this with non linear protections anyway.
If the vma code is so slow that glibc wants to merge code and data vmas together,
then we obviously need to fix the data structure (which will help everyone)
rather than hacking around it.
>
> The other part is avoiding guard VMAs for thread stacks, and that could be
> accomplished too by your proposal. Iff this work is held out, however.
I see no reason why they couldn't both go in. In fact, having an mmap flag for
adding guard regions around vmas (and perhaps eg. a system-wide / per-process
option for stack) could almost go in tomorrow.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin wrote:
> I see no reason why they couldn't both go in. In fact, having an mmap
> flag for
> adding guard regions around vmas (and perhaps eg. a system-wide /
> per-process
> option for stack) could almost go in tomorrow.
This would have to be flexible, though. For thread stacks, at least,
the programmer is able to specify the size of the guard area. It can be
arbitrarily large.
Also, consider IA-64. Here we have two stacks. We allocate them with
one mmap call and put the guard somewhere in the middle (the optimal
ratio of CPU and register stack size is yet to be determined) and have
the stack grow toward each other. This results into three VMAs in the
moment. Anything which results on more VMAs obviously isn't good.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
Blaisorblade wrote:
> I've not seen the numbers indeed, I've been told of a problem with a "customer
> program" and Ingo connected my work with this problem. Frankly, I've been
> always astonished about how looking up a 10-level tree can be slow. Poor
> cache locality is the only thing that I could think about.
It might be good if I explain a bit how much we use mmap in libc. The
numbers can really add up quickly.
- for each loaded DSO we might have up to 5 VMAs today:
1. text segment, protection r-xp, normally never written to
2. a gap, protection ---p (i.e., no access)
3. a relro data segment, protectection r--p
4. a data segment, protection rw-p
5. a bss segment, anonymous memory
The first four are mapped from the file. In fact, the first segment
"allocates" the entire address space of all segment, even if it's longer
than the file.
Then gap is done using mprotect(PROT_NONE). Then the area for segment 3
and 4 is mapped in one mmap() call. It's in the same file but the
offset used in the mmap is not the same as the same as the offset which
naturally is already established through the first mmap. I.e., if the
first mmap() would start at offset 0 and continue for 1000 pages, the
gap might start at a, say, offset of 4 pages and continue for 500 pages.
Then the "natural" offset of the first data page would be 504 pages but
the second mmap() call would in fact use the offset 4 because the text
and data segment are continuous in the _file_ (although not in memory).
Anyway, once relocations are done the protection of the relro segment is
changed, splitting the data segment in two.
So, for DSO loading there would be two steps of improvement:
1. if a mprotect() call wouldn't split the VMA we would have 3 VMAs in
the end instead of 5. 40% gain.
2. if I could use remap_file_pages() for the data segment mapping and
the call would allow changing the protection and it would not split the
VMAs, then we'd be down to 2 mappings. 60% down.
A second big VMA user are thread stacks. I think the application which
was mentioned in this thread briefly used literally thousands of
threads. Leaving aside the insanity of this (it's unfortunately how
many programmers work) this can create problems because we get at least
two (on IA-64 three) VMAs per thread. I.e., thread stack allocation
works likes this:
1. allocate area big enough for stack and guard (we don't use automatic
growing, this cannot work)
2. change the protection of the guard end of the stack to PROT_NONE.
So, for say 1000 threads we'll end up with 2000 VMAs. Threads are also
important to mention here because
- often they are short-lived and we have to recreate them often. We
usually reuse stacks but only keep that much allocated memory around.
So more often than we like we actually free and later re-allocate stacks.
- these thousands of stack VMAs are really used concurrently. ALl
threads are woken over a period of time.
A third source of VMAs is anonymous memory allocation. mmap is used in
the malloc implementation and directly in various places. For
randomization reasons there isn't really much we can do here, we
shouldn't lump all these allocations together.
A fourth source of VMAs are the programs themselves which mmap files.
Often read-only mappings of many small files.
Put all this together and non-trivial apps as written today (I don't say
they are high-quality apps) can easily have a few thousand, maybe even
10,000 to 20,000 VMAs. Firefox on my machine uses in the moment ~560
VMAs and this is with only a handful of threads. Are these the numbers
the VM system is optimized for? I think what our people running the
experiments at the customer site saw is that it's not. The VMA
traversal showed up on the profile lists.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
Ulrich Drepper wrote:
> Blaisorblade wrote:
>
>>I've not seen the numbers indeed, I've been told of a problem with a "customer
>>program" and Ingo connected my work with this problem. Frankly, I've been
>>always astonished about how looking up a 10-level tree can be slow. Poor
>>cache locality is the only thing that I could think about.
>
>
> It might be good if I explain a bit how much we use mmap in libc. The
> numbers can really add up quickly.
[...]
Thanks. Very informative.
> Put all this together and non-trivial apps as written today (I don't say
> they are high-quality apps) can easily have a few thousand, maybe even
> 10,000 to 20,000 VMAs. Firefox on my machine uses in the moment ~560
> VMAs and this is with only a handful of threads. Are these the numbers
> the VM system is optimized for? I think what our people running the
> experiments at the customer site saw is that it's not. The VMA
> traversal showed up on the profile lists.
Your % improvement numbers are of course only talking about memory
usage improvements. Time complexity increases with the log of the
number of VMAs, so while search within 100,000 vmas might have a CPU
cost of 16 arbitrary units, it is only about 300% the cost in 40
vmas (and not the 2,500,000% that the number of vmas suggests).
Definitely reducing vmas would be good. If guard ranges around vmas
can be implemented easily and reduce vmas by even 20%, it would come
at an almost zero complexity cost to the kernel.
However, I think another consideration is the vma lookup cache. I need
to get around to looking at this again, but IMO it is inadequate for
threaded applications. Currently we have one last-lookup cached vma
for each mm. You get cacheline bouncing when updating the cache, and
the locality becomes almost useless.
I think possibly each thread should have a private vma cache, with
room for at least its stack vma(s), (and several others, eg. code,
data). Perhaps the per-mm cache could be dispensed with completely,
although it might be useful eg. for the heap. And it might be helped
with increased entries as well.
I've got patches lying around to implement this stuff -- I'd be
interested to have more detail about this problem, or distilled test
cases.
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2006-05-13 23:31:13.000000000 +1000
+++ linux-2.6/mm/mmap.c 2006-05-13 23:48:53.000000000 +1000
@@ -30,6 +30,99 @@
#include <asm/cacheflush.h>
#include <asm/tlb.h>
+static void vma_cache_touch(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+ struct task_struct *curr = current;
+ if (mm == curr->mm) {
+ int i;
+ if (curr->vma_cache_sequence != mm->vma_sequence) {
+ curr->vma_cache_sequence = mm->vma_sequence;
+ curr->vma_cache[0] = vma;
+ for (i = 1; i < 4; i++)
+ curr->vma_cache[i] = NULL;
+ } else {
+ int update_mm;
+
+ if (curr->vma_cache[0] == vma)
+ return;
+
+ for (i = 1; i < 4; i++) {
+ if (curr->vma_cache[i] == vma)
+ break;
+ }
+ update_mm = 0;
+ if (i == 4) {
+ update_mm = 1;
+ i = 3;
+ }
+ while (i != 0) {
+ curr->vma_cache[i] = curr->vma_cache[i-1];
+ i--;
+ }
+ curr->vma_cache[0] = vma;
+
+ if (!update_mm)
+ return;
+ }
+ }
+
+ if (mm->vma_cache != vma) /* prevent cacheline bouncing */
+ mm->vma_cache = vma;
+}
+
+static void vma_cache_replace(struct mm_struct *mm, struct vm_area_struct *vma,
+ struct vm_area_struct *repl)
+{
+ mm->vma_sequence++;
+ if (unlikely(mm->vma_sequence == 0)) {
+ struct task_struct *curr = current, *t;
+ t = curr;
+ rcu_read_lock();
+ do {
+ t->vma_cache_sequence = -1;
+ t = next_thread(t);
+ } while (t != curr);
+ rcu_read_unlock();
+ }
+
+ if (mm->vma_cache == vma)
+ mm->vma_cache = repl;
+}
+
+static inline void vma_cache_invalidate(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+ vma_cache_replace(mm, vma, NULL);
+}
+
+static struct vm_area_struct *vma_cache_find(struct mm_struct *mm,
+ unsigned long addr)
+{
+ struct task_struct *curr;
+ struct vm_area_struct *vma;
+
+ inc_page_state(vma_cache_query);
+
+ curr = current;
+ if (mm == curr->mm && mm->vma_sequence == curr->vma_cache_sequence) {
+ int i;
+ for (i = 0; i < 4; i++) {
+ vma = curr->vma_cache[i];
+ if (vma && vma->vm_end > addr && vma->vm_start <= addr){
+ inc_page_state(vma_cache_hit);
+ return vma;
+ }
+ }
+ }
+
+ vma = mm->vma_cache;
+ if (vma && vma->vm_end > addr && vma->vm_start <= addr) {
+ inc_page_state(vma_cache_hit);
+ return vma;
+ }
+
+ return NULL;
+}
+
static void unmap_region(struct mm_struct *mm,
struct vm_area_struct *vma, struct vm_area_struct *prev,
unsigned long start, unsigned long end);
@@ -460,8 +553,6 @@
{
prev->vm_next = vma->vm_next;
rb_erase(&vma->vm_rb, &mm->mm_rb);
- if (mm->mmap_cache == vma)
- mm->mmap_cache = prev;
}
/*
@@ -586,6 +677,7 @@
* us to remove next before dropping the locks.
*/
__vma_unlink(mm, next, vma);
+ vma_cache_replace(mm, next, vma);
if (file)
__remove_shared_vm_struct(next, file, mapping);
if (next->anon_vma)
@@ -1384,8 +1476,8 @@
if (mm) {
/* Check the cache first. */
/* (Cache hit rate is typically around 35%.) */
- vma = mm->mmap_cache;
- if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
+ vma = vma_cache_find(mm, addr);
+ if (!vma) {
struct rb_node * rb_node;
rb_node = mm->mm_rb.rb_node;
@@ -1405,9 +1497,9 @@
} else
rb_node = rb_node->rb_right;
}
- if (vma)
- mm->mmap_cache = vma;
}
+ if (vma)
+ vma_cache_touch(mm, vma);
}
return vma;
}
@@ -1424,6 +1516,14 @@
if (!mm)
goto out;
+ vma = vma_cache_find(mm, addr);
+ if (vma) {
+ rb_node = rb_prev(&vma->vm_rb);
+ if (rb_node)
+ prev = rb_entry(rb_node, struct vm_area_struct, vm_rb);
+ goto out;
+ }
+
/* Guard against addr being lower than the first VMA */
vma = mm->mmap;
@@ -1445,6 +1545,9 @@
}
out:
+ if (vma)
+ vma_cache_touch(mm, vma);
+
*pprev = prev;
return prev ? prev->vm_next : vma;
}
@@ -1686,6 +1789,7 @@
insertion_point = (prev ? &prev->vm_next : &mm->mmap);
do {
+ vma_cache_invalidate(mm, vma);
rb_erase(&vma->vm_rb, &mm->mm_rb);
mm->map_count--;
tail_vma = vma;
@@ -1698,7 +1802,6 @@
else
addr = vma ? vma->vm_start : mm->mmap_base;
mm->unmap_area(mm, addr);
- mm->mmap_cache = NULL; /* Kill the cache. */
}
/*
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h 2006-05-13 23:31:05.000000000 +1000
+++ linux-2.6/include/linux/page-flags.h 2006-05-13 23:31:44.000000000 +1000
@@ -164,6 +164,9 @@
unsigned long pgrotated; /* pages rotated to tail of the LRU */
unsigned long nr_bounce; /* pages for bounce buffers */
+
+ unsigned long vma_cache_hit;
+ unsigned long vma_cache_query;
};
extern void get_page_state(struct page_state *ret);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2006-05-13 23:31:05.000000000 +1000
+++ linux-2.6/mm/page_alloc.c 2006-05-13 23:31:44.000000000 +1000
@@ -2389,6 +2389,9 @@
"pgrotated",
"nr_bounce",
+
+ "vma_cache_hit",
+ "vma_cache_query",
};
static void *vmstat_start(struct seq_file *m, loff_t *pos)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h 2006-05-13 23:31:03.000000000 +1000
+++ linux-2.6/include/linux/sched.h 2006-05-13 23:33:01.000000000 +1000
@@ -293,9 +293,11 @@
} while (0)
struct mm_struct {
- struct vm_area_struct * mmap; /* list of VMAs */
+ struct vm_area_struct *mmap; /* list of VMAs */
struct rb_root mm_rb;
- struct vm_area_struct * mmap_cache; /* last find_vma result */
+ struct vm_area_struct *vma_cache;
+ unsigned long vma_sequence;
+
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
@@ -734,6 +736,8 @@
struct list_head ptrace_list;
struct mm_struct *mm, *active_mm;
+ struct vm_area_struct *vma_cache[4];
+ unsigned long vma_cache_sequence;
/* task state */
struct linux_binfmt *binfmt;
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2006-05-13 23:31:03.000000000 +1000
+++ linux-2.6/kernel/fork.c 2006-05-13 23:32:59.000000000 +1000
@@ -197,7 +197,7 @@
mm->locked_vm = 0;
mm->mmap = NULL;
- mm->mmap_cache = NULL;
+ mm->vma_sequence = oldmm->vma_sequence+1;
mm->free_area_cache = oldmm->mmap_base;
mm->cached_hole_size = ~0UL;
mm->map_count = 0;
@@ -238,6 +238,10 @@
tmp->vm_next = NULL;
anon_vma_link(tmp);
file = tmp->vm_file;
+
+ if (oldmm->vma_cache == mpnt)
+ mm->vma_cache = tmp;
+
if (file) {
struct inode *inode = file->f_dentry->d_inode;
get_file(file);
On Sun, May 14, 2006 at 12:13:21AM +1000, Nick Piggin wrote:
>
> OK, I got interested again, but can't get Val's ebizzy to give me
> a find_vma constrained workload yet (though the numbers back up
> my assertion that the vma cache is crap for threaded apps).
Hey Nick,
Glad to see you're using it! There are (at least) two ways to do what
you want:
1. Increase the number of threads - this gives you two vma's per
thread, one for stack, one for guard page:
$ ./ebizzy -t 100
2. Apply the patch at the end of this email and use -p "prevent
coalescing", -m "always mmap" and appropriate number of chunks,
size, and records to search - this works for me:
$ ./ebizzy -p -m -n 10000 -s 4096 -r 100000
The original program mmapped everything with the same permissions and
no alignment restrictions, so all the mmaps were coalesced into one.
This version alternates PROT_WRITE permissions on the mmap'd areas
after they are written, so you get lots of vma's:
val@goober:~/ebizzy$ ./ebizzy -p -m -n 10000 -s 4096 -r 100000
[2]+ Stopped ./ebizzy -p -m -n 10000 -s 4096 -r 100000
val@goober:~/ebizzy$ wc -l /proc/`pgrep ebizzy`/maps
10019 /proc/10917/maps
I haven't profiled to see if this brings find_vma to the top, though.
(The patch also moves around some other stuff so that options are in
alphabetical order; apparently I thought 's' came after 'r' and before
'R'...)
-VAL
--- ebizzy.c.old 2006-05-13 10:18:58.000000000 -0700
+++ ebizzy.c 2006-05-13 11:01:42.000000000 -0700
@@ -52,9 +52,10 @@
static unsigned int always_mmap;
static unsigned int never_mmap;
static unsigned int chunks;
+static unsigned int prevent_coalescing;
static unsigned int records;
-static unsigned int chunk_size;
static unsigned int random_size;
+static unsigned int chunk_size;
static unsigned int threads;
static unsigned int verbose;
static unsigned int linear;
@@ -76,9 +77,10 @@
"-m\t\t Always use mmap instead of malloc\n"
"-M\t\t Never use mmap\n"
"-n <num>\t Number of memory chunks to allocate\n"
+ "-p \t\t Prevent mmap coalescing\n"
"-r <num>\t Total number of records to search for\n"
- "-s <size>\t Size of memory chunks, in bytes\n"
"-R\t\t Randomize size of memory to copy and search\n"
+ "-s <size>\t Size of memory chunks, in bytes\n"
"-t <num>\t Number of threads\n"
"-v[v[v]]\t Be verbose (more v's for more verbose)\n"
"-z\t\t Linear search instead of binary search\n",
@@ -98,7 +100,7 @@
cmd = argv[0];
opterr = 1;
- while ((c = getopt(argc, argv, "mMn:r:s:Rt:vz")) != -1) {
+ while ((c = getopt(argc, argv, "mMn:pr:Rs:t:vz")) != -1) {
switch (c) {
case 'm':
always_mmap = 1;
@@ -111,19 +113,22 @@
if (chunks == 0)
usage();
break;
+ case 'p':
+ prevent_coalescing = 1;
+ break;
case 'r':
records = atoi(optarg);
if (records == 0)
usage();
break;
+ case 'R':
+ random_size = 1;
+ break;
case 's':
chunk_size = atoi(optarg);
if (chunk_size == 0)
usage();
break;
- case 'R':
- random_size = 1;
- break;
case 't':
threads = atoi(optarg);
if (threads == 0)
@@ -141,7 +146,7 @@
}
if (verbose)
- printf("ebizzy 0.1, Copyright 2006 Intel Corporation\n"
+ printf("ebizzy 0.2, Copyright 2006 Intel Corporation\n"
"Written by Val Henson <[email protected]\n");
/*
@@ -173,10 +178,11 @@
printf("always_mmap %u\n", always_mmap);
printf("never_mmap %u\n", never_mmap);
printf("chunks %u\n", chunks);
+ printf("prevent coalescing %u\n", prevent_coalescing);
printf("records %u\n", records);
printf("records per thread %u\n", records_per_thread);
- printf("chunk_size %u\n", chunk_size);
printf("random_size %u\n", random_size);
+ printf("chunk_size %u\n", chunk_size);
printf("threads %u\n", threads);
printf("verbose %u\n", verbose);
printf("linear %u\n", linear);
@@ -251,9 +257,13 @@
{
int i, j;
- for (i = 0; i < chunks; i++)
+ for (i = 0; i < chunks; i++) {
for(j = 0; j < chunk_size / record_size; j++)
mem[i][j] = (record_t) j;
+ /* Prevent coalescing by alternating permissions */
+ if (prevent_coalescing && (i % 2) == 0)
+ mprotect(mem[i], chunk_size, PROT_READ);
+ }
if (verbose)
printf("Wrote memory\n");
}
On Sat, May 13, 2006 at 11:19:46AM -0700, Valerie Henson wrote:
> The original program mmapped everything with the same permissions and
> no alignment restrictions, so all the mmaps were coalesced into one.
> This version alternates PROT_WRITE permissions on the mmap'd areas
> after they are written, so you get lots of vma's:
... Which is of course exactly the case that Blaisorblade's patches
should coalesce into one vma. So I wrote another option which uses
holes instead - takes more memory initially, unfortunately. Grab it
from:
http://www.nmt.edu/~val/patches/ebizzy.tar.gz
-p for preventing coaelescing via protections, -P for preventing via
holes.
-VAL
Valerie Henson wrote:
> On Sun, May 14, 2006 at 12:13:21AM +1000, Nick Piggin wrote:
>
>>OK, I got interested again, but can't get Val's ebizzy to give me
>>a find_vma constrained workload yet (though the numbers back up
>>my assertion that the vma cache is crap for threaded apps).
>
>
> Hey Nick,
>
> Glad to see you're using it! There are (at least) two ways to do what
> you want:
>
> 1. Increase the number of threads - this gives you two vma's per
> thread, one for stack, one for guard page:
>
> $ ./ebizzy -t 100
>
> 2. Apply the patch at the end of this email and use -p "prevent
> coalescing", -m "always mmap" and appropriate number of chunks,
> size, and records to search - this works for me:
>
> $ ./ebizzy -p -m -n 10000 -s 4096 -r 100000
>
> The original program mmapped everything with the same permissions and
> no alignment restrictions, so all the mmaps were coalesced into one.
> This version alternates PROT_WRITE permissions on the mmap'd areas
> after they are written, so you get lots of vma's:
>
> val@goober:~/ebizzy$ ./ebizzy -p -m -n 10000 -s 4096 -r 100000
>
> [2]+ Stopped ./ebizzy -p -m -n 10000 -s 4096 -r 100000
> val@goober:~/ebizzy$ wc -l /proc/`pgrep ebizzy`/maps
> 10019 /proc/10917/maps
>
> I haven't profiled to see if this brings find_vma to the top, though.
>
Hi Val,
Thanks, I've tried with your most recent ebizzy and with 256 threads and
50,000 vmas (which gives really poor mmap_cache hits), I'm still unable
to get find_vma above a few % of kernel time.
With 50,000 vmas, my per-thread vma cache is much less effective, I guess
because access is pretty random (hopefully more realistic patterns would
get a bigger improvement).
I also tried running kbuild under UML, and could not make find_vma take
much time either [in this case, the per-thread vma cache patch roughly
doubles the number of hits, from about 15%->30% (in the host)].
So I guess it's time to go back into my hole. If anyone does come across
a find_vma constrained workload (especially with threads), I'd be very
interested.
Thanks,
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Hi,
On Tue, May 16, 2006 at 11:30:32PM +1000, Nick Piggin wrote:
> I also tried running kbuild under UML, and could not make find_vma take
> much time either [in this case, the per-thread vma cache patch roughly
> doubles the number of hits, from about 15%->30% (in the host)].
>
> So I guess it's time to go back into my hole. If anyone does come across
> a find_vma constrained workload (especially with threads), I'd be very
> interested.
I cannot offer much other than some random confirmation that from my own
oprofiling, whatever I did (often running a load test script consisting of
launching 30 big apps at the same time), find_vma basically always showed up
very prominently in the list of vmlinux-based code (always ranking within the
top 4 or 5 kernel hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
call-tracing showed it originating from mmap syscalls etc., and AFAIR quite
some find_vma activity from oprofile itself.
Profiling done on 512MB UP Athlon and P3/700, 2.6.16ish, current Debian.
Sorry for the foggy report, I don't have those logs here right now.
So yes, improving that part should help in general, but I cannot quite
say that my machines are "constrained" by it.
But you probably knew that already, otherwise you wouldn't have poked
in there... ;)
Andreas Mohr
On Tue, May 16, 2006 at 03:51:35PM +0200, Andreas Mohr wrote:
>
> I cannot offer much other than some random confirmation that from my own
> oprofiling, whatever I did (often running a load test script consisting of
> launching 30 big apps at the same time), find_vma basically always showed up
> very prominently in the list of vmlinux-based code (always ranking within the
> top 4 or 5 kernel hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
> call-tracing showed it originating from mmap syscalls etc., and AFAIR quite
> some find_vma activity from oprofile itself.
This is important: Which kernel?
The cases I saw it in were in a (now old) SuSE kernel which as it
turns out had old/different vma lookup code.
Thanks,
-VAL
On Tue, May 16, 2006 at 11:30:32PM +1000, Nick Piggin wrote:
>
> Hi Val,
>
> Thanks, I've tried with your most recent ebizzy and with 256 threads and
> 50,000 vmas (which gives really poor mmap_cache hits), I'm still unable
> to get find_vma above a few % of kernel time.
How excellent! Sometimes negative results are worth publishing. :)
-VAL
Hi,
On Tue, May 16, 2006 at 09:31:12AM -0700, Valerie Henson wrote:
> On Tue, May 16, 2006 at 03:51:35PM +0200, Andreas Mohr wrote:
> >
> > I cannot offer much other than some random confirmation that from my own
> > oprofiling, whatever I did (often running a load test script consisting of
> > launching 30 big apps at the same time), find_vma basically always showed up
> > very prominently in the list of vmlinux-based code (always ranking within the
> > top 4 or 5 kernel hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
> > call-tracing showed it originating from mmap syscalls etc., and AFAIR quite
> > some find_vma activity from oprofile itself.
>
> This is important: Which kernel?
I had some traces still showing find_vma prominently during a profiling run
just yesterday, with a very fresh 2.6.17-rc4-ck1 (IOW, basically 2.6.17-rc4).
I added some cache prefetching in the list traversal a while ago, and IIRC
that improved profiling times there, but cache prefetching is very often
a bandaid in search for a real solution: a better data-handling algorithm.
Andreas Mohr
Andreas Mohr wrote:
>Hi,
>
>On Tue, May 16, 2006 at 09:31:12AM -0700, Valerie Henson wrote:
>
>>On Tue, May 16, 2006 at 03:51:35PM +0200, Andreas Mohr wrote:
>>
>>>I cannot offer much other than some random confirmation that from my own
>>>oprofiling, whatever I did (often running a load test script consisting of
>>>launching 30 big apps at the same time), find_vma basically always showed up
>>>very prominently in the list of vmlinux-based code (always ranking within the
>>>top 4 or 5 kernel hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
>>>call-tracing showed it originating from mmap syscalls etc., and AFAIR quite
>>>some find_vma activity from oprofile itself.
>>>
>>This is important: Which kernel?
>>
>
>I had some traces still showing find_vma prominently during a profiling run
>just yesterday, with a very fresh 2.6.17-rc4-ck1 (IOW, basically 2.6.17-rc4).
>I added some cache prefetching in the list traversal a while ago, and IIRC
>that improved profiling times there, but cache prefetching is very often
>a bandaid in search for a real solution: a better data-handling algorithm.
>
If you want to try out the patch and see what it does for you, that would be
interesting. I'll repost a slightly cleaned up version in a couple of hours.
Nick
--
Send instant messages to your online friends http://au.messenger.yahoo.com
On Tuesday 16 May 2006 18:47, Andreas Mohr wrote:
> Hi,
>
> On Tue, May 16, 2006 at 09:31:12AM -0700, Valerie Henson wrote:
> > On Tue, May 16, 2006 at 03:51:35PM +0200, Andreas Mohr wrote:
> > > I cannot offer much other than some random confirmation that from my
> > > own oprofiling, whatever I did (often running a load test script
> > > consisting of launching 30 big apps at the same time), find_vma
> > > basically always showed up very prominently in the list of
> > > vmlinux-based code (always ranking within the top 4 or 5 kernel
> > > hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
> > > call-tracing showed it originating from mmap syscalls etc., and AFAIR
> > > quite some find_vma activity from oprofile itself.
> >
> > This is important: Which kernel?
I'd also add (for all peoples): on which processors? L2 cache size probably
plays an important role, if (as I'm convinced) the problem are cache misses
during rb-tree traversal.
> I had some traces still showing find_vma prominently during a profiling run
> just yesterday, with a very fresh 2.6.17-rc4-ck1 (IOW, basically
> 2.6.17-rc4). I added some cache prefetching in the list traversal a while
> ago,
You mean the rb-tree traversal, I guess! Or was the base kernel so old?
> and IIRC that improved profiling times there, but cache prefetching is
> very often a bandaid in search for a real solution: a better data-handling
> algorithm.
Ok, finally I find the time to kick in and ask a couple of question.
The current algorithm is good but has poor cache locality (IMHO).
First, since you can get find_vma on the profile, I've read (the article
talked about userspace apps but I think it applies to kernelspace too) that
oprofile can trace L2 cache misses.
I think such a profiling, if possible, would be particularly interesting:
there's no reason whatsoever for that lookup, even on a 32-level tree
(theoretical maximum since we have max 64K vmas and height_rbtree <= 2 logN),
should be so slow, unless you add cache misses into the picture. The fact
that cache prefetching helps shows this even more.
The lookup has very poor cache locality: the rb-node (3 pointers i.e. 12
bytes, and we need only 2 pointers on searches) is surrounded by non-relevant
data we fetch (we don't need the VMA itself for nodes we traverse).
For cache-locality the best data structure I know of are radix trees; but
changing the implementation is absolutely non-trivial (the find_vma_prev()
and friends API is tightly coupled with the rb-tree), and the size of the
tree grows with the virtual address space (which is a problem on 64-bit
archs); finally, you have locality when you do multiple searches, especially
for the root nodes, but not across different levels inside a single search.
--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
___________________________________
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB
http://mail.yahoo.it