> This will be useful since the VMA lookup at fault time can be a bottleneck for
> some programs (I've received a report about this from Ulrich Drepper and I've
> been told that also Val Henson from Intel is interested about this).
I've not seen much of this if any at all, the various caches that are in
place for these lookups seem to function quite well; what we did see was
glibc's malloc implementation being mistuned resulting in far too many
mmaps than needed (which in turn leads to far too much page zeroing
which is the really expensive part. It's not the vma lookup that is
expensive, it's the page zeroing)
On Tuesday 02 May 2006 12:21, Arjan van de Ven wrote:
> > This will be useful since the VMA lookup at fault time can be a
> > bottleneck for some programs (I've received a report about this from
> > Ulrich Drepper and I've been told that also Val Henson from Intel is
> > interested about this).
> I've not seen much of this if any at all, the various caches that are in
> place for these lookups seem to function quite well; what we did see was
> glibc's malloc implementation being mistuned resulting in far too many
> mmaps than needed (which in turn leads to far too much page zeroing
> which is the really expensive part. It's not the vma lookup that is
> expensive, it's the page zeroing)
Even to this email, I hope Ulrich will answer.
--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale!
http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
Blaisorblade wrote:
>> I've not seen much of this if any at all, the various caches that are in
>> place for these lookups seem to function quite well; what we did see was
>> glibc's malloc implementation being mistuned resulting in far too many
>> mmaps than needed (which in turn leads to far too much page zeroing
>> which is the really expensive part. It's not the vma lookup that is
>> expensive, it's the page zeroing)
> Even to this email, I hope Ulrich will answer.
All I can say is that some of our guys tuning a big application on a
customer's site reported seeing the VMA lookups being on the profile
list. This was some huge Java program. It might be that every other
page had a different protection, executable or not, read-only mmap etc.
And data access for very non-local.
I cannot say more since this was near to the end of the trials.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
On Tue, May 02, 2006 at 12:21:06PM +0200, Arjan van de Ven wrote:
>
> > This will be useful since the VMA lookup at fault time can be a bottleneck for
> > some programs (I've received a report about this from Ulrich Drepper and I've
> > been told that also Val Henson from Intel is interested about this).
>
> I've not seen much of this if any at all, the various caches that are in
> place for these lookups seem to function quite well; what we did see was
> glibc's malloc implementation being mistuned resulting in far too many
> mmaps than needed (which in turn leads to far too much page zeroing
> which is the really expensive part. It's not the vma lookup that is
> expensive, it's the page zeroing)
VMA lookup time hasn't been noticeable on the systems we're running
ebizzy[1] on, which is what Arjan is talking about. I did see it with
a customer application last year - on a kernel without the RB tree for
looking up vma's. My vague recollection of the oprofile results was
something on the order of 2-10% of cpu time spent in find_vma() and
similar vma handling functions. This was on an application with 100's
of vmas due to malloc() being tuned to use mmap() too often. Arjan
and I whomped up a patch to glibc to fix the root cause; for details
see:
http://sourceware.org/ml/libc-alpha/2006-03/msg00033.html
A more legitimate situation resulting in 100's of vma's is the JVM
case - you end up with at least 2 vma's per thread, for the stack and
guard page. I personally have seen a JVM with over 100 threads in
active use and over 500 vma's. (One of the nice things about your
patch is that it will eliminate the separate guard page vma, reducing
the number of necessary vma's by one per thread.)
I intend to go back and look for find_vma() and friends while running
ebizzy, since I wrote the darn thing partly to expose that problem;
however I suspect it's mostly gone now that we have RB-trees for
vma's. If you want to use ebizzy to evaluate your patches, just run
it with the -M "always mmap" argument and tune the various thread and
memory allocation options until you get a suitable number of vma's.
-VAL
[1] ebizzy is an application I wrote to replicate this kind of
workload. For more info, see:
http://infohost.nmt.edu/~val/patches.html#ebizzy