2001-03-20 18:33:13

by Serge Orlov

[permalink] [raw]
Subject: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

Hi,
I upgraded one of our computer happily running 2.2.13 kernel
to 2.4.2. Everything was OK, but compilation time of our C++
project greatly increased (1.4 times slower). I investigated the
issue and found that g++ spends 7 times more time in kernel.
The reason for this is big vm map:

cat /proc/15677/maps |wc -l
2238

15677 -- is cc1plus process, the map looks like this:
.....
4014a000-4014b000 rw-p 00000000 00:00 0
4014b000-4014c000 rw-p 00000000 00:00 0
4014c000-4014d000 rw-p 00000000 00:00 0
4014d000-4014e000 rw-p 00000000 00:00 0
4014e000-4014f000 rw-p 00000000 00:00 0
4014f000-40150000 rw-p 00000000 00:00 0
40150000-40151000 rw-p 00000000 00:00 0
40151000-40152000 rw-p 00000000 00:00 0
......
strace:
.....
15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40019000
15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40019000
15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001a000
15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001b000
15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001c000
15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
.......

2.4.2:
time g++ -g -Wall -I. -c t_instr.cpp -o t_instr.o

real 0m3.434s
user 0m2.790s
sys 0m0.530s

2.2.13:
time g++ -g -Wall -I. -c t_instr.cpp -o t_instr.o

real 0m3.167s
user 0m2.950s
sys 0m0.070s

I noticed a recent thread (Re: Kernel is unstable) in archives that
ended by Linus:
--- quote ---
Ehh.. If the merging doesn't actually happen, it's always a loss. We've
just spent CPU cycles on doing something useless. And in my tests, that
was the case a lot more than not.

Also, in the expense of taking a page fault, looking one or two levels
deeper in the AVL tree is pretty much not noticeable.

Show me numbers for real applications, and I might care. I saw barely
measurable speedups (and more importantly to me - real simplification)
by
removing it.

Don't bother arguing with "it might.."

Linus
--- end of quote ----

OK, the numbers are here. g++ is 2.96 from RedHat 7.0.
Please, CC me, as I'm not on the list.

Serge.



2001-03-20 18:45:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.



On Tue, 20 Mar 2001, Serge Orlov wrote:
>
> I upgraded one of our computer happily running 2.2.13 kernel
> to 2.4.2. Everything was OK, but compilation time of our C++
> project greatly increased (1.4 times slower). I investigated the
> issue and found that g++ spends 7 times more time in kernel.
> The reason for this is big vm map:

Cool. Somebody actually found a real case.

I'll fix the mmap case asap. Its' not hard, I just waited to see if it
ever actually triggers. Something like g++ certainly counts as major.

Are you willing to test out patches?

Linus

2001-03-20 18:44:53

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On Tue, Mar 20, 2001 at 09:28:57PM +0300, Serge Orlov wrote:
> Hi,
> I upgraded one of our computer happily running 2.2.13 kernel
> to 2.4.2. Everything was OK, but compilation time of our C++
> project greatly increased (1.4 times slower). I investigated the
> issue and found that g++ spends 7 times more time in kernel.

I see the *exact* same problem. Large C++ codes, and gcc spending most of the
CPU time in kernel.

> The reason for this is big vm map:
>
> cat /proc/15677/maps |wc -l
> 2238

Exactly what I see too. 200 MB of memory allocated in 4K maps...

There is an easy fix: In libiberty in GCC we could change xmalloc()
to do real malloc instead of calloc(). I think that would fix it.

Or glibc could be fixed to make calloc() behave more reasonably
when it's called with tons and tons of 4K allocations.

Or the kernel could be fixed to merge maps.

...
> .....
> 15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40019000

hear hear !

...
>
> OK, the numbers are here. g++ is 2.96 from RedHat 7.0.
> Please, CC me, as I'm not on the list.

gcc 2.96 here too.

Should we take this up with the glibc or gcc folks, or should
someone fix the kernel ?

This *is* a very significant performance problem for a standard tool.

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2001-03-20 19:00:33

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On Tue, Mar 20, 2001 at 10:43:33AM -0800, Linus Torvalds wrote:
>
>
> On Tue, 20 Mar 2001, Serge Orlov wrote:
> >
> > I upgraded one of our computer happily running 2.2.13 kernel
> > to 2.4.2. Everything was OK, but compilation time of our C++
> > project greatly increased (1.4 times slower). I investigated the
> > issue and found that g++ spends 7 times more time in kernel.
> > The reason for this is big vm map:
>
> Cool. Somebody actually found a real case.
>
> I'll fix the mmap case asap. Its' not hard, I just waited to see if it
> ever actually triggers. Something like g++ certainly counts as major.

Uber-cool ! :)

>
> Are you willing to test out patches?

Definitely.

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2001-03-21 01:21:56

by buhr

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

Linus Torvalds <[email protected]> writes:
>
> Cool. Somebody actually found a real case.
>
> I'll fix the mmap case asap. Its' not hard, I just waited to see if it
> ever actually triggers. Something like g++ certainly counts as major.

I frequently build Mozilla from scratch on my (aging) dual Celeron
machine. That's about 65 megs of actual C++ source, and it takes
about an hour of real time to compile. I see times for the whole
build like this:

real 60m4.574s
user 101m18.260s
sys 3m23.520s

with gcc 2.95.2 20000220 (Debian GNU/Linux) under Linux 2.4.2.

The sys-to-user ratio seems much closer to Serge's 2.2.13 numbers than
his 2.4.2 numbers, and I'm wondering why.

If I recall correctly, RedHat's 2.96 was a modified development
snapshot of GCC 3.0, not an official GCC release. If this is just a
quirk in 2.96 that can be fixed before the official release of 3.0 by
a trivial patch to libiberty, maybe your original hunch was right and
the kernel should be left as-is.

> Are you willing to test out patches?

I'm willing to help test out the patch; I'd be curious to see what
effect it has on the performance of 2.95.2.

Kevin <[email protected]>

2001-03-21 01:40:15

by David Miller

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.


Kevin Buhr writes:
> If I recall correctly, RedHat's 2.96 was a modified development
> snapshot of GCC 3.0, not an official GCC release. If this is just a
> quirk in 2.96 that can be fixed before the official release of 3.0 by
> a trivial patch to libiberty, maybe your original hunch was right and
> the kernel should be left as-is.

It is the garbage collector scheme used for memory allocation in gcc
>=2.96 that triggers the bad cases seen by Serge.

Later,
David S. Miller
[email protected]

2001-03-21 02:02:18

by Dieter Nützel

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

Linus Torvalds <[email protected]> writes:
>
> Cool. Somebody actually found a real case.
>
> I'll fix the mmap case asap. Its' not hard, I just waited to see if it
> ever actually triggers. Something like g++ certainly counts as major.

I do daily builds of the VTK CVS tree (The Visualization Toolkit,
http://www.kitware.com/vtk.html, a huge 3D app).

~33 MB C++ source

It took ~1 hour on my K7 550, 256 MB, IBM DTL-307030, glibc-2.2 and
gcc-2.95.2 ( 19991024 (release)) under most of the 2.4-test kernels (all with
ReiserFS) for a whole rebuild.
Now it take nearly 1 and a half hour with 2.4.2-ac20.
BTW My mouse (PS2) is very sluggished during C++ compilations, now.

I am open for all of your patches. Or should I better say most :-)))

Cheers,
Dieter
--
Dieter N?tzel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
Cognitive Systems Group
Vogt-K?lln-Stra?e 30
D-22527 Hamburg, Germany

email: [email protected]
@home: [email protected]

2001-03-21 06:42:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On 20 Mar 2001, Kevin Buhr wrote:

> Linus Torvalds <[email protected]> writes:
> >
> > Cool. Somebody actually found a real case.
> >
> > I'll fix the mmap case asap. Its' not hard, I just waited to see if it
> > ever actually triggers. Something like g++ certainly counts as major.
>
> I frequently build Mozilla from scratch on my (aging) dual Celeron
> machine. That's about 65 megs of actual C++ source, and it takes
> about an hour of real time to compile. I see times for the whole
> build like this:
>
> real 60m4.574s
> user 101m18.260s <-- impossible no?
> sys 3m23.520s

Why do numbers like this show up? I noticed some of this after having
enabled SMP on my UP box.

-Mike

2001-03-21 14:58:01

by Matthias Urlichs

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

> > I frequently build Mozilla from scratch on my (aging) dual Celeron
> > machine. [...]
> > real 60m4.574s
> > user 101m18.260s <-- impossible no?
> > sys 3m23.520s
>
> Why do numbers like this show up? I noticed some of this after having
> enabled SMP on my UP box.
>
Now why would that be impossible on a two-CPU system?

2001-03-21 15:06:31

by Mike Galbraith

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On Wed, 21 Mar 2001, Matthias Urlichs wrote:

> > > I frequently build Mozilla from scratch on my (aging) dual Celeron
> > > machine. [...]
> > > real 60m4.574s
> > > user 101m18.260s <-- impossible no?
> > > sys 3m23.520s
> >
> > Why do numbers like this show up? I noticed some of this after having
> > enabled SMP on my UP box.
> >
> Now why would that be impossible on a two-CPU system?

zzzt. Right.. impossible on a UP box.

-Mike

2001-03-21 16:12:43

by Kurt Garloff

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On Wed, Mar 21, 2001 at 07:41:55AM +0100, Mike Galbraith wrote:
> On 20 Mar 2001, Kevin Buhr wrote:
> > real 60m4.574s
> > user 101m18.260s <-- impossible no?
> > sys 3m23.520s
>
> Why do numbers like this show up? I noticed some of this after having
> enabled SMP on my UP box.

As you have two CPUs, you can spend more time in CPU than your wall clock
shows if you time multithreaded processes or multiple processes. At most
(ideal case) twice as much.

Regards,
--
Kurt Garloff <[email protected]> Eindhoven, NL
GPG key: See mail header, key servers Linux kernel development
SuSE GmbH, Nuernberg, FRG SCSI, Security


Attachments:
(No filename) (707.00 B)
(No filename) (232.00 B)
Download all attachments

2001-03-21 16:46:35

by Mike Galbraith

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On Wed, 21 Mar 2001, Kurt Garloff wrote:

> On Wed, Mar 21, 2001 at 07:41:55AM +0100, Mike Galbraith wrote:
> > On 20 Mar 2001, Kevin Buhr wrote:
> > > real 60m4.574s
> > > user 101m18.260s <-- impossible no?
> > > sys 3m23.520s
> >
> > Why do numbers like this show up? I noticed some of this after having
> > enabled SMP on my UP box.
>
> As you have two CPUs, you can spend more time in CPU than your wall clock
> shows if you time multithreaded processes or multiple processes. At most
> (ideal case) twice as much.

Yes. I'm so used to UP numbers I didn't think. I saw user larger than
real on my UP box yesterday during some testing, and then seeing this
post... oops.

-Mike

2001-03-21 20:17:12

by buhr

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

Mike Galbraith <[email protected]> writes:
>
> Yes. I'm so used to UP numbers I didn't think. I saw user larger than
> real on my UP box yesterday during some testing, and then seeing this
> post... oops.

Okay, so you see "user > real" on a UP box running an SMP kernel.

First, I'm not really familiar with this part of the kernel, but as I
understand things (and others will correct me if I'm wrong) ...

The "real time" is calculated by subtracting the "gettimeofday" before
and after running the process. The "user" and "system" times are
sampled times updated every timer tick.

A discrepancy of a hundredth of a second is perfectly normal.
"gettimeofday" uses a neat trick to get microsecond accuracy, but the
user and system times only have one timer tick (1/HZ=.01sec on i386)
resolution. For this reason, any CPU intensive program can give
slightly (within .01sec or so) higher user than real:

buhr@saurus:~/src/cpuburn/cpuburn-1.2$ time ./burnP6
real 0m6.438s
user 0m6.440s
sys 0m0.000s
^C
buhr@saurus:~/src/cpuburn/cpuburn-1.2$

If your discrepancy is bigger than a couple hundredths of second, it
gets more complicated.

In an SMP kernel, the jiffies are updated by the "do_timer" function,
and the timer bottom half uses the jiffies to update the time of day.
On the other hand, the user and system times are updated by the
"smp_local_timer_interrupt".

On an SMP motherboard (one with an APIC), "do_timer" is invoked by
timer ticks from the dedicated timer chip, but "smp_local_timer_
interrupt" is invoked by a timer on the APIC chip. These two timers
will run at nearly the same speed (HZ times per second), but not
exactly. If the APIC timer is significantly faster, you can have
user+system>real on an SMP motherboard, even though it only has one
processor installed!

So, the first question is, does your "UP" box really have a UP-only
motherboard? That is, in your bootup messages, do you see a line like
this:

Mar 5 15:32:28 mozart kernel: SMP motherboard not detected. Using
dummy APIC emulation.

If you don't see such a line, this might be the problem: the real time
is based on a different timer than the user and system times.
I believe the APIC timer is based on bus frequency. If you're over-
or under-clocking your board, you may see huge discrepancies.

If you *do* see the emulation message, then "do_timer" and
"smp_local_timer_interrupt" are both called exactly once on every
timer tick, so there is no discrepancy possible there.

However, the "gettimeofday" time isn't just based on the jiffies
count. The time adjustment parameters (set by the adjtimex(2) system
call) can modify the "gettimeofday" time away from what would normally
be calculated from jiffies alone. If you are running a time daemon,
like NTP, if you've run "ntpdate" at bootup and a time adjustment is
in progress, or if you've used the "adjtimex" utility directly to make
your system clock more accurate, then that could also account for the
discrepancy.

In any event, if the discrepancy is large: if user, for a
single-threaded process, exceeds the real time by more than 1% (or a
few hundredths of a second, whichever is greater) on any system, I
think this indicates a serious problem.

Kevin <[email protected]>

2001-03-21 20:20:42

by buhr

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

"David S. Miller" <[email protected]> writes:
>
> It is the garbage collector scheme used for memory allocation in gcc
> >=2.96 that triggers the bad cases seen by Serge.

Ahhh! Thanks for the info.

I'm still happy to help test out the patch, but I guess it's not
likely to affect my 2.95.2 numbers much at all. Maybe I can get a
snapshot of GCC 3.0 up and running, though, and test that out.

Thanks.

Kevin <[email protected]>

2001-03-22 09:06:07

by Mike Galbraith

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On 21 Mar 2001, Kevin Buhr wrote:

> Mike Galbraith <[email protected]> writes:
> >
> > Yes. I'm so used to UP numbers I didn't think. I saw user larger than
> > real on my UP box yesterday during some testing, and then seeing this
> > post... oops.
>
> Okay, so you see "user > real" on a UP box running an SMP kernel.

On ac20 I see this (has rw_mmap_sem patch in place tho..), but not on
2.4.3-pre6 with Linus' deadlock fix.

[snip nice explanation.. thanks] box is genuine UP btw.

> In any event, if the discrepancy is large: if user, for a
> single-threaded process, exceeds the real time by more than 1% (or a
> few hundredths of a second, whichever is greater) on any system, I
> think this indicates a serious problem.

Let me check virgin ac20 and see what it does.

2.4.2.ac20.virgin 2.4.3-pre6
real 11m0.708s 11m58.617s
user 15m8.720s 7m29.970s
sys 1m31.410s 0m41.590s

It looks like ac20 is doing some double accounting.

-Mike

(fwiw, the smp/up numbers suck rocks compared to up/up)

2001-03-22 18:24:37

by buhr

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

[email protected] (Kevin Buhr) writes:
>
> "David S. Miller" <[email protected]> writes:
> >
> > It is the garbage collector scheme used for memory allocation in gcc
> > >=2.96 that triggers the bad cases seen by Serge.
>
> Ahhh! Thanks for the info.
>
> I'm still happy to help test out the patch, but I guess it's not
> likely to affect my 2.95.2 numbers much at all. Maybe I can get a
> snapshot of GCC 3.0 up and running, though, and test that out.

I pulled the "gcc-3_0-branch" of GCC from CVS and compiled Mozilla
under a 2.4.2 kernel. The numbers I saw were:

real 57m26.850s
user 96m57.490s
sys 3m16.780s

which are almost exactly the same as my GCC 2.95.2 numbers. When I
peeked at "/proc/<cc1plus>/maps" a few times, I counted ~150 lines,
not ~2000. On another, much smaller block of C++ code, I get similar
results: no dramatic change in kernel time.

Either the Mozilla codebase and my other test case don't tickle the
problem, or something has changed in 3.0's allocation scheme since
RedHat 2.96 was built.

Kevin <[email protected]>

2001-03-22 18:37:47

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On Thu, Mar 22, 2001 at 12:23:15PM -0600, Kevin Buhr wrote:
> [email protected] (Kevin Buhr) writes:
...
> I pulled the "gcc-3_0-branch" of GCC from CVS and compiled Mozilla
> under a 2.4.2 kernel. The numbers I saw were:
>
> real 57m26.850s
> user 96m57.490s
> sys 3m16.780s
>
> which are almost exactly the same as my GCC 2.95.2 numbers. When I
> peeked at "/proc/<cc1plus>/maps" a few times, I counted ~150 lines,
> not ~2000. On another, much smaller block of C++ code, I get similar
> results: no dramatic change in kernel time.
>
> Either the Mozilla codebase and my other test case don't tickle the
> problem, or something has changed in 3.0's allocation scheme since
> RedHat 2.96 was built.

Mozilla uses C++ mainly as "extended C" - due to compatibility concerns.

Try compiling something like Qt/KDE/gtk-- which are really heavy on
templates (with all the benefits and drawbacks of that).

My code here is quite template heavy, and I suspect that's what's triggering
it. In fact, I can't compile our development code with optimization, because
GCC runs out of memory (it only allocates some 300-500 MB, but each page has
it's own map in /proc/pid/maps, and a wc -l /proc/pid/maps doesn't finish for
minutes). My typical GCC eats 100-200 MB and runs for several minutes.

You should benchmark this particular case with code that makes GCC eat
lots of memory, 100MB or more. I've never seen Mozilla really make GCC
eat that much memory - other projects do.

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2001-03-22 22:21:19

by buhr

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

Mike Galbraith <[email protected]> writes:
>
> 2.4.2.ac20.virgin 2.4.3-pre6
> real 11m0.708s 11m58.617s
> user 15m8.720s 7m29.970s
> sys 1m31.410s 0m41.590s
>
> It looks like ac20 is doing some double accounting.

Alan:

In "2.4.2-ac20", the check in "apic.c" in the "APIC_init_uniprocessor"
function to avoid initializing the APIC is:

if (!smp_found_config && !cpu_has_apic)
return -1;

However, in "arch/i386/time.c", we use the following check:

if (!smp_found_config)
smp_local_timer_interrupt(regs);

to see if we need to emulate an smp_local_timer_interrupt from
"do_timer_interrupt".

In Mike's case, I think we have smp_found_config == 0 but cpu_has_apic
== 1, so we're telling the CPU APIC to generate smp_local_timer_interrupts,
and then we're also emulating them on normal timer ticks. That
doubles the rate at which "smp_local_timer_interrupt" is called,
doubling the process user and system time accounting.

Mike, would you like to try out the following (untested) patch against
vanilla ac20 to see if it does the trick?

Kevin <[email protected]>

* * *

diff -ru linux-2.4.2-ac20/arch/i386/kernel/apic.c linux-2.4.2-ac20-local/arch/i386/kernel/apic.c
--- linux-2.4.2-ac20/arch/i386/kernel/apic.c Thu Mar 22 12:36:02 2001
+++ linux-2.4.2-ac20-local/arch/i386/kernel/apic.c Thu Mar 22 15:59:08 2001
@@ -30,6 +30,9 @@
#include <asm/mpspec.h>
#include <asm/pgalloc.h>

+/* Using APIC to generate smp_local_timer_interrupt? */
+int using_apic_timer = 0;
+
int prof_multiplier[NR_CPUS] = { 1, };
int prof_old_multiplier[NR_CPUS] = { 1, };
int prof_counter[NR_CPUS] = { 1, };
@@ -884,6 +887,8 @@

/* and update all other cpus */
smp_call_function(setup_APIC_timer, (void *)calibration_result, 1, 1);
+
+ using_apic_timer = 1;
}

/*
diff -ru linux-2.4.2-ac20/arch/i386/kernel/time.c linux-2.4.2-ac20-local/arch/i386/kernel/time.c
--- linux-2.4.2-ac20/arch/i386/kernel/time.c Thu Mar 22 12:36:03 2001
+++ linux-2.4.2-ac20-local/arch/i386/kernel/time.c Thu Mar 22 16:03:02 2001
@@ -422,7 +422,7 @@
if (!user_mode(regs))
x86_do_profile(regs->eip);
#else
- if (!smp_found_config)
+ if (!using_apic_timer)
smp_local_timer_interrupt(regs);
#endif

diff -ru linux-2.4.2-ac20/include/asm-i386/smp.h linux-2.4.2-ac20-local/include/asm-i386/smp.h
--- linux-2.4.2-ac20/include/asm-i386/smp.h Sun Mar 4 21:35:03 2001
+++ linux-2.4.2-ac20-local/include/asm-i386/smp.h Thu Mar 22 16:07:28 2001
@@ -34,6 +34,7 @@
extern unsigned long cpu_online_map;
extern volatile unsigned long smp_invalidate_needed;
extern int pic_mode;
+extern int using_apic_timer;
extern void smp_flush_tlb(void);
extern void smp_message_irq(int cpl, void *dev_id, struct pt_regs *regs);
extern void smp_send_reschedule(int cpu);

2001-03-23 04:34:04

by buhr

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

Jakob ?stergaard <[email protected]> writes:
>
> Try compiling something like Qt/KDE/gtk-- which are really heavy on
> templates (with all the benefits and drawbacks of that).

Okay, I just compiled gtk-- 1.0.3 (with CFLAGS = "-O2 -g") under three
versions of GCC (Debian 2.95.3, RedHat 2.96, and a CVS pull of the
"gcc-3_0-branch") on the same Debian machine running kernel 2.4.2.

In all cases, the "cc1plus" processes appeared to max out around 25M
total size. The "maps" pseudofiles for the 2.95.3 and and 3.0
compiles never grew past 250 lines, but the "maps" pseudofiles for the
RedHat 2.96 compile were gigantic, jumping to 3000 or 5000 lines at
times.

The results speak for themselves:

CVS gcc 3.0: Debian gcc 2.95.3: RedHat gcc 2.96:

real 16m8.423s real 8m2.417s real 12m24.939s
user 15m23.710s user 7m22.200s user 10m14.420s
sys 0m48.730s sys 0m41.040s sys 2m13.910s
maps: <250 lines <250 lines >3000 lines

Obviously, the *real* problem is RedHat GCC 2.96. If Linus bothers to
write this patch (he probably already has), its only proven benefit so
far is that it improves the performance of a RedHat-specific, orphaned
G++ development snapshot that everyone (the people of RedHat most of
all) will be glad to be rid of as soon as possible.

The numbers above suggest that the patch is unlikely to have a
positive impact on the performance of either officially released GCC
versions or the upcoming 3.0 release.

Drifting off topic...

> Mozilla uses C++ mainly as "extended C" - due to compatibility concerns.

This statement is potentially misleading.

I think most people will believe you to mean "using C++ as a better C"
in the sense of Stroustrup: using the small, conventional-language
subset of C++ that looks like C but has stronger type checking,
function and operator overloading, default arguments, "//" style
comments, reference types, and other syntactic and semantic sugar.

Mozilla does not use C++ as "extended C" in this sense. While it does
use a *subset* of C++ for compatibility reasons, the subset includes
extensive use of class lattices and polymorphism as well as extensive
(albeit simple and carefully constructed) uses of templates for its
utility classes, including string and component-autoreferencing
template classes and functions that are used throughout the source.
The only major C++ facilities that are not used are the standard
library, RTTI, namespaces, and exception handling, but other than that
it's a good, real-world C++ test case.

Kevin <[email protected]>

2001-03-23 07:46:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On 22 Mar 2001, Kevin Buhr wrote:

> Mike Galbraith <[email protected]> writes:
> >
> > 2.4.2.ac20.virgin 2.4.3-pre6
> > real 11m0.708s 11m58.617s
> > user 15m8.720s 7m29.970s
> > sys 1m31.410s 0m41.590s
> >
> > It looks like ac20 is doing some double accounting.

[snip]

> Mike, would you like to try out the following (untested) patch against
> vanilla ac20 to see if it does the trick?

Yes, that fixed it.

-Mike


2001-03-23 20:42:31

by James Lewis Nance

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On Thu, Mar 22, 2001 at 07:35:49PM +0100, Jakob ?stergaard wrote:

> My code here is quite template heavy, and I suspect that's what's triggering
> it. In fact, I can't compile our development code with optimization, because
> GCC runs out of memory (it only allocates some 300-500 MB, but each page has
> it's own map in /proc/pid/maps, and a wc -l /proc/pid/maps doesn't finish for
> minutes). My typical GCC eats 100-200 MB and runs for several minutes.

Would it be possible for you to post the preprocessor outout to this list?
It would be quite nice to have this testcase.

Jim

2001-03-23 21:37:15

by buhr

[permalink] [raw]
Subject: 2.4.2-ac20 patch for process time double-counting (was: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.)

Mike Galbraith <[email protected]> writes:
>
> > Mike, would you like to try out the following (untested) patch against
> > vanilla ac20 to see if it does the trick?
>
> Yes, that fixed it.

Great! Can you test one more configuration, please? I can't test it
properly with my SMP motherboard. Under "ac20", if you disable:

Symmetric multi-processing support (CONFIG_SMP)

you'll get to say yes to:

APIC support on uniprocessors (CONFIG_X86_UP_APIC)

If you say yes to that, you'll also get to say yes to:

IO-APIC support on uniprocessors (CONFIG_X86_UP_IOAPIC)

Can you check that the following patch against vanilla "ac20" works
correctly with SMP disabled and X86_UP_APIC enabled? (The original
patch I gave you won't compile with this configuration, since I put
the declaration in the wrong include file.) It shouldn't matter
whether X86_UP_IOAPIC is enabled or disabled.

In addition to checking that the sys/user times look right, please
check for the message:

Using local APIC timer interrupts.

in your boot messages (I *don't* think it'll be there, but I'm not
sure, and I'd really like to know one way or the other). In fact, if
you could send me your kernel messages up to the PCI probe, that would
be ideal.

Thanks muchly!

Kevin <[email protected]>

* * *

diff -ru linux-2.4.2-ac20-vanilla/arch/i386/kernel/apic.c linux-2.4.2-ac20/arch/i386/kernel/apic.c
--- linux-2.4.2-ac20-vanilla/arch/i386/kernel/apic.c Fri Mar 23 14:21:47 2001
+++ linux-2.4.2-ac20/arch/i386/kernel/apic.c Fri Mar 23 15:12:15 2001
@@ -30,6 +30,9 @@
#include <asm/mpspec.h>
#include <asm/pgalloc.h>

+/* Using APIC to generate smp_local_timer_interrupt? */
+int using_apic_timer = 0;
+
int prof_multiplier[NR_CPUS] = { 1, };
int prof_old_multiplier[NR_CPUS] = { 1, };
int prof_counter[NR_CPUS] = { 1, };
@@ -872,6 +875,9 @@

void __init setup_APIC_clocks (void)
{
+ printk("Using local APIC timer interrupts.\n");
+ using_apic_timer = 1;
+
__cli();

calibration_result = calibrate_APIC_clock();
diff -ru linux-2.4.2-ac20-vanilla/arch/i386/kernel/time.c linux-2.4.2-ac20/arch/i386/kernel/time.c
--- linux-2.4.2-ac20-vanilla/arch/i386/kernel/time.c Fri Mar 23 14:21:47 2001
+++ linux-2.4.2-ac20/arch/i386/kernel/time.c Fri Mar 23 14:04:43 2001
@@ -422,7 +422,7 @@
if (!user_mode(regs))
x86_do_profile(regs->eip);
#else
- if (!smp_found_config)
+ if (!using_apic_timer)
smp_local_timer_interrupt(regs);
#endif

diff -ru linux-2.4.2-ac20-vanilla/include/asm-i386/mpspec.h linux-2.4.2-ac20/include/asm-i386/mpspec.h
--- linux-2.4.2-ac20-vanilla/include/asm-i386/mpspec.h Mon Jan 8 13:35:28 2001
+++ linux-2.4.2-ac20/include/asm-i386/mpspec.h Fri Mar 23 14:20:19 2001
@@ -182,6 +182,7 @@
extern int mp_current_pci_id;
extern unsigned long mp_lapic_addr;
extern int pic_mode;
+extern int using_apic_timer;

#endif

2001-03-24 04:12:24

by Zack Weinberg

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

Kevin Buhr wrote:
> Jakob ?stergaard <[email protected]> writes:
> >
> > Try compiling something like Qt/KDE/gtk-- which are really heavy on
> > templates (with all the benefits and drawbacks of that).
>
> Okay, I just compiled gtk-- 1.0.3 (with CFLAGS = "-O2 -g") under three
> versions of GCC (Debian 2.95.3, RedHat 2.96, and a CVS pull of the
> "gcc-3_0-branch") on the same Debian machine running kernel 2.4.2.
>
> In all cases, the "cc1plus" processes appeared to max out around 25M
> total size. The "maps" pseudofiles for the 2.95.3 and and 3.0
> compiles never grew past 250 lines, but the "maps" pseudofiles for the
> RedHat 2.96 compile were gigantic, jumping to 3000 or 5000 lines at
> times.
>
> The results speak for themselves:
>
> CVS gcc 3.0: Debian gcc 2.95.3: RedHat gcc 2.96:
>
> real 16m8.423s real 8m2.417s real 12m24.939s
> user 15m23.710s user 7m22.200s user 10m14.420s
> sys 0m48.730s sys 0m41.040s sys 2m13.910s
> maps: <250 lines <250 lines >3000 lines

Let me inject some information about what gcc's doing in each version.

2.95.3 allocates its memory via a bunch of 'obstacks' which,
underneath, get memory from malloc, and therefore brk(2). I'm very
surprised to see it had ~250 vmas; it should be more like 10.

2.96 and later versions use a garbage collecting allocator instead; it
was becoming much too hard to decide which obstack to use when. The
garbage collector allocates memory with mmap(..., MAP_ANON, ...).
This is to avoid interfering with malloc, which is still used in many
places; and to get page-aligned memory without wasting tons of space,
as valloc(3) does.

In Red Hat's 2.96, that allocator gets memory from mmap one page at a
time. If I understand what's going on in the kernel correctly, that
means each page is its own vma. 25 megs of GC arena is roughly 6400
vmas in that regime.

In CVS 3.0-to-be (and trunk), the allocator gets memory in 32-page
chunks instead. So 25 megs of GC arena is only 200 vmas.

However, 25 megs of GC arena is small as these things go. GCC's
memory consumption can _easily_ get up to 200 or 300 megs. The
example I'm familiar with is insn-attrtab.c from GCC's own sources
(it's machine-generated code, with several huge functions). 256 megs
of GC arena, in 32-page chunks, is 2048 vmas. Yes, at this point the
machine is swapping... but if I understand the issue, it's just when
we're swapping that having thousands of vmas causes problems.

In conclusion, I think that GCC's allocator still makes a good case
for merging vmas.

zw

2001-03-24 05:04:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

In article <[email protected]>,
Kevin Buhr <[email protected]> wrote:
>
>The results speak for themselves:
>
> CVS gcc 3.0: Debian gcc 2.95.3: RedHat gcc 2.96:
>
> real 16m8.423s real 8m2.417s real 12m24.939s
> user 15m23.710s user 7m22.200s user 10m14.420s
> sys 0m48.730s sys 0m41.040s sys 2m13.910s
>maps: <250 lines <250 lines >3000 lines
>
>Obviously, the *real* problem is RedHat GCC 2.96. If Linus bothers to
>write this patch (he probably already has),

Check out 2.4.3-pre7, I'd be interested to hear what the system time is
for that one.

It does seem like gcc-2.96 is kind of special, but considering how easy
it is to merge anonymous memory (most of the changes were cosmetic ones
to get nice ordering to make the merge trivial without having to
allocate a vma that never gets used etc), it's certainly worth doing.

Linus

2001-03-24 07:51:02

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.4.2-ac20 patch for process time double-counting (was: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.)

On 23 Mar 2001, Kevin Buhr wrote:

> Mike Galbraith <[email protected]> writes:
> >
> > > Mike, would you like to try out the following (untested) patch against
> > > vanilla ac20 to see if it does the trick?
> >
> > Yes, that fixed it.
>
> Great! Can you test one more configuration, please? I can't test it
> properly with my SMP motherboard. Under "ac20", if you disable:
>
> Symmetric multi-processing support (CONFIG_SMP)
>
> you'll get to say yes to:
>
> APIC support on uniprocessors (CONFIG_X86_UP_APIC)
>
> If you say yes to that, you'll also get to say yes to:
>
> IO-APIC support on uniprocessors (CONFIG_X86_UP_IOAPIC)
>
> Can you check that the following patch against vanilla "ac20" works
> correctly with SMP disabled and X86_UP_APIC enabled? (The original
> patch I gave you won't compile with this configuration, since I put
> the declaration in the wrong include file.) It shouldn't matter
> whether X86_UP_IOAPIC is enabled or disabled.
>
> In addition to checking that the sys/user times look right, please
> check for the message:
>
> Using local APIC timer interrupts.

Times are fine. Local APIC timer interrupts are used.

> in your boot messages (I *don't* think it'll be there, but I'm not
> sure, and I'd really like to know one way or the other). In fact, if
> you could send me your kernel messages up to the PCI probe, that would
> be ideal.

Linux version 2.4.2-ac20 (root@el-kaboom) (gcc version gcc-2.95.3 20010315 (release)) #1 Sat Mar 24 07:57:01 CET 2001
BIOS-provided physical RAM map:
BIOS-e820: 000000000009fc00 @ 0000000000000000 (usable)
BIOS-e820: 0000000000000400 @ 000000000009fc00 (reserved)
BIOS-e820: 0000000000010000 @ 00000000000f0000 (reserved)
BIOS-e820: 0000000007ef0000 @ 0000000000100000 (usable)
BIOS-e820: 0000000000003000 @ 0000000007ff0000 (ACPI NVS)
BIOS-e820: 000000000000d000 @ 0000000007ff3000 (ACPI data)
BIOS-e820: 0000000000010000 @ 00000000ffff0000 (reserved)
Scan SMP from c0000000 for 1024 bytes.
Scan SMP from c009fc00 for 1024 bytes.
Scan SMP from c00f0000 for 65536 bytes.
Scan SMP from c009fc00 for 4096 bytes.
On node 0 totalpages: 32752
zone(0): 4096 pages.
zone(1): 28656 pages.
zone(2): 0 pages.
Local APIC disabled by BIOS -- reenabling.
Found and enabled local APIC!
mapped APIC to ffffe000 (fee00000)
Kernel command line: root=/dev/hda6,ro sb=220,5,1,7 mpu401=0x300,0 adlib=0x388 BOOT_IMAGE=242ac20
Initializing CPU#0
Detected 499.176 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 996.14 BogoMIPS
Memory: 126500k/131008k available (1101k kernel code, 4120k reserved, 339k data, 204k init, 0k highmem)
Dentry-cache hash table entries: 16384 (order: 5, 131072 bytes)
Buffer-cache hash table entries: 4096 (order: 2, 16384 bytes)
Page-cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 512K
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
CPU: Intel Pentium III (Katmai) stepping 03
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
Getting VERSION: 40011
Getting VERSION: 40011
Getting ID: 0
Getting ID: f000000
Getting LVT0: 700
Getting LVT1: 400
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000040
ESR value after enabling vector: 00000000
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 499.1652 MHz.
..... host bus clock speed is 99.8329 MHz.
cpu: 0, clocks: 998329, slice: 499164
CPU0<T0:998320,T1:499152,D:4,S:499164,C:998329>

> Thanks muchly!

Testing's easy, thanks for the fix.

-Mike

2001-03-24 09:32:59

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On Fri, Mar 23, 2001 at 09:02:30PM -0800, Linus Torvalds wrote:
> In article <[email protected]>,
> Kevin Buhr <[email protected]> wrote:
> >
> >The results speak for themselves:
> >
> > CVS gcc 3.0: Debian gcc 2.95.3: RedHat gcc 2.96:
> >
> > real 16m8.423s real 8m2.417s real 12m24.939s
> > user 15m23.710s user 7m22.200s user 10m14.420s
> > sys 0m48.730s sys 0m41.040s sys 2m13.910s
> >maps: <250 lines <250 lines >3000 lines
> >
> >Obviously, the *real* problem is RedHat GCC 2.96. If Linus bothers to
> >write this patch (he probably already has),
>
> Check out 2.4.3-pre7, I'd be interested to hear what the system time is
> for that one.

I was unable to compile gcc-3.0 from CVS this morning - so no tests there
for now...

First the "small" test case:
-----------------------------
2.4.2:
gcc-2.96: -O6 -felide-constructors -fPIC
real 7m31.748s
user 3m52.340s
sys 3m38.180s
Memory consumption: ~200MB

2.4.3-pre7:
gcc-2.96: -O6 -felide-constructors -fPIC
real 3m52.347s
user 3m46.120s
sys 0m3.370s

That's pretty darn impressive Linus ! 3m38 -> 3sec ! Now if the GCC people
could only repeat that trick ;)


Then the bigger one:
-----------------------------
2.4.2:
gcc-2.96: -O6 -felide-constructors -fPIC
Fails compilation with "Virtual memory exhausted!" after
real 37m28.305s
user 23m39.930s
sys 13m44.900s
Memory consumption: ~300MB before failure

Note - there are no ulimits set, and the machine has more than enough memory

2.4.3-pre7:
gcc-2.96: -O6 -felide-constructors -fPIC
real 31m48.898s
user 31m21.460s
sys 0m26.980s
Memory consumption: ~400MB - successful completion

Cool ! I can work again ;)

>
> It does seem like gcc-2.96 is kind of special, but considering how easy
> it is to merge anonymous memory (most of the changes were cosmetic ones
> to get nice ordering to make the merge trivial without having to
> allocate a vma that never gets used etc), it's certainly worth doing.

Beautiful !

Also, the speedup gained here is ~70 times, which may be more than the changed
allocation in gcc-3 will buy us (was that 32 times?). And, after all, there
_has_ to be some other case out there which is not as easily fixed as the GCC
one.

> Linus

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2001-03-24 09:50:01

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On Thu, Mar 22, 2001 at 10:32:51PM -0600, Kevin Buhr wrote:
> Jakob ?stergaard <[email protected]> writes:
> >
> > Try compiling something like Qt/KDE/gtk-- which are really heavy on
> > templates (with all the benefits and drawbacks of that).
>
> Okay, I just compiled gtk-- 1.0.3 (with CFLAGS = "-O2 -g") under three
> versions of GCC (Debian 2.95.3, RedHat 2.96, and a CVS pull of the
> "gcc-3_0-branch") on the same Debian machine running kernel 2.4.2.

It's important that you use at least -O3 to get inlining too.

>
> In all cases, the "cc1plus" processes appeared to max out around 25M
> total size. The "maps" pseudofiles for the 2.95.3 and and 3.0
> compiles never grew past 250 lines, but the "maps" pseudofiles for the
> RedHat 2.96 compile were gigantic, jumping to 3000 or 5000 lines at
> times.

25 MB doesn't count ;)

>
> The results speak for themselves:
>
> CVS gcc 3.0: Debian gcc 2.95.3: RedHat gcc 2.96:
>
> real 16m8.423s real 8m2.417s real 12m24.939s
> user 15m23.710s user 7m22.200s user 10m14.420s
> sys 0m48.730s sys 0m41.040s sys 2m13.910s
> maps: <250 lines <250 lines >3000 lines
>
> Obviously, the *real* problem is RedHat GCC 2.96. If Linus bothers to
> write this patch (he probably already has), its only proven benefit so
> far is that it improves the performance of a RedHat-specific, orphaned
> G++ development snapshot that everyone (the people of RedHat most of
> all) will be glad to be rid of as soon as possible.

No, map merging is obviously a good idea if it can be done at little cost.
There has to be other cases out there than GCC 2.96 (which is still the
best damn C++ compiler to ship on any GNU/Linux distribution in history)

As someone else already pointed out GCC-3.0 will improve it's allocation,
but it *still* allocates many maps - less than before, but still potentially
lots...

>
> The numbers above suggest that the patch is unlikely to have a
> positive impact on the performance of either officially released GCC
> versions or the upcoming 3.0 release.

It will still have the 70x performance increase on kernel memory map
handling as demonstrated in my benchmark just posted. However, it will
be 70x of much less than with 2.96.

Granted, the impact will be much smaller on GCC-3.0 in terms of wall clock
ticks, but I bet there is some other code out there that also triggers the
map-nightmare.

>
> Drifting off topic...

We can continue on /. ;)

>
> > Mozilla uses C++ mainly as "extended C" - due to compatibility concerns.
>
> This statement is potentially misleading.
>
> I think most people will believe you to mean "using C++ as a better C"
> in the sense of Stroustrup: using the small, conventional-language
> subset of C++ that looks like C but has stronger type checking,
> function and operator overloading, default arguments, "//" style
> comments, reference types, and other syntactic and semantic sugar.

Yes

>
> Mozilla does not use C++ as "extended C" in this sense. While it does
> use a *subset* of C++ for compatibility reasons, the subset includes
> extensive use of class lattices and polymorphism as well as extensive
> (albeit simple and carefully constructed) uses of templates for its
> utility classes, including string and component-autoreferencing
> template classes and functions that are used throughout the source.
> The only major C++ facilities that are not used are the standard
> library, RTTI, namespaces, and exception handling, but other than that
> it's a good, real-world C++ test case.

Ok - I just read the coding guidelines for Mozilla, that's where I got
my information from... In general (except for the exceptions I guess),
rule number one is "Don't use templates". Rule 5 is "Don't use the namespace
facility". Rule 16 is "Don't put constructors in header files". All
stuff that leads to much much shorter symbols (type names) and less code
inlining - something that makes the job a lot easier for GCC.

Putting template classes and functions in header files with heavy inlining
is something that makes GCC memory usage explode. I managed to write a
few hundred lines once I couldn't compile because GCC couldn't allocate
more than 800-900 MB (old glibc and GCC). The code was badly designed
and easily fixed, but it demonstrated this feature in GCC nicely.

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2001-03-24 19:28:27

by buhr

[permalink] [raw]
Subject: Re: 2.4.2-ac20 patch for process time double-counting (was: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.)

Mike Galbraith <[email protected]> writes:
>
> Times are fine. Local APIC timer interrupts are used.

Okay, thanks. That's good.

> Testing's easy, thanks for the fix.

This is where I'd submit the patch, but Alan evidently works 80 hours
a day. ;) The new patch is already in ac24.

Alan, FYI, I tested the patch on my SMP motherboard with CONFIG_SMP
(and maxcpus=0,1,unspecified) and with all combinations of
CONFIG_X86_UP_{,IO}APIC) and Michael tested CONFIG_SMP and
CONFIG_X86_UP_APIC on his non-SMP motherboard, so I don't think this
will come back to bite anyone.

Thanks!

Kevin <[email protected]>

2001-03-24 21:23:53

by buhr

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

Linus Torvalds <[email protected]> writes:
> >
[ under kernel 2.4.2 ]
> >
> > CVS gcc 3.0: Debian gcc 2.95.3: RedHat gcc 2.96:
> >
> > real 16m8.423s real 8m2.417s real 12m24.939s
> > user 15m23.710s user 7m22.200s user 10m14.420s
> > sys 0m48.730s sys 0m41.040s sys 2m13.910s
> >maps: <250 lines <250 lines >3000 lines
> >
> >Obviously, the *real* problem is RedHat GCC 2.96. If Linus bothers to
> >write this patch (he probably already has),
>
> Check out 2.4.3-pre7, I'd be interested to hear what the system time is
> for that one.

Okay. One note about the above results: as Zach pointed out, my
2.95.3 number for "maps" was wrong. I must have forgotten to collect
the data but thought I had. In fact, there are ~10 lines in "maps"
for the 2.95.3 "cc1plus" process. The other "maps" numbers for 3.0
and 2.96 are correct, at least within an order of magnitude.

Under 2.4.3-pre7, I get the following disappointing numbers:

CVS gcc 3.0: Debian gcc 2.95.3: RedHat gcc 2.96:

real 16m10.660s real 7m58.874s real 10m36.368s
user 15m27.900s user 7m23.090s user 10m0.290s
sys 0m48.400s sys 0m40.350s sys 0m40.790s
maps: <20 lines ~10 lines ~10 lines

A huge win for 2.96 and absolutely no benefit whatsoever for 3.0, even
though it obviously had a 10-fold effect on maps counts. On the
positive side, there was no performance *hit* either.

As a blind "have not looked at relevant kernel source" guess, this
looks like a hash scaling problem to me: the hash size works great for
~300 maps and falls apart in a major way at ~3000 maps, presumably
when we get multiple hits per hash bin and start walking 10-member
lists.

How this translates into a course of action---some combination of
keeping your patch, enlarging the hash, and performance tweaking the
list-walking---I'm not sure.

Kevin <[email protected]>

2001-03-24 21:47:57

by buhr

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

"Zack Weinberg" <[email protected]> writes:
>
> Let me inject some information about what gcc's doing in each version.

Thanks... very useful information.

> 2.95.3 allocates its memory via a bunch of 'obstacks' which,
> underneath, get memory from malloc, and therefore brk(2). I'm very
> surprised to see it had ~250 vmas; it should be more like 10.

You are correct. My "maps" numbers for 2.96 and 3.0 are correct (at
least within an order of magnitude), but I must have plucked the
number for 2.95.3 out of thin air---there are only ~10 maps, as you
predict.

> In conclusion, I think that GCC's allocator still makes a good case
> for merging vmas.

Maybe. It looks like the performance drop is quite sharp as a
function of vma count. In another note to the list, I observed no
system time change (not even a half a second) using GCC 3.0 on my
gtk-- test case between 2.4.2 and 2.4.3-pre7, even though the vma
count dropped from ~200 to ~15. On the other hand, 2.96 dropped from
>3000 to ~10 and dropped from a system time of 2m13s to a system time
of 41sec (in line with the 3.0 and 2.95.3 system times).

Given your data, it'll really depend on where the performance hit is
taken. If it's taken at 4000 vmas, then it'll take a 500 meg arena
under 3.0 before the patch makes a difference. It it's taken at 1000
vmas, then we'll see it around 125 megs, and it'll really make a big
difference in some of the test cases people are talking about.

Kevin <[email protected]>

2001-03-24 21:58:47

by buhr

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

Jakob ?stergaard <[email protected]> writes:
>
> It's important that you use at least -O3 to get inlining too.
[ . . . ]
> 25 MB doesn't count ;)

Aggh! I feel like I'm in a comedy sketch. You tell me "do that".
I do that. You tell me, "you should try this instead", so I do this.
Then, you tell me, "but you should really do the other."

You're the one who suggested "gtk--" as a test case. Built out of the
box, it uses "-O2". If there were magical settings or sekret
incantations, I wish you'd mentioned them when you suggested it.

> No, map merging is obviously a good idea if it can be done at little cost.
> There has to be other cases out there than GCC 2.96 (which is still the
> best damn C++ compiler to ship on any GNU/Linux distribution in history)

If something has a cost, even a little cost, and no one can find a
benefit, then implementing it is not "obviously" a good idea. That's
why Linus asked for a real-world example before he spent time
complicating the algorithms and adding checks that incur a cost for
every process, even those that won't get any benefit.

> As someone else already pointed out GCC-3.0 will improve it's allocation,
> but it *still* allocates many maps - less than before, but still potentially
> lots...

Yes. Zach's explanation is the first thing I've seen that makes a
case for some benefit (besides babysitting GCC 2.96) without
conflicting with the data I'm getting.

As I've noted elsewhere, I see no change at all in system time for GCC
3.0 between 2.4.2 and 2.4.3-pre7. Given Zach's explanation, I'm
prepared to believe there might be a difference with, say, a 500meg
arena (or perhaps something as small as a 100meg arena).

> It will still have the 70x performance increase on kernel memory map
> handling as demonstrated in my benchmark just posted. However, it will
> be 70x of much less than with 2.96.

For my test cases under 3.0, it looks like 70 times zero. However,
I'm now prepared to believe that it could be 70 times something
non-zero for certain very hairy source files.

Kevin <[email protected]>

2001-03-25 03:18:40

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

On Sat, Mar 24, 2001 at 01:54:39PM -0600, Kevin Buhr wrote:
> Jakob ?stergaard <[email protected]> writes:
> >
> > It's important that you use at least -O3 to get inlining too.
> [ . . . ]
> > 25 MB doesn't count ;)
>
> Aggh! I feel like I'm in a comedy sketch. You tell me "do that".
> I do that. You tell me, "you should try this instead", so I do this.
> Then, you tell me, "but you should really do the other."

I'm sorry, I was wrong about gtk-- being hairy enough, and I should have
apologized earler.

>
> You're the one who suggested "gtk--" as a test case. Built out of the
> box, it uses "-O2". If there were magical settings or sekret
> incantations, I wish you'd mentioned them when you suggested it.

Yes, yes. I guess even Qt won't do the trick either. I know at least one of
the KDE packages will, it uses Qt and if you set compilation options to -O6 it
will grow to ~100MB.

A few years ago when I compiled Mico, that one would make GCC chew up a few
hundred megs as well, if compilation options were set to use heavy
optimization.

But never mind about C++ test cases now...

>
> > No, map merging is obviously a good idea if it can be done at little cost.
> > There has to be other cases out there than GCC 2.96 (which is still the
> > best damn C++ compiler to ship on any GNU/Linux distribution in history)
>
> If something has a cost, even a little cost, and no one can find a
> benefit, then implementing it is not "obviously" a good idea. That's
> why Linus asked for a real-world example before he spent time
> complicating the algorithms and adding checks that incur a cost for
> every process, even those that won't get any benefit.

I just felt that many other parts of the kernel try hard to make it as
inexpensive as possible to use kernel functionality, and that the VM naturally
should do the same (to a reasonable extent, of course, as with the other
layers).

For example, if I use thousands of TCP connections, the network layer folks
have been working hard to ensure that I can actually do that with good
performance.

It would feel "wrong" - I think - if the VM had this special rule that "you can
use MMAP, but if you do it a lot the kernel becomes horribly inefficient".
Especially because it was just proved that such behaviour could be completely
eliminated without a big performance overhead on other more simpler users of
the VM.

It's just my oppinion - of course - but I think it's very nice that under Linux
you can actually use the system calls to do lots of neat tricks (such as the
GCC mmap one, or having a thousand TCP connnections open), without being
penalized too heavily. Using lots of system calls is not necessarily always
bad design.

>
> > As someone else already pointed out GCC-3.0 will improve it's allocation,
> > but it *still* allocates many maps - less than before, but still potentially
> > lots...
>
> Yes. Zach's explanation is the first thing I've seen that makes a
> case for some benefit (besides babysitting GCC 2.96) without
> conflicting with the data I'm getting.

But the bad case was a garbage collector in GCC. The mmap tricks seem like
some you may be inclined to actually use in something like garbage collectors.
Are we sure that the developers of all other garbage collectors out there
foresaw this problem and didn't do mmap tricks ?

When running the Haskell interpreter "Hugs", I see lots of lines like this from
strace:
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40017000
But I don't have any "big" haskell codes, so I don't know if Haskell does actually
exhibit the gcc-2.96 pattern too...

Maybe some Haskell / ML / Java folks could comment further on this ?

>
> As I've noted elsewhere, I see no change at all in system time for GCC
> 3.0 between 2.4.2 and 2.4.3-pre7. Given Zach's explanation, I'm
> prepared to believe there might be a difference with, say, a 500meg
> arena (or perhaps something as small as a 100meg arena).
>
> > It will still have the 70x performance increase on kernel memory map
> > handling as demonstrated in my benchmark just posted. However, it will
> > be 70x of much less than with 2.96.
>
> For my test cases under 3.0, it looks like 70 times zero. However,
> I'm now prepared to believe that it could be 70 times something
> non-zero for certain very hairy source files.

Or maybe 70x something large for some case we just don't know about yet ?

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2001-03-25 03:38:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.



On 24 Mar 2001, Kevin Buhr wrote:
>
> A huge win for 2.96 and absolutely no benefit whatsoever for 3.0, even
> though it obviously had a 10-fold effect on maps counts. On the
> positive side, there was no performance *hit* either.

I don't think the system time in 3.0 has anything to do the the mmap size.

The 40 seconds of system time you see is probably mostly something else.
It's not as if gcc _only_ does mmap's.

Do a kernel profile, and I bet that the mmap stuff is pretty low in the
noise, and the 40 seconds are for things like clearing pages in
do_anonymous_page() and for actually reading and writing to the file. Note
how the sys numbers are now all pretty much the same across the board for
different gcc versions - regardless of whether they use mmap for the
memory management or not.

(Well, gcc-2.95 and 2.96 are the same. Gcc-3.0 is higher, but it was
higher already before, and that's probably not the memory management per
se. I suspect it's because of other things, like bigger memory footprint
or similar. Or maybe the integrated preprocessor tends to do IO in smaller
chunks or something).

Linus


2001-03-25 16:48:40

by Jamie Lokier

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

Jakob ?stergaard wrote:
> But the bad case was a garbage collector in GCC. The mmap tricks seem like
> some you may be inclined to actually use in something like garbage collectors.
> Are we sure that the developers of all other garbage collectors out there
> foresaw this problem and didn't do mmap tricks ?

On this theme, some garbage collectors like to write-protect individual
pages, to detect which pages are modified between generations. The
kernel has never handled this especially well. It could be argued that
mprotect() and signal() aren't the right way to get this information
though, and it would be better to add a different mechanism.

-- Jamie

2001-03-26 04:23:56

by buhr

[permalink] [raw]
Subject: Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.

Linus Torvalds <[email protected]> writes:
>
> On 24 Mar 2001, Kevin Buhr wrote:
> >
> > A huge win for 2.96 and absolutely no benefit whatsoever for 3.0, even
> > though it obviously had a 10-fold effect on maps counts. On the
> > positive side, there was no performance *hit* either.
>
> I don't think the system time in 3.0 has anything to do the the mmap size.
>
> The 40 seconds of system time you see is probably mostly something else.
> It's not as if gcc _only_ does mmap's.

Yes, that's what I meant. I was assuming that there was 40sec of
baseline system time in each compilation representing everything
*except* searching lists of unmerged mmaps.

Before doing the pre7 test, I figured that---given Zack's 32-factor
observation---my benchmarks indicated that 2.96 was spending 2m14-40 =
214sec doing unmerged mmaps while 3.0 was spending 49-40 = 9 sec doing
unmerged mmaps. This ratio is more or less in line with a 32-fold
difference in number of maps predicted by Zack plus or minus a couple
seconds.

That is, I was assuming that the total time wasted because of unmerged
mmaps was roughly linear in the number of vmas. Actually, it'll be
O(n log n)---the number of maps times the O(log n) search time once
the AVL tree gets big enough to matter. Anyway, the factor should
still be around 30-50 or so.

When I did the test and 2.96 fell right in line with 2.95.3, I was
disappointed that 3.0 *didn't* fall right in line with the other
two---I thought I'd get those extra 8 seconds back.

> Do a kernel profile, and I bet that the mmap stuff is pretty low in the
> noise,

I'll bet your right---that's why I was disappointed. I thought 3.0's
mmap overhead would be higher than it turned out to be.

As it is, it looks like only the most extreme cases (thousands or ten
of thousands of mergeable maps) will benefit from the patch.

Kevin <[email protected]>