2004-04-05 17:08:53

by Eric Whiting

[permalink] [raw]
Subject: -mmX 4G patches feedback

from:
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5/2.6.5-mm1/announce.txt

>- I've dropped the 4G/4G patch and the remap-file-pages-prot patch. Two
> reasons:
>
> a) They create a lot of noise in areas where Hugh, Andrea and others
> are working
>
> b) -mm has been a bit flakey for a few people lately and I suspect the
> problems are related to early-startup changes in the 4:4 patch.


Andrew -- some data on the 4G/4G problems:

The following kernels with 4G/4G enabled would hang my box about once every 24
hours.
2.6.5-rc2-mm4
2.6.3-mm3

The 2.6.5-rc3-mm4 kernel with 4G/4G enabled has been much more stable (like
earlier -mmX kernels with 4G/4G enabled).

The 4G/4G patch is still useful for me -- although 64bit linux (x86_64) is the
best 'real' long-term solution to large memory jobs.

eric


2004-04-05 17:46:30

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback

On Mon, Apr 05, 2004 at 10:36:58AM -0600, Eric Whiting wrote:
> The 4G/4G patch is still useful for me -- although 64bit linux (x86_64) is the
> best 'real' long-term solution to large memory jobs.

what's your primary limitation? physical memory or virtual address
space? if it's physical memory go with 2.6-aa and it'll work fine up to
32G boxes included at full cpu performance.

if it's virtual address space and you've not much more than 4G of ram
3.5:1.5 usually works fine, and againt you'll run at full cpu
performance.

also make sure to move the task_unmapped_base at around 200M, so that
you get 1G more of address space for free.

2004-04-05 22:22:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback

On Mon, Apr 05, 2004 at 03:35:23PM -0600, Eric Whiting wrote:
> 4G of virtual address is what we need. Virtual address space is why the -mmX
> 4G/4G patches are useful. In this application it is single processes (usually

Indeed.

> 3.5:1.5 appears to be a 2.4.x kernel patch only right?

Martin has a port for 2.6 in the -mjb patchset (though it only works
with PAE disabled but there are patches floating around to make it work
at not noticeable cost with PAE enabled too).

Note if you never run sysscalls you're probably fine with 4:4, I'd
recommend to lower the timer irq back to 100 HZ though (2000 mm switch
per second is way too much for number crunching with 4:4, it's way too much
even with 3:1, with 3:1 is something like 1% slowdown just due the more
frequent irqs [on a mean 1-2Ghz box], with 4:4 should be a lot lot worse
than that).

2004-04-05 23:00:01

by Eric Whiting

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback

Andrea Arcangeli wrote:
>
> On Mon, Apr 05, 2004 at 10:36:58AM -0600, Eric Whiting wrote:
> > The 4G/4G patch is still useful for me -- although 64bit linux (x86_64) is the
> > best 'real' long-term solution to large memory jobs.
>
> what's your primary limitation? physical memory or virtual address
> space? if it's physical memory go with 2.6-aa and it'll work fine up to
> 32G boxes included at full cpu performance.

4G of virtual address is what we need. Virtual address space is why the -mmX
4G/4G patches are useful. In this application it is single processes (usually
running one at a time) that need more than 3G of RAM.

> if it's virtual address space and you've not much more than 4G of ram
> 3.5:1.5 usually works fine, and againt you'll run at full cpu
> performance.

3.5:1.5 appears to be a 2.4.x kernel patch only right?

2004-04-06 11:55:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]


* Andrea Arcangeli <[email protected]> wrote:

> Note if you never run sysscalls you're probably fine with 4:4, I'd
> recommend to lower the timer irq back to 100 HZ though (2000 mm switch
> per second is way too much for number crunching with 4:4, it's way too
> much even with 3:1, with 3:1 is something like 1% slowdown just due
> the more frequent irqs [on a mean 1-2Ghz box], with 4:4 should be a
> lot lot worse than that). [...]

my measurements do not support your claims wrt. the cost of 4:4.

here is a quick overview of the impact (on pure userspace speed) of
various kernel features turned on. Baseline is a 2.6 kernel with HZ=100,
UP, nohighmem and 3:1 (see [*] for details of the measurement):

100Hz 100.00%
100Hz + PAE: 0.00%
100Hz + 4:4: 0.00%
100Hz + PAE + 4:4: -0.01%

1000Hz: -1.08%
1000Hz + PAE: -1.08%
1000Hz + 4:4: -1.11%
1000Hz + PAE + 4:4: -1.39%

i.e. 1000Hz itself causes a 1.08% slowdown. Adding 4:4+PAE [***] causes
an additional 0.21% overhead on the 1000Hz kernel.

so your statement:

> [...] (2000 mm switch per second is way too much for number crunching
> with 4:4, [...] with 4:4 should be a lot lot worse than that

is 'very very' incorrect. The cost of 4:4 (on pure userspace code) is
one fifth of the cost of HZ=1000!

4:4, as explained numerous times, does carry costs for a number of
workloads. 4:4 will cause degradation for workloads that switch between
kernel-mode and user-mode heavily, in the 5-10% range. Also, it causes
degradation for threaded applications that do alot of user-kernel copies
(up to 30% degradation). On mixed workloads like kernel compilation the
impact is in the 1-2% range.

But 4:4 does not degrade mostly-userspace (or mostly-kernelspace)
workloads significantly. Also, 4:4 pushes the lowmem limit up way higher
on lots-of-RAM x86 systems, and it gives 3.98 GB of userspace VM, which
no other kernel feature offers.

I'd like to ask you to tame your colorful attacks on the 4:4 feature. If
you dont want to offer the users of -aa the option of 4:4 then that's
your decision but please respect the choice of others.

Ingo

[*] to get the numbers above, i used a simple userspace program to
measure 'cycles per sec available to userspace' [**]:

http://redhat.com/~mingo/4g-patches/loop_print.c

on an otherwise completely idle system, to the accuracy of 0.02%.
I ran the measurements 3 times and used the best time. (best/worst
ratio was always within 0.02%) Kernel version used was
2.6.4-rc3-mm3. I used a 525 MHz Celeron for testing. The results are
similar on faster x86 systems.

[**] i also repeated the measurements with a d-TLB-intense workload,
which should be the worst-case, considering the TLB flushes. [the
workload iterated through #dTLB pages and touched one byte in each
page.] This added +0.02% overhead in the 1000Hz + PAE case. (just
at the statistical noise limit).

[***] non-PAE 4:4 kernels are being used too - there are a fair number
of users who run simulation code using 4GB of physical RAM and a
pure 4:4 kernel with no highmem features required. For these 4:4
users the overhead on number-crunching is even smaller, only
0.03%.

2004-04-06 16:02:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Tue, Apr 06, 2004 at 01:55:39PM +0200, Ingo Molnar wrote:
>
> http://redhat.com/~mingo/4g-patches/loop_print.c

loop print does no memory access at all, it just loops forever, no
surprise at all you get very little slowdown no matter how many tlb
flushes happens.

If people needs >3G of user address space I assume they do some bulk
memory access too in their application.

Please write a realistic benchmark and repeat the test, the numbers you
posted are totally meaningless. Try a kernel compile of something
actually realistic (and a kernel compile doing lots of execve isn't the
worst case either).

Also note that the slowdown for app calling heavily syscalls is 30% not
5-10%, no matter if they're threaded or not, further more there has been
no proof that the 30% slowdown of mysql is really related to the
copy-users being serialized by a thread-wide spinlock, we made that
assumption but it's not certain yet.

You should also use a bleeding edge cpu for you measurements with large
tlb caches, which cpu did you use for your measurements?

2004-04-06 16:14:11

by Arjan van de Ven

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Tue, 2004-04-06 at 17:59, Andrea Arcangeli wrote:

> You should also use a bleeding edge cpu for you measurements with large
> tlb caches, which cpu did you use for your measurements?

afaics all Intel and AMD cpus with more than say 32 or 64 TLB's are
actually 64 bit capable.... so obviously you run a 64 bit kernel there.
(and amd64 even has that sweet CAM filter on the tlbs to mitigate the
effect even if you run a 32 bit kernel)

2004-04-06 16:40:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Tue, Apr 06, 2004 at 06:13:39PM +0200, Arjan van de Ven wrote:
> On Tue, 2004-04-06 at 17:59, Andrea Arcangeli wrote:
>
> > You should also use a bleeding edge cpu for you measurements with large
> > tlb caches, which cpu did you use for your measurements?
>
> afaics all Intel and AMD cpus with more than say 32 or 64 TLB's are
> actually 64 bit capable.... so obviously you run a 64 bit kernel there.
> (and amd64 even has that sweet CAM filter on the tlbs to mitigate the
> effect even if you run a 32 bit kernel)

I simply heard the effect was less visible on PIII than on more recent
cpus, but maybe that was wrong. Do you have any result comparing
different cpus (I mean with realistic tests not stuff like loop_print.c
doing nothing but rdtsc)? It'd be most interesting to see the effect on
hugetlbfs, to get past a certain amount of ram hugetlbfs is needed for
performance reasons (plus it avoids the costs of the pte saving ram, but
that's a secondary benefit, ptes are in highmem anyways).

2004-04-06 17:24:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]


* Andrea Arcangeli <[email protected]> wrote:

> > http://redhat.com/~mingo/4g-patches/loop_print.c
>
> loop print does no memory access at all, it just loops forever, [...]

expecting such a reply my first mail already answers this point:

[**] i also repeated the measurements with a d-TLB-intense workload,
which should be the worst-case, considering the TLB flushes. [the
workload iterated through #dTLB pages and touched one byte in each
page.] This added +0.02% overhead in the 1000Hz + PAE case. (just
at the statistical noise limit).

i just didnt expect your apparent inability to read.

(note that this dTLB test did the worst-case test by looping through
#dTLB pages (and not more), and thus maximizing the slowdown effect of
any TLB flushes. I also tested other workloads, such as
data-cache-intensive and memory-intensive workloads, with similar
results.)

> I simply heard the effect was less visible on PIII than on more recent
> cpus, but maybe that was wrong.

my mail also answers your other point:

[...] I used a 525 MHz Celeron for testing. The results are
similar on faster x86 systems.

yes, i did check a P4 CPU too. Plus:

> [...] no surprise at all you get very little slowdown no matter how
> many tlb flushes happens.

contrary to your claim, 90% of the TLB-flush overhead is in fact
upfront, at the time of the cr3 write, in the irq handler. So
loop_print.c will already show 90% of the overhead - and it's by far the
simplest and most stable measurement utility.

(anyway, feel free to reproduce and post contrary results here. The onus
is on you. And if you think i'm upset about your approach to this whole
issue then you are damn right.)

Ingo

2004-04-06 17:57:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Tue, Apr 06, 2004 at 07:24:31PM +0200, Ingo Molnar wrote:
> (anyway, feel free to reproduce and post contrary results here. The onus

I will run benchmarks as soon as I'm back from vacations. You didn't
post the modified benchmarks to produce any realistic load.

I will use the HINT to measure the slowdown on HZ=1000. It's an optimal
benchmark simulating userspace load at various cache sizes and it's
somewhat realistic.

Note also that the slowdown I expect wasn't of the order 10%, obviously,
I was expecting something between 1 and 2% which would be an *huge*
slowdown for any cpu bound app just for the timer irq, and I will try to
reproduce it on my 4-way xeon.

Regardless, even if HZ=1000 would run 1% faster (not 0.02% slower as you
measured) that changes nothing in terms of the 4:4 badness, the real
badness is for apps doing more than userspace pure calculations.

> is on you. And if you think i'm upset about your approach to this whole
> issue then you are damn right.)

the one upset should be the users running 30% slower with stuff like
mysql just because they own a 4/8G box. There's little interest from my
part to spend time on 4:4 stuff when things are so obvious (I want
however to try to benchmark the HZ=1000 with the hint).

2004-04-06 19:25:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]


* Andrea Arcangeli <[email protected]> wrote:

> I will use the HINT to measure the slowdown on HZ=1000. It's an
> optimal benchmark simulating userspace load at various cache sizes and
> it's somewhat realistic.

here are the INT results from the HINT benchmark (best of 3 runs):

1000Hz, 3:1, PAE: 25513978.295333 net QUIPs
1000Hz, 4:4, PAE: 25515998.582834 net QUIPs

i.e. the two kernels are equal in performance. (the noise of the
benchmark was around ~0.5% so this 0.01% win of 4:4 is a draw.) This is
not unexpected, the benchmark is too noisy to notice the 0.22% maximum
possible 4:4 hit.

> Also note that the slowdown for app calling heavily syscalls is 30%
> not 5-10%, [...]

you are right that it's not 5-10%, it's more like 5-15%. It's not 30%,
except in the mentioned case of heavily threaded MySQL benchmark, and in
microbenchmarks. (the microbenchmark case is understandable, 4:4 adds +3
usecs on PAE and +1 usec on non-PAE.)

i've just re-measured a couple of workloads that are very kernel and
syscall intensive, to get a feel for the worst-case:

apache tested via 'ab': 5% slowdown
dbench: 10% slowdown
tbench: 16% slowdown

these would be the ones where i'd expect to see the biggest slowdown,
these are dominated by kernel overhead and do alot of small syscalls.
(all these tests fully saturated the CPU.)

you should also consider that while 4:4 does introduce extra TLB
flushes, it also removes the TLB flush at context-switch. So for
context-switch intensive workloads the 4:4 overhead will be smaller. (in
some rare and atypical cases it might even be a speedup - e.g. NFS
servers driven by knfsd.) This is why e.g. lat_ctx is 4.15 with 3:1, and
it's 4.85 with 4:4, a 16% slowdown only - in contrast to lat_syscall
null, which is 0.7 usecs in the 3:1 case vs. 3.9 usecs in the 4:4 case.

But judging by your present attitude i'm sure you'll be able to find
worse performing testcases and will use them as the typical slowdown
number to quote from that point on ;) Good luck in your search.

here's the 4:4 overhead for some other workloads:

kernel compilation (30% kernel overhead): 2% slowdown
pure userspace code: 0% slowdown

anyway, i can only repeat what i said last year in the announcement
email of the 4:4 feature:

the typical cost of 4G/4G on typical x86 servers is +3 usecs of
syscall latency (this is in addition to the ~1 usec null syscall
latency). Depending on the workload this can cause a typical
measurable wall-clock overhead from 0% to 30%, for typical
application workloads (DB workload, networking workload, etc.).
Isolated microbenchmarks can show a bigger slowdown as well - due to
the syscall latency increase.

so it's not like there's a cat in the bag.

the cost of 4:4, just like the cost of any other kernel feature that
impacts performance (like e.g. PAE, highmem or swapping) should be
considered in light of the actual workload. 4:4 is definitely not an
'always good' feature - i never claimed it was. It is an enabler feature
for very large RAM systems, and it gives 3.98 GB of VM to userspace. It
is a slowdown for anything that doesnt need these features.

But for pure userspace code (which started this discussion), where
userspace overhead dominates by far, the cost is negligible even with
1000Hz.

Ingo

2004-04-06 20:25:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Tue, Apr 06, 2004 at 09:25:49PM +0200, Ingo Molnar wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > I will use the HINT to measure the slowdown on HZ=1000. It's an
> > optimal benchmark simulating userspace load at various cache sizes and
> > it's somewhat realistic.
>
> here are the INT results from the HINT benchmark (best of 3 runs):
>
> 1000Hz, 3:1, PAE: 25513978.295333 net QUIPs
> 1000Hz, 4:4, PAE: 25515998.582834 net QUIPs
>
> i.e. the two kernels are equal in performance. (the noise of the
> benchmark was around ~0.5% so this 0.01% win of 4:4 is a draw.) This is
> not unexpected, the benchmark is too noisy to notice the 0.22% maximum
> possible 4:4 hit.

that's really a good result, and this is a benchmark that is realistic
for a number crunching load, I'm running it too on my hardware right now
(actually I started it a few hours ago but it didn't finish a pass yet).
I'm using DOUBLE. However I won't post the quips, I draw the graph
showing the performance for every working set, that gives a better
picture of what is going on w.r.t. memory bandwidth/caches/tlb.

BTW, which is the latest 4:4 patch to use on top of 2.4? I'm porting the
one in Andrew's rc3-mm4 in order to run the benchmark on top of the same
kernel codebase. is that the latest one?

> i've just re-measured a couple of workloads that are very kernel and
> syscall intensive, to get a feel for the worst-case:
>
> apache tested via 'ab': 5% slowdown
> dbench: 10% slowdown
> tbench: 16% slowdown
>
> these would be the ones where i'd expect to see the biggest slowdown,
> these are dominated by kernel overhead and do alot of small syscalls.
> (all these tests fully saturated the CPU.)

we perfectly know that the biggest slowdown is in mysql-like and
databases running gettimeofday, they're quite common workloads. Numbers
were posted to the list too. vgettimeofday mitigates it.

> you should also consider that while 4:4 does introduce extra TLB
> flushes, it also removes the TLB flush at context-switch. So for
> context-switch intensive workloads the 4:4 overhead will be smaller. (in

yes, that's also why threaded apps are hurted most. And due the
serialized copy-user probably you shouldn't enable the threading support
in apache2 for 4:4. Did you get the 5% slowdown with threading enabled?
I'd expect more if you enable the threading, partly because 3:1 should
run a bit faster with threading, and mostly because 4:4 serializes the
copy-user. (OTOH not sure if apache does that much copy-user, in the
past they used mmapped I/O, and mmapped I/O should scale well with
threading on 4:4 too)

> But judging by your present attitude i'm sure you'll be able to find
> worse performing testcases and will use them as the typical slowdown
> number to quote from that point on ;) Good luck in your search.

they were already posted on the list.

> 'always good' feature - i never claimed it was. It is an enabler feature
> for very large RAM systems, and it gives 3.98 GB of VM to userspace. It

Agreed, I define very large as >32G (32G works fine w/o it).

The number crunching simulations using 4G address space at the expense
of peforming syscalls and interrupts slower is a small amount of
userbase (and even before they care about 4:4 they should care about
mapbase), these are the same users that asked me the 3.5:0.5 in the
past. The point is that these users gets a mmap(-ENOMEM) failure if they
need more address space, so it's easy to detect those users (and most
important they're always very skilled users capable of recompiling and
customizing a kernel) and to provide them specialized kernels, their
application will just complain with allocation failures. All the rest
of the userbase will just run slower and they'll not know they could run
faster without 4:4.

I provided 3.5:0.5 kernel option + your mapbase tweak for ages in all
2.4-aa for the people who needs 3.5 of address space in userspace. 4:4
adds up another 512M so it's better for them indeed.

> But for pure userspace code (which started this discussion), where
> userspace overhead dominates by far, the cost is negligible even with
> 1000Hz.

I thought hz 1000 would hurt more than %0.02 given 900 irqs per second
themselfs wastes 1% of the cpu, but I may very well be wrong about that
(it's just your previous post wasn't convincing due the apparently
too artificial testcase), anyways the hint is a lot more convincing now.
I will still try to reproduce your HINT numbers here on slightly
different high end hardware, so you can get a further confirmation and
datapoint.

thanks.

2004-04-07 06:03:36

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Tue, Apr 06, 2004 at 10:25:48PM +0200, Andrea Arcangeli wrote:
> I'm using DOUBLE. However I won't post the quips, I draw the graph
> showing the performance for every working set, that gives a better
> picture of what is going on w.r.t. memory bandwidth/caches/tlb.

Here we go:

http://www.kernel.org/pub/linux/kernel/people/andrea/misc/31-44-100-1000/31-44-100-1000.html

the global quips you posted indeed had no way to account for the part of
the curve where 4:4 badly hurts. details in the above url.

Please cross check my results on your hardware (I used a 2.5Ghz xeon, 1G
of ram, and benchs run fresh after boot with all ram still free).
Numbers are perfectly reproducible for me, and they make perfect sense
too. 2.6.5-aa4 is the same as 2.6.5-aa3 for this benchmark (though I'll
upload 2.6.5-aa4 in a few hours).

2004-04-07 06:46:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]


* Andrea Arcangeli <[email protected]> wrote:

> > I'm using DOUBLE. However I won't post the quips, I draw the graph
> > showing the performance for every working set, that gives a better
> > picture of what is going on w.r.t. memory bandwidth/caches/tlb.
>
> Here we go:
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/misc/31-44-100-1000/31-44-100-1000.html
>
> the global quips you posted indeed had no way to account for the part
> of the curve where 4:4 badly hurts. details in the above url.

Firstly, most of the first (full) chart supports my measurements.

There's the portion of the chart at around 500k working set that is at
issue, which area you've magnified so helpfully [ ;-) ].

That area of the curve is quite suspect at first sight. With a TLB flush
every 1 msec [*], for a 'double digit' slowdown to happen it means the
effect of the TLB flush has to be on the order of 100-200 usecs. This is
near impossible, the dTLB+iTLB on your CPU is only 64+64. This means
that a simple mmap() or a context-switch done by a number-cruncher
(which does a TLB flush too) would have a 100-200 usecs secondary cost -
this has never been seen or reported before!

but it is well-known that most complex numeric benchmarks are extremely
sensitive to the layout of pages - e.g. my dTLB-sensitive benchmark is
so sensitive that i easily get runs that are 2 times faster on 4:4 than
on 3:1, and the 'results' are extremely stable and repeatable on the
same kernel!

to eliminate layout effects, could you do another curve? Plain -aa4 (no
4:4 patch) but a __flush_tlb() added before and after do_IRQ(), in
arch/i386/kernel/irq.c? This should simulate much of the TLB flushing
effect of 4:4 on -aa4, without any of the other layout changes in the
kernel. [it's not a full simulation of all effects of 4:4, but it should
simulate the TLB flush effect quite well.]

once the kernel image layout has been stabilized via the __flush_tlb()
thing, the way to stabilize user-space layout is to static link the
benchmark and boot it via init=/bin/DOUBLE. This ensures that the
placement of the physical pages is constant. Doing the test 'fresh after
bootup' is not good enough.

Ingo

[*] a nitpick: you keep saying '2000 tlb flushes per second'. This is
misleading, there's one flush of the userspace TLBs every 1 msec
(i.e. 1000 per second), and one flush of the kernel TLBs - but
the kernel TLBs are small at this point, especially with 4MB pages.

2004-04-07 07:23:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Wed, Apr 07, 2004 at 08:46:29AM +0200, Ingo Molnar wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > > I'm using DOUBLE. However I won't post the quips, I draw the graph
> > > showing the performance for every working set, that gives a better
> > > picture of what is going on w.r.t. memory bandwidth/caches/tlb.
> >
> > Here we go:
> >
> > http://www.kernel.org/pub/linux/kernel/people/andrea/misc/31-44-100-1000/31-44-100-1000.html
> >
> > the global quips you posted indeed had no way to account for the part
> > of the curve where 4:4 badly hurts. details in the above url.
>
> Firstly, most of the first (full) chart supports my measurements.

yes, above and below a certain treshold there's definitely no difference.

> There's the portion of the chart at around 500k working set that is at
> issue, which area you've magnified so helpfully [ ;-) ].

the problem showup between 300k and 700k, and I magnified everything,
not just that single part.

> That area of the curve is quite suspect at first sight. With a TLB flush

the cache size of the l2 is 512k, that's the point where slowing down walking
pagetables out of l2 hurt most. It made perfect sense to me. Likely on a 1M
cache machine you'll get the same huge slowdown at 1M working set and so on
with bigger cache sizes in more expensive x86 big iron cpus.

> every 1 msec [*], for a 'double digit' slowdown to happen it means the
> effect of the TLB flush has to be on the order of 100-200 usecs. This is
> near impossible, the dTLB+iTLB on your CPU is only 64+64. This means
> that a simple mmap() or a context-switch done by a number-cruncher
> (which does a TLB flush too) would have a 100-200 usecs secondary cost -
> this has never been seen or reported before!

note that it's not only the tlb flush having the cost, the cost is the
later code going slow due the tlb misses. So if you rdstc around the mmap
syscall it'll return quick just like the irq returns quick to userspace.
the cost of the tlb misses causing pte walkings of ptes out of l2 cache isn't
easily measurable in other ways than I did.

> but it is well-known that most complex numeric benchmarks are extremely
> sensitive to the layout of pages - e.g. my dTLB-sensitive benchmark is
> so sensitive that i easily get runs that are 2 times faster on 4:4 than
> on 3:1, and the 'results' are extremely stable and repeatable on the
> same kernel!
>
> to eliminate layout effects, could you do another curve? Plain -aa4 (no

with page layout effects you mean different page coloring I assume, but it's
well proven that page coloring has no significant effect on those multi
associative x86 caches (we run some benchmark in the past to confirm that, ok
it was not on the same hardware but the associativity is similar, the big boost
of page coloring comes with bcache one way associative ;). Plus it's be a
tremendous conincidence if the caches were layed out exactly to support my
theories ;). Note that the position in the dimms doesn't matter much since the
whole point is when the stuff is at the limit of the l2 cache. I'm quite
certain you're wrong suspecting this -17% slowdown at 500k working set to be
just a measurement error due page coloring, I know the effects of page coloring
in practice, and they're visible even on consecutive runs w/o requiring a
kernel recompile or reboot and I verified consecutive runs the same results
(and I run various gnuplots/ssh etc.. between the runs). Of course there was a
tiny divergence between the runs (primarly due page coloring), but it wasn't
remotely comparable to the -17% gap and the order of the curves was always the
same (i.e. starting at HZ=100 and HZ=200, then at 200k the two curves crosses
and 4:4 goes all the way down, and HZ=1000 even lower). Just give it a spin
yourself too.

> 4:4 patch) but a __flush_tlb() added before and after do_IRQ(), in
> arch/i386/kernel/irq.c? This should simulate much of the TLB flushing
> effect of 4:4 on -aa4, without any of the other layout changes in the
> kernel. [it's not a full simulation of all effects of 4:4, but it should
> simulate the TLB flush effect quite well.]

sure I can try it (though not right now, but I'll try before the
weekend).

> once the kernel image layout has been stabilized via the __flush_tlb()
> thing, the way to stabilize user-space layout is to static link the
> benchmark and boot it via init=/bin/DOUBLE. This ensures that the
> placement of the physical pages is constant. Doing the test 'fresh after
> bootup' is not good enough.

You can try it too, you actually already showed the quips, maybe you
never looked at the per-working-set results.

btw, this is my hint.h setting to stop before paging and to get some more
granularity:

#define ADVANCE 1.1 /* 1.2589 */ /* Multiplier. We use roughly 1 decibel step size. */
/* Closer to 1.0 takes longer to run, but might */
/* produce slightly higher net QUIPS. */
#define NCHUNK 4 /* Number of chunks for scatter decomposition */
/* Larger numbers increase time to first result */
/* (latency) but sample domain more evenly. */
#define NSAMP 200 /* Maximum number of QUIPS measurements */
/* Increase if needed, e.g. if ADVANCE is smaller */
#define NTRIAL 5 /* Normal number of times to run a trial */
/* Increase if computer is prone to interruption */
#define PATIENCE 7 /* Number of times to rerun a bogus trial */
#define RUNTM 1.0 /* Target time, seconds. Reduce for high-res timer.*/
/* Should be much larger than timer resolution. */
#define STOPRT 0.65 /* Ratio of current to peak QUIPS to stop at */
/* Smaller numbers will beat on virtual memory. */
#define STOPTM 100 /* Longest time acceptable, seconds. Most systems */
/* run out of decent-speed memory well before this */
#define MXPROC 32 /* Maximum number of processors to use in shared */ /* memory configu
ration. Adjust as necessary. */


I used DOUBLE compiled with -ffast-math -fomit-frame-pointer -march=pentium4,
but I don't think INT or any compiler thing can make a difference.

> [*] a nitpick: you keep saying '2000 tlb flushes per second'. This is
> misleading, there's one flush of the userspace TLBs every 1 msec
> (i.e. 1000 per second), and one flush of the kernel TLBs - but
> the kernel TLBs are small at this point, especially with 4MB pages.

I'm saying 2000 tlb flushes only because I'd be wrong saying there are only
1000, but I obviously agree the cost of half of them is not significant (the
footprint of the irq handler is tiny).

2004-04-07 07:28:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]


* Andrea Arcangeli <[email protected]> wrote:

> BTW, which is the latest 4:4 patch to use on top of 2.4? I'm porting
> the one in Andrew's rc3-mm4 in order to run the benchmark on top of
> the same kernel codebase. is that the latest one?

yeah, the one in -mm is the latest one. (right now it's out to isolate
early-bootup and ACPI problems, but it's very recent otherwise)

the 2.4 one i kept minimalistic, the latest one you can find in the
RHEL3 srpm. The 2.6 one also contains related cleanups to x86
infrastructure.

> we perfectly know that the biggest slowdown is in mysql-like and
> databases running gettimeofday, they're quite common workloads.
> Numbers were posted to the list too. vgettimeofday mitigates it.

ok. I'm not sure it's due to gettimeofday, but vgettimeofday is nice no
matter what.

> > you should also consider that while 4:4 does introduce extra TLB
> > flushes, it also removes the TLB flush at context-switch. So for
> > context-switch intensive workloads the 4:4 overhead will be smaller. (in
>
> yes, that's also why threaded apps are hurted most. And due the
> serialized copy-user probably you shouldn't enable the threading support
> in apache2 for 4:4. Did you get the 5% slowdown with threading enabled?

it was Apache 2.0, but without threading enabled. (the default install)

> I'd expect more if you enable the threading, partly because 3:1 should
> run a bit faster with threading, and mostly because 4:4 serializes the
> copy-user. (OTOH not sure if apache does that much copy-user, in the
> past they used mmapped I/O, and mmapped I/O should scale well with
> threading on 4:4 too)

with the serialization it can get really slow. I have some copy-user
optimizations in the 2.4 patch. All it needs is a "safe" pagetable
walking function, which doesnt get confused by things like large pages.
If this function fails then we try user_pages() - to make sure we catch
mappings not present in the pagetables (e.g. hugepages on some
platforms). Another problem are IO vmas. It might not be legal to touch
them via the user-copy functions.

but if it's possible to walk the pagetables without having to look at
the vma tree then the copy-user functions can be done lockless. This
feature is not in the 2.6 patch yet.

> > But for pure userspace code (which started this discussion), where
> > userspace overhead dominates by far, the cost is negligible even with
> > 1000Hz.
>
> I thought hz 1000 would hurt more than %0.02 given 900 irqs per second
> themselfs wastes 1% of the cpu, but I may very well be wrong about that

the slowdown is higher: %0.2. This is the table again:

1000Hz: -1.08%
1000Hz + PAE: -1.08%
1000Hz + 4:4: -1.11%
1000Hz + PAE + 4:4: -1.39%

so we slow down from -1.08% to -1.39%, by -0.21%.

> (it's just your previous post wasn't convincing due the apparently too
> artificial testcase), [...]

as i mentioned it before, i did generate a dTLB testcase too - see the
simple code below. (it's hardcoded to my setup, the # of TLBs is
slightly below the real # of TLBs to not trash them.) It simply loops
through 60 pages and touches them. Feel free to modify it to have more
pages to see the effect on TLB-trashing apps [that should be a smaller
effect]. Also you can try other effects: nonlinear access to the pages,
and/or a more spread out layout of the working set.

but please be extremely careful with this testcase. It needs static
linking & init= booting to produce stable, comparable numbers. I did
this for a couple of key kernels and made sure the effect is the same as
in the simple loop_print.c case, but it's a real PITA to ensure stable
numbers.

Ingo

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define rdtscll(val) \
__asm__ __volatile__ ("rdtsc;" : "=A" (val))

#define SECS 10ULL

#define TLBS 60

volatile char pages[TLBS*4096];

void touch_dtlbs(void)
{
volatile char x;
int i;

for (i = 0; i < TLBS; i++)
x = pages[i*4096];
}

main () {
unsigned long long start, now, mhz = 525000000, limit = mhz * SECS;
unsigned int count;

printf("dtlb-intense workload:\n");
memset(pages, 1, TLBS*4096);
repeat:
rdtscll(start);
count = 0;
for (;;) {
touch_dtlbs();
count++;
rdtscll(now);
if (now - start > limit)
break;
}
printf("speed: %d loops.\n", count/SECS); fflush(stdout);
goto repeat;
}

2004-04-07 08:23:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]


* Andrea Arcangeli <[email protected]> wrote:

> > That area of the curve is quite suspect at first sight. With a TLB flush
>
> the cache size of the l2 is 512k, that's the point where slowing down
> walking pagetables out of l2 hurt most. It made perfect sense to me.
> Likely on a 1M cache machine you'll get the same huge slowdown at 1M
> working set and so on with bigger cache sizes in more expensive x86
> big iron cpus.

ah! i assumed a 1MB cache.

yes, there could be caching effects around the cache size, but the
magnitude still looks _way_ off. Here are a number of independent
calculations and measurements to support this point:

64 dTLBs means 64 pages. Even assuming the most spread out layout in PAE
mode, a single pagetable walk needs to access the pte and the pmd
pointers, which, if each pagewalk-path lies on separate cachelines
(worst-case), it means 2x64 == 128 bytes footprint per page. [64 byte L2
cacheline size on your box.] This means 128x64 == 8K footprint in the L2
cache for the TLBs.

this is only 1.5% of the L2 cache, so it should not make such a huge
difference. Even considering P4's habit of fetching two cachelines on a
miss, the footprint could at most be 3%.

the real pagetable footprint is likely much lower - there's likely a
fair amount of sharing at the pmd pointer level.

so the theory that it's the pagetable falling out of the cache that
makes the difference doesnt seem plausible.

> > every 1 msec [*], for a 'double digit' slowdown to happen it means the
> > effect of the TLB flush has to be on the order of 100-200 usecs. This is
> > near impossible, the dTLB+iTLB on your CPU is only 64+64. This means
> > that a simple mmap() or a context-switch done by a number-cruncher
> > (which does a TLB flush too) would have a 100-200 usecs secondary cost -
> > this has never been seen or reported before!
>
> note that it's not only the tlb flush having the cost, the cost is the
> later code going slow due the tlb misses. [...]

this is what i called the 'secondary' cost of the TLB flush.

> [...] So if you rdstc around the mmap syscall it'll return quick just
> like the irq returns quick to userspace. the cost of the tlb misses
> causing pte walkings of ptes out of l2 cache isn't easily measurable
> in other ways than I did.

i know that it's not measurable directly, but a 100-200 usecs slowdown
due to a TLB flush would be easily noticeable as a slowdown in userspace
performance.

lets assume all of the userspace pagetable cachelines fall out of the L2
cache during the 1 msec slice, and lets assume the worst-case spread of
the pagetables, necessiating 128 cachemisses. With 300 cycles per L2
cachemiss, this makes for 38400 cycles - 15 usecs on your 2.5 GHz Xeon.
This is the very-worst-case, and it's still an order of magnitude
smaller than the 100-200 usecs.

I've attached tlb2.c which measures cold-cache miss costs with a
randomized access pattern over a memory range of 512 MB, using 131072
separate pages as a TLB-miss target. (you should run it as root, it uses
cli.) So this cost combines the data-cache miss cost and the TLB-miss
costs.

This should give the worst-case cost of TLB misses. According to these
measurements the worst-case for 64 dTLB misses is ~25000 cycles, or 10
usecs on your box - well below the conservative calculation above. This
is the very worst-case TLB-trashing example i could create, and note
that it still over-estimates the TLB-miss cost because the data-cache
miss factors in too. (and the data-cache miss cannot be done in parallel
to the TLB miss, because without knowing the TLB the CPU cannot know
which page to fetch.) [the TLB miss is two cachemisses, the data-cache
fetch is one cachemiss. The two TLB related pte and pmd cachemisses are
linearized too, because the pmd value is needed for the pte fetching.]
So the real cost for the TLB misses would be around 6 usecs, way below
the 100-200 usecs effect you measured.

> > to eliminate layout effects, could you do another curve? Plain -aa4 (no
>
> with page layout effects you mean different page coloring I assume,
> but it's well proven that page coloring has no significant effect on
> those multi associative x86 caches [...]

E.g. the P4 has associativity constraints in the VM-linear space too
(!). Just try to access 10 cachelines spaced exactly 128 KB away from
each other (in virtual space).

> > 4:4 patch) but a __flush_tlb() added before and after do_IRQ(), in
> > arch/i386/kernel/irq.c? This should simulate much of the TLB flushing
> > effect of 4:4 on -aa4, without any of the other layout changes in the
> > kernel. [it's not a full simulation of all effects of 4:4, but it should
> > simulate the TLB flush effect quite well.]
>
> sure I can try it (though not right now, but I'll try before the
> weekend).

thanks for the testing! It will be interesting to see.

Ingo

2004-04-07 17:27:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Wed, Apr 07, 2004 at 08:46:29AM +0200, Ingo Molnar wrote:
> 4:4 patch) but a __flush_tlb() added before and after do_IRQ(), in

I added __flush_tlb_global on entry to better simulate the effect of
4:4. I doubt it makes a difference though.

--- x/arch/i386/kernel/irq.c.~1~ 2004-03-11 08:27:22.000000000 +0100
+++ x/arch/i386/kernel/irq.c 2004-04-07 19:23:21.735733664 +0200
@@ -427,6 +427,7 @@ asmlinkage unsigned int do_IRQ(struct pt
struct irqaction * action;
unsigned int status;

+ __flush_tlb_global();
irq_enter();

#ifdef CONFIG_DEBUG_STACKOVERFLOW
@@ -507,6 +508,7 @@ out:
spin_unlock(&desc->lock);

irq_exit();
+ __flush_tlb();

return 1;
}

2004-04-07 21:09:14

by Martin J. Bligh

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback

Tuesday, April 06, 2004 00:16:41 +0200 Andrea Arcangeli <[email protected]> wrote:

> On Mon, Apr 05, 2004 at 03:35:23PM -0600, Eric Whiting wrote:
>> 4G of virtual address is what we need. Virtual address space is why the -mmX
>> 4G/4G patches are useful. In this application it is single processes (usually
>
> Indeed.
>
>> 3.5:1.5 appears to be a 2.4.x kernel patch only right?
>
> Martin has a port for 2.6 in the -mjb patchset (though it only works
> with PAE disabled but there are patches floating around to make it work
> at not noticeable cost with PAE enabled too).

There's no such thing as 3.5:1.5. Do you mean 3.5:0.5? or 2.5:1.5? ;-)

M.

2004-04-07 21:39:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Wed, Apr 07, 2004 at 09:25:17AM +0200, Ingo Molnar wrote:
> but please be extremely careful with this testcase. It needs static
> linking & init= booting to produce stable, comparable numbers. I did
> this for a couple of key kernels and made sure the effect is the same as
> in the simple loop_print.c case, but it's a real PITA to ensure stable
> numbers.

you're basically testing the associativity of the TLB with your dTLB, no
surprise it gives very different results. the biggest noise generated by
page coloring is when the working set is equal the size of the cache,
for the tlb cache that's 64 pages on the prescott, but note I'm not
using a prescott, this is the first HT enabeld (P4 based) xeon I doubt
it has 64 dTLB entries, infact the biggest noise happens way below
256kbyte, I guess it has between 8 and 32 entries in the tlb (not sure
exactly, I've too old and too recent specifications none covering this
specific box, I didn't need to know this detail to benchmark accurately
and rule out page coloring by running multiple passes and verifiying
they're all the same).

2004-04-07 21:37:49

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Wed, Apr 07, 2004 at 10:23:50AM +0200, Ingo Molnar wrote:
> thanks for the testing! It will be interesting to see.

The overhead of the __flush_tlb_global+__flush_tlb in the irq handler is
nothing like -17%.

But I tried again with the real full 4:4 patch applied (after two runs
with __flush_tlb_global+flush_tlb) and there I get _again_ the very same
double digit slowdown compared to 3:1, this is 100% reproducible, no
matter how I mix the memory, this is not a measurement error and you
should be able to reproduce just fine.

I don't really think this can be related to the page coloring, the
concidence would be too huge, the double digit slowdown is always
measurable as soon as I enable 4:4 and it goes away as soon as I disable
it. There seems some more overhead in the 4:4 switch that isn't
accounted by just the tlb flush in the irq of a 3:1 kernel, or maybe the
simple fact you're sharing the same piece of address space disables some
cpu optimization in hardware (I mean, tlb flushes in mmaps
don't necessairly destroy everything, a true mm_switch has to instead
because the pgd changes). But don't ask me more about this, I'm just
reporting to you the accurate measurements of a double digit slowdown in
pure cpu computations with 4:4 and I now verified for you that it's not
a measurements error.

BTW, people doing number crunching with 4:4 may also want to try to
recompile with HZ=100 and try to measure any difference, to see if they
also see what I've measured with 4:4 (assuming they work heavily on
chunks of memory in cache-size chunks at time, which is the most
efficient of processing the data if your algorithms allows you to do it
btw).

Please try the HINT yourself too, the 4:4 measurements are definitely
100% reproducible for me.

See the "verification" results here (they should be visible in a few
minutes):

http://www.us.kernel.org/pub/linux/kernel/people/andrea/misc/31-44-100-1000/for-ingo.png

the 2.6.5-rc4-1000-flush is a run with the flush_tlb_global +
flush_tlb with 3:1, there's little slowdown for pure number crunching
with 1000 irqs per second (I run it twice and it was the same).

And then 2.6.5-rc4-4:4-1000-2 is another run with 4:4 after another
reboot and another kernel recompile, different memory mixed etc... (i.e.
no page coloring involved, I run it twice and it was the same, the error
across the different runs is around 3-4%, it's quite high, but
definitely nothing like a double digit percent error, also see how close
is the 3:1+flush_tlb curve to the pure 3:1). It doesn't matter how I run
the thing, it keeps giving the same double digit slowdown result for 4:4
and almost no slowdown for the flush_tlb. Again my theory is that the
cr3 dummy write is autodetected by the smart cpus and it doesn't account
for the true 4:4 overhead that is a truly mm_switch. See also my email
to Manfred on why it's worthless to use the same 4th level pgtable for
all processes on x86-64.

Also note that the HINT is written exactly to see the whole picture
about the hardware including l2 caches, so it's not likely that the most
important piece of the curve is going to be just garbage, obviously
there are errors due page coloring, but on x86 they're nothing like -17%
and the measurements are reproducible, so my numbers I posted are
correct and reproducible as far as I can tell.

2004-04-07 21:49:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback

On Wed, Apr 07, 2004 at 02:19:18PM -0700, Martin J. Bligh wrote:
> Tuesday, April 06, 2004 00:16:41 +0200 Andrea Arcangeli <[email protected]> wrote:
>
> > On Mon, Apr 05, 2004 at 03:35:23PM -0600, Eric Whiting wrote:
> >> 4G of virtual address is what we need. Virtual address space is why the -mmX
> >> 4G/4G patches are useful. In this application it is single processes (usually
> >
> > Indeed.
> >
> >> 3.5:1.5 appears to be a 2.4.x kernel patch only right?
> >
> > Martin has a port for 2.6 in the -mjb patchset (though it only works
> > with PAE disabled but there are patches floating around to make it work
> > at not noticeable cost with PAE enabled too).
>
> There's no such thing as 3.5:1.5. Do you mean 3.5:0.5? or 2.5:1.5? ;-)

I meant 3.5:0.5, that's the additional config option in 2.4-aa and in
your tree too IIRC.

2004-04-07 22:43:05

by Martin J. Bligh

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

>> is on you. And if you think i'm upset about your approach to this whole
>> issue then you are damn right.)
>
> the one upset should be the users running 30% slower with stuff like
> mysql just because they own a 4/8G box. There's little interest from my
> part to spend time on 4:4 stuff when things are so obvious (I want
> however to try to benchmark the HZ=1000 with the hint).

Isn't that scenario fixed up by vsyscall gettimeofday already?
Or was this another workload, where gettimeofday wasn't the problem?

M.

2004-04-07 22:46:58

by Martin J. Bligh

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

> anyway, i can only repeat what i said last year in the announcement
> email of the 4:4 feature:
>
> the typical cost of 4G/4G on typical x86 servers is +3 usecs of
> syscall latency (this is in addition to the ~1 usec null syscall
> latency). Depending on the workload this can cause a typical
> measurable wall-clock overhead from 0% to 30%, for typical
> application workloads (DB workload, networking workload, etc.).
> Isolated microbenchmarks can show a bigger slowdown as well - due to
> the syscall latency increase.
>
> so it's not like there's a cat in the bag.

I don't see how you can go by the cost in syscall latency ... the real
cost is not the time take to flush the cache, it's the impact of doing
so .... such microbenchmarks seems pointless. I'm not against 4/4G at all,
I think it solves a real problem ... I just think latency numbers are a
bad way to justify it - we need to look at whole benchmark runs.

M.

2004-04-07 22:52:02

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Wed, Apr 07, 2004 at 03:54:27PM -0700, Martin J. Bligh wrote:
> Or was this another workload, where gettimeofday wasn't the problem?

mysql is another common -30% slowdown where gettimeofday shouldn't be
the problem. Any threaded application doing I/O should have major
scalability problems with 4:4 regardless the tlb flushing frequency, the
more cpus the biggest the hit (I should say mm_switch frequency [not tlb
flush frequency] after my latest benchmark results)

2004-04-07 23:03:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Wed, Apr 07, 2004 at 03:58:25PM -0700, Martin J. Bligh wrote:
> so .... such microbenchmarks seems pointless. I'm not against 4/4G at all,
> I think it solves a real problem ... I just think latency numbers are a

I agree as well it solves a real problem (i.e. 4G userspace), though the
userbase that needs it is extremely limited and they're sure ok to run
slower than to change their application to use shmfs (a special 4:4
kernel may be ok, just like a special 2.5:1.5 may be ok, just like
3.5:0.5 was ok for similar reasons too), but the mass market doesn't
need 4:4 and it will never need it, so it's bad to have the masses pay
for this relevant worthless runtime overhead in various common
workloads.

Of course above I'm talking about 2.6-aa or 2.6-mjb. Clearly with
kernels including rmap like 2.6 mainline or 2.6-mm or 2.6-mc or the
2.4-rmap patches you need 4:4 everywhere, even on a 4/8G box to avoid
running out of normal zone in some fairly common and important workload.

2004-04-07 23:10:24

by Martin J. Bligh

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

> I agree as well it solves a real problem (i.e. 4G userspace), though the
> userbase that needs it is extremely limited and they're sure ok to run
> slower than to change their application to use shmfs (a special 4:4
> kernel may be ok, just like a special 2.5:1.5 may be ok, just like
> 3.5:0.5 was ok for similar reasons too), but the mass market doesn't
> need 4:4 and it will never need it, so it's bad to have the masses pay
> for this relevant worthless runtime overhead in various common
> workloads.

Yeah, it needs to be a separate kernel for huge blobby machines. I think
that's exactly what RH does, IIRC (> 16GB ?)

> Of course above I'm talking about 2.6-aa or 2.6-mjb. Clearly with
> kernels including rmap like 2.6 mainline or 2.6-mm or 2.6-mc or the
> 2.4-rmap patches you need 4:4 everywhere, even on a 4/8G box to avoid
> running out of normal zone in some fairly common and important workload.

Speaking of which, pte_highmem is stinking expensive itself. There's
probably a large class of workloads that'd work with out pte_highmem
if we had 4/4 split (or shared pagetables. Grrr ;-))

M.

2004-04-07 23:18:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Wed, Apr 07, 2004 at 04:21:44PM -0700, Martin J. Bligh wrote:
> Speaking of which, pte_highmem is stinking expensive itself. There's
> probably a large class of workloads that'd work with out pte_highmem
> if we had 4/4 split (or shared pagetables. Grrr ;-))

hey, I can add a sysctl in 5 minutes to disable pte_highmem at runtime,
why do you think it's expensive, it should be not, it's all atomic kmaps
only doing invlpg. The few workloads trashing on the ptes manipulation
needs pte_highmem anyways. If I thought it was expensive for any common
load the sysctl would be already there.

2004-04-07 23:23:24

by Martin J. Bligh

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

--On Thursday, April 08, 2004 01:18:06 +0200 Andrea Arcangeli <[email protected]> wrote:

> On Wed, Apr 07, 2004 at 04:21:44PM -0700, Martin J. Bligh wrote:
>> Speaking of which, pte_highmem is stinking expensive itself. There's
>> probably a large class of workloads that'd work with out pte_highmem
>> if we had 4/4 split (or shared pagetables. Grrr ;-))
>
> hey, I can add a sysctl in 5 minutes to disable pte_highmem at runtime,
> why do you think it's expensive, it should be not, it's all atomic kmaps
> only doing invlpg. The few workloads trashing on the ptes manipulation
> needs pte_highmem anyways. If I thought it was expensive for any common
> load the sysctl would be already there.

I measured it - IIRC it was 5-10% on kernel compile ... and that was on a
high ratio NUMA which it should have made *better* (as with highmem, the
PTEs can be allocated node locally). I'll try to dig up the old profiles.

M.

2004-04-08 00:18:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Wed, Apr 07, 2004 at 04:34:51PM -0700, Martin J. Bligh wrote:
> I measured it - IIRC it was 5-10% on kernel compile ... and that was on a
> high ratio NUMA which it should have made *better* (as with highmem, the
> PTEs can be allocated node locally). I'll try to dig up the old profiles.

but this is a kind of machine where I assume you've plenty of ram and
you really want pte_highmem enabled (the sysctl can still be added but
you really must know what you're doing if you disable pte_highmem
there), I was more interested to hear the impact on mid-low end 1-2G
machines where pte_highmem isn't really necessary for most apps, and
there the sysctl may be useful for general purpose too. If it payoffs
significantly the sysctl could then be elaborated in an heuristic that
prefers lowmem pagetables until a certain threshold, and then it
fallbacks in highmem allocations, and the threshold depends on the
highmem/lowmem ratio plus a further tuning via sysctl).

2004-04-08 06:24:33

by Martin J. Bligh

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

> On Wed, Apr 07, 2004 at 04:34:51PM -0700, Martin J. Bligh wrote:
>> I measured it - IIRC it was 5-10% on kernel compile ... and that was on a
>> high ratio NUMA which it should have made *better* (as with highmem, the
>> PTEs can be allocated node locally). I'll try to dig up the old profiles.
>
> but this is a kind of machine where I assume you've plenty of ram and
> you really want pte_highmem enabled (the sysctl can still be added but
> you really must know what you're doing if you disable pte_highmem
> there), I was more interested to hear the impact on mid-low end 1-2G
> machines where pte_highmem isn't really necessary for most apps, and

I can't imagine why it'd be any less.

> there the sysctl may be useful for general purpose too. If it payoffs
> significantly the sysctl could then be elaborated in an heuristic that
> prefers lowmem pagetables until a certain threshold, and then it
> fallbacks in highmem allocations, and the threshold depends on the
> highmem/lowmem ratio plus a further tuning via sysctl).

Instead of fiddling with tuning knobs, I'd prefer to just do the UKVA
idea I've proposed before, and let each process have their own pagetables
mapped permanently ;-)

M.

2004-04-08 21:59:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Wed, Apr 07, 2004 at 11:24:16PM -0700, Martin J. Bligh wrote:
> Instead of fiddling with tuning knobs, I'd prefer to just do the UKVA
> idea I've proposed before, and let each process have their own pagetables
> mapped permanently ;-)

that will have you pay for pte-highmem even in non-highmem machines.
I'm always been against your above idea ;) It can speedup mmap a bit for
some uncommon case but I believe your slowdown comes from the page faults after
exeve and startup not from mmap with the kernel compile, and worst of
all for non-highmem too (no sysctl or tuning knob can save you then).
Amittedly some mmap intensive workload can get a slight speedup compared
to pte-highmem but I don't think it's common and it has the potential of
slowing down the page faults especially in short lived tasks even w/o
highmem.

What I found attractive was the persistent kmap in userspace, but that
idea breaks with threading, and Andrew found another way that is to make
the page fault interruptible so it doesn't seem very worthwhile anymore
even w/o threading.

2004-04-08 22:08:15

by Martin J. Bligh

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

--On Thursday, April 08, 2004 23:59:46 +0200 Andrea Arcangeli <[email protected]> wrote:

> On Wed, Apr 07, 2004 at 11:24:16PM -0700, Martin J. Bligh wrote:
>> Instead of fiddling with tuning knobs, I'd prefer to just do the UKVA
>> idea I've proposed before, and let each process have their own pagetables
>> mapped permanently ;-)
>
> that will have you pay for pte-highmem even in non-highmem machines.
> I'm always been against your above idea ;) It can speedup mmap a bit for
> some uncommon case but I believe your slowdown comes from the page faults after
> exeve and startup not from mmap with the kernel compile, and worst of
> all for non-highmem too (no sysctl or tuning knob can save you then).
> Amittedly some mmap intensive workload can get a slight speedup compared
> to pte-highmem but I don't think it's common and it has the potential of
> slowing down the page faults especially in short lived tasks even w/o
> highmem.

You mean the page-faults for the pagetable mappings themselves? I wouldn't
have thought that'd make an impact - at least I don't see how it could be
worse than pte_highmem. And as we could make it conditional on highmem
anyway (or even CONFIG_64GB, I'm pretty sure 4GB machines don't need it),
I don't think it matters (ie you'd just turn it on instead of pte_highmem).

But you're right, we do need to take that into consideration.

> What I found attractive was the persistent kmap in userspace, but that
> idea breaks with threading, and Andrew found another way that is to make
> the page fault interruptible so it doesn't seem very worthwhile anymore
> even w/o threading.

Yeah, I've given up on that one ;-) The main use for it was pagetables
anyway, and we can do that without the threading problems.

M.

2004-04-08 22:19:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Thu, Apr 08, 2004 at 03:19:51PM -0700, Martin J. Bligh wrote:
> --On Thursday, April 08, 2004 23:59:46 +0200 Andrea Arcangeli <[email protected]> wrote:
>
> > On Wed, Apr 07, 2004 at 11:24:16PM -0700, Martin J. Bligh wrote:
> >> Instead of fiddling with tuning knobs, I'd prefer to just do the UKVA
> >> idea I've proposed before, and let each process have their own pagetables
> >> mapped permanently ;-)
> >
> > that will have you pay for pte-highmem even in non-highmem machines.
> > I'm always been against your above idea ;) It can speedup mmap a bit for
> > some uncommon case but I believe your slowdown comes from the page faults after
> > exeve and startup not from mmap with the kernel compile, and worst of
> > all for non-highmem too (no sysctl or tuning knob can save you then).
> > Amittedly some mmap intensive workload can get a slight speedup compared
> > to pte-highmem but I don't think it's common and it has the potential of
> > slowing down the page faults especially in short lived tasks even w/o
> > highmem.
>
> You mean the page-faults for the pagetable mappings themselves? I wouldn't
> have thought that'd make an impact - at least I don't see how it could be
> worse than pte_highmem. And as we could make it conditional on highmem

it's worse because you pay for it even with lowmem.

as for your question for why the overhead is lower on 1/2G boxes, that
as well is because the probability of the page going into highmem is
much lower.

> anyway (or even CONFIG_64GB, I'm pretty sure 4GB machines don't need it),
> I don't think it matters (ie you'd just turn it on instead of pte_highmem).

1 single smp kernel with CONFIG64G and ptehighmem=y covers 99% of the
x86 smp hardware in the market, from 32M of ram to 32G of ram both
included and always at the 99% of peak possible performance of the
hardware, that's really nice IMHO, I don't like design solutions that
requires different kernel image every few gigs of ram you add to the
machine unless real big gains can be demonstrated. One can recompile
and tune as usual, but we should prefer generic design solutions to
dedicated ones unless they really make an huge difference. Running a
CONFIG64G with ptehighmem=y on a 512M box may be say 0.1% slower than a
nohighmem-noptehighmem, Ingo posted the exact PAE vs non-PAE slowdown a
few days ago, it's non significant.

> But you're right, we do need to take that into consideration.

Best really would be to benchmark it, for it I definitely like your
kernel compile -j benchmark for it (but with mem=800m ;).

> > What I found attractive was the persistent kmap in userspace, but that
> > idea breaks with threading, and Andrew found another way that is to make
> > the page fault interruptible so it doesn't seem very worthwhile anymore
> > even w/o threading.
>
> Yeah, I've given up on that one ;-) The main use for it was pagetables
> anyway, and we can do that without the threading problems.

agreed ;)

2004-04-08 23:02:42

by Martin J. Bligh

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

>> >> Instead of fiddling with tuning knobs, I'd prefer to just do the UKVA
>> >> idea I've proposed before, and let each process have their own pagetables
>> >> mapped permanently ;-)
>> >
>> > that will have you pay for pte-highmem even in non-highmem machines.
>> > I'm always been against your above idea ;) It can speedup mmap a bit for
>> > some uncommon case but I believe your slowdown comes from the page faults after
>> > exeve and startup not from mmap with the kernel compile, and worst of
>> > all for non-highmem too (no sysctl or tuning knob can save you then).
>> > Amittedly some mmap intensive workload can get a slight speedup compared
>> > to pte-highmem but I don't think it's common and it has the potential of
>> > slowing down the page faults especially in short lived tasks even w/o
>> > highmem.
>>
>> You mean the page-faults for the pagetable mappings themselves? I wouldn't
>> have thought that'd make an impact - at least I don't see how it could be
>> worse than pte_highmem. And as we could make it conditional on highmem
>
> it's worse because you pay for it even with lowmem.
>
> as for your question for why the overhead is lower on 1/2G boxes, that
> as well is because the probability of the page going into highmem is
> much lower.

Me confused. Are you saying it's worse compared to pte_highmem? or to
shoving ptes in lowmem?

>> anyway (or even CONFIG_64GB, I'm pretty sure 4GB machines don't need it),
>> I don't think it matters (ie you'd just turn it on instead of pte_highmem).
>
> 1 single smp kernel with CONFIG64G and ptehighmem=y covers 99% of the
> x86 smp hardware in the market, from 32M of ram to 32G of ram both
> included and always at the 99% of peak possible performance of the
> hardware, that's really nice IMHO, I don't like design solutions that
> requires different kernel image every few gigs of ram you add to the
> machine unless real big gains can be demonstrated. One can recompile
> and tune as usual, but we should prefer generic design solutions to
> dedicated ones unless they really make an huge difference. Running a
> CONFIG64G with ptehighmem=y on a 512M box may be say 0.1% slower than a
> nohighmem-noptehighmem, Ingo posted the exact PAE vs non-PAE slowdown a
> few days ago, it's non significant.
>
>> But you're right, we do need to take that into consideration.
>
> Best really would be to benchmark it, for it I definitely like your
> kernel compile -j benchmark for it (but with mem=800m ;).

Ah. You're worried about the distro situation, where PTE_HIGHMEM would
be turned on for a non-highmem machine, right? Makes more sense I guess.
But runtime switching it probably isn't that hard either ;-)

M.

2004-04-08 23:22:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Thu, Apr 08, 2004 at 04:14:08PM -0700, Martin J. Bligh wrote:
> Me confused. Are you saying it's worse compared to pte_highmem? or to
> shoving ptes in lowmem?

worse than pte_highmem if booting with mem=800m

> Ah. You're worried about the distro situation, where PTE_HIGHMEM would
> be turned on for a non-highmem machine, right? Makes more sense I guess.

it's not just a distro situation, it's about not having to recompile the
kernel for every machine I own, even gentoo has an option to have a
compile server in the network that build packages and you install the
binaries from it, so there must be some value in being able to share a
binary on more than one machine (this is especially true for me since I
upgrade kernel quite fast).

it's not just about non-highmem machines, on 1G/2G boxes the probability
that pte-highmem cause you any slowdown is an order of magnitude smaller
than on a 32G machine (where ptes should never hit lowmem or it means my
classzone lowmem_reserve_ratio algorithms have not yet been ported to 2.6)
with your model you'd have no way to boost when you are lucky to get a
lowmem page.

2004-04-08 23:31:02

by Martin J. Bligh

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

--On Friday, April 09, 2004 01:22:15 +0200 Andrea Arcangeli <[email protected]> wrote:

> On Thu, Apr 08, 2004 at 04:14:08PM -0700, Martin J. Bligh wrote:
>> Me confused. Are you saying it's worse compared to pte_highmem? or to
>> shoving ptes in lowmem?
>
> worse than pte_highmem if booting with mem=800m
>
>> Ah. You're worried about the distro situation, where PTE_HIGHMEM would
>> be turned on for a non-highmem machine, right? Makes more sense I guess.
>
> it's not just a distro situation, it's about not having to recompile the
> kernel for every machine I own, even gentoo has an option to have a
> compile server in the network that build packages and you install the
> binaries from it, so there must be some value in being able to share a
> binary on more than one machine (this is especially true for me since I
> upgrade kernel quite fast).
>
> it's not just about non-highmem machines, on 1G/2G boxes the probability
> that pte-highmem cause you any slowdown is an order of magnitude smaller
> than on a 32G machine (where ptes should never hit lowmem or it means my
> classzone lowmem_reserve_ratio algorithms have not yet been ported to 2.6)
> with your model you'd have no way to boost when you are lucky to get a
> lowmem page.

OK, I think I understand your concern now - I was being slow ;-)
I guess there are a few more PTEs to set up on exec, you're right.
I still think it's faster than pte_highmem, which was a static config
option anyway (so it's better in all cases than pte_highmem, when enabled)
but still not perfect. Hmm. I'll go think about it ;-)

m.

2004-04-08 23:49:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]

On Thu, Apr 08, 2004 at 04:42:37PM -0700, Martin J. Bligh wrote:
> but still not perfect. Hmm. I'll go think about it ;-)

nothing is perfect anyways ;), if you've an implementation it'd be very
interesting to see a result in the kernel compile with -j with mem=800m,
that should be close to a real life worst case. Then we'll see if the
setup after execve is slowed down measurably, it's hard to tell, but we
know there's a chance since you measure significant slowdown from
pte-highmem on very highmem machines with short lived tasks. btw, note that
if the task isn't short lived, good apps should flood with mmap either (at
least on 64bit ;)