LinuxLists.cc - lmbench lat_mmap slowdown with CONFIG

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Nick Piggin <[email protected]> wrote:

> Hi,
>
> I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed
> down. On further investigation, a large part of this is not due to a
> _regression_ as such, but the introduction of CONFIG_PARAVIRT=y.
>
> Now, it is true that lat_mmap is basically a microbenchmark, however it
> is exercising the memory mapping and page fault handler paths, so we're
> talking about pretty important paths here. So I think it should be of
> interest.
>
> I've run the tests on a 2s8c AMD Barcelona system, binding the test to
> CPU0, and running 100 times (stddev is a bit hard to bring down, and my
> scripts needed 100 runs in order to pick up much smaller changes in the
> results -- for CONFIG_PARAVIRT, just a couple of runs should show up the
> problem).
>
> Times I believe are in nanoseconds for lmbench, anyway lower is better.
>
> non pv AVG=464.22 STD=5.56
> paravirt AVG=502.87 STD=7.36
>
> Nearly 10% performance drop here, which is quite a bit... hopefully
> people are testing the speed of their PV implementations against non-PV
> bare metal :)

Ouch, that looks unacceptably expensive. All the major distros turn
CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express
promise to have no measurable runtime overhead.

( And i suspect the real life mmap cost is probably even more expensive,
as on a Barcelona all of lmbench fits into the cache hence we dont see
any real $cache overhead. )

Jeremy, any ideas where this slowdown comes from and how it could be
fixed?

Ingo

2009-01-20 12:34:30

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

On Tue, Jan 20, 2009 at 12:26:34PM +0100, Ingo Molnar wrote:
>
> * Nick Piggin <[email protected]> wrote:
>
> > Hi,
> >
> > I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed
> > down. On further investigation, a large part of this is not due to a
> > _regression_ as such, but the introduction of CONFIG_PARAVIRT=y.
> >
> > Now, it is true that lat_mmap is basically a microbenchmark, however it
> > is exercising the memory mapping and page fault handler paths, so we're
> > talking about pretty important paths here. So I think it should be of
> > interest.
> >
> > I've run the tests on a 2s8c AMD Barcelona system, binding the test to
> > CPU0, and running 100 times (stddev is a bit hard to bring down, and my
> > scripts needed 100 runs in order to pick up much smaller changes in the
> > results -- for CONFIG_PARAVIRT, just a couple of runs should show up the
> > problem).
> >
> > Times I believe are in nanoseconds for lmbench, anyway lower is better.
> >
> > non pv AVG=464.22 STD=5.56
> > paravirt AVG=502.87 STD=7.36
> >
> > Nearly 10% performance drop here, which is quite a bit... hopefully
> > people are testing the speed of their PV implementations against non-PV
> > bare metal :)
>
> Ouch, that looks unacceptably expensive. All the major distros turn
> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express
> promise to have no measurable runtime overhead.
>
> ( And i suspect the real life mmap cost is probably even more expensive,
> as on a Barcelona all of lmbench fits into the cache hence we dont see
> any real $cache overhead. )

The PV kernel has over 100K larger text size, nearly 40K alone in mm/ and
kernel/. Definitely we don't see the worst of the icache or branch buffer
overhead on this microbenchmark. (wow, that's a nasty amount of bloat :( )

> Jeremy, any ideas where this slowdown comes from and how it could be
> fixed?

I had a bit of a poke around the profiles, but nothing stood out. However
oprofile counted 50% more cycles in the kernel with PV than with non-PV.
I'll have to take a look at the user/system times, because 50% seems
ludicrous.... hopefully it's just oprofile noise.

2009-01-20 12:46:02

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Nick Piggin <[email protected]> wrote:

> On Tue, Jan 20, 2009 at 12:26:34PM +0100, Ingo Molnar wrote:
> >
> > * Nick Piggin <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed
> > > down. On further investigation, a large part of this is not due to a
> > > _regression_ as such, but the introduction of CONFIG_PARAVIRT=y.
> > >
> > > Now, it is true that lat_mmap is basically a microbenchmark, however it
> > > is exercising the memory mapping and page fault handler paths, so we're
> > > talking about pretty important paths here. So I think it should be of
> > > interest.
> > >
> > > I've run the tests on a 2s8c AMD Barcelona system, binding the test to
> > > CPU0, and running 100 times (stddev is a bit hard to bring down, and my
> > > scripts needed 100 runs in order to pick up much smaller changes in the
> > > results -- for CONFIG_PARAVIRT, just a couple of runs should show up the
> > > problem).
> > >
> > > Times I believe are in nanoseconds for lmbench, anyway lower is better.
> > >
> > > non pv AVG=464.22 STD=5.56
> > > paravirt AVG=502.87 STD=7.36
> > >
> > > Nearly 10% performance drop here, which is quite a bit... hopefully
> > > people are testing the speed of their PV implementations against non-PV
> > > bare metal :)
> >
> > Ouch, that looks unacceptably expensive. All the major distros turn
> > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express
> > promise to have no measurable runtime overhead.
> >
> > ( And i suspect the real life mmap cost is probably even more expensive,
> > as on a Barcelona all of lmbench fits into the cache hence we dont see
> > any real $cache overhead. )
>
> The PV kernel has over 100K larger text size, nearly 40K alone in mm/ and
> kernel/. Definitely we don't see the worst of the icache or branch buffer
> overhead on this microbenchmark. (wow, that's a nasty amount of bloat :( )
>
>
> > Jeremy, any ideas where this slowdown comes from and how it could be
> > fixed?
>
> I had a bit of a poke around the profiles, but nothing stood out.
> However oprofile counted 50% more cycles in the kernel with PV than with
> non-PV. I'll have to take a look at the user/system times, because 50%
> seems ludicrous.... hopefully it's just oprofile noise.

If you have a Core2 test-system could you please try tip/master, which
also has your do_page_fault-de-bloating patch applied?

<plug>

The other advantage of tip/master would be that you could try precise
performance counter measurements via:

http://redhat.com/~mingo/perfcounters/timec.c

and split out the lmbench test-case into a standalone .c file loop.
Running it as:

$ taskset 0 ./timec -e -5,-4,-3,0,1,2,3 ./mmap-test

Will give you very precise information about what's going on in that
workload:

Performance counter stats for 'mmap-test':

628315.871980 task clock ticks (msecs)

42330 CPU migrations (events)
124980 context switches (events)
18698292 pagefaults (events)
1351875946010 CPU cycles (events)
1121901478363 instructions (events)
10654788968 cache references (events)
633581867 cache misses (events)

You might also want to try an NMI profile via kerneltop:

http://redhat.com/~mingo/perfcounters/kerneltop.c

just run it with no arguments on a perfcounters kernel and it will give
you something like:

------------------------------------------------------------------------------
KernelTop: 20297 irqs/sec [NMI, 10000 cache-misses], (all, 8 CPUs)
------------------------------------------------------------------------------

events RIP kernel function
______ ______ ________________ _______________

12816.00 - ffffffff803d5760 : copy_user_generic_string!
11751.00 - ffffffff80647a2c : unix_stream_recvmsg
10215.00 - ffffffff805eda5f : sock_alloc_send_skb
9738.00 - ffffffff80284821 : flush_free_list
6749.00 - ffffffff802854a1 : __kmalloc_track_caller
3663.00 - ffffffff805f09fa : skb_dequeue
3591.00 - ffffffff80284be2 : kmem_cache_alloc [qla2xxx]
3501.00 - ffffffff805f15f5 : __alloc_skb
1296.00 - ffffffff803d8eb4 : list_del [qla2xxx]
1110.00 - ffffffff805f0ed2 : kfree_skb

Ingo

2009-01-20 13:41:37

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

On Tue, Jan 20, 2009 at 01:45:00PM +0100, Ingo Molnar wrote:
>
> * Nick Piggin <[email protected]> wrote:
>
> > On Tue, Jan 20, 2009 at 12:26:34PM +0100, Ingo Molnar wrote:
> > >
> > > * Nick Piggin <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed
> > > > down. On further investigation, a large part of this is not due to a
> > > > _regression_ as such, but the introduction of CONFIG_PARAVIRT=y.
> > > >
> > > > Now, it is true that lat_mmap is basically a microbenchmark, however it
> > > > is exercising the memory mapping and page fault handler paths, so we're
> > > > talking about pretty important paths here. So I think it should be of
> > > > interest.
> > > >
> > > > I've run the tests on a 2s8c AMD Barcelona system, binding the test to
> > > > CPU0, and running 100 times (stddev is a bit hard to bring down, and my
> > > > scripts needed 100 runs in order to pick up much smaller changes in the
> > > > results -- for CONFIG_PARAVIRT, just a couple of runs should show up the
> > > > problem).
> > > >
> > > > Times I believe are in nanoseconds for lmbench, anyway lower is better.
> > > >
> > > > non pv AVG=464.22 STD=5.56
> > > > paravirt AVG=502.87 STD=7.36
> > > >
> > > > Nearly 10% performance drop here, which is quite a bit... hopefully
> > > > people are testing the speed of their PV implementations against non-PV
> > > > bare metal :)
> > >
> > > Ouch, that looks unacceptably expensive. All the major distros turn
> > > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express
> > > promise to have no measurable runtime overhead.
> > >
> > > ( And i suspect the real life mmap cost is probably even more expensive,
> > > as on a Barcelona all of lmbench fits into the cache hence we dont see
> > > any real $cache overhead. )
> >
> > The PV kernel has over 100K larger text size, nearly 40K alone in mm/ and
> > kernel/. Definitely we don't see the worst of the icache or branch buffer
> > overhead on this microbenchmark. (wow, that's a nasty amount of bloat :( )
> >
> >
> > > Jeremy, any ideas where this slowdown comes from and how it could be
> > > fixed?
> >
> > I had a bit of a poke around the profiles, but nothing stood out.
> > However oprofile counted 50% more cycles in the kernel with PV than with
> > non-PV. I'll have to take a look at the user/system times, because 50%
> > seems ludicrous.... hopefully it's just oprofile noise.

kbuild costs go up a bit (average of 30 builds)
elapsed
non-pv: AVG=53.31s STD=0.99
pv: AVG=53.54s STD=0.94

user
non-pv: AVG=318.63s STD=0.19
pv: AVG=319.33s STD=0.23

system
non-pv: AVG=30.56s STD=0.15
pv: AVG=31.80s STD=0.15

kernel side of the kbuild workload slows down by 4.1%. User time also
increases a bit (probably more cache and branch misses).

> If you have a Core2 test-system could you please try tip/master, which
> also has your do_page_fault-de-bloating patch applied?

Will try to get one to do some runs on.

Thanks,
Nick

2009-01-20 14:03:52

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Ingo Molnar <[email protected]> wrote:

> > Times I believe are in nanoseconds for lmbench, anyway lower is
> > better.
> >
> > non pv AVG=464.22 STD=5.56
> > paravirt AVG=502.87 STD=7.36
> >
> > Nearly 10% performance drop here, which is quite a bit... hopefully
> > people are testing the speed of their PV implementations against
> > non-PV bare metal :)
>
> Ouch, that looks unacceptably expensive. All the major distros turn
> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express
> promise to have no measurable runtime overhead.

Here are some more precise stats done via hw counters on a perfcounters
kernel using 'timec', running a modified version of the 'mmap performance
stress-test' app i made years ago.

The MM benchmark app can be downloaded from:

http://redhat.com/~mingo/misc/mmap-perf.c

timec.c can be picked up from:

http://redhat.com/~mingo/perfcounters/timec.c

mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and touches
the mapped area as well with a certain chance. The patterns are
pseudo-random and the random seed is initialized to the same value so
repeated runs produce the exact same mmap sequence.

I ran the test with a single thread and bound to a single core:

# taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1

[ I ran it as root - so that kernel-space hardware-counter statistics are
included as well. ]

The results are quite surprisingly candid about the true costs of
paravirt_ops on the native kernel's overhead (CONFIG_PARAVIRT=y):

-----------------------------------------------
| Performance counter stats for './mmap-perf' |
-----------------------------------------------
| |
| x86-defconfig | PARAVIRT=y
|------------------------------------------------------------------
|
| 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74%
| |
| 1 | 1 CPU migrations
| 91 | 79 context switches
| 55945 | 55943 pagefaults
| ............................................
| 3781392474 | 3918777174 CPU cycles +3.63%
| 1957153827 | 2161280486 instructions +10.43%
| 50234816 | 51303520 cache references +2.12%
| 5428258 | 5583728 cache misses +2.86%
| |
| 1314.782469 | 1363.694447 time elapsed (msecs) +3.72%
| |
-----------------------------------

The most surprising element is that in the paravirt_ops case we run 204
million more instructions - out of the ~2000 million instructions total.

That's an increase of over 10%!

That shows the expected $cache risks here as well: i ran this on an
Extreme Edition CPU which has a ton of L2 cache [4MB] which mutes L2
$cache misses quite a bit.

Note that this workload tests a broader range of MM related codepaths -
not just pure pagefault costs.

Ingo

ps. Measurement methodology:

The software counters show that the test was indeed done on an idle
system: there are no CPU migrations (the task is affine), nor any
significant context-switches, and the pagefault count is essentially
the same as well. (because this is a fully repeatable workload.)

The numbers are a representative sample from a run of more than 10
testruns, on an otherwise idle system. Measurement noise is very low:

3906920196 CPU cycles (events)
3907556124 CPU cycles (events)
3907902335 CPU cycles (events)
3914423870 CPU cycles (events)
3915642464 CPU cycles (events)
3916134988 CPU cycles (events)
3916840093 CPU cycles (events)
3918777174 CPU cycles (events)
3918993251 CPU cycles (events)
3919907192 CPU cycles (events)

The max/min spread of 10 runs is 0.3%, so the precision of this
measurement is in the 0.1% range - more than enough to be conclusive.

The max/min spread of the instruction counts is even better: in the
0.01% range. (that is because exactly the same workload is executed -
only timer IRQs and small disturbances cause noise here.)

2009-01-20 14:17:13

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

On Tue, Jan 20, 2009 at 03:03:24PM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > > Times I believe are in nanoseconds for lmbench, anyway lower is
> > > better.
> > >
> > > non pv AVG=464.22 STD=5.56
> > > paravirt AVG=502.87 STD=7.36
> > >
> > > Nearly 10% performance drop here, which is quite a bit... hopefully
> > > people are testing the speed of their PV implementations against
> > > non-PV bare metal :)
> >
> > Ouch, that looks unacceptably expensive. All the major distros turn
> > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express
> > promise to have no measurable runtime overhead.
>
> Here are some more precise stats done via hw counters on a perfcounters
> kernel using 'timec', running a modified version of the 'mmap performance
> stress-test' app i made years ago.
>
> The MM benchmark app can be downloaded from:
>
> http://redhat.com/~mingo/misc/mmap-perf.c

BTW. the lmbench test I run directly (it's called lat_mmap.c, and gets
compiled into a standalone lat_mmap exec by the standard lmbench build).

2009-01-20 14:18:36

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Nick Piggin <[email protected]> wrote:

> On Tue, Jan 20, 2009 at 03:03:24PM +0100, Ingo Molnar wrote:
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > > Times I believe are in nanoseconds for lmbench, anyway lower is
> > > > better.
> > > >
> > > > non pv AVG=464.22 STD=5.56
> > > > paravirt AVG=502.87 STD=7.36
> > > >
> > > > Nearly 10% performance drop here, which is quite a bit... hopefully
> > > > people are testing the speed of their PV implementations against
> > > > non-PV bare metal :)
> > >
> > > Ouch, that looks unacceptably expensive. All the major distros turn
> > > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express
> > > promise to have no measurable runtime overhead.
> >
> > Here are some more precise stats done via hw counters on a perfcounters
> > kernel using 'timec', running a modified version of the 'mmap performance
> > stress-test' app i made years ago.
> >
> > The MM benchmark app can be downloaded from:
> >
> > http://redhat.com/~mingo/misc/mmap-perf.c
>
> BTW. the lmbench test I run directly (it's called lat_mmap.c, and gets
> compiled into a standalone lat_mmap exec by the standard lmbench build).

doesnt that include an indeterminate number of gettimeofday() based
calibration calls? That would make it harder to measure its total costs in
a comparative way.

Ingo

2009-01-20 14:41:25

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

On Tue, Jan 20, 2009 at 03:17:35PM +0100, Ingo Molnar wrote:
>
> * Nick Piggin <[email protected]> wrote:
> >
> > BTW. the lmbench test I run directly (it's called lat_mmap.c, and gets
> > compiled into a standalone lat_mmap exec by the standard lmbench build).
>
> doesnt that include an indeterminate number of gettimeofday() based
> calibration calls? That would make it harder to measure its total costs in
> a comparative way.

Hmm... yes probably for really detailed profile comparisons or
other external measurements it would need modification.

2009-01-20 15:01:30

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Nick Piggin <[email protected]> wrote:

> On Tue, Jan 20, 2009 at 03:17:35PM +0100, Ingo Molnar wrote:
> >
> > * Nick Piggin <[email protected]> wrote:
> > >
> > > BTW. the lmbench test I run directly (it's called lat_mmap.c, and gets
> > > compiled into a standalone lat_mmap exec by the standard lmbench build).
> >
> > doesnt that include an indeterminate number of gettimeofday() based
> > calibration calls? That would make it harder to measure its total costs in
> > a comparative way.
>
> Hmm... yes probably for really detailed profile comparisons or other
> external measurements it would need modification.

yeah.

Btw., it's a trend to be aware of i think: as our commit flux goes up and
the average commit size goes down, it becomes harder and harder to measure
the per commit performance impact.

There's just 3 ways to handle it: decrease commit flux (which is out of
question), or to increase commits size (wich is out of question as well),
or to improve the quality of our measurements.

We can improve performance measurement quality in a number of ways:

- We can (and should) increase instrumentation precision

/usr/bin/time's 10 msec measurement granularity might have been
fine a decade ago but it is not fine today.

- We can (and should) increase the number of 'dimensions' (metrics) we
can instrument the kernel with.

Right now we basically only measure along the time axis, in 99% of
the cases. But 'elapsed time' is a tricky, compound and thus noisy
unit: it is affected by all delays in a workload. We do profiles
occasionally, but they are a lot more difficult to generate and a lot
harder to compare and are hard to be plugged into regression
analysis.

So if we see a statistically significant shift in one of more metrics of
something like:

-------------------------------------------------
|
| $ ./timec -e -5,-4,-3,0,1,2,3 make -j16 bzImage
|
| [...]
| Kernel: arch/x86/boot/bzImage is ready (#28)
|
| Performance counter stats for 'make':
|
| 628315.871980 task clock ticks (msecs)
|
| 42330 CPU migrations (events)
| 124980 context switches (events)
| 18698292 pagefaults (events)
| 1351875946010 CPU cycles (events)
| 1121901478363 instructions (events)
| 10654788968 cache references (events)
| 633581867 cache misses (events)
|
| Wall-clock time elapsed: 118348.109066 msecs
|
-----------------------------------------------

Becomes a _lot_ harder to ignore (and talk out of existence) than it is to
ignore a few minor digits changing in:

---------------------------------
|
| $ time make -j16 bzImage
|
| real 0m12.146s
| user 1m30.050s
| sys 0m12.757s
|
---------------------------------

( Especially as those minor digits tend to be rather noisy to begin with,
due to us sampling system/user time from the timer interrupt. )

It becomes even harder to ignore statistically significant regressions if
some of the metrics are hardware-generated hard physical facts - not
something wishy-washy and statistical as stime/utime statistics.

</plug> ;-)

Ingo

2009-01-20 15:14:19

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Ingo Molnar <[email protected]> wrote:

> That shows the expected $cache risks here as well: i ran this on an
> Extreme Edition CPU which has a ton of L2 cache [4MB] which mutes L2
> $cache misses quite a bit.

[ there's no such thing as "L2 $cache misses" - what i wanted to say is
that while the instruction cache size is rather small and static, a
large L2 cache helps in keeping the costs of instruction-cache misses
low - hence my measurement skews in favor of paravirt. ]

Ingo

2009-01-20 19:05:17

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

On Tue, 2009-01-20 at 03:26 -0800, Ingo Molnar wrote:

> Jeremy, any ideas where this slowdown comes from and how it could be
> fixed?

Well I'm early responding to this thread before reading on, but I looked
at the generated assembly for some common mm paths and it looked awful.
The biggest loser was probably having functions to convert pte_t back
and forth to pteval_t, which makes most potential mask / shift
optimizations impossible - indeed, because the compiler doesn't even
understand pte_val(X) = Y is static over the lifetime of the function,
it often calls these same conversions back and forth several times, and
because this is often done inside hidden macros, it's not even possible
to save a cached value in most places.

The bulk of state required to keep this extra conversion around ties up
a lot of registers and as a result heavily limits potential further
optimizations.

The code did not look more branchy to me, however, and gcc seemed to do
a good job with lining up a nice branch structure in the few paths I
looked at.

Zach

2009-01-20 19:32:00

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Zachary Amsden <[email protected]> wrote:

> On Tue, 2009-01-20 at 03:26 -0800, Ingo Molnar wrote:
>
> > Jeremy, any ideas where this slowdown comes from and how it could be
> > fixed?
>
> Well I'm early responding to this thread before reading on, but I looked
> at the generated assembly for some common mm paths and it looked awful.
> The biggest loser was probably having functions to convert pte_t back
> and forth to pteval_t, which makes most potential mask / shift
> optimizations impossible - indeed, because the compiler doesn't even
> understand pte_val(X) = Y is static over the lifetime of the function,
> it often calls these same conversions back and forth several times, and
> because this is often done inside hidden macros, it's not even possible
> to save a cached value in most places.
>
> The bulk of state required to keep this extra conversion around ties up
> a lot of registers and as a result heavily limits potential further
> optimizations.
>
> The code did not look more branchy to me, however, and gcc seemed to do
> a good job with lining up a nice branch structure in the few paths I
> looked at.

i've extended my mmap test with branch execution hw-perfcounter stats:

-----------------------------------------------
| Performance counter stats for './mmap-perf' |
-----------------------------------------------
| |
| x86-defconfig | PARAVIRT=y
|------------------------------------------------------------------
|
| 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74%
| |
| 1 | 1 CPU migrations
| 91 | 79 context switches
| 55945 | 55943 pagefaults
| ............................................
| 3781392474 | 3918777174 CPU cycles +3.63%
| 1957153827 | 2161280486 instructions +10.43%
| 50234816 | 51303520 cache references +2.12%
| 5428258 | 5583728 cache misses +2.86%
|
| 437983499 | 478967061 branches +9.36%
| 32486067 | 32336874 branch-misses -0.46%
| |
| 1314.782469 | 1363.694447 time elapsed (msecs) +3.72%
| |
-----------------------------------

So we execute 9.36% more branches - i.e. very noticeably higher as well.

The CPU predicts them slightly more effectively though, the -0.46% for
branch-misses is well above measurement noise (of ~0.02% for the branch
metric) so it's a systematic effect.

Non-functional 'boring' bloat tends to be easier to predict so it's not
necessarily a real surprise. That also explains why despite +10.43% more
instructions the total cycle count went up by a comparatively smaller
+3.63%.

[ that's 64-bit x86 btw. ]

Ingo

2009-01-20 19:37:33

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Ingo Molnar <[email protected]> wrote:

> -----------------------------------------------
> | Performance counter stats for './mmap-perf' |
> -----------------------------------------------
> | |
> | x86-defconfig | PARAVIRT=y
> |------------------------------------------------------------------
> |
> | 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74%
> | |
> | 1 | 1 CPU migrations
> | 91 | 79 context switches
> | 55945 | 55943 pagefaults
> | ............................................
> | 3781392474 | 3918777174 CPU cycles +3.63%
> | 1957153827 | 2161280486 instructions +10.43%
> | 50234816 | 51303520 cache references +2.12%
> | 5428258 | 5583728 cache misses +2.86%
> | |
> | 1314.782469 | 1363.694447 time elapsed (msecs) +3.72%
> | |
> -----------------------------------
>
> The most surprising element is that in the paravirt_ops case we run 204
> million more instructions - out of the ~2000 million instructions total.

So because this test does exactly 1 million MM syscalls, the average is
easy to calculate:

The native kernel's average MM syscall cost is 1957 instructions - with
CONFIG_PARAVIRT=y that increases by +10.43% to 2161 instructions. There's
over 200 extra instructions executed per MM syscall that we only do due to
CONFIG_PARAVIRT=y.

Ingo

2009-01-20 20:46:24

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>>> Times I believe are in nanoseconds for lmbench, anyway lower is
>>> better.
>>>
>>> non pv AVG=464.22 STD=5.56
>>> paravirt AVG=502.87 STD=7.36
>>>
>>> Nearly 10% performance drop here, which is quite a bit... hopefully
>>> people are testing the speed of their PV implementations against
>>> non-PV bare metal :)
>>>
>> Ouch, that looks unacceptably expensive. All the major distros turn
>> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express
>> promise to have no measurable runtime overhead.
>>
>
> Here are some more precise stats done via hw counters on a perfcounters
> kernel using 'timec', running a modified version of the 'mmap performance
> stress-test' app i made years ago.
>
> The MM benchmark app can be downloaded from:
>
> http://redhat.com/~mingo/misc/mmap-perf.c
>
> timec.c can be picked up from:
>
> http://redhat.com/~mingo/perfcounters/timec.c
>
> mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and touches
> the mapped area as well with a certain chance. The patterns are
> pseudo-random and the random seed is initialized to the same value so
> repeated runs produce the exact same mmap sequence.
>
> I ran the test with a single thread and bound to a single core:
>
> # taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1
>
> [ I ran it as root - so that kernel-space hardware-counter statistics are
> included as well. ]
>
> The results are quite surprisingly candid about the true costs of
> paravirt_ops on the native kernel's overhead (CONFIG_PARAVIRT=y):
>
> -----------------------------------------------
> | Performance counter stats for './mmap-perf' |
> -----------------------------------------------
> | |
> | x86-defconfig | PARAVIRT=y
> |------------------------------------------------------------------
> |
> | 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74%
> | |
> | 1 | 1 CPU migrations
> | 91 | 79 context switches
> | 55945 | 55943 pagefaults
> | ............................................
> | 3781392474 | 3918777174 CPU cycles +3.63%
> | 1957153827 | 2161280486 instructions +10.43%
>

!!

> | 50234816 | 51303520 cache references +2.12%
> | 5428258 | 5583728 cache misses +2.86%
>

Is this I or D, or combined?

> | |
> | 1314.782469 | 1363.694447 time elapsed (msecs) +3.72%
> | |
> -----------------------------------
>
> The most surprising element is that in the paravirt_ops case we run 204
> million more instructions - out of the ~2000 million instructions total.
>
> That's an increase of over 10%!
>

Yow! That's pretty awful. We knew that static instruction count was
up, but wouldn't have thought that it would hit the dynamic instruction
count so much...

I think there are some immediate tweaks we can make to the code
generated for each call site, which will help to an extent.

J

2009-01-20 20:57:41

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Jeremy Fitzhardinge <[email protected]> wrote:

> Ingo Molnar wrote:
>> * Ingo Molnar <[email protected]> wrote:
>>
>>
>>>> Times I believe are in nanoseconds for lmbench, anyway lower is
>>>> better.
>>>>
>>>> non pv AVG=464.22 STD=5.56
>>>> paravirt AVG=502.87 STD=7.36
>>>>
>>>> Nearly 10% performance drop here, which is quite a bit... hopefully
>>>> people are testing the speed of their PV implementations against
>>>> non-PV bare metal :)
>>>>
>>> Ouch, that looks unacceptably expensive. All the major distros turn
>>> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the
>>> express promise to have no measurable runtime overhead.
>>>
>>
>> Here are some more precise stats done via hw counters on a perfcounters
>> kernel using 'timec', running a modified version of the 'mmap
>> performance stress-test' app i made years ago.
>>
>> The MM benchmark app can be downloaded from:
>>
>> http://redhat.com/~mingo/misc/mmap-perf.c
>>
>> timec.c can be picked up from:
>>
>> http://redhat.com/~mingo/perfcounters/timec.c
>>
>> mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and
>> touches the mapped area as well with a certain chance. The patterns are
>> pseudo-random and the random seed is initialized to the same value so
>> repeated runs produce the exact same mmap sequence.
>>
>> I ran the test with a single thread and bound to a single core:
>>
>> # taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1
>>
>> [ I ran it as root - so that kernel-space hardware-counter statistics
>> are included as well. ]
>>
>> The results are quite surprisingly candid about the true costs of
>> paravirt_ops on the native kernel's overhead (CONFIG_PARAVIRT=y):
>>
>> -----------------------------------------------
>> | Performance counter stats for './mmap-perf' |
>> -----------------------------------------------
>> | |
>> | x86-defconfig | PARAVIRT=y
>> |------------------------------------------------------------------
>> |
>> | 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74%
>> | |
>> | 1 | 1 CPU migrations
>> | 91 | 79 context switches
>> | 55945 | 55943 pagefaults
>> | ............................................
>> | 3781392474 | 3918777174 CPU cycles +3.63%
>> | 1957153827 | 2161280486 instructions +10.43%
>>
>
> !!
>
>> | 50234816 | 51303520 cache references +2.12%
>> | 5428258 | 5583728 cache misses +2.86%
>>
>
> Is this I or D, or combined?

That's last-level-cache references+misses (L2 cache):

Bit Position Event Name UMask Event Select
CPUID.AH.EBX
3 LLC Reference 4FH 2EH
4 LLC Misses 41H 2EH

Ingo

2009-01-21 07:27:33

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

On Tue, Jan 20, 2009 at 09:56:53PM +0100, Ingo Molnar wrote:
>
> * Jeremy Fitzhardinge <[email protected]> wrote:
> >> | 50234816 | 51303520 cache references +2.12%
> >> | 5428258 | 5583728 cache misses +2.86%
> >>
> >
> > Is this I or D, or combined?
>
> That's last-level-cache references+misses (L2 cache):
>
> Bit Position Event Name UMask Event Select
> CPUID.AH.EBX
> 3 LLC Reference 4FH 2EH
> 4 LLC Misses 41H 2EH

Oh, _llc_ references/misses? Ouch.

You have, what 32K L1I, 32K L1D, and 4MB L2? And even this microbenchmark
is seeing increased L2 misses by nearly 3%. Hmm, I wonder where that is
coming from? Instruction fetches?

It would be interesting to see how "the oltp" benchmark fares with
CONFIG_PARAVIRT turned on. That workload lives and dies by the cache :)

2009-01-21 22:23:38

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Nick Piggin wrote:
> Oh, _llc_ references/misses? Ouch.
>
> You have, what 32K L1I, 32K L1D, and 4MB L2? And even this microbenchmark
> is seeing increased L2 misses by nearly 3%. Hmm, I wonder where that is
> coming from? Instruction fetches?
>

I assume so. There should be no extra data accesses with
CONFIG_PARAVIRT (hm, there's probably some extra stack/spill traffic,
but I surely hope that's not falling out of cache).

J

2009-01-22 22:27:10

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Ingo Molnar wrote:
> Ouch, that looks unacceptably expensive. All the major distros turn
> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express
> promise to have no measurable runtime overhead.
>
> ( And i suspect the real life mmap cost is probably even more expensive,
> as on a Barcelona all of lmbench fits into the cache hence we dont see
> any real $cache overhead. )
>
> Jeremy, any ideas where this slowdown comes from and how it could be
> fixed?
>

I just posted a couple of patches to pick some low-hanging fruit. It
turns out that we don't need to do any pvops calls to do pte flag
manipulations. I'd be interested to see how much of a difference it
makes (it reduces the static code size by a few k).

J

2009-01-22 22:28:30

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

On Wed, 2009-01-21 at 14:23 -0800, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > Oh, _llc_ references/misses? Ouch.
> >
> > You have, what 32K L1I, 32K L1D, and 4MB L2? And even this microbenchmark
> > is seeing increased L2 misses by nearly 3%. Hmm, I wonder where that is
> > coming from? Instruction fetches?
> >
>
> I assume so. There should be no extra data accesses with
> CONFIG_PARAVIRT (hm, there's probably some extra stack/spill traffic,
> but I surely hope that's not falling out of cache).

These fragments, from native_pgd_val, certainly don't help:

c0120f60: 55 push %ebp
c0120f61: 89 e5 mov %esp,%ebp
c0120f63: 5d pop %ebp
c0120f64: c3 ret
c0120f65: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi
c0120f69: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi

That is really disgusting. We absolutely should be patching away the
function calls here in the native case.. not sure we do that today.

Zach

2009-01-22 22:44:47

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Zachary Amsden wrote:
> These fragments, from native_pgd_val, certainly don't help:
>
> c0120f60: 55 push %ebp
> c0120f61: 89 e5 mov %esp,%ebp
> c0120f63: 5d pop %ebp
> c0120f64: c3 ret
> c0120f65: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi
> c0120f69: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi
>

Yes, that's a rather awful noop; compiling without frame pointers
reduces this to a single "ret".

> That is really disgusting. We absolutely should be patching away the
> function calls here in the native case.. not sure we do that today.
>

I did have some patches to do that at one point. If you set pgd_val =
paravirt_nop, then the patching machinery will completely nop out the
call site. The problem is that it depends on the calling convention
using the same regs for the first arg and return - true for 32-bit, but
not 64. We could fix that with identity functions which the patcher
recognizes and can replace with either pure nops or inline appropriate
register moves.

Also, I just posted patches to get rid of all pvops calls when fetching
or setting flags in a pte, which I hope will help.

J

2009-01-22 22:53:18

by H. Peter Anvin

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Jeremy Fitzhardinge wrote:
>
> I did have some patches to do that at one point. If you set pgd_val =
> paravirt_nop, then the patching machinery will completely nop out the
> call site. The problem is that it depends on the calling convention
> using the same regs for the first arg and return - true for 32-bit, but
> not 64. We could fix that with identity functions which the patcher
> recognizes and can replace with either pure nops or inline appropriate
> register moves.
>

There is also the option to use assembly wrappers to avoid relying on
the calling convention. This is particularly so since we have sites
where as little as a two-byte instruction gets bloated up with huge
push/pop sequences around a tiny instruction. Those would be better
served with a direct call to a stub (5 bytes), which would be repatched
to the two-byte instruction + 3 byte nop.

-hpa

2009-01-22 22:54:59

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

On Thu, 2009-01-22 at 14:44 -0800, Jeremy Fitzhardinge wrote:

> I did have some patches to do that at one point. If you set pgd_val =
> paravirt_nop, then the patching machinery will completely nop out the
> call site. The problem is that it depends on the calling convention
> using the same regs for the first arg and return - true for 32-bit, but
> not 64. We could fix that with identity functions which the patcher
> recognizes and can replace with either pure nops or inline appropriate
> register moves.

What about removing the identity functions entirely. They are useless,
really. All that is needed is a patch site filled with nops for Xen to
overwrite, just stuffing the value into the proper registers. For
64-bit, it can be a simple mov to satisfy the constraints.

> Also, I just posted patches to get rid of all pvops calls when fetching
> or setting flags in a pte, which I hope will help.

Sounds like it will help.

2009-01-22 22:58:33

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

On Thu, 2009-01-22 at 14:49 -0800, H. Peter Anvin wrote:

> There is also the option to use assembly wrappers to avoid relying on
> the calling convention. This is particularly so since we have sites
> where as little as a two-byte instruction gets bloated up with huge
> push/pop sequences around a tiny instruction. Those would be better
> served with a direct call to a stub (5 bytes), which would be repatched
> to the two-byte instruction + 3 byte nop.

Yes, for known trivial ops (most!), there isn't any reason to ever have
a call to begin with; simply an inline instruction sequence would be
fine, and only those callers that override the sequence would need to
patch. It's possible to write clever macros to assure there is always
space for a 5 byte call.

Zach

2009-01-22 23:04:58

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Jeremy Fitzhardinge <[email protected]> wrote:

> Ingo Molnar wrote:
>> Ouch, that looks unacceptably expensive. All the major distros turn
>> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express
>> promise to have no measurable runtime overhead.
>>
>> ( And i suspect the real life mmap cost is probably even more expensive,
>> as on a Barcelona all of lmbench fits into the cache hence we dont see
>> any real $cache overhead. )
>>
>> Jeremy, any ideas where this slowdown comes from and how it could be
>> fixed?
>>
>
> I just posted a couple of patches to pick some low-hanging fruit. It
> turns out that we don't need to do any pvops calls to do pte flag
> manipulations. I'd be interested to see how much of a difference it
> makes (it reduces the static code size by a few k).

I've tried your patches - but can see no significant reduction in
overhead. I've updated my table with numbers from your patches:

-----------------------------------------------
| Performance counter stats for './mmap-perf' |
-----------------------------------------------
| | |
| defconfig | PARAVIRT=y | +Jeremy
|-----------------------------------------------------------------------
|
| 1311.55452 | 1360.62493 | 1378.94464 task clock (msecs) +3.74%
| | |
| 1 | 1 | 0 CPU migrations
| 91 | 79 | 77 context switches
| 55945 | 55943 | 55980 pagefaults
|.......................................................................
| 3781392474 | 3918777174 | 3907189795 CPU cycles +3.63%
| 1957153827 | 2161280486 | 2161741689 instructions +10.43%
| 50234816 | 51303520 | 50619593 cache references +2.12%
| 5428258 | 5583728 | 5575808 cache misses +2.86%
|
| 437983499 | 478967061 | 479053595 branches +9.36%
| 32486067 | 32336874 | 32377710 branch-misses -0.46%
| |
| 1314.78246 | 1363.69444 | 1357.58161 time elapsed (msecs) +3.72%
| |
------------------------------------------------------------------------

'+Jeremy' is a CONFIG_PARAVIRT=y run done with your patches.

The most stable count is the instruction count:

| 1957153827 | 2161280486 | 2161741689 instructions +10.43%

But your two patches did not reduce the instruction count in any
measurable way.

In any case, it is rather inefficient of me proxy-testing your patches,
you can do these measurements yourself too on any Core2 or later Intel
CPU, by running tip/master plus picking up these two utilities:

http://people.redhat.com/mingo/perfcounters/perfstat.c
http://redhat.com/~mingo/misc/mmap-perf.c

building them and running this (as root):

taskset 1 ./perfstat ./mmap-perf 1

it will give you numbers like the ones above.

Ingo

2009-01-22 23:30:43

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

On Thu, 2009-01-22 at 15:04 -0800, Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:

> In any case, it is rather inefficient of me proxy-testing your patches,
> you can do these measurements yourself too on any Core2 or later Intel
> CPU, by running tip/master plus picking up these two utilities:

Eek, I have no time to spend on this right now, but if anyone is curious
to run this patch (which heavily breaks Xen), I suspect it will cure
most of the performance ailments.

Back when we did the VMI prototyping, we never saw any significant
benchmark reductions until the introduction of the M-to-P conversion
functions.

Zach

Attachments:

paravirt-drop-mpn-ops.patch (4.24 kB)

2009-01-22 23:55:48

by H. Peter Anvin

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Zachary Amsden wrote:
> On Thu, 2009-01-22 at 14:49 -0800, H. Peter Anvin wrote:
>
>> There is also the option to use assembly wrappers to avoid relying on
>> the calling convention. This is particularly so since we have sites
>> where as little as a two-byte instruction gets bloated up with huge
>> push/pop sequences around a tiny instruction. Those would be better
>> served with a direct call to a stub (5 bytes), which would be repatched
>> to the two-byte instruction + 3 byte nop.
>
> Yes, for known trivial ops (most!), there isn't any reason to ever have
> a call to begin with; simply an inline instruction sequence would be
> fine, and only those callers that override the sequence would need to
> patch. It's possible to write clever macros to assure there is always
> space for a 5 byte call.
>

It's functionally speaking the same thing... the advantage with starting
out with the call and then patch in the native code as opposed to the
other way around is to be able to handle things properly before we're
ready to run the patching code.

Right now a number of the call sites contain a huge push/pop sequence
followed by an indirect call. We can patch in the native code to avoid
the branch overhead, but the register constraints and icache footprint
is unchanged.

-hpa

2009-01-23 00:08:31

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

H. Peter Anvin wrote:
> Right now a number of the call sites contain a huge push/pop sequence
> followed by an indirect call. We can patch in the native code to
> avoid the branch overhead, but the register constraints and icache
> footprint is unchanged.

That's true for the pvops hooks emitted in the .S files, but not so true
for ones in C code (well, there are no explicit push/pops, but the
presence of the call may cause the compiler to generate them).

The .S hooks can definitely be cleaned up, but I don't think that's
germane to Nick's observations that the mm code is showing slowdowns.

J

2009-01-23 00:14:22

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Zachary Amsden wrote:
> What about removing the identity functions entirely. They are useless,
> really. All that is needed is a patch site filled with nops for Xen to
> overwrite, just stuffing the value into the proper registers. For
> 64-bit, it can be a simple mov to satisfy the constraints.
>

I think it comes to the same thing really. Both end up generating a
series of nops with values entering and leaving in well-defined
registers. The x86-64 calling convention is a bit awkward because the
first arg is in rdi and the ret is rax, so it can't quite be pure nops,
or we use a non-standard calling-convention with appropriate thunks to
call into C code. I think a mov is a better performance-complexity
tradeoff.

>> Also, I just posted patches to get rid of all pvops calls when fetching
>> or setting flags in a pte, which I hope will help.
>>
>
> Sounds like it will help.
>

...but apparently not.

J

2009-01-27 08:05:29

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Jeremy Fitzhardinge <[email protected]> wrote:

>>> Also, I just posted patches to get rid of all pvops calls when
>>> fetching or setting flags in a pte, which I hope will help.
>>
>> Sounds like it will help.
>
> ...but apparently not.

ping?

This is a very serious paravirt_ops slowdown affecting the native kernel's
performance to the tune of 5-10% in certain workloads.

It's been about 2 years ago that paravirt_ops went upstream, when you told
us that something like this would never happen, that paravirt_ops is
designed so flexibly that it will never hinder the native kernel - and if
it does it will be easy to fix it. Now is the time to fulfill that
promise.

Ingo

2009-01-27 08:24:34

[permalink] [raw]

Subject: Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Ingo Molnar wrote:
> This is a very serious paravirt_ops slowdown affecting the native kernel's
> performance to the tune of 5-10% in certain workloads.
>
> It's been about 2 years ago that paravirt_ops went upstream, when you told
> us that something like this would never happen, that paravirt_ops is
> designed so flexibly that it will never hinder the native kernel - and if
> it does it will be easy to fix it. Now is the time to fulfill that
> promise.
>

Yep, working on it.

J

2009-01-27 10:17:55