2009-03-02 12:09:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support


* Jeremy Fitzhardinge <[email protected]> wrote:

> Ingo Molnar wrote:
>> Personally i'd like to see a sufficient reply to the
>> mmap-perf paravirt regressions pointed out by Nick and
>> reproduced by myself as well. (They were in the 4-5%
>> macro-performance range iirc, which is huge.)
>>
>> So i havent seen any real progress on reducing native kernel
>> overhead with paravirt. Patches were sent but no measurements
>> were done and it seemed to have all fizzled out while the
>> dom0 patches are being pursued.
>>
>
> Hm, I'm not sure what you want me to do here. I sent out
> patches, they got merged, I posted the results of my
> measurements showing that the patches made a substantial
> improvement. I'd love to see confirmation from others that
> the patches help them, but I don't think you can say I've been
> unresponsive about this.

Have i missed a mail of yours perhaps? I dont have any track of
you having posted mmap-perf perfcounters results. I grepped my
mbox and the last mail i saw from you containing the string
"mmap-perf" is from January 20, and it only includes my numbers.

What i'd expect you to do is to proactively measure the overhead
of CONFIG_PARAVIRT overhead of the native kernel, and analyze
and address the results. Not just minimalistically reply to my
performance measurements - as that does not really scale in the
long run.

Ingo


2009-03-07 09:06:40

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

Ingo Molnar wrote:
> Have i missed a mail of yours perhaps? I dont have any track of
> you having posted mmap-perf perfcounters results. I grepped my
> mbox and the last mail i saw from you containing the string
> "mmap-perf" is from January 20, and it only includes my numbers.


Yes, I think you must have missed a mail. I've attached it for
reference, along with a more complete set of measurements I made
regarding the series of patches applied (series ending at
1f4f931501e9270c156d05ee76b7b872de486304) to improve pvops performance.

My results showed a dramatic drop in cache references (from about 300%
pvop vs non-pvop, down to 125% with the full set of patches applied),
but it didn't seem to make much of an effect on the overall wallclock
time. I'm a bit sceptical of the numbers here because, while each run's
passes are fairly consistent, booting and remeasuring seemed to cause
larger variations than we're looking at. It would be easy to handwave it
away with "cache effects", but its not very satisfying.

I also didn't find the measurements very convincing because the number
of CPU cycles and instructions executed count is effectively unchanged
(ie, the baseline non-pvops vs original pvops apparently execute exactly
the same number of instructions, but we know that there's a lot more
going on), and with no change as each added patch definitely removes
some amount of pvops overhead in terms of instructions in the
instruction stream. Is it just measuring usermode stats? I ran it as
root, with the command line you suggested ("./perfstat -e
-5,-4,-3,0,1,2,3 ./mmap-perf 1"). Cache misses wandered up and down in a
fairly non-intuitive way as well.

I'll do a rerun comparing current tip.git pvops vs non-pvops to see if I
can get some better results.

J


Attachments:
pvops-mmap-measurements.ods (19.57 kB)
Attached Message (50.57 kB)
Download all attachments

2009-03-08 11:02:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support


* Jeremy Fitzhardinge <[email protected]> wrote:

> Ingo Molnar wrote:
>> Have i missed a mail of yours perhaps? I dont have any track of you
>> having posted mmap-perf perfcounters results. I grepped my mbox and the
>> last mail i saw from you containing the string "mmap-perf" is from
>> January 20, and it only includes my numbers.
>
>
> Yes, I think you must have missed a mail. I've attached it for
> reference, along with a more complete set of measurements I
> made regarding the series of patches applied (series ending at
> 1f4f931501e9270c156d05ee76b7b872de486304) to improve pvops
> performance.

Yeah - indeed i missed those numbers - they were embedded in a
spreadsheet document attached to the mail ;)

> My results showed a dramatic drop in cache references (from
> about 300% pvop vs non-pvop, down to 125% with the full set of
> patches applied), but it didn't seem to make much of an effect
> on the overall wallclock time. I'm a bit sceptical of the
> numbers here because, while each run's passes are fairly
> consistent, booting and remeasuring seemed to cause larger
> variations than we're looking at. It would be easy to handwave
> it away with "cache effects", but its not very satisfying.

Well it's the L2 cache references which are being measured here,
and the L2 cache is likely very large on your test-system. So we
can easily run into associativity limits in the L1 cache while
still being mostly in L2 cache otherwise.

Associativity effects do depend on the kernel image layout and
on the precise allocations of kernel data structure allocations
we do during bootup - and they dont really change after that.

> I also didn't find the measurements very convincing because
> the number of CPU cycles and instructions executed count is
> effectively unchanged (ie, the baseline non-pvops vs original
> pvops apparently execute exactly the same number of
> instructions, but we know that there's a lot more going on),
> and with no change as each added patch definitely removes some
> amount of pvops overhead in terms of instructions in the
> instruction stream. Is it just measuring usermode stats? I ran
> it as root, with the command line you suggested ("./perfstat
> -e -5,-4,-3,0,1,2,3 ./mmap-perf 1"). Cache misses wandered up
> and down in a fairly non-intuitive way as well.

It's measuring kernel stats too - and i very much saw the
instruction count change to the tune of 10% or so.

> I'll do a rerun comparing current tip.git pvops vs non-pvops
> to see if I can get some better results.

Thanks - i'll also try your patch on the same system i measured
for my numbers so we'll have some comparison.

Ingo

2009-03-08 22:00:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

Ingo Molnar wrote:
>
> Associativity effects do depend on the kernel image layout and
> on the precise allocations of kernel data structure allocations
> we do during bootup - and they dont really change after that.
>

By the way, there is a really easy way (if a bit time consuming) to get
the actual variability here -- you have to reboot between runs, even for
the same kernel. It makes the data collection take a long time, but at
least it can be scripted.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-03-08 22:06:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support


* H. Peter Anvin <[email protected]> wrote:

> Ingo Molnar wrote:
> >
> > Associativity effects do depend on the kernel image layout
> > and on the precise allocations of kernel data structure
> > allocations we do during bootup - and they dont really
> > change after that.
> >
>
> By the way, there is a really easy way (if a bit time
> consuming) to get the actual variability here -- you have to
> reboot between runs, even for the same kernel. It makes the
> data collection take a long time, but at least it can be
> scripted.

Since it's the same kernel image i think the only truly reliable
method would be to reboot between _different_ kernel images:
same instructions but randomly re-align variables both in terms
of absolute address and in terms of relative position to each
other. Plus randomize bootmem allocs and never-gets-freed-really
boot-time allocations.

Really hard to do i think ...

Ingo

2009-03-08 22:12:11

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

Ingo Molnar wrote:
>
> Since it's the same kernel image i think the only truly reliable
> method would be to reboot between _different_ kernel images:
> same instructions but randomly re-align variables both in terms
> of absolute address and in terms of relative position to each
> other. Plus randomize bootmem allocs and never-gets-freed-really
> boot-time allocations.
>
> Really hard to do i think ...
>

Ouch, yeah.

On the other hand, the numbers made sense to me, so I don't see why
there is any reason to distrust them. They show a 5% overhead with
pv_ops enabled, reduced to a 2% overhead with the changed. That is more
or less what would match my intuition from seeing the code.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-03-08 22:12:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support


* H. Peter Anvin <[email protected]> wrote:

> Ingo Molnar wrote:
> >
> > Since it's the same kernel image i think the only truly reliable
> > method would be to reboot between _different_ kernel images:
> > same instructions but randomly re-align variables both in terms
> > of absolute address and in terms of relative position to each
> > other. Plus randomize bootmem allocs and never-gets-freed-really
> > boot-time allocations.
> >
> > Really hard to do i think ...
> >
>
> Ouch, yeah.
>
> On the other hand, the numbers made sense to me, so I don't
> see why there is any reason to distrust them. They show a 5%
> overhead with pv_ops enabled, reduced to a 2% overhead with
> the changed. That is more or less what would match my
> intuition from seeing the code.

Yeah - it was Jeremy expressed doubt in the numbers, not me.

And we need to eliminate that 2% as well - 2% is still an awful
lot of native kernel overhead from a kernel feature that 95%+ of
users do not make any use of.

Ingo

2009-03-09 18:06:55

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

Ingo Molnar wrote:
> * H. Peter Anvin <[email protected]> wrote:
>
>
>> Ingo Molnar wrote:
>>
>>> Since it's the same kernel image i think the only truly reliable
>>> method would be to reboot between _different_ kernel images:
>>> same instructions but randomly re-align variables both in terms
>>> of absolute address and in terms of relative position to each
>>> other. Plus randomize bootmem allocs and never-gets-freed-really
>>> boot-time allocations.
>>>
>>> Really hard to do i think ...
>>>
>>>
>> Ouch, yeah.
>>
>> On the other hand, the numbers made sense to me, so I don't
>> see why there is any reason to distrust them. They show a 5%
>> overhead with pv_ops enabled, reduced to a 2% overhead with
>> the changed. That is more or less what would match my
>> intuition from seeing the code.
>>
>
> Yeah - it was Jeremy expressed doubt in the numbers, not me.
>

Mainly because I was seeing the instruction and cycle counts completely
unchanged from run to run, which is implausible. They're not zero, so
they're clearly measurements of *something*, but not cycles and
instructions, since we know that they're changing. So what are they
measurements of? And if they're not what they claim, are the other
numbers more meaningful?

It's easy to read the numbers as confirmations of preconceived
expectations of the outcomes, but that's - as I said - unsatisfying.

> And we need to eliminate that 2% as well - 2% is still an awful
> lot of native kernel overhead from a kernel feature that 95%+ of
> users do not make any use of.
>

Well, I think there's a few points here:

1. the test in question is a bit vague about kernel and user
measurements. I assume the stuff coming from perfcounters is
kernel-only state, but the elapsed time includes the usermode
component, and so will be affected by the usermode page placement
and cache effects. If I change the test to copy the test
executable (statically linked, to avoid libraries), then that
should at least fuzz out user page placement.
2. Its true that the cache effects could be due to the precise layout
of the kernel executable; but if those effects are swamping
effects of the changes to improve pvops then its unclear what the
point of the exercise is. Especially since:
3. It is a config option, so if someone is sensitive to the
performance hit and it gives them no useful functionality to
offset it, then it can be disabled. Distros tend to enable it
because they tend to value function and flexibility over raw
performance; they tend to enable things like audit, selinux,
modules which all have performance hits of a similar scale (of
course, you could argue that more people get benefit from those
features to offset their costs). But,
4. I think you're underestimating the number of people who get
benefit from pvops; the Xen userbase is actually pretty large, and
KVM will use pvops hooks when available to improve Linux-as-guest.
5. Also, we're looking at a single benchmark with no obvious
relevance to a real workload. Perhaps there are workloads which
continuously mash mmap/munmap/mremap(!), but I think they're
fairly rare. Such a benchmark is useful for tuning specific
areas, but if we're going to evaluate pvops overhead, it would be
nice to use something a bit broader to base our measurements on.
Also, what weighting are we going to put on 32 vs 64 bit? Equally
important? One more than the other?

All that said, I would like to get the pvops overhead down to
unmeasureable - the ideal would be to be able to justify removing the
config option altogether and leave it always enabled.

The tradeoff, as always, is how much other complexity are we willing to
stand to get there? The addition of a new calling convention is already
fairly esoteric, but so far it has got us a 60% reduction in overhead
(in this test). But going further is going to get more complex.

For example, the next step would be to attack set_pte (including
set_pte_*, pte_clear, etc), to make them use the new calling convention,
and possibly make them inlineable (ie, to get it as close as possible to
the non-pvops case). But that will require them to be implemented in
asm (to guarantee that they only use the registers they're allowed to
use), and we already have 3 variants of each for the different pagetable
modes. All completely doable, and not even very hard, but it will be
just one more thing to maintain - we just need to be sure the payoff is
worth it.

J

2009-03-10 12:44:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support


* Jeremy Fitzhardinge <[email protected]> wrote:

>> Yeah - it was Jeremy expressed doubt in the numbers, not me.
>
> Mainly because I was seeing the instruction and cycle counts
> completely unchanged from run to run, which is implausible.
> They're not zero, so they're clearly measurements of
> *something*, but not cycles and instructions, since we know
> that they're changing. So what are they measurements of? And
> if they're not what they claim, are the other numbers more
> meaningful?

cycle count not changing in a macro-workload is not plausible.
Instruction count not changing can happen sometimes - if the
workload is deterministic (which this one is) and we happen to
get exactly the same number of timer irqs during the test. But
it's more common that it varies slightly - especially on SMP
where task balancing can be timing-dependent and hence is noise.

Ingo

2009-03-10 12:50:17

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

On Tuesday 10 March 2009 05:06:40 Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> > * H. Peter Anvin <[email protected]> wrote:
> >> Ingo Molnar wrote:
> >>> Since it's the same kernel image i think the only truly reliable
> >>> method would be to reboot between _different_ kernel images:
> >>> same instructions but randomly re-align variables both in terms
> >>> of absolute address and in terms of relative position to each
> >>> other. Plus randomize bootmem allocs and never-gets-freed-really
> >>> boot-time allocations.
> >>>
> >>> Really hard to do i think ...
> >>
> >> Ouch, yeah.
> >>
> >> On the other hand, the numbers made sense to me, so I don't
> >> see why there is any reason to distrust them. They show a 5%
> >> overhead with pv_ops enabled, reduced to a 2% overhead with
> >> the changed. That is more or less what would match my
> >> intuition from seeing the code.
> >
> > Yeah - it was Jeremy expressed doubt in the numbers, not me.
>
> Mainly because I was seeing the instruction and cycle counts completely
> unchanged from run to run, which is implausible. They're not zero, so
> they're clearly measurements of *something*, but not cycles and
> instructions, since we know that they're changing. So what are they
> measurements of? And if they're not what they claim, are the other
> numbers more meaningful?
>
> It's easy to read the numbers as confirmations of preconceived
> expectations of the outcomes, but that's - as I said - unsatisfying.
>
> > And we need to eliminate that 2% as well - 2% is still an awful
> > lot of native kernel overhead from a kernel feature that 95%+ of
> > users do not make any use of.
>
> Well, I think there's a few points here:
>
> 1. the test in question is a bit vague about kernel and user
> measurements. I assume the stuff coming from perfcounters is
> kernel-only state, but the elapsed time includes the usermode
> component, and so will be affected by the usermode page placement
> and cache effects. If I change the test to copy the test
> executable (statically linked, to avoid libraries), then that
> should at least fuzz out user page placement.
> 2. Its true that the cache effects could be due to the precise layout
> of the kernel executable; but if those effects are swamping
> effects of the changes to improve pvops then its unclear what the
> point of the exercise is. Especially since:
> 3. It is a config option, so if someone is sensitive to the
> performance hit and it gives them no useful functionality to
> offset it, then it can be disabled. Distros tend to enable it
> because they tend to value function and flexibility over raw
> performance; they tend to enable things like audit, selinux,
> modules which all have performance hits of a similar scale (of
> course, you could argue that more people get benefit from those
> features to offset their costs). But,
> 4. I think you're underestimating the number of people who get
> benefit from pvops; the Xen userbase is actually pretty large, and
> KVM will use pvops hooks when available to improve Linux-as-guest.
> 5. Also, we're looking at a single benchmark with no obvious
> relevance to a real workload. Perhaps there are workloads which
> continuously mash mmap/munmap/mremap(!), but I think they're
> fairly rare. Such a benchmark is useful for tuning specific
> areas, but if we're going to evaluate pvops overhead, it would be
> nice to use something a bit broader to base our measurements on.
> Also, what weighting are we going to put on 32 vs 64 bit? Equally
> important? One more than the other?

I saw _most_ of the extra overhead show up in page fault path. And also
don't forget that fork/exit workloads are essentially mashing mmap/munmap.

So things which mash these paths include kbuild, scripts, and some malloc
patters (like you might see in MySQL running OLTP).

Of course they tend to do more other stuff as well, so 2% in a
microbenchmark will be much smaller, but that was never in dispute. One
hardest problems is adding lots of features to critical paths that
individually "never show a statistical difference on any real workload",
but combine to slow things down. It really sucks to have people upgrade
and performance go down.

As an anecdote, I had a problem where an ISV upgraded SLES9 to SLES10
and their software's performance dropped 30% or so. And there were like
3 or 4 things that could be bisected to show a few % of that. This was
without pvops mind you, but in very similar paths (mmap/munmap/page
fault/teardown). The pvops stuff was basically just an extension of that
saga.

OK, that's probably an extreme case, but any of this stuff must always
be considered a critical fastpath IMO. We know any slowdown is going to
hurt in the long run.


> All that said, I would like to get the pvops overhead down to
> unmeasureable - the ideal would be to be able to justify removing the
> config option altogether and leave it always enabled.
>
> The tradeoff, as always, is how much other complexity are we willing to
> stand to get there? The addition of a new calling convention is already
> fairly esoteric, but so far it has got us a 60% reduction in overhead
> (in this test). But going further is going to get more complex.

If the complexity is not in generic code and constrained within pvops
stuff, then from my POV "as much as it takes", and you get to maintain
it ;)

Well, that's a bit unfair. From a distro POV, I'd love that to be the
case because we ship pvops. From a kernel.org point of view, you provide
a service that inevitably will have some cost but can be configured out.
But I do think that it would be in your interest too because the speed
of these paths should be important even for virtualised systems.


> For example, the next step would be to attack set_pte (including
> set_pte_*, pte_clear, etc), to make them use the new calling convention,
> and possibly make them inlineable (ie, to get it as close as possible to
> the non-pvops case). But that will require them to be implemented in
> asm (to guarantee that they only use the registers they're allowed to
> use), and we already have 3 variants of each for the different pagetable
> modes. All completely doable, and not even very hard, but it will be
> just one more thing to maintain - we just need to be sure the payoff is
> worth it.

Thanks for what you've done so far. I would like to see this taken as
far as possible. I think it is very worthwhile although complexity is
obviously a very real concern too.