From: Andi Kleen <[email protected]>
On IvyBridge MEM_*_RETIRED can leak through to the same counter
on the other thread. Add a dummy extra_reg to handle this case.
The extra reg is allocated and makes sure nothing different
runs on the same counter in the other thread.
This is normally only visible when multiplexing.
This patch doesn't 100% plug the hole, as the event may still get lost
when the other CPU thread does not have an enabled counter
(e.g. with per thread counting). However it fixes it well enough
for the common global profiling case (e.g. perf top / perf stat -a)
In theory it would be possible to handle the other cases
too, but it would need far more changes.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 1 +
arch/x86/kernel/cpu/perf_event_intel.c | 18 +++++++++++++++++-
2 files changed, 18 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index e46f932..5462d6b 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -46,6 +46,7 @@ enum extra_reg_type {
EXTRA_REG_RSP_0 = 0, /* offcore_response_0 */
EXTRA_REG_RSP_1 = 1, /* offcore_response_1 */
EXTRA_REG_LBR = 2, /* lbr_select */
+ EXTRA_REG_MEM_RETIRED = 3, /* used to run MEM_*_RETIRED alone */
EXTRA_REG_MAX /* number of entries needed */
};
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 73121ad..4803bf3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -160,6 +160,22 @@ static struct extra_reg intel_snb_extra_regs[] __read_mostly = {
EVENT_EXTRA_END
};
+static struct extra_reg intel_ivb_extra_regs[] __read_mostly = {
+ INTEL_EVENT_EXTRA_REG(0xb7, MSR_OFFCORE_RSP_0, 0x3fffffffffull, RSP_0),
+ INTEL_EVENT_EXTRA_REG(0xbb, MSR_OFFCORE_RSP_1, 0x3fffffffffull, RSP_1),
+ /*
+ * A number of MEM_*_RETIRED events can leak through to the same counter
+ * on the other SMT thread. Use a dummy extra reg to make sure they run
+ * alone.
+ */
+ INTEL_EVENT_EXTRA_REG(0xd0, 0, 0, MEM_RETIRED),
+ INTEL_EVENT_EXTRA_REG(0xd1, 0, 0, MEM_RETIRED),
+ INTEL_EVENT_EXTRA_REG(0xd2, 0, 0, MEM_RETIRED),
+ INTEL_EVENT_EXTRA_REG(0xd3, 0, 0, MEM_RETIRED),
+ EVENT_EXTRA_END
+};
+
+
static u64 intel_pmu_event_map(int hw_event)
{
return intel_perfmon_event_map[hw_event];
@@ -2141,7 +2157,7 @@ __init int intel_pmu_init(void)
x86_pmu.event_constraints = intel_ivb_event_constraints;
x86_pmu.pebs_constraints = intel_ivb_pebs_event_constraints;
x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
- x86_pmu.extra_regs = intel_snb_extra_regs;
+ x86_pmu.extra_regs = intel_ivb_extra_regs;
/* all extra regs are per-cpu when HT is on */
x86_pmu.er_flags |= ERF_HAS_RSP_1;
x86_pmu.er_flags |= ERF_NO_HT_SHARING;
--
1.7.7.6
On Mon, Apr 29, 2013 at 11:45:08AM -0700, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> On IvyBridge MEM_*_RETIRED can leak through to the same counter
> on the other thread. Add a dummy extra_reg to handle this case.
> The extra reg is allocated and makes sure nothing different
> runs on the same counter in the other thread.
>
> This is normally only visible when multiplexing.
>
> This patch doesn't 100% plug the hole, as the event may still get lost
> when the other CPU thread does not have an enabled counter
> (e.g. with per thread counting). However it fixes it well enough
> for the common global profiling case (e.g. perf top / perf stat -a)
> In theory it would be possible to handle the other cases
> too, but it would need far more changes.
So you're saying that if two SMT siblings count the same MEM_*_RETIRED event
(on the same counter?) events can get accounted to the wrong sibling?
And when the other sibling doesn't have (the same counter?) enabled we can
loose events?
This begs the question what happens when the sibling does have the (same?)
counter enabled but counting an all together different event; do we then still
'loose' events from the one sibling and add then to the other counter?
Peter Zijlstra <[email protected]> writes:
>
> So you're saying that if two SMT siblings count the same MEM_*_RETIRED event
> (on the same counter?) events can get accounted to the wrong sibling?
It can happen regardless of what event is enabled on the other counter.
> And when the other sibling doesn't have (the same counter?) enabled we
> can loose events?
doesn't have any events enabled.
> This begs the question what happens when the sibling does have the (same?)
> counter enabled but counting an all together different event; do we then still
> 'loose' events from the one sibling and add then to the other counter?
Yes, that is what the patch fixes.
Of course only if you actually apply it, and not lose it as usual.
-Andi
--
[email protected] -- Speaking for myself only
* Andi Kleen <[email protected]> wrote:
> Peter Zijlstra <[email protected]> writes:
> >
> > So you're saying that if two SMT siblings count the same MEM_*_RETIRED event
> > (on the same counter?) events can get accounted to the wrong sibling?
>
> It can happen regardless of what event is enabled on the other counter.
>
> > And when the other sibling doesn't have (the same counter?) enabled we
> > can loose events?
>
> doesn't have any events enabled.
>
> > This begs the question what happens when the sibling does have the (same?)
> > counter enabled but counting an all together different event; do we then still
> > 'loose' events from the one sibling and add then to the other counter?
>
> Yes, that is what the patch fixes.
>
> Of course only if you actually apply it, and not lose it as usual.
If your snide remark is referring to your pending Haswell patchset then
you are dead wrong: the reason why they have not been picked up yet is not
because they were ignored, but because they had to go through 11 review
iterations already - still counting (!).
That is a huge amount of overhead on the maintainer side.
With such a negative track record you should not expect maintainers to
fast-track your patches or trust your judgement too much - your patches
are often sloppy, your changelogs incomplete or outright deceiving, your
replies are often evasive and non-constructive.
Thanks,
Ingo
On Wed, May 01, 2013 at 02:56:32AM -0700, Andi Kleen wrote:
> Peter Zijlstra <[email protected]> writes:
> >
> > So you're saying that if two SMT siblings count the same MEM_*_RETIRED event
> > (on the same counter?) events can get accounted to the wrong sibling?
>
> It can happen regardless of what event is enabled on the other counter.
What I was trying to ask is.. where does it leak to. Does it leak to whatever
event the other counter is counting? Or does it explicitly leak to any sibilng
counter also counting MEM_*_RETIRED?
SMT0 SMT1
C0 MR MR
C1
C2
C3
So here SMT[01]-C0 will cross count their events.
SMT0 SMT1
C0 MR
C1 MR
C2
C3
Will they too here?
SMT0 SMT1
C0 MR Cycles
C1
C2
C3
What about here?
So again; do they specifically leak between the same counters of siblings or
between the same events of siblings. Your initial explanation wasn't clear on
when and where exactly the leak happens.
> > This begs the question what happens when the sibling does have the (same?)
> > counter enabled but counting an all together different event; do we then still
> > 'loose' events from the one sibling and add then to the other counter?
>
> Yes, that is what the patch fixes.
Well, it very much depends on the above answer; if case-3 leaks samples from
SMT0-C0 to SMT1-C0 then the patch doesn't fix anything as the SMT1-C0 event
(cycles) doesn't bother with the shared register.
Hi,
This is a serious problem because it silently corrupts events on the
sibling thread.
The problem can be observed easily using a basic perf stat command. The idea
is to use an event for which we easily know the final count in advance. I use
ROB_MISC_EVENT:LBR_INSERTS, i.e., the number of entries inserted into
the LBR. If I am not using the LBR, then the count should be zero.
Given that LBR
is not used by default, the count should be zero. Then, you combine this
event with the MEM_* events to cause multiplexing in system-wide mode
and you use a workload which does a lot of memory accesses (I use a triad
loop):
On IVB (desktop), you can do:
# perf stat -a -e r20cc,r81d0,r02d2,r02d4,r401d1 triad
You will get a non-zero count for r20cc (lbr_inserts). That's wrong.
If you restrict to one thread/core, you get the correct answer:
# perf stat -C0-3 -a -e r20cc,r81d0,r02d2,r02d4,r401d1 triad
Then, you get 0 for lbr_insert. This is the correct answer.
Unfortunately, Andi's patch fails to correct the problem though
it is the right approach. As he said, the problem is unfortunately
deeper than this.
On Wed, May 1, 2013 at 1:12 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, May 01, 2013 at 02:56:32AM -0700, Andi Kleen wrote:
>> Peter Zijlstra <[email protected]> writes:
>> >
>> > So you're saying that if two SMT siblings count the same MEM_*_RETIRED event
>> > (on the same counter?) events can get accounted to the wrong sibling?
>>
>> It can happen regardless of what event is enabled on the other counter.
>
> What I was trying to ask is.. where does it leak to. Does it leak to whatever
> event the other counter is counting? Or does it explicitly leak to any sibilng
> counter also counting MEM_*_RETIRED?
>
> SMT0 SMT1
>
> C0 MR MR
> C1
> C2
> C3
>
> So here SMT[01]-C0 will cross count their events.
>
>
> SMT0 SMT1
>
> C0 MR
> C1 MR
> C2
> C3
>
> Will they too here?
>
>
> SMT0 SMT1
>
> C0 MR Cycles
> C1
> C2
> C3
>
> What about here?
>
>
> So again; do they specifically leak between the same counters of siblings or
> between the same events of siblings. Your initial explanation wasn't clear on
> when and where exactly the leak happens.
>
>
>> > This begs the question what happens when the sibling does have the (same?)
>> > counter enabled but counting an all together different event; do we then still
>> > 'loose' events from the one sibling and add then to the other counter?
>>
>> Yes, that is what the patch fixes.
>
> Well, it very much depends on the above answer; if case-3 leaks samples from
> SMT0-C0 to SMT1-C0 then the patch doesn't fix anything as the SMT1-C0 event
> (cycles) doesn't bother with the shared register.
>
>
> Unfortunately, Andi's patch fails to correct the problem though
> it is the right approach. As he said, the problem is unfortunately
perf stat -a / record -a / top will work.
The case it doesn't fix is per thread when the other sibling
is not running a counter.
Essentially with the patch you may still get under-counting
on these specific events if they run per thread (or are really
unlucky in multiplexing), but never over-counting on anything.
In the past Ingo's guidance was to focus on the most common
use cases, which I think this patch does.
It would be possible to fix the per thread mode too, but it would
need quite a bit of new infrastructure and I have doubts that's
worth it.
-Andi
On Wed, May 1, 2013 at 9:13 PM, Andi Kleen <[email protected]> wrote:
>> Unfortunately, Andi's patch fails to correct the problem though
>> it is the right approach. As he said, the problem is unfortunately
>
> perf stat -a / record -a / top will work.
>
But it does not for perf stat -a. Look at my example, run it on your machine.
> The case it doesn't fix is per thread when the other sibling
> is not running a counter.
>
> Essentially with the patch you may still get under-counting
> on these specific events if they run per thread (or are really
> unlucky in multiplexing), but never over-counting on anything.
>
> In the past Ingo's guidance was to focus on the most common
> use cases, which I think this patch does.
>
> It would be possible to fix the per thread mode too, but it would
> need quite a bit of new infrastructure and I have doubts that's
> worth it.
>
> -Andi
On Wed, May 01, 2013 at 09:21:55PM +0200, Stephane Eranian wrote:
> On Wed, May 1, 2013 at 9:13 PM, Andi Kleen <[email protected]> wrote:
> >> Unfortunately, Andi's patch fails to correct the problem though
> >> it is the right approach. As he said, the problem is unfortunately
> >
> > perf stat -a / record -a / top will work.
> >
> But it does not for perf stat -a. Look at my example, run it on your machine.
Since nobody seems interested in explaining the exact nature of the hardware
failure I think the best thing we can do is simply blacklist these events.
Broken hardware is broken after all -- we should jump through too many hoops
pretending its not.
On Thu, May 2, 2013 at 9:32 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, May 01, 2013 at 09:21:55PM +0200, Stephane Eranian wrote:
>> On Wed, May 1, 2013 at 9:13 PM, Andi Kleen <[email protected]> wrote:
>> >> Unfortunately, Andi's patch fails to correct the problem though
>> >> it is the right approach. As he said, the problem is unfortunately
>> >
>> > perf stat -a / record -a / top will work.
>> >
>> But it does not for perf stat -a. Look at my example, run it on your machine.
>
>
> Since nobody seems interested in explaining the exact nature of the hardware
> failure I think the best thing we can do is simply blacklist these events.
>
> Broken hardware is broken after all -- we should jump through too many hoops
> pretending its not.
Problem is, those events are really important. We need to be able to
measure them.
On Mon, May 06, 2013 at 07:41:54PM +0200, Stephane Eranian wrote:
> Problem is, those events are really important. We need to be able to
> measure them.
At what cost? If the code we need to make it work is horrendous we need to
maintain that too -- possibly quite a lot longer than Intel thinks IVB is
relevant.
Anyway; since these events are immediately dangerous (they corrupt state)
blocking them is the right thing; we can always merge a patch that undoes that
once we've figured out a sane way.
Currently I'm still waiting for someone to explain (in complete detail) how the
corruption manifests. Until that is clear I can't even begin formulating what
might be needed to make it work again.
One mode that makes sense to me is where CNTx on SMTi corrupts CNTx on SMTj for
i != j. However there's also been said that the only case that doesn't corrupt
is for SMTj to have all counters disabled -- which contradicts the previous
case.
The former case we could maybe make work; the latter I can't see how we could
possible make work at all.