Hello
It took a long time to track down this bug. This causes a failure in the
PAPI regression tests.
The problem happens with the following:
fork a child, put it to sleep with SIGSTOP
perf_event_open(), attaching to the pid of the child
ENABLE/DISABLE/ENABLE/DISABLE/ENABLE
read the results with the enabled/running multiplexing fields
Your "enabled" time will be a huge negative number, which will confuse any
code depening on it (especially the rdpmc scaling code).
>From what I can tell, when you DISABLE an event that's in a
PERF_EVENT_STATE_INACTIVE state, currently it exits early without
updating event->tstamp_stopped. So later when calculating the
enabled time it is wrong and you get the negative number.
The following quick hack fixes things, it just abuses the existing code
that was there to handle a similar case when a filter failed. The proper
fix might be to just put the event->tstamp_stopped update before the
early out code, but the commit log for the change that introduced that
makes it sound a bit scary.
Anyway a reproducible test case for this is in my pref_event_tests tree,
the "tests/rdpmc/rdpmc_attach_multi_enable" test.
Vince
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 426c2ffba16d..18eadc3b37d5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1812,8 +1812,7 @@ event_sched_out(struct perf_event *event,
* maintained, otherwise bogus information is return
* via read() for time_enabled, time_running:
*/
- if (event->state == PERF_EVENT_STATE_INACTIVE &&
- !event_filter_match(event)) {
+ if (event->state == PERF_EVENT_STATE_INACTIVE) {
delta = tstamp - event->tstamp_stopped;
event->tstamp_running += delta;
event->tstamp_stopped = tstamp;
On Tue, Jul 25, 2017 at 09:47:52PM -0400, Vince Weaver wrote:
> The problem happens with the following:
> fork a child, put it to sleep with SIGSTOP
> perf_event_open(), attaching to the pid of the child
> ENABLE/DISABLE/ENABLE/DISABLE/ENABLE
> read the results with the enabled/running multiplexing fields
>
> Your "enabled" time will be a huge negative number, which will confuse any
> code depening on it (especially the rdpmc scaling code).
>
> From what I can tell, when you DISABLE an event that's in a
> PERF_EVENT_STATE_INACTIVE state, currently it exits early without
> updating event->tstamp_stopped. So later when calculating the
> enabled time it is wrong and you get the negative number.
> The following quick hack fixes things, it just abuses the existing code
> that was there to handle a similar case when a filter failed. The proper
> fix might be to just put the event->tstamp_stopped update before the
> early out code, but the commit log for the change that introduced that
> makes it sound a bit scary.
fa66f07aa1f0 ("perf_events: Fix time tracking for events with pid != -1 and cpu != -1")
is the commit that introduced it (the scary one just moved the && to the
right place).
> Anyway a reproducible test case for this is in my pref_event_tests tree,
> the "tests/rdpmc/rdpmc_attach_multi_enable" test.
>
> Vince
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 426c2ffba16d..18eadc3b37d5 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1812,8 +1812,7 @@ event_sched_out(struct perf_event *event,
> * maintained, otherwise bogus information is return
> * via read() for time_enabled, time_running:
> */
> - if (event->state == PERF_EVENT_STATE_INACTIVE &&
> - !event_filter_match(event)) {
> + if (event->state == PERF_EVENT_STATE_INACTIVE) {
> delta = tstamp - event->tstamp_stopped;
> event->tstamp_running += delta;
> event->tstamp_stopped = tstamp;
It doesn't look entirely crazy.. just trying to understand why Stephane
added that filter clause.
Argh, that time accounting always hurts my brain, its so weird. I never
dared rewrite it for risk of breaking it all again, but maybe, just
maybe we should have.. :/
init(now):
enabled = stopped = running = now;
in(now):
running += now - stopped;
out(now):
stopped = now;
read:
*enabled = stopped - enabled;
*running = stopped - running;
Simple example:
init(10):
e = s = r = 10
in(11):
r += 11 - 10; r=11
out(12):
s = 12
read:
*e = 12-10 = 2
*r = 12-11 = 1
in(14):
r += 14 - 12; r=13
out(16):
s = 16
read:
*e = 16 - 10 = 6
*r = 16 - 13 = 3
which confusingly works :-)
So lets break that by out not updating stopped, then we get:
init(10):
e = s = r = 10
in(11):
r += 11 - 10; r=11
read:
*e = 10-10 = 2
*r = 10-11 = -1
in(14):
r += 14 - 10; r=15
read:
*e = 10 - 10 = 0
*r = 10 - 15 = -5
That is negative for 'running', not enabled. But your test really does
report a negative 'enabled'.
Playing with that test it really is the IOC_DISABLE while STOP'ed that
messes things up.
Ah.. magic.. the below seems to fix things, hopefully it doesn't break
anything else.
---
kernel/events/core.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 426c2ffba16d..5dbbd1f90b4f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2458,6 +2458,20 @@ perf_install_in_context(struct perf_event_context *ctx,
}
/*
+ * reverse update_event_times()
+ */
+void __perf_event_enable_time(struct perf_event *event, u64 tstamp)
+{
+ /*
+ * total_time_enabled = tstamp_stopped - tstamp_enabled
+ * total_time_running = tstamp_stopped - tstamp_running;
+ */
+ event->tstamp_stopped = tstamp;
+ event->tstamp_enabled = tstamp - event->total_time_enabled;
+ event->tstamp_running = tstamp - event->total_time_running;
+}
+
+/*
* Put a event into inactive state and update time fields.
* Enabling the leader of a group effectively enables all
* the group members that aren't explicitly disabled, so we
@@ -2471,10 +2485,11 @@ static void __perf_event_mark_enabled(struct perf_event *event)
u64 tstamp = perf_event_time(event);
event->state = PERF_EVENT_STATE_INACTIVE;
- event->tstamp_enabled = tstamp - event->total_time_enabled;
+ __perf_event_enable_time(event, tstamp);
+
list_for_each_entry(sub, &event->sibling_list, group_entry) {
if (sub->state >= PERF_EVENT_STATE_INACTIVE)
- sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+ __perf_event_enable_time(sub, tstamp);
}
}
On Wed, 2 Aug 2017, Peter Zijlstra wrote:
> Playing with that test it really is the IOC_DISABLE while STOP'ed that
> messes things up.
>
> Ah.. magic.. the below seems to fix things, hopefully it doesn't break
> anything else.
yes, I've tested this and it seems to fix things.
With both this and the other patch applied PAPI can finally pass all of
its validation tests when running with RDPMC support enabled.
Thanks,
Vince
On Thu, Aug 03, 2017 at 02:07:53PM -0400, Vince Weaver wrote:
> On Wed, 2 Aug 2017, Peter Zijlstra wrote:
>
> > Playing with that test it really is the IOC_DISABLE while STOP'ed that
> > messes things up.
> >
> > Ah.. magic.. the below seems to fix things, hopefully it doesn't break
> > anything else.
>
> yes, I've tested this and it seems to fix things.
New version below.. is a bit different and should handle another case.
I'll try and run more tests on it, because while it _SHOULD_ not trigger
that WARN_ON_ONCE() you know how that goes..
> With both this and the other patch applied PAPI can finally pass all of
> its validation tests when running with RDPMC support enabled.
I get:
rdpmc# (for i in *.c ; do ./${i%.c}; done) | grep "^Testing"
Testing if rdpmc attach works... PASSED
Testing if minimized rdpmc papi-multi-attach works... PASSED
Testing if we can rdpmc in execed process... PASSED
Testing if we can rdpmc in execed process (PAPI) ... PASSED
Testing if rdpmc works with event groups... PASSED
-bash: ./rdpmc_lib: No such file or directory
Testing if rdpmc multi-attach works... PASSED
Testing if rdpmc papi-multi-attach works... PASSED
Testing if userspace rdpmc multiplexing works... PASSED
Testing if rdpmc with pthreads works... PASSED
Testing if rdpmc with pthreads works... PASSED
Testing if resetting while using rdpmc works... PASSED
Testing if userspace rdpmc reads are supported... NEW BEHAVIOR
Testing if rdpmc fallback works on sw events... PASSED
Testing if userspace rdpmc reads give expected results... PASSED
is that 'NEW BEHAVIOR' thing something I should worry about?
Full output for that test is:
rdpmc# ./rdpmc_support
This test checks if userspace rdpmc() style reads work.
total start/read/stop latency: 52668 cycles
Event 0 -- count: 63 running: 0
Testing if userspace rdpmc reads are supported... NEW BEHAVIOR
Unpatched kernel also does that.
---
Subject: perf: Fix time on IOC_ENABLE
From: Peter Zijlstra <[email protected]>
Date: Thu Aug 3 15:42:09 CEST 2017
Vince reported that when we do IOC_ENABLE/IOC_DISABLE while the task
is SIGSTOP'ed state the timestamps go wobbly.
It turns out we indeed fail to correctly account time while in 'OFF'
state and doing IOC_ENABLE without getting scheduled in exposes the
problem.
Further thinking about this problem, it occurred to me that we can
suffer a similar fate when we migrate an uncore event between CPUs.
The perf_event_install() on the 'new' CPU will do add_event_to_ctx()
which will reset all the time stamp, resulting in a subsequent
update_event_times() to overwrite the total_time_* fields with smaller
values.
Reported-by: Vince Weaver <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/events/core.c | 36 +++++++++++++++++++++++++++++++-----
1 file changed, 31 insertions(+), 5 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2217,6 +2217,33 @@ static int group_can_go_on(struct perf_e
return can_add_hw;
}
+/*
+ * Complement to update_event_times(). This computes the tstamp_* values to
+ * continue 'enabled' state from @now. And effectively discards the time
+ * between the prior tstamp_stopped and now (as we were in the OFF state, or
+ * just switched (context) time base).
+ *
+ * This further assumes '@event->state == INACTIVE' (we just came from OFF) and
+ * cannot have been scheduled in yet. And going into INACTIVE state means
+ * '@event->tstamp_stopped = @now'.
+ *
+ * Thus given the rules of update_event_times():
+ *
+ * total_time_enabled = tstamp_stopped - tstamp_enabled
+ * total_time_running = tstamp_stopped - tstamp_running
+ *
+ * We can insert 'tstamp_stopped == now' and reverse them to compute new
+ * tstamp_* values.
+ */
+static void __perf_event_enable_time(struct perf_event *event, u64 now)
+{
+ WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE);
+
+ event->tstamp_stopped = now;
+ event->tstamp_enabled = now - event->total_time_enabled;
+ event->tstamp_running = now - event->total_time_running;
+}
+
static void add_event_to_ctx(struct perf_event *event,
struct perf_event_context *ctx)
{
@@ -2224,9 +2251,7 @@ static void add_event_to_ctx(struct perf
list_add_event(event, ctx);
perf_group_attach(event);
- event->tstamp_enabled = tstamp;
- event->tstamp_running = tstamp;
- event->tstamp_stopped = tstamp;
+ __perf_event_enable_time(event, tstamp);
}
static void ctx_sched_out(struct perf_event_context *ctx,
@@ -2471,10 +2496,11 @@ static void __perf_event_mark_enabled(st
u64 tstamp = perf_event_time(event);
event->state = PERF_EVENT_STATE_INACTIVE;
- event->tstamp_enabled = tstamp - event->total_time_enabled;
+ __perf_event_enable_time(event, tstamp);
list_for_each_entry(sub, &event->sibling_list, group_entry) {
+ /* XXX should not be > INACTIVE if event isn't */
if (sub->state >= PERF_EVENT_STATE_INACTIVE)
- sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+ __perf_event_enable_time(sub, tstamp);
}
}
On Fri, 4 Aug 2017, Peter Zijlstra wrote:
> Testing if userspace rdpmc reads are supported... NEW BEHAVIOR
> Testing if rdpmc fallback works on sw events... PASSED
> Testing if userspace rdpmc reads give expected results... PASSED
>
> is that 'NEW BEHAVIOR' thing something I should worry about?
no, sorry, I've been working on cleaning up the outputs of the tests to be
less confusing.
NEW BEHAVIOR here means it has the "new" (since Linux 3.12) cap_usr_rdpmc
bit layout.
Vince
On Fri, Aug 04, 2017 at 01:22:53PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 03, 2017 at 02:07:53PM -0400, Vince Weaver wrote:
> > On Wed, 2 Aug 2017, Peter Zijlstra wrote:
> >
> > > Playing with that test it really is the IOC_DISABLE while STOP'ed that
> > > messes things up.
> > >
> > > Ah.. magic.. the below seems to fix things, hopefully it doesn't break
> > > anything else.
> >
> > yes, I've tested this and it seems to fix things.
>
> New version below.. is a bit different and should handle another case.
> I'll try and run more tests on it, because while it _SHOULD_ not trigger
> that WARN_ON_ONCE() you know how that goes..
And splat it goes.. turns out we can attach to a ctx when OFF (which
makes sense, they need to live someplace too).
In that case the IOC_ENABLE will again go through
__perf_event_enable_time() and recompute things.
I think I'll just remove the WARN_ON_ONCE(), and put in a comment.