2017-07-14 18:06:00

by Vince Weaver

[permalink] [raw]
Subject: perf: bisected sampling bug in Linux 4.11-rc1


I was tracking down some regressions in my perf_event_test testsuite.
Some of the tests broke in the 4.11-rc1 timeframe.

I've bisected one of them, this report is about
tests/overflow/simul_oneshot_group_overflow
This test creates an event group containing two sampling events, set
to overflow to a signal handler (which disables and then refreshes the
event).

On a good kernel you get the following:
Event perf::instructions with period 1000000
Event perf::instructions with period 2000000
fd 3 overflows: 946 (perf::instructions/1000000)
fd 4 overflows: 473 (perf::instructions/2000000)
Ending counts:
Count 0: 946379875
Count 1: 946365218

With the broken kernels you get:
Event perf::instructions with period 1000000
Event perf::instructions with period 2000000
fd 3 overflows: 938 (perf::instructions/1000000)
fd 4 overflows: 318 (perf::instructions/2000000)
Ending counts:
Count 0: 946373080
Count 1: 653373058


487f05e18aa4efacee6357480f293a5afe6593b5 is the first bad commit

commit 487f05e18aa4efacee6357480f293a5afe6593b5
Author: Alexander Shishkin <[email protected]>
Date: Thu Jan 19 18:43:30 2017 +0200

perf/core: Optimize event rescheduling on active contexts

When new events are added to an active context, we go and reschedule
all cpu groups and all task groups in order to preserve the priority
(cpu pinned, task pinned, cpu flexible, task flexible), but in
reality we only need to reschedule groups of the same priority as
that of the events being added, and below.

This patch changes the behavior so that only groups that need to be
rescheduled are rescheduled.

Reported-by: Adrian Hunter <[email protected]>
Signed-off-by: Alexander Shishkin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>



2017-07-14 18:26:06

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: bisected sampling bug in Linux 4.11-rc1

On Fri, 14 Jul 2017, Vince Weaver wrote:

> On a good kernel you get the following:
> Event perf::instructions with period 1000000
> Event perf::instructions with period 2000000
> fd 3 overflows: 946 (perf::instructions/1000000)
> fd 4 overflows: 473 (perf::instructions/2000000)
> Ending counts:
> Count 0: 946379875
> Count 1: 946365218
>
> With the broken kernels you get:
> Event perf::instructions with period 1000000
> Event perf::instructions with period 2000000
> fd 3 overflows: 938 (perf::instructions/1000000)
> fd 4 overflows: 318 (perf::instructions/2000000)
> Ending counts:
> Count 0: 946373080
> Count 1: 653373058

additional relevant detail:
in the failing case, the group leader of the event set has
.pinned=1
If I change that to .pinned=0 then the test passes.

Vince

2017-07-14 20:07:52

by Alexander Shishkin

[permalink] [raw]
Subject: Re: perf: bisected sampling bug in Linux 4.11-rc1

Vince Weaver <[email protected]> writes:

> I was tracking down some regressions in my perf_event_test testsuite.
> Some of the tests broke in the 4.11-rc1 timeframe.
>
> I've bisected one of them, this report is about
> tests/overflow/simul_oneshot_group_overflow
> This test creates an event group containing two sampling events, set
> to overflow to a signal handler (which disables and then refreshes the
> event).
>
> On a good kernel you get the following:
> Event perf::instructions with period 1000000
> Event perf::instructions with period 2000000
> fd 3 overflows: 946 (perf::instructions/1000000)
> fd 4 overflows: 473 (perf::instructions/2000000)
> Ending counts:
> Count 0: 946379875
> Count 1: 946365218
>
> With the broken kernels you get:
> Event perf::instructions with period 1000000
> Event perf::instructions with period 2000000
> fd 3 overflows: 938 (perf::instructions/1000000)
> fd 4 overflows: 318 (perf::instructions/2000000)
> Ending counts:
> Count 0: 946373080
> Count 1: 653373058

I'm not sure I'm seeing it (granted, it's a friday evening): is it the
difference in overflow counts?

Also, are they cpu or task bound?

Regards,
--
Alex

2017-07-14 20:14:33

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: bisected sampling bug in Linux 4.11-rc1

On Fri, 14 Jul 2017, Alexander Shishkin wrote:

> Vince Weaver <[email protected]> writes:
>
> > I was tracking down some regressions in my perf_event_test testsuite.
> > Some of the tests broke in the 4.11-rc1 timeframe.
> >
> > I've bisected one of them, this report is about
> > tests/overflow/simul_oneshot_group_overflow
> > This test creates an event group containing two sampling events, set
> > to overflow to a signal handler (which disables and then refreshes the
> > event).
> >
> > On a good kernel you get the following:
> > Event perf::instructions with period 1000000
> > Event perf::instructions with period 2000000
> > fd 3 overflows: 946 (perf::instructions/1000000)
> > fd 4 overflows: 473 (perf::instructions/2000000)
> > Ending counts:
> > Count 0: 946379875
> > Count 1: 946365218
> >
> > With the broken kernels you get:
> > Event perf::instructions with period 1000000
> > Event perf::instructions with period 2000000
> > fd 3 overflows: 938 (perf::instructions/1000000)
> > fd 4 overflows: 318 (perf::instructions/2000000)
> > Ending counts:
> > Count 0: 946373080
> > Count 1: 653373058
>
> I'm not sure I'm seeing it (granted, it's a friday evening): is it the
> difference in overflow counts?

It's two things.
It's created an grouped event, with the two events both
perf::instructions.

1. The total count at the end should be the same for both
(on the failing kernels it is not)
2. The overflow count for both events should be roughly
total_events/sample_freq.
(on the failing kernels it is not)

> Also, are they cpu or task bound?

The open looks like this:
perf_event_open(&pe,0,-1,-1,0);

On the failing case, the group leader is pinned.

The source code for the test is here:
https://github.com/deater/perf_event_tests/blob/master/tests/overflow/simul_oneshot_group_overflow.c

Vince

2017-07-15 11:00:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: perf: bisected sampling bug in Linux 4.11-rc1


* Vince Weaver <[email protected]> wrote:

> 487f05e18aa4efacee6357480f293a5afe6593b5 is the first bad commit
>
> commit 487f05e18aa4efacee6357480f293a5afe6593b5
> Author: Alexander Shishkin <[email protected]>
> Date: Thu Jan 19 18:43:30 2017 +0200
>
> perf/core: Optimize event rescheduling on active contexts

BTW., just to prepare for the eventuality: below is a (completely untested...)
revert of this commit, against recent kernels, with conflicts fixed up.

Does this fix your testcase?

Thanks,

Ingo

>From fe0deecf2a8e9f5097013bcf89a9ef4b80715be1 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Sat, 15 Jul 2017 12:57:51 +0200
Subject: [PATCH] Revert "perf/core: Optimize event rescheduling on active contexts"

This reverts commit 487f05e18aa4efacee6357480f293a5afe6593b5.
---
kernel/events/core.c | 80 ++++++++--------------------------------------------
1 file changed, 11 insertions(+), 69 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9747e422ab20..778aa2548142 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -359,8 +359,6 @@ enum event_type_t {
EVENT_FLEXIBLE = 0x1,
EVENT_PINNED = 0x2,
EVENT_TIME = 0x4,
- /* see ctx_resched() for details */
- EVENT_CPU = 0x8,
EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
};

@@ -1445,20 +1443,6 @@ static void update_group_times(struct perf_event *leader)
update_event_times(event);
}

-static enum event_type_t get_event_type(struct perf_event *event)
-{
- struct perf_event_context *ctx = event->ctx;
- enum event_type_t event_type;
-
- lockdep_assert_held(&ctx->lock);
-
- event_type = event->attr.pinned ? EVENT_PINNED : EVENT_FLEXIBLE;
- if (!ctx->task)
- event_type |= EVENT_CPU;
-
- return event_type;
-}
-
static struct list_head *
ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
{
@@ -2232,8 +2216,7 @@ ctx_sched_in(struct perf_event_context *ctx,
struct task_struct *task);

static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx,
- enum event_type_t event_type)
+ struct perf_event_context *ctx)
{
if (!cpuctx->task_ctx)
return;
@@ -2241,7 +2224,7 @@ static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
return;

- ctx_sched_out(ctx, cpuctx, event_type);
+ ctx_sched_out(ctx, cpuctx, EVENT_ALL);
}

static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
@@ -2256,51 +2239,13 @@ static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
}

-/*
- * We want to maintain the following priority of scheduling:
- * - CPU pinned (EVENT_CPU | EVENT_PINNED)
- * - task pinned (EVENT_PINNED)
- * - CPU flexible (EVENT_CPU | EVENT_FLEXIBLE)
- * - task flexible (EVENT_FLEXIBLE).
- *
- * In order to avoid unscheduling and scheduling back in everything every
- * time an event is added, only do it for the groups of equal priority and
- * below.
- *
- * This can be called after a batch operation on task events, in which case
- * event_type is a bit mask of the types of events involved. For CPU events,
- * event_type is only either EVENT_PINNED or EVENT_FLEXIBLE.
- */
static void ctx_resched(struct perf_cpu_context *cpuctx,
- struct perf_event_context *task_ctx,
- enum event_type_t event_type)
+ struct perf_event_context *task_ctx)
{
- enum event_type_t ctx_event_type = event_type & EVENT_ALL;
- bool cpu_event = !!(event_type & EVENT_CPU);
-
- /*
- * If pinned groups are involved, flexible groups also need to be
- * scheduled out.
- */
- if (event_type & EVENT_PINNED)
- event_type |= EVENT_FLEXIBLE;
-
perf_pmu_disable(cpuctx->ctx.pmu);
if (task_ctx)
- task_ctx_sched_out(cpuctx, task_ctx, event_type);
-
- /*
- * Decide which cpu ctx groups to schedule out based on the types
- * of events that caused rescheduling:
- * - EVENT_CPU: schedule out corresponding groups;
- * - EVENT_PINNED task events: schedule out EVENT_FLEXIBLE groups;
- * - otherwise, do nothing more.
- */
- if (cpu_event)
- cpu_ctx_sched_out(cpuctx, ctx_event_type);
- else if (ctx_event_type & EVENT_PINNED)
- cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
-
+ task_ctx_sched_out(cpuctx, task_ctx);
+ cpu_ctx_sched_out(cpuctx, EVENT_ALL);
perf_event_sched_in(cpuctx, task_ctx, current);
perf_pmu_enable(cpuctx->ctx.pmu);
}
@@ -2347,7 +2292,7 @@ static int __perf_install_in_context(void *info)
if (reprogram) {
ctx_sched_out(ctx, cpuctx, EVENT_TIME);
add_event_to_ctx(event, ctx);
- ctx_resched(cpuctx, task_ctx, get_event_type(event));
+ ctx_resched(cpuctx, task_ctx);
} else {
add_event_to_ctx(event, ctx);
}
@@ -2514,7 +2459,7 @@ static void __perf_event_enable(struct perf_event *event,
if (ctx->task)
WARN_ON_ONCE(task_ctx != ctx);

- ctx_resched(cpuctx, task_ctx, get_event_type(event));
+ ctx_resched(cpuctx, task_ctx);
}

/*
@@ -2941,7 +2886,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,

if (do_switch) {
raw_spin_lock(&ctx->lock);
- task_ctx_sched_out(cpuctx, ctx, EVENT_ALL);
+ task_ctx_sched_out(cpuctx, ctx);
raw_spin_unlock(&ctx->lock);
}
}
@@ -3498,7 +3443,6 @@ static int event_enable_on_exec(struct perf_event *event,
static void perf_event_enable_on_exec(int ctxn)
{
struct perf_event_context *ctx, *clone_ctx = NULL;
- enum event_type_t event_type = 0;
struct perf_cpu_context *cpuctx;
struct perf_event *event;
unsigned long flags;
@@ -3512,17 +3456,15 @@ static void perf_event_enable_on_exec(int ctxn)
cpuctx = __get_cpu_context(ctx);
perf_ctx_lock(cpuctx, ctx);
ctx_sched_out(ctx, cpuctx, EVENT_TIME);
- list_for_each_entry(event, &ctx->event_list, event_entry) {
+ list_for_each_entry(event, &ctx->event_list, event_entry)
enabled |= event_enable_on_exec(event, ctx);
- event_type |= get_event_type(event);
- }

/*
* Unclone and reschedule this context if we enabled any event.
*/
if (enabled) {
clone_ctx = unclone_ctx(ctx);
- ctx_resched(cpuctx, ctx, event_type);
+ ctx_resched(cpuctx, ctx);
} else {
ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
}
@@ -10466,7 +10408,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
* in.
*/
raw_spin_lock_irq(&child_ctx->lock);
- task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx, EVENT_ALL);
+ task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx);

/*
* Now that the context is inactive, destroy the task <-> ctx relation

2017-07-15 14:37:52

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: bisected sampling bug in Linux 4.11-rc1

On Sat, 15 Jul 2017, Ingo Molnar wrote:

>
> * Vince Weaver <[email protected]> wrote:
>
> > 487f05e18aa4efacee6357480f293a5afe6593b5 is the first bad commit
> >
> > commit 487f05e18aa4efacee6357480f293a5afe6593b5
> > Author: Alexander Shishkin <[email protected]>
> > Date: Thu Jan 19 18:43:30 2017 +0200
> >
> > perf/core: Optimize event rescheduling on active contexts
>
> BTW., just to prepare for the eventuality: below is a (completely untested...)
> revert of this commit, against recent kernels, with conflicts fixed up.
>
> Does this fix your testcase?

Yes, applying this to current git fixes the testcase and doesn't seem to
break anything else.


Although there is a separate issue also introduced in 4.11-rc1 that still
fails a different testcase. I'm in the middle of bisecting that one and
probably won't have the result of the bisect until Monday.

Vince

2017-07-16 03:03:15

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: bisected sampling bug in Linux 4.11-rc1

On Sat, 15 Jul 2017, Vince Weaver wrote:

> Although there is a separate issue also introduced in 4.11-rc1 that still
> fails a different testcase. I'm in the middle of bisecting that one and
> probably won't have the result of the bisect until Monday.

I went and bisected the other issue anyway. It wasn't in 4.11-rc1, but in
4.12-rc4. Yes, I have a test that triggered the
c1582c231ea041 perf/core: Drop kernel samples even though :u is specified
issue.

The test was actually losing 7% of its samples, so not just a single
sample here or there. The test had two events sampling, one at 100k and
one at 200k so I guess every other sample would have two samples
immediately back-to-back which must make it more likely to stray into the
kernel.

Vince

Subject: [tip:perf/urgent] Revert "perf/core: Optimize event rescheduling on active contexts"

Commit-ID: 770f8eb8a990a8904bfd8a6849be147b40b6e1aa
Gitweb: http://git.kernel.org/tip/770f8eb8a990a8904bfd8a6849be147b40b6e1aa
Author: Ingo Molnar <[email protected]>
AuthorDate: Sat, 15 Jul 2017 13:00:49 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 18 Jul 2017 10:44:47 +0200

Revert "perf/core: Optimize event rescheduling on active contexts"

This reverts commit 487f05e18aa4efacee6357480f293a5afe6593b5.

Vince Weaver reported that it breaks a testcase for pinned events:

| I've bisected one of them, this report is about:
|
| tests/overflow/simul_oneshot_group_overflow
|
| This test creates an event group containing two sampling events, set
| to overflow to a signal handler (which disables and then refreshes the
| event).
|
| On a good kernel you get the following:
| Event perf::instructions with period 1000000
| Event perf::instructions with period 2000000
| fd 3 overflows: 946 (perf::instructions/1000000)
| fd 4 overflows: 473 (perf::instructions/2000000)
| Ending counts:
| Count 0: 946379875
| Count 1: 946365218
|
| With the broken kernels you get:
| Event perf::instructions with period 1000000
| Event perf::instructions with period 2000000
| fd 3 overflows: 938 (perf::instructions/1000000)
| fd 4 overflows: 318 (perf::instructions/2000000)
| Ending counts:
| Count 0: 946373080
| Count 1: 653373058
...
| additional relevant detail:
| in the failing case, the group leader of the event set has
| .pinned=1
| If I change that to .pinned=0 then the test passes.

As it's an optimization we can revert it for now until the root cause is found.

Adrian Hunter <[email protected]>
Reported-by: Vince Weaver <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 80 ++++++++--------------------------------------------
1 file changed, 11 insertions(+), 69 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9747e42..778aa25 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -359,8 +359,6 @@ enum event_type_t {
EVENT_FLEXIBLE = 0x1,
EVENT_PINNED = 0x2,
EVENT_TIME = 0x4,
- /* see ctx_resched() for details */
- EVENT_CPU = 0x8,
EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
};

@@ -1445,20 +1443,6 @@ static void update_group_times(struct perf_event *leader)
update_event_times(event);
}

-static enum event_type_t get_event_type(struct perf_event *event)
-{
- struct perf_event_context *ctx = event->ctx;
- enum event_type_t event_type;
-
- lockdep_assert_held(&ctx->lock);
-
- event_type = event->attr.pinned ? EVENT_PINNED : EVENT_FLEXIBLE;
- if (!ctx->task)
- event_type |= EVENT_CPU;
-
- return event_type;
-}
-
static struct list_head *
ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
{
@@ -2232,8 +2216,7 @@ ctx_sched_in(struct perf_event_context *ctx,
struct task_struct *task);

static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx,
- enum event_type_t event_type)
+ struct perf_event_context *ctx)
{
if (!cpuctx->task_ctx)
return;
@@ -2241,7 +2224,7 @@ static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
return;

- ctx_sched_out(ctx, cpuctx, event_type);
+ ctx_sched_out(ctx, cpuctx, EVENT_ALL);
}

static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
@@ -2256,51 +2239,13 @@ static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
}

-/*
- * We want to maintain the following priority of scheduling:
- * - CPU pinned (EVENT_CPU | EVENT_PINNED)
- * - task pinned (EVENT_PINNED)
- * - CPU flexible (EVENT_CPU | EVENT_FLEXIBLE)
- * - task flexible (EVENT_FLEXIBLE).
- *
- * In order to avoid unscheduling and scheduling back in everything every
- * time an event is added, only do it for the groups of equal priority and
- * below.
- *
- * This can be called after a batch operation on task events, in which case
- * event_type is a bit mask of the types of events involved. For CPU events,
- * event_type is only either EVENT_PINNED or EVENT_FLEXIBLE.
- */
static void ctx_resched(struct perf_cpu_context *cpuctx,
- struct perf_event_context *task_ctx,
- enum event_type_t event_type)
+ struct perf_event_context *task_ctx)
{
- enum event_type_t ctx_event_type = event_type & EVENT_ALL;
- bool cpu_event = !!(event_type & EVENT_CPU);
-
- /*
- * If pinned groups are involved, flexible groups also need to be
- * scheduled out.
- */
- if (event_type & EVENT_PINNED)
- event_type |= EVENT_FLEXIBLE;
-
perf_pmu_disable(cpuctx->ctx.pmu);
if (task_ctx)
- task_ctx_sched_out(cpuctx, task_ctx, event_type);
-
- /*
- * Decide which cpu ctx groups to schedule out based on the types
- * of events that caused rescheduling:
- * - EVENT_CPU: schedule out corresponding groups;
- * - EVENT_PINNED task events: schedule out EVENT_FLEXIBLE groups;
- * - otherwise, do nothing more.
- */
- if (cpu_event)
- cpu_ctx_sched_out(cpuctx, ctx_event_type);
- else if (ctx_event_type & EVENT_PINNED)
- cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
-
+ task_ctx_sched_out(cpuctx, task_ctx);
+ cpu_ctx_sched_out(cpuctx, EVENT_ALL);
perf_event_sched_in(cpuctx, task_ctx, current);
perf_pmu_enable(cpuctx->ctx.pmu);
}
@@ -2347,7 +2292,7 @@ static int __perf_install_in_context(void *info)
if (reprogram) {
ctx_sched_out(ctx, cpuctx, EVENT_TIME);
add_event_to_ctx(event, ctx);
- ctx_resched(cpuctx, task_ctx, get_event_type(event));
+ ctx_resched(cpuctx, task_ctx);
} else {
add_event_to_ctx(event, ctx);
}
@@ -2514,7 +2459,7 @@ static void __perf_event_enable(struct perf_event *event,
if (ctx->task)
WARN_ON_ONCE(task_ctx != ctx);

- ctx_resched(cpuctx, task_ctx, get_event_type(event));
+ ctx_resched(cpuctx, task_ctx);
}

/*
@@ -2941,7 +2886,7 @@ unlock:

if (do_switch) {
raw_spin_lock(&ctx->lock);
- task_ctx_sched_out(cpuctx, ctx, EVENT_ALL);
+ task_ctx_sched_out(cpuctx, ctx);
raw_spin_unlock(&ctx->lock);
}
}
@@ -3498,7 +3443,6 @@ static int event_enable_on_exec(struct perf_event *event,
static void perf_event_enable_on_exec(int ctxn)
{
struct perf_event_context *ctx, *clone_ctx = NULL;
- enum event_type_t event_type = 0;
struct perf_cpu_context *cpuctx;
struct perf_event *event;
unsigned long flags;
@@ -3512,17 +3456,15 @@ static void perf_event_enable_on_exec(int ctxn)
cpuctx = __get_cpu_context(ctx);
perf_ctx_lock(cpuctx, ctx);
ctx_sched_out(ctx, cpuctx, EVENT_TIME);
- list_for_each_entry(event, &ctx->event_list, event_entry) {
+ list_for_each_entry(event, &ctx->event_list, event_entry)
enabled |= event_enable_on_exec(event, ctx);
- event_type |= get_event_type(event);
- }

/*
* Unclone and reschedule this context if we enabled any event.
*/
if (enabled) {
clone_ctx = unclone_ctx(ctx);
- ctx_resched(cpuctx, ctx, event_type);
+ ctx_resched(cpuctx, ctx);
} else {
ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
}
@@ -10466,7 +10408,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
* in.
*/
raw_spin_lock_irq(&child_ctx->lock);
- task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx, EVENT_ALL);
+ task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx);

/*
* Now that the context is inactive, destroy the task <-> ctx relation

2017-07-18 11:15:00

by Alexander Shishkin

[permalink] [raw]
Subject: Re: perf: bisected sampling bug in Linux 4.11-rc1

Vince Weaver <[email protected]> writes:

> I was tracking down some regressions in my perf_event_test testsuite.
> Some of the tests broke in the 4.11-rc1 timeframe.
>
> I've bisected one of them, this report is about
> tests/overflow/simul_oneshot_group_overflow
> This test creates an event group containing two sampling events, set
> to overflow to a signal handler (which disables and then refreshes the
> event).
>
> On a good kernel you get the following:
> Event perf::instructions with period 1000000
> Event perf::instructions with period 2000000
> fd 3 overflows: 946 (perf::instructions/1000000)
> fd 4 overflows: 473 (perf::instructions/2000000)
> Ending counts:
> Count 0: 946379875
> Count 1: 946365218
>
> With the broken kernels you get:
> Event perf::instructions with period 1000000
> Event perf::instructions with period 2000000
> fd 3 overflows: 938 (perf::instructions/1000000)
> fd 4 overflows: 318 (perf::instructions/2000000)
> Ending counts:
> Count 0: 946373080
> Count 1: 653373058
>
>
> 487f05e18aa4efacee6357480f293a5afe6593b5 is the first bad commit
>
> commit 487f05e18aa4efacee6357480f293a5afe6593b5
> Author: Alexander Shishkin <[email protected]>
> Date: Thu Jan 19 18:43:30 2017 +0200

Ok, there was a bug there indeed. This patch should take care of it and
should also be backportable in case it's stable-worthy.

>From 187d67c9908cb126656c34546772089c17a8e6c5 Mon Sep 17 00:00:00 2001
From: Alexander Shishkin <[email protected]>
Date: Tue, 18 Jul 2017 13:53:01 +0300
Subject: [PATCH] perf: Fix scheduling regression of pinned groups

Commit 487f05e18a ("perf/core: Optimize event rescheduling on active
contexts") erronously assumed that event's 'pinned' setting determines
whether the event belongs to a pinned group or not, but in fact, it's
the group leader's pinned state that matters. This was discovered by
Vince in a test case where two instruction counters are grouped, the
group leader is pinned, but the other event is not; in the regressed
case the counters were off by 33% (the difference between events'
periods), but should be the same within the error margin.

This fixes the problem by looking at the group leader's pinning.

Reported-by: Vince Weaver <[email protected]>
Signed-off-by: Alexander Shishkin <[email protected]>
Fixes: 487f05e18a ("perf/core: Optimize event rescheduling on active contexts")
Cc: [email protected]
---
kernel/events/core.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index bc63f8db1b..1edbaf94dd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1451,6 +1451,13 @@ static enum event_type_t get_event_type(struct perf_event *event)

lockdep_assert_held(&ctx->lock);

+ /*
+ * It's 'group type', really, because if our group leader is
+ * pinned, so are we.
+ */
+ if (event->group_leader != event)
+ event = event->group_leader;
+
event_type = event->attr.pinned ? EVENT_PINNED : EVENT_FLEXIBLE;
if (!ctx->task)
event_type |= EVENT_CPU;
--
2.11.0

2017-07-18 12:13:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: perf: bisected sampling bug in Linux 4.11-rc1


* Alexander Shishkin <[email protected]> wrote:

> Vince Weaver <[email protected]> writes:
>
> > I was tracking down some regressions in my perf_event_test testsuite.
> > Some of the tests broke in the 4.11-rc1 timeframe.
> >
> > I've bisected one of them, this report is about
> > tests/overflow/simul_oneshot_group_overflow
> > This test creates an event group containing two sampling events, set
> > to overflow to a signal handler (which disables and then refreshes the
> > event).
> >
> > On a good kernel you get the following:
> > Event perf::instructions with period 1000000
> > Event perf::instructions with period 2000000
> > fd 3 overflows: 946 (perf::instructions/1000000)
> > fd 4 overflows: 473 (perf::instructions/2000000)
> > Ending counts:
> > Count 0: 946379875
> > Count 1: 946365218
> >
> > With the broken kernels you get:
> > Event perf::instructions with period 1000000
> > Event perf::instructions with period 2000000
> > fd 3 overflows: 938 (perf::instructions/1000000)
> > fd 4 overflows: 318 (perf::instructions/2000000)
> > Ending counts:
> > Count 0: 946373080
> > Count 1: 653373058
> >
> >
> > 487f05e18aa4efacee6357480f293a5afe6593b5 is the first bad commit
> >
> > commit 487f05e18aa4efacee6357480f293a5afe6593b5
> > Author: Alexander Shishkin <[email protected]>
> > Date: Thu Jan 19 18:43:30 2017 +0200
>
> Ok, there was a bug there indeed. This patch should take care of it and
> should also be backportable in case it's stable-worthy.
>
> From 187d67c9908cb126656c34546772089c17a8e6c5 Mon Sep 17 00:00:00 2001
> From: Alexander Shishkin <[email protected]>
> Date: Tue, 18 Jul 2017 13:53:01 +0300
> Subject: [PATCH] perf: Fix scheduling regression of pinned groups

Ok, great - if this works then I'll pick up this fix instead of the revert that
I've queued up earlier today.

Thanks,

Ingo

2017-07-20 03:30:39

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: bisected sampling bug in Linux 4.11-rc1

On Tue, 18 Jul 2017, Ingo Molnar wrote:
>
> Ok, great - if this works then I'll pick up this fix instead of the revert that
> I've queued up earlier today.

sorry for the delay, I was out of town for a few days.

I've tried the patch and it looks like it fixes the test in question and
doesn't seem to add any other regressions.

Vince

Subject: [tip:perf/urgent] perf/core: Fix scheduling regression of pinned groups

Commit-ID: 3bda69c1c3993a2bddbae01397d12bfef6054011
Gitweb: http://git.kernel.org/tip/3bda69c1c3993a2bddbae01397d12bfef6054011
Author: Alexander Shishkin <[email protected]>
AuthorDate: Tue, 18 Jul 2017 14:08:34 +0300
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 20 Jul 2017 09:43:02 +0200

perf/core: Fix scheduling regression of pinned groups

Vince Weaver reported:

> I was tracking down some regressions in my perf_event_test testsuite.
> Some of the tests broke in the 4.11-rc1 timeframe.
>
> I've bisected one of them, this report is about
> tests/overflow/simul_oneshot_group_overflow
> This test creates an event group containing two sampling events, set
> to overflow to a signal handler (which disables and then refreshes the
> event).
>
> On a good kernel you get the following:
> Event perf::instructions with period 1000000
> Event perf::instructions with period 2000000
> fd 3 overflows: 946 (perf::instructions/1000000)
> fd 4 overflows: 473 (perf::instructions/2000000)
> Ending counts:
> Count 0: 946379875
> Count 1: 946365218
>
> With the broken kernels you get:
> Event perf::instructions with period 1000000
> Event perf::instructions with period 2000000
> fd 3 overflows: 938 (perf::instructions/1000000)
> fd 4 overflows: 318 (perf::instructions/2000000)
> Ending counts:
> Count 0: 946373080
> Count 1: 653373058

The root cause of the bug is that the following commit:

487f05e18a ("perf/core: Optimize event rescheduling on active contexts")

erronously assumed that event's 'pinned' setting determines whether the
event belongs to a pinned group or not, but in fact, it's the group
leader's pinned state that matters.

This was discovered by Vince in the test case described above, where two instruction
counters are grouped, the group leader is pinned, but the other event is not;
in the regressed case the counters were off by 33% (the difference between events'
periods), but should be the same within the error margin.

Fix the problem by looking at the group leader's pinning.

Reported-by: Vince Weaver <[email protected]>
Tested-by: Vince Weaver <[email protected]>
Signed-off-by: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Fixes: 487f05e18a ("perf/core: Optimize event rescheduling on active contexts")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9747e42..c9cdbd3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1452,6 +1452,13 @@ static enum event_type_t get_event_type(struct perf_event *event)

lockdep_assert_held(&ctx->lock);

+ /*
+ * It's 'group type', really, because if our group leader is
+ * pinned, so are we.
+ */
+ if (event->group_leader != event)
+ event = event->group_leader;
+
event_type = event->attr.pinned ? EVENT_PINNED : EVENT_FLEXIBLE;
if (!ctx->task)
event_type |= EVENT_CPU;