2013-10-01 19:12:45

by Adrian Hunter

[permalink] [raw]
Subject: PERF_EVENT_IOC_SET_OUTPUT

Hi

It does not seem possible to use set-output between
task contexts of different types (e.g. a software event
to a hardware event)

If you look at perf_event_set_output():

/*
* If its not a per-cpu rb, it must be the same task.
*/
if (output_event->cpu == -1 && output_event->ctx != event->ctx)
goto out;

ctx (perf_event_context) won't be the same for events
of different types. Is this restriction necessary?

Regards
Adrian


2013-10-02 10:04:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: PERF_EVENT_IOC_SET_OUTPUT

On Tue, Oct 01, 2013 at 10:11:56PM +0300, Adrian Hunter wrote:
> Hi
>
> It does not seem possible to use set-output between
> task contexts of different types (e.g. a software event
> to a hardware event)
>
> If you look at perf_event_set_output():
>
> /*
> * If its not a per-cpu rb, it must be the same task.
> */
> if (output_event->cpu == -1 && output_event->ctx != event->ctx)
> goto out;
>
> ctx (perf_event_context) won't be the same for events
> of different types. Is this restriction necessary?

Hmm.. so last night I wrote me a big reply saying we couldn't do it;
then this morning I reconsidered and thing that something like:

output_event->ctx->task != event->ctx->task

should actually work.

The reason it should be OK I think is because perf_mmap() will refuse to
create a buffer for inherited events that have ->cpu == -1.

My initial response was going to say that it wouldn't be possible
because __perf_event_task_sched_out() could 'break' one ctx while still
swapping the other, at which point the buffer would have to service two
different tasks, potentially from different CPUs and with the buffers
not actually being SMP safe that's a problem.

But like stated, perf_mmap() seems to avoid that issue entirely by not
allowing inherited events that aren't cpu bound.

Someone please double check this.. :-)

2013-10-02 10:30:04

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: PERF_EVENT_IOC_SET_OUTPUT

On Wed, Oct 02, 2013 at 12:03:50PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 10:11:56PM +0300, Adrian Hunter wrote:
> > Hi
> >
> > It does not seem possible to use set-output between
> > task contexts of different types (e.g. a software event
> > to a hardware event)
> >
> > If you look at perf_event_set_output():
> >
> > /*
> > * If its not a per-cpu rb, it must be the same task.
> > */
> > if (output_event->cpu == -1 && output_event->ctx != event->ctx)
> > goto out;
> >
> > ctx (perf_event_context) won't be the same for events
> > of different types. Is this restriction necessary?
>
> Hmm.. so last night I wrote me a big reply saying we couldn't do it;
> then this morning I reconsidered and thing that something like:
>
> output_event->ctx->task != event->ctx->task
>
> should actually work.
>
> The reason it should be OK I think is because perf_mmap() will refuse to
> create a buffer for inherited events that have ->cpu == -1.
>
> My initial response was going to say that it wouldn't be possible
> because __perf_event_task_sched_out() could 'break' one ctx while still
> swapping the other, at which point the buffer would have to service two
> different tasks, potentially from different CPUs and with the buffers
> not actually being SMP safe that's a problem.

I don't get what you mean with breaking or swapping a ctx.
But I can confirm that perf_mmap() won't allow a buffer to be remotely
accessed from another CPU. Now there may be other issues than locality which
I'm missing :)

>
> But like stated, perf_mmap() seems to avoid that issue entirely by not
> allowing inherited events that aren't cpu bound.
>
> Someone please double check this.. :-)

2013-10-02 11:27:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: PERF_EVENT_IOC_SET_OUTPUT

On Wed, Oct 02, 2013 at 12:29:56PM +0200, Frederic Weisbecker wrote:
> On Wed, Oct 02, 2013 at 12:03:50PM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 01, 2013 at 10:11:56PM +0300, Adrian Hunter wrote:
> > > Hi
> > >
> > > It does not seem possible to use set-output between
> > > task contexts of different types (e.g. a software event
> > > to a hardware event)
> > >
> > > If you look at perf_event_set_output():
> > >
> > > /*
> > > * If its not a per-cpu rb, it must be the same task.
> > > */
> > > if (output_event->cpu == -1 && output_event->ctx != event->ctx)
> > > goto out;
> > >
> > > ctx (perf_event_context) won't be the same for events
> > > of different types. Is this restriction necessary?
> >
> > Hmm.. so last night I wrote me a big reply saying we couldn't do it;
> > then this morning I reconsidered and thing that something like:
> >
> > output_event->ctx->task != event->ctx->task
> >
> > should actually work.
> >
> > The reason it should be OK I think is because perf_mmap() will refuse to
> > create a buffer for inherited events that have ->cpu == -1.
> >
> > My initial response was going to say that it wouldn't be possible
> > because __perf_event_task_sched_out() could 'break' one ctx while still
> > swapping the other, at which point the buffer would have to service two
> > different tasks, potentially from different CPUs and with the buffers
> > not actually being SMP safe that's a problem.
>
> I don't get what you mean with breaking or swapping a ctx.
> But I can confirm that perf_mmap() won't allow a buffer to be remotely
> accessed from another CPU. Now there may be other issues than locality which
> I'm missing :)

The way we 'optimize' context switches between tasks with identical
contexts is to simply swap the context and leave the hardware alone.

So counters belonging to prev will then belong to next and vice versa.
This avoids having to read hardware counters, update stats, removes
counters from hardware, and re-program hardware with possible the exact
same set.

When a child context changes its context (eg, inserts or removes a
counter) we break this swapping because now the contexts don't match
anymore and we have to take the slow and painful way of prodding
hardware.

2013-10-02 11:43:34

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: PERF_EVENT_IOC_SET_OUTPUT

On Wed, Oct 02, 2013 at 01:27:30PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 02, 2013 at 12:29:56PM +0200, Frederic Weisbecker wrote:
> > On Wed, Oct 02, 2013 at 12:03:50PM +0200, Peter Zijlstra wrote:
> > > On Tue, Oct 01, 2013 at 10:11:56PM +0300, Adrian Hunter wrote:
> > > > Hi
> > > >
> > > > It does not seem possible to use set-output between
> > > > task contexts of different types (e.g. a software event
> > > > to a hardware event)
> > > >
> > > > If you look at perf_event_set_output():
> > > >
> > > > /*
> > > > * If its not a per-cpu rb, it must be the same task.
> > > > */
> > > > if (output_event->cpu == -1 && output_event->ctx != event->ctx)
> > > > goto out;
> > > >
> > > > ctx (perf_event_context) won't be the same for events
> > > > of different types. Is this restriction necessary?
> > >
> > > Hmm.. so last night I wrote me a big reply saying we couldn't do it;
> > > then this morning I reconsidered and thing that something like:
> > >
> > > output_event->ctx->task != event->ctx->task
> > >
> > > should actually work.
> > >
> > > The reason it should be OK I think is because perf_mmap() will refuse to
> > > create a buffer for inherited events that have ->cpu == -1.
> > >
> > > My initial response was going to say that it wouldn't be possible
> > > because __perf_event_task_sched_out() could 'break' one ctx while still
> > > swapping the other, at which point the buffer would have to service two
> > > different tasks, potentially from different CPUs and with the buffers
> > > not actually being SMP safe that's a problem.
> >
> > I don't get what you mean with breaking or swapping a ctx.
> > But I can confirm that perf_mmap() won't allow a buffer to be remotely
> > accessed from another CPU. Now there may be other issues than locality which
> > I'm missing :)
>
> The way we 'optimize' context switches between tasks with identical
> contexts is to simply swap the context and leave the hardware alone.
>
> So counters belonging to prev will then belong to next and vice versa.
> This avoids having to read hardware counters, update stats, removes
> counters from hardware, and re-program hardware with possible the exact
> same set.
>
> When a child context changes its context (eg, inserts or removes a
> counter) we break this swapping because now the contexts don't match
> anymore and we have to take the slow and painful way of prodding
> hardware.

Ah right, I remember that now. This caused me quite some headaches
a few years ago :)

2013-10-02 12:29:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: PERF_EVENT_IOC_SET_OUTPUT


* Peter Zijlstra <[email protected]> wrote:

> On Wed, Oct 02, 2013 at 12:29:56PM +0200, Frederic Weisbecker wrote:
> > On Wed, Oct 02, 2013 at 12:03:50PM +0200, Peter Zijlstra wrote:
> > > On Tue, Oct 01, 2013 at 10:11:56PM +0300, Adrian Hunter wrote:
> > > > Hi
> > > >
> > > > It does not seem possible to use set-output between
> > > > task contexts of different types (e.g. a software event
> > > > to a hardware event)
> > > >
> > > > If you look at perf_event_set_output():
> > > >
> > > > /*
> > > > * If its not a per-cpu rb, it must be the same task.
> > > > */
> > > > if (output_event->cpu == -1 && output_event->ctx != event->ctx)
> > > > goto out;
> > > >
> > > > ctx (perf_event_context) won't be the same for events
> > > > of different types. Is this restriction necessary?
> > >
> > > Hmm.. so last night I wrote me a big reply saying we couldn't do it;
> > > then this morning I reconsidered and thing that something like:
> > >
> > > output_event->ctx->task != event->ctx->task
> > >
> > > should actually work.
> > >
> > > The reason it should be OK I think is because perf_mmap() will refuse to
> > > create a buffer for inherited events that have ->cpu == -1.
> > >
> > > My initial response was going to say that it wouldn't be possible
> > > because __perf_event_task_sched_out() could 'break' one ctx while still
> > > swapping the other, at which point the buffer would have to service two
> > > different tasks, potentially from different CPUs and with the buffers
> > > not actually being SMP safe that's a problem.
> >
> > I don't get what you mean with breaking or swapping a ctx. But I can
> > confirm that perf_mmap() won't allow a buffer to be remotely accessed
> > from another CPU. Now there may be other issues than locality which
> > I'm missing :)
>
> The way we 'optimize' context switches between tasks with identical
> contexts is to simply swap the context and leave the hardware alone.

Btw., this does not seem to be working very well when the perf context is
inherited:

Baseline kernel with no perf context:

aldebaran:~> taskset 1 perf stat --null perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two tasks

Total time: 5.024 [sec]

5.024951 usecs/op
199006 ops/sec

with inherited perf contexts:

aldebaran:~> taskset 1 perf stat perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two tasks

Total time: 17.869 [sec]

17.869061 usecs/op
55962 ops/sec

+12.8 usecs of perf switching fat per context switch, on a non-debug
kernel on a 2.8GHz CPU :-/

We should declare a hard feature stop until that is improved.

Thanks,

Ingo

2013-10-02 12:40:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: PERF_EVENT_IOC_SET_OUTPUT

On Wed, Oct 02, 2013 at 02:29:01PM +0200, Ingo Molnar wrote:
>
> Btw., this does not seem to be working very well when the perf context is
> inherited:

https://lkml.org/lkml/2011/7/21/124

Never figured that one out :-(

2013-10-03 06:43:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: PERF_EVENT_IOC_SET_OUTPUT


* Peter Zijlstra <[email protected]> wrote:

> On Wed, Oct 02, 2013 at 02:29:01PM +0200, Ingo Molnar wrote:
> >
> > Btw., this does not seem to be working very well when the perf context is
> > inherited:
>
> https://lkml.org/lkml/2011/7/21/124
>
> Never figured that one out :-(

Hm, maybe the various fixes over the past 2.5 years would make it work
today?

If you find time to send an updated version that boots then I can try to
trace it to figure out where it fails, if it still fails.

Thanks,

Ingo

2013-10-07 16:43:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: PERF_EVENT_IOC_SET_OUTPUT

On Thu, Oct 03, 2013 at 08:43:51AM +0200, Ingo Molnar wrote:
> If you find time to send an updated version that boots then I can try to
> trace it to figure out where it fails, if it still fails.

Can you give the below a spin?

---
Subject: perf: Fix the perf context switch optimization
From: Peter Zijlstra <[email protected]>
Date: Mon Oct 7 17:12:48 CEST 2013

Currently we only optimize the context switch between two contexts that
have the same parent; this forgoes the optimization between parent and
child context, even though these contexts could be equivalent too.

Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/events/core.c | 64 ++++++++++++++++++++++++++++++++++++---------------
1 file changed, 46 insertions(+), 18 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -899,6 +899,7 @@ static void unclone_ctx(struct perf_even
put_ctx(ctx->parent_ctx);
ctx->parent_ctx = NULL;
}
+ ctx->generation++;
}

static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
@@ -1136,6 +1137,8 @@ list_add_event(struct perf_event *event,
ctx->nr_events++;
if (event->attr.inherit_stat)
ctx->nr_stat++;
+
+ ctx->generation++;
}

/*
@@ -1313,6 +1316,8 @@ list_del_event(struct perf_event *event,
*/
if (event->state > PERF_EVENT_STATE_OFF)
event->state = PERF_EVENT_STATE_OFF;
+
+ ctx->generation++;
}

static void perf_group_detach(struct perf_event *event)
@@ -2149,22 +2154,38 @@ static void ctx_sched_out(struct perf_ev
}

/*
- * Test whether two contexts are equivalent, i.e. whether they
- * have both been cloned from the same version of the same context
- * and they both have the same number of enabled events.
- * If the number of enabled events is the same, then the set
- * of enabled events should be the same, because these are both
- * inherited contexts, therefore we can't access individual events
- * in them directly with an fd; we can only enable/disable all
- * events via prctl, or enable/disable all events in a family
- * via ioctl, which will have the same effect on both contexts.
+ * Test whether two contexts are equivalent, i.e. whether they have both been
+ * cloned from the same version of the same context.
+ *
+ * Equivalence is measured using a generation number in the context that is
+ * incremented on each modification to it; see unclone_ctx(), list_add_event()
+ * and list_del_event().
*/
static int context_equiv(struct perf_event_context *ctx1,
struct perf_event_context *ctx2)
{
- return ctx1->parent_ctx && ctx1->parent_ctx == ctx2->parent_ctx
- && ctx1->parent_gen == ctx2->parent_gen
- && !ctx1->pin_count && !ctx2->pin_count;
+ /* Pinning disables the swap optimization */
+ if (ctx1->pin_count || ctx2->pin_count)
+ return 0;
+
+ /* If ctx1 is the parent of ctx2 */
+ if (ctx1 == ctx2->parent_ctx && ctx1->generation == ctx2->parent_gen)
+ return 1;
+
+ /* If ctx2 is the parent of ctx1 */
+ if (ctx1->parent_ctx == ctx2 && ctx1->parent_gen == ctx2->generation)
+ return 1;
+
+ /*
+ * If ctx1 and ctx2 have the same parent; we flatten the parent
+ * hierarchy, see perf_event_init_context().
+ */
+ if (ctx1->parent_ctx && ctx1->parent_ctx == ctx2->parent_ctx &&
+ ctx1->parent_gen == ctx2->parent_gen)
+ return 1;
+
+ /* Unmatched */
+ return 0;
}

static void __perf_event_sync_stat(struct perf_event *event,
@@ -2247,7 +2268,7 @@ static void perf_event_context_sched_out
{
struct perf_event_context *ctx = task->perf_event_ctxp[ctxn];
struct perf_event_context *next_ctx;
- struct perf_event_context *parent;
+ struct perf_event_context *parent, *next_parent;
struct perf_cpu_context *cpuctx;
int do_switch = 1;

@@ -2259,10 +2280,18 @@ static void perf_event_context_sched_out
return;

rcu_read_lock();
- parent = rcu_dereference(ctx->parent_ctx);
next_ctx = next->perf_event_ctxp[ctxn];
- if (parent && next_ctx &&
- rcu_dereference(next_ctx->parent_ctx) == parent) {
+ if (!next_ctx)
+ goto unlock;
+
+ parent = rcu_dereference(ctx->parent_ctx);
+ next_parent = rcu_dereference(next_ctx->parent_ctx);
+
+ /* If neither context have a parent context; they cannot be clones. */
+ if (!parent && !next_parent)
+ goto unlock;
+
+ if (next_parent == ctx || next_ctx == parent || next_parent == parent) {
/*
* Looks like the two contexts are clones, so we might be
* able to optimize the context switch. We lock both
@@ -2290,6 +2319,7 @@ static void perf_event_context_sched_out
raw_spin_unlock(&next_ctx->lock);
raw_spin_unlock(&ctx->lock);
}
+unlock:
rcu_read_unlock();

if (do_switch) {
@@ -7128,7 +7158,6 @@ SYSCALL_DEFINE5(perf_event_open,
}

perf_install_in_context(ctx, event, event->cpu);
- ++ctx->generation;
perf_unpin_context(ctx);
mutex_unlock(&ctx->mutex);

@@ -7211,7 +7240,6 @@ perf_event_create_kernel_counter(struct
WARN_ON_ONCE(ctx->parent_ctx);
mutex_lock(&ctx->mutex);
perf_install_in_context(ctx, event, cpu);
- ++ctx->generation;
perf_unpin_context(ctx);
mutex_unlock(&ctx->mutex);

Subject: [tip:perf/core] perf: Fix the perf context switch optimization

Commit-ID: 5a3126d4fe7c311fe12f98fef0470f6cb582d1ef
Gitweb: http://git.kernel.org/tip/5a3126d4fe7c311fe12f98fef0470f6cb582d1ef
Author: Peter Zijlstra <[email protected]>
AuthorDate: Mon, 7 Oct 2013 17:12:48 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 29 Oct 2013 14:13:01 +0100

perf: Fix the perf context switch optimization

Currently we only optimize the context switch between two
contexts that have the same parent; this forgoes the
optimization between parent and child context, even though these
contexts could be equivalent too.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Shishkin, Alexander <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 64 +++++++++++++++++++++++++++++++++++++---------------
1 file changed, 46 insertions(+), 18 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 85a8bbd..17b3c6c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -899,6 +899,7 @@ static void unclone_ctx(struct perf_event_context *ctx)
put_ctx(ctx->parent_ctx);
ctx->parent_ctx = NULL;
}
+ ctx->generation++;
}

static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
@@ -1136,6 +1137,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
ctx->nr_events++;
if (event->attr.inherit_stat)
ctx->nr_stat++;
+
+ ctx->generation++;
}

/*
@@ -1313,6 +1316,8 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
*/
if (event->state > PERF_EVENT_STATE_OFF)
event->state = PERF_EVENT_STATE_OFF;
+
+ ctx->generation++;
}

static void perf_group_detach(struct perf_event *event)
@@ -2149,22 +2154,38 @@ static void ctx_sched_out(struct perf_event_context *ctx,
}

/*
- * Test whether two contexts are equivalent, i.e. whether they
- * have both been cloned from the same version of the same context
- * and they both have the same number of enabled events.
- * If the number of enabled events is the same, then the set
- * of enabled events should be the same, because these are both
- * inherited contexts, therefore we can't access individual events
- * in them directly with an fd; we can only enable/disable all
- * events via prctl, or enable/disable all events in a family
- * via ioctl, which will have the same effect on both contexts.
+ * Test whether two contexts are equivalent, i.e. whether they have both been
+ * cloned from the same version of the same context.
+ *
+ * Equivalence is measured using a generation number in the context that is
+ * incremented on each modification to it; see unclone_ctx(), list_add_event()
+ * and list_del_event().
*/
static int context_equiv(struct perf_event_context *ctx1,
struct perf_event_context *ctx2)
{
- return ctx1->parent_ctx && ctx1->parent_ctx == ctx2->parent_ctx
- && ctx1->parent_gen == ctx2->parent_gen
- && !ctx1->pin_count && !ctx2->pin_count;
+ /* Pinning disables the swap optimization */
+ if (ctx1->pin_count || ctx2->pin_count)
+ return 0;
+
+ /* If ctx1 is the parent of ctx2 */
+ if (ctx1 == ctx2->parent_ctx && ctx1->generation == ctx2->parent_gen)
+ return 1;
+
+ /* If ctx2 is the parent of ctx1 */
+ if (ctx1->parent_ctx == ctx2 && ctx1->parent_gen == ctx2->generation)
+ return 1;
+
+ /*
+ * If ctx1 and ctx2 have the same parent; we flatten the parent
+ * hierarchy, see perf_event_init_context().
+ */
+ if (ctx1->parent_ctx && ctx1->parent_ctx == ctx2->parent_ctx &&
+ ctx1->parent_gen == ctx2->parent_gen)
+ return 1;
+
+ /* Unmatched */
+ return 0;
}

static void __perf_event_sync_stat(struct perf_event *event,
@@ -2247,7 +2268,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
{
struct perf_event_context *ctx = task->perf_event_ctxp[ctxn];
struct perf_event_context *next_ctx;
- struct perf_event_context *parent;
+ struct perf_event_context *parent, *next_parent;
struct perf_cpu_context *cpuctx;
int do_switch = 1;

@@ -2259,10 +2280,18 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
return;

rcu_read_lock();
- parent = rcu_dereference(ctx->parent_ctx);
next_ctx = next->perf_event_ctxp[ctxn];
- if (parent && next_ctx &&
- rcu_dereference(next_ctx->parent_ctx) == parent) {
+ if (!next_ctx)
+ goto unlock;
+
+ parent = rcu_dereference(ctx->parent_ctx);
+ next_parent = rcu_dereference(next_ctx->parent_ctx);
+
+ /* If neither context have a parent context; they cannot be clones. */
+ if (!parent && !next_parent)
+ goto unlock;
+
+ if (next_parent == ctx || next_ctx == parent || next_parent == parent) {
/*
* Looks like the two contexts are clones, so we might be
* able to optimize the context switch. We lock both
@@ -2290,6 +2319,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
raw_spin_unlock(&next_ctx->lock);
raw_spin_unlock(&ctx->lock);
}
+unlock:
rcu_read_unlock();

if (do_switch) {
@@ -7136,7 +7166,6 @@ SYSCALL_DEFINE5(perf_event_open,
}

perf_install_in_context(ctx, event, event->cpu);
- ++ctx->generation;
perf_unpin_context(ctx);
mutex_unlock(&ctx->mutex);

@@ -7219,7 +7248,6 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
WARN_ON_ONCE(ctx->parent_ctx);
mutex_lock(&ctx->mutex);
perf_install_in_context(ctx, event, cpu);
- ++ctx->generation;
perf_unpin_context(ctx);
mutex_unlock(&ctx->mutex);