2009-09-25 10:25:39

by Arjan van de Ven

[permalink] [raw]
Subject: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

>From 5db5cd76f3c16c9f6093f54d1ccfb97d04b9a1ca Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <[email protected]>
Date: Fri, 25 Sep 2009 12:20:57 +0200
Subject: [PATCH] perf_core: provide a kernel-internal interface to get to performance counters

There are reasons for kernel code to ask for, and use, performance counters.
For example, in CPU freq governors this tends to be a good idea, but there
are other examples possible as well of course.

This patch adds the needed bits to do enable this functionality; they have been
tested in an experimental cpufreq driver that I'm working on, and the changes
are all that I needed to access counters properly.

Signed-off-by: Arjan van de Ven <[email protected]>
---
include/linux/perf_event.h | 3 ++
kernel/perf_event.c | 71 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3a9d36d..5890330 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -733,6 +733,9 @@ extern int hw_perf_group_sched_in(struct perf_event *group_leader,
struct perf_cpu_context *cpuctx,
struct perf_event_context *ctx, int cpu);
extern void perf_event_update_userpage(struct perf_event *event);
+extern int perf_event_release_kernel(struct perf_event *event);
+extern struct perf_event *perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu);
+extern u64 perf_event_read_value(struct perf_event *event);

struct perf_sample_data {
u64 type;
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 76ac4db..185f45a 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1734,6 +1734,26 @@ static int perf_release(struct inode *inode, struct file *file)
return 0;
}

+int perf_event_release_kernel(struct perf_event *event)
+{
+ struct perf_event_context *ctx = event->ctx;
+
+ WARN_ON_ONCE(ctx->parent_ctx);
+ mutex_lock(&ctx->mutex);
+ perf_event_remove_from_context(event);
+ mutex_unlock(&ctx->mutex);
+
+ mutex_lock(&event->owner->perf_event_mutex);
+ list_del_init(&event->owner_entry);
+ mutex_unlock(&event->owner->perf_event_mutex);
+ put_task_struct(event->owner);
+
+ free_event(event);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(perf_event_release_kernel);
+
static int perf_event_read_size(struct perf_event *event)
{
int entry = sizeof(u64); /* value */
@@ -1759,7 +1779,7 @@ static int perf_event_read_size(struct perf_event *event)
return size;
}

-static u64 perf_event_read_value(struct perf_event *event)
+u64 perf_event_read_value(struct perf_event *event)
{
struct perf_event *child;
u64 total = 0;
@@ -1770,6 +1790,7 @@ static u64 perf_event_read_value(struct perf_event *event)

return total;
}
+EXPORT_SYMBOL_GPL(perf_event_read_value);

static int perf_event_read_entry(struct perf_event *event,
u64 read_format, char __user *buf)
@@ -4463,6 +4484,54 @@ err_put_context:
}

/*
+ * perf_event_create_kernel_counter
+ * MUST be called from a kernel thread.
+ */
+struct perf_event *
+perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu)
+{
+ struct perf_event *event;
+ struct perf_event_context *ctx;
+ int err;
+
+ /*
+ * Get the target context (task or percpu):
+ */
+
+ ctx = find_get_context(0, cpu);
+ if (IS_ERR(ctx))
+ return NULL ;
+
+ event = perf_event_alloc(attr, cpu, ctx, NULL,
+ NULL, GFP_KERNEL);
+ err = PTR_ERR(event);
+ if (IS_ERR(event))
+ goto err_put_context;
+
+ event->filp = NULL;
+ WARN_ON_ONCE(ctx->parent_ctx);
+ mutex_lock(&ctx->mutex);
+ perf_install_in_context(ctx, event, cpu);
+ ++ctx->generation;
+ mutex_unlock(&ctx->mutex);
+
+ event->owner = current;
+ get_task_struct(current);
+ mutex_lock(&current->perf_event_mutex);
+ list_add_tail(&event->owner_entry, &current->perf_event_list);
+ mutex_unlock(&current->perf_event_mutex);
+
+ return event;
+
+err_put_context:
+ if (err < 0)
+ put_ctx(ctx);
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(perf_event_create_kernel_counter);
+
+/*
* inherit a event from parent task to child task:
*/
static struct perf_event *
--
1.6.0.6



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org


2009-09-25 10:44:26

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Fri, Sep 25, 2009 at 12:25:56PM +0200, Arjan van de Ven wrote:
> From 5db5cd76f3c16c9f6093f54d1ccfb97d04b9a1ca Mon Sep 17 00:00:00 2001
> From: Arjan van de Ven <[email protected]>
> Date: Fri, 25 Sep 2009 12:20:57 +0200
> Subject: [PATCH] perf_core: provide a kernel-internal interface to get to performance counters
>
> There are reasons for kernel code to ask for, and use, performance counters.
> For example, in CPU freq governors this tends to be a good idea, but there
> are other examples possible as well of course.
>
> This patch adds the needed bits to do enable this functionality; they have been
> tested in an experimental cpufreq driver that I'm working on, and the changes
> are all that I needed to access counters properly.
>
> Signed-off-by: Arjan van de Ven <[email protected]>



Sounds like a very good idea. I have that as an ad-hoc part in the
hw-breakpoint patchset. This patch would help me drop this part
and use such unified in-kernel users interface.

Frederic.

2009-09-25 11:41:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Fri, 2009-09-25 at 12:25 +0200, Arjan van de Ven wrote:
> From 5db5cd76f3c16c9f6093f54d1ccfb97d04b9a1ca Mon Sep 17 00:00:00 2001
> From: Arjan van de Ven <[email protected]>
> Date: Fri, 25 Sep 2009 12:20:57 +0200
> Subject: [PATCH] perf_core: provide a kernel-internal interface to get to performance counters
>
> There are reasons for kernel code to ask for, and use, performance counters.
> For example, in CPU freq governors this tends to be a good idea, but there
> are other examples possible as well of course.
>
> This patch adds the needed bits to do enable this functionality; they have been
> tested in an experimental cpufreq driver that I'm working on, and the changes
> are all that I needed to access counters properly.
>
> Signed-off-by: Arjan van de Ven <[email protected]>

Looks fine to me, can go in when we acquire a user.

2009-09-26 16:03:51

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters


Arjan van de Ven <[email protected]> writes:

> [...]
> There are reasons for kernel code to ask for, and use, performance counters.
> For example, in CPU freq governors this tends to be a good idea, but there
> are other examples possible as well of course.
>
> This patch adds the needed bits to do enable this functionality; they have been
> tested in an experimental cpufreq driver that I'm working on, and the changes
> are all that I needed to access counters properly.
> [...]

For what it's worth, this sort of thing also looks useful from
systemtap's point of view.

It appears that the patch assumes that a perf counter instance is to
be associated with the "current" task. How do you use this from your
prototype scheduler? Do you create/attach a new one for each thread?


- FChE

2009-09-26 16:11:30

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Sat, 26 Sep 2009 12:03:28 -0400
[email protected] (Frank Ch. Eigler) wrote:

>
> Arjan van de Ven <[email protected]> writes:
>
> > [...]
> > There are reasons for kernel code to ask for, and use, performance
> > counters. For example, in CPU freq governors this tends to be a
> > good idea, but there are other examples possible as well of course.
> >
> > This patch adds the needed bits to do enable this functionality;
> > they have been tested in an experimental cpufreq driver that I'm
> > working on, and the changes are all that I needed to access
> > counters properly. [...]
>
> For what it's worth, this sort of thing also looks useful from
> systemtap's point of view.
>
> It appears that the patch assumes that a perf counter instance is to
> be associated with the "current" task. How do you use this from your
> prototype scheduler? Do you create/attach a new one for each thread?
>

your appearance is not correct. A perf counter is either attached to a
task, or to a cpu. I use the later.


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-26 16:20:51

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

Hi -

On Sat, Sep 26, 2009 at 06:11:49PM +0200, Arjan van de Ven wrote:
> [...]
> your appearance is not correct. A perf counter is either attached to a
> task, or to a cpu. I use the later.

OK, that explains the "MUST be called from kernel thread" comment --
so that the ->owner and get/put_task_struct in the new create/free
functions all refer to that kernel thread. Thanks.

- FChE

2009-09-26 18:32:56

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Sat, Sep 26, 2009 at 12:03:28PM -0400, Frank Ch. Eigler wrote:
>
> Arjan van de Ven <[email protected]> writes:
>
> > [...]
> > There are reasons for kernel code to ask for, and use, performance counters.
> > For example, in CPU freq governors this tends to be a good idea, but there
> > are other examples possible as well of course.
> >
> > This patch adds the needed bits to do enable this functionality; they have been
> > tested in an experimental cpufreq driver that I'm working on, and the changes
> > are all that I needed to access counters properly.
> > [...]
>
> For what it's worth, this sort of thing also looks useful from
> systemtap's point of view.

Wouldn't SystemTap be another user that desires support for multiple/all CPU
perf-counters (apart from hw-breakpoints as a potential user)? As Arjan pointed
out, perf's present design would support only a per-CPU or per-task counter;
not both.

Thanks,
K.Prasad

2009-09-26 18:48:31

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Sun, 27 Sep 2009 00:02:46 +0530
"K.Prasad" <[email protected]> wrote:

> On Sat, Sep 26, 2009 at 12:03:28PM -0400, Frank Ch. Eigler wrote:

> > For what it's worth, this sort of thing also looks useful from
> > systemtap's point of view.
>
> Wouldn't SystemTap be another user that desires support for
> multiple/all CPU perf-counters (apart from hw-breakpoints as a
> potential user)? As Arjan pointed out, perf's present design would
> support only a per-CPU or per-task counter; not both.

I'm sorry but I think I am missing your point.
"all cpu counters" would be one small helper wrapper away, a helper I'm
sure the SystemTap people are happy to submit as part of their patch
series when they submit SystemTap to the kernel.


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-10-01 07:25:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters


* Arjan van de Ven <[email protected]> wrote:

> On Sun, 27 Sep 2009 00:02:46 +0530
> "K.Prasad" <[email protected]> wrote:
>
> > On Sat, Sep 26, 2009 at 12:03:28PM -0400, Frank Ch. Eigler wrote:
>
> > > For what it's worth, this sort of thing also looks useful from
> > > systemtap's point of view.
> >
> > Wouldn't SystemTap be another user that desires support for
> > multiple/all CPU perf-counters (apart from hw-breakpoints as a
> > potential user)? As Arjan pointed out, perf's present design would
> > support only a per-CPU or per-task counter; not both.
>
> I'm sorry but I think I am missing your point. "all cpu counters"
> would be one small helper wrapper away, a helper I'm sure the
> SystemTap people are happy to submit as part of their patch series
> when they submit SystemTap to the kernel.

Yes, and Frederic wrote that wrapper already for the hw-breakpoints
patches. It's a non-issue and does not affect the design - we can always
gang up an array of per cpu perf events, it's a straightforward use of
the existing design.

User-space tools have been doing this for ages already, 'perf top' will
open an array of per cpu events to monitor all events in the system.

Ingo

2009-10-01 08:16:23

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Thu, Oct 01, 2009 at 09:25:18AM +0200, Ingo Molnar wrote:
>
> * Arjan van de Ven <[email protected]> wrote:
>
> > On Sun, 27 Sep 2009 00:02:46 +0530
> > "K.Prasad" <[email protected]> wrote:
> >
> > > On Sat, Sep 26, 2009 at 12:03:28PM -0400, Frank Ch. Eigler wrote:
> >
> > > > For what it's worth, this sort of thing also looks useful from
> > > > systemtap's point of view.
> > >
> > > Wouldn't SystemTap be another user that desires support for
> > > multiple/all CPU perf-counters (apart from hw-breakpoints as a
> > > potential user)? As Arjan pointed out, perf's present design would
> > > support only a per-CPU or per-task counter; not both.
> >
> > I'm sorry but I think I am missing your point. "all cpu counters"
> > would be one small helper wrapper away, a helper I'm sure the
> > SystemTap people are happy to submit as part of their patch series
> > when they submit SystemTap to the kernel.
>
> Yes, and Frederic wrote that wrapper already for the hw-breakpoints
> patches. It's a non-issue and does not affect the design - we can always
> gang up an array of per cpu perf events, it's a straightforward use of
> the existing design.
>

Such a design (iteratively invoking a per-CPU perf event for all desired
CPUs) isn't without issues, some of which are noted here:
(apart from http://lkml.org/lkml/2009/9/14/298).

- It breaks the abstraction that a user of the exported interfaces would
enjoy w.r.t. having all CPU (or a cpumask of CPU) breakpoints.

- (Un)Availability of debug registers on every requested CPU is not
known until request for that CPU fails. A failed request should be
followed by a rollback of the partially successful requests.

- Any breakpoint exceptions generated due to partially successful
requests (before a failed request is encountered) must be treated as
'stray' and be ignored (by the end-user? or the wrapper code?).

- Any CPUs that become online eventually have to be trapped and
populated with the appropriate debug register value (not something
that the end-user of breakpoints should be bothered with).

- Modifying the characteristics of a kernel breakpoint (including the
valid CPUs) will be equally painful.

- Races between the requests (also leading to temporary failure of
all CPU requests) presenting an unclear picture about free debug
registers (making it difficult to predict the need for a retry).

So we either have a perf event infrastructure that is cognisant of
many/all CPU counters, or make perf as a user of hw-breakpoints layer
which already handles such requests in a deft manner (through appropriate
book-keeping).

Thanks,
K.Prasad

2009-10-01 08:53:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters


* K.Prasad <[email protected]> wrote:

> On Thu, Oct 01, 2009 at 09:25:18AM +0200, Ingo Molnar wrote:
> >
> > * Arjan van de Ven <[email protected]> wrote:
> >
> > > On Sun, 27 Sep 2009 00:02:46 +0530
> > > "K.Prasad" <[email protected]> wrote:
> > >
> > > > On Sat, Sep 26, 2009 at 12:03:28PM -0400, Frank Ch. Eigler wrote:
> > >
> > > > > For what it's worth, this sort of thing also looks useful from
> > > > > systemtap's point of view.
> > > >
> > > > Wouldn't SystemTap be another user that desires support for
> > > > multiple/all CPU perf-counters (apart from hw-breakpoints as a
> > > > potential user)? As Arjan pointed out, perf's present design would
> > > > support only a per-CPU or per-task counter; not both.
> > >
> > > I'm sorry but I think I am missing your point. "all cpu counters"
> > > would be one small helper wrapper away, a helper I'm sure the
> > > SystemTap people are happy to submit as part of their patch series
> > > when they submit SystemTap to the kernel.
> >
> > Yes, and Frederic wrote that wrapper already for the hw-breakpoints
> > patches. It's a non-issue and does not affect the design - we can always
> > gang up an array of per cpu perf events, it's a straightforward use of
> > the existing design.
> >
>
> Such a design (iteratively invoking a per-CPU perf event for all
> desired CPUs) isn't without issues, some of which are noted here:
> (apart from http://lkml.org/lkml/2009/9/14/298).
>
> - It breaks the abstraction that a user of the exported interfaces would
> enjoy w.r.t. having all CPU (or a cpumask of CPU) breakpoints.

CPU offlining/onlining support would be interesting to add.

> - (Un)Availability of debug registers on every requested CPU is not
> known until request for that CPU fails. A failed request should be
> followed by a rollback of the partially successful requests.

Yes.

> - Any breakpoint exceptions generated due to partially successful
> requests (before a failed request is encountered) must be treated as
> 'stray' and be ignored (by the end-user? or the wrapper code?).

Such inatomicity is inherent in using more than one CPU and a disjoint
set of hw-breakpoints. If the calling code cares then callbacks
triggering while the registration has not returned yet can be ignored.

> - Any CPUs that become online eventually have to be trapped and
> populated with the appropriate debug register value (not something
> that the end-user of breakpoints should be bothered with).
>
> - Modifying the characteristics of a kernel breakpoint (including the
> valid CPUs) will be equally painful.
>
> - Races between the requests (also leading to temporary failure of
> all CPU requests) presenting an unclear picture about free debug
> registers (making it difficult to predict the need for a retry).
>
> So we either have a perf event infrastructure that is cognisant of
> many/all CPU counters, or make perf as a user of hw-breakpoints layer
> which already handles such requests in a deft manner (through
> appropriate book-keeping).

Given that these are all still in the add-on category not affecting the
design, while the problems solved by perf events are definitely in the
non-trivial category, i'd suggest you extend perf events with a 'system
wide' event abstraction, which:

- Enumerates such registered events (via a list)

- Adds a CPU hotplug handler (which clones those events over to a new
CPU and directs it back to the ring-buffer of the existing event(s)
[if any])

- Plus a state field that allows the filtering out of stray/premature
events.

Such an add-on layer/abstraction would sure be useful in other cases as
well. It might make sense to expose it to user-space and make perf top
use it by default.

Thanks,

Ingo

2009-10-01 10:01:20

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Thu, Oct 01, 2009 at 10:53:30AM +0200, Ingo Molnar wrote:
>
> * K.Prasad <[email protected]> wrote:
>
> > On Thu, Oct 01, 2009 at 09:25:18AM +0200, Ingo Molnar wrote:
> > >
> > > * Arjan van de Ven <[email protected]> wrote:
> > >
> > > > On Sun, 27 Sep 2009 00:02:46 +0530
> > > > "K.Prasad" <[email protected]> wrote:
> > > >
> > > > > On Sat, Sep 26, 2009 at 12:03:28PM -0400, Frank Ch. Eigler wrote:
> > > >
> > > > > > For what it's worth, this sort of thing also looks useful from
> > > > > > systemtap's point of view.
> > > > >
> > > > > Wouldn't SystemTap be another user that desires support for
> > > > > multiple/all CPU perf-counters (apart from hw-breakpoints as a
> > > > > potential user)? As Arjan pointed out, perf's present design would
> > > > > support only a per-CPU or per-task counter; not both.
> > > >
> > > > I'm sorry but I think I am missing your point. "all cpu counters"
> > > > would be one small helper wrapper away, a helper I'm sure the
> > > > SystemTap people are happy to submit as part of their patch series
> > > > when they submit SystemTap to the kernel.
> > >
> > > Yes, and Frederic wrote that wrapper already for the hw-breakpoints
> > > patches. It's a non-issue and does not affect the design - we can always
> > > gang up an array of per cpu perf events, it's a straightforward use of
> > > the existing design.
> > >
> >
> > Such a design (iteratively invoking a per-CPU perf event for all
> > desired CPUs) isn't without issues, some of which are noted here:
> > (apart from http://lkml.org/lkml/2009/9/14/298).
> >
> > - It breaks the abstraction that a user of the exported interfaces would
> > enjoy w.r.t. having all CPU (or a cpumask of CPU) breakpoints.
>
> CPU offlining/onlining support would be interesting to add.
>
> > - (Un)Availability of debug registers on every requested CPU is not
> > known until request for that CPU fails. A failed request should be
> > followed by a rollback of the partially successful requests.
>
> Yes.
>
> > - Any breakpoint exceptions generated due to partially successful
> > requests (before a failed request is encountered) must be treated as
> > 'stray' and be ignored (by the end-user? or the wrapper code?).
>
> Such inatomicity is inherent in using more than one CPU and a disjoint
> set of hw-breakpoints. If the calling code cares then callbacks
> triggering while the registration has not returned yet can be ignored.
>

It can be prevented through book-keeping for debug registers, and
takes a 'greedy' approach that writes values onto the physical registers
only if it is known that there are sufficient slots available on all
desired CPUs (as done by the register_kernel_hw_breakpoint() code in
-tip now).

> > - Any CPUs that become online eventually have to be trapped and
> > populated with the appropriate debug register value (not something
> > that the end-user of breakpoints should be bothered with).
> >
> > - Modifying the characteristics of a kernel breakpoint (including the
> > valid CPUs) will be equally painful.
> >
> > - Races between the requests (also leading to temporary failure of
> > all CPU requests) presenting an unclear picture about free debug
> > registers (making it difficult to predict the need for a retry).
> >
> > So we either have a perf event infrastructure that is cognisant of
> > many/all CPU counters, or make perf as a user of hw-breakpoints layer
> > which already handles such requests in a deft manner (through
> > appropriate book-keeping).
>
> Given that these are all still in the add-on category not affecting the
> design, while the problems solved by perf events are definitely in the
> non-trivial category, i'd suggest you extend perf events with a 'system
> wide' event abstraction, which:
>
> - Enumerates such registered events (via a list)
>
> - Adds a CPU hotplug handler (which clones those events over to a new
> CPU and directs it back to the ring-buffer of the existing event(s)
> [if any])
>
> - Plus a state field that allows the filtering out of stray/premature
> events.
>

With some book-keeping (as stated before) in place, stray exceptions due
to premature events would be prevented since only successful requests
are written onto debug registers. There would be no need for a rollback
from the end-user too.

But I'm not sure if such book-keeping variables/data-structures will
find uses in other hw/sw events in perf apart from breakpoints (depends
on whether there's a need for support for multiple instances of a
hw/sw perf counter for a given CPU). If yes, then, the existing
synchronisation mechanism (through spin-locks over hw_breakpoint_lock)
must be extended over other perf events (post integration).

Thanks,
K.Prasad

2009-10-01 10:28:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters


* K.Prasad <[email protected]> wrote:

> On Thu, Oct 01, 2009 at 10:53:30AM +0200, Ingo Molnar wrote:
> >
> > * K.Prasad <[email protected]> wrote:
> >
> > > On Thu, Oct 01, 2009 at 09:25:18AM +0200, Ingo Molnar wrote:
> > > >
> > > > * Arjan van de Ven <[email protected]> wrote:
> > > >
> > > > > On Sun, 27 Sep 2009 00:02:46 +0530
> > > > > "K.Prasad" <[email protected]> wrote:
> > > > >
> > > > > > On Sat, Sep 26, 2009 at 12:03:28PM -0400, Frank Ch. Eigler wrote:
> > > > >
> > > > > > > For what it's worth, this sort of thing also looks useful from
> > > > > > > systemtap's point of view.
> > > > > >
> > > > > > Wouldn't SystemTap be another user that desires support for
> > > > > > multiple/all CPU perf-counters (apart from hw-breakpoints as a
> > > > > > potential user)? As Arjan pointed out, perf's present design would
> > > > > > support only a per-CPU or per-task counter; not both.
> > > > >
> > > > > I'm sorry but I think I am missing your point. "all cpu counters"
> > > > > would be one small helper wrapper away, a helper I'm sure the
> > > > > SystemTap people are happy to submit as part of their patch series
> > > > > when they submit SystemTap to the kernel.
> > > >
> > > > Yes, and Frederic wrote that wrapper already for the hw-breakpoints
> > > > patches. It's a non-issue and does not affect the design - we can always
> > > > gang up an array of per cpu perf events, it's a straightforward use of
> > > > the existing design.
> > > >
> > >
> > > Such a design (iteratively invoking a per-CPU perf event for all
> > > desired CPUs) isn't without issues, some of which are noted here:
> > > (apart from http://lkml.org/lkml/2009/9/14/298).
> > >
> > > - It breaks the abstraction that a user of the exported interfaces would
> > > enjoy w.r.t. having all CPU (or a cpumask of CPU) breakpoints.
> >
> > CPU offlining/onlining support would be interesting to add.
> >
> > > - (Un)Availability of debug registers on every requested CPU is not
> > > known until request for that CPU fails. A failed request should be
> > > followed by a rollback of the partially successful requests.
> >
> > Yes.
> >
> > > - Any breakpoint exceptions generated due to partially successful
> > > requests (before a failed request is encountered) must be treated as
> > > 'stray' and be ignored (by the end-user? or the wrapper code?).
> >
> > Such inatomicity is inherent in using more than one CPU and a disjoint
> > set of hw-breakpoints. If the calling code cares then callbacks
> > triggering while the registration has not returned yet can be ignored.
> >
>
> It can be prevented through book-keeping for debug registers, and
> takes a 'greedy' approach that writes values onto the physical registers
> only if it is known that there are sufficient slots available on all
> desired CPUs (as done by the register_kernel_hw_breakpoint() code in
> -tip now).
>
> > > - Any CPUs that become online eventually have to be trapped and
> > > populated with the appropriate debug register value (not something
> > > that the end-user of breakpoints should be bothered with).
> > >
> > > - Modifying the characteristics of a kernel breakpoint (including the
> > > valid CPUs) will be equally painful.
> > >
> > > - Races between the requests (also leading to temporary failure of
> > > all CPU requests) presenting an unclear picture about free debug
> > > registers (making it difficult to predict the need for a retry).
> > >
> > > So we either have a perf event infrastructure that is cognisant of
> > > many/all CPU counters, or make perf as a user of hw-breakpoints layer
> > > which already handles such requests in a deft manner (through
> > > appropriate book-keeping).
> >
> > Given that these are all still in the add-on category not affecting the
> > design, while the problems solved by perf events are definitely in the
> > non-trivial category, i'd suggest you extend perf events with a 'system
> > wide' event abstraction, which:
> >
> > - Enumerates such registered events (via a list)
> >
> > - Adds a CPU hotplug handler (which clones those events over to a new
> > CPU and directs it back to the ring-buffer of the existing event(s)
> > [if any])
> >
> > - Plus a state field that allows the filtering out of stray/premature
> > events.
> >
>
> With some book-keeping (as stated before) in place, stray exceptions
> due to premature events would be prevented since only successful
> requests are written onto debug registers. There would be no need for
> a rollback from the end-user too.
>
> But I'm not sure if such book-keeping variables/data-structures will
> find uses in other hw/sw events in perf apart from breakpoints
> (depends on whether there's a need for support for multiple instances
> of a hw/sw perf counter for a given CPU). If yes, then, the existing
> synchronisation mechanism (through spin-locks over hw_breakpoint_lock)
> must be extended over other perf events (post integration).

yes - i think the counter array currently used by 'perf top' could be
changed to use that new event type. Also, 'perf record --all' could use
it. SysProf (system-wide profiler) would also be potential users of it.

So yes, there would be use for this well beyond hw-breakpoints.

Ingo

2009-10-04 22:29:23

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Thu, Oct 01, 2009 at 10:53:30AM +0200, Ingo Molnar wrote:
>
> * K.Prasad <[email protected]> wrote:
>
> > On Thu, Oct 01, 2009 at 09:25:18AM +0200, Ingo Molnar wrote:
> > >
> > > * Arjan van de Ven <[email protected]> wrote:
> > >
> > > > On Sun, 27 Sep 2009 00:02:46 +0530
> > > > "K.Prasad" <[email protected]> wrote:
> > > >
> > > > > On Sat, Sep 26, 2009 at 12:03:28PM -0400, Frank Ch. Eigler wrote:
> > > >
> > > > > > For what it's worth, this sort of thing also looks useful from
> > > > > > systemtap's point of view.
> > > > >
> > > > > Wouldn't SystemTap be another user that desires support for
> > > > > multiple/all CPU perf-counters (apart from hw-breakpoints as a
> > > > > potential user)? As Arjan pointed out, perf's present design would
> > > > > support only a per-CPU or per-task counter; not both.
> > > >
> > > > I'm sorry but I think I am missing your point. "all cpu counters"
> > > > would be one small helper wrapper away, a helper I'm sure the
> > > > SystemTap people are happy to submit as part of their patch series
> > > > when they submit SystemTap to the kernel.
> > >
> > > Yes, and Frederic wrote that wrapper already for the hw-breakpoints
> > > patches. It's a non-issue and does not affect the design - we can always
> > > gang up an array of per cpu perf events, it's a straightforward use of
> > > the existing design.
> > >
> >
> > Such a design (iteratively invoking a per-CPU perf event for all
> > desired CPUs) isn't without issues, some of which are noted here:
> > (apart from http://lkml.org/lkml/2009/9/14/298).
> >
> > - It breaks the abstraction that a user of the exported interfaces would
> > enjoy w.r.t. having all CPU (or a cpumask of CPU) breakpoints.
>
> CPU offlining/onlining support would be interesting to add.
>
> > - (Un)Availability of debug registers on every requested CPU is not
> > known until request for that CPU fails. A failed request should be
> > followed by a rollback of the partially successful requests.
>
> Yes.
>
> > - Any breakpoint exceptions generated due to partially successful
> > requests (before a failed request is encountered) must be treated as
> > 'stray' and be ignored (by the end-user? or the wrapper code?).
>
> Such inatomicity is inherent in using more than one CPU and a disjoint
> set of hw-breakpoints. If the calling code cares then callbacks
> triggering while the registration has not returned yet can be ignored.
>
> > - Any CPUs that become online eventually have to be trapped and
> > populated with the appropriate debug register value (not something
> > that the end-user of breakpoints should be bothered with).
> >
> > - Modifying the characteristics of a kernel breakpoint (including the
> > valid CPUs) will be equally painful.
> >
> > - Races between the requests (also leading to temporary failure of
> > all CPU requests) presenting an unclear picture about free debug
> > registers (making it difficult to predict the need for a retry).
> >
> > So we either have a perf event infrastructure that is cognisant of
> > many/all CPU counters, or make perf as a user of hw-breakpoints layer
> > which already handles such requests in a deft manner (through
> > appropriate book-keeping).
>
> Given that these are all still in the add-on category not affecting the
> design, while the problems solved by perf events are definitely in the
> non-trivial category, i'd suggest you extend perf events with a 'system
> wide' event abstraction, which:
>
> - Enumerates such registered events (via a list)
>
> - Adds a CPU hotplug handler (which clones those events over to a new
> CPU and directs it back to the ring-buffer of the existing event(s)
> [if any])
>
> - Plus a state field that allows the filtering out of stray/premature
> events.
>
> Such an add-on layer/abstraction would sure be useful in other cases as
> well. It might make sense to expose it to user-space and make perf top
> use it by default.
>
> Thanks,
>
> Ingo


Can't we instead modify the perf events to be able to
run on multiple contexts?

We could change struct perf_event::ctx into a list of
context and then attach it to several cpu contexts.

The perf event struct have been designed to run on only one context
so its structure and handling does not deal with races due to
concurrent uses I guess. But at a first glance, few things would
need to be modified to handle that, and at a low cost.

There might be bad corner cases I forget though...

2009-10-05 07:51:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Thu, 2009-10-01 at 10:53 +0200, Ingo Molnar wrote:
> i'd suggest you extend perf events with a 'system
> wide' event abstraction, which:
>
> - Enumerates such registered events (via a list)
>
> - Adds a CPU hotplug handler (which clones those events over to a new
> CPU and directs it back to the ring-buffer of the existing event(s)
> [if any])
>
> - Plus a state field that allows the filtering out of stray/premature
> events.
>
> Such an add-on layer/abstraction would sure be useful in other cases as
> well. It might make sense to expose it to user-space and make perf top
> use it by default.

Non-trivial.

Something like this would imply a single output channel for all these
CPUs, and we've already seen that stuffing too many CPUs down one such
channel (using -M) leads to significant performance issues.

Therefore I would strongly argue to let the kernel interface be what it
is and solve this in a userspace library for those who care.

We really cannot sanely support an all-CPUs abstraction without running
into trouble.

2009-10-05 08:56:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters


* Peter Zijlstra <[email protected]> wrote:

> On Thu, 2009-10-01 at 10:53 +0200, Ingo Molnar wrote:
> > i'd suggest you extend perf events with a 'system
> > wide' event abstraction, which:
> >
> > - Enumerates such registered events (via a list)
> >
> > - Adds a CPU hotplug handler (which clones those events over to a new
> > CPU and directs it back to the ring-buffer of the existing event(s)
> > [if any])
> >
> > - Plus a state field that allows the filtering out of stray/premature
> > events.
> >
> > Such an add-on layer/abstraction would sure be useful in other cases as
> > well. It might make sense to expose it to user-space and make perf top
> > use it by default.
>
> Non-trivial.
>
> Something like this would imply a single output channel for all these
> CPUs, and we've already seen that stuffing too many CPUs down one such
> channel (using -M) leads to significant performance issues.

We could add internal per cpu buffering before it hits any globally
visible output channel. (That has come up when i talked to Frederic
about the function tracer.) We could even have page sized output (via
the introduction of a NOP event that fills up to the next page edge).

This would have advantages elsewhere as well - it would be an immediate
speedup for 'perf sched record' for example.

> Therefore I would strongly argue to let the kernel interface be what
> it is and solve this in a userspace library for those who care.
>
> We really cannot sanely support an all-CPUs abstraction without
> running into trouble.

User-space will be in an even poorer situation to solve this
intelligently.

Really, the only reason to _not_ abstract something in the kernel,
_ever_ is when:

- it is so trivial that it needs no extra helpers in the kernel

- or when it is so specialized that it's a policy in essence

'it is too difficult' is a real _in favor_ of putting something into the
kernel ;-)

Ingo

2009-10-05 09:25:34

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

2009/10/5 Ingo Molnar <[email protected]>:
>
> * Peter Zijlstra <[email protected]> wrote:
>> Non-trivial.
>>
>> Something like this would imply a single output channel for all these
>> CPUs, and we've already seen that stuffing too many CPUs down one such
>> channel (using -M) leads to significant performance issues.
>
> We could add internal per cpu buffering before it hits any globally
> visible output channel. (That has come up when i talked to Frederic
> about the function tracer.) We could even have page sized output (via
> the introduction of a NOP event that fills up to the next page edge).


That looks good for the counting/sampling fast path, but would that scale
once it comes to reordering in the globally visible output channel?
Such a union has its costs.

2009-10-05 09:49:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters


* Fr?d?ric Weisbecker <[email protected]> wrote:

> 2009/10/5 Ingo Molnar <[email protected]>:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >> Non-trivial.
> >>
> >> Something like this would imply a single output channel for all these
> >> CPUs, and we've already seen that stuffing too many CPUs down one such
> >> channel (using -M) leads to significant performance issues.
> >
> > We could add internal per cpu buffering before it hits any globally
> > visible output channel. (That has come up when i talked to Frederic
> > about the function tracer.) We could even have page sized output
> > (via the introduction of a NOP event that fills up to the next page
> > edge).
>
> That looks good for the counting/sampling fast path, but would that
> scale once it comes to reordering in the globally visible output
> channel? Such a union has its costs.

Well, reordering always has a cost, and we have multiple models
regarding to where to put that cost.

The first model is 'everything is per cpu' - i.e. completely separate
event buffers and the reordering is pushed to the user-space
post-processing stage. This is the most scalable solution - but it can
also lose information such as the true ordering of events.

The second model is 'event multiplexing' - here we use a single output
buffer for events. This serializes all output on the same buffer and
hence is the least scalable one. It is the easiest to use one: just a
single channel of output to deal with. It is also the most precise
solution and it saves the post-processing stage from reordering hassles.

What i suggested above is a third model: 'short-term per cpu,
multiplexed into an output channel with page granularity'. It has the
advantage of being per cpu on a page granular basis. It has the ease of
use of having a single output channel only.

Neither solution can eliminate the costs and tradeoffs involved. What
they do is to offer an app a spectrum to choose from.

Ingo

2009-10-05 09:56:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters


* Frederic Weisbecker <[email protected]> wrote:

> Can't we instead modify the perf events to be able to run on multiple
> contexts?
>
> We could change struct perf_event::ctx into a list of context and then
> attach it to several cpu contexts.
>
> The perf event struct have been designed to run on only one context so
> its structure and handling does not deal with races due to concurrent
> uses I guess. But at a first glance, few things would need to be
> modified to handle that, and at a low cost.
>
> There might be bad corner cases I forget though...

Dunno, i assumed it wouldnt be possible sanely. If you tried a patch we
could argue about the particulars ...

This would in essence create a new event type: system-wide. It doesnt
scale in its naive implementation - which would be fine for low freq
events.

We could encode it via sys_perf_event_open(pid:-1, cpu:-1).

Ingo

2009-10-05 10:09:34

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

Le 5 octobre 2009 11:48, Ingo Molnar <[email protected]> a ?crit :
>
> * Fr?d?ric Weisbecker <[email protected]> wrote:
>
>> 2009/10/5 Ingo Molnar <[email protected]>:
>> >
>> > * Peter Zijlstra <[email protected]> wrote:
>> >> Non-trivial.
>> >>
>> >> Something like this would imply a single output channel for all these
>> >> CPUs, and we've already seen that stuffing too many CPUs down one such
>> >> channel (using -M) leads to significant performance issues.
>> >
>> > We could add internal per cpu buffering before it hits any globally
>> > visible output channel. (That has come up when i talked to Frederic
>> > about the function tracer.) We could even have page sized output
>> > (via the introduction of a NOP event that fills up to the next page
>> > edge).
>>
>> That looks good for the counting/sampling fast path, but would that
>> scale once it comes to reordering in the globally visible output
>> channel? Such a union has its costs.
>
> Well, reordering always has a cost, and we have multiple models
> regarding to where to put that cost.
>
> The first model is 'everything is per cpu' - i.e. completely separate
> event buffers and the reordering is pushed to the user-space
> post-processing stage. This is the most scalable solution - but it can
> also lose information such as the true ordering of events.
>
> The second model is 'event multiplexing' - here we use a single output
> buffer for events. This serializes all output on the same buffer and
> hence is the least scalable one. It is the easiest to use one: just a
> single channel of output to deal with. It is also the most precise
> solution and it saves the post-processing stage from reordering hassles.
>
> What i suggested above is a third model: 'short-term per cpu,
> multiplexed into an output channel with page granularity'. It has the
> advantage of being per cpu on a page granular basis. It has the ease of
> use of having a single output channel only.
>
> Neither solution can eliminate the costs and tradeoffs involved. What
> they do is to offer an app a spectrum to choose from.
>
> ? ? ? ?Ingo
>

Ok. The third solution solves the multi-channel problem, and for the
ordering...well
as you said, everything has a cost.

2009-10-05 10:13:46

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

2009/10/5 Ingo Molnar <[email protected]>:
>
> * Frederic Weisbecker <[email protected]> wrote:
>
>> Can't we instead modify the perf events to be able to run on multiple
>> contexts?
>>
>> We could change struct perf_event::ctx into a list of context and then
>> attach it to several cpu contexts.
>>
>> The perf event struct have been designed to run on only one context so
>> its structure and handling does not deal with races due to concurrent
>> uses I guess. But at a first glance, few things would need to be
>> modified to handle that, and at a low cost.
>>
>> There might be bad corner cases I forget though...
>
> Dunno, i assumed it wouldnt be possible sanely. If you tried a patch we
> could argue about the particulars ...
>
> This would in essence create a new event type: system-wide. It doesnt
> scale in its naive implementation - which would be fine for low freq
> events.
>
> We could encode it via sys_perf_event_open(pid:-1, cpu:-1).
>
> ? ? ? ?Ingo
>

We could combine the above (single-channel/per-cpu page granularity) and the
abstract of wide perf events. That will avoid creating a new type that
wrap a set of
multiple events, which would be another kind of type to handle,
breaking the genericity
and add a lot of code.

2009-11-21 13:37:20

by Arjan van de Ven

[permalink] [raw]
Subject: [tip:perf/core] perf/core: Provide a kernel-internal interface to get to performance counters

Commit-ID: fb0459d75c1d0a4ba3cafdd2c754e7486968a676
Gitweb: http://git.kernel.org/tip/fb0459d75c1d0a4ba3cafdd2c754e7486968a676
Author: Arjan van de Ven <[email protected]>
AuthorDate: Fri, 25 Sep 2009 12:25:56 +0200
Committer: Frederic Weisbecker <[email protected]>
CommitDate: Tue, 3 Nov 2009 18:04:17 +0100

perf/core: Provide a kernel-internal interface to get to performance counters

There are reasons for kernel code to ask for, and use, performance
counters.
For example, in CPU freq governors this tends to be a good idea, but
there are other examples possible as well of course.

This patch adds the needed bits to do enable this functionality; they
have been tested in an experimental cpufreq driver that I'm working on,
and the changes are all that I needed to access counters properly.

[[email protected]: added pid to perf_event_create_kernel_counter so
that we can profile a particular task too

TODO: Have a better error reporting, don't just return NULL in fail
case.]

v2: Remove the wrong comment about the fact
perf_event_create_kernel_counter must be called from a kernel
thread.

Signed-off-by: Arjan van de Ven <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: "K.Prasad" <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Kiszka <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Jan Kiszka <[email protected]>
Cc: Avi Kivity <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/perf_event.h | 6 +++
kernel/perf_event.c | 75 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 80 insertions(+), 1 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index df9d964..fa151d4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -744,6 +744,12 @@ extern int hw_perf_group_sched_in(struct perf_event *group_leader,
struct perf_cpu_context *cpuctx,
struct perf_event_context *ctx, int cpu);
extern void perf_event_update_userpage(struct perf_event *event);
+extern int perf_event_release_kernel(struct perf_event *event);
+extern struct perf_event *
+perf_event_create_kernel_counter(struct perf_event_attr *attr,
+ int cpu,
+ pid_t pid);
+extern u64 perf_event_read_value(struct perf_event *event);

struct perf_sample_data {
u64 type;
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 12b5ec3..02d4ff0 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1725,6 +1725,26 @@ static int perf_release(struct inode *inode, struct file *file)
return 0;
}

+int perf_event_release_kernel(struct perf_event *event)
+{
+ struct perf_event_context *ctx = event->ctx;
+
+ WARN_ON_ONCE(ctx->parent_ctx);
+ mutex_lock(&ctx->mutex);
+ perf_event_remove_from_context(event);
+ mutex_unlock(&ctx->mutex);
+
+ mutex_lock(&event->owner->perf_event_mutex);
+ list_del_init(&event->owner_entry);
+ mutex_unlock(&event->owner->perf_event_mutex);
+ put_task_struct(event->owner);
+
+ free_event(event);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(perf_event_release_kernel);
+
static int perf_event_read_size(struct perf_event *event)
{
int entry = sizeof(u64); /* value */
@@ -1750,7 +1770,7 @@ static int perf_event_read_size(struct perf_event *event)
return size;
}

-static u64 perf_event_read_value(struct perf_event *event)
+u64 perf_event_read_value(struct perf_event *event)
{
struct perf_event *child;
u64 total = 0;
@@ -1761,6 +1781,7 @@ static u64 perf_event_read_value(struct perf_event *event)

return total;
}
+EXPORT_SYMBOL_GPL(perf_event_read_value);

static int perf_event_read_entry(struct perf_event *event,
u64 read_format, char __user *buf)
@@ -4638,6 +4659,58 @@ err_put_context:
return err;
}

+/**
+ * perf_event_create_kernel_counter
+ *
+ * @attr: attributes of the counter to create
+ * @cpu: cpu in which the counter is bound
+ * @pid: task to profile
+ */
+struct perf_event *
+perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
+ pid_t pid)
+{
+ struct perf_event *event;
+ struct perf_event_context *ctx;
+ int err;
+
+ /*
+ * Get the target context (task or percpu):
+ */
+
+ ctx = find_get_context(pid, cpu);
+ if (IS_ERR(ctx))
+ return NULL ;
+
+ event = perf_event_alloc(attr, cpu, ctx, NULL,
+ NULL, GFP_KERNEL);
+ err = PTR_ERR(event);
+ if (IS_ERR(event))
+ goto err_put_context;
+
+ event->filp = NULL;
+ WARN_ON_ONCE(ctx->parent_ctx);
+ mutex_lock(&ctx->mutex);
+ perf_install_in_context(ctx, event, cpu);
+ ++ctx->generation;
+ mutex_unlock(&ctx->mutex);
+
+ event->owner = current;
+ get_task_struct(current);
+ mutex_lock(&current->perf_event_mutex);
+ list_add_tail(&event->owner_entry, &current->perf_event_list);
+ mutex_unlock(&current->perf_event_mutex);
+
+ return event;
+
+err_put_context:
+ if (err < 0)
+ put_ctx(ctx);
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(perf_event_create_kernel_counter);
+
/*
* inherit a event from parent task to child task:
*/

2010-01-25 05:16:55

by john smith

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

Arjan,

> This patch adds the needed bits to do enable this functionality; they have been
> tested in an experimental cpufreq driver that I'm working on, and the changes
> are all that I needed to access counters properly.

Very useful for a driver performance analysis.
It doesn't look like you added the new api's to your cpufreq driver yet, can you please provide an example of their usage, possibly monitoring multiple counters concurrently?

John


2010-02-05 15:47:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Fri, Sep 25, 2009 at 12:25:56PM +0200, Arjan van de Ven wrote:
> >From 5db5cd76f3c16c9f6093f54d1ccfb97d04b9a1ca Mon Sep 17 00:00:00 2001
> From: Arjan van de Ven <[email protected]>
> Date: Fri, 25 Sep 2009 12:20:57 +0200
> Subject: [PATCH] perf_core: provide a kernel-internal interface to get to performance counters

So this patch got accepted into mainline a while ago, without a user
ever beeing presented, nevermind actually included in mainline. Time
to revert providing these useless exports and dead code?

2010-02-05 17:59:36

by john smith

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

> So this patch got accepted into mainline a while ago,
> without a user
> ever beeing presented, nevermind actually included in
> mainline. Time
> to revert providing these useless exports and dead code?

I think kernel perf api is very useful for improving kernel (modules) performance, and of course, for this purpose their usage won't make it into the distribution code.
I started to use them but still I would like to see an example showing its full potential (cpufreq was proposed).

Imagine moving structure's fields around for increasing cache performance, the general perf counters are, hmm, too general when we need to monitor only some code sections (after being identified using the global perf counters/tool).

John

--- On Fri, 2/5/10, Christoph Hellwig <[email protected]> wrote:

> From: Christoph Hellwig <[email protected]>
> Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters
> To: "Arjan van de Ven" <[email protected]>
> Cc: [email protected], [email protected], [email protected], "Frederic Weisbecker" <[email protected]>
> Date: Friday, February 5, 2010, 7:47 AM
> On Fri, Sep 25, 2009 at 12:25:56PM
> +0200, Arjan van de Ven wrote:
> > >From 5db5cd76f3c16c9f6093f54d1ccfb97d04b9a1ca Mon
> Sep 17 00:00:00 2001
> > From: Arjan van de Ven <[email protected]>
> > Date: Fri, 25 Sep 2009 12:20:57 +0200
> > Subject: [PATCH] perf_core: provide a kernel-internal
> interface to get to performance counters
>
> So this patch got accepted into mainline a while ago,
> without a user
> ever beeing presented, nevermind actually included in
> mainline.? Time
> to revert providing these useless exports and dead code?
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at? http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at? http://www.tux.org/lkml/
>


2010-02-06 06:23:23

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Fri, 5 Feb 2010 10:47:48 -0500
Christoph Hellwig <[email protected]> wrote:

> On Fri, Sep 25, 2009 at 12:25:56PM +0200, Arjan van de Ven wrote:
> > >From 5db5cd76f3c16c9f6093f54d1ccfb97d04b9a1ca Mon Sep 17 00:00:00
> > >2001
> > From: Arjan van de Ven <[email protected]>
> > Date: Fri, 25 Sep 2009 12:20:57 +0200
> > Subject: [PATCH] perf_core: provide a kernel-internal interface to
> > get to performance counters
>
> So this patch got accepted into mainline a while ago, without a user
> ever beeing presented, nevermind actually included in mainline. Time
> to revert providing these useless exports and dead code?
>


Frederic had a user near mainline.

I have one but it's .35 material at this point...

if you want to insist to bounce the patch out now and back in
for .35... I don't care but Frederic might.


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-02-06 11:46:50

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Fri, Feb 05, 2010 at 10:24:30PM -0800, Arjan van de Ven wrote:
> On Fri, 5 Feb 2010 10:47:48 -0500
> Christoph Hellwig <[email protected]> wrote:
>
> > On Fri, Sep 25, 2009 at 12:25:56PM +0200, Arjan van de Ven wrote:
> > > >From 5db5cd76f3c16c9f6093f54d1ccfb97d04b9a1ca Mon Sep 17 00:00:00
> > > >2001
> > > From: Arjan van de Ven <[email protected]>
> > > Date: Fri, 25 Sep 2009 12:20:57 +0200
> > > Subject: [PATCH] perf_core: provide a kernel-internal interface to
> > > get to performance counters
> >
> > So this patch got accepted into mainline a while ago, without a user
> > ever beeing presented, nevermind actually included in mainline. Time
> > to revert providing these useless exports and dead code?
> >
>
>
> Frederic had a user near mainline.
>
> I have one but it's .35 material at this point...
>
> if you want to insist to bounce the patch out now and back in
> for .35... I don't care but Frederic might.


We can't.
It has a user merged upstream: the hw-breakpoint subsystem.

2010-02-06 14:18:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Sat, 2010-02-06 at 12:46 +0100, Frederic Weisbecker wrote:

> We can't. It has a user merged upstream: the hw-breakpoint subsystem.

That's non modular, so removing that EXPORT still makes sense.

2010-02-06 16:08:58

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters

On Sat, Feb 06, 2010 at 03:18:42PM +0100, Peter Zijlstra wrote:
> On Sat, 2010-02-06 at 12:46 +0100, Frederic Weisbecker wrote:
>
> > We can't. It has a user merged upstream: the hw-breakpoint subsystem.
>
> That's non modular, so removing that EXPORT still makes sense.


Yeah, the export can be zapped, indeed.

2010-02-07 17:02:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters


* Christoph Hellwig <[email protected]> wrote:

> On Fri, Sep 25, 2009 at 12:25:56PM +0200, Arjan van de Ven wrote:
> > >From 5db5cd76f3c16c9f6093f54d1ccfb97d04b9a1ca Mon Sep 17 00:00:00 2001
> > From: Arjan van de Ven <[email protected]>
> > Date: Fri, 25 Sep 2009 12:20:57 +0200
> > Subject: [PATCH] perf_core: provide a kernel-internal interface to get to performance counters
>
> So this patch got accepted into mainline a while ago, without a user ever
> beeing presented, nevermind actually included in mainline. Time to revert
> providing these useless exports and dead code?

You are quite wrong about that, this API is being relied on by the upstream
hw-breakpoints subsystem:

earth4:~/tip> git grep perf_event_release_kernel
include/linux/perf_event.h:extern int perf_event_release_kernel(struct perf_event *event);
kernel/hw_breakpoint.c: perf_event_release_kernel(bp);
[...]

Furthermore beyond the use mentioned by Arjan there's a new patch-set posted
to LKML that makes use of this new API as well: the generalized NMI watchdog
feature posted by Don Zickus.

Thanks,

Ingo