2015-02-06 02:59:45

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][PATCH] perf: Implement read_group() PMU operation

From: Sukadev Bhattiprolu <[email protected]>
Date: Thu Feb 5 20:56:20 EST 2015 -0300
Subject: [RFC][PATCH] perf: Implement read_group() PMU operation

This is a lightly tested, exploratory patch to allow PMUs to return
several counters at once. Appreciate any comments :-)

Unlike normal hardware PMCs, the 24x7 counters[1] in Power8 are stored
in memory and accessed via a hypervisor call (HCALL). A major aspect
of the HCALL is that it allows retireving _SEVERAL_ counters at once
(unlike regular PMCs, which are read one at a time).

This patch implements a ->read_group() PMU operation that tries to
take advantage of this ability to read several counters at once. A
PMU that implements the ->read_group() operation would allow users
to retrieve several counters at once and get a more consistent
snapshot.

NOTE: This patch has a TODO in h_24x7_event_read_group() in that it
still does multiple HCALLS. I think that can be optimized
independently, once the pmu->read_group() interface itself is
finalized.

Appreciate comments on the ->read_group interface and best managing the
interfaces between the core and PMU layers - eg: Ok for hv-24x7 PMU to
to walk the ->sibling_list ?

[1] Some notes about 24x7 counters:

Power8 supports 24x7 counters[1] which differ from traditional PMCs
in several ways:

- The 24x7 counters are always on and counting. Rather than
start/stop the PMCs, we read/report the _change_ in values
in the counters during the execution of the workload.

- The 24x7 counters are not tied to a task context (they are
always on).

- Rather than reading the event counts from registers, we make
a hypervisor call (HCALL) to retrieve counts. The HCALL allows
retrieving a large number of counters in a single call.

- These counters don't generate interrupts when they overflow (so
sampling does not apply to these counters).
---

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1d36314..b69fbdf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -232,6 +232,13 @@ struct pmu {
void (*read) (struct perf_event *event);

/*
+ * Read a group of counters.
+ */
+ int (*read_group) (struct perf_event *event,
+ u64 *values,
+ int ncounters);
+
+ /*
* Group events scheduling is treated as a transaction, add
* group events as a whole and perform one schedulability test.
* If the test fails, roll back the whole group
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 934687f..026a9d0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3549,10 +3549,43 @@ static int perf_event_read_group(struct perf_event *event,
struct perf_event *leader = event->group_leader, *sub;
int n = 0, size = 0, ret = -EFAULT;
struct perf_event_context *ctx = leader->ctx;
+ u64 *valuesp;
u64 values[5];
+ int use_group_read;
u64 count, enabled, running;
+ struct pmu *pmu = event->pmu;
+
+ /*
+ * If PMU supports group read and group read is requested,
+ * allocate memory before taking the mutex.
+ */
+ use_group_read = 0;
+ if ((read_format & PERF_FORMAT_GROUP) && pmu->read_group) {
+ use_group_read++;
+ }
+
+ if (use_group_read) {
+ valuesp = kzalloc(leader->nr_siblings * sizeof(u64), GFP_KERNEL);
+ if (!valuesp)
+ return -ENOMEM;
+ }

mutex_lock(&ctx->mutex);
+
+ if (use_group_read) {
+ ret = pmu->read_group(leader, valuesp, leader->nr_siblings);
+ if (ret >= 0) {
+ size = ret * sizeof(u64);
+
+ ret = size;
+ if (copy_to_user(buf, valuesp, size))
+ ret = -EFAULT;
+ }
+
+ kfree(valuesp);
+ goto unlock;
+ }
+
count = perf_event_read_value(leader, &enabled, &running);

values[n++] = 1 + leader->nr_siblings;
diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 9445a82..cd48cf0 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -1071,12 +1071,33 @@ static int h_24x7_event_init(struct perf_event *event)
struct hv_perf_caps caps;
unsigned domain;
unsigned long hret;
+ u64 read_format, inv_flags;
u64 ct;

/* Not our event */
if (event->attr.type != event->pmu->type)
return -ENOENT;

+ /*
+ * We don't support enabled/running times with PERF_FORMAT_GROUP.
+ * The ->read_group() operation is intended to be used in continous
+ * monitoring mode, so these time values are not important at least
+ * for now.
+ *
+ * Not sure if the PERF_FORMAT_ID is useful. Block it for now.
+ */
+ read_format = event->attr.read_format;
+ inv_flags = PERF_FORMAT_TOTAL_TIME_ENABLED;
+ inv_flags |= PERF_FORMAT_TOTAL_TIME_RUNNING;
+ inv_flags |= PERF_FORMAT_ID;
+
+ if ((read_format & PERF_FORMAT_GROUP) && (read_format & inv_flags)) {
+ pr_devel("%s(): Invalid flags: rf 0x%llx, invf 0x%llx\n",
+ __func__, (unsigned long long)read_format,
+ (unsigned long long)inv_flags);
+ return -EINVAL;
+ }
+
/* Unused areas must be 0 */
if (event_get_reserved1(event) ||
event_get_reserved2(event) ||
@@ -1181,6 +1202,50 @@ static int h_24x7_event_add(struct perf_event *event, int flags)
return 0;
}

+static int h_24x7_event_read_group(struct perf_event *leader, u64 *values,
+ int ncounters)
+{
+ struct perf_event *sub;
+ int n = 0;
+
+ BUG_ON(!(leader->attr.read_format & PERF_FORMAT_GROUP));
+
+ /*
+ * sys_perf_event_open() for now prevents inheritance with
+ * PERF_FORMAT_GROUP. Ensure that hasn't changed.
+ */
+ BUG_ON(!list_empty(&leader->child_list));
+
+ if (ncounters < leader->nr_siblings) {
+ pr_devel("%s(): Insufficient buffer : ns %d, nc %d\n",
+ __func__, leader->nr_siblings, ncounters);
+ return -EINVAL;
+ }
+
+ raw_spin_lock(&leader->ctx->lock);
+
+ if (leader->state == PERF_EVENT_STATE_ACTIVE) {
+ h_24x7_event_update(leader);
+ values[n++] = local64_read(&leader->count);
+ }
+
+ /*
+ * TODO: For now, make one HCALL per event. We will soon retrieve
+ * several events with one HCALL.
+ */
+ list_for_each_entry(sub, &leader->sibling_list, group_entry) {
+ if (sub->state != PERF_EVENT_STATE_ACTIVE)
+ continue;
+
+ h_24x7_event_update(sub);
+ values[n++] = local64_read(&sub->count);
+ }
+
+ raw_spin_unlock(&leader->ctx->lock);
+
+ return n;
+}
+
static struct pmu h_24x7_pmu = {
.task_ctx_nr = perf_invalid_context,

@@ -1192,6 +1257,7 @@ static struct pmu h_24x7_pmu = {
.start = h_24x7_event_start,
.stop = h_24x7_event_stop,
.read = h_24x7_event_update,
+ .read_group = h_24x7_event_read_group,
};

static int hv_24x7_init(void)


2015-02-12 15:59:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] perf: Implement read_group() PMU operation

On Thu, Feb 05, 2015 at 06:59:15PM -0800, Sukadev Bhattiprolu wrote:
> From: Sukadev Bhattiprolu <[email protected]>
> Date: Thu Feb 5 20:56:20 EST 2015 -0300
> Subject: [RFC][PATCH] perf: Implement read_group() PMU operation
>
> This is a lightly tested, exploratory patch to allow PMUs to return
> several counters at once. Appreciate any comments :-)
>
> Unlike normal hardware PMCs, the 24x7 counters[1] in Power8 are stored
> in memory and accessed via a hypervisor call (HCALL). A major aspect
> of the HCALL is that it allows retireving _SEVERAL_ counters at once
> (unlike regular PMCs, which are read one at a time).
>
> This patch implements a ->read_group() PMU operation that tries to
> take advantage of this ability to read several counters at once. A
> PMU that implements the ->read_group() operation would allow users
> to retrieve several counters at once and get a more consistent
> snapshot.
>
> NOTE: This patch has a TODO in h_24x7_event_read_group() in that it
> still does multiple HCALLS. I think that can be optimized
> independently, once the pmu->read_group() interface itself is
> finalized.
>
> Appreciate comments on the ->read_group interface and best managing the
> interfaces between the core and PMU layers - eg: Ok for hv-24x7 PMU to
> to walk the ->sibling_list ?


> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3549,10 +3549,43 @@ static int perf_event_read_group(struct perf_event *event,

You also want perf_output_read_group().

> struct perf_event *leader = event->group_leader, *sub;
> int n = 0, size = 0, ret = -EFAULT;
> struct perf_event_context *ctx = leader->ctx;
> + u64 *valuesp;
> u64 values[5];
> + int use_group_read;
> u64 count, enabled, running;
> + struct pmu *pmu = event->pmu;
> +
> + /*
> + * If PMU supports group read and group read is requested,
> + * allocate memory before taking the mutex.
> + */
> + use_group_read = 0;
> + if ((read_format & PERF_FORMAT_GROUP) && pmu->read_group) {
> + use_group_read++;
> + }
> +
> + if (use_group_read) {
> + valuesp = kzalloc(leader->nr_siblings * sizeof(u64), GFP_KERNEL);
> + if (!valuesp)
> + return -ENOMEM;
> + }

This seems 'sad', the hardware already knows how many it can maximally
use at once and can preallocate, right?

>
> mutex_lock(&ctx->mutex);
> +
> + if (use_group_read) {
> + ret = pmu->read_group(leader, valuesp, leader->nr_siblings);
> + if (ret >= 0) {
> + size = ret * sizeof(u64);
> +
> + ret = size;
> + if (copy_to_user(buf, valuesp, size))
> + ret = -EFAULT;
> + }
> +
> + kfree(valuesp);
> + goto unlock;
> + }
> +
> count = perf_event_read_value(leader, &enabled, &running);
>
> values[n++] = 1 + leader->nr_siblings;

Since ->read() has a void return value, we can delay its effect, so I'm
currently thinking we might want to extend the transaction interface for
this; give pmu::start_txn() a flags argument to indicate scheduling
(add) or reading (read).

So we'd end up with something like:

pmu->start_txn(pmu, PMU_TXN_READ);

leader->read();

for_each_sibling()
sibling->read();

pmu->commit_txn();

after which we can use the values updated by the read calls. The trivial
no-support implementation lets read do its immediate thing like it does
now.

A more complex driver can then collect the actual counter values and
execute one hypercall using its pre-allocated memory.

So no allocations in the core code, and no sibling iterations in the
driver code.

Would that work for you?

2015-02-17 08:33:54

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][PATCH] perf: Implement read_group() PMU operation

Peter Zijlstra [[email protected]] wrote:
| > --- a/kernel/events/core.c
| > +++ b/kernel/events/core.c
| > @@ -3549,10 +3549,43 @@ static int perf_event_read_group(struct perf_event *event,
|
| You also want perf_output_read_group().

Ok. Will look into it. We currently don't support sampling with
the 24x7 counters but we should make sure that the new interface
fits correctly.


|
| > struct perf_event *leader = event->group_leader, *sub;
| > int n = 0, size = 0, ret = -EFAULT;
| > struct perf_event_context *ctx = leader->ctx;
| > + u64 *valuesp;
| > u64 values[5];
| > + int use_group_read;
| > u64 count, enabled, running;
| > + struct pmu *pmu = event->pmu;
| > +
| > + /*
| > + * If PMU supports group read and group read is requested,
| > + * allocate memory before taking the mutex.
| > + */
| > + use_group_read = 0;
| > + if ((read_format & PERF_FORMAT_GROUP) && pmu->read_group) {
| > + use_group_read++;
| > + }
| > +
| > + if (use_group_read) {
| > + valuesp = kzalloc(leader->nr_siblings * sizeof(u64), GFP_KERNEL);
| > + if (!valuesp)
| > + return -ENOMEM;
| > + }
|
| This seems 'sad', the hardware already knows how many it can maximally
| use at once and can preallocate, right?

Yeah :-) In a subsequent version, I got rid of the allocation by moving
the copy_to_user() into the PMU, but still needed to walk the sibling list.
|
| >
| > mutex_lock(&ctx->mutex);
| > +
| > + if (use_group_read) {
| > + ret = pmu->read_group(leader, valuesp, leader->nr_siblings);
| > + if (ret >= 0) {
| > + size = ret * sizeof(u64);
| > +
| > + ret = size;
| > + if (copy_to_user(buf, valuesp, size))
| > + ret = -EFAULT;
| > + }
| > +
| > + kfree(valuesp);
| > + goto unlock;
| > + }
| > +
| > count = perf_event_read_value(leader, &enabled, &running);
| >
| > values[n++] = 1 + leader->nr_siblings;
|
| Since ->read() has a void return value, we can delay its effect, so I'm
| currently thinking we might want to extend the transaction interface for
| this; give pmu::start_txn() a flags argument to indicate scheduling
| (add) or reading (read).
|
| So we'd end up with something like:
|
| pmu->start_txn(pmu, PMU_TXN_READ);
|
| leader->read();
|
| for_each_sibling()
| sibling->read();
|
| pmu->commit_txn();

So, each of the ->read() calls is really "appending a counter" to a
list of counters that the PMU should read and the values for the counters
(results of the read) are only available after the commit_txn() right?

In which case, perf_event_read_group() would then follow this commit_txn()
with its "normal" read, and the PMU would return the result cached during
->commit_txn(). If so, we need a way to invalidate the cached result ?

|
| after which we can use the values updated by the read calls. The trivial
| no-support implementation lets read do its immediate thing like it does
| now.
|
| A more complex driver can then collect the actual counter values and
| execute one hypercall using its pre-allocated memory.

the hypercall should happen in the ->commit_txn() right ?

|
| So no allocations in the core code, and no sibling iterations in the
| driver code.
|
| Would that work for you?

I think it would,

I am working on breaking up the underlying code along start/read/commit
lines, and hope to have it done later this week and then can experiment
more with this interface.

Appreciate the input.

Sukadev

2015-02-17 10:03:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] perf: Implement read_group() PMU operation

On Tue, Feb 17, 2015 at 12:33:12AM -0800, Sukadev Bhattiprolu wrote:
> Peter Zijlstra [[email protected]] wrote:
> | > --- a/kernel/events/core.c
> | > +++ b/kernel/events/core.c
> | > @@ -3549,10 +3549,43 @@ static int perf_event_read_group(struct perf_event *event,
> |
> | You also want perf_output_read_group().
>
> Ok. Will look into it. We currently don't support sampling with
> the 24x7 counters but we should make sure that the new interface
> fits correctly.

One thing someone 'could' do is group them together with a software
event that _can_ sample, and then use SAMPLE_READ to periodically stuff
values into the buffer.

> | Since ->read() has a void return value, we can delay its effect, so I'm
> | currently thinking we might want to extend the transaction interface for
> | this; give pmu::start_txn() a flags argument to indicate scheduling
> | (add) or reading (read).
> |
> | So we'd end up with something like:
> |
> | pmu->start_txn(pmu, PMU_TXN_READ);
> |
> | leader->read();
> |
> | for_each_sibling()
> | sibling->read();
> |
> | pmu->commit_txn();
>
> So, each of the ->read() calls is really "appending a counter" to a
> list of counters that the PMU should read and the values for the counters
> (results of the read) are only available after the commit_txn() right?

Correct.

> In which case, perf_event_read_group() would then follow this commit_txn()
> with its "normal" read, and the PMU would return the result cached during
> ->commit_txn(). If so, we need a way to invalidate the cached result ?

I was thinking of breaking up that code into two loops, once to call
->read() and update states, the second to use the now up-to-date data
and frob it into the stream.

But I must say I've not entirely given it much thought. But that way
you're not stuck with this cache and related problems.

> | after which we can use the values updated by the read calls. The trivial
> | no-support implementation lets read do its immediate thing like it does
> | now.
> |
> | A more complex driver can then collect the actual counter values and
> | execute one hypercall using its pre-allocated memory.
>
> the hypercall should happen in the ->commit_txn() right ?

Yah. Of course, if a ->read() is not part of a txn then it must do the
hypercall for just the one value.

2015-02-22 21:04:50

by Cody P Schafer

[permalink] [raw]
Subject: Re: [RFC][PATCH] perf: Implement read_group() PMU operation

On Thu, Feb 5, 2015 at 9:59 PM, Sukadev Bhattiprolu
<[email protected]> wrote:
> From: Sukadev Bhattiprolu <[email protected]>
> Date: Thu Feb 5 20:56:20 EST 2015 -0300
> Subject: [RFC][PATCH] perf: Implement read_group() PMU operation
>
> This is a lightly tested, exploratory patch to allow PMUs to return
> several counters at once. Appreciate any comments :-)
>

Back when I was fiddling with this, I started looking into changing
the {start,commit,cancel}_txn to operate on (struct perf_event *)
rather than (struct pmu *), and commit_txn would generate the actual
request & reads based on the perf_event's group (sounds similar but
not identical to what Peter's proposed previously).

The key bit I was concerned about was that these "PMUs" aren't
actually physical hw, so it made a bit more sense to pin the grouping
to a group rather than a txn over a PMU.

[Of course, I never did confirm if that actually fit with how perf was
modeling txns]