2009-03-17 05:43:42

by Paul Mackerras

[permalink] [raw]
Subject: [PATCH/RFC 2/2] perfcounters: add an mmap method to allow userspace to read hardware counters

Impact: new feature giving performance improvement

This adds the ability for userspace to do an mmap on a hardware counter
fd and get access to a read-only page that contains the information
needed to translate a hardware counter value to the full 64-bit
counter value that would be returned by a read on the fd. This is
useful on architectures that allow user programs to read the hardware
counters, such as PowerPC.

The mmap will only succeed if the counter is a hardware counter
monitoring the current process.

On my quad 2.5GHz PowerPC 970MP machine, userspace can read a counter
and translate it to the full 64-bit value in about 30ns using the
mmapped page, compared to about 830ns for the read syscall on the
counter, so this does give a significant performance improvement.

Signed-off-by: Paul Mackerras <[email protected]>
---
arch/powerpc/kernel/perf_counter.c | 6 +++
include/linux/perf_counter.h | 15 +++++++
kernel/perf_counter.c | 78 ++++++++++++++++++++++++++++++++++++
3 files changed, 99 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/perf_counter.c b/arch/powerpc/kernel/perf_counter.c
index 5008762..44271b6 100644
--- a/arch/powerpc/kernel/perf_counter.c
+++ b/arch/powerpc/kernel/perf_counter.c
@@ -417,6 +417,8 @@ void hw_perf_restore(u64 disable)
atomic64_set(&counter->hw.prev_count, val);
counter->hw.idx = hwc_index[i] + 1;
write_pmc(counter->hw.idx, val);
+ if (counter->user_page)
+ perf_counter_update_userpage(counter);
}
mb();
cpuhw->mmcr[0] |= MMCR0_PMXE | MMCR0_FCECE;
@@ -572,6 +574,8 @@ static void power_perf_disable(struct perf_counter *counter)
ppmu->disable_pmc(counter->hw.idx - 1, cpuhw->mmcr);
write_pmc(counter->hw.idx, 0);
counter->hw.idx = 0;
+ if (counter->user_page)
+ perf_counter_update_userpage(counter);
break;
}
}
@@ -732,6 +736,8 @@ static void record_and_restart(struct perf_counter *counter, long val,
write_pmc(counter->hw.idx, val);
atomic64_set(&counter->hw.prev_count, val);
atomic64_set(&counter->hw.period_left, left);
+ if (counter->user_page)
+ perf_counter_update_userpage(counter);

/*
* Finally record data if requested.
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 08c11a6..3f9309f 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -100,6 +100,17 @@ struct perf_counter_hw_event {
#define PERF_COUNTER_IOC_ENABLE _IO('$', 0)
#define PERF_COUNTER_IOC_DISABLE _IO('$', 1)

+/*
+ * Structure of the page that can be mapped via mmap
+ */
+struct perf_counter_mmap_page {
+ __u32 version; /* version number of this structure */
+ __u32 compat_version; /* lowest version this is compat with */
+ __u32 lock; /* seqlock for synchronization */
+ __u32 index; /* hardware counter identifier */
+ __s64 offset; /* add to hardware counter value */
+};
+
#ifdef __KERNEL__
/*
* Kernel-internal data types and definitions:
@@ -214,6 +225,9 @@ struct perf_counter {
int oncpu;
int cpu;

+ /* pointer to page shared with userspace via mmap */
+ unsigned long user_page;
+
/* read() / irq related data */
wait_queue_head_t waitq;
/* optional: for NMIs */
@@ -289,6 +303,7 @@ extern int perf_counter_task_enable(void);
extern int hw_perf_group_sched_in(struct perf_counter *group_leader,
struct perf_cpu_context *cpuctx,
struct perf_counter_context *ctx, int cpu);
+extern void perf_counter_update_userpage(struct perf_counter *counter);

/*
* Return 1 for a software counter, 0 for a hardware counter
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 7c62b93..a440b9e 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -1165,6 +1165,7 @@ static int perf_release(struct inode *inode, struct file *file)
mutex_unlock(&counter->mutex);
mutex_unlock(&ctx->mutex);

+ free_page(counter->user_page);
call_rcu(&counter->rcu_head, free_counter_rcu);
put_context(ctx);

@@ -1331,12 +1332,87 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
return err;
}

+void perf_counter_update_userpage(struct perf_counter *counter)
+{
+ struct perf_counter_mmap_page *userpg;
+
+ if (!counter->user_page)
+ return;
+ userpg = (struct perf_counter_mmap_page *) counter->user_page;
+
+ ++userpg->lock;
+ smp_wmb();
+ userpg->index = counter->hw.idx;
+ userpg->offset = atomic64_read(&counter->count);
+ if (counter->state == PERF_COUNTER_STATE_ACTIVE)
+ userpg->offset -= atomic64_read(&counter->hw.prev_count);
+ smp_wmb();
+ ++userpg->lock;
+}
+
+static int perf_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct perf_counter *counter = vma->vm_file->private_data;
+
+ if (!counter->user_page)
+ return VM_FAULT_SIGBUS;
+
+ vmf->page = virt_to_page(counter->user_page);
+ get_page(vmf->page);
+ return 0;
+}
+
+static struct vm_operations_struct perf_mmap_vmops = {
+ .fault = perf_mmap_fault,
+};
+
+static int perf_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct perf_counter *counter = file->private_data;
+ unsigned long userpg;
+
+ if (!(vma->vm_flags & VM_SHARED) || (vma->vm_flags & VM_WRITE))
+ return -EINVAL;
+ if (vma->vm_end - vma->vm_start != PAGE_SIZE)
+ return -EINVAL;
+
+ /*
+ * For now, restrict to the case of a hardware counter
+ * on the current task.
+ */
+ if (is_software_counter(counter) || counter->task != current)
+ return -EINVAL;
+
+ userpg = counter->user_page;
+ if (!userpg) {
+ userpg = get_zeroed_page(GFP_KERNEL);
+ mutex_lock(&counter->mutex);
+ if (counter->user_page) {
+ free_page(userpg);
+ userpg = counter->user_page;
+ } else {
+ counter->user_page = userpg;
+ }
+ mutex_unlock(&counter->mutex);
+ if (!userpg)
+ return -ENOMEM;
+ }
+
+ perf_counter_update_userpage(counter);
+
+ vma->vm_flags &= ~VM_MAYWRITE;
+ vma->vm_flags |= VM_RESERVED;
+ vma->vm_ops = &perf_mmap_vmops;
+ return 0;
+}
+
static const struct file_operations perf_intr_fops = {
.release = perf_release,
.read = perf_read_irq_data,
.poll = perf_poll,
.unlocked_ioctl = perf_ioctl,
.compat_ioctl = perf_ioctl,
+ .mmap = perf_mmap,
};

static const struct file_operations perf_value_fops = {
@@ -1345,6 +1421,7 @@ static const struct file_operations perf_value_fops = {
.poll = perf_poll,
.unlocked_ioctl = perf_ioctl,
.compat_ioctl = perf_ioctl,
+ .mmap = perf_mmap,
};

static const struct file_operations perf_clone_fops = {
@@ -1353,6 +1430,7 @@ static const struct file_operations perf_clone_fops = {
.poll = perf_poll,
.unlocked_ioctl = perf_ioctl,
.compat_ioctl = perf_ioctl,
+ .mmap = perf_mmap,
};

/*
--
1.5.5.rc3.7.gba13


2009-03-17 07:38:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH/RFC 2/2] perfcounters: add an mmap method to allow userspace to read hardware counters

On Tue, 2009-03-17 at 16:42 +1100, Paul Mackerras wrote:
> Impact: new feature giving performance improvement
>
> This adds the ability for userspace to do an mmap on a hardware counter
> fd and get access to a read-only page that contains the information
> needed to translate a hardware counter value to the full 64-bit
> counter value that would be returned by a read on the fd. This is
> useful on architectures that allow user programs to read the hardware
> counters, such as PowerPC.
>
> The mmap will only succeed if the counter is a hardware counter
> monitoring the current process.
>
> On my quad 2.5GHz PowerPC 970MP machine, userspace can read a counter
> and translate it to the full 64-bit value in about 30ns using the
> mmapped page, compared to about 830ns for the read syscall on the
> counter, so this does give a significant performance improvement.

While I think mmap'ed counters is a great idea, I really dont like this
patch. It adds a second output format unrelated to the regular output
format, and it doesn't appear to honor the regular output rules either.
PERF_RECORD_GROUP thingies won't work for example.

Nor is there any kind of queuing, one might want to have multiple events
in the mmap buffer..

I was planning to do this after cleaning up the normal output bits, as
our current output stuff is a mess:
- its spread out over arch code (seems daft to me, we should all output
the same)
- its useless for pretty much anything but the two apps we currently
have

In particular, it lacks the tid information for sampled data I hinted to
in the previous email.

Furthermore, in order to reliably profile userspace we need mmap
information in the output stream as well.


2009-03-17 08:27:59

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH/RFC 2/2] perfcounters: add an mmap method to allow userspace to read hardware counters

Peter Zijlstra writes:

> While I think mmap'ed counters is a great idea, I really dont like this
> patch. It adds a second output format unrelated to the regular output
> format, and it doesn't appear to honor the regular output rules either.
> PERF_RECORD_GROUP thingies won't work for example.
>
> Nor is there any kind of queuing, one might want to have multiple events
> in the mmap buffer..

I think you have misunderstood. This is not about sampling counters
*at all*. This is about simple counting counters.

On powerpc, userspace can read the hardware counters directly. This
stuff lets a program that is counting hardware events on itself do
that and translate the result into a full 64-bit value. The
information the program needs in order to do this is (a) which
hardware counter (if any) has been assigned to this particular
perf_counter and (b) what the offset between the hardware counter
value and the full 64-bit perf_counter value is. That, plus a
seqlock-style lock, is what's in the mmapped page.

> I was planning to do this after cleaning up the normal output bits, as
> our current output stuff is a mess:
> - its spread out over arch code (seems daft to me, we should all output
> the same)
> - its useless for pretty much anything but the two apps we currently
> have
>
> In particular, it lacks the tid information for sampled data I hinted to
> in the previous email.

Ingo has talked about reusing some of the tracing infrastructure for
reporting perf_counter events to userspace. That sounds like an
excellent idea to me, and that is why I didn't bother with putting the
event queue into the mmapped page at this stage. If it makes sense to
add it, it can be added later.

Paul.

2009-03-17 08:42:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH/RFC 2/2] perfcounters: add an mmap method to allow userspace to read hardware counters

On Tue, 2009-03-17 at 19:27 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
>
> > While I think mmap'ed counters is a great idea, I really dont like this
> > patch. It adds a second output format unrelated to the regular output
> > format, and it doesn't appear to honor the regular output rules either.
> > PERF_RECORD_GROUP thingies won't work for example.
> >
> > Nor is there any kind of queuing, one might want to have multiple events
> > in the mmap buffer..
>
> I think you have misunderstood. This is not about sampling counters
> *at all*. This is about simple counting counters.

I think I did indeed.

> On powerpc, userspace can read the hardware counters directly. This
> stuff lets a program that is counting hardware events on itself do
> that and translate the result into a full 64-bit value. The
> information the program needs in order to do this is (a) which
> hardware counter (if any) has been assigned to this particular
> perf_counter and (b) what the offset between the hardware counter
> value and the full 64-bit perf_counter value is. That, plus a
> seqlock-style lock, is what's in the mmapped page.

Ah, right. I think some of the intel chips can do similar things with
rdpmc instructions.

> > I was planning to do this after cleaning up the normal output bits, as
> > our current output stuff is a mess:
> > - its spread out over arch code (seems daft to me, we should all output
> > the same)
> > - its useless for pretty much anything but the two apps we currently
> > have
> >
> > In particular, it lacks the tid information for sampled data I hinted to
> > in the previous email.
>
> Ingo has talked about reusing some of the tracing infrastructure for
> reporting perf_counter events to userspace. That sounds like an
> excellent idea to me, and that is why I didn't bother with putting the
> event queue into the mmapped page at this stage. If it makes sense to
> add it, it can be added later.

Yeah, I've been looking into that, but so far I'm a bit at a loss, all
that tracing stuff is per-cpu, and that's massive overkill for us, since
we're dealing with single cpu streams.

One worry though, supposedly we want to mmap() such buffers too at some
point, how would that interact with that you proposed?

2009-03-17 09:24:53

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH/RFC 2/2] perfcounters: add an mmap method to allow userspace to read hardware counters

Peter Zijlstra writes:

> Ah, right. I think some of the intel chips can do similar things with
> rdpmc instructions.

Ah OK, I didn't know that. Cool.

> > Ingo has talked about reusing some of the tracing infrastructure for
> > reporting perf_counter events to userspace. That sounds like an
> > excellent idea to me, and that is why I didn't bother with putting the
> > event queue into the mmapped page at this stage. If it makes sense to
> > add it, it can be added later.
>
> Yeah, I've been looking into that, but so far I'm a bit at a loss, all
> that tracing stuff is per-cpu, and that's massive overkill for us, since
> we're dealing with single cpu streams.

I almost certainly know less about the tracing infrastructure than
you. :) Per-cpu buffers does indeed sound like overkill for what we
want - though we might actually want per-cpu buffers if we have
inherited sampling counters and a deep process hierarchy.

> One worry though, supposedly we want to mmap() such buffers too at some
> point, how would that interact with that you proposed?

I think we just have separate pages for the event ring buffer and the
info for reading the counter in userspace, and then there isn't any
interaction. We can do that either by having two fds and mmapping on
each, or by using the file offset argument to mmap.

Paul.