2009-03-25 08:15:53

by Paul Mackerras

[permalink] [raw]
Subject: [PATCH] perf_counter: allow and require one-page mmap on counting counters

Impact: bug fix

Currently the mmap code requires that the length of the mmap be at least
two pages. That is fine for sampling counters, but for counting
counters the second and subsequent pages are just wasted, since counting
counters don't generate events.

This changes the code to require that the mmap be one page in length
for counting counters, and applies the existing check to sampling
counters.

Signed-off-by: Paul Mackerras <[email protected]>
---
Ingo, please pull this from the master branch of my perfcounters.git
repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/paulus/perfcounters.git master

kernel/perf_counter.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index affe227..0412c7c 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -1362,8 +1362,13 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
vma_size = vma->vm_end - vma->vm_start;
nr_pages = (vma_size / PAGE_SIZE) - 1;

- if (nr_pages == 0 || !is_power_of_2(nr_pages))
- return -EINVAL;
+ if (counter->hw_event.record_type == PERF_RECORD_SIMPLE) {
+ if (nr_pages)
+ return -EINVAL;
+ } else {
+ if (nr_pages == 0 || !is_power_of_2(nr_pages))
+ return -EINVAL;
+ }

if (vma_size != PAGE_SIZE * (1 + nr_pages))
return -EINVAL;
--
1.5.6.3


2009-03-25 08:29:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: allow and require one-page mmap on counting counters


* Paul Mackerras <[email protected]> wrote:

> +++ b/kernel/perf_counter.c
> @@ -1362,8 +1362,13 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> vma_size = vma->vm_end - vma->vm_start;
> nr_pages = (vma_size / PAGE_SIZE) - 1;
>
> - if (nr_pages == 0 || !is_power_of_2(nr_pages))
> - return -EINVAL;
> + if (counter->hw_event.record_type == PERF_RECORD_SIMPLE) {
> + if (nr_pages)
> + return -EINVAL;
> + } else {
> + if (nr_pages == 0 || !is_power_of_2(nr_pages))
> + return -EINVAL;
> + }

Hm, is_power_of_2() is buggy then as 1 page is a power of two as
well: 1 == 2^0.

Hm, it seems fine:

static inline __attribute__((const))
bool is_power_of_2(unsigned long n)
{
return (n != 0 && ((n & (n - 1)) == 0));
}

that should return true for an input of 1.

What am i missing?

Ingo

2009-03-25 08:34:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: allow and require one-page mmap on counting counters

On Wed, 2009-03-25 at 09:28 +0100, Ingo Molnar wrote:
> * Paul Mackerras <[email protected]> wrote:
>
> > +++ b/kernel/perf_counter.c
> > @@ -1362,8 +1362,13 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> > vma_size = vma->vm_end - vma->vm_start;
> > nr_pages = (vma_size / PAGE_SIZE) - 1;
> >
> > - if (nr_pages == 0 || !is_power_of_2(nr_pages))
> > - return -EINVAL;
> > + if (counter->hw_event.record_type == PERF_RECORD_SIMPLE) {
> > + if (nr_pages)
> > + return -EINVAL;
> > + } else {
> > + if (nr_pages == 0 || !is_power_of_2(nr_pages))
> > + return -EINVAL;
> > + }
>
> Hm, is_power_of_2() is buggy then as 1 page is a power of two as
> well: 1 == 2^0.
>
> Hm, it seems fine:
>
> static inline __attribute__((const))
> bool is_power_of_2(unsigned long n)
> {
> return (n != 0 && ((n & (n - 1)) == 0));
> }
>
> that should return true for an input of 1.
>
> What am i missing?

nr_pages is the number of data pages.


2009-03-25 08:57:11

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: allow and require one-page mmap on counting counters

Ingo Molnar writes:

> * Paul Mackerras <[email protected]> wrote:
>
> > +++ b/kernel/perf_counter.c
> > @@ -1362,8 +1362,13 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> > vma_size = vma->vm_end - vma->vm_start;
> > nr_pages = (vma_size / PAGE_SIZE) - 1;

To answer your question below... this: ^^^

> >
> > - if (nr_pages == 0 || !is_power_of_2(nr_pages))
> > - return -EINVAL;
> > + if (counter->hw_event.record_type == PERF_RECORD_SIMPLE) {
> > + if (nr_pages)
> > + return -EINVAL;
> > + } else {
> > + if (nr_pages == 0 || !is_power_of_2(nr_pages))
> > + return -EINVAL;
> > + }
>
> Hm, is_power_of_2() is buggy then as 1 page is a power of two as
> well: 1 == 2^0.
>
> Hm, it seems fine:
>
> static inline __attribute__((const))
> bool is_power_of_2(unsigned long n)
> {
> return (n != 0 && ((n & (n - 1)) == 0));
> }
>
> that should return true for an input of 1.
>
> What am i missing?
>
> Ingo

We have one page as a header that contains the info for reading the
counter value in userspace plus the head pointer, followed by (for a
sampling counter) 2^N pages of ring buffer.

Paul.

2009-03-25 09:04:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: allow and require one-page mmap on counting counters


* Paul Mackerras <[email protected]> wrote:

> Ingo Molnar writes:
>
> > * Paul Mackerras <[email protected]> wrote:
> >
> > > +++ b/kernel/perf_counter.c
> > > @@ -1362,8 +1362,13 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> > > vma_size = vma->vm_end - vma->vm_start;
> > > nr_pages = (vma_size / PAGE_SIZE) - 1;
>
> To answer your question below... this: ^^^
>
> > >
> > > - if (nr_pages == 0 || !is_power_of_2(nr_pages))
> > > - return -EINVAL;
> > > + if (counter->hw_event.record_type == PERF_RECORD_SIMPLE) {
> > > + if (nr_pages)
> > > + return -EINVAL;
> > > + } else {
> > > + if (nr_pages == 0 || !is_power_of_2(nr_pages))
> > > + return -EINVAL;
> > > + }
> >
> > Hm, is_power_of_2() is buggy then as 1 page is a power of two as
> > well: 1 == 2^0.
> >
> > Hm, it seems fine:
> >
> > static inline __attribute__((const))
> > bool is_power_of_2(unsigned long n)
> > {
> > return (n != 0 && ((n & (n - 1)) == 0));
> > }
> >
> > that should return true for an input of 1.
> >
> > What am i missing?
> >
> > Ingo
>
> We have one page as a header that contains the info for reading
> the counter value in userspace plus the head pointer, followed by
> (for a sampling counter) 2^N pages of ring buffer.

ah - ok. Morning confusion. (any email from me that comes at single
digit hour local time should be considered fundamentally suspect ;-)

Wouldnt it still be better to keep the symmetry between counting and
sampling counters? In theory we could transit between these stags
and 'switch off' a sampling counter or 'switch on' a counting
counter - via an ioctl or so. Shouldnt counting counters be sampling
counters that were created while disabled temporarily?

Ingo

2009-03-25 09:18:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: allow and require one-page mmap on counting counters

On Wed, 2009-03-25 at 10:03 +0100, Ingo Molnar wrote:
> * Paul Mackerras <[email protected]> wrote:
>
> > Ingo Molnar writes:
> >
> > > * Paul Mackerras <[email protected]> wrote:
> > >
> > > > +++ b/kernel/perf_counter.c
> > > > @@ -1362,8 +1362,13 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> > > > vma_size = vma->vm_end - vma->vm_start;
> > > > nr_pages = (vma_size / PAGE_SIZE) - 1;
> >
> > To answer your question below... this: ^^^
> >
> > > >
> > > > - if (nr_pages == 0 || !is_power_of_2(nr_pages))
> > > > - return -EINVAL;
> > > > + if (counter->hw_event.record_type == PERF_RECORD_SIMPLE) {
> > > > + if (nr_pages)
> > > > + return -EINVAL;
> > > > + } else {
> > > > + if (nr_pages == 0 || !is_power_of_2(nr_pages))
> > > > + return -EINVAL;
> > > > + }
> > >
> > > Hm, is_power_of_2() is buggy then as 1 page is a power of two as
> > > well: 1 == 2^0.
> > >
> > > Hm, it seems fine:
> > >
> > > static inline __attribute__((const))
> > > bool is_power_of_2(unsigned long n)
> > > {
> > > return (n != 0 && ((n & (n - 1)) == 0));
> > > }
> > >
> > > that should return true for an input of 1.
> > >
> > > What am i missing?
> > >
> > > Ingo
> >
> > We have one page as a header that contains the info for reading
> > the counter value in userspace plus the head pointer, followed by
> > (for a sampling counter) 2^N pages of ring buffer.
>
> ah - ok. Morning confusion. (any email from me that comes at single
> digit hour local time should be considered fundamentally suspect ;-)
>
> Wouldnt it still be better to keep the symmetry between counting and
> sampling counters? In theory we could transit between these stags
> and 'switch off' a sampling counter or 'switch on' a counting
> counter - via an ioctl or so. Shouldnt counting counters be sampling
> counters that were created while disabled temporarily?

I think I initially intended 0 pages to be ok, even for sampling
counters. I just messed up that if stmt.

if (nr_pages != 0 && !is_power_of_2(nr_pages))
return -EINVAL;

would I think, do what I intended.


2009-03-25 09:43:24

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: allow and require one-page mmap on counting counters

Ingo Molnar writes:

> ah - ok. Morning confusion. (any email from me that comes at single
> digit hour local time should be considered fundamentally suspect ;-)

:)

> Wouldnt it still be better to keep the symmetry between counting and
> sampling counters? In theory we could transit between these stags
> and 'switch off' a sampling counter or 'switch on' a counting
> counter - via an ioctl or so. Shouldnt counting counters be sampling
> counters that were created while disabled temporarily?

Well, the buffer size can already be changed on the fly, by unmapping
the counter and remapping. So, shouldn't I be allowed to select a
zero-sized ring buffer at the times when I'm not sampling, i.e. when
it's a counting counter?

And here's something else that is semi-related: the PAPI guys want a
kind of counter that counts until it overflows, and then sends a
signal to the process and disables itself (and the whole group it's
in). The signal handler can then record whatever application-specific
information is interesting, re-enable the counter and return. It
seems to be a way to do profiling where what you're recording is
something specific to the program rather than generic things like the
instruction pointer.

So that could be a case where we want a sampling counter that doesn't
generate any event records in the kernel, and so a 0-sized ring buffer
could be appropriate.

Paul.

2009-03-25 11:48:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: allow and require one-page mmap on counting counters

On Wed, 2009-03-25 at 19:15 +1100, Paul Mackerras wrote:
> Impact: bug fix
>
> Currently the mmap code requires that the length of the mmap be at least
> two pages. That is fine for sampling counters, but for counting
> counters the second and subsequent pages are just wasted, since counting
> counters don't generate events.
>
> This changes the code to require that the mmap be one page in length
> for counting counters, and applies the existing check to sampling
> counters.

Does the below work for you Paul?

---
Subject: perf_counter: allow one-page mmap on counters
From: Peter Zijlstra <[email protected]>
Date: Wed Mar 25 12:44:16 CET 2009

A brainfart stopped single page mmap()s working. The rest of the code
should be perfectly fine with not having any data pages.

Reported-by: Paul Mackerras <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/perf_counter.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1369,7 +1369,11 @@ static int perf_mmap(struct file *file,
vma_size = vma->vm_end - vma->vm_start;
nr_pages = (vma_size / PAGE_SIZE) - 1;

- if (nr_pages == 0 || !is_power_of_2(nr_pages))
+ /*
+ * If we have data pages ensure they're a power-of-two number, so we
+ * can do bitmasks instead of modulo.
+ */
+ if (nr_pages != 0 && !is_power_of_2(nr_pages))
return -EINVAL;

if (vma_size != PAGE_SIZE * (1 + nr_pages))

2009-03-25 11:53:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: allow and require one-page mmap on counting counters


* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2009-03-25 at 19:15 +1100, Paul Mackerras wrote:
> > Impact: bug fix
> >
> > Currently the mmap code requires that the length of the mmap be at least
> > two pages. That is fine for sampling counters, but for counting
> > counters the second and subsequent pages are just wasted, since counting
> > counters don't generate events.
> >
> > This changes the code to require that the mmap be one page in length
> > for counting counters, and applies the existing check to sampling
> > counters.
>
> Does the below work for you Paul?
>
> ---
> Subject: perf_counter: allow one-page mmap on counters
> From: Peter Zijlstra <[email protected]>
> Date: Wed Mar 25 12:44:16 CET 2009
>
> A brainfart stopped single page mmap()s working. The rest of the code
> should be perfectly fine with not having any data pages.
>
> Reported-by: Paul Mackerras <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> kernel/perf_counter.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/kernel/perf_counter.c
> ===================================================================
> --- linux-2.6.orig/kernel/perf_counter.c
> +++ linux-2.6/kernel/perf_counter.c
> @@ -1369,7 +1369,11 @@ static int perf_mmap(struct file *file,
> vma_size = vma->vm_end - vma->vm_start;
> nr_pages = (vma_size / PAGE_SIZE) - 1;
>
> - if (nr_pages == 0 || !is_power_of_2(nr_pages))
> + /*
> + * If we have data pages ensure they're a power-of-two number, so we
> + * can do bitmasks instead of modulo.
> + */
> + if (nr_pages != 0 && !is_power_of_2(nr_pages))
> return -EINVAL;

Yeah - this is cleaner.

Ingo

2009-03-25 12:09:35

by Peter Zijlstra

[permalink] [raw]
Subject: [tip:perfcounters/core] perf_counter: allow and require one-page mmap on counting counters

Commit-ID: bb0ad4e22511e7caa715362945ad0766c7cca37e
Gitweb: http://git.kernel.org/tip/bb0ad4e22511e7caa715362945ad0766c7cca37e
Author: Peter Zijlstra <[email protected]>
AuthorDate: Wed, 25 Mar 2009 12:48:31 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 25 Mar 2009 13:02:54 +0100

perf_counter: allow and require one-page mmap on counting counters

A brainfart stopped single page mmap()s working. The rest of the code
should be perfectly fine with not having any data pages.

Reported-by: Paul Mackerras <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1237981712.7972.812.camel@twins>
Signed-off-by: Ingo Molnar <[email protected]>


---
kernel/perf_counter.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index f3e1b27..95e0257 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -1369,7 +1369,11 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
vma_size = vma->vm_end - vma->vm_start;
nr_pages = (vma_size / PAGE_SIZE) - 1;

- if (nr_pages == 0 || !is_power_of_2(nr_pages))
+ /*
+ * If we have data pages ensure they're a power-of-two number, so we
+ * can do bitmasks instead of modulo.
+ */
+ if (nr_pages != 0 && !is_power_of_2(nr_pages))
return -EINVAL;

if (vma_size != PAGE_SIZE * (1 + nr_pages))

2009-03-31 19:15:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: allow and require one-page mmap on counting counters

On Wed, 2009-03-25 at 20:42 +1100, Paul Mackerras wrote:
>
> And here's something else that is semi-related: the PAPI guys want a
> kind of counter that counts until it overflows, and then sends a
> signal to the process and disables itself (and the whole group it's
> in).

I tried doing this this evening, and its remarkably hard. Disabling a
counter relies on reading the time, and taking ctx->lock and such.
Things that are impossible to do in NMI context.

Furthermore, some of the software counters (those that use hrtimers)
wait for the completion of the handler (hrtimer_cancel) which would
deadlock when tried from the handler.

Some of these things could be hacked around, others less so, but all in
all it looked remarkably hard for something that sounds so simple.


2009-04-01 02:33:26

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: allow and require one-page mmap on counting counters

Peter Zijlstra writes:

> On Wed, 2009-03-25 at 20:42 +1100, Paul Mackerras wrote:
> >
> > And here's something else that is semi-related: the PAPI guys want a
> > kind of counter that counts until it overflows, and then sends a
> > signal to the process and disables itself (and the whole group it's
> > in).
>
> I tried doing this this evening, and its remarkably hard. Disabling a
> counter relies on reading the time, and taking ctx->lock and such.
> Things that are impossible to do in NMI context.

So, if I have a group where the leader is a hardware counter set to
use NMIs, and there is a task_clock software counter in the group,
don't we hit exactly the same issue with reading the time?

I'd be OK with saying that you can't use stop-and-signal with NMI
counters. There will still be some issues on powerpc because of our
lazy interrupt disabling scheme, so some work might have to get
deferred until we soft-enable interrupts, but we have a way to manage
that.

On another topic, I noticed that we have a race with perf_counter_read
where we do the IPI but don't check in __read() on the destination cpu
that the task we're after is still running on that cpu. It needs
checking and retry logic like we have in other places in
perf_counter.c.

Paul.

2009-04-01 08:13:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: allow and require one-page mmap on counting counters

On Wed, 2009-04-01 at 13:32 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
>
> > On Wed, 2009-03-25 at 20:42 +1100, Paul Mackerras wrote:
> > >
> > > And here's something else that is semi-related: the PAPI guys want a
> > > kind of counter that counts until it overflows, and then sends a
> > > signal to the process and disables itself (and the whole group it's
> > > in).
> >
> > I tried doing this this evening, and its remarkably hard. Disabling a
> > counter relies on reading the time, and taking ctx->lock and such.
> > Things that are impossible to do in NMI context.
>
> So, if I have a group where the leader is a hardware counter set to
> use NMIs, and there is a task_clock software counter in the group,
> don't we hit exactly the same issue with reading the time?

Hmm, I think you're right there. Nasty.

> I'd be OK with saying that you can't use stop-and-signal with NMI
> counters.

Right, so let them use regular IRQs, yes, that will work. Let me code
that.

> There will still be some issues on powerpc because of our
> lazy interrupt disabling scheme, so some work might have to get
> deferred until we soft-enable interrupts, but we have a way to manage
> that.

Hmm, you're saying ppc always uses NMIs, even when !hw_event.nmi?

> On another topic, I noticed that we have a race with perf_counter_read
> where we do the IPI but don't check in __read() on the destination cpu
> that the task we're after is still running on that cpu. It needs
> checking and retry logic like we have in other places in
> perf_counter.c.

Yep, you're right again. Should we perhaps generalize that whole code
and provide a method vector for the various bits. That way we could
collapse all that code replication.

This TODO list keeps growing :-)

BTW, how's progress with the lazy switching?

2009-04-01 09:32:28

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: allow and require one-page mmap on counting counters

Peter Zijlstra writes:

> > There will still be some issues on powerpc because of our
> > lazy interrupt disabling scheme, so some work might have to get
> > deferred until we soft-enable interrupts, but we have a way to manage
> > that.
>
> Hmm, you're saying ppc always uses NMIs, even when !hw_event.nmi?

The PMU interrupts are always handled the same way at the moment,
which is to say they're handled when they come in whether or not
interrupts are soft-enabled. The way things work is that
local_irq_disable etc. only soft-disable interrupts (clear a per-cpu
interrupts enabled flag in memory). Interrupts don't get
hard-disabled (i.e. we don't clear the CPU's hardware interrupt enable
flag) until we actually get an interrupt. A PMU interrupt can't be
taken while interrupts are hard-disabled but it can be taken while
interrupts are soft-disabled, so from that point of view it's an NMI.

The way that interrupts get hard-disabled is that the interrupt entry
code for regular interrupts and timer interrupts (but not PMU
interrupts) first checks the soft-enable flag. If that flag is clear,
i.e. interrupts are soft-disabled, it clears the copy of the CPU
interrupt enable in the saved state from the interrupt and returns,
meaning that the code that was interrupted gets resumed with its
hardware interrupt enable bit clear. That works because the CPU
doesn't automatically ack the interrupt with the interrupt controller
when it takes the interrupt, so the interrupt request will persist and
the CPU will take the interrupt again once the hardware interrupt
enable bit is set.

So in the period between soft-disabling and hard-disabling interrupts
it's possible to get a PMU interrupt. If there are things we want to
do that we can't if interrupts were soft-disabled, we will have to
defer them to the point where interrupts get soft-enabled again.

> BTW, how's progress with the lazy switching?

Umm, I've been focussing more on things that affect the ABI at this
stage, I'm afraid, plus I was waiting for the dust to settle a bit on
the changes you're making.

Paul.