LinuxLists.cc - [RFC][PATCH 00/11] perf pmu interface -v2

2010-06-24 14:41:05

Subject: [RFC][PATCH 00/11] perf pmu interface -v2

These patches prepare the perf code for multiple pmus (no user
interface yet, Lin Ming is working on that). These patches remove all
weak functions and rework the struct pmu interface.

The patches are boot tested on x86_64 and compile tested on: powerpc
(!fsl, what config is that?), sparc and arm (sorry no SH compiler)

On x86 perf stat and record with both software and hardware events still worked.

I changed all (hardware) pmu implementations so there's a fair chance I messed
something up, please have a look at it.

If people agree with this, I'll continue with the per-pmu context stuff I
talked about last time, which should get us the hardware write batching back.

2010-06-25 11:11:57

by Will Deacon

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

Hi Peter,

On Thu, 2010-06-24 at 15:28 +0100, Peter Zijlstra wrote:
> These patches prepare the perf code for multiple pmus (no user
> interface yet, Lin Ming is working on that). These patches remove all
> weak functions and rework the struct pmu interface.
>
> The patches are boot tested on x86_64 and compile tested on: powerpc
> (!fsl, what config is that?), sparc and arm (sorry no SH compiler)
>

I tried to test these for ARM, but I get compilation errors. Maybe you
forgot to enable perf in the ARM .config? Anyway, the compiler log is:

CC arch/arm/kernel/perf_event.o
arch/arm/kernel/perf_event.c: In function 'validate_event':
arch/arm/kernel/perf_event.c:342: error: 'pmu' undeclared (first use in this function)
arch/arm/kernel/perf_event.c:342: error: (Each undeclared identifier is reported only once
arch/arm/kernel/perf_event.c:342: error: for each function it appears in.)
arch/arm/kernel/perf_event.c: In function 'armpmu_event_init':
arch/arm/kernel/perf_event.c:536: warning: return makes integer from pointer without a cast
arch/arm/kernel/perf_event.c: In function 'init_hw_perf_events':
arch/arm/kernel/perf_event.c:3037: error: expected ';' before 'return'
arch/arm/kernel/perf_event.c:3038: warning: no return statement in function returning non-void
make[1]: *** [arch/arm/kernel/perf_event.o] Error 1
make: *** [arch/arm/kernel] Error 2

I think there's a missing semicolon, an ERR_PTR that should be a PTR_ERR
and an ordering issue with the pmu struct.

Cheers,

Will

2010-06-25 11:17:45

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Fri, 2010-06-25 at 12:11 +0100, Will Deacon wrote:
> Hi Peter,
>
> On Thu, 2010-06-24 at 15:28 +0100, Peter Zijlstra wrote:
> > These patches prepare the perf code for multiple pmus (no user
> > interface yet, Lin Ming is working on that). These patches remove all
> > weak functions and rework the struct pmu interface.
> >
> > The patches are boot tested on x86_64 and compile tested on: powerpc
> > (!fsl, what config is that?), sparc and arm (sorry no SH compiler)
> >
>
> I tried to test these for ARM, but I get compilation errors. Maybe you
> forgot to enable perf in the ARM .config? Anyway, the compiler log is:
>
> CC arch/arm/kernel/perf_event.o
> arch/arm/kernel/perf_event.c: In function 'validate_event':
> arch/arm/kernel/perf_event.c:342: error: 'pmu' undeclared (first use in this function)
> arch/arm/kernel/perf_event.c:342: error: (Each undeclared identifier is reported only once
> arch/arm/kernel/perf_event.c:342: error: for each function it appears in.)
> arch/arm/kernel/perf_event.c: In function 'armpmu_event_init':
> arch/arm/kernel/perf_event.c:536: warning: return makes integer from pointer without a cast
> arch/arm/kernel/perf_event.c: In function 'init_hw_perf_events':
> arch/arm/kernel/perf_event.c:3037: error: expected ';' before 'return'
> arch/arm/kernel/perf_event.c:3038: warning: no return statement in function returning non-void
> make[1]: *** [arch/arm/kernel/perf_event.o] Error 1
> make: *** [arch/arm/kernel] Error 2
>
> I think there's a missing semicolon, an ERR_PTR that should be a PTR_ERR
> and an ordering issue with the pmu struct.

I seem to have lost a refresh before sending the emails, please check:

git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git perf-pmu

I pushed out updated patches there.

2010-06-25 14:37:09

by Will Deacon

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

Hello,

On Fri, 2010-06-25 at 12:16 +0100, Peter Zijlstra wrote:
> I seem to have lost a refresh before sending the emails, please check:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git perf-pmu
>
> I pushed out updated patches there.

Ok, I rebuilt my Kernel and perf tools from that tree and tested it on a
quad-core ARMv7 board. Per-task counters appear to work (software and
hardware) but pinned hardware counters always return 0:

root@will-lucid-nfs:~# perf stat -a -e cs -e cycles -e instructions -- ls
linux-2.6 tmp

Performance counter stats for 'ls':

33 context-switches
0 cycles
0 instructions # 0.000 IPC

0.028572009 seconds time elapsed

It's odd if only ARM is affected in this way. Do pinned events still work
for other people?

Will

2010-06-25 14:50:34

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Fri, 2010-06-25 at 15:36 +0100, Will Deacon wrote:
> Hello,
>
> On Fri, 2010-06-25 at 12:16 +0100, Peter Zijlstra wrote:
> > I seem to have lost a refresh before sending the emails, please check:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git perf-pmu
> >
> > I pushed out updated patches there.
>
> Ok, I rebuilt my Kernel and perf tools from that tree and tested it on a
> quad-core ARMv7 board.

Ooh, neat toy ;-)

> Per-task counters appear to work (software and
> hardware) but pinned hardware counters always return 0:
>
>
> root@will-lucid-nfs:~# perf stat -a -e cs -e cycles -e instructions -- ls
> linux-2.6 tmp
>
> Performance counter stats for 'ls':
>
> 33 context-switches
> 0 cycles
> 0 instructions # 0.000 IPC
>
> 0.028572009 seconds time elapsed
>
>
> It's odd if only ARM is affected in this way. Do pinned events still work
> for other people?

/me goes build that exact tree on his x86_64.. and gets:

# perf stat -a -e cs -e cycles -e instructions -- ls > /dev/null

Performance counter stats for 'ls':

51 context-switches
24963513 cycles
9225808 instructions # 0.370 IPC

0.002389051 seconds time elapsed

Not exactly sure how I could have messed up the ARM architecture code to
make this happen though... will have a peek.

2010-06-26 11:22:42

by Matt Fleming

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Thu, 24 Jun 2010 16:28:04 +0200, Peter Zijlstra <[email protected]> wrote:
> These patches prepare the perf code for multiple pmus (no user
> interface yet, Lin Ming is working on that). These patches remove all
> weak functions and rework the struct pmu interface.
>
> The patches are boot tested on x86_64 and compile tested on: powerpc
> (!fsl, what config is that?), sparc and arm (sorry no SH compiler)

Hi Peter,

I tried your SH changes and I needed to apply this patch to get it to
compile. I haven't run the code yet, but I'll do that later today.

---

diff --git a/arch/sh/kernel/perf_event.c b/arch/sh/kernel/perf_event.c
index 50ff852..6bfbaec 100644
--- a/arch/sh/kernel/perf_event.c
+++ b/arch/sh/kernel/perf_event.c
@@ -251,7 +251,7 @@ static void sh_pmu_del(struct perf_event *event, int flags)
perf_event_update_userpage(event);
}

-static int sh_pmu_enable(struct perf_event *event, int flags)
+static int sh_pmu_add(struct perf_event *event, int flags)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
@@ -283,7 +283,7 @@ static void sh_pmu_read(struct perf_event *event)
sh_perf_event_update(event, &event->hw, event->hw.idx);
}

-static in sh_pmu_event_init(struct perf_event *event)
+static int sh_pmu_event_init(struct perf_event *event)
{
int err = __hw_perf_event_init(event);
if (unlikely(err)) {
@@ -345,15 +345,15 @@ sh_pmu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
return NOTIFY_OK;
}

-int __cpuinit register_sh_pmu(struct sh_pmu *pmu)
+int __cpuinit register_sh_pmu(struct sh_pmu *_pmu)
{
if (sh_pmu)
return -EBUSY;
- sh_pmu = pmu;
+ sh_pmu = _pmu;

- pr_info("Performance Events: %s support registered\n", pmu->name);
+ pr_info("Performance Events: %s support registered\n", _pmu->name);

- WARN_ON(pmu->num_events > MAX_HWEVENTS);
+ WARN_ON(_pmu->num_events > MAX_HWEVENTS);

perf_pmu_register(&pmu);
perf_cpu_notifier(sh_pmu_notifier);

2010-06-26 16:22:40

by Corey Ashford

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On 06/24/2010 07:28 AM, Peter Zijlstra wrote:
> These patches prepare the perf code for multiple pmus (no user
> interface yet, Lin Ming is working on that). These patches remove all
> weak functions and rework the struct pmu interface.
>
> The patches are boot tested on x86_64 and compile tested on: powerpc
> (!fsl, what config is that?), sparc and arm (sorry no SH compiler)
>
> On x86 perf stat and record with both software and hardware events still worked.
>
> I changed all (hardware) pmu implementations so there's a fair chance I messed
> something up, please have a look at it.
>
> If people agree with this, I'll continue with the per-pmu context stuff I
> talked about last time, which should get us the hardware write batching back.

Hi Peter,

These patches look like they are really taking us in the right
direction. Thanks for all your effort on this!

As for the "hardware write batching", can you describe a bit more about
what you have in mind there? I wonder if this might have something to
do with accounting for PMU hardware which is slow to access, for
example, via I2C via an internal bridge.

Cheers,

- Corey

2010-06-28 15:13:56

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Sat, 2010-06-26 at 09:22 -0700, Corey Ashford wrote:

> These patches look like they are really taking us in the right
> direction. Thanks for all your effort on this!

Yeah, although I seem to have managed to wreck all architectures tested
so far (I found some iffy on x86 too), I guess I need to carefully look
at things.

> As for the "hardware write batching", can you describe a bit more about
> what you have in mind there? I wonder if this might have something to
> do with accounting for PMU hardware which is slow to access, for
> example, via I2C via an internal bridge.

Right, so the write batching is basically delaying writing out the PMU
state to hardware until pmu::pmu_enable() time. It avoids having to
re-program the hardware when, due to a scheduling constraint, we have to
move counters around.

So say, we context switch a task, and remove the old events and add the
new ones under a single pmu::pmu_disable()/::pmu_enable() pair, we will
only hit the hardware twice (once to disable, once to enable), instead
of for each individual ::del()/::add().

For this to work we need to have an association between a context and a
pmu, otherwise its very hard to know what pmu to disable/enable; the
alternative is all of them which isn't very attractive.

Then again, it doesn't make sense to have task-counters on an I2C
attached PMU simply because writing to the PMU could cause context
switches.

2010-06-30 17:19:35

by Corey Ashford

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On 06/28/2010 08:13 AM, Peter Zijlstra wrote:
> On Sat, 2010-06-26 at 09:22 -0700, Corey Ashford wrote:
> As for the "hardware write batching", can you describe a bit more about
>> what you have in mind there? I wonder if this might have something to
>> do with accounting for PMU hardware which is slow to access, for
>> example, via I2C via an internal bridge.
>
> Right, so the write batching is basically delaying writing out the PMU
> state to hardware until pmu::pmu_enable() time. It avoids having to
> re-program the hardware when, due to a scheduling constraint, we have to
> move counters around.
>
> So say, we context switch a task, and remove the old events and add the
> new ones under a single pmu::pmu_disable()/::pmu_enable() pair, we will
> only hit the hardware twice (once to disable, once to enable), instead
> of for each individual ::del()/::add().
>
> For this to work we need to have an association between a context and a
> pmu, otherwise its very hard to know what pmu to disable/enable; the
> alternative is all of them which isn't very attractive.
>
> Then again, it doesn't make sense to have task-counters on an I2C
> attached PMU simply because writing to the PMU could cause context
> switches.

Thanks for your reply.

In our case, writing to some of the nest PMUs' control registers is done
via an internal bridge. We write to a specific memory address and an
internal MMIO-to-SCOM bridge (SCOM is similar to I2C) translates it to
serial and sends it over the internal serial bus. The address we write
to is based upon the control register's serial bus address, plus an
offset from the base of MMIO-to-SCOM bridge. The same process works for
reads.

While it does not cause a context switch because there are no IO drivers
to go through, it will take several thousand CPU cycles to complete,
which by the same token, still makes them inappropriate for
task-counters (not to mention that the nest units operate asynchronously
from the CPU).

However, there still are concerns relative to writing these control
registers from an interrupt handler because of the latency that will be
incurred, however slow we choose to do the event rotation. So at least
for the Wire-Speed processor, we may need a worker thread of some sort
to hand off the work to.

Our current code, based on linux 2.6.31 (soon to be 2.6.32) doesn't use
worker threads; we are just taking the latency hit for now.

- Corey

2010-06-30 18:11:40

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Wed, 2010-06-30 at 10:19 -0700, Corey Ashford wrote:
> However, there still are concerns relative to writing these control
> registers from an interrupt handler because of the latency that will be
> incurred, however slow we choose to do the event rotation. So at least
> for the Wire-Speed processor, we may need a worker thread of some sort
> to hand off the work to.

Right, once we have per-pmu contexts we can look at having means to
over-ride the means of rotation, such that the pmu can optionally drive
it itself instead of through the common tick.

2010-07-01 14:36:24

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Fri, 2010-06-25 at 16:50 +0200, Peter Zijlstra wrote:

> Not exactly sure how I could have messed up the ARM architecture code to
> make this happen though... will have a peek.

I did find a bug in there, not sure it could have been responsible for
this but who knows...

Pushed out a new git tree with the below delta folded in.

---
Index: linux-2.6/arch/arm/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/arch/arm/kernel/perf_event.c
+++ linux-2.6/arch/arm/kernel/perf_event.c
@@ -221,27 +221,6 @@ again:
}

static void
-armpmu_del(struct perf_event *event, int flags)
-{
- struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
- struct hw_perf_event *hwc = &event->hw;
- int idx = hwc->idx;
-
- WARN_ON(idx < 0);
-
- clear_bit(idx, cpuc->active_mask);
- armpmu->disable(hwc, idx);
-
- barrier();
-
- armpmu_event_update(event, hwc, idx);
- cpuc->events[idx] = NULL;
- clear_bit(idx, cpuc->used_mask);
-
- perf_event_update_userpage(event);
-}
-
-static void
armpmu_read(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
@@ -267,6 +246,8 @@ armpmu_stop(struct perf_event *event, in
*/
if (!(hwc->state & PERF_HES_STOPPED)) {
armpmu->disable(hwc, hwc->idx);
+ barrier(); /* why? */
+ armpmu_event_update(event, hwc, hwc->idx);
hwc->state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
}
}
@@ -289,7 +270,7 @@ armpmu_start(struct perf_event *event, i
hwc->state = 0;
/*
* Set the period again. Some counters can't be stopped, so when we
- * were throttled we simply disabled the IRQ source and the counter
+ * were stopped we simply disabled the IRQ source and the counter
* may have been left counting. If we don't do this step then we may
* get an interrupt too soon or *way* too late if the overflow has
* happened since disabling.
@@ -298,6 +279,23 @@ armpmu_start(struct perf_event *event, i
armpmu->enable(hwc, hwc->idx);
}

+static void
+armpmu_del(struct perf_event *event, int flags)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ struct hw_perf_event *hwc = &event->hw;
+ int idx = hwc->idx;
+
+ WARN_ON(idx < 0);
+
+ clear_bit(idx, cpuc->active_mask);
+ armpmu_stop(event, PERF_EF_UPDATE);
+ cpuc->events[idx] = NULL;
+ clear_bit(idx, cpuc->used_mask);
+
+ perf_event_update_userpage(event);
+}
+
static int
armpmu_add(struct perf_event *event, int flags)
{
@@ -1075,7 +1073,7 @@ armv6pmu_handle_irq(int irq_num,
continue;

if (perf_event_overflow(event, 0, &data, regs))
- armpmu_stop(event, 0);
+ armpmu->disable(hwc, idx);
}

/*

2010-07-01 15:02:48

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Thu, 2010-07-01 at 16:36 +0200, Peter Zijlstra wrote:
>
> Pushed out a new git tree with the below delta folded in.
>
Said tree also cures x86 and folds the SH build fix.

Matt, you said it broke SH completely, but did you try perf stat? perf
record is not supposed to work on SH due to the hardware not having an
overflow interrupt.

Which made me think, what on SH guarantees we update the counter often
enough not to suffer from counter wrap? Would it make sense to make the
SH code hook into their arch tick handler and update the counters from
there?

2010-07-01 15:31:15

by Matt Fleming

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Thu, Jul 01, 2010 at 05:02:35PM +0200, Peter Zijlstra wrote:
>
> Matt, you said it broke SH completely, but did you try perf stat? perf
> record is not supposed to work on SH due to the hardware not having an
> overflow interrupt.

perf record does work to some degree. It definitely worked before
applying your changes but not after. I admit I haven't really read the
perf event code, but Paul will know.

> Which made me think, what on SH guarantees we update the counter often
> enough not to suffer from counter wrap? Would it make sense to make the
> SH code hook into their arch tick handler and update the counters from
> there?

This was the way that the oprofile code used to work. Paul and I were
talking about using a hrtimer to sample performance counters as
opposed to piggy-backing on the tick handler.

2010-07-01 15:40:06

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Thu, 2010-07-01 at 16:31 +0100, MattFleming wrote:
> On Thu, Jul 01, 2010 at 05:02:35PM +0200, Peter Zijlstra wrote:
> >
> > Matt, you said it broke SH completely, but did you try perf stat? perf
> > record is not supposed to work on SH due to the hardware not having an
> > overflow interrupt.
>
> perf record does work to some degree. It definitely worked before
> applying your changes but not after. I admit I haven't really read the
> perf event code, but Paul will know.

Ok, let me look at that again.

> > Which made me think, what on SH guarantees we update the counter often
> > enough not to suffer from counter wrap? Would it make sense to make the
> > SH code hook into their arch tick handler and update the counters from
> > there?
>
> This was the way that the oprofile code used to work. Paul and I were
> talking about using a hrtimer to sample performance counters as
> opposed to piggy-backing on the tick handler.

Ah, for sampling for sure, simply group a software perf event and a
hardware perf event together and use PERF_SAMPLE_READ.

But suppose its a non sampling counter, how do you avoid overflows of
the hardware register?

2010-07-01 16:04:59

by Matt Fleming

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Thu, Jul 01, 2010 at 05:39:53PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-07-01 at 16:31 +0100, MattFleming wrote:
> > On Thu, Jul 01, 2010 at 05:02:35PM +0200, Peter Zijlstra wrote:
> > >
> > > Which made me think, what on SH guarantees we update the counter often
> > > enough not to suffer from counter wrap? Would it make sense to make the
> > > SH code hook into their arch tick handler and update the counters from
> > > there?
> >
> > This was the way that the oprofile code used to work. Paul and I were
> > talking about using a hrtimer to sample performance counters as
> > opposed to piggy-backing on the tick handler.
>
> Ah, for sampling for sure, simply group a software perf event and a
> hardware perf event together and use PERF_SAMPLE_READ.
>
> But suppose its a non sampling counter, how do you avoid overflows of
> the hardware register?

Hmm.. good question! I'm not entirely sure we do. As you were saying,
without using the arch tick handler, I don't think we can guarantee
avoiding counter overflows. Currently the counters are chained such
that the counters are at least 48 bits. I guess all my tests were
short enough to not cause the counters to wrap ;-)

At some point we will want to not require chaining, giving us 32
bits. So yeah, this is issue is gonna crop up then. Interestingly, the
counters on SH don't wrap when they reach they're maximum value, they
just stop incrementing.

2010-07-02 02:57:58

by Paul Mundt

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Thu, Jul 01, 2010 at 05:39:53PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-07-01 at 16:31 +0100, MattFleming wrote:
> > On Thu, Jul 01, 2010 at 05:02:35PM +0200, Peter Zijlstra wrote:
> > >
> > > Matt, you said it broke SH completely, but did you try perf stat? perf
> > > record is not supposed to work on SH due to the hardware not having an
> > > overflow interrupt.
> >
> > perf record does work to some degree. It definitely worked before
> > applying your changes but not after. I admit I haven't really read the
> > perf event code, but Paul will know.
>
> Ok, let me look at that again.
>
Any perf record functionality observed is entirely coincidental and not
by design. It was something I planned to revisit, but most of what we
have right now is only geared at the one-shot perf stat case.

> > > Which made me think, what on SH guarantees we update the counter often
> > > enough not to suffer from counter wrap? Would it make sense to make the
> > > SH code hook into their arch tick handler and update the counters from
> > > there?
> >
> > This was the way that the oprofile code used to work. Paul and I were
> > talking about using a hrtimer to sample performance counters as
> > opposed to piggy-backing on the tick handler.
>
> Ah, for sampling for sure, simply group a software perf event and a
> hardware perf event together and use PERF_SAMPLE_READ.
>
> But suppose its a non sampling counter, how do you avoid overflows of
> the hardware register?

At the moment it's not an issue since we have big enough counters that
overflows don't really happen, especially if we're primarily using them
for one-shot measuring.

SH-4A style counters behave in such a fashion that we have 2 general
purpose counters, and 2 counters for measuring bus transactions. These
bus counters can optionally be disabled and used in a chained mode to
provide the general purpose counters a 64-bit counter (the actual
validity in the upper half of the chained counter varies depending on the
CPUs, but all of them can do at least 48-bits when chained).

Each counter has overflow detection and asserts an overflow bit, but
there are no exceptions associated with this, so it's something that we
would have to tie in to the tick or defer to a bottom half handler in the
non-sampling case (or simply test on every read, and accept some degree
of accuracy loss). Any perf record functionality we implement with this
sort of scheme is only going to provide ballpark figures anyways, so it's
certainly within the parameters of acceptable loss in exchange for
increased functionality.

Different CPUs also implement their overflows differently, some will roll
and resume counting, but most simply stop until the overflow bit is
cleared.

My main plan was to build on top of the multi-pmu stuff, unchain the
counters, and expose the bus counters with their own event map as a
separate PMU instance. All of the other handling logic can pretty much be
reused directly, but it does mean that we need to be a bit smarter about
overflow detection/handling. Sampling and so on is also on the TODO list,
but is as of yet still not supported.

2010-07-02 09:52:18

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Fri, 2010-07-02 at 11:57 +0900, Paul Mundt wrote:
> At the moment it's not an issue since we have big enough counters that
> overflows don't really happen, especially if we're primarily using them
> for one-shot measuring.
>
> SH-4A style counters behave in such a fashion that we have 2 general
> purpose counters, and 2 counters for measuring bus transactions. These
> bus counters can optionally be disabled and used in a chained mode to
> provide the general purpose counters a 64-bit counter (the actual
> validity in the upper half of the chained counter varies depending on the
> CPUs, but all of them can do at least 48-bits when chained).

Right, so I was reading some of that code and I couldn't actually find
where you keep consistency between the hardware counter value and the
stored prev_count value.

That is, suppose I'm counting, the hardware starts at 0, hwc->prev_count
= 0 and event->count = 0.

At some point, x we context switch this task away, so we ->disable(),
which disables the counter and updates the values, so at that time
hwc->prev = x and event->count = x, right?

Now suppose we schedule the task back in, so we do ->enable(), then what
happens? sh_pmu_enable() finds an unused index, (disables it for some
reason.. it should already be cleared if its not used, but I guess a few
extra hardware writes dont hurt) and calls sh4a_pmu_enable() on it.

sh4a_pmu_enable() does 3 writes:

PPC_PMCAT -- does this clear the counter value?
PPC_CCBR -- writes the ->config bits
PPC_CCBR (adds CCBR_DUC, couldn't this be done in the
previous write to this reg?)

Now assuming that enable does indeed clear the hardware counter value,
shouldn't you also set hwc->prev_count to 0 again? Otherwise the next
update will see a massive jump?

Alternatively you could write the hwc->prev_count value back to the
register.

If you eventually want to drop the chained counter support I guess it
would make sense to have sh_perf_event_update() read and clear the
counter so that you're always 0 based and then enforce an update from
the arch tick hander so you never overflow.

2010-07-02 12:57:15

by Will Deacon

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

Hi Peter,

On Thu, 2010-07-01 at 15:36 +0100, Peter Zijlstra wrote:
> On Fri, 2010-06-25 at 16:50 +0200, Peter Zijlstra wrote:
>
> > Not exactly sure how I could have messed up the ARM architecture code to
> > make this happen though... will have a peek.
>
> I did find a bug in there, not sure it could have been responsible for
> this but who knows...
>
> Pushed out a new git tree with the below delta folded in.
>
I had a look at this yesterday and discovered a bug in the ARM
backend code, which I've posted a patch for to ALKML:

http://lists.infradead.org/pipermail/linux-arm-kernel/2010-July/019461.html

Unfortunately, with this applied and your latest changes I still
get 0 from pinned hardware counters:

# perf stat -r 5 -e cycles -e instructions -e cs -e faults -e branches -a -- git status

Performance counter stats for 'git status' (5 runs):

0 cycles ( +- nan% )
0 instructions # 0.000 IPC ( +- nan% )
88447 context-switches ( +- 12.624% )
13647 page-faults ( +- 0.015% )
0 branches ( +- nan% )

The changes you've made to arch/arm/kernel/perf_event.c
look sane. If I get some time I'll try and dig deeper.

Will

2010-07-05 11:15:09

by Paul Mundt

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Fri, Jul 02, 2010 at 11:52:03AM +0200, Peter Zijlstra wrote:
> Right, so I was reading some of that code and I couldn't actually find
> where you keep consistency between the hardware counter value and the
> stored prev_count value.
>
> That is, suppose I'm counting, the hardware starts at 0, hwc->prev_count
> = 0 and event->count = 0.
>
> At some point, x we context switch this task away, so we ->disable(),
> which disables the counter and updates the values, so at that time
> hwc->prev = x and event->count = x, right?
>
> Now suppose we schedule the task back in, so we do ->enable(), then what
> happens? sh_pmu_enable() finds an unused index, (disables it for some
> reason.. it should already be cleared if its not used, but I guess a few
> extra hardware writes dont hurt) and calls sh4a_pmu_enable() on it.
>
I don't quite remember where the ->disable() came from, I vaguely recall
copying it from one of the other architectures, but it could have just
been a remnant of something I had for debug code. In any event, you're
correct, we don't seem to need it anymore.

> sh4a_pmu_enable() does 3 writes:
>
> PPC_PMCAT -- does this clear the counter value?

Yes, the counters themselves are read-only, so clearing is done through
the PMCAT control register.

> PPC_CCBR -- writes the ->config bits
> PPC_CCBR (adds CCBR_DUC, couldn't this be done in the
> previous write to this reg?)
>
No, the DUC bit needs to be set by itself or the write is discarded on
some CPUs. Clearing it with other bits is fine, however. This is what
starts the counter running.

> Now assuming that enable does indeed clear the hardware counter value,
> shouldn't you also set hwc->prev_count to 0 again? Otherwise the next
> update will see a massive jump?
>
I think that's a correct observation, but I'm having difficulty verifying
it on my current board since it seems someone moved the PMCAT register,
as the counters aren't being cleared on this particular CPU. I'll test on
the board I wrote this code for initially tomorrow and see how that goes.
It did used to work fine at least.

> Alternatively you could write the hwc->prev_count value back to the
> register.
>
That would be an option if the counters weren't read-only, yes.

> If you eventually want to drop the chained counter support I guess it
> would make sense to have sh_perf_event_update() read and clear the
> counter so that you're always 0 based and then enforce an update from
> the arch tick hander so you never overflow.
>
Yes, I'd thought about that too. I'll give it a go once I find out where
the other half of my registers disappeared to. As it is, it seems my bat
and I have an appointment to make.

2010-07-08 11:13:53

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Thu, 2010-07-01 at 17:39 +0200, Peter Zijlstra wrote:
>
> Ah, for sampling for sure, simply group a software perf event and a
> hardware perf event together and use PERF_SAMPLE_READ.

So the idea is to sample using a software event (periodic timer of
sorts, maybe randomize it) and weight its samples by the hardware event
deltas.

Suppose you have a workload consisting of two main parts:

my_important_work()
{
load_my_data();
compute_me_silly();
}

Now, lets assume that both these functions take the same time to
complete for each part of work. In that case a periodic timer generate
samples that are about 50/50 distributed between these two functions.

Now, let us further assume that load_my_data() is so slow because its
missing all the caches and compute_me_silly() is slow because its
defeating the branch predictor.

So what we want to end up with, is that when we sample for cache-misses
we get load_my_data() as the predominant function, not a nice 50/50
relation. Idem for branch misses and compute_me_silly().

By weighting the samples by the hw counter delta we get this, if we
assume that the sampling frequency is not a harmonic of the runtime of
these functions, then statistics will dtrt.

It basically generates a massive skid on the sample, but as long as most
of the samples end up hitting the right function we're good. For a
periodic workload like:
while (lots) { my_important_work() }
that is even true for period > function_runtime with the exception of
that harmonic thing. For less neat workloads like:
while (lots) { my_important_work(); other_random_things(); }
This doesn't need to work unless period < function_runtime.

Clearly we cannot attribute anything to the actual instruction hit due
to the massive skid, but we can (possibly) say something about the
function based on these statistical rules.

2010-07-08 11:20:04

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

* Peter Zijlstra <[email protected]> wrote:

> On Thu, 2010-07-01 at 17:39 +0200, Peter Zijlstra wrote:
> >
> > Ah, for sampling for sure, simply group a software perf event and a
> > hardware perf event together and use PERF_SAMPLE_READ.
>
> So the idea is to sample using a software event (periodic timer of sorts,
> maybe randomize it) and weight its samples by the hardware event deltas.
>
> Suppose you have a workload consisting of two main parts:
>
> my_important_work()
> {
> load_my_data();
> compute_me_silly();
> }
>
> Now, lets assume that both these functions take the same time to complete
> for each part of work. In that case a periodic timer generate samples that
> are about 50/50 distributed between these two functions.
>
> Now, let us further assume that load_my_data() is so slow because its
> missing all the caches and compute_me_silly() is slow because its defeating
> the branch predictor.
>
> So what we want to end up with, is that when we sample for cache-misses we
> get load_my_data() as the predominant function, not a nice 50/50 relation.
> Idem for branch misses and compute_me_silly().
>
> By weighting the samples by the hw counter delta we get this, if we assume
> that the sampling frequency is not a harmonic of the runtime of these
> functions, then statistics will dtrt.

Yes.

And if the platform code implements this then the tooling side already takes
care of it - even if the CPU itself cannot geneate interrupts based on say
cachemisses or branches (but can measure them via counts).

The only situation where statistics will not do the right thing is when the
likelyhood of the sample tick significantly correlates with the likelyhood of
the workload itself executing. Timer-dominated workloads would be an example.

Real hrtimers are sufficiently tick-less to solve most of these artifacts in
practice.

Ingo

2010-07-18 19:37:41

by Matt Fleming

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/11] perf pmu interface -v2

On Thu, 8 Jul 2010 13:19:36 +0200, Ingo Molnar <[email protected]> wrote:
>
> And if the platform code implements this then the tooling side already takes
> care of it - even if the CPU itself cannot geneate interrupts based on say
> cachemisses or branches (but can measure them via counts).

I finally got a bit of time to work on this. I'm confused about how the
tools will take care of this at the moment. How much support currently
exists for these sorts of weighted samples? The only thing I can find
that looks related is PERF_SAMPLE_READ, but I don't think that's exactly
what we need.

Could someone point me in the right direction?