2008-12-04 23:46:26

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 2/3] performance counters: documentation

From: Ingo Molnar <[email protected]>

Add more documentation about performance counters.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
Documentation/perf-counters.txt | 104 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 104 insertions(+)

Index: linux/Documentation/perf-counters.txt
===================================================================
--- /dev/null
+++ linux/Documentation/perf-counters.txt
@@ -0,0 +1,104 @@
+
+Performance Counters for Linux
+------------------------------
+
+Performance counters are special hardware registers available on most modern
+CPUs. These registers count the number of certain types of hw events: such
+as instructions executed, cachemisses suffered, or branches mis-predicted -
+without slowing down the kernel or applications. These registers can also
+trigger interrupts when a threshold number of events have passed - and can
+thus be used to profile the code that runs on that CPU.
+
+The Linux Performance Counter subsystem provides an abstraction of these
+hardware capabilities. It provides per task and per CPU counters, and
+it provides event capabilities on top of those.
+
+Performance counters are accessed via special file descriptors.
+There's one file descriptor per virtual counter used.
+
+The special file descriptor is opened via the perf_counter_open()
+system call:
+
+ int
+ perf_counter_open(u32 hw_event_type,
+ u32 hw_event_period,
+ u32 record_type,
+ pid_t pid,
+ int cpu);
+
+The syscall returns the new fd. The fd can be used via the normal
+VFS system calls: read() can be used to read the counter, fcntl()
+can be used to set the blocking mode, etc.
+
+Multiple counters can be kept open at a time, and the counters
+can be poll()ed.
+
+When creating a new counter fd, 'hw_event_type' is one of:
+
+ enum hw_event_types {
+ PERF_COUNT_CYCLES,
+ PERF_COUNT_INSTRUCTIONS,
+ PERF_COUNT_CACHE_REFERENCES,
+ PERF_COUNT_CACHE_MISSES,
+ PERF_COUNT_BRANCH_INSTRUCTIONS,
+ PERF_COUNT_BRANCH_MISSES,
+ };
+
+These are standardized types of events that work uniformly on all CPUs
+that implements Performance Counters support under Linux. If a CPU is
+not able to count branch-misses, then the system call will return
+-EINVAL.
+
+[ Note: more hw_event_types are supported as well, but they are CPU
+ specific and are enumerated via /sys on a per CPU basis. Raw hw event
+ types can be passed in as negative numbers. For example, to count
+ "External bus cycles while bus lock signal asserted" events on Intel
+ Core CPUs, pass in a -0x4064 event type value. ]
+
+The parameter 'hw_event_period' is the number of events before waking up
+a read() that is blocked on a counter fd. Zero value means a non-blocking
+counter.
+
+'record_type' is the type of data that a read() will provide for the
+counter, and it can be one of:
+
+ enum perf_record_type {
+ PERF_RECORD_SIMPLE,
+ PERF_RECORD_IRQ,
+ };
+
+a "simple" counter is one that counts hardware events and allows
+them to be read out into a u64 count value. (read() returns 8 on
+a successful read of a simple counter.)
+
+An "irq" counter is one that will also provide an IRQ context information:
+the IP of the interrupted context. In this case read() will return
+the 8-byte counter value, plus the Instruction Pointer address of the
+interrupted context.
+
+The 'pid' parameter allows the counter to be specific to a task:
+
+ pid == 0: if the pid parameter is zero, the counter is attached to the
+ current task.
+
+ pid > 0: the counter is attached to a specific task (if the current task
+ has sufficient privilege to do so)
+
+ pid < 0: all tasks are counted (per cpu counters)
+
+The 'cpu' parameter allows a counter to be made specific to a full
+CPU:
+
+ cpu >= 0: the counter is restricted to a specific CPU
+ cpu == -1: the counter counts on all CPUs
+
+Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.
+
+A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
+events of that task and 'follows' that task to whatever CPU the task
+gets schedule to. Per task counters can be created by any user, for
+their own tasks.
+
+A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
+all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
+


2008-12-05 00:34:22

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 2/3] performance counters: documentation

Thomas Gleixner writes:

> + enum hw_event_types {
> + PERF_COUNT_CYCLES,
> + PERF_COUNT_INSTRUCTIONS,
> + PERF_COUNT_CACHE_REFERENCES,
> + PERF_COUNT_CACHE_MISSES,
> + PERF_COUNT_BRANCH_INSTRUCTIONS,
> + PERF_COUNT_BRANCH_MISSES,
> + };
> +
> +These are standardized types of events that work uniformly on all CPUs
> +that implements Performance Counters support under Linux. If a CPU is
> +not able to count branch-misses, then the system call will return
> +-EINVAL.
> +
> +[ Note: more hw_event_types are supported as well, but they are CPU
> + specific and are enumerated via /sys on a per CPU basis. Raw hw event
> + types can be passed in as negative numbers. For example, to count
> + "External bus cycles while bus lock signal asserted" events on Intel
> + Core CPUs, pass in a -0x4064 event type value. ]

This is going to be a huge problem, at least on powerpc, because it
means that the kernel will have to know which events can be counted on
which counters and what values need to be put in the control registers
to select them.

The thing is that not all the counters count the same set of events,
or use the same select values when they can count the same events.
For example, on a MPC7450 cpu, you can count L2 cache misses in PMC5
or PMC6. If you're counting them on PMC5 you need to put 19 into the
PCM5 event selector field in the MMCR1 register. But if you're
counting them on PMC6 then you need to put 29 in the PMC6 event
selector field in MMCR1.

Since we don't get to say which counter to use in perf_counter_open,
we would have to pass an abstracted "L2 cache miss" event code and
have that map to 19 or 29 depending on which PMC register we get to
use. But that means that the kernel then has to have the entire table
of countable events for every supported CPU model - something that
perfmon3 manages to keep out of the kernel.

The situation will be even worse with POWER5 and POWER6, where the
event selection logic is very complex, with multiple layers of
multiplexers. I really really don't want the kernel to have to know
about all that.

Basically, what it boils down to is that treating performance monitor
counters as independent units is just not feasible, at least on
powerpc. We really need to be able to deal with the full set of
counters as one thing.

Paul.

2008-12-05 00:38:12

by David Miller

[permalink] [raw]
Subject: Re: [patch 2/3] performance counters: documentation

From: Paul Mackerras <[email protected]>
Date: Fri, 5 Dec 2008 11:33:31 +1100

> This is going to be a huge problem, at least on powerpc, because it
> means that the kernel will have to know which events can be counted on
> which counters and what values need to be put in the control registers
> to select them.

Sparc64 is the same.

> The situation will be even worse with POWER5 and POWER6, where the
> event selection logic is very complex, with multiple layers of
> multiplexers. I really really don't want the kernel to have to know
> about all that.

Niagara2 has deep multiplexing and sub-event masking too.

I really appreciated how perfmon kept all of those details
in userspace.

2008-12-05 02:33:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 2/3] performance counters: documentation

Paul Mackerras <[email protected]> writes:
>> +
>> +[ Note: more hw_event_types are supported as well, but they are CPU
>> + specific and are enumerated via /sys on a per CPU basis. Raw hw event
>> + types can be passed in as negative numbers. For example, to count
>> + "External bus cycles while bus lock signal asserted" events on Intel
>> + Core CPUs, pass in a -0x4064 event type value. ]
>
> This is going to be a huge problem, at least on powerpc, because it
> means that the kernel will have to know which events can be counted on
> which counters and what values need to be put in the control registers
> to select them.

P4 has similar problems, and to some extent there's also the same
problem on newer Intel CPUs (e.g. with fixed counters and if you
consider PEBS which has some special restrictions)

-Andi

--
[email protected]

2008-12-05 02:49:19

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 2/3] performance counters: documentation

On Thu, 04 Dec 2008 16:37:41 -0800 (PST)
David Miller <[email protected]> wrote:

> From: Paul Mackerras <[email protected]>
> Date: Fri, 5 Dec 2008 11:33:31 +1100
>
> > This is going to be a huge problem, at least on powerpc, because it
> > means that the kernel will have to know which events can be counted
> > on which counters and what values need to be put in the control
> > registers to select them.
>
> Sparc64 is the same.
>
> > The situation will be even worse with POWER5 and POWER6, where the
> > event selection logic is very complex, with multiple layers of
> > multiplexers. I really really don't want the kernel to have to know
> > about all that.
>
> Niagara2 has deep multiplexing and sub-event masking too.
>
> I really appreciated how perfmon kept all of those details
> in userspace.

I would like to respectfully disagree with this some. The kernel needs
to abstract hardware to some degree for userspace. The problem in this
case is that userspace can't really do a better job, in fact it can
only do a worse job since it lacks the coordination capability of
knowing it has full control of all the hardware registers.
I am sure the corner cases you're talking about are nasty, I just don't
think they are less nasty when dealt with in userspace. Sure the kernel
might be simpler, but the system as a whole sure is not.



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-12-05 03:27:06

by David Miller

[permalink] [raw]
Subject: Re: [patch 2/3] performance counters: documentation

From: Arjan van de Ven <[email protected]>
Date: Thu, 4 Dec 2008 18:50:02 -0800

> I would like to respectfully disagree with this some. The kernel needs
> to abstract hardware to some degree for userspace. The problem in this
> case is that userspace can't really do a better job, in fact it can
> only do a worse job since it lacks the coordination capability of
> knowing it has full control of all the hardware registers.

The perfmon context abstraction dealt with that. Code using the
perfmon interfaces provided a set of counter and control register
values to the kernel.

The kernel merely loaded and unloaded them when a process (or group of
processes) ran.

The kernel is a validity checker, and that minimal stuff is exactly
what the perfmon kernel component implemented.