Hello,
As suggested by Andrew, I wrote a quick overview of the perfmon2 interface
that we have implemented on several architectures now. The goal of this
introduction is to give you an idea of the key features of the interface.
------------------------------------------------------------------------------
===========================================================
----------------------------------------------------
A quick overview of the perfmon2 interface for Linux
----------------------------------------------------
Copyright (c) 2005 Hewlett-Packard Development Company, L.P.
Contributed by Stephane Eranian <[email protected]>
===========================================================
I/ INTRODUCTION
------------
The goal of the perfmon2 interface is to provide access to the hardware
performance counters present in all modern processors.
The interface is designed to be builtin, very generic, flexible and
extensible. It is not designed to support a single application or a
small class of monitoring tools. The goal is to avoid fragmentation
where you have one tool using one interface. Because we want the
interface to be an integral part of the kernel, special care is taken
to make it robust and secure. The interface is uniform across all
hardware platforms, i.e., it offers the same level of software
functionalities on each platform. The nature of the captured data
depends solely on the capabilities of the underlying hardware.
Although, by nature the Performance Monitoring Unit (PMU) of each
processor architecture can be quite different, it is possible to
extrapolate a common hardware interface on which we can build a
powerful kernel interface. All modern PMUs are implemented using a
register interface. Two types of registers are typically present:
configuration registers (PMC) and data registers (PMD). As such, the
interface is simply exporting read/write operations on those registers.
A minimal set of software abstractions is added, such as the notion of
a perfmon context which is used to encapsulate the PMU state.
The same interface provides support for per-thread AND system-wide
measurements. For each mode, it is possible to collect simple counts
or create full sampling measurements.
Sampling is supported at the user level and also at the kernel level with
the ability to use a kernel-level sampling buffer. The format of the kernel
sampling buffer is implemented by a kernel pluggable module. As such it is
very easy to develop a custom format without any modification to the
interface nor its implementation.
To compensate for limitations of many PMU, such as a small number of
counters, the interface also exports the notion of event sets and allows
applications to multiplex sets, thereby allowing more events to be
measured in a single run than there are actual counters.
The PMU register description is implemented by a kernel pluggable module
thereby allowing new hardware to be supported without the need to wait
for the next release of the kernel. The description table supports virtual
PMD registers which can be tied to any kernel or hardware resource.
II/ BASE INTERFACE
---------------
PMU registers are accessed by reading/writing PMC and PMD registers.
The interface exposes a logical view of the PMU. The logical PMD/PMC
registers are mapped onto the actual PMU registers (be them PMD/PMC or
actual MSR) by the kernel. The mapping table is implemented by a kernel
module and can thus easily be updated without a kernel recompile. This
mechanism also makes it easy to add new PMU support inside a processor
family. The mapping is exposed to users via a file in /proc
(/proc/perfmon_map).
The interface is implemented using a system call interface rather than
a device driver. There are several reasons for this choice, the most
important being that we do want to support per-thread monitoring and
that requires access to the context switch code of the kernel to
save/restore the PMU state. Another reason is to reinforce the fact
that the interface must be an integral part of the kernel. Lastly, we
think it give us more flexibility in terms of how arguments can be
passed to/from the kernel.
Whenever possible, the interface leverages existing kernel mechanisms.
As such, we use a file descriptor to identify a perfmon context.
The interface defines the following set of system calls:
- int pfm_create_context(pfarg_ctx_t *ctx, void *smpl_arg, size_t smpl_size)
creates a per-thread or system-wide perfmon context. It returns a
file descriptor that uniquely identifies the context. The regular
file descriptor semantics w.r.t. to access control, sharing are
supported.
- pfm_write_pmds(int fd, pfarg_pmd_t *pmds, int n)
Write one or more PMD registers.
- pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
Read the one or more PMD registers.
- pfm_write_pmcs(int fd, pfarg_pmc_t*pmcs, int n)
Write the one or more PMC registers.
- pfm_load_context(int fd, pfarg_load_t *load)
Attach a perfmon context to either a thread or a processor. In the
case of a thread, the thread id is passed. In the case of a
processor, the context is bound to the CPU on which the call is
performed.
- pfm_start(int fd, pfarg_start_t *start)
Start active monitoring.
- pfm_stop(int fd)
Stop active monitoring.
- pfm_restart(int fd)
Resume monitoring after a user level overflow notification. This
call is used in conjunction with kernel-level sampling.
- pfm_create_evtsets(int fd, pfarg_setdesc_t *sets, int n)
Create/Modify one or more PMU event set. Each set encapsulates the
full PMU sets.
- pfm_delete_evtsets(int fd, pfarg_setdesc_t *sets, int n)
Delete a PMU event set. It is possible to delete one or more sets
in a single call.
- pfm_getinfo_evtsets(int fd, pfarg_setinfo_t *infos, int n):
Return information about an event set. It is possible to get
information about one or more sets in a single call. The call
returns, for instance, the number of times a set has been activated,
i.e., loaded onto the actual PMU.
- pfm_unload_context(int fd)
Detach a context from a thread or a processor.
By default, all counters are exported as 64-bit registers even when the
underlying hardware implements less. This makes it much easier for
applications that are doing event-based sampling because they don't
need to worry about the width of counters. It is possible to turn the
"virtualization" off.
A system-wide context allows a tool to monitor all activities on one
processor, i.e, across all threads. Full System-wide monitoring in an
SMP system is achieved by creating and binding a perfmon context on
each processor. By construction, a perfmon context can only be bound
to one processor at a time. This design choice is motivated by the
desire to enforce locality and to simplify the kernel implementation.
Multiple per-thread contexts can coexist at the same time on a system.
Multiple system-wide can co-exist as long as they do not monitor the
same set of processors. The existing implementation does not allow
per-thread and system-wide context to exist at the same time. The
restriction is not inherent to the interface but come from the
existing implementation.
3/ SAMPLING SUPPORT
----------------
The interface supports event-based sampling (EBS), where the sampling
period is determined by the number of occurrences of an event rather
than by time. Note that time-based sampling (TBS) can be emulated by
using an event with some correlation to time.
The EBS is based on the ability for PMU to generate an interrupt when
a counter overflows. All modern PMU support this mode.
Because counters are virtualized to 64 bits. A sampling period p,
is setup by writing a PMD to 2^{64}-p -1 = -p.
The interface does have the notion of a sampling period, it only
manipulates PMD values. When a counter overflows, it is possible
for a tool to request a notification. By construction, the interface
supports as many sampling periods as there are counters on the host
PMU making it possible to overlap distinct sampling measurements in
one run. The notification can be requested per counter.
The notification is sent as a message and can be extracted by invoking
read() on the file descriptor of the context. The overflow message
contains information about the overflow, such as the index of the
overflowed PMD.
The interface supports using select/poll on contexts file descriptors.
Similarly, it is possible to get an asynchronous notification via SIGIO
using the regular sequence of fcntl().
By default, during a notification, monitoring is masked, i.e., nothing
is captured. A tool uses the pfm_restart() call to resume monitoring.
It is possible to request that on overflow notification, the monitoring
thread be blocked. By default, it keeps on running with monitoring
masked. Blocking is not supported in system-wide mode nor when a thread
is self-monitoring.
4/ SAMPLING BUFFER SUPPORT
-----------------------
User level sampling works but it is quite expensive especially when for
non self-monitoring threads. To minimize the overhead, the interface also
supports a kernel level sampling buffer. The idea is simple: on overflow
the kernel record a sample, and only when the buffer becomes full is the
user level notification generated. Thus, we amortize the cost of the
notification by simply calling the user when lots of samples are available.
This is not such a new idea, it is present in OProfile or perfctr.
However, the interface needs to remains generic and flexible. If
the sampling buffer is in kernel, its format and what gets recorded
becomes somehow locked by the implementation. Every tool has different
needs. For instance, a tool such as Oprofile may want to aggregate
samples in the kernel, others such as VTUNE want to record all samples
sequentially. Furthermore, some tools may want to combine in each sample
PMU information with other kernel level information, such as a kernel
call stack for instance. It is hard to design a generic buffer format
that can handle all possible request. Instead, the interface provides
an infrastructure in which the buffer format is implemented by a kernel
module. Each module controls, what gets recorded, how it is recorded,
how the information is exported to user, when a 'buffer full'
notification must be sent. The perfmon core has an interface to
dynamically register new formats. Each format is uniquely identified by
a 128-bit UUID which is passed by the tool when the context is created.
Arguments for the buffer format are also passed during this call.
As part of the infrastructure, the interface provides a buffer allocation
and remapping service to the buffer format modules. Format may use this
service when the context is created. The kernel memory will be reserved
and the tool will be able to get access to the buffer via remapping
using the mmap() system call. This provides an efficient way of
exporting samples without the need for copying large amount of data
between the kernel and user space. But a format may choose to export
its samples via another way, such as a device driver interface for
instance.
The current implementation comes with a simple default format builtin.
This format records samples in a sequential order. Each sample has a
fixed sized header which include the interrupted instruction pointer,
which PMD overflowed, the PID and TID of the thread among other things.
The header is followed by an optional variable size body where
additional PMD values can be recorded.
We have successfully hooked the Oprofile kernel infrastructure to
our interface using a simple buffer format module on Linux/ia64.
We have released a buffer format that implements n-way buffering to
show how blind spots may be minimized. both modules required absolutely
no change to the interface nor perfmon core implementation. We have also
developed a buffer format module to support P4/Xeon Precise Event-Based
Sampling (PEBS).
Because sampling can happen in the kernel without user intervention,
the kernel must have all the information to figure out what to record,
how to restart the sampling period. This information is passed when
the PMD used as the sampling period is programmed. For each such PMD,
it is possible to indicate using bitvector, which PMDs to record on
overflow, which PMDs to reset on overflow.
For each PMD, the interface provides three possible values which are
used when sampling. The initial value is that is first loaded into the
PMD, i.e., the first sampling period. On overflow which does not
trigger a user level notification, a so-called short reset value is
used by the kernel to reload the PMD. After an overflow with a user
level notification, the kernel uses the so-called long reset value.
This mechanism can be exploited to hide the noise induced by the
recovery after a user notification.
The interface also supports automatic randomization of the reset value
for a PMD after an overflow. Randomization is indicated per PMD and is
controlled by a seed value. The range of variation is specified by a
bitmask per PMD.
5/ EVENT SETS AND MULTIPLEXING:
----------------------------
For many PMU models, the number of counters is fairly limited which
makes it sometimes difficult to collect certain metric in a single run.
But this is not always because you have a large number of counters that
they all can measure any events at the same time. Such constraints can
be alleviated by creating the notion of an event set. Each set
encapsulates the full PMU state. At any one time only one set is loaded
onto the actual PMU. Sets are then multiplexed. The counts collected
by counters in each set can then be scaled to approximate what they
would have been, had they run for the entire duration of the
measurement. It is very important to keep in mind that this is an
approximation. Its quality depends on the frequency at which sets can
be switched and also the overhead involved.
Event sets and multiplexing can be fully implemented at the user level
but this is prohibitively expensive especially for non-self-monitoring
threads.
The interface exports the notion of event set. Each PMD/PMC can be
assigned to a set when read/written.
By default any perfmon context has a single set, namely set. Tools can
create additional sets using pfm_create_evtsets(). A set is identified
by a number between 0-65535. The number indicate the order in which
sets are switched to and from. The kernel uses an ordered list managed
in a round-robin fashion to determine the switch order.
Switching can either be triggered by a timeout or by a counter overflow.
The switch mode is set per set and it is possible to mix and match.
The timeout is specified when the set is created (or modified for set0).
It is limited by the granularity of the timer tick. The user timeout is
rounded up to the nearest multiple of the timer tick frequency. The
actual timeout is returned to the tool.
For overflow switching, the interface does not require a dedicated
counter. Each PMD has an overflow switch counter. On overflow the
switch counter is decremented. When it reaches zero, switching occurs.
There can be more than on "trigger" PMD per set.
Overflow-based set switching can be used to implement counter
cascading, where certain counters start measuring only when
a certain threshold is reached on another event.
6/ PMU DESCRIPTION MODULES
-----------------------
The logical PMU is driven by a PMU description table. The table
is implemented by a kernel pluggable module. As such, it can be
updated at will without recompiling the kernel, waiting for the next
release of a Linux kernel or distribution, and without rebooting the
machine as long as the PMU model belongs to the same PMU family. For
instance, for the Itanium Processor Family, the architecture specifies
the framework for the PMU. Thus the Itanium PMU specific code is common
across all processor implementations. This is not the case for IA-32.
The interface and PMU description module support the notion of virtual
PMU registers, i.e., register not necessarily tied to an actual PMU
register. For instance, it may be interesting, especially when sampling,
to export a kernel resource or a non-PMU hardware registers as a PMD.
The actual read/write function for those virtual PMD is implemented by
the PMU description module, allowing for maximum flexibility.
It is important to understand that the interface, including the PMU
description modules, do not know anything about PMU events. All event
specific information, including event names and encodings has to be
implemented at the user level.
7/ IMPLEMENTATION
---------------
We have developed an implementation for the 2.6.x kernel series.
We do have support for the following processors:
- All Itanium processors (Itanium, McKinley/Madison, Montecito)
- Intel EM64T/Xeon. Includes support for PEBS and HyperThreading (produced by Intel)
- Intel P4/Xeon (32-bit). Includes support for PEBS and HyperThreading
- Intel Pentium M and P6 processors
- AMD 64-bit Opteron
- preliminary support for IBM Power 5 (produced by IBM)
- preliminary support for MIPS R5000 (produced by Phil Mucci)
The so-called "new perfmon code base" incorporates all the features we describe here.
At this point, this is a standalone patch for the 2.6.x kernels. The full patch can be
downloaded from our project web site at:
http://www.sf.net/projects/perfmon2
On Linux/ia64, there is a older version (v2.0) of this interface that is currently
provided on all 2.6.x based kernels. The "new code base" (v2.2) perfmon maintains
backward compatibility with this version.
8/ EXISTING TOOLS
--------------
Several commercial products as well as open-source tools already exists for this
interface (or previous incarnation) for Linux on Itanium where such interface has
been available for quite some time:
- HP Caliper for RHEL4, SLES9.
- BEA JRockit with Dynamic Optimization
- pfmon/libpfm by HPLabs
- qtools/qprof by HPLabs
- PerfSuite from NCSA
- all PAPI-based tools
- OProfile
We think that once the interface is part of the mainline kernel, we will see even more
tools being released and developed for the benefits of ALL users across ALL major
hardware platforms.
--
-Stephane
Stephane Eranian <[email protected]> wrote:
> or create full sampling measurements.
> ...
>
> Sampling is supported at the user level and also at the kernel level with
> the ability to use a kernel-level sampling buffer. The format of the kernel
> sampling buffer is implemented by a kernel pluggable module. As such it is
> very easy to develop a custom format without any modification to the
> interface nor its implementation.
Why would one want to change the format of the sampling buffer?
Would much simplification be realised if we were to remove this option?
> To compensate for limitations of many PMU, such as a small number of
> counters, the interface also exports the notion of event sets and allows
> applications to multiplex sets, thereby allowing more events to be
> measured in a single run than there are actual counters.
>
> The PMU register description is implemented by a kernel pluggable module
> thereby allowing new hardware to be supported without the need to wait
> for the next release of the kernel.
Is that option important, or likely to be useful? Are you sure there isn't
some overdesign here?
> II/ BASE INTERFACE
> ---------------
>
> PMU registers are accessed by reading/writing PMC and PMD registers.
> The interface exposes a logical view of the PMU. The logical PMD/PMC
> registers are mapped onto the actual PMU registers (be them PMD/PMC or
> actual MSR) by the kernel. The mapping table is implemented by a kernel
> module and can thus easily be updated without a kernel recompile. This
> mechanism also makes it easy to add new PMU support inside a processor
> family.
Ditto.
> The interface is implemented using a system call interface rather than
> a device driver. There are several reasons for this choice, the most
> important being that we do want to support per-thread monitoring and
> that requires access to the context switch code of the kernel to
> save/restore the PMU state. Another reason is to reinforce the fact
> that the interface must be an integral part of the kernel. Lastly, we
> think it give us more flexibility in terms of how arguments can be
> passed to/from the kernel.
>
> Whenever possible, the interface leverages existing kernel mechanisms.
> As such, we use a file descriptor to identify a perfmon context.
>
> The interface defines the following set of system calls:
>
> - int pfm_create_context(pfarg_ctx_t *ctx, void *smpl_arg, size_t smpl_size)
pfarg_ctx_t __user *ctx, I assume?
void __user *smpl_arg, I assume?
What is at *smpl_arg? Anonymous pointers to userspace like this aren't
very popular - strongly typed interfaces are preferred.
>
> creates a per-thread or system-wide perfmon context. It returns a
> file descriptor that uniquely identifies the context. The regular
> file descriptor semantics w.r.t. to access control, sharing are
> supported.
>
> - pfm_write_pmds(int fd, pfarg_pmd_t *pmds, int n)
>
> Write one or more PMD registers.
>
> - pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
>
> Read the one or more PMD registers.
>
> - pfm_write_pmcs(int fd, pfarg_pmc_t*pmcs, int n)
>
> Write the one or more PMC registers.
>
> - pfm_load_context(int fd, pfarg_load_t *load)
>
> Attach a perfmon context to either a thread or a processor. In the
> case of a thread, the thread id is passed. In the case of a
> processor, the context is bound to the CPU on which the call is
> performed.
Why should userspace concern itself with a particular CPU? That's really
only valid if the process has bound itself to a single CPU? If the CPU is
fully virtuialised by perfmon (it is) then why do we care about individual
CPU instances?
> - pfm_start(int fd, pfarg_start_t *start)
>
> Start active monitoring.
>
> - pfm_stop(int fd)
>
> Stop active monitoring.
>
> - pfm_restart(int fd)
>
> Resume monitoring after a user level overflow notification. This
> call is used in conjunction with kernel-level sampling.
>
> - pfm_create_evtsets(int fd, pfarg_setdesc_t *sets, int n)
>
> Create/Modify one or more PMU event set. Each set encapsulates the
> full PMU sets.
>
> - pfm_delete_evtsets(int fd, pfarg_setdesc_t *sets, int n)
>
> Delete a PMU event set. It is possible to delete one or more sets
> in a single call.
>
> - pfm_getinfo_evtsets(int fd, pfarg_setinfo_t *infos, int n):
>
> Return information about an event set. It is possible to get
> information about one or more sets in a single call. The call
> returns, for instance, the number of times a set has been activated,
> i.e., loaded onto the actual PMU.
>
> - pfm_unload_context(int fd)
>
> Detach a context from a thread or a processor.
>
> By default, all counters are exported as 64-bit registers even when the
> underlying hardware implements less. This makes it much easier for
> applications that are doing event-based sampling because they don't
> need to worry about the width of counters. It is possible to turn the
> "virtualization" off.
>
> A system-wide context allows a tool to monitor all activities on one
> processor, i.e, across all threads. Full System-wide monitoring in an
> SMP system is achieved by creating and binding a perfmon context on
> each processor. By construction, a perfmon context can only be bound
> to one processor at a time. This design choice is motivated by the
> desire to enforce locality and to simplify the kernel implementation.
hm. I'm surprised at such a CPU-centric approach. I'd have expected to
see a more task-centric model.
> Multiple per-thread contexts can coexist at the same time on a system.
> Multiple system-wide can co-exist as long as they do not monitor the
> same set of processors. The existing implementation does not allow
> per-thread and system-wide context to exist at the same time. The
> restriction is not inherent to the interface but come from the
> existing implementation.
>
> 3/ SAMPLING SUPPORT
> ----------------
>
> The interface supports event-based sampling (EBS), where the sampling
> period is determined by the number of occurrences of an event rather
> than by time. Note that time-based sampling (TBS) can be emulated by
> using an event with some correlation to time.
>
> The EBS is based on the ability for PMU to generate an interrupt when
> a counter overflows. All modern PMU support this mode.
>
> Because counters are virtualized to 64 bits. A sampling period p,
> is setup by writing a PMD to 2^{64}-p -1 = -p.
>
> The interface does have the notion of a sampling period, it only
> manipulates PMD values. When a counter overflows, it is possible
> for a tool to request a notification. By construction, the interface
> supports as many sampling periods as there are counters on the host
> PMU making it possible to overlap distinct sampling measurements in
> one run. The notification can be requested per counter.
>
> The notification is sent as a message and can be extracted by invoking
> read() on the file descriptor of the context. The overflow message
> contains information about the overflow, such as the index of the
> overflowed PMD.
>
> The interface supports using select/poll on contexts file descriptors.
> Similarly, it is possible to get an asynchronous notification via SIGIO
> using the regular sequence of fcntl().
So the kernel buffers these messages for the read()er. How does it handle
the case of a process which requests the messages but never gets around to
read()ing them?
> By default, during a notification, monitoring is masked, i.e., nothing
> is captured. A tool uses the pfm_restart() call to resume monitoring.
>
> It is possible to request that on overflow notification, the monitoring
> thread be blocked. By default, it keeps on running with monitoring
> masked. Blocking is not supported in system-wide mode nor when a thread
> is self-monitoring.
>
> 4/ SAMPLING BUFFER SUPPORT
> -----------------------
>
> User level sampling works but it is quite expensive especially when for
> non self-monitoring threads. To minimize the overhead, the interface also
> supports a kernel level sampling buffer. The idea is simple: on overflow
> the kernel record a sample, and only when the buffer becomes full is the
> user level notification generated. Thus, we amortize the cost of the
> notification by simply calling the user when lots of samples are available.
>
> This is not such a new idea, it is present in OProfile or perfctr.
> However, the interface needs to remains generic and flexible. If
> the sampling buffer is in kernel, its format and what gets recorded
> becomes somehow locked by the implementation. Every tool has different
> needs. For instance, a tool such as Oprofile may want to aggregate
> samples in the kernel, others such as VTUNE want to record all samples
> sequentially. Furthermore, some tools may want to combine in each sample
> PMU information with other kernel level information, such as a kernel
> call stack for instance. It is hard to design a generic buffer format
> that can handle all possible request. Instead, the interface provides
> an infrastructure in which the buffer format is implemented by a kernel
> module. Each module controls, what gets recorded, how it is recorded,
> how the information is exported to user, when a 'buffer full'
> notification must be sent. The perfmon core has an interface to
> dynamically register new formats. Each format is uniquely identified by
> a 128-bit UUID which is passed by the tool when the context is created.
> Arguments for the buffer format are also passed during this call.
Well that addresses my earlier questions I guess.
Is this actually useful? oprofile is there and works OK. Again, is there
overdesign here?
And why is it necessary to make the presentation of the samples to
userspace pluggable? I'd have thought that a single relayfs-based
implementation would suit all sampling buffer formats.
> As part of the infrastructure, the interface provides a buffer allocation
> and remapping service to the buffer format modules. Format may use this
> service when the context is created. The kernel memory will be reserved
> and the tool will be able to get access to the buffer via remapping
> using the mmap() system call. This provides an efficient way of
> exporting samples without the need for copying large amount of data
> between the kernel and user space. But a format may choose to export
> its samples via another way, such as a device driver interface for
> instance.
It doesn't sound like perfmon is using relayfs for the sampling buffer.
Why not?
> The current implementation comes with a simple default format builtin.
> This format records samples in a sequential order. Each sample has a
> fixed sized header which include the interrupted instruction pointer,
> which PMD overflowed, the PID and TID of the thread among other things.
> The header is followed by an optional variable size body where
> additional PMD values can be recorded.
>
> We have successfully hooked the Oprofile kernel infrastructure to
> our interface using a simple buffer format module on Linux/ia64.
Neat, but do we actually *need* this?
> We have released a buffer format that implements n-way buffering to
> show how blind spots may be minimized. both modules required absolutely
> no change to the interface nor perfmon core implementation. We have also
> developed a buffer format module to support P4/Xeon Precise Event-Based
> Sampling (PEBS).
>
> Because sampling can happen in the kernel without user intervention,
> the kernel must have all the information to figure out what to record,
> how to restart the sampling period. This information is passed when
> the PMD used as the sampling period is programmed. For each such PMD,
> it is possible to indicate using bitvector, which PMDs to record on
> overflow, which PMDs to reset on overflow.
>
> For each PMD, the interface provides three possible values which are
> used when sampling. The initial value is that is first loaded into the
> PMD, i.e., the first sampling period. On overflow which does not
> trigger a user level notification, a so-called short reset value is
> used by the kernel to reload the PMD. After an overflow with a user
> level notification, the kernel uses the so-called long reset value.
> This mechanism can be exploited to hide the noise induced by the
> recovery after a user notification.
>
> The interface also supports automatic randomization of the reset value
> for a PMD after an overflow.
Why would one want to randomise the PMD after an overflow?
> Randomization is indicated per PMD and is
> controlled by a seed value. The range of variation is specified by a
> bitmask per PMD.
>
> 5/ EVENT SETS AND MULTIPLEXING:
> ----------------------------
>
> ...
>
> 6/ PMU DESCRIPTION MODULES
> -----------------------
>
> The logical PMU is driven by a PMU description table. The table
> is implemented by a kernel pluggable module. As such, it can be
> updated at will without recompiling the kernel, waiting for the next
> release of a Linux kernel or distribution, and without rebooting the
> machine as long as the PMU model belongs to the same PMU family. For
> instance, for the Itanium Processor Family, the architecture specifies
> the framework for the PMU. Thus the Itanium PMU specific code is common
> across all processor implementations. This is not the case for IA-32.
I think the usefulness of this needs justification. CPUs are updated all
the time, and we release new kernels all the time to exploit the new CPU
features. What's so special about performance counters that they need such
special treatment?
>
> ...
>
> 7/ IMPLEMENTATION
> ---------------
>
> We have developed an implementation for the 2.6.x kernel series.
> We do have support for the following processors:
>
> - All Itanium processors (Itanium, McKinley/Madison, Montecito)
> - Intel EM64T/Xeon. Includes support for PEBS and HyperThreading (produced by Intel)
> - Intel P4/Xeon (32-bit). Includes support for PEBS and HyperThreading
> - Intel Pentium M and P6 processors
> - AMD 64-bit Opteron
> - preliminary support for IBM Power 5 (produced by IBM)
> - preliminary support for MIPS R5000 (produced by Phil Mucci)
Which achitectures does perfctr support? More, I think?
> -Stephane
Thanks for putting this together. It helps.
Overall: I worry about excessive configurability, excessive features.
Andrew Morton writes:
> > - All Itanium processors (Itanium, McKinley/Madison, Montecito)
> > - Intel EM64T/Xeon. Includes support for PEBS and HyperThreading (produced by Intel)
> > - Intel P4/Xeon (32-bit). Includes support for PEBS and HyperThreading
> > - Intel Pentium M and P6 processors
> > - AMD 64-bit Opteron
> > - preliminary support for IBM Power 5 (produced by IBM)
> > - preliminary support for MIPS R5000 (produced by Phil Mucci)
>
> Which achitectures does perfctr support? More, I think?
The sets are incomparable.
Intel P5 up to P4/Xeon/EM64T, though not P4's PEBS.
AMD K7 and K8.
X86 clones with performance counters (VIA C3 and Cyrix' P5-clones).
Any x86 with TSC. (Still useful for accurate time measurements.)
PPC32 (604 up to 74xx).
Any PPC32 with TB. (Still useful for accurate time measurements.)
POWER4/G5/POWER5 (done by David Gibson not me).
Preliminary ARM/XScale support is working but stalled due to
more pressing commitments and unresolved ARM platform issues.
(Some XScale/PXA drivers clobber the PMU registers for no good reason.)
UltraSPARC would be trivial to support, except (1) I don't have one,
and (2) they already have a primitive pre-historic perfctr facility.
/Mikael
> > The interface also supports automatic randomization of the reset value
> > for a PMD after an overflow.
>
> Why would one want to randomise the PMD after an overflow?
To get better data. Using a constant reload value may keep measuring the
same spot in the application if you are using a sample frequency that
matches some repeat pattern in the application (and Murphy's law says
that you'll hit this a lot).
-Tony
Andrew,
I will reply to your comments in details.
On Tue, Dec 20, 2005 at 10:05:11AM -0800, Tony Luck wrote:
> > > The interface also supports automatic randomization of the reset value
> > > for a PMD after an overflow.
> >
> > Why would one want to randomise the PMD after an overflow?
>
> To get better data. Using a constant reload value may keep measuring the
> same spot in the application if you are using a sample frequency that
> matches some repeat pattern in the application (and Murphy's law says
> that you'll hit this a lot).
>
Yes, Tony is right.
For several sampling measurments which are using events that occur very
frequently such as branches, it becomes very important to avoid getting
in lockstep with the execution. Using prime numbers is not always enough
and randomization is the best way to solve this problem.
We have ecountered this when David Mosberger was developing
q-syscollect for Linux/IA64. This tool is sampling return branches
using the MckInley Branch Trace Buffer. With the collected samples
it is possible to build a statistical call graph (a la gprof).
Without randomization, the samples were so biased that the data
was unusable. With randomization, the data was very close to actual
call graph as measured by gprof. The same argument applies to
sampling for cache misses, you want to make sure you are not
always capturing the same cache misses.
The random number generator does not have to be super fancy. That's why
we use the Carta algorithm, it is simple and fast and gives us very
good samples.
--
-Stephane
On Tue, Dec 20, 2005 at 02:51:56AM -0800, Andrew Morton wrote:
> Stephane Eranian <[email protected]> wrote:
> > or create full sampling measurements.
> > ...
> >
> > Sampling is supported at the user level and also at the kernel level with
> > the ability to use a kernel-level sampling buffer. The format of the kernel
> > sampling buffer is implemented by a kernel pluggable module. As such it is
> > very easy to develop a custom format without any modification to the
> > interface nor its implementation.
>
> Why would one want to change the format of the sampling buffer?
>
> Would much simplification be realised if we were to remove this option?
>
> > To compensate for limitations of many PMU, such as a small number of
> > counters, the interface also exports the notion of event sets and allows
> > applications to multiplex sets, thereby allowing more events to be
> > measured in a single run than there are actual counters.
> >
> > The PMU register description is implemented by a kernel pluggable module
> > thereby allowing new hardware to be supported without the need to wait
> > for the next release of the kernel.
>
> Is that option important, or likely to be useful? Are you sure there isn't
> some overdesign here?
Heh, I've been wondering that for some time. It's better than it
was..
[snip]
> > - pfm_load_context(int fd, pfarg_load_t *load)
> >
> > Attach a perfmon context to either a thread or a processor. In the
> > case of a thread, the thread id is passed. In the case of a
> > processor, the context is bound to the CPU on which the call is
> > performed.
>
> Why should userspace concern itself with a particular CPU? That's really
> only valid if the process has bound itself to a single CPU? If the CPU is
> fully virtuialised by perfmon (it is) then why do we care about individual
> CPU instances?
This option is for a context monitoring the whole system, rather than
a single thread, which you sometimes want to do (because one thread's
actions could have an impact on others, for example). But a context
can't sensibly track more than one thing simultaneously executing, so
it needs to be bound to a particular CPU. In practice such a context
would almost invariably be used as part of a 1-per-cpu set of contexts
the results ultimately aggregated.
[snip]
> > A system-wide context allows a tool to monitor all activities on one
> > processor, i.e, across all threads. Full System-wide monitoring in an
> > SMP system is achieved by creating and binding a perfmon context on
> > each processor. By construction, a perfmon context can only be bound
> > to one processor at a time. This design choice is motivated by the
> > desire to enforce locality and to simplify the kernel implementation.
>
> hm. I'm surprised at such a CPU-centric approach. I'd have expected to
> see a more task-centric model.
It is task-centric, usually. The CPU binding is only when doing
full-system monitoring. The above is a bit unclear, giving undue
emphasis to the per-CPU case: the point is that (from the kernel's
point of view) a context can only ever be active on one CPU at a time.
For a thread-virtualized context (the common case) that happens
automatically because the thread can't run on multiple CPUs at once,
for a full system monitoring context the binding must be explicit instead.
[snip]
> > This is not such a new idea, it is present in OProfile or perfctr.
> > However, the interface needs to remains generic and flexible. If
> > the sampling buffer is in kernel, its format and what gets recorded
> > becomes somehow locked by the implementation. Every tool has different
> > needs. For instance, a tool such as Oprofile may want to aggregate
> > samples in the kernel, others such as VTUNE want to record all samples
> > sequentially. Furthermore, some tools may want to combine in each sample
> > PMU information with other kernel level information, such as a kernel
> > call stack for instance. It is hard to design a generic buffer format
> > that can handle all possible request. Instead, the interface provides
> > an infrastructure in which the buffer format is implemented by a kernel
> > module. Each module controls, what gets recorded, how it is recorded,
> > how the information is exported to user, when a 'buffer full'
> > notification must be sent. The perfmon core has an interface to
> > dynamically register new formats. Each format is uniquely identified by
> > a 128-bit UUID which is passed by the tool when the context is created.
> > Arguments for the buffer format are also passed during this call.
>
> Well that addresses my earlier questions I guess.
>
> Is this actually useful? oprofile is there and works OK. Again, is there
> overdesign here?
>
> And why is it necessary to make the presentation of the samples to
> userspace pluggable? I'd have thought that a single relayfs-based
> implementation would suit all sampling buffer formats.
I think there's some confusion here because the term "pluggable buffer
format" isn't a particularly good one. It makes sense in the context
of perfmon's internal architecture, but doesn't really convey the
basic idea.
As I understand it from working with the perfmon code a bit, the point
is not so much the buffer itself, but the selection of what things to
sample when and how. The default format gives a fairly configurable
set of ways to do this but for some applications you'd have a choice
of either not collecting all the data you need, or collecting so much
extra data that it would bog down trying to save and/or aggregate it
all. So, the point of "custom buffer formats" is more a matter of
being able to add new (potentially application specific) sampling
schemes.
As you say, relayfs should be fine for the actual presentation of the
buffer to userspace, whatever it contains.
[snip]
> Overall: I worry about excessive configurability, excessive
> features.
Join the club :)
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Andrew,
Thanks for carefully reviewing this document.
On Tue, Dec 20, 2005 at 02:51:56AM -0800, Andrew Morton wrote:
> Stephane Eranian <[email protected]> wrote:
> >
> > Sampling is supported at the user level and also at the kernel level with
> > the ability to use a kernel-level sampling buffer. The format of the kernel
> > sampling buffer is implemented by a kernel pluggable module. As such it is
> > very easy to develop a custom format without any modification to the
> > interface nor its implementation.
>
> Why would one want to change the format of the sampling buffer?
The initial perfmon interface design as implemented in the 2.4 kernel series
did not have this feature. It was added to the 2nd generation for the following
reason:
- allow support of existing kernel profiling infratructures such as
Oprofile or VTUNE (the VTUNE driver is open-source)
- added flexibility in the way samples are recorded: what is recorded
how it is recorded.
- ability to support hardware sampling buffer
Take the example of earlier versions Oprofile, they already had all the kernel
infrastructure to record the interrupted instruction pointer + OS events such
as mmap/exit to help correlate samples with actual programs. Just looking
at their sampling buffer, they were aggregating samples instead of storing
them linearly. That is how the original DEC's DCPI implementation worked.
Our default sampling format stores samples sequentially, it would not have
been easy to adapt Oprofile-based tools to use this model. Another example
is the old VTUNE driver, they were using a single buffer to collect samples
across all CPUS. We have one buffer per CPU.
But now you can imagine people wanting to record other type of information
beyond PMU registers. For instance we have developed that collect a full
kernel call stack on counter overflow. Oprofile today does this but it is
hardcoded in the core of Oprofile. We can do this without any change to
the core of perfmon.
On Linux/ia64, Oprofile is hooked up to perfmon by a custom sampling format.
This format does virtually nothing besides connecting the perfmon interrupt
handler to the rest of the Oprofile infrastructure. We re-use 99% of the
Oprofile kernel code without any modifications. They still export their buffer
the way they use to, i.e. their own driver interface. They moved the setup
of the PMU over to the perfmon interface in the user level code.
Custom formats come in handy to support hardware sampling buffers. Take the
example of the P4/Xeon PMU. The Precise Event Based Sampling (PEBS) support
lets the CPU directly stores samples in a designated memory area. This
has two advantages:
- the CPU stores the actual instruction pointer of where the counter
overflowed, that where the precision comes from. Typically (and not
just on P4), there is a skew between where the counter overflow and
where the PMU interrupts.
- This is faster because you only take a PMU interrupt when the CPU runs
out of space
How would you support this if the buffer format was hardcoded? As you can see it is
not just about speed but also about gaining in precision, i.e. attributing samples
to where they actually occurred. The PEBS is implementation specific. There is no
guarantee it will exist in future Intel processors, for instance it does not exist
on Pentium M. It quicly becomes hard to maintain if you have have to ifdef this in
or out based on CPUID. With out custom format, it took about 200 lines of code and
almost no modification to the perfmon generic code. This is the first time PEBS
is really exposed to applications inside a generic interface. Neither Oprofile
nor perfctr support this today.
We provide a builtin default format that is fairly generic and that stores
samples with PMU information in a linear buffer. It may be good enough
for many tools already. For instance HP Caliper uses it, I believe it may
be good enough for VTUNE.
To comment on complexity, I don't think this adds much. It simply separates
existing code and adds some simple registration code.
>
> Would much simplification be realised if we were to remove this option?
>
I think it would be too limiting, for instance you would forfeit PEBS on
P4/Xeon.
> > Attach a perfmon context to either a thread or a processor. In the
> > case of a thread, the thread id is passed. In the case of a
> > processor, the context is bound to the CPU on which the call is
> > performed.
>
> Why should userspace concern itself with a particular CPU? That's really
> only valid if the process has bound itself to a single CPU? If the CPU is
> fully virtuialised by perfmon (it is) then why do we care about individual
> CPU instances?
>
See just below.
>
> > A system-wide context allows a tool to monitor all activities on one
> > processor, i.e, across all threads. Full System-wide monitoring in an
> > SMP system is achieved by creating and binding a perfmon context on
> > each processor. By construction, a perfmon context can only be bound
> > to one processor at a time. This design choice is motivated by the
> > desire to enforce locality and to simplify the kernel implementation.
>
> hm. I'm surprised at such a CPU-centric approach. I'd have expected to
> see a more task-centric model.
>
You need both modes. We support both modes using the same interface. There is
just a flag to speciy to create a CPU-centric perfmon context. In system-wide
mode you are interested in monitoring everything that is happening on one
processor. This is the mode in which OProfile, VTUNE works for instance.
They collect everything and then you can filter out samples per threads.
In per-thread mode, the PMU is actually virtualized, i.e., it becomes part
of the thread's machine state and is saved/restored on context switch.
The per-thread mode is more challenging to implement but allows more
specific measurements to be made. For instamce, I can measure how
many instructions where executed in function foo() of my thread.
How you design system-wide?
You have two choices:
- allow a single thread to monitor multiple CPUs with a single
perfmon context
- enforce that a thread can monitoring only one CPU. Full coverage
of SMP is achieved by creating and pinning a thread per-CPU.
We chose the second design because it is simpler to implement in the kernel
and scales much better because it enforces locality which is important when
sampling, i..e, the buffer is allocated on the monitored CPU. This also meshes
where well with hardware features such as P4/Xeon PEBS.
The pfm_load_context() call for a system-wide context, is using the current
CPU to bind to the context regardless of the calling thread affinity bitmask.
The user does not explicitly pass the CPU number he wants to monitor.
If a thread wants to monitoring CPU2, it must be sure to run on CPU2 when
making the call. That forces it to call sched_setaffinity() to allow only
CPU2. The pfm_load_context() does not change the affinity of the thread.
We did not want to add another mechanism to change affinity. We are simply
leveraging an existing interface. Similarly, we did not want to internally
enforce a particular affinity. That means that the thread can at any time
change its affinity again. But perfmon will reject any perfmon calls
read/write PMU registers if the thread is not running on CPU2. That we
can do easily and we do not have to change sched_setaffinity(). That leaves
the door open for a thread monitoring multiple CPU but by using multiple
contexts.
> > 3/ SAMPLING SUPPORT
> > ----------------
> > The notification is sent as a message and can be extracted by invoking
> > read() on the file descriptor of the context. The overflow message
> > contains information about the overflow, such as the index of the
> > overflowed PMD.
> >
> > The interface supports using select/poll on contexts file descriptors.
> > Similarly, it is possible to get an asynchronous notification via SIGIO
> > using the regular sequence of fcntl().
>
> So the kernel buffers these messages for the read()er. How does it handle
> the case of a process which requests the messages but never gets around to
> read()ing them?
Good question. Let's see what happens with the default sampling buffer format.
When the buffer fills up, monitoring is stopped and a notification message is
enqueued. If the tool never gets around to actually calling read(), a single
message will be queued forever. No other message is generated because monitoring
is stopped until the user issues a pfm_restart().
There needs to be a queue because some formats may implement a double buffering
scheme where monitoring is never actually stopped. The queue needs to be as
deep as the number of active buffers.
> > 4/ SAMPLING BUFFER SUPPORT
> > -----------------------
> >
> > User level sampling works but it is quite expensive especially when for
> > non self-monitoring threads. To minimize the overhead, the interface also
> > supports a kernel level sampling buffer. The idea is simple: on overflow
> > the kernel record a sample, and only when the buffer becomes full is the
> > user level notification generated. Thus, we amortize the cost of the
> > notification by simply calling the user when lots of samples are available.
> >
> > This is not such a new idea, it is present in OProfile or perfctr.
> > However, the interface needs to remains generic and flexible. If
> > the sampling buffer is in kernel, its format and what gets recorded
> > becomes somehow locked by the implementation. Every tool has different
> > needs. For instance, a tool such as Oprofile may want to aggregate
> > samples in the kernel, others such as VTUNE want to record all samples
> > sequentially. Furthermore, some tools may want to combine in each sample
> > PMU information with other kernel level information, such as a kernel
> > call stack for instance. It is hard to design a generic buffer format
> > that can handle all possible request. Instead, the interface provides
> > an infrastructure in which the buffer format is implemented by a kernel
> > module. Each module controls, what gets recorded, how it is recorded,
> > how the information is exported to user, when a 'buffer full'
> > notification must be sent. The perfmon core has an interface to
> > dynamically register new formats. Each format is uniquely identified by
> > a 128-bit UUID which is passed by the tool when the context is created.
> > Arguments for the buffer format are also passed during this call.
>
> Well that addresses my earlier questions I guess.
You had this question earlier about:
int pfm_create_context(pfarg_ctx_t __user *ctx, void *__user smpl_arg, size_t smpl_size);
> What is at *smpl_arg? Anonymous pointers to userspace like this aren't
> very popular - strongly typed interfaces are preferred.
>
Each sampling format may have options that must be passed when the context is created. There is
no predefined structure for such options. For the default format, for instance, you can pass the
size of the buffer you want. But you can imagine other things. As such the smpl_arg must be void *.
Now the smpl_size argument is NOT the size of the buffer but the size of the smpl_arg argument.
The format requested is identified by its UUID included in the ctx argument. The kernel takes
the UUID, checks that is is registered. If found, the kernel then checks if the smpl_size matches
with the format expected option size. If not, the call fails.
We have a set of protections to avoid abuses especially of the vector arguments for many of the
calls. The vector cannot be bigger than page. The same thing applies to sampling format options.
> Is this actually useful? oprofile is there and works OK. Again, is there
> overdesign here?
I think I answer this question at the beginning about sampling formats.
>
> And why is it necessary to make the presentation of the samples to
> userspace pluggable? I'd have thought that a single relayfs-based
> implementation would suit all sampling buffer formats.
What is pluggable: the policy code that decides what gets recorded and
how. The samples are exposed to user via mmap() on the file descriptor
identifying the context.
>
> > As part of the infrastructure, the interface provides a buffer allocation
> > and remapping service to the buffer format modules. Format may use this
> > service when the context is created. The kernel memory will be reserved
> > and the tool will be able to get access to the buffer via remapping
> > using the mmap() system call. This provides an efficient way of
> > exporting samples without the need for copying large amount of data
> > between the kernel and user space. But a format may choose to export
> > its samples via another way, such as a device driver interface for
> > instance.
>
> It doesn't sound like perfmon is using relayfs for the sampling buffer.
> Why not?
Relay-fs was also suggested by David Gibson. I don't have anything against
it. The problem is that it is CPU-centric. So it would probably work for
system-wide monitoring where we do effectively measure on a per-CPU basis.
But for per-thread monitoring this does not work correctly. You don't really
want to have a relay-fs buffer per-CPU per-thread and NR_CPUS file descriptors opens
per thread. It also makes it hard to reconstruct a per-thread sequential buffer.
Note that to expose the samples, we have not invented any new mechanism. We simply
leverage the existing mmap() call and the fact that we use a file descritor to
identify a context.
> > For each PMD, the interface provides three possible values which are
> > used when sampling. The initial value is that is first loaded into the
> > PMD, i.e., the first sampling period. On overflow which does not
> > trigger a user level notification, a so-called short reset value is
> > used by the kernel to reload the PMD. After an overflow with a user
> > level notification, the kernel uses the so-called long reset value.
> > This mechanism can be exploited to hide the noise induced by the
> > recovery after a user notification.
> >
> > The interface also supports automatic randomization of the reset value
> > for a PMD after an overflow.
>
> Why would one want to randomise the PMD after an overflow?
>
I think several people have already explained this. But I would add one
more thing. The phenomenom is somewhat aggravated when doing per-thread
sampling using events that are not so much affected by what is going on
in the system overall. In an earlier E-mail about this, I mentioned sampling
on return branches. The number of return branches is not affected by where
you run, the number of competing threads, the rate of interrupts. So there
is not so much this kind of "implicit" randomization that you would get
is you were to sample on cycles for instance. Overall randomization is very
important feature. It needs to be in the kernel because of the kernel level
sampling buffer.
> >
> > 6/ PMU DESCRIPTION MODULES
> > -----------------------
> >
> > The logical PMU is driven by a PMU description table. The table
> > is implemented by a kernel pluggable module. As such, it can be
> > updated at will without recompiling the kernel, waiting for the next
> > release of a Linux kernel or distribution, and without rebooting the
> > machine as long as the PMU model belongs to the same PMU family. For
> > instance, for the Itanium Processor Family, the architecture specifies
> > the framework for the PMU. Thus the Itanium PMU specific code is common
> > across all processor implementations. This is not the case for IA-32.
>
> I think the usefulness of this needs justification. CPUs are updated all
> the time, and we release new kernels all the time to exploit the new CPU
> features. What's so special about performance counters that they need such
> special treatment?
>
There are several issues that we are trying to solve with this feature:
- evolving hardware
- defining custom virtual registers
- debugging and support
It is true that you produce kernels everyday. However most of the users don't
use your kernels but those of commercial vendors which evolve at a much slower
pace.
Let's take an example on Itanium. Take a user running a commercial distro
based on 2.6. This user is given early access to a Montecito machine. He wants
to do some early performance analysis and needs perfmon support for Montecito,
yet he wants to stay on his kernel because it includes features that his
application depends upon. With the PMU description module, it would simply be
a matter of shipping the kernel module + updated tools. Without this, the
customer needs to recompile his kernel. The reality is, very few people know
or even want to recompile their own kernels.
The PMU is very implementation specific and hard to debug. It can be updated
from one stepping to the next. Software release cycles may not necessarily
match hardware release cycles.
As Dan and Phil mentioned, there are other usage of this for special processor
registers, chipset counters and the likes.
It is also very attractive to be able to expose an OS resource as a PMD register.
Dan mentioned using this to export PID, TID, but you can extend this to other OS
resources such as the amount of free memory, the number of tasks and so on. What
is the advantage? At the user level, they look like regular PMU resources and they
become very easy to include into samples. In summary, it becomes possible to sample
on resources which would otherwise be hard to get to from user level.
>
> Overall: I worry about excessive configurability, excessive features.
>
In general I am not a big fan of putting stuff in the kernel just because it's
cool to be kernel developer. Quite to the contrary, if I could get out of
the kernel development, it would certainly make my work easier.
Every feature that is supported by perfmon was put in there because
of user needs and because there was no better way to implement them in
user space and yet provide the same level of efficiency or simplicity.
The best examples being:
- the sampling buffer
- event set and multiplexing
The best counter example being:
- full coverage of an SMP system for system-wide measurement
is implemented at the user level by a collection of threads
pinned on each CPU. This was preferred over the model of a
single thread monitoring all CPUS with the kernel
sending IPI across all CPUs to program/start/stop.
Thanks to David, Dan and Phil for their comments.
--
-Stephane
On Thu, Dec 22, 2005 at 03:56:32AM -0800, Stephane Eranian wrote:
> reason:
> - allow support of existing kernel profiling infratructures such as
> Oprofile or VTUNE (the VTUNE driver is open-source)
last time I checked it was available in source, but not under an open-source
license. has this changed? In either case intel should contribute to the
kernel profiling infrastructure instead of doing their own thing. Supporting
people to do their own private variant is always a bad thing.
> Let's take an example on Itanium. Take a user running a commercial distro
> based on 2.6. This user is given early access to a Montecito machine.
That scenario is totally uninteresting for kernel development. we want
to encourage people to use upstream kernels, and not the bastardized vendor
crap.
I think you're adding totally pointless complexity everywhere for such
scenarious because HP apparently cares for such vendor mess. Maybe you
should concentrate on what's best for upstream kernel development. And
the most important thing is to reduce complexity by at least one magnitude.
Christoph Hellwig wrote:
> On Thu, Dec 22, 2005 at 03:56:32AM -0800, Stephane Eranian wrote:
>
>>reason:
>> - allow support of existing kernel profiling infratructures such as
>> Oprofile or VTUNE (the VTUNE driver is open-source)
>
>
> last time I checked it was available in source, but not under an open-source
> license. has this changed? In either case intel should contribute to the
> kernel profiling infrastructure instead of doing their own thing. Supporting
> people to do their own private variant is always a bad thing.
Both OProfile and PAPI are open source and could use such an performance
monitoring interface.
One of the problems right now is there is a patchwork of performance
monitoring support. Each instrumentation system has its own set of
drivers/patches. Few have support integrated into the kernel, e.g.
OProfile. However, the OProfile driver provides only a subset of the
performance monitoring support, system-wide sampling. The OProfile
driver doesn't allow per-thread monitoring or stopwatch style
measurement, which can be very useful for some performance monitoring
applications.
Having specific drivers for each performance monitoring program is not
the way to go. That is one of the reasons that people have problems
doing performance monitoring on Linux. Each performance monitoring
program has its own driver and/or set of patches to the kernel. Many
application programers are not in a position to patch the kernel and to
install the custom kernel on the machine so they can use performance
monitoring hardware. Not everyone has root access to the machine they
use, so they can install and reboot a kernel of their choosing.
>>Let's take an example on Itanium. Take a user running a commercial distro
>>based on 2.6. This user is given early access to a Montecito machine.
>
>
> That scenario is totally uninteresting for kernel development. we want
> to encourage people to use upstream kernels, and not the bastardized vendor
> crap.
Vendors don't want to provide "bastardized vendor crap" either. The
fewer patches in the vendor distributed kernels the better.
> I think you're adding totally pointless complexity everywhere for such
> scenarious because HP apparently cares for such vendor mess. Maybe you
> should concentrate on what's best for upstream kernel development. And
> the most important thing is to reduce complexity by at least one magnitude.
Specifically what are the things that are "best for upstream kernel
development?" What are the things that should be eliminated "to reduce
complexity by at least one magnitude?"
-Will
On Thu, Dec 22, 2005 at 10:37:56AM -0500, William Cohen wrote:
> Both OProfile and PAPI are open source and could use such an performance
> monitoring interface.
>
> One of the problems right now is there is a patchwork of performance
> monitoring support. Each instrumentation system has its own set of
> drivers/patches. Few have support integrated into the kernel, e.g.
> OProfile. However, the OProfile driver provides only a subset of the
> performance monitoring support, system-wide sampling. The OProfile
> driver doesn't allow per-thread monitoring or stopwatch style
> measurement, which can be very useful for some performance monitoring
> applications.
What about improving oprofile then? Unlike the vtune or perfoman people
the oprofile authors have shown they actually are able to design sensible
interfaces, and oprofile has broad plattform support over most support
architectures.
Christoph Hellwig wrote:
> On Thu, Dec 22, 2005 at 10:37:56AM -0500, William Cohen wrote:
>
>>Both OProfile and PAPI are open source and could use such an performance
>>monitoring interface.
>>
>>One of the problems right now is there is a patchwork of performance
>>monitoring support. Each instrumentation system has its own set of
>>drivers/patches. Few have support integrated into the kernel, e.g.
>>OProfile. However, the OProfile driver provides only a subset of the
>>performance monitoring support, system-wide sampling. The OProfile
>>driver doesn't allow per-thread monitoring or stopwatch style
>>measurement, which can be very useful for some performance monitoring
>>applications.
>
>
> What about improving oprofile then? Unlike the vtune or perfoman people
> the oprofile authors have shown they actually are able to design sensible
> interfaces, and oprofile has broad plattform support over most support
> architectures.
At what point would adding interfaces to OProfile turn it into perfmon?
Some of the additions like per-thread monitoring would require signicant
changes in the kernel that perfmon already implements. The IA64 OProfile
already uses the perfmon support in the kernel.
Perfmon2 has support for ia64, p6, pentium4, x86_64, ppc, and mips, so
there is multiple architectures support. OProfile currently supports
more architectures: alpha, arm, i386 (p6/pentium4/athlon), ia64,
mips, ppc, ppc64, x86-64, and a fall-back timer mechanisms.
-Will
Christoph Hellwig wrote:
> On Thu, Dec 22, 2005 at 10:37:56AM -0500, William Cohen wrote:
>
>>Both OProfile and PAPI are open source and could use such an performance
>>monitoring interface.
>>
>>One of the problems right now is there is a patchwork of performance
>>monitoring support. Each instrumentation system has its own set of
>>drivers/patches. Few have support integrated into the kernel, e.g.
>>OProfile. However, the OProfile driver provides only a subset of the
>>performance monitoring support, system-wide sampling. The OProfile
>>driver doesn't allow per-thread monitoring or stopwatch style
>>measurement, which can be very useful for some performance monitoring
>>applications.
>
>
> What about improving oprofile then? Unlike the vtune or perfoman people
> the oprofile authors have shown they actually are able to design sensible
> interfaces, and oprofile has broad plattform support over most support
> architectures.
Oprofile cannot be improved to provide stopwatch timing.
It is impossible because oprofile is sampling, not direct measurement.
Perfmon2, or anything which requires a system call to read a meter
[counter] of nanoseconds [per-thread virtualized cycle counter]
often adds unreasonably high overhead: hundreds of cycles or more,
instead of tens or less. CPU manufacturers are making life
difficult for users of perfctr, by muddying the meaning of
their user-readable cycle counters (see x86 RDTSC) or by omitting
user-readable cycle counters entirely (whether in the name of lower cost,
reducing "side channel" system information leaks, or otherwise.)
--
Andrew,
> > 6/ PMU DESCRIPTION MODULES
> > -----------------------
> >
> > The logical PMU is driven by a PMU description table. The table
> > is implemented by a kernel pluggable module. As such, it can be
> > updated at will without recompiling the kernel, waiting for the next
> > release of a Linux kernel or distribution, and without rebooting the
> > machine as long as the PMU model belongs to the same PMU family. For
> > instance, for the Itanium Processor Family, the architecture specifies
> > the framework for the PMU. Thus the Itanium PMU specific code is common
> > across all processor implementations. This is not the case for IA-32.
>
> I think the usefulness of this needs justification. CPUs are updated all
> the time, and we release new kernels all the time to exploit the new CPU
> features. What's so special about performance counters that they need such
> special treatment?
>
Given the discussion we are having, I thought it would be useful to take
a concrete example to try and clarify what I am talking about here. I chose
to use the PMU description module/table of the Pentium M because this is
a very common platform supported by all interfaces. The actual module contains
the following (arch/i386/perfmon/perfmon_pm.c) information:
- desciption of the PMU register: where they are, their type
- a callback for an option PMC write checker.
- a probe routine (not shown)
- an module_init/module_exit (not shown)
Let's look at the informaiton in more details:
The first information is architecture specific structure
used by the architecture specific code (arch/i386/perfmon/perfmon.c).
It contains the information about the MSR addresses for each register
that we want to access. Let's look at PMC0:
{{MSR_P6_EVNTSEL0, 0}, 0, PFM_REGT_PERFSEL},
- field 0=MSR_P6_EVNTSEL0: PMC0 is mapped onto MSR EVENTSEL0 (for thread 0)
- field 1=0: unused Pentium M does not support Hyperthreading (no thread 1)
- field 2=0: PMC0 is controlling PMD 0
- field 3=PFM_REGT_PERFSEL: this is a PMU control register
The business about HT is due to the fact that the i386 code is shared
with P4/Xeon.
struct pfm_arch_pmu_info pfm_pm_pmu_info={
.pmc_addrs = {
{{MSR_P6_EVNTSEL0, 0}, 0, PFM_REGT_PERFSEL},
{{MSR_P6_EVNTSEL1, 0}, 1, PFM_REGT_PERFSEL}
},
.pmd_addrs = {
{{MSR_P6_PERFCTR0, 0}, 0, PFM_REGT_CTR},
{{MSR_P6_PERFCTR1, 0}, 0, PFM_REGT_CTR}
},
.pmu_style = PFM_I386_PMU_P6,
.lps_per_core = 1
};
Now let's look at the mapping table. It contains the following information:
- attribute of the register
- logical name
- default value
- reserved bitfield
The mapping table describes the very basic and generic properties of a register and
is using the same structure for all PMU models. In contrast the first structure
is totally architecture specific.
static struct pfm_reg_desc pfm_pm_pmc_desc[PFM_MAX_PMCS+1]={
/* pmc0 */ { PFM_REG_W, "PERFSEL0", PFM_PM_PMC_VAL, PFM_PM_PMC_RSVD},
/* pmc1 */ { PFM_REG_W, "PERFSEL1", PFM_PM_PMC_VAL, PFM_PM_PMC_RSVD},
{ PFM_REG_END} /* end marker */
};
static struct pfm_reg_desc pfm_pm_pmd_desc[PFM_MAX_PMDS+1]={
/* pmd0 */ { PFM_REG_C , "PERFCTR0", 0x0, -1},
/* pmd1 */ { PFM_REG_C , "PERFCTR1", 0x0, -1},
{ PFM_REG_END} /* end marker */
};
Now the write checker. It is used to intervene on the value passed by
the user when it programs a PMC register. The role of the function is
to ensure that the reserved bitfields retains their default value.
It can be used to verify that a PMC value is actually authorized and
sane. PMU may disallowd certain combination of values. The checker is
optional. On Pentium M we simply enforce resreved bitfields.
static int pfm_pm_pmc_check(struct pfm_context *ctx, struct pfm_event_set *set,
u16 cnum, u32 flags, u64 *val)
{
u64 tmpval, tmp1, tmp2;
u64 rsvd_mask, dfl_value;
tmpval = *val;
rsvd_mask = pfm_pm_pmc_desc[cnum].reserved_mask;
dfl_value = pfm_pm_pmc_desc[cnum].default_value;
if (flags & PFM_REGFL_NO_EMUL64)
dfl_value &= ~(1ULL << 20);
/* remove reserved areas from user value */
tmp1 = tmpval & rsvd_mask;
/* get reserved fields values */
tmp2 = dfl_value & ~rsvd_mask;
*val = tmp1 | tmp2;
return 0;
}
And finally the structure that we register with the core of perfmon.
It includes among other things the actual width of the counters as this
is useful for sampling and 64-bit virtualization of counters.
static struct pfm_pmu_config pfm_pm_pmu_conf={
.pmu_name = "Intel Pentium M Processor",
.counter_width = 31,
.pmd_desc = pfm_pm_pmd_desc,
.pmc_desc = pfm_pm_pmc_desc,
.pmc_write_check = pfm_pm_pmc_check,
.probe_pmu = pfm_pm_probe_pmu,
.version = "1.0",
.flags = PMU_FLAGS,
.owner = THIS_MODULE,
.arch_info = &pfm_pm_pmu_info
};
This is not much information.
If this is not implemented as a kernel module, it would have to be integrated into
the kernel no matter what. This is very basic information that perfmon needs to operate
on the PMU registers. I prefer the table driven approach to the hardcoding and checking
everywhere. I hope you agree with me here.
The PMU description module is simply a way to separate this information from the
core. Note that the modules can, of course, be compiled in.